Re: [PATCH 1/3] POWERPC: don't cast a pointer to pointer of list_head
On Thursday 06 December 2007 20:33, Li Zefan wrote: The casting is safe only when the list_head member is the first member of the structure. Even so, I don't think too safe :) It might technically work, but it could break more easily. So even if you find places where list_head is the first member of a structure, it would be nice to explicitly use the list_head member and avoid casts IMO. Thanks, Nick Signed-off-by: Li Zefan [EMAIL PROTECTED] --- arch/ppc/syslib/ocp.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/ppc/syslib/ocp.c b/arch/ppc/syslib/ocp.c index 3f5be2c..d42d408 100644 --- a/arch/ppc/syslib/ocp.c +++ b/arch/ppc/syslib/ocp.c @@ -376,7 +376,7 @@ ocp_remove_one_device(unsigned int vendor, unsigned int function, int index) down_write(ocp_devices_sem); dev = __ocp_find_device(vendor, function, index); - list_del((struct list_head *)dev); + list_del(dev-link); up_write(ocp_devices_sem); DBG((ocp: ocp_remove_one_device(vendor: %x, function: %x, index: %d)... done.\n, vendor, function, index)); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch] xmon bitlock fix
(sorry, untested for lack of hardware) -- xmon uses a bit lock spinlock but doesn't close the critical section when releasing it. It doesn't seem like a big deal because it will eventually break out of the lock anyway, but presumably that's only in exceptional cases where some error is tolerated, while the lack of a memory barrier could allow incorrect results during normal functioning operation as well. Convert it to use a regular spinlock instead. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Index: linux-2.6/arch/ppc/xmon/start.c === --- linux-2.6.orig/arch/ppc/xmon/start.c +++ linux-2.6/arch/ppc/xmon/start.c @@ -92,16 +92,14 @@ xmon_write(void *handle, void *ptr, int { char *p = ptr; int i, c, ct; - -#ifdef CONFIG_SMP - static unsigned long xmon_write_lock; + static DEFINE_SPINLOCK(xmon_write_lock); int lock_wait = 100; int locked; - while ((locked = test_and_set_bit(0, xmon_write_lock)) != 0) + while (!(locked = spin_trylock(xmon_write_lock))) { if (--lock_wait == 0) break; -#endif + } if (!scc_initialized) xmon_init_scc(); @@ -122,10 +120,9 @@ xmon_write(void *handle, void *ptr, int eieio(); } -#ifdef CONFIG_SMP - if (!locked) - clear_bit(0, xmon_write_lock); -#endif + if (locked) + spin_unlock(xmon_write_lock); + return nb; } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch] powerpc: hash lock use lock bitops
This isn't a bugfix, but may help performance slightly... -- powerpc 64-bit hash pte lock bit is an actual lock, so it can take advantage of lock bitops for slightly more optimal memory barriers (can avoid an lwsync in the trylock). Signed-off-by: Nick Piggin [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Index: linux-2.6/arch/powerpc/mm/hash_native_64.c === --- linux-2.6.orig/arch/powerpc/mm/hash_native_64.c +++ linux-2.6/arch/powerpc/mm/hash_native_64.c @@ -113,7 +113,7 @@ static inline void native_lock_hpte(stru unsigned long *word = hptep-v; while (1) { - if (!test_and_set_bit(HPTE_LOCK_BIT, word)) + if (!test_and_set_bit_lock(HPTE_LOCK_BIT, word)) break; while(test_bit(HPTE_LOCK_BIT, word)) cpu_relax(); @@ -124,8 +124,7 @@ static inline void native_unlock_hpte(st { unsigned long *word = hptep-v; - asm volatile(lwsync:::memory); - clear_bit(HPTE_LOCK_BIT, word); + clear_bit_unlock(HPTE_LOCK_BIT, word); } static long native_hpte_insert(unsigned long hpte_group, unsigned long va, ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] xmon bitlock fix
On Tue, Nov 20, 2007 at 04:28:02PM +1100, Paul Mackerras wrote: Nick Piggin writes: xmon uses a bit lock spinlock but doesn't close the critical section when releasing it. It doesn't seem like a big deal because it will eventually break out of the lock anyway, but presumably that's only in exceptional cases where some error is tolerated, while the lack of a memory barrier could allow incorrect results during normal functioning operation as well. Convert it to use a regular spinlock instead. I'd rather not. The idea is that xmon is as independent as possible of the rest of the kernel, so that it will work even when lots of kernel data structures are corrupted. If spinlocks were simple spin loops then maybe, but they're not these days. Fair enough, I guess you still want lockdep kernels to work. As for the memory barrier comment, I don't think it is a reason, since test_and_set_bit acts as a barrier, and in any case the worst thing test_and_set_bit is a barrier (although heavier than required to acquire a lock). But clear_bit isn't strong enough... you do have eieio() in there, making it probably less problem, but that won't get executed if nb is 0, and it is still doing the xmon_init_scc() outside the loop. I just prefer to make it a simple lock that doesn't rely on barriers or the specific structure of the critical section. that could happen is that the characters from different cpu's outputs get interleaved (that's one reason why the loop has a timeout, the other is for robustness). In any case this is in arch/ppc which is dying code. I don't think there are any SMP platforms supported in arch/ppc any more except for some (fairly rare) PReP platforms, and I hope to get PReP moved over soon. I was just grepping the tree, and trying to reduce code which might be fragile, broken, or a poor example. Finally, why do you say that it doesn't close the critical section? Possibly the locked variable is badly named (it's zero when we get the lock) but AFAICS the code is actually correct. Well the unlock itself doesn't close it properly because clear_bit doesn't order previous loads or stores. How about this? --- xmon uses a bit lock spinlock but doesn't close the critical section when releasing it. It doesn't seem like a big deal because it will eventually break out of the lock anyway, but presumably that's only in exceptional cases where some error is tolerated, while the lack of a memory barrier could allow incorrect results during normal functioning operation as well. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/arch/ppc/xmon/start.c === --- linux-2.6.orig/arch/ppc/xmon/start.c +++ linux-2.6/arch/ppc/xmon/start.c @@ -98,7 +98,7 @@ xmon_write(void *handle, void *ptr, int int lock_wait = 100; int locked; - while ((locked = test_and_set_bit(0, xmon_write_lock)) != 0) + while ((locked = test_and_set_bit_lock(0, xmon_write_lock)) != 0) if (--lock_wait == 0) break; #endif @@ -124,7 +124,7 @@ xmon_write(void *handle, void *ptr, int #ifdef CONFIG_SMP if (!locked) - clear_bit(0, xmon_write_lock); + clear_bit_unlock(0, xmon_write_lock); #endif return nb; } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: hash lock use lock bitops
On Tue, Nov 20, 2007 at 05:08:24PM +1100, Benjamin Herrenschmidt wrote: On Tue, 2007-11-20 at 06:09 +0100, Nick Piggin wrote: This isn't a bugfix, but may help performance slightly... -- powerpc 64-bit hash pte lock bit is an actual lock, so it can take advantage of lock bitops for slightly more optimal memory barriers (can avoid an lwsync in the trylock). Signed-off-by: Nick Piggin [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Looks nice, I'll try it out on a G5 and let you know. Cool, thanks (I don't have mine handy ATM...). BTW, here is another thing which you might want to think about. Again untested for temporary lack of hardware. -- The radix-tree is now RCU safe, so powerpc can avoid the games it was playing in order to have a lockless readside. Saves an irq disable/enable, a couple of __get_cpu_var()s, a cacheline, and a memory barrier, in the fastpath. Should save a cycle or two... --- Index: linux-2.6/arch/powerpc/kernel/irq.c === --- linux-2.6.orig/arch/powerpc/kernel/irq.c +++ linux-2.6/arch/powerpc/kernel/irq.c @@ -406,8 +406,6 @@ void do_softirq(void) static LIST_HEAD(irq_hosts); static DEFINE_SPINLOCK(irq_big_lock); -static DEFINE_PER_CPU(unsigned int, irq_radix_reader); -static unsigned int irq_radix_writer; struct irq_map_entry irq_map[NR_IRQS]; static unsigned int irq_virq_count = NR_IRQS; static struct irq_host *irq_default_host; @@ -550,57 +548,6 @@ void irq_set_virq_count(unsigned int cou irq_virq_count = count; } -/* radix tree not lockless safe ! we use a brlock-type mecanism - * for now, until we can use a lockless radix tree - */ -static void irq_radix_wrlock(unsigned long *flags) -{ - unsigned int cpu, ok; - - spin_lock_irqsave(irq_big_lock, *flags); - irq_radix_writer = 1; - smp_mb(); - do { - barrier(); - ok = 1; - for_each_possible_cpu(cpu) { - if (per_cpu(irq_radix_reader, cpu)) { - ok = 0; - break; - } - } - if (!ok) - cpu_relax(); - } while(!ok); -} - -static void irq_radix_wrunlock(unsigned long flags) -{ - smp_wmb(); - irq_radix_writer = 0; - spin_unlock_irqrestore(irq_big_lock, flags); -} - -static void irq_radix_rdlock(unsigned long *flags) -{ - local_irq_save(*flags); - __get_cpu_var(irq_radix_reader) = 1; - smp_mb(); - if (likely(irq_radix_writer == 0)) - return; - __get_cpu_var(irq_radix_reader) = 0; - smp_wmb(); - spin_lock(irq_big_lock); - __get_cpu_var(irq_radix_reader) = 1; - spin_unlock(irq_big_lock); -} - -static void irq_radix_rdunlock(unsigned long flags) -{ - __get_cpu_var(irq_radix_reader) = 0; - local_irq_restore(flags); -} - static int irq_setup_virq(struct irq_host *host, unsigned int virq, irq_hw_number_t hwirq) { @@ -791,9 +738,9 @@ void irq_dispose_mapping(unsigned int vi /* Check if radix tree allocated yet */ if (host-revmap_data.tree.gfp_mask == 0) break; - irq_radix_wrlock(flags); + spin_lock_irqsave(irq_big_lock, flags); radix_tree_delete(host-revmap_data.tree, hwirq); - irq_radix_wrunlock(flags); + spin_unlock_irqrestore(irq_big_lock, flags); break; } @@ -861,9 +808,9 @@ unsigned int irq_radix_revmap(struct irq return irq_find_mapping(host, hwirq); /* Now try to resolve */ - irq_radix_rdlock(flags); + rcu_read_lock(); ptr = radix_tree_lookup(tree, hwirq); - irq_radix_rdunlock(flags); + rcu_read_unlock(); /* Found it, return */ if (ptr) { @@ -874,9 +821,9 @@ unsigned int irq_radix_revmap(struct irq /* If not there, try to insert it */ virq = irq_find_mapping(host, hwirq); if (virq != NO_IRQ) { - irq_radix_wrlock(flags); + spin_lock_irqsave(irq_big_lock, flags); radix_tree_insert(tree, hwirq, irq_map[virq]); - irq_radix_wrunlock(flags); + spin_unlock_irqrestore(irq_big_lock, flags); } return virq; } @@ -989,12 +936,12 @@ static int irq_late_init(void) struct irq_host *h; unsigned long flags; - irq_radix_wrlock(flags); + spin_lock_irqsave(irq_big_lock, flags); list_for_each_entry(h, irq_hosts, link) { if (h-revmap_type == IRQ_HOST_MAP_TREE) INIT_RADIX_TREE(h-revmap_data.tree, GFP_ATOMIC); } - irq_radix_wrunlock(flags); + spin_unlock_irqrestore(irq_big_lock, flags); return 0
Re: [PATCH] synchronize_irq needs a barrier
On Friday 19 October 2007 12:32, Herbert Xu wrote: First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. Not unless you mean a pair of spin lock/unlock as in 2 spin lock/unlock pairs (4 operations). *X = 10; spin_lock(lock); /* *Y speculatively loaded here */ /* store to *X leaves CPU store queue here */ spin_unlock(lock); y = *Y; * A spin lock in general only equates to (SS/SL/LL). * A spin unlock in general only equates to (SS/LS). I don't use the sparc barriers, so they don't come naturally to me ;) I think both loads and stores can pass into the critical section by having the spin_lock pass earlier ops, or by having spin_unlock be passed by later ones. In particular, a load after a spin unlock may pass before the spin unlock. Here is the race (with tg3 as the only example that I know of). The driver attempts to quiesce interrupts such that after the call to synchronize_irq it is intended that no further IRQ handler calls for that device will do any work besides acking the IRQ. Here is how it does it: CPU0 CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq() while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return do work The problem here is that load of irq_sync in the handler has passed above the setting of IRQ_INPROGRESS. Linus's patch fixes it because this becomes: CPU0 CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq set IRQ_INPROGRESS spin unlock spin lock spin unlock tg3_msi ack IRQ if (irq_sync) return do work while (IRQ_INPROGRESS) wait spin lock clear IRQ_INPROGRESS spin unlock return Even though we still do the work on the right we will now notice the INPROGRESS flag on the left and wait. It's hard to fix this in the drivers because they'd either have to access the desc lock or add a full mb to the fast path on the right. Once this goes in we can also remove the smp_mb from tg3.c. BTW, a lot of drivers (including the fusion example Ben quoted) call synchronize_irq before free_irq. This is unnecessary because the latter already calls it anyway. Cheers, ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Friday 19 October 2007 13:28, Herbert Xu wrote: Nick Piggin [EMAIL PROTECTED] wrote: First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. Not unless you mean a pair of spin lock/unlock as in 2 spin lock/unlock pairs (4 operations). *X = 10; spin_lock(lock); /* *Y speculatively loaded here */ /* store to *X leaves CPU store queue here */ spin_unlock(lock); y = *Y; Good point. Although in this case we're still safe because in the worst cases: CPU0 CPU1 irq_sync = 1 synchronize_irq spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock spin lock load irq_sync while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return spin lock clear IRQ_INPROGRESS spin unlock CPU0 CPU1 spin lock load irq_sync irq_sync = 1 synchronize_irq set IRQ_INPROGRESS spin unlock spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock while (IRQ_INPROGRESS) wait tg3_msi ack IRQ if (irq_sync) return do work spin lock clear IRQ_INPROGRESS spin unlock return So because we're using the same lock on both sides, it does do the right thing even without the memory barrier. Yeah, if you've got the lock on both sides there, then I agree it will be correct. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: SYSFS: need a noncaching read
On Wednesday 12 September 2007 20:01, Greg KH wrote: On Wed, Sep 12, 2007 at 07:32:07AM +0200, Robert Schwebel wrote: On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote: I have developed a device driver and use the sysFS to export some registers to userspace. Uuuh, uggly. Don't do that. Device drivers are there to abstract things, not to play around with registers from userspace. I opened the sysFS File for one register and did some reads from this File, but I alwas becoming the same value from the register, whats not OK, because they are changing. So I found out that the sysFS caches the reads ... :-( Yes, it does. What you can do is close()ing the file handle between accesses, which makes it work but is slow. Do an lseek back to 0 and then re-read, you will get called in your driver again. Can you do a pread with offset 0 to avoid the two syscalls? (which some people seem to be concerned about) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Thu, Aug 30, 2007 at 02:42:41PM -0500, Brent Casavant wrote: On Thu, 30 Aug 2007, Nick Piggin wrote: I don't know whether this is exactly a correct implementation of Linux's barrier semantics. On one hand, wmb _is_ ordering the stores as they come out of the CPU; on the other, it isn't ordering normal stores with respect to writel from the POV of the device (which is seems to be what is expected by the docs and device driver writers). Or, as I think of it, it's not ordering cacheable stores with respect to uncacheable stores from the perspective of other CPUs in the system. That's what's really at the heart of the concern for SN2. AFAIKS, the issue is simply that it is not ordering cacheable stores with respect to uncacheable stores from a _single_ CPU. I'll elaborate further down. And on the other side, it just doesn't seem so useful just to know that stores coming out of the CPU are ordered if they can be reordered by an intermediate. Well, it helps when certain classes of stores need to be ordered with respect to eachother. On SN2, wmb() still ensures that cacheable stores are issued in a particular order, and thus seen by other CPUs in a particular order. That is still important, even when IO devices are not in the mix. Well, we have smp_wmb() for that. Why even have wmb() at all, if it doesn't actually order stores to IO and RAM? It orders the class of stores which target RAM. It doesn't order the two seperate classes of stores (RAM and IO) with respect to eachother. wmb() *really* is supposed to order all stores. As far as I gather, devices often need it for something like this: *dma_buffer = blah; wmb(); writel(START_DMA, iomem); One problem for sn2 seems to be that wmb is called like 500 times in drivers/ and would be really heavy to turn it into mmiowb. On the other hand, I really don't like how it's just gone and said oh the normal Linux semantics are too hard, so make wmb() mean something slightly different, and add a totally new mmiowb() concept. Device driver writers already get barriers totally wrong. mmiowb is being completely misused already (and probably wmb too). mmiowb() when used in conjunction with a lock which serializes access to an IO device ensures that the order of stores to the IO device from different CPUs is well-defined. That's what we're really after here. But if we're pragmatic, we could say that stores which are sitting in the CPU's chipset where they can potentially be reordered, can still _conceptually_ be considered to be in some kind of store queue of the CPU. This would mean that wmb() does have to order these WRT cacheable stores coming from a single CPU. And once you do that, sn2 will _also_ do the right thing with multiple CPUs. I guess it is too expensive for you to have mmiowb() in every wmb(), because _most_ of the time, all that's needed is ordering between IOs. I think it's the other way around. Most of the time all you need is ordering between RAM stores, so mmiowb() would kill performance if it was called every time wmb() was invoked. No, we have smp_wmb() for that. So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore system memory. Then the non-prefixed primitives order everything (to the point that wmb() is like mmiowb on sn2). I'm not sure I follow. Here's the bad sequence we're working with: CPU A CPU B Lock owner IO device sees - - -- -- ... ... unowned lock() ... CPU A writel(val_a) lock() ... unlock()CPU B ... write(val_b)... ... unlock()unowned ... ... ... val_b ... ... ... val_a The cacheable store to RAM from CPU A to perform the unlock was not ordered with respect to the uncacheable writel() to the IO device. CPU B, which has a different uncacheable store path to the IO device in the NUMA system, saw the effect of the RAM store before CPU A's uncacheable store arrived at the IO device. CPU B then owned the lock, performed its own uncacheable store to the IO device, and released the lock. The two uncacheable stores are taking different routes to the device, and end up arriving in the wrong order. mmiowb() solves this by causing the following: CPU A CPU B Lock owner IO device sees - - -- -- ... ... Unowned lock() ... CPU A writel(val_a) lock() ... mmiowb()... val_a unlock()CPU B ... write(val_b)... ... mmiowb
Re: [patch 1/2] powerpc: rmb fix
On Thu, Aug 23, 2007 at 07:57:20PM +0200, Segher Boessenkool wrote: The powerpc kernel needs to have full sync insns in every I/O accessor in order to enforce all the ordering rules Linux demands. It's a bloody shame, but the alternative would be to make the barriers lots more expensive. A third alternative would be to Well lots more expensive compared to what you have now. But what you have now is like having those expensive barriers between *every* io access. Yeah. But I/O reads are very expensive anyway, and the barriers are used for more than just I/O ordering. rmb() should only be used when IO ordering matters. smp_rmb() is for regular ordering. So doesn't the fact IO reads are very expensive anyway lend more weight _to have_ the full IO ordering in rmb()? I/O writes are a different thing; ideally, they would use only eieio, if anything at all. For IO to IO ordering, yes eieio would be ideal. I don't know that there is really such a primitive for that in Linux. io_wmb(). Maybe the tradeoff isn't optimal. The I/O primitives didn't have all those syncs in there before, they got added because some bad interaction with spinlocks was discovered, if my memory isn't failing me. I think it may have been because IO ops are pretty strongly ordered on x86, and it was thought that a fair amount of code was relying on that. So the old primitives were made quite strongly ordered, and new ones were added to avoid the overhead. Unfortunately, you can't actually use the unordered ones unless you have working barrier instructions. Hence why I think rmb() should be an IO barrier. But I'm not pushing this too hard. You guys all know the gory ppc details better than I, so I'll just leave it with you to work out whether it is the right thing to do. have barrier ops that do not order everything, but just A vs. B for various choices of A and B (coherent accesses, MMIO accesses, etc.) The non-smp_ variant is supposed to order everything, AFAIK. Maybe you could get more fancy and have PIO vs MMIO etc etc. but it looks like this whole area is in a pretty sticky state anyway so let's not think about that. *Thinking* about it is fun. Trying to get the code merged would be a different thing ;-) ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Thu, Aug 23, 2007 at 09:16:42AM -0700, Linus Torvalds wrote: On Thu, 23 Aug 2007, Nick Piggin wrote: Also, FWIW, there are some advantages of deferring the mmiowb thingy until the point of unlock. And that is exactly what ppc64 does. But you're missing a big point: for 99.9% of all hardware, mmiowb() is a total no-op. So when you talk about advantages, you're not talking about any *real* advantage, are you? You're in a feisty mood today ;) I guess on the 0.1% of hradware where it is not a noop, there might be a real advantage... but that was just handwaving anyway. My real point was that I'd like things to be more easily understandable. I think we are agreed at this point that mmiowb without some form of CPU synchronisation is a bug, and it is also not of the same type of barrier that we normally think about in the kernel (it could be like a MPI style rendezvous barrier between the CPU and the IO fabric). Anyway, point is that device drivers seem to have enough on their plate already. Look at bcm43xx, for example. Most of this guy's mmiowb()s are completely wrong and should be wmb(). mmiowb() is only a wmb() on ppc because as I said, ppc's spin_unlock does not order IOs like most other architectures. On alpha, for example, spin_unlock does order IOs, so mmiowb is a noop, and this is broken (non-sn2 ia64 should also be a noop here, because their unlock orders IOs, but it seems that mmiowb semantics are so non obvious that they either got it wrong themselves, or assumed device drivers surely would). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Thu, Aug 23, 2007 at 09:56:16AM -0700, Jesse Barnes wrote: On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote: Of course, the normal memory barrier would usually be a spin_unlock() or something like that, not a wmb(). In fact, I don't think the powerpc implementation (as an example of this) will actually synchronize with anything *but* a spin_unlock(). We are even more sneaky in the sense that we set a per-cpu flag on any MMIO write and do the sync automatically in spin_unlock() :-) Yeah, that's a reasonable thing to do, and in fact I think there's code to do something similar when a task is switched out (this keeps user level drivers from having do mmiowb() type things). It might be worth doing that and removing mmiowb completely. Or, if that's too expensive, I'd like to see an API that is more explicitly for keeping IOs inside critical sections. FWIW, I think I had an earlier version of the patch that used the name pioflush() or something similar, the only confusing thing about that name is that the primitive doesn't actually force I/Os down to the device level, just to the closest bridge. Yeah, that's what I found when trying to think of a name ;) It is like an intermediate-level flush for the platform code, but from the POV of the driver writer, it is nothing of the sort ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Thu, Aug 23, 2007 at 06:27:42PM +0200, Benjamin Herrenschmidt wrote: On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote: On Thu, 23 Aug 2007, Nick Piggin wrote: Also, FWIW, there are some advantages of deferring the mmiowb thingy until the point of unlock. And that is exactly what ppc64 does. But you're missing a big point: for 99.9% of all hardware, mmiowb() is a total no-op. So when you talk about advantages, you're not talking about any *real* advantage, are you? I wonder whether it might be worth removing mmiowb and having all archs that matter do like ppc64 though... It's just yet another confusing barrier that most driver writers get wrong.. Only sn2 and powerpc really matter, actually (for different reasons). All smp architectures other than powerpc appear to have barrier instructions that order all memory operations, so IOs never leak out of locking primitives. This is why powerpc wants a wmb (not mmiowb) before spin_unlock to order IOs (pity about other locking primitives). And all platforms other than sn2 don't appear to reorder IOs after they leave the CPU, so only sn2 needs to do the mmiowb thing before spin_unlock. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Wed, Aug 22, 2007 at 12:02:11PM -0700, Jesse Barnes wrote: On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote: It really seems like it is some completely different concept from a barrier. And it shows, on the platform where it really matters (sn2), where the thing actually spins. I agree that it probably isn't a write barrier per se. Think of it as a tie two subsystems together thing. Right, maybe it's not the best name, but as long as you separate your memory access types, you can think of it as a real write barrier, just for mmio accesses (well uncached access really). If we have the following situation (all vars start at 0) CPU0CPU1CPU2 spin_lock(lock); ~ A = 1; ~ wmb(); ~ B = 2; ~ spin_unlock(lock); X = B; spin_lock(lock); rmb(); A = 10; Y = A; wmb(); ~ B = 11; ~ spin_unlock(lock); ~ (I use the '~' just to show CPU2 is not specifically temporally related to CPU0 or CPU1). Then CPU2 could have X==11 and Y==1, according to the Linux abstract memory consistency model, couldn't it? I think so, and I think this is what your mmiowb is trying to protect. In the above situation, CPU2 would just use the spinlock -- I don't think we have a simple primitive that CPU0 and 1 can call to prevent this reordering at CPU2. An IO device obviously can't use a spinlock :). (And it doesn't just matter on sn2. It also matters on powerpc64, although I think they just set a flag and do the *real* sync in the spin_unlock() path). Yeah, they keep threatening to use this instead, but I'm not sure how easy it would be. Also they may have more devices/drivers to worry about than sn2, so maybe changing over would mean too much driver debugging (well auditing really since it's not that hard to know where to put them). Irix actually had an io_unlock() routine that did this implicitly, but iirc that was shot down for Linux... Why was it shot down? Seems like a pretty good idea to me ;) I'm clueless when it comes to drivers, but I see a lot of mmiowb() that are not paired with spin_unlock. How are these obvious? (ie. what is the pattern?) It looks like some might be lockless FIFOs (or maybe I'm just not aware of where the locks are). Can you just quickly illustrate the problem being solved? Thanks, Nick ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote: On Thu, 23 Aug 2007, Nick Piggin wrote: Irix actually had an io_unlock() routine that did this implicitly, but iirc that was shot down for Linux... Why was it shot down? Seems like a pretty good idea to me ;) It's horrible. We'd need it for *every* single spinlock type. We have lots of them. So the choice is between: - sane: mmiowb() followed by any of the existing spin_unlock() variants (plain, _irq(), _bh(), _irqrestore()) - insane: multiply our current set of unlock primitives by two, by making io versions for them all: spin_unlock_io[_irq|_irqrestore|_bh]() but there's actually an EVEN WORSE problem with the stupid Irix approach, namely that it requires that the unlocker be aware of the exact details of what happens inside the lock. If the locking is done at an outer layer, that's not at all obvious! OK, but we'd have some kind of functions that are called not to serialise the CPUs, but to serialise the IO. It would be up to the calling code to already provide CPU synchronisation. serialize_io(); / unserialize_io(); / a nicer name If we could pass in some kind of relevant resoure (eg. the IO memory or device or something), then we might even be able to put debug checks there to ensure two CPUs are never inside the same critical IO section at once. In other words, Irix (once again) made a horrible and idiotic choice. We could make a better one. I don't think mmiowb is really insane, but I'd worry it being confused with a regular type of barrier and that CPU synchronisation needs to be provided for it to work or make sense. Big surprise. Irix was probably the flakiest and worst of all the commercial proprietary unixes. No taste. Is it? I've never used it ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: wmb vs mmiowb
On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote: On Thu, 23 Aug 2007, Nick Piggin wrote: Irix actually had an io_unlock() routine that did this implicitly, but iirc that was shot down for Linux... Why was it shot down? Seems like a pretty good idea to me ;) It's horrible. We'd need it for *every* single spinlock type. We have lots of them. So the choice is between: - sane: mmiowb() followed by any of the existing spin_unlock() variants (plain, _irq(), _bh(), _irqrestore()) - insane: multiply our current set of unlock primitives by two, by making io versions for them all: spin_unlock_io[_irq|_irqrestore|_bh]() but there's actually an EVEN WORSE problem with the stupid Irix approach, namely that it requires that the unlocker be aware of the exact details of what happens inside the lock. If the locking is done at an outer layer, that's not at all obvious! Also, FWIW, there are some advantages of deferring the mmiowb thingy until the point of unlock. The disadvantage is that the caller may not know if the inner layer performed ios that require the mmiowb, but the advantage of waiting until unlock is that the wait is deferred for as long as possible, and will hopefully be a shorter one when performed later. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 1/2] powerpc: rmb fix
On Tue, Aug 21, 2007 at 09:43:17PM +0200, Segher Boessenkool wrote: #define mb() __asm__ __volatile__ (sync : : : memory) -#define rmb() __asm__ __volatile__ (__stringify(LWSYNC) : : : memory) +#define rmb() __asm__ __volatile__ (sync : : : memory) #define wmb() __asm__ __volatile__ (sync : : : memory) #define read_barrier_depends() do { } while(0) @@ -42,7 +42,7 @@ #ifdef __KERNEL__ #ifdef CONFIG_SMP #define smp_mb() mb() -#define smp_rmb() rmb() +#define smp_rmb() __asm__ __volatile__ (__stringify(LWSYNC) : : : memory) #define smp_wmb() eieio() #define smp_read_barrier_depends() read_barrier_depends() #else I had to think about this one for awhile. It looks at first glance to be the right thing to do. But I do wonder how long rmb() has been lwsync Since the {ppc,ppc64} - powerpc merge. and if as a practical matter that has caused any problems? It has not as far as I know. If this isn't causing any problems maybe there is some loigic we are overlooking? The I/O accessor functions enforce the necessary ordering already I believe. Hmm, I never followed those discussions last year about IO ordering, and I can't see where (if) it was documented anywhere :( It appears that legacy code is handled by defining the old IO accessors to be completely ordered, and introducing new __raw_ variants that are not (OTOH, it seems like other architectures are implementing __raw prefix as inorder unless there is a _relaxed postfix). Drivers are definitely using these __raw_ accessors, and from a quick look, they do appear to be hoping that *mb() is going to order access for them. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 1/2] powerpc: rmb fix
On Wed, Aug 22, 2007 at 05:29:50AM +0200, Segher Boessenkool wrote: If this isn't causing any problems maybe there is some loigic we are overlooking? The I/O accessor functions enforce the necessary ordering already I believe. Ah, it looks like you might be right, IO should appear to go in-order, in which case the rmb() would simply need to order cacheable loads. Interesting way to do things... are drivers simply not up to scratch enough to allow out of order IO? The powerpc kernel needs to have full sync insns in every I/O accessor in order to enforce all the ordering rules Linux demands. It's a bloody shame, but the alternative would be to make the barriers lots more expensive. A third alternative would be to Well lots more expensive compared to what you have now. But what you have now is like having those expensive barriers between *every* io access. have barrier ops that do not order everything, but just A vs. B for various choices of A and B (coherent accesses, MMIO accesses, etc.) The non-smp_ variant is supposed to order everything, AFAIK. Maybe you could get more fancy and have PIO vs MMIO etc etc. but it looks like this whole area is in a pretty sticky state anyway so let's not think about that. Anyway, this raises another question -- if IO accessors have the right ordering, why is wmb() not an lwsync as well? There appears to be many more wmb() calls than rmb()... Input MMIO accessors are {sync, load, stall pipeline until load came back}. That's a full ordering on both sides. Output MMIO on the other hand is done with {sync, store}. Now since wmb() has to order MMIO writes vs. main memory writes, we need a full sync here. On some (most, all?) CPUs an eieio is actually enough btw. The barrier insn could be put at the end of all MMIO write ops too, but I believe that would be more expensive (in execution time; in code size it definitely would be, of course). Ah, that explains why wmb() is a sync. Doesn't seem like a very good idea though, if the rationale of having fully ordered IO accessors was because drivers didn't have enough barriers in them. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 1/2] powerpc: rmb fix
On Wed, Aug 22, 2007 at 05:33:16AM +0200, Segher Boessenkool wrote: The I/O accessor functions enforce the necessary ordering already I believe. Hmm, I never followed those discussions last year about IO ordering, and I can't see where (if) it was documented anywhere :( The comments in system.h weren't updated with the last fix, I think. It appears that legacy code is handled by defining the old IO accessors to be completely ordered, and introducing new __raw_ variants that are not (OTOH, it seems like other architectures are implementing __raw prefix as inorder unless there is a _relaxed postfix). __raw_XX() is for platform code only, which can do the needed barriers without having to use the heavy hammer like everything else unfortunately does. [EMAIL PROTECTED]:~/usr/src/linux-2.6/drivers egrep '__raw_(write|read)' -r * | wc -l 685 Drivers are definitely using these __raw_ accessors, and from a quick look, they do appear to be hoping that *mb() is going to order access for them. Which drivers? There are maybe a dozen that use the raw accessors, and use non-smp_ memory barriers. I just looked at drivers/video/tgafb.c, which indeed appears to intermix them. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
wmb vs mmiowb
Hi, I'm ignorant when it comes to IO access, so I hope this isn't rubbish (if it is, I would appreciate being corrected). It took me more than a glance to see what the difference is supposed to be between wmb() and mmiowb(). I think especially because mmiowb isn't really like a write barrier. wmb is supposed to order all writes coming out of a single CPU, so that's pretty simple. The problem is that writes coming from different CPUs can be seen by the device in a different order from which they were written if coming from different CPUs, even if the order of writes is guaranteed (eg. by a spinlock) and issued in the right order WRT the locking (ie. using wmb()). And this can happen because the writes can get posted away and reordered by the IO fabric (I think). mmiowb ensures the writes are seen by the device in the correct order. It doesn't seem like this primary function of mmiowb has anything to do with a write barrier that we are used to (it may have a seconary semantic of a wmb as well, but let's ignore that for now). A write barrier will never provide you with those semantics (writes from 2 CPUs seen in the same order by a 3rd party). If anything, I think it is closer to being a read barrier issued on behalf of the target device. But even that I think is not much better, because the target is not participating in the synchronisation that the CPUs are, so the read barrier request could still arrive at the device out of order WRT the other CPU's writes. It really seems like it is some completely different concept from a barrier. And it shows, on the platform where it really matters (sn2), where the thing actually spins. I don't know exactly how it should be categorised. On one hand, it is kind of like a critical section, and would work beautifully if we could just hide it inside spin_lock_io/spin_unlock_io. On the other hand, it seems like it is often used separately from locks, where it looks decidedly less like a critical section or release barrier. How can such uses be correct if they are worried about multi-CPU ordering but don't have anything to synchronize the CPUs? Or are they cleverly enforcing CPU ordering some other way? (in which case, maybe an acquire/release API really would make sense?). I don't really have a big point, except that I would like to know whether I'm on the right track, and wish the thing could have a better name/api. Thanks, Nick ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 1/2] powerpc: smp_wmb speedup
Sorry, this is patch 2/2 of course. On Tue, Aug 21, 2007 at 04:16:52AM +0200, Nick Piggin wrote: This one is perhaps not as straightforward. I'm pretty limited in the types of powerpc machines I can test with, so I don't actually know whether this is the right thing to do on power5/6 etc. I can supply the simple test program I used if anybody is interested. --- On my dual G5, lwsync is over 5 times faster than eieio when used in a simple test case (that actually makes real use of lwsync to provide write ordering). This is not surprising, as it avoids the IO access synchronisation of eieio, and still permits the important relaxation of executing loads before stores. The on sub-architectures where lwsync is unavailable, eieio is retained, as it should be faster than the alternative full sync (eieio is a proper subset of sync). Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/include/asm-powerpc/system.h === --- linux-2.6.orig/include/asm-powerpc/system.h +++ linux-2.6/include/asm-powerpc/system.h @@ -43,7 +43,11 @@ #ifdef CONFIG_SMP #define smp_mb() mb() #define smp_rmb()__asm__ __volatile__ (__stringify(LWSYNC) : : : memory) +#ifdef __SUBARCH_HAS_LWSYNC +#define smp_wmb()__asm__ __volatile__ (__stringify(LWSYNC) : : : memory) +#else #define smp_wmb()eieio() +#endif #define smp_read_barrier_depends() read_barrier_depends() #else #define smp_mb() barrier() ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev