Re: [PATCH 1/3] POWERPC: don't cast a pointer to pointer of list_head

2007-12-06 Thread Nick Piggin
On Thursday 06 December 2007 20:33, Li Zefan wrote:
 The casting is safe only when the list_head member is the
 first member of the structure.

Even so, I don't think too safe :) It might technically work,
but it could break more easily.

So even if you find places where list_head is the first member of
a structure, it would be nice to explicitly use the list_head member
and avoid casts IMO.

Thanks,
Nick

 Signed-off-by: Li Zefan [EMAIL PROTECTED]

 ---
  arch/ppc/syslib/ocp.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/arch/ppc/syslib/ocp.c b/arch/ppc/syslib/ocp.c
 index 3f5be2c..d42d408 100644
 --- a/arch/ppc/syslib/ocp.c
 +++ b/arch/ppc/syslib/ocp.c
 @@ -376,7 +376,7 @@ ocp_remove_one_device(unsigned int vendor, unsigned int
 function, int index)

   down_write(ocp_devices_sem);
   dev = __ocp_find_device(vendor, function, index);
 - list_del((struct list_head *)dev);
 + list_del(dev-link);
   up_write(ocp_devices_sem);

   DBG((ocp: ocp_remove_one_device(vendor: %x, function: %x, index: %d)...
 done.\n, vendor, function, index));
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[patch] xmon bitlock fix

2007-11-19 Thread Nick Piggin
(sorry, untested for lack of hardware)
--
xmon uses a bit lock spinlock but doesn't close the critical section
when releasing it. It doesn't seem like a big deal because it will
eventually break out of the lock anyway, but presumably that's only
in exceptional cases where some error is tolerated, while the lack of a memory
barrier could allow incorrect results during normal functioning operation
as well.

Convert it to use a regular spinlock instead.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
---
Index: linux-2.6/arch/ppc/xmon/start.c
===
--- linux-2.6.orig/arch/ppc/xmon/start.c
+++ linux-2.6/arch/ppc/xmon/start.c
@@ -92,16 +92,14 @@ xmon_write(void *handle, void *ptr, int 
 {
char *p = ptr;
int i, c, ct;
-
-#ifdef CONFIG_SMP
-   static unsigned long xmon_write_lock;
+   static DEFINE_SPINLOCK(xmon_write_lock);
int lock_wait = 100;
int locked;
 
-   while ((locked = test_and_set_bit(0, xmon_write_lock)) != 0)
+   while (!(locked = spin_trylock(xmon_write_lock))) {
if (--lock_wait == 0)
break;
-#endif
+   }
 
if (!scc_initialized)
xmon_init_scc();
@@ -122,10 +120,9 @@ xmon_write(void *handle, void *ptr, int 
eieio();
}
 
-#ifdef CONFIG_SMP
-   if (!locked)
-   clear_bit(0, xmon_write_lock);
-#endif
+   if (locked)
+   spin_unlock(xmon_write_lock);
+
return nb;
 }
 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[patch] powerpc: hash lock use lock bitops

2007-11-19 Thread Nick Piggin
This isn't a bugfix, but may help performance slightly...

--
powerpc 64-bit hash pte lock bit is an actual lock, so it can take advantage
of lock bitops for slightly more optimal memory barriers (can avoid an lwsync
in the trylock).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
---
Index: linux-2.6/arch/powerpc/mm/hash_native_64.c
===
--- linux-2.6.orig/arch/powerpc/mm/hash_native_64.c
+++ linux-2.6/arch/powerpc/mm/hash_native_64.c
@@ -113,7 +113,7 @@ static inline void native_lock_hpte(stru
unsigned long *word = hptep-v;
 
while (1) {
-   if (!test_and_set_bit(HPTE_LOCK_BIT, word))
+   if (!test_and_set_bit_lock(HPTE_LOCK_BIT, word))
break;
while(test_bit(HPTE_LOCK_BIT, word))
cpu_relax();
@@ -124,8 +124,7 @@ static inline void native_unlock_hpte(st
 {
unsigned long *word = hptep-v;
 
-   asm volatile(lwsync:::memory);
-   clear_bit(HPTE_LOCK_BIT, word);
+   clear_bit_unlock(HPTE_LOCK_BIT, word);
 }
 
 static long native_hpte_insert(unsigned long hpte_group, unsigned long va,
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch] xmon bitlock fix

2007-11-19 Thread Nick Piggin
On Tue, Nov 20, 2007 at 04:28:02PM +1100, Paul Mackerras wrote:
 Nick Piggin writes:
 
  xmon uses a bit lock spinlock but doesn't close the critical section
  when releasing it. It doesn't seem like a big deal because it will
  eventually break out of the lock anyway, but presumably that's only
  in exceptional cases where some error is tolerated, while the lack of a 
  memory
  barrier could allow incorrect results during normal functioning operation
  as well.
  
  Convert it to use a regular spinlock instead.
 
 I'd rather not.  The idea is that xmon is as independent as possible
 of the rest of the kernel, so that it will work even when lots of
 kernel data structures are corrupted.  If spinlocks were simple spin
 loops then maybe, but they're not these days.

Fair enough, I guess you still want lockdep kernels to work.

 
 As for the memory barrier comment, I don't think it is a reason, since
 test_and_set_bit acts as a barrier, and in any case the worst thing

test_and_set_bit is a barrier (although heavier than required to acquire
a lock). But clear_bit isn't strong enough... you do have eieio() in
there, making it probably less problem, but that won't get executed if
nb is 0, and it is still doing the xmon_init_scc() outside the loop. I
just prefer to make it a simple lock that doesn't rely on barriers or
the specific structure of the critical section.


 that could happen is that the characters from different cpu's outputs
 get interleaved (that's one reason why the loop has a timeout, the
 other is for robustness).
 
 In any case this is in arch/ppc which is dying code.  I don't think
 there are any SMP platforms supported in arch/ppc any more except for
 some (fairly rare) PReP platforms, and I hope to get PReP moved over
 soon.
 
I was just grepping the tree, and trying to reduce code which might
be fragile, broken, or a poor example.


 Finally, why do you say that it doesn't close the critical section?
 Possibly the locked variable is badly named (it's zero when we get
 the lock) but AFAICS the code is actually correct.

Well the unlock itself doesn't close it properly because clear_bit
doesn't order previous loads or stores.

How about this?
---

xmon uses a bit lock spinlock but doesn't close the critical section
when releasing it. It doesn't seem like a big deal because it will
eventually break out of the lock anyway, but presumably that's only
in exceptional cases where some error is tolerated, while the lack of a memory
barrier could allow incorrect results during normal functioning operation
as well.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/arch/ppc/xmon/start.c
===
--- linux-2.6.orig/arch/ppc/xmon/start.c
+++ linux-2.6/arch/ppc/xmon/start.c
@@ -98,7 +98,7 @@ xmon_write(void *handle, void *ptr, int 
int lock_wait = 100;
int locked;
 
-   while ((locked = test_and_set_bit(0, xmon_write_lock)) != 0)
+   while ((locked = test_and_set_bit_lock(0, xmon_write_lock)) != 0)
if (--lock_wait == 0)
break;
 #endif
@@ -124,7 +124,7 @@ xmon_write(void *handle, void *ptr, int 
 
 #ifdef CONFIG_SMP
if (!locked)
-   clear_bit(0, xmon_write_lock);
+   clear_bit_unlock(0, xmon_write_lock);
 #endif
return nb;
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch] powerpc: hash lock use lock bitops

2007-11-19 Thread Nick Piggin
On Tue, Nov 20, 2007 at 05:08:24PM +1100, Benjamin Herrenschmidt wrote:
 
 On Tue, 2007-11-20 at 06:09 +0100, Nick Piggin wrote:
  This isn't a bugfix, but may help performance slightly...
  
  --
  powerpc 64-bit hash pte lock bit is an actual lock, so it can take advantage
  of lock bitops for slightly more optimal memory barriers (can avoid an 
  lwsync
  in the trylock).
  
  Signed-off-by: Nick Piggin [EMAIL PROTECTED]
  Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
  ---
 
 Looks nice, I'll try it out on a G5 and let you know.

Cool, thanks (I don't have mine handy ATM...).

BTW, here is another thing which you might want to think about. Again
untested for temporary lack of hardware.

--

The radix-tree is now RCU safe, so powerpc can avoid the games it was playing
in order to have a lockless readside. Saves an irq disable/enable, a couple of
__get_cpu_var()s, a cacheline, and a memory barrier, in the fastpath. Should
save a cycle or two...

---
Index: linux-2.6/arch/powerpc/kernel/irq.c
===
--- linux-2.6.orig/arch/powerpc/kernel/irq.c
+++ linux-2.6/arch/powerpc/kernel/irq.c
@@ -406,8 +406,6 @@ void do_softirq(void)
 
 static LIST_HEAD(irq_hosts);
 static DEFINE_SPINLOCK(irq_big_lock);
-static DEFINE_PER_CPU(unsigned int, irq_radix_reader);
-static unsigned int irq_radix_writer;
 struct irq_map_entry irq_map[NR_IRQS];
 static unsigned int irq_virq_count = NR_IRQS;
 static struct irq_host *irq_default_host;
@@ -550,57 +548,6 @@ void irq_set_virq_count(unsigned int cou
irq_virq_count = count;
 }
 
-/* radix tree not lockless safe ! we use a brlock-type mecanism
- * for now, until we can use a lockless radix tree
- */
-static void irq_radix_wrlock(unsigned long *flags)
-{
-   unsigned int cpu, ok;
-
-   spin_lock_irqsave(irq_big_lock, *flags);
-   irq_radix_writer = 1;
-   smp_mb();
-   do {
-   barrier();
-   ok = 1;
-   for_each_possible_cpu(cpu) {
-   if (per_cpu(irq_radix_reader, cpu)) {
-   ok = 0;
-   break;
-   }
-   }
-   if (!ok)
-   cpu_relax();
-   } while(!ok);
-}
-
-static void irq_radix_wrunlock(unsigned long flags)
-{
-   smp_wmb();
-   irq_radix_writer = 0;
-   spin_unlock_irqrestore(irq_big_lock, flags);
-}
-
-static void irq_radix_rdlock(unsigned long *flags)
-{
-   local_irq_save(*flags);
-   __get_cpu_var(irq_radix_reader) = 1;
-   smp_mb();
-   if (likely(irq_radix_writer == 0))
-   return;
-   __get_cpu_var(irq_radix_reader) = 0;
-   smp_wmb();
-   spin_lock(irq_big_lock);
-   __get_cpu_var(irq_radix_reader) = 1;
-   spin_unlock(irq_big_lock);
-}
-
-static void irq_radix_rdunlock(unsigned long flags)
-{
-   __get_cpu_var(irq_radix_reader) = 0;
-   local_irq_restore(flags);
-}
-
 static int irq_setup_virq(struct irq_host *host, unsigned int virq,
irq_hw_number_t hwirq)
 {
@@ -791,9 +738,9 @@ void irq_dispose_mapping(unsigned int vi
/* Check if radix tree allocated yet */
if (host-revmap_data.tree.gfp_mask == 0)
break;
-   irq_radix_wrlock(flags);
+   spin_lock_irqsave(irq_big_lock, flags);
radix_tree_delete(host-revmap_data.tree, hwirq);
-   irq_radix_wrunlock(flags);
+   spin_unlock_irqrestore(irq_big_lock, flags);
break;
}
 
@@ -861,9 +808,9 @@ unsigned int irq_radix_revmap(struct irq
return irq_find_mapping(host, hwirq);
 
/* Now try to resolve */
-   irq_radix_rdlock(flags);
+   rcu_read_lock();
ptr = radix_tree_lookup(tree, hwirq);
-   irq_radix_rdunlock(flags);
+   rcu_read_unlock();
 
/* Found it, return */
if (ptr) {
@@ -874,9 +821,9 @@ unsigned int irq_radix_revmap(struct irq
/* If not there, try to insert it */
virq = irq_find_mapping(host, hwirq);
if (virq != NO_IRQ) {
-   irq_radix_wrlock(flags);
+   spin_lock_irqsave(irq_big_lock, flags);
radix_tree_insert(tree, hwirq, irq_map[virq]);
-   irq_radix_wrunlock(flags);
+   spin_unlock_irqrestore(irq_big_lock, flags);
}
return virq;
 }
@@ -989,12 +936,12 @@ static int irq_late_init(void)
struct irq_host *h;
unsigned long flags;
 
-   irq_radix_wrlock(flags);
+   spin_lock_irqsave(irq_big_lock, flags);
list_for_each_entry(h, irq_hosts, link) {
if (h-revmap_type == IRQ_HOST_MAP_TREE)
INIT_RADIX_TREE(h-revmap_data.tree, GFP_ATOMIC);
}
-   irq_radix_wrunlock(flags);
+   spin_unlock_irqrestore(irq_big_lock, flags);
 
return 0

Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 12:32, Herbert Xu wrote:

 First of all let's agree on some basic assumptions:

 * A pair of spin lock/unlock subsumes the effect of a full mb.

Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).

*X = 10;
spin_lock(lock);
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock(lock);
y = *Y;

 * A spin lock in general only equates to (SS/SL/LL).
 * A spin unlock in general only equates to (SS/LS).

I don't use the sparc barriers, so they don't come naturally to
me ;)

I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
be passed by later ones.


 In particular, a load after a spin unlock may pass before the
 spin unlock.

 Here is the race (with tg3 as the only example that I know of).
 The driver attempts to quiesce interrupts such that after the
 call to synchronize_irq it is intended that no further IRQ
 handler calls for that device will do any work besides acking
 the IRQ.

 Here is how it does it:

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 smp_mb
 synchronize_irq()
   while (IRQ_INPROGRESS)
   wait
   return
   set IRQ_INPROGRESS
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work

 The problem here is that load of irq_sync in the handler has
 passed above the setting of IRQ_INPROGRESS.

 Linus's patch fixes it because this becomes:

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 smp_mb
 synchronize_irq
   set IRQ_INPROGRESS
   spin unlock
   spin lock
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work
   while (IRQ_INPROGRESS)
   wait
   spin lock
   clear IRQ_INPROGRESS
   spin unlock
   return

 Even though we still do the work on the right we will now notice
 the INPROGRESS flag on the left and wait.

 It's hard to fix this in the drivers because they'd either have
 to access the desc lock or add a full mb to the fast path on the
 right.

 Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
 a lot of drivers (including the fusion example Ben quoted) call
 synchronize_irq before free_irq.  This is unnecessary because
 the latter already calls it anyway.

 Cheers,
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 13:28, Herbert Xu wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
  First of all let's agree on some basic assumptions:
 
  * A pair of spin lock/unlock subsumes the effect of a full mb.
 
  Not unless you mean a pair of spin lock/unlock as in
  2 spin lock/unlock pairs (4 operations).
 
  *X = 10;
  spin_lock(lock);
  /* *Y speculatively loaded here */
  /* store to *X leaves CPU store queue here */
  spin_unlock(lock);
  y = *Y;

 Good point.

 Although in this case we're still safe because in the worst
 cases:

 CPU0  CPU1
 irq_sync = 1
 synchronize_irq
   spin lock
   load IRQ_INPROGRESS
 irq_sync sync is visible
   spin unlock
   spin lock
   load irq_sync
   while (IRQ_INPROGRESS)
   wait
   return
   set IRQ_INPROGRESS
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   spin lock
   clear IRQ_INPROGRESS
   spin unlock

 

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 synchronize_irq
   set IRQ_INPROGRESS
   spin unlock
   spin lock
   load IRQ_INPROGRESS
 irq_sync sync is visible
   spin unlock
   while (IRQ_INPROGRESS)
   wait
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work
   spin lock
   clear IRQ_INPROGRESS
   spin unlock
   return

 So because we're using the same lock on both sides, it does
 do the right thing even without the memory barrier.


Yeah, if you've got the lock on both sides there, then I
agree it will be correct.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: SYSFS: need a noncaching read

2007-09-12 Thread Nick Piggin
On Wednesday 12 September 2007 20:01, Greg KH wrote:
 On Wed, Sep 12, 2007 at 07:32:07AM +0200, Robert Schwebel wrote:
  On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote:
   I have developed a device driver and use the sysFS to export some
   registers to userspace.
 
  Uuuh, uggly. Don't do that. Device drivers are there to abstract things,
  not to play around with registers from userspace.
 
   I opened the sysFS File for one register and did some reads from this
   File, but I alwas becoming the same value from the register, whats not
   OK, because they are changing. So I found out that the sysFS caches
   the reads ... :-(
 
  Yes, it does. What you can do is close()ing the file handle between
  accesses, which makes it work but is slow.

 Do an lseek back to 0 and then re-read, you will get called in your
 driver again.

Can you do a pread with offset 0 to avoid the two syscalls? (which some
people seem to be concerned about)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-09-03 Thread Nick Piggin
On Thu, Aug 30, 2007 at 02:42:41PM -0500, Brent Casavant wrote:
 On Thu, 30 Aug 2007, Nick Piggin wrote:
 
  I don't know whether this is exactly a correct implementation of
  Linux's barrier semantics. On one hand, wmb _is_ ordering the stores
  as they come out of the CPU; on the other, it isn't ordering normal
  stores with respect to writel from the POV of the device (which is
  seems to be what is expected by the docs and device driver writers).
 
 Or, as I think of it, it's not ordering cacheable stores with respect
 to uncacheable stores from the perspective of other CPUs in the system.
 That's what's really at the heart of the concern for SN2.

AFAIKS, the issue is simply that it is not ordering cacheable stores
with respect to uncacheable stores from a _single_ CPU. I'll elaborate
further down.


  And on the other side, it just doesn't seem so useful just to know
  that stores coming out of the CPU are ordered if they can be reordered
  by an intermediate.
 
 Well, it helps when certain classes of stores need to be ordered with
 respect to eachother.  On SN2, wmb() still ensures that cacheable stores
 are issued in a particular order, and thus seen by other CPUs in a
 particular order.  That is still important, even when IO devices are not
 in the mix.

Well, we have smp_wmb() for that.


  Why even have wmb() at all, if it doesn't actually
  order stores to IO and RAM?
 
 It orders the class of stores which target RAM.  It doesn't order the
 two seperate classes of stores (RAM and IO) with respect to eachother.

wmb() *really* is supposed to order all stores. As far as I gather,
devices often need it for something like this:

*dma_buffer = blah;
wmb();
writel(START_DMA, iomem);

One problem for sn2 seems to be that wmb is called like 500 times in
drivers/ and would be really heavy to turn it into mmiowb. On the
other hand, I really don't like how it's just gone and said oh the
normal Linux semantics are too hard, so make wmb() mean something
slightly different, and add a totally new mmiowb() concept. Device
driver writers already get barriers totally wrong. mmiowb is being
completely misused already (and probably wmb too).


 mmiowb() when used in conjunction with a lock which serializes access
 to an IO device ensures that the order of stores to the IO device from
 different CPUs is well-defined.  That's what we're really after here.

But if we're pragmatic, we could say that stores which are sitting in
the CPU's chipset where they can potentially be reordered, can still
_conceptually_ be considered to be in some kind of store queue of the
CPU. This would mean that wmb() does have to order these WRT cacheable
stores coming from a single CPU.

And once you do that, sn2 will _also_ do the right thing with multiple
CPUs.


  I guess it is too expensive for you to have mmiowb() in every wmb(),
  because _most_ of the time, all that's needed is ordering between IOs.
 
 I think it's the other way around.  Most of the time all you need is
 ordering between RAM stores, so mmiowb() would kill performance if it
 was called every time wmb() was invoked.

No, we have smp_wmb() for that.

 
  So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore
  system memory. Then the non-prefixed primitives order everything (to the
  point that wmb() is like mmiowb on sn2).
 
 I'm not sure I follow.  Here's the bad sequence we're working with:
 
   CPU A   CPU B   Lock owner  IO device sees 
   -   -   --  --
   ... ... unowned
   lock()  ... CPU A
   writel(val_a)   lock()  ...
   unlock()CPU B
   ... write(val_b)...
   ... unlock()unowned
   ... ... ... val_b
   ... ... ... val_a
 
 
 The cacheable store to RAM from CPU A to perform the unlock was
 not ordered with respect to the uncacheable writel() to the IO device.
 CPU B, which has a different uncacheable store path to the IO device
 in the NUMA system, saw the effect of the RAM store before CPU A's
 uncacheable store arrived at the IO device.  CPU B then owned the
 lock, performed its own uncacheable store to the IO device, and
 released the lock.  The two uncacheable stores are taking different
 routes to the device, and end up arriving in the wrong order.
 
 mmiowb() solves this by causing the following:
 
   CPU A   CPU B   Lock owner  IO device sees 
   -   -   --  --
   ... ... Unowned
   lock()  ... CPU A
   writel(val_a)   lock()  ...
   mmiowb()... val_a
   unlock()CPU B
   ... write(val_b)...
   ... mmiowb

Re: [patch 1/2] powerpc: rmb fix

2007-08-23 Thread Nick Piggin
On Thu, Aug 23, 2007 at 07:57:20PM +0200, Segher Boessenkool wrote:
 The powerpc kernel needs to have full sync insns in every I/O
 accessor in order to enforce all the ordering rules Linux demands.
 It's a bloody shame, but the alternative would be to make the
 barriers lots more expensive.  A third alternative would be to
 
 Well lots more expensive compared to what you have now. But what
 you have now is like having those expensive barriers between
 *every* io access.
 
 Yeah.  But I/O reads are very expensive anyway, and the barriers
 are used for more than just I/O ordering.

rmb() should only be used when IO ordering matters. smp_rmb() is
for regular ordering.

So doesn't the fact IO reads are very expensive anyway lend more
weight _to have_ the full IO ordering in rmb()?


 I/O writes are a different thing; ideally, they would use only
 eieio, if anything at all.

For IO to IO ordering, yes eieio would be ideal. I don't know that
there is really such a primitive for that in Linux. io_wmb().

 
 Maybe the tradeoff isn't optimal.  The I/O primitives didn't have
 all those syncs in there before, they got added because some bad
 interaction with spinlocks was discovered, if my memory isn't failing
 me.

I think it may have been because IO ops are pretty strongly ordered
on x86, and it was thought that a fair amount of code was relying on
that. So the old primitives were made quite strongly ordered, and
new ones were added to avoid the overhead. Unfortunately, you can't
actually use the unordered ones unless you have working barrier
instructions. Hence why I think rmb() should be an IO barrier.

But I'm not pushing this too hard. You guys all know the gory ppc
details better than I, so I'll just leave it with you to work out
whether it is the right thing to do.


 have barrier ops that do not order everything, but just A vs. B
 for various choices of A and B (coherent accesses, MMIO accesses,
 etc.)
 
 The non-smp_ variant is supposed to order everything, AFAIK. Maybe
 you could get more fancy and have PIO vs MMIO etc etc. but it looks
 like this whole area is in a pretty sticky state anyway so let's
 not think about that.
 
 *Thinking* about it is fun.  Trying to get the code merged would be
 a different thing ;-)

;)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-23 Thread Nick Piggin
On Thu, Aug 23, 2007 at 09:16:42AM -0700, Linus Torvalds wrote:
 
 
 On Thu, 23 Aug 2007, Nick Piggin wrote:
  
  Also, FWIW, there are some advantages of deferring the mmiowb thingy
  until the point of unlock.
 
 And that is exactly what ppc64 does.
 
 But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
 total no-op. So when you talk about advantages, you're not talking about 
 any *real* advantage, are you?

You're in a feisty mood today ;)

I guess on the 0.1% of hradware where it is not a noop, there might be a
real advantage... but that was just handwaving anyway. My real point was
that I'd like things to be more easily understandable.

I think we are agreed at this point that mmiowb without some form of CPU
synchronisation is a bug, and it is also not of the same type of barrier
that we normally think about in the kernel (it could be like a MPI style
rendezvous barrier between the CPU and the IO fabric). Anyway, point is
that device drivers seem to have enough on their plate already.

Look at bcm43xx, for example. Most of this guy's mmiowb()s are completely
wrong and should be wmb(). mmiowb() is only a wmb() on ppc because as I
said, ppc's spin_unlock does not order IOs like most other architectures.
On alpha, for example, spin_unlock does order IOs, so mmiowb is a noop,
and this is broken (non-sn2 ia64 should also be a noop here, because
their unlock orders IOs, but it seems that mmiowb semantics are so non
obvious that they either got it wrong themselves, or assumed device
drivers surely would).


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-23 Thread Nick Piggin
On Thu, Aug 23, 2007 at 09:56:16AM -0700, Jesse Barnes wrote:
 On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote:
   Of course, the normal memory barrier would usually be a
   spin_unlock() or something like that, not a wmb(). In fact, I
   don't think the powerpc implementation (as an example of this) will
   actually synchronize with anything *but* a spin_unlock().
 
  We are even more sneaky in the sense that we set a per-cpu flag on
  any MMIO write and do the sync automatically in spin_unlock() :-)
 
 Yeah, that's a reasonable thing to do, and in fact I think there's code 
 to do something similar when a task is switched out (this keeps user 
 level drivers from having do mmiowb() type things).

It might be worth doing that and removing mmiowb completely. Or, if
that's too expensive, I'd like to see an API that is more explicitly
for keeping IOs inside critical sections.


 FWIW, I think I had an earlier version of the patch that used the name 
 pioflush() or something similar, the only confusing thing about that 
 name is that the primitive doesn't actually force I/Os down to the 
 device level, just to the closest bridge.

Yeah, that's what I found when trying to think of a name ;) It is
like an intermediate-level flush for the platform code, but from the
POV of the driver writer, it is nothing of the sort ;)
 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-23 Thread Nick Piggin
On Thu, Aug 23, 2007 at 06:27:42PM +0200, Benjamin Herrenschmidt wrote:
 On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote:
  
  On Thu, 23 Aug 2007, Nick Piggin wrote:
   
   Also, FWIW, there are some advantages of deferring the mmiowb thingy
   until the point of unlock.
  
  And that is exactly what ppc64 does.
  
  But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
  total no-op. So when you talk about advantages, you're not talking about 
  any *real* advantage, are you?
 
 I wonder whether it might be worth removing mmiowb and having all archs
 that matter do like ppc64 though... It's just yet another confusing
 barrier that most driver writers get wrong..

Only sn2 and powerpc really matter, actually (for different reasons).

All smp architectures other than powerpc appear to have barrier
instructions that order all memory operations, so IOs never leak
out of locking primitives. This is why powerpc wants a wmb (not
mmiowb) before spin_unlock to order IOs (pity about other locking
primitives).

And all platforms other than sn2 don't appear to reorder IOs after
they leave the CPU, so only sn2 needs to do the mmiowb thing before
spin_unlock.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-22 Thread Nick Piggin
On Wed, Aug 22, 2007 at 12:02:11PM -0700, Jesse Barnes wrote:
 On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote:
   It really seems like it is some completely different concept from a
   barrier. And it shows, on the platform where it really matters
   (sn2), where the thing actually spins.
 
  I agree that it probably isn't a write barrier per se. Think of it
  as a tie two subsystems together thing.
 
 Right, maybe it's not the best name, but as long as you separate your 
 memory access types, you can think of it as a real write barrier, just 
 for mmio accesses (well uncached access really).

If we have the following situation (all vars start at 0)
CPU0CPU1CPU2
spin_lock(lock);   ~
A = 1;  ~
wmb();  ~
B = 2;  ~
spin_unlock(lock); X = B;
spin_lock(lock);   rmb();
A = 10; Y = A;
wmb();  ~
B = 11; ~
spin_unlock(lock); ~

(I use the '~' just to show CPU2 is not specifically temporally
related to CPU0 or CPU1).

Then CPU2 could have X==11 and Y==1, according to the Linux abstract
memory consistency model, couldn't it? I think so, and I think this
is what your mmiowb is trying to protect. In the above situation,
CPU2 would just use the spinlock -- I don't think we have a simple
primitive that CPU0 and 1 can call to prevent this reordering at
CPU2. An IO device obviously can't use a spinlock :).


  (And it doesn't just matter on sn2. It also matters on powerpc64,
  although I think they just set a flag and do the *real* sync in the
  spin_unlock() path).
 
 Yeah, they keep threatening to use this instead, but I'm not sure how 
 easy it would be.  Also they may have more devices/drivers to worry 
 about than sn2, so maybe changing over would mean too much driver 
 debugging (well auditing really since it's not that hard to know where 
 to put them).  Irix actually had an io_unlock() routine that did this 
 implicitly, but iirc that was shot down for Linux...

Why was it shot down? Seems like a pretty good idea to me ;)

I'm clueless when it comes to drivers, but I see a lot of mmiowb()
that are not paired with spin_unlock. How are these obvious? (ie.
what is the pattern?) It looks like some might be lockless FIFOs (or
maybe I'm just not aware of where the locks are). Can you just quickly
illustrate the problem being solved?

Thanks,
Nick
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-22 Thread Nick Piggin
On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 23 Aug 2007, Nick Piggin wrote:
 
   Irix actually had an io_unlock() routine that did this 
   implicitly, but iirc that was shot down for Linux...
  
  Why was it shot down? Seems like a pretty good idea to me ;)
 
 It's horrible. We'd need it for *every* single spinlock type. We have lots 
 of them. 
 
 So the choice is between:
 
  - sane:
 
   mmiowb()
 
followed by any of the existing spin_unlock() variants (plain, 
_irq(), _bh(), _irqrestore())
 
  - insane: multiply our current set of unlock primitives by two, by making 
io versions for them all:
 
   spin_unlock_io[_irq|_irqrestore|_bh]()
 
 but there's actually an EVEN WORSE problem with the stupid Irix approach, 
 namely that it requires that the unlocker be aware of the exact details of 
 what happens inside the lock. If the locking is done at an outer layer, 
 that's not at all obvious!

OK, but we'd have some kind of functions that are called not to
serialise the CPUs, but to serialise the IO. It would be up to
the calling code to already provide CPU synchronisation.

serialize_io(); / unserialize_io(); / a nicer name

If we could pass in some kind of relevant resoure (eg. the IO
memory or device or something), then we might even be able to
put debug checks there to ensure two CPUs are never inside the
same critical IO section at once.

 
 In other words, Irix (once again) made a horrible and idiotic choice. 

We could make a better one. I don't think mmiowb is really insane, but
I'd worry it being confused with a regular type of barrier and that CPU
synchronisation needs to be provided for it to work or make sense.
 

 Big surprise. Irix was probably the flakiest and worst of all the 
 commercial proprietary unixes. No taste.

Is it? I've never used it ;)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: wmb vs mmiowb

2007-08-22 Thread Nick Piggin
On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 23 Aug 2007, Nick Piggin wrote:
 
   Irix actually had an io_unlock() routine that did this 
   implicitly, but iirc that was shot down for Linux...
  
  Why was it shot down? Seems like a pretty good idea to me ;)
 
 It's horrible. We'd need it for *every* single spinlock type. We have lots 
 of them. 
 
 So the choice is between:
 
  - sane:
 
   mmiowb()
 
followed by any of the existing spin_unlock() variants (plain, 
_irq(), _bh(), _irqrestore())
 
  - insane: multiply our current set of unlock primitives by two, by making 
io versions for them all:
 
   spin_unlock_io[_irq|_irqrestore|_bh]()
 
 but there's actually an EVEN WORSE problem with the stupid Irix approach, 
 namely that it requires that the unlocker be aware of the exact details of 
 what happens inside the lock. If the locking is done at an outer layer, 
 that's not at all obvious!

Also, FWIW, there are some advantages of deferring the mmiowb thingy
until the point of unlock. The disadvantage is that the caller may not
know if the inner layer performed ios that require the mmiowb, but
the advantage of waiting until unlock is that the wait is deferred
for as long as possible, and will hopefully be a shorter one when
performed later.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch 1/2] powerpc: rmb fix

2007-08-21 Thread Nick Piggin
On Tue, Aug 21, 2007 at 09:43:17PM +0200, Segher Boessenkool wrote:
  #define mb()   __asm__ __volatile__ (sync : : : memory)
 -#define rmb()  __asm__ __volatile__ (__stringify(LWSYNC) : : : 
 memory)
 +#define rmb()  __asm__ __volatile__ (sync : : : memory)
  #define wmb()  __asm__ __volatile__ (sync : : : memory)
  #define read_barrier_depends()  do { } while(0)
 
 @@ -42,7 +42,7 @@
  #ifdef __KERNEL__
  #ifdef CONFIG_SMP
  #define smp_mb()   mb()
 -#define smp_rmb()  rmb()
 +#define smp_rmb()  __asm__ __volatile__ (__stringify(LWSYNC) : : : 
 memory)
  #define smp_wmb()  eieio()
  #define smp_read_barrier_depends() read_barrier_depends()
  #else
 
 I had to think about this one for awhile.  It looks at first glance to 
 be the right
 thing to do.  But I do wonder how long rmb() has been lwsync
 
 Since the {ppc,ppc64} - powerpc merge.
 
 and if as a practical matter that has caused any problems?
 
 It has not as far as I know.
 
 If this isn't causing any problems maybe there
 is some loigic we are overlooking?
 
 The I/O accessor functions enforce the necessary ordering
 already I believe.

Hmm, I never followed those discussions last year about IO ordering, and
I can't see where (if) it was documented anywhere :(

It appears that legacy code is handled by defining the old IO accessors to
be completely ordered, and introducing new __raw_ variants that are not
(OTOH, it seems like other architectures are implementing __raw prefix as
inorder unless there is a _relaxed postfix).

Drivers are definitely using these __raw_ accessors, and from a quick
look, they do appear to be hoping that *mb() is going to order access for
them.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch 1/2] powerpc: rmb fix

2007-08-21 Thread Nick Piggin
On Wed, Aug 22, 2007 at 05:29:50AM +0200, Segher Boessenkool wrote:
 If this isn't causing any problems maybe there
 is some loigic we are overlooking?
 
 The I/O accessor functions enforce the necessary ordering
 already I believe.
 
 Ah, it looks like you might be right, IO should appear to go in-order, 
 in
 which case the rmb() would simply need to order cacheable loads. 
 Interesting
 way to do things... are drivers simply not up to scratch enough to 
 allow
 out of order IO?
 
 The powerpc kernel needs to have full sync insns in every I/O
 accessor in order to enforce all the ordering rules Linux demands.
 It's a bloody shame, but the alternative would be to make the
 barriers lots more expensive.  A third alternative would be to

Well lots more expensive compared to what you have now. But what
you have now is like having those expensive barriers between
*every* io access.


 have barrier ops that do not order everything, but just A vs. B
 for various choices of A and B (coherent accesses, MMIO accesses,
 etc.)
 
The non-smp_ variant is supposed to order everything, AFAIK. Maybe
you could get more fancy and have PIO vs MMIO etc etc. but it looks
like this whole area is in a pretty sticky state anyway so let's
not think about that.


 Anyway, this raises another question -- if IO accessors have the right
 ordering, why is wmb() not an lwsync as well? There appears to be many
 more wmb() calls than rmb()...
 
 Input MMIO accessors are {sync, load, stall pipeline until load came 
 back}.
 That's a full ordering on both sides.
 
 Output MMIO on the other hand is done with {sync, store}.  Now since
 wmb() has to order MMIO writes vs. main memory writes, we need a full
 sync here.  On some (most, all?) CPUs an eieio is actually enough btw.
 The barrier insn could be put at the end of all MMIO write ops too,
 but I believe that would be more expensive (in execution time; in code
 size it definitely would be, of course).

Ah, that explains why wmb() is a sync. Doesn't seem like a very good
idea though, if the rationale of having fully ordered IO accessors was
because drivers didn't have enough barriers in them.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch 1/2] powerpc: rmb fix

2007-08-21 Thread Nick Piggin
On Wed, Aug 22, 2007 at 05:33:16AM +0200, Segher Boessenkool wrote:
 The I/O accessor functions enforce the necessary ordering
 already I believe.
 
 Hmm, I never followed those discussions last year about IO ordering, 
 and
 I can't see where (if) it was documented anywhere :(
 
 The comments in system.h weren't updated with the last fix, I think.
 
 It appears that legacy code is handled by defining the old IO 
 accessors to
 be completely ordered, and introducing new __raw_ variants that are not
 (OTOH, it seems like other architectures are implementing __raw prefix 
 as
 inorder unless there is a _relaxed postfix).
 
 __raw_XX() is for platform code only, which can do the needed
 barriers without having to use the heavy hammer like everything
 else unfortunately does.


[EMAIL PROTECTED]:~/usr/src/linux-2.6/drivers egrep '__raw_(write|read)' -r * 
| wc -l
685

 
 Drivers are definitely using these __raw_ accessors, and from a quick
 look, they do appear to be hoping that *mb() is going to order access 
 for
 them.
 
 Which drivers?

There are maybe a dozen that use the raw accessors, and use non-smp_
memory barriers. I just looked at drivers/video/tgafb.c, which
indeed appears to intermix them.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


wmb vs mmiowb

2007-08-21 Thread Nick Piggin
Hi,

I'm ignorant when it comes to IO access, so I hope this isn't rubbish (if
it is, I would appreciate being corrected).

It took me more than a glance to see what the difference is supposed to be
between wmb() and mmiowb(). I think especially because mmiowb isn't really
like a write barrier.

wmb is supposed to order all writes coming out of a single CPU, so that's
pretty simple.

The problem is that writes coming from different CPUs can be seen by the
device in a different order from which they were written if coming from
different CPUs, even if the order of writes is guaranteed (eg. by a
spinlock) and issued in the right order WRT the locking (ie.  using wmb()).
And this can happen because the writes can get posted away and reordered by
the IO fabric (I think). mmiowb ensures the writes are seen by the device
in the correct order.

It doesn't seem like this primary function of mmiowb has anything to do
with a write barrier that we are used to (it may have a seconary semantic
of a wmb as well, but let's ignore that for now). A write barrier will
never provide you with those semantics (writes from 2 CPUs seen in the
same order by a 3rd party). If anything, I think it is closer to being
a read barrier issued on behalf of the target device.  But even that I
think is not much better, because the target is not participating in the
synchronisation that the CPUs are, so the read barrier request could
still arrive at the device out of order WRT the other CPU's writes.

It really seems like it is some completely different concept from a
barrier. And it shows, on the platform where it really matters (sn2), where
the thing actually spins.

I don't know exactly how it should be categorised. On one hand, it is
kind of like a critical section, and would work beautifully if we could
just hide it inside spin_lock_io/spin_unlock_io. On the other hand, it
seems like it is often used separately from locks, where it looks decidedly
less like a critical section or release barrier. How can such uses be
correct if they are worried about multi-CPU ordering but don't have
anything to synchronize the CPUs? Or are they cleverly enforcing CPU
ordering some other way? (in which case, maybe an acquire/release API
really would make sense?).

I don't really have a big point, except that I would like to know whether
I'm on the right track, and wish the thing could have a better name/api.

Thanks,
Nick

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [patch 1/2] powerpc: smp_wmb speedup

2007-08-20 Thread Nick Piggin
Sorry, this is patch 2/2 of course.

On Tue, Aug 21, 2007 at 04:16:52AM +0200, Nick Piggin wrote:
 This one is perhaps not as straightforward. I'm pretty limited in the types
 of powerpc machines I can test with, so I don't actually know whether this
 is the right thing to do on power5/6 etc. I can supply the simple test program
 I used if anybody is interested.
 
 ---
 On my dual G5, lwsync is over 5 times faster than eieio when used in a simple
 test case (that actually makes real use of lwsync to provide write ordering).
 
 This is not surprising, as it avoids the IO access synchronisation of eieio,
 and still permits the important relaxation of executing loads before stores.
 The on sub-architectures where lwsync is unavailable, eieio is retained, as
 it should be faster than the alternative full sync (eieio is a proper subset
 of sync).
 
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]
 
 Index: linux-2.6/include/asm-powerpc/system.h
 ===
 --- linux-2.6.orig/include/asm-powerpc/system.h
 +++ linux-2.6/include/asm-powerpc/system.h
 @@ -43,7 +43,11 @@
  #ifdef CONFIG_SMP
  #define smp_mb() mb()
  #define smp_rmb()__asm__ __volatile__ (__stringify(LWSYNC) : : : 
 memory)
 +#ifdef __SUBARCH_HAS_LWSYNC
 +#define smp_wmb()__asm__ __volatile__ (__stringify(LWSYNC) : : : 
 memory)
 +#else
  #define smp_wmb()eieio()
 +#endif
  #define smp_read_barrier_depends()   read_barrier_depends()
  #else
  #define smp_mb() barrier()
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


<    1   2