Re: vm_ops.page_mkwrite() fails with vmalloc on 2.6.23

2007-10-29 Thread Nick Piggin
On Monday 29 October 2007 23:35, Peter Zijlstra wrote:
> On Mon, 2007-10-29 at 11:11 +0100, Peter Zijlstra wrote:
> > On Mon, 2007-10-29 at 01:17 -0700, Jaya Kumar wrote:
> > > On 10/29/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > > On Mon, 22 Oct 2007 16:40:57 +0200 Stefani Seibold 
<[EMAIL PROTECTED]> wrote:
> > > > > The problem original occurs with the fb_defio driver
> > > > > (driver/video/fb_defio.c). This driver use the
> > > > > vm_ops.page_mkwrite() handler for tracking the modified pages,
> > > > > which will be in an extra thread handled, to perform the IO and
> > > > > clean and write protect all pages with page_clean().
> > >
> > > Hi,
> > >
> > > An aside, I just tested that deferred IO works fine on
> > > 2.6.22.10/pxa255.
> > >
> > > I understood from the thread that PeterZ is looking into page_mkclean
> > > changes which I guess went into 2.6.23. I'm also happy to help in any
> > > way if the way we're doing fb_defio needs to change.
> >
> > Yeah, its the truncate race stuff introduced by Nick in
> >   d0217ac04ca6591841e5665f518e38064f4e65bd
> >
> > I'm a bit at a loss on how to go around fixing this. One ugly idea I had
> > was to check page->mapping before going into page_mkwrite() and when
> > that is null, don't bother with the truncate check.
>
> Something like this


I think it's a fine minimal patch. Maybe add a comment to say exactly
what we're doing here (pagecache generally just uses !mapping to test
for truncate).

Otherwise, Acked-by: Nick Piggin <[EMAIL PROTECTED]>, thanks!

> ---
>  mm/memory.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -2300,6 +2300,8 @@ static int __do_fault(struct mm_struct *
>* to become writable
>*/
>   if (vma->vm_ops->page_mkwrite) {
> + struct address_space *mapping = page->mapping;
> +
>   unlock_page(page);
>   if (vma->vm_ops->page_mkwrite(vma, page) < 0) {
>   ret = VM_FAULT_SIGBUS;
> @@ -2314,7 +2316,7 @@ static int __do_fault(struct mm_struct *
>* reworking page_mkwrite locking API, which
>* is better done later.
>*/
> - if (!page->mapping) {
> + if (mapping != page->mapping) {
>   ret = 0;
>   anon = 1; /* no anon but release 
> vmf.page */
>   goto out;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.23-rt3] NMI watchdog trace of deadlock

2007-10-27 Thread Nick Piggin
On Saturday 27 October 2007 15:21, Mike Galbraith wrote:
> Greetings,
>
> For quite a while now, RT kernels have been locking up on me
> occasionally while my back is turned.  Yesterday, the little bugger
> finally pounced while my serial console box was up and waiting.
>
> [10138.162953] WARNING: at arch/i386/kernel/smp.c:581
> native_smp_call_function_mask() [10138.170583]  []
> show_trace_log_lvl+0x1a/0x30
> [10138.175796]  [] show_trace+0x12/0x14
> [10138.180291]  [] dump_stack+0x16/0x18
> [10138.184769]  [] native_smp_call_function_mask+0x138/0x13d
> [10138.191117]  [] smp_call_function+0x1e/0x24
> [10138.196210]  [] on_each_cpu+0x25/0x50
> [10138.200807]  [] flush_tlb_all+0x1e/0x20
> [10138.205553]  [] kmap_high+0x1b6/0x417
> [10138.210118]  [] kmap+0x4d/0x4f
> [10138.214102]  [] ntfs_end_buffer_async_read+0x228/0x2f9
> [10138.220163]  [] end_bio_bh_io_sync+0x26/0x3f
> [10138.225352]  [] bio_endio+0x42/0x6d
> [10138.229769]  [] __end_that_request_first+0x115/0x4ac
> [10138.235682]  [] end_that_request_chunk+0x8/0xa
> [10138.241052]  [] ide_end_request+0x55/0x10a
> [10138.246058]  [] ide_dma_intr+0x6f/0xac
> [10138.250727]  [] ide_intr+0x93/0x1e0
> [10138.255125]  [] handle_IRQ_event+0x5c/0xc9

Looks like ntfs is kmap()ing from interrupt context. Should
be using kmap_atomic instead, I think.

But the ntfs code I have seems to do just that...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin
On Friday 26 October 2007 13:35, Benjamin Herrenschmidt wrote:

[acks]

Thanks for those...

> > Index: linux-2.6/include/asm-powerpc/mmu_context.h
> > ===
> > --- linux-2.6.orig/include/asm-powerpc/mmu_context.h
> > +++ linux-2.6/include/asm-powerpc/mmu_context.h
> > @@ -129,7 +129,7 @@ static inline void get_mmu_context(struc
> > steal_context();
> >  #endif
> > ctx = next_mmu_context;
> > -   while (test_and_set_bit(ctx, context_map)) {
> > +   while (test_and_set_bit_lock(ctx, context_map)) {
> > ctx = find_next_zero_bit(context_map, LAST_CONTEXT+1,
> > ctx); if (ctx > LAST_CONTEXT)
> > ctx = 0;
> > @@ -158,7 +158,7 @@ static inline void destroy_context(struc
> >  {
> > preempt_disable();
> > if (mm->context.id != NO_CONTEXT) {
> > -   clear_bit(mm->context.id, context_map);
> > +   clear_bit_unlock(mm->context.id, context_map);
> > mm->context.id = NO_CONTEXT;
> >  #ifdef FEW_CONTEXTS
> > atomic_inc(&nr_free_contexts);
>
> I don't think the previous code was wrong... it's not a locked section
> and we don't care about ordering previous stores. It's an allocation, it
> should be fine. In general, bitmap allocators should be allright.

Well if it is just allocating an arbitrary _number_ out of a bitmap
and nothing else (eg. like the pid allocator), then you don't need
barriers.


> Ignore the FEW_CONTEXTS stuff for now :-) At this point, it's UP only
> and will be replaced sooner or later.

OK. Then I agree, provided you're doing the correct synchronisation
or flushing etc. when destroying a context (which presumably you are).

I'll drop those bits then.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin
Hi,

Just out of interest, I did a grep for files containing test_and_set_bit
as well as clear_bit (excluding obvious ones like include/asm-*/bitops.h).

Quite a few interesting things. There is a lot of stuff in drivers/* that
could be suspect, WRT memory barriers, including lots I didn't touch. Lot
of work. But forget that...

Some surprises with more obvious bugs, which I have added some cc's for.
powerpc seems to have an mmu context allocation bug, (and possible
improvement in hpte locking with the new bitops).

floppy is using a crazy open coded bit mutex, convert to regular mutex.

vt has something strange too.

fs-writeback could be made a little safer by properly closing the "critical"
section (the existing sections aren't really critical, but you never know
what the future holds ;))

jfs seems to be missing an obvious smp_mb__before_clear_bit (and a less
obvious smp_mb__after_clear_bit)

xfs seems to be missing smp_mb__before

Lots of code that allocates things (eg. msi interrupts) is suspicious. I'm
not exactly sure if these allocation bitmaps are actually used to protect
data structures or not...

I'll have to get round to preparing proper patches for these things if I
don't hear nacks...
---
 arch/arm/mach-davinci/gpio.c  |4 ++--
 arch/arm/mach-imx/generic.c   |4 ++--
 arch/arm/mach-iop13xx/msi.c   |4 ++--
 arch/arm/mach-ns9xxx/gpio.c   |4 ++--
 arch/avr32/mach-at32ap/pio.c  |1 +
 arch/powerpc/mm/hash_native_64.c  |5 ++---
 arch/ppc/xmon/start.c |   20 +---
 block/ll_rw_blk.c |   20 +---
 drivers/block/cciss.c |4 ++--
 drivers/block/cpqarray.c  |4 ++--
 drivers/block/floppy.c|   36 
 drivers/char/vt.c |4 ++--
 drivers/net/ibm_newemac/mal.c |5 ++---
 drivers/net/s2io.c|8 
 drivers/usb/misc/phidgetservo.c   |6 +++---
 fs/fs-writeback.c |4 ++--
 fs/jfs/jfs_metapage.c |5 +++--
 fs/xfs/xfs_mount.c|4 ++--
 include/asm-powerpc/mmu_context.h |4 ++--
 include/asm-ppc/mmu_context.h |4 ++--
 include/linux/aio.h   |5 -
 include/linux/interrupt.h |3 ++-
 include/linux/netdevice.h |5 ++---
 net/bluetooth/cmtp/core.c |4 ++--
 net/core/dev.c|5 ++---
 26 files changed, 75 insertions(+), 98 deletions(-)

Index: linux-2.6/arch/arm/mach-davinci/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-davinci/gpio.c
+++ linux-2.6/arch/arm/mach-davinci/gpio.c
@@ -34,7 +34,7 @@ int gpio_request(unsigned gpio, const ch
 	if (gpio >= DAVINCI_N_GPIO)
 		return -EINVAL;
 
-	if (test_and_set_bit(gpio, gpio_in_use))
+	if (test_and_set_bit_lock(gpio, gpio_in_use))
 		return -EBUSY;
 
 	return 0;
@@ -46,7 +46,7 @@ void gpio_free(unsigned gpio)
 	if (gpio >= DAVINCI_N_GPIO)
 		return;
 
-	clear_bit(gpio, gpio_in_use);
+	clear_bit_unlock(gpio, gpio_in_use);
 }
 EXPORT_SYMBOL(gpio_free);
 
Index: linux-2.6/arch/arm/mach-imx/generic.c
===
--- linux-2.6.orig/arch/arm/mach-imx/generic.c
+++ linux-2.6/arch/arm/mach-imx/generic.c
@@ -107,7 +107,7 @@ int imx_gpio_request(unsigned gpio, cons
 		return -EINVAL;
 	}
 
-	if(test_and_set_bit(gpio, imx_gpio_alloc_map)) {
+	if(test_and_set_bit_lock(gpio, imx_gpio_alloc_map)) {
 		printk(KERN_ERR "imx_gpio: GPIO %d already used. Allocation for \"%s\" failed\n",
 			gpio, label ? label : "?");
 		return -EBUSY;
@@ -123,7 +123,7 @@ void imx_gpio_free(unsigned gpio)
 	if(gpio >= (GPIO_PORT_MAX + 1) * 32)
 		return;
 
-	clear_bit(gpio, imx_gpio_alloc_map);
+	clear_bit_unlock(gpio, imx_gpio_alloc_map);
 }
 
 EXPORT_SYMBOL(imx_gpio_free);
Index: linux-2.6/arch/arm/mach-iop13xx/msi.c
===
--- linux-2.6.orig/arch/arm/mach-iop13xx/msi.c
+++ linux-2.6/arch/arm/mach-iop13xx/msi.c
@@ -135,7 +135,7 @@ again:
 	if (irq > NR_IRQS)
 		return -ENOSPC;
 	/* test_and_set_bit operates on 32-bits at a time */
-	if (test_and_set_bit(pos, msi_irq_in_use))
+	if (test_and_set_bit_lock(pos, msi_irq_in_use))
 		goto again;
 
 	dynamic_irq_init(irq);
@@ -149,7 +149,7 @@ void destroy_irq(unsigned int irq)
 
 	dynamic_irq_cleanup(irq);
 
-	clear_bit(pos, msi_irq_in_use);
+	clear_bit_unlock(pos, msi_irq_in_use);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
Index: linux-2.6/arch/arm/mach-ns9xxx/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-ns9xxx/gpio.c
+++ linux-2.6/arch/arm/mach-ns9xxx/gpio.c
@@ -85,7 +85,7 @@ static inline void __iomem *ns9xxx_gpio_
 int gpio_request(unsigned gpio, const char *label)
 {
 	if (likely(ns9xxx_valid_gpio(gpio)))
-		return test_and_set_bit(gpio, gpiores) ? -EBUSY : 0;
+

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin
On Friday 26 October 2007 09:55, Andi Kleen wrote:
> > But we don't actually know what it is, and it could change with
> > different architectures or versions of gcc. I think the sanest thing
> > is for gcc to help us out here, seeing as there is this very well
> > defined requirement that we want.
> >
> > If you really still want the optimisation to occur,
>
> I don't think it makes sense for memory. It may may make sense for
> registers.
>
> But given that my kernel doesn't seem to contain any instances
> at all it's probably not too useful for it anyways.
>
> So I wouldn't have a problem disabling it, but it would
> be better if they fixed their heuristics to only apply it when
> it makes sense.

That's what I mean -- disabling it for memory. I mean, I am just
talking about the conditional => unconditional store to a shared
variable optimisation.

So for example, adc, sbb, and cmov are all still fine when they
are used for the right things. I don't want to turn them off
because they really are quite useful.

As far as it being theoretical... I hope it is. But we should
nip this in the "bud" (gcc 3.x does this too, sigh) before it
causes problems for us (and any and all other threaded programs
and libraries out there). And by that I mean ask them for a
compiler option rather than start adding volatile.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin
On Friday 26 October 2007 09:09, Andi Kleen wrote:
> On Friday 26 October 2007 00:49:42 Nick Piggin wrote:

> > Marking volatile I think is out of the question. To start with,
> > volatile creates really poor code (and most of the time we actually
> > do want the code in critical sections to be as tight as possible).
>
> Poor code is better than broken code I would say. Besides
> the cross CPU synchronization paths are likely dominated by
> cache misses anyways; it's unlikely they're actually limited
> by the core CPU. So it'll probably not matter all that much
> if the code is poor or not.

But we have to mark all memory access inside the critical section
as volatile, even after the CPU synchronisation, and often the
common case where there is no contention or cacheline bouncing.

Sure, real code is dominated by cache misses anyway, etc etc.
However volatile prevents a lot of real optimisations that we'd
actually want to do, and increases icache size etc.


> > But also because I don't think these bugs are just going to be
> > found easily.
> >
> > > It might be useful to come up with some kind of assembler pattern
> > > matcher to check if any such code is generated for the kernel
> > > and try it with different compiler versions.
> >
> > Hard to know how to do it. If you can, then it would be interesting.
>
> I checked my kernel for "adc" at least (for the trylock/++ pattern)
> and couldn't find any (in fact all of them were in
> data the compiler thought to be code). That was not a allyesconfig build,
> so i missed some code.

sbb as well.


> For cmov it's at first harder because they're much more frequent
> and cmov to memory is a multiple instruction pattern (the instruction
> does only support memory source operands). Luckily gcc
> doesn't know the if (x) mem = a; => ptr = cmov(x, &a, &dummy); *ptr = a;
> transformation trick so I don't think there are actually any conditional
> stores on current x86.
>
> It might be a problem on other architectures which support true
> conditional stores though (like IA64 or ARM)

It might just depend on the instruction costs that gcc knows about.
For example, if ld/st is expensive, they might hoist them out of
loops etc. You don't even need to have one of these predicate or
pseudo predicate instructions.


> Also I'm not sure if gcc doesn't know any other tricks like the
> conditional add using carry, although I cannot think of any related
> to stores from the top of my hat.
>
> Anyways, if it's only conditional add if we ever catch such a case
> it could be also handled with inline assembly similar to local_inc()

But we don't actually know what it is, and it could change with
different architectures or versions of gcc. I think the sanest thing
is for gcc to help us out here, seeing as there is this very well
defined requirement that we want.

If you really still want the optimisation to occur, I don't think it
is too much to use a local variable for the accumulator (eg. in the
simple example case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin
Can you retain cc list, please?

On Friday 26 October 2007 07:42, David Schwartz wrote:
>   I asked a collection of knowledgeable people I know about the issue. The
> consensus is that the optimization is not permitted in POSIX code but that
> it is permitted in pure C code. The basic argument goes like this:
>
>   To make POSIX-compliant code even possible, surely optimizations that 
> add
> writes to variables must be prohibited. That is -- if POSIX prohibits
> writing to a variable in certain cases only the programmer can detect, then
> a POSIX-compliant compiler cannot write to a variable except where
> explicitly told to do so. Any optimization that *adds* a write to a
> variable that would not otherwise occur *must* be prohibited.
>
>   Otherwise, it is literally impossible to comply with the POSIX 
> requirement
> that concurrent modifications and reads to shared variables take place
> while holding a mutex.

Now all you have to do is tell this to the gcc developers ;)


>   The simplest solution is simply to ditch the optimization. If it really
> isn't even an optimization, then that's an easy way out.

For some things, I'd expect it will be an optimisation, which is why
they're even doing it. Even on x86 perhaps, where they do tricks with
sbb/adc. If it avoids an unpredictable branch, it could help. Actually
a silly microbenchmark shows it's worth 10% to do a cmov vs an
unpredictable conditional jump, but another 2.5% to do the adc and
unconditional store (which is the problematic bit).

And for unshared things like local variables where their address
hasn't escaped, it's fine.

Still, I guess that for most non-stack variables, you would actually
_want_ to do a cmov rather than the adc, even in a single threaded
program. Because you don't want to touch the cacheline if possible,
let alone dirty it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin
On Thursday 25 October 2007 17:15, Andi Kleen wrote:
> On Thursday 25 October 2007 05:24, Nick Piggin wrote:
> > Basically, what the gcc developers are saying is that gcc is
> > free to load and store to any memory location, so long as it
> > behaves as if the instructions were executed in sequence.
>
> This case is clearly a bug, a very likely code pessimization.
> I guess it wasn't intentional, just an optimization that is useful
> for local register values doing too much.

Although there can be cases where it looks much more like an
optimisation (eg. where the branch and increment occurs much
more often), but it would still be a bug. Granted they are
rather constructed cases, but I don't think you want to rely on
the fact that most of the time it's OK.


> > I guess that dynamically allocated memory and computed pointers
> > are more difficult for gcc to do anything unsafe with, because
> > it is harder to tell if a given function has deallocated the
> > memory.
>
> Often accesses happen without function calls inbetween.
> Also I think newer gcc (not 3.x) can determine if a pointer
> "escapes" or not so that might not protect against it.
>
> > Any thoughts?
>
> We don't have much choice: If such a case is found it has to be marked
> volatile or that particular compiler version be unsupported.

Marking volatile I think is out of the question. To start with,
volatile creates really poor code (and most of the time we actually
do want the code in critical sections to be as tight as possible).
But also because I don't think these bugs are just going to be
found easily.


> It might be useful to come up with some kind of assembler pattern
> matcher to check if any such code is generated for the kernel
> and try it with different compiler versions.

Hard to know how to do it. If you can, then it would be interesting.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] fs: restore nobh

2007-10-25 Thread Nick Piggin
On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote:
>   Hi,
> 
> > This is overdue, sorry. Got a little complicated, and I've been away from
> > my filesystem test setup so I didn't want ot send it (lucky, coz I found
> > a bug after more substantial testing).
> > 
> > Anyway, RFC?
>   Hmm, maybe one comment/question:
> 
> > Index: linux-2.6/fs/buffer.c
> > ===
> > --- linux-2.6.orig/fs/buffer.c  2007-10-08 14:09:35.0 +1000
> > +++ linux-2.6/fs/buffer.c   2007-10-08 16:45:28.0 +1000
> ...
> 
> > -/*
> > - * Make sure any changes to nobh_commit_write() are reflected in
> > - * nobh_truncate_page(), since it doesn't call commit_write().
> > - */
> > -int nobh_commit_write(struct file *file, struct page *page,
> > -   unsigned from, unsigned to)
> > +int nobh_write_end(struct file *file, struct address_space *mapping,
> > +   loff_t pos, unsigned len, unsigned copied,
> > +   struct page *page, void *fsdata)
> >  {
> > struct inode *inode = page->mapping->host;
> > -   loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> > +   struct buffer_head *head = NULL;
> > +   struct buffer_head *bh;
> >  
> > -   if (page_has_buffers(page))
> > -   return generic_commit_write(file, page, from, to);
> > +   if (!PageMappedToDisk(page)) {
> > +   if (unlikely(copied < len) && !page_has_buffers(page))
> > +   attach_nobh_buffers(page, head);
> > +   if (page_has_buffers(page))
> > +   return generic_write_end(file, mapping, pos, len, 
> > copied, page, fsdata);
> > +   }
>   So we can have a page with attached buffers but we decide to not call
> generic_write_end()...
> 
> > SetPageUptodate(page);
> > set_page_dirty(page);
>   And we just set the page dirty but we leave buffers clean. So cannot
> subsequent try_to_free_buffers() come, mark page as clean (as buffers
> are clean) and discard the data?

Hmm, we might just be OK here because set_page_dirty should dirty the
buffers. However, I think you have spotted a mistake here and it would be
better if the generic_write_end test was outside the PageMapedToDisk
check.

Thanks!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] bitops kernel-doc: inline instead of macro

2007-10-25 Thread Nick Piggin
On Friday 26 October 2007 02:55, Randy Dunlap wrote:
> > Hmm, can we simply do
> >
> > static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
> > addr) {
> > return test_and_set_bit(nr, addr);
> > }
> >
> > please ?
>
> Certainly.  That does look better.

Thanks!

>
> ---
>
> From: Randy Dunlap <[EMAIL PROTECTED]>
>
> Use duplicated inline functions for test_and_set_bit_lock() on x86
> instead of #define macros, thus avoiding a bad example.  This allows
> kernel-doc to run cleanly instead of terminating with an error:
>
> Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
> prototype: 'test_and_set_bit_lock test_and_set_bit '
>
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  include/asm-x86/bitops_32.h |7 +--
>  include/asm-x86/bitops_64.h |7 +--
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
> @@ -183,9 +183,12 @@ static inline int test_and_set_bit(int n
>   * @nr: Bit to set
>   * @addr: Address to count from
>   *
> - * This is the same as test_and_set_bit on x86
> + * This is the same as test_and_set_bit on x86.
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
> addr) +{
> + return test_and_set_bit(nr, addr);
> +}
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_64.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_64.h
> @@ -173,9 +173,12 @@ static __inline__ int test_and_set_bit(i
>   * @nr: Bit to set
>   * @addr: Address to count from
>   *
> - * This is the same as test_and_set_bit on x86
> + * This is the same as test_and_set_bit on x86.
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +static __inline__ int test_and_set_bit_lock(int nr, volatile void * addr)
> +{
> + return test_and_set_bit(nr, addr);
> +}
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 15:45, Greg KH wrote:
> On Thu, Oct 25, 2007 at 12:31:06PM +1000, Nick Piggin wrote:
> > On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
> > > On 10/24/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > It was intended to be something like /proc/sys/kernel/ only.
> >
> > Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
> > /sys/net /sys/vm etc? "kernel" to me shouldn't really imply the
> > stuff under the kernel/ source directory or other random stuff
> > that doesn't fit into another directory, but attributes that are
> > directly related to the kernel software (rather than directly
> > associated with any device).
>
> What would you want in /sys/net and /sys/dev and /sys/vm?  I don't mind
> putting subdirs in /sys/kernel/ if you want it.

I guess potentially things that are today in /proc/sys/*. Sysfs is much
closer to the "right" place for this kind of attributes than procfs,
isn't it?


> > It would be nice to get a sysfs content maintainer or two. Just
> > having new additions occasionally reviewed along with the rest of
> > a patch, by random people, doesn't really aid consistency. Would it
> > be much trouble to ask that _all_ additions to sysfs be accompanied
> > by notification to this maintainer, along with a few line description?
> > (then merge would require SOB from said maintainer).
>
> No, I would _love_ that.  We should make the requirement that all new
> sysfs files be documented in Documentation/API/ like that details.

Obviously I'm for that too. A mandatory cc to a linux-abi list,
documentation, and an acked-by from the relevant API maintainers, etc.
All it needs is upstream to agree and sometime to implement it.


> I'll be glad to review it, but as it's pretty trivial to add sysfs
> files, everyone ends up doing it :)

If it fits with the overall direction that yourself / Kay / everyone
else has in mind, then yes. Problem is that if this stuff goes
unreviewed, or reviewed by different people who don't have a coherent
idea of what the API should look like, then it ends in a mess that you
can't fix easily.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin
Hi David,

[BTW. can you retain cc lists, please?]

On Thursday 25 October 2007 14:29, David Schwartz wrote:
> > Well that's exactly right. For threaded programs (and maybe even
> > real-world non-threaded ones in general), you don't want to be
> > even _reading_ global variables if you don't need to. Cache misses
> > and cacheline bouncing could easily cause performance to completely
> > tank in some cases while only gaining a cycle or two in
> > microbenchmarks for doing these funny x86 predication things.
>
> For some CPUs, replacing an conditional branch with a conditional move is a
> *huge* win because it cannot be mispredicted.

A *conditional* store should no be a problem.

However the funny trick of doing this conditional add (implemented with
unconditional store), is what is going to cause breakage.

On the CPUs where predicated instructions are a big win, I'd expect
they should also implement a conditional store for use here. However
they might be slower than an unconditional store (eg. x86's cmov),
and in those cases, gcc might just do the non-conditional store.


> In general, compilers should 
> optimize for unshared data since that's much more common in typical code.
> Even for shared data, the usual case is that you are going to access the
> data few times, so pulling the cache line to the CPU is essentially free
> since it will happen eventually.

This is not just a question of data that you were going to use anyway.
gcc generates memory accesses to locations that would never be accessed
Even stores. It is basically impossible to say that this is a real
performance win. Even on single threaded code: consider that cache
misses take the vast majority of time in many loads, which gives a
little hint that maybe it's a bad idea to do this ;)


> Heuristics may show that the vast majority of such constructs write anyway.
> So the optimization may also be valid based on such heuristics.

I'd never say the optimisation would always be useless. But it's a nasty
thing to have on by default, and apparently even with no good way to
supress it even if we want to.


> A better question is whether it's legal for a compiler that claims to
> support POSIX threads. I'm going to post on comp.programming.threads, where
> the threading experts hang out.

Either way, I think we really need a way to turn it off for Linux.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] Add lock_page_killable

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 14:11, Andrew Morton wrote:
> On Wed, 24 Oct 2007 08:24:57 -0400 Matthew Wilcox <[EMAIL PROTECTED]> wrote:
> > and associated infrastructure such as sync_page_killable and
> > fatal_signal_pending.  Use lock_page_killable in
> > do_generic_mapping_read() to allow us to kill `cat' of a file on an
> > NFS-mounted filesystem.
>
> whoa, big change.
>
> What exactly are the semantics here?  If the process has actually been
> killed (ie: we know that userspace won't be running again) then we break
> out of a lock_page() and allow the process to exit?  ie: it's basically
> invisible to userspace?

The actual conversions should also be relatively useful groundwork
if we ever want to make more things become generally interruptible.


> If so, it sounds OK.  I guess.  We're still screwed if the process is doing
> a synchronous write and lots of other scenarios.

I don't think it will matter in too many situations. If the process is
doing a synchronous write, nothing is guaranteed until the syscall
returns success...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 13:46, Arjan van de Ven wrote:
> On Thu, 25 Oct 2007 13:24:49 +1000
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Andi spotted this exchange on the gcc list. I don't think he's
> > brought it up here yet, but it worries me enough that I'd like
> > to discuss it.
> >
> > Starts here
> > http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
> >
> > Concrete example here
> > http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html
> >
> > Basically, what the gcc developers are saying is that gcc is
> > free to load and store to any memory location, so long as it
> > behaves as if the instructions were executed in sequence.
>
> this optimization btw is a serious mis-optimization, it makes memory
> more dirty and causes cachelines to become unshared I'm sure it
> works great on microbenchmarks but it sucks bigtime for anything real

Well that's exactly right. For threaded programs (and maybe even
real-world non-threaded ones in general), you don't want to be
even _reading_ global variables if you don't need to. Cache misses
and cacheline bouncing could easily cause performance to completely
tank in some cases while only gaining a cycle or two in
microbenchmarks for doing these funny x86 predication things.

I'm not sure about ia64 -- I _hope_ that for most of their
predication stuff, they also predicate the stores, rather than
just store unconditionally and rely on the source operand not
changing in the case they didn't intend the memory to change.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] TASK_KILLABLE

2007-10-24 Thread Nick Piggin
On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
> This series of patches introduces the facility to deliver only fatal
> signals to tasks which are otherwise waiting uninterruptibly.


This is pretty nice I think. It also is a significant piece of
infrastructure required to fix some of the main oom kill deadlocks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] Make wait_on_retry_sync_kiocb killable

2007-10-24 Thread Nick Piggin
On Friday 19 October 2007 08:26, Matthew Wilcox wrote:
> Use TASK_KILLABLE to allow wait_on_retry_sync_kiocb to return -EINTR.
> All callers then check the return value and break out of their loops.
>
> Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
> ---
>  fs/read_write.c |   17 -
>  1 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 124693e..3196a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -218,14 +218,15 @@ Einval:
>   return -EINVAL;
>  }
>
> -static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
> +static int wait_on_retry_sync_kiocb(struct kiocb *iocb)
>  {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> + set_current_state(TASK_KILLABLE);
>   if (!kiocbIsKicked(iocb))
>   schedule();
>   else
>   kiocbClearKicked(iocb);
>   __set_current_state(TASK_RUNNING);
> + return fatal_signal_pending(current) ? -EINTR : 0;

Although the EINTR never gets to userspace anyway, is there a good
reason why the last patch for do_generic_mapping_read doesn't pass
back -EINTR?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] Use macros instead of TASK_ flags

2007-10-24 Thread Nick Piggin
On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
> Abstracting away direct uses of TASK_ flags allows us to change the
> definitions of the task flags more easily.
>
> Also restructure do_wait() a little
>
> Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
> ---
>  arch/ia64/kernel/perfmon.c |4 +-
>  fs/proc/array.c|9 +---
>  fs/proc/base.c |2 +-
>  include/linux/sched.h  |   15 +++
>  include/linux/wait.h   |   11 +++--
>  kernel/exit.c  |   90
> +++ kernel/power/process.c |   
> 7 +--
>  kernel/ptrace.c|8 ++--
>  kernel/sched.c |   15 +++
>  kernel/signal.c|6 +-
>  kernel/wait.c  |2 +-
>  11 files changed, 83 insertions(+), 86 deletions(-)
>
> diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
> index f55fa07..6b0a6cf 100644
> --- a/arch/ia64/kernel/perfmon.c
> +++ b/arch/ia64/kernel/perfmon.c
> @@ -2630,7 +2630,7 @@ pfm_task_incompatible(pfm_context_t *ctx, struct
> task_struct *task) */
>   if (task == current) return 0;
>
> - if ((task->state != TASK_STOPPED) && (task->state != TASK_TRACED)) {
> + if (!is_task_stopped_or_traced(task)) {
>   DPRINT(("cannot attach to non-stopped task [%d] state=%ld\n", 
> task->pid,
> task->state)); return -EBUSY;
>   }
> @@ -4790,7 +4790,7 @@ recheck:
>* the task must be stopped.
>*/
>   if (PFM_CMD_STOPPED(cmd)) {
> - if ((task->state != TASK_STOPPED) && (task->state != 
> TASK_TRACED)) {
> + if (!is_task_stopped_or_traced(task)) {
>   DPRINT(("[%d] task not in stopped state\n", task->pid));
>   return -EBUSY;
>   }
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 27b59f5..8939bf0 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -140,13 +140,8 @@ static const char *task_state_array[] = {
>
>  static inline const char *get_task_state(struct task_struct *tsk)
>  {
> - unsigned int state = (tsk->state & (TASK_RUNNING |
> - TASK_INTERRUPTIBLE |
> - TASK_UNINTERRUPTIBLE |
> - TASK_STOPPED |
> - TASK_TRACED)) |
> - (tsk->exit_state & (EXIT_ZOMBIE |
> - EXIT_DEAD));
> + unsigned int state = (tsk->state & TASK_REPORT) |
> + (tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD));
>   const char **p = &task_state_array[0];
>
>   while (state) {
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fe74d1..e7e1815 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -196,7 +196,7 @@ static int proc_root_link(struct inode *inode, struct
> dentry **dentry, struct vf (task == current || \
>   (task->parent == current && \
>   (task->ptrace & PT_PTRACED) && \
> -  (task->state == TASK_STOPPED || task->state == TASK_TRACED) && \
> +  (is_task_stopped_or_traced(task)) && \
>security_ptrace(current,task) == 0))
>
>  static int proc_pid_cmdline(struct task_struct *task, char * buffer)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c204ab0..5ef5253 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -177,6 +177,21 @@ print_cfs_rq(struct seq_file *m, int cpu, struct
> cfs_rq *cfs_rq) /* in tsk->state again */
>  #define TASK_DEAD64
>
> +/* Convenience macros for the sake of wake_up */
> +#define TASK_NORMAL  (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
> +#define TASK_ALL (TASK_NORMAL | TASK_STOPPED | TASK_TRACED)
> +
> +/* get_task_state() */
> +#define TASK_REPORT  (TASK_RUNNING | TASK_INTERRUPTIBLE | \
> +  TASK_UNINTERRUPTIBLE | TASK_STOPPED | \
> +  TASK_TRACED)

I think it would be nicer if you made it explicit in the name that
these are not individual flags. Maybe it doesn't matter though...

Also, TASK_NORMAL / TASK_ALL aren't very good names. TASK_SLEEP_NORMAL
TASK_SLEEP_ALL might be a bit more helpful?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin
Hi,

Andi spotted this exchange on the gcc list. I don't think he's
brought it up here yet, but it worries me enough that I'd like
to discuss it.

Starts here
http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html

Concrete example here
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html

Basically, what the gcc developers are saying is that gcc is
free to load and store to any memory location, so long as it
behaves as if the instructions were executed in sequence.

I guess that dynamically allocated memory and computed pointers
are more difficult for gcc to do anything unsafe with, because
it is harder to tell if a given function has deallocated the
memory. However even that could theoretically happen in future
if the compiler can work out the address comes from a global
variable or is not changed intermediately.

Linux makes extensive use of both trylocks and interruptible
locks (ie. which automatically result in divergant code paths,
one of which holds the lock, the other doesn't). However there
are also other code paths which will either hold a particular
lock or will not hold it, depending on context or some flags
etc. barrier() doesn't help.

For x86, obviously the example above shows it can be miscompiled,
but it is probably relatively hard to make it happen for a non
trivial sequence. For an ISA with lots of predicated instructions
like ia64, it would seem to be much more likely. But of course
we don't want even the possibility of failures.

The gcc guys seem to be saying to mark everything volatile that
could be touched in a critical section. This is insane for Linux.

Any thoughts?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 12:43, Christoph Lameter wrote:
> On Thu, 25 Oct 2007, Nick Piggin wrote:
> > > Ummm... all unreclaimable is set! Are you mlocking the pages in memory?
> > > Or what causes this? All pages under writeback? What is the dirty ratio
> > > set to?
> >
> > Why is SLUB behaving differently, though.
>
> Nore sure. Are we really sure that this does not occur using SLAB?

>From the reports it seems pretty consistent. I guess it could well
be something that may occur with SLAB *if the conditions are a bit
different*...

> > Memory efficiency wouldn't be the reason, would it? I mean, SLUB
> > should be more efficient than SLAB, plus have less data lying around
> > in queues.
>
> SLAB may have data around in queues which (if the stars align the right
> way) may allow it to go longer without having to get a page from the page
> allocator.

But page allocs from slab isn't where the OOMs are occurring, so this
seems unlikely (also, the all_unreclaimable logic now should be pretty
strict, so you have to really run the machine out of memory (1GB of
swap gets fully used, then his DMA32 zone is scanned 8 times without
reclaiming a single page).

That said, parallel kernel compiling can really change a lot in memory
footprint depending on small variations in timing. So it might not be
anything to worry about.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 12:15, Christoph Lameter wrote:
> On Wed, 24 Oct 2007, Alexey Dobriyan wrote:
> > [12728.701398] DMA free:8032kB min:32kB low:40kB high:48kB active:2716kB
> > inactive:2208kB present:12744kB pages_scanned:9299 all_unreclaimable?
> > yes [12728.701567] lowmem_reserve[]: 0 2003 2003 2003 [12728.701654]
>
> Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or
> what causes this? All pages under writeback? What is the dirty ratio set
> to?

Why is SLUB behaving differently, though.

Memory efficiency wouldn't be the reason, would it? I mean, SLUB
should be more efficient than SLAB, plus have less data lying around
in queues.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin
On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
> On 10/24/07, Nick Piggin <[EMAIL PROTECTED]> wrote:
> > On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
> > > Nick Piggin wrote:
> > > > One thing I'd suggest is not to use debugfs, if it is going to
> > > > be a useful end-user feature.
> > >
> > > Is /sys/kernel/notifier_name/ an appropriate place?
> >
> > I'm curious about the /sys/kernel/ namespace. I had presumed that
> > it is intended to replace /proc/sys/ basically with the same
> > functionality.
>
> It was intended to be something like /proc/sys/kernel/ only.

Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
/sys/net /sys/vm etc? "kernel" to me shouldn't really imply the
stuff under the kernel/ source directory or other random stuff
that doesn't fit into another directory, but attributes that are
directly related to the kernel software (rather than directly
associated with any device).


> > I _assume_ these are system software stats and
> > tunables that are not exactly linked to device drivers (OTOH,
> > where do you draw the line? eg. Would filesystems go here?
>
> We already have /sys/fs/ ?
>
> > Core network algorithm tunables might, but per interface ones probably
> > not...).
>
> We will merge the nonsense of "block/", "class/" and "bus/" to one
> "subsystem". The block, class, bus directories will only be kept as
> symlinks for compatibility. Then every subsystem has a directory like:
> /sys/subsystem/block/, /sys/subsystem/net/ and the devices of the
> subsystem are in a devices/ directory below that. Just like the
> /sys/bus/< name>/devices/ layout looks today. All subsystem-global
> tunables can go below the /sys/subsystem// directory, without
> clashing with the list of devices or anything else.

Makes sense.


> > I don't know. Is there guidelines for sysfs (and procfs for that
> > matter)? Is anyone maintaining it (not the infrastructure, but
> > the actual content)?
>
> Unfortunately, there was never really a guideline.
>
> > It's kind of ironic that /proc/sys/ looks like one of the best
> > organised directories in proc, while /sys/kernel seems to be in
> > danger of becoming a mess: it has kexec and uevent files in the
> > base directory, rather than in subdirectories...
>
> True, just looking at it now, people do crazy things like:
> /sys/kernel/notes, which is a file with binary content, and a name
> nobody will ever be able to guess what it is good for. That should
> definitely go into a section/ directory.  Also the VM stuff there
> should probably move to a /sys/vm/ directory along with the weird
> placed top-level /sys/slab/.

Top level directory IMO should be kept as sparse as possible. If
you agree to /sys/mm for example, that's fine, but then slab should
go under that. (I'd prefer all to go underneath /sys/kernel, but...).

It would be nice to get a sysfs content maintainer or two. Just
having new additions occasionally reviewed along with the rest of
a patch, by random people, doesn't really aid consistency. Would it
be much trouble to ask that _all_ additions to sysfs be accompanied
by notification to this maintainer, along with a few line description?
(then merge would require SOB from said maintainer).

For that matter, this should be the case for *all* userspace API
changes ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] wait: use lock bitops for __wait_on_bit_lock

2007-10-24 Thread Nick Piggin
On Thursday 25 October 2007 11:14, Andrew Morton wrote:
> On Wed, 24 Oct 2007 18:13:06 +1000 [EMAIL PROTECTED] wrote:
> > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> >
> > ---
> >  kernel/wait.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux-2.6/kernel/wait.c
> > ===
> > --- linux-2.6.orig/kernel/wait.c
> > +++ linux-2.6/kernel/wait.c
> > @@ -195,7 +195,7 @@ __wait_on_bit_lock(wait_queue_head_t *wq
> > if ((ret = (*action)(q->key.flags)))
> > break;
> > }
> > -   } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > +   } while (test_and_set_bit_lock(q->key.bit_nr, q->key.flags));
> > finish_wait(wq, &q->wait);
> > return ret;
> >  }
>
> Sorry, I'm just not going to apply a patch like that.
>
> I mean, how the heck is anyone else supposed to understand what you're up
> to?

Hmm, I might just withdraw this patch 1/5. This is generally a slowpath,
and it's hard to guarantee that any exported user doesn't rely on the
full barrier here (not that they would know whether they do or not, let
alone document the fact).

I think all the others are pretty clear, though? (read on if no)


> There's a bit of documentation in Documentation/atomic_ops.txt 
> (probably enough, I guess) but it is totally unobvious to 98.3% of kernel
> developers when they should use test_and_set_bit() versus
> test_and_set_bit_lock() and it is far too much work to work out why it was
> used in __wait_on_bit_lock(), whether it is correct, what advantages it
> brings and whether and where others should emulate it.

If you set a bit for the purpose of opening a critical section, then
you can use this. And clear_bit_unlock to close it.

The advantages are that it is faster, and the hapless driver writer
doesn't have to butcher or forget about doing the correct
smp_mb__before_clear_bit(), or have reviewers wondering exactly WTF
that barrier is for, etc.

Basically, it is faster, harder to get wrong, and more self-docuemnting.

In general, others should not emulate it, because they should be
using our regular locking primitives instead. If they really must
roll their own, then using these would be nice, IMO.


> So in my opinion this submission isn't of sufficient quality to be
> included in Linux.
>
> IOW: please write changelogs.  Preferably good ones.

2/5: "tasklet_trylock opens a critical section. tasklet_unlock closes it.
  hence, _lock bitops can be used for the bitops"

5/5: "trylock_page opens a critical section. unlock_page closes it. hence,
  _lock bitops can be used for the bitops"

5/5: "trylock_buffer opens a critical section. unlock_buffer closes it.
  hence, _lock bitops can be used for the bitops"

Are those helpful?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bitops kernel-doc: expand macro

2007-10-24 Thread Nick Piggin
On Wednesday 24 October 2007 15:09, Randy Dunlap wrote:
> From: Randy Dunlap <[EMAIL PROTECTED]>
>
> Can we expand this macro definition, or should I look for a way to
> fool^W teach kernel-doc about this?
>
> scripts/kernel-doc says:
> Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
> prototype: 'test_and_set_bit_lock test_and_set_bit '

Actually, it probably looks a bit nicer like this anyway. If you grep
for it, then you can actually see the parameters...

On third thoughts, an inline function might be the best thing to do,
and also avoid setting a bad example. What do you think?

>
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  include/asm-x86/bitops_32.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
> @@ -185,7 +185,7 @@ static inline int test_and_set_bit(int n
>   *
>   * This is the same as test_and_set_bit on x86
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +#define test_and_set_bit_lock(nr, addr) test_and_set_bit(nr, addr)
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
> ---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-23 Thread Nick Piggin
On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
> Nick Piggin wrote:

> > One thing I'd suggest is not to use debugfs, if it is going to
> > be a useful end-user feature.
>
> Is /sys/kernel/notifier_name/ an appropriate place?

Hi list,

I'm curious about the /sys/kernel/ namespace. I had presumed that
it is intended to replace /proc/sys/ basically with the same
functionality. I _assume_ these are system software stats and
tunables that are not exactly linked to device drivers (OTOH,
where do you draw the line? eg. Would filesystems go here? Core
network algorithm tunables might, but per interface ones probably
not...).

I don't know. Is there guidelines for sysfs (and procfs for that
matter)? Is anyone maintaining it (not the infrastructure, but
the actual content)?

It's kind of ironic that /proc/sys/ looks like one of the best
organised directories in proc, while /sys/kernel seems to be in
danger of becoming a mess: it has kexec and uevent files in the
base directory, rather than in subdirectories...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread Nick Piggin
On Monday 22 October 2007 14:28, dean gaudet wrote:
> On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:
> > dean gaudet wrote:
> > > On Mon, 15 Oct 2007, Nick Piggin wrote:
> > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> > >> because it generally has to invalidate TLBs on all CPUs.
> > >
> > > why is that?  ignoring 32-bit archs we have heaps of address space
> > > available... couldn't the kernel just burn address space and delay
> > > global TLB invalidate by some relatively long time (say 1 second)?
> >
> > Yes, that's precisely the problem.  xfs does delay the unmap, leaving
> > stray mappings, which upsets Xen.
>
> sounds like a bug in xen to me :)

You could call it a bug I think. I don't know much about Xen though,
whether or not it expects to be able to run an arbitrary OS kernel.

Presumably, the hypervisor _could_ write protect and trap writes to
*all* page table page mappings.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin
On Monday 22 October 2007 04:39, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> >> Christian Borntraeger <[EMAIL PROTECTED]> writes:
> >>
> >> Let me put it another way.  Looking at /proc/slabinfo I can get
> >> 37 buffer_heads per page.  I can allocate 10% of memory in
> >> buffer_heads before we start to reclaim them.  So it requires just
> >> over 3.7 buffer_heads on very page of low memory to even trigger
> >> this case.  That is a large 1k filesystem or a weird sized partition,
> >> that we have written to directly.
> >
> > On a highmem machine it it could be relatively common.
>
> Possibly.  But the same proportions still hold.  1k filesystems
> are not the default these days and ramdisks are relatively uncommon.
> The memory quantities involved are all low mem.

You don't need 1K filesystems to have buffers attached though,
of course. You can hit the limit with a 4K filesystem with less
than 8GB in pagecache I'd say.



> > You don't want to change that for a stable patch, however.
> > It fixes the bug.
>
> No it avoids the bug which is something slightly different.
> Further I contend that it is not obviously correct that there
> are no other side effects (because it doesn't actually fix the
> bug), and that makes it of dubious value for a backport.

The bug in question isn't exactly that it uses buffercache for its
backing store (although that's obviously rather hairy as well). It's
this specific problem sequence. And it does fix the problem.


> If I had to slap a patch on there at this point just implementing
> an empty try_to_release_page (which disables try_to_free_buffers)
> would be my choice.

How is that better? Now you're making the exact same change for
all filesystems that you didn't think was obviously correct for
rd.c.


> > I just don't think what you have is the proper fix. Calling
> > into the core vfs and vm because right now it does something
> > that works for you but is completely unrelated to what you
> > are conceptually doing is not the right fix.
>
> I think there is a strong conceptual relation and other code
> doing largely the same thing is already in the kernel (ramfs).  Plus
> my gut feel says shared code will make maintenance easier.

ramfs is rather a different case. Filesystems intimately know
about the pagecache.


> You do have a point that the reuse may not be perfect and if that
> is the case we need to watch out for the potential to mess things
> up.
>
> So far I don't see any problems with the reuse.

It's just wrong. I guess if you don't see that by now, then we
have to just disagree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin
On Monday 22 October 2007 03:56, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > OK, I missed that you set the new inode's aops to the ramdisk_aops
> > rather than the bd_inode. Which doesn't make a lot of sense because
> > you just have a lot of useless aops there now.
>
> Not totally useless as you have mentioned they are accessed by
> the lru so we still need something there just not much.

Well, the ones that are unused are totally useless ;) I would
have expected them to be removed.


> >> Frankly I suspect the whole
> >> issue is to subtle and rare to make any backport make any sense.  My
> >> apologies Christian.
> >
> > It's a data corruption issue. I think it should be fixed.
>
> Of course it should be fixed.  I just don't know if a backport
> makes sense.  The problem once understood is hard to trigger
> and easy to avoid.

I mean backported. That's just me though, I don't know the nuances
of -stable releases. It could be that they rather not risk introducing
something worse which would be fair enough.


> >> Well those pages are only accessed through rd_blkdev_pagecache_IO
> >> and rd_ioctl.
> >
> > Wrong. It will be via the LRU, will get ->writepage() called,
> > block_invalidate_page, etc. And I guess also via sb->s_inodes, where
> > drop_pagecache_sb might do stuff to it (although it probably escapes
> > harm). But you're right that it isn't the obviously correct fix for
> > the problem.
>
> Yes.  We will be accessed via the LRU.  Yes I knew that.

OK it just didn't sound like it, seeing as you said that's the only
way they are accessed.


> The truth is 
> whatever we do we will be visible to the LRU.

No. With my patch, nothing in the ramdisk code is visible to the LRU.
Which is how it should be.


> My preferences run 
> towards having something that is less of a special case then a new
> potentially huge cache that is completely unaccounted for, that we
> have to build and maintain all of the infrastructure for
> independently.

It's not a cache, and it's not unaccounted for. It's specified exactly
with the rd sizing parameters. I don't know why you would say your
patch is better in this regard. Your ramdisk backing store will be
accounted as pagecache, which is completely wrong.


> The ramdisk code doesn't seem interesting enough 
> long term to get people to do independent maintenance.
>
> With my patch we are on the road to being just like the ramdisk
> for maintenance purposes code except having a different GFP mask.

You can be using highmem, BTW. And technically it probably isn't
correct to use GFP_USER.


> > If you think it is a nice way to go, then I think you are
> > blind ;)
>
> Well we each have different tastes.  I think my patch
> is a sane sensible small incremental step that does just what
> is needed to fix the problem.   It doesn't drag a lot of other
> stuff into the problem like a rewrite would so we can easily verify
> it.

The small incremental step that fixes the problem is Christian's
patch.


> > It's horrible. And using truncate_inode_pages / grab_cache_page and
> > new_inode is an incredible argument to save a few lines of code. You
> > obviously didn't realise your so called private pages would get
> > accessed via the LRU, for example.
>
> I did but but that is relatively minor.  Using the pagecache this
> way for this purpose is a well established idiom in the kernel
> so I didn't figure I was introducing anything to hard to maintain.

Where else is this an established idiom?


> > This is making a relatively
> > larger logical change than my patch, because now as well as having
> > a separate buffer cache and backing store, you are also making the
> > backing store pages visible to the VM.
>
> I am continuing to have the backing store pages visible to the VM,
> and from that perspective it is a smaller change then where we are
> today.

It is smaller lines of code. It is a larger change. Because what you
are doing is 2 things. You are firstly discontinuing the use of the
buffer cache for the backing store, and secondly you are introducing
a new backing store which registers an inode with the vfs and pages
with the pagecache.

My patch does the same thing without those two last questionable
steps.

You now have to treat those backing store pages as pagecache pages,
and hope you have set the right flags and registered the right aops.


> Not that we can truly hide pages from the VM. 

Any page you allocate is your private page. The problem is you're
just sending them back to the VM again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] add new notifier function ,take2

2007-10-21 Thread Nick Piggin
On Thursday 18 October 2007 18:52, Takenori Nagano wrote:
> Vivek Goyal wrote:

> > > My stance is that _all_ the RAS tools (kdb, kgdb, nlkd, netdump, lkcd,
> > > crash, kdump etc.) should be using a common interface that safely puts
> > > the entire system in a stopped state and saves the state of each cpu.
> > > Then each tool can do what it likes, instead of every RAS tool doing
> > > its own thing and they all conflict with each other, which is why this
> > > thread started.
> > >
> > > It is not the kernel's job to decide which RAS tool runs first, second
> > > etc., it is the user's decision to set that policy. Different sites
> > > will want different orders, some will say "go straight to kdump", other
> > > sites will want to invoke a debugger first. Sites must be able to
> > > define that policy, but we hard code the policy into the kernel.
>
> I agreed with him and I made new notifier function that users can change
> the order. Priority value in notifier blocks are hardcoded. If users want
> to change list order, they have to rebuild kernel. I think it is very
> unhappy.

Is it possible to use a single bit of common code and a single
notifier for these things? Or is it too difficult?

One thing I'd suggest is not to use debugfs, if it is going to
be a useful end-user feature.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin
On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> Christian Borntraeger <[EMAIL PROTECTED]> writes:

> Let me put it another way.  Looking at /proc/slabinfo I can get
> 37 buffer_heads per page.  I can allocate 10% of memory in
> buffer_heads before we start to reclaim them.  So it requires just
> over 3.7 buffer_heads on very page of low memory to even trigger
> this case.  That is a large 1k filesystem or a weird sized partition,
> that we have written to directly.

On a highmem machine it it could be relatively common.


> > I still dont fully understand what issues you have with my patch.
> > - it obviously fixes the problem
> > - I am not aware of any regression it introduces
> > - its small
>
> My primary issue with your patch is that it continues the saga the
> trying to use buffer cache to store the data which is a serious
> review problem, and clearly not what we want to do long term.

You don't want to change that for a stable patch, however.
It fixes the bug.


> > One concern you had, was the fact that buffer heads are out of sync with
> > struct pages. Testing your first patch revealed that this is actually
> > needed by reiserfs - and maybe others.
> > I can also see, that my patch looks a bit like a bandaid that cobbles the
> > rd pieces together.
> >
> > Is there anything else, that makes my patch unmergeable in your
> > opinion?
>
> For linus's tree the consensus is that to fix rd.c that we
> need to have a backing store that is stored somewhere besides
> in the page cache/buffer cache for /dev/ram0.   Doing that prevents
> all of the weird issues.
>
> Now we have the question of which patch gets us there.  I contend
> I have implemented it with my last little patch that this thread
> is a reply to.  Nick hasn't seen that just yet.

Or ever will. It wasn't that my whole argument against it is
based on that I mistakenly thought your patch served the bdev
inode directly from its backing store.


> So if we have a small patch that can implement the proper long
> term fix I contend we are in better shape.

I just don't think what you have is the proper fix. Calling
into the core vfs and vm because right now it does something
that works for you but is completely unrelated to what you
are conceptually doing is not the right fix.

Also, the patch I posted is big because it did other stuff
with dynamically allocated ramdisks from loop (ie. a modern
rewrite). As it is applied to rd.c and split into chunks, the
actual patch to switch to the new backing store isn't actually
that big. I'll submit it to -mm after things stabilise after
the merge window too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-21 Thread Nick Piggin
On Sunday 21 October 2007 18:55, David Woodhouse wrote:
> On Fri, 2007-10-19 at 17:16 +1000, Nick Piggin wrote:
> > if (writtenlen) {
> > -   if (inode->i_size < (pg->index << PAGE_CACHE_SHIFT) +
> > start + writtenlen) { -   inode->i_size = (pg->index
> > << PAGE_CACHE_SHIFT) + start + writtenlen; +   if
> > (inode->i_size < pos + start + writtenlen) {
> > +   inode->i_size = pos + start + writtenlen;
>
> This part seems wrong. Shouldn't it just be pos + writtenlen -- which is
> basically what it was already: pos==(pg->index<http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin
On Sunday 21 October 2007 16:48, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > Yes it does. It is exactly breaking the coherency between block
> > device and filesystem metadata coherency that Andrew cared about.
> > Whether or not that matters, that is a much bigger conceptual
> > change than simply using slightly more (reclaimable) memory in
> > some situations that my patch does.
> >
> > If you want to try convincing people to break that coherency,
> > fine, but it has to be done consistently and everywhere rather than
> > for a special case in rd.c.
>
> Nick.  Reread the patch.  The only thing your arguments have
> established for me is that this patch is not obviously correct.  Which
> makes it ineligible for a back port.

OK, I missed that you set the new inode's aops to the ramdisk_aops
rather than the bd_inode. Which doesn't make a lot of sense because
you just have a lot of useless aops there now.


> Frankly I suspect the whole 
> issue is to subtle and rare to make any backport make any sense.  My
> apologies Christian.

It's a data corruption issue. I think it should be fixed.


> >> The only way we make it to that inode is through block
> >> device I/O so it lives at exactly the same level in the hierarchy as
> >> a real block device.
> >
> > No, it doesn't. A real block device driver does have its own
> > buffer cache as it's backing store. It doesn't know about
> > readpage or writepage or set_page_dirty or buffers or pagecache.
>
> Well those pages are only accessed through rd_blkdev_pagecache_IO
> and rd_ioctl.

Wrong. It will be via the LRU, will get ->writepage() called,
block_invalidate_page, etc. And I guess also via sb->s_inodes, where
drop_pagecache_sb might do stuff to it (although it probably escapes
harm). But you're right that it isn't the obviously correct fix for
the problem.


> >> My patch is the considered rewrite boiled down
> >> to it's essentials and made a trivial patch.
> >
> > What's the considered rewrite here? The rewrite I posted is the
> > only one so far that's come up that I would consider [worthy],
> > while these patches are just more of the same wrongness.
>
> Well it looks like you were blind when you read the patch.

If you think it is a nice way to go, then I think you are
blind ;)


> Because the semantics between the two are almost identical,
> except I managed to implement BLKFLSBUF in a backwards compatible
> way by flushing both the buffer cache and my private cache.  You
> failed to flush the buffer cache in your implementation.

Obviously a simple typo that can be fixed by adding one line
of code.


> Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
> radix_tree.  But having truncate_inode_pages and grab_cache_page
> continue to work sure is convenient.

It's horrible. And using truncate_inode_pages / grab_cache_page and
new_inode is an incredible argument to save a few lines of code. You
obviously didn't realise your so called private pages would get
accessed via the LRU, for example. This is making a relatively
larger logical change than my patch, because now as well as having
a separate buffer cache and backing store, you are also making the
backing store pages visible to the VM.


> I certainly think it makes it a 
> lot simpler to audit the code to change just one thing at a time (the
> backing store) then to rip out and replace everything and then try and
> prove that the two patches are equivalent.

I think it's a bad idea just to stir the shit. We should take the
simple fix for the problem, and then fix it properly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin
On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> >> Andrew Morton <[EMAIL PROTECTED]> writes:
> >> > I don't think we little angels want to tread here.  There are so many
> >> > weirdo things out there which will break if we bust the coherence
> >> > between the fs and /dev/hda1.
> >>
> >> We broke coherence between the fs and /dev/hda1 when we introduced
> >> the page cache years ago,
> >
> > Not for metadata. And I wouldn't expect many filesystem analysis
> > tools to care about data.
>
> Well tools like dump certainly weren't happy when we made the change.

Doesn't that give you any suspicion that other tools mightn't
be happy if we make this change, then?


> >> and weird hacky cases like
> >> unmap_underlying_metadata don't change that.
> >
> > unmap_underlying_metadata isn't about raw block device access at
> > all, though (if you write to the filesystem via the blockdevice
> > when it isn't expecting it, it's going to blow up regardless).
>
> Well my goal with separating things is so that we could decouple two
> pieces of code that have different usage scenarios, and where
> supporting both scenarios simultaneously appears to me to needlessly
> complicate the code.
>
> Added to that we could then tune the two pieces of code for their
> different users.

I don't see too much complication from it. If we can actually
simplify things or make useful tuning, maybe it will be worth
doing.


> >> Currently only
> >> metadata is more or less in sync with the contents of /dev/hda1.
> >
> > It either is or it isn't, right? And it is, isn't it? (at least
> > for the common filesystems).
>
> ext2 doesn't store directories in the buffer cache.

Oh that's what you mean. OK, agreed there. But for the filesystems
and types of metadata that can now expect to have coherency, doing
this will break that expectation.

Again, I have no opinions either way on whether we should do that
in the long run. But doing it as a kneejerk response to braindead
rd.c code is wrong because of what *might* go wrong and we don't
know about.


> Journaling filesystems and filesystems that do ordered writes
> game the buffer cache.  Putting in data that should not yet
> be written to disk.  That gaming is where reiserfs goes BUG
> and where JBD moves the dirty bit to a different dirty bit.

Filesystems really want better control of writeback, I think.
This isn't really a consequence of the unified blockdev pagecache
/ metadata buffer cache, it is just that most of the important
things they do are with metadata.

If they have their own metadata inode, then they'll need to game
the cache for it, or the writeback code for that inode somehow
too.


> So as far as I can tell what is in the buffer cache is not really
> in sync with what should be on disk at any given movement except
> when everything is clean.

Naturally. It is a writeback cache.


> My suspicion is that actually reading from disk is likely to
> give a more coherent view of things.  Because there at least
> we have the writes as they are expected to be seen by fsck
> to recover the data, and a snapshot there should at least
> be recoverable.  Whereas a snapshot provides not such guarantees.

ext3 fsck I don't think is supposed to be run under a read/write
filesystem, so it's going to explode if you do that regardless.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin
On Sunday 21 October 2007 15:10, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> >> Currently the ramdisk tries to keep the block device page cache pages
> >> from being marked clean and dropped from memory.  That fails for
> >> filesystems that use the buffer cache because the buffer cache is not
> >> an ordinary buffer cache user and depends on the generic block device
> >> address space operations being used.
> >>
> >> To fix all of those associated problems this patch allocates a private
> >> inode to store the ramdisk pages in.
> >>
> >> The result is slightly more memory used for metadata, an extra copying
> >> when reading or writing directly to the block device, and changing the
> >> software block size does not loose the contents of the ramdisk.  Most
> >> of all this ensures we don't loose data during normal use of the
> >> ramdisk.
> >>
> >> I deliberately avoid the cleanup that is now possible because this
> >> patch is intended to be a bug fix.
> >
> > This just breaks coherency again like the last patch. That's a
> > really bad idea especially for stable (even if nothing actually
> > was to break, we'd likely never know about it anyway).
>
> Not a chance.

Yes it does. It is exactly breaking the coherency between block
device and filesystem metadata coherency that Andrew cared about.
Whether or not that matters, that is a much bigger conceptual
change than simply using slightly more (reclaimable) memory in
some situations that my patch does.

If you want to try convincing people to break that coherency,
fine, but it has to be done consistently and everywhere rather than
for a special case in rd.c.


> The only way we make it to that inode is through block 
> device I/O so it lives at exactly the same level in the hierarchy as
> a real block device.

No, it doesn't. A real block device driver does have its own
buffer cache as it's backing store. It doesn't know about
readpage or writepage or set_page_dirty or buffers or pagecache.


> My patch is the considered rewrite boiled down 
> to it's essentials and made a trivial patch.

What's the considered rewrite here? The rewrite I posted is the
only one so far that's come up that I would consider [worthy],
while these patches are just more of the same wrongness.


> It fundamentally fixes the problem, and doesn't attempt to reconcile
> the incompatible expectations of the ramdisk code and the buffer cache.

It fixes the bug in question, but not because it makes any
fundamental improvement to the conceptual ickyness of rd.c. I
don't know what you mean about attempting to reconcile the
incompatible [stuff]?


> > Christian's patch should go upstream and into stable. For 2.6.25-6,
> > my rewrite should just replace what's there. Using address spaces
> > to hold the ramdisk pages just confuses the issue even if they
> > *aren't* actually wired up to the vfs at all. Saving 20 lines is
> > not a good reason to use them.
>
> Well is more like saving 100 lines.  Not having to reexamine complicated
> infrastructure code and doing things the same way ramfs is.

Using radix_tree_insert instead of add_to_page_cache is hardly
complicated. If anything it is simpler because it isn't actually
confusing the issues and it is much better fit with real block
device drivers, which is actually the thing that is familiar to
block device driver maintainers.


> I think 
> that combination is a good reason.  Especially since I can do with a
> 16 line patch as I just demonstrated.  It is a solid and simple
> incremental change.

My opinion is that it is a much bigger behavioural change because
it results in incoherent buffer cache / blockdev cache, and it
results in code which is still ugly and conceptually wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin
On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> Currently the ramdisk tries to keep the block device page cache pages
> from being marked clean and dropped from memory.  That fails for
> filesystems that use the buffer cache because the buffer cache is not
> an ordinary buffer cache user and depends on the generic block device
> address space operations being used.
>
> To fix all of those associated problems this patch allocates a private
> inode to store the ramdisk pages in.
>
> The result is slightly more memory used for metadata, an extra copying
> when reading or writing directly to the block device, and changing the
> software block size does not loose the contents of the ramdisk.  Most
> of all this ensures we don't loose data during normal use of the
> ramdisk.
>
> I deliberately avoid the cleanup that is now possible because this
> patch is intended to be a bug fix.

This just breaks coherency again like the last patch. That's a
really bad idea especially for stable (even if nothing actually
was to break, we'd likely never know about it anyway).

Christian's patch should go upstream and into stable. For 2.6.25-6,
my rewrite should just replace what's there. Using address spaces
to hold the ramdisk pages just confuses the issue even if they
*aren't* actually wired up to the vfs at all. Saving 20 lines is
not a good reason to use them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin
On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> > I don't think we little angels want to tread here.  There are so many
> > weirdo things out there which will break if we bust the coherence between
> > the fs and /dev/hda1.
>
> We broke coherence between the fs and /dev/hda1 when we introduced
> the page cache years ago,

Not for metadata. And I wouldn't expect many filesystem analysis
tools to care about data.


> and weird hacky cases like 
> unmap_underlying_metadata don't change that.

unmap_underlying_metadata isn't about raw block device access at
all, though (if you write to the filesystem via the blockdevice
when it isn't expecting it, it's going to blow up regardless).


> Currently only 
> metadata is more or less in sync with the contents of /dev/hda1.

It either is or it isn't, right? And it is, isn't it? (at least
for the common filesystems).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin
On Friday 19 October 2007 17:03, Nick Piggin wrote:
> On Friday 19 October 2007 16:05, Erez Zadok wrote:
> > David,
> >
> > I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
> > 4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
> > when unionfs is stacked on top of jffs2, other than my truncate test --
> > whic tries to truncate files up/down (through the union, which then is
> > passed through to the lower jffs2 f/s).  The same truncate test passes on
> > all other file systems I've tried unionfs/2.6.24 with, as well as all of
> > the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
> > think this bug is more probably due to something else going on in 2.6.24,
> > possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs
> > isn't doing something right -- any pointers?)
> >
> > The oops trace is included below.  Is this a known issue and if so, any
> > fixes?  If this is the first you hear of this problem, let me know and
> > I'll try to narrow it down further.
>
> It's had quite a lot of recent changes in that area -- the "new aops"
> patches.
>
> They've been getting quite a bit of testing in -mm and no such problems,
> but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
> testing with -mm.
>
> The bug smells like jffs2 is actually passing back a "written" length
> greater than the length we passed into it.
>
> The following might show what's happening.

Hmm, looks like jffs2_write_end is writing more than we actually ask it
to, and returns that back.

unsigned aligned_start = start & ~3;

and

if (end == PAGE_CACHE_SIZE) {
/* When writing out the end of a page, write out the
   _whole_ page. This helps to reduce the number of
   nodes in files which have many short writes, like
   syslog files. */
start = aligned_start = 0;
}

These "longer" writes are fine, but they shouldn't get propagated back
to the vm/vfs. Something like the following patch might fix it.

Index: linux-2.6/fs/jffs2/file.c
===
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -255,7 +255,7 @@ static int jffs2_write_end(struct file *
 		   _whole_ page. This helps to reduce the number of
 		   nodes in files which have many short writes, like
 		   syslog files. */
-		start = aligned_start = 0;
+		aligned_start = 0;
 	}
 
 	ri = jffs2_alloc_raw_inode();
@@ -291,14 +291,11 @@ static int jffs2_write_end(struct file *
 	}
 
 	/* Adjust writtenlen for the padding we did, so we don't confuse our caller */
-	if (writtenlen < (start&3))
-		writtenlen = 0;
-	else
-		writtenlen -= (start&3);
+	writtenlen -= min(writtenlen, (start - aligned_start));
 
 	if (writtenlen) {
-		if (inode->i_size < (pg->index << PAGE_CACHE_SHIFT) + start + writtenlen) {
-			inode->i_size = (pg->index << PAGE_CACHE_SHIFT) + start + writtenlen;
+		if (inode->i_size < pos + start + writtenlen) {
+			inode->i_size = pos + start + writtenlen;
 			inode->i_blocks = (inode->i_size + 511) >> 9;
 
 			inode->i_ctime = inode->i_mtime = ITIME(je32_to_cpu(ri->ctime));


Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin
On Friday 19 October 2007 16:05, Erez Zadok wrote:
> David,
>
> I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
> 4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
> when unionfs is stacked on top of jffs2, other than my truncate test --
> whic tries to truncate files up/down (through the union, which then is
> passed through to the lower jffs2 f/s).  The same truncate test passes on
> all other file systems I've tried unionfs/2.6.24 with, as well as all of
> the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
> think this bug is more probably due to something else going on in 2.6.24,
> possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs isn't
> doing something right -- any pointers?)
>
> The oops trace is included below.  Is this a known issue and if so, any
> fixes?  If this is the first you hear of this problem, let me know and I'll
> try to narrow it down further.

It's had quite a lot of recent changes in that area -- the "new aops"
patches.

They've been getting quite a bit of testing in -mm and no such problems,
but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
testing with -mm.

The bug smells like jffs2 is actually passing back a "written" length
greater than the length we passed into it.

The following might show what's happening.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2184,6 +2184,7 @@ fs_write_aop_error:
 	return written ? written : status;
 }
 
+#include 
 static ssize_t generic_perform_write(struct file *file,
 struct iov_iter *i, loff_t pos)
 {
@@ -2243,6 +2244,13 @@ again:
 		page, fsdata);
 		if (unlikely(status < 0))
 			break;
+		if (status > copied) {
+			print_symbol("%s returned more than it should!\n", (unsigned long)a_ops->write_end);
+			printk("status = %ld, copied = %lu\n", status, copied);
+			dump_stack();
+			status = copied; /* try to fix */
+		}
+
 		copied = status;
 
 		cond_resched();


[patch] x86: lock bitops

2007-10-18 Thread Nick Piggin

I missed an obvious one!

x86 CPUs are defined not to reorder stores past earlier loads, so there is
no hardware memory barrier required to implement a release-consistent store
(all stores are, by definition).

So ditch the generic lock bitops, and implement optimised versions for x86,
which removes the mfence from __clear_bit_unlock (which is already a useful
primitive for SLUB).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-x86/bitops_32.h
===
--- linux-2.6.orig/include/asm-x86/bitops_32.h
+++ linux-2.6/include/asm-x86/bitops_32.h
@@ -76,6 +76,20 @@ static inline void clear_bit(int nr, vol
:"Ir" (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
__asm__ __volatile__(
@@ -83,6 +97,25 @@ static inline void __clear_bit(int nr, v
:"+m" (ADDR)
:"Ir" (nr));
 }
+
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -142,6 +175,15 @@ static inline int test_and_set_bit(int n
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -402,7 +444,6 @@ static inline int fls(int x)
 }
 
 #include 
-#include 
 
 #endif /* __KERNEL__ */
 
Index: linux-2.6/include/asm-x86/bitops_64.h
===
--- linux-2.6.orig/include/asm-x86/bitops_64.h
+++ linux-2.6/include/asm-x86/bitops_64.h
@@ -68,6 +68,20 @@ static __inline__ void clear_bit(int nr,
:"dIr" (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static __inline__ void __clear_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__(
@@ -76,6 +90,24 @@ static __inline__ void __clear_bit(int n
:"dIr" (nr));
 }
 
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -133,6 +165,15 @@ static __inline__ int test_and_set_bit(i
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -408,7 +449,6 @@ static __inline__ int fls(int x)
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
 #include 
-#include 
 
 #endif /* __KERNEL__ */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 13:28, Herbert Xu wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> >> First of all let's agree on some basic assumptions:
> >>
> >> * A pair of spin lock/unlock subsumes the effect of a full mb.
> >
> > Not unless you mean a pair of spin lock/unlock as in
> > 2 spin lock/unlock pairs (4 operations).
> >
> > *X = 10;
> > spin_lock(&lock);
> > /* *Y speculatively loaded here */
> > /* store to *X leaves CPU store queue here */
> > spin_unlock(&lock);
> > y = *Y;
>
> Good point.
>
> Although in this case we're still safe because in the worst
> cases:
>
> CPU0  CPU1
> irq_sync = 1
> synchronize_irq
>   spin lock
>   load IRQ_INPROGRESS
> irq_sync sync is visible
>   spin unlock
>   spin lock
>   load irq_sync
>   while (IRQ_INPROGRESS)
>   wait
>   return
>   set IRQ_INPROGRESS
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>
> 
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> synchronize_irq
>   set IRQ_INPROGRESS
>   spin unlock
>   spin lock
>   load IRQ_INPROGRESS
> irq_sync sync is visible
>   spin unlock
>   while (IRQ_INPROGRESS)
>   wait
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>   return
>
> So because we're using the same lock on both sides, it does
> do the right thing even without the memory barrier.


Yeah, if you've got the lock on both sides there, then I
agree it will be correct.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 12:32, Herbert Xu wrote:

> First of all let's agree on some basic assumptions:
>
> * A pair of spin lock/unlock subsumes the effect of a full mb.

Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).

*X = 10;
spin_lock(&lock);
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock(&lock);
y = *Y;

> * A spin lock in general only equates to (SS/SL/LL).
> * A spin unlock in general only equates to (SS/LS).

I don't use the sparc barriers, so they don't come naturally to
me ;)

I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
be passed by later ones.


> In particular, a load after a spin unlock may pass before the
> spin unlock.
>
> Here is the race (with tg3 as the only example that I know of).
> The driver attempts to quiesce interrupts such that after the
> call to synchronize_irq it is intended that no further IRQ
> handler calls for that device will do any work besides acking
> the IRQ.
>
> Here is how it does it:
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq()
>   while (IRQ_INPROGRESS)
>   wait
>   return
>   set IRQ_INPROGRESS
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>
> The problem here is that load of irq_sync in the handler has
> passed above the setting of IRQ_INPROGRESS.
>
> Linus's patch fixes it because this becomes:
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq
>   set IRQ_INPROGRESS
>   spin unlock
>   spin lock
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>   while (IRQ_INPROGRESS)
>   wait
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>   return
>
> Even though we still do the work on the right we will now notice
> the INPROGRESS flag on the left and wait.
>
> It's hard to fix this in the drivers because they'd either have
> to access the desc lock or add a full mb to the fast path on the
> right.
>
> Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
> a lot of drivers (including the fusion example Ben quoted) call
> synchronize_irq before free_irq.  This is unnecessary because
> the latter already calls it anyway.
>
> Cheers,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 12:01, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > > Yes that is what I attempted to do with the write barrier. To my
> > > knowledge there are no reads that could bleed out and I wanted to avoid
> > > a full fence instruction there.
> >
> > Oh, OK. Bit risky ;) You might be right, but anyway I think it
> > should be just as fast with the optimised bit_unlock on most
> > architectures.
>
> How expensive is the fence? An store with release semantics would be safer
> and okay for IA64.

I'm not sure, I had an idea it was relatively expensive on ia64,
but I didn't really test with a good workload (a microbenchmark
probably isn't that good because it won't generate too much out
of order memory traffic that needs to be fenced).


> > Which reminds me, it would be interesting to test the ia64
> > implementation I did. For the non-atomic unlock, I'm actually
> > doing an atomic operation there so that it can use the release
> > barrier rather than the mf. Maybe it's faster the other way around
> > though? Will be useful to test with something that isn't a trivial
> > loop, so the slub case would be a good benchmark.
>
> Lets avoid mf (too expensive) and just use a store with release semantics.

OK, that's what I've done at the moment.


> Where can I find your patchset? I looked through lkml but did not see it.

Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
ones to look at.

The rest of the patches I have queued here, apart from the SLUB patch,
I guess aren't so interesting to you (they don't do anything fancy
like convert to non-atomic unlocks, just switch things like page and
buffer locks to use new bitops).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 11:21, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > Ah, thanks, but can we just use my earlier patch that does the
> > proper __bit_spin_unlock which is provided by
> > bit_spin_lock-use-lock-bitops.patch
>
> Ok.
>
> > This primitive should have a better chance at being correct, and
> > also potentially be more optimised for each architecture (it
> > only has to provide release consistency).
>
> Yes that is what I attempted to do with the write barrier. To my knowledge
> there are no reads that could bleed out and I wanted to avoid a full fence
> instruction there.

Oh, OK. Bit risky ;) You might be right, but anyway I think it
should be just as fast with the optimised bit_unlock on most
architectures.

Which reminds me, it would be interesting to test the ia64
implementation I did. For the non-atomic unlock, I'm actually
doing an atomic operation there so that it can use the release
barrier rather than the mf. Maybe it's faster the other way around
though? Will be useful to test with something that isn't a trivial
loop, so the slub case would be a good benchmark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd0 inefficient?

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 08:05, Richard Jelinek wrote:
> Hello guys,
>
> I'm not subscribed to this list, so if you find this question valid
> enough to answer it, please cc me. Thanks.
>
> This is what the top-output looks like on my machine after having
> copied about 550GB of data from a twofish256 crypted disk to a raid
> array:
> --
> Mem:   8178452k total,  8132180k used,46272k free,  2743480k buffers
> Swap:0k total,0k used,0k free,  4563032k cached
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  5954 root   0 -20 000 R   62  0.0  96:42.61 loop0
>  6014 root  18   0 23744  19m  484 R   20  0.2  25:45.31 cp
>   255 root  10  -5 000 S8  0.0  10:21.82 kswapd0
>  6011 root  10  -5 000 D6  0.0   4:15.66 kjournald
> ...yadda yadda...
> --
>
> And what do we see here? We see loop0 and cp eating up some
> time. That's ok for me considered the work they do. kjournald is also
> ok for me, but I ask myself: why the heck has kswapd0 crunched 10+
> minutes of CPU time?
>
> I mean what does kswapd0 do?
> http://www.linuxforums.org/forum/linux-kernel/65380-what-does-kswapd0-do.ht
>ml
>
> And I have no swap - right? So it should just shut up - IMHO. Or am I
> missing something?

kswapd also reclaims pagecache, not just anonymous memory. It runs
in response to memory pressure and if it wasn't around, then all
your apps requesting memory would have to do basically the same
amount of work themselves.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin
[sorry, I read and replied to my inbox before mailing lists...
please ignore the last mail on this patch, and reply to this
one which is properly threaded]

On Friday 19 October 2007 08:15, Christoph Lameter wrote:
> Currently page flags are only modified in SLUB under page lock. This means
> that we do not need an atomic operation to release the lock since there
> is nothing we can race against that is modifying page flags. We can simply
> clear the bit without the use of an atomic operation and make sure that
> this change becomes visible after the other changes to slab metadata
> through a memory barrier.
>
> The performance of slab_free() increases 10-15% (SMP configuration doing
> a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.


>  mm/slub.c |   15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
> --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
> +++ a/mm/slub.c
> @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
>   bit_spin_lock(PG_locked, &page->flags);
>  }
>
> +/*
> + * Slab unlock version that avoids having to use atomic operations
> + * (echos some of the code of bit_spin_unlock!)
> + */
>  static __always_inline void slab_unlock(struct page *page)
>  {
> - bit_spin_unlock(PG_locked, &page->flags);
> +#ifdef CONFIG_SMP
> + unsigned long flags;
> +
> + flags = page->flags & ~(1 << PG_locked);
> +
> + smp_wmb();
> + page->flags = flags;
> +#endif
> + preempt_enable();
> + __release(bitlock);
>  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.

Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st
 
 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, &page->flags);
+	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
 static __always_inline int slab_trylock(struct page *page)


Re: + slub-avoid-atomic-operation-for-slab_unlock.patch added to -mm tree

2007-10-18 Thread Nick Piggin
On Friday 19 October 2007 08:31, [EMAIL PROTECTED] wrote:

> --
> Subject: SLUB: avoid atomic operation for slab_unlock
> From: Christoph Lameter <[EMAIL PROTECTED]>
>
> Currently page flags are only modified in SLUB under page lock.  This means
> that we do not need an atomic operation to release the lock since there is
> nothing we can race against that is modifying page flags.  We can simply
> clear the bit without the use of an atomic operation and make sure that
> this change becomes visible after the other changes to slab metadata
> through a memory barrier.
>
> The performance of slab_free() increases 10-15% (SMP configuration doing
> a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.


>  mm/slub.c |   15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
> --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
> +++ a/mm/slub.c
> @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
>   bit_spin_lock(PG_locked, &page->flags);
>  }
>
> +/*
> + * Slab unlock version that avoids having to use atomic operations
> + * (echos some of the code of bit_spin_unlock!)
> + */
>  static __always_inline void slab_unlock(struct page *page)
>  {
> - bit_spin_unlock(PG_locked, &page->flags);
> +#ifdef CONFIG_SMP
> + unsigned long flags;
> +
> + flags = page->flags & ~(1 << PG_locked);
> +
> + smp_wmb();
> + page->flags = flags;
> +#endif
> + preempt_enable();
> + __release(bitlock);
>  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.
Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st
 
 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, &page->flags);
+	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
 static __always_inline int slab_trylock(struct page *page)


Re: Kernel 2.6.21.6:"swapper: page allocation failure. order:1, mode:0x20"

2007-10-18 Thread Nick Piggin
On Thursday 18 October 2007 16:16, Andrew A. Razdolsky wrote:
> Hello!
>
> In attachments i did pick all info i know about this failure.

Hi,

Does this actually cause problems for your system? Occasional
page allocation failures from interrupt context are expected.

If you are getting a lot of these messages, you can try
increasing /proc/sys/vm/min_free_kbytes until they go away.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin
On Thursday 18 October 2007 17:14, Vasily Averin wrote:
> Nick Piggin wrote:
> > Hi,
> >
> > On Thursday 18 October 2007 16:24, Vasily Averin wrote:
> >> Hi all,
> >>
> >> could anybody explain how "inactive" may be much greater than "cached"?
> >> stress test (http://weather.ou.edu/~apw/projects/stress/) that writes
> >> into removed files in cycle puts the node to the following state:
> >>
> >> MemTotal: 16401648 kB
> >> MemFree:636644 kB
> >> Buffers:   1122556 kB
> >> Cached: 362880 kB
> >> SwapCached:700 kB
> >> Active:1604180 kB
> >> Inactive: 13609828 kB
> >>
> >> At the first glance memory should be freed on file closing, nobody
> >> refers to file and ext3_delete_inode() truncates inode. We can see that
> >> memory is go away from "cached", however could somebody explain why it
> >> become "invalid" instead be freed? Who holds the references to these
> >> pages?
> >
> > Buffers, swap cache, and anonymous.
>
> But buffers and swap cache are low (1.1 Gb and 700kB in this example) and
> anonymous should go away when process finished.

Ah, I didn't see it was an order of magnitude out.

Some filesystems, including I believe, ext3 with data=ordered,
can leave orphaned pages around after they have been truncated
out of the pagecache. These pages get left on the LRU and vmscan
reclaims them pretty easily.

Try ext3 data=writeback, or even ext2.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How Inactive may be much greather than cached?

2007-10-17 Thread Nick Piggin
Hi,

On Thursday 18 October 2007 16:24, Vasily Averin wrote:
> Hi all,
>
> could anybody explain how "inactive" may be much greater than "cached"?
> stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into
> removed files in cycle puts the node to the following state:
>
> MemTotal: 16401648 kB
> MemFree: 636644 kB
> Buffers: 1122556 kB
> Cached: 362880 kB
> SwapCached: 700 kB
> Active: 1604180 kB
> Inactive: 13609828 kB
>
> At the first glance memory should be freed on file closing, nobody refers
> to file and ext3_delete_inode() truncates inode. We can see that memory is
> go away from "cached", however could somebody explain why it become
> "invalid" instead be freed? Who holds the references to these pages?

Buffers, swap cache, and anonymous.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-17 Thread Nick Piggin
On Thursday 18 October 2007 13:59, Eric W. Biederman wrote:
> If filesystems care at all they want absolute control over the buffer
> cache.  Controlling which buffers are dirty and when.  Because we
> keep the buffer cache in the page cache for the block device we have
> not quite been giving filesystems that control leading to really weird
> bugs.

Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.

The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.

[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)

The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin
On Thursday 18 October 2007 04:45, Eric W. Biederman wrote:
> At this point my concern is what makes a clean code change in the
> kernel.  Because user space can currently play with buffer_heads
> by way of the block device and cause lots of havoc (see the recent

Well if userspace is writing to the filesystem metadata via the
blockdevice while it is running... that's the definition of havoc,
isn't it? ;) Whether or not the writes are going via a unified
metadata/blockdev cache or separate ones.

You really just have to not do that.

The actual reiserfs problem being seen is not because of userspace
going silly, but because ramdisk is hijacking the dirty bits.


> If that change is made then it happens that the current ramdisk
> would not need to worry about buffer heads and all of that
> nastiness and could just lock pages in the page cache.  It would not
> be quite as good for testing filesystems but retaining the existing
> characteristics would be simple.

No, it wouldn't. Because if you're proposing to split up the buffer
cache and the metadata cache, then you're back to a 2 cache
solution which is basically has the memory characteristics of my
proposal while still being horribly incestuous with the pagecache.


> After having looked a bit deeper the buffer_heads and the block
> devices don't look as intricately tied up as I had first thought.
> We still have the nasty case of:
>   if (buffer_new(bh))
>   unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> That I don't know how it got merged.  But otherwise the caches
> are fully separate.

Well its needed because some filesystems forget about their old
metadata. It's not really there to solve aliasing with the blockdev
pagecache.


> So currently it looks to me like there are two big things that will
> clean up that part of the code a lot:
> - moving the metadata buffer_heads to a magic filesystem inode.
> - Using a simpler non-buffer_head returning version of get_block
>   so we can make simple generic code for generating BIOs.

Although this is going off the track of the ramdisk problem. For
that we should just do the rewrite.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin
On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
> >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> >> > > What magic restrictions on page allocations? Actually we have
> >> > > fewer restrictions on page allocations because we can use
> >> > > highmem!
> >> >
> >> > With the proposed rewrite yes.
> >
> > Here's a quick first hack...
> >
> > Comments?
>
> I have beaten my version of this into working shape, and things
> seem ok.
>
> However I'm beginning to think that the real solution is to remove
> the dependence on buffer heads for caching the disk mapping for
> data pages, and move the metadata buffer heads off of the block
> device page cache pages.  Although I am just a touch concerned
> there may be an issue with using filesystem tools while the
> filesystem is mounted if I move the metadata buffer heads.
>
> If we were to move the metadata buffer heads (assuming I haven't
> missed some weird dependency) then I think there a bunch of
> weird corner cases that would be simplified.

I'd prefer just to go along the lines of what I posted. It's
a pure block device driver which knows absolutely nothing about
vm or vfs.

What are you guys using rd for, and is it really important to
have this supposed buffercache optimisation?


> I guess that is where I look next.
>
> Oh for what it is worth I took a quick look at fsblock and I don't think
> struct fsblock makes much sense as a block mapping translation layer for
> the data path where the current page caches works well.

Except the page cache doesn't store any such translation. fsblock
works very nicely as a buffer_head and nobh-mode replacement there,
but we could probably go one better (for many filesystems) by using
a tree.


> For less 
> then the cost of 1 fsblock I can cache all of the translations for a
> 1K filesystem on a 4K page.

I'm not exactly sure what you mean, unless you're talking about an
extent based data structure. fsblock is fairly slim as far as a
generic solution goes. You could probably save 4 or 8 bytes in it,
but I don't know if it's worth the trouble.


> I haven't looked to see if fsblock makes sense to use as a buffer head
> replacement yet.

Well I don't want to get too far off topic on the subject of fsblock,
and block mappings (because I think the ramdisk driver simply should
have nothing to do with any of that)... but fsblock is exactly a
buffer head replacement (so if it doesn't make sense, then I've
screwed something up badly! ;))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin
On Wed, Oct 17, 2007 at 01:51:17PM +0800, Herbert Xu wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> >
> > Also, for non-wb memory. I don't think the Intel document referenced
> > says anything about this, but the AMD document says that loads can pass
> > loads (page 8, rule b).
> > 
> > This is why our rmb() is still an lfence.
> 
> BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb
> instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's
> shared with other Xen domains or the hypervisor.
> 
> The reason this is necessary is because even if a Xen domain is
> UP the hypervisor might be SMP.
> 
> It would be nice if we can have these adopt the new SMP barriers
> on x86 instead of the IO ones as they currently do.

That's a good point actually. Something like raw_smp_*mb, which
always orders memory, but only for regular WB operatoins. I could
put that on the todo list...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin
On Wed, Oct 17, 2007 at 02:30:32AM +0200, Mikulas Patocka wrote:
> > > You already must not place any data structures into WC memory --- for 
> > > example, spinlocks wouldn't work there.
> > 
> > What do you mean "already"?
> 
> I mean "in current kernel" (I checked it in 2.6.22)

Ahh, that's not "current kernel", though ;)

4071c718555d955a35e9651f77086096ad87d498

 
> > If we already have drivers loading data from
> > WC memory, then rmb() needs to order them, whether or not they actually
> > need it. If that were prohibitively costly, then we'd introduce a new
> > barrier which does not order WC memory, right?
> > 
> > 
> > > wmb() also won't work on WC 
> > > memory, because it assumes that writes are ordered.
> > 
> > You mean the one defined like this:
> >   #define wmb()   asm volatile("sfence" ::: "memory")
> > ? If it assumed writes are ordered, then it would just be a barrier().
> 
> You read wrong part of the include file. Really, it is 
> (2.6.22,include/asm-i386/system.h):
> #ifdef CONFIG_X86_OOSTORE
> #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", 
> X86_FEATURE_XMM)
> #else
> #define wmb()   __asm__ __volatile__ ("": : :"memory")
> #endif
> 
> CONFIG_X86_OOSTORE is dependent on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
> --- so on Intel and AMD, it is really just barrier().
> 
> So drivers can't assume that wmb() works on write-combining memory.

Drivers should be able to assume that wmb() orders _everything_ (except
some whacky Altix thing, which I really want to fold under wmb at some
point anyway).

So I decided that old x86 semantics isn't right, and now it really is a
lock op / sfence everywhere.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Wednesday 17 October 2007 11:13, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:

> > We have 2 problems. First is that, for testing/consistency, we
> > don't want BLKFLSBUF to throw out the data. Maybe hardly anything
> > uses BLKFLSBUF now, so it could be just a minor problem, but still
> > one to fix.
>
> Hmm.  This is interesting because we won't be doing anything that
> effects correctness if we don't special case BLKFLSBUF just something
> that effects efficiency.  So I think we can get away with just
> changing blkflsbuf as long as there is a way to get rid of
> the data.

Technically, it does change correctness: after BLKFLSBUF, the
ramdisk should contain zeroes.

I'm assuming it would also cause problems in tight embedded
environments if ramdisk ram is supposed to be thrown away but
isn't. So maybe not technically a correctness problem, but could
be the difference between working and not working.


> > Second is actually throwing out the ramdisk data. dd from /dev/null
> > isn't trivial because it isn't a "command" from the kernel's POV.
> > rd could examine the writes to see if they are zero and page aligned,
> > I suppose... but if you're transitioning everyone over to a new
> > method anyway, might as well make it a nice one ;)
>
> Well I was thinking you can examine the page you just wrote to
> and if it is all zero's you don't need to cache that page anymore.
> Call it intelligent compression.
>
> Further it does make forwards and backwards compatibility simple
> because all you would have to do to reliably free a ramdisk is:
>
> dd if=/dev/zero of=/dev/ramX
> blockdev --flushbufs /dev/ramX

Sure, you could do that, but you still presumably need to support
the old behaviour.

As a test vehicle for filesystems, I'd much rather it didn't do this
of course, because subsequent writes would need to reallocate the
page again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
> >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> >> > +/*
> >> > + * ram device BLKFLSBUF has special semantics, we want to 
> >> > actually
> >> > + * release and destroy the ramdisk data.
> >> > + */
> >>
> >> We won't be able to fix completely this for a while time, but the fact
> >> that BLKFLSBUF has special semantics has always been a major wart.
> >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
> >> deperecation printk for BLKFLSBUF when passed to the ramdisk?  I doubt
> >> there are many tools that actually take advantage of this wierd aspect
> >> of ramdisks, so hopefully it's something we could remove in a 18
> >> months or so...
> >
> > It would be nice to be able to do that, I agree. The new ramdisk
> > code will be able to flush the buffer cache and destroy its data
> > separately, so it can actually be implemented.
>
> So the practical problem are peoples legacy boot setups but those
> are quickly going away.

After that, is the ramdisk useful for anything aside from testing?


> The sane thing is probably something that can be taken as a low
> level format command for the block device.
>
> Say: dd if=/dev/zero of=/dev/ramX

We have 2 problems. First is that, for testing/consistency, we
don't want BLKFLSBUF to throw out the data. Maybe hardly anything
uses BLKFLSBUF now, so it could be just a minor problem, but still
one to fix.

Second is actually throwing out the ramdisk data. dd from /dev/null
isn't trivial because it isn't a "command" from the kernel's POV.
rd could examine the writes to see if they are zero and page aligned,
I suppose... but if you're transitioning everyone over to a new
method anyway, might as well make it a nice one ;)


> I know rewriting the drive with all zeroes can cause a modern
> disk to redo it's low level format.  And that is something
> we can definitely implement without any backwards compatibility
> problems.
>
> Hmm. Do we have anything special for punching holes in files?
> That would be another sane route to take to remove the special
> case for clearing the memory.

truncate_range, I suppose. A file descriptor syscall based
alternative for madvise would be nice though (like fallocate).

We could always put something in /sys/block/ram*/xxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-16 Thread Nick Piggin
On Wed, Oct 17, 2007 at 01:05:16AM +0200, Mikulas Patocka wrote:
> > > I see, AMD says that WC memory loads can be out-of-order.
> > > 
> > > There is very little usability to it --- framebuffer and AGP aperture is 
> > > the only piece of memory that is WC and no kernel structures are placed 
> > > there, so it is possible to remove that lfence.
> > 
> > No. In Linux kernel, rmb() means that all previous loads, including to
> > any IO regions, will be executed before any subsequent load.
> 
> You already must not place any data structures into WC memory --- for 
> example, spinlocks wouldn't work there.

What do you mean "already"? If we already have drivers loading data from
WC memory, then rmb() needs to order them, whether or not they actually
need it. If that were prohibitively costly, then we'd introduce a new
barrier which does not order WC memory, right?


> wmb() also won't work on WC 
> memory, because it assumes that writes are ordered.

You mean the one defined like this:
  #define wmb()   asm volatile("sfence" ::: "memory")
? If it assumed writes are ordered, then it would just be a barrier().


> > How can you possibly get rid of lfence from there just because you may
> > happen to *know* that it isn't used (btw. the IO serialisation isn't for
> > kernel data structures, it is for actual IO operations, generally).
> 
> IO regions are in uncached memory, and x86 already serializes it fine. It 
> flushes any write buffers on access to uncached memory.
> 
> (BTW. what is the general portable rule for serializing writel() and 
> readl()? On x86 they are serialized in hardware, but what on other archs?)

Most tend to order them strongly these days. There are also relaxed
variants for architectures that can take advantage of them.


> > Doing that would lead to an unmaintainable mess. If drivers don't need rmb,
> > then they don't call it.
> 
> If wmb() doesn't currently work on write-combining memory, why should 
> rmb() work there?

I don't understand why you say wmb() doesn't work on WC memory. What part
of which spec are you reading (or, given your mistrust of specs, what CPU
are you seeing failures with)?
 

> The purpose of rmb() is to enforce ordering on architectures that don't 
> force it in hardware --- that is not the case of x86.

Well it clearly is the case because I just pointed you to a document
that says they can go out of order. If you want to argue that existing
implementations do not, then by all means go ahead and send a patch to
Linus and see what he says about it ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-16 Thread Nick Piggin
On Tue, Oct 16, 2007 at 12:33:54PM +0200, Mikulas Patocka wrote:
> 
> 
> On Tue, 16 Oct 2007, Nick Piggin wrote:
> 
> > > > The cpus also have an explicit set of instructions that deliberately do 
> > > > unordered stores/loads, and s/lfence etc are mostly designed for those.
> > > 
> > > I know about unordered stores (movnti & similar) --- they basically use 
> > > write-combining method on memory that is normally write-back --- and they 
> > > need sfence. But which one instruction does unordered load and needs 
> > > lefence?
> > 
> > Also, for non-wb memory. I don't think the Intel document referenced
> > says anything about this, but the AMD document says that loads can pass
> > loads (page 8, rule b).
> > 
> > This is why our rmb() is still an lfence.
> 
> I see, AMD says that WC memory loads can be out-of-order.
> 
> There is very little usability to it --- framebuffer and AGP aperture is 
> the only piece of memory that is WC and no kernel structures are placed 
> there, so it is possible to remove that lfence.

No. In Linux kernel, rmb() means that all previous loads, including to
any IO regions, will be executed before any subsequent load.

How can you possibly get rid of lfence from there just because you may
happen to *know* that it isn't used (btw. the IO serialisation isn't for
kernel data structures, it is for actual IO operations, generally).

Doing that would lead to an unmaintainable mess. If drivers don't need rmb,
then they don't call it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> > +   /*
> > +* ram device BLKFLSBUF has special semantics, we want to actually
> > +* release and destroy the ramdisk data.
> > +*/
>
> We won't be able to fix completely this for a while time, but the fact
> that BLKFLSBUF has special semantics has always been a major wart.
> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
> deperecation printk for BLKFLSBUF when passed to the ramdisk?  I doubt
> there are many tools that actually take advantage of this wierd aspect
> of ramdisks, so hopefully it's something we could remove in a 18
> months or so...

It would be nice to be able to do that, I agree. The new ramdisk
code will be able to flush the buffer cache and destroy its data
separately, so it can actually be implemented.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-16 Thread Nick Piggin
On Wednesday 17 October 2007 05:06, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote:
> >> I have not observed this case but it is possible to get a dirty page
> >> cache with clean buffer heads if we get a clean ramdisk page with
> >> buffer heads generated by a filesystem calling __getblk and then write
> >> to that page from user space through the block device.  Then we just
> >> need to hit the proper window and try_to_free_buffers() will mark that
> >> page clean and eventually drop it.  Ouch!
> >>
> >> To fix this use the generic __set_page_dirty_buffers in the ramdisk
> >> code so that when we mark a page dirty we also mark it's buffer heads
> >> dirty.
> >
> > Hmm, so we can also have some filesystems writing their own buffers
> > out by hand (clear_buffer_dirty, submit buffer for IO). Other places
> > will do similarly dodgy things with filesystem metadata
> > (fsync_buffers_list, for example).
> >
> > So your buffers get cleaned again, then your pages get cleaned.
>
> So I just took a little bit of time to look at and think about
> the path you are referring to, and I don't see a problem.
>
> The rule with the buffer dirty bit is that you first clear it
> and then you submit the write.  When the write request finally
> makes it's way to rd.c we copy the data if necessary and call
> set_page_dirty.  Which will then mark the page and the buffer
> dirty again.

Oh, maybe you're right. I didn't see it redirty the page there.


> In essence the ramdisk code just attempts to lock buffers in
> ram by setting their dirty bit, just like we do for pages
> in ramfs.

Yeah, which is half the reason why its so complicated. Logically
it should just hold another reference on the pages rather than
interfere with pagecache state, but it can't do that because it
doesn't always know when a new page is inserted.


> > While I said it was a good fix when I saw the patch earlier, I think
> > it's not closing the entire hole, and as such, Christian's patch is
> > probably the way to go for stable.
>
> I thought through the logic in try_to_free_buffers and it actually
> makes sense to me now.  mark_buffer_dirty sets the page dirty bit
> so dirty buffers reside on dirty pages.  When we submit I/O we
> aren't guaranteed to submit all of the dirty buffers for a page
> at once, so we don't clear the page dirty bit.  With the result
> that we can end up with pages with the dirty bit set but all of
> their buffers are clean.

Yep.


> Since we rather deliberately allow truly clean pages to be dropped
> from the ramdisk overriding try_to_free_buffer_pages looks wrong
> because then except for invalidate we can not remove buffers
> from truly clean pages.

Yeah, if your fix works I guess it is better to use it and converge
code rather than diverge it even more.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Tuesday 16 October 2007 18:17, Jan Engelhardt wrote:
> On Oct 16 2007 18:07, Nick Piggin wrote:
> >Changed. But it will hopefully just completely replace rd.c,
> >so I will probably just rename it to rd.c at some point (and
> >change .config options to stay compatible). Unless someone
> >sees a problem with that?
>
> I do not see a problem with keeping brd either.

Just doesn't seem to be any point in making it a new and different
module, assuming it can support exactly the same semantics. I'm
only doing so in these first diffs so that they are easier to read
and also easier to do a side by side comparison / builds with the
old rd.c


> >> It also does not seem needed, since it did not exist before.
> >> It should go, you can set the variable with brd.rd_nr=XXX (same
> >> goes for ramdisk_size).
> >
> >But only if it's a module?
>
> Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
> and you will see.

Ah, nice. (I don't use them much!). Still, backward compat I
think is needed if we are to replace rd.c.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-16 Thread Nick Piggin
On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote:
> I have not observed this case but it is possible to get a dirty page
> cache with clean buffer heads if we get a clean ramdisk page with
> buffer heads generated by a filesystem calling __getblk and then write
> to that page from user space through the block device.  Then we just
> need to hit the proper window and try_to_free_buffers() will mark that
> page clean and eventually drop it.  Ouch!
>
> To fix this use the generic __set_page_dirty_buffers in the ramdisk
> code so that when we mark a page dirty we also mark it's buffer heads
> dirty.

Hmm, so we can also have some filesystems writing their own buffers
out by hand (clear_buffer_dirty, submit buffer for IO). Other places
will do similarly dodgy things with filesystem metadata
(fsync_buffers_list, for example).

So your buffers get cleaned again, then your pages get cleaned.

While I said it was a good fix when I saw the patch earlier, I think
it's not closing the entire hole, and as such, Christian's patch is
probably the way to go for stable.

For mainline, *if* we want to keep the old rd.c around at all, I
don't see any harm in this patch so long as Christian's is merged
as well. Sharing common code is always good.

>
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> ---
>  drivers/block/rd.c |   13 +
>  1 files changed, 1 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/block/rd.c b/drivers/block/rd.c
> index 701ea77..84163da 100644
> --- a/drivers/block/rd.c
> +++ b/drivers/block/rd.c
> @@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space
> *mapping, return 0;
>  }
>
> -/*
> - * ramdisk blockdev pages have their own ->set_page_dirty() because we
> don't - * want them to contribute to dirty memory accounting.
> - */
> -static int ramdisk_set_page_dirty(struct page *page)
> -{
> - if (!TestSetPageDirty(page))
> - return 1;
> - return 0;
> -}
> -
>  static const struct address_space_operations ramdisk_aops = {
>   .readpage   = ramdisk_readpage,
>   .prepare_write  = ramdisk_prepare_write,
>   .commit_write   = ramdisk_commit_write,
>   .writepage  = ramdisk_writepage,
> - .set_page_dirty = ramdisk_set_page_dirty,
> + .set_page_dirty = __set_page_dirty_buffers,
>   .writepages = ramdisk_writepages,
>  };
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Preserve the dirty bit in init_page_buffers()

2007-10-16 Thread Nick Piggin
On Tuesday 16 October 2007 08:40, Eric W. Biederman wrote:
> The problem:  When we are trying to free buffers try_to_free_buffers()
> will look at ramdisk pages with clean buffer heads and remove the
> dirty bit from the page.  Resulting in ramdisk pages with data that get
> removed from the page cache.  Ouch!
>
> Buffer heads appear on ramdisk pages when a filesystem calls
> __getblk() which through a series of function calls eventually calls
> init_page_buffers().
>
> So to fix the mismatch between buffer head and page state this patch
> modifies init_page_buffers() to transfer the dirty bit from the page to
> the buffer heads like we currently do for the uptodate bit.
>
> This patch is safe as only __getblk calls init_page_buffers, and
> there are only two implementations of block devices cached in the
> page cache.  The generic implementation in block_dev.c and the
> implementation in rd.c.
>
> The generic implementation of block devices always does everything
> in terms of buffer heads so it always has buffer heads allocated
> before a page is marked dirty so this change does not affect it.

This is probably a good idea. Was this causing the reiserfs problems?
If so, I think we should be concentrating on what the real problem
is with reiserfs... (or at least why this so obviously correct
looking patch is wrong).

Acked-by: Nick Piggin <[EMAIL PROTECTED]>

>
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> ---
>  fs/buffer.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 75b51df..8b87beb 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct
> block_device *bdev, struct buffer_head *head = page_buffers(page);
>   struct buffer_head *bh = head;
>   int uptodate = PageUptodate(page);
> + int dirty = PageDirty(page);
>
>   do {
>   if (!buffer_mapped(bh)) {
> @@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct
> block_device *bdev, bh->b_blocknr = block;
>   if (uptodate)
>   set_buffer_uptodate(bh);
> + if (dirty)
> + set_buffer_dirty(bh);
>   set_buffer_mapped(bh);
>   }
>   block++;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Tuesday 16 October 2007 17:52, Jan Engelhardt wrote:
> On Oct 16 2007 17:47, Nick Piggin wrote:
> >Here's a quick first hack...
>
> Inline patches preferred ;-)

Thanks for reviewing it anyway ;)


> >+config BLK_DEV_BRD
> >+tristate "RAM block device support"
> >+---help---
> >+  This is a new  based block driver that replaces BLK_DEV_RAM.
>
> based on what? -^

RAM based.


> >+  To compile this driver as a module, choose M here: the
> >+  module will be called rd.
>
> called brd.ko.

Changed. But it will hopefully just completely replace rd.c,
so I will probably just rename it to rd.c at some point (and
change .config options to stay compatible). Unless someone
sees a problem with that?


> >+/*
> >+ * And now the modules code and kernel interface.
> >+ */
> >+static int rd_nr;
> >+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
>
> Perhaps unsigned?
> Perhaps even long for rd_size?

I've taken most of that stuff out of rd.c in an effort to stay
back compatible. I don't know if it really matters to use long?


> >+module_param(rd_nr, int, 0);
> >+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
> >+module_param(rd_size, int, 0);
> >+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
> >+MODULE_LICENSE("GPL");
> >+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
> >+
> >+/* options - nonmodular */
> >+#ifndef MODULE
> >+static int __init ramdisk_size(char *str)
> >+{
> >+rd_size = simple_strtol(str,NULL,0);
> >+return 1;
> >+}
>
> Is this, besides for compatibility, really needed?
>
> >+static int __init ramdisk_size2(char *str)
> >+{
> >+return ramdisk_size(str);
> >+}
> >+static int __init rd_nr(char *str)
>
> Err! Overlapping symbols! The rd_nr() function collides with the rd_nr
> variable.

Thanks... %s gone awry. Fixed to the expected names.


> It also does not seem needed, since it did not exist before. 
> It should go, you can set the variable with brd.rd_nr=XXX (same
> goes for ramdisk_size).

But only if it's a module?


> What's the point of ramdisk_size2()? 

Back compat. When rewriting the internals, I want to try avoid
changing any externals if possible. Whether this is the Right
Way to do it or not, I don't know :P


> >+{
> >+rd_nr = simple_strtol(str, NULL, 0);
> >+return 1;
> >+}
> >+__setup("ramdisk=", ramdisk_size);
> >+__setup("ramdisk_size=", ramdisk_size2);
>
> __setup("ramdisk_size=", ramdisk_size); maybe, or does not that work?

Didn't try it, but the rd.c code does the same thing so I guess it
doesn't (or didn't, when it was written).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin
On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:

> > > What magic restrictions on page allocations? Actually we have
> > > fewer restrictions on page allocations because we can use
> > > highmem!
> >
> > With the proposed rewrite yes.

Here's a quick first hack...

Comments?
This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. This makes it crap.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

---
Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -329,6 +329,7 @@ config BLK_DEV_UB
 
 config BLK_DEV_RAM
 	tristate "RAM disk support"
+	depends on !BLK_DEV_BRD
 	---help---
 	  Saying Y here will allow you to use a portion of your RAM memory as
 	  a block device, so that you can make file systems on it, read and
@@ -346,10 +347,24 @@ config BLK_DEV_RAM
 	  Most normal users won't need the RAM disk functionality, and can
 	  thus say N here.
 
+config BLK_DEV_BRD
+	tristate "RAM block device support"
+	---help---
+	  This is a new  based block driver that replaces BLK_DEV_RAM.
+
+	  Note that the kernel command line option "ramdisk=XX" is now
+	  obsolete. For details, read .
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called rd.
+
+	  Most normal users won't need the RAM disk functionality, and can
+	  thus say N here.
+
 config BLK_DEV_RAM_COUNT
 	int "Default number of RAM disks"
 	default "16"
-	depends on BLK_DEV_RAM
+	depends on BLK_DEV_RAM || BLK_DEV_BRD
 	help
 	  The default value is 16 RAM disks. Change this if you know what
 	  are doing. If you boot from a filesystem that needs to be extracted
@@ -357,7 +372,7 @@ config BLK_DEV_RAM_COUNT
 
 config BLK_DEV_RAM_SIZE
 	int "Default RAM disk size (kbytes)"
-	depends on BLK_DEV_RAM
+	depends on BLK_DEV_RAM || BLK_DEV_BRD
 	default "4096"
 	help
 	  The default value is 4096 kilobytes. Only change this if you know
Index: linux-2.6/drivers/block/Makefile
===
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_PS3_DISK)		+= ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= rd.o
+obj-$(CONFIG_BLK_DEV_BRD)	+= brd.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_DEV_PS2)	+= ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)	+= xd.o
Index: linux-2.6/drivers/block/brd.c
=======
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,467 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define PAGE_SECTORS (1 << (PAGE_SHIFT - 9))
+
+struct brd_device {
+	int		brd_number;
+	int		brd_refcnt;
+	loff_t		brd_offset;
+	loff_t		brd_sizelimit;
+	unsigned	brd_blocksize;
+
+	struct request_queue	*brd_queue;
+	struct gendisk		*brd_disk;
+
+	spinlock_t		brd_lock;
+	struct radix_tree_root	brd_pages;
+
+	struct list_head	brd_list;
+};
+
+static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+{
+	unsigned long idx;
+	struct page *page;
+
+	rcu_read_lock();
+	idx = sector >> (PAGE_SHIFT - 9);
+	page = radix_tree_lookup(&brd->brd_pages, idx);
+	rcu_read_unlock();
+
+	BUG_ON(page && page->index != idx);
+
+	return page;
+}
+
+static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
+{
+	unsigned long idx;
+	struct page *page;
+
+	page = brd_lookup_page(brd, sector);
+	if (page)
+		return page;
+
+	page = alloc_pa

Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> >> make_page_uptodate() is most hideous part I have run into.
> >> It has to know details about other layers to now what not
> >> to stomp.  I think my incorrect simplification of this is what messed
> >> things up, last round.
> >
> > Not really, it's just named funny. That's just a minor utility
> > function that more or less does what it says it should do.
> >
> > The main problem is really that it's implementing a block device
> > who's data comes from its own buffercache :P. I think.
>
> Well to put it another way, mark_page_uptodate() is the only
> place where we really need to know about the upper layers.
> Given that you can kill ramdisks by coding it as:
>
> static void make_page_uptodate(struct page *page)
> {
>   clear_highpage(page);
>   flush_dcache_page(page);
>   SetPageUptodate(page);
> }
>
> Something is seriously non-intuitive about that function if
> you understand the usual rules for how to use the page cache.

You're overwriting some buffers that were uptodate and dirty.
That would be expected to cause problems.


> The problem is that we support a case in the buffer cache
> where pages are partially uptodate and only the buffer_heads
> remember which parts are valid.  Assuming we are using them
> correctly.
>
> Having to walk through all of the buffer heads in make_page_uptodate
> seems to me to be a nasty layering violation in rd.c

Sure, but it's not just about the buffers. It's the pagecache
in general. It is supposed to be invisible to the device driver
and sitting above it, and yet it is taking the buffercache and
using it to pull its data out of.


> > I think it's worthwhile, given that we'd have a "real" looking
> > block device and minus these bugs.
>
> For testing purposes I think I can agree with that.

What non-testing uses does it have?


> >> Having a separate store would
> >> solve some of the problems, and probably remove the need
> >> for carefully specifying the ramdisk block size.  We would
> >> still need the magic restictions on page allocations though
> >> and it we would use them more often as the initial write to the
> >> ramdisk would not populate the pages we need.
> >
> > What magic restrictions on page allocations? Actually we have
> > fewer restrictions on page allocations because we can use
> > highmem!
>
> With the proposed rewrite yes.
>
> > And the lowmem buffercache pages that we currently pin
> > (unsuccessfully, in the case of this bug) are now completely
> > reclaimable. And all your buffer heads are now reclaimable.
>
> Hmm.  Good point.  So in net it should save memory even if
> it consumes a little more in the worst case.

Highmem systems would definitely like it. For others, yes, all
the duplicated pages should be able to get reclaimed if memory
gets tight, along with the buffer heads, so yeah footprint may
be a tad smaller.


> > If you mean GFP_NOIO... I don't see any problem. Block device
> > drivers have to allocate memory with GFP_NOIO; this may have
> > been considered magic or deep badness back when the code was
> > written, but it's pretty simple and accepted now.
>
> Well I always figured it was a bit rude allocating large amounts
> of memory GFP_NOIO but whatever.

You'd rather not, of course, but with dirty data limits now,
it doesn't matter much. (and I doubt anybody outside testing
is going to be hammering like crazy on rd).

Note that the buffercache based ramdisk driver is going to
also be allocating with GFP_NOFS if you're talking about a
filesystem writing to its metadata. In most systems, GFP_NOFS
isn't much different to GFP_NOIO.

We could introduce a mode which allocates pages up front
quite easily if it were a problem (which I doubt it ever would
be).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 14:38, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:

> > I don't follow your logic. We don't need SWAP > RAM in order to swap
> > effectively, IMO.
>
> The steady state of a system that is heavily and usably swapping but
> not thrashing is that all of the pages in RAM are in the swap cache,
> at least that used to be the case.

Yeah, it works better in 2.6 (and, IIRC later 2.4 kernels).


> > I don't know if there is a causal relationship there. I mean, I
> > think it's been a long time since thrashing was ever a viable mode
> > of operation, right?
>
> Right.  But swapping heavily has been a viable mode of operation
> and that the vast gap in disk random IO performance seems to have
> hurt significantly.

Or, just not improved as fast as everything else is improving.
There isn't too much the kernel can do about that. It just
relatively changes the point at which you'd consider "swapping
heavily", right?


> It be very clear is used to able to run a problem at little below
> full speed with the disk pegged with swap traffic, and I did this
> regularly when I started out with linux.

I can do this now. In make -jhuge tests for example, you can get
a 4GB, 4 core machine to max out a disk with swapping and still
have 0 idle time. Of course you can also go past that point and
your idle time comes up. That's not new though.


> > Maybe desktops just have less need for swapping now, so nobody sees
> > it much until something goes _really_ bad. When I'm using my 256MB
> > machine, unused stuff goes to swap.
>
> There is a bit of truth in the fact that there is less need for
> swapping now.  At the same time however swapping simply does not
> work well right now, and I'm not at all certain why.
>
> >> the disk for is very limited.   I wonder if we could figure out
> >> how to push and pull 1M or bigger chunks into and out of swap?
> >
> > Pulling in 1MB pages can really easily end up compounding the
> > thrashing problem unless you're very sure a significant amount
> > of it will be used.
>
> It's a hard call.  The I/O time for 1MB of contiguous disk data
> is about the I/O time of 512 bytes of contiguous disk data.

And if you're thrashing, then by definition you need to throw
out 1MB of your working set in order to read it in.


> >> I don't know if swap has actually worked since we vmscan stopped
> >> going over the virtual addresses.
> >
> > I do, and it does ;)
>
> Really?  Not just the pushing of unused stuff into swap.

We had several bugs and things that caused swapping performance
regressions vs 2.4 in earlyish 2.6. After those were fixed, we're
pretty competitive with 2.4 in some basic tests I was using. I
haven't run them for a fair while, so something might have broken
since then, I don't know.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:

> > How much swap do you have configured? You really shouldn't configure
> > so much unless you do want the kernel to actually use it all, right?
>
> No.
>
> There are three basic swapping scenarios.
> - Pushing unused data out of ram
> - Swapping
> - Thrashing
>
> To effectively swap you need SWAP > RAM because after a little while of
> swapping all of your pages in RAM should be assigned a location in the
> page cache.

I don't follow your logic. We don't need SWAP > RAM in order to swap
effectively, IMO.


> I have not heard of many people swapping and not thrashing lately.
> I think part of the problem is that we do random access to the swap
> partition which makes us seek limited.  And since the number of
> seeks per unit time has been increasing at a linear or slower rate
> that if we are doing random disk I/O then the amount we can use

I don't know if there is a causal relationship there. I mean, I
think it's been a long time since thrashing was ever a viable mode
of operation, right?

Maybe desktops just have less need for swapping now, so nobody sees
it much until something goes _really_ bad. When I'm using my 256MB
machine, unused stuff goes to swap.


> the disk for is very limited.   I wonder if we could figure out
> how to push and pull 1M or bigger chunks into and out of swap?

Pulling in 1MB pages can really easily end up compounding the
thrashing problem unless you're very sure a significant amount
of it will be used.


> I don't know if swap has actually worked since we vmscan stopped
> going over the virtual addresses.

I do, and it does ;)


> > Because if we're not really conservative about OOM killing, then the
> > user who actually really did want to use all the swap they configured
> > gets angry when we kill their jobs without using it all.
>
> I totally agree. The fact that the OOM killer started is a sign that
> the system was completely overwhelmed and nothing better could happen.
>
> In this case my gut feel says limiting the total number of processes
> would have been much more effective then anything at all to do with
> swap. make -j reminds me of the classic fork bomb.

Yep.


> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> Well we have SAQ which should kill everything on your current VT
> which should include X and all of it's children.

Which is exactly what you don't want to do if you've just forkbombed
yourself. I missed the fact that we now have a manual oom kill...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 13:14, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Monday 15 October 2007 19:16, Andrew Morton wrote:
> >> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[EMAIL PROTECTED]>
> >
> > wrote:
> >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> >> > > unmaintained, so decided to sent the patch to you :-).
> >> > > I have CCed Ted, who did work on the code in the 90s. I found no
> >> > > current email address of Chad Page.
> >> >
> >> > This really needs to be fixed...
> >>
> >> rd.c is fairly mind-boggling vfs abuse.
> >
> > Why do you say that? I guess it is _different_, by necessity(?)
> > Is there anything that is really bad?
>
> make_page_uptodate() is most hideous part I have run into.
> It has to know details about other layers to now what not
> to stomp.  I think my incorrect simplification of this is what messed
> things up, last round.

Not really, it's just named funny. That's just a minor utility
function that more or less does what it says it should do.

The main problem is really that it's implementing a block device
who's data comes from its own buffercache :P. I think.


> > I guess it's not nice
> > for operating on the pagecache from its request_fn, but the
> > alternative is to duplicate pages for backing store and buffer
> > cache (actually that might not be a bad alternative really).
>
> Cool. Triple buffering :)  Although I guess that would only
> apply to metadata these days.

Double buffering. You no longer serve data out of your buffer
cache. All filesystem data was already double buffered anyway,
so we'd be just losing out on one layer of savings for metadata.
I think it's worthwhile, given that we'd have a "real" looking
block device and minus these bugs.


> Having a separate store would 
> solve some of the problems, and probably remove the need
> for carefully specifying the ramdisk block size.  We would
> still need the magic restictions on page allocations though
> and it we would use them more often as the initial write to the
> ramdisk would not populate the pages we need.

What magic restrictions on page allocations? Actually we have
fewer restrictions on page allocations because we can use
highmem! And the lowmem buffercache pages that we currently pin
(unsuccessfully, in the case of this bug) are now completely
reclaimable. And all your buffer heads are now reclaimable.

If you mean GFP_NOIO... I don't see any problem. Block device
drivers have to allocate memory with GFP_NOIO; this may have
been considered magic or deep badness back when the code was
written, but it's pretty simple and accepted now.


> A very ugly bit seems to be the fact that we assume we can
> dereference bh->b_data without any special magic which
> means the ramdisk must live in low memory on 32bit machines.

Yeah but that's not rd.c. You need to rewrite the buffer layer
to fix that (see fsblock ;)).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-15 Thread Nick Piggin
On Mon, Oct 15, 2007 at 11:10:00AM +0200, Jarek Poplawski wrote:
> On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:
> ...
> > Has performance really been much problem for you? (even before the
> > lfence instruction, when you theoretically had to use a locked op)?
> > I mean, I'd struggle to find a place in the Linux kernel where there
> > is actually a measurable difference anywhere... and we're pretty
> > performance critical and I think we have a reasonable amount of lockless
> > code (I guess we may not have a lot of tight computational loops, though).
> > I'd be interested to know what, if any, application had found these
> > barriers to be problematic...
> 
> I'm not performance-words at all, so I can't help you, sorry. But, I
> understand people who care about this, and think there is a popular
> conviction barriers and locked instructions are costly, so I'm
> surprised there is any "problem" now with finding these gains...

It's more expensive than nothing, sure. However in real code, algorithmic
complexity, cache misses and cacheline bouncing tend to be much bigger
issues.

I can't think of a place in the kernel where smp_rmb matters _that_ much.
seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
lfence is not going to hurt. Maybe we even gain 0.01% performance in
someone's workload.

Also, remember: if loads are already in-order, then lfence is a noop,
right? (in practice it seems to have to do a _little_ bit of work, but
it's like a dozen cycles).


> > The thing is that those documents are not defining what a particular
> > implementation does, but how the architecture is defined (ie. what
> > must some arbitrary software/hardware provide and what may it expect).
> 
> I'm not sure this is the right way to tell it. If there is no
> distinction between what is and what could be, how can I believe in
> similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
> like they describe some real hardware mechanisms, and sometimes they
> mention about possible changes and reserved features too. So, when
> they don't mention you could think it's a present behavior.

No. Why are you reading that much into it? I know for a fact that some
non-x86 architectures actual implementations have stronger ordering than
their ISA allows. It's nothing to do with you "believing" how the hardware
works. That's not what the document is describing (directly).


> > It's pretty natural that Intel started out with a weaker guarantee
> > than their CPUs of the time actually supported, and tightened it up
> > after (presumably) deciding not to implement such relaxed semantics
> > for the forseeable future.
> 
> As a matter of fact it's not natural for me at all. I expected the
> other direction, and I still doubt programmers' intentions could be
> "automatically" predicted good enough, so IMHO, it's not for long.

Really? Consider the consequences if, instead of releasing this latest
document tightening consistency, Intel found that out of order loads
were worth 5% more performance and implemented them in their next chip.
The chip could be completely backwards compatible, but all your old code
would break, because it was broken to begin with (because it was outside
the spec).

IMO Intel did exactly the right thing from an engineering perspective,
and so did Linux to always follow the spec.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-15 Thread Nick Piggin
On Tue, Oct 16, 2007 at 12:08:01AM +0200, Mikulas Patocka wrote:
> > On Mon, 15 Oct 2007 22:47:42 +0200 (CEST)
> > Mikulas Patocka <[EMAIL PROTECTED]> wrote:
> > 
> > > > According to latest memory ordering specification documents from
> > > > Intel and AMD, both manufacturers are committed to in-order loads
> > > > from cacheable memory for the x86 architecture. Hence, smp_rmb()
> > > > may be a simple barrier.
> > > >
> > > > http://developer.intel.com/products/processor/manuals/318147.pdf 
> > > > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> > > 
> > > Hi
> > > 
> > > I'm just wondering about one thing --- what is LFENCE instruction
> > > good for?
> > > 
> > > SFENCE is for enforcing ordering in write-combining buffers (it
> > > doesn't have sense in write-back cache mode).
> > > MFENCE is for preventing of moving stores past loads.
> > > 
> > > But what is LFENCE for? I read the above documents and they already
> > > say that CPUs have ordered loads.
> > > 
> > 
> > The cpus also have an explicit set of instructions that deliberately do 
> > unordered stores/loads, and s/lfence etc are mostly designed for those.
> 
> I know about unordered stores (movnti & similar) --- they basically use 
> write-combining method on memory that is normally write-back --- and they 
> need sfence. But which one instruction does unordered load and needs 
> lefence?

Also, for non-wb memory. I don't think the Intel document referenced
says anything about this, but the AMD document says that loads can pass
loads (page 8, rule b).

This is why our rmb() is still an lfence.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] scheduler updates for v2.6.24

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 00:17, Ingo Molnar wrote:
> Linus, please pull the latest scheduler git tree from:
>
>git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> It contains lots of scheduler updates from lots of people - hopefully
> the last big one for quite some time. Most of the focus was on
> performance (both micro-performance and scalability/balancing), but
> there's the fair-scheduling feature now Kconfig selectable too. Find the
> shortlog below.

Nice work...

However it's a pity all the balancing stuff got wildly changed
in 2.6.23 and then somewhat changed back again now.

Despite appearances, a lot of those things weren't actually
*completely* arbitrary values. I fear that it will make finding
performance regressions harder than it should have...

Anyway.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nfs mmap adventure (was: 2.6.23-mm1)

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 00:06, David Howells wrote:
> Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > I get funny SIGBUS' like so:
> >
> > fault
> >   if (->page_mkwrite() < 0)
> > nfs_vm_page_mkwrite()
> >   nfs_write_begin()
> > nfs_flush_incompatible()
> >   nfs_wb_page()
> > nfs_wb_page_priority()
> >   nfs_sync_mapping_wait()
> > nfs_wait_on_request_locked()
> >   nfs_wait_on_request()
> > nfs_wait_bit_interruptible()
> >   return -ERESTARTSYS
> > SIGBUS
> >
> > trying to figure out what to do about this...
>
> Hmmm...  It sounds like the fault handler should deliver the appropriate
> signal, should ->page_mkwrite() return ERESTARTSYS, and then retry the
> access instruction that caused the fault when the signal handler has
> finished running.

I don't think the fault handler is currently in any position to do
that ATM. It is possible to make it interruptible in some contexts,
but faults from kernel code may not be able to cope.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 21:07, Andi Kleen wrote:
> On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:
> > Is this true even if you don't write through those old mappings?
>
> I think it happened for reads too.  It is a little counter intuitive
> because in theory the CPU doesn't need to write back non dirty lines,
> but in the one case which took so long to debug exactly this happened
> somehow.
>
> At it is undefined for reads and writes in the architecture so
> better be safe than sorry.

Yes, typo. I meant reads or writes.


> And x86 CPUs are out of order and do speculative executation
> and that can lead to arbitary memory accesses even if the code
> never touches an particular address.
>
> Newer Intel CPUs have something called self-snoop which was supposed
> to handle this; but in some situations it doesn't seem to catch it
> either.

Fair enough, so we have to have this lazy tlb flush hook for
Xen/PAT/etc. I don't think it should be much problem to
implement.


> > Is DRM or AGP then not also broken with lazy highmem flushing, or
> > how do they solve that?
>
> AGP doesn't allocate highmem pages.  Not sure about the DRM code.

Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 19:16, Andrew Morton wrote:
> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[EMAIL PROTECTED]> 
wrote:
> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > > unmaintained, so decided to sent the patch to you :-).
> > > I have CCed Ted, who did work on the code in the 90s. I found no
> > > current email address of Chad Page.
> >
> > This really needs to be fixed...
>
> rd.c is fairly mind-boggling vfs abuse.

Why do you say that? I guess it is _different_, by necessity(?)
Is there anything that is really bad? I guess it's not nice
for operating on the pagecache from its request_fn, but the
alternative is to duplicate pages for backing store and buffer
cache (actually that might not be a bad alternative really).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 19:52, Rob Landley wrote:
> On Monday 15 October 2007 8:37:44 am Nick Piggin wrote:
> > > Virtual memory isn't perfect.  I've _always_ been able to come up with
> > > examples where it just doesn't work for me.  This doesn't mean VM
> > > overcommit should be abolished, because it's useful more often than
> > > not.
> >
> > I hate to go completely offtopic here, but disks are so incredibly
> > slow when compared to RAM that there is really nothing the kernel
> > can do about this.
>
> I know.
>
> > Presumably the job will finish, given infinite
> > time.
>
> I gave it about half an hour, then it locked solid and stopped writing to
> the disk at all.  (I gave it another 5 minutes at that point, then held
> down the power button.)

Maybe it was a bug then. Hard to say without backtraces ;)


> > You really shouldn't configure
> > so much unless you do want the kernel to actually use it all, right?
>
> Two words: "Software suspend".  I've actually been thinking of increasing
> it on the next install...

Kernel doesn't know that you want to use it for suspend but not
regular swapping, unfortunately.


> > Because if we're not really conservative about OOM killing, then the
> > user who actually really did want to use all the swap they configured
> > gets angry when we kill their jobs without using it all.
>
> I tend to lower "swappiness" and when that happens all sorts of stuff goes
> weird.  Software suspend used to say says it can't free enough memory if I
> put swappiness at 0 (dunno if it still does).  This time the OOM killer
> never triggered before hard deadlock.  (I think I had it around 20 or 40 or
> some such.)
>
> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> *shrug* It might.  I was a letting it run hoping it would complete itself
> when it locked solid.  (The keyboard LEDs weren't flashing, so I don't
> _think_ it paniced.  I was in X so I wouldn't have seen a message...)

If you can work out where things are spinning/sleeping when that happens,
along with sysrq+M data, then it could make for a useful bug report. Not
entirely helpful, but if it is a reproducible problem for you, then you
might be able to get that data from outside X.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 19:36, Andi Kleen wrote:
> David Chinner <[EMAIL PROTECTED]> writes:
> > And yes, we delay unmapping pages until we have a batch of them
> > to unmap. vmap and vunmap do not scale, so this is batching helps
> > alleviate some of the worst of the problems.
>
> You're keeping vmaps around for already freed pages?

> That will be a big problem for proper PAT support, which needs
> to track all mappings to memory. It's not just a problem for Xen.
>
> In fact I suspect it is already broken with DRM or AGP for example which
> can use UC and WC mappings -- if you keep the mapping around and
> DRM or AGP turns the page in another mapping uncacheable you're
> creating an illegal cache attribute alias. These are known to occasionally
> create cache corruptions on several x86s; giving ___VERY___ hard to debug
> bugs once a blue moon.

Is this true even if you don't write through those old mappings?
Is DRM or AGP then not also broken with lazy highmem flushing, or
how do they solve that?


> Probably it'll require some generic VM batching mechanism where
> Xen or PAT code can hook into the list or force unmap the mappings
> as needed.

Definitely.


> Definitely needs to be fixed if true. You're lucky that Xen caught it
> in time.

I've been thinking that a simple debug mode could be a good idea.
Have a new field in the struct page to count the number of possible
unflushed mappings to the page so that things like PAT could go
BUG_ON appropriate conditions. Would that be helpful?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 19:05, Christian Borntraeger wrote:
> Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > > unmaintained, so decided to sent the patch to you :-).
> > > I have CCed Ted, who did work on the code in the 90s. I found no
> > > current email address of Chad Page.
> >
> > This really needs to be fixed...
>
> I obviously agree ;-)
> We have seen this problem happen several times.
>
> > I can't make up my mind between the approaches to fixing it.
> >
> > On one hand, I would actually prefer to really mark the buffers
> > dirty (as in: Eric's fix for this problem[*]) than this patch,
> > and this seems a bit like a bandaid...
>
> I have never seen these patches, so I cannot comment on them.

> > On the other hand, the wound being covered by the bandaid is
> > actually the code in the buffer layer that does this latent
> > "cleaning" of the page because it sadly doesn't really keep
> > track of the pagecache state. But it *still* feels like we
> > should be marking the rd page's buffers dirty which should
> > avoid this problem anyway.
>
> Yes, that would solve the problem as well. As long as we fix
> the problem, I am happy. On the other hand, do you see any
> obvious problem with this "bandaid"?

I don't think so -- in fact, it could be the best candidate for
a minimal fix for stable kernels (anyone disagree? if not, maybe
you could also send this to the stable maintainers?).

But I do want to have this fixed in a "nice" way. eg. I'd like
it to mark the buffers dirty because that actually results in
more reuse of generic kernel code, and also should make rd
behave more naturally (I like using it to test filesystems
because it can expose a lot more concurrency than something like
loop on tmpfs). It should also be possible to actually have
rd's buffer heads get reclaimed as well, preferably while
exercising the common buffer paths and without writing much new
code.

All of that is secondary to fixing the data corruption problem
of course! But the fact that those alternate patches do exist now
means I want to just bring them into the discussion again before
merging one or the other.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> unmaintained, so decided to sent the patch to you :-).
> I have CCed Ted, who did work on the code in the 90s. I found no current
> email address of Chad Page.

This really needs to be fixed...

I can't make up my mind between the approaches to fixing it.

On one hand, I would actually prefer to really mark the buffers
dirty (as in: Eric's fix for this problem[*]) than this patch,
and this seems a bit like a bandaid...

On the other hand, the wound being covered by the bandaid is
actually the code in the buffer layer that does this latent
"cleaning" of the page because it sadly doesn't really keep
track of the pagecache state. But it *still* feels like we
should be marking the rd page's buffers dirty which should
avoid this problem anyway.

[*] However, hmm, with Eric's patch I guess we'd still have a hole
where filesystems that write their buffers by hand think they are
"cleaning" these things and we're back to square one. That could
be fixed by marking the buffers dirty again?

Why were Eric's patches dropped, BTW? I don't remember.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 18:04, Rob Landley wrote:
> On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:

> > > excuse for conflating different categories of devices in the first
> > > place.
> >
> > See the thinkpad Ultrabay drive example above.
>
> Last week I drove my laptop so deep into swap (with a "make -j" on qemu)
> that after half an hour trying to repaint my kmail window, it locked solid.
> Again.  You'd think the oom killer would come to the rescue, but it didn't.
> Maybe Ubuntu disabled it.  I have _2_gigs_ of ram in this sucker, on a
> stock Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times),
> and yet I managed to make it swap itself to death one more time.
>
> Virtual memory isn't perfect.  I've _always_ been able to come up with
> examples where it just doesn't work for me.  This doesn't mean VM
> overcommit should be abolished, because it's useful more often than not.

I hate to go completely offtopic here, but disks are so incredibly
slow when compared to RAM that there is really nothing the kernel
can do about this. Presumably the job will finish, given infinite
time.

How much swap do you have configured? You really shouldn't configure
so much unless you do want the kernel to actually use it all, right?
Because if we're not really conservative about OOM killing, then the
user who actually really did want to use all the swap they configured
gets angry when we kill their jobs without using it all.

Would an oom-kill-someone-now sysrq be of help, I wonder?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-15 Thread Nick Piggin
On Mon, Oct 15, 2007 at 09:44:05AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 08:13:52AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 12 Oct 2007, Jarek Poplawski wrote:
> ...
> > So no, there's no way a software person could have afforded to say "it 
> > seems to work on my setup even without the barrier". On a dual-socket 
> > setup with s shared bus, that says absolutely *nothing* about the 
> > behaviour of the exact same CPU when used with a multi-bus chipset. Not to 
> > mention another revisions of the same CPU - much less a whole other 
> > microarchitecture.
> 
> Yes, I still can't believe this, but after some more reading I start
> to admit such things can happen in computer "science" too... I've
> mentioned a lost performance, but as a matter of fact I've been more
> concerned with the problem of truth:
> 
> From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A:
> 
>"7.2.2 Memory Ordering in P6 and More Recent Processor Families
> ...
> 1. Reads can be carried out speculatively and in any order.
> ..."
> 
> So, it looks to me like almost the 1-st Commandment. Some people (like
> me) did believe this, others tried to check, and it was respected for
> years notwithstanding nobody had ever seen such an event.

I'd say that's exactly what Intel wanted. It's pretty common (we do
it all the time in the kernel too) to create an API which places a
stronger requirement on the caller than is actually required. It can
make changes much less painful.

Has performance really been much problem for you? (even before the
lfence instruction, when you theoretically had to use a locked op)?
I mean, I'd struggle to find a place in the Linux kernel where there
is actually a measurable difference anywhere... and we're pretty
performance critical and I think we have a reasonable amount of lockless
code (I guess we may not have a lot of tight computational loops, though).
I'd be interested to know what, if any, application had found these
barriers to be problematic...

 
> And then, a few years later, we have this:
> 
> From: Intel(R) 64 Architecture Memory Ordering White Paper
> 
> "2 Memory ordering for write-back (WB) memory
>  ...
>  Intel 64 memory ordering obeys the following principles:
>  1. Loads are not reordered with other loads.
>  ..."
> 
> I know, technically this doesn't have to be a contradiction (for not
> WB), but to me it's something like: "OK, Elvis lives and this guy is
> not real Paul McCartney too" in an official CIA statement!

The thing is that those documents are not defining what a particular
implementation does, but how the architecture is defined (ie. what
must some arbitrary software/hardware provide and what may it expect).

It's pretty natural that Intel started out with a weaker guarantee
than their CPUs of the time actually supported, and tightened it up
after (presumably) deciding not to implement such relaxed semantics
for the forseeable future.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ARCH_FREE_PTE_NR 5350 on x86_64

2007-10-15 Thread Nick Piggin
On Monday 15 October 2007 16:54, Alok kataria wrote:
> Hi,
>
> Looking at the tlb_flush code path and its co-relation with
> ARCH_FREE_PTE_NR, on x86-64 architecture. I think we still don't use
> the ARCH_FREE_PTE_NR of 5350 as the caching value for the mmu_gathers
> structure, instead fallback to using 506 due to some typo errors in
> the code.
>
> Found this link in the archives.
>
> http://www.ussg.iu.edu/hypermail/linux/kernel/0509.1/1821.html
>
> I don't think anything has been done on this yet (looked at 2.6.23).
> Do let me know if its only a typo that needs a fix, or we are still
> waiting for some other other changes to come into effect.

ARCH_FREE_PTE_NR should probably just get ripped out completely at
this point. Something similar (but working) could be added back if
anybody actually had a workload where it helps to increase the mmu
batch size.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] vivi, videobuf_to_vmalloc() and related breakage

2007-10-14 Thread Nick Piggin
On Monday 15 October 2007 12:01, Al Viro wrote:
>   AFAICS, videobuf-vmalloc use of mem->vma and mem->vmalloc is
> bogus.
>
> You obtain the latter with vmalloc_user(); so far, so good.  Then you have
> retval=remap_vmalloc_range(vma, mem->vmalloc,0);
> where vma is given to you by mmap(); again, fine - we get the memory
> pointed to be mem->vmalloc() mapped at vma->vm_start.
>
> Now we get the trouble: things like
>
> static void vivi_fillbuff(struct vivi_dev *dev,struct vivi_buffer *buf)
> {
>   ...
>   void *vbuf=videobuf_to_vmalloc (&buf->vb);
>   ...
>   copy_to_user(vbuf + ..., ..., ...)
>
> get vbuf equal to ->vmalloc of buf->vp.priv and that is _not_ a userland
> address.  Giving it to copy_to_user() is not going to do anything good.
> On some targets it'll fail, on some - write to unrelated user memory.
> What is going on there?  If that's an attempt to copy into that buffer
> allocated by vmalloc_user(), why are we doing copy_to_user() at all?

Right you are. remap_vmalloc_range doesn't turn the passed vmalloc
area into user memory (it creates a completely new mapping).

Presumably it either wants to copy_to_user to that new mapping, or
memcpy to ->vmalloc? Would the former be an attempt to avoid some
virtual aliasing issues?


> But there's more; we have made a copy of vma (kmalloc+memcpy), stored it in
> mem->vma and later we cheerfully do remap_vmalloc_range(mem->vma,).
> And kfree that mem->vma immediately afterwards.  What the hell?  It might
> not break now, but that seems to be playing very fast and loose with the
> warranties provided by VM.

I don't know why one would be finding remap_vmalloc_range to fail
it mmap time but not later? Should just do it at mmap time and if
that is failing, then work out why (or ask linux-mm for help).
Actually there is probably a window where we can get subsequent
anonymous pages faulted into the empty vma there if we haven't
remapped it, then the subsequent attempt to remap will hit the
BUG_ON in remap_pte_range.

(that's aside from the big conceptual problem with passing in an
"invented" vma... don't do that! (: )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-14 Thread Nick Piggin
On Monday 15 October 2007 10:57, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> > because it generally has to invalidate TLBs on all CPUs.
>
> I see.
>
> > I'm looking at some more general solutions to this (already have some
> > batching / lazy unmapping that replaces the XFS specific one), however
> > they are still likely going to leave vmap mappings around after freeing
> > the page.
>
> Hm.  Could there be a call to shoot down any lazy mappings of a page, so
> the Xen pagetable code could use it on any pagetable page?  Ideally one
> that could be used on any page, but only causes expensive operations
> where needed.

Yeah, it would be possible. The easiest way would just be to shoot down
all lazy vmaps (because you're doing the global IPIs anyway, which are
the expensive thing, at which point you may as well purge the rest of
your lazy mappings).

If it is sufficiently rare, then it could be the simplest thing to do.


> > We _could_ hold on to the pages as well, but that's pretty inefficient.
> > The memory cost of keeping the mappings around tends to be well under
> > 1% the cost of the page itself. OTOH we could also avoid lazy flushes
> > on architectures where it is not costly. Either way, it probably would
> > require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
> > Xen. Although... it would be nice if Xen could take advantage of some
> > of these optimisations as well.
>
> In general the lazy unmappings won't worry Xen.  It's only for the
> specific case of allocating memory for pagetables.  Xen can do a bit of
> extra optimisation for cross-cpu tlb flushes (if the target vcpus are
> not currently running, then you don't need to do anything), but they're
> still an expensive operation, so the optimisation is definitely useful.

OK.


> > What's the actual problem for Xen? Anything that can be changed?
>
> Not easily.  Xen doesn't use shadow pagetables.  Instead, it gives the
> guest domains direct access to the real CPU's pagetable, but makes sure
> they're always mapped RO so that the hypervisor can control updates to
> the pagetables (either by trapping writes or via explicit hypercalls).
> This means that when constructing a new pagetable, Xen will verify that
> all the mappings of pages making up the new pagetable are RO before
> allowing it to be used.  If there are stray RW mappings of those pages,
> pagetable construction will fail.

OK, I see. Because even though it is technically safe where we are
using it (because nothing writes through the mappings after the page
is freed), a corrupted guest could use the same window to do bad
things with the pagetables?


> Aside from XFS, the only other case I've found where there could be
> stray RW mappings is when using high pages which are still in the kmap
> cache; I added an explicit call to flush the kmap cache to handle this.
> If vmap and kmap can be unified (at least the lazy unmap aspects of
> them), then that would be a nice little cleanup.

vmap is slightly harder than kmap in some respects. However it would
be really nice to get vmap fast and general enough to completely
replace all the kmap crud -- that's one goal, but the first thing
I'm doing is to concentrate on just vmap to work out how to make it
as fast as possible.

For Xen -- shouldn't be a big deal. We can have a single Linux mm API
to call, and we can do the right thing WRT vmap/kamp. I should try to
merge my current lazy vmap patches which replace the XFS stuff, so we
can implement such an API and fix your XFS issue? That's not going to
happen for at least a cycle or two though, so in the meantime maybe
an ifdef for that XFS vmap batching code would help?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-14 Thread Nick Piggin
On Monday 15 October 2007 09:12, Jeremy Fitzhardinge wrote:
> David Chinner wrote:
> > You mean xfs_buf.c.
>
> Yes, sorry.
>
> > And yes, we delay unmapping pages until we have a batch of them
> > to unmap. vmap and vunmap do not scale, so this is batching helps
> > alleviate some of the worst of the problems.
>
> How much performance does it cost?  What kind of workloads would it show
> up under?
>
> > Realistically, if this delayed release of vmaps is a problem for
> > Xen, then I think that some generic VM solution is needed to this
> > problem as vmap() is likely to become more common in future (think
> > large blocks in filesystems). Nick - any comments?
>
> Well, the only real problem is that the pages are returned to the free
> pool and reallocated while still being part of a mapping.  If the pages
> are still owned by the filesystem/pagecache, then there's no problem.
>
> What's the lifetime of things being vmapped/unmapped in xfs?  Are they
> necessarily being freed when they're unmapped, or could unmapping of
> freed memory be more immediate than other memory?

Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
because it generally has to invalidate TLBs on all CPUs.

I'm looking at some more general solutions to this (already have some
batching / lazy unmapping that replaces the XFS specific one), however
they are still likely going to leave vmap mappings around after freeing
the page.

We _could_ hold on to the pages as well, but that's pretty inefficient.
The memory cost of keeping the mappings around tends to be well under
1% the cost of the page itself. OTOH we could also avoid lazy flushes
on architectures where it is not costly. Either way, it probably would
require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
Xen. Although... it would be nice if Xen could take advantage of some
of these optimisations as well.

What's the actual problem for Xen? Anything that can be changed?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] hdaps: fix locking

2007-10-14 Thread Nick Piggin
On Sun, Oct 14, 2007 at 09:25:23AM +0200, Nick Piggin wrote:
> Here are a couple of fixes for the hdaps driver. I have kind of been
> blocking out the bug traces caused by these (the 2nd patch, actually)
> thinking that it's one of those transient / churn things... but it's
> getting annoying now because I think it turns off lockdep. After this
> patch it no longer produces warnings, but I don't actually know if it
> does the right thing (because I don't really know what the driver
> does or how to test it anyway!).
> 

Ahh, forget that, I see the patch in -mm now, which should go in instead.


> ---
> 
> hdaps was using incorrect mutex_trylock return code.
> 
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> 
> ---
> Index: linux-2.6/drivers/hwmon/hdaps.c
> ===
> --- linux-2.6.orig/drivers/hwmon/hdaps.c
> +++ linux-2.6/drivers/hwmon/hdaps.c
> @@ -328,22 +328,20 @@ static void hdaps_mousedev_poll(unsigned
>   int x, y;
>  
>   /* Cannot sleep.  Try nonblockingly.  If we fail, try again later. */
> - if (mutex_trylock(&hdaps_mtx)) {
> - mod_timer(&hdaps_timer,jiffies + HDAPS_POLL_PERIOD);
> - return;
> - }
> + if (!mutex_trylock(&hdaps_mtx))
> + goto out;
>  
>   if (__hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &x, &y))
> - goto out;
> + goto out_lock;
>  
>   input_report_abs(hdaps_idev, ABS_X, x - rest_x);
>   input_report_abs(hdaps_idev, ABS_Y, y - rest_y);
>   input_sync(hdaps_idev);
>  
> - mod_timer(&hdaps_timer, jiffies + HDAPS_POLL_PERIOD);
> -
> -out:
> +out_lock:
>   mutex_unlock(&hdaps_mtx);
> +out:
> + mod_timer(&hdaps_timer, jiffies + HDAPS_POLL_PERIOD);
>  }
>  
>  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/2] hdaps: fix locking

2007-10-14 Thread Nick Piggin
Here are a couple of fixes for the hdaps driver. I have kind of been
blocking out the bug traces caused by these (the 2nd patch, actually)
thinking that it's one of those transient / churn things... but it's
getting annoying now because I think it turns off lockdep. After this
patch it no longer produces warnings, but I don't actually know if it
does the right thing (because I don't really know what the driver
does or how to test it anyway!).

---

hdaps was using incorrect mutex_trylock return code.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/drivers/hwmon/hdaps.c
===
--- linux-2.6.orig/drivers/hwmon/hdaps.c
+++ linux-2.6/drivers/hwmon/hdaps.c
@@ -328,22 +328,20 @@ static void hdaps_mousedev_poll(unsigned
int x, y;
 
/* Cannot sleep.  Try nonblockingly.  If we fail, try again later. */
-   if (mutex_trylock(&hdaps_mtx)) {
-   mod_timer(&hdaps_timer,jiffies + HDAPS_POLL_PERIOD);
-   return;
-   }
+   if (!mutex_trylock(&hdaps_mtx))
+   goto out;
 
if (__hdaps_read_pair(HDAPS_PORT_XPOS, HDAPS_PORT_YPOS, &x, &y))
-   goto out;
+   goto out_lock;
 
input_report_abs(hdaps_idev, ABS_X, x - rest_x);
input_report_abs(hdaps_idev, ABS_Y, y - rest_y);
input_sync(hdaps_idev);
 
-   mod_timer(&hdaps_timer, jiffies + HDAPS_POLL_PERIOD);
-
-out:
+out_lock:
mutex_unlock(&hdaps_mtx);
+out:
+   mod_timer(&hdaps_timer, jiffies + HDAPS_POLL_PERIOD);
 }
 
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: avoid dirtying shared mappings on mlock

2007-10-12 Thread Nick Piggin
On Friday 12 October 2007 20:50, Peter Zijlstra wrote:
> On Fri, 2007-10-12 at 04:14 +1000, Nick Piggin wrote:
> > On Friday 12 October 2007 20:37, Peter Zijlstra wrote:

> > > The pages will still be read-only due to dirty tracking, so the first
> > > write will still do page_mkwrite().
> >
> > Which can SIGBUS, no?
>
> Sure, but that is no different than any other mmap'ed write. I'm not
> seeing how an mlocked region is special here.

Well it is a change in behaviour (admittedly, so was the change
to SIGBUS mmaped writes in the first place). It's a matter of
semantics I guess. Is the current behaviour actually a _problem_
for anyone? If not, then do we need to change it?

I'm not saying it does matter, just that it might matter ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: avoid dirtying shared mappings on mlock

2007-10-12 Thread Nick Piggin
On Friday 12 October 2007 20:37, Peter Zijlstra wrote:
> On Fri, 2007-10-12 at 02:57 +1000, Nick Piggin wrote:
> > On Friday 12 October 2007 19:03, Peter Zijlstra wrote:
> > > Subject: mm: avoid dirtying shared mappings on mlock
> > >
> > > Suleiman noticed that shared mappings get dirtied when mlocked.
> > > Avoid this by teaching make_pages_present about this case.
> > >
> > > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> > > Acked-by: Suleiman Souhlal <[EMAIL PROTECTED]>
> >
> > Umm, I don't see the other piece of this thread, so I don't
> > know what the actual problem was.
> >
> > But I would really rather not do this. If you do this, then you
> > now can get random SIGBUSes when you write into the memory if it
> > can't allocate blocks or ... (some other filesystem specific
> > condition).
>
> I'm not getting this, make_pages_present() only has to ensure all the
> pages are read from disk and in memory. How is this different from a
> read-scan?

I guess because we've mlocked a region that has PROT_WRITE access...
but actually, I suppose mlock doesn't technically require that we
can write to the memory, only that the page isn't swapped out.

Still, it is nice to be able to have a reasonable guarantee of
writability.


> The pages will still be read-only due to dirty tracking, so the first
> write will still do page_mkwrite().

Which can SIGBUS, no?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-12 Thread Nick Piggin
On Fri, Oct 12, 2007 at 11:55:05AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 10:57:33AM +0200, Nick Piggin wrote:
> > 
> > I don't know quite what you're saying... the CPUs could probably get
> > performance by having weakly ordered loads, OTOH I think the Intel
> > ones might already do this speculatively so they appear in order but
> > essentially have the performance of weak order.
> 
> I meant: if there is any reordering possible this should be quite
> distinctly visible.

It's not. Not in the cases where it is explicitly allowed and actively
exploited (loads passing stores), but most definitely not distinctly
visible in errata cases that have slipped through all the V&V.


> because why would any vendor enable such nasty
> things if not for performance. But now I start to doubt: of course
> there is such a possibility someone makes this reordering for some
> other reasons which could be so rare it's hard to check.

Yes: it isn't the explicitly allowed reorderings that we care
about here (because obviously we're retaining the barriers for those).
It would be cases of bugs in the CPUs meaning they don't follow the
standard. But how far do you take your mistrust of a CPU? You could
ask gcc to insert locked ops between every load and store operation?
Or keep it switched off to ensure no bugs ;)


> Anyway, it seems any heavy testing such as yours, should give us the
> same informations years earlier than any vendors manual and then any
> gain is multiplied by millions of users. Then only still doubtful
> cases could be treated with additional caution and some debugging
> code.

Firstly, while it can be possible to write a code to show up reordering,
it is really hard (ie. impossible) to guarantee no reordering happens. For
example, it may have only showed up on SMT+SMP P4 CPUs with some obscure
interactions between threads and cores involving more than 2 threads.

Secondly, even if we were sure that no current implementations reordered
loads, we don't want to go outside the bounds of the specification
because we might break on some future CPUs. This isn't a big performance
win.


> > All existing processors as far as we know are in-order WRT loads vs
> > loads and stores vs stores. It was just a matter of getting the docs
> > clarified, which gives us more confidence that we're correct and a
> > reasonable guarnatee of forward compatibility.
> 
> After reading this Intel's legal information I don't think you should
> feel so much more forward confident...

Yes, but that's the same way I feel after reading *any* legal "information" ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-12 Thread Nick Piggin
On Fri, Oct 12, 2007 at 11:12:13AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 10:42:34AM +0200, Helge Hafting wrote:
> > Jarek Poplawski wrote:
> > >On 04-10-2007 07:23, Nick Piggin wrote:
> > >  
> > >>According to latest memory ordering specification documents from Intel and
> > >>AMD, both manufacturers are committed to in-order loads from cacheable 
> > >>memory
> > >>for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> > >>
> > >...
> > >
> > >Great news!
> > >
> > >First it looks like a really great thing that it's revealed at last.
> > >But then... there is probably some confusion: did we have to use
> > >ineffective code for so long?
> > >  
> > You could have tried the optimization before, and
> > gotten better performance. But if without solid knowledge that
> > the optimization is _valid_, you risk having a kernel
> > that performs great but suffer the occational glitch and
> > therefore is unstable and crash the machine "now and then".
> > This sort of thing can't really be figured out by experimentation, because
> > the bad cases might happen only with some processors, some
> > combinations of memory/chipsets, or with some minimum
> > number of processors.  Such problems can be very hard
> > to find, especially considering that other plain bugs also
> > cause crashes.
> > 
> > Therefore, the "ineffective code" was used because it was
> > the only safe alternative. Now we know, so now we may optimize.
> 
> Sorry, I don't understand this logic at all. Since bad cases
> happen independently from any specifications and Intel doesn't
> take any legal responsibility for such information, it seems we
> should better still not optimize?

We already do in probably more critical and lible to be problematic
cases (notably, spin_unlock).

So unless there is reasonable information for us to believe this
will be a problem, IMO the best thing to do is stick with the
specs. Intel is pretty reasonable with documenting errata I think.

With memory barriers specifically, I'm sure we have many more bugs
in the kernel than AMD or Intel have in their chips ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: avoid dirtying shared mappings on mlock

2007-10-12 Thread Nick Piggin
On Friday 12 October 2007 02:57, Nick Piggin wrote:
> On Friday 12 October 2007 19:03, Peter Zijlstra wrote:
> > Subject: mm: avoid dirtying shared mappings on mlock
> >
> > Suleiman noticed that shared mappings get dirtied when mlocked.
> > Avoid this by teaching make_pages_present about this case.
> >
> > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> > Acked-by: Suleiman Souhlal <[EMAIL PROTECTED]>
>
> Umm, I don't see the other piece of this thread, so I don't
> know what the actual problem was.

Found it, but no more clues. Presumably it's some horrible
google workload... they're pretty happy to carry these kinds
of patches internally, right? ;)


> But I would really rather not do this. If you do this, then you
> now can get random SIGBUSes when you write into the memory if it
> can't allocate blocks or ... (some other filesystem specific
> condition).
>
> I agree it feels suboptimal, but we've _always_ done this, right?
> Is it suddenly a problem? Unless a really nice general way is
> made to solve this, optimising it like this makes it harder to
> ensure good semantics for all corner cases I think (and then once
> the optimisation is in place, it's a lot harder to remove it).

Yeah, I really would rather not do this. If we actually go
through the whole fault path in mlock, then it doesn't
really matter what future baggage we attach to fault handlers...
(OK, we still technically have some problems with invalidations,
but mostly they're avoidable unless doing something silly).

How about just doing another PROT_READ mmap, and mlocking that?
(I was going to suggest another syscall, but that's probably
overkill).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: avoid dirtying shared mappings on mlock

2007-10-12 Thread Nick Piggin
On Friday 12 October 2007 19:03, Peter Zijlstra wrote:
> Subject: mm: avoid dirtying shared mappings on mlock
>
> Suleiman noticed that shared mappings get dirtied when mlocked.
> Avoid this by teaching make_pages_present about this case.
>
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> Acked-by: Suleiman Souhlal <[EMAIL PROTECTED]>

Umm, I don't see the other piece of this thread, so I don't
know what the actual problem was.

But I would really rather not do this. If you do this, then you
now can get random SIGBUSes when you write into the memory if it
can't allocate blocks or ... (some other filesystem specific
condition).

I agree it feels suboptimal, but we've _always_ done this, right?
Is it suddenly a problem? Unless a really nice general way is
made to solve this, optimising it like this makes it harder to
ensure good semantics for all corner cases I think (and then once
the optimisation is in place, it's a lot harder to remove it).

I'll go away and have a better look for the old thread.

> ---
>  mm/memory.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -2719,7 +2719,12 @@ int make_pages_present(unsigned long add
>   vma = find_vma(current->mm, addr);
>   if (!vma)
>   return -1;
> - write = (vma->vm_flags & VM_WRITE) != 0;
> + /*
> +  * We want to touch writable mappings with a write fault in order
> +  * to break COW, except for shared mappings because these don't COW
> +  * and we would not want to dirty them for nothing.
> +  */
> + write = (vma->vm_flags & (VM_WRITE|VM_SHARED)) == VM_WRITE;
>   BUG_ON(addr >= end);
>   BUG_ON(end > vma->vm_end);
>   len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] Remove iget() and read_inode() [try #4]

2007-10-12 Thread Nick Piggin
On Friday 12 October 2007 19:07, David Howells wrote:
> Hi Linus,
>
> Here's a set of patches that remove all calls to iget() and all
> read_inode() functions.  They should be removed for two reasons: firstly
> they don't lend themselves to good error handling, and secondly their
> presence is a temptation for code outside a filesystem to call iget() to
> access inodes within that filesystem.

Is there any particular reason not to push these through -mm?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-12 Thread Nick Piggin
On Fri, Oct 12, 2007 at 10:25:34AM +0200, Jarek Poplawski wrote:
> On 04-10-2007 07:23, Nick Piggin wrote:
> > According to latest memory ordering specification documents from Intel and
> > AMD, both manufacturers are committed to in-order loads from cacheable 
> > memory
> > for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> ...
> 
> Great news!
> 
> First it looks like a really great thing that it's revealed at last.
> But then... there is probably some confusion: did we have to use
> ineffective code for so long?

I'm not sure exactly what the situation is with the manufacturers,
but maybe they (at least Intel) wanted to keep their options open
WRT their barrier semantics, even if current implementations were
not taking full liberty of them.

 
> First again, we could try to blame Intel etc. But then, wait a minute:
> is it such a mystery knowledge? If this reordering is done there are
> some easy rules broken (just like in examples from these manuals). And
> if somebody cared to do this for optimization, then this is probably
> noticeable optimization, let's say 5 or 10%. Then any test shouldn't
> need to take very long to tell the truth in less than 100 loops!

I don't know quite what you're saying... the CPUs could probably get
performance by having weakly ordered loads, OTOH I think the Intel
ones might already do this speculatively so they appear in order but
essentially have the performance of weak order.

If you're just talking about this patch, then it probably isn't much
performance gain. I'm guessing you'd be lucky to measure it from
userspace.


> So, maybe linux needs something like this, instead of waiting few
> years with each new model for vendors goodwill? IMHO, even for less
> popular processors, this could be checked under some debugging option
> at the system start (after disabling suspicios barrier for a while
> plus some WARN_ONs).

I don't know if that would be worthwhile. It actually isn't always
trivial to trigger reordering. For example, on my dual-core core2,
in order to see reads pass writes, I have to do work on a set that
exceeds the cache size and does a huge amount of work to ensure it
is going to trigger that. If you can actually come up with a test
case that triggers load/load or store/store reordering, I'm sure
Intel / AMD would like to see it ;)

All existing processors as far as we know are in-order WRT loads vs
loads and stores vs stores. It was just a matter of getting the docs
clarified, which gives us more confidence that we're correct and a
reasonable guarnatee of forward compatibility.

So, I think the plan is just to merge these 3 patches during the
current window.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23

2007-10-11 Thread Nick Piggin
On Friday 12 October 2007 15:46, Ingo Molnar wrote:
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> > ;) I think you snipped the important bit:
> >
> > "the peak is terrible but it has virtually no dropoff and performs
> > better under load than the default 2.6.21 scheduler." (verbatim)
>
> hm, i understood that peak remark to be in reference to FreeBSD's
> scheduler (which the FreeBSD guys are primarily interested in
> obviously), not v2.6.21 - but i could be wrong.

I think the Linux peak has always been roughly as good as their
best FreeBSD ones (eg. http://people.freebsd.org/~jeff/sysbench.png).
Obviously in that graph, Linux sucks because of the malloc/mmap_sem
issue. It also shows what he is calling the terrible CFS peak, I
guess.

In my own tests, after that was fixed, Linux's peak got even a bit
higher, so that's the benchmark for performance.


> In any case, there is indeed a regression with sysbench and a low number
> of threads, and it's being fixed. The peak got improved visibly in
> sched-devel:
>
>   http://people.redhat.com/mingo/misc/sysbench-sched-devel.jpg
>
> but there is still some peak regression left, i'm testing a patch for
> that.

OK good. Once that's fixed, we'll hopefully be competitive with
FreeBSD again in this test :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 spinlock hang in kswapd under heavy disk write loads

2007-10-11 Thread Nick Piggin
On Friday 12 October 2007 02:23, Mr. Berkley Shands wrote:
> With DEBUG_SLAB on, I can run only a very short time under 2.6.23
> before a kernel panic.
>
> [  626.028180] eth0: too many iterations (6) in nv_nic_irq.
> [  626.167583] eth0: too many iterations (6) in nv_nic_irq.
> [  626.206729] eth0: too many iterations (6) in nv_nic_irq.
> [  626.400171] eth0: too many iterations (6) in nv_nic_irq.
> [  626.446374] eth0: too many iterations (6) in nv_nic_irq.
>
> followed by the crash at
> forcedeth:nv_nic_irq_optimized+0x89
> handle_IRQ_event+0x25
> handle_edge_IRQ+0xe3
> do_IRQ

OK, thanks. Probably we should try to fix this problem before
attempting the more elusive looking corruption.

CC'ed forcedeth maintainer and netdev.

Can you reiterate what your hardware is (including lspci)?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23

2007-10-11 Thread Nick Piggin
On Wednesday 10 October 2007 20:14, Ingo Molnar wrote:
> * Nicholas Miell <[EMAIL PROTECTED]> wrote:
> > Does CFS still generate the following sysbench graphs with 2.6.23, or
> > did that get fixed?
> >
> > http://people.freebsd.org/~kris/scaling/linux-pgsql.png
> > http://people.freebsd.org/~kris/scaling/linux-mysql.png
>
> as far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench:
>
> http://redhat.com/~mingo/misc/sysbench.jpg
>
> As you can see it in the graph, v2.6.23 schedules much more consistently
> too. [ v2.6.22 has a small (but potentially statistically insignificant)
> edge at 4-6 clients, and CFS has a slightly better peak (which is
> statistically insignificant). ]
>
> ( Config is at http://redhat.com/~mingo/misc/config, system is Core2Duo
>   1.83 GHz, mysql-5.0.45, glibc-2.6. Nothing fancy either in the config
>   nor in the setup - everything is pretty close to the defaults. )
>
> i'm aware of a 2.6.21 vs. 2.6.23 sysbench regression report, and it
> apparently got resolved after various changes to the test environment:
>
>http://jeffr-tech.livejournal.com/10103.html
>
>  " [] has virtually no dropoff and performs better under load than
>the default 2.6.21 scheduler. " (paraphrased)

;) I think you snipped the important bit:

"the peak is terrible but it has virtually no dropoff and performs
better under load than the default 2.6.21 scheduler." (verbatim)

The dropoff under load was due to trivially avoided mmap_sem
contention in the kernel and glibc (and not-very-scalable mysql
heap locking), rather than specifically anything the scheduler
was doing wrong, I think (when the scheduler chose to start
preempting threads holding locks, then performance would tank.
Exactly when that point was reached, and what happens afterwards
was probably just luck.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 spinlock hang in kswapd under heavy disk write loads

2007-10-10 Thread Nick Piggin
On Friday 12 October 2007 10:56, Berkley Shands wrote:
> 100% reproducible on the two motherboards in question.
> Does not happen on any other motherboard I have in my possession
> (not tyan, not uniwide, not socket 940...)
>
> No errors, no dmesg, nothing with debug_spinlock set.
>  shows lots (when it works), but by then too many things are
> locked up to be of much use. I can get into KDB and look around
> (2.6.22 for kdb - it hangs there too). Even access to the local disk is
> blocked.
> Processes in core and running remain there (iostat, top, ...).
> I personally think the bios are suspect on the PCIe, as symptoms change
> with the bios rev. I did a major paper on SAS performance with one H8DMi,
> but it got a bios rev, and now crashes. Missed interrupt? APIC sending an
> interrupt to multiple cpus? I don't know.
>
> Tell me what to look at, and I can get you the info. It usually takes 20
> seconds
> to go bang, using either the LSIELP or the rocket raid 2340. Other
> controllers
> are too slow. 2.6.20 does not lock up. It is also 200MB/Sec slower in
> writing :-)
>
> thanks for the response.

OK, it does sound suspiciously like a hardware bug, or some
unrelated software bug that is causing memory scribbles...

A few things you could do.

One is that you could verify that it indeed is the kswapd_wait
spinlock that it is spinning on, and then when you see the lockup,
you could verify that no other tasks are holding the lock. (it is
quite an inner lock, so you shouldn't have to wade through call
chains...). That would confirm corruption. Dumping the lock
contents and the fields in the structure around the lock might
give a clue.

You could put the spinlock somewhere else and see what happens
(move it around in the structure, or get even more creative...).
or do something like have 2 spinlocks, and when you encounter
the lockup, verify whether or not they agree.

(It sounds like you're pretty capable, but if you want me to have
a look at doing a patch or two to help, let me know.)

Another is to bisect the problem, however as you say the kernel
is going slower, so you may just bisect to the point where it
is sustaining enough load to trigger the bug, so this may not be
worth you time just yet.

You could _try_ turning on slab debugging. If there is random
corruption, it might get caught. Maybe it will just change
things enough to hide the problem though.

Thanks for reporting!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 spinlock hang in kswapd under heavy disk write loads

2007-10-10 Thread Nick Piggin
On Thursday 11 October 2007 01:33, Berkley Shands wrote:
> 2.6.23 with CONFIG_DEBUG_SPINLOCK on does not hang under very high write
> loads to either an LSIELP (write rate 1.1GB/Sec) or to a highpoint
> RR2340 (write rate 1.0GB/Sec). With CONFIG_DEBUG_SPINLOCK off however, the
> system hangs with kswapd getting 100% of the cpu and most if not all other
> processes are locked out. Sometimes even the keyboard is locked out.
>
> This is seen on a supermicro H8DM3-2 and H8DMi-2 motherboards, 16 GB RAM,
>  SE or 2216 processors. Not seen on a Uniwide 3546ES, or on a
> Supermicro H8DM8 with 8222 CPUS.
>
> kswapd sits at (according to KDB)
> __spinlock_irqsave + 0x15
> prepare_to_wait + 0x15
> kswapd + 0xe3
> kthread + 0x47


This is the pgdat->kswapd_wait waitqueue lock, by the looks?
It should be basically impossible to get this lock wrong, so
perhaps what happens is some memory corruption which is padded
(or some other random change) by spinlock debugging?

Is this very reproducable? Does your dmesg have any funny
messages after performing this heavy write load with
CONFIG_DEBUG_SPINLOCK set?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Nick Piggin
On Wednesday 10 October 2007 15:20, Linus Torvalds wrote:
> On Wed, 10 Oct 2007, Hugh Dickins wrote:
> > On Tue, 9 Oct 2007, Nick Piggin wrote:
> > > by it ;) To prove my point: the *first* approach I posted to fix this
> > > problem was exactly a patch to special-case the zero_page refcounting
> > > which was removed with my PageReserved patch. Neither Hugh nor yourself
> > > liked it one bit!
> >
> > True (speaking for me; I forget whether Linus ever got to see it).
>
> The problem is, those first "remove ref-counting" patches were ugly
> *regardless* of ZERO_PAGE.
>
> We (yes, largely I) fixed up the mess since. The whole vm_normal_page()
> and the magic PFN_REMAP thing got rid of a lot of the problems.
>
> And I bet that we could do something very similar wrt the zero page too.
>
> Basically, the ZERO page could act pretty much exactly like a PFN_REMAP
> page: the VM would not touch it. No rmap, no page refcounting, no nothing.
>
> This following patch is not meant to be even half-way correct (it's not
> even _remotely_ tested), but is just meant to be a rough "grep for
> ZERO_PAGE in the VM, and see what happens if you don't ref-count it".
>
> Would something like the below work? I dunno. But I suspect it would. I

Sure it will work. It's not completely trivial like your patch,
though. The VM has to know about ZERO_PAGE if you also want it
to do the "optimised" wp (what you have won't work because it
will break all other "not normal" pages which are non-zero I think).

And your follow_page_page path is not going to do the right thing
for ZERO_PAGE either I think.



> doubt anybody has the energy to actually try to actually follow through on
> it, which is why I'm not pushing on it any more, and why I'll accept

Sure they have.

http://marc.info/?l=linux-mm&m=117515508009729&w=2

OK, this patch was open coding the tests rather than putting them in
vm_normal_page, but vm_normal_page doesn't magically make it a whole
lot cleaner (a _little_ bit cleaner, I agree, but in my current patch
I still need a vm_normal_or_zero_page() function).


> Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy
> about this.

Well that's not very good...


> The "page refcounting cleanups" in the VM back when were really painful.
> And dammit, I felt like I was the one who had to clean them up after you
> guys. Which makes me really testy on this subject.

OK, but in this case we'll not have a big hard-to-revert set of
changes that fundamentally alter assumptions throughout the vm.
It will be more a case of "if somebody screams, put the zero page
back", won't it?


> Totally half-assed untested patch to follow, not meant for anything but a
> "I think this kind of approach should have worked too" comment.
>
> So I'm not pushing the patch below, I'm just fighting for people realizing
> that
>
>  - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE
>thing. This means that I would not be at all surprised if some
>application basically depends on it. I've written test-programs that
>depends on it - maybe people have written other code that basically has
>been written for and tested with a kernel that has basically always
>made read-only zero pages extra cheap.
>
>So while it may be true that removing ZERO_PAGE won't affect anybody, I
>don't think it's a given, and I also don't think it's sane calling
>people "crazy" for depending on something that has always been true
>under Linux for the last 15+ years. There are few behaviors that have
>been around for that long.

That's the main question. Maybe my wording was a little strong, but
I simply personally couldn't think of sane uses of zero page. I'm
not prepared to argue that none could possibly exist.

It just seems like now might be a good time to just _try_ removing
the zero page, because of this peripheral problem caused by my
refcounting patch. If it doesn't work out, then at least we'll be
wiser for it, we can document why the zero page is needed, and add
it back with the refcounting exceptions.


>  - make sure the commit message is accurate as to need for this (ie not
>claim that the ZERO_PAGE itself was the problem, and give some actual
>performance numbers on what is going on)

OK, maybe this is where we are not on the same page.
There are 2 issues really. Firstly, performance problem of
refcounting the zero-page -- we've established that it causes
this livelock and that we should stop refcounting it, right?

Second issue is the performance difference between removing the
zero page complet

<    1   2   3   4   5   6   7   8   9   10   >