from:"Nick Piggin"

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

Can you retain cc list, please?

On Friday 26 October 2007 07:42, David Schwartz wrote:
>   I asked a collection of knowledgeable people I know about the issue. The
> consensus is that the optimization is not permitted in POSIX code but that
> it is permitted in pure C code. The basic argument goes like this:
>
>   To make POSIX-compliant code even possible, surely optimizations that 
> add
> writes to variables must be prohibited. That is -- if POSIX prohibits
> writing to a variable in certain cases only the programmer can detect, then
> a POSIX-compliant compiler cannot write to a variable except where
> explicitly told to do so. Any optimization that *adds* a write to a
> variable that would not otherwise occur *must* be prohibited.
>
>   Otherwise, it is literally impossible to comply with the POSIX 
> requirement
> that concurrent modifications and reads to shared variables take place
> while holding a mutex.

Now all you have to do is tell this to the gcc developers ;)

>   The simplest solution is simply to ditch the optimization. If it really
> isn't even an optimization, then that's an easy way out.

For some things, I'd expect it will be an optimisation, which is why
they're even doing it. Even on x86 perhaps, where they do tricks with
sbb/adc. If it avoids an unpredictable branch, it could help. Actually
a silly microbenchmark shows it's worth 10% to do a cmov vs an
unpredictable conditional jump, but another 2.5% to do the adc and
unconditional store (which is the problematic bit).

And for unshared things like local variables where their address
hasn't escaped, it's fine.

Still, I guess that for most non-stack variables, you would actually
_want_ to do a cmov rather than the adc, even in a single threaded
program. Because you don't want to touch the cacheline if possible,
let alone dirty it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Thursday 25 October 2007 17:15, Andi Kleen wrote:
> On Thursday 25 October 2007 05:24, Nick Piggin wrote:
> > Basically, what the gcc developers are saying is that gcc is
> > free to load and store to any memory location, so long as it
> > behaves as if the instructions were executed in sequence.
>
> This case is clearly a bug, a very likely code pessimization.
> I guess it wasn't intentional, just an optimization that is useful
> for local register values doing too much.

Although there can be cases where it looks much more like an
optimisation (eg. where the branch and increment occurs much
more often), but it would still be a bug. Granted they are
rather constructed cases, but I don't think you want to rely on
the fact that most of the time it's OK.

> > I guess that dynamically allocated memory and computed pointers
> > are more difficult for gcc to do anything unsafe with, because
> > it is harder to tell if a given function has deallocated the
> > memory.
>
> Often accesses happen without function calls inbetween.
> Also I think newer gcc (not 3.x) can determine if a pointer
> "escapes" or not so that might not protect against it.
>
> > Any thoughts?
>
> We don't have much choice: If such a case is found it has to be marked
> volatile or that particular compiler version be unsupported.

Marking volatile I think is out of the question. To start with,
volatile creates really poor code (and most of the time we actually
do want the code in critical sections to be as tight as possible).
But also because I don't think these bugs are just going to be
found easily.

> It might be useful to come up with some kind of assembler pattern
> matcher to check if any such code is generated for the kernel
> and try it with different compiler versions.

Hard to know how to do it. If you can, then it would be interesting.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] fs: restore nobh

2007-10-25 Thread Nick Piggin

On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote:
>   Hi,
> 
> > This is overdue, sorry. Got a little complicated, and I've been away from
> > my filesystem test setup so I didn't want ot send it (lucky, coz I found
> > a bug after more substantial testing).
> > 
> > Anyway, RFC?
>   Hmm, maybe one comment/question:
> 
> > Index: linux-2.6/fs/buffer.c
> > ===
> > --- linux-2.6.orig/fs/buffer.c  2007-10-08 14:09:35.0 +1000
> > +++ linux-2.6/fs/buffer.c   2007-10-08 16:45:28.0 +1000
> ...
> 
> > -/*
> > - * Make sure any changes to nobh_commit_write() are reflected in
> > - * nobh_truncate_page(), since it doesn't call commit_write().
> > - */
> > -int nobh_commit_write(struct file *file, struct page *page,
> > -   unsigned from, unsigned to)
> > +int nobh_write_end(struct file *file, struct address_space *mapping,
> > +   loff_t pos, unsigned len, unsigned copied,
> > +   struct page *page, void *fsdata)
> >  {
> > struct inode *inode = page->mapping->host;
> > -   loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> > +   struct buffer_head *head = NULL;
> > +   struct buffer_head *bh;
> >  
> > -   if (page_has_buffers(page))
> > -   return generic_commit_write(file, page, from, to);
> > +   if (!PageMappedToDisk(page)) {
> > +   if (unlikely(copied < len) && !page_has_buffers(page))
> > +   attach_nobh_buffers(page, head);
> > +   if (page_has_buffers(page))
> > +   return generic_write_end(file, mapping, pos, len, 
> > copied, page, fsdata);
> > +   }
>   So we can have a page with attached buffers but we decide to not call
> generic_write_end()...
> 
> > SetPageUptodate(page);
> > set_page_dirty(page);
>   And we just set the page dirty but we leave buffers clean. So cannot
> subsequent try_to_free_buffers() come, mark page as clean (as buffers
> are clean) and discard the data?

Hmm, we might just be OK here because set_page_dirty should dirty the
buffers. However, I think you have spotted a mistake here and it would be
better if the generic_write_end test was outside the PageMapedToDisk
check.

Thanks!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] bitops kernel-doc: inline instead of macro

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 02:55, Randy Dunlap wrote:
> > Hmm, can we simply do
> >
> > static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
> > addr) {
> > return test_and_set_bit(nr, addr);
> > }
> >
> > please ?
>
> Certainly.  That does look better.

Thanks!

>
> ---
>
> From: Randy Dunlap <[EMAIL PROTECTED]>
>
> Use duplicated inline functions for test_and_set_bit_lock() on x86
> instead of #define macros, thus avoiding a bad example.  This allows
> kernel-doc to run cleanly instead of terminating with an error:
>
> Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
> prototype: 'test_and_set_bit_lock test_and_set_bit '
>
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  include/asm-x86/bitops_32.h |7 +--
>  include/asm-x86/bitops_64.h |7 +--
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
> @@ -183,9 +183,12 @@ static inline int test_and_set_bit(int n
>   * @nr: Bit to set
>   * @addr: Address to count from
>   *
> - * This is the same as test_and_set_bit on x86
> + * This is the same as test_and_set_bit on x86.
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
> addr) +{
> + return test_and_set_bit(nr, addr);
> +}
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_64.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_64.h
> @@ -173,9 +173,12 @@ static __inline__ int test_and_set_bit(i
>   * @nr: Bit to set
>   * @addr: Address to count from
>   *
> - * This is the same as test_and_set_bit on x86
> + * This is the same as test_and_set_bit on x86.
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +static __inline__ int test_and_set_bit_lock(int nr, volatile void * addr)
> +{
> + return test_and_set_bit(nr, addr);
> +}
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-25 Thread Nick Piggin

On Thursday 25 October 2007 15:45, Greg KH wrote:
> On Thu, Oct 25, 2007 at 12:31:06PM +1000, Nick Piggin wrote:
> > On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
> > > On 10/24/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > It was intended to be something like /proc/sys/kernel/ only.
> >
> > Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
> > /sys/net /sys/vm etc? "kernel" to me shouldn't really imply the
> > stuff under the kernel/ source directory or other random stuff
> > that doesn't fit into another directory, but attributes that are
> > directly related to the kernel software (rather than directly
> > associated with any device).
>
> What would you want in /sys/net and /sys/dev and /sys/vm?  I don't mind
> putting subdirs in /sys/kernel/ if you want it.

I guess potentially things that are today in /proc/sys/*. Sysfs is much
closer to the "right" place for this kind of attributes than procfs,
isn't it?

> > It would be nice to get a sysfs content maintainer or two. Just
> > having new additions occasionally reviewed along with the rest of
> > a patch, by random people, doesn't really aid consistency. Would it
> > be much trouble to ask that _all_ additions to sysfs be accompanied
> > by notification to this maintainer, along with a few line description?
> > (then merge would require SOB from said maintainer).
>
> No, I would _love_ that.  We should make the requirement that all new
> sysfs files be documented in Documentation/API/ like that details.

Obviously I'm for that too. A mandatory cc to a linux-abi list,
documentation, and an acked-by from the relevant API maintainers, etc.
All it needs is upstream to agree and sometime to implement it.

> I'll be glad to review it, but as it's pretty trivial to add sysfs
> files, everyone ends up doing it :)

If it fits with the overall direction that yourself / Kay / everyone
else has in mind, then yes. Problem is that if this stuff goes
unreviewed, or reviewed by different people who don't have a coherent
idea of what the API should look like, then it ends in a mess that you
can't fix easily.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-25 Thread Nick Piggin

On Thursday 25 October 2007 15:45, Greg KH wrote:
 On Thu, Oct 25, 2007 at 12:31:06PM +1000, Nick Piggin wrote:
  On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
   On 10/24/07, Nick Piggin [EMAIL PROTECTED] wrote:

   It was intended to be something like /proc/sys/kernel/ only.
 
  Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
  /sys/net /sys/vm etc? kernel to me shouldn't really imply the
  stuff under the kernel/ source directory or other random stuff
  that doesn't fit into another directory, but attributes that are
  directly related to the kernel software (rather than directly
  associated with any device).

 What would you want in /sys/net and /sys/dev and /sys/vm?  I don't mind
 putting subdirs in /sys/kernel/ if you want it.

I guess potentially things that are today in /proc/sys/*. Sysfs is much
closer to the right place for this kind of attributes than procfs,
isn't it?


  It would be nice to get a sysfs content maintainer or two. Just
  having new additions occasionally reviewed along with the rest of
  a patch, by random people, doesn't really aid consistency. Would it
  be much trouble to ask that _all_ additions to sysfs be accompanied
  by notification to this maintainer, along with a few line description?
  (then merge would require SOB from said maintainer).

 No, I would _love_ that.  We should make the requirement that all new
 sysfs files be documented in Documentation/API/ like that details.

Obviously I'm for that too. A mandatory cc to a linux-abi list,
documentation, and an acked-by from the relevant API maintainers, etc.
All it needs is upstream to agree and sometime to implement it.


 I'll be glad to review it, but as it's pretty trivial to add sysfs
 files, everyone ends up doing it :)

If it fits with the overall direction that yourself / Kay / everyone
else has in mind, then yes. Problem is that if this stuff goes
unreviewed, or reviewed by different people who don't have a coherent
idea of what the API should look like, then it ends in a mess that you
can't fix easily.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] bitops kernel-doc: inline instead of macro

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 02:55, Randy Dunlap wrote:
  Hmm, can we simply do
 
  static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
  addr) {
  return test_and_set_bit(nr, addr);
  }
 
  please ?

 Certainly.  That does look better.

Thanks!


 ---

 From: Randy Dunlap [EMAIL PROTECTED]

 Use duplicated inline functions for test_and_set_bit_lock() on x86
 instead of #define macros, thus avoiding a bad example.  This allows
 kernel-doc to run cleanly instead of terminating with an error:

 Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
 prototype: 'test_and_set_bit_lock test_and_set_bit '

 Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
 ---
  include/asm-x86/bitops_32.h |7 +--
  include/asm-x86/bitops_64.h |7 +--
  2 files changed, 10 insertions(+), 4 deletions(-)

 --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
 +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
 @@ -183,9 +183,12 @@ static inline int test_and_set_bit(int n
   * @nr: Bit to set
   * @addr: Address to count from
   *
 - * This is the same as test_and_set_bit on x86
 + * This is the same as test_and_set_bit on x86.
   */
 -#define test_and_set_bit_lock test_and_set_bit
 +static inline int test_and_set_bit_lock(int nr, volatile unsigned long *
 addr) +{
 + return test_and_set_bit(nr, addr);
 +}

  /**
   * __test_and_set_bit - Set a bit and return its old value
 --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_64.h
 +++ linux-2.6.24-rc1/include/asm-x86/bitops_64.h
 @@ -173,9 +173,12 @@ static __inline__ int test_and_set_bit(i
   * @nr: Bit to set
   * @addr: Address to count from
   *
 - * This is the same as test_and_set_bit on x86
 + * This is the same as test_and_set_bit on x86.
   */
 -#define test_and_set_bit_lock test_and_set_bit
 +static __inline__ int test_and_set_bit_lock(int nr, volatile void * addr)
 +{
 + return test_and_set_bit(nr, addr);
 +}

  /**
   * __test_and_set_bit - Set a bit and return its old value
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] fs: restore nobh

2007-10-25 Thread Nick Piggin

On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote:
   Hi,
 
  This is overdue, sorry. Got a little complicated, and I've been away from
  my filesystem test setup so I didn't want ot send it (lucky, coz I found
  a bug after more substantial testing).
  
  Anyway, RFC?
   Hmm, maybe one comment/question:
 
  Index: linux-2.6/fs/buffer.c
  ===
  --- linux-2.6.orig/fs/buffer.c  2007-10-08 14:09:35.0 +1000
  +++ linux-2.6/fs/buffer.c   2007-10-08 16:45:28.0 +1000
 ...
 
  -/*
  - * Make sure any changes to nobh_commit_write() are reflected in
  - * nobh_truncate_page(), since it doesn't call commit_write().
  - */
  -int nobh_commit_write(struct file *file, struct page *page,
  -   unsigned from, unsigned to)
  +int nobh_write_end(struct file *file, struct address_space *mapping,
  +   loff_t pos, unsigned len, unsigned copied,
  +   struct page *page, void *fsdata)
   {
  struct inode *inode = page-mapping-host;
  -   loff_t pos = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
  +   struct buffer_head *head = NULL;
  +   struct buffer_head *bh;
   
  -   if (page_has_buffers(page))
  -   return generic_commit_write(file, page, from, to);
  +   if (!PageMappedToDisk(page)) {
  +   if (unlikely(copied  len)  !page_has_buffers(page))
  +   attach_nobh_buffers(page, head);
  +   if (page_has_buffers(page))
  +   return generic_write_end(file, mapping, pos, len, 
  copied, page, fsdata);
  +   }
   So we can have a page with attached buffers but we decide to not call
 generic_write_end()...
 
  SetPageUptodate(page);
  set_page_dirty(page);
   And we just set the page dirty but we leave buffers clean. So cannot
 subsequent try_to_free_buffers() come, mark page as clean (as buffers
 are clean) and discard the data?

Hmm, we might just be OK here because set_page_dirty should dirty the
buffers. However, I think you have spotted a mistake here and it would be
better if the generic_write_end test was outside the PageMapedToDisk
check.

Thanks!

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Thursday 25 October 2007 17:15, Andi Kleen wrote:
 On Thursday 25 October 2007 05:24, Nick Piggin wrote:
  Basically, what the gcc developers are saying is that gcc is
  free to load and store to any memory location, so long as it
  behaves as if the instructions were executed in sequence.

 This case is clearly a bug, a very likely code pessimization.
 I guess it wasn't intentional, just an optimization that is useful
 for local register values doing too much.

Although there can be cases where it looks much more like an
optimisation (eg. where the branch and increment occurs much
more often), but it would still be a bug. Granted they are
rather constructed cases, but I don't think you want to rely on
the fact that most of the time it's OK.


  I guess that dynamically allocated memory and computed pointers
  are more difficult for gcc to do anything unsafe with, because
  it is harder to tell if a given function has deallocated the
  memory.

 Often accesses happen without function calls inbetween.
 Also I think newer gcc (not 3.x) can determine if a pointer
 escapes or not so that might not protect against it.

  Any thoughts?

 We don't have much choice: If such a case is found it has to be marked
 volatile or that particular compiler version be unsupported.

Marking volatile I think is out of the question. To start with,
volatile creates really poor code (and most of the time we actually
do want the code in critical sections to be as tight as possible).
But also because I don't think these bugs are just going to be
found easily.


 It might be useful to come up with some kind of assembler pattern
 matcher to check if any such code is generated for the kernel
 and try it with different compiler versions.

Hard to know how to do it. If you can, then it would be interesting.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

Can you retain cc list, please?

On Friday 26 October 2007 07:42, David Schwartz wrote:
   I asked a collection of knowledgeable people I know about the issue. The
 consensus is that the optimization is not permitted in POSIX code but that
 it is permitted in pure C code. The basic argument goes like this:

   To make POSIX-compliant code even possible, surely optimizations that 
 add
 writes to variables must be prohibited. That is -- if POSIX prohibits
 writing to a variable in certain cases only the programmer can detect, then
 a POSIX-compliant compiler cannot write to a variable except where
 explicitly told to do so. Any optimization that *adds* a write to a
 variable that would not otherwise occur *must* be prohibited.

   Otherwise, it is literally impossible to comply with the POSIX 
 requirement
 that concurrent modifications and reads to shared variables take place
 while holding a mutex.

Now all you have to do is tell this to the gcc developers ;)


   The simplest solution is simply to ditch the optimization. If it really
 isn't even an optimization, then that's an easy way out.

For some things, I'd expect it will be an optimisation, which is why
they're even doing it. Even on x86 perhaps, where they do tricks with
sbb/adc. If it avoids an unpredictable branch, it could help. Actually
a silly microbenchmark shows it's worth 10% to do a cmov vs an
unpredictable conditional jump, but another 2.5% to do the adc and
unconditional store (which is the problematic bit).

And for unshared things like local variables where their address
hasn't escaped, it's fine.

Still, I guess that for most non-stack variables, you would actually
_want_ to do a cmov rather than the adc, even in a single threaded
program. Because you don't want to touch the cacheline if possible,
let alone dirty it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 09:09, Andi Kleen wrote:
 On Friday 26 October 2007 00:49:42 Nick Piggin wrote:

  Marking volatile I think is out of the question. To start with,
  volatile creates really poor code (and most of the time we actually
  do want the code in critical sections to be as tight as possible).

 Poor code is better than broken code I would say. Besides
 the cross CPU synchronization paths are likely dominated by
 cache misses anyways; it's unlikely they're actually limited
 by the core CPU. So it'll probably not matter all that much
 if the code is poor or not.

But we have to mark all memory access inside the critical section
as volatile, even after the CPU synchronisation, and often the
common case where there is no contention or cacheline bouncing.

Sure, real code is dominated by cache misses anyway, etc etc.
However volatile prevents a lot of real optimisations that we'd
actually want to do, and increases icache size etc.


  But also because I don't think these bugs are just going to be
  found easily.
 
   It might be useful to come up with some kind of assembler pattern
   matcher to check if any such code is generated for the kernel
   and try it with different compiler versions.
 
  Hard to know how to do it. If you can, then it would be interesting.

 I checked my kernel for adc at least (for the trylock/++ pattern)
 and couldn't find any (in fact all of them were in
 data the compiler thought to be code). That was not a allyesconfig build,
 so i missed some code.

sbb as well.


 For cmov it's at first harder because they're much more frequent
 and cmov to memory is a multiple instruction pattern (the instruction
 does only support memory source operands). Luckily gcc
 doesn't know the if (x) mem = a; = ptr = cmov(x, a, dummy); *ptr = a;
 transformation trick so I don't think there are actually any conditional
 stores on current x86.

 It might be a problem on other architectures which support true
 conditional stores though (like IA64 or ARM)

It might just depend on the instruction costs that gcc knows about.
For example, if ld/st is expensive, they might hoist them out of
loops etc. You don't even need to have one of these predicate or
pseudo predicate instructions.


 Also I'm not sure if gcc doesn't know any other tricks like the
 conditional add using carry, although I cannot think of any related
 to stores from the top of my hat.

 Anyways, if it's only conditional add if we ever catch such a case
 it could be also handled with inline assembly similar to local_inc()

But we don't actually know what it is, and it could change with
different architectures or versions of gcc. I think the sanest thing
is for gcc to help us out here, seeing as there is this very well
defined requirement that we want.

If you really still want the optimisation to occur, I don't think it
is too much to use a local variable for the accumulator (eg. in the
simple example case).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 09:55, Andi Kleen wrote:
  But we don't actually know what it is, and it could change with
  different architectures or versions of gcc. I think the sanest thing
  is for gcc to help us out here, seeing as there is this very well
  defined requirement that we want.
 
  If you really still want the optimisation to occur,

 I don't think it makes sense for memory. It may may make sense for
 registers.

 But given that my kernel doesn't seem to contain any instances
 at all it's probably not too useful for it anyways.

 So I wouldn't have a problem disabling it, but it would
 be better if they fixed their heuristics to only apply it when
 it makes sense.

That's what I mean -- disabling it for memory. I mean, I am just
talking about the conditional = unconditional store to a shared
variable optimisation.

So for example, adc, sbb, and cmov are all still fine when they
are used for the right things. I don't want to turn them off
because they really are quite useful.

As far as it being theoretical... I hope it is. But we should
nip this in the bud (gcc 3.x does this too, sigh) before it
causes problems for us (and any and all other threaded programs
and libraries out there). And by that I mean ask them for a
compiler option rather than start adding volatile.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin

Hi,

Just out of interest, I did a grep for files containing test_and_set_bit
as well as clear_bit (excluding obvious ones like include/asm-*/bitops.h).

Quite a few interesting things. There is a lot of stuff in drivers/* that
could be suspect, WRT memory barriers, including lots I didn't touch. Lot
of work. But forget that...

Some surprises with more obvious bugs, which I have added some cc's for.
powerpc seems to have an mmu context allocation bug, (and possible
improvement in hpte locking with the new bitops).

floppy is using a crazy open coded bit mutex, convert to regular mutex.

vt has something strange too.

fs-writeback could be made a little safer by properly closing the critical
section (the existing sections aren't really critical, but you never know
what the future holds ;))

jfs seems to be missing an obvious smp_mb__before_clear_bit (and a less
obvious smp_mb__after_clear_bit)

xfs seems to be missing smp_mb__before

Lots of code that allocates things (eg. msi interrupts) is suspicious. I'm
not exactly sure if these allocation bitmaps are actually used to protect
data structures or not...

I'll have to get round to preparing proper patches for these things if I
don't hear nacks...
---
 arch/arm/mach-davinci/gpio.c  |4 ++--
 arch/arm/mach-imx/generic.c   |4 ++--
 arch/arm/mach-iop13xx/msi.c   |4 ++--
 arch/arm/mach-ns9xxx/gpio.c   |4 ++--
 arch/avr32/mach-at32ap/pio.c  |1 +
 arch/powerpc/mm/hash_native_64.c  |5 ++---
 arch/ppc/xmon/start.c |   20 +---
 block/ll_rw_blk.c |   20 +---
 drivers/block/cciss.c |4 ++--
 drivers/block/cpqarray.c  |4 ++--
 drivers/block/floppy.c|   36 
 drivers/char/vt.c |4 ++--
 drivers/net/ibm_newemac/mal.c |5 ++---
 drivers/net/s2io.c|8 
 drivers/usb/misc/phidgetservo.c   |6 +++---
 fs/fs-writeback.c |4 ++--
 fs/jfs/jfs_metapage.c |5 +++--
 fs/xfs/xfs_mount.c|4 ++--
 include/asm-powerpc/mmu_context.h |4 ++--
 include/asm-ppc/mmu_context.h |4 ++--
 include/linux/aio.h   |5 -
 include/linux/interrupt.h |3 ++-
 include/linux/netdevice.h |5 ++---
 net/bluetooth/cmtp/core.c |4 ++--
 net/core/dev.c|5 ++---
 26 files changed, 75 insertions(+), 98 deletions(-)

Index: linux-2.6/arch/arm/mach-davinci/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-davinci/gpio.c
+++ linux-2.6/arch/arm/mach-davinci/gpio.c
@@ -34,7 +34,7 @@ int gpio_request(unsigned gpio, const ch
 	if (gpio = DAVINCI_N_GPIO)
 		return -EINVAL;
 
-	if (test_and_set_bit(gpio, gpio_in_use))
+	if (test_and_set_bit_lock(gpio, gpio_in_use))
 		return -EBUSY;
 
 	return 0;
@@ -46,7 +46,7 @@ void gpio_free(unsigned gpio)
 	if (gpio = DAVINCI_N_GPIO)
 		return;
 
-	clear_bit(gpio, gpio_in_use);
+	clear_bit_unlock(gpio, gpio_in_use);
 }
 EXPORT_SYMBOL(gpio_free);
 
Index: linux-2.6/arch/arm/mach-imx/generic.c
===
--- linux-2.6.orig/arch/arm/mach-imx/generic.c
+++ linux-2.6/arch/arm/mach-imx/generic.c
@@ -107,7 +107,7 @@ int imx_gpio_request(unsigned gpio, cons
 		return -EINVAL;
 	}
 
-	if(test_and_set_bit(gpio, imx_gpio_alloc_map)) {
+	if(test_and_set_bit_lock(gpio, imx_gpio_alloc_map)) {
 		printk(KERN_ERR imx_gpio: GPIO %d already used. Allocation for \%s\ failed\n,
 			gpio, label ? label : ?);
 		return -EBUSY;
@@ -123,7 +123,7 @@ void imx_gpio_free(unsigned gpio)
 	if(gpio = (GPIO_PORT_MAX + 1) * 32)
 		return;
 
-	clear_bit(gpio, imx_gpio_alloc_map);
+	clear_bit_unlock(gpio, imx_gpio_alloc_map);
 }
 
 EXPORT_SYMBOL(imx_gpio_free);
Index: linux-2.6/arch/arm/mach-iop13xx/msi.c
===
--- linux-2.6.orig/arch/arm/mach-iop13xx/msi.c
+++ linux-2.6/arch/arm/mach-iop13xx/msi.c
@@ -135,7 +135,7 @@ again:
 	if (irq  NR_IRQS)
 		return -ENOSPC;
 	/* test_and_set_bit operates on 32-bits at a time */
-	if (test_and_set_bit(pos, msi_irq_in_use))
+	if (test_and_set_bit_lock(pos, msi_irq_in_use))
 		goto again;
 
 	dynamic_irq_init(irq);
@@ -149,7 +149,7 @@ void destroy_irq(unsigned int irq)
 
 	dynamic_irq_cleanup(irq);
 
-	clear_bit(pos, msi_irq_in_use);
+	clear_bit_unlock(pos, msi_irq_in_use);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
Index: linux-2.6/arch/arm/mach-ns9xxx/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-ns9xxx/gpio.c
+++ linux-2.6/arch/arm/mach-ns9xxx/gpio.c
@@ -85,7 +85,7 @@ static inline void __iomem *ns9xxx_gpio_
 int gpio_request(unsigned gpio, const char *label)
 {
 	if (likely(ns9xxx_valid_gpio(gpio)))
-		return test_and_set_bit(gpio, gpiores) ? -EBUSY : 0;
+		return

Re: [interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 13:35, Benjamin Herrenschmidt wrote:

[acks]

Thanks for those...

  Index: linux-2.6/include/asm-powerpc/mmu_context.h
  ===
  --- linux-2.6.orig/include/asm-powerpc/mmu_context.h
  +++ linux-2.6/include/asm-powerpc/mmu_context.h
  @@ -129,7 +129,7 @@ static inline void get_mmu_context(struc
  steal_context();
   #endif
  ctx = next_mmu_context;
  -   while (test_and_set_bit(ctx, context_map)) {
  +   while (test_and_set_bit_lock(ctx, context_map)) {
  ctx = find_next_zero_bit(context_map, LAST_CONTEXT+1,
  ctx); if (ctx  LAST_CONTEXT)
  ctx = 0;
  @@ -158,7 +158,7 @@ static inline void destroy_context(struc
   {
  preempt_disable();
  if (mm-context.id != NO_CONTEXT) {
  -   clear_bit(mm-context.id, context_map);
  +   clear_bit_unlock(mm-context.id, context_map);
  mm-context.id = NO_CONTEXT;
   #ifdef FEW_CONTEXTS
  atomic_inc(nr_free_contexts);

 I don't think the previous code was wrong... it's not a locked section
 and we don't care about ordering previous stores. It's an allocation, it
 should be fine. In general, bitmap allocators should be allright.

Well if it is just allocating an arbitrary _number_ out of a bitmap
and nothing else (eg. like the pid allocator), then you don't need
barriers.


 Ignore the FEW_CONTEXTS stuff for now :-) At this point, it's UP only
 and will be replaced sooner or later.

OK. Then I agree, provided you're doing the correct synchronisation
or flushing etc. when destroying a context (which presumably you are).

I'll drop those bits then.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

Hi David,

[BTW. can you retain cc lists, please?]

On Thursday 25 October 2007 14:29, David Schwartz wrote:
> > Well that's exactly right. For threaded programs (and maybe even
> > real-world non-threaded ones in general), you don't want to be
> > even _reading_ global variables if you don't need to. Cache misses
> > and cacheline bouncing could easily cause performance to completely
> > tank in some cases while only gaining a cycle or two in
> > microbenchmarks for doing these funny x86 predication things.
>
> For some CPUs, replacing an conditional branch with a conditional move is a
> *huge* win because it cannot be mispredicted.

A *conditional* store should no be a problem.

However the funny trick of doing this conditional add (implemented with
unconditional store), is what is going to cause breakage.

On the CPUs where predicated instructions are a big win, I'd expect
they should also implement a conditional store for use here. However
they might be slower than an unconditional store (eg. x86's cmov),
and in those cases, gcc might just do the non-conditional store.

> In general, compilers should 
> optimize for unshared data since that's much more common in typical code.
> Even for shared data, the usual case is that you are going to access the
> data few times, so pulling the cache line to the CPU is essentially free
> since it will happen eventually.

This is not just a question of data that you were going to use anyway.
gcc generates memory accesses to locations that would never be accessed
Even stores. It is basically impossible to say that this is a real
performance win. Even on single threaded code: consider that cache
misses take the vast majority of time in many loads, which gives a
little hint that maybe it's a bad idea to do this ;)

> Heuristics may show that the vast majority of such constructs write anyway.
> So the optimization may also be valid based on such heuristics.

I'd never say the optimisation would always be useless. But it's a nasty
thing to have on by default, and apparently even with no good way to
supress it even if we want to.

> A better question is whether it's legal for a compiler that claims to
> support POSIX threads. I'm going to post on comp.programming.threads, where
> the threading experts hang out.

Either way, I think we really need a way to turn it off for Linux.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] Add lock_page_killable

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 14:11, Andrew Morton wrote:
> On Wed, 24 Oct 2007 08:24:57 -0400 Matthew Wilcox <[EMAIL PROTECTED]> wrote:
> > and associated infrastructure such as sync_page_killable and
> > fatal_signal_pending.  Use lock_page_killable in
> > do_generic_mapping_read() to allow us to kill `cat' of a file on an
> > NFS-mounted filesystem.
>
> whoa, big change.
>
> What exactly are the semantics here?  If the process has actually been
> killed (ie: we know that userspace won't be running again) then we break
> out of a lock_page() and allow the process to exit?  ie: it's basically
> invisible to userspace?

The actual conversions should also be relatively useful groundwork
if we ever want to make more things become generally interruptible.

> If so, it sounds OK.  I guess.  We're still screwed if the process is doing
> a synchronous write and lots of other scenarios.

I don't think it will matter in too many situations. If the process is
doing a synchronous write, nothing is guaranteed until the syscall
returns success...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 13:46, Arjan van de Ven wrote:
> On Thu, 25 Oct 2007 13:24:49 +1000
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Andi spotted this exchange on the gcc list. I don't think he's
> > brought it up here yet, but it worries me enough that I'd like
> > to discuss it.
> >
> > Starts here
> > http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
> >
> > Concrete example here
> > http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html
> >
> > Basically, what the gcc developers are saying is that gcc is
> > free to load and store to any memory location, so long as it
> > behaves as if the instructions were executed in sequence.
>
> this optimization btw is a serious mis-optimization, it makes memory
> more dirty and causes cachelines to become unshared I'm sure it
> works great on microbenchmarks but it sucks bigtime for anything real

Well that's exactly right. For threaded programs (and maybe even
real-world non-threaded ones in general), you don't want to be
even _reading_ global variables if you don't need to. Cache misses
and cacheline bouncing could easily cause performance to completely
tank in some cases while only gaining a cycle or two in
microbenchmarks for doing these funny x86 predication things.

I'm not sure about ia64 -- I _hope_ that for most of their
predication stuff, they also predicate the stores, rather than
just store unconditionally and rely on the source operand not
changing in the case they didn't intend the memory to change.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] TASK_KILLABLE

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
> This series of patches introduces the facility to deliver only fatal
> signals to tasks which are otherwise waiting uninterruptibly.

This is pretty nice I think. It also is a significant piece of
infrastructure required to fix some of the main oom kill deadlocks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] Make wait_on_retry_sync_kiocb killable

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:26, Matthew Wilcox wrote:
> Use TASK_KILLABLE to allow wait_on_retry_sync_kiocb to return -EINTR.
> All callers then check the return value and break out of their loops.
>
> Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
> ---
>  fs/read_write.c |   17 -
>  1 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 124693e..3196a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -218,14 +218,15 @@ Einval:
>   return -EINVAL;
>  }
>
> -static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
> +static int wait_on_retry_sync_kiocb(struct kiocb *iocb)
>  {
> - set_current_state(TASK_UNINTERRUPTIBLE);
> + set_current_state(TASK_KILLABLE);
>   if (!kiocbIsKicked(iocb))
>   schedule();
>   else
>   kiocbClearKicked(iocb);
>   __set_current_state(TASK_RUNNING);
> + return fatal_signal_pending(current) ? -EINTR : 0;

Although the EINTR never gets to userspace anyway, is there a good
reason why the last patch for do_generic_mapping_read doesn't pass
back -EINTR?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] Use macros instead of TASK_ flags

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
> Abstracting away direct uses of TASK_ flags allows us to change the
> definitions of the task flags more easily.
>
> Also restructure do_wait() a little
>
> Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>
> ---
>  arch/ia64/kernel/perfmon.c |4 +-
>  fs/proc/array.c|9 +---
>  fs/proc/base.c |2 +-
>  include/linux/sched.h  |   15 +++
>  include/linux/wait.h   |   11 +++--
>  kernel/exit.c  |   90
> +++ kernel/power/process.c |   
> 7 +--
>  kernel/ptrace.c|8 ++--
>  kernel/sched.c |   15 +++
>  kernel/signal.c|6 +-
>  kernel/wait.c  |2 +-
>  11 files changed, 83 insertions(+), 86 deletions(-)
>
> diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
> index f55fa07..6b0a6cf 100644
> --- a/arch/ia64/kernel/perfmon.c
> +++ b/arch/ia64/kernel/perfmon.c
> @@ -2630,7 +2630,7 @@ pfm_task_incompatible(pfm_context_t *ctx, struct
> task_struct *task) */
>   if (task == current) return 0;
>
> - if ((task->state != TASK_STOPPED) && (task->state != TASK_TRACED)) {
> + if (!is_task_stopped_or_traced(task)) {
>   DPRINT(("cannot attach to non-stopped task [%d] state=%ld\n", 
> task->pid,
> task->state)); return -EBUSY;
>   }
> @@ -4790,7 +4790,7 @@ recheck:
>* the task must be stopped.
>*/
>   if (PFM_CMD_STOPPED(cmd)) {
> - if ((task->state != TASK_STOPPED) && (task->state != 
> TASK_TRACED)) {
> + if (!is_task_stopped_or_traced(task)) {
>   DPRINT(("[%d] task not in stopped state\n", task->pid));
>   return -EBUSY;
>   }
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 27b59f5..8939bf0 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -140,13 +140,8 @@ static const char *task_state_array[] = {
>
>  static inline const char *get_task_state(struct task_struct *tsk)
>  {
> - unsigned int state = (tsk->state & (TASK_RUNNING |
> - TASK_INTERRUPTIBLE |
> - TASK_UNINTERRUPTIBLE |
> - TASK_STOPPED |
> - TASK_TRACED)) |
> - (tsk->exit_state & (EXIT_ZOMBIE |
> - EXIT_DEAD));
> + unsigned int state = (tsk->state & TASK_REPORT) |
> + (tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD));
>   const char **p = _state_array[0];
>
>   while (state) {
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fe74d1..e7e1815 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -196,7 +196,7 @@ static int proc_root_link(struct inode *inode, struct
> dentry **dentry, struct vf (task == current || \
>   (task->parent == current && \
>   (task->ptrace & PT_PTRACED) && \
> -  (task->state == TASK_STOPPED || task->state == TASK_TRACED) && \
> +  (is_task_stopped_or_traced(task)) && \
>security_ptrace(current,task) == 0))
>
>  static int proc_pid_cmdline(struct task_struct *task, char * buffer)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c204ab0..5ef5253 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -177,6 +177,21 @@ print_cfs_rq(struct seq_file *m, int cpu, struct
> cfs_rq *cfs_rq) /* in tsk->state again */
>  #define TASK_DEAD64
>
> +/* Convenience macros for the sake of wake_up */
> +#define TASK_NORMAL  (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
> +#define TASK_ALL (TASK_NORMAL | TASK_STOPPED | TASK_TRACED)
> +
> +/* get_task_state() */
> +#define TASK_REPORT  (TASK_RUNNING | TASK_INTERRUPTIBLE | \
> +  TASK_UNINTERRUPTIBLE | TASK_STOPPED | \
> +  TASK_TRACED)

I think it would be nicer if you made it explicit in the name that
these are not individual flags. Maybe it doesn't matter though...

Also, TASK_NORMAL / TASK_ALL aren't very good names. TASK_SLEEP_NORMAL
TASK_SLEEP_ALL might be a bit more helpful?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

Hi,

Andi spotted this exchange on the gcc list. I don't think he's
brought it up here yet, but it worries me enough that I'd like
to discuss it.

Starts here
http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html

Concrete example here
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html

Basically, what the gcc developers are saying is that gcc is
free to load and store to any memory location, so long as it
behaves as if the instructions were executed in sequence.

I guess that dynamically allocated memory and computed pointers
are more difficult for gcc to do anything unsafe with, because
it is harder to tell if a given function has deallocated the
memory. However even that could theoretically happen in future
if the compiler can work out the address comes from a global
variable or is not changed intermediately.

Linux makes extensive use of both trylocks and interruptible
locks (ie. which automatically result in divergant code paths,
one of which holds the lock, the other doesn't). However there
are also other code paths which will either hold a particular
lock or will not hold it, depending on context or some flags
etc. barrier() doesn't help.

For x86, obviously the example above shows it can be miscompiled,
but it is probably relatively hard to make it happen for a non
trivial sequence. For an ISA with lots of predicated instructions
like ia64, it would seem to be much more likely. But of course
we don't want even the possibility of failures.

The gcc guys seem to be saying to mark everything volatile that
could be touched in a critical section. This is insane for Linux.

Any thoughts?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 12:43, Christoph Lameter wrote:
> On Thu, 25 Oct 2007, Nick Piggin wrote:
> > > Ummm... all unreclaimable is set! Are you mlocking the pages in memory?
> > > Or what causes this? All pages under writeback? What is the dirty ratio
> > > set to?
> >
> > Why is SLUB behaving differently, though.
>
> Nore sure. Are we really sure that this does not occur using SLAB?

>From the reports it seems pretty consistent. I guess it could well
be something that may occur with SLAB *if the conditions are a bit
different*...

> > Memory efficiency wouldn't be the reason, would it? I mean, SLUB
> > should be more efficient than SLAB, plus have less data lying around
> > in queues.
>
> SLAB may have data around in queues which (if the stars align the right
> way) may allow it to go longer without having to get a page from the page
> allocator.

But page allocs from slab isn't where the OOMs are occurring, so this
seems unlikely (also, the all_unreclaimable logic now should be pretty
strict, so you have to really run the machine out of memory (1GB of
swap gets fully used, then his DMA32 zone is scanned 8 times without
reclaiming a single page).

That said, parallel kernel compiling can really change a lot in memory
footprint depending on small variations in timing. So it might not be
anything to worry about.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 12:15, Christoph Lameter wrote:
> On Wed, 24 Oct 2007, Alexey Dobriyan wrote:
> > [12728.701398] DMA free:8032kB min:32kB low:40kB high:48kB active:2716kB
> > inactive:2208kB present:12744kB pages_scanned:9299 all_unreclaimable?
> > yes [12728.701567] lowmem_reserve[]: 0 2003 2003 2003 [12728.701654]
>
> Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or
> what causes this? All pages under writeback? What is the dirty ratio set
> to?

Why is SLUB behaving differently, though.

Memory efficiency wouldn't be the reason, would it? I mean, SLUB
should be more efficient than SLAB, plus have less data lying around
in queues.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin

On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
> On 10/24/07, Nick Piggin <[EMAIL PROTECTED]> wrote:
> > On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
> > > Nick Piggin wrote:
> > > > One thing I'd suggest is not to use debugfs, if it is going to
> > > > be a useful end-user feature.
> > >
> > > Is /sys/kernel/notifier_name/ an appropriate place?
> >
> > I'm curious about the /sys/kernel/ namespace. I had presumed that
> > it is intended to replace /proc/sys/ basically with the same
> > functionality.
>
> It was intended to be something like /proc/sys/kernel/ only.

Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
/sys/net /sys/vm etc? "kernel" to me shouldn't really imply the
stuff under the kernel/ source directory or other random stuff
that doesn't fit into another directory, but attributes that are
directly related to the kernel software (rather than directly
associated with any device).


> > I _assume_ these are system software stats and
> > tunables that are not exactly linked to device drivers (OTOH,
> > where do you draw the line? eg. Would filesystems go here?
>
> We already have /sys/fs/ ?
>
> > Core network algorithm tunables might, but per interface ones probably
> > not...).
>
> We will merge the nonsense of "block/", "class/" and "bus/" to one
> "subsystem". The block, class, bus directories will only be kept as
> symlinks for compatibility. Then every subsystem has a directory like:
> /sys/subsystem/block/, /sys/subsystem/net/ and the devices of the
> subsystem are in a devices/ directory below that. Just like the
> /sys/bus/< name>/devices/ layout looks today. All subsystem-global
> tunables can go below the /sys/subsystem// directory, without
> clashing with the list of devices or anything else.

Makes sense.


> > I don't know. Is there guidelines for sysfs (and procfs for that
> > matter)? Is anyone maintaining it (not the infrastructure, but
> > the actual content)?
>
> Unfortunately, there was never really a guideline.
>
> > It's kind of ironic that /proc/sys/ looks like one of the best
> > organised directories in proc, while /sys/kernel seems to be in
> > danger of becoming a mess: it has kexec and uevent files in the
> > base directory, rather than in subdirectories...
>
> True, just looking at it now, people do crazy things like:
> /sys/kernel/notes, which is a file with binary content, and a name
> nobody will ever be able to guess what it is good for. That should
> definitely go into a section/ directory.  Also the VM stuff there
> should probably move to a /sys/vm/ directory along with the weird
> placed top-level /sys/slab/.

Top level directory IMO should be kept as sparse as possible. If
you agree to /sys/mm for example, that's fine, but then slab should
go under that. (I'd prefer all to go underneath /sys/kernel, but...).

It would be nice to get a sysfs content maintainer or two. Just
having new additions occasionally reviewed along with the rest of
a patch, by random people, doesn't really aid consistency. Would it
be much trouble to ask that _all_ additions to sysfs be accompanied
by notification to this maintainer, along with a few line description?
(then merge would require SOB from said maintainer).

For that matter, this should be the case for *all* userspace API
changes ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/5] wait: use lock bitops for __wait_on_bit_lock

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 11:14, Andrew Morton wrote:
> On Wed, 24 Oct 2007 18:13:06 +1000 [EMAIL PROTECTED] wrote:
> > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> >
> > ---
> >  kernel/wait.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux-2.6/kernel/wait.c
> > ===
> > --- linux-2.6.orig/kernel/wait.c
> > +++ linux-2.6/kernel/wait.c
> > @@ -195,7 +195,7 @@ __wait_on_bit_lock(wait_queue_head_t *wq
> > if ((ret = (*action)(q->key.flags)))
> > break;
> > }
> > -   } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > +   } while (test_and_set_bit_lock(q->key.bit_nr, q->key.flags));
> > finish_wait(wq, >wait);
> > return ret;
> >  }
>
> Sorry, I'm just not going to apply a patch like that.
>
> I mean, how the heck is anyone else supposed to understand what you're up
> to?

Hmm, I might just withdraw this patch 1/5. This is generally a slowpath,
and it's hard to guarantee that any exported user doesn't rely on the
full barrier here (not that they would know whether they do or not, let
alone document the fact).

I think all the others are pretty clear, though? (read on if no)

> There's a bit of documentation in Documentation/atomic_ops.txt 
> (probably enough, I guess) but it is totally unobvious to 98.3% of kernel
> developers when they should use test_and_set_bit() versus
> test_and_set_bit_lock() and it is far too much work to work out why it was
> used in __wait_on_bit_lock(), whether it is correct, what advantages it
> brings and whether and where others should emulate it.

If you set a bit for the purpose of opening a critical section, then
you can use this. And clear_bit_unlock to close it.

The advantages are that it is faster, and the hapless driver writer
doesn't have to butcher or forget about doing the correct
smp_mb__before_clear_bit(), or have reviewers wondering exactly WTF
that barrier is for, etc.

Basically, it is faster, harder to get wrong, and more self-docuemnting.

In general, others should not emulate it, because they should be
using our regular locking primitives instead. If they really must
roll their own, then using these would be nice, IMO.

> So in my opinion this submission isn't of sufficient quality to be
> included in Linux.
>
> IOW: please write changelogs.  Preferably good ones.

2/5: "tasklet_trylock opens a critical section. tasklet_unlock closes it.
  hence, _lock bitops can be used for the bitops"

5/5: "trylock_page opens a critical section. unlock_page closes it. hence,
  _lock bitops can be used for the bitops"

5/5: "trylock_buffer opens a critical section. unlock_buffer closes it.
  hence, _lock bitops can be used for the bitops"

Are those helpful?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bitops kernel-doc: expand macro

2007-10-24 Thread Nick Piggin

On Wednesday 24 October 2007 15:09, Randy Dunlap wrote:
> From: Randy Dunlap <[EMAIL PROTECTED]>
>
> Can we expand this macro definition, or should I look for a way to
> fool^W teach kernel-doc about this?
>
> scripts/kernel-doc says:
> Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
> prototype: 'test_and_set_bit_lock test_and_set_bit '

Actually, it probably looks a bit nicer like this anyway. If you grep
for it, then you can actually see the parameters...

On third thoughts, an inline function might be the best thing to do,
and also avoid setting a bad example. What do you think?

>
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  include/asm-x86/bitops_32.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
> +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
> @@ -185,7 +185,7 @@ static inline int test_and_set_bit(int n
>   *
>   * This is the same as test_and_set_bit on x86
>   */
> -#define test_and_set_bit_lock test_and_set_bit
> +#define test_and_set_bit_lock(nr, addr) test_and_set_bit(nr, addr)
>
>  /**
>   * __test_and_set_bit - Set a bit and return its old value
> ---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin

On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
> Nick Piggin wrote:

> > One thing I'd suggest is not to use debugfs, if it is going to
> > be a useful end-user feature.
>
> Is /sys/kernel/notifier_name/ an appropriate place?

Hi list,

I'm curious about the /sys/kernel/ namespace. I had presumed that
it is intended to replace /proc/sys/ basically with the same
functionality. I _assume_ these are system software stats and
tunables that are not exactly linked to device drivers (OTOH,
where do you draw the line? eg. Would filesystems go here? Core
network algorithm tunables might, but per interface ones probably
not...).

I don't know. Is there guidelines for sysfs (and procfs for that
matter)? Is anyone maintaining it (not the infrastructure, but
the actual content)?

It's kind of ironic that /proc/sys/ looks like one of the best
organised directories in proc, while /sys/kernel seems to be in
danger of becoming a mess: it has kexec and uevent files in the
base directory, rather than in subdirectories...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/5] wait: use lock bitops for __wait_on_bit_lock

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 11:14, Andrew Morton wrote:
 On Wed, 24 Oct 2007 18:13:06 +1000 [EMAIL PROTECTED] wrote:
  Signed-off-by: Nick Piggin [EMAIL PROTECTED]
 
  ---
   kernel/wait.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
 
  Index: linux-2.6/kernel/wait.c
  ===
  --- linux-2.6.orig/kernel/wait.c
  +++ linux-2.6/kernel/wait.c
  @@ -195,7 +195,7 @@ __wait_on_bit_lock(wait_queue_head_t *wq
  if ((ret = (*action)(q-key.flags)))
  break;
  }
  -   } while (test_and_set_bit(q-key.bit_nr, q-key.flags));
  +   } while (test_and_set_bit_lock(q-key.bit_nr, q-key.flags));
  finish_wait(wq, q-wait);
  return ret;
   }

 Sorry, I'm just not going to apply a patch like that.

 I mean, how the heck is anyone else supposed to understand what you're up
 to?

Hmm, I might just withdraw this patch 1/5. This is generally a slowpath,
and it's hard to guarantee that any exported user doesn't rely on the
full barrier here (not that they would know whether they do or not, let
alone document the fact).

I think all the others are pretty clear, though? (read on if no)


 There's a bit of documentation in Documentation/atomic_ops.txt 
 (probably enough, I guess) but it is totally unobvious to 98.3% of kernel
 developers when they should use test_and_set_bit() versus
 test_and_set_bit_lock() and it is far too much work to work out why it was
 used in __wait_on_bit_lock(), whether it is correct, what advantages it
 brings and whether and where others should emulate it.

If you set a bit for the purpose of opening a critical section, then
you can use this. And clear_bit_unlock to close it.

The advantages are that it is faster, and the hapless driver writer
doesn't have to butcher or forget about doing the correct
smp_mb__before_clear_bit(), or have reviewers wondering exactly WTF
that barrier is for, etc.

Basically, it is faster, harder to get wrong, and more self-docuemnting.

In general, others should not emulate it, because they should be
using our regular locking primitives instead. If they really must
roll their own, then using these would be nice, IMO.


 So in my opinion this submission isn't of sufficient quality to be
 included in Linux.

 IOW: please write changelogs.  Preferably good ones.

2/5: tasklet_trylock opens a critical section. tasklet_unlock closes it.
  hence, _lock bitops can be used for the bitops

5/5: trylock_page opens a critical section. unlock_page closes it. hence,
  _lock bitops can be used for the bitops

5/5: trylock_buffer opens a critical section. unlock_buffer closes it.
  hence, _lock bitops can be used for the bitops

Are those helpful?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin

On Wednesday 24 October 2007 21:12, Kay Sievers wrote:
 On 10/24/07, Nick Piggin [EMAIL PROTECTED] wrote:
  On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
   Nick Piggin wrote:
One thing I'd suggest is not to use debugfs, if it is going to
be a useful end-user feature.
  
   Is /sys/kernel/notifier_name/ an appropriate place?
 
  I'm curious about the /sys/kernel/ namespace. I had presumed that
  it is intended to replace /proc/sys/ basically with the same
  functionality.

 It was intended to be something like /proc/sys/kernel/ only.

Really? So you'd be happy to have a /sys/dev /sys/fs /sys/kernel
/sys/net /sys/vm etc? kernel to me shouldn't really imply the
stuff under the kernel/ source directory or other random stuff
that doesn't fit into another directory, but attributes that are
directly related to the kernel software (rather than directly
associated with any device).


  I _assume_ these are system software stats and
  tunables that are not exactly linked to device drivers (OTOH,
  where do you draw the line? eg. Would filesystems go here?

 We already have /sys/fs/ ?

  Core network algorithm tunables might, but per interface ones probably
  not...).

 We will merge the nonsense of block/, class/ and bus/ to one
 subsystem. The block, class, bus directories will only be kept as
 symlinks for compatibility. Then every subsystem has a directory like:
 /sys/subsystem/block/, /sys/subsystem/net/ and the devices of the
 subsystem are in a devices/ directory below that. Just like the
 /sys/bus/ name/devices/ layout looks today. All subsystem-global
 tunables can go below the /sys/subsystem/name/ directory, without
 clashing with the list of devices or anything else.

Makes sense.


  I don't know. Is there guidelines for sysfs (and procfs for that
  matter)? Is anyone maintaining it (not the infrastructure, but
  the actual content)?

 Unfortunately, there was never really a guideline.

  It's kind of ironic that /proc/sys/ looks like one of the best
  organised directories in proc, while /sys/kernel seems to be in
  danger of becoming a mess: it has kexec and uevent files in the
  base directory, rather than in subdirectories...

 True, just looking at it now, people do crazy things like:
 /sys/kernel/notes, which is a file with binary content, and a name
 nobody will ever be able to guess what it is good for. That should
 definitely go into a section/ directory.  Also the VM stuff there
 should probably move to a /sys/vm/ directory along with the weird
 placed top-level /sys/slab/.

Top level directory IMO should be kept as sparse as possible. If
you agree to /sys/mm for example, that's fine, but then slab should
go under that. (I'd prefer all to go underneath /sys/kernel, but...).

It would be nice to get a sysfs content maintainer or two. Just
having new additions occasionally reviewed along with the rest of
a patch, by random people, doesn't really aid consistency. Would it
be much trouble to ask that _all_ additions to sysfs be accompanied
by notification to this maintainer, along with a few line description?
(then merge would require SOB from said maintainer).

For that matter, this should be the case for *all* userspace API
changes ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 12:15, Christoph Lameter wrote:
 On Wed, 24 Oct 2007, Alexey Dobriyan wrote:
  [12728.701398] DMA free:8032kB min:32kB low:40kB high:48kB active:2716kB
  inactive:2208kB present:12744kB pages_scanned:9299 all_unreclaimable?
  yes [12728.701567] lowmem_reserve[]: 0 2003 2003 2003 [12728.701654]

 Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or
 what causes this? All pages under writeback? What is the dirty ratio set
 to?

Why is SLUB behaving differently, though.

Memory efficiency wouldn't be the reason, would it? I mean, SLUB
should be more efficient than SLAB, plus have less data lying around
in queues.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 12:43, Christoph Lameter wrote:
 On Thu, 25 Oct 2007, Nick Piggin wrote:
   Ummm... all unreclaimable is set! Are you mlocking the pages in memory?
   Or what causes this? All pages under writeback? What is the dirty ratio
   set to?
 
  Why is SLUB behaving differently, though.

 Nore sure. Are we really sure that this does not occur using SLAB?

From the reports it seems pretty consistent. I guess it could well
be something that may occur with SLAB *if the conditions are a bit
different*...

  Memory efficiency wouldn't be the reason, would it? I mean, SLUB
  should be more efficient than SLAB, plus have less data lying around
  in queues.

 SLAB may have data around in queues which (if the stars align the right
 way) may allow it to go longer without having to get a page from the page
 allocator.

But page allocs from slab isn't where the OOMs are occurring, so this
seems unlikely (also, the all_unreclaimable logic now should be pretty
strict, so you have to really run the machine out of memory (1GB of
swap gets fully used, then his DMA32 zone is scanned 8 times without
reclaiming a single page).

That said, parallel kernel compiling can really change a lot in memory
footprint depending on small variations in timing. So it might not be
anything to worry about.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

Hi,

Andi spotted this exchange on the gcc list. I don't think he's
brought it up here yet, but it worries me enough that I'd like
to discuss it.

Starts here
http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html

Concrete example here
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html

Basically, what the gcc developers are saying is that gcc is
free to load and store to any memory location, so long as it
behaves as if the instructions were executed in sequence.

I guess that dynamically allocated memory and computed pointers
are more difficult for gcc to do anything unsafe with, because
it is harder to tell if a given function has deallocated the
memory. However even that could theoretically happen in future
if the compiler can work out the address comes from a global
variable or is not changed intermediately.

Linux makes extensive use of both trylocks and interruptible
locks (ie. which automatically result in divergant code paths,
one of which holds the lock, the other doesn't). However there
are also other code paths which will either hold a particular
lock or will not hold it, depending on context or some flags
etc. barrier() doesn't help.

For x86, obviously the example above shows it can be miscompiled,
but it is probably relatively hard to make it happen for a non
trivial sequence. For an ISA with lots of predicated instructions
like ia64, it would seem to be much more likely. But of course
we don't want even the possibility of failures.

The gcc guys seem to be saying to mark everything volatile that
could be touched in a critical section. This is insane for Linux.

Any thoughts?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] Use macros instead of TASK_ flags

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
 Abstracting away direct uses of TASK_ flags allows us to change the
 definitions of the task flags more easily.

 Also restructure do_wait() a little

 Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]
 ---
  arch/ia64/kernel/perfmon.c |4 +-
  fs/proc/array.c|9 +---
  fs/proc/base.c |2 +-
  include/linux/sched.h  |   15 +++
  include/linux/wait.h   |   11 +++--
  kernel/exit.c  |   90
 +++ kernel/power/process.c |   
 7 +--
  kernel/ptrace.c|8 ++--
  kernel/sched.c |   15 +++
  kernel/signal.c|6 +-
  kernel/wait.c  |2 +-
  11 files changed, 83 insertions(+), 86 deletions(-)

 diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
 index f55fa07..6b0a6cf 100644
 --- a/arch/ia64/kernel/perfmon.c
 +++ b/arch/ia64/kernel/perfmon.c
 @@ -2630,7 +2630,7 @@ pfm_task_incompatible(pfm_context_t *ctx, struct
 task_struct *task) */
   if (task == current) return 0;

 - if ((task-state != TASK_STOPPED)  (task-state != TASK_TRACED)) {
 + if (!is_task_stopped_or_traced(task)) {
   DPRINT((cannot attach to non-stopped task [%d] state=%ld\n, 
 task-pid,
 task-state)); return -EBUSY;
   }
 @@ -4790,7 +4790,7 @@ recheck:
* the task must be stopped.
*/
   if (PFM_CMD_STOPPED(cmd)) {
 - if ((task-state != TASK_STOPPED)  (task-state != 
 TASK_TRACED)) {
 + if (!is_task_stopped_or_traced(task)) {
   DPRINT(([%d] task not in stopped state\n, task-pid));
   return -EBUSY;
   }
 diff --git a/fs/proc/array.c b/fs/proc/array.c
 index 27b59f5..8939bf0 100644
 --- a/fs/proc/array.c
 +++ b/fs/proc/array.c
 @@ -140,13 +140,8 @@ static const char *task_state_array[] = {

  static inline const char *get_task_state(struct task_struct *tsk)
  {
 - unsigned int state = (tsk-state  (TASK_RUNNING |
 - TASK_INTERRUPTIBLE |
 - TASK_UNINTERRUPTIBLE |
 - TASK_STOPPED |
 - TASK_TRACED)) |
 - (tsk-exit_state  (EXIT_ZOMBIE |
 - EXIT_DEAD));
 + unsigned int state = (tsk-state  TASK_REPORT) |
 + (tsk-exit_state  (EXIT_ZOMBIE | EXIT_DEAD));
   const char **p = task_state_array[0];

   while (state) {
 diff --git a/fs/proc/base.c b/fs/proc/base.c
 index 4fe74d1..e7e1815 100644
 --- a/fs/proc/base.c
 +++ b/fs/proc/base.c
 @@ -196,7 +196,7 @@ static int proc_root_link(struct inode *inode, struct
 dentry **dentry, struct vf (task == current || \
   (task-parent == current  \
   (task-ptrace  PT_PTRACED)  \
 -  (task-state == TASK_STOPPED || task-state == TASK_TRACED)  \
 +  (is_task_stopped_or_traced(task))  \
security_ptrace(current,task) == 0))

  static int proc_pid_cmdline(struct task_struct *task, char * buffer)
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index c204ab0..5ef5253 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -177,6 +177,21 @@ print_cfs_rq(struct seq_file *m, int cpu, struct
 cfs_rq *cfs_rq) /* in tsk-state again */
  #define TASK_DEAD64

 +/* Convenience macros for the sake of wake_up */
 +#define TASK_NORMAL  (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
 +#define TASK_ALL (TASK_NORMAL | TASK_STOPPED | TASK_TRACED)
 +
 +/* get_task_state() */
 +#define TASK_REPORT  (TASK_RUNNING | TASK_INTERRUPTIBLE | \
 +  TASK_UNINTERRUPTIBLE | TASK_STOPPED | \
 +  TASK_TRACED)

I think it would be nicer if you made it explicit in the name that
these are not individual flags. Maybe it doesn't matter though...

Also, TASK_NORMAL / TASK_ALL aren't very good names. TASK_SLEEP_NORMAL
TASK_SLEEP_ALL might be a bit more helpful?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] Make wait_on_retry_sync_kiocb killable

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:26, Matthew Wilcox wrote:
 Use TASK_KILLABLE to allow wait_on_retry_sync_kiocb to return -EINTR.
 All callers then check the return value and break out of their loops.

 Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]
 ---
  fs/read_write.c |   17 -
  1 files changed, 12 insertions(+), 5 deletions(-)

 diff --git a/fs/read_write.c b/fs/read_write.c
 index 124693e..3196a3b 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -218,14 +218,15 @@ Einval:
   return -EINVAL;
  }

 -static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
 +static int wait_on_retry_sync_kiocb(struct kiocb *iocb)
  {
 - set_current_state(TASK_UNINTERRUPTIBLE);
 + set_current_state(TASK_KILLABLE);
   if (!kiocbIsKicked(iocb))
   schedule();
   else
   kiocbClearKicked(iocb);
   __set_current_state(TASK_RUNNING);
 + return fatal_signal_pending(current) ? -EINTR : 0;

Although the EINTR never gets to userspace anyway, is there a good
reason why the last patch for do_generic_mapping_read doesn't pass
back -EINTR?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] TASK_KILLABLE

2007-10-24 Thread Nick Piggin

On Friday 19 October 2007 08:25, Matthew Wilcox wrote:
 This series of patches introduces the facility to deliver only fatal
 signals to tasks which are otherwise waiting uninterruptibly.


This is pretty nice I think. It also is a significant piece of
infrastructure required to fix some of the main oom kill deadlocks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 13:46, Arjan van de Ven wrote:
 On Thu, 25 Oct 2007 13:24:49 +1000

 Nick Piggin [EMAIL PROTECTED] wrote:
  Hi,
 
  Andi spotted this exchange on the gcc list. I don't think he's
  brought it up here yet, but it worries me enough that I'd like
  to discuss it.
 
  Starts here
  http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
 
  Concrete example here
  http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html
 
  Basically, what the gcc developers are saying is that gcc is
  free to load and store to any memory location, so long as it
  behaves as if the instructions were executed in sequence.

 this optimization btw is a serious mis-optimization, it makes memory
 more dirty and causes cachelines to become unshared I'm sure it
 works great on microbenchmarks but it sucks bigtime for anything real

Well that's exactly right. For threaded programs (and maybe even
real-world non-threaded ones in general), you don't want to be
even _reading_ global variables if you don't need to. Cache misses
and cacheline bouncing could easily cause performance to completely
tank in some cases while only gaining a cycle or two in
microbenchmarks for doing these funny x86 predication things.

I'm not sure about ia64 -- I _hope_ that for most of their
predication stuff, they also predicate the stores, rather than
just store unconditionally and rely on the source operand not
changing in the case they didn't intend the memory to change.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] Add lock_page_killable

2007-10-24 Thread Nick Piggin

On Thursday 25 October 2007 14:11, Andrew Morton wrote:
 On Wed, 24 Oct 2007 08:24:57 -0400 Matthew Wilcox [EMAIL PROTECTED] wrote:
  and associated infrastructure such as sync_page_killable and
  fatal_signal_pending.  Use lock_page_killable in
  do_generic_mapping_read() to allow us to kill `cat' of a file on an
  NFS-mounted filesystem.

 whoa, big change.

 What exactly are the semantics here?  If the process has actually been
 killed (ie: we know that userspace won't be running again) then we break
 out of a lock_page() and allow the process to exit?  ie: it's basically
 invisible to userspace?

The actual conversions should also be relatively useful groundwork
if we ever want to make more things become generally interruptible.


 If so, it sounds OK.  I guess.  We're still screwed if the process is doing
 a synchronous write and lots of other scenarios.

I don't think it will matter in too many situations. If the process is
doing a synchronous write, nothing is guaranteed until the syscall
returns success...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-24 Thread Nick Piggin

Hi David,

[BTW. can you retain cc lists, please?]

On Thursday 25 October 2007 14:29, David Schwartz wrote:
  Well that's exactly right. For threaded programs (and maybe even
  real-world non-threaded ones in general), you don't want to be
  even _reading_ global variables if you don't need to. Cache misses
  and cacheline bouncing could easily cause performance to completely
  tank in some cases while only gaining a cycle or two in
  microbenchmarks for doing these funny x86 predication things.

 For some CPUs, replacing an conditional branch with a conditional move is a
 *huge* win because it cannot be mispredicted.

A *conditional* store should no be a problem.

However the funny trick of doing this conditional add (implemented with
unconditional store), is what is going to cause breakage.

On the CPUs where predicated instructions are a big win, I'd expect
they should also implement a conditional store for use here. However
they might be slower than an unconditional store (eg. x86's cmov),
and in those cases, gcc might just do the non-conditional store.


 In general, compilers should 
 optimize for unshared data since that's much more common in typical code.
 Even for shared data, the usual case is that you are going to access the
 data few times, so pulling the cache line to the CPU is essentially free
 since it will happen eventually.

This is not just a question of data that you were going to use anyway.
gcc generates memory accesses to locations that would never be accessed
Even stores. It is basically impossible to say that this is a real
performance win. Even on single threaded code: consider that cache
misses take the vast majority of time in many loads, which gives a
little hint that maybe it's a bad idea to do this ;)


 Heuristics may show that the vast majority of such constructs write anyway.
 So the optimization may also be valid based on such heuristics.

I'd never say the optimisation would always be useless. But it's a nasty
thing to have on by default, and apparently even with no good way to
supress it even if we want to.


 A better question is whether it's legal for a compiler that claims to
 support POSIX threads. I'm going to post on comp.programming.threads, where
 the threading experts hang out.

Either way, I think we really need a way to turn it off for Linux.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sysfs sys/kernel/ namespace (was Re: [PATCH 0/2] add new notifier function ,take2)

2007-10-24 Thread Nick Piggin

On Tuesday 23 October 2007 10:55, Takenori Nagano wrote:
 Nick Piggin wrote:

  One thing I'd suggest is not to use debugfs, if it is going to
  be a useful end-user feature.

 Is /sys/kernel/notifier_name/ an appropriate place?

Hi list,

I'm curious about the /sys/kernel/ namespace. I had presumed that
it is intended to replace /proc/sys/ basically with the same
functionality. I _assume_ these are system software stats and
tunables that are not exactly linked to device drivers (OTOH,
where do you draw the line? eg. Would filesystems go here? Core
network algorithm tunables might, but per interface ones probably
not...).

I don't know. Is there guidelines for sysfs (and procfs for that
matter)? Is anyone maintaining it (not the infrastructure, but
the actual content)?

It's kind of ironic that /proc/sys/ looks like one of the best
organised directories in proc, while /sys/kernel seems to be in
danger of becoming a mess: it has kexec and uevent files in the
base directory, rather than in subdirectories...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bitops kernel-doc: expand macro

2007-10-24 Thread Nick Piggin

On Wednesday 24 October 2007 15:09, Randy Dunlap wrote:
 From: Randy Dunlap [EMAIL PROTECTED]

 Can we expand this macro definition, or should I look for a way to
 fool^W teach kernel-doc about this?

 scripts/kernel-doc says:
 Error(linux-2.6.24-rc1//include/asm-x86/bitops_32.h:188): cannot understand
 prototype: 'test_and_set_bit_lock test_and_set_bit '

Actually, it probably looks a bit nicer like this anyway. If you grep
for it, then you can actually see the parameters...

On third thoughts, an inline function might be the best thing to do,
and also avoid setting a bad example. What do you think?


 Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
 ---
  include/asm-x86/bitops_32.h |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 --- linux-2.6.24-rc1.orig/include/asm-x86/bitops_32.h
 +++ linux-2.6.24-rc1/include/asm-x86/bitops_32.h
 @@ -185,7 +185,7 @@ static inline int test_and_set_bit(int n
   *
   * This is the same as test_and_set_bit on x86
   */
 -#define test_and_set_bit_lock test_and_set_bit
 +#define test_and_set_bit_lock(nr, addr) test_and_set_bit(nr, addr)

  /**
   * __test_and_set_bit - Set a bit and return its old value
 ---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 14:28, dean gaudet wrote:
> On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:
> > dean gaudet wrote:
> > > On Mon, 15 Oct 2007, Nick Piggin wrote:
> > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> > >> because it generally has to invalidate TLBs on all CPUs.
> > >
> > > why is that?  ignoring 32-bit archs we have heaps of address space
> > > available... couldn't the kernel just burn address space and delay
> > > global TLB invalidate by some relatively long time (say 1 second)?
> >
> > Yes, that's precisely the problem.  xfs does delay the unmap, leaving
> > stray mappings, which upsets Xen.
>
> sounds like a bug in xen to me :)

You could call it a bug I think. I don't know much about Xen though,
whether or not it expects to be able to run an arbitrary OS kernel.

Presumably, the hypervisor _could_ write protect and trap writes to
*all* page table page mappings.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 04:39, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> >> Christian Borntraeger <[EMAIL PROTECTED]> writes:
> >>
> >> Let me put it another way.  Looking at /proc/slabinfo I can get
> >> 37 buffer_heads per page.  I can allocate 10% of memory in
> >> buffer_heads before we start to reclaim them.  So it requires just
> >> over 3.7 buffer_heads on very page of low memory to even trigger
> >> this case.  That is a large 1k filesystem or a weird sized partition,
> >> that we have written to directly.
> >
> > On a highmem machine it it could be relatively common.
>
> Possibly.  But the same proportions still hold.  1k filesystems
> are not the default these days and ramdisks are relatively uncommon.
> The memory quantities involved are all low mem.

You don't need 1K filesystems to have buffers attached though,
of course. You can hit the limit with a 4K filesystem with less
than 8GB in pagecache I'd say.

> > You don't want to change that for a stable patch, however.
> > It fixes the bug.
>
> No it avoids the bug which is something slightly different.
> Further I contend that it is not obviously correct that there
> are no other side effects (because it doesn't actually fix the
> bug), and that makes it of dubious value for a backport.

The bug in question isn't exactly that it uses buffercache for its
backing store (although that's obviously rather hairy as well). It's
this specific problem sequence. And it does fix the problem.

> If I had to slap a patch on there at this point just implementing
> an empty try_to_release_page (which disables try_to_free_buffers)
> would be my choice.

How is that better? Now you're making the exact same change for
all filesystems that you didn't think was obviously correct for
rd.c.

> > I just don't think what you have is the proper fix. Calling
> > into the core vfs and vm because right now it does something
> > that works for you but is completely unrelated to what you
> > are conceptually doing is not the right fix.
>
> I think there is a strong conceptual relation and other code
> doing largely the same thing is already in the kernel (ramfs).  Plus
> my gut feel says shared code will make maintenance easier.

ramfs is rather a different case. Filesystems intimately know
about the pagecache.

> You do have a point that the reuse may not be perfect and if that
> is the case we need to watch out for the potential to mess things
> up.
>
> So far I don't see any problems with the reuse.

It's just wrong. I guess if you don't see that by now, then we
have to just disagree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 03:56, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > OK, I missed that you set the new inode's aops to the ramdisk_aops
> > rather than the bd_inode. Which doesn't make a lot of sense because
> > you just have a lot of useless aops there now.
>
> Not totally useless as you have mentioned they are accessed by
> the lru so we still need something there just not much.

Well, the ones that are unused are totally useless ;) I would
have expected them to be removed.

> >> Frankly I suspect the whole
> >> issue is to subtle and rare to make any backport make any sense.  My
> >> apologies Christian.
> >
> > It's a data corruption issue. I think it should be fixed.
>
> Of course it should be fixed.  I just don't know if a backport
> makes sense.  The problem once understood is hard to trigger
> and easy to avoid.

I mean backported. That's just me though, I don't know the nuances
of -stable releases. It could be that they rather not risk introducing
something worse which would be fair enough.

> >> Well those pages are only accessed through rd_blkdev_pagecache_IO
> >> and rd_ioctl.
> >
> > Wrong. It will be via the LRU, will get ->writepage() called,
> > block_invalidate_page, etc. And I guess also via sb->s_inodes, where
> > drop_pagecache_sb might do stuff to it (although it probably escapes
> > harm). But you're right that it isn't the obviously correct fix for
> > the problem.
>
> Yes.  We will be accessed via the LRU.  Yes I knew that.

OK it just didn't sound like it, seeing as you said that's the only
way they are accessed.

> The truth is 
> whatever we do we will be visible to the LRU.

No. With my patch, nothing in the ramdisk code is visible to the LRU.
Which is how it should be.

> My preferences run 
> towards having something that is less of a special case then a new
> potentially huge cache that is completely unaccounted for, that we
> have to build and maintain all of the infrastructure for
> independently.

It's not a cache, and it's not unaccounted for. It's specified exactly
with the rd sizing parameters. I don't know why you would say your
patch is better in this regard. Your ramdisk backing store will be
accounted as pagecache, which is completely wrong.

> The ramdisk code doesn't seem interesting enough 
> long term to get people to do independent maintenance.
>
> With my patch we are on the road to being just like the ramdisk
> for maintenance purposes code except having a different GFP mask.

You can be using highmem, BTW. And technically it probably isn't
correct to use GFP_USER.

> > If you think it is a nice way to go, then I think you are
> > blind ;)
>
> Well we each have different tastes.  I think my patch
> is a sane sensible small incremental step that does just what
> is needed to fix the problem.   It doesn't drag a lot of other
> stuff into the problem like a rewrite would so we can easily verify
> it.

The small incremental step that fixes the problem is Christian's
patch.

> > It's horrible. And using truncate_inode_pages / grab_cache_page and
> > new_inode is an incredible argument to save a few lines of code. You
> > obviously didn't realise your so called private pages would get
> > accessed via the LRU, for example.
>
> I did but but that is relatively minor.  Using the pagecache this
> way for this purpose is a well established idiom in the kernel
> so I didn't figure I was introducing anything to hard to maintain.

Where else is this an established idiom?

> > This is making a relatively
> > larger logical change than my patch, because now as well as having
> > a separate buffer cache and backing store, you are also making the
> > backing store pages visible to the VM.
>
> I am continuing to have the backing store pages visible to the VM,
> and from that perspective it is a smaller change then where we are
> today.

It is smaller lines of code. It is a larger change. Because what you
are doing is 2 things. You are firstly discontinuing the use of the
buffer cache for the backing store, and secondly you are introducing
a new backing store which registers an inode with the vfs and pages
with the pagecache.

My patch does the same thing without those two last questionable
steps.

You now have to treat those backing store pages as pagecache pages,
and hope you have set the right flags and registered the right aops.

> Not that we can truly hide pages from the VM. 

Any page you allocate is your private page. The problem is you're
just sending them back to the VM again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] add new notifier function ,take2

2007-10-21 Thread Nick Piggin

On Thursday 18 October 2007 18:52, Takenori Nagano wrote:
> Vivek Goyal wrote:

> > > My stance is that _all_ the RAS tools (kdb, kgdb, nlkd, netdump, lkcd,
> > > crash, kdump etc.) should be using a common interface that safely puts
> > > the entire system in a stopped state and saves the state of each cpu.
> > > Then each tool can do what it likes, instead of every RAS tool doing
> > > its own thing and they all conflict with each other, which is why this
> > > thread started.
> > >
> > > It is not the kernel's job to decide which RAS tool runs first, second
> > > etc., it is the user's decision to set that policy. Different sites
> > > will want different orders, some will say "go straight to kdump", other
> > > sites will want to invoke a debugger first. Sites must be able to
> > > define that policy, but we hard code the policy into the kernel.
>
> I agreed with him and I made new notifier function that users can change
> the order. Priority value in notifier blocks are hardcoded. If users want
> to change list order, they have to rebuild kernel. I think it is very
> unhappy.

Is it possible to use a single bit of common code and a single
notifier for these things? Or is it too difficult?

One thing I'd suggest is not to use debugfs, if it is going to
be a useful end-user feature.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> Christian Borntraeger <[EMAIL PROTECTED]> writes:

> Let me put it another way.  Looking at /proc/slabinfo I can get
> 37 buffer_heads per page.  I can allocate 10% of memory in
> buffer_heads before we start to reclaim them.  So it requires just
> over 3.7 buffer_heads on very page of low memory to even trigger
> this case.  That is a large 1k filesystem or a weird sized partition,
> that we have written to directly.

On a highmem machine it it could be relatively common.

> > I still dont fully understand what issues you have with my patch.
> > - it obviously fixes the problem
> > - I am not aware of any regression it introduces
> > - its small
>
> My primary issue with your patch is that it continues the saga the
> trying to use buffer cache to store the data which is a serious
> review problem, and clearly not what we want to do long term.

You don't want to change that for a stable patch, however.
It fixes the bug.

> > One concern you had, was the fact that buffer heads are out of sync with
> > struct pages. Testing your first patch revealed that this is actually
> > needed by reiserfs - and maybe others.
> > I can also see, that my patch looks a bit like a bandaid that cobbles the
> > rd pieces together.
> >
> > Is there anything else, that makes my patch unmergeable in your
> > opinion?
>
> For linus's tree the consensus is that to fix rd.c that we
> need to have a backing store that is stored somewhere besides
> in the page cache/buffer cache for /dev/ram0.   Doing that prevents
> all of the weird issues.
>
> Now we have the question of which patch gets us there.  I contend
> I have implemented it with my last little patch that this thread
> is a reply to.  Nick hasn't seen that just yet.

Or ever will. It wasn't that my whole argument against it is
based on that I mistakenly thought your patch served the bdev
inode directly from its backing store.

> So if we have a small patch that can implement the proper long
> term fix I contend we are in better shape.

I just don't think what you have is the proper fix. Calling
into the core vfs and vm because right now it does something
that works for you but is completely unrelated to what you
are conceptually doing is not the right fix.

Also, the patch I posted is big because it did other stuff
with dynamically allocated ramdisks from loop (ie. a modern
rewrite). As it is applied to rd.c and split into chunks, the
actual patch to switch to the new backing store isn't actually
that big. I'll submit it to -mm after things stabilise after
the merge window too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 18:55, David Woodhouse wrote:
> On Fri, 2007-10-19 at 17:16 +1000, Nick Piggin wrote:
> > if (writtenlen) {
> > -   if (inode->i_size < (pg->index << PAGE_CACHE_SHIFT) +
> > start + writtenlen) { -   inode->i_size = (pg->index
> > << PAGE_CACHE_SHIFT) + start + writtenlen; +   if
> > (inode->i_size < pos + start + writtenlen) {
> > +   inode->i_size = pos + start + writtenlen;
>
> This part seems wrong. Shouldn't it just be pos + writtenlen -- which is
> basically what it was already: pos==(pg->index<http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 16:48, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > Yes it does. It is exactly breaking the coherency between block
> > device and filesystem metadata coherency that Andrew cared about.
> > Whether or not that matters, that is a much bigger conceptual
> > change than simply using slightly more (reclaimable) memory in
> > some situations that my patch does.
> >
> > If you want to try convincing people to break that coherency,
> > fine, but it has to be done consistently and everywhere rather than
> > for a special case in rd.c.
>
> Nick.  Reread the patch.  The only thing your arguments have
> established for me is that this patch is not obviously correct.  Which
> makes it ineligible for a back port.

OK, I missed that you set the new inode's aops to the ramdisk_aops
rather than the bd_inode. Which doesn't make a lot of sense because
you just have a lot of useless aops there now.

> Frankly I suspect the whole 
> issue is to subtle and rare to make any backport make any sense.  My
> apologies Christian.

It's a data corruption issue. I think it should be fixed.

> >> The only way we make it to that inode is through block
> >> device I/O so it lives at exactly the same level in the hierarchy as
> >> a real block device.
> >
> > No, it doesn't. A real block device driver does have its own
> > buffer cache as it's backing store. It doesn't know about
> > readpage or writepage or set_page_dirty or buffers or pagecache.
>
> Well those pages are only accessed through rd_blkdev_pagecache_IO
> and rd_ioctl.

Wrong. It will be via the LRU, will get ->writepage() called,
block_invalidate_page, etc. And I guess also via sb->s_inodes, where
drop_pagecache_sb might do stuff to it (although it probably escapes
harm). But you're right that it isn't the obviously correct fix for
the problem.

> >> My patch is the considered rewrite boiled down
> >> to it's essentials and made a trivial patch.
> >
> > What's the considered rewrite here? The rewrite I posted is the
> > only one so far that's come up that I would consider [worthy],
> > while these patches are just more of the same wrongness.
>
> Well it looks like you were blind when you read the patch.

If you think it is a nice way to go, then I think you are
blind ;)

> Because the semantics between the two are almost identical,
> except I managed to implement BLKFLSBUF in a backwards compatible
> way by flushing both the buffer cache and my private cache.  You
> failed to flush the buffer cache in your implementation.

Obviously a simple typo that can be fixed by adding one line
of code.

> Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
> radix_tree.  But having truncate_inode_pages and grab_cache_page
> continue to work sure is convenient.

It's horrible. And using truncate_inode_pages / grab_cache_page and
new_inode is an incredible argument to save a few lines of code. You
obviously didn't realise your so called private pages would get
accessed via the LRU, for example. This is making a relatively
larger logical change than my patch, because now as well as having
a separate buffer cache and backing store, you are also making the
backing store pages visible to the VM.

> I certainly think it makes it a 
> lot simpler to audit the code to change just one thing at a time (the
> backing store) then to rip out and replace everything and then try and
> prove that the two patches are equivalent.

I think it's a bad idea just to stir the shit. We should take the
simple fix for the problem, and then fix it properly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 16:48, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  Yes it does. It is exactly breaking the coherency between block
  device and filesystem metadata coherency that Andrew cared about.
  Whether or not that matters, that is a much bigger conceptual
  change than simply using slightly more (reclaimable) memory in
  some situations that my patch does.
 
  If you want to try convincing people to break that coherency,
  fine, but it has to be done consistently and everywhere rather than
  for a special case in rd.c.

 Nick.  Reread the patch.  The only thing your arguments have
 established for me is that this patch is not obviously correct.  Which
 makes it ineligible for a back port.

OK, I missed that you set the new inode's aops to the ramdisk_aops
rather than the bd_inode. Which doesn't make a lot of sense because
you just have a lot of useless aops there now.


 Frankly I suspect the whole 
 issue is to subtle and rare to make any backport make any sense.  My
 apologies Christian.

It's a data corruption issue. I think it should be fixed.


  The only way we make it to that inode is through block
  device I/O so it lives at exactly the same level in the hierarchy as
  a real block device.
 
  No, it doesn't. A real block device driver does have its own
  buffer cache as it's backing store. It doesn't know about
  readpage or writepage or set_page_dirty or buffers or pagecache.

 Well those pages are only accessed through rd_blkdev_pagecache_IO
 and rd_ioctl.

Wrong. It will be via the LRU, will get -writepage() called,
block_invalidate_page, etc. And I guess also via sb-s_inodes, where
drop_pagecache_sb might do stuff to it (although it probably escapes
harm). But you're right that it isn't the obviously correct fix for
the problem.


  My patch is the considered rewrite boiled down
  to it's essentials and made a trivial patch.
 
  What's the considered rewrite here? The rewrite I posted is the
  only one so far that's come up that I would consider [worthy],
  while these patches are just more of the same wrongness.

 Well it looks like you were blind when you read the patch.

If you think it is a nice way to go, then I think you are
blind ;)


 Because the semantics between the two are almost identical,
 except I managed to implement BLKFLSBUF in a backwards compatible
 way by flushing both the buffer cache and my private cache.  You
 failed to flush the buffer cache in your implementation.

Obviously a simple typo that can be fixed by adding one line
of code.


 Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
 radix_tree.  But having truncate_inode_pages and grab_cache_page
 continue to work sure is convenient.

It's horrible. And using truncate_inode_pages / grab_cache_page and
new_inode is an incredible argument to save a few lines of code. You
obviously didn't realise your so called private pages would get
accessed via the LRU, for example. This is making a relatively
larger logical change than my patch, because now as well as having
a separate buffer cache and backing store, you are also making the
backing store pages visible to the VM.


 I certainly think it makes it a 
 lot simpler to audit the code to change just one thing at a time (the
 backing store) then to rip out and replace everything and then try and
 prove that the two patches are equivalent.

I think it's a bad idea just to stir the shit. We should take the
simple fix for the problem, and then fix it properly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
 Christian Borntraeger [EMAIL PROTECTED] writes:

 Let me put it another way.  Looking at /proc/slabinfo I can get
 37 buffer_heads per page.  I can allocate 10% of memory in
 buffer_heads before we start to reclaim them.  So it requires just
 over 3.7 buffer_heads on very page of low memory to even trigger
 this case.  That is a large 1k filesystem or a weird sized partition,
 that we have written to directly.

On a highmem machine it it could be relatively common.


  I still dont fully understand what issues you have with my patch.
  - it obviously fixes the problem
  - I am not aware of any regression it introduces
  - its small

 My primary issue with your patch is that it continues the saga the
 trying to use buffer cache to store the data which is a serious
 review problem, and clearly not what we want to do long term.

You don't want to change that for a stable patch, however.
It fixes the bug.


  One concern you had, was the fact that buffer heads are out of sync with
  struct pages. Testing your first patch revealed that this is actually
  needed by reiserfs - and maybe others.
  I can also see, that my patch looks a bit like a bandaid that cobbles the
  rd pieces together.
 
  Is there anything else, that makes my patch unmergeable in your
  opinion?

 For linus's tree the consensus is that to fix rd.c that we
 need to have a backing store that is stored somewhere besides
 in the page cache/buffer cache for /dev/ram0.   Doing that prevents
 all of the weird issues.

 Now we have the question of which patch gets us there.  I contend
 I have implemented it with my last little patch that this thread
 is a reply to.  Nick hasn't seen that just yet.

Or ever will. It wasn't that my whole argument against it is
based on that I mistakenly thought your patch served the bdev
inode directly from its backing store.


 So if we have a small patch that can implement the proper long
 term fix I contend we are in better shape.

I just don't think what you have is the proper fix. Calling
into the core vfs and vm because right now it does something
that works for you but is completely unrelated to what you
are conceptually doing is not the right fix.

Also, the patch I posted is big because it did other stuff
with dynamically allocated ramdisks from loop (ie. a modern
rewrite). As it is applied to rd.c and split into chunks, the
actual patch to switch to the new backing store isn't actually
that big. I'll submit it to -mm after things stabilise after
the merge window too.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-21 Thread Nick Piggin

On Sunday 21 October 2007 18:55, David Woodhouse wrote:
 On Fri, 2007-10-19 at 17:16 +1000, Nick Piggin wrote:
  if (writtenlen) {
  -   if (inode-i_size  (pg-index  PAGE_CACHE_SHIFT) +
  start + writtenlen) { -   inode-i_size = (pg-index
   PAGE_CACHE_SHIFT) + start + writtenlen; +   if
  (inode-i_size  pos + start + writtenlen) {
  +   inode-i_size = pos + start + writtenlen;

 This part seems wrong. Shouldn't it just be pos + writtenlen -- which is
 basically what it was already: pos==(pg-indexPAGE_CACHE_SHIFT)+start

Yeah you're right, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] add new notifier function ,take2

2007-10-21 Thread Nick Piggin

On Thursday 18 October 2007 18:52, Takenori Nagano wrote:
 Vivek Goyal wrote:

   My stance is that _all_ the RAS tools (kdb, kgdb, nlkd, netdump, lkcd,
   crash, kdump etc.) should be using a common interface that safely puts
   the entire system in a stopped state and saves the state of each cpu.
   Then each tool can do what it likes, instead of every RAS tool doing
   its own thing and they all conflict with each other, which is why this
   thread started.
  
   It is not the kernel's job to decide which RAS tool runs first, second
   etc., it is the user's decision to set that policy. Different sites
   will want different orders, some will say go straight to kdump, other
   sites will want to invoke a debugger first. Sites must be able to
   define that policy, but we hard code the policy into the kernel.

 I agreed with him and I made new notifier function that users can change
 the order. Priority value in notifier blocks are hardcoded. If users want
 to change list order, they have to rebuild kernel. I think it is very
 unhappy.

Is it possible to use a single bit of common code and a single
notifier for these things? Or is it too difficult?

One thing I'd suggest is not to use debugfs, if it is going to
be a useful end-user feature.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 03:56, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  OK, I missed that you set the new inode's aops to the ramdisk_aops
  rather than the bd_inode. Which doesn't make a lot of sense because
  you just have a lot of useless aops there now.

 Not totally useless as you have mentioned they are accessed by
 the lru so we still need something there just not much.

Well, the ones that are unused are totally useless ;) I would
have expected them to be removed.


  Frankly I suspect the whole
  issue is to subtle and rare to make any backport make any sense.  My
  apologies Christian.
 
  It's a data corruption issue. I think it should be fixed.

 Of course it should be fixed.  I just don't know if a backport
 makes sense.  The problem once understood is hard to trigger
 and easy to avoid.

I mean backported. That's just me though, I don't know the nuances
of -stable releases. It could be that they rather not risk introducing
something worse which would be fair enough.


  Well those pages are only accessed through rd_blkdev_pagecache_IO
  and rd_ioctl.
 
  Wrong. It will be via the LRU, will get -writepage() called,
  block_invalidate_page, etc. And I guess also via sb-s_inodes, where
  drop_pagecache_sb might do stuff to it (although it probably escapes
  harm). But you're right that it isn't the obviously correct fix for
  the problem.

 Yes.  We will be accessed via the LRU.  Yes I knew that.

OK it just didn't sound like it, seeing as you said that's the only
way they are accessed.


 The truth is 
 whatever we do we will be visible to the LRU.

No. With my patch, nothing in the ramdisk code is visible to the LRU.
Which is how it should be.


 My preferences run 
 towards having something that is less of a special case then a new
 potentially huge cache that is completely unaccounted for, that we
 have to build and maintain all of the infrastructure for
 independently.

It's not a cache, and it's not unaccounted for. It's specified exactly
with the rd sizing parameters. I don't know why you would say your
patch is better in this regard. Your ramdisk backing store will be
accounted as pagecache, which is completely wrong.


 The ramdisk code doesn't seem interesting enough 
 long term to get people to do independent maintenance.

 With my patch we are on the road to being just like the ramdisk
 for maintenance purposes code except having a different GFP mask.

You can be using highmem, BTW. And technically it probably isn't
correct to use GFP_USER.


  If you think it is a nice way to go, then I think you are
  blind ;)

 Well we each have different tastes.  I think my patch
 is a sane sensible small incremental step that does just what
 is needed to fix the problem.   It doesn't drag a lot of other
 stuff into the problem like a rewrite would so we can easily verify
 it.

The small incremental step that fixes the problem is Christian's
patch.


  It's horrible. And using truncate_inode_pages / grab_cache_page and
  new_inode is an incredible argument to save a few lines of code. You
  obviously didn't realise your so called private pages would get
  accessed via the LRU, for example.

 I did but but that is relatively minor.  Using the pagecache this
 way for this purpose is a well established idiom in the kernel
 so I didn't figure I was introducing anything to hard to maintain.

Where else is this an established idiom?


  This is making a relatively
  larger logical change than my patch, because now as well as having
  a separate buffer cache and backing store, you are also making the
  backing store pages visible to the VM.

 I am continuing to have the backing store pages visible to the VM,
 and from that perspective it is a smaller change then where we are
 today.

It is smaller lines of code. It is a larger change. Because what you
are doing is 2 things. You are firstly discontinuing the use of the
buffer cache for the backing store, and secondly you are introducing
a new backing store which registers an inode with the vfs and pages
with the pagecache.

My patch does the same thing without those two last questionable
steps.

You now have to treat those backing store pages as pagecache pages,
and hope you have set the right flags and registered the right aops.


 Not that we can truly hide pages from the VM. 

Any page you allocate is your private page. The problem is you're
just sending them back to the VM again.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 04:39, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
  Christian Borntraeger [EMAIL PROTECTED] writes:
 
  Let me put it another way.  Looking at /proc/slabinfo I can get
  37 buffer_heads per page.  I can allocate 10% of memory in
  buffer_heads before we start to reclaim them.  So it requires just
  over 3.7 buffer_heads on very page of low memory to even trigger
  this case.  That is a large 1k filesystem or a weird sized partition,
  that we have written to directly.
 
  On a highmem machine it it could be relatively common.

 Possibly.  But the same proportions still hold.  1k filesystems
 are not the default these days and ramdisks are relatively uncommon.
 The memory quantities involved are all low mem.

You don't need 1K filesystems to have buffers attached though,
of course. You can hit the limit with a 4K filesystem with less
than 8GB in pagecache I'd say.



  You don't want to change that for a stable patch, however.
  It fixes the bug.

 No it avoids the bug which is something slightly different.
 Further I contend that it is not obviously correct that there
 are no other side effects (because it doesn't actually fix the
 bug), and that makes it of dubious value for a backport.

The bug in question isn't exactly that it uses buffercache for its
backing store (although that's obviously rather hairy as well). It's
this specific problem sequence. And it does fix the problem.


 If I had to slap a patch on there at this point just implementing
 an empty try_to_release_page (which disables try_to_free_buffers)
 would be my choice.

How is that better? Now you're making the exact same change for
all filesystems that you didn't think was obviously correct for
rd.c.


  I just don't think what you have is the proper fix. Calling
  into the core vfs and vm because right now it does something
  that works for you but is completely unrelated to what you
  are conceptually doing is not the right fix.

 I think there is a strong conceptual relation and other code
 doing largely the same thing is already in the kernel (ramfs).  Plus
 my gut feel says shared code will make maintenance easier.

ramfs is rather a different case. Filesystems intimately know
about the pagecache.


 You do have a point that the reuse may not be perfect and if that
 is the case we need to watch out for the potential to mess things
 up.

 So far I don't see any problems with the reuse.

It's just wrong. I guess if you don't see that by now, then we
have to just disagree.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread Nick Piggin

On Monday 22 October 2007 14:28, dean gaudet wrote:
 On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:
  dean gaudet wrote:
   On Mon, 15 Oct 2007, Nick Piggin wrote:
   Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
   because it generally has to invalidate TLBs on all CPUs.
  
   why is that?  ignoring 32-bit archs we have heaps of address space
   available... couldn't the kernel just burn address space and delay
   global TLB invalidate by some relatively long time (say 1 second)?
 
  Yes, that's precisely the problem.  xfs does delay the unmap, leaving
  stray mappings, which upsets Xen.

 sounds like a bug in xen to me :)

You could call it a bug I think. I don't know much about Xen though,
whether or not it expects to be able to run an arbitrary OS kernel.

Presumably, the hypervisor _could_ write protect and trap writes to
*all* page table page mappings.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin

On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> >> Andrew Morton <[EMAIL PROTECTED]> writes:
> >> > I don't think we little angels want to tread here.  There are so many
> >> > weirdo things out there which will break if we bust the coherence
> >> > between the fs and /dev/hda1.
> >>
> >> We broke coherence between the fs and /dev/hda1 when we introduced
> >> the page cache years ago,
> >
> > Not for metadata. And I wouldn't expect many filesystem analysis
> > tools to care about data.
>
> Well tools like dump certainly weren't happy when we made the change.

Doesn't that give you any suspicion that other tools mightn't
be happy if we make this change, then?

> >> and weird hacky cases like
> >> unmap_underlying_metadata don't change that.
> >
> > unmap_underlying_metadata isn't about raw block device access at
> > all, though (if you write to the filesystem via the blockdevice
> > when it isn't expecting it, it's going to blow up regardless).
>
> Well my goal with separating things is so that we could decouple two
> pieces of code that have different usage scenarios, and where
> supporting both scenarios simultaneously appears to me to needlessly
> complicate the code.
>
> Added to that we could then tune the two pieces of code for their
> different users.

I don't see too much complication from it. If we can actually
simplify things or make useful tuning, maybe it will be worth
doing.

> >> Currently only
> >> metadata is more or less in sync with the contents of /dev/hda1.
> >
> > It either is or it isn't, right? And it is, isn't it? (at least
> > for the common filesystems).
>
> ext2 doesn't store directories in the buffer cache.

Oh that's what you mean. OK, agreed there. But for the filesystems
and types of metadata that can now expect to have coherency, doing
this will break that expectation.

Again, I have no opinions either way on whether we should do that
in the long run. But doing it as a kneejerk response to braindead
rd.c code is wrong because of what *might* go wrong and we don't
know about.

> Journaling filesystems and filesystems that do ordered writes
> game the buffer cache.  Putting in data that should not yet
> be written to disk.  That gaming is where reiserfs goes BUG
> and where JBD moves the dirty bit to a different dirty bit.

Filesystems really want better control of writeback, I think.
This isn't really a consequence of the unified blockdev pagecache
/ metadata buffer cache, it is just that most of the important
things they do are with metadata.

If they have their own metadata inode, then they'll need to game
the cache for it, or the writeback code for that inode somehow
too.

> So as far as I can tell what is in the buffer cache is not really
> in sync with what should be on disk at any given movement except
> when everything is clean.

Naturally. It is a writeback cache.

> My suspicion is that actually reading from disk is likely to
> give a more coherent view of things.  Because there at least
> we have the writes as they are expected to be seen by fsck
> to recover the data, and a snapshot there should at least
> be recoverable.  Whereas a snapshot provides not such guarantees.

ext3 fsck I don't think is supposed to be run under a read/write
filesystem, so it's going to explode if you do that regardless.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin

On Sunday 21 October 2007 15:10, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> >> Currently the ramdisk tries to keep the block device page cache pages
> >> from being marked clean and dropped from memory.  That fails for
> >> filesystems that use the buffer cache because the buffer cache is not
> >> an ordinary buffer cache user and depends on the generic block device
> >> address space operations being used.
> >>
> >> To fix all of those associated problems this patch allocates a private
> >> inode to store the ramdisk pages in.
> >>
> >> The result is slightly more memory used for metadata, an extra copying
> >> when reading or writing directly to the block device, and changing the
> >> software block size does not loose the contents of the ramdisk.  Most
> >> of all this ensures we don't loose data during normal use of the
> >> ramdisk.
> >>
> >> I deliberately avoid the cleanup that is now possible because this
> >> patch is intended to be a bug fix.
> >
> > This just breaks coherency again like the last patch. That's a
> > really bad idea especially for stable (even if nothing actually
> > was to break, we'd likely never know about it anyway).
>
> Not a chance.

Yes it does. It is exactly breaking the coherency between block
device and filesystem metadata coherency that Andrew cared about.
Whether or not that matters, that is a much bigger conceptual
change than simply using slightly more (reclaimable) memory in
some situations that my patch does.

If you want to try convincing people to break that coherency,
fine, but it has to be done consistently and everywhere rather than
for a special case in rd.c.

> The only way we make it to that inode is through block 
> device I/O so it lives at exactly the same level in the hierarchy as
> a real block device.

No, it doesn't. A real block device driver does have its own
buffer cache as it's backing store. It doesn't know about
readpage or writepage or set_page_dirty or buffers or pagecache.

> My patch is the considered rewrite boiled down 
> to it's essentials and made a trivial patch.

What's the considered rewrite here? The rewrite I posted is the
only one so far that's come up that I would consider [worthy],
while these patches are just more of the same wrongness.

> It fundamentally fixes the problem, and doesn't attempt to reconcile
> the incompatible expectations of the ramdisk code and the buffer cache.

It fixes the bug in question, but not because it makes any
fundamental improvement to the conceptual ickyness of rd.c. I
don't know what you mean about attempting to reconcile the
incompatible [stuff]?

> > Christian's patch should go upstream and into stable. For 2.6.25-6,
> > my rewrite should just replace what's there. Using address spaces
> > to hold the ramdisk pages just confuses the issue even if they
> > *aren't* actually wired up to the vfs at all. Saving 20 lines is
> > not a good reason to use them.
>
> Well is more like saving 100 lines.  Not having to reexamine complicated
> infrastructure code and doing things the same way ramfs is.

Using radix_tree_insert instead of add_to_page_cache is hardly
complicated. If anything it is simpler because it isn't actually
confusing the issues and it is much better fit with real block
device drivers, which is actually the thing that is familiar to
block device driver maintainers.

> I think 
> that combination is a good reason.  Especially since I can do with a
> 16 line patch as I just demonstrated.  It is a solid and simple
> incremental change.

My opinion is that it is a much bigger behavioural change because
it results in incoherent buffer cache / blockdev cache, and it
results in code which is still ugly and conceptually wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin

On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> Currently the ramdisk tries to keep the block device page cache pages
> from being marked clean and dropped from memory.  That fails for
> filesystems that use the buffer cache because the buffer cache is not
> an ordinary buffer cache user and depends on the generic block device
> address space operations being used.
>
> To fix all of those associated problems this patch allocates a private
> inode to store the ramdisk pages in.
>
> The result is slightly more memory used for metadata, an extra copying
> when reading or writing directly to the block device, and changing the
> software block size does not loose the contents of the ramdisk.  Most
> of all this ensures we don't loose data during normal use of the
> ramdisk.
>
> I deliberately avoid the cleanup that is now possible because this
> patch is intended to be a bug fix.

This just breaks coherency again like the last patch. That's a
really bad idea especially for stable (even if nothing actually
was to break, we'd likely never know about it anyway).

Christian's patch should go upstream and into stable. For 2.6.25-6,
my rewrite should just replace what's there. Using address spaces
to hold the ramdisk pages just confuses the issue even if they
*aren't* actually wired up to the vfs at all. Saving 20 lines is
not a good reason to use them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin

On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> > I don't think we little angels want to tread here.  There are so many
> > weirdo things out there which will break if we bust the coherence between
> > the fs and /dev/hda1.
>
> We broke coherence between the fs and /dev/hda1 when we introduced
> the page cache years ago,

Not for metadata. And I wouldn't expect many filesystem analysis
tools to care about data.

> and weird hacky cases like 
> unmap_underlying_metadata don't change that.

unmap_underlying_metadata isn't about raw block device access at
all, though (if you write to the filesystem via the blockdevice
when it isn't expecting it, it's going to blow up regardless).

> Currently only 
> metadata is more or less in sync with the contents of /dev/hda1.

It either is or it isn't, right? And it is, isn't it? (at least
for the common filesystems).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin

On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
  I don't think we little angels want to tread here.  There are so many
  weirdo things out there which will break if we bust the coherence between
  the fs and /dev/hda1.

 We broke coherence between the fs and /dev/hda1 when we introduced
 the page cache years ago,

Not for metadata. And I wouldn't expect many filesystem analysis
tools to care about data.


 and weird hacky cases like 
 unmap_underlying_metadata don't change that.

unmap_underlying_metadata isn't about raw block device access at
all, though (if you write to the filesystem via the blockdevice
when it isn't expecting it, it's going to blow up regardless).


 Currently only 
 metadata is more or less in sync with the contents of /dev/hda1.

It either is or it isn't, right? And it is, isn't it? (at least
for the common filesystems).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin

On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
 Currently the ramdisk tries to keep the block device page cache pages
 from being marked clean and dropped from memory.  That fails for
 filesystems that use the buffer cache because the buffer cache is not
 an ordinary buffer cache user and depends on the generic block device
 address space operations being used.

 To fix all of those associated problems this patch allocates a private
 inode to store the ramdisk pages in.

 The result is slightly more memory used for metadata, an extra copying
 when reading or writing directly to the block device, and changing the
 software block size does not loose the contents of the ramdisk.  Most
 of all this ensures we don't loose data during normal use of the
 ramdisk.

 I deliberately avoid the cleanup that is now possible because this
 patch is intended to be a bug fix.

This just breaks coherency again like the last patch. That's a
really bad idea especially for stable (even if nothing actually
was to break, we'd likely never know about it anyway).

Christian's patch should go upstream and into stable. For 2.6.25-6,
my rewrite should just replace what's there. Using address spaces
to hold the ramdisk pages just confuses the issue even if they
*aren't* actually wired up to the vfs at all. Saving 20 lines is
not a good reason to use them.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-20 Thread Nick Piggin

On Sunday 21 October 2007 15:10, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
  Currently the ramdisk tries to keep the block device page cache pages
  from being marked clean and dropped from memory.  That fails for
  filesystems that use the buffer cache because the buffer cache is not
  an ordinary buffer cache user and depends on the generic block device
  address space operations being used.
 
  To fix all of those associated problems this patch allocates a private
  inode to store the ramdisk pages in.
 
  The result is slightly more memory used for metadata, an extra copying
  when reading or writing directly to the block device, and changing the
  software block size does not loose the contents of the ramdisk.  Most
  of all this ensures we don't loose data during normal use of the
  ramdisk.
 
  I deliberately avoid the cleanup that is now possible because this
  patch is intended to be a bug fix.
 
  This just breaks coherency again like the last patch. That's a
  really bad idea especially for stable (even if nothing actually
  was to break, we'd likely never know about it anyway).

 Not a chance.

Yes it does. It is exactly breaking the coherency between block
device and filesystem metadata coherency that Andrew cared about.
Whether or not that matters, that is a much bigger conceptual
change than simply using slightly more (reclaimable) memory in
some situations that my patch does.

If you want to try convincing people to break that coherency,
fine, but it has to be done consistently and everywhere rather than
for a special case in rd.c.


 The only way we make it to that inode is through block 
 device I/O so it lives at exactly the same level in the hierarchy as
 a real block device.

No, it doesn't. A real block device driver does have its own
buffer cache as it's backing store. It doesn't know about
readpage or writepage or set_page_dirty or buffers or pagecache.


 My patch is the considered rewrite boiled down 
 to it's essentials and made a trivial patch.

What's the considered rewrite here? The rewrite I posted is the
only one so far that's come up that I would consider [worthy],
while these patches are just more of the same wrongness.


 It fundamentally fixes the problem, and doesn't attempt to reconcile
 the incompatible expectations of the ramdisk code and the buffer cache.

It fixes the bug in question, but not because it makes any
fundamental improvement to the conceptual ickyness of rd.c. I
don't know what you mean about attempting to reconcile the
incompatible [stuff]?


  Christian's patch should go upstream and into stable. For 2.6.25-6,
  my rewrite should just replace what's there. Using address spaces
  to hold the ramdisk pages just confuses the issue even if they
  *aren't* actually wired up to the vfs at all. Saving 20 lines is
  not a good reason to use them.

 Well is more like saving 100 lines.  Not having to reexamine complicated
 infrastructure code and doing things the same way ramfs is.

Using radix_tree_insert instead of add_to_page_cache is hardly
complicated. If anything it is simpler because it isn't actually
confusing the issues and it is much better fit with real block
device drivers, which is actually the thing that is familiar to
block device driver maintainers.


 I think 
 that combination is a good reason.  Especially since I can do with a
 16 line patch as I just demonstrated.  It is a solid and simple
 incremental change.

My opinion is that it is a much bigger behavioural change because
it results in incoherent buffer cache / blockdev cache, and it
results in code which is still ugly and conceptually wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-20 Thread Nick Piggin

On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
  Andrew Morton [EMAIL PROTECTED] writes:
   I don't think we little angels want to tread here.  There are so many
   weirdo things out there which will break if we bust the coherence
   between the fs and /dev/hda1.
 
  We broke coherence between the fs and /dev/hda1 when we introduced
  the page cache years ago,
 
  Not for metadata. And I wouldn't expect many filesystem analysis
  tools to care about data.

 Well tools like dump certainly weren't happy when we made the change.

Doesn't that give you any suspicion that other tools mightn't
be happy if we make this change, then?


  and weird hacky cases like
  unmap_underlying_metadata don't change that.
 
  unmap_underlying_metadata isn't about raw block device access at
  all, though (if you write to the filesystem via the blockdevice
  when it isn't expecting it, it's going to blow up regardless).

 Well my goal with separating things is so that we could decouple two
 pieces of code that have different usage scenarios, and where
 supporting both scenarios simultaneously appears to me to needlessly
 complicate the code.

 Added to that we could then tune the two pieces of code for their
 different users.

I don't see too much complication from it. If we can actually
simplify things or make useful tuning, maybe it will be worth
doing.


  Currently only
  metadata is more or less in sync with the contents of /dev/hda1.
 
  It either is or it isn't, right? And it is, isn't it? (at least
  for the common filesystems).

 ext2 doesn't store directories in the buffer cache.

Oh that's what you mean. OK, agreed there. But for the filesystems
and types of metadata that can now expect to have coherency, doing
this will break that expectation.

Again, I have no opinions either way on whether we should do that
in the long run. But doing it as a kneejerk response to braindead
rd.c code is wrong because of what *might* go wrong and we don't
know about.


 Journaling filesystems and filesystems that do ordered writes
 game the buffer cache.  Putting in data that should not yet
 be written to disk.  That gaming is where reiserfs goes BUG
 and where JBD moves the dirty bit to a different dirty bit.

Filesystems really want better control of writeback, I think.
This isn't really a consequence of the unified blockdev pagecache
/ metadata buffer cache, it is just that most of the important
things they do are with metadata.

If they have their own metadata inode, then they'll need to game
the cache for it, or the writeback code for that inode somehow
too.


 So as far as I can tell what is in the buffer cache is not really
 in sync with what should be on disk at any given movement except
 when everything is clean.

Naturally. It is a writeback cache.


 My suspicion is that actually reading from disk is likely to
 give a more coherent view of things.  Because there at least
 we have the writes as they are expected to be seen by fsck
 to recover the data, and a snapshot there should at least
 be recoverable.  Whereas a snapshot provides not such guarantees.

ext3 fsck I don't think is supposed to be run under a read/write
filesystem, so it's going to explode if you do that regardless.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin

On Friday 19 October 2007 17:03, Nick Piggin wrote:
> On Friday 19 October 2007 16:05, Erez Zadok wrote:
> > David,
> >
> > I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
> > 4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
> > when unionfs is stacked on top of jffs2, other than my truncate test --
> > whic tries to truncate files up/down (through the union, which then is
> > passed through to the lower jffs2 f/s).  The same truncate test passes on
> > all other file systems I've tried unionfs/2.6.24 with, as well as all of
> > the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
> > think this bug is more probably due to something else going on in 2.6.24,
> > possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs
> > isn't doing something right -- any pointers?)
> >
> > The oops trace is included below.  Is this a known issue and if so, any
> > fixes?  If this is the first you hear of this problem, let me know and
> > I'll try to narrow it down further.
>
> It's had quite a lot of recent changes in that area -- the "new aops"
> patches.
>
> They've been getting quite a bit of testing in -mm and no such problems,
> but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
> testing with -mm.
>
> The bug smells like jffs2 is actually passing back a "written" length
> greater than the length we passed into it.
>
> The following might show what's happening.

Hmm, looks like jffs2_write_end is writing more than we actually ask it
to, and returns that back.

unsigned aligned_start = start & ~3;

and

if (end == PAGE_CACHE_SIZE) {
/* When writing out the end of a page, write out the
   _whole_ page. This helps to reduce the number of
   nodes in files which have many short writes, like
   syslog files. */
start = aligned_start = 0;
}

These "longer" writes are fine, but they shouldn't get propagated back
to the vm/vfs. Something like the following patch might fix it.

Index: linux-2.6/fs/jffs2/file.c
===
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -255,7 +255,7 @@ static int jffs2_write_end(struct file *
 		   _whole_ page. This helps to reduce the number of
 		   nodes in files which have many short writes, like
 		   syslog files. */
-		start = aligned_start = 0;
+		aligned_start = 0;
 	}
 
 	ri = jffs2_alloc_raw_inode();
@@ -291,14 +291,11 @@ static int jffs2_write_end(struct file *
 	}
 
 	/* Adjust writtenlen for the padding we did, so we don't confuse our caller */
-	if (writtenlen < (start&3))
-		writtenlen = 0;
-	else
-		writtenlen -= (start&3);
+	writtenlen -= min(writtenlen, (start - aligned_start));
 
 	if (writtenlen) {
-		if (inode->i_size < (pg->index << PAGE_CACHE_SHIFT) + start + writtenlen) {
-			inode->i_size = (pg->index << PAGE_CACHE_SHIFT) + start + writtenlen;
+		if (inode->i_size < pos + start + writtenlen) {
+			inode->i_size = pos + start + writtenlen;
 			inode->i_blocks = (inode->i_size + 511) >> 9;
 
 			inode->i_ctime = inode->i_mtime = ITIME(je32_to_cpu(ri->ctime));

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin

On Friday 19 October 2007 16:05, Erez Zadok wrote:
> David,
>
> I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
> 4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
> when unionfs is stacked on top of jffs2, other than my truncate test --
> whic tries to truncate files up/down (through the union, which then is
> passed through to the lower jffs2 f/s).  The same truncate test passes on
> all other file systems I've tried unionfs/2.6.24 with, as well as all of
> the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
> think this bug is more probably due to something else going on in 2.6.24,
> possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs isn't
> doing something right -- any pointers?)
>
> The oops trace is included below.  Is this a known issue and if so, any
> fixes?  If this is the first you hear of this problem, let me know and I'll
> try to narrow it down further.

It's had quite a lot of recent changes in that area -- the "new aops"
patches.

They've been getting quite a bit of testing in -mm and no such problems,
but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
testing with -mm.

The bug smells like jffs2 is actually passing back a "written" length
greater than the length we passed into it.

The following might show what's happening.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2184,6 +2184,7 @@ fs_write_aop_error:
 	return written ? written : status;
 }
 
+#include 
 static ssize_t generic_perform_write(struct file *file,
 struct iov_iter *i, loff_t pos)
 {
@@ -2243,6 +2244,13 @@ again:
 		page, fsdata);
 		if (unlikely(status < 0))
 			break;
+		if (status > copied) {
+			print_symbol("%s returned more than it should!\n", (unsigned long)a_ops->write_end);
+			printk("status = %ld, copied = %lu\n", status, copied);
+			dump_stack();
+			status = copied; /* try to fix */
+		}
+
 		copied = status;
 
 		cond_resched();

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin

On Friday 19 October 2007 17:03, Nick Piggin wrote:
 On Friday 19 October 2007 16:05, Erez Zadok wrote:
  David,
 
  I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
  4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
  when unionfs is stacked on top of jffs2, other than my truncate test --
  whic tries to truncate files up/down (through the union, which then is
  passed through to the lower jffs2 f/s).  The same truncate test passes on
  all other file systems I've tried unionfs/2.6.24 with, as well as all of
  the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
  think this bug is more probably due to something else going on in 2.6.24,
  possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs
  isn't doing something right -- any pointers?)
 
  The oops trace is included below.  Is this a known issue and if so, any
  fixes?  If this is the first you hear of this problem, let me know and
  I'll try to narrow it down further.

 It's had quite a lot of recent changes in that area -- the new aops
 patches.

 They've been getting quite a bit of testing in -mm and no such problems,
 but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
 testing with -mm.

 The bug smells like jffs2 is actually passing back a written length
 greater than the length we passed into it.

 The following might show what's happening.

Hmm, looks like jffs2_write_end is writing more than we actually ask it
to, and returns that back.

unsigned aligned_start = start  ~3;

and

if (end == PAGE_CACHE_SIZE) {
/* When writing out the end of a page, write out the
   _whole_ page. This helps to reduce the number of
   nodes in files which have many short writes, like
   syslog files. */
start = aligned_start = 0;
}

These longer writes are fine, but they shouldn't get propagated back
to the vm/vfs. Something like the following patch might fix it.

Index: linux-2.6/fs/jffs2/file.c
===
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -255,7 +255,7 @@ static int jffs2_write_end(struct file *
 		   _whole_ page. This helps to reduce the number of
 		   nodes in files which have many short writes, like
 		   syslog files. */
-		start = aligned_start = 0;
+		aligned_start = 0;
 	}
 
 	ri = jffs2_alloc_raw_inode();
@@ -291,14 +291,11 @@ static int jffs2_write_end(struct file *
 	}
 
 	/* Adjust writtenlen for the padding we did, so we don't confuse our caller */
-	if (writtenlen  (start3))
-		writtenlen = 0;
-	else
-		writtenlen -= (start3);
+	writtenlen -= min(writtenlen, (start - aligned_start));
 
 	if (writtenlen) {
-		if (inode-i_size  (pg-index  PAGE_CACHE_SHIFT) + start + writtenlen) {
-			inode-i_size = (pg-index  PAGE_CACHE_SHIFT) + start + writtenlen;
+		if (inode-i_size  pos + start + writtenlen) {
+			inode-i_size = pos + start + writtenlen;
 			inode-i_blocks = (inode-i_size + 511)  9;
 
 			inode-i_ctime = inode-i_mtime = ITIME(je32_to_cpu(ri-ctime));

Re: BUG at mm/filemap.c:1749 (2.6.24, jffs2, unionfs)

2007-10-19 Thread Nick Piggin

On Friday 19 October 2007 16:05, Erez Zadok wrote:
 David,

 I'm testing unionfs on top of jffs2, using 2.6.24 as of linus's commit
 4fa4d23fa20de67df919030c1216295664866ad7.  All of my unionfs tests pass
 when unionfs is stacked on top of jffs2, other than my truncate test --
 whic tries to truncate files up/down (through the union, which then is
 passed through to the lower jffs2 f/s).  The same truncate test passes on
 all other file systems I've tried unionfs/2.6.24 with, as well as all of
 the earlier kernels that unionfs runs on (2.6.9--2.6.23).  So I tend to
 think this bug is more probably due to something else going on in 2.6.24,
 possibly wrt jffs2/mtd.  (Of course, it's still possible that unionfs isn't
 doing something right -- any pointers?)

 The oops trace is included below.  Is this a known issue and if so, any
 fixes?  If this is the first you hear of this problem, let me know and I'll
 try to narrow it down further.

It's had quite a lot of recent changes in that area -- the new aops
patches.

They've been getting quite a bit of testing in -mm and no such problems,
but I doubt anyone was doing much unionfs over jffs2, or even much jffs2
testing with -mm.

The bug smells like jffs2 is actually passing back a written length
greater than the length we passed into it.

The following might show what's happening.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2184,6 +2184,7 @@ fs_write_aop_error:
 	return written ? written : status;
 }
 
+#include linux/kallsyms.h
 static ssize_t generic_perform_write(struct file *file,
 struct iov_iter *i, loff_t pos)
 {
@@ -2243,6 +2244,13 @@ again:
 		page, fsdata);
 		if (unlikely(status  0))
 			break;
+		if (status  copied) {
+			print_symbol(%s returned more than it should!\n, (unsigned long)a_ops-write_end);
+			printk(status = %ld, copied = %lu\n, status, copied);
+			dump_stack();
+			status = copied; /* try to fix */
+		}
+
 		copied = status;
 
 		cond_resched();

[patch] x86: lock bitops

2007-10-18 Thread Nick Piggin


I missed an obvious one!

x86 CPUs are defined not to reorder stores past earlier loads, so there is
no hardware memory barrier required to implement a release-consistent store
(all stores are, by definition).

So ditch the generic lock bitops, and implement optimised versions for x86,
which removes the mfence from __clear_bit_unlock (which is already a useful
primitive for SLUB).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-x86/bitops_32.h
===
--- linux-2.6.orig/include/asm-x86/bitops_32.h
+++ linux-2.6/include/asm-x86/bitops_32.h
@@ -76,6 +76,20 @@ static inline void clear_bit(int nr, vol
:"Ir" (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
__asm__ __volatile__(
@@ -83,6 +97,25 @@ static inline void __clear_bit(int nr, v
:"+m" (ADDR)
:"Ir" (nr));
 }
+
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -142,6 +175,15 @@ static inline int test_and_set_bit(int n
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -402,7 +444,6 @@ static inline int fls(int x)
 }
 
 #include 
-#include 
 
 #endif /* __KERNEL__ */
 
Index: linux-2.6/include/asm-x86/bitops_64.h
===
--- linux-2.6.orig/include/asm-x86/bitops_64.h
+++ linux-2.6/include/asm-x86/bitops_64.h
@@ -68,6 +68,20 @@ static __inline__ void clear_bit(int nr,
:"dIr" (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static __inline__ void __clear_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__(
@@ -76,6 +90,24 @@ static __inline__ void __clear_bit(int n
:"dIr" (nr));
 }
 
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -133,6 +165,15 @@ static __inline__ int test_and_set_bit(i
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -408,7 +449,6 @@ static __inline__ int fls(int x)
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
 #include 
-#include 
 
 #endif /* __KERNEL__ */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 13:28, Herbert Xu wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> >> First of all let's agree on some basic assumptions:
> >>
> >> * A pair of spin lock/unlock subsumes the effect of a full mb.
> >
> > Not unless you mean a pair of spin lock/unlock as in
> > 2 spin lock/unlock pairs (4 operations).
> >
> > *X = 10;
> > spin_lock();
> > /* *Y speculatively loaded here */
> > /* store to *X leaves CPU store queue here */
> > spin_unlock();
> > y = *Y;
>
> Good point.
>
> Although in this case we're still safe because in the worst
> cases:
>
> CPU0  CPU1
> irq_sync = 1
> synchronize_irq
>   spin lock
>   load IRQ_INPROGRESS
> irq_sync sync is visible
>   spin unlock
>   spin lock
>   load irq_sync
>   while (IRQ_INPROGRESS)
>   wait
>   return
>   set IRQ_INPROGRESS
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>
> 
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> synchronize_irq
>   set IRQ_INPROGRESS
>   spin unlock
>   spin lock
>   load IRQ_INPROGRESS
> irq_sync sync is visible
>   spin unlock
>   while (IRQ_INPROGRESS)
>   wait
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>   return
>
> So because we're using the same lock on both sides, it does
> do the right thing even without the memory barrier.


Yeah, if you've got the lock on both sides there, then I
agree it will be correct.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 12:32, Herbert Xu wrote:

> First of all let's agree on some basic assumptions:
>
> * A pair of spin lock/unlock subsumes the effect of a full mb.

Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).

*X = 10;
spin_lock();
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock();
y = *Y;

> * A spin lock in general only equates to (SS/SL/LL).
> * A spin unlock in general only equates to (SS/LS).

I don't use the sparc barriers, so they don't come naturally to
me ;)

I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
be passed by later ones.


> In particular, a load after a spin unlock may pass before the
> spin unlock.
>
> Here is the race (with tg3 as the only example that I know of).
> The driver attempts to quiesce interrupts such that after the
> call to synchronize_irq it is intended that no further IRQ
> handler calls for that device will do any work besides acking
> the IRQ.
>
> Here is how it does it:
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq()
>   while (IRQ_INPROGRESS)
>   wait
>   return
>   set IRQ_INPROGRESS
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>
> The problem here is that load of irq_sync in the handler has
> passed above the setting of IRQ_INPROGRESS.
>
> Linus's patch fixes it because this becomes:
>
> CPU0  CPU1
>   spin lock
>   load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq
>   set IRQ_INPROGRESS
>   spin unlock
>   spin lock
>   spin unlock
>   tg3_msi
>   ack IRQ
>   if (irq_sync)
>   return
>   do work
>   while (IRQ_INPROGRESS)
>   wait
>   spin lock
>   clear IRQ_INPROGRESS
>   spin unlock
>   return
>
> Even though we still do the work on the right we will now notice
> the INPROGRESS flag on the left and wait.
>
> It's hard to fix this in the drivers because they'd either have
> to access the desc lock or add a full mb to the fast path on the
> right.
>
> Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
> a lot of drivers (including the fusion example Ben quoted) call
> synchronize_irq before free_irq.  This is unnecessary because
> the latter already calls it anyway.
>
> Cheers,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 12:01, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > > Yes that is what I attempted to do with the write barrier. To my
> > > knowledge there are no reads that could bleed out and I wanted to avoid
> > > a full fence instruction there.
> >
> > Oh, OK. Bit risky ;) You might be right, but anyway I think it
> > should be just as fast with the optimised bit_unlock on most
> > architectures.
>
> How expensive is the fence? An store with release semantics would be safer
> and okay for IA64.

I'm not sure, I had an idea it was relatively expensive on ia64,
but I didn't really test with a good workload (a microbenchmark
probably isn't that good because it won't generate too much out
of order memory traffic that needs to be fenced).

> > Which reminds me, it would be interesting to test the ia64
> > implementation I did. For the non-atomic unlock, I'm actually
> > doing an atomic operation there so that it can use the release
> > barrier rather than the mf. Maybe it's faster the other way around
> > though? Will be useful to test with something that isn't a trivial
> > loop, so the slub case would be a good benchmark.
>
> Lets avoid mf (too expensive) and just use a store with release semantics.

OK, that's what I've done at the moment.

> Where can I find your patchset? I looked through lkml but did not see it.

Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
ones to look at.

The rest of the patches I have queued here, apart from the SLUB patch,
I guess aren't so interesting to you (they don't do anything fancy
like convert to non-atomic unlocks, just switch things like page and
buffer locks to use new bitops).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 11:21, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > Ah, thanks, but can we just use my earlier patch that does the
> > proper __bit_spin_unlock which is provided by
> > bit_spin_lock-use-lock-bitops.patch
>
> Ok.
>
> > This primitive should have a better chance at being correct, and
> > also potentially be more optimised for each architecture (it
> > only has to provide release consistency).
>
> Yes that is what I attempted to do with the write barrier. To my knowledge
> there are no reads that could bleed out and I wanted to avoid a full fence
> instruction there.

Oh, OK. Bit risky ;) You might be right, but anyway I think it
should be just as fast with the optimised bit_unlock on most
architectures.

Which reminds me, it would be interesting to test the ia64
implementation I did. For the non-atomic unlock, I'm actually
doing an atomic operation there so that it can use the release
barrier rather than the mf. Maybe it's faster the other way around
though? Will be useful to test with something that isn't a trivial
loop, so the slub case would be a good benchmark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kswapd0 inefficient?

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 08:05, Richard Jelinek wrote:
> Hello guys,
>
> I'm not subscribed to this list, so if you find this question valid
> enough to answer it, please cc me. Thanks.
>
> This is what the top-output looks like on my machine after having
> copied about 550GB of data from a twofish256 crypted disk to a raid
> array:
> --
> Mem:   8178452k total,  8132180k used,46272k free,  2743480k buffers
> Swap:0k total,0k used,0k free,  4563032k cached
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  5954 root   0 -20 000 R   62  0.0  96:42.61 loop0
>  6014 root  18   0 23744  19m  484 R   20  0.2  25:45.31 cp
>   255 root  10  -5 000 S8  0.0  10:21.82 kswapd0
>  6011 root  10  -5 000 D6  0.0   4:15.66 kjournald
> ...yadda yadda...
> --
>
> And what do we see here? We see loop0 and cp eating up some
> time. That's ok for me considered the work they do. kjournald is also
> ok for me, but I ask myself: why the heck has kswapd0 crunched 10+
> minutes of CPU time?
>
> I mean what does kswapd0 do?
> http://www.linuxforums.org/forum/linux-kernel/65380-what-does-kswapd0-do.ht
>ml
>
> And I have no swap - right? So it should just shut up - IMHO. Or am I
> missing something?

kswapd also reclaims pagecache, not just anonymous memory. It runs
in response to memory pressure and if it wasn't around, then all
your apps requesting memory would have to do basically the same
amount of work themselves.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

[sorry, I read and replied to my inbox before mailing lists...
please ignore the last mail on this patch, and reply to this
one which is properly threaded]

On Friday 19 October 2007 08:15, Christoph Lameter wrote:
> Currently page flags are only modified in SLUB under page lock. This means
> that we do not need an atomic operation to release the lock since there
> is nothing we can race against that is modifying page flags. We can simply
> clear the bit without the use of an atomic operation and make sure that
> this change becomes visible after the other changes to slab metadata
> through a memory barrier.
>
> The performance of slab_free() increases 10-15% (SMP configuration doing
> a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.

>  mm/slub.c |   15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
> --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
> +++ a/mm/slub.c
> @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
>   bit_spin_lock(PG_locked, >flags);
>  }
>
> +/*
> + * Slab unlock version that avoids having to use atomic operations
> + * (echos some of the code of bit_spin_unlock!)
> + */
>  static __always_inline void slab_unlock(struct page *page)
>  {
> - bit_spin_unlock(PG_locked, >flags);
> +#ifdef CONFIG_SMP
> + unsigned long flags;
> +
> + flags = page->flags & ~(1 << PG_locked);
> +
> + smp_wmb();
> + page->flags = flags;
> +#endif
> + preempt_enable();
> + __release(bitlock);
>  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.

Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st

 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, >flags);
+	__bit_spin_unlock(PG_locked, >flags);
 }

 static __always_inline int slab_trylock(struct page *page)

Re: + slub-avoid-atomic-operation-for-slab_unlock.patch added to -mm tree

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 08:31, [EMAIL PROTECTED] wrote:

> --
> Subject: SLUB: avoid atomic operation for slab_unlock
> From: Christoph Lameter <[EMAIL PROTECTED]>
>
> Currently page flags are only modified in SLUB under page lock.  This means
> that we do not need an atomic operation to release the lock since there is
> nothing we can race against that is modifying page flags.  We can simply
> clear the bit without the use of an atomic operation and make sure that
> this change becomes visible after the other changes to slab metadata
> through a memory barrier.
>
> The performance of slab_free() increases 10-15% (SMP configuration doing
> a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.


>  mm/slub.c |   15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
> --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
> +++ a/mm/slub.c
> @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
>   bit_spin_lock(PG_locked, >flags);
>  }
>
> +/*
> + * Slab unlock version that avoids having to use atomic operations
> + * (echos some of the code of bit_spin_unlock!)
> + */
>  static __always_inline void slab_unlock(struct page *page)
>  {
> - bit_spin_unlock(PG_locked, >flags);
> +#ifdef CONFIG_SMP
> + unsigned long flags;
> +
> + flags = page->flags & ~(1 << PG_locked);
> +
> + smp_wmb();
> + page->flags = flags;
> +#endif
> + preempt_enable();
> + __release(bitlock);
>  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.
Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st
 
 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, >flags);
+	__bit_spin_unlock(PG_locked, >flags);
 }
 
 static __always_inline int slab_trylock(struct page *page)

Re: Kernel 2.6.21.6:"swapper: page allocation failure. order:1, mode:0x20"

2007-10-18 Thread Nick Piggin

On Thursday 18 October 2007 16:16, Andrew A. Razdolsky wrote:
> Hello!
>
> In attachments i did pick all info i know about this failure.

Hi,

Does this actually cause problems for your system? Occasional
page allocation failures from interrupt context are expected.

If you are getting a lot of these messages, you can try
increasing /proc/sys/vm/min_free_kbytes until they go away.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin

On Thursday 18 October 2007 17:14, Vasily Averin wrote:
> Nick Piggin wrote:
> > Hi,
> >
> > On Thursday 18 October 2007 16:24, Vasily Averin wrote:
> >> Hi all,
> >>
> >> could anybody explain how "inactive" may be much greater than "cached"?
> >> stress test (http://weather.ou.edu/~apw/projects/stress/) that writes
> >> into removed files in cycle puts the node to the following state:
> >>
> >> MemTotal: 16401648 kB
> >> MemFree:636644 kB
> >> Buffers:   1122556 kB
> >> Cached: 362880 kB
> >> SwapCached:700 kB
> >> Active:1604180 kB
> >> Inactive: 13609828 kB
> >>
> >> At the first glance memory should be freed on file closing, nobody
> >> refers to file and ext3_delete_inode() truncates inode. We can see that
> >> memory is go away from "cached", however could somebody explain why it
> >> become "invalid" instead be freed? Who holds the references to these
> >> pages?
> >
> > Buffers, swap cache, and anonymous.
>
> But buffers and swap cache are low (1.1 Gb and 700kB in this example) and
> anonymous should go away when process finished.

Ah, I didn't see it was an order of magnitude out.

Some filesystems, including I believe, ext3 with data=ordered,
can leave orphaned pages around after they have been truncated
out of the pagecache. These pages get left on the LRU and vmscan
reclaims them pretty easily.

Try ext3 data=writeback, or even ext2.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin

Hi,

On Thursday 18 October 2007 16:24, Vasily Averin wrote:
> Hi all,
>
> could anybody explain how "inactive" may be much greater than "cached"?
> stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into
> removed files in cycle puts the node to the following state:
>
> MemTotal: 16401648 kB
> MemFree: 636644 kB
> Buffers: 1122556 kB
> Cached: 362880 kB
> SwapCached: 700 kB
> Active: 1604180 kB
> Inactive: 13609828 kB
>
> At the first glance memory should be freed on file closing, nobody refers
> to file and ext3_delete_inode() truncates inode. We can see that memory is
> go away from "cached", however could somebody explain why it become
> "invalid" instead be freed? Who holds the references to these pages?

Buffers, swap cache, and anonymous.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin

Hi,

On Thursday 18 October 2007 16:24, Vasily Averin wrote:
 Hi all,

 could anybody explain how inactive may be much greater than cached?
 stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into
 removed files in cycle puts the node to the following state:

 MemTotal: 16401648 kB
 MemFree: 636644 kB
 Buffers: 1122556 kB
 Cached: 362880 kB
 SwapCached: 700 kB
 Active: 1604180 kB
 Inactive: 13609828 kB

 At the first glance memory should be freed on file closing, nobody refers
 to file and ext3_delete_inode() truncates inode. We can see that memory is
 go away from cached, however could somebody explain why it become
 invalid instead be freed? Who holds the references to these pages?

Buffers, swap cache, and anonymous.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin

On Thursday 18 October 2007 17:14, Vasily Averin wrote:
 Nick Piggin wrote:
  Hi,
 
  On Thursday 18 October 2007 16:24, Vasily Averin wrote:
  Hi all,
 
  could anybody explain how inactive may be much greater than cached?
  stress test (http://weather.ou.edu/~apw/projects/stress/) that writes
  into removed files in cycle puts the node to the following state:
 
  MemTotal: 16401648 kB
  MemFree:636644 kB
  Buffers:   1122556 kB
  Cached: 362880 kB
  SwapCached:700 kB
  Active:1604180 kB
  Inactive: 13609828 kB
 
  At the first glance memory should be freed on file closing, nobody
  refers to file and ext3_delete_inode() truncates inode. We can see that
  memory is go away from cached, however could somebody explain why it
  become invalid instead be freed? Who holds the references to these
  pages?
 
  Buffers, swap cache, and anonymous.

 But buffers and swap cache are low (1.1 Gb and 700kB in this example) and
 anonymous should go away when process finished.

Ah, I didn't see it was an order of magnitude out.

Some filesystems, including I believe, ext3 with data=ordered,
can leave orphaned pages around after they have been truncated
out of the pagecache. These pages get left on the LRU and vmscan
reclaims them pretty easily.

Try ext3 data=writeback, or even ext2.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel 2.6.21.6:swapper: page allocation failure. order:1, mode:0x20

2007-10-18 Thread Nick Piggin

On Thursday 18 October 2007 16:16, Andrew A. Razdolsky wrote:
 Hello!

 In attachments i did pick all info i know about this failure.

Hi,

Does this actually cause problems for your system? Occasional
page allocation failures from interrupt context are expected.

If you are getting a lot of these messages, you can try
increasing /proc/sys/vm/min_free_kbytes until they go away.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86: lock bitops

2007-10-18 Thread Nick Piggin


I missed an obvious one!

x86 CPUs are defined not to reorder stores past earlier loads, so there is
no hardware memory barrier required to implement a release-consistent store
(all stores are, by definition).

So ditch the generic lock bitops, and implement optimised versions for x86,
which removes the mfence from __clear_bit_unlock (which is already a useful
primitive for SLUB).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
Index: linux-2.6/include/asm-x86/bitops_32.h
===
--- linux-2.6.orig/include/asm-x86/bitops_32.h
+++ linux-2.6/include/asm-x86/bitops_32.h
@@ -76,6 +76,20 @@ static inline void clear_bit(int nr, vol
:Ir (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
__asm__ __volatile__(
@@ -83,6 +97,25 @@ static inline void __clear_bit(int nr, v
:+m (ADDR)
:Ir (nr));
 }
+
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -142,6 +175,15 @@ static inline int test_and_set_bit(int n
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -402,7 +444,6 @@ static inline int fls(int x)
 }
 
 #include asm-generic/bitops/hweight.h
-#include asm-generic/bitops/lock.h
 
 #endif /* __KERNEL__ */
 
Index: linux-2.6/include/asm-x86/bitops_64.h
===
--- linux-2.6.orig/include/asm-x86/bitops_64.h
+++ linux-2.6/include/asm-x86/bitops_64.h
@@ -68,6 +68,20 @@ static __inline__ void clear_bit(int nr,
:dIr (nr));
 }
 
+/*
+ * clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   clear_bit(nr, addr);
+}
+
 static __inline__ void __clear_bit(int nr, volatile void * addr)
 {
__asm__ __volatile__(
@@ -76,6 +90,24 @@ static __inline__ void __clear_bit(int n
:dIr (nr));
 }
 
+/*
+ * __clear_bit_unlock - Clears a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * __clear_bit() is non-atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock if no other CPUs can concurrently
+ * modify other bits in the word.
+ *
+ * No memory barrier is required here, because x86 cannot reorder stores past
+ * older loads. Same principle as spin_unlock.
+ */
+static inline void __clear_bit_unlock(unsigned long nr, volatile unsigned long 
*addr)
+{
+   barrier();
+   __clear_bit(nr, addr);
+}
+
 #define smp_mb__before_clear_bit() barrier()
 #define smp_mb__after_clear_bit()  barrier()
 
@@ -133,6 +165,15 @@ static __inline__ int test_and_set_bit(i
 }
 
 /**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86
+ */
+#define test_and_set_bit_lock test_and_set_bit
+
+/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -408,7 +449,6 @@ static __inline__ int fls(int x)
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
 #include asm-generic/bitops/hweight.h
-#include asm-generic/bitops/lock.h
 
 #endif /* __KERNEL__ */
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 13:28, Herbert Xu wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
  First of all let's agree on some basic assumptions:
 
  * A pair of spin lock/unlock subsumes the effect of a full mb.
 
  Not unless you mean a pair of spin lock/unlock as in
  2 spin lock/unlock pairs (4 operations).
 
  *X = 10;
  spin_lock(lock);
  /* *Y speculatively loaded here */
  /* store to *X leaves CPU store queue here */
  spin_unlock(lock);
  y = *Y;

 Good point.

 Although in this case we're still safe because in the worst
 cases:

 CPU0  CPU1
 irq_sync = 1
 synchronize_irq
   spin lock
   load IRQ_INPROGRESS
 irq_sync sync is visible
   spin unlock
   spin lock
   load irq_sync
   while (IRQ_INPROGRESS)
   wait
   return
   set IRQ_INPROGRESS
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   spin lock
   clear IRQ_INPROGRESS
   spin unlock

 

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 synchronize_irq
   set IRQ_INPROGRESS
   spin unlock
   spin lock
   load IRQ_INPROGRESS
 irq_sync sync is visible
   spin unlock
   while (IRQ_INPROGRESS)
   wait
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work
   spin lock
   clear IRQ_INPROGRESS
   spin unlock
   return

 So because we're using the same lock on both sides, it does
 do the right thing even without the memory barrier.


Yeah, if you've got the lock on both sides there, then I
agree it will be correct.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] synchronize_irq needs a barrier

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 12:32, Herbert Xu wrote:

 First of all let's agree on some basic assumptions:

 * A pair of spin lock/unlock subsumes the effect of a full mb.

Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).

*X = 10;
spin_lock(lock);
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock(lock);
y = *Y;

 * A spin lock in general only equates to (SS/SL/LL).
 * A spin unlock in general only equates to (SS/LS).

I don't use the sparc barriers, so they don't come naturally to
me ;)

I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
be passed by later ones.


 In particular, a load after a spin unlock may pass before the
 spin unlock.

 Here is the race (with tg3 as the only example that I know of).
 The driver attempts to quiesce interrupts such that after the
 call to synchronize_irq it is intended that no further IRQ
 handler calls for that device will do any work besides acking
 the IRQ.

 Here is how it does it:

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 smp_mb
 synchronize_irq()
   while (IRQ_INPROGRESS)
   wait
   return
   set IRQ_INPROGRESS
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work

 The problem here is that load of irq_sync in the handler has
 passed above the setting of IRQ_INPROGRESS.

 Linus's patch fixes it because this becomes:

 CPU0  CPU1
   spin lock
   load irq_sync
 irq_sync = 1
 smp_mb
 synchronize_irq
   set IRQ_INPROGRESS
   spin unlock
   spin lock
   spin unlock
   tg3_msi
   ack IRQ
   if (irq_sync)
   return
   do work
   while (IRQ_INPROGRESS)
   wait
   spin lock
   clear IRQ_INPROGRESS
   spin unlock
   return

 Even though we still do the work on the right we will now notice
 the INPROGRESS flag on the left and wait.

 It's hard to fix this in the drivers because they'd either have
 to access the desc lock or add a full mb to the fast path on the
 right.

 Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
 a lot of drivers (including the fusion example Ben quoted) call
 synchronize_irq before free_irq.  This is unnecessary because
 the latter already calls it anyway.

 Cheers,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 11:21, Christoph Lameter wrote:
 On Fri, 19 Oct 2007, Nick Piggin wrote:
  Ah, thanks, but can we just use my earlier patch that does the
  proper __bit_spin_unlock which is provided by
  bit_spin_lock-use-lock-bitops.patch

 Ok.

  This primitive should have a better chance at being correct, and
  also potentially be more optimised for each architecture (it
  only has to provide release consistency).

 Yes that is what I attempted to do with the write barrier. To my knowledge
 there are no reads that could bleed out and I wanted to avoid a full fence
 instruction there.

Oh, OK. Bit risky ;) You might be right, but anyway I think it
should be just as fast with the optimised bit_unlock on most
architectures.

Which reminds me, it would be interesting to test the ia64
implementation I did. For the non-atomic unlock, I'm actually
doing an atomic operation there so that it can use the release
barrier rather than the mf. Maybe it's faster the other way around
though? Will be useful to test with something that isn't a trivial
loop, so the slub case would be a good benchmark.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kswapd0 inefficient?

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 08:05, Richard Jelinek wrote:
 Hello guys,

 I'm not subscribed to this list, so if you find this question valid
 enough to answer it, please cc me. Thanks.

 This is what the top-output looks like on my machine after having
 copied about 550GB of data from a twofish256 crypted disk to a raid
 array:
 --
 Mem:   8178452k total,  8132180k used,46272k free,  2743480k buffers
 Swap:0k total,0k used,0k free,  4563032k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  5954 root   0 -20 000 R   62  0.0  96:42.61 loop0
  6014 root  18   0 23744  19m  484 R   20  0.2  25:45.31 cp
   255 root  10  -5 000 S8  0.0  10:21.82 kswapd0
  6011 root  10  -5 000 D6  0.0   4:15.66 kjournald
 ...yadda yadda...
 --

 And what do we see here? We see loop0 and cp eating up some
 time. That's ok for me considered the work they do. kjournald is also
 ok for me, but I ask myself: why the heck has kswapd0 crunched 10+
 minutes of CPU time?

 I mean what does kswapd0 do?
 http://www.linuxforums.org/forum/linux-kernel/65380-what-does-kswapd0-do.ht
ml

 And I have no swap - right? So it should just shut up - IMHO. Or am I
 missing something?

kswapd also reclaims pagecache, not just anonymous memory. It runs
in response to memory pressure and if it wasn't around, then all
your apps requesting memory would have to do basically the same
amount of work themselves.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + slub-avoid-atomic-operation-for-slab_unlock.patch added to -mm tree

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 08:31, [EMAIL PROTECTED] wrote:

 --
 Subject: SLUB: avoid atomic operation for slab_unlock
 From: Christoph Lameter [EMAIL PROTECTED]

 Currently page flags are only modified in SLUB under page lock.  This means
 that we do not need an atomic operation to release the lock since there is
 nothing we can race against that is modifying page flags.  We can simply
 clear the bit without the use of an atomic operation and make sure that
 this change becomes visible after the other changes to slab metadata
 through a memory barrier.

 The performance of slab_free() increases 10-15% (SMP configuration doing
 a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.

  mm/slub.c |   15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

 diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
 --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
 +++ a/mm/slub.c
 @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
   bit_spin_lock(PG_locked, page-flags);
  }

 +/*
 + * Slab unlock version that avoids having to use atomic operations
 + * (echos some of the code of bit_spin_unlock!)
 + */
  static __always_inline void slab_unlock(struct page *page)
  {
 - bit_spin_unlock(PG_locked, page-flags);
 +#ifdef CONFIG_SMP
 + unsigned long flags;
 +
 + flags = page-flags  ~(1  PG_locked);
 +
 + smp_wmb();
 + page-flags = flags;
 +#endif
 + preempt_enable();
 + __release(bitlock);
  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.
Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st

 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, page-flags);
+	__bit_spin_unlock(PG_locked, page-flags);
 }

 static __always_inline int slab_trylock(struct page *page)

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

On Friday 19 October 2007 12:01, Christoph Lameter wrote:
 On Fri, 19 Oct 2007, Nick Piggin wrote:
   Yes that is what I attempted to do with the write barrier. To my
   knowledge there are no reads that could bleed out and I wanted to avoid
   a full fence instruction there.
 
  Oh, OK. Bit risky ;) You might be right, but anyway I think it
  should be just as fast with the optimised bit_unlock on most
  architectures.

 How expensive is the fence? An store with release semantics would be safer
 and okay for IA64.

I'm not sure, I had an idea it was relatively expensive on ia64,
but I didn't really test with a good workload (a microbenchmark
probably isn't that good because it won't generate too much out
of order memory traffic that needs to be fenced).


  Which reminds me, it would be interesting to test the ia64
  implementation I did. For the non-atomic unlock, I'm actually
  doing an atomic operation there so that it can use the release
  barrier rather than the mf. Maybe it's faster the other way around
  though? Will be useful to test with something that isn't a trivial
  loop, so the slub case would be a good benchmark.

 Lets avoid mf (too expensive) and just use a store with release semantics.

OK, that's what I've done at the moment.


 Where can I find your patchset? I looked through lkml but did not see it.

Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
ones to look at.

The rest of the patches I have queued here, apart from the SLUB patch,
I guess aren't so interesting to you (they don't do anything fancy
like convert to non-atomic unlocks, just switch things like page and
buffer locks to use new bitops).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Avoid atomic operation for slab_unlock

2007-10-18 Thread Nick Piggin

[sorry, I read and replied to my inbox before mailing lists...
please ignore the last mail on this patch, and reply to this
one which is properly threaded]

On Friday 19 October 2007 08:15, Christoph Lameter wrote:
 Currently page flags are only modified in SLUB under page lock. This means
 that we do not need an atomic operation to release the lock since there
 is nothing we can race against that is modifying page flags. We can simply
 clear the bit without the use of an atomic operation and make sure that
 this change becomes visible after the other changes to slab metadata
 through a memory barrier.

 The performance of slab_free() increases 10-15% (SMP configuration doing
 a long series of remote frees).

Ah, thanks, but can we just use my earlier patch that does the
proper __bit_spin_unlock which is provided by
bit_spin_lock-use-lock-bitops.patch

This primitive should have a better chance at being correct, and
also potentially be more optimised for each architecture (it
only has to provide release consistency).

I have attached the patch here just for reference, but actually
I am submitting it properly as part of a patch series today, now
that the base bit lock patches have been sent upstream.


  mm/slub.c |   15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

 diff -puN mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock mm/slub.c
 --- a/mm/slub.c~slub-avoid-atomic-operation-for-slab_unlock
 +++ a/mm/slub.c
 @@ -1181,9 +1181,22 @@ static __always_inline void slab_lock(st
   bit_spin_lock(PG_locked, page-flags);
  }

 +/*
 + * Slab unlock version that avoids having to use atomic operations
 + * (echos some of the code of bit_spin_unlock!)
 + */
  static __always_inline void slab_unlock(struct page *page)
  {
 - bit_spin_unlock(PG_locked, page-flags);
 +#ifdef CONFIG_SMP
 + unsigned long flags;
 +
 + flags = page-flags  ~(1  PG_locked);
 +
 + smp_wmb();
 + page-flags = flags;
 +#endif
 + preempt_enable();
 + __release(bitlock);
  }

This looks wrong, because it would allow the store unlocking
flags to pass a load within the critical section.

stores aren't allowed to pass loads in x86 (only vice versa),
so you might have been confused by looking at x86's spinlocks
into thinking this will work. However on powerpc and sparc, I
don't think it gives you the right types of barriers.

Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1185,7 +1185,7 @@ static __always_inline void slab_lock(st
 
 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, page-flags);
+	__bit_spin_unlock(PG_locked, page-flags);
 }
 
 static __always_inline int slab_trylock(struct page *page)

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-17 Thread Nick Piggin

On Thursday 18 October 2007 13:59, Eric W. Biederman wrote:
> If filesystems care at all they want absolute control over the buffer
> cache.  Controlling which buffers are dirty and when.  Because we
> keep the buffer cache in the page cache for the block device we have
> not quite been giving filesystems that control leading to really weird
> bugs.

Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.

The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.

[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)

The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin

On Thursday 18 October 2007 04:45, Eric W. Biederman wrote:
> At this point my concern is what makes a clean code change in the
> kernel.  Because user space can currently play with buffer_heads
> by way of the block device and cause lots of havoc (see the recent

Well if userspace is writing to the filesystem metadata via the
blockdevice while it is running... that's the definition of havoc,
isn't it? ;) Whether or not the writes are going via a unified
metadata/blockdev cache or separate ones.

You really just have to not do that.

The actual reiserfs problem being seen is not because of userspace
going silly, but because ramdisk is hijacking the dirty bits.

> If that change is made then it happens that the current ramdisk
> would not need to worry about buffer heads and all of that
> nastiness and could just lock pages in the page cache.  It would not
> be quite as good for testing filesystems but retaining the existing
> characteristics would be simple.

No, it wouldn't. Because if you're proposing to split up the buffer
cache and the metadata cache, then you're back to a 2 cache
solution which is basically has the memory characteristics of my
proposal while still being horribly incestuous with the pagecache.

> After having looked a bit deeper the buffer_heads and the block
> devices don't look as intricately tied up as I had first thought.
> We still have the nasty case of:
>   if (buffer_new(bh))
>   unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> That I don't know how it got merged.  But otherwise the caches
> are fully separate.

Well its needed because some filesystems forget about their old
metadata. It's not really there to solve aliasing with the blockdev
pagecache.

> So currently it looks to me like there are two big things that will
> clean up that part of the code a lot:
> - moving the metadata buffer_heads to a magic filesystem inode.
> - Using a simpler non-buffer_head returning version of get_block
>   so we can make simple generic code for generating BIOs.

Although this is going off the track of the ramdisk problem. For
that we should just do the rewrite.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin

On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
> >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> >> > > What magic restrictions on page allocations? Actually we have
> >> > > fewer restrictions on page allocations because we can use
> >> > > highmem!
> >> >
> >> > With the proposed rewrite yes.
> >
> > Here's a quick first hack...
> >
> > Comments?
>
> I have beaten my version of this into working shape, and things
> seem ok.
>
> However I'm beginning to think that the real solution is to remove
> the dependence on buffer heads for caching the disk mapping for
> data pages, and move the metadata buffer heads off of the block
> device page cache pages.  Although I am just a touch concerned
> there may be an issue with using filesystem tools while the
> filesystem is mounted if I move the metadata buffer heads.
>
> If we were to move the metadata buffer heads (assuming I haven't
> missed some weird dependency) then I think there a bunch of
> weird corner cases that would be simplified.

I'd prefer just to go along the lines of what I posted. It's
a pure block device driver which knows absolutely nothing about
vm or vfs.

What are you guys using rd for, and is it really important to
have this supposed buffercache optimisation?

> I guess that is where I look next.
>
> Oh for what it is worth I took a quick look at fsblock and I don't think
> struct fsblock makes much sense as a block mapping translation layer for
> the data path where the current page caches works well.

Except the page cache doesn't store any such translation. fsblock
works very nicely as a buffer_head and nobh-mode replacement there,
but we could probably go one better (for many filesystems) by using
a tree.

> For less 
> then the cost of 1 fsblock I can cache all of the translations for a
> 1K filesystem on a 4K page.

I'm not exactly sure what you mean, unless you're talking about an
extent based data structure. fsblock is fairly slim as far as a
generic solution goes. You could probably save 4 or 8 bytes in it,
but I don't know if it's worth the trouble.

> I haven't looked to see if fsblock makes sense to use as a buffer head
> replacement yet.

Well I don't want to get too far off topic on the subject of fsblock,
and block mappings (because I think the ramdisk driver simply should
have nothing to do with any of that)... but fsblock is exactly a
buffer head replacement (so if it doesn't make sense, then I've
screwed something up badly! ;))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin

On Wed, Oct 17, 2007 at 01:51:17PM +0800, Herbert Xu wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> >
> > Also, for non-wb memory. I don't think the Intel document referenced
> > says anything about this, but the AMD document says that loads can pass
> > loads (page 8, rule b).
> > 
> > This is why our rmb() is still an lfence.
> 
> BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb
> instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's
> shared with other Xen domains or the hypervisor.
> 
> The reason this is necessary is because even if a Xen domain is
> UP the hypervisor might be SMP.
> 
> It would be nice if we can have these adopt the new SMP barriers
> on x86 instead of the IO ones as they currently do.

That's a good point actually. Something like raw_smp_*mb, which
always orders memory, but only for regular WB operatoins. I could
put that on the todo list...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin

On Wed, Oct 17, 2007 at 02:30:32AM +0200, Mikulas Patocka wrote:
> > > You already must not place any data structures into WC memory --- for 
> > > example, spinlocks wouldn't work there.
> > 
> > What do you mean "already"?
> 
> I mean "in current kernel" (I checked it in 2.6.22)

Ahh, that's not "current kernel", though ;)

4071c718555d955a35e9651f77086096ad87d498

 
> > If we already have drivers loading data from
> > WC memory, then rmb() needs to order them, whether or not they actually
> > need it. If that were prohibitively costly, then we'd introduce a new
> > barrier which does not order WC memory, right?
> > 
> > 
> > > wmb() also won't work on WC 
> > > memory, because it assumes that writes are ordered.
> > 
> > You mean the one defined like this:
> >   #define wmb()   asm volatile("sfence" ::: "memory")
> > ? If it assumed writes are ordered, then it would just be a barrier().
> 
> You read wrong part of the include file. Really, it is 
> (2.6.22,include/asm-i386/system.h):
> #ifdef CONFIG_X86_OOSTORE
> #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", 
> X86_FEATURE_XMM)
> #else
> #define wmb()   __asm__ __volatile__ ("": : :"memory")
> #endif
> 
> CONFIG_X86_OOSTORE is dependent on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
> --- so on Intel and AMD, it is really just barrier().
> 
> So drivers can't assume that wmb() works on write-combining memory.

Drivers should be able to assume that wmb() orders _everything_ (except
some whacky Altix thing, which I really want to fold under wmb at some
point anyway).

So I decided that old x86 semantics isn't right, and now it really is a
lock op / sfence everywhere.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin

On Thursday 18 October 2007 04:45, Eric W. Biederman wrote:
 At this point my concern is what makes a clean code change in the
 kernel.  Because user space can currently play with buffer_heads
 by way of the block device and cause lots of havoc (see the recent

Well if userspace is writing to the filesystem metadata via the
blockdevice while it is running... that's the definition of havoc,
isn't it? ;) Whether or not the writes are going via a unified
metadata/blockdev cache or separate ones.

You really just have to not do that.

The actual reiserfs problem being seen is not because of userspace
going silly, but because ramdisk is hijacking the dirty bits.


 If that change is made then it happens that the current ramdisk
 would not need to worry about buffer heads and all of that
 nastiness and could just lock pages in the page cache.  It would not
 be quite as good for testing filesystems but retaining the existing
 characteristics would be simple.

No, it wouldn't. Because if you're proposing to split up the buffer
cache and the metadata cache, then you're back to a 2 cache
solution which is basically has the memory characteristics of my
proposal while still being horribly incestuous with the pagecache.


 After having looked a bit deeper the buffer_heads and the block
 devices don't look as intricately tied up as I had first thought.
 We still have the nasty case of:
   if (buffer_new(bh))
   unmap_underlying_metadata(bh-b_bdev, bh-b_blocknr);
 That I don't know how it got merged.  But otherwise the caches
 are fully separate.

Well its needed because some filesystems forget about their old
metadata. It's not really there to solve aliasing with the blockdev
pagecache.


 So currently it looks to me like there are two big things that will
 clean up that part of the code a lot:
 - moving the metadata buffer_heads to a magic filesystem inode.
 - Using a simpler non-buffer_head returning version of get_block
   so we can make simple generic code for generating BIOs.

Although this is going off the track of the ramdisk problem. For
that we should just do the rewrite.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

2007-10-17 Thread Nick Piggin

On Thursday 18 October 2007 13:59, Eric W. Biederman wrote:
 If filesystems care at all they want absolute control over the buffer
 cache.  Controlling which buffers are dirty and when.  Because we
 keep the buffer cache in the page cache for the block device we have
 not quite been giving filesystems that control leading to really weird
 bugs.

Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.

The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.

[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)

The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin

On Wed, Oct 17, 2007 at 02:30:32AM +0200, Mikulas Patocka wrote:
   You already must not place any data structures into WC memory --- for 
   example, spinlocks wouldn't work there.
  
  What do you mean already?
 
 I mean in current kernel (I checked it in 2.6.22)

Ahh, that's not current kernel, though ;)

4071c718555d955a35e9651f77086096ad87d498

 
  If we already have drivers loading data from
  WC memory, then rmb() needs to order them, whether or not they actually
  need it. If that were prohibitively costly, then we'd introduce a new
  barrier which does not order WC memory, right?
  
  
   wmb() also won't work on WC 
   memory, because it assumes that writes are ordered.
  
  You mean the one defined like this:
#define wmb()   asm volatile(sfence ::: memory)
  ? If it assumed writes are ordered, then it would just be a barrier().
 
 You read wrong part of the include file. Really, it is 
 (2.6.22,include/asm-i386/system.h):
 #ifdef CONFIG_X86_OOSTORE
 #define wmb() alternative(lock; addl $0,0(%%esp), sfence, 
 X86_FEATURE_XMM)
 #else
 #define wmb()   __asm__ __volatile__ (: : :memory)
 #endif
 
 CONFIG_X86_OOSTORE is dependent on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
 --- so on Intel and AMD, it is really just barrier().
 
 So drivers can't assume that wmb() works on write-combining memory.

Drivers should be able to assume that wmb() orders _everything_ (except
some whacky Altix thing, which I really want to fold under wmb at some
point anyway).

So I decided that old x86 semantics isn't right, and now it really is a
lock op / sfence everywhere.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-17 Thread Nick Piggin

On Wed, Oct 17, 2007 at 01:51:17PM +0800, Herbert Xu wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  Also, for non-wb memory. I don't think the Intel document referenced
  says anything about this, but the AMD document says that loads can pass
  loads (page 8, rule b).
  
  This is why our rmb() is still an lfence.
 
 BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb
 instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's
 shared with other Xen domains or the hypervisor.
 
 The reason this is necessary is because even if a Xen domain is
 UP the hypervisor might be SMP.
 
 It would be nice if we can have these adopt the new SMP barriers
 on x86 instead of the IO ones as they currently do.

That's a good point actually. Something like raw_smp_*mb, which
always orders memory, but only for regular WB operatoins. I could
put that on the todo list...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-17 Thread Nick Piggin

On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
  On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
What magic restrictions on page allocations? Actually we have
fewer restrictions on page allocations because we can use
highmem!
  
   With the proposed rewrite yes.
 
  Here's a quick first hack...
 
  Comments?

 I have beaten my version of this into working shape, and things
 seem ok.

 However I'm beginning to think that the real solution is to remove
 the dependence on buffer heads for caching the disk mapping for
 data pages, and move the metadata buffer heads off of the block
 device page cache pages.  Although I am just a touch concerned
 there may be an issue with using filesystem tools while the
 filesystem is mounted if I move the metadata buffer heads.

 If we were to move the metadata buffer heads (assuming I haven't
 missed some weird dependency) then I think there a bunch of
 weird corner cases that would be simplified.

I'd prefer just to go along the lines of what I posted. It's
a pure block device driver which knows absolutely nothing about
vm or vfs.

What are you guys using rd for, and is it really important to
have this supposed buffercache optimisation?


 I guess that is where I look next.

 Oh for what it is worth I took a quick look at fsblock and I don't think
 struct fsblock makes much sense as a block mapping translation layer for
 the data path where the current page caches works well.

Except the page cache doesn't store any such translation. fsblock
works very nicely as a buffer_head and nobh-mode replacement there,
but we could probably go one better (for many filesystems) by using
a tree.


 For less 
 then the cost of 1 fsblock I can cache all of the translations for a
 1K filesystem on a 4K page.

I'm not exactly sure what you mean, unless you're talking about an
extent based data structure. fsblock is fairly slim as far as a
generic solution goes. You could probably save 4 or 8 bytes in it,
but I don't know if it's worth the trouble.


 I haven't looked to see if fsblock makes sense to use as a buffer head
 replacement yet.

Well I don't want to get too far off topic on the subject of fsblock,
and block mappings (because I think the ramdisk driver simply should
have nothing to do with any of that)... but fsblock is exactly a
buffer head replacement (so if it doesn't make sense, then I've
screwed something up badly! ;))
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin

On Wednesday 17 October 2007 11:13, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:

> > We have 2 problems. First is that, for testing/consistency, we
> > don't want BLKFLSBUF to throw out the data. Maybe hardly anything
> > uses BLKFLSBUF now, so it could be just a minor problem, but still
> > one to fix.
>
> Hmm.  This is interesting because we won't be doing anything that
> effects correctness if we don't special case BLKFLSBUF just something
> that effects efficiency.  So I think we can get away with just
> changing blkflsbuf as long as there is a way to get rid of
> the data.

Technically, it does change correctness: after BLKFLSBUF, the
ramdisk should contain zeroes.

I'm assuming it would also cause problems in tight embedded
environments if ramdisk ram is supposed to be thrown away but
isn't. So maybe not technically a correctness problem, but could
be the difference between working and not working.

> > Second is actually throwing out the ramdisk data. dd from /dev/null
> > isn't trivial because it isn't a "command" from the kernel's POV.
> > rd could examine the writes to see if they are zero and page aligned,
> > I suppose... but if you're transitioning everyone over to a new
> > method anyway, might as well make it a nice one ;)
>
> Well I was thinking you can examine the page you just wrote to
> and if it is all zero's you don't need to cache that page anymore.
> Call it intelligent compression.
>
> Further it does make forwards and backwards compatibility simple
> because all you would have to do to reliably free a ramdisk is:
>
> dd if=/dev/zero of=/dev/ramX
> blockdev --flushbufs /dev/ramX

Sure, you could do that, but you still presumably need to support
the old behaviour.

As a test vehicle for filesystems, I'd much rather it didn't do this
of course, because subsequent writes would need to reallocate the
page again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][rfc] rewrite ramdisk

2007-10-16 Thread Nick Piggin

On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
> >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> >> > +/*
> >> > + * ram device BLKFLSBUF has special semantics, we want to 
> >> > actually
> >> > + * release and destroy the ramdisk data.
> >> > + */
> >>
> >> We won't be able to fix completely this for a while time, but the fact
> >> that BLKFLSBUF has special semantics has always been a major wart.
> >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
> >> deperecation printk for BLKFLSBUF when passed to the ramdisk?  I doubt
> >> there are many tools that actually take advantage of this wierd aspect
> >> of ramdisks, so hopefully it's something we could remove in a 18
> >> months or so...
> >
> > It would be nice to be able to do that, I agree. The new ramdisk
> > code will be able to flush the buffer cache and destroy its data
> > separately, so it can actually be implemented.
>
> So the practical problem are peoples legacy boot setups but those
> are quickly going away.

After that, is the ramdisk useful for anything aside from testing?


> The sane thing is probably something that can be taken as a low
> level format command for the block device.
>
> Say: dd if=/dev/zero of=/dev/ramX

We have 2 problems. First is that, for testing/consistency, we
don't want BLKFLSBUF to throw out the data. Maybe hardly anything
uses BLKFLSBUF now, so it could be just a minor problem, but still
one to fix.

Second is actually throwing out the ramdisk data. dd from /dev/null
isn't trivial because it isn't a "command" from the kernel's POV.
rd could examine the writes to see if they are zero and page aligned,
I suppose... but if you're transitioning everyone over to a new
method anyway, might as well make it a nice one ;)


> I know rewriting the drive with all zeroes can cause a modern
> disk to redo it's low level format.  And that is something
> we can definitely implement without any backwards compatibility
> problems.
>
> Hmm. Do we have anything special for punching holes in files?
> That would be another sane route to take to remove the special
> case for clearing the memory.

truncate_range, I suppose. A file descriptor syscall based
alternative for madvise would be nice though (like fallocate).

We could always put something in /sys/block/ram*/xxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 3926 matches

Mail list logo