subject:"\[PATCH\] nfs\: fix congestion control"

Re: [PATCH] nfs: fix congestion control -v4

2007-02-20 Thread Peter Zijlstra

On Fri, 2007-01-26 at 00:51 -0800, Andrew Morton wrote:
> On Fri, 26 Jan 2007 09:03:37 +0100
> Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 2007-01-25 at 22:04 -0800, Andrew Morton wrote:
> > > On Thu, 25 Jan 2007 21:31:43 -0800 (PST)
> > > Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Thu, 25 Jan 2007, Andrew Morton wrote:
> > > > 
> > > > > atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.
> > > > 
> > > > We have systems with 8TB main memory and are able to get to 16TB.
> > > 
> > > But I bet you don't use 4k pages on 'em ;)
> > > 
> > > > Better change it now.
> > > 
> > > yup.
> > 
> > I can change to atomic_long_t but that would make this patch depend on
> > Mathieu Desnoyers' atomic.h patch series.

^^^

> > Do I send out a -v5 with this, or should I send an incremental patch
> > once that hits your tree?
> 
> A patch against next -mm would suit, thanks.


As promised change the atomic_t in struct nfs_server to atomic_long_t in
anticipation of machines that can handle 8+TB of (4K) pages under writeback.

However I suspect other things in NFS will start going *bang* by then.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/write.c|4 ++--
 include/linux/nfs_fs_sb.h |2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -224,7 +224,7 @@ static void nfs_set_page_writeback(struc
struct inode *inode = page->mapping->host;
struct nfs_server *nfss = NFS_SERVER(inode);
 
-   if (atomic_inc_return(&nfss->writeback) >
+   if (atomic_long_inc_return(&nfss->writeback) >
NFS_CONGESTION_ON_THRESH)
set_bdi_congested(&nfss->backing_dev_info, WRITE);
}
@@ -236,7 +236,7 @@ static void nfs_end_page_writeback(struc
struct nfs_server *nfss = NFS_SERVER(inode);
 
end_page_writeback(page);
-   if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   if (atomic_long_dec_return(&nfss->writeback) < 
NFS_CONGESTION_OFF_THRESH) {
clear_bdi_congested(&nfss->backing_dev_info, WRITE);
congestion_end(WRITE);
}
Index: linux-2.6/include/linux/nfs_fs_sb.h
===
--- linux-2.6.orig/include/linux/nfs_fs_sb.h
+++ linux-2.6/include/linux/nfs_fs_sb.h
@@ -82,7 +82,7 @@ struct nfs_server {
struct rpc_clnt *   client_acl; /* ACL RPC client handle */
struct nfs_iostats *io_stats;   /* I/O statistics */
struct backing_dev_info backing_dev_info;
-   atomic_twriteback;  /* number of writeback pages */
+   atomic_long_t   writeback;  /* number of writeback pages */
int flags;  /* various flags */
unsigned intcaps;   /* server capabilities */
unsigned intrsize;  /* read size */


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-26 Thread Peter Zijlstra

On Fri, 2007-01-26 at 00:51 -0800, Andrew Morton wrote:

> A patch against next -mm would suit, thanks.
> 
> (But we already use atomic_long_t in generic code?)

but there is currently no atomic_long_{inc,dec}_return, or any
atomic_long_*_return function for that matter.

Mathieu adds these missing functions.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-26 Thread Andrew Morton

On Fri, 26 Jan 2007 09:03:37 +0100
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-01-25 at 22:04 -0800, Andrew Morton wrote:
> > On Thu, 25 Jan 2007 21:31:43 -0800 (PST)
> > Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, 25 Jan 2007, Andrew Morton wrote:
> > > 
> > > > atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.
> > > 
> > > We have systems with 8TB main memory and are able to get to 16TB.
> > 
> > But I bet you don't use 4k pages on 'em ;)
> > 
> > > Better change it now.
> > 
> > yup.
> 
> I can change to atomic_long_t but that would make this patch depend on
> Mathieu Desnoyers' atomic.h patch series.
> 
> Do I send out a -v5 with this, or should I send an incremental patch
> once that hits your tree?

A patch against next -mm would suit, thanks.

(But we already use atomic_long_t in generic code?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-26 Thread Peter Zijlstra

On Fri, 2007-01-26 at 09:00 +0100, Peter Zijlstra wrote:
> On Thu, 2007-01-25 at 21:02 -0800, Andrew Morton wrote:
> > On Thu, 25 Jan 2007 16:32:28 +0100
> > Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > +long congestion_wait_interruptible(int rw, long timeout)
> > > +{
> > > + long ret;
> > > + DEFINE_WAIT(wait);
> > > + wait_queue_head_t *wqh = &congestion_wqh[rw];
> > > +
> > > + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> > > + if (signal_pending(current))
> > > + ret = -ERESTARTSYS;
> > > + else
> > > + ret = io_schedule_timeout(timeout);
> > > + finish_wait(wqh, &wait);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(congestion_wait_interruptible);
> > 
> > I think this can share code with congestion_wait()?
> > 
> > static long __congestion_wait(int rw, long timeout, int state)
> > {
> > long ret;
> > DEFINE_WAIT(wait);
> > wait_queue_head_t *wqh = &congestion_wqh[rw];
> > 
> > prepare_to_wait(wqh, &wait, state);
> > ret = io_schedule_timeout(timeout);
> > finish_wait(wqh, &wait);
> > return ret;
> > }
> > 
> > long congestion_wait_interruptible(int rw, long timeout)
> > {
> > long ret = __congestion_wait(rw, timeout);
> > 
> > if (signal_pending(current))
> > ret = -ERESTARTSYS;
> > return ret;
> > }
> > 
> > it's only infinitesimally less efficient..
> 
> All the other _interruptible functions check signal_pending before
> calling schedule. Which seems to make sense since its called in a loop
> anyway, and if the loop condition turns false when interrupted you might
> as well just finish up instead of bailing out.
> 
> However if you'd rather see your version, who am I to object ;-)

ok, first wake up, then reply to emails :-)

How about this:

long congestion_wait_interruptible(int rw, long timeout)
{
long ret;

if (signal_pending(current))
ret = -ERESTARTSYS;
else
ret = congestion_wait(rw, timeout);
return ret;
}
EXPORT_SYMBOL(congestion_wait_interruptible);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-26 Thread Peter Zijlstra

On Thu, 2007-01-25 at 22:04 -0800, Andrew Morton wrote:
> On Thu, 25 Jan 2007 21:31:43 -0800 (PST)
> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 25 Jan 2007, Andrew Morton wrote:
> > 
> > > atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.
> > 
> > We have systems with 8TB main memory and are able to get to 16TB.
> 
> But I bet you don't use 4k pages on 'em ;)
> 
> > Better change it now.
> 
> yup.

I can change to atomic_long_t but that would make this patch depend on
Mathieu Desnoyers' atomic.h patch series.

Do I send out a -v5 with this, or should I send an incremental patch
once that hits your tree?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-26 Thread Peter Zijlstra

On Thu, 2007-01-25 at 21:02 -0800, Andrew Morton wrote:
> On Thu, 25 Jan 2007 16:32:28 +0100
> Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > +long congestion_wait_interruptible(int rw, long timeout)
> > +{
> > +   long ret;
> > +   DEFINE_WAIT(wait);
> > +   wait_queue_head_t *wqh = &congestion_wqh[rw];
> > +
> > +   prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> > +   if (signal_pending(current))
> > +   ret = -ERESTARTSYS;
> > +   else
> > +   ret = io_schedule_timeout(timeout);
> > +   finish_wait(wqh, &wait);
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(congestion_wait_interruptible);
> 
> I think this can share code with congestion_wait()?
> 
> static long __congestion_wait(int rw, long timeout, int state)
> {
>   long ret;
>   DEFINE_WAIT(wait);
>   wait_queue_head_t *wqh = &congestion_wqh[rw];
> 
>   prepare_to_wait(wqh, &wait, state);
>   ret = io_schedule_timeout(timeout);
>   finish_wait(wqh, &wait);
>   return ret;
> }
> 
> long congestion_wait_interruptible(int rw, long timeout)
> {
>   long ret = __congestion_wait(rw, timeout);
> 
>   if (signal_pending(current))
>   ret = -ERESTARTSYS;
>   return ret;
> }
> 
> it's only infinitesimally less efficient..

All the other _interruptible functions check signal_pending before
calling schedule. Which seems to make sense since its called in a loop
anyway, and if the loop condition turns false when interrupted you might
as well just finish up instead of bailing out.

However if you'd rather see your version, who am I to object ;-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Christoph Lameter

On Thu, 25 Jan 2007, Andrew Morton wrote:

> > We have systems with 8TB main memory and are able to get to 16TB.
> 
> But I bet you don't use 4k pages on 'em ;)

IA64 can be configured for 4k pagesize but yes 16k is the default. There 
are plans to go much higher though. Plus there may be other reaons that 
will force us to 4k pagesize on some configurations.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Andrew Morton

On Thu, 25 Jan 2007 21:31:43 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Thu, 25 Jan 2007, Andrew Morton wrote:
> 
> > atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.
> 
> We have systems with 8TB main memory and are able to get to 16TB.

But I bet you don't use 4k pages on 'em ;)

> Better change it now.

yup.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Christoph Lameter

On Thu, 25 Jan 2007, Andrew Morton wrote:

> atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.

We have systems with 8TB main memory and are able to get to 16TB.
Better change it now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Andrew Morton

On Thu, 25 Jan 2007 16:32:28 +0100
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Hopefully the last version ;-)
> 
> 
> ---
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severly broken, it marks the 
> backing
> device congested during each nfs_writepages() call but doesn't mirror this in
> nfs_writepage() which makes for deadlocks. Also it implements its own 
> waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap on 
> the
> number of active writeback pages and uses the bdi congestion waitqueue.
> 
> Also always use an interruptible wait since it makes sense to be able to 
> SIGKILL the process even for mounts without 'intr'.
> 
> ..
>
> --- linux-2.6-git.orig/include/linux/nfs_fs_sb.h  2007-01-25 
> 16:07:03.0 +0100
> +++ linux-2.6-git/include/linux/nfs_fs_sb.h   2007-01-25 16:07:12.0 
> +0100
> @@ -82,6 +82,7 @@ struct nfs_server {
>   struct rpc_clnt *   client_acl; /* ACL RPC client handle */
>   struct nfs_iostats *io_stats;   /* I/O statistics */
>   struct backing_dev_info backing_dev_info;
> + atomic_twriteback;  /* number of writeback pages */

We're going to get in trouble with this sort of thing within a few years. 
atomic_t is 32-bit.  Put 16TB of memory under writeback and blam.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Andrew Morton

On Thu, 25 Jan 2007 16:32:28 +0100
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> +long congestion_wait_interruptible(int rw, long timeout)
> +{
> + long ret;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[rw];
> +
> + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> + if (signal_pending(current))
> + ret = -ERESTARTSYS;
> + else
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + return ret;
> +}
> +EXPORT_SYMBOL(congestion_wait_interruptible);

I think this can share code with congestion_wait()?

static long __congestion_wait(int rw, long timeout, int state)
{
long ret;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[rw];

prepare_to_wait(wqh, &wait, state);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
return ret;
}

long congestion_wait_interruptible(int rw, long timeout)
{
long ret = __congestion_wait(rw, timeout);

if (signal_pending(current))
ret = -ERESTARTSYS;
return ret;
}

it's only infinitesimally less efficient..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] nfs: fix congestion control -v4

2007-01-25 Thread Peter Zijlstra

Hopefully the last version ;-)


---
Subject: nfs: fix congestion control

The current NFS client congestion logic is severly broken, it marks the backing
device congested during each nfs_writepages() call but doesn't mirror this in
nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap on the
number of active writeback pages and uses the bdi congestion waitqueue.

Also always use an interruptible wait since it makes sense to be able to 
SIGKILL the process even for mounts without 'intr'.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Trond Myklebust <[EMAIL PROTECTED]>
---
 fs/nfs/super.c  |4 -
 fs/nfs/sysctl.c |8 +++
 fs/nfs/write.c  |  116 
 include/linux/backing-dev.h |1 
 include/linux/nfs_fs.h  |1 
 include/linux/nfs_fs_sb.h   |1 
 mm/backing-dev.c|   16 ++
 7 files changed, 103 insertions(+), 44 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-25 16:07:03.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-25 16:10:48.0 +0100
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -78,7 +79,7 @@ static struct nfs_page * nfs_update_requ
struct page *,
unsigned int, unsigned int);
 static void nfs_mark_request_dirty(struct nfs_page *req);
-static int nfs_wait_on_write_congestion(struct address_space *, int);
+static int nfs_wait_on_write_congestion(struct address_space *);
 static int nfs_wait_on_requests(struct inode *, unsigned long, unsigned int);
 static long nfs_flush_mapping(struct address_space *mapping, struct 
writeback_control *wbc, int how);
 static const struct rpc_call_ops nfs_write_partial_ops;
@@ -89,8 +90,6 @@ static struct kmem_cache *nfs_wdata_cach
 static mempool_t *nfs_wdata_mempool;
 static mempool_t *nfs_commit_mempool;
 
-static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
-
 struct nfs_write_data *nfs_commit_alloc(void)
 {
struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
@@ -245,6 +244,39 @@ static int wb_priority(struct writeback_
 }
 
 /*
+ * NFS congestion control
+ */
+
+int nfs_congestion_kb;
+
+#define NFS_CONGESTION_ON_THRESH   (nfs_congestion_kb >> (PAGE_SHIFT-10))
+#define NFS_CONGESTION_OFF_THRESH  \
+   (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+
+static inline void nfs_set_page_writeback(struct page *page)
+{
+   if (!test_set_page_writeback(page)) {
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_inc_return(&nfss->writeback) > 
NFS_CONGESTION_ON_THRESH)
+   set_bdi_congested(&nfss->backing_dev_info, WRITE);
+   }
+}
+
+static inline void nfs_end_page_writeback(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   end_page_writeback(page);
+   if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   clear_bdi_congested(&nfss->backing_dev_info, WRITE);
+   congestion_end(WRITE);
+   }
+}
+
+/*
  * Find an associated nfs write request, and prepare to flush it out
  * Returns 1 if there was no write request, or if the request was
  * already tagged by nfs_set_page_dirty.Returns 0 if the request
@@ -281,7 +313,7 @@ static int nfs_page_mark_flush(struct pa
spin_unlock(req_lock);
if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
nfs_mark_request_dirty(req);
-   set_page_writeback(page);
+   nfs_set_page_writeback(page);
}
ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
nfs_unlock_request(req);
@@ -336,13 +368,8 @@ int nfs_writepage(struct page *page, str
return err; 
 }
 
-/*
- * Note: causes nfs_update_request() to block on the assumption
- *  that the writeback is generated due to memory pressure.
- */
 int nfs_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
 {
-   struct backing_dev_info *bdi = mapping->backing_dev_info;
struct inode *inode = mapping->host;
int err;
 
@@ -351,11 +378,6 @@ int nfs_writepages(struct address_space 
err = generic_writepages(mapping, wbc);
if (err)
return err;
-   while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
-   if (wbc->nonblocking)
-   return 0;
-   nfs_wait_on_write_congestion(mapping, 0);
-   }
err = nfs_flush_mapping(mapping, wbc, wb_priority(wbc));
if (err < 0)
g

Re: [PATCH] nfs: fix congestion control -v3

2007-01-22 Thread Christoph Lameter

On Sat, 20 Jan 2007, Peter Zijlstra wrote:

> Subject: nfs: fix congestion control

I am not sure if its too valuable since I have limited experience with NFS 
but it looks fine to me.

Acked-by: Christoph Lameter <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control -v3

2007-01-22 Thread Trond Myklebust

On Sat, 2007-01-20 at 08:01 +0100, Peter Zijlstra wrote:
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severly broken, it marks the 
> backing
> device congested during each nfs_writepages() call but doesn't mirror this in
> nfs_writepage() which makes for deadlocks. Also it implements its own 
> waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap on 
> the
> number of active writeback pages and uses the bdi congestion waitqueue.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> Cc: Trond Myklebust <[EMAIL PROTECTED]>
> ---
>  fs/nfs/super.c  |4 -
>  fs/nfs/sysctl.c |8 +++
>  fs/nfs/write.c  |  116 
> 
>  include/linux/backing-dev.h |1 
>  include/linux/nfs_fs.h  |1 
>  include/linux/nfs_fs_sb.h   |1 
>  mm/backing-dev.c|   16 ++
>  7 files changed, 104 insertions(+), 43 deletions(-)
> 
> Index: linux-2.6-git/fs/nfs/write.c
> ===
> --- linux-2.6-git.orig/fs/nfs/write.c 2007-01-20 07:20:10.0 +0100
> +++ linux-2.6-git/fs/nfs/write.c  2007-01-20 07:20:12.0 +0100
> @@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
>  static mempool_t *nfs_wdata_mempool;
>  static mempool_t *nfs_commit_mempool;
>  
> -static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
> -
>  struct nfs_write_data *nfs_commit_alloc(void)
>  {
>   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
> @@ -245,6 +243,39 @@ static int wb_priority(struct writeback_
>  }
>  
>  /*
> + * NFS congestion control
> + */
> +
> +int nfs_congestion_kb;
> +
> +#define NFS_CONGESTION_ON_THRESH (nfs_congestion_kb >> (PAGE_SHIFT-10))
> +#define NFS_CONGESTION_OFF_THRESH\
> + (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
> +
> +static inline void nfs_set_page_writeback(struct page *page)
> +{
> + if (!test_set_page_writeback(page)) {
> + struct inode *inode = page->mapping->host;
> + struct nfs_server *nfss = NFS_SERVER(inode);
> +
> + if (atomic_inc_return(&nfss->writeback) > 
> NFS_CONGESTION_ON_THRESH)
> + set_bdi_congested(&nfss->backing_dev_info, WRITE);
> + }
> +}
> +
> +static inline void nfs_end_page_writeback(struct page *page)
> +{
> + struct inode *inode = page->mapping->host;
> + struct nfs_server *nfss = NFS_SERVER(inode);
> +
> + end_page_writeback(page);
> + if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
> + clear_bdi_congested(&nfss->backing_dev_info, WRITE);
> + congestion_end(WRITE);
> + }
> +}
> +
> +/*
>   * Find an associated nfs write request, and prepare to flush it out
>   * Returns 1 if there was no write request, or if the request was
>   * already tagged by nfs_set_page_dirty.Returns 0 if the request
> @@ -281,7 +312,7 @@ static int nfs_page_mark_flush(struct pa
>   spin_unlock(req_lock);
>   if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
>   nfs_mark_request_dirty(req);
> - set_page_writeback(page);
> + nfs_set_page_writeback(page);
>   }
>   ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
>   nfs_unlock_request(req);
> @@ -336,13 +367,8 @@ int nfs_writepage(struct page *page, str
>   return err; 
>  }
>  
> -/*
> - * Note: causes nfs_update_request() to block on the assumption
> - *that the writeback is generated due to memory pressure.
> - */
>  int nfs_writepages(struct address_space *mapping, struct writeback_control 
> *wbc)
>  {
> - struct backing_dev_info *bdi = mapping->backing_dev_info;
>   struct inode *inode = mapping->host;
>   int err;
>  
> @@ -351,11 +377,6 @@ int nfs_writepages(struct address_space 
>   err = generic_writepages(mapping, wbc);
>   if (err)
>   return err;
> - while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
> - if (wbc->nonblocking)
> - return 0;
> - nfs_wait_on_write_congestion(mapping, 0);
> - }
>   err = nfs_flush_mapping(mapping, wbc, wb_priority(wbc));
>   if (err < 0)
>   goto out;
> @@ -369,9 +390,6 @@ int nfs_writepages(struct address_space 
>   if (err > 0)
>   err = 0;
>  out:
> - clear_bit(BDI_write_congested, &bdi->state);
> - wake_up_all(&nfs_write_congestion);
> - congestion_end(WRITE);
>   return err;
>  }
>  
> @@ -401,7 +419,7 @@ static int nfs_inode_add_request(struct 
>  }
>  
>  /*
> - * Insert a write request into an inode
> + * Remove a write request from an inode
>   */
>  static void nfs_inode_remove_request(struct nfs_page *req)
>  {
> @@ -585,8 +603,8 @@ static inline int nfs_scan_commit(struct
>  
>  static int nfs_wait_on_write_congestion(struct address_space *mapping, int 
> int

[PATCH] nfs: fix congestion control -v3

2007-01-19 Thread Peter Zijlstra

Subject: nfs: fix congestion control

The current NFS client congestion logic is severly broken, it marks the backing
device congested during each nfs_writepages() call but doesn't mirror this in
nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap on the
number of active writeback pages and uses the bdi congestion waitqueue.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Trond Myklebust <[EMAIL PROTECTED]>
---
 fs/nfs/super.c  |4 -
 fs/nfs/sysctl.c |8 +++
 fs/nfs/write.c  |  116 
 include/linux/backing-dev.h |1 
 include/linux/nfs_fs.h  |1 
 include/linux/nfs_fs_sb.h   |1 
 mm/backing-dev.c|   16 ++
 7 files changed, 104 insertions(+), 43 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-20 07:20:10.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-20 07:20:12.0 +0100
@@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
 static mempool_t *nfs_wdata_mempool;
 static mempool_t *nfs_commit_mempool;
 
-static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
-
 struct nfs_write_data *nfs_commit_alloc(void)
 {
struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
@@ -245,6 +243,39 @@ static int wb_priority(struct writeback_
 }
 
 /*
+ * NFS congestion control
+ */
+
+int nfs_congestion_kb;
+
+#define NFS_CONGESTION_ON_THRESH   (nfs_congestion_kb >> (PAGE_SHIFT-10))
+#define NFS_CONGESTION_OFF_THRESH  \
+   (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+
+static inline void nfs_set_page_writeback(struct page *page)
+{
+   if (!test_set_page_writeback(page)) {
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_inc_return(&nfss->writeback) > 
NFS_CONGESTION_ON_THRESH)
+   set_bdi_congested(&nfss->backing_dev_info, WRITE);
+   }
+}
+
+static inline void nfs_end_page_writeback(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   end_page_writeback(page);
+   if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   clear_bdi_congested(&nfss->backing_dev_info, WRITE);
+   congestion_end(WRITE);
+   }
+}
+
+/*
  * Find an associated nfs write request, and prepare to flush it out
  * Returns 1 if there was no write request, or if the request was
  * already tagged by nfs_set_page_dirty.Returns 0 if the request
@@ -281,7 +312,7 @@ static int nfs_page_mark_flush(struct pa
spin_unlock(req_lock);
if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
nfs_mark_request_dirty(req);
-   set_page_writeback(page);
+   nfs_set_page_writeback(page);
}
ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
nfs_unlock_request(req);
@@ -336,13 +367,8 @@ int nfs_writepage(struct page *page, str
return err; 
 }
 
-/*
- * Note: causes nfs_update_request() to block on the assumption
- *  that the writeback is generated due to memory pressure.
- */
 int nfs_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
 {
-   struct backing_dev_info *bdi = mapping->backing_dev_info;
struct inode *inode = mapping->host;
int err;
 
@@ -351,11 +377,6 @@ int nfs_writepages(struct address_space 
err = generic_writepages(mapping, wbc);
if (err)
return err;
-   while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
-   if (wbc->nonblocking)
-   return 0;
-   nfs_wait_on_write_congestion(mapping, 0);
-   }
err = nfs_flush_mapping(mapping, wbc, wb_priority(wbc));
if (err < 0)
goto out;
@@ -369,9 +390,6 @@ int nfs_writepages(struct address_space 
if (err > 0)
err = 0;
 out:
-   clear_bit(BDI_write_congested, &bdi->state);
-   wake_up_all(&nfs_write_congestion);
-   congestion_end(WRITE);
return err;
 }
 
@@ -401,7 +419,7 @@ static int nfs_inode_add_request(struct 
 }
 
 /*
- * Insert a write request into an inode
+ * Remove a write request from an inode
  */
 static void nfs_inode_remove_request(struct nfs_page *req)
 {
@@ -585,8 +603,8 @@ static inline int nfs_scan_commit(struct
 
 static int nfs_wait_on_write_congestion(struct address_space *mapping, int 
intr)
 {
+   struct inode *inode = mapping->host;
struct backing_dev_info *bdi = mapping->backing_dev_info;
-   DEFINE_WAIT(wait);
int ret = 0;
 
might_sleep();
@@ -594,31 +612,27 @@ static int nfs_wait_on_writ

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Christoph Lameter

On Fri, 19 Jan 2007, Trond Myklebust wrote:

> That would be good as a default, but I've been thinking that we could
> perhaps also add a sysctl in /proc/sys/fs/nfs in order to make it a
> tunable?

Good idea.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Trond Myklebust

On Fri, 2007-01-19 at 18:57 +0100, Peter Zijlstra wrote:
> On Fri, 2007-01-19 at 09:20 -0800, Christoph Lameter wrote:
> > On Fri, 19 Jan 2007, Peter Zijlstra wrote:
> > 
> > > + /*
> > > +  * NFS congestion size, scale with available memory.
> > > +  *
> > 
> > Well this all depends on the memory available to the running process.
> > If the process is just allowed to allocate from a subset of memory 
> > (cpusets) then this may need to be lower.
> > 
> > > +  *  64MB:8192k
> > > +  * 128MB:   11585k
> > > +  * 256MB:   16384k
> > > +  * 512MB:   23170k
> > > +  *   1GB:   32768k
> > > +  *   2GB:   46340k
> > > +  *   4GB:   65536k
> > > +  *   8GB:   92681k
> > > +  *  16GB:  131072k
> > 
> > Hmmm... lets say we have the worst case of an 8TB IA64 system with 1k 
> > nodes of 8G each.
> 
> Eeuh, right. Glad to have you around to remind how puny my boxens
> are :-)
> 
> >  On Ia64 the number of pages is 8TB/16KB pagesize = 512 
> > million pages. Thus nfs_congestion_size is 724064 pages which is 
> > 11.1Gbytes?
> > 
> > If we now restrict a cpuset to a single node then have a 
> > nfs_congestion_size of 11.1G vs an available memory on a node of 8G.
> 
> Right, perhaps cap this to a max of 256M. That would allow 128 2M RPC
> transfers, much more would not be needed I guess. Trond?

That would be good as a default, but I've been thinking that we could
perhaps also add a sysctl in /proc/sys/fs/nfs in order to make it a
tunable?

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Christoph Lameter

On Fri, 19 Jan 2007, Peter Zijlstra wrote:

> Eeuh, right. Glad to have you around to remind how puny my boxens
> are :-)

Sorry about that but it was unavoidable if we want to get to reasonable 
limits that will work in all situations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Peter Zijlstra

On Fri, 2007-01-19 at 09:20 -0800, Christoph Lameter wrote:
> On Fri, 19 Jan 2007, Peter Zijlstra wrote:
> 
> > +   /*
> > +* NFS congestion size, scale with available memory.
> > +*
> 
> Well this all depends on the memory available to the running process.
> If the process is just allowed to allocate from a subset of memory 
> (cpusets) then this may need to be lower.
> 
> > +*  64MB:8192k
> > +* 128MB:   11585k
> > +* 256MB:   16384k
> > +* 512MB:   23170k
> > +*   1GB:   32768k
> > +*   2GB:   46340k
> > +*   4GB:   65536k
> > +*   8GB:   92681k
> > +*  16GB:  131072k
> 
> Hmmm... lets say we have the worst case of an 8TB IA64 system with 1k 
> nodes of 8G each.

Eeuh, right. Glad to have you around to remind how puny my boxens
are :-)

>  On Ia64 the number of pages is 8TB/16KB pagesize = 512 
> million pages. Thus nfs_congestion_size is 724064 pages which is 
> 11.1Gbytes?
> 
> If we now restrict a cpuset to a single node then have a 
> nfs_congestion_size of 11.1G vs an available memory on a node of 8G.

Right, perhaps cap this to a max of 256M. That would allow 128 2M RPC
transfers, much more would not be needed I guess. Trond?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Peter Zijlstra

On Fri, 2007-01-19 at 11:51 -0500, Trond Myklebust wrote:

> > So with that out of the way I now have this
> 
> Looks much better. Just one obvious buglet...

> > @@ -1565,6 +1579,23 @@ int __init nfs_init_writepagecache(void)
> > if (nfs_commit_mempool == NULL)
> > return -ENOMEM;
> >  
> > +   /*
> > +* NFS congestion size, scale with available memory.
> > +*
> > +*  64MB:8192k
> > +* 128MB:   11585k
> > +* 256MB:   16384k
> > +* 512MB:   23170k
> > +*   1GB:   32768k
> > +*   2GB:   46340k
> > +*   4GB:   65536k
> > +*   8GB:   92681k
> > +*  16GB:  131072k
> > +*
> > +* This allows larger machines to have larger/more transfers.
> > +*/
> > +   nfs_congestion_size = 32*int_sqrt(totalram_pages);
> > +
>   ^^^ nfs_congestion_pages?

Ah, yeah, forgot to refresh the patch one last time before sending
out :-(.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Christoph Lameter

On Fri, 19 Jan 2007, Peter Zijlstra wrote:

> + /*
> +  * NFS congestion size, scale with available memory.
> +  *

Well this all depends on the memory available to the running process.
If the process is just allowed to allocate from a subset of memory 
(cpusets) then this may need to be lower.

> +  *  64MB:8192k
> +  * 128MB:   11585k
> +  * 256MB:   16384k
> +  * 512MB:   23170k
> +  *   1GB:   32768k
> +  *   2GB:   46340k
> +  *   4GB:   65536k
> +  *   8GB:   92681k
> +  *  16GB:  131072k

Hmmm... lets say we have the worst case of an 8TB IA64 system with 1k 
nodes of 8G each. On Ia64 the number of pages is 8TB/16KB pagesize = 512 
million pages. Thus nfs_congestion_size is 724064 pages which is 
11.1Gbytes?

If we now restrict a cpuset to a single node then have a 
nfs_congestion_size of 11.1G vs an available memory on a node of 8G.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Trond Myklebust

On Fri, 2007-01-19 at 14:07 +0100, Peter Zijlstra wrote:
> On Fri, 2007-01-19 at 10:34 +0100, Peter Zijlstra wrote:
> > On Thu, 2007-01-18 at 10:49 -0500, Trond Myklebust wrote:
> > 
> > > After the dirty page has been written to unstable storage, it marks the
> > > inode using I_DIRTY_DATASYNC, which should then ensure that the VFS
> > > calls write_inode() on the next pass through __sync_single_inode.
> > 
> > > I'd rather like to see fs/fs-writeback.c do this correctly (assuming
> > > that it is incorrect now).
> > 
> > balance_dirty_pages()
> >   wbc.sync_mode = WB_SYNC_NONE;
> >   writeback_inodes()
> > sync_sb_inodes()
> >   __writeback_single_inode()
> > __sync_single_inode()
> >   write_inode()
> > nfs_write_inode()
> > 
> > Ah, yes, I see. That ought to work.
> > 
> > /me goes verify he didn't mess it up himself...
> 
> And of course I did. This seems to work.
> 
> The problem I had was that throttle_vm_writeout() got stuck on commit
> pages. But that can be solved with a much much simpler and better
> solution. Force all swap traffic to be |FLUSH_STABLE, unstable swap
> space doesn't make sense anyway :-)

Yep. Since we don't care too much about metadata updates on the swap
file (the file size should never change), we could perhaps make the
swapper write out the pages with DATA_SYNC rather than the full
FILE_SYNC.

> So with that out of the way I now have this

Looks much better. Just one obvious buglet...

> ---
> 
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severly broken, it marks the 
> backing
> device congested during each nfs_writepages() call but doesn't mirror this in
> nfs_writepage() which makes for deadlocks. Also it implements its own 
> waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap on 
> the
> number of active writeback pages and uses the bdi congestion waitqueue.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> Cc: Trond Myklebust <[EMAIL PROTECTED]>
> ---
>  fs/nfs/write.c  |  113 
> 
>  include/linux/backing-dev.h |1 
>  include/linux/nfs_fs_sb.h   |1 
>  mm/backing-dev.c|   16 ++
>  4 files changed, 90 insertions(+), 41 deletions(-)
> 
> Index: linux-2.6-git/fs/nfs/write.c
> ===
> --- linux-2.6-git.orig/fs/nfs/write.c 2007-01-19 13:57:49.0 +0100
> +++ linux-2.6-git/fs/nfs/write.c  2007-01-19 14:03:30.0 +0100
> @@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
>  static mempool_t *nfs_wdata_mempool;
>  static mempool_t *nfs_commit_mempool;
>  
> -static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
> -
>  struct nfs_write_data *nfs_commit_alloc(void)
>  {
>   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
> @@ -245,6 +243,39 @@ static int wb_priority(struct writeback_
>  }
>  
>  /*
> + * NFS congestion control
> + */
> +
> +static unsigned long nfs_congestion_pages;
> +
> +#define NFS_CONGESTION_ON_THRESH nfs_congestion_pages
> +#define NFS_CONGESTION_OFF_THRESH\
> + (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
> +
> +static inline void nfs_set_page_writeback(struct page *page)
> +{
> + if (!test_set_page_writeback(page)) {
> + struct inode *inode = page->mapping->host;
> + struct nfs_server *nfss = NFS_SERVER(inode);
> +
> + if (atomic_inc_return(&nfss->writeback) > 
> NFS_CONGESTION_ON_THRESH)
> + set_bdi_congested(&nfss->backing_dev_info, WRITE);
> + }
> +}
> +
> +static inline void nfs_end_page_writeback(struct page *page)
> +{
> + struct inode *inode = page->mapping->host;
> + struct nfs_server *nfss = NFS_SERVER(inode);
> +
> + end_page_writeback(page);
> + if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
> + clear_bdi_congested(&nfss->backing_dev_info, WRITE);
> + congestion_end(WRITE);
> + }
> +}
> +
> +/*
>   * Find an associated nfs write request, and prepare to flush it out
>   * Returns 1 if there was no write request, or if the request was
>   * already tagged by nfs_set_page_dirty.Returns 0 if the request
> @@ -281,7 +312,7 @@ static int nfs_page_mark_flush(struct pa
>   spin_unlock(req_lock);
>   if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
>   nfs_mark_request_dirty(req);
> - set_page_writeback(page);
> + nfs_set_page_writeback(page);
>   }
>   ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
>   nfs_unlock_request(req);
> @@ -336,13 +367,8 @@ int nfs_writepage(struct page *page, str
>   return err; 
>  }
>  
> -/*
> - * Note: causes nfs_update_request() to block on the assumption
> - *that the writeback is generated due to memory pressure.
> - */
>  int nfs_writepages(struct address_space *mapping, struct w

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Peter Zijlstra

On Fri, 2007-01-19 at 10:34 +0100, Peter Zijlstra wrote:
> On Thu, 2007-01-18 at 10:49 -0500, Trond Myklebust wrote:
> 
> > After the dirty page has been written to unstable storage, it marks the
> > inode using I_DIRTY_DATASYNC, which should then ensure that the VFS
> > calls write_inode() on the next pass through __sync_single_inode.
> 
> > I'd rather like to see fs/fs-writeback.c do this correctly (assuming
> > that it is incorrect now).
> 
> balance_dirty_pages()
>   wbc.sync_mode = WB_SYNC_NONE;
>   writeback_inodes()
> sync_sb_inodes()
>   __writeback_single_inode()
> __sync_single_inode()
>   write_inode()
> nfs_write_inode()
> 
> Ah, yes, I see. That ought to work.
> 
> /me goes verify he didn't mess it up himself...

And of course I did. This seems to work.

The problem I had was that throttle_vm_writeout() got stuck on commit
pages. But that can be solved with a much much simpler and better
solution. Force all swap traffic to be |FLUSH_STABLE, unstable swap
space doesn't make sense anyway :-)

So with that out of the way I now have this

---

Subject: nfs: fix congestion control

The current NFS client congestion logic is severly broken, it marks the backing
device congested during each nfs_writepages() call but doesn't mirror this in
nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap on the
number of active writeback pages and uses the bdi congestion waitqueue.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Trond Myklebust <[EMAIL PROTECTED]>
---
 fs/nfs/write.c  |  113 
 include/linux/backing-dev.h |1 
 include/linux/nfs_fs_sb.h   |1 
 mm/backing-dev.c|   16 ++
 4 files changed, 90 insertions(+), 41 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-19 13:57:49.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-19 14:03:30.0 +0100
@@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
 static mempool_t *nfs_wdata_mempool;
 static mempool_t *nfs_commit_mempool;
 
-static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
-
 struct nfs_write_data *nfs_commit_alloc(void)
 {
struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
@@ -245,6 +243,39 @@ static int wb_priority(struct writeback_
 }
 
 /*
+ * NFS congestion control
+ */
+
+static unsigned long nfs_congestion_pages;
+
+#define NFS_CONGESTION_ON_THRESH   nfs_congestion_pages
+#define NFS_CONGESTION_OFF_THRESH  \
+   (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+
+static inline void nfs_set_page_writeback(struct page *page)
+{
+   if (!test_set_page_writeback(page)) {
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_inc_return(&nfss->writeback) > 
NFS_CONGESTION_ON_THRESH)
+   set_bdi_congested(&nfss->backing_dev_info, WRITE);
+   }
+}
+
+static inline void nfs_end_page_writeback(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   end_page_writeback(page);
+   if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   clear_bdi_congested(&nfss->backing_dev_info, WRITE);
+   congestion_end(WRITE);
+   }
+}
+
+/*
  * Find an associated nfs write request, and prepare to flush it out
  * Returns 1 if there was no write request, or if the request was
  * already tagged by nfs_set_page_dirty.Returns 0 if the request
@@ -281,7 +312,7 @@ static int nfs_page_mark_flush(struct pa
spin_unlock(req_lock);
if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
nfs_mark_request_dirty(req);
-   set_page_writeback(page);
+   nfs_set_page_writeback(page);
}
ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
nfs_unlock_request(req);
@@ -336,13 +367,8 @@ int nfs_writepage(struct page *page, str
return err; 
 }
 
-/*
- * Note: causes nfs_update_request() to block on the assumption
- *  that the writeback is generated due to memory pressure.
- */
 int nfs_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
 {
-   struct backing_dev_info *bdi = mapping->backing_dev_info;
struct inode *inode = mapping->host;
int err;
 
@@ -351,11 +377,6 @@ int nfs_writepages(struct address_space 
err = generic_writepages(mapping, wbc);
if (err)
return err;
-   while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
-   if (wbc->nonblocking)
-   return 0;
-   nfs_wait_on_write_congestion(mapping, 0);

Re: [PATCH] nfs: fix congestion control

2007-01-19 Thread Peter Zijlstra

On Thu, 2007-01-18 at 10:49 -0500, Trond Myklebust wrote:

> After the dirty page has been written to unstable storage, it marks the
> inode using I_DIRTY_DATASYNC, which should then ensure that the VFS
> calls write_inode() on the next pass through __sync_single_inode.

> I'd rather like to see fs/fs-writeback.c do this correctly (assuming
> that it is incorrect now).

balance_dirty_pages()
  wbc.sync_mode = WB_SYNC_NONE;
  writeback_inodes()
sync_sb_inodes()
  __writeback_single_inode()
__sync_single_inode()
  write_inode()
nfs_write_inode()

Ah, yes, I see. That ought to work.

/me goes verify he didn't mess it up himself...



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-18 Thread Trond Myklebust

On Thu, 2007-01-18 at 14:27 +0100, Peter Zijlstra wrote:
> On Wed, 2007-01-17 at 16:54 -0500, Trond Myklebust wrote:
> > On Wed, 2007-01-17 at 22:52 +0100, Peter Zijlstra wrote:
> > 
> > > > 
> > > > > Index: linux-2.6-git/fs/inode.c
> > > > > ===
> > > > > --- linux-2.6-git.orig/fs/inode.c 2007-01-12 08:03:47.0 
> > > > > +0100
> > > > > +++ linux-2.6-git/fs/inode.c  2007-01-12 08:53:26.0 +0100
> > > > > @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
> > > > >   * the i_state of an inode while it is in use..
> > > > >   */
> > > > >  DEFINE_SPINLOCK(inode_lock);
> > > > > +EXPORT_SYMBOL_GPL(inode_lock);
> > > > 
> > > > Hmmm... Commits to all NFS servers will be globally serialized via the 
> > > > inode_lock?
> > > 
> > > Hmm, right, thats not good indeed, I can pull the call to
> > > nfs_commit_list() out of that loop.
> > 
> > There is no reason to modify any of the commit stuff. Please just drop
> > that code.
> 
> I though you agreed to flushing commit pages when hitting the dirty page
> limit.

After the dirty page has been written to unstable storage, it marks the
inode using I_DIRTY_DATASYNC, which should then ensure that the VFS
calls write_inode() on the next pass through __sync_single_inode.

> Or would you rather see a notifier chain from congestion_wait() calling
> into the various NFS mounts?

I'd rather like to see fs/fs-writeback.c do this correctly (assuming
that it is incorrect now).

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-18 Thread Peter Zijlstra

On Wed, 2007-01-17 at 16:54 -0500, Trond Myklebust wrote:
> On Wed, 2007-01-17 at 22:52 +0100, Peter Zijlstra wrote:
> 
> > > 
> > > > Index: linux-2.6-git/fs/inode.c
> > > > ===
> > > > --- linux-2.6-git.orig/fs/inode.c   2007-01-12 08:03:47.0 
> > > > +0100
> > > > +++ linux-2.6-git/fs/inode.c2007-01-12 08:53:26.0 +0100
> > > > @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
> > > >   * the i_state of an inode while it is in use..
> > > >   */
> > > >  DEFINE_SPINLOCK(inode_lock);
> > > > +EXPORT_SYMBOL_GPL(inode_lock);
> > > 
> > > Hmmm... Commits to all NFS servers will be globally serialized via the 
> > > inode_lock?
> > 
> > Hmm, right, thats not good indeed, I can pull the call to
> > nfs_commit_list() out of that loop.
> 
> There is no reason to modify any of the commit stuff. Please just drop
> that code.

I though you agreed to flushing commit pages when hitting the dirty page
limit.

Or would you rather see a notifier chain from congestion_wait() calling
into the various NFS mounts?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Christoph Hellwig

> --- linux-2.6-git.orig/fs/inode.c 2007-01-12 08:03:47.0 +0100
> +++ linux-2.6-git/fs/inode.c  2007-01-12 08:53:26.0 +0100
> @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
>   * the i_state of an inode while it is in use..
>   */
>  DEFINE_SPINLOCK(inode_lock);
> +EXPORT_SYMBOL_GPL(inode_lock);

Btw, big "no fucking way" here.  There is no chance we're going to export
this, even _GPL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Trond Myklebust

On Wed, 2007-01-17 at 22:52 +0100, Peter Zijlstra wrote:

> > 
> > > Index: linux-2.6-git/fs/inode.c
> > > ===
> > > --- linux-2.6-git.orig/fs/inode.c 2007-01-12 08:03:47.0 +0100
> > > +++ linux-2.6-git/fs/inode.c  2007-01-12 08:53:26.0 +0100
> > > @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
> > >   * the i_state of an inode while it is in use..
> > >   */
> > >  DEFINE_SPINLOCK(inode_lock);
> > > +EXPORT_SYMBOL_GPL(inode_lock);
> > 
> > Hmmm... Commits to all NFS servers will be globally serialized via the 
> > inode_lock?
> 
> Hmm, right, thats not good indeed, I can pull the call to
> nfs_commit_list() out of that loop.

There is no reason to modify any of the commit stuff. Please just drop
that code.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Peter Zijlstra

On Wed, 2007-01-17 at 12:05 -0800, Christoph Lameter wrote:
> On Tue, 16 Jan 2007, Peter Zijlstra wrote:
> 
> > The current NFS client congestion logic is severely broken, it marks the
> > backing device congested during each nfs_writepages() call and implements
> > its own waitqueue.
> 
> This is the magic bullet that Andrew is looking for to fix the NFS issues?

Dunno if its magical, but it does solve a few issues I ran into.

> > Index: linux-2.6-git/include/linux/nfs_fs_sb.h
> > ===
> > --- linux-2.6-git.orig/include/linux/nfs_fs_sb.h2007-01-12 
> > 08:03:47.0 +0100
> > +++ linux-2.6-git/include/linux/nfs_fs_sb.h 2007-01-12 08:53:26.0 
> > +0100
> > @@ -82,6 +82,8 @@ struct nfs_server {
> > struct rpc_clnt *   client_acl; /* ACL RPC client handle */
> > struct nfs_iostats *io_stats;   /* I/O statistics */
> > struct backing_dev_info backing_dev_info;
> > +   atomic_twriteback;  /* number of writeback pages */
> > +   atomic_tcommit; /* number of commit pages */
> > int flags;  /* various flags */
> 
> I think writeback is frequently incremented? Would it be possible to avoid
> a single global instance of an atomic_t here? In a busy NFS system 
> with lots of processors writing via NFS this may cause a hot cacheline 
> that limits write speed.

This would be per NFS mount, pretty global indeed. But not different
that other backing_dev_info's. request_queue::nr_requests suffers a
similar fate.

> Would it be possible to use NR_WRITEBACK? If not then maybe add another
> ZVC counter named NFS_NFS_WRITEBACK?

Its a per backing_dev_info thing. So using the zone counters will not
work.

> > Index: linux-2.6-git/mm/page-writeback.c
> > ===
> > --- linux-2.6-git.orig/mm/page-writeback.c  2007-01-12 08:03:47.0 
> > +0100
> > +++ linux-2.6-git/mm/page-writeback.c   2007-01-12 08:53:26.0 
> > +0100
> > @@ -167,6 +167,12 @@ get_dirty_limits(long *pbackground, long
> > *pdirty = dirty;
> >  }
> >  
> > +int dirty_pages_exceeded(struct address_space *mapping)
> > +{
> > +   return dirty_exceeded;
> > +}
> > +EXPORT_SYMBOL_GPL(dirty_pages_exceeded);
> > +
> 
> Export the variable instead of adding a new function? Why does it take an 
> address space parameter that is not used?
> 

Yeah, that function used to be larger.

> 
> > Index: linux-2.6-git/fs/inode.c
> > ===
> > --- linux-2.6-git.orig/fs/inode.c   2007-01-12 08:03:47.0 +0100
> > +++ linux-2.6-git/fs/inode.c2007-01-12 08:53:26.0 +0100
> > @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
> >   * the i_state of an inode while it is in use..
> >   */
> >  DEFINE_SPINLOCK(inode_lock);
> > +EXPORT_SYMBOL_GPL(inode_lock);
> 
> Hmmm... Commits to all NFS servers will be globally serialized via the 
> inode_lock?

Hmm, right, thats not good indeed, I can pull the call to
nfs_commit_list() out of that loop.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Christoph Lameter

On Tue, 16 Jan 2007, Peter Zijlstra wrote:

> The current NFS client congestion logic is severely broken, it marks the
> backing device congested during each nfs_writepages() call and implements
> its own waitqueue.

This is the magic bullet that Andrew is looking for to fix the NFS issues?

> Index: linux-2.6-git/include/linux/nfs_fs_sb.h
> ===
> --- linux-2.6-git.orig/include/linux/nfs_fs_sb.h  2007-01-12 
> 08:03:47.0 +0100
> +++ linux-2.6-git/include/linux/nfs_fs_sb.h   2007-01-12 08:53:26.0 
> +0100
> @@ -82,6 +82,8 @@ struct nfs_server {
>   struct rpc_clnt *   client_acl; /* ACL RPC client handle */
>   struct nfs_iostats *io_stats;   /* I/O statistics */
>   struct backing_dev_info backing_dev_info;
> + atomic_twriteback;  /* number of writeback pages */
> + atomic_tcommit; /* number of commit pages */
>   int flags;  /* various flags */

I think writeback is frequently incremented? Would it be possible to avoid
a single global instance of an atomic_t here? In a busy NFS system 
with lots of processors writing via NFS this may cause a hot cacheline 
that limits write speed.

Would it be possible to use NR_WRITEBACK? If not then maybe add another
ZVC counter named NFS_NFS_WRITEBACK?

> Index: linux-2.6-git/mm/page-writeback.c
> ===
> --- linux-2.6-git.orig/mm/page-writeback.c2007-01-12 08:03:47.0 
> +0100
> +++ linux-2.6-git/mm/page-writeback.c 2007-01-12 08:53:26.0 +0100
> @@ -167,6 +167,12 @@ get_dirty_limits(long *pbackground, long
>   *pdirty = dirty;
>  }
>  
> +int dirty_pages_exceeded(struct address_space *mapping)
> +{
> + return dirty_exceeded;
> +}
> +EXPORT_SYMBOL_GPL(dirty_pages_exceeded);
> +

Export the variable instead of adding a new function? Why does it take an 
address space parameter that is not used?


> Index: linux-2.6-git/fs/inode.c
> ===
> --- linux-2.6-git.orig/fs/inode.c 2007-01-12 08:03:47.0 +0100
> +++ linux-2.6-git/fs/inode.c  2007-01-12 08:53:26.0 +0100
> @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtabl
>   * the i_state of an inode while it is in use..
>   */
>  DEFINE_SPINLOCK(inode_lock);
> +EXPORT_SYMBOL_GPL(inode_lock);

Hmmm... Commits to all NFS servers will be globally serialized via the 
inode_lock?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Trond Myklebust

On Wed, 2007-01-17 at 15:29 +0100, Peter Zijlstra wrote:
> I was thinking that since the server needs to actually sync the page a
> commit might be quite expensive (timewise), hence I didn't want to flush
> too much, and interleave them with writing out some real pages to
> utilise bandwidth.

Most servers just call fsync()/fdatasync() whenever we send a COMMIT, in
which case the extra round trips are just adding unnecessary latency.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Peter Zijlstra

On Wed, 2007-01-17 at 08:50 -0500, Trond Myklebust wrote:
> On Wed, 2007-01-17 at 09:49 +0100, Peter Zijlstra wrote: 
> > > They are certainly _not_ dirty pages. They are pages that have been
> > > written to the server but are not yet guaranteed to have hit the disk
> > > (they were only written to the server's page cache). We don't care if
> > > they are paged in or swapped out on the local client.
> > > 
> > > \All the COMMIT does, is to ask the server to write the data from its
> > > page cache onto disk. Once that has been done, we can release the pages.
> > > If the commit fails, then we iterate through the whole writepage()
> > > process again. The commit itself does, however, not even look at the
> > > page data.
> > 
> > Thou art correct from an NFS point of view, however for the VM they are
> > (still) just dirty pages and we need shed them.
> > 
> > You talk of swapping them out, they are filecache pages not swapcache
> > pages. The writepage() process needs to complete and that entails
> > committing them.
> 
> My point is that we can and should collect as many of the little buggers
> as we can and treat them with ONE commit call. We don't look at the
> data, we don't lock the pages, we don't care what the VM is doing with
> them. Throttling is not only unnecessary, it is actually a bad idea
> since it slows up the rate at which we can free up the pages.

Ah, OK.

I was thinking that since the server needs to actually sync the page a
commit might be quite expensive (timewise), hence I didn't want to flush
too much, and interleave them with writing out some real pages to
utilise bandwidth.

But if you think I should just bulk commit I can do that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Trond Myklebust

On Wed, 2007-01-17 at 09:49 +0100, Peter Zijlstra wrote: 
> > They are certainly _not_ dirty pages. They are pages that have been
> > written to the server but are not yet guaranteed to have hit the disk
> > (they were only written to the server's page cache). We don't care if
> > they are paged in or swapped out on the local client.
> > 
> > \All the COMMIT does, is to ask the server to write the data from its
> > page cache onto disk. Once that has been done, we can release the pages.
> > If the commit fails, then we iterate through the whole writepage()
> > process again. The commit itself does, however, not even look at the
> > page data.
> 
> Thou art correct from an NFS point of view, however for the VM they are
> (still) just dirty pages and we need shed them.
> 
> You talk of swapping them out, they are filecache pages not swapcache
> pages. The writepage() process needs to complete and that entails
> committing them.

My point is that we can and should collect as many of the little buggers
as we can and treat them with ONE commit call. We don't look at the
data, we don't lock the pages, we don't care what the VM is doing with
them. Throttling is not only unnecessary, it is actually a bad idea
since it slows up the rate at which we can free up the pages.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-17 Thread Peter Zijlstra

On Wed, 2007-01-17 at 01:15 -0500, Trond Myklebust wrote:
> On Wed, 2007-01-17 at 03:41 +0100, Peter Zijlstra wrote:
> > On Tue, 2007-01-16 at 17:27 -0500, Trond Myklebust wrote:
> > > On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> > > > Subject: nfs: fix congestion control
> > > > 
> > > > The current NFS client congestion logic is severely broken, it marks the
> > > > backing device congested during each nfs_writepages() call and 
> > > > implements
> > > > its own waitqueue.
> > > > 
> > > > Replace this by a more regular congestion implementation that puts a cap
> > > > on the number of active writeback pages and uses the bdi congestion 
> > > > waitqueue.
> > > > 
> > > > NFSv[34] commit pages are allowed to go unchecked as long as we are 
> > > > under 
> > > > the dirty page limit and not in direct reclaim.
> > 
> > > 
> > > What on earth is the point of adding congestion control to COMMIT?
> > > Strongly NACKed.
> > 
> > They are dirty pages, how are we getting rid of them when we reached the
> > dirty limit?
> 
> They are certainly _not_ dirty pages. They are pages that have been
> written to the server but are not yet guaranteed to have hit the disk
> (they were only written to the server's page cache). We don't care if
> they are paged in or swapped out on the local client.
> 
> \All the COMMIT does, is to ask the server to write the data from its
> page cache onto disk. Once that has been done, we can release the pages.
> If the commit fails, then we iterate through the whole writepage()
> process again. The commit itself does, however, not even look at the
> page data.

Thou art correct from an NFS point of view, however for the VM they are
(still) just dirty pages and we need shed them.

You talk of swapping them out, they are filecache pages not swapcache
pages. The writepage() process needs to complete and that entails
committing them.

> > > Why 16MB of on-the-wire data? Why not 32, or 128, or ...
> > 
> > Andrew always promotes a fixed number for congestion control, I pulled
> > one from a dark place. I have no problem with a more dynamic solution.
> > 
> > > Solaris already allows you to send 2MB of write data in a single RPC
> > > request, and the RPC engine has for some time allowed you to tune the
> > > number of simultaneous RPC requests you have on the wire: Chuck has
> > > already shown that read/write performance is greatly improved by upping
> > > that value to 64 or more in the case of RPC over TCP. Why are we then
> > > suddenly telling people that they are limited to 8 simultaneous writes?
> > 
> > min(max RPC size * max concurrent RPC reqs, dirty threshold) then?
> 
> That would be far preferable. For instance, it allows those who have
> long latency fat pipes to actually use the bandwidth optimally when
> writing out the data.

I will make it so then ;-)

I found max concurrent RPCs to be xprt_{tcp,udp}_slot_table_entries(?),
where might I find the max RPC size? 

(Any pointer to just RTFM are equally well received)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Trond Myklebust

On Wed, 2007-01-17 at 03:41 +0100, Peter Zijlstra wrote:
> On Tue, 2007-01-16 at 17:27 -0500, Trond Myklebust wrote:
> > On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> > > Subject: nfs: fix congestion control
> > > 
> > > The current NFS client congestion logic is severely broken, it marks the
> > > backing device congested during each nfs_writepages() call and implements
> > > its own waitqueue.
> > > 
> > > Replace this by a more regular congestion implementation that puts a cap
> > > on the number of active writeback pages and uses the bdi congestion 
> > > waitqueue.
> > > 
> > > NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> > > the dirty page limit and not in direct reclaim.
> 
> > 
> > What on earth is the point of adding congestion control to COMMIT?
> > Strongly NACKed.
> 
> They are dirty pages, how are we getting rid of them when we reached the
> dirty limit?

They are certainly _not_ dirty pages. They are pages that have been
written to the server but are not yet guaranteed to have hit the disk
(they were only written to the server's page cache). We don't care if
they are paged in or swapped out on the local client.

\All the COMMIT does, is to ask the server to write the data from its
page cache onto disk. Once that has been done, we can release the pages.
If the commit fails, then we iterate through the whole writepage()
process again. The commit itself does, however, not even look at the
page data.

> > Why 16MB of on-the-wire data? Why not 32, or 128, or ...
> 
> Andrew always promotes a fixed number for congestion control, I pulled
> one from a dark place. I have no problem with a more dynamic solution.
> 
> > Solaris already allows you to send 2MB of write data in a single RPC
> > request, and the RPC engine has for some time allowed you to tune the
> > number of simultaneous RPC requests you have on the wire: Chuck has
> > already shown that read/write performance is greatly improved by upping
> > that value to 64 or more in the case of RPC over TCP. Why are we then
> > suddenly telling people that they are limited to 8 simultaneous writes?
> 
> min(max RPC size * max concurrent RPC reqs, dirty threshold) then?

That would be far preferable. For instance, it allows those who have
long latency fat pipes to actually use the bandwidth optimally when
writing out the data.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Peter Zijlstra

On Tue, 2007-01-16 at 17:27 -0500, Trond Myklebust wrote:
> On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> > Subject: nfs: fix congestion control
> > 
> > The current NFS client congestion logic is severely broken, it marks the
> > backing device congested during each nfs_writepages() call and implements
> > its own waitqueue.
> > 
> > Replace this by a more regular congestion implementation that puts a cap
> > on the number of active writeback pages and uses the bdi congestion 
> > waitqueue.
> > 
> > NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> > the dirty page limit and not in direct reclaim.

> 
> What on earth is the point of adding congestion control to COMMIT?
> Strongly NACKed.

They are dirty pages, how are we getting rid of them when we reached the
dirty limit?

> Why 16MB of on-the-wire data? Why not 32, or 128, or ...

Andrew always promotes a fixed number for congestion control, I pulled
one from a dark place. I have no problem with a more dynamic solution.

> Solaris already allows you to send 2MB of write data in a single RPC
> request, and the RPC engine has for some time allowed you to tune the
> number of simultaneous RPC requests you have on the wire: Chuck has
> already shown that read/write performance is greatly improved by upping
> that value to 64 or more in the case of RPC over TCP. Why are we then
> suddenly telling people that they are limited to 8 simultaneous writes?

min(max RPC size * max concurrent RPC reqs, dirty threshold) then?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Trond Myklebust

On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severely broken, it marks the
> backing device congested during each nfs_writepages() call and implements
> its own waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap
> on the number of active writeback pages and uses the bdi congestion waitqueue.
> 
> NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> the dirty page limit and not in direct reclaim.
> 
>   A buxom young lass from Neale's Flat,
>   Bore triplets, named Matt, Pat and Tat.
>   "Oh Lord," she protested,
>   "'Tis somewhat congested ...
>   "You've given me no tit for Tat." 

What on earth is the point of adding congestion control to COMMIT?
Strongly NACKed.

Why 16MB of on-the-wire data? Why not 32, or 128, or ...
Solaris already allows you to send 2MB of write data in a single RPC
request, and the RPC engine has for some time allowed you to tune the
number of simultaneous RPC requests you have on the wire: Chuck has
already shown that read/write performance is greatly improved by upping
that value to 64 or more in the case of RPC over TCP. Why are we then
suddenly telling people that they are limited to 8 simultaneous writes?

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] nfs: fix congestion control

2007-01-16 Thread Peter Zijlstra

Subject: nfs: fix congestion control

The current NFS client congestion logic is severely broken, it marks the
backing device congested during each nfs_writepages() call and implements
its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap
on the number of active writeback pages and uses the bdi congestion waitqueue.

NFSv[34] commit pages are allowed to go unchecked as long as we are under 
the dirty page limit and not in direct reclaim.

A buxom young lass from Neale's Flat,
Bore triplets, named Matt, Pat and Tat.
"Oh Lord," she protested,
"'Tis somewhat congested ...
"You've given me no tit for Tat." 

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/inode.c  |1 
 fs/nfs/pagelist.c   |8 +
 fs/nfs/write.c  |  257 +++-
 include/linux/backing-dev.h |1 
 include/linux/nfs_fs_sb.h   |2 
 include/linux/nfs_page.h|2 
 include/linux/writeback.h   |1 
 mm/backing-dev.c|   16 ++
 mm/page-writeback.c |6 +
 9 files changed, 240 insertions(+), 54 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-12 08:03:47.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-12 09:31:41.0 +0100
@@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
 static mempool_t *nfs_wdata_mempool;
 static mempool_t *nfs_commit_mempool;
 
-static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
-
 struct nfs_write_data *nfs_commit_alloc(void)
 {
struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
@@ -245,6 +243,91 @@ static int wb_priority(struct writeback_
 }
 
 /*
+ * NFS congestion control
+ *
+ * Congestion is composed of two parts, writeback and commit pages.
+ * Writeback pages are active, that is, all that is required to get out of
+ * writeback state is sit around and wait. Commit pages on the other hand
+ * are not and they need a little nudge to go away.
+ *
+ * Normally we want to maximise the number of commit pages for it allows the
+ * server to make best use of unstable storage. However when we are running
+ * low on memory we do need to reduce the commit pages.
+ *
+ * Hence writeback pages are managed in the conventional way, but for commit
+ * pages we poll on every write.
+ *
+ * The threshold is picked so that it allows for 16M of in flight data
+ * given current transfer speeds that seems reasonable.
+ */
+
+#define NFS_CONGESTION_SIZE16
+#define NFS_CONGESTION_ON_THRESH   (NFS_CONGESTION_SIZE << (20 - 
PAGE_SHIFT))
+#define NFS_CONGESTION_OFF_THRESH  \
+   (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+
+/*
+ * Include the commit pages into the congestion logic when there is either
+ * pressure from the total dirty limit or when we're in direct reclaim.
+ */
+static inline int commit_pressure(struct inode *inode)
+{
+   return dirty_pages_exceeded(inode->i_mapping) ||
+   (current->flags & PF_MEMALLOC);
+}
+
+static void nfs_congestion_on(struct inode *inode)
+{
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_read(&nfss->writeback) > NFS_CONGESTION_ON_THRESH)
+   set_bdi_congested(&nfss->backing_dev_info, WRITE);
+}
+
+static void nfs_congestion_off(struct inode *inode)
+{
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   clear_bdi_congested(&nfss->backing_dev_info, WRITE);
+   congestion_end(WRITE);
+   }
+}
+
+static inline void nfs_set_page_writeback(struct page *page)
+{
+   if (!test_set_page_writeback(page)) {
+   struct inode *inode = page->mapping->host;
+   atomic_inc(&NFS_SERVER(inode)->writeback);
+   nfs_congestion_on(inode);
+   }
+}
+
+static inline void nfs_end_page_writeback(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   end_page_writeback(page);
+   atomic_dec(&NFS_SERVER(inode)->writeback);
+   nfs_congestion_off(inode);
+}
+
+static inline void nfs_set_page_commit(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   inc_zone_page_state(page, NR_UNSTABLE_NFS);
+   atomic_inc(&NFS_SERVER(inode)->commit);
+   nfs_congestion_on(inode);
+}
+
+static inline void nfs_end_page_commit(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   dec_zone_page_state(page, NR_UNSTABLE_NFS);
+   atomic_dec(&NFS_SERVER(inode)->commit);
+   nfs_congestion_off(inode);
+}
+
+/*
  * Find an associated nfs write request, and prepare to flush it out
  * Returns 1 if there was no write request, or if the request was
  * already tagged by nfs_set_page_dirty.Returns 0 if the request
@@ -281,7 +364,7 @@ static

38 matches

Mail list logo