Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-14 Thread Miklos Szeredi
> Only if the queue depth is not bound. Queue depths are bound and so
> the distance we can go over the threshold is limited.  This is the
> fundamental principle on which the throttling is based.
> 
> Hence, if the queue is not full, then we will have either written
> dirty pages to it (i.e wbc->nr_write != write_chunk so we will throttle
> or continue normally if write_chunk was written) or we have no more
> dirty pages left.
> 
> Having no dirty pages left on the bdi and it not being congested
> means we effectively have a clean, idle bdi. We should not be trying
> to throttle writeback here - we can't do anything to improve the
> situation by continuing to try to do writeback on this bdi, so we
> may as well give up and let the writer continue. Once we have dirty
> pages on the bdi, we'll get throttled appropriately.

OK, you convinced me.

How about this patch?  I introduced a new wbc counter, that sums the
number of dirty pages encountered, including ones already under
writeback.

Dave, big thanks for your insights.

Miklos

Index: linux/include/linux/writeback.h
===
--- linux.orig/include/linux/writeback.h2007-03-14 22:43:42.0 
+0100
+++ linux/include/linux/writeback.h 2007-03-14 22:58:56.0 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
long nr_to_write;   /* Write this many pages, and decrement
   this for each page written */
long pages_skipped; /* Pages which were not written */
+   long nr_dirty;  /* Number of dirty pages encountered */
 
/*
 * For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-14 22:41:01.0 +0100
+++ linux/mm/page-writeback.c   2007-03-14 23:00:20.0 +0100
@@ -220,6 +220,17 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break;  /* We've done our duty */
+
+   /*
+* If just a few dirty pages were encountered, and
+* the queue is not congested, then allow this dirty
+* producer to continue.  This resolves the deadlock
+* that happens when one filesystem writes back data
+* through another.  It should also help when a slow
+* device is completely blocking other writes.
+*/
+   if (wbc.nr_dirty < 8 && !bdi_write_congested(bdi))
+   break;
}
congestion_wait(WRITE, HZ/10);
}
@@ -612,6 +623,7 @@ retry:
  min(end - index, 
(pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
 
+   wbc->nr_dirty += nr_pages;
scanned = 1;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-13 Thread David Chinner
On Tue, Mar 13, 2007 at 09:21:59AM +0100, Miklos Szeredi wrote:
> > > read request
> > > sys_write
> > >   mutex_lock(i_mutex)
> > >   ...
> > >  balance_dirty_pages
> > > submit write requests
> > > loop ... write requests completed ... dirty still over limit ... 
> > >   ... loop forever
> > 
> > Hmmm - the situation in balance_dirty_pages() after an attempt
> > to writeback_inodes(&wbc) that has written nothing because there
> > is nothing to write would be:
> > 
> > wbc->nr_write == write_chunk &&
> > wbc->pages_skipped == 0 &&
> > wbc->encountered_congestion == 0 &&
> > !bdi_congested(wbc->bdi)
> > 
> > What happens if you make that an exit condition to the loop?
> 
> That's almost right.  The only problem is that even if there's no
> congestion, the device queue can be holding a great amount of yet
> unwritten pages.  So exiting on this condition would mean, that
> dirty+writeback could go way over the threshold.

Only if the queue depth is not bound. Queue depths are bound and so
the distance we can go over the threshold is limited.  This is the
fundamental principle on which the throttling is based.

Hence, if the queue is not full, then we will have either written
dirty pages to it (i.e wbc->nr_write != write_chunk so we will throttle
or continue normally if write_chunk was written) or we have no more
dirty pages left.

Having no dirty pages left on the bdi and it not being congested
means we effectively have a clean, idle bdi. We should not be trying
to throttle writeback here - we can't do anything to improve the
situation by continuing to try to do writeback on this bdi, so we
may as well give up and let the writer continue. Once we have dirty
pages on the bdi, we'll get throttled appropriately.

The point I'm making here is that if the bdi is not congested, any
pages dirtied on that bdi can be cleaned _quickly_ and so writing
more pages to it isn't a big deal even if we are over the global
dirty threshold.

Remember, the global dirty threshold is not really a hard limit -
it's a threshold at which we change behaviour. Throttling idle bdi's
does not contribute usefully to reducing the number of dirty pages
in the system; all it really does is deny service to devices that could
otherwise be doing useful work.

> How much this would be a problem?  I don't know, I guess it depends on
> many things: how many queues, how many requests per queue, how many
> bytes per request.

Right, and most ppl don't have enough devices in their system for
this to be a problem. Even those of us that do have enough devices
for this to potentially be a problem usually have enough RAM in
the machine so that it is not a problem

> > Or alternatively, adding another bit to the wbc structure to
> > say "there was nothing to do" and setting that if we find
> > list_empty(&sb->s_dirty) when trying to flush dirty inodes."
> > 
> > [ FWIW, this may also solve another problem of fast block devices
> > being throttled incorrectly when a slow block dev is consuming
> > all the dirty pages... ]
> 
> There may be a patch floating around, which I think basically does
> this, but only as long as the dirty+writeback are over a soft limit,
> but under the hard limit.
> 
> When over the the hard limit, balance_dirty_pages still loops until
> dirty+writeback go below the threshold.

The difference between the two methods is that if there is any hard
limit that results in balance_dirty_pages looping then you have a
potential deadlock.  Hence the soft+hard limits will reduce the
occurrence but not remove the deadlock. Breaking out of the loop
when there is nothing to do simply means we'll reenter again
with something to do very shortly (and *then* throttle) if the
process continues to write.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-13 Thread Miklos Szeredi
> > > IIUC, your problem is that there's another bdi that holds all the
> > > dirty pages, and this throttle loop never flushes pages from that
> > > other bdi and we sleep instead. It seems to me that the fundamental
> > > problem is that to clean the pages we need to flush both bdi's, not
> > > just the bdi we are directly dirtying.
> > 
> > This is what happens:
> > 
> > write fault on upper filesystem
> >   balance_dirty_pages
> > submit write requests
> >   loop ...
> 
> Isn't this loop transferring the dirty state from the upper
> filesystem to the lower filesystem?

What this loop is doing is putting write requests in the request
queue, and in so doing transforming page state from dirty to
writeback.

> What I don't see here is how the pages on this filesystem are not
> getting cleaned if the lower filesystem is being flushed properly.

Because the lower filesystem writes back one request, but then gets
stuck in balance_dirty_pages before returning.  So the write request
is never completed.

The problem is that balance_dirty_pages is waiting for the condition
that the global number of dirty+writeback pages goes below the
threshold.  But this condition can only be satisfied if
balance_dirty_pages() returns.

> I'm probably missing something big and obvious, but I'm not
> familiar with the exact workings of FUSE so please excuse my
> ignorance
> 
> > --- fuse IPC ---
> > [fuse loopback fs thread 1]
> 
> This is the lower filesystem? Or a callback thread for
> doing the write requests to the lower filesystem?

This is the fuse daemon.  It's a normal process that reads requests
from /dev/fuse, serves these requests then writes the reply back onto
/dev/fuse.  It is usually multithreaded, so it can serve many requests
in parallel.

The loopback filesystem serves the requests by issuing the relevant
filesystem syscalls on the underlying fs.

> > read request
> > sys_write
> >   mutex_lock(i_mutex)
> >   ...
> >  balance_dirty_pages
> > submit write requests
> > loop ... write requests completed ... dirty still over limit ... 
> > ... loop forever
> 
> Hmmm - the situation in balance_dirty_pages() after an attempt
> to writeback_inodes(&wbc) that has written nothing because there
> is nothing to write would be:
> 
>   wbc->nr_write == write_chunk &&
>   wbc->pages_skipped == 0 &&
>   wbc->encountered_congestion == 0 &&
>   !bdi_congested(wbc->bdi)
> 
> What happens if you make that an exit condition to the loop?

That's almost right.  The only problem is that even if there's no
congestion, the device queue can be holding a great amount of yet
unwritten pages.  So exiting on this condition would mean, that
dirty+writeback could go way over the threshold.

How much this would be a problem?  I don't know, I guess it depends on
many things: how many queues, how many requests per queue, how many
bytes per request.

> Or alternatively, adding another bit to the wbc structure to
> say "there was nothing to do" and setting that if we find
> list_empty(&sb->s_dirty) when trying to flush dirty inodes."
> 
> [ FWIW, this may also solve another problem of fast block devices
> being throttled incorrectly when a slow block dev is consuming
> all the dirty pages... ]

There may be a patch floating around, which I think basically does
this, but only as long as the dirty+writeback are over a soft limit,
but under the hard limit.

When over the the hard limit, balance_dirty_pages still loops until
dirty+writeback go below the threshold.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread David Chinner
On Mon, Mar 12, 2007 at 11:36:16PM +0100, Miklos Szeredi wrote:
> I'll try to explain the reason for the deadlock first.

Ah, thanks for that.

> > IIUC, your problem is that there's another bdi that holds all the
> > dirty pages, and this throttle loop never flushes pages from that
> > other bdi and we sleep instead. It seems to me that the fundamental
> > problem is that to clean the pages we need to flush both bdi's, not
> > just the bdi we are directly dirtying.
> 
> This is what happens:
> 
> write fault on upper filesystem
>   balance_dirty_pages
> submit write requests
>   loop ...

Isn't this loop transferring the dirty state from the upper
filesystem to the lower filesystem? What I don't see here is
how the pages on this filesystem are not getting cleaned if
the lower filesystem is being flushed properly.

I'm probably missing something big and obvious, but I'm not
familiar with the exact workings of FUSE so please excuse my
ignorance

> --- fuse IPC ---
> [fuse loopback fs thread 1]

This is the lower filesystem? Or a callback thread for
doing the write requests to the lower filesystem?

> read request
> sys_write
>   mutex_lock(i_mutex)
>   ...
>  balance_dirty_pages
> submit write requests
> loop ... write requests completed ... dirty still over limit ... 
>   ... loop forever

Hmmm - the situation in balance_dirty_pages() after an attempt
to writeback_inodes(&wbc) that has written nothing because there
is nothing to write would be:

wbc->nr_write == write_chunk &&
wbc->pages_skipped == 0 &&
wbc->encountered_congestion == 0 &&
!bdi_congested(wbc->bdi)

What happens if you make that an exit condition to the loop?
Or alternatively, adding another bit to the wbc structure to
say "there was nothing to do" and setting that if we find
list_empty(&sb->s_dirty) when trying to flush dirty inodes."

[ FWIW, this may also solve another problem of fast block devices
being throttled incorrectly when a slow block dev is consuming
all the dirty pages... ]

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread Miklos Szeredi
I'll try to explain the reason for the deadlock first.

> IIUC, your problem is that there's another bdi that holds all the
> dirty pages, and this throttle loop never flushes pages from that
> other bdi and we sleep instead. It seems to me that the fundamental
> problem is that to clean the pages we need to flush both bdi's, not
> just the bdi we are directly dirtying.

This is what happens:

write fault on upper filesystem
  balance_dirty_pages
submit write requests
  loop ...
--- fuse IPC ---
[fuse loopback fs thread 1]
read request
sys_write
  mutex_lock(i_mutex)
  ...
 balance_dirty_pages
submit write requests
loop ... write requests completed ... dirty still over limit ... 
... loop forever

[fuse loopback fs thread 1]
read request
sys_write
  mute_lock(i_mutex) blocks

So the queue for the upper filesystem is full.  The queue for the
lower filesystem is empty.  There are no dirty pages in the lower
filesystem.

So kicking pdflush for the lower filesystem doesn't help, there's
nothing to do.  balance_dirty_pages() for the lower filesystem should
just realize that there's nothing to do and return, and then there
would be progress.

So there's there's really no need to do any accounting, just some
logic to determine that a backing dev is nearly or completely
quiescent.

And getting out of this tight situation doesn't have to be efficient.
This is probably a very rare corner case, that almost never happens in
real life, only with aggressive test tools like bash_shared_mapping.

> > OK.  How about just accounting writeback pages?  That should be much
> > less of a problem, since normally writeback is started from
> > pdflush/kupdate in large batches without any concurrency.
> 
> Except when you are throttling you bounce the cacheline around
> each cpu as it triggers foreground writeback.

Yeah, we'd loose a bit of CPU, but not any write performance, since it
is being throttled back anyway.

> > Or is it possible to export the state of the device queue to mm?
> > E.g. could balance_dirty_pages() query the backing dev if there are
> > any outstanding write requests?
> 
> Not directly - writeback_in_progress(bdi) is a coarse measure
> indicating pdflush is active on this bdi, which implies outstanding
> write requests).

Hmm, not quite what I need.

> > > I'd call this a showstopper right now - maybe you need to look at
> > > something like the ZVC code that Christoph Lameter wrote, perhaps?
> > 
> > That's rather a heavyweight approach for this I think.
> 
> But if you want to use per-page accounting, you are going to
> need a per-cpu or per-zone set of counters on each bdi to do
> this without introducing regressions.

Yes, this is an option, but I hope for a simpler solution.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread David Chinner
On Mon, Mar 12, 2007 at 12:40:47PM +0100, Miklos Szeredi wrote:
> > > I have no idea how serious the scalability problems with this are.  If
> > > they are serious, different solutions can probably be found for the
> > > above, but this is certainly the simplest.
> > 
> > Atomic operations to a single per-backing device from all CPUs at once?
> > That's a pretty serious scalability issue and it will cause a major
> > performance regression for XFS.
> 
> OK.  How about just accounting writeback pages?  That should be much
> less of a problem, since normally writeback is started from
> pdflush/kupdate in large batches without any concurrency.

Except when you are throttling you bounce the cacheline around
each cpu as it triggers foreground writeback.

> Or is it possible to export the state of the device queue to mm?
> E.g. could balance_dirty_pages() query the backing dev if there are
> any outstanding write requests?

Not directly - writeback_in_progress(bdi) is a coarse measure
indicating pdflush is active on this bdi, which implies outstanding
write requests).

> > I'd call this a showstopper right now - maybe you need to look at
> > something like the ZVC code that Christoph Lameter wrote, perhaps?
> 
> That's rather a heavyweight approach for this I think.

But if you want to use per-page accounting, you are going to
need a per-cpu or per-zone set of counters on each bdi to do
this without introducing regressions.

> The only info balance_dirty_pages() really needs is whether there are
> any dirty+writeback bound for the backing dev or not.

writeback bound (i.e. writing as fast as we can) is probably
indicated fairly reliably by bdi_congested(bdi).

Now all you need is the number of dirty pages

> It knows about the diry pages, since it calls writeback_inodes() which
> scans the dirty pages for this backing dev looking for ones to write
> out.

It scans the dirty inode list for dirty inodes which indirectly finds
the dirty pages. It does not know about the number of dirty pages
directly...

> If after returning from writeback_inodes() wbc->nr_to_write
> didn't decrease and wbc->pages_skipped is zero then we know that there
> are no more dirty pages for the device.  Or at least there are no
> dirty pages which aren't already under writeback.

Sure, you can tell if there are _no_ dirty pages on the bdi, but
if there are dirty pages, you can't tell how many there are. Your
followup patches need to know how many dirty+writeback pages there
are on the bdi, so I don't really see any way you can solve the
deadlock in this manner without scalable bdi->nr_dirty accounting.



IIUC, your problem is that there's another bdi that holds all the
dirty pages, and this throttle loop never flushes pages from that
other bdi and we sleep instead. It seems to me that the fundamental
problem is that to clean the pages we need to flush both bdi's, not
just the bdi we are directly dirtying.

How about a "dependent bdi" link? i.e. if you have a loopback
filesystem, it has a direct bdi (the loopback device) and a
dependent bdi - the bdi that belongs to the underlying filesystem.

When we enter the throttle loop we flush from the direct bdi
and if we fail to flush all the pages we require, we flush
the dependent bdi (maybe even just kick pdflush for that bdi)
before we call congestion_wait() and go to sleep. This way
we are always making progress cleaning pages on the machine,
not just transferring dirty pages form one bdi to another.

Wouldn't that solve the deadlock without needing painful
accounting?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-12 Thread Miklos Szeredi
> > I have no idea how serious the scalability problems with this are.  If
> > they are serious, different solutions can probably be found for the
> > above, but this is certainly the simplest.
> 
> Atomic operations to a single per-backing device from all CPUs at once?
> That's a pretty serious scalability issue and it will cause a major
> performance regression for XFS.

OK.  How about just accounting writeback pages?  That should be much
less of a problem, since normally writeback is started from
pdflush/kupdate in large batches without any concurrency.

Or is it possible to export the state of the device queue to mm?
E.g. could balance_dirty_pages() query the backing dev if there are
any outstanding write requests?

> I'd call this a showstopper right now - maybe you need to look at
> something like the ZVC code that Christoph Lameter wrote, perhaps?

That's rather a heavyweight approach for this I think.

The only info balance_dirty_pages() really needs is whether there are
any dirty+writeback bound for the backing dev or not.

It knows about the diry pages, since it calls writeback_inodes() which
scans the dirty pages for this backing dev looking for ones to write
out.  If after returning from writeback_inodes() wbc->nr_to_write
didn't decrease and wbc->pages_skipped is zero then we know that there
are no more dirty pages for the device.  Or at least there are no
dirty pages which aren't already under writeback.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-11 Thread David Chinner
On Tue, Mar 06, 2007 at 07:04:46PM +0100, Miklos Szeredi wrote:
> From: Andrew Morton <[EMAIL PROTECTED]>
> 
> [EMAIL PROTECTED]: bugfix]
> 
> Miklos Szeredi <[EMAIL PROTECTED]>:
> 
> Changes:
>  - updated to apply after clear_page_dirty_for_io() race fix
> 
> This is needed for
> 
>  - balance_dirty_pages() deadlock fix
>  - fuse dirty page accounting
> 
> I have no idea how serious the scalability problems with this are.  If
> they are serious, different solutions can probably be found for the
> above, but this is certainly the simplest.

Atomic operations to a single per-backing device from all CPUs at once?
That's a pretty serious scalability issue and it will cause a major
performance regression for XFS.

I'd call this a showstopper right now - maybe you need to look at
something like the ZVC code that Christoph Lameter wrote, perhaps?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-06 Thread Miklos Szeredi
From: Andrew Morton <[EMAIL PROTECTED]>

[EMAIL PROTECTED]: bugfix]

Miklos Szeredi <[EMAIL PROTECTED]>:

Changes:
 - updated to apply after clear_page_dirty_for_io() race fix

This is needed for

 - balance_dirty_pages() deadlock fix
 - fuse dirty page accounting

I have no idea how serious the scalability problems with this are.  If
they are serious, different solutions can probably be found for the
above, but this is certainly the simplest.

Signed-off-by: Tomoki Sekiyama <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/block/ll_rw_blk.c
===
--- linux.orig/block/ll_rw_blk.c2007-03-06 11:19:16.0 +0100
+++ linux/block/ll_rw_blk.c 2007-03-06 13:40:08.0 +0100
@@ -201,6 +201,8 @@ EXPORT_SYMBOL(blk_queue_softirq_done);
  **/
 void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
 {
+   struct backing_dev_info *bdi = &q->backing_dev_info;
+
/*
 * set defaults
 */
@@ -208,9 +210,11 @@ void blk_queue_make_request(request_queu
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
q->make_request_fn = mfn;
-   q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / 
PAGE_CACHE_SIZE;
-   q->backing_dev_info.state = 0;
-   q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+   bdi->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+   bdi->state = 0;
+   bdi->capabilities = BDI_CAP_MAP_COPY;
+   atomic_long_set(&bdi->nr_dirty, 0);
+   atomic_long_set(&bdi->nr_writeback, 0);
blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
blk_queue_hardsect_size(q, 512);
blk_queue_dma_alignment(q, 511);
@@ -3922,6 +3926,19 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%lu\n",
+   atomic_long_read(&q->backing_dev_info.nr_dirty));
+
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%lu\n",
+   atomic_long_read(&q->backing_dev_info.nr_writeback));
+
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -3946,6 +3963,16 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_nr_dirty_entry = {
+   .attr = {.name = "nr_dirty", .mode = S_IRUGO },
+   .show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_nr_writeback_entry = {
+   .attr = {.name = "nr_writeback", .mode = S_IRUGO },
+   .show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -3957,6 +3984,8 @@ static struct attribute *default_attrs[]
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
+   &queue_nr_dirty_entry.attr,
+   &queue_nr_writeback_entry.attr,
&queue_iosched_entry.attr,
NULL,
 };
Index: linux/include/linux/backing-dev.h
===
--- linux.orig/include/linux/backing-dev.h  2007-03-06 11:19:18.0 
+0100
+++ linux/include/linux/backing-dev.h   2007-03-06 13:40:08.0 +0100
@@ -28,6 +28,8 @@ struct backing_dev_info {
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long state;/* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
+   atomic_long_t nr_dirty; /* Pages dirty against this BDI */
+   atomic_long_t nr_writeback;/* Pages under writeback against this BDI */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-06 13:28:26.0 +0100
+++ linux/mm/page-writeback.c   2007-03-06 13:45:55.0 +0100
@@ -743,6 +743,7 @@ void generic_page_dirtied(struct page *p
if (mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   atomic_long_inc(&mapping->backing_dev_info->nr_dirty);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(&mapping->page_tree,
@@ -896,6