subject:"Re\: \[PATCH\] remove throttle_vm

Re: [PATCH] remove throttle_vm_writeout()

2007-10-07 Thread Fengguang Wu

On Mon, Oct 08, 2007 at 09:54:33AM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
> 
> Careful. kswapd is much less efficient at writeout than pdflush
> because it does not do low->high offset writeback per address space.
> It just flushes the pages in LRU order and that turns writeback into
> a non-sequential mess. I/O sizes decrease substantially and
> throughput falls through the floor.
> 
> So if you want kswapd to take over all the writeback, it needs to do
> writeback in the same manner as the background flushes. i.e.  by
> grabbing page->mapping and flushing that in sequential order rather
> than just the page on the end of the LRU
> 
> I documented the effect of kswapd taking over writeback in this
> paper (section 5.3):
> 
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Ah, indeed. That means introducing a new "really congested" threshold
for kswapd is *dangerous*. I realized this later on, and am now
heading for another direction.

The basic idea is to
- rotate pdflush issued writeback pages for kswapd;
- use the more precise zone_rotate_wait() to throttle kswapd.

The code is a quick hack and not tested yet.
Early comments are more than welcome.

Fengguang
---
 include/linux/mmzone.h |1 +
 mm/filemap.c   |5 -
 mm/page_alloc.c|1 +
 mm/swap.c  |   13 +
 mm/vmscan.c|   12 ++--
 5 files changed, 29 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc8-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
wait_queue_head_t   * wait_table;
unsigned long   wait_table_hash_nr_entries;
unsigned long   wait_table_bits;
+   wait_queue_head_t   wait_rotate;
 
/*
 * Discontig memory support fields.
--- linux-2.6.23-rc8-mm2.orig/mm/filemap.c
+++ linux-2.6.23-rc8-mm2/mm/filemap.c
@@ -558,12 +558,15 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
-   if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
+   int r = 1;
+   if (!TestClearPageReclaim(page) || (r = rotate_reclaimable_page(page))) 
{
if (!test_clear_page_writeback(page))
BUG();
}
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
+   if (!r)
+   wake_up(_zone(page)->wait_rotate);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
--- linux-2.6.23-rc8-mm2.orig/mm/page_alloc.c
+++ linux-2.6.23-rc8-mm2/mm/page_alloc.c
@@ -3482,6 +3482,7 @@ static void __meminit free_area_init_cor
zone->prev_priority = DEF_PRIORITY;
 
zone_pcp_init(zone);
+   init_waitqueue_head(>wait_rotate);
INIT_LIST_HEAD(>active_list);
INIT_LIST_HEAD(>inactive_list);
zone->nr_scan_active = 0;
--- linux-2.6.23-rc8-mm2.orig/mm/vmscan.c
+++ linux-2.6.23-rc8-mm2/mm/vmscan.c
@@ -50,6 +50,7 @@
 struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
+   unsigned long nr_dirty_writeback;
 
/* This context's GFP mask */
gfp_t gfp_mask;
@@ -558,8 +559,10 @@ static unsigned long shrink_page_list(st
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
-   if (PageWriteback(page) || PageDirty(page))
+   if (PageWriteback(page) || PageDirty(page)) {
+   sc->nr_dirty_writeback++;
goto keep;
+   }
/*
 * A synchronous write - probably a ramdisk.  Go
 * ahead and try to reclaim the page.
@@ -620,6 +623,10 @@ keep_locked:
 keep:
list_add(>lru, _pages);
VM_BUG_ON(PageLRU(page));
+   if (PageLocked(page) && PageWriteback(page)) {
+   SetPageReclaim(page);
+   sc->nr_dirty_writeback++;
+   }
}
list_splice(_pages, page_list);
if (pagevec_count(_pvec))
@@ -1184,7 +1191,8 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   if (!nr_reclaimed && sc->nr_dirty_writeback)
+   zone_rotate_wait(zone, HZ/100);
return nr_reclaimed;
 }
 
--- linux-2.6.23-rc8-mm2.orig/mm/swap.c
+++ linux-2.6.23-rc8-mm2/mm/swap.c
@@ -174,6 +174,19 @@ int rotate_reclaimable_page(struct page 
return 0;
 }
 
+long zone_rotate_wait(struct zone* z, long timeout)
+{
+   long ret;
+

Re: [PATCH] remove throttle_vm_writeout()

2007-10-07 Thread David Chinner

On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
> The improvement could be:
> - kswapd is now explicitly preferred to do the writeout;

Careful. kswapd is much less efficient at writeout than pdflush
because it does not do low->high offset writeback per address space.
It just flushes the pages in LRU order and that turns writeback into
a non-sequential mess. I/O sizes decrease substantially and
throughput falls through the floor.

So if you want kswapd to take over all the writeback, it needs to do
writeback in the same manner as the background flushes. i.e.  by
grabbing page->mapping and flushing that in sequential order rather
than just the page on the end of the LRU

I documented the effect of kswapd taking over writeback in this
paper (section 5.3):

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-07 Thread David Chinner

On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
 The improvement could be:
 - kswapd is now explicitly preferred to do the writeout;

Careful. kswapd is much less efficient at writeout than pdflush
because it does not do low-high offset writeback per address space.
It just flushes the pages in LRU order and that turns writeback into
a non-sequential mess. I/O sizes decrease substantially and
throughput falls through the floor.

So if you want kswapd to take over all the writeback, it needs to do
writeback in the same manner as the background flushes. i.e.  by
grabbing page-mapping and flushing that in sequential order rather
than just the page on the end of the LRU

I documented the effect of kswapd taking over writeback in this
paper (section 5.3):

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-07 Thread Fengguang Wu

On Mon, Oct 08, 2007 at 09:54:33AM +1000, David Chinner wrote:
 On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
  The improvement could be:
  - kswapd is now explicitly preferred to do the writeout;
 
 Careful. kswapd is much less efficient at writeout than pdflush
 because it does not do low-high offset writeback per address space.
 It just flushes the pages in LRU order and that turns writeback into
 a non-sequential mess. I/O sizes decrease substantially and
 throughput falls through the floor.
 
 So if you want kswapd to take over all the writeback, it needs to do
 writeback in the same manner as the background flushes. i.e.  by
 grabbing page-mapping and flushing that in sequential order rather
 than just the page on the end of the LRU
 
 I documented the effect of kswapd taking over writeback in this
 paper (section 5.3):
 
 http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Ah, indeed. That means introducing a new really congested threshold
for kswapd is *dangerous*. I realized this later on, and am now
heading for another direction.

The basic idea is to
- rotate pdflush issued writeback pages for kswapd;
- use the more precise zone_rotate_wait() to throttle kswapd.

The code is a quick hack and not tested yet.
Early comments are more than welcome.

Fengguang
---
 include/linux/mmzone.h |1 +
 mm/filemap.c   |5 -
 mm/page_alloc.c|1 +
 mm/swap.c  |   13 +
 mm/vmscan.c|   12 ++--
 5 files changed, 29 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc8-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
wait_queue_head_t   * wait_table;
unsigned long   wait_table_hash_nr_entries;
unsigned long   wait_table_bits;
+   wait_queue_head_t   wait_rotate;
 
/*
 * Discontig memory support fields.
--- linux-2.6.23-rc8-mm2.orig/mm/filemap.c
+++ linux-2.6.23-rc8-mm2/mm/filemap.c
@@ -558,12 +558,15 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
-   if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
+   int r = 1;
+   if (!TestClearPageReclaim(page) || (r = rotate_reclaimable_page(page))) 
{
if (!test_clear_page_writeback(page))
BUG();
}
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
+   if (!r)
+   wake_up(page_zone(page)-wait_rotate);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
--- linux-2.6.23-rc8-mm2.orig/mm/page_alloc.c
+++ linux-2.6.23-rc8-mm2/mm/page_alloc.c
@@ -3482,6 +3482,7 @@ static void __meminit free_area_init_cor
zone-prev_priority = DEF_PRIORITY;
 
zone_pcp_init(zone);
+   init_waitqueue_head(zone-wait_rotate);
INIT_LIST_HEAD(zone-active_list);
INIT_LIST_HEAD(zone-inactive_list);
zone-nr_scan_active = 0;
--- linux-2.6.23-rc8-mm2.orig/mm/vmscan.c
+++ linux-2.6.23-rc8-mm2/mm/vmscan.c
@@ -50,6 +50,7 @@
 struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
+   unsigned long nr_dirty_writeback;
 
/* This context's GFP mask */
gfp_t gfp_mask;
@@ -558,8 +559,10 @@ static unsigned long shrink_page_list(st
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
-   if (PageWriteback(page) || PageDirty(page))
+   if (PageWriteback(page) || PageDirty(page)) {
+   sc-nr_dirty_writeback++;
goto keep;
+   }
/*
 * A synchronous write - probably a ramdisk.  Go
 * ahead and try to reclaim the page.
@@ -620,6 +623,10 @@ keep_locked:
 keep:
list_add(page-lru, ret_pages);
VM_BUG_ON(PageLRU(page));
+   if (PageLocked(page)  PageWriteback(page)) {
+   SetPageReclaim(page);
+   sc-nr_dirty_writeback++;
+   }
}
list_splice(ret_pages, page_list);
if (pagevec_count(freed_pvec))
@@ -1184,7 +1191,8 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc-gfp_mask);
+   if (!nr_reclaimed  sc-nr_dirty_writeback)
+   zone_rotate_wait(zone, HZ/100);
return nr_reclaimed;
 }
 
--- linux-2.6.23-rc8-mm2.orig/mm/swap.c
+++ linux-2.6.23-rc8-mm2/mm/swap.c
@@ -174,6 +174,19 @@ int rotate_reclaimable_page(struct page 
return 0;
 }
 
+long zone_rotate_wait(struct zone* z, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Fri, Oct 05, 2007 at 10:20:05AM -0700, Andrew Morton wrote:
> On Fri, 5 Oct 2007 20:30:28 +0800
> Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > > commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> > > Author: akpm 
> > > Date:   Sun Dec 22 01:07:33 2002 +
> > > 
> > > [PATCH] Give kswapd writeback higher priority than pdflush
> > > 
> > > The `low latency page reclaim' design works by preventing page
> > > allocators from blocking on request queues (and by preventing them 
> > > from
> > > blocking against writeback of individual pages, but that is immaterial
> > > here).
> > > 
> > > This has a problem under some situations.  pdflush (or a write(2)
> > > caller) could be saturating the queue with highmem pages.  This
> > > prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
> > > enormous amounts of scenning.
> > 
> > Sorry, I cannot not understand it. We now have balanced aging between
> > zones. So the page allocations are expected to distribute proportionally
> > between ZONE_HIGHMEM and ZONE_NORMAL?
> 
> Sure, but we don't have one disk queue per disk per zone!  The queue is
> shared by all the zones.  So if writeback from one zone has filled the
> queue up, the kernel can't write back data from another zone.

Hmm, that's a problem. But I guess when one zone is full, other zones
will not be far away... It's a "sooner or later" problem.

> (Well, it can, by blocking in get_request_wait(), but that causes long and
> uncontrollable latencies).

I guess PF_SWAPWRITE processes still have good probability to stuck in
get_request_wait().  Because balance_dirty_pages() are allowed to
disregard the congestion.  It will be exhausting the available request
slots all the time.

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |1 +
 1 file changed, 1 insertion(+)

--- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm2/mm/page-writeback.c
@@ -400,6 +400,7 @@ static void balance_dirty_pages(struct a
.sync_mode  = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write= write_chunk,
+   .nonblocking= 1,
.range_cyclic   = 1,
};
 

> > > A test case is to mmap(MAP_SHARED) almost all of a 4G machine's 
> > > memory,
> > > then kill the mmapping applications.  The machine instantly goes from
> > > 0% of memory dirty to 95% or more.  pdflush kicks in and starts 
> > > writing
> > > the least-recently-dirtied pages, which are all highmem.  The queue is
> > > congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
> > > 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> > > efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> > > 
> > > So this patch changes the policy for kswapd.  kswapd may use all of a
> > > request queue, and is prepared to block on request queues.
> > > 
> > > What will now happen in the above scenario is:
> > > 
> > > 1: The page alloctor scans some pages, fails to reclaim enough
> > >memory and takes a nap in blk_congetion_wait().
> > > 
> > > 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
> > >back pages.  (These pages will be rotated to the tail of the
> > >inactive list at IO-completion interrupt time).
> > > 
> > >This writeback will saturate the queue with ZONE_NORMAL pages.
> > >Conveniently, pdflush will avoid the congested queues.  So we end 
> > > up
> > >writing the correct pages.
> > > 
> > > In this test, kswapd CPU utilisation falls from 50% to 2%, page 
> > > reclaim
> > > efficiency rises from 2% to 40% and things are generally a lot 
> > > happier.
> > 
> > We may see the same problem and improvement in the absent of 'all
> > writeback goes to one zone' assumption.
> > 
> > The problem could be:
> > - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
> >   data and quickly _congests_ the queue;
> 
> Or someone ran fsync(), or pdflush is writing back data because it exceeded
> dirty_writeback_centisecs, etc.

Ah, yes.

> > - dirty pages are slowly but continuously turned into clean pages by
> >   balance_dirty_pages(), but they still stay in the same place in LRU;
> > - the zones are mostly dirty/writeback pages, kswapd has a hard time
> >   finding the randomly distributed clean pages;
> > - kswapd cannot do the writeout because the queue is congested!
> > 
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
> > - the pages written by kswapd will be rotated and easy for kswapd to 
> > reclaim;
> > - it becomes possible for kswapd to wait for the congested queue,
> >   instead of doing the vmscan like mad.
> 
> Yeah.  In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
> iirc)

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Fri, Oct 05, 2007 at 08:32:19PM +0200, Peter Zijlstra wrote:
> 
> On Fri, 2007-10-05 at 13:50 -0400, Trond Myklebust wrote:
> > On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
> > > In this patch I totally ignored unstable, but I'm not sure that's the
> > > proper thing to do, I'd need to figure out what happens to an unstable
> > > page when passed into pageout() - or if its passed to pageout at all.
> > > 
> > > If unstable pages would be passed to pageout(), and it would properly
> > > convert them to writeback and clean them, then there is nothing wrong.
> > 
> > Why would we want to do that? That would be a hell of a lot of work
> > (locking pages, setting flags, unlocking pages, ...) for absolutely no
> > reason.
> > 
> > Unstable writes are writes which have been sent to the server, but which
> > haven't been written to disk on the server. A single RPC command is then
> > sent (COMMIT) which basically tells the server to call fsync(). After
> > that is successful, we can free up the pages, but we do that with no
> > extra manipulation of the pages themselves: no page locks, just removal
> > from the NFS private radix tree, and freeing up of the NFS private
> > structures.
> > 
> > We only need to touch the pages again in the unlikely case that the
> > COMMIT fails because the server has rebooted. In this case we have to
> > resend the writes, and so the pages are marked as dirty, so we can go
> > through the whole writepages() rigmarole again...
> > 
> > So, no. I don't see sending pages through pageout() as being at all
> > helpful.
> 
> Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
> stand we can deadlock there because it just waits for the numbers to
> drop, and unstable pages don't automagically dissapear. Only
> write_inodes() - normally called from balance_dirty_pages() will call
> COMMIT.

I wonder whether
if (!bdi_nr_writeback)
break;
or something like that could avoid the deadlock?

> So my thought was that calling pageout() on an unstable page would do
> the COMMIT - we're low on memory, otherwise we would not be paging, so
> getting rid of unstable pages seems to make sense to me.

I guess "many unstable pages" would be better if we are taking this way.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra


On Fri, 2007-10-05 at 15:23 -0400, Trond Myklebust wrote: 
> On Fri, 2007-10-05 at 15:20 -0400, Trond Myklebust wrote:
> > On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
> > > Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
> > > stand we can deadlock there because it just waits for the numbers to
> > > drop, and unstable pages don't automagically dissapear. Only
> > > write_inodes() - normally called from balance_dirty_pages() will call
> > > COMMIT.
> > > 
> > > So my thought was that calling pageout() on an unstable page would do
> > > the COMMIT - we're low on memory, otherwise we would not be paging, so
> > > getting rid of unstable pages seems to make sense to me.
> > 
> > Why not rather track which mappings have large numbers of outstanding
> > unstable writes at the VM level, and then add some form of callback to
> > allow it to notify the filesystem when it needs to flush them out?

That would be nice, its just that the pageout throttling is not quite
that sophisticated. But we'll see what we can come up with.

> BTW: Please note that at least in the case of NFS, you will have to
> allow for the fact that the filesystem may not be _able_ to cause the
> numbers to drop. If the server is unavailable, then we're may be stuck
> in unstable page limbo for quite some time.

Agreed, it would be nice if that is handled is such a manner that it
does not take down all other paging.

The regular write path that only bothers with balance_dirty_pages()
already does this nicely.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Rik van Riel

On Fri, 05 Oct 2007 09:32:57 +0200
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> I think just adding nr_cpus * ratelimit_pages to the dirth_thresh in
> throttle_vm_writeout() will also solve the problem

Agreed, that should fix the main latency issues.
 
-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 15:20 -0400, Trond Myklebust wrote:
> On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
> > Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
> > stand we can deadlock there because it just waits for the numbers to
> > drop, and unstable pages don't automagically dissapear. Only
> > write_inodes() - normally called from balance_dirty_pages() will call
> > COMMIT.
> > 
> > So my thought was that calling pageout() on an unstable page would do
> > the COMMIT - we're low on memory, otherwise we would not be paging, so
> > getting rid of unstable pages seems to make sense to me.
> 
> Why not rather track which mappings have large numbers of outstanding
> unstable writes at the VM level, and then add some form of callback to
> allow it to notify the filesystem when it needs to flush them out?
> 
> Cheers
>   Trond

BTW: Please note that at least in the case of NFS, you will have to
allow for the fact that the filesystem may not be _able_ to cause the
numbers to drop. If the server is unavailable, then we're may be stuck
in unstable page limbo for quite some time.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
> Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
> stand we can deadlock there because it just waits for the numbers to
> drop, and unstable pages don't automagically dissapear. Only
> write_inodes() - normally called from balance_dirty_pages() will call
> COMMIT.
> 
> So my thought was that calling pageout() on an unstable page would do
> the COMMIT - we're low on memory, otherwise we would not be paging, so
> getting rid of unstable pages seems to make sense to me.

Why not rather track which mappings have large numbers of outstanding
unstable writes at the VM level, and then add some form of callback to
allow it to notify the filesystem when it needs to flush them out?

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra


On Fri, 2007-10-05 at 13:50 -0400, Trond Myklebust wrote:
> On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
> > In this patch I totally ignored unstable, but I'm not sure that's the
> > proper thing to do, I'd need to figure out what happens to an unstable
> > page when passed into pageout() - or if its passed to pageout at all.
> > 
> > If unstable pages would be passed to pageout(), and it would properly
> > convert them to writeback and clean them, then there is nothing wrong.
> 
> Why would we want to do that? That would be a hell of a lot of work
> (locking pages, setting flags, unlocking pages, ...) for absolutely no
> reason.
> 
> Unstable writes are writes which have been sent to the server, but which
> haven't been written to disk on the server. A single RPC command is then
> sent (COMMIT) which basically tells the server to call fsync(). After
> that is successful, we can free up the pages, but we do that with no
> extra manipulation of the pages themselves: no page locks, just removal
> from the NFS private radix tree, and freeing up of the NFS private
> structures.
> 
> We only need to touch the pages again in the unlikely case that the
> COMMIT fails because the server has rebooted. In this case we have to
> resend the writes, and so the pages are marked as dirty, so we can go
> through the whole writepages() rigmarole again...
> 
> So, no. I don't see sending pages through pageout() as being at all
> helpful.

Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
stand we can deadlock there because it just waits for the numbers to
drop, and unstable pages don't automagically dissapear. Only
write_inodes() - normally called from balance_dirty_pages() will call
COMMIT.

So my thought was that calling pageout() on an unstable page would do
the COMMIT - we're low on memory, otherwise we would not be paging, so
getting rid of unstable pages seems to make sense to me.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
> In this patch I totally ignored unstable, but I'm not sure that's the
> proper thing to do, I'd need to figure out what happens to an unstable
> page when passed into pageout() - or if its passed to pageout at all.
> 
> If unstable pages would be passed to pageout(), and it would properly
> convert them to writeback and clean them, then there is nothing wrong.

Why would we want to do that? That would be a hell of a lot of work
(locking pages, setting flags, unlocking pages, ...) for absolutely no
reason.

Unstable writes are writes which have been sent to the server, but which
haven't been written to disk on the server. A single RPC command is then
sent (COMMIT) which basically tells the server to call fsync(). After
that is successful, we can free up the pages, but we do that with no
extra manipulation of the pages themselves: no page locks, just removal
from the NFS private radix tree, and freeing up of the NFS private
structures.

We only need to touch the pages again in the unlikely case that the
COMMIT fails because the server has rebooted. In this case we have to
resend the writes, and so the pages are marked as dirty, so we can go
through the whole writepages() rigmarole again...

So, no. I don't see sending pages through pageout() as being at all
helpful.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Andrew Morton

On Fri, 5 Oct 2007 20:30:28 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> > commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> > Author: akpm 
> > Date:   Sun Dec 22 01:07:33 2002 +
> > 
> > [PATCH] Give kswapd writeback higher priority than pdflush
> > 
> > The `low latency page reclaim' design works by preventing page
> > allocators from blocking on request queues (and by preventing them from
> > blocking against writeback of individual pages, but that is immaterial
> > here).
> > 
> > This has a problem under some situations.  pdflush (or a write(2)
> > caller) could be saturating the queue with highmem pages.  This
> > prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
> > enormous amounts of scenning.
> 
> Sorry, I cannot not understand it. We now have balanced aging between
> zones. So the page allocations are expected to distribute proportionally
> between ZONE_HIGHMEM and ZONE_NORMAL?

Sure, but we don't have one disk queue per disk per zone!  The queue is
shared by all the zones.  So if writeback from one zone has filled the
queue up, the kernel can't write back data from another zone.

(Well, it can, by blocking in get_request_wait(), but that causes long and
uncontrollable latencies).

> > A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> > then kill the mmapping applications.  The machine instantly goes from
> > 0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
> > the least-recently-dirtied pages, which are all highmem.  The queue is
> > congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
> > 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> > efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> > 
> > So this patch changes the policy for kswapd.  kswapd may use all of a
> > request queue, and is prepared to block on request queues.
> > 
> > What will now happen in the above scenario is:
> > 
> > 1: The page alloctor scans some pages, fails to reclaim enough
> >memory and takes a nap in blk_congetion_wait().
> > 
> > 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
> >back pages.  (These pages will be rotated to the tail of the
> >inactive list at IO-completion interrupt time).
> > 
> >This writeback will saturate the queue with ZONE_NORMAL pages.
> >Conveniently, pdflush will avoid the congested queues.  So we end up
> >writing the correct pages.
> > 
> > In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> > efficiency rises from 2% to 40% and things are generally a lot happier.
> 
> We may see the same problem and improvement in the absent of 'all
> writeback goes to one zone' assumption.
> 
> The problem could be:
> - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
>   data and quickly _congests_ the queue;

Or someone ran fsync(), or pdflush is writing back data because it exceeded
dirty_writeback_centisecs, etc.

> - dirty pages are slowly but continuously turned into clean pages by
>   balance_dirty_pages(), but they still stay in the same place in LRU;
> - the zones are mostly dirty/writeback pages, kswapd has a hard time
>   finding the randomly distributed clean pages;
> - kswapd cannot do the writeout because the queue is congested!
> 
> The improvement could be:
> - kswapd is now explicitly preferred to do the writeout;
> - the pages written by kswapd will be rotated and easy for kswapd to reclaim;
> - it becomes possible for kswapd to wait for the congested queue,
>   instead of doing the vmscan like mad.

Yeah.  In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
iirc) would throttle by waiting on writeout of a particular page.  This was
a poor design, because writeback against a *particular* page can take
anywhere from one millisecond to thirty seconds to complete, depending upon
where the disk head is and all that stuff.

The critical change I made was to switch the throttling algorithm from
"wait for one page to get written" to "wait for _any_ page to get written".
 Becaue reclaim really doesn't care _which_ page got written: we want to
wake up and start scanning again when _any_ page got written.

That's what congestion_wait() does.

It is pretty crude.  It could be that writeback completed against pages which
aren't in the correct zone, or it could be that some other task went and
allocated the just-cleaned pages before this task can get running and
reclaim them, or it could be that the just-written-back pages weren't
reclaimable after all, etc.

It would take a mind-boggling amount of logic and locking to make all this
100% accurate and the need has never been demonstrated.  So page reclaim
presently should be viewed as a polling algorithm, where the rate of
polling is paced by the rate at which the IO system can retire

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread John Stoffel


>> I think that's an improvement in all respects.
>> 
>> However it still does not generally address the deadlock scenario: if
>> there's a small DMA zone, and fuse manages to put all of those pages
>> under writeout, then there's trouble.

Miklos> And the only way to solve that AFAICS, is to make sure fuse
Miklos> never uses more than e.g. 50% of _any_ zone for page cache.
Miklos> And that may need some tweaking in the allocator...

So what happens if I have three different FUSE mounts, all under heavy
write pressure?  It's not a FUSE problem, it's a VM problem as far as
I can see.   All I did was extrapolate from the 50% number (where did
that come from?) and triple it to go over 100%, since we obviously
shouldn't take 100% of any zone, right?

So the real cure is to have some way to rate limit Zone usage, making
it harder and harder to allocate in a zone as the zone gets more and
more full.  But how do you do this in a non-deadlocky way?  

Buy hey, I'm not that knowledgeable about the VM.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Thu, Oct 04, 2007 at 10:46:50AM -0700, Andrew Morton wrote:
> On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > > > > > But that said, there might be better ways to do that.
> > > > > 
> > > > > Sure, if we do need to globally limit the number of under-writeback
> > > > > pages, then I think we need to do it independently of the dirty
> > > > > accounting.
> > > > 
> > > > It need not be global, it could be per BDI as well, but yes.
> > > 
> > > For per-bdi limits we have the queue length.
> > 
> > Agreed, except for:
> > 
> > static int may_write_to_queue(struct backing_dev_info *bdi)
> > {
> > if (current->flags & PF_SWAPWRITE)
> > return 1;
> > if (!bdi_write_congested(bdi))
> > return 1;
> > if (bdi == current->backing_dev_info)
> > return 1;
> > return 0;
> > }
> > 
> > Which will write to congested queues. Anybody know why?
> 
> 
> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm 
> Date:   Sun Dec 22 01:07:33 2002 +
> 
> [PATCH] Give kswapd writeback higher priority than pdflush
> 
> The `low latency page reclaim' design works by preventing page
> allocators from blocking on request queues (and by preventing them from
> blocking against writeback of individual pages, but that is immaterial
> here).
> 
> This has a problem under some situations.  pdflush (or a write(2)
> caller) could be saturating the queue with highmem pages.  This
> prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
> enormous amounts of scenning.

Sorry, I cannot not understand it. We now have balanced aging between
zones. So the page allocations are expected to distribute proportionally
between ZONE_HIGHMEM and ZONE_NORMAL?

> A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> then kill the mmapping applications.  The machine instantly goes from
> 0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
> the least-recently-dirtied pages, which are all highmem.  The queue is
> congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
> 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> 
> So this patch changes the policy for kswapd.  kswapd may use all of a
> request queue, and is prepared to block on request queues.
> 
> What will now happen in the above scenario is:
> 
> 1: The page alloctor scans some pages, fails to reclaim enough
>memory and takes a nap in blk_congetion_wait().
> 
> 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
>back pages.  (These pages will be rotated to the tail of the
>inactive list at IO-completion interrupt time).
> 
>This writeback will saturate the queue with ZONE_NORMAL pages.
>Conveniently, pdflush will avoid the congested queues.  So we end up
>writing the correct pages.
> 
> In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> efficiency rises from 2% to 40% and things are generally a lot happier.

We may see the same problem and improvement in the absent of 'all
writeback goes to one zone' assumption.

The problem could be:
- dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
  data and quickly _congests_ the queue;
- dirty pages are slowly but continuously turned into clean pages by
  balance_dirty_pages(), but they still stay in the same place in LRU;
- the zones are mostly dirty/writeback pages, kswapd has a hard time
  finding the randomly distributed clean pages;
- kswapd cannot do the writeout because the queue is congested!

The improvement could be:
- kswapd is now explicitly preferred to do the writeout;
- the pages written by kswapd will be rotated and easy for kswapd to reclaim;
- it becomes possible for kswapd to wait for the congested queue,
  instead of doing the vmscan like mad.

The congestion wait looks like a pretty natural way to throttle the kswapd.
Instead of doing the vmscan at 1000MB/s and actually freeing pages at
60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
maybe 150MB/s.

Fengguang
---
> The downside is that kswapd may now do a lot less page reclaim,
> increasing page allocation latency, causing more direct reclaim,
> increasing lock contention in the VM, etc.  But I have not been able to
> demonstrate that in testing.
> 
> 
> The other problem is that there is only one kswapd, and there are lots
> of disks.  That is a generic problem - without being able to co-opt
> user processes we don't have enough threads to keep lots of disks 
> saturated.
> 
> One fix for this would be to add an additional "really congested"
> threshold in the request queues, so kswapd can still perform
> nonblocking writeout.  This gives

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

> Limiting FUSE to say 50% (suggestion from your other email) sounds like
> a horrible hack to me. - Need more time to think on this.

I don't really understand all that page balancing stuff, but I think
this will probably never or very rarely happen, because the allocator
will prefer the bigger zones, and the dirty page limiting will not let
the bigger zones get too full of dirty pages.

And even it can happen, it's not necessarily a fuse-only thing.

It makes tons of sense to make sure, that we don't fully dirty _any_
specialized zone.  One special zone group are the low-memory pages.
And currently balance_dirty_pages() makes sure we don't fill that up
with dirty file backed pages.  So something like that should make
sense for other special zones like DMA as well.

I'm not saying it's trivial, or even possible to implement, just
thinking...

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Fri, 2007-10-05 at 12:27 +0200, Miklos Szeredi wrote:
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 4ef4d22..eff2438 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
> >  int wakeup_pdflush(long nr_pages);
> >  void laptop_io_completion(void);
> >  void laptop_sync_completion(void);
> > -void throttle_vm_writeout(gfp_t gfp_mask);
> > +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
> >  
> >  /* These are exported to sysctl. */
> >  extern int dirty_background_ratio;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index eec1481..f949997 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
> > address_space *mapping,
> >  }
> >  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
> >  
> > -void throttle_vm_writeout(gfp_t gfp_mask)
> > +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
> >  {
> > -   long background_thresh;
> > -   long dirty_thresh;
> > -
> > if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> > /*
> >  * The caller might hold locks which can prevent IO completion
> > @@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > }
> >  
> >  for ( ; ; ) {
> > -   get_dirty_limits(_thresh, _thresh, NULL);
> > +   unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
> > +   zone_page_state(zone, NR_INACTIVE);
> >  
> > -/*
> > - * Boost the allowable dirty threshold a bit for page
> > - * allocators so they don't get DoS'ed by heavy writers
> > - */
> > -dirty_thresh += dirty_thresh / 10;  /* wh... */
> > +   /*
> > +* wait when 75% of the zone's pages are under writeback
> > +*/
> > +   thresh -= thresh >> 2;
> > +   if (zone_page_state(zone, NR_WRITEBACK) < thresh)
> > +   break;
> >  
> > -if (global_page_state(NR_UNSTABLE_NFS) +
> > -   global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -   break;
> >  congestion_wait(WRITE, HZ/10);
> >  }
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1be5a63..7dd6bd9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct 
> > zone *zone,
> > }
> > }
> >  
> > -   throttle_vm_writeout(sc->gfp_mask);
> > +   throttle_vm_writeout(zone, sc->gfp_mask);
> >  
> > atomic_dec(>reclaim_in_progress);
> > return nr_reclaimed;
> > 
> > 
> 
> I think that's an improvement in all respects.
> 
> However it still does not generally address the deadlock scenario: if
> there's a small DMA zone, and fuse manages to put all of those pages
> under writeout, then there's trouble.
> 
> But it's not really fuse specific.  If it was a normal filesystem that
> did that, and it needed a GFP_DMA allocation for writeout, it is in
> trouble also, as that allocation would fail (at least no deadlock).
> 
> Or is GFP_DMA never used by fs/io writeout paths?

I agree that its not complete. (hence the lack of sign-off etc.)

'normally' writeback pages just need an interrupt to signal the stuff is
written back. ie. the writeback completion path is atomic. 

 [ result of the thinking below -- the above is esp. true for swap pages
   - so maybe we should ensure that !swap traffic can never exceed this
   75% - that way, swap can always us get out of a tight spot ]

Maybe we should make that a requirement, although I see how that becomes
rather hard for FUSE (which is where Andrews PF_MEMALLOC suggestion
comes from - but I really dislike PF_MEMALLOC exposed to userspace).

Limiting FUSE to say 50% (suggestion from your other email) sounds like
a horrible hack to me. - Need more time to think on this.

.. o O [ process of thinking ]

While I think, I might as well tell the problem I had with
throttle_vm_writeout(), which is nfs unstable pages. Those don't go away
by waiting for it - as throttle_vm_writeout() does. So once you have an
excess of those, you're stuck as well.

In this patch I totally ignored unstable, but I'm not sure that's the
proper thing to do, I'd need to figure out what happens to an unstable
page when passed into pageout() - or if its passed to pageout at all.

If unstable pages would be passed to pageout(), and it would properly
convert them to writeback and clean them, then there is nothing wrong.

(Trond?)




signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

> I think that's an improvement in all respects.
> 
> However it still does not generally address the deadlock scenario: if
> there's a small DMA zone, and fuse manages to put all of those pages
> under writeout, then there's trouble.

And the only way to solve that AFAICS, is to make sure fuse never uses
more than e.g. 50% of _any_ zone for page cache.  And that may need
some tweaking in the allocator...

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 4ef4d22..eff2438 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
>  int wakeup_pdflush(long nr_pages);
>  void laptop_io_completion(void);
>  void laptop_sync_completion(void);
> -void throttle_vm_writeout(gfp_t gfp_mask);
> +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
>  
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index eec1481..f949997 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
> address_space *mapping,
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
>  {
> - long background_thresh;
> - long dirty_thresh;
> -
>   if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
>   /*
>* The caller might hold locks which can prevent IO completion
> @@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>   }
>  
>  for ( ; ; ) {
> - get_dirty_limits(_thresh, _thresh, NULL);
> + unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
> + zone_page_state(zone, NR_INACTIVE);
>  
> -/*
> - * Boost the allowable dirty threshold a bit for page
> - * allocators so they don't get DoS'ed by heavy writers
> - */
> -dirty_thresh += dirty_thresh / 10;  /* wh... */
> + /*
> +  * wait when 75% of the zone's pages are under writeback
> +  */
> + thresh -= thresh >> 2;
> + if (zone_page_state(zone, NR_WRITEBACK) < thresh)
> + break;
>  
> -if (global_page_state(NR_UNSTABLE_NFS) +
> - global_page_state(NR_WRITEBACK) <= dirty_thresh)
> - break;
>  congestion_wait(WRITE, HZ/10);
>  }
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1be5a63..7dd6bd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct 
> zone *zone,
>   }
>   }
>  
> - throttle_vm_writeout(sc->gfp_mask);
> + throttle_vm_writeout(zone, sc->gfp_mask);
>  
>   atomic_dec(>reclaim_in_progress);
>   return nr_reclaimed;
> 
> 

I think that's an improvement in all respects.

However it still does not generally address the deadlock scenario: if
there's a small DMA zone, and fuse manages to put all of those pages
under writeout, then there's trouble.

But it's not really fuse specific.  If it was a normal filesystem that
did that, and it needed a GFP_DMA allocation for writeout, it is in
trouble also, as that allocation would fail (at least no deadlock).

Or is GFP_DMA never used by fs/io writeout paths?

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Fri, 2007-10-05 at 11:22 +0200, Miklos Szeredi wrote:
> > So how do we end up with more writeback pages than that? should we teach
> > pdflush about these limits as well?
> 
> Ugh.
> 
> I think we should rather fix vmscan to not spin when all pages of a
> zone are already under writeout.  Which is the _real_ problem,
> according to Andrew.



diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4ef4d22..eff2438 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
 int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eec1481..f949997 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
 {
-   long background_thresh;
-   long dirty_thresh;
-
if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
/*
 * The caller might hold locks which can prevent IO completion
@@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
}
 
 for ( ; ; ) {
-   get_dirty_limits(_thresh, _thresh, NULL);
+   unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
+   zone_page_state(zone, NR_INACTIVE);
 
-/*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
-dirty_thresh += dirty_thresh / 10;  /* wh... */
+   /*
+* wait when 75% of the zone's pages are under writeback
+*/
+   thresh -= thresh >> 2;
+   if (zone_page_state(zone, NR_WRITEBACK) < thresh)
+   break;
 
-if (global_page_state(NR_UNSTABLE_NFS) +
-   global_page_state(NR_WRITEBACK) <= dirty_thresh)
-   break;
 congestion_wait(WRITE, HZ/10);
 }
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1be5a63..7dd6bd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct zone 
*zone,
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(zone, sc->gfp_mask);
 
atomic_dec(>reclaim_in_progress);
return nr_reclaimed;



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

> So how do we end up with more writeback pages than that? should we teach
> pdflush about these limits as well?

Ugh.

I think we should rather fix vmscan to not spin when all pages of a
zone are already under writeout.  Which is the _real_ problem,
according to Andrew.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Thu, 2007-10-04 at 17:48 -0700, Andrew Morton wrote:
> On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> 
> > > 
> > > I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But 
> > > it
> > > _is_.  That's what we're trying to fix, isn't it?
> > 
> > The problem, I believe is in the memory allocation code, not in fuse.
> 
> fuse is trying to do something which page reclaim was not designed for. 
> Stuff broke.
> 
> > In the example, memory allocation may be blocking indefinitely,
> > because we have 4MB under writeback, even though 28MB can still be
> > made available.  And that _should_ be fixable.
> 
> Well yes.  But we need to work out how, without re-breaking the thing which
> throttle_vm_writeout() fixed.

I'm thinking the really_congested thing will also fix this. By only
allowing a limited amount of extra writeback.

> > > > So the only thing the kernel should be careful about, is not to block
> > > > on an allocation if not strictly necessary.
> > > > 
> > > > Actually a trivial fix for this problem could be to just tweak the
> > > > thresholds, so to make the above scenario impossible.  Although I'm
> > > > still not convinced, this patch is perfect, because the dirty
> > > > threshold can actually change in time...
> > > > 
> > > > Index: linux/mm/page-writeback.c
> > > > ===
> > > > --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 
> > > > +0200
> > > > +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> > > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> > > >  for ( ; ; ) {
> > > > get_dirty_limits(_thresh, _thresh, 
> > > > NULL, NULL);
> > > > 
> > > > +   /*
> > > > +* Make sure the theshold is over the hard limit of
> > > > +* dirty_thresh + ratelimit_pages * nr_cpus
> > > > +*/
> > > > +   dirty_thresh += ratelimit_pages * num_online_cpus();
> > > > +
> > > >  /*
> > > >   * Boost the allowable dirty threshold a bit for page
> > > >   * allocators so they don't get DoS'ed by heavy writers
> > > 
> > > I can probably kind of guess what you're trying to do here.  But if
> > > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> > > then things might go bad.
> > 
> > I think the admin can do quite a bit of other damage, by setting
> > dirty_ratio too high.
> > 
> > Maybe this writeback throttling should just have a fixed limit of 80%
> > ZONE_NORMAL, and limit dirty_ratio to something like 50%.
> 
> Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
> we cannot limit the system-wide dirty-memory threshold to 12MB.
> 
> iow, throttle_vm_writeout() needs to become zone-aware.  Then it only
> throttles when, say, 80% of ZONE_FOO is under writeback.

As it stand 110% of dirty limit can already be larger than say zone_dma
(and likely is), so that is not a new bug - and I don't think its the
thing Miklos runs into.

The problem Miklos is seeing (and I, just in a different form), is that
throttle_vm_writeout() gets stuck because balance_dirty_pages() gets
called once every ratelimit_pages (per cpu). So we can have nr_cpus *
ratelimit_pages extra.

/me thinks

ok I confused myself.

by calling balance_dirty_pages() once every ratelimit_pages (per cpu)
allows for nr_cpus() * ratelimit_pages extra _dirty_ pages. But
balance_dirty_pages() will make it:
  nr_dirty + nr_unstable + nr_writeback < thresh

So even if it writes out all of the dirty pages, we still have:
  nr_unstable + nr_writeback < thresh

So at any one time nr_writeback should not exceed thresh. But it does!?

So how do we end up with more writeback pages than that? should we teach
pdflush about these limits as well?



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Thu, 2007-10-04 at 16:09 -0700, Andrew Morton wrote:
> On Fri, 05 Oct 2007 00:39:16 +0200
> Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> 
> > > throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps 
> > > fixing
> > > that would fix your deadlock.  That's doubtful, but I don't know anything
> > > about your deadlock so I cannot say.
> > 
> > No, doing the throttling per-zone won't in itself fix the deadlock.
> > 
> > Here's a deadlock example:
> > 
> > Total memory = 32M
> > /proc/sys/vm/dirty_ratio = 10
> > dirty_threshold = 3M
> > ratelimit_pages = 1M
> > 
> > Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
> > a fuse fs.  Page balancing is called which turns all these into
> > writeback pages.
> > 
> > Then userspace filesystem gets a write request, and tries to allocate
> > memory needed to complete the writeout.
> > 
> > That will possibly trigger direct reclaim, and throttle_vm_writeout()
> > will be called.  That will block until nr_writeback goes below 3.3M
> > (dirty_threshold + 10%).  But since all 4M of writeback is from the
> > fuse fs, that will never happen.
> > 
> > Does that explain it better?
> > 
> 
> yup, thanks.
> 
> This is a somewhat general problem: a userspace process is in the IO path. 
> Userspace block drivers, for example - pretty much anything which involves
> kernel->userspace upcalls for storage applications.
> 
> I solved it once in the past by marking the userspace process as
> PF_MEMALLOC and I beleive that others have implemented the same hack.
> 
> I suspect that what we need is a general solution, and that the solution
> will involve explicitly telling the kernel that this process is one which
> actually cleans memory and needs special treatment.
> 
> Because I bet there will be other corner-cases where such a process needs
> kernel help, and there might be optimisation opportunities as well.
> 
> Problem is, any such mark-me-as-special syscall would need to be
> privileged, and FUSE servers presently don't require special perms (do
> they?)

I think just adding nr_cpus * ratelimit_pages to the dirth_thresh in
throttle_vm_writeout() will also solve the problem


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Thu, 2007-10-04 at 16:09 -0700, Andrew Morton wrote:
 On Fri, 05 Oct 2007 00:39:16 +0200
 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
   throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps 
   fixing
   that would fix your deadlock.  That's doubtful, but I don't know anything
   about your deadlock so I cannot say.
  
  No, doing the throttling per-zone won't in itself fix the deadlock.
  
  Here's a deadlock example:
  
  Total memory = 32M
  /proc/sys/vm/dirty_ratio = 10
  dirty_threshold = 3M
  ratelimit_pages = 1M
  
  Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
  a fuse fs.  Page balancing is called which turns all these into
  writeback pages.
  
  Then userspace filesystem gets a write request, and tries to allocate
  memory needed to complete the writeout.
  
  That will possibly trigger direct reclaim, and throttle_vm_writeout()
  will be called.  That will block until nr_writeback goes below 3.3M
  (dirty_threshold + 10%).  But since all 4M of writeback is from the
  fuse fs, that will never happen.
  
  Does that explain it better?
  
 
 yup, thanks.
 
 This is a somewhat general problem: a userspace process is in the IO path. 
 Userspace block drivers, for example - pretty much anything which involves
 kernel-userspace upcalls for storage applications.
 
 I solved it once in the past by marking the userspace process as
 PF_MEMALLOC and I beleive that others have implemented the same hack.
 
 I suspect that what we need is a general solution, and that the solution
 will involve explicitly telling the kernel that this process is one which
 actually cleans memory and needs special treatment.
 
 Because I bet there will be other corner-cases where such a process needs
 kernel help, and there might be optimisation opportunities as well.
 
 Problem is, any such mark-me-as-special syscall would need to be
 privileged, and FUSE servers presently don't require special perms (do
 they?)

I think just adding nr_cpus * ratelimit_pages to the dirth_thresh in
throttle_vm_writeout() will also solve the problem


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Thu, 2007-10-04 at 17:48 -0700, Andrew Morton wrote:
 On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
   
   I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But 
   it
   _is_.  That's what we're trying to fix, isn't it?
  
  The problem, I believe is in the memory allocation code, not in fuse.
 
 fuse is trying to do something which page reclaim was not designed for. 
 Stuff broke.
 
  In the example, memory allocation may be blocking indefinitely,
  because we have 4MB under writeback, even though 28MB can still be
  made available.  And that _should_ be fixable.
 
 Well yes.  But we need to work out how, without re-breaking the thing which
 throttle_vm_writeout() fixed.

I'm thinking the really_congested thing will also fix this. By only
allowing a limited amount of extra writeback.

So the only thing the kernel should be careful about, is not to block
on an allocation if not strictly necessary.

Actually a trivial fix for this problem could be to just tweak the
thresholds, so to make the above scenario impossible.  Although I'm
still not convinced, this patch is perfect, because the dirty
threshold can actually change in time...

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 
+0200
+++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
@@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
 for ( ; ; ) {
get_dirty_limits(background_thresh, dirty_thresh, 
NULL, NULL);

+   /*
+* Make sure the theshold is over the hard limit of
+* dirty_thresh + ratelimit_pages * nr_cpus
+*/
+   dirty_thresh += ratelimit_pages * num_online_cpus();
+
 /*
  * Boost the allowable dirty threshold a bit for page
  * allocators so they don't get DoS'ed by heavy writers
   
   I can probably kind of guess what you're trying to do here.  But if
   ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
   then things might go bad.
  
  I think the admin can do quite a bit of other damage, by setting
  dirty_ratio too high.
  
  Maybe this writeback throttling should just have a fixed limit of 80%
  ZONE_NORMAL, and limit dirty_ratio to something like 50%.
 
 Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
 we cannot limit the system-wide dirty-memory threshold to 12MB.
 
 iow, throttle_vm_writeout() needs to become zone-aware.  Then it only
 throttles when, say, 80% of ZONE_FOO is under writeback.

As it stand 110% of dirty limit can already be larger than say zone_dma
(and likely is), so that is not a new bug - and I don't think its the
thing Miklos runs into.

The problem Miklos is seeing (and I, just in a different form), is that
throttle_vm_writeout() gets stuck because balance_dirty_pages() gets
called once every ratelimit_pages (per cpu). So we can have nr_cpus *
ratelimit_pages extra.

/me thinks

ok I confused myself.

by calling balance_dirty_pages() once every ratelimit_pages (per cpu)
allows for nr_cpus() * ratelimit_pages extra _dirty_ pages. But
balance_dirty_pages() will make it:
  nr_dirty + nr_unstable + nr_writeback  thresh

So even if it writes out all of the dirty pages, we still have:
  nr_unstable + nr_writeback  thresh

So at any one time nr_writeback should not exceed thresh. But it does!?

So how do we end up with more writeback pages than that? should we teach
pdflush about these limits as well?



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

 So how do we end up with more writeback pages than that? should we teach
 pdflush about these limits as well?

Ugh.

I think we should rather fix vmscan to not spin when all pages of a
zone are already under writeout.  Which is the _real_ problem,
according to Andrew.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Fri, 2007-10-05 at 11:22 +0200, Miklos Szeredi wrote:
  So how do we end up with more writeback pages than that? should we teach
  pdflush about these limits as well?
 
 Ugh.
 
 I think we should rather fix vmscan to not spin when all pages of a
 zone are already under writeout.  Which is the _real_ problem,
 according to Andrew.



diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4ef4d22..eff2438 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
 int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eec1481..f949997 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
 {
-   long background_thresh;
-   long dirty_thresh;
-
if ((gfp_mask  (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
/*
 * The caller might hold locks which can prevent IO completion
@@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
}
 
 for ( ; ; ) {
-   get_dirty_limits(background_thresh, dirty_thresh, NULL);
+   unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
+   zone_page_state(zone, NR_INACTIVE);
 
-/*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
-dirty_thresh += dirty_thresh / 10;  /* wh... */
+   /*
+* wait when 75% of the zone's pages are under writeback
+*/
+   thresh -= thresh  2;
+   if (zone_page_state(zone, NR_WRITEBACK)  thresh)
+   break;
 
-if (global_page_state(NR_UNSTABLE_NFS) +
-   global_page_state(NR_WRITEBACK) = dirty_thresh)
-   break;
 congestion_wait(WRITE, HZ/10);
 }
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1be5a63..7dd6bd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct zone 
*zone,
}
}
 
-   throttle_vm_writeout(sc-gfp_mask);
+   throttle_vm_writeout(zone, sc-gfp_mask);
 
atomic_dec(zone-reclaim_in_progress);
return nr_reclaimed;



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

 diff --git a/include/linux/writeback.h b/include/linux/writeback.h
 index 4ef4d22..eff2438 100644
 --- a/include/linux/writeback.h
 +++ b/include/linux/writeback.h
 @@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
  int wakeup_pdflush(long nr_pages);
  void laptop_io_completion(void);
  void laptop_sync_completion(void);
 -void throttle_vm_writeout(gfp_t gfp_mask);
 +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
  
  /* These are exported to sysctl. */
  extern int dirty_background_ratio;
 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index eec1481..f949997 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
 address_space *mapping,
  }
  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
  
 -void throttle_vm_writeout(gfp_t gfp_mask)
 +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
  {
 - long background_thresh;
 - long dirty_thresh;
 -
   if ((gfp_mask  (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
   /*
* The caller might hold locks which can prevent IO completion
 @@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
   }
  
  for ( ; ; ) {
 - get_dirty_limits(background_thresh, dirty_thresh, NULL);
 + unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
 + zone_page_state(zone, NR_INACTIVE);
  
 -/*
 - * Boost the allowable dirty threshold a bit for page
 - * allocators so they don't get DoS'ed by heavy writers
 - */
 -dirty_thresh += dirty_thresh / 10;  /* wh... */
 + /*
 +  * wait when 75% of the zone's pages are under writeback
 +  */
 + thresh -= thresh  2;
 + if (zone_page_state(zone, NR_WRITEBACK)  thresh)
 + break;
  
 -if (global_page_state(NR_UNSTABLE_NFS) +
 - global_page_state(NR_WRITEBACK) = dirty_thresh)
 - break;
  congestion_wait(WRITE, HZ/10);
  }
  }
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 1be5a63..7dd6bd9 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct 
 zone *zone,
   }
   }
  
 - throttle_vm_writeout(sc-gfp_mask);
 + throttle_vm_writeout(zone, sc-gfp_mask);
  
   atomic_dec(zone-reclaim_in_progress);
   return nr_reclaimed;
 
 

I think that's an improvement in all respects.

However it still does not generally address the deadlock scenario: if
there's a small DMA zone, and fuse manages to put all of those pages
under writeout, then there's trouble.

But it's not really fuse specific.  If it was a normal filesystem that
did that, and it needed a GFP_DMA allocation for writeout, it is in
trouble also, as that allocation would fail (at least no deadlock).

Or is GFP_DMA never used by fs/io writeout paths?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Thu, Oct 04, 2007 at 10:46:50AM -0700, Andrew Morton wrote:
 On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  But that said, there might be better ways to do that.
 
 Sure, if we do need to globally limit the number of under-writeback
 pages, then I think we need to do it independently of the dirty
 accounting.

It need not be global, it could be per BDI as well, but yes.
   
   For per-bdi limits we have the queue length.
  
  Agreed, except for:
  
  static int may_write_to_queue(struct backing_dev_info *bdi)
  {
  if (current-flags  PF_SWAPWRITE)
  return 1;
  if (!bdi_write_congested(bdi))
  return 1;
  if (bdi == current-backing_dev_info)
  return 1;
  return 0;
  }
  
  Which will write to congested queues. Anybody know why?
 
 
 commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
 Author: akpm akpm
 Date:   Sun Dec 22 01:07:33 2002 +
 
 [PATCH] Give kswapd writeback higher priority than pdflush
 
 The `low latency page reclaim' design works by preventing page
 allocators from blocking on request queues (and by preventing them from
 blocking against writeback of individual pages, but that is immaterial
 here).
 
 This has a problem under some situations.  pdflush (or a write(2)
 caller) could be saturating the queue with highmem pages.  This
 prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
 enormous amounts of scenning.

Sorry, I cannot not understand it. We now have balanced aging between
zones. So the page allocations are expected to distribute proportionally
between ZONE_HIGHMEM and ZONE_NORMAL?

 A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
 then kill the mmapping applications.  The machine instantly goes from
 0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
 the least-recently-dirtied pages, which are all highmem.  The queue is
 congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
 efficiency (pages_reclaimed/pages_scanned) falls to 2%.
 
 So this patch changes the policy for kswapd.  kswapd may use all of a
 request queue, and is prepared to block on request queues.
 
 What will now happen in the above scenario is:
 
 1: The page alloctor scans some pages, fails to reclaim enough
memory and takes a nap in blk_congetion_wait().
 
 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
back pages.  (These pages will be rotated to the tail of the
inactive list at IO-completion interrupt time).
 
This writeback will saturate the queue with ZONE_NORMAL pages.
Conveniently, pdflush will avoid the congested queues.  So we end up
writing the correct pages.
 
 In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
 efficiency rises from 2% to 40% and things are generally a lot happier.

We may see the same problem and improvement in the absent of 'all
writeback goes to one zone' assumption.

The problem could be:
- dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
  data and quickly _congests_ the queue;
- dirty pages are slowly but continuously turned into clean pages by
  balance_dirty_pages(), but they still stay in the same place in LRU;
- the zones are mostly dirty/writeback pages, kswapd has a hard time
  finding the randomly distributed clean pages;
- kswapd cannot do the writeout because the queue is congested!

The improvement could be:
- kswapd is now explicitly preferred to do the writeout;
- the pages written by kswapd will be rotated and easy for kswapd to reclaim;
- it becomes possible for kswapd to wait for the congested queue,
  instead of doing the vmscan like mad.

The congestion wait looks like a pretty natural way to throttle the kswapd.
Instead of doing the vmscan at 1000MB/s and actually freeing pages at
60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
maybe 150MB/s.

Fengguang
---
 The downside is that kswapd may now do a lot less page reclaim,
 increasing page allocation latency, causing more direct reclaim,
 increasing lock contention in the VM, etc.  But I have not been able to
 demonstrate that in testing.
 
 
 The other problem is that there is only one kswapd, and there are lots
 of disks.  That is a generic problem - without being able to co-opt
 user processes we don't have enough threads to keep lots of disks 
 saturated.
 
 One fix for this would be to add an additional really congested
 threshold in the request queues, so kswapd can still perform
 nonblocking writeout.  This gives kswapd priority over pdflush while
 allowing kswapd to feed many disk queues.  I doubt if this will be
 called for.

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra

On Fri, 2007-10-05 at 12:27 +0200, Miklos Szeredi wrote:
  diff --git a/include/linux/writeback.h b/include/linux/writeback.h
  index 4ef4d22..eff2438 100644
  --- a/include/linux/writeback.h
  +++ b/include/linux/writeback.h
  @@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
   int wakeup_pdflush(long nr_pages);
   void laptop_io_completion(void);
   void laptop_sync_completion(void);
  -void throttle_vm_writeout(gfp_t gfp_mask);
  +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
   
   /* These are exported to sysctl. */
   extern int dirty_background_ratio;
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index eec1481..f949997 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct 
  address_space *mapping,
   }
   EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
   
  -void throttle_vm_writeout(gfp_t gfp_mask)
  +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
   {
  -   long background_thresh;
  -   long dirty_thresh;
  -
  if ((gfp_mask  (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
  /*
   * The caller might hold locks which can prevent IO completion
  @@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  }
   
   for ( ; ; ) {
  -   get_dirty_limits(background_thresh, dirty_thresh, NULL);
  +   unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
  +   zone_page_state(zone, NR_INACTIVE);
   
  -/*
  - * Boost the allowable dirty threshold a bit for page
  - * allocators so they don't get DoS'ed by heavy writers
  - */
  -dirty_thresh += dirty_thresh / 10;  /* wh... */
  +   /*
  +* wait when 75% of the zone's pages are under writeback
  +*/
  +   thresh -= thresh  2;
  +   if (zone_page_state(zone, NR_WRITEBACK)  thresh)
  +   break;
   
  -if (global_page_state(NR_UNSTABLE_NFS) +
  -   global_page_state(NR_WRITEBACK) = dirty_thresh)
  -   break;
   congestion_wait(WRITE, HZ/10);
   }
   }
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index 1be5a63..7dd6bd9 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct 
  zone *zone,
  }
  }
   
  -   throttle_vm_writeout(sc-gfp_mask);
  +   throttle_vm_writeout(zone, sc-gfp_mask);
   
  atomic_dec(zone-reclaim_in_progress);
  return nr_reclaimed;
  
  
 
 I think that's an improvement in all respects.
 
 However it still does not generally address the deadlock scenario: if
 there's a small DMA zone, and fuse manages to put all of those pages
 under writeout, then there's trouble.
 
 But it's not really fuse specific.  If it was a normal filesystem that
 did that, and it needed a GFP_DMA allocation for writeout, it is in
 trouble also, as that allocation would fail (at least no deadlock).
 
 Or is GFP_DMA never used by fs/io writeout paths?

I agree that its not complete. (hence the lack of sign-off etc.)

'normally' writeback pages just need an interrupt to signal the stuff is
written back. ie. the writeback completion path is atomic. 

 [ result of the thinking below -- the above is esp. true for swap pages
   - so maybe we should ensure that !swap traffic can never exceed this
   75% - that way, swap can always us get out of a tight spot ]

Maybe we should make that a requirement, although I see how that becomes
rather hard for FUSE (which is where Andrews PF_MEMALLOC suggestion
comes from - but I really dislike PF_MEMALLOC exposed to userspace).

Limiting FUSE to say 50% (suggestion from your other email) sounds like
a horrible hack to me. - Need more time to think on this.

.. o O [ process of thinking ]

While I think, I might as well tell the problem I had with
throttle_vm_writeout(), which is nfs unstable pages. Those don't go away
by waiting for it - as throttle_vm_writeout() does. So once you have an
excess of those, you're stuck as well.

In this patch I totally ignored unstable, but I'm not sure that's the
proper thing to do, I'd need to figure out what happens to an unstable
page when passed into pageout() - or if its passed to pageout at all.

If unstable pages would be passed to pageout(), and it would properly
convert them to writeback and clean them, then there is nothing wrong.

(Trond?)




signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

 I think that's an improvement in all respects.
 
 However it still does not generally address the deadlock scenario: if
 there's a small DMA zone, and fuse manages to put all of those pages
 under writeout, then there's trouble.

And the only way to solve that AFAICS, is to make sure fuse never uses
more than e.g. 50% of _any_ zone for page cache.  And that may need
some tweaking in the allocator...

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Miklos Szeredi

 Limiting FUSE to say 50% (suggestion from your other email) sounds like
 a horrible hack to me. - Need more time to think on this.

I don't really understand all that page balancing stuff, but I think
this will probably never or very rarely happen, because the allocator
will prefer the bigger zones, and the dirty page limiting will not let
the bigger zones get too full of dirty pages.

And even it can happen, it's not necessarily a fuse-only thing.

It makes tons of sense to make sure, that we don't fully dirty _any_
specialized zone.  One special zone group are the low-memory pages.
And currently balance_dirty_pages() makes sure we don't fill that up
with dirty file backed pages.  So something like that should make
sense for other special zones like DMA as well.

I'm not saying it's trivial, or even possible to implement, just
thinking...

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread John Stoffel


 I think that's an improvement in all respects.
 
 However it still does not generally address the deadlock scenario: if
 there's a small DMA zone, and fuse manages to put all of those pages
 under writeout, then there's trouble.

Miklos And the only way to solve that AFAICS, is to make sure fuse
Miklos never uses more than e.g. 50% of _any_ zone for page cache.
Miklos And that may need some tweaking in the allocator...

So what happens if I have three different FUSE mounts, all under heavy
write pressure?  It's not a FUSE problem, it's a VM problem as far as
I can see.   All I did was extrapolate from the 50% number (where did
that come from?) and triple it to go over 100%, since we obviously
shouldn't take 100% of any zone, right?

So the real cure is to have some way to rate limit Zone usage, making
it harder and harder to allocate in a zone as the zone gets more and
more full.  But how do you do this in a non-deadlocky way?  

Buy hey, I'm not that knowledgeable about the VM.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Andrew Morton

On Fri, 5 Oct 2007 20:30:28 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

  commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
  Author: akpm akpm
  Date:   Sun Dec 22 01:07:33 2002 +
  
  [PATCH] Give kswapd writeback higher priority than pdflush
  
  The `low latency page reclaim' design works by preventing page
  allocators from blocking on request queues (and by preventing them from
  blocking against writeback of individual pages, but that is immaterial
  here).
  
  This has a problem under some situations.  pdflush (or a write(2)
  caller) could be saturating the queue with highmem pages.  This
  prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
  enormous amounts of scenning.
 
 Sorry, I cannot not understand it. We now have balanced aging between
 zones. So the page allocations are expected to distribute proportionally
 between ZONE_HIGHMEM and ZONE_NORMAL?

Sure, but we don't have one disk queue per disk per zone!  The queue is
shared by all the zones.  So if writeback from one zone has filled the
queue up, the kernel can't write back data from another zone.

(Well, it can, by blocking in get_request_wait(), but that causes long and
uncontrollable latencies).

  A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
  then kill the mmapping applications.  The machine instantly goes from
  0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
  the least-recently-dirtied pages, which are all highmem.  The queue is
  congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
  50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
  efficiency (pages_reclaimed/pages_scanned) falls to 2%.
  
  So this patch changes the policy for kswapd.  kswapd may use all of a
  request queue, and is prepared to block on request queues.
  
  What will now happen in the above scenario is:
  
  1: The page alloctor scans some pages, fails to reclaim enough
 memory and takes a nap in blk_congetion_wait().
  
  2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
 back pages.  (These pages will be rotated to the tail of the
 inactive list at IO-completion interrupt time).
  
 This writeback will saturate the queue with ZONE_NORMAL pages.
 Conveniently, pdflush will avoid the congested queues.  So we end up
 writing the correct pages.
  
  In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
  efficiency rises from 2% to 40% and things are generally a lot happier.
 
 We may see the same problem and improvement in the absent of 'all
 writeback goes to one zone' assumption.
 
 The problem could be:
 - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
   data and quickly _congests_ the queue;

Or someone ran fsync(), or pdflush is writing back data because it exceeded
dirty_writeback_centisecs, etc.

 - dirty pages are slowly but continuously turned into clean pages by
   balance_dirty_pages(), but they still stay in the same place in LRU;
 - the zones are mostly dirty/writeback pages, kswapd has a hard time
   finding the randomly distributed clean pages;
 - kswapd cannot do the writeout because the queue is congested!
 
 The improvement could be:
 - kswapd is now explicitly preferred to do the writeout;
 - the pages written by kswapd will be rotated and easy for kswapd to reclaim;
 - it becomes possible for kswapd to wait for the congested queue,
   instead of doing the vmscan like mad.

Yeah.  In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
iirc) would throttle by waiting on writeout of a particular page.  This was
a poor design, because writeback against a *particular* page can take
anywhere from one millisecond to thirty seconds to complete, depending upon
where the disk head is and all that stuff.

The critical change I made was to switch the throttling algorithm from
wait for one page to get written to wait for _any_ page to get written.
 Becaue reclaim really doesn't care _which_ page got written: we want to
wake up and start scanning again when _any_ page got written.

That's what congestion_wait() does.

It is pretty crude.  It could be that writeback completed against pages which
aren't in the correct zone, or it could be that some other task went and
allocated the just-cleaned pages before this task can get running and
reclaim them, or it could be that the just-written-back pages weren't
reclaimable after all, etc.

It would take a mind-boggling amount of logic and locking to make all this
100% accurate and the need has never been demonstrated.  So page reclaim
presently should be viewed as a polling algorithm, where the rate of
polling is paced by the rate at which the IO system can retire writes.

 The congestion wait looks like a pretty natural way to throttle the kswapd.
 Instead of doing the

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
 In this patch I totally ignored unstable, but I'm not sure that's the
 proper thing to do, I'd need to figure out what happens to an unstable
 page when passed into pageout() - or if its passed to pageout at all.
 
 If unstable pages would be passed to pageout(), and it would properly
 convert them to writeback and clean them, then there is nothing wrong.

Why would we want to do that? That would be a hell of a lot of work
(locking pages, setting flags, unlocking pages, ...) for absolutely no
reason.

Unstable writes are writes which have been sent to the server, but which
haven't been written to disk on the server. A single RPC command is then
sent (COMMIT) which basically tells the server to call fsync(). After
that is successful, we can free up the pages, but we do that with no
extra manipulation of the pages themselves: no page locks, just removal
from the NFS private radix tree, and freeing up of the NFS private
structures.

We only need to touch the pages again in the unlikely case that the
COMMIT fails because the server has rebooted. In this case we have to
resend the writes, and so the pages are marked as dirty, so we can go
through the whole writepages() rigmarole again...

So, no. I don't see sending pages through pageout() as being at all
helpful.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra


On Fri, 2007-10-05 at 13:50 -0400, Trond Myklebust wrote:
 On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
  In this patch I totally ignored unstable, but I'm not sure that's the
  proper thing to do, I'd need to figure out what happens to an unstable
  page when passed into pageout() - or if its passed to pageout at all.
  
  If unstable pages would be passed to pageout(), and it would properly
  convert them to writeback and clean them, then there is nothing wrong.
 
 Why would we want to do that? That would be a hell of a lot of work
 (locking pages, setting flags, unlocking pages, ...) for absolutely no
 reason.
 
 Unstable writes are writes which have been sent to the server, but which
 haven't been written to disk on the server. A single RPC command is then
 sent (COMMIT) which basically tells the server to call fsync(). After
 that is successful, we can free up the pages, but we do that with no
 extra manipulation of the pages themselves: no page locks, just removal
 from the NFS private radix tree, and freeing up of the NFS private
 structures.
 
 We only need to touch the pages again in the unlikely case that the
 COMMIT fails because the server has rebooted. In this case we have to
 resend the writes, and so the pages are marked as dirty, so we can go
 through the whole writepages() rigmarole again...
 
 So, no. I don't see sending pages through pageout() as being at all
 helpful.

Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
stand we can deadlock there because it just waits for the numbers to
drop, and unstable pages don't automagically dissapear. Only
write_inodes() - normally called from balance_dirty_pages() will call
COMMIT.

So my thought was that calling pageout() on an unstable page would do
the COMMIT - we're low on memory, otherwise we would not be paging, so
getting rid of unstable pages seems to make sense to me.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
 Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
 stand we can deadlock there because it just waits for the numbers to
 drop, and unstable pages don't automagically dissapear. Only
 write_inodes() - normally called from balance_dirty_pages() will call
 COMMIT.
 
 So my thought was that calling pageout() on an unstable page would do
 the COMMIT - we're low on memory, otherwise we would not be paging, so
 getting rid of unstable pages seems to make sense to me.

Why not rather track which mappings have large numbers of outstanding
unstable writes at the VM level, and then add some form of callback to
allow it to notify the filesystem when it needs to flush them out?

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Trond Myklebust

On Fri, 2007-10-05 at 15:20 -0400, Trond Myklebust wrote:
 On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
  Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
  stand we can deadlock there because it just waits for the numbers to
  drop, and unstable pages don't automagically dissapear. Only
  write_inodes() - normally called from balance_dirty_pages() will call
  COMMIT.
  
  So my thought was that calling pageout() on an unstable page would do
  the COMMIT - we're low on memory, otherwise we would not be paging, so
  getting rid of unstable pages seems to make sense to me.
 
 Why not rather track which mappings have large numbers of outstanding
 unstable writes at the VM level, and then add some form of callback to
 allow it to notify the filesystem when it needs to flush them out?
 
 Cheers
   Trond

BTW: Please note that at least in the case of NFS, you will have to
allow for the fact that the filesystem may not be _able_ to cause the
numbers to drop. If the server is unavailable, then we're may be stuck
in unstable page limbo for quite some time.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Rik van Riel

On Fri, 05 Oct 2007 09:32:57 +0200
Peter Zijlstra [EMAIL PROTECTED] wrote:

 I think just adding nr_cpus * ratelimit_pages to the dirth_thresh in
 throttle_vm_writeout() will also solve the problem

Agreed, that should fix the main latency issues.
 
-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Peter Zijlstra


On Fri, 2007-10-05 at 15:23 -0400, Trond Myklebust wrote: 
 On Fri, 2007-10-05 at 15:20 -0400, Trond Myklebust wrote:
  On Fri, 2007-10-05 at 20:32 +0200, Peter Zijlstra wrote:
   Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
   stand we can deadlock there because it just waits for the numbers to
   drop, and unstable pages don't automagically dissapear. Only
   write_inodes() - normally called from balance_dirty_pages() will call
   COMMIT.
   
   So my thought was that calling pageout() on an unstable page would do
   the COMMIT - we're low on memory, otherwise we would not be paging, so
   getting rid of unstable pages seems to make sense to me.
  
  Why not rather track which mappings have large numbers of outstanding
  unstable writes at the VM level, and then add some form of callback to
  allow it to notify the filesystem when it needs to flush them out?

That would be nice, its just that the pageout throttling is not quite
that sophisticated. But we'll see what we can come up with.

 BTW: Please note that at least in the case of NFS, you will have to
 allow for the fact that the filesystem may not be _able_ to cause the
 numbers to drop. If the server is unavailable, then we're may be stuck
 in unstable page limbo for quite some time.

Agreed, it would be nice if that is handled is such a manner that it
does not take down all other paging.

The regular write path that only bothers with balance_dirty_pages()
already does this nicely.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Fri, Oct 05, 2007 at 08:32:19PM +0200, Peter Zijlstra wrote:
 
 On Fri, 2007-10-05 at 13:50 -0400, Trond Myklebust wrote:
  On Fri, 2007-10-05 at 12:57 +0200, Peter Zijlstra wrote:
   In this patch I totally ignored unstable, but I'm not sure that's the
   proper thing to do, I'd need to figure out what happens to an unstable
   page when passed into pageout() - or if its passed to pageout at all.
   
   If unstable pages would be passed to pageout(), and it would properly
   convert them to writeback and clean them, then there is nothing wrong.
  
  Why would we want to do that? That would be a hell of a lot of work
  (locking pages, setting flags, unlocking pages, ...) for absolutely no
  reason.
  
  Unstable writes are writes which have been sent to the server, but which
  haven't been written to disk on the server. A single RPC command is then
  sent (COMMIT) which basically tells the server to call fsync(). After
  that is successful, we can free up the pages, but we do that with no
  extra manipulation of the pages themselves: no page locks, just removal
  from the NFS private radix tree, and freeing up of the NFS private
  structures.
  
  We only need to touch the pages again in the unlikely case that the
  COMMIT fails because the server has rebooted. In this case we have to
  resend the writes, and so the pages are marked as dirty, so we can go
  through the whole writepages() rigmarole again...
  
  So, no. I don't see sending pages through pageout() as being at all
  helpful.
 
 Well, the thing is, we throttle pageout in throttle_vm_writeout(). As it
 stand we can deadlock there because it just waits for the numbers to
 drop, and unstable pages don't automagically dissapear. Only
 write_inodes() - normally called from balance_dirty_pages() will call
 COMMIT.

I wonder whether
if (!bdi_nr_writeback)
break;
or something like that could avoid the deadlock?

 So my thought was that calling pageout() on an unstable page would do
 the COMMIT - we're low on memory, otherwise we would not be paging, so
 getting rid of unstable pages seems to make sense to me.

I guess many unstable pages would be better if we are taking this way.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-05 Thread Fengguang Wu

On Fri, Oct 05, 2007 at 10:20:05AM -0700, Andrew Morton wrote:
 On Fri, 5 Oct 2007 20:30:28 +0800
 Fengguang Wu [EMAIL PROTECTED] wrote:
 
   commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
   Author: akpm akpm
   Date:   Sun Dec 22 01:07:33 2002 +
   
   [PATCH] Give kswapd writeback higher priority than pdflush
   
   The `low latency page reclaim' design works by preventing page
   allocators from blocking on request queues (and by preventing them 
   from
   blocking against writeback of individual pages, but that is immaterial
   here).
   
   This has a problem under some situations.  pdflush (or a write(2)
   caller) could be saturating the queue with highmem pages.  This
   prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
   enormous amounts of scenning.
  
  Sorry, I cannot not understand it. We now have balanced aging between
  zones. So the page allocations are expected to distribute proportionally
  between ZONE_HIGHMEM and ZONE_NORMAL?
 
 Sure, but we don't have one disk queue per disk per zone!  The queue is
 shared by all the zones.  So if writeback from one zone has filled the
 queue up, the kernel can't write back data from another zone.

Hmm, that's a problem. But I guess when one zone is full, other zones
will not be far away... It's a sooner or later problem.

 (Well, it can, by blocking in get_request_wait(), but that causes long and
 uncontrollable latencies).

I guess PF_SWAPWRITE processes still have good probability to stuck in
get_request_wait().  Because balance_dirty_pages() are allowed to
disregard the congestion.  It will be exhausting the available request
slots all the time.

Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
---
 mm/page-writeback.c |1 +
 1 file changed, 1 insertion(+)

--- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm2/mm/page-writeback.c
@@ -400,6 +400,7 @@ static void balance_dirty_pages(struct a
.sync_mode  = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write= write_chunk,
+   .nonblocking= 1,
.range_cyclic   = 1,
};
 

   A test case is to mmap(MAP_SHARED) almost all of a 4G machine's 
   memory,
   then kill the mmapping applications.  The machine instantly goes from
   0% of memory dirty to 95% or more.  pdflush kicks in and starts 
   writing
   the least-recently-dirtied pages, which are all highmem.  The queue is
   congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
   50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
   efficiency (pages_reclaimed/pages_scanned) falls to 2%.
   
   So this patch changes the policy for kswapd.  kswapd may use all of a
   request queue, and is prepared to block on request queues.
   
   What will now happen in the above scenario is:
   
   1: The page alloctor scans some pages, fails to reclaim enough
  memory and takes a nap in blk_congetion_wait().
   
   2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
  back pages.  (These pages will be rotated to the tail of the
  inactive list at IO-completion interrupt time).
   
  This writeback will saturate the queue with ZONE_NORMAL pages.
  Conveniently, pdflush will avoid the congested queues.  So we end 
   up
  writing the correct pages.
   
   In this test, kswapd CPU utilisation falls from 50% to 2%, page 
   reclaim
   efficiency rises from 2% to 40% and things are generally a lot 
   happier.
  
  We may see the same problem and improvement in the absent of 'all
  writeback goes to one zone' assumption.
  
  The problem could be:
  - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
data and quickly _congests_ the queue;
 
 Or someone ran fsync(), or pdflush is writing back data because it exceeded
 dirty_writeback_centisecs, etc.

Ah, yes.

  - dirty pages are slowly but continuously turned into clean pages by
balance_dirty_pages(), but they still stay in the same place in LRU;
  - the zones are mostly dirty/writeback pages, kswapd has a hard time
finding the randomly distributed clean pages;
  - kswapd cannot do the writeout because the queue is congested!
  
  The improvement could be:
  - kswapd is now explicitly preferred to do the writeout;
  - the pages written by kswapd will be rotated and easy for kswapd to 
  reclaim;
  - it becomes possible for kswapd to wait for the congested queue,
instead of doing the vmscan like mad.
 
 Yeah.  In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
 iirc) would throttle by waiting on writeout of a particular page.  This was
 a poor design, because writeback against a *particular* page can take
 anywhere from one millisecond to thirty seconds to complete,

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > 
> > I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
> > _is_.  That's what we're trying to fix, isn't it?
> 
> The problem, I believe is in the memory allocation code, not in fuse.

fuse is trying to do something which page reclaim was not designed for. 
Stuff broke.

> In the example, memory allocation may be blocking indefinitely,
> because we have 4MB under writeback, even though 28MB can still be
> made available.  And that _should_ be fixable.

Well yes.  But we need to work out how, without re-breaking the thing which
throttle_vm_writeout() fixed.

> > > So the only thing the kernel should be careful about, is not to block
> > > on an allocation if not strictly necessary.
> > > 
> > > Actually a trivial fix for this problem could be to just tweak the
> > > thresholds, so to make the above scenario impossible.  Although I'm
> > > still not convinced, this patch is perfect, because the dirty
> > > threshold can actually change in time...
> > > 
> > > Index: linux/mm/page-writeback.c
> > > ===
> > > --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 
> > > +0200
> > > +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> > >  for ( ; ; ) {
> > > get_dirty_limits(_thresh, _thresh, NULL, 
> > > NULL);
> > > 
> > > +   /*
> > > +* Make sure the theshold is over the hard limit of
> > > +* dirty_thresh + ratelimit_pages * nr_cpus
> > > +*/
> > > +   dirty_thresh += ratelimit_pages * num_online_cpus();
> > > +
> > >  /*
> > >   * Boost the allowable dirty threshold a bit for page
> > >   * allocators so they don't get DoS'ed by heavy writers
> > 
> > I can probably kind of guess what you're trying to do here.  But if
> > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> > then things might go bad.
> 
> I think the admin can do quite a bit of other damage, by setting
> dirty_ratio too high.
> 
> Maybe this writeback throttling should just have a fixed limit of 80%
> ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
we cannot limit the system-wide dirty-memory threshold to 12MB.

iow, throttle_vm_writeout() needs to become zone-aware.  Then it only
throttles when, say, 80% of ZONE_FOO is under writeback.

Except I don't think that'll fix the problem 100%: if your fuse kernel
component somehow manages to put 80% of ZONE_FOO under writeback (and
remmeber this might be only 12MB on a 16GB machine) then we get stuck again
- the fuse server process (is that the correct terminology, btw?) ends up
waiting upon itself.

I'll think about it a bit.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> > > This is a somewhat general problem: a userspace process is in the IO 
> > > path. 
> > > Userspace block drivers, for example - pretty much anything which involves
> > > kernel->userspace upcalls for storage applications.
> > > 
> > > I solved it once in the past by marking the userspace process as
> > > PF_MEMALLOC and I beleive that others have implemented the same hack.
> > > 
> > > I suspect that what we need is a general solution, and that the solution
> > > will involve explicitly telling the kernel that this process is one which
> > > actually cleans memory and needs special treatment.
> > > 
> > > Because I bet there will be other corner-cases where such a process needs
> > > kernel help, and there might be optimisation opportunities as well.
> > > 
> > > Problem is, any such mark-me-as-special syscall would need to be
> > > privileged, and FUSE servers presently don't require special perms (do
> > > they?)
> > 
> > No, and that's a rather important feature, that I'd rather not give
> > up.
> 
> Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
> registration time?

No, it's futile.  What if another process is involved (ssh in case of
sshfs), etc.

> >  But with the dirty limiting, the memory cleaning really shouldn't
> > be a problem, as there is plenty of memory _not_ used for dirty file
> > data, that the filesystem can use during the writeback.
> 
> I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
> _is_.  That's what we're trying to fix, isn't it?

The problem, I believe is in the memory allocation code, not in fuse.

In the example, memory allocation may be blocking indefinitely,
because we have 4MB under writeback, even though 28MB can still be
made available.  And that _should_ be fixable.

> > So the only thing the kernel should be careful about, is not to block
> > on an allocation if not strictly necessary.
> > 
> > Actually a trivial fix for this problem could be to just tweak the
> > thresholds, so to make the above scenario impossible.  Although I'm
> > still not convinced, this patch is perfect, because the dirty
> > threshold can actually change in time...
> > 
> > Index: linux/mm/page-writeback.c
> > ===
> > --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
> > +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> >  for ( ; ; ) {
> > get_dirty_limits(_thresh, _thresh, NULL, 
> > NULL);
> > 
> > +   /*
> > +* Make sure the theshold is over the hard limit of
> > +* dirty_thresh + ratelimit_pages * nr_cpus
> > +*/
> > +   dirty_thresh += ratelimit_pages * num_online_cpus();
> > +
> >  /*
> >   * Boost the allowable dirty threshold a bit for page
> >   * allocators so they don't get DoS'ed by heavy writers
> 
> I can probably kind of guess what you're trying to do here.  But if
> ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> then things might go bad.

I think the admin can do quite a bit of other damage, by setting
dirty_ratio too high.

Maybe this writeback throttling should just have a fixed limit of 80%
ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 01:26:12 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > This is a somewhat general problem: a userspace process is in the IO path. 
> > Userspace block drivers, for example - pretty much anything which involves
> > kernel->userspace upcalls for storage applications.
> > 
> > I solved it once in the past by marking the userspace process as
> > PF_MEMALLOC and I beleive that others have implemented the same hack.
> > 
> > I suspect that what we need is a general solution, and that the solution
> > will involve explicitly telling the kernel that this process is one which
> > actually cleans memory and needs special treatment.
> > 
> > Because I bet there will be other corner-cases where such a process needs
> > kernel help, and there might be optimisation opportunities as well.
> > 
> > Problem is, any such mark-me-as-special syscall would need to be
> > privileged, and FUSE servers presently don't require special perms (do
> > they?)
> 
> No, and that's a rather important feature, that I'd rather not give
> up.

Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
registration time?

>  But with the dirty limiting, the memory cleaning really shouldn't
> be a problem, as there is plenty of memory _not_ used for dirty file
> data, that the filesystem can use during the writeback.

I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
_is_.  That's what we're trying to fix, isn't it?

> So the only thing the kernel should be careful about, is not to block
> on an allocation if not strictly necessary.
> 
> Actually a trivial fix for this problem could be to just tweak the
> thresholds, so to make the above scenario impossible.  Although I'm
> still not convinced, this patch is perfect, because the dirty
> threshold can actually change in time...
> 
> Index: linux/mm/page-writeback.c
> ===
> --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
> +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
> @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
>  for ( ; ; ) {
> get_dirty_limits(_thresh, _thresh, NULL, 
> NULL);
> 
> +   /*
> +* Make sure the theshold is over the hard limit of
> +* dirty_thresh + ratelimit_pages * nr_cpus
> +*/
> +   dirty_thresh += ratelimit_pages * num_online_cpus();
> +
>  /*
>   * Boost the allowable dirty threshold a bit for page
>   * allocators so they don't get DoS'ed by heavy writers

I can probably kind of guess what you're trying to do here.  But if
ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
then things might go bad.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> This is a somewhat general problem: a userspace process is in the IO path. 
> Userspace block drivers, for example - pretty much anything which involves
> kernel->userspace upcalls for storage applications.
> 
> I solved it once in the past by marking the userspace process as
> PF_MEMALLOC and I beleive that others have implemented the same hack.
> 
> I suspect that what we need is a general solution, and that the solution
> will involve explicitly telling the kernel that this process is one which
> actually cleans memory and needs special treatment.
> 
> Because I bet there will be other corner-cases where such a process needs
> kernel help, and there might be optimisation opportunities as well.
> 
> Problem is, any such mark-me-as-special syscall would need to be
> privileged, and FUSE servers presently don't require special perms (do
> they?)

No, and that's a rather important feature, that I'd rather not give
up.  But with the dirty limiting, the memory cleaning really shouldn't
be a problem, as there is plenty of memory _not_ used for dirty file
data, that the filesystem can use during the writeback.

So the only thing the kernel should be careful about, is not to block
on an allocation if not strictly necessary.

Actually a trivial fix for this problem could be to just tweak the
thresholds, so to make the above scenario impossible.  Although I'm
still not convinced, this patch is perfect, because the dirty
threshold can actually change in time...

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
+++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
@@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
 for ( ; ; ) {
get_dirty_limits(_thresh, _thresh, NULL, NULL);

+   /*
+* Make sure the theshold is over the hard limit of
+* dirty_thresh + ratelimit_pages * nr_cpus
+*/
+   dirty_thresh += ratelimit_pages * num_online_cpus();
+
 /*
  * Boost the allowable dirty threshold a bit for page
  * allocators so they don't get DoS'ed by heavy writers


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 00:39:16 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
> > that would fix your deadlock.  That's doubtful, but I don't know anything
> > about your deadlock so I cannot say.
> 
> No, doing the throttling per-zone won't in itself fix the deadlock.
> 
> Here's a deadlock example:
> 
> Total memory = 32M
> /proc/sys/vm/dirty_ratio = 10
> dirty_threshold = 3M
> ratelimit_pages = 1M
> 
> Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
> a fuse fs.  Page balancing is called which turns all these into
> writeback pages.
> 
> Then userspace filesystem gets a write request, and tries to allocate
> memory needed to complete the writeout.
> 
> That will possibly trigger direct reclaim, and throttle_vm_writeout()
> will be called.  That will block until nr_writeback goes below 3.3M
> (dirty_threshold + 10%).  But since all 4M of writeback is from the
> fuse fs, that will never happen.
> 
> Does that explain it better?
> 

yup, thanks.

This is a somewhat general problem: a userspace process is in the IO path. 
Userspace block drivers, for example - pretty much anything which involves
kernel->userspace upcalls for storage applications.

I solved it once in the past by marking the userspace process as
PF_MEMALLOC and I beleive that others have implemented the same hack.

I suspect that what we need is a general solution, and that the solution
will involve explicitly telling the kernel that this process is one which
actually cleans memory and needs special treatment.

Because I bet there will be other corner-cases where such a process needs
kernel help, and there might be optimisation opportunities as well.

Problem is, any such mark-me-as-special syscall would need to be
privileged, and FUSE servers presently don't require special perms (do
they?)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> None of the above.
> 
> [PATCH] vm: pageout throttling
> 
> With silly pageout testcases it is possible to place huge amounts of 
> memory
> under I/O.  With a large request queue (CFQ uses 8192 requests) it is
> possible to place _all_ memory under I/O at the same time.
> 
> This means that all memory is pinned and unreclaimable and the VM gets
> upset and goes oom.
> 
> The patch limits the amount of memory which is under pageout writeout to 
> be
> a little more than the amount of memory at which balance_dirty_pages()
> callers will synchronously throttle.
> 
> This means that heavy pageout activity can starve heavy writeback activity
> completely, but heavy writeback activity will not cause starvation of
> pageout.  Because we don't want a simple `dd' to be causing excessive
> latencies in page reclaim.
> 
> afaict that problem is still there.  It is possible to get all of
> ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
> queues), vmscan can them place all of ZONE_NORMAL under IO.
> 
> It could be that we've fixed this problem via other means in the interrim,
> but from a quick peek to seems to me that the scanner will still do a 100%
> CPU burn when all of a zone's pages are under writeback.

Ah, OK.

I did read the changelog, but you added quite a bit of translation ;)

> throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
> that would fix your deadlock.  That's doubtful, but I don't know anything
> about your deadlock so I cannot say.

No, doing the throttling per-zone won't in itself fix the deadlock.

Here's a deadlock example:

Total memory = 32M
/proc/sys/vm/dirty_ratio = 10
dirty_threshold = 3M
ratelimit_pages = 1M

Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
a fuse fs.  Page balancing is called which turns all these into
writeback pages.

Then userspace filesystem gets a write request, and tries to allocate
memory needed to complete the writeout.

That will possibly trigger direct reclaim, and throttle_vm_writeout()
will be called.  That will block until nr_writeback goes below 3.3M
(dirty_threshold + 10%).  But since all 4M of writeback is from the
fuse fs, that will never happen.

Does that explain it better?

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 14:25:22 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> By relying on the global diry limits, this can cause a deadlock when
> devices are stacked.
> 
> If the stacking is done through a fuse filesystem, the __GFP_FS,
> __GFP_IO tests won't help: the process doing the allocation doesn't
> have any special flag.

This description of the bug-which-is-being-fixed is nowhere near adequate
enough for a reviewer to understand the problem.  This makes it hard to
suggest alternative fixes.

> So why exactly does this function exist?

That's described in the changelog for the patch which added
throttle_vm_writeout().  Unsurprisingly ;)

> Direct reclaim does not _increase_ the number of dirty pages in the
> system, so rate limiting it seems somewhat pointless.
> 
> There are two cases:
> 
> 1) File backed pages -> file
> 
>   dirty + writeback count remains constant
> 
> 2) Anonymous pages -> swap
> 
>   writeback count increases, dirty balancing will hold back file
>   writeback in favor of swap
> 
> So the real question is: does case 2 need rate limiting, or is it OK
> to let the device queue fill with swap pages as fast as possible?

None of the above.

[PATCH] vm: pageout throttling

With silly pageout testcases it is possible to place huge amounts of memory
under I/O.  With a large request queue (CFQ uses 8192 requests) it is
possible to place _all_ memory under I/O at the same time.

This means that all memory is pinned and unreclaimable and the VM gets
upset and goes oom.

The patch limits the amount of memory which is under pageout writeout to be
a little more than the amount of memory at which balance_dirty_pages()
callers will synchronously throttle.

This means that heavy pageout activity can starve heavy writeback activity
completely, but heavy writeback activity will not cause starvation of
pageout.  Because we don't want a simple `dd' to be causing excessive
latencies in page reclaim.

afaict that problem is still there.  It is possible to get all of
ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
queues), vmscan can them place all of ZONE_NORMAL under IO.

It could be that we've fixed this problem via other means in the interrim,
but from a quick peek to seems to me that the scanner will still do a 100%
CPU burn when all of a zone's pages are under writeback.

throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
that would fix your deadlock.  That's doubtful, but I don't know anything
about your deadlock so I cannot say.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> Yeah, I'm guestimating O on a per device basis, but I agree that the
> current ratio limiting is quite crude. I'm not at all sorry to see
> throttle_vm_writeback() go, I just wanted to make a point that what it
> does is not quite without merrit - we agree that it can be done better
> differently.

Yes.  So what is it to be?

Is limiting by device queues enough?

Or do we need some global limit?

If so, the cleanest way I see is to separately account and limit
swap-writeback pages, so the global counters don't interfere with the
limiting.

This shouldn't be hard to do, as we have the per-bdi writeback
counting infrastructure already, and also a pseudo bdi for swap in
swapper_space.backing_dev_info.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 20:10:10 +0200
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Thu, 2007-10-04 at 10:46 -0700, Andrew Morton wrote:
> > On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > > static int may_write_to_queue(struct backing_dev_info *bdi)
> > > {
> > >   if (current->flags & PF_SWAPWRITE)
> > >   return 1;
> > >   if (!bdi_write_congested(bdi))
> > >   return 1;
> > >   if (bdi == current->backing_dev_info)
> > >   return 1;
> > >   return 0;
> > > }
> > > 
> > > Which will write to congested queues. Anybody know why?
> 
> OK, I guess I could have found that :-/

Nice changelog, if I do say so myself ;)

> > One fix for this would be to add an additional "really congested"
> > threshold in the request queues, so kswapd can still perform
> > nonblocking writeout.  This gives kswapd priority over pdflush while
> > allowing kswapd to feed many disk queues.  I doubt if this will be
> > called for.
> 
> I could do that.

I guess first you'd need to be able to reproduce the problem which that
patch fixed, then check that it remains fixed.

Sigh.  That problem was fairly subtle.  We could re-break reclaim in
this way and not find out about it for six months.  There's a lesson here. 
Several.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra


On Thu, 2007-10-04 at 10:46 -0700, Andrew Morton wrote:
> On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> > static int may_write_to_queue(struct backing_dev_info *bdi)
> > {
> > if (current->flags & PF_SWAPWRITE)
> > return 1;
> > if (!bdi_write_congested(bdi))
> > return 1;
> > if (bdi == current->backing_dev_info)
> > return 1;
> > return 0;
> > }
> > 
> > Which will write to congested queues. Anybody know why?

OK, I guess I could have found that :-/

> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm 
> Date:   Sun Dec 22 01:07:33 2002 +
> 
> [PATCH] Give kswapd writeback higher priority than pdflush
> 
> The `low latency page reclaim' design works by preventing page
> allocators from blocking on request queues (and by preventing them from
> blocking against writeback of individual pages, but that is immaterial
> here).
> 
> This has a problem under some situations.  pdflush (or a write(2)
> caller) could be saturating the queue with highmem pages.  This
> prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
> enormous amounts of scenning.
> 
> A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> then kill the mmapping applications.  The machine instantly goes from
> 0% of memory dirty to 95% or more. 

With dirty page tracking this is not supposed to happen anymore.

>  pdflush kicks in and starts writing
> the least-recently-dirtied pages, which are all highmem. 

with highmem >> normal, and user pages preferring highmem, this will
likely still be true.

>  The queue is
> congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
> 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So, the problem is a heavy writer vs swap. Which is still possible.
   
> So this patch changes the policy for kswapd.  kswapd may use all of a
> request queue, and is prepared to block on request queues.

So request queue's have a limit above the congestion level on which they
will block?

NFS doesn't have that AFAIK

> What will now happen in the above scenario is:
> 
> 1: The page alloctor scans some pages, fails to reclaim enough
>memory and takes a nap in blk_congetion_wait().
> 
> 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
>back pages.  (These pages will be rotated to the tail of the
>inactive list at IO-completion interrupt time).
> 
>This writeback will saturate the queue with ZONE_NORMAL pages.
>Conveniently, pdflush will avoid the congested queues.  So we end up
>writing the correct pages.
> 
> In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> efficiency rises from 2% to 40% and things are generally a lot happier.
> 
> 
> The downside is that kswapd may now do a lot less page reclaim,
> increasing page allocation latency, causing more direct reclaim,
> increasing lock contention in the VM, etc.  But I have not been able to
> demonstrate that in testing.
> 
> 
> The other problem is that there is only one kswapd, and there are lots
> of disks.  That is a generic problem - without being able to co-opt
> user processes we don't have enough threads to keep lots of disks 
> saturated.
> 
> One fix for this would be to add an additional "really congested"
> threshold in the request queues, so kswapd can still perform
> nonblocking writeout.  This gives kswapd priority over pdflush while
> allowing kswapd to feed many disk queues.  I doubt if this will be
> called for.

I could do that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> > > > > But that said, there might be better ways to do that.
> > > > 
> > > > Sure, if we do need to globally limit the number of under-writeback
> > > > pages, then I think we need to do it independently of the dirty
> > > > accounting.
> > > 
> > > It need not be global, it could be per BDI as well, but yes.
> > 
> > For per-bdi limits we have the queue length.
> 
> Agreed, except for:
> 
> static int may_write_to_queue(struct backing_dev_info *bdi)
> {
>   if (current->flags & PF_SWAPWRITE)
>   return 1;
>   if (!bdi_write_congested(bdi))
>   return 1;
>   if (bdi == current->backing_dev_info)
>   return 1;
>   return 0;
> }
> 
> Which will write to congested queues. Anybody know why?

commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm 
Date:   Sun Dec 22 01:07:33 2002 +

[PATCH] Give kswapd writeback higher priority than pdflush

The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations.  pdflush (or a write(2)
caller) could be saturating the queue with highmem pages.  This
prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications.  The machine instantly goes from
0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem.  The queue is
congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd.  kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
   memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
   back pages.  (These pages will be rotated to the tail of the
   inactive list at IO-completion interrupt time).

   This writeback will saturate the queue with ZONE_NORMAL pages.
   Conveniently, pdflush will avoid the congested queues.  So we end up
   writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.

The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc.  But I have not been able to
demonstrate that in testing.

The other problem is that there is only one kswapd, and there are lots
of disks.  That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout.  This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues.  I doubt if this will be
called for.

BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c635f39..9ab0209 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -7,6 +7,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

@@ -14,6 +15,11 @@ #define SWAP_FLAG_PREFER 0x8000  /* set i
 #define SWAP_FLAG_PRIO_MASK0x7fff
 #define SWAP_FLAG_PRIO_SHIFT   0

+static inline int current_is_kswapd(void)
+{
+   return current->flags & PF_KSWAPD;
+}
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aeab1e3..a8b9d2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
return page_count(page) - !!PagePrivate(page) == 2;
 }

+static int may_write_to_queue(struct backing_dev_info *bdi)
+{
+   if (current_is_kswapd())
+   return 1;
+   if (current_is_pdflush())   /* This is unlikely, but why not... */
+   return 1;
+   if (!bdi_write_congested(bdi))
+   return 1;
+   if (bdi == current->backing_dev_info)
+   return 1;
+   return 0;
+}
+
 /*
  * shrink_list returns the number of reclaimed

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra


On Thu, 2007-10-04 at 15:49 +0200, Miklos Szeredi wrote:
> > > > Which can only happen when it is larger than 10% of dirty_thresh.
> > > > 
> > > > Which is even more unlikely since it doesn't account nr_dirty (as I
> > > > think it should).
> > > 
> > > I think nr_dirty is totally irrelevant.  Since we don't care about
> > > case 1), and in case 2) nr_dirty doesn't play any role.
> > 
> > Ah, but its correct to have since we compare against dirty_thresh, which
> > is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
> > take one of these out, then we get an undefined amount of space extra.
> 
> Yeah, I guess the point of the function was to limit nr_write to
> _anything_ smaller than the total memory.

*grin*, crude :-/

> > > > As for 2), yes I think having a limit on the total number of pages in
> > > > flight is a good thing.
> > > 
> > > Why?
> > 
> > for my swapping over network thingies I need to put a bound on the
> > amount of outgoing traffic in flight because that bounds the amount of
> > memory consumed by the sending side.
> 
> I guess you will have some request queue with limited length, no?

See below.

> The main problem seems to be if devices use up all the reserved memory
> for queuing write requests.  Limiting the in-flight pages is a very
> crude way to solve this, the assumptions are:
> 
> O: overhead as a fraction of the request size
> T: total memory
> R: reserved memory
> T-R: may be full of anon pages
> 
> so if (T-R)*O > R  we are in trouble.
> 
> if we limit the writeback memory to L and L*O < R we are OK.  But we
> don't know O (it's device dependent).  We can make an estimate
> calculate L based on that, but that will be a number totally
> independent of the dirty threshold.

Yeah, I'm guestimating O on a per device basis, but I agree that the
current ratio limiting is quite crude. I'm not at all sorry to see
throttle_vm_writeback() go, I just wanted to make a point that what it
does is not quite without merrit - we agree that it can be done better
differently.

> > > > But that said, there might be better ways to do that.
> > > 
> > > Sure, if we do need to globally limit the number of under-writeback
> > > pages, then I think we need to do it independently of the dirty
> > > accounting.
> > 
> > It need not be global, it could be per BDI as well, but yes.
> 
> For per-bdi limits we have the queue length.

Agreed, except for:

static int may_write_to_queue(struct backing_dev_info *bdi)
{
if (current->flags & PF_SWAPWRITE)
return 1;
if (!bdi_write_congested(bdi))
return 1;
if (bdi == current->backing_dev_info)
return 1;
return 0;
}

Which will write to congested queues. Anybody know why?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> > > Which can only happen when it is larger than 10% of dirty_thresh.
> > > 
> > > Which is even more unlikely since it doesn't account nr_dirty (as I
> > > think it should).
> > 
> > I think nr_dirty is totally irrelevant.  Since we don't care about
> > case 1), and in case 2) nr_dirty doesn't play any role.
> 
> Ah, but its correct to have since we compare against dirty_thresh, which
> is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
> take one of these out, then we get an undefined amount of space extra.

Yeah, I guess the point of the function was to limit nr_write to
_anything_ smaller than the total memory.

> > > As for 2), yes I think having a limit on the total number of pages in
> > > flight is a good thing.
> > 
> > Why?
> 
> for my swapping over network thingies I need to put a bound on the
> amount of outgoing traffic in flight because that bounds the amount of
> memory consumed by the sending side.

I guess you will have some request queue with limited length, no?

The main problem seems to be if devices use up all the reserved memory
for queuing write requests.  Limiting the in-flight pages is a very
crude way to solve this, the assumptions are:

O: overhead as a fraction of the request size
T: total memory
R: reserved memory
T-R: may be full of anon pages

so if (T-R)*O > R  we are in trouble.

if we limit the writeback memory to L and L*O < R we are OK.  But we
don't know O (it's device dependent).  We can make an estimate
calculate L based on that, but that will be a number totally
independent of the dirty threshold.

> > > But that said, there might be better ways to do that.
> > 
> > Sure, if we do need to globally limit the number of under-writeback
> > pages, then I think we need to do it independently of the dirty
> > accounting.
> 
> It need not be global, it could be per BDI as well, but yes.

For per-bdi limits we have the queue length.

Miklod
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra

On Thu, 2007-10-04 at 15:00 +0200, Miklos Szeredi wrote:
> > > 1) File backed pages -> file
> > > 
> > >   dirty + writeback count remains constant
> > > 
> > > 2) Anonymous pages -> swap
> > > 
> > >   writeback count increases, dirty balancing will hold back file
> > >   writeback in favor of swap
> > > 
> > > So the real question is: does case 2 need rate limiting, or is it OK
> > > to let the device queue fill with swap pages as fast as possible?
> > 
> > > Because balance_dirty_pages() maintains:
> > 
> >  nr_dirty + nr_unstable + nr_writeback < 
> > total_dirty + nr_cpus * ratelimit_pages
> > 
> > throttle_vm_writeout() _should_ not deadlock on that, unless you're
> > caught in the error term: nr_cpus * ratelimit_pages. 
> 
> And it does get caught on that in small memory machines.  This
> deadlock is easily reproducable on a 32MB UML instance.  

Ah, yes, for those that is indeed easily doable.

> I haven't yet
> tested with the per-bdi patches, but I don't think they make a
> difference in this case.

Correct, they would not.

> > Which can only happen when it is larger than 10% of dirty_thresh.
> > 
> > Which is even more unlikely since it doesn't account nr_dirty (as I
> > think it should).
> 
> I think nr_dirty is totally irrelevant.  Since we don't care about
> case 1), and in case 2) nr_dirty doesn't play any role.

Ah, but its correct to have since we compare against dirty_thresh, which
is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
take one of these out, then we get an undefined amount of space extra.

> > As for 2), yes I think having a limit on the total number of pages in
> > flight is a good thing.
> 
> Why?

for my swapping over network thingies I need to put a bound on the
amount of outgoing traffic in flight because that bounds the amount of
memory consumed by the sending side.

> > But that said, there might be better ways to do that.
> 
> Sure, if we do need to globally limit the number of under-writeback
> pages, then I think we need to do it independently of the dirty
> accounting.

It need not be global, it could be per BDI as well, but yes.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

> > 1) File backed pages -> file
> > 
> >   dirty + writeback count remains constant
> > 
> > 2) Anonymous pages -> swap
> > 
> >   writeback count increases, dirty balancing will hold back file
> >   writeback in favor of swap
> > 
> > So the real question is: does case 2 need rate limiting, or is it OK
> > to let the device queue fill with swap pages as fast as possible?
> 
> > Because balance_dirty_pages() maintains:
> 
>  nr_dirty + nr_unstable + nr_writeback < 
>   total_dirty + nr_cpus * ratelimit_pages
> 
> throttle_vm_writeout() _should_ not deadlock on that, unless you're
> caught in the error term: nr_cpus * ratelimit_pages. 

And it does get caught on that in small memory machines.  This
deadlock is easily reproducable on a 32MB UML instance.  I haven't yet
tested with the per-bdi patches, but I don't think they make a
difference in this case.

> Which can only happen when it is larger than 10% of dirty_thresh.
> 
> Which is even more unlikely since it doesn't account nr_dirty (as I
> think it should).

I think nr_dirty is totally irrelevant.  Since we don't care about
case 1), and in case 2) nr_dirty doesn't play any role.

> As for 2), yes I think having a limit on the total number of pages in
> flight is a good thing.

Why?

> But that said, there might be better ways to do that.

Sure, if we do need to globally limit the number of under-writeback
pages, then I think we need to do it independently of the dirty
accounting.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra

On Thu, 2007-10-04 at 14:25 +0200, Miklos Szeredi wrote:
> This in preparation for the writable mmap patches for fuse.  I know it
> conflicts with
> 
>   writeback-remove-unnecessary-wait-in-throttle_vm_writeout.patch
> 
> but if this function is to be removed, it doesn't make much sense to
> fix it first ;)
> ---
> 
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> By relying on the global diry limits, this can cause a deadlock when
> devices are stacked.
> 
> If the stacking is done through a fuse filesystem, the __GFP_FS,
> __GFP_IO tests won't help: the process doing the allocation doesn't
> have any special flag.
> 
> So why exactly does this function exist?
> 
> Direct reclaim does not _increase_ the number of dirty pages in the
> system, so rate limiting it seems somewhat pointless.
> 
> There are two cases:
> 
> 1) File backed pages -> file
> 
>   dirty + writeback count remains constant
> 
> 2) Anonymous pages -> swap
> 
>   writeback count increases, dirty balancing will hold back file
>   writeback in favor of swap
> 
> So the real question is: does case 2 need rate limiting, or is it OK
> to let the device queue fill with swap pages as fast as possible?

Because balance_dirty_pages() maintains:

 nr_dirty + nr_unstable + nr_writeback < 
total_dirty + nr_cpus * ratelimit_pages

throttle_vm_writeout() _should_ not deadlock on that, unless you're
caught in the error term: nr_cpus * ratelimit_pages. 

Which can only happen when it is larger than 10% of dirty_thresh.

Which is even more unlikely since it doesn't account nr_dirty (as I
think it should).

As for 2), yes I think having a limit on the total number of pages in
flight is a good thing. But that said, there might be better ways to do
that.





signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra

On Thu, 2007-10-04 at 14:25 +0200, Miklos Szeredi wrote:
 This in preparation for the writable mmap patches for fuse.  I know it
 conflicts with
 
   writeback-remove-unnecessary-wait-in-throttle_vm_writeout.patch
 
 but if this function is to be removed, it doesn't make much sense to
 fix it first ;)
 ---
 
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 By relying on the global diry limits, this can cause a deadlock when
 devices are stacked.
 
 If the stacking is done through a fuse filesystem, the __GFP_FS,
 __GFP_IO tests won't help: the process doing the allocation doesn't
 have any special flag.
 
 So why exactly does this function exist?
 
 Direct reclaim does not _increase_ the number of dirty pages in the
 system, so rate limiting it seems somewhat pointless.
 
 There are two cases:
 
 1) File backed pages - file
 
   dirty + writeback count remains constant
 
 2) Anonymous pages - swap
 
   writeback count increases, dirty balancing will hold back file
   writeback in favor of swap
 
 So the real question is: does case 2 need rate limiting, or is it OK
 to let the device queue fill with swap pages as fast as possible?

Because balance_dirty_pages() maintains:

 nr_dirty + nr_unstable + nr_writeback  
total_dirty + nr_cpus * ratelimit_pages

throttle_vm_writeout() _should_ not deadlock on that, unless you're
caught in the error term: nr_cpus * ratelimit_pages. 

Which can only happen when it is larger than 10% of dirty_thresh.

Which is even more unlikely since it doesn't account nr_dirty (as I
think it should).

As for 2), yes I think having a limit on the total number of pages in
flight is a good thing. But that said, there might be better ways to do
that.





signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

  1) File backed pages - file
  
dirty + writeback count remains constant
  
  2) Anonymous pages - swap
  
writeback count increases, dirty balancing will hold back file
writeback in favor of swap
  
  So the real question is: does case 2 need rate limiting, or is it OK
  to let the device queue fill with swap pages as fast as possible?
 
  Because balance_dirty_pages() maintains:
 
  nr_dirty + nr_unstable + nr_writeback  
   total_dirty + nr_cpus * ratelimit_pages
 
 throttle_vm_writeout() _should_ not deadlock on that, unless you're
 caught in the error term: nr_cpus * ratelimit_pages. 

And it does get caught on that in small memory machines.  This
deadlock is easily reproducable on a 32MB UML instance.  I haven't yet
tested with the per-bdi patches, but I don't think they make a
difference in this case.

 Which can only happen when it is larger than 10% of dirty_thresh.
 
 Which is even more unlikely since it doesn't account nr_dirty (as I
 think it should).

I think nr_dirty is totally irrelevant.  Since we don't care about
case 1), and in case 2) nr_dirty doesn't play any role.

 As for 2), yes I think having a limit on the total number of pages in
 flight is a good thing.

Why?

 But that said, there might be better ways to do that.

Sure, if we do need to globally limit the number of under-writeback
pages, then I think we need to do it independently of the dirty
accounting.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra

On Thu, 2007-10-04 at 15:00 +0200, Miklos Szeredi wrote:
   1) File backed pages - file
   
 dirty + writeback count remains constant
   
   2) Anonymous pages - swap
   
 writeback count increases, dirty balancing will hold back file
 writeback in favor of swap
   
   So the real question is: does case 2 need rate limiting, or is it OK
   to let the device queue fill with swap pages as fast as possible?
  
   Because balance_dirty_pages() maintains:
  
   nr_dirty + nr_unstable + nr_writeback  
  total_dirty + nr_cpus * ratelimit_pages
  
  throttle_vm_writeout() _should_ not deadlock on that, unless you're
  caught in the error term: nr_cpus * ratelimit_pages. 
 
 And it does get caught on that in small memory machines.  This
 deadlock is easily reproducable on a 32MB UML instance.  

Ah, yes, for those that is indeed easily doable.

 I haven't yet
 tested with the per-bdi patches, but I don't think they make a
 difference in this case.

Correct, they would not.

  Which can only happen when it is larger than 10% of dirty_thresh.
  
  Which is even more unlikely since it doesn't account nr_dirty (as I
  think it should).
 
 I think nr_dirty is totally irrelevant.  Since we don't care about
 case 1), and in case 2) nr_dirty doesn't play any role.

Ah, but its correct to have since we compare against dirty_thresh, which
is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
take one of these out, then we get an undefined amount of space extra.

  As for 2), yes I think having a limit on the total number of pages in
  flight is a good thing.
 
 Why?

for my swapping over network thingies I need to put a bound on the
amount of outgoing traffic in flight because that bounds the amount of
memory consumed by the sending side.

  But that said, there might be better ways to do that.
 
 Sure, if we do need to globally limit the number of under-writeback
 pages, then I think we need to do it independently of the dirty
 accounting.

It need not be global, it could be per BDI as well, but yes.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

   Which can only happen when it is larger than 10% of dirty_thresh.
   
   Which is even more unlikely since it doesn't account nr_dirty (as I
   think it should).
  
  I think nr_dirty is totally irrelevant.  Since we don't care about
  case 1), and in case 2) nr_dirty doesn't play any role.
 
 Ah, but its correct to have since we compare against dirty_thresh, which
 is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
 take one of these out, then we get an undefined amount of space extra.

Yeah, I guess the point of the function was to limit nr_write to
_anything_ smaller than the total memory.

   As for 2), yes I think having a limit on the total number of pages in
   flight is a good thing.
  
  Why?
 
 for my swapping over network thingies I need to put a bound on the
 amount of outgoing traffic in flight because that bounds the amount of
 memory consumed by the sending side.

I guess you will have some request queue with limited length, no?

The main problem seems to be if devices use up all the reserved memory
for queuing write requests.  Limiting the in-flight pages is a very
crude way to solve this, the assumptions are:

O: overhead as a fraction of the request size
T: total memory
R: reserved memory
T-R: may be full of anon pages

so if (T-R)*O  R  we are in trouble.

if we limit the writeback memory to L and L*O  R we are OK.  But we
don't know O (it's device dependent).  We can make an estimate
calculate L based on that, but that will be a number totally
independent of the dirty threshold.

   But that said, there might be better ways to do that.
  
  Sure, if we do need to globally limit the number of under-writeback
  pages, then I think we need to do it independently of the dirty
  accounting.
 
 It need not be global, it could be per BDI as well, but yes.

For per-bdi limits we have the queue length.

Miklod
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra


On Thu, 2007-10-04 at 15:49 +0200, Miklos Szeredi wrote:
Which can only happen when it is larger than 10% of dirty_thresh.

Which is even more unlikely since it doesn't account nr_dirty (as I
think it should).
   
   I think nr_dirty is totally irrelevant.  Since we don't care about
   case 1), and in case 2) nr_dirty doesn't play any role.
  
  Ah, but its correct to have since we compare against dirty_thresh, which
  is defined to be a unit of nr_dirty + nr_unstable + nr_writeback. if we
  take one of these out, then we get an undefined amount of space extra.
 
 Yeah, I guess the point of the function was to limit nr_write to
 _anything_ smaller than the total memory.

*grin*, crude :-/

As for 2), yes I think having a limit on the total number of pages in
flight is a good thing.
   
   Why?
  
  for my swapping over network thingies I need to put a bound on the
  amount of outgoing traffic in flight because that bounds the amount of
  memory consumed by the sending side.
 
 I guess you will have some request queue with limited length, no?

See below.

 The main problem seems to be if devices use up all the reserved memory
 for queuing write requests.  Limiting the in-flight pages is a very
 crude way to solve this, the assumptions are:
 
 O: overhead as a fraction of the request size
 T: total memory
 R: reserved memory
 T-R: may be full of anon pages
 
 so if (T-R)*O  R  we are in trouble.
 
 if we limit the writeback memory to L and L*O  R we are OK.  But we
 don't know O (it's device dependent).  We can make an estimate
 calculate L based on that, but that will be a number totally
 independent of the dirty threshold.

Yeah, I'm guestimating O on a per device basis, but I agree that the
current ratio limiting is quite crude. I'm not at all sorry to see
throttle_vm_writeback() go, I just wanted to make a point that what it
does is not quite without merrit - we agree that it can be done better
differently.

But that said, there might be better ways to do that.
   
   Sure, if we do need to globally limit the number of under-writeback
   pages, then I think we need to do it independently of the dirty
   accounting.
  
  It need not be global, it could be per BDI as well, but yes.
 
 For per-bdi limits we have the queue length.

Agreed, except for:

static int may_write_to_queue(struct backing_dev_info *bdi)
{
if (current-flags  PF_SWAPWRITE)
return 1;
if (!bdi_write_congested(bdi))
return 1;
if (bdi == current-backing_dev_info)
return 1;
return 0;
}

Which will write to congested queues. Anybody know why?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 But that said, there might be better ways to do that.

Sure, if we do need to globally limit the number of under-writeback
pages, then I think we need to do it independently of the dirty
accounting.
   
   It need not be global, it could be per BDI as well, but yes.
  
  For per-bdi limits we have the queue length.
 
 Agreed, except for:
 
 static int may_write_to_queue(struct backing_dev_info *bdi)
 {
   if (current-flags  PF_SWAPWRITE)
   return 1;
   if (!bdi_write_congested(bdi))
   return 1;
   if (bdi == current-backing_dev_info)
   return 1;
   return 0;
 }
 
 Which will write to congested queues. Anybody know why?


commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm akpm
Date:   Sun Dec 22 01:07:33 2002 +

[PATCH] Give kswapd writeback higher priority than pdflush

The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations.  pdflush (or a write(2)
caller) could be saturating the queue with highmem pages.  This
prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications.  The machine instantly goes from
0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem.  The queue is
congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd.  kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
   memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
   back pages.  (These pages will be rotated to the tail of the
   inactive list at IO-completion interrupt time).

   This writeback will saturate the queue with ZONE_NORMAL pages.
   Conveniently, pdflush will avoid the congested queues.  So we end up
   writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.


The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc.  But I have not been able to
demonstrate that in testing.


The other problem is that there is only one kswapd, and there are lots
of disks.  That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional really congested
threshold in the request queues, so kswapd can still perform
nonblocking writeout.  This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues.  I doubt if this will be
called for.

BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c635f39..9ab0209 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -7,6 +7,7 @@ #include linux/kdev_t.h
 #include linux/linkage.h
 #include linux/mmzone.h
 #include linux/list.h
+#include linux/sched.h
 #include asm/atomic.h
 #include asm/page.h
 
@@ -14,6 +15,11 @@ #define SWAP_FLAG_PREFER 0x8000  /* set i
 #define SWAP_FLAG_PRIO_MASK0x7fff
 #define SWAP_FLAG_PRIO_SHIFT   0
 
+static inline int current_is_kswapd(void)
+{
+   return current-flags  PF_KSWAPD;
+}
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aeab1e3..a8b9d2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
return page_count(page) - !!PagePrivate(page) == 2;
 }
 
+static int may_write_to_queue(struct backing_dev_info *bdi)
+{
+   if (current_is_kswapd())
+   return 1;
+   if (current_is_pdflush())   /* This is unlikely, but why not... */
+   return 1;
+   if (!bdi_write_congested(bdi))
+   return 1;
+   if (bdi == current-backing_dev_info)
+   return 1;
+   return 0;
+}
+
 /*
  *

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Peter Zijlstra


On Thu, 2007-10-04 at 10:46 -0700, Andrew Morton wrote:
 On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

  static int may_write_to_queue(struct backing_dev_info *bdi)
  {
  if (current-flags  PF_SWAPWRITE)
  return 1;
  if (!bdi_write_congested(bdi))
  return 1;
  if (bdi == current-backing_dev_info)
  return 1;
  return 0;
  }
  
  Which will write to congested queues. Anybody know why?

OK, I guess I could have found that :-/

 commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
 Author: akpm akpm
 Date:   Sun Dec 22 01:07:33 2002 +
 
 [PATCH] Give kswapd writeback higher priority than pdflush
 
 The `low latency page reclaim' design works by preventing page
 allocators from blocking on request queues (and by preventing them from
 blocking against writeback of individual pages, but that is immaterial
 here).
 
 This has a problem under some situations.  pdflush (or a write(2)
 caller) could be saturating the queue with highmem pages.  This
 prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
 enormous amounts of scenning.
 
 A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
 then kill the mmapping applications.  The machine instantly goes from
 0% of memory dirty to 95% or more. 

With dirty page tracking this is not supposed to happen anymore.

  pdflush kicks in and starts writing
 the least-recently-dirtied pages, which are all highmem. 

with highmem  normal, and user pages preferring highmem, this will
likely still be true.

  The queue is
 congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
 efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So, the problem is a heavy writer vs swap. Which is still possible.
   
 So this patch changes the policy for kswapd.  kswapd may use all of a
 request queue, and is prepared to block on request queues.

So request queue's have a limit above the congestion level on which they
will block?

NFS doesn't have that AFAIK

 What will now happen in the above scenario is:
 
 1: The page alloctor scans some pages, fails to reclaim enough
memory and takes a nap in blk_congetion_wait().
 
 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
back pages.  (These pages will be rotated to the tail of the
inactive list at IO-completion interrupt time).
 
This writeback will saturate the queue with ZONE_NORMAL pages.
Conveniently, pdflush will avoid the congested queues.  So we end up
writing the correct pages.
 
 In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
 efficiency rises from 2% to 40% and things are generally a lot happier.
 
 
 The downside is that kswapd may now do a lot less page reclaim,
 increasing page allocation latency, causing more direct reclaim,
 increasing lock contention in the VM, etc.  But I have not been able to
 demonstrate that in testing.
 
 
 The other problem is that there is only one kswapd, and there are lots
 of disks.  That is a generic problem - without being able to co-opt
 user processes we don't have enough threads to keep lots of disks 
 saturated.
 
 One fix for this would be to add an additional really congested
 threshold in the request queues, so kswapd can still perform
 nonblocking writeout.  This gives kswapd priority over pdflush while
 allowing kswapd to feed many disk queues.  I doubt if this will be
 called for.

I could do that.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 20:10:10 +0200
Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Thu, 2007-10-04 at 10:46 -0700, Andrew Morton wrote:
  On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
   static int may_write_to_queue(struct backing_dev_info *bdi)
   {
 if (current-flags  PF_SWAPWRITE)
 return 1;
 if (!bdi_write_congested(bdi))
 return 1;
 if (bdi == current-backing_dev_info)
 return 1;
 return 0;
   }
   
   Which will write to congested queues. Anybody know why?
 
 OK, I guess I could have found that :-/

Nice changelog, if I do say so myself ;)

  One fix for this would be to add an additional really congested
  threshold in the request queues, so kswapd can still perform
  nonblocking writeout.  This gives kswapd priority over pdflush while
  allowing kswapd to feed many disk queues.  I doubt if this will be
  called for.
 
 I could do that.

I guess first you'd need to be able to reproduce the problem which that
patch fixed, then check that it remains fixed.

Sigh.  That problem was fairly subtle.  We could re-break reclaim in
this way and not find out about it for six months.  There's a lesson here. 
Several.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

 Yeah, I'm guestimating O on a per device basis, but I agree that the
 current ratio limiting is quite crude. I'm not at all sorry to see
 throttle_vm_writeback() go, I just wanted to make a point that what it
 does is not quite without merrit - we agree that it can be done better
 differently.

Yes.  So what is it to be?

Is limiting by device queues enough?

Or do we need some global limit?

If so, the cleanest way I see is to separately account and limit
swap-writeback pages, so the global counters don't interfere with the
limiting.

This shouldn't be hard to do, as we have the per-bdi writeback
counting infrastructure already, and also a pseudo bdi for swap in
swapper_space.backing_dev_info.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Thu, 04 Oct 2007 14:25:22 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

 From: Miklos Szeredi [EMAIL PROTECTED]
 
 By relying on the global diry limits, this can cause a deadlock when
 devices are stacked.
 
 If the stacking is done through a fuse filesystem, the __GFP_FS,
 __GFP_IO tests won't help: the process doing the allocation doesn't
 have any special flag.

This description of the bug-which-is-being-fixed is nowhere near adequate
enough for a reviewer to understand the problem.  This makes it hard to
suggest alternative fixes.

 So why exactly does this function exist?

That's described in the changelog for the patch which added
throttle_vm_writeout().  Unsurprisingly ;)

 Direct reclaim does not _increase_ the number of dirty pages in the
 system, so rate limiting it seems somewhat pointless.
 
 There are two cases:
 
 1) File backed pages - file
 
   dirty + writeback count remains constant
 
 2) Anonymous pages - swap
 
   writeback count increases, dirty balancing will hold back file
   writeback in favor of swap
 
 So the real question is: does case 2 need rate limiting, or is it OK
 to let the device queue fill with swap pages as fast as possible?

None of the above.

[PATCH] vm: pageout throttling

With silly pageout testcases it is possible to place huge amounts of memory
under I/O.  With a large request queue (CFQ uses 8192 requests) it is
possible to place _all_ memory under I/O at the same time.

This means that all memory is pinned and unreclaimable and the VM gets
upset and goes oom.

The patch limits the amount of memory which is under pageout writeout to be
a little more than the amount of memory at which balance_dirty_pages()
callers will synchronously throttle.

This means that heavy pageout activity can starve heavy writeback activity
completely, but heavy writeback activity will not cause starvation of
pageout.  Because we don't want a simple `dd' to be causing excessive
latencies in page reclaim.

afaict that problem is still there.  It is possible to get all of
ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
queues), vmscan can them place all of ZONE_NORMAL under IO.

It could be that we've fixed this problem via other means in the interrim,
but from a quick peek to seems to me that the scanner will still do a 100%
CPU burn when all of a zone's pages are under writeback.

throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
that would fix your deadlock.  That's doubtful, but I don't know anything
about your deadlock so I cannot say.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

 None of the above.
 
 [PATCH] vm: pageout throttling
 
 With silly pageout testcases it is possible to place huge amounts of 
 memory
 under I/O.  With a large request queue (CFQ uses 8192 requests) it is
 possible to place _all_ memory under I/O at the same time.
 
 This means that all memory is pinned and unreclaimable and the VM gets
 upset and goes oom.
 
 The patch limits the amount of memory which is under pageout writeout to 
 be
 a little more than the amount of memory at which balance_dirty_pages()
 callers will synchronously throttle.
 
 This means that heavy pageout activity can starve heavy writeback activity
 completely, but heavy writeback activity will not cause starvation of
 pageout.  Because we don't want a simple `dd' to be causing excessive
 latencies in page reclaim.
 
 afaict that problem is still there.  It is possible to get all of
 ZONE_NORMAL dirty on a highmem machine.  With a large queue (or lots of
 queues), vmscan can them place all of ZONE_NORMAL under IO.
 
 It could be that we've fixed this problem via other means in the interrim,
 but from a quick peek to seems to me that the scanner will still do a 100%
 CPU burn when all of a zone's pages are under writeback.

Ah, OK.

I did read the changelog, but you added quite a bit of translation ;)

 throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
 that would fix your deadlock.  That's doubtful, but I don't know anything
 about your deadlock so I cannot say.

No, doing the throttling per-zone won't in itself fix the deadlock.

Here's a deadlock example:

Total memory = 32M
/proc/sys/vm/dirty_ratio = 10
dirty_threshold = 3M
ratelimit_pages = 1M

Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
a fuse fs.  Page balancing is called which turns all these into
writeback pages.

Then userspace filesystem gets a write request, and tries to allocate
memory needed to complete the writeout.

That will possibly trigger direct reclaim, and throttle_vm_writeout()
will be called.  That will block until nr_writeback goes below 3.3M
(dirty_threshold + 10%).  But since all 4M of writeback is from the
fuse fs, that will never happen.

Does that explain it better?

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 00:39:16 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

  throttle_vm_writeout() should be a per-zone thing, I guess.  Perhaps fixing
  that would fix your deadlock.  That's doubtful, but I don't know anything
  about your deadlock so I cannot say.
 
 No, doing the throttling per-zone won't in itself fix the deadlock.
 
 Here's a deadlock example:
 
 Total memory = 32M
 /proc/sys/vm/dirty_ratio = 10
 dirty_threshold = 3M
 ratelimit_pages = 1M
 
 Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
 a fuse fs.  Page balancing is called which turns all these into
 writeback pages.
 
 Then userspace filesystem gets a write request, and tries to allocate
 memory needed to complete the writeout.
 
 That will possibly trigger direct reclaim, and throttle_vm_writeout()
 will be called.  That will block until nr_writeback goes below 3.3M
 (dirty_threshold + 10%).  But since all 4M of writeback is from the
 fuse fs, that will never happen.
 
 Does that explain it better?
 

yup, thanks.

This is a somewhat general problem: a userspace process is in the IO path. 
Userspace block drivers, for example - pretty much anything which involves
kernel-userspace upcalls for storage applications.

I solved it once in the past by marking the userspace process as
PF_MEMALLOC and I beleive that others have implemented the same hack.

I suspect that what we need is a general solution, and that the solution
will involve explicitly telling the kernel that this process is one which
actually cleans memory and needs special treatment.

Because I bet there will be other corner-cases where such a process needs
kernel help, and there might be optimisation opportunities as well.

Problem is, any such mark-me-as-special syscall would need to be
privileged, and FUSE servers presently don't require special perms (do
they?)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

 This is a somewhat general problem: a userspace process is in the IO path. 
 Userspace block drivers, for example - pretty much anything which involves
 kernel-userspace upcalls for storage applications.
 
 I solved it once in the past by marking the userspace process as
 PF_MEMALLOC and I beleive that others have implemented the same hack.
 
 I suspect that what we need is a general solution, and that the solution
 will involve explicitly telling the kernel that this process is one which
 actually cleans memory and needs special treatment.
 
 Because I bet there will be other corner-cases where such a process needs
 kernel help, and there might be optimisation opportunities as well.
 
 Problem is, any such mark-me-as-special syscall would need to be
 privileged, and FUSE servers presently don't require special perms (do
 they?)

No, and that's a rather important feature, that I'd rather not give
up.  But with the dirty limiting, the memory cleaning really shouldn't
be a problem, as there is plenty of memory _not_ used for dirty file
data, that the filesystem can use during the writeback.

So the only thing the kernel should be careful about, is not to block
on an allocation if not strictly necessary.

Actually a trivial fix for this problem could be to just tweak the
thresholds, so to make the above scenario impossible.  Although I'm
still not convinced, this patch is perfect, because the dirty
threshold can actually change in time...

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
+++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
@@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
 for ( ; ; ) {
get_dirty_limits(background_thresh, dirty_thresh, NULL, NULL);

+   /*
+* Make sure the theshold is over the hard limit of
+* dirty_thresh + ratelimit_pages * nr_cpus
+*/
+   dirty_thresh += ratelimit_pages * num_online_cpus();
+
 /*
  * Boost the allowable dirty threshold a bit for page
  * allocators so they don't get DoS'ed by heavy writers


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 01:26:12 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

  This is a somewhat general problem: a userspace process is in the IO path. 
  Userspace block drivers, for example - pretty much anything which involves
  kernel-userspace upcalls for storage applications.
  
  I solved it once in the past by marking the userspace process as
  PF_MEMALLOC and I beleive that others have implemented the same hack.
  
  I suspect that what we need is a general solution, and that the solution
  will involve explicitly telling the kernel that this process is one which
  actually cleans memory and needs special treatment.
  
  Because I bet there will be other corner-cases where such a process needs
  kernel help, and there might be optimisation opportunities as well.
  
  Problem is, any such mark-me-as-special syscall would need to be
  privileged, and FUSE servers presently don't require special perms (do
  they?)
 
 No, and that's a rather important feature, that I'd rather not give
 up.

Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
registration time?

  But with the dirty limiting, the memory cleaning really shouldn't
 be a problem, as there is plenty of memory _not_ used for dirty file
 data, that the filesystem can use during the writeback.

I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
_is_.  That's what we're trying to fix, isn't it?

 So the only thing the kernel should be careful about, is not to block
 on an allocation if not strictly necessary.
 
 Actually a trivial fix for this problem could be to just tweak the
 thresholds, so to make the above scenario impossible.  Although I'm
 still not convinced, this patch is perfect, because the dirty
 threshold can actually change in time...
 
 Index: linux/mm/page-writeback.c
 ===
 --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
 +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
 @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
  for ( ; ; ) {
 get_dirty_limits(background_thresh, dirty_thresh, NULL, 
 NULL);
 
 +   /*
 +* Make sure the theshold is over the hard limit of
 +* dirty_thresh + ratelimit_pages * nr_cpus
 +*/
 +   dirty_thresh += ratelimit_pages * num_online_cpus();
 +
  /*
   * Boost the allowable dirty threshold a bit for page
   * allocators so they don't get DoS'ed by heavy writers

I can probably kind of guess what you're trying to do here.  But if
ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
then things might go bad.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Miklos Szeredi

   This is a somewhat general problem: a userspace process is in the IO 
   path. 
   Userspace block drivers, for example - pretty much anything which involves
   kernel-userspace upcalls for storage applications.
   
   I solved it once in the past by marking the userspace process as
   PF_MEMALLOC and I beleive that others have implemented the same hack.
   
   I suspect that what we need is a general solution, and that the solution
   will involve explicitly telling the kernel that this process is one which
   actually cleans memory and needs special treatment.
   
   Because I bet there will be other corner-cases where such a process needs
   kernel help, and there might be optimisation opportunities as well.
   
   Problem is, any such mark-me-as-special syscall would need to be
   privileged, and FUSE servers presently don't require special perms (do
   they?)
  
  No, and that's a rather important feature, that I'd rather not give
  up.
 
 Can fuse do it?  Perhaps the fs can diddle the server's task_struct at
 registration time?

No, it's futile.  What if another process is involved (ssh in case of
sshfs), etc.

   But with the dirty limiting, the memory cleaning really shouldn't
  be a problem, as there is plenty of memory _not_ used for dirty file
  data, that the filesystem can use during the writeback.
 
 I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
 _is_.  That's what we're trying to fix, isn't it?

The problem, I believe is in the memory allocation code, not in fuse.

In the example, memory allocation may be blocking indefinitely,
because we have 4MB under writeback, even though 28MB can still be
made available.  And that _should_ be fixable.

  So the only thing the kernel should be careful about, is not to block
  on an allocation if not strictly necessary.
  
  Actually a trivial fix for this problem could be to just tweak the
  thresholds, so to make the above scenario impossible.  Although I'm
  still not convinced, this patch is perfect, because the dirty
  threshold can actually change in time...
  
  Index: linux/mm/page-writeback.c
  ===
  --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 +0200
  +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
  @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
   for ( ; ; ) {
  get_dirty_limits(background_thresh, dirty_thresh, NULL, 
  NULL);
  
  +   /*
  +* Make sure the theshold is over the hard limit of
  +* dirty_thresh + ratelimit_pages * nr_cpus
  +*/
  +   dirty_thresh += ratelimit_pages * num_online_cpus();
  +
   /*
* Boost the allowable dirty threshold a bit for page
* allocators so they don't get DoS'ed by heavy writers
 
 I can probably kind of guess what you're trying to do here.  But if
 ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
 then things might go bad.

I think the admin can do quite a bit of other damage, by setting
dirty_ratio too high.

Maybe this writeback throttling should just have a fixed limit of 80%
ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove throttle_vm_writeout()

2007-10-04 Thread Andrew Morton

On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

  
  I don't think I understand that.  Sure, it _shouldn't_ be a problem.  But it
  _is_.  That's what we're trying to fix, isn't it?
 
 The problem, I believe is in the memory allocation code, not in fuse.

fuse is trying to do something which page reclaim was not designed for. 
Stuff broke.

 In the example, memory allocation may be blocking indefinitely,
 because we have 4MB under writeback, even though 28MB can still be
 made available.  And that _should_ be fixable.

Well yes.  But we need to work out how, without re-breaking the thing which
throttle_vm_writeout() fixed.

   So the only thing the kernel should be careful about, is not to block
   on an allocation if not strictly necessary.
   
   Actually a trivial fix for this problem could be to just tweak the
   thresholds, so to make the above scenario impossible.  Although I'm
   still not convinced, this patch is perfect, because the dirty
   threshold can actually change in time...
   
   Index: linux/mm/page-writeback.c
   ===
   --- linux.orig/mm/page-writeback.c  2007-10-05 00:31:01.0 
   +0200
   +++ linux/mm/page-writeback.c   2007-10-05 00:50:11.0 +0200
   @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
for ( ; ; ) {
   get_dirty_limits(background_thresh, dirty_thresh, NULL, 
   NULL);
   
   +   /*
   +* Make sure the theshold is over the hard limit of
   +* dirty_thresh + ratelimit_pages * nr_cpus
   +*/
   +   dirty_thresh += ratelimit_pages * num_online_cpus();
   +
/*
 * Boost the allowable dirty threshold a bit for page
 * allocators so they don't get DoS'ed by heavy writers
  
  I can probably kind of guess what you're trying to do here.  But if
  ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
  then things might go bad.
 
 I think the admin can do quite a bit of other damage, by setting
 dirty_ratio too high.
 
 Maybe this writeback throttling should just have a fixed limit of 80%
 ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
we cannot limit the system-wide dirty-memory threshold to 12MB.

iow, throttle_vm_writeout() needs to become zone-aware.  Then it only
throttles when, say, 80% of ZONE_FOO is under writeback.

Except I don't think that'll fix the problem 100%: if your fuse kernel
component somehow manages to put 80% of ZONE_FOO under writeback (and
remmeber this might be only 12MB on a 16GB machine) then we get stuck again
- the fuse server process (is that the correct terminology, btw?) ends up
waiting upon itself.

I'll think about it a bit.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

74 matches

Mail list logo