Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-03 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 01:46:52PM +0100, richard kennedy wrote:
> On Tue, 2007-10-02 at 10:00 +0800, Fengguang Wu wrote:
> > ---
> >  mm/page-writeback.c |5 +
> >  1 file changed, 5 insertions(+)
> > 
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
> > pages_written += write_chunk - wbc.nr_to_write;
> > if (pages_written >= write_chunk)
> > break;  /* We've done our duty */
> > +   if (list_empty(&mapping->host->i_sb->s_dirty) &&
> > +   list_empty(&mapping->host->i_sb->s_io) &&
> > +   nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> > +   dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> > +   break;
> > }
> > congestion_wait(WRITE, HZ/10);
> > }
> 
> I've been testing 2.6.23-rc9 + this patch all morning but have just seen
> a lockup. As usual it happened  just after a large file copy finished
> and while nr_dirty is still large. I'm sorry to say I didn't have a
> serial console running so I don't have an other info. I will try again
> and see if I can capture some more data.
> 
> I did notice that at the beginning of my tests the dirty blocks are
> written back more quickly than usual
> 
> nr_dirty count after the copy finished and then 60 seconds later :-
> after copy+60 seconds
> 73520 0
> 73533 0
> 68554 1
>  
> but after several iterations of my testcase & just before the lockup
> 68560 57165
> 71974 62896
>  
> which is about the same as a unpatched kernel.

Hi Richard,

Thank you for the testing.

However, my patch is kind of duplicate efforts. I was taking the
'do it if simple' attitude.  I can continue to improve it if you 
really want it. Otherwise I'd recommend you to test the coming
2.6.24-rc1 or backport the -mm writeback patches back to 2.6.23 and
test it there. Peter has did a good job on it.

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-03 Thread richard kennedy
On Tue, 2007-10-02 at 10:00 +0800, Fengguang Wu wrote:
> On Mon, Oct 01, 2007 at 11:57:34AM -0400, Chuck Ebbert wrote:
> > On 09/29/2007 07:04 AM, Fengguang Wu wrote:
...
> 
> (expecting real world confirmations...)
> 
> Here is a new safer version.  It's more ugly though.
> 
> ---
> writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi
> 
> On a busy-writing system, a writer could be hold up infinitely on a
> light-load device. It will be trying to sync more than available dirty data.
> 
> The problem case:
> 
> 0. sda/nr_dirty >= dirty_limit;
>sdb/nr_dirty == 0
> 1. dd writes 32 pages on sdb
> 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> 3. it never gets there: there's only 128KB dirty data.
> 4. dd may be blocked for a lng time
> 
> Fix it by returning on 'zero dirty inodes' in the current bdi.
> (In fact there are slight differences between 'dirty inodes' and 'dirty 
> pages'.
> But there is no available counters for 'dirty pages'.)
> 
> But the newly introduced 'break' could make the nr_writeback drift away
> above the dirty limit. The workaround is to limit the error under 1MB.
> 
> Cc: Chuck Ebbert <[EMAIL PROTECTED]>
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  mm/page-writeback.c |5 +
>  1 file changed, 5 insertions(+)
> 
> --- linux-2.6.22.orig/mm/page-writeback.c
> +++ linux-2.6.22/mm/page-writeback.c
> @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
>   pages_written += write_chunk - wbc.nr_to_write;
>   if (pages_written >= write_chunk)
>   break;  /* We've done our duty */
> + if (list_empty(&mapping->host->i_sb->s_dirty) &&
> + list_empty(&mapping->host->i_sb->s_io) &&
> + nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> + break;
>   }
>   congestion_wait(WRITE, HZ/10);
>   }

I've been testing 2.6.23-rc9 + this patch all morning but have just seen
a lockup. As usual it happened  just after a large file copy finished
and while nr_dirty is still large. I'm sorry to say I didn't have a
serial console running so I don't have an other info. I will try again
and see if I can capture some more data.

I did notice that at the beginning of my tests the dirty blocks are
written back more quickly than usual

nr_dirty count after the copy finished and then 60 seconds later :-
after copy  +60 seconds
73520   0
73533   0
68554   1
 
but after several iterations of my testcase & just before the lockup
68560   57165
71974   62896
 
which is about the same as a unpatched kernel.

Richard
   

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-02 Thread Chuck Ebbert
On 10/02/2007 09:27 AM, Fengguang Wu wrote:
> On Tue, Oct 02, 2007 at 08:13:27PM +0800, Fengguang Wu wrote:
> [...]
>> One more serious problem is, a busy writer could also drain all the
>> dirty pages and make (nr_writeback == dirty_limit+1MB). In that case,
>> I suspect the light-load sdb writer still have good chance to
>> make progress(need confirmation).
> 
> Well it seems to be a really tricky issue without knowing the per-bdi
> numbers. Maybe we could just encourage users to upgrade to 2.6.24...
> 

Yeah, and if that doesn't work there's always 2.6.25... :/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-02 Thread Fengguang Wu
On Tue, Oct 02, 2007 at 08:13:27PM +0800, Fengguang Wu wrote:
[...]
> One more serious problem is, a busy writer could also drain all the
> dirty pages and make (nr_writeback == dirty_limit+1MB). In that case,
> I suspect the light-load sdb writer still have good chance to
> make progress(need confirmation).

Well it seems to be a really tricky issue without knowing the per-bdi
numbers. Maybe we could just encourage users to upgrade to 2.6.24...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-02 Thread Fengguang Wu
On Mon, Oct 01, 2007 at 07:14:57PM -0700, Andrew Morton wrote:
> On Tue, 2 Oct 2007 10:00:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi
> > 
> > On a busy-writing system, a writer could be hold up infinitely on a
> > light-load device. It will be trying to sync more than available dirty data.
> > 
> > The problem case:
> > 
> > 0. sda/nr_dirty >= dirty_limit;
> >sdb/nr_dirty == 0
> > 1. dd writes 32 pages on sdb
> > 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> > 3. it never gets there: there's only 128KB dirty data.
> > 4. dd may be blocked for a lng time
> 
> Please quantify lng.

There're only two 'break' conditions in the loop:
1. nr_dirty + nr_unstable + nr_writeback < dirty_limit
   => *mostly* FALSE for a busy system
   => *always* FALSE in Chakri's stucked NFS case
2. nr_written >= 6MB
   for a light-load bdi:
   => *never* TRUE until there comes many new writers, contributing
  more dirty pages to sync
   => more worse, those new writers will also stuck here...
  the obvious unbalance here is:
   each writer contributes only 32KB new dirty pages, but
   want to consume (not necessarily available) 6MB

So lng = min(global-less-busy-time, bdi-many-new-writers-arrival-time).

> > Fix it by returning on 'zero dirty inodes' in the current bdi.
> > (In fact there are slight differences between 'dirty inodes' and 'dirty 
> > pages'.
> > But there is no available counters for 'dirty pages'.)
> > 
> > But the newly introduced 'break' could make the nr_writeback drift away
> > above the dirty limit. The workaround is to limit the error under 1MB.
> 
> I'm still not sure that we fully understand this yet.
> 
> If the sdb writer is stuck in balance_dirty_pages() then all sda writers
> will be in balance_dirty_pages() too, madly writing stuff out to sda.  And
> pdflush will be writing out sda as well.  All this writeout to sda should
> release the sdb writer.
> 
> Why isn't this happening?

You are right in the reasoning. The exact consequence is:
the light-load sdb is made as _unresponsive_ as the busy sda

Hence Chakri's case: whenever NFS is stuck, every device get stuck.

> 
> > Cc: Chuck Ebbert <[EMAIL PROTECTED]>
> > Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> > Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> > ---
> >  mm/page-writeback.c |5 +
> >  1 file changed, 5 insertions(+)
> > 
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
> > pages_written += write_chunk - wbc.nr_to_write;
> > if (pages_written >= write_chunk)
> > break;  /* We've done our duty */
> > +   if (list_empty(&mapping->host->i_sb->s_dirty) &&
> > +   list_empty(&mapping->host->i_sb->s_io) &&
> > +   nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> > +   dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> > +   break;
> > }
> > congestion_wait(WRITE, HZ/10);
> > }
> 
> Well that has a nice safetly net.  Perhaps it could fail a bit later on,
> but that depends on why it's failing.

In theory, every CPU/paralle writer could contribute 8 pages of error.
Hence we get 1MB/32KB = 32 (CPUs/writers).

One more serious problem is, a busy writer could also drain all the
dirty pages and make (nr_writeback == dirty_limit+1MB). In that case,
I suspect the light-load sdb writer still have good chance to
make progress(need confirmation).

> How well tested was this?

Not well tested till now. My system becomes unusable soon after
starting the NFS write(even before plugging the network). I'm seeing
large latencies in try_to_wake_up(). Hope that Ingo could help it out.

> If we merge this for 2.6.23 then I expect that we'll immediately unmerge it
> for 2.6.24 because Peter's stuff fixes this problem by other means.
> 
> Do we all agree with the above sentence?

Yeah, Peter and me were both aware of the timing.
This patch is only meant for 2.6.23 and 2.6.22.10.

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-01 Thread Andrew Morton
On Tue, 2 Oct 2007 10:00:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi
> 
> On a busy-writing system, a writer could be hold up infinitely on a
> light-load device. It will be trying to sync more than available dirty data.
> 
> The problem case:
> 
> 0. sda/nr_dirty >= dirty_limit;
>sdb/nr_dirty == 0
> 1. dd writes 32 pages on sdb
> 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> 3. it never gets there: there's only 128KB dirty data.
> 4. dd may be blocked for a lng time

Please quantify lng.
 
> Fix it by returning on 'zero dirty inodes' in the current bdi.
> (In fact there are slight differences between 'dirty inodes' and 'dirty 
> pages'.
> But there is no available counters for 'dirty pages'.)
> 
> But the newly introduced 'break' could make the nr_writeback drift away
> above the dirty limit. The workaround is to limit the error under 1MB.

I'm still not sure that we fully understand this yet.

If the sdb writer is stuck in balance_dirty_pages() then all sda writers
will be in balance_dirty_pages() too, madly writing stuff out to sda.  And
pdflush will be writing out sda as well.  All this writeout to sda should
release the sdb writer.

Why isn't this happening?


> Cc: Chuck Ebbert <[EMAIL PROTECTED]>
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  mm/page-writeback.c |5 +
>  1 file changed, 5 insertions(+)
> 
> --- linux-2.6.22.orig/mm/page-writeback.c
> +++ linux-2.6.22/mm/page-writeback.c
> @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
>   pages_written += write_chunk - wbc.nr_to_write;
>   if (pages_written >= write_chunk)
>   break;  /* We've done our duty */
> + if (list_empty(&mapping->host->i_sb->s_dirty) &&
> + list_empty(&mapping->host->i_sb->s_io) &&
> + nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> + break;
>   }
>   congestion_wait(WRITE, HZ/10);
>   }

Well that has a nice safetly net.  Perhaps it could fail a bit later on,
but that depends on why it's failing.

How well tested was this?

If we merge this for 2.6.23 then I expect that we'll immediately unmerge it
for 2.6.24 because Peter's stuff fixes this problem by other means.

Do we all agree with the above sentence?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/