On Wed, 19 Sep 2007, Andy Whitcroft wrote:
> Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
> stuck in a 'D' wait:
> 
>  =======================
> mkfs.ext2     D c10220f4     0  6233   6222
>  [<c12194da>] io_schedule_timeout+0x1e/0x28
>  [<c10454b4>] congestion_wait+0x62/0x7a
>  [<c10402af>] get_dirty_limits+0x16a/0x172
>  [<c104040b>] balance_dirty_pages+0x154/0x1be
>  [<c103bda3>] generic_perform_write+0x168/0x18a
>  [<c103be38>] generic_file_buffered_write+0x73/0x107
>  [<c103c346>] __generic_file_aio_write_nolock+0x47a/0x4a5
>  [<c103c3b9>] generic_file_aio_write_nolock+0x48/0x9b
>  [<c105d2d6>] do_sync_write+0xbf/0xfc
>  [<c105d3a0>] vfs_write+0x8d/0x108
>  [<c105d4c3>] sys_write+0x41/0x67
>  [<c100260a>] syscall_call+0x7/0xb
>  =======================

[edited out some bogus lines from stale stack]

> This machine and others have run numerous test runs on this kernel and
> this is the first time I've see a hang like this.

I've been seeing something like that on 4-way PPC64: in my case I've
shells hanging in D state trying to append to kernel build log on ext3
(the builds themselves going on elsewhere, in tmpfs): one of the shells
holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.

> I wonder if this is the ultimate cause of the couple of mainline hangs
> which were seen, but not diagnosed.

My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
not done enough to check if those really correlate with the hangs),
and I'm wondering if the bdi_stat_sum business is needed on the
!nr_reclaimable path.

So I'm running now with the patch below, good so far, but can't judge
until tomorrow whether it has actually addressed the problem seen.

Not-yet-Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   53 +++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

--- 2.6.23-rc6-mm1/mm/page-writeback.c  2007-09-18 12:28:25.000000000 +0100
+++ linux/mm/page-writeback.c   2007-09-19 20:00:46.000000000 +0100
@@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
                bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
                bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
                if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-                               break;
+                       break;
 
                if (!bdi->dirty_exceeded)
                        bdi->dirty_exceeded = 1;
@@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
                 */
                if (bdi_nr_reclaimable) {
                        writeback_inodes(&wbc);
-
+                       pages_written += write_chunk - wbc.nr_to_write;
                        get_dirty_limits(&background_thresh, &dirty_thresh,
                                       &bdi_thresh, bdi);
+               }
 
-                       /*
-                        * In order to avoid the stacked BDI deadlock we need
-                        * to ensure we accurately count the 'dirty' pages when
-                        * the threshold is low.
-                        *
-                        * Otherwise it would be possible to get thresh+n pages
-                        * reported dirty, even though there are thresh-m pages
-                        * actually dirty; with m+n sitting in the percpu
-                        * deltas.
-                        */
-                       if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-                               bdi_nr_reclaimable =
-                                       bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-                               bdi_nr_writeback =
-                                       bdi_stat_sum(bdi, BDI_WRITEBACK);
-                       } else {
-                               bdi_nr_reclaimable =
-                                       bdi_stat(bdi, BDI_RECLAIMABLE);
-                               bdi_nr_writeback =
-                                       bdi_stat(bdi, BDI_WRITEBACK);
-                       }
+               /*
+                * In order to avoid the stacked BDI deadlock we need
+                * to ensure we accurately count the 'dirty' pages when
+                * the threshold is low.
+                *
+                * Otherwise it would be possible to get thresh+n pages
+                * reported dirty, even though there are thresh-m pages
+                * actually dirty; with m+n sitting in the percpu
+                * deltas.
+                */
+               if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+                       bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+                       bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+               } else if (bdi_nr_reclaimable) {
+                       bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+                       bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+               }
 
-                       if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-                               break;
+               if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+                       break;
+               if (pages_written >= write_chunk)
+                       break;          /* We've done our duty */
 
-                       pages_written += write_chunk - wbc.nr_to_write;
-                       if (pages_written >= write_chunk)
-                               break;          /* We've done our duty */
-               }
                congestion_wait(WRITE, HZ/10);
        }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to