subject:"writeout stalls in current \-git"

Re: writeout stalls in current -git

2007-11-20 Thread David Chinner

On Tue, Nov 20, 2007 at 02:16:17PM +0100, Damien Wyart wrote:
> Hello,
> 
> > > > Ok, so it's not synchronous writes that we are doing - we're just
> > > > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
> > > > The "synchronous" nature appears to be coming from higher level
> > > > locking when reclaiming inodes (on the flush lock). It appears that
> > > > inode write clustering is failing completely so we are writing the
> > > > same block multiple times i.e. once for each inode in the cluster we
> > > > have to write.
> 
> > > Works for me. The only remaining stalls are sub second and look
> > > completely valid, considering the amount of files being removed.
> > 
> > > Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>
> 
> * David Chinner <[EMAIL PROTECTED]> [2007-11-08 11:38]:
> > Great - thanks for reporting the problem and testing the fix.
> 
> This patch has not yet made its way into 2.6.24 (rc3). Is it intended?
> Maybe the fix can wait for 2.6.25, but wanted to make sure...

The patch is in the XFS dev tree being QA'd, and we will push it
to 2.6.24-rcX in the next few days.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-20 Thread Damien Wyart

Hello,

> > > Ok, so it's not synchronous writes that we are doing - we're just
> > > submitting bio's tagged as WRITE_SYNC to get the I/O issued
> > > quickly. The "synchronous" nature appears to be coming from higher
> > > level locking when reclaiming inodes (on the flush lock). It
> > > appears that inode write clustering is failing completely so we
> > > are writing the same block multiple times i.e. once for each inode
> > > in the cluster we have to write.

> > Works for me. The only remaining stalls are sub second and look
> > completely valid, considering the amount of files being removed.
> 
> > Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>

* David Chinner <[EMAIL PROTECTED]> [2007-11-08 11:38]:
> Great - thanks for reporting the problem and testing the fix.

This patch has not yet made its way into 2.6.24 (rc3). Is it intended?
Maybe the fix can wait for 2.6.25, but wanted to make sure...

-- 
Damien Wyart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-20 Thread Damien Wyart

Hello,

   Ok, so it's not synchronous writes that we are doing - we're just
   submitting bio's tagged as WRITE_SYNC to get the I/O issued
   quickly. The synchronous nature appears to be coming from higher
   level locking when reclaiming inodes (on the flush lock). It
   appears that inode write clustering is failing completely so we
   are writing the same block multiple times i.e. once for each inode
   in the cluster we have to write.

  Works for me. The only remaining stalls are sub second and look
  completely valid, considering the amount of files being removed.
 
  Tested-by: Torsten Kaiser [EMAIL PROTECTED]

* David Chinner [EMAIL PROTECTED] [2007-11-08 11:38]:
 Great - thanks for reporting the problem and testing the fix.

This patch has not yet made its way into 2.6.24 (rc3). Is it intended?
Maybe the fix can wait for 2.6.25, but wanted to make sure...

-- 
Damien Wyart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-20 Thread David Chinner

On Tue, Nov 20, 2007 at 02:16:17PM +0100, Damien Wyart wrote:
 Hello,
 
Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The synchronous nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.
 
   Works for me. The only remaining stalls are sub second and look
   completely valid, considering the amount of files being removed.
  
   Tested-by: Torsten Kaiser [EMAIL PROTECTED]
 
 * David Chinner [EMAIL PROTECTED] [2007-11-08 11:38]:
  Great - thanks for reporting the problem and testing the fix.
 
 This patch has not yet made its way into 2.6.24 (rc3). Is it intended?
 Maybe the fix can wait for 2.6.25, but wanted to make sure...

The patch is in the XFS dev tree being QA'd, and we will push it
to 2.6.24-rcX in the next few days.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-09 Thread Adrian Bunk

On Fri, Nov 09, 2007 at 09:10:47AM -0700, Jonathan Corbet wrote:
> Adrian Bunk <[EMAIL PROTECTED]> wrote:
> 
> > What's missing is a definition which of them are formal tags that must 
> > be explicitely given (look at point 13 in SubmittingPatches).
> > 
> > Signed-off-by: and Reviewed-by: are the formal tags someone must have 
> > explicitely given and that correspond to some statement.
> > 
> > OTOH, I can translate a "sounds fine" or "works for me" someone else 
> > gave me into an Acked-by: resp. Tested-by: tag.
> 
> The discussion of the Cc: tag says:
> 
>   This is the only tag which might be added without an explicit
>   action by the person it names.
> 
> I think that addresses your comment, no?  Certainly I wouldn't feel that
> I could add any of the other tags to a patch I posted - that's the job
> of the person named there.  

Acked-by: and Tested-by: do require explicit actions by the person they 
name, but they are not required to explicitely give this tag.

If a user said "the patch works for me" I would consider it overly 
bureaucratic to ask the user for a formal tag.

> jon

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-09 Thread Jonathan Corbet

Adrian Bunk <[EMAIL PROTECTED]> wrote:

> What's missing is a definition which of them are formal tags that must 
> be explicitely given (look at point 13 in SubmittingPatches).
> 
> Signed-off-by: and Reviewed-by: are the formal tags someone must have 
> explicitely given and that correspond to some statement.
> 
> OTOH, I can translate a "sounds fine" or "works for me" someone else 
> gave me into an Acked-by: resp. Tested-by: tag.

The discussion of the Cc: tag says:

This is the only tag which might be added without an explicit
action by the person it names.

I think that addresses your comment, no?  Certainly I wouldn't feel that
I could add any of the other tags to a patch I posted - that's the job
of the person named there.  

jon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-09 Thread Jonathan Corbet

Adrian Bunk [EMAIL PROTECTED] wrote:

 What's missing is a definition which of them are formal tags that must 
 be explicitely given (look at point 13 in SubmittingPatches).
 
 Signed-off-by: and Reviewed-by: are the formal tags someone must have 
 explicitely given and that correspond to some statement.
 
 OTOH, I can translate a sounds fine or works for me someone else 
 gave me into an Acked-by: resp. Tested-by: tag.

The discussion of the Cc: tag says:

This is the only tag which might be added without an explicit
action by the person it names.

I think that addresses your comment, no?  Certainly I wouldn't feel that
I could add any of the other tags to a patch I posted - that's the job
of the person named there.  

jon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-09 Thread Adrian Bunk

On Fri, Nov 09, 2007 at 09:10:47AM -0700, Jonathan Corbet wrote:
 Adrian Bunk [EMAIL PROTECTED] wrote:
 
  What's missing is a definition which of them are formal tags that must 
  be explicitely given (look at point 13 in SubmittingPatches).
  
  Signed-off-by: and Reviewed-by: are the formal tags someone must have 
  explicitely given and that correspond to some statement.
  
  OTOH, I can translate a sounds fine or works for me someone else 
  gave me into an Acked-by: resp. Tested-by: tag.
 
 The discussion of the Cc: tag says:
 
   This is the only tag which might be added without an explicit
   action by the person it names.
 
 I think that addresses your comment, no?  Certainly I wouldn't feel that
 I could add any of the other tags to a patch I posted - that's the job
 of the person named there.  

Acked-by: and Tested-by: do require explicit actions by the person they 
name, but they are not required to explicitely give this tag.

If a user said the patch works for me I would consider it overly 
bureaucratic to ask the user for a formal tag.

 jon

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-07 Thread David Chinner

On Wed, Nov 07, 2007 at 08:15:06AM +0100, Torsten Kaiser wrote:
> On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > Ok, so it's not synchronous writes that we are doing - we're just
> > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
> > The "synchronous" nature appears to be coming from higher level
> > locking when reclaiming inodes (on the flush lock). It appears that
> > inode write clustering is failing completely so we are writing the
> > same block multiple times i.e. once for each inode in the cluster we
> > have to write.
> 
> Works for me. The only remaining stalls are sub second and look
> completely valid, considering the amount of files being removed.

> Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>

Great - thanks for reporting the problem and testing the fix.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-07 Thread David Chinner

On Wed, Nov 07, 2007 at 08:15:06AM +0100, Torsten Kaiser wrote:
 On 11/7/07, David Chinner [EMAIL PROTECTED] wrote:
  Ok, so it's not synchronous writes that we are doing - we're just
  submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
  The synchronous nature appears to be coming from higher level
  locking when reclaiming inodes (on the flush lock). It appears that
  inode write clustering is failing completely so we are writing the
  same block multiple times i.e. once for each inode in the cluster we
  have to write.
 
 Works for me. The only remaining stalls are sub second and look
 completely valid, considering the amount of files being removed.

 Tested-by: Torsten Kaiser [EMAIL PROTECTED]

Great - thanks for reporting the problem and testing the fix.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Ok, so it's not synchronous writes that we are doing - we're just
> submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
> The "synchronous" nature appears to be coming from higher level
> locking when reclaiming inodes (on the flush lock). It appears that
> inode write clustering is failing completely so we are writing the
> same block multiple times i.e. once for each inode in the cluster we
> have to write.

Works for me. The only remaining stalls are sub second and look
completely valid, considering the amount of files being removed.

iostat 10 from this test:
 3  0  0 3500192332 20495600   105  8512 1809 6473  6 10 83  1
 0  0  0 3500200332 20457600 0  4367 1355 3712  2  6 92  0
 2  0  0 3504264332 20352800 0  6805 1912 4967  4  8 88  0
 0  0  0 3511632332 20352800 0  2843  805 1791  2  4 94  0
 0  0  0 3516852332 20351600 0  3375  879 2712  3  5 93  0
 0  0  0 3530544332 20266800   186   776  488 1152  4  2 89  4
 0  0  0 3574788332 20496000   226   326  358  787  0  1 98  0
 0  0  0 3576820332 20496000 0   376  332  737  0  0 99  0
 0  0  0 3578432332 20496000 0   356  293  606  1  1 99  0
 0  0  0 3580192332 20496000 0   101  104  384  0  0 99  0

I'm pleased to note that this is now much faster again.
Thanks!

Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>

CC's please note: It looks like this was really a different problem
then the 100% iowait that was seen with reiserfs.
Also the one complete stall I have seen is probably something else.
But I have not been able to reproduce this again with -mm and have
never seen this on mainline, so I will just ignore that single event
until I see it again.

Torsten

> ---
>  fs/xfs/xfs_iget.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
> ===
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
> +1100
> +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
> @@ -248,7 +248,7 @@ finish_inode:
> icl = NULL;
> if (radix_tree_gang_lookup(>pag_ici_root, (void**),
> first_index, 1)) {
> -   if ((iq->i_ino & mask) == first_index)
> +   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
> icl = iq->i_cluster;
> }
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread David Chinner

On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote:
> On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > > Rather than vmstat, can you use something like iostat to show how busy 
> > > your
> > > disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
> > 
> > Both "vmstat 10" and "iostat -x 10" output from this test:
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  2  0  0 3700592  0  85424003183  108  244  2  1 95 
> >  1
> > -> emerge reads something, don't knwo for sure what...
> >  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97 
> >  0
> 
> > 
> > The last 20% of the btrace look more or less completely like this, no
> > other programs do any IO...
> > 
> > 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
> > 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
> > 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
> > 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
> ^^
> Apparently we are doing synchronous writes. That would explain why
> it is slow. We shouldn't be doing synchronous writes here. I'll see if
> I can reproduce this.
> 
> 
> 
> Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
> look into this further and send a patch when I have something concrete.

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The "synchronous" nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.

/me scurries off to run some tests 

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_iget.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
icl = NULL;
if (radix_tree_gang_lookup(>pag_ici_root, (void**),
first_index, 1)) {
-   if ((iq->i_ino & mask) == first_index)
+   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
icl = iq->i_cluster;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread David Chinner

On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > Rather than vmstat, can you use something like iostat to show how busy your
> > disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
> 
> Both "vmstat 10" and "iostat -x 10" output from this test:
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  2  0  0 3700592  0  85424003183  108  244  2  1 95  1
> -> emerge reads something, don't knwo for sure what...
>  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0

> 
> The last 20% of the btrace look more or less completely like this, no
> other programs do any IO...
> 
> 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
> 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
> 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
> 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
^^
Apparently we are doing synchronous writes. That would explain why
it is slow. We shouldn't be doing synchronous writes here. I'll see if
I can reproduce this.



Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
look into this further and send a patch when I have something concrete.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Adrian Bunk

On Tue, Nov 06, 2007 at 09:25:12AM -0700, Jonathan Corbet wrote:
> Andrew wrote:
> 
> > > Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> 
> > 
> > I would prefer Tested-by: :(
> 
> This seems like as good an opportunity as any to toss my patch tags
> document out there one more time.  I still think it's a good idea to
> codify some sort of consensus on what these tags mean... 

What's missing is a definition which of them are formal tags that must 
be explicitely given (look at point 13 in SubmittingPatches).

Signed-off-by: and Reviewed-by: are the formal tags someone must have 
explicitely given and that correspond to some statement.

OTOH, I can translate a "sounds fine" or "works for me" someone else 
gave me into an Acked-by: resp. Tested-by: tag.

> jon
>...
> --- /dev/null
> +++ b/Documentation/patch-tags
> @@ -0,0 +1,76 @@
> +Patches headed for the mainline may contain a variety of tags documenting
> +who played a hand in (or was at least aware of) their progress.  All of
> +these tags have the form:
> +
> + Something-done-by: Full name <[EMAIL PROTECTED]> [optional random stuff]
> +
> +These tags are:
> +
> +From:The original author of the patch.  This tag will ensure
> + that credit is properly given when somebody other than the
> + original author submits the patch.
> +
> +Signed-off-by:   A person adding a Signed-off-by tag is attesting that 
> the
> + patch is, to the best of his or her knowledge, legally able
> + to be merged into the mainline and distributed under the
> + terms of the GNU General Public License, version 2.  See
> + the Developer's Certificate of Origin, found in
> + Documentation/SubmittingPatches, for the precise meaning of
> + Signed-off-by.  This tag assures upstream maintainers that
> + the provenance of the patch is known and allows the origin
> + of the patch to be reviewed should copyright questions
> + arise.
> +
> +Acked-by:The person named (who should be an active developer in the
> + area addressed by the patch) is aware of the patch and has
> + no objection to its inclusion; it informs upstream
> + maintainers that a certain degree of consensus on the patch
> + as been achieved..  An Acked-by tag does not imply any
> + involvement in the development of the patch or that a
> + detailed review was done. 
> +
> +Reviewed-by: The patch has been reviewed and found acceptable according
> + to the Reviewer's Statement as found at the bottom of this
> + file.  A Reviewed-by tag is a statement of opinion that the
> + patch is an appropriate modification of the kernel without
> + any remaining serious technical issues.  Any interested
> + reviewer (who has done the work) can offer a Reviewed-by
> + tag for a patch.  This tag serves to give credit to
> + reviewers and to inform maintainers of the degree of review
> + which has been done on the patch.
> +
> +Cc:  The person named was given the opportunity to comment on
> + the patch.  This is the only tag which might be added
> + without an explicit action by the person it names.  This
> + tag documents that potentially interested parties have been
> + included in the discussion.
> +
> +Tested-by:   The patch has been successfully tested (in some
> + environment) by the person named.  This tag informs
> + maintainers that some testing has been performed, provides
> + a means to locate testers for future patches, and ensures
> + credit for the testers.
> +
> +
> +
> +
> +Reviewer's statement of oversight, v0.02
> +
> +By offering my Reviewed-by: tag, I state that:
> +
> + (a) I have carried out a technical review of this patch to evaluate its
> + appropriateness and readiness for inclusion into the mainline kernel. 
> +
> + (b) Any problems, concerns, or questions relating to the patch have been
> + communicated back to the submitter.  I am satisfied with the
> + submitter's response to my comments.
> +
> + (c) While there may be things that could be improved with this submission,
> + I believe that it is, at this time, (1) a worthwhile modification to
> + the kernel, and (2) free of known issues which would argue against its
> + inclusion.
> +
> + (d) While I have reviewed the patch and believe it to be sound, I do not
> + (unless explicitly stated elsewhere) make any warranties or guarantees
> + that it will achieve its stated purpose or function properly in any
> + given situation.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/6/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> --
> Subject: writeback: remove pages_skipped accounting in 
> __block_write_full_page()
> From: Fengguang Wu <[EMAIL PROTECTED]>
>
> Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug:
[sni]
>  fs/buffer.c |1 -
>  fs/xfs/linux-2.6/xfs_aops.c |5 ++---
>  2 files changed, 2 insertions(+), 4 deletions(-)

I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted.
This does still stall.

On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Rather than vmstat, can you use something like iostat to show how busy your
> disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.

Both "vmstat 10" and "iostat -x 10" output from this test:
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3700592  0  85424003183  108  244  2  1 95  1
-> emerge reads something, don't knwo for sure what...
 1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0
 0  0  0 3657728  0  9122800   32235  445  833  0  0 99  0
 1  0  0 3653136  0  9469200   33033  455  844  1  1 98  0
 0  0  0 3646836  0  9772000   289 3  422  751  1  1 98  0
 0  0  0 3616468  0  9969200   18533  399  614  9  3 87  1
-> starts to remove the kernel tree
 0  0  0 3610452  0 10259200   138  3598 1398 3945  3  6 90  1
 0  0  0 3607136  0 10454800 2  5962 1919 6070  4  9 87  0
 0  0  0 3606636  0 10508000 0  1539  810 2200  1  2 97  0
-> first stall 28 sec.
 0  0  0 3606592  0 10529200 0   698  679 1390  0  1 99  0
 0  0  0 3606440  0 10553200 0   658  690 1457  0  0 99  0
 0  0  0 3606068  0 10612800 1  1780  947 1982  1  3 96  0
-> second stall 24 sec.
 0  0  0 3606036  0 10646400 4   858  758 1457  0  1 98  0
 0  0  0 3605380  0 10687200 0  1173  807 1880  1  2 97  0
 0  0  0 3605000  0 10774800 1  2413 1103 2996  2  4 94  0
-> third stall 38 sec.
 0  0  0 3604488  0 1084720045   897  748 1577  0  1 98  0
 0  0  0 3604176  0 10876400 0   824  752 1700  0  1 98  0
 0  0  0 3604012  0 10898800 0   660  643 1237  0  1 99  0
 0  0  0 3608936  0 11012000 1  3490 1232 3455  3  5 91  0
-> fourth stall 64 sec.
 1  0  0 3609060  0 11029600 0   568  669 1222  0  1 99  0
 0  0  0 3609464  0 11049600 0   604  638 1366  0  1 99  0
 0  0  0 3609244  0 11074000 0   844  714 1282  0  1 99  0
 0  0  0 3609508  0 11091200 0   552  584 1185  1  1 99  0
 2  0  0 3609436  0 3200 0   658  643 1442  0  1 99  0
 0  0  0 3609212  0 11134800 0   714  637 1382  0  0 99  0
 0  0  0 3619132  0 11049200   130  1086  736 1870  4  3 91  2
 0  0  0 3657016  0 11549600   466   589  718 1367  1  1 98  0
-> emerge finishs, dirty data was the hole time <1Mb, stays now below 300kb
(btrace running...)
 0  0  0 3657844  0 11566000 0   564  635 1226  1  1 99  0
 0  0  0 3658236  0 11584000 0   582  600 1248  1  0 99  0
 0  0  0 3658296  0 11601200 0   566  606 1232  1  1 99  0
 0  0  0 3657924  0 11621200 0   688  596 1321  1  0 99  0
 0  0  0 3658252  0 11641600 0   631  642 1356  1  0 98  0
 0  0  0 3658184  0 11659200 0   566  575 1273  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3658344  0 11677200 0   649  606 1301  0  0 99  0
 0  0  0 3658548  0 11697600 0   617  624 1345  0  0 99  0
 0  0  0 3659204  0 11716000 0   550  576 1223  1  1 99  0
 0  0  0 3659944  0 11734400 0   620  583 1272  0  0 99  0
 0  0  0 3660548  0 11754000 0   605  611 1338  0  0 99  0
 0  0  0 3661236  0 11773200 0   582  569 1275  0  0 99  0
 0  0  0 3662420  0 11788800 0   590  571 1157  0  0 99  0
 0  0  0 3664324  0 11806800 0   566  553 1222  1  1 99  0
 0  0  0 3665240  0 11816800 0   401  574  862  0  0 99  0
 0  0  0 3666984  0 11828000 0   454  574  958  1  1 99  0
 0  0  0 3668664  0 11840000 0   396  559  946  0  0 99  0
 0  0  0 3670628  0 11849600 0   296  495  784  0  0 99  0
 0  0  0 3671316  0 118496

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/6/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:
>
> > I'm struggling to understand what possible changed in XFS or writeback that
> > would lead to stalls like this, esp. as you appear to be removing files when
> > the stalls occur.
>
> Just a crazy idea,..
>
> Could there be a set_page_dirty() that doesn't have
> balance_dirty_pages() call near? For example modifying meta data in
> unlink?
>
> Such a situation could lead to an excess of dirty pages and the next
> call to balance_dirty_pages() would appear to stall, as it would
> desperately try to get below the limit again.

Only if accounting of the dirty pages is also broken.
In the unmerge testcase I see most of the time only <200kb of dirty
data in /proc/meminfo.

The system has 4Gb of RAM so I'm not sure if it should ever be valid
to stall even the emerge/install testcase.

Torsten

Now building a kernel with the skipped-pages-accounting-patch reverted...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra

On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. 

Just a crazy idea,..

Could there be a set_page_dirty() that doesn't have
balance_dirty_pages() call near? For example modifying meta data in
unlink?

Such a situation could lead to an excess of dirty pages and the next
call to balance_dirty_pages() would appear to stall, as it would
desperately try to get below the limit again.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Balbir Singh

> This seems like as good an opportunity as any to toss my patch tags
> document out there one more time.  I still think it's a good idea to
> codify some sort of consensus on what these tags mean...
>
> jon
>

[snip]

> +By offering my Reviewed-by: tag, I state that:
> +
> + (a) I have carried out a technical review of this patch to evaluate its
> + appropriateness and readiness for inclusion into the mainline kernel.
> +
> + (b) Any problems, concerns, or questions relating to the patch have been
> + communicated back to the submitter.  I am satisfied with the
> + submitter's response to my comments.
> +
> + (c) While there may be things that could be improved with this submission,
> + I believe that it is, at this time, (1) a worthwhile modification to
> + the kernel, and (2) free of known issues which would argue against its
> + inclusion.
> +
> + (d) While I have reviewed the patch and believe it to be sound, I do not
> + (unless explicitly stated elsewhere) make any warranties or guarantees
> + that it will achieve its stated purpose or function properly in any
> + given situation.

How about adding a Commented-on-by?

Initial version(s) that are not suitable or still shaping up are
commented-on by several people. A person who comments on one version
might not do a thorough review of the entire code, but through a
series of comments has contributed by pushing the developer in the
correct direction.

Balbir
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Jonathan Corbet

Andrew wrote:

> > Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> 
> 
> I would prefer Tested-by: :(

This seems like as good an opportunity as any to toss my patch tags
document out there one more time.  I still think it's a good idea to
codify some sort of consensus on what these tags mean... 

jon

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index 299615d..1948a93 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -286,6 +286,8 @@ parport.txt
- how to use the parallel-port driver.
 parport-lowlevel.txt
- description and usage of the low level parallel port functions.
+patch-tags
+   - description of the tags which can be added to patches
 pci-error-recovery.txt
- info on PCI error recovery.
 pci.txt
diff --git a/Documentation/patch-tags b/Documentation/patch-tags
new file mode 100644
index 000..6acde5e
--- /dev/null
+++ b/Documentation/patch-tags
@@ -0,0 +1,76 @@
+Patches headed for the mainline may contain a variety of tags documenting
+who played a hand in (or was at least aware of) their progress.  All of
+these tags have the form:
+
+   Something-done-by: Full name <[EMAIL PROTECTED]> [optional random stuff]
+
+These tags are:
+
+From:  The original author of the patch.  This tag will ensure
+   that credit is properly given when somebody other than the
+   original author submits the patch.
+
+Signed-off-by: A person adding a Signed-off-by tag is attesting that the
+   patch is, to the best of his or her knowledge, legally able
+   to be merged into the mainline and distributed under the
+   terms of the GNU General Public License, version 2.  See
+   the Developer's Certificate of Origin, found in
+   Documentation/SubmittingPatches, for the precise meaning of
+   Signed-off-by.  This tag assures upstream maintainers that
+   the provenance of the patch is known and allows the origin
+   of the patch to be reviewed should copyright questions
+   arise.
+
+Acked-by:  The person named (who should be an active developer in the
+   area addressed by the patch) is aware of the patch and has
+   no objection to its inclusion; it informs upstream
+   maintainers that a certain degree of consensus on the patch
+   as been achieved..  An Acked-by tag does not imply any
+   involvement in the development of the patch or that a
+   detailed review was done. 
+
+Reviewed-by:   The patch has been reviewed and found acceptable according
+   to the Reviewer's Statement as found at the bottom of this
+   file.  A Reviewed-by tag is a statement of opinion that the
+   patch is an appropriate modification of the kernel without
+   any remaining serious technical issues.  Any interested
+   reviewer (who has done the work) can offer a Reviewed-by
+   tag for a patch.  This tag serves to give credit to
+   reviewers and to inform maintainers of the degree of review
+   which has been done on the patch.
+
+Cc:The person named was given the opportunity to comment on
+   the patch.  This is the only tag which might be added
+   without an explicit action by the person it names.  This
+   tag documents that potentially interested parties have been
+   included in the discussion.
+
+Tested-by: The patch has been successfully tested (in some
+   environment) by the person named.  This tag informs
+   maintainers that some testing has been performed, provides
+   a means to locate testers for future patches, and ensures
+   credit for the testers.
+
+
+
+
+Reviewer's statement of oversight, v0.02
+
+By offering my Reviewed-by: tag, I state that:
+
+ (a) I have carried out a technical review of this patch to evaluate its
+ appropriateness and readiness for inclusion into the mainline kernel. 
+
+ (b) Any problems, concerns, or questions relating to the patch have been
+ communicated back to the submitter.  I am satisfied with the
+ submitter's response to my comments.
+
+ (c) While there may be things that could be improved with this submission,
+ I believe that it is, at this time, (1) a worthwhile modification to
+ the kernel, and (2) free of known issues which would argue against its
+ inclusion.
+
+ (d) While I have reviewed the patch and believe it to be sound, I do not
+ (unless explicitly stated elsewhere) make any warranties or guarantees
+ that it will achieve its stated purpose or function properly in any
+ given situation.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra

On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote:

> > > Subject: mm: speed up writeback ramp-up on clean systems
> > > 
> > > We allow violation of bdi limits if there is a lot of room on the
> > > system. Once we hit half the total limit we start enforcing bdi limits
> > > and bdi ramp-up should happen. Doing it this way avoids many small
> > > writeouts on an otherwise idle system and should also speed up the
> > > ramp-up.
> 
> Given the problems we're having in there I'm a bit reluctant to go tossing
> hastily put together and inadequately tested stuff onto the fire.  And
> that's what this patch looks like to me.

Not really hastily, I think it was written before the stuff hit
mainline. Inadequately tested, perhaps, its been in my and probably Wu's
kernels for a while. Granted that's not a lot of testing in the face of
those who have problems atm.

> Wanna convince me otherwise?

I'm perfectly happy with this patch earning its credits in -mm for a
while and maybe going in around -rc4 or something like that (hoping that
by then we've fixed these nagging issues).

Another patch I did come up with yesterday - not driven by any problems
in that area - could perhaps join this one on that path:

---
Subject: mm: bdi: tweak task dirty penalty

Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather
excessive on large memory machines. Use sqrt to scale it sub-linearly.

Update the comment while we're there.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6-2/mm/page-writeback.c
===
--- linux-2.6-2.orig/mm/page-writeback.c
+++ linux-2.6-2/mm/page-writeback.c
@@ -213,17 +213,21 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * Task specific dirty limit:
  *
- * task specific dirty limit:
+ *   dirty -= 8 * sqrt(dirty) * p_{t}
  *
- *   dirty -= (dirty/8) * p_{t}
+ * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This
+ * avoids infrequent dirtiers from getting stuck in this other guys dirty
+ * pages.
+ *
+ * Use a sub-linear function to scale the penalty, we only need a little room.
  */
 void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 {
long numerator, denominator;
long dirty = *pdirty;
-   u64 inv = dirty >> 3;
+   u64 inv = 8*int_sqrt(dirty);
 
task_dirties_fraction(tsk, , );
inv *= numerator;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> [  547.20] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > 
> global 12829 72 0 wc __ tw 0 sk 0
> [  550.48] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > 
> global 12017 62 0 wc __ tw 0 sk 0
> [  552.71] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > 
> global 11133 83 0 wc __ tw 0 sk 0
> [  558.66] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > 
> global 10470 33 0 wc _M tw 0 sk 0
4s
> [  562.75] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > 
> global 10555 69 0 wc _M tw 0 sk 0
3s
> [  565.15] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > 
> global 9562 498 0 wc _M tw -2 sk 0
4s
> [  569.49] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > 
> global 8960 2 0 wc _M tw 0 sk 0
3s
> [  572.91] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > 
> global 8088 205 0 wc _M tw -13 sk 0
2s
> [  574.61] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > 
> global 7114 188 0 wc _M tw -1 sk 0
10s
> [  584.27] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > 
> global 14544 0 0 wc _M tw -1 sk 0
9s
> [  593.05] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > 
> global 24583 736 0 wc _M tw -1 sk 0
7s
> [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > 
> global 27004 6 0 wc _M tw 587 sk 0
> [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > 
> global 27004 6 0 wc __ tw 1014 sk 0

The above messages and the below 'D' state pdflush indicate that one
single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete.

Let's try reverting the below patch with `patch -R`? It looks like
the most relevant change - if it's not a low level bug.

> [note] first stall, the output from emerge stops, so it seems it can
> not start processing the next file until the stall ends
> [  630.00] SysRq : Emergency Sync
> [  630.12] Emergency Sync complete
> [  632.85] SysRq : Show Blocked State
> [  632.85]   taskPC stack   pid father
> [  632.85] pdflush   D 81000f091788 0   285  2
> [  632.85]  810005d4da80 0046 0800
> 0071
> [  632.85]  81000fd52400 8022d61c 80819b00
> 80819b00
> [  632.85]  80815f40 80819b00 810100316f98
> 
> [  632.85] Call Trace:
> [  632.85]  [] task_rq_lock+0x4c/0x90
> [  632.85]  [] __wake_up_common+0x5a/0x90
> [  632.85]  [] __down+0xa7/0x11e
> [  632.85]  [] default_wake_function+0x0/0x10
> [  632.85]  [] __down_failed+0x35/0x3a
> [  632.85]  [] xfs_buf_lock+0x3e/0x40
> [  632.85]  [] _xfs_buf_find+0x13e/0x240
> [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> [  632.85]  [] xfs_itobp+0x81/0x1e0
> [  632.85]  [] write_cache_pages+0x123/0x330
> [  632.85]  [] xfs_iflush+0xfe/0x520
> [  632.85]  [] __down_read_trylock+0x42/0x60
> [  632.85]  [] xfs_inode_flush+0x179/0x1b0
> [  632.85]  [] xfs_fs_write_inode+0x2f/0x90
> [  632.85]  [] __writeback_single_inode+0x2ac/0x380
> [  632.85]  [] dm_table_any_congested+0x2e/0x80
> [  632.85]  [] generic_sync_sb_inodes+0x20d/0x330
> [  632.85]  [] writeback_inodes+0xa2/0xe0
> [  632.85]  [] wb_kupdate+0xa6/0x140
> [  632.85]  [] pdflush+0x0/0x1e0
> [  632.85]  [] pdflush+0x110/0x1e0
> [  632.85]  [] wb_kupdate+0x0/0x140
> [  632.85]  [] kthread+0x4b/0x80
> [  632.85]  [] child_rip+0xa/0x12
> [  632.85]  [] kthread+0x0/0x80
> [  632.85]  [] child_rip+0x0/0x12
> [  632.85]
> [  632.85] emergeD  0  6220   6129
> [  632.85]  810103ced9f8 0086 
> 0071
> [  632.85]  81000fd52cf8  80819b00
> 80819b00
> [  632.85]  80815f40 80819b00 810103ced9b8
> 810103ced9a8
> [  632.85] Call Trace:
> [  632.85]  [] __down+0xa7/0x11e
> [  632.85]  [] default_wake_function+0x0/0x10
> [  632.85]  [] __down_failed+0x35/0x3a
> [  632.85]  [] xfs_buf_lock+0x3e/0x40
> [  632.85]  [] _xfs_buf_find+0x13e/0x240
> [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> [  632.85]  [] xfs_itobp+0x81/0x1e0
> [  632.85]  [] xfs_buf_rele+0x2e/0xd0
> [  632.85]  [] xfs_iflush+0xfe/0x520
> [  632.85]  [] __down_read_trylock+0x42/0x60
> [  632.85]  [] xfs_inode_item_push+0x12/0x20
> [  632.85]  [] xfs_trans_push_ail+0x267/0x2b0
> [  632.85]  [] xfs_log_reserve+0x72/0x120
> [  632.85]  [] xfs_trans_reserve+0xa8/0x210
> [

Re: writeout stalls in current -git

2007-11-06 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
 [  547.20] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858  
 global 12829 72 0 wc __ tw 0 sk 0
 [  550.48] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834  
 global 12017 62 0 wc __ tw 0 sk 0
 [  552.71] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810  
 global 11133 83 0 wc __ tw 0 sk 0
 [  558.66] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786  
 global 10470 33 0 wc _M tw 0 sk 0
4s
 [  562.75] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762  
 global 10555 69 0 wc _M tw 0 sk 0
3s
 [  565.15] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738  
 global 9562 498 0 wc _M tw -2 sk 0
4s
 [  569.49] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712  
 global 8960 2 0 wc _M tw 0 sk 0
3s
 [  572.91] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688  
 global 8088 205 0 wc _M tw -13 sk 0
2s
 [  574.61] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651  
 global 7114 188 0 wc _M tw -1 sk 0
10s
 [  584.27] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626  
 global 14544 0 0 wc _M tw -1 sk 0
9s
 [  593.05] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601  
 global 24583 736 0 wc _M tw -1 sk 0
7s
 [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576  
 global 27004 6 0 wc _M tw 587 sk 0
 [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139  
 global 27004 6 0 wc __ tw 1014 sk 0

The above messages and the below 'D' state pdflush indicate that one
single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete.

Let's try reverting the below patch with `patch -R`? It looks like
the most relevant change - if it's not a low level bug.

 [note] first stall, the output from emerge stops, so it seems it can
 not start processing the next file until the stall ends
 [  630.00] SysRq : Emergency Sync
 [  630.12] Emergency Sync complete
 [  632.85] SysRq : Show Blocked State
 [  632.85]   taskPC stack   pid father
 [  632.85] pdflush   D 81000f091788 0   285  2
 [  632.85]  810005d4da80 0046 0800
 0071
 [  632.85]  81000fd52400 8022d61c 80819b00
 80819b00
 [  632.85]  80815f40 80819b00 810100316f98
 
 [  632.85] Call Trace:
 [  632.85]  [8022d61c] task_rq_lock+0x4c/0x90
 [  632.85]  [8022c8ea] __wake_up_common+0x5a/0x90
 [  632.85]  [805b16e7] __down+0xa7/0x11e
 [  632.85]  [8022da70] default_wake_function+0x0/0x10
 [  632.85]  [805b1365] __down_failed+0x35/0x3a
 [  632.85]  [803752ce] xfs_buf_lock+0x3e/0x40
 [  632.85]  [8037740e] _xfs_buf_find+0x13e/0x240
 [  632.85]  [8037757f] xfs_buf_get_flags+0x6f/0x190
 [  632.85]  [803776b2] xfs_buf_read_flags+0x12/0xa0
 [  632.85]  [80368824] xfs_trans_read_buf+0x64/0x340
 [  632.85]  [80352361] xfs_itobp+0x81/0x1e0
 [  632.85]  [8026b293] write_cache_pages+0x123/0x330
 [  632.85]  [80354d0e] xfs_iflush+0xfe/0x520
 [  632.85]  [803ae5d2] __down_read_trylock+0x42/0x60
 [  632.85]  [8036ed49] xfs_inode_flush+0x179/0x1b0
 [  632.85]  [8037ca8f] xfs_fs_write_inode+0x2f/0x90
 [  632.85]  [802b3aac] __writeback_single_inode+0x2ac/0x380
 [  632.85]  [804d074e] dm_table_any_congested+0x2e/0x80
 [  632.85]  [802b3f9d] generic_sync_sb_inodes+0x20d/0x330
 [  632.85]  [802b4532] writeback_inodes+0xa2/0xe0
 [  632.85]  [8026bfd6] wb_kupdate+0xa6/0x140
 [  632.85]  [8026c4b0] pdflush+0x0/0x1e0
 [  632.85]  [8026c5c0] pdflush+0x110/0x1e0
 [  632.85]  [8026bf30] wb_kupdate+0x0/0x140
 [  632.85]  [8024a32b] kthread+0x4b/0x80
 [  632.85]  [8020c9d8] child_rip+0xa/0x12
 [  632.85]  [8024a2e0] kthread+0x0/0x80
 [  632.85]  [8020c9ce] child_rip+0x0/0x12
 [  632.85]
 [  632.85] emergeD  0  6220   6129
 [  632.85]  810103ced9f8 0086 
 0071
 [  632.85]  81000fd52cf8  80819b00
 80819b00
 [  632.85]  80815f40 80819b00 810103ced9b8
 810103ced9a8
 [  632.85] Call Trace:
 [  632.85]  [805b16e7] __down+0xa7/0x11e
 [  632.85]  [8022da70] default_wake_function+0x0/0x10
 [  632.85]  [805b1365] __down_failed+0x35/0x3a
 [  632.85]  [803752ce] xfs_buf_lock+0x3e/0x40
 [  632.85]  [8037740e] _xfs_buf_find+0x13e/0x240
 [  632.85]  [8037757f] xfs_buf_get_flags+0x6f/0x190
 [  632.85]  [803776b2]

Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra

On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote:

   Subject: mm: speed up writeback ramp-up on clean systems
   
   We allow violation of bdi limits if there is a lot of room on the
   system. Once we hit half the total limit we start enforcing bdi limits
   and bdi ramp-up should happen. Doing it this way avoids many small
   writeouts on an otherwise idle system and should also speed up the
   ramp-up.
 
 Given the problems we're having in there I'm a bit reluctant to go tossing
 hastily put together and inadequately tested stuff onto the fire.  And
 that's what this patch looks like to me.

Not really hastily, I think it was written before the stuff hit
mainline. Inadequately tested, perhaps, its been in my and probably Wu's
kernels for a while. Granted that's not a lot of testing in the face of
those who have problems atm.

 Wanna convince me otherwise?

I'm perfectly happy with this patch earning its credits in -mm for a
while and maybe going in around -rc4 or something like that (hoping that
by then we've fixed these nagging issues).

Another patch I did come up with yesterday - not driven by any problems
in that area - could perhaps join this one on that path:

---
Subject: mm: bdi: tweak task dirty penalty

Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather
excessive on large memory machines. Use sqrt to scale it sub-linearly.

Update the comment while we're there.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page-writeback.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6-2/mm/page-writeback.c
===
--- linux-2.6-2.orig/mm/page-writeback.c
+++ linux-2.6-2/mm/page-writeback.c
@@ -213,17 +213,21 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * Task specific dirty limit:
  *
- * task specific dirty limit:
+ *   dirty -= 8 * sqrt(dirty) * p_{t}
  *
- *   dirty -= (dirty/8) * p_{t}
+ * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This
+ * avoids infrequent dirtiers from getting stuck in this other guys dirty
+ * pages.
+ *
+ * Use a sub-linear function to scale the penalty, we only need a little room.
  */
 void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 {
long numerator, denominator;
long dirty = *pdirty;
-   u64 inv = dirty  3;
+   u64 inv = 8*int_sqrt(dirty);
 
task_dirties_fraction(tsk, numerator, denominator);
inv *= numerator;


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Jonathan Corbet

Andrew wrote:

  Reviewed-by: Fengguang Wu [EMAIL PROTECTED] 
 
 I would prefer Tested-by: :(

This seems like as good an opportunity as any to toss my patch tags
document out there one more time.  I still think it's a good idea to
codify some sort of consensus on what these tags mean... 

jon

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index 299615d..1948a93 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -286,6 +286,8 @@ parport.txt
- how to use the parallel-port driver.
 parport-lowlevel.txt
- description and usage of the low level parallel port functions.
+patch-tags
+   - description of the tags which can be added to patches
 pci-error-recovery.txt
- info on PCI error recovery.
 pci.txt
diff --git a/Documentation/patch-tags b/Documentation/patch-tags
new file mode 100644
index 000..6acde5e
--- /dev/null
+++ b/Documentation/patch-tags
@@ -0,0 +1,76 @@
+Patches headed for the mainline may contain a variety of tags documenting
+who played a hand in (or was at least aware of) their progress.  All of
+these tags have the form:
+
+   Something-done-by: Full name [EMAIL PROTECTED] [optional random stuff]
+
+These tags are:
+
+From:  The original author of the patch.  This tag will ensure
+   that credit is properly given when somebody other than the
+   original author submits the patch.
+
+Signed-off-by: A person adding a Signed-off-by tag is attesting that the
+   patch is, to the best of his or her knowledge, legally able
+   to be merged into the mainline and distributed under the
+   terms of the GNU General Public License, version 2.  See
+   the Developer's Certificate of Origin, found in
+   Documentation/SubmittingPatches, for the precise meaning of
+   Signed-off-by.  This tag assures upstream maintainers that
+   the provenance of the patch is known and allows the origin
+   of the patch to be reviewed should copyright questions
+   arise.
+
+Acked-by:  The person named (who should be an active developer in the
+   area addressed by the patch) is aware of the patch and has
+   no objection to its inclusion; it informs upstream
+   maintainers that a certain degree of consensus on the patch
+   as been achieved..  An Acked-by tag does not imply any
+   involvement in the development of the patch or that a
+   detailed review was done. 
+
+Reviewed-by:   The patch has been reviewed and found acceptable according
+   to the Reviewer's Statement as found at the bottom of this
+   file.  A Reviewed-by tag is a statement of opinion that the
+   patch is an appropriate modification of the kernel without
+   any remaining serious technical issues.  Any interested
+   reviewer (who has done the work) can offer a Reviewed-by
+   tag for a patch.  This tag serves to give credit to
+   reviewers and to inform maintainers of the degree of review
+   which has been done on the patch.
+
+Cc:The person named was given the opportunity to comment on
+   the patch.  This is the only tag which might be added
+   without an explicit action by the person it names.  This
+   tag documents that potentially interested parties have been
+   included in the discussion.
+
+Tested-by: The patch has been successfully tested (in some
+   environment) by the person named.  This tag informs
+   maintainers that some testing has been performed, provides
+   a means to locate testers for future patches, and ensures
+   credit for the testers.
+
+
+
+
+Reviewer's statement of oversight, v0.02
+
+By offering my Reviewed-by: tag, I state that:
+
+ (a) I have carried out a technical review of this patch to evaluate its
+ appropriateness and readiness for inclusion into the mainline kernel. 
+
+ (b) Any problems, concerns, or questions relating to the patch have been
+ communicated back to the submitter.  I am satisfied with the
+ submitter's response to my comments.
+
+ (c) While there may be things that could be improved with this submission,
+ I believe that it is, at this time, (1) a worthwhile modification to
+ the kernel, and (2) free of known issues which would argue against its
+ inclusion.
+
+ (d) While I have reviewed the patch and believe it to be sound, I do not
+ (unless explicitly stated elsewhere) make any warranties or guarantees
+ that it will achieve its stated purpose or function properly in any
+ given situation.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Re: Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Balbir Singh

 This seems like as good an opportunity as any to toss my patch tags
 document out there one more time.  I still think it's a good idea to
 codify some sort of consensus on what these tags mean...

 jon


[snip]

 +By offering my Reviewed-by: tag, I state that:
 +
 + (a) I have carried out a technical review of this patch to evaluate its
 + appropriateness and readiness for inclusion into the mainline kernel.
 +
 + (b) Any problems, concerns, or questions relating to the patch have been
 + communicated back to the submitter.  I am satisfied with the
 + submitter's response to my comments.
 +
 + (c) While there may be things that could be improved with this submission,
 + I believe that it is, at this time, (1) a worthwhile modification to
 + the kernel, and (2) free of known issues which would argue against its
 + inclusion.
 +
 + (d) While I have reviewed the patch and believe it to be sound, I do not
 + (unless explicitly stated elsewhere) make any warranties or guarantees
 + that it will achieve its stated purpose or function properly in any
 + given situation.

How about adding a Commented-on-by?

Initial version(s) that are not suitable or still shaping up are
commented-on by several people. A person who comments on one version
might not do a thorough review of the entire code, but through a
series of comments has contributed by pushing the developer in the
correct direction.

Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra

On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:

 I'm struggling to understand what possible changed in XFS or writeback that
 would lead to stalls like this, esp. as you appear to be removing files when
 the stalls occur. 

Just a crazy idea,..

Could there be a set_page_dirty() that doesn't have
balance_dirty_pages() call near? For example modifying meta data in
unlink?

Such a situation could lead to an excess of dirty pages and the next
call to balance_dirty_pages() would appear to stall, as it would
desperately try to get below the limit again.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/6/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
 On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:

  I'm struggling to understand what possible changed in XFS or writeback that
  would lead to stalls like this, esp. as you appear to be removing files when
  the stalls occur.

 Just a crazy idea,..

 Could there be a set_page_dirty() that doesn't have
 balance_dirty_pages() call near? For example modifying meta data in
 unlink?

 Such a situation could lead to an excess of dirty pages and the next
 call to balance_dirty_pages() would appear to stall, as it would
 desperately try to get below the limit again.

Only if accounting of the dirty pages is also broken.
In the unmerge testcase I see most of the time only 200kb of dirty
data in /proc/meminfo.

The system has 4Gb of RAM so I'm not sure if it should ever be valid
to stall even the emerge/install testcase.

Torsten

Now building a kernel with the skipped-pages-accounting-patch reverted...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/6/07, Fengguang Wu [EMAIL PROTECTED] wrote:
 --
 Subject: writeback: remove pages_skipped accounting in 
 __block_write_full_page()
 From: Fengguang Wu [EMAIL PROTECTED]

 Miklos Szeredi [EMAIL PROTECTED] and me identified a writeback bug:
[sni]
  fs/buffer.c |1 -
  fs/xfs/linux-2.6/xfs_aops.c |5 ++---
  2 files changed, 2 insertions(+), 4 deletions(-)

I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted.
This does still stall.

On 11/6/07, David Chinner [EMAIL PROTECTED] wrote:
 Rather than vmstat, can you use something like iostat to show how busy your
 disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.

Both vmstat 10 and iostat -x 10 output from this test:
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3700592  0  85424003183  108  244  2  1 95  1
- emerge reads something, don't knwo for sure what...
 1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0
 0  0  0 3657728  0  9122800   32235  445  833  0  0 99  0
 1  0  0 3653136  0  9469200   33033  455  844  1  1 98  0
 0  0  0 3646836  0  9772000   289 3  422  751  1  1 98  0
 0  0  0 3616468  0  9969200   18533  399  614  9  3 87  1
- starts to remove the kernel tree
 0  0  0 3610452  0 10259200   138  3598 1398 3945  3  6 90  1
 0  0  0 3607136  0 10454800 2  5962 1919 6070  4  9 87  0
 0  0  0 3606636  0 10508000 0  1539  810 2200  1  2 97  0
- first stall 28 sec.
 0  0  0 3606592  0 10529200 0   698  679 1390  0  1 99  0
 0  0  0 3606440  0 10553200 0   658  690 1457  0  0 99  0
 0  0  0 3606068  0 10612800 1  1780  947 1982  1  3 96  0
- second stall 24 sec.
 0  0  0 3606036  0 10646400 4   858  758 1457  0  1 98  0
 0  0  0 3605380  0 10687200 0  1173  807 1880  1  2 97  0
 0  0  0 3605000  0 10774800 1  2413 1103 2996  2  4 94  0
- third stall 38 sec.
 0  0  0 3604488  0 1084720045   897  748 1577  0  1 98  0
 0  0  0 3604176  0 10876400 0   824  752 1700  0  1 98  0
 0  0  0 3604012  0 10898800 0   660  643 1237  0  1 99  0
 0  0  0 3608936  0 11012000 1  3490 1232 3455  3  5 91  0
- fourth stall 64 sec.
 1  0  0 3609060  0 11029600 0   568  669 1222  0  1 99  0
 0  0  0 3609464  0 11049600 0   604  638 1366  0  1 99  0
 0  0  0 3609244  0 11074000 0   844  714 1282  0  1 99  0
 0  0  0 3609508  0 11091200 0   552  584 1185  1  1 99  0
 2  0  0 3609436  0 3200 0   658  643 1442  0  1 99  0
 0  0  0 3609212  0 11134800 0   714  637 1382  0  0 99  0
 0  0  0 3619132  0 11049200   130  1086  736 1870  4  3 91  2
 0  0  0 3657016  0 11549600   466   589  718 1367  1  1 98  0
- emerge finishs, dirty data was the hole time 1Mb, stays now below 300kb
(btrace running...)
 0  0  0 3657844  0 11566000 0   564  635 1226  1  1 99  0
 0  0  0 3658236  0 11584000 0   582  600 1248  1  0 99  0
 0  0  0 3658296  0 11601200 0   566  606 1232  1  1 99  0
 0  0  0 3657924  0 11621200 0   688  596 1321  1  0 99  0
 0  0  0 3658252  0 11641600 0   631  642 1356  1  0 98  0
 0  0  0 3658184  0 11659200 0   566  575 1273  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3658344  0 11677200 0   649  606 1301  0  0 99  0
 0  0  0 3658548  0 11697600 0   617  624 1345  0  0 99  0
 0  0  0 3659204  0 11716000 0   550  576 1223  1  1 99  0
 0  0  0 3659944  0 11734400 0   620  583 1272  0  0 99  0
 0  0  0 3660548  0 11754000 0   605  611 1338  0  0 99  0
 0  0  0 3661236  0 11773200 0   582  569 1275  0  0 99  0
 0  0  0 3662420  0 11788800 0   590  571 1157  0  0 99  0
 0  0  0 3664324  0 11806800 0   566  553 1222  1  1 99  0
 0  0  0 3665240  0 11816800 0   401  574  862  0  0 99  0
 0  0  0 3666984  0 11828000 0   454  574  958  1  1 99  0
 0  0  0 3668664  0 11840000 0   396  559  946  0  0 99  0
 0  0  0 3670628  0 11849600 0   296  495  784  0  0 99  0
 0  0  0 3671316  0 11849600 036  334  307

Re: Patch tags [was writeout stalls in current -git]

2007-11-06 Thread Adrian Bunk

On Tue, Nov 06, 2007 at 09:25:12AM -0700, Jonathan Corbet wrote:
 Andrew wrote:
 
   Reviewed-by: Fengguang Wu [EMAIL PROTECTED] 
  
  I would prefer Tested-by: :(
 
 This seems like as good an opportunity as any to toss my patch tags
 document out there one more time.  I still think it's a good idea to
 codify some sort of consensus on what these tags mean... 

What's missing is a definition which of them are formal tags that must 
be explicitely given (look at point 13 in SubmittingPatches).

Signed-off-by: and Reviewed-by: are the formal tags someone must have 
explicitely given and that correspond to some statement.

OTOH, I can translate a sounds fine or works for me someone else 
gave me into an Acked-by: resp. Tested-by: tag.

 jon
...
 --- /dev/null
 +++ b/Documentation/patch-tags
 @@ -0,0 +1,76 @@
 +Patches headed for the mainline may contain a variety of tags documenting
 +who played a hand in (or was at least aware of) their progress.  All of
 +these tags have the form:
 +
 + Something-done-by: Full name [EMAIL PROTECTED] [optional random stuff]
 +
 +These tags are:
 +
 +From:The original author of the patch.  This tag will ensure
 + that credit is properly given when somebody other than the
 + original author submits the patch.
 +
 +Signed-off-by:   A person adding a Signed-off-by tag is attesting that 
 the
 + patch is, to the best of his or her knowledge, legally able
 + to be merged into the mainline and distributed under the
 + terms of the GNU General Public License, version 2.  See
 + the Developer's Certificate of Origin, found in
 + Documentation/SubmittingPatches, for the precise meaning of
 + Signed-off-by.  This tag assures upstream maintainers that
 + the provenance of the patch is known and allows the origin
 + of the patch to be reviewed should copyright questions
 + arise.
 +
 +Acked-by:The person named (who should be an active developer in the
 + area addressed by the patch) is aware of the patch and has
 + no objection to its inclusion; it informs upstream
 + maintainers that a certain degree of consensus on the patch
 + as been achieved..  An Acked-by tag does not imply any
 + involvement in the development of the patch or that a
 + detailed review was done. 
 +
 +Reviewed-by: The patch has been reviewed and found acceptable according
 + to the Reviewer's Statement as found at the bottom of this
 + file.  A Reviewed-by tag is a statement of opinion that the
 + patch is an appropriate modification of the kernel without
 + any remaining serious technical issues.  Any interested
 + reviewer (who has done the work) can offer a Reviewed-by
 + tag for a patch.  This tag serves to give credit to
 + reviewers and to inform maintainers of the degree of review
 + which has been done on the patch.
 +
 +Cc:  The person named was given the opportunity to comment on
 + the patch.  This is the only tag which might be added
 + without an explicit action by the person it names.  This
 + tag documents that potentially interested parties have been
 + included in the discussion.
 +
 +Tested-by:   The patch has been successfully tested (in some
 + environment) by the person named.  This tag informs
 + maintainers that some testing has been performed, provides
 + a means to locate testers for future patches, and ensures
 + credit for the testers.
 +
 +
 +
 +
 +Reviewer's statement of oversight, v0.02
 +
 +By offering my Reviewed-by: tag, I state that:
 +
 + (a) I have carried out a technical review of this patch to evaluate its
 + appropriateness and readiness for inclusion into the mainline kernel. 
 +
 + (b) Any problems, concerns, or questions relating to the patch have been
 + communicated back to the submitter.  I am satisfied with the
 + submitter's response to my comments.
 +
 + (c) While there may be things that could be improved with this submission,
 + I believe that it is, at this time, (1) a worthwhile modification to
 + the kernel, and (2) free of known issues which would argue against its
 + inclusion.
 +
 + (d) While I have reviewed the patch and believe it to be sound, I do not
 + (unless explicitly stated elsewhere) make any warranties or guarantees
 + that it will achieve its stated purpose or function properly in any
 + given situation.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this

Re: writeout stalls in current -git

2007-11-06 Thread David Chinner

On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
 On 11/6/07, David Chinner [EMAIL PROTECTED] wrote:
  Rather than vmstat, can you use something like iostat to show how busy your
  disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
 
 Both vmstat 10 and iostat -x 10 output from this test:
 procs ---memory-- ---swap-- -io -system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
  2  0  0 3700592  0  85424003183  108  244  2  1 95  1
 - emerge reads something, don't knwo for sure what...
  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0

 
 The last 20% of the btrace look more or less completely like this, no
 other programs do any IO...
 
 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
^^
Apparently we are doing synchronous writes. That would explain why
it is slow. We shouldn't be doing synchronous writes here. I'll see if
I can reproduce this.

goes off and looks

Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
look into this further and send a patch when I have something concrete.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread David Chinner

On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote:
 On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
  On 11/6/07, David Chinner [EMAIL PROTECTED] wrote:
   Rather than vmstat, can you use something like iostat to show how busy 
   your
   disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
  
  Both vmstat 10 and iostat -x 10 output from this test:
  procs ---memory-- ---swap-- -io -system-- 
  cpu
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
  wa
   2  0  0 3700592  0  85424003183  108  244  2  1 95 
   1
  - emerge reads something, don't knwo for sure what...
   1  0  0 3665352  0  8794000   239 2  343  585  2  1 97 
   0
 
  
  The last 20% of the btrace look more or less completely like this, no
  other programs do any IO...
  
  253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
  253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
  253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
  253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
 ^^
 Apparently we are doing synchronous writes. That would explain why
 it is slow. We shouldn't be doing synchronous writes here. I'll see if
 I can reproduce this.
 
 goes off and looks
 
 Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
 look into this further and send a patch when I have something concrete.

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The synchronous nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.

/me scurries off to run some tests 

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_iget.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
icl = NULL;
if (radix_tree_gang_lookup(pag-pag_ici_root, (void**)iq,
first_index, 1)) {
-   if ((iq-i_ino  mask) == first_index)
+   if ((XFS_INO_TO_AGINO(mp, iq-i_ino)  mask) == first_index)
icl = iq-i_cluster;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/7/07, David Chinner [EMAIL PROTECTED] wrote:
 Ok, so it's not synchronous writes that we are doing - we're just
 submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
 The synchronous nature appears to be coming from higher level
 locking when reclaiming inodes (on the flush lock). It appears that
 inode write clustering is failing completely so we are writing the
 same block multiple times i.e. once for each inode in the cluster we
 have to write.

Works for me. The only remaining stalls are sub second and look
completely valid, considering the amount of files being removed.

iostat 10 from this test:
 3  0  0 3500192332 20495600   105  8512 1809 6473  6 10 83  1
 0  0  0 3500200332 20457600 0  4367 1355 3712  2  6 92  0
 2  0  0 3504264332 20352800 0  6805 1912 4967  4  8 88  0
 0  0  0 3511632332 20352800 0  2843  805 1791  2  4 94  0
 0  0  0 3516852332 20351600 0  3375  879 2712  3  5 93  0
 0  0  0 3530544332 20266800   186   776  488 1152  4  2 89  4
 0  0  0 3574788332 20496000   226   326  358  787  0  1 98  0
 0  0  0 3576820332 20496000 0   376  332  737  0  0 99  0
 0  0  0 3578432332 20496000 0   356  293  606  1  1 99  0
 0  0  0 3580192332 20496000 0   101  104  384  0  0 99  0

I'm pleased to note that this is now much faster again.
Thanks!

Tested-by: Torsten Kaiser [EMAIL PROTECTED]

CC's please note: It looks like this was really a different problem
then the 100% iowait that was seen with reiserfs.
Also the one complete stall I have seen is probably something else.
But I have not been able to reproduce this again with -mm and have
never seen this on mainline, so I will just ignore that single event
until I see it again.

Torsten

 ---
  fs/xfs/xfs_iget.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
 ===
 --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
 +1100
 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
 @@ -248,7 +248,7 @@ finish_inode:
 icl = NULL;
 if (radix_tree_gang_lookup(pag-pag_ici_root, (void**)iq,
 first_index, 1)) {
 -   if ((iq-i_ino  mask) == first_index)
 +   if ((XFS_INO_TO_AGINO(mp, iq-i_ino)  mask) == first_index)
 icl = iq-i_cluster;
 }


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser

On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > > Ok, so it's probably a side effect of the writeback changes.
> > >
> > > Attached are two patches (two because one was in a separate patchset as
> > > a standalone change) that should prevent async writeback from blocking
> > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > > Can you see if this fixes the problem?
> >
> > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied 
> > raid5-patch
> > Applying your two patches ontop of that does not fix the stalls.
>
> So you are having RAID5 problems as well?

The first 2.6.24-rc1-git-kernel that I patched with your patches did
not boot for me. (Oops send in one of my previous mails) But given
that the stacktrace was not xfs related and I had seen this patch on
the lkml, I tried to fix this Oops this way.
I did not have troubles with the RAID5 otherwise.

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. Rather than vmstat, can you use something like iostat to
> show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
> some such issue.

Will do this this evening.

> OOC, what is the 'xfs_info ' output for your filesystem?

meta-data=/dev/mapper/root   isize=256agcount=32, agsize=4731132 blks
 =   sectsz=512   attr=1
data =   bsize=4096   blocks=151396224, imaxpct=25
 =   sunit=0  swidth=0 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks, lazy-count=0
realtime =none   extsz=4096   blocks=0, rtextents=0


> > vmstat 10 output from unmerging (uninstalling) a kernel:
> >  1  0  0 3512188332 19264400   18512  368  735 10  3 85 
> >  1
> > -> emerge starts to remove the kernel source files
> >  3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81 
> >  0
> >  0  0  0 3507212332 19283600 0   554  630 1233  0  1 99 
> >  0
> >  0  0  0 3507292332 19283600 0   537  580 1328  0  1 99 
> >  0
> >  0  0  0 3507168332 19283600 0   633  626 1380  0  1 99 
> >  0
> >  0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97 
> >  0
> >  0  0  0 3507596332 19283600 0   524  540 1544  0  0 99 
> >  0
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  0  0  0 3507540332 19283600 0   489  551 1293  0  0 99 
> >  0
> >  0  0  0 3507528332 19283600 0   527  510 1432  1  1 99 
> >  0
> >  0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95 
> >  0
> >  0  0  0 3507888332 19284000 0   442  565 1383  1  1 99 
> >  0
> >  0  0  0 3508704332 19284000 0   497  529 1479  0  0 99 
> >  0
> >  0  0  0 3508704332 19284000 0   594  595 1458  0  0 99 
> >  0
> >  0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95 
> >  0
> >  0  0  0 3510684332 19284000 0   699  600 1390  0  0 99 
> >  0
> >  0  0  0 3511636332 19284000 0   741  661 1641  0  0 
> > 100  0
> >  0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95 
> >  0
> >  0  0  0 3524040332 19284400 0   530  617 1297  0  0 99 
> >  0
> >  0  0  0 3524128332 19284400 0   812  674 1667  0  1 99 
> >  0
> >  0  0  0 3527000332 19367200   339   721  754 1681  3  2 93 
> >  1
> > -> emerge is finished, no dirty or writeback data in /proc/meminfo
>
> At this point, can you run a "sync" and see how long that takes to
> complete?

Already tried that: http://lkml.org/lkml/2007/11/2/178
See the logs from the second unmerge in the second half of the mail.

The sync did not stop this writeout, but returned immediately.

> The only thing I can think that woul dbe written out after
> this point is inodes, but even then it seems to go on for a long,
> long time and it really doesn't seem like XFS is holding up the
> inode writes.

Yes, I completly agree that this is much to long. Thats why I included
the after-emerge-finished parts of the logs. But I still partly
suspect xfs, because the xfssyncd shows up when I hip SysRq+W.

> Another option is to use blktrace/blkparse to determine which process is
> issuing this I/O.
>
> >  0  0  0 3583780332 19506000 0   494  555 1080  0  1 99 
> >  0
> >  0  0

Re: writeout stalls in current -git

2007-11-05 Thread David Chinner

On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > Ok, so it's probably a side effect of the writeback changes.
> >
> > Attached are two patches (two because one was in a separate patchset as
> > a standalone change) that should prevent async writeback from blocking
> > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > Can you see if this fixes the problem?
> 
> Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
> Applying your two patches ontop of that does not fix the stalls.

So you are having RAID5 problems as well?

I'm struggling to understand what possible changed in XFS or writeback that
would lead to stalls like this, esp. as you appear to be removing files when
the stalls occur. Rather than vmstat, can you use something like iostat to
show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
some such issue.

OOC, what is the 'xfs_info ' output for your filesystem? 

> vmstat 10 output from unmerging (uninstalling) a kernel:
>  1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
> -> emerge starts to remove the kernel source files
>  3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
>  0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
>  0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
>  0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
>  0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
>  0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
>  0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
>  0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
>  0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
>  0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
>  0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
>  0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
>  0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
>  0  0  0 3511636332 19284000 0   741  661 1641  0  0 100  > 0
>  0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
>  0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
>  0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
>  0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
> -> emerge is finished, no dirty or writeback data in /proc/meminfo

At this point, can you run a "sync" and see how long that takes to
complete? The only thing I can think that woul dbe written out after
this point is inodes, but even then it seems to go on for a long,
long time and it really doesn't seem like XFS is holding up the
inode writes.

Another option is to use blktrace/blkparse to determine which process is
issuing this I/O.

>  0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
>  0  0  0 3584352332 19506000 099  347  559  0  0 99  0
>  0  0  0 3585232332 19506000 011  301  621  0  0 99  0
> -> disks go idle.
> 
> So these patches do not seem to be the source of these excessive disk 
> writes...

Well, the patches I posted should prevent blocking in the places that it
was seen, so if that does not stop the slowdowns then either the writeback
code is not feeding us inodes fast enough or the block device below is
having some kind of problem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Andrew Morton

On Fri, 2 Nov 2007 18:33:29 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
> > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
> > 
> > > Interestingly, no background_writeout() appears, but only
> > > balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> > > block the process.
> > 
> > Yeah, the background threshold is not (yet) scaled. So it can happen
> > that the bdi_dirty limit is below the background limit.
> > 
> > I'm curious though as to these stalls, though, I can't seem to think of
> > what goes wrong.. esp since most writeback seems to happen from pdflush.
> 
> Me confused too. The new debug patch will confirm whether emerge is
> waiting in balance_dirty_pages().
> 
> > (or I'm totally misreading it - quite a possible as I'm still recovering
> > from a serious cold and not all the green stuff has yet figured out its
> > proper place wrt brain cells 'n stuff)
> 
> Do take care of yourself.
> 
> > 
> > I still have this patch floating around:
> 
> I think this patch is OK for 2.6.24 :-)
> 
> Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> 

I would prefer Tested-by: :(

> > 
> > ---
> > Subject: mm: speed up writeback ramp-up on clean systems
> > 
> > We allow violation of bdi limits if there is a lot of room on the
> > system. Once we hit half the total limit we start enforcing bdi limits
> > and bdi ramp-up should happen. Doing it this way avoids many small
> > writeouts on an otherwise idle system and should also speed up the
> > ramp-up.

Given the problems we're having in there I'm a bit reluctant to go tossing
hastily put together and inadequately tested stuff onto the fire.  And
that's what this patch looks like to me.

Wanna convince me otherwise?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser

On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Ok, so it's probably a side effect of the writeback changes.
>
> Attached are two patches (two because one was in a separate patchset as
> a standalone change) that should prevent async writeback from blocking
> on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> Can you see if this fixes the problem?

Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
Applying your two patches ontop of that does not fix the stalls.

vmstat 10 output from unmerging (uninstalling) a kernel:
 1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
-> emerge starts to remove the kernel source files
 3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
 0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
 0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
 0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
 0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
 0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
 0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
 0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
 0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
 0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
 0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
 0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
 0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
 0  0  0 3511636332 19284000 0   741  661 1641  0  0 100  0
 0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
 0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
 0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
 0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
-> emerge is finished, no dirty or writeback data in /proc/meminfo
 0  0  0 3571056332 19476800   111   639  632 1344  0  1 99  0
 0  0  0 3571260332 19476800 0   757  688 1405  1  0 99  0
 0  0  0 3571156332 19476800 0   753  641 1361  0  0 99  0
 0  0  0 3571404332 19476800 0   766  653 1389  0  0 99  0
 1  0  0 3571136332 19476800 6   764  669 1488  0  0 99  0
 0  0  0 3571668332 19482400 0   764  657 1482  0  0 99  0
 0  0  0 3571848332 19482400 0   673  659 1406  0  0 99  0
 0  0  0 3571908332 1950520022   753  638 1500  0  1 99  0
 0  0  0 3573052332 19505200 0   765  631 1482  0  1 99  0
 0  0  0 3574144332 19505200 0   771  640 1497  0  0 99  0
 0  0  0 3573468332 19505200 0   458  485 1251  0  0 99  0
 0  0  0 3574184332 19505200 0   427  474 1192  0  0 100  0
 0  0  0 3575092332 19505200 0   461  482 1235  0  0 99  0
 0  0  0 3576368332 19505600 0   582  556 1310  0  0 99  0
 0  0  0 3579300332 19505600 0   695  571 1402  0  0 99  0
 0  0  0 3580376332 19505600 0   417  568  906  0  0 99  0
 0  0  0 3581212332 19505600 0   421  559  977  0  1 99  0
 0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
 0  0  0 3584352332 19506000 099  347  559  0  0 99  0
 0  0  0 3585232332 19506000 011  301  621  0  0 99  0
-> disks go idle.

So these patches do not seem to be the source of these excessive disk writes...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser

On 11/5/07, David Chinner [EMAIL PROTECTED] wrote:
 Ok, so it's probably a side effect of the writeback changes.

 Attached are two patches (two because one was in a separate patchset as
 a standalone change) that should prevent async writeback from blocking
 on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
 Can you see if this fixes the problem?

Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
Applying your two patches ontop of that does not fix the stalls.

vmstat 10 output from unmerging (uninstalling) a kernel:
 1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
- emerge starts to remove the kernel source files
 3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
 0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
 0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
 0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
 0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
 0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
 0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
 0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
 0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
 0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
 0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
 0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
 0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
 0  0  0 3511636332 19284000 0   741  661 1641  0  0 100  0
 0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
 0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
 0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
 0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
- emerge is finished, no dirty or writeback data in /proc/meminfo
 0  0  0 3571056332 19476800   111   639  632 1344  0  1 99  0
 0  0  0 3571260332 19476800 0   757  688 1405  1  0 99  0
 0  0  0 3571156332 19476800 0   753  641 1361  0  0 99  0
 0  0  0 3571404332 19476800 0   766  653 1389  0  0 99  0
 1  0  0 3571136332 19476800 6   764  669 1488  0  0 99  0
 0  0  0 3571668332 19482400 0   764  657 1482  0  0 99  0
 0  0  0 3571848332 19482400 0   673  659 1406  0  0 99  0
 0  0  0 3571908332 1950520022   753  638 1500  0  1 99  0
 0  0  0 3573052332 19505200 0   765  631 1482  0  1 99  0
 0  0  0 3574144332 19505200 0   771  640 1497  0  0 99  0
 0  0  0 3573468332 19505200 0   458  485 1251  0  0 99  0
 0  0  0 3574184332 19505200 0   427  474 1192  0  0 100  0
 0  0  0 3575092332 19505200 0   461  482 1235  0  0 99  0
 0  0  0 3576368332 19505600 0   582  556 1310  0  0 99  0
 0  0  0 3579300332 19505600 0   695  571 1402  0  0 99  0
 0  0  0 3580376332 19505600 0   417  568  906  0  0 99  0
 0  0  0 3581212332 19505600 0   421  559  977  0  1 99  0
 0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
 0  0  0 3584352332 19506000 099  347  559  0  0 99  0
 0  0  0 3585232332 19506000 011  301  621  0  0 99  0
- disks go idle.

So these patches do not seem to be the source of these excessive disk writes...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Andrew Morton

On Fri, 2 Nov 2007 18:33:29 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
  On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
  
   Interestingly, no background_writeout() appears, but only
   balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
   block the process.
  
  Yeah, the background threshold is not (yet) scaled. So it can happen
  that the bdi_dirty limit is below the background limit.
  
  I'm curious though as to these stalls, though, I can't seem to think of
  what goes wrong.. esp since most writeback seems to happen from pdflush.
 
 Me confused too. The new debug patch will confirm whether emerge is
 waiting in balance_dirty_pages().
 
  (or I'm totally misreading it - quite a possible as I'm still recovering
  from a serious cold and not all the green stuff has yet figured out its
  proper place wrt brain cells 'n stuff)
 
 Do take care of yourself.
 
  
  I still have this patch floating around:
 
 I think this patch is OK for 2.6.24 :-)
 
 Reviewed-by: Fengguang Wu [EMAIL PROTECTED] 

I would prefer Tested-by: :(

  
  ---
  Subject: mm: speed up writeback ramp-up on clean systems
  
  We allow violation of bdi limits if there is a lot of room on the
  system. Once we hit half the total limit we start enforcing bdi limits
  and bdi ramp-up should happen. Doing it this way avoids many small
  writeouts on an otherwise idle system and should also speed up the
  ramp-up.

Given the problems we're having in there I'm a bit reluctant to go tossing
hastily put together and inadequately tested stuff onto the fire.  And
that's what this patch looks like to me.

Wanna convince me otherwise?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread David Chinner

On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
 On 11/5/07, David Chinner [EMAIL PROTECTED] wrote:
  Ok, so it's probably a side effect of the writeback changes.
 
  Attached are two patches (two because one was in a separate patchset as
  a standalone change) that should prevent async writeback from blocking
  on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
  Can you see if this fixes the problem?
 
 Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
 Applying your two patches ontop of that does not fix the stalls.

So you are having RAID5 problems as well?

I'm struggling to understand what possible changed in XFS or writeback that
would lead to stalls like this, esp. as you appear to be removing files when
the stalls occur. Rather than vmstat, can you use something like iostat to
show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
some such issue.

OOC, what is the 'xfs_info mtpt' output for your filesystem? 

 vmstat 10 output from unmerging (uninstalling) a kernel:
  1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
 - emerge starts to remove the kernel source files
  3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
  0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
  0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
  0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
  0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
  0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
 procs ---memory-- ---swap-- -io -system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
  0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
  0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
  0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
  0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
  0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
  0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
  0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
  0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
  0  0  0 3511636332 19284000 0   741  661 1641  0  0 100   0
  0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
  0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
  0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
  0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
 - emerge is finished, no dirty or writeback data in /proc/meminfo

At this point, can you run a sync and see how long that takes to
complete? The only thing I can think that woul dbe written out after
this point is inodes, but even then it seems to go on for a long,
long time and it really doesn't seem like XFS is holding up the
inode writes.

Another option is to use blktrace/blkparse to determine which process is
issuing this I/O.

  0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
  0  0  0 3584352332 19506000 099  347  559  0  0 99  0
  0  0  0 3585232332 19506000 011  301  621  0  0 99  0
 - disks go idle.
 
 So these patches do not seem to be the source of these excessive disk 
 writes...

Well, the patches I posted should prevent blocking in the places that it
was seen, so if that does not stop the slowdowns then either the writeback
code is not feeding us inodes fast enough or the block device below is
having some kind of problem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser

On 11/6/07, David Chinner [EMAIL PROTECTED] wrote:
 On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
  On 11/5/07, David Chinner [EMAIL PROTECTED] wrote:
   Ok, so it's probably a side effect of the writeback changes.
  
   Attached are two patches (two because one was in a separate patchset as
   a standalone change) that should prevent async writeback from blocking
   on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
   Can you see if this fixes the problem?
 
  Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied 
  raid5-patch
  Applying your two patches ontop of that does not fix the stalls.

 So you are having RAID5 problems as well?

The first 2.6.24-rc1-git-kernel that I patched with your patches did
not boot for me. (Oops send in one of my previous mails) But given
that the stacktrace was not xfs related and I had seen this patch on
the lkml, I tried to fix this Oops this way.
I did not have troubles with the RAID5 otherwise.

 I'm struggling to understand what possible changed in XFS or writeback that
 would lead to stalls like this, esp. as you appear to be removing files when
 the stalls occur. Rather than vmstat, can you use something like iostat to
 show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
 some such issue.

Will do this this evening.

 OOC, what is the 'xfs_info mtpt' output for your filesystem?

meta-data=/dev/mapper/root   isize=256agcount=32, agsize=4731132 blks
 =   sectsz=512   attr=1
data =   bsize=4096   blocks=151396224, imaxpct=25
 =   sunit=0  swidth=0 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks, lazy-count=0
realtime =none   extsz=4096   blocks=0, rtextents=0


  vmstat 10 output from unmerging (uninstalling) a kernel:
   1  0  0 3512188332 19264400   18512  368  735 10  3 85 
   1
  - emerge starts to remove the kernel source files
   3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81 
   0
   0  0  0 3507212332 19283600 0   554  630 1233  0  1 99 
   0
   0  0  0 3507292332 19283600 0   537  580 1328  0  1 99 
   0
   0  0  0 3507168332 19283600 0   633  626 1380  0  1 99 
   0
   0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97 
   0
   0  0  0 3507596332 19283600 0   524  540 1544  0  0 99 
   0
  procs ---memory-- ---swap-- -io -system-- 
  cpu
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
  wa
   0  0  0 3507540332 19283600 0   489  551 1293  0  0 99 
   0
   0  0  0 3507528332 19283600 0   527  510 1432  1  1 99 
   0
   0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95 
   0
   0  0  0 3507888332 19284000 0   442  565 1383  1  1 99 
   0
   0  0  0 3508704332 19284000 0   497  529 1479  0  0 99 
   0
   0  0  0 3508704332 19284000 0   594  595 1458  0  0 99 
   0
   0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95 
   0
   0  0  0 3510684332 19284000 0   699  600 1390  0  0 99 
   0
   0  0  0 3511636332 19284000 0   741  661 1641  0  0 
  100  0
   0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95 
   0
   0  0  0 3524040332 19284400 0   530  617 1297  0  0 99 
   0
   0  0  0 3524128332 19284400 0   812  674 1667  0  1 99 
   0
   0  0  0 3527000332 19367200   339   721  754 1681  3  2 93 
   1
  - emerge is finished, no dirty or writeback data in /proc/meminfo

 At this point, can you run a sync and see how long that takes to
 complete?

Already tried that: http://lkml.org/lkml/2007/11/2/178
See the logs from the second unmerge in the second half of the mail.

The sync did not stop this writeout, but returned immediately.

 The only thing I can think that woul dbe written out after
 this point is inodes, but even then it seems to go on for a long,
 long time and it really doesn't seem like XFS is holding up the
 inode writes.

Yes, I completly agree that this is much to long. Thats why I included
the after-emerge-finished parts of the logs. But I still partly
suspect xfs, because the xfssyncd shows up when I hip SysRq+W.

 Another option is to use blktrace/blkparse to determine which process is
 issuing this I/O.

   0  0  0 3583780332 19506000 0   494  555 1080  0  1 99 
   0
   0  0  0 3584352332 19506000 099  347  559  0  0 99 
   0
   0  0  0 3585232332 19506000 011  301  621  0  0 99

Re: writeout stalls in current -git

2007-11-04 Thread Torsten Kaiser

On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
> > I can now confirm, that I see this also with the current 
> > mainline-git-version
> > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
> > plus the fix for the sg changes in ieee1394.
>
> Ok, so it's probably a side effect of the writeback changes.
>
> Attached are two patches (two because one was in a separate patchset as
> a standalone change) that should prevent async writeback from blocking
> on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> Can you see if this fixes the problem?

Applied both patches against the kernel mentioned above.
This blows up at boot:
[   80.807589] Filesystem "dm-0": Disabling barriers, not supported by
the underlying device
[   80.820241] XFS mounting filesystem dm-0
[   80.913144] [ cut here ]
[   80.914932] kernel BUG at drivers/md/raid5.c:143!
[   80.916751] invalid opcode:  [1] SMP
[   80.918338] CPU 3
[   80.919142] Modules linked in:
[   80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3
[   80.922628] RIP: 0010:[]  []
__release_stripe+0x164/0x170
[   80.925935] RSP: 0018:8100060e7dd0  EFLAGS: 00010002
[   80.927987] RAX:  RBX: 81010141c288 RCX: 
[   80.930738] RDX:  RSI: 81010141c288 RDI: 810004fb3200
[   80.933488] RBP: 810004fb3200 R08:  R09: 0005
[   80.936240] R10: 0e00 R11: e200038465e8 R12: 81010141c298
[   80.938990] R13: 0286 R14: 810004fb3330 R15: 
[   80.941741] FS:  0060c870() GS:810100313700()
knlGS:
[   80.944861] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   80.947080] CR2: 7fff7b295000 CR3: 000101842000 CR4: 06e0
[   80.949830] DR0:  DR1:  DR2: 
[   80.952580] DR3:  DR6: 0ff0 DR7: 0400
[   80.955332] Process md1_raid5 (pid: 974, threadinfo
8100060e6000, task 81000645c730)
[   80.958584] Stack:  81010141c288 01f4
810004fb3200 804b6f2d
[   80.961761]  01f4 81010141c288 804c8bd0

[   80.964681]  8100060e7ee8 804bd094 81000645c730
8100060e7e70
[   80.967518] Call Trace:
[   80.968558]  [] release_stripe+0x3d/0x60
[   80.970677]  [] md_thread+0x0/0x100
[   80.972629]  [] raid5d+0x344/0x450
[   80.974549]  [] process_timeout+0x0/0x10
[   80.976668]  [] schedule_timeout+0x5a/0xd0
[   80.978855]  [] md_thread+0x0/0x100
[   80.980807]  [] md_thread+0x30/0x100
[   80.982794]  [] autoremove_wake_function+0x0/0x30
[   80.985214]  [] md_thread+0x0/0x100
[   80.987167]  [] kthread+0x4b/0x80
[   80.989054]  [] child_rip+0xa/0x12
[   80.990972]  [] kthread+0x0/0x80
[   80.992824]  [] child_rip+0x0/0x12
[   80.994743]
[   80.995588]
[   80.995588] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 48 83 ec 28
48 89 5c 24
[   80.999307] RIP  [] __release_stripe+0x164/0x170
[   81.001711]  RSP 

Switching back to unpatched 2.6.23-mm1 boots sucessfull...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-04 Thread David Chinner

On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
> On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > That's stalled waiting on the inode cluster buffer lock. That implies
> > that the inode lcuser is already being written out and the inode has
> > been redirtied during writeout.
> >
> > Does the kernel you are testing have the "flush inodes in ascending
> > inode number order" patches applied? If so, can you remove that
> > patch and see if the problem goes away?
> 
> I can now confirm, that I see this also with the current mainline-git-version
> I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
> plus the fix for the sg changes in ieee1394.

Ok, so it's probably a side effect of the writeback changes.

Attached are two patches (two because one was in a separate patchset as
a standalone change) that should prevent async writeback from blocking
on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
Can you see if this fixes the problem?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/xfs_inode.c |  283 -
 1 file changed, 129 insertions(+), 154 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-09-12 15:41:22.0 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-09-13 08:57:06.395641940 +1000
@@ -124,6 +124,126 @@ xfs_inobp_check(
 #endif
 
 /*
+ * Simple wrapper for calling xfs_imap() that includes error
+ * and bounds checking
+ */
+STATIC int
+xfs_ino_to_imap(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_ino_t   ino,
+   xfs_imap_t  *imap,
+   uintimap_flags)
+{
+   int error;
+
+   error = xfs_imap(mp, tp, ino, imap, imap_flags);
+   if (error) {
+   cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap()  returned an "
+   "error %d on %s.  Returning error.",
+   error, mp->m_fsname);
+   return error;
+   }
+
+   /*
+* If the inode number maps to a block outside the bounds
+* of the file system then return NULL rather than calling
+* read_buf and panicing when we get an error from the
+* driver.
+*/
+   if ((imap->im_blkno + imap->im_len) >
+   XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
+   xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: "
+   "(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > "
+   " XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)",
+   (unsigned long long) imap->im_blkno,
+   (unsigned long long) imap->im_len,
+   XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks));
+   return XFS_ERROR(EINVAL);
+   }
+   return 0;
+}
+
+/*
+ * Find the buffer associated with the given inode map
+ * We do basic validation checks on the buffer once it has been
+ * retrieved from disk.
+ */
+STATIC int
+xfs_imap_to_bp(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_imap_t  *imap,
+   xfs_buf_t   **bpp,
+   uintbuf_flags,
+   uintimap_flags)
+{
+   int error;
+   int i;
+   int ni;
+   xfs_buf_t   *bp;
+
+   error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
+  (int)imap->im_len, XFS_BUF_LOCK, );
+   if (error) {
+   cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned "
+   "an error %d on %s.  Returning error.",
+   error, mp->m_fsname);
+   return error;
+   }
+
+   /*
+* Validate the magic number and version of every inode in the buffer
+* (if DEBUG kernel) or the first inode in the buffer, otherwise.
+*/
+#ifdef DEBUG
+   ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog;
+#else  /* usual case */
+   ni = 1;
+#endif
+
+   for (i = 0; i < ni; i++) {
+   int di_ok;
+   xfs_dinode_t*dip;
+
+   dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+   (i << mp->m_sb.sb_inodelog));
+   di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC 
&&
+   XFS_DINODE_GOOD_VERSION(dip->di_core.di_version);
+   if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
+   XFS_ERRTAG_ITOBP_INOTOBP,
+   XFS_RANDOM_ITOBP_INOTOBP))) {
+   if (imap_flags & XFS_IMAP_BULKSTAT) {
+   xfs_trans_brelse(tp, bp);
+   return XFS_ERROR(EINVAL);

Re: writeout stalls in current -git

2007-11-04 Thread Torsten Kaiser

On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote:
> That's stalled waiting on the inode cluster buffer lock. That implies
> that the inode lcuser is already being written out and the inode has
> been redirtied during writeout.
>
> Does the kernel you are testing have the "flush inodes in ascending
> inode number order" patches applied? If so, can you remove that
> patch and see if the problem goes away?

I can now confirm, that I see this also with the current mainline-git-version
I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
plus the fix for the sg changes in ieee1394.
Bisecting would be troublesome, as the sg changes prevent mainline to
boot with my normal config / kill my network.

treogen ~ # vmstat 10
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
-> starting emerge
 1  0  0 3627072332 157724009713   41  189  2  2 94  2
 0  0  0 3607240332 16373600   59910  332  951  2  1 93  4
 0  0  0 3601920332 16759200   380 2  218  870  1  1 98  0
 0  0  0 3596356332 17164800   40421  182  818  0  0 99  0
 0  0  0 3579328332 18043600   87812  147  912  1  1 97  2
 0  0  0 3575376332 18277600   236 4  244  953  1  1 95  3
 2  1  0 3571792332 18508400   232 7  256 1003  2  1 95  2
 0  0  0 3564844332 18736400   228   605  246 1167  2  1 93  4
 0  0  0 3562128332 18978400   230 4  527 1238  2  1 93  4
 0  1  0 3558764332 19196400   21624  438 1059  1  1 93  6
 0  0  0 3555120332 19386800   19936  406  959  0  0 92  8
 0  0  0 3552008332 19592800   19711  458 1023  1  1 90  8
 0  0  0 3548728332 19766000   183 7  496 1086  1  1 90  8
 0  0  0 3545560332 19937200   170 8  483 1017  1  1 90  9
 0  1  0 3542124332 20125600   190 1  544 1137  1  1 88 10
 1  0  0 3536924332 20329600   195 7  637 1209  2  1 89  8
 1  1  0 3485096332 24918400   10116 10372 4537 13  3 76  8
 2  0  0 3442004332 27972800  108640  219 1349  7  3 87  4
-> emerge is done reading its package database
 1  0  0 3254796332 44863600 027  128 8360 24  6 70  0
 2  0  0 3143304332 554016004733  213 4480 16 11 72  1
-> kernel unpacked
 1  0  0 3125700332 56041600 120  122 1675 24  1 75  0
 1  0  0 3117356332 56796800 0   674  157 2975 24  2 73  1
 2  0  0 3111636332 57373600 0  1143  151 1924 23  1 75  1
 2  0  0 3102836332 58133200 0   890  153 1330 24  1 75  0
 1  0  0 3097236332 58736000 0   656  194 1593 24  1 74  0
 1  0  0 3086824332 59548000 0   812  235 2657 25  1 74  0
-> tar.bz2 created, installing starts now
 0  0  0 3091612332 6010240082   708  499 2397 17  4 78  1
 0  0  0 3086088332 6021800069  2459  769 2237  3  4 88  6
 0  0  0 3085916332 60223600 2  1752  693  949  1  2 96  1
 0  0  0 3084544332 6035640066  4057 1176 2850  3  6 91  0
 0  0  0 3078780332 6055720098  3194 1169 3288  5  6 89  0
 0  0  0 3077940332 6059240017  1139  823 1547  1  2 97  0
 0  0  0 3078268332 60592400 0   888  807 1329  0  1 99  0
-> first short stall
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3077040332 60592400 0  1950  785 1495  0  2 89  8
 0  0  0 3076588332 60589600 2  3807  925 2046  1  4 95  0
 0  0  0 3076900332 6060520011  2564  768 1471  1  3 95  1
 0  0  0 3071584332 6079280087  2499 1108 3433  4  6 90  0
-> second longer stall
(emerge was not able to complete a single filemove until the 'resume' line)
 0  0  0 3071592332 60792800 0   693  692 1289  0  0 99  0
 0  0  0 3072584332 60792800 0   792  731 1507  0  1 99  0
 0  0  0 3072840332 60792800 0   806  707 1521  0  1 99  0
 0  0  0 3072724332 60792800 0   782  695 1372  0  0 99  0
 0  0  0 3072972332 60792800 0   677  612 1301  0  0 99  0
 0  0  0 3072772332 60792800 0   738  681 1352  1  1 99  0
 0  0  0 3073020332 60792800 0   785  708 1328  0  1 99  0
 0  0  0 3072896332 60792800 0   833  722 1383  0  0 99  0
-> emerge resumed
 0  0  0 3069476332 60797200 2  4885  812 2062  1  4 90  5
 1  0  0 3069648332 60806800 4

Re: writeout stalls in current -git

2007-11-04 Thread Torsten Kaiser

On 11/2/07, David Chinner [EMAIL PROTECTED] wrote:
 That's stalled waiting on the inode cluster buffer lock. That implies
 that the inode lcuser is already being written out and the inode has
 been redirtied during writeout.

 Does the kernel you are testing have the flush inodes in ascending
 inode number order patches applied? If so, can you remove that
 patch and see if the problem goes away?

I can now confirm, that I see this also with the current mainline-git-version
I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
plus the fix for the sg changes in ieee1394.
Bisecting would be troublesome, as the sg changes prevent mainline to
boot with my normal config / kill my network.

treogen ~ # vmstat 10
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
- starting emerge
 1  0  0 3627072332 157724009713   41  189  2  2 94  2
 0  0  0 3607240332 16373600   59910  332  951  2  1 93  4
 0  0  0 3601920332 16759200   380 2  218  870  1  1 98  0
 0  0  0 3596356332 17164800   40421  182  818  0  0 99  0
 0  0  0 3579328332 18043600   87812  147  912  1  1 97  2
 0  0  0 3575376332 18277600   236 4  244  953  1  1 95  3
 2  1  0 3571792332 18508400   232 7  256 1003  2  1 95  2
 0  0  0 3564844332 18736400   228   605  246 1167  2  1 93  4
 0  0  0 3562128332 18978400   230 4  527 1238  2  1 93  4
 0  1  0 3558764332 19196400   21624  438 1059  1  1 93  6
 0  0  0 3555120332 19386800   19936  406  959  0  0 92  8
 0  0  0 3552008332 19592800   19711  458 1023  1  1 90  8
 0  0  0 3548728332 19766000   183 7  496 1086  1  1 90  8
 0  0  0 3545560332 19937200   170 8  483 1017  1  1 90  9
 0  1  0 3542124332 20125600   190 1  544 1137  1  1 88 10
 1  0  0 3536924332 20329600   195 7  637 1209  2  1 89  8
 1  1  0 3485096332 24918400   10116 10372 4537 13  3 76  8
 2  0  0 3442004332 27972800  108640  219 1349  7  3 87  4
- emerge is done reading its package database
 1  0  0 3254796332 44863600 027  128 8360 24  6 70  0
 2  0  0 3143304332 554016004733  213 4480 16 11 72  1
- kernel unpacked
 1  0  0 3125700332 56041600 120  122 1675 24  1 75  0
 1  0  0 3117356332 56796800 0   674  157 2975 24  2 73  1
 2  0  0 3111636332 57373600 0  1143  151 1924 23  1 75  1
 2  0  0 3102836332 58133200 0   890  153 1330 24  1 75  0
 1  0  0 3097236332 58736000 0   656  194 1593 24  1 74  0
 1  0  0 3086824332 59548000 0   812  235 2657 25  1 74  0
- tar.bz2 created, installing starts now
 0  0  0 3091612332 6010240082   708  499 2397 17  4 78  1
 0  0  0 3086088332 6021800069  2459  769 2237  3  4 88  6
 0  0  0 3085916332 60223600 2  1752  693  949  1  2 96  1
 0  0  0 3084544332 6035640066  4057 1176 2850  3  6 91  0
 0  0  0 3078780332 6055720098  3194 1169 3288  5  6 89  0
 0  0  0 3077940332 6059240017  1139  823 1547  1  2 97  0
 0  0  0 3078268332 60592400 0   888  807 1329  0  1 99  0
- first short stall
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3077040332 60592400 0  1950  785 1495  0  2 89  8
 0  0  0 3076588332 60589600 2  3807  925 2046  1  4 95  0
 0  0  0 3076900332 6060520011  2564  768 1471  1  3 95  1
 0  0  0 3071584332 6079280087  2499 1108 3433  4  6 90  0
- second longer stall
(emerge was not able to complete a single filemove until the 'resume' line)
 0  0  0 3071592332 60792800 0   693  692 1289  0  0 99  0
 0  0  0 3072584332 60792800 0   792  731 1507  0  1 99  0
 0  0  0 3072840332 60792800 0   806  707 1521  0  1 99  0
 0  0  0 3072724332 60792800 0   782  695 1372  0  0 99  0
 0  0  0 3072972332 60792800 0   677  612 1301  0  0 99  0
 0  0  0 3072772332 60792800 0   738  681 1352  1  1 99  0
 0  0  0 3073020332 60792800 0   785  708 1328  0  1 99  0
 0  0  0 3072896332 60792800 0   833  722 1383  0  0 99  0
- emerge resumed
 0  0  0 3069476332 60797200 2  4885  812 2062  1  4 90  5
 1  0  0 3069648332 60806800 4  4658  833 2158

Re: writeout stalls in current -git

2007-11-04 Thread David Chinner

On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
 On 11/2/07, David Chinner [EMAIL PROTECTED] wrote:
  That's stalled waiting on the inode cluster buffer lock. That implies
  that the inode lcuser is already being written out and the inode has
  been redirtied during writeout.
 
  Does the kernel you are testing have the flush inodes in ascending
  inode number order patches applied? If so, can you remove that
  patch and see if the problem goes away?
 
 I can now confirm, that I see this also with the current mainline-git-version
 I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
 plus the fix for the sg changes in ieee1394.

Ok, so it's probably a side effect of the writeback changes.

Attached are two patches (two because one was in a separate patchset as
a standalone change) that should prevent async writeback from blocking
on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
Can you see if this fixes the problem?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/xfs_inode.c |  283 -
 1 file changed, 129 insertions(+), 154 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c   2007-09-12 15:41:22.0 
+1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-09-13 08:57:06.395641940 +1000
@@ -124,6 +124,126 @@ xfs_inobp_check(
 #endif
 
 /*
+ * Simple wrapper for calling xfs_imap() that includes error
+ * and bounds checking
+ */
+STATIC int
+xfs_ino_to_imap(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_ino_t   ino,
+   xfs_imap_t  *imap,
+   uintimap_flags)
+{
+   int error;
+
+   error = xfs_imap(mp, tp, ino, imap, imap_flags);
+   if (error) {
+   cmn_err(CE_WARN, xfs_ino_to_imap: xfs_imap()  returned an 
+   error %d on %s.  Returning error.,
+   error, mp-m_fsname);
+   return error;
+   }
+
+   /*
+* If the inode number maps to a block outside the bounds
+* of the file system then return NULL rather than calling
+* read_buf and panicing when we get an error from the
+* driver.
+*/
+   if ((imap-im_blkno + imap-im_len) 
+   XFS_FSB_TO_BB(mp, mp-m_sb.sb_dblocks)) {
+   xfs_fs_cmn_err(CE_ALERT, mp, xfs_ino_to_imap: 
+   (imap-im_blkno (0x%llx) + imap-im_len (0x%llx))  
+XFS_FSB_TO_BB(mp, mp-m_sb.sb_dblocks) (0x%llx),
+   (unsigned long long) imap-im_blkno,
+   (unsigned long long) imap-im_len,
+   XFS_FSB_TO_BB(mp, mp-m_sb.sb_dblocks));
+   return XFS_ERROR(EINVAL);
+   }
+   return 0;
+}
+
+/*
+ * Find the buffer associated with the given inode map
+ * We do basic validation checks on the buffer once it has been
+ * retrieved from disk.
+ */
+STATIC int
+xfs_imap_to_bp(
+   xfs_mount_t *mp,
+   xfs_trans_t *tp,
+   xfs_imap_t  *imap,
+   xfs_buf_t   **bpp,
+   uintbuf_flags,
+   uintimap_flags)
+{
+   int error;
+   int i;
+   int ni;
+   xfs_buf_t   *bp;
+
+   error = xfs_trans_read_buf(mp, tp, mp-m_ddev_targp, imap-im_blkno,
+  (int)imap-im_len, XFS_BUF_LOCK, bp);
+   if (error) {
+   cmn_err(CE_WARN, xfs_imap_to_bp: xfs_trans_read_buf()returned 
+   an error %d on %s.  Returning error.,
+   error, mp-m_fsname);
+   return error;
+   }
+
+   /*
+* Validate the magic number and version of every inode in the buffer
+* (if DEBUG kernel) or the first inode in the buffer, otherwise.
+*/
+#ifdef DEBUG
+   ni = BBTOB(imap-im_len)  mp-m_sb.sb_inodelog;
+#else  /* usual case */
+   ni = 1;
+#endif
+
+   for (i = 0; i  ni; i++) {
+   int di_ok;
+   xfs_dinode_t*dip;
+
+   dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+   (i  mp-m_sb.sb_inodelog));
+   di_ok = be16_to_cpu(dip-di_core.di_magic) == XFS_DINODE_MAGIC 

+   XFS_DINODE_GOOD_VERSION(dip-di_core.di_version);
+   if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
+   XFS_ERRTAG_ITOBP_INOTOBP,
+   XFS_RANDOM_ITOBP_INOTOBP))) {
+   if (imap_flags  XFS_IMAP_BULKSTAT) {
+   xfs_trans_brelse(tp, bp);
+   return XFS_ERROR(EINVAL);
+   }
+

Re: writeout stalls in current -git

2007-11-04 Thread Torsten Kaiser

On 11/5/07, David Chinner [EMAIL PROTECTED] wrote:
 On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
  I can now confirm, that I see this also with the current 
  mainline-git-version
  I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
  plus the fix for the sg changes in ieee1394.

 Ok, so it's probably a side effect of the writeback changes.

 Attached are two patches (two because one was in a separate patchset as
 a standalone change) that should prevent async writeback from blocking
 on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
 Can you see if this fixes the problem?

Applied both patches against the kernel mentioned above.
This blows up at boot:
[   80.807589] Filesystem dm-0: Disabling barriers, not supported by
the underlying device
[   80.820241] XFS mounting filesystem dm-0
[   80.913144] [ cut here ]
[   80.914932] kernel BUG at drivers/md/raid5.c:143!
[   80.916751] invalid opcode:  [1] SMP
[   80.918338] CPU 3
[   80.919142] Modules linked in:
[   80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3
[   80.922628] RIP: 0010:[804b6ee4]  [804b6ee4]
__release_stripe+0x164/0x170
[   80.925935] RSP: 0018:8100060e7dd0  EFLAGS: 00010002
[   80.927987] RAX:  RBX: 81010141c288 RCX: 
[   80.930738] RDX:  RSI: 81010141c288 RDI: 810004fb3200
[   80.933488] RBP: 810004fb3200 R08:  R09: 0005
[   80.936240] R10: 0e00 R11: e200038465e8 R12: 81010141c298
[   80.938990] R13: 0286 R14: 810004fb3330 R15: 
[   80.941741] FS:  0060c870() GS:810100313700()
knlGS:
[   80.944861] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   80.947080] CR2: 7fff7b295000 CR3: 000101842000 CR4: 06e0
[   80.949830] DR0:  DR1:  DR2: 
[   80.952580] DR3:  DR6: 0ff0 DR7: 0400
[   80.955332] Process md1_raid5 (pid: 974, threadinfo
8100060e6000, task 81000645c730)
[   80.958584] Stack:  81010141c288 01f4
810004fb3200 804b6f2d
[   80.961761]  01f4 81010141c288 804c8bd0

[   80.964681]  8100060e7ee8 804bd094 81000645c730
8100060e7e70
[   80.967518] Call Trace:
[   80.968558]  [804b6f2d] release_stripe+0x3d/0x60
[   80.970677]  [804c8bd0] md_thread+0x0/0x100
[   80.972629]  [804bd094] raid5d+0x344/0x450
[   80.974549]  [8023df10] process_timeout+0x0/0x10
[   80.976668]  [805ae1ca] schedule_timeout+0x5a/0xd0
[   80.978855]  [804c8bd0] md_thread+0x0/0x100
[   80.980807]  [804c8c00] md_thread+0x30/0x100
[   80.982794]  [80249f20] autoremove_wake_function+0x0/0x30
[   80.985214]  [804c8bd0] md_thread+0x0/0x100
[   80.987167]  [80249b3b] kthread+0x4b/0x80
[   80.989054]  [8020c9c8] child_rip+0xa/0x12
[   80.990972]  [80249af0] kthread+0x0/0x80
[   80.992824]  [8020c9be] child_rip+0x0/0x12
[   80.994743]
[   80.995588]
[   80.995588] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 48 83 ec 28
48 89 5c 24
[   80.999307] RIP  [804b6ee4] __release_stripe+0x164/0x170
[   81.001711]  RSP 8100060e7dd0

Switching back to unpatched 2.6.23-mm1 boots sucessfull...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote:
> On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> > [  630.00] SysRq : Emergency Sync
> > [  630.12] Emergency Sync complete
> > [  632.85] SysRq : Show Blocked State
> > [  632.85]   taskPC stack   pid father
> > [  632.85] pdflush   D 81000f091788 0   285  2
> > [  632.85]  810005d4da80 0046 0800
> > 0071
> > [  632.85]  81000fd52400 8022d61c 80819b00
> > 80819b00
> > [  632.85]  80815f40 80819b00 810100316f98
> > 
> > [  632.85] Call Trace:
> > [  632.85]  [] task_rq_lock+0x4c/0x90
> > [  632.85]  [] __wake_up_common+0x5a/0x90
> > [  632.85]  [] __down+0xa7/0x11e
> > [  632.85]  [] default_wake_function+0x0/0x10
> > [  632.85]  [] __down_failed+0x35/0x3a
> > [  632.85]  [] xfs_buf_lock+0x3e/0x40
> > [  632.85]  [] _xfs_buf_find+0x13e/0x240
> > [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> > [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> > [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> > [  632.85]  [] xfs_itobp+0x81/0x1e0
> > [  632.85]  [] write_cache_pages+0x123/0x330
> > [  632.85]  [] xfs_iflush+0xfe/0x520
>
> That's stalled waiting on the inode cluster buffer lock. That implies
> that the inode lcuser is already being written out and the inode has
> been redirtied during writeout.
>
> Does the kernel you are testing have the "flush inodes in ascending
> inode number order" patches applied? If so, can you remove that
> patch and see if the problem goes away?

It's 2.6.23-mm1 with only some small fixes.

In it's broken-out directory I see:
git-xfs.patch

and

writeback-fix-periodic-superblock-dirty-inode-flushing.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch
writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch

I don't know if the patch you mentioned is part of that version of the
mm-patchset.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread David Chinner

On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> [  630.00] SysRq : Emergency Sync
> [  630.12] Emergency Sync complete
> [  632.85] SysRq : Show Blocked State
> [  632.85]   taskPC stack   pid father
> [  632.85] pdflush   D 81000f091788 0   285  2
> [  632.85]  810005d4da80 0046 0800
> 0071
> [  632.85]  81000fd52400 8022d61c 80819b00
> 80819b00
> [  632.85]  80815f40 80819b00 810100316f98
> 
> [  632.85] Call Trace:
> [  632.85]  [] task_rq_lock+0x4c/0x90
> [  632.85]  [] __wake_up_common+0x5a/0x90
> [  632.85]  [] __down+0xa7/0x11e
> [  632.85]  [] default_wake_function+0x0/0x10
> [  632.85]  [] __down_failed+0x35/0x3a
> [  632.85]  [] xfs_buf_lock+0x3e/0x40
> [  632.85]  [] _xfs_buf_find+0x13e/0x240
> [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> [  632.85]  [] xfs_itobp+0x81/0x1e0
> [  632.85]  [] write_cache_pages+0x123/0x330
> [  632.85]  [] xfs_iflush+0xfe/0x520

That's stalled waiting on the inode cluster buffer lock. That implies
that the inode lcuser is already being written out and the inode has
been redirtied during writeout.

Does the kernel you are testing have the "flush inodes in ascending
inode number order" patches applied? If so, can you remove that
patch and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
>
> > Interestingly, no background_writeout() appears, but only
> > balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> > block the process.
>
> Yeah, the background threshold is not (yet) scaled. So it can happen
> that the bdi_dirty limit is below the background limit.

I still have not seen a trigger of the "throttle_vm_writeout".
This time installing 2.6.24-rc1 again it not even triggerd any other
debugs apart from the one in wb_kupdate.
But 300Mb of new files might still not trigger this with 4Gb of RAM.

I'm currently testing 2.6.23-mm1 with this patch and the second
writeback-debug patch.

> I'm curious though as to these stalls, though, I can't seem to think of
> what goes wrong.. esp since most writeback seems to happen from pdflush.

I also don't know. But looking at the time the system takes to write
out 8kb, I'm starting to suspect that something is writing this out,
but not marking it clean... (Or redirtying it immediately?)

> (or I'm totally misreading it - quite a possible as I'm still recovering
> from a serious cold and not all the green stuff has yet figured out its
> proper place wrt brain cells 'n stuff)

Get well soon!

> I still have this patch floating around:
>
> ---
> Subject: mm: speed up writeback ramp-up on clean systems

applied, but did not fix the stalls.

Here the complete log from vmstat 10 and the syslog from an install of
vanilla 2.6.24-rc1.
(Please note: I installed the source of vanilla 2.6.24-rc1, but I am
still using 2.6.23-mm1!)
All lines with [note] are my comments about what the system was doing,
both logs are from the same run, so the notes should be more or less
in sync. I used SysRq+L to insert the SysRq-Helptext into the syslog
as marker...

The visible effects are similar to the unmerge run, but the stalls
during the moving did only start later. But the same effect after
emerge finished and the almost all dirty data was written, was
visible: I can still hear the disks and see the hdd light flickering
(mostly on) for much, much longer than it should take to write 8kb.

vmstat 10:
[note]emerge start
 1  0  0 3668496332 18774800 029   39  491  3  0 96  0
 1  0  0 3623940332 10008317 1724 3893 15  2 81  1
 0  0  0 3559488332 25243200  102148 11719 4536  9  4 74 13
 2  0  0 3482220332 311916007060   93 3818 11  3 86  0
 1  0  0 3289352332 48693200 235   33 11997 25  3 72  0
 1  0  0 3174036332 596412001033   35 3937 21  4 75  0
 2  0  0 3215756332 55529200 628   85  742 12 12 76  0
 2  0  0 3202128332 5597920032 9   34 1566 31  1 68  0
 2  0  0 3192804332 568072006046  172 4206 30  2 67  1
 3  0  0 3202424332 57262000 020  111 2223 27  1 72  0
 1  0  0 3196112332 57890000 0  1649  149 2763 25  2 73  0
 1  0  0 3190004332 58495600 017  110 2270 25  1 74  0
 1  0  0 3183952332 59084000 011  104 2553 25  1 74  0
 1  0  0 3176952332 59706800 0  2153  124 2886 25  2 72  0
 1  0  0 3171044332 60259200 022  109 2580 26  1 73  0
 1  0  0 3174896332 60549600   173  1441  312 2249  9  6 84  1
 1  0  0 3165204332 61185600   569  3221  606 4236  4  7 87  1
 0  0  0 3160856332 61351600   116  2281  570 3077  3  5 92  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3154712332 61520000   108  2166  528 3038  3  4 93  0
 0  0  0 3156008332 6154200018  1941  537 1015  0  2 97  0
 0  0  0 3156652332 61550400 8  2232  547  900  0  2 98  0
 0  0  0 3156748332 6156720012  1932  537  947  0  2 98  0
 0  0  0 3154720332 6159000014  2204  584 1256  1  2 97  0
 0  0  0 3154256332 6160600010  2676  610 1317  1  3 96  0
 1  0  0 3152488332 61628400 9  1994  573 1024  1  2 97  0
 0  0  0 3152404332 61640800 4  2218  540  904  0  2 97  0
 0  0  0 3151244332 6171560044  2198  598 1921  2  4 94  0
 0  0  0 3147224332 61867200   110  1802  644 2575  3  4 93  0
 0  0  0 3144608332 6198240080  1590  543 1900  2  4 95  0
 0  0  0 3140768332 62144800   111  1758  657 2735  3  4 93  0
 0  0  0 3140816332 6218960026   801  531 1667  1  2 98  0
[note] first stall, SysRq+W
 1  0  0 3127620332 62189600 0   640  490 1381  2  1 97  0
 0  0  0 3127780332 62190000 0   627  475 1531  2  1 98

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> I guess the new debug printks will provide more hints on it.

The "throttle_vm_writeout" did not trigger for my new workload.
Except one (the first) "balance_dirty_pages" came from line 445, the
newly added.

But I found an other workload that looks much more ... hmm ... 'mad'.

If I do an unmerge the emerge program will read all files to
revalidate their checksum and then delete it. If I do this unmerge the
progress of emerge will stall periodically for ~47 second. (Two times
I used a stopwatch to get this value. I think all other stalls where
identical, at least in KSysGuard they looked evenly spaced)

What really counts as 'mad' is this output from vmstat 10:
0  0  0 3639044332 17742000   29220  101  618  1  1 98  0
 1  0  0 3624068332 18062800   32322  137  663  5  2 93  0
 0  0  0 3602456332 18397200   30123  159  641  9  3 87  2
-> this was emerge collecting its package database
 0  0  0 3600052332 1842640019  7743  823 5543  3  8 89  0
 0  0  0 3599332332 18428000 1  2532  517 2341  1  2 97  0
-> normal removing, now the emerge stalls
 0  0  0 3599404332 18428000 0   551  323 1290  0  0 99  0
 0  0  0 3599648332 18428000 0   644  314 1222  0  1 99  0
 0  0  0 3599648332 18428400 0   569  296 1242  0  0 99  0
 0  0  0 3599868332 18428800 0  2362  320 2735  1  2 97  0
-> resumes for a short time, then stalls again
 0  0  0 3599488332 18428800 0   584  292 1395  0  0 99  0
 0  0  0 3600216332 18428800 0   550  301 1361  0  0 99  0
 0  0  0 3594176332 18429600 0   562  300 1373  2  1 97  0
 0  0  0 3594648332 18429600 0  1278  336 1881  1  1 98  0
 0  0  0 3594172332 18430800 1  2812  421 2840  1  4 95  0
-> and again
 0  0  0 3594296332 18430800 0   545  342 1283  0  0 99  0
 0  0  0 3594376332 18430800 0   561  319 1314  0  1 99  0
 0  0  0 3594340332 18430800 0   586  327 1258  0  1 99  0
 0  0  0 3594644332 18430800 0   498  248 1376  0  0 99  0
 0  0  0 3595116332 18434800 0  3519  565 3452  2  4 95  0
-> and again
 0  0  0 3595320332 18434800 0   483  284 1163  0  0 99  0
 3  0  0 3595444332 18435200 0   498  247 1173  3  0 97  0
 1  0  0 3585108332 18460000 0  1298  644 2394  1  1 98  0
 1  0  0 3588152332 18460800 0  3154  520 3221  2  4 94  0
-> and again
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3588540332 18460800 0   574  268 1332  0  1 99  0
 1  0  0 3588744332 18460800 0   546  335 1289  0  0 99  0
 1  0  0 3588628332 18460800 0   638  348 1257  0  1 99  0
 1  0  0 3588952332 18460800 0   567  310 1226  0  1 99  0
 1  0  0 3603644332 1849720059  2821  531 2419  3  4 91  1
 1  0  0 3649476332 18627200   370   395  380 1335  1  1 98  0
-> emerge finishes, and now the system goes 'mad'
The Dirty:-line from /proc/meminfo stays at 8 or 12 kB, but there
system is writing like 'mad':
 1  0  0 3650616332 18627600 0   424  296 1126  0  1 99  0
 1  0  0 3650708332 18627600 0   418  249 1190  0  0 99  0
 1  0  0 3650716332 18627600 0   418  256 1151  0  1 99  0
 1  0  0 3650816332 18627600 0   420  257 1120  0  0 99  0
 1  0  0 3651132332 18627600 0   418  269 1145  0  0 99  0
 1  0  0 3651332332 18628000 0   419  294 1099  0  1 99  0
 1  0  0 3651732332 18628000 0   423  311 1072  0  1 99  0
 1  0  0 3652048332 18628000 0   400  317 1127  0  0 99  0
 1  0  0 3652024332 18628000 0   426  346 1066  0  1 99  0
 2  0  0 3652304332 18628000 0   425  357 1132  0  1 99  0
 2  0  0 3652652332 18628000 0   416  364 1184  0  0 99  0
 1  0  0 3652836332 18628000 0   413  397 1110  0  1 99  0
 1  0  0 3652852332 18628400 0   426  427 1290  0  1 99  0
 1  0  0 3652060332 1864200014   404  421 1768  1  1 97  0
 1  0  0 3652904332 18642000 0   418  437 1792  1  1 98  0
 1  0  0 3653572332 18642000 0   410  442 1481  1  1 99  0
 2  0  0 3653872332 18642000 0   410  451 1206  0  1 99  0
 3  0  0 3654572332 18642000 0   414  479 1341  0  1 99  0
 1  0  0 3651720332 18983200   341   420  540 1600  1  1 98  1

Re: writeout stalls in current -git

2007-11-02 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
> On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
> 
> > Interestingly, no background_writeout() appears, but only
> > balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> > block the process.
> 
> Yeah, the background threshold is not (yet) scaled. So it can happen
> that the bdi_dirty limit is below the background limit.
> 
> I'm curious though as to these stalls, though, I can't seem to think of
> what goes wrong.. esp since most writeback seems to happen from pdflush.

Me confused too. The new debug patch will confirm whether emerge is
waiting in balance_dirty_pages().

> (or I'm totally misreading it - quite a possible as I'm still recovering
> from a serious cold and not all the green stuff has yet figured out its
> proper place wrt brain cells 'n stuff)

Do take care of yourself.

> 
> I still have this patch floating around:

I think this patch is OK for 2.6.24 :-)

Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> 

> 
> ---
> Subject: mm: speed up writeback ramp-up on clean systems
> 
> We allow violation of bdi limits if there is a lot of room on the
> system. Once we hit half the total limit we start enforcing bdi limits
> and bdi ramp-up should happen. Doing it this way avoids many small
> writeouts on an otherwise idle system and should also speed up the
> ramp-up.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
>
> ---
>  mm/page-writeback.c |   19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/page-writeback.c
> ===
> --- linux-2.6.orig/mm/page-writeback.c2007-09-28 10:08:33.937415368 
> +0200
> +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200
> @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
>   */
>  static void balance_dirty_pages(struct address_space *mapping)
>  {
> - long bdi_nr_reclaimable;
> - long bdi_nr_writeback;
> + long nr_reclaimable, bdi_nr_reclaimable;
> + long nr_writeback, bdi_nr_writeback;
>   long background_thresh;
>   long dirty_thresh;
>   long bdi_thresh;
> @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
>  
>   get_dirty_limits(_thresh, _thresh,
>   _thresh, bdi);
> +
> + nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> + global_page_state(NR_UNSTABLE_NFS);
> + nr_writeback = global_page_state(NR_WRITEBACK);
> +
>   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +
>   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
>   break;
>  
> + /*
> +  * Throttle it only when the background writeback cannot
> +  * catch-up. This avoids (excessively) small writeouts
> +  * when the bdi limits are ramping up.
> +  */
> + if (nr_reclaimable + nr_writeback <
> + (background_thresh + dirty_thresh) / 2)
> + break;
> +
>   if (!bdi->dirty_exceeded)
>   bdi->dirty_exceeded = 1;
>  
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Peter Zijlstra

On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:

> Interestingly, no background_writeout() appears, but only
> balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> block the process.

Yeah, the background threshold is not (yet) scaled. So it can happen
that the bdi_dirty limit is below the background limit.

I'm curious though as to these stalls, though, I can't seem to think of
what goes wrong.. esp since most writeback seems to happen from pdflush.

(or I'm totally misreading it - quite a possible as I'm still recovering
from a serious cold and not all the green stuff has yet figured out its
proper place wrt brain cells 'n stuff)


I still have this patch floating around:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-09-28 10:08:33.937415368 +0200
+++ linux-2.6/mm/page-writeback.c   2007-09-28 10:54:26.018247516 +0200
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-   long bdi_nr_reclaimable;
-   long bdi_nr_writeback;
+   long nr_reclaimable, bdi_nr_reclaimable;
+   long nr_writeback, bdi_nr_writeback;
long background_thresh;
long dirty_thresh;
long bdi_thresh;
@@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
 
get_dirty_limits(_thresh, _thresh,
_thresh, bdi);
+
+   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+   global_page_state(NR_UNSTABLE_NFS);
+   nr_writeback = global_page_state(NR_WRITEBACK);
+
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;
 
+   /*
+* Throttle it only when the background writeback cannot
+* catch-up. This avoids (excessively) small writeouts
+* when the bdi limits are ramping up.
+*/
+   if (nr_reclaimable + nr_writeback <
+   (background_thresh + dirty_thresh) / 2)
+   break;
+
if (!bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 08:42:05AM +0100, Torsten Kaiser wrote:
> The Subject is still missleading, I'm using 2.6.23-mm1.
> 
> On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
> > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
> > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy 
> > > > > writeouts.
> > > > > Each time I noticed this I was using emerge (package util from the
> > > > > gentoo distribution) to install/upgrade a package. The last step,
> > > > > where this hang occurred, is moving the prepared files from a tmpfs
> > > > > partion to the main xfs filesystem.
> > > > > The hangs where not fatal, after a few second everything resumed
> > > > > normal, so I was not able to capture a good image of what was
> > > > > happening.
> > > >
> > > > Thank you for the detailed report.
> > > >
> > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or
> > > > cannot type and run new commands?
> > >
> > > Only writeout stalled. The emerge that was moving the files hung, but
> > > everything else worked normaly.
> > > I was able to run new commands, like coping the /proc/meminfo.
> >
> > But you mentioned in the next mail that `watch cat /proc/meminfo`
> > could also be blocked for some time - I guess in the same time emerge
> > was stalled?
> 
> The behavior was different on these stalls.
> On first report the writeout stopped completly, the emerge stopped,
> but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and
> not stall.
> About the watch cat /proc/meminfo, I will write in the answer to the
> other mail...

OK.

> > > [snip]
> > > > > After this SysRq+W writeback resumed again. Possible that writing
> > > > > above into the syslog triggered that.
> > > >
> > > > Maybe. Are the log files on another disk/partition?
> > >
> > > No, everything was going to /
> > >
> > > What might be interesting is, that doing cat /proc/meminfo
> > > >~/stall/meminfo did not resume the writeback. So there might some
> > > threshold that only was broken with the additional write from
> > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the
> >
> > Have you tried explicit `sync`? ;-)
> 
> No. I wanted to see what is stalled. So I startet by collecting info
> from /proc and then the SysRq+W. And after hitting SysRQ the writeout
> started to resume without any further action.
> 
> But I think I have seen a `sync` stall also. During an other emerge I
> noticed the system slowing down and wanted to use `sync` to speed up
> the writeout. The result was, that the writeout did not speed up
> imiedetly only after around a minitue. The `sync` only returned at
> that time.
> Can writers starve `sync`?

I guess the new debug printks will provide more hints on it.

> > > syslog-ng package from gentoo:
> > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
> > >
> > > > > The source tmpfs is mounted with any special parameters, but the
> > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3
> > > > > disk RAID5 md.
> > > > > During the hang all CPUs where idle.
> > > >
> > > > No iowaits? ;-)
> > >
> > > No, I have a KSysGuard in my taskbar that showed no activity at all.
> > >
> > > OK, the subject does not match for my case, but there was also a tmpfs
> > > involved. And I found no thread with stalls on xfs. :-)
> >
> > Do you mean it is actually related with tmpfs?
> 
> I don't know. It's just that I have seen tmpfs also redirtieing inodes
> in these logs and the stalling emerge is moving files from tmpfs to
> xfs.
> It could be, but I don't know enough about tmpfs internals to really be sure.
> I just wanted to mention, that tmpfs is involved somehow.

The requeue messages for tmpfs are not pleasant, but known to be fine ;-) 

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote:
> > On 11/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > > Thank you. Maybe we can start by the applied debug patch :-)
> > >
> > > Will applied it and try to recreate this.
> >
> > Patch applied, used emerge to install a 2.6.24-rc1 kernel.
> >
> > I had no complete stalls, but three times during the move from tmpfs
> > to the main xfs the emerge got noticeable slower. There still was
> > writeout happening, but as emerge prints out every file it has written
> > during the pause not one file was processed.
> >
> > vmstat 10:
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  0  1  0 3146424332 61476800   134  1849  438 2515  3  4 91 
> >  2
> >  0  0  0 3146644332 61478400 2  1628  507  646  0  2 85 
> > 13
> >  0  0  0 3146868332 61486800 5  2359  527 1076  0  3 97 
> >  0
> >  1  0  0 3144372332 6161480096  2829  607 2666  2  5 92 
> >  0
> > -> normal writeout
> >  0  0  0 3140560332 61814400   152  2764  633 3308  3  6 91 
> >  0
> >  0  0  0 3137332332 61990800   114  1801  588 2858  3  4 93 
> >  0
> >  0  0  0 3136912332 6201360020   827  393 1605  1  2 98 
> >  0
> > -> first stall
>
> 'stall': vmstat's output stalls for some time, or emerge stalls for
> the next several vmstat lines?

emerge stalls. The vmstat did work normally.

> >  0  0  0 3137088332 62013600 0   557  339 1437  0  1 99 
> >  0
> >  0  0  0 3137160332 62013600 0   642  310 1400  0  1 99 
> >  0

So meaning that these last three lines indicated that for ~30 seconds
the writeout was much slower than normal.

> >  0  0  0 3136588332 62017200 6  2972  527 1195  0  3 80 
> > 16
> >  0  0  0 3136276332 6203480010  2668  558 1195  0  3 96 
> >  0
> >  0  0  0 3135228332 62042400 8  2712  522 1311  0  4 96 
> >  0
> >  0  0  0 3131740332 6215240075  2935  559 2457  2  5 93 
> >  0
> >  0  0  0 3128348332 6229720085  1470  490 2607  3  4 93 
> >  0
> >  0  0  0 3129292332 62297200 0   527  353 1398  0  1 99 
> >  0
> > -> second longer stall
> >  0  0  0 3128520332 62302800 6   488  249 1390  0  1 99 
> >  0
> >  0  0  0 3128236332 62302800 0   482  222 1222  0  1 99 
> >  0
> >  0  0  0 3128408332 62302800 0   585  269 1301  0  0 99 
> >  0
> >  0  0  0 3128532332 62302800 0   610  262 1278  0  0 99 
> >  0
> >  0  0  0 3128568332 62302800 0   636  345 1639  0  1 99 
> >  0
> >  0  0  0 3129032332 62304000 1   664  337 1466  0  1 99 
> >  0
> >  0  0  0 3129484332 62304000 0   658  300 1508  0  0 
> > 100  0
> >  0  0  0 3129576332 62304000 0   562  271 1454  0  1 99 
> >  0
> >  0  0  0 3129736332 62304000 0   627  278 1406  0  1 99 
> >  0
> >  0  0  0 3129368332 62304000 0   507  274 1301  0  1 99 
> >  0
> >  0  0  0 3129004332 62304000 0   444  211 1213  0  0 99 
> >  0

The second time the slowdown was much longer.

> >  0  1  0 3127260332 62304000 0  1036  305 1242  0  1 95 
> >  4
> >  0  0  0 3126280332 62312800 7  4241  555 1575  1  5 84 
> > 10
> >  0  0  0 3124948332 62323200 6  4194  529 1505  1  4 95 
> >  0
> >  0  0  0 3125228332 6241680058  1966  586 1964  2  4 94 
> >  0
> > -> emerge resumed to normal speed, without any intervention from my side
> >  0  0  0 3120932332 62590400   112  1546  546 2565  3  4 93 
> >  0
> >  0  0  0 3118012332 62756800   128  1542  612 2705  3  4 93 
> >  0
>
> Interesting, the 'bo' never falls to zero.

Yes, I was not able to recreate the complete stall from the first
mail, but even this slowdown does not look completly healthy.
I "hope" this is the same bug, as I seem to be able to trigger this
slowdown much easier.

[snip logs]
>
> Interestingly, no background_writeout() appears, but only
> balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> block the process.

Yes, I noticed that too.
The only time I have seen background_writeout was during bootup and shutdown.

As for the stalled watch cat /proc/meminfo:
That happend on the third slowdown/stall when emerge was already finished

> > If I see the complete stall again, I will post that too.
>
> Thank you, could you run it with the attached new debug patch?

I will, but it will have to wait until the

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

The Subject is still missleading, I'm using 2.6.23-mm1.

On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
> > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
> > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts.
> > > > Each time I noticed this I was using emerge (package util from the
> > > > gentoo distribution) to install/upgrade a package. The last step,
> > > > where this hang occurred, is moving the prepared files from a tmpfs
> > > > partion to the main xfs filesystem.
> > > > The hangs where not fatal, after a few second everything resumed
> > > > normal, so I was not able to capture a good image of what was
> > > > happening.
> > >
> > > Thank you for the detailed report.
> > >
> > > How severe was the hangs? Only writeouts stalled, all apps stalled, or
> > > cannot type and run new commands?
> >
> > Only writeout stalled. The emerge that was moving the files hung, but
> > everything else worked normaly.
> > I was able to run new commands, like coping the /proc/meminfo.
>
> But you mentioned in the next mail that `watch cat /proc/meminfo`
> could also be blocked for some time - I guess in the same time emerge
> was stalled?

The behavior was different on these stalls.
On first report the writeout stopped completly, the emerge stopped,
but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and
not stall.
About the watch cat /proc/meminfo, I will write in the answer to the
other mail...

> > [snip]
> > > > After this SysRq+W writeback resumed again. Possible that writing
> > > > above into the syslog triggered that.
> > >
> > > Maybe. Are the log files on another disk/partition?
> >
> > No, everything was going to /
> >
> > What might be interesting is, that doing cat /proc/meminfo
> > >~/stall/meminfo did not resume the writeback. So there might some
> > threshold that only was broken with the additional write from
> > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the
>
> Have you tried explicit `sync`? ;-)

No. I wanted to see what is stalled. So I startet by collecting info
from /proc and then the SysRq+W. And after hitting SysRQ the writeout
started to resume without any further action.

But I think I have seen a `sync` stall also. During an other emerge I
noticed the system slowing down and wanted to use `sync` to speed up
the writeout. The result was, that the writeout did not speed up
imiedetly only after around a minitue. The `sync` only returned at
that time.
Can writers starve `sync`?

> > syslog-ng package from gentoo:
> > http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
> >
> > > > The source tmpfs is mounted with any special parameters, but the
> > > > target xfs filesystem resides on a dm-crypt device that is on top a 3
> > > > disk RAID5 md.
> > > > During the hang all CPUs where idle.
> > >
> > > No iowaits? ;-)
> >
> > No, I have a KSysGuard in my taskbar that showed no activity at all.
> >
> > OK, the subject does not match for my case, but there was also a tmpfs
> > involved. And I found no thread with stalls on xfs. :-)
>
> Do you mean it is actually related with tmpfs?

I don't know. It's just that I have seen tmpfs also redirtieing inodes
in these logs and the stalling emerge is moving files from tmpfs to
xfs.
It could be, but I don't know enough about tmpfs internals to really be sure.
I just wanted to mention, that tmpfs is involved somehow.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

The Subject is still missleading, I'm using 2.6.23-mm1.

On 11/2/07, Fengguang Wu [EMAIL PROTECTED] wrote:
 On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
  On 11/1/07, Fengguang Wu [EMAIL PROTECTED] wrote:
   On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts.
Each time I noticed this I was using emerge (package util from the
gentoo distribution) to install/upgrade a package. The last step,
where this hang occurred, is moving the prepared files from a tmpfs
partion to the main xfs filesystem.
The hangs where not fatal, after a few second everything resumed
normal, so I was not able to capture a good image of what was
happening.
  
   Thank you for the detailed report.
  
   How severe was the hangs? Only writeouts stalled, all apps stalled, or
   cannot type and run new commands?
 
  Only writeout stalled. The emerge that was moving the files hung, but
  everything else worked normaly.
  I was able to run new commands, like coping the /proc/meminfo.

 But you mentioned in the next mail that `watch cat /proc/meminfo`
 could also be blocked for some time - I guess in the same time emerge
 was stalled?

The behavior was different on these stalls.
On first report the writeout stopped completly, the emerge stopped,
but at that time a cat /proc/meminfo ~/stall/meminfo did succedd and
not stall.
About the watch cat /proc/meminfo, I will write in the answer to the
other mail...

  [snip]
After this SysRq+W writeback resumed again. Possible that writing
above into the syslog triggered that.
  
   Maybe. Are the log files on another disk/partition?
 
  No, everything was going to /
 
  What might be interesting is, that doing cat /proc/meminfo
  ~/stall/meminfo did not resume the writeback. So there might some
  threshold that only was broken with the additional write from
  syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the

 Have you tried explicit `sync`? ;-)

No. I wanted to see what is stalled. So I startet by collecting info
from /proc and then the SysRq+W. And after hitting SysRQ the writeout
started to resume without any further action.

But I think I have seen a `sync` stall also. During an other emerge I
noticed the system slowing down and wanted to use `sync` to speed up
the writeout. The result was, that the writeout did not speed up
imiedetly only after around a minitue. The `sync` only returned at
that time.
Can writers starve `sync`?

  syslog-ng package from gentoo:
  http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
 
The source tmpfs is mounted with any special parameters, but the
target xfs filesystem resides on a dm-crypt device that is on top a 3
disk RAID5 md.
During the hang all CPUs where idle.
  
   No iowaits? ;-)
 
  No, I have a KSysGuard in my taskbar that showed no activity at all.
 
  OK, the subject does not match for my case, but there was also a tmpfs
  involved. And I found no thread with stalls on xfs. :-)

 Do you mean it is actually related with tmpfs?

I don't know. It's just that I have seen tmpfs also redirtieing inodes
in these logs and the stalling emerge is moving files from tmpfs to
xfs.
It could be, but I don't know enough about tmpfs internals to really be sure.
I just wanted to mention, that tmpfs is involved somehow.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 08:42:05AM +0100, Torsten Kaiser wrote:
 The Subject is still missleading, I'm using 2.6.23-mm1.
 
 On 11/2/07, Fengguang Wu [EMAIL PROTECTED] wrote:
  On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
   On 11/1/07, Fengguang Wu [EMAIL PROTECTED] wrote:
On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
 Since 2.6.23-mm1 I also experience strange hangs during heavy 
 writeouts.
 Each time I noticed this I was using emerge (package util from the
 gentoo distribution) to install/upgrade a package. The last step,
 where this hang occurred, is moving the prepared files from a tmpfs
 partion to the main xfs filesystem.
 The hangs where not fatal, after a few second everything resumed
 normal, so I was not able to capture a good image of what was
 happening.
   
Thank you for the detailed report.
   
How severe was the hangs? Only writeouts stalled, all apps stalled, or
cannot type and run new commands?
  
   Only writeout stalled. The emerge that was moving the files hung, but
   everything else worked normaly.
   I was able to run new commands, like coping the /proc/meminfo.
 
  But you mentioned in the next mail that `watch cat /proc/meminfo`
  could also be blocked for some time - I guess in the same time emerge
  was stalled?
 
 The behavior was different on these stalls.
 On first report the writeout stopped completly, the emerge stopped,
 but at that time a cat /proc/meminfo ~/stall/meminfo did succedd and
 not stall.
 About the watch cat /proc/meminfo, I will write in the answer to the
 other mail...

OK.

   [snip]
 After this SysRq+W writeback resumed again. Possible that writing
 above into the syslog triggered that.
   
Maybe. Are the log files on another disk/partition?
  
   No, everything was going to /
  
   What might be interesting is, that doing cat /proc/meminfo
   ~/stall/meminfo did not resume the writeback. So there might some
   threshold that only was broken with the additional write from
   syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the
 
  Have you tried explicit `sync`? ;-)
 
 No. I wanted to see what is stalled. So I startet by collecting info
 from /proc and then the SysRq+W. And after hitting SysRQ the writeout
 started to resume without any further action.
 
 But I think I have seen a `sync` stall also. During an other emerge I
 noticed the system slowing down and wanted to use `sync` to speed up
 the writeout. The result was, that the writeout did not speed up
 imiedetly only after around a minitue. The `sync` only returned at
 that time.
 Can writers starve `sync`?

I guess the new debug printks will provide more hints on it.

   syslog-ng package from gentoo:
   http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
  
 The source tmpfs is mounted with any special parameters, but the
 target xfs filesystem resides on a dm-crypt device that is on top a 3
 disk RAID5 md.
 During the hang all CPUs where idle.
   
No iowaits? ;-)
  
   No, I have a KSysGuard in my taskbar that showed no activity at all.
  
   OK, the subject does not match for my case, but there was also a tmpfs
   involved. And I found no thread with stalls on xfs. :-)
 
  Do you mean it is actually related with tmpfs?
 
 I don't know. It's just that I have seen tmpfs also redirtieing inodes
 in these logs and the stalling emerge is moving files from tmpfs to
 xfs.
 It could be, but I don't know enough about tmpfs internals to really be sure.
 I just wanted to mention, that tmpfs is involved somehow.

The requeue messages for tmpfs are not pleasant, but known to be fine ;-) 

Fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Fengguang Wu [EMAIL PROTECTED] wrote:
 I guess the new debug printks will provide more hints on it.

The throttle_vm_writeout did not trigger for my new workload.
Except one (the first) balance_dirty_pages came from line 445, the
newly added.

But I found an other workload that looks much more ... hmm ... 'mad'.

If I do an unmerge the emerge program will read all files to
revalidate their checksum and then delete it. If I do this unmerge the
progress of emerge will stall periodically for ~47 second. (Two times
I used a stopwatch to get this value. I think all other stalls where
identical, at least in KSysGuard they looked evenly spaced)

What really counts as 'mad' is this output from vmstat 10:
0  0  0 3639044332 17742000   29220  101  618  1  1 98  0
 1  0  0 3624068332 18062800   32322  137  663  5  2 93  0
 0  0  0 3602456332 18397200   30123  159  641  9  3 87  2
- this was emerge collecting its package database
 0  0  0 3600052332 1842640019  7743  823 5543  3  8 89  0
 0  0  0 3599332332 18428000 1  2532  517 2341  1  2 97  0
- normal removing, now the emerge stalls
 0  0  0 3599404332 18428000 0   551  323 1290  0  0 99  0
 0  0  0 3599648332 18428000 0   644  314 1222  0  1 99  0
 0  0  0 3599648332 18428400 0   569  296 1242  0  0 99  0
 0  0  0 3599868332 18428800 0  2362  320 2735  1  2 97  0
- resumes for a short time, then stalls again
 0  0  0 3599488332 18428800 0   584  292 1395  0  0 99  0
 0  0  0 3600216332 18428800 0   550  301 1361  0  0 99  0
 0  0  0 3594176332 18429600 0   562  300 1373  2  1 97  0
 0  0  0 3594648332 18429600 0  1278  336 1881  1  1 98  0
 0  0  0 3594172332 18430800 1  2812  421 2840  1  4 95  0
- and again
 0  0  0 3594296332 18430800 0   545  342 1283  0  0 99  0
 0  0  0 3594376332 18430800 0   561  319 1314  0  1 99  0
 0  0  0 3594340332 18430800 0   586  327 1258  0  1 99  0
 0  0  0 3594644332 18430800 0   498  248 1376  0  0 99  0
 0  0  0 3595116332 18434800 0  3519  565 3452  2  4 95  0
- and again
 0  0  0 3595320332 18434800 0   483  284 1163  0  0 99  0
 3  0  0 3595444332 18435200 0   498  247 1173  3  0 97  0
 1  0  0 3585108332 18460000 0  1298  644 2394  1  1 98  0
 1  0  0 3588152332 18460800 0  3154  520 3221  2  4 94  0
- and again
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3588540332 18460800 0   574  268 1332  0  1 99  0
 1  0  0 3588744332 18460800 0   546  335 1289  0  0 99  0
 1  0  0 3588628332 18460800 0   638  348 1257  0  1 99  0
 1  0  0 3588952332 18460800 0   567  310 1226  0  1 99  0
 1  0  0 3603644332 1849720059  2821  531 2419  3  4 91  1
 1  0  0 3649476332 18627200   370   395  380 1335  1  1 98  0
- emerge finishes, and now the system goes 'mad'
The Dirty:-line from /proc/meminfo stays at 8 or 12 kB, but there
system is writing like 'mad':
 1  0  0 3650616332 18627600 0   424  296 1126  0  1 99  0
 1  0  0 3650708332 18627600 0   418  249 1190  0  0 99  0
 1  0  0 3650716332 18627600 0   418  256 1151  0  1 99  0
 1  0  0 3650816332 18627600 0   420  257 1120  0  0 99  0
 1  0  0 3651132332 18627600 0   418  269 1145  0  0 99  0
 1  0  0 3651332332 18628000 0   419  294 1099  0  1 99  0
 1  0  0 3651732332 18628000 0   423  311 1072  0  1 99  0
 1  0  0 3652048332 18628000 0   400  317 1127  0  0 99  0
 1  0  0 3652024332 18628000 0   426  346 1066  0  1 99  0
 2  0  0 3652304332 18628000 0   425  357 1132  0  1 99  0
 2  0  0 3652652332 18628000 0   416  364 1184  0  0 99  0
 1  0  0 3652836332 18628000 0   413  397 1110  0  1 99  0
 1  0  0 3652852332 18628400 0   426  427 1290  0  1 99  0
 1  0  0 3652060332 1864200014   404  421 1768  1  1 97  0
 1  0  0 3652904332 18642000 0   418  437 1792  1  1 98  0
 1  0  0 3653572332 18642000 0   410  442 1481  1  1 99  0
 2  0  0 3653872332 18642000 0   410  451 1206  0  1 99  0
 3  0  0 3654572332 18642000 0   414  479 1341  0  1 99  0
 1  0  0 3651720332 18983200   341   420  540 1600  1  1 98  1
 1  0  0

Re: writeout stalls in current -git

2007-11-02 Thread Fengguang Wu

On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
 On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
 
  Interestingly, no background_writeout() appears, but only
  balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
  block the process.
 
 Yeah, the background threshold is not (yet) scaled. So it can happen
 that the bdi_dirty limit is below the background limit.
 
 I'm curious though as to these stalls, though, I can't seem to think of
 what goes wrong.. esp since most writeback seems to happen from pdflush.

Me confused too. The new debug patch will confirm whether emerge is
waiting in balance_dirty_pages().

 (or I'm totally misreading it - quite a possible as I'm still recovering
 from a serious cold and not all the green stuff has yet figured out its
 proper place wrt brain cells 'n stuff)

Do take care of yourself.

 
 I still have this patch floating around:

I think this patch is OK for 2.6.24 :-)

Reviewed-by: Fengguang Wu [EMAIL PROTECTED] 

 
 ---
 Subject: mm: speed up writeback ramp-up on clean systems
 
 We allow violation of bdi limits if there is a lot of room on the
 system. Once we hit half the total limit we start enforcing bdi limits
 and bdi ramp-up should happen. Doing it this way avoids many small
 writeouts on an otherwise idle system and should also speed up the
 ramp-up.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
  mm/page-writeback.c |   19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)
 
 Index: linux-2.6/mm/page-writeback.c
 ===
 --- linux-2.6.orig/mm/page-writeback.c2007-09-28 10:08:33.937415368 
 +0200
 +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200
 @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
   */
  static void balance_dirty_pages(struct address_space *mapping)
  {
 - long bdi_nr_reclaimable;
 - long bdi_nr_writeback;
 + long nr_reclaimable, bdi_nr_reclaimable;
 + long nr_writeback, bdi_nr_writeback;
   long background_thresh;
   long dirty_thresh;
   long bdi_thresh;
 @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
  
   get_dirty_limits(background_thresh, dirty_thresh,
   bdi_thresh, bdi);
 +
 + nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 + global_page_state(NR_UNSTABLE_NFS);
 + nr_writeback = global_page_state(NR_WRITEBACK);
 +
   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 +
   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
   break;
  
 + /*
 +  * Throttle it only when the background writeback cannot
 +  * catch-up. This avoids (excessively) small writeouts
 +  * when the bdi limits are ramping up.
 +  */
 + if (nr_reclaimable + nr_writeback 
 + (background_thresh + dirty_thresh) / 2)
 + break;
 +
   if (!bdi-dirty_exceeded)
   bdi-dirty_exceeded = 1;
  
 
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Fengguang Wu [EMAIL PROTECTED] wrote:
 On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote:
  On 11/1/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
   On 11/1/07, Fengguang Wu [EMAIL PROTECTED] wrote:
Thank you. Maybe we can start by the applied debug patch :-)
  
   Will applied it and try to recreate this.
 
  Patch applied, used emerge to install a 2.6.24-rc1 kernel.
 
  I had no complete stalls, but three times during the move from tmpfs
  to the main xfs the emerge got noticeable slower. There still was
  writeout happening, but as emerge prints out every file it has written
  during the pause not one file was processed.
 
  vmstat 10:
  procs ---memory-- ---swap-- -io -system-- 
  cpu
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
  wa
   0  1  0 3146424332 61476800   134  1849  438 2515  3  4 91 
   2
   0  0  0 3146644332 61478400 2  1628  507  646  0  2 85 
  13
   0  0  0 3146868332 61486800 5  2359  527 1076  0  3 97 
   0
   1  0  0 3144372332 6161480096  2829  607 2666  2  5 92 
   0
  - normal writeout
   0  0  0 3140560332 61814400   152  2764  633 3308  3  6 91 
   0
   0  0  0 3137332332 61990800   114  1801  588 2858  3  4 93 
   0
   0  0  0 3136912332 6201360020   827  393 1605  1  2 98 
   0
  - first stall

 'stall': vmstat's output stalls for some time, or emerge stalls for
 the next several vmstat lines?

emerge stalls. The vmstat did work normally.

   0  0  0 3137088332 62013600 0   557  339 1437  0  1 99 
   0
   0  0  0 3137160332 62013600 0   642  310 1400  0  1 99 
   0

So meaning that these last three lines indicated that for ~30 seconds
the writeout was much slower than normal.

   0  0  0 3136588332 62017200 6  2972  527 1195  0  3 80 
  16
   0  0  0 3136276332 6203480010  2668  558 1195  0  3 96 
   0
   0  0  0 3135228332 62042400 8  2712  522 1311  0  4 96 
   0
   0  0  0 3131740332 6215240075  2935  559 2457  2  5 93 
   0
   0  0  0 3128348332 6229720085  1470  490 2607  3  4 93 
   0
   0  0  0 3129292332 62297200 0   527  353 1398  0  1 99 
   0
  - second longer stall
   0  0  0 3128520332 62302800 6   488  249 1390  0  1 99 
   0
   0  0  0 3128236332 62302800 0   482  222 1222  0  1 99 
   0
   0  0  0 3128408332 62302800 0   585  269 1301  0  0 99 
   0
   0  0  0 3128532332 62302800 0   610  262 1278  0  0 99 
   0
   0  0  0 3128568332 62302800 0   636  345 1639  0  1 99 
   0
   0  0  0 3129032332 62304000 1   664  337 1466  0  1 99 
   0
   0  0  0 3129484332 62304000 0   658  300 1508  0  0 
  100  0
   0  0  0 3129576332 62304000 0   562  271 1454  0  1 99 
   0
   0  0  0 3129736332 62304000 0   627  278 1406  0  1 99 
   0
   0  0  0 3129368332 62304000 0   507  274 1301  0  1 99 
   0
   0  0  0 3129004332 62304000 0   444  211 1213  0  0 99 
   0

The second time the slowdown was much longer.

   0  1  0 3127260332 62304000 0  1036  305 1242  0  1 95 
   4
   0  0  0 3126280332 62312800 7  4241  555 1575  1  5 84 
  10
   0  0  0 3124948332 62323200 6  4194  529 1505  1  4 95 
   0
   0  0  0 3125228332 6241680058  1966  586 1964  2  4 94 
   0
  - emerge resumed to normal speed, without any intervention from my side
   0  0  0 3120932332 62590400   112  1546  546 2565  3  4 93 
   0
   0  0  0 3118012332 62756800   128  1542  612 2705  3  4 93 
   0

 Interesting, the 'bo' never falls to zero.

Yes, I was not able to recreate the complete stall from the first
mail, but even this slowdown does not look completly healthy.
I hope this is the same bug, as I seem to be able to trigger this
slowdown much easier.

[snip logs]

 Interestingly, no background_writeout() appears, but only
 balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
 block the process.

Yes, I noticed that too.
The only time I have seen background_writeout was during bootup and shutdown.

As for the stalled watch cat /proc/meminfo:
That happend on the third slowdown/stall when emerge was already finished

  If I see the complete stall again, I will post that too.

 Thank you, could you run it with the attached new debug patch?

I will, but it will have to wait until the evening.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: writeout stalls in current -git

2007-11-02 Thread Peter Zijlstra

On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:

 Interestingly, no background_writeout() appears, but only
 balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
 block the process.

Yeah, the background threshold is not (yet) scaled. So it can happen
that the bdi_dirty limit is below the background limit.

I'm curious though as to these stalls, though, I can't seem to think of
what goes wrong.. esp since most writeback seems to happen from pdflush.

(or I'm totally misreading it - quite a possible as I'm still recovering
from a serious cold and not all the green stuff has yet figured out its
proper place wrt brain cells 'n stuff)


I still have this patch floating around:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page-writeback.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-09-28 10:08:33.937415368 +0200
+++ linux-2.6/mm/page-writeback.c   2007-09-28 10:54:26.018247516 +0200
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-   long bdi_nr_reclaimable;
-   long bdi_nr_writeback;
+   long nr_reclaimable, bdi_nr_reclaimable;
+   long nr_writeback, bdi_nr_writeback;
long background_thresh;
long dirty_thresh;
long bdi_thresh;
@@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
 
get_dirty_limits(background_thresh, dirty_thresh,
bdi_thresh, bdi);
+
+   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+   global_page_state(NR_UNSTABLE_NFS);
+   nr_writeback = global_page_state(NR_WRITEBACK);
+
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
break;
 
+   /*
+* Throttle it only when the background writeback cannot
+* catch-up. This avoids (excessively) small writeouts
+* when the bdi limits are ramping up.
+*/
+   if (nr_reclaimable + nr_writeback 
+   (background_thresh + dirty_thresh) / 2)
+   break;
+
if (!bdi-dirty_exceeded)
bdi-dirty_exceeded = 1;
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
 On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:

  Interestingly, no background_writeout() appears, but only
  balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
  block the process.

 Yeah, the background threshold is not (yet) scaled. So it can happen
 that the bdi_dirty limit is below the background limit.

I still have not seen a trigger of the throttle_vm_writeout.
This time installing 2.6.24-rc1 again it not even triggerd any other
debugs apart from the one in wb_kupdate.
But 300Mb of new files might still not trigger this with 4Gb of RAM.

I'm currently testing 2.6.23-mm1 with this patch and the second
writeback-debug patch.

 I'm curious though as to these stalls, though, I can't seem to think of
 what goes wrong.. esp since most writeback seems to happen from pdflush.

I also don't know. But looking at the time the system takes to write
out 8kb, I'm starting to suspect that something is writing this out,
but not marking it clean... (Or redirtying it immediately?)

 (or I'm totally misreading it - quite a possible as I'm still recovering
 from a serious cold and not all the green stuff has yet figured out its
 proper place wrt brain cells 'n stuff)

Get well soon!

 I still have this patch floating around:

 ---
 Subject: mm: speed up writeback ramp-up on clean systems

applied, but did not fix the stalls.

Here the complete log from vmstat 10 and the syslog from an install of
vanilla 2.6.24-rc1.
(Please note: I installed the source of vanilla 2.6.24-rc1, but I am
still using 2.6.23-mm1!)
All lines with [note] are my comments about what the system was doing,
both logs are from the same run, so the notes should be more or less
in sync. I used SysRq+L to insert the SysRq-Helptext into the syslog
as marker...

The visible effects are similar to the unmerge run, but the stalls
during the moving did only start later. But the same effect after
emerge finished and the almost all dirty data was written, was
visible: I can still hear the disks and see the hdd light flickering
(mostly on) for much, much longer than it should take to write 8kb.

vmstat 10:
[note]emerge start
 1  0  0 3668496332 18774800 029   39  491  3  0 96  0
 1  0  0 3623940332 10008317 1724 3893 15  2 81  1
 0  0  0 3559488332 25243200  102148 11719 4536  9  4 74 13
 2  0  0 3482220332 311916007060   93 3818 11  3 86  0
 1  0  0 3289352332 48693200 235   33 11997 25  3 72  0
 1  0  0 3174036332 596412001033   35 3937 21  4 75  0
 2  0  0 3215756332 55529200 628   85  742 12 12 76  0
 2  0  0 3202128332 5597920032 9   34 1566 31  1 68  0
 2  0  0 3192804332 568072006046  172 4206 30  2 67  1
 3  0  0 3202424332 57262000 020  111 2223 27  1 72  0
 1  0  0 3196112332 57890000 0  1649  149 2763 25  2 73  0
 1  0  0 3190004332 58495600 017  110 2270 25  1 74  0
 1  0  0 3183952332 59084000 011  104 2553 25  1 74  0
 1  0  0 3176952332 59706800 0  2153  124 2886 25  2 72  0
 1  0  0 3171044332 60259200 022  109 2580 26  1 73  0
 1  0  0 3174896332 60549600   173  1441  312 2249  9  6 84  1
 1  0  0 3165204332 61185600   569  3221  606 4236  4  7 87  1
 0  0  0 3160856332 61351600   116  2281  570 3077  3  5 92  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3154712332 61520000   108  2166  528 3038  3  4 93  0
 0  0  0 3156008332 6154200018  1941  537 1015  0  2 97  0
 0  0  0 3156652332 61550400 8  2232  547  900  0  2 98  0
 0  0  0 3156748332 6156720012  1932  537  947  0  2 98  0
 0  0  0 3154720332 6159000014  2204  584 1256  1  2 97  0
 0  0  0 3154256332 6160600010  2676  610 1317  1  3 96  0
 1  0  0 3152488332 61628400 9  1994  573 1024  1  2 97  0
 0  0  0 3152404332 61640800 4  2218  540  904  0  2 97  0
 0  0  0 3151244332 6171560044  2198  598 1921  2  4 94  0
 0  0  0 3147224332 61867200   110  1802  644 2575  3  4 93  0
 0  0  0 3144608332 6198240080  1590  543 1900  2  4 95  0
 0  0  0 3140768332 62144800   111  1758  657 2735  3  4 93  0
 0  0  0 3140816332 6218960026   801  531 1667  1  2 98  0
[note] first stall, SysRq+W
 1  0  0 3127620332 62189600 0   640  490 1381  2  1 97  0
 0  0  0 3127780332 62190000 0   627  475 1531  2  1 98  0
 0  0  0 3127560

Re: writeout stalls in current -git

2007-11-02 Thread David Chinner

On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
 [  630.00] SysRq : Emergency Sync
 [  630.12] Emergency Sync complete
 [  632.85] SysRq : Show Blocked State
 [  632.85]   taskPC stack   pid father
 [  632.85] pdflush   D 81000f091788 0   285  2
 [  632.85]  810005d4da80 0046 0800
 0071
 [  632.85]  81000fd52400 8022d61c 80819b00
 80819b00
 [  632.85]  80815f40 80819b00 810100316f98
 
 [  632.85] Call Trace:
 [  632.85]  [8022d61c] task_rq_lock+0x4c/0x90
 [  632.85]  [8022c8ea] __wake_up_common+0x5a/0x90
 [  632.85]  [805b16e7] __down+0xa7/0x11e
 [  632.85]  [8022da70] default_wake_function+0x0/0x10
 [  632.85]  [805b1365] __down_failed+0x35/0x3a
 [  632.85]  [803752ce] xfs_buf_lock+0x3e/0x40
 [  632.85]  [8037740e] _xfs_buf_find+0x13e/0x240
 [  632.85]  [8037757f] xfs_buf_get_flags+0x6f/0x190
 [  632.85]  [803776b2] xfs_buf_read_flags+0x12/0xa0
 [  632.85]  [80368824] xfs_trans_read_buf+0x64/0x340
 [  632.85]  [80352361] xfs_itobp+0x81/0x1e0
 [  632.85]  [8026b293] write_cache_pages+0x123/0x330
 [  632.85]  [80354d0e] xfs_iflush+0xfe/0x520

That's stalled waiting on the inode cluster buffer lock. That implies
that the inode lcuser is already being written out and the inode has
been redirtied during writeout.

Does the kernel you are testing have the flush inodes in ascending
inode number order patches applied? If so, can you remove that
patch and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-02 Thread Torsten Kaiser

On 11/2/07, David Chinner [EMAIL PROTECTED] wrote:
 On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
  [  630.00] SysRq : Emergency Sync
  [  630.12] Emergency Sync complete
  [  632.85] SysRq : Show Blocked State
  [  632.85]   taskPC stack   pid father
  [  632.85] pdflush   D 81000f091788 0   285  2
  [  632.85]  810005d4da80 0046 0800
  0071
  [  632.85]  81000fd52400 8022d61c 80819b00
  80819b00
  [  632.85]  80815f40 80819b00 810100316f98
  
  [  632.85] Call Trace:
  [  632.85]  [8022d61c] task_rq_lock+0x4c/0x90
  [  632.85]  [8022c8ea] __wake_up_common+0x5a/0x90
  [  632.85]  [805b16e7] __down+0xa7/0x11e
  [  632.85]  [8022da70] default_wake_function+0x0/0x10
  [  632.85]  [805b1365] __down_failed+0x35/0x3a
  [  632.85]  [803752ce] xfs_buf_lock+0x3e/0x40
  [  632.85]  [8037740e] _xfs_buf_find+0x13e/0x240
  [  632.85]  [8037757f] xfs_buf_get_flags+0x6f/0x190
  [  632.85]  [803776b2] xfs_buf_read_flags+0x12/0xa0
  [  632.85]  [80368824] xfs_trans_read_buf+0x64/0x340
  [  632.85]  [80352361] xfs_itobp+0x81/0x1e0
  [  632.85]  [8026b293] write_cache_pages+0x123/0x330
  [  632.85]  [80354d0e] xfs_iflush+0xfe/0x520

 That's stalled waiting on the inode cluster buffer lock. That implies
 that the inode lcuser is already being written out and the inode has
 been redirtied during writeout.

 Does the kernel you are testing have the flush inodes in ascending
 inode number order patches applied? If so, can you remove that
 patch and see if the problem goes away?

It's 2.6.23-mm1 with only some small fixes.

In it's broken-out directory I see:
git-xfs.patch

and

writeback-fix-periodic-superblock-dirty-inode-flushing.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch
writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch

I don't know if the patch you mentioned is part of that version of the
mm-patchset.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-01 Thread Fengguang Wu

On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote:
> On 11/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > Thank you. Maybe we can start by the applied debug patch :-)
> >
> > Will applied it and try to recreate this.
> 
> Patch applied, used emerge to install a 2.6.24-rc1 kernel.
> 
> I had no complete stalls, but three times during the move from tmpfs
> to the main xfs the emerge got noticeable slower. There still was
> writeout happening, but as emerge prints out every file it has written
> during the pause not one file was processed.
> 
> vmstat 10:
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  0  1  0 3146424332 61476800   134  1849  438 2515  3  4 91  2
>  0  0  0 3146644332 61478400 2  1628  507  646  0  2 85 13
>  0  0  0 3146868332 61486800 5  2359  527 1076  0  3 97  0
>  1  0  0 3144372332 6161480096  2829  607 2666  2  5 92  0
> -> normal writeout
>  0  0  0 3140560332 61814400   152  2764  633 3308  3  6 91  0
>  0  0  0 3137332332 61990800   114  1801  588 2858  3  4 93  0
>  0  0  0 3136912332 6201360020   827  393 1605  1  2 98  0
> -> first stall

'stall': vmstat's output stalls for some time, or emerge stalls for
the next several vmstat lines?

>  0  0  0 3137088332 62013600 0   557  339 1437  0  1 99  0
>  0  0  0 3137160332 62013600 0   642  310 1400  0  1 99  0
>  0  0  0 3136588332 62017200 6  2972  527 1195  0  3 80 16
>  0  0  0 3136276332 6203480010  2668  558 1195  0  3 96  0
>  0  0  0 3135228332 62042400 8  2712  522 1311  0  4 96  0
>  0  0  0 3131740332 6215240075  2935  559 2457  2  5 93  0
>  0  0  0 3128348332 6229720085  1470  490 2607  3  4 93  0
>  0  0  0 3129292332 62297200 0   527  353 1398  0  1 99  0
> -> second longer stall
>  0  0  0 3128520332 62302800 6   488  249 1390  0  1 99  0
>  0  0  0 3128236332 62302800 0   482  222 1222  0  1 99  0
>  0  0  0 3128408332 62302800 0   585  269 1301  0  0 99  0
>  0  0  0 3128532332 62302800 0   610  262 1278  0  0 99  0
>  0  0  0 3128568332 62302800 0   636  345 1639  0  1 99  0
>  0  0  0 3129032332 62304000 1   664  337 1466  0  1 99  0
>  0  0  0 3129484332 62304000 0   658  300 1508  0  0 100  > 0
>  0  0  0 3129576332 62304000 0   562  271 1454  0  1 99  0
>  0  0  0 3129736332 62304000 0   627  278 1406  0  1 99  0
>  0  0  0 3129368332 62304000 0   507  274 1301  0  1 99  0
>  0  0  0 3129004332 62304000 0   444  211 1213  0  0 99  0
>  0  1  0 3127260332 62304000 0  1036  305 1242  0  1 95  4
>  0  0  0 3126280332 62312800 7  4241  555 1575  1  5 84 10
>  0  0  0 3124948332 62323200 6  4194  529 1505  1  4 95  0
>  0  0  0 3125228332 6241680058  1966  586 1964  2  4 94  0
> -> emerge resumed to normal speed, without any intervention from my side
>  0  0  0 3120932332 62590400   112  1546  546 2565  3  4 93  0
>  0  0  0 3118012332 62756800   128  1542  612 2705  3  4 93  0

Interesting, the 'bo' never falls to zero.

> 
> >From syslog:
> first stall:
> [  575.05] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259 > 
> global 610 0 0 wc __ tw 1023 sk 0
> [  586.35] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465 > 
> global 6117 0 0 wc _M tw 967 sk 0
> [  586.36] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408 > 
> global 6117 0 0 wc __ tw 1022 sk 0
> [  599.90] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523 > 
> global 11141 0 0 wc __ tw 1009 sk 0
> [  635.78] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397 > 
> global 12757 124 0 wc __ tw 0 sk 0
> [  638.47] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 
> 1536 global 11405 51 0 wc __ tw 0 sk 0
> [  638.82] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373 > 
> global 11276 48 0 wc __ tw -1 sk 0
> [  641.26] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348 > 
> global 10565 100 0 wc __ tw 0 sk 0
> [  643.98] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324 > 
> global 9788 103 0 wc __ tw -1 sk 0
> [  646.12] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299 > 
> global 8912 6 0 wc __ tw 0 sk 0
> 
> second stall:
> [  664.04] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117 > 
> global 2864 81 0 wc _M tw -13 sk 0
> [  664.40] mm/page-writeback.c

writeout stalls in current -git

2007-11-01 Thread Fengguang Wu

On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
> On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
> > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts.
> > > Each time I noticed this I was using emerge (package util from the
> > > gentoo distribution) to install/upgrade a package. The last step,
> > > where this hang occurred, is moving the prepared files from a tmpfs
> > > partion to the main xfs filesystem.
> > > The hangs where not fatal, after a few second everything resumed
> > > normal, so I was not able to capture a good image of what was
> > > happening.
> >
> > Thank you for the detailed report.
> >
> > How severe was the hangs? Only writeouts stalled, all apps stalled, or
> > cannot type and run new commands?
> 
> Only writeout stalled. The emerge that was moving the files hung, but
> everything else worked normaly.
> I was able to run new commands, like coping the /proc/meminfo.

But you mentioned in the next mail that `watch cat /proc/meminfo`
could also be blocked for some time - I guess in the same time emerge
was stalled?

> [snip]
> > > After this SysRq+W writeback resumed again. Possible that writing
> > > above into the syslog triggered that.
> >
> > Maybe. Are the log files on another disk/partition?
> 
> No, everything was going to /
> 
> What might be interesting is, that doing cat /proc/meminfo
> >~/stall/meminfo did not resume the writeback. So there might some
> threshold that only was broken with the additional write from
> syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the

Have you tried explicit `sync`? ;-)

> syslog-ng package from gentoo:
> http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
> 
> > > The source tmpfs is mounted with any special parameters, but the
> > > target xfs filesystem resides on a dm-crypt device that is on top a 3
> > > disk RAID5 md.
> > > During the hang all CPUs where idle.
> >
> > No iowaits? ;-)
> 
> No, I have a KSysGuard in my taskbar that showed no activity at all.
> 
> OK, the subject does not match for my case, but there was also a tmpfs
> involved. And I found no thread with stalls on xfs. :-)

Do you mean it is actually related with tmpfs?

> > > The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330
> > > interrupts per second because of the bttv driver. (But I was not using
> > > that device at this time.)
> > >
> > > I'm willing to test patches or more provide more information, but lack
> > > a good testcase to trigger this on demand.
> >
> > Thank you. Maybe we can start by the applied debug patch :-)
> 
> Will applied it and try to recreate this.
> 
> Thanks for looking into it.

Thank you for the rich information, too :-)

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

writeout stalls in current -git

2007-11-01 Thread Fengguang Wu

On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
 On 11/1/07, Fengguang Wu [EMAIL PROTECTED] wrote:
  On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
   Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts.
   Each time I noticed this I was using emerge (package util from the
   gentoo distribution) to install/upgrade a package. The last step,
   where this hang occurred, is moving the prepared files from a tmpfs
   partion to the main xfs filesystem.
   The hangs where not fatal, after a few second everything resumed
   normal, so I was not able to capture a good image of what was
   happening.
 
  Thank you for the detailed report.
 
  How severe was the hangs? Only writeouts stalled, all apps stalled, or
  cannot type and run new commands?
 
 Only writeout stalled. The emerge that was moving the files hung, but
 everything else worked normaly.
 I was able to run new commands, like coping the /proc/meminfo.

But you mentioned in the next mail that `watch cat /proc/meminfo`
could also be blocked for some time - I guess in the same time emerge
was stalled?

 [snip]
   After this SysRq+W writeback resumed again. Possible that writing
   above into the syslog triggered that.
 
  Maybe. Are the log files on another disk/partition?
 
 No, everything was going to /
 
 What might be interesting is, that doing cat /proc/meminfo
 ~/stall/meminfo did not resume the writeback. So there might some
 threshold that only was broken with the additional write from
 syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the

Have you tried explicit `sync`? ;-)

 syslog-ng package from gentoo:
 http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
 
   The source tmpfs is mounted with any special parameters, but the
   target xfs filesystem resides on a dm-crypt device that is on top a 3
   disk RAID5 md.
   During the hang all CPUs where idle.
 
  No iowaits? ;-)
 
 No, I have a KSysGuard in my taskbar that showed no activity at all.
 
 OK, the subject does not match for my case, but there was also a tmpfs
 involved. And I found no thread with stalls on xfs. :-)

Do you mean it is actually related with tmpfs?

   The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330
   interrupts per second because of the bttv driver. (But I was not using
   that device at this time.)
  
   I'm willing to test patches or more provide more information, but lack
   a good testcase to trigger this on demand.
 
  Thank you. Maybe we can start by the applied debug patch :-)
 
 Will applied it and try to recreate this.
 
 Thanks for looking into it.

Thank you for the rich information, too :-)

Fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-01 Thread Fengguang Wu

On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote:
 On 11/1/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
  On 11/1/07, Fengguang Wu [EMAIL PROTECTED] wrote:
   Thank you. Maybe we can start by the applied debug patch :-)
 
  Will applied it and try to recreate this.
 
 Patch applied, used emerge to install a 2.6.24-rc1 kernel.
 
 I had no complete stalls, but three times during the move from tmpfs
 to the main xfs the emerge got noticeable slower. There still was
 writeout happening, but as emerge prints out every file it has written
 during the pause not one file was processed.
 
 vmstat 10:
 procs ---memory-- ---swap-- -io -system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
  0  1  0 3146424332 61476800   134  1849  438 2515  3  4 91  2
  0  0  0 3146644332 61478400 2  1628  507  646  0  2 85 13
  0  0  0 3146868332 61486800 5  2359  527 1076  0  3 97  0
  1  0  0 3144372332 6161480096  2829  607 2666  2  5 92  0
 - normal writeout
  0  0  0 3140560332 61814400   152  2764  633 3308  3  6 91  0
  0  0  0 3137332332 61990800   114  1801  588 2858  3  4 93  0
  0  0  0 3136912332 6201360020   827  393 1605  1  2 98  0
 - first stall

'stall': vmstat's output stalls for some time, or emerge stalls for
the next several vmstat lines?

  0  0  0 3137088332 62013600 0   557  339 1437  0  1 99  0
  0  0  0 3137160332 62013600 0   642  310 1400  0  1 99  0
  0  0  0 3136588332 62017200 6  2972  527 1195  0  3 80 16
  0  0  0 3136276332 6203480010  2668  558 1195  0  3 96  0
  0  0  0 3135228332 62042400 8  2712  522 1311  0  4 96  0
  0  0  0 3131740332 6215240075  2935  559 2457  2  5 93  0
  0  0  0 3128348332 6229720085  1470  490 2607  3  4 93  0
  0  0  0 3129292332 62297200 0   527  353 1398  0  1 99  0
 - second longer stall
  0  0  0 3128520332 62302800 6   488  249 1390  0  1 99  0
  0  0  0 3128236332 62302800 0   482  222 1222  0  1 99  0
  0  0  0 3128408332 62302800 0   585  269 1301  0  0 99  0
  0  0  0 3128532332 62302800 0   610  262 1278  0  0 99  0
  0  0  0 3128568332 62302800 0   636  345 1639  0  1 99  0
  0  0  0 3129032332 62304000 1   664  337 1466  0  1 99  0
  0  0  0 3129484332 62304000 0   658  300 1508  0  0 100   0
  0  0  0 3129576332 62304000 0   562  271 1454  0  1 99  0
  0  0  0 3129736332 62304000 0   627  278 1406  0  1 99  0
  0  0  0 3129368332 62304000 0   507  274 1301  0  1 99  0
  0  0  0 3129004332 62304000 0   444  211 1213  0  0 99  0
  0  1  0 3127260332 62304000 0  1036  305 1242  0  1 95  4
  0  0  0 3126280332 62312800 7  4241  555 1575  1  5 84 10
  0  0  0 3124948332 62323200 6  4194  529 1505  1  4 95  0
  0  0  0 3125228332 6241680058  1966  586 1964  2  4 94  0
 - emerge resumed to normal speed, without any intervention from my side
  0  0  0 3120932332 62590400   112  1546  546 2565  3  4 93  0
  0  0  0 3118012332 62756800   128  1542  612 2705  3  4 93  0

Interesting, the 'bo' never falls to zero.

 
 From syslog:
 first stall:
 [  575.05] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259  
 global 610 0 0 wc __ tw 1023 sk 0
 [  586.35] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465  
 global 6117 0 0 wc _M tw 967 sk 0
 [  586.36] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408  
 global 6117 0 0 wc __ tw 1022 sk 0
 [  599.90] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523  
 global 11141 0 0 wc __ tw 1009 sk 0
 [  635.78] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397  
 global 12757 124 0 wc __ tw 0 sk 0
 [  638.47] mm/page-writeback.c 418 balance_dirty_pages:  emerge(6113) 
 1536 global 11405 51 0 wc __ tw 0 sk 0
 [  638.82] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373  
 global 11276 48 0 wc __ tw -1 sk 0
 [  641.26] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348  
 global 10565 100 0 wc __ tw 0 sk 0
 [  643.98] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324  
 global 9788 103 0 wc __ tw -1 sk 0
 [  646.12] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299  
 global 8912 6 0 wc __ tw 0 sk 0
 
 second stall:
 [  664.04] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117  
 global 2864 81 0 wc _M tw -13 sk 0
 [  664.40] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47080  
 global 1995 137 0 wc _M tw 176 sk 0
 [  664.51]

68 matches

Mail list logo