Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-05 Thread Fengguang Wu
On Fri, Oct 05, 2007 at 05:41:03PM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> > > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:

> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {/* all-written or 
> > blockade... */
> > if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
> > congestion_wait(WRITE, HZ/10);
> > else   /* all-written! */
> > break;
> > }
> 
> >From this, if we have more_io on one superblock and we skip pages on a
> different superblock, the combination of the two will causes us to stop
> writeback for a while. Is this the right thing to do?

No, the two cases will occur at the same time to a super_block.
See below.

> > We can also read the whole background_writeout() logic as
> > 
> > while (!done) {
> > /* sync _all_ sync-able data */
> > congestion_wait(100ms);
> > }
> 
> To me it reads as:
> 
>   while (!done) {
>   /* sync all data or until one inode skips */
>   congestion_wait(up to 100ms);
>   }
> 
> and it ignores that we might have more superblocks with dirty data
> on them that we haven't flushed because we skipped pages on
> an inode on a different block device.

AFAIK, generic_sync_sb_inodes() will simply skip the inode in trouble
and _continue_ to sync other inodes:

if (wbc->pages_skipped != pages_skipped) {
/*
 * writeback is not making progress due to locked
 * buffers.  Skip this inode for now.
 */
redirty_tail(inode);
}

Note that there's no "break" here.

> > Sure, the queues should be filled as fast as possible.
> > How fast can we fill the queue? Let's measure it:
> > 
> > //generated by the patch below
> > 
> > [  871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 
> > global 29911 0 0 wc _M tw -12 sk 0
> > [  871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 
> > global 28857 0 0 wc _M tw -12 sk 0
> > [  871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 
> > global 27834 0 0 wc _M tw -12 sk 0
> > [  871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 
> > global 26780 0 0 wc _M tw -12 sk 0
> > [  871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 
> > global 25757 0 0 wc _M tw -12 sk 0
> > [  871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 
> > global 24734 0 0 wc _M tw -12 sk 0
> > [  871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 
> > global 23680 0 0 wc _M tw -12 sk 0
> > [  871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 
> > global 22657 0 0 wc _M tw -12 sk 0
> > [  871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 
> > global 21603 0 0 wc _M tw -12 sk 0
> > [  871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 
> > global 20580 0 0 wc _M tw -12 sk 0
> > [  871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 
> > global 19557 0 0 wc _M tw -12 sk 0
> > [  871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 
> > global 18503 0 0 wc _M tw -12 sk 0
> > [  871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 
> > global 17480 0 0 wc _M tw -12 sk 0
> > [  871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 
> > global 16426 0 0 wc _M tw -12 sk 0
> > [  871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 
> > global 15403 961 0 wc _M tw -12 sk 0
> > [  871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 
> > global 14380 1550 0 wc _M tw -12 sk 0
> > [  871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 
> > global 13326 2573 0 wc _M tw -12 sk 0
> > [  871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 
> > global 12303 3224 0 wc _M tw -12 sk 0
> > [  871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 
> > global 11280 4154 0 wc _M tw -12 sk 0
> > [  871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 
> > global 10226 4929 0 wc _M tw -12 sk 0
> > [  871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 
> > global 9203 5735 0 wc _M tw -12 sk 0
> > [  871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 
> > global 8149 6603 0 wc _M tw -12 sk 0
> > [  871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 
> > global 7126 7409 0 wc _M tw -12 sk 0
> > [  871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 
> > global 6103 8246 0 wc _M tw -12 sk 0
> > [  871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 
> > global 5049 9052 0 wc _M tw -12 sk 0
> > [  871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 
> > global 4026 9982 0 wc _M tw 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-05 Thread David Chinner
On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
> On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > > OK, I guess it is the focus of all your questions: Why should we sleep
> > > in congestion_wait() and possibly hurt the write throughput? I'll try
> > > to summary it:
> > > 
> > > - congestion_wait() is necessary
> > > Besides device congestions, there may be other blockades we have to
> > > wait on, e.g. temporary page locks, NFS/journal issues(I guess).
> > 
> > We skip locked pages in writeback, and if some filesystems have
> > blocking issues that require non-blocking writeback waits for some
> > I/O to complete before re-entering writeback, then perhaps they should be
> > setting wbc->encountered_congestion to tell writeback to back off.
> 
> We have wbc->pages_skipped for that :-)

I walked right into that one ;)

> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {/* all-written or 
> blockade... */
> if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
> congestion_wait(WRITE, HZ/10);
> else   /* all-written! */
> break;
> }

>From this, if we have more_io on one superblock and we skip pages on a
different superblock, the combination of the two will causes us to stop
writeback for a while. Is this the right thing to do?

> We can also read the whole background_writeout() logic as
> 
> while (!done) {
> /* sync _all_ sync-able data */
> congestion_wait(100ms);
> }

To me it reads as:

while (!done) {
/* sync all data or until one inode skips */
congestion_wait(up to 100ms);
}

and it ignores that we might have more superblocks with dirty data
on them that we haven't flushed because we skipped pages on
an inode on a different block device.


> Note that it's far from "wait 100ms for every 4MB" (which is merely
> the worst possible case).

If that's the worst case, then it's far better than the current
"wait 30s for every 4MB".  ;)

Still, if it can be improved

> > > - congestion_wait() won't hurt write throughput
> > > When not congested, congestion_wait() will be wake up on each write
> > > completion.
> > 
> > What happens if the I/O we issued has already completed before we
> > got back up to the congestion_wait() call? We'll spend 100ms
> > sleeping when we shouldn't have and throughput goes down by 10% on
> > every occurrence
> 
> Ah, that was out of my imagination. Maybe we could do with
> 
> if (wbc.more_io)
> congestion_wait(WRITE, 1);
> 
> It's at least 10 times better.

And probably good enough to make it unnoticable.

> "going mad" means "busy waiting".

Ah, ok. that I understand ;)

> > > > > So for your question of queue depth, the answer is: the queue length
> > > > > will not build up in the first place. 
> > > > 
> > > > Which queue are you talking about here? The block deivce queue?
> > > 
> > > Yes, the elevator's queues.
> > 
> > I think this is the wrong thing to be doing and is detrimental
> > to I/o perfomrance because it wil reduce elevator efficiency.
> > 
> > The elevator can only work efficiently if we allow the queues to
> > build up. The deeper the queue, the better the elevator can sort the
> > I/o requests and keep the device at maximum efficiency.  If we don't
> > push enough I/O into the queues the we miss opportunities to combine
> > adjacent I/Os and reduce the seek load of writeback. Also, a shallow
> > queue will run dry if we don't get back to it in time which is
> > possible if we wait for I/o to complete before we go and flush
> > more
> 
> Sure, the queues should be filled as fast as possible.
> How fast can we fill the queue? Let's measure it:
> 
> //generated by the patch below
> 
> [  871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 
> 29911 0 0 wc _M tw -12 sk 0
> [  871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 
> 28857 0 0 wc _M tw -12 sk 0
> [  871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 
> 27834 0 0 wc _M tw -12 sk 0
> [  871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 
> 26780 0 0 wc _M tw -12 sk 0
> [  871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 
> 25757 0 0 wc _M tw -12 sk 0
> [  871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 
> 24734 0 0 wc _M tw -12 sk 0
> [  871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 
> 23680 0 0 wc _M tw -12 sk 0
> [  871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 
> 22657 0 0 wc _M tw -12 sk 0
> [  871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 
> 21603 0 0 wc _M tw -12 sk 0
> [  871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 
> 20580 0 0 wc _M tw -12 sk 0
> 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-05 Thread David Chinner
On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
 On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
  On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
   OK, I guess it is the focus of all your questions: Why should we sleep
   in congestion_wait() and possibly hurt the write throughput? I'll try
   to summary it:
   
   - congestion_wait() is necessary
   Besides device congestions, there may be other blockades we have to
   wait on, e.g. temporary page locks, NFS/journal issues(I guess).
  
  We skip locked pages in writeback, and if some filesystems have
  blocking issues that require non-blocking writeback waits for some
  I/O to complete before re-entering writeback, then perhaps they should be
  setting wbc-encountered_congestion to tell writeback to back off.
 
 We have wbc-pages_skipped for that :-)

I walked right into that one ;)

 if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {/* all-written or 
 blockade... */
 if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
 congestion_wait(WRITE, HZ/10);
 else   /* all-written! */
 break;
 }

From this, if we have more_io on one superblock and we skip pages on a
different superblock, the combination of the two will causes us to stop
writeback for a while. Is this the right thing to do?

 We can also read the whole background_writeout() logic as
 
 while (!done) {
 /* sync _all_ sync-able data */
 congestion_wait(100ms);
 }

To me it reads as:

while (!done) {
/* sync all data or until one inode skips */
congestion_wait(up to 100ms);
}

and it ignores that we might have more superblocks with dirty data
on them that we haven't flushed because we skipped pages on
an inode on a different block device.


 Note that it's far from wait 100ms for every 4MB (which is merely
 the worst possible case).

If that's the worst case, then it's far better than the current
wait 30s for every 4MB.  ;)

Still, if it can be improved

   - congestion_wait() won't hurt write throughput
   When not congested, congestion_wait() will be wake up on each write
   completion.
  
  What happens if the I/O we issued has already completed before we
  got back up to the congestion_wait() call? We'll spend 100ms
  sleeping when we shouldn't have and throughput goes down by 10% on
  every occurrence
 
 Ah, that was out of my imagination. Maybe we could do with
 
 if (wbc.more_io)
 congestion_wait(WRITE, 1);
 
 It's at least 10 times better.

And probably good enough to make it unnoticable.

 going mad means busy waiting.

Ah, ok. that I understand ;)

 So for your question of queue depth, the answer is: the queue length
 will not build up in the first place. 

Which queue are you talking about here? The block deivce queue?
   
   Yes, the elevator's queues.
  
  I think this is the wrong thing to be doing and is detrimental
  to I/o perfomrance because it wil reduce elevator efficiency.
  
  The elevator can only work efficiently if we allow the queues to
  build up. The deeper the queue, the better the elevator can sort the
  I/o requests and keep the device at maximum efficiency.  If we don't
  push enough I/O into the queues the we miss opportunities to combine
  adjacent I/Os and reduce the seek load of writeback. Also, a shallow
  queue will run dry if we don't get back to it in time which is
  possible if we wait for I/o to complete before we go and flush
  more
 
 Sure, the queues should be filled as fast as possible.
 How fast can we fill the queue? Let's measure it:
 
 //generated by the patch below
 
 [  871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 
 29911 0 0 wc _M tw -12 sk 0
 [  871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 
 28857 0 0 wc _M tw -12 sk 0
 [  871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 
 27834 0 0 wc _M tw -12 sk 0
 [  871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 
 26780 0 0 wc _M tw -12 sk 0
 [  871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 
 25757 0 0 wc _M tw -12 sk 0
 [  871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 
 24734 0 0 wc _M tw -12 sk 0
 [  871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 
 23680 0 0 wc _M tw -12 sk 0
 [  871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 
 22657 0 0 wc _M tw -12 sk 0
 [  871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 
 21603 0 0 wc _M tw -12 sk 0
 [  871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 
 20580 0 0 wc _M tw -12 sk 0
 [  871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 
 19557 0 0 wc _M tw -12 sk 0
 [  871.584992] mm/page-writeback.c 668 wb_kupdate: 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-05 Thread Fengguang Wu
On Fri, Oct 05, 2007 at 05:41:03PM +1000, David Chinner wrote:
 On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
  On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
   On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:

  if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {/* all-written or 
  blockade... */
  if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
  congestion_wait(WRITE, HZ/10);
  else   /* all-written! */
  break;
  }
 
 From this, if we have more_io on one superblock and we skip pages on a
 different superblock, the combination of the two will causes us to stop
 writeback for a while. Is this the right thing to do?

No, the two cases will occur at the same time to a super_block.
See below.

  We can also read the whole background_writeout() logic as
  
  while (!done) {
  /* sync _all_ sync-able data */
  congestion_wait(100ms);
  }
 
 To me it reads as:
 
   while (!done) {
   /* sync all data or until one inode skips */
   congestion_wait(up to 100ms);
   }
 
 and it ignores that we might have more superblocks with dirty data
 on them that we haven't flushed because we skipped pages on
 an inode on a different block device.

AFAIK, generic_sync_sb_inodes() will simply skip the inode in trouble
and _continue_ to sync other inodes:

if (wbc-pages_skipped != pages_skipped) {
/*
 * writeback is not making progress due to locked
 * buffers.  Skip this inode for now.
 */
redirty_tail(inode);
}

Note that there's no break here.

  Sure, the queues should be filled as fast as possible.
  How fast can we fill the queue? Let's measure it:
  
  //generated by the patch below
  
  [  871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 
  global 29911 0 0 wc _M tw -12 sk 0
  [  871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 
  global 28857 0 0 wc _M tw -12 sk 0
  [  871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 
  global 27834 0 0 wc _M tw -12 sk 0
  [  871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 
  global 26780 0 0 wc _M tw -12 sk 0
  [  871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 
  global 25757 0 0 wc _M tw -12 sk 0
  [  871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 
  global 24734 0 0 wc _M tw -12 sk 0
  [  871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 
  global 23680 0 0 wc _M tw -12 sk 0
  [  871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 
  global 22657 0 0 wc _M tw -12 sk 0
  [  871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 
  global 21603 0 0 wc _M tw -12 sk 0
  [  871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 
  global 20580 0 0 wc _M tw -12 sk 0
  [  871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 
  global 19557 0 0 wc _M tw -12 sk 0
  [  871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 
  global 18503 0 0 wc _M tw -12 sk 0
  [  871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 
  global 17480 0 0 wc _M tw -12 sk 0
  [  871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 
  global 16426 0 0 wc _M tw -12 sk 0
  [  871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 
  global 15403 961 0 wc _M tw -12 sk 0
  [  871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 
  global 14380 1550 0 wc _M tw -12 sk 0
  [  871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 
  global 13326 2573 0 wc _M tw -12 sk 0
  [  871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 
  global 12303 3224 0 wc _M tw -12 sk 0
  [  871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 
  global 11280 4154 0 wc _M tw -12 sk 0
  [  871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 
  global 10226 4929 0 wc _M tw -12 sk 0
  [  871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 
  global 9203 5735 0 wc _M tw -12 sk 0
  [  871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 
  global 8149 6603 0 wc _M tw -12 sk 0
  [  871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 
  global 7126 7409 0 wc _M tw -12 sk 0
  [  871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 
  global 6103 8246 0 wc _M tw -12 sk 0
  [  871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 
  global 5049 9052 0 wc _M tw -12 sk 0
  [  871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 
  global 4026 9982 0 wc _M tw -12 sk 0
  [  871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 
  global 2972 10757 0 wc _M tw -12 sk 0
  [  871.836728] mm/page-writeback.c 668 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-04 Thread Fengguang Wu
On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> > > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > > > wbc.pages_skipped = 0;
> > > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > > > /* Wrote less than expected */
> > > > > > -   congestion_wait(WRITE, HZ/10);
> > > > > > -   if (!wbc.encountered_congestion)
> > > > > > +   if (wbc.encountered_congestion || wbc.more_io)
> > > > > > +   congestion_wait(WRITE, HZ/10);
> > > > > > +   else
> > > > > > break;
> > > > > > }
> > > > > 
> > > > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > > > we have a fast filesystem, this might cause the device queues to
> > > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > > > will have trouble keeping the queues full, right?
> > > > 
> > > > You mean slow writers and fast RAID? That would be exactly the case
> > > > these patches try to improve.
> > > 
> > > I mean any writers and a fast block device (raid or otherwise).
> > > 
> > > > This patchset makes kupdate/background writeback more responsible,
> > > > so that if (avg-write-speed < device-capabilities), the dirty data are
> > > > synced timely, and we don't have to go for balance_dirty_pages().
> > > 
> > > Sure, but I'm asking about the effect of the patches on the
> > > (avg-write-speed == device-capabilities) case. I agree that
> > > they are necessary for timely syncing of data but I'm trying
> > > to understand what effect they have on the normal write case
> > 
> > > (i.e. keeping the disk at full write throughput).
> > 
> > OK, I guess it is the focus of all your questions: Why should we sleep
> > in congestion_wait() and possibly hurt the write throughput? I'll try
> > to summary it:
> > 
> > - congestion_wait() is necessary
> > Besides device congestions, there may be other blockades we have to
> > wait on, e.g. temporary page locks, NFS/journal issues(I guess).
> 
> We skip locked pages in writeback, and if some filesystems have
> blocking issues that require non-blocking writeback waits for some
> I/O to complete before re-entering writeback, then perhaps they should be
> setting wbc->encountered_congestion to tell writeback to back off.

We have wbc->pages_skipped for that :-)

> The question I'm asking is that if more_io tells us we have more
> work to do, why do we have to sleep first if the block dev is
> able to take more I/O?

See below.

> > 
> > - congestion_wait() is called only when necessary
> > congestion_wait() will only be called we saw blockades:
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > congestion_wait(WRITE, HZ/10);
> > }
> > So in normal case, it may well write 128MB data without any waiting.
> 
> Sure, but wbc.more_io doesn't indicate a blockade - just that there
> is more work to do, right?
 
It's not wbc.more_io, but the context(wbc.pages_skipped > 0) indicates
a blockade:

if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {/* all-written or 
blockade... */
if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
congestion_wait(WRITE, HZ/10);
else   /* all-written! */
break;
}

We can also read the whole background_writeout() logic as

while (!done) {
/* sync _all_ sync-able data */
congestion_wait(100ms);
}

And an example run could be:

sync 1000MB, skipped 100MB
congestion_wait(100ms);
sync 100MB, skipped 10MB
congestion_wait(100ms);
sync 10MB, all done

Note that it's far from "wait 100ms for every 4MB" (which is merely
the worst possible case).

> > - congestion_wait() won't hurt write throughput
> > When not congested, congestion_wait() will be wake up on each write
> > completion.
> 
> What happens if the I/O we issued has already completed before we
> got back up to the congestion_wait() call? We'll spend 100ms
> sleeping when we shouldn't have and throughput goes down by 10% on
> every occurrence

Ah, that was out of my imagination. Maybe we could do with

if (wbc.more_io)
congestion_wait(WRITE, 1);

It's at least 10 times better.

> if we've got more work to do, then we should do it without an
> arbitrary, non-deterministic delay being inserted. If the delay is
> needed to prevent he system from "going mad" (whatever tht 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-04 Thread Fengguang Wu
On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
 On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
  On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
   On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
 On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
  wbc.pages_skipped = 0;
  @@ -560,8 +561,9 @@ static void background_writeout(unsigned
  min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
  if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
  /* Wrote less than expected */
  -   congestion_wait(WRITE, HZ/10);
  -   if (!wbc.encountered_congestion)
  +   if (wbc.encountered_congestion || wbc.more_io)
  +   congestion_wait(WRITE, HZ/10);
  +   else
  break;
  }
 
 Why do you call congestion_wait() if there is more I/O to issue?  If
 we have a fast filesystem, this might cause the device queues to
 fill, then drain on congestion_wait(), then fill again, etc. i.e. we
 will have trouble keeping the queues full, right?

You mean slow writers and fast RAID? That would be exactly the case
these patches try to improve.
   
   I mean any writers and a fast block device (raid or otherwise).
   
This patchset makes kupdate/background writeback more responsible,
so that if (avg-write-speed  device-capabilities), the dirty data are
synced timely, and we don't have to go for balance_dirty_pages().
   
   Sure, but I'm asking about the effect of the patches on the
   (avg-write-speed == device-capabilities) case. I agree that
   they are necessary for timely syncing of data but I'm trying
   to understand what effect they have on the normal write case
  
   (i.e. keeping the disk at full write throughput).
  
  OK, I guess it is the focus of all your questions: Why should we sleep
  in congestion_wait() and possibly hurt the write throughput? I'll try
  to summary it:
  
  - congestion_wait() is necessary
  Besides device congestions, there may be other blockades we have to
  wait on, e.g. temporary page locks, NFS/journal issues(I guess).
 
 We skip locked pages in writeback, and if some filesystems have
 blocking issues that require non-blocking writeback waits for some
 I/O to complete before re-entering writeback, then perhaps they should be
 setting wbc-encountered_congestion to tell writeback to back off.

We have wbc-pages_skipped for that :-)

 The question I'm asking is that if more_io tells us we have more
 work to do, why do we have to sleep first if the block dev is
 able to take more I/O?

See below.

  
  - congestion_wait() is called only when necessary
  congestion_wait() will only be called we saw blockades:
  if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
  congestion_wait(WRITE, HZ/10);
  }
  So in normal case, it may well write 128MB data without any waiting.
 
 Sure, but wbc.more_io doesn't indicate a blockade - just that there
 is more work to do, right?
 
It's not wbc.more_io, but the context(wbc.pages_skipped  0) indicates
a blockade:

if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {/* all-written or 
blockade... */
if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
congestion_wait(WRITE, HZ/10);
else   /* all-written! */
break;
}

We can also read the whole background_writeout() logic as

while (!done) {
/* sync _all_ sync-able data */
congestion_wait(100ms);
}

And an example run could be:

sync 1000MB, skipped 100MB
congestion_wait(100ms);
sync 100MB, skipped 10MB
congestion_wait(100ms);
sync 10MB, all done

Note that it's far from wait 100ms for every 4MB (which is merely
the worst possible case).

  - congestion_wait() won't hurt write throughput
  When not congested, congestion_wait() will be wake up on each write
  completion.
 
 What happens if the I/O we issued has already completed before we
 got back up to the congestion_wait() call? We'll spend 100ms
 sleeping when we shouldn't have and throughput goes down by 10% on
 every occurrence

Ah, that was out of my imagination. Maybe we could do with

if (wbc.more_io)
congestion_wait(WRITE, 1);

It's at least 10 times better.

 if we've got more work to do, then we should do it without an
 arbitrary, non-deterministic delay being inserted. If the delay is
 needed to prevent he system from going mad (whatever tht means),
 then what's the explaination for the system going mad?

going mad means busy waiting.

  Note that MAX_WRITEBACK_PAGES=1024 and
  /sys/block/sda/queue/max_sectors_kb=512(for me),
  which means we are gave the chance to sync 4MB on 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread David Chinner
On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > >   wbc.pages_skipped = 0;
> > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > >   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > >   if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > >   /* Wrote less than expected */
> > > > > - congestion_wait(WRITE, HZ/10);
> > > > > - if (!wbc.encountered_congestion)
> > > > > + if (wbc.encountered_congestion || wbc.more_io)
> > > > > + congestion_wait(WRITE, HZ/10);
> > > > > + else
> > > > >   break;
> > > > >   }
> > > > 
> > > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > > we have a fast filesystem, this might cause the device queues to
> > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > > will have trouble keeping the queues full, right?
> > > 
> > > You mean slow writers and fast RAID? That would be exactly the case
> > > these patches try to improve.
> > 
> > I mean any writers and a fast block device (raid or otherwise).
> > 
> > > This patchset makes kupdate/background writeback more responsible,
> > > so that if (avg-write-speed < device-capabilities), the dirty data are
> > > synced timely, and we don't have to go for balance_dirty_pages().
> > 
> > Sure, but I'm asking about the effect of the patches on the
> > (avg-write-speed == device-capabilities) case. I agree that
> > they are necessary for timely syncing of data but I'm trying
> > to understand what effect they have on the normal write case
> 
> > (i.e. keeping the disk at full write throughput).
> 
> OK, I guess it is the focus of all your questions: Why should we sleep
> in congestion_wait() and possibly hurt the write throughput? I'll try
> to summary it:
> 
> - congestion_wait() is necessary
> Besides device congestions, there may be other blockades we have to
> wait on, e.g. temporary page locks, NFS/journal issues(I guess).

We skip locked pages in writeback, and if some filesystems have
blocking issues that require non-blocking writeback waits for some
I/O to complete before re-entering writeback, then perhaps they should be
setting wbc->encountered_congestion to tell writeback to back off.

The question I'm asking is that if more_io tells us we have more
work to do, why do we have to sleep first if the block dev is
able to take more I/O?

> 
> - congestion_wait() is called only when necessary
> congestion_wait() will only be called we saw blockades:
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> congestion_wait(WRITE, HZ/10);
> }
> So in normal case, it may well write 128MB data without any waiting.

Sure, but wbc.more_io doesn't indicate a blockade - just that there
is more work to do, right?

> - congestion_wait() won't hurt write throughput
> When not congested, congestion_wait() will be wake up on each write
> completion.

What happens if there I/O we issued has already completed before we
got back up to the congestion_wait() call? We'll spend 100ms
sleeping when we shouldn't have and throughput goes down by 10% on
every occurrence

if we've got more work to do, then we should do it without an
arbitrary, non-deterministic delay being inserted. If the delay is
needed to prevent he system from "going mad" (whatever tht means),
then what's the explaination for the system "going mad"?

> Note that MAX_WRITEBACK_PAGES=1024 and
> /sys/block/sda/queue/max_sectors_kb=512(for me),
> which means we are gave the chance to sync 4MB on every 512KB written,
> which means we are able to submit write IOs 8 times faster than the
> device capability. congestion_wait() is a magical timer :-)

So, with Jens Axboe's sglist chaining, that single I/O could now
be up to 32MB on some hardware. IOWs, we push 1024 pages, and that
could end up as a single I/O being issued to disk.

Your magic just broke. :/

> > > So for your question of queue depth, the answer is: the queue length
> > > will not build up in the first place. 
> > 
> > Which queue are you talking about here? The block deivce queue?
> 
> Yes, the elevator's queues.

I think this is the wrong thing to be doing and is detrimental
to I/o perfomrance because it wil reduce elevator efficiency.

The elevator can only work efficiently if we allow the queues to
build up. The deeper the queue, the better the elevator can sort the
I/o requests and keep the device at maximum efficiency.  If we don't
push enough I/O into the queues the we miss opportunities to combine

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > wbc.pages_skipped = 0;
> > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > /* Wrote less than expected */
> > > > -   congestion_wait(WRITE, HZ/10);
> > > > -   if (!wbc.encountered_congestion)
> > > > +   if (wbc.encountered_congestion || wbc.more_io)
> > > > +   congestion_wait(WRITE, HZ/10);
> > > > +   else
> > > > break;
> > > > }
> > > 
> > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > we have a fast filesystem, this might cause the device queues to
> > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > will have trouble keeping the queues full, right?
> > 
> > You mean slow writers and fast RAID? That would be exactly the case
> > these patches try to improve.
> 
> I mean any writers and a fast block device (raid or otherwise).
> 
> > This patchset makes kupdate/background writeback more responsible,
> > so that if (avg-write-speed < device-capabilities), the dirty data are
> > synced timely, and we don't have to go for balance_dirty_pages().
> 
> Sure, but I'm asking about the effect of the patches on the
> (avg-write-speed == device-capabilities) case. I agree that
> they are necessary for timely syncing of data but I'm trying
> to understand what effect they have on the normal write case

> (i.e. keeping the disk at full write throughput).

OK, I guess it is the focus of all your questions: Why should we sleep
in congestion_wait() and possibly hurt the write throughput? I'll try
to summary it:

- congestion_wait() is necessary
Besides device congestions, there may be other blockades we have to
wait on, e.g. temporary page locks, NFS/journal issues(I guess).

- congestion_wait() is called only when necessary
congestion_wait() will only be called we saw blockades:
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
congestion_wait(WRITE, HZ/10);
}
So in normal case, it may well write 128MB data without any waiting.

- congestion_wait() won't hurt write throughput
When not congested, congestion_wait() will be wake up on each write
completion. Note that MAX_WRITEBACK_PAGES=1024 and
/sys/block/sda/queue/max_sectors_kb=512(for me),
which means we are gave the chance to sync 4MB on every 512KB written,
which means we are able to submit write IOs 8 times faster than the
device capability. congestion_wait() is a magical timer :-)

> > So for your question of queue depth, the answer is: the queue length
> > will not build up in the first place. 
> 
> Which queue are you talking about here? The block deivce queue?

Yes, the elevator's queues.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
 On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
  On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
   On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
wbc.pages_skipped = 0;
@@ -560,8 +561,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
/* Wrote less than expected */
-   congestion_wait(WRITE, HZ/10);
-   if (!wbc.encountered_congestion)
+   if (wbc.encountered_congestion || wbc.more_io)
+   congestion_wait(WRITE, HZ/10);
+   else
break;
}
   
   Why do you call congestion_wait() if there is more I/O to issue?  If
   we have a fast filesystem, this might cause the device queues to
   fill, then drain on congestion_wait(), then fill again, etc. i.e. we
   will have trouble keeping the queues full, right?
  
  You mean slow writers and fast RAID? That would be exactly the case
  these patches try to improve.
 
 I mean any writers and a fast block device (raid or otherwise).
 
  This patchset makes kupdate/background writeback more responsible,
  so that if (avg-write-speed  device-capabilities), the dirty data are
  synced timely, and we don't have to go for balance_dirty_pages().
 
 Sure, but I'm asking about the effect of the patches on the
 (avg-write-speed == device-capabilities) case. I agree that
 they are necessary for timely syncing of data but I'm trying
 to understand what effect they have on the normal write case

 (i.e. keeping the disk at full write throughput).

OK, I guess it is the focus of all your questions: Why should we sleep
in congestion_wait() and possibly hurt the write throughput? I'll try
to summary it:

- congestion_wait() is necessary
Besides device congestions, there may be other blockades we have to
wait on, e.g. temporary page locks, NFS/journal issues(I guess).

- congestion_wait() is called only when necessary
congestion_wait() will only be called we saw blockades:
if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
congestion_wait(WRITE, HZ/10);
}
So in normal case, it may well write 128MB data without any waiting.

- congestion_wait() won't hurt write throughput
When not congested, congestion_wait() will be wake up on each write
completion. Note that MAX_WRITEBACK_PAGES=1024 and
/sys/block/sda/queue/max_sectors_kb=512(for me),
which means we are gave the chance to sync 4MB on every 512KB written,
which means we are able to submit write IOs 8 times faster than the
device capability. congestion_wait() is a magical timer :-)

  So for your question of queue depth, the answer is: the queue length
  will not build up in the first place. 
 
 Which queue are you talking about here? The block deivce queue?

Yes, the elevator's queues.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread David Chinner
On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
 On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
  On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
   On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
   wbc.pages_skipped = 0;
 @@ -560,8 +561,9 @@ static void background_writeout(unsigned
   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
   /* Wrote less than expected */
 - congestion_wait(WRITE, HZ/10);
 - if (!wbc.encountered_congestion)
 + if (wbc.encountered_congestion || wbc.more_io)
 + congestion_wait(WRITE, HZ/10);
 + else
   break;
   }

Why do you call congestion_wait() if there is more I/O to issue?  If
we have a fast filesystem, this might cause the device queues to
fill, then drain on congestion_wait(), then fill again, etc. i.e. we
will have trouble keeping the queues full, right?
   
   You mean slow writers and fast RAID? That would be exactly the case
   these patches try to improve.
  
  I mean any writers and a fast block device (raid or otherwise).
  
   This patchset makes kupdate/background writeback more responsible,
   so that if (avg-write-speed  device-capabilities), the dirty data are
   synced timely, and we don't have to go for balance_dirty_pages().
  
  Sure, but I'm asking about the effect of the patches on the
  (avg-write-speed == device-capabilities) case. I agree that
  they are necessary for timely syncing of data but I'm trying
  to understand what effect they have on the normal write case
 
  (i.e. keeping the disk at full write throughput).
 
 OK, I guess it is the focus of all your questions: Why should we sleep
 in congestion_wait() and possibly hurt the write throughput? I'll try
 to summary it:
 
 - congestion_wait() is necessary
 Besides device congestions, there may be other blockades we have to
 wait on, e.g. temporary page locks, NFS/journal issues(I guess).

We skip locked pages in writeback, and if some filesystems have
blocking issues that require non-blocking writeback waits for some
I/O to complete before re-entering writeback, then perhaps they should be
setting wbc-encountered_congestion to tell writeback to back off.

The question I'm asking is that if more_io tells us we have more
work to do, why do we have to sleep first if the block dev is
able to take more I/O?

 
 - congestion_wait() is called only when necessary
 congestion_wait() will only be called we saw blockades:
 if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
 congestion_wait(WRITE, HZ/10);
 }
 So in normal case, it may well write 128MB data without any waiting.

Sure, but wbc.more_io doesn't indicate a blockade - just that there
is more work to do, right?

 - congestion_wait() won't hurt write throughput
 When not congested, congestion_wait() will be wake up on each write
 completion.

What happens if there I/O we issued has already completed before we
got back up to the congestion_wait() call? We'll spend 100ms
sleeping when we shouldn't have and throughput goes down by 10% on
every occurrence

if we've got more work to do, then we should do it without an
arbitrary, non-deterministic delay being inserted. If the delay is
needed to prevent he system from going mad (whatever tht means),
then what's the explaination for the system going mad?

 Note that MAX_WRITEBACK_PAGES=1024 and
 /sys/block/sda/queue/max_sectors_kb=512(for me),
 which means we are gave the chance to sync 4MB on every 512KB written,
 which means we are able to submit write IOs 8 times faster than the
 device capability. congestion_wait() is a magical timer :-)

So, with Jens Axboe's sglist chaining, that single I/O could now
be up to 32MB on some hardware. IOWs, we push 1024 pages, and that
could end up as a single I/O being issued to disk.

Your magic just broke. :/

   So for your question of queue depth, the answer is: the queue length
   will not build up in the first place. 
  
  Which queue are you talking about here? The block deivce queue?
 
 Yes, the elevator's queues.

I think this is the wrong thing to be doing and is detrimental
to I/o perfomrance because it wil reduce elevator efficiency.

The elevator can only work efficiently if we allow the queues to
build up. The deeper the queue, the better the elevator can sort the
I/o requests and keep the device at maximum efficiency.  If we don't
push enough I/O into the queues the we miss opportunities to combine
adjacent I/Os and reduce the seek load of writeback. Also, a shallow
queue will run dry if we don't get back to it in time which is
possible if we wait for I/o to complete 

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread David Chinner
On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > >   wbc.pages_skipped = 0;
> > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > >   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > >   if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > >   /* Wrote less than expected */
> > > - congestion_wait(WRITE, HZ/10);
> > > - if (!wbc.encountered_congestion)
> > > + if (wbc.encountered_congestion || wbc.more_io)
> > > + congestion_wait(WRITE, HZ/10);
> > > + else
> > >   break;
> > >   }
> > 
> > Why do you call congestion_wait() if there is more I/O to issue?  If
> > we have a fast filesystem, this might cause the device queues to
> > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > will have trouble keeping the queues full, right?
> 
> You mean slow writers and fast RAID? That would be exactly the case
> these patches try to improve.

I mean any writers and a fast block device (raid or otherwise).

> This patchset makes kupdate/background writeback more responsible,
> so that if (avg-write-speed < device-capabilities), the dirty data are
> synced timely, and we don't have to go for balance_dirty_pages().

Sure, but I'm asking about the effect of the patches on the
(avg-write-speed == device-capabilities) case. I agree that
they are necessary for timely syncing of data but I'm trying
to understand what effect they have on the normal write case
(i.e. keeping the disk at full write throughput).

> So for your question of queue depth, the answer is: the queue length
> will not build up in the first place. 

Which queue are you talking about here? The block deivce queue?

> Also the name of congestion_wait() could be misleading:
> - when not congested, congestion_wait() will wakeup on write
>   completions;
> - when congested, congestion_wait() could also wakeup on write
>   completions on other non-congested devices.
> So congestion_wait(100ms) normally only takes 0.1-10ms.

True, but if we know we are not congested and have more work
to do, why sleep at all?

> For the more_io case, congestion_wait() serves more like 'to take a
> breath'. Tests show that the system could go mad without it.

I'm interested to know what tests show that pushing more I/O when
you don't have block device congestion make the system go mad (and
what mad means).  It sounds to me like it's hiding (yet another)
bug in the writeback code..

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > wbc.pages_skipped = 0;
> > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > /* Wrote less than expected */
> > -   congestion_wait(WRITE, HZ/10);
> > -   if (!wbc.encountered_congestion)
> > +   if (wbc.encountered_congestion || wbc.more_io)
> > +   congestion_wait(WRITE, HZ/10);
> > +   else
> > break;
> > }
> 
> Why do you call congestion_wait() if there is more I/O to issue?  If
> we have a fast filesystem, this might cause the device queues to
> fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> will have trouble keeping the queues full, right?

You mean slow writers and fast RAID? That would be exactly the case
these patches try to improve.

The old writeback behaviors are sluggish when there is
- single big dirty file;
- single congested device
the queues may well build up slowly, hit background_limit, and
continue to build up, until hit dirty_limit. That means:
- kupdate writeback could leave behind many expired dirty data
- background writeback used to return prematurely
- eventually it relies on balance_dirty_pages() to do the job,
  which means
  - writers get throttled unnecessarily
  - dirty_limit pages are pinned unnecessarily

This patchset makes kupdate/background writeback more responsible,
so that if (avg-write-speed < device-capabilities), the dirty data are
synced timely, and we don't have to go for balance_dirty_pages().

So for your question of queue depth, the answer is: the queue length
will not build up in the first place. 

Also the name of congestion_wait() could be misleading:
- when not congested, congestion_wait() will wakeup on write
  completions;
- when congested, congestion_wait() could also wakeup on write
  completions on other non-congested devices.
So congestion_wait(100ms) normally only takes 0.1-10ms.

For the more_io case, congestion_wait() serves more like 'to take a
breath'. Tests show that the system could go mad without it.

Regards,
Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread David Chinner
On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
>   wbc.pages_skipped = 0;
> @@ -560,8 +561,9 @@ static void background_writeout(unsigned
>   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>   if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>   /* Wrote less than expected */
> - congestion_wait(WRITE, HZ/10);
> - if (!wbc.encountered_congestion)
> + if (wbc.encountered_congestion || wbc.more_io)
> + congestion_wait(WRITE, HZ/10);
> + else
>   break;
>   }

Why do you call congestion_wait() if there is more I/O to issue?  If
we have a fast filesystem, this might cause the device queues to
fill, then drain on congestion_wait(), then fill again, etc. i.e. we
will have trouble keeping the queues full, right?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread David Chinner
On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
   wbc.pages_skipped = 0;
 @@ -560,8 +561,9 @@ static void background_writeout(unsigned
   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
   /* Wrote less than expected */
 - congestion_wait(WRITE, HZ/10);
 - if (!wbc.encountered_congestion)
 + if (wbc.encountered_congestion || wbc.more_io)
 + congestion_wait(WRITE, HZ/10);
 + else
   break;
   }

Why do you call congestion_wait() if there is more I/O to issue?  If
we have a fast filesystem, this might cause the device queues to
fill, then drain on congestion_wait(), then fill again, etc. i.e. we
will have trouble keeping the queues full, right?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread Fengguang Wu
On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
 On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
  wbc.pages_skipped = 0;
  @@ -560,8 +561,9 @@ static void background_writeout(unsigned
  min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
  if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
  /* Wrote less than expected */
  -   congestion_wait(WRITE, HZ/10);
  -   if (!wbc.encountered_congestion)
  +   if (wbc.encountered_congestion || wbc.more_io)
  +   congestion_wait(WRITE, HZ/10);
  +   else
  break;
  }
 
 Why do you call congestion_wait() if there is more I/O to issue?  If
 we have a fast filesystem, this might cause the device queues to
 fill, then drain on congestion_wait(), then fill again, etc. i.e. we
 will have trouble keeping the queues full, right?

You mean slow writers and fast RAID? That would be exactly the case
these patches try to improve.

The old writeback behaviors are sluggish when there is
- single big dirty file;
- single congested device
the queues may well build up slowly, hit background_limit, and
continue to build up, until hit dirty_limit. That means:
- kupdate writeback could leave behind many expired dirty data
- background writeback used to return prematurely
- eventually it relies on balance_dirty_pages() to do the job,
  which means
  - writers get throttled unnecessarily
  - dirty_limit pages are pinned unnecessarily

This patchset makes kupdate/background writeback more responsible,
so that if (avg-write-speed  device-capabilities), the dirty data are
synced timely, and we don't have to go for balance_dirty_pages().

So for your question of queue depth, the answer is: the queue length
will not build up in the first place. 

Also the name of congestion_wait() could be misleading:
- when not congested, congestion_wait() will wakeup on write
  completions;
- when congested, congestion_wait() could also wakeup on write
  completions on other non-congested devices.
So congestion_wait(100ms) normally only takes 0.1-10ms.

For the more_io case, congestion_wait() serves more like 'to take a
breath'. Tests show that the system could go mad without it.

Regards,
Fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-02 Thread David Chinner
On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
 On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
  On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
 wbc.pages_skipped = 0;
   @@ -560,8 +561,9 @@ static void background_writeout(unsigned
 min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
 /* Wrote less than expected */
   - congestion_wait(WRITE, HZ/10);
   - if (!wbc.encountered_congestion)
   + if (wbc.encountered_congestion || wbc.more_io)
   + congestion_wait(WRITE, HZ/10);
   + else
 break;
 }
  
  Why do you call congestion_wait() if there is more I/O to issue?  If
  we have a fast filesystem, this might cause the device queues to
  fill, then drain on congestion_wait(), then fill again, etc. i.e. we
  will have trouble keeping the queues full, right?
 
 You mean slow writers and fast RAID? That would be exactly the case
 these patches try to improve.

I mean any writers and a fast block device (raid or otherwise).

 This patchset makes kupdate/background writeback more responsible,
 so that if (avg-write-speed  device-capabilities), the dirty data are
 synced timely, and we don't have to go for balance_dirty_pages().

Sure, but I'm asking about the effect of the patches on the
(avg-write-speed == device-capabilities) case. I agree that
they are necessary for timely syncing of data but I'm trying
to understand what effect they have on the normal write case
(i.e. keeping the disk at full write throughput).

 So for your question of queue depth, the answer is: the queue length
 will not build up in the first place. 

Which queue are you talking about here? The block deivce queue?

 Also the name of congestion_wait() could be misleading:
 - when not congested, congestion_wait() will wakeup on write
   completions;
 - when congested, congestion_wait() could also wakeup on write
   completions on other non-congested devices.
 So congestion_wait(100ms) normally only takes 0.1-10ms.

True, but if we know we are not congested and have more work
to do, why sleep at all?

 For the more_io case, congestion_wait() serves more like 'to take a
 breath'. Tests show that the system could go mad without it.

I'm interested to know what tests show that pushing more I/O when
you don't have block device congestion make the system go mad (and
what mad means).  It sounds to me like it's hiding (yet another)
bug in the writeback code..

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/