Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Mike Snitzer
On Jan 18, 2008 3:00 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
>
> On Jan 18, 2008 12:46 PM, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Fri, 18 Jan 2008, Mel Gorman wrote:
> > >
> > > Right, and this is consistent with other complaints about the PFN of the
> > > page mattering to some hardware.
> >
> > I don't think it's actually the PFN per se.
> >
> > I think it's simply that some controllers (quite probably affected by both
> > driver and hardware limits) have some subtle interactions with the size of
> > the IO commands.
> >
> > For example, let's say that you have a controller that has some limit X on
> > the size of IO in flight (whether due to hardware or driver issues doesn't
> > really matter) in addition to a limit on the size of the scatter-gather
> > size. They all tend to have limits, and they differ.
> >
> > Now, the PFN doesn't matter per se, but the allocation pattern definitely
> > matters for whether the IO's are physically contiguous, and thus matters
> > for the size of the scatter-gather thing.
> >
> > Now, generally the rule-of-thumb is that you want big commands, so
> > physical merging is good for you, but I could well imagine that the IO
> > limits interact, and end up hurting each other. Let's say that a better
> > allocation order allows for bigger contiguous physical areas, and thus
> > fewer scatter-gather entries.
> >
> > What does that result in? The obvious answer is
> >
> >   "Better performance obviously, because the controller needs to do fewer
> >scatter-gather lookups, and the requests are bigger, because there are
> >fewer IO's that hit scatter-gather limits!"
> >
> > Agreed?
> >
> > Except maybe the *real* answer for some controllers end up being
> >
> >   "Worse performance, because individual commands grow because they don't
> >hit the per-command limits, but now we hit the global size-in-flight
> >limits and have many fewer of these good commands in flight. And while
> >the commands are larger, it means that there are fewer outstanding
> >commands, which can mean that the disk cannot scheduling things
> >as well, or makes high latency of command generation by the controller
> >much more visible because there aren't enough concurrent requests
> >queued up to hide it"
> >
> > Is this the reason? I have no idea. But somebody who knows the AACRAID
> > hardware and driver limits might think about interactions like that.
> > Sometimes you actually might want to have smaller individual commands if
> > there is some other limit that means that it can be more advantageous to
> > have many small requests over a few big onees.
> >
> > RAID might well make it worse. Maybe small requests work better because
> > they are simpler to schedule because they only hit one disk (eg if you
> > have simple striping)! So that's another reason why one *large* request
> > may actually be slower than two requests half the size, even if it's
> > against the "normal rule".
> >
> > And it may be that that AACRAID box takes a big hit on DIO exactly because
> > DIO has been optimized almost purely for making one command as big as
> > possible.
> >
> > Just a theory.
>
> Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
> configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz).  That
> is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
> non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)
...
> I can fire up 2.6.24-rc8 in short order to see if things are vastly
> improved (as Martin seems to indicate that he is happy with AACRAID on
> 2.6.24-rc8).  Although even Martin's AACRAID numbers from 2.6.19.2 are
> still quite good (relative to mine).  Martin can you share any tuning
> you may have done to get AACRAID to where it is for you right now?

I can confirm 2.6.24-rc8 behaves like Martin has posted for the
AACRAID.  Slower DIO with smaller avgreqsiz.  Much faster buffered IO
(for my config anyway) with a much larger avgreqsiz (180K).

I have no idea why 2.6.22.16's request size on non-DIO is _so_ small...

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Mike Snitzer
On Jan 18, 2008 12:46 PM, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN of the
> > page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by both
> driver and hardware limits) have some subtle interactions with the size of
> the IO commands.
>
> For example, let's say that you have a controller that has some limit X on
> the size of IO in flight (whether due to hardware or driver issues doesn't
> really matter) in addition to a limit on the size of the scatter-gather
> size. They all tend to have limits, and they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern definitely
> matters for whether the IO's are physically contiguous, and thus matters
> for the size of the scatter-gather thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the IO
> limits interact, and end up hurting each other. Let's say that a better
> allocation order allows for bigger contiguous physical areas, and thus
> fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
>   "Better performance obviously, because the controller needs to do fewer
>scatter-gather lookups, and the requests are bigger, because there are
>fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
>   "Worse performance, because individual commands grow because they don't
>hit the per-command limits, but now we hit the global size-in-flight
>limits and have many fewer of these good commands in flight. And while
>the commands are larger, it means that there are fewer outstanding
>commands, which can mean that the disk cannot scheduling things
>as well, or makes high latency of command generation by the controller
>much more visible because there aren't enough concurrent requests
>queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the AACRAID
> hardware and driver limits might think about interactions like that.
> Sometimes you actually might want to have smaller individual commands if
> there is some other limit that means that it can be more advantageous to
> have many small requests over a few big onees.
>
> RAID might well make it worse. Maybe small requests work better because
> they are simpler to schedule because they only hit one disk (eg if you
> have simple striping)! So that's another reason why one *large* request
> may actually be slower than two requests half the size, even if it's
> against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO exactly because
> DIO has been optimized almost purely for making one command as big as
> possible.
>
> Just a theory.

Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz).  That
is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)

DIO cmdline:  dd if=/dev/zero of=/dev/sdX bs=8192k count=1k
non-DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k

DIO is ~80MB/s on all 5 LUNs for a total of ~400MB/s
non-DIO is only ~12MB on all 5 LUNs for a mere ~70MB/s aggregate
(deadline w/ nr_requests=32)

Calls into question the theory of small requests being beneficial for
AACRAID.  Martin, what are you seeing for the avg request size when
you're conducting your AACRAID tests?

I can fire up 2.6.24-rc8 in short order to see if things are vastly
improved (as Martin seems to indicate that he is happy with AACRAID on
2.6.24-rc8).  Although even Martin's AACRAID numbers from 2.6.19.2 are
still quite good (relative to mine).  Martin can you share any tuning
you may have done to get AACRAID to where it is for you right now?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Mike Snitzer
On Jan 17, 2008 8:52 AM, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
>
> - Original Message 
> > From: Fengguang Wu <[EMAIL PROTECTED]>
> > To: Martin Knoblauch <[EMAIL PROTECTED]>
> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> > "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> > <[EMAIL PROTECTED]>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> >
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > >  I can add myself to Mikes question. It would be good to know
> > a
> >
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
>  showing quite nice improvement of the overall writeback situation and
> > it
> >
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
>  apparently already has reverted  "...2250b". I will definitely repeat my
> > tests
> >
>  with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
> Hi Fengguang,
>
>  something really bad has happened between -rc3 and -rc6. Embarrassingly I 
> did not catch that earlier :-(
>
>  Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , 
> dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in 
> pre 2.6.24. The only test that is still good is mix3, which I attribute to 
> the per-BDI stuff.
>
>  At the moment I am frantically trying to find when things went down. I did 
> run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that 
> I cannot provide any input to your patch.
>
> Depressed
> Martin

Martin,

I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
with anyone who might be interested.

As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
result matrix you previously posted, 2.6.22.x+perbdi might give you
what you're looking for (sans improved writeback that 2.6.24 was
thought to be providing).  That is, much improved scaling with better
O_DIRECT and network throughput.  Just a thought...

Unfortunately, my priorities (and computing resources) have shifted
and I won't be able to thoroughly test Fengguang's new writeback patch
on 2.6.24-rc8... whereby missing out on providing
justification/testing to others on _some_ improved writeback being
included in 2.6.24 final.

Not to mention the window for writeback improvement is all but closed
considering the 2.6.24-rc8 announcement's 2.6.24 final release
timetable.

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Mike Snitzer
On Jan 16, 2008 9:15 AM, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> - Original Message 
> > From: Fengguang Wu <[EMAIL PROTECTED]>
> > To: Martin Knoblauch <[EMAIL PROTECTED]>
> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> > "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> > <[EMAIL PROTECTED]>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
>
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> >
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > >  I can add myself to Mikes question. It would be good to know
> > a
> >
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
>  showing quite nice improvement of the overall writeback situation and
> > it
> >
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
>  apparently already has reverted  "...2250b". I will definitely repeat my
> > tests
> >
>  with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
>
>  Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for 
> me, as I have not looked at -rc7 due to holidays and some of the reported 
> problems with it.

Fengguang's latest writeback patch applies cleanly, builds, boots on 2.6.24-rc8.

I'll be able to share ext3 performance results (relative to 2.6.24-rc7) shortly.

Mike
>
>
> > Fengguang
> > ---
> >  fs/fs-writeback.c |   17 +++--
> >  include/linux/writeback.h |1 +
> >  mm/page-writeback.c   |9 ++---
> >  3 files changed, 22 insertions(+), 5 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
> >   * soon as the queue becomes uncongested.
> >   */
> >  inode->i_state |= I_DIRTY_PAGES;
> > -requeue_io(inode);
> > +if (wbc->nr_to_write <= 0)
> > +/*
> > + * slice used up: queue for next turn
> > + */
> > +requeue_io(inode);
> > +else
> > +/*
> > + * somehow blocked: retry later
> > + */
> > +redirty_tail(inode);
> >  } else {
> >  /*
> >   * Otherwise fully redirty the inode so that
> > @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
> >  iput(inode);
> >  cond_resched();
> >  spin_lock(&inode_lock);
> > -if (wbc->nr_to_write <= 0)
> > +if (wbc->nr_to_write <= 0) {
> > +wbc->more_io = 1;
> >  break;
> > +}
> > +if (!list_empty(&sb->s_more_io))
> > +wbc->more_io = 1;
> >  }
> >  return;/* Leave any unwritten inodes on s_io */
> >  }
> > --- linux.orig/include/linux/writeback.h
> > +++ linux/include/linux/writeback.h
> > @@ -62,6 +62,7 @@ struct writeback_control {
> >  unsigned for_reclaim:1;/* Invoked from the page
> > allocator
> >
>  */
> >  unsigned for_writepages:1;/* This is a writepages() call */
> >  unsigned range_cyclic:1;/* range_start is cyclic */
> > +unsigned more_io:1;/* more io to be dispatched */
> >  };
> >
> >  /*
> > --- linux.orig/mm/page-writeback.c
> > +++ linux/mm/page-writeback.c
> > @@ -558,6 +558,7 @@ static void background_writeout(unsigned
> >  global_page_state(NR_UNSTABLE_NFS) < background_thresh
> >  && min_pages <= 0)
> >  break;
> > +wbc.more_io = 0;
> >  wbc.encountered

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-15 Thread Mike Snitzer
On Jan 14, 2008 7:50 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> >
> > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > >
> > > > Joerg, this patch fixed the bug for me :-)
> > >
> > > Fengguang, congratulations, I can confirm that your patch fixed the bug! 
> > > With
> > > previous kernels the bug showed up after each reboot. Now, when booting 
> > > the
> > > patched kernel everything is fine and there is no longer any suspicious
> > > iowait!
> > >
> > > Do you have an idea why this problem appeared in 2.6.24? Did somebody 
> > > change
> > > the ext2 code or is it related to the changes in the scheduler?
> >
> > It was Fengguang who changed the inode writeback code, and I guess the
> > new and improved code was less able do deal with these funny corner
> > cases. But he has been very good in tracking them down and solving them,
> > kudos to him for that work!
>
> Thank you.
>
> In particular the bug is triggered by the patch named:
> "writeback: introduce writeback_control.more_io to indicate more io"
> That patch means to speed up writeback, but unfortunately its
> aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
>
> Linus, given the number of bugs it triggered, I'd recommend revert
> this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> push it back to -mm tree for more testings?

Fengguang,

I'd like to better understand where your writeback work stands
relative to 2.6.24-rcX and -mm.  To be clear, your changes in
2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
performance improvement with ext3 (as compared to 2.6.22, CFS could be
helping, etc but...).  Very impressive!

Given this improvement it is unfortunate to see your request to revert
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
you're not confident in it for 2.6.24.

That said, you recently posted an -mm patchset that first reverts
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
the "slow writes for concurrent large and small file writes" bug:
http://lkml.org/lkml/2008/1/15/132

For those interested in using your writeback improvements in
production sooner rather than later (primarily with ext3); what
recommendations do you have?  Just heavily test our own 2.6.24 + your
evolving "close, but not ready for merge" -mm writeback patchset?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html