Re: [Patch] Output of L1,L2 and L3 cache sizes to /proc/cpuinfo

2001-05-22 Thread Martin Knoblauch

Tomas Telensky wrote:
> 
> On 21 May 2001, H. Peter Anvin wrote:
> 
> > Followup to:  <[EMAIL PROTECTED]>
> > By author:"Martin.Knoblauch" <[EMAIL PROTECTED]>
> > In newsgroup: linux.dev.kernel
> > >
> > > Hi,
> > >
> > >  while trying to enhance a small hardware inventory script, I found that
> > > cpuinfo is missing the details of L1, L2 and L3 size, although they may
> > > be available at boot time. One could of cource grep them from "dmesg"
> > > output, but that may scroll away on long lived systems.
> > >
> >
> > Any particular reason this needs to be done in the kernel, as opposed
> 
> It is already done in kernel, because it's displaying :)
> So, once evaluated, why not to give it to /proc/cpuinfo. I think it makes
> sense and gives it things in order.
> 

 That came to my mind as an pro argument also. The work is already done
in setup.c, so why not expose it at the same place where the other stuff
is. After all, it is just a more detailed output of the already
available "cache size" line.

Martin
PS: At least, I am not being ignored :-) No need for me to complain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Patch] Output of L1,L2 and L3 cache sizes to /proc/cpuinfo

2001-05-22 Thread Martin Knoblauch

"H. Peter Anvin" wrote:
> 
> "Martin.Knoblauch" wrote:
> >
> >  After some checking, I could have made the answer a bit less terse:
> >
> > - it would require that the kernel is compiled with cpuid [module]
> > support
> >   - not everybody may want enable this, just for getting one or two
> > harmless numbers.
> 
> If so, then that's their problem.  We're not here to solve the problem of
> stupid system administrators.
>

 They may not be stupid, just mislead :-( When Intel created the "cpuid"
Feature some way along the P3 line, they gave a stupid reason for it and
created a big public uproar. As silly as I think that was (on both
sides), the term "cpuid" is tainted. Some people just fear it like hell.
Anyway.
 
> > - you would need a utility with root permission to analyze the cpuid
> > info. The
> >   cahce info does not seem to be there in clear ascii.
> 
> Bullsh*t.  /dev/cpu/%d/cpuid is supposed to be mode 444 (world readable.)
> 

 Thanks you :-) In any case, on my system (Suse 7.1) the files are mode
400.

Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
On Thu, Jul 07 2005, Pekka Enberg wrote:
> Jens Axboe wrote:
> > > ATA7 defines a park maneuvre, I don't know how well supported it
is
> > > yet though. You can test with this little app, if it says 'head
> > > parked' it works. If not, it has just idled the drive.
>
> On 7/7/05, Lenz Grimmer <[EMAIL PROTECTED]> wrote:
> > Great! Thanks for digging this up - it works on my T42, using a
Fujitsu
> > MHT2080AH drive:
>
> Works on my T42p which uses a Hitachi HTS726060M9AT00 drive. I don't
> hear any sound, though.

 Interesting. Same Notebook, same drive. The program say "not parked"
:-( This is on FC2 with a pretty much vanilla 2.6.9 kernel.

[EMAIL PROTECTED] tmp]# uname -a
Linux l15833 2.6.9-noagp #1 Wed May 4 16:09:14 CEST 2005 i686 i686 i386
GNU/Linux
[EMAIL PROTECTED] tmp]# hdparm -i /dev/hda

/dev/hda:

 Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7877kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117210240
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
 Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a:

 * signifies the current active mode

[EMAIL PROTECTED] tmp]# ./park /dev/hda
head not parked 4c
[EMAIL PROTECTED] tmp]#

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch


--- Pekka Enberg <[EMAIL PROTECTED]> wrote:

> 
> Martin, don't trim the cc!
> 

 sorry about that, but I did not have the CC at time of reply. I read
LKLM from the archives and respond by cut and paste.

Martin


----------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
--- Pekka Enberg <[EMAIL PROTECTED]> wrote:

> On 7/7/05, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> >  Interesting. Same Notebook, same drive. The program say "not
> parked"
> > :-( This is on FC2 with a pretty much vanilla 2.6.9 kernel.
> > 
> > [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda
> > 
> > /dev/hda:
> > 
> >  Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB
> 
> haji ~ # hdparm -i /dev/hda
> 
> /dev/hda:
> 
>  Model=HTS726060M9AT00, FwRev=MH4OA6DA, SerialNo=MRH453M4H2A6PB

 OK, different FW levels. After upgrading my disk to MH40A6GA my head
parks :-) Minimum required level for this disk seems to be A6DA. Hope
this info is useful. 

[EMAIL PROTECTED] tmp]# hdparm -i /dev/hda

/dev/hda:

 Model=HTS726060M9AT00, FwRev=MH4OA6GA, SerialNo=MRH403M4GS88XB
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7877kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117210240
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
 Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a:

 * signifies the current active mode

[EMAIL PROTECTED] tmp]# ./park /dev/hda
head parked
[EMAIL PROTECTED] tmp]#


Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
--- Alejandro Bonilla <[EMAIL PROTECTED]> wrote:

> 
> > --- Pekka Enberg <[EMAIL PROTECTED]> wrote:
> >
> > > On 7/7/05, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > > >  Interesting. Same Notebook, same drive. The program say "not
> > > parked"
> > > > :-( This is on FC2 with a pretty much vanilla 2.6.9 kernel.
> > > >
> > > > [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda
> > > >
> > > > /dev/hda:
> > > >
> > > >  Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB
> > >
> > > haji ~ # hdparm -i /dev/hda
> > >
> > > /dev/hda:
> > >
> > >  Model=HTS726060M9AT00, FwRev=MH4OA6DA, SerialNo=MRH453M4H2A6PB
> >
> >  OK, different FW levels. After upgrading my disk to MH40A6GA my
> head
> > parks :-) Minimum required level for this disk seems to be A6DA.
> Hope
> > this info is useful.
> 
> Martin,
> 
>   Simply upgrading your firmware fixed your problem for being to park
> the
> head?
> 

 Yup. Do not forget that FW is very powerful. Likely the parking
feature was added after A6BA.

 Basically I saw that the only difference between me and Pekka was the
FW (discounting the different CPU speed and Kernel version). I googled
around and found the IBM FW page at:

http://www-306.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-41008

 Download is simple, just don't use the "IBM Download Manager". Main
problem is that one needs a bootable floopy drive and "the other OS" to
create a bootable floppy. It would be great if IBM could provide floppy
images for use with "dd" for the poor Linux users.

 Then I pondered over the risk involved with the update. Curiosity won
:-) And now the head parks. BUT - I definitely do not encourage anybody
to perform the procedure. Do at your own risk after thinking about the
possible consequences ...

 Anyway someone reported a non working  HTS548040M9AT00 with FW
revision MG2OA53A. The newest revision, from the same floppy image, is
A5HA.

Cheers
Martin 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
--- Dave Hansen <[EMAIL PROTECTED]> wrote:

> On Thu, 2005-07-07 at 10:14 -0700, Martin Knoblauch wrote:
> >  Basically I saw that the only difference between me and Pekka was
> the
> > FW (discounting the different CPU speed and Kernel version). I
> googled
> > around and found the IBM FW page at:
> > 
> >
>
http://www-306.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-41008
> > 
> >  Download is simple, just don't use the "IBM Download Manager".
> Main
> > problem is that one needs a bootable floopy drive and "the other
> OS" to
> > create a bootable floppy. It would be great if IBM could provide
> floppy
> > images for use with "dd" for the poor Linux users.
> 
> Did you really need to make 18 diskettes?
>

 yikes - no !! :-) Somewhere on that page there is a table that tells
you which of the 18 floppies is for your disk. In my case it was #13.
 
> I have the feeling that this will work for many T4[012]p? users:
> 
>
http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=TPAD-HDFIRM
> 

 Yeah, I think that is the "DA" version. You still need "the other OS",
although you don't need the floppy.

 If IBM would provide a CD image (bootable ISO) containing FW for all
supported drives - that would be great. No need for the "other OS" any
more.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
--- Dave Hansen <[EMAIL PROTECTED]> wrote:

> 
> I have the feeling that this will work for many T4[012]p? users:
> 
>
http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=TPAD-HDFIRM
> 

 Actually, I think your feeling is wrong. Looking at the readme.txt it
seems version 7.1 of the upgrade floppy has the "BA" firmware that I
had on my disk in the beginning (not parking the heads).

Cheers
Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)

2005-07-07 Thread Martin Knoblauch
--- Erik Mouw <[EMAIL PROTECTED]> wrote:

> On Thu, Jul 07, 2005 at 11:45:38AM -0700, Martin Knoblauch wrote:
> >  If IBM would provide a CD image (bootable ISO) containing FW for
> all
> > supported drives - that would be great. No need for the "other OS"
> any
> > more.
> 
> I can imagine IBM doesn't do that because in that way you can't
> update
> the firmware of the CD/DVD drive. Bootable FreeDOS floppy images
> would
> be a nice idea, though.
> 
> 

 now, this is getting off-topic. The CD image I proposed would be only
for the hard disks.

 Bootable -Dos floppy images that one could just "dd" of the floppy
would be great, because they eliminate the need for "the other OS", but
you still need a floppy drive. I am not sure how many Notebook owners
actually have one. The hardest part in my FW upgrade was actually
finding a drive in our company.

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch

--- Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> > 
> > Right, and this is consistent with other complaints about the PFN
> > of the page mattering to some hardware.
> 
> I don't think it's actually the PFN per se.
> 
> I think it's simply that some controllers (quite probably affected by
> both  driver and hardware limits) have some subtle interactions with
> the size of  the IO commands.
> 
> For example, let's say that you have a controller that has some limit
> X on  the size of IO in flight (whether due to hardware or driver
> issues doesn't  really matter) in addition to a limit on the size
> of the scatter-gather  size. They all tend to have limits, and
> they differ.
> 
> Now, the PFN doesn't matter per se, but the allocation pattern
> definitely  matters for whether the IO's are physically
> contiguous, and thus matters  for the size of the scatter-gather
> thing.
> 
> Now, generally the rule-of-thumb is that you want big commands, so 
> physical merging is good for you, but I could well imagine that the
> IO  limits interact, and end up hurting each other. Let's say that a
> better  allocation order allows for bigger contiguous physical areas,
> and thus  fewer scatter-gather entries.
> 
> What does that result in? The obvious answer is
> 
>   "Better performance obviously, because the controller needs to do
> fewer scatter-gather lookups, and the requests are bigger, because
> there are fewer IO's that hit scatter-gather limits!"
> 
> Agreed?
> 
> Except maybe the *real* answer for some controllers end up being
> 
>   "Worse performance, because individual commands grow because they
> don't  hit the per-command limits, but now we hit the global
> size-in-flight limits and have many fewer of these good commands in
> flight. And while the commands are larger, it means that there
> are fewer outstanding commands, which can mean that the disk
> cannot scheduling things as well, or makes high latency of command
> generation by the controller much more visible because there aren't
> enough concurrent requests queued up to hide it"
> 
> Is this the reason? I have no idea. But somebody who knows the
> AACRAID hardware and driver limits might think about interactions
> like that. Sometimes you actually might want to have smaller 
> individual commands if there is some other limit that means that
> it can be more advantageous to have many small requests over a
> few big onees.
> 
> RAID might well make it worse. Maybe small requests work better
> because they are simpler to schedule because they only hit one
> disk (eg if you have simple striping)! So that's another reason
> why one *large* request may actually be slower than two requests
> half the size, even if it's against the "normal rule".
> 
> And it may be that that AACRAID box takes a big hit on DIO
> exactly because DIO has been optimized almost purely for making
> one command as big as possible.
> 
> Just a theory.
> 
>   Linus

 just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

 What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

 I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test  -rc8-rc8-without-Mels-Patch
dd1   57  94
dd1-dir   87  86
dd2   2x8.5   2x45
dd2-dir   2x432x43
dd3   3x7 3x30
dd3-dir   3x28.5  3x28.5
mix3  59,2x25 98,2x24

 The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

 At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that. 

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: speed up writeback of big dirty files

2008-01-19 Thread Martin Knoblauch
 Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Linus Torvalds <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; 
> Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>
> Sent: Thursday, January 17, 2008 6:28:18 AM
> Subject: [PATCH] writeback: speed up writeback of big dirty files
> 
> On Jan 16, 2008 9:15 AM, Martin Knoblauch
> 
> 
 wrote:
> > Fengguang's latest writeback patch applies cleanly, builds, boots
> on
> 
 2.6.24-rc8.
> 
> Linus, if possible, I'd suggest this patch be merged for 2.6.24.
> 
> It's a safer version of the reverted patch. It was tested on
> ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
> other bug fixing patches.
> 
> Fengguang
> ---
> 
> writeback: speed up writeback of big dirty files
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
> - after 30s:~4M
> - after 5s: ~4M
> - after 5s: all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
> s_ios_more_io
> -
> 1)100M,1K 0
> 2)1K  96M
> 3)0   96M
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> nr_to_write > 0 in (3) fools the upper layer to think that data
> have
> 
 all been
> written out. The big dirty file is actually still sitting in
> s_more_io.
> 
 We
> cannot simply splice s_more_io back to s_io as soon as s_io
> becomes
> 
 empty, and
> let the loop in generic_sync_sb_inodes() continue: this may
> starve
> 
 newly
> expired inodes in s_dirty.  It is also not an option to draw
> inodes
> 
 from both
> s_more_io and s_dirty, an let the loop go on: this might lead to
> live
> 
 locks,
> and might also starve other superblocks in sync time(well kupdate
> may
> 
 still
> starve some superblocks, that's another bug).
> We have to return when a full scan of s_io completes. So nr_to_write
> >
> 
 0 does
> not necessarily mean that "all data are written". This patch
> introduces
> 
 a flag
> writeback_control.more_io to indicate that more io should be done.
> With
> 
 it the
> big dirty file no longer has to wait for the next kupdate invocation
> 5s
> 
 later.
> 
> In sync_sb_inodes() we only set more_io on super_blocks we
> actually
> 
 visited.
> This aviods the interaction between two pdflush deamons.
> 
> Also in __sync_single_inode() we don't blindly keep requeuing the io
> if
> 
 the
> filesystem cannot progress. Failing to do so may lead to 100% iowait.
> 
> Tested-by: Mike Snitzer 
> Signed-off-by: Fengguang Wu 
> ---
>  fs/fs-writeback.c |   18 --
>  include/linux/writeback.h |1 +
>  mm/page-writeback.c   |9 ++---
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
>   * soon as the queue becomes uncongested.
>   */
>  inode->i_state |= I_DIRTY_PAGES;
> -requeue_io(inode);
> +if (wbc->nr_to_write <= 0) {
> +/*
> + * slice used up: queue for next turn
> + */
> +requeue_io(inode);
> +} else {
> +/*
> + * somehow blocked: retry later
> + */
> +redirty_tail(inode);
> +}
>  } else {
>  /*
>   * Otherwise fully redirty the inode so that
> @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
>  iput(inode);
>  cond_resched();
>  spin_lock(&inode_lock);
> -if (wbc->nr_to_write <= 0)
> +if (wbc->nr_to_write <= 0) {
> +wbc->more_io = 1;
>  break;
> +}
> +if (!list_empty(&sb->s_more_io))
> +wbc->more_io = 1;
>  }
>  return;/* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-19 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Linus Torvalds <[EMAIL PROTECTED]>
> Cc: Mel Gorman <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL 
> PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; 
> "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Friday, January 18, 2008 11:47:02 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> > I can fire up 2.6.24-rc8 in short order to see if things are vastly
> > improved (as Martin seems to indicate that he is happy with
> > AACRAID on 2.6.24-rc8).  Although even Martin's AACRAID
> > numbers from 2.6.19.2
> 
 > are still quite good (relative to mine).  Martin can you share any tuning
> > you may have done to get AACRAID to where it is for you right now?
Mike,

 I have always been happy with the AACRAID box compared to the CCISS system. 
Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to 
me. For me the differences between 2.6.19  and 2.6.24-rc8 on the AACRAID setup 
are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something 
I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I 
attribute to Peter's per-dbi stuff.

 And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS 
boxes. This would have been the next "production" kernel for me. But lets 
discuss this under a seperate topic. It has nothing to do with the original 
wait-io issue.

 Oh, before I forget. There has been no tuning for the AACRAID. The system is 
an IBM x3650 with built in AACRAID and battery backed write cache. The disks 
are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an 
my tests. I do 1MB writes to simulate the behaviour of the real applications, 
while yours seem to be much smaller.
 
Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-22 Thread Martin Knoblauch

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> >  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> > the question is which part of the system does not cope well with the
> > larger IO sizes? Is it the CCISS controller, LVM or both. I am
> open
> 
 to
> > suggestions on how to debug that. 
> 
> What is your LVM device configuration?
>   E.g. 'dmsetup table' and 'dmsetup info -c' output.
> Some configurations lead to large IOs getting split up on the
> way
> 
 through
> device-mapper.
>
Hi Alastair,

 here is the output, the filesystem in question is on LogVol02:

  [EMAIL PROTECTED] ~]# dmsetup table
VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
VolGroup00-LogVol00: 0 67108864 linear 104:2 384
[EMAIL PROTECTED] ~]# dmsetup info -c
Name Maj Min Stat Open Targ Event  UUID
VolGroup00-LogVol02 253   1 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R
VolGroup00-LogVol01 253   2 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq
VolGroup00-LogVol00 253   0 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE

> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>  

 thanks for the suggestion. Are they supposed to apply to mainline?

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-22 Thread Martin Knoblauch
- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>  


 nope. Exactely the same poor results. To rule out LVM/DM I really have to see 
what happens if I setup a system with filesystems directly on partitions. Might 
take some time though.

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-23 Thread Martin Knoblauch
- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Wednesday, January 23, 2008 12:40:52 AM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> >   [EMAIL PROTECTED] ~]# dmsetup table
> > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> > VolGroup00-LogVol00: 0 67108864 linear 104:2 384
> 
> The IO should pass straight through simple linear targets like
> that without needing to get broken up, so I wouldn't expect those patches to
> make any difference in this particular case.
> 

Alasdair,

 LVM/DM are off the hook :-) I converted one box to direct using partitions and 
the performance is the same disappointment as with LVM/DM. Thanks anyway for 
looking at my problem.

 I will move the discussion now to a new thread, targetting CCISS directly. 

Cheers
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote:
> 
> > nfs-remove-congestion_end.patch
> > lib-percpu_counter_add.patch
> > lib-percpu_counter_sub.patch
> > lib-percpu_counter-variable-batch.patch
> > lib-make-percpu_counter_add-take-s64.patch
> > lib-percpu_counter_set.patch
> > lib-percpu_counter_sum_positive.patch
> > lib-percpu_count_sum.patch
> > lib-percpu_counter_init-error-handling.patch
> > lib-percpu_counter_init_irq.patch
> > mm-bdi-init-hooks.patch
> > mm-scalable-bdi-statistics-counters.patch
> > mm-count-reclaimable-pages-per-bdi.patch
> > mm-count-writeback-pages-per-bdi.patch
> 
> This one:
> > mm-expose-bdi-statistics-in-sysfs.patch
> 
> > lib-floating-proportions.patch
> > mm-per-device-dirty-threshold.patch
> > mm-per-device-dirty-threshold-warning-fix.patch
> > mm-per-device-dirty-threshold-fix.patch
> > mm-dirty-balancing-for-tasks.patch
> > mm-dirty-balancing-for-tasks-warning-fix.patch
> 
> And, this one:
> > debug-sysfs-files-for-the-current-ratio-size-total.patch
> 
> 
> I'm not sure polluting /sys/block//queue/ like that is The Right
> Thing. These patches sure were handy when debugging this, but not
> sure
> they want to move to maineline.
> 
> Maybe we want /sys/bdi// or maybe /debug/bdi//
> 
> Opinions?
> 
Hi Peter,

 my only opinion is that it is great to see that stuff moving into
mainline. If it really goes in, there will be one more very interested
rc-tester :-)

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] sluggish writeback fixes

2007-10-03 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> Andrew,
> 
> The following patches fix the sluggish writeback behavior.
> They are well understood and well tested - but not yet widely tested.
> 
> The first patch reverts the debugging -mm only
> check_dirty_inode_list.patch -
> which is no longer necessary.
> 
> The following 4 patches do the real jobs:
> 
> [PATCH 2/5] writeback: fix time ordering of the per superblock inode
> lists 8
> [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes()
> [PATCH 4/5] writeback: remove pages_skipped accounting in
> __block_write_full_page()
> [PATCH 5/5] writeback: introduce writeback_control.more_io to
> indicate more io
> 
> They share the same goal as the following patches in -mm. Therefore
> I'd
> recommend to put the last 4 new ones after them:
> 
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
> writeback-fix-comment-use-helper-function.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
> writeback-fix-periodic-superblock-dirty-inode-flushing.patch
> 
> Regards,
> Fengguang
Hi Fenguang,

 now that Peters stuff seems to make it into mainline, do you think
your fixes should go in as well? Would definitely help to broaden the
tester base. Definitely by one very interested tester :-)

Keep on the good work
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-28 Thread Martin Knoblauch
Hi,

 currently I am tracking down an "interesting" effect when writing to a 
Solars-10/Sparc based server. The server exports two filesystems. One UFS, one 
VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in 
question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 
2.6.22.6) as well. The client is x86_64 with 8 GB of ram. 

 The problem: when writing to the VXFS based filesystem, performance drops 
dramatically when the the filesize reaches or exceeds "dirty_ratio". For a 
dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 
30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform 
the same tests on the UFS based FS, performance stays at about 30 MB/sec until 
3GB and likely larger (I just stopped at 3 GB).

 Any ideas what could cause this difference? Any suggestions on debugging it?

spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)
spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)

Cheers
Martin
PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name 
:-)

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
> From: Chris Snook <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Sent: Friday, December 28, 2007 7:45:13 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> Martin Knoblauch wrote:
> > Hi,
> > 
> > currently I am tracking down an "interesting" effect when writing
> to
> 
 a
> > Solars-10/Sparc based server. The server exports two filesystems.
> One
> 
 UFS,
> > one VXFS. The filesystems are mounted NFS3/TCP, no special
> options.
> 
 Linux
> > kernel in question is 2.6.24-rc6, but it happens with earlier kernels
> > (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram.
> > 
> > The problem: when writing to the VXFS based filesystem,
> performance
> 
 drops
> > dramatically when the the filesize reaches or exceeds
> "dirty_ratio".
> 
 For a
> > dirty_ratio of 10% (about 800MB) files below 750 MB are
> transfered
> 
 with about
> > 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If
> I
> 
 perform
> > the same tests on the UFS based FS, performance stays at about
> 30
> 
 MB/sec
> > until 3GB and likely larger (I just stopped at 3 GB).
> > 
> > Any ideas what could cause this difference? Any suggestions
> on
> 
 debugging it?
> 
> 1) Try normal NFS tuning, such as rsize/wsize tuning.
>

  rsize/wsize only have minimal effect. The negotiated  size seems to be 
optimal.

> 2) You're entering synchronous writeback mode, so you can delay the
> 
 problem by raising dirty_ratio to 100, or reduce the size of the problem
> by lowering  dirty_ratio to 1.  Either one could help.
> 

 For experiments, sure. But I do not think that I want to have 8 GB of dirty 
pages [potentially] laying around. Are you sure that 1% is a useful value for 
dirty_ratio? Looking at the code, it seems a minimum of 5% is  enforced  in 
"page-writeback.c:get_dirty_limits":

dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

if (dirty_ratio < 5)
dirty_ratio = 5;


> 3) It sounds like the bottleneck is the vxfs filesystem.  It only
> 
 *appears* on  the client side because writes up until dirty_ratio get buffered 
on
> the client. 

 Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same 
situation points in that direction.

>   If you can confirm that the server is actually writing stuff to disk
> 
 slower  when the client is in writeback mode, then it's possible the Linux
> NFS client is  doing something inefficient in writeback mode.
> 

 I will try to get an iostat trace from the Sun side. Thanks for the suggestion.

Cheers
Martin
PS: Happy Year 2008 to all Kernel Hackers and their families



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
> From: Chris Snook <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Sent: Friday, December 28, 2007 7:45:13 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> Martin Knoblauch wrote:
> > Hi,
> > 
> > currently I am tracking down an "interesting" effect when writing

> 3) It sounds like the bottleneck is the vxfs filesystem.  It
> only *appears* on  the client side because writes up until dirty_ratio
> get buffered on the client. 
>   If you can confirm that the server is actually writing stuff to
> disk slower  when the client is in writeback mode, then it's possible
> the Linux NFSclient is  doing something inefficient in writeback mode.
> 

so, is the output of "iostat -d -l1 d111" during two runs. The first run is 
with 750 MB, the second with 850MB.

// 750MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
 22   0   14
  0   00
  0   0   13
29347 468   12
37040 593   17
30938 492   25
30421 491   25
41626 676   16
42913 703   14
39890 647   15
9009 1417
8963 1417
5143  817
34814 547   10
49323 775   12
28624 4516
 22   16
 finish
  0   00
  0   00

 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. 
Fine.

// 850MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
  0   00
11275 180   10
39874 635   14
37403 587   17
24341 392   30
25989 423   26
22464 375   30
21922 361   32
27924 450   26
21507 342   21
9217 153   15
9260 150   15
9544 155   15
9298 150   14
10118 162   11
15505 250   12
27513 448   14
26698 436   15
26144 431   15
25201 412   14
 38 seconds in run
  0   00
  0   00
579  17   12
  0   00
  0   00
  0   00
  0   00
518   9   16
485   86
  9   17
514   97
  0   00
  0   00
541   98
532  106
  0   00
  0   00
650  127
  0   00
242   89
1023  185
304   56
418   87
283   55
303   58
527  106
  0   00
  0   00
  0   00
  5   1   13
  0   00
  0   00
  0   00
  0   00
  0   00
  0   0   11
  0   00
  0   00
  0   00
  1   0   15
  0   00
 96   2   15
138   3   10
11057 1756
17549 2806
351   85
  0   00
# 218 seconds in run, finish.

 So, for the first 38 seconds everything looks similar to the 750 MB case. For 
the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec.

Maybe it is time to capture the traffic. What are the best tcpdump parameters 
for NFS? I always forget :-(

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stack warning from 2.6.24-rc

2007-12-04 Thread Martin Knoblauch
- Original Message 
> From: Ingo Molnar <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org
> Sent: Tuesday, December 4, 2007 12:52:23 PM
> Subject: Re: Stack warning from 2.6.24-rc
> 
> 
> * Martin Knoblauch  wrote:
> 
> >  I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 
> >  GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4:
> > 
> > [  180.739846] mount.nfs used greatest stack depth: 3192 bytes left
> > [  666.121007] bash used greatest stack depth: 3160 bytes left
> > 
> >  Nothing bad has happened so far. The message does not show on a 
> >  similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running 
> >  rc3. Anything to worry? Anything I can do to help debugging?
> 
> those are generated by:
> 
>   CONFIG_DEBUG_STACKOVERFLOW=y
>   CONFIG_DEBUG_STACK_USAGE=y
> 
> and look quite harmless. If they were much closer to zero it would be
> a problem.
> 
> Ingo
> 

 OK, I will ignore it then. I was just surprised to see it.

Thanks
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


What is the unit of "nr_writeback"?

2007-12-04 Thread Martin Knoblauch
Hi,

 forgive the stupid question. What is the unit of "nr_writeback"? One would 
usually assume a rate, but looking at the code I see it added together with 
nr_dirty and nr_unstable, somehow defeating the assumption.

Cheers
Martin
------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] writeback bug fixes and simplifications

2008-01-11 Thread Martin Knoblauch
- Original Message 
> From: WU Fengguang <[EMAIL PROTECTED]>
> To: Hans-Peter Jansen <[EMAIL PROTECTED]>
> Cc: Sascha Warner <[EMAIL PROTECTED]>; Andrew Morton <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; Peter Zijlstra <[EMAIL PROTECTED]>
> Sent: Wednesday, January 9, 2008 4:33:32 AM
> Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications
> 
> On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote:
> > Am Freitag, 28. Dezember 2007 schrieb Sascha Warner:
> > > Andrew Morton wrote:
> > > > On Thu, 27 Dec 2007 23:08:40 +0100 Sascha
> Warner
> 
  
> > wrote:
> > > >> Hi,
> > > >>
> > > >> I applied your patches to 2.6.24-rc6-mm1, but now I am
> faced
> 
 with one
> > > >> pdflush often using 100% CPU for a long time. There seem to
> be
> 
 some
> > > >> rare pauses from its 100% usage, however.
> > > >>
> > > >> On ~23 minutes uptime i have ~19 minutes pdflush runtime.
> > > >>
> > > >> This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo
> > > >> ~x64_64
> > > >>
> > > >> Let me know if you need more info.
> > > >
> > > > (some) cc's restored.  Please, always do reply-to-all.
> > >
> > > Hi Wu,
> > 
> > Sascha, if you want to address Fengguang by his first name, note
> that
> 
 
> > chinese and bavarians (and some others I forgot now, too)
> typically
> 
 use the 
> > order:
> >   
> >   lastname firstname 
> > 
> > when they spell their names. Another evidence is, that the name Wu
> is
> 
 a 
> > pretty common chinese family name.
> > 
> > Fengguang, if it's the other way around, correct me please (and
> I'm
> 
 going to 
> > wear a big brown paper bag for the rest of the day..). 
> 
> You are right. We normally do "Fengguang" or "Mr. Wu" :-)
> For LKML the first name is less ambiguous.
> 
> Thanks,
> Fengguang
> 

 Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as 
well. This is only true in a folklore context (or when you are very deep in the 
countryside). Officially the bavarians use the usual German Given/Lastname. 
Although they will never admit to be Germans, of course :-)


Cheers
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2008-01-14 Thread Martin Knoblauch
- Original Message 
> From: Martin Knoblauch <[EMAIL PROTECTED]>
> To: Chris Snook <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap <[EMAIL 
> PROTECTED]>
> Sent: Saturday, December 29, 2007 12:11:08 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> - Original Message 
> > From: Chris Snook 
> > To: Martin Knoblauch 
> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> > Sent: Friday, December 28, 2007 7:45:13 PM
> > Subject: Re: Strange NFS write performance
> Linux->Solaris-10/VXFS,
> 
 maybe VW related
> > 
> > Martin Knoblauch wrote:
> > > Hi,
> > > 
> > > currently I am tracking down an "interesting" effect when writing
> 
> > 3) It sounds like the bottleneck is the vxfs filesystem.  It
> > only *appears* on  the client side because writes up
> until
> 
 dirty_ratio
> > get buffered on the client. 
> >   If you can confirm that the server is actually writing stuff to
> > disk slower  when the client is in writeback mode, then it's possible
> > the Linux NFSclient is  doing something inefficient in
> writeback
> 
 mode.
> > 
> 
> so, is the output of "iostat -d -l1 d111" during two runs. The
> first
> 
 run is with 750 MB, the second with 850MB.
> 
> // 750MB
> $ iostat -d -l 1 md111 2
>md111
> kps tps serv
>  22   0   14
>   0   00
>   0   0   13
> 29347 468   12
> 37040 593   17
> 30938 492   25
> 30421 491   25
> 41626 676   16
> 42913 703   14
> 39890 647   15
> 9009 1417
> 8963 1417
> 5143  817
> 34814 547   10
> 49323 775   12
> 28624 4516
>  22   16
>  finish
>   0   00
>   0   00
> 
>  Here it seems that the disk is writing for 26-28 seconds with avg.
> 29
> 
 MB/sec. Fine.
> 
> // 850MB
> $ iostat -d -l 1 md111 2
>md111
> kps tps serv
>   0   00
> 11275 180   10
> 39874 635   14
> 37403 587   17
> 24341 392   30
> 25989 423   26
> 22464 375   30
> 21922 361   32
> 27924 450   26
> 21507 342   21
> 9217 153   15
> 9260 150   15
> 9544 155   15
> 9298 150   14
> 10118 162   11
> 15505 250   12
> 27513 448   14
> 26698 436   15
> 26144 431   15
> 25201 412   14
>  38 seconds in run
>   0   00
>   0   00
> 579  17   12
>   0   00
>   0   00
>   0   00
>   0   00
> 518   9   16
> 485   86
>   9   17
> 514   97
>   0   00
>   0   00
> 541   98
> 532  106
>   0   00
>   0   00
> 650  127
>   0   00
> 242   89
> 1023  185
> 304   56
> 418   87
> 283   55
> 303   58
> 527  106
>   0   00
>   0   00
>   0   00
>   5   1   13
>   0   00
>   0   00
>   0   00
>   0   00
>   0   00
>   0   0   11
>   0   00
>   0   00
>   0   00
>   1   0   15
>   0   00
>  96   2   15
> 138   3   10
> 11057 1756
> 17549 2806
> 351   85
>   0   00
> # 218 seconds in run, finish.
> 
>  So, for the first 38 seconds everything looks similar to the 750
> MB case. For the next about 180 seconds most time nothing happens.
> Averaging 4.1 MB/sec.
> 
> Maybe it is time to capture the traffic. What are the best
> tcpdump parameters for NFS? I always forget :-(
> 
> Cheers
> Martin
> 
> 
Hi,

 now that the seasonal festivities are over - Happy New Year btw. - any 
comments/suggestions on my problem?

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Fengguang Wu <[EMAIL PROTECTED]>
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar 
> <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" 
> <[EMAIL PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; Andrew Morton 
> <[EMAIL PROTECTED]>
> Sent: Tuesday, January 15, 2008 10:13:22 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Jan 14, 2008 7:50 AM, Fengguang Wu  wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch
> fixed
> 
 the bug! With
> > > > previous kernels the bug showed up after each reboot. Now,
> when
> 
 booting the
> > > > patched kernel everything is fine and there is no longer
> any
> 
 suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24?
> Did
> 
 somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I
> guess
> 
 the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and
> solving
> 
 them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> > "writeback: introduce writeback_control.more_io to
> indicate
> 
 more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
> 
 Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!
> 
> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?
> 
Hi Fengguang, Mike,

 I can add myself to Mikes question. It would be good to know a "roadmap" for 
the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice 
improvement of the overall writeback situation and it would be sad to see this 
[partially] gone in 2.6.24-final. Linus apparently already has reverted  
"...2250b". I will definitely repeat my tests with -rc8. and report.

 Cheers
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 

 Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for 
me, as I have not looked at -rc7 due to holidays and some of the reported 
problems with it.

Cheers
Martin

> Fengguang
> ---
>  fs/fs-writeback.c |   17 +++--
>  include/linux/writeback.h |1 +
>  mm/page-writeback.c   |9 ++---
>  3 files changed, 22 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
>   * soon as the queue becomes uncongested.
>   */
>  inode->i_state |= I_DIRTY_PAGES;
> -requeue_io(inode);
> +if (wbc->nr_to_write <= 0)
> +/*
> + * slice used up: queue for next turn
> + */
> +requeue_io(inode);
> +else
> +/*
> + * somehow blocked: retry later
> + */
> +redirty_tail(inode);
>  } else {
>  /*
>   * Otherwise fully redirty the inode so that
> @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
>  iput(inode);
>  cond_resched();
>  spin_lock(&inode_lock);
> -if (wbc->nr_to_write <= 0)
> +if (wbc->nr_to_write <= 0) {
> +wbc->more_io = 1;
>  break;
> +}
> +if (!list_empty(&sb->s_more_io))
> +wbc->more_io = 1;
>  }
>  return;/* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
>  unsigned for_reclaim:1;/* Invoked from the page
> allocator
> 
 */
>  unsigned for_writepages:1;/* This is a writepages() call */
>  unsigned range_cyclic:1;/* range_start is cyclic */
> +unsigned more_io:1;/* more io to be dispatched */
>  };
>  
>  /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
>  global_page_state(NR_UNSTABLE_NFS) < background_thresh
>  && min_pages <= 0)
>  break;
> +wbc.more_io = 0;
>  wbc.encountered_congestion = 0;
>  wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
>  min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  /* Wrote less than expected */
> -congestion_wait(WRITE, HZ/10);
> -if (!wbc.encountered_congestion)
> +if (wbc.encountered_congestion || wbc.more_io)
> +congestion_wait(WRITE, HZ/10);
> +else
>  break;
>  }
>  }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
>  global_page_state(N

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did 
not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 
is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 
2.6.24. The only test that is still good is mix3, which I attribute to the 
per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run 
-rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I 
cannot provide any input to your patch.

Depressed
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Martin Knoblauch <[EMAIL PROTECTED]>
> To: Fengguang Wu <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> - Original Message 
> > From: Fengguang Wu 
> > To: Martin Knoblauch 
> > Cc: Mike Snitzer ; Peter
> Zijlstra
> 
 ; [EMAIL PROTECTED]; Ingo Molnar
> ;
> 
 linux-kernel@vger.kernel.org;
> "[EMAIL PROTECTED]"
> 
 ; Linus
> Torvalds
> 
 
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> > 
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> > 
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > 
> > > Hi Fengguang, Mike,
> > > 
> > >  I can add myself to Mikes question. It would be good to know
> > a
> > 
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> > 
>  showing quite nice improvement of the overall writeback situation and
> > it
> > 
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> > 
>  apparently already has reverted  "...2250b". I will definitely
> repeat
> 
 my
> > tests
> > 
>  with -rc8. and report.
> > 
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> > 
> Hi Fengguang,
> 
>  something really bad has happened between -rc3 and
> -rc6.
> 
 Embarrassingly I did not catch that earlier :-(
> 
>  Compared to the numbers I posted
> in
> 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
> 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
> 
 that is still good is mix3, which I attribute to the per-BDI stuff.
> 
>  At the moment I am frantically trying to find when things went down.
> I
> 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
> 
 Sorry that I cannot provide any input to your patch.
> 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I 
reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <[EMAIL PROTECTED]>
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#mm: fix page allocation for larger I/O segments
#
#In some cases the IO subsystem is able to merge requests if the pages are
#adjacent in physical memory.  This was achieved in the allocator by having
#expand() return pages in physically contiguous order in situations were a
#large buddy was split.  However, list-based anti-fragmentation changed the
#order pages were returned in to avoid searching in buffered_rmqueue() for a
#page of the appropriate migrate type.
#
#This patch restores behaviour of rmqueue_bulk() preserving the physical
#order of pages returned by the allocator without incurring increased search
#costs for anti-fragmentation.
#
#Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
#Cc: James Bottomley <[EMAIL PROTECTED]>
#Cc: Jens Axboe <[EMAIL PROTECTED]>
#Cc: Mark Lord <[EMAIL PROTECTED]
#Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
#Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 +
+++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 +
@@ -847,8 +847,19 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+   /*
+* Split buddy pages returned by expand() are received here
+* in physical page order. The page is added to the callers and
+* list and the list head then moves forward. From the callers
+* perspective, the linked list is ordered by page number in
+* some conditions. This is useful for IO devices that can
+* merge IO requests if the physical p

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
> with anyone who might be interested.
> 
> As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing).  That is, much improved scaling with better
> O_DIRECT and network throughput.  Just a thought...
> 
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
> 
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
> 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mel Gorman <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
> 
 Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
> 
 in
> > > > > > > production sooner rather than later (primarily with
> ext3);
> 
 what
> > > > > > > recommendations do you have?  Just heavily test our
> own
> 
 2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
> 
 patchset?
> > > > > > > 
> > > > > > 
> > > > > >  I can add myself to Mikes question. It would be good to
> know
> 
 a
> > > > > 
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
> 
 has
> > > > > been showing quite nice improvement of the overall
> writeback
> 
 situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted  "...2250b". I
> will
> 
 definitely
> > > > > repeat my tests  with -rc8. and report.
> > > > > 
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > > 
> > > Hi Fengguang,
> > > 
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
> 
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to 
> run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

 The effect  is  defintely  depending on  the  IO  hardware. I performed the 
same tests
on a different box with an AACRAID controller and there things look different. 
Basically
the "offending" commit helps seingle stream performance on that box, while 
dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I 
repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define "many" 
for the kernel
compiles :-)

> > 
> >  OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> > 
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman 
> > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > #

> > 
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> > 
> 
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. 
But
we should try to find a way to make CCISS faster, as it apparently can be 
fas

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch
- Original Message 
> From: Mel Gorman <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > > 
> > 
> > The effect  is  defintely  depending on  the  IO  hardware.
> > 
 performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
> 
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test   2.6.19.2   2.6.24-rc6  
2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1 325   350 290
dd1-dir   180   160 160
dd2 2x90 2x113 2x110
dd2-dir   2x120   2x922x93
dd33x54  3x70   3x70
dd3-dir  3x83  3x64   3x64
mix3  55,2x30  400,2x25   310,2x25

 What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system 
compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 
2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

 So, on this box your patch is definitely needed to get the pre-2.6.24 
performance
when writing a single big file.

 Actually things on the CCISS box might be even more complicated. I forgot the 
fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page 
order/coloring?

 Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while 
it does
not hurt the other third that much :-)

> > 
> >  I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> > 
> 
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
> 
 to randomise the freelists affecting what color of page is used for the
> dd  test.
> 

 ouch :-) OK, I will try that.

Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VM Requirement Document - v0.0

2001-06-27 Thread Martin Knoblauch

>> * If we're getting low cache hit rates, don't flush 
>> processes to swap. 
>> * If we're getting good cache hit rates, flush old, idle 
>> processes to swap. 

Rik> ... but I fail to see this one. If we get a low cache hit rate, 
Rik> couldn't that just mean we allocated too little memory for the 
Rik> cache ? 

 maybe more specific: If the hit-rate is low and the cache is already
70+% of the systems memory, the chances maybe slim that more cache is
going to improve the hit-rate. 

 I do not care much whether the cache is using 99% of the systems memory
or 50%. As long as there is free memory, using it for cache is great. I
care a lot if the cache takes down interactivity, because it pushes out
processes that it thinks idle, but that I need in 5 seconds. The caches
pressure against processes should decrease with the (relative) size of
the cache. Especially in low hit-rate situations.

 OT: I asked the question before somewhere else. Are there interfaces to
the VM that expose the various cache sizes and, more important,
hit-rates to userland? I would love to see (or maybe help writing in my
free time) a tool to just visualize/analyze the efficiency of the VM
system.

Martin
-- 
------
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ReiserFS patches vs. 2.4.5-ac series

2001-06-27 Thread Martin Knoblauch

Hi,

 what is the current relation between the reiserfs patches at
namesys.com and the 2.4.5-ac series kernel?

 Namesys seems to have a small one for the "umount" problem and two 
bigger ones (knfsd and knfsd+quota+mount). All apply cleanly to vanilla
2.4.5, but the bigger ones fails against ac18 and ac19 (earlier ones
also I would guess). Are some of the knfsd/quota fixes already in -ac?

Thanks
Martin
-- 
------
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: VM Requirement Document - v0.0

2001-06-27 Thread Martin Knoblauch

Rik van Riel wrote:
> 
> On Wed, 27 Jun 2001, Martin Knoblauch wrote:
> 
> >  I do not care much whether the cache is using 99% of the systems memory
> > or 50%. As long as there is free memory, using it for cache is great. I
> > care a lot if the cache takes down interactivity, because it pushes out
> > processes that it thinks idle, but that I need in 5 seconds. The caches
> > pressure against processes
> 
> Too bad that processes are in general cached INSIDE the cache.
> 
> You'll have to write a new balancing story now ;)
> 

 maybe that is part of "the answer" :-)

Martin
-- 
--
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: VM Requirement Document - v0.0

2001-06-28 Thread Martin Knoblauch

Helge Hafting wrote:
> 
> Martin Knoblauch wrote:
> 
> >
> >  maybe more specific: If the hit-rate is low and the cache is already
> > 70+% of the systems memory, the chances maybe slim that more cache is
> > going to improve the hit-rate.
> >
> Oh, but this is posible.  You can get into situations where
> the (file cache) working set needs 80% or so of memory
> to get a near-perfect hitrate, and where
> using 70% of memory will trash madly due to the file access

 thats why I said "maybe" :-) Sure, another 5% of cache may improve
things, but they also may kill the interactive performance. Thats why
there should be probably more than one VM strategy to accomodate Servers
and Workstations/Lpatops.

> pattern.  And this won't be a problem either, if
> the working set of "other" (non-file)
> stuff is below 20% of memory.  The total size of
> non-file stuff may be above 20% though, so something goes
> into swap.
> 

 And that is the problem. To much seems to go into swap. At least for
interactive work. Unfortunatelly, with 128MB of memory I cannot entirely
turn of swap. I will see how things are going once I have 256 or 512 MB
(hopefully soon :-)

> I definitely want the machine to work under such circumstances,
> so an arbitrary limit of 70% won't work.
>

 Do not take the 70% as an arbitrary limit. I never said that. The 70%
is just my situation. The problems may arise at 60% cache or at 97.38%
cache.
 
> Preventing swap-trashing at all cost doesn't help if the

 Never said at all cost.

> machine loose to io-trashing instead.  Performance will be
> just as much down, although perhaps more satisfying because
> people aren't that surprised if explicit file operations
> take a long time.  They hate it when moving the mouse
> or something cause a disk access even if their
> apps runs faster. :-(
> 

 Absolutely true. And if the main purpose of the machine is interactive
work (we do want to be Linux a success on the desktop, don't we?), it
should not be hampered by by an IO improvement that may be only of
secondary importance to the user (that the final "customer" for all the
work that is done to the kernel :-). On big servers a litle paging now
and then may be absolutely OK, as long as the IO is going strong.

 I am observing the the discussions of VM behaviour in 2.4.x for some
time. They are mostly very entertaining and revealing. But they also
show that one solution does not seem to benefit all possible scenarios.
Therfore either more than one VM strategy is necessary, or better means
of tuning the cache behaviour, or both. Definitely better ways of
measuring the VM efficiency seem to be needed.

 While implementing VM strategies is probably out of question for a lot
of the people that complain, I hope that at least my complaints are kind
of useful.

Martin
-- 
--
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Cosmetic JFFS patch.

2001-06-29 Thread Martin Knoblauch

>Olaf Hering wrote: 
>> kde.o. 2.5? 
>
>Good idea! Graphics needs to be in the kernel to be fast. Windows 
>proved that. 

 thought SGI proved that :-)

Martin
-- 
------
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Announcing Journaled File System (JFS) release 1.0.0 available

2001-06-29 Thread Martin Knoblauch

Hi,

 first of all congratulations for finishing the initial first release.
Some questions, just out of curiosity:


>* Fast recovery after a system crash or power outage 
>
>* Journaling for file system integrity 
>
>* Journaling of meta-data only 
>

 does this mean JSF/Linux always journals only the meta-data, or is that
an option?
Does it perform full data-journaling under AIX?

>* Extent-based allocation 
>
>* Excellent overall performance 
>
>* 64 bit file system 
>
>* Built to scale. In memory and on-disk data structures are designed to 
>  scale beyond practical limit 

 Is this scaling only for size, or also for performance (many disks on
many controllers) like XFS (at least on SGI iron)?

Thanks
Martin
-- 
------
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



VM behaviour under 2.4.5-ac21

2001-06-29 Thread Martin Knoblauch

Hi,

 just something positive for the weekend. With 2.4.5-ac21, the behaviour
on my laptop (128MB plus twice the sapw) seems a bit more sane. When I
start new large applications now, the "used" portion of VM actually
pushes against the cache instead of forcing stuff into swap. It is still
using swap, but the effects on interactivity are much lighter.

 So, if this is a preview of 2.4.6 bahaviour, there may be a light at
the end of the tunnel.

Have a good weekend
Martin
-- 
------
Martin Knoblauch |email:  [EMAIL PROTECTED]
TeraPort GmbH|Phone:  +49-89-510857-309
C+ITS|Fax:+49-89-510857-111
http://www.teraport.de   |Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-12 Thread Martin Knoblauch
- Original Message 
> From: "Zhang, Yanmin" <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]; LKML 
> Sent: Monday, November 12, 2007 1:45:57 AM
> Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
> 
> On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > - Original Message 
> > > From: "Zhang, Yanmin" 
> > > To: [EMAIL PROTECTED]
> > > Cc: LKML 
> > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > > 
> > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > 50%
> > > 
> >  regression
> > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > > 
> > > My machine has 8 processor cores and 8GB memory.
> > > 
> > > By bisect, I located patch
> >
> >
> 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> =
> > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > > 
> > > 
> > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > times
> > > 
> >  after rebooting machine,
> > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > iozone
> > > 
> >  got a very small result and
> > > following run has 4Xorig_result.
> > > 
> > > What I reported is the regression of 2nd/3rd run, because first run
> > > has
> > > 
> >  bigger regression.
> > > 
> > > I also tried to change
> > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > > 
> >  and didn't get improvement.
> >  could you tell us the exact iozone command you are using?
> iozone -i 0 -r 4k -s 512m
> 

 OK, I definitely do not see the reported effect.  On a HP Proliant with a 
RAID5 on CCISS I get:

2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite

 The first run is always slowest, all subsequent runs are faster and the same 
speed.

> 
> >  I would like to repeat it on my setup, because I definitely see
> the
> 
 opposite behaviour in 2.6.24-rc1/rc2. The speed there is much
> better
> 
 than in 2.6.22 and before (I skipped 2.6.23, because I was waiting
> for
> 
 the per-bdi changes). I definitely do not see the difference between
> 1st
> 
 and subsequent runs. But then, I do my tests with 5GB file sizes like:
> > 
> > iozone3_283/src/current/iozone -t 5 -F /scratch/X1
> /scratch/X2
> 
 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
> My machine uses SATA (AHCI) disk.
> 

 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed 
write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some 
IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to 
free up one of the boxes.

Cheers
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.24-rc1: First impressions

2007-10-26 Thread Martin Knoblauch
Hi ,

 just to give some feedback on 2.6.24-rc1. For some time I am tracking 
IO/writeback problems that hurt system responsiveness big-time. I tested Peters 
stuff together with Fenguangs additions and it looked promising. Therefore I 
was very happy to see Peters stuff going into 2.6.24 and waited eagerly for 
rc1. In short, I am impressed. This really looks good. IO throughput is great 
and I could not reproduce the responsiveness problems so far.

 Below are a some numbers of my brute-force I/O tests that I can use to bring 
responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB 
Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery 
protected writeback cahe enabled) and gigabit networking (tg3). User space is 
64-bit RHEL4.3

 I am basically doing copies using "dd" with 1MB blocksize. Local Filesystem 
ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. 
NFS3 Server is a Sun/T2000/Solaris10. The tests are:

dd1 - copy 16 GB from /dev/zero to local FS
dd1-dir - same, but using O_DIRECT for output
dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS
dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS
net1 - copy 5.2 GB from NFS3 share to local FS
mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares

 I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec.

test   2.6.19.2 2.6.22.62.6.24.-rc1

dd1   285096
dd1-dir 888886
dd2  2x16.5   2x112x44.5
dd2-dir  2x44  2x442x43
dd33x9.83x8.7 3x30
dd3-dir  3x29.5  3x29.53x28.5
net130-33 50-55 37-52
mix3   17/32 25/5096/35 (disk/combined-network)


 Some observations:

- single threaded disk speed really went up wit 2.6.24-rc1. It is now even 
better than O_DIRECT
- O_DIRECT took a slight hit compared to the older kernels. Not an issue for 
me, but maybe others care
- multi threaded non O_DIRECT scales for the first time ever  Almost no 
loss compared to single threaded !!
- network throughput took a hit from 2.6.22.6 and is not as repeatable. Still 
better than 2.6.19.2 though

 What actually surprises me most is the big performance win on the single 
threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for 
was of course the scalability.

 So, this looks great and most likely I will push 2.6.24 (maybe .X) into my 
environment.

Happy weekend
Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
> From: Andrew Morton <[EMAIL PROTECTED]>
> To: Arjan van de Ven <[EMAIL PROTECTED]>
> Cc: Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL 
> PROTECTED]; [EMAIL PROTECTED]
> Sent: Saturday, October 27, 2007 7:59:51 AM
> Subject: Re: 2.6.24-rc1: First impressions
> 
> On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de
> Ven
> 
  wrote:
> 
> > > > > dd1 - copy 16 GB from /dev/zero to local FS
> > > > > dd1-dir - same, but using O_DIRECT for output
> > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
> local
> 
 FS
> > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
> local
> 
 FS
> > > > > net1 - copy 5.2 GB from NFS3 share to local FS
> > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3
> > > > > shares
> > > > > 
> > > > >  I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All
> > > > > units are MB/sec.
> > > > > 
> > > > > test   2.6.19.2 2.6.22.62.6.24.-rc1
> > > >
> >
> 
 
> > > > > dd1  28   50 96
> > > > > dd1-dir  88   88 86
> > > > > dd2  2x16.5 2x11 2x44.5
> > > > > dd2-dir2x44 2x44   2x43
> > > > > dd3   3x9.83x8.7   3x30
> > > > > dd3-dir  3x29.5   3x29.5 3x28.5
> > > > > net1  30-3350-55  37-52
> > > > > mix3  17/3225/50  96/35
> > > > > (disk/combined-network)
> > > > 
> > > > wow, really nice results!
> > > 
> > > Those changes seem suspiciously large to me.  I wonder if
> there's
> 
 less
> > > physical IO happening during the timed run, and
> correspondingly
> 
 more
> > > afterwards.
> > > 
> > 
> > another option... this is ext2.. didn't the ext2 reservation
> stuff
> 
 get
> > merged into -rc1? for ext3 that gave a 4x or so speed boost (much
> > better sequential allocation pattern)
> > 
> 
> Yes, one would expect that to make a large difference in
> dd2/dd2-dir
> 
 and
> dd3/dd3-dir - but only on SMP.  On UP there's not enough concurrency
> in the fs block allocator for any damage to occur.
>

 Just for the record the test are done on  SMP.
 
> Reservations won't affect dd1 though, and that went faster too.
>

 This is the one result that surprised me most, as I did not really expect any 
big moves here. I am not complaining :-), but definitely it would be nice to 
understand the why.

Cheers
Martin
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
> From: Ingo Molnar <[EMAIL PROTECTED]>
> To: Andrew Morton <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Sent: Friday, October 26, 2007 9:33:40 PM
> Subject: Re: 2.6.24-rc1: First impressions
> 
> 
> * Andrew Morton  wrote:
> 
> > > > dd1 - copy 16 GB from /dev/zero to local FS
> > > > dd1-dir - same, but using O_DIRECT for output
> > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
> local
> 
 FS
> > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
> local
> 
 FS
> > > > net1 - copy 5.2 GB from NFS3 share to local FS
> > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two
> NFS3
> 
 shares
> > > > 
> > > >  I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1.
> All
> 
 units 
> > > >  are MB/sec.
> > > > 
> > > > test   2.6.19.2 2.6.22.62.6.24.-rc1
> > > > 
> > > > dd1  28   50 96
> > > > dd1-dir  88   88 86
> > > > dd2  2x16.5 2x11 2x44.5
> > > > dd2-dir2x44 2x44   2x43
> > > > dd3   3x9.83x8.7   3x30
> > > > dd3-dir  3x29.5   3x29.5 3x28.5
> > > > net1  30-3350-55  37-52
> > > > mix3  17/3225/50 
> 96/35
> 
 (disk/combined-network)
> > > 
> > > wow, really nice results!
> > 
> > Those changes seem suspiciously large to me.  I wonder if
> there's
> 
 less 
> > physical IO happening during the timed run, and correspondingly more 
> > afterwards.
> 
> so a final 'sync' should be added to the test too, and the time
> it
> 
 takes 
> factored into the bandwidth numbers?
> 

 One of the reasons I do 15 GB transfers is to make sure that I am well above 
the possible page cache size. And of course I am doing a final sync to finish 
the runs :-) The sync is also running faster in 2.6.24-rc1.

 If I factor it in the results for dd1/dd3 are:

test2.6.19.22.6.22.62.6.24-rc1
sync time   18sec19sec  6sec
dd1 27.5 47.592
dd3 3x9.1  3x8.5   3x29

So basically including the sync time make 2.6.24-rc1 even more promosing. Now, 
I know that my benchmarks numbers are crude and show only a very small aspect 
of system performance. But - it is an aspect I care about a lot. And those 
benchmarks match my use-case pretty good.

Cheers
Martin





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour - next try

2007-08-28 Thread Martin Knoblauch
Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some "misbehaviour" related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly "use once" or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in "D" state.

 The data flows in basically three modes. All of them are affected:

local-disk -> NFS
NFS -> local-disk
NFS -> NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
"dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing "perfect yet. Use once
does seem to be a problem.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
> [...]
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> [...]
> >  Just by chance I found out that doing all I/O inc sync-mode does
> > prevent the load from going up. Of course, I/O throughput is not
> > stellar (but not much worse than the non-O_DIRECT case). But the
> > responsiveness seem OK. Maybe a solution, as this can be controlled
> via
> > mount (would be great for O_DIRECT :-).
> > 
> >  In general 2.6.22 seems to bee better that 2.6.19, but this is
> highly
> > subjective :-( I am using the following setting in /proc. They seem
> to
> > provide the smoothest responsiveness:
> > 
> > vm.dirty_background_ratio = 1
> > vm.dirty_ratio = 1
> > vm.swappiness = 1
> > vm.vfs_cache_pressure = 1
> 
> You are apparently running into the sluggish kupdate-style writeback
> problem with large files: huge amount of dirty pages are getting
> accumulated and flushed to the disk all at once when dirty background
> ratio is reached. The current -mm tree has some fixes for it, and
> there are some more in my tree. Martin, I'll send you the patch if
> you'd like to try it out.
>
Hi Fengguang,

 Yeah, that pretty much describes the situation we end up. Although
"sluggish" is much to friendly if we hit the situation :-)

 Yes, I am very interested  to check out your patch. I saw your
postings on LKML already and was already curious. Any chance you have
something agains 2.6.22-stable? I have reasons not to move to -23 or
-mm.

> >  Another thing I saw during my tests is that when writing to NFS,
> the
> > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual
> thing,
> > or a bug?
> 
> What are the nr_unstable numbers?
>

 Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
numbers for the disk case. Good to know.

 For NFS, the nr_writeback numbers seem surprisingly high. They also go
to 80-90k (pages ?). In the disk case they rarely go over 12k.

Cheers
Martin
> Fengguang
> 
> 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
> > 
> > --- Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > You are apparently running into the sluggish kupdate-style
> writeback
> > > problem with large files: huge amount of dirty pages are getting
> > > accumulated and flushed to the disk all at once when dirty
> background
> > > ratio is reached. The current -mm tree has some fixes for it, and
> > > there are some more in my tree. Martin, I'll send you the patch
> if
> > > you'd like to try it out.
> > >
> > Hi Fengguang,
> > 
> >  Yeah, that pretty much describes the situation we end up. Although
> > "sluggish" is much to friendly if we hit the situation :-)
> > 
> >  Yes, I am very interested  to check out your patch. I saw your
> > postings on LKML already and was already curious. Any chance you
> have
> > something agains 2.6.22-stable? I have reasons not to move to -23
> or
> > -mm.
> 
> Well, they are a dozen patches from various sources.  I managed to
> back-port them. It compiles and runs, however I cannot guarantee
> more...
>

 Thanks. I understand the limited scope of the warranty :-) I will give
it a spin today.
 
> > > >  Another thing I saw during my tests is that when writing to
> NFS,
> > > the
> > > > "dirty" or "nr_dirty" numbers are always 0. Is this a
> conceptual
> > > thing,
> > > > or a bug?
> > > 
> > > What are the nr_unstable numbers?
> > >
> > 
> >  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
> > numbers for the disk case. Good to know.
> > 
> >  For NFS, the nr_writeback numbers seem surprisingly high. They
> also go
> > to 80-90k (pages ?). In the disk case they rarely go over 12k.
> 
> Maybe the difference of throttling one single 'cp' and a dozen
> 'nfsd'?
>

 No "nfsd" running on that box. It is just a client.

Cheers
Martin
 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28 2007, Martin Knoblauch wrote:
> > Keywords: I/O, bdi-v9, cfs
> > 
> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
>
Hi Jens,

 thanks for the suggestion. Unfortunatelly the non-direct [parallel]
writes to the device got considreably slower. I guess the "6i"
controller copes better with higher values.

 Can nr_cmds be changed at runtime? Maybe there is a optimal setting.

[   69.438851] SCSI subsystem initialized
[   69.442712] HP CISS Driver (v 3.6.14)
[   69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level,
low) -> IRQ 51
[   69.442899] cciss: limited max commands to 2 (Smart Array 6i)
[   69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC
[   69.494352]   blocks= 426759840 block_size= 512
[   69.498350]   heads=255, sectors=32, cylinders=52299
[   69.498352]
[   69.498509]   blocks= 426759840 block_size= 512
[   69.498602]   heads=255, sectors=32, cylinders=52299
[   69.498604]
[   69.498608]  cciss/c0d0: p1 p2

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Chuck Ebbert <[EMAIL PROTECTED]> wrote:

> On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
> > 
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> 
> Try booting with "mem=4096M", "mem=2048M", ...
> 
> 

 hmm. I tried 1024M a while ago and IIRC did not see a lot [any]
difference. But as it is no big deal, I will repeat it tomorrow.

 Just curious - what are you expecting? Why should it help?

Thanks
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: regression of autofs for current git?

2007-08-30 Thread Martin Knoblauch
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote:
>
>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22
>
>This (and it's related patches) may be the problem.
>I can probably tell if you post your map or if you strace the
automount
>process managing the a problem mount point and look for mount
returning
>EBUSY when it should succeed.

 Likely. That is the one that will break the user-space automounter as
well (and keeps me from .23). I don't care very much about what the
default is, but it would be great if the new behaviour could be
globally changed at run- (or boot-) time. It will be some time until
the new mount option makes it into the distros.

Cheers
Martin
PS: Sorry, but I likely killed the CC list


------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> I saw a bulletin from HP recently that sugggested disabling the 
> write-back cache on some Smart Array controllers as a workaround
> because 
> it reduced performance in applications that did large bulk writes. 
> Presumably they are planning on releasing some updated firmware that 
> fixes this eventually..
> 
> -- 
> Robert Hancock  Saskatoon, SK, Canada
> To email, remove "nospam" from [EMAIL PROTECTED]
> Home Page: http://www.roberthancock.com/
> 
Robert,

 just checked it out. At least with the "6i", you do not want to
disable the WBC :-) Performance really goes down the toilet for all
cases.

 Do you still have a pointer to that bulletin?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
Hi Jens,

 how exactely is the queue depth related to the max # of commands? I
ask, because with the 2.6.22 kernel the "maximum queue depth since
init" seems to be never higher than 16, even with much higher
outstanding commands. On a 2.6.19 kernel, maximum queue depth is much
higher, just a bit below "max # of commands since init".

[2.6.22]# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Max sectors: 2048
Current Q depth: 0
Current # commands on controller: 145
Max Q depth since init: 16
Max # commands on controller since init: 204
Max SG entries since init: 31
Sequential access devices: 0

[2.6.19] cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 197
Max # commands on controller since init: 198
Max SG entries since init: 31
Sequential access devices: 0

Cheers
Martin




--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Martin Knoblauch

--- Ian Kent <[EMAIL PROTECTED]> wrote:

> On Thu, 30 Aug 2007, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 31 Aug 2007, Trond Myklebust wrote:
> > > 
> > > It did not. The previous behaviour was to always silently
> override the
> > > user mount options.
> > 
> > ..so it still worked for any sane setup, at least.
> > 
> > You broke that. Hua gave good reasons for why he cannot use the
> current 
> > kernel. It's a regression.
> > 
> > In other words, the new behaviour is *worse* than the behaviour you
> 
> > consider to be the incorrect one.
> > 
> 
> This all came about due to complains about not being able to mount
> the 
> same server file system with different options, most commonly ro vs.
> rw 
> which I think was due to the shared super block changes some time
> ago. 
> And, to some extent, I have to plead guilty for not complaining
> enough 
> about this default in the beginning, which is basically unacceptable
> for 
> sure.
> 
> We have seen breakage in Fedora with the introduction of the patches
> and 
> this is typical of it. It also breaks amd and admins have no way of 
> altering this that I'm aware of (help us here Ion).
> 
> I understand Tronds concerns but the fact remains that other Unixs
> allow 
> this behaviour but don't assert cache coherancy and many sysadmin
> don't 
> realize this. So the broken behavior is expected to work and we can't
> 
> simply stop allowing it unless we want to attend a public hanging
> with us 
> as the paticipants.
> 
> There is no question that the new behavior is worse and this change
> is 
> unacceptable as a solution to the original problem.
> 
> I really think that reversing the default, as has been suggested, 
> documenting the risk in the mount.nfs man page and perhaps issuing a 
> warning from the kernel is a better way to handle this. At least we
> will 
> be doing more to raise public awareness of the issue than others.
> 

 I can only second that. Changing the default behavior in this way is
really bad.

 Not that I am disagreeing with the technical reasons, but the change
breaks working setups. And -EBUSY is not very helpful as a message
here. It does not matter that the user tools may handle the breakage
incorrect. The users (admins) had workings setups for years. And they
were obviously working "good enough".

 And one should not forget that there will be a considerable time until
"nosharecache" will trickle down into distributions.

 If the situation stays this way, quite a few people will not be able
to move beyond 2.6.22 for some time. E.g. for I am working for a
company that operates some linux "clusters" at a few german automotive
cdompanies. For certain reasons everything there is based on
automounter maps (both autofs and amd style). We have almost zero
influence on that setup. The maps are a mess - we will run into the
sharecache problem. At the same time I am trying to fight the notorious
"system turns into frozen molassis on moderate I/O load". There maybe
some interesting developements coming forth after 2.6.22. Not good :-(

 What I would like to see done for the at hand situation is:

- make "nosharecache" the default for the forseeable future
- log any attempt to mount option-inconsistent NFS filesystems to dmesh
and syslog (apparently the NFS client is able to detect them :-). Do
this regardless of the "nosharecache" option. This way admins will at
least be made aware of the situation.
- In a year or so we can talk about making the default safe. With
proper advertising.

 Just my € 0.02.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-09-03 Thread Martin Knoblauch

--- Jakob Oestergaard <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote:
> ...
> > This is *not* a security hole. In order to make it a security hole,
> you 
> > need to be root in the first place.
> 
> Non-root users can write to places where root might believe they
> cannot write
> because he might be under the mistaken assumption that ro means ro.
> 
> I am under the impression that that could have implications in some
> setups.
>

 That was never in question.
 
> ...
> > 
> >  - it's a misfeature that people are used to, and has been around
> forever.
> 
> Sure, they're used it it, but I doubt they are aware of it.
>

 So, the right thing to do (tm) is to make them aware without breaking
their setup. 

 Log any detected inconsistencies in the dmesg buffer and to syslog. If
the sysadmin is not competent enough to notice, to bad.
 
Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: [PATCH] Small patch on top of per device dirty throttling -v9

2007-09-03 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > > 
> > > > Peter,
> > > > 
> > > >  any chance to get a rollup against 2.6.22-stable?
> > > > 
> > > >  The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > > 
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > > 
> > Hi Peter,
> > 
> >  any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
> 
> mindless port, seems to compile and boot on my test box ymmv.
> 
Hi Peter,

 while doing my tests I observed that setting dirty_ratio below 5% did
not make a difference at all. Just by chance I found that this
apparently is an enforced limit in mm/page-writeback.c.

 With below patch I have lowered the limit to 2%. With that, things
look a lot better on my systems. Load during write stays below 1.5 for
one writer. Responsiveness is good. 

This may even help without the throttling patch. Not sure that this is
the right thing to do, but it helps :-)

Cheers
Martin

--- linux-2.6.22.5-bdi-v9/mm/page-writeback.c
+++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c
@@ -311,8 +311,11 @@
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

-   if (dirty_ratio < 5)
-   dirty_ratio = 5;
+/*
+** MKN: Lower enforced limit from 5% to 2%
+*/
+   if (dirty_ratio < 2)
+   dirty_ratio = 2;

background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-04 Thread Martin Knoblauch

--- Leroy van Logchem <[EMAIL PROTECTED]> wrote:

> Andrea Arcangeli wrote:
> > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> >> Ok perhaps the new adaptive dirty limits helps your single disk
> >> a lot too. But your improvements seem to be more "collateral
> damage" @)
> >>
> >> But if that was true it might be enough to just change the dirty
> limits
> >> to get the same effect on your system. You might want to play with
> >> /proc/sys/vm/dirty_*
> > 
> > The adaptive dirty limit is per task so it can't be reproduced with
> > global sysctl. It made quite some difference when I researched into
> it
> > in function of time. This isn't in function of time but it
> certainly
> > makes a lot of difference too, actually it's the most important
> part
> > of the patchset for most people, the rest is for the corner cases
> that
> > aren't handled right currently (writing to a slow device with
> > writeback cache has always been hanging the whole thing).
> 
> 
> Self-tuning > static sysctl's. The last years we needed to use very 
> small values for dirty_ratio and dirty_background_ratio to soften the
> 
> latency problems we have during sustained writes. Imo these patches 
> really help in many cases, please commit to mainline.
> 
> -- 
> Leroy
> 

 while it helps in some situations, I did some tests today with
2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
hurts NFS writes. Anyone seen similar effects?

 Otherwise I would just second your request. It definitely helps the
problematic performance of my CCISS based RAID5 volume.

Martin

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-05 Thread Martin Knoblauch

--- Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> > Ok perhaps the new adaptive dirty limits helps your single disk
> > a lot too. But your improvements seem to be more "collateral
> damage" @)
> > 
> > But if that was true it might be enough to just change the dirty
> limits
> > to get the same effect on your system. You might want to play with
> > /proc/sys/vm/dirty_*
> 
> The adaptive dirty limit is per task so it can't be reproduced with
> global sysctl. It made quite some difference when I researched into
> it
> in function of time. This isn't in function of time but it certainly
> makes a lot of difference too, actually it's the most important part
> of the patchset for most people, the rest is for the corner cases
> that

> aren't handled right currently (writing to a slow device with
> writeback cache has always been hanging the whole thing).

 didn't see that remark before. I just realized that "slow device with
writeback cache" pretty well describes the CCISS controller in the
DL380g4. Could you elaborate why that is a problematic case?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-06 Thread Martin Knoblauch

--- Martin Knoblauch <[EMAIL PROTECTED]> wrote:

> 
> --- Leroy van Logchem <[EMAIL PROTECTED]> wrote:
> 
> > Andrea Arcangeli wrote:
> > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> > >> Ok perhaps the new adaptive dirty limits helps your single disk
> > >> a lot too. But your improvements seem to be more "collateral
> > damage" @)
> > >>
> > >> But if that was true it might be enough to just change the dirty
> > limits
> > >> to get the same effect on your system. You might want to play
> with
> > >> /proc/sys/vm/dirty_*
> > > 
> > > The adaptive dirty limit is per task so it can't be reproduced
> with
> > > global sysctl. It made quite some difference when I researched
> into
> > it
> > > in function of time. This isn't in function of time but it
> > certainly
> > > makes a lot of difference too, actually it's the most important
> > part
> > > of the patchset for most people, the rest is for the corner cases
> > that
> > > aren't handled right currently (writing to a slow device with
> > > writeback cache has always been hanging the whole thing).
> > 
> > 
> > Self-tuning > static sysctl's. The last years we needed to use very
> 
> > small values for dirty_ratio and dirty_background_ratio to soften
> the
> > 
> > latency problems we have during sustained writes. Imo these patches
> 
> > really help in many cases, please commit to mainline.
> > 
> > -- 
> > Leroy
> > 
> 
>  while it helps in some situations, I did some tests today with
> 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
> hurts NFS writes. Anyone seen similar effects?
> 
>  Otherwise I would just second your request. It definitely helps the
> problematic performance of my CCISS based RAID5 volume.
> 

 please disregard my comment about NFS write performance. What I have
seen is caused by some other stuff I am toying with.

 So, I second your request to push this forward.

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch
>Per device dirty throttling patches
>
>These patches aim to improve balance_dirty_pages() and directly
>address three issues:
>1) inter device starvation
>2) stacked device deadlocks
>3) inter process starvation
>
>1 and 2 are a direct result from removing the global dirty
>limit and using per device dirty limits. By giving each device
>its own dirty limit is will no longer starve another device,
>and the cyclic dependancy on the dirty limit is broken.
>
>In order to efficiently distribute the dirty limit across
>the independant devices a floating proportion is used, this
>will allocate a share of the total limit proportional to the
>device's recent activity.
>
>3 is done by also scaling the dirty limit proportional to the
>current task's recent dirty rate.
>
>Changes since -v8:
>- cleanup of the proportion code
>- fix percpu_counter_add(&counter, -(unsigned long))
>- fix per task dirty rate code
>- fwd port to .23-rc2-mm2

Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).

Cheers
Martin 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after the weekend.
> 
Hi Peter,

 that would be highly appreciated. Thanks a lot in advance.

Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-09 Thread Martin Knoblauch
- Original Message 
> From: "Zhang, Yanmin" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: LKML 
> Sent: Friday, November 9, 2007 10:47:52 AM
> Subject: iozone write 50% regression in kernel 2.6.24-rc1
> 
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> 50%
> 
 regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> 
> My machine has 8 processor cores and 8GB memory.
> 
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> 
> 
> Another behavior: with kernel 2.6.23, if I run iozone for many
> times
> 
 after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of
> iozone
> 
 got a very small result and
> following run has 4Xorig_result.
> 
> What I reported is the regression of 2nd/3rd run, because first run
> has
> 
 bigger regression.
> 
> I also tried to change
> /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> 
 and didn't get improvement.
> 
> -yanmin
> -
Hi Yanmin,

 could you tell us the exact iozone command you are using? I would like to 
repeat it on my setup, because I definitely see the opposite behaviour in 
2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I 
skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do 
not see the difference between 1st and subsequent runs. But then, I do my tests 
with 5GB file sizes like:

iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 
/scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1

Kind regards
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Binary Drivers

2006-12-26 Thread Martin Knoblauch
ntract. If
it says "works with XXX" (and it does), you have no right to demand
that it works with YYY, or that the manufacturer has to help you make
it work with YYY.

 The manufacturer may not be allowed to *actively* prevent you from
making it work with YYY, but I see no legal problem (IANAL, in any
jurisdiction of the wolrd) if they make it hard for you by being
*passive*.

 If they promised that it works with YYY, it is another story. They are
obliged to make it work or compensate you. How they make it work is up
to them, as long as they keep the promise. Whether you are satisfied is
up to you.

>If you retain some rights over something, then you are not selling it
>in the normal sense. You are selling a subset of the rights to it,
>and the buy must be told what rights he is getting and what rights
>he is not getting.

 They are not keeping any right from you. They are just not being
helpful.

 And now lets stop the car nonsense  :-)

Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Binary Drivers

2006-12-26 Thread Martin Knoblauch
On 12/25/06, David Schwartz <[EMAIL PROTECTED]> wrote:

>   If I bought the car from the manufacturer, it also must
> include any rights the manufacturer might have to the car's use.
> That includes using the car to violate emission control measures.
> If I didn't buy the right to use the car that way (insofar as
> that right was owned by the car manufacturer), I didn't
> buy the whole care -- just *some* of the rights to use it.

 just to be dense - what makes you think that the car manufacturer has
any legal right to violate emission control measures? What an utter
nonsense (sorry).

 So, lets stop the stupid car comparisons. They are no being funny any
more.

Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Binary Drivers

2006-12-26 Thread Martin Knoblauch

--- James C Georgas <[EMAIL PROTECTED]> wrote:

> On Tue, 2006-26-12 at 03:20 -0800, Martin Knoblauch wrote:
> > On 12/25/06, David Schwartz <[EMAIL PROTECTED]> wrote:
> > 
> > >   If I bought the car from the manufacturer, it also must
> > > include any rights the manufacturer might have to the car's use.
> > > That includes using the car to violate emission control measures.
> > > If I didn't buy the right to use the car that way (insofar as
> > > that right was owned by the car manufacturer), I didn't
> > > buy the whole care -- just *some* of the rights to use it.
> > 
> >  just to be dense - what makes you think that the car manufacturer
> has
> > any legal right to violate emission control measures? What an utter
> > nonsense (sorry).
> > 
> >  So, lets stop the stupid car comparisons. They are no being funny
> any
> > more.
> 
> Let's summarize the current situation:
> 
> 1) Hardware vendors don't have to tell us how to program their
> products, as long as they provide some way to use it 
> (i.e. binary blob driver).
>

 Correct, as far as I can tell.
 
> 2) Hardware vendors don't want to tell us how to program their
> products, because they think this information is their secret
> sauce (or maybe their competitor's secret sauce).
>

 - or they are ashamed to show the world what kind of crap they sell
 - or they have lost (never had) the documentation themselves. I tend
to no believe this

> 3) Hardware vendors don't tell us how to program their products,
> because they know about (1) and they believe (2).
>

 - or they are just ignorant
  
> 4) We need products with datasheets because of our development model.
>

 - correct
 
> 5) We want products with capabilities that these vendors advertise.
>

 we want open-spec products that meet the performance of the high-end
closed-spec products
 
> 6) Products that satisfy both (4) and (5) are often scarce or
> non-existent.
>

 unfortunatelly
 
> 
> So far, the suggestions I've seen to resolve the above conflict fall
> into three categories:
> 
> a) Force vendors to provide datasheets. 
> 
> b) Entice vendors to provide datasheets.
> 
> c) Reverse engineer the hardware and write our own datasheets.
> 
> Solution (a) involves denial of point (1), mostly through the use of
> analogy and allegory. Alternatively, one can try to change the law
> through government channels.
>

  good luck
 
> Solution (b) requires market pressure, charity, or visionary
> management.
> We can't exert enough market pressure currently to make much
> difference.
> Charity sometimes gives us datasheets for old hardware. Visionary
> management is the future.
> 

 - Old hardware is not interesting in most markets
 - Visionary mamangement is rare

> Solution (c) is what we do now, with varying degrees of success. A
> good example is the R300 support in the radeon DRM module.
> 

 But the R300 does not meet 5)

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Binary Drivers

2006-12-26 Thread Martin Knoblauch

--- Trent Waddington <[EMAIL PROTECTED]> wrote:

> On 12/26/06, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> >
> >  Oh, if only for Christmas - stop this stupid car comparisons. They
> are
> > just that - utter nonsense.
> >
> >  And now lets stop the car nonsense  :-)
> 
> I agree, if you really want to talk about cars, I can relate the woes
> I've heard from mechanics about how impossible it is to service new
> model Fords these days.

 A behaviour that is not very different from gthe GMs, BMWs,
daimler-Chryslers, Toyota, "you name them" of this world.

 I never said I liked the attitude.

>  Without the engine management systems
> diagnostics devices they can't do anything.  Ford controls who gets
> these devices and demands a cut of every service, essentially setting
> the price.  Service centers that don't play ball don't get the
> devices or get the devices taken away from them if they question 
> Ford's pricing policies.  Of course, this should be illegal, and our
> governments should be enforcing antitrust laws, but Ford is a big
> company and has lots of lawyers..
>

 Actually we have/had a similar situation here in Germany. We are  used
to having "licensed dealerships" which are only allowed to sell one car
brand. This might be illegal by EU laws now.

> Repco and other after market manufacturers can't easily make a clone
> of these devices like they do every other part, because reverse
> engineering software is not really as advanced as reverse engineering
> spare parts.. or maybe software reverse engineering is just so much
> more expensive than automotive reverse engineering that it is not
> cost effective to clone these devices.. or maybe they're just afraid
> of the lawyers too.
> 

 Understanding software is more difficult, because you also have to
understand the working prinziple of the underlying hardware, which you
often have no specs for either. So you have to reverse engineer both
layers.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels

2006-12-27 Thread Martin Knoblauch
Hi, (please CC on replies, thanks)

 for the ganglia project (http://ganglia.sourceforge.net/) we are
trying to find a heuristics to determine the number of physical CPU
"cores" as opposed to virtual processors added by enabling HT. The
method should work on 2.4 and 2.6 kernels.

 So far it seems that looking at the "physical id", "core id" and "cpu
cores" of /proc/cpuinfo is the way to go.

 In 2.6 I would try to find the  distinct "physical id"s and  and sum
up the corresponding "cpu cores". The question is whether this would
work for 2.4 based systems.

 Does anybody recall when the "physical id", "core id" and "cpu cores"
were added to /proc/cpuinfo ?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels

2006-12-27 Thread Martin Knoblauch

--- Arjan van de Ven <[EMAIL PROTECTED]> wrote:

> On Wed, 2006-12-27 at 06:16 -0800, Martin Knoblauch wrote:
> > Hi, (please CC on replies, thanks)
> > 
> >  for the ganglia project (http://ganglia.sourceforge.net/) we are
> > trying to find a heuristics to determine the number of physical CPU
> > "cores" as opposed to virtual processors added by enabling HT. The
> > method should work on 2.4 and 2.6 kernels.
> 
> I have a counter question for you.. what are you trying to do with
> the
> "these two are SMT sibblings" information ?
> 
> Because I suspect "HT" is the wrong level of detection for what you
> really want to achieve
> 
> If you want to decide "shares caches" then at least 2.6 kernels
> directly
> export that (and HT is just the wrong way to go about this). 
> -- 
Hi Arjan,

 one piece of information that Ganglia collects for a node is the
"number of CPUs", originally meaning "physical CPUs". With the
introduction of HT and multi-core things are a bit more complex now. We
have decided that HT sibblings do not qualify as "real" CPUs, while
multi-cores do.

 Currently we are doing "sysconf(_SC_NPROCESSORS_ONLN)". But this
includes both physical and virtual (HT) cores. We are looking for a
method that only shows "real iron" and works on 2.6 and 2.4 kernels.
Whether this has any practial valus is a completely different question.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels

2006-12-27 Thread Martin Knoblauch

--- Gleb Natapov <[EMAIL PROTECTED]> wrote:

> On Wed, Dec 27, 2006 at 04:13:00PM +0100, Arjan van de Ven wrote:
> > The original p4 HT to a large degree suffered from a too small
> cache
> > that now was shared. SMT in general isn't per se all that different
> in
> > performance than dual core, at least not on a fundamental level,
> it's
> > all a matter of how many resources each thread has on average. With
> dual
> > core sharing the cache for example, that already is part HT.
> Putting the
> > "boundary" at HT-but-not-dual-core is going to be highly artificial
> and
> > while it may work for the current hardware, in general it's not a
> good
> > way of separating things (just look at the PowerPC processors,
> those are
> > highly SMT as well), and I suspect that your distinction is just
> going
> > to break all the time over the next 10 years ;) Or even today on
> the
> > current "large cache" P4 processors with HT it already breaks.
> (just
> > those tend to be the expensive models so more rare)
> > 
> If I run two threads that are doing only calculations and very little
> or no
> IO at all on the same socket will modern HT and dual core be the same
> (or close) performance wise?
> 
Hi Gleb,
 
 this is a real interesting question. Ganglia is coming [originally]
from the HPC side of computing. At least in the past HT as implemented
on XEONs did help a lot. Running two CPU+memory-bandwith intensive
processes on the same physical CPU would at best result in a 50/50
performance split. So, knowing how many "real" CPUs are in a system is
interesting to us.

 Other workloads (like lots of java threads doing mixed IO and CPU
stuff) of course can benefit from HT.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels

2006-12-27 Thread Martin Knoblauch

--- Gleb Natapov <[EMAIL PROTECTED]> wrote:

> > 
> If I run two threads that are doing only calculations and very little
> or no
> IO at all on the same socket will modern HT and dual core be the same
> (or close) performance wise?
> 

 actually I wanted to write that "HT as implemented on XEONs did not
help a lot for HPC workloads in the past"

Cheers
Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels

2006-12-27 Thread Martin Knoblauch
>In article <[EMAIL PROTECTED]> you
wrote:
>> once your program (and many others) have such a check, then the next
>> step will be pressure on the kernel code to "fake" the old situation
>> when there is a processor where  no
longer
>> holds. It's basically a road to madness :-(
>
> I agree that for HPC sizing a benchmark with various levels of 
> parallelity are better. The question is, if the code in question
> only is for inventory reasons. In that case I would do something
> like x sockets, y cores and z cm threads.
>
> Bernd

 For sizing purposes, doing benchmarks is the only way. For the purpose
of Ganglia the sockets/cores/threads info is purely for inventory. And
we are likely going to add the new information to our metrics.

 But - we still need to find a way to extract the infor :-)

Cheers
Martin
PS: I have likely killed the CC this time. Sorry.

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[2.6.19] NFS: server error: fileid changed

2006-12-11 Thread Martin Knoblauch
Hi, [please CC me, as I am not subscribed]

 after updating a RHEL4 box (EM64T based) to a plain 2.6.19 kernel, we
are seeing repeated occurences of the following messages (about every
45-50 minutes).

 It is always the same server (a NetApp filer, mounted via the
user-space automounter "amd") and the expected/got numbers seem to
repeat.

 Is there a  way to find out which files are involved? Nothing seems to
be obviously breaking, but I do not like to get my logfiles filled up. 

[ 9337.747546] NFS: server nvgm022 error: fileid changed
[ 9337.747549] fsid 0:25: expected fileid 0x7a6f3d, got 0x65be80
[ 9338.020427] NFS: server nvgm022 error: fileid changed
[ 9338.020430] fsid 0:25: expected fileid 0x15f5d7c, got 0x9f9900
[ 9338.070147] NFS: server nvgm022 error: fileid changed
[ 9338.070150] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9338.338896] NFS: server nvgm022 error: fileid changed
[ 9338.338899] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9338.370207] NFS: server nvgm022 error: fileid changed
[ 9338.370210] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9338.634437] NFS: server nvgm022 error: fileid changed
[ 9338.634439] fsid 0:25: expected fileid 0x7a6f3d, got 0x22070e
[ 9338.698383] NFS: server nvgm022 error: fileid changed
[ 9338.698385] fsid 0:25: expected fileid 0x7a6f3d, got 0x352777
[ 9338.949952] NFS: server nvgm022 error: fileid changed
[ 9338.949954] fsid 0:25: expected fileid 0x15f5d7c, got 0x5988c4
[ 9339.042473] NFS: server nvgm022 error: fileid changed
[ 9339.042476] fsid 0:25: expected fileid 0x7a6f3d, got 0x9f9900
[ 9339.267338] NFS: server nvgm022 error: fileid changed
[ 9339.267341] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9339.309921] NFS: server nvgm022 error: fileid changed
[ 9339.309923] fsid 0:25: expected fileid 0x15f5d7c, got 0x65be80
[ 9339.405146] NFS: server nvgm022 error: fileid changed
[ 9339.405149] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9339.433816] NFS: server nvgm022 error: fileid changed
[ 9339.433819] fsid 0:25: expected fileid 0x15f5d7c, got 0x65be80
[ 9340.149325] NFS: server nvgm022 error: fileid changed
[ 9340.149328] fsid 0:25: expected fileid 0x7a6f3d, got 0x19bc55
[ 9340.173278] NFS: server nvgm022 error: fileid changed
[ 9340.173281] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e
[ 9340.324517] NFS: server nvgm022 error: fileid changed
[ 9340.324520] fsid 0:25: expected fileid 0x15f5d7c, got 0x11c9001

Thanks
Martin


------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.19] NFS: server error: fileid changed

2006-12-11 Thread Martin Knoblauch

--- Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Mon, 2006-12-11 at 08:09 -0800, Martin Knoblauch wrote:
> > Hi, [please CC me, as I am not subscribed]
> > 
> >  after updating a RHEL4 box (EM64T based) to a plain 2.6.19 kernel,
> we
> > are seeing repeated occurences of the following messages (about
> every
> > 45-50 minutes).
> > 
> >  It is always the same server (a NetApp filer, mounted via the
> > user-space automounter "amd") and the expected/got numbers seem to
> > repeat.
> 
> Are you seeing it _without_ amd? The usual reason for the errors you
> see are bogus replay cache replies. For that reason, the kernel is
> usually very careful when initialising its value for the
> XID: we set part of it using the clock value, and part of it
> using a random number generator.
> I'm not so sure that other services are as careful.
>

 So far, we are only seeing it on amd-mounted filesystems, not on
static NFS mounts. Unfortunatelly, it is difficult to avoid "amd" in
our environment.
 
> >  Is there a  way to find out which files are involved? Nothing
> seems to
> > be obviously breaking, but I do not like to get my logfiles filled
> up. 
> 
> The fileid is the same as the inode number. Just convert those
> hexadecimal values into ordinary numbers, then search for them using
> 'ls
> -i'.
> 

 thanks. will check that out.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.19] NFS: server error: fileid changed

2006-12-11 Thread Martin Knoblauch

--- Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Mon, 2006-12-11 at 15:44 -0800, Martin Knoblauch wrote:
> >  So far, we are only seeing it on amd-mounted filesystems, not on
> > static NFS mounts. Unfortunatelly, it is difficult to avoid "amd"
> in
> > our environment.
> 
> Any chance you could try substituting a recent version of autofs?
> This
> sort of problem is more likely to happen on partitions that are
> unmounted and then remounted often. I'd just like to figure out if
> this
> is something that we need to fix in the kernel, or if it is purely an
> amd problem.
> 
> Cheers
>   Trond
> 
Hi Trond,

 unfortunatelly I have no controll over the mounting maps, as they are
maintained from different people. So the answer is no. Unfortunatelly
the customer has decided on using am-utils. This has been hurting us
(and them) for years ...

 Your are likely correct when you hint towards partitions which are
frequently remounted.  

 In any case, your help is appreciated.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.19] NFS: server error: fileid changed

2006-12-15 Thread Martin Knoblauch

--- Trond Myklebust <[EMAIL PROTECTED]> wrote:

> 
> >  Is there a  way to find out which files are involved? Nothing
> seems to
> > be obviously breaking, but I do not like to get my logfiles filled
> up. 
> 
> The fileid is the same as the inode number. Just convert those
> hexadecimal values into ordinary numbers, then search for them using
> 'ls
> -i'.
> 
> Trond
> 
> > [ 9337.747546] NFS: server nvgm022 error: fileid changed
> > [ 9337.747549] fsid 0:25: expected fileid 0x7a6f3d, got 0x65be80
Hi Trond, 

 just curious: how is the fsid related to mounted filesystems? What
does "0:25" stand for?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


File-locking problems with RHEL4 kernel (2.6.9-42.0.3.ELsmp) under high load

2006-11-27 Thread Martin Knoblauch
Hi,

 first of all, yes - I know that this kernel is very old and it is not
an official LKML kernel. No need to tell me, no need to waste bandwidth
by telling me :-) I just post here, because I got no response
"elsewhere".

 Second - please CC me on any reply, as I am not subscribed.

 OK. Here is the problem. Said RHEL4 kernel seems to have problems with
file-locking when the system is under high, likely network related,
load. The symptoms are things using file locking (rpm, the user-space
automounter amd) fail to obtain locks, usually stating timeout
problems.

 The sytem in question is a HP/DL380G4 with dual-single-core EM64T CPUs
and 8GB of Memory. The network interfaces are "tg3". 

 The high load can be triggered by copying three 3 GB files in parallel
from an NFS server (Solaris10, NFS, TCP, 1GBit) to another NFS server
(RHEL4, NFS, TCP, 100 MBit). The measured network performance is OK.
During this operation the systems goes to Loads around/above 10.
Overall responsiveness feels good, but software doing file-locking or
opening a new ssh connection take extremely long.

 So, if anyone has an idea or hint, it will be highly appreciated.

Cheers
Martin

----------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Daniel J Blueman <[EMAIL PROTECTED]> wrote:

> On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >  for a customer we are operating a rackful of HP/DL380/G4 boxes
> that
> > have given us some problems with system responsiveness under [I/O
> > triggered] system load.
> [snip]
> 
> IIRC, the locking in the CCISS driver was pretty heavy until later in
> the 2.6 series (2.6.16?) kernels; I don't think they were backported
> to the 1000 or so patches that comprise RH EL 4 kernels.
> 
> With write performance being really poor on the Smartarray
> controllers
> without the battery-backed write cache, and with less-good locking,
> performance can really suck.
> 
> On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
> L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
> with 6x36GB 10K RPM SCSI disks and all latest firmware:
> 
> # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
> 509+1 records in
> 509+1 records out
> 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s
> 
> # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s
> 
> Oh dear! There are internal performance problems with this
> controller.
> The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
> is
> perhaps twice the read performance (PCI-X helps some) but still
> sucks.
> 
> I'd get the BBWC in or install another controller.
> 
Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the "6i" and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-09 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote:
> > On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> >
> > I'd suspect you can't get both at 100%.
> >
> > I'd guess you are probably using a 100Hz no-preempt kernel.  Have
> you
> > tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
> > overall throughput, but interactive responsiveness should be better
> -
> > if it is, then you could experiment with various combinations of
> > CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
> > CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
> > what gives you the best balance between throughput and interactive
> > responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
> > CONFIG_NO_HZ, but I don't think the impact will be as significant
> as
> > with the other options, so to keep things simple I'd leave those
> out
> > at first) .
> >
> > I'd guess that something like CONFIG_PREEMPT_VOLUNTARY +
> CONFIG_HZ_300
> > would probably be a good compromise for you, but just to see if
> > there's any effect at all, start out with CONFIG_PREEMPT +
> > CONFIG_HZ_1000.
> >
> 
> I'm currious, did you ever try playing around with CONFIG_PREEMPT*
> and
> CONFIG_HZ* to see if that had any noticable impact on interactive
> performance and stuff like logging into the box via ssh etc...?
> 
> -- 
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
> 
> 
Hi Jesper,

 my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but
have not observed much difference. The config is now:

config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set
config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y
config-2.6.22-rc7:# CONFIG_PREEMPT is not set
config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y

Cheers


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/17] per device dirty throttling -v7

2007-07-18 Thread Martin Knoblauch
Miklos Szeredi wrote:

>> Latest version of the per bdi dirty throttling patches.
>>
>> Most of the changes since last time are little cleanups and more
>> detail in the split out of the floating proportion into their
>> own little lib.
>>
>> Patches are against 2.6.22-rc4-mm2
>>
>> A rollup of all this against 2.6.21 is available here:
>>
http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch
>>
>> This patch-set passes the starve an USB stick test..
>
>I've done some testing of several problem cases.

 just curious - what are the plans towards inclusion in mainline?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Linux 2.6.22-rc7

2007-07-03 Thread Martin Knoblauch
>Ok, Linux-2.6.22-rc7 is out there.
>
>It's hopefully (almost certainly) the last -rc before the final 2.6.22
>release, and we should be in pretty good shape. The flow of patches
has
>really slowed down and the regression list has shrunk a lot.
>
>The shortlog/diffstat reflects that, with the biggest part of the -rc7
>patch being literally just a power defconfig update.
>
>The patches are mostly trivial fixes, a few new device ID's, and the
>appended shortlog really does pretty much explain it.
>
>Final testing always appreciated, of course,
>
>Linus

 For what it is worth - rc7 compiles and boots here
(HP/DL380G4,2x86_64, 8GB, cciss, 2xtg3). The subjective feeling(*) is
much better that the original RHEL4 kernel and better than 2.6.19 on
the same box.

(*) Our main problem with 2.6 kernels so far is a tendency to really
bad responsiveness under I/O related load.

Cheers
Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour

2007-07-05 Thread Martin Knoblauch
Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2->LVM->CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing "df" take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
"pdflush" threads all being in "D" state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo "Start 3 dd processes: "`date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000&
wait
echo "Finish 3 dd processes: "`date`
sync
echo "Finish sync: "`date`
rm -f /scratch/X?
echo "Files removed: "`date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The "felt" responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.

Thanks in advance
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
> > helps. This workload will fill up memory with dirty data very
> quickly,
> > and it seems like system responsiveness often goes down the toilet
> when
> > this happens and the system is going crazy trying to write it all
> out.
> >
> 
> Perhaps trying out a different elevator would also be worthwhile.
> 

 AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
CFQ gives less (about 10-15%) throughput except for the kernel with the
cfs cpu scheduler, where CFQ is on par with the other IO schedulers.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
> helps. This workload will fill up memory with dirty data very
> quickly, 
> and it seems like system responsiveness often goes down the toilet
> when 
> this happens and the system is going crazy trying to write it all
> out.
> 

 Definitely the "going crazy" part is the worst problem I see with 2.6
based kernels (late 2.4 was really better in this corner case).

 I am just now playing with dirty_ratio. Anybody knows what the lower
limit is? "0" seems acceptabel, but does it actually imply "write out
immediatelly"?

 Another problem, the VM parameters are not really well dociúmented in
their behaviour and interdependence.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
>>b) any ideas how to optimize the settings of the /proc/sys/vm/
>>parameters? The documentation is a bit thin here.
>>
>>
>I cant offer any advice there, but is raid-5 really the best choice
>for your needs? I would not choose raid-5 for a system that is
>regularly performing lots of large writes at the same time, dont
>forget that each write can require several reads to recalculate the
>partity.
>
>Does the raid card have much cache ram?
>

 192 MB, split 50/50 to read write.

>If you can afford to loose some space raid-10 would probably perform
>better.

 RAID5 most likely is not the best solution and I would not use it if
the described use-case was happening all the time. It happens a few
times a day and then things go down when all memory is filled with
page-cache.

 And the same also happens when copying large amountd of data from one
NFS mounted FS to another NFS mounted FS. No disk involved there.
Memory fills with page-cache until it reaches a ceeling and then for
some time responsiveness is really really bad.

 I am just now playing with the dirty_* stuff. Maybe it helps.

Cheers
Martin



------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Martin Knoblauch wrote:
>--- Robert Hancock <[EMAIL PROTECTED]> wrote:
>
>>
>> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
>> helps. This workload will fill up memory with dirty data very
>> quickly,
>> and it seems like system responsiveness often goes down the toilet
>> when
>> this happens and the system is going crazy trying to write it all
>> out.
>>
>
>Definitely the "going crazy" part is the worst problem I see with 2.6
>based kernels (late 2.4 was really better in this corner case).
>
>I am just now playing with dirty_ratio. Anybody knows what the lower
>limit is? "0" seems acceptabel, but does it actually imply "write out
>immediatelly"?
>
>Another problem, the VM parameters are not really well documented in
>their behaviour and interdependence.

 Lowering dirty_ration just leads to more imbalanced write-speed for
the three dd's. Even when lowering the number to 0, the hich load
stays.

 Now, on another experiment I mounted the FS with "sync". And now the
load stays below/around 3. No more "pdflush" daemons going wild. And
the responsiveness is good, with no drops.

 My question is now: is there a parameter that one can use to force
immediate writeout for every process. This may hurt overall performance
of the system, but might really help my situation. Setting dirty_ratio
to 0 does not seem to do it.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Brice Figureau wrote:

>> CFQ gives less (about 10-15%) throughput except for the kernel
>> with the
>> cfs cpu scheduler, where CFQ is on par with the other IO
>> schedulers.
>>
>
>Please have a look to kernel bug #7372:
>http://bugzilla.kernel.org/show_bug.cgi?id=7372
>
>It seems I encountered the almost same issue.
>
>The fix on my side, beside running 2.6.17 (which was working fine
>for me) was to:
>1) have /proc/sys/vm/vfs_cache_pressure=1
>2) have /proc/sys/vm/dirty_ratio=1 and 
> /proc/sys/vm/dirty_background_ratio=1
>3) have /proc/sys/vm/swappiness=2
>4) run Peter Zijlstra: per dirty device throttling patch on the
> top of 2.6.21.5:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Brice,

 any of them sufficient, or all together nedded? Just to avoid
confusion.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-23 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after
> the weekend.
> 
Hi Peter,

 any progress on a version against 2.6.22.5? I have seen the very
positive report from Jeffrey W. Baker and would really love to test
your patch. But as I said, anything newer than 2.6.22.x might not be an
option due to the NFS changes.

Kind regards
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-24 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > > 
> > > > Peter,
> > > > 
> > > >  any chance to get a rollup against 2.6.22-stable?
> > > > 
> > > >  The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > > 
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > > 
> > Hi Peter,
> > 
> >  any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
> 
> mindless port, seems to compile and boot on my test box ymmv.
> 
> I think .5 should not present anything other than trivial rejects if
> anything. But I'm not keeping -stable in my git remotes so I can't
> say
> for sure.

Hi Peter,

 thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one
8-line offset in readahead.c.

 I will report testing-results separately.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/