Re: [Patch] Output of L1,L2 and L3 cache sizes to /proc/cpuinfo
Tomas Telensky wrote: > > On 21 May 2001, H. Peter Anvin wrote: > > > Followup to: <[EMAIL PROTECTED]> > > By author:"Martin.Knoblauch" <[EMAIL PROTECTED]> > > In newsgroup: linux.dev.kernel > > > > > > Hi, > > > > > > while trying to enhance a small hardware inventory script, I found that > > > cpuinfo is missing the details of L1, L2 and L3 size, although they may > > > be available at boot time. One could of cource grep them from "dmesg" > > > output, but that may scroll away on long lived systems. > > > > > > > Any particular reason this needs to be done in the kernel, as opposed > > It is already done in kernel, because it's displaying :) > So, once evaluated, why not to give it to /proc/cpuinfo. I think it makes > sense and gives it things in order. > That came to my mind as an pro argument also. The work is already done in setup.c, so why not expose it at the same place where the other stuff is. After all, it is just a more detailed output of the already available "cache size" line. Martin PS: At least, I am not being ignored :-) No need for me to complain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] Output of L1,L2 and L3 cache sizes to /proc/cpuinfo
"H. Peter Anvin" wrote: > > "Martin.Knoblauch" wrote: > > > > After some checking, I could have made the answer a bit less terse: > > > > - it would require that the kernel is compiled with cpuid [module] > > support > > - not everybody may want enable this, just for getting one or two > > harmless numbers. > > If so, then that's their problem. We're not here to solve the problem of > stupid system administrators. > They may not be stupid, just mislead :-( When Intel created the "cpuid" Feature some way along the P3 line, they gave a stupid reason for it and created a big public uproar. As silly as I think that was (on both sides), the term "cpuid" is tainted. Some people just fear it like hell. Anyway. > > - you would need a utility with root permission to analyze the cpuid > > info. The > > cahce info does not seem to be there in clear ascii. > > Bullsh*t. /dev/cpu/%d/cpuid is supposed to be mode 444 (world readable.) > Thanks you :-) In any case, on my system (Suse 7.1) the files are mode 400. Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Head parking (was: IBM HDAPS things are looking up)
On Thu, Jul 07 2005, Pekka Enberg wrote: > Jens Axboe wrote: > > > ATA7 defines a park maneuvre, I don't know how well supported it is > > > yet though. You can test with this little app, if it says 'head > > > parked' it works. If not, it has just idled the drive. > > On 7/7/05, Lenz Grimmer <[EMAIL PROTECTED]> wrote: > > Great! Thanks for digging this up - it works on my T42, using a Fujitsu > > MHT2080AH drive: > > Works on my T42p which uses a Hitachi HTS726060M9AT00 drive. I don't > hear any sound, though. Interesting. Same Notebook, same drive. The program say "not parked" :-( This is on FC2 with a pretty much vanilla 2.6.9 kernel. [EMAIL PROTECTED] tmp]# uname -a Linux l15833 2.6.9-noagp #1 Wed May 4 16:09:14 CEST 2005 i686 i686 i386 GNU/Linux [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda /dev/hda: Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=DualPortCache, BuffSize=7877kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117210240 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a: * signifies the current active mode [EMAIL PROTECTED] tmp]# ./park /dev/hda head not parked 4c [EMAIL PROTECTED] tmp]# Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Head parking (was: IBM HDAPS things are looking up)
--- Pekka Enberg <[EMAIL PROTECTED]> wrote: > > Martin, don't trim the cc! > sorry about that, but I did not have the CC at time of reply. I read LKLM from the archives and respond by cut and paste. Martin ---------- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Head parking (was: IBM HDAPS things are looking up)
--- Pekka Enberg <[EMAIL PROTECTED]> wrote: > On 7/7/05, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > Interesting. Same Notebook, same drive. The program say "not > parked" > > :-( This is on FC2 with a pretty much vanilla 2.6.9 kernel. > > > > [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda > > > > /dev/hda: > > > > Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB > > haji ~ # hdparm -i /dev/hda > > /dev/hda: > > Model=HTS726060M9AT00, FwRev=MH4OA6DA, SerialNo=MRH453M4H2A6PB OK, different FW levels. After upgrading my disk to MH40A6GA my head parks :-) Minimum required level for this disk seems to be A6DA. Hope this info is useful. [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda /dev/hda: Model=HTS726060M9AT00, FwRev=MH4OA6GA, SerialNo=MRH403M4GS88XB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=DualPortCache, BuffSize=7877kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117210240 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a: * signifies the current active mode [EMAIL PROTECTED] tmp]# ./park /dev/hda head parked [EMAIL PROTECTED] tmp]# Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Head parking (was: IBM HDAPS things are looking up)
--- Alejandro Bonilla <[EMAIL PROTECTED]> wrote: > > > --- Pekka Enberg <[EMAIL PROTECTED]> wrote: > > > > > On 7/7/05, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > > > Interesting. Same Notebook, same drive. The program say "not > > > parked" > > > > :-( This is on FC2 with a pretty much vanilla 2.6.9 kernel. > > > > > > > > [EMAIL PROTECTED] tmp]# hdparm -i /dev/hda > > > > > > > > /dev/hda: > > > > > > > > Model=HTS726060M9AT00, FwRev=MH4OA6BA, SerialNo=MRH403M4GS88XB > > > > > > haji ~ # hdparm -i /dev/hda > > > > > > /dev/hda: > > > > > > Model=HTS726060M9AT00, FwRev=MH4OA6DA, SerialNo=MRH453M4H2A6PB > > > > OK, different FW levels. After upgrading my disk to MH40A6GA my > head > > parks :-) Minimum required level for this disk seems to be A6DA. > Hope > > this info is useful. > > Martin, > > Simply upgrading your firmware fixed your problem for being to park > the > head? > Yup. Do not forget that FW is very powerful. Likely the parking feature was added after A6BA. Basically I saw that the only difference between me and Pekka was the FW (discounting the different CPU speed and Kernel version). I googled around and found the IBM FW page at: http://www-306.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-41008 Download is simple, just don't use the "IBM Download Manager". Main problem is that one needs a bootable floopy drive and "the other OS" to create a bootable floppy. It would be great if IBM could provide floppy images for use with "dd" for the poor Linux users. Then I pondered over the risk involved with the update. Curiosity won :-) And now the head parks. BUT - I definitely do not encourage anybody to perform the procedure. Do at your own risk after thinking about the possible consequences ... Anyway someone reported a non working HTS548040M9AT00 with FW revision MG2OA53A. The newest revision, from the same floppy image, is A5HA. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)
--- Dave Hansen <[EMAIL PROTECTED]> wrote: > On Thu, 2005-07-07 at 10:14 -0700, Martin Knoblauch wrote: > > Basically I saw that the only difference between me and Pekka was > the > > FW (discounting the different CPU speed and Kernel version). I > googled > > around and found the IBM FW page at: > > > > > http://www-306.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-41008 > > > > Download is simple, just don't use the "IBM Download Manager". > Main > > problem is that one needs a bootable floopy drive and "the other > OS" to > > create a bootable floppy. It would be great if IBM could provide > floppy > > images for use with "dd" for the poor Linux users. > > Did you really need to make 18 diskettes? > yikes - no !! :-) Somewhere on that page there is a table that tells you which of the 18 floppies is for your disk. In my case it was #13. > I have the feeling that this will work for many T4[012]p? users: > > http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=TPAD-HDFIRM > Yeah, I think that is the "DA" version. You still need "the other OS", although you don't need the floppy. If IBM would provide a CD image (bootable ISO) containing FW for all supported drives - that would be great. No need for the "other OS" any more. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)
--- Dave Hansen <[EMAIL PROTECTED]> wrote: > > I have the feeling that this will work for many T4[012]p? users: > > http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=TPAD-HDFIRM > Actually, I think your feeling is wrong. Looking at the readme.txt it seems version 7.1 of the upgrade floppy has the "BA" firmware that I had on my disk in the beginning (not parking the heads). Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Hdaps-devel] RE: Head parking (was: IBM HDAPS things are looking up)
--- Erik Mouw <[EMAIL PROTECTED]> wrote: > On Thu, Jul 07, 2005 at 11:45:38AM -0700, Martin Knoblauch wrote: > > If IBM would provide a CD image (bootable ISO) containing FW for > all > > supported drives - that would be great. No need for the "other OS" > any > > more. > > I can imagine IBM doesn't do that because in that way you can't > update > the firmware of the CD/DVD drive. Bootable FreeDOS floppy images > would > be a nice idea, though. > > now, this is getting off-topic. The CD image I proposed would be only for the hard disks. Bootable -Dos floppy images that one could just "dd" of the floppy would be great, because they eliminate the need for "the other OS", but you still need a floppy drive. I am not sure how many Notebook owners actually have one. The hardest part in my FW upgrade was actually finding a drive in our company. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
--- Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 18 Jan 2008, Mel Gorman wrote: > > > > Right, and this is consistent with other complaints about the PFN > > of the page mattering to some hardware. > > I don't think it's actually the PFN per se. > > I think it's simply that some controllers (quite probably affected by > both driver and hardware limits) have some subtle interactions with > the size of the IO commands. > > For example, let's say that you have a controller that has some limit > X on the size of IO in flight (whether due to hardware or driver > issues doesn't really matter) in addition to a limit on the size > of the scatter-gather size. They all tend to have limits, and > they differ. > > Now, the PFN doesn't matter per se, but the allocation pattern > definitely matters for whether the IO's are physically > contiguous, and thus matters for the size of the scatter-gather > thing. > > Now, generally the rule-of-thumb is that you want big commands, so > physical merging is good for you, but I could well imagine that the > IO limits interact, and end up hurting each other. Let's say that a > better allocation order allows for bigger contiguous physical areas, > and thus fewer scatter-gather entries. > > What does that result in? The obvious answer is > > "Better performance obviously, because the controller needs to do > fewer scatter-gather lookups, and the requests are bigger, because > there are fewer IO's that hit scatter-gather limits!" > > Agreed? > > Except maybe the *real* answer for some controllers end up being > > "Worse performance, because individual commands grow because they > don't hit the per-command limits, but now we hit the global > size-in-flight limits and have many fewer of these good commands in > flight. And while the commands are larger, it means that there > are fewer outstanding commands, which can mean that the disk > cannot scheduling things as well, or makes high latency of command > generation by the controller much more visible because there aren't > enough concurrent requests queued up to hide it" > > Is this the reason? I have no idea. But somebody who knows the > AACRAID hardware and driver limits might think about interactions > like that. Sometimes you actually might want to have smaller > individual commands if there is some other limit that means that > it can be more advantageous to have many small requests over a > few big onees. > > RAID might well make it worse. Maybe small requests work better > because they are simpler to schedule because they only hit one > disk (eg if you have simple striping)! So that's another reason > why one *large* request may actually be slower than two requests > half the size, even if it's against the "normal rule". > > And it may be that that AACRAID box takes a big hit on DIO > exactly because DIO has been optimized almost purely for making > one command as big as possible. > > Just a theory. > > Linus just to make one thing clear - I am not so much concerned about the performance of AACRAID. It is OK with or without Mel's patch. It is better with Mel's patch. The regression in DIO compared to 2.6.19.2 is completely independent of Mel's stuff. What interests me much more is the behaviour of the CCISS+LVM based system. Here I see a huge benefit of reverting Mel's patch. I dirtied the system after reboot as Mel suggested (24 parallel kernel build) and repeated the tests. The dirtying did not make any difference. Here are the results: Test -rc8-rc8-without-Mels-Patch dd1 57 94 dd1-dir 87 86 dd2 2x8.5 2x45 dd2-dir 2x432x43 dd3 3x7 3x30 dd3-dir 3x28.5 3x28.5 mix3 59,2x25 98,2x24 The big IO size with Mel's patch really has a devastating effect on the parallel write. Nowhere near the value one would expect, while the numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not see this earlier. Maybe we could have found a solution for .24. At least, rc1-rc5 have shown that the CCISS system can do well. Now the question is which part of the system does not cope well with the larger IO sizes? Is it the CCISS controller, LVM or both. I am open to suggestions on how to debug that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: speed up writeback of big dirty files
Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Linus Torvalds <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; > Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]> > Sent: Thursday, January 17, 2008 6:28:18 AM > Subject: [PATCH] writeback: speed up writeback of big dirty files > > On Jan 16, 2008 9:15 AM, Martin Knoblauch > > wrote: > > Fengguang's latest writeback patch applies cleanly, builds, boots > on > 2.6.24-rc8. > > Linus, if possible, I'd suggest this patch be merged for 2.6.24. > > It's a safer version of the reverted patch. It was tested on > ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the > other bug fixing patches. > > Fengguang > --- > > writeback: speed up writeback of big dirty files > > After making dirty a 100M file, the normal behavior is to > start the writeback for all data after 30s delays. But > sometimes the following happens instead: > > - after 30s:~4M > - after 5s: ~4M > - after 5s: all remaining 92M > > Some analyze shows that the internal io dispatch queues goes like this: > > s_ios_more_io > - > 1)100M,1K 0 > 2)1K 96M > 3)0 96M > 1) initial state with a 100M file and a 1K file > 2) 4M written, nr_to_write <= 0, so write more > 3) 1K written, nr_to_write > 0, no more writes(BUG) > nr_to_write > 0 in (3) fools the upper layer to think that data > have > all been > written out. The big dirty file is actually still sitting in > s_more_io. > We > cannot simply splice s_more_io back to s_io as soon as s_io > becomes > empty, and > let the loop in generic_sync_sb_inodes() continue: this may > starve > newly > expired inodes in s_dirty. It is also not an option to draw > inodes > from both > s_more_io and s_dirty, an let the loop go on: this might lead to > live > locks, > and might also starve other superblocks in sync time(well kupdate > may > still > starve some superblocks, that's another bug). > We have to return when a full scan of s_io completes. So nr_to_write > > > 0 does > not necessarily mean that "all data are written". This patch > introduces > a flag > writeback_control.more_io to indicate that more io should be done. > With > it the > big dirty file no longer has to wait for the next kupdate invocation > 5s > later. > > In sync_sb_inodes() we only set more_io on super_blocks we > actually > visited. > This aviods the interaction between two pdflush deamons. > > Also in __sync_single_inode() we don't blindly keep requeuing the io > if > the > filesystem cannot progress. Failing to do so may lead to 100% iowait. > > Tested-by: Mike Snitzer > Signed-off-by: Fengguang Wu > --- > fs/fs-writeback.c | 18 -- > include/linux/writeback.h |1 + > mm/page-writeback.c |9 ++--- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux.orig/fs/fs-writeback.c > +++ linux/fs/fs-writeback.c > @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > -requeue_io(inode); > +if (wbc->nr_to_write <= 0) { > +/* > + * slice used up: queue for next turn > + */ > +requeue_io(inode); > +} else { > +/* > + * somehow blocked: retry later > + */ > +redirty_tail(inode); > +} > } else { > /* > * Otherwise fully redirty the inode so that > @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s > iput(inode); > cond_resched(); > spin_lock(&inode_lock); > -if (wbc->nr_to_write <= 0) > +if (wbc->nr_to_write <= 0) { > +wbc->more_io = 1; > break; > +} > +if (!list_empty(&sb->s_more_io)) > +wbc->more_io = 1; > } > return;/* Leave any unwritten inodes on s_io */ > } > --- linux.orig/include/linux/writeback.h
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Linus Torvalds <[EMAIL PROTECTED]> > Cc: Mel Gorman <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL > PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; > "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Friday, January 18, 2008 11:47:02 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > I can fire up 2.6.24-rc8 in short order to see if things are vastly > > improved (as Martin seems to indicate that he is happy with > > AACRAID on 2.6.24-rc8). Although even Martin's AACRAID > > numbers from 2.6.19.2 > > are still quite good (relative to mine). Martin can you share any tuning > > you may have done to get AACRAID to where it is for you right now? Mike, I have always been happy with the AACRAID box compared to the CCISS system. Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19 and 2.6.24-rc8 on the AACRAID setup are: - 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference. + the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff. And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next "production" kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue. Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Tuesday, January 22, 2008 3:39:33 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote: > > At least, rc1-rc5 have shown that the CCISS system can do well. Now > > the question is which part of the system does not cope well with the > > larger IO sizes? Is it the CCISS controller, LVM or both. I am > open > to > > suggestions on how to debug that. > > What is your LVM device configuration? > E.g. 'dmsetup table' and 'dmsetup info -c' output. > Some configurations lead to large IOs getting split up on the > way > through > device-mapper. > Hi Alastair, here is the output, the filesystem in question is on LogVol02: [EMAIL PROTECTED] ~]# dmsetup table VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248 VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528 VolGroup00-LogVol00: 0 67108864 linear 104:2 384 [EMAIL PROTECTED] ~]# dmsetup info -c Name Maj Min Stat Open Targ Event UUID VolGroup00-LogVol02 253 1 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R VolGroup00-LogVol01 253 2 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq VolGroup00-LogVol00 253 0 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE > See if these patches make any difference: > http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/ > > dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch > dm-introduce-merge_bvec_fn.patch > dm-linear-add-merge.patch > dm-table-remove-merge_bvec-sector-restriction.patch > thanks for the suggestion. Are they supposed to apply to mainline? Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Tuesday, January 22, 2008 3:39:33 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > See if these patches make any difference: > http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/ > > dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch > dm-introduce-merge_bvec_fn.patch > dm-linear-add-merge.patch > dm-table-remove-merge_bvec-sector-restriction.patch > nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Wednesday, January 23, 2008 12:40:52 AM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote: > > [EMAIL PROTECTED] ~]# dmsetup table > > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248 > > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528 > > VolGroup00-LogVol00: 0 67108864 linear 104:2 384 > > The IO should pass straight through simple linear targets like > that without needing to get broken up, so I wouldn't expect those patches to > make any difference in this particular case. > Alasdair, LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem. I will move the discussion now to a new thread, targetting CCISS directly. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote: > > > nfs-remove-congestion_end.patch > > lib-percpu_counter_add.patch > > lib-percpu_counter_sub.patch > > lib-percpu_counter-variable-batch.patch > > lib-make-percpu_counter_add-take-s64.patch > > lib-percpu_counter_set.patch > > lib-percpu_counter_sum_positive.patch > > lib-percpu_count_sum.patch > > lib-percpu_counter_init-error-handling.patch > > lib-percpu_counter_init_irq.patch > > mm-bdi-init-hooks.patch > > mm-scalable-bdi-statistics-counters.patch > > mm-count-reclaimable-pages-per-bdi.patch > > mm-count-writeback-pages-per-bdi.patch > > This one: > > mm-expose-bdi-statistics-in-sysfs.patch > > > lib-floating-proportions.patch > > mm-per-device-dirty-threshold.patch > > mm-per-device-dirty-threshold-warning-fix.patch > > mm-per-device-dirty-threshold-fix.patch > > mm-dirty-balancing-for-tasks.patch > > mm-dirty-balancing-for-tasks-warning-fix.patch > > And, this one: > > debug-sysfs-files-for-the-current-ratio-size-total.patch > > > I'm not sure polluting /sys/block//queue/ like that is The Right > Thing. These patches sure were handy when debugging this, but not > sure > they want to move to maineline. > > Maybe we want /sys/bdi// or maybe /debug/bdi// > > Opinions? > Hi Peter, my only opinion is that it is great to see that stuff moving into mainline. If it really goes in, there will be one more very interested rc-tester :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] sluggish writeback fixes
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > Andrew, > > The following patches fix the sluggish writeback behavior. > They are well understood and well tested - but not yet widely tested. > > The first patch reverts the debugging -mm only > check_dirty_inode_list.patch - > which is no longer necessary. > > The following 4 patches do the real jobs: > > [PATCH 2/5] writeback: fix time ordering of the per superblock inode > lists 8 > [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes() > [PATCH 4/5] writeback: remove pages_skipped accounting in > __block_write_full_page() > [PATCH 5/5] writeback: introduce writeback_control.more_io to > indicate more io > > They share the same goal as the following patches in -mm. Therefore > I'd > recommend to put the last 4 new ones after them: > > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch > writeback-fix-comment-use-helper-function.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch > writeback-fix-periodic-superblock-dirty-inode-flushing.patch > > Regards, > Fengguang Hi Fenguang, now that Peters stuff seems to make it into mainline, do you think your fixes should go in as well? Would definitely help to broaden the tester base. Definitely by one very interested tester :-) Keep on the good work Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
Hi, currently I am tracking down an "interesting" effect when writing to a Solars-10/Sparc based server. The server exports two filesystems. One UFS, one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. The problem: when writing to the VXFS based filesystem, performance drops dramatically when the the filesize reaches or exceeds "dirty_ratio". For a dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform the same tests on the UFS based FS, performance stays at about 30 MB/sec until 3GB and likely larger (I just stopped at 3 GB). Any ideas what could cause this difference? Any suggestions on debugging it? spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) Cheers Martin PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name :-) ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Chris Snook <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > Sent: Friday, December 28, 2007 7:45:13 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > Martin Knoblauch wrote: > > Hi, > > > > currently I am tracking down an "interesting" effect when writing > to > a > > Solars-10/Sparc based server. The server exports two filesystems. > One > UFS, > > one VXFS. The filesystems are mounted NFS3/TCP, no special > options. > Linux > > kernel in question is 2.6.24-rc6, but it happens with earlier kernels > > (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. > > > > The problem: when writing to the VXFS based filesystem, > performance > drops > > dramatically when the the filesize reaches or exceeds > "dirty_ratio". > For a > > dirty_ratio of 10% (about 800MB) files below 750 MB are > transfered > with about > > 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If > I > perform > > the same tests on the UFS based FS, performance stays at about > 30 > MB/sec > > until 3GB and likely larger (I just stopped at 3 GB). > > > > Any ideas what could cause this difference? Any suggestions > on > debugging it? > > 1) Try normal NFS tuning, such as rsize/wsize tuning. > rsize/wsize only have minimal effect. The negotiated size seems to be optimal. > 2) You're entering synchronous writeback mode, so you can delay the > problem by raising dirty_ratio to 100, or reduce the size of the problem > by lowering dirty_ratio to 1. Either one could help. > For experiments, sure. But I do not think that I want to have 8 GB of dirty pages [potentially] laying around. Are you sure that 1% is a useful value for dirty_ratio? Looking at the code, it seems a minimum of 5% is enforced in "page-writeback.c:get_dirty_limits": dirty_ratio = vm_dirty_ratio; if (dirty_ratio > unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; if (dirty_ratio < 5) dirty_ratio = 5; > 3) It sounds like the bottleneck is the vxfs filesystem. It only > *appears* on the client side because writes up until dirty_ratio get buffered on > the client. Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same situation points in that direction. > If you can confirm that the server is actually writing stuff to disk > slower when the client is in writeback mode, then it's possible the Linux > NFS client is doing something inefficient in writeback mode. > I will try to get an iostat trace from the Sun side. Thanks for the suggestion. Cheers Martin PS: Happy Year 2008 to all Kernel Hackers and their families -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Chris Snook <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > Sent: Friday, December 28, 2007 7:45:13 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > Martin Knoblauch wrote: > > Hi, > > > > currently I am tracking down an "interesting" effect when writing > 3) It sounds like the bottleneck is the vxfs filesystem. It > only *appears* on the client side because writes up until dirty_ratio > get buffered on the client. > If you can confirm that the server is actually writing stuff to > disk slower when the client is in writeback mode, then it's possible > the Linux NFSclient is doing something inefficient in writeback mode. > so, is the output of "iostat -d -l1 d111" during two runs. The first run is with 750 MB, the second with 850MB. // 750MB $ iostat -d -l 1 md111 2 md111 kps tps serv 22 0 14 0 00 0 0 13 29347 468 12 37040 593 17 30938 492 25 30421 491 25 41626 676 16 42913 703 14 39890 647 15 9009 1417 8963 1417 5143 817 34814 547 10 49323 775 12 28624 4516 22 16 finish 0 00 0 00 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. Fine. // 850MB $ iostat -d -l 1 md111 2 md111 kps tps serv 0 00 11275 180 10 39874 635 14 37403 587 17 24341 392 30 25989 423 26 22464 375 30 21922 361 32 27924 450 26 21507 342 21 9217 153 15 9260 150 15 9544 155 15 9298 150 14 10118 162 11 15505 250 12 27513 448 14 26698 436 15 26144 431 15 25201 412 14 38 seconds in run 0 00 0 00 579 17 12 0 00 0 00 0 00 0 00 518 9 16 485 86 9 17 514 97 0 00 0 00 541 98 532 106 0 00 0 00 650 127 0 00 242 89 1023 185 304 56 418 87 283 55 303 58 527 106 0 00 0 00 0 00 5 1 13 0 00 0 00 0 00 0 00 0 00 0 0 11 0 00 0 00 0 00 1 0 15 0 00 96 2 15 138 3 10 11057 1756 17549 2806 351 85 0 00 # 218 seconds in run, finish. So, for the first 38 seconds everything looks similar to the 750 MB case. For the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec. Maybe it is time to capture the traffic. What are the best tcpdump parameters for NFS? I always forget :-( Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Stack warning from 2.6.24-rc
- Original Message > From: Ingo Molnar <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org > Sent: Tuesday, December 4, 2007 12:52:23 PM > Subject: Re: Stack warning from 2.6.24-rc > > > * Martin Knoblauch wrote: > > > I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 > > GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4: > > > > [ 180.739846] mount.nfs used greatest stack depth: 3192 bytes left > > [ 666.121007] bash used greatest stack depth: 3160 bytes left > > > > Nothing bad has happened so far. The message does not show on a > > similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running > > rc3. Anything to worry? Anything I can do to help debugging? > > those are generated by: > > CONFIG_DEBUG_STACKOVERFLOW=y > CONFIG_DEBUG_STACK_USAGE=y > > and look quite harmless. If they were much closer to zero it would be > a problem. > > Ingo > OK, I will ignore it then. I was just surprised to see it. Thanks Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
What is the unit of "nr_writeback"?
Hi, forgive the stupid question. What is the unit of "nr_writeback"? One would usually assume a rate, but looking at the code I see it added together with nr_dirty and nr_unstable, somehow defeating the assumption. Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/11] writeback bug fixes and simplifications
- Original Message > From: WU Fengguang <[EMAIL PROTECTED]> > To: Hans-Peter Jansen <[EMAIL PROTECTED]> > Cc: Sascha Warner <[EMAIL PROTECTED]>; Andrew Morton <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; Peter Zijlstra <[EMAIL PROTECTED]> > Sent: Wednesday, January 9, 2008 4:33:32 AM > Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications > > On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote: > > Am Freitag, 28. Dezember 2007 schrieb Sascha Warner: > > > Andrew Morton wrote: > > > > On Thu, 27 Dec 2007 23:08:40 +0100 Sascha > Warner > > > wrote: > > > >> Hi, > > > >> > > > >> I applied your patches to 2.6.24-rc6-mm1, but now I am > faced > with one > > > >> pdflush often using 100% CPU for a long time. There seem to > be > some > > > >> rare pauses from its 100% usage, however. > > > >> > > > >> On ~23 minutes uptime i have ~19 minutes pdflush runtime. > > > >> > > > >> This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo > > > >> ~x64_64 > > > >> > > > >> Let me know if you need more info. > > > > > > > > (some) cc's restored. Please, always do reply-to-all. > > > > > > Hi Wu, > > > > Sascha, if you want to address Fengguang by his first name, note > that > > > chinese and bavarians (and some others I forgot now, too) > typically > use the > > order: > > > > lastname firstname > > > > when they spell their names. Another evidence is, that the name Wu > is > a > > pretty common chinese family name. > > > > Fengguang, if it's the other way around, correct me please (and > I'm > going to > > wear a big brown paper bag for the rest of the day..). > > You are right. We normally do "Fengguang" or "Mr. Wu" :-) > For LKML the first name is less ambiguous. > > Thanks, > Fengguang > Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as well. This is only true in a folklore context (or when you are very deep in the countryside). Officially the bavarians use the usual German Given/Lastname. Although they will never admit to be Germans, of course :-) Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Martin Knoblauch <[EMAIL PROTECTED]> > To: Chris Snook <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap <[EMAIL > PROTECTED]> > Sent: Saturday, December 29, 2007 12:11:08 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > - Original Message > > From: Chris Snook > > To: Martin Knoblauch > > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > > Sent: Friday, December 28, 2007 7:45:13 PM > > Subject: Re: Strange NFS write performance > Linux->Solaris-10/VXFS, > maybe VW related > > > > Martin Knoblauch wrote: > > > Hi, > > > > > > currently I am tracking down an "interesting" effect when writing > > > 3) It sounds like the bottleneck is the vxfs filesystem. It > > only *appears* on the client side because writes up > until > dirty_ratio > > get buffered on the client. > > If you can confirm that the server is actually writing stuff to > > disk slower when the client is in writeback mode, then it's possible > > the Linux NFSclient is doing something inefficient in > writeback > mode. > > > > so, is the output of "iostat -d -l1 d111" during two runs. The > first > run is with 750 MB, the second with 850MB. > > // 750MB > $ iostat -d -l 1 md111 2 >md111 > kps tps serv > 22 0 14 > 0 00 > 0 0 13 > 29347 468 12 > 37040 593 17 > 30938 492 25 > 30421 491 25 > 41626 676 16 > 42913 703 14 > 39890 647 15 > 9009 1417 > 8963 1417 > 5143 817 > 34814 547 10 > 49323 775 12 > 28624 4516 > 22 16 > finish > 0 00 > 0 00 > > Here it seems that the disk is writing for 26-28 seconds with avg. > 29 > MB/sec. Fine. > > // 850MB > $ iostat -d -l 1 md111 2 >md111 > kps tps serv > 0 00 > 11275 180 10 > 39874 635 14 > 37403 587 17 > 24341 392 30 > 25989 423 26 > 22464 375 30 > 21922 361 32 > 27924 450 26 > 21507 342 21 > 9217 153 15 > 9260 150 15 > 9544 155 15 > 9298 150 14 > 10118 162 11 > 15505 250 12 > 27513 448 14 > 26698 436 15 > 26144 431 15 > 25201 412 14 > 38 seconds in run > 0 00 > 0 00 > 579 17 12 > 0 00 > 0 00 > 0 00 > 0 00 > 518 9 16 > 485 86 > 9 17 > 514 97 > 0 00 > 0 00 > 541 98 > 532 106 > 0 00 > 0 00 > 650 127 > 0 00 > 242 89 > 1023 185 > 304 56 > 418 87 > 283 55 > 303 58 > 527 106 > 0 00 > 0 00 > 0 00 > 5 1 13 > 0 00 > 0 00 > 0 00 > 0 00 > 0 00 > 0 0 11 > 0 00 > 0 00 > 0 00 > 1 0 15 > 0 00 > 96 2 15 > 138 3 10 > 11057 1756 > 17549 2806 > 351 85 > 0 00 > # 218 seconds in run, finish. > > So, for the first 38 seconds everything looks similar to the 750 > MB case. For the next about 180 seconds most time nothing happens. > Averaging 4.1 MB/sec. > > Maybe it is time to capture the traffic. What are the best > tcpdump parameters for NFS? I always forget :-( > > Cheers > Martin > > Hi, now that the seasonal festivities are over - Happy New Year btw. - any comments/suggestions on my problem? Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Fengguang Wu <[EMAIL PROTECTED]> > Cc: Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar > <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" > <[EMAIL PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; Andrew Morton > <[EMAIL PROTECTED]> > Sent: Tuesday, January 15, 2008 10:13:22 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Jan 14, 2008 7:50 AM, Fengguang Wu wrote: > > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote: > > > > > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote: > > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu: > > > > > > > > > Joerg, this patch fixed the bug for me :-) > > > > > > > > Fengguang, congratulations, I can confirm that your patch > fixed > the bug! With > > > > previous kernels the bug showed up after each reboot. Now, > when > booting the > > > > patched kernel everything is fine and there is no longer > any > suspicious > > > > iowait! > > > > > > > > Do you have an idea why this problem appeared in 2.6.24? > Did > somebody change > > > > the ext2 code or is it related to the changes in the scheduler? > > > > > > It was Fengguang who changed the inode writeback code, and I > guess > the > > > new and improved code was less able do deal with these funny corner > > > cases. But he has been very good in tracking them down and > solving > them, > > > kudos to him for that work! > > > > Thank you. > > > > In particular the bug is triggered by the patch named: > > "writeback: introduce writeback_control.more_io to > indicate > more io" > > That patch means to speed up writeback, but unfortunately its > > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2. > > > > Linus, given the number of bugs it triggered, I'd recommend revert > > this patch(git commit > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). > Let's > > push it back to -mm tree for more testings? > > Fengguang, > > I'd like to better understand where your writeback work stands > relative to 2.6.24-rcX and -mm. To be clear, your changes in > 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write > performance improvement with ext3 (as compared to 2.6.22, CFS could be > helping, etc but...). Very impressive! > > Given this improvement it is unfortunate to see your request to revert > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if > you're not confident in it for 2.6.24. > > That said, you recently posted an -mm patchset that first reverts > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address > the "slow writes for concurrent large and small file writes" bug: > http://lkml.org/lkml/2008/1/15/132 > > For those interested in using your writeback improvements in > production sooner rather than later (primarily with ext3); what > recommendations do you have? Just heavily test our own 2.6.24 + your > evolving "close, but not ready for merge" -mm writeback patchset? > Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted "...2250b". I will definitely repeat my tests with -rc8. and report. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Wednesday, January 16, 2008 1:00:04 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > For those interested in using your writeback improvements in > > > production sooner rather than later (primarily with ext3); what > > > recommendations do you have? Just heavily test our own 2.6.24 > + > your > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > Hi Fengguang, Mike, > > > > I can add myself to Mikes question. It would be good to know > a > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > been > showing quite nice improvement of the overall writeback situation and > it > would be sad to see this [partially] gone in 2.6.24-final. > Linus > apparently already has reverted "...2250b". I will definitely repeat my > tests > with -rc8. and report. > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > Maybe we can push it to 2.6.24 after your testing. > Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it. Cheers Martin > Fengguang > --- > fs/fs-writeback.c | 17 +++-- > include/linux/writeback.h |1 + > mm/page-writeback.c |9 ++--- > 3 files changed, 22 insertions(+), 5 deletions(-) > > --- linux.orig/fs/fs-writeback.c > +++ linux/fs/fs-writeback.c > @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > -requeue_io(inode); > +if (wbc->nr_to_write <= 0) > +/* > + * slice used up: queue for next turn > + */ > +requeue_io(inode); > +else > +/* > + * somehow blocked: retry later > + */ > +redirty_tail(inode); > } else { > /* > * Otherwise fully redirty the inode so that > @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s > iput(inode); > cond_resched(); > spin_lock(&inode_lock); > -if (wbc->nr_to_write <= 0) > +if (wbc->nr_to_write <= 0) { > +wbc->more_io = 1; > break; > +} > +if (!list_empty(&sb->s_more_io)) > +wbc->more_io = 1; > } > return;/* Leave any unwritten inodes on s_io */ > } > --- linux.orig/include/linux/writeback.h > +++ linux/include/linux/writeback.h > @@ -62,6 +62,7 @@ struct writeback_control { > unsigned for_reclaim:1;/* Invoked from the page > allocator > */ > unsigned for_writepages:1;/* This is a writepages() call */ > unsigned range_cyclic:1;/* range_start is cyclic */ > +unsigned more_io:1;/* more io to be dispatched */ > }; > > /* > --- linux.orig/mm/page-writeback.c > +++ linux/mm/page-writeback.c > @@ -558,6 +558,7 @@ static void background_writeout(unsigned > global_page_state(NR_UNSTABLE_NFS) < background_thresh > && min_pages <= 0) > break; > +wbc.more_io = 0; > wbc.encountered_congestion = 0; > wbc.nr_to_write = MAX_WRITEBACK_PAGES; > wbc.pages_skipped = 0; > @@ -565,8 +566,9 @@ static void background_writeout(unsigned > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > /* Wrote less than expected */ > -congestion_wait(WRITE, HZ/10); > -if (!wbc.encountered_congestion) > +if (wbc.encountered_congestion || wbc.more_io) > +congestion_wait(WRITE, HZ/10); > +else > break; > } > } > @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg > global_page_state(N
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Wednesday, January 16, 2008 1:00:04 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > For those interested in using your writeback improvements in > > > production sooner rather than later (primarily with ext3); what > > > recommendations do you have? Just heavily test our own 2.6.24 > + > your > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > Hi Fengguang, Mike, > > > > I can add myself to Mikes question. It would be good to know > a > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > been > showing quite nice improvement of the overall writeback situation and > it > would be sad to see this [partially] gone in 2.6.24-final. > Linus > apparently already has reverted "...2250b". I will definitely repeat my > tests > with -rc8. and report. > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > Maybe we can push it to 2.6.24 after your testing. > Hi Fengguang, something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-( Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff. At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch. Depressed Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Martin Knoblauch <[EMAIL PROTECTED]> > To: Fengguang Wu <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Thursday, January 17, 2008 2:52:58 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > - Original Message > > From: Fengguang Wu > > To: Martin Knoblauch > > Cc: Mike Snitzer ; Peter > Zijlstra > ; [EMAIL PROTECTED]; Ingo Molnar > ; > linux-kernel@vger.kernel.org; > "[EMAIL PROTECTED]" > ; Linus > Torvalds > > > Sent: Wednesday, January 16, 2008 1:00:04 PM > > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > > For those interested in using your writeback improvements in > > > > production sooner rather than later (primarily with ext3); what > > > > recommendations do you have? Just heavily test our own 2.6.24 > > + > > > your > > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > > > Hi Fengguang, Mike, > > > > > > I can add myself to Mikes question. It would be good to know > > a > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > > been > > > showing quite nice improvement of the overall writeback situation and > > it > > > would be sad to see this [partially] gone in 2.6.24-final. > > Linus > > > apparently already has reverted "...2250b". I will definitely > repeat > my > > tests > > > with -rc8. and report. > > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > > Maybe we can push it to 2.6.24 after your testing. > > > Hi Fengguang, > > something really bad has happened between -rc3 and > -rc6. > Embarrassingly I did not catch that earlier :-( > > Compared to the numbers I posted > in > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec > (slight > plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only > test > that is still good is mix3, which I attribute to the per-BDI stuff. > > At the moment I am frantically trying to find when things went down. > I > did run -rc8 and rc8+yourpatch. No difference to what I see with > -rc6. > Sorry that I cannot provide any input to your patch. > OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d #Author: Mel Gorman <[EMAIL PROTECTED]> #Date: Mon Dec 17 16:20:05 2007 -0800 # #mm: fix page allocation for larger I/O segments # #In some cases the IO subsystem is able to merge requests if the pages are #adjacent in physical memory. This was achieved in the allocator by having #expand() return pages in physically contiguous order in situations were a #large buddy was split. However, list-based anti-fragmentation changed the #order pages were returned in to avoid searching in buffered_rmqueue() for a #page of the appropriate migrate type. # #This patch restores behaviour of rmqueue_bulk() preserving the physical #order of pages returned by the allocator without incurring increased search #costs for anti-fragmentation. # #Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> #Cc: James Bottomley <[EMAIL PROTECTED]> #Cc: Jens Axboe <[EMAIL PROTECTED]> #Cc: Mark Lord <[EMAIL PROTECTED] #Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> #Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c --- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 + +++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 + @@ -847,8 +847,19 @@ struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; + + /* +* Split buddy pages returned by expand() are received here +* in physical page order. The page is added to the callers and +* list and the list head then moves forward. From the callers +* perspective, the linked list is ordered by page number in +* some conditions. This is useful for IO devices that can +* merge IO requests if the physical p
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Thursday, January 17, 2008 5:11:50 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > I've backported Peter's perbdi patchset to 2.6.22.x. I can share it > with anyone who might be interested. > > As expected, it has yielded 2.6.24-rcX level scaling. Given the test > result matrix you previously posted, 2.6.22.x+perbdi might give you > what you're looking for (sans improved writeback that 2.6.24 was > thought to be providing). That is, much improved scaling with better > O_DIRECT and network throughput. Just a thought... > > Unfortunately, my priorities (and computing resources) have shifted > and I won't be able to thoroughly test Fengguang's new writeback patch > on 2.6.24-rc8... whereby missing out on providing > justification/testing to others on _some_ improved writeback being > included in 2.6.24 final. > > Not to mention the window for writeback improvement is all but closed > considering the 2.6.24-rc8 announcement's 2.6.24 final release > timetable. > Mike, thanks for the offer, but the improved throughput is my #1 priority nowadays. And while the better scaling for different targets is nothing to frown upon, the much better scaling when writing to the same target would have been the big winner for me. Anyway, I located the "offending" commit. Lets see what the experts say. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mel Gorman <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Thursday, January 17, 2008 9:23:57 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On (17/01/08 09:44), Martin Knoblauch didst pronounce: > > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, > Martin > Knoblauch wrote: > > > > > > > For those interested in using your writeback > improvements > in > > > > > > > production sooner rather than later (primarily with > ext3); > what > > > > > > > recommendations do you have? Just heavily test our > own > 2.6.24 > > > > > > > evolving "close, but not ready for merge" -mm > writeback > patchset? > > > > > > > > > > > > > > > > > > > I can add myself to Mikes question. It would be good to > know > a > > > > > > > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so > far > has > > > > > been showing quite nice improvement of the overall > writeback > situation and > > > > > it would be sad to see this [partially] gone in 2.6.24-final. > > > > > Linus apparently already has reverted "...2250b". I > will > definitely > > > > > repeat my tests with -rc8. and report. > > > > > > > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > > > > Maybe we can push it to 2.6.24 after your testing. > > > > > > > Hi Fengguang, > > > > > > something really bad has happened between -rc3 and -rc6. > > > Embarrassingly I did not catch that earlier :-( > > > Compared to the numbers I posted in > > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec > > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. > > > The only test that is still good is mix3, which I attribute to > > > the per-BDI stuff. > > I suspect that the IO hardware you have is very sensitive to the > color of the physical page. I wonder, do you boot the system cleanly > and then run these tests? If so, it would be interesting to know what > happens if you stress the system first (many kernel compiles for example, > basically anything that would use a lot of memory in different ways for some > time) to randomise the free lists a bit and then run your test. You'd need to > run > the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you > identified reverted. > The effect is defintely depending on the IO hardware. I performed the same tests on a different box with an AACRAID controller and there things look different. Basically the "offending" commit helps seingle stream performance on that box, while dual/triple stream are not affected. So I suspect that the CCISS is just not behaving well. And yes, the tests are usually done on a freshly booted box. Of course, I repeat them a few times. On the CCISS box the numbers are very constant. On the AACRAID box they vary quite a bit. I can certainly stress the box before doing the tests. Please define "many" for the kernel compiles :-) > > > > OK, the change happened between rc5 and rc6. Just following a > > gut feeling, I reverted > > > > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d > > #Author: Mel Gorman > > #Date: Mon Dec 17 16:20:05 2007 -0800 > > # > > > > This has brought back the good results I observed and reported. > > I do not know what to make out of this. At least on the systems > > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, > > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery > > protected writeback cache enabled) and gigabit networking (tg3)) this > > optimisation is a dissaster. > > > > That patch was not an optimisation, it was a regression fix > against 2.6.23 and I don't believe reverting it is an option. Other IO > hardware benefits from having the allocator supply pages in PFN order. I think this late in the 2.6.24 game we just should leave things as they are. But we should try to find a way to make CCISS faster, as it apparently can be fas
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mel Gorman <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Thursday, January 17, 2008 11:12:21 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On (17/01/08 13:50), Martin Knoblauch didst pronounce: > > > > > > > The effect is defintely depending on the IO hardware. > > performed the same tests > > on a different box with an AACRAID controller and there things > > look different. > > I take it different also means it does not show this odd performance > behaviour and is similar whether the patch is applied or not? > Here are the numbers (MB/s) from the AACRAID box, after a fresh boot: Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d dd1 325 350 290 dd1-dir 180 160 160 dd2 2x90 2x113 2x110 dd2-dir 2x120 2x922x93 dd33x54 3x70 3x70 dd3-dir 3x83 3x64 3x64 mix3 55,2x30 400,2x25 310,2x25 What we are seing here is that: a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box b) Reverting your patch hurts single stream c) dual/triple stream are not affected by your patch and are improved over 2.6.19 d) the mix3 performance is improved compared to 2.6.19. d1) reverting your patch hurts the local-disk part of mix3 e) the AACRAID setup is definitely faster than the CCISS. So, on this box your patch is definitely needed to get the pre-2.6.24 performance when writing a single big file. Actually things on the CCISS box might be even more complicated. I forgot the fact that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring? Anyway: does your patch only address this performance issue, or are there also data integrity concerns without it? I may consider reverting the patch for my production environment. It really helps two thirds of my boxes big time, while it does not hurt the other third that much :-) > > > > I can certainly stress the box before doing the tests. Please > > define "many" for the kernel compiles :-) > > > > With 8GiB of RAM, try making 24 copies of the kernel and compiling them > all simultaneously. Running that for for 20-30 minutes should be enough > to randomise the freelists affecting what color of page is used for the > dd test. > ouch :-) OK, I will try that. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM Requirement Document - v0.0
>> * If we're getting low cache hit rates, don't flush >> processes to swap. >> * If we're getting good cache hit rates, flush old, idle >> processes to swap. Rik> ... but I fail to see this one. If we get a low cache hit rate, Rik> couldn't that just mean we allocated too little memory for the Rik> cache ? maybe more specific: If the hit-rate is low and the cache is already 70+% of the systems memory, the chances maybe slim that more cache is going to improve the hit-rate. I do not care much whether the cache is using 99% of the systems memory or 50%. As long as there is free memory, using it for cache is great. I care a lot if the cache takes down interactivity, because it pushes out processes that it thinks idle, but that I need in 5 seconds. The caches pressure against processes should decrease with the (relative) size of the cache. Especially in low hit-rate situations. OT: I asked the question before somewhere else. Are there interfaces to the VM that expose the various cache sizes and, more important, hit-rates to userland? I would love to see (or maybe help writing in my free time) a tool to just visualize/analyze the efficiency of the VM system. Martin -- ------ Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ReiserFS patches vs. 2.4.5-ac series
Hi, what is the current relation between the reiserfs patches at namesys.com and the 2.4.5-ac series kernel? Namesys seems to have a small one for the "umount" problem and two bigger ones (knfsd and knfsd+quota+mount). All apply cleanly to vanilla 2.4.5, but the bigger ones fails against ac18 and ac19 (earlier ones also I would guess). Are some of the knfsd/quota fixes already in -ac? Thanks Martin -- ------ Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM Requirement Document - v0.0
Rik van Riel wrote: > > On Wed, 27 Jun 2001, Martin Knoblauch wrote: > > > I do not care much whether the cache is using 99% of the systems memory > > or 50%. As long as there is free memory, using it for cache is great. I > > care a lot if the cache takes down interactivity, because it pushes out > > processes that it thinks idle, but that I need in 5 seconds. The caches > > pressure against processes > > Too bad that processes are in general cached INSIDE the cache. > > You'll have to write a new balancing story now ;) > maybe that is part of "the answer" :-) Martin -- -- Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM Requirement Document - v0.0
Helge Hafting wrote: > > Martin Knoblauch wrote: > > > > > maybe more specific: If the hit-rate is low and the cache is already > > 70+% of the systems memory, the chances maybe slim that more cache is > > going to improve the hit-rate. > > > Oh, but this is posible. You can get into situations where > the (file cache) working set needs 80% or so of memory > to get a near-perfect hitrate, and where > using 70% of memory will trash madly due to the file access thats why I said "maybe" :-) Sure, another 5% of cache may improve things, but they also may kill the interactive performance. Thats why there should be probably more than one VM strategy to accomodate Servers and Workstations/Lpatops. > pattern. And this won't be a problem either, if > the working set of "other" (non-file) > stuff is below 20% of memory. The total size of > non-file stuff may be above 20% though, so something goes > into swap. > And that is the problem. To much seems to go into swap. At least for interactive work. Unfortunatelly, with 128MB of memory I cannot entirely turn of swap. I will see how things are going once I have 256 or 512 MB (hopefully soon :-) > I definitely want the machine to work under such circumstances, > so an arbitrary limit of 70% won't work. > Do not take the 70% as an arbitrary limit. I never said that. The 70% is just my situation. The problems may arise at 60% cache or at 97.38% cache. > Preventing swap-trashing at all cost doesn't help if the Never said at all cost. > machine loose to io-trashing instead. Performance will be > just as much down, although perhaps more satisfying because > people aren't that surprised if explicit file operations > take a long time. They hate it when moving the mouse > or something cause a disk access even if their > apps runs faster. :-( > Absolutely true. And if the main purpose of the machine is interactive work (we do want to be Linux a success on the desktop, don't we?), it should not be hampered by by an IO improvement that may be only of secondary importance to the user (that the final "customer" for all the work that is done to the kernel :-). On big servers a litle paging now and then may be absolutely OK, as long as the IO is going strong. I am observing the the discussions of VM behaviour in 2.4.x for some time. They are mostly very entertaining and revealing. But they also show that one solution does not seem to benefit all possible scenarios. Therfore either more than one VM strategy is necessary, or better means of tuning the cache behaviour, or both. Definitely better ways of measuring the VM efficiency seem to be needed. While implementing VM strategies is probably out of question for a lot of the people that complain, I hope that at least my complaints are kind of useful. Martin -- -- Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Cosmetic JFFS patch.
>Olaf Hering wrote: >> kde.o. 2.5? > >Good idea! Graphics needs to be in the kernel to be fast. Windows >proved that. thought SGI proved that :-) Martin -- ------ Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announcing Journaled File System (JFS) release 1.0.0 available
Hi, first of all congratulations for finishing the initial first release. Some questions, just out of curiosity: >* Fast recovery after a system crash or power outage > >* Journaling for file system integrity > >* Journaling of meta-data only > does this mean JSF/Linux always journals only the meta-data, or is that an option? Does it perform full data-journaling under AIX? >* Extent-based allocation > >* Excellent overall performance > >* 64 bit file system > >* Built to scale. In memory and on-disk data structures are designed to > scale beyond practical limit Is this scaling only for size, or also for performance (many disks on many controllers) like XFS (at least on SGI iron)? Thanks Martin -- ------ Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
VM behaviour under 2.4.5-ac21
Hi, just something positive for the weekend. With 2.4.5-ac21, the behaviour on my laptop (128MB plus twice the sapw) seems a bit more sane. When I start new large applications now, the "used" portion of VM actually pushes against the cache instead of forcing stuff into swap. It is still using swap, but the effects on interactivity are much lighter. So, if this is a preview of 2.4.6 bahaviour, there may be a light at the end of the tunnel. Have a good weekend Martin -- ------ Martin Knoblauch |email: [EMAIL PROTECTED] TeraPort GmbH|Phone: +49-89-510857-309 C+ITS|Fax:+49-89-510857-111 http://www.teraport.de |Mobile: +49-170-4904759 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message > From: "Zhang, Yanmin" <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED]; LKML > Sent: Monday, November 12, 2007 1:45:57 AM > Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1 > > On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote: > > - Original Message > > > From: "Zhang, Yanmin" > > > To: [EMAIL PROTECTED] > > > Cc: LKML > > > Sent: Friday, November 9, 2007 10:47:52 AM > > > Subject: iozone write 50% regression in kernel 2.6.24-rc1 > > > > > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has > > > 50% > > > > > regression > > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. > > > > > > My machine has 8 processor cores and 8GB memory. > > > > > > By bisect, I located patch > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h > = > > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. > > > > > > > > > Another behavior: with kernel 2.6.23, if I run iozone for many > > > times > > > > > after rebooting machine, > > > the result looks stable. But with 2.6.24-rc1, the first run of > > > iozone > > > > > got a very small result and > > > following run has 4Xorig_result. > > > > > > What I reported is the regression of 2nd/3rd run, because first run > > > has > > > > > bigger regression. > > > > > > I also tried to change > > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio > > > > > and didn't get improvement. > > could you tell us the exact iozone command you are using? > iozone -i 0 -r 4k -s 512m > OK, I definitely do not see the reported effect. On a HP Proliant with a RAID5 on CCISS I get: 2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite 2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite The first run is always slowest, all subsequent runs are faster and the same speed. > > > I would like to repeat it on my setup, because I definitely see > the > opposite behaviour in 2.6.24-rc1/rc2. The speed there is much > better > than in 2.6.22 and before (I skipped 2.6.23, because I was waiting > for > the per-bdi changes). I definitely do not see the difference between > 1st > and subsequent runs. But then, I do my tests with 5GB file sizes like: > > > > iozone3_283/src/current/iozone -t 5 -F /scratch/X1 > /scratch/X2 > /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 > My machine uses SATA (AHCI) disk. > 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to free up one of the boxes. Cheers Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-rc1: First impressions
Hi , just to give some feedback on 2.6.24-rc1. For some time I am tracking IO/writeback problems that hurt system responsiveness big-time. I tested Peters stuff together with Fenguangs additions and it looked promising. Therefore I was very happy to see Peters stuff going into 2.6.24 and waited eagerly for rc1. In short, I am impressed. This really looks good. IO throughput is great and I could not reproduce the responsiveness problems so far. Below are a some numbers of my brute-force I/O tests that I can use to bring responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery protected writeback cahe enabled) and gigabit networking (tg3). User space is 64-bit RHEL4.3 I am basically doing copies using "dd" with 1MB blocksize. Local Filesystem ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. NFS3 Server is a Sun/T2000/Solaris10. The tests are: dd1 - copy 16 GB from /dev/zero to local FS dd1-dir - same, but using O_DIRECT for output dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS net1 - copy 5.2 GB from NFS3 share to local FS mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec. test 2.6.19.2 2.6.22.62.6.24.-rc1 dd1 285096 dd1-dir 888886 dd2 2x16.5 2x112x44.5 dd2-dir 2x44 2x442x43 dd33x9.83x8.7 3x30 dd3-dir 3x29.5 3x29.53x28.5 net130-33 50-55 37-52 mix3 17/32 25/5096/35 (disk/combined-network) Some observations: - single threaded disk speed really went up wit 2.6.24-rc1. It is now even better than O_DIRECT - O_DIRECT took a slight hit compared to the older kernels. Not an issue for me, but maybe others care - multi threaded non O_DIRECT scales for the first time ever Almost no loss compared to single threaded !! - network throughput took a hit from 2.6.22.6 and is not as repeatable. Still better than 2.6.19.2 though What actually surprises me most is the big performance win on the single threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for was of course the scalability. So, this looks great and most likely I will push 2.6.24 (maybe .X) into my environment. Happy weekend Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message > From: Andrew Morton <[EMAIL PROTECTED]> > To: Arjan van de Ven <[EMAIL PROTECTED]> > Cc: Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; > linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL > PROTECTED]; [EMAIL PROTECTED] > Sent: Saturday, October 27, 2007 7:59:51 AM > Subject: Re: 2.6.24-rc1: First impressions > > On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de > Ven > wrote: > > > > > > dd1 - copy 16 GB from /dev/zero to local FS > > > > > dd1-dir - same, but using O_DIRECT for output > > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to > local > FS > > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo > local > FS > > > > > net1 - copy 5.2 GB from NFS3 share to local FS > > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 > > > > > shares > > > > > > > > > > I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All > > > > > units are MB/sec. > > > > > > > > > > test 2.6.19.2 2.6.22.62.6.24.-rc1 > > > > > > > > > > > > dd1 28 50 96 > > > > > dd1-dir 88 88 86 > > > > > dd2 2x16.5 2x11 2x44.5 > > > > > dd2-dir2x44 2x44 2x43 > > > > > dd3 3x9.83x8.7 3x30 > > > > > dd3-dir 3x29.5 3x29.5 3x28.5 > > > > > net1 30-3350-55 37-52 > > > > > mix3 17/3225/50 96/35 > > > > > (disk/combined-network) > > > > > > > > wow, really nice results! > > > > > > Those changes seem suspiciously large to me. I wonder if > there's > less > > > physical IO happening during the timed run, and > correspondingly > more > > > afterwards. > > > > > > > another option... this is ext2.. didn't the ext2 reservation > stuff > get > > merged into -rc1? for ext3 that gave a 4x or so speed boost (much > > better sequential allocation pattern) > > > > Yes, one would expect that to make a large difference in > dd2/dd2-dir > and > dd3/dd3-dir - but only on SMP. On UP there's not enough concurrency > in the fs block allocator for any damage to occur. > Just for the record the test are done on SMP. > Reservations won't affect dd1 though, and that went faster too. > This is the one result that surprised me most, as I did not really expect any big moves here. I am not complaining :-), but definitely it would be nice to understand the why. Cheers Martin > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message > From: Ingo Molnar <[EMAIL PROTECTED]> > To: Andrew Morton <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Sent: Friday, October 26, 2007 9:33:40 PM > Subject: Re: 2.6.24-rc1: First impressions > > > * Andrew Morton wrote: > > > > > dd1 - copy 16 GB from /dev/zero to local FS > > > > dd1-dir - same, but using O_DIRECT for output > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to > local > FS > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo > local > FS > > > > net1 - copy 5.2 GB from NFS3 share to local FS > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two > NFS3 > shares > > > > > > > > I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. > All > units > > > > are MB/sec. > > > > > > > > test 2.6.19.2 2.6.22.62.6.24.-rc1 > > > > > > > > dd1 28 50 96 > > > > dd1-dir 88 88 86 > > > > dd2 2x16.5 2x11 2x44.5 > > > > dd2-dir2x44 2x44 2x43 > > > > dd3 3x9.83x8.7 3x30 > > > > dd3-dir 3x29.5 3x29.5 3x28.5 > > > > net1 30-3350-55 37-52 > > > > mix3 17/3225/50 > 96/35 > (disk/combined-network) > > > > > > wow, really nice results! > > > > Those changes seem suspiciously large to me. I wonder if > there's > less > > physical IO happening during the timed run, and correspondingly more > > afterwards. > > so a final 'sync' should be added to the test too, and the time > it > takes > factored into the bandwidth numbers? > One of the reasons I do 15 GB transfers is to make sure that I am well above the possible page cache size. And of course I am doing a final sync to finish the runs :-) The sync is also running faster in 2.6.24-rc1. If I factor it in the results for dd1/dd3 are: test2.6.19.22.6.22.62.6.24-rc1 sync time 18sec19sec 6sec dd1 27.5 47.592 dd3 3x9.1 3x8.5 3x29 So basically including the sync time make 2.6.24-rc1 even more promosing. Now, I know that my benchmarks numbers are crude and show only a very small aspect of system performance. But - it is an aspect I care about a lot. And those benchmarks match my use-case pretty good. Cheers Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour - next try
Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some "misbehaviour" related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly "use once" or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in "D" state. The data flows in basically three modes. All of them are affected: local-disk -> NFS NFS -> local-disk NFS -> NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing "perfect yet. Use once does seem to be a problem. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: > [...] > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > [...] > > Just by chance I found out that doing all I/O inc sync-mode does > > prevent the load from going up. Of course, I/O throughput is not > > stellar (but not much worse than the non-O_DIRECT case). But the > > responsiveness seem OK. Maybe a solution, as this can be controlled > via > > mount (would be great for O_DIRECT :-). > > > > In general 2.6.22 seems to bee better that 2.6.19, but this is > highly > > subjective :-( I am using the following setting in /proc. They seem > to > > provide the smoothest responsiveness: > > > > vm.dirty_background_ratio = 1 > > vm.dirty_ratio = 1 > > vm.swappiness = 1 > > vm.vfs_cache_pressure = 1 > > You are apparently running into the sluggish kupdate-style writeback > problem with large files: huge amount of dirty pages are getting > accumulated and flushed to the disk all at once when dirty background > ratio is reached. The current -mm tree has some fixes for it, and > there are some more in my tree. Martin, I'll send you the patch if > you'd like to try it out. > Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although "sluggish" is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. > > Another thing I saw during my tests is that when writing to NFS, > the > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual > thing, > > or a bug? > > What are the nr_unstable numbers? > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Cheers Martin > Fengguang > > -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: > > > > --- Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > > You are apparently running into the sluggish kupdate-style > writeback > > > problem with large files: huge amount of dirty pages are getting > > > accumulated and flushed to the disk all at once when dirty > background > > > ratio is reached. The current -mm tree has some fixes for it, and > > > there are some more in my tree. Martin, I'll send you the patch > if > > > you'd like to try it out. > > > > > Hi Fengguang, > > > > Yeah, that pretty much describes the situation we end up. Although > > "sluggish" is much to friendly if we hit the situation :-) > > > > Yes, I am very interested to check out your patch. I saw your > > postings on LKML already and was already curious. Any chance you > have > > something agains 2.6.22-stable? I have reasons not to move to -23 > or > > -mm. > > Well, they are a dozen patches from various sources. I managed to > back-port them. It compiles and runs, however I cannot guarantee > more... > Thanks. I understand the limited scope of the warranty :-) I will give it a spin today. > > > > Another thing I saw during my tests is that when writing to > NFS, > > > the > > > > "dirty" or "nr_dirty" numbers are always 0. Is this a > conceptual > > > thing, > > > > or a bug? > > > > > > What are the nr_unstable numbers? > > > > > > > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty > > numbers for the disk case. Good to know. > > > > For NFS, the nr_writeback numbers seem surprisingly high. They > also go > > to 80-90k (pages ?). In the disk case they rarely go over 12k. > > Maybe the difference of throttling one single 'cp' and a dozen > 'nfsd'? > No "nfsd" running on that box. It is just a client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28 2007, Martin Knoblauch wrote: > > Keywords: I/O, bdi-v9, cfs > > > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > > Hi Jens, thanks for the suggestion. Unfortunatelly the non-direct [parallel] writes to the device got considreably slower. I guess the "6i" controller copes better with higher values. Can nr_cmds be changed at runtime? Maybe there is a optimal setting. [ 69.438851] SCSI subsystem initialized [ 69.442712] HP CISS Driver (v 3.6.14) [ 69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level, low) -> IRQ 51 [ 69.442899] cciss: limited max commands to 2 (Smart Array 6i) [ 69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC [ 69.494352] blocks= 426759840 block_size= 512 [ 69.498350] heads=255, sectors=32, cylinders=52299 [ 69.498352] [ 69.498509] blocks= 426759840 block_size= 512 [ 69.498602] heads=255, sectors=32, cylinders=52299 [ 69.498604] [ 69.498608] cciss/c0d0: p1 p2 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Chuck Ebbert <[EMAIL PROTECTED]> wrote: > On 08/28/2007 11:53 AM, Martin Knoblauch wrote: > > > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > > Try booting with "mem=4096M", "mem=2048M", ... > > hmm. I tried 1024M a while ago and IIRC did not see a lot [any] difference. But as it is no big deal, I will repeat it tomorrow. Just curious - what are you expecting? Why should it help? Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: regression of autofs for current git?
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote: > >http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22 > >This (and it's related patches) may be the problem. >I can probably tell if you post your map or if you strace the automount >process managing the a problem mount point and look for mount returning >EBUSY when it should succeed. Likely. That is the one that will break the user-space automounter as well (and keeps me from .23). I don't care very much about what the default is, but it would be great if the new behaviour could be globally changed at run- (or boot-) time. It will be some time until the new mount option makes it into the distros. Cheers Martin PS: Sorry, but I likely killed the CC list ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Robert Hancock <[EMAIL PROTECTED]> wrote: > > I saw a bulletin from HP recently that sugggested disabling the > write-back cache on some Smart Array controllers as a workaround > because > it reduced performance in applications that did large bulk writes. > Presumably they are planning on releasing some updated firmware that > fixes this eventually.. > > -- > Robert Hancock Saskatoon, SK, Canada > To email, remove "nospam" from [EMAIL PROTECTED] > Home Page: http://www.roberthancock.com/ > Robert, just checked it out. At least with the "6i", you do not want to disable the WBC :-) Performance really goes down the toilet for all cases. Do you still have a pointer to that bulletin? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > Hi Jens, how exactely is the queue depth related to the max # of commands? I ask, because with the 2.6.22 kernel the "maximum queue depth since init" seems to be never higher than 16, even with much higher outstanding commands. On a 2.6.19 kernel, maximum queue depth is much higher, just a bit below "max # of commands since init". [2.6.22]# cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Max sectors: 2048 Current Q depth: 0 Current # commands on controller: 145 Max Q depth since init: 16 Max # commands on controller since init: 204 Max SG entries since init: 31 Sequential access devices: 0 [2.6.19] cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 197 Max # commands on controller since init: 198 Max SG entries since init: 31 Sequential access devices: 0 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Ian Kent <[EMAIL PROTECTED]> wrote: > On Thu, 30 Aug 2007, Linus Torvalds wrote: > > > > > > On Fri, 31 Aug 2007, Trond Myklebust wrote: > > > > > > It did not. The previous behaviour was to always silently > override the > > > user mount options. > > > > ..so it still worked for any sane setup, at least. > > > > You broke that. Hua gave good reasons for why he cannot use the > current > > kernel. It's a regression. > > > > In other words, the new behaviour is *worse* than the behaviour you > > > consider to be the incorrect one. > > > > This all came about due to complains about not being able to mount > the > same server file system with different options, most commonly ro vs. > rw > which I think was due to the shared super block changes some time > ago. > And, to some extent, I have to plead guilty for not complaining > enough > about this default in the beginning, which is basically unacceptable > for > sure. > > We have seen breakage in Fedora with the introduction of the patches > and > this is typical of it. It also breaks amd and admins have no way of > altering this that I'm aware of (help us here Ion). > > I understand Tronds concerns but the fact remains that other Unixs > allow > this behaviour but don't assert cache coherancy and many sysadmin > don't > realize this. So the broken behavior is expected to work and we can't > > simply stop allowing it unless we want to attend a public hanging > with us > as the paticipants. > > There is no question that the new behavior is worse and this change > is > unacceptable as a solution to the original problem. > > I really think that reversing the default, as has been suggested, > documenting the risk in the mount.nfs man page and perhaps issuing a > warning from the kernel is a better way to handle this. At least we > will > be doing more to raise public awareness of the issue than others. > I can only second that. Changing the default behavior in this way is really bad. Not that I am disagreeing with the technical reasons, but the change breaks working setups. And -EBUSY is not very helpful as a message here. It does not matter that the user tools may handle the breakage incorrect. The users (admins) had workings setups for years. And they were obviously working "good enough". And one should not forget that there will be a considerable time until "nosharecache" will trickle down into distributions. If the situation stays this way, quite a few people will not be able to move beyond 2.6.22 for some time. E.g. for I am working for a company that operates some linux "clusters" at a few german automotive cdompanies. For certain reasons everything there is based on automounter maps (both autofs and amd style). We have almost zero influence on that setup. The maps are a mess - we will run into the sharecache problem. At the same time I am trying to fight the notorious "system turns into frozen molassis on moderate I/O load". There maybe some interesting developements coming forth after 2.6.22. Not good :-( What I would like to see done for the at hand situation is: - make "nosharecache" the default for the forseeable future - log any attempt to mount option-inconsistent NFS filesystems to dmesh and syslog (apparently the NFS client is able to detect them :-). Do this regardless of the "nosharecache" option. This way admins will at least be made aware of the situation. - In a year or so we can talk about making the default safe. With proper advertising. Just my 0.02. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Jakob Oestergaard <[EMAIL PROTECTED]> wrote: > On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote: > ... > > This is *not* a security hole. In order to make it a security hole, > you > > need to be root in the first place. > > Non-root users can write to places where root might believe they > cannot write > because he might be under the mistaken assumption that ro means ro. > > I am under the impression that that could have implications in some > setups. > That was never in question. > ... > > > > - it's a misfeature that people are used to, and has been around > forever. > > Sure, they're used it it, but I doubt they are aware of it. > So, the right thing to do (tm) is to make them aware without breaking their setup. Log any detected inconsistencies in the dmesg buffer and to syslog. If the sysadmin is not competent enough to notice, to bad. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: [PATCH] Small patch on top of per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: > > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > > > > > Peter, > > > > > > > > any chance to get a rollup against 2.6.22-stable? > > > > > > > > The 2.6.23 series may not be usable for me due to the > > > > nosharedcache changes for NFS (the new default will massively > > > > disturb the user-space automounter). > > > > > > I'll see what I can do, bit busy with other stuff atm, hopefully > > > after > > > the weekend. > > > > > Hi Peter, > > > > any progress on a version against 2.6.22.5? I have seen the very > > positive report from Jeffrey W. Baker and would really love to test > > your patch. But as I said, anything newer than 2.6.22.x might not > be an > > option due to the NFS changes. > > mindless port, seems to compile and boot on my test box ymmv. > Hi Peter, while doing my tests I observed that setting dirty_ratio below 5% did not make a difference at all. Just by chance I found that this apparently is an enforced limit in mm/page-writeback.c. With below patch I have lowered the limit to 2%. With that, things look a lot better on my systems. Load during write stays below 1.5 for one writer. Responsiveness is good. This may even help without the throttling patch. Not sure that this is the right thing to do, but it helps :-) Cheers Martin --- linux-2.6.22.5-bdi-v9/mm/page-writeback.c +++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c @@ -311,8 +311,11 @@ if (dirty_ratio > unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; - if (dirty_ratio < 5) - dirty_ratio = 5; +/* +** MKN: Lower enforced limit from 5% to 2% +*/ + if (dirty_ratio < 2) + dirty_ratio = 2; background_ratio = dirty_background_ratio; if (background_ratio >= dirty_ratio) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Leroy van Logchem <[EMAIL PROTECTED]> wrote: > Andrea Arcangeli wrote: > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > >> Ok perhaps the new adaptive dirty limits helps your single disk > >> a lot too. But your improvements seem to be more "collateral > damage" @) > >> > >> But if that was true it might be enough to just change the dirty > limits > >> to get the same effect on your system. You might want to play with > >> /proc/sys/vm/dirty_* > > > > The adaptive dirty limit is per task so it can't be reproduced with > > global sysctl. It made quite some difference when I researched into > it > > in function of time. This isn't in function of time but it > certainly > > makes a lot of difference too, actually it's the most important > part > > of the patchset for most people, the rest is for the corner cases > that > > aren't handled right currently (writing to a slow device with > > writeback cache has always been hanging the whole thing). > > > Self-tuning > static sysctl's. The last years we needed to use very > small values for dirty_ratio and dirty_background_ratio to soften the > > latency problems we have during sustained writes. Imo these patches > really help in many cases, please commit to mainline. > > -- > Leroy > while it helps in some situations, I did some tests today with 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it hurts NFS writes. Anyone seen similar effects? Otherwise I would just second your request. It definitely helps the problematic performance of my CCISS based RAID5 volume. Martin Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > > Ok perhaps the new adaptive dirty limits helps your single disk > > a lot too. But your improvements seem to be more "collateral > damage" @) > > > > But if that was true it might be enough to just change the dirty > limits > > to get the same effect on your system. You might want to play with > > /proc/sys/vm/dirty_* > > The adaptive dirty limit is per task so it can't be reproduced with > global sysctl. It made quite some difference when I researched into > it > in function of time. This isn't in function of time but it certainly > makes a lot of difference too, actually it's the most important part > of the patchset for most people, the rest is for the corner cases > that > aren't handled right currently (writing to a slow device with > writeback cache has always been hanging the whole thing). didn't see that remark before. I just realized that "slow device with writeback cache" pretty well describes the CCISS controller in the DL380g4. Could you elaborate why that is a problematic case? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > --- Leroy van Logchem <[EMAIL PROTECTED]> wrote: > > > Andrea Arcangeli wrote: > > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > > >> Ok perhaps the new adaptive dirty limits helps your single disk > > >> a lot too. But your improvements seem to be more "collateral > > damage" @) > > >> > > >> But if that was true it might be enough to just change the dirty > > limits > > >> to get the same effect on your system. You might want to play > with > > >> /proc/sys/vm/dirty_* > > > > > > The adaptive dirty limit is per task so it can't be reproduced > with > > > global sysctl. It made quite some difference when I researched > into > > it > > > in function of time. This isn't in function of time but it > > certainly > > > makes a lot of difference too, actually it's the most important > > part > > > of the patchset for most people, the rest is for the corner cases > > that > > > aren't handled right currently (writing to a slow device with > > > writeback cache has always been hanging the whole thing). > > > > > > Self-tuning > static sysctl's. The last years we needed to use very > > > small values for dirty_ratio and dirty_background_ratio to soften > the > > > > latency problems we have during sustained writes. Imo these patches > > > really help in many cases, please commit to mainline. > > > > -- > > Leroy > > > > while it helps in some situations, I did some tests today with > 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it > hurts NFS writes. Anyone seen similar effects? > > Otherwise I would just second your request. It definitely helps the > problematic performance of my CCISS based RAID5 volume. > please disregard my comment about NFS write performance. What I have seen is caused by some other stuff I am toying with. So, I second your request to push this forward. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
>Per device dirty throttling patches > >These patches aim to improve balance_dirty_pages() and directly >address three issues: >1) inter device starvation >2) stacked device deadlocks >3) inter process starvation > >1 and 2 are a direct result from removing the global dirty >limit and using per device dirty limits. By giving each device >its own dirty limit is will no longer starve another device, >and the cyclic dependancy on the dirty limit is broken. > >In order to efficiently distribute the dirty limit across >the independant devices a floating proportion is used, this >will allocate a share of the total limit proportional to the >device's recent activity. > >3 is done by also scaling the dirty limit proportional to the >current task's recent dirty rate. > >Changes since -v8: >- cleanup of the proportion code >- fix percpu_counter_add(&counter, -(unsigned long)) >- fix per task dirty rate code >- fwd port to .23-rc2-mm2 Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > Peter, > > > > any chance to get a rollup against 2.6.22-stable? > > > > The 2.6.23 series may not be usable for me due to the > > nosharedcache changes for NFS (the new default will massively > > disturb the user-space automounter). > > I'll see what I can do, bit busy with other stuff atm, hopefully > after the weekend. > Hi Peter, that would be highly appreciated. Thanks a lot in advance. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message > From: "Zhang, Yanmin" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: LKML > Sent: Friday, November 9, 2007 10:47:52 AM > Subject: iozone write 50% regression in kernel 2.6.24-rc1 > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has > 50% > regression > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. > > My machine has 8 processor cores and 8GB memory. > > By bisect, I located patch > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h= > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. > > > Another behavior: with kernel 2.6.23, if I run iozone for many > times > after rebooting machine, > the result looks stable. But with 2.6.24-rc1, the first run of > iozone > got a very small result and > following run has 4Xorig_result. > > What I reported is the regression of 2nd/3rd run, because first run > has > bigger regression. > > I also tried to change > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio > and didn't get improvement. > > -yanmin > - Hi Yanmin, could you tell us the exact iozone command you are using? I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like: iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 Kind regards Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Binary Drivers
ntract. If it says "works with XXX" (and it does), you have no right to demand that it works with YYY, or that the manufacturer has to help you make it work with YYY. The manufacturer may not be allowed to *actively* prevent you from making it work with YYY, but I see no legal problem (IANAL, in any jurisdiction of the wolrd) if they make it hard for you by being *passive*. If they promised that it works with YYY, it is another story. They are obliged to make it work or compensate you. How they make it work is up to them, as long as they keep the promise. Whether you are satisfied is up to you. >If you retain some rights over something, then you are not selling it >in the normal sense. You are selling a subset of the rights to it, >and the buy must be told what rights he is getting and what rights >he is not getting. They are not keeping any right from you. They are just not being helpful. And now lets stop the car nonsense :-) Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Binary Drivers
On 12/25/06, David Schwartz <[EMAIL PROTECTED]> wrote: > If I bought the car from the manufacturer, it also must > include any rights the manufacturer might have to the car's use. > That includes using the car to violate emission control measures. > If I didn't buy the right to use the car that way (insofar as > that right was owned by the car manufacturer), I didn't > buy the whole care -- just *some* of the rights to use it. just to be dense - what makes you think that the car manufacturer has any legal right to violate emission control measures? What an utter nonsense (sorry). So, lets stop the stupid car comparisons. They are no being funny any more. Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Binary Drivers
--- James C Georgas <[EMAIL PROTECTED]> wrote: > On Tue, 2006-26-12 at 03:20 -0800, Martin Knoblauch wrote: > > On 12/25/06, David Schwartz <[EMAIL PROTECTED]> wrote: > > > > > If I bought the car from the manufacturer, it also must > > > include any rights the manufacturer might have to the car's use. > > > That includes using the car to violate emission control measures. > > > If I didn't buy the right to use the car that way (insofar as > > > that right was owned by the car manufacturer), I didn't > > > buy the whole care -- just *some* of the rights to use it. > > > > just to be dense - what makes you think that the car manufacturer > has > > any legal right to violate emission control measures? What an utter > > nonsense (sorry). > > > > So, lets stop the stupid car comparisons. They are no being funny > any > > more. > > Let's summarize the current situation: > > 1) Hardware vendors don't have to tell us how to program their > products, as long as they provide some way to use it > (i.e. binary blob driver). > Correct, as far as I can tell. > 2) Hardware vendors don't want to tell us how to program their > products, because they think this information is their secret > sauce (or maybe their competitor's secret sauce). > - or they are ashamed to show the world what kind of crap they sell - or they have lost (never had) the documentation themselves. I tend to no believe this > 3) Hardware vendors don't tell us how to program their products, > because they know about (1) and they believe (2). > - or they are just ignorant > 4) We need products with datasheets because of our development model. > - correct > 5) We want products with capabilities that these vendors advertise. > we want open-spec products that meet the performance of the high-end closed-spec products > 6) Products that satisfy both (4) and (5) are often scarce or > non-existent. > unfortunatelly > > So far, the suggestions I've seen to resolve the above conflict fall > into three categories: > > a) Force vendors to provide datasheets. > > b) Entice vendors to provide datasheets. > > c) Reverse engineer the hardware and write our own datasheets. > > Solution (a) involves denial of point (1), mostly through the use of > analogy and allegory. Alternatively, one can try to change the law > through government channels. > good luck > Solution (b) requires market pressure, charity, or visionary > management. > We can't exert enough market pressure currently to make much > difference. > Charity sometimes gives us datasheets for old hardware. Visionary > management is the future. > - Old hardware is not interesting in most markets - Visionary mamangement is rare > Solution (c) is what we do now, with varying degrees of success. A > good example is the R300 support in the radeon DRM module. > But the R300 does not meet 5) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Binary Drivers
--- Trent Waddington <[EMAIL PROTECTED]> wrote: > On 12/26/06, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > > > Oh, if only for Christmas - stop this stupid car comparisons. They > are > > just that - utter nonsense. > > > > And now lets stop the car nonsense :-) > > I agree, if you really want to talk about cars, I can relate the woes > I've heard from mechanics about how impossible it is to service new > model Fords these days. A behaviour that is not very different from gthe GMs, BMWs, daimler-Chryslers, Toyota, "you name them" of this world. I never said I liked the attitude. > Without the engine management systems > diagnostics devices they can't do anything. Ford controls who gets > these devices and demands a cut of every service, essentially setting > the price. Service centers that don't play ball don't get the > devices or get the devices taken away from them if they question > Ford's pricing policies. Of course, this should be illegal, and our > governments should be enforcing antitrust laws, but Ford is a big > company and has lots of lawyers.. > Actually we have/had a similar situation here in Germany. We are used to having "licensed dealerships" which are only allowed to sell one car brand. This might be illegal by EU laws now. > Repco and other after market manufacturers can't easily make a clone > of these devices like they do every other part, because reverse > engineering software is not really as advanced as reverse engineering > spare parts.. or maybe software reverse engineering is just so much > more expensive than automotive reverse engineering that it is not > cost effective to clone these devices.. or maybe they're just afraid > of the lawyers too. > Understanding software is more difficult, because you also have to understand the working prinziple of the underlying hardware, which you often have no specs for either. So you have to reverse engineer both layers. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels
Hi, (please CC on replies, thanks) for the ganglia project (http://ganglia.sourceforge.net/) we are trying to find a heuristics to determine the number of physical CPU "cores" as opposed to virtual processors added by enabling HT. The method should work on 2.4 and 2.6 kernels. So far it seems that looking at the "physical id", "core id" and "cpu cores" of /proc/cpuinfo is the way to go. In 2.6 I would try to find the distinct "physical id"s and and sum up the corresponding "cpu cores". The question is whether this would work for 2.4 based systems. Does anybody recall when the "physical id", "core id" and "cpu cores" were added to /proc/cpuinfo ? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels
--- Arjan van de Ven <[EMAIL PROTECTED]> wrote: > On Wed, 2006-12-27 at 06:16 -0800, Martin Knoblauch wrote: > > Hi, (please CC on replies, thanks) > > > > for the ganglia project (http://ganglia.sourceforge.net/) we are > > trying to find a heuristics to determine the number of physical CPU > > "cores" as opposed to virtual processors added by enabling HT. The > > method should work on 2.4 and 2.6 kernels. > > I have a counter question for you.. what are you trying to do with > the > "these two are SMT sibblings" information ? > > Because I suspect "HT" is the wrong level of detection for what you > really want to achieve > > If you want to decide "shares caches" then at least 2.6 kernels > directly > export that (and HT is just the wrong way to go about this). > -- Hi Arjan, one piece of information that Ganglia collects for a node is the "number of CPUs", originally meaning "physical CPUs". With the introduction of HT and multi-core things are a bit more complex now. We have decided that HT sibblings do not qualify as "real" CPUs, while multi-cores do. Currently we are doing "sysconf(_SC_NPROCESSORS_ONLN)". But this includes both physical and virtual (HT) cores. We are looking for a method that only shows "real iron" and works on 2.6 and 2.4 kernels. Whether this has any practial valus is a completely different question. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels
--- Gleb Natapov <[EMAIL PROTECTED]> wrote: > On Wed, Dec 27, 2006 at 04:13:00PM +0100, Arjan van de Ven wrote: > > The original p4 HT to a large degree suffered from a too small > cache > > that now was shared. SMT in general isn't per se all that different > in > > performance than dual core, at least not on a fundamental level, > it's > > all a matter of how many resources each thread has on average. With > dual > > core sharing the cache for example, that already is part HT. > Putting the > > "boundary" at HT-but-not-dual-core is going to be highly artificial > and > > while it may work for the current hardware, in general it's not a > good > > way of separating things (just look at the PowerPC processors, > those are > > highly SMT as well), and I suspect that your distinction is just > going > > to break all the time over the next 10 years ;) Or even today on > the > > current "large cache" P4 processors with HT it already breaks. > (just > > those tend to be the expensive models so more rare) > > > If I run two threads that are doing only calculations and very little > or no > IO at all on the same socket will modern HT and dual core be the same > (or close) performance wise? > Hi Gleb, this is a real interesting question. Ganglia is coming [originally] from the HPC side of computing. At least in the past HT as implemented on XEONs did help a lot. Running two CPU+memory-bandwith intensive processes on the same physical CPU would at best result in a 50/50 performance split. So, knowing how many "real" CPUs are in a system is interesting to us. Other workloads (like lots of java threads doing mixed IO and CPU stuff) of course can benefit from HT. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels
--- Gleb Natapov <[EMAIL PROTECTED]> wrote: > > > If I run two threads that are doing only calculations and very little > or no > IO at all on the same socket will modern HT and dual core be the same > (or close) performance wise? > actually I wanted to write that "HT as implemented on XEONs did not help a lot for HPC workloads in the past" Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to detect multi-core and/or HT-enabled CPUs in 2.4.x and 2.6.x kernels
>In article <[EMAIL PROTECTED]> you wrote: >> once your program (and many others) have such a check, then the next >> step will be pressure on the kernel code to "fake" the old situation >> when there is a processor where no longer >> holds. It's basically a road to madness :-( > > I agree that for HPC sizing a benchmark with various levels of > parallelity are better. The question is, if the code in question > only is for inventory reasons. In that case I would do something > like x sockets, y cores and z cm threads. > > Bernd For sizing purposes, doing benchmarks is the only way. For the purpose of Ganglia the sockets/cores/threads info is purely for inventory. And we are likely going to add the new information to our metrics. But - we still need to find a way to extract the infor :-) Cheers Martin PS: I have likely killed the CC this time. Sorry. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[2.6.19] NFS: server error: fileid changed
Hi, [please CC me, as I am not subscribed] after updating a RHEL4 box (EM64T based) to a plain 2.6.19 kernel, we are seeing repeated occurences of the following messages (about every 45-50 minutes). It is always the same server (a NetApp filer, mounted via the user-space automounter "amd") and the expected/got numbers seem to repeat. Is there a way to find out which files are involved? Nothing seems to be obviously breaking, but I do not like to get my logfiles filled up. [ 9337.747546] NFS: server nvgm022 error: fileid changed [ 9337.747549] fsid 0:25: expected fileid 0x7a6f3d, got 0x65be80 [ 9338.020427] NFS: server nvgm022 error: fileid changed [ 9338.020430] fsid 0:25: expected fileid 0x15f5d7c, got 0x9f9900 [ 9338.070147] NFS: server nvgm022 error: fileid changed [ 9338.070150] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9338.338896] NFS: server nvgm022 error: fileid changed [ 9338.338899] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9338.370207] NFS: server nvgm022 error: fileid changed [ 9338.370210] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9338.634437] NFS: server nvgm022 error: fileid changed [ 9338.634439] fsid 0:25: expected fileid 0x7a6f3d, got 0x22070e [ 9338.698383] NFS: server nvgm022 error: fileid changed [ 9338.698385] fsid 0:25: expected fileid 0x7a6f3d, got 0x352777 [ 9338.949952] NFS: server nvgm022 error: fileid changed [ 9338.949954] fsid 0:25: expected fileid 0x15f5d7c, got 0x5988c4 [ 9339.042473] NFS: server nvgm022 error: fileid changed [ 9339.042476] fsid 0:25: expected fileid 0x7a6f3d, got 0x9f9900 [ 9339.267338] NFS: server nvgm022 error: fileid changed [ 9339.267341] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9339.309921] NFS: server nvgm022 error: fileid changed [ 9339.309923] fsid 0:25: expected fileid 0x15f5d7c, got 0x65be80 [ 9339.405146] NFS: server nvgm022 error: fileid changed [ 9339.405149] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9339.433816] NFS: server nvgm022 error: fileid changed [ 9339.433819] fsid 0:25: expected fileid 0x15f5d7c, got 0x65be80 [ 9340.149325] NFS: server nvgm022 error: fileid changed [ 9340.149328] fsid 0:25: expected fileid 0x7a6f3d, got 0x19bc55 [ 9340.173278] NFS: server nvgm022 error: fileid changed [ 9340.173281] fsid 0:25: expected fileid 0x15f5d7c, got 0x22070e [ 9340.324517] NFS: server nvgm022 error: fileid changed [ 9340.324520] fsid 0:25: expected fileid 0x15f5d7c, got 0x11c9001 Thanks Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.19] NFS: server error: fileid changed
--- Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Mon, 2006-12-11 at 08:09 -0800, Martin Knoblauch wrote: > > Hi, [please CC me, as I am not subscribed] > > > > after updating a RHEL4 box (EM64T based) to a plain 2.6.19 kernel, > we > > are seeing repeated occurences of the following messages (about > every > > 45-50 minutes). > > > > It is always the same server (a NetApp filer, mounted via the > > user-space automounter "amd") and the expected/got numbers seem to > > repeat. > > Are you seeing it _without_ amd? The usual reason for the errors you > see are bogus replay cache replies. For that reason, the kernel is > usually very careful when initialising its value for the > XID: we set part of it using the clock value, and part of it > using a random number generator. > I'm not so sure that other services are as careful. > So far, we are only seeing it on amd-mounted filesystems, not on static NFS mounts. Unfortunatelly, it is difficult to avoid "amd" in our environment. > > Is there a way to find out which files are involved? Nothing > seems to > > be obviously breaking, but I do not like to get my logfiles filled > up. > > The fileid is the same as the inode number. Just convert those > hexadecimal values into ordinary numbers, then search for them using > 'ls > -i'. > thanks. will check that out. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.19] NFS: server error: fileid changed
--- Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Mon, 2006-12-11 at 15:44 -0800, Martin Knoblauch wrote: > > So far, we are only seeing it on amd-mounted filesystems, not on > > static NFS mounts. Unfortunatelly, it is difficult to avoid "amd" > in > > our environment. > > Any chance you could try substituting a recent version of autofs? > This > sort of problem is more likely to happen on partitions that are > unmounted and then remounted often. I'd just like to figure out if > this > is something that we need to fix in the kernel, or if it is purely an > amd problem. > > Cheers > Trond > Hi Trond, unfortunatelly I have no controll over the mounting maps, as they are maintained from different people. So the answer is no. Unfortunatelly the customer has decided on using am-utils. This has been hurting us (and them) for years ... Your are likely correct when you hint towards partitions which are frequently remounted. In any case, your help is appreciated. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.19] NFS: server error: fileid changed
--- Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > Is there a way to find out which files are involved? Nothing > seems to > > be obviously breaking, but I do not like to get my logfiles filled > up. > > The fileid is the same as the inode number. Just convert those > hexadecimal values into ordinary numbers, then search for them using > 'ls > -i'. > > Trond > > > [ 9337.747546] NFS: server nvgm022 error: fileid changed > > [ 9337.747549] fsid 0:25: expected fileid 0x7a6f3d, got 0x65be80 Hi Trond, just curious: how is the fsid related to mounted filesystems? What does "0:25" stand for? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
File-locking problems with RHEL4 kernel (2.6.9-42.0.3.ELsmp) under high load
Hi, first of all, yes - I know that this kernel is very old and it is not an official LKML kernel. No need to tell me, no need to waste bandwidth by telling me :-) I just post here, because I got no response "elsewhere". Second - please CC me on any reply, as I am not subscribed. OK. Here is the problem. Said RHEL4 kernel seems to have problems with file-locking when the system is under high, likely network related, load. The symptoms are things using file locking (rpm, the user-space automounter amd) fail to obtain locks, usually stating timeout problems. The sytem in question is a HP/DL380G4 with dual-single-core EM64T CPUs and 8GB of Memory. The network interfaces are "tg3". The high load can be triggered by copying three 3 GB files in parallel from an NFS server (Solaris10, NFS, TCP, 1GBit) to another NFS server (RHEL4, NFS, TCP, 100 MBit). The measured network performance is OK. During this operation the systems goes to Loads around/above 10. Overall responsiveness feels good, but software doing file-locking or opening a new ssh connection take extremely long. So, if anyone has an idea or hint, it will be highly appreciated. Cheers Martin ---------- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Daniel J Blueman <[EMAIL PROTECTED]> wrote: > On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > Hi, > > > > for a customer we are operating a rackful of HP/DL380/G4 boxes > that > > have given us some problems with system responsiveness under [I/O > > triggered] system load. > [snip] > > IIRC, the locking in the CCISS driver was pretty heavy until later in > the 2.6 series (2.6.16?) kernels; I don't think they were backported > to the 1000 or so patches that comprise RH EL 4 kernels. > > With write performance being really poor on the Smartarray > controllers > without the battery-backed write cache, and with less-good locking, > performance can really suck. > > On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB > L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller > with 6x36GB 10K RPM SCSI disks and all latest firmware: > > # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000 > 509+1 records in > 509+1 records out > 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s > > # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s > > Oh dear! There are internal performance problems with this > controller. > The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$) > is > perhaps twice the read performance (PCI-X helps some) but still > sucks. > > I'd get the BBWC in or install another controller. > Hi Daniel, thanks for the suggestion. The DL380g4 boxes have the "6i" and all systems are equipped with the BBWC (192 MB, split 50/50). The thing is not really a speed daemon, but sufficient for the task. The problem really seems to be related to the VM system not writing out dirty pages early enough and then getting into trouble when the pressure gets to high. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl <[EMAIL PROTECTED]> wrote: > On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote: > > On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > > I'd suspect you can't get both at 100%. > > > > I'd guess you are probably using a 100Hz no-preempt kernel. Have > you > > tried a 1000Hz + preempt kernel? Sure, you'll get a bit lower > > overall throughput, but interactive responsiveness should be better > - > > if it is, then you could experiment with various combinations of > > CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and > > CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see > > what gives you the best balance between throughput and interactive > > responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or > > CONFIG_NO_HZ, but I don't think the impact will be as significant > as > > with the other options, so to keep things simple I'd leave those > out > > at first) . > > > > I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + > CONFIG_HZ_300 > > would probably be a good compromise for you, but just to see if > > there's any effect at all, start out with CONFIG_PREEMPT + > > CONFIG_HZ_1000. > > > > I'm currious, did you ever try playing around with CONFIG_PREEMPT* > and > CONFIG_HZ* to see if that had any noticable impact on interactive > performance and stuff like logging into the box via ssh etc...? > > -- > Jesper Juhl <[EMAIL PROTECTED]> > Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html > Plain text mails only, please http://www.expita.com/nomime.html > > Hi Jesper, my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but have not observed much difference. The config is now: config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y config-2.6.22-rc7:# CONFIG_PREEMPT is not set config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y Cheers -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/17] per device dirty throttling -v7
Miklos Szeredi wrote: >> Latest version of the per bdi dirty throttling patches. >> >> Most of the changes since last time are little cleanups and more >> detail in the split out of the floating proportion into their >> own little lib. >> >> Patches are against 2.6.22-rc4-mm2 >> >> A rollup of all this against 2.6.21 is available here: >> http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch >> >> This patch-set passes the starve an USB stick test.. > >I've done some testing of several problem cases. just curious - what are the plans towards inclusion in mainline? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Linux 2.6.22-rc7
>Ok, Linux-2.6.22-rc7 is out there. > >It's hopefully (almost certainly) the last -rc before the final 2.6.22 >release, and we should be in pretty good shape. The flow of patches has >really slowed down and the regression list has shrunk a lot. > >The shortlog/diffstat reflects that, with the biggest part of the -rc7 >patch being literally just a power defconfig update. > >The patches are mostly trivial fixes, a few new device ID's, and the >appended shortlog really does pretty much explain it. > >Final testing always appreciated, of course, > >Linus For what it is worth - rc7 compiles and boots here (HP/DL380G4,2x86_64, 8GB, cciss, 2xtg3). The subjective feeling(*) is much better that the original RHEL4 kernel and better than 2.6.19 on the same box. (*) Our main problem with 2.6 kernels so far is a tendency to really bad responsiveness under I/O related load. Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour
Hi, for a customer we are operating a rackful of HP/DL380/G4 boxes that have given us some problems with system responsiveness under [I/O triggered] system load. The systems in question have the following HW: 2x Intel/EM64T CPUs 8GB memory CCISS Raid controller with 4x72GB SCSI disks as RAID5 2x BCM5704 NIC (using tg3) The distribution is RHEL4. We have tested several kernels including the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18. One part of the workload is when several processes try to write 5 GB each to the local filesystem (ext2->LVM->CCISS). When this happens, the load goes up to 12 and responsiveness goes down. This means from one moment to the next things like opening a ssh connection to the host in question, or doing "df" take forever (minutes). Especially bad with the vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7. The load basically comes from the writing processes and up to 12 "pdflush" threads all being in "D" state. So, what I would like to understand is how we can maximize the responsiveness of the system, while keeping disk throughput at maximum. During my investiogation I basically performed the following test, because it represents the kind of trouble situation: $ cat dd3.sh echo "Start 3 dd processes: "`date` dd if=/dev/zero of=/scratch/X1 bs=1M count=5000& dd if=/dev/zero of=/scratch/X2 bs=1M count=5000& dd if=/dev/zero of=/scratch/X3 bs=1M count=5000& wait echo "Finish 3 dd processes: "`date` sync echo "Finish sync: "`date` rm -f /scratch/X? echo "Files removed: "`date` This results in the following timings. All with the anticipatory scheduler, because it gives the best results: 2.6.19.2, HT: 10m 2.6.19.2, non-HT: 8m45s 2.6.22-rc7, HT: 10m 2.6.22-rc7, non-HT: 6m 2.6.22-rc7+cfs_v18, HT: 10m40s 2.6.22-rc7+cfs_v18, non-HT: 10m45s The "felt" responsiveness was best with the last two kernels, although the load profile over time looks identical in all cases. So, a few questions: a) any idea why disabling HT improves throughput, except for the cfs kernels? For plain 2.6.22 the difference is quite substantial b) any ideas how to optimize the settings of the /proc/sys/vm/ parameters? The documentation is a bit thin here. Thanks in advance Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl <[EMAIL PROTECTED]> wrote: > On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote: > [snip] > > > > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that > > helps. This workload will fill up memory with dirty data very > quickly, > > and it seems like system responsiveness often goes down the toilet > when > > this happens and the system is going crazy trying to write it all > out. > > > > Perhaps trying out a different elevator would also be worthwhile. > AS seems to be the best one (NOOP and DeadLine seem to be equally OK). CFQ gives less (about 10-15%) throughput except for the kernel with the cfs cpu scheduler, where CFQ is on par with the other IO schedulers. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Robert Hancock <[EMAIL PROTECTED]> wrote: > > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that > helps. This workload will fill up memory with dirty data very > quickly, > and it seems like system responsiveness often goes down the toilet > when > this happens and the system is going crazy trying to write it all > out. > Definitely the "going crazy" part is the worst problem I see with 2.6 based kernels (late 2.4 was really better in this corner case). I am just now playing with dirty_ratio. Anybody knows what the lower limit is? "0" seems acceptabel, but does it actually imply "write out immediatelly"? Another problem, the VM parameters are not really well dociúmented in their behaviour and interdependence. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
>>b) any ideas how to optimize the settings of the /proc/sys/vm/ >>parameters? The documentation is a bit thin here. >> >> >I cant offer any advice there, but is raid-5 really the best choice >for your needs? I would not choose raid-5 for a system that is >regularly performing lots of large writes at the same time, dont >forget that each write can require several reads to recalculate the >partity. > >Does the raid card have much cache ram? > 192 MB, split 50/50 to read write. >If you can afford to loose some space raid-10 would probably perform >better. RAID5 most likely is not the best solution and I would not use it if the described use-case was happening all the time. It happens a few times a day and then things go down when all memory is filled with page-cache. And the same also happens when copying large amountd of data from one NFS mounted FS to another NFS mounted FS. No disk involved there. Memory fills with page-cache until it reaches a ceeling and then for some time responsiveness is really really bad. I am just now playing with the dirty_* stuff. Maybe it helps. Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
Martin Knoblauch wrote: >--- Robert Hancock <[EMAIL PROTECTED]> wrote: > >> >> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that >> helps. This workload will fill up memory with dirty data very >> quickly, >> and it seems like system responsiveness often goes down the toilet >> when >> this happens and the system is going crazy trying to write it all >> out. >> > >Definitely the "going crazy" part is the worst problem I see with 2.6 >based kernels (late 2.4 was really better in this corner case). > >I am just now playing with dirty_ratio. Anybody knows what the lower >limit is? "0" seems acceptabel, but does it actually imply "write out >immediatelly"? > >Another problem, the VM parameters are not really well documented in >their behaviour and interdependence. Lowering dirty_ration just leads to more imbalanced write-speed for the three dd's. Even when lowering the number to 0, the hich load stays. Now, on another experiment I mounted the FS with "sync". And now the load stays below/around 3. No more "pdflush" daemons going wild. And the responsiveness is good, with no drops. My question is now: is there a parameter that one can use to force immediate writeout for every process. This may hurt overall performance of the system, but might really help my situation. Setting dirty_ratio to 0 does not seem to do it. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
Brice Figureau wrote: >> CFQ gives less (about 10-15%) throughput except for the kernel >> with the >> cfs cpu scheduler, where CFQ is on par with the other IO >> schedulers. >> > >Please have a look to kernel bug #7372: >http://bugzilla.kernel.org/show_bug.cgi?id=7372 > >It seems I encountered the almost same issue. > >The fix on my side, beside running 2.6.17 (which was working fine >for me) was to: >1) have /proc/sys/vm/vfs_cache_pressure=1 >2) have /proc/sys/vm/dirty_ratio=1 and > /proc/sys/vm/dirty_background_ratio=1 >3) have /proc/sys/vm/swappiness=2 >4) run Peter Zijlstra: per dirty device throttling patch on the > top of 2.6.21.5: >http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html Brice, any of them sufficient, or all together nedded? Just to avoid confusion. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > Peter, > > > > any chance to get a rollup against 2.6.22-stable? > > > > The 2.6.23 series may not be usable for me due to the > > nosharedcache changes for NFS (the new default will massively > > disturb the user-space automounter). > > I'll see what I can do, bit busy with other stuff atm, hopefully > after > the weekend. > Hi Peter, any progress on a version against 2.6.22.5? I have seen the very positive report from Jeffrey W. Baker and would really love to test your patch. But as I said, anything newer than 2.6.22.x might not be an option due to the NFS changes. Kind regards Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: > > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > > > > > Peter, > > > > > > > > any chance to get a rollup against 2.6.22-stable? > > > > > > > > The 2.6.23 series may not be usable for me due to the > > > > nosharedcache changes for NFS (the new default will massively > > > > disturb the user-space automounter). > > > > > > I'll see what I can do, bit busy with other stuff atm, hopefully > > > after > > > the weekend. > > > > > Hi Peter, > > > > any progress on a version against 2.6.22.5? I have seen the very > > positive report from Jeffrey W. Baker and would really love to test > > your patch. But as I said, anything newer than 2.6.22.x might not > be an > > option due to the NFS changes. > > mindless port, seems to compile and boot on my test box ymmv. > > I think .5 should not present anything other than trivial rejects if > anything. But I'm not keeping -stable in my git remotes so I can't > say > for sure. Hi Peter, thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one 8-line offset in readahead.c. I will report testing-results separately. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/