Re: Understanding I/O behaviour - next try

2007-09-13 Thread Peter Zijlstra
On Wed, 2007-08-29 at 01:15 -0700, Martin Knoblauch wrote:

> > >  Another thing I saw during my tests is that when writing to NFS, the
> > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
> > > or a bug?
> > 
> > What are the nr_unstable numbers?

NFS has the concept of unstable storage, that is a state where it is
agreed the page has been transferred to the remote server, but has not
yet been written to disk.

>  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
> numbers for the disk case. Good to know.
> 
>  For NFS, the nr_writeback numbers seem surprisingly high. They also go
> to 80-90k (pages ?). In the disk case they rarely go over 12k.

see: /proc/sys/fs/nfs/nfs_congestion_kb

That is the limit for when the nfs BDI is marked congested, so
nfs_writeout + nfs_unstable <= nfs_congestion_kb

The nfs_dirty always being 0 just means that pages very quickly start
their writeout cycle.


signature.asc
Description: This is a digitally signed message part


Re: Understanding I/O behaviour - next try

2007-09-13 Thread Peter Zijlstra
On Wed, 2007-08-29 at 01:15 -0700, Martin Knoblauch wrote:

Another thing I saw during my tests is that when writing to NFS, the
   dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
   or a bug?
  
  What are the nr_unstable numbers?

NFS has the concept of unstable storage, that is a state where it is
agreed the page has been transferred to the remote server, but has not
yet been written to disk.

  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
 numbers for the disk case. Good to know.
 
  For NFS, the nr_writeback numbers seem surprisingly high. They also go
 to 80-90k (pages ?). In the disk case they rarely go over 12k.

see: /proc/sys/fs/nfs/nfs_congestion_kb

That is the limit for when the nfs BDI is marked congested, so
nfs_writeout + nfs_unstable = nfs_congestion_kb

The nfs_dirty always being 0 just means that pages very quickly start
their writeout cycle.


signature.asc
Description: This is a digitally signed message part


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
Hi Jens,

 how exactely is the queue depth related to the max # of commands? I
ask, because with the 2.6.22 kernel the "maximum queue depth since
init" seems to be never higher than 16, even with much higher
outstanding commands. On a 2.6.19 kernel, maximum queue depth is much
higher, just a bit below "max # of commands since init".

[2.6.22]# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Max sectors: 2048
Current Q depth: 0
Current # commands on controller: 145
Max Q depth since init: 16
Max # commands on controller since init: 204
Max SG entries since init: 31
Sequential access devices: 0

[2.6.19] cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 197
Max # commands on controller since init: 198
Max SG entries since init: 31
Sequential access devices: 0

Cheers
Martin




--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> I saw a bulletin from HP recently that sugggested disabling the 
> write-back cache on some Smart Array controllers as a workaround
> because 
> it reduced performance in applications that did large bulk writes. 
> Presumably they are planning on releasing some updated firmware that 
> fixes this eventually..
> 
> -- 
> Robert Hancock  Saskatoon, SK, Canada
> To email, remove "nospam" from [EMAIL PROTECTED]
> Home Page: http://www.roberthancock.com/
> 
Robert,

 just checked it out. At least with the "6i", you do not want to
disable the WBC :-) Performance really goes down the toilet for all
cases.

 Do you still have a pointer to that bulletin?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Robert Hancock [EMAIL PROTECTED] wrote:

 
 I saw a bulletin from HP recently that sugggested disabling the 
 write-back cache on some Smart Array controllers as a workaround
 because 
 it reduced performance in applications that did large bulk writes. 
 Presumably they are planning on releasing some updated firmware that 
 fixes this eventually..
 
 -- 
 Robert Hancock  Saskatoon, SK, Canada
 To email, remove nospam from [EMAIL PROTECTED]
 Home Page: http://www.roberthancock.com/
 
Robert,

 just checked it out. At least with the 6i, you do not want to
disable the WBC :-) Performance really goes down the toilet for all
cases.

 Do you still have a pointer to that bulletin?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Jens Axboe [EMAIL PROTECTED] wrote:

 
 Try limiting the queue depth on the cciss device, some of those are
 notoriously bad at starving commands. Something like the below hack,
 see
 if it makes a difference (and please verify in dmesg that it prints
 the
 message about limiting depth!):
 
 diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
 index 084358a..257e1c3 100644
 --- a/drivers/block/cciss.c
 +++ b/drivers/block/cciss.c
 @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
 struct pci_dev *pdev)
   if (board_id == products[i].board_id) {
   c-product_name = products[i].product_name;
   c-access = *(products[i].access);
 +#if 0
   c-nr_cmds = products[i].nr_cmds;
 +#else
 + c-nr_cmds = 2;
 + printk(cciss: limited max commands to 2\n);
 +#endif
   break;
   }
   }
 
 -- 
 Jens Axboe
 
 
Hi Jens,

 how exactely is the queue depth related to the max # of commands? I
ask, because with the 2.6.22 kernel the maximum queue depth since
init seems to be never higher than 16, even with much higher
outstanding commands. On a 2.6.19 kernel, maximum queue depth is much
higher, just a bit below max # of commands since init.

[2.6.22]# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Max sectors: 2048
Current Q depth: 0
Current # commands on controller: 145
Max Q depth since init: 16
Max # commands on controller since init: 204
Max SG entries since init: 31
Sequential access devices: 0

[2.6.19] cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 197
Max # commands on controller since init: 198
Max SG entries since init: 31
Sequential access devices: 0

Cheers
Martin




--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Chuck Ebbert <[EMAIL PROTECTED]> wrote:

> On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
> > 
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> 
> Try booting with "mem=4096M", "mem=2048M", ...
> 
> 

 hmm. I tried 1024M a while ago and IIRC did not see a lot [any]
difference. But as it is no big deal, I will repeat it tomorrow.

 Just curious - what are you expecting? Why should it help?

Thanks
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Chuck Ebbert
On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
> 
>  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
> has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
> The performance of the block device with O_DIRECT is about 90 MB/sec.
> 
>  The problematic behaviour comes when we are moving large files through
> the system. The file usage in this case is mostly "use once" or
> streaming. As soon as the amount of file data is larger than 7.5 GB, we
> see occasional unresponsiveness of the system (e.g. no more ssh
> connections into the box) of more than 1 or 2 minutes (!) duration
> (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
> some other poor guys being in "D" state.

Try booting with "mem=4096M", "mem=2048M", ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Robert Hancock

Jens Axboe wrote:

On Tue, Aug 28 2007, Martin Knoblauch wrote:

Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some "misbehaviour" related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly "use once" or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in "D" state.

 The data flows in basically three modes. All of them are affected:

local-disk -> NFS
NFS -> local-disk
NFS -> NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
"dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing "perfect yet. Use once
does seem to be a problem.


Try limiting the queue depth on the cciss device, some of those are
notoriously bad at starving commands. Something like the below hack, see
if it makes a difference (and please verify in dmesg that it prints the
message about limiting depth!):


I saw a bulletin from HP recently that sugggested disabling the 
write-back cache on some Smart Array controllers as a workaround because 
it reduced performance in applications that did large bulk writes. 
Presumably they are planning on releasing some updated firmware that 
fixes this eventually..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28 2007, Martin Knoblauch wrote:
> > Keywords: I/O, bdi-v9, cfs
> > 
> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
>
Hi Jens,

 thanks for the suggestion. Unfortunatelly the non-direct [parallel]
writes to the device got considreably slower. I guess the "6i"
controller copes better with higher values.

 Can nr_cmds be changed at runtime? Maybe there is a optimal setting.

[   69.438851] SCSI subsystem initialized
[   69.442712] HP CISS Driver (v 3.6.14)
[   69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level,
low) -> IRQ 51
[   69.442899] cciss: limited max commands to 2 (Smart Array 6i)
[   69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC
[   69.494352]   blocks= 426759840 block_size= 512
[   69.498350]   heads=255, sectors=32, cylinders=52299
[   69.498352]
[   69.498509]   blocks= 426759840 block_size= 512
[   69.498602]   heads=255, sectors=32, cylinders=52299
[   69.498604]
[   69.498608]  cciss/c0d0: p1 p2

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Jens Axboe
On Tue, Aug 28 2007, Martin Knoblauch wrote:
> Keywords: I/O, bdi-v9, cfs
> 
> Hi,
> 
>  a while ago I asked a few questions on the Linux I/O behaviour,
> because I were (still am) fighting some "misbehaviour" related to heavy
> I/O.
> 
>  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
> has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
> The performance of the block device with O_DIRECT is about 90 MB/sec.
> 
>  The problematic behaviour comes when we are moving large files through
> the system. The file usage in this case is mostly "use once" or
> streaming. As soon as the amount of file data is larger than 7.5 GB, we
> see occasional unresponsiveness of the system (e.g. no more ssh
> connections into the box) of more than 1 or 2 minutes (!) duration
> (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
> some other poor guys being in "D" state.
> 
>  The data flows in basically three modes. All of them are affected:
> 
> local-disk -> NFS
> NFS -> local-disk
> NFS -> NFS
> 
>  NFS is V3/TCP.
> 
>  So, I made a few experiments in the last few days, using three
> different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.
> 
>  The first observation (independent of the kernel) is that we *should*
> use O_DIRECT, at least for output to the local disk. Here we see about
> 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel
> threads to the same block device (through a ext2 FS) gives:
> 
> O_Direct: 88 MB/s, 2x44, 3x29.5
> non-O_DIRECT: 51 MB/s, 2x19, 3x12.5
> 
> - Observation 1a: IO schedulers are mostly equivalent, with CFQ
> slightly worse than AS and DEADLINE
> - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
> performance goes [slightly] down. With three threads it is 3x10 MB/s.
> Ingo?
> - Observation 1c: bdi-v9 does not help in this case, which is not
> surprising.
> 
>  The real question here is why the non-O_DIRECT case is so slow. Is
> this a general thing? Is this related to the CCISS controller? Using
> O_DIRECT is unfortunatelly not an option for us.
> 
>  When using three different targets (local disk plus two different NFS
> Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
> to be] limited to the speed of the slowest FS. With bdi-v9 we see a
> considerable speedup.
> 
>  Just by chance I found out that doing all I/O inc sync-mode does
> prevent the load from going up. Of course, I/O throughput is not
> stellar (but not much worse than the non-O_DIRECT case). But the
> responsiveness seem OK. Maybe a solution, as this can be controlled via
> mount (would be great for O_DIRECT :-).
> 
>  In general 2.6.22 seems to bee better that 2.6.19, but this is highly
> subjective :-( I am using the following setting in /proc. They seem to
> provide the smoothest responsiveness:
> 
> vm.dirty_background_ratio = 1
> vm.dirty_ratio = 1
> vm.swappiness = 1
> vm.vfs_cache_pressure = 1
> 
>  Another thing I saw during my tests is that when writing to NFS, the
> "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
> or a bug?
> 
>  In any case, view this as a report for one specific loadcase that does
> not behave very well. It seems there are ways to make things better
> (sync, per device throttling, ...), but nothing "perfect yet. Use once
> does seem to be a problem.

Try limiting the queue depth on the cciss device, some of those are
notoriously bad at starving commands. Something like the below hack, see
if it makes a difference (and please verify in dmesg that it prints the
message about limiting depth!):

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 084358a..257e1c3 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev 
*pdev)
if (board_id == products[i].board_id) {
c->product_name = products[i].product_name;
c->access = *(products[i].access);
+#if 0
c->nr_cmds = products[i].nr_cmds;
+#else
+   c->nr_cmds = 2;
+   printk("cciss: limited max commands to 2\n");
+#endif
break;
}
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
> > 
> > --- Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > You are apparently running into the sluggish kupdate-style
> writeback
> > > problem with large files: huge amount of dirty pages are getting
> > > accumulated and flushed to the disk all at once when dirty
> background
> > > ratio is reached. The current -mm tree has some fixes for it, and
> > > there are some more in my tree. Martin, I'll send you the patch
> if
> > > you'd like to try it out.
> > >
> > Hi Fengguang,
> > 
> >  Yeah, that pretty much describes the situation we end up. Although
> > "sluggish" is much to friendly if we hit the situation :-)
> > 
> >  Yes, I am very interested  to check out your patch. I saw your
> > postings on LKML already and was already curious. Any chance you
> have
> > something agains 2.6.22-stable? I have reasons not to move to -23
> or
> > -mm.
> 
> Well, they are a dozen patches from various sources.  I managed to
> back-port them. It compiles and runs, however I cannot guarantee
> more...
>

 Thanks. I understand the limited scope of the warranty :-) I will give
it a spin today.
 
> > > >  Another thing I saw during my tests is that when writing to
> NFS,
> > > the
> > > > "dirty" or "nr_dirty" numbers are always 0. Is this a
> conceptual
> > > thing,
> > > > or a bug?
> > > 
> > > What are the nr_unstable numbers?
> > >
> > 
> >  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
> > numbers for the disk case. Good to know.
> > 
> >  For NFS, the nr_writeback numbers seem surprisingly high. They
> also go
> > to 80-90k (pages ?). In the disk case they rarely go over 12k.
> 
> Maybe the difference of throttling one single 'cp' and a dozen
> 'nfsd'?
>

 No "nfsd" running on that box. It is just a client.

Cheers
Martin
 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Fengguang Wu
On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
> 
> --- Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > You are apparently running into the sluggish kupdate-style writeback
> > problem with large files: huge amount of dirty pages are getting
> > accumulated and flushed to the disk all at once when dirty background
> > ratio is reached. The current -mm tree has some fixes for it, and
> > there are some more in my tree. Martin, I'll send you the patch if
> > you'd like to try it out.
> >
> Hi Fengguang,
> 
>  Yeah, that pretty much describes the situation we end up. Although
> "sluggish" is much to friendly if we hit the situation :-)
> 
>  Yes, I am very interested  to check out your patch. I saw your
> postings on LKML already and was already curious. Any chance you have
> something agains 2.6.22-stable? I have reasons not to move to -23 or
> -mm.

Well, they are a dozen patches from various sources.  I managed to
back-port them. It compiles and runs, however I cannot guarantee
more...

> > >  Another thing I saw during my tests is that when writing to NFS,
> > the
> > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual
> > thing,
> > > or a bug?
> > 
> > What are the nr_unstable numbers?
> >
> 
>  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
> numbers for the disk case. Good to know.
> 
>  For NFS, the nr_writeback numbers seem surprisingly high. They also go
> to 80-90k (pages ?). In the disk case they rarely go over 12k.

Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'?

Fengguang
--- linux-2.6.22.orig/fs/fs-writeback.c
+++ linux-2.6.22/fs/fs-writeback.c
@@ -24,6 +24,148 @@
 #include 
 #include "internal.h"
 
+/*
+ * Add @inode to its superblock's radix tree of dirty inodes.
+ *
+ * - the radix tree is indexed by inode number
+ * - inode_tree is not authoritative; inode_list is
+ * - inode_tree is a superset of inode_list: it is possible that an inode
+ *   get synced elsewhere and moved to other lists, while still remaining
+ *   in the radix tree.
+ */
+static void add_to_dirty_tree(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct dirty_inode_tree *dt = >s_dirty_tree;
+	int e;
+
+	e = radix_tree_preload(GFP_ATOMIC);
+	if (!e) {
+		e = radix_tree_insert(>inode_tree, inode->i_ino, inode);
+		/*
+		 * - inode numbers are not necessarily unique
+		 * - an inode might somehow be redirtied and resent to us
+		 */
+		if (!e) {
+			__iget(inode);
+			dt->nr_inodes++;
+			if (dt->max_index < inode->i_ino)
+			dt->max_index = inode->i_ino;
+			list_move(>i_list, >s_dirty_tree.inode_list);
+		}
+		radix_tree_preload_end();
+	}
+}
+
+#define DIRTY_SCAN_BATCH	16
+#define DIRTY_SCAN_ALL		LONG_MAX
+#define DIRTY_SCAN_REMAINING	(LONG_MAX-1)
+
+/*
+ * Scan the dirty inode tree and pull some inodes onto s_io.
+ * It could go beyond @end - it is a soft/approx limit.
+ */
+static unsigned long scan_dirty_tree(struct super_block *sb,
+	unsigned long begin, unsigned long end)
+{
+	struct dirty_inode_tree *dt = >s_dirty_tree;
+	struct inode *inodes[DIRTY_SCAN_BATCH];
+	struct inode *inode = NULL;
+	int i, j;
+	void *p;
+
+	while (begin < end) {
+		j = radix_tree_gang_lookup(>inode_tree, (void **)inodes,
+		begin, DIRTY_SCAN_BATCH);
+		if (!j)
+			break;
+		for (i = 0; i < j; i++) {
+			inode = inodes[i];
+			if (end != DIRTY_SCAN_ALL) {
+/* skip young volatile ones */
+if (time_after(inode->dirtied_when,
+	jiffies - dirty_volatile_interval)) {
+	inodes[i] = 0;
+	continue;
+}
+			}
+
+			dt->nr_inodes--;
+			p = radix_tree_delete(>inode_tree, inode->i_ino);
+			BUG_ON(!p);
+
+			if (!(inode->i_state & I_SYNC))
+list_move(>i_list, >s_io);
+		}
+		begin = inode->i_ino + 1;
+
+		spin_unlock(_lock);
+		for (i = 0; i < j; i++)
+			if (inodes[i])
+iput(inodes[i]);
+		cond_resched();
+		spin_lock(_lock);
+	}
+
+	return begin;
+}
+
+/*
+ * Move a cluster of dirty inodes to the io dispatch queue.
+ */
+static void dispatch_cluster_inodes(struct super_block *sb,
+	unsigned long *older_than_this)
+{
+	struct dirty_inode_tree *dt = >s_dirty_tree;
+	int scan_interval = dirty_expire_interval - dirty_volatile_interval;
+	unsigned long begin;
+	unsigned long end;
+
+	if (!older_than_this) {
+		/*
+		 * Be aggressive: either it is a sync(), or we fall into
+		 * background writeback because kupdate-style writebacks
+		 * could not catch up with fast writers.
+		 */
+		begin = 0;
+		end = DIRTY_SCAN_ALL;
+	} else if (time_after_eq(jiffies,
+dt->start_jiffies + scan_interval)) {
+		begin = dt->next_index;
+		end = DIRTY_SCAN_REMAINING; /* complete this sweep */
+	} else {
+		unsigned long time_total = max(scan_interval, 1);
+		unsigned long time_delta = jiffies - dt->start_jiffies;
+		unsigned long scan_total = dt->max_index;
+		unsigned long scan_delta = scan_total * time_delta / time_total;
+
+		begin = dt->next_index;
+		end = scan_delta;
+	}
+
+	scan_dirty_tree(sb, begin, end);
+
+	if (end 

Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
> [...]
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> [...]
> >  Just by chance I found out that doing all I/O inc sync-mode does
> > prevent the load from going up. Of course, I/O throughput is not
> > stellar (but not much worse than the non-O_DIRECT case). But the
> > responsiveness seem OK. Maybe a solution, as this can be controlled
> via
> > mount (would be great for O_DIRECT :-).
> > 
> >  In general 2.6.22 seems to bee better that 2.6.19, but this is
> highly
> > subjective :-( I am using the following setting in /proc. They seem
> to
> > provide the smoothest responsiveness:
> > 
> > vm.dirty_background_ratio = 1
> > vm.dirty_ratio = 1
> > vm.swappiness = 1
> > vm.vfs_cache_pressure = 1
> 
> You are apparently running into the sluggish kupdate-style writeback
> problem with large files: huge amount of dirty pages are getting
> accumulated and flushed to the disk all at once when dirty background
> ratio is reached. The current -mm tree has some fixes for it, and
> there are some more in my tree. Martin, I'll send you the patch if
> you'd like to try it out.
>
Hi Fengguang,

 Yeah, that pretty much describes the situation we end up. Although
"sluggish" is much to friendly if we hit the situation :-)

 Yes, I am very interested  to check out your patch. I saw your
postings on LKML already and was already curious. Any chance you have
something agains 2.6.22-stable? I have reasons not to move to -23 or
-mm.

> >  Another thing I saw during my tests is that when writing to NFS,
> the
> > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual
> thing,
> > or a bug?
> 
> What are the nr_unstable numbers?
>

 Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
numbers for the disk case. Good to know.

 For NFS, the nr_writeback numbers seem surprisingly high. They also go
to 80-90k (pages ?). In the disk case they rarely go over 12k.

Cheers
Martin
> Fengguang
> 
> 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu [EMAIL PROTECTED] wrote:

 On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
 [...]
   The basic setup is a dual x86_64 box with 8 GB of memory. The
 DL380
  has a HW RAID5, made from 4x72GB disks and about 100 MB write
 cache.
  The performance of the block device with O_DIRECT is about 90
 MB/sec.
  
   The problematic behaviour comes when we are moving large files
 through
  the system. The file usage in this case is mostly use once or
  streaming. As soon as the amount of file data is larger than 7.5
 GB, we
  see occasional unresponsiveness of the system (e.g. no more ssh
  connections into the box) of more than 1 or 2 minutes (!) duration
  (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
 and
  some other poor guys being in D state.
 [...]
   Just by chance I found out that doing all I/O inc sync-mode does
  prevent the load from going up. Of course, I/O throughput is not
  stellar (but not much worse than the non-O_DIRECT case). But the
  responsiveness seem OK. Maybe a solution, as this can be controlled
 via
  mount (would be great for O_DIRECT :-).
  
   In general 2.6.22 seems to bee better that 2.6.19, but this is
 highly
  subjective :-( I am using the following setting in /proc. They seem
 to
  provide the smoothest responsiveness:
  
  vm.dirty_background_ratio = 1
  vm.dirty_ratio = 1
  vm.swappiness = 1
  vm.vfs_cache_pressure = 1
 
 You are apparently running into the sluggish kupdate-style writeback
 problem with large files: huge amount of dirty pages are getting
 accumulated and flushed to the disk all at once when dirty background
 ratio is reached. The current -mm tree has some fixes for it, and
 there are some more in my tree. Martin, I'll send you the patch if
 you'd like to try it out.

Hi Fengguang,

 Yeah, that pretty much describes the situation we end up. Although
sluggish is much to friendly if we hit the situation :-)

 Yes, I am very interested  to check out your patch. I saw your
postings on LKML already and was already curious. Any chance you have
something agains 2.6.22-stable? I have reasons not to move to -23 or
-mm.

   Another thing I saw during my tests is that when writing to NFS,
 the
  dirty or nr_dirty numbers are always 0. Is this a conceptual
 thing,
  or a bug?
 
 What are the nr_unstable numbers?


 Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
numbers for the disk case. Good to know.

 For NFS, the nr_writeback numbers seem surprisingly high. They also go
to 80-90k (pages ?). In the disk case they rarely go over 12k.

Cheers
Martin
 Fengguang
 
 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Fengguang Wu
On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
 
 --- Fengguang Wu [EMAIL PROTECTED] wrote:
 
  You are apparently running into the sluggish kupdate-style writeback
  problem with large files: huge amount of dirty pages are getting
  accumulated and flushed to the disk all at once when dirty background
  ratio is reached. The current -mm tree has some fixes for it, and
  there are some more in my tree. Martin, I'll send you the patch if
  you'd like to try it out.
 
 Hi Fengguang,
 
  Yeah, that pretty much describes the situation we end up. Although
 sluggish is much to friendly if we hit the situation :-)
 
  Yes, I am very interested  to check out your patch. I saw your
 postings on LKML already and was already curious. Any chance you have
 something agains 2.6.22-stable? I have reasons not to move to -23 or
 -mm.

Well, they are a dozen patches from various sources.  I managed to
back-port them. It compiles and runs, however I cannot guarantee
more...

Another thing I saw during my tests is that when writing to NFS,
  the
   dirty or nr_dirty numbers are always 0. Is this a conceptual
  thing,
   or a bug?
  
  What are the nr_unstable numbers?
 
 
  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
 numbers for the disk case. Good to know.
 
  For NFS, the nr_writeback numbers seem surprisingly high. They also go
 to 80-90k (pages ?). In the disk case they rarely go over 12k.

Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'?

Fengguang
--- linux-2.6.22.orig/fs/fs-writeback.c
+++ linux-2.6.22/fs/fs-writeback.c
@@ -24,6 +24,148 @@
 #include linux/buffer_head.h
 #include internal.h
 
+/*
+ * Add @inode to its superblock's radix tree of dirty inodes.
+ *
+ * - the radix tree is indexed by inode number
+ * - inode_tree is not authoritative; inode_list is
+ * - inode_tree is a superset of inode_list: it is possible that an inode
+ *   get synced elsewhere and moved to other lists, while still remaining
+ *   in the radix tree.
+ */
+static void add_to_dirty_tree(struct inode *inode)
+{
+	struct super_block *sb = inode-i_sb;
+	struct dirty_inode_tree *dt = sb-s_dirty_tree;
+	int e;
+
+	e = radix_tree_preload(GFP_ATOMIC);
+	if (!e) {
+		e = radix_tree_insert(dt-inode_tree, inode-i_ino, inode);
+		/*
+		 * - inode numbers are not necessarily unique
+		 * - an inode might somehow be redirtied and resent to us
+		 */
+		if (!e) {
+			__iget(inode);
+			dt-nr_inodes++;
+			if (dt-max_index  inode-i_ino)
+			dt-max_index = inode-i_ino;
+			list_move(inode-i_list, sb-s_dirty_tree.inode_list);
+		}
+		radix_tree_preload_end();
+	}
+}
+
+#define DIRTY_SCAN_BATCH	16
+#define DIRTY_SCAN_ALL		LONG_MAX
+#define DIRTY_SCAN_REMAINING	(LONG_MAX-1)
+
+/*
+ * Scan the dirty inode tree and pull some inodes onto s_io.
+ * It could go beyond @end - it is a soft/approx limit.
+ */
+static unsigned long scan_dirty_tree(struct super_block *sb,
+	unsigned long begin, unsigned long end)
+{
+	struct dirty_inode_tree *dt = sb-s_dirty_tree;
+	struct inode *inodes[DIRTY_SCAN_BATCH];
+	struct inode *inode = NULL;
+	int i, j;
+	void *p;
+
+	while (begin  end) {
+		j = radix_tree_gang_lookup(dt-inode_tree, (void **)inodes,
+		begin, DIRTY_SCAN_BATCH);
+		if (!j)
+			break;
+		for (i = 0; i  j; i++) {
+			inode = inodes[i];
+			if (end != DIRTY_SCAN_ALL) {
+/* skip young volatile ones */
+if (time_after(inode-dirtied_when,
+	jiffies - dirty_volatile_interval)) {
+	inodes[i] = 0;
+	continue;
+}
+			}
+
+			dt-nr_inodes--;
+			p = radix_tree_delete(dt-inode_tree, inode-i_ino);
+			BUG_ON(!p);
+
+			if (!(inode-i_state  I_SYNC))
+list_move(inode-i_list, sb-s_io);
+		}
+		begin = inode-i_ino + 1;
+
+		spin_unlock(inode_lock);
+		for (i = 0; i  j; i++)
+			if (inodes[i])
+iput(inodes[i]);
+		cond_resched();
+		spin_lock(inode_lock);
+	}
+
+	return begin;
+}
+
+/*
+ * Move a cluster of dirty inodes to the io dispatch queue.
+ */
+static void dispatch_cluster_inodes(struct super_block *sb,
+	unsigned long *older_than_this)
+{
+	struct dirty_inode_tree *dt = sb-s_dirty_tree;
+	int scan_interval = dirty_expire_interval - dirty_volatile_interval;
+	unsigned long begin;
+	unsigned long end;
+
+	if (!older_than_this) {
+		/*
+		 * Be aggressive: either it is a sync(), or we fall into
+		 * background writeback because kupdate-style writebacks
+		 * could not catch up with fast writers.
+		 */
+		begin = 0;
+		end = DIRTY_SCAN_ALL;
+	} else if (time_after_eq(jiffies,
+dt-start_jiffies + scan_interval)) {
+		begin = dt-next_index;
+		end = DIRTY_SCAN_REMAINING; /* complete this sweep */
+	} else {
+		unsigned long time_total = max(scan_interval, 1);
+		unsigned long time_delta = jiffies - dt-start_jiffies;
+		unsigned long scan_total = dt-max_index;
+		unsigned long scan_delta = scan_total * time_delta / time_total;
+
+		begin = dt-next_index;
+		end = scan_delta;
+	}
+
+	scan_dirty_tree(sb, begin, end);
+
+	if (end  DIRTY_SCAN_REMAINING) {
+		

Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu [EMAIL PROTECTED] wrote:

 On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
  
  --- Fengguang Wu [EMAIL PROTECTED] wrote:
  
   You are apparently running into the sluggish kupdate-style
 writeback
   problem with large files: huge amount of dirty pages are getting
   accumulated and flushed to the disk all at once when dirty
 background
   ratio is reached. The current -mm tree has some fixes for it, and
   there are some more in my tree. Martin, I'll send you the patch
 if
   you'd like to try it out.
  
  Hi Fengguang,
  
   Yeah, that pretty much describes the situation we end up. Although
  sluggish is much to friendly if we hit the situation :-)
  
   Yes, I am very interested  to check out your patch. I saw your
  postings on LKML already and was already curious. Any chance you
 have
  something agains 2.6.22-stable? I have reasons not to move to -23
 or
  -mm.
 
 Well, they are a dozen patches from various sources.  I managed to
 back-port them. It compiles and runs, however I cannot guarantee
 more...


 Thanks. I understand the limited scope of the warranty :-) I will give
it a spin today.
 
 Another thing I saw during my tests is that when writing to
 NFS,
   the
dirty or nr_dirty numbers are always 0. Is this a
 conceptual
   thing,
or a bug?
   
   What are the nr_unstable numbers?
  
  
   Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
  numbers for the disk case. Good to know.
  
   For NFS, the nr_writeback numbers seem surprisingly high. They
 also go
  to 80-90k (pages ?). In the disk case they rarely go over 12k.
 
 Maybe the difference of throttling one single 'cp' and a dozen
 'nfsd'?


 No nfsd running on that box. It is just a client.

Cheers
Martin
 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Jens Axboe
On Tue, Aug 28 2007, Martin Knoblauch wrote:
 Keywords: I/O, bdi-v9, cfs
 
 Hi,
 
  a while ago I asked a few questions on the Linux I/O behaviour,
 because I were (still am) fighting some misbehaviour related to heavy
 I/O.
 
  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
 The performance of the block device with O_DIRECT is about 90 MB/sec.
 
  The problematic behaviour comes when we are moving large files through
 the system. The file usage in this case is mostly use once or
 streaming. As soon as the amount of file data is larger than 7.5 GB, we
 see occasional unresponsiveness of the system (e.g. no more ssh
 connections into the box) of more than 1 or 2 minutes (!) duration
 (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
 some other poor guys being in D state.
 
  The data flows in basically three modes. All of them are affected:
 
 local-disk - NFS
 NFS - local-disk
 NFS - NFS
 
  NFS is V3/TCP.
 
  So, I made a few experiments in the last few days, using three
 different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.
 
  The first observation (independent of the kernel) is that we *should*
 use O_DIRECT, at least for output to the local disk. Here we see about
 90 MB/sec write performance. A simple dd using 1,2 and 3 parallel
 threads to the same block device (through a ext2 FS) gives:
 
 O_Direct: 88 MB/s, 2x44, 3x29.5
 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5
 
 - Observation 1a: IO schedulers are mostly equivalent, with CFQ
 slightly worse than AS and DEADLINE
 - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
 performance goes [slightly] down. With three threads it is 3x10 MB/s.
 Ingo?
 - Observation 1c: bdi-v9 does not help in this case, which is not
 surprising.
 
  The real question here is why the non-O_DIRECT case is so slow. Is
 this a general thing? Is this related to the CCISS controller? Using
 O_DIRECT is unfortunatelly not an option for us.
 
  When using three different targets (local disk plus two different NFS
 Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
 to be] limited to the speed of the slowest FS. With bdi-v9 we see a
 considerable speedup.
 
  Just by chance I found out that doing all I/O inc sync-mode does
 prevent the load from going up. Of course, I/O throughput is not
 stellar (but not much worse than the non-O_DIRECT case). But the
 responsiveness seem OK. Maybe a solution, as this can be controlled via
 mount (would be great for O_DIRECT :-).
 
  In general 2.6.22 seems to bee better that 2.6.19, but this is highly
 subjective :-( I am using the following setting in /proc. They seem to
 provide the smoothest responsiveness:
 
 vm.dirty_background_ratio = 1
 vm.dirty_ratio = 1
 vm.swappiness = 1
 vm.vfs_cache_pressure = 1
 
  Another thing I saw during my tests is that when writing to NFS, the
 dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
 or a bug?
 
  In any case, view this as a report for one specific loadcase that does
 not behave very well. It seems there are ways to make things better
 (sync, per device throttling, ...), but nothing perfect yet. Use once
 does seem to be a problem.

Try limiting the queue depth on the cciss device, some of those are
notoriously bad at starving commands. Something like the below hack, see
if it makes a difference (and please verify in dmesg that it prints the
message about limiting depth!):

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 084358a..257e1c3 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev 
*pdev)
if (board_id == products[i].board_id) {
c-product_name = products[i].product_name;
c-access = *(products[i].access);
+#if 0
c-nr_cmds = products[i].nr_cmds;
+#else
+   c-nr_cmds = 2;
+   printk(cciss: limited max commands to 2\n);
+#endif
break;
}
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Jens Axboe [EMAIL PROTECTED] wrote:

 On Tue, Aug 28 2007, Martin Knoblauch wrote:
  Keywords: I/O, bdi-v9, cfs
  
 
 Try limiting the queue depth on the cciss device, some of those are
 notoriously bad at starving commands. Something like the below hack,
 see
 if it makes a difference (and please verify in dmesg that it prints
 the
 message about limiting depth!):
 
 diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
 index 084358a..257e1c3 100644
 --- a/drivers/block/cciss.c
 +++ b/drivers/block/cciss.c
 @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
 struct pci_dev *pdev)
   if (board_id == products[i].board_id) {
   c-product_name = products[i].product_name;
   c-access = *(products[i].access);
 +#if 0
   c-nr_cmds = products[i].nr_cmds;
 +#else
 + c-nr_cmds = 2;
 + printk(cciss: limited max commands to 2\n);
 +#endif
   break;
   }
   }
 
 -- 
 Jens Axboe
 
 

Hi Jens,

 thanks for the suggestion. Unfortunatelly the non-direct [parallel]
writes to the device got considreably slower. I guess the 6i
controller copes better with higher values.

 Can nr_cmds be changed at runtime? Maybe there is a optimal setting.

[   69.438851] SCSI subsystem initialized
[   69.442712] HP CISS Driver (v 3.6.14)
[   69.442871] ACPI: PCI Interrupt :04:03.0[A] - GSI 51 (level,
low) - IRQ 51
[   69.442899] cciss: limited max commands to 2 (Smart Array 6i)
[   69.482370] cciss0: 0x46 at PCI :04:03.0 IRQ 51 using DAC
[   69.494352]   blocks= 426759840 block_size= 512
[   69.498350]   heads=255, sectors=32, cylinders=52299
[   69.498352]
[   69.498509]   blocks= 426759840 block_size= 512
[   69.498602]   heads=255, sectors=32, cylinders=52299
[   69.498604]
[   69.498608]  cciss/c0d0: p1 p2

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Robert Hancock

Jens Axboe wrote:

On Tue, Aug 28 2007, Martin Knoblauch wrote:

Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some misbehaviour related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly use once or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in D state.

 The data flows in basically three modes. All of them are affected:

local-disk - NFS
NFS - local-disk
NFS - NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple dd using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing perfect yet. Use once
does seem to be a problem.


Try limiting the queue depth on the cciss device, some of those are
notoriously bad at starving commands. Something like the below hack, see
if it makes a difference (and please verify in dmesg that it prints the
message about limiting depth!):


I saw a bulletin from HP recently that sugggested disabling the 
write-back cache on some Smart Array controllers as a workaround because 
it reduced performance in applications that did large bulk writes. 
Presumably they are planning on releasing some updated firmware that 
fixes this eventually..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Chuck Ebbert
On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
 
  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
 The performance of the block device with O_DIRECT is about 90 MB/sec.
 
  The problematic behaviour comes when we are moving large files through
 the system. The file usage in this case is mostly use once or
 streaming. As soon as the amount of file data is larger than 7.5 GB, we
 see occasional unresponsiveness of the system (e.g. no more ssh
 connections into the box) of more than 1 or 2 minutes (!) duration
 (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
 some other poor guys being in D state.

Try booting with mem=4096M, mem=2048M, ...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Chuck Ebbert [EMAIL PROTECTED] wrote:

 On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
  
   The basic setup is a dual x86_64 box with 8 GB of memory. The
 DL380
  has a HW RAID5, made from 4x72GB disks and about 100 MB write
 cache.
  The performance of the block device with O_DIRECT is about 90
 MB/sec.
  
   The problematic behaviour comes when we are moving large files
 through
  the system. The file usage in this case is mostly use once or
  streaming. As soon as the amount of file data is larger than 7.5
 GB, we
  see occasional unresponsiveness of the system (e.g. no more ssh
  connections into the box) of more than 1 or 2 minutes (!) duration
  (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
 and
  some other poor guys being in D state.
 
 Try booting with mem=4096M, mem=2048M, ...
 
 

 hmm. I tried 1024M a while ago and IIRC did not see a lot [any]
difference. But as it is no big deal, I will repeat it tomorrow.

 Just curious - what are you expecting? Why should it help?

Thanks
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-28 Thread Fengguang Wu
On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
[...]
>  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
> has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
> The performance of the block device with O_DIRECT is about 90 MB/sec.
> 
>  The problematic behaviour comes when we are moving large files through
> the system. The file usage in this case is mostly "use once" or
> streaming. As soon as the amount of file data is larger than 7.5 GB, we
> see occasional unresponsiveness of the system (e.g. no more ssh
> connections into the box) of more than 1 or 2 minutes (!) duration
> (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
> some other poor guys being in "D" state.
[...]
>  Just by chance I found out that doing all I/O inc sync-mode does
> prevent the load from going up. Of course, I/O throughput is not
> stellar (but not much worse than the non-O_DIRECT case). But the
> responsiveness seem OK. Maybe a solution, as this can be controlled via
> mount (would be great for O_DIRECT :-).
> 
>  In general 2.6.22 seems to bee better that 2.6.19, but this is highly
> subjective :-( I am using the following setting in /proc. They seem to
> provide the smoothest responsiveness:
> 
> vm.dirty_background_ratio = 1
> vm.dirty_ratio = 1
> vm.swappiness = 1
> vm.vfs_cache_pressure = 1

You are apparently running into the sluggish kupdate-style writeback
problem with large files: huge amount of dirty pages are getting
accumulated and flushed to the disk all at once when dirty background
ratio is reached. The current -mm tree has some fixes for it, and
there are some more in my tree. Martin, I'll send you the patch if
you'd like to try it out.

>  Another thing I saw during my tests is that when writing to NFS, the
> "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
> or a bug?

What are the nr_unstable numbers?

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour - next try

2007-08-28 Thread Martin Knoblauch
Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some "misbehaviour" related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly "use once" or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in "D" state.

 The data flows in basically three modes. All of them are affected:

local-disk -> NFS
NFS -> local-disk
NFS -> NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
"dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing "perfect yet. Use once
does seem to be a problem.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour - next try

2007-08-28 Thread Martin Knoblauch
Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some misbehaviour related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly use once or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in D state.

 The data flows in basically three modes. All of them are affected:

local-disk - NFS
NFS - local-disk
NFS - NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple dd using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing perfect yet. Use once
does seem to be a problem.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-28 Thread Fengguang Wu
On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
[...]
  The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
 The performance of the block device with O_DIRECT is about 90 MB/sec.
 
  The problematic behaviour comes when we are moving large files through
 the system. The file usage in this case is mostly use once or
 streaming. As soon as the amount of file data is larger than 7.5 GB, we
 see occasional unresponsiveness of the system (e.g. no more ssh
 connections into the box) of more than 1 or 2 minutes (!) duration
 (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
 some other poor guys being in D state.
[...]
  Just by chance I found out that doing all I/O inc sync-mode does
 prevent the load from going up. Of course, I/O throughput is not
 stellar (but not much worse than the non-O_DIRECT case). But the
 responsiveness seem OK. Maybe a solution, as this can be controlled via
 mount (would be great for O_DIRECT :-).
 
  In general 2.6.22 seems to bee better that 2.6.19, but this is highly
 subjective :-( I am using the following setting in /proc. They seem to
 provide the smoothest responsiveness:
 
 vm.dirty_background_ratio = 1
 vm.dirty_ratio = 1
 vm.swappiness = 1
 vm.vfs_cache_pressure = 1

You are apparently running into the sluggish kupdate-style writeback
problem with large files: huge amount of dirty pages are getting
accumulated and flushed to the disk all at once when dirty background
ratio is reached. The current -mm tree has some fixes for it, and
there are some more in my tree. Martin, I'll send you the patch if
you'd like to try it out.

  Another thing I saw during my tests is that when writing to NFS, the
 dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
 or a bug?

What are the nr_unstable numbers?

Fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-09 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote:
> > On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> >
> > I'd suspect you can't get both at 100%.
> >
> > I'd guess you are probably using a 100Hz no-preempt kernel.  Have
> you
> > tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
> > overall throughput, but interactive responsiveness should be better
> -
> > if it is, then you could experiment with various combinations of
> > CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
> > CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
> > what gives you the best balance between throughput and interactive
> > responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
> > CONFIG_NO_HZ, but I don't think the impact will be as significant
> as
> > with the other options, so to keep things simple I'd leave those
> out
> > at first) .
> >
> > I'd guess that something like CONFIG_PREEMPT_VOLUNTARY +
> CONFIG_HZ_300
> > would probably be a good compromise for you, but just to see if
> > there's any effect at all, start out with CONFIG_PREEMPT +
> > CONFIG_HZ_1000.
> >
> 
> I'm currious, did you ever try playing around with CONFIG_PREEMPT*
> and
> CONFIG_HZ* to see if that had any noticable impact on interactive
> performance and stuff like logging into the box via ssh etc...?
> 
> -- 
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
> 
> 
Hi Jesper,

 my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but
have not observed much difference. The config is now:

config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set
config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y
config-2.6.22-rc7:# CONFIG_PREEMPT is not set
config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y

Cheers


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-09 Thread Martin Knoblauch

--- Jesper Juhl [EMAIL PROTECTED] wrote:

 On 05/07/07, Jesper Juhl [EMAIL PROTECTED] wrote:
  On 05/07/07, Martin Knoblauch [EMAIL PROTECTED] wrote:
   Hi,
  
 
  I'd suspect you can't get both at 100%.
 
  I'd guess you are probably using a 100Hz no-preempt kernel.  Have
 you
  tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
  overall throughput, but interactive responsiveness should be better
 -
  if it is, then you could experiment with various combinations of
  CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
  CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
  what gives you the best balance between throughput and interactive
  responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
  CONFIG_NO_HZ, but I don't think the impact will be as significant
 as
  with the other options, so to keep things simple I'd leave those
 out
  at first) .
 
  I'd guess that something like CONFIG_PREEMPT_VOLUNTARY +
 CONFIG_HZ_300
  would probably be a good compromise for you, but just to see if
  there's any effect at all, start out with CONFIG_PREEMPT +
  CONFIG_HZ_1000.
 
 
 I'm currious, did you ever try playing around with CONFIG_PREEMPT*
 and
 CONFIG_HZ* to see if that had any noticable impact on interactive
 performance and stuff like logging into the box via ssh etc...?
 
 -- 
 Jesper Juhl [EMAIL PROTECTED]
 Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
 Plain text mails only, please  http://www.expita.com/nomime.html
 
 
Hi Jesper,

 my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but
have not observed much difference. The config is now:

config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set
config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y
config-2.6.22-rc7:# CONFIG_PREEMPT is not set
config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y

Cheers


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-08 Thread Jesper Juhl

On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote:

On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  for a customer we are operating a rackful of HP/DL380/G4 boxes that
> have given us some problems with system responsiveness under [I/O
> triggered] system load.
>
>  The systems in question have the following HW:
>
> 2x Intel/EM64T CPUs
> 8GB memory
> CCISS Raid controller with 4x72GB SCSI disks as RAID5
> 2x BCM5704 NIC (using tg3)
>
>  The distribution is RHEL4. We have tested several kernels including
> the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.
>
>  One part of the workload is when several processes try to write 5 GB
> each to the local filesystem (ext2->LVM->CCISS). When this happens, the
> load goes up to 12 and responsiveness goes down. This means from one
> moment to the next things like opening a ssh connection to the host in
> question, or doing "df" take forever (minutes). Especially bad with the
> vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.
>
>  The load basically comes from the writing processes and up to 12
> "pdflush" threads all being in "D" state.
>
>  So, what I would like to understand is how we can maximize the
> responsiveness of the system, while keeping disk throughput at maximum.
>

I'd suspect you can't get both at 100%.

I'd guess you are probably using a 100Hz no-preempt kernel.  Have you
tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
overall throughput, but interactive responsiveness should be better -
if it is, then you could experiment with various combinations of
CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
what gives you the best balance between throughput and interactive
responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
CONFIG_NO_HZ, but I don't think the impact will be as significant as
with the other options, so to keep things simple I'd leave those out
at first) .

I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + CONFIG_HZ_300
would probably be a good compromise for you, but just to see if
there's any effect at all, start out with CONFIG_PREEMPT +
CONFIG_HZ_1000.



I'm currious, did you ever try playing around with CONFIG_PREEMPT* and
CONFIG_HZ* to see if that had any noticable impact on interactive
performance and stuff like logging into the box via ssh etc...?

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-08 Thread Jesper Juhl

On 05/07/07, Jesper Juhl [EMAIL PROTECTED] wrote:

On 05/07/07, Martin Knoblauch [EMAIL PROTECTED] wrote:
 Hi,

  for a customer we are operating a rackful of HP/DL380/G4 boxes that
 have given us some problems with system responsiveness under [I/O
 triggered] system load.

  The systems in question have the following HW:

 2x Intel/EM64T CPUs
 8GB memory
 CCISS Raid controller with 4x72GB SCSI disks as RAID5
 2x BCM5704 NIC (using tg3)

  The distribution is RHEL4. We have tested several kernels including
 the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

  One part of the workload is when several processes try to write 5 GB
 each to the local filesystem (ext2-LVM-CCISS). When this happens, the
 load goes up to 12 and responsiveness goes down. This means from one
 moment to the next things like opening a ssh connection to the host in
 question, or doing df take forever (minutes). Especially bad with the
 vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

  The load basically comes from the writing processes and up to 12
 pdflush threads all being in D state.

  So, what I would like to understand is how we can maximize the
 responsiveness of the system, while keeping disk throughput at maximum.


I'd suspect you can't get both at 100%.

I'd guess you are probably using a 100Hz no-preempt kernel.  Have you
tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
overall throughput, but interactive responsiveness should be better -
if it is, then you could experiment with various combinations of
CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
what gives you the best balance between throughput and interactive
responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
CONFIG_NO_HZ, but I don't think the impact will be as significant as
with the other options, so to keep things simple I'd leave those out
at first) .

I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + CONFIG_HZ_300
would probably be a good compromise for you, but just to see if
there's any effect at all, start out with CONFIG_PREEMPT +
CONFIG_HZ_1000.



I'm currious, did you ever try playing around with CONFIG_PREEMPT* and
CONFIG_HZ* to see if that had any noticable impact on interactive
performance and stuff like logging into the box via ssh etc...?

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-07 Thread Leroy van Logchem
>  I am just now playing with dirty_ratio. Anybody knows what the lower
> limit is? "0" seems acceptabel, but does it actually imply "write out
> immediatelly"?

You should "watch -n 1 cat /proc/meminfo" and monitor the Dirty and Writeback
while lowering the amount the kernel may keep dirty. The solution we are hoping
for is are the per device dirty throttling -v7 patches.

-- 
Leroy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-07 Thread Leroy van Logchem
  I am just now playing with dirty_ratio. Anybody knows what the lower
 limit is? 0 seems acceptabel, but does it actually imply write out
 immediatelly?

You should watch -n 1 cat /proc/meminfo and monitor the Dirty and Writeback
while lowering the amount the kernel may keep dirty. The solution we are hoping
for is are the per device dirty throttling -v7 patches.

-- 
Leroy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Daniel J Blueman

> On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >  for a customer we are operating a rackful of HP/DL380/G4 boxes
> that
> > have given us some problems with system responsiveness under [I/O
> > triggered] system load.
> [snip]
>
> IIRC, the locking in the CCISS driver was pretty heavy until later in
> the 2.6 series (2.6.16?) kernels; I don't think they were backported
> to the 1000 or so patches that comprise RH EL 4 kernels.
>
> With write performance being really poor on the Smartarray
> controllers
> without the battery-backed write cache, and with less-good locking,
> performance can really suck.
>
> On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
> L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
> with 6x36GB 10K RPM SCSI disks and all latest firmware:
>
> # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
> 509+1 records in
> 509+1 records out
> 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s
>
> # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s
>
> Oh dear! There are internal performance problems with this
> controller.
> The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
> is
> perhaps twice the read performance (PCI-X helps some) but still
> sucks.
>
> I'd get the BBWC in or install another controller.
>
Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the "6i" and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.


Hmm...check out /proc/sys/vm/dirty_* and the documentation in the
kernel tree for this.

Just measuring single-spindle performance, it's still poor on RH EL4
(2.6.9) x86-64 with 64MB SmartArray 6i (w/o BBWC):

# swapoff -av
swapoff on /dev/cciss/c0d0p2

# time dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1024k count=1000
real0m49.717s  <-- 20MB/s

# time dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=1000
real0m25.372s  <-- 39MB/s

Daniel
--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Daniel J Blueman <[EMAIL PROTECTED]> wrote:

> On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >  for a customer we are operating a rackful of HP/DL380/G4 boxes
> that
> > have given us some problems with system responsiveness under [I/O
> > triggered] system load.
> [snip]
> 
> IIRC, the locking in the CCISS driver was pretty heavy until later in
> the 2.6 series (2.6.16?) kernels; I don't think they were backported
> to the 1000 or so patches that comprise RH EL 4 kernels.
> 
> With write performance being really poor on the Smartarray
> controllers
> without the battery-backed write cache, and with less-good locking,
> performance can really suck.
> 
> On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
> L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
> with 6x36GB 10K RPM SCSI disks and all latest firmware:
> 
> # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
> 509+1 records in
> 509+1 records out
> 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s
> 
> # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s
> 
> Oh dear! There are internal performance problems with this
> controller.
> The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
> is
> perhaps twice the read performance (PCI-X helps some) but still
> sucks.
> 
> I'd get the BBWC in or install another controller.
> 
Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the "6i" and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Daniel J Blueman

On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

[snip]

IIRC, the locking in the CCISS driver was pretty heavy until later in
the 2.6 series (2.6.16?) kernels; I don't think they were backported
to the 1000 or so patches that comprise RH EL 4 kernels.

With write performance being really poor on the Smartarray controllers
without the battery-backed write cache, and with less-good locking,
performance can really suck.

On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
with 6x36GB 10K RPM SCSI disks and all latest firmware:

# dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
509+1 records in
509+1 records out
534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s

# dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s

Oh dear! There are internal performance problems with this controller.
The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$) is
perhaps twice the read performance (PCI-X helps some) but still sucks.

I'd get the BBWC in or install another controller.

Daniel
--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Brice Figureau wrote:

>> CFQ gives less (about 10-15%) throughput except for the kernel
>> with the
>> cfs cpu scheduler, where CFQ is on par with the other IO
>> schedulers.
>>
>
>Please have a look to kernel bug #7372:
>http://bugzilla.kernel.org/show_bug.cgi?id=7372
>
>It seems I encountered the almost same issue.
>
>The fix on my side, beside running 2.6.17 (which was working fine
>for me) was to:
>1) have /proc/sys/vm/vfs_cache_pressure=1
>2) have /proc/sys/vm/dirty_ratio=1 and 
> /proc/sys/vm/dirty_background_ratio=1
>3) have /proc/sys/vm/swappiness=2
>4) run Peter Zijlstra: per dirty device throttling patch on the
> top of 2.6.21.5:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Brice,

 any of them sufficient, or all together nedded? Just to avoid
confusion.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Brice Figureau
Martin Knoblauch  knobisoft.de> writes:

> --- Jesper Juhl  gmail.com> wrote:
> 
> > On 06/07/07, Robert Hancock  shaw.ca> wrote:
> > [snip]
> > >
> > > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
> > > helps. This workload will fill up memory with dirty data very
> > quickly,
> > > and it seems like system responsiveness often goes down the toilet
> > when
> > > this happens and the system is going crazy trying to write it all
> > out.
> > >
> > 
> > Perhaps trying out a different elevator would also be worthwhile.
> > 
> 
>  AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
> CFQ gives less (about 10-15%) throughput except for the kernel with the
> cfs cpu scheduler, where CFQ is on par with the other IO schedulers.
> 

Please have a look to kernel bug #7372:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

It seems I encountered the almost same issue.

The fix on my side, beside running 2.6.17 (which was working fine for me) was 
to:
 1) have /proc/sys/vm/vfs_cache_pressure=1
 2) have /proc/sys/vm/dirty_ratio=1 and /proc/sys/vm/dirty_background_ratio=1
 3) have /proc/sys/vm/swappiness=2
 4) run Peter Zijlstra: per dirty device throttling patch on the top of 
2.6.21.5:
http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Hope that helps,
--
Brice Figureau

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Martin Knoblauch wrote:
>--- Robert Hancock <[EMAIL PROTECTED]> wrote:
>
>>
>> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
>> helps. This workload will fill up memory with dirty data very
>> quickly,
>> and it seems like system responsiveness often goes down the toilet
>> when
>> this happens and the system is going crazy trying to write it all
>> out.
>>
>
>Definitely the "going crazy" part is the worst problem I see with 2.6
>based kernels (late 2.4 was really better in this corner case).
>
>I am just now playing with dirty_ratio. Anybody knows what the lower
>limit is? "0" seems acceptabel, but does it actually imply "write out
>immediatelly"?
>
>Another problem, the VM parameters are not really well documented in
>their behaviour and interdependence.

 Lowering dirty_ration just leads to more imbalanced write-speed for
the three dd's. Even when lowering the number to 0, the hich load
stays.

 Now, on another experiment I mounted the FS with "sync". And now the
load stays below/around 3. No more "pdflush" daemons going wild. And
the responsiveness is good, with no drops.

 My question is now: is there a parameter that one can use to force
immediate writeout for every process. This may hurt overall performance
of the system, but might really help my situation. Setting dirty_ratio
to 0 does not seem to do it.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
>>b) any ideas how to optimize the settings of the /proc/sys/vm/
>>parameters? The documentation is a bit thin here.
>>
>>
>I cant offer any advice there, but is raid-5 really the best choice
>for your needs? I would not choose raid-5 for a system that is
>regularly performing lots of large writes at the same time, dont
>forget that each write can require several reads to recalculate the
>partity.
>
>Does the raid card have much cache ram?
>

 192 MB, split 50/50 to read write.

>If you can afford to loose some space raid-10 would probably perform
>better.

 RAID5 most likely is not the best solution and I would not use it if
the described use-case was happening all the time. It happens a few
times a day and then things go down when all memory is filled with
page-cache.

 And the same also happens when copying large amountd of data from one
NFS mounted FS to another NFS mounted FS. No disk involved there.
Memory fills with page-cache until it reaches a ceeling and then for
some time responsiveness is really really bad.

 I am just now playing with the dirty_* stuff. Maybe it helps.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
> helps. This workload will fill up memory with dirty data very
> quickly, 
> and it seems like system responsiveness often goes down the toilet
> when 
> this happens and the system is going crazy trying to write it all
> out.
> 

 Definitely the "going crazy" part is the worst problem I see with 2.6
based kernels (late 2.4 was really better in this corner case).

 I am just now playing with dirty_ratio. Anybody knows what the lower
limit is? "0" seems acceptabel, but does it actually imply "write out
immediatelly"?

 Another problem, the VM parameters are not really well dociúmented in
their behaviour and interdependence.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
> > helps. This workload will fill up memory with dirty data very
> quickly,
> > and it seems like system responsiveness often goes down the toilet
> when
> > this happens and the system is going crazy trying to write it all
> out.
> >
> 
> Perhaps trying out a different elevator would also be worthwhile.
> 

 AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
CFQ gives less (about 10-15%) throughput except for the kernel with the
cfs cpu scheduler, where CFQ is on par with the other IO schedulers.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Jesper Juhl [EMAIL PROTECTED] wrote:

 On 06/07/07, Robert Hancock [EMAIL PROTECTED] wrote:
 [snip]
 
  Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
  helps. This workload will fill up memory with dirty data very
 quickly,
  and it seems like system responsiveness often goes down the toilet
 when
  this happens and the system is going crazy trying to write it all
 out.
 
 
 Perhaps trying out a different elevator would also be worthwhile.
 

 AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
CFQ gives less (about 10-15%) throughput except for the kernel with the
cfs cpu scheduler, where CFQ is on par with the other IO schedulers.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Robert Hancock [EMAIL PROTECTED] wrote:

 
 Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
 helps. This workload will fill up memory with dirty data very
 quickly, 
 and it seems like system responsiveness often goes down the toilet
 when 
 this happens and the system is going crazy trying to write it all
 out.
 

 Definitely the going crazy part is the worst problem I see with 2.6
based kernels (late 2.4 was really better in this corner case).

 I am just now playing with dirty_ratio. Anybody knows what the lower
limit is? 0 seems acceptabel, but does it actually imply write out
immediatelly?

 Another problem, the VM parameters are not really well dociúmented in
their behaviour and interdependence.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.


I cant offer any advice there, but is raid-5 really the best choice
for your needs? I would not choose raid-5 for a system that is
regularly performing lots of large writes at the same time, dont
forget that each write can require several reads to recalculate the
partity.

Does the raid card have much cache ram?


 192 MB, split 50/50 to read write.

If you can afford to loose some space raid-10 would probably perform
better.

 RAID5 most likely is not the best solution and I would not use it if
the described use-case was happening all the time. It happens a few
times a day and then things go down when all memory is filled with
page-cache.

 And the same also happens when copying large amountd of data from one
NFS mounted FS to another NFS mounted FS. No disk involved there.
Memory fills with page-cache until it reaches a ceeling and then for
some time responsiveness is really really bad.

 I am just now playing with the dirty_* stuff. Maybe it helps.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Martin Knoblauch wrote:
--- Robert Hancock [EMAIL PROTECTED] wrote:


 Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
 helps. This workload will fill up memory with dirty data very
 quickly,
 and it seems like system responsiveness often goes down the toilet
 when
 this happens and the system is going crazy trying to write it all
 out.


Definitely the going crazy part is the worst problem I see with 2.6
based kernels (late 2.4 was really better in this corner case).

I am just now playing with dirty_ratio. Anybody knows what the lower
limit is? 0 seems acceptabel, but does it actually imply write out
immediatelly?

Another problem, the VM parameters are not really well documented in
their behaviour and interdependence.

 Lowering dirty_ration just leads to more imbalanced write-speed for
the three dd's. Even when lowering the number to 0, the hich load
stays.

 Now, on another experiment I mounted the FS with sync. And now the
load stays below/around 3. No more pdflush daemons going wild. And
the responsiveness is good, with no drops.

 My question is now: is there a parameter that one can use to force
immediate writeout for every process. This may hurt overall performance
of the system, but might really help my situation. Setting dirty_ratio
to 0 does not seem to do it.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Brice Figureau
Martin Knoblauch spamtrap at knobisoft.de writes:

 --- Jesper Juhl jesper.juhl at gmail.com wrote:
 
  On 06/07/07, Robert Hancock hancockr at shaw.ca wrote:
  [snip]
  
   Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
   helps. This workload will fill up memory with dirty data very
  quickly,
   and it seems like system responsiveness often goes down the toilet
  when
   this happens and the system is going crazy trying to write it all
  out.
  
  
  Perhaps trying out a different elevator would also be worthwhile.
  
 
  AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
 CFQ gives less (about 10-15%) throughput except for the kernel with the
 cfs cpu scheduler, where CFQ is on par with the other IO schedulers.
 

Please have a look to kernel bug #7372:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

It seems I encountered the almost same issue.

The fix on my side, beside running 2.6.17 (which was working fine for me) was 
to:
 1) have /proc/sys/vm/vfs_cache_pressure=1
 2) have /proc/sys/vm/dirty_ratio=1 and /proc/sys/vm/dirty_background_ratio=1
 3) have /proc/sys/vm/swappiness=2
 4) run Peter Zijlstra: per dirty device throttling patch on the top of 
2.6.21.5:
http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Hope that helps,
--
Brice Figureau

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Brice Figureau wrote:

 CFQ gives less (about 10-15%) throughput except for the kernel
 with the
 cfs cpu scheduler, where CFQ is on par with the other IO
 schedulers.


Please have a look to kernel bug #7372:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

It seems I encountered the almost same issue.

The fix on my side, beside running 2.6.17 (which was working fine
for me) was to:
1) have /proc/sys/vm/vfs_cache_pressure=1
2) have /proc/sys/vm/dirty_ratio=1 and 
 /proc/sys/vm/dirty_background_ratio=1
3) have /proc/sys/vm/swappiness=2
4) run Peter Zijlstra: per dirty device throttling patch on the
 top of 2.6.21.5:
http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Brice,

 any of them sufficient, or all together nedded? Just to avoid
confusion.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Daniel J Blueman

On 5 Jul, 16:50, Martin Knoblauch [EMAIL PROTECTED] wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

[snip]

IIRC, the locking in the CCISS driver was pretty heavy until later in
the 2.6 series (2.6.16?) kernels; I don't think they were backported
to the 1000 or so patches that comprise RH EL 4 kernels.

With write performance being really poor on the Smartarray controllers
without the battery-backed write cache, and with less-good locking,
performance can really suck.

On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
with 6x36GB 10K RPM SCSI disks and all latest firmware:

# dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
509+1 records in
509+1 records out
534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s

# dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s

Oh dear! There are internal performance problems with this controller.
The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$) is
perhaps twice the read performance (PCI-X helps some) but still sucks.

I'd get the BBWC in or install another controller.

Daniel
--
Daniel J Blueman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Daniel J Blueman [EMAIL PROTECTED] wrote:

 On 5 Jul, 16:50, Martin Knoblauch [EMAIL PROTECTED] wrote:
  Hi,
 
   for a customer we are operating a rackful of HP/DL380/G4 boxes
 that
  have given us some problems with system responsiveness under [I/O
  triggered] system load.
 [snip]
 
 IIRC, the locking in the CCISS driver was pretty heavy until later in
 the 2.6 series (2.6.16?) kernels; I don't think they were backported
 to the 1000 or so patches that comprise RH EL 4 kernels.
 
 With write performance being really poor on the Smartarray
 controllers
 without the battery-backed write cache, and with less-good locking,
 performance can really suck.
 
 On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
 L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
 with 6x36GB 10K RPM SCSI disks and all latest firmware:
 
 # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
 509+1 records in
 509+1 records out
 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s
 
 # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
 100+0 records in
 100+0 records out
 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s
 
 Oh dear! There are internal performance problems with this
 controller.
 The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
 is
 perhaps twice the read performance (PCI-X helps some) but still
 sucks.
 
 I'd get the BBWC in or install another controller.
 
Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the 6i and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Daniel J Blueman

 On 5 Jul, 16:50, Martin Knoblauch [EMAIL PROTECTED] wrote:
  Hi,
 
   for a customer we are operating a rackful of HP/DL380/G4 boxes
 that
  have given us some problems with system responsiveness under [I/O
  triggered] system load.
 [snip]

 IIRC, the locking in the CCISS driver was pretty heavy until later in
 the 2.6 series (2.6.16?) kernels; I don't think they were backported
 to the 1000 or so patches that comprise RH EL 4 kernels.

 With write performance being really poor on the Smartarray
 controllers
 without the battery-backed write cache, and with less-good locking,
 performance can really suck.

 On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
 L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
 with 6x36GB 10K RPM SCSI disks and all latest firmware:

 # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
 509+1 records in
 509+1 records out
 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s

 # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
 100+0 records in
 100+0 records out
 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s

 Oh dear! There are internal performance problems with this
 controller.
 The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
 is
 perhaps twice the read performance (PCI-X helps some) but still
 sucks.

 I'd get the BBWC in or install another controller.

Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the 6i and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.


Hmm...check out /proc/sys/vm/dirty_* and the documentation in the
kernel tree for this.

Just measuring single-spindle performance, it's still poor on RH EL4
(2.6.9) x86-64 with 64MB SmartArray 6i (w/o BBWC):

# swapoff -av
swapoff on /dev/cciss/c0d0p2

# time dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1024k count=1000
real0m49.717s  -- 20MB/s

# time dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=1000
real0m25.372s  -- 39MB/s

Daniel
--
Daniel J Blueman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Jesper Juhl

On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote:
[snip]


Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
helps. This workload will fill up memory with dirty data very quickly,
and it seems like system responsiveness often goes down the toilet when
this happens and the system is going crazy trying to write it all out.



Perhaps trying out a different elevator would also be worthwhile.

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Robert Hancock

Martin Knoblauch wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2->LVM->CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing "df" take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
"pdflush" threads all being in "D" state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo "Start 3 dd processes: "`date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000&
wait
echo "Finish 3 dd processes: "`date`
sync
echo "Finish sync: "`date`
rm -f /scratch/X?
echo "Files removed: "`date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The "felt" responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.


Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
helps. This workload will fill up memory with dirty data very quickly, 
and it seems like system responsiveness often goes down the toilet when 
this happens and the system is going crazy trying to write it all out.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Jesper Juhl

On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2->LVM->CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing "df" take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
"pdflush" threads all being in "D" state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.



I'd suspect you can't get both at 100%.

I'd guess you are probably using a 100Hz no-preempt kernel.  Have you
tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
overall throughput, but interactive responsiveness should be better -
if it is, then you could experiment with various combinations of
CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
what gives you the best balance between throughput and interactive
responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
CONFIG_NO_HZ, but I don't think the impact will be as significant as
with the other options, so to keep things simple I'd leave those out
at first) .

I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + CONFIG_HZ_300
would probably be a good compromise for you, but just to see if
there's any effect at all, start out with CONFIG_PREEMPT +
CONFIG_HZ_1000.

Hope that helps.

(PS. please don't do crap like using that spamtrap@ address and have
people manually replace it with the one from your .signature when
posting on LKML - it's annoying as hell)

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Andrew Lyon

On 7/5/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2->LVM->CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing "df" take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
"pdflush" threads all being in "D" state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo "Start 3 dd processes: "`date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000&
wait
echo "Finish 3 dd processes: "`date`
sync
echo "Finish sync: "`date`
rm -f /scratch/X?
echo "Files removed: "`date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The "felt" responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial


Under certain loads HT can reduce performance, I have had serious
performance problems on windows terminal servers with HT enabled, and
I now disable it on all servers, no matter what OS they run.

Why? http://blogs.msdn.com/slavao/archive/2005/11/12/492119.aspx


b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.


I cant offer any advice there, but is raid-5 really the best choice
for your needs? I would not choose raid-5 for a system that is
regularly performing lots of large writes at the same time, dont
forget that each write can require several reads to recalculate the
partity.

Does the raid card have much cache ram?

If you can afford to loose some space raid-10 would probably perform better.

Andy



Thanks in advance
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour

2007-07-05 Thread Martin Knoblauch
Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2->LVM->CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing "df" take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
"pdflush" threads all being in "D" state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo "Start 3 dd processes: "`date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000&
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000&
wait
echo "Finish 3 dd processes: "`date`
sync
echo "Finish sync: "`date`
rm -f /scratch/X?
echo "Files removed: "`date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The "felt" responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.

Thanks in advance
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Jesper Juhl

On 06/07/07, Robert Hancock [EMAIL PROTECTED] wrote:
[snip]


Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
helps. This workload will fill up memory with dirty data very quickly,
and it seems like system responsiveness often goes down the toilet when
this happens and the system is going crazy trying to write it all out.



Perhaps trying out a different elevator would also be worthwhile.

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Robert Hancock

Martin Knoblauch wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2-LVM-CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing df take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
pdflush threads all being in D state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo Start 3 dd processes: `date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000
wait
echo Finish 3 dd processes: `date`
sync
echo Finish sync: `date`
rm -f /scratch/X?
echo Files removed: `date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The felt responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.


Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
helps. This workload will fill up memory with dirty data very quickly, 
and it seems like system responsiveness often goes down the toilet when 
this happens and the system is going crazy trying to write it all out.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour

2007-07-05 Thread Martin Knoblauch
Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2-LVM-CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing df take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
pdflush threads all being in D state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo Start 3 dd processes: `date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000
wait
echo Finish 3 dd processes: `date`
sync
echo Finish sync: `date`
rm -f /scratch/X?
echo Files removed: `date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The felt responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial
b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.

Thanks in advance
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Andrew Lyon

On 7/5/07, Martin Knoblauch [EMAIL PROTECTED] wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2-LVM-CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing df take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
pdflush threads all being in D state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.

 During my investiogation I basically performed the following test,
because it represents the kind of trouble situation:


$ cat dd3.sh
echo Start 3 dd processes: `date`
dd if=/dev/zero of=/scratch/X1 bs=1M count=5000
dd if=/dev/zero of=/scratch/X2 bs=1M count=5000
dd if=/dev/zero of=/scratch/X3 bs=1M count=5000
wait
echo Finish 3 dd processes: `date`
sync
echo Finish sync: `date`
rm -f /scratch/X?
echo Files removed: `date`


 This results in the following timings. All with the anticipatory
scheduler, because it gives the best results:

2.6.19.2, HT: 10m
2.6.19.2, non-HT: 8m45s
2.6.22-rc7, HT: 10m
2.6.22-rc7, non-HT: 6m
2.6.22-rc7+cfs_v18, HT: 10m40s
2.6.22-rc7+cfs_v18, non-HT: 10m45s

 The felt responsiveness was best with the last two kernels, although
the load profile over time looks identical in all cases.

 So, a few questions:

a) any idea why disabling HT improves throughput, except for the cfs
kernels? For plain 2.6.22 the difference is quite substantial


Under certain loads HT can reduce performance, I have had serious
performance problems on windows terminal servers with HT enabled, and
I now disable it on all servers, no matter what OS they run.

Why? http://blogs.msdn.com/slavao/archive/2005/11/12/492119.aspx


b) any ideas how to optimize the settings of the /proc/sys/vm/
parameters? The documentation is a bit thin here.


I cant offer any advice there, but is raid-5 really the best choice
for your needs? I would not choose raid-5 for a system that is
regularly performing lots of large writes at the same time, dont
forget that each write can require several reads to recalculate the
partity.

Does the raid card have much cache ram?

If you can afford to loose some space raid-10 would probably perform better.

Andy



Thanks in advance
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-05 Thread Jesper Juhl

On 05/07/07, Martin Knoblauch [EMAIL PROTECTED] wrote:

Hi,

 for a customer we are operating a rackful of HP/DL380/G4 boxes that
have given us some problems with system responsiveness under [I/O
triggered] system load.

 The systems in question have the following HW:

2x Intel/EM64T CPUs
8GB memory
CCISS Raid controller with 4x72GB SCSI disks as RAID5
2x BCM5704 NIC (using tg3)

 The distribution is RHEL4. We have tested several kernels including
the original 2.6.9, 2.6.19.2, 2.6.22-rc7 and 2.6.22-rc7+cfs-v18.

 One part of the workload is when several processes try to write 5 GB
each to the local filesystem (ext2-LVM-CCISS). When this happens, the
load goes up to 12 and responsiveness goes down. This means from one
moment to the next things like opening a ssh connection to the host in
question, or doing df take forever (minutes). Especially bad with the
vendor kernel, better (but not perfect) with 2.6.19 and 2.6.22-rc7.

 The load basically comes from the writing processes and up to 12
pdflush threads all being in D state.

 So, what I would like to understand is how we can maximize the
responsiveness of the system, while keeping disk throughput at maximum.



I'd suspect you can't get both at 100%.

I'd guess you are probably using a 100Hz no-preempt kernel.  Have you
tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
overall throughput, but interactive responsiveness should be better -
if it is, then you could experiment with various combinations of
CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
what gives you the best balance between throughput and interactive
responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
CONFIG_NO_HZ, but I don't think the impact will be as significant as
with the other options, so to keep things simple I'd leave those out
at first) .

I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + CONFIG_HZ_300
would probably be a good compromise for you, but just to see if
there's any effect at all, start out with CONFIG_PREEMPT +
CONFIG_HZ_1000.

Hope that helps.

(PS. please don't do crap like using that spamtrap@ address and have
people manually replace it with the one from your .signature when
posting on LKML - it's annoying as hell)

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/