There is definitely something strange going on here.
As the bonnie test below shows, the SCSI disk used
for my tests should vastly outperform the old IDE one:
-------Sequential Output-------- ---Sequential Input-- --Random--
Seagate -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
ST318451LW MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
SCSI 200 21544 96.8 51367 51.4 11141 16.3 17729 58.2 40968 40.4 602.9 5.4
Quantum -------Sequential Output-------- ---Sequential Input-- --Random--
Fireball -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
ST3.2A MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
IDE 200 3884 72.8 4513 86.0 1781 36.4 3144 89.9 4052 95.3 131.5 0.9
I used a program based on Mike Black's "Blah Blah" test
(shown below) in which 200 write()+fdatasync()s are
performed. Each write() outputs either 20 or 4096 bytes.
On my Celeron 533 Mhz 128 MB ram hardware with an ext2 fs,
the "block" size that is seen by the sd driver for each
fdatasync() is 4096 bytes. lk 2.4.2 is being used. The
fs/buffer.c __wait_on_buffer() routine waits for IO
completion in response to fdatasync(). Timings have been
done with Andrew Morton's timepegs (units are microseconds).
Here are the IDE results:
IDE 20*200 Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
leave __wait_on_buffer:0 200 1,037.23 6,487.72 1,252.19 250,439.80
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 7.32 21.05 7.82 1,557.05
IDE 4096*200 Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
leave __wait_on_buffer:0 200 1,037.06 7,354.21 1,243.78 248,756.64
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 23.01 67.32 37.03 7,370.51
So the size of each transfer doesn't matter to this IDE
disk. Now the same test for the SCSI disk:
SCSI(20*200) Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
enter sd_init_command:0 200 1.86 13.27 2.05 411.48
enter sd_init_command:0 ->
enter rw_intr:0 200 320.87 5,398.56 3,417.30 683,461.25
enter rw_intr:0 ->
leave __wait_on_buffer:0 200 4.04 15.81 4.42 885.73
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 8.78 14.39 9.26 1,844.23
SCSI(4096*200) Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
enter sd_init_command:0 200 1.97 13.20 2.21 443.52
enter sd_init_command:0 ->
enter rw_intr:0 200 109.53 13,997.50 1,327.47 265,495.87
enter rw_intr:0 ->
leave __wait_on_buffer:0 200 4.37 22.50 4.75 951.44
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 22.40 42.20 24.27 4,831.34
The extra timepegs inside the SCSI subsystem show that
the IO transaction to that disk really did take that
long. [Initially I suspected a "plugging" type
elevator bug, but that isn't supported by the above
and various other timepegs not shown.]
Since there is a wait on completion for every write,
tagged queuing should not be involved.
So writing more data to the SCSI disk speeds it up!
I suspect the critical point in the "20*200" test is
that the same sequence of 8 512 byte sectors are being
written to disk 200 times. BTW That disk spins at
15K rpm so one rotation takes 4 ms and it has a
4 MB cache.
Even though the SCSI disk's "cache" mode page indicates
that the write cache is on, it would seem that writing
the same sectors continually causes flushes to the medium
(and hence the associated delay). Here is scu's output
of the "cache" mode page:
$ scu -f /dev/sda show page cache
Cache Control Parameters (Page 0x8 - Current Values):
Mode Parameter Header:
Mode Data Length: 31
Medium Type: 0 (Default Medium Type)
Device Specific Parameter: 0x10 (Supports DPO & FUA bits)
Block Descriptor Length: 8
Mode Parameter Block Descriptor:
Density Code: 0x2
Number of Logical Blocks: 2289239 (1117.792 megabytes)
Logical Block Length: 512
Page Header / Data:
Page Code: 0x8
Parameters Savable: Yes
Page Length: 18
Read Cache Disable (RCD): No
Multiplication Factor (MF): Off
Write Cache Enable (WCE): Yes
Cache Segment Size Enable (SIZE): Off
Discontinuity (DISC): On
Caching Analysis Permitted (CAP): Disabled
Abort Pre-Fetch (ABPF): Off
Initiator Control Enable (IC): Off
Write Retention Priority: 0 (Not distiguished)
Demand Read Retention Priority: 0 (Not distiguished)
Disable Prefetch Transfer Length: 65535 blocks
Minimum Prefetch: 0 blocks
Maximum Prefetch: 65535 blocks
Maximum Prefetch Ceiling: 65535 blocks
Vendor Specific Cache Bits (VS): 0
Disable Read-Ahead (DRA): No
Logical Block Cache Segment Size: Off (Cache size interpreted as bytes)
Force Sequential Write (FSW): Yes (Blocks written in sequential order)
Number of Cache Segments: 20
Cache Segment Size: 0 bytes
Non-Cache Segment Size: 0 bytes
Perhaps someone has an idea of which of the above settings
can be tweaked to make the write caching work better in
this case.
Doug Gilbert
----------------------------------------------------------
Test program:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define NUM_BLKS 200
#define BLK_SIZE 4096 /* use either 20 or 4096 */
char buff[BLK_SIZE * NUM_BLKS];
int main(int argc, char * argv[])
{
int fd, k;
fd = open("tst.txt", O_WRONLY | O_CREAT, 0644);
for (k = 0; k < NUM_BLKS; ++k) {
write(fd, buff + (k * BLK_SIZE), BLK_SIZE);
fdatasync(fd);
}
close(fd);
return 0;
}
Mike Black wrote:
>
> Here's a strace -r on IDE:
> 0.001488 write(3, "\214\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.001530 write(3, "\215\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000513 fdatasync(0x3) = 0
> 0.001555 write(3, "\216\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000517 fdatasync(0x3) = 0
> 0.001494 write(3, "\217\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000515 fdatasync(0x3) = 0
> 0.001495 write(3, "\220\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000522 fdatasync(0x3) = 0
>
> Here it is on SCSI:
> 0.049285 write(3, "\3\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000689 fdatasync(0x3) = 0
> 0.049148 write(3, "\4\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.049318 write(3, "\5\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.049343 write(3, "\6\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
>
> Looks like a constant 50ms delay on each fdatasync() on SCSI vs .5ms for
> IDE. Maybe IDE isn't really doing a sync?? I find .5ms to be a little too
> good.
>
> I did this on 4 different machines with different SCSI cards (include RAID5
> and non-RAID), disks, and IDE drives with the same behavior.
>
> ________________________________________
> Michael D. Black Principal Engineer
> [EMAIL PROTECTED] 321-676-2923,x203
> http://www.csihq.com Computer Science Innovations
> http://www.csihq.com/~mike My home page
> FAX 321-676-2355
> ----- Original Message -----
> From: "Jeremy Hansen" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Friday, March 02, 2001 11:27 AM
> Subject: scsi vs ide performance on fsync's
>
> We're doing some mysql benchmarking. For some reason it seems that ide
> drives are currently beating a scsi raid array and it seems to be related
> to fsync's. Bonnie stats show the scsi array to blow away ide as
> expected, but mysql tests still have the idea beating on plain insert
> speeds. Can anyone explain how this is possible, or perhaps explain how
> our testing may be flawed?
>
> Here's the bonnie stats:
>
> IDE Drive:
>
> Version 1.00g ------Sequential Output------ --Sequential
> Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per
> Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> jeremy 300M 9026 94 17524 12 8173 9 7269 83 23678 7 102.9
> 0
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delet
> e--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 469 98 1476 98 16855 89 459 98 7132 99 688
> 25
>
> SCSI Array:
>
> Version 1.00g ------Sequential Output------ --Sequential
> Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per
> Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> orville 300M 8433 100 134143 99 127982 99 8016 100 374457 99
> 1583.4 6
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delet
> e--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 503 13 +++++ +++ 538 13 490 13 +++++ +++ 428
> 11
>
> So...obviously from bonnie stats, the scsi array blows away the ide...but
> using the attached c program, here's what we get for fsync stats using the
> little c program I've attached:
>
> IDE Drive:
>
> jeremy:~# time ./xlog file.out fsync
>
> real 0m1.850s
> user 0m0.000s
> sys 0m0.220s
>
> SCSI Array:
>
> [root@orville mysql_data]# time /root/xlog file.out fsync
>
> real 0m23.586s
> user 0m0.010s
> sys 0m0.110s
>
> I would appreciate any help understand what I'm seeing here and any
> suggestions on how to improve the performance.
>
> The SCSI adapter on the raid array is an Adaptec 39160, the raid
> controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on
> the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is
> not the problem, as I get almost the exact same results running this on
> ext2 on the raid array.
>
> Thanks
> -jeremy
>
> --
> this is my sig.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/