[ Wednesday, September 15, 1999 ] [EMAIL PROTECTED] wrote:
> > Also read the /usr/doc info on calculating stride.
>
> The version I have doesn't mention anything useful in connection with
> RAID-1, only RAID-4/5, so I left it alone. I'd be glad to change this
> to any reasonable number, though.
I don't understand how stride would matter, since its point was to
incorporate stripe behavior at lower levels of the device to optimize
ext2 access and behavior... since raid1 is simply mirroring and no
striping exists, I don't see what's up here.
My question would be whether a block read of an md raid1 device across
2 drives does what I think is the Right Thing... break the block into 2
equal sections, asking each drive for one of the sections, then passing
back once both finish...
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> pa 384 3312 47.9 10074 14.6 4463 12.6 3494 49.4 12235 13.2 77.6 2.2
> pb 384 3332 47.9 9320 12.9 4411 11.7 3468 48.9 11991 13.4 83.5 1.8
> sum 768 6644 95.8 19394 27.5 8874 24.3 6962 98.3 24226 26.6 161.1 4.0
>
>
> This pretty clearly suggests that the hardware is capable of a lot more
> than the RAID-1 is actually doing. I'd expect that RAID-1 block writes
> would be as slow as 10 MB/s (although they seem faster, perhaps because
> of the cache). But reads should at least be in the ballpark of 20-24 MB/s,
> shouldn't they? rather than the 9 MB/s I'm getting?
Certainly sounds pretty logical...
I'm still trying to figure out raid1.c/raid1_make_request... here's
my (bad) attempt at translating...
1) while loop to get a new buffer head struct, then zero it
2) reset atomic read's to just normal read's (not sure why?)
3) set-up the buffer head information
4) check for rw of READ or READA (currently can't be READA due to
previous code, but hopefully a decent -O2 catches this)
5) Now here's where I get confused... sticking to the same mirror disk
during this read depends on 2 things:
if (bh->b_blocknr * sectors == raid_conf->next_sect)
This looks to check the sector number into the md block
device against the next sector we would have read from last
time, or (I believe) simply checking for a sequential read,
hoping to optimize by going right back to that drive since
we should save some seek times (although in the benchmarking
case where we do lots of sequential, switching isn't as painful
as the mirror(s) will have their arm in the same place as well)
Since we want to avoid the seek if possible, we stick to the
disk UNLESS we've hit a hard-coded limit of 64KB (128 sectors)
if (raid_conf->sect_count >= mirror->sect_limit)
switch_disks = 1;
(wouldn't that more sense as > rather than >=?)
Now I think this is key since bonnie will end up sending 8
consecutive block read's to the *same drive* (since it deals
with 8KB chunks). If this were set down to 8 sectors (4KB)
then we would switch drives in the middle of the 8KB block, so
each read gets split across both drives... course, perhaps the
raid1 block size doesn't allow this to happen... the multiple
hard-coded 128's do scare me though :)
If someone can let me know whether it breaks the code for that
128 to be much smaller, I'll change and check bonnie runs in
both cases for improvement.
Just to (imho) help improve readability...
--------------------
--- raid1.c~ Wed Sep 15 21:45:18 1999
+++ raid1.c Wed Sep 15 22:15:07 1999
@@ -203,7 +203,8 @@
struct raid1_data *raid_conf = (struct raid1_data *) mddev->private;
struct buffer_head *mirror_bh[MD_SB_DISKS], *bh_req;
struct raid1_bh * r1_bh;
- int n = raid_conf->raid_disks, i, sum_bhs = 0, switch_disks = 0, sectors;
+ int n = raid_conf->raid_disks, i, sum_bhs = 0,
+ switch_disks = 0, sectors_per_blk;
struct mirror_info *mirror;
PRINTK(("raid1_make_request().\n"));
@@ -239,14 +240,14 @@
PRINTK(("raid1_make_request(), read branch.\n"));
mirror = raid_conf->mirrors + last_used;
bh->b_rdev = mirror->dev;
- sectors = bh->b_size >> 9;
- if (bh->b_blocknr * sectors == raid_conf->next_sect) {
- raid_conf->sect_count += sectors;
- if (raid_conf->sect_count >= mirror->sect_limit)
+ sectors_per_blk = bh->b_size >> 9;
+ if (bh->b_blocknr * sectors_per_blk == raid_conf->next_sect) {
+ raid_conf->sect_count += sectors_per_blk;
+ if (raid_conf->sect_count > mirror->sect_limit)
switch_disks = 1;
} else
switch_disks = 1;
- raid_conf->next_sect = (bh->b_blocknr + 1) * sectors;
+ raid_conf->next_sect = (bh->b_blocknr + 1) * sectors_per_blk;
if (switch_disks) {
PRINTK(("read-balancing: switching %d -> %d (%d sectors)\n",
last_used, mirror->next, raid_conf->sect_count));
raid_conf->sect_count = 0;
--------------------
and then what I *think* (looking forward to the correction) should help
cut down one very common calculation (moving the mirrors address to
the same as raid1_data)
--------------------
--- raid1.h~ Wed Sep 15 21:55:44 1999
+++ raid1.h Wed Sep 15 21:55:47 1999
@@ -19,7 +19,6 @@
};
struct raid1_data {
- struct md_dev *mddev;
struct mirror_info mirrors[MD_SB_DISKS]; /* RAID1 devices, 2 to
MD_SB_DISKS */
int raid_disks;
int working_disks; /* Number of working disks */
@@ -27,6 +26,9 @@
unsigned long next_sect;
int sect_count;
int resync_running;
+ struct md_dev *mddev; /* since should be a little-used pointer back
+ up, move to the end to save another address
+ calculation in a common case */
};
/*
--------------------
James
"looking forwarding to mingo laying the proverbial smack down"
Manning
--
Miscellaneous Engineer --- IBM Netfinity Performance Development