Is the raid1readbalance patch production ready?

2000-07-21 Thread Malcolm Beattie

Is the raid1readbalance-2.2.15-B2 patch (when applied against a
2.2.16+linux-2.2.16-raid-B2 kernel) rock-solid and production
quality? Can I trust 750GB of users' email to it? Is it guaranteed
to behave the same during failure modes that the non-patched RAID
code does? Is anyone using it heavily in a production system?
(Not that I expect any other answer except maybe for a resounding
"probably" :-)

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: speed and scaling

2000-07-13 Thread Malcolm Beattie

Seth Vidal writes:
>  I have an odd question. Where I work we will, in the next year, be in a
> position to have to process about a terabyte or more of data. The data is
> probably going to be shipped on tapes to us but then it needs to be read
> from disks and analyzed. The process is segmentable so its reasonable to
> be able to break it down into 2-4 sections for processing so arguably only
> 500gb per machine will be needed. I'd like to get the fastest possible
> access rates from a single machine to the data. Ideally 90MB/s+
> 
> So were considering the following:
> 
> Dual Processor P3 something.
> ~1gb ram.
> multiple 75gb ultra 160 drives - probably ibm's 10krpm drives
> Adaptec's best 160 controller that is supported by linux. 
> 
> The data does not have to be redundant or stable - since it can be
> restored from tape at almost any time.
> 
> so I'd like to put this in a software raid 0 array for the speed.
> 
> So my questions are these:
>  Is 90MB/s a reasonable speed to be able to achieve in a raid0 array
> across say 5-8 drives?
> What controllers/drives should I be looking at?

Here are actual benchmarks from one of my systems.

dbench:

2 Throughput 123.637 MB/sec (NB=154.546 MB/sec  1236.37 MBit/sec)
4 Throughput 109.7 MB/sec (NB=137.126 MB/sec  1097 MBit/sec)
32 Throughput 77.7743 MB/sec (NB=97.2178 MB/sec  777.743 MBit/sec)
64 Throughput 64.3793 MB/sec (NB=80.4741 MB/sec  643.793 MBit/sec)


Bonnie:

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
 2000  9585 99.1 51312 26.0 28675 45.3  9224 94.8 81720 73.3 512.2  4.2

That is with a Dell 4400, 2 x 600 MHz Pentium III Coppermine CPUs
with 256K cache, 1GB RAM, one 64-bit 66MHz PCI bus (and one 33MHz
PCI bus). Disk subsystem is a built-in Adaptec 7899 dual-160MB
channel with 8 Quantum ATLAS IV 9MB SCA disks attached to one
channel. The benchmarks above were done on an ext2 filesystem with
4KB blocksize and stride 16 created on a 7-way stripe of the above
disks using software RAID (0.9 on kernel 2.2.x) with 64KB chunksize.

If you use both SCSI channels and use 36GB disks (let alone 75GB
ones), you'll get 480GB of disk without even needing to plug another
SCSI card in. With larger disks or another SCSI card or two you could
go larger/faster.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Bug in 2.2.14 + raid-2.2.14-B1

2000-04-14 Thread Malcolm Beattie

I reported this bug to linux-raid on March 27 and to linux-kernel
a week later and had zero responses from either. In case my previous
message was too long, here it is again in brief.

Kernel 2.2.14 + raid-2.2.14-B1 as shipped with Red Hat 6.x.
RAID5 across multiple SCSI disks.
Spin down one disk with ioctl SCSI_IOCTL_STOP_UNIT to simulate error.
Kernel logs

md: bug in file raid5.c, line 659

   **
   *  *
   **

followed by complete lock up of all activity on /dev/md0, including
any attempt to do raidhot{add,remove}.

*Please* can someone comment/help?

The reason I am using a disk spin down to simulate failure is that
echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi
doesn't work for me with kernel 2.2. The underlying write gives EBUSY
which the kernel source says means the disk is busy. This worked fine
for me (along with add-single-device) on kernel 2.0 with RAID 0.90.

*Please* can someone help, even if only by saying
"scsi remove-single-device works fine for me with 2.2" or "no it
doesn't work for me either but I don't care"?

This problem is preventing the upgrade to 2.2 of a number of Linux
servers and has meant that I've had to bring a new large server into
service without the benefit of RAID (since it needs kernel 2.2 for
other reasons).

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



raid-2.2.14-B1 reconstruction bug and problems

2000-03-27 Thread Malcolm Beattie

I've been using RAID 0.90 with the 2.0 kernel on a bunch of production
boxes (RAID5) and the disk failure handling and reconstruction has
worked fine, both in tests and (once) in real life when a disk failed.
I'm now trying 2.2.14 + raid-2.2.14-B1 (as shipped in the Red Hat 6.x
kernel) and have come across both a problem with testing disk failure
and also an apparent bug in RAID error handling:

-- cut here --
SCSI disk error : host 0 channel 0 id 8 lun 0 return code = 2802 
[valid=0] Info fld=0x0, Current sd08:61: sense key Not Ready 
Additional sense indicates Logical unit not ready, initializing command required 
scsidisk I/O error: dev 08:61, sector 265176 
md: bug in file raid5.c, line 659 
  
   ** 
   *  * 
   ** 
-- cut here --

followed by a detailed dump of the RAID superblock information. After
that, any commands (including raidhotremove/raidhotadd) which try to
touch the RAID array hang in uninterruptible sleep and so do any
processes which were accessing the RAID filesystem at the time of the
failure. The above was triggered by my simulation of a disk failure
which I did by spinning the disk down with the SCSI_IOCTL_STOP_UNIT
ioctl.

That leads to the second problem: the reason I used that method of
simulating a disk failure was that the old method:
echo "scsi remove-single-device 0 0 3 0" > /proc/scsi/scsi
has stopped working with kernel 2.2. strace shows that the write()
returns with errno EBUSY. linux/drivers/scsi/scsi.c shows that this
is because the access_count of Scsi_Device structure is non-zero.
Looking at the equivalent 2.0 source doesn't seem to show any semantic
changes and yet the same command under 2.0 works fine. Please can
anyone help otherwise this server is going to have to run without the
added reliability of RAID5 which would be disappointing?

As an act of desperation I even wrote a little kernel module to change
the access_count back to zero and then ran the
"...remove-single-device...". This time, the device did get removed
properly, RAID noticed the removal and went properly into degraded
mode. Unfortunately, once again, all processes accessing the RAID
filesystem and then any raidhotadd/raidhotremove/umount commands all
hung in uninterruptible state. Nothing in this mailing list or
anywhere else I can find with web searches seems to have had this
problem so I'm at a loss what to do. Any help would be gratefully
received. In case it matters, this is on an SMP system (2 CPUs) and
the disks are all SCSI disks on a bus with an Adaptec 7899 adapter,
using the aic7xxx driver 5.1.72/3.2.4. In case anyone wants the kernel
module to alter a SCSI device access_count, here it is:

-- cut here --
#include 
#include 
#include 
#include "/usr/src/linux/drivers/scsi/scsi.h"
#include "/usr/src/linux/drivers/scsi/hosts.h"

static int host = 0;
static int channel = 0;
static int id = 0;
static int lun = 0;
static int delta = 0;

MODULE_PARM(host, "i");
MODULE_PARM(channel, "i");
MODULE_PARM(id, "i");
MODULE_PARM(lun, "i");
MODULE_PARM(delta, "i");

int init_module(void)
{
struct Scsi_Host *hba;
Scsi_Device *scd;

printk("scsiaccesscount starting\n");
for (hba = scsi_hostlist; hba; hba = hba->next)
if (hba->host_no == host)
break;

if (!hba)
return -ENODEV;

for (scd = hba->host_queue; scd; scd = scd->next)
if (scd->channel == channel && scd->id == id && scd->lun == lun)
break;

if (!scd)
return -ENODEV;

printk("access_count is %d\n", scd->access_count);
if (delta) {
scd->access_count += delta;
printk("changed access_count to %d\n", scd->access_count);
}

return -EIO;
}
-- cut here --

Use it as
insmod scsiaccesscount.o host=0 channel=0 id=3 lun=0
to show the access count for ID 3 on bus 0 channel 0 and 
insmod scsiaccesscount.o host=0 channel=0 id=3 lun=0 delta=-1
to substract one from the access_count. Obviously this is just for
debugging and may not be safe to do at all (and indeed wasn't in my
case).

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



raid5 checksumming chooses wrong function

2000-03-14 Thread Malcolm Beattie

When booting my new Dell 4400, pre-installed with Red Hat 6.1, the
raid5 checksumming function it chooses is not the fastest. I get:

raid5: measuring checksumming speed
raid5: KNI detected,...
   pIII_kni: 1078.611 MB/sec
raid5: MMX detected,...
   pII_mmx : 1304.925
   p5_mmx  : 1381.125
   8 regs  : 1029.081
   32 regs :  584.073
using fastest function: pIII_kni (1078.611 MB/sec)

Is there a good reason for it choosing pIII_kni (in which case the
wording of the message "fastest" needs changing) or is it a bug? If
noone else sees this, I'll dig in and see if I can fix it: maybe
it's because the two sets of function lists are dependent on
particular hardware (first for KNI, then for MMX) and something isn't
getting zeroed or set to the max in between.

Benchmarking it on a stripeset of 7 x 9GB disks on a Ultra3 bus with
one of the Adaptec 7899 channels, it's impressively fast. 81MB/s block
reads and 512 seeks/s in bonnie and 50MB/s (500 "netbench" Mbits/sec)
running dbench with 128 threads. I've done tiotest runs too and I'll
be doing more benchmarks on RAID5 soon. If anyone wants me to post
figures, I'll do so.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: Is Raid Stable?

1999-10-08 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
> For more direct results, I've got a raid 1 on my web server, a raid 0 
> in test (stupid jedi mind trick - what happens with a 2gig and 200mb
> stripe set), and I hope to try raid 5 real soon now.

I'm using software RAID5 on our IMAP server mailstore cluster: 6 nodes
each with 6 x 9GB 10KRPM disks installed in 3 hot-swap hot-everything
Sun D1000 arrays. It's up to 8300 users now and will be tripling over
the next year or two. No problems. I also use software RAID5 (5 x 9GB
10K RPM disks in a Sun hot-swap multipack) for our web server
(2 million hits/week, 90+ main sites, thousands of small user sites).
No problems. I also use it on our mirror server (similar hardware).
No problems. I trust it.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



quotacheck with -DEXT2_DIRECT broken with RAID

1999-05-10 Thread Malcolm Beattie

I have some systems using Red Hat 5.2 plus raid0145-19981215-2.0.36
with raidtools-19981214-0.90.tar.gz. On each I have six disks
configured as a single RAID5 device and with an ext2 filesystem with
4K blocksize and -Rstride=16 (since the raid chunk-size is 64).

On running quotacheck (from the quota-1.55-9 RPM from Red Hat 5.2), it
writes complete random trash into the "amount used" fields of the
quota file: it's clearly picking up not only random block data but
also random uid data. This caused people to appear over-quota. Oops.
Recompiling quotacheck from source with -DEXT2_DIRECT turned off makes
it works fine. (In fact, it's nice and fast despite the fact that it
has to walk the filesystem tree itself.) It looks as though libext2fs
(or possibly the way quotacheck calls it) is broken when used with one
or more of the following:
(1) any RAID device
(2) any ext2 filesystem with 4K blocks
(3) any ext2 filesystem with -Rstride=n

I think (1) is unlikely and that it's probably 2 or 3 (but RAID users
are mostly likely to tweak those). Has anyone else run into this or
can anyone else please try to reproduce it? Once I know exactly what's
causing it I'll contact the quota tools maintainer and/or Red Hat.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: benefits of journaling for soft RAID ?

1999-02-12 Thread Malcolm Beattie

Matti Aarnio writes:
>   Take another; a popular public FTP site with storage
>   consisting of multiple 30-50 GB RAID5 filesystems.
>   Total capacity around 300 GB.  It crashes for some
>   reason, when will it be online ?  (That is a system
>   where a slow day is 50 GB worth of anonymous FTP
>   traffic.)
> 
>   That FTP site happens to run journaled filesystem,
>   it will be back online in 5 minutes.  (Or then the
>   problem requires hardware service which takes more
>   time..)

The ftp site running a non-journalled ext2 filesystem will be back
in about 13 minutes for multiple 50 GB RAID5 arrays which are nicely
parallelised. Please do some benchmarks, like I have, before just
assuming that fsck takes forever on largish filesystems.
Yes, 5 minutes is faster than 13 minutes but the Solaris boxes I have
which *do* have journalled filesystems still take significantly
longer to boot than my Linux boxes because the rest of the boot is
horribly slow. Solaris seems to take forever prodding its very bits of
hardware before it gets around to booting properly.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: fsck performance on large RAID arrays ?

1999-02-10 Thread Malcolm Beattie

Richard Jones writes:
> Malcolm Beattie wrote:
> > 
> > 
> > Mounting mine when clean takes 4 seconds. I wonder if you used a 1k
> > block size for your filesystem. That greatly increases the time to
> > check the bitmaps upon mounting (though you can turn this off with
> > mount -o check=none). It also greatly decreases the performance of
> > the filesystem.
> 
> Quite probably, if that was the default. Can you
> point me to any other things I should be changing
> (eg. stripe size, in particular). Given the myriad
> different possibilities and very limited time, I
> didn't experiment to choose optimal block size or
> stripe size.
> 
> I don't have the bonnie benchmarks to hand, although
> they were quite acceptable - bottom line was 13 MBytes/sec
> throughput IIRC. I posted them on linux-raid before,
> so you should be able to dig them up from the archives.

The only archive of this list that AltaVista found me was a local one
which didn't go back far enough. I suggest you look at your "Random
Seeks" figure in bonnie, not the bandwidth figures. I would suggest
recreating the whole thing from scratch with
chunk-size  64
in your /etc/raidtab. Then use
mke2fs -b 4096 -R stride=16 /dev/md0
to create the filesystem (and wait until /proc/mdstat shows the
rebuild has finished). Then try bonnie again and see if the "Random
Seeks" figure has improved. Then try putting on lots of data and
testing fsck times. Oh, and since you're not using SCSI disks, check
that "hdparm /dev/hda" (and the other disks) shows you have using_dma
and unmaskirq set to 1.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: fsck performance on large RAID arrays ?

1999-02-09 Thread Malcolm Beattie

Richard Jones writes:
> Benno Senoner wrote:
> > 
> > Hi,
> > does anyone of you know how long it takes,
> > to e2fsck (after an unclean shutdown) for example a soft-raid5 array of
> > a total size of about 40-50 GB
> > ( example : 6 disk with 9GB  (UW SCSI) )
> > assume the machine is a PII300 - PII400
> > 
> > assume that the raid-array is almost filled with data (so that e2fsck
> > takes longer)
> 
> This is affected by so many different factors,
> that it's really impossible for me to give an
> estimate for your machine. 

Indeed.

>However, as a guide,
> our machine was:
> 
>   P-II 233 MHz
>   256 MB RAM
>   6 * UltraDMA drives with measured throughput
>   of 16 MBytes/sec
>   RAID space: 42 GB after formatting
> 
> with the drive about 20% full we had fsck times of
> 20 mins and 33% full of about 30 mins. 

The primary differences between yours and mine (which took 13 minutes
to fsck when 70% full) are (1) mine used SCSI which has tagged
queuing and (2) mine had 1 RPM disks which also improves seek
times. It looks as though Benno's resembles mine a bit more closely.

>In both cases,
> mounting a clean filesystem took about 2 mins.

Mounting mine when clean takes 4 seconds. I wonder if you used a 1k
block size for your filesystem. That greatly increases the time to
check the bitmaps upon mounting (though you can turn this off with
mount -o check=none). It also greatly decreases the performance of
the filesystem.

> Which just goes to show that ext2 is not a suitable
> filesystem for large disk arrays. 

Not at all. You just have to be more careful with choosing hardware
and tuning software with larger arrays. As a matter of interest,
what figures does bonnie give on your array? I would guess that the
fsck time is probably mostly affected by random seeks. I get 282/sec.

>   Roll on journalling
> in 2.3, I say!

That will be nice but you can use ext2 as-is on large systems if care
is taken at the design/tuning stage.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: fsck performance on large RAID arrays ?

1999-02-09 Thread Malcolm Beattie

Benno Senoner writes:
> does anyone of you know how long it takes,
> to e2fsck (after an unclean shutdown) for example a soft-raid5 array of
> a total size of about 40-50 GB
> ( example : 6 disk with 9GB  (UW SCSI) )
> assume the machine is a PII300 - PII400
> 
> assume that the raid-array is almost filled with data (so that e2fsck
> takes longer)
> 
> 
> can times go up to 1h ?

I did some benchmarks last Wednesday for a thread on the server-linux
mailing list. Here's a cut-and-paste job:

Hardware: 350 MHz Pentium II PC, 512 MB RAM, BT958D SCSI adapter.
  Sun D1000 disk array with 6 x 9 GB 1 RPM disks.
Software: Linux 2.0.36 + latest RAID patch.
Filesystem configured as a single 43 GB RAID5 ext filesystem with
4k blocks and 64k RAID5 chunk-size.

I created 25 subdirectories on the filesystem and in each untarred
four copies of the Linux 2.2.1 source tree (each is ~4000 files
totalling 63 MB untarred).

fsck took 8 minutes.

Then I added 100 subdirectories in each of those subdirectories and
into each of those directories put five 1MB files. (The server is
actually going to be an IMAP server and this mimics half-load quite
well). The result is 18 GB used on the filesystem.

fsck took 10.5 minutes.

Then I added another 100 subdirectories in each of the 25 directories
and put another five 1MB files in each of those. The result is 30 GB
used on the filesystem.

fsck took 13 minutes.

The important points are probably that (a) the disks are 1 RPM
which helps random I/O and (b) the filesystem block size is 4k. Don't
even think about using a 1k block size on a large filesystem (unless
you have a really weird environment).

> Is it very unsafe to remove fsck at boot ?

Yes. You might as well leave it in since if the filesystem was
unmounted cleanly then fsck doesn't bother checking it fully and
continues straight on.

> what about checking /proc/mdstat at boot time and then determining if
> e2fsck should be run or not ?
> In theory if the array was shut down cleanly , the filesystem should be
> in a consistent status.
> please correct me in I am wrong .

That's wrong. The consistency of the array and the consistency of the
filesystem on it are two independent issues.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: RAID1 on / for RH 5.2

1999-01-07 Thread Malcolm Beattie

Mike Brodbelt writes:
> Get rid of the raidtools that shipped with RedHat. Then remove all reference to
> RAID in /etc/rc.d/rc.sysinit and /etc/rc.d/init.d/halt. Then get hold of the
> raidtools 0.90 source, and compile that. Kernel - 2.0.38 doesn't exist :-). I
> suggest using a clean 2.0.36 source tree, and patching that for RAID support.
> Don't use the kernel source provided with RH 5.2, it's a 2.0.36 pre release,
> with lots of weird patches added. 

The original kernel RPMs for Red Hat 5.2 (kernel-*-2.0.36-0.7.*.rpm)
were indeed a pre-release 2.0.36 kernel. However, the update RPMs which
are now available for 5.2 (kernel-*-2.0.36-1.*.rpm) are real 2.0.36
kernels. The only thing to be careful about (well, the only thing that
bit me anyway) is that you need to do a "make mrproper" to get rid of
the foo.ver files in include/modules otherwise the stale entries for
the md_foo functions in md.ver clash with the new ones which go in
ksyms.ver for recent RAID patches. After installing the kernel-souce
RPM and patching with the RAID patch (the patch hunk to defconfig is
rejected but it's beningn: just apply it by hand): do the make config;
make mrproper and make zImage. Then do make modules;
make install_modules. Pick a name for your new kernel version
(say -foo) and do
mv /lib/modules/2.0.36{,-foo}
cd /usr/src/linux
strings vmlinux | grep 'Linux version' > /lib/modules/2.0.36-foo/.rhkmvtag
cp arch/i386/boot/zImage /boot/vmlinuz-2.0.36-foo
cp System.map /boot/System.map-2.0.36-foo
cp /boot/module-info{,-2.0.36-foo}
then add an entry to /etc/lilo.conf and rerun lilo. If you stick with
the Red Hat way of doing kernels then the multiple versions of the
kernel will coexist more nicely for modules and such like.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Some hw v. sw RAID benchmarks

1999-01-07 Thread Malcolm Beattie

Here for your edification and amusement are some benchmarks comparing
hardware v. software RAID for fairly similar setups.

Sun sell two versions of their 12-disk hot-swap dual-everything disk
array (codename Dilbert):
 * the D1000 is a "dumb" array presenting 6 disks on each of two
   Ultra Wide Differential SCSI busses.
 * the A1000 is similar but has an internal hardware RAID module
   which connects to the two busses internally, does its "RAID thing"
   and presents a single Ultra Wide Differential bus to the outside
   world and talks to an intelligent adapter card on the hosts side.

We have the following configurations which I benchmarked using bonnie:

System 1: A1000 array with 6 x 1 RPM 4G wide SCSI drives and 64MB
  NVRAM cache connected to Sun Ultra 5 with a 270 MHz
  UltraSPARC IIi CPU and 320 MB RAM running Solaris 2.6 via a 
  Symbios 53C875-based card.
System 2: D1000 array with 6 x 1 RPM 9G wide SCSI drives on one
  of its two busses connected to a PC with a 350 MHz PII CPU
  and 512 MB RAM running Linux 2.0.36 with with
  raid-19981214-0.90 RAID patch.

Both systems were set up as a single 6 disk RAID5 group. System 1 had
a standard Solaris UFS filesystem on the resulting 20GB logical drive.
System 2 used chunk-size 64 for its RAID5 configuration (defaults for
ther settings) and a single ext2 filesystem (with blocksize 4096 and
stride=16). Bonnie was run on both as the only non-idle process on a
1000 MB file.

  System 1 System 2
Seq output
--
  per char7268 K/s @ 66.7% CPU   5104 K/s @ 88.6% CPU  
  block  12850 K/s @ 31.9% CPU  12922 K/s @ 16.4% CPU
  rewrite 8221 K/s @ 45.1% CPU   5973 K/s @ 16.9% CPU

Seq input
-
  per char8275 K/s @ 99.2% CPU   5058 K/s @ 96.1% CPU
  block  21856 K/s @ 46.4% CPU  13080 K/s @ 15.2% CPU

Random Seeks   293.0 /s @ 8.7% CPU282.3 /s @ 5.7% CPU

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



RAID0 striping two RAID5 arrays

1998-11-11 Thread Malcolm Beattie

I'm setting up an ftp/mirror server out of a PC with a 4GB EIDE disk,
three 6GB EIDE disks taken from other PCs that don't need them and an
old SCSI array of 4 x 4GB disks. Rather than mess around with
allocating 7 or 8 different filesystems to different parts of the ftp
hierarchy, I want to raid them into one large blob. What seems an
obvious way to me which should provide reasonable performance and
reliability is to make the 4 x 4 GB disks into a RAID5 array
(resulting in 12 GB of visible store), make the 3 x 6GB disks into a
separate RAID5 array (another 12GB) and then RAID0 stripe the two
12GB md devices into one big 24GB one. Is doing RAID0 over RAID5 a
possible/reasonable thing to do with the latest md/raidtools? Need I
choose any non-default chunk sizes or suchlike to tune things better?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services



Re: Is this possible/feasible

1998-10-20 Thread Malcolm Beattie

Stephen C. Tweedie writes:
> Hi,
> 
> On Sun, 18 Oct 1998 15:55:35 +0200 (CEST), MOLNAR Ingo
> <[EMAIL PROTECTED]> said:
> 
> > On Sun, 18 Oct 1998, Tod Detre wrote:
> 
> >> in 2.1 kernels you can mak nfs a block device.  raid can work with block
> >> devices so if you raid5 several nfs computers one can go down, but you
> >> still can go on. 
> 
> > you probably want to use Stephen Tweedie's NBD (Network Block Device),
> 
> Heh, thanks, but the credit is Pavel Machek's.  I've just been testing
> and bug-fixing it.
> 
> > which works over TCP and is such more reliable and works over bigger
> > distance and larger dropped packets range. You can even have 5 disks on 5
> > continents put together into a big RAID5 array. (ment to survive a
> > meteorite up to the size of a few 10 miles ;) and you can loopback it
> > through a crypt^H^H^H^H^Hcompression module too before sending it out to
> > the net. 
> 
> Of course, you'll need to manually reconstruct the raid array as
> appropriate, and you don't get raid autostart on a networked block
> device either.  However, it ought to be fun to watch, and I'm hoping we
> can integrate this method of operation into some of the clustering
> technology now appearing on Linux to do failover of NFS services if one
> of the networked raid hosts dies.  Just remount the raid on another
> machine using the surviving networked disks, remount ext2fs and migrate
> the NFS server's IP address: voila!

There's a way which should give better performance in the general case
that I think I've mentioned on this mailing list before. It avoids the
overhead of a synchronous NBD since when you're migrating a disk to a
new system, there's no constraint that the remote system have up to
date data all the time. It's a combination of a little kernel driver
called breq and a simple user-mode program. The basic idea is to add
a few lines to make_request in ll_rw_blk.c in "case WRITE". When breq
is turned on for a particular device (done by an ioctl on /dev/breq
which appears as a character device to user-land), the block number of
the request is simply written to a 4K ring buffer. That's the only
kernel patch needed. The breq device driver module sucks out the data
from the ring buffer and feeds it to the reader.

To do a filesystem migration, there's a bmigrate user-land program
which effectively has two independent threads (actually it uses
select() but it's easier to think of as threads). You start with a
bitmap with one bit for each block on the device you're migrating,
setting the bitmap to all 1s. You make a TCP connection to a daemon
on the new system (described below). One thread of bimgrate does

while (1) {
blocknum_t n; /* 32 bits */
read(breq_fd, &n, 4);
bitmap[n] = 1; /* mark block n dirty */
}

The other thread does

while (n = find_first_set_bit(bitmap)) {
struct { int n; char data[512] } binfo;
bitmap[n] = 0; /* mark it clean */
read(raw_device_fd, &binfo.data, sizeof(binfo.data));
binfo.n = n;
write(remote_socket, &binfo, sizeof(binfo));
}

The daemon on the other end of the connection just does

while (read(client_socket, &binfo, sizeof(binfo))) {
lseek(raw_device_fd, binfo.n * 512, SEEK_SET);
write(raw_device_fd, &binfo.data, 512);
}

This is all completely asynchronous to the migrating filesystem so
it's not as slow as a network block device. Now, gradually the
bitmap gets cleared as the migrater writes across the data and
catches up with ongoing write activity. Eventually, there are only
a "few" bits set. At that time, you take down the RAID device, let
the migrater finish sending the last few blocks, then bring up the
new system on the same IP number with its newly migrated data.

The code for bmigrate is sitting on my IPC at home and I haven't
quite had the time to do the breq thing properly yet. I've not quite
figured out what context make_request runs in and how to synchronise
writing to the ring buffer with the ioctl code to shut it off.
Does make_request get called from interrupts or bottom halves?
What's the new-fangled SMP-safe way to do such locking in a way that
make_request doesn't have to get a slow lock every time it wants to
write data?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services