Is the raid1readbalance patch production ready?
Is the raid1readbalance-2.2.15-B2 patch (when applied against a 2.2.16+linux-2.2.16-raid-B2 kernel) rock-solid and production quality? Can I trust 750GB of users' email to it? Is it guaranteed to behave the same during failure modes that the non-patched RAID code does? Is anyone using it heavily in a production system? (Not that I expect any other answer except maybe for a resounding "probably" :-) --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: speed and scaling
Seth Vidal writes: > I have an odd question. Where I work we will, in the next year, be in a > position to have to process about a terabyte or more of data. The data is > probably going to be shipped on tapes to us but then it needs to be read > from disks and analyzed. The process is segmentable so its reasonable to > be able to break it down into 2-4 sections for processing so arguably only > 500gb per machine will be needed. I'd like to get the fastest possible > access rates from a single machine to the data. Ideally 90MB/s+ > > So were considering the following: > > Dual Processor P3 something. > ~1gb ram. > multiple 75gb ultra 160 drives - probably ibm's 10krpm drives > Adaptec's best 160 controller that is supported by linux. > > The data does not have to be redundant or stable - since it can be > restored from tape at almost any time. > > so I'd like to put this in a software raid 0 array for the speed. > > So my questions are these: > Is 90MB/s a reasonable speed to be able to achieve in a raid0 array > across say 5-8 drives? > What controllers/drives should I be looking at? Here are actual benchmarks from one of my systems. dbench: 2 Throughput 123.637 MB/sec (NB=154.546 MB/sec 1236.37 MBit/sec) 4 Throughput 109.7 MB/sec (NB=137.126 MB/sec 1097 MBit/sec) 32 Throughput 77.7743 MB/sec (NB=97.2178 MB/sec 777.743 MBit/sec) 64 Throughput 64.3793 MB/sec (NB=80.4741 MB/sec 643.793 MBit/sec) Bonnie: ---Sequential Output ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 9585 99.1 51312 26.0 28675 45.3 9224 94.8 81720 73.3 512.2 4.2 That is with a Dell 4400, 2 x 600 MHz Pentium III Coppermine CPUs with 256K cache, 1GB RAM, one 64-bit 66MHz PCI bus (and one 33MHz PCI bus). Disk subsystem is a built-in Adaptec 7899 dual-160MB channel with 8 Quantum ATLAS IV 9MB SCA disks attached to one channel. The benchmarks above were done on an ext2 filesystem with 4KB blocksize and stride 16 created on a 7-way stripe of the above disks using software RAID (0.9 on kernel 2.2.x) with 64KB chunksize. If you use both SCSI channels and use 36GB disks (let alone 75GB ones), you'll get 480GB of disk without even needing to plug another SCSI card in. With larger disks or another SCSI card or two you could go larger/faster. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Bug in 2.2.14 + raid-2.2.14-B1
I reported this bug to linux-raid on March 27 and to linux-kernel a week later and had zero responses from either. In case my previous message was too long, here it is again in brief. Kernel 2.2.14 + raid-2.2.14-B1 as shipped with Red Hat 6.x. RAID5 across multiple SCSI disks. Spin down one disk with ioctl SCSI_IOCTL_STOP_UNIT to simulate error. Kernel logs md: bug in file raid5.c, line 659 ** * * ** followed by complete lock up of all activity on /dev/md0, including any attempt to do raidhot{add,remove}. *Please* can someone comment/help? The reason I am using a disk spin down to simulate failure is that echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi doesn't work for me with kernel 2.2. The underlying write gives EBUSY which the kernel source says means the disk is busy. This worked fine for me (along with add-single-device) on kernel 2.0 with RAID 0.90. *Please* can someone help, even if only by saying "scsi remove-single-device works fine for me with 2.2" or "no it doesn't work for me either but I don't care"? This problem is preventing the upgrade to 2.2 of a number of Linux servers and has meant that I've had to bring a new large server into service without the benefit of RAID (since it needs kernel 2.2 for other reasons). --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
raid-2.2.14-B1 reconstruction bug and problems
I've been using RAID 0.90 with the 2.0 kernel on a bunch of production boxes (RAID5) and the disk failure handling and reconstruction has worked fine, both in tests and (once) in real life when a disk failed. I'm now trying 2.2.14 + raid-2.2.14-B1 (as shipped in the Red Hat 6.x kernel) and have come across both a problem with testing disk failure and also an apparent bug in RAID error handling: -- cut here -- SCSI disk error : host 0 channel 0 id 8 lun 0 return code = 2802 [valid=0] Info fld=0x0, Current sd08:61: sense key Not Ready Additional sense indicates Logical unit not ready, initializing command required scsidisk I/O error: dev 08:61, sector 265176 md: bug in file raid5.c, line 659 ** * * ** -- cut here -- followed by a detailed dump of the RAID superblock information. After that, any commands (including raidhotremove/raidhotadd) which try to touch the RAID array hang in uninterruptible sleep and so do any processes which were accessing the RAID filesystem at the time of the failure. The above was triggered by my simulation of a disk failure which I did by spinning the disk down with the SCSI_IOCTL_STOP_UNIT ioctl. That leads to the second problem: the reason I used that method of simulating a disk failure was that the old method: echo "scsi remove-single-device 0 0 3 0" > /proc/scsi/scsi has stopped working with kernel 2.2. strace shows that the write() returns with errno EBUSY. linux/drivers/scsi/scsi.c shows that this is because the access_count of Scsi_Device structure is non-zero. Looking at the equivalent 2.0 source doesn't seem to show any semantic changes and yet the same command under 2.0 works fine. Please can anyone help otherwise this server is going to have to run without the added reliability of RAID5 which would be disappointing? As an act of desperation I even wrote a little kernel module to change the access_count back to zero and then ran the "...remove-single-device...". This time, the device did get removed properly, RAID noticed the removal and went properly into degraded mode. Unfortunately, once again, all processes accessing the RAID filesystem and then any raidhotadd/raidhotremove/umount commands all hung in uninterruptible state. Nothing in this mailing list or anywhere else I can find with web searches seems to have had this problem so I'm at a loss what to do. Any help would be gratefully received. In case it matters, this is on an SMP system (2 CPUs) and the disks are all SCSI disks on a bus with an Adaptec 7899 adapter, using the aic7xxx driver 5.1.72/3.2.4. In case anyone wants the kernel module to alter a SCSI device access_count, here it is: -- cut here -- #include #include #include #include "/usr/src/linux/drivers/scsi/scsi.h" #include "/usr/src/linux/drivers/scsi/hosts.h" static int host = 0; static int channel = 0; static int id = 0; static int lun = 0; static int delta = 0; MODULE_PARM(host, "i"); MODULE_PARM(channel, "i"); MODULE_PARM(id, "i"); MODULE_PARM(lun, "i"); MODULE_PARM(delta, "i"); int init_module(void) { struct Scsi_Host *hba; Scsi_Device *scd; printk("scsiaccesscount starting\n"); for (hba = scsi_hostlist; hba; hba = hba->next) if (hba->host_no == host) break; if (!hba) return -ENODEV; for (scd = hba->host_queue; scd; scd = scd->next) if (scd->channel == channel && scd->id == id && scd->lun == lun) break; if (!scd) return -ENODEV; printk("access_count is %d\n", scd->access_count); if (delta) { scd->access_count += delta; printk("changed access_count to %d\n", scd->access_count); } return -EIO; } -- cut here -- Use it as insmod scsiaccesscount.o host=0 channel=0 id=3 lun=0 to show the access count for ID 3 on bus 0 channel 0 and insmod scsiaccesscount.o host=0 channel=0 id=3 lun=0 delta=-1 to substract one from the access_count. Obviously this is just for debugging and may not be safe to do at all (and indeed wasn't in my case). --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
raid5 checksumming chooses wrong function
When booting my new Dell 4400, pre-installed with Red Hat 6.1, the raid5 checksumming function it chooses is not the fastest. I get: raid5: measuring checksumming speed raid5: KNI detected,... pIII_kni: 1078.611 MB/sec raid5: MMX detected,... pII_mmx : 1304.925 p5_mmx : 1381.125 8 regs : 1029.081 32 regs : 584.073 using fastest function: pIII_kni (1078.611 MB/sec) Is there a good reason for it choosing pIII_kni (in which case the wording of the message "fastest" needs changing) or is it a bug? If noone else sees this, I'll dig in and see if I can fix it: maybe it's because the two sets of function lists are dependent on particular hardware (first for KNI, then for MMX) and something isn't getting zeroed or set to the max in between. Benchmarking it on a stripeset of 7 x 9GB disks on a Ultra3 bus with one of the Adaptec 7899 channels, it's impressively fast. 81MB/s block reads and 512 seeks/s in bonnie and 50MB/s (500 "netbench" Mbits/sec) running dbench with 128 threads. I've done tiotest runs too and I'll be doing more benchmarks on RAID5 soon. If anyone wants me to post figures, I'll do so. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: Is Raid Stable?
[EMAIL PROTECTED] writes: > For more direct results, I've got a raid 1 on my web server, a raid 0 > in test (stupid jedi mind trick - what happens with a 2gig and 200mb > stripe set), and I hope to try raid 5 real soon now. I'm using software RAID5 on our IMAP server mailstore cluster: 6 nodes each with 6 x 9GB 10KRPM disks installed in 3 hot-swap hot-everything Sun D1000 arrays. It's up to 8300 users now and will be tripling over the next year or two. No problems. I also use software RAID5 (5 x 9GB 10K RPM disks in a Sun hot-swap multipack) for our web server (2 million hits/week, 90+ main sites, thousands of small user sites). No problems. I also use it on our mirror server (similar hardware). No problems. I trust it. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
quotacheck with -DEXT2_DIRECT broken with RAID
I have some systems using Red Hat 5.2 plus raid0145-19981215-2.0.36 with raidtools-19981214-0.90.tar.gz. On each I have six disks configured as a single RAID5 device and with an ext2 filesystem with 4K blocksize and -Rstride=16 (since the raid chunk-size is 64). On running quotacheck (from the quota-1.55-9 RPM from Red Hat 5.2), it writes complete random trash into the "amount used" fields of the quota file: it's clearly picking up not only random block data but also random uid data. This caused people to appear over-quota. Oops. Recompiling quotacheck from source with -DEXT2_DIRECT turned off makes it works fine. (In fact, it's nice and fast despite the fact that it has to walk the filesystem tree itself.) It looks as though libext2fs (or possibly the way quotacheck calls it) is broken when used with one or more of the following: (1) any RAID device (2) any ext2 filesystem with 4K blocks (3) any ext2 filesystem with -Rstride=n I think (1) is unlikely and that it's probably 2 or 3 (but RAID users are mostly likely to tweak those). Has anyone else run into this or can anyone else please try to reproduce it? Once I know exactly what's causing it I'll contact the quota tools maintainer and/or Red Hat. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: benefits of journaling for soft RAID ?
Matti Aarnio writes: > Take another; a popular public FTP site with storage > consisting of multiple 30-50 GB RAID5 filesystems. > Total capacity around 300 GB. It crashes for some > reason, when will it be online ? (That is a system > where a slow day is 50 GB worth of anonymous FTP > traffic.) > > That FTP site happens to run journaled filesystem, > it will be back online in 5 minutes. (Or then the > problem requires hardware service which takes more > time..) The ftp site running a non-journalled ext2 filesystem will be back in about 13 minutes for multiple 50 GB RAID5 arrays which are nicely parallelised. Please do some benchmarks, like I have, before just assuming that fsck takes forever on largish filesystems. Yes, 5 minutes is faster than 13 minutes but the Solaris boxes I have which *do* have journalled filesystems still take significantly longer to boot than my Linux boxes because the rest of the boot is horribly slow. Solaris seems to take forever prodding its very bits of hardware before it gets around to booting properly. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: fsck performance on large RAID arrays ?
Richard Jones writes: > Malcolm Beattie wrote: > > > > > > Mounting mine when clean takes 4 seconds. I wonder if you used a 1k > > block size for your filesystem. That greatly increases the time to > > check the bitmaps upon mounting (though you can turn this off with > > mount -o check=none). It also greatly decreases the performance of > > the filesystem. > > Quite probably, if that was the default. Can you > point me to any other things I should be changing > (eg. stripe size, in particular). Given the myriad > different possibilities and very limited time, I > didn't experiment to choose optimal block size or > stripe size. > > I don't have the bonnie benchmarks to hand, although > they were quite acceptable - bottom line was 13 MBytes/sec > throughput IIRC. I posted them on linux-raid before, > so you should be able to dig them up from the archives. The only archive of this list that AltaVista found me was a local one which didn't go back far enough. I suggest you look at your "Random Seeks" figure in bonnie, not the bandwidth figures. I would suggest recreating the whole thing from scratch with chunk-size 64 in your /etc/raidtab. Then use mke2fs -b 4096 -R stride=16 /dev/md0 to create the filesystem (and wait until /proc/mdstat shows the rebuild has finished). Then try bonnie again and see if the "Random Seeks" figure has improved. Then try putting on lots of data and testing fsck times. Oh, and since you're not using SCSI disks, check that "hdparm /dev/hda" (and the other disks) shows you have using_dma and unmaskirq set to 1. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: fsck performance on large RAID arrays ?
Richard Jones writes: > Benno Senoner wrote: > > > > Hi, > > does anyone of you know how long it takes, > > to e2fsck (after an unclean shutdown) for example a soft-raid5 array of > > a total size of about 40-50 GB > > ( example : 6 disk with 9GB (UW SCSI) ) > > assume the machine is a PII300 - PII400 > > > > assume that the raid-array is almost filled with data (so that e2fsck > > takes longer) > > This is affected by so many different factors, > that it's really impossible for me to give an > estimate for your machine. Indeed. >However, as a guide, > our machine was: > > P-II 233 MHz > 256 MB RAM > 6 * UltraDMA drives with measured throughput > of 16 MBytes/sec > RAID space: 42 GB after formatting > > with the drive about 20% full we had fsck times of > 20 mins and 33% full of about 30 mins. The primary differences between yours and mine (which took 13 minutes to fsck when 70% full) are (1) mine used SCSI which has tagged queuing and (2) mine had 1 RPM disks which also improves seek times. It looks as though Benno's resembles mine a bit more closely. >In both cases, > mounting a clean filesystem took about 2 mins. Mounting mine when clean takes 4 seconds. I wonder if you used a 1k block size for your filesystem. That greatly increases the time to check the bitmaps upon mounting (though you can turn this off with mount -o check=none). It also greatly decreases the performance of the filesystem. > Which just goes to show that ext2 is not a suitable > filesystem for large disk arrays. Not at all. You just have to be more careful with choosing hardware and tuning software with larger arrays. As a matter of interest, what figures does bonnie give on your array? I would guess that the fsck time is probably mostly affected by random seeks. I get 282/sec. > Roll on journalling > in 2.3, I say! That will be nice but you can use ext2 as-is on large systems if care is taken at the design/tuning stage. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: fsck performance on large RAID arrays ?
Benno Senoner writes: > does anyone of you know how long it takes, > to e2fsck (after an unclean shutdown) for example a soft-raid5 array of > a total size of about 40-50 GB > ( example : 6 disk with 9GB (UW SCSI) ) > assume the machine is a PII300 - PII400 > > assume that the raid-array is almost filled with data (so that e2fsck > takes longer) > > > can times go up to 1h ? I did some benchmarks last Wednesday for a thread on the server-linux mailing list. Here's a cut-and-paste job: Hardware: 350 MHz Pentium II PC, 512 MB RAM, BT958D SCSI adapter. Sun D1000 disk array with 6 x 9 GB 1 RPM disks. Software: Linux 2.0.36 + latest RAID patch. Filesystem configured as a single 43 GB RAID5 ext filesystem with 4k blocks and 64k RAID5 chunk-size. I created 25 subdirectories on the filesystem and in each untarred four copies of the Linux 2.2.1 source tree (each is ~4000 files totalling 63 MB untarred). fsck took 8 minutes. Then I added 100 subdirectories in each of those subdirectories and into each of those directories put five 1MB files. (The server is actually going to be an IMAP server and this mimics half-load quite well). The result is 18 GB used on the filesystem. fsck took 10.5 minutes. Then I added another 100 subdirectories in each of the 25 directories and put another five 1MB files in each of those. The result is 30 GB used on the filesystem. fsck took 13 minutes. The important points are probably that (a) the disks are 1 RPM which helps random I/O and (b) the filesystem block size is 4k. Don't even think about using a 1k block size on a large filesystem (unless you have a really weird environment). > Is it very unsafe to remove fsck at boot ? Yes. You might as well leave it in since if the filesystem was unmounted cleanly then fsck doesn't bother checking it fully and continues straight on. > what about checking /proc/mdstat at boot time and then determining if > e2fsck should be run or not ? > In theory if the array was shut down cleanly , the filesystem should be > in a consistent status. > please correct me in I am wrong . That's wrong. The consistency of the array and the consistency of the filesystem on it are two independent issues. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: RAID1 on / for RH 5.2
Mike Brodbelt writes: > Get rid of the raidtools that shipped with RedHat. Then remove all reference to > RAID in /etc/rc.d/rc.sysinit and /etc/rc.d/init.d/halt. Then get hold of the > raidtools 0.90 source, and compile that. Kernel - 2.0.38 doesn't exist :-). I > suggest using a clean 2.0.36 source tree, and patching that for RAID support. > Don't use the kernel source provided with RH 5.2, it's a 2.0.36 pre release, > with lots of weird patches added. The original kernel RPMs for Red Hat 5.2 (kernel-*-2.0.36-0.7.*.rpm) were indeed a pre-release 2.0.36 kernel. However, the update RPMs which are now available for 5.2 (kernel-*-2.0.36-1.*.rpm) are real 2.0.36 kernels. The only thing to be careful about (well, the only thing that bit me anyway) is that you need to do a "make mrproper" to get rid of the foo.ver files in include/modules otherwise the stale entries for the md_foo functions in md.ver clash with the new ones which go in ksyms.ver for recent RAID patches. After installing the kernel-souce RPM and patching with the RAID patch (the patch hunk to defconfig is rejected but it's beningn: just apply it by hand): do the make config; make mrproper and make zImage. Then do make modules; make install_modules. Pick a name for your new kernel version (say -foo) and do mv /lib/modules/2.0.36{,-foo} cd /usr/src/linux strings vmlinux | grep 'Linux version' > /lib/modules/2.0.36-foo/.rhkmvtag cp arch/i386/boot/zImage /boot/vmlinuz-2.0.36-foo cp System.map /boot/System.map-2.0.36-foo cp /boot/module-info{,-2.0.36-foo} then add an entry to /etc/lilo.conf and rerun lilo. If you stick with the Red Hat way of doing kernels then the multiple versions of the kernel will coexist more nicely for modules and such like. --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Some hw v. sw RAID benchmarks
Here for your edification and amusement are some benchmarks comparing hardware v. software RAID for fairly similar setups. Sun sell two versions of their 12-disk hot-swap dual-everything disk array (codename Dilbert): * the D1000 is a "dumb" array presenting 6 disks on each of two Ultra Wide Differential SCSI busses. * the A1000 is similar but has an internal hardware RAID module which connects to the two busses internally, does its "RAID thing" and presents a single Ultra Wide Differential bus to the outside world and talks to an intelligent adapter card on the hosts side. We have the following configurations which I benchmarked using bonnie: System 1: A1000 array with 6 x 1 RPM 4G wide SCSI drives and 64MB NVRAM cache connected to Sun Ultra 5 with a 270 MHz UltraSPARC IIi CPU and 320 MB RAM running Solaris 2.6 via a Symbios 53C875-based card. System 2: D1000 array with 6 x 1 RPM 9G wide SCSI drives on one of its two busses connected to a PC with a 350 MHz PII CPU and 512 MB RAM running Linux 2.0.36 with with raid-19981214-0.90 RAID patch. Both systems were set up as a single 6 disk RAID5 group. System 1 had a standard Solaris UFS filesystem on the resulting 20GB logical drive. System 2 used chunk-size 64 for its RAID5 configuration (defaults for ther settings) and a single ext2 filesystem (with blocksize 4096 and stride=16). Bonnie was run on both as the only non-idle process on a 1000 MB file. System 1 System 2 Seq output -- per char7268 K/s @ 66.7% CPU 5104 K/s @ 88.6% CPU block 12850 K/s @ 31.9% CPU 12922 K/s @ 16.4% CPU rewrite 8221 K/s @ 45.1% CPU 5973 K/s @ 16.9% CPU Seq input - per char8275 K/s @ 99.2% CPU 5058 K/s @ 96.1% CPU block 21856 K/s @ 46.4% CPU 13080 K/s @ 15.2% CPU Random Seeks 293.0 /s @ 8.7% CPU282.3 /s @ 5.7% CPU --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
RAID0 striping two RAID5 arrays
I'm setting up an ftp/mirror server out of a PC with a 4GB EIDE disk, three 6GB EIDE disks taken from other PCs that don't need them and an old SCSI array of 4 x 4GB disks. Rather than mess around with allocating 7 or 8 different filesystems to different parts of the ftp hierarchy, I want to raid them into one large blob. What seems an obvious way to me which should provide reasonable performance and reliability is to make the 4 x 4 GB disks into a RAID5 array (resulting in 12 GB of visible store), make the 3 x 6GB disks into a separate RAID5 array (another 12GB) and then RAID0 stripe the two 12GB md devices into one big 24GB one. Is doing RAID0 over RAID5 a possible/reasonable thing to do with the latest md/raidtools? Need I choose any non-default chunk sizes or suchlike to tune things better? --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services
Re: Is this possible/feasible
Stephen C. Tweedie writes: > Hi, > > On Sun, 18 Oct 1998 15:55:35 +0200 (CEST), MOLNAR Ingo > <[EMAIL PROTECTED]> said: > > > On Sun, 18 Oct 1998, Tod Detre wrote: > > >> in 2.1 kernels you can mak nfs a block device. raid can work with block > >> devices so if you raid5 several nfs computers one can go down, but you > >> still can go on. > > > you probably want to use Stephen Tweedie's NBD (Network Block Device), > > Heh, thanks, but the credit is Pavel Machek's. I've just been testing > and bug-fixing it. > > > which works over TCP and is such more reliable and works over bigger > > distance and larger dropped packets range. You can even have 5 disks on 5 > > continents put together into a big RAID5 array. (ment to survive a > > meteorite up to the size of a few 10 miles ;) and you can loopback it > > through a crypt^H^H^H^H^Hcompression module too before sending it out to > > the net. > > Of course, you'll need to manually reconstruct the raid array as > appropriate, and you don't get raid autostart on a networked block > device either. However, it ought to be fun to watch, and I'm hoping we > can integrate this method of operation into some of the clustering > technology now appearing on Linux to do failover of NFS services if one > of the networked raid hosts dies. Just remount the raid on another > machine using the surviving networked disks, remount ext2fs and migrate > the NFS server's IP address: voila! There's a way which should give better performance in the general case that I think I've mentioned on this mailing list before. It avoids the overhead of a synchronous NBD since when you're migrating a disk to a new system, there's no constraint that the remote system have up to date data all the time. It's a combination of a little kernel driver called breq and a simple user-mode program. The basic idea is to add a few lines to make_request in ll_rw_blk.c in "case WRITE". When breq is turned on for a particular device (done by an ioctl on /dev/breq which appears as a character device to user-land), the block number of the request is simply written to a 4K ring buffer. That's the only kernel patch needed. The breq device driver module sucks out the data from the ring buffer and feeds it to the reader. To do a filesystem migration, there's a bmigrate user-land program which effectively has two independent threads (actually it uses select() but it's easier to think of as threads). You start with a bitmap with one bit for each block on the device you're migrating, setting the bitmap to all 1s. You make a TCP connection to a daemon on the new system (described below). One thread of bimgrate does while (1) { blocknum_t n; /* 32 bits */ read(breq_fd, &n, 4); bitmap[n] = 1; /* mark block n dirty */ } The other thread does while (n = find_first_set_bit(bitmap)) { struct { int n; char data[512] } binfo; bitmap[n] = 0; /* mark it clean */ read(raw_device_fd, &binfo.data, sizeof(binfo.data)); binfo.n = n; write(remote_socket, &binfo, sizeof(binfo)); } The daemon on the other end of the connection just does while (read(client_socket, &binfo, sizeof(binfo))) { lseek(raw_device_fd, binfo.n * 512, SEEK_SET); write(raw_device_fd, &binfo.data, 512); } This is all completely asynchronous to the migrating filesystem so it's not as slow as a network block device. Now, gradually the bitmap gets cleared as the migrater writes across the data and catches up with ongoing write activity. Eventually, there are only a "few" bits set. At that time, you take down the RAID device, let the migrater finish sending the last few blocks, then bring up the new system on the same IP number with its newly migrated data. The code for bmigrate is sitting on my IPC at home and I haven't quite had the time to do the breq thing properly yet. I've not quite figured out what context make_request runs in and how to synchronise writing to the ring buffer with the ioctl code to shut it off. Does make_request get called from interrupts or bottom halves? What's the new-fangled SMP-safe way to do such locking in a way that make_request doesn't have to get a slow lock every time it wants to write data? --Malcolm -- Malcolm Beattie <[EMAIL PROTECTED]> Unix Systems Programmer Oxford University Computing Services