Hardware RAID and Linux software RAID on a real production server
This a report about real production experiment using both Linux software RAID and a Mylex hardware RAID controler (real production tend to be even harder than tests, even on a lower load, since more special situations append). My production server has: 2 x 8GB linux software RAID 1, Buslogic BT958 controler (ultra wild SCSI 40 at MB/s). 4 x 45GB linux software RAID 5 (0 spare), Buslogic BT958 controler (ultra wild SCSI at 20 MB/s). 6 x 50GB hardware RAID 5 (1 spare), Mylex AcceleRAID 250 (ultra wild LVD SCSI at 80 MB/s, also called ultra2 SCSI). All cables are short, high quality, and the all server is connected to an UPS. There is also an air cooler to prevent excessive heat. Nb: the 4 disks ultra wild SCSI chain is running only at 20 MB/s since non LVD ultra SCSI is unriable at 40 MB/s when there are more than 3 disks or cable is more than 1.5 meter. Also I noticed that the Buslogic controler is unreliable to recover from SCSI bus errors (it will sadely infinit loop on SCSI reset) as opposed to disk errors, so it's really important to set a conservative SCSI bus speed. The box run like a charm several monthes long (on such a load, the previous OS/2 server had was crashing roughly once a week), but suddenly the Mylex contoler put all disks offline (dead) at once (while I was on holliday: not only humans can be vicious). I just had to force them all back online, check consistency, and it restarted. A few weeks later, it put all disks offline again at once. This time, it refused to get everything back nicely (since consistency check failed, and it was putting everything back offline after only a few minutes). So, I decided to remove the Mylex controler, and put a Tekram DC390U2W. This model is the one I selected from a tiny study I did several monthes ago on this mailing list through asking people what LVD SCSI controler they where using and how happy they where. (Tekram was first, and Adaptec second) The problem was how to read using Linux software RAID 5, the datas written by Mylex controler. I wrote a few test programs, and found that the Mylex AcceleRaid controler is using what Linux software RAID calls right-asymmetric parity algorithm (you will find the part of /etc/raidtab file I use, at the end of this mail), so I could read the data using Linux software RAID (mkraid --force --dangerous-no-resync /dev/md2). Great. So, I first checked all the disks surfaces, reading sequencialy all the content, one disk after the other ... and found no error !!! Then, I checked RAID 5 parity using an extended version of Pliant RAID conversion sofware (for handling right-asymmetric parity algorithm) and found only two corrupted chunks. Third, I recomputed the MD5 checksum of all the 175000 files, and compared it to the value in the database that I keep up to date on a second computer. I had only three corrupted files (so I will need to insert only 3 CDs to get things fixed). Lastly, I tryed 'e2csck -n' on the new Linux sofware RAID 5 partition and discovered no error. So, the final unanswered question is why did the Mylex controler failed that ungracefully if no disk contains dead blocks ? My experimental conclusion is that Linux software RAID is even more reliable (the two RAID sets handled by Linux sofware had no problem since I use the right conservative bus speed), and much more flexible (enabled me to check disks individually, and freely select the right RAID 5 configuration to match existing datas). Thanks to Ingo and others for this great work. I hope that this fairly long story can help (mixed with several others) others decide more safely what they will trust most for their real work datas (also an infinit incremental backup, and a MD5 database are great values in any case) and know that it's possible to switch in the middle. raiddev /dev/md2 raid-level5 parity-algorithm right-asymmetric nr-raid-disks 5 nr-spare-disks0 persistent-superblock 0 chunk-size64 device/dev/sdh raid-disk 0 device/dev/sdi raid-disk 1 device/dev/sdj raid-disk 2 device/dev/sdk raid-disk 3 device/dev/sdl raid-disk 4
Re: Hardware RAID and Linux software RAID on a real production server
Leonard N. Zubkoff wrote: Generally, the Mylex PCI RAID controllers take disks offline when certain types of unrecoverable errors occur. The driver will log the reason for any disk being killed as a console message. Without further information as to precisely why the disks were taken offline and whether they all were taken offline simultaneously, it's hard to know what happened. Firmware bugs in either the controller firmware or disk drives are a plausible reason, as would be a problem with the SCSI controller chip on the AcceleRAID, or an electrical problem on the SCSI bus. I removed the Mylex controler, and since it's a production server, I cannot do experiments on this one, so I cannot get any more informations (I'm sorry about that because I find very important to spend some time helping maintainers fix things). Also I know that dmesg output is important for mainteners, so please find below what I saved before removing the controler. Please also notice that the output contains many lines related to the fact that I tryed to force disks back online. I'm sorry about not having the very first error messages, but they was so much output from indirect troubles that append after the inititial problem that dmesg beginning was already troncated when I logged on the machine. The most significant message is probably: DAC960#0: Physical Drive 0:x killed because of bad tag returned from drive but I don't find it meaningfull at all, and since there is no source code available to scan, that's why I stopped trying to cope with this controler. Also, if you want to test the controler yourself, I can ask my compagny to send and give it to you, since we are not going to use it any more. In the next monthes, I will use the same disks set, with the same cables (except the one linking to the controler, since the pins are different) driven using Linux software RAID and the Tekram DC390U2W, so I will send you any news about a failure that would append. I also discovered that Mylex is not using the last megabyte on each disk, so I can use persistent-superblock 1 in my new /etc/raidtab file. This can be interresting for you to know since it states that changing from Mylex AcceleRAID to Linux software RAID 0.90 can be done without clearing datas. Regards, and many thanks for the great work that you do, even if my personal experiment is leading me to drop all sophisticated devices and rather use simpler ones where the sophisticated features being performed in free (source code available) softwares, that I can read in case of failure. Hubert Tonneau * DAC960 RAID Driver Version 2.2.4 of 23 August 1999 * Copyright 1998-1999 by Leonard N. Zubkoff [EMAIL PROTECTED] Configuring Mylex DAC960PTL1 PCI RAID Controller Firmware Version: 4.06-0-60, Channels: 1, Memory Size: 8MB PCI Bus: 0, Device: 13, Function: 1, I/O Address: Unassigned PCI Address: 0xF680 mapped at 0xD000, IRQ Channel: 9 Controller Queue Depth: 128, Maximum Blocks per Command: 128 Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33 Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 128/32 Physical Devices: 0:1 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ05082119480B8Z 0:2 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ051852194804HG Disk Status: Dead, 97691648 blocks, 4 resets 0:3 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ0516021948K2Z7 Disk Status: Dead, 97691648 blocks, 4 resets 0:4 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ0105171948JQZA Disk Status: Dead, 97691648 blocks, 4 resets 0:5 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ05085919480B5E Disk Status: Dead, 97691648 blocks, 4 resets 0:6 Vendor: SEAGATE Model: ST150176LCRevision: 0001 Serial Number: NQ0282161948JQMF Disk Status: Dead, 97691648 blocks, 4 resets Logical Drives: /dev/rd/c0d0: RAID-5, Offline, 390766592 blocks, Write Back No Rebuild or Consistency Check in Progress DAC960#0: Make Online of Physical Drive 0:6 Succeeded DAC960#0: Physical Drive 0:6 is now ONLINE DAC960#0: Make Online of Physical Drive 0:6 Illegal DAC960#0: Make Online of Physical Drive 0:2 Succeeded DAC960#0: Physical Drive 0:2 is now ONLINE DAC960#0: Make Online of Physical Drive 0:3 Succeeded DAC960#0: Physical Drive 0:3 is now ONLINE DAC960#0: Make Online of Physical Drive 0:1 Illegal DAC960#0: Make Online of Physical Drive 0:4 Succeeded DAC960#0: Make Online of Physical Drive 0:5 Succeeded DAC960#0: Physical Drive 0:4 is now ONLINE DAC960#0: Physical Drive 0:5 is now ONLINE DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE DAC960#0: Make Online of Physical Drive 0:6 Illegal DAC960#0: Make Online of Physical Drive 0:1 Illegal rd/c0d0
Reconfiguring a RAID system without data loss
Release 2 of the Pliant utility for reconfiguring (adding disks, changing the RAID level or the chunk size) a software RAID without loosing (at your own risks: test it on sample datas before using it on real ones) datas is available. The new code is expected to event survive to a disk failure in the middle of the RAID reconfiguration (provided the RAID level is 1 or 5 :-) ) The script will be included in Pliant release 28 that will be available at http://pliant.cams.ehess.fr/ in a few days. If some of you want to stress test it, I can now send a snapshot by email because the changes that will append in Pliant before release 28 have nothing to do with this RAID script. - Ingo wrote that this script should be written in C for several reasons: - speed - portability These are my answers to these arguments: - Although it would be faster if written in C because Pliant code generator is not as efficient as GCC, this is not that important because: . there is nothing in Pliant itself that make it slower than C, so at some point, the speed will be excacly the same as with C. . on a 300Mhz processor, the conversion speed could be about 10 MB/s, so it's completely IO bounded on my laptop. - Now the portability argument is very valuable because Pliant currently runs only and i386 (Alpha port planned next year, but others are not scheduled yet). - There is an additional strong argument for writting the script in C: Pliant is a young language that changes much for the moment. Now, these are my arguments for writting the script in Pliant: - Compiling a C program is a big problem because a C compiler is idiot: you need a set of extra tools (make, .configure script, ...) in order to cope with differences on various machines. On the other hand, a Pliant script is always run directly from it's source code (Pliant is what I call a dynamic compiler, you could also read on the fly, or not stupid), so for free softwares it's definetly better because you get rid of the binaries nightmare. - In standard C, there is no provision for 64 bits arithmetics, whereas in Pliant, the unlimited intergers are a basic feature. - A Pliant program is faster debugged because when you set the debugging level to 2 or more, all arithmetic overflows are reported: it was important for such an application. - The Pliant code is probably shorter than the C code would be. - I prefer Pliant (I designed it :-) ) Ingo also specifyed that at some point, this conversion should be done on the fly in kernel space. This would be the ultimate rafinement, but it does not worth the troubles (except for changing the chunk size) until we get the ability to resize an ext2 partition without umounting it. I seems to be already possible, but with strong restrictions, so, for the moment, the 'ext2resize' utility seems to be the safest solution. Now believing that no Pliant code will go in Linux kernel might also be wrong on the long term. My opinion is that it may well append the other way round at some point. Pliant would be the boot loader that would load and compile on the fly the Linux kernel. In Pliant, there is provision for several syntaxes, so the parts of Linux that are written in C would be seen as Pliant programs: a Pliant C parser module will make it transparent. Now Pliant would bring the flexibility to Linux kernel: the ability to compile on the fly some drivers when the harward is detected or some network frames are received, or recompile on the fly with different optimising options, when some statistics have been collected, and the cost would only be a bigger kernel (because Pliant contains the compiler and some extra informations). Lastly Pliant compiling machinery is much more powerfull than C compilers one, so on a large project such as the Linux kernel you can get a much cleaner and easyer to maintain code, even if keeping the C syntax. - As a conclusion, I will maintain the script, in Pliant, until somebody rewrites it in C. Ingo, please don't include the script in the RAID user levels tools since it may need to be adjusted if I make more changes in Pliant itself: just specifying that it's available and the Pliant URL. I may also send you a small HTML page specifying how to use it, which could be included in the documentation of the user level tools. This is a sample usage of the script: pliant module /pliant/admin/raid.pli command 'raid_convert "/dev/md0" "/dev/hda5 /dev/hda6 /dev/hda7" "/dev/hda5 /dev/hda6 /dev/hda7 /dev/hda8" 5 64*2^10' The third parameter (5) is the requested RAID level. The fourth parameter (64*2^10=64K) is the requested chunk size. Regards, Hubert Tonneau
RE: question about adding a disk: Now possible
This is a Pliant (http://pliant.cams.ehess.fr/) script that should enable you to make changes in a RAID configuration with (or with the hope of) no data loss. Lets take an example: The old /etc/raidtab configuration file is: raiddev /dev/md0 raid-level5 nr-raid-disks 3 nr-spare-disks0 persistent-superblock 1 chunk-size64 device/dev/hda5 raid-disk 0 device/dev/hda6 raid-disk 1 device/dev/hda7 raid-disk 2 The new one you want is: raiddev /dev/md0 raid-level5 nr-raid-disks 4 nr-spare-disks0 persistent-superblock 1 chunk-size4 device/dev/hda5 raid-disk 0 device/dev/hda6 raid-disk 1 device/dev/hda7 raid-disk 2 device/dev/hda8 raid-disk 3 1) Run the following Pliant command: pliant module /sample/raidconvert.pli command 'raid_convert "/dev/md0" "/dev/hda5 /dev/hda6 /dev/hda7" "/dev/hda5 /dev/hda6 /dev/hda7 /dev/hda8" 5 4*1024' - pameter 2 of 'raid_convert' is the list of the raid devices in the old RAID configuration (sparce disks should not be listed) - pameter 3 is the list of the raid devices in the new RAID configuration (sparce disks should not be listed) - pameter 4 is the raid level in the new RAID configuration - pameter 5 is the new chunk size 2) Modify your /etc/raidtab 3) use 'mkraid' command in order to recreate the new RAID array (your data should be preserved) This script should also enables you to remove some disks, to change the raid level or the chunk size, BUT TEST IT ON SAMPLES BEFORE APPLYING TO ANY SERIOUS DATAS. You must use 'ext2resize' command BEFORE 1 if your new RAID array will be smaller than the old one, or after 3 if it's bigger. You should also be awared that if anything wrong appends during the conversion (any io error), the program will abort ungracefully and all datas will be lost. Good luck brave peoples Hubert Tonneau # Copyright (C) 1999 Hubert Tonneau [EMAIL PROTECTED] # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License version 2 # as published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # version 2 along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. # release 1 for Pliant release 25 module "/pliant/v1.pli" module "/pliant/meta.pli" constant sector_size 512 function os_llseek handle high low result whence - err arg Int handle ; arg uInt high low ; arg Address result ; arg Int whence ; arg Int err kernel_function 140 function pgcd a b - g arg Intn a b g var Intn x := a var Intn y := b g := y while x0 g := x x := y%x y := g function ppcm a b - m arg Intn a b m m := a*b\(pgcd a b) function min a b - c arg Intn a b c c := shunt a=b a b type Device field Int handle - -1 field Str name method d open device_name arg_w Device d ; arg Str device_name d name := device_name var Str namez := device_name+"[0]" d handle := os_open namez:characters 2 0 if d:handle0 error "Failed to open device "+d:name method d close arg_w Device d if (os_close d:handle)0 error "Failed to open device "+d:name d handle := -1 method d seek position arg Device d ; arg Intn position if position%sector_size0 error "Missaligned seek applyed to device "+d:name+" ("+(cast position Str)+")" var uInt high := cast position\(cast 2 Intn)^32 uInt var uInt low := cast position%(cast 2 Intn)^32 uInt check high*(cast 2 Intn)^32+low=position if (os_llseek d:handle high low addressof:(var uInt64 result) 0)0 error "Failed to set position for device "+d:name method d read buffer size arg Device d ; arg Address buffer ; arg Int size var Int red := os_read d:handle buffer size if redsize error "Failed to read from device "+d:name method d write buffer size - status arg Device d ; arg Address buffer ; arg Int size ; arg Status status var Int written := os_write d:handle buffer size if writtensize error "Failed to write to device "+d:name type Raid field Str device_name field Array:Device devices field Int level
Reliable SCSI LVD controler for Linux ?
What is the most reliable LVD SCSI controler for Linux ? (I use several Buslogic controlers, but as far as I know they don't have an LVD version, which is absolutely necessary for long SCSI chains, and my Buslogic controlers went in an infinite reset loop several times, which raid cannot protect against. I also tested a Mylex AcceleRAID with it's integrated raid software, but it's expensive and not very flexible (you cannot remotely change the raid configuration since the raid configuration program is accessed at boot time only, and as far as I know, you cannot have different scsi chanels use different scsi speeds, and you cannot have a raid set involving disks on several controlers), and lastly their raid sofware is not open so it's hard to trust them. On the other hand, I believe that you can add drives to existing raid sets.)
linear over raid1 dead locks
With the following configuration, any attempt to access /dev/md1 will lock the process in D (disk sleep) state: raiddev /dev/md0 raid-level1 nr-raid-disks 2 nr-spare-disks0 persistent-superblock 1 chunk-size64 device/dev/hda5 raid-disk 0 device/dev/hda6 raid-disk 1 raiddev /dev/md1 raid-level linear nr-raid-disks 2 chunk-size 4 persistent-superblock 0 device /dev/md0 raid-disk 0 device /dev/hda7 raid-disk 1 On the other hand, the following configuration works just fine: raiddev /dev/md0 raid-level1 nr-raid-disks 2 nr-spare-disks0 persistent-superblock 1 chunk-size64 device/dev/hda5 raid-disk 0 device/dev/hda6 raid-disk 1 raiddev /dev/md1 raid-level linear nr-raid-disks 2 chunk-size 4 persistent-superblock 0 device /dev/hda8 raid-disk 0 device /dev/hda7 raid-disk 1
raid0145-19990724-2.0.37 compiling problem
Applying raid0145-19990724-2.0.37 will make kernel 2.0.37 modules fail to compile if CONFIG_BLK_DEV_SR is selected as module. (no problem with 2.2.12-pre4 for the same configuration) gcc -D__KERNEL__ -I/usr/src/linux-2.0.37/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strength-reduce -pipe -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2 -DCPU=686 -DMODULE -c -o sr_ioctl.o sr_ioctl.c ld -m elf_i386 -m elf_i386 -r -o sr_mod.o sr.o sr_ioctl.o sr_ioctl.o(.data+0x0): multiple definition of `kernel_version' sr.o(.data+0x0): first defined here make[2]: *** [sr_mod.o] Error 1 make[2]: Leaving directory `/usr/src/linux-2.0.37/drivers/scsi' make[1]: *** [modules] Error 2 make[1]: Leaving directory `/usr/src/linux-2.0.37/drivers' make: *** [modules] Error 2
raid 0.90 a bit rough with 2.0.37 kernel
I just installed raid 1 feature for the small disks (2 x 8 Gb) on my production server (for the large ones, 4 x 50 Gb, i'll wait a bit since reloading the datas in case of failure would require to feed many many CDs whereas I have an additional disk to disk backup for the small disks) When booting 2.0.37 kernel, the boot process stopped and asked for root password in order to get to a shell instead of normal boot. This is a bad behaviour since no RAID device appears in /etc/fstab so a problem with the RAID drives should not prevent normal boot: the result for me was very bad since I use 'vnc' to remotely configure the server, so stopping the normal boot process made me loose the server control (the server is 100 miles away from me). Switching to a shell instead of normal boot can be a reasonable behaviour when one / partition is damaged, but not when there is a problem (moreover in this case it was a virtual one) on a partition which is not mounted during the boot process. So I had somebody insert a floppy in order to boot through NFS root, I renamed /etc/raidtab to /etc/raidtab0 and rebooted again from the hard disk. The boot process went fine, but then 'raidstart -a' complained something like 'bad argument' or 'bad device' (sorry i don't remember the exact message) Lastly I switched to 2.2.12-pre4 kernel and everything ran just fine. Regards, Hubert Tonneau