Hardware RAID and Linux software RAID on a real production server

2000-08-13 Thread Hubert Tonneau

This a report about real production experiment using both Linux
software RAID and a Mylex hardware RAID controler (real production
tend to be even harder than tests, even on a lower load, since
more special situations append).

My production server has:
2 x 8GB linux software RAID 1, Buslogic BT958 controler (ultra wild SCSI 40 at MB/s).
4 x 45GB linux software RAID 5 (0 spare), Buslogic BT958 controler (ultra wild SCSI at 
20 MB/s).
6 x 50GB hardware RAID 5 (1 spare), Mylex AcceleRAID 250 (ultra wild LVD SCSI at 80 
MB/s, also called ultra2 SCSI).
All cables are short, high quality, and the all server is connected to an UPS.
There is also an air cooler to prevent excessive heat.

Nb: the 4 disks ultra wild SCSI chain is running only at 20 MB/s since non
LVD ultra SCSI is unriable at 40 MB/s when there are more than 3 disks or
cable is more than 1.5 meter. Also I noticed that the Buslogic controler
is unreliable to recover from SCSI bus errors (it will sadely infinit loop
on SCSI reset) as opposed to disk errors, so it's really important to set a
conservative SCSI bus speed.

The box run like a charm several monthes long (on such a load, the previous
OS/2 server had was crashing roughly once a week), but suddenly the Mylex
contoler put all disks offline (dead) at once (while I was on holliday:
not only humans can be vicious). I just had to force them all back online,
check consistency, and it restarted.
A few weeks later, it put all disks offline again at once. This time, it
refused to get everything back nicely (since consistency check failed,
and it was putting everything back offline after only a few minutes).

So, I decided to remove the Mylex controler, and put a Tekram DC390U2W.
This model is the one I selected from a tiny study I did several monthes
ago on this mailing list through asking people what LVD SCSI controler
they where using and how happy they where. (Tekram was first, and Adaptec
second)

The problem was how to read using Linux software RAID 5, the datas
written by Mylex controler. I wrote a few test programs, and found that
the Mylex AcceleRaid controler is using what Linux software RAID calls
right-asymmetric parity algorithm (you will find the part of /etc/raidtab
file I use, at the end of this mail), so I could read the data using
Linux software RAID (mkraid --force --dangerous-no-resync /dev/md2).
Great.

So, I first checked all the disks surfaces, reading sequencialy all
the content, one disk after the other ... and found no error !!!

Then, I checked RAID 5 parity using an extended version of Pliant RAID
conversion sofware (for handling right-asymmetric parity algorithm)
and found only two corrupted chunks.

Third, I recomputed the MD5 checksum of all the 175000 files, and
compared it to the value in the database that I keep up to date on
a second computer. I had only three corrupted files (so I will need to
insert only 3 CDs to get things fixed).

Lastly, I tryed 'e2csck -n' on the new Linux sofware RAID 5 partition
and discovered no error.

So, the final unanswered question is why did the Mylex controler failed
that ungracefully if no disk contains dead blocks ?
My experimental conclusion is that Linux software RAID is even more
reliable (the two RAID sets handled by Linux sofware had no problem since
I use the right conservative bus speed), and much more flexible (enabled
me to check disks individually, and freely select the right RAID 5
configuration to match existing datas).
Thanks to Ingo and others for this great work.

I hope that this fairly long story can help (mixed with several others)
others decide more safely what they will trust most for their real work
datas (also an infinit incremental backup, and a MD5 database are great
values in any case) and know that it's possible to switch in the middle.


raiddev /dev/md2
raid-level5
parity-algorithm  right-asymmetric
nr-raid-disks 5
nr-spare-disks0
persistent-superblock 0
chunk-size64

device/dev/sdh
raid-disk 0
device/dev/sdi
raid-disk 1
device/dev/sdj
raid-disk 2
device/dev/sdk
raid-disk 3
device/dev/sdl
raid-disk 4



Re: Hardware RAID and Linux software RAID on a real production server

2000-08-13 Thread Hubert Tonneau

Leonard N. Zubkoff wrote:
 
 Generally, the Mylex PCI RAID controllers take disks offline when certain types
 of unrecoverable errors occur.  The driver will log the reason for any disk
 being killed as a console message.  Without further information as to precisely
 why the disks were taken offline and whether they all were taken offline
 simultaneously, it's hard to know what happened.  Firmware bugs in either the
 controller firmware or disk drives are a plausible reason, as would be a
 problem with the SCSI controller chip on the AcceleRAID, or an electrical
 problem on the SCSI bus.

I removed the Mylex controler, and since it's a production server,
I cannot do experiments on this one, so I cannot get any more informations
(I'm sorry about that because I find very important to spend some time
helping maintainers fix things).
Also I know that dmesg output is important for mainteners, so please find
below what I saved before removing the controler. Please also notice that
the output contains many lines related to the fact that I tryed to
force disks back online. I'm sorry about not having the very first error
messages, but they was so much output from indirect troubles that append
after the inititial problem that dmesg beginning was already troncated
when I logged on the machine.

The most significant message is probably:
  DAC960#0: Physical Drive 0:x killed because of bad tag returned from drive
but I don't find it meaningfull at all, and since there is no source code
available to scan, that's why I stopped trying to cope with this controler.

Also, if you want to test the controler yourself, I can ask my compagny
to send and give it to you, since we are not going to use it any more.

In the next monthes, I will use the same disks set, with the same cables
(except the one linking to the controler, since the pins are different)
driven using Linux software RAID and the Tekram DC390U2W, so I will
send you any news about a failure that would append. 

I also discovered that Mylex is not using the last megabyte on each
disk, so I can use
  persistent-superblock 1
in my new /etc/raidtab file.
This can be interresting for you to know since it states that changing
from Mylex AcceleRAID to Linux software RAID 0.90 can be done without
clearing datas.

Regards, and many thanks for the great work that you do, even if my
personal experiment is leading me to drop all sophisticated devices
and rather use simpler ones where the sophisticated features being
performed in free (source code available) softwares, that I can read
in case of failure. 

Hubert Tonneau

* DAC960 RAID Driver Version 2.2.4 of 23 August 1999 *
Copyright 1998-1999 by Leonard N. Zubkoff [EMAIL PROTECTED]
Configuring Mylex DAC960PTL1 PCI RAID Controller
  Firmware Version: 4.06-0-60, Channels: 1, Memory Size: 8MB
  PCI Bus: 0, Device: 13, Function: 1, I/O Address: Unassigned
  PCI Address: 0xF680 mapped at 0xD000, IRQ Channel: 9
  Controller Queue Depth: 128, Maximum Blocks per Command: 128
  Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33
  Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 128/32
  Physical Devices:
0:1  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ05082119480B8Z
0:2  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ051852194804HG
 Disk Status: Dead, 97691648 blocks, 4 resets
0:3  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ0516021948K2Z7
 Disk Status: Dead, 97691648 blocks, 4 resets
0:4  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ0105171948JQZA
 Disk Status: Dead, 97691648 blocks, 4 resets
0:5  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ05085919480B5E
 Disk Status: Dead, 97691648 blocks, 4 resets
0:6  Vendor: SEAGATE   Model: ST150176LCRevision: 0001
 Serial Number: NQ0282161948JQMF
 Disk Status: Dead, 97691648 blocks, 4 resets
  Logical Drives:
/dev/rd/c0d0: RAID-5, Offline, 390766592 blocks, Write Back
  No Rebuild or Consistency Check in Progress

DAC960#0: Make Online of Physical Drive 0:6 Succeeded
DAC960#0: Physical Drive 0:6 is now ONLINE
DAC960#0: Make Online of Physical Drive 0:6 Illegal
DAC960#0: Make Online of Physical Drive 0:2 Succeeded
DAC960#0: Physical Drive 0:2 is now ONLINE
DAC960#0: Make Online of Physical Drive 0:3 Succeeded
DAC960#0: Physical Drive 0:3 is now ONLINE
DAC960#0: Make Online of Physical Drive 0:1 Illegal
DAC960#0: Make Online of Physical Drive 0:4 Succeeded
DAC960#0: Make Online of Physical Drive 0:5 Succeeded
DAC960#0: Physical Drive 0:4 is now ONLINE
DAC960#0: Physical Drive 0:5 is now ONLINE
DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE
DAC960#0: Make Online of Physical Drive 0:6 Illegal
DAC960#0: Make Online of Physical Drive 0:1 Illegal
 rd/c0d0

Reconfiguring a RAID system without data loss

1999-10-01 Thread Hubert Tonneau

Release 2 of the Pliant utility for reconfiguring (adding disks, changing
the RAID level or the chunk size) a software RAID without loosing (at your
own risks: test it on sample datas before using it on real ones) datas is
available.
The new code is expected to event survive to a disk failure in the middle
of the RAID reconfiguration (provided the RAID level is 1 or 5 :-) )

The script will be included in Pliant release 28 that will be available
at http://pliant.cams.ehess.fr/ in a few days.

If some of you want to stress test it, I can now send a snapshot by email
because the changes that will append in Pliant before release 28 have
nothing to do with this RAID script.

-

Ingo wrote that this script should be written in C for several reasons:
- speed
- portability

These are my answers to these arguments:
- Although it would be faster if written in C because Pliant code generator
  is not as efficient as GCC, this is not that important because:
  . there is nothing in Pliant itself that make it slower than C, so at
some point, the speed will be excacly the same as with C.
  . on a 300Mhz processor, the conversion speed could be about 10 MB/s,
so it's completely IO bounded on my laptop.
- Now the portability argument is very valuable because Pliant currently
  runs only and i386 (Alpha port planned next year, but others are not
  scheduled yet).
- There is an additional strong argument for writting the script in C:
  Pliant is a young language that changes much for the moment.

Now, these are my arguments for writting the script in Pliant:
- Compiling a C program is a big problem because a C compiler is idiot:
  you need a set of extra tools (make, .configure script, ...) in
  order to cope with differences on various machines.
  On the other hand, a Pliant script is always run directly from
  it's source code (Pliant is what I call a dynamic compiler, you
  could also read on the fly, or not stupid), so for free softwares
  it's definetly better because you get rid of the binaries
  nightmare.
- In standard C, there is no provision for 64 bits arithmetics,
  whereas in Pliant, the unlimited intergers are a basic feature.
- A Pliant program is faster debugged because when you set the
  debugging level to 2 or more, all arithmetic overflows are
  reported: it was important for such an application.
- The Pliant code is probably shorter than the C code would be.
- I prefer Pliant (I designed it :-) )

Ingo also specifyed that at some point, this conversion should be
done on the fly in kernel space.
This would be the ultimate rafinement, but it does not worth the
troubles (except for changing the chunk size) until we get the ability
to resize an ext2 partition without umounting it. I seems to be
already possible, but with strong restrictions, so, for the moment, the
'ext2resize' utility seems to be the safest solution.
Now believing that no Pliant code will go in Linux kernel might also
be wrong on the long term. My opinion is that it may well append the
other way round at some point. Pliant would be the boot loader that
would load and compile on the fly the Linux kernel. In Pliant, there is
provision for several syntaxes, so the parts of Linux that are written
in C would be seen as Pliant programs: a Pliant C parser module will make
it transparent. Now Pliant would bring the flexibility to Linux kernel:
the ability to compile on the fly some drivers when the harward is
detected or some network frames are received, or recompile on the fly
with different optimising options, when some statistics have been
collected, and the cost would only be a bigger kernel (because Pliant
contains the compiler and some extra informations).
Lastly Pliant compiling machinery is much more powerfull than C compilers
one, so on a large project such as the Linux kernel you can get a much
cleaner and easyer to maintain code, even if keeping the C syntax.

-

As a conclusion, I will maintain the script, in Pliant, until somebody
rewrites it in C.
Ingo, please don't include the script in the RAID user levels tools
since it may need to be adjusted if I make more changes in Pliant itself:
just specifying that it's available and the Pliant URL. I may also send
you a small HTML page specifying how to use it, which could be included
in the documentation of the user level tools.

This is a sample usage of the script:
pliant module /pliant/admin/raid.pli command 'raid_convert "/dev/md0" "/dev/hda5 
/dev/hda6 /dev/hda7" "/dev/hda5 /dev/hda6 /dev/hda7 /dev/hda8" 5 64*2^10'

The third parameter (5) is the requested RAID level.
The fourth parameter (64*2^10=64K) is the requested chunk size.

Regards,
Hubert Tonneau



RE: question about adding a disk: Now possible

1999-09-18 Thread Hubert Tonneau

This is a Pliant (http://pliant.cams.ehess.fr/) script that should enable
you to make changes in a RAID configuration with (or with the hope of)
no data loss.

Lets take an example:

The old /etc/raidtab configuration file is:

raiddev /dev/md0
raid-level5
nr-raid-disks 3
nr-spare-disks0
persistent-superblock 1
chunk-size64

device/dev/hda5
raid-disk 0
device/dev/hda6
raid-disk 1
device/dev/hda7
raid-disk 2

The new one you want is:

raiddev /dev/md0
raid-level5
nr-raid-disks 4
nr-spare-disks0
persistent-superblock 1
chunk-size4

device/dev/hda5
raid-disk 0
device/dev/hda6
raid-disk 1
device/dev/hda7
raid-disk 2
device/dev/hda8
raid-disk 3

1) Run the following Pliant command: 
pliant module /sample/raidconvert.pli command 'raid_convert "/dev/md0" "/dev/hda5 
/dev/hda6 /dev/hda7" "/dev/hda5 /dev/hda6 /dev/hda7 /dev/hda8" 5 4*1024'
- pameter 2 of 'raid_convert' is the list of the raid devices
  in the old RAID configuration (sparce disks should not be listed)
- pameter 3 is the list of the raid devices
  in the new RAID configuration (sparce disks should not be listed)
- pameter 4 is the raid level in the new RAID configuration 
- pameter 5 is the new chunk size

2) Modify your /etc/raidtab

3) use 'mkraid' command in order to recreate the new RAID array
   (your data should be preserved)

This script should also enables you to remove some disks, to change the
raid level or the chunk size, BUT TEST IT ON SAMPLES BEFORE APPLYING
TO ANY SERIOUS DATAS.

You must use 'ext2resize' command BEFORE 1 if your new RAID array will be
smaller than the old one, or after 3 if it's bigger.

You should also be awared that if anything wrong appends during the
conversion (any io error), the program will abort ungracefully and
all datas will be lost.

Good luck brave peoples
Hubert Tonneau


# Copyright (C) 1999  Hubert Tonneau  [EMAIL PROTECTED]
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License version 2
# as published by the Free Software Foundation.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# version 2 along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

# release 1 for Pliant release 25

module "/pliant/v1.pli"
module "/pliant/meta.pli"

constant sector_size 512

function os_llseek handle high low result whence - err
  arg Int handle ; arg uInt high low ; arg Address result ; arg Int whence ; arg Int 
err
  kernel_function 140
  

function pgcd a b - g
  arg Intn a b g
  var Intn x := a
  var Intn y := b
  g := y
  while x0
g := x
x := y%x
y := g

function ppcm a b - m
  arg Intn a b m
  m := a*b\(pgcd a b)

function min a b - c
  arg Intn a b c
  c := shunt a=b a b


type Device
  field Int handle - -1
  field Str name

method d open device_name
  arg_w Device d ; arg Str device_name
  d name := device_name
  var Str namez := device_name+"[0]"
  d handle := os_open namez:characters 2 0
  if d:handle0
error "Failed to open device "+d:name

method d close
  arg_w Device d 
  if (os_close d:handle)0
error "Failed to open device "+d:name
  d handle := -1

method d seek position
  arg Device d ; arg Intn position
  if position%sector_size0
error "Missaligned seek applyed to device "+d:name+" ("+(cast position Str)+")"
  var uInt high := cast position\(cast 2 Intn)^32 uInt
  var uInt low := cast position%(cast 2 Intn)^32 uInt
  check high*(cast 2 Intn)^32+low=position
  if (os_llseek d:handle high low addressof:(var uInt64 result) 0)0
error "Failed to set position for device "+d:name

method d read buffer size
  arg Device d ; arg Address buffer ; arg Int size
  var Int red := os_read d:handle buffer size
  if redsize
error "Failed to read from device "+d:name

method d write buffer size - status
  arg Device d ; arg Address buffer ; arg Int size ; arg Status status
  var Int written := os_write d:handle buffer size
  if writtensize
error "Failed to write to device "+d:name


type Raid
  field Str device_name
  field Array:Device devices
  field Int level
  

Reliable SCSI LVD controler for Linux ?

1999-09-07 Thread Hubert Tonneau

What is the most reliable LVD SCSI controler for Linux ?

(I use several Buslogic controlers, but as far as I know they don't
 have an LVD version, which is absolutely necessary for long SCSI chains,
 and my Buslogic controlers went in an infinite reset loop several times,
 which raid cannot protect against.
 I also tested a Mylex AcceleRAID with it's integrated raid software,
 but it's expensive and not very flexible (you cannot remotely change
 the raid configuration since the raid configuration program is
 accessed at boot time only, and as far as I know, you cannot have
 different scsi chanels use different scsi speeds, and you cannot
 have a raid set involving disks on several controlers), and lastly
 their raid sofware is not open so it's hard to trust them. On the
 other hand, I believe that you can add drives to existing raid sets.)



linear over raid1 dead locks

1999-09-07 Thread Hubert Tonneau

With the following configuration, any attempt to access /dev/md1 will
lock the process in D (disk sleep) state: 

raiddev /dev/md0
raid-level1
nr-raid-disks 2
nr-spare-disks0
persistent-superblock 1
chunk-size64

device/dev/hda5
raid-disk 0
device/dev/hda6
raid-disk 1

raiddev /dev/md1
raid-level  linear
nr-raid-disks   2
chunk-size  4

persistent-superblock 0
device  /dev/md0
raid-disk   0
device  /dev/hda7
raid-disk   1

On the other hand, the following configuration works just fine:

raiddev /dev/md0
raid-level1
nr-raid-disks 2
nr-spare-disks0
persistent-superblock 1
chunk-size64

device/dev/hda5
raid-disk 0
device/dev/hda6
raid-disk 1

raiddev /dev/md1
raid-level  linear
nr-raid-disks   2
chunk-size  4

persistent-superblock 0
device  /dev/hda8
raid-disk   0
device  /dev/hda7
raid-disk   1



raid0145-19990724-2.0.37 compiling problem

1999-08-16 Thread Hubert Tonneau

Applying raid0145-19990724-2.0.37 will make kernel 2.0.37 modules fail
to compile if CONFIG_BLK_DEV_SR is selected as module.
(no problem with 2.2.12-pre4 for the same configuration)

gcc -D__KERNEL__ -I/usr/src/linux-2.0.37/include -Wall -Wstrict-prototypes -O2
-fomit-frame-pointer -fno-strength-reduce -pipe -m486 -malign-loops=2
-malign-jumps=2 -malign-functions=2 -DCPU=686 -DMODULE  -c -o sr_ioctl.o
sr_ioctl.c
ld -m elf_i386 -m elf_i386 -r -o sr_mod.o sr.o sr_ioctl.o
sr_ioctl.o(.data+0x0): multiple definition of `kernel_version'
sr.o(.data+0x0): first defined here
make[2]: *** [sr_mod.o] Error 1
make[2]: Leaving directory `/usr/src/linux-2.0.37/drivers/scsi'
make[1]: *** [modules] Error 2
make[1]: Leaving directory `/usr/src/linux-2.0.37/drivers'
make: *** [modules] Error 2



raid 0.90 a bit rough with 2.0.37 kernel

1999-08-16 Thread Hubert Tonneau

I just installed raid 1 feature for the small disks (2 x 8 Gb)
on my production server (for the large ones, 4 x 50 Gb, i'll wait a bit
since reloading the datas in case of failure would require to feed many
many CDs whereas I have an additional disk to disk backup for the small
disks)

When booting 2.0.37 kernel, the boot process stopped and asked for root
password in order to get to a shell instead of normal boot.
This is a bad behaviour since no RAID device appears in /etc/fstab so
a problem with the RAID drives should not prevent normal boot: the
result for me was very bad since I use 'vnc' to remotely configure the
server, so stopping the normal boot process made me loose the server
control (the server is 100 miles away from me).
Switching to a shell instead of normal boot can be a reasonable
behaviour when one / partition is damaged, but not when there is
a problem (moreover in this case it was a virtual one) on a partition
which is not mounted during the boot process.

So I had somebody insert a floppy in order to boot through NFS root,
I renamed /etc/raidtab to /etc/raidtab0 and rebooted again from the hard
disk.
The boot process went fine, but then 'raidstart -a' complained something
like 'bad argument' or 'bad device' (sorry i don't remember the exact
message)

Lastly I switched to 2.2.12-pre4 kernel and everything ran just fine.

Regards,
Hubert Tonneau