Re: product testimonials
> Notice that it checks every 3 seconds, but emails every 10 minutes > (prevents the inbox from filling up overnight). > > What does it look like when a drive dies? I presume something like: > > [..UD] > > Then, perhaps just doing a (Perl) regexp: if (/\[[^\]]*D[^\]]*\]/) > then report the failure? what I've seen is that it looks like this: [UU_UU] until the drive is marked as dead. and then it changes to: [UUDUU] (I believe) I'd want to know about the _ until otherwise noted. and I'd want to be able to touch and ignore file so if its rebuilding I don't hear about it every few minutes. I'll see what I can hack up. it wouldn't be a bad idea to put a few of these up on a raid-related website so people could see their options. -sv
Re: product testimonials
On Mon, Mar 20, 2000 at 10:36:40AM -0800, [EMAIL PROTECTED] wrote: > while true > do > sleep 3 > if [ -n "`cat $RAID_STAT | perl -ne 'if (/(.*\[U*[^\]\[U]+U*\])$/) { print >\"Failure! $1\n\"; }'`" ] > then > cat $RAID_STAT | mail -s " Raid Failure Warning " $ADMIN_EMAIL > sleep 600 > fi > done Ugh. Why would you want to check every _3 seconds_ for a disk failure? That's a lot of wasted cycles. Instead of forking a few different programs like that, how about: a cronjob that runs every 5 minutes and just diffs mdstat and a copy of a good mdstat: */5 * * * * /usr/bin/diff /mdstat.good /proc/mdstat Just do "cp /proc/mdstat /mdstat.good", and that's it. You'll get a mail if there's a difference (including failure), and no mail if they're the same. You still get mailed every 5 minutes, but you don't use quite so many cycles and processes to do it. (worst case, you have a 5-10 minute window with a failed disk. If you're worried about it, change the */5 to * ...) You could also throw this into a monitor using "mon" (http://www.kernel.org/software/mon) which you can set to mail you once an hour/etc. -- Randomly Generated Tagline: I'd love to, but I want to spend more time with my blender.
Re: product testimonials
On Mon, Mar 20, 2000 at 10:29:28AM -0500, Seth Vidal wrote: >[...] > On the subject of semi-silent failures: Has anyone written a script to > monitor the [U]'s in the /proc/mdstat location? It would be fairly > trivial (start beeping the system speaker loudly and emailing > repetitively) Has this already been or should I work on it? OK, as usual I've taken somebody else's idea and ran with it. Here's a little script daemon which monitors the raids for failures: ---SOF--- #!/bin/bash PATH="/bin:/usr/bin:/usr/local/bin:${PATH}" ADMIN_EMAIL="root@localhost" RAID_STAT="/proc/mdstat" while true do sleep 3 if [ -n "`cat $RAID_STAT | perl -ne 'if (/(.*\[U*[^\]\[U]+U*\])$/) { print \"Failure! $1\n\"; }'`" ] then cat $RAID_STAT | mail -s " Raid Failure Warning " $ADMIN_EMAIL sleep 600 fi done ---EOF--- (Watch for wrapped lines. There are a couple long ones.) If you want to test it, then do this: cat /proc/mdstat > testfile then, change RAID_STAT to ="testfile" Then, start the daemon and let it run. After a while, simulate the failure by editing 'testfile'. If it works, kill the daemon, and point RAID_STAT to the real file (/proc/mdstat). To use it in rc.local (or where ever), just start it like this: su nobody raid-monitor.sh & (Or change 'nobody' to some other under priledged user of your choice.) Phil -- Philip Edelbrock -- IS Manager -- Edge Design, Corvallis, OR [EMAIL PROTECTED] -- http://www.netroedge.com/~phil PGP F16: 01 D2 FD 01 B5 46 F4 F0 3A 8B 9D 7E 14 7F FB 7A
Re: product testimonials
On Mon, Mar 20, 2000 at 10:29:28AM -0500, Seth Vidal wrote: >[...] > On the subject of semi-silent failures: Has anyone written a script to > monitor the [U]'s in the /proc/mdstat location? It would be fairly > trivial (start beeping the system speaker loudly and emailing > repetitively) Has this already been or should I work on it? Humm, good idea! I made a simple shell script a while ago to monitor Lm-sensors stuff (fan speeds, power supplies, temperatures, etc... see www.lm-sensors.nu). It wasn't too hard. In rc.local, I just started it in the background and it would email me if things went arey. Here's what it looked like: ---SOF--- PATH="/bin:/usr/bin:/usr/local/bin:${PATH}" ADMIN_EMAIL="root@localhost" if [ -n "`sensors | grep ALARM`" ] then echo "Pending Alarms on start up! Exiting!" exit fi while true do sleep 3 sensors | /root/bin/getsensors.pl | /root/bin/displayit.pl if [ -n "`sensors | grep ALARM`" ] then sensors | mail -s " Hardware Health Warning " $ADMIN_EMAIL sleep 600 fi done ---EOF--- Btw- that line just after the 'sleep 3' is to display stuff on an LCD panel in a 5.25" drive, so ignore that. Notice that it checks every 3 seconds, but emails every 10 minutes (prevents the inbox from filling up overnight). What does it look like when a drive dies? I presume something like: [..UD] Then, perhaps just doing a (Perl) regexp: if (/\[[^\]]*D[^\]]*\]/) then report the failure? Phil -- Philip Edelbrock -- IS Manager -- Edge Design, Corvallis, OR [EMAIL PROTECTED] -- http://www.netroedge.com/~phil PGP F16: 01 D2 FD 01 B5 46 F4 F0 3A 8B 9D 7E 14 7F FB 7A
Re: product testimonials
> Well, we've been using assorted versions of the 0.90 raid code for over a > year in a couple of servers. We've had mostly good success with both the > raid1 and raid5 code. I don't have any raid5 disk failure stories (yet > ;-), but we are using EIDE drives so I expect one before TOO long ;-) > > Raid5 has given us good performance and reliability so far. Now, we do > have a raid1 array that did something interesting (and bad). One of the > drives was failing intermittently (and fairly silently) and had been > removed from the array. Unfortunately, backups were also failing (again, > fairly silently). On a normal power down / reboot, it appears that the > wrong drive was marked as master and on the reboot it re-synced to the > drive that had been out of the array for a couple of months. (yeah, yeah, > we need a sys-admin ;-) Anyway, 2 months of data went down the tubes. No > level of raid is a replacement for good backups. On the subject of semi-silent failures: Has anyone written a script to monitor the [U]'s in the /proc/mdstat location? It would be fairly trivial (start beeping the system speaker loudly and emailing repetitively) Has this already been or should I work on it? Thanks -sv
Re: product testimonials
Well, we've been using assorted versions of the 0.90 raid code for over a year in a couple of servers. We've had mostly good success with both the raid1 and raid5 code. I don't have any raid5 disk failure stories (yet ;-), but we are using EIDE drives so I expect one before TOO long ;-) Raid5 has given us good performance and reliability so far. Now, we do have a raid1 array that did something interesting (and bad). One of the drives was failing intermittently (and fairly silently) and had been removed from the array. Unfortunately, backups were also failing (again, fairly silently). On a normal power down / reboot, it appears that the wrong drive was marked as master and on the reboot it re-synced to the drive that had been out of the array for a couple of months. (yeah, yeah, we need a sys-admin ;-) Anyway, 2 months of data went down the tubes. No level of raid is a replacement for good backups. Keith P.S. That system was 2.0.0-pre3, I think, with the matching raidtools. On Sun, 19 Mar 2000, Seth Vidal wrote: > Hi folks, > I've got a user in my dept who is thinking about using software raid5 > (after I explained the advantages to them) - but they want "testimonials" > ie: - people who have used software raid5 under linux and have had it save > their ass or have had it work correctly and keep them from a costly backup > restore. IE: success stories. Also I would like to hear some failure > stories too - sort of horror stories - now the obscure situations I don't > care about - if you got an axe stuck in your drive by accident and it > killed the entire array then I feel sorry for you but I don't consider > that average use. > > Can anyone give some testimonials on the 0.90 raid? > thanks > > -sv > >
Re: product testimonials
I have not tested RAID5, but RAID1 has spared me much. One of the HD's failed, and I simply just tugged it out (it was the master disk, so I had to change the jumpers). I copy the MBR and the root partition after every major change (because they are not raided) so it just booted up in degraded mode. When I got the second HD back, I plugged it in and did a disk-wide dd and also ran the appropriate raid commands, and lo! it started working like a charm! This was on two similar disks though (IBM 7200rpm) ran from standard DMA controller on a BE6-2. Was a pain to set up though... _ | Martin Eriksson <[EMAIL PROTECTED]> | http://www.fysnet.nu/ | ICQ: 5869159 - Original Message - From: "Seth Vidal" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, March 20, 2000 4:16 AM Subject: product testimonials > Hi folks, > I've got a user in my dept who is thinking about using software raid5 > (after I explained the advantages to them) - but they want "testimonials" > ie: - people who have used software raid5 under linux and have had it save > their ass or have had it work correctly and keep them from a costly backup > restore. IE: success stories. Also I would like to hear some failure > stories too - sort of horror stories - now the obscure situations I don't > care about - if you got an axe stuck in your drive by accident and it > killed the entire array then I feel sorry for you but I don't consider > that average use. > > Can anyone give some testimonials on the 0.90 raid? > thanks > > -sv