Re: product testimonials

2000-03-21 Thread Seth Vidal

> Notice that it checks every 3 seconds, but emails every 10 minutes
> (prevents the inbox from filling up overnight).
> 
> What does it look like when a drive dies?  I presume something like:
> 
> [..UD]
> 
> Then, perhaps just doing a (Perl) regexp: if (/\[[^\]]*D[^\]]*\]/)
> then report the failure?

what I've seen is that it looks like this:
[UU_UU] until the drive is marked as dead.

and then it changes to:
[UUDUU] (I believe)

I'd want to know about the _ until otherwise noted.

and I'd want to be able to touch and ignore file so if its rebuilding I
don't hear about it every few minutes.

I'll see what I can hack up.

it wouldn't be a bad idea to put a few of these up on a raid-related
website so people could see their options.

-sv





Re: product testimonials

2000-03-20 Thread Theo Van Dinter

On Mon, Mar 20, 2000 at 10:36:40AM -0800, [EMAIL PROTECTED] wrote:
> while true
> do
>  sleep 3
>  if [ -n "`cat $RAID_STAT | perl -ne 'if (/(.*\[U*[^\]\[U]+U*\])$/) { print 
>\"Failure! $1\n\"; }'`" ]
>  then
> cat $RAID_STAT | mail -s " Raid Failure Warning " $ADMIN_EMAIL
> sleep 600
>  fi
> done

Ugh.  Why would you want to check every _3 seconds_ for a disk failure?
That's a lot of wasted cycles.  Instead of forking a few different programs
like that, how about:

a cronjob that runs every 5 minutes and just diffs mdstat and a copy of a
good mdstat:

*/5 * * * * /usr/bin/diff /mdstat.good /proc/mdstat

Just do "cp /proc/mdstat /mdstat.good", and that's it.  You'll get a mail if
there's a difference (including failure), and no mail if they're the same.
You still get mailed every 5 minutes, but you don't use quite so many cycles
and processes to do it.  (worst case, you have a 5-10 minute window with a
failed disk.  If you're worried about it, change the */5 to * ...)  You could
also throw this into a monitor using "mon"
(http://www.kernel.org/software/mon) which you can set to mail you once an
hour/etc.

-- 
Randomly Generated Tagline:
I'd love to, but I want to spend more time with my blender.



Re: product testimonials

2000-03-20 Thread phil

On Mon, Mar 20, 2000 at 10:29:28AM -0500, Seth Vidal wrote:
>[...] 
> On the subject of semi-silent failures: Has anyone written a script to
> monitor the [U]'s in the /proc/mdstat location? It would be fairly
> trivial (start beeping the system speaker loudly and emailing
> repetitively) Has this already been or should I work on it?

OK, as usual I've taken somebody else's idea and ran with it.  Here's
a little script daemon which monitors the raids for failures:

---SOF---
#!/bin/bash

PATH="/bin:/usr/bin:/usr/local/bin:${PATH}"

ADMIN_EMAIL="root@localhost"

RAID_STAT="/proc/mdstat"

while true
do
 sleep 3
 if [ -n "`cat $RAID_STAT | perl -ne 'if (/(.*\[U*[^\]\[U]+U*\])$/) { print \"Failure! 
$1\n\"; }'`" ]
 then
cat $RAID_STAT | mail -s " Raid Failure Warning " $ADMIN_EMAIL
sleep 600
 fi
done
---EOF---

(Watch for wrapped lines.  There are a couple long ones.)

If you want to test it, then do this:

cat /proc/mdstat > testfile

then, change RAID_STAT to ="testfile"

Then, start the daemon and let it run.  After a while, simulate the
failure by editing 'testfile'.  If it works, kill the daemon, and
point RAID_STAT to the real file (/proc/mdstat).

To use it in rc.local (or where ever), just start it like this:

su nobody raid-monitor.sh &

(Or change 'nobody' to some other under priledged user of your
choice.)


Phil

-- 
Philip Edelbrock -- IS Manager -- Edge Design, Corvallis, OR
   [EMAIL PROTECTED] -- http://www.netroedge.com/~phil
 PGP F16: 01 D2 FD 01 B5 46 F4 F0  3A 8B 9D 7E 14 7F FB 7A



Re: product testimonials

2000-03-20 Thread phil

On Mon, Mar 20, 2000 at 10:29:28AM -0500, Seth Vidal wrote:
>[...] 
> On the subject of semi-silent failures: Has anyone written a script to
> monitor the [U]'s in the /proc/mdstat location? It would be fairly
> trivial (start beeping the system speaker loudly and emailing
> repetitively) Has this already been or should I work on it?

Humm, good idea!  I made a simple shell script a while ago to monitor
Lm-sensors stuff (fan speeds, power supplies, temperatures, etc... 
see www.lm-sensors.nu).  It wasn't too hard.  In rc.local, I just
started it in the background and it would email me if things went
arey. 

Here's what it looked like:

---SOF---
PATH="/bin:/usr/bin:/usr/local/bin:${PATH}"

ADMIN_EMAIL="root@localhost"

if [ -n "`sensors | grep ALARM`" ]
then
echo "Pending Alarms on start up!  Exiting!"
exit
fi

while true
do
 sleep 3
 sensors | /root/bin/getsensors.pl | /root/bin/displayit.pl
 if [ -n "`sensors | grep ALARM`" ]
 then
sensors | mail -s " Hardware Health Warning " $ADMIN_EMAIL
sleep 600
 fi
done
---EOF---

Btw- that line just after the 'sleep 3' is to display stuff on an LCD
panel in a 5.25" drive, so ignore that.

Notice that it checks every 3 seconds, but emails every 10 minutes
(prevents the inbox from filling up overnight).

What does it look like when a drive dies?  I presume something like:

[..UD]

Then, perhaps just doing a (Perl) regexp: if (/\[[^\]]*D[^\]]*\]/)
then report the failure?


Phil

-- 
Philip Edelbrock -- IS Manager -- Edge Design, Corvallis, OR
   [EMAIL PROTECTED] -- http://www.netroedge.com/~phil
 PGP F16: 01 D2 FD 01 B5 46 F4 F0  3A 8B 9D 7E 14 7F FB 7A



Re: product testimonials

2000-03-20 Thread Seth Vidal

> Well, we've been using assorted versions of the 0.90 raid code for over a
> year in a couple of servers.  We've had mostly good success with both the
> raid1 and raid5 code.  I don't have any raid5 disk failure stories (yet
> ;-), but we are using EIDE drives so I expect one before TOO long ;-)
> 
> Raid5 has given us good performance and reliability so far.  Now, we do
> have a raid1 array that did something interesting (and bad).  One of the
> drives was failing intermittently (and fairly silently) and had been
> removed from the array.  Unfortunately, backups were also failing (again,
> fairly silently). On a normal power down / reboot, it appears that the
> wrong drive was marked as master and on the reboot it re-synced to the
> drive that had been out of the array for a couple of months. (yeah, yeah,
> we need a sys-admin ;-) Anyway, 2 months of data went down the tubes.  No
> level of raid is a replacement for good backups.  

On the subject of semi-silent failures: Has anyone written a script to
monitor the [U]'s in the /proc/mdstat location? It would be fairly
trivial (start beeping the system speaker loudly and emailing
repetitively) Has this already been or should I work on it?

Thanks

-sv




Re: product testimonials

2000-03-20 Thread Keith Underwood

Well, we've been using assorted versions of the 0.90 raid code for over a
year in a couple of servers.  We've had mostly good success with both the
raid1 and raid5 code.  I don't have any raid5 disk failure stories (yet
;-), but we are using EIDE drives so I expect one before TOO long ;-)

Raid5 has given us good performance and reliability so far.  Now, we do
have a raid1 array that did something interesting (and bad).  One of the
drives was failing intermittently (and fairly silently) and had been
removed from the array.  Unfortunately, backups were also failing (again,
fairly silently). On a normal power down / reboot, it appears that the
wrong drive was marked as master and on the reboot it re-synced to the
drive that had been out of the array for a couple of months. (yeah, yeah,
we need a sys-admin ;-) Anyway, 2 months of data went down the tubes.  No
level of raid is a replacement for good backups.  

Keith

P.S. That system was 2.0.0-pre3, I think, with the matching raidtools.


On Sun, 19 Mar 2000, Seth Vidal wrote:

> Hi folks,
>  I've got a user in my dept who is thinking about using software raid5
> (after I explained the advantages to them) - but they want "testimonials"
> ie: - people who have used software raid5 under linux and have had it save
> their ass or have had it work correctly and keep them from a costly backup
> restore. IE: success stories. Also I would like to hear some failure
> stories too - sort of horror stories - now the obscure situations I don't
> care about - if you got an axe stuck in your drive by accident and it
> killed the entire array then I  feel sorry for you but I don't consider
> that average use.
> 
> Can anyone give some testimonials on the 0.90 raid?
> thanks
> 
> -sv
> 
> 



Re: product testimonials

2000-03-19 Thread Martin Eriksson

I have not tested RAID5, but RAID1 has spared me much. One of the HD's
failed, and I simply just tugged it out (it was the master disk, so I had to
change the jumpers).

I copy the MBR and the root partition after every major change (because they
are not raided) so it just booted up in degraded mode.

When I got the second HD back, I plugged it in and did a disk-wide dd and
also ran the appropriate raid commands, and lo! it started working like a
charm!

This was on two similar disks though (IBM 7200rpm) ran from standard DMA
controller on a BE6-2.

Was a pain to set up though...

_
|  Martin Eriksson <[EMAIL PROTECTED]>
|  http://www.fysnet.nu/
|  ICQ: 5869159


- Original Message -
From: "Seth Vidal" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, March 20, 2000 4:16 AM
Subject: product testimonials


> Hi folks,
>  I've got a user in my dept who is thinking about using software raid5
> (after I explained the advantages to them) - but they want "testimonials"
> ie: - people who have used software raid5 under linux and have had it save
> their ass or have had it work correctly and keep them from a costly backup
> restore. IE: success stories. Also I would like to hear some failure
> stories too - sort of horror stories - now the obscure situations I don't
> care about - if you got an axe stuck in your drive by accident and it
> killed the entire array then I  feel sorry for you but I don't consider
> that average use.
>
> Can anyone give some testimonials on the 0.90 raid?
> thanks
>
> -sv