Re: [zfs-discuss] Hardware Failure Best Practices

2010-03-08 Thread Giovanni Tirloni
On Mon, Mar 8, 2010 at 2:00 PM, Chris Dunbar  wrote:

> Hello,
>
> I just found this list and am very excited that you all are here! I have a
> homemade ZFS server that serves as our poor man's Thumper (we named it
> thumpthis) and provides primarily NFS shares for our VMware environment. As
> is often the case, the server has developed a hardware problem mere days
> before I am ready to go live with a new replacement server (thumpthat). At
> first the problem appeared to be a bad drive, but now I am not so sure. I
> would like to sanity check my thought process with this list and see if
> anybody has some different ideas. Here is a quick timeline of the trouble:
>
> 1. I noticed the following when running a routine zpool status:
>
> 
>  mirrorDEGRADED 0 0 0
>c3t2d0  ONLINE   0 0 0
>c3t3d0  REMOVED  0  368K 0
> 
>
> 2. I determined which drive appeared to be offline by watching drive lights
> and then rebooted the server.
>
> 3. Initially the drive appeared to be fine and ZFS picked it backup and
> resilvered the mirror. About 30 minutes later I noticed that the same drive
> was again marked REMOVED.
>
> 4. I shut the server down and replaced the drives with a new, larger disk.
>
> 5. I ran zpool replace tank c3t3d0 and it happily went to work on the
> replacement drive. A few hours later the resilver was complete and all
> seemed well.
>
> 6. The next day, about 12 hours after installing the new drive I found the
> same error message (here's the whole pool):
>
> config:
>
>NAMESTATE READ WRITE CKSUM
>tankDEGRADED 0 0 0
>  mirrorONLINE   0 0 0
>c3t0d0  ONLINE   0 0 0
>c3t1d0  ONLINE   0 0 0
>  mirrorDEGRADED 0 0 0
>c3t2d0  ONLINE   0 0 0
>c3t3d0  REMOVED  0  370K 0
>  mirrorONLINE   0 0 0
>c4t0d0  ONLINE   0 0 0
>c4t1d0  ONLINE   0 0 0
>  mirrorONLINE   0 0 0
>c4t2d0  ONLINE   0 0 0
>c4t3d0  ONLINE   0 0 0
>
> errors: No known data errors
>
> This is where I am now. Either my new hard drive is bad (not impossible) or
> I am looking at some other hardware failure, possibly the AOC-SAT2-MV8
> controller card. I have a spare controller card (same make and model
> purchased at the same time we built the server) and plan to replace that
> tonight. Does that seem like the correct course of action? Are there any
> steps I can take beforehand to zero in on the problem? Any words of
> encouragement or wisdom?
>

What does `iostat -En` say ?

My suggestion is to replace the cable that's connecting the c3t3d0 disk.

IMHO, the cable is much more likely to be faulty than a single port on the
disk controller.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Hardware Failure Best Practices

2010-03-08 Thread Chris Dunbar
Hello,

I just found this list and am very excited that you all are here! I have a 
homemade ZFS server that serves as our poor man's Thumper (we named it 
thumpthis) and provides primarily NFS shares for our VMware environment. As is 
often the case, the server has developed a hardware problem mere days before I 
am ready to go live with a new replacement server (thumpthat). At first the 
problem appeared to be a bad drive, but now I am not so sure. I would like to 
sanity check my thought process with this list and see if anybody has some 
different ideas. Here is a quick timeline of the trouble:

1. I noticed the following when running a routine zpool status:


  mirrorDEGRADED 0 0 0
c3t2d0  ONLINE   0 0 0
c3t3d0  REMOVED  0  368K 0


2. I determined which drive appeared to be offline by watching drive lights and 
then rebooted the server.

3. Initially the drive appeared to be fine and ZFS picked it backup and 
resilvered the mirror. About 30 minutes later I noticed that the same drive was 
again marked REMOVED.

4. I shut the server down and replaced the drives with a new, larger disk.

5. I ran zpool replace tank c3t3d0 and it happily went to work on the 
replacement drive. A few hours later the resilver was complete and all seemed 
well.

6. The next day, about 12 hours after installing the new drive I found the same 
error message (here's the whole pool):

config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirrorONLINE   0 0 0
c3t0d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
  mirrorDEGRADED 0 0 0
c3t2d0  ONLINE   0 0 0
c3t3d0  REMOVED  0  370K 0
  mirrorONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0

errors: No known data errors

This is where I am now. Either my new hard drive is bad (not impossible) or I 
am looking at some other hardware failure, possibly the AOC-SAT2-MV8 controller 
card. I have a spare controller card (same make and model purchased at the same 
time we built the server) and plan to replace that tonight. Does that seem like 
the correct course of action? Are there any steps I can take beforehand to zero 
in on the problem? Any words of encouragement or wisdom?

Regards,
Chris Dunbar
Earthside, LLC

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss