[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?

2008-10-29 Thread Michael Stalnaker
All;

I have a large zfs tank with four raidz2 groups in it. Each of these groups
is 11 disks, and I have four hot spare disks in the system.  The system is
running Open Solaris build snv_90.   One of these groups has had a disk
failure, which the OS correctly detected, and replaced with one of the hot
spares, and began rebuilding.

Now it gets interesting. The resilver runs for about 1 hour, then stops. If
I put zpool status ­v in a while loop with a 10 minute sleep, I see the
repair proceed, then with no messages of ANY kind, it¹ll silently quit and
start over. I¹m attaching the output of zpool status ­v from an hour ago and
then from just now below. Has anyone seen this, or have any ideas as to the
cause? Is there a timeout or priorty I need to change in a tuneable or
something?

--Mike



One Hour Ago:

[EMAIL PROTECTED]:~$ zpool status -v
  pool: tank1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: resilver in progress for 0h46m, 3.96% done, 18h39m to go
config:

NAME   STATE READ WRITE CKSUM
tank1  DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
c1t13d0ONLINE   0 0 0
spare  DEGRADED 0 0 0
  c1t14d0  FAULTED  0 0 0  too many errors
  c1t11d0  ONLINE   0 0 0
c1t15d0ONLINE   0 0 0
c1t16d0ONLINE   0 0 0
c1t17d0ONLINE   0 0 0
c1t18d0ONLINE   0 0 0
c1t19d0ONLINE   0 0 0
c1t20d0ONLINE   0 0 0
c1t21d0ONLINE   0 0 0
c1t22d0ONLINE   0 0 0
c1t23d0ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c1t0d0 ONLINE   0 0 0
c1t1d0 ONLINE   0 0 0
c1t2d0 ONLINE   0 0 0
c1t3d0 ONLINE   0 0 0
c1t4d0 ONLINE   0 0 0
c1t5d0 ONLINE   0 0 0
c1t6d0 ONLINE   0 0 0
c1t7d0 ONLINE   0 0 0
c1t8d0 ONLINE   0 0 0
c1t9d0 ONLINE   0 0 0
c1t10d0ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c2t13d0ONLINE   0 0 0
c2t14d0ONLINE   0 0 0
c2t15d0ONLINE   0 0 0
c2t16d0ONLINE   0 0 0
c2t17d0ONLINE   0 0 0
c2t18d0ONLINE   0 0 0
c2t19d0ONLINE   0 0 0
c2t20d0ONLINE   0 0 0
c2t21d0ONLINE   0 0 0
c2t22d0ONLINE   0 0 0
c2t23d0ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c2t0d0 ONLINE   0 0 0
c2t1d0 ONLINE   0 0 0
c2t2d0 ONLINE   0 0 0
c2t3d0 ONLINE   0 0 0
c1t24d0ONLINE   0 0 0
c2t5d0 ONLINE   0 0 0
c2t6d0 ONLINE   0 0 0
c2t7d0 ONLINE   0 0 0
c2t8d0 ONLINE   0 0 0
c2t9d0 ONLINE   0 0 0
c2t10d0ONLINE   0 0 0
spares
  c1t11d0  INUSE currently in use
  c2t24d0  AVAIL
  c2t11d0  AVAIL
  c2t4d0   AVAIL

errors: No known data errors

Just Now:

[EMAIL PROTECTED]:~$ zpool status -v
  pool: tank1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: resilver in progress for 0h24m, 2.23% done, 17h51m to go
config:

NAME   STATE READ WRITE CKSUM
tank1  DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
c1t13d0ONLINE   0 0 0
spare  DEGRADED 0 0 0
  c1t14d0  FAULTED  0 0 0  too many errors
  c1t11d0  ONLINE   0 0 0
c1t15d0ONLINE   0 0 0
c1t16d0ONLINE   0 0 0
c1t17d0ONLINE   0 0 0
c1t18d0ONLINE   

[zfs-discuss] A couple basic questions re: zfs sharenfs

2008-09-18 Thread Michael Stalnaker
All;

I¹m sure I¹m missing something basic here. I need to do the following
things, and can¹t for the life of me figure out how:

1. Export a zfs filesystem over NFS, but restrict access to a limited set of
hosts and/or subnets: ie: 10.9.8.0/24 and 10.9.9.5 in.
2. give root access to a zfs file system over nfs.

I¹m sure this is doable with the right options, but I can¹t figure out how.

Any suggestions?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problems replacing a failed drive.

2008-02-29 Thread Michael Stalnaker
I have a 24 disk SATA array running on Open Solaris Nevada, b78. We had a
drive fail, and I¹ve replaced the device but can¹t get the system to
recognize that I replaced the drive.

Zpool status ­v shows the failed drive:

[EMAIL PROTECTED] ~]$ zpool status -v
  pool: LogData
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008
config:

NAME STATE READ WRITE CKSUM
LogData  DEGRADED 0 0 0
  raidz2 DEGRADED 0 0 0
c0t12d0  ONLINE   0 0 0
c0t5d0   ONLINE   0 0 0
c0t0d0   ONLINE   0 0 0
c0t4d0   ONLINE   0 0 0
c0t8d0   ONLINE   0 0 0
c0t16d0  ONLINE   0 0 0
c0t20d0  ONLINE   0 0 0
c0t1d0   ONLINE   0 0 0
c0t9d0   ONLINE   0 0 0
c0t13d0  ONLINE   0 0 0
c0t17d0  ONLINE   0 0 0
c0t20d0  FAULTED  0 0 0  too many errors
c0t2d0   ONLINE   0 0 0
c0t6d0   ONLINE   0 0 0
c0t10d0  ONLINE   0 0 0
c0t14d0  ONLINE   0 0 0
c0t18d0  ONLINE   0 0 0
c0t22d0  ONLINE   0 0 0
c0t3d0   ONLINE   0 0 0
c0t7d0   ONLINE   0 0 0
c0t11d0  ONLINE   0 0 0
c0t15d0  ONLINE   0 0 0
c0t19d0  ONLINE   0 0 0
c0t23d0  ONLINE   0 0 0

errors: No known data errors


I tried doing a zpool clear with no luck:

[EMAIL PROTECTED] ~]# zpool clear LogData c0t20d0
[EMAIL PROTECTED] ~]# zpool status -v
  pool: LogData
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008
config:

NAME STATE READ WRITE CKSUM
LogData  DEGRADED 0 0 0
  raidz2 DEGRADED 0 0 0
c0t12d0  ONLINE   0 0 0
c0t5d0   ONLINE   0 0 0
c0t0d0   ONLINE   0 0 0
c0t4d0   ONLINE   0 0 0
c0t8d0   ONLINE   0 0 0
c0t16d0  ONLINE   0 0 0
c0t20d0  ONLINE   0 0 0
c0t1d0   ONLINE   0 0 0
c0t9d0   ONLINE   0 0 0
c0t13d0  ONLINE   0 0 0
c0t17d0  ONLINE   0 0 0
c0t20d0  FAULTED  0 0 0  too many errors
c0t2d0   ONLINE   0 0 0
c0t6d0   ONLINE   0 0 0
c0t10d0  ONLINE   0 0 0
c0t14d0  ONLINE   0 0 0
c0t18d0  ONLINE   0 0 0
c0t22d0  ONLINE   0 0 0
c0t3d0   ONLINE   0 0 0
c0t7d0   ONLINE   0 0 0

And I¹ve tried zpool replace:

[EMAIL PROTECTED] ~]# 
[EMAIL PROTECTED] ~]# zpool replace -f LogData c0t20d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c0t20d0s0 is part of active ZFS pool LogData. Please see zpool(1M).


So.. What am I missing here folks?

Any help would be appreciated.

-Mike


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sun 5220 as a ZFS Server?

2008-02-04 Thread Michael Stalnaker
We¹re looking at building out sever ZFS servers, and are considering an x86
platform vs a Sun 5520 as the base platform. Any comments from the floor on
comparative performance as a ZFS server? We¹d be using the LSI 3801
controllers in either case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Disk array problems - any suggestions?

2008-01-12 Thread Michael Stalnaker
All;

I have a 24-disk SATA array attached to an HP DL160 with a LSI 3801E for the
controller. We've been seeing errors that look like:

WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL 
PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0);
Disconnected command timeout for Target 23
WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL 
PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0);
Disconnected command timeout for Target 23

   SCSI transport failed: reason 'reset': giving up


WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL 
PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0);
Disconnected command timeout for Target 23
WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL 
PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0);
Disconnected command timeout for Target 23

When these occur, the system hangs on any access to the array and never
recovers. After some discussions with some folks at Sun, I rebuilt the
system from Solaris 10 x 86 Update 4 to run Open Solaris. It's currently on
Solaris Express (Nevada) build 78, and these errors are continuing. The
drives are the 750g hitachis, and after power cycle and reboot, the error
does not persist on one drive. Each of the drives is in a carrier with some
active electronics to adapt the SATA drives for SAS use. My fear at the
moment is that there's some sort of problem with the 24 drive enclosure
itself as the drives appear to be fine, and I cannot believe we're seeing an
intermittent failure across a number of drives.

Any suggestions would be appreciated.

--Mike Stalnaker

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Major problem with a new ZFS setup

2007-11-08 Thread Michael Stalnaker
We weren't able to do anything at all, and finally rebooted the system. When
we did, everything came back normally, even with the target that was
reporting errors before. We're using an LSI PCI-E controller that's on the
supported device list, and LSI 3801-E. Right now, I'm trying to figure out
if there's a different controller we should be using with Solaris 10 Release
4 (X86) that will handle a drive issue more gracefully. I know folks are
working on this part of the code, but I need to get as far along as I can
right now. :)



On 11/8/07 8:43 PM, Ian Collins [EMAIL PROTECTED] wrote:

 Michael Stalnaker wrote:
 
 Finally trying to do a zpool status yields:
 
 [EMAIL PROTECTED]:/# zpool status -v
   pool: LogData
  state: ONLINE
 status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: none requested
 
 At which point the shell hangs, and cannot be control-c'd.
 
 
 Any thoughts on how to proceed? I'm guessing we have a bad disk, but I'm not
 sure. Anything you can recommend to diagnose this would be welcome.
 
   
 Are you able to run a zpool scrub?
 
 Ian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss