[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?
All; I have a large zfs tank with four raidz2 groups in it. Each of these groups is 11 disks, and I have four hot spare disks in the system. The system is running Open Solaris build snv_90. One of these groups has had a disk failure, which the OS correctly detected, and replaced with one of the hot spares, and began rebuilding. Now it gets interesting. The resilver runs for about 1 hour, then stops. If I put zpool status v in a while loop with a 10 minute sleep, I see the repair proceed, then with no messages of ANY kind, it¹ll silently quit and start over. I¹m attaching the output of zpool status v from an hour ago and then from just now below. Has anyone seen this, or have any ideas as to the cause? Is there a timeout or priorty I need to change in a tuneable or something? --Mike One Hour Ago: [EMAIL PROTECTED]:~$ zpool status -v pool: tank1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver in progress for 0h46m, 3.96% done, 18h39m to go config: NAME STATE READ WRITE CKSUM tank1 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c1t13d0ONLINE 0 0 0 spare DEGRADED 0 0 0 c1t14d0 FAULTED 0 0 0 too many errors c1t11d0 ONLINE 0 0 0 c1t15d0ONLINE 0 0 0 c1t16d0ONLINE 0 0 0 c1t17d0ONLINE 0 0 0 c1t18d0ONLINE 0 0 0 c1t19d0ONLINE 0 0 0 c1t20d0ONLINE 0 0 0 c1t21d0ONLINE 0 0 0 c1t22d0ONLINE 0 0 0 c1t23d0ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t13d0ONLINE 0 0 0 c2t14d0ONLINE 0 0 0 c2t15d0ONLINE 0 0 0 c2t16d0ONLINE 0 0 0 c2t17d0ONLINE 0 0 0 c2t18d0ONLINE 0 0 0 c2t19d0ONLINE 0 0 0 c2t20d0ONLINE 0 0 0 c2t21d0ONLINE 0 0 0 c2t22d0ONLINE 0 0 0 c2t23d0ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t24d0ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0ONLINE 0 0 0 spares c1t11d0 INUSE currently in use c2t24d0 AVAIL c2t11d0 AVAIL c2t4d0 AVAIL errors: No known data errors Just Now: [EMAIL PROTECTED]:~$ zpool status -v pool: tank1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver in progress for 0h24m, 2.23% done, 17h51m to go config: NAME STATE READ WRITE CKSUM tank1 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c1t13d0ONLINE 0 0 0 spare DEGRADED 0 0 0 c1t14d0 FAULTED 0 0 0 too many errors c1t11d0 ONLINE 0 0 0 c1t15d0ONLINE 0 0 0 c1t16d0ONLINE 0 0 0 c1t17d0ONLINE 0 0 0 c1t18d0ONLINE
[zfs-discuss] A couple basic questions re: zfs sharenfs
All; I¹m sure I¹m missing something basic here. I need to do the following things, and can¹t for the life of me figure out how: 1. Export a zfs filesystem over NFS, but restrict access to a limited set of hosts and/or subnets: ie: 10.9.8.0/24 and 10.9.9.5 in. 2. give root access to a zfs file system over nfs. I¹m sure this is doable with the right options, but I can¹t figure out how. Any suggestions? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Problems replacing a failed drive.
I have a 24 disk SATA array running on Open Solaris Nevada, b78. We had a drive fail, and I¹ve replaced the device but can¹t get the system to recognize that I replaced the drive. Zpool status v shows the failed drive: [EMAIL PROTECTED] ~]$ zpool status -v pool: LogData state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008 config: NAME STATE READ WRITE CKSUM LogData DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c0t12d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t16d0 ONLINE 0 0 0 c0t20d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t17d0 ONLINE 0 0 0 c0t20d0 FAULTED 0 0 0 too many errors c0t2d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 c0t18d0 ONLINE 0 0 0 c0t22d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t15d0 ONLINE 0 0 0 c0t19d0 ONLINE 0 0 0 c0t23d0 ONLINE 0 0 0 errors: No known data errors I tried doing a zpool clear with no luck: [EMAIL PROTECTED] ~]# zpool clear LogData c0t20d0 [EMAIL PROTECTED] ~]# zpool status -v pool: LogData state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008 config: NAME STATE READ WRITE CKSUM LogData DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c0t12d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t16d0 ONLINE 0 0 0 c0t20d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t17d0 ONLINE 0 0 0 c0t20d0 FAULTED 0 0 0 too many errors c0t2d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 c0t18d0 ONLINE 0 0 0 c0t22d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 And I¹ve tried zpool replace: [EMAIL PROTECTED] ~]# [EMAIL PROTECTED] ~]# zpool replace -f LogData c0t20d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c0t20d0s0 is part of active ZFS pool LogData. Please see zpool(1M). So.. What am I missing here folks? Any help would be appreciated. -Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sun 5220 as a ZFS Server?
We¹re looking at building out sever ZFS servers, and are considering an x86 platform vs a Sun 5520 as the base platform. Any comments from the floor on comparative performance as a ZFS server? We¹d be using the LSI 3801 controllers in either case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Disk array problems - any suggestions?
All; I have a 24-disk SATA array attached to an HP DL160 with a LSI 3801E for the controller. We've been seeing errors that look like: WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0); Disconnected command timeout for Target 23 WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0); Disconnected command timeout for Target 23 SCSI transport failed: reason 'reset': giving up WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0); Disconnected command timeout for Target 23 WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci8086,[EMAIL PROTECTED],3/pci1000,[EMAIL PROTECTED] (mpt0); Disconnected command timeout for Target 23 When these occur, the system hangs on any access to the array and never recovers. After some discussions with some folks at Sun, I rebuilt the system from Solaris 10 x 86 Update 4 to run Open Solaris. It's currently on Solaris Express (Nevada) build 78, and these errors are continuing. The drives are the 750g hitachis, and after power cycle and reboot, the error does not persist on one drive. Each of the drives is in a carrier with some active electronics to adapt the SATA drives for SAS use. My fear at the moment is that there's some sort of problem with the 24 drive enclosure itself as the drives appear to be fine, and I cannot believe we're seeing an intermittent failure across a number of drives. Any suggestions would be appreciated. --Mike Stalnaker ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Major problem with a new ZFS setup
We weren't able to do anything at all, and finally rebooted the system. When we did, everything came back normally, even with the target that was reporting errors before. We're using an LSI PCI-E controller that's on the supported device list, and LSI 3801-E. Right now, I'm trying to figure out if there's a different controller we should be using with Solaris 10 Release 4 (X86) that will handle a drive issue more gracefully. I know folks are working on this part of the code, but I need to get as far along as I can right now. :) On 11/8/07 8:43 PM, Ian Collins [EMAIL PROTECTED] wrote: Michael Stalnaker wrote: Finally trying to do a zpool status yields: [EMAIL PROTECTED]:/# zpool status -v pool: LogData state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested At which point the shell hangs, and cannot be control-c'd. Any thoughts on how to proceed? I'm guessing we have a bad disk, but I'm not sure. Anything you can recommend to diagnose this would be welcome. Are you able to run a zpool scrub? Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss