[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes
On 09/22/10 04:27 PM, Ben Miller wrote: On 09/21/10 09:16 AM, Ben Miller wrote: I had tried a clear a few times with no luck. I just did a detach and that did remove the old disk and has now triggered another resilver which hopefully works. I had tried a remove rather than a detach before, but that doesn't work on raidz2... thanks, Ben I made some progress. That resilver completed with 4 errors. I cleared those and still had the one error ":<0x0>" so I started a scrub. The scrub restarted the resilver on c4t0d0 again though! There currently are no errors anyway, but the resilver will be running for the next day+. Is this another bug or will doing a scrub eventually lead to a scrub of the pool instead of the resilver? Ben Well not much progress. The one permanent error ":<0x0>" came back. And the disk keeps wanting to resilver when trying to do a scrub. Now after the last resilver I have more checksum errors on the pool, but not on any disks: NAME STATE READ WRITE CKSUM pool2 ONLINE 0 037 ... raidz2-1ONLINE 0 074 All other checksum totals are 0. So three problems: 1. How to get the disk to stop resilvering? 2. How do you get checksum errors on the pool, but no disk is identified? If I clear them and let the resilver go again more checksum errors appear. So how to get rid of these errors? 3. How to get rid of the metadata:0x0 error? I'm currently destroying old snapshots (though that bug was fixed quite awhile ago and I'm running b134). I can try unmounting filesystems and remounting next (all are currently mounted). I can also schedule a reboot for next week if anyone things that would help. thanks, Ben ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a disk never completes
On 09/21/10 09:16 AM, Ben Miller wrote: On 09/20/10 10:45 AM, Giovanni Tirloni wrote: On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller mailto:bmil...@mail.eecis.udel.edu>> wrote: I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks (Seagate Constellation) and the pool seems sick now. The pool has four raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the third disk to finish replacing (c4t0d0). I have tried the resilver for c4t0d0 four times now and the pool also comes up with checksum errors and a permanent error (:<0x0>). The first resilver was from 'zpool replace', which came up with checksum errors. I cleared the errors which triggered the second resilver (same result). I then did a 'zpool scrub' which started the third resilver and also identified three permanent errors (the two additional were in files in snapshots which I then destroyed). I then did a 'zpool clear' and then another scrub which started the fourth resilver attempt. This last attempt identified another file with errors in a snapshot that I have now destroyed. Any ideas how to get this disk finished being replaced without rebuilding the pool and restoring from backup? The pool is working, but is reporting as degraded and with checksum errors. [...] Try to run a `zpool clear pool2` and see if clears the errors. If not, you may have to detach `c4t0d0s0/o`. I believe it's a bug that was fixed in recent builds. I had tried a clear a few times with no luck. I just did a detach and that did remove the old disk and has now triggered another resilver which hopefully works. I had tried a remove rather than a detach before, but that doesn't work on raidz2... thanks, Ben I made some progress. That resilver completed with 4 errors. I cleared those and still had the one error ":<0x0>" so I started a scrub. The scrub restarted the resilver on c4t0d0 again though! There currently are no errors anyway, but the resilver will be running for the next day+. Is this another bug or will doing a scrub eventually lead to a scrub of the pool instead of the resilver? Ben ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a disk never completes
On 09/20/10 10:45 AM, Giovanni Tirloni wrote: On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller mailto:bmil...@mail.eecis.udel.edu>> wrote: I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks (Seagate Constellation) and the pool seems sick now. The pool has four raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the third disk to finish replacing (c4t0d0). I have tried the resilver for c4t0d0 four times now and the pool also comes up with checksum errors and a permanent error (:<0x0>). The first resilver was from 'zpool replace', which came up with checksum errors. I cleared the errors which triggered the second resilver (same result). I then did a 'zpool scrub' which started the third resilver and also identified three permanent errors (the two additional were in files in snapshots which I then destroyed). I then did a 'zpool clear' and then another scrub which started the fourth resilver attempt. This last attempt identified another file with errors in a snapshot that I have now destroyed. Any ideas how to get this disk finished being replaced without rebuilding the pool and restoring from backup? The pool is working, but is reporting as degraded and with checksum errors. [...] Try to run a `zpool clear pool2` and see if clears the errors. If not, you may have to detach `c4t0d0s0/o`. I believe it's a bug that was fixed in recent builds. I had tried a clear a few times with no luck. I just did a detach and that did remove the old disk and has now triggered another resilver which hopefully works. I had tried a remove rather than a detach before, but that doesn't work on raidz2... thanks, Ben -- Giovanni Tirloni gtirl...@sysdroid.com <mailto:gtirl...@sysdroid.com> ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Replacing a disk never completes
I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks (Seagate Constellation) and the pool seems sick now. The pool has four raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the third disk to finish replacing (c4t0d0). I have tried the resilver for c4t0d0 four times now and the pool also comes up with checksum errors and a permanent error (:<0x0>). The first resilver was from 'zpool replace', which came up with checksum errors. I cleared the errors which triggered the second resilver (same result). I then did a 'zpool scrub' which started the third resilver and also identified three permanent errors (the two additional were in files in snapshots which I then destroyed). I then did a 'zpool clear' and then another scrub which started the fourth resilver attempt. This last attempt identified another file with errors in a snapshot that I have now destroyed. Any ideas how to get this disk finished being replaced without rebuilding the pool and restoring from backup? The pool is working, but is reporting as degraded and with checksum errors. Here is what the pool currently looks like: # zpool status -v pool2 pool: pool2 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 33h9m with 4 errors on Thu Sep 16 00:28:14 config: NAME STATE READ WRITE CKSUM pool2 DEGRADED 0 0 8 raidz2-0ONLINE 0 0 0 c0t4d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 0 c2t4d0ONLINE 0 0 0 c3t4d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c2t5d0ONLINE 0 0 0 c3t5d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 raidz2-1DEGRADED 0 014 c0t5d0ONLINE 0 0 0 c1t5d0ONLINE 0 0 0 c2t1d0ONLINE 0 0 0 c3t1d0ONLINE 0 0 0 c4t1d0ONLINE 0 0 0 c5t1d0ONLINE 0 0 0 c2t0d0ONLINE 0 0 0 c3t0d0ONLINE 0 0 0 replacing-8 DEGRADED 0 0 0 c4t0d0s0/o OFFLINE 0 0 0 c4t0d0 ONLINE 0 0 0 268G resilvered c5t0d0ONLINE 0 0 0 raidz2-2ONLINE 0 0 0 c0t6d0ONLINE 0 0 0 c1t6d0ONLINE 0 0 0 c2t6d0ONLINE 0 0 0 c3t6d0ONLINE 0 0 0 c4t6d0ONLINE 0 0 0 c5t6d0ONLINE 0 0 0 c2t7d0ONLINE 0 0 0 c3t7d0ONLINE 0 0 0 c4t7d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 raidz2-3ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c2t3d0ONLINE 0 0 0 c3t3d0ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 c2t2d0ONLINE 0 0 0 c3t2d0ONLINE 0 0 0 c4t2d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 0 logs mirror-4ONLINE 0 0 0 c0t1d0s0 ONLINE 0 0 0 c1t3d0s0 ONLINE 0 0 0 cache c0t3d0s7ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x0> <0x167a2>:<0x552ed> (This second file was in a snapshot I destroyed after the resilver completed). # zpool list pool2 NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT pool2 31.8T 13.8T 17.9T43% 1.65x DEGRADED - The slog is a mirror of two SLC SSDs and the L2ARC is an MLC SSD. thanks, Ben ___ zfs-discuss mailing list zfs-discus
Re: [zfs-discuss] Pool is wrong size in b134
Cindy, The other two pools are 2 disk mirrors (rpool and another). Ben Cindy Swearingen wrote: Hi Ben, Any other details about this pool, like how it might be different from the other two pools on this system, might be helpful... I'm going to try to reproduce this problem. We'll be in touch. Thanks, Cindy On 06/17/10 07:02, Ben Miller wrote: I upgraded a server today that has been running SXCE b111 to the OpenSolaris preview b134. It has three pools and two are fine, but one comes up with no space available in the pool (SCSI jbod of 300GB disks). The zpool version is at 14. I tried exporting the pool and re-importing and I get several errors like this both exporting and importing: # zpool export pool1 WARNING: metaslab_free_dva(): bad DVA 0:645838978048 WARNING: metaslab_free_dva(): bad DVA 0:645843271168 ... I tried removing the zpool.cache file, rebooting, importing and receive no warnings, but still reporting the wrong avail and size. # zfs list pool1 NAMEUSED AVAIL REFER MOUNTPOINT pool1 396G 0 3.22M /export/home # zpool list pool1 NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT pool1 476G 341G 135G71% 1.00x ONLINE - # zpool status pool1 pool: pool1 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors I try exporting and again get the metaslab_free_dva() warnings. Imported again with no warnings, but same numbers as above. If I try to remove files or truncate files I receive no free space errors. I reverted back to b111 and here is what the pool really looks like. # zfs list pool1 NAMEUSED AVAIL REFER MOUNTPOINT pool1 396G 970G 3.22M /export/home # zpool list pool1 NAMESIZE USED AVAILCAP HEALTH ALTROOT pool1 1.91T 557G 1.36T28% ONLINE - # zpool status pool1 pool: pool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors Also, the disks were replaced one at a time last year from 73GB to 300GB to increase the size of the pool. Any idea why the pool is showing up as the wrong size in b134 and have anything else to try? I don't want to upgrade the pool version yet and then not be able to revert back... thanks, Ben ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Pool is wrong size in b134
I upgraded a server today that has been running SXCE b111 to the OpenSolaris preview b134. It has three pools and two are fine, but one comes up with no space available in the pool (SCSI jbod of 300GB disks). The zpool version is at 14. I tried exporting the pool and re-importing and I get several errors like this both exporting and importing: # zpool export pool1 WARNING: metaslab_free_dva(): bad DVA 0:645838978048 WARNING: metaslab_free_dva(): bad DVA 0:645843271168 ... I tried removing the zpool.cache file, rebooting, importing and receive no warnings, but still reporting the wrong avail and size. # zfs list pool1 NAMEUSED AVAIL REFER MOUNTPOINT pool1 396G 0 3.22M /export/home # zpool list pool1 NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT pool1 476G 341G 135G71% 1.00x ONLINE - # zpool status pool1 pool: pool1 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors I try exporting and again get the metaslab_free_dva() warnings. Imported again with no warnings, but same numbers as above. If I try to remove files or truncate files I receive no free space errors. I reverted back to b111 and here is what the pool really looks like. # zfs list pool1 NAMEUSED AVAIL REFER MOUNTPOINT pool1 396G 970G 3.22M /export/home # zpool list pool1 NAMESIZE USED AVAILCAP HEALTH ALTROOT pool1 1.91T 557G 1.36T28% ONLINE - # zpool status pool1 pool: pool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors Also, the disks were replaced one at a time last year from 73GB to 300GB to increase the size of the pool. Any idea why the pool is showing up as the wrong size in b134 and have anything else to try? I don't want to upgrade the pool version yet and then not be able to revert back... thanks, Ben ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
# zpool status -xv all pools are healthy Ben > What does 'zpool status -xv' show? > > On Tue, Jan 27, 2009 at 8:01 AM, Ben Miller > wrote: > > I forgot the pool that's having problems was > recreated recently so it's already at zfs version 3. > I just did a 'zfs upgrade -a' for another pool, but > some of those filesystems failed since they are busy > and couldn't be unmounted. > > > # zfs upgrade -a > > cannot unmount '/var/mysql': Device busy > > cannot unmount '/var/postfix': Device busy > > > > 6 filesystems upgraded > > 821 filesystems already at this version > > > > Ben > > -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
I forgot the pool that's having problems was recreated recently so it's already at zfs version 3. I just did a 'zfs upgrade -a' for another pool, but some of those filesystems failed since they are busy and couldn't be unmounted. # zfs upgrade -a cannot unmount '/var/mysql': Device busy cannot unmount '/var/postfix': Device busy 6 filesystems upgraded 821 filesystems already at this version Ben > You can upgrade live. 'zfs upgrade' with no > arguments shows you the > zfs version status of filesystems present without > upgrading. > > > > On Jan 24, 2009, at 10:19 AM, Ben Miller > wrote: > > > We haven't done 'zfs upgrade ...' any. I'll give > that a try the > > next time the system can be taken down. > > > > Ben > > > >> A little gotcha that I found in my 10u6 update > >> process was that 'zpool > >> upgrade [poolname]' is not the same as 'zfs > upgrade > >> [poolname]/[filesystem(s)]' > >> > >> What does 'zfs upgrade' say? I'm not saying this > is > >> the source of > >> your problem, but it's a detail that seemed to > affect > >> stability for > >> me. > >> > >> -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
We haven't done 'zfs upgrade ...' any. I'll give that a try the next time the system can be taken down. Ben > A little gotcha that I found in my 10u6 update > process was that 'zpool > upgrade [poolname]' is not the same as 'zfs upgrade > [poolname]/[filesystem(s)]' > > What does 'zfs upgrade' say? I'm not saying this is > the source of > your problem, but it's a detail that seemed to affect > stability for > me. > > > On Thu, Jan 22, 2009 at 7:25 AM, Ben Miller > > The pools are upgraded to version 10. Also, this > is on Solaris 10u6. > > -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
The pools are upgraded to version 10. Also, this is on Solaris 10u6. # zpool upgrade This system is currently running ZFS pool version 10. All pools are formatted using this version. Ben > What's the output of 'zfs upgrade' and 'zpool > upgrade'? (I'm just > curious - I had a similar situation which seems to be > resolved now > that I've gone to Solaris 10u6 or OpenSolaris > 2008.11). > > > > On Wed, Jan 21, 2009 at 2:11 PM, Ben Miller > wrote: > > Bug ID is 6793967. > > > > This problem just happened again. > > % zpool status pool1 > > pool: pool1 > > state: DEGRADED > > scrub: resilver completed after 0h48m with 0 > errors on Mon Jan 5 12:30:52 2009 > > config: > > > >NAME STATE READ WRITE CKSUM > >pool1 DEGRADED 0 0 0 > > raidz2 DEGRADED 0 0 0 > >c4t8d0s0 ONLINE 0 0 0 > >c4t9d0s0 ONLINE 0 0 0 > >c4t10d0s0 ONLINE 0 0 0 > >c4t11d0s0 ONLINE 0 0 0 > >c4t12d0s0 REMOVED 0 0 0 > >c4t13d0s0 ONLINE 0 0 0 > > > > errors: No known data errors > > > > % zpool status -x > > all pools are healthy > > % > > # zpool online pool1 c4t12d0s0 > > % zpool status -x > > pool: pool1 > > state: ONLINE > > status: One or more devices is currently being > resilvered. The pool will > >continue to function, possibly in a degraded > state. > > action: Wait for the resilver to complete. > > scrub: resilver in progress for 0h0m, 0.12% done, > 2h38m to go > > config: > > > >NAME STATE READ WRITE CKSUM > >pool1 ONLINE 0 0 0 > > raidz2 ONLINE 0 0 0 > >c4t8d0s0 ONLINE 0 0 0 > >c4t9d0s0 ONLINE 0 0 0 > >c4t10d0s0 ONLINE 0 0 0 > >c4t11d0s0 ONLINE 0 0 0 > >c4t12d0s0 ONLINE 0 0 0 > >c4t13d0s0 ONLINE 0 0 0 > > > > errors: No known data errors > > % > > > > Ben > > -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
Bug ID is 6793967. This problem just happened again. % zpool status pool1 pool: pool1 state: DEGRADED scrub: resilver completed after 0h48m with 0 errors on Mon Jan 5 12:30:52 2009 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c4t8d0s0 ONLINE 0 0 0 c4t9d0s0 ONLINE 0 0 0 c4t10d0s0 ONLINE 0 0 0 c4t11d0s0 ONLINE 0 0 0 c4t12d0s0 REMOVED 0 0 0 c4t13d0s0 ONLINE 0 0 0 errors: No known data errors % zpool status -x all pools are healthy % # zpool online pool1 c4t12d0s0 % zpool status -x pool: pool1 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.12% done, 2h38m to go config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t8d0s0 ONLINE 0 0 0 c4t9d0s0 ONLINE 0 0 0 c4t10d0s0 ONLINE 0 0 0 c4t11d0s0 ONLINE 0 0 0 c4t12d0s0 ONLINE 0 0 0 c4t13d0s0 ONLINE 0 0 0 errors: No known data errors % Ben > I just put in a (low priority) bug report on this. > > Ben > > > This post from close to a year ago never received > a > > response. We just had this same thing happen to > > another server that is running Solaris 10 U6. One > of > > the disks was marked as removed and the pool > > degraded, but 'zpool status -x' says all pools are > > healthy. After doing an 'zpool online' on the > disk > > it resilvered in fine. Any ideas why 'zpool > status > > -x' reports all healthy while 'zpool status' shows > a > > pool in degraded mode? > > > > thanks, > > Ben > > > > > We run a cron job that does a 'zpool status -x' > to > > > check for any degraded pools. We just happened > to > > > find a pool degraded this morning by running > > 'zpool > > > status' by hand and were surprised that it was > > > degraded as we didn't get a notice from the cron > > > job. > > > > > > # uname -srvp > > > SunOS 5.11 snv_78 i386 > > > > > > # zpool status -x > > > all pools are healthy > > > > > > # zpool status pool1 > > > pool: pool1 > > > tate: DEGRADED > > > scrub: none requested > > > onfig: > > > > > > NAME STATE READ WRITE CKSUM > > > pool1DEGRADED 0 0 0 > > > raidz1 DEGRADED 0 0 0 > > > c1t8d0 REMOVED 0 0 0 > > > c1t9d0 ONLINE 0 0 0 > > > c1t10d0 ONLINE 0 0 0 > > > c1t11d0 ONLINE 0 0 0 > > > No known data errors > > > > > > I'm going to look into it now why the disk is > > listed > > > as removed. > > > > > > Does this look like a bug with 'zpool status > -x'? > > > > > > Ben -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
I just put in a (low priority) bug report on this. Ben > This post from close to a year ago never received a > response. We just had this same thing happen to > another server that is running Solaris 10 U6. One of > the disks was marked as removed and the pool > degraded, but 'zpool status -x' says all pools are > healthy. After doing an 'zpool online' on the disk > it resilvered in fine. Any ideas why 'zpool status > -x' reports all healthy while 'zpool status' shows a > pool in degraded mode? > > thanks, > Ben > > > We run a cron job that does a 'zpool status -x' to > > check for any degraded pools. We just happened to > > find a pool degraded this morning by running > 'zpool > > status' by hand and were surprised that it was > > degraded as we didn't get a notice from the cron > > job. > > > > # uname -srvp > > SunOS 5.11 snv_78 i386 > > > > # zpool status -x > > all pools are healthy > > > > # zpool status pool1 > > pool: pool1 > > tate: DEGRADED > > scrub: none requested > > onfig: > > > > NAME STATE READ WRITE CKSUM > > pool1DEGRADED 0 0 0 > > raidz1 DEGRADED 0 0 0 > > c1t8d0 REMOVED 0 0 0 > > c1t9d0 ONLINE 0 0 0 > > c1t10d0 ONLINE 0 0 0 > > c1t11d0 ONLINE 0 0 0 > > No known data errors > > > > I'm going to look into it now why the disk is > listed > > as removed. > > > > Does this look like a bug with 'zpool status -x'? > > > > Ben -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status -x strangeness
This post from close to a year ago never received a response. We just had this same thing happen to another server that is running Solaris 10 U6. One of the disks was marked as removed and the pool degraded, but 'zpool status -x' says all pools are healthy. After doing an 'zpool online' on the disk it resilvered in fine. Any ideas why 'zpool status -x' reports all healthy while 'zpool status' shows a pool in degraded mode? thanks, Ben > We run a cron job that does a 'zpool status -x' to > check for any degraded pools. We just happened to > find a pool degraded this morning by running 'zpool > status' by hand and were surprised that it was > degraded as we didn't get a notice from the cron > job. > > # uname -srvp > SunOS 5.11 snv_78 i386 > > # zpool status -x > all pools are healthy > > # zpool status pool1 > pool: pool1 > tate: DEGRADED > scrub: none requested > onfig: > > NAME STATE READ WRITE CKSUM > pool1DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > c1t8d0 REMOVED 0 0 0 > c1t9d0 ONLINE 0 0 0 > c1t10d0 ONLINE 0 0 0 > c1t11d0 ONLINE 0 0 0 > No known data errors > > I'm going to look into it now why the disk is listed > as removed. > > Does this look like a bug with 'zpool status -x'? > > Ben -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool status -x strangeness on b78
We run a cron job that does a 'zpool status -x' to check for any degraded pools. We just happened to find a pool degraded this morning by running 'zpool status' by hand and were surprised that it was degraded as we didn't get a notice from the cron job. # uname -srvp SunOS 5.11 snv_78 i386 # zpool status -x all pools are healthy # zpool status pool1 pool: pool1 state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM pool1DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c1t8d0 REMOVED 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 errors: No known data errors I'm going to look into it now why the disk is listed as removed. Does this look like a bug with 'zpool status -x'? Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System hang caused by a "bad" snapshot
> > > Hello Matthew, > > > Tuesday, September 12, 2006, 7:57:45 PM, you > > wrote: > > > MA> Ben Miller wrote: > > > >> I had a strange ZFS problem this morning. > The > > entire system would > > >> hang when mounting the ZFS filesystems. After > > trial and error I > > >> determined that the problem was with one of > the > > 2500 ZFS filesystems. > > >> When mounting that users' home the system > would > > hang and need to be > > >> rebooted. After I removed the snapshots (9 of > > them) for that > > >> filesystem everything was fine. > > >> > > >> I don't know how to reproduce this and didn't > get > > a crash dump. I > > >> don't remember seeing anything about this > before > > so I wanted to > > >> report it and see if anyone has any ideas. > > > > MA> Hmm, that sounds pretty bizarre, since I > don't > > think that mounting a > > MA> filesystem doesn't really interact with > snapshots > > at all. > > MA> Unfortunately, I don't think we'll be able to > > diagnose this without a > > MA> crash dump or reproducibility. If it happens > > again, force a crash dump > > MA> while the system is hung and we can take a > look > > at it. > > > > Maybe it wasn't hung after all. I've seen similar > > behavior here > > sometimes. Did your disks used in a pool were > > actually working? > > > > There was lots of activity on the disks (iostat and > status LEDs) until it got to this one filesystem > and > everything stopped. 'zpool iostat 5' stopped > running, the shell wouldn't respond and activity on > the disks stopped. This fs is relatively small >(175M used of a 512M quota). > Sometimes it takes a lot of time (30-50minutes) to > > mount a file system > > - it's rare, but it happens. And during this ZFS > > reads from those > > disks in a pool. I did report it here some time > ago. > > > In my case the system crashed during the evening > and it was left hung up when I came in during the > morning, so it was hung for a good 9-10 hours. > > The problem happened again last night, but for a > different users' filesystem. I took a crash dump > with it hung and the back trace looks like this: > > ::status > debugging crash dump vmcore.0 (64-bit) from hostname > operating system: 5.11 snv_40 (sun4u) > panic message: sync initiated > dump content: kernel pages only > > ::stack > 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, > 1849000, f005a4d8) > prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, > 1812158, 181b4c8) > debug_enter+0x110(0, a, a, 180fc00, 0, 183e000) > abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, > 2a100047d98, 1, 1859800) > intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, > 600019de110, 600019de110, > 600019de110) > zfs_delete_thread_target+8(600019de080, > , 0, 600019de080, > 6000d791ae8, 60001aed428) > zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, > 2a100c4faca, 2a100c4fac8, > 600019de0e0) > thread_start+4(600019de080, 0, 0, 0, 0, 0) > > In single user I set the mountpoint for that user to > be none and then brought the system up fine. Then I > destroyed the snapshots for that user and their > filesystem mounted fine. In this case the quota was > reached with the snapshots and 52% used without. > > Ben Hate to re-open something from a year ago, but we just had this problem happen again. We have been running Solaris 10u3 on this system for awhile. I searched the bug reports, but couldn't find anything on this. I also think I understand what happened a little more. We take snapshots at noon and the system hung up during that time. When trying to reboot the system would hang on the ZFS mounts. After I boot into single use and remove the snapshot from the filesystem causing the problem everything is fine. The filesystem in question at 100% use with snapshots in use. Here's the back trace for the system when it was hung: > ::stack 0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8) prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60) debug_enter+0x118(0, a, a, 180fc00, 0, 183d400) abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00) intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 6000240) 0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0) dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8) dbuf_hold_level+0x18(60008cd02e8, 0,
[zfs-discuss] Re: Remove files when at quota limit
Has anyone else run into this situation? Does anyone have any solutions other than removing snapshots or increasing the quota? I'd like to put in an RFE to reserve some space so files can be removed when users are at their quota. Any thoughts from the ZFS team? Ben > We have around 1000 users all with quotas set on > their ZFS filesystems on Solaris 10 U3. We take > snapshots daily and rotate out the week old ones. > The situation is that some users ignore the advice > of keeping space used below 80% and keep creating > large temporary files. They then try to remove > files when the space used is 100% and get over quota > messages. We then need to remove some or all of > their snapshots to free space. Is there anything > being worked on to keep some space reserved so files > can be removed when at the quota limit or some other > solution? What are other people doing is this > situation? We have also set up alternate > filesystems for users with transient data that we do > not take snapshots on, but we still have this > problem on home directories. > > thanks, > Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Remove files when at quota limit
We have around 1000 users all with quotas set on their ZFS filesystems on Solaris 10 U3. We take snapshots daily and rotate out the week old ones. The situation is that some users ignore the advice of keeping space used below 80% and keep creating large temporary files. They then try to remove files when the space used is 100% and get over quota messages. We then need to remove some or all of their snapshots to free space. Is there anything being worked on to keep some space reserved so files can be removed when at the quota limit or some other solution? What are other people doing is this situation? We have also set up alternate filesystems for users with transient data that we do not take snapshots on, but we still have this problem on home directories. thanks, Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: ZFS disables nfs/server on a host
I just threw in a truss in the SMF script and rebooted the test system and it failed again. The truss output is at http://www.eecis.udel.edu/~bmiller/zfs.truss-Apr27-2007 thanks, Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS disables nfs/server on a host
I was able to duplicate this problem on a test Ultra 10. I put in a workaround by adding a service that depends on /milestone/multi-user-server which does a 'zfs share -a'. It's strange this hasn't happened on other systems, but maybe it's related to slower systems... Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS disables nfs/server on a host
I just rebooted this host this morning and the same thing happened again. I have the core file from zfs. [ Apr 26 07:47:01 Executing start method ("/lib/svc/method/nfs-server start") ] Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 380, func tion zfs_share Abort - core dumped Why would nfs/server be disabled instead of going into maintenance with this error? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS disables nfs/server on a host
It does seem like an ordering problem, but nfs/server should be starting up late enough with SMF dependencies. I need to see if I can duplicate the problem on a test system... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS disables nfs/server on a host
I have an Ultra 10 client running Sol10 U3 that has a zfs pool set up on the extra space of the internal ide disk. There's just the one fs and it is shared with the sharenfs property. When this system reboots nfs/server ends up getting disabled and this is the error from the SMF logs: [ Apr 16 08:41:22 Executing start method ("/lib/svc/method/nfs-server start") ] [ Apr 16 08:41:24 Method "start" exited with status 0 ] [ Apr 18 10:59:23 Executing start method ("/lib/svc/method/nfs-server start") ] Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 380, function zfs_share If I re-enable nfs/server after the system is up it's fine. The system was recently upgraded to use zfs and this has happened on the last two reboots. We have lots of other systems that share nfs through zfs fine and I didn't see a similar problem on the list. Any ideas? Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot
> > Hello Matthew, > > Tuesday, September 12, 2006, 7:57:45 PM, you > wrote: > > MA> Ben Miller wrote: > > >> I had a strange ZFS problem this morning. The > > entire system would > > >> hang when mounting the ZFS filesystems. After > > trial and error I > > >> determined that the problem was with one of the > > 2500 ZFS filesystems. > > >> When mounting that users' home the system would > > hang and need to be > > >> rebooted. After I removed the snapshots (9 of > > them) for that > > >> filesystem everything was fine. > > >> > > >> I don't know how to reproduce this and didn't > get > > a crash dump. I > > >> don't remember seeing anything about this > before > > so I wanted to > > >> report it and see if anyone has any ideas. > > > > MA> Hmm, that sounds pretty bizarre, since I don't > > think that mounting a > > MA> filesystem doesn't really interact with > snapshots > > at all. > > MA> Unfortunately, I don't think we'll be able to > > diagnose this without a > > MA> crash dump or reproducibility. If it happens > > again, force a crash dump > > MA> while the system is hung and we can take a > look > > at it. > > > > Maybe it wasn't hung after all. I've seen similar > > behavior here > > sometimes. Did your disks used in a pool were > > actually working? > > > > There was lots of activity on the disks (iostat and > status LEDs) until it got to this one filesystem and > everything stopped. 'zpool iostat 5' stopped > running, the shell wouldn't respond and activity on > the disks stopped. This fs is relatively small > (175M used of a 512M quota). > > Sometimes it takes a lot of time (30-50minutes) to > > mount a file system > > - it's rare, but it happens. And during this ZFS > > reads from those > > disks in a pool. I did report it here some time > ago. > > > In my case the system crashed during the evening > and it was left hung up when I came in during the > morning, so it was hung for a good 9-10 hours. > The problem happened again last night, but for a different users' filesystem. I took a crash dump with it hung and the back trace looks like this: > ::status debugging crash dump vmcore.0 (64-bit) from hostname operating system: 5.11 snv_40 (sun4u) panic message: sync initiated dump content: kernel pages only > ::stack 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, 1849000, f005a4d8) prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, 1812158, 181b4c8) debug_enter+0x110(0, a, a, 180fc00, 0, 183e000) abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, 2a100047d98, 1, 1859800) intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, 600019de110, 600019de110, 600019de110) zfs_delete_thread_target+8(600019de080, , 0, 600019de080, 6000d791ae8, 60001aed428) zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, 2a100c4faca, 2a100c4fac8, 600019de0e0) thread_start+4(600019de080, 0, 0, 0, 0, 0) In single user I set the mountpoint for that user to be none and then brought the system up fine. Then I destroyed the snapshots for that user and their filesystem mounted fine. In this case the quota was reached with the snapshots and 52% used without. Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot
> Hello Matthew, > Tuesday, September 12, 2006, 7:57:45 PM, you wrote: > MA> Ben Miller wrote: > >> I had a strange ZFS problem this morning. The > entire system would > >> hang when mounting the ZFS filesystems. After > trial and error I > >> determined that the problem was with one of the > 2500 ZFS filesystems. > >> When mounting that users' home the system would > hang and need to be > >> rebooted. After I removed the snapshots (9 of > them) for that > >> filesystem everything was fine. > >> > >> I don't know how to reproduce this and didn't get > a crash dump. I > >> don't remember seeing anything about this before > so I wanted to > >> report it and see if anyone has any ideas. > > MA> Hmm, that sounds pretty bizarre, since I don't > think that mounting a > MA> filesystem doesn't really interact with snapshots > at all. > MA> Unfortunately, I don't think we'll be able to > diagnose this without a > MA> crash dump or reproducibility. If it happens > again, force a crash dump > MA> while the system is hung and we can take a look > at it. > > Maybe it wasn't hung after all. I've seen similar > behavior here > sometimes. Did your disks used in a pool were > actually working? > There was lots of activity on the disks (iostat and status LEDs) until it got to this one filesystem and everything stopped. 'zpool iostat 5' stopped running, the shell wouldn't respond and activity on the disks stopped. This fs is relatively small (175M used of a 512M quota). > Sometimes it takes a lot of time (30-50minutes) to > mount a file system > - it's rare, but it happens. And during this ZFS > reads from those > disks in a pool. I did report it here some time ago. > In my case the system crashed during the evening and it was left hung up when I came in during the morning, so it was hung for a good 9-10 hours. Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] System hang caused by a "bad" snapshot
I had a strange ZFS problem this morning. The entire system would hang when mounting the ZFS filesystems. After trial and error I determined that the problem was with one of the 2500 ZFS filesystems. When mounting that users' home the system would hang and need to be rebooted. After I removed the snapshots (9 of them) for that filesystem everything was fine. I don't know how to reproduce this and didn't get a crash dump. I don't remember seeing anything about this before so I wanted to report it and see if anyone has any ideas. The system is a Sun Fire 280R with 3GB of RAM running SXCR b40. The pool looks like this (I'm running a scrub currently): # zpool status pool1 pool: pool1 state: ONLINE scrub: scrub in progress, 78.61% done, 0h18m to go config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 errors: No known data errors Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss