[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes

2010-09-30 Thread Ben Miller

On 09/22/10 04:27 PM, Ben Miller wrote:

On 09/21/10 09:16 AM, Ben Miller wrote:



I had tried a clear a few times with no luck. I just did a detach and that
did remove the old disk and has now triggered another resilver which
hopefully works. I had tried a remove rather than a detach before, but that
doesn't work on raidz2...

thanks,
Ben


I made some progress. That resilver completed with 4 errors. I cleared
those and still had the one error ":<0x0>" so I started a scrub.
The scrub restarted the resilver on c4t0d0 again though! There currently
are no errors anyway, but the resilver will be running for the next day+.
Is this another bug or will doing a scrub eventually lead to a scrub of the
pool instead of the resilver?

Ben


	Well not much progress.  The one permanent error ":<0x0>" came 
back.  And the disk keeps wanting to resilver when trying to do a scrub. 
Now after the last resilver I have more checksum errors on the pool, but 
not on any disks:

NAME  STATE READ WRITE CKSUM
pool2 ONLINE  0 037
...
  raidz2-1ONLINE  0 074

All other checksum totals are 0.  So three problems:
1. How to get the disk to stop resilvering?

	2. How do you get checksum errors on the pool, but no disk is identified? 
 If I clear them and let the resilver go again more checksum errors 
appear.  So how to get rid of these errors?


	3. How to get rid of the metadata:0x0 error?  I'm currently destroying old 
snapshots (though that bug was fixed quite awhile ago and I'm running 
b134).  I can try unmounting filesystems and remounting next (all are 
currently mounted).  I can also schedule a reboot for next week if anyone 
things that would help.


thanks,
Ben

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing a disk never completes

2010-09-22 Thread Ben Miller

On 09/21/10 09:16 AM, Ben Miller wrote:

On 09/20/10 10:45 AM, Giovanni Tirloni wrote:

On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller mailto:bmil...@mail.eecis.udel.edu>> wrote:

I have an X4540 running b134 where I'm replacing 500GB disks with 2TB
disks (Seagate Constellation) and the pool seems sick now. The pool
has four raidz2 vdevs (8+2) where the first set of 10 disks were
replaced a few months ago. I replaced two disks in the second set
(c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the
third disk to finish replacing (c4t0d0).

I have tried the resilver for c4t0d0 four times now and the pool also
comes up with checksum errors and a permanent error (:<0x0>).
The first resilver was from 'zpool replace', which came up with
checksum errors. I cleared the errors which triggered the second
resilver (same result). I then did a 'zpool scrub' which started the
third resilver and also identified three permanent errors (the two
additional were in files in snapshots which I then destroyed). I then
did a 'zpool clear' and then another scrub which started the fourth
resilver attempt. This last attempt identified another file with
errors in a snapshot that I have now destroyed.

Any ideas how to get this disk finished being replaced without
rebuilding the pool and restoring from backup? The pool is working,
but is reporting as degraded and with checksum errors.


[...]

Try to run a `zpool clear pool2` and see if clears the errors. If not, you
may have to detach `c4t0d0s0/o`.

I believe it's a bug that was fixed in recent builds.


I had tried a clear a few times with no luck. I just did a detach and that
did remove the old disk and has now triggered another resilver which
hopefully works. I had tried a remove rather than a detach before, but that
doesn't work on raidz2...

thanks,
Ben

	I made some progress.  That resilver completed with 4 errors.  I cleared 
those and still had the one error ":<0x0>" so I started a scrub. 
 The scrub restarted the resilver on c4t0d0 again though!  There currently 
are no errors anyway, but the resilver will be running for the next day+. 
Is this another bug or will doing a scrub eventually lead to a scrub of the 
pool instead of the resilver?


Ben
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing a disk never completes

2010-09-21 Thread Ben Miller

On 09/20/10 10:45 AM, Giovanni Tirloni wrote:

On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller mailto:bmil...@mail.eecis.udel.edu>> wrote:

I have an X4540 running b134 where I'm replacing 500GB disks with 2TB
disks (Seagate Constellation) and the pool seems sick now.  The pool
has four raidz2 vdevs (8+2) where the first set of 10 disks were
replaced a few months ago.  I replaced two disks in the second set
(c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the
third disk to finish replacing (c4t0d0).

I have tried the resilver for c4t0d0 four times now and the pool also
comes up with checksum errors and a permanent error (:<0x0>).
  The first resilver was from 'zpool replace', which came up with
checksum errors.  I cleared the errors which triggered the second
resilver (same result).  I then did a 'zpool scrub' which started the
third resilver and also identified three permanent errors (the two
additional were in files in snapshots which I then destroyed).  I then
did a 'zpool clear' and then another scrub which started the fourth
resilver attempt.  This last attempt identified another file with
errors in a snapshot that I have now destroyed.

Any ideas how to get this disk finished being replaced without
rebuilding the pool and restoring from backup?  The pool is working,
but is reporting as degraded and with checksum errors.


[...]

Try to run a `zpool clear pool2` and see if clears the errors. If not, you
may have to detach `c4t0d0s0/o`.

I believe it's a bug that was fixed in recent builds.

	I had tried a clear a few times with no luck.  I just did a detach and 
that did remove the old disk and has now triggered another resilver which 
hopefully works.  I had tried a remove rather than a detach before, but 
that doesn't work on raidz2...


thanks,
Ben


--
Giovanni Tirloni
gtirl...@sysdroid.com <mailto:gtirl...@sysdroid.com>


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Replacing a disk never completes

2010-09-16 Thread Ben Miller
I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks 
(Seagate Constellation) and the pool seems sick now.  The pool has four 
raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few 
months ago.  I replaced two disks in the second set (c2t0d0, c3t0d0) a 
couple of weeks ago, but have been unable to get the third disk to finish 
replacing (c4t0d0).


I have tried the resilver for c4t0d0 four times now and the pool also comes 
up with checksum errors and a permanent error (:<0x0>).  The 
first resilver was from 'zpool replace', which came up with checksum 
errors.  I cleared the errors which triggered the second resilver (same 
result).  I then did a 'zpool scrub' which started the third resilver and 
also identified three permanent errors (the two additional were in files in 
snapshots which I then destroyed).  I then did a 'zpool clear' and then 
another scrub which started the fourth resilver attempt.  This last attempt 
identified another file with errors in a snapshot that I have now destroyed.


Any ideas how to get this disk finished being replaced without rebuilding 
the pool and restoring from backup?  The pool is working, but is reporting 
as degraded and with checksum errors.


Here is what the pool currently looks like:

 # zpool status -v pool2
  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 33h9m with 4 errors on Thu Sep 16 00:28:14
config:

NAME  STATE READ WRITE CKSUM
pool2 DEGRADED 0 0 8
  raidz2-0ONLINE   0 0 0
c0t4d0ONLINE   0 0 0
c1t4d0ONLINE   0 0 0
c2t4d0ONLINE   0 0 0
c3t4d0ONLINE   0 0 0
c4t4d0ONLINE   0 0 0
c5t4d0ONLINE   0 0 0
c2t5d0ONLINE   0 0 0
c3t5d0ONLINE   0 0 0
c4t5d0ONLINE   0 0 0
c5t5d0ONLINE   0 0 0
  raidz2-1DEGRADED 0 014
c0t5d0ONLINE   0 0 0
c1t5d0ONLINE   0 0 0
c2t1d0ONLINE   0 0 0
c3t1d0ONLINE   0 0 0
c4t1d0ONLINE   0 0 0
c5t1d0ONLINE   0 0 0
c2t0d0ONLINE   0 0 0
c3t0d0ONLINE   0 0 0
replacing-8   DEGRADED 0 0 0
  c4t0d0s0/o  OFFLINE  0 0 0
  c4t0d0  ONLINE   0 0 0  268G resilvered
c5t0d0ONLINE   0 0 0
  raidz2-2ONLINE   0 0 0
c0t6d0ONLINE   0 0 0
c1t6d0ONLINE   0 0 0
c2t6d0ONLINE   0 0 0
c3t6d0ONLINE   0 0 0
c4t6d0ONLINE   0 0 0
c5t6d0ONLINE   0 0 0
c2t7d0ONLINE   0 0 0
c3t7d0ONLINE   0 0 0
c4t7d0ONLINE   0 0 0
c5t7d0ONLINE   0 0 0
  raidz2-3ONLINE   0 0 0
c0t7d0ONLINE   0 0 0
c1t7d0ONLINE   0 0 0
c2t3d0ONLINE   0 0 0
c3t3d0ONLINE   0 0 0
c4t3d0ONLINE   0 0 0
c5t3d0ONLINE   0 0 0
c2t2d0ONLINE   0 0 0
c3t2d0ONLINE   0 0 0
c4t2d0ONLINE   0 0 0
c5t2d0ONLINE   0 0 0
logs
  mirror-4ONLINE   0 0 0
c0t1d0s0  ONLINE   0 0 0
c1t3d0s0  ONLINE   0 0 0
cache
  c0t3d0s7ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x0>
<0x167a2>:<0x552ed>
(This second file was in a snapshot I destroyed after the resilver 
completed).


# zpool list pool2
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
pool2  31.8T  13.8T  17.9T43%  1.65x  DEGRADED  -

The slog is a mirror of two SLC SSDs and the L2ARC is an MLC SSD.

thanks,
Ben
___
zfs-discuss mailing list
zfs-discus

Re: [zfs-discuss] Pool is wrong size in b134

2010-06-17 Thread Ben Miller

Cindy,
The other two pools are 2 disk mirrors (rpool and another).

Ben

Cindy Swearingen wrote:

Hi Ben,

Any other details about this pool, like how it might be different from 
the other two pools on this system, might be helpful...


I'm going to try to reproduce this problem.

We'll be in touch.

Thanks,

Cindy

On 06/17/10 07:02, Ben Miller wrote:
I upgraded a server today that has been running SXCE b111 to the 
OpenSolaris preview b134.  It has three pools and two are fine, but 
one comes up with no space available in the pool (SCSI jbod of 300GB 
disks). The zpool version is at 14.


I tried exporting the pool and re-importing and I get several errors 
like this both exporting and importing:


# zpool export pool1
WARNING: metaslab_free_dva(): bad DVA 0:645838978048
WARNING: metaslab_free_dva(): bad DVA 0:645843271168
...

I tried removing the zpool.cache file, rebooting, importing and 
receive no warnings, but still reporting the wrong avail and size.


# zfs list pool1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool1   396G  0  3.22M  /export/home
# zpool list pool1
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
pool1   476G   341G   135G71%  1.00x  ONLINE  -
# zpool status pool1
  pool: pool1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool 
can

still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz2-0   ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0
c1t14d0  ONLINE   0 0 0

errors: No known data errors

I try exporting and again get the metaslab_free_dva() warnings.  
Imported again with no warnings, but same numbers as above.  If I try 
to remove files or truncate files I receive no free space errors.


I reverted back to b111 and here is what the pool really looks like.

# zfs list pool1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool1   396G   970G  3.22M  /export/home
# zpool list pool1
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
pool1  1.91T   557G  1.36T28%  ONLINE  -
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0
c1t14d0  ONLINE   0 0 0

errors: No known data errors

Also, the disks were replaced one at a time last year from 73GB to 
300GB to increase the size of the pool.  Any idea why the pool is 
showing up as the wrong size in b134 and have anything else to try?  I 
don't want to upgrade the pool version yet and then not be able to 
revert back...


thanks,
Ben

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Pool is wrong size in b134

2010-06-17 Thread Ben Miller
I upgraded a server today that has been running SXCE b111 to the 
OpenSolaris preview b134.  It has three pools and two are fine, but one 
comes up with no space available in the pool (SCSI jbod of 300GB disks). 
The zpool version is at 14.


I tried exporting the pool and re-importing and I get several errors like 
this both exporting and importing:


# zpool export pool1
WARNING: metaslab_free_dva(): bad DVA 0:645838978048
WARNING: metaslab_free_dva(): bad DVA 0:645843271168
...

I tried removing the zpool.cache file, rebooting, importing and receive no 
warnings, but still reporting the wrong avail and size.


# zfs list pool1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool1   396G  0  3.22M  /export/home
# zpool list pool1
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
pool1   476G   341G   135G71%  1.00x  ONLINE  -
# zpool status pool1
  pool: pool1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz2-0   ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0
c1t14d0  ONLINE   0 0 0

errors: No known data errors

I try exporting and again get the metaslab_free_dva() warnings.  Imported 
again with no warnings, but same numbers as above.  If I try to remove 
files or truncate files I receive no free space errors.


I reverted back to b111 and here is what the pool really looks like.

# zfs list pool1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool1   396G   970G  3.22M  /export/home
# zpool list pool1
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
pool1  1.91T   557G  1.36T28%  ONLINE  -
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0
c1t12d0  ONLINE   0 0 0
c1t13d0  ONLINE   0 0 0
c1t14d0  ONLINE   0 0 0

errors: No known data errors

Also, the disks were replaced one at a time last year from 73GB to 300GB to 
increase the size of the pool.  Any idea why the pool is showing up as the 
wrong size in b134 and have anything else to try?  I don't want to upgrade 
the pool version yet and then not be able to revert back...


thanks,
Ben

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-28 Thread Ben Miller
# zpool status -xv
all pools are healthy

Ben

> What does 'zpool status -xv' show?
> 
> On Tue, Jan 27, 2009 at 8:01 AM, Ben Miller
>  wrote:
> > I forgot the pool that's having problems was
> recreated recently so it's already at zfs version 3.
> I just did a 'zfs upgrade -a' for another pool, but
> some of those filesystems failed since they are busy
>  and couldn't be unmounted.
> 
> > # zfs upgrade -a
> > cannot unmount '/var/mysql': Device busy
> > cannot unmount '/var/postfix': Device busy
> > 
> > 6 filesystems upgraded
> > 821 filesystems already at this version
> >
> > Ben
> >
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-27 Thread Ben Miller
I forgot the pool that's having problems was recreated recently so it's already 
at zfs version 3.  I just did a 'zfs upgrade -a' for another pool, but some of 
those filesystems failed since they are busy and couldn't be unmounted.

# zfs upgrade -a
cannot unmount '/var/mysql': Device busy
cannot unmount '/var/postfix': Device busy

6 filesystems upgraded
821 filesystems already at this version

Ben

> You can upgrade live.  'zfs upgrade' with no
> arguments shows you the  
> zfs version status of filesystems present without
> upgrading.
> 
> 
> 
> On Jan 24, 2009, at 10:19 AM, Ben Miller
>  wrote:
> 
> > We haven't done 'zfs upgrade ...' any.  I'll give
> that a try the  
> > next time the system can be taken down.
> >
> > Ben
> >
> >> A little gotcha that I found in my 10u6 update
> >> process was that 'zpool
> >> upgrade [poolname]' is not the same as 'zfs
> upgrade
> >> [poolname]/[filesystem(s)]'
> >>
> >> What does 'zfs upgrade' say?  I'm not saying this
> is
> >> the source of
> >> your problem, but it's a detail that seemed to
> affect
> >> stability for
> >> me.
> >>
> >>
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-24 Thread Ben Miller
We haven't done 'zfs upgrade ...' any.  I'll give that a try the next time the 
system can be taken down.

Ben

> A little gotcha that I found in my 10u6 update
> process was that 'zpool
> upgrade [poolname]' is not the same as 'zfs upgrade
> [poolname]/[filesystem(s)]'
> 
> What does 'zfs upgrade' say?  I'm not saying this is
> the source of
> your problem, but it's a detail that seemed to affect
> stability for
> me.
> 
> 
> On Thu, Jan 22, 2009 at 7:25 AM, Ben Miller
> > The pools are upgraded to version 10.  Also, this
> is on Solaris 10u6.
> >
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-22 Thread Ben Miller
The pools are upgraded to version 10.  Also, this is on Solaris 10u6.

# zpool upgrade
This system is currently running ZFS pool version 10.

All pools are formatted using this version.

Ben

> What's the output of 'zfs upgrade' and 'zpool
> upgrade'? (I'm just
> curious - I had a similar situation which seems to be
> resolved now
> that I've gone to Solaris 10u6 or OpenSolaris
> 2008.11).
> 
> 
> 
> On Wed, Jan 21, 2009 at 2:11 PM, Ben Miller
>  wrote:
> > Bug ID is 6793967.
> >
> > This problem just happened again.
> > % zpool status pool1
> >  pool: pool1
> >  state: DEGRADED
> >  scrub: resilver completed after 0h48m with 0
> errors on Mon Jan  5 12:30:52 2009
> > config:
> >
> >NAME   STATE READ WRITE CKSUM
> >pool1  DEGRADED 0 0 0
> >  raidz2   DEGRADED 0 0 0
> >c4t8d0s0   ONLINE   0 0 0
> >c4t9d0s0   ONLINE   0 0 0
> >c4t10d0s0  ONLINE   0 0 0
> >c4t11d0s0  ONLINE   0 0 0
> >c4t12d0s0  REMOVED  0 0 0
> >c4t13d0s0  ONLINE   0 0 0
> >
> > errors: No known data errors
> >
> > % zpool status -x
> > all pools are healthy
> > %
> > # zpool online pool1 c4t12d0s0
> > % zpool status -x
> >  pool: pool1
> >  state: ONLINE
> > status: One or more devices is currently being
> resilvered.  The pool will
> >continue to function, possibly in a degraded
> state.
> > action: Wait for the resilver to complete.
> >  scrub: resilver in progress for 0h0m, 0.12% done,
> 2h38m to go
> > config:
> >
> >NAME   STATE READ WRITE CKSUM
> >pool1  ONLINE   0 0 0
> >  raidz2   ONLINE   0 0 0
> >c4t8d0s0   ONLINE   0 0 0
> >c4t9d0s0   ONLINE   0 0 0
> >c4t10d0s0  ONLINE   0 0 0
> >c4t11d0s0  ONLINE   0 0 0
> >c4t12d0s0  ONLINE   0 0 0
> >c4t13d0s0  ONLINE   0 0 0
> >
> > errors: No known data errors
> > %
> >
> > Ben
> >
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-21 Thread Ben Miller
Bug ID is 6793967.

This problem just happened again.
% zpool status pool1
  pool: pool1
 state: DEGRADED
 scrub: resilver completed after 0h48m with 0 errors on Mon Jan  5 12:30:52 2009
config:

NAME   STATE READ WRITE CKSUM
pool1  DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
c4t8d0s0   ONLINE   0 0 0
c4t9d0s0   ONLINE   0 0 0
c4t10d0s0  ONLINE   0 0 0
c4t11d0s0  ONLINE   0 0 0
c4t12d0s0  REMOVED  0 0 0
c4t13d0s0  ONLINE   0 0 0

errors: No known data errors

% zpool status -x
all pools are healthy
%
# zpool online pool1 c4t12d0s0
% zpool status -x
  pool: pool1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.12% done, 2h38m to go
config:

NAME   STATE READ WRITE CKSUM
pool1  ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c4t8d0s0   ONLINE   0 0 0
c4t9d0s0   ONLINE   0 0 0
c4t10d0s0  ONLINE   0 0 0
c4t11d0s0  ONLINE   0 0 0
c4t12d0s0  ONLINE   0 0 0
c4t13d0s0  ONLINE   0 0 0

errors: No known data errors
%

Ben

> I just put in a (low priority) bug report on this.
> 
> Ben
> 
> > This post from close to a year ago never received
> a
> > response.  We just had this same thing happen to
> > another server that is running Solaris 10 U6.  One
> of
> > the disks was marked as removed and the pool
> > degraded, but 'zpool status -x' says all pools are
> > healthy.  After doing an 'zpool online' on the
> disk
> > it resilvered in fine.  Any ideas why 'zpool
> status
> > -x' reports all healthy while 'zpool status' shows
> a
> > pool in degraded mode?
> > 
> > thanks,
> > Ben
> > 
> > > We run a cron job that does a 'zpool status -x'
> to
> > > check for any degraded pools.  We just happened
> to
> > > find a pool degraded this morning by running
> > 'zpool
> > > status' by hand and were surprised that it was
> > > degraded as we didn't get a notice from the cron
> > > job.
> > > 
> > > # uname -srvp
> > > SunOS 5.11 snv_78 i386
> > > 
> > > # zpool status -x
> > > all pools are healthy
> > > 
> > > # zpool status pool1
> > >   pool: pool1
> > > tate: DEGRADED
> > >  scrub: none requested
> > > onfig:
> > > 
> > > NAME STATE READ WRITE CKSUM
> > > pool1DEGRADED 0 0 0
> > >   raidz1 DEGRADED 0 0 0
> > >   c1t8d0   REMOVED  0 0 0
> > >   c1t9d0   ONLINE   0 0 0
> > >   c1t10d0  ONLINE   0 0 0
> > >   c1t11d0  ONLINE   0 0 0
> > > No known data errors
> > > 
> > > I'm going to look into it now why the disk is
> > listed
> > > as removed.
> > > 
> > > Does this look like a bug with 'zpool status
> -x'?
> > > 
> > > Ben
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-12 Thread Ben Miller
I just put in a (low priority) bug report on this.

Ben

> This post from close to a year ago never received a
> response.  We just had this same thing happen to
> another server that is running Solaris 10 U6.  One of
> the disks was marked as removed and the pool
> degraded, but 'zpool status -x' says all pools are
> healthy.  After doing an 'zpool online' on the disk
> it resilvered in fine.  Any ideas why 'zpool status
> -x' reports all healthy while 'zpool status' shows a
> pool in degraded mode?
> 
> thanks,
> Ben
> 
> > We run a cron job that does a 'zpool status -x' to
> > check for any degraded pools.  We just happened to
> > find a pool degraded this morning by running
> 'zpool
> > status' by hand and were surprised that it was
> > degraded as we didn't get a notice from the cron
> > job.
> > 
> > # uname -srvp
> > SunOS 5.11 snv_78 i386
> > 
> > # zpool status -x
> > all pools are healthy
> > 
> > # zpool status pool1
> >   pool: pool1
> > tate: DEGRADED
> >  scrub: none requested
> > onfig:
> > 
> > NAME STATE READ WRITE CKSUM
> > pool1DEGRADED 0 0 0
> >   raidz1 DEGRADED 0 0 0
> >   c1t8d0   REMOVED  0 0 0
> >   c1t9d0   ONLINE   0 0 0
> >   c1t10d0  ONLINE   0 0 0
> >   c1t11d0  ONLINE   0 0 0
> > No known data errors
> > 
> > I'm going to look into it now why the disk is
> listed
> > as removed.
> > 
> > Does this look like a bug with 'zpool status -x'?
> > 
> > Ben
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status -x strangeness

2009-01-07 Thread Ben Miller
This post from close to a year ago never received a response.  We just had this 
same thing happen to another server that is running Solaris 10 U6.  One of the 
disks was marked as removed and the pool degraded, but 'zpool status -x' says 
all pools are healthy.  After doing an 'zpool online' on the disk it resilvered 
in fine.  Any ideas why 'zpool status -x' reports all healthy while 'zpool 
status' shows a pool in degraded mode?

thanks,
Ben

> We run a cron job that does a 'zpool status -x' to
> check for any degraded pools.  We just happened to
> find a pool degraded this morning by running 'zpool
> status' by hand and were surprised that it was
> degraded as we didn't get a notice from the cron
> job.
> 
> # uname -srvp
> SunOS 5.11 snv_78 i386
> 
> # zpool status -x
> all pools are healthy
> 
> # zpool status pool1
>   pool: pool1
> tate: DEGRADED
>  scrub: none requested
> onfig:
> 
> NAME STATE READ WRITE CKSUM
> pool1DEGRADED 0 0 0
>   raidz1 DEGRADED 0 0 0
>   c1t8d0   REMOVED  0 0 0
>   c1t9d0   ONLINE   0 0 0
>   c1t10d0  ONLINE   0 0 0
>   c1t11d0  ONLINE   0 0 0
> No known data errors
> 
> I'm going to look into it now why the disk is listed
> as removed.
> 
> Does this look like a bug with 'zpool status -x'?
> 
> Ben
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool status -x strangeness on b78

2008-02-06 Thread Ben Miller
We run a cron job that does a 'zpool status -x' to check for any degraded 
pools.  We just happened to find a pool degraded this morning by running 'zpool 
status' by hand and were surprised that it was degraded as we didn't get a 
notice from the cron job.

# uname -srvp
SunOS 5.11 snv_78 i386

# zpool status -x
all pools are healthy

# zpool status pool1
  pool: pool1
 state: DEGRADED
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1DEGRADED 0 0 0
  raidz1 DEGRADED 0 0 0
c1t8d0   REMOVED  0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0

errors: No known data errors

I'm going to look into it now why the disk is listed as removed.

Does this look like a bug with 'zpool status -x'?

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] System hang caused by a "bad" snapshot

2007-09-18 Thread Ben Miller
> > > Hello Matthew,
> > > Tuesday, September 12, 2006, 7:57:45 PM, you
> > wrote:
> > > MA> Ben Miller wrote:
> > > >> I had a strange ZFS problem this morning.
>  The
>  > entire system would
>  > >> hang when mounting the ZFS filesystems.  After
>  > trial and error I
> > >> determined that the problem was with one of
>  the
>  > 2500 ZFS filesystems.
> > >> When mounting that users' home the system
>  would
>  > hang and need to be
>  > >> rebooted.  After I removed the snapshots (9 of
>  > them) for that
>  > >> filesystem everything was fine.
>  > >> 
>  > >> I don't know how to reproduce this and didn't
>  get
>  > a crash dump.  I
>  > >> don't remember seeing anything about this
>  before
>  > so I wanted to
>  > >> report it and see if anyone has any ideas.
>  > 
> > MA> Hmm, that sounds pretty bizarre, since I
>  don't
>  > think that mounting a 
>  > MA> filesystem doesn't really interact with
>  snapshots
>  > at all. 
>  > MA> Unfortunately, I don't think we'll be able to
>  > diagnose this without a 
>  > MA> crash dump or reproducibility.  If it happens
>  > again, force a crash dump
>  > MA> while the system is hung and we can take a
>  look
>  > at it.
>  > 
>  > Maybe it wasn't hung after all. I've seen similar
>  > behavior here
>  > sometimes. Did your disks used in a pool were
>  > actually working?
>  > 
>  
>  There was lots of activity on the disks (iostat and
> status LEDs) until it got to this one filesystem
>  and
>  everything stopped.  'zpool iostat 5' stopped
>  running, the shell wouldn't respond and activity on
>  the disks stopped.  This fs is relatively small
>(175M used of a 512M quota).
>  Sometimes it takes a lot of time (30-50minutes) to
>  > mount a file system
>  > - it's rare, but it happens. And during this ZFS
>  > reads from those
>  > disks in a pool. I did report it here some time
>  ago.
>  > 
>  In my case the system crashed during the evening
>  and it was left hung up when I came in during the
>   morning, so it was hung for a good 9-10 hours.
> 
> The problem happened again last night, but for a
> different users' filesystem.  I took a crash dump
> with it hung and the back trace looks like this:
> > ::status
> debugging crash dump vmcore.0 (64-bit) from hostname
> operating system: 5.11 snv_40 (sun4u)
> panic message: sync initiated
> dump content: kernel pages only
> > ::stack
> 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8,
> 1849000, f005a4d8)
> prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61,
> 1812158, 181b4c8)
> debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
> abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000,
> 2a100047d98, 1, 1859800)
> intr_thread+0x170(600019de0e0, 0, 6000d7bfc98,
> 600019de110, 600019de110, 
> 600019de110)
> zfs_delete_thread_target+8(600019de080,
> , 0, 600019de080, 
> 6000d791ae8, 60001aed428)
> zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1,
> 2a100c4faca, 2a100c4fac8, 
> 600019de0e0)
> thread_start+4(600019de080, 0, 0, 0, 0, 0)
> 
> In single user I set the mountpoint for that user to
> be none and then brought the system up fine.  Then I
> destroyed the snapshots for that user and their
> filesystem mounted fine.  In this case the quota was
> reached with the snapshots and 52% used without.
> 
> Ben

Hate to re-open something from a year ago, but we just had this problem happen 
again.  We have been running Solaris 10u3 on this system for awhile.  I 
searched the bug reports, but couldn't find anything on this.  I also think I 
understand what happened a little more.  We take snapshots at noon and the 
system hung up during that time.  When trying to reboot the system would hang 
on the ZFS mounts.  After I boot into single use and remove the snapshot from 
the filesystem causing the problem everything is fine.  The filesystem in 
question at 100% use with snapshots in use.

Here's the back trace for the system when it was hung:
> ::stack
0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 6000240)
0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
dbuf_hold_level+0x18(60008cd02e8, 0, 

[zfs-discuss] Re: Remove files when at quota limit

2007-05-15 Thread Ben Miller
Has anyone else run into this situation?  Does anyone have any solutions other 
than removing snapshots or increasing the quota?  I'd like to put in an RFE to 
reserve some space so files can be removed when users are at their quota.  Any 
thoughts from the ZFS team?

Ben

> We have around 1000 users all with quotas set on
> their ZFS filesystems on Solaris 10 U3.  We take
> snapshots daily and rotate out the week old ones.
> The situation is that some users ignore the advice
> of keeping space used below 80% and keep creating
> large temporary files.  They then try to remove
> files when the space used is 100% and get over quota
> messages.  We then need to remove some or all of
> their snapshots to free space.  Is there anything
> being worked on to keep some space reserved so files
> can be removed when at the quota limit or some other
> solution?  What are other people doing is this
> situation?  We have also set up alternate
> filesystems for users with transient data that we do
> not take snapshots on, but we still have this
>  problem on home directories.
> 
> thanks,
> Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Remove files when at quota limit

2007-05-10 Thread Ben Miller
We have around 1000 users all with quotas set on their ZFS filesystems on 
Solaris 10 U3.  We take snapshots daily and rotate out the week old ones.  The 
situation is that some users ignore the advice of keeping space used below 80% 
and keep creating large temporary files.  They then try to remove files when 
the space used is 100% and get over quota messages.  We then need to remove 
some or all of their snapshots to free space.  Is there anything being worked 
on to keep some space reserved so files can be removed when at the quota limit 
or some other solution?  What are other people doing is this situation?  We 
have also set up alternate filesystems for users with transient data that we do 
not take snapshots on, but we still have this problem on home directories.

thanks,
Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: ZFS disables nfs/server on a host

2007-04-27 Thread Ben Miller
I just threw in a truss in the SMF script and rebooted the test system and it 
failed again.
The truss output is at http://www.eecis.udel.edu/~bmiller/zfs.truss-Apr27-2007

thanks,
Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS disables nfs/server on a host

2007-04-26 Thread Ben Miller
I was able to duplicate this problem on a test Ultra 10.  I put in a workaround 
by adding a service that depends on /milestone/multi-user-server which does a 
'zfs share -a'.  It's strange this hasn't happened on other systems, but maybe 
it's related to slower systems...

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS disables nfs/server on a host

2007-04-26 Thread Ben Miller
I just rebooted this host this morning and the same thing happened again.  I 
have the core file from zfs.

[ Apr 26 07:47:01 Executing start method ("/lib/svc/method/nfs-server start") ]
Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 380, func
tion zfs_share
Abort - core dumped

Why would nfs/server be disabled instead of going into maintenance with this 
error?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS disables nfs/server on a host

2007-04-19 Thread Ben Miller
It does seem like an ordering problem, but nfs/server should be starting up 
late enough with SMF dependencies.  I need to see if I can duplicate the 
problem on a test system...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS disables nfs/server on a host

2007-04-19 Thread Ben Miller
I have an Ultra 10 client running Sol10 U3 that has a zfs pool set up on the 
extra space of the internal ide disk.  There's just the one fs and it is shared 
with the sharenfs property.  When this system reboots nfs/server ends up 
getting disabled and this is the error from the SMF logs:

[ Apr 16 08:41:22 Executing start method ("/lib/svc/method/nfs-server start") ]
[ Apr 16 08:41:24 Method "start" exited with status 0 ]
[ Apr 18 10:59:23 Executing start method ("/lib/svc/method/nfs-server start") ]
Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 380, 
function zfs_share

If I re-enable nfs/server after the system is up it's fine.  The system was 
recently upgraded to use zfs and this has happened on the last two reboots.  We 
have lots of other systems that share nfs through zfs fine and I didn't see a 
similar problem on the list.  Any ideas?

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

2006-09-27 Thread Ben Miller
> > Hello Matthew,
> > Tuesday, September 12, 2006, 7:57:45 PM, you
> wrote:
> > MA> Ben Miller wrote:
> > >> I had a strange ZFS problem this morning.  The
> > entire system would
> > >> hang when mounting the ZFS filesystems.  After
> > trial and error I
> > >> determined that the problem was with one of the
> > 2500 ZFS filesystems.
> > >> When mounting that users' home the system would
> > hang and need to be
> > >> rebooted.  After I removed the snapshots (9 of
> > them) for that
> > >> filesystem everything was fine.
> > >> 
> > >> I don't know how to reproduce this and didn't
> get
> > a crash dump.  I
> > >> don't remember seeing anything about this
> before
> > so I wanted to
> > >> report it and see if anyone has any ideas.
> > 
> > MA> Hmm, that sounds pretty bizarre, since I don't
> > think that mounting a 
> > MA> filesystem doesn't really interact with
> snapshots
> > at all. 
> > MA> Unfortunately, I don't think we'll be able to
> > diagnose this without a 
> > MA> crash dump or reproducibility.  If it happens
> > again, force a crash dump
> > MA> while the system is hung and we can take a
> look
> > at it.
> > 
> > Maybe it wasn't hung after all. I've seen similar
> > behavior here
> > sometimes. Did your disks used in a pool were
> > actually working?
> > 
> 
> There was lots of activity on the disks (iostat and
> status LEDs) until it got to this one filesystem and
> everything stopped.  'zpool iostat 5' stopped
> running, the shell wouldn't respond and activity on
> the disks stopped.  This fs is relatively small
>   (175M used of a 512M quota).
> > Sometimes it takes a lot of time (30-50minutes) to
> > mount a file system
> > - it's rare, but it happens. And during this ZFS
> > reads from those
> > disks in a pool. I did report it here some time
> ago.
> > 
> In my case the system crashed during the evening
> and it was left hung up when I came in during the
>  morning, so it was hung for a good 9-10 hours.
> 
  The problem happened again last night, but for a different users' filesystem. 
 I took a crash dump with it hung and the back trace looks like this:
> ::status
debugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_40 (sun4u)
panic message: sync initiated
dump content: kernel pages only
> ::stack
0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, 1849000, f005a4d8)
prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, 1812158, 181b4c8)
debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, 2a100047d98, 1, 1859800)
intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, 600019de110, 600019de110, 
600019de110)
zfs_delete_thread_target+8(600019de080, , 0, 600019de080, 
6000d791ae8, 60001aed428)
zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, 2a100c4faca, 2a100c4fac8, 
600019de0e0)
thread_start+4(600019de080, 0, 0, 0, 0, 0)

In single user I set the mountpoint for that user to be none and then brought 
the system up fine.  Then I destroyed the snapshots for that user and their 
filesystem mounted fine.  In this case the quota was reached with the snapshots 
and 52% used without.

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

2006-09-13 Thread Ben Miller
> Hello Matthew,
> Tuesday, September 12, 2006, 7:57:45 PM, you wrote:
> MA> Ben Miller wrote:
> >> I had a strange ZFS problem this morning.  The
> entire system would
> >> hang when mounting the ZFS filesystems.  After
> trial and error I
> >> determined that the problem was with one of the
> 2500 ZFS filesystems.
> >> When mounting that users' home the system would
> hang and need to be
> >> rebooted.  After I removed the snapshots (9 of
> them) for that
> >> filesystem everything was fine.
> >> 
> >> I don't know how to reproduce this and didn't get
> a crash dump.  I
> >> don't remember seeing anything about this before
> so I wanted to
> >> report it and see if anyone has any ideas.
> 
> MA> Hmm, that sounds pretty bizarre, since I don't
> think that mounting a 
> MA> filesystem doesn't really interact with snapshots
> at all. 
> MA> Unfortunately, I don't think we'll be able to
> diagnose this without a 
> MA> crash dump or reproducibility.  If it happens
> again, force a crash dump
> MA> while the system is hung and we can take a look
> at it.
> 
> Maybe it wasn't hung after all. I've seen similar
> behavior here
> sometimes. Did your disks used in a pool were
> actually working?
> 

  There was lots of activity on the disks (iostat and status LEDs) until it got 
to this one filesystem and everything stopped.  'zpool iostat 5' stopped 
running, the shell wouldn't respond and activity on the disks stopped.  This fs 
is relatively small  (175M used of a 512M quota).

> Sometimes it takes a lot of time (30-50minutes) to
> mount a file system
> - it's rare, but it happens. And during this ZFS
> reads from those
> disks in a pool. I did report it here some time ago.
> 
  In my case the system crashed during the evening and it was left hung up when 
I came in during the morning, so it was hung for a good 9-10 hours.

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] System hang caused by a "bad" snapshot

2006-09-12 Thread Ben Miller
I had a strange ZFS problem this morning.  The entire system would hang when 
mounting the ZFS filesystems.  After trial and error I determined that the 
problem was with one of the 2500 ZFS filesystems.  When mounting that users' 
home the system would hang and need to be rebooted.  After I removed the 
snapshots (9 of them) for that filesystem everything was fine.

I don't know how to reproduce this and didn't get a crash dump.  I don't 
remember seeing anything about this before so I wanted to report it and see if 
anyone has any ideas.

The system is a Sun Fire 280R with 3GB of RAM running SXCR b40.
The pool looks like this (I'm running a scrub currently):
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: scrub in progress, 78.61% done, 0h18m to go
config:

NAME STATE READ WRITE CKSUM
pool1ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c1t8d0   ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0

errors: No known data errors

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss