Re: [zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-13 Thread Brian Leonard
Actually, there's still the primary issue of this post - the apparent hang. At 
the moment, I have 3 zpool commands running, all apparently hung and doing 
nothing:

bleon...@opensolaris:~$ ps -ef | grep zpool
root 20465 20411   0 18:10:44 pts/4   0:00 zpool clear r5pool
root 20408 20403   0 18:08:19 pts/3   0:00 zpool status r5pool
root 20396 17612   0 18:08:04 pts/2   0:00 zpool scrub r5pool

You can see all of them are not very busy, and seem to be waiting on something:

bleon...@opensolaris:~# ptime -p 20465
real12:25.188031517
user0.004037420
sys 0.008682963

bleon...@opensolaris:~# ptime -p 20408
real15:03.977246851
user0.002700817
sys 0.005662413

bleon...@opensolaris:~# ptime -p 20396
real15:24.793176743
user0.002954137
sys 0.014851215

And as I said earlier, I can't control+break or kill any of these processes. 
Time for hard-reboot.

/Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-13 Thread Brian Leonard
Hi Cindy,

I'm trying to demonstrate how ZFS behaves when a disk fails. The drive 
enclosure I'm using (http://www.icydock.com/product/mb561us-4s-1.html) says it 
supports hot swap, but that's not what I'm experiencing. When I plug the disk 
back in, all 4 disks are no longer recognizable until I restart the enclosure.

This same demo works fine when using USB sticks, and maybe that's because each 
USB stick has its own controller.

Thanks for your help,
Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-12 Thread Brian Leonard
Hi,

I'm currently trying to work with a quad-bay USB drive enclosure. I've created 
a raidz pool as follows:

bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  ONLINE   0 0 0
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

If I pop a disk and run a zpool scrub, the fault is noted:

bleon...@opensolaris:~# zpool scrub r5pool
bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010
config:

NAME  STATE READ WRITE CKSUM
r5poolDEGRADED 0 0 0
  raidz1  DEGRADED 0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  FAULTED  0 0 0  corrupted data
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

However, it's when I pop the disk back in that everything goes south. If I run 
a zpool scrub at this point, the command appears to just hang.

Running zpool status again shows the scrub will finish in 2 minutes, but I 
never does. You can see it's been running for 33 minutes already, and there's 
no data in the pool.

bleon...@opensolaris:/r5pool# zpool status r5pool
  pool: r5pool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  ONLINE   0 0 0
c1t0d3p0  ONLINE   0 0 0

errors: 24 data errors, use '-v' for a list

zpool scrub -s r5pool doesn't have any effect.

I can't even kill the scrub process. Even a reboot command at this point will 
hang the machine, so I have to hard power-cycle the machine to get everything 
back to normal. There must be a more elegant solution, right?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanet errors detected in :<0x13>

2010-06-30 Thread W Brian Leonard
Well, I was doing a ZFS send / receive to backup a large (60 GB) of 
data, which never completed. A zpool clear at that point just hung and I 
had to reboot the system, after which it appeared to come up clean. As 
soon as I tried the backup again I noticed the pool reported the error 
you see below - but the backup did complete as the pool remained online.


Thanks for your help Cindy,
Brian

Cindy Swearingen wrote:



I reviewed the zpool clear syntax (looking at my own docs) and didn't
remember that a one-device pool probably doesn't need the device
specified. For pools with many devices, you might want to just clear
the errors on a particular device.

USB sticks for pools are problemmatic. It would be good to know what
caused these errors to try to prevent them in the future.

We know that USB devices don't generate/fabricate device IDs so they
are prone to problems when moving/changing/re-inserting but without
more info, its hard to tell what happened.

cs

On 06/29/10 14:13, W Brian Leonard wrote:
Interesting, this time it worked! Does specifying the device to clear 
cause the command to behave differently? I had assumed w/out the 
device specification, the clear would just apply to all devices in 
the pool (which are just the one).


Thanks,
Brian

Cindy Swearingen wrote:

Hi Brian,

Because the pool is still online and the metadata is redundant, maybe
these errors were caused by a brief hiccup from the USB device's
physical connection. You might try:

# zpool clear external c0t0d0p0

Then, run a scrub:

# zpool scrub external

If the above fails, then please identify the Solaris release and what
events preceded this problem.

Thanks,

Cindy




On 06/29/10 11:15, W Brian Leonard wrote:

Hi Cindy,

The scrub didn't help and yes, this is an external USB device.

Thanks,
Brian

Cindy Swearingen wrote:

Hi Brian,

You might try running a scrub on this pool.

Is this an external USB device?

Thanks,

Cindy

On 06/29/10 09:16, Brian Leonard wrote:

Hi,

I have a zpool which is currently reporting that the 
":<0x13>" file is corrupt:


bleon...@opensolaris:~$ pfexec zpool status -xv external
  pool: external
 state: ONLINE
status: One or more devices has experienced an error resulting in 
data

corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise 
restore the

entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
externalONLINE   0 0 0
  c0t0d0p0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x13>

Otherwise, as you can see, the pool is online. As it's unclear to 
me how to restore the ":<0x13>" file, is my only option 
for correcting this error to destroy and recreate the pool?


Thanks,
Brian






--
W Brian Leonard
Principal Product Manager
860.206.6093
http://blogs.sun.com/observatory

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanet errors detected in :<0x13>

2010-06-30 Thread W Brian Leonard
Interesting, this time it worked! Does specifying the device to clear 
cause the command to behave differently? I had assumed w/out the device 
specification, the clear would just apply to all devices in the pool 
(which are just the one).


Thanks,
Brian

Cindy Swearingen wrote:

Hi Brian,

Because the pool is still online and the metadata is redundant, maybe
these errors were caused by a brief hiccup from the USB device's
physical connection. You might try:

# zpool clear external c0t0d0p0

Then, run a scrub:

# zpool scrub external

If the above fails, then please identify the Solaris release and what
events preceded this problem.

Thanks,

Cindy




On 06/29/10 11:15, W Brian Leonard wrote:

Hi Cindy,

The scrub didn't help and yes, this is an external USB device.

Thanks,
Brian

Cindy Swearingen wrote:

Hi Brian,

You might try running a scrub on this pool.

Is this an external USB device?

Thanks,

Cindy

On 06/29/10 09:16, Brian Leonard wrote:

Hi,

I have a zpool which is currently reporting that the 
":<0x13>" file is corrupt:


bleon...@opensolaris:~$ pfexec zpool status -xv external
  pool: external
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise 
restore the

entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
externalONLINE   0 0 0
  c0t0d0p0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x13>

Otherwise, as you can see, the pool is online. As it's unclear to 
me how to restore the ":<0x13>" file, is my only option 
for correcting this error to destroy and recreate the pool?


Thanks,
Brian




--
W Brian Leonard
Principal Product Manager
860.206.6093
http://blogs.sun.com/observatory

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanet errors detected in :<0x13>

2010-06-30 Thread W Brian Leonard

Hi Cindy,

The scrub didn't help and yes, this is an external USB device.

Thanks,
Brian

Cindy Swearingen wrote:

Hi Brian,

You might try running a scrub on this pool.

Is this an external USB device?

Thanks,

Cindy

On 06/29/10 09:16, Brian Leonard wrote:

Hi,

I have a zpool which is currently reporting that the 
":<0x13>" file is corrupt:


bleon...@opensolaris:~$ pfexec zpool status -xv external
  pool: external
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
externalONLINE   0 0 0
  c0t0d0p0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x13>

Otherwise, as you can see, the pool is online. As it's unclear to me 
how to restore the ":<0x13>" file, is my only option for 
correcting this error to destroy and recreate the pool?


Thanks,
Brian


--
W Brian Leonard
Principal Product Manager
860.206.6093
http://blogs.sun.com/observatory

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Permanet errors detected in :<0x13>

2010-06-29 Thread Brian Leonard
Hi,

I have a zpool which is currently reporting that the ":<0x13>" file 
is corrupt:

bleon...@opensolaris:~$ pfexec zpool status -xv external
  pool: external
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
externalONLINE   0 0 0
  c0t0d0p0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x13>

Otherwise, as you can see, the pool is online. As it's unclear to me how to 
restore the ":<0x13>" file, is my only option for correcting this 
error to destroy and recreate the pool?

Thanks,
Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] invalid vdev configuration

2009-06-04 Thread Brian Leonard
> Check contents of /dev/dsk and /dev/rdsk to see if
> there are some 
> missing links there for devices in question. You may
> want to run
> 
> devfsadm -c disk -sv
> devfsadm -c disk -Csv
> 
> and see if it reports anything.

There were quite a few links it removed, all on c0.
 
> Try to move c6d1p0 and c7d1p0 out of /dev/dsk and
> /dev/rdsk and see if 
> you can import the pool.

That worked! It was able to import the pool on c6d1 and c7d1.  Clearly I have a
little more reading to do regarding how Solaris manages disks.  Thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] invalid vdev configuration

2009-06-04 Thread Brian Leonard
> h... export the pool again.  Then try simply "zpool import" 
> and it should show the way it sees vault.  Reply with that output.

zpool export vault
cannot open 'vault': no such pool


zpool import
  pool: vault
id: 196786381623412270
 state: UNAVAIL
action: The pool cannot be imported due to damaged devices or data.
config:

vault   UNAVAIL  insufficient replicas
  mirrorUNAVAIL  corrupted data
c6d1p0  ONLINE
c7d1p0  ONLINE
[
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] invalid vdev configuration

2009-06-04 Thread Brian Leonard
> Since you did not export the pool, it may be looking for the wrong
> devices.  Try this:
> zpool export vault
> zpool import vault

That was the first thing I tried, with no luck.

> Above, I used slice 0 as an example, your system may use a
> different slice.  But you can run zdb -l on all of them to find

Aha, zdb found complete label sets for the "vault" pool on /dev/rdsk/c6d1 and 
c7d1.  The incomplete labels were c6d1p0 and c7d1p0.  Could I just zpool 
replace c6d1p0 with c6d1 and c7d1p0 with c7d0?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] invalid vdev configuration

2009-06-03 Thread Brian Leonard
I had a machine die the other day and take one of its zfs pools with it. I 
booted the new machine, with the same disks but a different SATA controller, 
and the rpool was mounted but another pool "vault" was not.  If I try to import 
it I get "invalid vdev configuration".  fmdump shows zfs.vdev.bad_label, and 
checking the label with zdb I find labels 2 and 3 missing.  How can I get my 
pool back?  Thanks.

snv_98

zpool import
  pool: vault
id: 196786381623412270
 state: UNAVAIL
action: The pool cannot be imported due to damaged devices or data.
config:

vault   UNAVAIL  insufficient replicas
  mirrorUNAVAIL  corrupted data
c6d1p0  ONLINE
c7d1p0  ONLINE


fmdump -eV
Jun 04 2009 07:43:47.165169453 ereport.fs.zfs.vdev.bad_label
nvlist version: 0
class = ereport.fs.zfs.vdev.bad_label
ena = 0x8ebd8837ae1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x2bb202be54c462e
vdev = 0xaa3f2fd35788620b
(end detector)

pool = vault
pool_guid = 0x2bb202be54c462e
pool_context = 2
pool_failmode = wait
vdev_guid = 0xaa3f2fd35788620b
vdev_type = mirror
parent_guid = 0x2bb202be54c462e
parent_type = root
prev_state = 0x7
__ttl = 0x1
__tod = 0x4a27c183 0x9d8492d

Jun 04 2009 07:43:47.165169794 ereport.fs.zfs.zpool
nvlist version: 0
class = ereport.fs.zfs.zpool
ena = 0x8ebd8837ae1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x2bb202be54c462e
(end detector)

pool = vault
pool_guid = 0x2bb202be54c462e
pool_context = 2
pool_failmode = wait
__ttl = 0x1
__tod = 0x4a27c183 0x9d84a82


zdb -l /dev/rdsk/c6d1p0

LABEL 0

version=13
name='vault'
state=0
txg=42243
pool_guid=196786381623412270
hostid=997759551
hostname='philo'
top_guid=12267576494733681163
guid=16901406274466991796
vdev_tree
type='mirror'
id=0
guid=12267576494733681163
whole_disk=0
metaslab_array=14
metaslab_shift=33
ashift=9
asize=1000199946240
is_log=0
children[0]
type='disk'
id=0
guid=16901406274466991796
path='/dev/dsk/c1t1d0p0'
devid='id1,s...@f3b789a3f48e44b860003d3320001/q'
phys_path='/p...@0,0/pci1043,8...@7/d...@1,0:q'
whole_disk=0
DTL=77
children[1]
type='disk'
id=1
guid=6231056817092537765
path='/dev/dsk/c1t0d0p0'
devid='id1,s...@f3b789a3f48e44b86000263f9/q'
phys_path='/p...@0,0/pci1043,8...@7/d...@0,0:q'
whole_disk=0
DTL=76

LABEL 1

version=13
name='vault'
state=0
txg=42243
pool_guid=196786381623412270
hostid=997759551
hostname='philo'
top_guid=12267576494733681163
guid=16901406274466991796
vdev_tree
type='mirror'
id=0
guid=12267576494733681163
whole_disk=0
metaslab_array=14
metaslab_shift=33
ashift=9
asize=1000199946240
is_log=0
children[0]
type='disk'
id=0
guid=16901406274466991796
path='/dev/dsk/c1t1d0p0'
devid='id1,s...@f3b789a3f48e44b860003d3320001/q'
phys_path='/p...@0,0/pci1043,8...@7/d...@1,0:q'
whole_disk=0
DTL=77
children[1]
type='disk'
id=1
guid=6231056817092537765
path='/dev/dsk/c1t0d0p0'
devid='id1,s...@f3b789a3f48e44b86000263f9/q'
phys_path='/p...@0,0/pci1043,8...@7/d...@0,0:q'
whole_disk=0
DTL=76

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs cp hangs when the mirrors are removed ..

2009-01-08 Thread Brian Leonard
Karthik, did you ever file a bug or this? I'm experiencing the same hang and 
wondering how to recover.

/Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss