Re: [zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-13 Thread Brian Leonard
Hi Cindy,

I'm trying to demonstrate how ZFS behaves when a disk fails. The drive 
enclosure I'm using (http://www.icydock.com/product/mb561us-4s-1.html) says it 
supports hot swap, but that's not what I'm experiencing. When I plug the disk 
back in, all 4 disks are no longer recognizable until I restart the enclosure.

This same demo works fine when using USB sticks, and maybe that's because each 
USB stick has its own controller.

Thanks for your help,
Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-13 Thread Brian Leonard
Actually, there's still the primary issue of this post - the apparent hang. At 
the moment, I have 3 zpool commands running, all apparently hung and doing 
nothing:

bleon...@opensolaris:~$ ps -ef | grep zpool
root 20465 20411   0 18:10:44 pts/4   0:00 zpool clear r5pool
root 20408 20403   0 18:08:19 pts/3   0:00 zpool status r5pool
root 20396 17612   0 18:08:04 pts/2   0:00 zpool scrub r5pool

You can see all of them are not very busy, and seem to be waiting on something:

bleon...@opensolaris:~# ptime -p 20465
real12:25.188031517
user0.004037420
sys 0.008682963

bleon...@opensolaris:~# ptime -p 20408
real15:03.977246851
user0.002700817
sys 0.005662413

bleon...@opensolaris:~# ptime -p 20396
real15:24.793176743
user0.002954137
sys 0.014851215

And as I said earlier, I can't control+break or kill any of these processes. 
Time for hard-reboot.

/Brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-12 Thread Brian Leonard
Hi,

I'm currently trying to work with a quad-bay USB drive enclosure. I've created 
a raidz pool as follows:

bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  ONLINE   0 0 0
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

If I pop a disk and run a zpool scrub, the fault is noted:

bleon...@opensolaris:~# zpool scrub r5pool
bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010
config:

NAME  STATE READ WRITE CKSUM
r5poolDEGRADED 0 0 0
  raidz1  DEGRADED 0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  FAULTED  0 0 0  corrupted data
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

However, it's when I pop the disk back in that everything goes south. If I run 
a zpool scrub at this point, the command appears to just hang.

Running zpool status again shows the scrub will finish in 2 minutes, but I 
never does. You can see it's been running for 33 minutes already, and there's 
no data in the pool.

bleon...@opensolaris:/r5pool# zpool status r5pool
  pool: r5pool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  ONLINE   0 0 0
c1t0d3p0  ONLINE   0 0 0

errors: 24 data errors, use '-v' for a list

zpool scrub -s r5pool doesn't have any effect.

I can't even kill the scrub process. Even a reboot command at this point will 
hang the machine, so I have to hard power-cycle the machine to get everything 
back to normal. There must be a more elegant solution, right?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from an apparent ZFS Hang

2010-07-12 Thread Cindy Swearingen

Hi Brian,

What are you trying to determine? How the pool behaves when a drive is
yanked out?

Its hard to tell how a pool will react with external USB drives. I think
it will also depend on how the system handles a device removal.

I created a similar raidz pool with non-USB devices, offlined a disk,
and ran a scrub. It works as expected. See the output below. Could
you retry your test with an offline rather than a yank and see if
the system hangs?

In addition, we don't support pools that are created on p* devices.
Use the c1t0d* names instead.

Thanks,

Cindy

# zpool create rzpool raidz1 c2t6d0 c2t7d0 c2t8d0
# zpool offline rzpool c2t8d0
# zpool status rzpool
  pool: rzpool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
 scan: none requested
config:

NAMESTATE READ WRITE CKSUM
rzpool  DEGRADED 0 0 0
  raidz1-0  DEGRADED 0 0 0
c2t6d0  ONLINE   0 0 0
c2t7d0  ONLINE   0 0 0
c2t8d0  OFFLINE  0 0 0

errors: No known data errors
# zpool scrub rzpool
# zpool status rzpool
  pool: rzpool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
 scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 12 09:56:36 2010
config:

NAMESTATE READ WRITE CKSUM
rzpool  DEGRADED 0 0 0
  raidz1-0  DEGRADED 0 0 0
c2t6d0  ONLINE   0 0 0
c2t7d0  ONLINE   0 0 0
c2t8d0  OFFLINE  0 0 0

errors: No known data errors
# zpool status rzpool
  pool: rzpool
 state: ONLINE
 scan: resilvered 14K in 0h0m with 0 errors on Mon Jul 12 10:12:55 2010
config:

NAMESTATE READ WRITE CKSUM
rzpool  ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c2t6d0  ONLINE   0 0 0
c2t7d0  ONLINE   0 0 0
c2t8d0  ONLINE   0 0 0

errors: No known data errors


On 07/12/10 10:45, Brian Leonard wrote:

Hi,

I'm currently trying to work with a quad-bay USB drive enclosure. I've created 
a raidz pool as follows:

bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  ONLINE   0 0 0
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

If I pop a disk and run a zpool scrub, the fault is noted:

bleon...@opensolaris:~# zpool scrub r5pool
bleon...@opensolaris:~# zpool status r5pool
  pool: r5pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010
config:

NAME  STATE READ WRITE CKSUM
r5poolDEGRADED 0 0 0
  raidz1  DEGRADED 0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0
c1t0d2p0  FAULTED  0 0 0  corrupted data
c1t0d3p0  ONLINE   0 0 0

errors: No known data errors

However, it's when I pop the disk back in that everything goes south. If I run 
a zpool scrub at this point, the command appears to just hang.

Running zpool status again shows the scrub will finish in 2 minutes, but I 
never does. You can see it's been running for 33 minutes already, and there's 
no data in the pool.

bleon...@opensolaris:/r5pool# zpool status r5pool
  pool: r5pool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go
config:

NAME  STATE READ WRITE CKSUM
r5poolONLINE   0 0 0
  raidz1  ONLINE   0 0 0
c1t0d0p0  ONLINE   0 0 0
c1t0d1p0  ONLINE   0 0 0