Re: [zfs-discuss] Recovering from an apparent ZFS Hang
Hi Cindy, I'm trying to demonstrate how ZFS behaves when a disk fails. The drive enclosure I'm using (http://www.icydock.com/product/mb561us-4s-1.html) says it supports hot swap, but that's not what I'm experiencing. When I plug the disk back in, all 4 disks are no longer recognizable until I restart the enclosure. This same demo works fine when using USB sticks, and maybe that's because each USB stick has its own controller. Thanks for your help, Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering from an apparent ZFS Hang
Actually, there's still the primary issue of this post - the apparent hang. At the moment, I have 3 zpool commands running, all apparently hung and doing nothing: bleon...@opensolaris:~$ ps -ef | grep zpool root 20465 20411 0 18:10:44 pts/4 0:00 zpool clear r5pool root 20408 20403 0 18:08:19 pts/3 0:00 zpool status r5pool root 20396 17612 0 18:08:04 pts/2 0:00 zpool scrub r5pool You can see all of them are not very busy, and seem to be waiting on something: bleon...@opensolaris:~# ptime -p 20465 real12:25.188031517 user0.004037420 sys 0.008682963 bleon...@opensolaris:~# ptime -p 20408 real15:03.977246851 user0.002700817 sys 0.005662413 bleon...@opensolaris:~# ptime -p 20396 real15:24.793176743 user0.002954137 sys 0.014851215 And as I said earlier, I can't control+break or kill any of these processes. Time for hard-reboot. /Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovering from an apparent ZFS Hang
Hi, I'm currently trying to work with a quad-bay USB drive enclosure. I've created a raidz pool as follows: bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 ONLINE 0 0 0 c1t0d3p0 ONLINE 0 0 0 errors: No known data errors If I pop a disk and run a zpool scrub, the fault is noted: bleon...@opensolaris:~# zpool scrub r5pool bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010 config: NAME STATE READ WRITE CKSUM r5poolDEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 FAULTED 0 0 0 corrupted data c1t0d3p0 ONLINE 0 0 0 errors: No known data errors However, it's when I pop the disk back in that everything goes south. If I run a zpool scrub at this point, the command appears to just hang. Running zpool status again shows the scrub will finish in 2 minutes, but I never does. You can see it's been running for 33 minutes already, and there's no data in the pool. bleon...@opensolaris:/r5pool# zpool status r5pool pool: r5pool state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 ONLINE 0 0 0 c1t0d3p0 ONLINE 0 0 0 errors: 24 data errors, use '-v' for a list zpool scrub -s r5pool doesn't have any effect. I can't even kill the scrub process. Even a reboot command at this point will hang the machine, so I have to hard power-cycle the machine to get everything back to normal. There must be a more elegant solution, right? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering from an apparent ZFS Hang
Hi Brian, What are you trying to determine? How the pool behaves when a drive is yanked out? Its hard to tell how a pool will react with external USB drives. I think it will also depend on how the system handles a device removal. I created a similar raidz pool with non-USB devices, offlined a disk, and ran a scrub. It works as expected. See the output below. Could you retry your test with an offline rather than a yank and see if the system hangs? In addition, we don't support pools that are created on p* devices. Use the c1t0d* names instead. Thanks, Cindy # zpool create rzpool raidz1 c2t6d0 c2t7d0 c2t8d0 # zpool offline rzpool c2t8d0 # zpool status rzpool pool: rzpool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAMESTATE READ WRITE CKSUM rzpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 OFFLINE 0 0 0 errors: No known data errors # zpool scrub rzpool # zpool status rzpool pool: rzpool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 12 09:56:36 2010 config: NAMESTATE READ WRITE CKSUM rzpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 OFFLINE 0 0 0 errors: No known data errors # zpool status rzpool pool: rzpool state: ONLINE scan: resilvered 14K in 0h0m with 0 errors on Mon Jul 12 10:12:55 2010 config: NAMESTATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 errors: No known data errors On 07/12/10 10:45, Brian Leonard wrote: Hi, I'm currently trying to work with a quad-bay USB drive enclosure. I've created a raidz pool as follows: bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 ONLINE 0 0 0 c1t0d3p0 ONLINE 0 0 0 errors: No known data errors If I pop a disk and run a zpool scrub, the fault is noted: bleon...@opensolaris:~# zpool scrub r5pool bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010 config: NAME STATE READ WRITE CKSUM r5poolDEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 FAULTED 0 0 0 corrupted data c1t0d3p0 ONLINE 0 0 0 errors: No known data errors However, it's when I pop the disk back in that everything goes south. If I run a zpool scrub at this point, the command appears to just hang. Running zpool status again shows the scrub will finish in 2 minutes, but I never does. You can see it's been running for 33 minutes already, and there's no data in the pool. bleon...@opensolaris:/r5pool# zpool status r5pool pool: r5pool state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0