Re: [zfs-discuss] Recovering from an apparent ZFS Hang
Actually, there's still the primary issue of this post - the apparent hang. At the moment, I have 3 zpool commands running, all apparently hung and doing nothing: bleon...@opensolaris:~$ ps -ef | grep zpool root 20465 20411 0 18:10:44 pts/4 0:00 zpool clear r5pool root 20408 20403 0 18:08:19 pts/3 0:00 zpool status r5pool root 20396 17612 0 18:08:04 pts/2 0:00 zpool scrub r5pool You can see all of them are not very busy, and seem to be waiting on something: bleon...@opensolaris:~# ptime -p 20465 real12:25.188031517 user0.004037420 sys 0.008682963 bleon...@opensolaris:~# ptime -p 20408 real15:03.977246851 user0.002700817 sys 0.005662413 bleon...@opensolaris:~# ptime -p 20396 real15:24.793176743 user0.002954137 sys 0.014851215 And as I said earlier, I can't control+break or kill any of these processes. Time for hard-reboot. /Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering from an apparent ZFS Hang
Hi Cindy, I'm trying to demonstrate how ZFS behaves when a disk fails. The drive enclosure I'm using (http://www.icydock.com/product/mb561us-4s-1.html) says it supports hot swap, but that's not what I'm experiencing. When I plug the disk back in, all 4 disks are no longer recognizable until I restart the enclosure. This same demo works fine when using USB sticks, and maybe that's because each USB stick has its own controller. Thanks for your help, Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovering from an apparent ZFS Hang
Hi, I'm currently trying to work with a quad-bay USB drive enclosure. I've created a raidz pool as follows: bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 ONLINE 0 0 0 c1t0d3p0 ONLINE 0 0 0 errors: No known data errors If I pop a disk and run a zpool scrub, the fault is noted: bleon...@opensolaris:~# zpool scrub r5pool bleon...@opensolaris:~# zpool status r5pool pool: r5pool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010 config: NAME STATE READ WRITE CKSUM r5poolDEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 FAULTED 0 0 0 corrupted data c1t0d3p0 ONLINE 0 0 0 errors: No known data errors However, it's when I pop the disk back in that everything goes south. If I run a zpool scrub at this point, the command appears to just hang. Running zpool status again shows the scrub will finish in 2 minutes, but I never does. You can see it's been running for 33 minutes already, and there's no data in the pool. bleon...@opensolaris:/r5pool# zpool status r5pool pool: r5pool state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go config: NAME STATE READ WRITE CKSUM r5poolONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t0d0p0 ONLINE 0 0 0 c1t0d1p0 ONLINE 0 0 0 c1t0d2p0 ONLINE 0 0 0 c1t0d3p0 ONLINE 0 0 0 errors: 24 data errors, use '-v' for a list zpool scrub -s r5pool doesn't have any effect. I can't even kill the scrub process. Even a reboot command at this point will hang the machine, so I have to hard power-cycle the machine to get everything back to normal. There must be a more elegant solution, right? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanet errors detected in :<0x13>
Well, I was doing a ZFS send / receive to backup a large (60 GB) of data, which never completed. A zpool clear at that point just hung and I had to reboot the system, after which it appeared to come up clean. As soon as I tried the backup again I noticed the pool reported the error you see below - but the backup did complete as the pool remained online. Thanks for your help Cindy, Brian Cindy Swearingen wrote: I reviewed the zpool clear syntax (looking at my own docs) and didn't remember that a one-device pool probably doesn't need the device specified. For pools with many devices, you might want to just clear the errors on a particular device. USB sticks for pools are problemmatic. It would be good to know what caused these errors to try to prevent them in the future. We know that USB devices don't generate/fabricate device IDs so they are prone to problems when moving/changing/re-inserting but without more info, its hard to tell what happened. cs On 06/29/10 14:13, W Brian Leonard wrote: Interesting, this time it worked! Does specifying the device to clear cause the command to behave differently? I had assumed w/out the device specification, the clear would just apply to all devices in the pool (which are just the one). Thanks, Brian Cindy Swearingen wrote: Hi Brian, Because the pool is still online and the metadata is redundant, maybe these errors were caused by a brief hiccup from the USB device's physical connection. You might try: # zpool clear external c0t0d0p0 Then, run a scrub: # zpool scrub external If the above fails, then please identify the Solaris release and what events preceded this problem. Thanks, Cindy On 06/29/10 11:15, W Brian Leonard wrote: Hi Cindy, The scrub didn't help and yes, this is an external USB device. Thanks, Brian Cindy Swearingen wrote: Hi Brian, You might try running a scrub on this pool. Is this an external USB device? Thanks, Cindy On 06/29/10 09:16, Brian Leonard wrote: Hi, I have a zpool which is currently reporting that the ":<0x13>" file is corrupt: bleon...@opensolaris:~$ pfexec zpool status -xv external pool: external state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM externalONLINE 0 0 0 c0t0d0p0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x13> Otherwise, as you can see, the pool is online. As it's unclear to me how to restore the ":<0x13>" file, is my only option for correcting this error to destroy and recreate the pool? Thanks, Brian -- W Brian Leonard Principal Product Manager 860.206.6093 http://blogs.sun.com/observatory ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanet errors detected in :<0x13>
Interesting, this time it worked! Does specifying the device to clear cause the command to behave differently? I had assumed w/out the device specification, the clear would just apply to all devices in the pool (which are just the one). Thanks, Brian Cindy Swearingen wrote: Hi Brian, Because the pool is still online and the metadata is redundant, maybe these errors were caused by a brief hiccup from the USB device's physical connection. You might try: # zpool clear external c0t0d0p0 Then, run a scrub: # zpool scrub external If the above fails, then please identify the Solaris release and what events preceded this problem. Thanks, Cindy On 06/29/10 11:15, W Brian Leonard wrote: Hi Cindy, The scrub didn't help and yes, this is an external USB device. Thanks, Brian Cindy Swearingen wrote: Hi Brian, You might try running a scrub on this pool. Is this an external USB device? Thanks, Cindy On 06/29/10 09:16, Brian Leonard wrote: Hi, I have a zpool which is currently reporting that the ":<0x13>" file is corrupt: bleon...@opensolaris:~$ pfexec zpool status -xv external pool: external state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM externalONLINE 0 0 0 c0t0d0p0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x13> Otherwise, as you can see, the pool is online. As it's unclear to me how to restore the ":<0x13>" file, is my only option for correcting this error to destroy and recreate the pool? Thanks, Brian -- W Brian Leonard Principal Product Manager 860.206.6093 http://blogs.sun.com/observatory ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanet errors detected in :<0x13>
Hi Cindy, The scrub didn't help and yes, this is an external USB device. Thanks, Brian Cindy Swearingen wrote: Hi Brian, You might try running a scrub on this pool. Is this an external USB device? Thanks, Cindy On 06/29/10 09:16, Brian Leonard wrote: Hi, I have a zpool which is currently reporting that the ":<0x13>" file is corrupt: bleon...@opensolaris:~$ pfexec zpool status -xv external pool: external state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM externalONLINE 0 0 0 c0t0d0p0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x13> Otherwise, as you can see, the pool is online. As it's unclear to me how to restore the ":<0x13>" file, is my only option for correcting this error to destroy and recreate the pool? Thanks, Brian -- W Brian Leonard Principal Product Manager 860.206.6093 http://blogs.sun.com/observatory ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Permanet errors detected in :<0x13>
Hi, I have a zpool which is currently reporting that the ":<0x13>" file is corrupt: bleon...@opensolaris:~$ pfexec zpool status -xv external pool: external state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM externalONLINE 0 0 0 c0t0d0p0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x13> Otherwise, as you can see, the pool is online. As it's unclear to me how to restore the ":<0x13>" file, is my only option for correcting this error to destroy and recreate the pool? Thanks, Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] invalid vdev configuration
> Check contents of /dev/dsk and /dev/rdsk to see if > there are some > missing links there for devices in question. You may > want to run > > devfsadm -c disk -sv > devfsadm -c disk -Csv > > and see if it reports anything. There were quite a few links it removed, all on c0. > Try to move c6d1p0 and c7d1p0 out of /dev/dsk and > /dev/rdsk and see if > you can import the pool. That worked! It was able to import the pool on c6d1 and c7d1. Clearly I have a little more reading to do regarding how Solaris manages disks. Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] invalid vdev configuration
> h... export the pool again. Then try simply "zpool import" > and it should show the way it sees vault. Reply with that output. zpool export vault cannot open 'vault': no such pool zpool import pool: vault id: 196786381623412270 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: vault UNAVAIL insufficient replicas mirrorUNAVAIL corrupted data c6d1p0 ONLINE c7d1p0 ONLINE [ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] invalid vdev configuration
> Since you did not export the pool, it may be looking for the wrong > devices. Try this: > zpool export vault > zpool import vault That was the first thing I tried, with no luck. > Above, I used slice 0 as an example, your system may use a > different slice. But you can run zdb -l on all of them to find Aha, zdb found complete label sets for the "vault" pool on /dev/rdsk/c6d1 and c7d1. The incomplete labels were c6d1p0 and c7d1p0. Could I just zpool replace c6d1p0 with c6d1 and c7d1p0 with c7d0? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] invalid vdev configuration
I had a machine die the other day and take one of its zfs pools with it. I booted the new machine, with the same disks but a different SATA controller, and the rpool was mounted but another pool "vault" was not. If I try to import it I get "invalid vdev configuration". fmdump shows zfs.vdev.bad_label, and checking the label with zdb I find labels 2 and 3 missing. How can I get my pool back? Thanks. snv_98 zpool import pool: vault id: 196786381623412270 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: vault UNAVAIL insufficient replicas mirrorUNAVAIL corrupted data c6d1p0 ONLINE c7d1p0 ONLINE fmdump -eV Jun 04 2009 07:43:47.165169453 ereport.fs.zfs.vdev.bad_label nvlist version: 0 class = ereport.fs.zfs.vdev.bad_label ena = 0x8ebd8837ae1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x2bb202be54c462e vdev = 0xaa3f2fd35788620b (end detector) pool = vault pool_guid = 0x2bb202be54c462e pool_context = 2 pool_failmode = wait vdev_guid = 0xaa3f2fd35788620b vdev_type = mirror parent_guid = 0x2bb202be54c462e parent_type = root prev_state = 0x7 __ttl = 0x1 __tod = 0x4a27c183 0x9d8492d Jun 04 2009 07:43:47.165169794 ereport.fs.zfs.zpool nvlist version: 0 class = ereport.fs.zfs.zpool ena = 0x8ebd8837ae1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x2bb202be54c462e (end detector) pool = vault pool_guid = 0x2bb202be54c462e pool_context = 2 pool_failmode = wait __ttl = 0x1 __tod = 0x4a27c183 0x9d84a82 zdb -l /dev/rdsk/c6d1p0 LABEL 0 version=13 name='vault' state=0 txg=42243 pool_guid=196786381623412270 hostid=997759551 hostname='philo' top_guid=12267576494733681163 guid=16901406274466991796 vdev_tree type='mirror' id=0 guid=12267576494733681163 whole_disk=0 metaslab_array=14 metaslab_shift=33 ashift=9 asize=1000199946240 is_log=0 children[0] type='disk' id=0 guid=16901406274466991796 path='/dev/dsk/c1t1d0p0' devid='id1,s...@f3b789a3f48e44b860003d3320001/q' phys_path='/p...@0,0/pci1043,8...@7/d...@1,0:q' whole_disk=0 DTL=77 children[1] type='disk' id=1 guid=6231056817092537765 path='/dev/dsk/c1t0d0p0' devid='id1,s...@f3b789a3f48e44b86000263f9/q' phys_path='/p...@0,0/pci1043,8...@7/d...@0,0:q' whole_disk=0 DTL=76 LABEL 1 version=13 name='vault' state=0 txg=42243 pool_guid=196786381623412270 hostid=997759551 hostname='philo' top_guid=12267576494733681163 guid=16901406274466991796 vdev_tree type='mirror' id=0 guid=12267576494733681163 whole_disk=0 metaslab_array=14 metaslab_shift=33 ashift=9 asize=1000199946240 is_log=0 children[0] type='disk' id=0 guid=16901406274466991796 path='/dev/dsk/c1t1d0p0' devid='id1,s...@f3b789a3f48e44b860003d3320001/q' phys_path='/p...@0,0/pci1043,8...@7/d...@1,0:q' whole_disk=0 DTL=77 children[1] type='disk' id=1 guid=6231056817092537765 path='/dev/dsk/c1t0d0p0' devid='id1,s...@f3b789a3f48e44b86000263f9/q' phys_path='/p...@0,0/pci1043,8...@7/d...@0,0:q' whole_disk=0 DTL=76 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs cp hangs when the mirrors are removed ..
Karthik, did you ever file a bug or this? I'm experiencing the same hang and wondering how to recover. /Brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss