Re: [zfs-discuss] lost zpool when server restarted.
Looking at the txg numbers, it's clear that labels on to devices that are unavailable now may be stale: Krzys wrote: When I do zdb on emcpower3a which seems to be ok from zpool perspective I get the following output: bash-3.00# zdb -lv /dev/dsk/emcpower3a LABEL 0 version=3 name='mypool' state=0 txg=4367380 pool_guid=4148251638983938048 top_guid=9690155374174551757 guid=9690155374174551757 vdev_tree type='disk' id=2 guid=9690155374174551757 path='/dev/dsk/emcpower3a' whole_disk=0 metaslab_array=1813 metaslab_shift=30 ashift=9 asize=134208815104 Here we have txg=4367380, but on other two devices (probably; at least on one of them) - txg=4367379: But when I do zdb on emcpower0a which seems to be not that ok and get the following output: bash-3.00# zdb -lv /dev/dsk/emcpower0a LABEL 0 version=3 name='mypool' state=0 txg=4367379 pool_guid=4148251638983938048 top_guid=14125143252243381576 guid=14125143252243381576 vdev_tree type='disk' id=0 guid=14125143252243381576 path='/dev/dsk/emcpower0a' whole_disk=0 metaslab_array=13 metaslab_shift=29 ashift=9 asize=107365269504 DTL=727 that also is the same for emcpower2a in my pool. What does 'zdb -uuu mypool' say? Is there a way to be able to fix failed LABELs 2 and 3? I know you need 4 of them, but is there a way to reconstruct them in any way? It looks like the problem is not that labels 2 and 3 are missing, but that labels 0 and 1 are stale Or is my pool lost completely and I need to recreate it? It would be off that reboot of a server could cause such disaster. There's Dirty Time Log object allocated for device with unreadable labels, and it means that device in question was not available for some time, so something weird might be going on with your storage a while back (prior to reboot)... But I was unable to find anywhere where people would be able to repair or recreate those LABELS. How would I recover my zpools? Any help or suggestion is greatly appreciated. Have you seen this thread - http://www.opensolaris.org/jive/thread.jspa?messageID=220125 ? I think some of that experience may be applicable to this case as well Btw, what kind of Solaris are you running? wbr, victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lost zpool when server restarted.
It's OK that you're missing labels 2 and 3 -- there are four copies precisely so that you can afford to lose a few. Labels 2 and 3 are at the end of the disk. The fact that only they are missing makes me wonder if someone resized the LUNs. Growing them would be OK, but shrinking them would indeed cause the pool to fail to open (since part of it was amputated). There ought to be more helpful diagnostics in the FMA error log. After a failed attempt to import, type this: # fmdump -ev and let me know what it says. Jeff On Tue, Apr 29, 2008 at 03:31:53PM -0400, Krzys wrote: I have a problem on one of my systems with zfs. I used to have zpool created with 3 luns on SAN. I did not have to put any raid or anything on it since it was already using raid on SAN. Anyway server rebooted and I cannot zee my pools. When I do try to import it it does fail. I am using EMC Clarion as SAN and powerpath # zpool list no pools available # zpool import -f pool: mypool id: 4148251638983938048 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: mypool UNAVAIL insufficient replicas emcpower0a UNAVAIL cannot open emcpower2a UNAVAIL cannot open emcpower3a ONLINE I think I am able to see all the luns and I should be able to access them on my sun box. # powermt display dev=all Pseudo name=emcpower0a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A001264FB20990FDC11 [LUN 13] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d0s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d0s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d0s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d0s0 SP B4 active alive 0 0 Pseudo name=emcpower1a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A004C1388343C10DC11 [LUN 14] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d1s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d1s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d1s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d1s0 SP B4 active alive 0 0 Pseudo name=emcpower3a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A00A82C68514E86DC11 [LUN 7] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d3s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d3s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d3s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d3s0 SP B4 active alive 0 0 Pseudo name=emcpower2a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=600601604B141B00C2F6DB2AC349DC11 [LUN 24] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B
Re: [zfs-discuss] lost zpool when server restarted.
Looking at the txg numbers, it's clear that labels on to devices that are unavailable now may be stale: Actually, they look OK. The txg values in the label indicate the last txg in which the pool configuration changed for devices in that top-level vdev (e.g. mirror or raid-z group), not the last txg synced. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lost zpool when server restarted.
Jeff Bonwick wrote: Looking at the txg numbers, it's clear that labels on to devices that are unavailable now may be stale: Actually, they look OK. The txg values in the label indicate the last txg in which the pool configuration changed for devices in that top-level vdev (e.g. mirror or raid-z group), not the last txg synced. Agree, I've jumped to conclusions here. But still it is a difference between two labels presented. Since this was running for a while I suppose there have been no admin-initiated configuration changes, so config change may be due to allocation of DTL object, correct? Still it would be interesting to know txg of the selected uberblock to see how long ago that change happened. Also it would be interesting to know why did server reboot? Victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lost zpool when server restarted.
Because this system was in production I had to fairly quickly recover, so I was unable to play much more with it we had to destroy it and recreate new pool and then recover data from tapes. Its a mistery as to why in the middle of a night it rebooted, we could not figure this out and why pool had this problem... so unfortunatelly I will not be able to follow what you Victor and Jeff were suggesting. before we destroyed that pool I did get output of fmdump on that system to see what failed etc. As you can see it happend at around 3:54 am on Sunday morning there was no one on the system from admin perspective to break anything, only thing that I might think of would be the backups running which could generate more traffic, but then I had that system for over a year setup this way, and no changes were made to it from storage perspective. yes I did see this URL: http://www.opensolaris.org/jive/thread.jspa?messageID=220125 but unfortunately I was unable to apply it in my situation as I had no idea what values to apply... :( anyway here is fmdump bash-3.00# fmdump -eV TIME CLASS Apr 27 2008 03:54:05.605369200 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0x18594234ea1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 vdev = 0xc40696f31f78fd48 (end detector) pool = mypool pool_guid = 0x39918ce32491d000 pool_context = 1 vdev_guid = 0xc40696f31f78fd48 vdev_type = disk vdev_path = /dev/dsk/emcpower0a parent_guid = 0x39918ce32491d000 parent_type = root prev_state = 0x1 __ttl = 0x1 __tod = 0x4814311d 0x24153370 Apr 27 2008 03:54:05.605369725 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0x18594234ea1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 vdev = 0xd56fa2d7686dae8c (end detector) pool = mypool pool_guid = 0x39918ce32491d000 pool_context = 1 vdev_guid = 0xd56fa2d7686dae8c vdev_type = disk vdev_path = /dev/dsk/emcpower2a parent_guid = 0x39918ce32491d000 parent_type = root prev_state = 0x1 __ttl = 0x1 __tod = 0x4814311d 0x2415357d Apr 27 2008 03:54:05.605369225 ereport.fs.zfs.zpool nvlist version: 0 class = ereport.fs.zfs.zpool ena = 0x18594234ea1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 (end detector) pool = mypool pool_guid = 0x39918ce32491d000 pool_context = 1 __ttl = 0x1 __tod = 0x4814311d 0x24153389 Apr 27 2008 03:56:28.180698100 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0x398b69181e00401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 vdev = 0xc40696f31f78fd48 (end detector) pool = mypool pool_guid = 0x39918ce32491d000 pool_context = 1 vdev_guid = 0xc40696f31f78fd48 vdev_type = disk vdev_path = /dev/dsk/emcpower0a parent_guid = 0x39918ce32491d000 parent_type = root prev_state = 0x1 __ttl = 0x1 __tod = 0x481431ac 0xac53bf4 Apr 27 2008 03:56:28.180698375 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0x398b69181e00401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 vdev = 0xd56fa2d7686dae8c (end detector) pool = mypool pool_guid = 0x39918ce32491d000 pool_context = 1 vdev_guid = 0xd56fa2d7686dae8c vdev_type = disk vdev_path = /dev/dsk/emcpower2a parent_guid = 0x39918ce32491d000 parent_type = root prev_state = 0x1 __ttl = 0x1 __tod = 0x481431ac 0xac53d07 Apr 27 2008 03:56:28.180698500 ereport.fs.zfs.zpool nvlist version: 0 class = ereport.fs.zfs.zpool ena = 0x398b69181e00401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x39918ce32491d000 (end detector)
[zfs-discuss] lost zpool when server restarted.
I have a problem on one of my systems with zfs. I used to have zpool created with 3 luns on SAN. I did not have to put any raid or anything on it since it was already using raid on SAN. Anyway server rebooted and I cannot zee my pools. When I do try to import it it does fail. I am using EMC Clarion as SAN and powerpath # zpool list no pools available # zpool import -f pool: mypool id: 4148251638983938048 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: mypool UNAVAIL insufficient replicas emcpower0a UNAVAIL cannot open emcpower2a UNAVAIL cannot open emcpower3a ONLINE I think I am able to see all the luns and I should be able to access them on my sun box. # powermt display dev=all Pseudo name=emcpower0a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A001264FB20990FDC11 [LUN 13] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d0s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d0s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d0s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d0s0 SP B4 active alive 0 0 Pseudo name=emcpower1a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A004C1388343C10DC11 [LUN 14] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d1s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d1s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d1s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d1s0 SP B4 active alive 0 0 Pseudo name=emcpower3a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A00A82C68514E86DC11 [LUN 7] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d3s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d3s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d3s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d3s0 SP B4 active alive 0 0 Pseudo name=emcpower2a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=600601604B141B00C2F6DB2AC349DC11 [LUN 24] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d2s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d2s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d2s0 SP A5 active alive 0 0 3072 [EMAIL