Re: [zfs-discuss] zfs destroy hanging
Some more info - the system won't shutdown, issuing shutdown -g0 -i5 just sits there doing nothing. Then i tried to find locks on the savecore i took, - mdb crashes: mdb -k ./unix.1 ./vmcore.1 mdb: failed to read panicbuf and panic_reg -- current register set will be unavailable Loading modules: [ unix genunix specfs dtrace cpu.generic uppc pcplusmp scsi_vhci zfs sd ip hook neti sctp arp usba uhci fctl fcip fcp md cpc random crypto smbsrv nfs lofs logindmux ptm ufs sppp nsmb ipc ] ::findlocks mdb: type graph not yet built; run ::typegraph. ::typegraph typegraph: pass = initial mdb: failed to read slab at 0x4350b001065fff0: no mapping for address *** mdb: received signal ABRT at: [1] libc.so.1`_lwp_kill+0xa() [2] libc.so.1`raise+0x19() [3] libumem.so.1`umem_do_abort+0x1c() [4] libumem.so.1`umem_err_recoverable+0xb8() [5] libumem.so.1`process_free+0x17e() [6] libumem.so.1`free+0x16() [7] mdb`mdb_free+0x3b() [8] genunix.so`avl_walk_fini+0x26() [9] genunix.so`combined_walk_fini+0x40() [10] mdb`mdb_wcb_destroy+0x3d() [11] mdb`walk_common+0xb8() [12] mdb`mdb_pwalk+0x3f() [13] genunix.so`kmem_estimate_allocated+0x37() [14] genunix.so`typegraph_estimate+0x2e() [15] genunix.so`list_walk_step+0xa2() [16] mdb`walk_step+0x5e() [17] mdb`walk_common+0x7d() [18] mdb`mdb_pwalk+0x3f() [19] mdb`mdb_walk+0xc() [20] genunix.so`typegraph+0xd1() [21] mdb`dcmd_invoke+0x64() [22] mdb`mdb_call_idcmd+0xff() [23] mdb`mdb_call+0x390() [24] mdb`yyparse+0x4e5() [25] mdb`mdb_run+0x2cd() [26] mdb`main+0x1246() [27] mdb`_start+0x6c() mdb: (c)ore dump, (q)uit, (r)ecover, or (s)top for debugger [cqrs]? --- i hit [r] mdb: unloading module 'genunix' ... Segmentation Fault (core dumped) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
And pstack won't give stack on bootadm process: devu...@zfs05:/var/crash/zfs05# pstack 23870 23870: /sbin/bootadm -a update_all devu...@zfs05:/var/crash/zfs05# pstack -F 23870 23870: /sbin/bootadm -a update_all devu...@zfs05:/var/crash/zfs05# kill -9 23870 devu...@zfs05:/var/crash/zfs05# kill -9 23870 devu...@zfs05:/var/crash/zfs05# kill -9 23870 devu...@zfs05:/var/crash/zfs05# pstack -F 23870 pstack: cannot examine 23870: unanticipated system error devu...@zfs05:/var/crash/zfs05# pstack -F 23870 pstack: cannot examine 23870: unanticipated system error devu...@zfs05:/var/crash/zfs05# ps -ef | grep bootadm root 24214 23890 0 11:11:00 pts/13 0:00 grep bootadm root 23870 23847 0 10:35:03 pts/8 0:00 /sbin/bootadm -a update_all devu...@zfs05:/var/crash/zfs05# kill -9 23870 devu...@zfs05:/var/crash/zfs05# ps -ef | grep bootadm root 24220 23890 0 11:11:13 pts/13 0:00 grep bootadm root 23870 23847 0 10:35:03 pts/8 0:00 /sbin/bootadm -a update_all devu...@zfs05:/var/crash/zfs05# pstack -F 23870 pstack: cannot examine 23870: unanticipated system error -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
We have just got a hang like this. Here's the output of ps -ef | grep zfs: root 425 7 0 Jun 17 console 0:00 /usr/lib/saf/ttymon -g -d /dev/console -l console -m ldterm,ttcompat -h -p zfs0 root 22879 22876 0 18:18:37 ? 0:01 /usr/sbin/zfs rollback -r tank/aa root 22884 1 0 18:19:02 ? 0:00 /usr/sbin/zfs clone tank/templates/x tank/yy root 23188 23006 0 18:42:50 pts/8 0:00 grep z root 22883 22826 0 18:19:01 ? 0:00 /usr/sbin/zfs destroy -rf tank/aa root 22880 22855 0 18:18:37 ? 0:01 /usr/sbin/zfs rollback -r tank/aa root 22930 1 0 18:24:33 ? 0:00 /usr/sbin/zfs clone tank/bb tank/ccc root 22961 23010 0 18:25:42 pts/2 0:00 zfs list root 22995 22945 0 18:27:11 pts/4 0:00 zfs list I have created a crash dump, though it is rather large (around .5GB compressed). Should i upload it here (i would rather not), or is there someone i should send it to directly? The symptoms are otherwise the same - destroy hangs, process is not killable even by kill -9, all other zfs commands are hanging waiting for this to complete, even zfs list. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
Forgot to mention - 1. this system was installed as 2008.11, so it should have no upgrade issues. 2. Not sure how to do the mdb -k on the dump, the only thing it produced is the following: ::status debugging live kernel (64-bit) on zfs05 operating system: 5.11 snv_101b (i86pc) $C -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
Ok, sorry for spamming - got some more info from mdb -k devu...@zfs05:/var/crash/zfs05# mdb -k unix.0 vmcore.0 mdb: failed to read panicbuf and panic_reg -- current register set will be unavailable Loading modules: [ unix genunix specfs dtrace cpu.generic uppc pcplusmp scsi_vhci zfs sd ip hook neti sctp arp usba uhci fctl fcip fcp md cpc random crypto smbsrv nfs lofs logindmux ptm ufs sppp nsmb ipc ] ::status debugging crash dump vmcore.0 (64-bit) from zfs05 operating system: 5.11 snv_101b (i86pc) panic message: dump content: kernel pages only $C ::stack ::pgrep zfs SPID PPID PGIDSIDUID FLAGS ADDR NAME R 22884 1 22511 22511 0 0x4a004900 ff0317295350 zfs R 22930 1 22930 22918 0 0x4a004000 ff0306ca68f8 zfs R 22879 22876 22825 22825 0 0x4a004000 ff0306bf4918 zfs R 22880 22855 22825 22825 0 0x4a004900 ff031754f068 zfs R 22883 22826 22825 22825 0 0x4a004000 ff0306ca5038 zfs R 22995 22945 22995 22939 0 0x4a004000 ff0306caa6d8 zfs R 22961 23010 22961 22974 0 0x4a004000 ff031728f050 zfs -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
Thanks, I've filed your message where I can easily get at it even if I'm having trouble with the server. I'm afraid I'd rather I didn't get the chance to use it, but if something weird does go on, I'm happy to have the procedure to capture information that might help get it identified and fixed. On Sat, February 14, 2009 16:30, James C. McPherson wrote: Hi David, if this happens to you again, you could help get more data on the problem by getting a crash dump, either forced or via reboot or (if you have a dedicated dump device, via savecore: (dedicated dump dev, ) # savecore -L /var/crash/`uname -n` [rest snipped to save bandwidth] -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
I think you can kill the destroy command process using traditional methods. Perhaps your slowness issue is because the pool is an older format. I've not had these problems since upgrading to the zfs version that comes default with 2008.11 On Fri, Feb 13, 2009 at 4:14 PM, David Dyer-Bennet d...@dd-b.net wrote: This shouldn't be taking anywhere *near* half an hour. The snapshots differ trivially, by one or two files and less than 10k of data (they're test results from working on my backup script). But so far, it's still sitting there after more than half an hour. local...@fsfs:~/src/bup2# zfs destroy ruin/export cannot destroy 'ruin/export': filesystem has children use '-r' to destroy the following datasets: ruin/export/h...@bup-20090210-202557utc ruin/export/h...@20090210-213902utc ruin/export/home/local...@first ruin/export/home/local...@second ruin/export/home/local...@bup-20090210-202557utc ruin/export/home/local...@20090210-213902utc ruin/export/home/localddb ruin/export/home local...@fsfs:~/src/bup2# zfs destroy -r ruin/export It's still hung. Ah, here's zfs list output from shortly before I started the destroy: ruin 474G 440G 431G /backups/ruin ruin/export 35.0M 440G18K /backups/ruin/export ruin/export/home35.0M 440G19K /export/home ruin/export/home/localddb 35M 440G 27.8M /export/home/localddb As you can see, the ruin/export/home filesystem (and subs) is NOT large. iostat shows no activity on pool ruin over a minute. local...@fsfs:~$ pfexec zpool iostat ruin 10 capacity operationsbandwidth pool used avail read write read write -- - - - - - - ruin 474G 454G 10 0 1.13M840 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 The pool still thinks it is healthy. local...@fsfs:~$ zpool status -v ruin pool: ruin state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 4h42m with 0 errors on Mon Feb 9 19:10:49 2009 config: NAMESTATE READ WRITE CKSUM ruinONLINE 0 0 0 c7t0d0ONLINE 0 0 0 errors: No known data errors There is still a process out there trying to run that destroy. It doesn't appear to be using much cpu time. local...@fsfs:~$ ps -ef | grep zfs localddb 7291 7228 0 15:10:56 pts/4 0:00 grep zfs root 7223 7101 0 14:18:27 pts/3 0:00 zfs destroy -r ruin/export Running 2008.11. local...@fsfs:~$ uname -a SunOS fsfs 5.11 snv_101b i86pc i386 i86pc Solaris Any suggestions? Eventually I'll kill the process by the gentlest way that works, I suppose (if it doesn't complete). -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
On Sat, February 14, 2009 13:04, Blake wrote: I think you can kill the destroy command process using traditional methods. kill and kill -9 failed. In fact, rebooting failed; I had to use a hard reset (it shut down most of the way, but then got stuck). Perhaps your slowness issue is because the pool is an older format. I've not had these problems since upgrading to the zfs version that comes default with 2008.11 We can hope. In case that's the cause, I upgraded the pool format (after considering whether I'd be needing to access it with older software; hope I was right :-)). The pool did import and scrub cleanly, anyway. That's hopeful. Also this particular pool is a scratch pool at the moment, so I'm not risking losing data, only risking losing confidence in ZFS. It's also a USB external disk. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hanging
On Sat, 14 Feb 2009 15:40:04 -0600 (CST) David Dyer-Bennet d...@dd-b.net wrote: On Sat, February 14, 2009 13:04, Blake wrote: I think you can kill the destroy command process using traditional methods. kill and kill -9 failed. In fact, rebooting failed; I had to use a hard reset (it shut down most of the way, but then got stuck). Perhaps your slowness issue is because the pool is an older format. I've not had these problems since upgrading to the zfs version that comes default with 2008.11 We can hope. In case that's the cause, I upgraded the pool format (after considering whether I'd be needing to access it with older software; hope I was right :-)). The pool did import and scrub cleanly, anyway. That's hopeful. Also this particular pool is a scratch pool at the moment, so I'm not risking losing data, only risking losing confidence in ZFS. It's also a USB external disk. Hi David, if this happens to you again, you could help get more data on the problem by getting a crash dump, either forced or via reboot or (if you have a dedicated dump device, via savecore: (dedicated dump dev, ) # savecore -L /var/crash/`uname -n` or # reboot -dq (forced, 64bit mode) # echo 0rip|mdb -kw (forced, 32bit mode) # echo 0eip|mdb -kw Try the command line options first, only use the mdb kick in the guts if the other two fail. Once you've got the core, you could post the output of ::status $C when run over the core with mdb -k. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs destroy hanging
This shouldn't be taking anywhere *near* half an hour. The snapshots differ trivially, by one or two files and less than 10k of data (they're test results from working on my backup script). But so far, it's still sitting there after more than half an hour. local...@fsfs:~/src/bup2# zfs destroy ruin/export cannot destroy 'ruin/export': filesystem has children use '-r' to destroy the following datasets: ruin/export/h...@bup-20090210-202557utc ruin/export/h...@20090210-213902utc ruin/export/home/local...@first ruin/export/home/local...@second ruin/export/home/local...@bup-20090210-202557utc ruin/export/home/local...@20090210-213902utc ruin/export/home/localddb ruin/export/home local...@fsfs:~/src/bup2# zfs destroy -r ruin/export It's still hung. Ah, here's zfs list output from shortly before I started the destroy: ruin 474G 440G 431G /backups/ruin ruin/export 35.0M 440G18K /backups/ruin/export ruin/export/home35.0M 440G19K /export/home ruin/export/home/localddb 35M 440G 27.8M /export/home/localddb As you can see, the ruin/export/home filesystem (and subs) is NOT large. iostat shows no activity on pool ruin over a minute. local...@fsfs:~$ pfexec zpool iostat ruin 10 capacity operationsbandwidth pool used avail read write read write -- - - - - - - ruin 474G 454G 10 0 1.13M840 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 ruin 474G 454G 0 0 0 0 The pool still thinks it is healthy. local...@fsfs:~$ zpool status -v ruin pool: ruin state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 4h42m with 0 errors on Mon Feb 9 19:10:49 2009 config: NAMESTATE READ WRITE CKSUM ruinONLINE 0 0 0 c7t0d0ONLINE 0 0 0 errors: No known data errors There is still a process out there trying to run that destroy. It doesn't appear to be using much cpu time. local...@fsfs:~$ ps -ef | grep zfs localddb 7291 7228 0 15:10:56 pts/4 0:00 grep zfs root 7223 7101 0 14:18:27 pts/3 0:00 zfs destroy -r ruin/export Running 2008.11. local...@fsfs:~$ uname -a SunOS fsfs 5.11 snv_101b i86pc i386 i86pc Solaris Any suggestions? Eventually I'll kill the process by the gentlest way that works, I suppose (if it doesn't complete). -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss