Hi folks, I’ll capture as requested; but to clarify, the testing has been done with very small datasets; literally a few k in size, up to a few megs, up to a few TB’s; we hang at the same spot—just when the recv begins. Correct: old --> Old, no problems. To give background; we have several PB of ZFS deployed, primarily on OI—pre-hipster. We have several systems that run Hipster (for driver support) that are able to receive these streams without any issues (can’t give you exact git revisions from Illumos at this time—but span of time is prior to Sept/2016), and these pre 9/16’ish versions of OI-Hipster can successfully send streams to our old OI systems. We run into trouble with systems that are later than 9/2016 (Omni, or OI) in that they cannot receive I haven’t looked at Omni LTS to see if I can narrow that down yet either. I have also had similar hangs of these later systems against ZFSoL hosts, but have not tested/debugged extensively. Stay tuned for dtrace. Joe From: Matthew Ahrens <mahr...@delphix.com> Reply-To: "discuss@lists.illumos.org" <discuss@lists.illumos.org> Date: Tuesday, November 8, 2016 at 6:41 PM To: Illumos Discussion <discuss@lists.illumos.org> Paul Dagnelie <p...@delphix.com> Subject: Re: [discuss] ZFS recv hangs It sounds like you're saying that you hit the problem when sending from new -> old, but not when sending the same filesystems from old -> old? Another clue here could be the output of: zfs send ... | zstreamdump -v | gzip >file.gz Though that may be redundant with the dtrace output I mentioned. But if you could get the zstreamdump from both the new and old systems, we could compare them to determine what's happening differently. --matt On Tue, Nov 8, 2016 at 4:22 PM, Matthew Ahrens <mahr...@delphix.com<mailto:mahr...@delphix.com>> wrote: On Thu, Nov 3, 2016 at 6:18 AM, Hetrick, Joseph P <joseph-hetr...@uiowa.edu<mailto:joseph-hetr...@uiowa.edu>> wrote: Per Alex suggestion to see where ZFS is at during the hang period: THREAD STATE SOBJ COUNT ffffff007a8b3c40 SLEEP CV 3 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a txg_thread_wait+0x7c txg_sync_thread+0x118 thread_start+8 ffffff007a292c40 SLEEP CV 3 swtch+0x145 cv_wait+0x61 spa_thread+0x225 thread_start+8 ffffff007a8aac40 SLEEP CV 3 swtch+0x145 cv_wait+0x61 txg_thread_wait+0x5f txg_quiesce_thread+0x94 thread_start+8 ffffff007a1bbc40 SLEEP CV 1 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a arc_reclaim_thread+0x13d thread_start+8 ffffff007a1c1c40 SLEEP CV 1 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a l2arc_feed_thread+0xa1 thread_start+8 ffffff11bde0f4a0 ONPROC <NONE> 1 mutex_exit dbuf_hold_impl+0x81 dnode_next_offset_level+0xee dnode_next_offset+0xa2 dmu_object_next+0x54 restore_freeobjects+0x7e dmu_recv_stream+0x7f1 zfs_ioc_recv+0x416 zfsdev_ioctl+0x347 cdev_ioctl+0x45 spec_ioctl+0x5a fop_ioctl+0x7b ioctl+0x18e _sys_sysenter_post_swapgs+0x149 echo "::stacks -m zfs" |mdb -k THREAD STATE SOBJ COUNT ffffff007a8b3c40 SLEEP CV 3 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a txg_thread_wait+0x7c txg_sync_thread+0x118 thread_start+8 ffffff007a292c40 SLEEP CV 3 swtch+0x145 cv_wait+0x61 spa_thread+0x225 thread_start+8 ffffff007a8aac40 SLEEP CV 3 swtch+0x145 cv_wait+0x61 txg_thread_wait+0x5f txg_quiesce_thread+0x94 thread_start+8 ffffff007a1bbc40 SLEEP CV 1 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a arc_reclaim_thread+0x13d thread_start+8 ffffff007a1c1c40 SLEEP CV 1 swtch+0x145 cv_timedwait_hires+0xe0 cv_timedwait+0x5a l2arc_feed_thread+0xa1 thread_start+8 ffffff11bde0f4a0 ONPROC <NONE> 1 dbuf_hash+0xdc 0xffffff11ca05c460 dbuf_hold_impl+0x59 dnode_next_offset_level+0xee dnode_next_offset+0xa2 dmu_object_next+0x54 restore_freeobjects+0x7e dmu_recv_stream+0x7f1 zfs_ioc_recv+0x416 zfsdev_ioctl+0x347 cdev_ioctl+0x45 spec_ioctl+0x5a fop_ioctl+0x7b ioctl+0x18e _sys_sysenter_post_swapgs+0x149 Are you sure that it's hung? This stack trace seems to indicate that the receive is running, and processing a FREEOBJECTS record. It's possible that this is for a huge number of objects, which could take a long time (perhaps more than it should). If you can reproduce this, can you capture the record we are processing, e.g. with dtrace: dtrace -n 'restore_freeobjects:entry{print(*args[1])}' The last thing printed should be the one that we "hang" on. FYI - you must be running bits that do not include this commit, which renamed restore_freeobjects(). commit a2cdcdd260232b58202b11a9bfc0103c9449ed52 Author: Paul Dagnelie <p...@delphix.com<mailto:p...@delphix.com>> Date: Fri Jul 17 14:51:38 2015 -0700 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <prakash.su...@delphix.com<mailto:prakash.su...@delphix.com>> Reviewed by: Matthew Ahrens <mahr...@delphix.com<mailto:mahr...@delphix.com>> --matt Where the action was: zfs recv -v dpool01/wtf <test-15-out receiving full stream of dpool01/test@now into dpool01/wtf@now test-15-out is zfs send dpool01/test@now >test-15-out and then sent to the node It’s only about 48k in size; no filesystem data (though, problem exists when I have a filesystem with data). I’ve created a few identical filesystems on a few nodes and done some hex compares with them; but nothing extensive beyond “I differences”. Thanks Alex, Joe On 11/2/16, 11:15 AM, "Hetrick, Joseph P" <joseph-hetr...@uiowa.edu<mailto:joseph-hetr...@uiowa.edu>> wrote: Hi folks, We’ve run into an odd issue that seems concerning. Our shop runs OpenIndiana and we’ve got several versions in play. Recently while testing a new system which is much more recent (bleeding edge OI Hipster release) we discovered that zfs sends to older systems caused hangs. By older, we’re talking same zfs/zpool versions of 5/28 and no visible properties differences. Can provide more info if told what is useful; but the gist is that: Zfs send of a vanilla dataset (no properties defined other than defaults) to any “older” system causes the recv to hang, eventually the host will crash. Truss’ing the recvr process doesn’t seem to give a lot of info as to the cause. Filesystem snapshot is received; and then that’s it. No fancy send or recv args in play (zfs send dataset via netcat or mbuffer or ssh to a recv –v <dest>. A close comparision of zfs and pool properties shows no difference. On a whim we even created pools and datasets that were downversioned below the senders. We’ve seen this in hosts a bit later than: illumos-a7317ce but not before (and certainly a bit later); and where we are now: illumos-2816291. Oddly illumos-a7317ce systems appear to be able to receive these datasets just fine…and we’ve had no problems with systems of that vintage sending to older systems. Any ideas and instruction is most welcome, Joe illumos-discuss | Archives<https://www.listbox.com/member/archive/182180/=now> [https://www.listbox.com/images/feed-icon-10x10.jpgaed2fe2.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc] <https://www.listbox.com/member/archive/rss/182180/28533586-617f1285> | Modify<https://www.listbox.com/member/?&> Your Subscription [https://www.listbox.com/images/listbox-logo-small.pngaed2fe2.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc]<http://www.listbox.com>
------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com