*Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes
process hang - ksh93
CR 6876768/solaris_nevada changed on Feb 24 2010 by <User 1-16P5I3>
=== Field ============ === New Value ============= === Old Value =============
Evaluation New Note
Status 5-Cause Known 3-Accepted
====================== =========================== ===========================
*Change Request ID*: 6876768/solaris_nevada
*Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes
process hang - ksh93
Product: solaris
Category: shell
Subcategory: korn93
Type: Defect
Subtype:
Status: 5-Cause Known
Substatus:
Priority: 2-High
Introduced In Release:
Introduced In Build:
Responsible Engineer: <User 1-16P5I3>
Keywords: BOP
=== *Description* ============================================================
zfs send/zfs receive operations experience hang under ksh:
Problem is that the righthand side of the pipe is not being created "receive".
The included truss will show that the zfs receive is not created as the truss
doesn't even create the file on a failure mode.
There MAY also be ioctl involvement as indicated by the last line of the
included truss:
1955: ioctl(3, ZFS_IOC_SEND, 0x08044A00) (sleeping...)
May also be an issue of when the receive side of an anonymous pipe dies it
doesn't kill off the corresponding send side of the pipe.
Potentially related CRs:
6859444
6782948
4657448 - Cu reports that the test included in CR4657448 does not fail on his
host and he feels that the failure apparently depends on having multiple cores
and the number of processes in the pipeline.
In addition, included as an attachment are the cu scripts. You will find the
following:
1) diff showing change that eliminated the hang (w/o changing shell)
2) r1.2 of the script that includes the hang
3) current version of script that includes various changes and does not (yet)
cause hang (but multiple copies, run in parallel, can induce the zfs 3-way
deadlock described in SR#71405530)
4) script used to create test zpool
5) script to create 100 test zfs [operations]
*** (#1 of 1): 2009-08-28 00:23:49 GMT+00:00 <User 1-90DA97>
*** Last Edit: 2009-09-23 15:13:21 GMT+00:00 <User 1-90DA97>
=== *Public Comments* ========================================================
I don't see any indication that this is a ZFS bug. Please provide a crash dump
from when the system is "hung".
This bug will be closed or recategorized as a ksh bug on September 10 if more
information is not provided.
*** (#1 of 9): 2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707>
In examining the different trusses provided by the customer, we see that the
action the customer wants to perform is
zfs send | zfs receive
We trussed both sides of the pipe with:
truss -o sendside zfs send | truss -o recvside zfs receive
This process was run in a loop to transfer multiple snapshots. The actual
truss file names are unique across the entire loop. At some point in the
process the left hand side hangs. There was no truss created for the receive
side.
This strongly suggests that the receive side of the pipe was not created at
all. That the "zfs receive" was not a part of the issue.
This issue was reproducable on the CU system. It might take take an hour or
so, but the issue would happen. Since the CU is using these scripts to make
backups, they are run a regular intervals and the chances of failure are very
high.
In an attempt to issolate the issue, we changed the send/receive pair from
"zfs" to "dd". While we moved the same amount of data, in over 300 iterations
we did not get a single hang.
The analysis suggests that this MIGHT be an interaction between KSH and
Kernel/ZFS. The "zfs send" makes a single ioctl() call which generates large
amounts of data. That data is moved directly from zfs to a file, by passing
user space.
Additional point: This lock up is NOT seen if the user pipes across the
network. I.e. zfs send | ssh remote.system.com zfs receive
*** (#2 of 9): 2009-08-28 12:45:35 GMT+00:00 <User 1-92VH66>
Based on the additional information that Chris provided, is a crash dump still
required?
*** (#3 of 9): 2009-08-28 15:03:01 GMT+00:00 <User 1-90DA97>
>From the latest update it looks like it is Solaris kh93 bug.
*** (#4 of 9): 2009-09-23 15:43:34 GMT+00:00 <User 1-1SURPB>
Transfering to shell/korn93 based on last comment for further investigation.
*** (#5 of 9): 2009-10-01 17:13:43 GMT+00:00 <User 1-3GMVGZ>
Roland Mainz, the OpenSolaris ksh93 integration project lead asks:
Does the problem go away if you apply the ksh93-integration update2 tarballs
from
http://www.opensolaris.org/os/project/ksh93-integration/downloads/2009-09-22/
You can find his e-mail address and the project mailing list/web forum on
the ksh93 project pages at:
http://opensolaris.org/os/project/ksh93-integration/
*** (#6 of 9): 2009-10-01 20:24:07 GMT+00:00 <User 1-5Q-1267>
Queried customer and asked that he download and apply the eval/test tarballs to
determine if they resolve the issue. Will update with results.
*** (#7 of 9): 2009-10-09 17:04:42 GMT+00:00 <User 1-90DA97>
Please see attachment 6876768emailupdate.txt for update queries related to
customer response... Thanks.
*** (#8 of 9): 2009-10-13 15:52:13 GMT+00:00 <User 1-90DA97>
Note from Roland:
<nrubsig> Note for the customer: The ksh93-integration 2009-09-22 binaries are
very close (minus packaging changes) to what will end-up in
OpenSolaris.2010.03. It is hightly unlikely that they cause problems (and the
only reason why they are not labelled "gamma" or "final" binaries is that I
simply copied the page template from previous versions).
*** (#9 of 9): 2009-10-14 11:42:38 GMT+00:00 <User 1-5Q-5197>
=== *Workaround* =============================================================
/usr/bin/ksh seems to be working fine. I captured
the stack(below) which shows that the 'zfs receive'
process is kicked off, and running for some time.
521 /usr/lib/ssh/sshd
2525 /usr/lib/ssh/sshd
2526 /usr/lib/ssh/sshd
2531 bash
2546 bash
2547 bash
16571 dtrace -s /tmp/load/trace.d -o /var/tmp/trace.out -c
./zfs_hang_script.ksh
16572 /usr/bin/ksh ./zfs_hang_script.ksh
29539 zfs send ACS/acsnfs6/ACS/mail/<email address
omitted>:backupzfs-2010-02-11-09:
29540 zfs receive -F ACSTEST/mail/26
Can you ask the Cu to run /usr/bin/ksh instead of /usr/bin/ksh93 ?
In the meantime, i am running the mailsync program with D probes,
which will let us know why 'zfs recv' process (in case of ksh93)
get exited.
*** (#1 of 1): 2010-02-15 12:07:24 GMT+00:00 <User 1-16P5I3>
=== *Additional Details* =====================================================
Targeted Release: solaris_nevada
Commit To Fix In Build:
Fixed In Build:
Integrated In Build:
Verified In Build:
See Also: 4657448, 6782948, 6859444
Duplicate of:
Hooks:
Hook1:
Hook2:
Hook3:
Hook4:
Hook5:
Hook6:
Program Management:
Root Cause:
Fix Affects Documentation: No
Fix Affects Localization: No
=== *History* ================================================================
Date Submitted: 2009-08-28 00:23:46 GMT+00:00
Submitted By: <User 1-90DA97>
Status Changed Date Updated Updated By
2-Incomplete 2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707>
3-Accepted 2009-11-12 14:32:34 GMT+00:00 <User 1-90DA97>
5-Cause Known 2010-02-24 14:07:05 GMT+00:00 <User 1-16P5I3>
=== *Service Request* ========================================================
Impact: Significant
Functionality: Secondary
Severity: 3
Product Name: solaris
Product Release: osol_2009.06
Product Build: osol_2009.06
Operating System: osol_2009.06
Hardware:
Submitted Date: 2009-08-28 00:23:49 GMT+00:00
=== *Multiple Release (MR) Cluster* - 6876768 ================================
ID: +6876768/solaris_nevada
SubCR Number: 6876768
Targeted Release: solaris_nevada
=== *SubCR* ==================================================================
ID: 6876768/osol_2009.06u6
Status: 3-Accepted
Substatus:
Priority: 3-Medium
Responsible Engineer: <User 1-16P5I3>
SubCR Number: 2182414
Targeted Release: osol_2009.06u6
Commit To Fix In Build:
Fixed In Build:
Integrated In Build:
Verified In Build:
Hook1:
Hook2:
Program Management: