James Carlson wrote:
> I recall seeing questions from customers and/or ISVs who wanted to
> have this in Solaris, but I don't see that any RFEs were ever filed.
Actually, Jerry Chu tried to do this project long long time
ago. He even filed a 1-pager (dated 7/22/97) with some technical
explanations. I forgot the reason why this project did not work
out. In case people are interested, I include the technical part
of the 1-pager below.
---
2.1. Project Description
Provide an efficient, copyless, in-kernel shortcut for moving
data between different I/O objects. The premise is that no
user inspection and/or manipulation of the data is needed.
This project provides a new system call, splice() that
moves data all in the kernel address space between two I/O
objects identified by file descriptors. The motivation
is to improve I/O performance and reduce CPU utilization.
The exact level where data is moved depends on the types
of the file descriptors and the implementation. On a system
where device-to-device DMA is supported, it's conceivable that
data can be moved from one device to another without going
through the host memory.
The types of I/O objects supported by splice() will be defined
later. The final list will depend on both its usefulness and
the ease of implementation. E.g. splicing a file and a network
socket together can be very useful. But splicing a file and
a raw device together seems to be of a questionable use.
Both synchronous and asynchronous modes of operation will be
supported. The exact semantics of an asynch splice() needs
to be defined later.
Many details of the API, including the argument list, their
semantics, the states of the file descriptors before and after
splice()...etc need to be flushed out later. In essence, the
purpose of the new API is to provide new, useful services to
the applications. Therefore its format shall be driven by
what's best to suit the application's needs.
4. Technical Description
Work has been done before to address some I/O efficiency
issues. This includes Zero Copy TCP, as_pagelock, UFS
direct/IO. There is also a fbufs proposal. But splice() is
different in two key aspects:
i) Both Zero Copy TCP, as_pagelock, and fbufs address the
case where data needs to traverse to the user address space
for some user processing. However, splice() is based on
a premise that no user level data manipulation is needed, so
a shortcut can be built all in the kernel, without involving
the user address space.
ii) Zero Copy TCP can also be used to move data between a file
and the network. But since it uses the existing APIs, in order
to uphold all the API semantics, additional overhead is incurred.
E.g. data needs to be mapped into the user address space. A
copy-on-write protection might be needed too.
Since splice() defines a new API, it has a free hand to do
the job in the most efficient way.
There are two main challenges facing the project. The first
is to efficiently move data from one form of "container",
supplied by the source I/O object to another that can be
consumed by the target I/O object.
The second is to figure out a control scheme to attain the
optimal performance. This is usually achieved by minimizing
the number of context switches required, and by using asynch
I/O to achieve good I/O interleaving. The exact scheme may
vary, depending on the types of the I/O objects involved.
There are three different types of "data container" employed
by I/O objects to hold data -
A. physical memory pages (pp) with a [vnode, offset] identity -
Used by regular files or block devices.
Note that normally a virtual memory mapping has to be set up
first, before data in a memory page is accessible by the CPU.
This is usually done through a VM segment driver. It's also
possible to set up mappings in an ad-hoc manner.
B. mblks -
Used by network endpoints/STREAMS devices.
C. temporary (un-named) buffers or memory pages -
Used by raw/character (non-STREAMS) devices.
The potential amount of CPU saving from splice() will depend
on the types of the I/O objects being spliced together. At the
very least there will be a saving of uio to uio copy operation.
Further possible saving includes user/kernel virtual memory
mappings, use of kernel memory, and context switches. The
latter will depend on how well "streaming" the data flow can be
made.
In the following I'll attempt to describe some possible
implementations for five different types of splices:
1. A <-> A (file to file)
Since the same kind of data container (physical memory pages)
is used in both the source and the sink, this is relatively
easy. Either use VOP_PAGEIO to source and sink data pages,
thus bypassing the page cache, or use VOP_GETPAGE to fetch the
source page, rename it to the target [vnode, offset], then
sink it using VOP_PUTPAGE.
Saving: at least one user mapping, one kernel mapping and one
data copy.
If both the source and the sink resides on the same filesystem,
the splice() operation can be pushed one level down below the
vnode layer, in order to gain more efficiency. A new vnode-op
will have to be invented.
2. A -> B (file to network or STREAMS device)
This is more difficult than the previous one because the source
and the sink have incompatible data containers, one uses
physical addresses, the other uses virtual addresses.
One solution is to map the physical memory pages into the
kernel virtual address space before making the mblk carry
the virtual pointers. A more radical approach is to modify
the mblk structure to carry physical memory too. This approach
only works when none of the kernel STREAMS modules need to
access the data. Unfortunately, for a networking protocol like
TCP, this is not the case. TCP needs to checksum the data
before sending them out. However, if the data checksumming can
be done using physical addresses, or be completely offloaded to
special hardware, this becomes feasible. Note that in this
scheme all the modules/drivers involved have to understand the
new mblk data format.
Saving: at least one data copy. If mblk is made to carry
physical pages directly, it'll save kernel mappings too.
3. B -> A (network or STREAMS device to file)
Unless the mblk has been made to carry physical memory, which
can be easily consumed by a file vnode using VOP_PAGEIO, or
page renaming followed by VOP_PUTPAGE, there is not much one
can do other than what's already done in "Zero-Copy TCP".
Note that one restriction above is that the network MTU has to
be sufficiently large to cover a whole memory page.
4. B <-> B (network to network or STREAMS devices)
Since both the source and the sink use mblks, this is
relatively easy. A STREAMS multiplexor will be added on top
of both the source and the sink streams to serve as a conduit
between them.
Any processing required on the data can be made into kernel
modules and pushed between the source and the sink. No data
needs to enter the user address space.
Saving: multiple user mappings and data copy operations.
5. C -> C (raw device to raw device)
Allocate temporary physical memory pages to carry data. No
user buffer or even kernel mapping is needed!
Saving: no virtual memory mappings. no va to pp lookups.
By default, splice() runs synchronously in the user context.
This is relatively easy to implement since the switching
of the context can closely mimic the equivalent read()
followed by write() lock-step we currently have. Synchronous
mode also give the user more control. E.g. the rate at which
data is moved can be controlled easily.
Since data is not entering the user address space, constantly
switching in and out of the user context is a waste of precious
CPU cycles. In the ideal, simplistic streaming case, control
only switches between the two I/O completion threads, one for
the source and one for the sink. I.e. the read completion thread
will schedule the write, and the write completion thread will
kick start the next read. But this may put too much work on
interrupt threads. Also the amount of work to change in the
existing I/O subsystems to support full streaming may be
substantial. More design work has to be done here to figure out
a good balance before performance and complexity.
5. Reference Documents
Zero Copy TCP - PSARC case number 950407
as_pagelock - PSARC case number 951205
UFS direct/IO - ??
fbufs - PSARC case number 951103
"Exploiting In-kernel Data Paths to Improve I/O Throughput and
CPU Availability", 1993 Winter USENIX
"Zero-Copy TCP in Solaris", 1996 USENIX
"An Analysis of Process and Memory Models to Support High-Speed
Networking in a UNIX Environment", 1996 USENIX
"Container Shipping - Operating System Support for I/O-
Intensive Applications", 1994 Computer
TransmitFile() man page from Winsock2, WIN32 SDK for Windows NT
--
K. Poon.
[EMAIL PROTECTED]
_______________________________________________
networking-discuss mailing list
[email protected]