James Carlson wrote:

> I recall seeing questions from customers and/or ISVs who wanted to
> have this in Solaris, but I don't see that any RFEs were ever filed.


Actually, Jerry Chu tried to do this project long long time
ago.  He even filed a 1-pager (dated 7/22/97) with some technical
explanations.  I forgot the reason why this project did not work
out.  In case people are interested, I include the technical part
of the 1-pager below.

---

   2.1. Project Description

        Provide an efficient, copyless, in-kernel shortcut for moving
        data between different I/O objects. The premise is that no
        user inspection and/or manipulation of the data is needed.

        This project provides a new system call, splice() that
        moves data all in the kernel address space between two I/O
        objects identified by file descriptors. The motivation
        is to improve I/O performance and reduce CPU utilization.

        The exact level where data is moved depends on the types
        of the file descriptors and the implementation. On a system
        where device-to-device DMA is supported, it's conceivable that
        data can be moved from one device to another without going
        through the host memory.

        The types of I/O objects supported by splice() will be defined
        later. The final list will depend on both its usefulness and
        the ease of implementation. E.g. splicing a file and a network
        socket together can be very useful. But splicing a file and
        a raw device together seems to be of a questionable use.

        Both synchronous and asynchronous modes of operation will be
        supported. The exact semantics of an asynch splice() needs
        to be defined later.

        Many details of the API, including the argument list, their
        semantics, the states of the file descriptors before and after
        splice()...etc need to be flushed out later. In essence, the
        purpose of the new API is to provide new, useful services to
        the applications. Therefore its format shall be driven by
        what's best to suit the application's needs.

4. Technical Description

        Work has been done before to address some I/O efficiency
        issues. This includes Zero Copy TCP, as_pagelock, UFS
        direct/IO. There is also a fbufs proposal. But splice() is
        different in two key aspects:

        i) Both Zero Copy TCP, as_pagelock, and fbufs address the
        case where data needs to traverse to the user address space
        for some user processing. However, splice() is based on
        a premise that no user level data manipulation is needed, so
        a shortcut can be built all in the kernel, without involving
        the user address space.

        ii) Zero Copy TCP can also be used to move data between a file
        and the network. But since it uses the existing APIs, in order
        to uphold all the API semantics, additional overhead is incurred.
        E.g. data needs to be mapped into the user address space. A
        copy-on-write protection might be needed too.

        Since splice() defines a new API, it has a free hand to do
        the job in the most efficient way.

        There are two main challenges facing the project. The first
        is to efficiently move data from one form of "container",
        supplied by the source I/O object to another that can be
        consumed by the target I/O object.

        The second is to figure out a control scheme to attain the
        optimal performance. This is usually achieved by minimizing
        the number of context switches required, and by using asynch
        I/O to achieve good I/O interleaving. The exact scheme may
        vary, depending on the types of the I/O objects involved.

        There are three different types of "data container" employed
        by I/O objects to hold data -

        A. physical memory pages (pp) with a [vnode, offset] identity -
        Used by regular files or block devices.

        Note that normally a virtual memory mapping has to be set up
        first, before data in a memory page is accessible by the CPU.
        This is usually done through a VM segment driver. It's also
        possible to set up mappings in an ad-hoc manner.

        B. mblks -
        Used by network endpoints/STREAMS devices.

        C. temporary (un-named) buffers or memory pages -
        Used by raw/character (non-STREAMS) devices.

        The potential amount of CPU saving from splice() will depend
        on the types of the I/O objects being spliced together. At the
        very least there will be a saving of uio to uio copy operation.
        Further possible saving includes user/kernel virtual memory
        mappings, use of kernel memory, and context switches. The
        latter will depend on how well "streaming" the data flow can be
        made.

        In the following I'll attempt to describe some possible
        implementations for five different types of splices:

        1. A <-> A (file to file)

        Since the same kind of data container (physical memory pages)
        is used in both the source and the sink, this is relatively
        easy. Either use VOP_PAGEIO to source and sink data pages,
        thus bypassing the page cache, or use VOP_GETPAGE to fetch the
        source page, rename it to the target [vnode, offset], then
        sink it using VOP_PUTPAGE.

        Saving: at least one user mapping, one kernel mapping and one
        data copy.

        If both the source and the sink resides on the same filesystem,
        the splice() operation can be pushed one level down below the
        vnode layer, in order to gain more efficiency. A new vnode-op
        will have to be invented.

        2. A -> B (file to network or STREAMS device)

        This is more difficult than the previous one because the source
        and the sink have incompatible data containers, one uses
        physical addresses, the other uses virtual addresses.

        One solution is to map the physical memory pages into the
        kernel virtual address space before making the mblk carry
        the virtual pointers. A more radical approach is to modify
        the mblk structure to carry physical memory too. This approach
        only works when none of the kernel STREAMS modules need to
        access the data. Unfortunately, for a networking protocol like
        TCP, this is not the case. TCP needs to checksum the data
        before sending them out. However, if the data checksumming can
        be done using physical addresses, or be completely offloaded to
        special hardware, this becomes feasible. Note that in this
        scheme all the modules/drivers involved have to understand the
        new mblk data format.

        Saving: at least one data copy. If mblk is made to carry
        physical pages directly, it'll save kernel mappings too.

        3. B -> A (network or STREAMS device to file)

        Unless the mblk has been made to carry physical memory, which
        can be easily consumed by a file vnode using VOP_PAGEIO, or
        page renaming followed by VOP_PUTPAGE, there is not much one
        can do other than what's already done in "Zero-Copy TCP".

        Note that one restriction above is that the network MTU has to
        be sufficiently large to cover a whole memory page.

        4. B <-> B (network to network or STREAMS devices)

        Since both the source and the sink use mblks, this is
        relatively easy. A STREAMS multiplexor will be added on top
        of both the source and the sink streams to serve as a conduit
        between them.

        Any processing required on the data can be made into kernel
        modules and pushed between the source and the sink. No data
        needs to enter the user address space.

        Saving: multiple user mappings and data copy operations.

        5. C -> C (raw device to raw device)

        Allocate temporary physical memory pages to carry data. No
        user buffer or even kernel mapping is needed!

        Saving: no virtual memory mappings. no va to pp lookups.

        By default, splice() runs synchronously in the user context.
        This is relatively easy to implement since the switching
        of the context can closely mimic the equivalent read()
        followed by write() lock-step we currently have. Synchronous
        mode also give the user more control. E.g. the rate at which
        data is moved can be controlled easily.

        Since data is not entering the user address space, constantly
        switching in and out of the user context is a waste of precious
        CPU cycles. In the ideal, simplistic streaming case, control
        only switches between the two I/O completion threads, one for
        the source and one for the sink. I.e. the read completion thread
        will schedule the write, and the write completion thread will
        kick start the next read. But this may put too much work on
        interrupt threads. Also the amount of work to change in the
        existing I/O subsystems to support full streaming may be
        substantial. More design work has to be done here to figure out
        a good balance before performance and complexity.


5. Reference Documents

        Zero Copy TCP   - PSARC case number 950407
        as_pagelock     - PSARC case number 951205
        UFS direct/IO   - ??
        fbufs           - PSARC case number 951103

        "Exploiting In-kernel Data Paths to Improve I/O Throughput and
        CPU Availability", 1993 Winter USENIX

        "Zero-Copy TCP in Solaris", 1996 USENIX

        "An Analysis of Process and Memory Models to Support High-Speed
        Networking in a UNIX Environment", 1996 USENIX

        "Container Shipping - Operating System Support for I/O-
        Intensive Applications", 1994 Computer

        TransmitFile() man page from Winsock2, WIN32 SDK for Windows NT



-- 

                                                K. Poon.
                                                [EMAIL PROTECTED]

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to