On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote:
> copy_file_range() is a new system call for copying ranges of data
> completely in the kernel.  This gives filesystems an opportunity to
> implement some kind of "copy acceleration", such as reflinks or
> server-side-copy (in the case of NFS).
> 
> Signed-off-by: Anna Schumaker <anna.schuma...@netapp.com>
> ---
>  man2/copy_file_range.2 | 168 
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 168 insertions(+)
>  create mode 100644 man2/copy_file_range.2
> 
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> new file mode 100644
> index 0000000..4a4cb73
> --- /dev/null
> +++ b/man2/copy_file_range.2
> @@ -0,0 +1,168 @@
> +.\"This manpage is Copyright (C) 2015 Anna Schumaker 
> <anna.schuma...@netapp.com>
> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +copy_file_range \- Copy a range of data from one file to another
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/copy.h>
> +.B #include <sys/syscall.h>
> +.B #include <unistd.h>
> +
> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in 
> ",
> +.BI "                int " fd_out ", loff_t * " off_out ", size_t " len ",
> +.BI "                unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR copy_file_range ()
> +system call performs an in-kernel copy between two file descriptors
> +without all that tedious mucking about in userspace.

;)

> +It copies up to
> +.I len
> +bytes of data from file descriptor
> +.I fd_in
> +to file descriptor
> +.I fd_out
> +at
> +.IR off_out .
> +The file descriptors must not refer to the same file.

Why?  btrfs (and XFS) reflink can handle the case of a file sharing blocks
with itself.

> +
> +The following semantics apply for
> +.IR fd_in ,
> +and similar statements apply to
> +.IR off_out :
> +.IP * 3
> +If
> +.I off_in
> +is NULL, then bytes are read from
> +.I fd_in
> +starting from the current file offset and the current
> +file offset is adjusted appropriately.
> +.IP *
> +If
> +.I off_in
> +is not NULL, then
> +.I off_in
> +must point to a buffer that specifies the starting
> +offset where bytes from
> +.I fd_in
> +will be read.  The current file offset of
> +.I fd_in
> +is not changed, but
> +.I off_in
> +is adjusted appropriately.
> +.PP
> +The default behavior of
> +.BR copy_file_range ()
> +is filesystem specific, and might result in creating a
> +copy-on-write reflink.
> +In the event that a given filesystem does not implement
> +any form of copy acceleration, the kernel will perform
> +a deep copy of the requested range by reading bytes from

I wonder if it's wise to allow deep copies -- what happens if len == 1T?
Will this syscall just block for a really long time?

> +.I fd_in
> +and writing them to
> +.IR fd_out .

"...if COPY_REFLINK is not set in flags."

> +
> +Currently, Linux only supports the following flag:
> +.TP 1.9i
> +.B COPY_REFLINK
> +Only perform the copy if the filesystem can do it as a reflink.
> +Do not fall back on performing a deep copy.
> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.
> +
> +On error,
> +.BR copy_file_range ()
> +returns \-1 and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EBADF
> +One or more file descriptors are not valid,
> +or do not have proper read-write mode.

"or fd_out is not opened for writing"?

> +.TP
> +.B EINVAL
> +Requested range extends beyond the end of the file;
> +.I flags
> +argument is set to an invalid value.
> +.TP
> +.B EOPNOTSUPP
> +.B COPY_REFLINK
> +was specified in
> +.IR flags ,
> +but the target filesystem does not support reflinks.
> +.TP
> +.B EXDEV
> +Target filesystem doesn't support cross-filesystem copies.
> +.SH VERSIONS

Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...)
that can be returned?  (I was looking at the fallocate manpage.)

--D

> +The
> +.BR copy_file_range ()
> +system call first appeared in Linux 4.3.
> +.SH CONFORMING TO
> +The
> +.BR copy_file_range ()
> +system call is a nonstandard Linux extension.
> +.SH EXAMPLE
> +.nf
> +
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <linux/copy.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +
> +int main(int argc, char **argv)
> +{
> +    int fd_in, fd_out;
> +    struct stat stat;
> +    loff_t len, ret;
> +
> +    if (argc != 3) {
> +        fprintf(stderr, "Usage: %s <pathname> <pathname>\n", argv[0]);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    fd_in = open(argv[1], O_RDONLY);
> +    if (fd_in == -1) {
> +        perror("open (argv[1])");
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    if (fstat(fd_in, &stat) == -1) {
> +        perror("fstat");
> +        exit(EXIT_FAILURE);
> +    }
> +    len = stat.st_size;
> +
> +    fd_out = open(argv[2], O_WRONLY | O_CREAT, 0644);
> +    if (fd_out == -1) {
> +        perror("open (argv[2])");
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    do {
> +        ret = syscall(__NR_copy_file_range, fd_in, NULL,
> +                      fd_out, NULL, len, 0);
> +        if (ret == -1) {
> +            perror("copy_file_range");
> +            exit(EXIT_FAILURE);
> +        }
> +
> +        len -= ret;
> +    } while (len > 0);
> +
> +    close(fd_in);
> +    close(fd_out);
> +    exit(EXIT_SUCCESS);
> +}
> +.fi
> +.SH SEE ALSO
> +.BR splice (2)
> -- 
> 2.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to