Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-14 Thread Zach Brown
On Tue, Apr 14, 2015 at 12:23:25PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2015 at 11:54:08AM -0700, Zach Brown wrote:
> > Is this relying on btrfs range cloning being atomic?  It certainly
> > doesn't look atomic.  It can modify items across an arbitrarily large
> > number of leaf blocks.  It can make the changes across multiple
> > transactions which could introduce partial modification on reboot after
> > crashes.  It can fail (the dynamic duo: enomem, eio) and leave the
> > desintation partially modified.
> 
> I didn't mean atomic in the failure atomic sense, but in the sense of
> being atomic vs other writes, similar to how Posix specifies it for
> writes vs other writes.  Guess I need to express this intent better.

Ah, right, OK.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-14 Thread Zach Brown
On Tue, Apr 14, 2015 at 02:29:06PM -0400, J. Bruce Fields wrote:
> On Tue, Apr 14, 2015 at 11:22:41AM -0700, Zach Brown wrote:
> > On Tue, Apr 14, 2015 at 02:19:11PM -0400, J. Bruce Fields wrote:
> > > On Tue, Apr 14, 2015 at 01:16:13PM -0400, Anna Schumaker wrote:
> > > > On 04/14/2015 12:53 PM, Christoph Hellwig wrote:
> > > > > On Sat, Apr 11, 2015 at 09:04:02AM -0400, Jeff Layton wrote:
> > > > >> Yuck! How the heck do you clean up the mess if that happens? I
> > > > >> guess you're just stuck redoing the copy with normal READ/WRITE?
> > > > >>
> > > > >> Maybe we need to have the interface return a hard error in that
> > > > >> case and not try to give back any sort of offset?
> > > > > 
> > > > > The NFSv4.2 COPY interface is a train wreck.  At least for Linux I'd
> > > > > expect us to simply ignore it and only implement my new CLONE
> > > > > operation with sane semantics.  That is unless someone can show some
> > > > > real life use case for the inter server copy, in which case we'll
> > > > > have to deal with that mess.  But getting that one right at the VFS
> > > > > level will be a nightmare anyway.
> > > > > 
> > > > > Make this a vote from me to not support partial copies and just
> > > > > return and error in that case.
> > > > 
> > > > Agreed.  Looking at the v4.2 spec, COPY does take ca_consecutive and a
> > > > ca_synchronous flags that let the client state if the copy should be
> > > > done consecutively or synchronously.  I expected to always set
> > > > consecutive to "true" for the Linux client.
> > > 
> > > That's supposed to mean results are well-defined in the partial-copy
> > > case, but I think Christoph's suggesting eliminating the partial-copy
> > > case entirely?
> > > 
> > > Which would be fine with me.
> > > 
> > > It might actually have been me advocating for partial copies.  But that
> > > was only because a partial-copy-handling-loop seemed simpler to me than
> > > progress callbacks if we were going to support long-running copies.
> > > 
> > > I'm happy enough not to have it at all.
> > 
> > Ah, OK, that's great news.
> > 
> > I thought at one point we were worried about very long running RPCs on
> > the server.  Are we not worried about that now?
> > 
> > Is the client expected to cut the work up into arbitrarily managable
> > chunks?  Is the server expected to fail COPY/CLONE requests that it
> > thinks would take way too long?  Something else?
> 
> Christoph is proposing a CLONE rpc that's required to be atomic:
> 
>   
> https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-35#section-15.13
>   "The CLONE operation is atomic, that is either all changes or no
>   changes are seen by the client or other clients."
> 
> So that couldn't be really long-running (or the server is nuts).
> 
> So that'd mean Anna would rip out the server-side copy loop and we'd
> initially just support btrfs or whatever.

Is this relying on btrfs range cloning being atomic?  It certainly
doesn't look atomic.  It can modify items across an arbitrarily large
number of leaf blocks.  It can make the changes across multiple
transactions which could introduce partial modification on reboot after
crashes.  It can fail (the dynamic duo: enomem, eio) and leave the
desintation partially modified.

> I mean the server-side copy loop may also be useful but I'm all for
> wiring up the obvious case first.

Sure, I'm all for wiring up the simple version that doesn't return
partial progress.  If that'll work for you guys.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-14 Thread Zach Brown
On Tue, Apr 14, 2015 at 02:19:11PM -0400, J. Bruce Fields wrote:
> On Tue, Apr 14, 2015 at 01:16:13PM -0400, Anna Schumaker wrote:
> > On 04/14/2015 12:53 PM, Christoph Hellwig wrote:
> > > On Sat, Apr 11, 2015 at 09:04:02AM -0400, Jeff Layton wrote:
> > >> Yuck! How the heck do you clean up the mess if that happens? I
> > >> guess you're just stuck redoing the copy with normal READ/WRITE?
> > >>
> > >> Maybe we need to have the interface return a hard error in that
> > >> case and not try to give back any sort of offset?
> > > 
> > > The NFSv4.2 COPY interface is a train wreck.  At least for Linux I'd
> > > expect us to simply ignore it and only implement my new CLONE
> > > operation with sane semantics.  That is unless someone can show some
> > > real life use case for the inter server copy, in which case we'll
> > > have to deal with that mess.  But getting that one right at the VFS
> > > level will be a nightmare anyway.
> > > 
> > > Make this a vote from me to not support partial copies and just
> > > return and error in that case.
> > 
> > Agreed.  Looking at the v4.2 spec, COPY does take ca_consecutive and a
> > ca_synchronous flags that let the client state if the copy should be
> > done consecutively or synchronously.  I expected to always set
> > consecutive to "true" for the Linux client.
> 
> That's supposed to mean results are well-defined in the partial-copy
> case, but I think Christoph's suggesting eliminating the partial-copy
> case entirely?
> 
> Which would be fine with me.
> 
> It might actually have been me advocating for partial copies.  But that
> was only because a partial-copy-handling-loop seemed simpler to me than
> progress callbacks if we were going to support long-running copies.
> 
> I'm happy enough not to have it at all.

Ah, OK, that's great news.

I thought at one point we were worried about very long running RPCs on
the server.  Are we not worried about that now?

Is the client expected to cut the work up into arbitrarily managable
chunks?  Is the server expected to fail COPY/CLONE requests that it
thinks would take way too long?  Something else?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-13 Thread Zach Brown
> > >> Could we perhaps instead of a length, define a 'pos_in_start' and a
> > >> 'pos_in_end' offset (with the latter being -1 for a full-file copy)
> > >> and then return an 'loff_t' value stating where the copy ended?
> > >
> > > Well, the resulting offset will be set if the caller provided it.  So
> > > they could already be getting the copied length from that.  But they
> > > might not specify the offsets.  Maybe they're just using the results to
> > > total up a completion indicator.
> > >
> > > Maybe we could make the length a pointer like the offsets that's set to
> > > the copied length on return.
> > 
> > That works, but why do we care so much about the difference between a
> > length and an offset as a return value?
> > 
> 
> I think it just comes down to potential confusion for users. What's
> more useful, the number of bytes actually copied, or the offset into the
> file where the copy ended?
> 
> I tend to the think an offset is more useful for someone trying to
> copy a file in chunks, particularly if the file is sparse. That gives
> them a clear place to continue the copy.
> 
> So, I think I agree with Trond that phrasing this interface in terms of
> file offsets seems like it might be more useful. That also neatly
> sidesteps the size_t limitations on 32-bit platforms.

Yeah, fair enough.  I'll rework it.

> > To be fair, the NFS copy offload also allows the copy to proceed out
> > of order, in which case the range of copied data could be
> > non-contiguous in the case of a failure. However neither the length
> > nor the offset case will give you the full story in that case. Any
> > return value can at best be considered to define an offset range whose
> > contents need to be checked for success/failure.
> > 
> 
> Yuck! How the heck do you clean up the mess if that happens? I guess
> you're just stuck redoing the copy with normal READ/WRITE?

I don't think anyone will worry about checking file contents.

Yes, technically you can get fragmented completion past the initial
contiguous region that the interface told you is done.   You can get
that with O_DIRECT today.

But it's a rare case that is not worth worrying about.  You'll retry at
the contiguous offset until it doesn't make progress and then fall back
to read/write.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Zach Brown
On Fri, Apr 10, 2015 at 06:36:41PM -0400, Trond Myklebust wrote:
> On Fri, Apr 10, 2015 at 6:00 PM, Zach Brown  wrote:

> > +
> > +/*
> > + * copy_file_range() differs from regular file read and write in that it
> > + * specifically allows return partial success.  When it does so is up to
> > + * the copy_file_range method.
> > + */
> > +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > +   struct file *file_out, loff_t pos_out,
> > +   size_t len, int flags)
> 
> I'm going to repeat a gripe with this interface. I really don't think
> we should treat copy_file_range() as taking a size_t length, since
> that is not sufficient to do a full file copy on 32-bit systems w/ LFS
> support.

*nod*.  The length type is limited by the syscall return type and the
arbitrary desire to mimic read/write.

I sympathize with wanting to copy giant files with operations that don't
scale with file size because files can be enormous but sparse.

> Could we perhaps instead of a length, define a 'pos_in_start' and a
> 'pos_in_end' offset (with the latter being -1 for a full-file copy)
> and then return an 'loff_t' value stating where the copy ended?

Well, the resulting offset will be set if the caller provided it.  So
they could already be getting the copied length from that.  But they
might not specify the offsets.  Maybe they're just using the results to
total up a completion indicator.

Maybe we could make the length a pointer like the offsets that's set to
the copied length on return.

This all seems pretty gross.  Does anyone else have a vote?

(And I'll argue strongly against creating magical offset values that
change behaviour.  If we want to ignore arguments and get the length
from the source file we'd add a flag to do so.)

> Note that both btrfs and NFSv4.2 allow for 64-bit lengths, so this
> interface would be closer to what is already in use anyway.

Yeah, btrfs doesn't allow partial progress.  It returns 0 on success.
We could also do that but people have expressed an interest in returning
partial progress.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 3/3] btrfs: add .copy_file_range file operation

2015-04-10 Thread Zach Brown
This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown 
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f9c89ca..f7cfa26 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3958,6 +3958,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 30982bb..49989899 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2820,6 +2820,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_ioctl,
 #endif
+   .copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 74609b9..0eb008e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3537,17 +3537,16 @@ out:
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+   u64 off, u64 olen, u64 destoff)
 {
struct inode *inode = file_inode(file);
+   struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct fd src_file;
-   struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
-   int same_inode = 0;
+   int same_inode = src == inode;
 
/*
 * TODO:
@@ -3560,49 +3559,20 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
 *   be either compressed or non-compressed.
 */
 
-   /* the destination must be opened for writing */
-   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-   return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   ret = -EXDEV;
-   if (src_file.file->f_path.mnt != file->f_path.mnt)
-   goto out_fput;
-
-   src = file_inode(src_file.file);
-
-   ret = -EINVAL;
-   if (src == inode)
-   same_inode = 1;
-
-   /* the src must be open for reading */
-   if (!(src_file.file->f_mode & FMODE_READ))
-   goto out_fput;
+   if (file_src->f_path.mnt != file->f_path.mnt ||
+   src->i_sb != inode->i_sb)
+   return -EXDEV;
 
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   goto out_fput;
+   return -EINVAL;
 
-   ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-   goto out_fput;
-
-   ret = -EXDEV;
-   if (src->i_sb != inode->i_sb)
-   goto out_fput;
+   return -EISDIR;
 
if (!same_inode) {
if (inode < src) {
@@ -3690,6 +3660,49 @@ out_unlock:
} else {
mutex_unlock(&src->i_mutex);
}
+   return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+   ssize_t ret;
+
+   ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+   if (ret == 0)
+   ret = len;
+   return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+  u64 off, u64 olen, u64 destoff)
+{
+   struct fd src_file;
+   int ret;
+
+   /* the destination must be opened for writing */
+   if (!(file->f_mode & FMODE_WRITE) || (

[PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Zach Brown
Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown 
---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 4 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 8e1b687..c65ce1d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, int flags)
+{
+   struct inode *inode_in;
+   struct inode *inode_out;
+   ssize_t ret;
+
+   if (flags)
+   return -EINVAL;
+
+   if (len == 0)
+   return 0;
+
+   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+   ret = rw_verify_area(READ, file_in, &pos_in, len);
+   if (ret >= 0)
+   ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+   if (ret < 0)
+   return ret;
+
+   if (!(file_in->f_mode & FMODE_READ) ||
+   !(file_out->f_mode & FMODE_WRITE) ||
+   (file_out->f_flags & O_APPEND) ||
+   !file_in->f_op || !file_in->f_op->copy_file_range)
+   return -EINVAL;
+
+   inode_in = file_inode(file_in);
+   inode_out = file_inode(file_out);
+
+   /* make sure offsets don't wrap and the input is inside i_size */
+   if (pos_in + len < pos_in || pos_out + len < pos_out ||
+   pos_in + len > i_size_read(inode_in))
+   return -EINVAL;
+
+   /* this could be relaxed once a method supports cross-fs copies */
+   if (inode_in->i_sb != inode_out->i_sb ||
+   file_in->f_path.mnt != file_out->f_path.mnt)
+   return -EXDEV;
+
+   /* forbid ranges in the same file */
+   if (inode_in == inode_out)
+   return -EINVAL;
+
+   ret = mnt_want_write_file(file_out);
+   if (ret)
+   return ret;
+
+   ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+len, flags);
+   if (ret > 0) {
+   fsnotify_access(file_in);
+   add_rchar(current, ret);
+   fsnotify_modify(file_out);
+   add_wchar(current, ret);
+   }
+   inc_syscr(current);
+   inc_syscw(current);
+
+   mnt_drop_write_file(file_out);
+
+   return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+   int, fd_out, loff_t __user *, off_out,
+   size_t, len, unsigned int, flags)
+{
+   loff_t pos_in;
+   loff_t pos_out;
+   struct fd f_in;
+   struct fd f_out;
+   ssize_t ret;
+
+   f_in = fdget(fd_in);
+   f_out = fdget(fd_out);
+   if (!f_in.file || !f_out.file) {
+   ret = -EBADF;
+   goto out;
+   }
+
+   ret = -EFAULT;
+   if (off_in) {
+   if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_in = f_in.file->f_pos;
+   }
+
+   if (off_out) {
+   if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_out = f_out.file->f_pos;
+   }
+
+   ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+   if (ret > 0) {
+  

[PATCH RFC 0/3] simple copy offloading system call

2015-04-10 Thread Zach Brown
Hello everyone!

Here's my current attempt at the most basic system call interface for
offloading copying between files.  The system call and vfs function
are relatively light wrappers around the file_operation method that
does the heavy lifting.

There was interest at LSF in getting the basic infrastructure merged
before worrying about adding behavioural flags and more complicated
implementations.  This series only offers a refactoring of the btrfs
clone ioctl as an example of an implementation of the file
copy_file_range method.

I've added support for copy_file_range() to xfs_io in xfsprogs and
have the start of an xfstest that tests the system call.  I'll send
those to fstests@.

So how does this look?

Do we want to merge this and let the NFS and block XCOPY patches add
their changes when they're ready?

- z

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/3] x86: add sys_copy_file_range to syscall tables

2015-04-10 Thread Zach Brown
Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown 
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..88d0025 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356i386memfd_createsys_memfd_create
 357i386bpf sys_bpf
 358i386execveatsys_execveat
stub32_execveat
+359i386copy_file_range sys_copy_file_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..81802c5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec_file_load
 321common  bpf sys_bpf
 32264  execveatstub_execveat
+323common  copy_file_range sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target/file: fix inclusive vfs_fsync_range() end

2014-10-07 Thread Zach Brown
On Mon, Oct 06, 2014 at 11:39:45PM -0700, Christoph Hellwig wrote:
> On Mon, Oct 06, 2014 at 04:40:13PM -0700, Zach Brown wrote:
> > Both of the file target's calls to vfs_fsync_range() got the end offset
> > off by one.  The range is inclusive, not exclusive.  It would sync a bit
> > more data than was required.
> > 
> > The sync path already tested the length of the range and fell back to
> > LLONG_MAX so I copied that pattern in the rw path.
> > 
> > This is untested. I found the errors by inspection while following other
> > code.
> 
> Maybe it's time to move vfs_fsync_range to a more normal calling
> convention?

Yeah, I wanted to just fix the bugs before going too far.  

The current interface does seem like a bad fit.  3 of the 5 non-core
callers got it wrong.  They all generate start,end from off,count and
fall back to LLONG_MAX.

I'll put something together if no one beats me to it.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] target/file: fix inclusive vfs_fsync_range() end

2014-10-06 Thread Zach Brown
Both of the file target's calls to vfs_fsync_range() got the end offset
off by one.  The range is inclusive, not exclusive.  It would sync a bit
more data than was required.

The sync path already tested the length of the range and fell back to
LLONG_MAX so I copied that pattern in the rw path.

This is untested. I found the errors by inspection while following other
code.

Signed-off-by: Zach Brown 
---
 drivers/target/target_core_file.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/target/target_core_file.c 
b/drivers/target/target_core_file.c
index 7d6cdda..176588f 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -415,7 +415,7 @@ fd_execute_sync_cache(struct se_cmd *cmd)
} else {
start = cmd->t_task_lba * dev->dev_attrib.block_size;
if (cmd->data_length)
-   end = start + cmd->data_length;
+   end = start + cmd->data_length - 1;
else
end = LLONG_MAX;
}
@@ -680,7 +680,12 @@ fd_execute_rw(struct se_cmd *cmd, struct scatterlist *sgl, 
u32 sgl_nents,
struct fd_dev *fd_dev = FD_DEV(dev);
loff_t start = cmd->t_task_lba *
dev->dev_attrib.block_size;
-   loff_t end = start + cmd->data_length;
+   loff_t end;
+
+   if (cmd->data_length)
+   end = start + cmd->data_length - 1;
+   else
+   end = LLONG_MAX;
 
vfs_fsync_range(fd_dev->fd_file, start, end, 1);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Zach Brown
> > I'd just remove this generic teardown callback path entirely.  If
> > there's PI state hanging off the iocb tear it down during iocb teardown.
> 
> Hmm, I thought aio_complete /was/ iocb teardown time.

Well, usually :).  If you build up before aio_run_iocb() then you nead
to teardown in kiocb_free(), which is also called by aio_complete().

> > (Isn't there some allocate-and-copy-from-userspace helper now? But..)
> 
>  Is there?  I didn't find one when I looked, but it wasn't an 
> exhaustive
> search.

I could have sworn that I saw something.. ah, right, memdup_user().

> > I don't like the rudundancy of the implicit size requirement by a
> > field's flag being set being duplicated by the explicit size argument.
> > What does that give us, exactly?
> 
> Either another sanity check or another way to screw up, depending on how you
> look at it.  I'd been considering shortening the size field to u32 and adding 
> a
> magic number field, but I wonder if that's really necessary.  Seems like it
> shouldn't be -- if userland screws up, it's not hard to kill the process.
> (Or segv it, or...)

I don't think I'd bother.  The bits should be enough and are already
necessary to have explicit indicators of fields being set.

> > Fields in the iocb  As each of these are initialized I'd just
> > test the presence bits and __get_user() the userspace arguemnts
> > directly, or copy_from_user() something slightly more complicated on to
> > the stack.
> >
> > That gets rid of us having to care about the size at all.  It stops us
> > from allocating a kernel copy and pinning it for the duration of the IO.
> > We'd just be sampling the present userspace arguments as we initialie
> > the iocb during submission.
> 
> I like this idea.  For the PI extension, nothing particularly error-prone
> happens in teardown, which allows the flexibility to copy_from_user any
> arguments required, and to copy_to_user any setup errors that happen.  I can
> get rid a lot of allocate-and-copy nonsense, as you point out.
> 
> Ok, I'll migrate my patches towards this strategy, and let's see how much code
> goes away. :)

Cool :).

> I've also noticed a bug where if you make one of these PI-extended calls on a
> file living on a filesystem, it'll extend the io request's range to be
> filesystem block-aligned, which causes all kinds of havoc with the user
> provided PI buffers, since they now need to be extended to fit the added
> blocks.  Alternately, one could require PI IOs to be fs-block aligned when
> dealing with regular files. 

I think, like O_DIRECT, it just has to be aligned or fail :(.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Zach Brown
> One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
> not # of struct iovecs) that I can throw at the kernel?

Yeah, dunno.  I'd guess big :).  I'd hope that the PI code already has a
way to clamp the size of bios if there's a limit to the size of PI data
that can be managed downstream?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO

2014-04-02 Thread Zach Brown
> 's a small number of them with strong semantics because they're a part
> of the syscall ABI.

("There's" a small number of them..  vim troubles :))

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Zach Brown
> +static int setup_pi_ext(struct kiocb *req, int is_write)
> +{
> + struct file *file = req->ki_filp;
> + struct io_extension *ext = &req->ki_ioext->ke_kern;
> + void *p;
> + unsigned long start, end;
> + int retval;
> +
> + if (!(file->f_flags & O_DIRECT)) {
> + pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
> + return -EINVAL;
> + }
> +
> + BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
> +
> + end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
> + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
> + req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
> + req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
> + req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
> + p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
> + sizeof(struct page *),
> + GFP_NOIO);

Can userspace give us bad data and get us to generate insane allcation
attempt warnings?

> + if (p == NULL) {
> + pr_err("%s: no room for page array?\n", __func__);
> + return -ENOMEM;
> + }
> + req->ki_ioext->ke_pi_iter.pi_userpages = p;
> +
> + retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
> +  req->ki_ioext->ke_pi_iter.pi_nrpages,
> +  is_write,

Isn't this is_write backwards?  If it's a write syscall then the PI
pages is going to be read from.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Zach Brown
> @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
> res2)
>   struct io_event *ev_page, *event;
>   unsigned long   flags;
>   unsigned tail, pos;
> + int ret;
> +
> + ret = io_teardown_extensions(iocb);
> + if (ret) {
> + if (!res)
> + res = ret;
> + else if (!res2)
> + res2 = ret;
> + else
> + pr_err("error %d tearing down aio extensions\n", ret);
> + }

This ends up trying to copy the kernel's io_extension copy back to
userspace from interrupts, which obviously won't fly.

And to what end?  So that maybe someone can later add an 'extension'
that can fill in some field that's then copied to userspace?  But by
copying the entire argument struct back?

Let's not get ahead of ourselves.  If they're going to try and give
userspace some feedback after IO completion they're going to have to try
a lot harder because they don't have acces to the submitting task
context anymore.  They'd have to pin some reference to a feedback
mechanism in the in-flight io.  I think we'd want that explicit in the
iocb, not hiding off on the other side of this extension interface.

I'd just remove this generic teardown callback path entirely.  If
there's PI state hanging off the iocb tear it down during iocb teardown.

> +struct io_extension_type {
> + unsigned int type;
> + unsigned int extension_struct_size;
> + int (*setup_fn)(struct kiocb *, int is_write);
> + int (*destroy_fn)(struct kiocb *);
> +};

I'd also get rid of all of this.  More below.

> +static int io_setup_extensions(struct kiocb *req, int is_write,
> +struct io_extension __user *ioext)
> +{
> + struct io_extension_type *iet;
> + __u64 sz, has;
> + int ret;
> +
> + /* Check size of buffer */
> + if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz
> + return -EFAULT;
> + if (sz > PAGE_SIZE ||
> + sz > sizeof(struct io_extension) ||
> + sz < IO_EXT_SIZE(ie_has))
> + return -EINVAL;
> +
> + /* Check that the buffer's big enough */
> + if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has
> + return -EFAULT;
> + ret = io_check_bufsize(has, sz);
> + if (ret)
> + return ret;
> +
> + /* Copy from userland */
> + req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> + if (!req->ki_ioext)
> + return -ENOMEM;
> +
> + req->ki_ioext->ke_user = ioext;
> + if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> + ret = -EFAULT;
> + goto out;
> + }

(Isn't there some allocate-and-copy-from-userspace helper now? But..)

I don't like the rudundancy of the implicit size requirement by a
field's flag being set being duplicated by the explicit size argument.
What does that give us, exactly?

Our notion of the total size only seems to only matter if we're copying
the entire struct from userspace and I'm don't think we need to do that.

For each argument, we're translating it into some kernel equivalent,
right?  Fields in the iocb  As each of these are initialized I'd just
test the presence bits and __get_user() the userspace arguemnts
directly, or copy_from_user() something slightly more complicated on to
the stack.

That gets rid of us having to care about the size at all.  It stops us
from allocating a kernel copy and pinning it for the duration of the IO.
We'd just be sampling the present userspace arguments as we initialie
the iocb during submission.

> + /* Try to initialize all the extensions */
> + has = 0;
> + for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> + if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> + continue;
> + ret = iet->setup_fn(req, is_write);
> + if (ret) {
> + req->ki_ioext->ke_kern.ie_has = has;
> + goto out_destroy;
> + }
> + has |= iet->type;
> + }

So instead of doing all this we'd test explicit bits and act
accordingly.  If they're trivial translations between userspace fields
and iocb fields we could just do it inline in this helper that'd be more
like iocb_parse_more_args(iocb, struct __user *ptr).  For more
complicated stuff, like the PI page pinning, it could call out to PI.

> + user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;

Need a __force there?  Has this been run through sparse?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code

2014-04-02 Thread Zach Brown
> +static int bio_integrity_generate_verify(struct bio *bio, int operate)
>  {

> + if (operate)
> + sector = bio->bi_iter.bi_sector;
> + else
> + sector = bio->bi_integrity->bip_iter.bi_sector;

> + if (operate) {
> + bi->generate_fn(&bix);
> + } else {
> + ret = bi->verify_fn(&bix);
> + if (ret) {
> + kunmap_atomic(kaddr);
> + return ret;
> + }
> + }

I was glad to see this replaced with explicit sector and func arguments
in later refactoring in the 6/ patch.

But I don't think the function poiner casts in that 6/ patch are wise
(Or even safe all the time, given crazy function pointer trampolines?
Is that still a thing?).  I'd have made a single walk_fn type that
returns and have the non-returning iterators just return 0.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO

2014-04-02 Thread Zach Brown
On Mon, Mar 24, 2014 at 09:22:31AM -0700, Darrick J. Wong wrote:
> This RFC provides a rough implementation of a mechanism to allow
> userspace to attach protection information (e.g. T10 DIF) data to a
> disk write and to receive the information alongside a disk read.

I have some comments for you! :)  Mostly about the interface up in aio.
I don't have all that much to say about the bio/pi bits.

> Patch #2 implements a generic IO extension interface so that we can
> receive a struct io_extension from userspace containing the structure
> size, a flag telling us which extensions we'd like to use (ie_has),
> and (eventually) extension data.  There's a small framework for
> mapping ie_has bits to actual extensions.

I still really don't think that we should be thinking of these as
generic extensions.  We're talking about arguments to syscalls.  's a
small number of them with strong semantics because they're a part of the
syscall ABI.  I don't think we should implement them by iterating over
per-field ops structs.

Anyway, more in reply to the patches.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Zach Brown
On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
>
> > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > extensible data structure that can be grown without completely revising 
> > the ABI as new parameters are added.  We need something that is either 
> > TLV based, or an extensible array.
> 
> Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
> that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
> becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
> becomes a u16 that tells us the array length.  The libaio.h equivalents are
> iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> 
> Next, let's define a conceptual structure for aio extensions:
> 
> struct iocb_extension {
>   void *ie_buf;
>   unsigned int ie_buflen;
>   unsigned int ie_type;
>   unsigned int ie_flags;
> };
> 
> The actual definitions can be defined in a similar fashion to the other aio
> structures so that the structures are padded to the same layout regardless of
> bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.

I'm firmly in the camp that doesn't want to go down this abstract road.
We had this conversation with Kent when he wanted to do something very
similar.

What happens if there are duplicate ie_types?  Is that universally
prohibited, validity left up to the types that are duplicated?  What if
the len is not the right size?  Who checks that?  What if the extension
(they're arguments, but one thing at a time) is writable and the buf
pointers overlap or is unaligned?  Is that cool, who checks it?

Who defines the acceptable set?  Can drivers make up their own weird
types?  How does strace print all this?  How does the security module
universe declare policies that can forbid or allow these things?

Personally, I think this level of dynamism is not worth the complexity.

Can we instead just have a nice easy struct with fixed members that only
grows?

struct some_more_args {
u64 has; /* = HAS_PI_VEC; */
u64 pi_vec_ptr;
u64 pi_vec_nr_segs;
};

struct some_more_args {
u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
u64 pi_vec_ptr;
u64 pi_vec_nr_segs;
u64 magic_thing;
};

If it only grows and has bits indicating presence then I think we're
good.   You only fetch the space for the bits that are indicated.  You
can return errors for bits you don't recognize.  You could perhaps offer
some way to announce the bits you recognize.

I'll admit, though, that I don't really like having to fetch the 'has'
bits first to find out how large the rest of the struct is.  Maybe
that's not worth worrying about.

Thoughts?  Am I out to lunch here?

> Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
> to the extension designer to decide if they want to support either a S-G list,
> one big (vaddr) buffer, or toggle flags?

No idea.  Either seems doable.  I'd aim for simpler to reduce the number
of weird cases to handle or forbid (iovecs with a byte per page!) unless
Martin thinks people want to vector the PI goo.

> I think so.  Let's see how much we can get done.

FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
around.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Zach Brown
On Fri, Mar 21, 2014 at 03:20:25PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> > On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > > arg list is interpreted to point to a buffer containing a header,
> > > followed by the the PI data.
> > 
> > Instead of adding commands that indicate that the final element is a
> > magical pi buffer, why not expand the iocb?
> > 
> > In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> > is a pointer to an extension of the iocb.  In that extension could be a
> > full iov *, nr_segs for PI data.
> > 
> > You'd then translate that into a bigger kernel kiocb with a specific
> > pointer to PI data rather than having to bubble the tests for this magic
> > final iovec down through the kernel.
> > 
> > +   if (iocb->ki_flags & KIOCB_USE_PI) {
> > +   nr_segs--;
> > +   pi_iov = (struct iovec *)(iov + nr_segs);
> > +   }
> > 
> > I suggest this because there's already pressure to extend the iocb.
> > Folks want io priority inputs, completion time outputs, etc.
> 
> I'm curious about the reqprio field -- it seems like it was put there to
> request some kind of IO priority change, but the kernel doesn't use it.

The user-facing iocbs were derived from the posix aio interface which
has a reqprio field (aio(7), aio_reqprio).  I don't think anything's
ever been done with it.

I don't know more about what current io prio stuff people might want to
specify..  ioprio_set(2) args instead of having to bounce through
syscalls and current-> for each op?  cgroup bits?  No idea.

> If aio_reserved2 becomes a (flag-guarded) pointer to an array of aio
> extensions, I'd be tempted to reuse the reqprio to signal the length of the
> extension array, and if anyone wants to start using reqprio, they could add it
> as an extension.

I'll admit, I'm hesitant to cannibalize reqprio for this.  It's a lame
s16.  But maybe it'll be the least awful alternative.

> (More about this in my response to Ben LaHaise.)

(I'll go reply over there too.)

> > And heck, on the sync rw syscall side, add variant that have a pointer
> > to this same extension struct.  There's nothing inherently aio specific
> > about having lots more per-io inputs and outputs.
> 
> I'm curious -- what kinds of extensions do you envision for sync()?

Sorry, that was poorly worded.  By 'sync' I meant the synchronous
classic sys_*write* syscalls.  Maybe we should add another variant with
a "struct io_goo *" pointer, or whatever.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Zach Brown
On Fri, Mar 21, 2014 at 02:39:59PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 10:57:31AM -0400, Jeff Moyer wrote:
> > "Darrick J. Wong"  writes:
> > 
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > 
> > Sorry for the shallow question, but what does that M stand for?
> 
> Hmmm... I really don't remember why I picked 'M'.  Probably because it implied
> that the IO has extra 'M'etadata associated with it.

Magical! :)

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Zach Brown
On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> This RFC provides a rough implementation of a mechanism to allow
> userspace to attach protection information (e.g. T10 DIF) data to a
> disk write and to receive the information alongside a disk read.  The
> interface is an extension to the AIO interface: two new commands
> (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> arg list is interpreted to point to a buffer containing a header,
> followed by the the PI data.

Instead of adding commands that indicate that the final element is a
magical pi buffer, why not expand the iocb?

In the user iocb, a bit in aio_flags could indicate that aio_reserved2
is a pointer to an extension of the iocb.  In that extension could be a
full iov *, nr_segs for PI data.

You'd then translate that into a bigger kernel kiocb with a specific
pointer to PI data rather than having to bubble the tests for this magic
final iovec down through the kernel.

+   if (iocb->ki_flags & KIOCB_USE_PI) {
+   nr_segs--;
+   pi_iov = (struct iovec *)(iov + nr_segs);
+   }

I suggest this because there's already pressure to extend the iocb.
Folks want io priority inputs, completion time outputs, etc.

It's a much cleaner way to extend the interface without an explosion of
command enums that are really combinations of per-io arguments that are
present or not.

And heck, on the sync rw syscall side, add variant that have a pointer
to this same extension struct.  There's nothing inherently aio specific
about having lots more per-io inputs and outputs.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] copy offloading

2014-01-15 Thread Zach Brown

Discussing copy offloading at LSF is appropriate because it can involve
so many layers of the stack:

 - high level syscall interface
 - in-kernel high level entry point for nfsd 
 - fs specific implementations (btrfs and ocfs2 cow, nfs) 
 - vfs helper for offloading block copies for ext*,xfs
 - bio offload requests for cow block devices like bcache/dm-cache
 - encoding offload bios into scsi reqs
 - processing virt guest device offload requests with host syscalls

Getting the user and in-kernel interfaces right to support all these
moving parts has proven tricky.  The more input, the better.

It's been a while since I sent out a refreshed version of the series.
That'll be remedied before LSF rolls around :). 

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 0/9] target: Add support for EXTENDED_COPY (VAAI) offload emulation

2013-08-26 Thread Zach Brown
On Mon, Aug 26, 2013 at 10:02:59PM +, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger 
> 
> Hi folks,
> 
> This -v2 series adds support to target-core for generic EXTENDED_COPY offload
> emulation as defined by SPC-4 using virtual (IBLOCK, FILEIO, RAMDISK)
> backends.

Cool, thanks for sending this out.  I'll just stare blankly but
supportively at the SCSI details.

I've been experimenting with the reasonable suggestion of using splice
as the entry point into the high level fs methods that'd be backed by
this stuff.  Let me get it cleaned up and sent out for review.

Hopefully with a few iterations of that we could test some block file
systems on top of this.

> This implemenation fully supports copy offload between the same device
> backend, and across multiple device backends.  It supports copy offload
> transparently across multiple target ports of different fabrics, eg:
> iSCSI -> FC, FC -> iSER, iSER -> FCoE and so on.

That's exciting!  I doubt that we want to be derailed by getting this
working in the first pass, but it'd be fun to enable cross-fs copying
once we got the fundamentals worked out

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Zach Brown
On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
> Boaz Harrosh  writes:
> >> 
> >> For aio we just need to add additional fields to an existing structure.
> >> 
> >> So yeah, I'd be interested in that discussion as well.
> 
> Sure, it's easy to start there, but then you eventually end up having to
> add a non-aio interface as well.  Let's not take the latter off the
> table.

I agree that a sync variant should't be ignored, but needing a sync
interface with PI arguments also shouldn't get in the way of adding
support to the aio+dio path.  Simply because it's what people use :/.

> I'm not sure how that's directly related to aio, but ok.  If we're going
> to rewrite the aio code, I think Zach's acall would be a good start, at
> least on the API front:
>   http://lwn.net/Articles/316806/

Yeah, I'm happy to chat about this stuff if people are interested.  I
think I'd do things differently today than what was done in that aged
acall prototype.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html