Re: [PATCH 4/4] btrfs: offline dedupe

Zach Brown Mon, 15 Jul 2013 13:57:26 -0700

Josef asked that I check out the offline dedup patches.  I hope that I
found the most recent posting in my archives :).


The first three patches seemed fine.  I have questions about the
read/compare/clone core:

> +             addr = kmap(page);
> +             memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
> +             kunmap(page);

I think you need flush_dcache_page() here 'cause there could be writable
user address mappings of these page cache pages.

> +static void btrfs_double_lock(struct inode *inode1, u64 loff1,
> +                           struct inode *inode2, u64 loff2, u64 len)
> +{
> +     if (inode1 < inode2) {
> +             mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +             mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +             lock_extent_range(inode1, loff1, len);
> +             lock_extent_range(inode2, loff2, len);
> +     } else {
> +             mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +             mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +             lock_extent_range(inode2, loff2, len);
> +             lock_extent_range(inode1, loff1, len);
> +     }
> +}

I'd do it a little differently to reduce the surface area for future 1/2
copy-paste typos.  (And maybe make it safe for when src == dst?)

        if (inode1 > inode2)
                swap(inode1, inode2)

        mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
        lock_extent_range(inode1, loff1, len);
        if (inode1 != inode2) {
                mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
                lock_extent_range(inode2, loff2, len);
        }

If lockdep lets us do this, anyway :/.  (Then maybe a similar swap to
match in unlock, I'm not sure.)

> +     if (copy_from_user(&tmp,
> +                        (struct btrfs_ioctl_same_args __user *)argp,
> +                        sizeof(tmp))) {
> +             ret = -EFAULT;
> +             goto out_drop_write;
> +     }
> +
> +     args_size = sizeof(tmp) + (tmp.dest_count *
> +                     sizeof(struct btrfs_ioctl_same_extent_info));
> +
> +     /* Keep size of ioctl argument sane */
> +     if (args_size > PAGE_CACHE_SIZE) {
> +             ret = -E2BIG;
> +             goto out_drop_write;
> +     }
> +
> +     args = kmalloc(args_size, GFP_NOFS);
> +     if (!args) {
> +             ret = -ENOMEM;
> +             goto out_drop_write;
> +     }
> +
> +     ret = -EFAULT;
> +     if (copy_from_user(args,
> +                        (struct btrfs_ioctl_same_args __user *)argp,
> +                        args_size))
> +             goto out;
> +     /* Make sure args didn't change magically between copies. */
> +     if (memcmp(&tmp, args, sizeof(tmp)))
> +             goto out;
> +
> +     if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
> +             goto out;
> +
> +     /* pre-format 'out' fields to sane default values */
> +     for (i = 0; i < args->dest_count; i++) {
> +             info = &args->info[i];
> +             info->bytes_deduped = 0;
> +             info->status = 0;
> +     }

I'd get rid of all this code by only copying each input argument on to
the stack as it's needed and by getting rid of the writable output
struct fields.  (more on this later)

> +     /*
> +      * Limit the total length we will dedupe for each operation. 
> +      * This is intended to bound the entire ioctl to something sane.
> +      */
> +     if (len > BTRFS_MAX_DEDUPE_LEN)
> +             len = BTRFS_MAX_DEDUPE_LEN;

[ ... ]

> +     src_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +     dst_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +     if (!src_buffer || !dst_buffer) {
> +             ret = -ENOMEM;
> +             goto out;
> +     }

Why use these allocated intermediate buffers at all?  Regardless of how
large a region you decide to lock and clone at a time, you can get page
refs for both files and use kmap_atomic() to compare without copying. 

And if you don't have the buffers, why not lock and clone as much as
the user asked for?

I could understand wanting to reduce lock hold times, but you could do
that by doing the initial io waits on shared page cache references and
only using the expensive locks to verify the pages and perform the
clone.  But hopefully that complexity isn't really needed?

> +             if (S_ISDIR(dst->i_mode)) {
> +                     info->status = -EISDIR;
> +                     goto next;
> +             }

Does anything stop us from getting here with symlinks?  It looks like
btrfs_symlink() helpfully overrides i_fop with the _file_ operations
which have .ioctl.  I didn't actually try it.

> +     if (copy_to_user(argp, args, args_size))
> +             ret = -EFAULT;

I don't think it should be possible to return -EFAULT after all the work
is done just because some input memory was accidentally read-only for
whatever reason.

> +#define BTRFS_SAME_DATA_DIFFERS      1
> +/* For extent-same ioctl */
> +struct btrfs_ioctl_same_extent_info {
> +     __s64 fd;               /* in - destination file */
> +     __u64 logical_offset;   /* in - start of extent in destination */
> +     __u64 bytes_deduped;    /* out - total # of bytes we were able
> +                              * to dedupe from this file */
> +     /* status of this dedupe operation:
> +      * 0 if dedup succeeds
> +      * < 0 for error
> +      * == BTRFS_SAME_DATA_DIFFERS if data differs
> +      */
> +     __s32 status;           /* out - see above description */
> +     __u32 reserved;
> +};

As I said, I'd get rid of the output fields.  Like the other vectored io
syscalls, the return value can indicate the number of initial
consecutive bytes that worked.  When no progess was made then it can
return errors.  Userspace is left to sort out the resulting state and
figure out the extents to retry in exactly the same way that it found
the initial extents to attempt to dedupe in the first place.

(And imagine strace trying to print the inputs and outputs.  Poor, poor,
strace! :))

I hope this helps!

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] btrfs: offline dedupe

Reply via email to