[PATCH] btrfs: skip waiting on ordered range for special files

2015-09-11 Thread Jeff Mahoney
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.

filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
next depends on whether there's an open file handle associated with the
inode.  If there is, we write to the block device, which is unexpected
behavior.  If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.

Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.

Cc: 
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl 
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/inode.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5035,7 +5035,8 @@ void btrfs_evict_inode(struct inode *ino
goto no_delete;
}
/* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
-   btrfs_wait_ordered_range(inode, 0, (u64)-1);
+   if (!special_file(inode->i_mode))
+   btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
btrfs_free_io_failure_record(inode, 0, (u64)-1);
 

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs regression since 4.X kernel NULL pointer dereference

2015-09-11 Thread Stefan Priebe


Am 11.09.2015 um 21:05 schrieb Jeff Mahoney:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 9/11/15 2:55 PM, Jeff Mahoney wrote:

On 8/25/15 5:00 AM, Christoph Hellwig wrote:

I think this is btrfs using a struct block_device that doesn't
have a valid queue pointer in it's gendisk for ->s_bdev.  And
there are some fishy looking ->s_bdev assignments in the code
which I suspect are related to it:



fs/btrfs/dev-replace.c: if (fs_info->sb->s_bdev ==
src_device->bdev) fs/btrfs/dev-replace.c: fs_info->sb->s_bdev =
tgt_device->bdev; fs/btrfs/volumes.c: if (device->bdev ==
root->fs_info->sb->s_bdev) fs/btrfs/volumes.c:
root->fs_info->sb->s_bdev = next_device->bdev;
fs/btrfs/volumes.c: if (tgtdev->bdev == fs_info->sb->s_bdev)
fs/btrfs/volumes.c: fs_info->sb->s_bdev = next_device->bdev;


The report at https://bugzilla.kernel.org/show_bug.cgi?id=100911
tracks it down a bit further and it's bdev->bd_disk == NULL instead
of the queue in the gendisk. I don't think that the s_bdev stuff
is related, though I'd certainly love to see that bit go away.

If we're calling blk_get_backing_dev_info, that means we're
already using an inode that has blockdev_superblock and the btrfs
superblock isn't even involved.

We're getting there because btrfs_evict_inode ->
btrfs_wait_ordered_range -> btrfs_fdatawrite_range ->
filemap_fdatawrite_range gets called with inode->i_mapping.  That
mapping gets passed down through __filemap_fdatawrite_range to
wbc_attach_fdatawrite_inode where the inode passed is mapping->host
-- which will be the block device inode rather than the btrfs
device node inode.  That inode is the one ultimately checked in
inode_to_bdi.

So it looks like we're causing writeback on an unrelated block
device that was opened using a device node hosted on btrfs, which
is obviously wrong.

I don't think snapshot removal is even a requirement to trigger
this. I expect it's possible to trigger with two device nodes for
the same block device where one is getting closed and cleaned up
while the eviction of the other happens.  The device nodes wouldn't
even need to be on the same fs.

Other file systems use &inode->i_data in eviction.  Is it that
simple here?


Your patch works fine here. Did some simple tests already.

Thanks!

Stefan



Incidentally, this explanation also covers why I was unable to
reproduce it locally.  SLES systems use devtmpfs and I just bind
mounted it into my chroot environment like I normally would.  When I
cp'd /dev into the test environment, I was able to reproduce immediately
.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJV8yYAAAoJEB57S2MheeWy2BwP+QGdpsErIfyHJcx95LLrvsxu
n0kBoI4Jd5yfNxp8m+Ll3xgUdsd6rKHJV2Muq8aRdNEdzf1E0DFrRcE0d1W5UrJy
lPzrA8QxCVaLf5jFysFp0xygKbLKHGmOAv2KnAGYFw6exIjb344UnZb6aiw5Uekm
DqrTmEq+0Yb/mE04GVpWMylK6pkDOhgkOzFVZa1Pff0eKY4E61G5GtmA2kNAUP9v
CsoZ0FO1WdF2Fc9ONSPjq7FdZLKH+OmIVakHnaELa8EEM3W7NU+mxLRabznBV25e
L/KPjr+awzkhV1ieyAAww/dddE3bN5nmDOq+OgvA9WPgaRvvwne2tHVTFaxHoiHg
d8oHDLkC1/Z1MqINLi5dZNsSuIWMvRhIMV9Th5F2rdWxrBCSRvID7N+Z2HHh6mJC
Q9rgSOyYKclTam6IF7yX8lDWIqkAnoA6OxvOKRccgr3hS/u4DzVtRmWHO9RblEi+
a9dF2FCP+v+Lgdb8C5n7XUixrtF5H6BWHhmArgjmxD6iyeXOmphyGrgqmSLdY1s9
sakvLrSB9i3O27CKoup2OHyF6MOdgsaa90FZLPLt6BDrCTWAscd0LDy8MbaKgKCR
kjfSTiwNydzZfkJixH71U/1mGbuB9nqf6jrNWCQdE5f57MSCEwiFqQvaD1KK+Uug
ZW2Bz1VQxkOvGbYiJ4HV
=ic4+
-END PGP SIGNATURE-


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs regression since 4.X kernel NULL pointer dereference

2015-09-11 Thread Christoph Biedl
Stefan Priebe wrote...

> Thanks. We're using schroot like the user in this bugreport:
> https://bugzilla.kernel.org/show_bug.cgi?id=100911
> 
> But he also claims he found another way to reproduce using vfcgbackup (last
> comment).

FWIW, this was still the same thing, just stripped down to get closer
to the actual cause. It began with building a private Debian package
that has lvm2 as build dependency. Daemons are disabled in the build
chroot (via policy-rc.d) but the vfcgbackup invocation in lvm2's
postinst is still run.

Also, a quick test of Jeff's update to the Bugzilla ticket looks very
good. More tests will follow.

Christoph
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: keep dropped roots in cache until transaciton commit

2015-09-11 Thread Mark Fasheh
On Fri, Sep 11, 2015 at 09:25:25AM +0800, Qu Wenruo wrote:
> Josef Bacik wrote on 2015/09/10 16:27 -0400:
> >When dropping a snapshot we need to account for the qgroup changes.  If we 
> >drop
> >the snapshot in all one go then the backref code will fail to find blocks 
> >from
> >the snapshot we dropped since it won't be able to find the root in the fs 
> >root
> >cache.  This can lead to us failing to find refs from other roots that 
> >pointed
> >at blocks in the now deleted root.  To handle this we need to not remove the 
> >fs
> >roots from the cache until after we process the qgroup operations.  Do this 
> >by
> >adding dropped roots to a list on the transaction, and letting the 
> >transaction
> >remove the roots at the same time it drops the commit roots.  This will keep 
> >all
> >of the backref searching code in sync properly, and fixes a problem Mark was
> >seeing with snapshot delete and qgroups.  Thanks,Btrfs: keep dropped roots in
> >cache until transaciton commit
> 
> Mark will definitely be happy with this patch, as quite a good basis
> for snapshot deletion.

Indeed. My tests against a kernel with Josefs patches and my snapshot
deletion code seem to be passing. I'll have something on the list shortly.

Thanks,
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/9] vfs: Remove copy_file_range mountpoint checks

2015-09-11 Thread Anna Schumaker
I still want to do an in-kernel copy even if the files are on different
mountpoints, and NFS has a "server to server" copy that expects two
files on different mountpoints.  Let's have individual filesystems
implement this check instead.

Signed-off-by: Anna Schumaker 
---
 fs/read_write.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index ac32388..363bd3e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
pos_in + len > i_size_read(inode_in))
return -EINVAL;
 
-   /* this could be relaxed once a method supports cross-fs copies */
-   if (inode_in->i_sb != inode_out->i_sb ||
-   file_in->f_path.mnt != file_out->f_path.mnt)
-   return -EXDEV;
-
if (len == 0)
return 0;
 
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/9] VFS: In-kernel copy system call

2015-09-11 Thread Anna Schumaker
Copy system calls came up during Plumbers a couple of weeks ago, because
several filesystems (including NFS and XFS) are currently working on copy
acceleration implementations.  We haven't heard from Zach Brown in a while,
so I volunteered to push his patches upstream so individual filesystems
don't need to keep writing their own ioctls.

Changes in v2:
- Update against the most recent Linus kernel
  - Fix conflicts due to new system calls
- Remove requirement that inode_in == inode_out
- Drop patch to add mountpoint checking to btrfs
  - btrfs already did this check
- Rename COPY_REFLINK -> COPY_FR_REFLINK
- Add COPY_FR_COPY flag
- Expand flags == 0 to (COPY_FR_COPY | COPY_FR_REFLINK)
- Remove checking for invalid flags
- Create a new function for handling pagecache copies
- Move rw_verify_area() checks into the new pagecache-copy function
  - Use the return value from rw_verify_area() to set amount of data to copy
- Update man page

I tested the COPY_FR_COPY flag by using /dev/urandom to generate files of
varying sizes and copying them.  I compared the output from `time` against
that of `cp` to see if there is any noticable difference.  I think there
have been some libvirt changes since my first set of trials, because this
time around cpu usage was down significantly.  This time around, VFS copy
was slightly faster than /usr/bin/cp in all cases.  Values in the tables
below are averages across multiple trials.


 /usr/bin/cp |   512  |  1024  |  1536  |  2048  |  2560  |  3072  |  5120
-|||||||
user |  0.00s |  0.01s |  0.01s |  0.01s |  0.01s |  0.01s |  0.02s
  system |  0.68s |  0.48s |  0.74s |  0.99s |  1.25s |  1.50s |  2.51s
 cpu |34% |14% |14% |15% |14% |14% |15%
   total |  1.993 |  3.314 |  4.994 |  6.599 |  8.627 | 10.079 | 16.852


VFS copy |   512  |  1024  |  1536  |  2048  |  2560  |  3072  |  5120
-|||||||
user |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s
  system |  0.65s |  0.46s |  0.70s |  0.93s |  1.18s |  1.41s |  2.37s
 cpu |35% |14% |15% |14% |14% |14% |14%
   total |  1.870 |  3.084 |  4.613 |  6.206 |  7.884 |  9.372 | 15.904


Questions?  Comments?  Thoughts?

Anna


Anna Schumaker (6):
  vfs: Copy should check len after file open mode
  vfs: Copy shouldn't forbid ranges inside the same file
  vfs: Copy should use file_out rather than file_in
  vfs: Remove copy_file_range mountpoint checks
  vfs: copy_file_range() can do a pagecache copy with splice
  btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/file.c|   1 +
 fs/btrfs/ioctl.c   |  95 ++--
 fs/read_write.c| 132 +
 include/linux/copy.h   |   6 ++
 include/linux/fs.h |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/Kbuild  |   1 +
 include/uapi/linux/copy.h  |   7 ++
 kernel/sys_ni.c|   1 +
 12 files changed, 215 insertions(+), 40 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/9] btrfs: add .copy_file_range file operation

2015-09-11 Thread Anna Schumaker
From: Zach Brown 

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown 
Signed-off-by: Anna Schumaker 
Reviewed-by: Josef Bacik 
Reviewed-by: David Sterba 
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..5d06a4f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_ioctl,
 #endif
+   .copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..4311554 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+   u64 off, u64 olen, u64 destoff)
 {
struct inode *inode = file_inode(file);
+   struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct fd src_file;
-   struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
-   int same_inode = 0;
+   int same_inode = src == inode;
 
/*
 * TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
 *   be either compressed or non-compressed.
 */
 
-   /* the destination must be opened for writing */
-   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-   return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   ret = -EXDEV;
-   if (src_file.file->f_path.mnt != file->f_path.mnt)
-   goto out_fput;
-
-   src = file_inode(src_file.file);
-
-   ret = -EINVAL;
-   if (src == inode)
-   same_inode = 1;
-
-   /* the src must be open for reading */
-   if (!(src_file.file->f_mode & FMODE_READ))
-   goto out_fput;
+   if (file_src->f_path.mnt != file->f_path.mnt ||
+   src->i_sb != inode->i_sb)
+   return -EXDEV;
 
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   goto out_fput;
+   return -EINVAL;
 
-   ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-   goto out_fput;
-
-   ret = -EXDEV;
-   if (src->i_sb != inode->i_sb)
-   goto out_fput;
+   return -EISDIR;
 
if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(&src->i_mutex);
+   return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+   ssize_t ret;
+
+   ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+   if (ret == 0)
+   ret = len;
+   return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+  u64 off, u64 olen, u64 destoff)
+{
+   struct fd src_file;
+   int ret;
+
+   /* the destination must be opened fo

[PATCH v2 1/9] vfs: add copy_file_range syscall and vfs helper

2015-09-11 Thread Anna Schumaker
From: Zach Brown 

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown 
[Anna Schumaker:  Change -EINVAL to -EBADF during file verification]
Signed-off-by: Anna Schumaker 
---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 4 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..82c4933 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, int flags)
+{
+   struct inode *inode_in;
+   struct inode *inode_out;
+   ssize_t ret;
+
+   if (flags)
+   return -EINVAL;
+
+   if (len == 0)
+   return 0;
+
+   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+   ret = rw_verify_area(READ, file_in, &pos_in, len);
+   if (ret >= 0)
+   ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+   if (ret < 0)
+   return ret;
+
+   if (!(file_in->f_mode & FMODE_READ) ||
+   !(file_out->f_mode & FMODE_WRITE) ||
+   (file_out->f_flags & O_APPEND) ||
+   !file_in->f_op || !file_in->f_op->copy_file_range)
+   return -EBADF;
+
+   inode_in = file_inode(file_in);
+   inode_out = file_inode(file_out);
+
+   /* make sure offsets don't wrap and the input is inside i_size */
+   if (pos_in + len < pos_in || pos_out + len < pos_out ||
+   pos_in + len > i_size_read(inode_in))
+   return -EINVAL;
+
+   /* this could be relaxed once a method supports cross-fs copies */
+   if (inode_in->i_sb != inode_out->i_sb ||
+   file_in->f_path.mnt != file_out->f_path.mnt)
+   return -EXDEV;
+
+   /* forbid ranges in the same file */
+   if (inode_in == inode_out)
+   return -EINVAL;
+
+   ret = mnt_want_write_file(file_out);
+   if (ret)
+   return ret;
+
+   ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+len, flags);
+   if (ret > 0) {
+   fsnotify_access(file_in);
+   add_rchar(current, ret);
+   fsnotify_modify(file_out);
+   add_wchar(current, ret);
+   }
+   inc_syscr(current);
+   inc_syscw(current);
+
+   mnt_drop_write_file(file_out);
+
+   return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+   int, fd_out, loff_t __user *, off_out,
+   size_t, len, unsigned int, flags)
+{
+   loff_t pos_in;
+   loff_t pos_out;
+   struct fd f_in;
+   struct fd f_out;
+   ssize_t ret;
+
+   f_in = fdget(fd_in);
+   f_out = fdget(fd_out);
+   if (!f_in.file || !f_out.file) {
+   ret = -EBADF;
+   goto out;
+   }
+
+   ret = -EFAULT;
+   if (off_in) {
+   if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_in = f_in.file->f_pos;
+   }
+
+   if (off_out) {
+   if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_out = f_out.file->f_pos;
+   }
+
+   ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+   if (ret > 0) {
+   

[PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

2015-09-11 Thread Anna Schumaker
The NFS server will need some kind offallback for filesystems that don't
have any kind of copy acceleration, and it should be generally useful to
have an in-kernel copy to avoid lots of switches between kernel and user
space.

I make this configurable by adding two new flags.  Users who only want a
reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
pass COPY_FR_COPY.  The default (flags=0) means to first attempt a
reflink, but use the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker 
---
v2:
- Rename COPY_REFLINK -> COPY_FR_REFLINK
- Introduce COPY_FR_COPY flag
- Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
- Drop check for invalid flags
- Move call to do_splice_direct() into a new function
- Move rw_verify_area() checks into the new fallback function
---
 fs/read_write.c   | 56 ---
 include/linux/copy.h  |  6 +
 include/uapi/linux/Kbuild |  1 +
 include/uapi/linux/copy.h |  7 ++
 4 files changed, 48 insertions(+), 22 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index 363bd3e..ba24884 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
 #include  
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+  struct file *file_out, loff_t pos_out,
+  size_t len)
+{
+   ssize_t ret;
+
+   ret = rw_verify_area(READ, file_in, &pos_in, len);
+   if (ret >= 0) {
+   len = ret;
+   ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+   if (ret >= 0)
+   len = ret;
+   }
+   if (ret < 0)
+   return ret;
+
+   file_start_write(file_out);
+   ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+   file_end_write(file_out);
+
+   return ret;
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, int flags)
 {
-   struct inode *inode_in;
-   struct inode *inode_out;
ssize_t ret;
 
-   if (flags)
-   return -EINVAL;
-
-   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-   ret = rw_verify_area(READ, file_in, &pos_in, len);
-   if (ret >= 0)
-   ret = rw_verify_area(WRITE, file_out, &pos_out, len);
-   if (ret < 0)
-   return ret;
+   if (flags == 0)
+   flags = COPY_FR_COPY | COPY_FR_REFLINK;
 
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
-   !file_out->f_op || !file_out->f_op->copy_file_range)
+   !file_in->f_op)
return -EBADF;
 
-   inode_in = file_inode(file_in);
-   inode_out = file_inode(file_out);
-
-   /* make sure offsets don't wrap and the input is inside i_size */
-   if (pos_in + len < pos_in || pos_out + len < pos_out ||
-   pos_in + len > i_size_read(inode_in))
-   return -EINVAL;
-
if (len == 0)
return 0;
 
@@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
- len, flags);
+   ret = -EOPNOTSUPP;
+   if (file_out->f_op->copy_file_range)
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+   if ((ret < 0) && (flags & COPY_FR_COPY))
+   ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+ pos_out, len);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include 
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70ff1d9..d46830a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += c

[PATCH v2 4/9] vfs: Copy should check len after file open mode

2015-09-11 Thread Anna Schumaker
I don't think it makes sense to report that a copy succeeded if the
files aren't open properly.

Signed-off-by: Anna Schumaker 
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 82c4933..38cc251 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (flags)
return -EINVAL;
 
-   if (len == 0)
-   return 0;
-
/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
ret = rw_verify_area(READ, file_in, &pos_in, len);
if (ret >= 0)
@@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (inode_in == inode_out)
return -EINVAL;
 
+   if (len == 0)
+   return 0;
+
ret = mnt_want_write_file(file_out);
if (ret)
return ret;
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

2015-09-11 Thread Anna Schumaker
Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker 
---
 fs/btrfs/ioctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4311554..2e14b91 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, 
loff_t pos_in,
 {
ssize_t ret;
 
+   if (!(flags & COPY_FR_REFLINK))
+   return -EOPNOTSUPP;
+
ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
if (ret == 0)
ret = len;
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/9] vfs: Copy shouldn't forbid ranges inside the same file

2015-09-11 Thread Anna Schumaker
This is perfectly valid for BTRFS and XFS, so let's leave this up to
filesystems to check.

Signed-off-by: Anna Schumaker 
---
 fs/read_write.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 38cc251..d32549b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1371,10 +1371,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
file_in->f_path.mnt != file_out->f_path.mnt)
return -EXDEV;
 
-   /* forbid ranges in the same file */
-   if (inode_in == inode_out)
-   return -EINVAL;
-
if (len == 0)
return 0;
 
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/9] x86: add sys_copy_file_range to syscall tables

2015-09-11 Thread Anna Schumaker
From: Zach Brown 

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown 
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 477bfa6..6867783 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -381,3 +381,4 @@
 372i386recvmsg sys_recvmsg 
compat_sys_recvmsg
 373i386shutdownsys_shutdown
 374i386userfaultfd sys_userfaultfd
+375i386copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 81c4906..23baaa5 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
 321common  bpf sys_bpf
 32264  execveatstub_execveat
 323common  userfaultfd sys_userfaultfd
+324common  copy_file_range sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

2015-09-11 Thread Anna Schumaker
copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker 
---
 man2/copy_file_range.2 | 188 +
 1 file changed, 188 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 000..84912b5
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,188 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker 

+.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include 
+.B #include 
+.B #include 
+
+.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
+.BI "int " fd_out ", loff_t * " off_out ", size_t " len ",
+.BI "unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without all that tedious mucking about in userspace.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset and the current
+file offset is adjusted appropriately.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument is a bit mask composed by OR-ing together zero
+or more of the following flags:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems, like NFS, might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back on performing a full data copy.
+This is equivalent to setting
+.I flags
+equal to
+.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid,
+or do not have proper read-write mode;
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+was specified in
+.IR flags ,
+but the target filesystem does not support reflinks.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+int main(int argc, char **argv)
+{
+int fd_in, fd_out;
+struct stat stat;
+loff_t len, ret;
+
+if (argc != 3) {
+fprintf(stderr, "Usage: %s  \\n", argv[0]);
+exit(EXIT_FAILURE);
+}
+
+fd_in = open(argv[1], O_RDONLY);
+if (fd_in == -1) {
+perror("open (argv[1])");
+exit(EXIT_FAILURE);
+}
+
+if (fstat(fd_in, &stat) == -1) {
+perror("fstat");
+exit(EXIT_FAILURE);
+}
+len = stat.st_size;
+
+fd_out = creat(argv[2], 0644);
+if (fd_out == -1) {
+perror("creat (argv[2])");
+exit(EXIT_FAILURE);
+}
+
+do {
+ret = syscall(__NR_copy_file_range, fd_in, NULL,
+  fd_out, NULL, len, 0);
+if (ret == -1) {
+perror("copy_file_range");
+exit(EXIT_FAILURE);
+}
+
+len -= ret;
+} while (len > 0);
+
+close(fd_in);
+close(fd_out);
+exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vge

[PATCH v2 6/9] vfs: Copy should use file_out rather than file_in

2015-09-11 Thread Anna Schumaker
The way to think about this is that the destination filesystem reads the
data from the source file and processes it accordingly.  This is
especially important to avoid an infinate loop when doing a "server to
server" copy on NFS.

Signed-off-by: Anna Schumaker 
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d32549b..ac32388 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
-   !file_in->f_op || !file_in->f_op->copy_file_range)
+   !file_out->f_op || !file_out->f_op->copy_file_range)
return -EBADF;
 
inode_in = file_inode(file_in);
@@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-len, flags);
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
+ len, flags);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs regression since 4.X kernel NULL pointer dereference

2015-09-11 Thread Chris Mason
On Fri, Sep 11, 2015 at 02:55:17PM -0400, Jeff Mahoney wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 8/25/15 5:00 AM, Christoph Hellwig wrote:
> > I think this is btrfs using a struct block_device that doesn't
> > have a valid queue pointer in it's gendisk for ->s_bdev.  And there
> > are some fishy looking ->s_bdev assignments in the code which I
> > suspect are related to it:
> > 
> > fs/btrfs/dev-replace.c: if (fs_info->sb->s_bdev ==
> > src_device->bdev) fs/btrfs/dev-replace.c:
> > fs_info->sb->s_bdev = tgt_device->bdev; fs/btrfs/volumes.c: if
> > (device->bdev == root->fs_info->sb->s_bdev) fs/btrfs/volumes.c:
> > root->fs_info->sb->s_bdev = next_device->bdev; fs/btrfs/volumes.c:
> > if (tgtdev->bdev == fs_info->sb->s_bdev) fs/btrfs/volumes.c:
> > fs_info->sb->s_bdev = next_device->bdev;
> 
> The report at https://bugzilla.kernel.org/show_bug.cgi?id=100911
> tracks it down a bit further and it's bdev->bd_disk == NULL instead of
> the queue in the gendisk. I don't think that the s_bdev stuff is
> related, though I'd certainly love to see that bit go away.
> 
> If we're calling blk_get_backing_dev_info, that means we're already
> using an inode that has blockdev_superblock and the btrfs superblock
> isn't even involved.
> 
> We're getting there because btrfs_evict_inode ->
> btrfs_wait_ordered_range -> btrfs_fdatawrite_range ->
> filemap_fdatawrite_range gets called with inode->i_mapping.  That
> mapping gets passed down through __filemap_fdatawrite_range to
> wbc_attach_fdatawrite_inode where the inode passed is mapping->host --
> which will be the block device inode rather than the btrfs device node
> inode.  That inode is the one ultimately checked in inode_to_bdi.
> 
> So it looks like we're causing writeback on an unrelated block device
> that was opened using a device node hosted on btrfs, which is
> obviously wrong.
> 
> I don't think snapshot removal is even a requirement to trigger this.
>  I expect it's possible to trigger with two device nodes for the same
> block device where one is getting closed and cleaned up while the
> eviction of the other happens.  The device nodes wouldn't even need to
> be on the same fs.
> 
> Other file systems use &inode->i_data in eviction.  Is it that simple
> here?

Oh, ok I'm following now.  This really should explain it.  Jeff
mentioned that he's working on a patch to skip the wait_ordered_range
dance based on i_mode.  Thanks Jeff!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs regression since 4.X kernel NULL pointer dereference

2015-09-11 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 9/11/15 2:55 PM, Jeff Mahoney wrote:
> On 8/25/15 5:00 AM, Christoph Hellwig wrote:
>> I think this is btrfs using a struct block_device that doesn't 
>> have a valid queue pointer in it's gendisk for ->s_bdev.  And
>> there are some fishy looking ->s_bdev assignments in the code
>> which I suspect are related to it:
> 
>> fs/btrfs/dev-replace.c: if (fs_info->sb->s_bdev == 
>> src_device->bdev) fs/btrfs/dev-replace.c: fs_info->sb->s_bdev =
>> tgt_device->bdev; fs/btrfs/volumes.c: if (device->bdev ==
>> root->fs_info->sb->s_bdev) fs/btrfs/volumes.c: 
>> root->fs_info->sb->s_bdev = next_device->bdev;
>> fs/btrfs/volumes.c: if (tgtdev->bdev == fs_info->sb->s_bdev)
>> fs/btrfs/volumes.c: fs_info->sb->s_bdev = next_device->bdev;
> 
> The report at https://bugzilla.kernel.org/show_bug.cgi?id=100911 
> tracks it down a bit further and it's bdev->bd_disk == NULL instead
> of the queue in the gendisk. I don't think that the s_bdev stuff
> is related, though I'd certainly love to see that bit go away.
> 
> If we're calling blk_get_backing_dev_info, that means we're
> already using an inode that has blockdev_superblock and the btrfs
> superblock isn't even involved.
> 
> We're getting there because btrfs_evict_inode -> 
> btrfs_wait_ordered_range -> btrfs_fdatawrite_range -> 
> filemap_fdatawrite_range gets called with inode->i_mapping.  That 
> mapping gets passed down through __filemap_fdatawrite_range to 
> wbc_attach_fdatawrite_inode where the inode passed is mapping->host
> -- which will be the block device inode rather than the btrfs
> device node inode.  That inode is the one ultimately checked in
> inode_to_bdi.
> 
> So it looks like we're causing writeback on an unrelated block
> device that was opened using a device node hosted on btrfs, which
> is obviously wrong.
> 
> I don't think snapshot removal is even a requirement to trigger
> this. I expect it's possible to trigger with two device nodes for
> the same block device where one is getting closed and cleaned up
> while the eviction of the other happens.  The device nodes wouldn't
> even need to be on the same fs.
> 
> Other file systems use &inode->i_data in eviction.  Is it that
> simple here?

Incidentally, this explanation also covers why I was unable to
reproduce it locally.  SLES systems use devtmpfs and I just bind
mounted it into my chroot environment like I normally would.  When I
cp'd /dev into the test environment, I was able to reproduce immediately
.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJV8yYAAAoJEB57S2MheeWy2BwP+QGdpsErIfyHJcx95LLrvsxu
n0kBoI4Jd5yfNxp8m+Ll3xgUdsd6rKHJV2Muq8aRdNEdzf1E0DFrRcE0d1W5UrJy
lPzrA8QxCVaLf5jFysFp0xygKbLKHGmOAv2KnAGYFw6exIjb344UnZb6aiw5Uekm
DqrTmEq+0Yb/mE04GVpWMylK6pkDOhgkOzFVZa1Pff0eKY4E61G5GtmA2kNAUP9v
CsoZ0FO1WdF2Fc9ONSPjq7FdZLKH+OmIVakHnaELa8EEM3W7NU+mxLRabznBV25e
L/KPjr+awzkhV1ieyAAww/dddE3bN5nmDOq+OgvA9WPgaRvvwne2tHVTFaxHoiHg
d8oHDLkC1/Z1MqINLi5dZNsSuIWMvRhIMV9Th5F2rdWxrBCSRvID7N+Z2HHh6mJC
Q9rgSOyYKclTam6IF7yX8lDWIqkAnoA6OxvOKRccgr3hS/u4DzVtRmWHO9RblEi+
a9dF2FCP+v+Lgdb8C5n7XUixrtF5H6BWHhmArgjmxD6iyeXOmphyGrgqmSLdY1s9
sakvLrSB9i3O27CKoup2OHyF6MOdgsaa90FZLPLt6BDrCTWAscd0LDy8MbaKgKCR
kjfSTiwNydzZfkJixH71U/1mGbuB9nqf6jrNWCQdE5f57MSCEwiFqQvaD1KK+Uug
ZW2Bz1VQxkOvGbYiJ4HV
=ic4+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs regression since 4.X kernel NULL pointer dereference

2015-09-11 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 8/25/15 5:00 AM, Christoph Hellwig wrote:
> I think this is btrfs using a struct block_device that doesn't
> have a valid queue pointer in it's gendisk for ->s_bdev.  And there
> are some fishy looking ->s_bdev assignments in the code which I
> suspect are related to it:
> 
> fs/btrfs/dev-replace.c: if (fs_info->sb->s_bdev ==
> src_device->bdev) fs/btrfs/dev-replace.c:
> fs_info->sb->s_bdev = tgt_device->bdev; fs/btrfs/volumes.c: if
> (device->bdev == root->fs_info->sb->s_bdev) fs/btrfs/volumes.c:
> root->fs_info->sb->s_bdev = next_device->bdev; fs/btrfs/volumes.c:
> if (tgtdev->bdev == fs_info->sb->s_bdev) fs/btrfs/volumes.c:
> fs_info->sb->s_bdev = next_device->bdev;

The report at https://bugzilla.kernel.org/show_bug.cgi?id=100911
tracks it down a bit further and it's bdev->bd_disk == NULL instead of
the queue in the gendisk. I don't think that the s_bdev stuff is
related, though I'd certainly love to see that bit go away.

If we're calling blk_get_backing_dev_info, that means we're already
using an inode that has blockdev_superblock and the btrfs superblock
isn't even involved.

We're getting there because btrfs_evict_inode ->
btrfs_wait_ordered_range -> btrfs_fdatawrite_range ->
filemap_fdatawrite_range gets called with inode->i_mapping.  That
mapping gets passed down through __filemap_fdatawrite_range to
wbc_attach_fdatawrite_inode where the inode passed is mapping->host --
which will be the block device inode rather than the btrfs device node
inode.  That inode is the one ultimately checked in inode_to_bdi.

So it looks like we're causing writeback on an unrelated block device
that was opened using a device node hosted on btrfs, which is
obviously wrong.

I don't think snapshot removal is even a requirement to trigger this.
 I expect it's possible to trigger with two device nodes for the same
block device where one is getting closed and cleaned up while the
eviction of the other happens.  The device nodes wouldn't even need to
be on the same fs.

Other file systems use &inode->i_data in eviction.  Is it that simple
here?

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJV8yOVAAoJEB57S2MheeWysvMP/0cIPCytKGzQqkNpzjfcBk4b
a4s3xM3xnxZ0BayvAWIpSrCLp/5OR5N30Eu326LNZKIEnC7jbkQHePFLIftnhtJ/
eGWlFe9kOsHGWtdA2HyZO9s6V/Nnh0t7vXKUBfqTjV71T66VL/FP9cfRVJ4Ov5Zb
99dK58glhDuF0tOQhePdfaqw4zym+3YHkD+CJjTUKO9YnpTgr4CQFJ+6v6itGbIt
QRY7qVY0S1nz0w/s8AsKu2g76thILtBvmwsEMik3TYSJI5gHxLgSpR0btk64o67+
N50AGsO/TMJs6u9p8Ad4zMFF8AfylAgTV3g8uH6v2QLI3ILVMhjtqgOwWlT78Aca
dmceWAfhBAdRizYqKQC6ZKq26Qf9GTSEoM0L/3TuBqN5scKtGYx0mvoDzj080i7p
nmPJ955pWwxa2tsmo8wRoPXVjvOXegIyguyHvqTg0wrwzfm4aPtZGTtr7RU65lp2
83fl2KJXan8V1vkOwmZ9n4e1G1g8Gggb+qCMAiv9cLWkfTus2HFdh5GNEZ+jSCJ1
2+QzIjFzLqx0N3wQmneBfkdDiWpQkAbQJjJLPdJykivo4WytV/6Vtvcqbv39JCJj
1awM2EpqB9rKV24BGDH86MiErvVT3HBLjSEEpIa41T8PlBXEsQOH1hsXTZSzyP9o
iO8qclZgSIIUgiN4feV3
=xPuq
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2015-09-11 Thread Chris Mason
Hi Linus,

We have a few more for the btrfs tree in my for-linus-4.3 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.3

These are small cleanups, and also some fixes for our async worker
thread initialization.

I was having some trouble testing these, but it ended up being a
combination of changing around my test servers and a shiny new schedule
while atomic from the new start/finish_plug in writeback_sb_inodes().

That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
changing a bunch of things in my test setup at once it would have
been really clear.  Fix for writeback_sb_inodes() on the way as well.

Zhao Lei (4) commits (+66/-88):
btrfs: Remove noused chunk_tree and chunk_objectid from 
scrub_enumerate_chunks and scrub_chunk (+2/-8)
btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures (+33/-40)
btrfs: Update out-of-date "skip parity stripe" comment (+1/-1)
btrfs: Add raid56 support for updating (+30/-39)

Qu Wenruo (1) commits (+35/-24):
btrfs: async_thread: Fix workqueue 'max_active' value when initializing

Tsutomu Itoh (1) commits (+3/-6):
Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called

Total: (6) commits (+104/-118)

 fs/btrfs/async-thread.c | 57 ++---
 fs/btrfs/async-thread.h |  2 +-
 fs/btrfs/dev-replace.c  |  3 +-
 fs/btrfs/disk-io.c  | 76 +++--
 fs/btrfs/disk-io.h  |  1 +
 fs/btrfs/inode.c|  3 +-
 fs/btrfs/scrub.c| 12 ++--
 fs/btrfs/tree-defrag.c  |  3 +-
 fs/btrfs/volumes.c  | 21 +++---
 9 files changed, 82 insertions(+), 96 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: keep dropped roots in cache until transaciton commit

2015-09-11 Thread Josef Bacik

On 09/10/2015 09:25 PM, Qu Wenruo wrote:



Josef Bacik wrote on 2015/09/10 16:27 -0400:

When dropping a snapshot we need to account for the qgroup changes.
If we drop
the snapshot in all one go then the backref code will fail to find
blocks from
the snapshot we dropped since it won't be able to find the root in the
fs root
cache.  This can lead to us failing to find refs from other roots that
pointed
at blocks in the now deleted root.  To handle this we need to not
remove the fs
roots from the cache until after we process the qgroup operations.  Do
this by
adding dropped roots to a list on the transaction, and letting the
transaction
remove the roots at the same time it drops the commit roots.  This
will keep all
of the backref searching code in sync properly, and fixes a problem
Mark was
seeing with snapshot delete and qgroups.  Thanks,Btrfs: keep dropped
roots in
cache until transaciton commit

When dropping a snapshot we need to account for the qgroup changes.
If we drop
the snapshot in all one go then the backref code will fail to find
blocks from
the snapshot we dropped since it won't be able to find the root in the
fs root
cache.  This can lead to us failing to find refs from other roots that
pointed
at blocks in the now deleted root.  To handle this we need to not
remove the fs
roots from the cache until after we process the qgroup operations.  Do
this by
adding dropped roots to a list on the transaction, and letting the
transaction
remove the roots at the same time it drops the commit roots.  This
will keep all
of the backref searching code in sync properly, and fixes a problem
Mark was
seeing with snapshot delete and qgroups.  Thanks,


Mark will definitely be happy with this patch, as quite a good basis for
snapshot deletion.

BTW, the commit message seems to be repeating itself.



Argh I usually notice when that happens, there's some weird vim key 
combo that I accidently hit pretty regularly that duplicates everything 
I just typed.  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix for btrfs-convert chunk type and fsck support

2015-09-11 Thread Qu Wenruo


在 2015年09月11日 22:56, David Sterba 写道:

On Thu, Sep 10, 2015 at 10:34:13AM +0800, Qu Wenruo wrote:

Again the buggy btrfs-convert, even David tried to ban mixed-bg features
for btrfs-convert, it will still put data and metadata extents into the
same chunk, without marking the chunk mixed.

So in the patchset, first add fsck support for such problem, and then
force btrfs-convert to use mixed block group.


I don't think this is a good option for now.


I'm OK with the decision not to force mixed bg right now.
As you mentioned, it should be better to do it until we fix the whole 
problem.



People convert
many-terabytes filesystems.Unless there's a way how to convert such
filesystem to the split data/metadata type I don't want to force mixed
bg to convert.


But even for case of TB convert, I'm not sure if that will be a good 
idea to use sperate data/metadata chunk.
Even for ext4, I'm not sure if it does store extents/metadata in a good 
manaer.
IIRC, ext* will allocate space from the middle of free space to avoid 
fragment. (I'm not familiar with ext* anyway, so I can be totally wrong)

So ext* may cause a lot of small holes in its free space.

And for convert, all ext* data and metadata must be covered by btrfs 
DATA chunk, and then restore btrfs metadata into the resting space.
Either causing tons of small metadata chunks between scatterd data 
chunks, or almost no space left for metadata.


So IMHO, for converted case, mixed bg would be a quite good and generic 
choice.



The bug you describe is there, but I wonder why didn't
we notice problems that arise from it.


Personally speaking, COW nature of btrfs is doing a quite good job to 
hide the bug, and even self healing.


For metadata in non-mixed DATA chunk, for read case, kernel won't detect 
anything wrong as long as it can pass the generation/csum/backref check.


For writting metadata in non-mixed DATA chunk, COW will alloc new tree 
block from METADATA chunk.
And if we have enough metadata operation, the problem will just 
disappear after all metadata is COWed.


Only some corner case will trigger a WARNING or BUG.

We can add some extra check in check_tree_block() to check such case, 
but I think it will bring a bad impact on performance.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs-progs: Fix infinite loop of btrfs-image

2015-09-11 Thread David Sterba
On Wed, Sep 09, 2015 at 09:32:21PM +0800, Zhao Lei wrote:
> Bug:
>  # btrfs-image -t0 -c9 /dev/sda6 /tmp/btrfs_image.img
>  (hang)
>  # btrfs-image -r -t0 /tmp/btrfs_image.img /dev/sda6
>  (hang)
> 
> Reason:
>  Program need to create at least 1 thread for compress/decompress.
>  but if user specify -t0 in argument, it overwrite the default value
>  of 1, then the program really created 0 thread, and caused above
>  error.
> 
> Fix:
>  We can add a check, to make program not allow -t0 argument,
>  but there are still exist another problem.
>  for ex, in node with 4 cpus:
>  btrfs-image -c9 -t1: 4 threads (1 means use nr_cpus)
>  -c9 -t2: 2 threads
>  -c9 -t3: 3 threads
>  ...
>  (-t1 have more threads than -t2 and -t3)
> 
>  So we change to use value of 0 as "use nr_cpus threads", then:
>  btrfs-image [no -t arg]: use nr_cpus threads
>  -t0: use nr_cpus threads
>  -t val:  use val threads.
> 
> Signed-off-by: Zhao Lei 

All 3 patches applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix for btrfs-convert chunk type and fsck support

2015-09-11 Thread David Sterba
On Thu, Sep 10, 2015 at 10:34:13AM +0800, Qu Wenruo wrote:
> Again the buggy btrfs-convert, even David tried to ban mixed-bg features
> for btrfs-convert, it will still put data and metadata extents into the
> same chunk, without marking the chunk mixed.
> 
> So in the patchset, first add fsck support for such problem, and then
> force btrfs-convert to use mixed block group.

I don't think this is a good option for now. People convert
many-terabytes filesystems. Unless there's a way how to convert such
filesystem to the split data/metadata type I don't want to force mixed
bg to convert. The bug you describe is there, but I wonder why didn't
we notice problems that arise from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix cross stripe boundary check

2015-09-11 Thread David Sterba
On Fri, Sep 11, 2015 at 08:27:17AM +0800, Qu Wenruo wrote:
> BTW, any idea to add mkfs test?

Yes, I'll add a test that will cycle through combinations of various
options (nodesize, raid profiles).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix cross stripe boundary check

2015-09-11 Thread David Sterba
On Fri, Sep 11, 2015 at 03:16:02PM +0200, Holger Hoffstätte wrote:
> Am I correct that this also causes false positives with btrfs check? I just
> ran a sanity check on an fs that had no problems whatsoever and was
> definitely not converted (so 16k nodesize) and got thousands of
> cross-stripe complaints; repair didn't help. Applying the patch seems to
> have fixed those; it completes without problems now.

Yes you are. If the node blocks end at the stripe boundary, they're
incorrectly marked as stripe crossing. The 64k nodesize case was quick
to detect that because this holds for all nodes. Thanks for your
report, this means the bug is not that rare. I'll do a release within a
few days including this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix cross stripe boundary check

2015-09-11 Thread Holger Hoffstätte
On Thu, Sep 10, 2015 at 5:02 PM, David Sterba  wrote:
> Commit 854437ca3c228d8ab3eb24d2efc1c21b5d56a635 ("btrfs-progs:
> extent-tree: avoid allocating tree block that crosses stripe boundary")
> does not work for 64k nodesize. Due to an off-by-one error, all queries
> to check_crossing_stripes will return that all extents cross a stripe
> and this will lead to a false ENOSPC. This crashes later
>
> $ ./mkfs.btrfs -n 64k image
>
> ./mkfs.btrfs(btrfs_reserve_extent+0xb77)[0x417f38]
> ./mkfs.btrfs(btrfs_alloc_free_block+0x57)[0x417fe0]
> ./mkfs.btrfs(__btrfs_cow_block+0x163)[0x408eb7]
> ./mkfs.btrfs(btrfs_cow_block+0xd0)[0x4097c4]
> ./mkfs.btrfs(btrfs_search_slot+0x16f)[0x40be4d]
> ./mkfs.btrfs(btrfs_insert_empty_items+0xc0)[0x40d5f9]
> ./mkfs.btrfs(btrfs_insert_item+0x99)[0x40da5f]
> ./mkfs.btrfs(btrfs_make_block_group+0x4d)[0x41705c]
> ./mkfs.btrfs(main+0xeef)[0x434b56]

Am I correct that this also causes false positives with btrfs check? I just
ran a sanity check on an fs that had no problems whatsoever and was
definitely not converted (so 16k nodesize) and got thousands of
cross-stripe complaints; repair didn't help. Applying the patch seems to
have fixed those; it completes without problems now.

Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Aw: Re: btrfs corruption / bug after sending and receiving and repair

2015-09-11 Thread Jonas von Malottki


> Gesendet: Freitag, 11. September 2015 um 11:49 Uhr
> Von: "Filipe David Manana" 
> An: "Jonas von Malottki" 
> Cc: "linux-btrfs@vger.kernel.org" 
> Betreff: Re: btrfs corruption / bug after sending and receiving and repair
>
> On Fri, Sep 11, 2015 at 10:39 AM, Jonas von Malottki  wrote:
> > Hi Btrfslers,
> >
> > I was playing around with send and receive facility to store backups at a 
> > remote machine. Unfortunately I send more data to a device that it could 
> > handle. So the receive operation was ended with "could not write file, no 
> > more space left on device". So far so good, no big deal. To my surprise the 
> > snapshot was transferred but was incomplete (you could actually cd into the 
> > snap and view files, but some were missing). Naturally I didn't trust the 
> > snapshot so I deleted it via btrfs sub del. As I needed more space I 
> > deleted also all other subvolumes. So the device was emtpy, a btrfs subvol 
> > list -a /mntpoint showed nothing, but there was still around 500gb on the 
> > btrfs volume (btrfs fil df), so I dismounted the dev and started a btrfs 
> > check --repair /dev/sdb2 (output below) followed by mounting it via 
> > subvolid  "mount -t btrfs -o subvolid=0 /dev/sdb2 /mnt/btrfs2". But it 
> > showed that there were still 27% of the device in use (2TB device) with 
> > supposedly nothing on it. So I tried to dismount it to reformat the device, 
> > but the umount just hang and a quick look into dmesg (below) showed that 
> > something was freaked up.
> >
> > No real damage done, just wanted you to know maybe you can fix a bug. I can 
> > leave the device for a few days if you would like to have special 
> > information. But i'll have to reboot at least.
> >
> > Thanks for all cool btrfs features though :).
> >
> > Best regards
> > Jonas
> >
> >
> >
> > Output requested (but defunct after the whole stuff happened):
> >
> > vid@tauon:~$   uname -a
> > Linux tauon 3.19.0-28-generic #30-Ubuntu SMP Mon Aug 31 15:52:51 UTC 2015 
> > x86_64 x86_64 x86_64 GNU/Linux
> > vid@tauon:~$ btrfs --version
> > Btrfs v3.17
> > vid@tauon:~$  btrfs fi show
> > ERROR: unable to access '/mnt/btrfs1'
> > ERROR: could not open /dev/sdb2
> > ERROR: could not open /dev/sdc1
> > Btrfs v3.17
> >
> >
> > Output from Repair:
> >
> > vid@tauon:/mnt$ sudo btrfs check --repair  /dev/sdb2
> > enabling repair mode
> > Fixed 0 roots.
> > Checking filesystem on /dev/sdb2
> > UUID: 1af082d6-10f5-45b6-8373-c67b5c595ed6
> > checking extents
> > checking free space cache
> > cache and super generation don't match, space cache will be invalidated
> > checking fs roots
> > checking csums
> > checking root refs
> > found 256752151257 bytes used err is 0
> > total csum bytes: 512030428
> > total tree bytes: 557645824
> > total fs tree bytes: 5079040
> > total extent tree bytes: 4685824
> > btree space waste bytes: 31440848
> > file data blocks allocated: 524450885632
> >  referenced 524450885632
> > Btrfs v3.17
> >
> >
> >
> >
> > Output from Dmesg
> >
> >
> > [84963.514765] btrfs[6456]: segfault at 0 ip 7fcba71e36b4 sp 
> > 7ffe01ba8c40 error 4 in libc-2.21.so[7fcba71a8000+1c]
> > [86002.789758] BTRFS info (device sdb2): disk space caching is enabled
> > [86003.549464] BTRFS: checking UUID tree
> > [86077.520375] [ cut here ]
> > [86077.520381] kernel BUG at 
> > /build/linux-5xFjum/linux-3.19.0/fs/btrfs/inode.c:3142!
> > [86077.520383] invalid opcode:  [#1] SMP
> > [86077.520386] Modules linked in: cfg80211 snd_hda_codec_hdmi gpio_ich 
> > kvm_intel kvm snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul 
> > snd_seq_virmidi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> > snd_emu10k1 snd_hda_controller snd_hda_codec snd_util_mem serio_raw 
> > snd_hwdep snd_ac97_codec ac97_bus snd_seq_midi snd_seq_midi_event joydev 
> > snd_rawmidi nvidia(POE) lpc_ich snd_pcm snd_seq emu10k1_gp snd_seq_device 
> > gameport snd_timer snd drm soundcore x38_edac 8250_fintek edac_core shpchp 
> > mac_hid it87 hwmon_vid coretemp parport_pc ppdev lp parport autofs4 
> > pata_acpi btrfs xor raid6_pq hid_generic usbhid hid firewire_ohci 
> > firewire_core crc_itu_t r8169 mii ahci pata_jmicron libahci
> > [86077.520421] CPU: 1 PID: 6519 Comm: btrfs-cleaner Tainted: P   OE 
> >  3.19.0-28-generic #30-Ubuntu
> > [86077.520422] Hardware name: Gigabyte Technology Co., Ltd. 
> > X38-DQ6/X38-DQ6, BIOS F9F 07/30/2008
> > [86077.520425] task: 88020ed68000 ti: 88020ee9 task.ti: 
> > 88020ee9
> > [86077.520426] RIP: 0010:[]  [] 
> > btrfs_orphan_add+0x1c0/0x1e0 [btrfs]
> > [86077.520447] RSP: 0018:88020ee93c38  EFLAGS: 00010286
> > [86077.520449] RAX: ffe4 RBX: 880002004800 RCX: 
> > 880104cfa000
> > [86077.520450] RDX: 510e RSI: 0004 RDI: 
> > 880104cfa138
> > [86077.520452] RBP: 88020ee93c78 R08: 0001db10 R09: 
> > 880210fcf090
> > [86077.520453] R10: 88022fc9db50 R11: ea0008744840 R12

Re: btrfs corruption / bug after sending and receiving and repair

2015-09-11 Thread Filipe David Manana
On Fri, Sep 11, 2015 at 10:39 AM, Jonas von Malottki  wrote:
> Hi Btrfslers,
>
> I was playing around with send and receive facility to store backups at a 
> remote machine. Unfortunately I send more data to a device that it could 
> handle. So the receive operation was ended with "could not write file, no 
> more space left on device". So far so good, no big deal. To my surprise the 
> snapshot was transferred but was incomplete (you could actually cd into the 
> snap and view files, but some were missing). Naturally I didn't trust the 
> snapshot so I deleted it via btrfs sub del. As I needed more space I deleted 
> also all other subvolumes. So the device was emtpy, a btrfs subvol list -a 
> /mntpoint showed nothing, but there was still around 500gb on the btrfs 
> volume (btrfs fil df), so I dismounted the dev and started a btrfs check 
> --repair /dev/sdb2 (output below) followed by mounting it via subvolid  
> "mount -t btrfs -o subvolid=0 /dev/sdb2 /mnt/btrfs2". But it showed that 
> there were still 27% of the device in use (2TB device) with supposedly 
> nothing on it. So I tried to dismount it to reformat the device, but the 
> umount just hang and a quick look into dmesg (below) showed that something 
> was freaked up.
>
> No real damage done, just wanted you to know maybe you can fix a bug. I can 
> leave the device for a few days if you would like to have special 
> information. But i'll have to reboot at least.
>
> Thanks for all cool btrfs features though :).
>
> Best regards
> Jonas
>
>
>
> Output requested (but defunct after the whole stuff happened):
>
> vid@tauon:~$   uname -a
> Linux tauon 3.19.0-28-generic #30-Ubuntu SMP Mon Aug 31 15:52:51 UTC 2015 
> x86_64 x86_64 x86_64 GNU/Linux
> vid@tauon:~$ btrfs --version
> Btrfs v3.17
> vid@tauon:~$  btrfs fi show
> ERROR: unable to access '/mnt/btrfs1'
> ERROR: could not open /dev/sdb2
> ERROR: could not open /dev/sdc1
> Btrfs v3.17
>
>
> Output from Repair:
>
> vid@tauon:/mnt$ sudo btrfs check --repair  /dev/sdb2
> enabling repair mode
> Fixed 0 roots.
> Checking filesystem on /dev/sdb2
> UUID: 1af082d6-10f5-45b6-8373-c67b5c595ed6
> checking extents
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> checking csums
> checking root refs
> found 256752151257 bytes used err is 0
> total csum bytes: 512030428
> total tree bytes: 557645824
> total fs tree bytes: 5079040
> total extent tree bytes: 4685824
> btree space waste bytes: 31440848
> file data blocks allocated: 524450885632
>  referenced 524450885632
> Btrfs v3.17
>
>
>
>
> Output from Dmesg
>
>
> [84963.514765] btrfs[6456]: segfault at 0 ip 7fcba71e36b4 sp 
> 7ffe01ba8c40 error 4 in libc-2.21.so[7fcba71a8000+1c]
> [86002.789758] BTRFS info (device sdb2): disk space caching is enabled
> [86003.549464] BTRFS: checking UUID tree
> [86077.520375] [ cut here ]
> [86077.520381] kernel BUG at 
> /build/linux-5xFjum/linux-3.19.0/fs/btrfs/inode.c:3142!
> [86077.520383] invalid opcode:  [#1] SMP
> [86077.520386] Modules linked in: cfg80211 snd_hda_codec_hdmi gpio_ich 
> kvm_intel kvm snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul 
> snd_seq_virmidi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> snd_emu10k1 snd_hda_controller snd_hda_codec snd_util_mem serio_raw snd_hwdep 
> snd_ac97_codec ac97_bus snd_seq_midi snd_seq_midi_event joydev snd_rawmidi 
> nvidia(POE) lpc_ich snd_pcm snd_seq emu10k1_gp snd_seq_device gameport 
> snd_timer snd drm soundcore x38_edac 8250_fintek edac_core shpchp mac_hid 
> it87 hwmon_vid coretemp parport_pc ppdev lp parport autofs4 pata_acpi btrfs 
> xor raid6_pq hid_generic usbhid hid firewire_ohci firewire_core crc_itu_t 
> r8169 mii ahci pata_jmicron libahci
> [86077.520421] CPU: 1 PID: 6519 Comm: btrfs-cleaner Tainted: P   OE  
> 3.19.0-28-generic #30-Ubuntu
> [86077.520422] Hardware name: Gigabyte Technology Co., Ltd. X38-DQ6/X38-DQ6, 
> BIOS F9F 07/30/2008
> [86077.520425] task: 88020ed68000 ti: 88020ee9 task.ti: 
> 88020ee9
> [86077.520426] RIP: 0010:[]  [] 
> btrfs_orphan_add+0x1c0/0x1e0 [btrfs]
> [86077.520447] RSP: 0018:88020ee93c38  EFLAGS: 00010286
> [86077.520449] RAX: ffe4 RBX: 880002004800 RCX: 
> 880104cfa000
> [86077.520450] RDX: 510e RSI: 0004 RDI: 
> 880104cfa138
> [86077.520452] RBP: 88020ee93c78 R08: 0001db10 R09: 
> 880210fcf090
> [86077.520453] R10: 88022fc9db50 R11: ea0008744840 R12: 
> 880105660578
> [86077.520455] R13: 8802226ef630 R14: 880002004c58 R15: 
> 0001
> [86077.520456] FS:  () GS:88022fc8() 
> knlGS:
> [86077.520458] CS:  0010 DS:  ES:  CR0: 8005003b
> [86077.520460] CR2: 7f8167699148 CR3: 01c13000 CR4: 
> 07e0
> [86077.520461] Stack:
> [86077.520462]  88020ee93c78 c0408ca5 880104c

btrfs corruption / bug after sending and receiving and repair

2015-09-11 Thread Jonas von Malottki
Hi Btrfslers,

I was playing around with send and receive facility to store backups at a 
remote machine. Unfortunately I send more data to a device that it could 
handle. So the receive operation was ended with "could not write file, no more 
space left on device". So far so good, no big deal. To my surprise the snapshot 
was transferred but was incomplete (you could actually cd into the snap and 
view files, but some were missing). Naturally I didn't trust the snapshot so I 
deleted it via btrfs sub del. As I needed more space I deleted also all other 
subvolumes. So the device was emtpy, a btrfs subvol list -a /mntpoint showed 
nothing, but there was still around 500gb on the btrfs volume (btrfs fil df), 
so I dismounted the dev and started a btrfs check --repair /dev/sdb2 (output 
below) followed by mounting it via subvolid  "mount -t btrfs -o subvolid=0 
/dev/sdb2 /mnt/btrfs2". But it showed that there were still 27% of the device 
in use (2TB device) with supposedly nothing on it. So I tried to dismount it to 
reformat the device, but the umount just hang and a quick look into dmesg 
(below) showed that something was freaked up. 

No real damage done, just wanted you to know maybe you can fix a bug. I can 
leave the device for a few days if you would like to have special information. 
But i'll have to reboot at least.

Thanks for all cool btrfs features though :).

Best regards 
Jonas 



Output requested (but defunct after the whole stuff happened):

vid@tauon:~$   uname -a
Linux tauon 3.19.0-28-generic #30-Ubuntu SMP Mon Aug 31 15:52:51 UTC 2015 
x86_64 x86_64 x86_64 GNU/Linux
vid@tauon:~$ btrfs --version
Btrfs v3.17
vid@tauon:~$  btrfs fi show
ERROR: unable to access '/mnt/btrfs1'
ERROR: could not open /dev/sdb2
ERROR: could not open /dev/sdc1
Btrfs v3.17


Output from Repair:

vid@tauon:/mnt$ sudo btrfs check --repair  /dev/sdb2 
enabling repair mode
Fixed 0 roots.
Checking filesystem on /dev/sdb2
UUID: 1af082d6-10f5-45b6-8373-c67b5c595ed6
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 256752151257 bytes used err is 0
total csum bytes: 512030428
total tree bytes: 557645824
total fs tree bytes: 5079040
total extent tree bytes: 4685824
btree space waste bytes: 31440848
file data blocks allocated: 524450885632
 referenced 524450885632
Btrfs v3.17




Output from Dmesg


[84963.514765] btrfs[6456]: segfault at 0 ip 7fcba71e36b4 sp 
7ffe01ba8c40 error 4 in libc-2.21.so[7fcba71a8000+1c]
[86002.789758] BTRFS info (device sdb2): disk space caching is enabled
[86003.549464] BTRFS: checking UUID tree
[86077.520375] [ cut here ]
[86077.520381] kernel BUG at 
/build/linux-5xFjum/linux-3.19.0/fs/btrfs/inode.c:3142!
[86077.520383] invalid opcode:  [#1] SMP 
[86077.520386] Modules linked in: cfg80211 snd_hda_codec_hdmi gpio_ich 
kvm_intel kvm snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul 
snd_seq_virmidi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
snd_emu10k1 snd_hda_controller snd_hda_codec snd_util_mem serio_raw snd_hwdep 
snd_ac97_codec ac97_bus snd_seq_midi snd_seq_midi_event joydev snd_rawmidi 
nvidia(POE) lpc_ich snd_pcm snd_seq emu10k1_gp snd_seq_device gameport 
snd_timer snd drm soundcore x38_edac 8250_fintek edac_core shpchp mac_hid it87 
hwmon_vid coretemp parport_pc ppdev lp parport autofs4 pata_acpi btrfs xor 
raid6_pq hid_generic usbhid hid firewire_ohci firewire_core crc_itu_t r8169 mii 
ahci pata_jmicron libahci
[86077.520421] CPU: 1 PID: 6519 Comm: btrfs-cleaner Tainted: P   OE  
3.19.0-28-generic #30-Ubuntu
[86077.520422] Hardware name: Gigabyte Technology Co., Ltd. X38-DQ6/X38-DQ6, 
BIOS F9F 07/30/2008
[86077.520425] task: 88020ed68000 ti: 88020ee9 task.ti: 
88020ee9
[86077.520426] RIP: 0010:[]  [] 
btrfs_orphan_add+0x1c0/0x1e0 [btrfs]
[86077.520447] RSP: 0018:88020ee93c38  EFLAGS: 00010286
[86077.520449] RAX: ffe4 RBX: 880002004800 RCX: 880104cfa000
[86077.520450] RDX: 510e RSI: 0004 RDI: 880104cfa138
[86077.520452] RBP: 88020ee93c78 R08: 0001db10 R09: 880210fcf090
[86077.520453] R10: 88022fc9db50 R11: ea0008744840 R12: 880105660578
[86077.520455] R13: 8802226ef630 R14: 880002004c58 R15: 0001
[86077.520456] FS:  () GS:88022fc8() 
knlGS:
[86077.520458] CS:  0010 DS:  ES:  CR0: 8005003b
[86077.520460] CR2: 7f8167699148 CR3: 01c13000 CR4: 07e0
[86077.520461] Stack:
[86077.520462]  88020ee93c78 c0408ca5 880104cfa000 
880002001800
[86077.520465]  880210fcf090 880105660578 880223b7ec00 
8801b8337d80
[86077.520468]  88020ee93d08 c03b42da 880210fcf098 
880210fcf110
[86077.520470] Call Trace:
[86077.520487]  [] ? lookup_free_space_inode+0x45/0xf0 [btr

Re: [PATCH 0/2] btrfs: fortification for GFP_NOFS allocations

2015-09-11 Thread Michal Hocko
On Wed 09-09-15 18:13:39, Vlastimil Babka wrote:
> On 08/19/2015 08:17 PM, Chris Mason wrote:
> >On Wed, Aug 19, 2015 at 02:17:39PM +0200, mho...@kernel.org wrote:
> >>Hi,
> >>these two patches were sent as a part of a larger RFC which aims at
> >>allowing GFP_NOFS allocations to fail to help sort out memory reclaim
> >>issues bound to the current behavior
> >>(http://marc.info/?l=linux-mm&m=143876830616538&w=2).
> >>
> >>It is clear that move to the GFP_NOFS behavior change is a long term
> >>plan but these patches should be good enough even with that change in
> >>place. It also seems that Chris wasn't opposed and would be willing to
> >>take them http://marc.info/?l=linux-mm&m=143991792427165&w=2 so here we
> >>come. I have rephrased the changeslogs to not refer to the patch which
> >>changes the NOFS behavior.
> >>
> >>Just to clarify. These two patches allowed my particular testcase
> >>(mentioned in the cover referenced above) to survive it doesn't mean
> >>that the failing GFP_NOFS are OK now. I have seen some other places
> >>where GFP_NOFS allocation is followed by BUG_ON(ALLOC_FAILED). I have
> >>not encountered them though.
> >>
> >>Let me know if you would prefer other changes.
> >
> >My plan is to start with these two and take more as required.
> 
> I've previously noticed in __set_extent_bit() things like:
> 
> if (!prealloc && (mask & __GFP_WAIT)) {
> prealloc = alloc_extent_state(mask);
> BUG_ON(!prealloc);
> }
> 
> and later:
> 
> prealloc = alloc_extent_state_atomic(prealloc);
> BUG_ON(!prealloc);

Yes. I have noticed also many other places:
$ git grep "BUG_ON.*ENOMEM" -- fs/btrfs/ | wc -l
47

I have talked to David Sterba and he said this is on his todo list.
So this will likely take some more time but it is definitely good to
sort out.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html