Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-26 Thread Theodore Ts'o
On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote:
 No abuse necessary at all. Just a different inode_dirtied_after()
 check is requires if the inode is on the time dirty list in
 move_expired_inodes().

I'm still not sure what you have in mind here.  When would this be
checked?  It sounds like you want to set a timeout such that when an
inode which had its timestamps updated lazily 24 hours earlier, the
inode would get written out.  Yes?  But that implies something is
going to have to scan the list of inodes on the dirty time list
periodically.  When are you proposing that this take place?

The various approaches that come to mind all seem more complex than
what I have in this patch 3 of 4, and I'm not sure it's worth the
complexity.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging

2014-11-26 Thread Theodore Ts'o
Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c | 1 +
 fs/inode.c| 5 +
 2 files changed, 6 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 529480a..3d87174 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include linux/backing-dev.h
 #include linux/tracepoint.h
 #include linux/device.h
+#include trace/events/fs.h
 #include internal.h
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 37c0429..b2fea60 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include linux/list_lru.h
 #include internal.h
 
+#define CREATE_TRACE_POINTS
+#include trace/events/fs.h
+
 /*
  * Inode locking rules:
  *
@@ -1443,6 +1446,7 @@ retry:
inode-i_op-write_time(inode);
else if (inode-i_sb-s_op-write_inode)
mark_inode_dirty_sync(inode);
+   trace_fs_lazytime_iput(inode);
goto retry;
}
iput_final(inode);
@@ -1561,6 +1565,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode-i_ts_dirty_day = days_since_boot;
spin_unlock(inode-i_lock);
inode_requeue_dirtytime(inode);
+   trace_fs_lazytime_defer(inode);
return 0;
}
 force_dirty:
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale

2014-11-26 Thread Theodore Ts'o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c | 18 ++
 include/linux/fs.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef8c5d8..529480a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1143,6 +1143,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode-i_ts_dirty_day = 0;
if (sb-s_op-dirty_inode)
sb-s_op-dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 9e464cc..37c0429 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1510,6 +1510,8 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   struct timespec uptime;
+   unsigned short days_since_boot;
int ret;
 
if (inode-i_op-update_time) {
@@ -1526,11 +1528,26 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags  S_MTIME)
inode-i_mtime = *time;
}
+   /*
+* If i_ts_dirty_day is zero, then either we have not deferred
+* timestamp updates, or the system has been up for less than
+* a day (so days_since_boot is zero), so we defer timestamp
+* updates in that case and set the I_DIRTY_TIME flag.  If a
+* day or more has passed, then i_ts_dirty_day will be
+* different from days_since_boot, and then we should update
+* the on-disk inode and then we can clear i_ts_dirty_day.
+*/
if ((inode-i_sb-s_flags  MS_LAZYTIME) 
!(flags  S_VERSION) 
!(inode-i_state  (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
if (inode-i_state  I_DIRTY_TIME)
return 0;
+   get_monotonic_boottime(uptime);
+   days_since_boot = div_u64(uptime.tv_sec, 86400);
+   if (inode-i_ts_dirty_day 
+   (inode-i_ts_dirty_day != days_since_boot))
+   goto force_dirty;
+
spin_lock(inode-i_lock);
if (inode-i_state  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
spin_unlock(inode-i_lock);
@@ -1541,6 +1558,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
return 0;
}
inode-i_state |= I_DIRTY_TIME;
+   inode-i_ts_dirty_day = days_since_boot;
spin_unlock(inode-i_lock);
inode_requeue_dirtytime(inode);
return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 55cf34d..d0a2181 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 5/7] vfs: add find_active_inode_nowait() function

2014-11-26 Thread Theodore Ts'o
Add a new function find_active_inode_nowait() which will never block.
If there is an inode being freed or is still being initialized, this
function will return NULL instead of blocking waiting for an inode to
be freed or to finish initializing.  Hence, a negative return from
this function does not mean that inode number is free for use.  It is
useful for callers that want to opportunistically do some work on an
inode only if it is present and available in the cache, and where
blocking is not an option.

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/inode.c | 36 
 include/linux/fs.h |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index b2fea60..0b4c6ae 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1283,6 +1283,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb:super block of file system to search
+ * @ino:   inode number to search for
+ *
+ * Search for an active inode @ino in the inode cache, and if the
+ * inode is in the cache, the inode is returned with an incremented
+ * reference count.  If the inode is being freed or is newly
+ * initialized, return nothing instead of trying to wait for the inode
+ * initialization or destruction to be complete.
+ */
+struct inode *find_active_inode_nowait(struct super_block *sb,
+  unsigned long ino)
+{
+   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct inode *inode, *ret_inode = NULL;
+
+   spin_lock(inode_hash_lock);
+   hlist_for_each_entry(inode, head, i_hash) {
+   if ((inode-i_ino != ino) ||
+   (inode-i_sb != sb))
+   continue;
+   spin_lock(inode-i_lock);
+   if ((inode-i_state  (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
+   __iget(inode);
+   ret_inode = inode;
+   }
+   spin_unlock(inode-i_lock);
+   goto out;
+   }
+out:
+   spin_unlock(inode_hash_lock);
+   return ret_inode;
+}
+EXPORT_SYMBOL(find_active_inode_nowait);
+
 int insert_inode_locked(struct inode *inode)
 {
struct super_block *sb = inode-i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d0a2181..dc615ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2419,6 +2419,8 @@ extern struct inode *ilookup(struct super_block *sb, 
unsigned long ino);
 
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
(*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_active_inode_nowait(struct super_block *,
+ unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int 
(*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time()

2014-11-26 Thread Theodore Ts'o
The only reason btrfs cloned code from the VFS layer was so it could
add a check to see if a subvolume is read-ony.  Instead of doing that,
let's add a new inode operation which allows a file system to return
an error if the inode is read-only, and use that in update_time().
There may be other places where the VFS layer may want to know that
btrfs would want to treat an inode is read-only.

With this commit, there are no remaining users of update_time() in the
inode operations structure, so we can remove it and simply things
further.

Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba dste...@suse.cz
---
 fs/btrfs/inode.c   | 26 ++
 fs/inode.c | 22 +++---
 include/linux/fs.h |  2 +-
 3 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a5e0d0d..0bfe3a4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5554,26 +5554,12 @@ static int btrfs_dirty_inode(struct inode *inode)
return ret;
 }
 
-/*
- * This is a copy of file_update_time.  We need this so we can return error on
- * ENOSPC for updating the inode in the case of file write and mmap writes.
- */
-static int btrfs_update_time(struct inode *inode, struct timespec *now,
-int flags)
+static int btrfs_is_readonly(struct inode *inode)
 {
struct btrfs_root *root = BTRFS_I(inode)-root;
 
if (btrfs_root_readonly(root))
return -EROFS;
-
-   if (flags  S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags  S_CTIME)
-   inode-i_ctime = *now;
-   if (flags  S_MTIME)
-   inode-i_mtime = *now;
-   if (flags  S_ATIME)
-   inode-i_atime = *now;
return 0;
 }
 
@@ -9466,8 +9452,8 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9475,8 +9461,8 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9546,8 +9532,8 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.fiemap = btrfs_fiemap,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9559,8 +9545,8 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.removexattr= btrfs_removexattr,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9573,8 +9559,8 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.getxattr   = btrfs_getxattr,
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 0b4c6ae..e29bd2d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1554,20 +1554,20 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
unsigned short days_since_boot;
int ret;
 
-   if (inode-i_op-update_time) {
-   ret = inode-i_op-update_time(inode, time, flags);
+   if (inode-i_op-is_readonly) {
+   ret = inode-i_op-is_readonly(inode);
if (ret)
return ret;
-   } else {
-   if (flags  S_ATIME)
-   inode-i_atime = *time;
-   if (flags  S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags  S_CTIME)
-   inode-i_ctime = *time;
-   if (flags  S_MTIME)
-   inode-i_mtime = *time;
}
+   if (flags  S_ATIME)

[PATCH-v4 0/7] add support for a lazytime mount option

2014-11-26 Thread Theodore Ts'o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v3:
   - inodes with I_DIRTY_TIME set are placed on a new bdi list,
b_dirty_time.  This allows filesystem-level syncs to more
easily iterate over those inodes that need to have their
timestamps written to disk.
   - dirty timestamps will be written out asynchronously on the final
iput, instead of when the inode gets evicted.
   - separate the definition of the new function
find_active_inode_nowait() to a separate patch
   - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
   indicate whether the inode needs to be on the write back lists,
   or whether the inode itself is dirty, while I_DIRTY means any one
   of the inode dirty flags are set.  This simplifies the fs
   writeback logic which needs to test for different combinations of
   the inode dirty flags in different places.

Changes since -v2:
   - If update_time() updates i_version, it will not use lazytime (i..e,
   the inode will be marked dirty so the change will be persisted on to
   disk sooner rather than later).  Yes, this eliminates the
   benefits of lazytime if the user is experting the file system via
   NFSv4.  Sad, but NFS's requirements seem to mandate this.
   - Fix time wrapping bug 49 days after the system boots (on a system
with a 32-bit jiffies).   Use get_monotonic_boottime() instead.
   - Clean up type warning in include/tracing/ext4.h
   - Added explicit parenthesis for stylistic reasons
   - Added an is_readonly() inode operations method so btrfs doesn't
   have to duplicate code in update_time().

Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
 what circumstances inodes have their timestamps lazily updated


Theodore Ts'o (7):
  vfs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  vfs: add find_active_inode_nowait() function
  ext4: add support for a lazytime mount option
  btrfs: add an is_readonly() so btrfs can use common code for
update_time()

 Documentation/filesystems/Locking |   2 +
 fs/btrfs/inode.c  |  34 +---
 fs/ext4/inode.c   |  49 -
 fs/ext4/super.c   |   9 +++
 fs/fs-writeback.c |  57 +--
 fs/inode.c| 113 +++---
 fs/proc_namespace.c   |   1 +
 fs/sync.c |   8 +++
 fs/xfs/xfs_iops.c |  39 ++---
 include/linux/backing-dev.h   |   1 +
 include/linux/fs.h|  17 +-
 include/trace/events/ext4.h   |  30 ++
 include/uapi/linux/fs.h   |   1 +
 mm/backing-dev.c  |   9 ++-
 14 files changed, 306 insertions(+), 64 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-26 Thread Theodore Ts'o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba dste...@suse.cz
---
 Documentation/filesystems/Locking |  2 ++
 fs/btrfs/inode.c  | 10 ++
 fs/inode.c| 29 ++---
 fs/xfs/xfs_iops.c | 39 ---
 include/linux/fs.h|  1 +
 5 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index b30753c..e49861d 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -63,6 +63,7 @@ prototypes:
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, 
u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
+   void (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
@@ -95,6 +96,7 @@ listxattr:no
 removexattr:   yes
 fiemap:no
 update_time:   no
+write_time:no
 atomic_open:   yes
 tmpfile:   no
 dentry_open:   no
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode-i_mtime = *now;
if (flags  S_ATIME)
inode-i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode-i_op-update_time)
-   return inode-i_op-update_time(inode, time, flags);
-
-   if (flags  S_ATIME)
-   inode-i_atime = *time;
-   if (flags  S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags  S_CTIME)
-   inode-i_ctime = *time;
-   if (flags  S_MTIME)
-   inode-i_mtime = *time;
+   int ret;
+
+   if (inode-i_op-update_time) {
+   ret = inode-i_op-update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags  S_ATIME)
+   inode-i_atime = *time;
+   if (flags  S_VERSION)
+   

[PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Theodore Ts'o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/ext4/inode.c | 49 ++---
 fs/ext4/super.c |  9 +
 include/trace/events/ext4.h | 30 +++
 3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5653fa4..8308c82 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino  ~(inodes_per_block - 1);
+   for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode-i_state  I_DIRTY_TIME) == 0 ||
+   !spin_trylock(inode-i_lock)) {
+   iput(inode);
+   continue;
+   }
+   inode-i_state = ~I_DIRTY_TIME;
+   inode-i_ts_dirty_day = 0;
+   spin_unlock(inode-i_lock);
+   inode_requeue_dirtytime(inode);
+
+   ei = EXT4_I(inode);
+   raw_inode = (struct ext4_inode *) buf;
+
+   spin_lock(ei-i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(ei-i_raw_lock);
+   trace_ext4_other_inode_update_time(inode, orig_ino);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4237,7 +4282,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block  EXT4_N_BLOCKS; block++)
raw_inode-i_block[block] = ei-i_data[block];
}
-
if (likely(!test_opt2(inode-i_sb, HURD_COMPAT))) {
raw_inode-i_disk_version = cpu_to_le32(inode-i_version);
if (ei-i_extra_isize) {
@@ -4248,10 +4292,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei-i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(ei-i_raw_lock);
+   ext4_update_other_inodes_time(inode-i_sb, inode-i_ino, bh-b_data);
 
BUFFER_TRACE(bh, call ext4_handle_dirty_metadata);
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58859bc..93a2b7a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1132,6 +1132,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, i_version},
{Opt_stripe, stripe=%u},
{Opt_delalloc, delalloc},
+   {Opt_lazytime, lazytime},
+   {Opt_nolazytime, nolazytime},
{Opt_nodelalloc, nodelalloc},
{Opt_removed, mblk_io_submit},
{Opt_removed, nomblk_io_submit},
@@ -1452,6 +1455,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
case Opt_i_version:
sb-s_flags |= MS_I_VERSION;
return 1;
+   case Opt_lazytime:
+   sb-s_flags |= MS_LAZYTIME;
+   return 1;
+   case Opt_nolazytime:
+   sb-s_flags 

[PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-26 Thread Theodore Ts'o
Add a new mount option which enables a new lazytime mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c   | 55 -
 fs/inode.c  | 43 ++-
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  8 +++
 include/linux/backing-dev.h |  1 +
 include/linux/fs.h  | 11 +++--
 include/uapi/linux/fs.h |  1 +
 mm/backing-dev.c|  9 ++--
 8 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ef8c5d8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -397,7 +397,7 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
 * shot. If still dirty, it will be redirty_tail()'ed below.  Update
 * the dirty time to prevent enqueue and sync it again.
 */
-   if ((inode-i_state  I_DIRTY) 
+   if ((inode-i_state  I_DIRTY_WB) 
(wbc-sync_mode == WB_SYNC_ALL || wbc-tagged_writepages))
inode-dirtied_when = jiffies;
 
@@ -428,13 +428,15 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
 */
redirty_tail(inode, wb);
}
-   } else if (inode-i_state  I_DIRTY) {
+   } else if (inode-i_state  I_DIRTY_WB) {
/*
 * Filesystems can dirty the inode during writeback operations,
 * such as delayed allocation during submission or metadata
 * updates after data IO completion.
 */
redirty_tail(inode, wb);
+   } else if (inode-i_state  I_DIRTY_TIME) {
+   list_move(inode-i_wb_list, wb-b_dirty_time);
} else {
/* The inode is clean. Remove from writeback lists. */
list_del_init(inode-i_wb_list);
@@ -482,11 +484,11 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode-i_state = ~I_DIRTY_PAGES;
-   dirty = inode-i_state  I_DIRTY;
-   inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   dirty = inode-i_state  I_DIRTY_INODE;
+   inode-i_state = ~I_DIRTY_INODE;
spin_unlock(inode-i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
-   if (dirty  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+   if (dirty) {
int err = write_inode(inode, wbc);
if (ret == 0)
ret = err;
@@ -1162,7 +1164,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 
spin_lock(inode-i_lock);
if ((inode-i_state  flags) != flags) {
-   const int was_dirty = inode-i_state  I_DIRTY;
+   const int was_dirty = inode-i_state  I_DIRTY_WB;
 
inode-i_state |= flags;
 
@@ -1224,6 +1226,24 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+void inode_requeue_dirtytime(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(bdi-wb.list_lock);
+   spin_lock(inode-i_lock);
+   if ((inode-i_state  I_DIRTY_WB) == 0) {
+   if (inode-i_state  I_DIRTY_TIME)
+   list_move(inode-i_wb_list, bdi-wb.b_dirty_time);
+   else
+   list_del_init(inode-i_wb_list);
+   }
+   spin_unlock(inode-i_lock);
+   spin_unlock(bdi-wb.list_lock);
+
+}
+EXPORT_SYMBOL(inode_requeue_dirtytime);
+
 static void wait_sb_inodes(struct super_block *sb)
 {
struct inode *inode, *old_inode = NULL;
@@ -1277,6 +1297,28 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * Take all of the indoes on the dirty_time list, and mark them as
+ * dirty, 

resetting device stats

2014-11-26 Thread Russell Coker
When running Debian kernel version 3.16.0-4-amd64 and btrfs-tools version 
3.17-1.1 I ran a btrfs replace operation to replace a 3TB disk that was giving 
read errors with a new 4TB disk.

After the replace the btrfs device stats command reported that the 4TB disk 
had 16 read errors.  It appears that the device stats are copied in a replace 
operation.

I believe that this is a bug.  There is no reason to claim that the new 4TB 
disk had read errors when it was the old 3TB disk that had the errors.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Holger Hoffstätte
On Tue, 25 Nov 2014 15:17:58 -0800, John Williams wrote:

 2) CityHash : for 256-bit hashes on all systems
 https://code.google.com/p/cityhash/

Btw this is now superseded by Farmhash:
https://code.google.com/p/farmhash/

-h

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type

2014-11-26 Thread Miao Xie
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6013d88..bfc406d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last-read_rebuild !=
-   cur-read_rebuild) {
+   if (last-operation != cur-operation) {
return 0;
}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(rbio-bio_list_lock);
spin_unlock_irqrestore(h-lock, flags);
 
-   if (next-read_rebuild)
+   if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next-operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
+   rbio-operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio-faila;
failb = rbio-failb;
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(rbio-bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1871,7 +1875,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i  nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0)
cache_rbio_pages(rbio);
else
@@ -2042,7 +2046,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio-read_rebuild)
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO, 0);
return -EIO;
 }
@@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
 
if (hold_bbio)

[PATCH v3 00/11] Implement device scrub/replace for RAID56

2014-11-26 Thread Miao Xie
This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (8):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56
  Btrfs: fix possible deadlock caused by pending I/O in plug list

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |  20 +-
 fs/btrfs/raid56.c  | 746 -
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 6 files changed, 1521 insertions(+), 130 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device

2014-11-26 Thread Miao Xie
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6f82c1b..cfa449f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio-bbio;
void *pointers[rbio-real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio-stripe_npages);
int nr_data = rbio-nr_data;
int stripe;
int pagenr;
@@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(bio_list);
@@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio-num_tgtdevs  bbio-tgtdev_map[rbio-scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2422,6 +2430,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio-scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, bio_list, page,
+  bbio-tgtdev_map[rbio-scrubp],
+  pagenr, rbio-stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7f95afc..0ae837f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity-logic_end - sparity-logic_start + 1;
-   ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE,
   sparity-logic_start,
   length, bbio, 0, raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret  ((rw  REQ_WRITE) || mirror_num  1)
-raid_map_ret) {
+   if (raid_map_ret  ((rw  REQ_WRITE) || mirror_num  1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map-num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map

2014-11-26 Thread Miao Xie
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
+   atomic_set(rbio-error, 0);
+   atomic_set(rbio-stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(bbio-stripes_pending, bio_list_size(bio_list));
-   BUG_ON(atomic_read(bbio-stripes_pending) == 0);
+   atomic_set(rbio-stripes_pending, bio_list_size(bio_list));
+   BUG_ON(atomic_read(rbio-stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio-faila == -1) {
/* first failure on this rbio */
rbio-faila = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else if (rbio-failb == -1) {
/* second failure on this rbio */
rbio-failb = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio-bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(bbio-stripes_pending, bios_to_read);
+   atomic_set(rbio-stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int 
err)
set_bio_pages_uptodate(bio);
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if 

[PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56

2014-11-26 Thread Miao Xie
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- None.
---
 fs/btrfs/raid56.c | 514 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 609 --
 3 files changed, 1115 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index bfc406d..3b99cbc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
+static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
cur-raid_map[0])
return 0;
 
-   /* reads can't merge with writes */
-   if (last-operation != cur-operation) {
+   /* we can't merge with different operations */
+   if (last-operation != cur-operation)
+   return 0;
+   /*
+* We've need read the full stripe from the drive.
+* check and repair the parity and write the new results.
+*
+* We're not allowed to add any new bios to the
+* bio list here, anyone else that wants to
+* change this stripe needs to do their own rmw.
+*/
+   if (last-operation == BTRFS_RBIO_PARITY_SCRUB ||
+   cur-operation == BTRFS_RBIO_PARITY_SCRUB)
return 0;
-   }
 
return 1;
 }
@@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
 
if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else if (next-operation == BTRFS_RBIO_WRITE){
+   else if (next-operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
async_rmw_stripe(next);
+   } else if (next-operation == BTRFS_RBIO_PARITY_SCRUB) {
+  

[PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 894796a..9f6a464 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EOPNOTSUPP;
-   }
-
switch (args-start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie
The increase/decrease of bio counter is on the I/O path, so we should
use io_schedule() instead of schedule(), or the deadlock might be
triggered by the pending I/O in the plug list. io_schedule() can help
us because it will flush all the pending I/O before the task is going
to sleep.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- New patch to fix possible deadlock caused by the pending bios in the
  plug list when the io submitters were going to sleep.

Changelog v1 - v2:
- None.
---
 fs/btrfs/dev-replace.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fa27b4e..894796a 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, 
s64 amount)
wake_up(fs_info-replace_wait);
 }
 
+#define btrfs_wait_event_io(wq, condition) \
+do {   \
+   if (condition)  \
+   break;  \
+   (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,  \
+   io_schedule()); \
+} while (0)
+
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
 {
-   DEFINE_WAIT(wait);
 again:
percpu_counter_inc(fs_info-bio_counter);
if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, fs_info-fs_state)) {
btrfs_bio_counter_dec(fs_info);
-   wait_event(fs_info-replace_wait,
-  !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
-fs_info-fs_state));
+   btrfs_wait_event_io(fs_info-replace_wait,
+   !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
+ fs_info-fs_state));
goto again;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device

2014-11-26 Thread Miao Xie
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3b99cbc..6f82c1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio-nr_data + 1 == rbio-bbio-num_stripes)
+   if (rbio-nr_data + 1 == rbio-real_stripes)
return NULL;
 
index += ((rbio-nr_data + 1) * rbio-stripe_len) 
@@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int real_stripes = bbio-num_stripes - bbio-num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-real_stripes = real_stripes;
rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
@@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio-num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio-num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio-nr_data = nr_data;
return rbio;
@@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio-faila = 0 || rbio-failb = 0) {
-   BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1);
+   BUG_ON(rbio-faila == rbio-real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio-bbio;
-   void *pointers[bbio-num_stripes];
+   void *pointers[rbio-real_stripes];
int stripe_len = rbio-stripe_len;
int nr_data = rbio-nr_data;
int stripe;
@@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(bio_list);
 
-   if (bbio-num_stripes - rbio-nr_data == 1) {
-   p_stripe = bbio-num_stripes - 1;
-   } else if (bbio-num_stripes - rbio-nr_data == 2) {
-   p_stripe = bbio-num_stripes - 2;
-   q_stripe = bbio-num_stripes - 1;
+   if (rbio-real_stripes - rbio-nr_data == 1) {
+   p_stripe = rbio-real_stripes - 1;
+   } else if (rbio-real_stripes - rbio-nr_data == 2) {
+   p_stripe = rbio-real_stripes - 2;
+   q_stripe = rbio-real_stripes - 1;
} else {
BUG();
}
@@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
 
-   raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE,
+   

[PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-26 Thread Miao Xie
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map;
+   u64 map_length;
+};
+
 struct scrub_page {
struct scrub_block  *sblock;

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Brendan Hide

On 2014/11/25 13:30, Liu Bo wrote:

This is actually inspired by ZFS, who offers checksum functions ranging
from the simple-and-fast fletcher2 to the slower-but-secure sha256.

Back to btrfs, crc32c is the only choice.

And also for the slowness of sha256, Intel has a set of instructions for
it, Intel SHA Extensions, that may help a lot.


I think the advantage will be in giving a choice with some strong 
suggestions:


An example of suggestions - if using sha256 on an old or low-power 
CPU, detect that the CPU doesn't support the appropriate acceleration 
functions and print a warning at mount or a warning-and-prompt at mkfs-time.


The default could even be changed based on the architecture - though I 
suspect crc32c is already a good default on most architectures.


The choice allowance gives flexibility where admins know it optimally 
could be used - and David's suggestion (separate thread) would be able 
to take advantage of that.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Brendan Hide

On 2014/11/25 18:47, David Sterba wrote:

We could provide an interface for external applications that would make
use of the strong checksums. Eg. external dedup, integrity db. The
benefit here is that the checksum is always up to date, so there's no
need to compute the checksums again. At the obvious cost.


I can imagine some use-cases where you might even want more than one 
algorithm to be used and stored. Not sure if that makes me a madman, 
though. ;)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Austin S Hemmelgarn

On 2014-11-26 08:38, Brendan Hide wrote:

On 2014/11/25 18:47, David Sterba wrote:

We could provide an interface for external applications that would make
use of the strong checksums. Eg. external dedup, integrity db. The
benefit here is that the checksum is always up to date, so there's no
need to compute the checksums again. At the obvious cost.


I can imagine some use-cases where you might even want more than one
algorithm to be used and stored. Not sure if that makes me a madman,
though. ;)

Not crazy at all, I would love to have the ability to store multiple 
different weak but fast hash values.  For example, on my laptop, it is 
actually faster to compute crc32c, adler32, and md5 hashes together than 
it is to compute pretty much any 256-bit hash I've tried.


This then brings up the issue of what to do when we try to mount such a 
fs on a system that doesn't support some or all of the hashes used.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Chris Mason

On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:

The increase/decrease of bio counter is on the I/O path, so we should
use io_schedule() instead of schedule(), or the deadlock might be
triggered by the pending I/O in the plug list. io_schedule() can help
us because it will flush all the pending I/O before the task is going
to sleep.


Can you please describe this deadlock in more detail?  schedule() also 
triggers a flush of the plug list, and if that's no longer sufficient 
we can run into other problems (especially with preemption on).


-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Filipe Manana
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 
__btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq 
ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse 
i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw 
thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg 
sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 
ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 
3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]   88004581bb08 813e7a13 
88004581bb50
[25230.404330]  88004581bb40 810423aa a049386a 
ffe4
[25230.404332]  a05214c0 240c 88010fc8f800 
88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [813e7a13] dump_stack+0x45/0x56
[25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
[25230.404351]  [a049386a] ? __btrfs_abort_transaction+0x50/0xfc 
[btrfs]
[25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [a04a8c43] 
btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 
[btrfs]
[25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [810f6501] ? get_page+0x1a/0x2b
[25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
[25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
[25230.404450]  [81139401] vfs_write+0xb0/0x112
[25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
[25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): 
btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space 
left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No 
such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 5 +++--
 fs/btrfs/super.c   | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4bf8f02..0a5e770 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans,
int ret = 0;
 
list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) {
-   list_del_init(block_group-bg_list);
if (ret)
-   continue;
+   goto next;
 
spin_lock(block_group-lock);
memcpy(item, block_group-item, sizeof(item));
@@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans,
   key.objectid, key.offset);
if (ret)
btrfs_abort_transaction(trans, extent_root, ret);
+next:
+   list_del_init(block_group-bg_list);
}
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a2b97ef..c9ab88c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -262,7 +262,7 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle 
*trans,
   

[PATCH 5/6] Btrfs: fix race between writing free space cache and trimming

2014-11-26 Thread Filipe Manana
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits, we will have some free space that won't be used again unless
the free space cache is rebuilt. After the unmount, fsck (btrfsck, btrfs check)
reports the issue like the following example:

*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/free-space-cache.c | 59 ++---
 fs/btrfs/free-space-cache.h |  2 ++
 fs/btrfs/inode-map.c|  2 ++
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 16c2d39..6380863 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -31,6 +31,12 @@
 #define BITS_PER_BITMAP(PAGE_CACHE_SIZE * 8)
 #define MAX_CACHE_BYTES_PER_GIG(32 * 1024)
 
+struct btrfs_trim_range {
+   u64 start;
+   u64 bytes;
+   struct list_head list;
+};
+
 static int link_free_space(struct btrfs_free_space_ctl *ctl,
   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -881,6 +887,7 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
int ret;
struct btrfs_free_cluster *cluster = NULL;
struct rb_node *node = rb_first(ctl-free_space_offset);
+   struct btrfs_trim_range *trim_entry;
 
/* Get the cluster for this block_group if it exists */
if (block_group  !list_empty(block_group-cluster_list)) {
@@ -916,6 +923,21 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
cluster = NULL;
}
}
+
+   /*
+* Make sure we don't miss any range that was removed from our rbtree
+* because trimming is running. Otherwise after a umount+mount (or crash
+* after committing the transaction) we would leak free space and get
+* an inconsistent free space cache report from fsck.
+*/
+   list_for_each_entry(trim_entry, ctl-trimming_ranges, list) {
+   ret = io_ctl_add_entry(io_ctl, trim_entry-start,
+  trim_entry-bytes, NULL);
+   if (ret)
+   goto fail;
+   *entries += 1;
+   }
+
return 0;
 fail:
return -ENOSPC;
@@ -1135,10 +1157,12 @@ static int __btrfs_write_out_cache(struct btrfs_root 
*root, struct inode *inode,
 
io_ctl_set_generation(io_ctl, trans-transid);
 
+   mutex_lock(ctl-cache_writeout_mutex);
/* Write out the extent entries in the free space cache */
ret = write_cache_extent_entries(io_ctl, ctl,
 block_group, entries, bitmaps,
 bitmap_list);
+   mutex_unlock(ctl-cache_writeout_mutex);
if (ret)
goto out_nospc;
 
@@ -2295,6 +2319,8 @@ void btrfs_init_free_space_ctl(struct 
btrfs_block_group_cache *block_group)
ctl-start = block_group-key.objectid;
ctl-private = block_group;
ctl-op = free_space_op;
+   INIT_LIST_HEAD(ctl-trimming_ranges);
+   mutex_init(ctl-cache_writeout_mutex);
 
/*
 * we only want to have 32k of ram per block group for keeping
@@ -2911,10 +2937,12 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster 
*cluster)
 
 static int do_trimming(struct btrfs_block_group_cache *block_group,
   u64 *total_trimmed, u64 start, u64 bytes,
-  u64 reserved_start, u64 reserved_bytes)
+  u64 reserved_start, u64 reserved_bytes,
+  struct btrfs_trim_range *trim_entry)
 {
struct btrfs_space_info *space_info = block_group-space_info;
struct btrfs_fs_info *fs_info = block_group-fs_info;
+   struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
int ret;
int update = 0;
u64 trimmed = 0;
@@ -2934,7 +2962,10 @@ static int do_trimming(struct btrfs_block_group_cache 
*block_group,
if (!ret)
*total_trimmed += trimmed;
 
+   mutex_lock(ctl-cache_writeout_mutex);
  

[PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard

2014-11-26 Thread Filipe Manana
This patchset fixes several issues exposed by block group removal/allocation and
trim/discard running in parallel.

The first 3 patches and the last one (6) are independent and don't depend on 
each
other. Patches 3 and 6 are not really related to trim/discard at all.

I bundled all these patches into the same patchset because all these issues were
triggered by the same test for xfstests that I prepared and will send out soon.
The issues were triggered on a qemu/kvm guest using scsi-hd drives with discard
support enabled (which makes the host do hole punching on the virtual disks'
image files), 4 virtual CPUs and 4Gb of ram. Some of these issues were hard to
trigger and happened like once for every 10+ runs of the test (each test run
takes nearly 30 minutes with my hardware and a debug kernel).

Filipe Manana (6):
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs: fix crash caused by block group removal
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block
groups

 fs/btrfs/ctree.h| 14 ++-
 fs/btrfs/disk-io.c  | 14 +++
 fs/btrfs/extent-tree.c  | 90 ++---
 fs/btrfs/free-space-cache.c | 85 --
 fs/btrfs/free-space-cache.h |  2 +
 fs/btrfs/inode-map.c|  2 +
 fs/btrfs/super.c|  2 +-
 fs/btrfs/volumes.c  | 33 +
 8 files changed, 215 insertions(+), 27 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group

2014-11-26 Thread Filipe Manana
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info-unused_bgs and clear the range representing the whole
group from fs_info-freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info-freed_extents (pinned_extents).

That is:

 CPU 1CPU 2

  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info-unused_bgs
got block group from the list
clear_extent_bits for the whole
block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info-pinned_extents)

  block group deleted, and a new block
  group with the same logical address is
  created

  reserve space from the new block group
  for new data or metadata - the reserved
  space overlaps the range specified by
  CPU 1 for set_extent_dirty()

  commit transaction
find all ranges marked as dirty in
fs_info-pinned_extents, clear them
and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool 
dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd 
auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc 
parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode 
serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common 
ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci 
virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: GW  
3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: 88009f2c0650 ti: 8801356bc000 task.ti: 
8801356bc000
[ 2163.429157] RIP: 0010:[a037728e]  [a037728e] 
unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX:  RBX:  RCX: 
[ 2163.429990] RDX: 41bf RSI: 01c0 RDI: 880024307080
[ 2163.430042] RBP: 8801356bfde8 R08: 0068 R09: 88003734f118
[ 2163.430042] R10: 8801356bfcb8 R11: fb69 R12: 8800243070d0
[ 2163.430042] R13: 83c04000 R14: 8800751b0f00 R15: 880024307000
[ 2163.430042] FS:  () GS:88013f40() 
knlGS:
[ 2163.430042] CS:  0010 DS:  ES:  CR0: 8005003b
[ 2163.430042] CR2: 7ff10eb43fc0 CR3: 04cb8000 CR4: 06f0
[ 2163.430042] Stack:
[ 2163.430042]  8800243070d0 83c08000 83c07fff 
88012d6bc800
[ 2163.430042]  8800243070d0 8800751b0f18 8800751b0f00 

[ 2163.430042]  8801356bfe18 a037a481 83c04000 
83c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [a037a481] btrfs_finish_extent_commit+0xac/0xbf 
[btrfs]
[ 2163.430042]  [a038c06d] btrfs_commit_transaction+0x6ee/0x882 
[btrfs]
[ 2163.430042]  [a03881f1] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [a03880ff] ? btrfs_cleanup_transaction+0x3d8/0x3d8 
[btrfs]
[ 2163.430042]  [8105966b] kthread+0xb7/0xbf
[ 2163.430042]  [810595b4] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [813ebeac] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [810595b4] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)


[PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed

2014-11-26 Thread Filipe Manana
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 744b580..3ba65d9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3162,7 +3162,19 @@ next_block_group(struct btrfs_root *root,
 struct btrfs_block_group_cache *cache)
 {
struct rb_node *node;
+
spin_lock(root-fs_info-block_group_cache_lock);
+
+   /* If our block group was removed, we need a full search. */
+   if (RB_EMPTY_NODE(cache-cache_node)) {
+   const u64 next_bytenr = cache-key.objectid + cache-key.offset;
+
+   spin_unlock(root-fs_info-block_group_cache_lock);
+   btrfs_put_block_group(cache);
+   cache = btrfs_lookup_first_block_group(root-fs_info,
+  next_bytenr);
+   return cache;
+   }
node = rb_next(cache-cache_node);
btrfs_put_block_group(cache);
if (node) {
@@ -9400,6 +9412,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
spin_lock(root-fs_info-block_group_cache_lock);
rb_erase(block_group-cache_node,
 root-fs_info-block_group_cache_tree);
+   RB_CLEAR_NODE(block_group-cache_node);
 
if (root-fs_info-first_logical_byte == block_group-key.objectid)
root-fs_info-first_logical_byte = (u64)-1;
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Filipe Manana
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/ctree.h| 13 -
 fs/btrfs/disk-io.c  | 14 ++
 fs/btrfs/extent-tree.c  | 24 +++-
 fs/btrfs/free-space-cache.c | 26 +-
 fs/btrfs/volumes.c  | 33 ++---
 5 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
unsigned int dirty:1;
unsigned int iref:1;
unsigned int has_caching_ctl:1;
+   unsigned int removed:1;
 
int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
/* For delayed block group creation or deletion of empty block groups */
struct list_head bg_list;
+
+   atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
/* For btrfs to record security options */
struct security_mnt_opts security_opts;
+
+   /*
+* Chunks that can't be freed yet (under a trim/discard operation)
+* and will be latter freed.
+*/
+   rwlock_t pinned_chunks_lock;
+   struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
   u64 type, u64 chunk_objectid, u64 chunk_offset,
   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 group_start);
+struct btrfs_root *root, u64 group_start,
+bool *remove_em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
   struct btrfs_root *root);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 

[PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Filipe Manana
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info-caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch Btrfs: remove empty block groups automatically.

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8
[58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c 
[btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060
[58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse 
processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore 
thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic 
sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy 
ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: 
btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW  
3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 
88013896c000
[58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05] 
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: 
[58380.443238] RDX:  RSI: 880185e16800 RDI: 8801446eaeb8
[58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60
[58380.443238] R10: 88013896fd28 R11:  R12: 880222cae000
[58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00
[58380.443238] FS:  () GS:88023ed8() 
knlGS:
[58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
[58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0
[58380.443238] Stack:
[58380.443238]  88013896fe18 a03fe2d5 880222cae850 
880185e16800
[58380.443238]  88000dc41c20  8801a9ca9f00 

[58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0 
88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 
[btrfs]
[58380.443238]  [a040fbcf] btrfs_commit_transaction+0x45c/0x882 
[btrfs]
[58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 
[btrfs]
[58380.443238]  [8105966b] kthread+0xb7/0xbf
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
[58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 27 +++
 2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3ccd09..7f40a65 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
unsigned int ro:1;
unsigned int dirty:1;
unsigned int iref:1;
+   unsigned int has_caching_ctl:1;
 
int disk_cache_state;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3ba65d9..b7e40ef 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -607,6 +607,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
cache-cached = BTRFS_CACHE_NO;
} else {
cache-cached = BTRFS_CACHE_STARTED;
+   cache-has_caching_ctl = 1;
}
}
spin_unlock(cache-lock);
@@ -627,6 +628,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
cache-cached = BTRFS_CACHE_NO;
} else {
cache-cached = BTRFS_CACHE_STARTED;
+   cache-has_caching_ctl = 1;
}
spin_unlock(cache-lock);
wake_up(caching_ctl-wait);
@@ -9328,6 +9330,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
int ret;
int index;
int factor;
+   struct btrfs_caching_control *caching_ctl = NULL;
 
root = root-fs_info-extent_root;
 
@@ -9435,8 +9438,32 @@ int 

[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

2014-11-26 Thread Filipe Manana
Stress btrfs' block group allocation and deallocation while running
fstrim in parallel. Part of the goal is also to get data block groups
deallocated so that new metadata block groups, using the same physical
device space ranges, get allocated while fstrim is running. This caused
several issues ranging from invalid memory accesses, kernel crashes,
metadata or data corruption, free space cache inconsistencies and free
space leaks.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 tests/btrfs/082 | 148 
 tests/btrfs/082.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 151 insertions(+)
 create mode 100755 tests/btrfs/082
 create mode 100644 tests/btrfs/082.out

diff --git a/tests/btrfs/082 b/tests/btrfs/082
new file mode 100755
index 000..8ac9f06
--- /dev/null
+++ b/tests/btrfs/082
@@ -0,0 +1,148 @@
+#! /bin/bash
+# FSQA Test No. 082
+#
+# Stress btrfs' block group allocation and deallocation while running fstrim in
+# parallel. Part of the goal is also to get data block groups deallocated so
+# that new metadata block groups, using the same physical device space ranges,
+# get allocated while fstrim is running. This caused several issues ranging
+# from invalid memory accesses, kernel crashes, metadata or data corruption,
+# free space cache inconsistencies and free space leaks.
+#
+# These issues were fixed by the following btrfs linux kernel patches:
+#
+#   Btrfs: fix invalid block group rbtree access after bg is removed
+#   Btrfs: fix crash caused by block group removal
+#   Btrfs: fix freeing used extents after removing empty block group
+#   Btrfs: fix race between fs trimming and block group remove/allocation
+#   Btrfs: fix race between writing free space cache and trimming
+#   Btrfs: make btrfs_abort_transaction consider existence of new block groups
+#
+# The issues were found on a qemu/kvm guest with 4 virtual CPUs, 4Gb of ram and
+# scsi-hd devices with discard support enabled (that means hole punching in the
+# disk's image file is performed by the host).
+#
+#---
+#
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana fdman...@suse.com
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+   rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_nocheck
+_require_fstrim
+
+rm -f $seqres.full
+
+# Keep allocating and deallocating 2G of data space with the goal of creating
+# and deleting 2 block groups constantly. The intention is to race with the
+# fstrim loop below.
+fallocate_loop()
+{
+   local name=$1
+   while true; do
+   $XFS_IO_PROG -f -c falloc -k 0 2G \
+   $SCRATCH_MNT/$name  /dev/null
+   sleep 3
+   $XFS_IO_PROG -c truncate 0 \
+   $SCRATCH_MNT/$name  /dev/null
+   sleep 3
+   done
+}
+
+trim_loop()
+{
+   while true; do
+   $FSTRIM_PROG $SCRATCH_MNT
+   done
+}
+
+# Create a bunch of small files that get their single extent inlined in the
+# btree, so that we consume a lot of metadata space and get a chance of a
+# data block group getting deleted and reused for metadata later. Sometimes
+# the creation of all these files succeeds other times we get ENOSPC failures
+# at some point - this depends on how fast the btrfs' cleaner kthread is
+# notified about empty block groups, how fast it deletes them and how fast
+# the fallocate calls happen. So we don't really care if they all succeed or
+# not, the goal is just to keep metadata space usage growing while data block
+# groups are deleted.
+create_files()
+{
+   local prefix=$1
+
+   for ((i = 1; i = 40; i++)); do
+   echo Creating file ${prefix}_$i $seqres.full 21
+   $XFS_IO_PROG -f -c pwrite -S 0xaa 0 3900 \
+   $SCRATCH_MNT/${prefix}_$i $seqres.full 21
+   ret=$?
+  

Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Josef Bacik

On 11/26/2014 10:28 AM, Filipe Manana wrote:

If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info-caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch Btrfs: remove empty block groups automatically.

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8
[58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c 
[btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060
[58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse 
processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore 
thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic 
sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy 
ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: 
btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW  
3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 
88013896c000
[58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05] 
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: 
[58380.443238] RDX:  RSI: 880185e16800 RDI: 8801446eaeb8
[58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60
[58380.443238] R10: 88013896fd28 R11:  R12: 880222cae000
[58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00
[58380.443238] FS:  () GS:88023ed8() 
knlGS:
[58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
[58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0
[58380.443238] Stack:
[58380.443238]  88013896fe18 a03fe2d5 880222cae850 
880185e16800
[58380.443238]  88000dc41c20  8801a9ca9f00 

[58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0 
88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 
[btrfs]
[58380.443238]  [a040fbcf] btrfs_commit_transaction+0x45c/0x882 
[btrfs]
[58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 
[btrfs]
[58380.443238]  [8105966b] kthread+0xb7/0xbf
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
[58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana fdman...@suse.com
---
  fs/btrfs/ctree.h   |  1 +
  fs/btrfs/extent-tree.c | 27 +++
  2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3ccd09..7f40a65 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
unsigned int ro:1;
unsigned int dirty:1;
unsigned int iref:1;
+   unsigned int has_caching_ctl:1;



Don't do this, just unconditionally call get_caching_control in 
btrfs_remove_block_group, then if we get something we can do stuff, 
otherwise we can just continue.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed

2014-11-26 Thread Josef Bacik

On 11/26/2014 10:28 AM, Filipe Manana wrote:

If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana fdman...@suse.com


Reviewed-by: Josef Bacik jba...@fb.com

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group

2014-11-26 Thread Josef Bacik

On 11/26/2014 10:28 AM, Filipe Manana wrote:

There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info-unused_bgs and clear the range representing the whole
group from fs_info-freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info-freed_extents (pinned_extents).

That is:

  CPU 1CPU 2

   btrfs_delete_unused_bgs()
update_block_group()
add block group to
fs_info-unused_bgs
 got block group from the list
 clear_extent_bits for the whole
 block group range in freed_extents[]
set_extent_dirty for the
range covering the freed
extent in freed_extents[]
(fs_info-pinned_extents)

   block group deleted, and a new block
   group with the same logical address is
   created

   reserve space from the new block group
   for new data or metadata - the reserved
   space overlaps the range specified by
   CPU 1 for set_extent_dirty()

   commit transaction
 find all ranges marked as dirty in
 fs_info-pinned_extents, clear them
 and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool 
dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd 
auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc 
parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode 
serio_raw ext4 crc16 jbd2 mbcache
  sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common 
ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci 
virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: GW  
3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: 88009f2c0650 ti: 8801356bc000 task.ti: 
8801356bc000
[ 2163.429157] RIP: 0010:[a037728e]  [a037728e] 
unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX:  RBX:  RCX: 
[ 2163.429990] RDX: 41bf RSI: 01c0 RDI: 880024307080
[ 2163.430042] RBP: 8801356bfde8 R08: 0068 R09: 88003734f118
[ 2163.430042] R10: 8801356bfcb8 R11: fb69 R12: 8800243070d0
[ 2163.430042] R13: 83c04000 R14: 8800751b0f00 R15: 880024307000
[ 2163.430042] FS:  () GS:88013f40() 
knlGS:
[ 2163.430042] CS:  0010 DS:  ES:  CR0: 8005003b
[ 2163.430042] CR2: 7ff10eb43fc0 CR3: 04cb8000 CR4: 06f0
[ 2163.430042] Stack:
[ 2163.430042]  8800243070d0 83c08000 83c07fff 
88012d6bc800
[ 2163.430042]  8800243070d0 8800751b0f18 8800751b0f00 

[ 2163.430042]  8801356bfe18 a037a481 83c04000 
83c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [a037a481] btrfs_finish_extent_commit+0xac/0xbf 
[btrfs]
[ 2163.430042]  [a038c06d] btrfs_commit_transaction+0x6ee/0x882 
[btrfs]
[ 2163.430042]  [a03881f1] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [a03880ff] ? btrfs_cleanup_transaction+0x3d8/0x3d8 
[btrfs]
[ 2163.430042]  [8105966b] kthread+0xb7/0xbf
[ 2163.430042]  [810595b4] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [813ebeac] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [810595b4] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana fdman...@suse.com
---
  fs/btrfs/extent-tree.c | 21 

Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Josef Bacik

On 11/26/2014 10:28 AM, Filipe Manana wrote:

If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 
__btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq 
ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse 
i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw 
thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg 
sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 
ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 
3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]   88004581bb08 813e7a13 
88004581bb50
[25230.404330]  88004581bb40 810423aa a049386a 
ffe4
[25230.404332]  a05214c0 240c 88010fc8f800 
88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [813e7a13] dump_stack+0x45/0x56
[25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
[25230.404351]  [a049386a] ? __btrfs_abort_transaction+0x50/0xfc 
[btrfs]
[25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [a04a8c43] 
btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 
[btrfs]
[25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [810f6501] ? get_page+0x1a/0x2b
[25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
[25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
[25230.404450]  [81139401] vfs_write+0xb0/0x112
[25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
[25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): 
btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space 
left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No 
such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana fdman...@suse.com
---
  fs/btrfs/extent-tree.c | 5 +++--
  fs/btrfs/super.c   | 2 +-
  2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4bf8f02..0a5e770 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans,
int ret = 0;

list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) {
-   list_del_init(block_group-bg_list);
if (ret)
-   continue;
+   goto next;

spin_lock(block_group-lock);
memcpy(item, block_group-item, sizeof(item));
@@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans,
   key.objectid, key.offset);
if (ret)
btrfs_abort_transaction(trans, extent_root, ret);
+next:
+   list_del_init(block_group-bg_list);
}
  }



I don't understand this change, logically it seems the same as what we 
had before.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe 

Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Filipe David Manana
On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote:
 On 11/26/2014 10:28 AM, Filipe Manana wrote:

 If we remove a block group (because it became empty), we might have left
 a caching_ctl structure in fs_info-caching_block_groups that points to
 the block group and is accessed at transaction commit time. This results
 in accessing an invalid or incorrect block group. This issue became
 visible
 after Josef's patch Btrfs: remove empty block groups automatically.

 So if the block group is removed make sure we don't leave a dangling
 caching_ctl in caching_block_groups.

 Sample crash trace:

 [58380.439449] BUG: unable to handle kernel paging request at
 8801446eaeb8
 [58380.439707] IP: [a03f6d05]
 block_group_cache_done.isra.21+0xc/0x1c [btrfs]
 [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
 8001446ea060
 [58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
 [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
 auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
 processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
 thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
 ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
 virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
 virtio [last unloaded: btrfs]
 [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW
 3.17.0-rc5-btrfs-next-1+ #1
 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti:
 88013896c000
 [58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05]
 block_group_cache_done.isra.21+0xc/0x1c [btrfs]
 [58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX:
 
 [58380.443238] RDX:  RSI: 880185e16800 RDI:
 8801446eaeb8
 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09:
 88013896fc60
 [58380.443238] R10: 88013896fd28 R11:  R12:
 880222cae000
 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15:
 8801446eae00
 [58380.443238] FS:  () GS:88023ed8()
 knlGS:
 [58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
 [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4:
 06e0
 [58380.443238] Stack:
 [58380.443238]  88013896fe18 a03fe2d5 880222cae850
 880185e16800
 [58380.443238]  88000dc41c20  8801a9ca9f00
 
 [58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0
 88013ac82090
 [58380.443238] Call Trace:
 [58380.443238]  [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7
 [btrfs]
 [58380.443238]  [a040fbcf] btrfs_commit_transaction+0x45c/0x882
 [btrfs]
 [58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4
 [btrfs]
 [58380.443238]  [a040bf66] ?
 btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
 [58380.443238]  [8105966b] kthread+0xb7/0xbf
 [58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
 [58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
 [58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
   fs/btrfs/ctree.h   |  1 +
   fs/btrfs/extent-tree.c | 27 +++
   2 files changed, 28 insertions(+)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index d3ccd09..7f40a65 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
 unsigned int ro:1;
 unsigned int dirty:1;
 unsigned int iref:1;
 +   unsigned int has_caching_ctl:1;


 Don't do this, just unconditionally call get_caching_control in
 btrfs_remove_block_group, then if we get something we can do stuff,
 otherwise we can just continue.  Thanks,

That's what I initially thought too. However, get_caching_control only
returns us the caching_ctl if block_group-cached ==
BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
has_caching_ctl flag is just to avoid holding the semaphore and search
through the list (since block_group-caching_ctl can be NULL but a
caching_ctl point to the block group can be in the list).


 Josef

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org

Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Josef Bacik

On 11/26/2014 10:28 AM, Filipe Manana wrote:

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

 checking extents
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 read block failed check_tree_block
 owner ref check failed [833912832 16384]
 Errors found in extent allocation tree or chunk allocation
 checking free space cache
 checking fs roots
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 Check tree block failed, want=833912832, have=0
 read block failed check_tree_block
 root 5 root dir 256 error
 root 5 inode 260 errors 2001, no inode item, link count wrong
 unresolved ref dir 256 index 0 namelen 8 name foobar_3 
filetype 1 errors 6, no dir index, no inode ref
 root 5 inode 262 errors 2001, no inode item, link count wrong
 unresolved ref dir 256 index 0 namelen 8 name foobar_5 
filetype 1 errors 6, no dir index, no inode ref
 root 5 inode 263 errors 2001, no inode item, link count wrong
 (...)

Signed-off-by: Filipe Manana fdman...@suse.com
---
  fs/btrfs/ctree.h| 13 -
  fs/btrfs/disk-io.c  | 14 ++
  fs/btrfs/extent-tree.c  | 24 +++-
  fs/btrfs/free-space-cache.c | 26 +-
  fs/btrfs/volumes.c  | 33 ++---
  5 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
unsigned int dirty:1;
unsigned int iref:1;
unsigned int has_caching_ctl:1;
+   unsigned int removed:1;

int disk_cache_state;

@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {

/* For delayed block group creation or deletion of empty block groups */
struct list_head bg_list;
+
+   atomic_t trimming;
  };

  /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {

/* For btrfs to record security options */
struct security_mnt_opts security_opts;
+
+   /*
+* Chunks that can't be freed yet (under a trim/discard operation)
+* and will be latter freed.
+*/
+   rwlock_t pinned_chunks_lock;
+   struct list_head pinned_chunks;
  };

  struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
   u64 type, u64 chunk_objectid, u64 chunk_offset,
   u64 size);
  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 group_start);
+struct btrfs_root *root, u64 group_start,
+bool *remove_em);
  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
   struct 

Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Filipe David Manana
On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote:
 On 11/26/2014 10:28 AM, Filipe Manana wrote:

 If the transaction handle doesn't have used blocks but has created new
 block
 groups make sure we turn the fs into readonly mode too. This is because
 the
 new block groups didn't get all their metadata persisted into the chunk
 and
 device trees, and therefore if a subsequent transaction starts, allocates
 space from the new block groups, writes data or metadata into that space,
 commits successfully and then after we unmount and mount the filesystem
 again, the same space can be allocated again for a new block group,
 resulting in file data or metadata corruption.

 Example where we don't abort the transaction when we fail to finish the
 chunk allocation (add items to the chunk and device trees) and later a
 future transaction where the block group is removed fails because it can't
 find the chunk item in the chunk tree:

 [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
 [25230.404301] BTRFS: Transaction aborted (error -28)
 [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
 raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop
 psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
 serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom
 ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
 virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
 virtio [last unloaded: btrfs]
 [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
 3.17.0-rc5-btrfs-next-1+ #1
 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [25230.404328]   88004581bb08 813e7a13
 88004581bb50
 [25230.404330]  88004581bb40 810423aa a049386a
 ffe4
 [25230.404332]  a05214c0 240c 88010fc8f800
 88004581bba8
 [25230.404334] Call Trace:
 [25230.404338]  [813e7a13] dump_stack+0x45/0x56
 [25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
 [25230.404351]  [a049386a] ? __btrfs_abort_transaction+0x50/0xfc
 [btrfs]
 [25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
 [25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc
 [btrfs]
 [25230.404374]  [a04a8c43]
 btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
 [25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de
 [btrfs]
 [25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12
 [btrfs]
 [25230.404408]  [a04a3d64]
 btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
 [25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d
 [btrfs]
 [25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
 [25230.404429]  [810f6501] ? get_page+0x1a/0x2b
 [25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f
 [btrfs]
 [25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
 [25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
 [25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
 [25230.404450]  [81139401] vfs_write+0xb0/0x112
 [25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
 [25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
 [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
 [25230.404458] BTRFS warning (device sdc):
 btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space
 left).
 [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
 errno=-2 No such entry (Failed lookup while freeing chunk.)

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
   fs/btrfs/extent-tree.c | 5 +++--
   fs/btrfs/super.c   | 2 +-
   2 files changed, 4 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 4bf8f02..0a5e770 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
 btrfs_trans_handle *trans,
 int ret = 0;

 list_for_each_entry_safe(block_group, tmp, trans-new_bgs,
 bg_list) {
 -   list_del_init(block_group-bg_list);
 if (ret)
 -   continue;
 +   goto next;

 spin_lock(block_group-lock);
 memcpy(item, block_group-item, sizeof(item));
 @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
 btrfs_trans_handle *trans,
key.objectid, key.offset);
 if (ret)
 btrfs_abort_transaction(trans, extent_root, ret);
 +next:
 +   list_del_init(block_group-bg_list);
 }
   }


 I don't understand 

Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Josef Bacik

On 11/26/2014 11:15 AM, Filipe David Manana wrote:

On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote:

On 11/26/2014 10:28 AM, Filipe Manana wrote:


If the transaction handle doesn't have used blocks but has created new
block
groups make sure we turn the fs into readonly mode too. This is because
the
new block groups didn't get all their metadata persisted into the chunk
and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
__btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop
psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom
ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]   88004581bb08 813e7a13
88004581bb50
[25230.404330]  88004581bb40 810423aa a049386a
ffe4
[25230.404332]  a05214c0 240c 88010fc8f800
88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [813e7a13] dump_stack+0x45/0x56
[25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
[25230.404351]  [a049386a] ? __btrfs_abort_transaction+0x50/0xfc
[btrfs]
[25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc
[btrfs]
[25230.404374]  [a04a8c43]
btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de
[btrfs]
[25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12
[btrfs]
[25230.404408]  [a04a3d64]
btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d
[btrfs]
[25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [810f6501] ? get_page+0x1a/0x2b
[25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f
[btrfs]
[25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
[25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
[25230.404450]  [81139401] vfs_write+0xb0/0x112
[25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
[25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc):
btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space
left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana fdman...@suse.com
---
   fs/btrfs/extent-tree.c | 5 +++--
   fs/btrfs/super.c   | 2 +-
   2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4bf8f02..0a5e770 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
btrfs_trans_handle *trans,
 int ret = 0;

 list_for_each_entry_safe(block_group, tmp, trans-new_bgs,
bg_list) {
-   list_del_init(block_group-bg_list);
 if (ret)
-   continue;
+   goto next;

 spin_lock(block_group-lock);
 memcpy(item, block_group-item, sizeof(item));
@@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
btrfs_trans_handle *trans,
key.objectid, key.offset);
 if (ret)
 btrfs_abort_transaction(trans, extent_root, ret);
+next:
+   list_del_init(block_group-bg_list);
 }
   }



I don't understand this change, logically it seems the 

Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Filipe David Manana
On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik jba...@fb.com wrote:
 On 11/26/2014 10:28 AM, Filipe Manana wrote:

 Our fs trim operation, which is completely transactionless (doesn't start
 or joins an existing transaction) consists of visiting all block groups
 and then for each one to iterate its free space entries and perform a
 discard operation against the space range represented by the free space
 entries. However before performing a discard, the corresponding free space
 entry is removed from the free space rbtree, and when the discard
 completes
 it is added back to the free space rbtree.

 If a block group remove operation happens while the discard is ongoing (or
 before it starts and after a free space entry is hidden), we end up not
 waiting for the discard to complete, remove the extent map that maps
 logical address to physical addresses and the corresponding chunk metadata
 from the the chunk and device trees. After that and before the discard
 completes, the current running transaction can finish and a new one start,
 allowing for new block groups that map to the same physical addresses to
 be allocated and written to.

 So fix this by keeping the extent map in memory until the discard
 completes
 so that the same physical addresses aren't reused before it completes.

 If the physical locations that are under a discard operation end up being
 used for a new metadata block group for example, and dirty metadata
 extents
 are written before the discard finishes (the VM might call writepages() of
 our btree inode's i_mapping for example, or an fsync log commit happens)
 we
 end up overwriting metadata with zeroes, which leads to errors from fsck
 like the following:

  checking extents
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  read block failed check_tree_block
  owner ref check failed [833912832 16384]
  Errors found in extent allocation tree or chunk allocation
  checking free space cache
  checking fs roots
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  read block failed check_tree_block
  root 5 root dir 256 error
  root 5 inode 260 errors 2001, no inode item, link count wrong
  unresolved ref dir 256 index 0 namelen 8 name foobar_3
 filetype 1 errors 6, no dir index, no inode ref
  root 5 inode 262 errors 2001, no inode item, link count wrong
  unresolved ref dir 256 index 0 namelen 8 name foobar_5
 filetype 1 errors 6, no dir index, no inode ref
  root 5 inode 263 errors 2001, no inode item, link count wrong
  (...)

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
   fs/btrfs/ctree.h| 13 -
   fs/btrfs/disk-io.c  | 14 ++
   fs/btrfs/extent-tree.c  | 24 +++-
   fs/btrfs/free-space-cache.c | 26 +-
   fs/btrfs/volumes.c  | 33 ++---
   5 files changed, 100 insertions(+), 10 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 7f40a65..51056c7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 unsigned int dirty:1;
 unsigned int iref:1;
 unsigned int has_caching_ctl:1;
 +   unsigned int removed:1;

 int disk_cache_state;

 @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {

 /* For delayed block group creation or deletion of empty block
 groups */
 struct list_head bg_list;
 +
 +   atomic_t trimming;
   };

   /* delayed seq elem */
 @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {

 /* For btrfs to record security options */
 struct security_mnt_opts security_opts;
 +
 +   /*
 +* Chunks that can't be freed yet (under a trim/discard operation)
 +* and will be latter freed.
 +*/
 +   rwlock_t pinned_chunks_lock;
 +   struct list_head pinned_chunks;
   };

   struct btrfs_subvolume_writers {
 @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
 *trans,
u64 type, u64 chunk_objectid, u64 chunk_offset,
u64 size);
   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 -struct btrfs_root *root, u64 group_start);
 +struct btrfs_root *root, u64 group_start,
 +bool *remove_em);
   void 

Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Josef Bacik

On 11/26/2014 11:09 AM, Filipe David Manana wrote:

On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote:

On 11/26/2014 10:28 AM, Filipe Manana wrote:


If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info-caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became
visible
after Josef's patch Btrfs: remove empty block groups automatically.

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at
8801446eaeb8
[58380.439707] IP: [a03f6d05]
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
8001446ea060
[58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW
3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti:
88013896c000
[58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05]
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX:

[58380.443238] RDX:  RSI: 880185e16800 RDI:
8801446eaeb8
[58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09:
88013896fc60
[58380.443238] R10: 88013896fd28 R11:  R12:
880222cae000
[58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15:
8801446eae00
[58380.443238] FS:  () GS:88023ed8()
knlGS:
[58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
[58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4:
06e0
[58380.443238] Stack:
[58380.443238]  88013896fe18 a03fe2d5 880222cae850
880185e16800
[58380.443238]  88000dc41c20  8801a9ca9f00

[58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0
88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7
[btrfs]
[58380.443238]  [a040fbcf] btrfs_commit_transaction+0x45c/0x882
[btrfs]
[58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4
[btrfs]
[58380.443238]  [a040bf66] ?
btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [8105966b] kthread+0xb7/0xbf
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
[58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana fdman...@suse.com
---
   fs/btrfs/ctree.h   |  1 +
   fs/btrfs/extent-tree.c | 27 +++
   2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3ccd09..7f40a65 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
 unsigned int ro:1;
 unsigned int dirty:1;
 unsigned int iref:1;
+   unsigned int has_caching_ctl:1;



Don't do this, just unconditionally call get_caching_control in
btrfs_remove_block_group, then if we get something we can do stuff,
otherwise we can just continue.  Thanks,


That's what I initially thought too. However, get_caching_control only
returns us the caching_ctl if block_group-cached ==
BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
has_caching_ctl flag is just to avoid holding the semaphore and search
through the list (since block_group-caching_ctl can be NULL but a
caching_ctl point to the block group can be in the list).



Oh God that's not good, we need to change get_caching_control to return 
if there is a caching control at all, since the other users want to wait 
for the fast caching to finish too.  So change that and then use it 
unconditionally.  I bet this has been causing us the random early ENOSPC 
problems.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Filipe David Manana
On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik jba...@fb.com wrote:
 On 11/26/2014 11:15 AM, Filipe David Manana wrote:

 On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote:

 On 11/26/2014 10:28 AM, Filipe Manana wrote:


 If the transaction handle doesn't have used blocks but has created new
 block
 groups make sure we turn the fs into readonly mode too. This is because
 the
 new block groups didn't get all their metadata persisted into the chunk
 and
 device trees, and therefore if a subsequent transaction starts,
 allocates
 space from the new block groups, writes data or metadata into that
 space,
 commits successfully and then after we unmount and mount the filesystem
 again, the same space can be allocated again for a new block group,
 resulting in file data or metadata corruption.

 Example where we don't abort the transaction when we fail to finish the
 chunk allocation (add items to the chunk and device trees) and later a
 future transaction where the block group is removed fails because it
 can't
 find the chunk item in the chunk tree:

 [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
 [25230.404301] BTRFS: Transaction aborted (error -28)
 [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
 raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
 loop
 psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
 serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod
 cdrom
 ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
 virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
 virtio [last unloaded: btrfs]
 [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
 3.17.0-rc5-btrfs-next-1+ #1
 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 BIOS
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [25230.404328]   88004581bb08 813e7a13
 88004581bb50
 [25230.404330]  88004581bb40 810423aa a049386a
 ffe4
 [25230.404332]  a05214c0 240c 88010fc8f800
 88004581bba8
 [25230.404334] Call Trace:
 [25230.404338]  [813e7a13] dump_stack+0x45/0x56
 [25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
 [25230.404351]  [a049386a] ?
 __btrfs_abort_transaction+0x50/0xfc
 [btrfs]
 [25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
 [25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc
 [btrfs]
 [25230.404374]  [a04a8c43]
 btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
 [25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de
 [btrfs]
 [25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12
 [btrfs]
 [25230.404408]  [a04a3d64]
 btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
 [25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d
 [btrfs]
 [25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
 [25230.404429]  [810f6501] ? get_page+0x1a/0x2b
 [25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f
 [btrfs]
 [25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
 [25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
 [25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
 [25230.404450]  [81139401] vfs_write+0xb0/0x112
 [25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
 [25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
 [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
 [25230.404458] BTRFS warning (device sdc):
 btrfs_create_pending_block_groups:9228: Aborting unused transaction(No
 space
 left).
 [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
 errno=-2 No such entry (Failed lookup while freeing chunk.)

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
fs/btrfs/extent-tree.c | 5 +++--
fs/btrfs/super.c   | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 4bf8f02..0a5e770 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
 btrfs_trans_handle *trans,
  int ret = 0;

  list_for_each_entry_safe(block_group, tmp, trans-new_bgs,
 bg_list) {
 -   list_del_init(block_group-bg_list);
  if (ret)
 -   continue;
 +   goto next;

  spin_lock(block_group-lock);
  memcpy(item, block_group-item, sizeof(item));
 @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
 btrfs_trans_handle *trans,
 key.objectid,
 key.offset);
  if (ret)
  

Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Josef Bacik

On 11/26/2014 11:25 AM, Filipe David Manana wrote:

On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik jba...@fb.com wrote:

On 11/26/2014 10:28 AM, Filipe Manana wrote:


Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard
completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard
completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata
extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens)
we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

  checking extents
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  read block failed check_tree_block
  owner ref check failed [833912832 16384]
  Errors found in extent allocation tree or chunk allocation
  checking free space cache
  checking fs roots
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  Check tree block failed, want=833912832, have=0
  read block failed check_tree_block
  root 5 root dir 256 error
  root 5 inode 260 errors 2001, no inode item, link count wrong
  unresolved ref dir 256 index 0 namelen 8 name foobar_3
filetype 1 errors 6, no dir index, no inode ref
  root 5 inode 262 errors 2001, no inode item, link count wrong
  unresolved ref dir 256 index 0 namelen 8 name foobar_5
filetype 1 errors 6, no dir index, no inode ref
  root 5 inode 263 errors 2001, no inode item, link count wrong
  (...)

Signed-off-by: Filipe Manana fdman...@suse.com
---
   fs/btrfs/ctree.h| 13 -
   fs/btrfs/disk-io.c  | 14 ++
   fs/btrfs/extent-tree.c  | 24 +++-
   fs/btrfs/free-space-cache.c | 26 +-
   fs/btrfs/volumes.c  | 33 ++---
   5 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 unsigned int dirty:1;
 unsigned int iref:1;
 unsigned int has_caching_ctl:1;
+   unsigned int removed:1;

 int disk_cache_state;

@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {

 /* For delayed block group creation or deletion of empty block
groups */
 struct list_head bg_list;
+
+   atomic_t trimming;
   };

   /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {

 /* For btrfs to record security options */
 struct security_mnt_opts security_opts;
+
+   /*
+* Chunks that can't be freed yet (under a trim/discard operation)
+* and will be latter freed.
+*/
+   rwlock_t pinned_chunks_lock;
+   struct list_head pinned_chunks;
   };

   struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
*trans,
u64 type, u64 chunk_objectid, u64 chunk_offset,
u64 size);
   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 group_start);
+struct btrfs_root *root, u64 group_start,
+bool *remove_em);
   void 

Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups

2014-11-26 Thread Josef Bacik

On 11/26/2014 11:29 AM, Filipe David Manana wrote:

On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik jba...@fb.com wrote:

On 11/26/2014 11:15 AM, Filipe David Manana wrote:


On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote:


On 11/26/2014 10:28 AM, Filipe Manana wrote:



If the transaction handle doesn't have used blocks but has created new
block
groups make sure we turn the fs into readonly mode too. This is because
the
new block groups didn't get all their metadata persisted into the chunk
and
device trees, and therefore if a subsequent transaction starts,
allocates
space from the new block groups, writes data or metadata into that
space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it
can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
__btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
loop
psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod
cdrom
ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]   88004581bb08 813e7a13
88004581bb50
[25230.404330]  88004581bb40 810423aa a049386a
ffe4
[25230.404332]  a05214c0 240c 88010fc8f800
88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [813e7a13] dump_stack+0x45/0x56
[25230.404342]  [810423aa] warn_slowpath_common+0x7f/0x98
[25230.404351]  [a049386a] ?
__btrfs_abort_transaction+0x50/0xfc
[btrfs]
[25230.404353]  [8104240b] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [a049386a] __btrfs_abort_transaction+0x50/0xfc
[btrfs]
[25230.404374]  [a04a8c43]
btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [a04b77fd] __btrfs_end_transaction+0x7e/0x2de
[btrfs]
[25230.404398]  [a04b7a6d] btrfs_end_transaction+0x10/0x12
[btrfs]
[25230.404408]  [a04a3d64]
btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [a04c53bd] __btrfs_buffered_write+0x160/0x48d
[btrfs]
[25230.404425]  [811a9268] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [810f6501] ? get_page+0x1a/0x2b
[25230.404441]  [a04c7c95] btrfs_file_write_iter+0x321/0x42f
[btrfs]
[25230.404443]  [8110f5d9] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [813e98c5] ? mutex_unlock+0x16/0x18
[25230.404449]  [81138d68] new_sync_write+0x7c/0xa0
[25230.404450]  [81139401] vfs_write+0xb0/0x112
[25230.404452]  [81139c9d] SyS_pwrite64+0x66/0x84
[25230.404454]  [813ebf52] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc):
btrfs_create_pending_block_groups:9228: Aborting unused transaction(No
space
left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana fdman...@suse.com
---
fs/btrfs/extent-tree.c | 5 +++--
fs/btrfs/super.c   | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4bf8f02..0a5e770 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
btrfs_trans_handle *trans,
  int ret = 0;

  list_for_each_entry_safe(block_group, tmp, trans-new_bgs,
bg_list) {
-   list_del_init(block_group-bg_list);
  if (ret)
-   continue;
+   goto next;

  spin_lock(block_group-lock);
  memcpy(item, block_group-item, sizeof(item));
@@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
btrfs_trans_handle *trans,
 key.objectid,
key.offset);
  if (ret)
  btrfs_abort_transaction(trans, extent_root,
ret);

Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Filipe David Manana
On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik jba...@fb.com wrote:
 On 11/26/2014 11:09 AM, Filipe David Manana wrote:

 On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote:

 On 11/26/2014 10:28 AM, Filipe Manana wrote:


 If we remove a block group (because it became empty), we might have left
 a caching_ctl structure in fs_info-caching_block_groups that points to
 the block group and is accessed at transaction commit time. This results
 in accessing an invalid or incorrect block group. This issue became
 visible
 after Josef's patch Btrfs: remove empty block groups automatically.

 So if the block group is removed make sure we don't leave a dangling
 caching_ctl in caching_block_groups.

 Sample crash trace:

 [58380.439449] BUG: unable to handle kernel paging request at
 8801446eaeb8
 [58380.439707] IP: [a03f6d05]
 block_group_cache_done.isra.21+0xc/0x1c [btrfs]
 [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
 8001446ea060
 [58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
 [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
 auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
 processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
 thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
 ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
 virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
 virtio [last unloaded: btrfs]
 [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G
 W
 3.17.0-rc5-btrfs-next-1+ #1
 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 BIOS
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti:
 88013896c000
 [58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05]
 block_group_cache_done.isra.21+0xc/0x1c [btrfs]
 [58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX:
 
 [58380.443238] RDX:  RSI: 880185e16800 RDI:
 8801446eaeb8
 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09:
 88013896fc60
 [58380.443238] R10: 88013896fd28 R11:  R12:
 880222cae000
 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15:
 8801446eae00
 [58380.443238] FS:  () GS:88023ed8()
 knlGS:
 [58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
 [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4:
 06e0
 [58380.443238] Stack:
 [58380.443238]  88013896fe18 a03fe2d5 880222cae850
 880185e16800
 [58380.443238]  88000dc41c20  8801a9ca9f00
 
 [58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0
 88013ac82090
 [58380.443238] Call Trace:
 [58380.443238]  [a03fe2d5]
 btrfs_prepare_extent_commit+0x5a/0xd7
 [btrfs]
 [58380.443238]  [a040fbcf]
 btrfs_commit_transaction+0x45c/0x882
 [btrfs]
 [58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4
 [btrfs]
 [58380.443238]  [a040bf66] ?
 btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
 [58380.443238]  [8105966b] kthread+0xb7/0xbf
 [58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
 [58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
 [58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
fs/btrfs/ctree.h   |  1 +
fs/btrfs/extent-tree.c | 27 +++
2 files changed, 28 insertions(+)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index d3ccd09..7f40a65 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
  unsigned int ro:1;
  unsigned int dirty:1;
  unsigned int iref:1;
 +   unsigned int has_caching_ctl:1;


 Don't do this, just unconditionally call get_caching_control in
 btrfs_remove_block_group, then if we get something we can do stuff,
 otherwise we can just continue.  Thanks,


 That's what I initially thought too. However, get_caching_control only
 returns us the caching_ctl if block_group-cached ==
 BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
 has_caching_ctl flag is just to avoid holding the semaphore and search
 through the list (since block_group-caching_ctl can be NULL but a
 caching_ctl point to the block group can be in the list).


 Oh God that's not good, we need to change get_caching_control to return if
 there is a caching control at all, since the other users want to wait for
 the fast caching to finish too.  So change that and then use it
 unconditionally.  I bet this has been causing us the random early ENOSPC
 problems.  Thanks,

Right, I think that's a separate change different from what I'm 

Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal

2014-11-26 Thread Josef Bacik

On 11/26/2014 11:34 AM, Filipe David Manana wrote:

On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik jba...@fb.com wrote:

On 11/26/2014 11:09 AM, Filipe David Manana wrote:


On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote:


On 11/26/2014 10:28 AM, Filipe Manana wrote:



If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info-caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became
visible
after Josef's patch Btrfs: remove empty block groups automatically.

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at
8801446eaeb8
[58380.439707] IP: [a03f6d05]
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
8001446ea060
[58380.441220] Oops:  [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G
W
3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti:
88013896c000
[58380.443238] RIP: 0010:[a03f6d05]  [a03f6d05]
block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX:

[58380.443238] RDX:  RSI: 880185e16800 RDI:
8801446eaeb8
[58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09:
88013896fc60
[58380.443238] R10: 88013896fd28 R11:  R12:
880222cae000
[58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15:
8801446eae00
[58380.443238] FS:  () GS:88023ed8()
knlGS:
[58380.443238] CS:  0010 DS:  ES:  CR0: 8005003b
[58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4:
06e0
[58380.443238] Stack:
[58380.443238]  88013896fe18 a03fe2d5 880222cae850
880185e16800
[58380.443238]  88000dc41c20  8801a9ca9f00

[58380.443238]  88013896fe80 a040fbcf 88018b0dcdb0
88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [a03fe2d5]
btrfs_prepare_extent_commit+0x5a/0xd7
[btrfs]
[58380.443238]  [a040fbcf]
btrfs_commit_transaction+0x45c/0x882
[btrfs]
[58380.443238]  [a040c058] transaction_kthread+0xf2/0x1a4
[btrfs]
[58380.443238]  [a040bf66] ?
btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [8105966b] kthread+0xb7/0xbf
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67
[58380.443238]  [813ebeac] ret_from_fork+0x7c/0xb0
[58380.443238]  [810595b4] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana fdman...@suse.com
---
fs/btrfs/ctree.h   |  1 +
fs/btrfs/extent-tree.c | 27 +++
2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3ccd09..7f40a65 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
  unsigned int ro:1;
  unsigned int dirty:1;
  unsigned int iref:1;
+   unsigned int has_caching_ctl:1;



Don't do this, just unconditionally call get_caching_control in
btrfs_remove_block_group, then if we get something we can do stuff,
otherwise we can just continue.  Thanks,



That's what I initially thought too. However, get_caching_control only
returns us the caching_ctl if block_group-cached ==
BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
has_caching_ctl flag is just to avoid holding the semaphore and search
through the list (since block_group-caching_ctl can be NULL but a
caching_ctl point to the block group can be in the list).



Oh God that's not good, we need to change get_caching_control to return if
there is a caching control at all, since the other users want to wait for
the fast caching to finish too.  So change that and then use it
unconditionally.  I bet this has been causing us the random early ENOSPC
problems.  Thanks,


Right, I think that's a separate change different from what I'm trying to fix.

When caching_thread 

[PATCH] Btrfs: make get_caching_control unconditionally return the ctl

2014-11-26 Thread Josef Bacik
This was written when we didn't do a caching control for the fast free space
cache loading.  However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 fs/btrfs/extent-tree.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 031dafb..74eb29d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -315,12 +315,6 @@ get_caching_control(struct btrfs_block_group_cache *cache)
struct btrfs_caching_control *ctl;
 
spin_lock(cache-lock);
-   if (cache-cached != BTRFS_CACHE_STARTED) {
-   spin_unlock(cache-lock);
-   return NULL;
-   }
-
-   /* We're loading it the fast way, so we don't have a caching_ctl. */
if (!cache-caching_ctl) {
spin_unlock(cache-lock);
return NULL;
@@ -594,6 +588,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
spin_unlock(cache-lock);
 
if (fs_info-mount_opt  BTRFS_MOUNT_SPACE_CACHE) {
+   mutex_lock(caching_ctl-mutex);
ret = load_free_space_cache(fs_info, cache);
 
spin_lock(cache-lock);
@@ -601,6 +596,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
cache-caching_ctl = NULL;
cache-cached = BTRFS_CACHE_FINISHED;
cache-last_byte_to_unpin = (u64)-1;
+   caching_ctl-progress = (u64)-1;
} else {
if (load_cache_only) {
cache-caching_ctl = NULL;
@@ -610,6 +606,8 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
}
}
spin_unlock(cache-lock);
+   mutex_unlock(caching_ctl-mutex);
+
wake_up(caching_ctl-wait);
if (ret == 1) {
put_caching_control(caching_ctl);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-26 Thread Goffredo Baroncelli
On 11/25/2014 11:21 PM, Zygo Blaxell wrote:
  However I still doesn't understood why you want btrfs-w/multiple disk over 
  LVM ?
 I want to split a few disks into partitions, but I want to create,
 move, and resize the partitions from time to time.  Only LVM can do
 that without taking the machine down, reducing RAID integrity levels,
 hotplugging drives, or leaving installed drives idle most of the time.
 
 I want btrfs-raid1 because of its ability to replace corrupted or lost
 data from one disk using the other.  If I run a single-volume btrfs
 on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage
 stack), I can detect lost data, but not replace it automatically from
 the other mirror.
OK, now I have understood.

Anyway as workaround, take in account that you can pass explicitly the
devices as:

mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt

(supposing that the filesystem is on /dev/sda.../dev/sdd)

I am working to a mount.btrfs helper. The aim of this helper is to manage
the assembling of multiple devices; the main points will be:
- wait until all the devices appeared
- allow (if required) to mount in degraded mode after a timeout
- at this point it could/should also skip the lvm-snapshotted devices (but 
before 
I have to know how recognize these) 

I hope to issue the patches in the next week

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Filipe Manana
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: Added a comment to better explain the synchronization between block
group removal and triming, as requested by Josef.

 fs/btrfs/ctree.h| 13 -
 fs/btrfs/disk-io.c  | 14 ++
 fs/btrfs/extent-tree.c  | 28 +++-
 fs/btrfs/free-space-cache.c | 26 +-
 fs/btrfs/volumes.c  | 33 ++---
 5 files changed, 104 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
unsigned int dirty:1;
unsigned int iref:1;
unsigned int has_caching_ctl:1;
+   unsigned int removed:1;
 
int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
/* For delayed block group creation or deletion of empty block groups */
struct list_head bg_list;
+
+   atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
/* For btrfs to record security options */
struct security_mnt_opts security_opts;
+
+   /*
+* Chunks that can't be freed yet (under a trim/discard operation)
+* and will be latter freed.
+*/
+   rwlock_t pinned_chunks_lock;
+   struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
   u64 type, u64 chunk_objectid, u64 chunk_offset,
   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 group_start);
+struct btrfs_root *root, u64 group_start,
+bool *remove_em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread John Williams
On Wed, Nov 26, 2014 at 4:50 AM, Holger Hoffstätte
holger.hoffstae...@googlemail.com wrote:
 On Tue, 25 Nov 2014 15:17:58 -0800, John Williams wrote:

 2) CityHash : for 256-bit hashes on all systems
 https://code.google.com/p/cityhash/

 Btw this is now superseded by Farmhash:
 https://code.google.com/p/farmhash/


It seems FarmHash is not a complete replacement for CityHash.
Specifically, I don't see a Fingerprint256() function in FarmHash, so
no 256-bit fingerprints (unless I am missing something?). Also, it
seems that FarmHash's Fingerprint128() hash is just CityHash128().

Unless I am misreading it, I think FarmHash is mostly useful for
32-bit and 64-bit hashes.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two persistent problems

2014-11-26 Thread Marc Joliet
Am Fri, 14 Nov 2014 17:00:26 -0500
schrieb Josef Bacik jba...@fb.com:

 On 11/14/2014 04:51 PM, Hugo Mills wrote:
  Chris, Josef, anyone else who's interested,
 
  On IRC, I've been seeing reports of two persistent unsolved
  problems. Neither is showing up very often, but both have turned up
  often enough to indicate that there's something specific going on
  worthy of investigation.
 
  One of them is definitely a btrfs problem. The other may be btrfs,
  or something in the block layer, or just broken hardware; it's hard to
  tell from where I sit.
 
  Problem 1: ENOSPC on balance
 
  This has been going on since about March this year. I can
  reasonably certainly recall 8-10 cases, possibly a number more. When
  running a balance, the operation fails with ENOSPC when there's plenty
  of space remaining unallocated. This happens on full balance, filtered
  balance, and device delete. Other than the ENOSPC on balance, the FS
  seems to work OK. It seems to be more prevalent on filesystems
  converted from ext*. The first few or more reports of this didn't make
  it to bugzilla, but a few of them since then have gone in.
 
  Problem 2: Unexplained zeroes
 
  Failure to mount. Transid failure, expected xyz, have 0. Chris
  looked at an early one of these (for Ke, on IRC) back in September
  (the 27th -- sadly, the public IRC logs aren't there for it, but I can
  supply a copy of the private log). He rapidly came to the conclusion
  that it was something bad going on with TRIM, replacing some blocks
  with zeroes. Since then, I've seen a bunch of these coming past on
  IRC. It seems to be a 3.17 thing. I can successfully predict the
  presence of an SSD and -odiscard from the have 0. I've successfully
  persuaded several people to put this into bugzilla and capture
  btrfs-images.  btrfs recover doesn't generally seem to be helpful in
  recovering data.
 
 
  I think Josef had problem 1 in his sights, but I don't know if
  additional images or reports are helpful at this point. For problem 2,
  there's obviously something bad going on, but there's not much else to
  go on -- and the inability to recover data isn't good.
 
  For each of these, what more information should I be trying to
  collect from any future reporters?
 
 
 
 So for #2 I've been looking at that the last two weeks.  I'm always 
 paranoid we're screwing up one of our data integrity sort of things, 
 either not waiting on IO to complete properly or something like that. 
 I've built a dm target to be as evil as possible and have been running 
 it trying to make bad things happen.  I got slightly side tracked since 
 my stress test exposed a bug in the tree log stuff an csums which I just 
 fixed.  Now that I've fixed that I'm going back to try and make the 
 expected blah, have 0 type errors happen.

Just a quick question from a user: does Filipe's patch Btrfs: fix race between
fs trimming and block group remove/allocation fix this?  Judging by the commit
message, it looks like it.  If so, can you say whether it will make it into
3.17.x?

Maybe I'm being overly paranoid, but I stuck with 3.16.7 because of this.  (I
mean, I have backups, but there's no need to provoke a situation where I will
need them ;-) .)

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


signature.asc
Description: PGP signature


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-26 Thread Christoph Hellwig
As mentioned last round please move the addition of the is_readonly
operation to the first thing in the series, so that the ordering makes
more sense.

Second I think this patch is incorrect for XFS - XFS uses -update_time
to set the time stampst in the dinode.  These two need to be coherent
as we can write out a dirty inode any time, so it needs to have the
timestamp uptodate.

Third update_time now calls mark_inode_dirty unconditionally, while
previously it wasn't called when -update_time was set.  At least
for XFS that's a major change in behavior as XFS never used VFS dirty
tracking for metadata updates.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Christoph Hellwig
The subject line seems incorrect, this seems to implement some form
of dirty inode writeback clustering.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Roman Mamedov
Hello,

I used btrfs-convert to switch my FS from Ext4 to Btrfs. As it was a rather
large 10 TB filesystem, to save on the conversion time, I used the -d,
disable data checksum option of btrfs-convert.

Turns out now I can't cp --reflink any files that were already on the FS
prior to conversion. The error message from cp is failed to clone [...]
Invalid argument.

I assume this is because of the lack of checksums; the only way to make old
files cloneable is to plain copy them to a different place and then delete the
originals, but that's what I was trying to avoid in the first place.

Also I thought maybe defragmenting will help, but nope, doesn't seem to be the
case, even ordering it to recompress data to a different method doesn't fix
the problem. (Even if it did, it's still a lot of unnecessary rewriting).

Is there really a good reason to stop these files without checksums from being
cloneable? It's not like they have the noCoW attribute, so I'd assume any new
write to these files would cause a CoW and proper checksums for all new blocks
anyways.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-26 Thread Robert White

On 11/25/2014 07:22 PM, Duncan wrote:

From my perspective, however, btrfs is simply incompatible with lvm

snapshots, because the basic assumptions are incompatible.  Btrfs assumes
UUIDs will be exactly what they say on the label, /unique/, while lvm's
snapshot feature directly breaks that uniqueness by copying the (former)
UUID, thus making the former UUID no longer unique and thus no longer
truly UUID.  Thus, part of the lvm /feature/ of snapshots is in direct
contradiction to a basic assumption of btrfs, that UUIDs are exactly
that, unique, making that feature directly incompatible with btrfs on a
very basic level.


A finer point here. LVM doesn't copy the UUID. AN LVM snapshot is a 
copy-on-write entity so it _exposes_ the single sector(s) of the 
superblock(s) in both views of the underlying storage. This is universal 
to the idea of a snapshot. Just as a btrfs subvol snap /old /new 
exposes all the unique elements of /old under the name /new (in 
preparation for the user to implement subsequent divergence); lvmcreate 
--snapshot Old New causes every block-N of Old to be identically 
available as block-N of New (in preparation for the user to implement 
subsequent divergence).


In point of fact the LVM snapshot operation is a zero-copy operation at 
its heart. After the snapshot is established, when a block in modified 
in Old, it's original content is saved in New. When blocks are written 
in New, they are written in place and the reference to the block content 
in Old is overwritten.


This is the reason that fsfreeze is unnecessary for things above LVM 
snapshots as the instant-in-time divergence is _instant_. It's not that 
LVM goes out and does an fsfreeze equivalent action, its that the switch 
to write-divergence is essentially atomic. A bunch of metatdata is setup 
and then all-at-once one write behavior is switched with another by 
re-mapping the device access routines.


So while you may have a point about btrfs being unprepared for LVM, 
neither party is particularly at fault in any way.


The damn you photocopier for making photocopies so identically nature 
of your problem with LVM seems to be leading you to misplaced conclusions.


If you need to harmonize these sorts of things, you need to be able to 
re-write blocks in question with disambiguating information (like new 
UUIDS) or restrict your accesses in some other manner.


If you are waiting for someone to code it up perhaps you should do so. 
But it will _never_ be automatic because the use cases that don't match 
your expectations may need the founding assumptions to be as they are today.


In other words, your belief that your position is entirely logical may 
be a little off, particularly if you think LVM is Copying things when 
it does a snapshot.


As previously stated XFS solved this problem by providing a tool that 
would change the UUID of a file system. This tool cold then be pointed 
at either (or both) the original and/or snapshot volumes as needed.


I don't see a re-make the btrfs option for changing UUIDs and LVM 
doesn't care _at_ _all_ about what is actually in its volumes (okay, 
lvresize has some fsck nonsense, but that's just messy).


It might even be wrong to try to harmonize those features, like trying 
to put a manual clutch into a car with an automatic transmission... it 
may just not fit.


Given that BTRFS want's to play in the same level of abstraction as LVM, 
its kind of a given that they'll butt heads over things like conflicting 
definitions of what it means to take a snapshot.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-26 Thread Dave Chinner
On Wed, Nov 26, 2014 at 05:20:17AM -0500, Theodore Ts'o wrote:
 On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote:
  No abuse necessary at all. Just a different inode_dirtied_after()
  check is requires if the inode is on the time dirty list in
  move_expired_inodes().
 
 I'm still not sure what you have in mind here.  When would this be
 checked? 

Have you looked at where move_expired_inodes() gets called from?
It's called periodically from background writeback by queue_io(),
and sync uses the same infrastructure to expire all inodes on the
dirty list

 It sounds like you want to set a timeout such that when an
 inode which had its timestamps updated lazily 24 hours earlier, the
 inode would get written out.  Yes?  But that implies something is
 going to have to scan the list of inodes on the dirty time list
 periodically.  When are you proposing that this take place?

The writeback code already does this for dirty inodes. it does it in
move_expired_inodes() to move the inodes with i_dirtied_when is
older than 30s. It's *trivial* to add a time dirty inode list and
scan that at the same time to pull off inodes that are older than
24hrs.

 The various approaches that come to mind all seem more complex than
 what I have in this patch 3 of 4, and I'm not sure it's worth the
 complexity.

the once a day stuff you've added is a horrible, nasty hack. I
wasn't going to say anything about it (i.e. if you can't say
anything nice...). The existing dirty inode writeback expiry code
does *everything* we need already, we just need to plumb in a new
list and add an expiry check of that list to move inodes to the b_io
list when they have been timestamp dirty for more than 24 hours...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Dave Chinner
On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
 Add an optimization for the MS_LAZYTIME mount option so that we will
 opportunistically write out any inodes with the I_DIRTY_TIME flag set
 in a particular inode table block when we need to update some inode in
 that inode table block anyway.
 
 Also add some temporary code so that we can set the lazytime mount
 option without needing a modified /sbin/mount program which can set
 MS_LAZYTIME.  We can eventually make this go away once util-linux has
 added support.
 
 Google-Bug-Id: 18297052
 
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 ---
  fs/ext4/inode.c | 49 
 ++---
  fs/ext4/super.c |  9 +
  include/trace/events/ext4.h | 30 +++
  3 files changed, 85 insertions(+), 3 deletions(-)
 
 diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
 index 5653fa4..8308c82 100644
 --- a/fs/ext4/inode.c
 +++ b/fs/ext4/inode.c
 @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
  }
  
  /*
 + * Opportunistically update the other time fields for other inodes in
 + * the same inode table block.
 + */
 +static void ext4_update_other_inodes_time(struct super_block *sb,
 +   unsigned long orig_ino, char *buf)
 +{
 + struct ext4_inode_info  *ei;
 + struct ext4_inode   *raw_inode;
 + unsigned long   ino;
 + struct inode*inode;
 + int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
 + int inode_size = EXT4_INODE_SIZE(sb);
 +
 + ino = orig_ino  ~(inodes_per_block - 1);
 + for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
 + if (ino == orig_ino)
 + continue;
 + inode = find_active_inode_nowait(sb, ino);
 + if (!inode ||
 + (inode-i_state  I_DIRTY_TIME) == 0 ||
 + !spin_trylock(inode-i_lock)) {
 + iput(inode);
 + continue;
 + }
 + inode-i_state = ~I_DIRTY_TIME;
 + inode-i_ts_dirty_day = 0;
 + spin_unlock(inode-i_lock);
 + inode_requeue_dirtytime(inode);
 +
 + ei = EXT4_I(inode);
 + raw_inode = (struct ext4_inode *) buf;
 +
 + spin_lock(ei-i_raw_lock);
 + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
 + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
 + ext4_inode_csum_set(inode, raw_inode, ei);
 + spin_unlock(ei-i_raw_lock);
 + trace_ext4_other_inode_update_time(inode, orig_ino);
 + iput(inode);
 + }
 +}

Am I right in that this now does unlogged timestamp updates of
inodes? What happens when that buffer gets overwritten by log
recover after a crash? The timestamp updates get lost?

FYI, XFS has had all sorts of nasty log recovery corner cases
caused by log recovery overwriting non-logged inode updates like
this. In the past few years we've removed every single non-logged
inode update optimisation so that all changes (including timestamps)
are transactional so inode state on disk not matching what log
recovery wrote to disk for all the other inode metadata...

Optimistic unlogged inode updates are a slippery slope, and history
tells me that it doesn't lead to a nice place

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Andreas Dilger
On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote:
 
 On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
 Add an optimization for the MS_LAZYTIME mount option so that we will
 opportunistically write out any inodes with the I_DIRTY_TIME flag set
 in a particular inode table block when we need to update some inode
 in that inode table block anyway.
 
 Also add some temporary code so that we can set the lazytime mount
 option without needing a modified /sbin/mount program which can set
 MS_LAZYTIME.  We can eventually make this go away once util-linux has
 added support.
 
 Google-Bug-Id: 18297052
 
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 ---
 fs/ext4/inode.c | 49 
 ++---
 fs/ext4/super.c |  9 +
 include/trace/events/ext4.h | 30 +++
 3 files changed, 85 insertions(+), 3 deletions(-)
 
 diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
 index 5653fa4..8308c82 100644
 --- a/fs/ext4/inode.c
 +++ b/fs/ext4/inode.c
 @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
 + * Opportunistically update the other time fields for other inodes in
 + * the same inode table block.
 + */
 +static void ext4_update_other_inodes_time(struct super_block *sb,
 +  unsigned long orig_ino, char *buf)
 +{
 +struct ext4_inode_info  *ei;
 +struct ext4_inode   *raw_inode;
 +unsigned long   ino;
 +struct inode*inode;
 +int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
 +int inode_size = EXT4_INODE_SIZE(sb);
 +
 +ino = orig_ino  ~(inodes_per_block - 1);
 +for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
 +if (ino == orig_ino)
 +continue;
 +inode = find_active_inode_nowait(sb, ino);
 +if (!inode ||
 +(inode-i_state  I_DIRTY_TIME) == 0 ||
 +!spin_trylock(inode-i_lock)) {
 +iput(inode);
 +continue;
 +}
 +inode-i_state = ~I_DIRTY_TIME;
 +inode-i_ts_dirty_day = 0;
 +spin_unlock(inode-i_lock);
 +inode_requeue_dirtytime(inode);
 +
 +ei = EXT4_I(inode);
 +raw_inode = (struct ext4_inode *) buf;
 +
 +spin_lock(ei-i_raw_lock);
 +EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
 +EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 +EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
 +ext4_inode_csum_set(inode, raw_inode, ei);
 +spin_unlock(ei-i_raw_lock);
 +trace_ext4_other_inode_update_time(inode, orig_ino);
 +iput(inode);
 +}
 +}
 
 Am I right in that this now does unlogged timestamp updates of
 inodes? What happens when that buffer gets overwritten by log
 recover after a crash? The timestamp updates get lost?
 
 FYI, XFS has had all sorts of nasty log recovery corner cases
 caused by log recovery overwriting non-logged inode updates like
 this. In the past few years we've removed every single non-logged
 inode update optimisation so that all changes (including timestamps)
 are transactional so inode state on disk not matching what log
 recovery wrote to disk for all the other inode metadata...
 
 Optimistic unlogged inode updates are a slippery slope, and history
 tells me that it doesn't lead to a nice place

Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
logical journaling, this isn't an unlogged update.  It is just taking
advantage of the fact that the whole block is going to be logged and
written to the disk anyway.  If the only update needed for other inodes
in the block is the timestamp then they may as well be flushed to disk
at the same time and avoid the need for another update later on.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 11:55 AM, Roman Mamedov wrote:

Is there really a good reason to stop these files without checksums from being
cloneable? It's not like they have the noCoW attribute, so I'd assume any new
write to these files would cause a CoW and proper checksums for all new blocks
anyways.


The problem seems to be that you are trying to clone a NODATASUM file to 
a file that would have data checsums. The resulting file _might_ then 
end up with some extents with, and others without, checksums if that 
target file were modified.


So you _could_ reflink the file but you'd have to do it to another file 
with no data checksums -- which basically means a NOCOW file, or 
mounting with nodatasum while you do the reflink, but now you have more 
problem files.


linux/fs/btrfs/ioctl.c @ ~ line 2930
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)-flags  BTRFS_INODE_NODATASUM) !=
(BTRFS_I(dst)-flags  BTRFS_INODE_NODATASUM)) {
ret = -EINVAL;
goto out_unlock;
}

Basically if the system allowed you to reflink from a no-data-sum to a 
data-sum file the results would be instantly unreadable for failing 
every single data checksum test. That or the entire checksum system 
would have to be advisory instead of functional.


So yes, there is a good reason.


David Foster Wallace famously said act in haste, repent in leasure. 
You kind-of shot yourself in the foot while attempting to save time. You 
promised yourself that you didn't need the checksums.


Now at this point I'm going to make a few guesses...

I don't see _anywhere_ in the kernel source or btrfs-progs where 
BTRFS_INODE_NODATASUM is explicitly cleared from any inode ever. It 
might be implicitly cleared somewhere but I couldn't find it. So all 
those unsummed files are probably unsummed for life.


I also don't see anything in the code that says this ioctl will create 
the checksums for the selected file so you may have to do the copy you 
tried to avoid.


Sorry man...

If you haven't done much with the file system yet, you might want to 
reverse the conversion and do it again.


Otherwise, you will want to copy the files long-hand, possibly in 
batches if you are more than 50% full on disk space.


On the bright side...

This would be the perfect moment to think about your backup/snapshot 
scheme. I always have a /whatever (e.g. /__System for a root partition) 
as my default subvolume that I normally mount. When I do my backups I 
mount -o subvol=/ /dev/whathaveyou /mnt/maintenance and then do my 
snapshots in /mnt/maintenance. That way every file in my N snapshots 
doesn't show up in the output of locate N+1 times. (e.g. this lets me 
hide my local snapshots/backups from normal operations)


Also, now would be the perfect time to add compression to your default 
regime. Compressing the files only happens on write and so using 
compression involves a copy anyway.


-- Rob.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Roman Mamedov
On Wed, 26 Nov 2014 15:18:26 -0800
Robert White rwh...@pobox.com wrote:

 So you _could_ reflink the file but you'd have to do it to another file 
 with no data checksums -- which basically means a NOCOW file, or 
 mounting with nodatasum while you do the reflink, but now you have more 
 problem files.

Bingo!!! A cp --reflink to a destination that's been made chattr +C prior to
that, works perfectly.

My goal was to convert regular top-level directories into subvolumes (for
further snapshotting). With that trick, I've been able to do that now w/o
issues.

$ mv Music Music.orig
$ sudo btrfs sub create Music
Create subvolume './Music'
$ sudo chattr +C Music
$ sudo cp -a --reflink Music.orig/* Music/
$ 

Finished with no rewriting necessary. After that I recursively-removed the +C
attribute from all newly reflinked files, and cp --reflink as well as
snapshotting of those works fine.

Thanks for the idea. :)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Dave Chinner
On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote:
 On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote:
  
  On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
  Add an optimization for the MS_LAZYTIME mount option so that we will
  opportunistically write out any inodes with the I_DIRTY_TIME flag set
  in a particular inode table block when we need to update some inode
  in that inode table block anyway.
  
  Also add some temporary code so that we can set the lazytime mount
  option without needing a modified /sbin/mount program which can set
  MS_LAZYTIME.  We can eventually make this go away once util-linux has
  added support.
  
  Google-Bug-Id: 18297052
  
  Signed-off-by: Theodore Ts'o ty...@mit.edu
  ---
  fs/ext4/inode.c | 49 
  ++---
  fs/ext4/super.c |  9 +
  include/trace/events/ext4.h | 30 +++
  3 files changed, 85 insertions(+), 3 deletions(-)
  
  diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
  index 5653fa4..8308c82 100644
  --- a/fs/ext4/inode.c
  +++ b/fs/ext4/inode.c
  @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
  }
  
  /*
  + * Opportunistically update the other time fields for other inodes in
  + * the same inode table block.
  + */
  +static void ext4_update_other_inodes_time(struct super_block *sb,
  +unsigned long orig_ino, char *buf)
  +{
  +  struct ext4_inode_info  *ei;
  +  struct ext4_inode   *raw_inode;
  +  unsigned long   ino;
  +  struct inode*inode;
  +  int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
  +  int inode_size = EXT4_INODE_SIZE(sb);
  +
  +  ino = orig_ino  ~(inodes_per_block - 1);
  +  for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
  +  if (ino == orig_ino)
  +  continue;
  +  inode = find_active_inode_nowait(sb, ino);
  +  if (!inode ||
  +  (inode-i_state  I_DIRTY_TIME) == 0 ||
  +  !spin_trylock(inode-i_lock)) {
  +  iput(inode);
  +  continue;
  +  }
  +  inode-i_state = ~I_DIRTY_TIME;
  +  inode-i_ts_dirty_day = 0;
  +  spin_unlock(inode-i_lock);
  +  inode_requeue_dirtytime(inode);
  +
  +  ei = EXT4_I(inode);
  +  raw_inode = (struct ext4_inode *) buf;
  +
  +  spin_lock(ei-i_raw_lock);
  +  EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
  +  EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
  +  EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
  +  ext4_inode_csum_set(inode, raw_inode, ei);
  +  spin_unlock(ei-i_raw_lock);
  +  trace_ext4_other_inode_update_time(inode, orig_ino);
  +  iput(inode);
  +  }
  +}
  
  Am I right in that this now does unlogged timestamp updates of
  inodes? What happens when that buffer gets overwritten by log
  recover after a crash? The timestamp updates get lost?
  
  FYI, XFS has had all sorts of nasty log recovery corner cases
  caused by log recovery overwriting non-logged inode updates like
  this. In the past few years we've removed every single non-logged
  inode update optimisation so that all changes (including timestamps)
  are transactional so inode state on disk not matching what log
  recovery wrote to disk for all the other inode metadata...
  
  Optimistic unlogged inode updates are a slippery slope, and history
  tells me that it doesn't lead to a nice place
 
 Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
 logical journaling, this isn't an unlogged update.  It is just taking
 advantage of the fact that the whole block is going to be logged and
 written to the disk anyway.

Urk - that's worse, isn't it? i.e the code above calls iput() from
within a current transaction context?  What happens if that drops
the last reference to the inode and it gets evicted due to racing
with an unlink? Won't that try to start another transaction to free
the inode (i.e. through ext4_evict_inode())?




 If the only update needed for other inodes
 in the block is the timestamp then they may as well be flushed to disk
 at the same time and avoid the need for another update later on.
 
 Cheers, Andreas
 
 
 
 
 
 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 03:33 PM, Roman Mamedov wrote:

On Wed, 26 Nov 2014 15:18:26 -0800
Robert White rwh...@pobox.com wrote:


So you _could_ reflink the file but you'd have to do it to another file
with no data checksums -- which basically means a NOCOW file, or
mounting with nodatasum while you do the reflink, but now you have more
problem files.


Bingo!!! A cp --reflink to a destination that's been made chattr +C prior to
that, works perfectly.

My goal was to convert regular top-level directories into subvolumes (for
further snapshotting). With that trick, I've been able to do that now w/o
issues.

$ mv Music Music.orig
$ sudo btrfs sub create Music
Create subvolume './Music'
$ sudo chattr +C Music
$ sudo cp -a --reflink Music.orig/* Music/
$

Finished with no rewriting necessary. After that I recursively-removed the +C
attribute from all newly reflinked files, and cp --reflink as well as
snapshotting of those works fine.

Thanks for the idea. :)



Uh... you may _still_ have no checksums on any of those data extents. 
They are not going to come back until you write them to a normal file 
with a normal copy. So you may be lacking most of the data validation 
features of this filesystem. For instance you can, and always could, 
snapshot a NODATACOW/NODATASUM file just fine (I call it 1COW mode).


Setting NODATACOW sets NODATASUM...

Clearing NODATACOW does _not_ clear NODATASUM (at least not on a 
non-empty file) as near as I can tell, so that directory hierarchy and 
its subsequent snapshots is likely less safe than you think.


You might want to go experiment. Make another new subvol (or at least a 
directory in a directory/root/subvol that never had the +C attribute 
set) and see if you can cp --reflink any of these files into that 
subdirectory without repeating the +C trick.


Basically scrubbing and mirroring and sending your Music folder is an 
unprotected and unverified operation the way you've done this (if my 
reading of the code is correct).


You _really_ might be better off spending the time and doing the copy to 
a normal directory.


For instance without checksums if you btrfs scrub your volume it will 
read the file but it can't know if the file got corrupted, it can only 
tell you if you if the disk read completed without error. (there's a 
whole degenerate/simplified path in the code for scrubbing un-summed files).


So seriously, you might be Saving yourself time right into a future 
data loss.


Take this as a you've been warned.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Roman Mamedov
On Wed, 26 Nov 2014 16:00:23 -0800
Robert White rwh...@pobox.com wrote:

 Uh... you may _still_ have no checksums on any of those data extents. 
 They are not going to come back until you write them to a normal file 
 with a normal copy. So you may be lacking most of the data validation 
 features of this filesystem.

Well, this FS is coming from being Ext4 for years, so it's not worse off now
than it was before. And anyways the main feature that I wanted were snapshots.

 You might want to go experiment. Make another new subvol (or at least a 
 directory in a directory/root/subvol that never had the +C attribute 
 set) and see if you can cp --reflink any of these files into that 
 subdirectory without repeating the +C trick.

Ha, indeed I can't. Maybe there should be a way to generate checksums without
rewriting files, just via reading them, then calculating and writing checksum
to metadata.

 Clearing NODATACOW does _not_ clear NODATASUM (at least not on a 
 non-empty file) as near as I can tell, so that directory hierarchy and 
 its subsequent snapshots is likely less safe than you think.

The nodatasum flag also isn't accessible via chattr, is it?

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 03:33 PM, Roman Mamedov wrote:

Finished with no rewriting necessary. After that I recursively-removed the +C
attribute from all newly reflinked files, and cp --reflink as well as
snapshotting of those works fine.


I did some double checking and I think you'll find that if you lsattr 
those files they still have the C (NoCOW) attribute, which also means 
they are still unsummed.


Which also means that when you cp --reflink them the target files you 
create are also NoCOW.


So you harmonized the lack of checksums with the linux-level C 
attribute. This has hidden your problem but not fixed it.


(Trying to clear the NOCOW attribute on a file in BTRFS is _silently_ 
ignored as invalid. That recursive removal only changed the directories.)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Roman Mamedov
On Wed, 26 Nov 2014 16:20:44 -0800
Robert White rwh...@pobox.com wrote:

 I did some double checking and I think you'll find that if you lsattr 
 those files they still have the C (NoCOW) attribute, which also means 
 they are still unsummed.

Indeed, I looked at the top level only, which had just directories.

 (Trying to clear the NOCOW attribute on a file in BTRFS is _silently_ 
 ignored as invalid. That recursive removal only changed the directories.)

And the chattr command even completes with a zero exit code, this is rather
unexpected.

$ lsattr \ MP3.m3u; chattr -C \ MP3.m3u  echo Yay; lsattr \ 
MP3.m3u 
---C  MP3.m3u
Yay
---C  MP3.m3u


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 04:20 PM, Roman Mamedov wrote:

On Wed, 26 Nov 2014 16:00:23 -0800
Robert White rwh...@pobox.com wrote:


Uh... you may _still_ have no checksums on any of those data extents.
They are not going to come back until you write them to a normal file
with a normal copy. So you may be lacking most of the data validation
features of this filesystem.


Well, this FS is coming from being Ext4 for years, so it's not worse off now
than it was before. And anyways the main feature that I wanted were snapshots.


Given that you wont be able to scrub the data and BTRFS is usable but 
still a little brittle, it might be a little worse off than you think.


Plus if you ever find yourself in need of balancing you've tossed out 
one level of data integrity right here at the start.




You might want to go experiment. Make another new subvol (or at least a
directory in a directory/root/subvol that never had the +C attribute
set) and see if you can cp --reflink any of these files into that
subdirectory without repeating the +C trick.


Ha, indeed I can't. Maybe there should be a way to generate checksums without
rewriting files, just via reading them, then calculating and writing checksum
to metadata.


That problem would be computationally hard because you'd have to 
verify that no other file was using that extent before you put that 
extent under control of the csum machinery, otherwise you might break 
later break the COW promise when the file that knows those blocks by 
their checksum changes the contents out from underneath the other 
references.




Clearing NODATACOW does _not_ clear NODATASUM (at least not on a
non-empty file) as near as I can tell, so that directory hierarchy and
its subsequent snapshots is likely less safe than you think.


The nodatasum flag also isn't accessible via chattr, is it?


It is not. NODATASUM and NODATACOW are private features. The linux 
kernel has no general concept of data checksums. The BTRFS guys mapped 
the NODATACOW attribute onto the existing lsattr/chattr bit because it 
was defined for another file system long ago.


You'll also find that you can not set or clear the C attr on a file in a 
BTRFS unless its size is zero. So all your files now have the C 
attribute more-or-less forever. Only a normal copy operation (e.g. 
allocating new extents and writing the data into them wiht the checksum 
feature in force) will change that.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 04:28 PM, Roman Mamedov wrote:

On Wed, 26 Nov 2014 16:20:44 -0800
Robert White rwh...@pobox.com wrote:

(Trying to clear the NOCOW attribute on a file in BTRFS is _silently_
ignored as invalid. That recursive removal only changed the directories.)


And the chattr command even completes with a zero exit code, this is rather
unexpected.


That's what silently means in this context. I didn't pick the result, 
and it's not what I would have done. I've got no idea if this was ever 
discussed at any length for pros-and-cons. I could make an argument for 
the silent result, or against it. Since the attribute is immutable there 
really isn't a nope, that's just not possible dave errno value to 
return that isn't as confusing as just skipping it.


The closest result code would be ENOSUP (operation not supported) but 
changing attributes _is_ supported, just not that particular attribute 
in that particular circumstance.


Also, the set attributes call sets all the attributes at once so there 
is no way to say which attribute was rejected. As such, a do what you 
can and let the people check the result behavior is not at all 
unreasonable.


Life is full of flaws. 8-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Robert White

On 11/26/2014 04:31 PM, Robert White wrote:

On 11/26/2014 04:20 PM, Roman Mamedov wrote:

On Wed, 26 Nov 2014 16:00:23 -0800
Robert White rwh...@pobox.com wrote:

You might want to go experiment. Make another new subvol (or at least a
directory in a directory/root/subvol that never had the +C attribute
set) and see if you can cp --reflink any of these files into that
subdirectory without repeating the +C trick.


Ha, indeed I can't. Maybe there should be a way to generate checksums
without
rewriting files, just via reading them, then calculating and writing
checksum
to metadata.


That problem would be computationally hard because you'd have to
verify that no other file was using that extent before you put that
extent under control of the csum machinery, otherwise you might break
later break the COW promise when the file that knows those blocks by
their checksum changes the contents out from underneath the other
references.


I explained this poorly/incorrectly...

So some guy like yourself converts a file system, or mounts an existing 
file system with nodatasum and creates some file. As a result there is a 
file called /One that has no checksums on its extents.


Then the guy creates a directory and sets +C on it, and copies the file 
into that directory with cp --reflink /One /d/Two. File /d/Two is a 
no-cow file.


Now the guy somes back and decides to put the data checksums onto the 
extents of /One.


At this moment everything _looks_ fine.

Then the guy alters the first byte of /d/Two, which modifies the no-cow 
file in place.


Now the guy tires to read /One ... what happens? The checksum doesn't 
match and the data has changed because of the NOCOW.


(So I think, in practice, the 1COW mechanism prevents this just like it 
works for snapshots)


But thats a _lot_ of corner cases that can be solved by telling someone 
to copy the file if they want the checksums to be recreated.


e.g. the set of all files and all possibilities gets well into the 
computationally hard end of the swimming pool.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/6] Btrfs: fix race between fs trimming and block group remove/allocation

2014-11-26 Thread Filipe Manana
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 
1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: Added a comment to better explain the synchronization between block
group removal and triming, as requested by Josef.

V3: Changed logic to move extent map to pinned chunks list while holding
the block group's spinlock, and similarly to make trimming check for
the need to remove the em from the pinned chunks list while holding
that spinlock. This is to avoid the (unlikely) case where the trimmer
attempts to remove the em before chunk removal adds it to the pinned
chunks list, which would leave the em alive (but outside the mapping
tree) until umount time.

 fs/btrfs/ctree.h| 13 -
 fs/btrfs/disk-io.c  | 14 ++
 fs/btrfs/extent-tree.c  | 43 ++-
 fs/btrfs/free-space-cache.c | 37 -
 fs/btrfs/volumes.c  | 21 +
 5 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..4a167a0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
unsigned int dirty:1;
unsigned int iref:1;
unsigned int has_caching_ctl:1;
+   unsigned int removed:1;
 
int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
/* For delayed block group creation or deletion of empty block groups */
struct list_head bg_list;
+
+   atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
/* For btrfs to record security options */
struct security_mnt_opts security_opts;
+
+   /*
+* Chunks that can't be freed yet (under a trim/discard operation)
+* and will be latter freed.
+*/
+   rwlock_t pinned_chunks_lock;
+   struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
  

[PATCH] btrfs-progs: fix return value problem for btrfs sub show

2014-11-26 Thread Gui Hecheng
If you exec:
# btrfs sub show dir  == non-subvolume dir
The cmd print error messages as expected, but returns 0.
By convetion, it should return non-zero and we should explicitly
set it before it goto out.

With other pieces adopted:
1) removed a unnecessary return value set -EINVAL
2) fixed another code branch which may return 0 upon error.
3) with 2) applied, the ret = 0 follows can be removed

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 cmds-subvolume.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index fa58a24..53eec46 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -906,6 +906,7 @@ static int cmd_subvol_show(int argc, char **argv)
}
if (!ret) {
fprintf(stderr, ERROR: '%s' is not a subvolume\n, fullpath);
+   ret = 1;
goto out;
}
 
@@ -919,7 +920,6 @@ static int cmd_subvol_show(int argc, char **argv)
fprintf(stderr,
ERROR: %s doesn't belong to btrfs mount point\n,
fullpath);
-   ret = -EINVAL;
goto out;
}
ret = 1;
@@ -958,13 +958,13 @@ static int cmd_subvol_show(int argc, char **argv)
memset(get_ri, 0, sizeof(get_ri));
get_ri.root_id = sv_id;
 
-   if (btrfs_get_subvol(mntfd, get_ri)) {
+   ret = btrfs_get_subvol(mntfd, get_ri);
+   if (ret) {
fprintf(stderr, ERROR: can't find '%s'\n,
svpath);
goto out;
}
 
-   ret = 0;
/* print the info */
printf(%s\n, fullpath);
printf(\tName: \t\t\t%s\n, get_ri.name);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: apply realpath for btrfs fi show when mount point is given

2014-11-26 Thread Gui Hecheng
For now,
# btrfs fi show /mnt/btrfs
gives info correctly, while
# btrfs fi show /mnt/btrfs/
gives nothing.

This implies that the @realpath() function should be applied to
unify the behavior.

Made a more clear comment right above the call as well.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 cmds-filesystem.c | 60 +++
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index cd6b3c6..4e6d2f6 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -901,39 +901,47 @@ static int cmd_show(int argc, char **argv)
if (strlen(search) == 0)
usage(cmd_show_usage);
type = check_arg_type(search);
+
+   /*
+* For search is a device:
+* realpath do /dev/mapper/XX = /dev/dm-X
+* which is required by BTRFS_SCAN_DEV
+* For search is a mountpoint:
+* realpath do  /mnt/btrfs/  = /mnt/btrfs
+* which shall be recognized by btrfs_scan_kernel()
+*/
+   if (!realpath(search, path)) {
+   fprintf(stderr, ERROR: Could not show %s: %s\n,
+   search, strerror(errno));
+   return 1;
+   }
+
+   search = path;
+
/*
 * needs spl handling if input arg is block dev
 * And if input arg is mount-point just print it
 * right away
 */
-   if (type == BTRFS_ARG_BLKDEV) {
-   if (where == BTRFS_SCAN_LBLKID) {
-   /* we need to do this because
-* legacy BTRFS_SCAN_DEV
-* provides /dev/dm-x paths
-*/
-   if (realpath(search, path))
-   search = path;
+   if (type == BTRFS_ARG_BLKDEV  where != BTRFS_SCAN_LBLKID) {
+   ret = get_btrfs_mount(search,
+   mp, sizeof(mp));
+   if (!ret) {
+   /* given block dev is mounted*/
+   search = mp;
+   type = BTRFS_ARG_MNTPOINT;
} else {
-   ret = get_btrfs_mount(search,
-   mp, sizeof(mp));
-   if (!ret) {
-   /* given block dev is mounted*/
-   search = mp;
-   type = BTRFS_ARG_MNTPOINT;
-   } else {
-   ret = dev_to_fsid(search, fsid);
-   if (ret) {
-   fprintf(stderr,
-   ERROR: No btrfs on 
%s\n,
-   search);
-   return 1;
-   }
-   uuid_unparse(fsid, uuid_buf);
-   search = uuid_buf;
-   type = BTRFS_ARG_UUID;
-   goto devs_only;
+   ret = dev_to_fsid(search, fsid);
+   if (ret) {
+   fprintf(stderr,
+   ERROR: No btrfs on %s\n,
+   search);
+   return 1;
}
+   uuid_unparse(fsid, uuid_buf);
+   search = uuid_buf;
+   type = BTRFS_ARG_UUID;
+   goto devs_only;
}
}
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie
On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
 On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
 On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 The increase/decrease of bio counter is on the I/O path, so we should
 use io_schedule() instead of schedule(), or the deadlock might be
 triggered by the pending I/O in the plug list. io_schedule() can help
 us because it will flush all the pending I/O before the task is going
 to sleep.

 Can you please describe this deadlock in more detail?  schedule() also 
 triggers
 a flush of the plug list, and if that's no longer sufficient we can run into 
 other
 problems (especially with preemption on).
 
 Sorry for my miss. I forgot to check the current implementation of 
 schedule(), which flushes the plug list unconditionally. Please ignore this 
 patch.

I have updated my raid56-scrub-replace branch, please re-pull the branch.

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

 
 Thanks
 Miao
 

 -chris


 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs-progs: fix page align issue for lzo compress in restore

2014-11-26 Thread Gui Hecheng
On Tue, 2014-10-14 at 11:32 +0200, David Sterba wrote:
 On Tue, Oct 14, 2014 at 10:06:16AM +0200, Marc Dietrich wrote:
  This hasn't landed in an btrfs-progs branch I found. Any update?
 
 I had it tagged for review and found something that needs fixing. The
 PAGE_CACHE_SIZE is hardcoded to 4k, this will break on filesystems with
 larger sectors (eg. the powerpc machines). I'll scheudule the patch post
 3.17, with a fix.

Hi David,

I note that this patch is not yet in the latest integration, how's the
fix going?

-Gui

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums

2014-11-26 Thread Liu Bo
On Thu, Nov 27, 2014 at 12:55:27AM +0500, Roman Mamedov wrote:
 Hello,
 
 I used btrfs-convert to switch my FS from Ext4 to Btrfs. As it was a rather
 large 10 TB filesystem, to save on the conversion time, I used the -d,
 disable data checksum option of btrfs-convert.
 
 Turns out now I can't cp --reflink any files that were already on the FS
 prior to conversion. The error message from cp is failed to clone [...]
 Invalid argument.
 
 I assume this is because of the lack of checksums; the only way to make old
 files cloneable is to plain copy them to a different place and then delete the
 originals, but that's what I was trying to avoid in the first place.
 
 Also I thought maybe defragmenting will help, but nope, doesn't seem to be the
 case, even ordering it to recompress data to a different method doesn't fix
 the problem. (Even if it did, it's still a lot of unnecessary rewriting).
 
 Is there really a good reason to stop these files without checksums from being
 cloneable? It's not like they have the noCoW attribute, so I'd assume any new
 write to these files would cause a CoW and proper checksums for all new blocks
 anyways.

Just FYI, I recently send a patch[1] to fix btrfs-convert's checksum
problem, it'll produce checksum for empty extents, which makes slow
btrrfs-convert even slower.

With this patch, you may try to convert your ext4 without disabling checksum 
and see if time is improved. 

[1]: Btrfs-progs: fix a bug of converting sparse ext2/3/4
https://patchwork.kernel.org/patch/5374741/

thanks,
-liubo
 
 -- 
 With respect,
 Roman
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Liu Bo
On Tue, Nov 25, 2014 at 05:39:05PM +0100, David Sterba wrote:
 On Mon, Nov 24, 2014 at 01:23:05PM +0800, Liu Bo wrote:
  This brings a strong-but-slow checksum algorithm, sha256.
  
  Actually btrfs used sha256 at the early time, but then moved to crc32c for
  performance purposes.
  
  As crc32c is sort of weak due to its hash collision issue, we need a 
  stronger
  algorithm as an alternative.
  
  Users can choose sha256 from mkfs.btrfs via
  
  $ mkfs.btrfs -C 256 /device
 
 There's already some good feedback so I'll try to cover what hasn't been
 mentioned yet.
 
 I think it's better to separate the preparatory works from adding the
 algorithm itself. The former can be merged (and tested) independently.

That's a good point.

 
 There are several checksum algorithms that trade off speed and strength
 so we may want to support more than just sha256. Easy to add but I'd
 rather see them added in all at once than one by one.
 
 Another question is if we'd like to use different checksum for data and
 metadata. This would not cost any format change if we use the 2 bytes in
 super block csum_type.

Yes, but breaking it into meta_csum_type and data_csum_type will need a
imcompat flag.

 
 
 Optional/crazy/format change stuff:
 
 * per-file checksum algorithm - unlike compression, the whole file would
   have to use the same csum algo
   reflink would work iff the algos match
   snapshotting is unaffected
 
 * per-subvolume checksum algorithm - specify the csum type at creation
   time, or afterwards unless it's modified

I thought about this before, if we enable this, a few cases need to be dealt
with(at least),
1. convert file data's csum from one algorithm to another
2. to make different checksum length co-exist, we can either use different
   key.type for different algorithms, or pack checksum into a new structure that
   has algorithm types(and length).

thanks,
-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-26 Thread Zygo Blaxell
On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote:
 On 11/25/2014 11:21 PM, Zygo Blaxell wrote:
   However I still doesn't understood why you want btrfs-w/multiple disk 
   over LVM ?
  I want to split a few disks into partitions, but I want to create,
  move, and resize the partitions from time to time.  Only LVM can do
  that without taking the machine down, reducing RAID integrity levels,
  hotplugging drives, or leaving installed drives idle most of the time.
  
  I want btrfs-raid1 because of its ability to replace corrupted or lost
  data from one disk using the other.  If I run a single-volume btrfs
  on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage
  stack), I can detect lost data, but not replace it automatically from
  the other mirror.
 OK, now I have understood.
 
 Anyway as workaround, take in account that you can pass explicitly the
 devices as:
 
 mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt
 
 (supposing that the filesystem is on /dev/sda.../dev/sdd)
 
 I am working to a mount.btrfs helper. The aim of this helper is to manage
 the assembling of multiple devices; the main points will be:
 - wait until all the devices appeared

...and make sure there are no duplicate UUIDs.

 - allow (if required) to mount in degraded mode after a timeout

This is a terrible idea with current btrfs, at least for read-write
degraded mounting (fallback to read-only degraded would be OK).
Mounting a filesystem read-write and degraded is something you only want
to do immediately before you replace all the missing disks and bring the
filesystem up to a non-degraded space and after you've ensured that the
missing disks can never, ever come back; otherwise, btrfs eats your data
in a slightly different way than we have discussed so far...

 - at this point it could/should also skip the lvm-snapshotted devices (but 
 before 
 I have to know how recognize these) 

You don't have to recognize them as snapshots (and it's probably better
not to treat snapshots specially anyway--how do you know whether the
snapshot or the origin LVs are wanted for mounting?).  You just have to
detect duplicate UUIDs at the btrfs subdevice level, and if any are found,
stop immediately (or get a hint from the admin).

This is a weakness of the current udev and asynchronous device hotplug
concept:  there is no notion of bus enumeration in progress, so we can be
trying to assemble multi-device storage before we have all the devices
visible.  Assembly of aggregate storage (whatever it is--btrfs, md,
lvm2...) has to wait until all known storage buses are fully enumerated
in order to detect if there are duplicates.

 I hope to issue the patches in the next week
 
 BR
 G.Baroncelli
 
 -- 
 gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it
 Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: Digital signature


[PATCH] Btrfs-progs: add_refs_for_implied: Use root_id instead of root_bytenr.

2014-11-26 Thread Chandan Rajendra
resolve_one_root() returns the objectid of a tree rather than the logical
address of the root node. Hence using root_bytenr is misleading. Fix this.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 qgroup-verify.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qgroup-verify.c b/qgroup-verify.c
index c0c61d0..c98c751 100644
--- a/qgroup-verify.c
+++ b/qgroup-verify.c
@@ -553,11 +553,11 @@ static int add_refs_for_implied(struct btrfs_fs_info 
*info, u64 bytenr,
struct tree_block *block)
 {
int ret;
-   u64 root_bytenr = resolve_one_root(bytenr);
+   u64 root_id = resolve_one_root(bytenr);
struct btrfs_root *root;
struct btrfs_key key;
 
-   key.objectid = root_bytenr;
+   key.objectid = root_id;
key.type = BTRFS_ROOT_ITEM_KEY;
key.offset = (u64)-1;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html