Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote: No abuse necessary at all. Just a different inode_dirtied_after() check is requires if the inode is on the time dirty list in move_expired_inodes(). I'm still not sure what you have in mind here. When would this be checked? It sounds like you want to set a timeout such that when an inode which had its timestamps updated lazily 24 hours earlier, the inode would get written out. Yes? But that implies something is going to have to scan the list of inodes on the dirty time list periodically. When are you proposing that this take place? The various approaches that come to mind all seem more complex than what I have in this patch 3 of 4, and I'm not sure it's worth the complexity. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging
Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c| 5 + 2 files changed, 6 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 529480a..3d87174 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -27,6 +27,7 @@ #include linux/backing-dev.h #include linux/tracepoint.h #include linux/device.h +#include trace/events/fs.h #include internal.h /* diff --git a/fs/inode.c b/fs/inode.c index 37c0429..b2fea60 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -20,6 +20,9 @@ #include linux/list_lru.h #include internal.h +#define CREATE_TRACE_POINTS +#include trace/events/fs.h + /* * Inode locking rules: * @@ -1443,6 +1446,7 @@ retry: inode-i_op-write_time(inode); else if (inode-i_sb-s_op-write_inode) mark_inode_dirty_sync(inode); + trace_fs_lazytime_iput(inode); goto retry; } iput_final(inode); @@ -1561,6 +1565,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) inode-i_ts_dirty_day = days_since_boot; spin_unlock(inode-i_lock); inode_requeue_dirtytime(inode); + trace_fs_lazytime_defer(inode); return 0; } force_dirty: -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale
Guarantee that the on-disk timestamps will be no more than 24 hours stale. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c | 18 ++ include/linux/fs.h | 1 + 3 files changed, 20 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef8c5d8..529480a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1143,6 +1143,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (flags (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { trace_writeback_dirty_inode_start(inode, flags); + inode-i_ts_dirty_day = 0; if (sb-s_op-dirty_inode) sb-s_op-dirty_inode(inode, flags); diff --git a/fs/inode.c b/fs/inode.c index 9e464cc..37c0429 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1510,6 +1510,8 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { + struct timespec uptime; + unsigned short days_since_boot; int ret; if (inode-i_op-update_time) { @@ -1526,11 +1528,26 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } + /* +* If i_ts_dirty_day is zero, then either we have not deferred +* timestamp updates, or the system has been up for less than +* a day (so days_since_boot is zero), so we defer timestamp +* updates in that case and set the I_DIRTY_TIME flag. If a +* day or more has passed, then i_ts_dirty_day will be +* different from days_since_boot, and then we should update +* the on-disk inode and then we can clear i_ts_dirty_day. +*/ if ((inode-i_sb-s_flags MS_LAZYTIME) !(flags S_VERSION) !(inode-i_state (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) { if (inode-i_state I_DIRTY_TIME) return 0; + get_monotonic_boottime(uptime); + days_since_boot = div_u64(uptime.tv_sec, 86400); + if (inode-i_ts_dirty_day + (inode-i_ts_dirty_day != days_since_boot)) + goto force_dirty; + spin_lock(inode-i_lock); if (inode-i_state (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { spin_unlock(inode-i_lock); @@ -1541,6 +1558,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) return 0; } inode-i_state |= I_DIRTY_TIME; + inode-i_ts_dirty_day = days_since_boot; spin_unlock(inode-i_lock); inode_requeue_dirtytime(inode); return 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 55cf34d..d0a2181 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -575,6 +575,7 @@ struct inode { struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; + unsigned short i_ts_dirty_day; unsigned inti_blkbits; blkcnt_ti_blocks; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 5/7] vfs: add find_active_inode_nowait() function
Add a new function find_active_inode_nowait() which will never block. If there is an inode being freed or is still being initialized, this function will return NULL instead of blocking waiting for an inode to be freed or to finish initializing. Hence, a negative return from this function does not mean that inode number is free for use. It is useful for callers that want to opportunistically do some work on an inode only if it is present and available in the cache, and where blocking is not an option. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/inode.c | 36 include/linux/fs.h | 2 ++ 2 files changed, 38 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index b2fea60..0b4c6ae 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1283,6 +1283,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino) } EXPORT_SYMBOL(ilookup); +/** + * find_active_inode_nowait - find an active inode in the inode cache + * @sb:super block of file system to search + * @ino: inode number to search for + * + * Search for an active inode @ino in the inode cache, and if the + * inode is in the cache, the inode is returned with an incremented + * reference count. If the inode is being freed or is newly + * initialized, return nothing instead of trying to wait for the inode + * initialization or destruction to be complete. + */ +struct inode *find_active_inode_nowait(struct super_block *sb, + unsigned long ino) +{ + struct hlist_head *head = inode_hashtable + hash(sb, ino); + struct inode *inode, *ret_inode = NULL; + + spin_lock(inode_hash_lock); + hlist_for_each_entry(inode, head, i_hash) { + if ((inode-i_ino != ino) || + (inode-i_sb != sb)) + continue; + spin_lock(inode-i_lock); + if ((inode-i_state (I_FREEING | I_WILL_FREE | I_NEW)) == 0) { + __iget(inode); + ret_inode = inode; + } + spin_unlock(inode-i_lock); + goto out; + } +out: + spin_unlock(inode_hash_lock); + return ret_inode; +} +EXPORT_SYMBOL(find_active_inode_nowait); + int insert_inode_locked(struct inode *inode) { struct super_block *sb = inode-i_sb; diff --git a/include/linux/fs.h b/include/linux/fs.h index d0a2181..dc615ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2419,6 +2419,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino); extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); extern struct inode * iget_locked(struct super_block *, unsigned long); +extern struct inode *find_active_inode_nowait(struct super_block *, + unsigned long); extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *); extern int insert_inode_locked(struct inode *); #ifdef CONFIG_DEBUG_LOCK_ALLOC -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time()
The only reason btrfs cloned code from the VFS layer was so it could add a check to see if a subvolume is read-ony. Instead of doing that, let's add a new inode operation which allows a file system to return an error if the inode is read-only, and use that in update_time(). There may be other places where the VFS layer may want to know that btrfs would want to treat an inode is read-only. With this commit, there are no remaining users of update_time() in the inode operations structure, so we can remove it and simply things further. Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba dste...@suse.cz --- fs/btrfs/inode.c | 26 ++ fs/inode.c | 22 +++--- include/linux/fs.h | 2 +- 3 files changed, 18 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a5e0d0d..0bfe3a4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5554,26 +5554,12 @@ static int btrfs_dirty_inode(struct inode *inode) return ret; } -/* - * This is a copy of file_update_time. We need this so we can return error on - * ENOSPC for updating the inode in the case of file write and mmap writes. - */ -static int btrfs_update_time(struct inode *inode, struct timespec *now, -int flags) +static int btrfs_is_readonly(struct inode *inode) { struct btrfs_root *root = BTRFS_I(inode)-root; if (btrfs_root_readonly(root)) return -EROFS; - - if (flags S_VERSION) - inode_inc_iversion(inode); - if (flags S_CTIME) - inode-i_ctime = *now; - if (flags S_MTIME) - inode-i_mtime = *now; - if (flags S_ATIME) - inode-i_atime = *now; return 0; } @@ -9466,8 +9452,8 @@ static const struct inode_operations btrfs_dir_inode_operations = { .permission = btrfs_permission, .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, - .update_time= btrfs_update_time, .write_time = btrfs_write_time, + .is_readonly= btrfs_is_readonly, .tmpfile= btrfs_tmpfile, }; static const struct inode_operations btrfs_dir_ro_inode_operations = { @@ -9475,8 +9461,8 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { .permission = btrfs_permission, .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, - .update_time= btrfs_update_time, .write_time = btrfs_write_time, + .is_readonly= btrfs_is_readonly, }; static const struct file_operations btrfs_dir_file_operations = { @@ -9546,8 +9532,8 @@ static const struct inode_operations btrfs_file_inode_operations = { .fiemap = btrfs_fiemap, .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, - .update_time= btrfs_update_time, .write_time = btrfs_write_time, + .is_readonly= btrfs_is_readonly, }; static const struct inode_operations btrfs_special_inode_operations = { .getattr= btrfs_getattr, @@ -9559,8 +9545,8 @@ static const struct inode_operations btrfs_special_inode_operations = { .removexattr= btrfs_removexattr, .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, - .update_time= btrfs_update_time, .write_time = btrfs_write_time, + .is_readonly= btrfs_is_readonly, }; static const struct inode_operations btrfs_symlink_inode_operations = { .readlink = generic_readlink, @@ -9573,8 +9559,8 @@ static const struct inode_operations btrfs_symlink_inode_operations = { .getxattr = btrfs_getxattr, .listxattr = btrfs_listxattr, .removexattr= btrfs_removexattr, - .update_time= btrfs_update_time, .write_time = btrfs_write_time, + .is_readonly= btrfs_is_readonly, }; const struct dentry_operations btrfs_dentry_operations = { diff --git a/fs/inode.c b/fs/inode.c index 0b4c6ae..e29bd2d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1554,20 +1554,20 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) unsigned short days_since_boot; int ret; - if (inode-i_op-update_time) { - ret = inode-i_op-update_time(inode, time, flags); + if (inode-i_op-is_readonly) { + ret = inode-i_op-is_readonly(inode); if (ret) return ret; - } else { - if (flags S_ATIME) - inode-i_atime = *time; - if (flags S_VERSION) - inode_inc_iversion(inode); - if (flags S_CTIME) - inode-i_ctime = *time; - if (flags S_MTIME) - inode-i_mtime = *time; } + if (flags S_ATIME)
[PATCH-v4 0/7] add support for a lazytime mount option
This is an updated version of what had originally been an ext4-specific patch which significantly improves performance by lazily writing timestamp updates (and in particular, mtime updates) to disk. The in-memory timestamps are always correct, but they are only written to disk when required for correctness. This provides a huge performance boost for ext4 due to how it handles journalling, but it's valuable for all file systems running on flash storage or drive-managed SMR disks by reducing the metadata write load. So upon request, I've moved the functionality to the VFS layer. Once the /sbin/mount program adds support for MS_LAZYTIME, all file systems should be able to benefit from this optimization. There is still an ext4-specific optimization, which may be applicable for other file systems which store more than one inode in a block, but it will require file system specific code. It is purely optional, however. Please note the changes to update_time() and the new write_time() inode operations functions, which impact btrfs and xfs. The changes are fairly simple, but I would appreciate confirmation from the btrfs and xfs teams that I got things right. Thanks!! Any objections to my carrying these patches in the ext4 git tree? Changes since -v3: - inodes with I_DIRTY_TIME set are placed on a new bdi list, b_dirty_time. This allows filesystem-level syncs to more easily iterate over those inodes that need to have their timestamps written to disk. - dirty timestamps will be written out asynchronously on the final iput, instead of when the inode gets evicted. - separate the definition of the new function find_active_inode_nowait() to a separate patch - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which indicate whether the inode needs to be on the write back lists, or whether the inode itself is dirty, while I_DIRTY means any one of the inode dirty flags are set. This simplifies the fs writeback logic which needs to test for different combinations of the inode dirty flags in different places. Changes since -v2: - If update_time() updates i_version, it will not use lazytime (i..e, the inode will be marked dirty so the change will be persisted on to disk sooner rather than later). Yes, this eliminates the benefits of lazytime if the user is experting the file system via NFSv4. Sad, but NFS's requirements seem to mandate this. - Fix time wrapping bug 49 days after the system boots (on a system with a 32-bit jiffies). Use get_monotonic_boottime() instead. - Clean up type warning in include/tracing/ext4.h - Added explicit parenthesis for stylistic reasons - Added an is_readonly() inode operations method so btrfs doesn't have to duplicate code in update_time(). Changes since -v1: - Added explanatory comments in update_time() regarding i_ts_dirty_days - Fix type used for days_since_boot - Improve SMP scalability in update_time and ext4_update_other_inodes_time - Added tracepoints to help test and characterize how often and under what circumstances inodes have their timestamps lazily updated Theodore Ts'o (7): vfs: split update_time() into update_time() and write_time() vfs: add support for a lazytime mount option vfs: don't let the dirty time inodes get more than a day stale vfs: add lazytime tracepoints for better debugging vfs: add find_active_inode_nowait() function ext4: add support for a lazytime mount option btrfs: add an is_readonly() so btrfs can use common code for update_time() Documentation/filesystems/Locking | 2 + fs/btrfs/inode.c | 34 +--- fs/ext4/inode.c | 49 - fs/ext4/super.c | 9 +++ fs/fs-writeback.c | 57 +-- fs/inode.c| 113 +++--- fs/proc_namespace.c | 1 + fs/sync.c | 8 +++ fs/xfs/xfs_iops.c | 39 ++--- include/linux/backing-dev.h | 1 + include/linux/fs.h| 17 +- include/trace/events/ext4.h | 30 ++ include/uapi/linux/fs.h | 1 + mm/backing-dev.c | 9 ++- 14 files changed, 306 insertions(+), 64 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
In preparation for adding support for the lazytime mount option, we need to be able to separate out the update_time() and write_time() inode operations. Currently, only btrfs and xfs uses update_time(). We needed to preserve update_time() because btrfs wants to have a special btrfs_root_readonly() check; otherwise we could drop the update_time() inode operation entirely. Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Acked-by: David Sterba dste...@suse.cz --- Documentation/filesystems/Locking | 2 ++ fs/btrfs/inode.c | 10 ++ fs/inode.c| 29 ++--- fs/xfs/xfs_iops.c | 39 --- include/linux/fs.h| 1 + 5 files changed, 47 insertions(+), 34 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index b30753c..e49861d 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -63,6 +63,7 @@ prototypes: int (*removexattr) (struct dentry *, const char *); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); void (*update_time)(struct inode *, struct timespec *, int); + void (*write_time)(struct inode *); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); @@ -95,6 +96,7 @@ listxattr:no removexattr: yes fiemap:no update_time: no +write_time:no atomic_open: yes tmpfile: no dentry_open: no diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..a5e0d0d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now, inode-i_mtime = *now; if (flags S_ATIME) inode-i_atime = *now; + return 0; +} + +static int btrfs_write_time(struct inode *inode) +{ return btrfs_dirty_inode(inode); } @@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, .tmpfile= btrfs_tmpfile, }; static const struct inode_operations btrfs_dir_ro_inode_operations = { @@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct file_operations btrfs_dir_file_operations = { @@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_special_inode_operations = { .getattr= btrfs_getattr, @@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_symlink_inode_operations = { .readlink = generic_readlink, @@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = { .listxattr = btrfs_listxattr, .removexattr= btrfs_removexattr, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; const struct dentry_operations btrfs_dentry_operations = { diff --git a/fs/inode.c b/fs/inode.c index 26753ba..8f5c4b5 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { - if (inode-i_op-update_time) - return inode-i_op-update_time(inode, time, flags); - - if (flags S_ATIME) - inode-i_atime = *time; - if (flags S_VERSION) - inode_inc_iversion(inode); - if (flags S_CTIME) - inode-i_ctime = *time; - if (flags S_MTIME) - inode-i_mtime = *time; + int ret; + + if (inode-i_op-update_time) { + ret = inode-i_op-update_time(inode, time, flags); + if (ret) + return ret; + } else { + if (flags S_ATIME) + inode-i_atime = *time; + if (flags S_VERSION) +
[PATCH-v4 6/7] ext4: add support for a lazytime mount option
Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 49 ++--- fs/ext4/super.c | 9 + include/trace/events/ext4.h | 30 +++ 3 files changed, 85 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa4..8308c82 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, + unsigned long orig_ino, char *buf) +{ + struct ext4_inode_info *ei; + struct ext4_inode *raw_inode; + unsigned long ino; + struct inode*inode; + int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; + int inode_size = EXT4_INODE_SIZE(sb); + + ino = orig_ino ~(inodes_per_block - 1); + for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { + if (ino == orig_ino) + continue; + inode = find_active_inode_nowait(sb, ino); + if (!inode || + (inode-i_state I_DIRTY_TIME) == 0 || + !spin_trylock(inode-i_lock)) { + iput(inode); + continue; + } + inode-i_state = ~I_DIRTY_TIME; + inode-i_ts_dirty_day = 0; + spin_unlock(inode-i_lock); + inode_requeue_dirtytime(inode); + + ei = EXT4_I(inode); + raw_inode = (struct ext4_inode *) buf; + + spin_lock(ei-i_raw_lock); + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + ext4_inode_csum_set(inode, raw_inode, ei); + spin_unlock(ei-i_raw_lock); + trace_ext4_other_inode_update_time(inode, orig_ino); + iput(inode); + } +} + + +/* * Post the struct inode info into an on-disk inode location in the * buffer-cache. This gobbles the caller's reference to the * buffer_head in the inode location struct. @@ -4237,7 +4282,6 @@ static int ext4_do_update_inode(handle_t *handle, for (block = 0; block EXT4_N_BLOCKS; block++) raw_inode-i_block[block] = ei-i_data[block]; } - if (likely(!test_opt2(inode-i_sb, HURD_COMPAT))) { raw_inode-i_disk_version = cpu_to_le32(inode-i_version); if (ei-i_extra_isize) { @@ -4248,10 +4292,9 @@ static int ext4_do_update_inode(handle_t *handle, cpu_to_le16(ei-i_extra_isize); } } - ext4_inode_csum_set(inode, raw_inode, ei); - spin_unlock(ei-i_raw_lock); + ext4_update_other_inodes_time(inode-i_sb, inode-i_ino, bh-b_data); BUFFER_TRACE(bh, call ext4_handle_dirty_metadata); rc = ext4_handle_dirty_metadata(handle, NULL, bh); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 58859bc..93a2b7a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1132,6 +1132,7 @@ enum { Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, + Opt_lazytime, Opt_nolazytime, Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio, Opt_dioread_nolock, Opt_dioread_lock, @@ -1195,6 +1196,8 @@ static const match_table_t tokens = { {Opt_i_version, i_version}, {Opt_stripe, stripe=%u}, {Opt_delalloc, delalloc}, + {Opt_lazytime, lazytime}, + {Opt_nolazytime, nolazytime}, {Opt_nodelalloc, nodelalloc}, {Opt_removed, mblk_io_submit}, {Opt_removed, nomblk_io_submit}, @@ -1452,6 +1455,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, case Opt_i_version: sb-s_flags |= MS_I_VERSION; return 1; + case Opt_lazytime: + sb-s_flags |= MS_LAZYTIME; + return 1; + case Opt_nolazytime: + sb-s_flags
[PATCH-v4 2/7] vfs: add support for a lazytime mount option
Add a new mount option which enables a new lazytime mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 55 - fs/inode.c | 43 ++- fs/proc_namespace.c | 1 + fs/sync.c | 8 +++ include/linux/backing-dev.h | 1 + include/linux/fs.h | 11 +++-- include/uapi/linux/fs.h | 1 + mm/backing-dev.c| 9 ++-- 8 files changed, 113 insertions(+), 16 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ef8c5d8 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -397,7 +397,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * shot. If still dirty, it will be redirty_tail()'ed below. Update * the dirty time to prevent enqueue and sync it again. */ - if ((inode-i_state I_DIRTY) + if ((inode-i_state I_DIRTY_WB) (wbc-sync_mode == WB_SYNC_ALL || wbc-tagged_writepages)) inode-dirtied_when = jiffies; @@ -428,13 +428,15 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, */ redirty_tail(inode, wb); } - } else if (inode-i_state I_DIRTY) { + } else if (inode-i_state I_DIRTY_WB) { /* * Filesystems can dirty the inode during writeback operations, * such as delayed allocation during submission or metadata * updates after data IO completion. */ redirty_tail(inode, wb); + } else if (inode-i_state I_DIRTY_TIME) { + list_move(inode-i_wb_list, wb-b_dirty_time); } else { /* The inode is clean. Remove from writeback lists. */ list_del_init(inode-i_wb_list); @@ -482,11 +484,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) /* Clear I_DIRTY_PAGES if we've written out all dirty pages */ if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode-i_state = ~I_DIRTY_PAGES; - dirty = inode-i_state I_DIRTY; - inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + dirty = inode-i_state I_DIRTY_INODE; + inode-i_state = ~I_DIRTY_INODE; spin_unlock(inode-i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ - if (dirty (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { + if (dirty) { int err = write_inode(inode, wbc); if (ret == 0) ret = err; @@ -1162,7 +1164,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(inode-i_lock); if ((inode-i_state flags) != flags) { - const int was_dirty = inode-i_state I_DIRTY; + const int was_dirty = inode-i_state I_DIRTY_WB; inode-i_state |= flags; @@ -1224,6 +1226,24 @@ out_unlock_inode: } EXPORT_SYMBOL(__mark_inode_dirty); +void inode_requeue_dirtytime(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(bdi-wb.list_lock); + spin_lock(inode-i_lock); + if ((inode-i_state I_DIRTY_WB) == 0) { + if (inode-i_state I_DIRTY_TIME) + list_move(inode-i_wb_list, bdi-wb.b_dirty_time); + else + list_del_init(inode-i_wb_list); + } + spin_unlock(inode-i_lock); + spin_unlock(bdi-wb.list_lock); + +} +EXPORT_SYMBOL(inode_requeue_dirtytime); + static void wait_sb_inodes(struct super_block *sb) { struct inode *inode, *old_inode = NULL; @@ -1277,6 +1297,28 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * Take all of the indoes on the dirty_time list, and mark them as + * dirty,
resetting device stats
When running Debian kernel version 3.16.0-4-amd64 and btrfs-tools version 3.17-1.1 I ran a btrfs replace operation to replace a 3TB disk that was giving read errors with a new 4TB disk. After the replace the btrfs device stats command reported that the 4TB disk had 16 read errors. It appears that the device stats are copied in a replace operation. I believe that this is a bug. There is no reason to claim that the new 4TB disk had read errors when it was the old 3TB disk that had the errors. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On Tue, 25 Nov 2014 15:17:58 -0800, John Williams wrote: 2) CityHash : for 256-bit hashes on all systems https://code.google.com/p/cityhash/ Btw this is now superseded by Farmhash: https://code.google.com/p/farmhash/ -h -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer variant as bool variant to record the operation type, we would add new variant and increase the size of raid bio structure. It is not good, by this patch, we define different number for different operation, and we can just use a variant to record the operation type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6013d88..bfc406d 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -62,6 +62,11 @@ #define RBIO_CACHE_SIZE 1024 +enum btrfs_rbio_ops { + BTRFS_RBIO_WRITE= 0, + BTRFS_RBIO_READ_REBUILD = 1, +}; + struct btrfs_raid_bio { struct btrfs_fs_info *fs_info; struct btrfs_bio *bbio; @@ -124,7 +129,7 @@ struct btrfs_raid_bio { * differently from a parity rebuild as part of * rmw */ - int read_rebuild; + enum btrfs_rbio_ops operation; /* first bad stripe */ int faila; @@ -147,7 +152,6 @@ struct btrfs_raid_bio { atomic_t refs; - atomic_t stripes_pending; atomic_t error; @@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, return 0; /* reads can't merge with writes */ - if (last-read_rebuild != - cur-read_rebuild) { + if (last-operation != cur-operation) { return 0; } @@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_unlock(rbio-bio_list_lock); spin_unlock_irqrestore(h-lock, flags); - if (next-read_rebuild) + if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else { + else if (next-operation == BTRFS_RBIO_WRITE){ steal_rbio(rbio, next); async_rmw_stripe(next); } @@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; + rbio-operation = BTRFS_RBIO_WRITE; /* * don't plug on full rbios, just get them out the door @@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) faila = rbio-faila; failb = rbio-failb; - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { spin_lock_irq(rbio-bio_list_lock); set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); @@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1871,7 +1875,7 @@ pstripe: * know they can be trusted. If this was a read reconstruction, * other endio functions will fiddle the uptodate bits */ - if (!rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_WRITE) { for (i = 0; i nr_pages; i++) { if (faila != -1) { page = rbio_stripe_page(rbio, faila, i); @@ -1888,7 +1892,7 @@ pstripe: * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1904,7 +1908,7 @@ cleanup: cleanup_io: - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { if (err == 0) cache_rbio_pages(rbio); else @@ -2042,7 +2046,7 @@ out: return 0; cleanup: - if (rbio-read_rebuild) + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) rbio_orig_end_io(rbio, -EIO, 0); return -EIO; } @@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, if (hold_bbio)
[PATCH v3 00/11] Implement device scrub/replace for RAID56
This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. The basic idea is reading and check the data which has checksum out of the raid56 stripe lock, if the data is right, then lock the raid56 stripe, read out the other data in the same stripe, if no IO error happens, calculate the parity and check the original one, if the original parity is right, the scrub parity passes. or write out the new one. But if the common data(not parity) that we read out is wrong, we will try to recover it, and then check and repair the parity. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Changelog v2 - v3: - Fix wrong stripe start logical address calculation which was reported by Chris. - Fix unhandled raid bios for parity scrub, which are added into the plug list of the head raid bio. - Fix possible deadlock caused by the pending bios in the plug list when the io submitters were going to sleep. - Fix undealt use-after-free problem of the source device in the final device replace procedure. - Modify the code that is used to avoid the rbio merge. Changelog v1 - v2: - Change some function names in raid56.c to make them fit the code style of the raid56. Thanks Miao Miao Xie (8): Btrfs, raid56: don't change bbio and raid_map Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Btrfs, raid56: use a variant to record the operation type Btrfs, raid56: support parity scrub on raid56 Btrfs, replace: write dirty pages into the replace target device Btrfs, replace: write raid56 parity into the replace target device Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Btrfs: fix possible deadlock caused by pending I/O in plug list Zhao Lei (3): Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Btrfs, replace: enable dev-replace for raid56 fs/btrfs/dev-replace.c | 20 +- fs/btrfs/raid56.c | 746 - fs/btrfs/raid56.h | 16 +- fs/btrfs/scrub.c | 803 +++-- fs/btrfs/volumes.c | 52 +++- fs/btrfs/volumes.h | 14 +- 6 files changed, 1521 insertions(+), 130 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write the right parity or corrected parity into the target device before the parity scrub end. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 23 +++ fs/btrfs/scrub.c | 2 +- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6f82c1b..cfa449f 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int err) static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check) { + struct btrfs_bio *bbio = rbio-bbio; void *pointers[rbio-real_stripes]; + DECLARE_BITMAP(pbitmap, rbio-stripe_npages); int nr_data = rbio-nr_data; int stripe; int pagenr; @@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, struct page *q_page = NULL; struct bio_list bio_list; struct bio *bio; + int is_replace = 0; int ret; bio_list_init(bio_list); @@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, BUG(); } + if (bbio-num_tgtdevs bbio-tgtdev_map[rbio-scrubp]) { + is_replace = 1; + bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages); + } + /* * Because the higher layers(scrubber) are unlikely to * use this area of the disk again soon, so don't cache @@ -2422,6 +2430,21 @@ writeback: goto cleanup; } + if (!is_replace) + goto submit_write; + + for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) { + struct page *page; + + page = rbio_stripe_page(rbio, rbio-scrubp, pagenr); + ret = rbio_add_io_page(rbio, bio_list, page, + bbio-tgtdev_map[rbio-scrubp], + pagenr, rbio-stripe_len); + if (ret) + goto cleanup; + } + +submit_write: nr_data = bio_list_size(bio_list); if (!nr_data) { /* Every parity is right */ diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 7f95afc..0ae837f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) goto out; length = sparity-logic_end - sparity-logic_start + 1; - ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS, + ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE, sparity-logic_start, length, bbio, 0, raid_map); if (ret || !bbio || !raid_map) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
From: Zhao Lei zhao...@cn.fujitsu.com bbio_ret in this condition is always !NULL because previous code already have a check-and-skip: 4908 if (!bbio_ret) 4909 goto out; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v3: - None. --- fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f61278f..41b0dff 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, BTRFS_BLOCK_GROUP_RAID6)) { u64 tmp; - if (bbio_ret ((rw REQ_WRITE) || mirror_num 1) -raid_map_ret) { + if (raid_map_ret ((rw REQ_WRITE) || mirror_num 1)) { int i, rot; /* push stripe_nr back to the start of the full stripe */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
From: Zhao Lei zhao...@cn.fujitsu.com stripe_index's value was set again in latter line: stripe_index = 0; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v3: - None. --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 41b0dff..66d5035 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* push stripe_nr back to the start of the full stripe */ stripe_nr = raid56_full_stripe_start; - do_div(stripe_nr, stripe_len); - - stripe_index = do_div(stripe_nr, nr_data_stripes(map)); + do_div(stripe_nr, stripe_len * nr_data_stripes(map)); /* RAID[56] write or recovery. Return all stripes */ num_stripes = map-num_stripes; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is better that we don't change any variant of bbio and don't free it at the end of IO request. So we introduced similar variants into the raid bio, and don't access those bbio's variants any more. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6a41631..c54b0e6 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,7 +58,6 @@ */ #define RBIO_CACHE_READY_BIT 3 - #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -146,6 +145,10 @@ struct btrfs_raid_bio { atomic_t refs; + + atomic_t stripes_pending; + + atomic_t error; /* * these are two arrays of pointers. We allocate the * rbio big enough to hold them both and setup their @@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; /* OK, we have read all the stripes we need to. */ - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) err = -EIO; rbio_orig_end_io(rbio, err, 0); @@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-faila = -1; rbio-failb = -1; atomic_set(rbio-refs, 1); + atomic_set(rbio-error, 0); + atomic_set(rbio-stripes_pending, 0); /* * the stripe_pages and bio_pages array point to the extra @@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * now that we've set rmw_locked, run through the @@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) } } - atomic_set(bbio-stripes_pending, bio_list_size(bio_list)); - BUG_ON(atomic_read(bbio-stripes_pending) == 0); + atomic_set(rbio-stripes_pending, bio_list_size(bio_list)); + BUG_ON(atomic_read(rbio-stripes_pending) == 0); while (1) { bio = bio_list_pop(bio_list); @@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed) if (rbio-faila == -1) { /* first failure on this rbio */ rbio-faila = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else if (rbio-failb == -1) { /* second failure on this rbio */ rbio-failb = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else { ret = -EIO; } @@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) goto cleanup; /* @@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio) static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) { int bios_to_read = 0; - struct btrfs_bio *bbio = rbio-bbio; struct bio_list bio_list; int ret; int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE); @@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * build a list of bios to read all the missing parts of this * stripe @@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) * the bbio may be freed once we submit the last bio. Make sure * not to touch it after that */ - atomic_set(bbio-stripes_pending, bios_to_read); + atomic_set(rbio-stripes_pending, bios_to_read); while (1) { bio = bio_list_pop(bio_list); if (!bio) @@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err) set_bio_pages_uptodate(bio); bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if
[PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56
The implementation is: - Read and check all the data with checksum in the same stripe. All the data which has checksum is COW data, and we are sure that it is not changed though we don't lock the stripe. because the space of that data just can be reclaimed after the current transction is committed, and then the fs can use it to store the other data, but when doing scrub, we hold the current transaction, that is that data can not be recovered, it is safe that read and check it out of the stripe lock. - Lock the stripe - Read out all the data without checksum and parity The data without checksum and the parity may be changed if we don't lock the stripe, so we need read it in the stripe lock context. - Check the parity - Re-calculate the new parity and write back it if the old parity is not right - Unlock the stripe If we can not read out the data or the data we read is corrupted, we will try to repair it. If the repair fails. we will mark the horizontal sub-stripe(pages on the same horizontal) as corrupted sub-stripe, and we will skip the parity check and repair of that horizontal sub-stripe. And in order to skip the horizontal sub-stripe that has no data, we introduce a bitmap. If there is some data on the horizontal sub-stripe, we will the relative bit to 1, and when we check and repair the parity, we will skip those horizontal sub-stripes that the relative bits is 0. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - Fix wrong stripe start logical address calculation which was reported by Chris. - Fix unhandled raid bios for parity scrub, which are added into the plug list of the head raid bio. - Modify the code that is used to avoid the rbio merge. Changelog v1 - v2: - None. --- fs/btrfs/raid56.c | 514 - fs/btrfs/raid56.h | 12 ++ fs/btrfs/scrub.c | 609 -- 3 files changed, 1115 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index bfc406d..3b99cbc 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -65,6 +65,7 @@ enum btrfs_rbio_ops { BTRFS_RBIO_WRITE= 0, BTRFS_RBIO_READ_REBUILD = 1, + BTRFS_RBIO_PARITY_SCRUB = 2, }; struct btrfs_raid_bio { @@ -123,6 +124,7 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int stripe_npages; /* * set if we're doing a parity rebuild * for a read from higher up, which is handled @@ -137,6 +139,7 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + int scrubp; /* * number of pages needed to represent the full * stripe @@ -171,6 +174,11 @@ struct btrfs_raid_bio { * here for faster lookup */ struct page **bio_pages; + + /* +* bitmap to record which horizontal stripe has data +*/ + unsigned long *dbitmap; }; static int __raid56_parity_recover(struct btrfs_raid_bio *rbio); @@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio); static void index_rbio_pages(struct btrfs_raid_bio *rbio); static int alloc_rbio_pages(struct btrfs_raid_bio *rbio); +static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, +int need_check); +static void async_scrub_parity(struct btrfs_raid_bio *rbio); + /* * the stripe hash table is used for locking, and to collect * bios in hopes of making a full stripe @@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, cur-raid_map[0]) return 0; - /* reads can't merge with writes */ - if (last-operation != cur-operation) { + /* we can't merge with different operations */ + if (last-operation != cur-operation) + return 0; + /* +* We've need read the full stripe from the drive. +* check and repair the parity and write the new results. +* +* We're not allowed to add any new bios to the +* bio list here, anyone else that wants to +* change this stripe needs to do their own rmw. +*/ + if (last-operation == BTRFS_RBIO_PARITY_SCRUB || + cur-operation == BTRFS_RBIO_PARITY_SCRUB) return 0; - } return 1; } @@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else if (next-operation == BTRFS_RBIO_WRITE){ + else if (next-operation == BTRFS_RBIO_WRITE) { steal_rbio(rbio, next); async_rmw_stripe(next); + } else if (next-operation == BTRFS_RBIO_PARITY_SCRUB) { +
[PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56
From: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/dev-replace.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 894796a..9f6a464 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root, struct btrfs_device *tgt_device = NULL; struct btrfs_device *src_device = NULL; - if (btrfs_fs_incompat(fs_info, RAID56)) { - btrfs_warn(fs_info, dev_replace cannot yet handle RAID5/RAID6); - return -EOPNOTSUPP; - } - switch (args-start.cont_reading_from_srcdev_mode) { case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS: case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID: -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - New patch to fix possible deadlock caused by the pending bios in the plug list when the io submitters were going to sleep. Changelog v1 - v2: - None. --- fs/btrfs/dev-replace.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index fa27b4e..894796a 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount) wake_up(fs_info-replace_wait); } +#define btrfs_wait_event_io(wq, condition) \ +do { \ + if (condition) \ + break; \ + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ + io_schedule()); \ +} while (0) + void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info) { - DEFINE_WAIT(wait); again: percpu_counter_inc(fs_info-bio_counter); if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, fs_info-fs_state)) { btrfs_bio_counter_dec(fs_info); - wait_event(fs_info-replace_wait, - !test_bit(BTRFS_FS_STATE_DEV_REPLACING, -fs_info-fs_state)); + btrfs_wait_event_io(fs_info-replace_wait, + !test_bit(BTRFS_FS_STATE_DEV_REPLACING, + fs_info-fs_state)); goto again; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device
The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 104 + fs/btrfs/volumes.c | 26 -- fs/btrfs/volumes.h | 10 -- 3 files changed, 97 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 3b99cbc..6f82c1b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -124,6 +124,8 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int real_stripes; + int stripe_npages; /* * set if we're doing a parity rebuild @@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index) */ static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index) { - if (rbio-nr_data + 1 == rbio-bbio-num_stripes) + if (rbio-nr_data + 1 == rbio-real_stripes) return NULL; index += ((rbio-nr_data + 1) * rbio-stripe_len) @@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, { struct btrfs_raid_bio *rbio; int nr_data = 0; - int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes); + int real_stripes = bbio-num_stripes - bbio-num_tgtdevs; + int num_pages = rbio_nr_pages(stripe_len, real_stripes); int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE); void *p; @@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-fs_info = root-fs_info; rbio-stripe_len = stripe_len; rbio-nr_pages = num_pages; + rbio-real_stripes = real_stripes; rbio-stripe_npages = stripe_npages; rbio-faila = -1; rbio-failb = -1; @@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-bio_pages = p + sizeof(struct page *) * num_pages; rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2; - if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE) - nr_data = bbio-num_stripes - 2; + if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE) + nr_data = real_stripes - 2; else - nr_data = bbio-num_stripes - 1; + nr_data = real_stripes - 1; rbio-nr_data = nr_data; return rbio; @@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio) { if (rbio-faila = 0 || rbio-failb = 0) { - BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1); + BUG_ON(rbio-faila == rbio-real_stripes - 1); __raid56_parity_recover(rbio); } else { finish_rmw(rbio); @@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio) static noinline void finish_rmw(struct btrfs_raid_bio *rbio) { struct btrfs_bio *bbio = rbio-bbio; - void *pointers[bbio-num_stripes]; + void *pointers[rbio-real_stripes]; int stripe_len = rbio-stripe_len; int nr_data = rbio-nr_data; int stripe; @@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) bio_list_init(bio_list); - if (bbio-num_stripes - rbio-nr_data == 1) { - p_stripe = bbio-num_stripes - 1; - } else if (bbio-num_stripes - rbio-nr_data == 2) { - p_stripe = bbio-num_stripes - 2; - q_stripe = bbio-num_stripes - 1; + if (rbio-real_stripes - rbio-nr_data == 1) { + p_stripe = rbio-real_stripes - 1; + } else if (rbio-real_stripes - rbio-nr_data == 2) { + p_stripe = rbio-real_stripes - 2; + q_stripe = rbio-real_stripes - 1; } else { BUG(); } @@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) SetPageUptodate(p); pointers[stripe++] = kmap(p); - raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE, +
[PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 42 +--- fs/btrfs/raid56.h | 2 +- fs/btrfs/scrub.c | 194 - fs/btrfs/volumes.c | 16 - fs/btrfs/volumes.h | 4 ++ 5 files changed, 226 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c54b0e6..6013d88 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,6 +58,8 @@ */ #define RBIO_CACHE_READY_BIT 3 +#define RBIO_HOLD_BBIO_MAP_BIT 4 + #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -799,6 +801,21 @@ done_nolock: remove_rbio_from_cache(rbio); } +static inline void +__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need) +{ + if (need) { + kfree(raid_map); + kfree(bbio); + } +} + +static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio) +{ + __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map, + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)); +} + static void __free_raid_bio(struct btrfs_raid_bio *rbio) { int i; @@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) rbio-stripe_pages[i] = NULL; } } - kfree(rbio-raid_map); - kfree(rbio-bbio); + + free_bbio_and_raid_map(rbio); + kfree(rbio); } @@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, GFP_NOFS); - if (!rbio) { - kfree(raid_map); - kfree(bbio); + if (!rbio) return ERR_PTR(-ENOMEM); - } bio_list_init(rbio-bio_list); INIT_LIST_HEAD(rbio-plug_list); @@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct blk_plug_cb *cb; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, 1); return PTR_ERR(rbio); + } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2038,15 +2055,19 @@ cleanup: */ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, - u64 stripe_len, int mirror_num) + u64 stripe_len, int mirror_num, int hold_bbio) { struct btrfs_raid_bio *rbio; int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); return PTR_ERR(rbio); + } + if (hold_bbio) + set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags); rbio-read_rebuild = 1; bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio-faila = find_logical_bio_stripe(rbio, bio); if (rbio-faila == -1) { BUG(); - kfree(raid_map); - kfree(bbio); + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); kfree(rbio); return -EIO; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index ea5d73b..b310e8c 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map) int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, -u64 stripe_len, int mirror_num); +u64 stripe_len, int mirror_num, int hold_bbio); int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, u64 stripe_len); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index efa0831..ca4b9eb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -63,6 +63,13 @@ struct scrub_ctx; */ #define SCRUB_MAX_PAGES_PER_BLOCK 16 /* 64k per node/leaf/sector */ +struct scrub_recover { + atomic_trefs; + struct btrfs_bio*bbio; + u64 *raid_map; + u64 map_length; +}; + struct scrub_page { struct scrub_block *sblock;
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 2014/11/25 13:30, Liu Bo wrote: This is actually inspired by ZFS, who offers checksum functions ranging from the simple-and-fast fletcher2 to the slower-but-secure sha256. Back to btrfs, crc32c is the only choice. And also for the slowness of sha256, Intel has a set of instructions for it, Intel SHA Extensions, that may help a lot. I think the advantage will be in giving a choice with some strong suggestions: An example of suggestions - if using sha256 on an old or low-power CPU, detect that the CPU doesn't support the appropriate acceleration functions and print a warning at mount or a warning-and-prompt at mkfs-time. The default could even be changed based on the architecture - though I suspect crc32c is already a good default on most architectures. The choice allowance gives flexibility where admins know it optimally could be used - and David's suggestion (separate thread) would be able to take advantage of that. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 2014/11/25 18:47, David Sterba wrote: We could provide an interface for external applications that would make use of the strong checksums. Eg. external dedup, integrity db. The benefit here is that the checksum is always up to date, so there's no need to compute the checksums again. At the obvious cost. I can imagine some use-cases where you might even want more than one algorithm to be used and stored. Not sure if that makes me a madman, though. ;) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 2014-11-26 08:38, Brendan Hide wrote: On 2014/11/25 18:47, David Sterba wrote: We could provide an interface for external applications that would make use of the strong checksums. Eg. external dedup, integrity db. The benefit here is that the checksum is always up to date, so there's no need to compute the checksums again. At the obvious cost. I can imagine some use-cases where you might even want more than one algorithm to be used and stored. Not sure if that makes me a madman, though. ;) Not crazy at all, I would love to have the ability to store multiple different weak but fast hash values. For example, on my laptop, it is actually faster to compute crc32c, adler32, and md5 hashes together than it is to compute pretty much any 256-bit hash I've tried. This then brings up the issue of what to do when we try to mount such a fs on a system that doesn't support some or all of the hashes used. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Can you please describe this deadlock in more detail? schedule() also triggers a flush of the plug list, and if that's no longer sufficient we can run into other problems (especially with preemption on). -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret) btrfs_abort_transaction(trans, extent_root, ret); +next: + list_del_init(block_group-bg_list); } } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index a2b97ef..c9ab88c 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -262,7 +262,7 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
[PATCH 5/6] Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists of hiding free space entries from a block group, perform the trim/discard and then make the free space entries visible again. Therefore while free space entry is being trimmed, we can have free space cache writing running in parallel (as part of a transaction commit) which will miss the free space entry. This means that an unmount (or crash/reboot) after that transaction commit and mount again before another transaction starts/commits, we will have some free space that won't be used again unless the free space cache is rebuilt. After the unmount, fsck (btrfsck, btrfs check) reports the issue like the following example: *** fsck.btrfs output *** checking extents checking free space cache There is no free space entry for 521764864-521781248 There is no free space entry for 521764864-1103101952 cache appears valid but isnt 29360128 Checking filesystem on /dev/sdc UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5 found 1235681286 bytes used err is -22 (...) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/free-space-cache.c | 59 ++--- fs/btrfs/free-space-cache.h | 2 ++ fs/btrfs/inode-map.c| 2 ++ 3 files changed, 60 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 16c2d39..6380863 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -31,6 +31,12 @@ #define BITS_PER_BITMAP(PAGE_CACHE_SIZE * 8) #define MAX_CACHE_BYTES_PER_GIG(32 * 1024) +struct btrfs_trim_range { + u64 start; + u64 bytes; + struct list_head list; +}; + static int link_free_space(struct btrfs_free_space_ctl *ctl, struct btrfs_free_space *info); static void unlink_free_space(struct btrfs_free_space_ctl *ctl, @@ -881,6 +887,7 @@ int write_cache_extent_entries(struct io_ctl *io_ctl, int ret; struct btrfs_free_cluster *cluster = NULL; struct rb_node *node = rb_first(ctl-free_space_offset); + struct btrfs_trim_range *trim_entry; /* Get the cluster for this block_group if it exists */ if (block_group !list_empty(block_group-cluster_list)) { @@ -916,6 +923,21 @@ int write_cache_extent_entries(struct io_ctl *io_ctl, cluster = NULL; } } + + /* +* Make sure we don't miss any range that was removed from our rbtree +* because trimming is running. Otherwise after a umount+mount (or crash +* after committing the transaction) we would leak free space and get +* an inconsistent free space cache report from fsck. +*/ + list_for_each_entry(trim_entry, ctl-trimming_ranges, list) { + ret = io_ctl_add_entry(io_ctl, trim_entry-start, + trim_entry-bytes, NULL); + if (ret) + goto fail; + *entries += 1; + } + return 0; fail: return -ENOSPC; @@ -1135,10 +1157,12 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, io_ctl_set_generation(io_ctl, trans-transid); + mutex_lock(ctl-cache_writeout_mutex); /* Write out the extent entries in the free space cache */ ret = write_cache_extent_entries(io_ctl, ctl, block_group, entries, bitmaps, bitmap_list); + mutex_unlock(ctl-cache_writeout_mutex); if (ret) goto out_nospc; @@ -2295,6 +2319,8 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group) ctl-start = block_group-key.objectid; ctl-private = block_group; ctl-op = free_space_op; + INIT_LIST_HEAD(ctl-trimming_ranges); + mutex_init(ctl-cache_writeout_mutex); /* * we only want to have 32k of ram per block group for keeping @@ -2911,10 +2937,12 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster) static int do_trimming(struct btrfs_block_group_cache *block_group, u64 *total_trimmed, u64 start, u64 bytes, - u64 reserved_start, u64 reserved_bytes) + u64 reserved_start, u64 reserved_bytes, + struct btrfs_trim_range *trim_entry) { struct btrfs_space_info *space_info = block_group-space_info; struct btrfs_fs_info *fs_info = block_group-fs_info; + struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; int ret; int update = 0; u64 trimmed = 0; @@ -2934,7 +2962,10 @@ static int do_trimming(struct btrfs_block_group_cache *block_group, if (!ret) *total_trimmed += trimmed; + mutex_lock(ctl-cache_writeout_mutex);
[PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard
This patchset fixes several issues exposed by block group removal/allocation and trim/discard running in parallel. The first 3 patches and the last one (6) are independent and don't depend on each other. Patches 3 and 6 are not really related to trim/discard at all. I bundled all these patches into the same patchset because all these issues were triggered by the same test for xfstests that I prepared and will send out soon. The issues were triggered on a qemu/kvm guest using scsi-hd drives with discard support enabled (which makes the host do hole punching on the virtual disks' image files), 4 virtual CPUs and 4Gb of ram. Some of these issues were hard to trigger and happened like once for every 10+ runs of the test (each test run takes nearly 30 minutes with my hardware and a debug kernel). Filipe Manana (6): Btrfs: fix invalid block group rbtree access after bg is removed Btrfs: fix crash caused by block group removal Btrfs: fix freeing used extents after removing empty block group Btrfs: fix race between fs trimming and block group remove/allocation Btrfs: fix race between writing free space cache and trimming Btrfs: make btrfs_abort_transaction consider existence of new block groups fs/btrfs/ctree.h| 14 ++- fs/btrfs/disk-io.c | 14 +++ fs/btrfs/extent-tree.c | 90 ++--- fs/btrfs/free-space-cache.c | 85 -- fs/btrfs/free-space-cache.h | 2 + fs/btrfs/inode-map.c| 2 + fs/btrfs/super.c| 2 +- fs/btrfs/volumes.c | 33 + 8 files changed, 215 insertions(+), 27 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused block groups and removing an unused block group (cleaner kthread) that leads to freeing extents that are in use or a crash during transaction commmit. Basically the cleaner kthread, when executing btrfs_delete_unused_bgs(), might catch the newly added block group to the list fs_info-unused_bgs and clear the range representing the whole group from fs_info-freed_extents[] before the task that added the block group to the list (running update_block_group()) marked the last freed extent as dirty in fs_info-freed_extents (pinned_extents). That is: CPU 1CPU 2 btrfs_delete_unused_bgs() update_block_group() add block group to fs_info-unused_bgs got block group from the list clear_extent_bits for the whole block group range in freed_extents[] set_extent_dirty for the range covering the freed extent in freed_extents[] (fs_info-pinned_extents) block group deleted, and a new block group with the same logical address is created reserve space from the new block group for new data or metadata - the reserved space overlaps the range specified by CPU 1 for set_extent_dirty() commit transaction find all ranges marked as dirty in fs_info-pinned_extents, clear them and add them to the free space cache Alternatively, if CPU 2 doesn't create a new block group with the same logical address, we get a crash/BUG_ON at transaction commit when unpining extent ranges because we can't find a block group for the range marked as dirty by CPU 1. Sample trace: [ 2163.426462] invalid opcode: [#1] SMP DEBUG_PAGEALLOC [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 2163.428875] task: 88009f2c0650 ti: 8801356bc000 task.ti: 8801356bc000 [ 2163.429157] RIP: 0010:[a037728e] [a037728e] unpin_extent_range.isra.58+0x62/0x192 [btrfs] [ 2163.429562] RSP: 0018:8801356bfda8 EFLAGS: 00010246 [ 2163.429802] RAX: RBX: RCX: [ 2163.429990] RDX: 41bf RSI: 01c0 RDI: 880024307080 [ 2163.430042] RBP: 8801356bfde8 R08: 0068 R09: 88003734f118 [ 2163.430042] R10: 8801356bfcb8 R11: fb69 R12: 8800243070d0 [ 2163.430042] R13: 83c04000 R14: 8800751b0f00 R15: 880024307000 [ 2163.430042] FS: () GS:88013f40() knlGS: [ 2163.430042] CS: 0010 DS: ES: CR0: 8005003b [ 2163.430042] CR2: 7ff10eb43fc0 CR3: 04cb8000 CR4: 06f0 [ 2163.430042] Stack: [ 2163.430042] 8800243070d0 83c08000 83c07fff 88012d6bc800 [ 2163.430042] 8800243070d0 8800751b0f18 8800751b0f00 [ 2163.430042] 8801356bfe18 a037a481 83c04000 83c07fff [ 2163.430042] Call Trace: [ 2163.430042] [a037a481] btrfs_finish_extent_commit+0xac/0xbf [btrfs] [ 2163.430042] [a038c06d] btrfs_commit_transaction+0x6ee/0x882 [btrfs] [ 2163.430042] [a03881f1] transaction_kthread+0xf2/0x1a4 [btrfs] [ 2163.430042] [a03880ff] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [ 2163.430042] [8105966b] kthread+0xb7/0xbf [ 2163.430042] [810595b4] ? __kthread_parkme+0x67/0x67 [ 2163.430042] [813ebeac] ret_from_fork+0x7c/0xb0 [ 2163.430042] [810595b4] ? __kthread_parkme+0x67/0x67 So fix this by making update_block_group() first set the range as dirty in pinned_extents before adding the block group to the unused_bgs list. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-)
[PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding a reference on it but the block group can be removed after we got it (via btrfs_remove_block_group), which means it will no longer be part of the rbtree. However, btrfs_remove_block_group() was only calling rb_erase() which leaves the block group's rb_node left and right child pointers with the same content they had before calling rb_erase. This was dangerous because a call to next_block_group() would access the node's left and right child pointers (via rb_next), which can be no longer valid. Fix this by clearing a block group's node after removing it from the tree, and have next_block_group() do a tree search to get the next block group instead of using rb_next() if our block group was removed. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 13 + 1 file changed, 13 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 744b580..3ba65d9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3162,7 +3162,19 @@ next_block_group(struct btrfs_root *root, struct btrfs_block_group_cache *cache) { struct rb_node *node; + spin_lock(root-fs_info-block_group_cache_lock); + + /* If our block group was removed, we need a full search. */ + if (RB_EMPTY_NODE(cache-cache_node)) { + const u64 next_bytenr = cache-key.objectid + cache-key.offset; + + spin_unlock(root-fs_info-block_group_cache_lock); + btrfs_put_block_group(cache); + cache = btrfs_lookup_first_block_group(root-fs_info, + next_bytenr); + return cache; + } node = rb_next(cache-cache_node); btrfs_put_block_group(cache); if (node) { @@ -9400,6 +9412,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, spin_lock(root-fs_info-block_group_cache_lock); rb_erase(block_group-cache_node, root-fs_info-block_group_cache_tree); + RB_CLEAR_NODE(block_group-cache_node); if (root-fs_info-first_logical_byte == block_group-key.objectid) root-fs_info-first_logical_byte = (u64)-1; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 24 +++- fs/btrfs/free-space-cache.c | 26 +- fs/btrfs/volumes.c | 33 ++--- 5 files changed, 100 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..51056c7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 group_start); +struct btrfs_root *root, u64 group_start, +bool *remove_em); void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
[PATCH 2/6] Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; int disk_cache_state; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3ba65d9..b7e40ef 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -607,6 +607,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, cache-cached = BTRFS_CACHE_NO; } else { cache-cached = BTRFS_CACHE_STARTED; + cache-has_caching_ctl = 1; } } spin_unlock(cache-lock); @@ -627,6 +628,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, cache-cached = BTRFS_CACHE_NO; } else { cache-cached = BTRFS_CACHE_STARTED; + cache-has_caching_ctl = 1; } spin_unlock(cache-lock); wake_up(caching_ctl-wait); @@ -9328,6 +9330,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, int ret; int index; int factor; + struct btrfs_caching_control *caching_ctl = NULL; root = root-fs_info-extent_root; @@ -9435,8 +9438,32 @@ int
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
Stress btrfs' block group allocation and deallocation while running fstrim in parallel. Part of the goal is also to get data block groups deallocated so that new metadata block groups, using the same physical device space ranges, get allocated while fstrim is running. This caused several issues ranging from invalid memory accesses, kernel crashes, metadata or data corruption, free space cache inconsistencies and free space leaks. Signed-off-by: Filipe Manana fdman...@suse.com --- tests/btrfs/082 | 148 tests/btrfs/082.out | 2 + tests/btrfs/group | 1 + 3 files changed, 151 insertions(+) create mode 100755 tests/btrfs/082 create mode 100644 tests/btrfs/082.out diff --git a/tests/btrfs/082 b/tests/btrfs/082 new file mode 100755 index 000..8ac9f06 --- /dev/null +++ b/tests/btrfs/082 @@ -0,0 +1,148 @@ +#! /bin/bash +# FSQA Test No. 082 +# +# Stress btrfs' block group allocation and deallocation while running fstrim in +# parallel. Part of the goal is also to get data block groups deallocated so +# that new metadata block groups, using the same physical device space ranges, +# get allocated while fstrim is running. This caused several issues ranging +# from invalid memory accesses, kernel crashes, metadata or data corruption, +# free space cache inconsistencies and free space leaks. +# +# These issues were fixed by the following btrfs linux kernel patches: +# +# Btrfs: fix invalid block group rbtree access after bg is removed +# Btrfs: fix crash caused by block group removal +# Btrfs: fix freeing used extents after removing empty block group +# Btrfs: fix race between fs trimming and block group remove/allocation +# Btrfs: fix race between writing free space cache and trimming +# Btrfs: make btrfs_abort_transaction consider existence of new block groups +# +# The issues were found on a qemu/kvm guest with 4 virtual CPUs, 4Gb of ram and +# scsi-hd devices with discard support enabled (that means hole punching in the +# disk's image file is performed by the host). +# +#--- +# +# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana fdman...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ + rm -fr $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch_nocheck +_require_fstrim + +rm -f $seqres.full + +# Keep allocating and deallocating 2G of data space with the goal of creating +# and deleting 2 block groups constantly. The intention is to race with the +# fstrim loop below. +fallocate_loop() +{ + local name=$1 + while true; do + $XFS_IO_PROG -f -c falloc -k 0 2G \ + $SCRATCH_MNT/$name /dev/null + sleep 3 + $XFS_IO_PROG -c truncate 0 \ + $SCRATCH_MNT/$name /dev/null + sleep 3 + done +} + +trim_loop() +{ + while true; do + $FSTRIM_PROG $SCRATCH_MNT + done +} + +# Create a bunch of small files that get their single extent inlined in the +# btree, so that we consume a lot of metadata space and get a chance of a +# data block group getting deleted and reused for metadata later. Sometimes +# the creation of all these files succeeds other times we get ENOSPC failures +# at some point - this depends on how fast the btrfs' cleaner kthread is +# notified about empty block groups, how fast it deletes them and how fast +# the fallocate calls happen. So we don't really care if they all succeed or +# not, the goal is just to keep metadata space usage growing while data block +# groups are deleted. +create_files() +{ + local prefix=$1 + + for ((i = 1; i = 40; i++)); do + echo Creating file ${prefix}_$i $seqres.full 21 + $XFS_IO_PROG -f -c pwrite -S 0xaa 0 3900 \ + $SCRATCH_MNT/${prefix}_$i $seqres.full 21 + ret=$? +
Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
On 11/26/2014 10:28 AM, Filipe Manana wrote: If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; Don't do this, just unconditionally call get_caching_control in btrfs_remove_block_group, then if we get something we can do stuff, otherwise we can just continue. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed
On 11/26/2014 10:28 AM, Filipe Manana wrote: If we grab a block group, for example in btrfs_trim_fs(), we will be holding a reference on it but the block group can be removed after we got it (via btrfs_remove_block_group), which means it will no longer be part of the rbtree. However, btrfs_remove_block_group() was only calling rb_erase() which leaves the block group's rb_node left and right child pointers with the same content they had before calling rb_erase. This was dangerous because a call to next_block_group() would access the node's left and right child pointers (via rb_next), which can be no longer valid. Fix this by clearing a block group's node after removing it from the tree, and have next_block_group() do a tree search to get the next block group instead of using rb_next() if our block group was removed. Signed-off-by: Filipe Manana fdman...@suse.com Reviewed-by: Josef Bacik jba...@fb.com Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group
On 11/26/2014 10:28 AM, Filipe Manana wrote: There's a race between adding a block group to the list of the unused block groups and removing an unused block group (cleaner kthread) that leads to freeing extents that are in use or a crash during transaction commmit. Basically the cleaner kthread, when executing btrfs_delete_unused_bgs(), might catch the newly added block group to the list fs_info-unused_bgs and clear the range representing the whole group from fs_info-freed_extents[] before the task that added the block group to the list (running update_block_group()) marked the last freed extent as dirty in fs_info-freed_extents (pinned_extents). That is: CPU 1CPU 2 btrfs_delete_unused_bgs() update_block_group() add block group to fs_info-unused_bgs got block group from the list clear_extent_bits for the whole block group range in freed_extents[] set_extent_dirty for the range covering the freed extent in freed_extents[] (fs_info-pinned_extents) block group deleted, and a new block group with the same logical address is created reserve space from the new block group for new data or metadata - the reserved space overlaps the range specified by CPU 1 for set_extent_dirty() commit transaction find all ranges marked as dirty in fs_info-pinned_extents, clear them and add them to the free space cache Alternatively, if CPU 2 doesn't create a new block group with the same logical address, we get a crash/BUG_ON at transaction commit when unpining extent ranges because we can't find a block group for the range marked as dirty by CPU 1. Sample trace: [ 2163.426462] invalid opcode: [#1] SMP DEBUG_PAGEALLOC [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 2163.428875] task: 88009f2c0650 ti: 8801356bc000 task.ti: 8801356bc000 [ 2163.429157] RIP: 0010:[a037728e] [a037728e] unpin_extent_range.isra.58+0x62/0x192 [btrfs] [ 2163.429562] RSP: 0018:8801356bfda8 EFLAGS: 00010246 [ 2163.429802] RAX: RBX: RCX: [ 2163.429990] RDX: 41bf RSI: 01c0 RDI: 880024307080 [ 2163.430042] RBP: 8801356bfde8 R08: 0068 R09: 88003734f118 [ 2163.430042] R10: 8801356bfcb8 R11: fb69 R12: 8800243070d0 [ 2163.430042] R13: 83c04000 R14: 8800751b0f00 R15: 880024307000 [ 2163.430042] FS: () GS:88013f40() knlGS: [ 2163.430042] CS: 0010 DS: ES: CR0: 8005003b [ 2163.430042] CR2: 7ff10eb43fc0 CR3: 04cb8000 CR4: 06f0 [ 2163.430042] Stack: [ 2163.430042] 8800243070d0 83c08000 83c07fff 88012d6bc800 [ 2163.430042] 8800243070d0 8800751b0f18 8800751b0f00 [ 2163.430042] 8801356bfe18 a037a481 83c04000 83c07fff [ 2163.430042] Call Trace: [ 2163.430042] [a037a481] btrfs_finish_extent_commit+0xac/0xbf [btrfs] [ 2163.430042] [a038c06d] btrfs_commit_transaction+0x6ee/0x882 [btrfs] [ 2163.430042] [a03881f1] transaction_kthread+0xf2/0x1a4 [btrfs] [ 2163.430042] [a03880ff] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [ 2163.430042] [8105966b] kthread+0xb7/0xbf [ 2163.430042] [810595b4] ? __kthread_parkme+0x67/0x67 [ 2163.430042] [813ebeac] ret_from_fork+0x7c/0xb0 [ 2163.430042] [810595b4] ? __kthread_parkme+0x67/0x67 So fix this by making update_block_group() first set the range as dirty in pinned_extents before adding the block group to the unused_bgs list. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 21
Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
On 11/26/2014 10:28 AM, Filipe Manana wrote: If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret) btrfs_abort_transaction(trans, extent_root, ret); +next: + list_del_init(block_group-bg_list); } } I don't understand this change, logically it seems the same as what we had before. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe
Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; Don't do this, just unconditionally call get_caching_control in btrfs_remove_block_group, then if we get something we can do stuff, otherwise we can just continue. Thanks, That's what I initially thought too. However, get_caching_control only returns us the caching_ctl if block_group-cached == BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The has_caching_ctl flag is just to avoid holding the semaphore and search through the list (since block_group-caching_ctl can be NULL but a caching_ctl point to the block group can be in the list). Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org
Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
On 11/26/2014 10:28 AM, Filipe Manana wrote: Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 24 +++- fs/btrfs/free-space-cache.c | 26 +- fs/btrfs/volumes.c | 33 ++--- 5 files changed, 100 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..51056c7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 group_start); +struct btrfs_root *root, u64 group_start, +bool *remove_em); void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, struct
Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret) btrfs_abort_transaction(trans, extent_root, ret); +next: + list_del_init(block_group-bg_list); } } I don't understand
Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
On 11/26/2014 11:15 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret) btrfs_abort_transaction(trans, extent_root, ret); +next: + list_del_init(block_group-bg_list); } } I don't understand this change, logically it seems the
Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 24 +++- fs/btrfs/free-space-cache.c | 26 +- fs/btrfs/volumes.c | 33 ++--- 5 files changed, 100 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..51056c7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 group_start); +struct btrfs_root *root, u64 group_start, +bool *remove_em); void
Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
On 11/26/2014 11:09 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: GW 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; Don't do this, just unconditionally call get_caching_control in btrfs_remove_block_group, then if we get something we can do stuff, otherwise we can just continue. Thanks, That's what I initially thought too. However, get_caching_control only returns us the caching_ctl if block_group-cached == BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The has_caching_ctl flag is just to avoid holding the semaphore and search through the list (since block_group-caching_ctl can be NULL but a caching_ctl point to the block group can be in the list). Oh God that's not good, we need to change get_caching_control to return if there is a caching control at all, since the other users want to wait for the fast caching to finish too. So change that and then use it unconditionally. I bet this has been causing us the random early ENOSPC problems. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 11:15 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret)
Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
On 11/26/2014 11:25 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 24 +++- fs/btrfs/free-space-cache.c | 26 +- fs/btrfs/volumes.c | 33 ++--- 5 files changed, 100 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..51056c7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 group_start); +struct btrfs_root *root, u64 group_start, +bool *remove_em); void
Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
On 11/26/2014 11:29 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 11:15 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 88004581bb08 813e7a13 88004581bb50 [25230.404330] 88004581bb40 810423aa a049386a ffe4 [25230.404332] a05214c0 240c 88010fc8f800 88004581bba8 [25230.404334] Call Trace: [25230.404338] [813e7a13] dump_stack+0x45/0x56 [25230.404342] [810423aa] warn_slowpath_common+0x7f/0x98 [25230.404351] [a049386a] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [8104240b] warn_slowpath_fmt+0x48/0x50 [25230.404362] [a049386a] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [a04a8c43] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [a04b77fd] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [a04b7a6d] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [a04a3d64] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [a04c53bd] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [811a9268] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [810f6501] ? get_page+0x1a/0x2b [25230.404441] [a04c7c95] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [8110f5d9] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [813e98c5] ? mutex_unlock+0x16/0x18 [25230.404449] [81138d68] new_sync_write+0x7c/0xa0 [25230.404450] [81139401] vfs_write+0xb0/0x112 [25230.404452] [81139c9d] SyS_pwrite64+0x66/0x84 [25230.404454] [813ebf52] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 5 +++-- fs/btrfs/super.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bf8f02..0a5e770 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, int ret = 0; list_for_each_entry_safe(block_group, tmp, trans-new_bgs, bg_list) { - list_del_init(block_group-bg_list); if (ret) - continue; + goto next; spin_lock(block_group-lock); memcpy(item, block_group-item, sizeof(item)); @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, key.objectid, key.offset); if (ret) btrfs_abort_transaction(trans, extent_root, ret);
Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 11:09 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; Don't do this, just unconditionally call get_caching_control in btrfs_remove_block_group, then if we get something we can do stuff, otherwise we can just continue. Thanks, That's what I initially thought too. However, get_caching_control only returns us the caching_ctl if block_group-cached == BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The has_caching_ctl flag is just to avoid holding the semaphore and search through the list (since block_group-caching_ctl can be NULL but a caching_ctl point to the block group can be in the list). Oh God that's not good, we need to change get_caching_control to return if there is a caching control at all, since the other users want to wait for the fast caching to finish too. So change that and then use it unconditionally. I bet this has been causing us the random early ENOSPC problems. Thanks, Right, I think that's a separate change different from what I'm
Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
On 11/26/2014 11:34 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 11:09 AM, Filipe David Manana wrote: On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik jba...@fb.com wrote: On 11/26/2014 10:28 AM, Filipe Manana wrote: If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info-caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch Btrfs: remove empty block groups automatically. So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at 8801446eaeb8 [58380.439707] IP: [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 8001446ea060 [58380.441220] Oops: [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: 88013ac82090 ti: 88013896c000 task.ti: 88013896c000 [58380.443238] RIP: 0010:[a03f6d05] [a03f6d05] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: 880222cae850 RBX: 880119ba74c0 RCX: [58380.443238] RDX: RSI: 880185e16800 RDI: 8801446eaeb8 [58380.443238] RBP: 88013896fdd8 R08: 8801a9ca9fa8 R09: 88013896fc60 [58380.443238] R10: 88013896fd28 R11: R12: 880222cae000 [58380.443238] R13: 880222cae850 R14: 880222cae6b0 R15: 8801446eae00 [58380.443238] FS: () GS:88023ed8() knlGS: [58380.443238] CS: 0010 DS: ES: CR0: 8005003b [58380.443238] CR2: 8801446eaeb8 CR3: 01811000 CR4: 06e0 [58380.443238] Stack: [58380.443238] 88013896fe18 a03fe2d5 880222cae850 880185e16800 [58380.443238] 88000dc41c20 8801a9ca9f00 [58380.443238] 88013896fe80 a040fbcf 88018b0dcdb0 88013ac82090 [58380.443238] Call Trace: [58380.443238] [a03fe2d5] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [a040fbcf] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [a040c058] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [a040bf66] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [8105966b] kthread+0xb7/0xbf [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 [58380.443238] [813ebeac] ret_from_fork+0x7c/0xb0 [58380.443238] [810595b4] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 27 +++ 2 files changed, 28 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3ccd09..7f40a65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache { unsigned int ro:1; unsigned int dirty:1; unsigned int iref:1; + unsigned int has_caching_ctl:1; Don't do this, just unconditionally call get_caching_control in btrfs_remove_block_group, then if we get something we can do stuff, otherwise we can just continue. Thanks, That's what I initially thought too. However, get_caching_control only returns us the caching_ctl if block_group-cached == BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The has_caching_ctl flag is just to avoid holding the semaphore and search through the list (since block_group-caching_ctl can be NULL but a caching_ctl point to the block group can be in the list). Oh God that's not good, we need to change get_caching_control to return if there is a caching control at all, since the other users want to wait for the fast caching to finish too. So change that and then use it unconditionally. I bet this has been causing us the random early ENOSPC problems. Thanks, Right, I think that's a separate change different from what I'm trying to fix. When caching_thread
[PATCH] Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space cache loading. However we started doing that a long time ago, and there is still a small window of time that we could be caching the block group the fast way, so if there is a caching_ctl at all on the block group just return it, the callers all wait properly for what they want. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/extent-tree.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 031dafb..74eb29d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -315,12 +315,6 @@ get_caching_control(struct btrfs_block_group_cache *cache) struct btrfs_caching_control *ctl; spin_lock(cache-lock); - if (cache-cached != BTRFS_CACHE_STARTED) { - spin_unlock(cache-lock); - return NULL; - } - - /* We're loading it the fast way, so we don't have a caching_ctl. */ if (!cache-caching_ctl) { spin_unlock(cache-lock); return NULL; @@ -594,6 +588,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, spin_unlock(cache-lock); if (fs_info-mount_opt BTRFS_MOUNT_SPACE_CACHE) { + mutex_lock(caching_ctl-mutex); ret = load_free_space_cache(fs_info, cache); spin_lock(cache-lock); @@ -601,6 +596,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, cache-caching_ctl = NULL; cache-cached = BTRFS_CACHE_FINISHED; cache-last_byte_to_unpin = (u64)-1; + caching_ctl-progress = (u64)-1; } else { if (load_cache_only) { cache-caching_ctl = NULL; @@ -610,6 +606,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, } } spin_unlock(cache-lock); + mutex_unlock(caching_ctl-mutex); + wake_up(caching_ctl-wait); if (ret == 1) { put_caching_control(caching_ctl); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/25/2014 11:21 PM, Zygo Blaxell wrote: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared - allow (if required) to mount in degraded mode after a timeout - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- V2: Added a comment to better explain the synchronization between block group removal and triming, as requested by Josef. fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 28 +++- fs/btrfs/free-space-cache.c | 26 +- fs/btrfs/volumes.c | 33 ++--- 5 files changed, 104 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..51056c7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 group_start); +struct btrfs_root *root, u64 group_start, +bool *remove_em); void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On Wed, Nov 26, 2014 at 4:50 AM, Holger Hoffstätte holger.hoffstae...@googlemail.com wrote: On Tue, 25 Nov 2014 15:17:58 -0800, John Williams wrote: 2) CityHash : for 256-bit hashes on all systems https://code.google.com/p/cityhash/ Btw this is now superseded by Farmhash: https://code.google.com/p/farmhash/ It seems FarmHash is not a complete replacement for CityHash. Specifically, I don't see a Fingerprint256() function in FarmHash, so no 256-bit fingerprints (unless I am missing something?). Also, it seems that FarmHash's Fingerprint128() hash is just CityHash128(). Unless I am misreading it, I think FarmHash is mostly useful for 32-bit and 64-bit hashes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two persistent problems
Am Fri, 14 Nov 2014 17:00:26 -0500 schrieb Josef Bacik jba...@fb.com: On 11/14/2014 04:51 PM, Hugo Mills wrote: Chris, Josef, anyone else who's interested, On IRC, I've been seeing reports of two persistent unsolved problems. Neither is showing up very often, but both have turned up often enough to indicate that there's something specific going on worthy of investigation. One of them is definitely a btrfs problem. The other may be btrfs, or something in the block layer, or just broken hardware; it's hard to tell from where I sit. Problem 1: ENOSPC on balance This has been going on since about March this year. I can reasonably certainly recall 8-10 cases, possibly a number more. When running a balance, the operation fails with ENOSPC when there's plenty of space remaining unallocated. This happens on full balance, filtered balance, and device delete. Other than the ENOSPC on balance, the FS seems to work OK. It seems to be more prevalent on filesystems converted from ext*. The first few or more reports of this didn't make it to bugzilla, but a few of them since then have gone in. Problem 2: Unexplained zeroes Failure to mount. Transid failure, expected xyz, have 0. Chris looked at an early one of these (for Ke, on IRC) back in September (the 27th -- sadly, the public IRC logs aren't there for it, but I can supply a copy of the private log). He rapidly came to the conclusion that it was something bad going on with TRIM, replacing some blocks with zeroes. Since then, I've seen a bunch of these coming past on IRC. It seems to be a 3.17 thing. I can successfully predict the presence of an SSD and -odiscard from the have 0. I've successfully persuaded several people to put this into bugzilla and capture btrfs-images. btrfs recover doesn't generally seem to be helpful in recovering data. I think Josef had problem 1 in his sights, but I don't know if additional images or reports are helpful at this point. For problem 2, there's obviously something bad going on, but there's not much else to go on -- and the inability to recover data isn't good. For each of these, what more information should I be trying to collect from any future reporters? So for #2 I've been looking at that the last two weeks. I'm always paranoid we're screwing up one of our data integrity sort of things, either not waiting on IO to complete properly or something like that. I've built a dm target to be as evil as possible and have been running it trying to make bad things happen. I got slightly side tracked since my stress test exposed a bug in the tree log stuff an csums which I just fixed. Now that I've fixed that I'm going back to try and make the expected blah, have 0 type errors happen. Just a quick question from a user: does Filipe's patch Btrfs: fix race between fs trimming and block group remove/allocation fix this? Judging by the commit message, it looks like it. If so, can you say whether it will make it into 3.17.x? Maybe I'm being overly paranoid, but I stuck with 3.16.7 because of this. (I mean, I have backups, but there's no need to provoke a situation where I will need them ;-) .) -- Marc Joliet -- People who think they know everything really annoy those of us who know we don't - Bjarne Stroustrup signature.asc Description: PGP signature
Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
As mentioned last round please move the addition of the is_readonly operation to the first thing in the series, so that the ordering makes more sense. Second I think this patch is incorrect for XFS - XFS uses -update_time to set the time stampst in the dinode. These two need to be coherent as we can write out a dirty inode any time, so it needs to have the timestamp uptodate. Third update_time now calls mark_inode_dirty unconditionally, while previously it wasn't called when -update_time was set. At least for XFS that's a major change in behavior as XFS never used VFS dirty tracking for metadata updates. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
The subject line seems incorrect, this seems to implement some form of dirty inode writeback clustering. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't cp --reflink files on a Ext4-converted FS w/o checksums
Hello, I used btrfs-convert to switch my FS from Ext4 to Btrfs. As it was a rather large 10 TB filesystem, to save on the conversion time, I used the -d, disable data checksum option of btrfs-convert. Turns out now I can't cp --reflink any files that were already on the FS prior to conversion. The error message from cp is failed to clone [...] Invalid argument. I assume this is because of the lack of checksums; the only way to make old files cloneable is to plain copy them to a different place and then delete the originals, but that's what I was trying to avoid in the first place. Also I thought maybe defragmenting will help, but nope, doesn't seem to be the case, even ordering it to recompress data to a different method doesn't fix the problem. (Even if it did, it's still a lot of unnecessary rewriting). Is there really a good reason to stop these files without checksums from being cloneable? It's not like they have the noCoW attribute, so I'd assume any new write to these files would cause a CoW and proper checksums for all new blocks anyways. -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/25/2014 07:22 PM, Duncan wrote: From my perspective, however, btrfs is simply incompatible with lvm snapshots, because the basic assumptions are incompatible. Btrfs assumes UUIDs will be exactly what they say on the label, /unique/, while lvm's snapshot feature directly breaks that uniqueness by copying the (former) UUID, thus making the former UUID no longer unique and thus no longer truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct contradiction to a basic assumption of btrfs, that UUIDs are exactly that, unique, making that feature directly incompatible with btrfs on a very basic level. A finer point here. LVM doesn't copy the UUID. AN LVM snapshot is a copy-on-write entity so it _exposes_ the single sector(s) of the superblock(s) in both views of the underlying storage. This is universal to the idea of a snapshot. Just as a btrfs subvol snap /old /new exposes all the unique elements of /old under the name /new (in preparation for the user to implement subsequent divergence); lvmcreate --snapshot Old New causes every block-N of Old to be identically available as block-N of New (in preparation for the user to implement subsequent divergence). In point of fact the LVM snapshot operation is a zero-copy operation at its heart. After the snapshot is established, when a block in modified in Old, it's original content is saved in New. When blocks are written in New, they are written in place and the reference to the block content in Old is overwritten. This is the reason that fsfreeze is unnecessary for things above LVM snapshots as the instant-in-time divergence is _instant_. It's not that LVM goes out and does an fsfreeze equivalent action, its that the switch to write-divergence is essentially atomic. A bunch of metatdata is setup and then all-at-once one write behavior is switched with another by re-mapping the device access routines. So while you may have a point about btrfs being unprepared for LVM, neither party is particularly at fault in any way. The damn you photocopier for making photocopies so identically nature of your problem with LVM seems to be leading you to misplaced conclusions. If you need to harmonize these sorts of things, you need to be able to re-write blocks in question with disambiguating information (like new UUIDS) or restrict your accesses in some other manner. If you are waiting for someone to code it up perhaps you should do so. But it will _never_ be automatic because the use cases that don't match your expectations may need the founding assumptions to be as they are today. In other words, your belief that your position is entirely logical may be a little off, particularly if you think LVM is Copying things when it does a snapshot. As previously stated XFS solved this problem by providing a tool that would change the UUID of a file system. This tool cold then be pointed at either (or both) the original and/or snapshot volumes as needed. I don't see a re-make the btrfs option for changing UUIDs and LVM doesn't care _at_ _all_ about what is actually in its volumes (okay, lvresize has some fsck nonsense, but that's just messy). It might even be wrong to try to harmonize those features, like trying to put a manual clutch into a car with an automatic transmission... it may just not fit. Given that BTRFS want's to play in the same level of abstraction as LVM, its kind of a given that they'll butt heads over things like conflicting definitions of what it means to take a snapshot. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
On Wed, Nov 26, 2014 at 05:20:17AM -0500, Theodore Ts'o wrote: On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote: No abuse necessary at all. Just a different inode_dirtied_after() check is requires if the inode is on the time dirty list in move_expired_inodes(). I'm still not sure what you have in mind here. When would this be checked? Have you looked at where move_expired_inodes() gets called from? It's called periodically from background writeback by queue_io(), and sync uses the same infrastructure to expire all inodes on the dirty list It sounds like you want to set a timeout such that when an inode which had its timestamps updated lazily 24 hours earlier, the inode would get written out. Yes? But that implies something is going to have to scan the list of inodes on the dirty time list periodically. When are you proposing that this take place? The writeback code already does this for dirty inodes. it does it in move_expired_inodes() to move the inodes with i_dirtied_when is older than 30s. It's *trivial* to add a time dirty inode list and scan that at the same time to pull off inodes that are older than 24hrs. The various approaches that come to mind all seem more complex than what I have in this patch 3 of 4, and I'm not sure it's worth the complexity. the once a day stuff you've added is a horrible, nasty hack. I wasn't going to say anything about it (i.e. if you can't say anything nice...). The existing dirty inode writeback expiry code does *everything* we need already, we just need to plumb in a new list and add an expiry check of that list to move inodes to the b_io list when they have been timestamp dirty for more than 24 hours... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote: Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 49 ++--- fs/ext4/super.c | 9 + include/trace/events/ext4.h | 30 +++ 3 files changed, 85 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa4..8308c82 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, + unsigned long orig_ino, char *buf) +{ + struct ext4_inode_info *ei; + struct ext4_inode *raw_inode; + unsigned long ino; + struct inode*inode; + int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; + int inode_size = EXT4_INODE_SIZE(sb); + + ino = orig_ino ~(inodes_per_block - 1); + for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { + if (ino == orig_ino) + continue; + inode = find_active_inode_nowait(sb, ino); + if (!inode || + (inode-i_state I_DIRTY_TIME) == 0 || + !spin_trylock(inode-i_lock)) { + iput(inode); + continue; + } + inode-i_state = ~I_DIRTY_TIME; + inode-i_ts_dirty_day = 0; + spin_unlock(inode-i_lock); + inode_requeue_dirtytime(inode); + + ei = EXT4_I(inode); + raw_inode = (struct ext4_inode *) buf; + + spin_lock(ei-i_raw_lock); + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + ext4_inode_csum_set(inode, raw_inode, ei); + spin_unlock(ei-i_raw_lock); + trace_ext4_other_inode_update_time(inode, orig_ino); + iput(inode); + } +} Am I right in that this now does unlogged timestamp updates of inodes? What happens when that buffer gets overwritten by log recover after a crash? The timestamp updates get lost? FYI, XFS has had all sorts of nasty log recovery corner cases caused by log recovery overwriting non-logged inode updates like this. In the past few years we've removed every single non-logged inode update optimisation so that all changes (including timestamps) are transactional so inode state on disk not matching what log recovery wrote to disk for all the other inode metadata... Optimistic unlogged inode updates are a slippery slope, and history tells me that it doesn't lead to a nice place Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote: Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 49 ++--- fs/ext4/super.c | 9 + include/trace/events/ext4.h | 30 +++ 3 files changed, 85 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa4..8308c82 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, + unsigned long orig_ino, char *buf) +{ +struct ext4_inode_info *ei; +struct ext4_inode *raw_inode; +unsigned long ino; +struct inode*inode; +int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; +int inode_size = EXT4_INODE_SIZE(sb); + +ino = orig_ino ~(inodes_per_block - 1); +for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { +if (ino == orig_ino) +continue; +inode = find_active_inode_nowait(sb, ino); +if (!inode || +(inode-i_state I_DIRTY_TIME) == 0 || +!spin_trylock(inode-i_lock)) { +iput(inode); +continue; +} +inode-i_state = ~I_DIRTY_TIME; +inode-i_ts_dirty_day = 0; +spin_unlock(inode-i_lock); +inode_requeue_dirtytime(inode); + +ei = EXT4_I(inode); +raw_inode = (struct ext4_inode *) buf; + +spin_lock(ei-i_raw_lock); +EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); +EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); +EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); +ext4_inode_csum_set(inode, raw_inode, ei); +spin_unlock(ei-i_raw_lock); +trace_ext4_other_inode_update_time(inode, orig_ino); +iput(inode); +} +} Am I right in that this now does unlogged timestamp updates of inodes? What happens when that buffer gets overwritten by log recover after a crash? The timestamp updates get lost? FYI, XFS has had all sorts of nasty log recovery corner cases caused by log recovery overwriting non-logged inode updates like this. In the past few years we've removed every single non-logged inode update optimisation so that all changes (including timestamps) are transactional so inode state on disk not matching what log recovery wrote to disk for all the other inode metadata... Optimistic unlogged inode updates are a slippery slope, and history tells me that it doesn't lead to a nice place Since ext4/jbd2 is logging the whole block, unlike XFS which is doing logical journaling, this isn't an unlogged update. It is just taking advantage of the fact that the whole block is going to be logged and written to the disk anyway. If the only update needed for other inodes in the block is the timestamp then they may as well be flushed to disk at the same time and avoid the need for another update later on. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 11:55 AM, Roman Mamedov wrote: Is there really a good reason to stop these files without checksums from being cloneable? It's not like they have the noCoW attribute, so I'd assume any new write to these files would cause a CoW and proper checksums for all new blocks anyways. The problem seems to be that you are trying to clone a NODATASUM file to a file that would have data checsums. The resulting file _might_ then end up with some extents with, and others without, checksums if that target file were modified. So you _could_ reflink the file but you'd have to do it to another file with no data checksums -- which basically means a NOCOW file, or mounting with nodatasum while you do the reflink, but now you have more problem files. linux/fs/btrfs/ioctl.c @ ~ line 2930 /* don't make the dst file partly checksummed */ if ((BTRFS_I(src)-flags BTRFS_INODE_NODATASUM) != (BTRFS_I(dst)-flags BTRFS_INODE_NODATASUM)) { ret = -EINVAL; goto out_unlock; } Basically if the system allowed you to reflink from a no-data-sum to a data-sum file the results would be instantly unreadable for failing every single data checksum test. That or the entire checksum system would have to be advisory instead of functional. So yes, there is a good reason. David Foster Wallace famously said act in haste, repent in leasure. You kind-of shot yourself in the foot while attempting to save time. You promised yourself that you didn't need the checksums. Now at this point I'm going to make a few guesses... I don't see _anywhere_ in the kernel source or btrfs-progs where BTRFS_INODE_NODATASUM is explicitly cleared from any inode ever. It might be implicitly cleared somewhere but I couldn't find it. So all those unsummed files are probably unsummed for life. I also don't see anything in the code that says this ioctl will create the checksums for the selected file so you may have to do the copy you tried to avoid. Sorry man... If you haven't done much with the file system yet, you might want to reverse the conversion and do it again. Otherwise, you will want to copy the files long-hand, possibly in batches if you are more than 50% full on disk space. On the bright side... This would be the perfect moment to think about your backup/snapshot scheme. I always have a /whatever (e.g. /__System for a root partition) as my default subvolume that I normally mount. When I do my backups I mount -o subvol=/ /dev/whathaveyou /mnt/maintenance and then do my snapshots in /mnt/maintenance. That way every file in my N snapshots doesn't show up in the output of locate N+1 times. (e.g. this lets me hide my local snapshots/backups from normal operations) Also, now would be the perfect time to add compression to your default regime. Compressing the files only happens on write and so using compression involves a copy anyway. -- Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On Wed, 26 Nov 2014 15:18:26 -0800 Robert White rwh...@pobox.com wrote: So you _could_ reflink the file but you'd have to do it to another file with no data checksums -- which basically means a NOCOW file, or mounting with nodatasum while you do the reflink, but now you have more problem files. Bingo!!! A cp --reflink to a destination that's been made chattr +C prior to that, works perfectly. My goal was to convert regular top-level directories into subvolumes (for further snapshotting). With that trick, I've been able to do that now w/o issues. $ mv Music Music.orig $ sudo btrfs sub create Music Create subvolume './Music' $ sudo chattr +C Music $ sudo cp -a --reflink Music.orig/* Music/ $ Finished with no rewriting necessary. After that I recursively-removed the +C attribute from all newly reflinked files, and cp --reflink as well as snapshotting of those works fine. Thanks for the idea. :) -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote: On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote: Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 49 ++--- fs/ext4/super.c | 9 + include/trace/events/ext4.h | 30 +++ 3 files changed, 85 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa4..8308c82 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, +unsigned long orig_ino, char *buf) +{ + struct ext4_inode_info *ei; + struct ext4_inode *raw_inode; + unsigned long ino; + struct inode*inode; + int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; + int inode_size = EXT4_INODE_SIZE(sb); + + ino = orig_ino ~(inodes_per_block - 1); + for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { + if (ino == orig_ino) + continue; + inode = find_active_inode_nowait(sb, ino); + if (!inode || + (inode-i_state I_DIRTY_TIME) == 0 || + !spin_trylock(inode-i_lock)) { + iput(inode); + continue; + } + inode-i_state = ~I_DIRTY_TIME; + inode-i_ts_dirty_day = 0; + spin_unlock(inode-i_lock); + inode_requeue_dirtytime(inode); + + ei = EXT4_I(inode); + raw_inode = (struct ext4_inode *) buf; + + spin_lock(ei-i_raw_lock); + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + ext4_inode_csum_set(inode, raw_inode, ei); + spin_unlock(ei-i_raw_lock); + trace_ext4_other_inode_update_time(inode, orig_ino); + iput(inode); + } +} Am I right in that this now does unlogged timestamp updates of inodes? What happens when that buffer gets overwritten by log recover after a crash? The timestamp updates get lost? FYI, XFS has had all sorts of nasty log recovery corner cases caused by log recovery overwriting non-logged inode updates like this. In the past few years we've removed every single non-logged inode update optimisation so that all changes (including timestamps) are transactional so inode state on disk not matching what log recovery wrote to disk for all the other inode metadata... Optimistic unlogged inode updates are a slippery slope, and history tells me that it doesn't lead to a nice place Since ext4/jbd2 is logging the whole block, unlike XFS which is doing logical journaling, this isn't an unlogged update. It is just taking advantage of the fact that the whole block is going to be logged and written to the disk anyway. Urk - that's worse, isn't it? i.e the code above calls iput() from within a current transaction context? What happens if that drops the last reference to the inode and it gets evicted due to racing with an unlink? Won't that try to start another transaction to free the inode (i.e. through ext4_evict_inode())? If the only update needed for other inodes in the block is the timestamp then they may as well be flushed to disk at the same time and avoid the need for another update later on. Cheers, Andreas -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 03:33 PM, Roman Mamedov wrote: On Wed, 26 Nov 2014 15:18:26 -0800 Robert White rwh...@pobox.com wrote: So you _could_ reflink the file but you'd have to do it to another file with no data checksums -- which basically means a NOCOW file, or mounting with nodatasum while you do the reflink, but now you have more problem files. Bingo!!! A cp --reflink to a destination that's been made chattr +C prior to that, works perfectly. My goal was to convert regular top-level directories into subvolumes (for further snapshotting). With that trick, I've been able to do that now w/o issues. $ mv Music Music.orig $ sudo btrfs sub create Music Create subvolume './Music' $ sudo chattr +C Music $ sudo cp -a --reflink Music.orig/* Music/ $ Finished with no rewriting necessary. After that I recursively-removed the +C attribute from all newly reflinked files, and cp --reflink as well as snapshotting of those works fine. Thanks for the idea. :) Uh... you may _still_ have no checksums on any of those data extents. They are not going to come back until you write them to a normal file with a normal copy. So you may be lacking most of the data validation features of this filesystem. For instance you can, and always could, snapshot a NODATACOW/NODATASUM file just fine (I call it 1COW mode). Setting NODATACOW sets NODATASUM... Clearing NODATACOW does _not_ clear NODATASUM (at least not on a non-empty file) as near as I can tell, so that directory hierarchy and its subsequent snapshots is likely less safe than you think. You might want to go experiment. Make another new subvol (or at least a directory in a directory/root/subvol that never had the +C attribute set) and see if you can cp --reflink any of these files into that subdirectory without repeating the +C trick. Basically scrubbing and mirroring and sending your Music folder is an unprotected and unverified operation the way you've done this (if my reading of the code is correct). You _really_ might be better off spending the time and doing the copy to a normal directory. For instance without checksums if you btrfs scrub your volume it will read the file but it can't know if the file got corrupted, it can only tell you if you if the disk read completed without error. (there's a whole degenerate/simplified path in the code for scrubbing un-summed files). So seriously, you might be Saving yourself time right into a future data loss. Take this as a you've been warned. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On Wed, 26 Nov 2014 16:00:23 -0800 Robert White rwh...@pobox.com wrote: Uh... you may _still_ have no checksums on any of those data extents. They are not going to come back until you write them to a normal file with a normal copy. So you may be lacking most of the data validation features of this filesystem. Well, this FS is coming from being Ext4 for years, so it's not worse off now than it was before. And anyways the main feature that I wanted were snapshots. You might want to go experiment. Make another new subvol (or at least a directory in a directory/root/subvol that never had the +C attribute set) and see if you can cp --reflink any of these files into that subdirectory without repeating the +C trick. Ha, indeed I can't. Maybe there should be a way to generate checksums without rewriting files, just via reading them, then calculating and writing checksum to metadata. Clearing NODATACOW does _not_ clear NODATASUM (at least not on a non-empty file) as near as I can tell, so that directory hierarchy and its subsequent snapshots is likely less safe than you think. The nodatasum flag also isn't accessible via chattr, is it? -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 03:33 PM, Roman Mamedov wrote: Finished with no rewriting necessary. After that I recursively-removed the +C attribute from all newly reflinked files, and cp --reflink as well as snapshotting of those works fine. I did some double checking and I think you'll find that if you lsattr those files they still have the C (NoCOW) attribute, which also means they are still unsummed. Which also means that when you cp --reflink them the target files you create are also NoCOW. So you harmonized the lack of checksums with the linux-level C attribute. This has hidden your problem but not fixed it. (Trying to clear the NOCOW attribute on a file in BTRFS is _silently_ ignored as invalid. That recursive removal only changed the directories.) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On Wed, 26 Nov 2014 16:20:44 -0800 Robert White rwh...@pobox.com wrote: I did some double checking and I think you'll find that if you lsattr those files they still have the C (NoCOW) attribute, which also means they are still unsummed. Indeed, I looked at the top level only, which had just directories. (Trying to clear the NOCOW attribute on a file in BTRFS is _silently_ ignored as invalid. That recursive removal only changed the directories.) And the chattr command even completes with a zero exit code, this is rather unexpected. $ lsattr \ MP3.m3u; chattr -C \ MP3.m3u echo Yay; lsattr \ MP3.m3u ---C MP3.m3u Yay ---C MP3.m3u -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 04:20 PM, Roman Mamedov wrote: On Wed, 26 Nov 2014 16:00:23 -0800 Robert White rwh...@pobox.com wrote: Uh... you may _still_ have no checksums on any of those data extents. They are not going to come back until you write them to a normal file with a normal copy. So you may be lacking most of the data validation features of this filesystem. Well, this FS is coming from being Ext4 for years, so it's not worse off now than it was before. And anyways the main feature that I wanted were snapshots. Given that you wont be able to scrub the data and BTRFS is usable but still a little brittle, it might be a little worse off than you think. Plus if you ever find yourself in need of balancing you've tossed out one level of data integrity right here at the start. You might want to go experiment. Make another new subvol (or at least a directory in a directory/root/subvol that never had the +C attribute set) and see if you can cp --reflink any of these files into that subdirectory without repeating the +C trick. Ha, indeed I can't. Maybe there should be a way to generate checksums without rewriting files, just via reading them, then calculating and writing checksum to metadata. That problem would be computationally hard because you'd have to verify that no other file was using that extent before you put that extent under control of the csum machinery, otherwise you might break later break the COW promise when the file that knows those blocks by their checksum changes the contents out from underneath the other references. Clearing NODATACOW does _not_ clear NODATASUM (at least not on a non-empty file) as near as I can tell, so that directory hierarchy and its subsequent snapshots is likely less safe than you think. The nodatasum flag also isn't accessible via chattr, is it? It is not. NODATASUM and NODATACOW are private features. The linux kernel has no general concept of data checksums. The BTRFS guys mapped the NODATACOW attribute onto the existing lsattr/chattr bit because it was defined for another file system long ago. You'll also find that you can not set or clear the C attr on a file in a BTRFS unless its size is zero. So all your files now have the C attribute more-or-less forever. Only a normal copy operation (e.g. allocating new extents and writing the data into them wiht the checksum feature in force) will change that. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 04:28 PM, Roman Mamedov wrote: On Wed, 26 Nov 2014 16:20:44 -0800 Robert White rwh...@pobox.com wrote: (Trying to clear the NOCOW attribute on a file in BTRFS is _silently_ ignored as invalid. That recursive removal only changed the directories.) And the chattr command even completes with a zero exit code, this is rather unexpected. That's what silently means in this context. I didn't pick the result, and it's not what I would have done. I've got no idea if this was ever discussed at any length for pros-and-cons. I could make an argument for the silent result, or against it. Since the attribute is immutable there really isn't a nope, that's just not possible dave errno value to return that isn't as confusing as just skipping it. The closest result code would be ENOSUP (operation not supported) but changing attributes _is_ supported, just not that particular attribute in that particular circumstance. Also, the set attributes call sets all the attributes at once so there is no way to say which attribute was rejected. As such, a do what you can and let the people check the result behavior is not at all unreasonable. Life is full of flaws. 8-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On 11/26/2014 04:31 PM, Robert White wrote: On 11/26/2014 04:20 PM, Roman Mamedov wrote: On Wed, 26 Nov 2014 16:00:23 -0800 Robert White rwh...@pobox.com wrote: You might want to go experiment. Make another new subvol (or at least a directory in a directory/root/subvol that never had the +C attribute set) and see if you can cp --reflink any of these files into that subdirectory without repeating the +C trick. Ha, indeed I can't. Maybe there should be a way to generate checksums without rewriting files, just via reading them, then calculating and writing checksum to metadata. That problem would be computationally hard because you'd have to verify that no other file was using that extent before you put that extent under control of the csum machinery, otherwise you might break later break the COW promise when the file that knows those blocks by their checksum changes the contents out from underneath the other references. I explained this poorly/incorrectly... So some guy like yourself converts a file system, or mounts an existing file system with nodatasum and creates some file. As a result there is a file called /One that has no checksums on its extents. Then the guy creates a directory and sets +C on it, and copies the file into that directory with cp --reflink /One /d/Two. File /d/Two is a no-cow file. Now the guy somes back and decides to put the data checksums onto the extents of /One. At this moment everything _looks_ fine. Then the guy alters the first byte of /d/Two, which modifies the no-cow file in place. Now the guy tires to read /One ... what happens? The checksum doesn't match and the data has changed because of the NOCOW. (So I think, in practice, the 1COW mechanism prevents this just like it works for snapshots) But thats a _lot_ of corner cases that can be solved by telling someone to copy the file if they want the checksums to be recreated. e.g. the set of all files and all possibilities gets well into the computationally hard end of the swimming pool. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana fdman...@suse.com --- V2: Added a comment to better explain the synchronization between block group removal and triming, as requested by Josef. V3: Changed logic to move extent map to pinned chunks list while holding the block group's spinlock, and similarly to make trimming check for the need to remove the em from the pinned chunks list while holding that spinlock. This is to avoid the (unlikely) case where the trimmer attempts to remove the em before chunk removal adds it to the pinned chunks list, which would leave the em alive (but outside the mapping tree) until umount time. fs/btrfs/ctree.h| 13 - fs/btrfs/disk-io.c | 14 ++ fs/btrfs/extent-tree.c | 43 ++- fs/btrfs/free-space-cache.c | 37 - fs/btrfs/volumes.c | 21 + 5 files changed, 117 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f40a65..4a167a0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { unsigned int dirty:1; unsigned int iref:1; unsigned int has_caching_ctl:1; + unsigned int removed:1; int disk_cache_state; @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { /* For delayed block group creation or deletion of empty block groups */ struct list_head bg_list; + + atomic_t trimming; }; /* delayed seq elem */ @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { /* For btrfs to record security options */ struct security_mnt_opts security_opts; + + /* +* Chunks that can't be freed yet (under a trim/discard operation) +* and will be latter freed. +*/ + rwlock_t pinned_chunks_lock; + struct list_head pinned_chunks; }; struct btrfs_subvolume_writers { @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
[PATCH] btrfs-progs: fix return value problem for btrfs sub show
If you exec: # btrfs sub show dir == non-subvolume dir The cmd print error messages as expected, but returns 0. By convetion, it should return non-zero and we should explicitly set it before it goto out. With other pieces adopted: 1) removed a unnecessary return value set -EINVAL 2) fixed another code branch which may return 0 upon error. 3) with 2) applied, the ret = 0 follows can be removed Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- cmds-subvolume.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index fa58a24..53eec46 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -906,6 +906,7 @@ static int cmd_subvol_show(int argc, char **argv) } if (!ret) { fprintf(stderr, ERROR: '%s' is not a subvolume\n, fullpath); + ret = 1; goto out; } @@ -919,7 +920,6 @@ static int cmd_subvol_show(int argc, char **argv) fprintf(stderr, ERROR: %s doesn't belong to btrfs mount point\n, fullpath); - ret = -EINVAL; goto out; } ret = 1; @@ -958,13 +958,13 @@ static int cmd_subvol_show(int argc, char **argv) memset(get_ri, 0, sizeof(get_ri)); get_ri.root_id = sv_id; - if (btrfs_get_subvol(mntfd, get_ri)) { + ret = btrfs_get_subvol(mntfd, get_ri); + if (ret) { fprintf(stderr, ERROR: can't find '%s'\n, svpath); goto out; } - ret = 0; /* print the info */ printf(%s\n, fullpath); printf(\tName: \t\t\t%s\n, get_ri.name); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: apply realpath for btrfs fi show when mount point is given
For now, # btrfs fi show /mnt/btrfs gives info correctly, while # btrfs fi show /mnt/btrfs/ gives nothing. This implies that the @realpath() function should be applied to unify the behavior. Made a more clear comment right above the call as well. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- cmds-filesystem.c | 60 +++ 1 file changed, 34 insertions(+), 26 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index cd6b3c6..4e6d2f6 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -901,39 +901,47 @@ static int cmd_show(int argc, char **argv) if (strlen(search) == 0) usage(cmd_show_usage); type = check_arg_type(search); + + /* +* For search is a device: +* realpath do /dev/mapper/XX = /dev/dm-X +* which is required by BTRFS_SCAN_DEV +* For search is a mountpoint: +* realpath do /mnt/btrfs/ = /mnt/btrfs +* which shall be recognized by btrfs_scan_kernel() +*/ + if (!realpath(search, path)) { + fprintf(stderr, ERROR: Could not show %s: %s\n, + search, strerror(errno)); + return 1; + } + + search = path; + /* * needs spl handling if input arg is block dev * And if input arg is mount-point just print it * right away */ - if (type == BTRFS_ARG_BLKDEV) { - if (where == BTRFS_SCAN_LBLKID) { - /* we need to do this because -* legacy BTRFS_SCAN_DEV -* provides /dev/dm-x paths -*/ - if (realpath(search, path)) - search = path; + if (type == BTRFS_ARG_BLKDEV where != BTRFS_SCAN_LBLKID) { + ret = get_btrfs_mount(search, + mp, sizeof(mp)); + if (!ret) { + /* given block dev is mounted*/ + search = mp; + type = BTRFS_ARG_MNTPOINT; } else { - ret = get_btrfs_mount(search, - mp, sizeof(mp)); - if (!ret) { - /* given block dev is mounted*/ - search = mp; - type = BTRFS_ARG_MNTPOINT; - } else { - ret = dev_to_fsid(search, fsid); - if (ret) { - fprintf(stderr, - ERROR: No btrfs on %s\n, - search); - return 1; - } - uuid_unparse(fsid, uuid_buf); - search = uuid_buf; - type = BTRFS_ARG_UUID; - goto devs_only; + ret = dev_to_fsid(search, fsid); + if (ret) { + fprintf(stderr, + ERROR: No btrfs on %s\n, + search); + return 1; } + uuid_unparse(fsid, uuid_buf); + search = uuid_buf; + type = BTRFS_ARG_UUID; + goto devs_only; } } } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote: On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote: On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Can you please describe this deadlock in more detail? schedule() also triggers a flush of the plug list, and if that's no longer sufficient we can run into other problems (especially with preemption on). Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch. I have updated my raid56-scrub-replace branch, please re-pull the branch. https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Thanks Miao Thanks Miao -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] btrfs-progs: fix page align issue for lzo compress in restore
On Tue, 2014-10-14 at 11:32 +0200, David Sterba wrote: On Tue, Oct 14, 2014 at 10:06:16AM +0200, Marc Dietrich wrote: This hasn't landed in an btrfs-progs branch I found. Any update? I had it tagged for review and found something that needs fixing. The PAGE_CACHE_SIZE is hardcoded to 4k, this will break on filesystems with larger sectors (eg. the powerpc machines). I'll scheudule the patch post 3.17, with a fix. Hi David, I note that this patch is not yet in the latest integration, how's the fix going? -Gui -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't cp --reflink files on a Ext4-converted FS w/o checksums
On Thu, Nov 27, 2014 at 12:55:27AM +0500, Roman Mamedov wrote: Hello, I used btrfs-convert to switch my FS from Ext4 to Btrfs. As it was a rather large 10 TB filesystem, to save on the conversion time, I used the -d, disable data checksum option of btrfs-convert. Turns out now I can't cp --reflink any files that were already on the FS prior to conversion. The error message from cp is failed to clone [...] Invalid argument. I assume this is because of the lack of checksums; the only way to make old files cloneable is to plain copy them to a different place and then delete the originals, but that's what I was trying to avoid in the first place. Also I thought maybe defragmenting will help, but nope, doesn't seem to be the case, even ordering it to recompress data to a different method doesn't fix the problem. (Even if it did, it's still a lot of unnecessary rewriting). Is there really a good reason to stop these files without checksums from being cloneable? It's not like they have the noCoW attribute, so I'd assume any new write to these files would cause a CoW and proper checksums for all new blocks anyways. Just FYI, I recently send a patch[1] to fix btrfs-convert's checksum problem, it'll produce checksum for empty extents, which makes slow btrrfs-convert even slower. With this patch, you may try to convert your ext4 without disabling checksum and see if time is improved. [1]: Btrfs-progs: fix a bug of converting sparse ext2/3/4 https://patchwork.kernel.org/patch/5374741/ thanks, -liubo -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On Tue, Nov 25, 2014 at 05:39:05PM +0100, David Sterba wrote: On Mon, Nov 24, 2014 at 01:23:05PM +0800, Liu Bo wrote: This brings a strong-but-slow checksum algorithm, sha256. Actually btrfs used sha256 at the early time, but then moved to crc32c for performance purposes. As crc32c is sort of weak due to its hash collision issue, we need a stronger algorithm as an alternative. Users can choose sha256 from mkfs.btrfs via $ mkfs.btrfs -C 256 /device There's already some good feedback so I'll try to cover what hasn't been mentioned yet. I think it's better to separate the preparatory works from adding the algorithm itself. The former can be merged (and tested) independently. That's a good point. There are several checksum algorithms that trade off speed and strength so we may want to support more than just sha256. Easy to add but I'd rather see them added in all at once than one by one. Another question is if we'd like to use different checksum for data and metadata. This would not cost any format change if we use the 2 bytes in super block csum_type. Yes, but breaking it into meta_csum_type and data_csum_type will need a imcompat flag. Optional/crazy/format change stuff: * per-file checksum algorithm - unlike compression, the whole file would have to use the same csum algo reflink would work iff the algos match snapshotting is unaffected * per-subvolume checksum algorithm - specify the csum type at creation time, or afterwards unless it's modified I thought about this before, if we enable this, a few cases need to be dealt with(at least), 1. convert file data's csum from one algorithm to another 2. to make different checksum length co-exist, we can either use different key.type for different algorithms, or pack checksum into a new structure that has algorithm types(and length). thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote: On 11/25/2014 11:21 PM, Zygo Blaxell wrote: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared ...and make sure there are no duplicate UUIDs. - allow (if required) to mount in degraded mode after a timeout This is a terrible idea with current btrfs, at least for read-write degraded mounting (fallback to read-only degraded would be OK). Mounting a filesystem read-write and degraded is something you only want to do immediately before you replace all the missing disks and bring the filesystem up to a non-degraded space and after you've ensured that the missing disks can never, ever come back; otherwise, btrfs eats your data in a slightly different way than we have discussed so far... - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) You don't have to recognize them as snapshots (and it's probably better not to treat snapshots specially anyway--how do you know whether the snapshot or the origin LVs are wanted for mounting?). You just have to detect duplicate UUIDs at the btrfs subdevice level, and if any are found, stop immediately (or get a hint from the admin). This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
[PATCH] Btrfs-progs: add_refs_for_implied: Use root_id instead of root_bytenr.
resolve_one_root() returns the objectid of a tree rather than the logical address of the root node. Hence using root_bytenr is misleading. Fix this. Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com --- qgroup-verify.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/qgroup-verify.c b/qgroup-verify.c index c0c61d0..c98c751 100644 --- a/qgroup-verify.c +++ b/qgroup-verify.c @@ -553,11 +553,11 @@ static int add_refs_for_implied(struct btrfs_fs_info *info, u64 bytenr, struct tree_block *block) { int ret; - u64 root_bytenr = resolve_one_root(bytenr); + u64 root_id = resolve_one_root(bytenr); struct btrfs_root *root; struct btrfs_key key; - key.objectid = root_bytenr; + key.objectid = root_id; key.type = BTRFS_ROOT_ITEM_KEY; key.offset = (u64)-1; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html