[PATCH 24/27] btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). We also add freeze protection to the transaction mechanism to avoid starting transactions on frozen filesystem. At minimum this is necessary to stop iput() of unlinked file to change frozen filesystem during truncation. Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason chris.ma...@oracle.com Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/disk-io.c |3 --- fs/btrfs/file.c|3 ++- fs/btrfs/inode.c |6 +- fs/btrfs/transaction.c |7 +++ 4 files changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7ae51de..663f3a0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1532,8 +1532,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); - if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); @@ -1565,7 +1563,6 @@ static int transaction_kthread(void *arg) do { cannot_commit = false; delay = HZ * 30; - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); mutex_lock(root-fs_info-transaction_kthread_mutex); spin_lock(root-fs_info-trans_lock); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 876cddd..7f3d7fe 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1392,7 +1392,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, ssize_t err = 0; size_t count, ocount; - vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE); + sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); @@ -1482,6 +1482,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, num_written = err; } out: + sb_end_write(inode-i_sb); current-backing_dev_info = NULL; return num_written ? num_written : err; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e9991ad..54e5378 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6563,6 +6563,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) u64 page_start; u64 page_end; + sb_start_pagefault(inode-i_sb); ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (!ret) { ret = btrfs_update_time(vma-vm_file); @@ -6652,12 +6653,15 @@ again: unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); out_unlock: - if (!ret) + if (!ret) { + sb_end_pagefault(inode-i_sb); return VM_FAULT_LOCKED; + } unlock_page(page); out: btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); out_noreserve: + sb_end_pagefault(inode-i_sb); return ret; } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 1791c6e..05d8c29 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -325,6 +325,8 @@ again: if (!h) return ERR_PTR(-ENOMEM); + sb_start_intwrite(root-fs_info-sb); + if (may_wait_transaction(root, type)) wait_current_trans(root); @@ -335,6 +337,7 @@ again: } while (ret == -EBUSY); if (ret 0) { + sb_end_intwrite(root-fs_info-sb); kmem_cache_free(btrfs_trans_handle_cachep, h); return ERR_PTR(ret); } @@ -524,6 +527,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, count++; } + sb_end_intwrite(root-fs_info-sb); + if (lock !atomic_read(root-fs_info-open_ioctl_trans) should_end_transaction(trans, root)) { trans-transaction-blocked = 1; @@ -1507,6 +1512,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, put_transaction(cur_trans); put_transaction(cur_trans); + sb_end_intwrite(root-fs_info-sb); + trace_btrfs_transaction_commit(root); btrfs_scrub_continue(root); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/27 v6] Fix filesystem freezing deadlocks
Hello, here is the sixth iteration of my patches to improve filesystem freezing. The change since last iteration is that filesystem can be frozen with open but unlinked files. After some thinking, I've decided that the best way to handle this is to block removal inside -evict_inode() of each filesystem and use fs-internal level of freeze protection for that (usually I've instrumented filesystem's transaction system to use freeze protection). Handling inside VFS would be less work but the only level of freeze protection that has a chance of not causing deadlocks is the one used for page faults and even there it's not clear lock ordering would be correct wrt some fs-specific locks. I've converted ext2, ext4, btrfs, xfs, nilfs2, ocfs2, gfs2 and also checked that ext3, reiserfs, jfs should work as well (they have their internal freeze protection mechanisms, possibly they could be replaced by a generic one but given these are mostly aging filesystems, it's not a real priority IHMO). So finally I'm not aware of any pending issue with this patch set so if you have some concern, please speak up! Introductory text to first time readers: Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 13 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem file_update_time() call to -page_mkwrite callback (patches 01-07) and put freeze handling in mnt_want_write() / mnt_drop_write(). That however required some code shuffling and changes to kern_path_create() (see patches 09-12). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. Changes since v5: * handle unlinked open files on frozen filesystem * lockdep keys for freeze protection are now per filesystem type * taught lockdep that freeze protection at lower level does not create dependency when we already hold freeze protection at higher level * rebased on 3.5-rc1-ish Changes since v4: * added a couple of Acked-by's * added some comments doc update * added patches from series Push file_update_time() into .page_mkwrite since it doesn't make much sense to keep them separate anymore * rebased on top of 3.4-rc2 Changes since v3: * added third level of freezing for fs internal purposes - hooked some filesystems to use it (XFS, nilfs2) * removed racy i_size check from filemap_mkwrite() Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder el...@kernel.org CC: Anton Altaparmakov an...@tuxera.com CC: Ben Myers b...@sgi.com CC: Chris Mason chris.ma...@oracle.com CC: cluster-de...@redhat.com CC: David S. Miller da...@davemloft.net CC: fuse-de...@lists.sourceforge.net CC: J. Bruce Fields bfie...@fieldses.org CC: Joel Becker jl...@evilplan.org CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh mfas...@suse.com CC: Miklos Szeredi mik...@szeredi.hu CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp CC: Steven Whitehouse swhit...@redhat.com CC: Theodore Ts'o ty...@mit.edu CC: x...@oss.sgi.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More
[PATCH 00/19 v5] Fix filesystem freezing deadlocks
Hello, here is the fifth iteration of my patches to improve filesystem freezing. No serious changes since last time. Mostly I rebased patches and merged this series with series moving file_update_time() to -page_mkwrite() to simplify testing and merging. Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 13 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem file_update_time() call to -page_mkwrite callback (patches 01-07) and put freeze handling in mnt_want_write() / mnt_drop_write(). That however required some code shuffling and changes to kern_path_create() (see patches 09-12). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. Changes since v4: * added a couple of Acked-by's * added some comments doc update * added patches from series Push file_update_time() into .page_mkwrite since it doesn't make much sense to keep them separate anymore * rebased on top of 3.4-rc2 Changes since v3: * added third level of freezing for fs internal purposes - hooked some filesystems to use it (XFS, nilfs2) * removed racy i_size check from filemap_mkwrite() Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder el...@kernel.org CC: Anton Altaparmakov an...@tuxera.com CC: Ben Myers b...@sgi.com CC: Chris Mason chris.ma...@oracle.com CC: cluster-de...@redhat.com CC: David S. Miller da...@davemloft.net CC: fuse-de...@lists.sourceforge.net CC: J. Bruce Fields bfie...@fieldses.org CC: Joel Becker jl...@evilplan.org CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh mfas...@suse.com CC: Miklos Szeredi mik...@szeredi.hu CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp CC: Steven Whitehouse swhit...@redhat.com CC: Theodore Ts'o ty...@mit.edu CC: x...@oss.sgi.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks
The subject should have been [PATCH 00/27]... Sorry for the mistake. Honza On Mon 16-04-12 18:13:38, Jan Kara wrote: Hello, here is the fifth iteration of my patches to improve filesystem freezing. No serious changes since last time. Mostly I rebased patches and merged this series with series moving file_update_time() to -page_mkwrite() to simplify testing and merging. Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 13 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem file_update_time() call to -page_mkwrite callback (patches 01-07) and put freeze handling in mnt_want_write() / mnt_drop_write(). That however required some code shuffling and changes to kern_path_create() (see patches 09-12). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. Changes since v4: * added a couple of Acked-by's * added some comments doc update * added patches from series Push file_update_time() into .page_mkwrite since it doesn't make much sense to keep them separate anymore * rebased on top of 3.4-rc2 Changes since v3: * added third level of freezing for fs internal purposes - hooked some filesystems to use it (XFS, nilfs2) * removed racy i_size check from filemap_mkwrite() Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder el...@kernel.org CC: Anton Altaparmakov an...@tuxera.com CC: Ben Myers b...@sgi.com CC: Chris Mason chris.ma...@oracle.com CC: cluster-de...@redhat.com CC: David S. Miller da...@davemloft.net CC: fuse-de...@lists.sourceforge.net CC: J. Bruce Fields bfie...@fieldses.org CC: Joel Becker jl...@evilplan.org CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh mfas...@suse.com CC: Miklos Szeredi mik...@szeredi.hu CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp CC: Steven Whitehouse swhit...@redhat.com CC: Theodore Ts'o ty...@mit.edu CC: x...@oss.sgi.com -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 24/27] btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason chris.ma...@oracle.com Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/disk-io.c |3 --- fs/btrfs/file.c|3 ++- fs/btrfs/inode.c |6 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 20196f4..555a57a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1527,8 +1527,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); - if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); @@ -1560,7 +1558,6 @@ static int transaction_kthread(void *arg) do { cannot_commit = false; delay = HZ * 30; - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); mutex_lock(root-fs_info-transaction_kthread_mutex); spin_lock(root-fs_info-trans_lock); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d83260d..a48251e 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1358,7 +1358,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, ssize_t err = 0; size_t count, ocount; - vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE); + sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); @@ -1449,6 +1449,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, num_written = err; } out: + sb_end_write(inode-i_sb); current-backing_dev_info = NULL; return num_written ? num_written : err; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 115bc05..db4fc01 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6592,6 +6592,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) u64 page_start; u64 page_end; + sb_start_pagefault(inode-i_sb); ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (!ret) { ret = btrfs_update_time(vma-vm_file); @@ -6681,12 +6682,15 @@ again: unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); out_unlock: - if (!ret) + if (!ret) { + sb_end_pagefault(inode-i_sb); return VM_FAULT_LOCKED; + } unlock_page(page); out: btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); out_noreserve: + sb_end_pagefault(inode-i_sb); return ret; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason chris.ma...@oracle.com CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/ioctl.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 18cc23d..869d913 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(file); + if (ret) + return ret; + mutex_lock(inode-i_mutex); ip_oldflags = ip-flags; @@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } - ret = mnt_want_write_file(file); - if (ret) - goto out_unlock; - if (flags FS_SYNC_FL) ip-flags |= BTRFS_INODE_SYNC; else @@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) inode-i_flags = i_oldflags; } - mnt_drop_write_file(file); out_unlock: mutex_unlock(inode-i_mutex); + mnt_drop_write_file(file); return ret; } @@ -639,6 +639,10 @@ static noinline int btrfs_mksubvol(struct path *parent, struct dentry *dentry; int error; + error = mnt_want_write(parent-mnt); + if (error) + return error; + mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT); dentry = lookup_one_len(name, parent-dentry, namelen); @@ -650,13 +654,9 @@ static noinline int btrfs_mksubvol(struct path *parent, if (dentry-d_inode) goto out_dput; - error = mnt_want_write(parent-mnt); - if (error) - goto out_dput; - error = btrfs_may_create(dir, dentry); if (error) - goto out_drop_write; + goto out_dput; down_read(BTRFS_I(dir)-root-fs_info-subvol_sem); @@ -674,12 +674,11 @@ static noinline int btrfs_mksubvol(struct path *parent, fsnotify_mkdir(dir, dentry); out_up_read: up_read(BTRFS_I(dir)-root-fs_info-subvol_sem); -out_drop_write: - mnt_drop_write(parent-mnt); out_dput: dput(dentry); out_unlock: mutex_unlock(dir-i_mutex); + mnt_drop_write(parent-mnt); return error; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason chris.ma...@oracle.com CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/ioctl.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 03bb62a..c855e55 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(file); + if (ret) + return ret; + mutex_lock(inode-i_mutex); ip_oldflags = ip-flags; @@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } - ret = mnt_want_write_file(file); - if (ret) - goto out_unlock; - if (flags FS_SYNC_FL) ip-flags |= BTRFS_INODE_SYNC; else @@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) inode-i_flags = i_oldflags; } - mnt_drop_write_file(file); out_unlock: mutex_unlock(inode-i_mutex); + mnt_drop_write_file(file); return ret; } @@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent, struct dentry *dentry; int error; + error = mnt_want_write(parent-mnt); + if (error) + return error; + mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT); dentry = lookup_one_len(name, parent-dentry, namelen); @@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent, if (dentry-d_inode) goto out_dput; - error = mnt_want_write(parent-mnt); - if (error) - goto out_dput; - error = btrfs_may_create(dir, dentry); if (error) - goto out_drop_write; + goto out_dput; down_read(BTRFS_I(dir)-root-fs_info-subvol_sem); @@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent, fsnotify_mkdir(dir, dentry); out_up_read: up_read(BTRFS_I(dir)-root-fs_info-subvol_sem); -out_drop_write: - mnt_drop_write(parent-mnt); out_dput: dput(dentry); out_unlock: mutex_unlock(dir-i_mutex); + mnt_drop_write(parent-mnt); return error; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/19 v4] Fix filesystem freezing deadlocks
Hello, here is the fourth iteration of my patches to improve filesystem freezing. Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 06 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem freeze handling in mnt_want_write() / mnt_drop_write(). This however required some code shuffling and changes to kern_path_create() (see patches 02-05). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. And for people wanting to test - this patchset is based on patch series Push file_update_time() into .page_mkwrite so you'll need to pull that one in as well. Changes since v3: * added third level of freezing for fs internal purposes - hooked some filesystems to use it (XFS, nilfs2) * removed racy i_size check from filemap_mkwrite() Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder el...@kernel.org CC: Anton Altaparmakov an...@tuxera.com CC: Ben Myers b...@sgi.com CC: Chris Mason chris.ma...@oracle.com CC: cluster-de...@redhat.com CC: David S. Miller da...@davemloft.net CC: fuse-de...@lists.sourceforge.net CC: J. Bruce Fields bfie...@fieldses.org CC: Joel Becker jl...@evilplan.org CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh mfas...@suse.com CC: Miklos Szeredi mik...@szeredi.hu CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp CC: Steven Whitehouse swhit...@redhat.com CC: Theodore Ts'o ty...@mit.edu CC: x...@oss.sgi.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/19] btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason chris.ma...@oracle.com Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/disk-io.c |3 --- fs/btrfs/file.c|3 ++- fs/btrfs/inode.c |6 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 811d9f9..fc0f74c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); - if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); @@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg) do { delay = HZ * 30; - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); mutex_lock(root-fs_info-transaction_kthread_mutex); spin_lock(root-fs_info-trans_lock); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 859ba2d..1aac7ca 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, ssize_t err = 0; size_t count, ocount; - vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE); + sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); @@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, num_written = err; } out: + sb_end_write(inode-i_sb); current-backing_dev_info = NULL; return num_written ? num_written : err; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..63c9006 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) u64 page_start; u64 page_end; + sb_start_pagefault(inode-i_sb); ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (!ret) { ret = btrfs_update_time(vma-vm_file); @@ -6495,12 +6496,15 @@ again: unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); out_unlock: - if (!ret) + if (!ret) { + sb_end_pagefault(inode-i_sb); return VM_FAULT_LOCKED; + } unlock_page(page); out: btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); out_noreserve: + sb_end_pagefault(inode-i_sb); return ret; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3/4, btrfs, ocfs2: How to assure that cleancache_invalidate_fs is called on every superblock free
Hello, On Fri 09-03-12 14:40:22, Andor Daam wrote: Is it ever possible for a superblock for a mounted filesystem to be free'd without a previous call to unmount the filesystem? No, I don't think so (well, except for cases where we do not manage to fully setup the superblock). But be aware that mount/umount need not be really the entry points you are looking for since filesystem can be mounted several times. Rather deactivate_locked_supers() is the place you are looking for... I need to be certain that the function cleancache_invalidate_fs, which is at the moment called by deactivate_locked_super (fs/super.c) [1], is called before every free on a superblock of cleancache-enabled filesystems. Is this already the case or are there situations in which this does not happen? It would be interesting to know this, as we are planning to have cleancache save pointers to superblocks of every mounted cleancache-enabled filesystem [2] and it would be fatal if a superblock is free'd without cleancache being notified. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: getdents - ext4 vs btrfs performance
On Fri 02-03-12 14:32:15, Ted Tso wrote: On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote: It would be interesting to have a project where someone added fallocate() support into libelf, and then added some hueristics into ext4 so that if a file is fallocated to a precise size, or if the file is fully written and closed before writeback begins, that we use this to more efficiently pack the space used by the files by the block allocator. This is a place where I would not be surprised that XFS has some better code to avoid accelerated file system aging, and where we could do better with ext4 with some development effort. AFAIK XFS people actually prefer that applications let them do their work using delayed allocation and do not interfere with fallocate(2) calls. The problem they have with fallocate(2) is that it forces you to allocate blocks while with delayed allocation you can make the decision about allocation later. So for small files which completely fit into pagecache before they get pushed out by writeback, they can make better decisions from delayed allocation. Just dumping my memory from some other thread... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/19] btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason chris.ma...@oracle.com Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/disk-io.c |3 --- fs/btrfs/file.c|3 ++- fs/btrfs/inode.c |6 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 811d9f9..fc0f74c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); - if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); @@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg) do { delay = HZ * 30; - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); mutex_lock(root-fs_info-transaction_kthread_mutex); spin_lock(root-fs_info-trans_lock); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 859ba2d..1aac7ca 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, ssize_t err = 0; size_t count, ocount; - vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE); + sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); @@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, num_written = err; } out: + sb_end_write(inode-i_sb); current-backing_dev_info = NULL; return num_written ? num_written : err; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..63c9006 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) u64 page_start; u64 page_end; + sb_start_pagefault(inode-i_sb); ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (!ret) { ret = btrfs_update_time(vma-vm_file); @@ -6495,12 +6496,15 @@ again: unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); out_unlock: - if (!ret) + if (!ret) { + sb_end_pagefault(inode-i_sb); return VM_FAULT_LOCKED; + } unlock_page(page); out: btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); out_noreserve: + sb_end_pagefault(inode-i_sb); return ret; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/19] Fix filesystem freezing deadlocks
Hallelujah, after a couple of weeks and several rewrites, here comes the third iteration of my patches to improve filesystem freezing. Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 06 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem freeze handling in mnt_want_write() / mnt_drop_write(). This however required some code shuffling and changes to kern_path_create() (see patches 02-05). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. And for people wanting to test - this patchset is based on patch series Push file_update_time() into .page_mkwrite so you'll need to pull that one in as well. Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder el...@kernel.org CC: Anton Altaparmakov an...@tuxera.com CC: Ben Myers b...@sgi.com CC: Chris Mason chris.ma...@oracle.com CC: cluster-de...@redhat.com CC: David S. Miller da...@davemloft.net CC: fuse-de...@lists.sourceforge.net CC: J. Bruce Fields bfie...@fieldses.org CC: Joel Becker jl...@evilplan.org CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh mfas...@suse.com CC: Miklos Szeredi mik...@szeredi.hu CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp CC: Steven Whitehouse swhit...@redhat.com CC: Theodore Ts'o ty...@mit.edu CC: x...@oss.sgi.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason chris.ma...@oracle.com CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/ioctl.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 03bb62a..c855e55 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(file); + if (ret) + return ret; + mutex_lock(inode-i_mutex); ip_oldflags = ip-flags; @@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } - ret = mnt_want_write_file(file); - if (ret) - goto out_unlock; - if (flags FS_SYNC_FL) ip-flags |= BTRFS_INODE_SYNC; else @@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) inode-i_flags = i_oldflags; } - mnt_drop_write_file(file); out_unlock: mutex_unlock(inode-i_mutex); + mnt_drop_write_file(file); return ret; } @@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent, struct dentry *dentry; int error; + error = mnt_want_write(parent-mnt); + if (error) + return error; + mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT); dentry = lookup_one_len(name, parent-dentry, namelen); @@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent, if (dentry-d_inode) goto out_dput; - error = mnt_want_write(parent-mnt); - if (error) - goto out_dput; - error = btrfs_may_create(dir, dentry); if (error) - goto out_drop_write; + goto out_dput; down_read(BTRFS_I(dir)-root-fs_info-subvol_sem); @@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent, fsnotify_mkdir(dir, dentry); out_up_read: up_read(BTRFS_I(dir)-root-fs_info-subvol_sem); -out_drop_write: - mnt_drop_write(parent-mnt); out_dput: dput(dentry); out_unlock: mutex_unlock(dir-i_mutex); + mnt_drop_write(parent-mnt); return error; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device
On Wed 08-02-12 22:05:26, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/08/2012 06:20 PM, Jan Kara wrote: Thanks for your reply. I admit I was not sure what exactly size argument should be. So after looking into the code for a while I figured it should be a total size of the filesystem - or differently it should be size of virtual block address space in the filesystem. Thus when filesystem has more devices (or admin wants to add more devices later), it can be larger than the first device. But I'm not really a btrfs developper so I might be wrong and of course feel free to fix the issue as you deem fit. The size of the fs is the total size of the individual disks. When you limit the size, you limit the size of a disk, not the whole fs. IIRC, mkfs initializes the fs on the first disk, which is why it was using that size as the size of the whole fs, and then adds the other disks after ( which then add their size to the total fs size ). OK, I missed that btrfs_add_to_fsid() increases total size of the filesystem. So now I agree with you. New patch is attached. Thanks for your review. It might be nice if mkfs could take sizes for each disk, but it only seems to take one size for the initial disk. Yes, but I don't see a realistic usecase so I don't think it's really worth the work. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR From e5f46872232520310c56327593c02ef6a7f5ea33 Mon Sep 17 00:00:00 2001 From: Jan Kara j...@suse.cz Date: Fri, 10 Feb 2012 11:44:44 +0100 Subject: [PATCH] mkfs: Handle creation of filesystem larger than the first device mkfs does not properly check requested size of the filesystem. Thus if the requested size is larger than the first device, it happily creates larger filesystem than a device it resides on which results in 'attemp to access beyond end of device' messages from the kernel. So verify specified filesystem size against the size of the first device. CC: David Sterba dste...@suse.cz Signed-off-by: Jan Kara j...@suse.cz --- mkfs.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mkfs.c b/mkfs.c index e3ced19..3afe9eb 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1282,6 +1282,10 @@ int main(int ac, char **av) ret = btrfs_prepare_device(fd, file, zero_end, dev_block_count, mixed); if (block_count == 0) block_count = dev_block_count; + else if (block_count dev_block_count) { + fprintf(stderr, %s is smaller than requested size\n, file); + exit(1); + } } else { ac = 0; file = av[optind++]; -- 1.7.1
[PATCH 4/4] btrfs: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Although we use our own bio-end_io function, we call dio_end_io() from it and thus, because we don't set any specific dio-end_io function, generic code ends up calling generic_dio_end_io() which is all what we need for proper O_SYNC AIO DIO handling. Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/inode.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..68add6e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6221,7 +6221,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, ret = __blockdev_direct_IO(rw, iocb, inode, BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev, iov, offset, nr_segs, btrfs_get_blocks_direct, NULL, - btrfs_submit_direct, 0); + btrfs_submit_direct, DIO_SYNC_WRITES); if (ret 0 ret != -EIOCBQUEUED) { clear_extent_bit(BTRFS_I(inode)-io_tree, offset, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Generic O_SYNC AIO DIO handling
Hi Jeff, these patches implement generic way of handling O_SYNC AIO DIO. They work for all filesystems except for ext4 and xfs. Thus together with your patches, all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3, btrfs, and xfs (to check that I didn't break anything when the generic code is unused) and things seem to work fine. Will you add these patches to your series please? Thanks. Honza -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] ocfs2: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/aops.c |6 ++ 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 78b68af..3d14c2b 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb, level = ocfs2_iocb_rw_locked_level(iocb); ocfs2_rw_unlock(inode, level); - if (is_async) - aio_complete(iocb, ret, 0); - inode_dio_done(inode); + generic_dio_end_io(iocb, offset, bytes, private, ret, is_async); } /* @@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw, return __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, offset, nr_segs, ocfs2_direct_IO_get_blocks, - ocfs2_dio_end_io, NULL, 0); + ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES); } static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] vfs: Handle O_SYNC AIO DIO in generic code properly
Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystem wanting to use the helpers has to pass DIO_SYNC_WRITES to __blockdev_direct_IO. Then if they don't use direct IO end_io handler, generic code takes care of everything else. Otherwise their end_io handler is passed struct dio_sync_io_work pointer as 'private' argument and they have to call generic_dio_end_io() to finish their AIO DIO. Generic code then takes care to call generic_write_sync() from a workqueue context when AIO DIO is completed. Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling and the generic one is enough for them, make blockdev_direct_IO() pass DIO_SYNC_WRITES flag. Signed-off-by: Jan Kara j...@suse.cz --- fs/direct-io.c | 128 ++-- fs/super.c |2 + include/linux/fs.h | 13 +- 3 files changed, 138 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 4a588db..79aa531 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -38,6 +38,8 @@ #include linux/atomic.h #include linux/prefetch.h +#include asm/cmpxchg.h + /* * How many user pages to map in one call to get_user_pages(). This determines * the size of a structure in the slab cache @@ -112,6 +114,15 @@ struct dio_submit { unsigned tail; /* last valid page + 1 */ }; +/* state needed for final sync and completion of O_SYNC AIO DIO */ +struct dio_sync_io_work { + struct kiocb *iocb; + loff_t offset; + ssize_t len; + int ret; + struct work_struct work; +}; + /* dio_state communicated between submission path and end_io */ struct dio { int flags; /* doesn't change */ @@ -134,6 +145,7 @@ struct dio { /* AIO related stuff */ struct kiocb *iocb; /* kiocb */ ssize_t result; /* IO result */ + struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */ /* * pages[] (and any fields placed after it) are not zeroed out at @@ -261,6 +273,45 @@ static inline struct page *dio_get_page(struct dio *dio, } /** + * generic_dio_end_io() - generic dio -end_io handler + * @iocb: iocb of finishing DIO + * @offset: the byte offset in the file of the completed operation + * @bytes: length of the completed operation + * @work: work to queue for O_SYNC AIO DIO, NULL otherwise + * @ret: error code if IO failed + * @is_async: is this AIO? + * + * This is generic callback to be called when direct IO is finished. It + * handles update of number of outstanding DIOs for an inode, completion + * of async iocb and queueing of work if we need to call fsync() because + * io was O_SYNC. + */ +void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes, + struct dio_sync_io_work *work, int ret, bool is_async) +{ + struct inode *inode = iocb-ki_filp-f_dentry-d_inode; + + if (!is_async) { + inode_dio_done(inode); + return; + } + + /* +* If we need to sync file, we offload completion to workqueue +*/ + if (work) { + work-ret = ret; + work-offset = offset; + work-len = bytes; + queue_work(inode-i_sb-s_dio_flush_wq, work-work); + } else { + aio_complete(iocb, ret, 0); + inode_dio_done(inode); + } +} +EXPORT_SYMBOL(generic_dio_end_io); + +/** * dio_complete() - called when all DIO BIO I/O has been completed * @offset: the byte offset in the file of the completed operation * @@ -302,12 +353,22 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is ret = transferred; if (dio-end_io dio-result) { + void *private; + + if (dio-sync_work) + private = dio-sync_work; + else + private = dio-private; dio-end_io(dio-iocb, offset, transferred, - dio-private, ret, is_async); + private, ret, is_async); } else { - if (is_async) - aio_complete(dio-iocb, ret, 0); - inode_dio_done(dio-inode); + /* No IO submitted? Skip syncing... */ + if (!dio-result dio-sync_work) { + kfree(dio-sync_work); + dio-sync_work = NULL; + } + generic_dio_end_io(dio-iocb, offset, transferred, + dio-sync_work, ret, is_async); } return ret; @@ -1064,6 +1125,41 @@ static inline int drop_refcount(struct dio *dio) } /* + * Work performed from workqueue when AIO DIO is finished. + */ +static void dio_aio_sync_work(struct work_struct *work) +{ + struct dio_sync_io_work *sync_work = + container_of(work
[PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC file. Signed-off-by: Jan Kara j...@suse.cz --- fs/gfs2/aops.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 501e5cb..9c381ff 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, rv = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, offset, nr_segs, gfs2_get_block_direct, - NULL, NULL, 0); + NULL, NULL, DIO_SYNC_WRITES); out: gfs2_glock_dq_m(1, gh); gfs2_holder_uninit(gh); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device
On Wed 08-02-12 17:01:15, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 1/26/2012 11:03 AM, Jan Kara wrote: make_btrfs() function takes a size of filesystem as an argument. It uses this value to set the size of the first device as well which is wrong for filesystems larger than this device. It results in 'attemp to access beyond end of device' messages from the kernel. So add size of the first device as an argument to make_btrfs(). I don't think this patch is correct. Yes, the size switch only applies to the first device, so it doesn't make any sense to try to use a value larger than that device. Attempting to do so probably should be trapped and it should error out, and the man page should probably clarify the fact that the size is only for the first device. It looks like you think the size should somehow apply to multiple devices or to the total fs size when creating on multiple devices, and that just doesn't make sense. Thanks for your reply. I admit I was not sure what exactly size argument should be. So after looking into the code for a while I figured it should be a total size of the filesystem - or differently it should be size of virtual block address space in the filesystem. Thus when filesystem has more devices (or admin wants to add more devices later), it can be larger than the first device. But I'm not really a btrfs developper so I might be wrong and of course feel free to fix the issue as you deem fit. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mkfs: Handle creation of filesystem larger than the first device
make_btrfs() function takes a size of filesystem as an argument. It uses this value to set the size of the first device as well which is wrong for filesystems larger than this device. It results in 'attemp to access beyond end of device' messages from the kernel. So add size of the first device as an argument to make_btrfs(). CC: David Sterba dste...@suse.cz Signed-off-by: Jan Kara j...@suse.cz --- convert.c |2 +- mkfs.c|6 -- utils.c |4 ++-- utils.h |2 +- 4 files changed, 8 insertions(+), 6 deletions(-) As a side note, I'd guess that creating filesystem larger than all given devices (especially in case of single device) is usually not what sysadmin wants (we've spotted this bug when xfstest were happily creating 1 GB filesystem on 500 MB device and it took us a while to notice the problem). Thus maybe it should require some --force switch? diff --git a/convert.c b/convert.c index 291dc27..7f1932c 100644 --- a/convert.c +++ b/convert.c @@ -2374,7 +2374,7 @@ int do_convert(const char *devname, int datacsum, int packing, int noxattr) goto fail; } ret = make_btrfs(fd, devname, ext2_fs-super-s_volume_name, -blocks, total_bytes, blocksize, blocksize, +blocks, total_bytes, total_bytes, blocksize, blocksize, blocksize, blocksize); if (ret) { fprintf(stderr, unable to create initial ctree\n); diff --git a/mkfs.c b/mkfs.c index e3ced19..f0d29bb 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1294,8 +1294,10 @@ int main(int ac, char **av) first_file = file; source_dir_size = size_sourcedir(source_dir, sectorsize, num_of_meta_chunks, size_of_data); - if(block_count source_dir_size) + if (block_count source_dir_size) block_count = source_dir_size; + dev_block_count = block_count; + ret = zero_output_file(fd, block_count, sectorsize); if (ret) { fprintf(stderr, unable to zero the output file\n); @@ -1321,7 +1323,7 @@ int main(int ac, char **av) leafsize * i; } - ret = make_btrfs(fd, file, label, blocks, block_count, + ret = make_btrfs(fd, file, label, blocks, block_count, dev_block_count, nodesize, leafsize, sectorsize, stripesize); if (ret) { diff --git a/utils.c b/utils.c index 178d1b9..f34da51 100644 --- a/utils.c +++ b/utils.c @@ -74,7 +74,7 @@ static u64 reference_root_table[] = { }; int make_btrfs(int fd, const char *device, const char *label, - u64 blocks[7], u64 num_bytes, u32 nodesize, + u64 blocks[7], u64 num_bytes, u64 dev_num_bytes, u32 nodesize, u32 leafsize, u32 sectorsize, u32 stripesize) { struct btrfs_super_block super; @@ -276,7 +276,7 @@ int make_btrfs(int fd, const char *device, const char *label, dev_item = btrfs_item_ptr(buf, nritems, struct btrfs_dev_item); btrfs_set_device_id(buf, dev_item, 1); btrfs_set_device_generation(buf, dev_item, 0); - btrfs_set_device_total_bytes(buf, dev_item, num_bytes); + btrfs_set_device_total_bytes(buf, dev_item, dev_num_bytes); btrfs_set_device_bytes_used(buf, dev_item, BTRFS_MKFS_SYSTEM_GROUP_SIZE); btrfs_set_device_io_align(buf, dev_item, sectorsize); diff --git a/utils.h b/utils.h index c5f55e1..bf2d5a4 100644 --- a/utils.h +++ b/utils.h @@ -22,7 +22,7 @@ #define BTRFS_MKFS_SYSTEM_GROUP_SIZE (4 * 1024 * 1024) int make_btrfs(int fd, const char *device, const char *label, - u64 blocks[6], u64 num_bytes, u32 nodesize, + u64 blocks[6], u64 num_bytes, u64 dev_num_bytes, u32 nodesize, u32 leafsize, u32 sectorsize, u32 stripesize); int btrfs_make_root_dir(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix busyloops in transaction waiting code
wait_log_commit() and wait_for_writer() were using slightly different conditions for deciding whether they should call schedule() and whether they should continue in the wait loop. Thus it could happen that we busylooped when the first condition was not true while the second one was. That is burning CPU cycles needlessly and is deadly on UP machines... Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/tree-log.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index cb877e0..966cc74 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle *trans, finish_wait(root-log_commit_wait[index], wait); mutex_lock(root-log_mutex); - } while (root-log_transid transid + 2 + } while (root-fs_info-last_trans_log_full_commit != +trans-transid root-log_transid transid + 2 atomic_read(root-log_commit[index])); return 0; } @@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle *trans, struct btrfs_root *root) { DEFINE_WAIT(wait); - while (atomic_read(root-log_writers)) { + while (root-fs_info-last_trans_log_full_commit != + trans-transid atomic_read(root-log_writers)) { prepare_to_wait(root-log_writer_wait, wait, TASK_UNINTERRUPTIBLE); mutex_unlock(root-log_mutex); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix busyloops in transaction waiting code
wait_log_commit() and wait_for_writer() were using slightly different conditions for deciding whether they should call schedule() and whether they should continue in the wait loop. Thus it could happen that we busylooped when the first condition was not true while the second one was. That is burning CPU cycles needlessly and is deadly on UP machines... CC: sta...@kernel.org Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/tree-log.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index cb877e0..966cc74 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle *trans, finish_wait(root-log_commit_wait[index], wait); mutex_lock(root-log_mutex); - } while (root-log_transid transid + 2 + } while (root-fs_info-last_trans_log_full_commit != +trans-transid root-log_transid transid + 2 atomic_read(root-log_commit[index])); return 0; } @@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle *trans, struct btrfs_root *root) { DEFINE_WAIT(wait); - while (atomic_read(root-log_writers)) { + while (root-fs_info-last_trans_log_full_commit != + trans-transid atomic_read(root-log_writers)) { prepare_to_wait(root-log_writer_wait, wait, TASK_UNINTERRUPTIBLE); mutex_unlock(root-log_mutex); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: push file_update_time into -page_mkwrite
@@ -571,6 +571,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) filp-f_mapping-host-i_ino, (long long)page_offset(page)); + file_update_time(filp); /* make sure the cache has finished storing the page */ nfs_fscache_wait_on_page_write(NFS_I(dentry-d_inode), page); diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c index 2660152..d1ab350 100644 --- a/fs/nilfs2/file.c +++ b/fs/nilfs2/file.c @@ -70,6 +70,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (unlikely(nilfs_near_disk_full(inode-i_sb-s_fs_info))) return VM_FAULT_SIGBUS; /* -ENOSPC */ + file_update_time(vma-vm_file); lock_page(page); if (page-mapping != inode-i_mapping || page_offset(page) = i_size_read(inode) || !PageUptodate(page)) { diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c index 3e9393c..27c59e5 100644 --- a/fs/ocfs2/mmap.c +++ b/fs/ocfs2/mmap.c @@ -141,6 +141,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) ocfs2_block_signals(oldset); + file_update_time(vma-vm_file); /* * The cluster locks taken will block a truncate from another * node. Taking the data lock will also ensure that we don't diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index a475983..4879bfb 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -225,6 +225,7 @@ static int bin_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (!sysfs_get_active(attr_sd)) return VM_FAULT_SIGBUS; + file_update_time(file); ret = 0; if (bb-vm_ops-page_mkwrite) ret = bb-vm_ops-page_mkwrite(vma, vmf); diff --git a/kernel/events/core.c b/kernel/events/core.c index 0f85778..abbaee4 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3466,6 +3466,7 @@ static int perf_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct ring_buffer *rb; int ret = VM_FAULT_SIGBUS; + file_update_time(vma-vm_file); if (vmf-flags FAULT_FLAG_MKWRITE) { if (vmf-pgoff == 0) ret = 0; diff --git a/mm/memory.c b/mm/memory.c index a56e3ba..8e4686f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2619,10 +2619,6 @@ reuse: } } - /* file_update_time outside page_lock */ - if (vma-vm_file) - file_update_time(vma-vm_file); - return ret; } @@ -3303,10 +3299,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, */ balance_dirty_pages_ratelimited(mapping); } - - /* file_update_time outside page_lock */ - if (vma-vm_file) - file_update_time(vma-vm_file); } else { unlock_page(vmf.page); if (anon) diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c index 55d92cb..e5f4be1 100644 --- a/security/selinux/selinuxfs.c +++ b/security/selinux/selinuxfs.c @@ -458,6 +458,7 @@ static int sel_mmap_policy_fault(struct vm_area_struct *vma, unsigned long offset; struct page *page; + file_update_time(vma-vm_file); if (vmf-flags (FAULT_FLAG_MKWRITE | FAULT_FLAG_WRITE)) return VM_FAULT_SIGBUS; -- 1.7.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed 16-11-11 07:45:50, Christoph Hellwig wrote: On Wed, Nov 16, 2011 at 11:54:13AM +0100, Jan Kara wrote: Yeah, only that nobody calls that fsync() automatically if the fd is O_SYNC if I'm right. But maybe calling fdatasync() on the range which was fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for most filesystems? That would match how we treat O_SYNC for other operations as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit with this. This would work fine with XFS and be equivalent to what it does for O_DSYNC now. But I'd rather see every filesystem do the right thing and make sure the update actually is on disk when doing O_(D)SYNC operations. OK, I don't really have a strong opinion here. Are you afraid that just calling fsync() need not be enough to push all updates fallocate did to disk? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] remove i_alloc_sem
On Mon 20-06-11 16:15:33, Christoph Hellwig wrote: i_alloc_sem has always been a bit of an odd lock. It's the only remaining rw_semaphore that can be released by a different thread than the one that locked it, and it's use case in the core direct I/O code is more like a counter given that the writers already have external serialization. This series removes it in favour of a simpler counter scheme, thus getting rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the same time shrinking the size of struct inode by 160 bytes on 64-bit systems. The only nasty bit is that two filesystems (fat and ext4) have started abusing the lock for their own purposes. I've added a new rw_semaphore ext4 abuse should be gone when Ted merges my rewrite of ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend it? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fs: fix deadlocks in writeback_if_idle
On Wed 24-11-10 12:03:43, Nick Piggin wrote: For the _nr variant that btrfs uses, it's worse for the filesystems that don't have a 1:1 bdi-sb mapping. It might not actually write any of the pages from the SB that is out of space. That's true, but it might not write anything anyway (and after we check whether writeout is in progress, the writeout thread could go to sleep and not do anything anyway). So it's a pretty hacky interface anyway. If you want to do anything deterministic, you obviously need real coupling between producer and consumer. This should only be a performance tweak (or a workaround hack in worst case). Yes, the current interface is a band aid for the problem and better interface is welcome. But it's not trivial to do better... It makes no further guarantees, and anyway the sb has to compete for normal writeback within this bdi. I think Christoph is right because filesystems should not really know about how bdi writeback queueing works. But I don't know if it's worth doing anything more complex for this functionality? I think we should make a writeback_inodes_sb_unlocked() that doesn't warn when the semaphore isn't held. That should be enough given where btrfs and ext4 are calling it from. It doesn't solve the bugs -- calling and waiting for writeback is still broken because completion requires i_mutex and it is called from under i_mutex. Well, as I wrote in my previous email, only ext4 has the problem with i_mutex and I personally view it as a bug. But ultimately it's Ted's call to decide. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fs: fix deadlocks in writeback_if_idle
On Tue 23-11-10 21:11:49, Nick Piggin wrote: The issue of writeback_inodes_sb being synchronous so far as it has to wait until the work has been dequeued is another subtlety. That is a funny interface though, really. It has 3 callers, sync, quota, and ubifs. From a quick look, quota and ubifs seem to think it is some kind of synchronous writeout API. Yes, they expect it and it used to be the case before per-bdi flusher threads existed (because the function submitted IO on its own). Then it was changed to an async interface in per-bdi flusher thread patches and then back again to a synchronous one... Sad history... It also really sucks that it can get deadlocked behind an unrelated item in a workqueue. I think it should just get converted over to the async-submission style as well. Here I don't quite agree. After my patches (currently in -mm tree) all work items have reasonably well defined lifetime so no livelocks should occur. After all writeback thread is doing its best to do as much IO as possible (and hopefully saturates the storage) so given all the IO work items we cannot do much better. Where I see a room for improvement is that work items usually try to achieve a common goal - for example when we get two items write all dirty pages, we have mostly fulfilled the second one after finishing the first one but we start from the beginning when processing the second request. But it seems non-trivial to do this request merging especially for different types of requests... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix up lock order reversal in writeback
On Tue 23-11-10 19:07:58, Nick Piggin wrote: On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote: On Fri 19-11-10 16:16:19, Nick Piggin wrote: On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote: On Wed 17-11-10 22:28:34, Andrew Morton wrote: The fact that a call to -write_begin can randomly return with s_umount held, to be randomly released at some random time in the future is a bit ugly, isn't it? write_begin is a pretty low-level, per-inode thing. I guess you missed that writeback_inodes_sb_nr() (called from _if_idle variants) does: bdi_queue_work(sb-s_bdi, work); wait_for_completion(done); So we return only after all the IO has been submitted and unlock s_umount in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves because we are holding i_mutex and we need to get and put references to other inodes while doing writeback (those would be really horrible lock dependencies - writeback thread can put the last reference to an unlinked inode...). But if we're waiting for it, with the lock held, then surely it can deadlock just the same as if we submit it ourself? Yes, that's what I realized as well and what needs fixing. It was an unintentional consequence of locking changes Christoph did to the writeback code to fix other races. BTW. are you taking i_mutex inside writeback? I mutex can be held while entering page reclaim, and thus writepage... so it could be a bug too. No, writeback does not need i_mutex. I did in fact see i_mutex from writeback, which is how the lock order reversal was noticed in the first place. I haven't had much luck in reproducing it yet. It did come from end_io_work, I believe. There is another deadlock in here, by the looks (although this is not the one which triggers lockdep -- the workqueue coupling means it defeats lockdep detection). Buffered write holds i_lock, and waits for inode writeback submission, but the work queue seems to be blocked on i_mutex (ext4_end_io_work), and so it deadlocks. Ah, I see. That's a new thing introduced by Ted's rewrite of ext4_writepages() - bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc. And indeed it introduced a deadlock as you describe... Honza Note that this is an AA deadlock, different from ABBA one relating to s_umount lock (but very similar call chains IIRC). [ 748.406457] SysRq : Show Blocked State [ 748.406685] taskPC stack pid father [ 748.406868] kworker/9:1 D 6296 118 2 0x [ 748.407173] 88012acb3c90 0046 88012b1cec80 [ 748.407686] 0002 88012acb2000 88012acb2000 d6c8 [ 748.408200] 88012acb2010 88012acb3fd8 880129c7dd00 88012b1cec80 [ 748.408711] Call Trace: [ 748.408866] [81039431] ? get_parent_ip+0x11/0x50 [ 748.409033] [8160145c] __mutex_lock_common+0x18c/0x480 [ 748.409205] [a0042147] ? ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409379] [a0042147] ? ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409546] [8160182e] mutex_lock_nested+0x3e/0x50 [ 748.409717] [a0042147] ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409883] [81068378] process_one_work+0x1b8/0x5a0 [ 748.410046] [8106830e] ? process_one_work+0x14e/0x5a0 [ 748.410219] [a0042110] ? ext4_end_io_work+0x0/0xb0 [ext4] [ 748.410386] [81069675] worker_thread+0x175/0x3a0 [ 748.410549] [81069500] ? worker_thread+0x0/0x3a0 [ 748.410712] [8106e246] kthread+0x96/0xa0 [ 748.410874] [81003ed4] kernel_thread_helper+0x4/0x10 [ 748.411038] [81039878] ? finish_task_switch+0x78/0x110 [ 748.411202] [816036c0] ? restore_args+0x0/0x30 [ 748.411364] [8106e1b0] ? kthread+0x0/0xa0 [ 748.411524] [81003ed0] ? kernel_thread_helper+0x0/0x10 [ 748.606853] dbenchD 2872 2916 1 0x0004 [ 748.607154] 880129ec58b8 0046 880129ec5838 810806fd [ 748.607665] 880129ec5898 880129ec4000 880129ec4000 d6c8 [ 748.608176] 880129ec4010 880129ec5fd8 88010a2d 88011ff69f00 [ 748.608686] Call Trace: [ 748.608837] [810806fd] ? trace_hardirqs_off+0xd/0x10 [ 748.609001] [81600bb5] schedule_timeout+0x1f5/0x350 [ 748.609164] [8108380b] ? mark_held_locks+0x6b/0xa0 [ 748.609328] [8160314b] ? _raw_spin_unlock_irq+0x2b/0x60 [ 748.609493] [81039431] ? get_parent_ip+0x11/0x50 [ 748.609656] [81606c3d] ? sub_preempt_count+0x9d/0xd0 [ 748.609820] [815ffacd] wait_for_common+0x10d/0x190 [ 748.609984] [810426e0] ? default_wake_function
Re: [patch] fix up lock order reversal in writeback
On Fri 19-11-10 16:16:19, Nick Piggin wrote: On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote: On Wed 17-11-10 22:28:34, Andrew Morton wrote: The fact that a call to -write_begin can randomly return with s_umount held, to be randomly released at some random time in the future is a bit ugly, isn't it? write_begin is a pretty low-level, per-inode thing. I guess you missed that writeback_inodes_sb_nr() (called from _if_idle variants) does: bdi_queue_work(sb-s_bdi, work); wait_for_completion(done); So we return only after all the IO has been submitted and unlock s_umount in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves because we are holding i_mutex and we need to get and put references to other inodes while doing writeback (those would be really horrible lock dependencies - writeback thread can put the last reference to an unlinked inode...). But if we're waiting for it, with the lock held, then surely it can deadlock just the same as if we submit it ourself? Yes, that's what I realized as well and what needs fixing. It was an unintentional consequence of locking changes Christoph did to the writeback code to fix other races. BTW. are you taking i_mutex inside writeback? I mutex can be held while entering page reclaim, and thus writepage... so it could be a bug too. No, writeback does not need i_mutex. In fact, as I'm speaking about it, pushing things to writeback thread and waiting on the work does not help a bit with the locking issues (we didn't wait for the work previously but that had other issues). Bug, sigh. What might be better interface for usecases like above is to allow filesystem to kick flusher thread to start doing background writeback (regardless of dirty limits). Then the caller can wait for some delayed allocation reservations to get freed (easy enough to check in -writepage() and wake the waiters) - possibly with a reasonable timeout so that we don't stall forever. We really need to throttle the producer without any locks held, no? So the filesystem would like to to hook into dirtying path somewhere without i_mutex held (and without implementing your own .write). Eg. around the dirty throttling calls somewhere I suppose. Well, the particular case where we need to do delayed allocation but the filesystem has already all blocks reserved is hard to detect without any locks. At least we need some locks to find out how many blocks do we need to reserve for that write(). So for filesystems it would solve the locking issues if we could bail out of write() path up to the point where we hold no locks, wait there / do necessary writeback and then restart the write... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On Wed 17-11-10 20:46:15, Josef Bacik wrote: Hole punching has already been implemented by XFS and OCFS2, and has the potential to be implemented on both BTRFS and EXT4 so we need a generic way to get to this feature. The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE to fallocate() since it already looks like the normal fallocate() operation. I've tested this patch with XFS and BTRFS to make sure XFS did what it's supposed to do and that BTRFS failed like it was supposed to. Thank you, Looks nice now. Acked-by: Jan Kara j...@suse.cz Honza Signed-off-by: Josef Bacik jo...@redhat.com --- fs/open.c |7 ++- include/linux/falloc.h |1 + 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/open.c b/fs/open.c index 4197b9e..5b6ef7e 100644 --- a/fs/open.c +++ b/fs/open.c @@ -223,7 +223,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -EINVAL; /* Return error if mode is not supported */ - if (mode !(mode FALLOC_FL_KEEP_SIZE)) + if (mode ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) + return -EOPNOTSUPP; + + /* Punch hole must have keep size set */ + if ((mode FALLOC_FL_PUNCH_HOLE) + !(mode FALLOC_FL_KEEP_SIZE)) return -EOPNOTSUPP; if (!(file-f_mode FMODE_WRITE)) diff --git a/include/linux/falloc.h b/include/linux/falloc.h index 3c15510..73e0b62 100644 --- a/include/linux/falloc.h +++ b/include/linux/falloc.h @@ -2,6 +2,7 @@ #define _FALLOC_H_ #define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ #ifdef __KERNEL__ -- 1.6.6.1 -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix up lock order reversal in writeback
On Wed 17-11-10 22:28:34, Andrew Morton wrote: I'm not sure that s_umount versus i_mutex has come up before. Logically I'd expect i_mutex to nest inside s_umount. Because s_umount is a per-superblock thing, and i_mutex is a per-file thing, and files live under superblocks. Nesting s_umount outside i_mutex creates complex deadlock graphs between the various i_mutexes, I think. Someone tell me if btrfs has the same bug, via its call to writeback_inodes_sb_nr_if_idle()? I don't see why these functions need s_umount at all, if they're called from within -write_begin against an inode on that superblock. If the superblock can get itself disappeared while we're running -write_begin on it, we have problems, no? As I wrote to Chris, the function just needs exclusion from umount / remount happening (and want to stop umount from returning EBUSY when writeback thread is writing something out). When the function is called from -write_begin this is no issue as you properly noted so s_umount is not needed in that particular case. In which case I'd suggest just removing the down_read(s_umount) and specifying that the caller must pin the superblock via some means. Possibly, but currently the advantage is that we can have WARN_ON in the writeback code that complains if someone starts writeback without properly pinned superblock and we cannot easily do that with your general rule. I'm not saying that should stop us from changing the rule but it was kind of nice. Only we can't do that because we need to hold s_umount until the bdi_queue_work() worker has done its work. The fact that a call to -write_begin can randomly return with s_umount held, to be randomly released at some random time in the future is a bit ugly, isn't it? write_begin is a pretty low-level, per-inode thing. I guess you missed that writeback_inodes_sb_nr() (called from _if_idle variants) does: bdi_queue_work(sb-s_bdi, work); wait_for_completion(done); So we return only after all the IO has been submitted and unlock s_umount in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves because we are holding i_mutex and we need to get and put references to other inodes while doing writeback (those would be really horrible lock dependencies - writeback thread can put the last reference to an unlinked inode...). In fact, as I'm speaking about it, pushing things to writeback thread and waiting on the work does not help a bit with the locking issues (we didn't wait for the work previously but that had other issues). Bug, sigh. What might be better interface for usecases like above is to allow filesystem to kick flusher thread to start doing background writeback (regardless of dirty limits). Then the caller can wait for some delayed allocation reservations to get freed (easy enough to check in -writepage() and wake the waiters) - possibly with a reasonable timeout so that we don't stall forever. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On Tue 16-11-10 12:16:11, Jan Kara wrote: On Mon 15-11-10 12:05:18, Josef Bacik wrote: diff --git a/fs/open.c b/fs/open.c index 4197b9e..ab8dedf 100644 --- a/fs/open.c +++ b/fs/open.c @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -EINVAL; /* Return error if mode is not supported */ - if (mode !(mode FALLOC_FL_KEEP_SIZE)) + if (mode (mode ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))) Why not just: if (mode ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ? And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too much which way but keeping it ambiguous (ignored) in the interface usually proves as a bad idea in future when we want to further extend the interface... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Ocfs2: handle hole punching via fallocate properly
On Mon 15-11-10 12:05:20, Josef Bacik wrote: This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch flag in fallocate. I didn't test it, but it seems simple enough. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com You might want to directly CC Joel Becker joel.bec...@oracle.com who maintains OCFS2. Otherwise the patch looks OK so you can add Acked-by: Jan Kara j...@suse.cz for what it's worth ;). Honza --- fs/ocfs2/file.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 77b4c04..181ae52 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1992,6 +1992,7 @@ static long ocfs2_fallocate(struct inode *inode, int mode, loff_t offset, struct ocfs2_super *osb = OCFS2_SB(inode-i_sb); struct ocfs2_space_resv sr; int change_size = 1; + int cmd = OCFS2_IOC_RESVSP64; if (!ocfs2_writes_unwritten_extents(osb)) return -EOPNOTSUPP; @@ -2002,12 +2003,17 @@ static long ocfs2_fallocate(struct inode *inode, int mode, loff_t offset, if (mode FALLOC_FL_KEEP_SIZE) change_size = 0; + if (mode FALLOC_FL_PUNCH_HOLE) { + cmd = OCFS2_IOC_UNRESVSP64; + change_size = 0; + } + sr.l_whence = 0; sr.l_start = (s64)offset; sr.l_len = (s64)len; - return __ocfs2_change_file_space(NULL, inode, offset, - OCFS2_IOC_RESVSP64, sr, change_size); + return __ocfs2_change_file_space(NULL, inode, offset, cmd, sr, + change_size); } int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos, -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] Ext4: fail if we try to use hole punch
On Mon 15-11-10 12:05:21, Josef Bacik wrote: Ext4 doesn't have the ability to punch holes yet, so make sure we return EOPNOTSUPP if we try to use hole punching through fallocate. This support can be added later. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com Acked-by: Jan Kara j...@suse.cz Honza --- fs/ext4/extents.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 0554c48..35bca73 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3622,6 +3622,10 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) struct ext4_map_blocks map; unsigned int credits, blkbits = inode-i_blkbits; + /* We only support the FALLOC_FL_KEEP_SIZE mode */ + if (mode (mode != FALLOC_FL_KEEP_SIZE)) + return -EOPNOTSUPP; + /* * currently supporting (pre)allocate mode for extent-based * files _only_ -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix up lock order reversal in writeback
On Tue 16-11-10 22:00:58, Nick Piggin wrote: I saw a lock order warning on ext4 trigger. This should solve it. Raciness shouldn't matter much, because writeback can stop just after we make the test and return anyway (so the API is racy anyway). Hmm, for now the fix is OK. Ultimately, we probably want to call writeback_inodes_sb() directly from all the callers. They all just want to reduce uncertainty of delayed allocation reservations by writing delayed data and actually wait for some of the writeback to happen before they retry again the allocation. Although the callers generally cannot get umount_sem because they hold other locks, they have the superblock well pinned so grabbing umount_sem makes sense mostly to make assertions happy. But as I'm thinking about it, trylock *is* maybe the right answer to this anyway... So Acked-by: Jan Kara j...@suse.cz Honza Signed-off-by: Nick Piggin npig...@kernel.dk Index: linux-2.6/fs/fs-writeback.c === --- linux-2.6.orig/fs/fs-writeback.c 2010-11-16 21:44:32.0 +1100 +++ linux-2.6/fs/fs-writeback.c 2010-11-16 21:49:37.0 +1100 @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb); * * Invoke writeback_inodes_sb if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. + * + * May be called inside i_lock. May not start writeback if locks cannot + * be acquired. */ int writeback_inodes_sb_if_idle(struct super_block *sb) { if (!writeback_in_progress(sb-s_bdi)) { - down_read(sb-s_umount); - writeback_inodes_sb(sb); - up_read(sb-s_umount); - return 1; - } else - return 0; + if (down_read_trylock(sb-s_umount)) { + writeback_inodes_sb(sb); + up_read(sb-s_umount); + return 1; + } + } + return 0; } EXPORT_SYMBOL(writeback_inodes_sb_if_idle); @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl * * Invoke writeback_inodes_sb if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. + * + * May be called inside i_lock. May not start writeback if locks cannot + * be acquired. */ int writeback_inodes_sb_nr_if_idle(struct super_block *sb, unsigned long nr) { if (!writeback_in_progress(sb-s_bdi)) { - down_read(sb-s_umount); - writeback_inodes_sb_nr(sb, nr); - up_read(sb-s_umount); - return 1; - } else - return 0; + if (down_read_trylock(sb-s_umount)) { + writeback_inodes_sb_nr(sb, nr); + up_read(sb-s_umount); + return 1; + } + } + return 0; } EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle); -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On Tue 09-11-10 16:41:47, Ted Ts'o wrote: On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote: Implementation is up to the filesystem. However, XFS does (b) because: 1) it was extremely simple to implement (one of the advantages of having an exceedingly complex allocation interface to begin with :P) 2) conversion is atomic, fast and reliable 3) it is independent of the underlying storage; and 4) reads of unwritten extents operate at memory speed, not disk speed. Yeah, I was thinking that using a device-style TRIM might be better since future attempts to write to it won't require a separate seek to modify the extent tree. But yeah, there are a bunch of advantages of simply mutating the extent tree. While we're on the subject of changes to fallocate, what do people think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root privileges or (if capabilities are in use) CAP_DAC_OVERRIDE CAP_MAC_OVERRIDE CAP_SYS_ADMIN. This would allow a trusted process to fallocate blocks with the extent already marked initialized. I've had two requests for such functionality for ext4 already. (Take for example a trusted cluster filesystem backend that checks the object checksum before returning any data to the user; and if the check fails the cluster file system will try to use some other replica stored on some other server.) Hum, could you elaborate a bit? I fail to see how above fallocate() flag could be used to help solving this problem... Just curious... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dirtiable inode bdi default != sb bdi btrfs
On Wed 29-09-10 10:19:36, Christoph Hellwig wrote: --- From: Christoph Hellwig h...@lst.de Subject: [PATCH] writeback: always use sb-s_bdi for writeback purposes ... The one exception for now is the block device filesystem which really wants different writeback contexts for it's different (internal) inodes to handle the writeout more efficiently. For now we do this with a hack in fs-writeback.c because we're so late in the cycle, but in the future I plan to replace this with a superblock method that allows for multiple writeback contexts per filesystem. Another exception I know about is mtd_inodefs filesystem (drivers/mtd/mtdchar.c). Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/fs/fs-writeback.c === --- linux-2.6.orig/fs/fs-writeback.c 2010-09-29 16:58:41.750557721 +0900 +++ linux-2.6/fs/fs-writeback.c 2010-09-29 17:11:35.040557719 +0900 @@ -72,22 +72,10 @@ int writeback_in_progress(struct backing static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; - struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info; - /* - * For inodes on standard filesystems, we use superblock's bdi. For - * inodes on virtual filesystems, we want to use inode mapping's bdi - * because they can possibly point to something useful (think about - * block_dev filesystem). - */ - if (sb-s_bdi sb-s_bdi != noop_backing_dev_info) { - /* Some device inodes could play dirty tricks. Catch them... */ - WARN(bdi != sb-s_bdi bdi_cap_writeback_dirty(bdi), - Dirtiable inode bdi %s != sb bdi %s\n, - bdi-name, sb-s_bdi-name); - return sb-s_bdi; - } - return bdi; + if (strcmp(sb-s_type-name, bdev) == 0) + return inode-i_mapping-backing_dev_info; + return sb-s_bdi; So at least here you'd need also add a similar exception for mtd_inodefs. Because of these exeptions I've chosen the (sb-s_bdi sb-s_bdi != noop_backing_dev_info) check rather than your exception based check. All in all I don't care much what ends up in the kernel as it's just a temporary solution... Also I've added the warning to catch situations where inodes would get filed to a different backing device after the patch. So far the reported warnings were harmless but still I'm more comfortable when it's there because otherwise we can so easily miss some device-driver-invented filesystem like mtd_inodefs which would break silently after the change... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dirtiable inode bdi default != sb bdi btrfs
On Mon 27-09-10 20:55:45, Cesar Eduardo Barros wrote: Em 27-09-2010 19:25, Jan Kara escreveu: [Added CCs for similar ecryptfs warning] On Thu 23-09-10 12:38:49, Andrew Morton wrote: [...] device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 /dev/mapper/vg_cesarbinspiro-lv_home SELinux: initialized (dev dm-3, type btrfs), uses xattr [ cut here ] WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d() Hardware name: Inspiron N4010 Dirtiable inode bdi default != sb bdi btrfs That suggests that we should probably handle such cases in a more generic way by changing the code in inode_init_always(). The patch below makes at least btrfs happy for me... Could you maybe test it? Thanks. Applied on top of v2.6.36-rc5-151-g32163f4, running it right now. The warning messages no longer happen, and everything seems to be working fine. Tested-by: Cesar Eduardo Barros ces...@cesarb.net Great, thanks for testing. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dirtiable inode bdi default != sb bdi btrfs
On Wed 29-09-10 16:10:06, Christoph Hellwig wrote: On Wed, Sep 29, 2010 at 02:18:08PM +0200, Jan Kara wrote: On Wed 29-09-10 10:19:36, Christoph Hellwig wrote: --- From: Christoph Hellwig h...@lst.de Subject: [PATCH] writeback: always use sb-s_bdi for writeback purposes ... The one exception for now is the block device filesystem which really wants different writeback contexts for it's different (internal) inodes to handle the writeout more efficiently. For now we do this with a hack in fs-writeback.c because we're so late in the cycle, but in the future I plan to replace this with a superblock method that allows for multiple writeback contexts per filesystem. Another exception I know about is mtd_inodefs filesystem (drivers/mtd/mtdchar.c). No, it's not. MTD only has three different backing_dev_info instances which have different flags in the mapping-relevant portion of the backing_dev. In the end I agree I was probably wrong but it's not that simple ;) So at least here you'd need also add a similar exception for mtd_inodefs. No. For one thing we don't need any exception for correctnes alone - even the block device variant would work fine with the default case. Here I don't agree. If you don't have some kind of exception, sb-s_bdi for both block and mtd_inodefs filesystems points to noop_backing_dev_info and you get no writeback for that one. So it isn't just a performance issue but also a correctness one. Regarding mtd_inodefs I now looked in more detail what MTD actually does and it seems to me that MTD device inodes do not seem to carry any cached state that flusher threads could write back. So returning noop_backing_dev_info might be the right thing for them after all... (added David Woodhouse and MTD list to CC so that they can shout if it's not the case). Coming to this conclusion, I'm happy with your patch going in as is... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dirtiable inode bdi default != sb bdi btrfs
[Added CCs for similar ecryptfs warning] On Thu 23-09-10 12:38:49, Andrew Morton wrote: This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the warning. [...] device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 /dev/mapper/vg_cesarbinspiro-lv_home SELinux: initialized (dev dm-3, type btrfs), uses xattr [ cut here ] WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d() Hardware name: Inspiron N4010 Dirtiable inode bdi default != sb bdi btrfs Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan] Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8 Call Trace: [8104d0e4] warn_slowpath_common+0x85/0x9d [8104d19f] warn_slowpath_fmt+0x46/0x48 [811308b7] inode_to_bdi+0x62/0x6d [81131b48] __mark_inode_dirty+0xd0/0x177 [81127168] touch_atime+0x107/0x12a [81122384] ? filldir+0x0/0xd0 [8112259b] vfs_readdir+0x8d/0xb4 [8112270b] sys_getdents+0x81/0xd1 [81009c72] system_call_fastpath+0x16/0x1b Thanks for the report. These bdi pointers are a mess. As Chris pointed out, btrfs forgets to properly initialize inode-i_mapping.backing_dev_info for directories and special inodes and thus these were previously attached to default_backing_dev_info which probably isn't what Chris would like to see. I've also got a similar report for ecryptfs which also does not initialize inode-i_mapping.backing_dev_info although it sets sb-s_bdi and thus again its inodes get filed to default_backing_dev_info lists. Quick search seems to reveal that other filesystems using handcrafted bdi's get it wrong as well and thus their inodes end up in the default_backing_dev_info lists which is generally undesirable (this was happening already before my patch, my code just started complaining about that). That suggests that we should probably handle such cases in a more generic way by changing the code in inode_init_always(). The patch below makes at least btrfs happy for me... Could you maybe test it? Thanks. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR --- From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001 From: Jan Kara j...@suse.cz Date: Mon, 27 Sep 2010 23:56:48 +0200 Subject: [PATCH] bdi: Initialize inode-i_mapping.backing_dev_info to sb-s_bdi Currently, we initialize inode-i_mapping.backing_dev_info to the bdi of device sb-s_bdev points to. However there is quite a big number of filesystems that do not set sb-s_bdev (because they do not have one) but do set sb-s_bdi. These filesystems would generally benefit from setting inode-i_mapping.backing_dev_info to their s_bdi because otherwise their inodes would point to default_backing_dev_info and thus dirty inode tracking would happen there. So change inode initialization code to use sb-s_bdi if it is available. Signed-off-by: Jan Kara j...@suse.cz --- fs/inode.c | 22 ++ 1 files changed, 14 insertions(+), 8 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 8646433..e415be4 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -172,15 +172,21 @@ int inode_init_always(struct super_block *sb, struct inode *inode) mapping-writeback_index = 0; /* -* If the block_device provides a backing_dev_info for client -* inodes then use that. Otherwise the inode share the bdev's -* backing_dev_info. +* If the filesystem provides a backing_dev_info for client inodes +* then use that. Otherwise inodes share default_backing_dev_info. */ - if (sb-s_bdev) { - struct backing_dev_info *bdi; - - bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info; - mapping-backing_dev_info = bdi; + if (sb-s_bdi sb-s_bdi != noop_backing_dev_info) { + /* +* Catch cases where filesystem might be bitten by using s_bdi +* instead of sb-s_bdev. Can be removed in 2.6.38. +*/ + if (sb-s_bdev) { + struct backing_dev_info *bdi = + sb-s_bdev-bd_inode-i_mapping-backing_dev_info; + WARN
Re: Dirtiable inode bdi default != sb bdi btrfs
On Mon 27-09-10 18:54:52, Chris Mason wrote: On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote: [Added CCs for similar ecryptfs warning] On Thu 23-09-10 12:38:49, Andrew Morton wrote: This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the warning. [...] device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 /dev/mapper/vg_cesarbinspiro-lv_home SELinux: initialized (dev dm-3, type btrfs), uses xattr [ cut here ] WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d() Hardware name: Inspiron N4010 Dirtiable inode bdi default != sb bdi btrfs Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan] Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8 Call Trace: [8104d0e4] warn_slowpath_common+0x85/0x9d [8104d19f] warn_slowpath_fmt+0x46/0x48 [811308b7] inode_to_bdi+0x62/0x6d [81131b48] __mark_inode_dirty+0xd0/0x177 [81127168] touch_atime+0x107/0x12a [81122384] ? filldir+0x0/0xd0 [8112259b] vfs_readdir+0x8d/0xb4 [8112270b] sys_getdents+0x81/0xd1 [81009c72] system_call_fastpath+0x16/0x1b Thanks for the report. These bdi pointers are a mess. As Chris pointed out, btrfs forgets to properly initialize inode-i_mapping.backing_dev_info for directories and special inodes and thus these were previously attached to default_backing_dev_info which probably isn't what Chris would like to see. There's no actual writeback for these, so it works fine for btrfs either way. Ah, OK. I see you do all the writing already at inode dirtying time so you don't really care what happens after the dirtying for inodes without page cache. So it's really just a cosmetic issue for btrfs. I've also got a similar report for ecryptfs which also does not initialize inode-i_mapping.backing_dev_info although it sets sb-s_bdi and thus again its inodes get filed to default_backing_dev_info lists. Quick search seems to reveal that other filesystems using handcrafted bdi's get it wrong as well and thus their inodes end up in the default_backing_dev_info lists which is generally undesirable (this was happening already before my patch, my code just started complaining about that). That suggests that we should probably handle such cases in a more generic way by changing the code in inode_init_always(). The patch below makes at least btrfs happy for me... Could you maybe test it? Thanks. Christoph had a slightly different plan for this, I've cc'd him and kept the patch below for comment. OK, thanks. Honza --- From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001 From: Jan Kara j...@suse.cz Date: Mon, 27 Sep 2010 23:56:48 +0200 Subject: [PATCH] bdi: Initialize inode-i_mapping.backing_dev_info to sb-s_bdi Currently, we initialize inode-i_mapping.backing_dev_info to the bdi of device sb-s_bdev points to. However there is quite a big number of filesystems that do not set sb-s_bdev (because they do not have one) but do set sb-s_bdi. These filesystems would generally benefit from setting inode-i_mapping.backing_dev_info to their s_bdi because otherwise their inodes would point to default_backing_dev_info and thus dirty inode tracking would happen there. So change inode initialization code to use sb-s_bdi if it is available. Signed-off-by: Jan Kara j...@suse.cz --- fs/inode.c | 22 ++ 1 files changed, 14 insertions(+), 8 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 8646433..e415be4 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -172,15 +172,21 @@ int inode_init_always(struct super_block *sb, struct inode *inode) mapping-writeback_index = 0; /* -* If the block_device provides a backing_dev_info for client -* inodes then use that. Otherwise the inode share the bdev's -* backing_dev_info. +* If the filesystem provides a backing_dev_info for client inodes +* then use that. Otherwise inodes share default_backing_dev_info
Re: [patch v2 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc
On Thu 02-09-10 09:59:13, Jiri Slaby wrote: On 09/02/2010 03:02 AM, David Rientjes wrote: --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -334,6 +334,57 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node) return kmalloc_node(size, flags | __GFP_ZERO, node); } +/** + * kmalloc_nofail - infinitely loop until kmalloc() succeeds. + * @size: how many bytes of memory are required. + * @flags: the type of memory to allocate (see kmalloc). + * + * NOTE: no new callers of this function should be implemented! + * All memory allocations should be failable whenever possible. + */ +static inline void *kmalloc_nofail(size_t size, gfp_t flags) +{ + void *ret; + + for (;;) { +ret = kmalloc(size, flags); + if (ret) + return ret; + WARN_ON_ONCE(get_order(size) PAGE_ALLOC_COSTLY_ORDER); This doesn't work as you expect. kmalloc will warn every time it fails. __GFP_NOFAIL used to disable the warning. Actually what's wrong with __GFP_NOFAIL? I cannot find a reason in the changelogs why the patches are needed. David should probably add the reasoning to the changelogs so that he doesn't have to explain again and again ;). But if I understood it correctly, the concern is that the looping checks slightly impact fast path of the callers which do not need it. Generally, also looping for a long time inside allocator isn't a nice thing but some callers aren't able to do better for now to the patch is imperfect in this sence... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Usefulness of max_transid in the tree search?
Hi, I just had a look at how btrfs ioctl searches the tree for new items and while looking at the code I've noticed that maximum transaction id as specified in the search is rather useless. If I understand right, transaction id is stored in a header of each node. Thus when any item of the node gets modified, new transaction id gets recorded. So when you give the search some maximum transaction id, you will not receive items in nodes where some item has been modified in a later transaction. That seems like a rather useless set of items to me... Or am I missing something? Honza -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Do not return more items that user asked from from search ioctl
While searching a tree we didn't properly check number of items we really stored in user's buffer thus possibly exceeding number of items requested by user. This was mostly harmless since actual buffer overflow is checked correctly in a different place. Anyway, let's fix the check. Signed-off-by: Jan Kara j...@suse.cz --- fs/btrfs/ioctl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..94e7ab5 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -977,7 +977,7 @@ static noinline int copy_to_sk(struct btrfs_root *root, } found++; - if (*num_found = sk-nr_items) + if (*num_found + found = sk-nr_items) break; } advance_key: -- 1.6.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html