[PATCH 24/27] btrfs: Convert to new freezing mechanism

2012-06-01 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason chris.ma...@oracle.com
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 fs/btrfs/transaction.c |7 +++
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7ae51de..663f3a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1532,8 +1532,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1565,7 +1563,6 @@ static int transaction_kthread(void *arg)
do {
cannot_commit = false;
delay = HZ * 30;
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 876cddd..7f3d7fe 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1392,7 +1392,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode-i_sb);
 
mutex_lock(inode-i_mutex);
 
@@ -1482,6 +1482,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9991ad..54e5378 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6563,6 +6563,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode-i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma-vm_file);
@@ -6652,12 +6653,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode-i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode-i_sb);
return ret;
 }
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1791c6e..05d8c29 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -325,6 +325,8 @@ again:
if (!h)
return ERR_PTR(-ENOMEM);
 
+   sb_start_intwrite(root-fs_info-sb);
+
if (may_wait_transaction(root, type))
wait_current_trans(root);
 
@@ -335,6 +337,7 @@ again:
} while (ret == -EBUSY);
 
if (ret  0) {
+   sb_end_intwrite(root-fs_info-sb);
kmem_cache_free(btrfs_trans_handle_cachep, h);
return ERR_PTR(ret);
}
@@ -524,6 +527,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
count++;
}
 
+   sb_end_intwrite(root-fs_info-sb);
+
if (lock  !atomic_read(root-fs_info-open_ioctl_trans) 
should_end_transaction(trans, root)) {
trans-transaction-blocked = 1;
@@ -1507,6 +1512,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
put_transaction(cur_trans);
put_transaction(cur_trans);
 
+   sb_end_intwrite(root-fs_info-sb);
+
trace_btrfs_transaction_commit(root);
 
btrfs_scrub_continue(root);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/27 v6] Fix filesystem freezing deadlocks

2012-06-01 Thread Jan Kara
  Hello,

  here is the sixth iteration of my patches to improve filesystem freezing.
The change since last iteration is that filesystem can be frozen with open but
unlinked files. After some thinking, I've decided that the best way to handle
this is to block removal inside -evict_inode() of each filesystem and use
fs-internal level of freeze protection for that (usually I've instrumented
filesystem's transaction system to use freeze protection). Handling
inside VFS would be less work but the only level of freeze protection that
has a chance of not causing deadlocks is the one used for page faults and even
there it's not clear lock ordering would be correct wrt some fs-specific locks.
I've converted ext2, ext4, btrfs, xfs, nilfs2, ocfs2, gfs2 and also checked
that ext3, reiserfs, jfs should work as well (they have their internal freeze
protection mechanisms, possibly they could be replaced by a generic one but
given these are mostly aging filesystems, it's not a real priority IHMO).
So finally I'm not aware of any pending issue with this patch set so if you
have some concern, please speak up!

Introductory text to first time readers:

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to -page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v5:
  * handle unlinked  open files on frozen filesystem
  * lockdep keys for freeze protection are now per filesystem type
  * taught lockdep that freeze protection at lower level does not create
dependency when we already hold freeze protection at higher level 
  * rebased on 3.5-rc1-ish

Changes since v4:
  * added a couple of Acked-by's
  * added some comments  doc update
  * added patches from series Push file_update_time() into .page_mkwrite
since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder el...@kernel.org
CC: Anton Altaparmakov an...@tuxera.com
CC: Ben Myers b...@sgi.com
CC: Chris Mason chris.ma...@oracle.com
CC: cluster-de...@redhat.com
CC: David S. Miller da...@davemloft.net
CC: fuse-de...@lists.sourceforge.net
CC: J. Bruce Fields bfie...@fieldses.org
CC: Joel Becker jl...@evilplan.org
CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh mfas...@suse.com
CC: Miklos Szeredi mik...@szeredi.hu
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
CC: Steven Whitehouse swhit...@redhat.com
CC: Theodore Ts'o ty...@mit.edu
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More 

[PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Jan Kara
  Hello,

  here is the fifth iteration of my patches to improve filesystem freezing.
No serious changes since last time. Mostly I rebased patches and merged this
series with series moving file_update_time() to -page_mkwrite() to simplify
testing and merging.

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to -page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v4:
  * added a couple of Acked-by's
  * added some comments  doc update
  * added patches from series Push file_update_time() into .page_mkwrite
since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder el...@kernel.org
CC: Anton Altaparmakov an...@tuxera.com
CC: Ben Myers b...@sgi.com
CC: Chris Mason chris.ma...@oracle.com
CC: cluster-de...@redhat.com
CC: David S. Miller da...@davemloft.net
CC: fuse-de...@lists.sourceforge.net
CC: J. Bruce Fields bfie...@fieldses.org
CC: Joel Becker jl...@evilplan.org
CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh mfas...@suse.com
CC: Miklos Szeredi mik...@szeredi.hu
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
CC: Steven Whitehouse swhit...@redhat.com
CC: Theodore Ts'o ty...@mit.edu
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Jan Kara
  The subject should have been [PATCH 00/27]... Sorry for the mistake.

Honza

On Mon 16-04-12 18:13:38, Jan Kara wrote:
   Hello,
 
   here is the fifth iteration of my patches to improve filesystem freezing.
 No serious changes since last time. Mostly I rebased patches and merged this
 series with series moving file_update_time() to -page_mkwrite() to simplify
 testing and merging.
 
 Filesystem freezing is currently racy and thus we can end up with dirty data 
 on
 frozen filesystem (see changelog patch 13 for detailed race description). This
 patch series aims at fixing this.
 
 To be able to block all places where inodes get dirtied, I've moved filesystem
 file_update_time() call to -page_mkwrite callback (patches 01-07) and put
 freeze handling in mnt_want_write() / mnt_drop_write(). That however required
 some code shuffling and changes to kern_path_create() (see patches 09-12). I
 think the result is OK but opinions may differ ;). The advantage of this 
 change
 also is that all filesystems get freeze protection almost for free - even ext2
 can handle freezing well now.
 
 Another potential contention point might be patch 19. In that patch we make
 freeze_super() refuse to freeze the filesystem when there are open but 
 unlinked
 files which may be impractical in some cases. The main reason for this is the
 problem with handling of file deletion from fput() called with mmap_sem held
 (e.g. from munmap(2)), and then there's the fact that we cannot really force
 such filesystem into a consistent state... But if people think that freezing
 with open but unlinked files should happen, then I have some possible
 solutions in mind (maybe as a separate patchset since this is large enough).
 
 I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
 filesystem despite beating it with fsstress and bash-shared-mapping while
 freezing and unfreezing for several hours (using ext4 and xfs) so I'm
 reasonably confident this could finally be the right solution.
 
 Changes since v4:
   * added a couple of Acked-by's
   * added some comments  doc update
   * added patches from series Push file_update_time() into .page_mkwrite
 since it doesn't make much sense to keep them separate anymore
   * rebased on top of 3.4-rc2
 
 Changes since v3:
   * added third level of freezing for fs internal purposes - hooked some
 filesystems to use it (XFS, nilfs2)
   * removed racy i_size check from filemap_mkwrite()
 
 Changes since v2:
   * completely rewritten
   * freezing is now blocked at VFS entry points
   * two stage freezing to handle both mmapped writes and other IO
 
 The biggest changes since v1:
   * have two counters to provide safe state transitions for SB_FREEZE_WRITE
 and SB_FREEZE_TRANS states
   * use percpu counters instead of own percpu structure
   * added documentation fixes from the old fs freezing series
   * converted XFS to use SB_FREEZE_TRANS counter instead of its private
 m_active_trans counter
 
   Honza
 
 CC: Alex Elder el...@kernel.org
 CC: Anton Altaparmakov an...@tuxera.com
 CC: Ben Myers b...@sgi.com
 CC: Chris Mason chris.ma...@oracle.com
 CC: cluster-de...@redhat.com
 CC: David S. Miller da...@davemloft.net
 CC: fuse-de...@lists.sourceforge.net
 CC: J. Bruce Fields bfie...@fieldses.org
 CC: Joel Becker jl...@evilplan.org
 CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
 CC: linux-btrfs@vger.kernel.org
 CC: linux-e...@vger.kernel.org
 CC: linux-...@vger.kernel.org
 CC: linux-ni...@vger.kernel.org
 CC: linux-ntfs-...@lists.sourceforge.net
 CC: Mark Fasheh mfas...@suse.com
 CC: Miklos Szeredi mik...@szeredi.hu
 CC: ocfs2-de...@oss.oracle.com
 CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
 CC: Steven Whitehouse swhit...@redhat.com
 CC: Theodore Ts'o ty...@mit.edu
 CC: x...@oss.sgi.com
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/27] btrfs: Convert to new freezing mechanism

2012-04-16 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason chris.ma...@oracle.com
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 20196f4..555a57a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1527,8 +1527,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1560,7 +1558,6 @@ static int transaction_kthread(void *arg)
do {
cannot_commit = false;
delay = HZ * 30;
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d83260d..a48251e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1358,7 +1358,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode-i_sb);
 
mutex_lock(inode-i_mutex);
 
@@ -1449,6 +1449,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 115bc05..db4fc01 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6592,6 +6592,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode-i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma-vm_file);
@@ -6681,12 +6682,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode-i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode-i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

2012-04-16 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 18cc23d..869d913 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(inode-i_mutex);
 
ip_oldflags = ip-flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags  FS_SYNC_FL)
ip-flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode-i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(inode-i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -639,6 +639,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent-mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent-dentry, namelen);
@@ -650,13 +654,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry-d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent-mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
 
@@ -674,12 +674,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent-mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(dir-i_mutex);
+   mnt_drop_write(parent-mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-28 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(inode-i_mutex);
 
ip_oldflags = ip-flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags  FS_SYNC_FL)
ip-flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode-i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(inode-i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent-mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent-dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry-d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent-mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent-mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(dir-i_mutex);
+   mnt_drop_write(parent-mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19 v4] Fix filesystem freezing deadlocks

2012-03-28 Thread Jan Kara
  Hello,

  here is the fourth iteration of my patches to improve filesystem freezing.
Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 06 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
Push file_update_time() into .page_mkwrite so you'll need to pull that one
in as well.

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder el...@kernel.org
CC: Anton Altaparmakov an...@tuxera.com
CC: Ben Myers b...@sgi.com
CC: Chris Mason chris.ma...@oracle.com
CC: cluster-de...@redhat.com
CC: David S. Miller da...@davemloft.net
CC: fuse-de...@lists.sourceforge.net
CC: J. Bruce Fields bfie...@fieldses.org
CC: Joel Becker jl...@evilplan.org
CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh mfas...@suse.com
CC: Miklos Szeredi mik...@szeredi.hu
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
CC: Steven Whitehouse swhit...@redhat.com
CC: Theodore Ts'o ty...@mit.edu
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-28 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason chris.ma...@oracle.com
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode-i_sb);
 
mutex_lock(inode-i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode-i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma-vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode-i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode-i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3/4, btrfs, ocfs2: How to assure that cleancache_invalidate_fs is called on every superblock free

2012-03-12 Thread Jan Kara
  Hello,

On Fri 09-03-12 14:40:22, Andor Daam wrote:
 Is it ever possible for a superblock for a mounted filesystem to be
 free'd without a previous call to unmount the filesystem?
  No, I don't think so (well, except for cases where we do not manage to
fully setup the superblock). But be aware that mount/umount need not be
really the entry points you are looking for since filesystem can be mounted
several times. Rather deactivate_locked_supers() is the place you are
looking for...

 I need to be certain that the function cleancache_invalidate_fs, which is
 at the moment called by deactivate_locked_super (fs/super.c) [1], is
 called before every free on a superblock of cleancache-enabled
 filesystems.  Is this already the case or are there situations in which
 this does not happen?
 
 It would be interesting to know this, as we are planning to have
 cleancache save pointers to superblocks of every mounted
 cleancache-enabled filesystem [2] and it would be fatal if a
 superblock is free'd without cleancache being notified.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-05 Thread Jan Kara
On Fri 02-03-12 14:32:15, Ted Tso wrote:
 On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
 It would be interesting to have a project where someone added
 fallocate() support into libelf, and then added some hueristics into
 ext4 so that if a file is fallocated to a precise size, or if the file
 is fully written and closed before writeback begins, that we use this
 to more efficiently pack the space used by the files by the block
 allocator.  This is a place where I would not be surprised that XFS
 has some better code to avoid accelerated file system aging, and where
 we could do better with ext4 with some development effort.
  AFAIK XFS people actually prefer that applications let them do their work
using delayed allocation and do not interfere with fallocate(2) calls. The
problem they have with fallocate(2) is that it forces you to allocate
blocks while with delayed allocation you can make the decision about
allocation later. So for small files which completely fit into pagecache
before they get pushed out by writeback, they can make better decisions
from delayed allocation. Just dumping my memory from some other thread...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-05 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason chris.ma...@oracle.com
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode-i_sb);
 
mutex_lock(inode-i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode-i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma-vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode-i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode-i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] Fix filesystem freezing deadlocks

2012-03-05 Thread Jan Kara
  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
Push file_update_time() into .page_mkwrite so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder el...@kernel.org
CC: Anton Altaparmakov an...@tuxera.com
CC: Ben Myers b...@sgi.com
CC: Chris Mason chris.ma...@oracle.com
CC: cluster-de...@redhat.com
CC: David S. Miller da...@davemloft.net
CC: fuse-de...@lists.sourceforge.net
CC: J. Bruce Fields bfie...@fieldses.org
CC: Joel Becker jl...@evilplan.org
CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh mfas...@suse.com
CC: Miklos Szeredi mik...@szeredi.hu
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
CC: Steven Whitehouse swhit...@redhat.com
CC: Theodore Ts'o ty...@mit.edu
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-05 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(inode-i_mutex);
 
ip_oldflags = ip-flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags  FS_SYNC_FL)
ip-flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode-i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(inode-i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent-mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent-dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry-d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent-mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent-mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(dir-i_mutex);
+   mnt_drop_write(parent-mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-10 Thread Jan Kara
On Wed 08-02-12 22:05:26, Phillip Susi wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/08/2012 06:20 PM, Jan Kara wrote:
Thanks for your reply. I admit I was not sure what exactly size argument
  should be. So after looking into the code for a while I figured it should
  be a total size of the filesystem - or differently it should be size of
  virtual block address space in the filesystem. Thus when filesystem has
  more devices (or admin wants to add more devices later), it can be larger
  than the first device. But I'm not really a btrfs developper so I might be
  wrong and of course feel free to fix the issue as you deem fit.
 
 The size of the fs is the total size of the individual disks.  When you
 limit the size, you limit the size of a disk, not the whole fs.  IIRC,
 mkfs initializes the fs on the first disk, which is why it was using that
 size as the size of the whole fs, and then adds the other disks after (
 which then add their size to the total fs size ).
  OK, I missed that btrfs_add_to_fsid() increases total size of the
filesystem. So now I agree with you. New patch is attached. Thanks for your
review.

 It might be nice if
 mkfs could take sizes for each disk, but it only seems to take one size
 for the initial disk.
  Yes, but I don't see a realistic usecase so I don't think it's really
worth the work.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
From e5f46872232520310c56327593c02ef6a7f5ea33 Mon Sep 17 00:00:00 2001
From: Jan Kara j...@suse.cz
Date: Fri, 10 Feb 2012 11:44:44 +0100
Subject: [PATCH] mkfs: Handle creation of filesystem larger than the first device

mkfs does not properly check requested size of the filesystem. Thus if the
requested size is larger than the first device, it happily creates larger
filesystem than a device it resides on which results in 'attemp to access
beyond end of device' messages from the kernel. So verify specified filesystem
size against the size of the first device.

CC: David Sterba dste...@suse.cz
Signed-off-by: Jan Kara j...@suse.cz
---
 mkfs.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index e3ced19..3afe9eb 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1282,6 +1282,10 @@ int main(int ac, char **av)
 		ret = btrfs_prepare_device(fd, file, zero_end, dev_block_count, mixed);
 		if (block_count == 0)
 			block_count = dev_block_count;
+		else if (block_count  dev_block_count) {
+			fprintf(stderr, %s is smaller than requested size\n, file);
+			exit(1);
+		}
 	} else {
 		ac = 0;
 		file = av[optind++];
-- 
1.7.1



[PATCH 4/4] btrfs: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file. Although we use our own bio-end_io function, we call dio_end_io()
from it and thus, because we don't set any specific dio-end_io function,
generic code ends up calling generic_dio_end_io() which is all what we need
for proper O_SYNC AIO DIO handling.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/inode.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..68add6e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6221,7 +6221,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
ret = __blockdev_direct_IO(rw, iocb, inode,
   BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev,
   iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
-  btrfs_submit_direct, 0);
+  btrfs_submit_direct, DIO_SYNC_WRITES);
 
if (ret  0  ret != -EIOCBQUEUED) {
clear_extent_bit(BTRFS_I(inode)-io_tree, offset,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Generic O_SYNC AIO DIO handling

2012-02-10 Thread Jan Kara

  Hi Jeff,

  these patches implement generic way of handling O_SYNC AIO DIO. They work
for all filesystems except for ext4 and xfs. Thus together with your patches,
all filesystems should handle O_SYNC AIO DIO correctly. I've tested ext3,
btrfs, and xfs (to check that I didn't break anything when the generic code
is unused) and things seem to work fine. Will you add these patches to your
series please? Thanks.

Honza
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] ocfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/ocfs2/aops.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 78b68af..3d14c2b 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
level = ocfs2_iocb_rw_locked_level(iocb);
ocfs2_rw_unlock(inode, level);
 
-   if (is_async)
-   aio_complete(iocb, ret, 0);
-   inode_dio_done(inode);
+   generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
 }
 
 /*
@@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
return __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev,
iov, offset, nr_segs,
ocfs2_direct_IO_get_blocks,
-   ocfs2_dio_end_io, NULL, 0);
+   ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
 }
 
 static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] vfs: Handle O_SYNC AIO DIO in generic code properly

2012-02-10 Thread Jan Kara
Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystem wanting to
use the helpers has to pass DIO_SYNC_WRITES to __blockdev_direct_IO. Then if
they don't use direct IO end_io handler, generic code takes care of everything
else. Otherwise their end_io handler is passed struct dio_sync_io_work pointer
as 'private' argument and they have to call generic_dio_end_io() to finish
their AIO DIO. Generic code then takes care to call generic_write_sync() from
a workqueue context when AIO DIO is completed.

Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
and the generic one is enough for them, make blockdev_direct_IO() pass
DIO_SYNC_WRITES flag.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/direct-io.c |  128 ++--
 fs/super.c |2 +
 include/linux/fs.h |   13 +-
 3 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4a588db..79aa531 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,8 @@
 #include linux/atomic.h
 #include linux/prefetch.h
 
+#include asm/cmpxchg.h
+
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
  * the size of a structure in the slab cache
@@ -112,6 +114,15 @@ struct dio_submit {
unsigned tail;  /* last valid page + 1 */
 };
 
+/* state needed for final sync and completion of O_SYNC AIO DIO */
+struct dio_sync_io_work {
+   struct kiocb *iocb;
+   loff_t offset;
+   ssize_t len;
+   int ret;
+   struct work_struct work;
+};
+
 /* dio_state communicated between submission path and end_io */
 struct dio {
int flags;  /* doesn't change */
@@ -134,6 +145,7 @@ struct dio {
/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
ssize_t result; /* IO result */
+   struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */
 
/*
 * pages[] (and any fields placed after it) are not zeroed out at
@@ -261,6 +273,45 @@ static inline struct page *dio_get_page(struct dio *dio,
 }
 
 /**
+ * generic_dio_end_io() - generic dio -end_io handler
+ * @iocb: iocb of finishing DIO
+ * @offset: the byte offset in the file of the completed operation
+ * @bytes: length of the completed operation
+ * @work: work to queue for O_SYNC AIO DIO, NULL otherwise
+ * @ret: error code if IO failed
+ * @is_async: is this AIO?
+ *
+ * This is generic callback to be called when direct IO is finished. It
+ * handles update of number of outstanding DIOs for an inode, completion
+ * of async iocb and queueing of work if we need to call fsync() because
+ * io was O_SYNC.
+ */
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+   struct dio_sync_io_work *work, int ret, bool is_async)
+{
+   struct inode *inode = iocb-ki_filp-f_dentry-d_inode;
+
+   if (!is_async) {
+   inode_dio_done(inode);
+   return;
+   }
+
+   /*
+* If we need to sync file, we offload completion to workqueue
+*/
+   if (work) {
+   work-ret = ret;
+   work-offset = offset;
+   work-len = bytes;
+   queue_work(inode-i_sb-s_dio_flush_wq, work-work);
+   } else {
+   aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
+   }
+}
+EXPORT_SYMBOL(generic_dio_end_io);
+
+/**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
  *
@@ -302,12 +353,22 @@ static ssize_t dio_complete(struct dio *dio, loff_t 
offset, ssize_t ret, bool is
ret = transferred;
 
if (dio-end_io  dio-result) {
+   void *private;
+
+   if (dio-sync_work)
+   private = dio-sync_work;
+   else
+   private = dio-private;
dio-end_io(dio-iocb, offset, transferred,
-   dio-private, ret, is_async);
+   private, ret, is_async);
} else {
-   if (is_async)
-   aio_complete(dio-iocb, ret, 0);
-   inode_dio_done(dio-inode);
+   /* No IO submitted? Skip syncing... */
+   if (!dio-result  dio-sync_work) {
+   kfree(dio-sync_work);
+   dio-sync_work = NULL;
+   }
+   generic_dio_end_io(dio-iocb, offset, transferred,
+  dio-sync_work, ret, is_async);
}
 
return ret;
@@ -1064,6 +1125,41 @@ static inline int drop_refcount(struct dio *dio)
 }
 
 /*
+ * Work performed from workqueue when AIO DIO is finished.
+ */
+static void dio_aio_sync_work(struct work_struct *work)
+{
+   struct dio_sync_io_work *sync_work =
+   container_of(work

[PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-10 Thread Jan Kara
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/gfs2/aops.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 501e5cb..9c381ff 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 
rv = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov,
  offset, nr_segs, gfs2_get_block_direct,
- NULL, NULL, 0);
+ NULL, NULL, DIO_SYNC_WRITES);
 out:
gfs2_glock_dq_m(1, gh);
gfs2_holder_uninit(gh);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-02-08 Thread Jan Kara
On Wed 08-02-12 17:01:15, Phillip Susi wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 1/26/2012 11:03 AM, Jan Kara wrote:
  make_btrfs() function takes a size of filesystem as an argument. It
  uses this value to set the size of the first device as well which
  is wrong for filesystems larger than this device. It results in
  'attemp to access beyond end of device' messages from the kernel.
  So add size of the first device as an argument to make_btrfs().
 
 I don't think this patch is correct.  Yes, the size switch only
 applies to the first device, so it doesn't make any sense to try to
 use a value larger than that device.  Attempting to do so probably
 should be trapped and it should error out, and the man page should
 probably clarify the fact that the size is only for the first device.
 
 It looks like you think the size should somehow apply to multiple
 devices or to the total fs size when creating on multiple devices, and
 that just doesn't make sense.
  Thanks for your reply. I admit I was not sure what exactly size argument
should be. So after looking into the code for a while I figured it should
be a total size of the filesystem - or differently it should be size of
virtual block address space in the filesystem. Thus when filesystem has
more devices (or admin wants to add more devices later), it can be larger
than the first device. But I'm not really a btrfs developper so I might be
wrong and of course feel free to fix the issue as you deem fit.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mkfs: Handle creation of filesystem larger than the first device

2012-01-26 Thread Jan Kara
make_btrfs() function takes a size of filesystem as an argument. It uses this
value to set the size of the first device as well which is wrong for
filesystems larger than this device. It results in 'attemp to access beyond end
of device' messages from the kernel. So add size of the first device as an
argument to make_btrfs().

CC: David Sterba dste...@suse.cz
Signed-off-by: Jan Kara j...@suse.cz
---
 convert.c |2 +-
 mkfs.c|6 --
 utils.c   |4 ++--
 utils.h   |2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

  As a side note, I'd guess that creating filesystem larger than all given
devices (especially in case of single device) is usually not what sysadmin
wants (we've spotted this bug when xfstest were happily creating 1 GB
filesystem on 500 MB device and it took us a while to notice the problem).
Thus maybe it should require some --force switch?

diff --git a/convert.c b/convert.c
index 291dc27..7f1932c 100644
--- a/convert.c
+++ b/convert.c
@@ -2374,7 +2374,7 @@ int do_convert(const char *devname, int datacsum, int 
packing, int noxattr)
goto fail;
}
ret = make_btrfs(fd, devname, ext2_fs-super-s_volume_name,
-blocks, total_bytes, blocksize, blocksize,
+blocks, total_bytes, total_bytes, blocksize, blocksize,
 blocksize, blocksize);
if (ret) {
fprintf(stderr, unable to create initial ctree\n);
diff --git a/mkfs.c b/mkfs.c
index e3ced19..f0d29bb 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1294,8 +1294,10 @@ int main(int ac, char **av)
first_file = file;
source_dir_size = size_sourcedir(source_dir, sectorsize,
 num_of_meta_chunks, 
size_of_data);
-   if(block_count  source_dir_size)
+   if (block_count  source_dir_size)
block_count = source_dir_size;
+   dev_block_count = block_count;
+
ret = zero_output_file(fd, block_count, sectorsize);
if (ret) {
fprintf(stderr, unable to zero the output file\n);
@@ -1321,7 +1323,7 @@ int main(int ac, char **av)
leafsize * i;
}
 
-   ret = make_btrfs(fd, file, label, blocks, block_count,
+   ret = make_btrfs(fd, file, label, blocks, block_count, dev_block_count,
 nodesize, leafsize,
 sectorsize, stripesize);
if (ret) {
diff --git a/utils.c b/utils.c
index 178d1b9..f34da51 100644
--- a/utils.c
+++ b/utils.c
@@ -74,7 +74,7 @@ static u64 reference_root_table[] = {
 };
 
 int make_btrfs(int fd, const char *device, const char *label,
-  u64 blocks[7], u64 num_bytes, u32 nodesize,
+  u64 blocks[7], u64 num_bytes, u64 dev_num_bytes, u32 nodesize,
   u32 leafsize, u32 sectorsize, u32 stripesize)
 {
struct btrfs_super_block super;
@@ -276,7 +276,7 @@ int make_btrfs(int fd, const char *device, const char 
*label,
dev_item = btrfs_item_ptr(buf, nritems, struct btrfs_dev_item);
btrfs_set_device_id(buf, dev_item, 1);
btrfs_set_device_generation(buf, dev_item, 0);
-   btrfs_set_device_total_bytes(buf, dev_item, num_bytes);
+   btrfs_set_device_total_bytes(buf, dev_item, dev_num_bytes);
btrfs_set_device_bytes_used(buf, dev_item,
BTRFS_MKFS_SYSTEM_GROUP_SIZE);
btrfs_set_device_io_align(buf, dev_item, sectorsize);
diff --git a/utils.h b/utils.h
index c5f55e1..bf2d5a4 100644
--- a/utils.h
+++ b/utils.h
@@ -22,7 +22,7 @@
 #define BTRFS_MKFS_SYSTEM_GROUP_SIZE (4 * 1024 * 1024)
 
 int make_btrfs(int fd, const char *device, const char *label,
-  u64 blocks[6], u64 num_bytes, u32 nodesize,
+  u64 blocks[6], u64 num_bytes, u64 dev_num_bytes, u32 nodesize,
   u32 leafsize, u32 sectorsize, u32 stripesize);
 int btrfs_make_root_dir(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 objectid);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix busyloops in transaction waiting code

2012-01-26 Thread Jan Kara
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/tree-log.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cb877e0..966cc74 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle 
*trans,
 
finish_wait(root-log_commit_wait[index], wait);
mutex_lock(root-log_mutex);
-   } while (root-log_transid  transid + 2 
+   } while (root-fs_info-last_trans_log_full_commit !=
+trans-transid  root-log_transid  transid + 2 
 atomic_read(root-log_commit[index]));
return 0;
 }
@@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle 
*trans,
   struct btrfs_root *root)
 {
DEFINE_WAIT(wait);
-   while (atomic_read(root-log_writers)) {
+   while (root-fs_info-last_trans_log_full_commit !=
+  trans-transid  atomic_read(root-log_writers)) {
prepare_to_wait(root-log_writer_wait,
wait, TASK_UNINTERRUPTIBLE);
mutex_unlock(root-log_mutex);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix busyloops in transaction waiting code

2012-01-25 Thread Jan Kara
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

CC: sta...@kernel.org
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/tree-log.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cb877e0..966cc74 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1957,7 +1957,8 @@ static int wait_log_commit(struct btrfs_trans_handle 
*trans,
 
finish_wait(root-log_commit_wait[index], wait);
mutex_lock(root-log_mutex);
-   } while (root-log_transid  transid + 2 
+   } while (root-fs_info-last_trans_log_full_commit !=
+trans-transid  root-log_transid  transid + 2 
 atomic_read(root-log_commit[index]));
return 0;
 }
@@ -1966,7 +1967,8 @@ static int wait_for_writer(struct btrfs_trans_handle 
*trans,
   struct btrfs_root *root)
 {
DEFINE_WAIT(wait);
-   while (atomic_read(root-log_writers)) {
+   while (root-fs_info-last_trans_log_full_commit !=
+  trans-transid  atomic_read(root-log_writers)) {
prepare_to_wait(root-log_writer_wait,
wait, TASK_UNINTERRUPTIBLE);
mutex_unlock(root-log_mutex);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: push file_update_time into -page_mkwrite

2011-11-29 Thread Jan Kara
 @@ -571,6 +571,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct 
 *vma, struct vm_fault *vmf)
   filp-f_mapping-host-i_ino,
   (long long)page_offset(page));
  
 + file_update_time(filp);
   /* make sure the cache has finished storing the page */
   nfs_fscache_wait_on_page_write(NFS_I(dentry-d_inode), page);
  
 diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
 index 2660152..d1ab350 100644
 --- a/fs/nilfs2/file.c
 +++ b/fs/nilfs2/file.c
 @@ -70,6 +70,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, 
 struct vm_fault *vmf)
   if (unlikely(nilfs_near_disk_full(inode-i_sb-s_fs_info)))
   return VM_FAULT_SIGBUS; /* -ENOSPC */
  
 + file_update_time(vma-vm_file);
   lock_page(page);
   if (page-mapping != inode-i_mapping ||
   page_offset(page) = i_size_read(inode) || !PageUptodate(page)) {
 diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
 index 3e9393c..27c59e5 100644
 --- a/fs/ocfs2/mmap.c
 +++ b/fs/ocfs2/mmap.c
 @@ -141,6 +141,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, 
 struct vm_fault *vmf)
  
   ocfs2_block_signals(oldset);
  
 + file_update_time(vma-vm_file);
   /*
* The cluster locks taken will block a truncate from another
* node. Taking the data lock will also ensure that we don't
 diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
 index a475983..4879bfb 100644
 --- a/fs/sysfs/bin.c
 +++ b/fs/sysfs/bin.c
 @@ -225,6 +225,7 @@ static int bin_page_mkwrite(struct vm_area_struct *vma, 
 struct vm_fault *vmf)
   if (!sysfs_get_active(attr_sd))
   return VM_FAULT_SIGBUS;
  
 + file_update_time(file);
   ret = 0;
   if (bb-vm_ops-page_mkwrite)
   ret = bb-vm_ops-page_mkwrite(vma, vmf);
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 0f85778..abbaee4 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -3466,6 +3466,7 @@ static int perf_mmap_fault(struct vm_area_struct *vma, 
 struct vm_fault *vmf)
   struct ring_buffer *rb;
   int ret = VM_FAULT_SIGBUS;
  
 + file_update_time(vma-vm_file);
   if (vmf-flags  FAULT_FLAG_MKWRITE) {
   if (vmf-pgoff == 0)
   ret = 0;
 diff --git a/mm/memory.c b/mm/memory.c
 index a56e3ba..8e4686f 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -2619,10 +2619,6 @@ reuse:
   }
   }
  
 - /* file_update_time outside page_lock */
 - if (vma-vm_file)
 - file_update_time(vma-vm_file);
 -
   return ret;
   }
  
 @@ -3303,10 +3299,6 @@ static int __do_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
*/
   balance_dirty_pages_ratelimited(mapping);
   }
 -
 - /* file_update_time outside page_lock */
 - if (vma-vm_file)
 - file_update_time(vma-vm_file);
   } else {
   unlock_page(vmf.page);
   if (anon)
 diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
 index 55d92cb..e5f4be1 100644
 --- a/security/selinux/selinuxfs.c
 +++ b/security/selinux/selinuxfs.c
 @@ -458,6 +458,7 @@ static int sel_mmap_policy_fault(struct vm_area_struct 
 *vma,
   unsigned long offset;
   struct page *page;
  
 + file_update_time(vma-vm_file);
   if (vmf-flags  (FAULT_FLAG_MKWRITE | FAULT_FLAG_WRITE))
   return VM_FAULT_SIGBUS;
  
 -- 
 1.7.5.2
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Jan Kara
On Wed 16-11-11 07:45:50, Christoph Hellwig wrote:
 On Wed, Nov 16, 2011 at 11:54:13AM +0100, Jan Kara wrote:
Yeah, only that nobody calls that fsync() automatically if the fd is
  O_SYNC if I'm right. But maybe calling fdatasync() on the range which was
  fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for
  most filesystems? That would match how we treat O_SYNC for other operations
  as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit
  with this.
 
 This would work fine with XFS and be equivalent to what it does for
 O_DSYNC now.  But I'd rather see every filesystem do the right thing
 and make sure the update actually is on disk when doing O_(D)SYNC
 operations.
  OK, I don't really have a strong opinion here. Are you afraid that just
calling fsync() need not be enough to push all updates fallocate did to
disk?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-21 Thread Jan Kara
On Mon 20-06-11 16:15:33, Christoph Hellwig wrote:
 i_alloc_sem has always been a bit of an odd lock.  It's the only remaining
 rw_semaphore that can be released by a different thread than the one that
 locked it, and it's use case in the core direct I/O code is more like a
 counter given that the writers already have external serialization.
 
 This series removes it in favour of a simpler counter scheme, thus getting
 rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
 same time shrinking the size of struct inode by 160 bytes on 64-bit systems.
 
 The only nasty bit is that two filesystems (fat and ext4) have started
 abusing the lock for their own purposes.  I've added a new rw_semaphore
  ext4 abuse should be gone when Ted merges my rewrite of
ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
it?

Honza

-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-24 Thread Jan Kara
On Wed 24-11-10 12:03:43, Nick Piggin wrote:
  For the _nr variant that btrfs uses, it's worse for the filesystems
  that don't have a 1:1 bdi-sb mapping.  It might not actually write any
  of the pages from the SB that is out of space.
 
 That's true, but it might not write anything anyway (and after we
 check whether writeout is in progress, the writeout thread could go
 to sleep and not do anything anyway).
 
 So it's a pretty hacky interface anyway. If you want to do anything
 deterministic, you obviously need real coupling between producer and
 consumer. This should only be a performance tweak (or a workaround
 hack in worst case).
  Yes, the current interface is a band aid for the problem and better
interface is welcome. But it's not trivial to do better...

   It makes no further guarantees, and anyway
   the sb has to compete for normal writeback within this bdi.
  
   
   I think Christoph is right because filesystems should not really
   know about how bdi writeback queueing works. But I don't know if it's
   worth doing anything more complex for this functionality?
  
  I think we should make a writeback_inodes_sb_unlocked() that doesn't
  warn when the semaphore isn't held.  That should be enough given where
  btrfs and ext4 are calling it from.
 
 It doesn't solve the bugs -- calling and waiting for writeback is
 still broken because completion requires i_mutex and it is called
 from under i_mutex.
  Well, as I wrote in my previous email, only ext4 has the problem with
i_mutex and I personally view it as a bug. But ultimately it's Ted's call
to decide.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-23 Thread Jan Kara
On Tue 23-11-10 21:11:49, Nick Piggin wrote:
 The issue of writeback_inodes_sb being synchronous so far as it has to
 wait until the work has been dequeued is another subtlety. That is a
 funny interface though, really. It has 3 callers, sync, quota, and
 ubifs. From a quick look, quota and ubifs seem to think it is some kind
 of synchronous writeout API.
  Yes, they expect it and it used to be the case before per-bdi flusher
threads existed (because the function submitted IO on its own). Then it
was changed to an async interface in per-bdi flusher thread patches and
then back again to a synchronous one... Sad history...

 It also really sucks that it can get deadlocked behind an unrelated item
 in a workqueue. I think it should just get converted over to the
 async-submission style as well.
  Here I don't quite agree. After my patches (currently in -mm tree) all
work items have reasonably well defined lifetime so no livelocks should
occur. After all writeback thread is doing its best to do as much IO as
possible (and hopefully saturates the storage) so given all the IO work
items we cannot do much better. Where I see a room for improvement is
that work items usually try to achieve a common goal - for example when we
get two items write all dirty pages, we have mostly fulfilled the second
one after finishing the first one but we start from the beginning when
processing the second request. But it seems non-trivial to do this request
merging especially for different types of requests...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-23 Thread Jan Kara
On Tue 23-11-10 19:07:58, Nick Piggin wrote:
 On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote:
  On Fri 19-11-10 16:16:19, Nick Piggin wrote:
   On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
On Wed 17-11-10 22:28:34, Andrew Morton wrote:
 The fact that a call to -write_begin can randomly return with 
 s_umount
 held, to be randomly released at some random time in the future is a
 bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
 thing.
  I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
variants) does:
bdi_queue_work(sb-s_bdi, work);
wait_for_completion(done);
  So we return only after all the IO has been submitted and unlock 
s_umount
in writeback_inodes_sb_if_idle(). And we cannot really submit the IO 
ourselves
because we are holding i_mutex and we need to get and put references
to other inodes while doing writeback (those would be really horrible 
lock
dependencies - writeback thread can put the last reference to an 
unlinked
inode...).
   
   But if we're waiting for it, with the lock held, then surely it can
   deadlock just the same as if we submit it ourself?
Yes, that's what I realized as well and what needs fixing. It was an
  unintentional consequence of locking changes Christoph did to the writeback
  code to fix other races.
  
   BTW. are you taking i_mutex inside writeback? I mutex can be held
   while entering page reclaim, and thus writepage... so it could be a
   bug too.
No, writeback does not need i_mutex.
 
 I did in fact see i_mutex from writeback, which is how the lock order
 reversal was noticed in the first place. I haven't had much luck in
 reproducing it yet. It did come from end_io_work, I believe.
 
 There is another deadlock in here, by the looks (although this is not
 the one which triggers lockdep -- the workqueue coupling means it
 defeats lockdep detection).
 
 Buffered write holds i_lock, and waits for inode writeback submission,
 but the work queue seems to be blocked on i_mutex (ext4_end_io_work),
 and so it deadlocks.
  Ah, I see. That's a new thing introduced by Ted's rewrite of
ext4_writepages() - bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc. And indeed it
introduced a deadlock as you describe...

Honza

 Note that this is an AA deadlock, different from ABBA one relating to
 s_umount lock (but very similar call chains IIRC).
 
 
 [  748.406457] SysRq : Show Blocked State
 [  748.406685]   taskPC stack   pid father
 [  748.406868] kworker/9:1   D   6296   118  2
 0x
 [  748.407173]  88012acb3c90 0046 
 88012b1cec80
 [  748.407686]  0002 88012acb2000 88012acb2000
 d6c8
 [  748.408200]  88012acb2010 88012acb3fd8 880129c7dd00
 88012b1cec80
 [  748.408711] Call Trace:
 [  748.408866]  [81039431] ? get_parent_ip+0x11/0x50
 [  748.409033]  [8160145c] __mutex_lock_common+0x18c/0x480
 [  748.409205]  [a0042147] ? ext4_end_io_work+0x37/0xb0 [ext4]
 [  748.409379]  [a0042147] ? ext4_end_io_work+0x37/0xb0 [ext4]
 [  748.409546]  [8160182e] mutex_lock_nested+0x3e/0x50
 [  748.409717]  [a0042147] ext4_end_io_work+0x37/0xb0 [ext4]
 [  748.409883]  [81068378] process_one_work+0x1b8/0x5a0
 [  748.410046]  [8106830e] ? process_one_work+0x14e/0x5a0
 [  748.410219]  [a0042110] ? ext4_end_io_work+0x0/0xb0 [ext4]
 [  748.410386]  [81069675] worker_thread+0x175/0x3a0
 [  748.410549]  [81069500] ? worker_thread+0x0/0x3a0
 [  748.410712]  [8106e246] kthread+0x96/0xa0
 [  748.410874]  [81003ed4] kernel_thread_helper+0x4/0x10
 [  748.411038]  [81039878] ? finish_task_switch+0x78/0x110
 [  748.411202]  [816036c0] ? restore_args+0x0/0x30
 [  748.411364]  [8106e1b0] ? kthread+0x0/0xa0
 [  748.411524]  [81003ed0] ? kernel_thread_helper+0x0/0x10
 [  748.606853] dbenchD   2872  2916  1
 0x0004
 [  748.607154]  880129ec58b8 0046 880129ec5838
 810806fd
 [  748.607665]  880129ec5898 880129ec4000 880129ec4000
 d6c8
 [  748.608176]  880129ec4010 880129ec5fd8 88010a2d
 88011ff69f00
 [  748.608686] Call Trace:
 [  748.608837]  [810806fd] ? trace_hardirqs_off+0xd/0x10
 [  748.609001]  [81600bb5] schedule_timeout+0x1f5/0x350
 [  748.609164]  [8108380b] ? mark_held_locks+0x6b/0xa0
 [  748.609328]  [8160314b] ? _raw_spin_unlock_irq+0x2b/0x60
 [  748.609493]  [81039431] ? get_parent_ip+0x11/0x50
 [  748.609656]  [81606c3d] ? sub_preempt_count+0x9d/0xd0
 [  748.609820]  [815ffacd] wait_for_common+0x10d/0x190
 [  748.609984]  [810426e0] ? default_wake_function

Re: [patch] fix up lock order reversal in writeback

2010-11-22 Thread Jan Kara
On Fri 19-11-10 16:16:19, Nick Piggin wrote:
 On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
  On Wed 17-11-10 22:28:34, Andrew Morton wrote:
   The fact that a call to -write_begin can randomly return with s_umount
   held, to be randomly released at some random time in the future is a
   bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
   thing.
I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
  variants) does:
  bdi_queue_work(sb-s_bdi, work);
  wait_for_completion(done);
So we return only after all the IO has been submitted and unlock s_umount
  in writeback_inodes_sb_if_idle(). And we cannot really submit the IO 
  ourselves
  because we are holding i_mutex and we need to get and put references
  to other inodes while doing writeback (those would be really horrible lock
  dependencies - writeback thread can put the last reference to an unlinked
  inode...).
 
 But if we're waiting for it, with the lock held, then surely it can
 deadlock just the same as if we submit it ourself?
  Yes, that's what I realized as well and what needs fixing. It was an
unintentional consequence of locking changes Christoph did to the writeback
code to fix other races.

 BTW. are you taking i_mutex inside writeback? I mutex can be held
 while entering page reclaim, and thus writepage... so it could be a
 bug too.
  No, writeback does not need i_mutex.

  In fact, as I'm speaking about it, pushing things to writeback thread and
  waiting on the work does not help a bit with the locking issues (we didn't
  wait for the work previously but that had other issues). Bug, sigh.
  
  What might be better interface for usecases like above is to allow
  filesystem to kick flusher thread to start doing background writeback
  (regardless of dirty limits). Then the caller can wait for some delayed
  allocation reservations to get freed (easy enough to check in
  -writepage() and wake the waiters) - possibly with a reasonable timeout
  so that we don't stall forever.
 
 We really need to throttle the producer without any locks held, no?
 So the filesystem would like to to hook into dirtying path somewhere
 without i_mutex held (and without implementing your own .write). Eg.
 around the dirty throttling calls somewhere I suppose.
  Well, the particular case where we need to do delayed allocation but the
filesystem has already all blocks reserved is hard to detect without any
locks. At least we need some locks to find out how many blocks do we need
to reserve for that write(). So for filesystems it would solve the locking
issues if we could bail out of write() path up to the point where we hold
no locks, wait there / do necessary writeback and then restart the write...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-18 Thread Jan Kara
On Wed 17-11-10 20:46:15, Josef Bacik wrote:
 Hole punching has already been implemented by XFS and OCFS2, and has the
 potential to be implemented on both BTRFS and EXT4 so we need a generic way to
 get to this feature.  The simplest way in my mind is to add 
 FALLOC_FL_PUNCH_HOLE
 to fallocate() since it already looks like the normal fallocate() operation.
 I've tested this patch with XFS and BTRFS to make sure XFS did what it's
 supposed to do and that BTRFS failed like it was supposed to.  Thank you,
  Looks nice now. Acked-by: Jan Kara j...@suse.cz

Honza
 
 Signed-off-by: Josef Bacik jo...@redhat.com
 ---
  fs/open.c  |7 ++-
  include/linux/falloc.h |1 +
  2 files changed, 7 insertions(+), 1 deletions(-)
 
 diff --git a/fs/open.c b/fs/open.c
 index 4197b9e..5b6ef7e 100644
 --- a/fs/open.c
 +++ b/fs/open.c
 @@ -223,7 +223,12 @@ int do_fallocate(struct file *file, int mode, loff_t 
 offset, loff_t len)
   return -EINVAL;
  
   /* Return error if mode is not supported */
 - if (mode  !(mode  FALLOC_FL_KEEP_SIZE))
 + if (mode  ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 + return -EOPNOTSUPP;
 +
 + /* Punch hole must have keep size set */
 + if ((mode  FALLOC_FL_PUNCH_HOLE) 
 + !(mode  FALLOC_FL_KEEP_SIZE))
   return -EOPNOTSUPP;
  
   if (!(file-f_mode  FMODE_WRITE))
 diff --git a/include/linux/falloc.h b/include/linux/falloc.h
 index 3c15510..73e0b62 100644
 --- a/include/linux/falloc.h
 +++ b/include/linux/falloc.h
 @@ -2,6 +2,7 @@
  #define _FALLOC_H_
  
  #define FALLOC_FL_KEEP_SIZE  0x01 /* default is extend size */
 +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
  
  #ifdef __KERNEL__
  
 -- 
 1.6.6.1
 
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Jan Kara
On Wed 17-11-10 22:28:34, Andrew Morton wrote:
 I'm not sure that s_umount versus i_mutex has come up before.
 
 Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
 is a per-superblock thing, and i_mutex is a per-file thing, and files
 live under superblocks.  Nesting s_umount outside i_mutex creates
 complex deadlock graphs between the various i_mutexes, I think.
 
 Someone tell me if btrfs has the same bug, via its call to
 writeback_inodes_sb_nr_if_idle()?
 
 I don't see why these functions need s_umount at all, if they're called
 from within -write_begin against an inode on that superblock.  If the
 superblock can get itself disappeared while we're running -write_begin
 on it, we have problems, no?
  As I wrote to Chris, the function just needs exclusion from umount /
remount happening (and want to stop umount from returning EBUSY when
writeback thread is writing something out). When the function is called
from -write_begin this is no issue as you properly noted so s_umount is
not needed in that particular case.

 In which case I'd suggest just removing the down_read(s_umount) and
 specifying that the caller must pin the superblock via some means.
  Possibly, but currently the advantage is that we can have WARN_ON in the
writeback code that complains if someone starts writeback without properly
pinned superblock and we cannot easily do that with your general rule. I'm
not saying that should stop us from changing the rule but it was kind of
nice.

 Only we can't do that because we need to hold s_umount until the
 bdi_queue_work() worker has done its work.
 
 The fact that a call to -write_begin can randomly return with s_umount
 held, to be randomly released at some random time in the future is a
 bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
 thing.
  I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
variants) does:
bdi_queue_work(sb-s_bdi, work);
wait_for_completion(done);
  So we return only after all the IO has been submitted and unlock s_umount
in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
because we are holding i_mutex and we need to get and put references
to other inodes while doing writeback (those would be really horrible lock
dependencies - writeback thread can put the last reference to an unlinked
inode...).

In fact, as I'm speaking about it, pushing things to writeback thread and
waiting on the work does not help a bit with the locking issues (we didn't
wait for the work previously but that had other issues). Bug, sigh.

What might be better interface for usecases like above is to allow
filesystem to kick flusher thread to start doing background writeback
(regardless of dirty limits). Then the caller can wait for some delayed
allocation reservations to get freed (easy enough to check in
-writepage() and wake the waiters) - possibly with a reasonable timeout
so that we don't stall forever.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-16 Thread Jan Kara
On Tue 16-11-10 12:16:11, Jan Kara wrote:
 On Mon 15-11-10 12:05:18, Josef Bacik wrote:
  diff --git a/fs/open.c b/fs/open.c
  index 4197b9e..ab8dedf 100644
  --- a/fs/open.c
  +++ b/fs/open.c
  @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t 
  offset, loff_t len)
  return -EINVAL;
   
  /* Return error if mode is not supported */
  -   if (mode  !(mode  FALLOC_FL_KEEP_SIZE))
  +   if (mode  (mode  ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
   Why not just:
 if (mode  ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
  And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
much which way but keeping it ambiguous (ignored) in the interface usually
proves as a bad idea in future when we want to further extend the interface...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Ocfs2: handle hole punching via fallocate properly

2010-11-16 Thread Jan Kara
On Mon 15-11-10 12:05:20, Josef Bacik wrote:
 This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch
 flag in fallocate.  I didn't test it, but it seems simple enough.  Thanks,
 
 Signed-off-by: Josef Bacik jo...@redhat.com
  You might want to directly CC Joel Becker joel.bec...@oracle.com who
maintains OCFS2. Otherwise the patch looks OK so you can add
  Acked-by: Jan Kara j...@suse.cz
for what it's worth ;).

Honza
 ---
  fs/ocfs2/file.c |   10 --
  1 files changed, 8 insertions(+), 2 deletions(-)
 
 diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
 index 77b4c04..181ae52 100644
 --- a/fs/ocfs2/file.c
 +++ b/fs/ocfs2/file.c
 @@ -1992,6 +1992,7 @@ static long ocfs2_fallocate(struct inode *inode, int 
 mode, loff_t offset,
   struct ocfs2_super *osb = OCFS2_SB(inode-i_sb);
   struct ocfs2_space_resv sr;
   int change_size = 1;
 + int cmd = OCFS2_IOC_RESVSP64;
  
   if (!ocfs2_writes_unwritten_extents(osb))
   return -EOPNOTSUPP;
 @@ -2002,12 +2003,17 @@ static long ocfs2_fallocate(struct inode *inode, int 
 mode, loff_t offset,
   if (mode  FALLOC_FL_KEEP_SIZE)
   change_size = 0;
  
 + if (mode  FALLOC_FL_PUNCH_HOLE) {
 + cmd = OCFS2_IOC_UNRESVSP64;
 + change_size = 0;
 + }
 +
   sr.l_whence = 0;
   sr.l_start = (s64)offset;
   sr.l_len = (s64)len;
  
 - return __ocfs2_change_file_space(NULL, inode, offset,
 -  OCFS2_IOC_RESVSP64, sr, change_size);
 + return __ocfs2_change_file_space(NULL, inode, offset, cmd, sr,
 +  change_size);
  }
  
  int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 -- 
 1.6.6.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Ext4: fail if we try to use hole punch

2010-11-16 Thread Jan Kara
On Mon 15-11-10 12:05:21, Josef Bacik wrote:
 Ext4 doesn't have the ability to punch holes yet, so make sure we return
 EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
 be added later.  Thanks,
 
 Signed-off-by: Josef Bacik jo...@redhat.com
  Acked-by: Jan Kara j...@suse.cz

Honza
 ---
  fs/ext4/extents.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 0554c48..35bca73 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -3622,6 +3622,10 @@ long ext4_fallocate(struct inode *inode, int mode, 
 loff_t offset, loff_t len)
   struct ext4_map_blocks map;
   unsigned int credits, blkbits = inode-i_blkbits;
  
 + /* We only support the FALLOC_FL_KEEP_SIZE mode */
 + if (mode  (mode != FALLOC_FL_KEEP_SIZE))
 + return -EOPNOTSUPP;
 +
   /*
* currently supporting (pre)allocate mode for extent-based
* files _only_
 -- 
 1.6.6.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-16 Thread Jan Kara
On Tue 16-11-10 22:00:58, Nick Piggin wrote:
 I saw a lock order warning on ext4 trigger. This should solve it.
 Raciness shouldn't matter much, because writeback can stop just
 after we make the test and return anyway (so the API is racy anyway).
  Hmm, for now the fix is OK. Ultimately, we probably want to call
writeback_inodes_sb() directly from all the callers. They all just want to
reduce uncertainty of delayed allocation reservations by writing delayed
data and actually wait for some of the writeback to happen before they
retry again the allocation.

Although the callers generally cannot get umount_sem because they hold
other locks, they have the superblock well pinned so grabbing umount_sem
makes sense mostly to make assertions happy. But as I'm thinking about it,
trylock *is* maybe the right answer to this anyway...

So
Acked-by: Jan Kara j...@suse.cz

Honza
 
 Signed-off-by: Nick Piggin npig...@kernel.dk
 
 Index: linux-2.6/fs/fs-writeback.c
 ===
 --- linux-2.6.orig/fs/fs-writeback.c  2010-11-16 21:44:32.0 +1100
 +++ linux-2.6/fs/fs-writeback.c   2010-11-16 21:49:37.0 +1100
 @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
   *
   * Invoke writeback_inodes_sb if no writeback is currently underway.
   * Returns 1 if writeback was started, 0 if not.
 + *
 + * May be called inside i_lock. May not start writeback if locks cannot
 + * be acquired.
   */
  int writeback_inodes_sb_if_idle(struct super_block *sb)
  {
   if (!writeback_in_progress(sb-s_bdi)) {
 - down_read(sb-s_umount);
 - writeback_inodes_sb(sb);
 - up_read(sb-s_umount);
 - return 1;
 - } else
 - return 0;
 + if (down_read_trylock(sb-s_umount)) {
 + writeback_inodes_sb(sb);
 + up_read(sb-s_umount);
 + return 1;
 + }
 + }
 + return 0;
  }
  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
  
 @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
   *
   * Invoke writeback_inodes_sb if no writeback is currently underway.
   * Returns 1 if writeback was started, 0 if not.
 + *
 + * May be called inside i_lock. May not start writeback if locks cannot
 + * be acquired.
   */
  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
  unsigned long nr)
  {
   if (!writeback_in_progress(sb-s_bdi)) {
 - down_read(sb-s_umount);
 - writeback_inodes_sb_nr(sb, nr);
 - up_read(sb-s_umount);
 - return 1;
 - } else
 - return 0;
 + if (down_read_trylock(sb-s_umount)) {
 + writeback_inodes_sb_nr(sb, nr);
 + up_read(sb-s_umount);
 + return 1;
 + }
 + }
 + return 0;
  }
  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
  
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-09 Thread Jan Kara
On Tue 09-11-10 16:41:47, Ted Ts'o wrote:
 On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
  Implementation is up to the filesystem. However, XFS does (b)
  because:
  
  1) it was extremely simple to implement (one of the
 advantages of having an exceedingly complex allocation
 interface to begin with :P)
  2) conversion is atomic, fast and reliable
  3) it is independent of the underlying storage; and
  4) reads of unwritten extents operate at memory speed,
 not disk speed.
 
 Yeah, I was thinking that using a device-style TRIM might be better
 since future attempts to write to it won't require a separate seek to
 modify the extent tree.  But yeah, there are a bunch of advantages of
 simply mutating the extent tree.
 
 While we're on the subject of changes to fallocate, what do people
 think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
 privileges or (if capabilities are in use) CAP_DAC_OVERRIDE 
 CAP_MAC_OVERRIDE  CAP_SYS_ADMIN.  This would allow a trusted process
 to fallocate blocks with the extent already marked initialized.  I've
 had two requests for such functionality for ext4 already.  
 
 (Take for example a trusted cluster filesystem backend that checks the
 object checksum before returning any data to the user; and if the
 check fails the cluster file system will try to use some other replica
 stored on some other server.)
  Hum, could you elaborate a bit? I fail to see how above fallocate() flag
could be used to help solving this problem... Just curious...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
 ---
 From: Christoph Hellwig h...@lst.de
 Subject: [PATCH] writeback: always use sb-s_bdi for writeback purposes
 
...
 The one exception for now is the block device filesystem which really
 wants different writeback contexts for it's different (internal) inodes
 to handle the writeout more efficiently.  For now we do this with
 a hack in fs-writeback.c because we're so late in the cycle, but in
 the future I plan to replace this with a superblock method that allows
 for multiple writeback contexts per filesystem.
  Another exception I know about is mtd_inodefs filesystem
(drivers/mtd/mtdchar.c).

 Signed-off-by: Christoph Hellwig h...@lst.de
 
 Index: linux-2.6/fs/fs-writeback.c
 ===
 --- linux-2.6.orig/fs/fs-writeback.c  2010-09-29 16:58:41.750557721 +0900
 +++ linux-2.6/fs/fs-writeback.c   2010-09-29 17:11:35.040557719 +0900
 @@ -72,22 +72,10 @@ int writeback_in_progress(struct backing
  static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
  {
   struct super_block *sb = inode-i_sb;
 - struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
  
 - /*
 -  * For inodes on standard filesystems, we use superblock's bdi. For
 -  * inodes on virtual filesystems, we want to use inode mapping's bdi
 -  * because they can possibly point to something useful (think about
 -  * block_dev filesystem).
 -  */
 - if (sb-s_bdi  sb-s_bdi != noop_backing_dev_info) {
 - /* Some device inodes could play dirty tricks. Catch them... */
 - WARN(bdi != sb-s_bdi  bdi_cap_writeback_dirty(bdi),
 - Dirtiable inode bdi %s != sb bdi %s\n,
 - bdi-name, sb-s_bdi-name);
 - return sb-s_bdi;
 - }
 - return bdi;
 + if (strcmp(sb-s_type-name, bdev) == 0)
 + return inode-i_mapping-backing_dev_info;
 + return sb-s_bdi;
  So at least here you'd need also add a similar exception for
mtd_inodefs. Because of these exeptions I've chosen the
(sb-s_bdi  sb-s_bdi != noop_backing_dev_info) check rather than your
exception based check. All in all I don't care much what ends up in the
kernel as it's just a temporary solution...
  Also I've added the warning to catch situations where inodes would get
filed to a different backing device after the patch. So far the reported
warnings were harmless but still I'm more comfortable when it's there
because otherwise we can so easily miss some device-driver-invented
filesystem like mtd_inodefs which would break silently after the change...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Mon 27-09-10 20:55:45, Cesar Eduardo Barros wrote:
 Em 27-09-2010 19:25, Jan Kara escreveu:
 [Added CCs for similar ecryptfs warning]
 On Thu 23-09-10 12:38:49, Andrew Morton wrote:
 [...]
 device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342
 /dev/mapper/vg_cesarbinspiro-lv_home
 SELinux: initialized (dev dm-3, type btrfs), uses xattr
 [ cut here ]
 WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
 Hardware name: Inspiron N4010
 Dirtiable inode bdi default != sb bdi btrfs
That suggests that we should probably handle such cases in a more generic
 way by changing the code in inode_init_always(). The patch below makes at
 least btrfs happy for me... Could you maybe test it? Thanks.
 
 Applied on top of v2.6.36-rc5-151-g32163f4, running it right now.
 The warning messages no longer happen, and everything seems to be
 working fine.
 
 Tested-by: Cesar Eduardo Barros ces...@cesarb.net
  Great, thanks for testing.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 16:10:06, Christoph Hellwig wrote:
 On Wed, Sep 29, 2010 at 02:18:08PM +0200, Jan Kara wrote:
  On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
   ---
   From: Christoph Hellwig h...@lst.de
   Subject: [PATCH] writeback: always use sb-s_bdi for writeback purposes
   
  ...
   The one exception for now is the block device filesystem which really
   wants different writeback contexts for it's different (internal) inodes
   to handle the writeout more efficiently.  For now we do this with
   a hack in fs-writeback.c because we're so late in the cycle, but in
   the future I plan to replace this with a superblock method that allows
   for multiple writeback contexts per filesystem.
Another exception I know about is mtd_inodefs filesystem
  (drivers/mtd/mtdchar.c).
 
 No, it's not.  MTD only has three different backing_dev_info instances
 which have different flags in the mapping-relevant portion of the
 backing_dev. 
  In the end I agree I was probably wrong but it's not that simple ;)

So at least here you'd need also add a similar exception for
  mtd_inodefs.
 
 No.  For one thing we don't need any exception for correctnes alone -
 even the block device variant would work fine with the default case.
Here I don't agree. If you don't have some kind of exception, sb-s_bdi
for both block and mtd_inodefs filesystems points to
noop_backing_dev_info and you get no writeback for that one. So it isn't
just a performance issue but also a correctness one.

Regarding mtd_inodefs I now looked in more detail what MTD actually does
and it seems to me that MTD device inodes do not seem to carry any
cached state that flusher threads could write back. So returning
noop_backing_dev_info might be the right thing for them after all...
(added David Woodhouse and MTD list to CC so that they can shout if it's
not the case). Coming to this conclusion, I'm happy with your patch going
in as is...
Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-27 Thread Jan Kara
[Added CCs for similar ecryptfs warning]
On Thu 23-09-10 12:38:49, Andrew Morton wrote:
  This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
  happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
  commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
  warning.
  [...]
  device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
  /dev/mapper/vg_cesarbinspiro-lv_home
  SELinux: initialized (dev dm-3, type btrfs), uses xattr
  [ cut here ]
  WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
  Hardware name: Inspiron N4010
  Dirtiable inode bdi default != sb bdi btrfs
  Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
  snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
  iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
  snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
  cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
  pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
  rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
  aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
  i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
  Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
  Call Trace:
[8104d0e4] warn_slowpath_common+0x85/0x9d
[8104d19f] warn_slowpath_fmt+0x46/0x48
[811308b7] inode_to_bdi+0x62/0x6d
[81131b48] __mark_inode_dirty+0xd0/0x177
[81127168] touch_atime+0x107/0x12a
[81122384] ? filldir+0x0/0xd0
[8112259b] vfs_readdir+0x8d/0xb4
[8112270b] sys_getdents+0x81/0xd1
[81009c72] system_call_fastpath+0x16/0x1b
  Thanks for the report. These bdi pointers are a mess. As Chris pointed
out, btrfs forgets to properly initialize inode-i_mapping.backing_dev_info
for directories and special inodes and thus these were previously attached
to default_backing_dev_info which probably isn't what Chris would like to
see.
  I've also got a similar report for ecryptfs which also does not
initialize inode-i_mapping.backing_dev_info although it sets sb-s_bdi and
thus again its inodes get filed to default_backing_dev_info lists.  Quick
search seems to reveal that other filesystems using handcrafted bdi's get
it wrong as well and thus their inodes end up in the default_backing_dev_info
lists which is generally undesirable (this was happening already before my
patch, my code just started complaining about that).
  That suggests that we should probably handle such cases in a more generic
way by changing the code in inode_init_always(). The patch below makes at
least btrfs happy for me... Could you maybe test it? Thanks.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
---

From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001
From: Jan Kara j...@suse.cz
Date: Mon, 27 Sep 2010 23:56:48 +0200
Subject: [PATCH] bdi: Initialize inode-i_mapping.backing_dev_info to sb-s_bdi

Currently, we initialize inode-i_mapping.backing_dev_info to the bdi of device
sb-s_bdev points to. However there is quite a big number of filesystems that
do not set sb-s_bdev (because they do not have one) but do set sb-s_bdi.
These filesystems would generally benefit from setting
inode-i_mapping.backing_dev_info to their s_bdi because otherwise their inodes
would point to default_backing_dev_info and thus dirty inode tracking would
happen there. So change inode initialization code to use sb-s_bdi if it
is available.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/inode.c |   22 ++
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8646433..e415be4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -172,15 +172,21 @@ int inode_init_always(struct super_block *sb, struct 
inode *inode)
mapping-writeback_index = 0;
 
/*
-* If the block_device provides a backing_dev_info for client
-* inodes then use that.  Otherwise the inode share the bdev's
-* backing_dev_info.
+* If the filesystem provides a backing_dev_info for client inodes
+* then use that. Otherwise inodes share default_backing_dev_info.
 */
-   if (sb-s_bdev) {
-   struct backing_dev_info *bdi;
-
-   bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info;
-   mapping-backing_dev_info = bdi;
+   if (sb-s_bdi  sb-s_bdi != noop_backing_dev_info) {
+   /*
+* Catch cases where filesystem might be bitten by using s_bdi
+* instead of sb-s_bdev. Can be removed in 2.6.38.
+*/
+   if (sb-s_bdev) {
+   struct backing_dev_info *bdi =
+ sb-s_bdev-bd_inode-i_mapping-backing_dev_info;
+   WARN

Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-27 Thread Jan Kara
On Mon 27-09-10 18:54:52, Chris Mason wrote:
 On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote:
  [Added CCs for similar ecryptfs warning]
  On Thu 23-09-10 12:38:49, Andrew Morton wrote:
This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
warning.
[...]
device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
/dev/mapper/vg_cesarbinspiro-lv_home
SELinux: initialized (dev dm-3, type btrfs), uses xattr
[ cut here ]
WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
Hardware name: Inspiron N4010
Dirtiable inode bdi default != sb bdi btrfs
Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
Call Trace:
  [8104d0e4] warn_slowpath_common+0x85/0x9d
  [8104d19f] warn_slowpath_fmt+0x46/0x48
  [811308b7] inode_to_bdi+0x62/0x6d
  [81131b48] __mark_inode_dirty+0xd0/0x177
  [81127168] touch_atime+0x107/0x12a
  [81122384] ? filldir+0x0/0xd0
  [8112259b] vfs_readdir+0x8d/0xb4
  [8112270b] sys_getdents+0x81/0xd1
  [81009c72] system_call_fastpath+0x16/0x1b
Thanks for the report. These bdi pointers are a mess. As Chris pointed
  out, btrfs forgets to properly initialize inode-i_mapping.backing_dev_info
  for directories and special inodes and thus these were previously attached
  to default_backing_dev_info which probably isn't what Chris would like to
  see.
 
 There's no actual writeback for these, so it works fine for btrfs either
 way.
  Ah, OK. I see you do all the writing already at inode dirtying time so
you don't really care what happens after the dirtying for inodes without
page cache. So it's really just a cosmetic issue for btrfs.

I've also got a similar report for ecryptfs which also does not
  initialize inode-i_mapping.backing_dev_info although it sets sb-s_bdi and
  thus again its inodes get filed to default_backing_dev_info lists.  Quick
  search seems to reveal that other filesystems using handcrafted bdi's get
  it wrong as well and thus their inodes end up in the 
  default_backing_dev_info
  lists which is generally undesirable (this was happening already before my
  patch, my code just started complaining about that).
That suggests that we should probably handle such cases in a more generic
  way by changing the code in inode_init_always(). The patch below makes at
  least btrfs happy for me... Could you maybe test it? Thanks.
 
 Christoph had a slightly different plan for this, I've cc'd him and kept
 the patch below for comment.
  OK, thanks.

Honza
  ---
  
  From 29f60c2b08ff9637a10439d1513805835ddcc746 Mon Sep 17 00:00:00 2001
  From: Jan Kara j...@suse.cz
  Date: Mon, 27 Sep 2010 23:56:48 +0200
  Subject: [PATCH] bdi: Initialize inode-i_mapping.backing_dev_info to 
  sb-s_bdi
  
  Currently, we initialize inode-i_mapping.backing_dev_info to the bdi of 
  device
  sb-s_bdev points to. However there is quite a big number of filesystems 
  that
  do not set sb-s_bdev (because they do not have one) but do set sb-s_bdi.
  These filesystems would generally benefit from setting
  inode-i_mapping.backing_dev_info to their s_bdi because otherwise their 
  inodes
  would point to default_backing_dev_info and thus dirty inode tracking would
  happen there. So change inode initialization code to use sb-s_bdi if it
  is available.
  
  Signed-off-by: Jan Kara j...@suse.cz
  ---
   fs/inode.c |   22 ++
   1 files changed, 14 insertions(+), 8 deletions(-)
  
  diff --git a/fs/inode.c b/fs/inode.c
  index 8646433..e415be4 100644
  --- a/fs/inode.c
  +++ b/fs/inode.c
  @@ -172,15 +172,21 @@ int inode_init_always(struct super_block *sb, struct 
  inode *inode)
  mapping-writeback_index = 0;
   
  /*
  -* If the block_device provides a backing_dev_info for client
  -* inodes then use that.  Otherwise the inode share the bdev's
  -* backing_dev_info.
  +* If the filesystem provides a backing_dev_info for client inodes
  +* then use that. Otherwise inodes share default_backing_dev_info

Re: [patch v2 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-09-02 Thread Jan Kara
On Thu 02-09-10 09:59:13, Jiri Slaby wrote:
 On 09/02/2010 03:02 AM, David Rientjes wrote:
  --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -334,6 +334,57
  @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
  return kmalloc_node(size, flags | __GFP_ZERO, node); }
   
  +/** + * kmalloc_nofail - infinitely loop until kmalloc() succeeds.  +
  * @size: how many bytes of memory are required.  + * @flags: the type
  of memory to allocate (see kmalloc).  + * + * NOTE: no new callers of
  this function should be implemented!  + * All memory allocations should
  be failable whenever possible.  + */ +static inline void
  *kmalloc_nofail(size_t size, gfp_t flags) +{ +  void *ret; + +  for
  (;;) { +ret = kmalloc(size, flags); +   if (ret) +
  return ret; +   WARN_ON_ONCE(get_order(size) 
  PAGE_ALLOC_COSTLY_ORDER);
 
 This doesn't work as you expect. kmalloc will warn every time it fails.
 __GFP_NOFAIL used to disable the warning. Actually what's wrong with
 __GFP_NOFAIL? I cannot find a reason in the changelogs why the patches
 are needed.
  David should probably add the reasoning to the changelogs so that he
doesn't have to explain again and again ;). But if I understood it
correctly, the concern is that the looping checks slightly impact fast path
of the callers which do not need it. Generally, also looping for a long
time inside allocator isn't a nice thing but some callers aren't able to do
better for now to the patch is imperfect in this sence...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Usefulness of max_transid in the tree search?

2010-08-20 Thread Jan Kara
  Hi,

  I just had a look at how btrfs ioctl searches the tree for new items and
while looking at the code I've noticed that maximum transaction id as
specified in the search is rather useless.
  If I understand right, transaction id is stored in a header of each node.
Thus when any item of the node gets modified, new transaction id gets
recorded. So when you give the search some maximum transaction id, you will
not receive items in nodes where some item has been modified in a later
transaction. That seems like a rather useless set of items to me... Or am I
missing something?

Honza
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Do not return more items that user asked from from search ioctl

2010-08-20 Thread Jan Kara
While searching a tree we didn't properly check number of items we really
stored in user's buffer thus possibly exceeding number of items requested
by user. This was mostly harmless since actual buffer overflow is checked
correctly in a different place. Anyway, let's fix the check.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..94e7ab5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -977,7 +977,7 @@ static noinline int copy_to_sk(struct btrfs_root *root,
}
found++;
 
-   if (*num_found = sk-nr_items)
+   if (*num_found + found = sk-nr_items)
break;
}
 advance_key:
-- 
1.6.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3