ioctl #21 kernel patch that (1) actually works, to wait for dead_roots cleaning and (2) provides framework for adding facility for user-space waiting for anything else deferred

2011-01-16 Thread David Nicol
Here is the latest ioctl #21 kernel patch;
it  (1) actually works, unlike the last one, which only waited until
the to-do list was emptied at the beginning of the cleaning function,
which wasn't actually helpful
and (2) provides framework for adding facility for user-space waiting
for anything else deferred, with deferred iputs penciled in as the
next thing to allow waiting for -- which probably currently suffers
from a similar problem with the to-do list getting emptied before the
work is done -- but which is out of the scope of the sponsor's
requirements.  Another pair of atomics might be required to make that
work right too. I don't know if waiting for deferred iputs is holding
anyone or anything back.

Going forward, I imagine ioctl #21, with it's flags field, to be the
ioctl that newly deferred and backgrounded activities use to allow
waiting for their completion. This means that any move to background
something would include extending the BICW_TEST macro to include
testing for completion of the deferred activity, adjusting the maximum
flag value in strict mode, and adding more wakeup calls around the
code that handles the new deferred thing.

A flags field of 0xFFFA should always mean wait for completion of
everything you know how to wait for.

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29c2009..250f2f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -895,6 +895,8 @@ struct btrfs_fs_info {
struct list_head trans_list;
struct list_head hashers;
struct list_head dead_roots;
+   atomic_t dead_roots_cleaners ;
+   wait_queue_head_t cleaner_notification_registration;
struct list_head caching_block_groups;

spinlock_t delayed_iput_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..d2741c5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1450,7 +1450,9 @@ static int cleaner_kthread(void *arg)
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
+   
wake_up_all(root-fs_info-cleaner_notification_registration);
btrfs_clean_old_snapshots(root);
+   
wake_up_all(root-fs_info-cleaner_notification_registration);
mutex_unlock(root-fs_info-cleaner_mutex);
}

@@ -1581,6 +1583,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
INIT_RADIX_TREE(fs_info-fs_roots_radix, GFP_ATOMIC);
INIT_LIST_HEAD(fs_info-trans_list);
INIT_LIST_HEAD(fs_info-dead_roots);
+   atomic_set(fs_info-dead_roots_cleaners,0);
+   init_waitqueue_head(fs_info-cleaner_notification_registration);
INIT_LIST_HEAD(fs_info-delayed_iputs);
INIT_LIST_HEAD(fs_info-hashers);
INIT_LIST_HEAD(fs_info-delalloc_inodes);
@@ -2393,7 +2397,9 @@ int btrfs_commit_super(struct btrfs_root *root)

mutex_lock(root-fs_info-cleaner_mutex);
btrfs_run_delayed_iputs(root);
+   wake_up_all(root-fs_info-cleaner_notification_registration);
btrfs_clean_old_snapshots(root);
+   wake_up_all(root-fs_info-cleaner_notification_registration);
mutex_unlock(root-fs_info-cleaner_mutex);

/* wait until ongoing cleanup work done */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..b669f19 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1212,6 +1212,66 @@ static noinline int
btrfs_ioctl_ino_lookup(struct file *file,
return ret;
 }

+static int btrfs_ioctl_cleaner_wait(struct btrfs_root *root, void __user *arg)
+{
+struct btrfs_ioctl_cleaner_wait_args *bicwa;
+long remainingjiffies;
+int err;
+
+   bicwa = memdup_user(arg, sizeof(*bicwa));
+   if (IS_ERR(bicwa))
+   return PTR_ERR(bicwa);
+
+err = -EINVAL;
+
+/* flag bit 0x04 means, error on unknown flag.
+   the highest possible valid flag value at this rev is 0x07. */
+if (((bicwa-flags  4) == 4) ( bicwa-flags  7) ){
+   kfree(bicwa);
+return err;
+};
+
+#define BICW_TEST ( \
+   /* flag bit 0x01: suppress wait for dead_roots */\
+   (bicwa-flags  1 == 1 || ( list_empty(root-fs_info-dead_roots))  \
+   ( atomic_read(root-fs_info-dead_roots_cleaners) == 0 ))  \
+\
+   /* flag bit 0x02: wait for delayed iputs */  \
+ (bicwa-flags  2 == 0 || list_empty(root-fs_info-delayed_iputs)) \
+\
+   /* 0x04 is consumed earlier */   \
+   /* add the next one at 0x08 */   \
+)
+
+if (bicwa-ms  0)
+   {
+unsigned long millisecs = bicwa-ms;
+

Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

2011-01-16 Thread Shaohua Li
On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
 On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
  On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
   On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
 On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
  On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
   On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
 Shaohua,

 On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
  Hi,
We have file readahead to do asyn file read, but has no 
  metadata
  readahead. For a list of files, their metadata is stored in 
  fragmented
  disk space and metadata read is a sync operation, which 
  impacts the
  efficiency of readahead much. The patches try to add 
  meatadata readahead
  for btrfs.
In btrfs, metadata is stored in btree_inode. Ideally, if 
  we could hook
  the inode to a fd so we could use existing syscalls 
  (readahead, mincore
  or upcoming fincore) to do readahead, but the inode is 
  hidden, there is
  no easy way for this from my understanding. So we add two 
  ioctls for

 If that is the main obstacle, why not do straightforward 
 fincore()/
 fadvise(), and add ioctls to btrfs to export/grab the hidden
 btree_inode in any form?  This will address btrfs' specific 
 issue, and
 have the benefit of making the VFS part general enough. You 
 know
 ext2/3/4 already have block_dev ready for metadata readahead.
I forgot to update this comment. Please see patch 2 and patch 
4, both
incore and readahead need btrfs specific staff involved, so we 
can't use
generic fincore or something.
  
   You can if you like :)
  
   - fincore() can return the referenced bit, which is generally
 useful information
  metadata page in ext2/3 doesn't have reference bit set, while btrfs 
  has.
  we can't blindly filter out such pages with the bit.

 block_dev inodes have the accessed bits. Look at the below output.

 /dev/sda5 is a mounted ext4 partition.  The 'A'/'R' in the
 dump_page_cache lines stand for Active/Referenced.
ext4 already does readahead? please check other filesystems.
   
   ext3/4 does readahead on accessing large directories. However that's
   orthogonal feature to the user space metadata readahead. The latter is
   still important for fast boot on ext3/4.
   
filesystem sues bread like API to read metadata, which definitely
doesn't set referenced bit.
   
   __find_get_block() will call touch_buffer() which is a synonymous for
   mark_page_accessed().
  yes, but only when the buffer is accessed at the second time.
 
 Not likely. Otherwise it would be a performance bug.
 
 __getblk() has two code paths, both will call touch_buffer().
 
 a)
 __find_get_block()
 touch_buffer()
 b)
 __getblk_slow
 __find_get_block()
 touch_buffer()
I missed this, sorry.
  fincore can takes a parameter or it returns a bit to distinguish
  referenced pages, but I don't think it's a good API. This should be
  transparent to userspace.

 Users care about the cached status may well be interested in the
 active/referenced status. They are co-related information. fincore()
 won't be a simple replication of mincore() anyway. fincore() has to
 deal with huge sparsely accessed files. The accessed bits of a file
 page are normally more meaningful than the accessed bits of mapped
 (anonymous) pages.
if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't 
generic enough.
   
   It's a reasonable thing to set the accessed bits. So I believe the
   various filesystems are calling mark_page_accessed() on their metadata
   inode, or can be changed to do it.
  yes, we can, with a lot of pain. And filesystems must be smart to avoid 
  marking the bit
  for pages which are readahead in but actually are invalid. The second patch 
  in the series
 
 invalid means !PG_uptodate? I wonder why there is a need to test
 that bit at all. !PG_uptodate seems an unrelated transitional state.
not PG_update, it's referenced bit. A readahead metadata page will have update 
bit set,
but it might not have referenced bit if it's an obsolete page. btrfs
doesn't use the buffer_head

  has more detailed infomation about this issue. The problem is if this is 
  really worthy
  for metadata readahead. Some filesystems might don't care about metadata 
  readahead. If
  we make fincore check the bit, then fincore syscall will 

atime updates stuck under btrfs_dirty_inode

2011-01-16 Thread Andy Isaacson
Hi all,

I'm seeing processes block (for minutes or longer) in directory access
on btrfs.  I'm running

commit ccda756d13f13a9276edc8d53d05aaecbf3c4baa
Merge: 0c21e3a 65e5341
Author: Andy Isaacson a...@hexapodia.org
Date:   Sun Jan 9 12:37:23 2011 -0800

Merge branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable

on a quad core Core i7.

There seem to be a few failure cases.  My most recent ls has been stuck
like this for several hours, on a directory that should have somewhere
between 6 and 50 entries:

% ps u 15028
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
adi  15028  0.0  0.0  27032  1204 pts/8S+   22:18   0:00 ls -l /btr/foo
% cat /proc/15028/stack
[a013af78] shrink_delalloc+0x112/0x147 [btrfs]
[a013b09f] reserve_metadata_bytes+0xf2/0x1a8 [btrfs]
[a013b2d0] btrfs_block_rsv_add+0x28/0x49 [btrfs]
[a013b33e] btrfs_trans_reserve_metadata+0x4d/0x75 [btrfs]
[a014b4d4] start_transaction+0x13e/0x1d8 [btrfs]
[a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs]
[a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs]
[8111d82f] __mark_inode_dirty+0x2b/0x1ab
[8111297a] touch_atime+0x10e/0x131
[8110d49d] vfs_readdir+0x79/0xa0
[8110d60d] sys_getdents+0x81/0xd1
[81009e97] tracesys+0xd9/0xde
[] 0x

Sometimes, if I poll fast enough I see this stack instead:

% cat /proc/15028/stack
[a0149b2d] wait_for_commit+0x99/0xe2 [btrfs]
[a014ac20] btrfs_commit_transaction+0xfe/0x612 [btrfs]
[a014b4e7] start_transaction+0x151/0x1d8 [btrfs]
[a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs]
[a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs]
[8111d82f] __mark_inode_dirty+0x2b/0x1ab
[8111297a] touch_atime+0x10e/0x131
[8110d49d] vfs_readdir+0x79/0xa0
[8110d60d] sys_getdents+0x81/0xd1
[81009e97] tracesys+0xd9/0xde
[] 0x

Another ls has been accumulating runtime for several days:

 7406 pts/3R+   2973:56  |   \_ ls -F

the directory it's reading has around 200 files in it, I think.

When I catch it in kernel mode its stack looks like this:

% cat /proc/7406/stack
[8111d293] writeback_inodes_sb_nr+0x76/0x7d
[8111d6b0] writeback_inodes_sb_nr_if_idle+0x41/0x55
[a013af15] shrink_delalloc+0xaf/0x147 [btrfs]
[a013b09f] reserve_metadata_bytes+0xf2/0x1a8 [btrfs]
[a013b2d0] btrfs_block_rsv_add+0x28/0x49 [btrfs]
[a013b33e] btrfs_trans_reserve_metadata+0x4d/0x75 [btrfs]
[a014b4d4] start_transaction+0x13e/0x1d8 [btrfs]
[a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs]
[a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs]
[8111d82f] __mark_inode_dirty+0x2b/0x1ab
[8111297a] touch_atime+0x10e/0x131
[8110d49d] vfs_readdir+0x79/0xa0
[8110d60d] sys_getdents+0x81/0xd1
[81009c82] system_call_fastpath+0x16/0x1b
[] 0x

vmstat is rather odd:

% vmstat 1
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 124348 177388 1127622800   1572430  2 10 87  1
 1  0  0 122108 177388 1127854000   536   112 2870 813309  2 13 83  
2
 2  0  0 119628 177388 1128085200   704 0 2938 887219  2 11 87  0
 2  0  0 117768 177388 1128268400   800 0 2936 883427  2 15 82  
1
 2  0  0 114872 177388 1128546400  1104 0 2819 883498  1  8 90  0
 1  0  0 112704 177388 1128738800   612 0 2961 880957  2 14 84  0
^C

One of the three block devices on this btrfs is reading a few hundred
blocks per second, but is nowhere near 100% busy; the CS count is
extremely high; and we have less than a full CPU spent in system time
(even though there are two ls processes blocked).

If anyone can point me to a good place to start tracking down the root
cause, I'd appreciate it.

Thanks,
-andy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html