ioctl #21 kernel patch that (1) actually works, to wait for dead_roots cleaning and (2) provides framework for adding facility for user-space waiting for anything else deferred
Here is the latest ioctl #21 kernel patch; it (1) actually works, unlike the last one, which only waited until the to-do list was emptied at the beginning of the cleaning function, which wasn't actually helpful and (2) provides framework for adding facility for user-space waiting for anything else deferred, with deferred iputs penciled in as the next thing to allow waiting for -- which probably currently suffers from a similar problem with the to-do list getting emptied before the work is done -- but which is out of the scope of the sponsor's requirements. Another pair of atomics might be required to make that work right too. I don't know if waiting for deferred iputs is holding anyone or anything back. Going forward, I imagine ioctl #21, with it's flags field, to be the ioctl that newly deferred and backgrounded activities use to allow waiting for their completion. This means that any move to background something would include extending the BICW_TEST macro to include testing for completion of the deferred activity, adjusting the maximum flag value in strict mode, and adding more wakeup calls around the code that handles the new deferred thing. A flags field of 0xFFFA should always mean wait for completion of everything you know how to wait for. diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29c2009..250f2f1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -895,6 +895,8 @@ struct btrfs_fs_info { struct list_head trans_list; struct list_head hashers; struct list_head dead_roots; + atomic_t dead_roots_cleaners ; + wait_queue_head_t cleaner_notification_registration; struct list_head caching_block_groups; spinlock_t delayed_iput_lock; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 34f7c37..d2741c5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1450,7 +1450,9 @@ static int cleaner_kthread(void *arg) if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); + wake_up_all(root-fs_info-cleaner_notification_registration); btrfs_clean_old_snapshots(root); + wake_up_all(root-fs_info-cleaner_notification_registration); mutex_unlock(root-fs_info-cleaner_mutex); } @@ -1581,6 +1583,8 @@ struct btrfs_root *open_ctree(struct super_block *sb, INIT_RADIX_TREE(fs_info-fs_roots_radix, GFP_ATOMIC); INIT_LIST_HEAD(fs_info-trans_list); INIT_LIST_HEAD(fs_info-dead_roots); + atomic_set(fs_info-dead_roots_cleaners,0); + init_waitqueue_head(fs_info-cleaner_notification_registration); INIT_LIST_HEAD(fs_info-delayed_iputs); INIT_LIST_HEAD(fs_info-hashers); INIT_LIST_HEAD(fs_info-delalloc_inodes); @@ -2393,7 +2397,9 @@ int btrfs_commit_super(struct btrfs_root *root) mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); + wake_up_all(root-fs_info-cleaner_notification_registration); btrfs_clean_old_snapshots(root); + wake_up_all(root-fs_info-cleaner_notification_registration); mutex_unlock(root-fs_info-cleaner_mutex); /* wait until ongoing cleanup work done */ diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..b669f19 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1212,6 +1212,66 @@ static noinline int btrfs_ioctl_ino_lookup(struct file *file, return ret; } +static int btrfs_ioctl_cleaner_wait(struct btrfs_root *root, void __user *arg) +{ +struct btrfs_ioctl_cleaner_wait_args *bicwa; +long remainingjiffies; +int err; + + bicwa = memdup_user(arg, sizeof(*bicwa)); + if (IS_ERR(bicwa)) + return PTR_ERR(bicwa); + +err = -EINVAL; + +/* flag bit 0x04 means, error on unknown flag. + the highest possible valid flag value at this rev is 0x07. */ +if (((bicwa-flags 4) == 4) ( bicwa-flags 7) ){ + kfree(bicwa); +return err; +}; + +#define BICW_TEST ( \ + /* flag bit 0x01: suppress wait for dead_roots */\ + (bicwa-flags 1 == 1 || ( list_empty(root-fs_info-dead_roots)) \ + ( atomic_read(root-fs_info-dead_roots_cleaners) == 0 )) \ +\ + /* flag bit 0x02: wait for delayed iputs */ \ + (bicwa-flags 2 == 0 || list_empty(root-fs_info-delayed_iputs)) \ +\ + /* 0x04 is consumed earlier */ \ + /* add the next one at 0x08 */ \ +) + +if (bicwa-ms 0) + { +unsigned long millisecs = bicwa-ms; +
Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote: On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote: On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote: On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote: On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: Shaohua, On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: Hi, We have file readahead to do asyn file read, but has no metadata readahead. For a list of files, their metadata is stored in fragmented disk space and metadata read is a sync operation, which impacts the efficiency of readahead much. The patches try to add meatadata readahead for btrfs. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook the inode to a fd so we could use existing syscalls (readahead, mincore or upcoming fincore) to do readahead, but the inode is hidden, there is no easy way for this from my understanding. So we add two ioctls for If that is the main obstacle, why not do straightforward fincore()/ fadvise(), and add ioctls to btrfs to export/grab the hidden btree_inode in any form? This will address btrfs' specific issue, and have the benefit of making the VFS part general enough. You know ext2/3/4 already have block_dev ready for metadata readahead. I forgot to update this comment. Please see patch 2 and patch 4, both incore and readahead need btrfs specific staff involved, so we can't use generic fincore or something. You can if you like :) - fincore() can return the referenced bit, which is generally useful information metadata page in ext2/3 doesn't have reference bit set, while btrfs has. we can't blindly filter out such pages with the bit. block_dev inodes have the accessed bits. Look at the below output. /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the dump_page_cache lines stand for Active/Referenced. ext4 already does readahead? please check other filesystems. ext3/4 does readahead on accessing large directories. However that's orthogonal feature to the user space metadata readahead. The latter is still important for fast boot on ext3/4. filesystem sues bread like API to read metadata, which definitely doesn't set referenced bit. __find_get_block() will call touch_buffer() which is a synonymous for mark_page_accessed(). yes, but only when the buffer is accessed at the second time. Not likely. Otherwise it would be a performance bug. __getblk() has two code paths, both will call touch_buffer(). a) __find_get_block() touch_buffer() b) __getblk_slow __find_get_block() touch_buffer() I missed this, sorry. fincore can takes a parameter or it returns a bit to distinguish referenced pages, but I don't think it's a good API. This should be transparent to userspace. Users care about the cached status may well be interested in the active/referenced status. They are co-related information. fincore() won't be a simple replication of mincore() anyway. fincore() has to deal with huge sparsely accessed files. The accessed bits of a file page are normally more meaningful than the accessed bits of mapped (anonymous) pages. if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough. It's a reasonable thing to set the accessed bits. So I believe the various filesystems are calling mark_page_accessed() on their metadata inode, or can be changed to do it. yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit for pages which are readahead in but actually are invalid. The second patch in the series invalid means !PG_uptodate? I wonder why there is a need to test that bit at all. !PG_uptodate seems an unrelated transitional state. not PG_update, it's referenced bit. A readahead metadata page will have update bit set, but it might not have referenced bit if it's an obsolete page. btrfs doesn't use the buffer_head has more detailed infomation about this issue. The problem is if this is really worthy for metadata readahead. Some filesystems might don't care about metadata readahead. If we make fincore check the bit, then fincore syscall will
atime updates stuck under btrfs_dirty_inode
Hi all, I'm seeing processes block (for minutes or longer) in directory access on btrfs. I'm running commit ccda756d13f13a9276edc8d53d05aaecbf3c4baa Merge: 0c21e3a 65e5341 Author: Andy Isaacson a...@hexapodia.org Date: Sun Jan 9 12:37:23 2011 -0800 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable on a quad core Core i7. There seem to be a few failure cases. My most recent ls has been stuck like this for several hours, on a directory that should have somewhere between 6 and 50 entries: % ps u 15028 USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND adi 15028 0.0 0.0 27032 1204 pts/8S+ 22:18 0:00 ls -l /btr/foo % cat /proc/15028/stack [a013af78] shrink_delalloc+0x112/0x147 [btrfs] [a013b09f] reserve_metadata_bytes+0xf2/0x1a8 [btrfs] [a013b2d0] btrfs_block_rsv_add+0x28/0x49 [btrfs] [a013b33e] btrfs_trans_reserve_metadata+0x4d/0x75 [btrfs] [a014b4d4] start_transaction+0x13e/0x1d8 [btrfs] [a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs] [a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs] [8111d82f] __mark_inode_dirty+0x2b/0x1ab [8111297a] touch_atime+0x10e/0x131 [8110d49d] vfs_readdir+0x79/0xa0 [8110d60d] sys_getdents+0x81/0xd1 [81009e97] tracesys+0xd9/0xde [] 0x Sometimes, if I poll fast enough I see this stack instead: % cat /proc/15028/stack [a0149b2d] wait_for_commit+0x99/0xe2 [btrfs] [a014ac20] btrfs_commit_transaction+0xfe/0x612 [btrfs] [a014b4e7] start_transaction+0x151/0x1d8 [btrfs] [a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs] [a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs] [8111d82f] __mark_inode_dirty+0x2b/0x1ab [8111297a] touch_atime+0x10e/0x131 [8110d49d] vfs_readdir+0x79/0xa0 [8110d60d] sys_getdents+0x81/0xd1 [81009e97] tracesys+0xd9/0xde [] 0x Another ls has been accumulating runtime for several days: 7406 pts/3R+ 2973:56 | \_ ls -F the directory it's reading has around 200 files in it, I think. When I catch it in kernel mode its stack looks like this: % cat /proc/7406/stack [8111d293] writeback_inodes_sb_nr+0x76/0x7d [8111d6b0] writeback_inodes_sb_nr_if_idle+0x41/0x55 [a013af15] shrink_delalloc+0xaf/0x147 [btrfs] [a013b09f] reserve_metadata_bytes+0xf2/0x1a8 [btrfs] [a013b2d0] btrfs_block_rsv_add+0x28/0x49 [btrfs] [a013b33e] btrfs_trans_reserve_metadata+0x4d/0x75 [btrfs] [a014b4d4] start_transaction+0x13e/0x1d8 [btrfs] [a014b5c6] btrfs_start_transaction+0x13/0x15 [btrfs] [a014f9db] btrfs_dirty_inode+0x6c/0xfd [btrfs] [8111d82f] __mark_inode_dirty+0x2b/0x1ab [8111297a] touch_atime+0x10e/0x131 [8110d49d] vfs_readdir+0x79/0xa0 [8110d60d] sys_getdents+0x81/0xd1 [81009c82] system_call_fastpath+0x16/0x1b [] 0x vmstat is rather odd: % vmstat 1 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 124348 177388 1127622800 1572430 2 10 87 1 1 0 0 122108 177388 1127854000 536 112 2870 813309 2 13 83 2 2 0 0 119628 177388 1128085200 704 0 2938 887219 2 11 87 0 2 0 0 117768 177388 1128268400 800 0 2936 883427 2 15 82 1 2 0 0 114872 177388 1128546400 1104 0 2819 883498 1 8 90 0 1 0 0 112704 177388 1128738800 612 0 2961 880957 2 14 84 0 ^C One of the three block devices on this btrfs is reading a few hundred blocks per second, but is nowhere near 100% busy; the CS count is extremely high; and we have less than a full CPU spent in system time (even though there are two ls processes blocked). If anyone can point me to a good place to start tracking down the root cause, I'd appreciate it. Thanks, -andy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html