Re: Hang in XFS reclaim on 3.7.0-rc3
On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote: After experiencing a hang of all IO yesterday ( http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on LOCKDEP after upgrading to -rc3. I then tried to replicate the load that hung yesterday and got the following lockdep report, implicating XFS instead of by stacking swap onto dm-crypt and md. [ 2844.971913] [ 2844.971920] = [ 2844.971921] [ INFO: inconsistent lock state ] [ 2844.971924] 3.7.0-rc3 #1 Not tainted [ 2844.971925] - [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage. [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: [811e7ef4] xfs_ilock+0x84/0xb0 [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at: [ 2844.971942] [8108137e] mark_held_locks+0x7e/0x130 [ 2844.971947] [81081a63] lockdep_trace_alloc+0x63/0xc0 [ 2844.971949] [810e9dd5] kmem_cache_alloc+0x35/0xe0 [ 2844.971952] [810dba31] vm_map_ram+0x271/0x770 [ 2844.971955] [811e10a6] _xfs_buf_map_pages+0x46/0xe0 [ 2844.971959] [811e1fba] xfs_buf_get_map+0x8a/0x130 [ 2844.971961] [81233849] xfs_trans_get_buf_map+0xa9/0xd0 [ 2844.971964] [8121e339] xfs_ifree_cluster+0x129/0x670 [ 2844.971967] [8121f959] xfs_ifree+0xe9/0xf0 [ 2844.971969] [811f4abf] xfs_inactive+0x2af/0x480 [ 2844.971972] [811efb90] xfs_fs_evict_inode+0x70/0x80 [ 2844.971974] [8110cb8f] evict+0xaf/0x1b0 [ 2844.971977] [8110cd95] iput+0x105/0x210 [ 2844.971979] [811070d0] dentry_iput+0xa0/0xe0 [ 2844.971981] [81108310] dput+0x150/0x280 [ 2844.971983] [811020fb] sys_renameat+0x21b/0x290 [ 2844.971986] [81102186] sys_rename+0x16/0x20 [ 2844.971988] [816b2292] system_call_fastpath+0x16/0x1b We shouldn't be mapping pages there. See if the patch below fixes it. Fundamentally, though, the lockdep warning has come about because vm_map_ram is doing a GFP_KERNEL allocation when we need it to be doing GFP_NOFS - we are within a transaction here, so memory reclaim is not allowed to recurse back into the filesystem. mm-folk: can we please get this vmalloc/gfp_flags passing API fixed once and for all? This is the fourth time in the last month or so that I've seen XFS bug reports with silent hangs and associated lockdep output that implicate GFP_KERNEL allocations from vm_map_ram in GFP_NOFS conditions as the potential cause Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: don't vmap inode cluster buffers during free From: Dave Chinner dchin...@redhat.com Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/xfs/xfs_inode.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index c4add46..82f6e5d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1781,7 +1781,8 @@ xfs_ifree_cluster( * to mark all the active inodes on the buffer stale. */ bp = xfs_trans_get_buf(tp, mp-m_ddev_targp, blkno, - mp-m_bsize * blks_per_cluster, 0); + mp-m_bsize * blks_per_cluster, + XBF_UNMAPPED); if (!bp) return ENOMEM; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hang in XFS reclaim on 3.7.0-rc3
[add the linux-mm cc I forgot to add before sending] On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote: On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote: After experiencing a hang of all IO yesterday ( http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on LOCKDEP after upgrading to -rc3. I then tried to replicate the load that hung yesterday and got the following lockdep report, implicating XFS instead of by stacking swap onto dm-crypt and md. [ 2844.971913] [ 2844.971920] = [ 2844.971921] [ INFO: inconsistent lock state ] [ 2844.971924] 3.7.0-rc3 #1 Not tainted [ 2844.971925] - [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage. [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: [811e7ef4] xfs_ilock+0x84/0xb0 [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at: [ 2844.971942] [8108137e] mark_held_locks+0x7e/0x130 [ 2844.971947] [81081a63] lockdep_trace_alloc+0x63/0xc0 [ 2844.971949] [810e9dd5] kmem_cache_alloc+0x35/0xe0 [ 2844.971952] [810dba31] vm_map_ram+0x271/0x770 [ 2844.971955] [811e10a6] _xfs_buf_map_pages+0x46/0xe0 [ 2844.971959] [811e1fba] xfs_buf_get_map+0x8a/0x130 [ 2844.971961] [81233849] xfs_trans_get_buf_map+0xa9/0xd0 [ 2844.971964] [8121e339] xfs_ifree_cluster+0x129/0x670 [ 2844.971967] [8121f959] xfs_ifree+0xe9/0xf0 [ 2844.971969] [811f4abf] xfs_inactive+0x2af/0x480 [ 2844.971972] [811efb90] xfs_fs_evict_inode+0x70/0x80 [ 2844.971974] [8110cb8f] evict+0xaf/0x1b0 [ 2844.971977] [8110cd95] iput+0x105/0x210 [ 2844.971979] [811070d0] dentry_iput+0xa0/0xe0 [ 2844.971981] [81108310] dput+0x150/0x280 [ 2844.971983] [811020fb] sys_renameat+0x21b/0x290 [ 2844.971986] [81102186] sys_rename+0x16/0x20 [ 2844.971988] [816b2292] system_call_fastpath+0x16/0x1b We shouldn't be mapping pages there. See if the patch below fixes it. Fundamentally, though, the lockdep warning has come about because vm_map_ram is doing a GFP_KERNEL allocation when we need it to be doing GFP_NOFS - we are within a transaction here, so memory reclaim is not allowed to recurse back into the filesystem. mm-folk: can we please get this vmalloc/gfp_flags passing API fixed once and for all? This is the fourth time in the last month or so that I've seen XFS bug reports with silent hangs and associated lockdep output that implicate GFP_KERNEL allocations from vm_map_ram in GFP_NOFS conditions as the potential cause Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: don't vmap inode cluster buffers during free From: Dave Chinner dchin...@redhat.com Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/xfs/xfs_inode.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index c4add46..82f6e5d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1781,7 +1781,8 @@ xfs_ifree_cluster( * to mark all the active inodes on the buffer stale. */ bp = xfs_trans_get_buf(tp, mp-m_ddev_targp, blkno, - mp-m_bsize * blks_per_cluster, 0); + mp-m_bsize * blks_per_cluster, + XBF_UNMAPPED); if (!bp) return ENOMEM; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On Mon, Oct 29, 2012 at 06:11:58PM +, Luck, Tony wrote: What I would recommend is adding a #define FS_CORRUPTED_FL 0x0100 /* File is corrupted */ ... and which could be accessed and cleared via the lsattr and chattr programs. Good - but we need some space to save the corrupted range information too. These errors should be quite rare, so one range per file should be enough. New file systems should plan to add space in their on-disk format. The corruption isn't going to go away across a reboot. No, not at all. if you want to store something in the filesystem permanently, then use xattrs. You cannot rely on the filesystem being able to store random application specific data in their on-disk format. That's the *exact purpose* that xattrs were invented for - they are an extensible, user-defined, per-file metadata storage mechanism that is not tied to the filesystem on-disk format. The kernel already makes extensive use of xattrs for such metadata - just look at all the security and integrity code that uses xattrs to store their application-specific metadata. Hence *anything* that the kernel wants to store on permanent storage should be using xattrs because then the application has complete control of what is stored without caring about what filesystem it is storing it on. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hang in XFS reclaim on 3.7.0-rc3
On Thu, Nov 01, 2012 at 04:30:10PM -0500, Ben Myers wrote: Hi Dave, On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote: On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote: After experiencing a hang of all IO yesterday ( http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on LOCKDEP after upgrading to -rc3. I then tried to replicate the load that hung yesterday and got the following lockdep report, implicating XFS instead of by stacking swap onto dm-crypt and md. [ 2844.971913] [ 2844.971920] = [ 2844.971921] [ INFO: inconsistent lock state ] [ 2844.971924] 3.7.0-rc3 #1 Not tainted [ 2844.971925] - [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage. [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: [811e7ef4] xfs_ilock+0x84/0xb0 [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at: [ 2844.971942] [8108137e] mark_held_locks+0x7e/0x130 [ 2844.971947] [81081a63] lockdep_trace_alloc+0x63/0xc0 [ 2844.971949] [810e9dd5] kmem_cache_alloc+0x35/0xe0 [ 2844.971952] [810dba31] vm_map_ram+0x271/0x770 [ 2844.971955] [811e10a6] _xfs_buf_map_pages+0x46/0xe0 . We shouldn't be mapping pages there. See if the patch below fixes it. Fundamentally, though, the lockdep warning has come about because vm_map_ram is doing a GFP_KERNEL allocation when we need it to be doing GFP_NOFS - we are within a transaction here, so memory reclaim is not allowed to recurse back into the filesystem. mm-folk: can we please get this vmalloc/gfp_flags passing API fixed once and for all? This is the fourth time in the last month or so that I've seen XFS bug reports with silent hangs and associated lockdep output that implicate GFP_KERNEL allocations from vm_map_ram in GFP_NOFS conditions as the potential cause xfs: don't vmap inode cluster buffers during free Could you write up a little more background for the commit message? Sure, that was just a test patch and often I don't bother putting a detailed description in them until I know they fix the problem. My current tree has: xfs: don't vmap inode cluster buffers during free Inode buffers do not need to be mapped as inodes are read or written directly from/to the pages underlying the buffer. This fixes a regression introduced by commit 611c994 (xfs: make XBF_MAPPED the default behaviour). Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote: On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote: On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote: This patchset optimizes several places which take the per inode spin lock. They have not been fully tested yet, thus they are marked as RFC. Inodes are RCU freed. The i_lock spinlock on the i_state field forms part of the memory barrier that allows the RCU read side to correctly detect a freed inode during a RCU protected cache lookup (hash list traversal for the VFS, or a radix tree traversal for XFS). The i_lock usage around the hahs list operations ensures the hash list operations are atomic with state changes so that such changes are correctly detected during RCU-protected traversals... IOWs, removing the i_lock from around the i_state transitions and inode hash insert/remove/traversal operations will cause races in the RCU lookups and result in incorrectly using freed inodes instead of failing the lookup and creating a new one. So I don't think this is a good idea at all... Hello, Dave: Thanks for your explanation. Though I can't fully understand it, your concern seems to be that RCU inode lookup will be bothered by this change. But we do not have RCU inode lookup in VFS: inode lookup is done by rather a tranditional way. Ah, I'd forgotten that neither of these RCU-based lookups ever got merged: https://lkml.org/lkml/2010/6/23/397 http://thread.gmane.org/gmane.linux.kernel/1056494 That, however, is the approach that the inode caches shoul dbe moving towards - RCU lookups to reduce locking, not changing i_lock/i_state atomicity that has been designed to facilitate RCU safe lookups... XFS gives me the impression that it implements its own inode cache. There may be such thing there. I have little knowledge on XFS, but I guess it's unlikely impacted by the change of code implementing VFS inode cache. Yeah, I dropped the generic inode hash RCU conversion - the SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't have any motiviation to see it through because I'd already converted XFs to avoid the global inode_hash_lock and use RCU lookups on it's internal inode cache... As far as I can see, RCU inode free is for RCU dentry lookup, which seems have nothing to do with 'detect a freed inode'. If you know nothing of the history of this, then it might seem that way Taking i_lock in these places looks like to me a result of following old lock scheme blindly when breaking the big global inode lock. The i_state/i_hash_list/i_lock relationship was created specifically during the inode_lock breakup to allow us to guarantee that certain fields of the inode are unchanging without needing to take multiple nested locks: $ gl -n 1 250df6e commit 250df6ed274d767da844a5d9f05720b804240197 Author: Dave Chinner dchin...@redhat.com Date: Tue Mar 22 22:23:36 2011 +1100 fs: protect inode-i_state with inode-i_lock Protect inode state transitions and validity checks with the inode-i_lock. This enables us to make inode state transitions independently of the inode_lock and is the first step to peeling away the inode_lock from the code. This requires that __iget() is done atomically with i_state checks during list traversals so that we don't race with another thread marking the inode I_FREEING between the state check and grabbing the reference. Also remove the unlock_new_inode() memory barrier optimisation required to avoid taking the inode_lock when clearing I_NEW. Simplify the code by simply taking the inode-i_lock around the state change and wakeup. Because the wakeup is no longer tricky, remove the wake_up_inode() function and open code the wakeup where necessary. Signed-off-by: Dave Chinner dchin...@redhat.com Signed-off-by: Al Viro v...@zeniv.linux.org.uk The inode hash lookup needs to check i_state atomically during the traversal so inodes being freed are skipped (e.g. I_FREEING, I_WILL_FREE). those i_state flags are set only with the i_lock held, and so inode hash lookups need to take the i_lock to guarantee the i_state field is correct. This inode data field synchronisation is separate to the cache hash list traversal protection. The only way to do this is to have an inner lock (inode-i_lock) that protects both the inode-i_hash_list and inode-i_state fields, and a lock order that provides outer list traversal protections (inode_hash_lock). Whether the outer lock is the inode_hash_lock or rcu_read_lock(), the lock order and the data fields the locks are protecting are the same Of course, maybe they are there for something. Could you speak more about the race this change (patch 1,2?) brings up? Thank you. When you drop the lock from the i_state initialisation, you end up dropping the implicit unlock-lock memory barrier that the inode-i_lock provides
Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote: On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote: On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote: On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote: On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote: This patchset optimizes several places which take the per inode spin lock. They have not been fully tested yet, thus they are marked as RFC. Inodes are RCU freed. The i_lock spinlock on the i_state field forms part of the memory barrier that allows the RCU read side to correctly detect a freed inode during a RCU protected cache lookup (hash list traversal for the VFS, or a radix tree traversal for XFS). The i_lock usage around the hahs list operations ensures the hash list operations are atomic with state changes so that such changes are correctly detected during RCU-protected traversals... IOWs, removing the i_lock from around the i_state transitions and inode hash insert/remove/traversal operations will cause races in the RCU lookups and result in incorrectly using freed inodes instead of failing the lookup and creating a new one. So I don't think this is a good idea at all... . The inode hash lookup needs to check i_state atomically during the traversal so inodes being freed are skipped (e.g. I_FREEING, I_WILL_FREE). those i_state flags are set only with the i_lock held, and so inode hash lookups need to take the i_lock to guarantee the i_state field is correct. This inode data field synchronisation is separate to the cache hash list traversal protection. The only way to do this is to have an inner lock (inode-i_lock) that protects both the inode-i_hash_list and inode-i_state fields, and a lock order that provides outer list traversal protections (inode_hash_lock). Whether the outer lock is the inode_hash_lock or rcu_read_lock(), the lock order and the data fields the locks are protecting are the same Of course, maybe they are there for something. Could you speak more about the race this change (patch 1,2?) brings up? Thank you. When you drop the lock from the i_state initialisation, you end up dropping the implicit unlock-lock memory barrier that the inode-i_lock provides. i.e. you get this in iget_locked(): thread 1thread 2 lock(inode_hash_lock) for_each_hash_item() inode-i_state = I_NEW hash_list_insert finds newly inserted inode lock(inode-i_lock) unlock(inode-i_lock) unlock(inode_hash_lock) wait_on_inode() see inode-i_state = 0 uses inode before initialisation is complete IOWs, there is no unlock-lock transition occurring on any lock, so there are no implicit memory barriers in this code, and so other CPUs are not guaranteed to see the inode-i_state = I_NEW write that thread 2 did. The lock/unlock pair around this I_NEW assignment guarantees that thread 1 will see the change to i_state correctly. So even without RCU, dropping the i_lock from these i_state/hash insert/remove operations will result in races occurring... This interleave can never happen because of inode_hash_lock. Ah, sorry, I'm context switching too much right now. s/lock(inode_hash_lock)/rcu_read_lock. And that's the race condition the the locking order is *intended* to avoid. It's just that we haven't done the last piece of the work, which is replacing the read side inode_hash_lock usage with rcu_read_lock. Seriously, if you want to improve the locking of this code, go back an resurrect the basic RCU hash traversal patches (i.e. Nick's original patch rather than my SLAB_DESTROY_BY_RCU based ones). That has much more benefit to many more workloads than just removing a non-global, uncontended locks like this patch set does. Ah, this is intended to be a code clean patchset actually. I thought these locks are redundant in an obvious and trivial manner. If, on the contrary, they are such tricky, then never mind :) Thanks for your patient. The RCU conversion is actually trivial - everything is already set up for it to be done, and is simpler than this patch set. It pretty much is simply replacing all the read side inode_hash_lock pairs with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you want to clean up this code, then RCU traversals are the conversion to make. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
On Mon, Sep 24, 2012 at 03:08:52PM +0800, Guo Chao wrote: On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote: Ah, this is intended to be a code clean patchset actually. I thought these locks are redundant in an obvious and trivial manner. If, on the contrary, they are such tricky, then never mind :) Thanks for your patient. The RCU conversion is actually trivial - everything is already set up for it to be done, and is simpler than this patch set. It pretty much is simply replacing all the read side inode_hash_lock pairs with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you want to clean up this code, then RCU traversals are the conversion to make. Thanks for your suggestion. Though I doubt it's such trivial, I will try this after a little investigation. Probably best to start with the patch below - it's run under heavy concurrent load on ext4 with a working set of inodes about 20x larger than can fit in memory for the past hour Cheers, Dave. -- Dave Chinner da...@fromorbit.com fs: Use RCU lookups for inode cache From: Dave Chinner dchin...@redhat.com Convert inode cache lookups to be protected by RCU locking rather than the global inode_hash_lock. This will improve scalability of inode lookup intensive workloads. Smoke tested w/ ext4 on concurrent fsmark/lookup/unlink workloads over 50 million or so inodes... Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/inode.c | 74 ++-- 1 file changed, 42 insertions(+), 32 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index ac8d904..2e92674 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -464,7 +464,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval) spin_lock(inode_hash_lock); spin_lock(inode-i_lock); - hlist_add_head(inode-i_hash, b); + hlist_add_head_rcu(inode-i_hash, b); spin_unlock(inode-i_lock); spin_unlock(inode_hash_lock); } @@ -480,7 +480,7 @@ void __remove_inode_hash(struct inode *inode) { spin_lock(inode_hash_lock); spin_lock(inode-i_lock); - hlist_del_init(inode-i_hash); + hlist_del_init_rcu(inode-i_hash); spin_unlock(inode-i_lock); spin_unlock(inode_hash_lock); } @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode); static struct inode *find_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), - void *data) + void *data, bool locked) { struct hlist_node *node; struct inode *inode = NULL; repeat: - hlist_for_each_entry(inode, node, head, i_hash) { + rcu_read_lock(); + hlist_for_each_entry_rcu(inode, node, head, i_hash) { spin_lock(inode-i_lock); + if (inode_unhashed(inode)) { + spin_unlock(inode-i_lock); + continue; + } if (inode-i_sb != sb) { spin_unlock(inode-i_lock); continue; @@ -800,13 +805,20 @@ repeat: continue; } if (inode-i_state (I_FREEING|I_WILL_FREE)) { + rcu_read_unlock(); + if (locked) + spin_unlock(inode_hash_lock); __wait_on_freeing_inode(inode); + if (locked) + spin_lock(inode_hash_lock); goto repeat; } __iget(inode); spin_unlock(inode-i_lock); + rcu_read_unlock(); return inode; } + rcu_read_unlock(); return NULL; } @@ -815,14 +827,20 @@ repeat: * iget_locked for details. */ static struct inode *find_inode_fast(struct super_block *sb, - struct hlist_head *head, unsigned long ino) +struct hlist_head *head, +unsigned long ino, bool locked) { struct hlist_node *node; struct inode *inode = NULL; repeat: - hlist_for_each_entry(inode, node, head, i_hash) { + rcu_read_lock(); + hlist_for_each_entry_rcu(inode, node, head, i_hash) { spin_lock(inode-i_lock); + if (inode_unhashed(inode)) { + spin_unlock(inode-i_lock); + continue; + } if (inode-i_ino != ino) { spin_unlock(inode-i_lock); continue; @@ -832,13 +850,20 @@ repeat: continue; } if (inode-i_state (I_FREEING|I_WILL_FREE)) { + rcu_read_unlock
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
, the questions I'd be seeking to answer before proposing it for inclusion are as follows 1. what's the comparison in performance to typical NFS server writeback parameter tuning? i.e. dirty_background_ratio=5, dirty_ratio=10, dirty_expire_centiseconds=1000, dirty_writeback_centisecs=1? i.e. does this give change give any benefit over the current common practice for configuring NFS servers? 2. what happens when you have 10 clients all writing to the server at once? Or a 100? NFS servers rarely have a single writer to a single file at a time, so what impact does this change have on multiple concurrent file write performance from multiple clients? 3. Following on from the multiple client test, what difference does it make to file fragmentation rates? Writing more frequently means smaller allocations and writes, and that tends to lead to higher fragmentation rates, especially when multiple files are being written concurrently. Higher fragmentation also means lower performance over time as fragmentation accelerates filesystem aging effects on performance. IOWs, it may be faster when new, but it will be slower 3 months down the track and that's a bad tradeoff to make. 4. What happens for higher bandwidth network links? e.g. gigE or 10gigE? Are the improvements still there? Or does it cause regressions at higher speeds? I'm especially interested in what happens to multiple writers at higher network speeds, because that's a key performance metric used to measure enterprise level NFS servers. 5. Are the improvements consistent across different filesystem types? We've had writeback changes in the past cause improvements on one filesystem but significant regressions on others. I'd suggest that you need to present results for ext4, XFS and btrfs so that we have a decent idea of what we can expect from the change to the generic code. Yeah, I'm asking a lot of questions. That's because the generic writeback code is extremely important to performance and the impact of a change cannot be evaluated from a single test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 01/10] vfs: introduce private rb structures
last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; + u32 last_temperature; may as well make the flags a u32 - the compiler will ues that much space anyway as it aligned the u32 last_temperature variable after it. +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + /* node for hot_inode_tree rb_tree */ + struct rb_node rb_node; + /* tree of ranges in this inode */ + struct hot_range_tree hot_range_tree; + /* frequency data for this inode */ + struct hot_freq_data hot_freq_data; + /* inode number, copied from inode */ + unsigned long i_ino; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, i_no, in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; It's hard to see the code in the commentsi, and some of comments are redundant.. It's easier to read if you do this: struct hot_inode_item { struct rb_node node;/* hot_inode_tree index */ struct hot_range_tree hot_range_tree; /* tree of ranges */ struct hot_freq_data hot_freq_data; /* frequency data */ unsigned long i_ino;/* inode number from inode */ u8 in_tree; /* ref counting check */ spinlock_t lock;/* protects object data */ struct kref refs; /* prevents kfree */ } Also: - i_ino really needs to be a 64 bit quantity as some filesystems can use 64 bit inode numbers even on 32 bit systems (e.g. XFS). - in_tree can be u32 or a flags field if it is boolean. if it is just debug, then maybe it can be removed whenteh code is ready for commit. +}; + +/* + * An item representing a range inside of an inode whose frequency + * is being tracked + */ +struct hot_range_item { + /* node for hot_range_tree rb_tree */ + struct rb_node rb_node; + /* frequency data for this range */ + struct hot_freq_data hot_freq_data; + /* the hot_inode_item associated with this hot_range_item */ + struct hot_inode_item *hot_inode; + /* starting offset of this range */ + u64 start; + /* length of this range */ + u64 len; What units? u64 start; /* start offset in bytes */ u64 len /* length in bytes */ + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, start, len, and in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +struct hot_info { + /* red-black tree that keeps track of fs-wide hot data */ + struct hot_inode_tree hot_inode_tree; +}; The comment is redundant... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 02/10] vfs: add support for updating access frequency
RANGE_SIZE_MASK; + hr-len = RANGE_SIZE; + hr-hot_inode = he; + hot_rb_add_hot_range_item(hrtree, hr); + write_unlock(hrtree-lock); + } + + spin_lock(hr-lock); + hot_rb_update_freq(hr-hot_freq_data, rw); + spin_unlock(hr-lock); + hot_rb_free_hot_range_item(hr); + } Same comments again about locking. I note that the code will always insert range items of a length RANGE_SIZE. This means you have a fixed object granularity and hence you have no need for a range based search. That is, you could use a radix tree where each entry in the radix tree points directly to the range object similar to how the page cache uses a radix tree for indexing pages. That brings the possibility of lockless range item lookups + +out: + return ret; +} + +/* main function to update access frequency from read/writepage(s) hooks */ +void hot_rb_update_freqs(struct inode *inode, u64 start, + u64 len, int rw) +{ + struct hot_inode_item *he; + + he = hot_rb_update_inode_freq(inode, rw); + + WARN_ON(!he); + + if (he) { + hot_rb_update_range_freq(he, start, len, + rw, (inode-i_sb-s_hotinfo)); + + hot_rb_free_hot_inode_item(he); + } This is very assymetric. it would be much better to collapse all the abstraction down to something much simpler, say: int hot_inode_update_freqs() { he = hot_inode_item_find(tree, inum, null) if (!he) { new_he = allocate() if (!new_he) return -ENOMEM; init new_he he = hot_inode_item_find(tree, inum, new_he) if (he != new_he) free new_he } hot_inode_freq_update(he-hot_freq_data, write) for (cur = start_off; cur = end_off; cur += RANGE_SIZE) { hr = hot_range_item_find(tree, cur, NULL) if (!hr) { new_hr = allocate() if (!new_hr) return -ENOMEM; init new_hr hr = hot_inode_item_find(tree, cur, new_hr) if (hr != new_hr) free new_hr } hot_inode_freq_update(hr-hot_freq_data, write) hot_range_item_put(hr); { hot_inode_item_put(he); } +} + +/* * Initialize kmem cache for hot_inode_item * and hot_range_item */ diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 269b67a..2ba29e4 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -21,6 +21,21 @@ #define FREQ_DATA_TYPE_INODE (1 0) /* freq data struct is for a range */ #define FREQ_DATA_TYPE_RANGE (1 1) +/* size of sub-file ranges */ +#define RANGE_SIZE (1 20) +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1))) You might want to state what units these ranges are in, and that there is one tracking object per range per inode +#define FREQ_POWER 4 + +struct hot_info; +struct inode; + +struct hot_inode_item You've already included include/linux/hot_tracking.h, so you shouldn't need these forward declarations... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote: On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote: @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode); static struct inode *find_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), - void *data) + void *data, bool locked) { struct hlist_node *node; struct inode *inode = NULL; repeat: - hlist_for_each_entry(inode, node, head, i_hash) { + rcu_read_lock(); + hlist_for_each_entry_rcu(inode, node, head, i_hash) { spin_lock(inode-i_lock); + if (inode_unhashed(inode)) { + spin_unlock(inode-i_lock); + continue; + } Is this check too early? If the unhashed inode happened to be the target inode, we are wasting our time to continue the traversal and we do not wait on it. If the inode is unhashed, then it is already passing through evict() or has already passed through. If it has already passed through evict() then it is too late to call __wait_on_freeing_inode() as the wakeup occurs in evict() immediately after the inode is removed from the hash. i.e: remove_inode_hash(inode); spin_lock(inode-i_lock); wake_up_bit(inode-i_state, __I_NEW); BUG_ON(inode-i_state != (I_FREEING | I_CLEAR)); spin_unlock(inode-i_lock); i.e. if we get the case: Thread 1, RCU hash traversalThread 2, evicting foo rcu_read_lock() found inode foo remove_inode_hash(inode); spin_lock(foo-i_lock); wake_up(I_NEW) spin_unlock(foo-i_lock); destroy_inode() .. spin_lock(foo-i_lock) match sb, ino I_FREEING rcu_read_unlock() rcu grace period can expire at any time now, so use after free is guaranteed at some point wait_on_freeing_inode wait_on_bit(I_NEW) will never get woken Hence if the inode is unhashed, it doesn't matter what inode it is, it is never valid to use it any further because it may have already been freed and the only reason we can safely access here it is that the RCU grace period will not expire until we call rcu_read_unlock(). @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) struct inode *old; spin_lock(inode_hash_lock); - /* We released the lock, so.. */ - old = find_inode_fast(sb, head, ino); + old = find_inode_fast(sb, head, ino, true); if (!old) { inode-i_ino = ino; spin_lock(inode-i_lock); E ... couldn't we use memory barrier API instead of irrelevant spin lock on newly allocated inode to publish I_NEW? Yes, we could. However, having multiple synchronisation methods for a single variable that should only be used in certain circumstances is something that is easy to misunderstand and get wrong. Memory barriers are much more subtle and harder to understand than spin locks, and every memory barrier needs to be commented to explain what the barrier is actually protecting against. In the case where a spin lock is guaranteed to be uncontended and the cache line hot in the CPU cache, it makes no sense to replace the spin lock with a memory barrier, especially when every other place we modify the i_state/i_hash fields we have to wrap them with i_lock Simple code is good code - save the complexity for something that needs it. I go through many mails of the last trend of scaling VFS. Many patches seem quite natural, say RCU inode lookup Sure, but the implementation in those RCU lookup patches sucked. or per-bucket inode hash lock or It was a bad idea. At minimum, you can't use lockdep on it. Worse for the realtime guys is the fact it can't be converted to a sleeping lock. Worst was the refusal to change it in any way to address concerns. And realistically, the fundamental problem is not with the inode_hash_lock, it's with the fact that the cache is based on a hash table rather than a more scalable structure like a radix tree or btree. This is a primary reason for XFS having it's own inode cache - hashes can only hold a certain number of entries before performance collapses catastrophically and so don't scale well to tens or hundreds of millions of entries. per-superblock inode list lock, Because it isn't a particularly hot lock, and given that most workloads hit on a single filesystem, scalability is not improved by making this change. As such, as long as there is a single linked list used to iterate all inodes
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Wed, Sep 26, 2012 at 10:56:08AM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers I can see what you mean, but don't know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good? Most filesystems will only need to add 3-4 lines of code to their existing parser, so it's not a big deal I think... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Wed, Sep 26, 2012 at 10:53:07AM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: I note that the code will always insert range items of a length RANGE_SIZE. This means you have a fixed object granularity and hence you have no need for a range based search. That is, you could use a radix tree where each entry in the radix tree points directly to the range object similar to how the page cache uses a radix tree for indexing pages. That brings the possibility of lockless range item lookups Great suggestion, but can we temporarily put it in TODO list? because it will bring one big code change. Sure. I just wanted to point out that there are better choices for indexing fixed size elements than rb-trees and why it might make sense to use a different type of tree. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 04/10] vfs: add init and exit support
On Sun, Sep 23, 2012 at 08:56:29PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add initialization function to create some key data structures when hot tracking is enabled; Clean up them when hot tracking is disabled Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 60 + fs/hot_tracking.h |2 + fs/namespace.c|4 +++ fs/super.c|6 + 4 files changed, 72 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index f97e8a6..fa89f70 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -135,6 +135,51 @@ static void hot_rb_free_hot_range_item(struct hot_range_item *hr) } } +static int hot_rb_remove_hot_inode_item(struct hot_inode_tree *tree, +struct hot_inode_item *he) hot_inode_item_remove() +{ +int ret = 0; +rb_erase(he-rb_node, tree-map); +he-in_tree = 0; +return ret; +} + +static int hot_rb_remove_hot_range_item(struct hot_range_tree *tree, +struct hot_range_item *hr) hot_range_item_remove() (repeat for other function names ;) +{ +int ret = 0; +rb_erase(hr-rb_node, tree-map); +hr-in_tree = 0; +return ret; +} these can probably be void functions are they don't use the return value at all. + +/* Frees the entire hot_inode_tree. */ +static void hot_rb_inode_tree_free(struct hot_info *root) +{ + struct rb_node *node, *node2; + struct hot_inode_item *he; + struct hot_range_item *hr; + + /* Free hot inode and range trees on fs root */ + node = rb_first(root-hot_inode_tree.map); + + while (node) { while ((node = rb_first(root-hot_inode_tree.map))) { + he = rb_entry(node, struct hot_inode_item, rb_node); + + node2 = rb_first(he-hot_range_tree.map); + while (node2) { while ((node2 = rb_first(he-hot_range_tree.map))) { . + if (sb-s_hotinfo.mount_opt HOT_MOUNT_HOT_TRACK) + hot_track_exit(sb); + Let the filesystems call in .put_super() down_write(namespace_sem); br_write_lock(vfsmount_lock); event++; diff --git a/fs/super.c b/fs/super.c index 7eb3b0c..0999d5c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1153,6 +1153,9 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) WARN_ON(sb-s_bdi == default_backing_dev_info); sb-s_flags |= MS_BORN; + if (hottrack) + hot_track_init(sb, name); And call this in .fill_super() after parsing the hottrack argument. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 05/10] vfs: introduce one hash table
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. So, let me see if I've got the relationship straight: - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode - hot_inode_item contains access frequency data for that inode - hot_inode_item holds a heat hash node to index the access frequency data for that inode - hot_inode_item.hot_range_tree indexes hot_range_items for that inode - hot_range_item contains access frequency data for that range - hot_range_item holds a heat hash node to index the access frequency data for that range - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes How about some ascii art? :) Just looking at the hot inode item case (the range item case is the same pattern, though), we have: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ +---+ | frequency data | | V^ V | ...--hot_inode_item--... |...--hot_inode_item-- | frequency data | frequency data | ^| ^ | || | | || | +--hot_hash_node--hot_hash_node--hot_hash_node-- There's no actual data stored in the hot_hash_node, just pointer back to the frequency data, a hlist_node and a pointer to the hashlist head. IOWs, I agree with Ram that this does not need to exist and just embedding a hlist_node inside the hot_inode_item is all that is needed. i.e: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ | | frequency data | +---+ | hlist_node | | V^ | V | ...--hot_inode_item--... | | ...--hot_inode_item-- | frequency data | |frequency data +--hlist_node---+ +---hlist_node---. There's no need for separate allocations, initialisations, locks and reference counting - all that is already in the hot_inode_item. The items have the same lifecycle limitations - a hot_hash_node must be torn down before the frequency data it points to is freed. Finally, there's no difference in how you move it between lists. Indeed, calling it a hash is wrong - there's not hashing at all - it keeping an array of list where each entry corresponds to a specific temperature. It is a *heat map*, not a hash list. i.e. inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. As it is, there aren't any users of the heat maps that are generated in this patch set - it's not even exported to userspace or to debugfs, so I'm not sure how it will be used yet. How are these heat maps going to be used by filesystems, Zhi? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 06/10] vfs: enable hot data tracking
On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Miscellaneous features that implement hot data tracking and generally make the hot data functions a bit more friendly. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/direct-io.c | 10 ++ include/linux/hot_tracking.h | 11 +++ mm/filemap.c |8 mm/page-writeback.c | 21 + mm/readahead.c |9 + 5 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index f86c720..3773f44 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -37,6 +37,7 @@ #include linux/uio.h #include linux/atomic.h #include linux/prefetch.h +#include hot_tracking.h /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev-bd_queue); prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host) + iov_length(iov, nr_segs) 0) { + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host, + (u64)offset, + (u64)iov_length(iov, nr_segs), + rw WRITE); + } That's a bit messy. I'd prefer a static inline function that hides all this. e.g. track_hot_inode_ranges(inode, offset, length, rw) { if (inode-i_sb-s_flags MS_HOT_TRACKING) hot_inode_freq_update(inode, offset, length, rw); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5ad5ce2..552c861 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -35,6 +35,7 @@ #include linux/buffer_head.h /* __set_page_dirty_buffers */ #include linux/pagevec.h #include linux/timer.h +#include linux/hot_tracking.h #include trace/events/writeback.h /* @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages); int do_writepages(struct address_space *mapping, struct writeback_control *wbc) { int ret; + pgoff_t start = 0; + u64 prev_count = 0, count = 0; if (wbc-nr_to_write = 0) return 0; + + /* Hot data tracking */ + if (TRACK_THIS_INODE(mapping-host) + wbc-range_cyclic) { + start = mapping-writeback_index PAGE_CACHE_SHIFT; + prev_count = (u64)wbc-nr_to_write; + } Why only wbc-range_cyclic? This won't record things like synchronous writes or fsync-triggered writes, are are far more likely to be to hot ranges in a file... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Fork and run one kernel kthread to calculate that temperature based on some metrics kept in custom frequency data structs, and store the info in the hash table. No new kthreads, please. Use a per-superblock workqueue and a struct delayed_work to run periodic work on each superblock. That will also remove all the nasty, nasty !hot_track_temperature_update_kthread checks from the code, too. Also, I'd separate the work that the workqueue does from the patch that introduces the work queue. That way there is only one new thing to comment on in the patch. Further, I'd separate the aging code from the code that updates the temperature map into it's own patch as well.. Finally, you're going to need a shrinker to control the amount of memory that is used in tracking hot regions - if we are throwing inodes out of memory due to memory pressure, we most definitely are going to need to reduce the amount of memory the tracking code is using, even if it means losing useful information (i.e. the shrinker accelerates the aging process). Given the above, and the other comments earlier in the series, there's not a lot of point in me spending time commenting on ethe code in detail here as it will change significantly as a result of all the earlier comments Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 05/10] vfs: introduce one hash table
On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. So, let me see if I've got the relationship straight: - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode - hot_inode_item contains access frequency data for that inode - hot_inode_item holds a heat hash node to index the access frequency data for that inode - hot_inode_item.hot_range_tree indexes hot_range_items for that inode - hot_range_item contains access frequency data for that range - hot_range_item holds a heat hash node to index the access frequency data for that range - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes Correct. How about some ascii art? :) Just looking at the hot inode item case (the range item case is the same pattern, though), we have: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ +---+ | frequency data | | V^ V | ...--hot_inode_item--... |...--hot_inode_item-- | frequency data | frequency data | ^| ^ | || | | || | +--hot_hash_node--hot_hash_node--hot_hash_node-- Great, can we put them in hot_tracking.txt in Documentation? There's no actual data stored in the hot_hash_node, just pointer back to the frequency data, a hlist_node and a pointer to the hashlist head. IOWs, I agree with Ram that this does not need to exist and just embedding a hlist_node inside the hot_inode_item is all that is needed. i.e: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ | | frequency data | +---+ | hlist_node | | V^ | V | ...--hot_inode_item--... | | ...--hot_inode_item-- | frequency data | |frequency data +--hlist_node---+ +---hlist_node---. There's no need for separate allocations, initialisations, locks and reference counting - all that is already in the hot_inode_item. The items have the same lifecycle limitations - a hot_hash_node must be torn down before the frequency data it points to is freed. Finally, there's no difference in how you move it between lists. How will you know if one hot_inode_item should be moved between lists when its freq data is changed? Record the current temperature in the frequency data, and if it changes, change the list it is on. Indeed, calling it a hash is wrong - there's not hashing at all - it keeping an array of list where each entry corresponds to a specific temperature. It is a *heat map*, not a hash list. i.e. inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. OK. As it is, there aren't any users of the heat maps that are generated in this patch set - it's not even exported to userspace or to debugfs, so I'm not sure how it will be used yet. How are these heat maps going to be used by filesystems, Zhi? In hot_hash_calc_temperature(), you can see that one hot_inode or hot_range's freq data will be distilled into one temperature value, then it will be inserted to the heat map based on its temperature. When the file corresponding to the inode or range got hotter or cold, its location will be changed in the heat map based on its new temperature in hot_hash_update_hash_table(). Yes, but a hot_inode_item or hot_range_item can only have one location in the heat map, right? So it doesn't need external structure to point to the frequency data to track this And the user will retrieve those freq data and temperature info via debugfs or ioctl interfaces. Right - but that data is only extracted after an initial hot_inode_tree lookup - The heat map itself is never directly used for lookups. If it's not used for lookups based on temperature, why is it needed? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 06/10] vfs: enable hot data tracking
On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Miscellaneous features that implement hot data tracking and generally make the hot data functions a bit more friendly. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/direct-io.c | 10 ++ include/linux/hot_tracking.h | 11 +++ mm/filemap.c |8 mm/page-writeback.c | 21 + mm/readahead.c |9 + 5 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index f86c720..3773f44 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -37,6 +37,7 @@ #include linux/uio.h #include linux/atomic.h #include linux/prefetch.h +#include hot_tracking.h /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev-bd_queue); prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host) + iov_length(iov, nr_segs) 0) { + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host, + (u64)offset, + (u64)iov_length(iov, nr_segs), + rw WRITE); + } That's a bit messy. I'd prefer a static inline function that hides all this. e.g. Do you think of moving the condition into hot_inode_udate_freqs(), not adding another new function? Moving it into hot_inode_udate_freqs() will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Fork and run one kernel kthread to calculate that temperature based on some metrics kept in custom frequency data structs, and store the info in the hash table. No new kthreads, please. Use a per-superblock workqueue and a struct delayed_work to run periodic work on each superblock. If no new kthread is created, which kthread will work on these delayed_work tasks? One of the kworker threads that service the workqueue infrastructure. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able If we don't allocate one sb-s_hot_info, where will those hash list head and btree roots locate? I wrote that thinking (mistakenly) that s-hot)info was dynamically allocated rather than being embedded in the struct super_block. Indeed, if the mount option is held in s_flags, then it could be dynamically allocated, but I don't think that's really necessary... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] loop: Make explicit loop device destruction lazy
From: Dave Chinner dchin...@redhat.com xfstests has always had random failures of tests due to loop devices failing to be torn down and hence leaving filesytems that cannot be unmounted. This causes test runs to immediately stop. Over the past 6 or 7 years we've added hacks like explicit unmount -d commands for loop mounts, losetup -d after unmount -d fails, etc, but still the problems persist. Recently, the frequency of loop related failures increased again to the point that xfstests 259 will reliably fail with a stray loop device that was not torn down. That is despite the fact the test is above as simple as it gets - loop 5 or 6 times running mkfs.xfs with different paramters: lofile=$(losetup -f) losetup $lofile $testfile $MKFS_XFS_PROG -b size=512 $lofile /dev/null || echo mkfs failed! sync losetup -d $lofile And losteup -d $lofile is failing with EBUSY on 1-3 of these loops every time the test is run. Turns out that blkid is running simultaneously with losetup -d, and so it sees an elevated reference count and returns EBUSY. But why is blkid running? It's obvious, isn't it? udev has decided to try and find out what is on the block device as a result of a creation notification. And it is racing with mkfs, so might still be scanning the device when mkfs finishes and we try to tear it down. So, make losetup -d force autoremove behaviour. That is, when the last reference goes away, tear down the device. xfstests wants it *gone*, not causing random teardown failures when we know that all the operations the tests have specifically run on the device have completed and are no longer referencing the loop device. Signed-off-by: Dave Chinner dchin...@redhat.com --- drivers/block/loop.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 3bba655..187b573 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -976,8 +976,21 @@ static int loop_clr_fd(struct loop_device *lo) if (lo-lo_state != Lo_bound) return -ENXIO; - if (lo-lo_refcnt 1) /* we needed one fd for the ioctl */ - return -EBUSY; + /* +* If we've explicitly asked to tear down the loop device, +* and it has an elevated reference count, set it for auto-teardown when +* the last reference goes away. This stops $!~#$@ udev from +* preventing teardown because it decided that it needs to run blkid on +* the loopback device whenever they appear. xfstests is notorious for +* failing tests because blkid via udev races with a losetup +* dev/do something like mkfs/losetup -d dev causing the losetup -d +* command to fail with EBUSY. +*/ + if (lo-lo_refcnt 1) { + lo-lo_flags |= LO_FLAGS_AUTOCLEAR; + mutex_unlock(lo-lo_ctl_mutex); + return 0; + } if (filp == NULL) return -EINVAL; -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote: On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote: On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). Similar discussions about posix_fadvise() are being had for marking ranges of files as volatile (i.e. useful for determining what can be evicted from a cache when space reclaim is required). https://lkml.org/lkml/2012/10/2/501 Hmm, interesting Speaking as an implementor though, hints that aren't associated with any specific IO are harder to make use of - stuff is in the cache. What you really want is to know, for a given IO, whether to cache it or not, and possibly where in the LRU to stick it. I can see how it might be useful, but it needs to have a defined set of attributes that a file IO is allowed to have. If you don't define the set, then what really have is an arbitrary set of storage-device specific interfaces. Of course, once we have a defined set of per-file IO policy attributes, we don't really need per-IO attributes - you can just set them through a range interface like fadvise() or fallocate(). Well, it's quite possible that different implementations would have no trouble making use of those kinds of hints, I'm no doubt biased by having implemented bcache. With bcache though, cache replacement is done in terms of physical address space, not logical (i.e. the address space of the device being cached). So to handle posix_fadvise, we'd have to walk the btree and chase pointers to buckets, and modify the bucket priorities up or down... but what about the other data in those buckets? It's not clear what should happen, but there isn't any good way to take that into account. (The exception is dropping data from the cache entirely, we can just invalidate that part of the keyspace and garbage collection will reclaim the buckets they pointed to. Bcache does that for discard requests, currently). It sounds to me like you are saying is that the design of bcache is unsuited to file-level management of caching policy, and that is why you want to pass attributes directly to bcache with specific IOs. Is that a fair summary of the problem you are describing here? My problem with this approach has nothing to do with the per-IO nature of it - it's to do with the layering violations and the amount of storage specific knowledge needed to make effective use of it. i.e. it seems like an interface that can only be used by people intimately familiar with underlying storage implementation. You happen to be one of those people, so I figure that you don't see a problem with that. ;) However, it also implies that an application must understand and use a specific storage configuration that matches the attributes an application sends. I understand how this model is appealling to Google because they control the whole application and storage stack (hardware and software) from top to bottom. However, I just don't think that it is a solution that the rest of the world can use effectively. The scope of data locality, aging and IO priority policy control is much larger than just controlling SSD caching. SSD caching is just a another implementation of automatic tiering, and HSMs have been doing this for years and years. It's the same problem - space management and deciding what to leave in the frequently accessed pool
Re: VFS hot tracking: How to calculate data temperature?
On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote: On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao c...@us.ibm.com wrote: On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote: Here also has another question. How to save the file temperature among the umount to be able to preserve the file tempreture after reboot? This above is the requirement from DB product. I thought that we can save file temperature in its inode struct, that is, add one new field in struct inode, then this info will be written to disk with inode. Any comments or ideas are appreciated, thanks. Maybe could save the last file temperature with extended attributes. It seems that only ext4 has the concept of extended attributes. All major filesystems have xattr support. They are used extensively by the security and integrity subsystems, for example. Saving the information might be something that is useful to certian applications, but lets have the people that need that functionality spell out their requirements before discussing how or what to implement. Indeed, discussion shoul dreally focus on getting the core, in-memory infrastructure sorted out first before trying to expand the functionality further... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with DISCARD and RAID5
On Fri, Nov 02, 2012 at 09:40:58AM +0800, Shaohua Li wrote: On Thu, Nov 01, 2012 at 05:38:54PM +1100, NeilBrown wrote: Hi Shaohua, I've been doing some testing and discovered a problem with your discard support for RAID5. The code in blkdev_issue_discard assumes that the 'granularity' is a power of 2, and for example subtracts 1 to get a mask. However RAID5 sets the granularity to be the stripe size which often is not a power of two. When this happens you can easily get into an infinite loop. I suspect that to make this work properly, blkdev_issue_discard will need to be changed to allow 'granularity' to be an arbitrary value. When it is a power of two, the current masking can be used. When it is anything else, it will need to use sector_div(). Yep, looks we need use sector_div. And this isn't the only problem. discard request can be merged, and the merge check only checks max_discard_sectors. That means the split requests in blkdev_issue_discard can be merged again. The split nerver works. I'm wondering what's purpose of discard_alignment and discard_granularity. Are there devices with discard_granularity not 1 sector? Most certainly. Thin provisioned storage often has granularity in the order of megabytes If bio isn't discard aligned, what device will do? Up to the device. Further, why driver handles alignment/granularity if device will ignore misaligned request. When you send a series of sequential unaligned requests, the device may ignore them all. Hence you end up with nothing being discarded, even though the entire range being discarded is much, much larger than the discard granularity Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: add hot tracking support.
On Wed, Nov 07, 2012 at 04:38:23PM +0800, Zhi Yong Wu wrote: HI, Dave, I guess that you should add some hot tracking stuff in some xfs_show_xxx function, right? Yes, it should - I thought I did that. I recall seeing int /proc/mounts, but maybe I was just hallucinating. I'll send an updated version when I get to fixing it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
load that causes write latency occurring. Testing with lots of clients introduces all of these things, and that will greatly impact server behaviour. Aggregation in memory isolates a lot of this variation from writeback and hence smooths out a lot of the variability that leads to fragmentation, seeks, latency spikes and preamture filesystem aging. That is, if you set a 100MB dirty_bytes limit on a bdi it will give really good buffering for a single client doing a streaming write. If you've got 10 clients, then assuming fair distribution of server resources, then that is 10MB per client per writeback trigger. That's line ball as to whether it will cause fragmentation severe enough to impact server throughput. If you've got 100 clients,then that's only 1MB per client per writeback trigger, and that's definitely too low to maintain decent writeback behaviour. i.e. you're now writing 100 files 1MB at a time, and that tends towards random IO patterns rather than sequential IO patterns. Seek time dertermines throughput, not IO bandwidth limits. IOWs, as the client count goes up, the writeback patterns will tends more towards random IO than sequential IO unless the amount of buffering allowed before writeback triggers also grows. That's important, because random IO is much slower than sequential IO. What I'd like to have is some insight into whether this patch changes that inflection point, for better or for worse. The only way to find that is to run multi-client testing 5. Are the improvements consistent across different filesystem types? We've had writeback changes in the past cause improvements on one filesystem but significant regressions on others. I'd suggest that you need to present results for ext4, XFS and btrfs so that we have a decent idea of what we can expect from the change to the generic code. As mentioned in the above Table 1 2, performance gain in WRITE speed is different on different file systems i.e. different on NFS client over XFS EXT4. We also tried BTRFS over NFS, but we could not see any WRITE speed performance gain/degrade on BTRFS over NFS, so we are not posting BTRFS results here. You should post btrfs numbers even if they show no change. It wasn't until I got this far that I even realised that you'd even tested BTRFS. I don't know what to make of this, because I don't know what the throughput rates compared to XFS and EXT4 are Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/22] loop: Issue O_DIRECT aio using bio_vec
On Mon, Oct 22, 2012 at 10:15:00AM -0500, Dave Kleikamp wrote: This is the current version of the patchset I presented at the LSF-MM Summit in San Francisco in April. I apologize for letting it go so long before re-submitting. This patchset was begun by Zach Brown and was originally submitted for review in October, 2009. Feedback was positive, and I have picked up where he left off, porting his patches to the latest mainline kernel and adding support more file systems. This patch series adds a kernel interface to fs/aio.c so that kernel code can issue concurrent asynchronous IO to file systems. It adds an aio command and file system methods which specify io memory with pages instead of userspace addresses. This series was written to reduce the current overhead loop imposes by performing synchronus buffered file system IO from a kernel thread. These patches turn loop into a light weight layer that translates bios into iocbs. I note that there is no support for XFS in this patch set. Is there a particular problem that prevents XFS from being converted, or it just hasn't been done? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/22] loop: Issue O_DIRECT aio using bio_vec
On Mon, Oct 22, 2012 at 07:53:40PM -0500, Dave Kleikamp wrote: On 10/22/2012 07:07 PM, Dave Chinner wrote: On Mon, Oct 22, 2012 at 10:15:00AM -0500, Dave Kleikamp wrote: This is the current version of the patchset I presented at the LSF-MM Summit in San Francisco in April. I apologize for letting it go so long before re-submitting. This patchset was begun by Zach Brown and was originally submitted for review in October, 2009. Feedback was positive, and I have picked up where he left off, porting his patches to the latest mainline kernel and adding support more file systems. This patch series adds a kernel interface to fs/aio.c so that kernel code can issue concurrent asynchronous IO to file systems. It adds an aio command and file system methods which specify io memory with pages instead of userspace addresses. This series was written to reduce the current overhead loop imposes by performing synchronus buffered file system IO from a kernel thread. These patches turn loop into a light weight layer that translates bios into iocbs. I note that there is no support for XFS in this patch set. Is there a particular problem that prevents XFS from being converted, or it just hasn't been done? It just hasn't been done. It wasn't one of the trivial ones so I put it off at first, and after that, it's an oversight. I'll follow up with an xfs patch for your review. Thanks Shaggy, that's all I wanted to know. No extra work for me (apart from review and testing) is fine by me. ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead windows to the new defaults. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/16] f2fs: add inode operations for special inodes
On Wed, Oct 17, 2012 at 12:50:11PM +, Arnd Bergmann wrote: On Tuesday 16 October 2012, Jaegeuk Kim wrote: IIRC, fs2fs uses 4k inodes, so IMO per-inode xattr tress with internal storage before spilling to an external block is probably the best approach to take... Yes, indeed this is the best approach to f2fs's xattr. Apart from giving fs hints, it is worth enough to optimize later. I've thought a bit more about how this could be represented efficiently in 4KB nodes. This would require a significant change of the way you represent inodes, but can improve a number of things at the same time. The idea is to replace the fixed area in the inode that contains block pointers with an extensible TLV (type/length/value) list that can contain multiple variable-length fields, like this. You've just re-invented inode forks... ;) All TLVs together with the fixed-length inode data can fill a 4KB block. The obvious types would be: * Direct file contents if the file is less than a block * List of block pointers, as before, minimum 1, maximum until the end of the block * List of indirect pointers, now also a variable length, similar to the list of block pointers * List of double-indirect block pointers * direct xattr: zero-terminated attribute name followed by contents * indirect xattr: zero-terminated attribute name followed by up to 16 block pointers to store a maximum of 64KB sized xattrs This could be extended later to cover additional types, e.g. a list of erase block pointers, triple-indirect blocks or extents. An inode fork doesn't care about the data in it - it's just an independent block mapping index. i.e. inline, direct, indirect, double indirect. The data in the fork is managed externally to the format of the fork. e.g. XFS has two forks - one for storing data (file data, directory contents, etc) and the other for storing attributes. The main issue with supporting an arbitrary number of forks is space management of the inode literal area. e.g. one fork is in inline format (e.g. direct file contents) and then we add an attribute. The attribute won't fit inline, nor will an extent form fork header, so the inline data fork has to be converted to extent format before the xattr can be added. Now scale that problem up to an arbitrary number of forks As a variation of this, it would also be nice to turn around the order in which the pointers are walked, to optimize for space and for growing files, rather than for reading the beginning of a file. With this, you can represent a 9 KB file using a list of two block pointers, and 1KB of direct data, all in the inode. When the user adds another byte, you only need to rewrite the inode. Similarly, a 5 MB file would have a single indirect node (covering block pointers for 4 MB), plus 256 separate block pointers (covering the last megabyte), and a 5 GB file can be represented using 1 double-indirect node and 256 indirect nodes, and each of them can still be followed by direct tail data and extended attributes. I'm not sure that the resultant code complexity is worth saving an extra block here and there. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: On Fri, 19 Oct 2012, Peter Zijlstra wrote: Yes, I tried this approach - it involves doing LOCK instruction on read lock, remembering the cpu and doing another LOCK instruction on read unlock (which will hopefully be on the same CPU, so no cacheline bouncing happens in the common case). It was slower than the approach without any LOCK instructions (43.3 seconds seconds for the implementation with per-cpu LOCKed access, 42.7 seconds for this implementation without atomic instruction; the benchmark involved doing 512-byte direct-io reads and writes on a ramdisk with 8 processes on 8-core machine). So why is that a problem? Surely that's already tons better then what you've currently got. Percpu rw-semaphores do not improve performance at all. I put them there to avoid performance regression, not to improve performance. All Linux kernels have a race condition - when you change block size of a block device and you read or write the device at the same time, a crash may happen. This bug is there since ever. Recently, this bug started to cause major trouble - multiple high profile business sites report crashes because of this race condition. You can fix this race by using a read lock around I/O paths and write lock around block size changing, but normal rw semaphore cause cache line bouncing when taken for read by multiple processors and I/O performance degradation because of it is measurable. This doesn't sound like a new problem. Hasn't this global access, single modifier exclusion problem been solved before in the VFS? e.g. mnt_want_write()/mnt_make_readonly() Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
[ Folks, can you trim your responses down to just quote the part you are responding to? Having to repeatedly scroll through 500 lines of irrelevant text just to find the 5 lines that is being commented on is exceedingly painful. ] On Tue, Oct 09, 2012 at 09:01:18PM +0900, Jaegeuk Kim wrote: From: Lukáš Czerner [mailto:lczer...@redhat.com] I am sorry but this reply makes me smile. How can you design a fs relying on time attack heuristics to figure out what the proper layout should be ? Or even endorse such heuristics to be used in mkfs ? What we should be focusing on is to push vendors to actually give us such information so we can properly propagate that throughout the kernel - that's something everyone will benefit from. After that the optimization can be done in every file system. Frankly speaking, I agree that it would be the right direction eventually. But, as you know, it's very difficult for all flash vendors to promote and standardize that. Because each vendors have different strategies to open their internal information and also try to protect their secrets whatever they are. IMO, we don't need to wait them now. Instead, from the start, I suggest f2fs that uses those information to the file system design. In addition, I suggest using heuristics right now as best efforts. And in response, other people are suggesting that this is the wrong approach. Maybe in future, if vendors give something, f2fs would be more feasible. In the mean time, I strongly hope to validate and stabilize f2fs with community. Do not get me wrong, I do not think it is worth to wait for vendors to come to their senses, but it is worth constantly reminding that we *need* this kind of information and those heuristics are not feasible in the long run anyway. I believe that this conversation happened several times already, but what about having independent public database of all the internal information about hw from different vendors where users can add information gathered by the time attack heuristic so other does not have to run this again and again. I am not sure if Linaro or someone else have something like that, someone can maybe post a link to that. Linaro already have one, which is another reason why using heuristics is the wrong approach: https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=showredirect=WorkingGroups%2FKernelConsolidation%2FProjects%2FFlashCardSurvey As I mentioned, I agree to push vendors to open those information all the time. And, I absolutely didn't mean that it is worth to wait vendors. I meant, until opening those information by vendors, something like proposing f2fs or gathering heuristics are also needed simultaneously. Anyway, it's very interesting to build a database gathering products' information. May I access the database? It's public information. If you want to support different types of flash, then either add your timing attack derived information on specific hardware to the above table, or force vendors to update it themselves if they want their flash memory supported by this filesystem. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote: On Tue, 9 Oct 2012, Jan Kara wrote: On Mon 08-10-12 21:24:40, Hugh Dickins wrote: On Mon, 1 Oct 2012, Jan Kara wrote: On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via standard write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). What changed recently? Was XFS hardly used on s390 until now? The problem was originally hit on SLE11-SP2 which is 3.0 based after migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think XFS just started to be more peevish about what pages it gets between these two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things up). Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case, whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the ASSERT usually compiled out, stumbling later in page_buffers() as you say. What that says is that no-one is running xfstests-based QA on s390 with CONFIG_XFS_DEBUG enabled, otherwise this would have been found. I've never tested XFS on s390 before, and I doubt any of the upstream developers have, either, because not many peopl ehave s390 machines in their basement. So this is probably just an oversight in the distro QA environment more than anything Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. The above third-party application using our file system maintains some long-opened files, we does not have any chances to force them to call fadvise(POSIX_FADV_NORMAL). :( So raise a bug/feature request with the third party. Modifying kernel code because you can't directly modify the application isn't the best solution for anyone. This really is an application problem - the kernel already provides the mechanisms to solve this problem... :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. That doesn't make a whole lot of sense to me. let me try to get this straight. There is data that resides on two devices (A + B), and a fuse filesystem to access that data. There is a single file in the fuse fs has data on both devices. An app has the file open, and when the data it is accessing is on device A you need to set the readahead to what is best for device A? And when the app tries to access data for that file that is on device B, you need to set the readahead to what is best for device B? And you are changing the fuse BDI readahead settings according to where the data in the back end lies? It seems to me that you should be setting the fuse readahead to the maximum of the readahead windows the data devices have configured at mount time and leaving it at that The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. btrfs simply takes into account the number of disks it has for a given storage pool when setting up the default bdi ra_pages during mount. This is basically doing what I suggested above. Same with the generic fuse code - it's simply setting a sensible default value for the given fuse configuration. Neither are dynamic in the sense you are talking about, though. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: On Wed, 24 Oct 2012, Dave Chinner wrote: On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: On Fri, 19 Oct 2012, Peter Zijlstra wrote: Yes, I tried this approach - it involves doing LOCK instruction on read lock, remembering the cpu and doing another LOCK instruction on read unlock (which will hopefully be on the same CPU, so no cacheline bouncing happens in the common case). It was slower than the approach without any LOCK instructions (43.3 seconds seconds for the implementation with per-cpu LOCKed access, 42.7 seconds for this implementation without atomic instruction; the benchmark involved doing 512-byte direct-io reads and writes on a ramdisk with 8 processes on 8-core machine). So why is that a problem? Surely that's already tons better then what you've currently got. Percpu rw-semaphores do not improve performance at all. I put them there to avoid performance regression, not to improve performance. All Linux kernels have a race condition - when you change block size of a block device and you read or write the device at the same time, a crash may happen. This bug is there since ever. Recently, this bug started to cause major trouble - multiple high profile business sites report crashes because of this race condition. You can fix this race by using a read lock around I/O paths and write lock around block size changing, but normal rw semaphore cause cache line bouncing when taken for read by multiple processors and I/O performance degradation because of it is measurable. This doesn't sound like a new problem. Hasn't this global access, single modifier exclusion problem been solved before in the VFS? e.g. mnt_want_write()/mnt_make_readonly() Cheers, Dave. Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() to use percpu rw semaphores and remove the duplicated code. I think you misunderstood my point - that rather than re-inventing the wheel, why didn't you just copy something that is known to work? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi-ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi-ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; I think we really have to reset file-f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page
On Sat, Oct 27, 2012 at 06:16:26PM -0400, Theodore Ts'o wrote: On Fri, Oct 26, 2012 at 10:24:23PM +, Luck, Tony wrote: Well, we could set a new attribute bit on the file which indicates that the file has been corrupted, and this could cause any attempts to open the file to return some error until the bit has been cleared. That sounds a lot better than renaming/moving the file. What I would recommend is adding a #define FS_CORRUPTED_FL 0x0100 /* File is corrupted */ ... and which could be accessed and cleared via the lsattr and chattr programs. Except that there are filesystems that cannot implement such flags, or require on-disk format changes to add more of those flags. This is most definitely not a filesystem specific behaviour, so any sort of VFS level per-file state needs to be kept in xattrs, not special flags. Filesystems are welcome to optimise the storage of such special xattrs (e.g. down to a single boolean flag in an inode), but using a flag for something that dould, in fact, storage the exactly offset and length of the corruption is far better than just storing a something is corrupted in this file bit Application programs could also get very confused when any attempt to open or read from a file suddenly returned some new error code (EIO, or should we designate a new errno code for this purpose, so there is a better indication of what the heck was going on?) EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding a new one is also challenging too. I think we really need a different error code from EIO; it's already horribly overloaded already, and if this is new behavior when the customers get confused and call up the distribution help desk, they won't thank us if we further overload EIO. This is abusing one of the System V stream errno's, but no one else is using it: #define EADV 68 /* Advertise error */ I note that we've already added a new error code: #define EHWPOISON 133 /* Memory page has hardware error */ ... although the glibc shipping with Debian testing hasn't been taught what it is, so strerror(EHWPOISON) returns Unknown error 133. We could simply allow open(2) and stat(2) return this error, although I wonder if we're just better off defining a new error code. If we are going to add special new file corrupted errors, we should add EFSCORRUPTED (i.e. filesystem corrupted) at the same time Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency
On Sun, Oct 28, 2012 at 09:51:48PM +0800, Zhi Yong Wu wrote: On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu gnehzuil@gmail.com wrote: Hi Zhiyong, On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote: [snip] @@ -199,6 +342,54 @@ err: } /* + * Main function to update access frequency from read/writepage(s) hooks + */ +inline void hot_update_freqs(struct inode *inode, u64 start, + u64 len, int rw) This function seems too big. So we really need to inline this function? As Dave said in his comments, it will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement I don't think I said that with respect to this code. I think I said it w.r.t. a define or a small wrapper that decides to call hot_update_freqs(). A static inline fucntion should only be a couple of lines of code at most. A static function, OTOH, can be inlined by the compiler if the compiler thinks that is a win. But . +EXPORT_SYMBOL_GPL(hot_update_freqs); ... it's an exported function, so it can't be inline or static, so using inline is wrong whatever way you look at it. ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs: Use bool type rather than a custom boolean_t type.
On Mon, Nov 12, 2012 at 09:36:17PM -0200, Thiago Farina wrote: Hi, Please, take a look. Patch attached. It's a good start for a cleanup, but there's no point in removing the boolean_t from one file and then not removing it from the rest of the XFS code. It's only used in a handful of places, so just remove it completely. Also, can you please place patches in line rather than attaching them. Attaching them means they cannot be quoted in reply. See Documentation/SubmittingPatches and Documentation/email-clients.txt for guidelines. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs: Remove boolean_t typedef completely.
On Mon, Nov 12, 2012 at 10:36:47PM -0200, Thiago Farina wrote: On Mon, Nov 12, 2012 at 10:24 PM, Dave Chinner da...@fromorbit.com wrote: On Mon, Nov 12, 2012 at 09:36:17PM -0200, Thiago Farina wrote: Hi, Please, take a look. Patch attached. It's a good start for a cleanup, but there's no point in removing the boolean_t from one file and then not removing it from the rest of the XFS code. It's only used in a handful of places, so just remove it completely. Done. Please, take another look. Also, can you please place patches in line rather than attaching them. Attaching them means they cannot be quoted in reply. See Documentation/SubmittingPatches and Documentation/email-clients.txt for guidelines. Sorry, patch attached. It's still attached as a base64 encoded attachment, not as inline text Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ Deadline
On Thu, Oct 11, 2012 at 01:23:32PM +0100, Alex Bligh wrote: We have noticed significant I/O scheduling issues on both the CFQ and the deadline scheduler where a non-root user can starve any other process of any I/O for minutes at a time. The problem is more serious using CFQ but is still an effective local DoS vector using Deadline. A simple way to generate the problem is: dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5 (note use of 2 dd's is to avoid alleged optimisation of the writing dd from /dev/zero). zcat-ing a large file with stout redirected to a file produces a similar error. Using ionice to set idle priority makes no difference. To instrument the problem we produced a python script which does a MySQL select and update every 10 seconds, and time the execution of the update. This is normally milliseconds, but under user generated load conditions, we can take this to indefinite (on CFQ) and over a minute (on deadline). Postgres is affected in a similar manner (i.e. it is not MySQL specific). Simultaneously we have captured the output of 'vmstat 1 2' and /proc/meminfo, with appropriate timestamps. Well, mysql is stuck in fsync(), so of course it's going to have problems with write latency: [ 3840.268303] [812650d5] jbd2_log_wait_commit+0xb5/0x130 [ 3840.268308] [8108aa50] ? add_wait_queue+0x60/0x60 [ 3840.268313] [81211248] ext4_sync_file+0x208/0x2d0 And postgres gets stuck there too. So what you are seeing is likely an ext4 problem, not an IO scheduler problem. Suggestion: try the same test with XFS. If the problem still exists, then it *might* be an ioscheduler problem. If it goes away, then it's an ext4 problem. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. . +struct hot_heat_info { + __u64 avg_delta_reads; + __u64 avg_delta_writes; + __u64 last_read_time; + __u64 last_write_time; + __u32 num_reads; + __u32 num_writes; + __u32 temperature; + __u8 live; + char filename[PATH_MAX]; Don't put the filename in the ioctl and open the file in the kernel. Have userspace open the file directly and issue the ioctl on the fd that is returned. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 12/13] vfs: add debugfs support
On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a /sys/kernel/debug/hot_track/device_name/ directory for each volume that contains two files. The first, `inode_data', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data', contains similar information for subfile ranges. + /* create debugfs range_data file */ + debugfs_range_entry = debugfs_create_file(range_data, + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO, + debugfs_volume_entry, + (void *) range_data, + hot_debugfs_range_fops); These should not be world readable. 0600 is probably the correct permissions for them as we do not want random users to be able to infer what files users are accessing from this information. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 12/13] vfs: add debugfs support
On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a /sys/kernel/debug/hot_track/device_name/ directory for each volume that contains two files. The first, `inode_data', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data', contains similar information for subfile ranges. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 462 + fs/hot_tracking.h | 43 + 2 files changed, 505 insertions(+), 0 deletions(-) . +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len) +{ + struct lstring *debugfs_log = data-debugfs_log; + uint new_log_alloc_size; + char *new_log; + static char err_msg[] = No more memory!\n; + + if (len = data-log_alloc_size - debugfs_log-len) { .. + } + + memcpy(debugfs_log-str + debugfs_log-len, data-log_work_buff, len); + debugfs_log-len += (unsigned long) len; + + return len; +} + +/* Returns the number of bytes written to the log. */ +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...) +{ + struct lstring *debugfs_log = data-debugfs_log; + va_list args; + int len; + static char trunc_msg[] = + The next message has been truncated.\n; + + if (debugfs_log-str == NULL) + return -1; + + spin_lock(data-log_lock); + + va_start(args, fmt); + len = vsnprintf(data-log_work_buff, + sizeof(data-log_work_buff), fmt, args); + va_end(args); + + if (len = sizeof(data-log_work_buff)) { + hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg)); + } + + len = hot_debugfs_copy(data, data-log_work_buff, len); + spin_unlock(data-log_lock); + + return len; +} Aren't you just recreating seq_printf() here? i.e. can't you replace all this complexity with generic seq_file/seq_operations constructs? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation TODO List: 1) Fix OOM issues - the hot inode tracking caches grow very large and don't get trimmed under memory pressure. From slabtop, after creating roughly 24 million single byte files(*) on a machine with 8GB RAM: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 23859510 23859476 99%0.12K 795317 30 3181268K hot_range_item 23859441 23859439 99%0.16K 1037367 23 4149468K hot_inode_item 572530 572530 100%0.55K 817907327160K radix_tree_node 241706 241406 99%0.22K 14218 17 56872K xfs_ili 241206 241204 99%1.06K 804023321608K xfs_inode The inode tracking is trying to track all 24 million inodes even though they have been written only once, and there are only 240,000 inodes in the cache at this point in time. That was the last update that slabtop got, so it is indicative of the impending OOM situation that occurred. Changelog from v2: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] I haven't looked at the shrinker, but clearly it is not working, otherwise the above OOM situation would not be occurring. Cheers, Dave. (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/16] f2fs: add inode operations for special inodes
On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote: On Sunday 14 October 2012, Vyacheslav Dubeyko wrote: On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote: 2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko: Extended attributes are more flexible way, from my point of view. The xattr gives possibility to make hint to filesystem at any time and without any dependencies with application's functional opportunities. Documented way of using such extended attributes gives to user flexible way of manipulation of filesystem behavior (but I remember that you don't believe in an user :-)). So, I think that fadvise() and extended attributes can be complementary solutions. Right. Another option is to have ext4 style attributes, see http://linux.die.net/man/1/chattr Xattrs are much prefered to more ext4 style flags because xattrs are filesystem independent. Indeed, some filesystems can't store any new ext4 style flags without a change of disk format or internally mapping them to an xattr. So really, xattrs are the best way forward for such hints. Unlike extended attributes, there is a limited number of those, and they can only be boolean flags, but that might be enough for this particular use case. A boolean is not sufficient for access policy hints. An extensible xattr format is probably the best approach to take here, so that we can easily introduce new access policy hints as functionality is required. Indeed, an extensible xattr could start with just a hot/cold boolean, and grow from there The main reason I can see against extended attributes is that they are not stored very efficiently in f2fs, unless a lot of work is put into coming up with a good implementation. A single flags bit can trivially be added to the inode in comparison (if it's not there already). That's a deficiency that should be corrected, then, because xattrs are very common these days. And given that stuff like access frequency tracking is being implemented at the VFS level, access policy hints should also be VFS functionality. A bad filesystem implementation should not dictate the interface for generically useful functionality Anyway, hardcoding or saving in filesystem list of file extensions is a nasty way. It can be not safe or hardly understandable by users the way of reconfiguration filesystem by means of tunefs or debugfs with the purpose of file extensions addition in such black-box as TV or smartphones, from my point of view. It is only a performance hint though, so it is not a correctness issue the file system gets it wrong. In order to do efficient garbage collection, a log structured file system should take all the information it can get about the expected life of data it writes. I agree that the list, even in the form of mkfs time settings, is not a clean abstraction, but in the place of an Android phone manufacturer I would still enable it if it promises a significant performance advantage over not using it. I guess it would be nice if this could be overridden in some form, e.g. using an ioctl on the file as ext4 does. An xattr on the root inode that holds a list like this is something that could be set at mkfs time, but then also updated easily by new software packages that are installed... We should also take the kinds of access we have seen on a file into account. Yes, but it should be done at the VFS level, not in the filesystem itself. Integrated into the current hot inode/range tracking that is being worked on right now, I'd suggest. IOWs, these access policy issues are not unique to F2FS or it's use case. Anything to do with access hints, policy, tracking, file classification, etc that can influence data locality, reclaim, migration, etc need to be dealt with at the VFS, independently of a specific filesystem. Filesystems can make use of that information how they please (whether in the kernel or via userspace tools), but having filesystem specific interfaces and implementations of the same functionality is extremely wasteful. Let's do it once, and do it right the first time. ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write is not atomic?
On Mon, Oct 15, 2012 at 11:36:15PM +0200, Juliusz Chroboczek wrote: Hi, The Linux manual page for write(2) says: The adjustment of the file offset and the write operation are performed as an atomic step. That's wrong. The file offset update is not synchronised at all with the write, and for a shared fd the update will race. This is apparently an extension to POSIX, which says This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control. This is how Linux behaves. The following fragment of code int fd; fd = open(exemple, O_CREAT | O_WRONLY | O_TRUNC, 0666); fork(); write(fd, Ouille, 6); close(fd); produces OuilleOuille, as expected, on ext4 on two machines running Linux 3.2 AMD64. However, over XFS on an old Pentium III at 500 MHz running 2.6.32, it produces just Ouille roughly once in three times. ext4, on 3.6: $ for i in `seq 0 1`; do ./a.out ; cat /mnt/scratch/foo ; echo ; done | sort |uniq -c 39 Ouille 9962 OuilleOuille $ XFS, on the same kernel, hardware and block device: $ for i in `seq 0 1`; do ./a.out ; cat /mnt/scratch/foo ; echo ; done | sort |uniq -c 40 Ouille 9961 OuilleOuille $ So both filesystems behave according to the POSIX definition of concurrent writes Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs: fix buffer lookup race on allocation failure
On Mon, Oct 15, 2012 at 11:27:58AM +0200, William Dauchy wrote: Hello, I believe, the commit fe2429b fixes the attached kernel trace. I tested it both on top of 3.2 and 3.4 stable tree. Could we consider adding this patch in stable tree at least for 3.2 and 3.4? commit fe2429b0966a7ec42b5fe3bf96f0f10de0a3b536 Author: Dave Chinner dchin...@redhat.com Date: Mon Apr 23 15:58:45 2012 +1000 xfs: fix buffer lookup race on allocation failure When memory allocation fails to add the page array or tht epages to a buffer during xfs_buf_get(), the buffer is left in the cache in a partially initialised state. There is enough state left for the next lookup on that buffer to find the buffer, and for the buffer to then be used without finishing the initialisation. As a result, when an attempt to do IO on the buffer occurs, it fails with EIO because there are no pages attached to the buffer. We cannot remove the buffer from the cache immediately and free it, because there may already be a racing lookup that is blocked on the buffer lock. Hence the moment we unlock the buffer to then free it, the other user is woken and we have a use-after-free situation. To avoid this race condition altogether, allocate the pages for the buffer before we insert it into the cache. This then means that we don't have an allocation failure case to deal after the buffer is already present in the cache, and hence avoid the problem altogether. In most cases we won't have racing inserts for the same buffer, and so won't increase the memory pressure allocation before insertion may entail. Signed-off-by: Dave Chinner dchin...@redhat.com Reviewed-by: Mark Tinguely tingu...@sgi.com Signed-off-by: Ben Myers b...@sgi.com XFS: Assertion failed: bp-b_bn != XFS_BUF_DADDR_NULL, file: fs/xfs/xfs_buf.c, line: 598 You're running a CONFIG_XFS_DEBUG kernel. If you can reproduce the problem with CONFIG_XFS_DEBUG, then it probably should be backported. If you are using CONFIG_XFS_DEBUG on production systems, then you shouldn't be because it does nasty things to allocation patterns, not to mention a 25-30% CPU overhead and will panic in places where errors are recoverable but as a developer we want to try to find out what went wrong. In this case, you'll get a transient EIO error when the I/O is issued on the malformed buffer, but other than that the system can continue alon just fine and the next read ofthe buffer will work prefectly... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xfs: add hot tracking support.
From: Dave Chinner dchin...@redhat.com Connect up the VFS hot tracking support so XFS filesystems can make use of it. Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/xfs/xfs_mount.h |1 + fs/xfs/xfs_super.c |9 + 2 files changed, 10 insertions(+) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index a631ca3..d5e7277 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -215,6 +215,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_WSYNC(1ULL 0) /* for nfs - all metadata ops must be synchronous except for space allocations */ +#define XFS_MOUNT_HOTTRACK (1ULL 1) /* hot inode tracking */ #define XFS_MOUNT_WAS_CLEAN(1ULL 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL 4) /* atomic stop of all filesystem operations, typically for diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 56c2537..17786ff 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -61,6 +61,7 @@ #include linux/kthread.h #include linux/freezer.h #include linux/parser.h +#include linux/hot_tracking.h static const struct super_operations xfs_super_operations; static kmem_zone_t *xfs_ioend_zone; @@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool; #define MNTOPT_NODELAYLOG nodelaylog/* Delayed logging disabled */ #define MNTOPT_DISCARDdiscard/* Discard unused blocks */ #define MNTOPT_NODISCARD nodiscard /* Do not discard unused blocks */ +#define MNTOPT_HOTTRACKhot_track /* hot inode tracking */ /* * Table driven mount option parser. @@ -371,6 +373,8 @@ xfs_parseargs( mp-m_flags |= XFS_MOUNT_DISCARD; } else if (!strcmp(this_char, MNTOPT_NODISCARD)) { mp-m_flags = ~XFS_MOUNT_DISCARD; + } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) { + mp-m_flags |= XFS_MOUNT_HOTTRACK; } else if (!strcmp(this_char, ihashsize)) { xfs_warn(mp, ihashsize no longer used, option is deprecated.); @@ -1040,6 +1044,9 @@ xfs_fs_put_super( { struct xfs_mount*mp = XFS_M(sb); + if (mp-m_flags XFS_MOUNT_HOTTRACK) + hot_track_exit(sb); + xfs_filestream_unmount(mp); xfs_unmountfs(mp); @@ -1470,6 +1477,8 @@ xfs_fs_fill_super( error = ENOMEM; goto out_unmount; } + if (mp-m_flags XFS_MOUNT_HOTTRACK) + hot_track_init(sb); return 0; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation # mount -o hot_track /dev/vdc /mnt/scratch # umount /mnt/scratch hangs here on XFS: # echo w /proc/sysrq-trigger [ 44.229252] SysRq : Show Blocked State [ 44.230044] taskPC stack pid father [ 44.231187] umount D 88021fd52dc0 3632 4107 4106 0x [ 44.231946] 880212a91bd8 0086 8802153da300 880212a91fd8 [ 44.231946] 880212a91fd8 880212a91fd8 880216cf6340 8802153da300 [ 44.231946] 7fff 880212a91d78 880212a91d80 [ 44.231946] Call Trace: [ 44.231946] [81b6c1b9] schedule+0x29/0x70 [ 44.231946] [81b6a079] schedule_timeout+0x159/0x220 [ 44.231946] [8171a2f4] ? do_raw_spin_lock+0x54/0x120 [ 44.231946] [8171a45d] ? do_raw_spin_unlock+0x5d/0xb0 [ 44.231946] [81b6bfee] wait_for_common+0xee/0x190 [ 44.231946] [810b6a10] ? try_to_wake_up+0x2f0/0x2f0 [ 44.231946] [81b6c18d] wait_for_completion+0x1d/0x20 [ 44.231946] [8109efec] flush_workqueue+0x14c/0x3f0 [ 44.231946] [811aad89] hot_track_exit+0x39/0x180 [ 44.231946] [81454e83] xfs_fs_put_super+0x23/0x70 [ 44.231946] [8117a991] generic_shutdown_super+0x61/0xf0 [ 44.231946] [8117aa50] kill_block_super+0x30/0x80 [ 44.231946] [8117ae45] deactivate_locked_super+0x45/0x70 [ 44.231946] [8117ba0e] deactivate_super+0x4e/0x70 [ 44.231946] [81197541] mntput_no_expire+0x101/0x160 [ 44.231946] [811985b6] sys_umount+0x76/0x3a0 [ 44.231946] [81b755a9] system_call_fastpath+0x16/0x1b because this is stuck: [ 200.064574] kworker/u:2 S 88021fc12dc0 5208 669 2 0x [ 200.064574] 88021532fc60 0046 88021532c7c0 88021532ffd8 [ 200.064574] 88021532ffd8 88021532ffd8 81fc3420 88021532c7c0 [ 200.064574] 88021532fc50 88021532fc98 8221e700 8221e700 [ 200.064574] Call Trace: [ 200.064574] [81b6c1b9] schedule+0x29/0x70 [ 200.064574] [81b6a03b] schedule_timeout+0x11b/0x220 [ 200.064574] [810908d0] ? usleep_range+0x50/0x50 [ 200.064574] [811aa78a] hot_temperature_update_work+0x16a/0x1c0 [ 200.064574] [8171a2f4] ? do_raw_spin_lock+0x54/0x120 [ 200.064574] [81b6d3ce] ? _raw_spin_unlock_irq+0xe/0x30 [ 200.064574] [810b0b3c] ? finish_task_switch+0x5c/0x100 [ 200.064574] [8109d959] process_one_work+0x139/0x500 [ 200.064574] [811aa620] ? hot_range_update+0x1f0/0x1f0 [ 200.064574] [8109e63e] worker_thread+0x15e/0x460 [ 200.064574] [8109e4e0] ? manage_workers+0x2f0/0x2f0 [ 200.064574] [810a4b33] kthread+0x93/0xa0 [ 200.064574] [81b765c4] kernel_thread_helper+0x4/0x10 [ 200.064574] [810a4aa0] ? __init_kthread_worker+0x40/0x40 [ 200.064574] [81b765c0] ? gs_change+0x13/0x13 Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
(). + destroy_workqueue(wq); +} And there's not need for separate function for this - put it in line. FWIW, it also leaks the hot_work structure, but you're going to remove that anyway. ;) diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index d19e64a..7a79a6d 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -36,6 +36,9 @@ */ #define TIME_TO_KICK 400 +/* set how often to update temperatures (seconds) */ +#define HEAT_UPDATE_DELAY 400 FWIW, 400 seconds is an unusual time period. It's expected that periodic work might take place at intervals of 5 minutes, 10 minutes, etc, not 6m40s. It's much easier to predict and understand behaviour if it's at a interval of whole units like minutes, especially when looking at timestamped event traces. Hence 300s (5 minutes) makes a lot more sense as a period for updates... /* * The following comments explain what exactly comprises a unit of heat. * diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 7114179..b37e0f8 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -84,6 +84,8 @@ struct hot_info { /* map of range temperature */ struct hot_map_head heat_range_map[HEAT_MAP_SIZE]; + + struct workqueue_struct *update_wq; Add the struct delayed_work here, too. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a I think you mean hot_freq_data :P calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. To get the heat info for a specific file you have to know what file you want to get that info for, right? I can see the usefulness of asking for the heat data on a specific file, but how do you find the hot files in the first place? i.e. the big question the user interface needs to answer is what files are hot?. Once userspace knows what the hottest files are, it can open them and query the data via the above ioctl, but expecting userspace to iterate millions of inodes in a filesystem to find hot files is very inefficient. FWIW, if you were to return file handles to the hottest files, then the application could open and query them without even needing to know the path name to them. This woul dbe exceedingly useful for defragmentation programs, especially as that is the way xfs_fsr already operates on candidate files.(*) IOWs, sometimes the pathname is irrelevant to the operations that applications want to perform - all they care about having an efficient method of finding the inode they want and getting a file descriptor that points to the file. Given the heat map info fits right in to the sort of operations defrag and data mover tools already do, it kind of makes sense to optimise the interface towards those uses (*) i.e. finds them via bulkstat which returns handle information along with all the other inode data, then opens the file by handle to do the defrag work FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. I can't see how this is a manageable interface. It is not persistent, so after every filesystem mount you'd have to set the flag on all your inodes again. Hence, for the moment, I'd suggest that dropping per-inode tracking control until all the core issues are sorted out Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/16] f2fs: add inode operations for special inodes
On Tue, Oct 16, 2012 at 11:38:35AM +, Arnd Bergmann wrote: On Tuesday 16 October 2012, Jaegeuk Kim wrote: On Monday 15 October 2012, Dave Chinner wrote: On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote: On Sunday 14 October 2012, Vyacheslav Dubeyko wrote: On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote: 2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko: The main reason I can see against extended attributes is that they are not stored very efficiently in f2fs, unless a lot of work is put into coming up with a good implementation. A single flags bit can trivially be added to the inode in comparison (if it's not there already). That's a deficiency that should be corrected, then, because xattrs are very common these days. IMO, most file systems including f2fs have some inefficiency to store and retrieve xattrs, since they have to allocate an additional block. The only distinct problem in f2fs is that there is a cleaning overhead. So, that's the why xattr is not an efficient way in f2fs. I would hope that there is a better way to encode extented attributes if the current method is not efficient enough. Maybe Dave or someone else who is experienced with this can make suggestions. What is the expected size of the attributes normally? Most attributes are small. Even large user attributes don't generally get to be more than a couple of hundred bytes, though the maximum size for a single xattr is 64K. Does it make sense to put attributes for multiple files into a single block? There are two main ways of dealing with attributes. The first is a tree-like structure to index and store unique xattrs, and have the inode siimply keep pointers to the main xattr tree. This is great for keeping space down when there are lots of identical xattrs, but is a serialisation point for modification an modification can be complex (i.e. shared entries in the tree need COW semantics.) This is the approach ZFS takes, IIRC, and is the most space efficient way of dealing with xattrs. It's not the most performance efficient way, however, and the reference counting means frequent tree rewrites. The second is the XFS/ext4 approach, where xattrs are stored in a per-inode tree, with no sharing. The attribute tree holds the attributes in it's leaves, and the tree grows and shrinks as you add or remove xattrs. There are optimisations on top of this - e.g. for XFS if the xattrs fit in the spare space in the inode, they are packed into the inode (shortform) and don't require an external block. IIUC, there are patches to implement this optimisation for ext4 floating around at the moment. This is a worthwhile optimisation, because with a 512 byte inode size on XFS there is enough spare space (roughly 380 bytes) for most systems to store all their xattrs in the inode itself. XFS also has remote attr storage for large xattrs (i.e. larger than a leaf block), where the tree just keeps a pointer to an external extent that holds the xattr. IIRC, fs2fs uses 4k inodes, so IMO per-inode xattr tress with internal storage before spilling to an external block is probably the best approach to take... OTOH, I think xattr itself is for users, not for communicating between file system and users. No, you are mistaken in that point, as Dave explained. e.g. selinux, IMA, ACLs, capabilities, etc all communicate information that the kernel uses for access control. That's why xattrs have different namespaces like system, security and user. Only user attributes are truly for user data - the rest are for communicating information to the kernel A file usage policy xattr would definitely exist under the system namespace - it's not a user xattr at all. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs: fix buffer lookup race on allocation failure
On Tue, Oct 16, 2012 at 11:27:48AM +0200, William Dauchy wrote: Hello Dave, Thanks for your reply. On Tue, Oct 16, 2012 at 1:21 AM, Dave Chinner da...@fromorbit.com wrote: You're running a CONFIG_XFS_DEBUG kernel. If you can reproduce the problem with CONFIG_XFS_DEBUG, then it probably should be backported. Yes indeed. Sorry, I had a typo in the above - I meant: If you can reproduce the problem *without* CONFIG_XFS_DEBUG, then it probably should be backported. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/16] f2fs: add inode operations for special inodes
On Wed, Oct 17, 2012 at 07:30:21AM +0900, Jaegeuk Kim wrote: OTOH, I think xattr itself is for users, not for communicating between file system and users. No, you are mistaken in that point, as Dave explained. e.g. selinux, IMA, ACLs, capabilities, etc all communicate information that the kernel uses for access control. That's why xattrs have different namespaces like system, security and user. Only user attributes are truly for user data - the rest are for communicating information to the kernel I agree that system is used by kernel. How about the file system view? Not sure what you mean - the filesystem woul dsimply read the xattrs in the system namespace as it needs, just like the other subsystems like selinux or IMA do. Would you explain what file systems retrieve xattrs and use them with their own purpose? I think cachefs users a CacheFiles.cache namespace for storing information it needs in xattrs. ecryptfs stores crypto metadata in xattrs in the lower filesytem. NFSv4 servers store junction mount information in xattrs. So there are examples where filesystems use xattrs for special information. However, in most cases filesystems don't need xattrs for their own metadata primarily because that gets added to their own on-disk formats. IThe above are all overlay style filesystems that don't have their own on-disk formats, so need to use xattrs to store their per-inode metadata. The case of access hints and allocation policies are not somethign that are native to any filesystem on-disk format. They are abstract concepts that really only the software generating/using that information knows about. Given we want the software that uses this information to be in VFS, it is separate from every filesystem and this is exactly the use case that system xattrs were intended for. :) Sorry, I'm not familiar with xattrs in depth. Unfortunately, system is not implemented in f2fs yet. :( If you've already implemented the user.* namespace, then it's trivial to support the other namespaces - it's just prefixing the xattrs with the appropriate string instead of user Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? I copied the wrong mount command. It was definitely enabled. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote: On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? I copied the wrong mount command. It was definitely enabled. OK, BTW: fs_mark is the script written by you? After xfsprogs is installed, i haven't found this command. # apt-get install fsmark Or get the source here: http://sourceforge.net/projects/fsmark/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT and XFS
On Thu, Jul 14, 2005 at 10:22:46AM +1000, Nathan Scott wrote: Hi there, On Wed, Jul 13, 2005 at 09:45:58AM -0700, Daniel Walker wrote: On Wed, 2005-07-13 at 08:47 +0200, Ingo Molnar wrote: downgrade_write() wasnt the main problem - the main problem was that for PREEMPT_RT i implemented 'strict' semaphores, which are not identical to vanilla kernel semaphores. The thing that seemed to impact XFS the most is the 'acquirer thread has to release the lock' rule of strict semaphores. Both the XFS logging code and the XFS IO completion code seems to release locks in a different context from where the acquire happened. It's of course valid upstream behavior, but without these extra rules it's hard to do sane priority inheritance. (who do you boost if you dont really know who 'owns' the lock?) It might make sense to introduce some sort of sem_pass_to(new_owner) interface? For now i introduced a compat type, which lets those semaphores fall back to the vanilla implementation. Hmm, I'm not aware of anywhere in XFS where we do that. From talking to some colleagues here, they're claiming that we can't be doing that since it'd trip an assert in the IRIX mrlock code. Now that I've read the thread, I see it's not mrlocks that is the issue with unlocking in a different context - it's semaphores. All the pagebuf synchronisation is done with a semaphore because it's held across the I/O and it's _most definitely_ released in a different context when doing async I/O. Just about all metadata I/O is async because once the transaction has been logged to disk we don't need to write these buffers out synchronously. Not to mention the log I/O completion unlocks the buffers in a transaction in a different context as well. The whole point of using a semaphore in the pagebuf is because there is no tracking of who owns the lock so we can actually release it in a different context. Semaphores were invented for this purpose, and we use them in the way they were intended. ;) Realistically, I seriously doubt the need for any sort of rt changes to these semaphores. They can be held for indeterminant periods of time potentially across multiple disk I/Os (e.g. when held locked in a transaction that requires more metadata to be read in from disk to make progress). Hence there is no really no point in making them RT aware because if you end up waiting on one of them you can forget about pretty much any RT guarantee that you've ever given Cheers, Dave. -- Dave Chinner RD Software Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: fix comment typo of struct xfs_da_blkinfo.
On Tue, Jul 17, 2012 at 10:40:21PM -0300, Paulo Alcantara wrote: From: Christoph Hellwig h...@infradead.org Date: Tue, 17 Jul 2012 03:06:43 -0400 Btw, if you need more reviers for the syslinus support feel free to pass it by me (or the list). This is the branch I'm maintaing for the XFS readonly driver: git://zytor.com/users/pcacjr/syslinux.git (branch: xfs) The current status is: - We are able to find the root inode by reading rootino field from superblock (xfs_iget_root() function). - We are able to find inodes that have core's format set to local only at the moment, which is by reading leaf entries from inode btrees. The function that looks up for an inode is the xfs_iget() one. We're looking forward to implement the handling of keys in inode btrees (extents) also. Baoszi is currently working on the inode btree's key part (extents), and I'm working on the data part of the file inode, which is the xfs_next_extent() function. The xfs_next_extent() function must be able to read the inode (cast to the data fork) and populate a structure that stores physical starting number sector and number of consecutive sectors that contain the inode's data so that Syslinux will know where to read from. As we discussed on #xfs, I'm still not convinced that parsing the filesystem metadata in the bootloader is the way to go. Far better, IMO, is simply to supply the bootloader with a list of extents for the blocks it needs to read to get the files it needs. You can get them via fiemap(), and it'll work for XFS, btrfs, ext3, ext4, etc without needing to write code to parse the on-disk format of every filesystem. Especially as we are in the process of making major changes to the XFS on-disk format, which means you'll have to make significant changes to support a second XFS disk format in the not-to-distant future... Thanks for the interest in helping us! We want it to work! ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]
On Thu, Apr 04, 2013 at 04:30:54PM -0400, Phillip Susi wrote: I have not tested it yet, but I am pretty sure it won't work. It looks like the patch changes the BLKRRPART path to go ahead and remove existing partitions when GENHD_FL_NO_PARTSCAN is set. loop doesn't issue the BLKRRPART ioctl when !LO_FLAGS_PARTSCAN so this won't help. I think loop needs to set GENHD_FL_NO_PARTSCAN and then issue the ioctl regardless of the LO_FLAGS_PARTSCAN flag to get the partitions to be removed. I will try to test tonight. After testing, my initial thoughts appeared to have been correct. I had to modify the patch as follows. To test, simply do: truncate -s 10m img losetup /dev/loop0 img parted /dev/loop0 mklabel msdos mkpart primary ext2 1m 2m quit ls /dev/loop0* Note the /dev/loop0p1 node. Run losetup -d /dev/loop0 and see if it is still there. Jens, can we get one of these fixes merged quickly? xfstests is unusable on any kernel more recent than 3.9-rc4 because of these problems Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]
On Tue, Apr 09, 2013 at 09:01:39AM +0200, Jens Axboe wrote: On Tue, Apr 09 2013, Dave Chinner wrote: On Thu, Apr 04, 2013 at 04:30:54PM -0400, Phillip Susi wrote: I have not tested it yet, but I am pretty sure it won't work. It looks like the patch changes the BLKRRPART path to go ahead and remove existing partitions when GENHD_FL_NO_PARTSCAN is set. loop doesn't issue the BLKRRPART ioctl when !LO_FLAGS_PARTSCAN so this won't help. I think loop needs to set GENHD_FL_NO_PARTSCAN and then issue the ioctl regardless of the LO_FLAGS_PARTSCAN flag to get the partitions to be removed. I will try to test tonight. After testing, my initial thoughts appeared to have been correct. I had to modify the patch as follows. To test, simply do: truncate -s 10m img losetup /dev/loop0 img parted /dev/loop0 mklabel msdos mkpart primary ext2 1m 2m quit ls /dev/loop0* Note the /dev/loop0p1 node. Run losetup -d /dev/loop0 and see if it is still there. Jens, can we get one of these fixes merged quickly? xfstests is unusable on any kernel more recent than 3.9-rc4 because of these problems Yep, did the revert yesterday and it's going out today. Great, I didn't see an update in the latest -rc, so I wanted to make sure it hadn't fallen through he cracks Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote: On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote: I think that outside of zone loop is better place to run shrink_slab(), because shrink_slab() is not directly related to a specific zone. This is true and has been the case for a long time. The slab shrinkers are not zone aware and it is complicated by the fact that slab usage can indirectly pin memory on other zones. .. And this is a question not related to this patch. Why nr_slab is used here to decide zone-all_unreclaimable? Slab is not directly associated with a slab but as reclaiming slab can free memory from unpredictable zones we do not consider a zone to be fully unreclaimable until we cannot shrink slab any more. This is something the numa aware shrinkers will greatly help with - instead of being a global shrink it becomes a node-the-zone-belongs-to shrink, and so You may be thinking that this is extremely heavy handed and you're right, it is. ... it is much less heavy handed than the current code... nr_slab is not directly related whether a specific zone is reclaimable or not, and, moreover, nr_slab is not directly related to number of reclaimed pages. It just say some objects in the system are freed. All true, it's the indirect relation between slab objects and the memory that is freed when slab objects are reclaimed that has to be taken into account. Node awareness within the shrinker infrastructure and LRUs make the relationship much more direct ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote: On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. I'm surprised that no-one has suggested change the IO elevator yet. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote: On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. We're using REQ_META | REQ_PRIO for reads, not writes. I'm surprised that no-one has suggested change the IO elevator yet. Well, testing to see if the stalls go away with the noop schedule is a good thing to try just to validate the theory. Exactly. The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. Here's the historic problem with CFQ: it's scheduling algorithms change from release to release, and so what you tune the filesystem to for this release is likely to cause different behaviour in a few releases time. We've had this problem time and time again with CFQ+XFS, so we stopped trying to tune to a particular elevator long ago. The best you can do it tag the Io as appropriately as possible (e.g. metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ hasn't been broken since the last release At least, that's the theory; we should confirm that this is indeed what is causing the data stalls which Mel is reporting on HDD's before we start figuring out how to fix this problem. *nod*. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]
[Added jens Axboe to CC] On Tue, Apr 02, 2013 at 02:08:49AM -0400, CAI Qian wrote: Saw on almost all the servers range from x64, ppc64 and s390x with kernel 3.9-rc5 and xfsprogs-3.1.10. Never caught this in 3.9-rc4, so looks like something new broke this. Log is here with sysrq debug info. http://people.redhat.com/qcai/stable/log Has nothing to do with XFS: [34762.105676] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [34762.152871] systemd-udevd D 88030bc94440 0 4391412 0x0084 [34762.196660] 8801134d9a98 0082 880303e09ac0 8801134d9fd8 [34762.240512] 8801134d9fd8 8801134d9fd8 880304f98000 880303e09ac0 [34762.287047] 8802936c1ac0 88060521f998 88060521f99c 880303e09ac0 [34762.331446] Call Trace: [34762.345444] [816274a9] schedule+0x29/0x70 [34762.373908] [8162777e] schedule_preempt_disabled+0xe/0x10 [34762.411394] [81625ae3] __mutex_lock_slowpath+0xc3/0x140 [34762.447597] [8162561a] mutex_lock+0x2a/0x50 [34762.476785] [811daddb] __blkdev_get+0x6b/0x4b0 [34762.508839] [812a619a] ? selinux_file_alloc_security+0x4a/0x80 [34762.546436] [811db3bd] blkdev_get+0x19d/0x2e0 [34762.580977] [812a18fa] ? inode_has_perm.isra.31.constprop.61+0x2a/0x30 [34762.624346] [811db55f] blkdev_open+0x5f/0x90 [34762.656318] [811a00bf] do_dentry_open+0x20f/0x2c0 [34762.689762] [811db500] ? blkdev_get+0x2e0/0x2e0 [34762.71] [811a01a5] finish_open+0x35/0x50 [34762.754363] [811b082e] do_last+0x6de/0xde0 [34762.783808] [811ad378] ? inode_permission+0x18/0x50 [34762.819265] [811ad428] ? link_path_walk+0x78/0x880 [34762.853700] [812a619a] ? selinux_file_alloc_security+0x4a/0x80 [34762.892881] [811b0fe7] path_openat+0xb7/0x4b0 [34762.923621] [811ad0e0] ? getname_flags.part.33+0x30/0x150 [34762.960839] [811b16a1] do_filp_open+0x41/0xa0 [34762.992114] [811bddc2] ? __alloc_fd+0x42/0x110 [34763.023342] [811a14f4] do_sys_open+0xf4/0x1e0 [34763.054129] [811a1601] sys_open+0x21/0x30 [34763.082134] [81630d59] system_call_fastpath+0x16/0x1b And: [34763.116218] INFO: task umount:4421 blocked for more than 120 seconds. [34763.153670] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [34763.200321] umount D 88030bcb4440 0 4421452 0x0080 [34763.242026] 8801134fbcb8 0082 8802936c1ac0 8801134fbfd8 [34763.287320] 8801134fbfd8 8801134fbfd8 8803069d 8802936c1ac0 [34763.330722] 8802936c1ac0 88060521f998 88060521f99c 8802936c1ac0 [34763.374428] Call Trace: [34763.388977] [816274a9] schedule+0x29/0x70 [34763.419164] [8162777e] schedule_preempt_disabled+0xe/0x10 [34763.456279] [81625ae3] __mutex_lock_slowpath+0xc3/0x140 [34763.492474] [8162561a] mutex_lock+0x2a/0x50 [34763.520840] [81417da5] loop_clr_fd+0x285/0x480 [34763.552050] [81418800] lo_release+0x70/0x80 [34763.580787] [811dabac] __blkdev_put+0x17c/0x1d0 [34763.612479] [811dac57] blkdev_put+0x57/0x140 [34763.641557] [811a409d] kill_block_super+0x4d/0x80 [34763.674257] [811a4467] deactivate_locked_super+0x57/0x80 [34763.708755] [811a4fee] deactivate_super+0x4e/0x70 [34763.742906] [811bfbb7] mntput_no_expire+0xd7/0x130 [34763.776872] [811c0a9c] sys_umount+0x9c/0x3c0 [34763.811819] [81630d59] system_call_fastpath+0x16/0x1b It's clearly a loopback device problem, stuck on the bdev-bd_inode-i_mutex. And there's changesin the loop device teardown since 3.9-rc4: $ glo v3.9-rc4..HEAD -- drivers/block/loop.c c1681bf loop: prevent bdev freeing while device in use 8761a3d loop: cleanup partitions when detaching loop device 183cfb5 loop: fix error return code in loop_add() $ So this looks like someone hasn't been testing their loopback driver changes properly... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] vfs: export do_splice_direct() to modules
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote: Miklos Szeredi mik...@szeredi.hu wrote: Export do_splice_direct() to modules. Needed by overlay filesystem. Apparently you cannot call this from any function that is holding an i_mutex if the target of the splice uses generic_file_splice_write(). The problem is a potential deadlock situation: We have places already that do: mnt_want_write() mutex_lock() This can be found in do_last() for example. However, mnt_want_write() calls sb_start_write() as does generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding: mutex_lock() sb_start_write() which lockdep reports as a potential ABBA deadlock. Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock might operate, so it's possible that this is a false alarm. Maybe Jan Kara can illuminate further, so I've added him to the cc list. I've attached the report I got with unionmount. There's plenty of problems with splice locking that can lead to deadlocks. Here's another that's been known for ages: http://oss.sgi.com/archives/xfs/2011-08/msg00168.html Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote: On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote: We don't care about the ordering between multiple concurrent metadata modifications - what matters is whether the ongoing data IO around them is ordered correctly. Dave, The point that Michel is making is that there never was any ordering guarantee by rwsem. It's an illusion. Weasel words. The reason is simple: to even get to the lock the cpu has to be sleep-able. So for every submission that you believe is ordered, is by its very nature __not ordered__, even when used by kernel code. Why? Because any thread on its way to claim the lock (reader or writer) could be pre-empted for some other task, thus delaying the submission of whatever i/o you believed to be ordered. You think I don't know this? You're arguing fine grained, low level behaviour between tasks is unpredictable. I get that. I understand that. But I'm not arguing about fine-grained, low level, microsecond semantics of the locking order What you (and Michael) appear to be failing to see is what happens on a macro level when you have read locks being held for periods measured in *seconds* (e.g. direct IO gets queued behind a few thousand other IOs in the elevator waiting for a request slot), and the subsequent effect of inserting an operation that requires a write lock into that IO stream. IOWs, it simply doesn't matter if there's a micro-level race between the write lock and a couple of the readers. That's the level you guys are arguing at but it simply does not matter in the cases I'm describing. I'm talking about high level serialisation behaviours that might take of *seconds* to play out and the ordering behaviours observed at that scale. That is, I don't care if a couple of threads out of a few thousand race with the write lock over few tens to hundreds of microseconds, but I most definitely care if a few thousand IOs issued seconds after the write lock is queued jump over the write lock. That is a gross behavioural change at the macro-level. So just to reiterate: there is no 'queue' and no 'barrier'. The guarantees that rwsem makes are; 1. Multiple readers can own the lock. 2. Only a single writer can own the lock. 3. Readers will not starve writers. You've conveniently ignored the fact that the current implementation also provides following guarantee: 4. new readers will block behind existing writers And that's the behaviour we currently depend on, whether you like it or not. Where lock policy can have a significant impact is on performance. But predicting that impact is difficult -- it's better just to measure. Predicting the impact in this case is trivial - it's obvious that ordering of operations will change and break high level assumptions that userspace currently makes about various IO operations on XFS filesystems It's not my intention to convince you (or anyone else) that there should only be One True Rwsem, because I don't believe that. But I didn't want the impression to persist that rwsem does anything more than implement a fair reader/writer semaphore. I'm sorry, but redefining fair to suit your own needs doesn't convince me of anything. rwsem behaviour has been unchanged for at least 10 years and hence the current implementation defines what is fair, not what you say is fair Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
On Tue, Mar 19, 2013 at 10:28:24AM -0700, Tejun Heo wrote: Hello, Jan. On Tue, Mar 19, 2013 at 8:46 AM, Jan Kara j...@suse.cz wrote: Well, but what you often get is just output of sysrq-w, or sysrq-t, or splat from scheduler about stuck task. You often don't have the comfort of tracing... Can't we somehow change 'comm' of the task when it starts processing work of some bdi? You sure can but I'd prefer not to do that. If you wanna do it properly, you have to grab task lock every time a work item starts execution. I'm not sure how beneficial having the block device identifier would be. Backtrace would be there the same. Is identifying the block device that important? When you have a system that has 50+ active filesystems (pretty common in the distributed storage environments were every disk has it's own filesystem), knowing which filesystem(s) are getting stuck in writeback from the sysrq-w or hangcheck output is pretty damn important Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tux3 Report: Initial fsck has landed
On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote: On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley r...@landley.net wrote: I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then nothing. Is this project dead or not? Not. We haven't done much about updating tux3.org lately, however you will find plenty of activity here: https://github.com/OGAWAHirofumi/tux3/tree/master/user You will also find fairly comprehensive updates on where we are and where this is going, here: http://phunq.net/pipermail/tux3/ At the moment we're being pretty quiet because of being in the middle of developing the next-gen directory index. Not such a small task, as you might imagine. Hi Daniel, The next-gen directory index comment made me curious. I wanted to know if there's anything I could learn from what you are doing and whether anything of your new algorithms could be applied to, say, the XFS directory structure to improve it. I went looking for design docs and found this: http://phunq.net/pipermail/tux3/2013-January/001938.html In a word: Disappointment. Compared to the XFS directory structure, the most striking architectural similarity that I see is this: the file bteee[sic] effectively is a second directory index that imposes a stable ordering on directory blocks. That was the key architectural innovation in the XFS directory structure that allowed it to provide the correct seekdir/telldir/NFS readdir semantics and still scale. i.e. virtually mapped directory entries. I explained this layout recently here: http://marc.info/?l=linux-ext4m=136081996316453w=2 http://marc.info/?l=linux-ext4m=136082221117399w=2 http://marc.info/?l=linux-ext4m=136089526928538w=2 We could swap the relevant portions of your PHTree design doc with my comments (and vice versa) and both sets of references would still make perfect sense. :P Further, the PHTree description of tag based freespace tracking is rather close to how XFS uses tags to track free space regions, including the fact that XFS can be lazy at updating global free space indexes. The global freespace tree indexing is slightly different to the XFS method - it's closer to the original V1 dir code in XFS (that didn't scale at all well) than the current code. However, that's really a fine detail compared to all the major structural and algorithmic similarities. Hence it appears to me that at a fundamental level PHTree is just a re-implementation of the XFS directory architecture. It's definitely a *major* step forward from HTree, but it can hardly be considered revolutionary or next-gen. It's not even state of the art. Hence: disappointment. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list
On Mon, Mar 25, 2013 at 10:22:31AM -0700, Greg Thelen wrote: Call cond_resched() from shrink_dentry_list() to preserve shrink_dcache_parent() interactivity. void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, dispose)) != 0) shrink_dentry_list(dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. Hi Greg, I can see how this coul dcause problems, but isn't the problem then that shrink_dcache_parent()/select_parent() itself is mishandling the need for rescheduling rather than being a problem with the shrink_dentry_list() implementation? i.e. select_parent() is aborting batching based on a need for rescheduling, but then not doing that itself and assuming that someone else will do the reschedule for it? Perhaps this is a better approach: - while ((found = select_parent(parent, dispose)) != 0) + while ((found = select_parent(parent, dispose)) != 0) { shrink_dentry_list(dispose); + cond_resched(); + } With this, select_parent() stops batching when a resched is needed, we dispose of the list as a single batch and only then resched if it was needed before we go and grab the next batch. That should fix the small batch problem without the potential for changing the shrink_dentry_list() behaviour adversely for other users Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list
On Mon, Mar 25, 2013 at 05:39:13PM -0700, Greg Thelen wrote: On Mon, Mar 25 2013, Dave Chinner wrote: On Mon, Mar 25, 2013 at 10:22:31AM -0700, Greg Thelen wrote: Call cond_resched() from shrink_dentry_list() to preserve shrink_dcache_parent() interactivity. void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, dispose)) != 0) shrink_dentry_list(dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. Hi Greg, I can see how this coul dcause problems, but isn't the problem then that shrink_dcache_parent()/select_parent() itself is mishandling the need for rescheduling rather than being a problem with the shrink_dentry_list() implementation? i.e. select_parent() is aborting batching based on a need for rescheduling, but then not doing that itself and assuming that someone else will do the reschedule for it? Perhaps this is a better approach: - while ((found = select_parent(parent, dispose)) != 0) + while ((found = select_parent(parent, dispose)) != 0) { shrink_dentry_list(dispose); + cond_resched(); + } With this, select_parent() stops batching when a resched is needed, we dispose of the list as a single batch and only then resched if it was needed before we go and grab the next batch. That should fix the small batch problem without the potential for changing the shrink_dentry_list() behaviour adversely for other users I considered only modifying shrink_dcache_parent() as you show above. Either approach fixes the problem I've seen. My initial approach adds cond_resched() deeper into shrink_dentry_list() because I thought that there might a secondary benefit: shrink_dentry_list() would be willing to give up the processor when working on a huge number of dentry. This could improve interactivity during shrinker and umount. I don't feel strongly on this and would be willing to test and post the add-cond_resched-to-shrink_dcache_parent approach. The shrinker has interactivity problems because of the global dcache_lru_lock, not because of ithe size of the list passed to shrink_dentry_list(). The amount of work that shrink_dentry_list() does here is already bound by the shrinker batch size. Hence in the absence of the RT folk complaining about significant holdoffs I don't think there is an interactivity problem through the shrinker path. As for the unmount path - shrink_dcache_for_umount_subtree() - that doesn't use shrink_dentry_list() and so would need it's own internal calls to cond_resched(). Perhaps it's shrink_dcache_sb() that you are concerned about? Either way, And there are lots more similar issues in the unmount path such as evict_inodes(), so unless you are going to give every possible path through unmount/remount/bdev invalidation the same treatment then changing shrink_dentry_list() won't significantly improve the interactivity of the system situation in these paths... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote: When the first queued waiter is a reader, wake all readers instead of just those that are at the front of the queue. There are really two motivations for this change: Isn't this a significant change of semantics for the rwsem? i.e. that read lock requests that come after a write lock request now jump ahead of the write lock request? i.e.the write lock request is no longer a barrier in the queue? XFS has long assumed that a rwsem write lock is a barrier that stops new read locks from being taken, and this change will break that assumption. Given that this barrier assumption is used as the basis for serialisation of operations like IO vs truncate, there's a bit more at stake than just improving parallelism here. i.e. IO issued after truncate/preallocate/hole punch could now be issued ahead of the pending metadata operation, whereas currently the IO issued after the pending metadata operation is waiting for the write lock will be only be processed -after- the metadata modification operation completes... That is a recipe for weird data corruption problems because applications are likely to have implicit dependencies on the barrier effect of metadata operations on data IO... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote: On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote: When the first queued waiter is a reader, wake all readers instead of just those that are at the front of the queue. There are really two motivations for this change: Isn't this a significant change of semantics for the rwsem? i.e. that read lock requests that come after a write lock request now jump ahead of the write lock request? i.e.the write lock request is no longer a barrier in the queue? Yes, I am allowing readers to skip ahead of writers in the queue (but only if they can run with another reader that was already ahead). I don't see that this is a change of observable semantics for correct programs. If a reader and a writer both block on the rwsem, how do you known for sure which one got queued first ? rwsem API doesn't give you any easy way to know whether a thread is currently queued on the rwsem (it could also be descheduled before it gets onto the rwsem queue). There are algorithms that rely on write locks to act as read-side barriers to prevent write side starvation. i.e. if you keep queuing up read locks, the writer never gets the lock, thereby starving the writer. But yes, if you're making assumptions about queuing order the change makes it more likely that they'll be observably wrong. XFS has long assumed that a rwsem write lock is a barrier that stops new read locks from being taken, and this change will break that assumption. Given that this barrier assumption is used as the basis for serialisation of operations like IO vs truncate, there's a bit more at stake than just improving parallelism here. i.e. IO issued after truncate/preallocate/hole punch could now be issued ahead of the pending metadata operation, whereas currently the IO issued after the pending metadata operation is waiting for the write lock will be only be processed -after- the metadata modification operation completes... That is a recipe for weird data corruption problems because applications are likely to have implicit dependencies on the barrier effect of metadata operations on data IO... I am confused as to exactly what XFS is doing, could you point me to the code / indicate a scenario where this would go wrong ? If you really rely on this for correctness you'd have to do something already to guarantee that your original queueing order is as desired, and I just don't see how it'd be done... My point isn't that XFS gets the serialisation wrong - my point is that the change of queuing order will change the userspace visible behaviour of the filesystem *significantly*. For example: - direct IO only takes read locks, while truncate takes a write lock. Right now we can drop a truncate, preallocation or hole punch operation into a stream of concurrent direct IO and have it execute almost immediately - the barrier mechanism in the rwsem ensures that it will be executed ahead of any future IO that is issued by the application. With your proposed change, that operation won't take place until all current *and all future* IOs stop and the read lock is no longer held by anyone. To put this in context, here's the Irix XFS iolock initialisation code from mid 1997, where the barrier semantics are first explicitly documented: mrlock_init(ip-i_iolock, MRLOCK_BARRIER, xfsio, (long)vp-v_number); It was only coded like this in 1997, because the Irix multi-reader lock only grew read-side queue jumping and priority inheritence at this time. This changed the default behaviour from write lock barrier semantics to something similar to what you are proposing for rwsems right now. All the locks that relied on barrier semantics of write locks had to be configured explicitly to have that behaviour rather than implicitly relying on the fact that mrlocks had provided write barrier semantics. So, that's my concern - we've got 20 years of algorithms in XFS designed around rwsem write locks providing read-side barriers, and it's been down this road once before (over 15 years ago). Changing the semantics of rwsems is going to break stuff in subtle and unpredictable ways Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner da...@fromorbit.com wrote: On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote: On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner da...@fromorbit.com wrote: Isn't this a significant change of semantics for the rwsem? i.e. that read lock requests that come after a write lock request now jump ahead of the write lock request? i.e.the write lock request is no longer a barrier in the queue? Yes, I am allowing readers to skip ahead of writers in the queue (but only if they can run with another reader that was already ahead). I don't see that this is a change of observable semantics for correct programs. If a reader and a writer both block on the rwsem, how do you known for sure which one got queued first ? rwsem API doesn't give you any easy way to know whether a thread is currently queued on the rwsem (it could also be descheduled before it gets onto the rwsem queue). There are algorithms that rely on write locks to act as read-side barriers to prevent write side starvation. i.e. if you keep queuing up read locks, the writer never gets the lock, thereby starving the writer. What I propose actually isn't as bad as a reader preference rwlock like rwlock_t. When a reader makes it to the head of the queue, all readers gets woken. At that point there are only writers left in the queue, and new readers will get queued after them and won't be able to skip over them (since they have no earlier reader to join up with). So, I am not throwing reader/writer fairness out the window here. OK, but My point isn't that XFS gets the serialisation wrong - my point is that the change of queuing order will change the userspace visible behaviour of the filesystem *significantly*. For example: - direct IO only takes read locks, while truncate takes a write lock. Right now we can drop a truncate, preallocation or hole punch operation into a stream of concurrent direct IO and have it execute almost immediately - the barrier mechanism in the rwsem ensures that it will be executed ahead of any future IO that is issued by the application. With your proposed change, that operation won't take place until all current *and all future* IOs stop and the read lock is no longer held by anyone. You actually wouldn't get such starvation with my proposal. What you might see would be: - Assuming you have up to N concurrent reads in progress, a writer might see up to 2N-1 readers proceed in front of it. This limit is reached if you first had N-1 readers grabbing the rwsem with an empty queue, then another writer W1 got queued, So something like RW1 then a reader RN (note that it will get queued after W1 and not run concurrently with the existing N-1 readers), then our writer W2 gets queued. So: RW1W2 As our N-1 readers complete their IO operations, they might get queued again after W2, So: W1W2R and then skip ahead of W2 when RN reaches the front of the queue. So, W2 is still guaranteed to run eventually, but with a worst case latency that is ~2x worse than before So, when W1 is released, the queue is treated as though it was: RW2 yes? Ok, so I can see where your latency figure comes from, but it's still a change of ordering in that W2 is no longer a barrier to the reads queued after it. And if we extend that to WN, we have an interleaved queue similar to this: W1R1W2R2W3R3.WnRm By my reading of the algorithm you are proposing, after W1 is released, we end up with the queue being treated like this: R1R2R3RmW2W3Wn Right? If so, that is definitely a change of lock ordering semantics - writes do not have barrier properties anymore. - since all readers are woken at once, you might see burst of direct IO operations followed by bursts of truncate operations, instead of having them interleaved in smaller groups as before. And this will result in the same application IO pattern giving vastly different results. e.g. a read IO promoted ahead of a truncate will now return data instead of -1 (beyond EOF). I'm not sure if these characteristics are good enough for the XFS use case. It seems tougher than many rwsem use cases, since the benefits of increased reader parallelism are not as clear here (i.e. the IO Well defined, predictable, deterministc ordering of operations takes far more precedence over parallelism when it comes to filesystem behaviour. The rwsems already have enough parallelism to allow us to drive more than 2 million 4k IOPS to a single file via multi-threaded direct IO(*), so we aren't limited by parallelism and concurrency in rwsems. So if this explanation still didn't made you comfortable with the proposal, I will rework it to avoid the queue
Re: torrent hash failures since 3.9.0-rc1
On Tue, Mar 12, 2013 at 09:28:54AM +0100, Sander wrote: Markus Trippelsdorf wrote (ao): On 2013.03.11 at 16:37 -0400, Theodore Ts'o wrote: We actually run fsx in a number of different configruations as part of our regression testing before we send Linus a pull request, and haven't found any issues. So unless it's a hardware problem, it seems unlikely to me that your running fsx would turn up anything. Yes, I let it run for a while anyway and it didn't report any failure. Please note that my local rtorrent version was configured with --with-posix-fallocate. Would it be possible to enhance fsx to detect such an issue? fsx in xfstests already uses fallocate() for preallocation and hole punching, so such problems related to these operations can be found using fsx. The issue here, however, involves memory reclaim interactions and so is not something fsx can reproduce in isolation. :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote: Hi Dave, On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote: - since all readers are woken at once, you might see burst of direct IO operations followed by bursts of truncate operations, instead of having them interleaved in smaller groups as before. And this will result in the same application IO pattern giving vastly different results. e.g. a read IO promoted ahead of a truncate will now return data instead of -1 (beyond EOF). I will reply to this part first, as I think this gives a more concrete anchor to the discussion. The crucial point is that the truncate system call hasn't completed yet - that thread is still queued in the rwsem. You still have the guarantee that the truncate will complete eventually (even if there is a never-ending stream of IOs), and that all IO system calls that start after the truncate system call completes will see the file as truncated. Sure, but the problem is not about when the syscall completes - the problem is that you are changing where in the pipeline of IO the truncate is initially executed. i.e. ordering is not defined by when an operation completes, but by the order ing which the queue is processed after a blocking operation completes. That is not when the syscall completes, but when the filesystem drops the exclusive lock. From a single threaded userspace application perspective or multithreaded apps with their own userspace locking, truncates complete when either the syscall completes or some time after when the application drops it's lock. Either way, we're looking at completion time serialisation, and in that case what XFS or the kernel does simply doesn't matter. However, if we are talking about userspace applications that use lockless IO algorithms or kernel side applications (like knfsd) that are exposed directly to the filesystem locking semantics, serialisation behaviour is actually defined by filesystem's submission side locking behaviour. There is no external serialisation of IO completion, so serialisation and ordering is defined (for XFS) solely by the rwsem behaviour. You don't have guarantees about which system call will appear to run before the other if the execution times of the two system calls overlap each other, but you actually never had such a guarantee from a correctness perspective (i.e. you could never guarantee which of the two would get queued on the rwsem ahead of the other). Sure, but as long as the submission side ordering is deterministic, it doesn't matter. Ok, so I can see where your latency figure comes from, but it's still a change of ordering in that W2 is no longer a barrier to the reads queued after it. My claim is that it's not a visible change from a correctness perspective I am not arguing that it is incorrect. I'm arguing that the change of ordering semantics breaks assumptions a lot of code makes about how these locks work. similar to this: W1R1W2R2W3R3.WnRm By my reading of the algorithm you are proposing, after W1 is released, we end up with the queue being treated like this: R1R2R3RmW2W3Wn Right? Yes, if what you are representing is the state of the queue at a given point of time (which implies R1..Rm must be distinct threads, not just the same thread doing repeated requests). Yup, that's very typical. As new requests come in over time, one important point to remember is that each writer will see at most one batch of readers wake up ahead of it (though this batch may include readers that were originally queued both before and after it). And that's *exactly* the problem with the changes you are proposing - rwsems will no longer provide strongly ordered write vs read barrier semantics. I find the name 'barrier' actually confusing when used to describe synchronous operations. To me a barrier is usualy between asynchronous operations, and it is well defined which operations are ahead or behind of the barrier (either because they were queued by the same thread, or they were queued by different threads which may have synchronized together some other way). When you have hundreds or thousands of threads doing IO to the one file, it doesn't matter if the IO is issued synchronously or asynchronously by the threads - you simply have a huge amount of data IO concurrency and very, very deep pipelines. Inserting a metadata modification (truncate, preallocation, holepunch, etc) into that pipeline currently causes all new submissions to queue behind the metadata modification, waits for everything submitted before the metadata modification to complete and then runs the metadata modification. Once it completes, it then allows everything queued up to the next metadata modification to run That matches your definition of a barrier pretty well, I think
Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()
On Mon, Mar 04, 2013 at 08:32:45AM +, Tony Lu wrote: Thanks for you following up. My apologize that I just found that it is one change I made before that causes this problem. This change forces mkfs.xfs to format xfs partitions whose sectorsize were not smaller than 4096 bytes, which was due to a bug that earlier versions of xfs used (struct *page)-private(long) as a bitmap to represent each block's state within a page (the size of a page could be 64K or larger, then it needs 128 bit or more to represent each block's state within a page). You do realise that bug doesn't affect x86-64 platforms as they don't support 64k pages? This is reproducible on 2.6.38.6 kernel on X86. But I do not get why this change makes the xfs log inconsistent during mount/cp/umount operations. Neither do I, and I don't care to look any further because the problem is of your own making. In future, please check first that the bug you are reporting is reproducable on a current upstream kernel and userspace. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
(0x7f629fcfd000, 134225920) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 134217728) = 134217728 mmap(NULL, 536879104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f626fcf9000 munmap(0x7f628fcfb000, 268443648) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 268435456) = 268435456 mmap(NULL, 1073750016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f622fcf7000 munmap(0x7f626fcf9000, 536879104) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 536870912) = 536870912 mmap(NULL, 2147491840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f61afcf5000 munmap(0x7f622fcf7000, 1073750016) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1073741824) = 1073741824 mmap(NULL, 4294975488, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f60afcf3000 munmap(0x7f61afcf5000, 2147491840) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 2147483648) = 2147479552 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 4096) = 4096 mmap(NULL, 8589942784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API
On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote: On 11/29/2012 01:28 AM, Dave Chinner wrote: On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote: On 11/28/2012 07:17 AM, Dave Chinner wrote: On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote: On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner da...@fromorbit.com wrote: The shrinker doesn't work on bytes - it works on + * *objects*. And I thought you were reviewing the shrinker API to be useful where a single object may range between 4K and 4G. Which requires rewriting all the algorithms to not be dependent on the subsystems using a fixed size object. The shrinker control function is called shrink_slab() for a reason - it was expected to be used to shrink caches of fixed sized objects allocated from slab memory. It has no concept of the amount of memory that each object consumes, just an idea of how much *IO* it takes to replace the object in memory once it's been reclaimed. The DEFAULT_SEEKS is design to encode the fact it generally takes 2 IOs to replace either a LRU page or a filesystem slab object, and so balances the scanning based on that value. i.e. the shrinker algorithms are solidly based around fixed sized objects that have some relationship to the cost of physical IO operations to replace them in the cache. One nit: It shouldn't take 2IOs to replace a slab object, right? objects. A random dentry in a small directory will take on IO to read the inode, then another to read the block the dirent sits in. TO read an inode froma cached dentry will generally take one IO to read the inode, and another to read related, out of inode information (e.g. attributes or extent/block maps). Sometimes it will only take on IO, sometimes it might take 3 or, in the case of dirents, coult take hundreds of IOs if the directory structure is large enough. So a default of 2 seeks to replace any single dentry/inode in the cache is a pretty good default to use. This should be the cost of allocating a new page, that can contain, multiple Once the page is in, a new object should be quite cheap to come up with. Indeed. More on this in the next paragraph... I'm not sure what you are trying to say here. Are you saying that you think that the IO cost for replacing a slab cache object doesn't matter? It's not the cost of allocating the page (a couple of microseconds) that is being considered - it the 3-4 orders of magnitude worse cost of reading the object from disk (could be 20ms). The slab/page allocation is lost in the noise compared to the time it takes to fill the page cache page with data or a single slab object. Essentially, slab pages with multiple objects in them are much more expensive to replace in the cache than a page cache page This is a very wild thought, but now that I am diving deep in the shrinker API, and seeing things like this: if (reclaim_state) { sc-nr_reclaimed += reclaim_state-reclaimed_slab; reclaim_state-reclaimed_slab = 0; } That's not part of the shrinker - that's part of the vmscan code, external to the shrinker infrastructure. It's getting information back from the slab caches behind the shrinkers, and it's not the full picture because many shrinkers are not backed by slab caches. It's a work around for not not having accurate feedback from the shrink_slab() code about how many pages were freed. I know it is not part of the shrinkers, and that is precisely my point. vmscan needs to go through this kinds of hacks because our API is not strong enough to just give it back the answer that matters to the caller. What matters is that the slab caches are shrunk in proportion to the page cache. i.e. balanced reclaim. For dentry and inode caches, what matters is the number of objects reclaimed because the shrinker algorithm balances based on the relative cost of object replacement in the cache. e.g. if you have 1000 pages in the page LRUs, and 1000 objects in the dentry cache, each takes 1 IO to replace, then if you reclaim 2 pages from the page LRUs and 2 pages from the dentry cache, it will take 2 IOs to replace the pages in the LRU, but 36 IOs to replace the objects in the dentry cache that were reclaimed. This is why the shrinker balances objects scanned vs LRU pages scanned - it treats each page as an object and the shrinker relates that to the relative cost of objects in the slab cache being reclaimed. i.e. the focus is on keeping a balance between caches, not reclaiming an absolute number of pages. Note: I'm not saying this is perfect, what I'm trying to do is let you know the why behind the current algorithm. i.e. why it mostly works and why ignoring the principles behind why it works is going fraught with danger... Essentially, the problem is an impedance mismatch between the way the LRUs are scanned/balanced (in pages) and slab caches are managed
Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
On Thu, Nov 29, 2012 at 11:02:24AM -0800, Andi Kleen wrote: Dave Chinner da...@fromorbit.com writes: Comments, thoughts and flames all welcome. Doing the reclaim per CPU sounds like a big change in the VM balance. It's per node, not per CPU. And AFAICT, it hasn't changed the balance of page cache vs inode/dentry caches under general, global workloads at all. Doesn't this invalidate some zone reclaim mode settings? No, because zone reclaim is per-node and the shrinkers now can reclaim just from a single node. i.e. the behaviour is now better suited to the aims of zone reclaim which is to free memory from a single, targetted node. Indeed, I removed a hack in the zone reclaim code that sprayed slab reclaim across the entire machine until sufficient objects had been freed from the target node How did you validate all this? fakenuma setups, various workloads that generate even dentry/slab cache loadings across all nodes, adding page cache pressure on a single node, watching slab reclaim from a single node. that sort of thing. I haven't really done any performance testing other than not obviously slower. There's no point optimising anything before there's any sort of agreement as to whether this is the right approach to take or not Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason chris.ma...@fusionio.com wrote: Just reading the new blkdev_get_blocks, it looks like we're mixing shifts. In direct-io.c map_bh-b_size is how much we'd like to map, and it has no relation at all to the actual block size of the device. The interface is abusing b_size to ask for as large a mapping as possible. Ugh. That's a big violation of how buffer-heads are supposed to work: the block number is very much defined to be in multiples of b_size (see for example submit_bh() that turns it into a sector number). But you're right. The direct-IO code really *is* violating that, and knows that get_block() ends up being defined in i_blkbits regardless of b_size. Same with mpage_readpages(), so it's not just direct IO that has this problem Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Fri, Nov 30, 2012 at 11:36:01AM -0500, Christoph Hellwig wrote: On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote: Ugh. That's a big violation of how buffer-heads are supposed to work: the block number is very much defined to be in multiples of b_size (see for example submit_bh() that turns it into a sector number). But you're right. The direct-IO code really *is* violating that, and knows that get_block() ends up being defined in i_blkbits regardless of b_size. Same with mpage_readpages(), so it's not just direct IO that has this problem The mpage code may actually fall back to BHs. I have a version of the direct I/O code that uses the iomap_ops from the multi-page write code that you originally started. It uses the new op as primary interface for direct I/O and provides a helper for filesystems that still use buffer heads internally. I'll try to dust it off and send out a version for the current kernel. So it was based on this interface? (I went looking for this code on google a couple of days ago so I could point at it and say we should be using an iomap structure, not buffer heads, but it looks like I never posted it to fsdevel or the xfs lists...) diff --git a/include/linux/fs.h b/include/linux/fs.h index 090f0ea..e247d62 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -522,6 +522,7 @@ enum positive_aop_returns { struct page; struct address_space; struct writeback_control; +struct iomap; struct iov_iter { const struct iovec *iov; @@ -614,6 +615,9 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + + int (*iomap)(struct address_space *mapping, loff_t pos, + ssize_t length, struct iomap *iomap, int cmd); }; /* diff --git a/include/linux/iomap.h b/include/linux/iomap.h new file mode 100644 index 000..7708614 --- /dev/null +++ b/include/linux/iomap.h @@ -0,0 +1,45 @@ +#ifndef _IOMAP_H +#define _IOMAP_H + +/* -iomap a_op command types */ +#define IOMAP_READ 0x01/* read the current mapping starting at the + given position, trimmed to a maximum length. + FS's should use this to obtain and lock + resources within this range */ +#define IOMAP_RESERVE 0x02/* reserve space for an allocation that spans + the given iomap */ +#define IOMAP_ALLOCATE 0x03/* allocate space in a given iomap - must have + first been reserved */ +#define IOMAP_UNRESERVE0x04/* return unused reserved space for the given + iomap and used space. This will always be + called after a IOMAP_READ so as to allow the + FS to release held resources. */ + +/* types of block ranges for multipage write mappings. */ +#define IOMAP_HOLE 0x01/* no blocks allocated, need allocation */ +#define IOMAP_DELALLOC 0x02/* delayed allocation blocks */ +#define IOMAP_MAPPED 0x03/* blocks allocated @blkno */ +#define IOMAP_UNWRITTEN0x04/* blocks allocated @blkno in unwritten state */ + +struct iomap { + sector_tblkno; /* first sector of mapping */ + loff_t offset; /* file offset of mapping, bytes */ + ssize_t length; /* length of mapping, bytes */ + int type; /* type of mapping */ + void*priv; /* fs private data associated with map */ +}; + +static inline bool +iomap_needs_allocation(struct iomap *iomap) +{ + return iomap-type == IOMAP_HOLE; +} + +/* multipage write interfaces use iomaps */ +typedef int (*mpw_actor_t)(struct address_space *mapping, void *src, + loff_t pos, ssize_t len, struct iomap *iomap); + +ssize_t multipage_write_segment(struct address_space *mapping, void *src, + loff_t pos, ssize_t length, mpw_actor_t actor); + +#endif /* _IOMAP_H */ Cheers, Dave. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2] bdi: add a user-tunable cpu_list for the bdi flusher threads
On Mon, Dec 03, 2012 at 01:53:39PM -0500, Jeff Moyer wrote: Hi, In realtime environments, it may be desirable to keep the per-bdi flusher threads from running on certain cpus. This patch adds a cpu_list file to /sys/class/bdi/* to enable this. The default is to tie the flusher threads to the same numa node as the backing device (though I could be convinced to make it a mask of all cpus to avoid a change in behaviour). The default seems reasonable to me. Comments, as always, are appreciated. . +static ssize_t cpu_list_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + struct bdi_writeback *wb = bdi-wb; + cpumask_var_t newmask; + ssize_t ret; + struct task_struct *task; + + if (!alloc_cpumask_var(newmask, GFP_KERNEL)) + return -ENOMEM; + + ret = cpulist_parse(buf, newmask); + if (!ret) { + spin_lock(bdi-wb_lock); + task = wb-task; + if (task) + get_task_struct(task); + spin_unlock(bdi-wb_lock); + if (task) { + ret = set_cpus_allowed_ptr(task, newmask); + put_task_struct(task); + } Why is this set here outside the bdi-flusher_cpumask_mutex? Also, I'd prefer it named ..._lock as that is the normal convention for such variables. You can tell the type of lock from the declaration or the use... @@ -437,6 +488,14 @@ static int bdi_forker_thread(void *ptr) spin_lock_bh(bdi-wb_lock); bdi-wb.task = task; spin_unlock_bh(bdi-wb_lock); + mutex_lock(bdi-flusher_cpumask_mutex); + ret = set_cpus_allowed_ptr(task, + bdi-flusher_cpumask); + mutex_unlock(bdi-flusher_cpumask_mutex); As it is set under the lock here Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 16/31] loop: use aio to perform io on the underlying file
On Mon, Dec 03, 2012 at 10:59:39AM -0600, Dave Kleikamp wrote: On 11/22/2012 05:06 PM, Dave Chinner wrote: On Wed, Nov 21, 2012 at 04:40:56PM -0600, Dave Kleikamp wrote: From: Zach Brown z...@zabbo.net This uses the new kernel aio interface to process loopback IO by submitting concurrent direct aio. Previously loop's IO was serialized by synchronous processing in a thread. The aio operations specify the memory for the IO with the bio_vec arrays directly instead of mappings of the pages. The use of aio operations is enabled when the backing file supports the read_iter and write_iter methods. These methods must only be added when O_DIRECT on bio_vecs is supported. Signed-off-by: Dave Kleikamp dave.kleik...@oracle.com Cc: Zach Brown z...@zabbo.net I suspect aio iocompetion here doesn't work for FUA write IO. Since the underlying file system is doing direct IO, we at least know that the IO has been performed to the underlying device. It could still be in the devices write cache, so maybe an fsync is still needed. It wouldn't be hard to fix if that's the right thing to do. FUA means the write is all the way to stable storage when completion is signalled by the underlying device. That means the loop device needs to (at least) fdatasync after the write completes... +static void lo_rw_aio_complete(u64 data, long res) +{ + struct bio *bio = (struct bio *)(uintptr_t)data; + + if (res 0) + res = 0; + else if (res 0) + res = -EIO; + + bio_endio(bio, res); +} This effectively does nothing... It passes the IO completion from the underlying device to the caller. Yes, I know. What I'm saying is that it does nothing on completion of a FUA write when it should be doing a fsync. - lo-lo_encrypt_key_size) { - ret = -EOPNOTSUPP; - goto out; - } - ret = file-f_op-fallocate(file, mode, pos, - bio-bi_size); - if (unlikely(ret ret != -EINVAL - ret != -EOPNOTSUPP)) - ret = -EIO; - goto out; - } - ret = lo_send(lo, bio, pos); if ((bio-bi_rw REQ_FUA) !ret) { And as you can see here that after writing the data in the filebacked path, there's special handling for REQ_FUA (i.e. another fsync). In this path, the data may still be in the underlying filesystem's page cache. Sure, but fsync means that it ends up on stable storage, as per the requirement of an FUA write. Direct IO does not mean the data is on stable storage, just that it bypasses the page cache. And this extra fsync is now not done in the aio path. I.e. the AIO completion path needs to issue the fsync to maintain correct REQ_FUA semantics... If this is really necessary, I can fix it. Absolutely. If we don't implement FUA properly, we'll end up with corrupted filesystems and/or data loss when kernel crashes or powerloss occurs. That's not an acceptable outcome, so we need FUA to be implemented properly. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update atime from future.
On Tue, Dec 04, 2012 at 01:56:39AM +0800, yangsheng wrote: Relatime should update the inode atime if it is more than a day in the future. The original problem seen was a tarball that had a bad atime, but could also happen if someone fat-fingers a touch. The future atime will never be fixed. Before the relatime patch, the future atime would be updated back to the current time on the next access. So if someone accidentally changes time back a few days, access times go backwards for everything? That doesn't sound right to me - it's going to seriously screw up backups and other scanners that use atime to determine newly accessed files IMO, if you fat-finger a manual atime update or use atimes direct from tarballs, then that's your problem as a user and not the responsibility of the kernel to magically fix for you Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2] bdi: add a user-tunable cpu_list for the bdi flusher threads
On Tue, Dec 04, 2012 at 09:42:55AM -0500, Jeff Moyer wrote: Dave Chinner da...@fromorbit.com writes: On Mon, Dec 03, 2012 at 01:53:39PM -0500, Jeff Moyer wrote: +static ssize_t cpu_list_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + struct bdi_writeback *wb = bdi-wb; + cpumask_var_t newmask; + ssize_t ret; + struct task_struct *task; + + if (!alloc_cpumask_var(newmask, GFP_KERNEL)) + return -ENOMEM; + + ret = cpulist_parse(buf, newmask); + if (!ret) { + spin_lock(bdi-wb_lock); + task = wb-task; + if (task) + get_task_struct(task); + spin_unlock(bdi-wb_lock); + if (task) { + ret = set_cpus_allowed_ptr(task, newmask); + put_task_struct(task); + } Why is this set here outside the bdi-flusher_cpumask_mutex? The cpumask mutex protects updates to bdi-flusher_cpumask, it has nothing to do with the call to set_cpus_allowed. We are protected from concurrent calls to cpu_list_store by the sysfs mutex that is taken on entry. I understand that this is non-obvious, and it wouldn't be wrong to hold the mutex here. If you'd like me to do that for clarity, that would be ok with me. At minimum it needs a comment like this otherwise someone is going to come along and ask why is that safe? like I just did. I'd prefer the code to be obviously consistent to avoid the need for commenting about the special case, especially when the obviously correct code is simpler ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c index a836118..3391800 100644 --- a/fs/xfs/xfs_export.c +++ b/fs/xfs/xfs_export.c @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type) case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG: return 6; } - return 255; /* invalid */ + return FILEID_INVALID; /* invalid */ } I think you can drop the /* invalid */ comment from there now as it is redundant with this change. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6 RFC] Mapping range lock
On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote: On Thu 31-01-13 16:07:57, Andrew Morton wrote: c) i_mutex doesn't allow any paralellism of operations using it and some filesystems workaround this for specific cases (e.g. DIO reads). Using range locking allows for concurrent operations (e.g. writes, DIO) on different parts of the file. Of course, range locking itself isn't enough to make the parallelism possible. Filesystems still have to somehow deal with the concurrency when manipulating inode allocation data. But the range locking at least provides a common VFS mechanism for serialization VFS itself needs and it's upto each filesystem to serialize more if it needs to. That would be useful to end-users, but I'm having trouble predicting *how* useful. As Zheng said, there are people interested in this for DIO. Currently filesystems each invent their own tweaks to avoid the serialization at least for the easiest cases. The thing is, this won't replace the locking those filesystems use to parallelise DIO - it just adds another layer of locking they'll need to use. The locks filesystems like XFS use to serialise IO against hole punch also serialise against many more internal functions and so if these range locks don't have the same capability we're going to have to retain those locks even after the range locks are introduced. It basically means we're going to have two layers of range locks - one for IO sanity and atomicity, and then this layer just for hole punch vs mmap. As i've said before, what we really need in XFS is IO range locks because we need to be able to serialise operations against IO in progress, not page cache operations in progress. IOWs, locking at the mapping tree level does not provide the right exclusion semantics we need to get rid of the existing filesystem locking that allows concurrent IO to be managed. Hence the XFS IO path locking suddenly because 4 locks deep: i_mutex XFS_IOLOCK_{SHARED,EXCL} mapping range lock XFS_ILOCK_{SHARED,EXCL} That's because the buffered IO path uses per-page lock ranges and to provide atomicity of read vs write, read vs truncate, etc we still need to use the XFS_IOLOCK_EXCL to provide this functionality. Hence I really think we need to be driving this lock outwards to where the i_mutex currently sits, turning it into an *IO range lock*, and not an inner-level mapping range lock. i.e flattening the locking to: io_range_lock(off, len) fs internal inode metadata modification lock Yes, I know this causes problems with mmap and locking orders, but perhaps we should be trying to get that fixed first because it simplifies the whole locking schema we need for filesystems to behave sanely. i.e. shouldn't we be aiming to simplify things as we rework locking rather than make the more complex? IOWs, I think the it's a mapping range lock approach is not the right level to be providing IO exclusion semantics. After all, it's entire IO ranges that we need to provide -atomic- exclusion for... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6 RFC] Mapping range lock
On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote: On Wed 06-02-13 10:25:12, Dave Chinner wrote: On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote: On Thu 31-01-13 16:07:57, Andrew Morton wrote: c) i_mutex doesn't allow any paralellism of operations using it and some filesystems workaround this for specific cases (e.g. DIO reads). Using range locking allows for concurrent operations (e.g. writes, DIO) on different parts of the file. Of course, range locking itself isn't enough to make the parallelism possible. Filesystems still have to somehow deal with the concurrency when manipulating inode allocation data. But the range locking at least provides a common VFS mechanism for serialization VFS itself needs and it's upto each filesystem to serialize more if it needs to. That would be useful to end-users, but I'm having trouble predicting *how* useful. As Zheng said, there are people interested in this for DIO. Currently filesystems each invent their own tweaks to avoid the serialization at least for the easiest cases. The thing is, this won't replace the locking those filesystems use to parallelise DIO - it just adds another layer of locking they'll need to use. The locks filesystems like XFS use to serialise IO against hole punch also serialise against many more internal functions and so if these range locks don't have the same capability we're going to have to retain those locks even after the range locks are introduced. It basically means we're going to have two layers of range locks - one for IO sanity and atomicity, and then this layer just for hole punch vs mmap. As i've said before, what we really need in XFS is IO range locks because we need to be able to serialise operations against IO in progress, not page cache operations in progress. Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked + PageWriteback serialize all IO for part of the file underlying the page. I.e. at most one of truncate (punch hole), DIO, writeback, buffered write, buffered read, page fault can run on that part of file. Right, it serialises page cache operations sufficient to avoid page cache coherence problems, but it does not serialise operations sufficiently to provide atomicity between operations that should be atomic w.r.t. each other. So how come it doesn't provide enough serialization for XFS? Ah, is it the problem that if two threads do overlapping buffered writes to a file then we can end up with data mixed from the two writes (if we didn't have something like i_mutex)? That's one case of specific concern - the POSIX write() atomicity guarantee - but it indicates the cause of many of my other concerns, too. e.g. write vs prealloc, write vs punch, read vs truncate, write vs truncate, buffered vs direct write, etc. Basically, page-cache granularity locking for buffered IO means that it cannot be wholly serialised against any other operation in progress. That means we can't use the range lock to provide a barrier to guarantee that no IO is currently in progress at all, and hence it doesn't provide the IO barrier semantics we need for various operations within XFS. An example of this is that the online defrag ioctl requires copyin + mtime updates in the write path are atomic w.r.t the swap extents ioctl so that it can detect concurrent modification of the file being defragged and abort. The page cache based range locks simply don't provide this coverage, and so we'd need to maintain the IO operation locking we currently have to provide this exclusion.. Truncate is something I also see as particularly troublesome, because the i_mutex current provides mutual exclusion against the operational range of a buffered write (i.e. at the .aio_write level) and not page granularity like this patch set would result in. Hence the behaviour of write vs truncate races could change quite significantly. e.g. instead of write completes then truncate or truncate completes then write, we could have partial write, truncate, write continues and completes resulting in a bunch of zeros inside the range the write call wrote to. The application won't even realise that the data it wrote was corrupted by the racing truncate. IOWs, I think that the fundamental unit of atomicity we need here is the operational range of the syscall i.e. that each of the protected operations needs to execute atomically as a whole with respect to each other, not in a piecemeal fashion where some use whole range locking and others use fine-grained page-range locking... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.8-rc5 xfs corruption
On Wed, Jan 30, 2013 at 10:16:47PM -0500, CAI Qian wrote: Hello, (Sorry to post to xfs mailing lists but unsure about which one is the best for this.) Trimmed to just x...@oss.sgi.com. I have seen something like this once during testing on a system with a EMC VNX FC/multipath back-end. This is a trace from the verifier code that was added in 3.8-rc1 so I doubt it has anything to do with any problem you've seen in the past Can you tell us what workload you were running and what hardware you are using as per: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F As it is, if you mounted the filesystem after this problem was detected, log recovery probably propagated it to disk. I'd suggest that you run xfs_repair -n on the device and post the output so we can see if any corruption has actaully made it to disk. If no corruption made it to disk, it's possible that we've got the incorrect verifier attached to the buffer. [ 3025.063024] 8801a0d5: 2e 2e 2f 2e 2e 2f 75 73 72 2f 6c 69 62 2f 6d 6f ../../usr/lib/mo The start of a block contains a path and the only type of block that can contain this format of metadata is remote symlink block. Remote symlink blocks don't have a verifier attached to them as there is nothing that can currently be used to verify them as correct. I can't see exactly how this can occur as stale buffers have the verifier ops cleared before being returned to the new user, and newly allocated xfs_bufs are zeroed before being initialised. I really need to know what you are doing to be able to get to the bottom of it Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()
On Tue, Feb 26, 2013 at 07:28:19AM +, Tony Lu wrote: I get a reliable way to reproduce this bug. The logprint and metadump are attached. Kernel version: 2.6.38.8 This is important because this: 4 umount /dev/sda1 /mnt 5 mount /dev/sda1 /mnt XFS mounting filesystem sda1 Starting XFS recovery on filesystem: sda1 (logdev: internal) Ending XFS recovery on filesystem: sda1 (logdev: internal) Indicates that the unmount record is either not being written, it is being written when there log has not been fully flushed or log recovery is not finding it. You need to copy out the log first to determine what the state of the log is before you mount the filesystem - that way if log recovery is run you can see whether it was supposed to run. (i.e. a clean log should never run recovery, and unmount should always leave a clean log). Either way, I'm more than 10,000 iterations into a run of 100k iterations of this script on 3.8.0, and I have not seen a single log recovery attempt occur. That implies you are seeing a bug in 2.6.38 that has since been fixed. It would be a good idea for you to upgrade the system to a 3.8 kernel and determine if you can still reproduce the problem on your system - that way we'll know if the bug really has been fixed or not Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ext4 updates for 3.9
On Wed, Feb 27, 2013 at 02:29:07PM -0500, Theodore Ts'o wrote: On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote: Looks like it's fixed here too. How did this make it through -next without anyone hitting it ? Is anyone running xfstests or similar on linux-next regularly ? I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip before I submitted a pull request. The problem is that XFSTESTS is S-L-O-W if you use large partitions, so typically I use a 5GB partition sizes for my test runs. This isn't the case for XFS. I typically see 1TB scratch devices only being ~10-20% slower than 10GB scratch devices, and 10TB only being a little slower than 1TB scratch devices. I have to use sparse devices and --large-fs for 100TB filesystem testing, so I can't directly compare the speeds to those that I run on physical devices. However I can say that it isn't significantly slower than using small scratch devices... So what we probably need to do is to have a separate set of tests using a loopback mount, and perhaps an artificially created file system which has a large percentage of the blocks in the middle of the file system busied out, to make efficient testing of these sorts of That's exactly what the --large-fs patch set I posted months ago does for ext4 - it uses fallocate() to fill all but 50GB of the large filesystem without actually writing any data and runs the standard tests in the remaining unused space. However, last time I tested ext4 with this patchset (when I posted the patches months ago), multi-TB preallocation on ext4 was still too slow to make it practical for testing on devices larger than 2-3TB. Perhaps it would make testing 1-2TB ext4 filesystems fast enough that you could do it regularly... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V5 00/30] loop: Issue O_DIRECT aio using bio_vec
On Fri, Jan 18, 2013 at 12:16:35PM -0500, Jeff Moyer wrote: Dave Kleikamp dave.kleik...@oracle.com writes: This patchset was begun by Zach Brown and was originally submitted for review in October, 2009. Feedback was positive, and I have picked up where he left off, porting his patches to the latest mainline kernel and adding support more file systems. [snip] My hopes are that this patchset is finally ready for linux-next. Hi, Shaggy, I'm finally getting around to testing this using xfstests. This is my setup: dd if=/dev/zero of=testdev.img bs=1M count=1024 dd if=/dev/zero of=scratchdev.img bs=1M count=1024 losetup -f ./testdev.img losetup -f ./scratchdev.img mkfs -t xfs /dev/loop0 mount /dev/loop0 /mnt/test export TEST_DIR=/mnt/test export TEST_DEV=/dev/loop0 export SCRATCH_MNT=/mnt/scratch export SCRATCH_DEV=/dev/loop1 I then ran: ./check -g aio and here is the summary: Ran: 112 113 198 207 208 209 210 211 212 239 240 Failures: 112 198 207 239 240 Failed 5 of 11 tests There's no need to limit xfstests to the aio group here. If the loop device is what you are actually testing, then you probably want to run the auto/quick groups as they will do a whole lot more than just data IO to the loop devices... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote: On 11/28/2012 03:14 AM, Dave Chinner wrote: [PATCH 09/19] list_lru: per-node list infrastructure This makes the generic LRU list much more scalable by changing it to a {list,lock,count} tuple per node. There are no external API changes to this changeover, so is transparent to current users. [PATCH 10/19] shrinker: add node awareness [PATCH 11/19] fs: convert inode and dentry shrinking to be node Adds a nodemask to the struct shrink_control for callers of shrink_slab to set appropriately for their reclaim context. This nodemask is then passed by the inode and dentry cache reclaim code to the generic LRU list code to implement node aware shrinking. I have a follow up question that popped up from a discussion between me and my very American friend Johnny Wheeler, also known as Johannes Weiner (CC'd). I actually remember we discussing this, but don't fully remember the outcome. And since I can't find it anywhere, it must have been in a media other than e-mail. So I thought it would do no harm in at least documenting it... Why are we doing this per-node, instead of per-zone? It seems to me that the goal is to collapse all zones of a node into a single list, but since the number of zones is not terribly larger than the number of nodes, and zones is where the pressure comes from, what do we really gain from this? The number is quite a bit higher - there are platforms with 5 zones to a node. The reality is, though, for most platforms slab allocations come from a single zone - they never come from ZONE_DMA, ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason for having cache LRUs for these zones. So, two zones at most. And then there's the complexity issue - it's simple/trivial to user per node lists, node masks, etc. It's an obvious abstraction that everyone understands, is simle to understand, acheives exactly the purpose that is needed and is not tied to the /current/ implementation of the current VM memory management code. I don't see any good reason for tying LRUs to MM zones. the original implementation of the per-node shrinkers by Nick Piggin did this: the LRUs for the dentry and inode caches were embedded in the struct zone, and it wasn't generically extensible because of that. i.e. node-aware shrinkers were directly influenced by the zone infrastructure and so the internal implementation of the mm subsystem started leaking out and determining how completely unrelated subsystems need to implement their own cache management. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers
On Wed, Jan 23, 2013 at 06:36:33PM +0400, Glauber Costa wrote: On 01/22/2013 03:21 AM, Dave Chinner wrote: On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote: On 11/28/2012 03:14 AM, Dave Chinner wrote: [PATCH 09/19] list_lru: per-node list infrastructure This makes the generic LRU list much more scalable by changing it to a {list,lock,count} tuple per node. There are no external API changes to this changeover, so is transparent to current users. [PATCH 10/19] shrinker: add node awareness [PATCH 11/19] fs: convert inode and dentry shrinking to be node Adds a nodemask to the struct shrink_control for callers of shrink_slab to set appropriately for their reclaim context. This nodemask is then passed by the inode and dentry cache reclaim code to the generic LRU list code to implement node aware shrinking. I have a follow up question that popped up from a discussion between me and my very American friend Johnny Wheeler, also known as Johannes Weiner (CC'd). I actually remember we discussing this, but don't fully remember the outcome. And since I can't find it anywhere, it must have been in a media other than e-mail. So I thought it would do no harm in at least documenting it... Why are we doing this per-node, instead of per-zone? It seems to me that the goal is to collapse all zones of a node into a single list, but since the number of zones is not terribly larger than the number of nodes, and zones is where the pressure comes from, what do we really gain from this? The number is quite a bit higher - there are platforms with 5 zones to a node. The reality is, though, for most platforms slab allocations come from a single zone - they never come from ZONE_DMA, ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason for having cache LRUs for these zones. So, two zones at most. Yes, but one would expect that most of those special zones would be present only in the first node, no? (correct me if I am wrong here). As I understand it, every node has an identical zone setup (i.e. a flat array of MAX_NR_ZONES zones in the struct pglist_data), and pages are simply places the in the appropriate zones on each node... Also, IIUC, the behaviour of the zones one each node is architecture dependent, we can't make assumptions that certain zones are only ever used on the first node... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/19] list_lru: per-node list infrastructure
On Wed, Jan 16, 2013 at 11:21:44AM -0800, Glauber Costa wrote: On 11/27/2012 03:14 PM, Dave Chinner wrote: From: Dave Chinner dchin...@redhat.com Now that we have an LRU list API, we can start to enhance the implementation. This splits the single LRU list into per-node lists and locks to enhance scalability. Items are placed on lists according to the node the memory belongs to. To make scanning the lists efficient, also track whether the per-node lists have entries in them in a active nodemask. Signed-off-by: Dave Chinner dchin...@redhat.com --- include/linux/list_lru.h | 14 ++-- lib/list_lru.c | 160 +++--- 2 files changed, 129 insertions(+), 45 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 3423949..b0e3ba2 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -8,21 +8,23 @@ #define _LRU_LIST_H 0 #include linux/list.h +#include linux/nodemask.h -struct list_lru { +struct list_lru_node { spinlock_t lock; struct list_headlist; longnr_items; +} cacheline_aligned_in_smp; + +struct list_lru { + struct list_lru_nodenode[MAX_NUMNODES]; + nodemask_t active_nodes; }; MAX_NUMNODES will default to 1 9, if I'm not mistaken. Your list_lru_node seems to be around 32 bytes on 64-bit systems (128 with debug). So we're talking about 16k per lru. *nod* It is relatively little compared to the number of inodes typically on a LRU. The superblocks only, are present by the dozens even in a small system, and I believe the whole goal of this API is to get more users to switch to it. This can easily use up a respectable bunch of megs. Isn't it a bit too much ? Maybe, but for active superblocks it only takes a handful of cached inodes to make this 16k look like noise, so I didn't care. Indeed, a typical active filesystem could be consuming gigabytes of memory in the slab, so 16k is a tiny amount of overhead to track this amount of memory more efficiently. Most other LRU/shrinkers are tracking large objects and only have a single LRU instance machine wide. Hence the numbers arguments don't play out well in favour of a more complex, dynamic solution for them, either. Sometimes dumb and simple is the best approach ;) I am wondering if we can't do better in here and at least allocate+grow according to the actual number of nodes. We could add hotplug notifiers and grow/shrink the node array as they get hot plugged, but that seems unnecessarily complex given how rare such operations are. If superblock proliferation is the main concern here, then doing somethign as simple as allowing filesystems to specify they want numa aware LRU lists via a mount_bdev() flag would solve this problem. If the flag is set, then full numa lists are created. Otherwise the LRU list simply has a single node and collapses all node IDs down to 0 and ignores all NUMA optimisations... That way the low item count virtual filesystems like proc, sys, hugetlbfs, etc won't use up memory, but filesytems that actually make use of NUMA awareness still get the more expensive, scalable implementation. Indeed, any subsystem that is not performance or location sensitive can use the simple single list version, so we can avoid overhead in that manner system wide... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/19] list_lru: per-node list infrastructure
into every memcg that contains that type of object for that cache context /me stops rambling Your current suggestion of going per-node only in the performance critical filesystems could also possibly work, provided this count is expected to be small. The problem is deciding on a per filesystem basis. I was thinking that all filesytsems of a specific type would use a particular type of structure, not that specific instances of a filesystem could use different types Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/