Re: Hang in XFS reclaim on 3.7.0-rc3

2012-10-29 Thread Dave Chinner
On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
 After experiencing a hang of all IO yesterday (
 http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on
 LOCKDEP after upgrading to -rc3.
 
 I then tried to replicate the load that hung yesterday and got the
 following lockdep report, implicating XFS instead of by stacking swap
 onto dm-crypt and md.
 
 [ 2844.971913]
 [ 2844.971920] =
 [ 2844.971921] [ INFO: inconsistent lock state ]
 [ 2844.971924] 3.7.0-rc3 #1 Not tainted
 [ 2844.971925] -
 [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage.
 [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
 [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: [811e7ef4] 
 xfs_ilock+0x84/0xb0
 [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
 [ 2844.971942]   [8108137e] mark_held_locks+0x7e/0x130
 [ 2844.971947]   [81081a63] lockdep_trace_alloc+0x63/0xc0
 [ 2844.971949]   [810e9dd5] kmem_cache_alloc+0x35/0xe0
 [ 2844.971952]   [810dba31] vm_map_ram+0x271/0x770
 [ 2844.971955]   [811e10a6] _xfs_buf_map_pages+0x46/0xe0
 [ 2844.971959]   [811e1fba] xfs_buf_get_map+0x8a/0x130
 [ 2844.971961]   [81233849] xfs_trans_get_buf_map+0xa9/0xd0
 [ 2844.971964]   [8121e339] xfs_ifree_cluster+0x129/0x670
 [ 2844.971967]   [8121f959] xfs_ifree+0xe9/0xf0
 [ 2844.971969]   [811f4abf] xfs_inactive+0x2af/0x480
 [ 2844.971972]   [811efb90] xfs_fs_evict_inode+0x70/0x80
 [ 2844.971974]   [8110cb8f] evict+0xaf/0x1b0
 [ 2844.971977]   [8110cd95] iput+0x105/0x210
 [ 2844.971979]   [811070d0] dentry_iput+0xa0/0xe0
 [ 2844.971981]   [81108310] dput+0x150/0x280
 [ 2844.971983]   [811020fb] sys_renameat+0x21b/0x290
 [ 2844.971986]   [81102186] sys_rename+0x16/0x20
 [ 2844.971988]   [816b2292] system_call_fastpath+0x16/0x1b

We shouldn't be mapping pages there. See if the patch below fixes
it.

Fundamentally, though, the lockdep warning has come about because
vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
doing GFP_NOFS - we are within a transaction here, so memory reclaim
is not allowed to recurse back into the filesystem.

mm-folk: can we please get this vmalloc/gfp_flags passing API
fixed once and for all? This is the fourth time in the last month or
so that I've seen XFS bug reports with silent hangs and associated
lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
in GFP_NOFS conditions as the potential cause

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: don't vmap inode cluster buffers during free

From: Dave Chinner dchin...@redhat.com

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/xfs/xfs_inode.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c4add46..82f6e5d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1781,7 +1781,8 @@ xfs_ifree_cluster(
 * to mark all the active inodes on the buffer stale.
 */
bp = xfs_trans_get_buf(tp, mp-m_ddev_targp, blkno,
-   mp-m_bsize * blks_per_cluster, 0);
+   mp-m_bsize * blks_per_cluster,
+   XBF_UNMAPPED);
 
if (!bp)
return ENOMEM;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in XFS reclaim on 3.7.0-rc3

2012-10-29 Thread Dave Chinner
[add the linux-mm cc I forgot to add before sending]

On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote:
 On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
  After experiencing a hang of all IO yesterday (
  http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on
  LOCKDEP after upgrading to -rc3.
  
  I then tried to replicate the load that hung yesterday and got the
  following lockdep report, implicating XFS instead of by stacking swap
  onto dm-crypt and md.
  
  [ 2844.971913]
  [ 2844.971920] =
  [ 2844.971921] [ INFO: inconsistent lock state ]
  [ 2844.971924] 3.7.0-rc3 #1 Not tainted
  [ 2844.971925] -
  [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage.
  [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
  [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: [811e7ef4] 
  xfs_ilock+0x84/0xb0
  [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
  [ 2844.971942]   [8108137e] mark_held_locks+0x7e/0x130
  [ 2844.971947]   [81081a63] lockdep_trace_alloc+0x63/0xc0
  [ 2844.971949]   [810e9dd5] kmem_cache_alloc+0x35/0xe0
  [ 2844.971952]   [810dba31] vm_map_ram+0x271/0x770
  [ 2844.971955]   [811e10a6] _xfs_buf_map_pages+0x46/0xe0
  [ 2844.971959]   [811e1fba] xfs_buf_get_map+0x8a/0x130
  [ 2844.971961]   [81233849] xfs_trans_get_buf_map+0xa9/0xd0
  [ 2844.971964]   [8121e339] xfs_ifree_cluster+0x129/0x670
  [ 2844.971967]   [8121f959] xfs_ifree+0xe9/0xf0
  [ 2844.971969]   [811f4abf] xfs_inactive+0x2af/0x480
  [ 2844.971972]   [811efb90] xfs_fs_evict_inode+0x70/0x80
  [ 2844.971974]   [8110cb8f] evict+0xaf/0x1b0
  [ 2844.971977]   [8110cd95] iput+0x105/0x210
  [ 2844.971979]   [811070d0] dentry_iput+0xa0/0xe0
  [ 2844.971981]   [81108310] dput+0x150/0x280
  [ 2844.971983]   [811020fb] sys_renameat+0x21b/0x290
  [ 2844.971986]   [81102186] sys_rename+0x16/0x20
  [ 2844.971988]   [816b2292] system_call_fastpath+0x16/0x1b
 
 We shouldn't be mapping pages there. See if the patch below fixes
 it.
 
 Fundamentally, though, the lockdep warning has come about because
 vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
 doing GFP_NOFS - we are within a transaction here, so memory reclaim
 is not allowed to recurse back into the filesystem.
 
 mm-folk: can we please get this vmalloc/gfp_flags passing API
 fixed once and for all? This is the fourth time in the last month or
 so that I've seen XFS bug reports with silent hangs and associated
 lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
 in GFP_NOFS conditions as the potential cause
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com
 
 xfs: don't vmap inode cluster buffers during free
 
 From: Dave Chinner dchin...@redhat.com
 
 Signed-off-by: Dave Chinner dchin...@redhat.com
 ---
  fs/xfs/xfs_inode.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
 index c4add46..82f6e5d 100644
 --- a/fs/xfs/xfs_inode.c
 +++ b/fs/xfs/xfs_inode.c
 @@ -1781,7 +1781,8 @@ xfs_ifree_cluster(
* to mark all the active inodes on the buffer stale.
*/
   bp = xfs_trans_get_buf(tp, mp-m_ddev_targp, blkno,
 - mp-m_bsize * blks_per_cluster, 0);
 + mp-m_bsize * blks_per_cluster,
 + XBF_UNMAPPED);
  
   if (!bp)
   return ENOMEM;
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
 -- 
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.
 
 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-30 Thread Dave Chinner
On Mon, Oct 29, 2012 at 06:11:58PM +, Luck, Tony wrote:
  What I would recommend is adding a 
 
  #define FS_CORRUPTED_FL 0x0100 /* File is corrupted */
 
  ... and which could be accessed and cleared via the lsattr and chattr
  programs.
 
 Good - but we need some space to save the corrupted range information
 too. These errors should be quite rare, so one range per file should be
 enough.
 
 New file systems should plan to add space in their on-disk format. The
 corruption isn't going to go away across a reboot.

No, not at all. if you want to store something in the filesystem
permanently, then use xattrs. You cannot rely on the filesystem
being able to store random application specific data in their
on-disk format. That's the *exact purpose* that xattrs were
invented for - they are an extensible, user-defined, per-file
metadata storage mechanism that is not tied to the filesystem
on-disk format.

The kernel already makes extensive use of xattrs for such metadata -
just look at all the security and integrity code that uses xattrs to
store their application-specific metadata.  Hence *anything* that
the kernel wants to store on permanent storage should be using
xattrs because then the application has complete control of what is
stored without caring about what filesystem it is storing it on.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in XFS reclaim on 3.7.0-rc3

2012-11-01 Thread Dave Chinner
On Thu, Nov 01, 2012 at 04:30:10PM -0500, Ben Myers wrote:
 Hi Dave,
 
 On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote:
  On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
   After experiencing a hang of all IO yesterday (
   http://marc.info/?l=linux-kernelm=135142236520624w=2 ), I turned on
   LOCKDEP after upgrading to -rc3.
   
   I then tried to replicate the load that hung yesterday and got the
   following lockdep report, implicating XFS instead of by stacking swap
   onto dm-crypt and md.
   
   [ 2844.971913]
   [ 2844.971920] =
   [ 2844.971921] [ INFO: inconsistent lock state ]
   [ 2844.971924] 3.7.0-rc3 #1 Not tainted
   [ 2844.971925] -
   [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} - {IN-RECLAIM_FS-W} usage.
   [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
   [ 2844.971931] ((ip-i_lock)-mr_lock){?.}, at: 
   [811e7ef4] xfs_ilock+0x84/0xb0
   [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
   [ 2844.971942]   [8108137e] mark_held_locks+0x7e/0x130
   [ 2844.971947]   [81081a63] lockdep_trace_alloc+0x63/0xc0
   [ 2844.971949]   [810e9dd5] kmem_cache_alloc+0x35/0xe0
   [ 2844.971952]   [810dba31] vm_map_ram+0x271/0x770
   [ 2844.971955]   [811e10a6] _xfs_buf_map_pages+0x46/0xe0
.
  We shouldn't be mapping pages there. See if the patch below fixes
  it.
  
  Fundamentally, though, the lockdep warning has come about because
  vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
  doing GFP_NOFS - we are within a transaction here, so memory reclaim
  is not allowed to recurse back into the filesystem.
  
  mm-folk: can we please get this vmalloc/gfp_flags passing API
  fixed once and for all? This is the fourth time in the last month or
  so that I've seen XFS bug reports with silent hangs and associated
  lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
  in GFP_NOFS conditions as the potential cause
  
  xfs: don't vmap inode cluster buffers during free
 
 Could you write up a little more background for the commit message?

Sure, that was just a test patch and often I don't bother putting a
detailed description in them until I know they fix the problem. My
current tree has:

xfs: don't vmap inode cluster buffers during free

Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 (xfs: make XBF_MAPPED the
default behaviour).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Dave Chinner
On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
 On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
 
  On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
   This patchset optimizes several places which take the per inode spin lock.
   They have not been fully tested yet, thus they are marked as RFC. 
  
  Inodes are RCU freed. The i_lock spinlock on the i_state field forms
  part of the memory barrier that allows the RCU read side to
  correctly detect a freed inode during a RCU protected cache lookup
  (hash list traversal for the VFS, or a radix tree traversal for XFS).
  The i_lock usage around the hahs list operations ensures the hash
  list operations are atomic with state changes so that such changes
  are correctly detected during RCU-protected traversals...
  
  IOWs, removing the i_lock from around the i_state transitions and
  inode hash insert/remove/traversal operations will cause races in
  the RCU lookups and result in incorrectly using freed inodes instead
  of failing the lookup and creating a new one.
  
  So I don't think this is a good idea at all...
 
 
 Hello, Dave:
 
   Thanks for your explanation.
  
   Though I can't fully understand it, your concern seems to be that
 RCU inode lookup will be bothered by this change. But we do not have 
 RCU inode lookup in VFS: inode lookup is done by rather a tranditional
 way. 

Ah, I'd forgotten that neither of these RCU-based lookups ever got
merged:

https://lkml.org/lkml/2010/6/23/397
http://thread.gmane.org/gmane.linux.kernel/1056494

That, however, is the approach that the inode caches shoul dbe
moving towards - RCU lookups to reduce locking, not changing
i_lock/i_state atomicity that has been designed to facilitate RCU
safe lookups...

   XFS gives me the impression that it implements its own inode cache.
 There may be such thing there. I have little knowledge on XFS, but I
 guess it's unlikely impacted by the change of code implementing VFS 
 inode cache.

Yeah, I dropped the generic inode hash RCU conversion - the
SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
have any motiviation to see it through because I'd already converted
XFs to avoid the global inode_hash_lock and use RCU lookups on it's
internal inode cache...

   As far as I can see, RCU inode free is for RCU dentry lookup, which
 seems have nothing to do with 'detect a freed inode'.

If you know nothing of the history of this, then it might seem that
way

 Taking i_lock in these
 places looks like to me a result of following old lock scheme blindly when 
 breaking the big global inode lock.

The i_state/i_hash_list/i_lock relationship was created specifically
during the inode_lock breakup to allow us to guarantee that certain
fields of the inode are unchanging without needing to take multiple
nested locks:

$ gl -n 1 250df6e
commit 250df6ed274d767da844a5d9f05720b804240197
Author: Dave Chinner dchin...@redhat.com
Date:   Tue Mar 22 22:23:36 2011 +1100

fs: protect inode-i_state with inode-i_lock

Protect inode state transitions and validity checks with the
inode-i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode-i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner dchin...@redhat.com
Signed-off-by: Al Viro v...@zeniv.linux.org.uk

The inode hash lookup needs to check i_state atomically during the
traversal so inodes being freed are skipped (e.g. I_FREEING,
I_WILL_FREE). those i_state flags are set only with the i_lock held,
and so inode hash lookups need to take the i_lock to guarantee the
i_state field is correct. This inode data field synchronisation is
separate to the cache hash list traversal protection.

The only way to do this is to have an inner lock (inode-i_lock)
that protects both the inode-i_hash_list and inode-i_state fields,
and a lock order that provides outer list traversal protections
(inode_hash_lock). Whether the outer lock is the inode_hash_lock or
rcu_read_lock(), the lock order and the data fields the locks are
protecting are the same

 Of course, maybe they are there for something. Could you speak
 more about the race this change (patch 1,2?) brings up? Thank you.

When you drop the lock from the i_state initialisation, you end up
dropping the implicit unlock-lock memory barrier that the
inode-i_lock provides

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
  On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
   On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
   
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
 This patchset optimizes several places which take the per inode spin 
 lock.
 They have not been fully tested yet, thus they are marked as RFC. 

Inodes are RCU freed. The i_lock spinlock on the i_state field forms
part of the memory barrier that allows the RCU read side to
correctly detect a freed inode during a RCU protected cache lookup
(hash list traversal for the VFS, or a radix tree traversal for XFS).
The i_lock usage around the hahs list operations ensures the hash
list operations are atomic with state changes so that such changes
are correctly detected during RCU-protected traversals...

IOWs, removing the i_lock from around the i_state transitions and
inode hash insert/remove/traversal operations will cause races in
the RCU lookups and result in incorrectly using freed inodes instead
of failing the lookup and creating a new one.

So I don't think this is a good idea at all...
.
  The inode hash lookup needs to check i_state atomically during the
  traversal so inodes being freed are skipped (e.g. I_FREEING,
  I_WILL_FREE). those i_state flags are set only with the i_lock held,
  and so inode hash lookups need to take the i_lock to guarantee the
  i_state field is correct. This inode data field synchronisation is
  separate to the cache hash list traversal protection.
  
  The only way to do this is to have an inner lock (inode-i_lock)
  that protects both the inode-i_hash_list and inode-i_state fields,
  and a lock order that provides outer list traversal protections
  (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
  rcu_read_lock(), the lock order and the data fields the locks are
  protecting are the same
  
   Of course, maybe they are there for something. Could you speak
   more about the race this change (patch 1,2?) brings up? Thank you.
  
  When you drop the lock from the i_state initialisation, you end up
  dropping the implicit unlock-lock memory barrier that the
  inode-i_lock provides. i.e. you get this in iget_locked():
  
  
  thread 1thread 2
  
  lock(inode_hash_lock)
  for_each_hash_item()
  
  inode-i_state = I_NEW
  hash_list_insert
  
  finds newly inserted inode
  lock(inode-i_lock)
  unlock(inode-i_lock)
  unlock(inode_hash_lock)
  
  wait_on_inode()
  see inode-i_state = 0 
  uses inode before initialisation
   is complete
  
  IOWs, there is no unlock-lock transition occurring on any lock, so
  there are no implicit memory barriers in this code, and so other
  CPUs are not guaranteed to see the inode-i_state = I_NEW write
  that thread 2 did. The lock/unlock pair around this I_NEW assignment
  guarantees that thread 1 will see the change to i_state correctly.
  
  So even without RCU, dropping the i_lock from these
  i_state/hash insert/remove operations will result in races
  occurring...

 This interleave can never happen because of inode_hash_lock.

Ah, sorry, I'm context switching too much right now.

s/lock(inode_hash_lock)/rcu_read_lock.

And that's the race condition the the locking order is *intended* to
avoid. It's just that we haven't done the last piece of the work,
which is replacing the read side inode_hash_lock usage with
rcu_read_lock.

  Seriously, if you want to improve the locking of this code, go back
  an resurrect the basic RCU hash traversal patches (i.e. Nick's
  original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
  has much more benefit to many more workloads than just removing a
  non-global, uncontended locks like this patch set does.
  
 
 Ah, this is intended to be a code clean patchset actually. I thought these
 locks are redundant in an obvious and trivial manner. If, on the contrary, 
 they are such tricky, then never mind :) Thanks for your patient.

The RCU conversion is actually trivial - everything is already set
up for it to be done, and is simpler than this patch set. It pretty
much is simply replacing all the read side inode_hash_lock pairs
with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
want to clean up this code, then RCU traversals are the conversion
to make.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 03:08:52PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
   Ah, this is intended to be a code clean patchset actually. I thought these
   locks are redundant in an obvious and trivial manner. If, on the 
   contrary, 
   they are such tricky, then never mind :) Thanks for your patient.
  
  The RCU conversion is actually trivial - everything is already set
  up for it to be done, and is simpler than this patch set. It pretty
  much is simply replacing all the read side inode_hash_lock pairs
  with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
  want to clean up this code, then RCU traversals are the conversion
  to make.
 
 
 Thanks for your suggestion. Though I doubt it's such trivial, I will try this
 after a little investigation.

Probably best to start with the patch below - it's run under heavy
concurrent load on ext4 with a working set of inodes about 20x
larger than can fit in memory for the past hour

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

fs: Use RCU lookups for inode cache

From: Dave Chinner dchin...@redhat.com

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Smoke tested w/ ext4 on concurrent fsmark/lookup/unlink workloads
over 50 million or so inodes...

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/inode.c |   74 ++--
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..2e92674 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -464,7 +464,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long 
hashval)
 
spin_lock(inode_hash_lock);
spin_lock(inode-i_lock);
-   hlist_add_head(inode-i_hash, b);
+   hlist_add_head_rcu(inode-i_hash, b);
spin_unlock(inode-i_lock);
spin_unlock(inode_hash_lock);
 }
@@ -480,7 +480,7 @@ void __remove_inode_hash(struct inode *inode)
 {
spin_lock(inode_hash_lock);
spin_lock(inode-i_lock);
-   hlist_del_init(inode-i_hash);
+   hlist_del_init_rcu(inode-i_hash);
spin_unlock(inode-i_lock);
spin_unlock(inode_hash_lock);
 }
@@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
-   void *data)
+   void *data, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(inode-i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
if (inode-i_sb != sb) {
spin_unlock(inode-i_lock);
continue;
@@ -800,13 +805,20 @@ repeat:
continue;
}
if (inode-i_state  (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(inode_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(inode_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(inode-i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -815,14 +827,20 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-   struct hlist_head *head, unsigned long ino)
+struct hlist_head *head,
+unsigned long ino, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(inode-i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
if (inode-i_ino != ino) {
spin_unlock(inode-i_lock);
continue;
@@ -832,13 +850,20 @@ repeat:
continue;
}
if (inode-i_state  (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Dave Chinner
, the questions I'd be seeking to
answer before proposing it for inclusion are as follows

1. what's the comparison in performance to typical NFS
server writeback parameter tuning? i.e. dirty_background_ratio=5,
dirty_ratio=10, dirty_expire_centiseconds=1000,
dirty_writeback_centisecs=1? i.e. does this give change give any
benefit over the current common practice for configuring NFS
servers?

2. what happens when you have 10 clients all writing to the server
at once? Or a 100? NFS servers rarely have a single writer to a
single file at a time, so what impact does this change have on
multiple concurrent file write performance from multiple clients?

3. Following on from the multiple client test, what difference does it
make to file fragmentation rates? Writing more frequently means
smaller allocations and writes, and that tends to lead to higher
fragmentation rates, especially when multiple files are being
written concurrently. Higher fragmentation also means lower
performance over time as fragmentation accelerates filesystem aging
effects on performance.  IOWs, it may be faster when new, but it
will be slower 3 months down the track and that's a bad tradeoff to
make.

4. What happens for higher bandwidth network links? e.g. gigE or
10gigE? Are the improvements still there? Or does it cause
regressions at higher speeds? I'm especially interested in what
happens to multiple writers at higher network speeds, because that's
a key performance metric used to measure enterprise level NFS
servers.

5. Are the improvements consistent across different filesystem
types?  We've had writeback changes in the past cause improvements
on one filesystem but significant regressions on others.  I'd
suggest that you need to present results for ext4, XFS and btrfs so
that we have a decent idea of what we can expect from the change to
the generic code.

Yeah, I'm asking a lot of questions. That's because the generic
writeback code is extremely important to performance and the impact
of a change cannot be evaluated from a single test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Dave Chinner
 last_write_time;
 + u32 nr_reads;
 + u32 nr_writes;
 + u64 avg_delta_reads;
 + u64 avg_delta_writes;
 + u8 flags;
 + u32 last_temperature;

may as well make the flags a u32 - the compiler will ues that much
space anyway as it aligned the u32 last_temperature variable after
it.

 +};
 +
 +/* An item representing an inode and its access frequency */
 +struct hot_inode_item {
 + /* node for hot_inode_tree rb_tree */
 + struct rb_node rb_node;
 + /* tree of ranges in this inode */
 + struct hot_range_tree hot_range_tree;
 + /* frequency data for this inode */
 + struct hot_freq_data hot_freq_data;
 + /* inode number, copied from inode */
 + unsigned long i_ino;
 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, i_no, in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;

It's hard to see the code in the commentsi, and some of comments are
redundant.. It's easier to read if you do this:

struct hot_inode_item {
struct rb_node node;/* hot_inode_tree index */
struct hot_range_tree hot_range_tree;   /* tree of ranges */
struct hot_freq_data hot_freq_data; /* frequency data */
unsigned long i_ino;/* inode number from inode */
u8 in_tree; /* ref counting check */
spinlock_t lock;/* protects object data */
struct kref refs;   /* prevents kfree */
}

Also: 
- i_ino really needs to be a 64 bit quantity as some
  filesystems can use 64 bit inode numbers even on 32
  bit systems (e.g. XFS).
- in_tree can be u32 or a flags field if it is boolean. if
  it is just debug, then maybe it can be removed whenteh
  code is ready for commit.

 +};
 +
 +/*
 + * An item representing a range inside of an inode whose frequency
 + * is being tracked
 + */
 +struct hot_range_item {
 + /* node for hot_range_tree rb_tree */
 + struct rb_node rb_node;
 + /* frequency data for this range */
 + struct hot_freq_data hot_freq_data;
 + /* the hot_inode_item associated with this hot_range_item */
 + struct hot_inode_item *hot_inode;
 + /* starting offset of this range */
 + u64 start;
 + /* length of this range */
 + u64 len;

What units?
u64 start;  /* start offset in bytes */
u64 len /* length in bytes */

 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, start, len, and in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;
 +};
 +
 +struct hot_info {
 + /* red-black tree that keeps track of fs-wide hot data */
 + struct hot_inode_tree hot_inode_tree;
 +};

The comment is redundant...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 02/10] vfs: add support for updating access frequency

2012-09-25 Thread Dave Chinner
  RANGE_SIZE_MASK;
 + hr-len = RANGE_SIZE;
 + hr-hot_inode = he;
 + hot_rb_add_hot_range_item(hrtree, hr);
 + write_unlock(hrtree-lock);
 + }
 +
 + spin_lock(hr-lock);
 + hot_rb_update_freq(hr-hot_freq_data, rw);
 + spin_unlock(hr-lock);
 + hot_rb_free_hot_range_item(hr);
 + }

Same comments again about locking.

I note that the code will always insert range items of a length
RANGE_SIZE. This means you have a fixed object granularity and hence
you have no need for a range based search. That is, you could use a
radix tree where each entry in the radix tree points directly to the
range object similar to how the page cache uses a radix tree for
indexing pages. That brings the possibility of lockless range item
lookups


 +
 +out:
 + return ret;
 +}
 +
 +/* main function to update access frequency from read/writepage(s) hooks */
 +void hot_rb_update_freqs(struct inode *inode, u64 start,
 + u64 len, int rw)
 +{
 + struct hot_inode_item *he;
 +
 + he = hot_rb_update_inode_freq(inode, rw);
 +
 + WARN_ON(!he);
 +
 + if (he) {
 + hot_rb_update_range_freq(he, start, len,
 + rw, (inode-i_sb-s_hotinfo));
 +
 + hot_rb_free_hot_inode_item(he);
 + }

This is very assymetric. it would be much better to collapse all
the abstraction down to something much simpler, say:


int hot_inode_update_freqs()
{
he = hot_inode_item_find(tree, inum, null)
if (!he) {
new_he = allocate()
if (!new_he)
return -ENOMEM;

init new_he
he = hot_inode_item_find(tree, inum, new_he)
if (he != new_he)
free new_he
}
hot_inode_freq_update(he-hot_freq_data, write)

for (cur = start_off; cur = end_off; cur += RANGE_SIZE) {
hr = hot_range_item_find(tree, cur, NULL)
if (!hr) {
new_hr = allocate()
if (!new_hr)
return -ENOMEM;

init new_hr
hr = hot_inode_item_find(tree, cur, new_hr)
if (hr != new_hr)
free new_hr
}
hot_inode_freq_update(hr-hot_freq_data, write)
hot_range_item_put(hr);
{
hot_inode_item_put(he);
}






 +}
 +
 +/*
   * Initialize kmem cache for hot_inode_item
   * and hot_range_item
   */
 diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
 index 269b67a..2ba29e4 100644
 --- a/fs/hot_tracking.h
 +++ b/fs/hot_tracking.h
 @@ -21,6 +21,21 @@
  #define FREQ_DATA_TYPE_INODE (1  0)
  /* freq data struct is for a range */
  #define FREQ_DATA_TYPE_RANGE (1  1)
 +/* size of sub-file ranges */
 +#define RANGE_SIZE (1  20)
 +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1)))

You might want to state what units these ranges are in, and that
there is one tracking object per range per inode

 +#define FREQ_POWER 4
 +
 +struct hot_info;
 +struct inode;
 +
 +struct hot_inode_item

You've already included include/linux/hot_tracking.h, so you
shouldn't need these forward declarations...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-25 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Introduce one new mount option '-o hottrack',
 and add its parsing support.
   Its usage looks like:
mount -o hottrack
mount -o nouser,hottrack
mount -o nouser,hottrack,loop
mount -o hottrack,nouser

I think that this option parsing should be done by the filesystem,
even though the tracking functionality is in the VFS. That way ony
the filesystems that can use the tracking information will turn it
on, rather than being able to turn it on for everything regardless
of whether it is useful or not.

Along those lines, just using a normal superblock flag to indicate
it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
don't need to allocate the sb-s_hot_info structure just to be able
to check whether we are tracking hot inodes or not.

This then means the hot inode tracking for the superblock can be
initialised by the filesystem as part of it's fill_super method,
along with the filesystem specific code that will use the hot
tracking information the VFS gathers

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Dave Chinner
On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
  @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
  *inode);
   static struct inode *find_inode(struct super_block *sb,
  struct hlist_head *head,
  int (*test)(struct inode *, void *),
  -   void *data)
  +   void *data, bool locked)
   {
  struct hlist_node *node;
  struct inode *inode = NULL;
  
   repeat:
  -   hlist_for_each_entry(inode, node, head, i_hash) {
  +   rcu_read_lock();
  +   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
  spin_lock(inode-i_lock);
  +   if (inode_unhashed(inode)) {
  +   spin_unlock(inode-i_lock);
  +   continue;
  +   }
 
 Is this check too early? If the unhashed inode happened to be the target
 inode, we are wasting our time to continue the traversal and we do not wait 
 on it.

If the inode is unhashed, then it is already passing through evict()
or has already passed through. If it has already passed through
evict() then it is too late to call __wait_on_freeing_inode() as the
wakeup occurs in evict() immediately after the inode is removed
from the hash. i.e:

remove_inode_hash(inode);

spin_lock(inode-i_lock);
wake_up_bit(inode-i_state, __I_NEW);
BUG_ON(inode-i_state != (I_FREEING | I_CLEAR));
spin_unlock(inode-i_lock);

i.e. if we get the case:

Thread 1, RCU hash traversalThread 2, evicting foo

rcu_read_lock()
found inode foo
remove_inode_hash(inode);
spin_lock(foo-i_lock);
wake_up(I_NEW)
spin_unlock(foo-i_lock);
destroy_inode()
..
spin_lock(foo-i_lock)
match sb, ino
I_FREEING
  rcu_read_unlock()

rcu grace period can expire at any time now,
 so use after free is guaranteed at some point

  wait_on_freeing_inode
wait_on_bit(I_NEW)

will never get woken

Hence if the inode is unhashed, it doesn't matter what inode it is,
it is never valid to use it any further because it may have already
been freed and the only reason we can safely access here it is that
the RCU grace period will not expire until we call
rcu_read_unlock().

  @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
  unsigned long ino)
  struct inode *old;
  
  spin_lock(inode_hash_lock);
  -   /* We released the lock, so.. */
  -   old = find_inode_fast(sb, head, ino);
  +   old = find_inode_fast(sb, head, ino, true);
  if (!old) {
  inode-i_ino = ino;
  spin_lock(inode-i_lock);
 
 E ... couldn't we use memory barrier API instead of irrelevant spin
 lock on newly allocated inode to publish I_NEW?

Yes, we could.

However, having multiple synchronisation methods for a single
variable that should only be used in certain circumstances is
something that is easy to misunderstand and get wrong. Memory
barriers are much more subtle and harder to understand than spin
locks, and every memory barrier needs to be commented to explain
what the barrier is actually protecting against.

In the case where a spin lock is guaranteed to be uncontended and
the cache line hot in the CPU cache, it makes no sense to replace
the spin lock with a memory barrier, especially when every other
place we modify the i_state/i_hash fields we have to wrap them
with i_lock

Simple code is good code - save the complexity for something that
needs it.

 I go through many mails of the last trend of scaling VFS. Many patches
 seem quite natural, say RCU inode lookup

Sure, but the implementation in those RCU lookup patches sucked.

 or per-bucket inode hash lock or 

It was a bad idea. At minimum, you can't use lockdep on it. Worse
for the realtime guys is the fact it can't be converted to a
sleeping lock. Worst was the refusal to change it in any way to
address concerns.

And realistically, the fundamental problem is not with the
inode_hash_lock, it's with the fact that the cache is based on a
hash table rather than a more scalable structure like a radix tree
or btree. This is a primary reason for XFS having it's own inode
cache - hashes can only hold a certain number of entries before
performance collapses catastrophically and so don't scale well to
tens or hundreds of millions of entries.

 per-superblock inode list lock,

Because it isn't a particularly hot lock, and given that
most workloads hit on a single filesystem, scalability is not
improved by making this change. As such, as long as there is a
single linked list used to iterate all inodes

Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-26 Thread Dave Chinner
On Wed, Sep 26, 2012 at 10:56:08AM +0800, Zhi Yong Wu wrote:
 On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Introduce one new mount option '-o hottrack',
  and add its parsing support.
Its usage looks like:
 mount -o hottrack
 mount -o nouser,hottrack
 mount -o nouser,hottrack,loop
 mount -o hottrack,nouser
 
  I think that this option parsing should be done by the filesystem,
  even though the tracking functionality is in the VFS. That way ony
  the filesystems that can use the tracking information will turn it
  on, rather than being able to turn it on for everything regardless
  of whether it is useful or not.
 
  Along those lines, just using a normal superblock flag to indicate
  it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
  don't need to allocate the sb-s_hot_info structure just to be able
  to check whether we are tracking hot inodes or not.
 
  This then means the hot inode tracking for the superblock can be
  initialised by the filesystem as part of it's fill_super method,
  along with the filesystem specific code that will use the hot
  tracking information the VFS gathers
 I can see what you mean, but don't know if other guys also agree with this.
 If so, all FS specific code which use hot tracking feature wll have to add
 the same chunk of code in it fill_super method. Is it good?

Most filesystems will only need to add 3-4 lines of code to their
existing parser, so it's not a big deal I think...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 02/10] vfs: add support for updating access frequency

2012-09-26 Thread Dave Chinner
On Wed, Sep 26, 2012 at 10:53:07AM +0800, Zhi Yong Wu wrote:
 On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote:
  I note that the code will always insert range items of a length
  RANGE_SIZE. This means you have a fixed object granularity and hence
  you have no need for a range based search. That is, you could use a
  radix tree where each entry in the radix tree points directly to the
  range object similar to how the page cache uses a radix tree for
  indexing pages. That brings the possibility of lockless range item
  lookups
 Great suggestion, but can we temporarily put it in TODO list? because
 it will bring one big code change.

Sure. I just wanted to point out that there are better choices for
indexing fixed size elements than rb-trees and why it might make
sense to use a different type of tree.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 04/10] vfs: add init and exit support

2012-09-26 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:29PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Add initialization function to create some
 key data structures when hot tracking is enabled;
 Clean up them when hot tracking is disabled
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c |   60 
 +
  fs/hot_tracking.h |2 +
  fs/namespace.c|4 +++
  fs/super.c|6 +
  4 files changed, 72 insertions(+), 0 deletions(-)
 
 diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
 index f97e8a6..fa89f70 100644
 --- a/fs/hot_tracking.c
 +++ b/fs/hot_tracking.c
 @@ -135,6 +135,51 @@ static void hot_rb_free_hot_range_item(struct 
 hot_range_item *hr)
   }
  }
  
 +static int hot_rb_remove_hot_inode_item(struct hot_inode_tree *tree,
 +struct hot_inode_item *he)

hot_inode_item_remove()

 +{
 +int ret = 0;
 +rb_erase(he-rb_node, tree-map);
 +he-in_tree = 0;
 +return ret;
 +}
 +
 +static int hot_rb_remove_hot_range_item(struct hot_range_tree *tree,
 +struct hot_range_item *hr)

hot_range_item_remove()

(repeat for other function names ;)

 +{
 +int ret = 0;
 +rb_erase(hr-rb_node, tree-map);
 +hr-in_tree = 0;
 +return ret;
 +}

these can probably be void functions are they don't use the return
value at all.

 +
 +/* Frees the entire hot_inode_tree. */
 +static void hot_rb_inode_tree_free(struct hot_info *root)
 +{
 + struct rb_node *node, *node2;
 + struct hot_inode_item *he;
 + struct hot_range_item *hr;
 +
 + /* Free hot inode and range trees on fs root */
 + node = rb_first(root-hot_inode_tree.map);
 +
 + while (node) {

while ((node = rb_first(root-hot_inode_tree.map))) {

 + he = rb_entry(node, struct hot_inode_item, rb_node);
 +
 + node2 = rb_first(he-hot_range_tree.map);
 + while (node2) {

while ((node2 = rb_first(he-hot_range_tree.map))) {

.
  
 + if (sb-s_hotinfo.mount_opt  HOT_MOUNT_HOT_TRACK)
 + hot_track_exit(sb);
 +

Let the filesystems call in .put_super()

   down_write(namespace_sem);
   br_write_lock(vfsmount_lock);
   event++;
 diff --git a/fs/super.c b/fs/super.c
 index 7eb3b0c..0999d5c 100644
 --- a/fs/super.c
 +++ b/fs/super.c
 @@ -1153,6 +1153,9 @@ mount_fs(struct file_system_type *type, int flags, 
 const char *name, void *data)
   WARN_ON(sb-s_bdi == default_backing_dev_info);
   sb-s_flags |= MS_BORN;
  
 + if (hottrack)
 + hot_track_init(sb, name);

And call this in .fill_super() after parsing the hottrack argument.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-26 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Adds a hash table structure which contains
 a lot of hash list and is used to efficiently
 look up the data temperature of a file or its
 ranges.
   In each hash list of hash table, the hash node
 will keep track of temperature info.

So, let me see if I've got the relationship straight:

- sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode

- hot_inode_item contains access frequency data for that inode

- hot_inode_item holds a heat hash node to index the access
  frequency data for that inode

- hot_inode_item.hot_range_tree indexes hot_range_items for that inode

- hot_range_item contains access frequency data for that range

- hot_range_item holds a heat hash node to index the access
  frequency data for that range

- sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes

- sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes

How about some ascii art? :) Just looking at the hot inode item case
(the range item case is the same pattern, though), we have:


heat_inode_hl   hot_inode_tree
| |
| V
|   +---hot_inode_item---+
+---+   |   frequency data   |
|   V^   V
| ...--hot_inode_item--... |...--hot_inode_item--
|   frequency data   |  frequency data
|   ^|   ^
|   ||   |
|   ||   |
+--hot_hash_node--hot_hash_node--hot_hash_node--


There's no actual data stored in the hot_hash_node, just pointer
back to the frequency data, a hlist_node and a pointer to the
hashlist head. IOWs, I agree with Ram that this does not need to
exist and just embedding a hlist_node inside the hot_inode_item is
all that is needed. i.e:

heat_inode_hl   hot_inode_tree
| |
| V
|   +---hot_inode_item---+
|   |   frequency data   |
+---+   |   hlist_node   |
|   V^ | V
| ...--hot_inode_item--... | |  ...--hot_inode_item--
|   frequency data   | |frequency data
+--hlist_node---+ +---hlist_node---.

There's no need for separate allocations, initialisations, locks and
reference counting - all that is already in the hot_inode_item. The
items have the same lifecycle limitations - a hot_hash_node must be
torn down before the frequency data it points to is freed. Finally,
there's no difference in how you move it between lists.

Indeed, calling it a hash is wrong - there's not hashing at all
- it keeping an array of list where each entry corresponds to a
specific temperature. It is a *heat map*, not a hash list. i.e.
inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.

As it is, there aren't any users of the heat maps that are generated
in this patch set - it's not even exported to userspace or to
debugfs, so I'm not sure how it will be used yet. How are these heat
maps going to be used by filesystems, Zhi?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 06/10] vfs: enable hot data tracking

2012-09-26 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Miscellaneous features that implement hot data tracking
 and generally make the hot data functions a bit more friendly.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/direct-io.c   |   10 ++
  include/linux/hot_tracking.h |   11 +++
  mm/filemap.c |8 
  mm/page-writeback.c  |   21 +
  mm/readahead.c   |9 +
  5 files changed, 59 insertions(+), 0 deletions(-)
 
 diff --git a/fs/direct-io.c b/fs/direct-io.c
 index f86c720..3773f44 100644
 --- a/fs/direct-io.c
 +++ b/fs/direct-io.c
 @@ -37,6 +37,7 @@
  #include linux/uio.h
  #include linux/atomic.h
  #include linux/prefetch.h
 +#include hot_tracking.h
  
  /*
   * How many user pages to map in one call to get_user_pages().  This 
 determines
 @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, 
 struct inode *inode,
   prefetch(bdev-bd_queue);
   prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
  
 + /* Hot data tracking */
 + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
 +  iov_length(iov, nr_segs)  0) {
 + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
 + (u64)offset,
 + (u64)iov_length(iov, nr_segs),
 + rw  WRITE);
 + }

That's a bit messy. I'd prefer a static inline function that hides
all this. e.g.

track_hot_inode_ranges(inode, offset, length, rw)
{
if (inode-i_sb-s_flags  MS_HOT_TRACKING)
hot_inode_freq_update(inode, offset, length, rw);
}

 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index 5ad5ce2..552c861 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -35,6 +35,7 @@
  #include linux/buffer_head.h /* __set_page_dirty_buffers */
  #include linux/pagevec.h
  #include linux/timer.h
 +#include linux/hot_tracking.h
  #include trace/events/writeback.h
  
  /*
 @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages);
  int do_writepages(struct address_space *mapping, struct writeback_control 
 *wbc)
  {
   int ret;
 + pgoff_t start = 0;
 + u64 prev_count = 0, count = 0;
  
   if (wbc-nr_to_write = 0)
   return 0;
 +
 + /* Hot data tracking */
 + if (TRACK_THIS_INODE(mapping-host)
 +  wbc-range_cyclic) {
 + start = mapping-writeback_index  PAGE_CACHE_SHIFT;
 + prev_count = (u64)wbc-nr_to_write;
 + }

Why only wbc-range_cyclic? This won't record things like
synchronous writes or fsync-triggered writes, are are far more
likely to be to hot ranges in a file...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-26 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Fork and run one kernel kthread to calculate
 that temperature based on some metrics kept
 in custom frequency data structs, and store
 the info in the hash table.

No new kthreads, please. Use a per-superblock workqueue and a struct
delayed_work to run periodic work on each superblock.

That will also remove all the nasty, nasty
!hot_track_temperature_update_kthread checks from the code, too.

Also, I'd separate the work that the workqueue does from the patch
that introduces the work queue. That way there is only one new thing
to comment on in the patch. Further, I'd separate the aging code
from the code that updates the temperature map into it's own patch
as well..

Finally, you're going to need a shrinker to control the amount of
memory that is used in tracking hot regions - if we are throwing
inodes out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking code is
using, even if it means losing useful information (i.e. the shrinker
accelerates the aging process).

Given the above, and the other comments earlier in the series,
there's not a lot of point in me spending time commenting on ethe
code in detail here as it will change significantly as a result of
all the earlier comments

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Adds a hash table structure which contains
  a lot of hash list and is used to efficiently
  look up the data temperature of a file or its
  ranges.
In each hash list of hash table, the hash node
  will keep track of temperature info.
 
  So, let me see if I've got the relationship straight:
 
  - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode
 
  - hot_inode_item contains access frequency data for that inode
 
  - hot_inode_item holds a heat hash node to index the access
frequency data for that inode
 
  - hot_inode_item.hot_range_tree indexes hot_range_items for that inode
 
  - hot_range_item contains access frequency data for that range
 
  - hot_range_item holds a heat hash node to index the access
frequency data for that range
 
  - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes
 
  - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes
 Correct.
 
  How about some ascii art? :) Just looking at the hot inode item case
  (the range item case is the same pattern, though), we have:
 
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  +---+   |   frequency data   |
  |   V^   V
  | ...--hot_inode_item--... |...--hot_inode_item--
  |   frequency data   |  frequency data
  |   ^|   ^
  |   ||   |
  |   ||   |
  +--hot_hash_node--hot_hash_node--hot_hash_node--
 Great, can we put them in hot_tracking.txt in Documentation?
 
 
  There's no actual data stored in the hot_hash_node, just pointer
  back to the frequency data, a hlist_node and a pointer to the
  hashlist head. IOWs, I agree with Ram that this does not need to
  exist and just embedding a hlist_node inside the hot_inode_item is
  all that is needed. i.e:
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  |   |   frequency data   |
  +---+   |   hlist_node   |
  |   V^ | V
  | ...--hot_inode_item--... | |  ...--hot_inode_item--
  |   frequency data   | |frequency data
  +--hlist_node---+ +---hlist_node---.
 
  There's no need for separate allocations, initialisations, locks and
  reference counting - all that is already in the hot_inode_item. The
  items have the same lifecycle limitations - a hot_hash_node must be
  torn down before the frequency data it points to is freed. Finally,
  there's no difference in how you move it between lists.
 How will you know if one hot_inode_item should be moved between lists
 when its freq data is changed?

Record the current temperature in the frequency data, and if it
changes, change the list it is on.

  Indeed, calling it a hash is wrong - there's not hashing at all
  - it keeping an array of list where each entry corresponds to a
  specific temperature. It is a *heat map*, not a hash list. i.e.
  inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.
 OK.
 
  As it is, there aren't any users of the heat maps that are generated
  in this patch set - it's not even exported to userspace or to
  debugfs, so I'm not sure how it will be used yet. How are these heat
  maps going to be used by filesystems, Zhi?
 In hot_hash_calc_temperature(), you can see that one hot_inode or
 hot_range's freq data will be distilled into one temperature value,
 then it will be inserted to the heat map based on its temperature.
 When the file corresponding to the inode or range got hotter or cold,
 its location will be changed in the heat map based on its new
 temperature in hot_hash_update_hash_table().

Yes, but a hot_inode_item or hot_range_item can only have one
location in the heat map, right? So it doesn't need external
structure to point to the frequency data to track this

 And the user will retrieve those freq data and temperature info via
 debugfs or ioctl interfaces.

Right - but that data is only extracted after an initial
hot_inode_tree lookup - The heat map itself is never directly used
for lookups. If it's not used for lookups based on temperature, why
is it needed?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 06/10] vfs: enable hot data tracking

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Miscellaneous features that implement hot data tracking
  and generally make the hot data functions a bit more friendly.
 
  Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
  ---
   fs/direct-io.c   |   10 ++
   include/linux/hot_tracking.h |   11 +++
   mm/filemap.c |8 
   mm/page-writeback.c  |   21 +
   mm/readahead.c   |9 +
   5 files changed, 59 insertions(+), 0 deletions(-)
 
  diff --git a/fs/direct-io.c b/fs/direct-io.c
  index f86c720..3773f44 100644
  --- a/fs/direct-io.c
  +++ b/fs/direct-io.c
  @@ -37,6 +37,7 @@
   #include linux/uio.h
   #include linux/atomic.h
   #include linux/prefetch.h
  +#include hot_tracking.h
 
   /*
* How many user pages to map in one call to get_user_pages().  This 
  determines
  @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, 
  struct inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
  + /* Hot data tracking */
  + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
  +  iov_length(iov, nr_segs)  0) {
  + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
  + (u64)offset,
  + (u64)iov_length(iov, nr_segs),
  + rw  WRITE);
  + }
 
  That's a bit messy. I'd prefer a static inline function that hides
  all this. e.g.
 Do you think of moving the condition into hot_inode_udate_freqs(), not
 adding another new function?

Moving it into hot_inode_udate_freqs() will add a function call
overhead even when tracking is not enabled. a static inline function
will just result in no extra overhead other than the if
statement

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Fork and run one kernel kthread to calculate
  that temperature based on some metrics kept
  in custom frequency data structs, and store
  the info in the hash table.
 
  No new kthreads, please. Use a per-superblock workqueue and a struct
  delayed_work to run periodic work on each superblock.
 If no new kthread is created, which kthread will work on these
 delayed_work tasks?

One of the kworker threads that service the workqueue
infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote:
 On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Introduce one new mount option '-o hottrack',
  and add its parsing support.
Its usage looks like:
 mount -o hottrack
 mount -o nouser,hottrack
 mount -o nouser,hottrack,loop
 mount -o hottrack,nouser
 
  I think that this option parsing should be done by the filesystem,
  even though the tracking functionality is in the VFS. That way ony
  the filesystems that can use the tracking information will turn it
  on, rather than being able to turn it on for everything regardless
  of whether it is useful or not.
 
  Along those lines, just using a normal superblock flag to indicate
  it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
  don't need to allocate the sb-s_hot_info structure just to be able
 If we don't allocate one sb-s_hot_info, where will those hash list
 head and btree roots locate?

I wrote that thinking (mistakenly) that s-hot)info was dynamically
allocated rather than being embedded in the struct super_block.

Indeed, if the mount option is held in s_flags, then it could be
dynamically allocated, but I don't think that's really necessary...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] loop: Make explicit loop device destruction lazy

2012-09-28 Thread Dave Chinner
From: Dave Chinner dchin...@redhat.com

xfstests has always had random failures of tests due to loop devices
failing to be torn down and hence leaving filesytems that cannot be
unmounted. This causes test runs to immediately stop.

Over the past 6 or 7 years we've added hacks like explicit unmount
-d commands for loop mounts, losetup -d after unmount -d fails, etc,
but still the problems persist.  Recently, the frequency of loop
related failures increased again to the point that xfstests 259 will
reliably fail with a stray loop device that was not torn down.

That is despite the fact the test is above as simple as it gets -
loop 5 or 6 times running mkfs.xfs with different paramters:

lofile=$(losetup -f)
losetup $lofile $testfile
$MKFS_XFS_PROG -b size=512 $lofile /dev/null || echo mkfs failed!
sync
losetup -d $lofile

And losteup -d $lofile is failing with EBUSY on 1-3 of these loops
every time the test is run.

Turns out that blkid is running simultaneously with losetup -d, and
so it sees an elevated reference count and returns EBUSY.  But why
is blkid running? It's obvious, isn't it? udev has decided to try
and find out what is on the block device as a result of a creation
notification. And it is racing with mkfs, so might still be scanning
the device when mkfs finishes and we try to tear it down.

So, make losetup -d force autoremove behaviour. That is, when the
last reference goes away, tear down the device. xfstests wants it
*gone*, not causing random teardown failures when we know that all
the operations the tests have specifically run on the device have
completed and are no longer referencing the loop device.

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 drivers/block/loop.c |   17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3bba655..187b573 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -976,8 +976,21 @@ static int loop_clr_fd(struct loop_device *lo)
if (lo-lo_state != Lo_bound)
return -ENXIO;
 
-   if (lo-lo_refcnt  1)  /* we needed one fd for the ioctl */
-   return -EBUSY;
+   /*
+* If we've explicitly asked to tear down the loop device,
+* and it has an elevated reference count, set it for auto-teardown when
+* the last reference goes away. This stops $!~#$@ udev from
+* preventing teardown because it decided that it needs to run blkid on
+* the loopback device whenever they appear. xfstests is notorious for
+* failing tests because blkid via udev races with a losetup
+* dev/do something like mkfs/losetup -d dev causing the losetup -d
+* command to fail with EBUSY.
+*/
+   if (lo-lo_refcnt  1) {
+   lo-lo_flags |= LO_FLAGS_AUTOCLEAR;
+   mutex_unlock(lo-lo_ctl_mutex);
+   return 0;
+   }
 
if (filp == NULL)
return -EINVAL;
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Dave Chinner
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote:
 On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
  On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
   On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
Kent Overstreet koverstr...@google.com writes:

 So, I and other people keep running into things where we really need 
 to
 add an interface to pass some auxiliary... stuff along with a pread() 
 or
 pwrite().

 A few examples:

 * IO scheduler hints. Some userspace program wants to, per IO, specify
 either priorities or a cgroup - by specifying a cgroup you can have a
 fileserver in userspace that makes use of cfq's per cgroup bandwidth
 quotas.

You can do this today by splitting I/O between processes and placing
those processes in different cgroups.  For io priority, there is
ioprio_set, which incurs an extra system call, but can be used.  Not
elegant, but possible.
   
   Yes - those are things I'm trying to replace. Doing it that way is a
   real pain, both as it's a lousy interface for this and it does impact
   performance (ioprio_set doesn't really work too well with aio, too).
   
 * Cache hints. For bcache and other things, userspace may want to 
 specify
 this data should be cached, this data should bypass the cache, 
 etc.

Please explain how you will differentiate this from posix_fadvise.
   
   Oh sorry, I think about SSD caching so much I forget to say that's what
   I'm talking about. posix_fadvise is for the page cache, we want
   something different for an SSD cache (IMO it'd be really ugly to use it
   for both, and posix_fadvise() can't really specifify everything we'd
   want to for an SSD cache).
  
  Similar discussions about posix_fadvise() are being had for marking
  ranges of files as volatile (i.e. useful for determining what can be
  evicted from a cache when space reclaim is required).
  
  https://lkml.org/lkml/2012/10/2/501
 
 Hmm, interesting
 
 Speaking as an implementor though, hints that aren't associated with any
 specific IO are harder to make use of - stuff is in the cache. What you
 really want is to know, for a given IO, whether to cache it or not, and
 possibly where in the LRU to stick it.

I can see how it might be useful, but it needs to have a defined set
of attributes that a file IO is allowed to have. If you don't define
the set, then what really have is an arbitrary set of storage-device
specific interfaces.

Of course, once we have a defined set of per-file IO policy
attributes, we don't really need per-IO attributes - you can just
set them through a range interface like fadvise() or fallocate().

 Well, it's quite possible that different implementations would have no
 trouble making use of those kinds of hints, I'm no doubt biased by
 having implemented bcache. With bcache though, cache replacement is done
 in terms of physical address space, not logical (i.e. the address space
 of the device being cached). 
 
 So to handle posix_fadvise, we'd have to walk the btree and chase
 pointers to buckets, and modify the bucket priorities up or down... but
 what about the other data in those buckets? It's not clear what should
 happen, but there isn't any good way to take that into account.
 
 (The exception is dropping data from the cache entirely, we can just
 invalidate that part of the keyspace and garbage collection will reclaim
 the buckets they pointed to. Bcache does that for discard requests,
 currently).

It sounds to me like you are saying is that the design of bcache is
unsuited to file-level management of caching policy, and that is why
you want to pass attributes directly to bcache with specific IOs. Is
that a fair summary of the problem you are describing here?

My problem with this approach has nothing to do with the per-IO
nature of it - it's to do with the layering violations and the
amount of storage specific knowledge needed to make effective use of
it. i.e. it seems like an interface that can only be used by people
intimately familiar with underlying storage implementation. You
happen to be one of those people, so I figure that you don't see a
problem with that. ;)

However, it also implies that an application must understand and use
a specific storage configuration that matches the attributes an
application sends. I understand how this model is appealling to
Google because they control the whole application and storage stack
(hardware and software) from top to bottom. However, I just don't
think that it is a solution that the rest of the world can use
effectively.

The scope of data locality, aging and IO priority policy
control is much larger than just controlling SSD caching.
SSD caching is just a another implementation of automatic tiering,
and HSMs have been doing this for years and years. It's the same
problem - space management and deciding what to leave in the
frequently accessed pool

Re: VFS hot tracking: How to calculate data temperature?

2012-11-05 Thread Dave Chinner
On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote:
 On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao c...@us.ibm.com wrote:
  On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote:
  Here also has another question.
 
  How to save the file temperature among the umount to be able to
  preserve the file tempreture after reboot?
 
  This above is the requirement from DB product.
  I thought that we can save file temperature in its inode struct, that
  is, add one new field in struct inode, then this info will be written
  to disk with inode.
 
  Any comments or ideas are appreciated, thanks.
 
 
 
  Maybe could save the last file temperature with extended attributes.
 It seems that only ext4 has the concept of extended attributes.

All major filesystems have xattr support. They are used extensively
by the security and integrity subsystems, for example.

Saving the information might be something that is useful to certian
applications, but lets have the people that need that functionality
spell out their requirements before discussing how or what to
implement.  Indeed, discussion shoul dreally focus on getting the
core, in-memory infrastructure sorted out first before trying to
expand the functionality further...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem with DISCARD and RAID5

2012-11-05 Thread Dave Chinner
On Fri, Nov 02, 2012 at 09:40:58AM +0800, Shaohua Li wrote:
 On Thu, Nov 01, 2012 at 05:38:54PM +1100, NeilBrown wrote:
  
  Hi Shaohua,
   I've been doing some testing and discovered a problem with your discard
   support for RAID5.
  
   The code in blkdev_issue_discard assumes that the 'granularity' is a power
   of 2, and for example subtracts 1 to get a mask.
  
   However RAID5 sets the granularity to be the stripe size which often is not
   a power of two.  When this happens you can easily get into an infinite 
  loop.
  
   I suspect that to make this work properly, blkdev_issue_discard will need 
  to
   be changed to allow 'granularity' to be an arbitrary value.
   When it is a power of two, the current masking can be used.
   When it is anything else, it will need to use sector_div().
 
 Yep, looks we need use sector_div. And this isn't the only problem. discard
 request can be merged, and the merge check only checks max_discard_sectors.
 That means the split requests in blkdev_issue_discard can be merged again. The
 split nerver works.
 
 I'm wondering what's purpose of discard_alignment and discard_granularity. Are
 there devices with discard_granularity not 1 sector?

Most certainly. Thin provisioned storage often has granularity in the
order of megabytes

 If bio isn't discard
 aligned, what device will do?

Up to the device.

 Further, why driver handles alignment/granularity
 if device will ignore misaligned request.

When you send a series of sequential unaligned requests, the device
may ignore them all. Hence you end up with nothing being discarded,
even though the entire range being discarded is much, much larger
than the discard granularity

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: add hot tracking support.

2012-11-07 Thread Dave Chinner
On Wed, Nov 07, 2012 at 04:38:23PM +0800, Zhi Yong Wu wrote:
 HI, Dave,
 
 I guess that you should add some hot tracking stuff in some
 xfs_show_xxx function, right?

Yes, it should - I thought I did that. I recall seeing int
/proc/mounts, but maybe I was just hallucinating. I'll send an
updated version when I get to fixing it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-10-21 Thread Dave Chinner
 load that causes write
latency occurring.  Testing with lots of clients introduces all of
these things, and that will greatly impact server behaviour.
Aggregation in memory isolates a lot of this variation from
writeback and hence smooths out a lot of the variability that leads
to fragmentation, seeks, latency spikes and preamture filesystem
aging.

That is, if you set a 100MB dirty_bytes limit on a bdi it will give
really good buffering for a single client doing a streaming write.
If you've got 10 clients, then assuming fair distribution of server
resources, then that is 10MB per client per writeback trigger.
That's line ball as to whether it will cause fragmentation severe
enough to impact server throughput. If you've got 100 clients,then
that's only 1MB per client per writeback trigger, and that's
definitely too low to maintain decent writeback behaviour.  i.e.
you're now writing 100 files 1MB at a time, and that tends towards
random IO patterns rather than sequential IO patterns. Seek time
dertermines throughput, not IO bandwidth limits.

IOWs, as the client count goes up, the writeback patterns will tends
more towards random IO than sequential IO unless the amount of
buffering allowed before writeback triggers also grows. That's
important, because random IO is much slower than sequential IO.
What I'd like to have is some insight into whether this patch
changes that inflection point, for better or for worse. The only way
to find that is to run multi-client testing

  5. Are the improvements consistent across different filesystem
  types?  We've had writeback changes in the past cause improvements
  on one filesystem but significant regressions on others.  I'd
  suggest that you need to present results for ext4, XFS and btrfs so
  that we have a decent idea of what we can expect from the change to
  the generic code.
 
 As mentioned in the above Table 1  2, performance gain in WRITE speed
 is different on different file systems i.e. different on NFS client
 over XFS  EXT4.
 We also tried BTRFS over NFS, but we could not see any WRITE speed
 performance gain/degrade on BTRFS over NFS, so we are not posting
 BTRFS results here.

You should post btrfs numbers even if they show no change. It wasn't
until I got this far that I even realised that you'd even tested
BTRFS. I don't know what to make of this, because I don't know what
the throughput rates compared to XFS and EXT4 are

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/22] loop: Issue O_DIRECT aio using bio_vec

2012-10-22 Thread Dave Chinner
On Mon, Oct 22, 2012 at 10:15:00AM -0500, Dave Kleikamp wrote:
 This is the current version of the patchset I presented at the LSF-MM
 Summit in San Francisco in April. I apologize for letting it go so
 long before re-submitting.
 
 This patchset was begun by Zach Brown and was originally submitted for
 review in October, 2009. Feedback was positive, and I have picked up
 where he left off, porting his patches to the latest mainline kernel
 and adding support more file systems.
 
 This patch series adds a kernel interface to fs/aio.c so that kernel code can
 issue concurrent asynchronous IO to file systems.  It adds an aio command and
 file system methods which specify io memory with pages instead of userspace
 addresses.
 
 This series was written to reduce the current overhead loop imposes by
 performing synchronus buffered file system IO from a kernel thread.  These
 patches turn loop into a light weight layer that translates bios into iocbs.

I note that there is no support for XFS in this patch set. Is there
a particular problem that prevents XFS from being converted, or it
just hasn't been done?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/22] loop: Issue O_DIRECT aio using bio_vec

2012-10-22 Thread Dave Chinner
On Mon, Oct 22, 2012 at 07:53:40PM -0500, Dave Kleikamp wrote:
 On 10/22/2012 07:07 PM, Dave Chinner wrote:
  On Mon, Oct 22, 2012 at 10:15:00AM -0500, Dave Kleikamp wrote:
  This is the current version of the patchset I presented at the LSF-MM
  Summit in San Francisco in April. I apologize for letting it go so
  long before re-submitting.
 
  This patchset was begun by Zach Brown and was originally submitted for
  review in October, 2009. Feedback was positive, and I have picked up
  where he left off, porting his patches to the latest mainline kernel
  and adding support more file systems.
 
  This patch series adds a kernel interface to fs/aio.c so that kernel code 
  can
  issue concurrent asynchronous IO to file systems.  It adds an aio command 
  and
  file system methods which specify io memory with pages instead of userspace
  addresses.
 
  This series was written to reduce the current overhead loop imposes by
  performing synchronus buffered file system IO from a kernel thread.  These
  patches turn loop into a light weight layer that translates bios into 
  iocbs.
  
  I note that there is no support for XFS in this patch set. Is there
  a particular problem that prevents XFS from being converted, or it
  just hasn't been done?
 
 It just hasn't been done. It wasn't one of the trivial ones so I put it
 off at first, and after that, it's an oversight. I'll follow up with an
 xfs patch for your review.

Thanks Shaggy, that's all I wanted to know. No extra work for me
(apart from review and testing) is fine by me. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Dave Chinner
On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
 Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
 synchronize with it's backing device's when the latter is changed
 with blockdev --setra, the application needs to reopen the file
 to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.

 which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead
windows to the new defaults.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-23 Thread Dave Chinner
On Wed, Oct 17, 2012 at 12:50:11PM +, Arnd Bergmann wrote:
 On Tuesday 16 October 2012, Jaegeuk Kim wrote:
   IIRC, fs2fs uses 4k inodes, so IMO per-inode xattr tress with
   internal storage before spilling to an external block is probably
   the best approach to take...
  
  Yes, indeed this is the best approach to f2fs's xattr.
  Apart from giving fs hints, it is worth enough to optimize later.
 
 I've thought a bit more about how this could be represented efficiently
 in 4KB nodes. This would require a significant change of the way you
 represent inodes, but can improve a number of things at the same time.
 
 The idea is to replace the fixed area in the inode that contains block
 pointers with an extensible TLV (type/length/value) list that can contain
 multiple variable-length fields, like this.

You've just re-invented inode forks... ;)

 All TLVs together with the
 fixed-length inode data can fill a 4KB block.
 
 The obvious types would be:
 
 * Direct file contents if the file is less than a block
 * List of block pointers, as before, minimum 1, maximum until the end
   of the block
 * List of indirect pointers, now also a variable length, similar to the
   list of block pointers
 * List of double-indirect block pointers
 * direct xattr: zero-terminated attribute name followed by contents
 * indirect xattr: zero-terminated attribute name followed by up to
   16 block pointers to store a maximum of 64KB sized xattrs
 
 This could be extended later to cover additional types, e.g. a list
 of erase block pointers, triple-indirect blocks or extents.

An inode fork doesn't care about the data in it - it's just an
independent block mapping index. i.e. inline, direct,
indirect, double indirect. The data in the fork is managed
externally to the format of the fork. e.g. XFS has two forks - one
for storing data (file data, directory contents, etc) and the other
for storing attributes.

The main issue with supporting an arbitrary number of forks is space
management of the inode literal area.  e.g. one fork is in inline
format (e.g.  direct file contents) and then we add an attribute.
The attribute won't fit inline, nor will an extent form fork header,
so the inline data fork has to be converted to extent format before
the xattr can be added. Now scale that problem up to an arbitrary
number of forks

 As a variation of this, it would also be nice to turn around the order
 in which the pointers are walked, to optimize for space and for growing
 files, rather than for reading the beginning of a file. With this, you
 can represent a 9 KB file using a list of two block pointers, and 1KB
 of direct data, all in the inode. When the user adds another byte, you
 only need to rewrite the inode. Similarly, a 5 MB file would have a
 single indirect node (covering block pointers for 4 MB), plus 256
 separate block pointers (covering the last megabyte), and a 5 GB file
 can be represented using 1 double-indirect node and 256 indirect nodes,
 and each of them can still be followed by direct tail data and
 extended attributes.

I'm not sure that the resultant code complexity is worth saving an
extra block here and there.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Dave Chinner
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
 
 
 On Fri, 19 Oct 2012, Peter Zijlstra wrote:
 
   Yes, I tried this approach - it involves doing LOCK instruction on read 
   lock, remembering the cpu and doing another LOCK instruction on read 
   unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
   happens in the common case). It was slower than the approach without any 
   LOCK instructions (43.3 seconds seconds for the implementation with 
   per-cpu LOCKed access, 42.7 seconds for this implementation without 
   atomic 
   instruction; the benchmark involved doing 512-byte direct-io reads and 
   writes on a ramdisk with 8 processes on 8-core machine).
  
  So why is that a problem? Surely that's already tons better then what
  you've currently got.
 
 Percpu rw-semaphores do not improve performance at all. I put them there 
 to avoid performance regression, not to improve performance.
 
 All Linux kernels have a race condition - when you change block size of a 
 block device and you read or write the device at the same time, a crash 
 may happen. This bug is there since ever. Recently, this bug started to 
 cause major trouble - multiple high profile business sites report crashes 
 because of this race condition.

 You can fix this race by using a read lock around I/O paths and write lock 
 around block size changing, but normal rw semaphore cause cache line 
 bouncing when taken for read by multiple processors and I/O performance 
 degradation because of it is measurable.

This doesn't sound like a new problem.  Hasn't this global access,
single modifier exclusion problem been solved before in the VFS?
e.g. mnt_want_write()/mnt_make_readonly()

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/16] f2fs: introduce flash-friendly file system

2012-10-09 Thread Dave Chinner
[ Folks, can you trim your responses down to just quote the part you
are responding to? Having to repeatedly scroll through 500 lines of
irrelevant text just to find the 5 lines that is being commented on
is exceedingly painful.  ]

On Tue, Oct 09, 2012 at 09:01:18PM +0900, Jaegeuk Kim wrote:
  From: Lukáš Czerner [mailto:lczer...@redhat.com]
I am sorry but this reply makes me smile. How can you design a fs
relying on time attack heuristics to figure out what the proper
layout should be ? Or even endorse such heuristics to be used in
mkfs ? What we should be focusing on is to push vendors to actually
give us such information so we can properly propagate that
throughout the kernel - that's something everyone will benefit from.
After that the optimization can be done in every file system.
   
  
   Frankly speaking, I agree that it would be the right direction eventually.
   But, as you know, it's very difficult for all flash vendors to promote 
   and standardize that.
   Because each vendors have different strategies to open their internal 
   information and also try
   to protect their secrets whatever they are.
  
   IMO, we don't need to wait them now.
   Instead, from the start, I suggest f2fs that uses those information to 
   the file system design.
   In addition, I suggest using heuristics right now as best efforts.

And in response, other people are suggesting that this is the
wrong approach.

   Maybe in future, if vendors give something, f2fs would be more feasible.
   In the mean time, I strongly hope to validate and stabilize f2fs with 
   community.
  
  Do not get me wrong, I do not think it is worth to wait for vendors
  to come to their senses, but it is worth constantly reminding that
  we *need* this kind of information and those heuristics are not
  feasible in the long run anyway.
  
  I believe that this conversation happened several times already, but
  what about having independent public database of all the internal
  information about hw from different vendors where users can add
  information gathered by the time attack heuristic so other does not
  have to run this again and again. I am not sure if Linaro or someone
  else have something like that, someone can maybe post a link to that.

Linaro already have one, which is another reason why using
heuristics is the wrong approach:

https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=showredirect=WorkingGroups%2FKernelConsolidation%2FProjects%2FFlashCardSurvey

 As I mentioned, I agree to push vendors to open those information all the 
 time.
 And, I absolutely didn't mean that it is worth to wait vendors.
 I meant, until opening those information by vendors, something like
 proposing f2fs or gathering heuristics are also needed simultaneously.
 
 Anyway, it's very interesting to build a database gathering products' 
 information.
 May I access the database?

It's public information.

If you want to support different types of flash, then either add
your timing attack derived information on specific hardware to the
above table, or force vendors to update it themselves if they want
their flash memory supported by this filesystem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

2012-10-10 Thread Dave Chinner
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
 On Tue, 9 Oct 2012, Jan Kara wrote:
  On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
   On Mon, 1 Oct 2012, Jan Kara wrote:
   
On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard 
write, HW
dirty bit gets set and when we later map and unmap the page, 
page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of 
problems to
filesystems. The bug we observed in practice is that buffers from the 
page get
freed, so when the page gets later marked as dirty and writeback writes 
it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in 
page_buffers() called
from xfs_count_page_state().
   
   What changed recently?  Was XFS hardly used on s390 until now?
The problem was originally hit on SLE11-SP2 which is 3.0 based after
  migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
  XFS just started to be more peevish about what pages it gets between these
  two releases ;) (e.g. ext3 or ext4 just says oh, well and fixes things
  up).
 
 Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
 whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
 ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Dave Chinner
On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
 Hi Dave,
 On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote:
  On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
  Hi,
Recently we ran into the bug that an opened file's ra_pages does not
  synchronize with it's backing device's when the latter is changed
  with blockdev --setra, the application needs to reopen the file
  to know the change,
 
  or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
  window to the (new) bdi default.
 
  which is inappropriate under our circumstances.
 
  Which are? We don't know your circumstances, so you need to tell us
  why you need this and why existing methods of handling such changes
  are insufficient...
 
  Optimal readahead windows tend to be a physical property of the
  storage and that does not tend to change dynamically. Hence block
  device readahead should only need to be set up once, and generally
  that can be done before the filesystem is mounted and files are
  opened (e.g. via udev rules). Hence you need to explain why you need
  to change the default block device readahead on the fly, and why
  fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead
  windows to the new defaults.
 Our system is a fuse-based file system, fuse creates a
 pseudo backing device for the user space file systems, the default readahead
 size is 128KB and it can't fully utilize the backing storage's read ability,
 so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

 The above third-party application using our file system maintains
 some long-opened files, we does not have any chances
 to force them to call fadvise(POSIX_FADV_NORMAL). :(

So raise a bug/feature request with the third party.  Modifying
kernel code because you can't directly modify the application isn't
the best solution for anyone. This really is an application problem
- the kernel already provides the mechanisms to solve this
problem...  :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Dave Chinner
On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
 On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote:
  On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
  Hi Dave,
  On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote:
   On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
   Hi,
 Recently we ran into the bug that an opened file's ra_pages does not
   synchronize with it's backing device's when the latter is changed
   with blockdev --setra, the application needs to reopen the file
   to know the change,
  
   or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
   window to the (new) bdi default.
  
   which is inappropriate under our circumstances.
  
   Which are? We don't know your circumstances, so you need to tell us
   why you need this and why existing methods of handling such changes
   are insufficient...
  
   Optimal readahead windows tend to be a physical property of the
   storage and that does not tend to change dynamically. Hence block
   device readahead should only need to be set up once, and generally
   that can be done before the filesystem is mounted and files are
   opened (e.g. via udev rules). Hence you need to explain why you need
   to change the default block device readahead on the fly, and why
   fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead
   windows to the new defaults.
  Our system is a fuse-based file system, fuse creates a
  pseudo backing device for the user space file systems, the default 
  readahead
  size is 128KB and it can't fully utilize the backing storage's read 
  ability,
  so we should tune it.
 
  Sure, but that doesn't tell me anything about why you can't do this
  at mount time before the application opens any files. i.e.  you've
  simply stated the reason why readahead is tunable, not why you need
  to be fully dynamic.
 We store our file system's data on different disks so we need to change 
 ra_pages
 dynamically according to where the data resides, it can't be fixed at mount 
 time
 or when we open files.

That doesn't make a whole lot of sense to me. let me try to get this
straight.

There is data that resides on two devices (A + B), and a fuse
filesystem to access that data. There is a single file in the fuse
fs has data on both devices. An app has the file open, and when the
data it is accessing is on device A you need to set the readahead to
what is best for device A? And when the app tries to access data for
that file that is on device B, you need to set the readahead to what
is best for device B? And you are changing the fuse BDI readahead
settings according to where the data in the back end lies?

It seems to me that you should be setting the fuse readahead to the
maximum of the readahead windows the data devices have configured at
mount time and leaving it at that

 The abstract bdi of fuse and btrfs provides some dynamically changing
 bdi.ra_pages
 based on the real backing device. IMHO this should not be ignored.

btrfs simply takes into account the number of disks it has for a
given storage pool when setting up the default bdi ra_pages during
mount.  This is basically doing what I suggested above.  Same with
the generic fuse code - it's simply setting a sensible default value
for the given fuse configuration.

Neither are dynamic in the sense you are talking about, though.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
 
 
 On Wed, 24 Oct 2012, Dave Chinner wrote:
 
  On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
   
   
   On Fri, 19 Oct 2012, Peter Zijlstra wrote:
   
 Yes, I tried this approach - it involves doing LOCK instruction on 
 read 
 lock, remembering the cpu and doing another LOCK instruction on read 
 unlock (which will hopefully be on the same CPU, so no cacheline 
 bouncing 
 happens in the common case). It was slower than the approach without 
 any 
 LOCK instructions (43.3 seconds seconds for the implementation with 
 per-cpu LOCKed access, 42.7 seconds for this implementation without 
 atomic 
 instruction; the benchmark involved doing 512-byte direct-io reads 
 and 
 writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got.
   
   Percpu rw-semaphores do not improve performance at all. I put them there 
   to avoid performance regression, not to improve performance.
   
   All Linux kernels have a race condition - when you change block size of a 
   block device and you read or write the device at the same time, a crash 
   may happen. This bug is there since ever. Recently, this bug started to 
   cause major trouble - multiple high profile business sites report crashes 
   because of this race condition.
  
   You can fix this race by using a read lock around I/O paths and write 
   lock 
   around block size changing, but normal rw semaphore cause cache line 
   bouncing when taken for read by multiple processors and I/O performance 
   degradation because of it is measurable.
  
  This doesn't sound like a new problem.  Hasn't this global access,
  single modifier exclusion problem been solved before in the VFS?
  e.g. mnt_want_write()/mnt_make_readonly()
  
  Cheers,
  
  Dave.
 
 Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
 semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
 to use percpu rw semaphores and remove the duplicated code.

I think you misunderstood my point - that rather than re-inventing
the wheel, why didn't you just copy something that is known to
work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
 Hi Chen,
 
  But how can bdi related ra_pages reflect different files' readahead
  window? Maybe these different files are sequential read, random read
  and so on.
 
 It's simple: sequential reads will get ra_pages readahead size while
 random reads will not get readahead at all.
 
 Talking about the below chunk, it might hurt someone that explicitly
 takes advantage of the behavior, however the ra_pages*2 seems more
 like a hack than general solution to me: if the user will need
 POSIX_FADV_SEQUENTIAL to double the max readahead window size for
 improving IO performance, then why not just increase bdi-ra_pages and
 benefit all reads? One may argue that it offers some differential
 behavior to specific applications, however it may also present as a
 counter-optimization: if the root already tuned bdi-ra_pages to the
 optimal size, the doubled readahead size will only cost more memory
 and perhaps IO latency.
 
 --- a/mm/fadvise.c
 +++ b/mm/fadvise.c
 @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
 len, int advice)
 spin_unlock(file-f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
 -   file-f_ra.ra_pages = bdi-ra_pages * 2;

I think we really have to reset file-f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-28 Thread Dave Chinner
On Sat, Oct 27, 2012 at 06:16:26PM -0400, Theodore Ts'o wrote:
 On Fri, Oct 26, 2012 at 10:24:23PM +, Luck, Tony wrote:
   Well, we could set a new attribute bit on the file which indicates
   that the file has been corrupted, and this could cause any attempts to
   open the file to return some error until the bit has been cleared.
  
  That sounds a lot better than renaming/moving the file.
 
 What I would recommend is adding a 
 
 #define FS_CORRUPTED_FL   0x0100 /* File is corrupted */
 
 ... and which could be accessed and cleared via the lsattr and chattr
 programs.

Except that there are filesystems that cannot implement such flags,
or require on-disk format changes to add more of those flags. This
is most definitely not a filesystem specific behaviour, so any sort
of VFS level per-file state needs to be kept in xattrs, not special
flags. Filesystems are welcome to optimise the storage of such
special xattrs (e.g. down to a single boolean flag in an inode), but
using a flag for something that dould, in fact, storage the exactly
offset and length of the corruption is far better than just storing
a something is corrupted in this file bit

   Application programs could also get very confused when any attempt to
   open or read from a file suddenly returned some new error code (EIO,
   or should we designate a new errno code for this purpose, so there is
   a better indication of what the heck was going on?)
  
  EIO sounds wrong ... but it is perhaps the best of the existing codes. 
  Adding
  a new one is also challenging too.
 
 I think we really need a different error code from EIO; it's already
 horribly overloaded already, and if this is new behavior when the
 customers get confused and call up the distribution help desk, they
 won't thank us if we further overload EIO.  This is abusing one of the
 System V stream errno's, but no one else is using it:
 
 #define EADV   68  /* Advertise error */
 
 I note that we've already added a new error code:
 
 #define EHWPOISON 133   /* Memory page has hardware error */
 
 ... although the glibc shipping with Debian testing hasn't been taught
 what it is, so strerror(EHWPOISON) returns Unknown error 133.  We
 could simply allow open(2) and stat(2) return this error, although I
 wonder if we're just better off defining a new error code.

If we are going to add special new file corrupted errors, we
should add EFSCORRUPTED (i.e. filesystem corrupted) at the same
time

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency

2012-10-28 Thread Dave Chinner
On Sun, Oct 28, 2012 at 09:51:48PM +0800, Zhi Yong Wu wrote:
 On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu gnehzuil@gmail.com wrote:
  Hi Zhiyong,
 
  On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote:
  [snip]
  @@ -199,6 +342,54 @@ err:
   }
 
   /*
  + * Main function to update access frequency from read/writepage(s) hooks
  + */
  +inline void hot_update_freqs(struct inode *inode, u64 start,
  + u64 len, int rw)
 
  This function seems too big.  So we really need to inline this function?
 As Dave said in his comments, it will add a function call
 overhead even when tracking is not enabled. a static inline function
 will just result in no extra overhead other than the if
 statement

I don't think I said that with respect to this code. I think I said
it w.r.t. a define or a small wrapper that decides to call
hot_update_freqs().  A static inline fucntion should only be a
couple of lines of code at most.

A static function, OTOH, can be inlined by the compiler if the
compiler thinks that is a win. But

.

  +EXPORT_SYMBOL_GPL(hot_update_freqs);

... it's an exported function, so it can't be inline or static, so
using inline is wrong whatever way you look at it. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: Use bool type rather than a custom boolean_t type.

2012-11-12 Thread Dave Chinner
On Mon, Nov 12, 2012 at 09:36:17PM -0200, Thiago Farina wrote:
 Hi,
 
 Please, take a look.
 
 Patch attached.

It's a good start for a cleanup, but there's no point in removing
the boolean_t from one file and then not removing it from the rest
of the XFS code.  It's only used in a handful of places, so just
remove it completely.

Also, can you please place patches in line rather than attaching
them. Attaching them means they cannot be quoted in reply. See
Documentation/SubmittingPatches and Documentation/email-clients.txt
for guidelines.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: Remove boolean_t typedef completely.

2012-11-12 Thread Dave Chinner
On Mon, Nov 12, 2012 at 10:36:47PM -0200, Thiago Farina wrote:
 On Mon, Nov 12, 2012 at 10:24 PM, Dave Chinner da...@fromorbit.com wrote:
  On Mon, Nov 12, 2012 at 09:36:17PM -0200, Thiago Farina wrote:
  Hi,
 
  Please, take a look.
 
  Patch attached.
 
  It's a good start for a cleanup, but there's no point in removing
  the boolean_t from one file and then not removing it from the rest
  of the XFS code.  It's only used in a handful of places, so just
  remove it completely.
 
 Done.
 
 Please, take another look.
 
  Also, can you please place patches in line rather than attaching
  them. Attaching them means they cannot be quoted in reply. See
  Documentation/SubmittingPatches and Documentation/email-clients.txt
  for guidelines.
 
 Sorry, patch attached.

It's still attached as a base64 encoded attachment, not as inline
text

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Local DoS through write heavy I/O on CFQ Deadline

2012-10-14 Thread Dave Chinner
On Thu, Oct 11, 2012 at 01:23:32PM +0100, Alex Bligh wrote:
 We have noticed significant I/O scheduling issues on both the CFQ and the
 deadline scheduler where a non-root user can starve any other process of
 any I/O for minutes at a time. The problem is more serious using CFQ but is
 still an effective local DoS vector using Deadline.
 
 A simple way to generate the problem is:
 
   dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5
 
 (note use of 2 dd's is to avoid alleged optimisation of the writing dd
 from /dev/zero). zcat-ing a large file with stout redirected to a file
 produces a similar error. Using ionice to set idle priority makes no
 difference.
 
 To instrument the problem we produced a python script which does a MySQL
 select and update every 10 seconds, and time the execution of the update.
 This is normally milliseconds, but under user generated load conditions, we
 can take this to indefinite (on CFQ) and over a minute (on deadline).
 Postgres is affected in a similar manner (i.e. it is not MySQL specific).
 Simultaneously we have captured the output of 'vmstat 1 2' and
 /proc/meminfo, with appropriate timestamps.

Well, mysql is stuck in fsync(), so of course it's going to have
problems with write latency:

[ 3840.268303] [812650d5] jbd2_log_wait_commit+0xb5/0x130
[ 3840.268308] [8108aa50] ? add_wait_queue+0x60/0x60
[ 3840.268313] [81211248] ext4_sync_file+0x208/0x2d0

And postgres gets stuck there too. So what you are seeing is likely
an ext4 problem, not an IO scheduler problem.

Suggestion: try the same test with XFS. If the problem still exists,
then it *might* be an ioscheduler problem. If it goes away, then
it's an ext4 problem.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   FS_IOC_GET_HEAT_INFO: return a struct containing the various
 metrics collected in btrfs_freq_data structs, and also return a
 calculated data temperature based on those metrics. Optionally, retrieve
 the temperature from the hot data hash list instead of recalculating it.
 
   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
 state of hot data tracking and migration:
 
 0 = do nothing
 1 = track frequency of access
 
   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
 migration, as described above.
.
 +struct hot_heat_info {
 + __u64 avg_delta_reads;
 + __u64 avg_delta_writes;
 + __u64 last_read_time;
 + __u64 last_write_time;
 + __u32 num_reads;
 + __u32 num_writes;
 + __u32 temperature;
 + __u8 live;
 + char filename[PATH_MAX];

Don't put the filename in the ioctl and open the file in the kernel.
Have userspace open the file directly and issue the ioctl on the fd
that is returned.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 12/13] vfs: add debugfs support

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Add a /sys/kernel/debug/hot_track/device_name/ directory for each
 volume that contains two files. The first, `inode_data', contains the
 heat information for inodes that have been brought into the hot data map
 structures. The second, `range_data', contains similar information for
 subfile ranges.

 + /* create debugfs range_data file */
 + debugfs_range_entry = debugfs_create_file(range_data,
 + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO,
 + debugfs_volume_entry,
 + (void *) range_data,
 + hot_debugfs_range_fops);

These should not be world readable. 0600 is probably the correct
permissions for them as we do not want random users to be able to
infer what files users are accessing from this information.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 12/13] vfs: add debugfs support

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Add a /sys/kernel/debug/hot_track/device_name/ directory for each
 volume that contains two files. The first, `inode_data', contains the
 heat information for inodes that have been brought into the hot data map
 structures. The second, `range_data', contains similar information for
 subfile ranges.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c |  462 
 +
  fs/hot_tracking.h |   43 +
  2 files changed, 505 insertions(+), 0 deletions(-)
.
 +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int 
 len)
 +{
 + struct lstring *debugfs_log = data-debugfs_log;
 + uint new_log_alloc_size;
 + char *new_log;
 + static char err_msg[] = No more memory!\n;
 +
 + if (len = data-log_alloc_size - debugfs_log-len) {
..
 + }
 +
 + memcpy(debugfs_log-str + debugfs_log-len, data-log_work_buff, len);
 + debugfs_log-len += (unsigned long) len;
 +
 + return len;
 +}
 +
 +/* Returns the number of bytes written to the log. */
 +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, 
 ...)
 +{
 + struct lstring *debugfs_log = data-debugfs_log;
 + va_list args;
 + int len;
 + static char trunc_msg[] =
 + The next message has been truncated.\n;
 +
 + if (debugfs_log-str == NULL)
 + return -1;
 +
 + spin_lock(data-log_lock);
 +
 + va_start(args, fmt);
 + len = vsnprintf(data-log_work_buff,
 + sizeof(data-log_work_buff), fmt, args);
 + va_end(args);
 +
 + if (len = sizeof(data-log_work_buff)) {
 + hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
 + }
 +
 + len = hot_debugfs_copy(data, data-log_work_buff, len);
 + spin_unlock(data-log_lock);
 +
 + return len;
 +}

Aren't you just recreating seq_printf() here? i.e. can't you replace
all this complexity with generic seq_file/seq_operations constructs?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 00/13] vfs: hot data tracking

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
 NOTE:
 
   The patchset is currently post out mainly to make sure
 it is going in the correct direction and hope to get some
 helpful comments from other guys.
   For more infomation, please check hot_tracking.txt in Documentation
 
 TODO List:

1) Fix OOM issues - the hot inode tracking caches grow very large
and don't get trimmed under memory pressure. From slabtop, after
creating roughly 24 million single byte files(*) on a machine with
8GB RAM:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME   
23859510 23859476  99%0.12K 795317   30   3181268K hot_range_item
23859441 23859439  99%0.16K 1037367   23   4149468K hot_inode_item
572530 572530 100%0.55K  817907327160K radix_tree_node
241706 241406  99%0.22K  14218   17 56872K xfs_ili
241206 241204  99%1.06K  804023321608K xfs_inode

The inode tracking is trying to track all 24 million inodes even
though they have been written only once, and there are only 240,000
inodes in the cache at this point in time. That was the last update
that slabtop got, so it is indicative of the impending OOM situation
that occurred.

 Changelog from v2:
  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
  2.) Added memory shrinker [Dave Chinner]

I haven't looked at the shrinker, but clearly it is not working,
otherwise the above OOM situation would not be occurring.

Cheers,

Dave.

(*) Tested on an empty 17TB XFS filesystem with:

$ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
meta-data=/dev/vdc   isize=256agcount=17, agsize=268435455 blks
 =   sectsz=512   attr=2, projid32bit=0
data =   bsize=4096   blocks=4563402735, imaxpct=5
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal log   bsize=4096   blocks=131072, version=2
 =   sectsz=512   sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ fs_mark  -D  1  -S0  -n  10  -s  1  -L  63  -d \
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
/mnt/scratch/6  -d  /mnt/scratch/7
.
 0 21601  16679.3 12552262
 0 22401  15412.4 12588587
 0 23201  16367.6 14199322
 0 24001  15680.4 15741205
hangs here w/ OOM

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-15 Thread Dave Chinner
On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote:
 On Sunday 14 October 2012, Vyacheslav Dubeyko wrote: 
  On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote:
   2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko:
  Extended attributes are more flexible way, from my point of view. The xattr 
  gives
  possibility to make hint to filesystem at any time and without any 
  dependencies with
  application's functional opportunities. Documented way of using such 
  extended attributes
  gives to user flexible way of manipulation of filesystem behavior (but I 
  remember that
  you don't believe in an user :-)).
  
  So, I think that fadvise() and extended attributes can be complementary 
  solutions.
 
 Right. Another option is to have ext4 style attributes, see 
 http://linux.die.net/man/1/chattr

Xattrs are much prefered to more ext4 style flags because xattrs
are filesystem independent. Indeed, some filesystems can't store any
new ext4 style flags without a change of disk format or
internally mapping them to an xattr. So really, xattrs are the best
way forward for such hints.

 Unlike extended attributes, there is a limited number of those,
 and they can only be boolean flags, but that might be enough for
 this particular use case.

A boolean is not sufficient for access policy hints. An extensible
xattr format is probably the best approach to take here, so that we
can easily introduce new access policy hints as functionality is
required. Indeed, an extensible xattr could start with just a
hot/cold boolean, and grow from there

 The main reason I can see against extended attributes is that they are not 
 stored
 very efficiently in f2fs, unless a lot of work is put into coming up with a 
 good
 implementation. A single flags bit can trivially be added to the inode in
 comparison (if it's not there already).

That's a deficiency that should be corrected, then, because xattrs
are very common these days.

And given that stuff like access frequency tracking is being
implemented at the VFS level, access policy hints should also be VFS
functionality. A bad filesystem implementation should not dictate
the interface for generically useful functionality

  Anyway, hardcoding or saving in filesystem list of file extensions is a 
  nasty way. It
  can be not safe or hardly understandable by users the way of 
  reconfiguration filesystem 
  by means of tunefs or debugfs with the purpose of file extensions addition 
  in such 
  black-box as TV or smartphones, from my point of view.
 
 It is only a performance hint though, so it is not a correctness issue the
 file system gets it wrong. In order to do efficient garbage collection, a log
 structured file system should take all the information it can get about the
 expected life of data it writes. I agree that the list, even in the form of
 mkfs time settings, is not a clean abstraction, but in the place of an Android
 phone manufacturer I would still enable it if it promises a significant
 performance advantage over not using it. I guess it would be nice if this
 could be overridden in some form, e.g. using an ioctl on the file as ext4 
 does.

An xattr on the root inode that holds a list like this is something
that could be set at mkfs time, but then also updated easily by new
software packages that are installed...

 We should also take the kinds of access we have seen on a file into account.

Yes, but it should be done at the VFS level, not in the filesystem
itself. Integrated into the current hot inode/range tracking that is
being worked on right now, I'd suggest.

IOWs, these access policy issues are not unique to F2FS or it's use
case. Anything to do with access hints, policy, tracking, file
classification, etc that can influence data locality, reclaim,
migration, etc need to be dealt with at the VFS, independently of a
specific filesystem. Filesystems can make use of that information
how they please (whether in the kernel or via userspace tools), but
having filesystem specific interfaces and implementations of the
same functionality is extremely wasteful. Let's do it once, and do
it right the first time. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Write is not atomic?

2012-10-15 Thread Dave Chinner
On Mon, Oct 15, 2012 at 11:36:15PM +0200, Juliusz Chroboczek wrote:
 Hi,
 
 The Linux manual page for write(2) says:
 
 The adjustment of the file offset and the write operation are
 performed as an atomic step.

That's wrong. The file offset update is not synchronised at all with
the write, and for a shared fd the update will race.


 This is apparently an extension to POSIX, which says
 
 This volume of IEEE Std 1003.1-2001 does not specify behavior of
 concurrent writes to a file from multiple processes. Applications
 should use some form of concurrency control.

This is how Linux behaves.

 The following fragment of code
 
 int fd;
 fd = open(exemple, O_CREAT | O_WRONLY | O_TRUNC, 0666);
 fork();
 write(fd, Ouille, 6);
 close(fd);
 
 produces OuilleOuille, as expected, on ext4 on two machines running
 Linux 3.2 AMD64.  However, over XFS on an old Pentium III at 500 MHz
 running 2.6.32, it produces just Ouille roughly once in three times.

ext4, on 3.6:

$ for i in `seq 0 1`; do ./a.out ; cat /mnt/scratch/foo ; echo ; done | 
sort |uniq -c
 39 Ouille
   9962 OuilleOuille
$

XFS, on the same kernel, hardware and block device:

$ for i in `seq 0 1`; do ./a.out ; cat /mnt/scratch/foo ; echo ; done | 
sort |uniq -c
 40 Ouille
   9961 OuilleOuille
$

So both filesystems behave according to the POSIX definition of
concurrent writes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: fix buffer lookup race on allocation failure

2012-10-15 Thread Dave Chinner
On Mon, Oct 15, 2012 at 11:27:58AM +0200, William Dauchy wrote:
 Hello,
 
 I believe, the commit fe2429b fixes the attached kernel trace.
 I tested it both on top of 3.2 and 3.4 stable tree.
 Could we consider adding this patch in stable tree at least for 3.2 and 3.4?
 
 commit fe2429b0966a7ec42b5fe3bf96f0f10de0a3b536
 Author: Dave Chinner dchin...@redhat.com
 Date:   Mon Apr 23 15:58:45 2012 +1000
 
 xfs: fix buffer lookup race on allocation failure
 
 When memory allocation fails to add the page array or tht epages to
 a buffer during xfs_buf_get(), the buffer is left in the cache in a
 partially initialised state. There is enough state left for the next
 lookup on that buffer to find the buffer, and for the buffer to then
 be used without finishing the initialisation.  As a result, when an
 attempt to do IO on the buffer occurs, it fails with EIO because
 there are no pages attached to the buffer.
 
 We cannot remove the buffer from the cache immediately and free it,
 because there may already be a racing lookup that is blocked on the
 buffer lock. Hence the moment we unlock the buffer to then free it,
 the other user is woken and we have a use-after-free situation.
 
 To avoid this race condition altogether, allocate the pages for the
 buffer before we insert it into the cache.  This then means that we
 don't have an allocation  failure case to deal after the buffer is
 already present in the cache, and hence avoid the problem
 altogether.  In most cases we won't have racing inserts for the same
 buffer, and so won't increase the memory pressure allocation before
 insertion may entail.
 
 Signed-off-by: Dave Chinner dchin...@redhat.com
 Reviewed-by: Mark Tinguely tingu...@sgi.com
 Signed-off-by: Ben Myers b...@sgi.com
 
 
 XFS: Assertion failed: bp-b_bn != XFS_BUF_DADDR_NULL, file:
 fs/xfs/xfs_buf.c, line: 598

You're running a CONFIG_XFS_DEBUG kernel. If you can reproduce the
problem with CONFIG_XFS_DEBUG, then it probably should be
backported.

If you are using CONFIG_XFS_DEBUG on production systems, then you
shouldn't be because it does nasty things to allocation patterns,
not to mention a 25-30% CPU overhead and will panic in places where
errors are recoverable but as a developer we want to try to find out
what went wrong.

In this case, you'll get a transient EIO error when the I/O is
issued on the malformed buffer, but other than that the system can
continue alon just fine and the next read ofthe buffer will work
prefectly...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] xfs: add hot tracking support.

2012-10-15 Thread Dave Chinner

From: Dave Chinner dchin...@redhat.com

Connect up the VFS hot tracking support so XFS filesystems can make
use of it.

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/xfs/xfs_mount.h |1 +
 fs/xfs/xfs_super.c |9 +
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a631ca3..d5e7277 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -215,6 +215,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC(1ULL  0) /* for nfs - all 
metadata ops
   must be synchronous except
   for space allocations */
+#define XFS_MOUNT_HOTTRACK (1ULL  1) /* hot inode tracking */
 #define XFS_MOUNT_WAS_CLEAN(1ULL  3)
 #define XFS_MOUNT_FS_SHUTDOWN  (1ULL  4) /* atomic stop of all filesystem
   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 56c2537..17786ff 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -61,6 +61,7 @@
 #include linux/kthread.h
 #include linux/freezer.h
 #include linux/parser.h
+#include linux/hot_tracking.h
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  nodelaylog/* Delayed logging disabled */
 #define MNTOPT_DISCARDdiscard/* Discard unused blocks */
 #define MNTOPT_NODISCARD   nodiscard /* Do not discard unused blocks */
+#define MNTOPT_HOTTRACKhot_track /* hot inode tracking */
 
 /*
  * Table driven mount option parser.
@@ -371,6 +373,8 @@ xfs_parseargs(
mp-m_flags |= XFS_MOUNT_DISCARD;
} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
mp-m_flags = ~XFS_MOUNT_DISCARD;
+   } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+   mp-m_flags |= XFS_MOUNT_HOTTRACK;
} else if (!strcmp(this_char, ihashsize)) {
xfs_warn(mp,
ihashsize no longer used, option is deprecated.);
@@ -1040,6 +1044,9 @@ xfs_fs_put_super(
 {
struct xfs_mount*mp = XFS_M(sb);
 
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK)
+   hot_track_exit(sb);
+
xfs_filestream_unmount(mp);
xfs_unmountfs(mp);
 
@@ -1470,6 +1477,8 @@ xfs_fs_fill_super(
error = ENOMEM;
goto out_unmount;
}
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK)
+   hot_track_init(sb);
 
return 0;
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 00/13] vfs: hot data tracking

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
 NOTE:
 
   The patchset is currently post out mainly to make sure
 it is going in the correct direction and hope to get some
 helpful comments from other guys.
   For more infomation, please check hot_tracking.txt in Documentation
 

# mount -o hot_track /dev/vdc /mnt/scratch
# umount /mnt/scratch

hangs here on XFS:

#  echo w  /proc/sysrq-trigger 
[   44.229252] SysRq : Show Blocked State
[   44.230044]   taskPC stack   pid father
[   44.231187] umount  D 88021fd52dc0  3632  4107   4106 0x
[   44.231946]  880212a91bd8 0086 8802153da300 
880212a91fd8
[   44.231946]  880212a91fd8 880212a91fd8 880216cf6340 
8802153da300
[   44.231946]   7fff 880212a91d78 
880212a91d80
[   44.231946] Call Trace:
[   44.231946]  [81b6c1b9] schedule+0x29/0x70
[   44.231946]  [81b6a079] schedule_timeout+0x159/0x220
[   44.231946]  [8171a2f4] ? do_raw_spin_lock+0x54/0x120
[   44.231946]  [8171a45d] ? do_raw_spin_unlock+0x5d/0xb0
[   44.231946]  [81b6bfee] wait_for_common+0xee/0x190
[   44.231946]  [810b6a10] ? try_to_wake_up+0x2f0/0x2f0
[   44.231946]  [81b6c18d] wait_for_completion+0x1d/0x20
[   44.231946]  [8109efec] flush_workqueue+0x14c/0x3f0
[   44.231946]  [811aad89] hot_track_exit+0x39/0x180
[   44.231946]  [81454e83] xfs_fs_put_super+0x23/0x70
[   44.231946]  [8117a991] generic_shutdown_super+0x61/0xf0
[   44.231946]  [8117aa50] kill_block_super+0x30/0x80
[   44.231946]  [8117ae45] deactivate_locked_super+0x45/0x70
[   44.231946]  [8117ba0e] deactivate_super+0x4e/0x70
[   44.231946]  [81197541] mntput_no_expire+0x101/0x160
[   44.231946]  [811985b6] sys_umount+0x76/0x3a0
[   44.231946]  [81b755a9] system_call_fastpath+0x16/0x1b

because this is stuck:

[  200.064574] kworker/u:2 S 88021fc12dc0  5208   669  2 0x
[  200.064574]  88021532fc60 0046 88021532c7c0 
88021532ffd8
[  200.064574]  88021532ffd8 88021532ffd8 81fc3420 
88021532c7c0
[  200.064574]  88021532fc50 88021532fc98 8221e700 
8221e700
[  200.064574] Call Trace:
[  200.064574]  [81b6c1b9] schedule+0x29/0x70
[  200.064574]  [81b6a03b] schedule_timeout+0x11b/0x220
[  200.064574]  [810908d0] ? usleep_range+0x50/0x50
[  200.064574]  [811aa78a] hot_temperature_update_work+0x16a/0x1c0
[  200.064574]  [8171a2f4] ? do_raw_spin_lock+0x54/0x120
[  200.064574]  [81b6d3ce] ? _raw_spin_unlock_irq+0xe/0x30
[  200.064574]  [810b0b3c] ? finish_task_switch+0x5c/0x100
[  200.064574]  [8109d959] process_one_work+0x139/0x500
[  200.064574]  [811aa620] ? hot_range_update+0x1f0/0x1f0
[  200.064574]  [8109e63e] worker_thread+0x15e/0x460
[  200.064574]  [8109e4e0] ? manage_workers+0x2f0/0x2f0
[  200.064574]  [810a4b33] kthread+0x93/0xa0
[  200.064574]  [81b765c4] kernel_thread_helper+0x4/0x10
[  200.064574]  [810a4aa0] ? __init_kthread_worker+0x40/0x40
[  200.064574]  [81b765c0] ? gs_change+0x13/0x13

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 09/13] vfs: add one wq to update map info periodically

2012-10-15 Thread Dave Chinner
().

 + destroy_workqueue(wq);
 +}

And there's not need for separate function for this - put it in
line.

FWIW, it also leaks the hot_work structure, but you're going to
remove that anyway. ;)

 diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
 index d19e64a..7a79a6d 100644
 --- a/fs/hot_tracking.h
 +++ b/fs/hot_tracking.h
 @@ -36,6 +36,9 @@
   */
  #define TIME_TO_KICK 400
  
 +/* set how often to update temperatures (seconds) */
 +#define HEAT_UPDATE_DELAY 400

FWIW, 400 seconds is an unusual time period. It's expected that
periodic work might take place at intervals of 5 minutes, 10
minutes, etc, not 6m40s. It's much easier to predict and understand
behaviour if it's at a interval of whole units like minutes,
especially when looking at timestamped event traces. Hence 300s (5
minutes) makes a lot more sense as a period for updates...

  /*
   * The following comments explain what exactly comprises a unit of heat.
   *
 diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
 index 7114179..b37e0f8 100644
 --- a/include/linux/hot_tracking.h
 +++ b/include/linux/hot_tracking.h
 @@ -84,6 +84,8 @@ struct hot_info {
  
   /* map of range temperature */
   struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
 +
 + struct workqueue_struct *update_wq;

Add the struct delayed_work here, too.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   FS_IOC_GET_HEAT_INFO: return a struct containing the various
 metrics collected in btrfs_freq_data structs, and also return a

I think you mean hot_freq_data :P

 calculated data temperature based on those metrics. Optionally, retrieve
 the temperature from the hot data hash list instead of recalculating it.

To get the heat info for a specific file you have to know what file
you want to get that info for, right?  I can see the usefulness of
asking for the heat data on a specific file, but how do you find the
hot files in the first place? i.e. the big question the user
interface needs to answer is what files are hot?.

Once userspace knows what the hottest files are, it can open them
and query the data via the above ioctl, but expecting userspace to
iterate millions of inodes in a filesystem to find hot files is very
inefficient.

FWIW, if you were to return file handles to the hottest files, then
the application could open and query them without even needing to
know the path name to them. This woul dbe exceedingly useful for
defragmentation programs, especially as that is the way xfs_fsr
already operates on candidate files.(*)

IOWs, sometimes the pathname is irrelevant to the operations that
applications want to perform - all they care about having an
efficient method of finding the inode they want and getting a file
descriptor that points to the file. Given the heat map info fits
right in to the sort of operations defrag and data mover tools
already do, it kind of makes sense to optimise the interface towards
those uses

(*) i.e. finds them via bulkstat which returns handle information
along with all the other inode data, then opens the file by handle
to do the defrag work

   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
 state of hot data tracking and migration:
 
 0 = do nothing
 1 = track frequency of access
 
   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
 migration, as described above.

I can't see how this is a manageable interface. It is not
persistent, so after every filesystem mount you'd have to set the
flag on all your inodes again. Hence, for the moment, I'd suggest
that dropping per-inode tracking control until all the core issues
are sorted out

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-16 Thread Dave Chinner
On Tue, Oct 16, 2012 at 11:38:35AM +, Arnd Bergmann wrote:
 On Tuesday 16 October 2012, Jaegeuk Kim wrote:
  On Monday 15 October 2012, Dave Chinner wrote:
   On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote:
On Sunday 14 October 2012, Vyacheslav Dubeyko wrote:
 On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote:
  2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko:
   
The main reason I can see against extended attributes is that they are 
not stored
very efficiently in f2fs, unless a lot of work is put into coming up 
with a good
implementation. A single flags bit can trivially be added to the inode 
in
comparison (if it's not there already).
   
   That's a deficiency that should be corrected, then, because xattrs
   are very common these days.
  
  IMO, most file systems including f2fs have some inefficiency to store
  and retrieve xattrs, since they have to allocate an additional block.
  The only distinct problem in f2fs is that there is a cleaning overhead.
  So, that's the why xattr is not an efficient way in f2fs.
 
 I would hope that there is a better way to encode extented attributes
 if the current method is not efficient enough. Maybe Dave or someone
 else who is experienced with this can make suggestions.
 
 What is the expected size of the attributes normally?

Most attributes are small. Even large user attributes don't
generally get to be more than a couple of hundred bytes, though the
maximum size for a single xattr is 64K.

 Does it
 make sense to put attributes for multiple files into a single block?

There are two main ways of dealing with attributes. The first is a
tree-like structure to index and store unique xattrs, and have the
inode siimply keep pointers to the main xattr tree. This is great
for keeping space down when there are lots of identical xattrs, but
is a serialisation point for modification an modification can be
complex (i.e. shared entries in the tree need COW semantics.) This
is the approach ZFS takes, IIRC, and is the most space efficient way
of dealing with xattrs. It's not the most performance efficient way,
however, and the reference counting means frequent tree rewrites.

The second is the XFS/ext4 approach, where xattrs are stored in a
per-inode tree, with no sharing. The attribute tree holds the
attributes in it's leaves, and the tree grows and shrinks as you add
or remove xattrs. There are optimisations on top of this - e.g. for
XFS if the xattrs fit in the spare space in the inode, they are
packed into the inode (shortform) and don't require an external
block. IIUC, there are patches to implement this optimisation for
ext4 floating around at the moment. This is a worthwhile
optimisation, because with a 512 byte inode size on XFS there is
enough spare space (roughly 380 bytes) for most systems to store all
their xattrs in the inode itself. XFS also has remote attr storage
for large xattrs (i.e. larger than a leaf block), where the tree
just keeps a pointer to an external extent that holds the xattr.

IIRC, fs2fs uses 4k inodes, so IMO per-inode xattr tress with
internal storage before spilling to an external block is probably
the best approach to take...

  OTOH, I think xattr itself is for users, not for communicating
  between file system and users.
 
 No, you are mistaken in that point, as Dave explained.

e.g. selinux, IMA, ACLs, capabilities, etc all communicate
information that the kernel uses for access control. That's why
xattrs have different namespaces like system, security and
user. Only user attributes are truly for user data
- the rest are for communicating information to the kernel

A file usage policy xattr would definitely exist under the system
namespace - it's not a user xattr at all.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: fix buffer lookup race on allocation failure

2012-10-16 Thread Dave Chinner
On Tue, Oct 16, 2012 at 11:27:48AM +0200, William Dauchy wrote:
 Hello Dave,
 
 Thanks for your reply.
 
 On Tue, Oct 16, 2012 at 1:21 AM, Dave Chinner da...@fromorbit.com wrote:
  You're running a CONFIG_XFS_DEBUG kernel. If you can reproduce the
  problem with CONFIG_XFS_DEBUG, then it probably should be
  backported.
 
 Yes indeed.

Sorry, I had a typo in the above - I meant:

If you can reproduce the problem *without* CONFIG_XFS_DEBUG, then it
probably should be backported.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-16 Thread Dave Chinner
On Wed, Oct 17, 2012 at 07:30:21AM +0900, Jaegeuk Kim wrote:
OTOH, I think xattr itself is for users, not for communicating
between file system and users.
   
   No, you are mistaken in that point, as Dave explained.
  
  e.g. selinux, IMA, ACLs, capabilities, etc all communicate
  information that the kernel uses for access control. That's why
  xattrs have different namespaces like system, security and
  user. Only user attributes are truly for user data
  - the rest are for communicating information to the kernel
  
 
 I agree that system is used by kernel.
 How about the file system view?

Not sure what you mean - the filesystem woul dsimply read the xattrs
in the system namespace as it needs, just like the other subsystems
like selinux or IMA do.

 Would you explain what file systems retrieve xattrs and use
 them with their own purpose?

I think cachefs users a CacheFiles.cache namespace for storing
information it needs in xattrs. ecryptfs stores crypto metadata in
xattrs in the lower filesytem. NFSv4 servers store junction mount
information in xattrs.

So there are examples where filesystems use xattrs for special
information. However, in most cases filesystems don't need xattrs
for their own metadata primarily because that gets added to their
own on-disk formats. IThe above are all overlay style filesystems
that don't have their own on-disk formats, so need to use xattrs to
store their per-inode metadata.

The case of access hints and allocation policies are not somethign
that are native to any filesystem on-disk format. They are abstract
concepts that really only the software generating/using that
information knows about. Given we want the software that uses this
information to be in VFS, it is separate from every filesystem and
this is exactly the use case that system xattrs were intended for.
:)

 Sorry, I'm not familiar with xattrs in depth.
 
 Unfortunately, system is not implemented in f2fs yet. :(

If you've already implemented the user.* namespace, then it's
trivial to support the other namespaces - it's just prefixing the
xattrs with the appropriate string instead of user

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 00/13] vfs: hot data tracking

2012-10-17 Thread Dave Chinner
On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
 On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote:
  On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  (*) Tested on an empty 17TB XFS filesystem with:
 
  $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
  meta-data=/dev/vdc   isize=256agcount=17, agsize=268435455 
  blks
   =   sectsz=512   attr=2, projid32bit=0
  data =   bsize=4096   blocks=4563402735, imaxpct=5
   =   sunit=0  swidth=0 blks
  naming   =version 2  bsize=4096   ascii-ci=0
  log  =internal log   bsize=4096   blocks=131072, version=2
   =   sectsz=512   sunit=1 blks, lazy-count=1
  realtime =none   extsz=4096   blocks=0, rtextents=0
  $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
  $ sudo chmod 777 /mnt/scratch
  $ fs_mark  -D  1  -S0  -n  10  -s  1  -L  63  -d \
  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
  /mnt/scratch/6  -d  /mnt/scratch/7
  .
   0 21601  16679.3 12552262
   0 22401  15412.4 12588587
   0 23201  16367.6 14199322
   0 24001  15680.4 15741205
  hangs here w/ OOM
 In this test, i haven't see you enable hot_track function via
 mount, why did it meet OOM?

I copied the wrong mount command. It was definitely enabled.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 00/13] vfs: hot data tracking

2012-10-17 Thread Dave Chinner
On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote:
 On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner da...@fromorbit.com wrote:
  On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote:
  On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote:
   On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote:
   From: Zhi Yong Wu wu...@linux.vnet.ibm.com
  
   (*) Tested on an empty 17TB XFS filesystem with:
  
   $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc
   meta-data=/dev/vdc   isize=256agcount=17, 
   agsize=268435455 blks
=   sectsz=512   attr=2, projid32bit=0
   data =   bsize=4096   blocks=4563402735, 
   imaxpct=5
=   sunit=0  swidth=0 blks
   naming   =version 2  bsize=4096   ascii-ci=0
   log  =internal log   bsize=4096   blocks=131072, version=2
=   sectsz=512   sunit=1 blks, lazy-count=1
   realtime =none   extsz=4096   blocks=0, rtextents=0
   $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch
   $ sudo chmod 777 /mnt/scratch
   $ fs_mark  -D  1  -S0  -n  10  -s  1  -L  63  -d \
   /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d \
   /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d \
   /mnt/scratch/6  -d  /mnt/scratch/7
   .
0 21601  16679.3 12552262
0 22401  15412.4 12588587
0 23201  16367.6 14199322
0 24001  15680.4 15741205
   hangs here w/ OOM
  In this test, i haven't see you enable hot_track function via
  mount, why did it meet OOM?
 
  I copied the wrong mount command. It was definitely enabled.
 OK, BTW: fs_mark is the script written by you? After xfsprogs is
 installed, i haven't found this command.

# apt-get install fsmark

Or get the source here:

http://sourceforge.net/projects/fsmark/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT and XFS

2005-07-13 Thread Dave Chinner
On Thu, Jul 14, 2005 at 10:22:46AM +1000, Nathan Scott wrote:
 Hi there,
 
 On Wed, Jul 13, 2005 at 09:45:58AM -0700, Daniel Walker wrote:
  On Wed, 2005-07-13 at 08:47 +0200, Ingo Molnar wrote:
   
   downgrade_write() wasnt the main problem - the main problem was that for 
   PREEMPT_RT i implemented 'strict' semaphores, which are not identical to 
   vanilla kernel semaphores. The thing that seemed to impact XFS the most 
   is the 'acquirer thread has to release the lock' rule of strict 
   semaphores. Both the XFS logging code and the XFS IO completion code 
   seems to release locks in a different context from where the acquire 
   happened. It's of course valid upstream behavior, but without these 
   extra rules it's hard to do sane priority inheritance. (who do you boost 
   if you dont really know who 'owns' the lock?) It might make sense to 
   introduce some sort of sem_pass_to(new_owner) interface? For now i 
   introduced a compat type, which lets those semaphores fall back to the 
   vanilla implementation.
 
 Hmm, I'm not aware of anywhere in XFS where we do that.  From talking
 to some colleagues here, they're claiming that we can't be doing that
 since it'd trip an assert in the IRIX mrlock code.

Now that I've read the thread, I see it's not mrlocks that is the
issue with unlocking in a different context - it's semaphores.

All the pagebuf synchronisation is done with a semaphore because
it's held across the I/O and it's _most definitely_ released in a
different context when doing async I/O. Just about all metadata I/O
is async because once the transaction has been logged to disk we
don't need to write these buffers out synchronously. Not to mention
the log I/O completion unlocks the buffers in a transaction in a
different context as well.

The whole point of using a semaphore in the pagebuf is because there
is no tracking of who owns the lock so we can actually release it
in a different context. Semaphores were invented for this purpose,
and we use them in the way they were intended. ;)

Realistically, I seriously doubt the need for any sort of rt changes
to these semaphores. They can be held for indeterminant periods of
time potentially across multiple disk I/Os (e.g. when held locked in
a transaction that requires more metadata to be read in from disk to
make progress).  Hence there is no really no point in making them RT
aware because if you end up waiting on one of them you can forget
about pretty much any RT guarantee that you've ever given

Cheers,

Dave.
-- 
Dave Chinner
RD Software Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: fix comment typo of struct xfs_da_blkinfo.

2012-07-19 Thread Dave Chinner
On Tue, Jul 17, 2012 at 10:40:21PM -0300, Paulo Alcantara wrote:
 From: Christoph Hellwig h...@infradead.org
 Date: Tue, 17 Jul 2012 03:06:43 -0400
 
  Btw, if you need more reviers for the syslinus support feel free to pass
  it by me (or the list).
 
 This is the branch I'm maintaing for the XFS readonly driver:
 
 git://zytor.com/users/pcacjr/syslinux.git (branch: xfs)
 
 The current status is:
 
 - We are able to find the root inode by reading rootino field from
   superblock (xfs_iget_root() function).
 - We are able to find inodes that have core's format set to local only 
 at
   the moment, which is by reading leaf entries from inode btrees. The
   function that looks up for an inode is the xfs_iget() one. We're looking
   forward to implement the handling of keys in inode btrees (extents) 
 also.
 
 Baoszi is currently working on the inode btree's key part (extents), and I'm
 working on the data part of the file inode, which is the xfs_next_extent()
 function.
 
 The xfs_next_extent() function must be able to read the inode (cast to the 
 data
 fork) and populate a structure that stores physical starting number sector and
 number of consecutive sectors that contain the inode's data so that Syslinux
 will know where to read from.

As we discussed on #xfs, I'm still not convinced that parsing the
filesystem metadata in the bootloader is the way to go. Far better,
IMO, is simply to supply the bootloader with a list of extents for
the blocks it needs to read to get the files it needs. You can get
them via fiemap(), and it'll work for XFS, btrfs, ext3, ext4, etc
without needing to write code to parse the on-disk format of every
filesystem.

Especially as we are in the process of making major changes to the
XFS on-disk format, which means you'll have to make significant
changes to support a second XFS disk format in the not-to-distant
future...

 Thanks for the interest in helping us!

We want it to work! ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]

2013-04-09 Thread Dave Chinner
On Thu, Apr 04, 2013 at 04:30:54PM -0400, Phillip Susi wrote:
  I have not tested it yet, but I am pretty sure it won't work.  It
  looks like the patch changes the BLKRRPART path to go ahead and remove
  existing partitions when GENHD_FL_NO_PARTSCAN is set.  loop doesn't
  issue the BLKRRPART ioctl when !LO_FLAGS_PARTSCAN so this won't help.
   I think loop needs to set GENHD_FL_NO_PARTSCAN and then issue the
  ioctl regardless of the LO_FLAGS_PARTSCAN flag to get the partitions
  to be removed.  I will try to test tonight.
 
 After testing, my initial thoughts appeared to have been correct.  I had
 to modify the patch as follows.  To test, simply do:
 
 truncate -s 10m img
 losetup /dev/loop0 img
 parted /dev/loop0
 mklabel msdos
 mkpart primary ext2 1m 2m
 quit
 ls /dev/loop0*
 
 Note the /dev/loop0p1 node.  Run losetup -d /dev/loop0 and see if it is
 still there.

Jens, can we get one of these fixes merged quickly? xfstests is
unusable on any kernel more recent than 3.9-rc4 because of these
problems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]

2013-04-09 Thread Dave Chinner
On Tue, Apr 09, 2013 at 09:01:39AM +0200, Jens Axboe wrote:
 On Tue, Apr 09 2013, Dave Chinner wrote:
  On Thu, Apr 04, 2013 at 04:30:54PM -0400, Phillip Susi wrote:
I have not tested it yet, but I am pretty sure it won't work.  It
looks like the patch changes the BLKRRPART path to go ahead and remove
existing partitions when GENHD_FL_NO_PARTSCAN is set.  loop doesn't
issue the BLKRRPART ioctl when !LO_FLAGS_PARTSCAN so this won't help.
 I think loop needs to set GENHD_FL_NO_PARTSCAN and then issue the
ioctl regardless of the LO_FLAGS_PARTSCAN flag to get the partitions
to be removed.  I will try to test tonight.
   
   After testing, my initial thoughts appeared to have been correct.  I had
   to modify the patch as follows.  To test, simply do:
   
   truncate -s 10m img
   losetup /dev/loop0 img
   parted /dev/loop0
   mklabel msdos
   mkpart primary ext2 1m 2m
   quit
   ls /dev/loop0*
   
   Note the /dev/loop0p1 node.  Run losetup -d /dev/loop0 and see if it is
   still there.
  
  Jens, can we get one of these fixes merged quickly? xfstests is
  unusable on any kernel more recent than 3.9-rc4 because of these
  problems
 
 Yep, did the revert yesterday and it's going out today.

Great, I didn't see an update in the latest -rc, so I wanted to make
sure it hadn't fallen through he cracks

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-09 Thread Dave Chinner
On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
 On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
 
  I think that outside of zone loop is better place to run shrink_slab(),
  because shrink_slab() is not directly related to a specific zone.
  
 
 This is true and has been the case for a long time. The slab shrinkers
 are not zone aware and it is complicated by the fact that slab usage can
 indirectly pin memory on other zones.
..
  And this is a question not related to this patch.
  Why nr_slab is used here to decide zone-all_unreclaimable?
 
 Slab is not directly associated with a slab but as reclaiming slab can
 free memory from unpredictable zones we do not consider a zone to be
 fully unreclaimable until we cannot shrink slab any more.

This is something the numa aware shrinkers will greatly help with -
instead of being a global shrink it becomes a
node-the-zone-belongs-to shrink, and so

 You may be thinking that this is extremely heavy handed and you're
 right, it is.

... it is much less heavy handed than the current code...

  nr_slab is not directly related whether a specific zone is reclaimable
  or not, and, moreover, nr_slab is not directly related to number of
  reclaimed pages. It just say some objects in the system are freed.
  
 
 All true, it's the indirect relation between slab objects and the memory
 that is freed when slab objects are reclaimed that has to be taken into
 account.

Node awareness within the shrinker infrastructure and LRUs make the
relationship much more direct ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Dave Chinner
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
 On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
I think it might be more enlightening if Mel traced which process in
  which funclion is holding the buffer lock. I suspect we'll find out that
  the flusher thread has submitted the buffer for IO as an async write and
  thus it takes a long time to complete in presence of reads which have
  higher priority.
 
 That's an interesting theory.  If the workload is one which is very
 heavy on reads and writes, that could explain the high latency.  That
 would explain why those of us who are using primarily SSD's are seeing
 the problems, because would be reads are nice and fast.
 
 If that is the case, one possible solution that comes to mind would be
 to mark buffer_heads that contain metadata with a flag, so that the
 flusher thread can write them back at the same priority as reads.

Ext4 is already using REQ_META for this purpose.

I'm surprised that no-one has suggested change the IO elevator
yet.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Dave Chinner
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
 On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
   If that is the case, one possible solution that comes to mind would be
   to mark buffer_heads that contain metadata with a flag, so that the
   flusher thread can write them back at the same priority as reads.
  
  Ext4 is already using REQ_META for this purpose.
 
 We're using REQ_META | REQ_PRIO for reads, not writes.
 
  I'm surprised that no-one has suggested change the IO elevator
  yet.
 
 Well, testing to see if the stalls go away with the noop schedule is a
 good thing to try just to validate the theory.

Exactly.

 The thing is, we do want to make ext4 work well with cfq, and
 prioritizing non-readahead read requests ahead of data writeback does
 make sense.  The issue is with is that metadata writes going through
 the block device could in some cases effectively cause a priority
 inversion when what had previously been an asynchronous writeback
 starts blocking a foreground, user-visible process.

Here's the historic problem with CFQ: it's scheduling algorithms
change from release to release, and so what you tune the filesystem
to for this release is likely to cause different behaviour
in a few releases time.

We've had this problem time and time again with CFQ+XFS, so we
stopped trying to tune to a particular elevator long ago.  The
best you can do it tag the Io as appropriately as possible (e.g.
metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ
hasn't been broken since the last release

 At least, that's the theory; we should confirm that this is indeed
 what is causing the data stalls which Mel is reporting on HDD's before
 we start figuring out how to fix this problem.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Loopback device hung [was Re: xfs deadlock on 3.9-rc5 running xfstests case #78]

2013-04-02 Thread Dave Chinner
[Added jens Axboe to CC]

On Tue, Apr 02, 2013 at 02:08:49AM -0400, CAI Qian wrote:
 Saw on almost all the servers range from x64, ppc64 and s390x with kernel
 3.9-rc5 and xfsprogs-3.1.10. Never caught this in 3.9-rc4, so looks like
 something new broke this. Log is here with sysrq debug info.
 http://people.redhat.com/qcai/stable/log

Has nothing to do with XFS:

[34762.105676] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message. 
[34762.152871] systemd-udevd   D 88030bc94440 0  4391412 0x0084 
[34762.196660]  8801134d9a98 0082 880303e09ac0 
8801134d9fd8 
[34762.240512]  8801134d9fd8 8801134d9fd8 880304f98000 
880303e09ac0 
[34762.287047]  8802936c1ac0 88060521f998 88060521f99c 
880303e09ac0 
[34762.331446] Call Trace: 
[34762.345444]  [816274a9] schedule+0x29/0x70 
[34762.373908]  [8162777e] schedule_preempt_disabled+0xe/0x10 
[34762.411394]  [81625ae3] __mutex_lock_slowpath+0xc3/0x140 
[34762.447597]  [8162561a] mutex_lock+0x2a/0x50 
[34762.476785]  [811daddb] __blkdev_get+0x6b/0x4b0 
[34762.508839]  [812a619a] ? selinux_file_alloc_security+0x4a/0x80 
[34762.546436]  [811db3bd] blkdev_get+0x19d/0x2e0 
[34762.580977]  [812a18fa] ? 
inode_has_perm.isra.31.constprop.61+0x2a/0x30 
[34762.624346]  [811db55f] blkdev_open+0x5f/0x90 
[34762.656318]  [811a00bf] do_dentry_open+0x20f/0x2c0 
[34762.689762]  [811db500] ? blkdev_get+0x2e0/0x2e0 
[34762.71]  [811a01a5] finish_open+0x35/0x50 
[34762.754363]  [811b082e] do_last+0x6de/0xde0 
[34762.783808]  [811ad378] ? inode_permission+0x18/0x50 
[34762.819265]  [811ad428] ? link_path_walk+0x78/0x880 
[34762.853700]  [812a619a] ? selinux_file_alloc_security+0x4a/0x80 
[34762.892881]  [811b0fe7] path_openat+0xb7/0x4b0 
[34762.923621]  [811ad0e0] ? getname_flags.part.33+0x30/0x150 
[34762.960839]  [811b16a1] do_filp_open+0x41/0xa0 
[34762.992114]  [811bddc2] ? __alloc_fd+0x42/0x110 
[34763.023342]  [811a14f4] do_sys_open+0xf4/0x1e0 
[34763.054129]  [811a1601] sys_open+0x21/0x30 
[34763.082134]  [81630d59] system_call_fastpath+0x16/0x1b 


And:

[34763.116218] INFO: task umount:4421 blocked for more than 120 seconds. 
[34763.153670] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message. 
[34763.200321] umount  D 88030bcb4440 0  4421452 0x0080 
[34763.242026]  8801134fbcb8 0082 8802936c1ac0 
8801134fbfd8 
[34763.287320]  8801134fbfd8 8801134fbfd8 8803069d 
8802936c1ac0 
[34763.330722]  8802936c1ac0 88060521f998 88060521f99c 
8802936c1ac0 
[34763.374428] Call Trace: 
[34763.388977]  [816274a9] schedule+0x29/0x70 
[34763.419164]  [8162777e] schedule_preempt_disabled+0xe/0x10 
[34763.456279]  [81625ae3] __mutex_lock_slowpath+0xc3/0x140 
[34763.492474]  [8162561a] mutex_lock+0x2a/0x50 
[34763.520840]  [81417da5] loop_clr_fd+0x285/0x480 
[34763.552050]  [81418800] lo_release+0x70/0x80 
[34763.580787]  [811dabac] __blkdev_put+0x17c/0x1d0 
[34763.612479]  [811dac57] blkdev_put+0x57/0x140 
[34763.641557]  [811a409d] kill_block_super+0x4d/0x80 
[34763.674257]  [811a4467] deactivate_locked_super+0x57/0x80 
[34763.708755]  [811a4fee] deactivate_super+0x4e/0x70 
[34763.742906]  [811bfbb7] mntput_no_expire+0xd7/0x130 
[34763.776872]  [811c0a9c] sys_umount+0x9c/0x3c0 
[34763.811819]  [81630d59] system_call_fastpath+0x16/0x1b 


It's clearly a loopback device problem, stuck on the
bdev-bd_inode-i_mutex. And there's changesin the loop device
teardown since 3.9-rc4:

$ glo v3.9-rc4..HEAD -- drivers/block/loop.c  
c1681bf loop: prevent bdev freeing while device in use
8761a3d loop: cleanup partitions when detaching loop device
183cfb5 loop: fix error return code in loop_add()
$

So this looks like someone hasn't been testing their loopback
driver changes properly...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread Dave Chinner
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote:
 Miklos Szeredi mik...@szeredi.hu wrote:
 
  Export do_splice_direct() to modules.  Needed by overlay filesystem.
 
 Apparently you cannot call this from any function that is holding an i_mutex
 if the target of the splice uses generic_file_splice_write().
 
 The problem is a potential deadlock situation:
 
 We have places already that do:
 
   mnt_want_write()
   mutex_lock()
 
 This can be found in do_last() for example.
 
 However, mnt_want_write() calls sb_start_write() as does
 generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
 
   mutex_lock()
   sb_start_write()
 
 which lockdep reports as a potential ABBA deadlock.
 
 Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
 might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
 can
 illuminate further, so I've added him to the cc list.
 
 I've attached the report I got with unionmount.

There's plenty of problems with splice locking that can lead to
deadlocks. Here's another that's been known for ages:

http://oss.sgi.com/archives/xfs/2011-08/msg00168.html

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-18 Thread Dave Chinner
On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
 On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
  We don't care about the ordering between multiple concurrent
  metadata modifications - what matters is whether the ongoing data IO
  around them is ordered correctly.
 
 Dave,
 
 The point that Michel is making is that there never was any ordering
 guarantee by rwsem. It's an illusion.

Weasel words.

 The reason is simple: to even get to the lock the cpu has to be
 sleep-able. So for every submission that you believe is ordered, is by
 its very nature __not ordered__, even when used by kernel code.

 Why? Because any thread on its way to claim the lock (reader or writer)
 could be pre-empted for some other task, thus delaying the submission of
 whatever i/o you believed to be ordered.

You think I don't know this?  You're arguing fine grained, low level
behaviour between tasks is unpredictable. I get that. I understand
that. But I'm not arguing about fine-grained, low level, microsecond
semantics of the locking order

What you (and Michael) appear to be failing to see is what happens
on a macro level when you have read locks being held for periods
measured in *seconds* (e.g. direct IO gets queued behind a few
thousand other IOs in the elevator waiting for a request slot),
and the subsequent effect of inserting an operation that requires a
write lock into that IO stream.

IOWs, it simply doesn't matter if there's a micro-level race between
the write lock and a couple of the readers. That's the level you
guys are arguing at but it simply does not matter in the cases I'm
describing. I'm talking about high level serialisation behaviours
that might take of *seconds* to play out and the ordering behaviours
observed at that scale.

That is, I don't care if a couple of threads out of a few thousand
race with the write lock over few tens to hundreds of microseconds,
but I most definitely care if a few thousand IOs issued seconds
after the write lock is queued jump over the write lock. That is a
gross behavioural change at the macro-level.

 So just to reiterate: there is no 'queue' and no 'barrier'. The
 guarantees that rwsem makes are;
 1. Multiple readers can own the lock.
 2. Only a single writer can own the lock.
 3. Readers will not starve writers.

You've conveniently ignored the fact that the current implementation
also provides following guarantee:

4. new readers will block behind existing writers

And that's the behaviour we currently depend on, whether you like it
or not.

 Where lock policy can have a significant impact is on performance. But
 predicting that impact is difficult -- it's better just to measure.

Predicting the impact in this case is trivial - it's obvious that
ordering of operations will change and break high level assumptions
that userspace currently makes about various IO operations on XFS
filesystems

 It's not my intention to convince you (or anyone else) that there should
 only be One True Rwsem, because I don't believe that. But I didn't want
 the impression to persist that rwsem does anything more than implement a
 fair reader/writer semaphore.

I'm sorry, but redefining fair to suit your own needs doesn't
convince me of anything. rwsem behaviour has been unchanged for at
least 10 years and hence the current implementation defines what is
fair, not what you say is fair

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue

2013-03-20 Thread Dave Chinner
On Tue, Mar 19, 2013 at 10:28:24AM -0700, Tejun Heo wrote:
 Hello, Jan.
 
 On Tue, Mar 19, 2013 at 8:46 AM, Jan Kara j...@suse.cz wrote:
Well, but what you often get is just output of sysrq-w, or sysrq-t, or
  splat from scheduler about stuck task. You often don't have the comfort of
  tracing... Can't we somehow change 'comm' of the task when it starts
  processing work of some bdi?
 
 You sure can but I'd prefer not to do that. If you wanna do it
 properly, you have to grab task lock every time a work item starts
 execution. I'm not sure how beneficial having the block device
 identifier would be. Backtrace would be there the same. Is identifying
 the block device that important?

When you have a system that has 50+ active filesystems (pretty
common in the distributed storage environments were every disk has
it's own filesystem), knowing which filesystem(s) are getting stuck
in writeback from the sysrq-w or hangcheck output is pretty damn
important

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: Initial fsck has landed

2013-03-21 Thread Dave Chinner
On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
 On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley r...@landley.net wrote:
  I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
  nothing. Is this project dead or not?
 
 Not. We haven't done much about updating tux3.org lately, however you
 will find plenty of activity here:
 
  https://github.com/OGAWAHirofumi/tux3/tree/master/user
 
 You will also find fairly comprehensive updates on where we are and
 where this is going, here:
 
  http://phunq.net/pipermail/tux3/
 
 At the moment we're being pretty quiet because of being in the middle
 of developing the next-gen directory index. Not such a small task, as
 you might imagine.

Hi Daniel,

The next-gen directory index comment made me curious. I wanted to
know if there's anything I could learn from what you are doing and
whether anything of your new algorithms could be applied to, say,
the XFS directory structure to improve it.

I went looking for design docs and found this:

http://phunq.net/pipermail/tux3/2013-January/001938.html

In a word: Disappointment.

Compared to the XFS directory structure, the most striking
architectural similarity that I see is this:

the file bteee[sic] effectively is a second directory index
that imposes a stable ordering on directory blocks.

That was the key architectural innovation in the XFS directory
structure that allowed it to provide the correct seekdir/telldir/NFS
readdir semantics and still scale. i.e. virtually mapped directory
entries. I explained this layout recently here:

http://marc.info/?l=linux-ext4m=136081996316453w=2
http://marc.info/?l=linux-ext4m=136082221117399w=2
http://marc.info/?l=linux-ext4m=136089526928538w=2

We could swap the relevant portions of your PHTree design doc with
my comments (and vice versa) and both sets of references would still
make perfect sense. :P

Further, the PHTree description of tag based freespace tracking is
rather close to how XFS uses tags to track free space regions,
including the fact that XFS can be lazy at updating global free
space indexes.  The global freespace tree indexing is slightly
different to the XFS method - it's closer to the original V1 dir
code in XFS (that didn't scale at all well) than the current code.
However, that's really a fine detail compared to all the major
structural and algorithmic similarities.

Hence it appears to me that at a fundamental level PHTree is just a
re-implementation of the XFS directory architecture. It's definitely
a *major* step forward from HTree, but it can hardly be considered
revolutionary or next-gen. It's not even state of the art. Hence:
disappointment.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list

2013-03-25 Thread Dave Chinner
On Mon, Mar 25, 2013 at 10:22:31AM -0700, Greg Thelen wrote:
 Call cond_resched() from shrink_dentry_list() to preserve
 shrink_dcache_parent() interactivity.
 
 void shrink_dcache_parent(struct dentry * parent)
 {
   while ((found = select_parent(parent, dispose)) != 0)
   shrink_dentry_list(dispose);
 }
 
 select_parent() populates the dispose list with dentries which
 shrink_dentry_list() then deletes.  select_parent() carefully uses
 need_resched() to avoid doing too much work at once.  But neither
 shrink_dcache_parent() nor its called functions call cond_resched().
 So once need_resched() is set select_parent() will return single
 dentry dispose list which is then deleted by shrink_dentry_list().
 This is inefficient when there are a lot of dentry to process.  This
 can cause softlockup and hurts interactivity on non preemptable
 kernels.

Hi Greg,

I can see how this coul dcause problems, but isn't the problem then
that shrink_dcache_parent()/select_parent() itself is mishandling
the need for rescheduling rather than being a problem with
the shrink_dentry_list() implementation?  i.e. select_parent() is
aborting batching based on a need for rescheduling, but then not
doing that itself and assuming that someone else will do the
reschedule for it?

Perhaps this is a better approach:

-   while ((found = select_parent(parent, dispose)) != 0)
+   while ((found = select_parent(parent, dispose)) != 0) {
shrink_dentry_list(dispose);
+   cond_resched();
+   }

With this, select_parent() stops batching when a resched is needed,
we dispose of the list as a single batch and only then resched if it
was needed before we go and grab the next batch. That should fix the
small batch problem without the potential for changing the
shrink_dentry_list() behaviour adversely for other users

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list

2013-03-25 Thread Dave Chinner
On Mon, Mar 25, 2013 at 05:39:13PM -0700, Greg Thelen wrote:
 On Mon, Mar 25 2013, Dave Chinner wrote:
  On Mon, Mar 25, 2013 at 10:22:31AM -0700, Greg Thelen wrote:
  Call cond_resched() from shrink_dentry_list() to preserve
  shrink_dcache_parent() interactivity.
  
  void shrink_dcache_parent(struct dentry * parent)
  {
 while ((found = select_parent(parent, dispose)) != 0)
 shrink_dentry_list(dispose);
  }
  
  select_parent() populates the dispose list with dentries which
  shrink_dentry_list() then deletes.  select_parent() carefully uses
  need_resched() to avoid doing too much work at once.  But neither
  shrink_dcache_parent() nor its called functions call cond_resched().
  So once need_resched() is set select_parent() will return single
  dentry dispose list which is then deleted by shrink_dentry_list().
  This is inefficient when there are a lot of dentry to process.  This
  can cause softlockup and hurts interactivity on non preemptable
  kernels.
 
  Hi Greg,
 
  I can see how this coul dcause problems, but isn't the problem then
  that shrink_dcache_parent()/select_parent() itself is mishandling
  the need for rescheduling rather than being a problem with
  the shrink_dentry_list() implementation?  i.e. select_parent() is
  aborting batching based on a need for rescheduling, but then not
  doing that itself and assuming that someone else will do the
  reschedule for it?
 
  Perhaps this is a better approach:
 
  -   while ((found = select_parent(parent, dispose)) != 0)
  +   while ((found = select_parent(parent, dispose)) != 0) {
  shrink_dentry_list(dispose);
  +   cond_resched();
  +   }
 
  With this, select_parent() stops batching when a resched is needed,
  we dispose of the list as a single batch and only then resched if it
  was needed before we go and grab the next batch. That should fix the
  small batch problem without the potential for changing the
  shrink_dentry_list() behaviour adversely for other users
 
 I considered only modifying shrink_dcache_parent() as you show above.
 Either approach fixes the problem I've seen.  My initial approach adds
 cond_resched() deeper into shrink_dentry_list() because I thought that
 there might a secondary benefit: shrink_dentry_list() would be willing
 to give up the processor when working on a huge number of dentry.  This
 could improve interactivity during shrinker and umount.  I don't feel
 strongly on this and would be willing to test and post the
 add-cond_resched-to-shrink_dcache_parent approach.

The shrinker has interactivity problems because of the global
dcache_lru_lock, not because of ithe size of the list passed to
shrink_dentry_list(). The amount of work that shrink_dentry_list()
does here is already bound by the shrinker batch size. Hence in the
absence of the RT folk complaining about significant holdoffs I
don't think there is an interactivity problem through the shrinker
path.

As for the unmount path - shrink_dcache_for_umount_subtree() - that
doesn't use shrink_dentry_list() and so would need it's own internal
calls to cond_resched().  Perhaps it's shrink_dcache_sb() that you
are concerned about?  Either way, And there are lots more similar
issues in the unmount path such as evict_inodes(), so unless you are
going to give every possible path through unmount/remount/bdev
invalidation the same treatment then changing shrink_dentry_list()
won't significantly improve the interactivity of the system
situation in these paths...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-08 Thread Dave Chinner
On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
 When the first queued waiter is a reader, wake all readers instead of
 just those that are at the front of the queue. There are really two
 motivations for this change:

Isn't this a significant change of semantics for the rwsem? i.e.
that read lock requests that come after a write lock request now
jump ahead of the write lock request? i.e.the write lock request is
no longer a barrier in the queue?

XFS has long assumed that a rwsem write lock is a barrier that
stops new read locks from being taken, and this change will break
that assumption. Given that this barrier assumption is used as the
basis for serialisation of operations like IO vs truncate, there's a
bit more at stake than just improving parallelism here.  i.e. IO
issued after truncate/preallocate/hole punch could now be issued
ahead of the pending metadata operation, whereas currently the IO
issued after the pending metadata operation is waiting for the write
lock will be only be processed -after- the metadata modification
operation completes...

That is a recipe for weird data corruption problems because
applications are likely to have implicit dependencies on the barrier
effect of metadata operations on data IO...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-10 Thread Dave Chinner
On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
 On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner da...@fromorbit.com wrote:
  On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
  When the first queued waiter is a reader, wake all readers instead of
  just those that are at the front of the queue. There are really two
  motivations for this change:
 
  Isn't this a significant change of semantics for the rwsem? i.e.
  that read lock requests that come after a write lock request now
  jump ahead of the write lock request? i.e.the write lock request is
  no longer a barrier in the queue?
 
 Yes, I am allowing readers to skip ahead of writers in the queue (but
 only if they can run with another reader that was already ahead).
 
 I don't see that this is a change of observable semantics for correct
 programs. If a reader and a writer both block on the rwsem, how do you
 known for sure which one got queued first ? rwsem API doesn't give you
 any easy way to know whether a thread is currently queued on the rwsem
 (it could also be descheduled before it gets onto the rwsem queue).

There are algorithms that rely on write locks to act as read-side
barriers to prevent write side starvation. i.e. if you keep queuing
up read locks, the writer never gets the lock, thereby starving the
writer.

 But yes, if you're making assumptions about queuing order the change
 makes it more likely that they'll be observably wrong.
 
  XFS has long assumed that a rwsem write lock is a barrier that
  stops new read locks from being taken, and this change will break
  that assumption. Given that this barrier assumption is used as the
  basis for serialisation of operations like IO vs truncate, there's a
  bit more at stake than just improving parallelism here.  i.e. IO
  issued after truncate/preallocate/hole punch could now be issued
  ahead of the pending metadata operation, whereas currently the IO
  issued after the pending metadata operation is waiting for the write
  lock will be only be processed -after- the metadata modification
  operation completes...
 
  That is a recipe for weird data corruption problems because
  applications are likely to have implicit dependencies on the barrier
  effect of metadata operations on data IO...
 
 I am confused as to exactly what XFS is doing, could you point me to
 the code / indicate a scenario where this would go wrong ? If you
 really rely on this for correctness you'd have to do something already
 to guarantee that your original queueing order is as desired, and I
 just don't see how it'd be done...

My point isn't that XFS gets the serialisation wrong - my point is
that the change of queuing order will change the userspace visible
behaviour of the filesystem *significantly*.

For example: - direct IO only takes read locks, while truncate takes
a write lock.  Right now we can drop a truncate, preallocation or
hole punch operation into a stream of concurrent direct IO and have
it execute almost immediately - the barrier mechanism in the rwsem
ensures that it will be executed ahead of any future IO that is
issued by the application. With your proposed change, that operation
won't take place until all current *and all future* IOs stop and the
read lock is no longer held by anyone.

To put this in context, here's the Irix XFS iolock initialisation
code from mid 1997, where the barrier semantics are first explicitly
documented:

mrlock_init(ip-i_iolock, MRLOCK_BARRIER, xfsio, (long)vp-v_number);

It was only coded like this in 1997, because the Irix multi-reader
lock only grew read-side queue jumping and priority inheritence at
this time. This changed the default behaviour from write lock
barrier semantics to something similar to what you are proposing for
rwsems right now. All the locks that relied on barrier semantics of
write locks had to be configured explicitly to have that behaviour
rather than implicitly relying on the fact that mrlocks had provided
write barrier semantics.

So, that's my concern - we've got 20 years of algorithms in XFS
designed around rwsem write locks providing read-side barriers, and
it's been down this road once before (over 15 years ago).  Changing
the semantics of rwsems is going to break stuff in subtle and
unpredictable ways

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-11 Thread Dave Chinner
On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
 On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner da...@fromorbit.com wrote:
  On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
  On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner da...@fromorbit.com wrote:
   Isn't this a significant change of semantics for the rwsem? i.e.
   that read lock requests that come after a write lock request now
   jump ahead of the write lock request? i.e.the write lock request is
   no longer a barrier in the queue?
 
  Yes, I am allowing readers to skip ahead of writers in the queue (but
  only if they can run with another reader that was already ahead).
 
  I don't see that this is a change of observable semantics for correct
  programs. If a reader and a writer both block on the rwsem, how do you
  known for sure which one got queued first ? rwsem API doesn't give you
  any easy way to know whether a thread is currently queued on the rwsem
  (it could also be descheduled before it gets onto the rwsem queue).
 
  There are algorithms that rely on write locks to act as read-side
  barriers to prevent write side starvation. i.e. if you keep queuing
  up read locks, the writer never gets the lock, thereby starving the
  writer.
 
 What I propose actually isn't as bad as a reader preference rwlock
 like rwlock_t. When a reader makes it to the head of the queue, all
 readers gets woken. At that point there are only writers left in the
 queue, and new readers will get queued after them and won't be able to
 skip over them (since they have no earlier reader to join up with).
 So, I am not throwing reader/writer fairness out the window here.

OK, but

  My point isn't that XFS gets the serialisation wrong - my point is
  that the change of queuing order will change the userspace visible
  behaviour of the filesystem *significantly*.
 
  For example: - direct IO only takes read locks, while truncate takes
  a write lock.  Right now we can drop a truncate, preallocation or
  hole punch operation into a stream of concurrent direct IO and have
  it execute almost immediately - the barrier mechanism in the rwsem
  ensures that it will be executed ahead of any future IO that is
  issued by the application. With your proposed change, that operation
  won't take place until all current *and all future* IOs stop and the
  read lock is no longer held by anyone.
 
 You actually wouldn't get such starvation with my proposal.
 
 What you might see would be:
 
 - Assuming you have up to N concurrent reads in progress, a writer
 might see up to 2N-1 readers proceed in front of it. This limit is
 reached if you first had N-1 readers grabbing the rwsem with an empty
 queue, then another writer W1 got queued,

So something like

RW1

 then a reader RN (note that
 it will get queued after W1 and not run concurrently with the existing
 N-1 readers), then our writer W2 gets queued.

So:

RW1W2

 As our N-1 readers
 complete their IO operations, they might get queued again after W2,

So:
W1W2R

 and then skip ahead of W2 when RN reaches the front of the queue. So,
 W2 is still guaranteed to run eventually, but with a worst case
 latency that is ~2x worse than before

So, when W1 is released, the queue is treated as though it was:

RW2

yes? Ok, so I can see where your latency figure comes from, but it's
still a change of ordering in that W2 is no longer a barrier to the
reads queued after it.

And if we extend that to WN, we have an interleaved queue
similar to this:

W1R1W2R2W3R3.WnRm

By my reading of the algorithm you are proposing, after W1 is
released, we end up with the queue being treated like this:

R1R2R3RmW2W3Wn

Right?

If so, that is definitely a change of lock ordering semantics -
writes do not have barrier properties anymore.

 - since all readers are woken at once, you might see burst of direct
 IO operations followed by bursts of truncate operations, instead of
 having them interleaved in smaller groups as before.

And this will result in the same application IO pattern giving
vastly different results. e.g. a read IO promoted ahead of a
truncate will now return data instead of -1 (beyond EOF).

 I'm not sure if these characteristics are good enough for the XFS use
 case. It seems tougher than many rwsem use cases, since the benefits
 of increased reader parallelism are not as clear here (i.e. the IO

Well defined, predictable, deterministc ordering of operations takes
far more precedence over parallelism when it comes to filesystem
behaviour. The rwsems already have enough parallelism to allow us to
drive more than 2 million 4k IOPS to a single file via
multi-threaded direct IO(*), so we aren't limited by parallelism and
concurrency in rwsems.

 So if
 this explanation still didn't made you comfortable with the proposal,
 I will rework it to avoid the queue

Re: torrent hash failures since 3.9.0-rc1

2013-03-12 Thread Dave Chinner
On Tue, Mar 12, 2013 at 09:28:54AM +0100, Sander wrote:
 Markus Trippelsdorf wrote (ao):
  On 2013.03.11 at 16:37 -0400, Theodore Ts'o wrote:
   We actually run fsx in a number of different configruations as part of
   our regression testing before we send Linus a pull request, and
   haven't found any issues.  So unless it's a hardware problem, it seems
   unlikely to me that your running fsx would turn up anything.
  
  Yes, I let it run for a while anyway and it didn't report any failure.
 
  Please note that my local rtorrent version was configured with
  --with-posix-fallocate.
 
 Would it be possible to enhance fsx to detect such an issue?

fsx in xfstests already uses fallocate() for preallocation and hole
punching, so such problems related to these operations can be found
using fsx. The issue here, however, involves memory reclaim
interactions and so is not something fsx can reproduce in isolation.  :/


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-12 Thread Dave Chinner
On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
 Hi Dave,
 
 On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
  - since all readers are woken at once, you might see burst of direct
  IO operations followed by bursts of truncate operations, instead of
  having them interleaved in smaller groups as before.
 
  And this will result in the same application IO pattern giving
  vastly different results. e.g. a read IO promoted ahead of a
  truncate will now return data instead of -1 (beyond EOF).
 
 I will reply to this part first, as I think this gives a more concrete
 anchor to the discussion.
 
 The crucial point is that the truncate system call hasn't completed
 yet - that thread is still queued in the rwsem.
 
 You still have the guarantee that the truncate will complete
 eventually (even if there is a never-ending stream of IOs), and that
 all IO system calls that start after the truncate system call
 completes will see the file as truncated.

Sure, but the problem is not about when the syscall completes - the
problem is that you are changing where in the pipeline of IO the
truncate is initially executed.  i.e. ordering is not defined by
when an operation completes, but by the order ing which the queue is
processed after a blocking operation completes. That is not when the
syscall completes, but when the filesystem drops the exclusive lock.

From a single threaded userspace application perspective or
multithreaded apps with their own userspace locking, truncates
complete when either the syscall completes or some time after when
the application drops it's lock. Either way, we're looking at
completion time serialisation, and in that case what XFS or the
kernel does simply doesn't matter.

However, if we are talking about userspace applications that use
lockless IO algorithms or kernel side applications (like knfsd) that
are exposed directly to the filesystem locking semantics,
serialisation behaviour is actually defined by filesystem's
submission side locking behaviour. There is no external
serialisation of IO completion, so serialisation and ordering is
defined (for XFS) solely by the rwsem behaviour.

 You don't have guarantees about which system call will appear to run
 before the other if the execution times of the two system calls
 overlap each other, but you actually never had such a guarantee from a
 correctness perspective (i.e. you could never guarantee which of the
 two would get queued on the rwsem ahead of the other).

Sure, but as long as the submission side ordering is deterministic,
it doesn't matter.

  Ok, so I can see where your latency figure comes from, but it's
  still a change of ordering in that W2 is no longer a barrier to the
  reads queued after it.
 
 My claim is that it's not a visible change from a correctness
 perspective

I am not arguing that it is incorrect. I'm arguing that the change
of ordering semantics breaks assumptions a lot of code makes about
how these locks work.

  similar to this:
 
  W1R1W2R2W3R3.WnRm
 
  By my reading of the algorithm you are proposing, after W1 is
  released, we end up with the queue being treated like this:
 
  R1R2R3RmW2W3Wn
 
  Right?
 
 Yes, if what you are representing is the state of the queue at a given
 point of time (which implies R1..Rm must be distinct threads, not just
 the same thread doing repeated requests).

Yup, that's very typical.

 As new requests come in over time, one important point to remember is
 that each writer will see at most one batch of readers wake up ahead
 of it (though this batch may include readers that were originally
 queued both before and after it).

And that's *exactly* the problem with the changes you are proposing
- rwsems will no longer provide strongly ordered write vs read
barrier semantics.

 I find the name 'barrier' actually confusing when used to describe
 synchronous operations.  To me a barrier is usualy between
 asynchronous operations, and it is well defined which operations
 are ahead or behind of the barrier (either because they were
 queued by the same thread, or they were queued by different
 threads which may have synchronized together some other way).

When you have hundreds or thousands of threads doing IO to the one
file, it doesn't matter if the IO is issued synchronously or
asynchronously by the threads - you simply have a huge amount of
data IO concurrency and very, very deep pipelines.

Inserting a metadata modification (truncate, preallocation,
holepunch, etc) into that pipeline currently causes all new
submissions to queue behind the metadata modification, waits for
everything submitted before the metadata modification to complete
and then runs the metadata modification. Once it completes, it then
allows everything queued up to the next metadata modification to
run

That matches your definition of a barrier pretty well, I think

Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-03-04 Thread Dave Chinner
On Mon, Mar 04, 2013 at 08:32:45AM +, Tony Lu wrote:
 Thanks for you following up.
 
 My apologize that I just found that it is one change I made before
 that causes this problem. This change forces mkfs.xfs to format
 xfs partitions whose sectorsize were not smaller than 4096 bytes,
 which was due to a bug that earlier versions of xfs used (struct
 *page)-private(long) as a bitmap to represent each block's state
 within a page (the size of a page could be 64K or larger, then it
 needs 128 bit or more to represent each block's state within a
 page).

You do realise that bug doesn't affect x86-64 platforms as they
don't support 64k pages?

 This is reproducible on 2.6.38.6 kernel on X86. But I do not get
 why this change makes the xfs log inconsistent during
 mount/cp/umount operations.

Neither do I, and I don't care to look any further because the
problem is of your own making. In future, please check first that
the bug you are reporting is reproducable on a current upstream
kernel and userspace.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)

2012-11-29 Thread Dave Chinner
(0x7f629fcfd000, 134225920)   = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
134217728) = 134217728
mmap(NULL, 536879104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f626fcf9000
munmap(0x7f628fcfb000, 268443648)   = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
268435456) = 268435456
mmap(NULL, 1073750016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f622fcf7000
munmap(0x7f626fcf9000, 536879104)   = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
536870912) = 536870912
mmap(NULL, 2147491840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f61afcf5000
munmap(0x7f622fcf7000, 1073750016)  = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1073741824) = 1073741824
mmap(NULL, 4294975488, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f60afcf3000
munmap(0x7f61afcf5000, 2147491840)  = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
2147483648) = 2147479552
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
4096) = 4096
mmap(NULL, 8589942784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= -1 ENOMEM (Cannot allocate memory)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote:
 On 11/29/2012 01:28 AM, Dave Chinner wrote:
  On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
  On 11/28/2012 07:17 AM, Dave Chinner wrote:
  On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
  On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner da...@fromorbit.com 
  wrote:
  The shrinker doesn't work on bytes - it works on
  + * *objects*.
 
  And I thought you were reviewing the shrinker API to be useful where a
  single object may range between 4K and 4G.
 
  Which requires rewriting all the algorithms to not be dependent on
  the subsystems using a fixed size object. The shrinker control
  function is called shrink_slab() for a reason - it was expected to
  be used to shrink caches of fixed sized objects allocated from slab
  memory.
 
  It has no concept of the amount of memory that each object consumes,
  just an idea of how much *IO* it takes to replace the object in
  memory once it's been reclaimed. The DEFAULT_SEEKS is design to
  encode the fact it generally takes 2 IOs to replace either a LRU
  page or a filesystem slab object, and so balances the scanning based
  on that value. i.e. the shrinker algorithms are solidly based around
  fixed sized objects that have some relationship to the cost of
  physical IO operations to replace them in the cache.
 
  One nit: It shouldn't take 2IOs to replace a slab object, right?
  objects.
  
  A random dentry in a small directory will take on IO to read the
  inode, then another to read the block the dirent sits in. TO read an
  inode froma cached dentry will generally take one IO to read the
  inode, and another to read related, out of inode information (e.g.
  attributes or extent/block maps). Sometimes it will only take on IO,
  sometimes it might take 3 or, in the case of dirents, coult take
  hundreds of IOs if the directory structure is large enough.
  
  So a default of 2 seeks to replace any single dentry/inode in the
  cache is a pretty good default to use.
  
  This
  should be the cost of allocating a new page, that can contain, multiple
  Once the page is in, a new object should be quite cheap to come up with.
  
 
 Indeed. More on this in the next paragraph...

I'm not sure what you are trying to say here. Are you saying that
you think that the IO cost for replacing a slab cache object doesn't
matter?

  It's not the cost of allocating the page (a couple of microseconds)
  that is being considered - it the 3-4 orders of magnitude worse cost
  of reading the object from disk (could be 20ms). The slab/page
  allocation is lost in the noise compared to the time it takes to
  fill the page cache page with data or a single slab object.
  Essentially, slab pages with multiple objects in them are much more
  expensive to replace in the cache than a page cache page
  
  This is a very wild thought, but now that I am diving deep in the
  shrinker API, and seeing things like this:
 
  if (reclaim_state) {
  sc-nr_reclaimed += reclaim_state-reclaimed_slab;
  reclaim_state-reclaimed_slab = 0;
  }
  
  That's not part of the shrinker - that's part of the vmscan
  code, external to the shrinker infrastructure. It's getting
  information back from the slab caches behind the shrinkers, and it's
  not the full picture because many shrinkers are not backed by slab
  caches. It's a work around for not not having accurate feedback from
  the shrink_slab() code about how many pages were freed.
  
 I know it is not part of the shrinkers, and that is precisely my point.
 vmscan needs to go through this kinds of hacks because our API is not
 strong enough to just give it back the answer that matters to the caller.

What matters is that the slab caches are shrunk in proportion to the
page cache. i.e. balanced reclaim. For dentry and inode caches, what
matters is the number of objects reclaimed because the shrinker
algorithm balances based on the relative cost of object replacement
in the cache.

e.g. if you have 1000 pages in the page LRUs, and 1000 objects in
the dentry cache, each takes 1 IO to replace, then if you reclaim 2
pages from the page LRUs and 2 pages from the dentry cache, it will
take 2 IOs to replace the pages in the LRU, but 36 IOs to replace
the objects in the dentry cache that were reclaimed.

This is why the shrinker balances objects scanned vs LRU pages
scanned - it treats each page as an object and the shrinker relates
that to the relative cost of objects in the slab cache being
reclaimed. i.e. the focus is on keeping a balance between caches,
not reclaiming an absolute number of pages.

Note: I'm not saying this is perfect, what I'm trying to do is let
you know the why behind the current algorithm. i.e. why it mostly
works and why ignoring the principles behind why it works is going
fraught with danger...

  Essentially, the problem is an impedance mismatch between the way
  the LRUs are scanned/balanced (in pages) and slab caches are managed

Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 11:02:24AM -0800, Andi Kleen wrote:
 Dave Chinner da...@fromorbit.com writes:
 
  Comments, thoughts and flames all welcome.
 
 Doing the reclaim per CPU sounds like a big change in the VM balance. 

It's per node, not per CPU. And AFAICT, it hasn't changed the
balance of page cache vs inode/dentry caches under general, global
workloads at all.

 Doesn't this invalidate some zone reclaim mode settings?

No, because zone reclaim is per-node and the shrinkers now can
reclaim just from a single node. i.e. the behaviour is now better
suited to the aims of zone reclaim which is to free memory from a
single, targetted node. Indeed, I removed a hack in the zone reclaim
code that sprayed slab reclaim across the entire machine until
sufficient objects had been freed from the target node

 How did you validate all this?

fakenuma setups, various workloads that generate even dentry/slab
cache loadings across all nodes, adding page cache pressure on a
single node, watching slab reclaim from a single node. that sort of
thing.

I haven't really done any performance testing other than not
obviously slower. There's no point optimising anything before
there's any sort of agreement as to whether this is the right
approach to take or not

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason chris.ma...@fusionio.com wrote:
 
  Just reading the new blkdev_get_blocks, it looks like we're mixing
  shifts.  In direct-io.c map_bh-b_size is how much we'd like to map, and
  it has no relation at all to the actual block size of the device.  The
  interface is abusing b_size to ask for as large a mapping as possible.
 
 Ugh. That's a big violation of how buffer-heads are supposed to work:
 the block number is very much defined to be in multiples of b_size
 (see for example submit_bh() that turns it into a sector number).
 
 But you're right. The direct-IO code really *is* violating that, and
 knows that get_block() ends up being defined in i_blkbits regardless
 of b_size.

Same with mpage_readpages(), so it's not just direct IO that has
this problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Dave Chinner
On Fri, Nov 30, 2012 at 11:36:01AM -0500, Christoph Hellwig wrote:
 On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote:
   Ugh. That's a big violation of how buffer-heads are supposed to work:
   the block number is very much defined to be in multiples of b_size
   (see for example submit_bh() that turns it into a sector number).
   
   But you're right. The direct-IO code really *is* violating that, and
   knows that get_block() ends up being defined in i_blkbits regardless
   of b_size.
  
  Same with mpage_readpages(), so it's not just direct IO that has
  this problem
 
 The mpage code may actually fall back to BHs.
 
 I have a version of the direct I/O code that uses the iomap_ops from the
 multi-page write code that you originally started.  It uses the new op
 as primary interface for direct I/O and provides a helper for
 filesystems that still use buffer heads internally.  I'll try to dust it
 off and send out a version for the current kernel.

So it was based on this interface?

(I went looking for this code on google a couple of days ago so I
could point at it and say we should be using an iomap structure,
not buffer heads, but it looks like I never posted it to fsdevel or
the xfs lists...)


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 090f0ea..e247d62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -522,6 +522,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct iomap;
 
 struct iov_iter {
const struct iovec *iov;
@@ -614,6 +615,9 @@ struct address_space_operations {
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
+
+   int (*iomap)(struct address_space *mapping, loff_t pos,
+   ssize_t length, struct iomap *iomap, int cmd);
 };
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
new file mode 100644
index 000..7708614
--- /dev/null
+++ b/include/linux/iomap.h
@@ -0,0 +1,45 @@
+#ifndef _IOMAP_H
+#define _IOMAP_H
+
+/* -iomap a_op command types */
+#define IOMAP_READ 0x01/* read the current mapping starting at the
+  given position, trimmed to a maximum length.
+  FS's should use this to obtain and lock
+  resources within this range */
+#define IOMAP_RESERVE  0x02/* reserve space for an allocation that spans
+  the given iomap */
+#define IOMAP_ALLOCATE 0x03/* allocate space in a given iomap - must have
+  first been reserved */
+#define IOMAP_UNRESERVE0x04/* return unused reserved space for the 
given
+  iomap and used space. This will always be
+  called after a IOMAP_READ so as to allow the
+  FS to release held resources. */
+
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE 0x01/* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC 0x02/* delayed allocation blocks */
+#define IOMAP_MAPPED   0x03/* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN0x04/* blocks allocated @blkno in unwritten 
state */
+
+struct iomap {
+   sector_tblkno;  /* first sector of mapping */
+   loff_t  offset; /* file offset of mapping, bytes */
+   ssize_t length; /* length of mapping, bytes */
+   int type;   /* type of mapping */
+   void*priv;  /* fs private data associated with map */
+};
+
+static inline bool
+iomap_needs_allocation(struct iomap *iomap)
+{
+   return iomap-type == IOMAP_HOLE;
+}
+
+/* multipage write interfaces use iomaps */
+typedef int (*mpw_actor_t)(struct address_space *mapping, void *src,
+   loff_t pos, ssize_t len, struct iomap *iomap);
+
+ssize_t multipage_write_segment(struct address_space *mapping, void *src,
+   loff_t pos, ssize_t length, mpw_actor_t actor);
+
+#endif /* _IOMAP_H */

Cheers,

Dave.
 
 
 -- 
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.
 
 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch,v2] bdi: add a user-tunable cpu_list for the bdi flusher threads

2012-12-03 Thread Dave Chinner
On Mon, Dec 03, 2012 at 01:53:39PM -0500, Jeff Moyer wrote:
 Hi,
 
 In realtime environments, it may be desirable to keep the per-bdi
 flusher threads from running on certain cpus.  This patch adds a
 cpu_list file to /sys/class/bdi/* to enable this.  The default is to tie
 the flusher threads to the same numa node as the backing device (though
 I could be convinced to make it a mask of all cpus to avoid a change in
 behaviour).

The default seems reasonable to me.

 Comments, as always, are appreciated.
.

 +static ssize_t cpu_list_store(struct device *dev,
 + struct device_attribute *attr, const char *buf, size_t count)
 +{
 + struct backing_dev_info *bdi = dev_get_drvdata(dev);
 + struct bdi_writeback *wb = bdi-wb;
 + cpumask_var_t newmask;
 + ssize_t ret;
 + struct task_struct *task;
 +
 + if (!alloc_cpumask_var(newmask, GFP_KERNEL))
 + return -ENOMEM;
 +
 + ret = cpulist_parse(buf, newmask);
 + if (!ret) {
 + spin_lock(bdi-wb_lock);
 + task = wb-task;
 + if (task)
 + get_task_struct(task);
 + spin_unlock(bdi-wb_lock);
 + if (task) {
 + ret = set_cpus_allowed_ptr(task, newmask);
 + put_task_struct(task);
 + }

Why is this set here outside the bdi-flusher_cpumask_mutex?

Also, I'd prefer it named ..._lock as that is the normal
convention for such variables. You can tell the type of lock from
the declaration or the use...



 @@ -437,6 +488,14 @@ static int bdi_forker_thread(void *ptr)
   spin_lock_bh(bdi-wb_lock);
   bdi-wb.task = task;
   spin_unlock_bh(bdi-wb_lock);
 + mutex_lock(bdi-flusher_cpumask_mutex);
 + ret = set_cpus_allowed_ptr(task,
 + bdi-flusher_cpumask);
 + mutex_unlock(bdi-flusher_cpumask_mutex);

As it is set under the lock here

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 16/31] loop: use aio to perform io on the underlying file

2012-12-03 Thread Dave Chinner
On Mon, Dec 03, 2012 at 10:59:39AM -0600, Dave Kleikamp wrote:
 On 11/22/2012 05:06 PM, Dave Chinner wrote:
  On Wed, Nov 21, 2012 at 04:40:56PM -0600, Dave Kleikamp wrote:
  From: Zach Brown z...@zabbo.net
 
  This uses the new kernel aio interface to process loopback IO by
  submitting concurrent direct aio.  Previously loop's IO was serialized
  by synchronous processing in a thread.
 
  The aio operations specify the memory for the IO with the bio_vec arrays
  directly instead of mappings of the pages.
 
  The use of aio operations is enabled when the backing file supports the
  read_iter and write_iter methods.  These methods must only be added when
  O_DIRECT on bio_vecs is supported.
 
  Signed-off-by: Dave Kleikamp dave.kleik...@oracle.com
  Cc: Zach Brown z...@zabbo.net
  
  I suspect aio iocompetion here doesn't work for FUA write IO.
 
 Since the underlying file system is doing direct IO, we at least know
 that the IO has been performed to the underlying device. It could still
 be in the devices write cache, so maybe an fsync is still needed. It
 wouldn't be hard to fix if that's the right thing to do.

FUA means the write is all the way to stable storage when completion
is signalled by the underlying device. That means the loop device
needs to (at least) fdatasync after the write completes...

  +static void lo_rw_aio_complete(u64 data, long res)
  +{
  +  struct bio *bio = (struct bio *)(uintptr_t)data;
  +
  +  if (res  0)
  +  res = 0;
  +  else if (res  0)
  +  res = -EIO;
  +
  +  bio_endio(bio, res);
  +}
  
  This effectively does nothing...
 
 It passes the IO completion from the underlying device to the caller.

Yes, I know. What I'm saying is that it does nothing on completion
of a FUA write when it should be doing a fsync.

  -  lo-lo_encrypt_key_size) {
  -  ret = -EOPNOTSUPP;
  -  goto out;
  -  }
  -  ret = file-f_op-fallocate(file, mode, pos,
  -  bio-bi_size);
  -  if (unlikely(ret  ret != -EINVAL 
  -   ret != -EOPNOTSUPP))
  -  ret = -EIO;
  -  goto out;
  -  }
  -
 ret = lo_send(lo, bio, pos);
   
 if ((bio-bi_rw  REQ_FUA)  !ret) {
  
  And as you can see here that after writing the data in the filebacked
  path, there's special handling for REQ_FUA (i.e. another fsync).
  
 
 In this path, the data may still be in the underlying filesystem's page
 cache.

Sure, but fsync means that it ends up on stable storage, as per the
requirement of an FUA write. Direct IO does not mean the data is on
stable storage, just that it bypasses the page cache.

  And this extra fsync is now not done in the aio path. I.e. the AIO
  completion path needs to issue the fsync to maintain correct REQ_FUA
  semantics...
 
 If this is really necessary, I can fix it.

Absolutely. If we don't implement FUA properly, we'll end up with
corrupted filesystems and/or data loss when kernel crashes or
powerloss occurs. That's not an acceptable outcome, so we need FUA
to be implemented properly.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update atime from future.

2012-12-04 Thread Dave Chinner
On Tue, Dec 04, 2012 at 01:56:39AM +0800, yangsheng wrote:
 Relatime should update the inode atime if it is more than a day in the
 future.  The original problem seen was a tarball that had a bad atime,
 but could also happen if someone fat-fingers a touch.  The future
 atime will never be fixed.  Before the relatime patch, the future atime
 would be updated back to the current time on the next access.

So if someone accidentally changes time back a few days, access
times go backwards for everything? That doesn't sound right to me -
it's going to seriously screw up backups and other scanners that use
atime to determine newly accessed files

IMO, if you fat-finger a manual atime update or use atimes direct
from tarballs, then that's your problem as a user and not the
responsibility of the kernel to magically fix for you

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch,v2] bdi: add a user-tunable cpu_list for the bdi flusher threads

2012-12-04 Thread Dave Chinner
On Tue, Dec 04, 2012 at 09:42:55AM -0500, Jeff Moyer wrote:
 Dave Chinner da...@fromorbit.com writes:
 
  On Mon, Dec 03, 2012 at 01:53:39PM -0500, Jeff Moyer wrote:
  +static ssize_t cpu_list_store(struct device *dev,
  +  struct device_attribute *attr, const char *buf, size_t count)
  +{
  +  struct backing_dev_info *bdi = dev_get_drvdata(dev);
  +  struct bdi_writeback *wb = bdi-wb;
  +  cpumask_var_t newmask;
  +  ssize_t ret;
  +  struct task_struct *task;
  +
  +  if (!alloc_cpumask_var(newmask, GFP_KERNEL))
  +  return -ENOMEM;
  +
  +  ret = cpulist_parse(buf, newmask);
  +  if (!ret) {
  +  spin_lock(bdi-wb_lock);
  +  task = wb-task;
  +  if (task)
  +  get_task_struct(task);
  +  spin_unlock(bdi-wb_lock);
  +  if (task) {
  +  ret = set_cpus_allowed_ptr(task, newmask);
  +  put_task_struct(task);
  +  }
 
  Why is this set here outside the bdi-flusher_cpumask_mutex?
 
 The cpumask mutex protects updates to bdi-flusher_cpumask, it has
 nothing to do with the call to set_cpus_allowed.  We are protected from
 concurrent calls to cpu_list_store by the sysfs mutex that is taken on
 entry.  I understand that this is non-obvious, and it wouldn't be wrong
 to hold the mutex here.  If you'd like me to do that for clarity, that
 would be ok with me.

At minimum it needs a comment like this otherwise someone is going
to come along and ask why is that safe? like I just did. I'd
prefer the code to be obviously consistent to avoid the need for
commenting about the special case, especially when the obviously
correct code is simpler ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Dave Chinner
On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote:
 From: Namjae Jeon namjae.j...@samsung.com
 
 This patch is a follow up on below patch:
 
 [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
 commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 

 diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
 index a836118..3391800 100644
 --- a/fs/xfs/xfs_export.c
 +++ b/fs/xfs/xfs_export.c
 @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type)
   case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG:
   return 6;
   }
 - return 255; /* invalid */
 + return FILEID_INVALID; /* invalid */
  }

I think you can drop the /* invalid */ comment from there now as
it is redundant with this change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6 RFC] Mapping range lock

2013-02-05 Thread Dave Chinner
On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
 On Thu 31-01-13 16:07:57, Andrew Morton wrote:
   c) i_mutex doesn't allow any paralellism of operations using it and some
  filesystems workaround this for specific cases (e.g. DIO reads). Using
  range locking allows for concurrent operations (e.g. writes, DIO) on
  different parts of the file. Of course, range locking itself isn't
  enough to make the parallelism possible. Filesystems still have to
  somehow deal with the concurrency when manipulating inode allocation
  data. But the range locking at least provides a common VFS mechanism 
   for
  serialization VFS itself needs and it's upto each filesystem to
  serialize more if it needs to.
  
  That would be useful to end-users, but I'm having trouble predicting
  *how* useful.
   As Zheng said, there are people interested in this for DIO. Currently
 filesystems each invent their own tweaks to avoid the serialization at
 least for the easiest cases.

The thing is, this won't replace the locking those filesystems use
to parallelise DIO - it just adds another layer of locking they'll
need to use. The locks filesystems like XFS use to serialise IO
against hole punch also serialise against many more internal
functions and so if these range locks don't have the same capability
we're going to have to retain those locks even after the range locks
are introduced. It basically means we're going to have two layers
of range locks - one for IO sanity and atomicity, and then this
layer just for hole punch vs mmap.

As i've said before, what we really need in XFS is IO range locks
because we need to be able to serialise operations against IO in
progress, not page cache operations in progress.  IOWs, locking at
the mapping tree level does not provide the right exclusion
semantics we need to get rid of the existing filesystem locking that
allows concurrent IO to be managed.  Hence the XFS IO path locking
suddenly because 4 locks deep:

i_mutex
  XFS_IOLOCK_{SHARED,EXCL}
mapping range lock
  XFS_ILOCK_{SHARED,EXCL}

That's because the buffered IO path uses per-page lock ranges and to
provide atomicity of read vs write, read vs truncate, etc we still
need to use the XFS_IOLOCK_EXCL to provide this functionality.

Hence I really think we need to be driving this lock outwards to
where the i_mutex currently sits, turning it into an *IO range
lock*, and not an inner-level mapping range lock. i.e flattening the
locking to:

io_range_lock(off, len)
  fs internal inode metadata modification lock

Yes, I know this causes problems with mmap and locking orders, but
perhaps we should be trying to get that fixed first because it
simplifies the whole locking schema we need for filesystems to
behave sanely. i.e. shouldn't we be aiming to simplify things
as we rework locking rather than make the more complex?

IOWs, I think the it's a mapping range lock approach is not the
right level to be providing IO exclusion semantics. After all, it's
entire IO ranges that we need to provide -atomic- exclusion for...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6 RFC] Mapping range lock

2013-02-06 Thread Dave Chinner
On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
 On Wed 06-02-13 10:25:12, Dave Chinner wrote:
  On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
   On Thu 31-01-13 16:07:57, Andrew Morton wrote:
 c) i_mutex doesn't allow any paralellism of operations using it and 
 some
filesystems workaround this for specific cases (e.g. DIO reads). 
 Using
range locking allows for concurrent operations (e.g. writes, DIO) 
 on
different parts of the file. Of course, range locking itself isn't
enough to make the parallelism possible. Filesystems still have to
somehow deal with the concurrency when manipulating inode 
 allocation
data. But the range locking at least provides a common VFS 
 mechanism for
serialization VFS itself needs and it's upto each filesystem to
serialize more if it needs to.

That would be useful to end-users, but I'm having trouble predicting
*how* useful.
 As Zheng said, there are people interested in this for DIO. Currently
   filesystems each invent their own tweaks to avoid the serialization at
   least for the easiest cases.
  
  The thing is, this won't replace the locking those filesystems use
  to parallelise DIO - it just adds another layer of locking they'll
  need to use. The locks filesystems like XFS use to serialise IO
  against hole punch also serialise against many more internal
  functions and so if these range locks don't have the same capability
  we're going to have to retain those locks even after the range locks
  are introduced. It basically means we're going to have two layers
  of range locks - one for IO sanity and atomicity, and then this
  layer just for hole punch vs mmap.
  
  As i've said before, what we really need in XFS is IO range locks
  because we need to be able to serialise operations against IO in
  progress, not page cache operations in progress.
   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
 PageWriteback serialize all IO for part of the file underlying the page.
 I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
 buffered read, page fault can run on that part of file.

Right, it serialises page cache operations sufficient to avoid
page cache coherence problems, but it does not serialise operations
sufficiently to provide atomicity between operations that should be
atomic w.r.t. each other.

 So how come it
 doesn't provide enough serialization for XFS?
 
 Ah, is it the problem that if two threads do overlapping buffered writes
 to a file then we can end up with data mixed from the two writes (if we
 didn't have something like i_mutex)?

That's one case of specific concern - the POSIX write() atomicity
guarantee - but it indicates the cause of many of my other concerns,
too. e.g. write vs prealloc, write vs punch, read vs truncate, write
vs truncate, buffered vs direct write, etc.

Basically, page-cache granularity locking for buffered IO means that
it cannot be wholly serialised against any other operation in
progress. That means we can't use the range lock to provide a
barrier to guarantee that no IO is currently in progress at all, and
hence it doesn't provide the IO barrier semantics we need for
various operations within XFS.

An example of this is that the online defrag ioctl requires copyin +
mtime updates in the write path are atomic w.r.t the swap extents
ioctl so that it can detect concurrent modification of the file being
defragged and abort. The page cache based range locks simply don't
provide this coverage, and so we'd need to maintain the IO operation
locking we currently have to provide this exclusion..

Truncate is something I also see as particularly troublesome,
because the i_mutex current provides mutual exclusion against the
operational range of a buffered write (i.e. at the .aio_write level)
and not page granularity like this patch set would result in. Hence
the behaviour of write vs truncate races could change quite
significantly. e.g.  instead of write completes then truncate or
truncate completes then write, we could have partial write,
truncate, write continues and completes resulting in a bunch of
zeros inside the range the write call wrote to. The application
won't even realise that the data it wrote was corrupted by the
racing truncate.

IOWs, I think that the fundamental unit of atomicity we need here is
the operational range of the syscall i.e. that each of the protected
operations needs to execute atomically as a whole with respect to
each other, not in a piecemeal fashion where some use whole range
locking and others use fine-grained page-range locking...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc5 xfs corruption

2013-01-30 Thread Dave Chinner
On Wed, Jan 30, 2013 at 10:16:47PM -0500, CAI Qian wrote:
 Hello,
 
 (Sorry to post to xfs mailing lists but unsure about which one is the
 best for this.)

Trimmed to just x...@oss.sgi.com.

 I have seen something like this once during testing on a system with a
 EMC VNX FC/multipath back-end.

This is a trace from the verifier code that was added in 3.8-rc1 so
I doubt it has anything to do with any problem you've seen in the
past

Can you tell us what workload you were running and what hardware you
are using as per:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

As it is, if you mounted the filesystem after this problem was
detected, log recovery probably propagated it to disk. I'd suggest
that you run xfs_repair -n on the device and post the output so we
can see if any corruption has actaully made it to disk. If no
corruption made it to disk, it's possible that we've got the
incorrect verifier attached to the buffer.

 [ 3025.063024] 8801a0d5: 2e 2e 2f 2e 2e 2f 75 73 72 2f 6c 69 62 2f 6d 
 6f  ../../usr/lib/mo 

The start of a block contains a path and the only
type of block that can contain this format of metadata is remote
symlink block. Remote symlink blocks don't have a verifier attached
to them as there is nothing that can currently be used to verify
them as correct.

I can't see exactly how this can occur as stale buffers have the
verifier ops cleared before being returned to the new user, and
newly allocated xfs_bufs are zeroed before being initialised. I
really need to know what you are doing to be able to get to the
bottom of it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-02-26 Thread Dave Chinner
On Tue, Feb 26, 2013 at 07:28:19AM +, Tony Lu wrote:
 I get a reliable way to reproduce this bug. The logprint and metadump are 
 attached.
 
 Kernel version: 2.6.38.8

This is important



 because this:

 4 umount /dev/sda1 /mnt
 5 mount /dev/sda1 /mnt
 XFS mounting filesystem sda1
 Starting XFS recovery on filesystem: sda1 (logdev: internal)
 Ending XFS recovery on filesystem: sda1 (logdev: internal)

Indicates that the unmount record is either not being written, it is
being written when there log has not been fully flushed or log
recovery is not finding it. You need to copy out the log
first to determine what the state of the log is before you mount the
filesystem - that way if log recovery is run you can see whether it
was supposed to run. (i.e. a clean log should never run recovery,
and unmount should always leave a clean log).

Either way, I'm more than 10,000 iterations into a run of 100k
iterations of this script on 3.8.0, and I have not seen a single log
recovery attempt occur. That implies you are seeing a bug in 2.6.38
that has since been fixed. It would be a good idea for you to
upgrade the system to a 3.8 kernel and determine if you can still
reproduce the problem on your system - that way we'll know if the
bug really has been fixed or not

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ext4 updates for 3.9

2013-02-28 Thread Dave Chinner
On Wed, Feb 27, 2013 at 02:29:07PM -0500, Theodore Ts'o wrote:
 On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
  
  Looks like it's fixed here too.
  
  How did this make it through -next without anyone hitting it ?
 
  Is anyone running xfstests or similar on linux-next regularly ?
 
 I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
 before I submitted a pull request.  The problem is that XFSTESTS is
 S-L-O-W if you use large partitions, so typically I use a 5GB
 partition sizes for my test runs.

This isn't the case for XFS. I typically see 1TB scratch devices
only being ~10-20% slower than 10GB scratch devices, and 10TB only
being a little slower than 1TB scratch devices. I have to use sparse
devices and --large-fs for 100TB filesystem testing, so I can't
directly compare the speeds to those that I run on physical devices.
However I can say that it isn't significantly slower than using
small scratch devices...

 So what we probably need to do is to have a separate set of tests
 using a loopback mount, and perhaps an artificially created file
 system which has a large percentage of the blocks in the middle of the
 file system busied out, to make efficient testing of these sorts of

That's exactly what the --large-fs patch set I posted months ago does
for ext4 - it uses fallocate() to fill all but 50GB of the large
filesystem without actually writing any data and runs the standard
tests in the remaining unused space.

However, last time I tested ext4 with this patchset (when I posted
the patches months ago), multi-TB preallocation on ext4 was still too
slow to make it practical for testing on devices larger than 2-3TB.
Perhaps it would make testing 1-2TB ext4 filesystems fast enough
that you could do it regularly...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V5 00/30] loop: Issue O_DIRECT aio using bio_vec

2013-01-20 Thread Dave Chinner
On Fri, Jan 18, 2013 at 12:16:35PM -0500, Jeff Moyer wrote:
 Dave Kleikamp dave.kleik...@oracle.com writes:
 
  This patchset was begun by Zach Brown and was originally submitted for
  review in October, 2009. Feedback was positive, and I have picked up
  where he left off, porting his patches to the latest mainline kernel
  and adding support more file systems.
 
 [snip]
 
  My hopes are that this patchset is finally ready for linux-next.
 
 Hi, Shaggy,
 
 I'm finally getting around to testing this using xfstests.  This is my
 setup:
 
 dd if=/dev/zero of=testdev.img bs=1M count=1024
 dd if=/dev/zero of=scratchdev.img bs=1M count=1024
 losetup -f ./testdev.img
 losetup -f ./scratchdev.img
 mkfs -t xfs /dev/loop0
 mount /dev/loop0 /mnt/test
 export TEST_DIR=/mnt/test
 export TEST_DEV=/dev/loop0
 export SCRATCH_MNT=/mnt/scratch
 export SCRATCH_DEV=/dev/loop1
 
 I then ran:
 
 ./check -g aio
 
 and here is the summary:
 
 Ran: 112 113 198 207 208 209 210 211 212 239 240
 Failures: 112 198 207 239 240
 Failed 5 of 11 tests

There's no need to limit xfstests to the aio group here. If the loop
device is what you are actually testing, then you probably want to
run the auto/quick groups as they will do a whole lot more than just
data IO to the loop devices...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

2013-01-21 Thread Dave Chinner
On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote:
 On 11/28/2012 03:14 AM, Dave Chinner wrote:
  [PATCH 09/19] list_lru: per-node list infrastructure
  
  This makes the generic LRU list much more scalable by changing it to
  a {list,lock,count} tuple per node. There are no external API
  changes to this changeover, so is transparent to current users.
  
  [PATCH 10/19] shrinker: add node awareness
  [PATCH 11/19] fs: convert inode and dentry shrinking to be node
  
  Adds a nodemask to the struct shrink_control for callers of
  shrink_slab to set appropriately for their reclaim context. This
  nodemask is then passed by the inode and dentry cache reclaim code
  to the generic LRU list code to implement node aware shrinking.
 
 I have a follow up question that popped up from a discussion between me
 and my very American friend Johnny Wheeler, also known as Johannes
 Weiner (CC'd). I actually remember we discussing this, but don't fully
 remember the outcome. And since I can't find it anywhere, it must have
 been in a media other than e-mail. So I thought it would do no harm in
 at least documenting it...
 
 Why are we doing this per-node, instead of per-zone?
 
 It seems to me that the goal is to collapse all zones of a node into a
 single list, but since the number of zones is not terribly larger than
 the number of nodes, and zones is where the pressure comes from, what do
 we really gain from this?

The number is quite a bit higher - there are platforms with 5 zones
to a node. The reality is, though, for most platforms slab
allocations come from a single zone - they never come from ZONE_DMA,
ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason
for having cache LRUs for these zones. So, two zones at most.

And then there's the complexity issue - it's simple/trivial to user
per node lists, node masks, etc. It's an obvious abstraction that
everyone understands, is simle to understand, acheives exactly the
purpose that is needed and is not tied to the /current/
implementation of the current VM memory management code.

I don't see any good reason for tying LRUs to MM zones. the
original implementation of the per-node shrinkers by Nick Piggin did
this: the LRUs for the dentry and inode caches were embedded in the
struct zone, and it wasn't generically extensible because of that.
i.e. node-aware shrinkers were directly influenced by the zone
infrastructure and so the internal implementation of the mm
subsystem started leaking out and determining how completely
unrelated subsystems need to implement their own cache
management.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

2013-01-23 Thread Dave Chinner
On Wed, Jan 23, 2013 at 06:36:33PM +0400, Glauber Costa wrote:
 On 01/22/2013 03:21 AM, Dave Chinner wrote:
  On Mon, Jan 21, 2013 at 08:08:53PM +0400, Glauber Costa wrote:
  On 11/28/2012 03:14 AM, Dave Chinner wrote:
  [PATCH 09/19] list_lru: per-node list infrastructure
 
  This makes the generic LRU list much more scalable by changing it to
  a {list,lock,count} tuple per node. There are no external API
  changes to this changeover, so is transparent to current users.
 
  [PATCH 10/19] shrinker: add node awareness
  [PATCH 11/19] fs: convert inode and dentry shrinking to be node
 
  Adds a nodemask to the struct shrink_control for callers of
  shrink_slab to set appropriately for their reclaim context. This
  nodemask is then passed by the inode and dentry cache reclaim code
  to the generic LRU list code to implement node aware shrinking.
 
  I have a follow up question that popped up from a discussion between me
  and my very American friend Johnny Wheeler, also known as Johannes
  Weiner (CC'd). I actually remember we discussing this, but don't fully
  remember the outcome. And since I can't find it anywhere, it must have
  been in a media other than e-mail. So I thought it would do no harm in
  at least documenting it...
 
  Why are we doing this per-node, instead of per-zone?
 
  It seems to me that the goal is to collapse all zones of a node into a
  single list, but since the number of zones is not terribly larger than
  the number of nodes, and zones is where the pressure comes from, what do
  we really gain from this?
  
  The number is quite a bit higher - there are platforms with 5 zones
  to a node. The reality is, though, for most platforms slab
  allocations come from a single zone - they never come from ZONE_DMA,
  ZONE_HIGHMEM or ZONE_MOVEABLE, so there is there is no good reason
  for having cache LRUs for these zones. So, two zones at most.
  
 Yes, but one would expect that most of those special zones would be
 present only in the first node, no? (correct me if I am wrong here).

As I understand it, every node has an identical zone setup (i.e. a
flat array of MAX_NR_ZONES zones in the struct pglist_data), and
pages are simply places the in the appropriate zones on each node...

Also, IIUC, the behaviour of the zones one each node is architecture
dependent, we can't make assumptions that certain zones are only
ever used on the first node...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] list_lru: per-node list infrastructure

2013-01-16 Thread Dave Chinner
On Wed, Jan 16, 2013 at 11:21:44AM -0800, Glauber Costa wrote:
 On 11/27/2012 03:14 PM, Dave Chinner wrote:
  From: Dave Chinner dchin...@redhat.com
  
  Now that we have an LRU list API, we can start to enhance the
  implementation.  This splits the single LRU list into per-node lists
  and locks to enhance scalability. Items are placed on lists
  according to the node the memory belongs to. To make scanning the
  lists efficient, also track whether the per-node lists have entries
  in them in a active nodemask.
  
  Signed-off-by: Dave Chinner dchin...@redhat.com
  ---
   include/linux/list_lru.h |   14 ++--
   lib/list_lru.c   |  160 
  +++---
   2 files changed, 129 insertions(+), 45 deletions(-)
  
  diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
  index 3423949..b0e3ba2 100644
  --- a/include/linux/list_lru.h
  +++ b/include/linux/list_lru.h
  @@ -8,21 +8,23 @@
   #define _LRU_LIST_H 0
   
   #include linux/list.h
  +#include linux/nodemask.h
   
  -struct list_lru {
  +struct list_lru_node {
  spinlock_t  lock;
  struct list_headlist;
  longnr_items;
  +} cacheline_aligned_in_smp;
  +
  +struct list_lru {
  +   struct list_lru_nodenode[MAX_NUMNODES];
  +   nodemask_t  active_nodes;
   };
   
 MAX_NUMNODES will default to 1  9, if I'm not mistaken. Your
 list_lru_node seems to be around 32 bytes on 64-bit systems (128 with
 debug). So we're talking about 16k per lru.

*nod*

It is relatively little compared to the number of inodes typically
on a LRU.

 The superblocks only, are present by the dozens even in a small system,
 and I believe the whole goal of this API is to get more users to switch
 to it. This can easily use up a respectable bunch of megs.
 
 Isn't it a bit too much ?

Maybe, but for active superblocks it only takes a handful of cached
inodes to make this 16k look like noise, so I didn't care. Indeed, a
typical active filesystem could be consuming gigabytes of memory in
the slab, so 16k is a tiny amount of overhead to track this amount
of memory more efficiently.

Most other LRU/shrinkers are tracking large objects and only have a
single LRU instance machine wide. Hence the numbers arguments don't
play out well in favour of a more complex, dynamic solution for
them, either. Sometimes dumb and simple is the best approach ;)

 I am wondering if we can't do better in here and at least allocate+grow
 according to the actual number of nodes.

We could add hotplug notifiers and grow/shrink the node array as
they get hot plugged, but that seems unnecessarily complex given
how rare such operations are.

If superblock proliferation is the main concern here, then doing
somethign as simple as allowing filesystems to specify they want
numa aware LRU lists via a mount_bdev() flag would solve this
problem. If the flag is set, then full numa lists are created.
Otherwise the LRU list simply has a single node and collapses all node
IDs down to 0 and ignores all NUMA optimisations...

That way the low item count virtual filesystems like proc, sys,
hugetlbfs, etc won't use up memory, but filesytems that actually
make use of NUMA awareness still get the more expensive, scalable
implementation. Indeed, any subsystem that is not performance or
location sensitive can use the simple single list version, so we can
avoid overhead in that manner system wide...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] list_lru: per-node list infrastructure

2013-01-16 Thread Dave Chinner
 into every memcg that
contains that type of object for that cache context

/me stops rambling

 Your current suggestion of going per-node only in the performance
 critical filesystems could also possibly work, provided this count is
 expected to be small.

The problem is deciding on a per filesystem basis. I was thinking
that all filesytsems of a specific type would use a particular type
of structure, not that specific instances of a filesystem could use
different types

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >