Re: [Cluster-devel] [PATCH v1 06/11] locks: convert to i_lock to protect i_flock list

2013-06-04 Thread Jeff Layton
On Tue, 4 Jun 2013 17:22:08 -0400
"J. Bruce Fields"  wrote:

> On Fri, May 31, 2013 at 11:07:29PM -0400, Jeff Layton wrote:
> > Having a global lock that protects all of this code is a clear
> > scalability problem. Instead of doing that, move most of the code to be
> > protected by the i_lock instead.
> > 
> > The exceptions are the global lists that file_lock->fl_link sits on.
> > Those still need a global lock of some sort, so wrap just those accesses
> > in the file_lock_lock().
> > 
> > Note that this requires a small change to the /proc/locks code. Instead
> > of walking the fl_block list to look for locks blocked on the current
> > lock, we must instead walk the global blocker list and skip any that
> > aren't blocked on the current lock. Otherwise, we'd need to take the
> > i_lock on each inode as we go and that would create a lock inversion
> > problem.
> 
> locks_insert lock sets fl_nspid after adding to the global list, so in
> theory I think that could leave /proc/locks seeing a garbage fl_nspid
> or fl_pid field.
> 

Well spotted. I think we can simply fix that by ensuring that we do the
insert onto the global list as the last thing in locks_insert_lock and
remove it from the global list first thing in locks_delete_lock. I'll
do that and test it out before I send the next iteration.

> I'm similarly worried about the deadlock detection code using the
> fl_next value before it's set but I'm not sure what the consequences
> are.
> 

This I'm less clear on. We already do the insert onto the global
blocked list after locks_insert_block. Am I missing something here?

> But I think both of those are fixable, and this looks OK otherwise?
> 
> Patch-sequencing nit: it'd be nice to have the big-but-boring
> lock_flocks->spin_lock(&i_lock) search-and-replace in a separate patch
> from the (briefer, but more subtle) deadlock and /proc/locks changes.
> Off the top of my head all I can think of is bulk-replace of
> lock_flocks() by lock_flocks(inode) (which ignores inode argument), then
> change definition of the lock_flocks(inode) in the patch which does the
> other stuff, then bulk replace of lock_flocks(inode) by
> spin_lock(&inode->i_lock), but maybe there's something simpler.
> 

Ok, maybe I can do some of these other changes before we switch to the
i_lock, and then do the big boring change after that. I'll see if I can
cook up a reasonable patch for that for the next iteration.

> --b.
> 
> 
> > 
> > Signed-off-by: Jeff Layton 
> > Signed-off-by: J. Bruce Fields 
> > ---
> >  Documentation/filesystems/Locking |   23 +--
> >  fs/afs/flock.c|5 +-
> >  fs/ceph/locks.c   |2 +-
> >  fs/ceph/mds_client.c  |8 +-
> >  fs/cifs/cifsfs.c  |2 +-
> >  fs/cifs/file.c|   13 ++--
> >  fs/gfs2/file.c|2 +-
> >  fs/lockd/svcsubs.c|   12 ++--
> >  fs/locks.c|  121 
> > -
> >  fs/nfs/delegation.c   |   11 ++--
> >  fs/nfs/nfs4state.c|8 +-
> >  fs/nfsd/nfs4state.c   |8 +-
> >  include/linux/fs.h|   11 
> >  13 files changed, 119 insertions(+), 107 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/Locking 
> > b/Documentation/filesystems/Locking
> > index 0706d32..13f91ab 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -344,7 +344,7 @@ prototypes:
> >  
> >  
> >  locking rules:
> > -   file_lock_lock  may block
> > +   inode->i_lock   may block
> >  fl_copy_lock:  yes no
> >  fl_release_private:maybe   no
> >  
> > @@ -357,12 +357,21 @@ prototypes:
> > int (*lm_change)(struct file_lock **, int);
> >  
> >  locking rules:
> > -   file_lock_lock  may block
> > -lm_compare_owner:  yes no
> > -lm_notify: yes no
> > -lm_grant:  no  no
> > -lm_break:  yes no
> > -lm_change  yes no
> > +
> > +   inode->i_lock   file_lock_lock  may block
> > +lm_compare_owner:  yes maybe   no
> > +lm_notify: yes no  no
> > +lm_grant:  no  no  no
> > +lm_break:  yes no  no
> > +lm_change  yes no  no
> > +
> > +   ->lm_compare_owner is generally called with *an* inode->i_lock
> > +held. It may not be the i_lock of the inode for either file_lock being
> > +compared! This is the case with deadlock detection, since the code has
> > +to chase down the owners of locks that may be entirely unrelated to the
> > +one on which the lock is being acquired. For deadlock detection however,
> > +the file_lock_lock is also held. The locks primarily ensure that neither
> > +file_lock disappea

Re: [Cluster-devel] [PATCH v1 08/11] locks: convert fl_link to a hlist_node

2013-06-04 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:31PM -0400, Jeff Layton wrote:
> Testing has shown that iterating over the blocked_list for deadlock
> detection turns out to be a bottleneck. In order to alleviate that,
> begin the process of turning it into a hashtable. We start by turning
> the fl_link into a hlist_node and the global lists into hlists. A later
> patch will do the conversion of the blocked_list to a hashtable.

Even simpler would be if we could add a pointer to the (well, a) lock
that a lockowner is blocking on, and then we'd just have to follow a
pointer.  I haven't thought that through, though, perhaps that's hard ot
make work

--b.

> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c |   32 
>  include/linux/fs.h |2 +-
>  2 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index fc35b9e..5ed056b 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -163,8 +163,8 @@ int lease_break_time = 45;
>  #define for_each_lock(inode, lockp) \
>   for (lockp = &inode->i_flock; *lockp != NULL; lockp = 
> &(*lockp)->fl_next)
>  
> -static LIST_HEAD(file_lock_list);
> -static LIST_HEAD(blocked_list);
> +static HLIST_HEAD(file_lock_list);
> +static HLIST_HEAD(blocked_list);
>  
>  /* Protects the two list heads above */
>  static DEFINE_SPINLOCK(file_lock_lock);
> @@ -173,7 +173,7 @@ static struct kmem_cache *filelock_cache __read_mostly;
>  
>  static void locks_init_lock_heads(struct file_lock *fl)
>  {
> - INIT_LIST_HEAD(&fl->fl_link);
> + INIT_HLIST_NODE(&fl->fl_link);
>   INIT_LIST_HEAD(&fl->fl_block);
>   init_waitqueue_head(&fl->fl_wait);
>  }
> @@ -207,7 +207,7 @@ void locks_free_lock(struct file_lock *fl)
>  {
>   BUG_ON(waitqueue_active(&fl->fl_wait));
>   BUG_ON(!list_empty(&fl->fl_block));
> - BUG_ON(!list_empty(&fl->fl_link));
> + BUG_ON(!hlist_unhashed(&fl->fl_link));
>  
>   locks_release_private(fl);
>   kmem_cache_free(filelock_cache, fl);
> @@ -486,7 +486,7 @@ static inline void
>  locks_insert_global_blocked(struct file_lock *waiter)
>  {
>   spin_lock(&file_lock_lock);
> - list_add(&waiter->fl_link, &blocked_list);
> + hlist_add_head(&waiter->fl_link, &blocked_list);
>   spin_unlock(&file_lock_lock);
>  }
>  
> @@ -494,7 +494,7 @@ static inline void
>  locks_delete_global_blocked(struct file_lock *waiter)
>  {
>   spin_lock(&file_lock_lock);
> - list_del_init(&waiter->fl_link);
> + hlist_del_init(&waiter->fl_link);
>   spin_unlock(&file_lock_lock);
>  }
>  
> @@ -502,7 +502,7 @@ static inline void
>  locks_insert_global_locks(struct file_lock *waiter)
>  {
>   spin_lock(&file_lock_lock);
> - list_add_tail(&waiter->fl_link, &file_lock_list);
> + hlist_add_head(&waiter->fl_link, &file_lock_list);
>   spin_unlock(&file_lock_lock);
>  }
>  
> @@ -510,7 +510,7 @@ static inline void
>  locks_delete_global_locks(struct file_lock *waiter)
>  {
>   spin_lock(&file_lock_lock);
> - list_del_init(&waiter->fl_link);
> + hlist_del_init(&waiter->fl_link);
>   spin_unlock(&file_lock_lock);
>  }
>  
> @@ -705,7 +705,7 @@ static struct file_lock *what_owner_is_waiting_for(struct 
> file_lock *block_fl)
>  {
>   struct file_lock *fl, *ret = NULL;
>  
> - list_for_each_entry(fl, &blocked_list, fl_link) {
> + hlist_for_each_entry(fl, &blocked_list, fl_link) {
>   if (posix_same_owner(fl, block_fl)) {
>   ret = fl->fl_next;
>   if (likely(ret))
> @@ -867,7 +867,7 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   goto out;
>   error = FILE_LOCK_DEFERRED;
>   locks_insert_block(fl, request);
> - if (list_empty(&request->fl_link))
> + if (hlist_unhashed(&request->fl_link))
>   locks_insert_global_blocked(request);
>   goto out;
>   }
> @@ -882,10 +882,10 @@ static int __posix_lock_file(struct inode *inode, 
> struct file_lock *request, str
>* Now that we know the request is no longer blocked, we can take it
>* off the global list. Some callers send down partially initialized
>* requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> -  * the lock if the list is empty, as that indicates a request that
> +  * the lock if the hlist is unhashed, as that indicates a request that
>* never blocked.
>*/
> - if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> + if ((request->fl_flags & FL_SLEEP) && 
> !hlist_unhashed(&request->fl_link))
>   locks_delete_global_blocked(request);
>  
>   /*
> @@ -2277,11 +2277,11 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>   struct file_lock *fl, *bfl;
>  
> - fl = list_entry(v, struct file_lock, fl_link);
> +   

Re: [Cluster-devel] [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

2013-06-04 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> Currently, when there is a lot of lock contention the kernel spends an
> inordinate amount of time taking blocked locks off of the global
> blocked_list and then putting them right back on again. When all of this
> code was protected by a single lock, then it didn't matter much, but now
> it means a lot of file_lock_lock thrashing.
> 
> Optimize this a bit by deferring the removal from the blocked_list until
> we're either applying or cancelling the lock. By doing this, and using a
> lockless list_empty check, we can avoid taking the file_lock_lock in
> many cases.
> 
> Because the fl_link check is lockless, we must ensure that only the task
> that "owns" the request manipulates the fl_link. Also, with this change,
> it's possible that we'll see an entry on the blocked_list that has a
> NULL fl_next pointer. In that event, just ignore it and continue walking
> the list.

OK, that sounds safe as in it shouldn't crash, but does the deadlock
detection still work, or can it miss loops?

Those locks that are temporarily NULL would previously not have been on
the list at all, OK, but...  I'm having trouble reasoning about how this
works now.

Previously a single lock was held interrupted across
posix_locks_deadlock and locks_insert_block() which guaranteed we
shouldn't be adding a loop, is that still true?

--b.

> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c |   29 +++--
>  1 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 055c06c..fc35b9e 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -520,7 +520,6 @@ locks_delete_global_locks(struct file_lock *waiter)
>  static void __locks_delete_block(struct file_lock *waiter)
>  {
>   list_del_init(&waiter->fl_block);
> - locks_delete_global_blocked(waiter);
>   waiter->fl_next = NULL;
>  }
>  
> @@ -704,13 +703,16 @@ EXPORT_SYMBOL(posix_test_lock);
>  /* Find a lock that the owner of the given block_fl is blocking on. */
>  static struct file_lock *what_owner_is_waiting_for(struct file_lock 
> *block_fl)
>  {
> - struct file_lock *fl;
> + struct file_lock *fl, *ret = NULL;
>  
>   list_for_each_entry(fl, &blocked_list, fl_link) {
> - if (posix_same_owner(fl, block_fl))
> - return fl->fl_next;
> + if (posix_same_owner(fl, block_fl)) {
> + ret = fl->fl_next;
> + if (likely(ret))
> + break;
> + }
>   }
> - return NULL;
> + return ret;
>  }
>  
>  static int posix_locks_deadlock(struct file_lock *caller_fl,
> @@ -865,7 +867,8 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   goto out;
>   error = FILE_LOCK_DEFERRED;
>   locks_insert_block(fl, request);
> - locks_insert_global_blocked(request);
> + if (list_empty(&request->fl_link))
> + locks_insert_global_blocked(request);
>   goto out;
>   }
>   }
> @@ -876,6 +879,16 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   goto out;
>  
>   /*
> +  * Now that we know the request is no longer blocked, we can take it
> +  * off the global list. Some callers send down partially initialized
> +  * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> +  * the lock if the list is empty, as that indicates a request that
> +  * never blocked.
> +  */
> + if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> + locks_delete_global_blocked(request);
> +
> + /*
>* Find the first old lock with the same owner as the new lock.
>*/
>   
> @@ -1069,6 +1082,7 @@ int posix_lock_file_wait(struct file *filp, struct 
> file_lock *fl)
>   continue;
>  
>   locks_delete_block(fl);
> + locks_delete_global_blocked(fl);
>   break;
>   }
>   return error;
> @@ -1147,6 +1161,7 @@ int locks_mandatory_area(int read_write, struct inode 
> *inode,
>   }
>  
>   locks_delete_block(&fl);
> + locks_delete_global_blocked(&fl);
>   break;
>   }
>  
> @@ -1859,6 +1874,7 @@ static int do_lock_file_wait(struct file *filp, 
> unsigned int cmd,
>   continue;
>  
>   locks_delete_block(fl);
> + locks_delete_global_blocked(fl);
>   break;
>   }
>  
> @@ -2160,6 +2176,7 @@ posix_unblock_lock(struct file *filp, struct file_lock 
> *waiter)
>   else
>   status = -ENOENT;
>   spin_unlock(&inode->i_lock);
> + locks_delete_global_blocked(waiter);
>   return status;
>  }
>  
> -- 
> 1.7.1
> 



Re: [Cluster-devel] [PATCH v1 06/11] locks: convert to i_lock to protect i_flock list

2013-06-04 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:29PM -0400, Jeff Layton wrote:
> Having a global lock that protects all of this code is a clear
> scalability problem. Instead of doing that, move most of the code to be
> protected by the i_lock instead.
> 
> The exceptions are the global lists that file_lock->fl_link sits on.
> Those still need a global lock of some sort, so wrap just those accesses
> in the file_lock_lock().
> 
> Note that this requires a small change to the /proc/locks code. Instead
> of walking the fl_block list to look for locks blocked on the current
> lock, we must instead walk the global blocker list and skip any that
> aren't blocked on the current lock. Otherwise, we'd need to take the
> i_lock on each inode as we go and that would create a lock inversion
> problem.

locks_insert lock sets fl_nspid after adding to the global list, so in
theory I think that could leave /proc/locks seeing a garbage fl_nspid
or fl_pid field.

I'm similarly worried about the deadlock detection code using the
fl_next value before it's set but I'm not sure what the consequences
are.

But I think both of those are fixable, and this looks OK otherwise?

Patch-sequencing nit: it'd be nice to have the big-but-boring
lock_flocks->spin_lock(&i_lock) search-and-replace in a separate patch
from the (briefer, but more subtle) deadlock and /proc/locks changes.
Off the top of my head all I can think of is bulk-replace of
lock_flocks() by lock_flocks(inode) (which ignores inode argument), then
change definition of the lock_flocks(inode) in the patch which does the
other stuff, then bulk replace of lock_flocks(inode) by
spin_lock(&inode->i_lock), but maybe there's something simpler.

--b.


> 
> Signed-off-by: Jeff Layton 
> Signed-off-by: J. Bruce Fields 
> ---
>  Documentation/filesystems/Locking |   23 +--
>  fs/afs/flock.c|5 +-
>  fs/ceph/locks.c   |2 +-
>  fs/ceph/mds_client.c  |8 +-
>  fs/cifs/cifsfs.c  |2 +-
>  fs/cifs/file.c|   13 ++--
>  fs/gfs2/file.c|2 +-
>  fs/lockd/svcsubs.c|   12 ++--
>  fs/locks.c|  121 
> -
>  fs/nfs/delegation.c   |   11 ++--
>  fs/nfs/nfs4state.c|8 +-
>  fs/nfsd/nfs4state.c   |8 +-
>  include/linux/fs.h|   11 
>  13 files changed, 119 insertions(+), 107 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking 
> b/Documentation/filesystems/Locking
> index 0706d32..13f91ab 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -344,7 +344,7 @@ prototypes:
>  
>  
>  locking rules:
> - file_lock_lock  may block
> + inode->i_lock   may block
>  fl_copy_lock:yes no
>  fl_release_private:  maybe   no
>  
> @@ -357,12 +357,21 @@ prototypes:
>   int (*lm_change)(struct file_lock **, int);
>  
>  locking rules:
> - file_lock_lock  may block
> -lm_compare_owner:yes no
> -lm_notify:   yes no
> -lm_grant:no  no
> -lm_break:yes no
> -lm_changeyes no
> +
> + inode->i_lock   file_lock_lock  may block
> +lm_compare_owner:yes maybe   no
> +lm_notify:   yes no  no
> +lm_grant:no  no  no
> +lm_break:yes no  no
> +lm_changeyes no  no
> +
> + ->lm_compare_owner is generally called with *an* inode->i_lock
> +held. It may not be the i_lock of the inode for either file_lock being
> +compared! This is the case with deadlock detection, since the code has
> +to chase down the owners of locks that may be entirely unrelated to the
> +one on which the lock is being acquired. For deadlock detection however,
> +the file_lock_lock is also held. The locks primarily ensure that neither
> +file_lock disappear out from under you while doing the comparison.
>  
>  --- buffer_head ---
>  prototypes:
> diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> index 2497bf3..03fc0d1 100644
> --- a/fs/afs/flock.c
> +++ b/fs/afs/flock.c
> @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, 
> struct key *key)
>   */
>  static int afs_do_setlk(struct file *file, struct file_lock *fl)
>  {
> + struct inode = file_inode(file);
>   struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
>   afs_lock_type_t type;
>   struct key *key = file->private_data;
> @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct 
> file_lock *fl)
>  
>   type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
>  
> - lock_flocks();
> + spin_

Re: [Cluster-devel] [PATCH v1 05/11] locks: encapsulate the fl_link list handling

2013-06-04 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:28PM -0400, Jeff Layton wrote:
> Move the fl_link list handling routines into a separate set of helpers.
> Also move the global list handling out of locks_insert_block, and into
> the caller that ends up triggering it as that allows us to eliminate the
> IS_POSIX check there.

ACK.--b.

> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c |   34 +-
>  1 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index cef0e04..caca466 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -494,13 +494,38 @@ static int posix_same_owner(struct file_lock *fl1, 
> struct file_lock *fl2)
>   return fl1->fl_owner == fl2->fl_owner;
>  }
>  
> +/* Remove a blocker or lock from one of the global lists */
> +static inline void
> +locks_insert_global_blocked(struct file_lock *waiter)
> +{
> + list_add(&waiter->fl_link, &blocked_list);
> +}
> +
> +static inline void
> +locks_delete_global_blocked(struct file_lock *waiter)
> +{
> + list_del_init(&waiter->fl_link);
> +}
> +
> +static inline void
> +locks_insert_global_locks(struct file_lock *waiter)
> +{
> + list_add_tail(&waiter->fl_link, &file_lock_list);
> +}
> +
> +static inline void
> +locks_delete_global_locks(struct file_lock *waiter)
> +{
> + list_del_init(&waiter->fl_link);
> +}
> +
>  /* Remove waiter from blocker's block list.
>   * When blocker ends up pointing to itself then the list is empty.
>   */
>  static void __locks_delete_block(struct file_lock *waiter)
>  {
>   list_del_init(&waiter->fl_block);
> - list_del_init(&waiter->fl_link);
> + locks_delete_global_blocked(waiter);
>   waiter->fl_next = NULL;
>  }
>  
> @@ -524,8 +549,6 @@ static void locks_insert_block(struct file_lock *blocker,
>   BUG_ON(!list_empty(&waiter->fl_block));
>   list_add_tail(&waiter->fl_block, &blocker->fl_block);
>   waiter->fl_next = blocker;
> - if (IS_POSIX(blocker))
> - list_add(&waiter->fl_link, &blocked_list);
>  }
>  
>  /* Wake up processes blocked waiting for blocker.
> @@ -552,7 +575,7 @@ static void locks_wake_up_blocks(struct file_lock 
> *blocker)
>   */
>  static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
>  {
> - list_add(&fl->fl_link, &file_lock_list);
> + locks_insert_global_locks(fl);
>  
>   fl->fl_nspid = get_pid(task_tgid(current));
>  
> @@ -573,7 +596,7 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
>  
>   *thisfl_p = fl->fl_next;
>   fl->fl_next = NULL;
> - list_del_init(&fl->fl_link);
> + locks_delete_global_locks(fl);
>  
>   if (fl->fl_nspid) {
>   put_pid(fl->fl_nspid);
> @@ -839,6 +862,7 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   goto out;
>   error = FILE_LOCK_DEFERRED;
>   locks_insert_block(fl, request);
> + locks_insert_global_blocked(request);
>   goto out;
>   }
>   }
> -- 
> 1.7.1
> 



Re: [Cluster-devel] [PATCH v1 04/11] locks: make "added" in __posix_lock_file a bool

2013-06-04 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:27PM -0400, Jeff Layton wrote:
> ...save 3 bytes of stack space.
> 
> Signed-off-by: Jeff Layton 

ACK.--b.

> ---
>  fs/locks.c |9 +
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index a7d2253..cef0e04 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -800,7 +800,8 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   struct file_lock *left = NULL;
>   struct file_lock *right = NULL;
>   struct file_lock **before;
> - int error, added = 0;
> + int error;
> + bool added = false;
>  
>   /*
>* We may need two file_lock structures for this operation,
> @@ -894,7 +895,7 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   continue;
>   }
>   request = fl;
> - added = 1;
> + added = true;
>   }
>   else {
>   /* Processing for different lock types is a bit
> @@ -905,7 +906,7 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   if (fl->fl_start > request->fl_end)
>   break;
>   if (request->fl_type == F_UNLCK)
> - added = 1;
> + added = true;
>   if (fl->fl_start < request->fl_start)
>   left = fl;
>   /* If the next lock in the list has a higher end
> @@ -935,7 +936,7 @@ static int __posix_lock_file(struct inode *inode, struct 
> file_lock *request, str
>   locks_release_private(fl);
>   locks_copy_private(fl, request);
>   request = fl;
> - added = 1;
> + added = true;
>   }
>   }
>   /* Go on to next lock.
> -- 
> 1.7.1
> 



[Cluster-devel] GFS2: Pull request (fixes)

2013-06-04 Thread Steven Whitehouse
Hi,

Please consider pulling the following fixes for GFS2,

Steve.

--

There are four patches this time. The first fixes a problem where the
wrong descriptor type was being written into the log for journaled data
blocks. The second fixes a race relating to the deallocation of allocator
data. The third provides a fallback if kmalloc is unable to satisfy a
request to allocate a directory hash table. The fourth fixes the iopen
glock caching so that inodes are deleted in a more timely manner after
rmdir/unlink.

-
The following changes since commit aa4f608478acb7ed69dfcff4f3c404100b78ac49:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k (2013-06-03 
18:09:42 +0900)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes.git master

Bob Peterson (4):
  GFS2: Set log descriptor type for jdata blocks
  GFS2: Increase i_writecount during gfs2_setattr_size
  GFS2: Fall back to vmalloc if kmalloc fails for dir hash tables
  GFS2: Don't cache iopen glocks

 fs/gfs2/bmap.c  |   17 +
 fs/gfs2/dir.c   |   43 +--
 fs/gfs2/file.c  |   19 +--
 fs/gfs2/inode.c |1 +
 fs/gfs2/lops.c  |4 +++-
 fs/gfs2/rgrp.c  |4 +++-
 fs/gfs2/super.c |6 +-
 7 files changed, 71 insertions(+), 23 deletions(-)



signature.asc
Description: This is a digitally signed message part


Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Jeff Layton
On Tue, 4 Jun 2013 10:53:22 -0400
"J. Bruce Fields"  wrote:

> On Tue, Jun 04, 2013 at 07:46:40AM -0700, Christoph Hellwig wrote:
> > Having RCU for modification mostly workloads never is a good idea, so
> > I don't think it makes sense to mention it here.
> > 
> > If you care about the overhead it's worth trying to use per-cpu lists,
> > though.
> 
> Yes.  The lock and unlock could happen on different CPU's--but I think
> you can make the rule that the lock stays associated with the list it
> was first put on, and then it's correct in general and hopefully quick
> in the common case.
> 

It's important to distinguish between the blocked_list/hash and the
file_lock_list. Yes, they use the same embedded list_head or hlist_node
in the file_lock, but they have very different characteristics and use
cases.

In the testing I did, having a hashtable for the blocked locks helped a
lot more than a percpu list. The trick with deadlock detection is to
ensure that you don't spend a lot of time walking the lists. Since we
do deadlock detection frequently, we need to optimize for that case.

For the file_lock_list, it might make sense to have percpu hlists with
an lglock however. The thing to note here is that the file_lock_list is
almost never read. Only /proc/locks uses it, so anything we can do to
optimize the list manipulation is probably worth it.

All that said, I'm leery about changing too much of this code too fast.
It's pretty old and poorly understood, so I think we need to be
cautious and take an incremental approach to changing it.

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Jeff Layton
On Tue, 4 Jun 2013 07:46:40 -0700
Christoph Hellwig  wrote:

> Having RCU for modification mostly workloads never is a good idea, so
> I don't think it makes sense to mention it here.
> 
> If you care about the overhead it's worth trying to use per-cpu lists,
> though.
> 

Yeah, I looked at those too in an earlier set and it did help some.
Moving to a hashtable for the blocked_list really seemed to help the
most there, but percpu lists with lglocks or something might help a lot
on the file_lock_list.

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread J. Bruce Fields
On Tue, Jun 04, 2013 at 07:46:40AM -0700, Christoph Hellwig wrote:
> Having RCU for modification mostly workloads never is a good idea, so
> I don't think it makes sense to mention it here.
> 
> If you care about the overhead it's worth trying to use per-cpu lists,
> though.

Yes.  The lock and unlock could happen on different CPU's--but I think
you can make the rule that the lock stays associated with the list it
was first put on, and then it's correct in general and hopefully quick
in the common case.

--b.



Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Christoph Hellwig
Having RCU for modification mostly workloads never is a good idea, so
I don't think it makes sense to mention it here.

If you care about the overhead it's worth trying to use per-cpu lists,
though.



Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Jeff Layton
On Tue, 04 Jun 2013 16:19:53 +0200
"Stefan (metze) Metzmacher"  wrote:

> Hi Jeff,
> 
> > There's no reason we have to protect the blocked_hash and file_lock_list
> > with the same spinlock. With the tests I have, breaking it in two gives
> > a barely measurable performance benefit, but it seems reasonable to make
> > this locking as granular as possible.
> 
> as file_lock_{list,lock} is only used for debugging (/proc/locks) after this
> change, I guess it would be possible to use RCU instead of a spinlock.
> 
> @others: this was the related discussion on IRC
> (http://irclog.samba.org/) about this:
> 
> 16:02 < metze> jlayton: do you have time to discuss your file_lock_lock
> changes?
> 16:02 < jlayton> metze: sure, what's up?
> 16:03 < jlayton> metze: note that it won't help vl's thundering herd
> problems...
> 16:03 < metze> is it correct that after your last patch file_lock_lock
> is only used for /proc/locks?
> 16:03 < jlayton> well, it's only used to protect the list that is used
> for /proc/locks
> 16:04 < jlayton> it still gets taken whenever a lock is acquired or
> released in order to manipulate that list
> 16:04 < metze> would it be a good idea to use rcu instead of a spin lock?
> 16:04 < jlayton> I tried using RCU, but it turned out to slow everything
> down
> 16:04 < jlayton> this is not a read-mostly workload unfortunately
> 16:04 < jlayton> so doing it with mutual exclusion turns out to be faster
> 16:04 < metze> ok
> 16:05 < jlayton> I might play around with it again sometime, but I don't
> think it really helps. What we need to ensure is
>  that we optimize the code that manipulates that list,
> and RCU list manipulations have larger overhead
> 16:06 < jlayton> metze: that's a good question though so if you want to
> ask it on the list, please do
> 16:06 < jlayton> others will probably be wondering the same thing
> 16:08 < metze> maybe it's worth a comment in commit message and the code


I'm not sure it's worth commenting about RCU in the code. In the future
it may be possible to do this -- who knows. It does seem to have been
consistently slower in my testing here though.

> 16:08 < metze> btw, why don't you remove the ' /* Protects the
> file_lock_list and the blocked_hash */' comment?
> 

Removed in my git tree -- thanks for pointing that out.

-- 
Jeff Layton 


signature.asc
Description: PGP signature


Re: [Cluster-devel] [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Stefan (metze) Metzmacher
Hi Jeff,

> There's no reason we have to protect the blocked_hash and file_lock_list
> with the same spinlock. With the tests I have, breaking it in two gives
> a barely measurable performance benefit, but it seems reasonable to make
> this locking as granular as possible.

as file_lock_{list,lock} is only used for debugging (/proc/locks) after this
change, I guess it would be possible to use RCU instead of a spinlock.

@others: this was the related discussion on IRC
(http://irclog.samba.org/) about this:

16:02 < metze> jlayton: do you have time to discuss your file_lock_lock
changes?
16:02 < jlayton> metze: sure, what's up?
16:03 < jlayton> metze: note that it won't help vl's thundering herd
problems...
16:03 < metze> is it correct that after your last patch file_lock_lock
is only used for /proc/locks?
16:03 < jlayton> well, it's only used to protect the list that is used
for /proc/locks
16:04 < jlayton> it still gets taken whenever a lock is acquired or
released in order to manipulate that list
16:04 < metze> would it be a good idea to use rcu instead of a spin lock?
16:04 < jlayton> I tried using RCU, but it turned out to slow everything
down
16:04 < jlayton> this is not a read-mostly workload unfortunately
16:04 < jlayton> so doing it with mutual exclusion turns out to be faster
16:04 < metze> ok
16:05 < jlayton> I might play around with it again sometime, but I don't
think it really helps. What we need to ensure is
 that we optimize the code that manipulates that list,
and RCU list manipulations have larger overhead
16:06 < jlayton> metze: that's a good question though so if you want to
ask it on the list, please do
16:06 < jlayton> others will probably be wondering the same thing
16:08 < metze> maybe it's worth a comment in commit message and the code
16:08 < metze> btw, why don't you remove the ' /* Protects the
file_lock_list and the blocked_hash */' comment?

metze



signature.asc
Description: OpenPGP digital signature


[Cluster-devel] [PATCH 4/4] GFS2: Don't cache iopen glocks

2013-06-04 Thread Steven Whitehouse
From: Bob Peterson 

This patch makes GFS2 immediately reclaim/delete all iopen glocks
as soon as they're dequeued. This allows deleters to get an
EXclusive lock on iopen so files are deleted properly instead of
being set as unlinked.

Signed-off-by: Bob Peterson 
Signed-off-by: Steven Whitehouse 

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8833a4f..62b484e 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -189,6 +189,7 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, 
unsigned int type,
return inode;
 
 fail_refresh:
+   ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
ip->i_iopen_gh.gh_gl->gl_object = NULL;
gfs2_glock_dq_uninit(&ip->i_iopen_gh);
 fail_iopen:
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 917c8e1..e5639de 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1444,6 +1444,7 @@ static void gfs2_evict_inode(struct inode *inode)
/* Must not read inode block until block type has been verified */
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, &gh);
if (unlikely(error)) {
+   ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq_uninit(&ip->i_iopen_gh);
goto out;
}
@@ -1514,8 +1515,10 @@ out_unlock:
if (gfs2_rs_active(ip->i_res))
gfs2_rs_deltree(ip->i_res);
 
-   if (test_bit(HIF_HOLDER, &ip->i_iopen_gh.gh_iflags))
+   if (test_bit(HIF_HOLDER, &ip->i_iopen_gh.gh_iflags)) {
+   ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq(&ip->i_iopen_gh);
+   }
gfs2_holder_uninit(&ip->i_iopen_gh);
gfs2_glock_dq_uninit(&gh);
if (error && error != GLR_TRYFAILED && error != -EROFS)
@@ -1534,6 +1537,7 @@ out:
ip->i_gl = NULL;
if (ip->i_iopen_gh.gh_gl) {
ip->i_iopen_gh.gh_gl->gl_object = NULL;
+   ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq_uninit(&ip->i_iopen_gh);
}
 }
-- 
1.7.4



[Cluster-devel] [PATCH 2/4] GFS2: Increase i_writecount during gfs2_setattr_size

2013-06-04 Thread Steven Whitehouse
From: Bob Peterson 

This patch calls get_write_access in a few functions. This
merely increases inode->i_writecount for the duration of the function.
That will ensure that any file closes won't delete the inode's
multi-block reservation while the function is running.

Signed-off-by: Bob Peterson 
Signed-off-by: Steven Whitehouse 

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 1dc9a13..93b5809 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1286,17 +1286,26 @@ int gfs2_setattr_size(struct inode *inode, u64 newsize)
if (ret)
return ret;
 
+   ret = get_write_access(inode);
+   if (ret)
+   return ret;
+
inode_dio_wait(inode);
 
ret = gfs2_rs_alloc(GFS2_I(inode));
if (ret)
-   return ret;
+   goto out;
 
oldsize = inode->i_size;
-   if (newsize >= oldsize)
-   return do_grow(inode, newsize);
+   if (newsize >= oldsize) {
+   ret = do_grow(inode, newsize);
+   goto out;
+   }
 
-   return do_shrink(inode, oldsize, newsize);
+   ret = do_shrink(inode, oldsize, newsize);
+out:
+   put_write_access(inode);
+   return ret;
 }
 
 int gfs2_truncatei_resume(struct gfs2_inode *ip)
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index acd1676..ad0dc38 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -402,16 +402,20 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
/* Update file times before taking page lock */
file_update_time(vma->vm_file);
 
+   ret = get_write_access(inode);
+   if (ret)
+   goto out;
+
ret = gfs2_rs_alloc(ip);
if (ret)
-   return ret;
+   goto out_write_access;
 
gfs2_size_hint(vma->vm_file, pos, PAGE_CACHE_SIZE);
 
gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
ret = gfs2_glock_nq(&gh);
if (ret)
-   goto out;
+   goto out_uninit;
 
set_bit(GLF_DIRTY, &ip->i_gl->gl_flags);
set_bit(GIF_SW_PAGED, &ip->i_flags);
@@ -480,12 +484,15 @@ out_quota_unlock:
gfs2_quota_unlock(ip);
 out_unlock:
gfs2_glock_dq(&gh);
-out:
+out_uninit:
gfs2_holder_uninit(&gh);
if (ret == 0) {
set_page_dirty(page);
wait_for_stable_page(page);
}
+out_write_access:
+   put_write_access(inode);
+out:
sb_end_pagefault(inode->i_sb);
return block_page_mkwrite_return(ret);
 }
@@ -594,10 +601,10 @@ static int gfs2_release(struct inode *inode, struct file 
*file)
kfree(file->private_data);
file->private_data = NULL;
 
-   if ((file->f_mode & FMODE_WRITE) &&
-   (atomic_read(&inode->i_writecount) == 1))
-   gfs2_rs_delete(ip);
+   if (!(file->f_mode & FMODE_WRITE))
+   return 0;
 
+   gfs2_rs_delete(ip);
return 0;
 }
 
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 5232525..9809156 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -638,8 +638,10 @@ void gfs2_rs_deltree(struct gfs2_blkreserv *rs)
  */
 void gfs2_rs_delete(struct gfs2_inode *ip)
 {
+   struct inode *inode = &ip->i_inode;
+
down_write(&ip->i_rw_mutex);
-   if (ip->i_res) {
+   if (ip->i_res && atomic_read(&inode->i_writecount) <= 1) {
gfs2_rs_deltree(ip->i_res);
BUG_ON(ip->i_res->rs_free);
kmem_cache_free(gfs2_rsrv_cachep, ip->i_res);
-- 
1.7.4



[Cluster-devel] [PATCH 1/4] GFS2: Set log descriptor type for jdata blocks

2013-06-04 Thread Steven Whitehouse
From: Bob Peterson 

This patch sets the log descriptor type according to whether the
journal commit is for (journaled) data or metadata. This was
recently broken when the functions to process data and metadata
log ops were combined.

Signed-off-by: Bob Peterson 
Signed-off-by: Steven Whitehouse 

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 68b4c8f..6c33d7b 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -419,7 +419,9 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, 
unsigned int limit,
if (total > limit)
num = limit;
gfs2_log_unlock(sdp);
-   page = gfs2_get_log_desc(sdp, GFS2_LOG_DESC_METADATA, num + 1, 
num);
+   page = gfs2_get_log_desc(sdp,
+is_databuf ? GFS2_LOG_DESC_JDATA :
+GFS2_LOG_DESC_METADATA, num + 1, num);
ld = page_address(page);
gfs2_log_lock(sdp);
ptr = (__be64 *)(ld + 1);
-- 
1.7.4



[Cluster-devel] [PATCH 3/4] GFS2: Fall back to vmalloc if kmalloc fails for dir hash tables

2013-06-04 Thread Steven Whitehouse
From: Bob Peterson 

This version has one more correction: the vmalloc calls are replaced
by __vmalloc calls to preserve the GFP_NOFS flag.

When GFS2's directory management code allocates buffers for a
directory hash table, if it can't get the memory it needs, it
currently gives a bad return code. Rather than giving an error,
this patch allows it to use virtual memory rather than kernel
memory for the hash table. This should make it possible for
directories to function properly, even when kernel memory becomes
very fragmented.

Signed-off-by: Bob Peterson 
Signed-off-by: Steven Whitehouse 

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index c3e82bd..b631c90 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -354,22 +354,31 @@ static __be64 *gfs2_dir_get_hash_table(struct gfs2_inode 
*ip)
return ERR_PTR(-EIO);
}
 
-   hc = kmalloc(hsize, GFP_NOFS);
-   ret = -ENOMEM;
+   hc = kmalloc(hsize, GFP_NOFS | __GFP_NOWARN);
+   if (hc == NULL)
+   hc = __vmalloc(hsize, GFP_NOFS, PAGE_KERNEL);
+
if (hc == NULL)
return ERR_PTR(-ENOMEM);
 
ret = gfs2_dir_read_data(ip, hc, hsize);
if (ret < 0) {
-   kfree(hc);
+   if (is_vmalloc_addr(hc))
+   vfree(hc);
+   else
+   kfree(hc);
return ERR_PTR(ret);
}
 
spin_lock(&inode->i_lock);
-   if (ip->i_hash_cache)
-   kfree(hc);
-   else
+   if (ip->i_hash_cache) {
+   if (is_vmalloc_addr(hc))
+   vfree(hc);
+   else
+   kfree(hc);
+   } else {
ip->i_hash_cache = hc;
+   }
spin_unlock(&inode->i_lock);
 
return ip->i_hash_cache;
@@ -385,7 +394,10 @@ void gfs2_dir_hash_inval(struct gfs2_inode *ip)
 {
__be64 *hc = ip->i_hash_cache;
ip->i_hash_cache = NULL;
-   kfree(hc);
+   if (is_vmalloc_addr(hc))
+   vfree(hc);
+   else
+   kfree(hc);
 }
 
 static inline int gfs2_dirent_sentinel(const struct gfs2_dirent *dent)
@@ -1113,7 +1125,10 @@ static int dir_double_exhash(struct gfs2_inode *dip)
if (IS_ERR(hc))
return PTR_ERR(hc);
 
-   h = hc2 = kmalloc(hsize_bytes * 2, GFP_NOFS);
+   h = hc2 = kmalloc(hsize_bytes * 2, GFP_NOFS | __GFP_NOWARN);
+   if (hc2 == NULL)
+   hc2 = __vmalloc(hsize_bytes * 2, GFP_NOFS, PAGE_KERNEL);
+
if (!hc2)
return -ENOMEM;
 
@@ -1145,7 +1160,10 @@ fail:
gfs2_dinode_out(dip, dibh->b_data);
brelse(dibh);
 out_kfree:
-   kfree(hc2);
+   if (is_vmalloc_addr(hc2))
+   vfree(hc2);
+   else
+   kfree(hc2);
return error;
 }
 
@@ -1846,6 +1864,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
index, u32 len,
memset(&rlist, 0, sizeof(struct gfs2_rgrp_list));
 
ht = kzalloc(size, GFP_NOFS);
+   if (ht == NULL)
+   ht = vzalloc(size);
if (!ht)
return -ENOMEM;
 
@@ -1933,7 +1953,10 @@ out_rlist:
gfs2_rlist_free(&rlist);
gfs2_quota_unhold(dip);
 out:
-   kfree(ht);
+   if (is_vmalloc_addr(ht))
+   vfree(ht);
+   else
+   kfree(ht);
return error;
 }
 
-- 
1.7.4



[Cluster-devel] GFS2: Pre-pull patch posting (fixes)

2013-06-04 Thread Steven Whitehouse
There are four patches this time. The first fixes a problem where the
wrong descriptor type was being written into the log for journaled data
blocks. The second fixes a race relating to the deallocation of allocator
data. The third provides a fallback if kmalloc is unable to satisfy a
request to allocate a directory hash table. The fourth fixes the iopen
glock caching so that inodes are deleted in a more timely manner after
rmdir/unlink,

Steve.




Re: [Cluster-devel] [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-04 Thread Jeff Layton
On Tue, 4 Jun 2013 07:56:44 -0400
Jim Rees  wrote:

> Jeff Layton wrote:
> 
>   > Might be nice to look at some profiles to confirm all of that.  I'd also
>   > be curious how much variation there was in the results above, as they're
>   > pretty close.
>   > 
>   
>   The above is just a random representative sample. The results are
>   pretty close when running this test, but I can average up several runs
>   and present the numbers. I plan to get a bare-metal test box on which
>   to run some more detailed testing and maybe some profiling this week.
> 
> Just contributing more runs into the mean doesn't tell us anything about the
> variance. With numbers that close you need the variance to tell whether it's
> a significant change.

Thanks. I'll see if I can get some standard deviation numbers here, and
I'll do it on some bare metal to ensure that virtualization doesn't
skew any results.

FWIW, they were all consistently very close to one another when I ran
these tests, and the times were all consistently shorter than the
unpatched kernel.

That said, this test is pretty rough. Doing this with "time" measures
other things that aren't related to locking. So I'll also see if I
can come up with a way to measure the actual locking performance more
accurately too.

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-04 Thread Davidlohr Bueso
On Fri, 2013-05-31 at 23:07 -0400, Jeff Layton wrote:
> This is not the first attempt at doing this. The conversion to the
> i_lock was originally attempted by Bruce Fields a few years ago. His
> approach was NAK'ed since it involved ripping out the deadlock
> detection. People also really seem to like /proc/locks for debugging, so
> keeping that in is probably worthwhile.

Yep, we need to keep this. FWIW, lslocks(8) relies on /proc/locks.

Thanks,
Davidlohr



Re: [Cluster-devel] [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-04 Thread Jim Rees
Jeff Layton wrote:

  > Might be nice to look at some profiles to confirm all of that.  I'd also
  > be curious how much variation there was in the results above, as they're
  > pretty close.
  > 
  
  The above is just a random representative sample. The results are
  pretty close when running this test, but I can average up several runs
  and present the numbers. I plan to get a bare-metal test box on which
  to run some more detailed testing and maybe some profiling this week.

Just contributing more runs into the mean doesn't tell us anything about the
variance. With numbers that close you need the variance to tell whether it's
a significant change.



Re: [Cluster-devel] [PATCH v1 03/11] locks: comment cleanups and clarifications

2013-06-04 Thread Jeff Layton
On Mon, 3 Jun 2013 18:00:24 -0400
"J. Bruce Fields"  wrote:

> On Fri, May 31, 2013 at 11:07:26PM -0400, Jeff Layton wrote:
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/locks.c |   24 +++-
> >  include/linux/fs.h |6 ++
> >  2 files changed, 25 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index e3140b8..a7d2253 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl)
> >  int leases_enable = 1;
> >  int lease_break_time = 45;
> >  
> > +/*
> > + * The i_flock list is ordered by:
> > + *
> > + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
> > + * 2) lock owner
> > + * 3) lock range start
> > + * 4) lock range end
> > + *
> > + * Obviously, the last two criteria only matter for POSIX locks.
> > + */
> 
> Thanks, yes, that needs documenting!  Though I wonder if this is the
> place people will look for it.
> 

Agreed. If you can think of a better spot to put this, I'd be happy to
move it in the next iteration of this series.

> >  #define for_each_lock(inode, lockp) \
> > for (lockp = &inode->i_flock; *lockp != NULL; lockp = 
> > &(*lockp)->fl_next)
> >  
> > @@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, 
> > struct file_lock *request, str
> > }
> >  
> > lock_flocks();
> > +   /*
> > +* New lock request. Walk all POSIX locks and look for conflicts. If
> > +* there are any, either return -EAGAIN or put the request on the
> > +* blocker's list of waiters.
> > +*/
> 
> This though, seems a) not 100% accurate (it could also return EDEADLCK,
> for example), b) mostly redundant with respect to the following code.
> 

Good point -- I'll fix that up. That one was mostly for my own benefit
while trying to figure out how this code works. It's still probably worth
some more verbose comments here since this certainly wasn't 100%
obvious to me when I first dove into this code.

> > if (request->fl_type != F_UNLCK) {
> > for_each_lock(inode, before) {
> > fl = *before;
> > @@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, 
> > struct file_lock *request, str
> > before = &fl->fl_next;
> > }
> >  
> > -   /* Process locks with this owner.  */
> > +   /* Process locks with this owner. */
> > while ((fl = *before) && posix_same_owner(request, fl)) {
> > /* Detect adjacent or overlapping regions (if same lock type)
> >  */
> > @@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, 
> > struct file_lock *request, str
> > }
> >  
> > /*
> > -* The above code only modifies existing locks in case of
> > -* merging or replacing.  If new lock(s) need to be inserted
> > -* all modifications are done bellow this, so it's safe yet to
> > -* bail out.
> > +* The above code only modifies existing locks in case of merging or
> > +* replacing. If new lock(s) need to be inserted all modifications are
> > +* done below this, so it's safe yet to bail out.
> >  */
> > error = -ENOLCK; /* "no luck" */
> > if (right && left == right && !new_fl2)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b9d7816..ae377e9 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -926,6 +926,12 @@ int locks_in_grace(struct net *);
> >  /* that will die - we need it for nfs_lock_info */
> >  #include 
> >  
> > +/*
> > + * struct file_lock represents a generic "file lock". It's used to 
> > represent
> > + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
> > + * note that the same struct is used to represent both a request for a 
> > lock and
> > + * the lock itself, but the same object is never used for both.
> 
> Yes, and I do find that confusing.  I wonder if there's a sensible way
> to use separate structs for the different uses.
> 
> --b.
> 

Yeah. I think that latter point accounts for a lot of the difficulty for
the uninitiated to understand this code.

I considered creating a "struct lock_request" too. If I were designing
this code from scratch I certainly would do so, but it's harder to
justify such a change to the existing code.

It would mean a lot of churn in fs-specific lock routines, and we'd
have to completely revamp things like locks_copy_lock. I just don't see
that giving us a lot of benefit in the near term and I really want this
set to focus on concrete benefits.

We might consider it in the future though. I'll add a FIXME comment
here so we can record the idea for posterity.

Thanks for the comments!

> > + */
> >  struct file_lock {
> > struct file_lock *fl_next;  /* singly linked list for this inode  */
> > struct list_head fl_link;   /* doubly linked list of all locks */
> > -- 
> > 1.7.1
> > 


-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-04 Thread Jeff Layton
On Mon, 3 Jun 2013 17:31:01 -0400
"J. Bruce Fields"  wrote:

> On Fri, May 31, 2013 at 11:07:23PM -0400, Jeff Layton wrote:
> > Executive summary (tl;dr version): This patchset represents an overhaul
> > of the file locking code with an aim toward improving its scalability
> > and making the code a bit easier to understand.
> 
> Thanks for working on this, that code could use some love!
> 
> > Longer version:
> > 
> > When the BKL was finally ripped out of the kernel in 2010, the strategy
> > taken for the file locking code was to simply turn it into a new
> > file_lock_locks spinlock. It was an expedient way to deal with the file
> > locking code at the time, but having a giant spinlock around all of this
> > code is clearly not great for scalability. Red Hat has bug reports that
> > go back into the 2.6.18 era that point to BKL scalability problems in
> > the file locking code and the file_lock_lock suffers from the same
> > issues.
> > 
> > This patchset is my first attempt to make this code less dependent on
> > global locking. The main change is to switch most of the file locking
> > code to be protected by the inode->i_lock instead of the file_lock_lock.
> > 
> > While that works for most things, there are a couple of global data
> > structures (lists in the current code) that need a global lock to
> > protect them. So we still need a global lock in order to deal with
> > those. The remaining patches are intended to make that global locking
> > less painful. The big gain is made by turning the blocked_list into a
> > hashtable, which greatly speeds up the deadlock detection code.
> > 
> > I rolled a couple of small programs in order to test this code. The
> > first one just forks off 128 children and has them lock and unlock the
> > same file 10k times. Running this under "time" against a file on tmpfs
> > gives typical values like this:
> 
> What kind of hardware was this?
> 

Mostly a KVM guest on Intel hardware. I've done some testing on bare
metal too, and the results are pretty similar.
 
> > 
> > Unpatched (3.10-rc3-ish):
> > real0m5.283s
> > user0m0.380s
> > sys 0m20.469s
> > 
> > Patched (same base kernel):
> > real0m5.099s
> > user0m0.478s
> > sys 0m19.662s
> > 
> > ...so there seems to be some modest performance gain in this test. I
> > think that's almost entirely due to the change to a hashtable and to
> > optimize removing and readding blocked locks to the global lists. Note
> > that with this code we have to take two spinlocks instead of just one,
> > and that has some performance impact too. So the real peformance gain
> > from that hashtable conversion is eaten up to some degree by this.
> 
> Might be nice to look at some profiles to confirm all of that.  I'd also
> be curious how much variation there was in the results above, as they're
> pretty close.
> 

The above is just a random representative sample. The results are
pretty close when running this test, but I can average up several runs
and present the numbers. I plan to get a bare-metal test box on which
to run some more detailed testing and maybe some profiling this week.

> > The next test just forks off a bunch of children that each create their
> > own file and then lock and unlock it 20k times. Obviously, the locks in
> > this case are uncontended. Running that under "time" typically gives
> > these rough numbers.
> > 
> > Unpatched (3.10-rc3-ish):
> > real0m8.836s
> > user0m1.018s
> > sys 0m34.094s
> > 
> > Patched (same base kernel):
> > real0m4.965s
> > user0m1.043s
> > sys 0m18.651s
> > 
> > In this test, we see the real benefit of moving to the i_lock for most
> > of this code. The run time is almost cut in half in this test. With
> > these changes locking different inodes needs very little serialization.
> > 
> > If people know of other file locking performance tests, then I'd be
> > happy to try them out too. It's possible that this might make some
> > workloads slower, and it would be helpful to know what they are (and
> > address them) if so.
> > 
> > This is not the first attempt at doing this. The conversion to the
> > i_lock was originally attempted by Bruce Fields a few years ago. His
> > approach was NAK'ed since it involved ripping out the deadlock
> > detection. People also really seem to like /proc/locks for debugging, so
> > keeping that in is probably worthwhile.
> 
> Yes, there's already code that depends on it.
> 
> The deadlock detection, though--I still wonder if we could get away with
> ripping it out.  Might be worth at least giving an option to configure
> it out as a first step.
> 
> --b.
> 


I considered that, and have patches that add such a config option. Some
of the later patches in this set optimize the code that is necessary to
support deadlock detection. In particular, turning the blocked_list
into a hashtable really speeds up the list walking. So much so that I
think the case for compiling it out is less obvious.

Should we a