from:"Matthew Wilcox"

Re: how to show propagation state for mounts

2008-02-20 Thread Matthew Wilcox

On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote:
> It's less about the form of representation (after all, we generate poll
> events when contents of that sucker changes, so one *can* get a consistent
> snapshot of the entire thing) and more about having it self-contained
> when we have namespaces in the play.
> 
> IOW, the data in there should give answers to questions that make sense.
> "Do events get propagated from this vfsmount I have to that vfsmount I have?"
> is a meaningful one; ditto for "are events here propagated to somewhere I
> don't see?" or "are events getting propagated here from somewhere I don't
> see?".

Why do those last two questions deserve an answer?  How will a person's
or application's behaviour be affected by whether a change will
propagate to something they don't know about and can't see?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Matthew Wilcox

On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
> @@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
>   mntget(save.mnt);
>  
>   result = __link_path_walk(name, nd);
> - if (result == -ESTALE) {
> + while (result == -ESTALE) {
> + /*
> +  * If no progress was made looking up the pathname,
> +  * then stop and return ENOENT instead of ESTALE.
> +  */
> + if (nd->dentry == save.dentry) {
> + result = -ENOENT;
> + break;
> + }
>   *nd = save;
>   dget(nd->dentry);
>   mntget(nd->mnt);
>   nd->flags |= LOOKUP_REVAL;
>   result = __link_path_walk(name, nd);
> + /*
> +  * If no progress was made this time, then return
> +  * ENOENT instead of ESTALE because no recovery
> +  * is possible to recover the stale file handle.
> +  */
> + if (result == -ESTALE && nd->dentry == save.dentry)
> + result = -ENOENT;
>   }
>  
>   dput(save.dentry);

Why do you need both of these tests?  The first one should be enough,
surely?

> @@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
>   * @create_mode: create intent flags
>   */
>  static int path_lookup_create(int dfd, const char *name,
> -   unsigned int lookup_flags, struct nameidata *nd,
> -   int open_flags, int create_mode)
> + unsigned int lookup_flags, struct nameidata *nd,
> + int open_flags, int create_mode)

Gratuitous reformatting?

> @@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
>   int acc_mode, error;
>   struct path path;
>   struct dentry *dir;
> - int count = 0;
> + int count;
> +
> +top:
> + count = 0;
>  
>   acc_mode = ACC_MODE(flag);
>  
> @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
>   /*
>* Create - we need to know the parent.
>*/
> - error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
> + error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
> + flag, mode);
>   if (error)
>   return error;
>  
> @@ -1812,10 +1833,17 @@ ok:
>   return 0;
>  
>  exit_dput:
> + if (error == -ESTALE)
> + d_drop(path.dentry);
>   dput_path(&path, nd);
>  exit:
>   if (!IS_ERR(nd->intent.open.file))
>   release_open_intent(nd);
> + if (error == -ESTALE) {
> + d_drop(nd->dentry);
> + path_release(nd);
> + goto top;
> + }

I wonder if a tail-call might not work better here.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: file locks: Use wait_event_interruptible_timeout()

2008-01-15 Thread Matthew Wilcox

On Tue, Jan 15, 2008 at 09:48:51AM -0500, J. Bruce Fields wrote:
> On Mon, Jan 14, 2008 at 09:28:30PM -0700, Matthew Wilcox wrote:
> > interruptible_sleep_on_locked() is just an open-coded
> > wait_event_interruptible_timeout() with a few assumptions since we know
> > we hold the BKL.  locks_block_on_timeout() is only used in one place, so
> > it's actually simpler to inline it into its caller.
> 
> Makes sense, thanks.  So the assumption we were depending on the BKL for
> was that we could count on the wake-up not coming till after we block,
> so we could skip a check ->fl_next that's normally needed to resolve the
> usual sleeping-on-some-condition race?

That's right.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

file locks: Split flock_find_conflict out of flock_lock_file

2008-01-14 Thread Matthew Wilcox


Reduce the spaghetti-like nature of flock_lock_file by making the chunk
of code labelled find_conflict into its own function.  Also allocate
memory before taking the kernel lock in preparation for switching to a
normal spinlock.

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

diff --git a/fs/locks.c b/fs/locks.c
index b681459..bc691e5 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -699,6 +699,33 @@ next_task:
return 0;
 }
 
+/*
+ * A helper function for flock_lock_file().  It searches the list of locks
+ * for @inode looking for one which would conflict with @request.
+ * If it does not find one, it returns the address where the requested lock
+ * should be inserted.  If it does find a conflicting lock, it returns NULL.
+ */
+static struct file_lock **
+flock_find_conflict(struct inode *inode, struct file_lock *request)
+{
+   struct file_lock **before;
+
+   for_each_lock(inode, before) {
+   struct file_lock *fl = *before;
+   if (IS_POSIX(fl))
+   break;
+   if (IS_LEASE(fl))
+   continue;
+   if (!flock_locks_conflict(request, fl))
+   continue;
+
+   if (request->fl_flags & FL_SLEEP)
+   locks_insert_block(fl, request);
+   return NULL;
+   }
+   return before;
+}
+
 /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks
  * after any leases, but before any posix locks.
  *
@@ -714,18 +741,21 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
int error = 0;
int found = 0;
 
-   lock_kernel();
-   if (request->fl_flags & FL_ACCESS)
-   goto find_conflict;
+   if (request->fl_flags & FL_ACCESS) {
+   lock_kernel();
+   before = flock_find_conflict(inode, request);
+   unlock_kernel();
+   return before ? 0 : -EAGAIN;
+   }
 
if (request->fl_type != F_UNLCK) {
-   error = -ENOMEM;
new_fl = locks_alloc_lock();
-   if (new_fl == NULL)
-   goto out;
-   error = 0;
+   if (!new_fl)
+   return -ENOMEM;
}
 
+   lock_kernel();
+
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -754,22 +784,12 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
if (found)
cond_resched();
 
-find_conflict:
-   for_each_lock(inode, before) {
-   struct file_lock *fl = *before;
-   if (IS_POSIX(fl))
-   break;
-   if (IS_LEASE(fl))
-   continue;
-   if (!flock_locks_conflict(request, fl))
-   continue;
+   before = flock_find_conflict(inode, request);
+   if (!before) {
error = -EAGAIN;
-   if (request->fl_flags & FL_SLEEP)
-   locks_insert_block(fl, request);
goto out;
}
-   if (request->fl_flags & FL_ACCESS)
-   goto out;
+
locks_copy_lock(new_fl, request);
locks_insert_lock(before, new_fl);
new_fl = NULL;
-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

file locks: Use wait_event_interruptible_timeout()

2008-01-14 Thread Matthew Wilcox


interruptible_sleep_on_locked() is just an open-coded
wait_event_interruptible_timeout() with a few assumptions since we know
we hold the BKL.  locks_block_on_timeout() is only used in one place, so
it's actually simpler to inline it into its caller.

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

 locks.c |   33 -
 1 file changed, 4 insertions(+), 29 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..b681459 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -634,33 +634,6 @@ static int flock_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
return (locks_conflict(caller_fl, sys_fl));
 }
 
-static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int 
timeout)
-{
-   int result = 0;
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_INTERRUPTIBLE);
-   add_wait_queue(fl_wait, &wait);
-   if (timeout == 0)
-   schedule();
-   else
-   result = schedule_timeout(timeout);
-   if (signal_pending(current))
-   result = -ERESTARTSYS;
-   remove_wait_queue(fl_wait, &wait);
-   __set_current_state(TASK_RUNNING);
-   return result;
-}
-
-static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock 
*waiter, int time)
-{
-   int result;
-   locks_insert_block(blocker, waiter);
-   result = interruptible_sleep_on_locked(&waiter->fl_wait, time);
-   __locks_delete_block(waiter);
-   return result;
-}
-
 void
 posix_test_lock(struct file *filp, struct file_lock *fl)
 {
@@ -1256,7 +1229,10 @@ restart:
if (break_time == 0)
break_time++;
}
-   error = locks_block_on_timeout(flock, new_fl, break_time);
+   locks_insert_block(flock, new_fl);
+   error = wait_event_interruptible_timeout(new_fl->fl_wait,
+   !new_fl->fl_next, break_time);
+   __locks_delete_block(new_fl);
if (error >= 0) {
if (error == 0)
time_out_leases(inode);

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Leak in nlmsvc_testlock for async GETFL case

2008-01-14 Thread Matthew Wilcox

On Mon, Jan 14, 2008 at 03:44:19PM -0500, J. Bruce Fields wrote:
> Thanks!  I've queued it up for 2.6.25.

Hi Bruce,

I haven't had as much time to play with de-BKL-ising fs/locks.c as I
would like, so fixing that for 2.6.25 is probably out of the question,
but here are two janitorial patches that hopefully can be applied and
will make the next steps easier.  They make sense all by themselves,
even if I don't get back to this project for a few months.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rewrite rd

2008-01-14 Thread Matthew Wilcox

On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> + sector_t sector, size_t n)
> +{
> + struct page *page;
> + void *dst;
> + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> + size_t copy;
> +
> + copy = min((unsigned long)n, PAGE_SIZE - offset);
> + page = brd_lookup_page(brd, sector);
> + BUG_ON(!page);
> +
> + dst = kmap_atomic(page, KM_USER1);
> + memcpy(dst + offset, src, copy);
> + kunmap_atomic(dst, KM_USER1);

You're using kmap_atomic, but I see no reason you can't be preempted.
Don't you need to at least disable preemption while you have stuff
atomically kmapped?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox

On Fri, Jan 04, 2008 at 03:53:04PM -0500, J. Bruce Fields wrote:
> So, the problem is that fcntl_setlease() does
> 
>   vfs_setlease()
>   fasync_helper()
> 
> which the bkl held over both, and you want to preserve that?
> 
> But what that BKL is doing is a mystery to me--the very first thing that
> fasync_helper() does is kmem_cache_alloc(., GFP_KERNEL).  So you won't
> be introducing any new problem if you lock those two operations
> separately.  Unless I'm totally missing something.

A very good point.

So yet another race caused by using the BKL rather than thinking ... but
maybe it's an inconsequential race.  The consequences are that (if the
kmalloc in fasync_helper sleeps) a lease appears that isn't fully set-up
yet (and may be removed if the kmalloc fails).  Actually, it seems bad
if the kmalloc eventually succeeds -- there's a window while kmalloc is
sleeping where another process could open the file, break the lease,
fl_fasync will be NULL, so no signal is sent.  Then 30 seconds later the 
lease is removed without the leaseholder being sent a signal.  Bad.

How can we fix this situation?  I think we need a better interface than
fasync_helper() -- fasync_alloc() and fasync_setup() would seem to do
the trick.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox

On Fri, Jan 04, 2008 at 02:47:18PM -0500, J. Bruce Fields wrote:
> > > Then I started to wonder about the current split of functionality between
> > > fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
> > > other process having this file open should be common to all filesystems.
> > > And it should be honoured by NFS (it shouldn't hand out a delegation if
> > > a local user has the file open), so it needs to be in vfs_setlease().
> 
> Of course, in the case of a (still unfortunately theoretical) cluster
> filesystem implementation, those checks will be insufficient on their
> own.

Oh yes, definitely insufficient, but also necessary.  Right now,
filesystems bypass them entirely.  And the last thing we want is
individual filesystems duplicating these tests ...

> Though the write-lease check especially:
> 
>   if ((arg == F_WRLCK)
> && ((atomic_read(&dentry->d_count) > 1)
>   || (atomic_read(&inode->i_count) > 1)))
>   goto out;
> 
> seems like a terrible hack, and gives some annoying false positives:
> 
>   http://www.ussg.iu.edu/hypermail/linux/kernel/0711.1/2255.html

It is a terrible hack.  It was sufficient for the time, but it wasn't
supposed to be merged in that state ... I haven't had the courage to
figure out how to solve it properly yet.

> > > Now I'm worrying about the mechanics of calling out to a filesystem to
> > > perform a lock.  Obviously, a network filesystem is going to have to
> > > sleep waiting for a response from the fileserver, so that precludes the
> > > use of a spinlock held over the call to f_op->setlease.  I'm not keen on
> > > the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
> > > hard enough without worrying what a filesystem might be doing with it.
> 
> I'm confused as to what the locks are actually protecting.
> 
> If they're mainly just protecting the inode's lock list, for example,
> then we should let the filesystem call back to helpers in locks.c for
> any manipulations of that list and do the locking in those helpers.

But we need to ensure the lock list is always consistent for, eg,
/proc/locks.  We don't want locks to temporarily appear on the list when
they never get granted (due to an error elsewhere in the chain).

> > > vfs_setlease()
> > >if (f_op->setlease())
> > >   res = f_op->setlease()
> > >   if (res)
> > >  return res;
> > >lock_kernel()
> > >generic_setlease()
> > >unlock_kernel()
> 
> Why can't the filesystem call into generic_setlease() on its own?

Because (assuming we're rid of the BKL), fcntl_setlease() needs to
acquire the spinlock and hold it while generic_setlease() runs, so
generic_setlease() can't acquire the lock.

> Christoph Hellwig has suggested removing the second case and just doing
> 
>   .setlease = generic_setlease
> 
> for all filesystems that should support leases; see the final patch in
> 
>   git://linux-nfs.org/~bfields/linux.git server-cluster-lease-api

Yeah, but that doesn't work if we repurpose the setlease f_op.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox

On Fri, Jan 04, 2008 at 01:55:36PM -0500, david m. richter wrote:
>   fwiw, i've done some work on extending the lease subsystem to help 
> support the full range of requirements for NFSv4 file and directory 
> delegations (e.g., breaking a lease when unlinking a file) and we ended up 
> actually doing most of what you've just suggested here, which i take to be 
> a good sign.

As long as it's great minds thinking alike and not fools seldom
differing ;-)

>   most of my refactoring came out of trying to simplify locking and 
> avoid holding locks too long (rather than specifically focusing on 
> cluster-oriented stuff, but the goals dovetail) and your work on getting 
> the BKL out of locks.c entirely is something i really like and look 
> forward to.

Excellent.  Shall I make the patch myself, or did you want to post a
patch based on working code?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox


Hi Bruce,

The current implementation of vfs_setlease/generic_setlease/etc is a
bit quirky.  I've been thinking it over for the past couple of days,
and I think we need to refactor it to work sensibly.

As you may have noticed, I've been mulling over getting rid of the
BKL in fs/locks.c and the lease interface is particularly problematic.
Here's one reason why:

fcntl_setlease
   lock_kernel
   vfs_setlease
  lock_kernel
  generic_setlease
  unlock_kernel
   fasync_helper
   unlock_kernel

This is perfectly legitimate for the BKL of course, but other spinlocks
can't be acquired recursively in this way.  At first I thought I'd just
push the call to generic_setlease into __vfs_setlease and have two ways
of calling it:

fcntl_setlease
   lock_kernel
   __vfs_setlease
   fasync_helper
   unlock_kernel

vfs_setlease
  lock_kernel
  __vfs_setlease
  unlock_kernel

But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
that's bad for converting to spnlocks later.  Then I thought I'd move
the spinlock acquisition into generic_setlease().  But I can't do that
either as fcntl_setlease() needs to hold the spinlock around the call
to generic_setlease() and fasync_helper().

Then I started to wonder about the current split of functionality between
fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
other process having this file open should be common to all filesystems.
And it should be honoured by NFS (it shouldn't hand out a delegation if
a local user has the file open), so it needs to be in vfs_setlease().

Now I'm worrying about the mechanics of calling out to a filesystem to
perform a lock.  Obviously, a network filesystem is going to have to
sleep waiting for a response from the fileserver, so that precludes the
use of a spinlock held over the call to f_op->setlease.  I'm not keen on
the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
hard enough without worrying what a filesystem might be doing with it.

So I think we need to refactor the interface, and I'd like to hear your
thoughts on my ideas of how to handle this.

First, have clients of vfs_setlease call lease_alloc() and pass it in,
rather than allocate it on the stack and have this horrible interface
where we may pass back an allocated lock.  This allows us to not allocate
memory (hence sleep) during generic_setlease().

Second, move some of the functionality of generic_setlease() to
vfs_setlease(), as mentioned above.

Third, change the purpose of f_op->setlease.  We can rename it if you
like to catch any out of tree users.  I'd propose using it like this:

vfs_setlease()
   if (f_op->setlease())
  res = f_op->setlease()
  if (res)
 return res;
   lock_kernel()
   generic_setlease()
   unlock_kernel()

fcntl_setlease
   if (f_op->setlease())
  res = f_op->setlease()
  if (res)
 return res;
   lock_kernel
   generic_setlease()
   ... fasync ...
   unlock_kernel

So 'f_op->setlease' now means "Try to get a lease from the fileserver".
We can optimise this a bit to not even call setlease if, say, we already
have a read lease and we're trying to get a second read lease.  But we
have to record our local leases (that way they'll show up in /proc/locks).

I think the combination of these three ideas gives us a sane interface
to the various setlease functions, and let us convert from lock_kernel()
to spin_lock() later.  Have I missed something?  I don't often think
about the needs of cluster filesystems, so I may misunderstand how they
need this operation to work.

At some point, we need to revisit the logic for 'is this file open
by another process' -- it's clearly buggy since it doesn't hold the
inode_lock while checking i_count, so it could race against an __iget().

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Remove BKL from fs/locks.c

2007-12-30 Thread Matthew Wilcox

On Sun, Dec 30, 2007 at 01:05:10PM +, Christoph Hellwig wrote:
> If people are really having any kind of scalability problems with this
> still it should be quite trivial to make the file_lock_list and
> blocked_list aswel as the new file_lock_lock per-superblock as file
> and thus locks never move between superblocks.  In fact I'd probably
> do this even without scalability concerns just to make our fs data
> structures nice per-superblock.

Hrm.  file_lock_list is used for /proc/locks, so that's OK to convert,
we just need to iterate over each super block in the system, acquire the
sb_file_lock_lock, then iterate over each sb_file_lock_list.

The blocked_list is a bit more complex since we need to check every lock
on the blocked list, and would need to acquire all the sb_file_lock_locks
to check this list consistently.  I don't see a nice way to do this --
particularly when you consider that we need to run this check every time
someone takes out a POSIX lock that blocks on another lock.

> > I had to move one memory allocation out from under the file_lock_lock.
> > I hope I got that logic right.  I'm rather tempted to split out the
> > find_conflict algorithm from that function into something that can be
> > called separately for the FL_ACCESS case.
> 
> Yes, splitting that out makes a lot of sense.  Should be a separate
> patch, though.

Indeed.  What you see here is just me hacking until stuff works.

> > +static inline void lock_flocks(void)
> > +{
> > +   spin_lock(&file_lock_lock);
> > +}
> > +
> > +static inline void unlock_flocks(void)
> > +{
> > +   spin_unlock(&file_lock_lock);
> > +}
> 
> I'd rather not introduce this wrappers, they only obsfucated what's
> really going on.

Fair enough.  file_lock_lock is a crappy name though, and I was
embarrassed to use it everywhere.

> > +   if (found) {
> > +   unlock_flocks();
> > cond_resched();
> > +   lock_flocks();
> > +   }
> 
> There's a cond_resched_lock that only drops the lock in case we really
> need to block.

Ooh, thanks, I didn't know about that.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Remove BKL from fs/locks.c

2007-12-30 Thread Matthew Wilcox

On Sun, Dec 30, 2007 at 08:36:44PM +1100, Stephen Rothwell wrote:
> We should probably do some performance testing on this because the last
> time we tried the impact was quite noticeable.  You should ping Tridge as
> he has some good lock testing setups.  And he cares if we slow him down :-)

Last time I did this, I switched to a semaphore instead of a spinlock.
That was what slowed us down.  I doubt we can see a performance loss
with this patch since it's a 1-1 substitution of the BKL spinlock with a
private spinlock.

Good idea about asking tridge for an evaluation of the patch though.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Remove BKL from fs/locks.c

2007-12-29 Thread Matthew Wilcox


I've been promising to do this for about seven years now.

It seems to work well enough, but I haven't run any serious stress
tests on it.  This implementation uses one spinlock to protect both lock
lists and all the i_flock chains.  It doesn't seem worth splitting up
the locking any further.

I had to move one memory allocation out from under the file_lock_lock.
I hope I got that logic right.  I'm rather tempted to split out the
find_conflict algorithm from that function into something that can be
called separately for the FL_ACCESS case.

I also have to drop and reacquire the file_lock_lock around the call
to cond_resched().  This was done automatically for us before by the
special BKL semantics.

I had to change vfs_setlease() as it relied on the special BKL ability
to recursively acquire the same lock.  The internal caller now calls
__vfs_setlease and the exported interface acquires and releases the
file_lock_lock around calling __vfs_setlease.

I should probably split out the removal of interruptible_sleep_on_locked()
as it's basically unrelated to all this.

diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..68de569 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -139,9 +139,23 @@ int lease_break_time = 45;
 #define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = 
&(*lockp)->fl_next)
 
+/*
+ * Protects the two list heads below, plus the inode->i_flock list
+ */
+static DEFINE_SPINLOCK(file_lock_lock);
 static LIST_HEAD(file_lock_list);
 static LIST_HEAD(blocked_list);
 
+static inline void lock_flocks(void)
+{
+   spin_lock(&file_lock_lock);
+}
+
+static inline void unlock_flocks(void)
+{
+   spin_unlock(&file_lock_lock);
+}
+
 static struct kmem_cache *filelock_cache __read_mostly;
 
 /* Allocate an empty lock structure. */
@@ -507,9 +521,9 @@ static void __locks_delete_block(struct file_lock *waiter)
  */
 static void locks_delete_block(struct file_lock *waiter)
 {
-   lock_kernel();
+   lock_flocks();
__locks_delete_block(waiter);
-   unlock_kernel();
+   unlock_flocks();
 }
 
 /* Insert waiter into blocker's block list.
@@ -634,29 +648,15 @@ static int flock_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
return (locks_conflict(caller_fl, sys_fl));
 }
 
-static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int 
timeout)
-{
-   int result = 0;
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_INTERRUPTIBLE);
-   add_wait_queue(fl_wait, &wait);
-   if (timeout == 0)
-   schedule();
-   else
-   result = schedule_timeout(timeout);
-   if (signal_pending(current))
-   result = -ERESTARTSYS;
-   remove_wait_queue(fl_wait, &wait);
-   __set_current_state(TASK_RUNNING);
-   return result;
-}
-
-static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock 
*waiter, int time)
+static int locks_block_on_timeout(struct file_lock *blocker,
+   struct file_lock *waiter, int time)
 {
int result;
locks_insert_block(blocker, waiter);
-   result = interruptible_sleep_on_locked(&waiter->fl_wait, time);
+   unlock_flocks();
+   result = wait_event_interruptible_timeout(waiter->fl_wait,
+   !waiter->fl_next, time);
+   lock_flocks();
__locks_delete_block(waiter);
return result;
 }
@@ -666,7 +666,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 {
struct file_lock *cfl;
 
-   lock_kernel();
+   lock_flocks();
for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = 
cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -677,7 +677,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
__locks_copy_lock(fl, cfl);
else
fl->fl_type = F_UNLCK;
-   unlock_kernel();
+   unlock_flocks();
return;
 }
 
@@ -741,18 +741,16 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
int error = 0;
int found = 0;
 
-   lock_kernel();
-   if (request->fl_flags & FL_ACCESS)
-   goto find_conflict;
-
-   if (request->fl_type != F_UNLCK) {
-   error = -ENOMEM;
+   if (!(request->fl_flags & FL_ACCESS) && (request->fl_type != F_UNLCK)) {
new_fl = locks_alloc_lock();
-   if (new_fl == NULL)
-   goto out;
-   error = 0;
+   if (!new_fl)
+   return -ENOMEM;
}
 
+   lock_flocks();
+   if (request->fl_flags & FL_ACCESS)
+   goto find_conflict;
+
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -778,8 +776,11 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
 * If a h

Re: NFS Killable tasks request comments on patch

2007-12-06 Thread Matthew Wilcox

On Thu, Dec 06, 2007 at 10:34:26AM -0700, Matthew Wilcox wrote:
> I've done a more thorough review of Liam's work and come up with a few
> more fixes (which I'll publish later today)

I've put up a git tree for this work; see
http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=task_killable

I've split up some of the earlier patches, hopefully in a way which will
make akpm less grumpy.  Andrew, let me know if I've misunderstood how
you want to see it.

Here's the commit for the NFS people to chew on a bit.

commit f05b88f294044cdf56fbe637a246ba6c7a14d6f1
Author: Matthew Wilcox <[EMAIL PROTECTED]>
Date:   Thu Dec 6 16:24:39 2007 -0500

NFS: Switch from intr mount option to TASK_KILLABLE

By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr'
mount option.  We have to use _killable everywhere instead of _interruptible
as we get rid of rpc_clnt_sigmask/sigunmask.

Signed-off-by: Liam R. Howlett <[EMAIL PROTECTED]>
Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 70587f3..310fa2f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -302,7 +302,7 @@ found_client:
if (new)
nfs_free_client(new);
 
-   error = wait_event_interruptible(nfs_client_active_wq,
+   error = wait_event_killable(nfs_client_active_wq,
clp->cl_cons_state != NFS_CS_INITING);
if (error < 0) {
nfs_put_client(clp);
@@ -494,10 +494,6 @@ static int nfs_init_server_rpcclient(struct nfs_server 
*server, rpc_authflavor_t
if (server->flags & NFS_MOUNT_SOFT)
server->client->cl_softrtry = 1;
 
-   server->client->cl_intr = 0;
-   if (server->flags & NFS4_MOUNT_INTR)
-   server->client->cl_intr = 1;
-
return 0;
 }
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 5e8d82f..7b994b2 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -193,7 +193,7 @@ static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq)
if (dreq->iocb)
goto out;
 
-   result = wait_for_completion_interruptible(&dreq->completion);
+   result = wait_for_completion_killable(&dreq->completion);
 
if (!result)
result = dreq->error;
@@ -391,9 +391,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const 
struct iovec *iov,
   unsigned long nr_segs, loff_t pos)
 {
ssize_t result = 0;
-   sigset_t oldset;
struct inode *inode = iocb->ki_filp->f_mapping->host;
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
 
dreq = nfs_direct_req_alloc();
@@ -405,11 +403,9 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const 
struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;
 
-   rpc_clnt_sigmask(clnt, &oldset);
result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos);
if (!result)
result = nfs_direct_wait(dreq);
-   rpc_clnt_sigunmask(clnt, &oldset);
nfs_direct_req_release(dreq);
 
return result;
@@ -767,9 +763,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
size_t count)
 {
ssize_t result = 0;
-   sigset_t oldset;
struct inode *inode = iocb->ki_filp->f_mapping->host;
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
size_t wsize = NFS_SERVER(inode)->wsize;
int sync = 0;
@@ -787,11 +781,9 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;
 
-   rpc_clnt_sigmask(clnt, &oldset);
result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync);
if (!result)
result = nfs_direct_wait(dreq);
-   rpc_clnt_sigunmask(clnt, &oldset);
nfs_direct_req_release(dreq);
 
return result;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index db5d96d..f68c222 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -433,15 +433,11 @@ static int nfs_wait_schedule(void *word)
  */
 static int nfs_wait_on_inode(struct inode *inode)
 {
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_inode *nfsi = NFS_I(inode);
-   sigset_t oldmask;
int error;
 
-   rpc_clnt_sigmask(clnt, &oldmask);
error = wait_on_bit_lock(&nfsi->flags, NFS_INO_REVALIDATING,
-   nfs_wait_schedule, TASK_INTERRUPTIBLE);
-   rpc_clnt_sigunmask(clnt, &oldmask);
+   nfs_wait_schedule, TASK_KILLABLE);
 
return error;
 }
diff --git a/fs/nfs/mou

Re: Massive slowdown when re-querying large nfs dir

2007-11-04 Thread Matthew Wilcox

On Mon, Nov 05, 2007 at 07:58:38AM +0300, Al Boldi wrote:
> Any ideas?

How about tcpdumping and seeing what requests are flowing across the
wire?  You might be able to figure out what's being done differently.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 09:38:55PM +, Alan Cox wrote:
> > It doesn't require the system to detect it, only mandate what to return
> > if it does detect it.
> 
> We should be detecting at least the obvious case.

What is the obvious case?  A task that has never called clone()?

> If SYSV only spots simple AB - BA deadlocks or taking the same lock twice
> yourself then that ought to be sufficient for us too.

You can't deadlock against yourself -- either with POSIX locks or BSD
locks (POSIX simple upgrades/downgrades the lock; each byte in a file
can only be in either read-locked state or write-locked state for a
given process.  BSD locks release the lock and wake all waiters before
attempting to acquire the lock in its new state).  In my other post, I
showed a simple AB-BA deadlock that can't be easily detected.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 11:55:52PM +0100, Jiri Kosina wrote:
> On Sun, 28 Oct 2007, Matthew Wilcox wrote:
> 
> > Bzzt.  You get a false deadlock with multiple threads like so:
> > Thread A of task B takes lock 1
> > Thread C of task D takes lock 2
> > Thread C of task D blocks on lock 1
> > Thread E of task B blocks on lock 2
> 
>   A potential for deadlock occurs if a process controlling a locked 
>   region is put to sleep by attempting to lock another process' 
>   locked region. If the system detects that sleeping until a locked 
>   region is unlocked would cause a deadlock, fcntl() shall fail with 
>   an [EDEADLK] error.
> 
> This is what POSIX says [1], even after being modified with respect to 
> POSIX Threads Extension, right?
> 
> So it doesn't deal with threads at all, just processess are taken into 
> account. Probably for a reason :)

Did you have a concrete suggestion, or are you just quoting the spec?

The problem is that it's nonsense -- processes don't sleep, threads do.
I think the key is "would deadlock", not "might deadlock".  Our current
behaviour is clearly in violation of SuSv3.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 10:48:33PM +, Alan Cox wrote:
> > Bzzt.  You get a false deadlock with multiple threads like so:
> > 
> > Thread A of task B takes lock 1
> > Thread C of task D takes lock 2
> > Thread C of task D blocks on lock 1
> > Thread E of task B blocks on lock 2
> 
> The spec and SYSV certainly ignore threading in this situation and you
> know that perfectly well (or did in 2004)

The discussion petered out (or that mailing list archive lost articles
from the thread) without any kind of resolution, or indeed interest.
What is your suggestion for handling this problem?  As it is now, the
kernel 'detects' deadlock where there is none, which doesn't seem
allowed by SuSv3 either.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 05:50:30PM -0400, Trond Myklebust wrote:
> > You can't fix the false EDEADLK detection without solving the halting
> > problem.  Best of luck with that.
> 
> I can see that it would be difficult to do efficiently, but basically,
> this boils down to finding a circular path in a graph. That is hardly an
> unsolvable issue...

Bzzt.  You get a false deadlock with multiple threads like so:

Thread A of task B takes lock 1
Thread C of task D takes lock 2
Thread C of task D blocks on lock 1
Thread E of task B blocks on lock 2

We currently declare deadlock at this point (unless the deadlock detection
code has changed since I last looked at it), despite thread A being about
to release lock 1.  Oh, and by the way, thread E is capable of releasing
lock 1, so you can't just say "well, detect by thread instead of by task".

So the only way I can see to accurately detect deadlock is to simulate
the future execution of all threads in task B to see if any of them
will release lock 1 without first gaining lock 2.  Which, I believe,
is halting-equivalent.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 06:40:52PM +, Alan Cox wrote:
> NAK. This is an ABI change. It was also comprehensively rejected before
> because
> 
> - EDEADLK behaviour is ABI

Not in any meaningful way.

> - EDEADLK behaviour is required by SuSv3

What SuSv3 actually says is:

If the system detects that sleeping until a locked region is
unlocked would cause a deadlock, fcntl() shall fail with an
[EDEADLK] error.

It doesn't require the system to detect it, only mandate what to return
if it does detect it.

> - We have no idea what applications may rely on this behaviour.

I've never heard of one that does.

> so we need to fix the bugs - the lock usage and the looping. At that
> point it merely becomes a performance concern to those who use it, which
> is the proper behaviour. If you want a faster non-checking one use
> flock(), or add another flag that is a Linux "don't check for deadlock"

You can't fix the false EDEADLK detection without solving the halting
problem.  Best of luck with that.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox

On Sun, Oct 28, 2007 at 01:43:21PM -0400, J. Bruce Fields wrote:
> We currently attempt to return -EDEALK to blocking fcntl() file locking
> requests that would create a cycle in the graph of tasks waiting on
> locks.
> 
> This is inefficient: in the general case it requires us determining
> whether we're adding a cycle to an arbitrary directed acyclic graph.
> And this calculation has to be performed while holding a lock (currently
> the BKL) that prevents that graph from changing.
> 
> It has historically been a source of bugs; most recently it was noticed
> that it could loop indefinitely while holding the BKL.

It can also return -EDEADLK spuriously.  So yeah, just kill it.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLUB performance regression vs SLAB

2007-10-05 Thread Matthew Wilcox

On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
> I'd like to second Davids emails here, this is a serious problem. Having
> a reproducible test case lowers the barrier for getting the problem
> fixed by orders of magnitude. It's the difference between the problem
> getting fixed in a day or two and it potentially lingering for months,
> because email ping-pong takes forever and "the test team has moved on to
> other tests, we'll let you know the results of test foo in 3 weeks time
> when we have a new slot on the box" just removing any developer
> motivation to work on the issue.

I vaguely remembered something called orasim, so I went looking for it.
I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
seems to be a stillborn project.  Is there anything else I should know
about orasim?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] isofs: add +w bit for non-RR discs

2007-10-04 Thread Matthew Wilcox

On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote:
> Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge

Looks to me like you've added S_IWUSR, not S_IWUGO.

> - popt->mode = S_IRUGO | S_IXUGO; /*
> + popt->mode = S_IRUGO | S_IWUSR | S_IXUGO;
> - inode->i_mode = S_IRUGO | S_IXUGO | S_IFDIR;
> + inode->i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR;

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox

On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote:
> Anything, I do mean anything, can be simulated using small test
> programs.  Pointing at a big fancy machine with lots of storage
> and disk is a passive aggressive way to avoid the real issues,
> in that nobody is putting forth the effort to try and come up
> with an at least publishable test case that Christoph can use to
> help you guys.
> 
> If coming up with a reproducable and publishable test case is
> the difference between this getting fixed and it not getting
> fixed, are you going to invest the time to do that?

If that's what it takes, then yes.  But I'm far from convinced that
it's as easy to come up with a TPC benchmark simulator as you think.
There have been efforts in the past (orasim, for example), but
presumably Christoph has already tried these benchmarks.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox

On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote:
> There comes a point where it is the reporter's responsibility to help
> the developer come up with a publishable test case the developer can
> use to work on fixing the problem and help ensure it stays fixed.

That's a lot of effort.  Is it more effort than doing some remote
debugging with Christoph?  I don't know.

> Using an unpublishable benchmark, whose results even cannot be
> published, really stretches the limits of "reasonable" don't you
> think?

Yet here we stand.  Christoph is aggressively trying to get slab removed
from the tree.  There is a testcase which shows slub performing worse
than slab.  It's not my fault I can't publish it.  And just because I
can't publish it doesn't mean it doesn't exist.

Slab needs to not get removed until slub is as good a performer on this
benchmark.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox

On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote:
> I was not aware of that. Would it be possible for you to summarize all the 
> test data that you have right now about SLUB vs. SLAB with the patches 
> listed?  Exactly what kernel version and what version of the per cpu 
> patches were tested?

We have three runs, all with 2.6.23-rc3 plus the patches that Suresh
applied from 20070922.  The first run is with slab.  The second run is
with SLUB and the third run is SLUB plus the tuning parameters you
recommended.

I have a spreadsheet with Vtune data in it that was collected during
each of these test runs, so we can see which functions are the hottest.
I can grab that data and send it to you, if that's interesting.

> Was the page allocator pass through patchset 
> separately applied as I requested?

I don't believe so.  Suresh?

I think for future tests, it would be easiest if you send me a git
reference.  That way we will all know precisely what is being tested.

> Finally: Is there some way that I can reproduce the tests on my machines?

As usual for these kinds of setups ... take a two-CPU machine, 64GB
of memory, half a dozen fibre channel adapters, about 3000 discs,
a commercial database, a team of experts for three months worth of
tuning ...

I don't know if anyone's tried to replicate a benchmark like this using
Postgres.  Would be nice if they have ...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox

On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote:
> On Thu, 4 Oct 2007, Matthew Wilcox wrote:
> 
> > So, on "a well-known OLTP benchmark which prohibits publishing absolute
> > numbers" and on an x86-64 system (I don't think exactly which model
> > is important), we're seeing *6.51%* performance loss on slub vs slab.
> > This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
> > asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
> > gets back 0.38% of that.  It's still down 6.13% over slab.
> 
> Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
> earlier. Seems that we are mainly seeing cacheline bouncing due to two 
> cpus accessing meta data in the same page struct. The patches in 
> MM that are scheduled to be merged for .24 address that issue. I 
> have repeatedly asked that these patches be tested. The patches were 
> posted months ago.

I just checked with the guys who did the test.  When I said -rc3, I
mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to
test for you.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox

On Mon, Oct 01, 2007 at 01:50:44PM -0700, Christoph Lameter wrote:
> The problem is with the weird way of Intel testing and communication. 
> Every 3-6 month or so they will tell you the system is X% up or down on 
> arch Y (and they wont give you details because its somehow secret). And 
> then there are conflicting statements by the two or so performance test 
> departments. One of them repeatedly assured me that they do not see any 
> regressions.

Could you cut out the snarky remarks?  It takes a long time to run a
test, and testing every one of the patches you send really isn't high
on anyone's priority list.  The performance team have also been having
problems getting stable results with recent kernels, adding to the delay.
The good news is that we do now have committment to testing upstream
kernels, so you should see results more frequently than you have been.

I'm taking over from Suresh as liason for the performance team, so
if you hear *anything* from *anyone* else at Intel about performance,
I want you to cc me about it.  OK?  And I don't want to hear any more
whining about hearing different things from different people.

So, on "a well-known OLTP benchmark which prohibits publishing absolute
numbers" and on an x86-64 system (I don't think exactly which model
is important), we're seeing *6.51%* performance loss on slub vs slab.
This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
gets back 0.38% of that.  It's still down 6.13% over slab.

For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus
RHEL 4.5, so the performance guys are really unhappy about going up to
almost 8% regression.

In the detailed profiles, __slab_free is the third most expensive
function, behind only spin locks.  get_partial_node is right behind it
in fourth place, and kmem_cache_alloc is sixth.  __slab_alloc is eight
and kmem_cache_free is tenth.  These positions don't change with the
slub boot parameters.

Now, where do we go next?  I suspect that 2.6.23-rc9 has significant
changes since -rc3, but I'd like to confirm that before kicking off
another (expensive) run.  Please, tell me what useful kernels are to test.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Matthew Wilcox

On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote:
> Would you accept a patch which causes the deprecated sysfs
> files/directories to disappear, even if CONFIG_SYS_DEPRECATED is
> defined, via a boot-time parameter?

How about a mount option?  That way people can test without a reboot:

mount -o remount,deprecated={yes,no} /sys

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Matthew Wilcox

On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote:
> I'm reminded of Rusty's 2003 OLS Keynote, where he points out that
> what's important is not making an interface easy to use, but _hard_
> _to_ _misuse_.  That fact that sysfs is all laid out in a directory,
> but for which some directories/symlinks are OK to use, and some are
> NOT OK to use --- is why I call the sysfs interface "an open pit".
> Sure, if you have the map to the minefield, a minefield is perfectly
> safe when you know what to avoid.  But is that the best way to
> construct a path/interface for an application programmer to get from
> point A to point B?  Maybe, maybe not.

I agree with your criticism of sysfs (indeed, I think I've made many of
them before in somewhat stronger language), but do you have a suggestion
as to an interface that would be harder to misuse?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] VFS: new fgetattr() file operation

2007-09-24 Thread Matthew Wilcox

On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote:
> A file isn't deleted while there are still links or open files
> refering to it.  So getting the attributes for a file with nlink==0 is
> perfectly valid while the file is still open.

Is it?  Why not just pretend that the attributes are wiped when the file
is deleted.  Effectively, they are, since they can't affect anything.

> If a network filesystem protocol can't handle operations (be it data
> or metadata) on an unlinked file, we must do sillirenaming, so that
> the file is not actually unlinked.

Or you could call getattr right before you unlink and cache the result
in the client.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] VFS: new fgetattr() file operation

2007-09-24 Thread Matthew Wilcox

On Mon, Sep 24, 2007 at 02:48:08PM +0200, Miklos Szeredi wrote:
> > and if that means adding silly rename support so be it.
> 
> That's what is done currently.
> 
> But it's has various dawbacks, like rmdir doesn't work if there are
> open files within an otherwise empty directory.
> 
> I'd happily accept suggestions on how to deal with this differenty.

Only sillyrename files with nlink > 1?  I don't see how attributes can
change anything for a deleted file.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] avoid clobbering registers with J_ASSERT macro

2007-08-20 Thread Matthew Wilcox

On Mon, Aug 20, 2007 at 04:22:04PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, 2007-08-20 at 09:18 -0400, Chris Snook wrote:
> 
> > > How's about we just remove that printk?  Do
> > > 
> > > #define J_ASSERT(e) BUG_ON(e)?

ITYM #define J_ASSERT(e) BUG_ON(!e)

> It did.  The original J_ASSERT predates BUG() entirely, and was added so
> that we got the file/line-no information.  But with the current BUG()
> macro, I can't see any reason for J_ASSERT still to try to gather that
> information itself.

Do you still want to keep J_ASSERT, or should all uses of it be replaced
with BUG_ON?

(to put it another way; if you were writing JBD now, would you add your
own J_ASSERT, or would you just use BUG_ON directly?)

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)

2007-07-16 Thread Matthew Wilcox

On Mon, Jul 16, 2007 at 09:28:08PM +0300, Al Boldi wrote:
> Well, conceptually it sounds like a piece of cake, technically your guess is 
> as good as mine.  IIRC, akpm once mentioned something like this.

How much have you looked at the VFS?  There's nothing journalling-related
in the VFS right now.  ext3 and XFS share no common journalling code,
nor do I think that would be possible, due to the very different concepts
they have of journalling.

Here's a good hint:

$ find fs -type f |xargs grep -l journal_start
fs/ext3/acl.c
fs/ext3/inode.c
fs/ext3/ioctl.c
fs/ext3/namei.c
fs/ext3/resize.c
fs/ext3/super.c
fs/ext3/xattr.c
fs/ext4/acl.c
fs/ext4/extents.c
fs/ext4/inode.c
fs/ext4/ioctl.c
fs/ext4/namei.c
fs/ext4/resize.c
fs/ext4/super.c
fs/ext4/xattr.c
fs/jbd/journal.c
fs/jbd/transaction.c
fs/jbd2/journal.c
fs/jbd2/transaction.c
fs/ocfs2/journal.c
fs/ocfs2/super.c

JBD and JBD2 provide a journalling implementation that ext3, ext4 and
ocfs2 use.  Note that XFS doesn't, it has its own journalling code.

If you want XFS to support data=ordered, talk to the XFS folks.  Or
start picking through XFS yourself, of course -- you do have the source
code.

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Advocacy] Re: 3ware 9650 tips

2007-07-16 Thread Matthew Wilcox

On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote:
> XFS surely rocks, but it's missing one critical component: data=ordered
> And that's one component that's just too critical to overlook for an 
> enterprise environment that is built on data-integrity over performance.
> 
> So that's the secret why people still use ext3, and XFS' reliance on external 
> hardware to ensure integrity is really misplaced.
> 
> Now, maybe when we get the data=ordered onto the VFS level, then maybe XFS 
> may become viable for the enterprise, and ext3 may cease to be KING.

Wow, thanks for bringing an advocacy thread onto linux-fsdevel.  Just what
we wanted.  Do you have any insight into how to "get the data=ordered
onto the VFS level"?  Because to me, that sounds like pure nonsense.

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Matthew Wilcox

On Thu, Jul 05, 2007 at 11:30:20PM +0200, Karel Zak wrote:
> > > >  The package build system is now based on autotools. The build system
> > > >  supports  separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS,
> > > >  SUID_LDFLAGS). For more details see the README file
> > >
> > > And this is really dumb.  autotools is a completely pain in the ass and
> 
>  Well, Adrian Bunk added autotools stuff to util-linux during his work
>  on v2.13. This stuff has been fixed and stabilized in util-linux-ng
>  v2.13.
> 
>  I'm not fanatical autotools protagonist, but it seems useful in
>  util-linux. We will see...
> 
>  I'm ready to change or fix arbitrary thing in util-linux-ng, but I
>  always need a real reason -- bug report, new feature, or so. This
>  discussion is about impressions and feelings only.

No, it's based on long, hard and painful experiences attempting to debug
the fuckups that autoconf creates when it goes wrong.

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Versioning file system

2007-06-19 Thread Matthew Wilcox

On Tue, Jun 19, 2007 at 04:34:42PM -0400, John Stoffel wrote:
> > "Jack" == Jack Stone <[EMAIL PROTECTED]> writes:
> 
> Jack> The whole idea of the file system is that it wouldn't return the
> Jack> file in the file listing. The user would have to know that the
> Jack> file system was versioning to access the older versions as they
> Jack> would explicitly have to request them.
> 
> So tell me what happens when I do:
> 
>   touch foo:121212121212

You create a new file called 'foo:121212121212'.  If you then modify it,
you could access the old version as foo:121212121212:0.

(The .snapshot directory makes more sense than magic names, since you
can see what versions of a file are available without a special tool).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read/write counts

2007-06-04 Thread Matthew Wilcox

On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote:
> Programs that assume a full transfer are fairly common, but are 
> universally regarded as either broken or just lazy, and when it does cause 
> a problem, it is far more common to fix the application than the kernel.

Linus has explicitly forbidden short reads from being returned.  The
original poster may get away with it for a specialised case, but for
example, signals may not cause a return to userspace with a short read
for exactly this reason.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases

2007-06-01 Thread Matthew Wilcox

On Fri, Jun 01, 2007 at 12:44:16PM -0400, J. Bruce Fields wrote:
> The only problem I'm aware of is that leases aren't broken on rename,
> link, and unlink.  This is kind of tricky to fix.  David Richter (cc'd)
> and I sketched out a few different approaches, and I think he has some
> patches implementing at least one of them.
> 
> This may require defining some new type of lease, to avoid changing the
> documented behavior of the fcntl lease operations (which only break on
> open).  Although actually I believe Samba needs the same behavior we do,
> and they're probably the most important user of leases

Samba internally prohibits renaming or deleting an open file, to match
Windows semantics.  So it won't notice the difference.  At least, that's
what I remember from a discussion with Tridge when we were implementing
leases back in 2000.

I think it's an acceptable change in Linux semantics to break leases on
rename/delete/link.  Though I'm not quite sure why you need to -- nobody
else is touching the contents of the file, so it's not like you need to
write the data back to it, or discard your cached copy of it in the case
of a read-only lease.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Matthew Wilcox

On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote:
> +static inline int
> +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt,
> +int mode)
> +{
> + struct nameidata nd = {
> + .dentry = child,
> + .mnt = mnt,
> + };
> +
> + return vfs_create(dir, child, mode, &nd);
> +}
> +

Wouldn't it normally result in fewer instructions (on most architectures
... maybe not x86) to keep the same argument order as vfs_create?  ie:

static inline int nfsd_do_create(struct inode *dir, struct dentry *child,
 int mode, struct vfsmount *mnt)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.

2007-04-09 Thread Matthew Wilcox

On Sun, Apr 08, 2007 at 09:16:59PM -0700, [EMAIL PROTECTED] wrote:
> REISER4 FOR INCLUSION IN THE LINUX KERNEL.

Fuck off.  Cheerleading like this only hurts your cause.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: forced umount?

2007-03-18 Thread Matthew Wilcox

On Sun, Mar 18, 2007 at 08:16:19PM +0100, Arjan van de Ven wrote:
> the problem with the people who say they want forced umount is.. that
> most of the time they either want
> 1) get rid of the namespace entry
> or
> 2) want to stop any and all IO to a certain device/partition 

There is a third component - they want to deliver a fatal signal to
all processes which are waiting on IO to that sb.  My scenario here is a
machine with an NFS mount of a server which has gone down.  Cronjobs which
scan the whole filesystem (eg updatedb) soon pile up sleeping on access.
Equally, if one has one's ogg collection stored on said NFS server, the
ogg player will be in uninterruptible sleep while holding the sound device
open, preventing other applications from making sounds.  It's desirable
to be able to kill these apps dead, and the usual suggestion of 'mount
it soft,intr' isn't the greatest idea (and somewhat hard to change after
the fact).

I remember Linus suggesting a sleeping state between UNINTERRUPTIBLE and
INTERRUPTIBLE which would be FATAL_SIGNALS_ONLY.  The usual problem (of
short reads) isn't a problem if the task is only going to die when it
receives them.  Has anyone investigated this in any detail?  Perhaps
I'll take a look at doing it next week.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] sys_fallocate() system call

2007-03-17 Thread Matthew Wilcox

On Fri, Mar 16, 2007 at 05:17:04PM +0100, Heiko Carstens wrote:
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> e.g.
> 
> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode)
> 
> would work even on s390 ;)

How about:

asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
u32 len_low, u32 len_high);

That way we all suffer equally ...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] sys_fallocate() system call

2007-03-17 Thread Matthew Wilcox

On Sat, Mar 17, 2007 at 08:59:05PM +1100, Paul Mackerras wrote:
> ... but wouldn't work on 32-bit powerpc. :(  We would end up with a
> pad argument between fd and offset, giving 7 arguments in all
> (counting the loff_t's as 2), but we only support 6.

Ditto mips and parisc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why should we teach students Linux??

2007-03-06 Thread Matthew Wilcox

On Tue, Mar 06, 2007 at 06:43:34PM +0100, Roel Bindels wrote:
> Ok, but don't you guys have some fact I can build a case with?? I don't
> think that my college's will take me serious if I come with this store,
> even it is true ;)

You are crossposting far too widely to some wildly inappropriate lists
for this subject.  Try to find an advocacy group, or a local user group.
They can be of more help in this.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Rename file over another with the same inode number fails silently.

2007-02-08 Thread Matthew Wilcox

On Thu, Feb 08, 2007 at 11:48:16AM -0500, John Muir wrote:
> The attached test program creates a file, and then some hard links to
> that file (file0 - fileN).
> The test program then attempts to rename(fileN, file) for every hard
> link created.
> 
> My expectation is that the hard links file0 - fileN would simply
> disappear, or that rename would respond with an error result and an
> appropriate errno value indicating the problem.
> 
> My observation is that the hard links file0 to fileN do not in fact
> disappear and rename returns 0.
> 
> Do I have the wrong expectations? If so, why should I have to stat the
> files to determine if they are the same inode before I rename?

This is a POSIX requirement.  See

http://www.opengroup.org/onlinepubs/009695399/functions/rename.html

where it says,

"If the old argument and the new argument resolve to the same existing
file, rename() shall return successfully and perform no other action."

There's not really much point in discussing whether your expectations
are wrong or whether the extra work you have to do is silly ... POSIX
says so, so we have to behave this way.

Personally, I think this is an unexpected oversight on POSIX's part.
The rationale refers to 'rename("x", "x");', and doesn't discuss hard
links.  Possibly something to raise with the POSIX ctte, but I doubt
they'd be interested in changing this now ... perhaps if you can find
some other unices which behave differently, you might have a shot.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Symbolic links vs hard links

2007-01-10 Thread Matthew Wilcox

On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote:
> >Other people are of the opinion that the invention of the symbolic link
> >was a huge mistake.
> 
> I guess I haven't heard that one.  What is the argument that we were 
> better off without symbolic links?

I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good
a presentation of that argument as any, though there's also some good
stuff in the Unix Haters Handbook (pages 164-165 in particular).

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding hardlinks

2007-01-09 Thread Matthew Wilcox

On Tue, Jan 09, 2007 at 03:43:14PM -0800, Bryan Henderson wrote:
> >but you can get a large number of >1 linked
> >files, when you copy full directories with "cp -rl".  Which I do a lot
> >when developing. I've done that a few times with the Linux tree.
> 
> Can you shed some light on how you use this technique?  (I.e. what does it 
> do for you?)

A couple of weeks ago, a colleague asked me to generate a patch against
his tree in a hurry.  I did cp -rl his-tree my-tree (which completed
quickly), edited the two files that needed to be patched, then did
diff -urp his-tree my-tree, which also completed quickly, as diff knows
that if two files have the same inode, they don't need to be opened.

> Many people are of the opinion that since the invention of symbolic links, 
> multiple hard links to files have been more trouble than they're worth.  I 

Other people are of the opinion that the invention of the symbolic link
was a huge mistake.  Po-tato, pot-ato ...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Matthew Wilcox

On Sat, Jan 06, 2007 at 10:58:31PM +0300, Alexey Dobriyan wrote:
> Nor me nor my box is going to act as NFS server, so ifdef all
> exporting code.

> @@ -916,7 +918,9 @@ static int ext2_fill_super(struct super_
>* set up enough so that it can read an inode
>*/
>   sb->s_op = &ext2_sops;
> +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
>   sb->s_export_op = &ext2_export_ops;
> +#endif

To avoid putting ifdefs within a function, how about adding:

#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
#define set_export_ops(sb, ops) sb->s_export_op = ops
#else
#define set_export_ops(sb, ops) 0
#endif

That way you can get rid of the function pointer from the struct
superblock too.

But Dave Woodhouse is going to kill you for adding another
CONFIG_*_MODULE dependency.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is a NULL check missing in nfs_lookup?

2007-01-05 Thread Matthew Wilcox

On Fri, Jan 05, 2007 at 02:10:06PM -0500, Erez Zadok wrote:
> Ah, ok.  So why not put an ASSERT in there, or at least a comment, to make
> the code clearer.  As it stands, anyone looking at the code in the future
> can easily rediscover this "bug" that dereferences a null ptr.

Because anyone poking in the VFS should take the time to understand
how it works?  Adding crap like BUG_ON(!nd) is pointless -- you don't
get a clearer backtrace from that than you do from a null pointer.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is a NULL check missing in nfs_lookup?

2007-01-05 Thread Matthew Wilcox

On Fri, Jan 05, 2007 at 12:11:17PM -0500, Shaya Potter wrote:
> I guess the question is, why shouldn't a dentry object know what
> vfsmount it belongs to is?  Can it belong to multiple vfsmount objects?
> (Again, probably showing my igornance here).

If you have the same tree mounted in two places, they'll have
different vfsmounts.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding hardlinks

2007-01-03 Thread Matthew Wilcox

On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote:
> High probability is all you have.  Cosmic radiation hitting your
> computer will more likly cause problems, than colliding 64bit inode
> numbers ;)

Some of us have machines designed to cope with cosmic rays, and would be
unimpressed with a decrease in reliability.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: inode i_blksize problem

2006-12-20 Thread Matthew Wilcox

On Wed, Dec 20, 2006 at 01:26:56PM +0100, Sergio Paracuellos wrote:
> I am trying to compile a module for kernel 2.6.18-1 that uses the 'inode
> struct' but the compiler tell me inode struct hasn't a member called
> "i_blksize". I don't have that problem in kernel 2.6.16. What happend
> with i_blksize?

git-log reveals:

commit ba52de123d454b57369f291348266d86f4b35070
Author: Theodore Ts'o <[EMAIL PROTECTED]>
Date:   Wed Sep 27 01:50:49 2006 -0700

[PATCH] inode-diet: Eliminate i_blksize from the inode structure

This eliminates the i_blksize field from struct inode.  Filesystems that wan
t
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[EMAIL PROTECTED]: cleanup]
[EMAIL PROTECTED]: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]>
Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

So, it's gone.  I'd just delete that line from your sources if I were you.
By the way, what kind of module is this?  Whatever it's doing looks
pretty dodgy to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-17 Thread Matthew Wilcox

On Sun, Dec 17, 2006 at 11:07:27AM -0800, Ulrich Drepper wrote:
> And how often do the scripts which are in everyday use require such a 
> command?  And the same for the other programs.

I know that the rsync load is a major factor on kernel.org right now.
With all the git trees (particularly the ones that people haven't packed
recently), there's a lot of files in a lot of directories.  If
readdirplus would help this situation, it would definitely have a real
world benefit.  Obviously, I haven't done any measurements or attempted
to quantify what the improvement would be.

For those not familiar with a git repo, it has an 'objects' directory with
256 directories named 00 to ff.  Each of those directories can contain
many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once
a file is written, it is never modified, so all rsync needs to do is be
able to compare the timestamps and sizes and notice they haven't changed.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-14 Thread Matthew Wilcox

On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote:
> I don't think that I understand what you're saying here. The openg() 
> call does not perform file open (not that that is necessarily even a 
> first-class FS operation), it simply does the lookup.
> 
> When we were naming these calls, from a POSIX consistency perspective it 
> seemed best to keep the "open" nomenclature. That seems to be confusing 
> to some. Perhaps we should rename the function "lookup" or something 
> similar, to help keep from giving the wrong idea?
> 
> There is a difference between the openg() and path_to_handle() approach 
> in that we do permission checking at openg(), and that does have 
> implications on how the handle might be stored and such. That's being 
> discussed in a separate thread.

I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, ".sutoc");
link(path, s);
} while (errno == EEXIST);

mpi_broadcast(s);
sleep(10);
unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.
If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting ".sutoc" specially, and as
efficiently as adding the openg/sutoc concept.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:
> Considering that filesystems like GFS and OCFS allow clients DIRECT
> ACCESS to the block device itself (which no amount of authentication
> will fix, unless it is in the disks themselves), the risk of passing a
> file handle around is pretty minimal.

That's either disingenuous, or missing the point.  OCFS/GFS allow the
kernel direct access to the block device.  openg()&sutoc() are about
passing around file handles to untrusted users.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-06 Thread Matthew Wilcox

On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote:
> Permission checks are done on the path_to_handle(), so in reality
> only root or CAP_SYS_ADMIN users can currently use the
> open_by_handle interface because of this lack of checking. Given
> that our current users of this interface need root permissions to do
> other things (data migration), this has never been an issue.
> 
> This is an implementation detail - it is possible that file handle,
> being opaque, could encode a UID/GID of the user that constructed
> the handle and then allow any process with the same UID/GID to use
> open_by_handle() on that handle. (I think hch has already pointed
> this out.)

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

I suspect any real answer to this would have to involve cryptographical
techniques (say, creating a secure hash of the information plus a
boot-time generated nonce).  Now you're starting to use a lot of bits,
and compute time, and you'll need to be sure to keep the nonce secret.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: openg and path_to_handle

2006-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote:
> David Chinner wrote:
> >Does anyone here know about the XFS libhandle API? This has been
> >around for years and it does _exactly_ what these proposed syscalls
> >are supposed to do (and more).
> 
> Thanks for pointing these out Dave. These are indeed along the same 
> lines as the openg()/openfh() approach.
> 
> One difference is that they appear to perform permission checking on the 
> open_by_handle(), which means that the entire path needs to be encoded 
> in the handle, and makes it difficult to eliminate the path traversal 
> overhead on N open_by_handle() operations.

Another (and highly important) difference is that usage is restricted to
root:

xfs_open_by_handle(...)
...
if (!capable(CAP_SYS_ADMIN))
return -XFS_ERROR(EPERM);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
> The openg() solution has the following advantages to what you propose. 
> First, it places the burden of the communication of the file handle on 
> the application process, not the file system. That means less work for 
> the file system. Second, it does not require that clients respond to 
> unexpected network traffic. Third, the network traffic is deterministic 
> -- one client interacts with the file system and then explicitly 
> performs the broadcast. Fourth, it does not require that the file system 
> store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.
:
: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?
:
: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
:
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
> the flag) would likely cause a storm of network traffic if clients were 
> closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

> However, the application change issue is actually moot; we will make 
> whatever changes inside our MPI-IO implementation, and many users will 
> get the benefits for free.

That's good.

> The readdirplus(), readx()/writex(), and openg()/openfh() were all 
> designed to allow our applications to explain exactly what they wanted 
> and to allow for explicit communication. I understand that there is a 
> tendency toward solutions where the FS guesses what the app is going to 
> do or is passed a hint (e.g. fadvise) about what is going to happen, 
> because these things don't require interface changes. But these 
> solutions just aren't as effective as actually spelling out what the 
> application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.  I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Matthew Wilcox

On Tue, Dec 05, 2006 at 06:09:03PM +0100, Latchesar Ionkov wrote:
> It could be wasteful, but it could (most likely) also be useful. Name
> resolution is not that expensive on either side of the network. The
> latency introduced by the single-name lookups is :)

*is* latency the problem here?  Last I heard, it was the intolerable
load placed on the DLM by having clients bounce the read locks for each
directory element all over the cluster.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Matthew Wilcox

On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote:
> I think that the main problem is that all these file systems resove a
> path name, one directory at a time bringing the server to its knees by
> the huge amount of requests. I would like to see what the performance
> is if you a) cache the last few hundred lookups on the server side,
> and b) modify VFS and the file systems to support multi-name lookups.
> Just assume for a moment that there is no any way to get these new
> operations in (which is probaly going to be true anyway :). What other
> solutions can you think of? :)

How exactly would you want a multi-name lookup to work?  Are you saying
that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if
you find it, find share, if you find it, find misc, if you find it, find
pci.ids"?  That would be potentially very wasteful; consider mount
points, symlinks and other such effects on the namespace.  You could ask
the server to do a lot of work which you then discard ... and that's not
efficient.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Matthew Wilcox

On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote:
> The filehandle idiocy on the other hand is way of into crackpipe land.

Right, and it needs to be discarded.  Of course, there was a real
problem that it addressed, so we need to come up with an acceptable
alternative.

The scenario is a cluster-wide application doing simultaneous opens of
the same file.  So thousands of nodes all hitting the same DLM locks
(for read) all at once.  The openg() non-solution implies that all
nodes in the cluster share the same filehandle space, so I think a
reasonable solution can be implemented entirely within the clusterfs,
with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
sees this flag set (in ->lookup), it can treat it as a hint that this
pathname component is likely to be opened again on other nodes and
broadcast that fact to the other nodes within the cluster.  Other nodes
on seeing that hint (which could be structured as "The child "bin"
of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact.
When they see their own open, they can populate the path to that file
without asking the server for extra metadata.

There's obviously security issues there (why I say 'hint' rather than
'command'), but there's also security problems with open-by-filehandle.
Note that this solution requires no syscall changes, no application
changes, and also helps a scenario where each node opens a different
file in the same directory.

I've never worked on a clusterfs, so there may be some gotchas (eg, how
do you invalidate the caches of nodes when you do a rename).  But this
has to be preferable to open-by-fh.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-29 Thread Matthew Wilcox

On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote:
> On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote:
> >  - openg/sutoc
> > 
> > No way.  We already have a very nice file descriptor abstraction.
> > You can pass file descriptors over unix sockets just fine.
> 
> Yes, but it behaves like dup().  Gary replied to me off-list (which I
> didn't notice and continued replying to him off-list).  I wrote:
> 
> Is this for people who don't know about dup(), or do they need
> independent file offsets?  If the latter, I think an xdup() would be
> preferable (would there be a security issue for OSes with revoke()?)
> Either that, or make the key be useful for something else.

I further wonder if these people would see appreciable gains from doing
sutoc rather than doing openat(dirfd, "basename", flags, mode);

If they did, I could also see openat being extended to allow dirfd to
be a file fd, as long as pathname were NULL or a pointer to NUL.

But with all the readx stuff being proposed, I bet they don't really
need independent file offsets.  That's, like, so *1970*s.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-29 Thread Matthew Wilcox

On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote:
>  - openg/sutoc
> 
>   No way.  We already have a very nice file descriptor abstraction.
>   You can pass file descriptors over unix sockets just fine.

Yes, but it behaves like dup().  Gary replied to me off-list (which I
didn't notice and continued replying to him off-list).  I wrote:

Is this for people who don't know about dup(), or do they need
independent file offsets?  If the latter, I think an xdup() would be
preferable (would there be a security issue for OSes with revoke()?)
Either that, or make the key be useful for something else.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-28 Thread Matthew Wilcox

On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote:
> >Things like
> >openg()  - on process opens a file and gets a key that is passed to 
> >lots of processes which
> >use the key to get a handle (great for thousands of processes opening a 
> >file)

I don't understand how this leads to a more efficient implementation.
It seem to just add complexity.  What does 'sutoc' mean anyway?

> >readx/writex - scattergather readwrite - more appropriate and 
> >complete than the real time extended read/write

These don't seem to be documented on the website.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)

2006-11-17 Thread Matthew Wilcox

On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote:
> 2) this scheme would effectively leak inode addresses into userspace.
> I'm not sure if that would be exploitable, but it's probably best not to
> do it. The patch adds a static unsigned int that is initialized to a
> random value at boot time. We'll xor the inode offset with this value.
> That should allow for a unique i_ino value, but since the xor mask would
> be secret, it shouldn't be possible to turn it back into an address.
> There may be a more secure way to do this. I'm definitely open to
> suggestions here.

I *think* the xor mask is mere obfuscation.  It looks likely that you can
recover it with a little bit of trial and error.  If you can force the
filesystem to hand you back new inodes quickly such that there is a high
probability you get consecutive allocations, you'll get a sequence which
would be spaced 700-odd bytes apart, except that it's been xored.  Since
you know it's incrementing, if you see the sequence decrease, you'll
know that was a 1 in that bit.

It'd be a bit more complex than that, and cryptanalysis was never my
forte, but I suspect we should either use a folded hash like md5, or
give up and just divide the address by sizeof(struct inode).  Sure,
divides are slow, but this is a divide by a constant, so it shouldn't be
that bad.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)

2006-11-17 Thread Matthew Wilcox

On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote:
> +/* convert an inode address into an unsigned int and xor it with a random 
> value
> + * determined at boot time */
> +static inline unsigned int inode_to_uint (struct inode *inode)
> +{
> + return unsigned long) (inode - (struct inode *) 0))
> +  ^ inode_xor_mask) & 0x);
> +}

Seems a little obfuscated.  Why not simply:

return ((unsigned long)inode ^ inode_xor_mask) & 0x;

> @@ -125,7 +135,6 @@ static struct inode *alloc_inode(struct 
>   inode->i_size = 0;
>   inode->i_blocks = 0;
>   inode->i_bytes = 0;
> - inode->i_generation = 0;
>  #ifdef CONFIG_QUOTA
>   memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
>  #endif

It seems to me that filesystems with fake inodes could instead
initialise i_generation to, say, jiffies.  What do you think to that?

I like this idea, very creative.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CVE-2006-3468

2006-11-16 Thread Matthew Wilcox

On Thu, Nov 16, 2006 at 10:21:32PM +0100, Majkls wrote:
> Hello,
> I would like to solve this bug. Have anybody some backtrace or some
> working exploit?

Sure you have the right CVE?  That bug seems to have been fixed long
ago:

http://git.parisc-linux.org/?p=linux-2.6.git;a=commit;h=2ccb48ebb4de139eef4fcefd5f2bb823cb0d81b9
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Make journal_commit_transaction() more understandable

2005-08-29 Thread Matthew Wilcox


journal_commit_transaction() is still 650+ lines long and contains 16
local variables.  By moving phase 3 into its own function, we reduce
its length by 150+ lines and reduce it to 5 local variables.

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

 commit.c |  367 +--
 1 files changed, 194 insertions(+), 173 deletions(-)

Index: linux-2.6/fs/jbd/commit.c
===
RCS file: /var/lib/cvs/linux-2.6/fs/jbd/commit.c,v
retrieving revision 1.14
diff -u -p -r1.14 commit.c
--- linux-2.6/fs/jbd/commit.c   4 Apr 2005 20:47:15 -   1.14
+++ linux-2.6/fs/jbd/commit.c   29 Aug 2005 15:59:02 -
@@ -93,6 +93,189 @@ static int inverted_lock(journal_t *jour
return 1;
 }
 
+/*
+ * Way to go: we have now written out all of the data for a transaction!
+ * Now comes the tricky part: we need to write out metadata.  Loop over the
+ * transaction's entire buffer list:
+ */
+static int journal_write_metadata(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+   char *tagp = NULL;
+   struct journal_head *descriptor = NULL;
+   int flags;
+   unsigned long blocknr;
+   journal_header_t *header;
+   journal_block_tag_t *tag = NULL;
+   int space_left = 0;
+   int first_tag = 0;
+   int tag_flag;
+   int i;
+   int err = 0;
+   int bufs = 0;
+
+   struct buffer_head **wbuf = journal->j_wbuf;
+
+   commit_transaction->t_state = T_COMMIT;
+
+   while (commit_transaction->t_buffers) {
+
+   /* Find the next buffer to be journaled... */
+
+   struct journal_head *jh, *new_jh;
+   jh = commit_transaction->t_buffers;
+
+   /* If we're in abort mode, we just un-journal the buffer and
+  release it for background writing. */
+
+   if (is_journal_aborted(journal)) {
+   JBUFFER_TRACE(jh, "journal is aborting: refile");
+   journal_refile_buffer(journal, jh);
+   /* If that was the last one, we need to clean up
+* any descriptor buffers which may have been
+* already allocated, even if we are now
+* aborting. */
+   if (!commit_transaction->t_buffers)
+   goto start_journal_io;
+   continue;
+   }
+
+   /* Make sure we have a descriptor block in which to
+  record the metadata buffer. */
+
+   if (!descriptor) {
+   struct buffer_head *bh;
+
+   J_ASSERT (bufs == 0);
+
+   jbd_debug(4, "JBD: get descriptor\n");
+
+   descriptor = journal_get_descriptor_buffer(journal);
+   if (!descriptor) {
+   __journal_abort_hard(journal);
+   continue;
+   }
+
+   bh = jh2bh(descriptor);
+   jbd_debug(4, "JBD: got buffer %llu (%p)\n",
+   (unsigned long long)bh->b_blocknr, bh->b_data);
+   header = (journal_header_t *)&bh->b_data[0];
+   header->h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
+   header->h_blocktype = cpu_to_be32(JFS_DESCRIPTOR_BLOCK);
+   header->h_sequence  = 
cpu_to_be32(commit_transaction->t_tid);
+
+   tagp = &bh->b_data[sizeof(journal_header_t)];
+   space_left = bh->b_size - sizeof(journal_header_t);
+   first_tag = 1;
+   set_buffer_jwrite(bh);
+   set_buffer_dirty(bh);
+   wbuf[bufs++] = bh;
+
+   /* Record it so that we can wait for IO
+   completion later */
+   BUFFER_TRACE(bh, "ph3: file as descriptor");
+   journal_file_buffer(descriptor, commit_transaction,
+   BJ_LogCtl);
+   }
+
+   /* Where is the buffer to be written? */
+
+   err = journal_next_log_block(journal, &blocknr);
+   /* If the block mapping failed, just abandon the buffer
+  and repeat this loop: we'll fall into the
+  refile-on-abort condition above. */
+   if (err) {
+   __journal_abort_hard(journal);
+   continue;
+   }
+
+   /*
+* start_this_handle() uses t_outstanding_credits to determine
+* the free space in the log, but this counter is

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Matthew Wilcox

On Fri, Aug 19, 2005 at 08:38:34PM +0100, Al Viro wrote:
> FWIW, I'd rather take page_symlink(), page_symlink_inode_operations,
> page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(),
> generic_readlink() and vfs_readlink() to the same place where these guys
> would live.  They all belong together and none of them has any business
> in fs/namei.c.  Options: fs/libfs.c or separate library since fs/libfs.c
> is getting crowded.  Linus, do you have any objections to that or suggestions
> on filename here?

fs/symlink.c?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-15 Thread Matthew Wilcox

On Fri, Apr 15, 2005 at 10:03:05PM +1000, Herbert Xu wrote:
> I suppose it could be smart and stay quiet about
> 
> val < 0 || val > BOUND
> 
> However, gcc is slow enough as it is without adding unnecessary
> smarts like this.

It only warns with -W on, not with -Wall, so I see no compelling
reason to fix this.  I think the real problem here is that 'arg'
is declared 2 pages earlier in the function prototype (aka the
function-growth-hormone-imbalance syndrome).

There's two good ways of fixing this, adding a f_setsig() function:

static inline int f_setsig(struct file *filp, unsigned long arg)
{
if (arg > _NSIG)
return -EINVAL;

filp->f_owner.signum = arg;
return 0;
}
...
case F_SETSIG:
err = f_setsig(filp, arg);
break;

or add a function that checks a variable to see if it's a valid signal number:

#define valid_signal(arg)   ((unsigned long)arg <= _NSIG)
...
case F_SETSIG:
if (!valid_signal(arg))
break;
err = 0;
filp->f_owner.signum = arg;
break;

Looks like futex.c, ptrace.c, signal.c, sys.c and almost every
architecture's ptrace code could easily make use of the latter, but not
the former.  It also looks like we have a few off-by-one errors.  For
example, in h8300's ptrace code:

case PTRACE_SYSCALL:
case PTRACE_CONT: {
ret = -EIO;
if ((unsigned long) data >= _NSIG)
break ;

but

case PTRACE_SINGLESTEP: {
ret = -EIO;
if ((unsigned long) data > _NSIG)
break;

so I'd recommend the second solution.  But be careful not to "fix up"
cases like:

./kernel/exit.c:if (sig < 1 || sig > _NSIG)

where we really don't want to allow zero.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-15 Thread Matthew Wilcox

On Fri, Apr 15, 2005 at 09:21:50AM +0100, Christoph Hellwig wrote:
> On Fri, Apr 15, 2005 at 02:31:00AM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote:
> > > 'arg' is unsigned so it can never be less than zero, so testing for that 
> > > is pointless and also generates a warning when building with gcc -W. This 
> > > patch eliminates the pointless check.
> > 
> > Didn't Linus already reject this one 6 months ago?
> 
> I think Linux only complained if we're using some typedef that actually
> may be signed.  For fcntl that 'arg' argument is unsigned and that's hardcoded
> in the ABI.  So the check doesn't make sense at all.

No, it was exactly this patch:
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/1816.html

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-14 Thread Matthew Wilcox

On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote:
> 'arg' is unsigned so it can never be less than zero, so testing for that 
> is pointless and also generates a warning when building with gcc -W. This 
> patch eliminates the pointless check.

Didn't Linus already reject this one 6 months ago?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mmap question

2005-03-21 Thread Matthew Wilcox

On Mon, Mar 21, 2005 at 01:33:55PM -0800, Bryan Henderson wrote:
> It looks to me like you're running into the fundamental limitation that 
> the CPU doesn't notify Linux every time you store into a memory location. 
> It does, though, set the dirty flag in the page table, and Linux 
> eventually inspects that flag and finds out that you have stored in the 
> past.  At that time, it can call set_page_dirty.

well, we *could* know ... never map this page writable.  have a per-vma
flag that says "emulate writes", and call the filesystem to update
backing storage before returning to the application.

the price of doing this would be exorbitant.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Efficient handling of sparse files

2005-02-28 Thread Matthew Wilcox


This problem came up with the systemimager program which uses rsync to
install files from a master server to many clients.  Red Hat has a system
user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
rsync does understand the concept of sparse files (with the -S flag), but
it has to read every block to discover that it is indeed empty.  This sucks.

I was wondering if we could introduce a new system call (or ioctl?) that,
given an fd would find the next block with data in it.  We could use the
->bmap method ... except that has dire warnings about adding new callers
and viro may soon be in testicle-gouging range.

One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA)
... but without permission to reuse ->bmap for this purpose, it's
pointless to discuss user interfaces.

Suggestions?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help!

2005-02-04 Thread Matthew Wilcox

On Fri, Feb 04, 2005 at 03:58:04PM +0530, Anirban Mukherjee wrote:
> I am doing a project on Ext2fs and Ext3fs and I require some materials on
> the physical structure of Ext2fs.It would be very helpful if someone sends
> me some information(or links where to find it) along with some general
> stuff on Ext2fs and Ext3fs.

"Please do my homework for me"

Did you even look in Documentation/filesystems/ext{2,3}.txt?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NFSD needs EXPORTFS

2005-02-03 Thread Matthew Wilcox


Got this report about 2.6.11-rc3.  Is this the correct solution?

- Forwarded message from Joel Soete <[EMAIL PROTECTED]> -

A short analyse, it seems that's because NFSD was builtin while EXPORTFS
was a module in my previous config file. Imho EXPORTFS would be build as
NFSD?

Is the following hunk would do the trick:
--- fs/Kconfig.Orig 2005-02-03 16:45:13.562275206 +0100
+++ fs/Kconfig  2005-02-03 16:46:36.496469111 +0100
@@ -1400,6 +1400,7 @@
tristate "NFS server support"
depends on INET
select LOCKD
+   select EXPORTFS
select SUNRPC
help
  If you want your Linux box to act as an NFS *server*, so that other
><

Thanks in advance,
Joel

- End forwarded message -

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Matthew Wilcox

On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:

I think the below loop would be clearer as a for loop ...

err = 0;
for (nr = 0; nr < nr_pages; nr++, start++) {
if (start == lp_idx) {
pages[nr] = locked_page;
if (!nr)
continue;
lock_page(locked_page);
if (!wbc)
continue;
if (wbc->for_reclaim) {
up(&inode->i_sem);
up_read(&inode->i_sb->s_umount);
}
/* Was the page truncated under us? */
if (page_mapping(locked_page) != mapping) {
err = -ESTALE;
goto err_out_locked;
}
} else {
pages[nr] = find_lock_page(mapping, start);
if (pages[nr])
continue;
if (!cached_page) {
cached_page = alloc_page(gfp_mask);
if (unlikely(!cached_page))
goto err_out;
}
err = add_to_page_cache_lru(cached_page,
mapping, start, gfp_mask);
if (unlikely(err)) {
if (err == -EEXIST)
continue;
goto err_out;
}
pages[nr] = cached_page;
cached_page = NULL;
}
}

The above fixes two bugs in the below:
 - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))
 - The -EEXIST case after add_to_page_cache_lru() would result in
   an infinite loop in the original as nr wasn't being incremented.

> + err = nr = 0;
> + while (nr < nr_pages) {
> + if (start == lp_idx) {
> + pages[nr] = locked_page;
> + if (nr) {
> + lock_page(locked_page);
> + if (wbc) {
> + if (wbc->for_reclaim) {
> + up(&inode->i_sem);
> + up_read(&inode->i_sb->s_umount);
> + }
> + /* Was the page truncated under us? */
> + if (page_mapping(locked_page) !=
> + mapping) {
> + err = -ESTALE;
> + goto err_out_locked;
> + }
> + }
> + }
> + } else {
> + pages[nr] = find_lock_page(mapping, start);
> + if (!pages[nr]) {
> + if (!cached_page) {
> + cached_page = alloc_page(gfp_mask);
> + if (!unlikely(cached_page))
> + goto err_out;
> + }
> + err = add_to_page_cache_lru(cached_page,
> + mapping, start, gfp_mask);
> + if (unlikely(err)) {
> + if (err == -EEXIST)
> + continue;
> + goto err_out;
> + }
> + pages[nr] = cached_page;
> + cached_page = NULL;
> + }
> + }
> + nr++;
> + start++;
> + }

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-31 Thread Matthew Wilcox

On Fri, Jan 28, 2005 at 02:53:41PM -0800, Bryan Henderson wrote:
> >Just putting up my hand to say "yeah, us too" - we could also make
> >use of that functionality, so we can grok existing XFS filesystems
> >that have blocksizes larger than the page size.
> 
> IBM Storage Tank has block size > page size and has the same problem. This 
> is one of several ways that Storage Tank isn't generic enough to use 
> generic_file_write() and generic_file_read(), so it doesn't.  That's not a 
> terrible way to go, by the way.  At some point, making the generic 
> interface complex enough to handle every possible filesystem becomes worse 
> than every filesystem driver having its own code.

That's certainly true; but it might make sense to write a
multipage_file_read() and multipage_file_write() that can be shared
between the filesystems that do need this complexity.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Minor ext2 speedup

2005-01-24 Thread Matthew Wilcox


Port Andreas Dilger's and Jan Kara's patch for ext3 to ext2.
Also some whitespace changes to get ext2/ext3 closer in sync.

Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]>

Index: fs/ext2/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v
retrieving revision 1.4
diff -u -p -r1.4 balloc.c
--- fs/ext2/balloc.c13 Sep 2004 15:23:44 -  1.4
+++ fs/ext2/balloc.c24 Jan 2005 22:10:01 -
@@ -6,7 +6,7 @@
  * Laboratoire MASI - Institut Blaise Pascal
  * Universite Pierre et Marie Curie (Paris VI)
  *
- *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
+ *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
  */
@@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_
 
return NULL;
}
-   
-   group_desc = block_group / EXT2_DESC_PER_BLOCK(sb);
-   offset = block_group % EXT2_DESC_PER_BLOCK(sb);
+
+   group_desc = block_group >> EXT2_DESC_PER_BLOCK_BITS(sb);
+   offset = block_group & (EXT2_DESC_PER_BLOCK(sb) - 1);
if (!sbi->s_group_desc[group_desc]) {
ext2_error (sb, "ext2_get_group_desc",
"Group descriptor not loaded - "
@@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_
 block_group, group_desc, offset);
return NULL;
}
-   
+
desc = (struct ext2_group_desc *) sbi->s_group_desc[group_desc]->b_data;
if (bh)
*bh = sbi->s_group_desc[group_desc];
@@ -236,12 +236,12 @@ do_more:
 
for (i = 0, group_freed = 0; i < count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
-   bit + i, (void *) bitmap_bh->b_data))
-   ext2_error (sb, "ext2_free_blocks",
- "bit already cleared for block %lu",
- block + i);
-   else
+   bit + i, bitmap_bh->b_data)) {
+   ext2_error(sb, __FUNCTION__,
+   "bit already cleared for block %lu", block + i);
+   } else {
group_freed++;
+   }
}
 
mark_buffer_dirty(bitmap_bh);
@@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st
 static inline int
 block_in_use(unsigned long block, struct super_block *sb, unsigned char *map)
 {
-   return ext2_test_bit ((block - 
le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) %
+   return ext2_test_bit ((block -
+   le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) %
 EXT2_BLOCKS_PER_GROUP(sb), map);
 }
 
 static inline int test_root(int a, int b)
 {
-   if (a == 0)
-   return 1;
-   while (1) {
-   if (a == 1)
-   return 1;
-   if (a % b)
-   return 0;
-   a = a / b;
-   }
+   int num = b;
+
+   while (a > num)
+   num *= b;
+   return num == a;
 }
 
 static int ext2_group_sparse(int group)
 {
+   if (group <= 1)
+   return 1;
return (test_root(group, 3) || test_root(group, 5) ||
test_root(group, 7));
 }

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Minor ext3 speedup

2005-01-15 Thread Matthew Wilcox

On Thu, Jan 13, 2005 at 01:15:06PM +0100, Jan Kara wrote:
>   Attached patch removes unnecessary division and modulo from ext3 code
> paths. It reduces (according to oprofile) the CPU usage measurably under
> a dbench load (see description of the patch for the numbers).

I thought I'd apply Jan & Andreas's patch to ext2 too.  While doing so,
I noticed several places where patches had been applied to ext2 and not to
ext3.  One of them was even a bugfix.  This patch brings ext2/balloc.c
and ext3/balloc.c closer together and fixes the bug.  It includes the
aforementioned patch for both ext2 and ext3.

For the curious, the bug occurs when ext3_free_blocks_sb() decides to
"do_more".  *pdquot_freed_blocks was updated each time around the loop,
so bg_free_blocks_count was getting over-incremented.  I fixed it the
same way it had been fixed in ext2 -- by introducing a group_freed
variable that is reset each time around the loop.

Index: linux-2.6/fs/ext2/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v
retrieving revision 1.4
diff -u -p -r1.4 balloc.c
--- linux-2.6/fs/ext2/balloc.c  13 Sep 2004 15:23:44 -  1.4
+++ linux-2.6/fs/ext2/balloc.c  15 Jan 2005 16:45:10 -
@@ -6,7 +6,7 @@
  * Laboratoire MASI - Institut Blaise Pascal
  * Universite Pierre et Marie Curie (Paris VI)
  *
- *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
+ *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
  */
@@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_
 
return NULL;
}
-   
-   group_desc = block_group / EXT2_DESC_PER_BLOCK(sb);
-   offset = block_group % EXT2_DESC_PER_BLOCK(sb);
+
+   group_desc = block_group >> EXT2_DESC_PER_BLOCK_BITS(sb);
+   offset = block_group & (EXT2_DESC_PER_BLOCK(sb) - 1);
if (!sbi->s_group_desc[group_desc]) {
ext2_error (sb, "ext2_get_group_desc",
"Group descriptor not loaded - "
@@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_
 block_group, group_desc, offset);
return NULL;
}
-   
+
desc = (struct ext2_group_desc *) sbi->s_group_desc[group_desc]->b_data;
if (bh)
*bh = sbi->s_group_desc[group_desc];
@@ -236,12 +236,12 @@ do_more:
 
for (i = 0, group_freed = 0; i < count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
-   bit + i, (void *) bitmap_bh->b_data))
-   ext2_error (sb, "ext2_free_blocks",
- "bit already cleared for block %lu",
- block + i);
-   else
+   bit + i, bitmap_bh->b_data)) {
+   ext2_error(sb, __FUNCTION__,
+   "bit already cleared for block %lu", block + i);
+   } else {
group_freed++;
+   }
}
 
mark_buffer_dirty(bitmap_bh);
@@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st
 static inline int
 block_in_use(unsigned long block, struct super_block *sb, unsigned char *map)
 {
-   return ext2_test_bit ((block - 
le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) %
+   return ext2_test_bit ((block -
+   le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) %
 EXT2_BLOCKS_PER_GROUP(sb), map);
 }
 
 static inline int test_root(int a, int b)
 {
-   if (a == 0)
-   return 1;
-   while (1) {
-   if (a == 1)
-   return 1;
-   if (a % b)
-   return 0;
-   a = a / b;
-   }
+   int num = b;
+
+   while (a > num)
+   num *= b;
+   return num == a;
 }
 
 static int ext2_group_sparse(int group)
 {
+   if (group <= 1)
+   return 1;
return (test_root(group, 3) || test_root(group, 5) ||
test_root(group, 7));
 }
Index: linux-2.6/fs/ext3/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext3/balloc.c,v
retrieving revision 1.9
diff -u -p -r1.9 balloc.c
--- linux-2.6/fs/ext3/balloc.c  12 Jan 2005 20:17:23 -  1.9
+++ linux-2.6/fs/ext3/balloc.c  15 Jan 2005 16:45:10 -
@@ -43,34 +43,34 @@ struct ext3_group_desc * ext3_get_group_
 struct buffer_head ** bh)
 {
unsigned long group_desc;
-   unsigned long desc;
-   struct ext3_group_desc * gdp;
+   unsigned long offset;
+   struct ext3_group_desc * desc;
+   struct ext3_sb_info *sbi = EXT3_SB(sb);
 
-   if (block_group >= EXT3_SB(sb)->s

Re: [Final call for testers][PATCH] superblock handling changes (2.4.6-pre3)

2001-06-15 Thread Matthew Wilcox


On Fri, Jun 15, 2001 at 01:10:00AM -0400, Alexander Viro wrote:
> +static inline void write_super(struct super_block *sb)
> +{
> + lock_super(sb);
> + if (sb->s_root && sb->s_dirt)

When I first looked at this, I thought it was a typo.  I don't think we
should have s_dirty and s_dirt as fields of the superblock.  how about
s_dirty_inodes and s_isdirty, respectively?

> +restart:
> + spin_lock(&sb_lock);
> + sb = sb_entry(super_blocks.next);
> + while (sb != sb_entry(&super_blocks))
> + if (sb->s_dirt) {
> + sb->s_count++;
> + spin_unlock(&sb_lock);
> + down_read(&sb->s_umount);
> + write_super(sb);
> + drop_super(sb);
> + goto restart;
> + } else
> + sb = sb_entry(sb->s_list.next);
> + spin_unlock(&sb_lock);

I think this could be clearer.

struct list_head *tmp;
restart:
spin_lock(&sb_lock);
list_for_each(tmp, super_blocks) {
struct super_block *sb = sb_entry(tmp);
if (!sb->s_dirt)
continue;
spin_unlock(&sb_lock);
down_read(&sb->s_umount);
write_super(sb);
drop_super(sb);
goto restart;
}
spin_unlock(&sb_lock);

> @@ -773,16 +810,16 @@
>  void *data, int silent)
>  {
>   struct super_block * s;
> - s = get_empty_super();
> + s = alloc_super();
>   if (!s)
>   goto out;
>   s->s_dev = dev;
>   s->s_bdev = bdev;
>   s->s_flags = flags;
> - s->s_dirt = 0;
>   s->s_type = type;
> - s->s_dquot.flags = 0;
> - s->s_maxbytes = MAX_NON_LFS;
> + spin_lock(&sb_lock);
> + list_add (&s->s_list, super_blocks.prev);

I'd use list_add_tail(&s->s_list, super_blocks);

> - if (mnt->mnt_instances.next != mnt->mnt_instances.prev) {
> + if (atomic_read(&sb->s_active) > 1) {

I'm happy to see that line disappear.  It was mightily confusing.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: about BKL in VFS

2001-06-08 Thread Matthew Wilcox

On Fri, Jun 08, 2001 at 02:04:16PM -0400, Alexander Viro wrote:
> the only
> areas in VFS that still rely on BKL are locks.c, dquot.c and super.c. The
> latter is fixed in the patches I'm feeding to Linus (OK, the pieces I'm
> feeding to Linus make shifting the BKL down to method calls trivial). Other
> two...  Well, ask the poor guys who maintain them ;-)

the file locking code always takes the BKL itself, and doesn't rely on
the rest of the VFS to take the lock for it.  it's not too much trouble
to remove it, and this will happen as part of the 2.5 rewrite.  which is
in progress.

annoyingly, there's very little common semantics between POSIX locks and
FLOCK locks, which means that fs/locks.c is basically implementing two
completely independent things through one interface.  and then there's
the other 3-4 different types of lock bleuch.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Matthew Wilcox


On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote:
> > `the class of devices in question' was cryptographic devices, and possibly
> > other transactional DSPs.  I don't consider audio to be transactional.
> > in any case, you can do transactional things with two threads, as long
> > as they each have their own fd on the device.  Think of the fd as your
> > transaction handle.
> 
> Thats a bit pathetic. So I have to fill my app with expensive pthread locks
> or hack all the drivers and totally change the multi-open sematics in the ABI

huh?

void thread_init(void) {
int fd = open("/dev/crypto");
real_thread_init(fd);
}

where was that lock again?

and notice this idea is only for transactional things -- what
transactional things do sound drivers do?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Matthew Wilcox

On Tue, May 22, 2001 at 08:49:04AM +0100, Alan Cox wrote:
> > For _devices_, though?  I don't expect my mouse to work if gpm and xfree
> > both try to consume device events from the same filp.  Heck, it doesn't
> > even work when they try to consume events from the same inode :-)  I think
> > this is a reasonable restriction for the class of devices in question.
> 
> Not really. Think about basic things like full duplex audio with two threads

`the class of devices in question' was cryptographic devices, and possibly
other transactional DSPs.  I don't consider audio to be transactional.
in any case, you can do transactional things with two threads, as long
as they each have their own fd on the device.  Think of the fd as your
transaction handle.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Matthew Wilcox

On Tue, May 22, 2001 at 02:22:34AM +0200, Ingo Oeser wrote:
> ioctl has actually 4 semantics:
> 
> command only
> command + read
> command + write
> command + rw-transaction
> 
> Separating these would be a first step. And yes, I consider each
> of them useful.
> 
> command only: reset drive

echo 'reset' >/dev/sg0ctl

> command + rw-transaction: "dear device please mangle this data"
>(crypto processors come to mind...)

I can't think of a reasonable tool-based approach to this, but I can
definitely see that a program could use this well.  It simply requires
that you use the filp to store your state.

fd = open(/dev/crypto) -> creates filp
write(fd, "Death to all fanatics!\n"); -> calls crypto device, stores result in
private data structure
sleep(100);
read(fd, "Qrngu gb nyy snangvpf!\n"); -> frees data structure

[You'll note the advanced design of my crypto processor.]

Clearly, this is open to abuse by persons never calling read() and passing in
far too much to write().  I think this can be alleviated by refusing to accept more 
than (say) 4k at a time, or bean-counter.

A sick way would be to allow the ->write() call to have its buffer
modified.  But I don't think we want to go down that path.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Matthew Wilcox

On Mon, May 21, 2001 at 06:13:18PM -0700, Linus Torvalds wrote:
> Nope. You can (and people do, quite often) share filps. So you can't
> associate state with it.

For _devices_, though?  I don't expect my mouse to work if gpm and xfree
both try to consume device events from the same filp.  Heck, it doesn't
even work when they try to consume events from the same inode :-)  I think
this is a reasonable restriction for the class of devices in question.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Matthew Wilcox


On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
> Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?

No.

$ ls -l arch/*/kernel/ioctl32*.c
-rw-r--r--1 willywilly   22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c
-rw-r--r--1 willywilly  109475 May 18 16:39 arch/parisc/kernel/ioctl32.c
-rw-r--r--1 willywilly  117605 Feb  1 20:35 arch/sparc64/kernel/ioctl32.c

only about 100k.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox


On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> The transaction(2) syscall can be just as easily abused as ioctl(2) in
> this respect.

But read() and write() cannot.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox


On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote:
> Al, if you really want to kill ioctl(2), then perhaps you should
> implement a transaction(2) syscall. Something like:
> int transaction (int fd, void *rbuf, size_t rlen,
>void *wbuf, size_t wlen);
> 
> Of course, there wouldn't be any practical gain, since we already have
> ioctl(2). Any gain would be aesthetic.

I can tell you haven't had to write any 32-bit ioctl emulation code for
a 64-bit kernel recently.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)

2001-05-19 Thread Matthew Wilcox


On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote:
> clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add
> unopened descriptors. I.e. no IO until you open it (turning the thing into
> opened one), but we can do lookups (move to child), we can clone and
> kill them and we can stat them.

Those who would like a more detailed explanation can find one at
http://plan9.bell-labs.com/sys/man/5/INDEX.html

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 05:25:22PM +0100, Alan Cox wrote:
> Only to an English speaker. I suspect Quebec City canadians would prefer a
> different command set.

Should we support `pas387' as well as `no387' as a kernel boot parameter
then?  Face it, a sysadmin has to know the limited subset of english
which is used to configure a kernel.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

F_SETLK_NP

2001-05-06 Thread Matthew Wilcox



Here's an initial implementation of F_SETLK_NP as previously discussed
on linux-fsdevel.  The _ONLY_ difference between these locks and POSIX
locks is that they are not removed on a close(dup(fd)).

I have a question about the interaction between POSIX and RANGE locks.
If they overlap, should they be merged?  And if so, what type should
the resulting lock have?  My initial thought is that they should _not_
be merged but _should_ conflict, which is going to require a slightly
bigger patch.  This patch merges them and gives the merged lock the type
of the lock which arrived first.

Anyway, here's the test program (I checked what was going on via
/proc/locks, you can too).

/* lock-test.c Copyright (c) Matthew Wilcox, Hewlett-Packard.  GPL License */

#define _GNU_SOURCE

#include 
#include 
#include 

#define F_SETLK_NP 15

int main(int argc, char **argv) {
struct flock64 lock;
int fd;

lock.l_type = F_RDLCK;
lock.l_whence = 0;
lock.l_start = 50;
lock.l_len = 100;

fd = open(argv[0], O_RDONLY);
fcntl(fd, F_SETLK_NP, &lock);
printf("Applied lock to file -- check and press the `Enter' key\n");
getchar();
close(dup(fd));
printf("Closed dup -- check and press the `Enter' key\n");
getchar();
close(fd);
printf("Closed original -- check and press the `Enter' key\n");
getchar();
return 0;
}

And now for the patch itself.  It was generated against 2.4.5-pre1 but
will probably apply against any 2.4.x kernel.  I've taken the liberty
of getting rid of some old cruft which wasn't used any more.  This patch is
i386 only -- just add constants to other arch include files as required.

diff -urNX dontdiff linux-245p1/fs/fcntl.c linux-245p1-vfs/fs/fcntl.c
--- linux-245p1/fs/fcntl.c  Sun May  6 09:16:20 2001
+++ linux-245p1-vfs/fs/fcntl.c  Sun May  6 05:52:35 2001
@@ -260,6 +260,10 @@
case F_SETLKW:
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
break;
+   case F_SETLK_NP:
+   case F_SETLKW_NP:
+   err = fcntl_setlk_np(fd, cmd, (struct flock64 *) arg);
+   break;
case F_GETOWN:
/*
 * XXX If f_owner is a process group, the
diff -urNX dontdiff linux-245p1/fs/locks.c linux-245p1-vfs/fs/locks.c
--- linux-245p1/fs/locks.c  Sun May  6 09:14:08 2001
+++ linux-245p1-vfs/fs/locks.c  Sun May  6 05:38:12 2001
@@ -547,7 +547,7 @@
/* POSIX locks owned by the same process do not conflict with
 * each other.
 */
-   if (!(sys_fl->fl_flags & FL_POSIX) ||
+   if (!(sys_fl->fl_flags & (FL_POSIX | FL_RANGE)) ||
locks_same_owner(caller_fl, sys_fl))
return (0);
 
@@ -620,7 +620,7 @@
 
lock_kernel();
for (cfl = filp->f_dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
-   if (!(cfl->fl_flags & FL_POSIX))
+   if (!(cfl->fl_flags & (FL_POSIX | FL_RANGE)))
continue;
if (posix_locks_conflict(cfl, fl))
break;
@@ -863,7 +863,7 @@
if (caller->fl_type != F_UNLCK) {
   repeat:
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
-   if (!(fl->fl_flags & FL_POSIX))
+   if (!(fl->fl_flags & (FL_POSIX | FL_RANGE)))
continue;
if (!posix_locks_conflict(caller, fl))
continue;
@@ -892,7 +892,7 @@
 
/* First skip locks owned by other processes.
 */
-   while ((fl = *before) && (!(fl->fl_flags & FL_POSIX) ||
+   while ((fl = *before) && (!(fl->fl_flags & (FL_POSIX | FL_RANGE)) ||
  !locks_same_owner(caller, fl))) {
before = &fl->fl_next;
}
@@ -1458,23 +1458,72 @@
break;
case F_UNLCK:
break;
-   case F_SHLCK:
-   case F_EXLCK:
-#ifdef __sparc__
-/* warn a bit for now, but don't overdo it */
-{
-   static int count = 0;
-   if (!count) {
-   count=1;
-   printk(KERN_WARNING
-  "fcntl_setlk() called by process %d (%s) with broken flock() 
emulation\n",
-  current->pid, current->comm);
+   default:
+   error = -EINVAL;
+   goto out_putf;
+   }
+
+   if (filp->f_op && filp->f_op->lock != NULL) {
+   error = filp->f_op->lock(filp, cmd, file_lock);
+   if (error < 0)
+   goto out_putf;
}
+   error = posix_lock_file(filp,

Re: File Locking in Linux 2.5

2001-05-04 Thread Matthew Wilcox


On Fri, May 04, 2001 at 09:35:33AM -0700, Peter J. Braam wrote:
> For cluster and distributed file systems we actually need a somewhat
> different approach to locking.  The best thing to have would be for the FS
> to know that a lock will be needed and for what operation at lookup time.

For locking-used-to-guarantee-cluster-operation, yes.  I think that's
a completely different thing to applications-locking-files and it's
a mistake to try to treat the two as one.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: File Locking in Linux 2.5

2001-05-04 Thread Matthew Wilcox

On Fri, May 04, 2001 at 01:33:44PM +0200, Trond Myklebust wrote:
> ??? It'll be a completely useless form of lock if programs can end up
> blocking forever on a set of (from their perspective) valid
> operations.

Umm.. that's your opinion.  I don't see how `making them mandatory' would
make this better or worse than if they check for deadlock.  If they're
mandatory, a process doing a read() which conflicts will block forever
whether they're checked for deadlock or not.

> Ideally the VFS locking manager wants to provide a service to samba,
> NFS and their ilk, whereby they ask for a lock. If waiting would
> deadlock, send back an immediate answer to the caller. If not, then
> the locking manager should be responsible for scheduling a lock
> immediately after the blocking lock is released, and notifying the
> caller that it can proceed.

did you read my proposal?  was that not exactly what i suggested?

> Anything less than that would be an unacceptable step backward
> wrt. the existing code. We don't want to cram more responsabilities
> down on the filesystem-level.

Actually, we _do_ want to give the filesystems the opportunity to handle
their own locks without poking around in the guts of locks.c as lockd
currently does.  Did my proposal seem unacceptable to you?

> How about allowing the process to use a cookie to define ownership at
> the kernel level?
> 
> For normal processes this may or may not include some combination
> involving the pid, but kernel level server processes could use this to
> save the metadata required to identify the client. I know the NFSv4
> people are keen to implement this sort of thing...

I like this.  That may be why the current code checks both owner and
pid fields.  Definitely worthy of a comment if so...

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox

On Thu, May 03, 2001 at 02:10:48PM +0200, Trond Myklebust wrote:
> >>>>> " " == Matthew Wilcox <[EMAIL PROTECTED]> writes:
> 
>  > I'll get to it this weekend then.  Should be a relatively
>  > simple patch.  Are there any other semantics you want changing
>  > from the POSIX lock?  I'm thinking of removing deadlock
>  > detection, for one...
> 
> What do you mean by `removing' it? Eliminating deadlock detection
> entirely will get you in deep trouble when you add mandatory locking.

Huh?  (1) I don't intend to have mandatory locking on these new locks.
(2) how exactly would it get me into trouble?

Let's just go over the semantics we currently have for posix locks,
and see if we can come up with `better' lock semantics.

POSIX locks are advisory (unless you do the mandatory lock dance),
and apply to a byte-range.  They are checked for deadlock, merged with other
compatible locks (same PID, same `owner').  (this seems redundant and there's
a question mark by it in the source... never mind.)

The POSIX locking rules mean that you can do funky things like:

lock bytes 0-100
lock bytes 50-150
unlock bytes 50-150

and you have bytes 0-49 left locked.  of course, you can then unlock
bytes 0-100 and it will remove the lock from 0-49, but this behaviour
seems unintuitive.

You can also (and this causes a fair bit of ugliness in the kernel code)
do:

lock 0-100
unlock 25-75

or, even worse

lock_read 0-100
lock_write 25-75

which leaves you with 0-24 read-locked, 25-75 write-locked and 76-100
read-locked.

i can't imagine this is behaviour people actually use/desire.  So for
a new lock type, i propose the following semantics:

byte-range, advisory, non-merging, non-deadlock-checking

Another semantic issue I want to tie down.  Is it useful for different
threadsd to be able to share a lock?  Or should locks from different
threads conflict?  I'm currently leaning towards the latter, but you
may be able to convince me I'm wrong.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox


On Thu, May 03, 2001 at 01:28:33PM +0200, Andi Kleen wrote:
> Just don't forget to add a per user ulimit for it and probably an admin
> tool like ipcs.

It'll use the same limit as the other locks do.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox


On Thu, May 03, 2001 at 12:19:31AM -0700, Jeremy Allison wrote:
> > I want to add another byte-range lock, which looks and smells like a
> > POSIX fcntl lock except that it is not removed by closing any fd which
> > happens to be open on this file.
> 
> If you do this I'll be eternally in your debt ! I had to
> write that code for Samba 2.2 and it was a *horrible* 
> experience. With non-broken locking we can #ifdef out
> via autoconf a large chunk of code from the Linux Samba
> implementation, and free up fd's immediately on close.
> 
> As soon as you have any prototypes of this please let
> us know on the [EMAIL PROTECTED] list - I'd
> like to try it asap !

I'll get to it this weekend then.  Should be a relatively simple patch.
Are there any other semantics you want changing from the POSIX lock?
I'm thinking of removing deadlock detection, for one...

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: File Locking in 2.5

2001-05-02 Thread Matthew Wilcox

On Wed, May 02, 2001 at 04:49:30PM +0200, Andi Kleen wrote:
> On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote:
> >  * All filesystems will fill in their ->lock method.
> 
> Why when a common stub should work for 90% of them?  Please keep
> global search-and-edit operation low when not absolutely possible.

I think this will be one of the more minor changes that filesystems will
see during 2.5 :-)

> >  * Local filesystems should all use local_lock() for this method,
> >unless they have a good reason to provide their own facilities.
> 
> 
> So this should be default for ->lock

Perhaps we misunderstand one another?  I propose most filesystems should do

struct file_operations {
...
lock: local_lock
}

and I think you're proposing

if (i_fop->lock)
i_fop->lock();
else
local_lock();

please correct me if I'm wrong.

> >A clustered filesystem might call out to the network and say `I want
> >to put this lock on this file'. Either some other node in the cluster
> >says `Denied', `Blocked' or `Granted' (ie handles the request), OR no
> >other node accepts responsibility for the lock, in which case we lock
> >it locally by calling local_lock().
> 
> It would be nice if it was possible to interface it to the distributed lock
> manager IBM has recently GPLed...

It would.  I hadn't heard that got released, I shall investigate.

> It's still linear lists I guess? Any evidence that a hash table or a ternary
> search tree makes sense?

I'm not going to change it from a linear list until I've made these other
changes.  In the meantime, if anyone knows of a data structure which lets me
record ranges and check for overlaps, let me know...

> I would add a printk warning that fires 10 times, then you can get rid
> of it.

They're only supported for the benefit of really old applications (ie
stuff that runs under iBCS).  We could probably drop them or make them a
config option.  I've never met anyone who mounts their filesystems with
-o mand :-)

> I never understood why you can't do it completely inode local protected
> by the inode lock; but then I don't claim that I understand fs/locks.c

inode_lock claims it's for protecting the list manipulations, not for
protecting any element in an individual inode.  Though in practice it
seems to be used for protecting i_state as well.  I think I'd rather use
a different lock to protect i_flock -- inode_lock is probably already
highly contended.

> Or just walk the inode hash ? 

yes, could do that.  but /proc/locks really ought to die.  i don't see why
we need to expose kernel addresses to userspace.  the extreme security
people have also voiced concerns about the covert channel involved and
the way this can be used to spy on other processes.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

1 2 >

1 - 100 of 127 matches

Mail list logo