Re: how to show propagation state for mounts
On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote: > It's less about the form of representation (after all, we generate poll > events when contents of that sucker changes, so one *can* get a consistent > snapshot of the entire thing) and more about having it self-contained > when we have namespaces in the play. > > IOW, the data in there should give answers to questions that make sense. > "Do events get propagated from this vfsmount I have to that vfsmount I have?" > is a meaningful one; ditto for "are events here propagated to somewhere I > don't see?" or "are events getting propagated here from somewhere I don't > see?". Why do those last two questions deserve an answer? How will a person's or application's behaviour be affected by whether a change will propagate to something they don't know about and can't see? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] enhanced ESTALE error handling
On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote: > @@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const > mntget(save.mnt); > > result = __link_path_walk(name, nd); > - if (result == -ESTALE) { > + while (result == -ESTALE) { > + /* > + * If no progress was made looking up the pathname, > + * then stop and return ENOENT instead of ESTALE. > + */ > + if (nd->dentry == save.dentry) { > + result = -ENOENT; > + break; > + } > *nd = save; > dget(nd->dentry); > mntget(nd->mnt); > nd->flags |= LOOKUP_REVAL; > result = __link_path_walk(name, nd); > + /* > + * If no progress was made this time, then return > + * ENOENT instead of ESTALE because no recovery > + * is possible to recover the stale file handle. > + */ > + if (result == -ESTALE && nd->dentry == save.dentry) > + result = -ENOENT; > } > > dput(save.dentry); Why do you need both of these tests? The first one should be enough, surely? > @@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char > * @create_mode: create intent flags > */ > static int path_lookup_create(int dfd, const char *name, > - unsigned int lookup_flags, struct nameidata *nd, > - int open_flags, int create_mode) > + unsigned int lookup_flags, struct nameidata *nd, > + int open_flags, int create_mode) Gratuitous reformatting? > @@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path > int acc_mode, error; > struct path path; > struct dentry *dir; > - int count = 0; > + int count; > + > +top: > + count = 0; > > acc_mode = ACC_MODE(flag); > > @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path > /* >* Create - we need to know the parent. >*/ > - error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode); > + error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd, > + flag, mode); > if (error) > return error; > > @@ -1812,10 +1833,17 @@ ok: > return 0; > > exit_dput: > + if (error == -ESTALE) > + d_drop(path.dentry); > dput_path(&path, nd); > exit: > if (!IS_ERR(nd->intent.open.file)) > release_open_intent(nd); > + if (error == -ESTALE) { > + d_drop(nd->dentry); > + path_release(nd); > + goto top; > + } I wonder if a tail-call might not work better here. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: file locks: Use wait_event_interruptible_timeout()
On Tue, Jan 15, 2008 at 09:48:51AM -0500, J. Bruce Fields wrote: > On Mon, Jan 14, 2008 at 09:28:30PM -0700, Matthew Wilcox wrote: > > interruptible_sleep_on_locked() is just an open-coded > > wait_event_interruptible_timeout() with a few assumptions since we know > > we hold the BKL. locks_block_on_timeout() is only used in one place, so > > it's actually simpler to inline it into its caller. > > Makes sense, thanks. So the assumption we were depending on the BKL for > was that we could count on the wake-up not coming till after we block, > so we could skip a check ->fl_next that's normally needed to resolve the > usual sleeping-on-some-condition race? That's right. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
file locks: Split flock_find_conflict out of flock_lock_file
Reduce the spaghetti-like nature of flock_lock_file by making the chunk of code labelled find_conflict into its own function. Also allocate memory before taking the kernel lock in preparation for switching to a normal spinlock. Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]> diff --git a/fs/locks.c b/fs/locks.c index b681459..bc691e5 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -699,6 +699,33 @@ next_task: return 0; } +/* + * A helper function for flock_lock_file(). It searches the list of locks + * for @inode looking for one which would conflict with @request. + * If it does not find one, it returns the address where the requested lock + * should be inserted. If it does find a conflicting lock, it returns NULL. + */ +static struct file_lock ** +flock_find_conflict(struct inode *inode, struct file_lock *request) +{ + struct file_lock **before; + + for_each_lock(inode, before) { + struct file_lock *fl = *before; + if (IS_POSIX(fl)) + break; + if (IS_LEASE(fl)) + continue; + if (!flock_locks_conflict(request, fl)) + continue; + + if (request->fl_flags & FL_SLEEP) + locks_insert_block(fl, request); + return NULL; + } + return before; +} + /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks * after any leases, but before any posix locks. * @@ -714,18 +741,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) int error = 0; int found = 0; - lock_kernel(); - if (request->fl_flags & FL_ACCESS) - goto find_conflict; + if (request->fl_flags & FL_ACCESS) { + lock_kernel(); + before = flock_find_conflict(inode, request); + unlock_kernel(); + return before ? 0 : -EAGAIN; + } if (request->fl_type != F_UNLCK) { - error = -ENOMEM; new_fl = locks_alloc_lock(); - if (new_fl == NULL) - goto out; - error = 0; + if (!new_fl) + return -ENOMEM; } + lock_kernel(); + for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -754,22 +784,12 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) if (found) cond_resched(); -find_conflict: - for_each_lock(inode, before) { - struct file_lock *fl = *before; - if (IS_POSIX(fl)) - break; - if (IS_LEASE(fl)) - continue; - if (!flock_locks_conflict(request, fl)) - continue; + before = flock_find_conflict(inode, request); + if (!before) { error = -EAGAIN; - if (request->fl_flags & FL_SLEEP) - locks_insert_block(fl, request); goto out; } - if (request->fl_flags & FL_ACCESS) - goto out; + locks_copy_lock(new_fl, request); locks_insert_lock(before, new_fl); new_fl = NULL; -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
file locks: Use wait_event_interruptible_timeout()
interruptible_sleep_on_locked() is just an open-coded wait_event_interruptible_timeout() with a few assumptions since we know we hold the BKL. locks_block_on_timeout() is only used in one place, so it's actually simpler to inline it into its caller. Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]> locks.c | 33 - 1 file changed, 4 insertions(+), 29 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 8b8388e..b681459 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -634,33 +634,6 @@ static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *s return (locks_conflict(caller_fl, sys_fl)); } -static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout) -{ - int result = 0; - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(fl_wait, &wait); - if (timeout == 0) - schedule(); - else - result = schedule_timeout(timeout); - if (signal_pending(current)) - result = -ERESTARTSYS; - remove_wait_queue(fl_wait, &wait); - __set_current_state(TASK_RUNNING); - return result; -} - -static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock *waiter, int time) -{ - int result; - locks_insert_block(blocker, waiter); - result = interruptible_sleep_on_locked(&waiter->fl_wait, time); - __locks_delete_block(waiter); - return result; -} - void posix_test_lock(struct file *filp, struct file_lock *fl) { @@ -1256,7 +1229,10 @@ restart: if (break_time == 0) break_time++; } - error = locks_block_on_timeout(flock, new_fl, break_time); + locks_insert_block(flock, new_fl); + error = wait_event_interruptible_timeout(new_fl->fl_wait, + !new_fl->fl_next, break_time); + __locks_delete_block(new_fl); if (error >= 0) { if (error == 0) time_out_leases(inode); -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Leak in nlmsvc_testlock for async GETFL case
On Mon, Jan 14, 2008 at 03:44:19PM -0500, J. Bruce Fields wrote: > Thanks! I've queued it up for 2.6.25. Hi Bruce, I haven't had as much time to play with de-BKL-ising fs/locks.c as I would like, so fixing that for 2.6.25 is probably out of the question, but here are two janitorial patches that hopefully can be applied and will make the next steps easier. They make sense all by themselves, even if I don't get back to this project for a few months. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] rewrite rd
On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote: > +static void copy_to_brd(struct brd_device *brd, const void *src, > + sector_t sector, size_t n) > +{ > + struct page *page; > + void *dst; > + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT; > + size_t copy; > + > + copy = min((unsigned long)n, PAGE_SIZE - offset); > + page = brd_lookup_page(brd, sector); > + BUG_ON(!page); > + > + dst = kmap_atomic(page, KM_USER1); > + memcpy(dst + offset, src, copy); > + kunmap_atomic(dst, KM_USER1); You're using kmap_atomic, but I see no reason you can't be preempted. Don't you need to at least disable preemption while you have stuff atomically kmapped? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 03:53:04PM -0500, J. Bruce Fields wrote: > So, the problem is that fcntl_setlease() does > > vfs_setlease() > fasync_helper() > > which the bkl held over both, and you want to preserve that? > > But what that BKL is doing is a mystery to me--the very first thing that > fasync_helper() does is kmem_cache_alloc(., GFP_KERNEL). So you won't > be introducing any new problem if you lock those two operations > separately. Unless I'm totally missing something. A very good point. So yet another race caused by using the BKL rather than thinking ... but maybe it's an inconsequential race. The consequences are that (if the kmalloc in fasync_helper sleeps) a lease appears that isn't fully set-up yet (and may be removed if the kmalloc fails). Actually, it seems bad if the kmalloc eventually succeeds -- there's a window while kmalloc is sleeping where another process could open the file, break the lease, fl_fasync will be NULL, so no signal is sent. Then 30 seconds later the lease is removed without the leaseholder being sent a signal. Bad. How can we fix this situation? I think we need a better interface than fasync_helper() -- fasync_alloc() and fasync_setup() would seem to do the trick. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 02:47:18PM -0500, J. Bruce Fields wrote: > > > Then I started to wonder about the current split of functionality between > > > fcntl_setlease, vfs_setlease and generic_setlease. The check for no > > > other process having this file open should be common to all filesystems. > > > And it should be honoured by NFS (it shouldn't hand out a delegation if > > > a local user has the file open), so it needs to be in vfs_setlease(). > > Of course, in the case of a (still unfortunately theoretical) cluster > filesystem implementation, those checks will be insufficient on their > own. Oh yes, definitely insufficient, but also necessary. Right now, filesystems bypass them entirely. And the last thing we want is individual filesystems duplicating these tests ... > Though the write-lease check especially: > > if ((arg == F_WRLCK) > && ((atomic_read(&dentry->d_count) > 1) > || (atomic_read(&inode->i_count) > 1))) > goto out; > > seems like a terrible hack, and gives some annoying false positives: > > http://www.ussg.iu.edu/hypermail/linux/kernel/0711.1/2255.html It is a terrible hack. It was sufficient for the time, but it wasn't supposed to be merged in that state ... I haven't had the courage to figure out how to solve it properly yet. > > > Now I'm worrying about the mechanics of calling out to a filesystem to > > > perform a lock. Obviously, a network filesystem is going to have to > > > sleep waiting for a response from the fileserver, so that precludes the > > > use of a spinlock held over the call to f_op->setlease. I'm not keen on > > > the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite > > > hard enough without worrying what a filesystem might be doing with it. > > I'm confused as to what the locks are actually protecting. > > If they're mainly just protecting the inode's lock list, for example, > then we should let the filesystem call back to helpers in locks.c for > any manipulations of that list and do the locking in those helpers. But we need to ensure the lock list is always consistent for, eg, /proc/locks. We don't want locks to temporarily appear on the list when they never get granted (due to an error elsewhere in the chain). > > > vfs_setlease() > > >if (f_op->setlease()) > > > res = f_op->setlease() > > > if (res) > > > return res; > > >lock_kernel() > > >generic_setlease() > > >unlock_kernel() > > Why can't the filesystem call into generic_setlease() on its own? Because (assuming we're rid of the BKL), fcntl_setlease() needs to acquire the spinlock and hold it while generic_setlease() runs, so generic_setlease() can't acquire the lock. > Christoph Hellwig has suggested removing the second case and just doing > > .setlease = generic_setlease > > for all filesystems that should support leases; see the final patch in > > git://linux-nfs.org/~bfields/linux.git server-cluster-lease-api Yeah, but that doesn't work if we repurpose the setlease f_op. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 01:55:36PM -0500, david m. richter wrote: > fwiw, i've done some work on extending the lease subsystem to help > support the full range of requirements for NFSv4 file and directory > delegations (e.g., breaking a lease when unlinking a file) and we ended up > actually doing most of what you've just suggested here, which i take to be > a good sign. As long as it's great minds thinking alike and not fools seldom differing ;-) > most of my refactoring came out of trying to simplify locking and > avoid holding locks too long (rather than specifically focusing on > cluster-oriented stuff, but the goals dovetail) and your work on getting > the BKL out of locks.c entirely is something i really like and look > forward to. Excellent. Shall I make the patch myself, or did you want to post a patch based on working code? ;-) -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
On setting a lease across a cluster
Hi Bruce, The current implementation of vfs_setlease/generic_setlease/etc is a bit quirky. I've been thinking it over for the past couple of days, and I think we need to refactor it to work sensibly. As you may have noticed, I've been mulling over getting rid of the BKL in fs/locks.c and the lease interface is particularly problematic. Here's one reason why: fcntl_setlease lock_kernel vfs_setlease lock_kernel generic_setlease unlock_kernel fasync_helper unlock_kernel This is perfectly legitimate for the BKL of course, but other spinlocks can't be acquired recursively in this way. At first I thought I'd just push the call to generic_setlease into __vfs_setlease and have two ways of calling it: fcntl_setlease lock_kernel __vfs_setlease fasync_helper unlock_kernel vfs_setlease lock_kernel __vfs_setlease unlock_kernel But generic_setlease allocates memory (GFP_KERNEL) under the lock, so that's bad for converting to spnlocks later. Then I thought I'd move the spinlock acquisition into generic_setlease(). But I can't do that either as fcntl_setlease() needs to hold the spinlock around the call to generic_setlease() and fasync_helper(). Then I started to wonder about the current split of functionality between fcntl_setlease, vfs_setlease and generic_setlease. The check for no other process having this file open should be common to all filesystems. And it should be honoured by NFS (it shouldn't hand out a delegation if a local user has the file open), so it needs to be in vfs_setlease(). Now I'm worrying about the mechanics of calling out to a filesystem to perform a lock. Obviously, a network filesystem is going to have to sleep waiting for a response from the fileserver, so that precludes the use of a spinlock held over the call to f_op->setlease. I'm not keen on the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite hard enough without worrying what a filesystem might be doing with it. So I think we need to refactor the interface, and I'd like to hear your thoughts on my ideas of how to handle this. First, have clients of vfs_setlease call lease_alloc() and pass it in, rather than allocate it on the stack and have this horrible interface where we may pass back an allocated lock. This allows us to not allocate memory (hence sleep) during generic_setlease(). Second, move some of the functionality of generic_setlease() to vfs_setlease(), as mentioned above. Third, change the purpose of f_op->setlease. We can rename it if you like to catch any out of tree users. I'd propose using it like this: vfs_setlease() if (f_op->setlease()) res = f_op->setlease() if (res) return res; lock_kernel() generic_setlease() unlock_kernel() fcntl_setlease if (f_op->setlease()) res = f_op->setlease() if (res) return res; lock_kernel generic_setlease() ... fasync ... unlock_kernel So 'f_op->setlease' now means "Try to get a lease from the fileserver". We can optimise this a bit to not even call setlease if, say, we already have a read lease and we're trying to get a second read lease. But we have to record our local leases (that way they'll show up in /proc/locks). I think the combination of these three ideas gives us a sane interface to the various setlease functions, and let us convert from lock_kernel() to spin_lock() later. Have I missed something? I don't often think about the needs of cluster filesystems, so I may misunderstand how they need this operation to work. At some point, we need to revisit the logic for 'is this file open by another process' -- it's clearly buggy since it doesn't hold the inode_lock while checking i_count, so it could race against an __iget(). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Remove BKL from fs/locks.c
On Sun, Dec 30, 2007 at 01:05:10PM +, Christoph Hellwig wrote: > If people are really having any kind of scalability problems with this > still it should be quite trivial to make the file_lock_list and > blocked_list aswel as the new file_lock_lock per-superblock as file > and thus locks never move between superblocks. In fact I'd probably > do this even without scalability concerns just to make our fs data > structures nice per-superblock. Hrm. file_lock_list is used for /proc/locks, so that's OK to convert, we just need to iterate over each super block in the system, acquire the sb_file_lock_lock, then iterate over each sb_file_lock_list. The blocked_list is a bit more complex since we need to check every lock on the blocked list, and would need to acquire all the sb_file_lock_locks to check this list consistently. I don't see a nice way to do this -- particularly when you consider that we need to run this check every time someone takes out a POSIX lock that blocks on another lock. > > I had to move one memory allocation out from under the file_lock_lock. > > I hope I got that logic right. I'm rather tempted to split out the > > find_conflict algorithm from that function into something that can be > > called separately for the FL_ACCESS case. > > Yes, splitting that out makes a lot of sense. Should be a separate > patch, though. Indeed. What you see here is just me hacking until stuff works. > > +static inline void lock_flocks(void) > > +{ > > + spin_lock(&file_lock_lock); > > +} > > + > > +static inline void unlock_flocks(void) > > +{ > > + spin_unlock(&file_lock_lock); > > +} > > I'd rather not introduce this wrappers, they only obsfucated what's > really going on. Fair enough. file_lock_lock is a crappy name though, and I was embarrassed to use it everywhere. > > + if (found) { > > + unlock_flocks(); > > cond_resched(); > > + lock_flocks(); > > + } > > There's a cond_resched_lock that only drops the lock in case we really > need to block. Ooh, thanks, I didn't know about that. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Remove BKL from fs/locks.c
On Sun, Dec 30, 2007 at 08:36:44PM +1100, Stephen Rothwell wrote: > We should probably do some performance testing on this because the last > time we tried the impact was quite noticeable. You should ping Tridge as > he has some good lock testing setups. And he cares if we slow him down :-) Last time I did this, I switched to a semaphore instead of a spinlock. That was what slowed us down. I doubt we can see a performance loss with this patch since it's a 1-1 substitution of the BKL spinlock with a private spinlock. Good idea about asking tridge for an evaluation of the patch though. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Remove BKL from fs/locks.c
I've been promising to do this for about seven years now. It seems to work well enough, but I haven't run any serious stress tests on it. This implementation uses one spinlock to protect both lock lists and all the i_flock chains. It doesn't seem worth splitting up the locking any further. I had to move one memory allocation out from under the file_lock_lock. I hope I got that logic right. I'm rather tempted to split out the find_conflict algorithm from that function into something that can be called separately for the FL_ACCESS case. I also have to drop and reacquire the file_lock_lock around the call to cond_resched(). This was done automatically for us before by the special BKL semantics. I had to change vfs_setlease() as it relied on the special BKL ability to recursively acquire the same lock. The internal caller now calls __vfs_setlease and the exported interface acquires and releases the file_lock_lock around calling __vfs_setlease. I should probably split out the removal of interruptible_sleep_on_locked() as it's basically unrelated to all this. diff --git a/fs/locks.c b/fs/locks.c index 8b8388e..68de569 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -139,9 +139,23 @@ int lease_break_time = 45; #define for_each_lock(inode, lockp) \ for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next) +/* + * Protects the two list heads below, plus the inode->i_flock list + */ +static DEFINE_SPINLOCK(file_lock_lock); static LIST_HEAD(file_lock_list); static LIST_HEAD(blocked_list); +static inline void lock_flocks(void) +{ + spin_lock(&file_lock_lock); +} + +static inline void unlock_flocks(void) +{ + spin_unlock(&file_lock_lock); +} + static struct kmem_cache *filelock_cache __read_mostly; /* Allocate an empty lock structure. */ @@ -507,9 +521,9 @@ static void __locks_delete_block(struct file_lock *waiter) */ static void locks_delete_block(struct file_lock *waiter) { - lock_kernel(); + lock_flocks(); __locks_delete_block(waiter); - unlock_kernel(); + unlock_flocks(); } /* Insert waiter into blocker's block list. @@ -634,29 +648,15 @@ static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *s return (locks_conflict(caller_fl, sys_fl)); } -static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout) -{ - int result = 0; - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(fl_wait, &wait); - if (timeout == 0) - schedule(); - else - result = schedule_timeout(timeout); - if (signal_pending(current)) - result = -ERESTARTSYS; - remove_wait_queue(fl_wait, &wait); - __set_current_state(TASK_RUNNING); - return result; -} - -static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock *waiter, int time) +static int locks_block_on_timeout(struct file_lock *blocker, + struct file_lock *waiter, int time) { int result; locks_insert_block(blocker, waiter); - result = interruptible_sleep_on_locked(&waiter->fl_wait, time); + unlock_flocks(); + result = wait_event_interruptible_timeout(waiter->fl_wait, + !waiter->fl_next, time); + lock_flocks(); __locks_delete_block(waiter); return result; } @@ -666,7 +666,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) { struct file_lock *cfl; - lock_kernel(); + lock_flocks(); for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) { if (!IS_POSIX(cfl)) continue; @@ -677,7 +677,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) __locks_copy_lock(fl, cfl); else fl->fl_type = F_UNLCK; - unlock_kernel(); + unlock_flocks(); return; } @@ -741,18 +741,16 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) int error = 0; int found = 0; - lock_kernel(); - if (request->fl_flags & FL_ACCESS) - goto find_conflict; - - if (request->fl_type != F_UNLCK) { - error = -ENOMEM; + if (!(request->fl_flags & FL_ACCESS) && (request->fl_type != F_UNLCK)) { new_fl = locks_alloc_lock(); - if (new_fl == NULL) - goto out; - error = 0; + if (!new_fl) + return -ENOMEM; } + lock_flocks(); + if (request->fl_flags & FL_ACCESS) + goto find_conflict; + for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -778,8 +776,11 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) * If a h
Re: NFS Killable tasks request comments on patch
On Thu, Dec 06, 2007 at 10:34:26AM -0700, Matthew Wilcox wrote: > I've done a more thorough review of Liam's work and come up with a few > more fixes (which I'll publish later today) I've put up a git tree for this work; see http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=task_killable I've split up some of the earlier patches, hopefully in a way which will make akpm less grumpy. Andrew, let me know if I've misunderstood how you want to see it. Here's the commit for the NFS people to chew on a bit. commit f05b88f294044cdf56fbe637a246ba6c7a14d6f1 Author: Matthew Wilcox <[EMAIL PROTECTED]> Date: Thu Dec 6 16:24:39 2007 -0500 NFS: Switch from intr mount option to TASK_KILLABLE By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr' mount option. We have to use _killable everywhere instead of _interruptible as we get rid of rpc_clnt_sigmask/sigunmask. Signed-off-by: Liam R. Howlett <[EMAIL PROTECTED]> Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]> diff --git a/fs/nfs/client.c b/fs/nfs/client.c index 70587f3..310fa2f 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -302,7 +302,7 @@ found_client: if (new) nfs_free_client(new); - error = wait_event_interruptible(nfs_client_active_wq, + error = wait_event_killable(nfs_client_active_wq, clp->cl_cons_state != NFS_CS_INITING); if (error < 0) { nfs_put_client(clp); @@ -494,10 +494,6 @@ static int nfs_init_server_rpcclient(struct nfs_server *server, rpc_authflavor_t if (server->flags & NFS_MOUNT_SOFT) server->client->cl_softrtry = 1; - server->client->cl_intr = 0; - if (server->flags & NFS4_MOUNT_INTR) - server->client->cl_intr = 1; - return 0; } diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 5e8d82f..7b994b2 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -193,7 +193,7 @@ static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq) if (dreq->iocb) goto out; - result = wait_for_completion_interruptible(&dreq->completion); + result = wait_for_completion_killable(&dreq->completion); if (!result) result = dreq->error; @@ -391,9 +391,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t result = 0; - sigset_t oldset; struct inode *inode = iocb->ki_filp->f_mapping->host; - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; dreq = nfs_direct_req_alloc(); @@ -405,11 +403,9 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, if (!is_sync_kiocb(iocb)) dreq->iocb = iocb; - rpc_clnt_sigmask(clnt, &oldset); result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos); if (!result) result = nfs_direct_wait(dreq); - rpc_clnt_sigunmask(clnt, &oldset); nfs_direct_req_release(dreq); return result; @@ -767,9 +763,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, size_t count) { ssize_t result = 0; - sigset_t oldset; struct inode *inode = iocb->ki_filp->f_mapping->host; - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; size_t wsize = NFS_SERVER(inode)->wsize; int sync = 0; @@ -787,11 +781,9 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, if (!is_sync_kiocb(iocb)) dreq->iocb = iocb; - rpc_clnt_sigmask(clnt, &oldset); result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync); if (!result) result = nfs_direct_wait(dreq); - rpc_clnt_sigunmask(clnt, &oldset); nfs_direct_req_release(dreq); return result; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index db5d96d..f68c222 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -433,15 +433,11 @@ static int nfs_wait_schedule(void *word) */ static int nfs_wait_on_inode(struct inode *inode) { - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_inode *nfsi = NFS_I(inode); - sigset_t oldmask; int error; - rpc_clnt_sigmask(clnt, &oldmask); error = wait_on_bit_lock(&nfsi->flags, NFS_INO_REVALIDATING, - nfs_wait_schedule, TASK_INTERRUPTIBLE); - rpc_clnt_sigunmask(clnt, &oldmask); + nfs_wait_schedule, TASK_KILLABLE); return error; } diff --git a/fs/nfs/mou
Re: Massive slowdown when re-querying large nfs dir
On Mon, Nov 05, 2007 at 07:58:38AM +0300, Al Boldi wrote: > Any ideas? How about tcpdumping and seeing what requests are flowing across the wire? You might be able to figure out what's being done differently. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 09:38:55PM +, Alan Cox wrote: > > It doesn't require the system to detect it, only mandate what to return > > if it does detect it. > > We should be detecting at least the obvious case. What is the obvious case? A task that has never called clone()? > If SYSV only spots simple AB - BA deadlocks or taking the same lock twice > yourself then that ought to be sufficient for us too. You can't deadlock against yourself -- either with POSIX locks or BSD locks (POSIX simple upgrades/downgrades the lock; each byte in a file can only be in either read-locked state or write-locked state for a given process. BSD locks release the lock and wake all waiters before attempting to acquire the lock in its new state). In my other post, I showed a simple AB-BA deadlock that can't be easily detected. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 11:55:52PM +0100, Jiri Kosina wrote: > On Sun, 28 Oct 2007, Matthew Wilcox wrote: > > > Bzzt. You get a false deadlock with multiple threads like so: > > Thread A of task B takes lock 1 > > Thread C of task D takes lock 2 > > Thread C of task D blocks on lock 1 > > Thread E of task B blocks on lock 2 > > A potential for deadlock occurs if a process controlling a locked > region is put to sleep by attempting to lock another process' > locked region. If the system detects that sleeping until a locked > region is unlocked would cause a deadlock, fcntl() shall fail with > an [EDEADLK] error. > > This is what POSIX says [1], even after being modified with respect to > POSIX Threads Extension, right? > > So it doesn't deal with threads at all, just processess are taken into > account. Probably for a reason :) Did you have a concrete suggestion, or are you just quoting the spec? The problem is that it's nonsense -- processes don't sleep, threads do. I think the key is "would deadlock", not "might deadlock". Our current behaviour is clearly in violation of SuSv3. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 10:48:33PM +, Alan Cox wrote: > > Bzzt. You get a false deadlock with multiple threads like so: > > > > Thread A of task B takes lock 1 > > Thread C of task D takes lock 2 > > Thread C of task D blocks on lock 1 > > Thread E of task B blocks on lock 2 > > The spec and SYSV certainly ignore threading in this situation and you > know that perfectly well (or did in 2004) The discussion petered out (or that mailing list archive lost articles from the thread) without any kind of resolution, or indeed interest. What is your suggestion for handling this problem? As it is now, the kernel 'detects' deadlock where there is none, which doesn't seem allowed by SuSv3 either. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 05:50:30PM -0400, Trond Myklebust wrote: > > You can't fix the false EDEADLK detection without solving the halting > > problem. Best of luck with that. > > I can see that it would be difficult to do efficiently, but basically, > this boils down to finding a circular path in a graph. That is hardly an > unsolvable issue... Bzzt. You get a false deadlock with multiple threads like so: Thread A of task B takes lock 1 Thread C of task D takes lock 2 Thread C of task D blocks on lock 1 Thread E of task B blocks on lock 2 We currently declare deadlock at this point (unless the deadlock detection code has changed since I last looked at it), despite thread A being about to release lock 1. Oh, and by the way, thread E is capable of releasing lock 1, so you can't just say "well, detect by thread instead of by task". So the only way I can see to accurately detect deadlock is to simulate the future execution of all threads in task B to see if any of them will release lock 1 without first gaining lock 2. Which, I believe, is halting-equivalent. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 06:40:52PM +, Alan Cox wrote: > NAK. This is an ABI change. It was also comprehensively rejected before > because > > - EDEADLK behaviour is ABI Not in any meaningful way. > - EDEADLK behaviour is required by SuSv3 What SuSv3 actually says is: If the system detects that sleeping until a locked region is unlocked would cause a deadlock, fcntl() shall fail with an [EDEADLK] error. It doesn't require the system to detect it, only mandate what to return if it does detect it. > - We have no idea what applications may rely on this behaviour. I've never heard of one that does. > so we need to fix the bugs - the lock usage and the looping. At that > point it merely becomes a performance concern to those who use it, which > is the proper behaviour. If you want a faster non-checking one use > flock(), or add another flag that is a Linux "don't check for deadlock" You can't fix the false EDEADLK detection without solving the halting problem. Best of luck with that. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 01:43:21PM -0400, J. Bruce Fields wrote: > We currently attempt to return -EDEALK to blocking fcntl() file locking > requests that would create a cycle in the graph of tasks waiting on > locks. > > This is inefficient: in the general case it requires us determining > whether we're adding a cycle to an arbitrary directed acyclic graph. > And this calculation has to be performed while holding a lock (currently > the BKL) that prevents that graph from changing. > > It has historically been a source of bugs; most recently it was noticed > that it could loop indefinitely while holding the BKL. It can also return -EDEADLK spuriously. So yeah, just kill it. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote: > I'd like to second Davids emails here, this is a serious problem. Having > a reproducible test case lowers the barrier for getting the problem > fixed by orders of magnitude. It's the difference between the problem > getting fixed in a day or two and it potentially lingering for months, > because email ping-pong takes forever and "the test team has moved on to > other tests, we'll let you know the results of test foo in 3 weeks time > when we have a new slot on the box" just removing any developer > motivation to work on the issue. I vaguely remembered something called orasim, so I went looking for it. I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which seems to be a stillborn project. Is there anything else I should know about orasim? ;-) -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] isofs: add +w bit for non-RR discs
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote: > Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge Looks to me like you've added S_IWUSR, not S_IWUGO. > - popt->mode = S_IRUGO | S_IXUGO; /* > + popt->mode = S_IRUGO | S_IWUSR | S_IXUGO; > - inode->i_mode = S_IRUGO | S_IXUGO | S_IFDIR; > + inode->i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR; -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote: > Anything, I do mean anything, can be simulated using small test > programs. Pointing at a big fancy machine with lots of storage > and disk is a passive aggressive way to avoid the real issues, > in that nobody is putting forth the effort to try and come up > with an at least publishable test case that Christoph can use to > help you guys. > > If coming up with a reproducable and publishable test case is > the difference between this getting fixed and it not getting > fixed, are you going to invest the time to do that? If that's what it takes, then yes. But I'm far from convinced that it's as easy to come up with a TPC benchmark simulator as you think. There have been efforts in the past (orasim, for example), but presumably Christoph has already tried these benchmarks. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote: > There comes a point where it is the reporter's responsibility to help > the developer come up with a publishable test case the developer can > use to work on fixing the problem and help ensure it stays fixed. That's a lot of effort. Is it more effort than doing some remote debugging with Christoph? I don't know. > Using an unpublishable benchmark, whose results even cannot be > published, really stretches the limits of "reasonable" don't you > think? Yet here we stand. Christoph is aggressively trying to get slab removed from the tree. There is a testcase which shows slub performing worse than slab. It's not my fault I can't publish it. And just because I can't publish it doesn't mean it doesn't exist. Slab needs to not get removed until slub is as good a performer on this benchmark. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote: > I was not aware of that. Would it be possible for you to summarize all the > test data that you have right now about SLUB vs. SLAB with the patches > listed? Exactly what kernel version and what version of the per cpu > patches were tested? We have three runs, all with 2.6.23-rc3 plus the patches that Suresh applied from 20070922. The first run is with slab. The second run is with SLUB and the third run is SLUB plus the tuning parameters you recommended. I have a spreadsheet with Vtune data in it that was collected during each of these test runs, so we can see which functions are the hottest. I can grab that data and send it to you, if that's interesting. > Was the page allocator pass through patchset > separately applied as I requested? I don't believe so. Suresh? I think for future tests, it would be easiest if you send me a git reference. That way we will all know precisely what is being tested. > Finally: Is there some way that I can reproduce the tests on my machines? As usual for these kinds of setups ... take a two-CPU machine, 64GB of memory, half a dozen fibre channel adapters, about 3000 discs, a commercial database, a team of experts for three months worth of tuning ... I don't know if anyone's tried to replicate a benchmark like this using Postgres. Would be nice if they have ... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote: > On Thu, 4 Oct 2007, Matthew Wilcox wrote: > > > So, on "a well-known OLTP benchmark which prohibits publishing absolute > > numbers" and on an x86-64 system (I don't think exactly which model > > is important), we're seeing *6.51%* performance loss on slub vs slab. > > This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've > > asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) > > gets back 0.38% of that. It's still down 6.13% over slab. > > Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded > earlier. Seems that we are mainly seeing cacheline bouncing due to two > cpus accessing meta data in the same page struct. The patches in > MM that are scheduled to be merged for .24 address that issue. I > have repeatedly asked that these patches be tested. The patches were > posted months ago. I just checked with the guys who did the test. When I said -rc3, I mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to test for you. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SLUB performance regression vs SLAB
On Mon, Oct 01, 2007 at 01:50:44PM -0700, Christoph Lameter wrote: > The problem is with the weird way of Intel testing and communication. > Every 3-6 month or so they will tell you the system is X% up or down on > arch Y (and they wont give you details because its somehow secret). And > then there are conflicting statements by the two or so performance test > departments. One of them repeatedly assured me that they do not see any > regressions. Could you cut out the snarky remarks? It takes a long time to run a test, and testing every one of the patches you send really isn't high on anyone's priority list. The performance team have also been having problems getting stable results with recent kernels, adding to the delay. The good news is that we do now have committment to testing upstream kernels, so you should see results more frequently than you have been. I'm taking over from Suresh as liason for the performance team, so if you hear *anything* from *anyone* else at Intel about performance, I want you to cc me about it. OK? And I don't want to hear any more whining about hearing different things from different people. So, on "a well-known OLTP benchmark which prohibits publishing absolute numbers" and on an x86-64 system (I don't think exactly which model is important), we're seeing *6.51%* performance loss on slub vs slab. This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) gets back 0.38% of that. It's still down 6.13% over slab. For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus RHEL 4.5, so the performance guys are really unhappy about going up to almost 8% regression. In the detailed profiles, __slab_free is the third most expensive function, behind only spin locks. get_partial_node is right behind it in fourth place, and kmem_cache_alloc is sixth. __slab_alloc is eight and kmem_cache_free is tenth. These positions don't change with the slub boot parameters. Now, where do we go next? I suspect that 2.6.23-rc9 has significant changes since -rc3, but I'd like to confirm that before kicking off another (expensive) run. Please, tell me what useful kernels are to test. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote: > Would you accept a patch which causes the deprecated sysfs > files/directories to disappear, even if CONFIG_SYS_DEPRECATED is > defined, via a boot-time parameter? How about a mount option? That way people can test without a reboot: mount -o remount,deprecated={yes,no} /sys -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote: > I'm reminded of Rusty's 2003 OLS Keynote, where he points out that > what's important is not making an interface easy to use, but _hard_ > _to_ _misuse_. That fact that sysfs is all laid out in a directory, > but for which some directories/symlinks are OK to use, and some are > NOT OK to use --- is why I call the sysfs interface "an open pit". > Sure, if you have the map to the minefield, a minefield is perfectly > safe when you know what to avoid. But is that the best way to > construct a path/interface for an application programmer to get from > point A to point B? Maybe, maybe not. I agree with your criticism of sysfs (indeed, I think I've made many of them before in somewhat stronger language), but do you have a suggestion as to an interface that would be harder to misuse? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] VFS: new fgetattr() file operation
On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote: > A file isn't deleted while there are still links or open files > refering to it. So getting the attributes for a file with nlink==0 is > perfectly valid while the file is still open. Is it? Why not just pretend that the attributes are wiped when the file is deleted. Effectively, they are, since they can't affect anything. > If a network filesystem protocol can't handle operations (be it data > or metadata) on an unlinked file, we must do sillirenaming, so that > the file is not actually unlinked. Or you could call getattr right before you unlink and cache the result in the client. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] VFS: new fgetattr() file operation
On Mon, Sep 24, 2007 at 02:48:08PM +0200, Miklos Szeredi wrote: > > and if that means adding silly rename support so be it. > > That's what is done currently. > > But it's has various dawbacks, like rmdir doesn't work if there are > open files within an otherwise empty directory. > > I'd happily accept suggestions on how to deal with this differenty. Only sillyrename files with nlink > 1? I don't see how attributes can change anything for a deleted file. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] avoid clobbering registers with J_ASSERT macro
On Mon, Aug 20, 2007 at 04:22:04PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, 2007-08-20 at 09:18 -0400, Chris Snook wrote: > > > > How's about we just remove that printk? Do > > > > > > #define J_ASSERT(e) BUG_ON(e)? ITYM #define J_ASSERT(e) BUG_ON(!e) > It did. The original J_ASSERT predates BUG() entirely, and was added so > that we got the file/line-no information. But with the current BUG() > macro, I can't see any reason for J_ASSERT still to try to gather that > information itself. Do you still want to keep J_ASSERT, or should all uses of it be replaced with BUG_ON? (to put it another way; if you were writing JBD now, would you add your own J_ASSERT, or would you just use BUG_ON directly?) -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)
On Mon, Jul 16, 2007 at 09:28:08PM +0300, Al Boldi wrote: > Well, conceptually it sounds like a piece of cake, technically your guess is > as good as mine. IIRC, akpm once mentioned something like this. How much have you looked at the VFS? There's nothing journalling-related in the VFS right now. ext3 and XFS share no common journalling code, nor do I think that would be possible, due to the very different concepts they have of journalling. Here's a good hint: $ find fs -type f |xargs grep -l journal_start fs/ext3/acl.c fs/ext3/inode.c fs/ext3/ioctl.c fs/ext3/namei.c fs/ext3/resize.c fs/ext3/super.c fs/ext3/xattr.c fs/ext4/acl.c fs/ext4/extents.c fs/ext4/inode.c fs/ext4/ioctl.c fs/ext4/namei.c fs/ext4/resize.c fs/ext4/super.c fs/ext4/xattr.c fs/jbd/journal.c fs/jbd/transaction.c fs/jbd2/journal.c fs/jbd2/transaction.c fs/ocfs2/journal.c fs/ocfs2/super.c JBD and JBD2 provide a journalling implementation that ext3, ext4 and ocfs2 use. Note that XFS doesn't, it has its own journalling code. If you want XFS to support data=ordered, talk to the XFS folks. Or start picking through XFS yourself, of course -- you do have the source code. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Advocacy] Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote: > XFS surely rocks, but it's missing one critical component: data=ordered > And that's one component that's just too critical to overlook for an > enterprise environment that is built on data-integrity over performance. > > So that's the secret why people still use ext3, and XFS' reliance on external > hardware to ensure integrity is really misplaced. > > Now, maybe when we get the data=ordered onto the VFS level, then maybe XFS > may become viable for the enterprise, and ext3 may cease to be KING. Wow, thanks for bringing an advocacy thread onto linux-fsdevel. Just what we wanted. Do you have any insight into how to "get the data=ordered onto the VFS level"? Because to me, that sounds like pure nonsense. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
On Thu, Jul 05, 2007 at 11:30:20PM +0200, Karel Zak wrote: > > > > The package build system is now based on autotools. The build system > > > > supports separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS, > > > > SUID_LDFLAGS). For more details see the README file > > > > > > And this is really dumb. autotools is a completely pain in the ass and > > Well, Adrian Bunk added autotools stuff to util-linux during his work > on v2.13. This stuff has been fixed and stabilized in util-linux-ng > v2.13. > > I'm not fanatical autotools protagonist, but it seems useful in > util-linux. We will see... > > I'm ready to change or fix arbitrary thing in util-linux-ng, but I > always need a real reason -- bug report, new feature, or so. This > discussion is about impressions and feelings only. No, it's based on long, hard and painful experiences attempting to debug the fuckups that autoconf creates when it goes wrong. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Tue, Jun 19, 2007 at 04:34:42PM -0400, John Stoffel wrote: > > "Jack" == Jack Stone <[EMAIL PROTECTED]> writes: > > Jack> The whole idea of the file system is that it wouldn't return the > Jack> file in the file listing. The user would have to know that the > Jack> file system was versioning to access the older versions as they > Jack> would explicitly have to request them. > > So tell me what happens when I do: > > touch foo:121212121212 You create a new file called 'foo:121212121212'. If you then modify it, you could access the old version as foo:121212121212:0. (The .snapshot directory makes more sense than magic names, since you can see what versions of a file are available without a special tool). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: > Programs that assume a full transfer are fairly common, but are > universally regarded as either broken or just lazy, and when it does cause > a problem, it is far more common to fix the application than the kernel. Linus has explicitly forbidden short reads from being returned. The original poster may get away with it for a specialised case, but for example, signals may not cause a return to userspace with a short read for exactly this reason. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases
On Fri, Jun 01, 2007 at 12:44:16PM -0400, J. Bruce Fields wrote: > The only problem I'm aware of is that leases aren't broken on rename, > link, and unlink. This is kind of tricky to fix. David Richter (cc'd) > and I sketched out a few different approaches, and I think he has some > patches implementing at least one of them. > > This may require defining some new type of lease, to avoid changing the > documented behavior of the fcntl lease operations (which only break on > open). Although actually I believe Samba needs the same behavior we do, > and they're probably the most important user of leases Samba internally prohibits renaming or deleting an open file, to match Windows semantics. So it won't notice the difference. At least, that's what I remember from a discussion with Tridge when we were implementing leases back in 2000. I think it's an acceptable change in Linux semantics to break leases on rename/delete/link. Though I'm not quite sure why you need to -- nobody else is touching the contents of the file, so it's not like you need to write the data back to it, or discard your cached copy of it in the case of a read-only lease. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote: > +static inline int > +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt, > +int mode) > +{ > + struct nameidata nd = { > + .dentry = child, > + .mnt = mnt, > + }; > + > + return vfs_create(dir, child, mode, &nd); > +} > + Wouldn't it normally result in fewer instructions (on most architectures ... maybe not x86) to keep the same argument order as vfs_create? ie: static inline int nfsd_do_create(struct inode *dir, struct dentry *child, int mode, struct vfsmount *mnt) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.
On Sun, Apr 08, 2007 at 09:16:59PM -0700, [EMAIL PROTECTED] wrote: > REISER4 FOR INCLUSION IN THE LINUX KERNEL. Fuck off. Cheerleading like this only hurts your cause. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: forced umount?
On Sun, Mar 18, 2007 at 08:16:19PM +0100, Arjan van de Ven wrote: > the problem with the people who say they want forced umount is.. that > most of the time they either want > 1) get rid of the namespace entry > or > 2) want to stop any and all IO to a certain device/partition There is a third component - they want to deliver a fatal signal to all processes which are waiting on IO to that sb. My scenario here is a machine with an NFS mount of a server which has gone down. Cronjobs which scan the whole filesystem (eg updatedb) soon pile up sleeping on access. Equally, if one has one's ogg collection stored on said NFS server, the ogg player will be in uninterruptible sleep while holding the sound device open, preventing other applications from making sounds. It's desirable to be able to kill these apps dead, and the usual suggestion of 'mount it soft,intr' isn't the greatest idea (and somewhat hard to change after the fact). I remember Linus suggesting a sleeping state between UNINTERRUPTIBLE and INTERRUPTIBLE which would be FATAL_SIGNALS_ONLY. The usual problem (of short reads) isn't a problem if the task is only going to die when it receives them. Has anyone investigated this in any detail? Perhaps I'll take a look at doing it next week. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] sys_fallocate() system call
On Fri, Mar 16, 2007 at 05:17:04PM +0100, Heiko Carstens wrote: > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > e.g. > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode) > > would work even on s390 ;) How about: asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, u32 len_low, u32 len_high); That way we all suffer equally ... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] sys_fallocate() system call
On Sat, Mar 17, 2007 at 08:59:05PM +1100, Paul Mackerras wrote: > ... but wouldn't work on 32-bit powerpc. :( We would end up with a > pad argument between fd and offset, giving 7 arguments in all > (counting the loff_t's as 2), but we only support 6. Ditto mips and parisc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why should we teach students Linux??
On Tue, Mar 06, 2007 at 06:43:34PM +0100, Roel Bindels wrote: > Ok, but don't you guys have some fact I can build a case with?? I don't > think that my college's will take me serious if I come with this store, > even it is true ;) You are crossposting far too widely to some wildly inappropriate lists for this subject. Try to find an advocacy group, or a local user group. They can be of more help in this. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rename file over another with the same inode number fails silently.
On Thu, Feb 08, 2007 at 11:48:16AM -0500, John Muir wrote: > The attached test program creates a file, and then some hard links to > that file (file0 - fileN). > The test program then attempts to rename(fileN, file) for every hard > link created. > > My expectation is that the hard links file0 - fileN would simply > disappear, or that rename would respond with an error result and an > appropriate errno value indicating the problem. > > My observation is that the hard links file0 to fileN do not in fact > disappear and rename returns 0. > > Do I have the wrong expectations? If so, why should I have to stat the > files to determine if they are the same inode before I rename? This is a POSIX requirement. See http://www.opengroup.org/onlinepubs/009695399/functions/rename.html where it says, "If the old argument and the new argument resolve to the same existing file, rename() shall return successfully and perform no other action." There's not really much point in discussing whether your expectations are wrong or whether the extra work you have to do is silly ... POSIX says so, so we have to behave this way. Personally, I think this is an unexpected oversight on POSIX's part. The rationale refers to 'rename("x", "x");', and doesn't discuss hard links. Possibly something to raise with the POSIX ctte, but I doubt they'd be interested in changing this now ... perhaps if you can find some other unices which behave differently, you might have a shot. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbolic links vs hard links
On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote: > >Other people are of the opinion that the invention of the symbolic link > >was a huge mistake. > > I guess I haven't heard that one. What is the argument that we were > better off without symbolic links? I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good a presentation of that argument as any, though there's also some good stuff in the Unix Haters Handbook (pages 164-165 in particular). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Tue, Jan 09, 2007 at 03:43:14PM -0800, Bryan Henderson wrote: > >but you can get a large number of >1 linked > >files, when you copy full directories with "cp -rl". Which I do a lot > >when developing. I've done that a few times with the Linux tree. > > Can you shed some light on how you use this technique? (I.e. what does it > do for you?) A couple of weeks ago, a colleague asked me to generate a patch against his tree in a hurry. I did cp -rl his-tree my-tree (which completed quickly), edited the two files that needed to be patched, then did diff -urp his-tree my-tree, which also completed quickly, as diff knows that if two files have the same inode, they don't need to be opened. > Many people are of the opinion that since the invention of symbolic links, > multiple hard links to files have been more trouble than they're worth. I Other people are of the opinion that the invention of the symbolic link was a huge mistake. Po-tato, pot-ato ... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext2: conditional removal of NFSD code
On Sat, Jan 06, 2007 at 10:58:31PM +0300, Alexey Dobriyan wrote: > Nor me nor my box is going to act as NFS server, so ifdef all > exporting code. > @@ -916,7 +918,9 @@ static int ext2_fill_super(struct super_ >* set up enough so that it can read an inode >*/ > sb->s_op = &ext2_sops; > +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE) > sb->s_export_op = &ext2_export_ops; > +#endif To avoid putting ifdefs within a function, how about adding: #if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE) #define set_export_ops(sb, ops) sb->s_export_op = ops #else #define set_export_ops(sb, ops) 0 #endif That way you can get rid of the function pointer from the struct superblock too. But Dave Woodhouse is going to kill you for adding another CONFIG_*_MODULE dependency. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is a NULL check missing in nfs_lookup?
On Fri, Jan 05, 2007 at 02:10:06PM -0500, Erez Zadok wrote: > Ah, ok. So why not put an ASSERT in there, or at least a comment, to make > the code clearer. As it stands, anyone looking at the code in the future > can easily rediscover this "bug" that dereferences a null ptr. Because anyone poking in the VFS should take the time to understand how it works? Adding crap like BUG_ON(!nd) is pointless -- you don't get a clearer backtrace from that than you do from a null pointer. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is a NULL check missing in nfs_lookup?
On Fri, Jan 05, 2007 at 12:11:17PM -0500, Shaya Potter wrote: > I guess the question is, why shouldn't a dentry object know what > vfsmount it belongs to is? Can it belong to multiple vfsmount objects? > (Again, probably showing my igornance here). If you have the same tree mounted in two places, they'll have different vfsmounts. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote: > High probability is all you have. Cosmic radiation hitting your > computer will more likly cause problems, than colliding 64bit inode > numbers ;) Some of us have machines designed to cope with cosmic rays, and would be unimpressed with a decrease in reliability. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: inode i_blksize problem
On Wed, Dec 20, 2006 at 01:26:56PM +0100, Sergio Paracuellos wrote: > I am trying to compile a module for kernel 2.6.18-1 that uses the 'inode > struct' but the compiler tell me inode struct hasn't a member called > "i_blksize". I don't have that problem in kernel 2.6.16. What happend > with i_blksize? git-log reveals: commit ba52de123d454b57369f291348266d86f4b35070 Author: Theodore Ts'o <[EMAIL PROTECTED]> Date: Wed Sep 27 01:50:49 2006 -0700 [PATCH] inode-diet: Eliminate i_blksize from the inode structure This eliminates the i_blksize field from struct inode. Filesystems that wan t to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [EMAIL PROTECTED]: cleanup] [EMAIL PROTECTED]: generic_fillattr() fix] Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> So, it's gone. I'd just delete that line from your sources if I were you. By the way, what kind of module is this? Whatever it's doing looks pretty dodgy to me. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Sun, Dec 17, 2006 at 11:07:27AM -0800, Ulrich Drepper wrote: > And how often do the scripts which are in everyday use require such a > command? And the same for the other programs. I know that the rsync load is a major factor on kernel.org right now. With all the git trees (particularly the ones that people haven't packed recently), there's a lot of files in a lot of directories. If readdirplus would help this situation, it would definitely have a real world benefit. Obviously, I haven't done any measurements or attempted to quantify what the improvement would be. For those not familiar with a git repo, it has an 'objects' directory with 256 directories named 00 to ff. Each of those directories can contain many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once a file is written, it is never modified, so all rsync needs to do is be able to compare the timestamps and sizes and notice they haven't changed. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: > I don't think that I understand what you're saying here. The openg() > call does not perform file open (not that that is necessarily even a > first-class FS operation), it simply does the lookup. > > When we were naming these calls, from a POSIX consistency perspective it > seemed best to keep the "open" nomenclature. That seems to be confusing > to some. Perhaps we should rename the function "lookup" or something > similar, to help keep from giving the wrong idea? > > There is a difference between the openg() and path_to_handle() approach > in that we do permission checking at openg(), and that does have > implications on how the handle might be stored and such. That's being > discussed in a separate thread. I was just thinking about how one might implement this, when it struck me ... how much more efficient is a kernel implementation compared to: int openg(const char *path) { char *s; do { s = tempnam(FSROOT, ".sutoc"); link(path, s); } while (errno == EEXIST); mpi_broadcast(s); sleep(10); unlink(s); } and sutoc() becomes simply open(). Now you have a name that's quick to open (if a client has the filesystem mounted, it has a handle for the root already), has a defined lifespan, has minimal permission checking, and doesn't require standardisation. I suppose some cluster fs' might not support cross-directory links (AFS is one, I think), but then, no cluster fs's support openg/sutoc. If a filesystem's willing to add support for these handles, it shouldn't be too hard for them to treat files starting ".sutoc" specially, and as efficiently as adding the openg/sutoc concept. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > Considering that filesystems like GFS and OCFS allow clients DIRECT > ACCESS to the block device itself (which no amount of authentication > will fix, unless it is in the disks themselves), the risk of passing a > file handle around is pretty minimal. That's either disingenuous, or missing the point. OCFS/GFS allow the kernel direct access to the block device. openg()&sutoc() are about passing around file handles to untrusted users. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > Permission checks are done on the path_to_handle(), so in reality > only root or CAP_SYS_ADMIN users can currently use the > open_by_handle interface because of this lack of checking. Given > that our current users of this interface need root permissions to do > other things (data migration), this has never been an issue. > > This is an implementation detail - it is possible that file handle, > being opaque, could encode a UID/GID of the user that constructed > the handle and then allow any process with the same UID/GID to use > open_by_handle() on that handle. (I think hch has already pointed > this out.) While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. I suspect any real answer to this would have to involve cryptographical techniques (say, creating a secure hash of the information plus a boot-time generated nonce). Now you're starting to use a lot of bits, and compute time, and you'll need to be sure to keep the nonce secret. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > David Chinner wrote: > >Does anyone here know about the XFS libhandle API? This has been > >around for years and it does _exactly_ what these proposed syscalls > >are supposed to do (and more). > > Thanks for pointing these out Dave. These are indeed along the same > lines as the openg()/openfh() approach. > > One difference is that they appear to perform permission checking on the > open_by_handle(), which means that the entire path needs to be encoded > in the handle, and makes it difficult to eliminate the path traversal > overhead on N open_by_handle() operations. Another (and highly important) difference is that usage is restricted to root: xfs_open_by_handle(...) ... if (!capable(CAP_SYS_ADMIN)) return -XFS_ERROR(EPERM); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: > The openg() solution has the following advantages to what you propose. > First, it places the burden of the communication of the file handle on > the application process, not the file system. That means less work for > the file system. Second, it does not require that clients respond to > unexpected network traffic. Third, the network traffic is deterministic > -- one client interacts with the file system and then explicitly > performs the broadcast. Fourth, it does not require that the file system > store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. : : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? : : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. > In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing > the flag) would likely cause a storm of network traffic if clients were > closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. > However, the application change issue is actually moot; we will make > whatever changes inside our MPI-IO implementation, and many users will > get the benefits for free. That's good. > The readdirplus(), readx()/writex(), and openg()/openfh() were all > designed to allow our applications to explain exactly what they wanted > and to allow for explicit communication. I understand that there is a > tendency toward solutions where the FS guesses what the app is going to > do or is passed a hint (e.g. fadvise) about what is going to happen, > because these things don't require interface changes. But these > solutions just aren't as effective as actually spelling out what the > application wants. Sure, but I think you're emphasising "these interfaces let us get our job done" over the legitimate concerns that we have. I haven't really looked at the readdirplus() or readx()/writex() interfaces, but the security problems with openg() makes me think you haven't really looked at it from the "what could go wrong" perspective. I'd be interested in reviewing the readx()/writex() interfaces, but still don't see a document for them anywhere. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 06:09:03PM +0100, Latchesar Ionkov wrote: > It could be wasteful, but it could (most likely) also be useful. Name > resolution is not that expensive on either side of the network. The > latency introduced by the single-name lookups is :) *is* latency the problem here? Last I heard, it was the intolerable load placed on the DLM by having clients bounce the read locks for each directory element all over the cluster. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > I think that the main problem is that all these file systems resove a > path name, one directory at a time bringing the server to its knees by > the huge amount of requests. I would like to see what the performance > is if you a) cache the last few hundred lookups on the server side, > and b) modify VFS and the file systems to support multi-name lookups. > Just assume for a moment that there is no any way to get these new > operations in (which is probaly going to be true anyway :). What other > solutions can you think of? :) How exactly would you want a multi-name lookup to work? Are you saying that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if you find it, find share, if you find it, find misc, if you find it, find pci.ids"? That would be potentially very wasteful; consider mount points, symlinks and other such effects on the namespace. You could ask the server to do a lot of work which you then discard ... and that's not efficient. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote: > The filehandle idiocy on the other hand is way of into crackpipe land. Right, and it needs to be discarded. Of course, there was a real problem that it addressed, so we need to come up with an acceptable alternative. The scenario is a cluster-wide application doing simultaneous opens of the same file. So thousands of nodes all hitting the same DLM locks (for read) all at once. The openg() non-solution implies that all nodes in the cluster share the same filehandle space, so I think a reasonable solution can be implemented entirely within the clusterfs, with an extra flag to open(), say O_CLUSTER_WIDE. When the clusterfs sees this flag set (in ->lookup), it can treat it as a hint that this pathname component is likely to be opened again on other nodes and broadcast that fact to the other nodes within the cluster. Other nodes on seeing that hint (which could be structured as "The child "bin" of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact. When they see their own open, they can populate the path to that file without asking the server for extra metadata. There's obviously security issues there (why I say 'hint' rather than 'command'), but there's also security problems with open-by-filehandle. Note that this solution requires no syscall changes, no application changes, and also helps a scenario where each node opens a different file in the same directory. I've never worked on a clusterfs, so there may be some gotchas (eg, how do you invalidate the caches of nodes when you do a rename). But this has to be preferable to open-by-fh. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: > On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote: > > - openg/sutoc > > > > No way. We already have a very nice file descriptor abstraction. > > You can pass file descriptors over unix sockets just fine. > > Yes, but it behaves like dup(). Gary replied to me off-list (which I > didn't notice and continued replying to him off-list). I wrote: > > Is this for people who don't know about dup(), or do they need > independent file offsets? If the latter, I think an xdup() would be > preferable (would there be a security issue for OSes with revoke()?) > Either that, or make the key be useful for something else. I further wonder if these people would see appreciable gains from doing sutoc rather than doing openat(dirfd, "basename", flags, mode); If they did, I could also see openat being extended to allow dirfd to be a file fd, as long as pathname were NULL or a pointer to NUL. But with all the readx stuff being proposed, I bet they don't really need independent file offsets. That's, like, so *1970*s. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote: > - openg/sutoc > > No way. We already have a very nice file descriptor abstraction. > You can pass file descriptors over unix sockets just fine. Yes, but it behaves like dup(). Gary replied to me off-list (which I didn't notice and continued replying to him off-list). I wrote: Is this for people who don't know about dup(), or do they need independent file offsets? If the latter, I think an xdup() would be preferable (would there be a security issue for OSes with revoke()?) Either that, or make the key be useful for something else. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: > >Things like > >openg() - on process opens a file and gets a key that is passed to > >lots of processes which > >use the key to get a handle (great for thousands of processes opening a > >file) I don't understand how this leads to a more efficient implementation. It seem to just add complexity. What does 'sutoc' mean anyway? > >readx/writex - scattergather readwrite - more appropriate and > >complete than the real time extended read/write These don't seem to be documented on the website. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)
On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote: > 2) this scheme would effectively leak inode addresses into userspace. > I'm not sure if that would be exploitable, but it's probably best not to > do it. The patch adds a static unsigned int that is initialized to a > random value at boot time. We'll xor the inode offset with this value. > That should allow for a unique i_ino value, but since the xor mask would > be secret, it shouldn't be possible to turn it back into an address. > There may be a more secure way to do this. I'm definitely open to > suggestions here. I *think* the xor mask is mere obfuscation. It looks likely that you can recover it with a little bit of trial and error. If you can force the filesystem to hand you back new inodes quickly such that there is a high probability you get consecutive allocations, you'll get a sequence which would be spaced 700-odd bytes apart, except that it's been xored. Since you know it's incrementing, if you see the sequence decrease, you'll know that was a 1 in that bit. It'd be a bit more complex than that, and cryptanalysis was never my forte, but I suspect we should either use a folded hash like md5, or give up and just divide the address by sizeof(struct inode). Sure, divides are slow, but this is a divide by a constant, so it shouldn't be that bad. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)
On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote: > +/* convert an inode address into an unsigned int and xor it with a random > value > + * determined at boot time */ > +static inline unsigned int inode_to_uint (struct inode *inode) > +{ > + return unsigned long) (inode - (struct inode *) 0)) > + ^ inode_xor_mask) & 0x); > +} Seems a little obfuscated. Why not simply: return ((unsigned long)inode ^ inode_xor_mask) & 0x; > @@ -125,7 +135,6 @@ static struct inode *alloc_inode(struct > inode->i_size = 0; > inode->i_blocks = 0; > inode->i_bytes = 0; > - inode->i_generation = 0; > #ifdef CONFIG_QUOTA > memset(&inode->i_dquot, 0, sizeof(inode->i_dquot)); > #endif It seems to me that filesystems with fake inodes could instead initialise i_generation to, say, jiffies. What do you think to that? I like this idea, very creative. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2006-3468
On Thu, Nov 16, 2006 at 10:21:32PM +0100, Majkls wrote: > Hello, > I would like to solve this bug. Have anybody some backtrace or some > working exploit? Sure you have the right CVE? That bug seems to have been fixed long ago: http://git.parisc-linux.org/?p=linux-2.6.git;a=commit;h=2ccb48ebb4de139eef4fcefd5f2bb823cb0d81b9 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Make journal_commit_transaction() more understandable
journal_commit_transaction() is still 650+ lines long and contains 16 local variables. By moving phase 3 into its own function, we reduce its length by 150+ lines and reduce it to 5 local variables. Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]> commit.c | 367 +-- 1 files changed, 194 insertions(+), 173 deletions(-) Index: linux-2.6/fs/jbd/commit.c === RCS file: /var/lib/cvs/linux-2.6/fs/jbd/commit.c,v retrieving revision 1.14 diff -u -p -r1.14 commit.c --- linux-2.6/fs/jbd/commit.c 4 Apr 2005 20:47:15 - 1.14 +++ linux-2.6/fs/jbd/commit.c 29 Aug 2005 15:59:02 - @@ -93,6 +93,189 @@ static int inverted_lock(journal_t *jour return 1; } +/* + * Way to go: we have now written out all of the data for a transaction! + * Now comes the tricky part: we need to write out metadata. Loop over the + * transaction's entire buffer list: + */ +static int journal_write_metadata(journal_t *journal, + transaction_t *commit_transaction) +{ + char *tagp = NULL; + struct journal_head *descriptor = NULL; + int flags; + unsigned long blocknr; + journal_header_t *header; + journal_block_tag_t *tag = NULL; + int space_left = 0; + int first_tag = 0; + int tag_flag; + int i; + int err = 0; + int bufs = 0; + + struct buffer_head **wbuf = journal->j_wbuf; + + commit_transaction->t_state = T_COMMIT; + + while (commit_transaction->t_buffers) { + + /* Find the next buffer to be journaled... */ + + struct journal_head *jh, *new_jh; + jh = commit_transaction->t_buffers; + + /* If we're in abort mode, we just un-journal the buffer and + release it for background writing. */ + + if (is_journal_aborted(journal)) { + JBUFFER_TRACE(jh, "journal is aborting: refile"); + journal_refile_buffer(journal, jh); + /* If that was the last one, we need to clean up +* any descriptor buffers which may have been +* already allocated, even if we are now +* aborting. */ + if (!commit_transaction->t_buffers) + goto start_journal_io; + continue; + } + + /* Make sure we have a descriptor block in which to + record the metadata buffer. */ + + if (!descriptor) { + struct buffer_head *bh; + + J_ASSERT (bufs == 0); + + jbd_debug(4, "JBD: get descriptor\n"); + + descriptor = journal_get_descriptor_buffer(journal); + if (!descriptor) { + __journal_abort_hard(journal); + continue; + } + + bh = jh2bh(descriptor); + jbd_debug(4, "JBD: got buffer %llu (%p)\n", + (unsigned long long)bh->b_blocknr, bh->b_data); + header = (journal_header_t *)&bh->b_data[0]; + header->h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); + header->h_blocktype = cpu_to_be32(JFS_DESCRIPTOR_BLOCK); + header->h_sequence = cpu_to_be32(commit_transaction->t_tid); + + tagp = &bh->b_data[sizeof(journal_header_t)]; + space_left = bh->b_size - sizeof(journal_header_t); + first_tag = 1; + set_buffer_jwrite(bh); + set_buffer_dirty(bh); + wbuf[bufs++] = bh; + + /* Record it so that we can wait for IO + completion later */ + BUFFER_TRACE(bh, "ph3: file as descriptor"); + journal_file_buffer(descriptor, commit_transaction, + BJ_LogCtl); + } + + /* Where is the buffer to be written? */ + + err = journal_next_log_block(journal, &blocknr); + /* If the block mapping failed, just abandon the buffer + and repeat this loop: we'll fall into the + refile-on-abort condition above. */ + if (err) { + __journal_abort_hard(journal); + continue; + } + + /* +* start_this_handle() uses t_outstanding_credits to determine +* the free space in the log, but this counter is
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 08:38:34PM +0100, Al Viro wrote: > FWIW, I'd rather take page_symlink(), page_symlink_inode_operations, > page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(), > generic_readlink() and vfs_readlink() to the same place where these guys > would live. They all belong together and none of them has any business > in fs/namei.c. Options: fs/libfs.c or separate library since fs/libfs.c > is getting crowded. Linus, do you have any objections to that or suggestions > on filename here? fs/symlink.c? -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero
On Fri, Apr 15, 2005 at 10:03:05PM +1000, Herbert Xu wrote: > I suppose it could be smart and stay quiet about > > val < 0 || val > BOUND > > However, gcc is slow enough as it is without adding unnecessary > smarts like this. It only warns with -W on, not with -Wall, so I see no compelling reason to fix this. I think the real problem here is that 'arg' is declared 2 pages earlier in the function prototype (aka the function-growth-hormone-imbalance syndrome). There's two good ways of fixing this, adding a f_setsig() function: static inline int f_setsig(struct file *filp, unsigned long arg) { if (arg > _NSIG) return -EINVAL; filp->f_owner.signum = arg; return 0; } ... case F_SETSIG: err = f_setsig(filp, arg); break; or add a function that checks a variable to see if it's a valid signal number: #define valid_signal(arg) ((unsigned long)arg <= _NSIG) ... case F_SETSIG: if (!valid_signal(arg)) break; err = 0; filp->f_owner.signum = arg; break; Looks like futex.c, ptrace.c, signal.c, sys.c and almost every architecture's ptrace code could easily make use of the latter, but not the former. It also looks like we have a few off-by-one errors. For example, in h8300's ptrace code: case PTRACE_SYSCALL: case PTRACE_CONT: { ret = -EIO; if ((unsigned long) data >= _NSIG) break ; but case PTRACE_SINGLESTEP: { ret = -EIO; if ((unsigned long) data > _NSIG) break; so I'd recommend the second solution. But be careful not to "fix up" cases like: ./kernel/exit.c:if (sig < 1 || sig > _NSIG) where we really don't want to allow zero. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero
On Fri, Apr 15, 2005 at 09:21:50AM +0100, Christoph Hellwig wrote: > On Fri, Apr 15, 2005 at 02:31:00AM +0100, Matthew Wilcox wrote: > > On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote: > > > 'arg' is unsigned so it can never be less than zero, so testing for that > > > is pointless and also generates a warning when building with gcc -W. This > > > patch eliminates the pointless check. > > > > Didn't Linus already reject this one 6 months ago? > > I think Linux only complained if we're using some typedef that actually > may be signed. For fcntl that 'arg' argument is unsigned and that's hardcoded > in the ABI. So the check doesn't make sense at all. No, it was exactly this patch: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/1816.html -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero
On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote: > 'arg' is unsigned so it can never be less than zero, so testing for that > is pointless and also generates a warning when building with gcc -W. This > patch eliminates the pointless check. Didn't Linus already reject this one 6 months ago? -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
On Mon, Mar 21, 2005 at 01:33:55PM -0800, Bryan Henderson wrote: > It looks to me like you're running into the fundamental limitation that > the CPU doesn't notify Linux every time you store into a memory location. > It does, though, set the dirty flag in the page table, and Linux > eventually inspects that flag and finds out that you have stored in the > past. At that time, it can call set_page_dirty. well, we *could* know ... never map this page writable. have a per-vma flag that says "emulate writes", and call the filesystem to update backing storage before returning to the application. the price of doing this would be exorbitant. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Efficient handling of sparse files
This problem came up with the systemimager program which uses rsync to install files from a master server to many clients. Red Hat has a system user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size. rsync does understand the concept of sparse files (with the -S flag), but it has to read every block to discover that it is indeed empty. This sucks. I was wondering if we could introduce a new system call (or ioctl?) that, given an fd would find the next block with data in it. We could use the ->bmap method ... except that has dire warnings about adding new callers and viro may soon be in testicle-gouging range. One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA) ... but without permission to reuse ->bmap for this purpose, it's pointless to discuss user interfaces. Suggestions? -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help!
On Fri, Feb 04, 2005 at 03:58:04PM +0530, Anirban Mukherjee wrote: > I am doing a project on Ext2fs and Ext3fs and I require some materials on > the physical structure of Ext2fs.It would be very helpful if someone sends > me some information(or links where to find it) along with some general > stuff on Ext2fs and Ext3fs. "Please do my homework for me" Did you even look in Documentation/filesystems/ext{2,3}.txt? -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
NFSD needs EXPORTFS
Got this report about 2.6.11-rc3. Is this the correct solution? - Forwarded message from Joel Soete <[EMAIL PROTECTED]> - A short analyse, it seems that's because NFSD was builtin while EXPORTFS was a module in my previous config file. Imho EXPORTFS would be build as NFSD? Is the following hunk would do the trick: --- fs/Kconfig.Orig 2005-02-03 16:45:13.562275206 +0100 +++ fs/Kconfig 2005-02-03 16:46:36.496469111 +0100 @@ -1400,6 +1400,7 @@ tristate "NFS server support" depends on INET select LOCKD + select EXPORTFS select SUNRPC help If you want your Linux box to act as an NFS *server*, so that other >< Thanks in advance, Joel - End forwarded message - -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.
On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote: I think the below loop would be clearer as a for loop ... err = 0; for (nr = 0; nr < nr_pages; nr++, start++) { if (start == lp_idx) { pages[nr] = locked_page; if (!nr) continue; lock_page(locked_page); if (!wbc) continue; if (wbc->for_reclaim) { up(&inode->i_sem); up_read(&inode->i_sb->s_umount); } /* Was the page truncated under us? */ if (page_mapping(locked_page) != mapping) { err = -ESTALE; goto err_out_locked; } } else { pages[nr] = find_lock_page(mapping, start); if (pages[nr]) continue; if (!cached_page) { cached_page = alloc_page(gfp_mask); if (unlikely(!cached_page)) goto err_out; } err = add_to_page_cache_lru(cached_page, mapping, start, gfp_mask); if (unlikely(err)) { if (err == -EEXIST) continue; goto err_out; } pages[nr] = cached_page; cached_page = NULL; } } The above fixes two bugs in the below: - if (!unlikely(cached_page)) should be if (unlikely(!cached_page)) - The -EEXIST case after add_to_page_cache_lru() would result in an infinite loop in the original as nr wasn't being incremented. > + err = nr = 0; > + while (nr < nr_pages) { > + if (start == lp_idx) { > + pages[nr] = locked_page; > + if (nr) { > + lock_page(locked_page); > + if (wbc) { > + if (wbc->for_reclaim) { > + up(&inode->i_sem); > + up_read(&inode->i_sb->s_umount); > + } > + /* Was the page truncated under us? */ > + if (page_mapping(locked_page) != > + mapping) { > + err = -ESTALE; > + goto err_out_locked; > + } > + } > + } > + } else { > + pages[nr] = find_lock_page(mapping, start); > + if (!pages[nr]) { > + if (!cached_page) { > + cached_page = alloc_page(gfp_mask); > + if (!unlikely(cached_page)) > + goto err_out; > + } > + err = add_to_page_cache_lru(cached_page, > + mapping, start, gfp_mask); > + if (unlikely(err)) { > + if (err == -EEXIST) > + continue; > + goto err_out; > + } > + pages[nr] = cached_page; > + cached_page = NULL; > + } > + } > + nr++; > + start++; > + } -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage
On Fri, Jan 28, 2005 at 02:53:41PM -0800, Bryan Henderson wrote: > >Just putting up my hand to say "yeah, us too" - we could also make > >use of that functionality, so we can grok existing XFS filesystems > >that have blocksizes larger than the page size. > > IBM Storage Tank has block size > page size and has the same problem. This > is one of several ways that Storage Tank isn't generic enough to use > generic_file_write() and generic_file_read(), so it doesn't. That's not a > terrible way to go, by the way. At some point, making the generic > interface complex enough to handle every possible filesystem becomes worse > than every filesystem driver having its own code. That's certainly true; but it might make sense to write a multipage_file_read() and multipage_file_write() that can be shared between the filesystems that do need this complexity. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Minor ext2 speedup
Port Andreas Dilger's and Jan Kara's patch for ext3 to ext2. Also some whitespace changes to get ext2/ext3 closer in sync. Signed-off-by: Matthew Wilcox <[EMAIL PROTECTED]> Index: fs/ext2/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v retrieving revision 1.4 diff -u -p -r1.4 balloc.c --- fs/ext2/balloc.c13 Sep 2004 15:23:44 - 1.4 +++ fs/ext2/balloc.c24 Jan 2005 22:10:01 - @@ -6,7 +6,7 @@ * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * - * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 + * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 */ @@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_ return NULL; } - - group_desc = block_group / EXT2_DESC_PER_BLOCK(sb); - offset = block_group % EXT2_DESC_PER_BLOCK(sb); + + group_desc = block_group >> EXT2_DESC_PER_BLOCK_BITS(sb); + offset = block_group & (EXT2_DESC_PER_BLOCK(sb) - 1); if (!sbi->s_group_desc[group_desc]) { ext2_error (sb, "ext2_get_group_desc", "Group descriptor not loaded - " @@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_ block_group, group_desc, offset); return NULL; } - + desc = (struct ext2_group_desc *) sbi->s_group_desc[group_desc]->b_data; if (bh) *bh = sbi->s_group_desc[group_desc]; @@ -236,12 +236,12 @@ do_more: for (i = 0, group_freed = 0; i < count; i++) { if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), - bit + i, (void *) bitmap_bh->b_data)) - ext2_error (sb, "ext2_free_blocks", - "bit already cleared for block %lu", - block + i); - else + bit + i, bitmap_bh->b_data)) { + ext2_error(sb, __FUNCTION__, + "bit already cleared for block %lu", block + i); + } else { group_freed++; + } } mark_buffer_dirty(bitmap_bh); @@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st static inline int block_in_use(unsigned long block, struct super_block *sb, unsigned char *map) { - return ext2_test_bit ((block - le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) % + return ext2_test_bit ((block - + le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) % EXT2_BLOCKS_PER_GROUP(sb), map); } static inline int test_root(int a, int b) { - if (a == 0) - return 1; - while (1) { - if (a == 1) - return 1; - if (a % b) - return 0; - a = a / b; - } + int num = b; + + while (a > num) + num *= b; + return num == a; } static int ext2_group_sparse(int group) { + if (group <= 1) + return 1; return (test_root(group, 3) || test_root(group, 5) || test_root(group, 7)); } -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Minor ext3 speedup
On Thu, Jan 13, 2005 at 01:15:06PM +0100, Jan Kara wrote: > Attached patch removes unnecessary division and modulo from ext3 code > paths. It reduces (according to oprofile) the CPU usage measurably under > a dbench load (see description of the patch for the numbers). I thought I'd apply Jan & Andreas's patch to ext2 too. While doing so, I noticed several places where patches had been applied to ext2 and not to ext3. One of them was even a bugfix. This patch brings ext2/balloc.c and ext3/balloc.c closer together and fixes the bug. It includes the aforementioned patch for both ext2 and ext3. For the curious, the bug occurs when ext3_free_blocks_sb() decides to "do_more". *pdquot_freed_blocks was updated each time around the loop, so bg_free_blocks_count was getting over-incremented. I fixed it the same way it had been fixed in ext2 -- by introducing a group_freed variable that is reset each time around the loop. Index: linux-2.6/fs/ext2/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v retrieving revision 1.4 diff -u -p -r1.4 balloc.c --- linux-2.6/fs/ext2/balloc.c 13 Sep 2004 15:23:44 - 1.4 +++ linux-2.6/fs/ext2/balloc.c 15 Jan 2005 16:45:10 - @@ -6,7 +6,7 @@ * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * - * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 + * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 */ @@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_ return NULL; } - - group_desc = block_group / EXT2_DESC_PER_BLOCK(sb); - offset = block_group % EXT2_DESC_PER_BLOCK(sb); + + group_desc = block_group >> EXT2_DESC_PER_BLOCK_BITS(sb); + offset = block_group & (EXT2_DESC_PER_BLOCK(sb) - 1); if (!sbi->s_group_desc[group_desc]) { ext2_error (sb, "ext2_get_group_desc", "Group descriptor not loaded - " @@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_ block_group, group_desc, offset); return NULL; } - + desc = (struct ext2_group_desc *) sbi->s_group_desc[group_desc]->b_data; if (bh) *bh = sbi->s_group_desc[group_desc]; @@ -236,12 +236,12 @@ do_more: for (i = 0, group_freed = 0; i < count; i++) { if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), - bit + i, (void *) bitmap_bh->b_data)) - ext2_error (sb, "ext2_free_blocks", - "bit already cleared for block %lu", - block + i); - else + bit + i, bitmap_bh->b_data)) { + ext2_error(sb, __FUNCTION__, + "bit already cleared for block %lu", block + i); + } else { group_freed++; + } } mark_buffer_dirty(bitmap_bh); @@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st static inline int block_in_use(unsigned long block, struct super_block *sb, unsigned char *map) { - return ext2_test_bit ((block - le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) % + return ext2_test_bit ((block - + le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block)) % EXT2_BLOCKS_PER_GROUP(sb), map); } static inline int test_root(int a, int b) { - if (a == 0) - return 1; - while (1) { - if (a == 1) - return 1; - if (a % b) - return 0; - a = a / b; - } + int num = b; + + while (a > num) + num *= b; + return num == a; } static int ext2_group_sparse(int group) { + if (group <= 1) + return 1; return (test_root(group, 3) || test_root(group, 5) || test_root(group, 7)); } Index: linux-2.6/fs/ext3/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext3/balloc.c,v retrieving revision 1.9 diff -u -p -r1.9 balloc.c --- linux-2.6/fs/ext3/balloc.c 12 Jan 2005 20:17:23 - 1.9 +++ linux-2.6/fs/ext3/balloc.c 15 Jan 2005 16:45:10 - @@ -43,34 +43,34 @@ struct ext3_group_desc * ext3_get_group_ struct buffer_head ** bh) { unsigned long group_desc; - unsigned long desc; - struct ext3_group_desc * gdp; + unsigned long offset; + struct ext3_group_desc * desc; + struct ext3_sb_info *sbi = EXT3_SB(sb); - if (block_group >= EXT3_SB(sb)->s
Re: [Final call for testers][PATCH] superblock handling changes (2.4.6-pre3)
On Fri, Jun 15, 2001 at 01:10:00AM -0400, Alexander Viro wrote: > +static inline void write_super(struct super_block *sb) > +{ > + lock_super(sb); > + if (sb->s_root && sb->s_dirt) When I first looked at this, I thought it was a typo. I don't think we should have s_dirty and s_dirt as fields of the superblock. how about s_dirty_inodes and s_isdirty, respectively? > +restart: > + spin_lock(&sb_lock); > + sb = sb_entry(super_blocks.next); > + while (sb != sb_entry(&super_blocks)) > + if (sb->s_dirt) { > + sb->s_count++; > + spin_unlock(&sb_lock); > + down_read(&sb->s_umount); > + write_super(sb); > + drop_super(sb); > + goto restart; > + } else > + sb = sb_entry(sb->s_list.next); > + spin_unlock(&sb_lock); I think this could be clearer. struct list_head *tmp; restart: spin_lock(&sb_lock); list_for_each(tmp, super_blocks) { struct super_block *sb = sb_entry(tmp); if (!sb->s_dirt) continue; spin_unlock(&sb_lock); down_read(&sb->s_umount); write_super(sb); drop_super(sb); goto restart; } spin_unlock(&sb_lock); > @@ -773,16 +810,16 @@ > void *data, int silent) > { > struct super_block * s; > - s = get_empty_super(); > + s = alloc_super(); > if (!s) > goto out; > s->s_dev = dev; > s->s_bdev = bdev; > s->s_flags = flags; > - s->s_dirt = 0; > s->s_type = type; > - s->s_dquot.flags = 0; > - s->s_maxbytes = MAX_NON_LFS; > + spin_lock(&sb_lock); > + list_add (&s->s_list, super_blocks.prev); I'd use list_add_tail(&s->s_list, super_blocks); > - if (mnt->mnt_instances.next != mnt->mnt_instances.prev) { > + if (atomic_read(&sb->s_active) > 1) { I'm happy to see that line disappear. It was mightily confusing. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
On Fri, Jun 08, 2001 at 02:04:16PM -0400, Alexander Viro wrote: > the only > areas in VFS that still rely on BKL are locks.c, dquot.c and super.c. The > latter is fixed in the patches I'm feeding to Linus (OK, the pieces I'm > feeding to Linus make shifting the BKL down to method calls trivial). Other > two... Well, ask the poor guys who maintain them ;-) the file locking code always takes the BKL itself, and doesn't rely on the rest of the VFS to take the lock for it. it's not too much trouble to remove it, and this will happen as part of the 2.5 rewrite. which is in progress. annoyingly, there's very little common semantics between POSIX locks and FLOCK locks, which means that fs/locks.c is basically implementing two completely independent things through one interface. and then there's the other 3-4 different types of lock bleuch. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote: > > `the class of devices in question' was cryptographic devices, and possibly > > other transactional DSPs. I don't consider audio to be transactional. > > in any case, you can do transactional things with two threads, as long > > as they each have their own fd on the device. Think of the fd as your > > transaction handle. > > Thats a bit pathetic. So I have to fill my app with expensive pthread locks > or hack all the drivers and totally change the multi-open sematics in the ABI huh? void thread_init(void) { int fd = open("/dev/crypto"); real_thread_init(fd); } where was that lock again? and notice this idea is only for transactional things -- what transactional things do sound drivers do? -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, May 22, 2001 at 08:49:04AM +0100, Alan Cox wrote: > > For _devices_, though? I don't expect my mouse to work if gpm and xfree > > both try to consume device events from the same filp. Heck, it doesn't > > even work when they try to consume events from the same inode :-) I think > > this is a reasonable restriction for the class of devices in question. > > Not really. Think about basic things like full duplex audio with two threads `the class of devices in question' was cryptographic devices, and possibly other transactional DSPs. I don't consider audio to be transactional. in any case, you can do transactional things with two threads, as long as they each have their own fd on the device. Think of the fd as your transaction handle. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, May 22, 2001 at 02:22:34AM +0200, Ingo Oeser wrote: > ioctl has actually 4 semantics: > > command only > command + read > command + write > command + rw-transaction > > Separating these would be a first step. And yes, I consider each > of them useful. > > command only: reset drive echo 'reset' >/dev/sg0ctl > command + rw-transaction: "dear device please mangle this data" >(crypto processors come to mind...) I can't think of a reasonable tool-based approach to this, but I can definitely see that a program could use this well. It simply requires that you use the filp to store your state. fd = open(/dev/crypto) -> creates filp write(fd, "Death to all fanatics!\n"); -> calls crypto device, stores result in private data structure sleep(100); read(fd, "Qrngu gb nyy snangvpf!\n"); -> frees data structure [You'll note the advanced design of my crypto processor.] Clearly, this is open to abuse by persons never calling read() and passing in far too much to write(). I think this can be alleviated by refusing to accept more than (say) 4k at a time, or bean-counter. A sick way would be to allow the ->write() call to have its buffer modified. But I don't think we want to go down that path. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Mon, May 21, 2001 at 06:13:18PM -0700, Linus Torvalds wrote: > Nope. You can (and people do, quite often) share filps. So you can't > associate state with it. For _devices_, though? I don't expect my mouse to work if gpm and xfree both try to consume device events from the same filp. Heck, it doesn't even work when they try to consume events from the same inode :-) I think this is a reasonable restriction for the class of devices in question. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote: > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"? No. $ ls -l arch/*/kernel/ioctl32*.c -rw-r--r--1 willywilly 22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c -rw-r--r--1 willywilly 109475 May 18 16:39 arch/parisc/kernel/ioctl32.c -rw-r--r--1 willywilly 117605 Feb 1 20:35 arch/sparc64/kernel/ioctl32.c only about 100k. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote: > The transaction(2) syscall can be just as easily abused as ioctl(2) in > this respect. But read() and write() cannot. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote: > Al, if you really want to kill ioctl(2), then perhaps you should > implement a transaction(2) syscall. Something like: > int transaction (int fd, void *rbuf, size_t rlen, >void *wbuf, size_t wlen); > > Of course, there wouldn't be any practical gain, since we already have > ioctl(2). Any gain would be aesthetic. I can tell you haven't had to write any 32-bit ioctl emulation code for a 64-bit kernel recently. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote: > clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add > unopened descriptors. I.e. no IO until you open it (turning the thing into > opened one), but we can do lookups (move to child), we can clone and > kill them and we can stat them. Those who would like a more detailed explanation can find one at http://plan9.bell-labs.com/sys/man/5/INDEX.html -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 05:25:22PM +0100, Alan Cox wrote: > Only to an English speaker. I suspect Quebec City canadians would prefer a > different command set. Should we support `pas387' as well as `no387' as a kernel boot parameter then? Face it, a sysadmin has to know the limited subset of english which is used to configure a kernel. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
F_SETLK_NP
Here's an initial implementation of F_SETLK_NP as previously discussed on linux-fsdevel. The _ONLY_ difference between these locks and POSIX locks is that they are not removed on a close(dup(fd)). I have a question about the interaction between POSIX and RANGE locks. If they overlap, should they be merged? And if so, what type should the resulting lock have? My initial thought is that they should _not_ be merged but _should_ conflict, which is going to require a slightly bigger patch. This patch merges them and gives the merged lock the type of the lock which arrived first. Anyway, here's the test program (I checked what was going on via /proc/locks, you can too). /* lock-test.c Copyright (c) Matthew Wilcox, Hewlett-Packard. GPL License */ #define _GNU_SOURCE #include #include #include #define F_SETLK_NP 15 int main(int argc, char **argv) { struct flock64 lock; int fd; lock.l_type = F_RDLCK; lock.l_whence = 0; lock.l_start = 50; lock.l_len = 100; fd = open(argv[0], O_RDONLY); fcntl(fd, F_SETLK_NP, &lock); printf("Applied lock to file -- check and press the `Enter' key\n"); getchar(); close(dup(fd)); printf("Closed dup -- check and press the `Enter' key\n"); getchar(); close(fd); printf("Closed original -- check and press the `Enter' key\n"); getchar(); return 0; } And now for the patch itself. It was generated against 2.4.5-pre1 but will probably apply against any 2.4.x kernel. I've taken the liberty of getting rid of some old cruft which wasn't used any more. This patch is i386 only -- just add constants to other arch include files as required. diff -urNX dontdiff linux-245p1/fs/fcntl.c linux-245p1-vfs/fs/fcntl.c --- linux-245p1/fs/fcntl.c Sun May 6 09:16:20 2001 +++ linux-245p1-vfs/fs/fcntl.c Sun May 6 05:52:35 2001 @@ -260,6 +260,10 @@ case F_SETLKW: err = fcntl_setlk(fd, cmd, (struct flock *) arg); break; + case F_SETLK_NP: + case F_SETLKW_NP: + err = fcntl_setlk_np(fd, cmd, (struct flock64 *) arg); + break; case F_GETOWN: /* * XXX If f_owner is a process group, the diff -urNX dontdiff linux-245p1/fs/locks.c linux-245p1-vfs/fs/locks.c --- linux-245p1/fs/locks.c Sun May 6 09:14:08 2001 +++ linux-245p1-vfs/fs/locks.c Sun May 6 05:38:12 2001 @@ -547,7 +547,7 @@ /* POSIX locks owned by the same process do not conflict with * each other. */ - if (!(sys_fl->fl_flags & FL_POSIX) || + if (!(sys_fl->fl_flags & (FL_POSIX | FL_RANGE)) || locks_same_owner(caller_fl, sys_fl)) return (0); @@ -620,7 +620,7 @@ lock_kernel(); for (cfl = filp->f_dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) { - if (!(cfl->fl_flags & FL_POSIX)) + if (!(cfl->fl_flags & (FL_POSIX | FL_RANGE))) continue; if (posix_locks_conflict(cfl, fl)) break; @@ -863,7 +863,7 @@ if (caller->fl_type != F_UNLCK) { repeat: for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { - if (!(fl->fl_flags & FL_POSIX)) + if (!(fl->fl_flags & (FL_POSIX | FL_RANGE))) continue; if (!posix_locks_conflict(caller, fl)) continue; @@ -892,7 +892,7 @@ /* First skip locks owned by other processes. */ - while ((fl = *before) && (!(fl->fl_flags & FL_POSIX) || + while ((fl = *before) && (!(fl->fl_flags & (FL_POSIX | FL_RANGE)) || !locks_same_owner(caller, fl))) { before = &fl->fl_next; } @@ -1458,23 +1458,72 @@ break; case F_UNLCK: break; - case F_SHLCK: - case F_EXLCK: -#ifdef __sparc__ -/* warn a bit for now, but don't overdo it */ -{ - static int count = 0; - if (!count) { - count=1; - printk(KERN_WARNING - "fcntl_setlk() called by process %d (%s) with broken flock() emulation\n", - current->pid, current->comm); + default: + error = -EINVAL; + goto out_putf; + } + + if (filp->f_op && filp->f_op->lock != NULL) { + error = filp->f_op->lock(filp, cmd, file_lock); + if (error < 0) + goto out_putf; } + error = posix_lock_file(filp,
Re: File Locking in Linux 2.5
On Fri, May 04, 2001 at 09:35:33AM -0700, Peter J. Braam wrote: > For cluster and distributed file systems we actually need a somewhat > different approach to locking. The best thing to have would be for the FS > to know that a lock will be needed and for what operation at lookup time. For locking-used-to-guarantee-cluster-operation, yes. I think that's a completely different thing to applications-locking-files and it's a mistake to try to treat the two as one. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Fri, May 04, 2001 at 01:33:44PM +0200, Trond Myklebust wrote: > ??? It'll be a completely useless form of lock if programs can end up > blocking forever on a set of (from their perspective) valid > operations. Umm.. that's your opinion. I don't see how `making them mandatory' would make this better or worse than if they check for deadlock. If they're mandatory, a process doing a read() which conflicts will block forever whether they're checked for deadlock or not. > Ideally the VFS locking manager wants to provide a service to samba, > NFS and their ilk, whereby they ask for a lock. If waiting would > deadlock, send back an immediate answer to the caller. If not, then > the locking manager should be responsible for scheduling a lock > immediately after the blocking lock is released, and notifying the > caller that it can proceed. did you read my proposal? was that not exactly what i suggested? > Anything less than that would be an unacceptable step backward > wrt. the existing code. We don't want to cram more responsabilities > down on the filesystem-level. Actually, we _do_ want to give the filesystems the opportunity to handle their own locks without poking around in the guts of locks.c as lockd currently does. Did my proposal seem unacceptable to you? > How about allowing the process to use a cookie to define ownership at > the kernel level? > > For normal processes this may or may not include some combination > involving the pid, but kernel level server processes could use this to > save the metadata required to identify the client. I know the NFSv4 > people are keen to implement this sort of thing... I like this. That may be why the current code checks both owner and pid fields. Definitely worthy of a comment if so... -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 02:10:48PM +0200, Trond Myklebust wrote: > >>>>> " " == Matthew Wilcox <[EMAIL PROTECTED]> writes: > > > I'll get to it this weekend then. Should be a relatively > > simple patch. Are there any other semantics you want changing > > from the POSIX lock? I'm thinking of removing deadlock > > detection, for one... > > What do you mean by `removing' it? Eliminating deadlock detection > entirely will get you in deep trouble when you add mandatory locking. Huh? (1) I don't intend to have mandatory locking on these new locks. (2) how exactly would it get me into trouble? Let's just go over the semantics we currently have for posix locks, and see if we can come up with `better' lock semantics. POSIX locks are advisory (unless you do the mandatory lock dance), and apply to a byte-range. They are checked for deadlock, merged with other compatible locks (same PID, same `owner'). (this seems redundant and there's a question mark by it in the source... never mind.) The POSIX locking rules mean that you can do funky things like: lock bytes 0-100 lock bytes 50-150 unlock bytes 50-150 and you have bytes 0-49 left locked. of course, you can then unlock bytes 0-100 and it will remove the lock from 0-49, but this behaviour seems unintuitive. You can also (and this causes a fair bit of ugliness in the kernel code) do: lock 0-100 unlock 25-75 or, even worse lock_read 0-100 lock_write 25-75 which leaves you with 0-24 read-locked, 25-75 write-locked and 76-100 read-locked. i can't imagine this is behaviour people actually use/desire. So for a new lock type, i propose the following semantics: byte-range, advisory, non-merging, non-deadlock-checking Another semantic issue I want to tie down. Is it useful for different threadsd to be able to share a lock? Or should locks from different threads conflict? I'm currently leaning towards the latter, but you may be able to convince me I'm wrong. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 01:28:33PM +0200, Andi Kleen wrote: > Just don't forget to add a per user ulimit for it and probably an admin > tool like ipcs. It'll use the same limit as the other locks do. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 12:19:31AM -0700, Jeremy Allison wrote: > > I want to add another byte-range lock, which looks and smells like a > > POSIX fcntl lock except that it is not removed by closing any fd which > > happens to be open on this file. > > If you do this I'll be eternally in your debt ! I had to > write that code for Samba 2.2 and it was a *horrible* > experience. With non-broken locking we can #ifdef out > via autoconf a large chunk of code from the Linux Samba > implementation, and free up fd's immediately on close. > > As soon as you have any prototypes of this please let > us know on the [EMAIL PROTECTED] list - I'd > like to try it asap ! I'll get to it this weekend then. Should be a relatively simple patch. Are there any other semantics you want changing from the POSIX lock? I'm thinking of removing deadlock detection, for one... -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in 2.5
On Wed, May 02, 2001 at 04:49:30PM +0200, Andi Kleen wrote: > On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote: > > * All filesystems will fill in their ->lock method. > > Why when a common stub should work for 90% of them? Please keep > global search-and-edit operation low when not absolutely possible. I think this will be one of the more minor changes that filesystems will see during 2.5 :-) > > * Local filesystems should all use local_lock() for this method, > >unless they have a good reason to provide their own facilities. > > > So this should be default for ->lock Perhaps we misunderstand one another? I propose most filesystems should do struct file_operations { ... lock: local_lock } and I think you're proposing if (i_fop->lock) i_fop->lock(); else local_lock(); please correct me if I'm wrong. > >A clustered filesystem might call out to the network and say `I want > >to put this lock on this file'. Either some other node in the cluster > >says `Denied', `Blocked' or `Granted' (ie handles the request), OR no > >other node accepts responsibility for the lock, in which case we lock > >it locally by calling local_lock(). > > It would be nice if it was possible to interface it to the distributed lock > manager IBM has recently GPLed... It would. I hadn't heard that got released, I shall investigate. > It's still linear lists I guess? Any evidence that a hash table or a ternary > search tree makes sense? I'm not going to change it from a linear list until I've made these other changes. In the meantime, if anyone knows of a data structure which lets me record ranges and check for overlaps, let me know... > I would add a printk warning that fires 10 times, then you can get rid > of it. They're only supported for the benefit of really old applications (ie stuff that runs under iBCS). We could probably drop them or make them a config option. I've never met anyone who mounts their filesystems with -o mand :-) > I never understood why you can't do it completely inode local protected > by the inode lock; but then I don't claim that I understand fs/locks.c inode_lock claims it's for protecting the list manipulations, not for protecting any element in an individual inode. Though in practice it seems to be used for protecting i_state as well. I think I'd rather use a different lock to protect i_flock -- inode_lock is probably already highly contended. > Or just walk the inode hash ? yes, could do that. but /proc/locks really ought to die. i don't see why we need to expose kernel addresses to userspace. the extreme security people have also voiced concerns about the covert channel involved and the way this can be used to spy on other processes. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]