lock_rename for cluster filesystems? (was: Re: [PATCH] prune_icache_sb)
On 12/4/06, Wendy Cheng <[EMAIL PROTECTED]> wrote: Russell Cattelan wrote: > Wendy Cheng wrote: > >> Linux kernel, particularly the VFS layer, is starting to show signs >> of inadequacy as the software components built upon it keep growing. >> I have doubts that it can keep up and handle this complexity with a >> development policy like you just described (filesystem is a dumb >> layer ?). Aren't these DIO_xxx_LOCKING flags inside >> __blockdev_direct_IO() a perfect example why trying to do too many >> things inside vfs layer for so many filesystems is a bad idea ? By >> the way, since we're on this subject, could we discuss a little bit >> about vfs rename call (or I can start another new discussion thread) ? >> >> Note that linux do_rename() starts with the usual lookup logic, >> followed by "lock_rename", then a final round of dentry lookup, and >> finally comes to filesystem's i_op->rename call. Since lock_rename() >> only calls for vfs layer locks that are local to this particular >> machine, for a cluster filesystem, there exists a huge window between >> the final lookup and filesystem's i_op->rename calls such that the >> file could get deleted from another node before fs can do anything >> about it. Is it possible that we could get a new function pointer >> (lock_rename) in inode_operations structure so a cluster filesystem >> can do proper locking ? > > It looks like the ocfs2 guys have the similar problem? > > http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt > > > Thanks for the pointer. Same as ocfs2, under current VFS code, both GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly ~200 line draft patch ready for GFS1 (and am looking into GFS2 at this moment). The issue here is, for GFS, if vfs lock_rename() can call us, this complication can be greatly reduced. Will start another thread to see whether the wish can be granted. Hi Wendy, Have you (or others) made any progress on a possible solution to simplify handling cluster fs do_rename() races (e.g. your proposed lock_rename in inode_operations)? I couldn't find a newer thread that continued this discussion... please advise, thanks. Mike - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Andrew Morton wrote: On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott <[EMAIL PROTECTED]> wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the "command" flag extension. Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to "mode"? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Right now neither ext4 preallocation implementation or reservation are guranteed to allocate/reserve contigugous extents. If the application told it so, it could do more searching to satisfy the requirement. Or fadvise is the right interface? Mingming I'm inclined to merge this patch nice and early, so the syscall number is stabilised. Otherwise the people who are working on out-of-tree code (ie: ext4) will have to keep playing catchup. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Dave Kleikamp wrote: On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote: On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp <[EMAIL PROTECTED]> wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file->f_path.dentry->d_inode; + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op->allocate(). if (inode->i_op && inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. I think this is useful. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Badari Pulavarty wrote: BTW, what is the interface for finding out what is the size of the pre-allocated file ? With XFS at least, "du," "stat," etc tell you a little: [EMAIL PROTECTED] test]# touch resvsp [EMAIL PROTECTED] test]# xfs_io resvsp xfs_io> resvsp 0 10g The file is 0 length, but is using 10g of blocks: (with posix_fallocate this would move the size out to 10g as well) [EMAIL PROTECTED] test]# ls -lh resvsp -rw-r--r-- 1 root root 0 Nov 28 14:11 resvsp [EMAIL PROTECTED] test]# du -hc resvsp 10G resvsp 10G total [EMAIL PROTECTED] test]# stat resvsp File: `resvsp' Size: 0 Blocks: 20971520 IO Block: 4096 regular empty file Device: 81eh/2078d Inode: 186 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) xfs also has an interface to find out what allocations are where: if you reserve some ranges not starting at 0... [EMAIL PROTECTED] test]# xfs_io resvsp xfs_io> resvsp 1g 1g xfs_io> resvsp 3g 1g xfs_io> resvsp 5g 1g xfs_io> quit [EMAIL PROTECTED] test]# xfs_bmap -v resvsp resvsp: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: hole 2097152 1: [2097152..4194303]: 42392..2139543 0 (42392..2139543) 2097152 1 2: [4194304..6291455]: hole 2097152 3: [6291456..8388607]: 4236696..6333847 0 (4236696..6333847) 2097152 1 4: [8388608..10485759]: hole 2097152 5: [10485760..12582911]: 8431000..10528151 0 (8431000..10528151) 2097152 1 The flags of 1 mean that these extents is preallocated/unwritten. I suppose outside of XFS, FIBMAP is your best bet, but that won't tell you what is preallocated vs. allocated/written -Eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > What about > > > if the > > > blocks already exists ? What would be return values in those cases ? > > > > 0 on success, other normal errors oetherwise.. > > > > If asked for a range that includes already-allocated blocks, you just > > allocate any non-allocated blocks in the range, I think. > > Yes. What I was trying to figure out is, if there is a requirement that > interface need to return exact number of bytes it *really* allocated > (like write() or read()). I can't think of any, but just wanted to > through it out.. Hopefully not, because posix didn't anticipate that. We could of course return a positive number on success, but it'd get tricky on 32-bit machines. > BTW, what is the interface for finding out what is the size of the > pre-allocated file ? stat.st_blocks? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote: > Badari Pulavarty wrote: > > > > Amit K. Arora wrote: > > > >> This is to give a heads up on few patches that we will be soon coming up > >> with. These patches implement a new system call sys_fallocate() and a > >> new inode operation "fallocate", for persistent preallocation. The new > >> system call, as Andrew suggested, will look like: > >> > >> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > >> > > I am wondering about return values from this syscall ? Is it supposed to > > return the > > number of bytes allocated ? What about partial allocations ? > > If you don't have enough blocks to cover the request, you should > probably just return -ENOSPC, not a partial allocation. That could be challenging, when multiple writers are working in parallel. You may not be able to return -ENOSPC, till you fail the allocation (for filesystems which alllocates a block at a time). > > > What about > > if the > > blocks already exists ? What would be return values in those cases ? > > 0 on success, other normal errors oetherwise.. > > If asked for a range that includes already-allocated blocks, you just > allocate any non-allocated blocks in the range, I think. Yes. What I was trying to figure out is, if there is a requirement that interface need to return exact number of bytes it *really* allocated (like write() or read()). I can't think of any, but just wanted to through it out.. BTW, what is the interface for finding out what is the size of the pre-allocated file ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Badari Pulavarty wrote: Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? If you don't have enough blocks to cover the request, you should probably just return -ENOSPC, not a partial allocation. What about if the blocks already exists ? What would be return values in those cases ? 0 on success, other normal errors oetherwise.. If asked for a range that includes already-allocated blocks, you just allocate any non-allocated blocks in the range, I think. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On 3/2/07, Dave Kleikamp <[EMAIL PROTECTED]> wrote: Then there's no need for sys_allocate to return a long. Every syscall must return a long. Otherwise you can have problems on 64-bit archs. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Mar 1 2007 23:09, Dave Kleikamp wrote: >> >> Given that glibc already implements fallocate for all filesystems, it will >> need to continue to do so for filesystems which don't implement this >> syscall - otherwise applications would start breaking. > >I didn't make it clear, but my point was to call generic_fallocate if >the file system did not define i_op->allocate(). > >if (inode->i_op && inode->i_op->fallocate) > ret = inode->i_op->fallocate(inode, offset, len); >else > ret = generic_fallocate(inode, offset, len); > >I'm not sure it's worth the effort, but I thought I'd throw the idea out >there. Writing zeroes using glibc emu most likely means write() -- so generic_fallocate should be preferable (think splice). Or does glibc use mmap() and it's all different? Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit wrote: > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote: > On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Just curious .. What does posix_fallocate() return ? > > bookmark this: > > http://www.opengroup.org/onlinepubs/009695399/nfindex.html > > Upon successful completion, posix_fallocate() shall return zero; > otherwise, an error number shall be returned to indicate the error. Then there's no need for sys_allocate to return a long. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote: > On Mar 01, 2007 13:15 -0600, Eric Sandeen wrote: > > One thing I'd like to see is a cmd argument as well, to allow for > > example allocation vs. reservation (i.e. allocating blocks vs. simply > > reserving a number), as well as the inverse of those functions > > (un-reservation, de-allocation)? > > > > If the allocation interface allows allocation/reservation within > > arbitrary ranges, if the only way to un-allocate is via a truncate, > > that's pretty asymmetric. > > I'd rather we just get the oft-discussed punch() syscall instead. > This is really what "unallocate" would do for persistent allocations > and it would be useful for files that were not preallocated. I can see a difference though. punch() would throw away written data as well as pre-allocated-but-never-written-to data. I can see where a user might preallocate a large file and do a lot of random writes. At some point, he decides the file isn't going to grow much more, so let's free up the remaining pre-allocated blocks. This makes even more sense with reservation. The alternative would be to have punch() take a flag to specify if only preallocated or reserved blocks should be freed. > > For filesystems that don't implement punch glibc() would do zero-filling > of the punched area I guess (to make it equivalent to reading from a > hole in the file). Or it could just fail. Writing zeroes may be really slow and not give the caller any benefit. (The intention was to free blocks back to the file system.) Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Mar 01, 2007 13:15 -0600, Eric Sandeen wrote: > One thing I'd like to see is a cmd argument as well, to allow for > example allocation vs. reservation (i.e. allocating blocks vs. simply > reserving a number), as well as the inverse of those functions > (un-reservation, de-allocation)? > > If the allocation interface allows allocation/reservation within > arbitrary ranges, if the only way to un-allocate is via a truncate, > that's pretty asymmetric. I'd rather we just get the oft-discussed punch() syscall instead. This is really what "unallocate" would do for persistent allocations and it would be useful for files that were not preallocated. For filesystems that don't implement punch glibc() would do zero-filling of the punched area I guess (to make it equivalent to reading from a hole in the file). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html