[RFC] Heads up on sys_fallocate()
This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1 contiguity - less defragmentation and thus faster access speed, and 2 guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S |1 + fs/ext4/file.c |1 + fs/open.c| 18 ++ include/asm-i386/unistd.h|3 ++- include/linux/fs.h |1 + include/linux/syscalls.h |1 + 6 files changed, 24 insertions(+), 1 deletion(-) Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S === --- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.20.1/fs/ext4/file.c === --- linux-2.6.20.1.orig/fs/ext4/file.c +++ linux-2.6.20.1/fs/ext4/file.c @@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_ .removexattr= generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.20.1/fs/open.c === --- linux-2.6.20.1.orig/fs/open.c +++ linux-2.6.20.1/fs/open.c @@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.20.1/include/asm-i386/unistd.h === --- linux-2.6.20.1.orig/include/asm-i386/unistd.h +++ linux-2.6.20.1/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.20.1/include/linux/fs.h === --- linux-2.6.20.1.orig/include/linux/fs.h +++ linux-2.6.20.1/include/linux/fs.h @@ -1124,6 +1124,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, loff_t, loff_t); }; struct seq_file; Index: linux-2.6.20.1/include/linux/syscalls.h === --- linux-2.6.20.1.orig/include/linux/syscalls.h +++ linux-2.6.20.1/include/linux/syscalls.h @@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long
Ext4 devel interlock meeting minutes (Feb. 28, 2007)
Ext4 Developer Interlock Call: 01/28/2007 Meeting Minutes Attendees: Mingming Cao, Suparna Bhattacharya, Dave Kleikamp, Eric Sandeen, Takashi Sato, Avantika Mathur Minutes can be accessed at: http://ext4.wiki.kernel.org/index.php/Ext4_Developer%27s_Conference_Call Mingming sent out minutes from the Ext4 filesystem and storage workshop which took place two weeks ago, and will be posting these on the ext4 wiki as well. Mingming gave a talk and led a BOF on ext4 at the summit - feel free to update or add comments to these minute. - One thing that was not discussed at the conference is the overall future plans for the Ext4 filesystem. Many people believe that Ext4 is a new filesystem that will include many new features that new filesystems have; including greater scalability. But such additions may need massive chagnes and rewrite. Our question is, how long to we plan to continue to support backwards compatibility. _PATCH STATUS_ Inode Versioning: - Need to implement the high 32 bits for the i_version field. Andreas is looking at adding the new field in i_extra_isize. - The 64 bit i_version would therefore only be available in ext4; and we would add the 32 bit patch to ext3. Need to verify with NFS that this would be ok for them. Nanosecond Timestamps: - Kalpak has resent the patches - CPU usage is a concern. Ted had suggested masking off different levels of granularity and testing performance at each level. Preallocation: - akpm suggested that we created and implement a system call for fallocate, Amit Arora is working on a simple patch which implements the system call fo i386 architecture. - the main concern is the need to add an inode operation at VFS layer. There are mixed responses about whether we should add a system call for preallocation. hch suggested we add a cmd paramter to the fallocate system call to do preallocate, unprealloc, reserve, unreserve etc. -- Mingming thinks it would be it would be good to use this syscall for reservation as well. current interface to reservation is ioctl. - Before continuing development on the system call, it is a good idea to discuss implementation details on lkml and linux-fsdevel. -- Eric will send and email to linux-ext4 before extending the discussion to other lists. -- Mingming will ask Amit to resend patches and follow up with this discussion. Online Defragmentation: - Takashi tested his online defrag patches and found a problem, that he is currently looking into. - After fixing the problem he will upgrade and repost his patches. - Need Alex's update on his mballoc patch as this online defrag patch is currently depending on it. -- Could we try to use preallocation in online defragmentation? - In the filesystem workshop there was discussion on how locking works if the file being defragmented is in use. -- There were suggestions to do defragmention at directory level as well. -- Use page cache rather than O_DIRECT to avoid complexity. e2fsprogs Changes: - Ted has planned to support 64 bit block number and extents in e2fsprogs. - This will require many changes and rewrite. We will ask Ted about current status and distributing work items. Migration Tool: - Suparna and Mingming are working with Aneesh Veetil to create a tool to migrate from regular files to exent files, and from 128 to 256 byte inode. - Andrew Morton had posted asking for help in testing positive return value from prepare_write. Shaggy and Suparna will look into this. - Mapped I/O with preallocation -- David Chinner has discussed an issue with performing mapped IO with unwritten extents in XFS. -- Mapped I/O can read/write and initialize unwritten extents without notifying the underlying filesystem. So an unwritten extent is not being flagged to an initialized extent, and after the data is written to disk the extent is still flagged as unwritten. If the filesystem is remounted, reading would return zeros. -- This problem should only apply to a cold cache. If the cache is in use, the data would be retrieved from cache. - Mingming and Eric discussed a different method of implementing preallocation proposed by Arjan -- when you want to reserve or preallocate 1000 blocks. Reduce the superblock counter by 1000 and add 1000 to the inode counter. As more writes are performed, inode would decrement from the inode allocated blocks counter. -- This could possibly be integrated with the current ext4 reservation. The reservation window would know that there are allocated but unwritten blocks in memory, only accessible when blocks have been written. -- But using the current reservation, contiguous preallocated blocks would not be guaranteed. Having contiguous blocks is one of the requirements of the feature. - Eric has benchmark data between ext3 and ext4; he will retest and post results on the mailing list. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1 contiguity - less defragmentation and thus faster access speed, and 2 guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S |1 + fs/ext4/file.c |1 + fs/open.c| 18 ++ include/asm-i386/unistd.h|3 ++- include/linux/fs.h |1 + include/linux/syscalls.h |1 + 6 files changed, 24 insertions(+), 1 deletion(-) I certainly agree that we want something like this. posix_fallocate() is the glibc interface we want to be compatible with (which your definition is, AFAICS). Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, Mar 01, 2007 at 03:23:19PM -0500, Jeff Garzik wrote: I certainly agree that we want something like this. posix_fallocate() is the glibc interface we want to be compatible with (which your definition is, AFAICS). This would be great for Samba. Windows clients do this a lot Jeremy. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; You can only allocate space on typewriters? ;) J - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 13:14:32 -0800 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: Amit K. Arora wrote: + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; You can only allocate space on typewriters? ;) A lot of people get confused about -ENOTTY, but it is the return for attempting to use an ioctl on the wrong type of object, so this appears to be quite correct. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Alan wrote: A lot of people get confused about -ENOTTY, but it is the return for attempting to use an ioctl on the wrong type of object, so this appears to be quite correct. This is a syscall though; ENOSYS is probably a better match. J - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Alan wrote: ENOSYS indicates quite different things and ENOTTY is also used for syscalls. I still think ENOTTY is correct. Yes, ENOSYS tends to me operation flat out not support rather than not on this object. I think we can do better than ENOTTY though - ENOTSUP for example (modulo the confusion over EOPNOTSUPP). (You can tell the patch has very little real substance if we're arguing over errnos at this point :) J - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); It is intended that glibc use this same syscall for both posix_fallocate() and posix_fallocate64(). I'd agree with Eric on the command flag extension. That new argument might need to come after fd - ARM has funny requirements on syscall arg padding and layout. +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} Please always put a blank line between the variable definitions and the first statement. Please always use hard tabs, not bunch-of-spaces. This seems to happening rather a lot in the ext4 patches. It's a trivial thing, but also trivial to fix. A grep across the diffs is needed. ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
That new argument might need to come after fd - ARM has funny requirements on syscall arg padding and layout. FYI the 32bit ppc ABI does too, from arch/powerpc/kernel/sys_ppc32.c: /* * long long munging: * The 32 bit ABI passes long longs in an odd even register pair. */ and the first argument in a function call is in r3. Anton - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. However with this kernel change, glibc will need to look at the errno, so that it can correctly propagate EIO, ENOSPC and whatever. So we will need to return a reliable and stable and sensible value so that glibc knows when it should emulate and when it should propagate. Perhaps Ulrich can comment. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, Mar 02, 2007 at 12:04:45AM +0530, Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Thanks a lot, this has been long overdue. Please don't forget to Cc the XFS list to keep developers of the only Linux filesystem supporting persistant allocations for a long time :) Various people will beat you up for the above syscall as lots of architectures really want 64bit arguments aligned in a proper way, e.g. you at least need a pad after 'int fd'. Then again I already have suggestions for filling up that slot with useful information: - you really want a whence argument as to lseek, as it makes a lot of sense for applications to allocate from the end of the file or the current file positions. The existing XFS ioctl already has this, and it's trivial to support this in any preallocation implementation I could imagine. - we should think about having a flag value for which kind of preallocation we want. XFS currently has two: ALLOCSP which updates the inode size and physically zeroes blocks RESVSP which does not update inode size but creates and unwritten extent the current posix_fallocate semantics are somewhere in the middle, as it requires and update to the inode size, but does not specify at all what happens if you read from the newly allocated space. And yes, as and heads up to developers implementing this feature on new filesystems: don't just return new blocks, that's a gapping security hole :) +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} This should use fget_light, and I'm sure the code could be written in a slightly more readable: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) { struct file *file = fget(fd); ret = -EINVAL; if (file) struct inode *inode = file-f_path.dentry-d_inode; if (inode-i_op inode-i_op-fallocate) ret = inode-i_op-fallocate(inode, offset, len); else ret = -ENOTTY; fput(file); } return ret; } p.s. you reference ext4_fallocate in the patch but don't actually introduce it, it definitively won't compile as-is :) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, Mar 01, 2007 at 10:44:16PM +, Dave Kleikamp wrote: Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. I'd be more happy to have the write out zeroes loop in glibc. And glibc needs to have it anyway, for older kernels. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Just curious .. What does posix_fallocate() return ? bookmark this: http://www.opengroup.org/onlinepubs/009695399/nfindex.html Upon successful completion, posix_fallocate() shall return zero; otherwise, an error number shall be returned to indicate the error. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Andrew Morton wrote: Perhaps Ulrich can comment. I was out of town, hence the delay. I think that if there is no support for the syscall the correct answer is to return ENOSYS. In this case the current userlevel code would be used and ENOSYS is also used to trigger the use of the compat code in glibc in case the syscall does not exist at all. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature