[RFC] Heads up on sys_fallocate()

2007-03-01 Thread Amit K. Arora
This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.
 
Persistent preallocation is a file system feature using which an
application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1 contiguity - less defragmentation and thus faster access speed, and
2 guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
 arch/i386/kernel/syscall_table.S |1 +
 fs/ext4/file.c   |1 +
 fs/open.c|   18 ++
 include/asm-i386/unistd.h|3 ++-
 include/linux/fs.h   |1 +
 include/linux/syscalls.h |1 +
 6 files changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.20.1/fs/ext4/file.c
===
--- linux-2.6.20.1.orig/fs/ext4/file.c
+++ linux-2.6.20.1/fs/ext4/file.c
@@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_
.removexattr= generic_removexattr,
 #endif
.permission = ext4_permission,
+   .fallocate  = ext4_fallocate,
 };
 
Index: linux-2.6.20.1/fs/open.c
===
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file-f_path.dentry-d_inode;
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -1124,6 +1124,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+   long (*fallocate)(struct inode *, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct 
getcpu_cache __user *cache);
+asmlinkage long 

Ext4 devel interlock meeting minutes (Feb. 28, 2007)

2007-03-01 Thread Avantika Mathur

Ext4 Developer Interlock Call: 01/28/2007 Meeting Minutes

Attendees: Mingming Cao, Suparna Bhattacharya, Dave Kleikamp, Eric 
Sandeen, Takashi Sato, Avantika Mathur
Minutes can be accessed at: 
http://ext4.wiki.kernel.org/index.php/Ext4_Developer%27s_Conference_Call


Mingming sent out minutes from the Ext4 filesystem and storage workshop 
which took place two weeks ago, and will be posting these on the ext4 
wiki as well.  Mingming gave a talk and led a BOF on ext4 at the summit

- feel free to update or add comments to these minute.

- One thing that was not discussed at the conference is the overall 
future plans for the Ext4 filesystem.  Many people believe that Ext4 is 
a new filesystem that will include many new features that new 
filesystems have; including greater scalability.  But such additions may 
need  massive chagnes and rewrite.  Our question is, how long to we plan 
to continue to support backwards compatibility.


_PATCH STATUS_

Inode Versioning:
- Need to implement the high 32 bits for the i_version field. Andreas is 
looking at adding the new field in i_extra_isize.
- The 64 bit i_version would therefore only be available in ext4; and we 
would add the 32 bit patch to ext3.  Need to verify with NFS that this 
would be ok for them.


Nanosecond Timestamps:
- Kalpak has resent the patches
- CPU usage is a concern. Ted had suggested masking off different levels 
of granularity and testing performance at each level.


Preallocation:
- akpm suggested that we created and implement a system call for 
fallocate, Amit Arora is working on a simple patch which implements the 
system call fo i386 architecture.
- the main concern is the need to add an inode operation at VFS layer. 
There are mixed responses about whether we should add a system call for 
preallocation. hch suggested we add a cmd paramter to the fallocate 
system call to do preallocate, unprealloc, reserve, unreserve etc.
 -- Mingming thinks it would be it would be good to use this syscall 
for reservation as well. current interface to reservation is ioctl.
- Before continuing development on the system call, it is a good idea to 
discuss implementation details on lkml and linux-fsdevel.  
 -- Eric will send and email to linux-ext4 before extending the 
discussion to other lists.
 -- Mingming will ask Amit to resend patches and follow up with this 
discussion.


Online Defragmentation:
- Takashi tested his online defrag patches and found a problem, that he 
is currently looking into.

- After fixing the problem he will upgrade and repost his patches.
- Need Alex's update on his mballoc patch as this online defrag patch is 
currently depending on it.

 -- Could we try to use preallocation in online defragmentation?
- In the filesystem workshop there was discussion on how locking works 
if the file being defragmented is in use.
 -- There were suggestions to do defragmention at directory level as 
well.  
 -- Use page cache rather than O_DIRECT to avoid complexity.


e2fsprogs Changes:
- Ted has planned to support 64 bit block number and extents in e2fsprogs.
- This will require many changes and rewrite. We will ask Ted about 
current status and distributing work items.


Migration Tool:
- Suparna and Mingming are working with Aneesh Veetil to create a tool 
to migrate from regular files to exent files, and from 128 to 256 byte 
inode.


- Andrew Morton had posted asking for help in testing positive return 
value from prepare_write.  Shaggy and Suparna will look into this.


- Mapped I/O with preallocation
 -- David Chinner has discussed an issue with performing mapped IO with 
unwritten extents in XFS.
 -- Mapped I/O can read/write and initialize unwritten extents without 
notifying the underlying filesystem.  So an unwritten extent is not 
being flagged to an initialized extent, and after the data is written to 
disk the extent is still flagged as unwritten.  If the filesystem is 
remounted, reading would return zeros.
 -- This problem should only apply to a cold cache.  If the cache is in 
use, the data would be retrieved from cache.


- Mingming and Eric discussed a different method of implementing 
preallocation proposed by Arjan
 -- when you want to reserve or preallocate 1000 blocks.  Reduce the 
superblock counter by 1000 and add 1000 to the inode counter.  As more 
writes are performed, inode would decrement from the inode allocated 
blocks counter.  
 -- This could possibly be integrated with the current ext4 
reservation.  The reservation window would know that there are allocated 
but unwritten blocks in memory, only accessible when blocks have been 
written.  
 -- But using the current reservation, contiguous preallocated blocks 
would not be guaranteed.  Having contiguous blocks is one of the 
requirements of the feature.  

- Eric has benchmark data between ext3 and ext4; he will retest and post 
results on the mailing list.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message 

Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeff Garzik

Amit K. Arora wrote:

This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.
 
Persistent preallocation is a file system feature using which an

application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1 contiguity - less defragmentation and thus faster access speed, and
2 guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
 arch/i386/kernel/syscall_table.S |1 +
 fs/ext4/file.c   |1 +
 fs/open.c|   18 ++
 include/asm-i386/unistd.h|3 ++-
 include/linux/fs.h   |1 +
 include/linux/syscalls.h |1 +
 6 files changed, 24 insertions(+), 1 deletion(-)


I certainly agree that we want something like this.

posix_fallocate() is the glibc interface we want to be compatible with 
(which your definition is, AFAICS).


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeremy Allison
On Thu, Mar 01, 2007 at 03:23:19PM -0500, Jeff Garzik wrote:
 I certainly agree that we want something like this.
 
 posix_fallocate() is the glibc interface we want to be compatible with 
 (which your definition is, AFAICS).

This would be great for Samba. Windows clients do this a lot

Jeremy.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeremy Fitzhardinge
Amit K. Arora wrote:
 + if (inode-i_op  inode-i_op-fallocate)
 + ret = inode-i_op-fallocate(inode, offset, len);
 + else
 + ret = -ENOTTY;

You can only allocate space on typewriters? ;)

J
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Alan
On Thu, 01 Mar 2007 13:14:32 -0800
Jeremy Fitzhardinge [EMAIL PROTECTED] wrote:

 Amit K. Arora wrote:
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, offset, len);
  +   else
  +   ret = -ENOTTY;
 
 You can only allocate space on typewriters? ;)

A lot of people get confused about -ENOTTY, but it is the return for
attempting to use an ioctl on the wrong type of object, so this appears
to be quite correct.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeremy Fitzhardinge
Alan wrote:
 A lot of people get confused about -ENOTTY, but it is the return for
 attempting to use an ioctl on the wrong type of object, so this appears
 to be quite correct.

This is a syscall though; ENOSYS is probably a better match.

J
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeremy Fitzhardinge
Alan wrote:
 ENOSYS indicates quite different things and ENOTTY is also used for
 syscalls. I still think ENOTTY is correct.
   
Yes, ENOSYS tends to me operation flat out not support rather than
not on this object.  I think we can do better than ENOTTY though -
ENOTSUP for example (modulo the confusion over EOPNOTSUPP).

(You can tell the patch has very little real substance if we're arguing
over errnos at this point :)

J
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:

 This is to give a heads up on few patches that we will be soon coming up
 with. These patches implement a new system call sys_fallocate() and a
 new inode operation fallocate, for persistent preallocation. The new
 system call, as Andrew suggested, will look like:
 
   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

It is intended that glibc use this same syscall for both posix_fallocate()
and posix_fallocate64().

I'd agree with Eric on the command flag extension.

That new argument might need to come after fd - ARM has funny requirements on
syscall arg padding and layout.

 +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
 +{
 + struct file *file;
 + struct inode *inode;
 + long ret = -EINVAL;
 + file = fget(fd);
 + if (!file)
 + goto out;
 + inode = file-f_path.dentry-d_inode;
 + if (inode-i_op  inode-i_op-fallocate)
 + ret = inode-i_op-fallocate(inode, offset, len);
 + else
 + ret = -ENOTTY;
 + fput(file);
 +out:
 +return ret;
 +}

Please always put a blank line between the variable definitions and the
first statement.

Please always use hard tabs, not bunch-of-spaces.  This seems to happening
rather a lot in the ext4 patches.  It's a trivial thing, but also trivial
to fix.  A grep across the diffs is needed.

ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Anton Blanchard

 That new argument might need to come after fd - ARM has funny
 requirements on syscall arg padding and layout.

FYI the 32bit ppc ABI does too, from arch/powerpc/kernel/sys_ppc32.c:

/*
 * long long munging:
 * The 32 bit ABI passes long longs in an odd even register pair.
 */

and the first argument in a function call is in r3.

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp [EMAIL PROTECTED] wrote:

 On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
  On Fri, 2 Mar 2007 00:04:45 +0530
  Amit K. Arora [EMAIL PROTECTED] wrote:
 
   +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
   +{
   + struct file *file;
   + struct inode *inode;
   + long ret = -EINVAL;
   + file = fget(fd);
   + if (!file)
   + goto out;
   + inode = file-f_path.dentry-d_inode;
   + if (inode-i_op  inode-i_op-fallocate)
   + ret = inode-i_op-fallocate(inode, offset, len);
   + else
   + ret = -ENOTTY;
   + fput(file);
   +out:
   +return ret;
   +}
  
 
  ENOTTY is a bit unconventional - we often use EINVAL for this sort of
  thing.  But EINVAL has other meanings for posix_fallocate() and isn't
  really appropriate here anyway.  So I'm not sure what would be better...
 
 Would EINVAL (or whatever) make it back to the caller of
 posix_fallocate(), or would glibc fall back to its current
 implementation?
 
 Forgive me if I haven't put enough thought into it, but would it be
 useful to create a generic_fallocate() that writes zeroed pages for any
 non-existent pages in the range?  I don't know how glibc currently
 implements posix_fallocate(), but maybe the kernel could do it more
 efficiently, even in generic code.  Maybe we don't care, since the major
 file systems can probably do something better in their own code.

Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.

However with this kernel change, glibc will need to look at the errno,
so that it can correctly propagate EIO, ENOSPC and whatever.  So we will
need to return a reliable and stable and sensible value so that glibc knows
when it should emulate and when it should propagate.

Perhaps Ulrich can comment.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Fri, Mar 02, 2007 at 12:04:45AM +0530, Amit K. Arora wrote:
 This is to give a heads up on few patches that we will be soon coming up
 with. These patches implement a new system call sys_fallocate() and a
 new inode operation fallocate, for persistent preallocation. The new
 system call, as Andrew suggested, will look like:
 
   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
 As we are developing and testing the required patches, we decided to
 post a preliminary patch and get inputs from the community to give it
 a right direction and shape. First, a little description on the feature.

Thanks a lot, this has been long overdue.

Please don't forget to Cc the XFS list to keep developers of the only
Linux filesystem supporting persistant allocations for a long time :)

Various people will beat you up for the above syscall as lots of
architectures really want 64bit arguments aligned in a proper way,
e.g. you at least need a pad after 'int fd'.  Then again I already
have suggestions for filling up that slot with useful information:

 - you really want a whence argument as to lseek, as it makes a lot
   of sense for applications to allocate from the end of the file
   or the current file positions.  The existing XFS ioctl already
   has this, and it's trivial to support this in any preallocation
   implementation I could imagine.
 - we should think about having a flag value for which kind of preallocation
   we want.  XFS currently has two:

ALLOCSP which updates the inode size and physically zeroes blocks
RESVSP which does not update inode size but creates and unwritten
   extent

   the current posix_fallocate semantics are somewhere in the middle, as
   it requires and update to the inode size, but does not specify at
   all what happens if you read from the newly allocated space.
   And yes, as and heads up to developers implementing this feature
   on new filesystems: don't just return new blocks, that's a gapping
   security hole :)

 +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
 +{
 + struct file *file;
 + struct inode *inode;
 + long ret = -EINVAL;
 + file = fget(fd);
 + if (!file)
 + goto out;
 + inode = file-f_path.dentry-d_inode;
 + if (inode-i_op  inode-i_op-fallocate)
 + ret = inode-i_op-fallocate(inode, offset, len);
 + else
 + ret = -ENOTTY;
 + fput(file);
 +out:
 +return ret;
 +}

This should use fget_light, and I'm sure the code could be written
in a slightly more readable:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
{
struct file *file = fget(fd);
 ret = -EINVAL;

if (file)
struct inode *inode = file-f_path.dentry-d_inode;
if (inode-i_op  inode-i_op-fallocate)
ret = inode-i_op-fallocate(inode, offset, len);
else
ret = -ENOTTY;
fput(file);
}

return ret;
}

p.s. you reference ext4_fallocate in the patch but don't actually
introduce it, it definitively won't compile as-is :)
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Thu, Mar 01, 2007 at 10:44:16PM +, Dave Kleikamp wrote:
 Would EINVAL (or whatever) make it back to the caller of
 posix_fallocate(), or would glibc fall back to its current
 implementation?
 
 Forgive me if I haven't put enough thought into it, but would it be
 useful to create a generic_fallocate() that writes zeroed pages for any
 non-existent pages in the range?  I don't know how glibc currently
 implements posix_fallocate(), but maybe the kernel could do it more
 efficiently, even in generic code.  Maybe we don't care, since the major
 file systems can probably do something better in their own code.

I'd be more happy to have the write out zeroes loop in glibc.  And
glibc needs to have it anyway, for older kernels.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote:

 Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

Upon successful completion, posix_fallocate() shall return zero;
otherwise, an error number shall be returned to indicate the error.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Ulrich Drepper
Andrew Morton wrote:
 Perhaps Ulrich can comment.

I was out of town, hence the delay.

I think that if there is no support for the syscall the correct answer
is to return ENOSYS.  In this case the current userlevel code would be
used and ENOSYS is also used to trigger the use of the compat code in
glibc in case the syscall does not exist at all.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature