Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: I suppose it might be a bit late in the game to add a goal parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make the API more suitable for XFS? The goal could be a single __u64, or a struct with e.g. __u64 byte offset (possibly also __u32 lun like in FIEMAP). I guess the one potential limitation here is the number of function parameters on some architectures. This isn't really about more suitable for XFS but more about more suitable for sophisticated layout decisions. But I'm still not confident this should be shohorned into this syscall. In fact I'm already rather unhappy about the feature churn in the current patch series. The more I think about it the more I'd prefer we would just put a simple syscall in that implements nothing but the posix_fallocate(3) semantics as defined in SuS, and then go on to brainstorm about advanced preallocation / layout hint semantics. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: On Jun 14, 2007 09:52 +1000, David Chinner wrote: B FA_PREALLOCATE provides the same functionality as B FA_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. TP B FA_DEALLOCATE removes the underlying disk space with the given range. The disk space shall be removed regardless of it's contents so both allocated space from B FA_ALLOCATE and B FA_PREALLOCATE as well as from B write(3) will be removed. B FA_DEALLOCATE shall never remove disk blocks outside the range specified. So this is essentially the same as punch. Depends on your definition of punch. There doesn't seem to be a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the end. ftruncate() B FA_DEALLOCATE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, B ftuncate64(3) should be used. This also seems to be a bit of a wart, since it isn't a natural converse of either of the above functions. How about having two modes, similar to FA_ALLOCATE and FA_PREALLOCATE? shrug whatever. Say, FA_PUNCH (which would be as you describe here - deletes all data in the specified range changing the file size if it overlaps EOF, Punch means different things to different people. To me (and probably most XFS aware ppl) punch implies no change to the file size. i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching holes to leave the file size unchanged. This is the behaviour I described for FA_DEALLOCATE. and FA_DEALLOCATE, which only deallocates unused FA_{PRE,}ALLOCATE space? That's an unwritten-to-hole extent conversion. Is that really useful for anything? That's easily implemented with FIEMAP and FA_DEALLOCATE. Anyway, because we can't agree on a single pair of flags: FA_ALLOCATE== posix_fallocate() FA_DEALLOCATE == unwritten-to-hole ??? FA_RESV_SPACE == XFS_IOC_RESVSP64 FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64 We might also consider making @mode be a mask instead of an enumeration: FA_FL_DEALLOC 0x01 (default allocate) FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC) i.e: #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA I suppose it might be a bit late in the game to add a goal parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make the API more suitable for XFS? It would suffice for the simpler operations, I think, but we'll rapidly run out of flags and we'll still need another interface for doing complex stuff. The goal could be a single __u64, or a struct with e.g. __u64 byte offset (possibly also __u32 lun like in FIEMAP). I guess the one potential limitation here is the number of function parameters on some architectures. To be useful it needs to __u64. B ENOSPC There is not enough space left on the device containing the file referred to by IR fd. Should probably say whether space is removed on failure or not. In Right. I'd say on error you need to FA_DEALLOCATE to ensure any space allocated was freed back up. That way the error handling in the allocate functions is much simpler (i.e. no need to undo there). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). First pass is attached. `nroff -man fallocate.2 | less` to view. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include sys/syscall.h .PP .BI int syscall(int, int fd, int mode, loff_t offset, loff_t len); .Op .SH DESCRIPTION The .BR fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .IR offset and continuing for .IR len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are three modes: .TP .B FA_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. B FA_ALLOCATE closely resembles B posix_fallocate(3) and is intended as a method of optimally implementing this function. B FA_ALLOCATE may allocate a larger range that was specified. TP B FA_PREALLOCATE provides the same functionality as B FA_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. TP B FA_DEALLOCATE removes the underlying disk space with the given range. The disk space shall be removed regardless of it's contents so both allocated space from B FA_ALLOCATE and B FA_PREALLOCATE as well as from B write(3) will be removed. B FA_DEALLOCATE shall never remove disk blocks outside the range specified. B FA_DEALLOCATE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, B ftuncate64(3) should be used. SH RETURN VALUE BR fallocate() returns zero on success, or an error number on failure. Note that IR errno is not set. SH ERRORS TP B EBADF I fd is not a valid file descriptor, or is not opened for writing. TP B EFBIG I offset+len exceeds the maximum file size. TP B EINVAL I offset or I len was less than 0. TP B ENODEV I fd does not refer to a regular file or a directory. TP B ENOSPC There is not enough space left on the device containing the file referred to by IR fd. TP B ESPIPE I fd refers to a pipe of file descriptor. B ENOSYS The filesystem underlying the file descriptor does not support this operation. SH AVAILABILITY The BR fallocate () system call is available since 2.6.XX SH SEE ALSO BR syscall (2), BR posix_fadvise (3) BR ftruncate (3)
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Hi Dave, Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). The idea I had was to push the patch with bare minimum functionality (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other new mode(s) based on the man page you planned to provide. Thanks! -- Regards, Amit Arora Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Hi Dave, Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). No, I didn't. Instead of working on new preallocation stuff, I've been spending all my time fixing bugs found by new and interesting (ab)uses of preallocation and hole punching. The idea I had was to push the patch with bare minimum functionality (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other new mode(s) based on the man page you planned to provide. Push them. I'll just make XFS work with whatever is provided. Is there a test harness for the syscall yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. By both above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. Yes, that's right. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. How can it be a backburner issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. By both above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for unallocation when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. Ok. 4) Should we update mtime ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: Andrew Morton writes: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevelm=117585330016809w=2 (Andreas) http://marc.info/?l=linux-fsdevelm=117690157917378w=2 (Andreas) http://marc.info/?l=linux-fsdevelm=117578821827323w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevelm=117760995610639w=2 http://marc.info/?l=linux-fsdevelm=117708124913098w=2 http://marc.info/?l=linux-fsdevelm=117767607229807w=2 Are there any better ideas ? Regards Suparna Paul. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Suparna Bhattacharya writes: This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: Suparna Bhattacharya writes: This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. I think that was understood. Regards Suparna Paul. -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: Suparna Bhattacharya writes: Of course the interface used by an application program would have the fd first. Glibc can do the translation. I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for unallocation when block holding the EOF gets unallocated. - If we do not unallocate normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On May 09, 2007 21:31 +0530, Amit K. Arora wrote: 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. 3) If above is true, the file size will need to be changed for unallocation when block holding the EOF gets unallocated. - If we do not unallocate normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. 4) Should we update mtime ctime on a successfull allocation/ unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the this file was modified point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for unallocation when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. 4) Should we update mtime ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? Ok. I will add a detailed comment here. +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file-f_mode FMODE_WRITE)) + goto out_fput; + + inode = file-f_path.dentry-d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode-i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. True. + ret = -EFBIG; + if (offset + len inode-i_sb-s_maxbytes) + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. Ok. I will add a check here for wrap though zero. + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? Ok. Will remove it. +/* + * fallocate() modes + */ +#define FA_ALLOCATE0x1 +#define FA_DEALLOCATE 0x2 Now those aren't in posix. They should be documented, along with their expected semantics. Will add a comment describing the role of these modes. #ifdef __KERNEL__ #include linux/linkage.h @@ -1125,6 +1131,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? Ok. Will add the variable names here. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. + + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. H - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote: On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). + + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. H - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. That describes the current glibc implementation. We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; struct statfs f; /* `off_t' is a signed type. Therefore we can determine whether OFFSET + LEN is too large if it is a negative value. */ if (offset 0 || len 0) return EINVAL; if (offset + len 0) return EFBIG; /* First thing we have to make sure is that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) return ESPIPE; if (! S_ISREG (st.st_mode)) return ENODEV; if (len == 0) { if (st.st_size offset) { int ret = __ftruncate (fd, offset); if (ret != 0) ret = errno; return ret; } return 0; } ... is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is shall fail. But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and required storage for regular file data starting at offset and continuing for len bytes doesn't make sense for negative size. And given the general Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring. I believe returning EINVAL for len 0 is not a POSIX violation. That doesn't mean the standard shouldn't be clarified, whether by saying EINVAL must be returned for non-positive len or saying that using negative len has undefined or implementation defined behavior. The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. No, see above. Jakub - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: On Fri, 4 May 2007 16:07:31 +1000 David Chinner [EMAIL PROTECTED] wrote: On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of allocate from offset to offset+len then glibc's implementation can turn negative length into two separate fallocate syscalls + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. H - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file-f_mode FMODE_WRITE)) + goto out_fput; + + inode = file-f_path.dentry-d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode-i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. + ret = -EFBIG; + if (offset + len inode-i_sb-s_maxbytes) + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? +/* + * fallocate() modes + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE0x2 Now those aren't in posix. They should be documented, along with their expected semantics. #ifdef __KERNEL__ #include linux/linkage.h @@ -1125,6 +1131,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton [EMAIL PROTECTED] wrote: + ret = -EFBIG; + if (offset + len inode-i_sb-s_maxbytes) + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. But it doesn't handle offset+len wrapping through zero. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew Morton writes: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html