Re: [RFC] Heads up on sys_fallocate()
On Tue, Mar 06, 2007 at 10:46:56AM -0600, Eric Sandeen wrote: > Ulrich Drepper wrote: > > Christoph Hellwig wrote: > >> fallocate with the whence argument and flags is already quite complicated, > >> I'd rather have another call for placement decisions, that would > >> be called on an fd to do placement decissions for any further allocations > >> (prealloc, write, etc) > > > > Yes, posix_fallocate shouldn't be made more complicated. But I don't > > understand why requesting linear layout of the blocks should be an > > option. It's always an advantage if the blocks requested this way are > > linear on disk. So, the kernel should always do its best to make this > > happen, without needing an additional option. > > > > Agreed on both points. The hints would be for things like start block, > or speculative EOF preallocation, not contiguity, which I think should > always be the goal. ISTR having had this discussion before ;) About guided preallocation for defrag: http://marc.info/?t=11624785951&r=1&w=2 e.g.: The sorts of policies we need for effective use of preallocation: http://marc.info/?l=linux-fsdevel&m=116184475308164&w=2 http://marc.info/?l=linux-fsdevel&m=116278169519095&w=2 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote: > > I'll probably first write some userspace fs-reorganizer to find out how > much these changes in layout are able to give you in performance (i.e. > whether it's worth the effort of more complicated kernel online > defragmenter). Have tried profiling the read accesses and prereading them asynchronously on startup? That appears to have improved E17 a lot. See http://lca2007.linux.org.au/talk/101 (and watch the video). Jörn -- The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Tue 06-03-07 12:23:22, Eric Sandeen wrote: > Jan Kara wrote: > > On Tue 06-03-07 06:36:09, Ulrich Drepper wrote: > >> Christoph Hellwig wrote: > >>> fallocate with the whence argument and flags is already quite complicated, > >>> I'd rather have another call for placement decisions, that would > >>> be called on an fd to do placement decissions for any further allocations > >>> (prealloc, write, etc) > >> Yes, posix_fallocate shouldn't be made more complicated. But I don't > >> understand why requesting linear layout of the blocks should be an > >> option. It's always an advantage if the blocks requested this way are > >> linear on disk. So, the kernel should always do its best to make this > >> happen, without needing an additional option. > > Actually, it's not that simple. You want linear layout of blocks you are > > going to read. That is not necessary a linear layout of blocks in a single > > file - trace sometime a start of some complicated app like KDE. You find > > it's seeking like a hell because it needs a few blocks from a ton of > > distinct files (shared libs, config files, etc). As these files are mostly > > read only, it's advantageous to interleave them on disk or at least keep > > them close. > > At some point shouldn't the apps be fixed, rather than do crazy things > with the filesystem? :) Yes :) That's basically what we told KDE developpers when they were complaining ;) But it's hard to fix it for them too (because of some desktop specs requiring lots of different text config files which can change anytime - don't ask me who designed it). Moreover for example for loading shared libraries from which you need just a few blocks scattered all over the place the problem is in ELF itself. I'll probably first write some userspace fs-reorganizer to find out how much these changes in layout are able to give you in performance (i.e. whether it's worth the effort of more complicated kernel online defragmenter). Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: > On Tue 06-03-07 06:36:09, Ulrich Drepper wrote: >> Christoph Hellwig wrote: >>> fallocate with the whence argument and flags is already quite complicated, >>> I'd rather have another call for placement decisions, that would >>> be called on an fd to do placement decissions for any further allocations >>> (prealloc, write, etc) >> Yes, posix_fallocate shouldn't be made more complicated. But I don't >> understand why requesting linear layout of the blocks should be an >> option. It's always an advantage if the blocks requested this way are >> linear on disk. So, the kernel should always do its best to make this >> happen, without needing an additional option. > Actually, it's not that simple. You want linear layout of blocks you are > going to read. That is not necessary a linear layout of blocks in a single > file - trace sometime a start of some complicated app like KDE. You find > it's seeking like a hell because it needs a few blocks from a ton of > distinct files (shared libs, config files, etc). As these files are mostly > read only, it's advantageous to interleave them on disk or at least keep > them close. At some point shouldn't the apps be fixed, rather than do crazy things with the filesystem? :) -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Ulrich Drepper wrote: > Christoph Hellwig wrote: >> fallocate with the whence argument and flags is already quite complicated, >> I'd rather have another call for placement decisions, that would >> be called on an fd to do placement decissions for any further allocations >> (prealloc, write, etc) > > Yes, posix_fallocate shouldn't be made more complicated. But I don't > understand why requesting linear layout of the blocks should be an > option. It's always an advantage if the blocks requested this way are > linear on disk. So, the kernel should always do its best to make this > happen, without needing an additional option. > Agreed on both points. The hints would be for things like start block, or speculative EOF preallocation, not contiguity, which I think should always be the goal. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Tue, Mar 06, 2007 at 06:36:09AM -0800, Ulrich Drepper wrote: > Christoph Hellwig wrote: > > fallocate with the whence argument and flags is already quite complicated, > > I'd rather have another call for placement decisions, that would > > be called on an fd to do placement decissions for any further allocations > > (prealloc, write, etc) > > Yes, posix_fallocate shouldn't be made more complicated. But I don't > understand why requesting linear layout of the blocks should be an > option. It's always an advantage if the blocks requested this way are > linear on disk. So, the kernel should always do its best to make this > happen, without needing an additional option. There are HPC workloads where you have multi writers on multiple machines that write to different parts of a file. You preferably want each of those regions in separate allocation groups. (Or tell the customers to use separate files for the regions..) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Tue 06-03-07 06:36:09, Ulrich Drepper wrote: > Christoph Hellwig wrote: > > fallocate with the whence argument and flags is already quite complicated, > > I'd rather have another call for placement decisions, that would > > be called on an fd to do placement decissions for any further allocations > > (prealloc, write, etc) > > Yes, posix_fallocate shouldn't be made more complicated. But I don't > understand why requesting linear layout of the blocks should be an > option. It's always an advantage if the blocks requested this way are > linear on disk. So, the kernel should always do its best to make this > happen, without needing an additional option. Actually, it's not that simple. You want linear layout of blocks you are going to read. That is not necessary a linear layout of blocks in a single file - trace sometime a start of some complicated app like KDE. You find it's seeking like a hell because it needs a few blocks from a ton of distinct files (shared libs, config files, etc). As these files are mostly read only, it's advantageous to interleave them on disk or at least keep them close. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Christoph Hellwig wrote: > fallocate with the whence argument and flags is already quite complicated, > I'd rather have another call for placement decisions, that would > be called on an fd to do placement decissions for any further allocations > (prealloc, write, etc) Yes, posix_fallocate shouldn't be made more complicated. But I don't understand why requesting linear layout of the blocks should be an option. It's always an advantage if the blocks requested this way are linear on disk. So, the kernel should always do its best to make this happen, without needing an additional option. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote: > Yep, I think it makes sense to use preallocation for defragmentation. > After all both preallocation and defragmentation shall call underlying > filesystem multiple block allocator to try to allocate a chunk of > contiguous blocks on disk. ext4 online defrag implementation by Takashi > already support to choose a "goal" allocation block to guide the ext4 > block allocator to place the defraged file is a specific location. > > Passing a little bit more hint to sys_fallocate() (i.e, goal block, > and/or whether the goal block is important over the size of prealloc > extent), might make it more useful for the orginial goal (get contigous > and guranteed blocks) and for defragmentation. fallocate with the whence argument and flags is already quite complicated, I'd rather have another call for placement decisions, that would be called on an fd to do placement decissions for any further allocations (prealloc, write, etc) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: > Does the allocation have to be persistent beyond lifetime of the file > descriptor? It would be fairly simple to support the write guarantee > while the file is open (or rather the inode remains cached) and drop it > afterwards. "The posix_fallocate() function shall ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media." I interpret "on the storage media" to mean that it is persistent. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: >> I am wondering if it is useful to add another mode to advise block >> allocation policy? Something like indicating which physical block/block >> group to allocate from (goal), and whether ask for strict contigous >> blocks. This will help preallocation or reservation to choose the right >> blocks for the file. > Yes, I also think this would be useful so you can "guide" > preallocation for things like defragmentation (e.g. preallocate space > for the file being defragmented and move the file to it). Hints & policies for allocation would certainly be useful, but I think they belong outside this interface. i.e. you could flag an inode for whatever allocation you choose, and -then- call posix_fallocate so that the allocator will take the hints you've given it. See also this blurb from the posix_fallocate definition: "It is implementation-defined whether a previous posix_fadvise() call influences allocation strategy." FWIW I don't see a lot of point in asking for "strict contiguous blocks" - the allocator will presumeably try to do this in any case, and I'm not sure when you would want to fail if you get more than one extent...? -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott <[EMAIL PROTECTED]> wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the "command" flag extension. Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to "mode"? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Yes, I also think this would be useful so you can "guide" preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza Yep, I think it makes sense to use preallocation for defragmentation. After all both preallocation and defragmentation shall call underlying filesystem multiple block allocator to try to allocate a chunk of contiguous blocks on disk. ext4 online defrag implementation by Takashi already support to choose a "goal" allocation block to guide the ext4 block allocator to place the defraged file is a specific location. Passing a little bit more hint to sys_fallocate() (i.e, goal block, and/or whether the goal block is important over the size of prealloc extent), might make it more useful for the orginial goal (get contigous and guranteed blocks) and for defragmentation. Regards, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Theodore Tso wrote: > [...] although the libc > implementation still wouldn't be able to go away for long time due to > the need to be backwards compatible with older kernels that didn't > have this support. It's better than that. If somebody compiles glibc to not run on older kernels at all (tested at runtime) then the code is dropped. E.g., the current Fedora glibc does not support 2.6.8 or earlier. So, don't let the compat code be a factor in the decision making. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: > Well, I'm sure the kernel can do better than the code we have in libc > now. The kernel has access to the bitmasks which say which blocks have > already been allocated. The libc code does not and we have to be very > simple-minded and simply touch every block. And this means reading it > and then writing it back. The kernel would know when the reading part > is not necessary. Add to then the block granularity (we use f_bsize as > returned from fstatfs but that's not the best value in some cases) and > you have compelling data to have generic code in the kernel. Then libc > implementation can then go away completely which is a good thing. You have a very good point; indeed since we don't export an interface which allows userspace to determine whether or not a block is in use, that does mean a huge amount of churn in the page cache. So maybe it would be worth doing in the kernel as a result, although the libc implementation still wouldn't be able to go away for long time due to the need to be backwards compatible with older kernels that didn't have this support. Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: >> Of course. You call posix_fallocate once for the lifetime of the file >> when it is created to ensure that all future uses will work. > > That part is not quite clear from the manpage but I trust most people > would assume the same. Not only that, it is what this function is for. In the POSIX committee we've looked at the functions in detail before adding them, even if some information is not in the man page but instead in the Rationale. > Still, it is quite obvious that noone designing this interface has lost > much thought to compressing filesystems. You already have problems with supporting the functionality posix_fallocate is supporting. You cannot reliably support MAP_SHARED files if all of a sudden the compression causes and expansion of a block and that causes a ENOSPC error. So, don't expect pity. This is a function in support of a real and reliable implementation of memory mapped files. You don't use MAP_SHARED on such filesystems, it'll eat your kittens sooner or later anyway. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote: > Jörn Engel wrote: > > Does the allocation have to be persistent beyond lifetime of the file > > descriptor? > > Of course. You call posix_fallocate once for the lifetime of the file > when it is created to ensure that all future uses will work. That part is not quite clear from the manpage but I trust most people would assume the same. > It seems your filesystem will not be able to support this unless > compression is turned off. Correct. Compression needs to be turned off for a file, if posix_fallocate(3) is to succeed. What I could do is disable compression (meaning that no data written in the future will be compressed) and rewrite all blocks within the given range. Still, it is quite obvious that noone designing this interface has lost much thought to compressing filesystems. Whatever I can come up with will either be incompatible or some sort of hack. :( Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: > Theodore Tso wrote: > > Given that glibc already has to support this for older kernels, I > > would argue that there's no point putting in generic support for > > filesystem that can't support a more advanced way of doing things. > > Well, I'm sure the kernel can do better than the code we have in libc > now. The kernel has access to the bitmasks which say which blocks have > already been allocated. The layer of the kernel where a totally generic fallback would be implemented does not have access to this information. We could do a mostly generic helper for block filesystems that allows to implement fallocate this way without a lot of their own code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: > The bad news for posix_fallocate() is that even if libc is smart enough > to write random data, mmap() can still cause problems. This is not smart, quite to the contrary. The standard guarantees that all not-yet-written-to places in the file are zero. And if a block has already been written posix_fallocate cannot change it. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: > Does the allocation have to be persistent beyond lifetime of the file > descriptor? Of course. You call posix_fallocate once for the lifetime of the file when it is created to ensure that all future uses will work. It seems your filesystem will not be able to support this unless compression is turned off. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
Theodore Tso wrote: > Given that glibc already has to support this for older kernels, I > would argue that there's no point putting in generic support for > filesystem that can't support a more advanced way of doing things. Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and simply touch every block. And this means reading it and then writing it back. The kernel would know when the reading part is not necessary. Add to then the block granularity (we use f_bsize as returned from fstatfs but that's not the best value in some cases) and you have compelling data to have generic code in the kernel. Then libc implementation can then go away completely which is a good thing. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On 5 Mar 2007, at 14:37, Theodore Tso wrote: On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote: And I specifically did NOT update the initialized size in the inode thus it will remain at its old value thus all new allocated blocks will be considered as present but not initialized thus a read will always return zero whilst a write will do the right thing and pad with zeroes as necessary (if the write is smaller than the block size, etc). You're describing a method of doing in-advance preallocation where the filesystem format explicitly has support for this kind of feature in a way that doesn't require pre-zeroing the data blocks in question. Indeed. The question which this subthread was concerned about was whether the kernel should get involved in initializing datablocks in the case where the filesystem format does not have this support, or whether this functionality should continue to be done in userspace. Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Yes, I understood that after I had sent my post... And yes, I would agree. If glibc already does this there does not appear to be any value in just moving existing functionality into the kernel. Simply let "dumb" file systems return ENOSYS and let glibc do it... And any FS which can do it better can implement the function and then glibc should not go anywhere near it. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote: > And I specifically did NOT update the initialized size in the inode > thus it will remain at its old value thus all new allocated blocks > will be considered as present but not initialized thus a read will > always return zero whilst a write will do the right thing and pad > with zeroes as necessary (if the write is smaller than the block > size, etc). Anton, You're describing a method of doing in-advance preallocation where the filesystem format explicitly has support for this kind of feature in a way that doesn't require pre-zeroing the data blocks in question. The question which this subthread was concerned about was whether the kernel should get involved in initializing datablocks in the case where the filesystem format does not have this support, or whether this functionality should continue to be done in userspace. Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote: > > I don't know how your compression algorithm works [...] LogFS is designed for flash media, so it does not have to worry much about reducing disk seeks. It is log-structured, which simplifies compression further. When writing a block, it basically compresses it and appends it to the log. Writes only have to be byte-aligned, so no space is lost for padding. The bad news for posix_fallocate() is that even if libc is smart enough to write random data, mmap() can still cause problems. If the VM decides to write a given page twice, the second write compresses better and the medium has filled up between the two writes, the users will have fun. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote: > > I'd be more happy to have the write out zeroes loop in glibc. ?And > > glibc needs to have it anyway, for older kernels. > > A generic_fallocate makes sense to me iff we can do it in the kernel > more significantly more efficiently than in glibc, e.g. by using only > a single page in page cache instead of one for each page to be preallocated. We can't do that with the current page cache interfaces. But what might make sense is to have a block_dump_prealloc that takes a get_block callback to do what you propose. It still wouldn't be entirely generic, but would allow block based filesystems to do a not entirely dumb implementation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote: > > Using the current glibc implementation on a compressed file system ideally > should be a very expensive no-op because you won't actually allocate much > space for a file when writing zeroes to it. You also don't benefit of a > contiguous allocation in logfs, since flash has uniform seek times over > all the medium. > > I'd suggest you implement posix_fallocate as an real nop and just return > success without doing anything. You could also return ENOSPC in case > the blocks requested by posix_fallocate don't fit on the medium without > compression, but that is more or less just guesswork (like statfs is). Quoting POSIX_FALLOCATE(3): The function posix_fallocate() ensures that disk space is allocated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. If the size of the file is less than offset+len, then the file is increased to this size; otherwise the file size is left unchanged. Afaics, the (main) purpose of this function is not to decrease fragmentation but to ensure mmap() won't cause any problems because the medium fills up. That problem exists for LogFS as well, once rw mmap() is supported. Simply returning success without doing anything would be a bug. -ENOSPC is a better choice, but still a lame implementation. And falling back on libc to write zeroes in a loop is an exercise in futility. Does the allocation have to be persistent beyond lifetime of the file descriptor? It would be fairly simple to support the write guarantee while the file is open (or rather the inode remains cached) and drop it afterwards. Jörn -- "[One] doesn't need to know [...] how to cause a headache in order to take an aspirin." -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
> >On Fri, 02 Mar 2007 09:40:54 +1100 > >Nathan Scott <[EMAIL PROTECTED]> wrote: > > > > > >>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > >> > >>>On Fri, 2 Mar 2007 00:04:45 +0530 > >>>"Amit K. Arora" <[EMAIL PROTECTED]> wrote: > >>> > >>> > This is to give a heads up on few patches that we will be soon coming up > with. These patches implement a new system call sys_fallocate() and a > new inode operation "fallocate", for persistent preallocation. The new > system call, as Andrew suggested, will look like: > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > >>> > >>>... > >>> > >>>I'd agree with Eric on the "command" flag extension. > >> > >>Seems like a separate syscall would be better, "command" sounds > >>a bit ioctl like, especially if that command is passed into the > >>filesystems.. > >> > > > > > >madvise, fadvise, lseek, etc seem to work OK. > > > >I get repeatedly traumatised by patch rejects whenever a new syscall gets > >added, so I'm biased. > > > >The advantage of a command flag is that we can add new modes in the future > >without causing lots of churn, waiting for arch maintainers to catch up, > >potentially adding new compat code, etc. > > > >Rename it to "mode"? ;) > > > I am wondering if it is useful to add another mode to advise block > allocation policy? Something like indicating which physical block/block > group to allocate from (goal), and whether ask for strict contigous > blocks. This will help preallocation or reservation to choose the right > blocks for the file. Yes, I also think this would be useful so you can "guide" preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Sun, Mar 04, 2007 at 08:11:17PM +, Anton Altaparmakov wrote: > glibc cannot ever be smart enough because a file system driver will > always know better and be able to do things in a much more optimized > way. Please read the thread again. That is not what anyone proposed. The issues we're discussing is whether fallback for a filesystem that does not support preallocation natively should be done in kernelspace or in userspace. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote: > > When you do it like this, who can the kernel/filesystem *guarantee* that > when the data is written there actually is room on the harddrive? > > What you described seems like using truncate/ftruncate to increase the > file's size. That is not at all what posix_fallocate is for. > posix_fallocate must make sure that the requested blocks on the disk are > reserved (allocated) for the file's use and that at no point in the > future will, say, a msync() fail because a mmap(MAP_SHARED) page has > been written to. That actually causes an interesting problem for compressing filesystems. The space consumed by blocks depends on their contents and how well it compresses. At the moment, the only option I see to support posix_fallocate for LogFS is to set an inode flag disabling compression, then allocate the blocks. But if the file already contains large amounts of compressed data, I have a problem. Disabling compression for a range within a file is not supported, so I can only return an error. But which one? Jörn -- A surrounded army must be given a way out. -- Sun Tzu - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Monday 05 March 2007, Anton Altaparmakov wrote: > An alternative would be to allocate blocks and then when the data is > written perform the compression and free any blocks you do not need > any more because the data has shrunk sufficiently. Depending on the > implementation details this could potentially create horrible > fragmentation as you would allocate a large consecutive region and > then go and drop random blocks from that region thus making the file > fragmented. Unfortunately, this is not as easy on logfs, because there is no point in allocating a block when there is no data to write into it. Fragmentation on flash media is free, but you can never modify a block in place without erasing it first. This means it will always be written to a new location on the next write access. One option that might work (similar to what you describe in your other mail) is to have a per-inode count of reserved blocks, without allocating specific blocks for them. The journal then needs to maintain the number of total reserved blocks for all files and keep that in sync with blocks that were reserved for specific inodes. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Monday 05 March 2007, Jörn Engel wrote: > That actually causes an interesting problem for compressing filesystems. > The space consumed by blocks depends on their contents and how well it > compresses. At the moment, the only option I see to support > posix_fallocate for LogFS is to set an inode flag disabling compression, > then allocate the blocks. > > But if the file already contains large amounts of compressed data, I > have a problem. Disabling compression for a range within a file is not > supported, so I can only return an error. But which one? Using the current glibc implementation on a compressed file system ideally should be a very expensive no-op because you won't actually allocate much space for a file when writing zeroes to it. You also don't benefit of a contiguous allocation in logfs, since flash has uniform seek times over all the medium. I'd suggest you implement posix_fallocate as an real nop and just return success without doing anything. You could also return ENOSPC in case the blocks requested by posix_fallocate don't fit on the medium without compression, but that is more or less just guesswork (like statfs is). Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On 5 Mar 2007, at 00:32, Anton Altaparmakov wrote: On 5 Mar 2007, at 00:16, Jörn Engel wrote: On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote: When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. That actually causes an interesting problem for compressing filesystems. The space consumed by blocks depends on their contents and how well it compresses. At the moment, the only option I see to support posix_fallocate for LogFS is to set an inode flag disabling compression, then allocate the blocks. But if the file already contains large amounts of compressed data, I have a problem. Disabling compression for a range within a file is not supported, so I can only return an error. But which one? I don't know how your compression algorithm works but at least on NTFS that bit is easy: you allocate the blocks and mark them as allocated then the compression engine will write non-compressed data to those blocks. Basically it works like this "does compression block X have any sparse blocks?". If the answer is "yes" the block is treated as compressed data and if the answer is "no" the block is treated as uncompressed data. This means that if the data cannot be compressed (and in some cases if the data compressed is bigger than the data uncompressed) the data is stored non-compressed. That is the most space efficient method to do things. An alternative would be to allocate blocks and then when the data is written perform the compression and free any blocks you do not need any more because the data has shrunk sufficiently. Depending on the implementation details this could potentially create horrible fragmentation as you would allocate a large consecutive region and then go and drop random blocks from that region thus making the file fragmented. And another thing you could do (best if you support journalling) would be to do the allocation and hang the details off the inode on a "preallocation list" of some kind and then as the data gets written use blocks from the preallocation list as you go along. This would avoid the fragmentation issue for example. You could then free the surplus blocks when the whole range of the file being covered by the preallocation list has been written to and/or when the file is closed for the last time (drop_inode/delete_inode). Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On 5 Mar 2007, at 00:16, Jörn Engel wrote: On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote: When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. That actually causes an interesting problem for compressing filesystems. The space consumed by blocks depends on their contents and how well it compresses. At the moment, the only option I see to support posix_fallocate for LogFS is to set an inode flag disabling compression, then allocate the blocks. But if the file already contains large amounts of compressed data, I have a problem. Disabling compression for a range within a file is not supported, so I can only return an error. But which one? I don't know how your compression algorithm works but at least on NTFS that bit is easy: you allocate the blocks and mark them as allocated then the compression engine will write non-compressed data to those blocks. Basically it works like this "does compression block X have any sparse blocks?". If the answer is "yes" the block is treated as compressed data and if the answer is "no" the block is treated as uncompressed data. This means that if the data cannot be compressed (and in some cases if the data compressed is bigger than the data uncompressed) the data is stored non-compressed. That is the most space efficient method to do things. An alternative would be to allocate blocks and then when the data is written perform the compression and free any blocks you do not need any more because the data has shrunk sufficiently. Depending on the implementation details this could potentially create horrible fragmentation as you would allocate a large consecutive region and then go and drop random blocks from that region thus making the file fragmented. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Hi, On 4 Mar 2007, at 22:38, Ulrich Drepper wrote: Anton Altaparmakov wrote: And that is it. No zeroing needs to happen at all because we have not updated the initialized size of the inode! When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? The blocks are allocated so of course it is guaranteed. Subsequent writes to this file will not generate any allocations thus allocations cannot fail. (-: What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. No that is different. I described performing the allocations in the volume bitmap, i.e. for each allocated block the corresponding "in use" bit is set in the bitmap (NTFS uses a linear bitmap where byte 0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...). Also I described updating the extent map of the inode such that it describes the physical blocks as belonging to the file, thus you would have "logical file block X corresponds to physical block Y on volume" entries entered into the extent map of the inode and they would describe the just allocated blocks. Finally I described updating the allocated size in the inode which basically says "there are that many bytes worth of blocks allocated to this inode". And optionally I described updating the data size in the inode which basically says "this file has size Z bytes". And I specifically did NOT update the initialized size in the inode thus it will remain at its old value thus all new allocated blocks will be considered as present but not initialized thus a read will always return zero whilst a write will do the right thing and pad with zeroes as necessary (if the write is smaller than the block size, etc). Note that you are right that this is like truncate in NTFS for non- sparse enabled inodes/volumes. But for sparse ones, instead of doing any allocations in the bitmap and entering them in the extent map, you would simply add a single entry to the extent map that says "X blocks allocated starting at logical block Y corresponding to no physical blocks, i.e. they are sparse". You would then also update the allocated size and data size as above and now you can even (but do not have to) update the initialized size to be equal to the data size as the file can be considered fully initialized because it is sparse. As an implementation detail this truncate operation would not modify the compressed size of the inode (i.e. the really used on-disk space, i.e. what you get from running "du" as that does not change when you add sparse blocks) whilst the fallocate described above would update the compressed size (if the file is sparse or compressed - there is no compressed size in the inode if the inode is not sparse/ compressed) because the file now occupies more blocks on disk even if they are actually not initialized. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Anton Altaparmakov wrote: > And that is it. No zeroing needs to happen at all because we > have not updated the initialized size of the inode! When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Sunday 04 March 2007, Anton Altaparmakov wrote: > > A generic_fallocate makes sense to me iff we can do it in the kernel > > more significantly more efficiently than in glibc, e.g. by using only > > a single page in page cache instead of one for each page to be > > preallocated. > > > > If glibc is smart enough to do an optimal implementation, I fully > > agree > > with you. > > glibc cannot ever be smart enough because a file system driver will > always know better and be able to do things in a much more optimized > way. Ok, that's not what I meant. It's obvious that the file system itself can do better than both VFS and glibc. The question is whether VFS can be better than glibc on file systems that don't offer their own implementation of the fallocate operation. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On 3 Mar 2007, at 22:45, Arnd Bergmann wrote: On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote: Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. I'd be more happy to have the write out zeroes loop in glibc. And glibc needs to have it anyway, for older kernels. A generic_fallocate makes sense to me iff we can do it in the kernel more significantly more efficiently than in glibc, e.g. by using only a single page in page cache instead of one for each page to be preallocated. If glibc is smart enough to do an optimal implementation, I fully agree with you. glibc cannot ever be smart enough because a file system driver will always know better and be able to do things in a much more optimized way. For example on NTFS fallocate() only needs to involve the setting of a few bits in the volume block allocation bitmap (one bit for each logical block being allocated) and update the extent map in the on- disk inode to reflect that those blocks are now allocated to the inode. Then it just needs to update the allocated size and optionally the data size (if fallocate wants to increase the file size rather than just the allocated size). And that is it. No zeroing needs to happen at all because we have not updated the initialized size of the inode! glibc can only dream of an implementation like this. (-; Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote: > > Forgive me if I haven't put enough thought into it, but would it be > > useful to create a generic_fallocate() that writes zeroed pages for any > > non-existent pages in the range? I don't know how glibc currently > > implements posix_fallocate(), but maybe the kernel could do it more > > efficiently, even in generic code. Maybe we don't care, since the major > > file systems can probably do something better in their own code. > > I'd be more happy to have the write out zeroes loop in glibc. And > glibc needs to have it anyway, for older kernels. A generic_fallocate makes sense to me iff we can do it in the kernel more significantly more efficiently than in glibc, e.g. by using only a single page in page cache instead of one for each page to be preallocated. If glibc is smart enough to do an optimal implementation, I fully agree with you. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Andrew Morton wrote: On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott <[EMAIL PROTECTED]> wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the "command" flag extension. Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to "mode"? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Right now neither ext4 preallocation implementation or reservation are guranteed to allocate/reserve contigugous extents. If the application told it so, it could do more searching to satisfy the requirement. Or fadvise is the right interface? Mingming I'm inclined to merge this patch nice and early, so the syscall number is stabilised. Otherwise the people who are working on out-of-tree code (ie: ext4) will have to keep playing catchup. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Dave Kleikamp wrote: On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote: On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp <[EMAIL PROTECTED]> wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file->f_path.dentry->d_inode; + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op->allocate(). if (inode->i_op && inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. I think this is useful. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Badari Pulavarty wrote: BTW, what is the interface for finding out what is the size of the pre-allocated file ? With XFS at least, "du," "stat," etc tell you a little: [EMAIL PROTECTED] test]# touch resvsp [EMAIL PROTECTED] test]# xfs_io resvsp xfs_io> resvsp 0 10g The file is 0 length, but is using 10g of blocks: (with posix_fallocate this would move the size out to 10g as well) [EMAIL PROTECTED] test]# ls -lh resvsp -rw-r--r-- 1 root root 0 Nov 28 14:11 resvsp [EMAIL PROTECTED] test]# du -hc resvsp 10G resvsp 10G total [EMAIL PROTECTED] test]# stat resvsp File: `resvsp' Size: 0 Blocks: 20971520 IO Block: 4096 regular empty file Device: 81eh/2078d Inode: 186 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) xfs also has an interface to find out what allocations are where: if you reserve some ranges not starting at 0... [EMAIL PROTECTED] test]# xfs_io resvsp xfs_io> resvsp 1g 1g xfs_io> resvsp 3g 1g xfs_io> resvsp 5g 1g xfs_io> quit [EMAIL PROTECTED] test]# xfs_bmap -v resvsp resvsp: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: hole 2097152 1: [2097152..4194303]: 42392..2139543 0 (42392..2139543) 2097152 1 2: [4194304..6291455]: hole 2097152 3: [6291456..8388607]: 4236696..6333847 0 (4236696..6333847) 2097152 1 4: [8388608..10485759]: hole 2097152 5: [10485760..12582911]: 8431000..10528151 0 (8431000..10528151) 2097152 1 The flags of 1 mean that these extents is preallocated/unwritten. I suppose outside of XFS, FIBMAP is your best bet, but that won't tell you what is preallocated vs. allocated/written -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > What about > > > if the > > > blocks already exists ? What would be return values in those cases ? > > > > 0 on success, other normal errors oetherwise.. > > > > If asked for a range that includes already-allocated blocks, you just > > allocate any non-allocated blocks in the range, I think. > > Yes. What I was trying to figure out is, if there is a requirement that > interface need to return exact number of bytes it *really* allocated > (like write() or read()). I can't think of any, but just wanted to > through it out.. Hopefully not, because posix didn't anticipate that. We could of course return a positive number on success, but it'd get tricky on 32-bit machines. > BTW, what is the interface for finding out what is the size of the > pre-allocated file ? stat.st_blocks? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote: > Badari Pulavarty wrote: > > > > Amit K. Arora wrote: > > > >> This is to give a heads up on few patches that we will be soon coming up > >> with. These patches implement a new system call sys_fallocate() and a > >> new inode operation "fallocate", for persistent preallocation. The new > >> system call, as Andrew suggested, will look like: > >> > >> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > >> > > I am wondering about return values from this syscall ? Is it supposed to > > return the > > number of bytes allocated ? What about partial allocations ? > > If you don't have enough blocks to cover the request, you should > probably just return -ENOSPC, not a partial allocation. That could be challenging, when multiple writers are working in parallel. You may not be able to return -ENOSPC, till you fail the allocation (for filesystems which alllocates a block at a time). > > > What about > > if the > > blocks already exists ? What would be return values in those cases ? > > 0 on success, other normal errors oetherwise.. > > If asked for a range that includes already-allocated blocks, you just > allocate any non-allocated blocks in the range, I think. Yes. What I was trying to figure out is, if there is a requirement that interface need to return exact number of bytes it *really* allocated (like write() or read()). I can't think of any, but just wanted to through it out.. BTW, what is the interface for finding out what is the size of the pre-allocated file ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Badari Pulavarty wrote: Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? If you don't have enough blocks to cover the request, you should probably just return -ENOSPC, not a partial allocation. What about if the blocks already exists ? What would be return values in those cases ? 0 on success, other normal errors oetherwise.. If asked for a range that includes already-allocated blocks, you just allocate any non-allocated blocks in the range, I think. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On 3/2/07, Dave Kleikamp <[EMAIL PROTECTED]> wrote: Then there's no need for sys_allocate to return a long. Every syscall must return a long. Otherwise you can have problems on 64-bit archs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mar 1 2007 23:09, Dave Kleikamp wrote: >> >> Given that glibc already implements fallocate for all filesystems, it will >> need to continue to do so for filesystems which don't implement this >> syscall - otherwise applications would start breaking. > >I didn't make it clear, but my point was to call generic_fallocate if >the file system did not define i_op->allocate(). > >if (inode->i_op && inode->i_op->fallocate) > ret = inode->i_op->fallocate(inode, offset, len); >else > ret = generic_fallocate(inode, offset, len); > >I'm not sure it's worth the effort, but I thought I'd throw the idea out >there. Writing zeroes using glibc emu most likely means write() -- so generic_fallocate should be preferable (think splice). Or does glibc use mmap() and it's all different? Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit wrote: > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote: > On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Just curious .. What does posix_fallocate() return ? > > bookmark this: > > http://www.opengroup.org/onlinepubs/009695399/nfindex.html > > Upon successful completion, posix_fallocate() shall return zero; > otherwise, an error number shall be returned to indicate the error. Then there's no need for sys_allocate to return a long. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote: > On Mar 01, 2007 13:15 -0600, Eric Sandeen wrote: > > One thing I'd like to see is a cmd argument as well, to allow for > > example allocation vs. reservation (i.e. allocating blocks vs. simply > > reserving a number), as well as the inverse of those functions > > (un-reservation, de-allocation)? > > > > If the allocation interface allows allocation/reservation within > > arbitrary ranges, if the only way to un-allocate is via a truncate, > > that's pretty asymmetric. > > I'd rather we just get the oft-discussed punch() syscall instead. > This is really what "unallocate" would do for persistent allocations > and it would be useful for files that were not preallocated. I can see a difference though. punch() would throw away written data as well as pre-allocated-but-never-written-to data. I can see where a user might preallocate a large file and do a lot of random writes. At some point, he decides the file isn't going to grow much more, so let's free up the remaining pre-allocated blocks. This makes even more sense with reservation. The alternative would be to have punch() take a flag to specify if only preallocated or reserved blocks should be freed. > > For filesystems that don't implement punch glibc() would do zero-filling > of the punched area I guess (to make it equivalent to reading from a > hole in the file). Or it could just fail. Writing zeroes may be really slow and not give the caller any benefit. (The intention was to free blocks back to the file system.) Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mar 01, 2007 13:15 -0600, Eric Sandeen wrote: > One thing I'd like to see is a cmd argument as well, to allow for > example allocation vs. reservation (i.e. allocating blocks vs. simply > reserving a number), as well as the inverse of those functions > (un-reservation, de-allocation)? > > If the allocation interface allows allocation/reservation within > arbitrary ranges, if the only way to un-allocate is via a truncate, > that's pretty asymmetric. I'd rather we just get the oft-discussed punch() syscall instead. This is really what "unallocate" would do for persistent allocations and it would be useful for files that were not preallocated. For filesystems that don't implement punch glibc() would do zero-filling of the punched area I guess (to make it equivalent to reading from a hole in the file). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Andrew Morton wrote: > Perhaps Ulrich can comment. I was out of town, hence the delay. I think that if there is no support for the syscall the correct answer is to return ENOSYS. In this case the current userlevel code would be used and ENOSYS is also used to trigger the use of the compat code in glibc in case the syscall does not exist at all. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > Just curious .. What does posix_fallocate() return ? bookmark this: http://www.opengroup.org/onlinepubs/009695399/nfindex.html Upon successful completion, posix_fallocate() shall return zero; otherwise, an error number shall be returned to indicate the error. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? What about if the blocks already exists ? What would be return values in those cases ? Just curious .. What does posix_fallocate() return ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, Mar 01, 2007 at 05:29:15PM -0600, Eric Sandeen wrote: > Amit K. Arora wrote: > > Might want more error checking in there, something like (rough cut)... > (or is some of this glibc's job?) Yeah, we need to have this checks. We can't rely on userspace not passing arguments that might corrupt your filesystem or let you escalate privilegues. > which would keep things in line with posix_fallocate's specified errors, > too? Yes, very good idea. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, Mar 01, 2007 at 10:44:16PM +, Dave Kleikamp wrote: > Would EINVAL (or whatever) make it back to the caller of > posix_fallocate(), or would glibc fall back to its current > implementation? > > Forgive me if I haven't put enough thought into it, but would it be > useful to create a generic_fallocate() that writes zeroed pages for any > non-existent pages in the range? I don't know how glibc currently > implements posix_fallocate(), but maybe the kernel could do it more > efficiently, even in generic code. Maybe we don't care, since the major > file systems can probably do something better in their own code. I'd be more happy to have the write out zeroes loop in glibc. And glibc needs to have it anyway, for older kernels. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, Mar 02, 2007 at 12:04:45AM +0530, Amit K. Arora wrote: > This is to give a heads up on few patches that we will be soon coming up > with. These patches implement a new system call sys_fallocate() and a > new inode operation "fallocate", for persistent preallocation. The new > system call, as Andrew suggested, will look like: > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > > As we are developing and testing the required patches, we decided to > post a preliminary patch and get inputs from the community to give it > a right direction and shape. First, a little description on the feature. Thanks a lot, this has been long overdue. Please don't forget to Cc the XFS list to keep developers of the only Linux filesystem supporting persistant allocations for a long time :) Various people will beat you up for the above syscall as lots of architectures really want 64bit arguments aligned in a proper way, e.g. you at least need a pad after 'int fd'. Then again I already have suggestions for filling up that slot with useful information: - you really want a whence argument as to lseek, as it makes a lot of sense for applications to allocate from the end of the file or the current file positions. The existing XFS ioctl already has this, and it's trivial to support this in any preallocation implementation I could imagine. - we should think about having a flag value for which kind of preallocation we want. XFS currently has two: ALLOCSP which updates the inode size and physically zeroes blocks RESVSP which does not update inode size but creates and unwritten extent the current posix_fallocate semantics are somewhere in the middle, as it requires and update to the inode size, but does not specify at all what happens if you read from the newly allocated space. And yes, as and heads up to developers implementing this feature on new filesystems: don't just return new blocks, that's a gapping security hole :) > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + file = fget(fd); > + if (!file) > + goto out; > + inode = file->f_path.dentry->d_inode; > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, offset, len); > + else > + ret = -ENOTTY; > + fput(file); > +out: > +return ret; > +} This should use fget_light, and I'm sure the code could be written in a slightly more readable: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) { struct file *file = fget(fd); ret = -EINVAL; if (file) struct inode *inode = file->f_path.dentry->d_inode; if (inode->i_op && inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, offset, len); else ret = -ENOTTY; fput(file); } return ret; } p.s. you reference ext4_fallocate in the patch but don't actually introduce it, it definitively won't compile as-is :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: Might want more error checking in there, something like (rough cut)... (or is some of this glibc's job?) +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret; > + > + ret = -EINVAL; > + if (len == 0 || offset < 0) > + goto out; > + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + goto out_fput; + inode = file->f_path.dentry->d_inode; > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) > + goto out_fput; + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, offset, len); + else + ret = -ENOTTY; > +out_fput: + fput(file); +out: + return ret; +} which would keep things in line with posix_fallocate's specified errors, too? -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote: > On Thu, 01 Mar 2007 22:44:16 + > Dave Kleikamp <[EMAIL PROTECTED]> wrote: > > > On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > > > On Fri, 2 Mar 2007 00:04:45 +0530 > > > "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + file = fget(fd); > > > > + if (!file) > > > > + goto out; > > > > + inode = file->f_path.dentry->d_inode; > > > > + if (inode->i_op && inode->i_op->fallocate) > > > > + ret = inode->i_op->fallocate(inode, offset, len); > > > > + else > > > > + ret = -ENOTTY; > > > > + fput(file); > > > > +out: > > > > +return ret; > > > > +} > > > > > > > > ENOTTY is a bit unconventional - we often use EINVAL for this sort of > > > thing. But EINVAL has other meanings for posix_fallocate() and isn't > > > really appropriate here anyway. So I'm not sure what would be better... > > > > Would EINVAL (or whatever) make it back to the caller of > > posix_fallocate(), or would glibc fall back to its current > > implementation? > > > > Forgive me if I haven't put enough thought into it, but would it be > > useful to create a generic_fallocate() that writes zeroed pages for any > > non-existent pages in the range? I don't know how glibc currently > > implements posix_fallocate(), but maybe the kernel could do it more > > efficiently, even in generic code. Maybe we don't care, since the major > > file systems can probably do something better in their own code. > > Given that glibc already implements fallocate for all filesystems, it will > need to continue to do so for filesystems which don't implement this > syscall - otherwise applications would start breaking. I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op->allocate(). if (inode->i_op && inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > On Fri, 2 Mar 2007 00:04:45 +0530 > "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This is to give a heads up on few patches that we will be soon coming up > > with. These patches implement a new system call sys_fallocate() and a > > new inode operation "fallocate", for persistent preallocation. The new > > system call, as Andrew suggested, will look like: > > > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > ... > > I'd agree with Eric on the "command" flag extension. Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. cheers. -- Nathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp <[EMAIL PROTECTED]> wrote: > On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > > On Fri, 2 Mar 2007 00:04:45 +0530 > > "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) > > > +{ > > > + struct file *file; > > > + struct inode *inode; > > > + long ret = -EINVAL; > > > + file = fget(fd); > > > + if (!file) > > > + goto out; > > > + inode = file->f_path.dentry->d_inode; > > > + if (inode->i_op && inode->i_op->fallocate) > > > + ret = inode->i_op->fallocate(inode, offset, len); > > > + else > > > + ret = -ENOTTY; > > > + fput(file); > > > +out: > > > +return ret; > > > +} > > > > > ENOTTY is a bit unconventional - we often use EINVAL for this sort of > > thing. But EINVAL has other meanings for posix_fallocate() and isn't > > really appropriate here anyway. So I'm not sure what would be better... > > Would EINVAL (or whatever) make it back to the caller of > posix_fallocate(), or would glibc fall back to its current > implementation? > > Forgive me if I haven't put enough thought into it, but would it be > useful to create a generic_fallocate() that writes zeroed pages for any > non-existent pages in the range? I don't know how glibc currently > implements posix_fallocate(), but maybe the kernel could do it more > efficiently, even in generic code. Maybe we don't care, since the major > file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. However with this kernel change, glibc will need to look at the errno, so that it can correctly propagate EIO, ENOSPC and whatever. So we will need to return a reliable and stable and sensible value so that glibc knows when it should emulate and when it should propagate. Perhaps Ulrich can comment. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott <[EMAIL PROTECTED]> wrote: > On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > > On Fri, 2 Mar 2007 00:04:45 +0530 > > "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > > This is to give a heads up on few patches that we will be soon coming up > > > with. These patches implement a new system call sys_fallocate() and a > > > new inode operation "fallocate", for persistent preallocation. The new > > > system call, as Andrew suggested, will look like: > > > > > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > > ... > > > > I'd agree with Eric on the "command" flag extension. > > Seems like a separate syscall would be better, "command" sounds > a bit ioctl like, especially if that command is passed into the > filesystems.. > madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to "mode"? ;) I'm inclined to merge this patch nice and early, so the syscall number is stabilised. Otherwise the people who are working on out-of-tree code (ie: ext4) will have to keep playing catchup. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: > On Fri, 2 Mar 2007 00:04:45 +0530 > "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + inode = file->f_path.dentry->d_inode; > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, offset, len); > > + else > > + ret = -ENOTTY; > > + fput(file); > > +out: > > +return ret; > > +} > > ENOTTY is a bit unconventional - we often use EINVAL for this sort of > thing. But EINVAL has other meanings for posix_fallocate() and isn't > really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
> That new argument might need to come after "fd" - ARM has funny > requirements on syscall arg padding and layout. FYI the 32bit ppc ABI does too, from arch/powerpc/kernel/sys_ppc32.c: /* * long long munging: * The 32 bit ABI passes long longs in an odd even register pair. */ and the first argument in a function call is in r3. Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Nathan Scott wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the "command" flag extension. Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. cheers. I'm fine with that too, I'd just like the functionality :) -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2 Mar 2007 00:04:45 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > This is to give a heads up on few patches that we will be soon coming up > with. These patches implement a new system call sys_fallocate() and a > new inode operation "fallocate", for persistent preallocation. The new > system call, as Andrew suggested, will look like: > > asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); It is intended that glibc use this same syscall for both posix_fallocate() and posix_fallocate64(). I'd agree with Eric on the "command" flag extension. That new argument might need to come after "fd" - ARM has funny requirements on syscall arg padding and layout. > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + file = fget(fd); > + if (!file) > + goto out; > + inode = file->f_path.dentry->d_inode; > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, offset, len); > + else > + ret = -ENOTTY; > + fput(file); > +out: > +return ret; > +} Please always put a blank line between the variable definitions and the first statement. Please always use hard tabs, not bunch-of-spaces. This seems to happening rather a lot in the ext4 patches. It's a trivial thing, but also trivial to fix. A grep across the diffs is needed. ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Alan wrote: > ENOSYS indicates quite different things and ENOTTY is also used for > syscalls. I still think ENOTTY is correct. > Yes, ENOSYS tends to me "operation flat out not support" rather than "not on this object". I think we can do better than ENOTTY though - ENOTSUP for example (modulo the confusion over EOPNOTSUPP). (You can tell the patch has very little real substance if we're arguing over errnos at this point :) J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 14:05:36 -0800 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Alan wrote: > > A lot of people get confused about -ENOTTY, but it is the return for > > attempting to use an ioctl on the wrong type of object, so this appears > > to be quite correct. > > This is a syscall though; ENOSYS is probably a better match. ENOSYS indicates quite different things and ENOTTY is also used for syscalls. I still think ENOTTY is correct. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Alan wrote: > A lot of people get confused about -ENOTTY, but it is the return for > attempting to use an ioctl on the wrong type of object, so this appears > to be quite correct. This is a syscall though; ENOSYS is probably a better match. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 13:14:32 -0800 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Amit K. Arora wrote: > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, offset, len); > > + else > > + ret = -ENOTTY; > > You can only allocate space on typewriters? ;) A lot of people get confused about -ENOTTY, but it is the return for attempting to use an ioctl on the wrong type of object, so this appears to be quite correct. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, offset, len); > + else > + ret = -ENOTTY; You can only allocate space on typewriters? ;) J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, Mar 01, 2007 at 03:23:19PM -0500, Jeff Garzik wrote: > I certainly agree that we want something like this. > > posix_fallocate() is the glibc interface we want to be compatible with > (which your definition is, AFAICS). This would be great for Samba. Windows clients do this a lot Jeremy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1> contiguity - less defragmentation and thus faster access speed, and 2> guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S |1 + fs/ext4/file.c |1 + fs/open.c| 18 ++ include/asm-i386/unistd.h|3 ++- include/linux/fs.h |1 + include/linux/syscalls.h |1 + 6 files changed, 24 insertions(+), 1 deletion(-) I certainly agree that we want something like this. posix_fallocate() is the glibc interface we want to be compatible with (which your definition is, AFAICS). Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); One thing I'd like to see is a cmd argument as well, to allow for example allocation vs. reservation (i.e. allocating blocks vs. simply reserving a number), as well as the inverse of those functions (un-reservation, de-allocation)? If the allocation interface allows allocation/reservation within arbitrary ranges, if the only way to un-allocate is via a truncate, that's pretty asymmetric. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Heads up on sys_fallocate()
This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1> contiguity - less defragmentation and thus faster access speed, and 2> guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S |1 + fs/ext4/file.c |1 + fs/open.c| 18 ++ include/asm-i386/unistd.h|3 ++- include/linux/fs.h |1 + include/linux/syscalls.h |1 + 6 files changed, 24 insertions(+), 1 deletion(-) Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S === --- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.20.1/fs/ext4/file.c === --- linux-2.6.20.1.orig/fs/ext4/file.c +++ linux-2.6.20.1/fs/ext4/file.c @@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_ .removexattr= generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.20.1/fs/open.c === --- linux-2.6.20.1.orig/fs/open.c +++ linux-2.6.20.1/fs/open.c @@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file->f_path.dentry->d_inode; + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.20.1/include/asm-i386/unistd.h === --- linux-2.6.20.1.orig/include/asm-i386/unistd.h +++ linux-2.6.20.1/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.20.1/include/linux/fs.h === --- linux-2.6.20.1.orig/include/linux/fs.h +++ linux-2.6.20.1/include/linux/fs.h @@ -1124,6 +1124,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, loff_t, loff_t); }; struct seq_file; Index: linux-2.6.20.1/include/linux/syscalls.h === --- linux-2.6.20.1.orig/include/linux/syscalls.h +++ linux-2.6.20.1/include/linux/syscalls.h @@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage