Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote: Using the current glibc implementation on a compressed file system ideally should be a very expensive no-op because you won't actually allocate much space for a file when writing zeroes to it. You also don't benefit of a contiguous allocation in logfs, since flash has uniform seek times over all the medium. I'd suggest you implement posix_fallocate as an real nop and just return success without doing anything. You could also return ENOSPC in case the blocks requested by posix_fallocate don't fit on the medium without compression, but that is more or less just guesswork (like statfs is). Quoting POSIX_FALLOCATE(3): The function posix_fallocate() ensures that disk space is allocated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. If the size of the file is less than offset+len, then the file is increased to this size; otherwise the file size is left unchanged. Afaics, the (main) purpose of this function is not to decrease fragmentation but to ensure mmap() won't cause any problems because the medium fills up. That problem exists for LogFS as well, once rw mmap() is supported. Simply returning success without doing anything would be a bug. -ENOSPC is a better choice, but still a lame implementation. And falling back on libc to write zeroes in a loop is an exercise in futility. Does the allocation have to be persistent beyond lifetime of the file descriptor? It would be fairly simple to support the write guarantee while the file is open (or rather the inode remains cached) and drop it afterwards. Jörn -- [One] doesn't need to know [...] how to cause a headache in order to take an aspirin. -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote: I'd be more happy to have the write out zeroes loop in glibc. ?And glibc needs to have it anyway, for older kernels. A generic_fallocate makes sense to me iff we can do it in the kernel more significantly more efficiently than in glibc, e.g. by using only a single page in page cache instead of one for each page to be preallocated. We can't do that with the current page cache interfaces. But what might make sense is to have a block_dump_prealloc that takes a get_block callback to do what you propose. It still wouldn't be entirely generic, but would allow block based filesystems to do a not entirely dumb implementation. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote: I don't know how your compression algorithm works [...] LogFS is designed for flash media, so it does not have to worry much about reducing disk seeks. It is log-structured, which simplifies compression further. When writing a block, it basically compresses it and appends it to the log. Writes only have to be byte-aligned, so no space is lost for padding. The bad news for posix_fallocate() is that even if libc is smart enough to write random data, mmap() can still cause problems. If the VM decides to write a given page twice, the second write compresses better and the medium has filled up between the two writes, the users will have fun. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On 5 Mar 2007, at 14:37, Theodore Tso wrote: On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote: And I specifically did NOT update the initialized size in the inode thus it will remain at its old value thus all new allocated blocks will be considered as present but not initialized thus a read will always return zero whilst a write will do the right thing and pad with zeroes as necessary (if the write is smaller than the block size, etc). You're describing a method of doing in-advance preallocation where the filesystem format explicitly has support for this kind of feature in a way that doesn't require pre-zeroing the data blocks in question. Indeed. The question which this subthread was concerned about was whether the kernel should get involved in initializing datablocks in the case where the filesystem format does not have this support, or whether this functionality should continue to be done in userspace. Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Yes, I understood that after I had sent my post... And yes, I would agree. If glibc already does this there does not appear to be any value in just moving existing functionality into the kernel. Simply let dumb file systems return ENOSYS and let glibc do it... And any FS which can do it better can implement the function and then glibc should not go anywhere near it. Best regards, Anton -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Theodore Tso wrote: Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and simply touch every block. And this means reading it and then writing it back. The kernel would know when the reading part is not necessary. Add to then the block granularity (we use f_bsize as returned from fstatfs but that's not the best value in some cases) and you have compelling data to have generic code in the kernel. Then libc implementation can then go away completely which is a good thing. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: Does the allocation have to be persistent beyond lifetime of the file descriptor? Of course. You call posix_fallocate once for the lifetime of the file when it is created to ensure that all future uses will work. It seems your filesystem will not be able to support this unless compression is turned off. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: The bad news for posix_fallocate() is that even if libc is smart enough to write random data, mmap() can still cause problems. This is not smart, quite to the contrary. The standard guarantees that all not-yet-written-to places in the file are zero. And if a block has already been written posix_fallocate cannot change it. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: Theodore Tso wrote: Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The layer of the kernel where a totally generic fallback would be implemented does not have access to this information. We could do a mostly generic helper for block filesystems that allows to implement fallocate this way without a lot of their own code. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: Of course. You call posix_fallocate once for the lifetime of the file when it is created to ensure that all future uses will work. That part is not quite clear from the manpage but I trust most people would assume the same. Not only that, it is what this function is for. In the POSIX committee we've looked at the functions in detail before adding them, even if some information is not in the man page but instead in the Rationale. Still, it is quite obvious that noone designing this interface has lost much thought to compressing filesystems. You already have problems with supporting the functionality posix_fallocate is supporting. You cannot reliably support MAP_SHARED files if all of a sudden the compression causes and expansion of a block and that causes a ENOSPC error. So, don't expect pity. This is a function in support of a real and reliable implementation of memory mapped files. You don't use MAP_SHARED on such filesystems, it'll eat your kittens sooner or later anyway. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and simply touch every block. And this means reading it and then writing it back. The kernel would know when the reading part is not necessary. Add to then the block granularity (we use f_bsize as returned from fstatfs but that's not the best value in some cases) and you have compelling data to have generic code in the kernel. Then libc implementation can then go away completely which is a good thing. You have a very good point; indeed since we don't export an interface which allows userspace to determine whether or not a block is in use, that does mean a huge amount of churn in the page cache. So maybe it would be worth doing in the kernel as a result, although the libc implementation still wouldn't be able to go away for long time due to the need to be backwards compatible with older kernels that didn't have this support. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Theodore Tso wrote: [...] although the libc implementation still wouldn't be able to go away for long time due to the need to be backwards compatible with older kernels that didn't have this support. It's better than that. If somebody compiles glibc to not run on older kernels at all (tested at runtime) then the code is dropped. E.g., the current Fedora glibc does not support 2.6.8 or earlier. So, don't let the compat code be a factor in the decision making. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the command flag extension. Seems like a separate syscall would be better, command sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to mode? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Yes, I also think this would be useful so you can guide preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza Yep, I think it makes sense to use preallocation for defragmentation. After all both preallocation and defragmentation shall call underlying filesystem multiple block allocator to try to allocate a chunk of contiguous blocks on disk. ext4 online defrag implementation by Takashi already support to choose a goal allocation block to guide the ext4 block allocator to place the defraged file is a specific location. Passing a little bit more hint to sys_fallocate() (i.e, goal block, and/or whether the goal block is important over the size of prealloc extent), might make it more useful for the orginial goal (get contigous and guranteed blocks) and for defragmentation. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Yes, I also think this would be useful so you can guide preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Hints policies for allocation would certainly be useful, but I think they belong outside this interface. i.e. you could flag an inode for whatever allocation you choose, and -then- call posix_fallocate so that the allocator will take the hints you've given it. See also this blurb from the posix_fallocate definition: It is implementation-defined whether a previous posix_fadvise() call influences allocation strategy. FWIW I don't see a lot of point in asking for strict contiguous blocks - the allocator will presumeably try to do this in any case, and I'm not sure when you would want to fail if you get more than one extent...? -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Jörn Engel wrote: Does the allocation have to be persistent beyond lifetime of the file descriptor? It would be fairly simple to support the write guarantee while the file is open (or rather the inode remains cached) and drop it afterwards. The posix_fallocate() function shall ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media. I interpret on the storage media to mean that it is persistent. -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Why should we teach students Linux??
Hello listers, I'm tutor on the Faculty ICT, department NID. This is a bachelor degree and we are preparing our students to become something more then just System Administrators (such as manager, consulting, etc). Since this department is part of the Microsoft camp, the students are educated mostly in this direction, which I think is not a bad thing. A better thing would be if we could give our students the opportunity to meat both the systems on the same level, at least, that is my opinion. To change a curriculum of a study, I need a solid case. So if somebody knows a link/document about why we should educate our students in the Linux OS, please send it. Or article about the usage of Linux in company's. I hope you will all take some time to send me your best links/documents. with best regards Roel Bindels - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote: Yep, I think it makes sense to use preallocation for defragmentation. After all both preallocation and defragmentation shall call underlying filesystem multiple block allocator to try to allocate a chunk of contiguous blocks on disk. ext4 online defrag implementation by Takashi already support to choose a goal allocation block to guide the ext4 block allocator to place the defraged file is a specific location. Passing a little bit more hint to sys_fallocate() (i.e, goal block, and/or whether the goal block is important over the size of prealloc extent), might make it more useful for the orginial goal (get contigous and guranteed blocks) and for defragmentation. fallocate with the whence argument and flags is already quite complicated, I'd rather have another call for placement decisions, that would be called on an fd to do placement decissions for any further allocations (prealloc, write, etc) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html