Re: Ext2 / VFS projects
Erez Zadok wrote: > [...] > (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they > can be included in the same original source code as well as stackable > file systems: > > check_parent macro (fs/namei.c -> include/linux/dcache_func.h) > lock_parent (fs/namei.c -> include/linux/dcache_func.h) > get_parent (fs/namei.c -> include/linux/dcache_func.h) > unlock_dir (fs/namei.c -> include/linux/dcache_func.h) > double_lock (fs/namei.c -> include/linux/dcache_func.h) > double_unlock (fs/namei.c -> include/linux/dcache_func.h) > That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of most of these functions... -- Manfred
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Manfred Spraul writes: > Erez Zadok wrote: > > [...] > > (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they > > can be included in the same original source code as well as stackable > > file systems: > > > > check_parent macro (fs/namei.c -> include/linux/dcache_func.h) > > lock_parent (fs/namei.c -> include/linux/dcache_func.h) > > get_parent (fs/namei.c -> include/linux/dcache_func.h) > > unlock_dir (fs/namei.c -> include/linux/dcache_func.h) > > double_lock (fs/namei.c -> include/linux/dcache_func.h) > > double_unlock (fs/namei.c -> include/linux/dcache_func.h) > > > That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of > most of these functions... I agree. I didn't want to make copies of those b/c I got burnt in the past when they changed subtly and I didn't notice the change. > -- > Manfred Erez.
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Tigran Aivazian writes: > I noticed the stackable fs item on Alan's list ages ago but there was no > pointer to the patch (I noticed FIST stuff but surely that is not a "small > passive patch" you are referring to?) Yes the patches are small and passive. No new vfs/mm code is added or changed! The most important part of my patches had already been included since 2.3.17; that was an addition/renaming of a private field in struct vm_area_struct. What's left are things that are necessary to support stacking for the first time in linux: exposing some functions/symbols from {mm,fs}/*.c, adding externs to headers, additions to ksyms.c, and moving some macros and inline functions from private .c files to a header, so they can be included in any file system. I've used these patches on dozens of linux machines for the past 2+ years, and have had no problems. I constantly get people asking me when my patches will become part of the main kernel. I have about 9 active developers who write file systems using my templates. I've had more than 21,000 downloads of my templates in the past two years. > So, my point is - if you point everyone to those patches, someone might > help Alan out if one feels like it (and has time). http://www.cs.columbia.edu/~ezk/research/software/fist-patches/ The latest 2.3 patches in that URL include two things: my small main kernel patches, and a fully working lofs. The lofs of course is several thousands of lines of code, but it is not strictly necessary to include it with the main kernel; it can be distributed and built separately, just as my other f/s modules are. However, I do think that lofs is a useful enough f/s that it should be part of the main kernel. If you go to the 2.3 directory under the above URL, there's a README describing the latest 2.3 patches. I've included it below, so everyone can read it and see what my patches do, and how harmless they are. BTW, I've got a prototype unionfs for linux if anyone is interested. > Regards, > Tigran. As always, I'll be delighted to help *anyone* use my work, and would love to help the linux maintainers incorporate my patches, answer any concerns they might have, etc. Cheers, Erez. == Summary of changes for 2.3.25 to support stackable file systems and lofs. (Note: some of my previous patches had been incorporated in 2.3.17.) (1) Created a new header file include/linux/dcache_func.h. This header file contains dcache-related definitions (mostly static inlines) used by my stacking code and by fs/namei.c. Ion and I tried to put these definitions in fs.h and dcache.h to no avail. We would have to make lots of changes to fs.h or dcache.h and other .c files just to get these few definitions in. In the interest of simplicity and minimizing kernel changes, we opted for a new, small header file. This header file is included in fs/namei.c because everything in dcache_func.h was taken from fs/namei.c. And of course, these static inlines are useful for my stacking code. If you don't like the name dcache_func.h, maybe you can suggest a better name. Maybe namei.h? If you don't like having a new header file, let me know what you'd prefer instead and I'll work on it, even if it means making more changes to fs.h, namei.c, and dcache.h... (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they can be included in the same original source code as well as stackable file systems: check_parent macro (fs/namei.c -> include/linux/dcache_func.h) lock_parent (fs/namei.c -> include/linux/dcache_func.h) get_parent (fs/namei.c -> include/linux/dcache_func.h) unlock_dir (fs/namei.c -> include/linux/dcache_func.h) double_lock (fs/namei.c -> include/linux/dcache_func.h) double_unlock (fs/namei.c -> include/linux/dcache_func.h) (3) Added to include/linux/fs.h an extern definition to default_llseek. (4) include/linux/mm.h: also added extern definitions for filemap_swapout filemap_swapin filemap_sync filemap_nopage so they can be included in other code (esp. stackable f/s modules). (5) added EXPORT_SYMBOL declarations in kernel/ksyms.c for functions which I now exposed to (stackable f/s) modules: EXPORT_SYMBOL(___wait_on_page); EXPORT_SYMBOL(add_to_page_cache); EXPORT_SYMBOL(default_llseek); EXPORT_SYMBOL(filemap_nopage); EXPORT_SYMBOL(filemap_swapout); EXPORT_SYMBOL(filemap_sync); EXPORT_SYMBOL(remove_inode_page); EXPORT_SYMBOL(swap_free); EXPORT_SYMBOL(nr_lru_pages); EXPORT_SYMBOL(console_loglevel); (6) mm/filemap.c: make the function filemap_nopage non-static, so it can be called from other places. This was not an inline function so there's no performance impact. ditto
Re: Ext2 / VFS projects
Hi, On Thu, 10 Feb 2000 10:27:29 -0500 (EST), Alexander Viro <[EMAIL PROTECTED]> said: > Correct, but that's going to make design much more complex - you really > don't want to do it for anything other than sub-page stuff (probably even > sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector > fragment. Not to mention the fact that for cases when you have 1K > fragments and really large blocks you don't want all this mess around... > It's doable, indeed, but... Sure, but to me the main question is this --- can we do this sort of fragment support in ext3 without having to add complexity to the rest of the VM/VFS? I think the answer is yes. --Stephen
Re: Ext2 / VFS projects
I noticed the stackable fs item on Alan's list ages ago but there was no pointer to the patch (I noticed FIST stuff but surely that is not a "small passive patch" you are referring to?) So, my point is - if you point everyone to those patches, someone might help Alan out if one feels like it (and has time). Regards, Tigran. On Thu, 10 Feb 2000, Erez Zadok wrote: > Also, I really hope that my remaining (small, passive) patches to the VFS to > support stackable file systems will be incorporated soon.
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Matthew Wilcox writes: > > Greetings. Ted Ts'o recently hosted an ext2 puffinfest where we > discussed the future of the VFS and ext2. Ben LaHaise, Phil Schwan, [...] Also, I really hope that my remaining (small, passive) patches to the VFS to support stackable file systems will be incorporated soon. Cheers, Erez.
Re: Ext2 / VFS projects
Hi, On Wed, 09 Feb 2000 11:31:03 -0500, Matthew Wilcox <[EMAIL PROTECTED]> said: > fine-grained locking > [remove test_and_set_bit()] The critical one here is the superblock lock. --Stephen
Re: Ext2 / VFS projects
On Thu, 10 Feb 2000, Stephen C. Tweedie wrote: > That shoudn't matter. In the new VM it would be pretty trivial for the > filesystem to reserve a separate address_space against which to cache > fragment blocks. Populating that address_space when we want to read a > fragment block doesn't have to be any more complex than populating the > page cache already is. IO itself shouldn't be hard. Correct, but that's going to make design much more complex - you really don't want to do it for anything other than sub-page stuff (probably even sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector fragment. Not to mention the fact that for cases when you have 1K fragments and really large blocks you don't want all this mess around... It's doable, indeed, but...
Re: Ext2 / VFS projects
Hi, On Wed, 9 Feb 2000 14:30:13 -0500 (EST), Alexander Viro <[EMAIL PROTECTED]> said: > On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote: >> with 2k blocks and 128 byte fragments, we get to really reduce wasted >> space below any other system i've ever experienced. > Erm... I'm afraid that you are missing the point. You will get the > hardware sectors shared between the files. And you can't pass requess > smaller than that. _And_ you have to lock the bh when you do IO. Now, > estimate the fun with deadlocks... That shoudn't matter. In the new VM it would be pretty trivial for the filesystem to reserve a separate address_space against which to cache fragment blocks. Populating that address_space when we want to read a fragment block doesn't have to be any more complex than populating the page cache already is. IO itself shouldn't be hard. Yes, this will end up double-caching fragmented files to some extent, since we'll have to reserve a separate, non-physically-mapped page for the tail of a fragmented file. Allocation/deallocation of fragments themselves obviously has to be done very carefully, but we already have to deal with that sort of race in the filesystem for normal allocations --- this isn't really any different in principle. --Stephen
Re: Ext2 / VFS projects
On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote: > > On 09-Feb-00 Andi Kleen wrote: > > On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote: > > > > [...] > > > > How about secure deletion? > > > > 1.3 used to have some simple minded overwriting of deleted data when the > > 's' attribute was set. That got lost with 2.0+. > > > > Secure disk overwriting that is immune to > > manual surface probing seems to take a lot more effort (Colin Plumb's > > sterilize does 25 overwrites with varying bit patterns). Such a complicated > > procedure is probably better kept in user space. What I would like is some > > way to have a sterilize daemon running, and when a get 's' file gets > > deleted the VFS would open a new file descriptor for it, pass it to > > sterilized (sterild?) using a unix control message and let it do its job. > > > > What does the audience think? Should such a facility have kernel support > > or not? I think secure deletion is an interesting topic and it would be > > nice if Linux supported it better. > > You have to be careful that you don't leak the file you're trying to eradicate > into the swap via the serilize daemon. I guess simply never reading the file > is a good start. sterilize does that. You have of course be careful that you didn't leak its content to swap before (one way around that is encrypted swapping) > > The other question is whether you're talking about an ext2-specific thing, or > whether its a general service all filesystems provide. Many filesystem I was actually only thinking about ext2 (because only it has a 's' bit and the thread is about ext2's future) > designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl > filesystem, don't let a process overwrite an existing block on disk. Well, > ext3 does, but only via the journal; wafl never does. There's also the > question of what happens when you have a RAID device under the filesystem, > especially with hot-swappable disks. reiserfs lets you when you don't change the file size (if you do it is possible that the file is migrated from a formatted node to a unformatted node). sterilize does not change file sizes. ext3 only doesn't let you when you do data journaling (good point I forgot that) RAID0/RAID1 are no problem I think, because you have always well defined block(s) to write too. The wipe data does not depend on the old data on the disk, so e.g. on a simple mirrored configuration both blocks would be sterilized in parallel. RAID5 devices could be a problem, especially when they do data journaling (I think most only journal some metadata). It is not clear how the sterilize algorithms interact with the XORed blocks. If you swap your disks inbetween you lost. > > Perhaps a better approach, since we're talking about a privileged process, is > to get a list of raw blocks and go directly to the disk. You'd have to be very > careful to synchronize with the filesystem... Not too much. The file still exists, but there are no references to it outside sterild. No other process can access it. Assuming the file system does not have fragments and the raw io has block granuality and the file was fdatasync'ed before you could directly access it without worrying about any file system interference. If the fs has fragments you need the infrastructure needed for O_DIRECT (I think that is planned anyways). With a "invalidate all dirty buffers for file X" call you could optimize part of the fdatasync writes away, but a good sterilize needs so many writes anyways (25+) that it probably does not make much difference. The data would be only really deleted when the system is turned off, because it could partly still exist in some not yet reused buffers. -Andi
Re: Ext2 / VFS projects
On 09-Feb-00 Andi Kleen wrote: > On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote: > > [...] > > How about secure deletion? > > 1.3 used to have some simple minded overwriting of deleted data when the > 's' attribute was set. That got lost with 2.0+. > > Secure disk overwriting that is immune to > manual surface probing seems to take a lot more effort (Colin Plumb's > sterilize does 25 overwrites with varying bit patterns). Such a complicated > procedure is probably better kept in user space. What I would like is some > way to have a sterilize daemon running, and when a get 's' file gets > deleted the VFS would open a new file descriptor for it, pass it to > sterilized (sterild?) using a unix control message and let it do its job. > > What does the audience think? Should such a facility have kernel support > or not? I think secure deletion is an interesting topic and it would be > nice if Linux supported it better. You have to be careful that you don't leak the file you're trying to eradicate into the swap via the serilize daemon. I guess simply never reading the file is a good start. The other question is whether you're talking about an ext2-specific thing, or whether its a general service all filesystems provide. Many filesystem designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl filesystem, don't let a process overwrite an existing block on disk. Well, ext3 does, but only via the journal; wafl never does. There's also the question of what happens when you have a RAID device under the filesystem, especially with hot-swappable disks. Perhaps a better approach, since we're talking about a privileged process, is to get a list of raw blocks and go directly to the disk. You'd have to be very careful to synchronize with the filesystem... J
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote: > This requires Ben's work to decouple the ext2 allocation size from the > hardware page size. We would _always_ want to write out a fragment block > as one to ensure that the fragment descriptor wasn't at odds with the > contents of the block. Imagine the descriptor not being written out after > the block was compacted. My initial plan is to decouple the allocation size from the hardware page size only for the cases where the allocation size is larger than the physical block size of the disk. Going beyond that is non-trivial, but doable. It may only be interesting for e2compr, since large blocks with 512 byte fragments will rock. -ben
Re: Ext2 / VFS projects
On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote: [...] How about secure deletion? 1.3 used to have some simple minded overwriting of deleted data when the 's' attribute was set. That got lost with 2.0+. Secure disk overwriting that is immune to manual surface probing seems to take a lot more effort (Colin Plumb's sterilize does 25 overwrites with varying bit patterns). Such a complicated procedure is probably better kept in user space. What I would like is some way to have a sterilize daemon running, and when a get 's' file gets deleted the VFS would open a new file descriptor for it, pass it to sterilized (sterild?) using a unix control message and let it do its job. What does the audience think? Should such a facility have kernel support or not? I think secure deletion is an interesting topic and it would be nice if Linux supported it better. sterilize also does some tricks to overwrite entries in directories, but I see no easy way to make that fit into the kernel. Comments? -Andi -- This is like TV. I don't like TV.
Re: Ext2 / VFS projects
On Wed, Feb 09, 2000 at 02:30:13PM -0500, Alexander Viro wrote: > > > On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote: > > > with 2k blocks and 128 byte fragments, we get to really reduce wasted > > space below any other system i've ever experienced. > > Erm... I'm afraid that you are missing the point. You will get the > hardware sectors shared between the files. And you can't pass requess > smaller than that. _And_ you have to lock the bh when you do IO. Now, > estimate the fun with deadlocks... This requires Ben's work to decouple the ext2 allocation size from the hardware page size. We would _always_ want to write out a fragment block as one to ensure that the fragment descriptor wasn't at odds with the contents of the block. Imagine the descriptor not being written out after the block was compacted.
Re: Ext2 / VFS projects
Date: Wed, 9 Feb 2000 13:25:23 -0500 From: [EMAIL PROTECTED] Yes. Based on my misunderstanding of how BSD did fragments, Ted and I came up with an interestingly different way of doing fragments. Here's the basic idea: Keep the bitmaps of _blocks_ instead of fragments and use a second bitmap to identify which blocks are non-full fragment blocks. Use the last fragment in a fragment block as a fragment block descriptor to indicate which inode each fragment belongs to (exact format still to be determined). Fragment blocks are kept compact as otherwise we would have to deal with internal fragmentation. The further refinement of this plan is to use a always keep the fragments compacted, and then using an indirection table. So instead of storing the fragment address in the inode, we store an index into fragment location table in find the fragment. This makes it trivial to pack the fragments in the block to avoid internal fragmentation. Then you don't need a bitmap to keep track of the fragment allocation; you just need a single entry in the administrative fragment block to point at the next free fragment. Note that the idea here is to set the block size to the maximum ideal transfer size for disks. For modern disks, that's probably something like 64k or 128k. (i.e., it doesn't take much more time to read 64k compared to 1k). The one downside of this plan is that when you delete a file with a tail, you have to do an extra block read/write to update the allocation information in the fragment block. In the BSD scheme, you just have to update the allocation bitmap. This does slow deletions by a small amount, but that might not be that big of an issue. The reason why we were considering this sort of thing is because when the difference between the fragment and block size grows, the potential problems with internal fragmentation is a real issue. - Ted
Re: Ext2 / VFS projects
Lately I have been encaged at other activities, and haven't had time to check upon VFS layer happenings. On Wed, Feb 09, 2000 at 11:31:03AM -0500, Matthew Wilcox wrote: > Greetings. Ted Ts'o recently hosted an ext2 puffinfest where we > discussed the future of the VFS and ext2. Ben LaHaise, Phil Schwan, ... Add pathconf() to the VFS. Right now the peeks I have had at 2.3 series do show that people do WRONG things with O_LARGEFILE flag bit per what the LFS semantics are telling.r The filesystem must be able to pass to the VFS what capabilities given file/directory has -- like can filesizes exceeding 2G be used at all... (EXT2, UFS, NFSv3 can, MINIX et.al. can't..) (And filenamesizes supported at directories, and...) These don't look right even at egrep tersenes: (2.3.42) [root@mea linux]# egrep O_LARGEFILE $cc ./fs/open.c:flags |= O_LARGEFILE; ./fs/ext2/file.c: * the caller didn't specify O_LARGEFILE. On 64bit systems we force ./fs/ext2/file.c: if (inode->u.ext2_i.i_high_size && !(filp->f_flags & O_LARGEFILE)) ./fs/udf/file.c: * On 64 bit systems we force on O_LARGEFILE in sys_open. ./fs/udf/file.c:if ((inode->i_size & 0xUL) && !(filp->f_flags & O_LARGEFILE)) ./arch/sparc64/kernel/sys_sparc32.c: * not force O_LARGEFILE on. ./arch/sparc64/solaris/fs.c:if (flags & 0x2000) fl |= O_LARGEFILE; The limit at 32-bit systems is 2G, not 4G, and NO kernel space system shall (aside of sys_open64() syscall) set that flag. (Which I think the sparc64/solaris thing does.) The tests of file open at EXT2 and UDF (?!) should, I think, be conditionalized under a wrapper of: #if BITS_PER_LONG == 32 ... #endif Sigh, so much to do, so little time for kernel hacking... /Matti Aarnio <[EMAIL PROTECTED]>
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote: > with 2k blocks and 128 byte fragments, we get to really reduce wasted > space below any other system i've ever experienced. Erm... I'm afraid that you are missing the point. You will get the hardware sectors shared between the files. And you can't pass requess smaller than that. _And_ you have to lock the bh when you do IO. Now, estimate the fun with deadlocks...
Re: Ext2 / VFS projects
On Wed, Feb 09, 2000 at 12:02:46PM -0500, Alexander Viro wrote: > > 256k blocks with 1k fragments (viro) > > [also 8k blocks with 256-byte fragments!] >^^^ > HUH??? You want to deal with sharing sectors between the different files? Yes. Based on my misunderstanding of how BSD did fragments, Ted and I came up with an interestingly different way of doing fragments. Here's the basic idea: Keep the bitmaps of _blocks_ instead of fragments and use a second bitmap to identify which blocks are non-full fragment blocks. Use the last fragment in a fragment block as a fragment block descriptor to indicate which inode each fragment belongs to (exact format still to be determined). Fragment blocks are kept compact as otherwise we would have to deal with internal fragmentation. with 256k blocks, we get to grow a block group to 16GB (probably excessive with today's discs) and then we don't have any further problems with not enough free space. with 2k blocks and 128 byte fragments, we get to really reduce wasted space below any other system i've ever experienced.
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000, Jeff Garzik wrote: > al viro was kind enough to e-mail me some of his thoughts on banishing > the big kernel lock from the VFS. Though my time with the VFS has been > nil in the past few months, I'd still like to work on this if noone > beats me to it. > > IIRC the two big items are dcache/dentry and inode threading. s/inode/friggin' POSIX locks shit/ And it's a bit different story - ext2fs needs some serialization of its own; after all it got some internal structures ;-) > Any and all ideas for online defrag, please post. I'm very interested. See below. > > delayed allocation > > this needs to be in the VFS desperately. every new & advanced > filesystem is winding up implementing their own logic for this... > > > Address spaces (viro) > > can someone elaborate? Urgh. It's a long(ish) story. Basically, we are getting address_space methods. It removes ->readpage/->writepage/->get_block out of inode_operations, BTW. What it means: we are getting rid of a lots of code duplication (data semantics; as in normal block fs vs. no-holes fs vs. extent-based with holes vs. fragments-handling a-la FFS vs. fs with small files embeddable into inodes vs. ...). Address_space is an MMU. This way they become separated from filesystems proper (i.e. layout, etc.). If you want more coherent description - ask and I'll write it. The latest version of my patch sits on ftp.math.psu.edu/pub/viro/as-patch-26z2 (warning: needs testing).
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000, Tigran Aivazian wrote: > On Wed, 9 Feb 2000, Matthew Wilcox wrote: > > fix posix fcntl locks (willy) > > Very interesting. I haven't checked recently but which part of POSIX fcntl > locks is broken? Take a _large_[1] barf-bag and read posix_locks_deadlock(), for one. Could you spell "totally inadequate data structures"? [1] You'll need it. Don't complain about the ruined keyboard - you've been warned.
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000, Matthew Wilcox wrote: > fix posix fcntl locks (willy) Very interesting. I haven't checked recently but which part of POSIX fcntl locks is broken? Tigran.
Re: Ext2 / VFS projects
On Wed, 9 Feb 2000, Matthew Wilcox wrote: > 2.4: > Collapsed indirect blocks [readonly] (willy) > > 2.5: > Journalling (sct) > Access to lownumbered inodes > Dynamic inode tables (phil) > Btree directories (phil) > ext2 allocation page size greater than cpu page size (bcrl) > 256k blocks with 1k fragments (viro) > [also 8k blocks with 256-byte fragments!] ^^^ HUH??? You want to deal with sharing sectors between the different files? > fine-grained locking > [remove test_and_set_bit()] > e2compr requires VM modifications (and basically nothing else). > nounlink attribute flag may go into 2.4.early, FWIC. > Some interesting mm/fs projects: > > 2.4: > Move directories to page cache (bcrl) > Address spaces (viro) Variant submitted to Linus. > Move fhandle <-> dentry conversion functions to VFS (viro) > move silly_rename to VFS (viro) > Investigate buddy allocator algorithms with more interesting properties (bcrl) > bdflush may need tuning (bcrl) > > 2.5: > Removal of buffer heads (bcrl) > fix posix fcntl locks (willy) > sort out interface to block devices (viro) I still hope to get at least the interface parts into 2.4... Other 2.4.early stuff: caching ext2_find_entry() results in dentry (patch exists, obviously correct and well-tested). caching the position of last lookup and doing cyclic lookups from that place (literal copying from VFAT, where it worked since the early Summer; it's an old BSD optimization). pre-alloc for directories (yup, right now it's _off_). Mandatory 2.3.late stuff: serialization between truncate and write. There are races...
Re: Ext2 / VFS projects
Caveat reader: With the exception of procfs stuff in 2.3.x, most of my VFS participation thus far has been of the "I want to work on this when I get time" sort of partitication. ;-) First, my request: Add an fcntl flag, O_NOCACHE (or O_DIRECT, unimplemented) which allows an app to hint that it does not want the OS to cache this data. This will be a _big_ win for servers and desktops where large multimedia or database files are slung around. Matthew Wilcox wrote: > Btree directories (phil) I hope these are _not_ pure binary search trees but rather a smarter ADT... > Backup inode table interesting idea > fine-grained locking al viro was kind enough to e-mail me some of his thoughts on banishing the big kernel lock from the VFS. Though my time with the VFS has been nil in the past few months, I'd still like to work on this if noone beats me to it. IIRC the two big items are dcache/dentry and inode threading. > Online defragmentation & size Has there been any substantive discussion about online defragmentation? I think it is a wholly separate, and more interesting issue than resize (which will be solved in the future with LVMs, IMHO...) For online defrag, there are tons of different scenarios and hueristics which can be employed to optimize for various situations: * move frequently-accessed files to the middle of the disk (requires knowledge of physical disk organization, below the partition layer) * group files together on disk in directories, with gaps of free space in between for placement of files "near" other files in the same directory * options to pack files into inodes (if possible and supported by fs) or to fragment small files, to conserve space * dozens of hueristics. if online defrag is in userspace, admin can even craft their own disk optimization rules. Kernel changes Short term, the easiest implementation will be in-kernel. Long term, I would like to see (if possible) a set of generalized ioctls which allow a userspace program to contain the bulk of the defragmenting/disk optimization logic. Any and all ideas for online defrag, please post. I'm very interested. > delayed allocation this needs to be in the VFS desperately. every new & advanced filesystem is winding up implementing their own logic for this... > Address spaces (viro) can someone elaborate? > sort out interface to block devices (viro) mostly done? -- Jeff Garzik | Only so many songs can be sung Building 1024 | with two lips, two lungs, and MandrakeSoft, Inc. | one tongue.