Re: Userfs homepage (again)
In message <[EMAIL PROTECTED]>, "Andrew Morton" writes: > Hi, J. > > I'd be interested in seeing some expansion on why shared mem was > undesirable and why the NFS interface was not the way to go. > Particularly the latter, as people have been suggesting this > occasionally. > > This was thrashed through a number of months back on the gnome list. > The debate concerned whether the exercise of unifying FTP/HTTP/etc with > filesystems should be done in user space or the OS. User space won out > (A VFS library, but limited to gnome apps only) because gnome is not > just for linux. > > But the consensus was that if this were to be done in the OS, the NFS > hook should be the way because that gives the best shot at cross-Unix > portability. I have some comments on this. My wrapfs is a wrapper stackable f/s with hooks for a higher-level language. Wrapfs is ported to linux (2.[012]), solaris, and freebsd, and offers the same functionality via its api. Ion and I wrote several file systems using wrapfs, and their performance (in-kernel) beats user-level nfs servers anywhere from 50% to 10x. Most of the savings come from reduced context switches of course. For more info, you can see my research Web page, and esp. the paper I'd be presenting at Usenix in June: "Extending File Systems Using Stackable Templates." I also have another paper which I'll be presenting next month at LinuxExpo (.org) detailing the implementation of wrapfs for linux 2.1: "A Stackable File System Interface For Linux." As to the question of nfs vs. something else, I'd agree that the vnode f/s interface is not the right api for f/s developers to use, esp. if you'd like portability. My Wrapfs uses a simplified api to modify file data and file names, but that's still not enough. My FiST (f/s translator) language can describe file system at a higher level using an api that looks more like system calls and NFS: read, write, lookup, unlink, mkdir, etc. Using that common f/s description, it generates the right vnode-level code for whatever OS wrapfs was ported to. At a later date I intend to use fist to generate user-level nfs server code. (I've had my share of user-level file servers: I maintain amd/am-utils and wrote and maintain hlfsd.) My research Web page is: http://www.cs.columbia.edu/~ezk/research/ > Another comment on the web page: s/lession/lesson/ :-) > > - Andrew. PS. I'll be releasing a new wrapfs and kernel patches for 2.2.6 (new rename code), as soon as Ion had a chance to verify my fixes. Erez.
lookup() return val in 2.2.7
2.2.7 changed the return value of the ->lookup() that's called from fs/namei.c:real_lookup(). It used to return an int and the f/s was to fill in the dentry in the preallocated outarg "dentry". 2.2.7 changed the ret val to struct dentry ptr. Why? The new code is semantically the same. It used to be int error = dir->i_op->lookup(dir, dentry); result = dentry; if (error) { dput(dentry); result = ERR_PTR(error); } and now it is result = dir->i_op->lookup(dir, dentry); if (result) dput(dentry); else result = dentry; I was hoping that the 2.2 series won't change f/s APIs. Each time something like this changes, I have to update my stackable f/s for linux. Will the prototype for lookup remain this way or not? It seems to me that having the result be available both as a retval and filled in the dentry outarg is confusing. IMHO it may confuse some programmers who may not know how to pass back their result to the VFS. If we want the ret val to double up as both a valid point and an ERR_PTR, then why not change lookup so it only takes the dir inode and, not take a second "dentry" argument, and is expected to return a value that is either the valid dentry looked up, or an ERR_PTR. Of course then, the VFS won't be able to d_alloc the space for the new dentry, and each f/s will have to instead. So we go back to where we were: the old prototype for lookup which took an allocated dentry and returned a plain integer errno was better, no? Maybe someone can explain this to me? Are there historical reasons why this change was made now? Perhaps this is in preparation for a different lookup API? Either way, I think I can get my stackable f/s to work with little change. Thanks, Erez.
Re: VFS question...
In message <[EMAIL PROTECTED]>, "John P. Looney" writes: > I'm trying to write a program that would gather statistics on filesystem > usage, over a long period of time. > > I *think* the best way of doing this is to write a kernel module that > could replace some VFS functions, perhaps sys_write and the like, with > a function that writes a "what happened" message to a userspace program, > and then calls the original function. This is called stacking... :-) > Has something like this been done already ? If not, is it clean enough > to work ? If I wished to log just: > > When a file is created/deleted/modified/read from > > what's are the VFS functions of most use ? Off the top of my head: create, open, unlink, read, write, putpage, getpage. But there's more. > Can a module override functions > in the kernel proper ? No, but it can intercept f/s fuctions right below the VFS and before the VFS calls the lower level f/s (ext2fs, nfs, etc.) > Kate > > -- > Microsoft - the best reason in the world to drink beer You can use my wrapfs/lofs template for linux. They are stackable f/s modules. They wrap on top of any directory and can modify f/s behavior, or in your case, just observe it. That way you can measure any f/s and you don't need to modify the VFS or other file systems. In fact I've done something similar to what you've done already a while back when I needed to count the number of lookups, unlinks, and create's on a news spool. I used my template and put in simple integer counters at the entry point to several VFS functions. Then I added an ioctl that returned the values of those counters back to a user process that polled them every few minutes. Unfortunately I didn't save that code b/c it was so simple; I didn't think someone else might find it useful... :-( Anyway, the idea is simple and you can extend it to any number of integer counters and VFS ops. The performance overhead of lofs on 2.2 is only 2--4%, and adding the counters should not add more than 1--2%. I would not recommend that you do a printk from the kernel and count it from syslog or something; that will harm performance significantly and will fill up your logs quickly. My templates come with a lot of debugging messages that you can turn on/off or set to a given level. (Yes I have printfs on entry/exit to every VFS function.) You can get source code for these in http://www.cs.columbia.edu/~ezk/research/software/ The code is for earlier 2.2.x kernels. I've updated the code for all kernels up to 2.2.10 and I'm am testing it now. I'll release this within a few days. I've also been working on porting my templates to 2.3 kernels. So far 2.3.8 is giving me trouble but I suspect the kernel itself has problems. 2.3.9-pre8 seems more stable. Once I get wrapfs/lofs to work on 2.3 kernels, I'll release that too. My code requires small kernel patches. I've been working on getting those cleaned and submitted to Linus, who agreed in principal to incorporate them. I will make all such announcements on linux-fsdevel. Stay tuned. If you use my code, let me know if you have any questions. I'd like to help. Cheers, Erez.
Updated Stackable f/s support for 2.2/2.3
I've released updates to my stackable file systems for linux. The updates work for up to 2.2.10 and 2.3.10. You can find all software in http://www.cs.columbia.edu/~ezk/research/software/. The packages include (small) kernel patches, and sources for several stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs. Let me know if you have any questions. Happy stacking. :-) Erez.
Updated Stackable f/s support for 2.3.11-12
I've released updates to my stackable file systems for linux. The updates work for up 2.3.12. You can find all software in http://www.cs.columbia.edu/~ezk/research/software: fist-linux-2.3-fs-0.1.1.tar.gz fist-linux-2.3-cryptfs-0.1.1.tar.gz (under "export controlled") The packages include (small) kernel patches, and sources for several stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs. Let me know if you have any questions. Erez.
2.3 NFS client expects monotonically increasing cookies
I found out that 2.3 kernels (see linux/fs/nfs/dir.c) expect NFS cookies passed from the server to be monotonically increasing. 2.2 kernels do not make that assumption it seems. The cookies I'm talking about are the 'cookie' field in 'struct entry' (rpcsvc/nfs_prot.h). The NFS (v2) specs do not specify that the nfs cookies should or should not be increasing. I quote from RFC 1094, section #2.2.17: ``... and a "cookie" which is an opaque pointer to the next entry in the directory. The cookie is used in the next READDIR call to get more entries starting at a given point in the directory. The special cookie zero (all bits zero) can be used to get the entries starting at the beginning of the directory.'' The cookies are opaque and must not be interpreted by the client! Linux should not assume anything about them, only make sure it sends back to the server the last cookie in the direntry chain, so that the *server* can restart directory reading from the last entry just read. I discovered this problem b/c directory reading in amd stopped working when using 2.3 kernels. Turned out that my "browsable directories" code didn't generate monotonically increasing cookies. I rewrote amd's code so they are monotonically increasing and directory listing under amd works again. Nevertheless, I think this assumption of 2.3 kernels can cause problems when interacting with non-linux NFS servers that do not generate monotonically increasing cookies. Erez.
Updated Stackable f/s support for 2.2.11 and 2.3.13
I've released updates to my stackable file systems for linux. The updates work for up 2.2.11 and 2.3.13. There were no functional changes since the previous versions. You can find all software in http://www.cs.columbia.edu/~ezk/research/software: For 2.2 kernels: fist-linux-2.2-fs-0.4.1.tar.gz fist-linux-2.2-cryptfs-0.4.1.tar.gz (under "export controlled") For 2.3 kernels: fist-linux-2.3-fs-0.1.2.tar.gz fist-linux-2.3-cryptfs-0.1.2.tar.gz (under "export controlled") The packages include (small) kernel patches, and sources for several stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs. Let me know if you have any questions. Erez.
2.3.14 text for CONFIG_MSDOS_PARTITION
The text for the CONFIG_MSDOS_PARTITION option in 2.3.14 is a bit misleading: This option enables support for using hard disks that were partitioned on an MS-DOS system. This may be useful if you are sharing a hard disk between i386 and m68k Linux boxes, for example. Say Y if you need this feature; users who are only using their system-native partitioning scheme can say N here. I think many PC based systems will want that option, so the default for PC systems should be changed to 'Y'. If you've partitioned your system using MSDOS fdisk, or linux fdisk/xxx, you probably want this option on. The text as it is does not make it too clear when to pick this option? It implies that the option is only necessary for cross-platform compatibility. Should I pick this option if I partitioned using some flavor of MS Windows? FreeBSD? Solaris? If someone will give me a bit more accurate details of when to say Y/N, I'm willing to rewrite the text and produce a small patch. Cheers, Erez.
Updated Stackable f/s support for 2.2.12 and 2.3.1[45]
I've released updates to my stackable file systems for linux. The updates work for up 2.2.12 and 2.3.15. There were no functional changes since the previous versions. One small bug was fixed in my stackable file system templates. Better documentation was included. Kernel patches remain essentially the same. You can find all software in http://www.cs.columbia.edu/~ezk/research/software/ For 2.2 kernels: fist-linux-2.2-fs-0.4.2.tar.gz fist-linux-2.2-cryptfs-0.4.2.tar.gz (under "export controlled") For 2.3 kernels: fist-linux-2.3-fs-0.1.3.tar.gz fist-linux-2.3-cryptfs-0.1.3.tar.gz (under "export controlled") The packages include (small) kernel patches, and sources for several stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs. Let me know if you have any questions. Cheers, Erez Zadok. --- Columbia University Department of Computer Science. EMail: [EMAIL PROTECTED] Web: http://www.cs.columbia.edu/~ezk
Re: Testing Filesystems
In message <[EMAIL PROTECTED]>, Steve Dodd writes: > On Thu, Sep 09, 1999 at 03:17:06PM +0100, Angelo Masci wrote: > > > Is there a set of regression tests for filesystems > > available? > > Not that I know of, but I didn't look very hard.. > > > I'd like to start testing a filesystem and was wondering > > if there's a set of IO operation tests lurking out there > > somewhere. > > If you run into one, I'd be very interested > > > I'm looking for obvious stuff. Boundary tests for all > > IO related operations. > > -- > "I decry the current tendency to seek patents on algorithms. There are > better ways to earn a living than to prevent other people from making use of > one's contributions to computer science." -- Donald E. Knuth, TAoCP vol 3 It would be nice if SPEC SFS '97 could test non-NFS file systems. For my testing of stackable file systems, I unpack, configure, and build several large packages: am-utils, egcs-1.1.2, binutils, and emacs-20.4. I run a loop of "configure;make; make clean" about 10-20 times. I watch for any obvious bugs/oopses, odd kernel messages, creeping-up memory utilization, and such. I also check for any leftover locked vnodes/inodes or ones with incorrect reference counts; to detect some of that I try to unmount and remount my file system, and also unmount and remount the underlying file system. Any bad locks or refcounts will cause trouble when you try to unmount, remount, or fsck. I also use bonnie. Over the past few years, I've written over two dozen small C programs intended to test specific features. For example, a lot of the complexity in my stackable f/s has to do with mmap'ed pages. So I have a program that opens and mmaps files, then it reads/writes specific pages, page ranges, and bytes just around page boundaries. I used that to detect and fix boundary conditions. I'm pretty sure that everyone who ever wrote a file system has written a few test programs. My small test programs are intended for testing stackable f/s. I'm sure developers of other f/s have written tests specific to their f/s. Maybe the fsdevel community should join forces and put together a real f/s regression tests package. That would be useful for every filesystem on Unix systems, not just Linux. I'd be happy to contribute to such an effort if one were organized. Erez.
Updated Stackable f/s support (+lofs) for 2.3.17/2.3.18
First, some good news. Kernel 2.3.17 finally contain some of my stackable f/s patches, about 50% of them. I've released updates to my stackable file systems for linux. The updates work up to 2.3.18. There were no functional changes since the previous versions, just one LRU-related bug in lofs/wrapfs/cryptfs. This time, I've included in my kernel patches more stuff: (1) what other patches are necessary to support stackable f/s (2) an updated vfs.txt file (3) complete lofs sources, integrated with the rest of the kernel You can find all software in http://www.cs.columbia.edu/~ezk/research/software/ For 2.3 kernels: fist-linux-2.3-fs-0.1.4.tar.gz fist-linux-2.3-cryptfs-0.1.4.tar.gz (under "export controlled") The packages include kernel patches, and sources for several stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs. If all you want is lofs for 2.3.18, then simply apply this patch to a 2.3.18 kernel: http://www.cs.columbia.edu/~ezk/research/software/fist-2.3.18.diff Let me know if you have any questions. Cheers, Erez Zadok. --- Columbia University Department of Computer Science. EMail: [EMAIL PROTECTED] Web: http://www.cs.columbia.edu/~ezk
Re: page_cache: how does generic_file_read really work?
Heh, heh. Funny you should mention this Peter. I'm struggling with this question every time a new kernel release is made, b/c I have to updated my stackable f/s modules. If you tell what you're using (2.2 or 2.3), I'll take a stab at explaining this. Erez.
Re: page_cache: how does generic_file_read really work?
In message <[EMAIL PROTECTED]>, "Peter J. Braam" writes: > Hi > > I wondered if someone could explain what is happening in > generic_file_read: > > More generically, I'd like to understand how in a file system I can > "get" a page and use it to copy date into/out of it. How do I then put > the page away? I think my example from Cryptfs (or Wrapfs, or any other of my stackable f/s) might help you. My file systems use generic_file_read as their read routine. Let's take the simple case of the first time to get a page, meaning it's not in memory or in cache anywhere. In the VFS, generic_file_read essentially calls do_generic_file_read, which does in a loop: - find hash of the page - try to find a page w/o locking it __find_page_nolock. - initially there won't be a page or cached page, so it allocates a page (page_cache_alloc()) and puts it in the cache (__add_to_page_cache()). The page is allocated already locked. - call *your* file system's readpage routine (which must exist, b/c you've defined your f/s to use generic_file_read, instead of your own read routine. This means that your readpage must assume that the page is allocated and locked. No matter how your readpage is called, you'll get a locked page. - After returning from your readpage function, the VFS calls page_cache_release, which frees the page, but does not remove it from LRU caches. (I find the name 'page_cache_release' a bit confusing.) This means that your readpage routine should have done all the necessary actions prior to the very last free'ing of the page: this may include setting uptodate/locked/whatever bits, removing from LRU caches, and more. Now let's go into my cryptfs_readpage function. Remember that my situation may be (slightly) more complicated then yours. My stackable file systems must both emulate a VFS and a lower level f/s. My stackable f/ss act as a VFS to the lower level f/s (say, ext2fs), and at the same time they look like a lower level f/s to the real VFS. This is what I do in cryptfs_readpage(): - find a page_hash() of the lower-level inode, for the same offset. This is part of how I emulate a VFS to the lower-level f/s. The VFS looked for a page hash at a given offset, so I repeat the same operation on the lower-level inode/filesystem (which I sometimes call the "hidden" inode or filesystem). - find and lock a page at the lower-level, for the same offset. Remember that the VFS called me w/ a locked page, so here I'm preparing a lower-level f/s page and locking it, before calling the lower-level file-system's readpage(). - if I cannot find such a page, I allocate it in kernel space, and add it to the page cache (add_to_page_cache) - I call the the readpage() routine of the lower-level f/s, and make sure I have valid data (wait_on_page). At this point, I have two pages: the hidden_page is the one I retrieved from the lower level f/s, and the 'page' which was passed to me by the VFS. In cryptfs, the hidden_page is encrypted, so now I'm decrypting the data from the hidden_page and into the page which was passed to me. - I use page_address() to "map" a page's data into kernel memory, so I can copy and manipulate it as any other "char *" buffer. I map both pages, then I call my decryption routine to decode the hidden_page into the current page. This is done of course with the locks held on both pages. - now that I have valid, decrypted data into the page that I got from the VFS. I unlock it, set the uptodate flag on, and wake-up anyone who might be waiting on it. - finally, right before I return, I call __free_page() (which is the same as page_cache_release) on the hidden_page. Since the VFS will do the same on my page, I must free the hidden page which I allocated. In all this fun, sooner or later, your flushpage routine may be called (via truncate_inode_pages) from multiple places, such as iput(), vmtruncate() and more. Some are invoked [in]directly by your f/s code or the VFS, while other times are the result of a kernel thread that cleans up old unused pages (LRU). All this means that your flushpage() function must do a few more things, and emulate on the lower-level what truncate_inode_pages does to cryptfs's pages: - find the corresponding hidden page and lock it. the f/s's flushpage() routine gets a locked page, but must not unlock it, b/c the VFS will unlock the page. - call the flushpage routine of the lower-level f/s - clear the uptodate flag of the hidden_page, remove it from the lru cache (lru_cache_del), call remove_inode_page on it as well, unlock the hidden_page, and free it. These actions are mostly what truncate_inode_pages does to your page, therefore cryptfs must do the same to the lower-level f/s. The above explanation is a simplified version of what really goes on, and what my stackable f/s modules do. I didn't explain the other cases, nor the interaction with other parts of the same file sy
Re: d_path or way to get full pathname
Marc, regarding your dentry full pathname function (and Serge's): I've not yet looked at either in detail but what I think is needed (assuming it's not there already) is this: - a flag to pass to the function: if true, returns full path names starting w/ a '/' and crossing mount points. There are cases you want one behavior and cases for another. - if the flag is false, return relative pathname to this super_block - a faster method than constant shifting of the bytes. This is a serious one. If you keep shifting bytes for each component, your complexity is O(n^2). You can make it 2*O(n) as follows: (1) first, scan the dentrys and their parents in reverse, cross mount points as needed. (2) sum up the total number of bytes needed, from the q_str structures. (3) allocate the correct number of bytes (or verify that the user passed enough space) (4) repeat the reverse traversal, but this time, copy the bytes into the output buffer directly at their offsets into the buffer (don't copy any terminating nulls so you won't trash the beginning of the component that followed). I'll be happy to help anyone write or test such a version (I started something similar a while back). I think it would be a useful small addition to the kernel. Erez.
Re: Web FS Q
In message <[EMAIL PROTECTED]>, "David Bialac" writes: > For fun (and because I think it might be a useful feature), I'm working > on a filesystem that allows a website to be mounted as a local > filesystem. I'm starting to dive in, and successfully have the kernel > recognizing that webfs exists, so it's now time to write some socket > code. Amongst the thing I want to put into this system is caching of > server data locally, specifically on the local filesystem. The > question I have is, can one filesystem ask to write to another? I > don't see anythinng in there that seems to attempt to do this, so I > need to be sure said is possible. As others have already said, this isn't a new idea, and there are several alternatives you should look at first. Also there are issues wrt mapping the HTTP protocol to a file-system interface that you should be aware of. I believe this was discussed again in linux-fsdevel and the freebsd-fs mailing lists just in the past 4-6 weeks. However, there's nothing wrong with doing such a project for fun. But if you can find something that wasn't done before, that would be even better. If you think that stackable file systems could help your project, see my stackable f/s templates work http://www.cs.columbia.edu/~ezk/research/software/ > Why this is not as stupid as it sounds: Imagine the internet-enabled > appliance scenario: today, if say a DVD manufacturer has a glitch in > their DVD player, the only fix is to take it in for repair. If the > device was internet-enabled, and further read its software off the web, > it could conceivably update software on the fly without the inconvience > of the user going without his player. Nother scenario: you could save > your files to a website run anywhere, then download them anywhere. This idea somewhat matches some of the ideas that were discussed in the Usenix '94 "unionfs" paper: a way to merge a readonly f/s with a writable f/s, the latter includes patches and updates to the readonly stuff (which may come from a cdrom). > David Bialac > [EMAIL PROTECTED] PS. I don't see a problem writing *loadable* kernel code. It doesn't make the core kernel bigger, only run-time kernel memory consumption increases. Kernel modules aren't a solution for every application. If speed is not a concern, user-level file servers are easier to write and debug. Otherwise I personally think that all file systems should be in the kernel (loadable or statically compiled) for performance reasons. Erez.
Re: Minimal fs module won't unmount
In message <[EMAIL PROTECTED]>, Malcolm Beattie writes: > I sent this to linux-kernel 10 days ago but got zero responses so I'll > try here in case I get luckier. I don't recall seeing your message from 10 days ago. Maybe it didn't get to others as well. > I'm writing a little "fake" filesystem module which, when mounted on a > mountpoint, makes the root of the new filesystem be a "magic" symlink. > It all works fine except that the filesystem won't unmount. strace > shows that oldumount is returning EINVAL. What the "magic symlink" > does isn't important here and the following cut-down version displays > the same problem. Is it something to do with the fact that the core fs > code expects the root of the new filesystem to be an ordinary > directory or am I missing something else? Here's the cut-down module > which simply makes follow_link appear to be your cwd. You can compile > (under 2.2 or 2.3, though 2.3 is untested) by > > cc -c -D__KERNEL__ -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe >-fno-strength-reduce -m486 -DCPU=486 -DMODULE -DMODVERSIONS -include >/usr/src/linux/include/linux/modversions.h -I/usr/src/linux/include nullfs.c > > (modifying arch-specific options as necessary) and then doing > # insmod ./nullfs.o > # mkdir /tmp/nullcwd > # mount -t null none /tmp/nullcwd > # ls -l /tmp/nullcwd > lrwxrwxrwx 1 root root0 Nov 30 12:21 /tmp/nullcwd -> foo > # ls -l /tmp/nullcwd/ > ...listing of your current working directory... > # umount /tmp/nullcwd > umount: none: not found > umount: /tmp/foo: not mounted > How can I get it to umount properly? I've not looked at your code, but you might want to see what I do in my wrapfs/lofs during mount and unmount. Usually the main reason why something won't unmount is that you're holding some resources (inodes, dentries, etc.) in which case you get EBUSY. If you're getting EINVAL, the question is where? Is your code being invoked at all, or is the VFS giving this EINVAL. If your code isn't called, then search the VFS (starting w/ do_umount) to find what code path could return you an EINVAL. I personally found out that it's faster (and more fun :-) to debug VFS code myself by sticking printf's at certain places and building a test kernel with that. BTW, just to avoid any potential problems, mount w/ the real mnt point name instead of 'none'. Erez.
Re: Oops with ext3 journaling
In message <[EMAIL PROTECTED]>, Pavel Machek writes: > Hi! > > > No, and I'm pretty much convinced now that I'll move to having a > > private, hidden inode for the journal in the future. > > Please don't do that. Current way of switching ext2/ext3 is very > nice. If someone wants to shoot in their foot... > Pavel IMHO, as a long term solution, ext3 should have as few ways to shoot oneself in the foot as possible. Hackers usually won't do "stupid" things (at least not unintentionally :-), but hoards of Joe-users will. > I'm really [EMAIL PROTECTED] Look at http://195.113.31.123/~pavel. Pavel > Hi! I'm a .signature virus! Copy me into your ~/.signature, please! Erez.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
In message <[EMAIL PROTECTED]>, Jeff Garzik writes: > On Thu, 23 Dec 1999, Hans Reiser wrote: > > All I'm going to ask is that if mark_buffer_dirty gets changed again, > > whoever changes it please let us know this time. The last two times > > it was changed we weren't informed, and the first time it happened it > > took a long time to figure it out. > > Can't you figure this sort of thing out on your own? Generally if you > want to stay updated on something, you are the one who needs to do the > legwork. And grep'ing patches ain't that hard > > Jeff Jeff, Hans is absolutely right. We can all figure it out on our own, and waste many hours re-discovering that which others have discovered independently. It's a royal pain and time sink. I'd rather write new code than try to figure out what's changed b/t kernel versions. In my case (stackable f/s), every time there's a change to anything under linux/fs, linux/mm, or headers, I've got to find out what changed and how it affected my code. It's NOT enough to grep the patches. Union diffs don't give you enough of a context of difference that's meaningful to understanding the overall changes that were made. I have to use emacs's ediff or other methods to find out the meaning and motivation behind the change. There is no NEWS file for each release. There is no ChangeLog for each release. Actually there are a few ChangeLog files sprinkled around the sources. The last linux-2.3.25/fs/ChangeLog was updated was 1998. There is no one who summarizes kernel changes. A long time ago, someone used to. I don't remember his name. Is he still doing that? I maintain a much smaller package (am-utils) and there's no way I could remember what changes I've made throughout the years. That's why I keep a details ChangeLog and NEWS files w/ my releases. I realize the linux kernel is a much bigger and complex beast, but shouldn't that be a bigger motivation for everyone to keep ChangeLogs? IMHO, if we want to speed linux development along, we should help the documentation of linux. Hans and linux-fsdevel folks: I have a proposal. How would you all feel forming an informal group that would report changes relevant to f/s developers on this list. (Maybe even a different mailing list?) I'm willing to take the time to report whatever VFS changes I find each time I update my stackable f/s code for a new kernel, including when no relevant changes are made (which IMHO is just as important). This effort would help all of us f/s developers, but only if we each take the time to report our findings to this list. The few minutes each person takes to report their findings as they relate to their f/s, will save numerous other people many hours; overall this would help everyone. We can also make it easy to find these messages in the archives, so we can make the Subject of such messages a grep-able format---say, CHANGE 2.3.17-2.3.18: vm_area_struct->vm_pte renamed vm_private_data Comments? Erez.
Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)
In message <[EMAIL PROTECTED]>, Jeff Garzik writes: > [...] > To sum, documenting changes is a very good idea, notifying specific > hackers of specific kernel changes is a waste of time [unless they > are the maintainers of the code being changed, of course]. I agree that notifying individuals doesn't scale. Notifying the list as a whole, does. > Jeff Erez.
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Matthew Wilcox writes: > > Greetings. Ted Ts'o recently hosted an ext2 puffinfest where we > discussed the future of the VFS and ext2. Ben LaHaise, Phil Schwan, [...] Also, I really hope that my remaining (small, passive) patches to the VFS to support stackable file systems will be incorporated soon. Cheers, Erez.
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Tigran Aivazian writes: > I noticed the stackable fs item on Alan's list ages ago but there was no > pointer to the patch (I noticed FIST stuff but surely that is not a "small > passive patch" you are referring to?) Yes the patches are small and passive. No new vfs/mm code is added or changed! The most important part of my patches had already been included since 2.3.17; that was an addition/renaming of a private field in struct vm_area_struct. What's left are things that are necessary to support stacking for the first time in linux: exposing some functions/symbols from {mm,fs}/*.c, adding externs to headers, additions to ksyms.c, and moving some macros and inline functions from private .c files to a header, so they can be included in any file system. I've used these patches on dozens of linux machines for the past 2+ years, and have had no problems. I constantly get people asking me when my patches will become part of the main kernel. I have about 9 active developers who write file systems using my templates. I've had more than 21,000 downloads of my templates in the past two years. > So, my point is - if you point everyone to those patches, someone might > help Alan out if one feels like it (and has time). http://www.cs.columbia.edu/~ezk/research/software/fist-patches/ The latest 2.3 patches in that URL include two things: my small main kernel patches, and a fully working lofs. The lofs of course is several thousands of lines of code, but it is not strictly necessary to include it with the main kernel; it can be distributed and built separately, just as my other f/s modules are. However, I do think that lofs is a useful enough f/s that it should be part of the main kernel. If you go to the 2.3 directory under the above URL, there's a README describing the latest 2.3 patches. I've included it below, so everyone can read it and see what my patches do, and how harmless they are. BTW, I've got a prototype unionfs for linux if anyone is interested. > Regards, > Tigran. As always, I'll be delighted to help *anyone* use my work, and would love to help the linux maintainers incorporate my patches, answer any concerns they might have, etc. Cheers, Erez. == Summary of changes for 2.3.25 to support stackable file systems and lofs. (Note: some of my previous patches had been incorporated in 2.3.17.) (1) Created a new header file include/linux/dcache_func.h. This header file contains dcache-related definitions (mostly static inlines) used by my stacking code and by fs/namei.c. Ion and I tried to put these definitions in fs.h and dcache.h to no avail. We would have to make lots of changes to fs.h or dcache.h and other .c files just to get these few definitions in. In the interest of simplicity and minimizing kernel changes, we opted for a new, small header file. This header file is included in fs/namei.c because everything in dcache_func.h was taken from fs/namei.c. And of course, these static inlines are useful for my stacking code. If you don't like the name dcache_func.h, maybe you can suggest a better name. Maybe namei.h? If you don't like having a new header file, let me know what you'd prefer instead and I'll work on it, even if it means making more changes to fs.h, namei.c, and dcache.h... (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they can be included in the same original source code as well as stackable file systems: check_parent macro (fs/namei.c -> include/linux/dcache_func.h) lock_parent (fs/namei.c -> include/linux/dcache_func.h) get_parent (fs/namei.c -> include/linux/dcache_func.h) unlock_dir (fs/namei.c -> include/linux/dcache_func.h) double_lock (fs/namei.c -> include/linux/dcache_func.h) double_unlock (fs/namei.c -> include/linux/dcache_func.h) (3) Added to include/linux/fs.h an extern definition to default_llseek. (4) include/linux/mm.h: also added extern definitions for filemap_swapout filemap_swapin filemap_sync filemap_nopage so they can be included in other code (esp. stackable f/s modules). (5) added EXPORT_SYMBOL declarations in kernel/ksyms.c for functions which I now exposed to (stackable f/s) modules: EXPORT_SYMBOL(___wait_on_page); EXPORT_SYMBOL(add_to_page_cache); EXPORT_SYMBOL(default_llseek); EXPORT_SYMBOL(filemap_nopage); EXPORT_SYMBOL(filemap_swapout); EXPORT_SYMBOL(filemap_sync); EXPORT_SYMBOL(remove_inode_page); EXPORT_SYMBOL(swap_free); EXPORT_SYMBOL(nr_lru_pages); EXPORT_SYMBOL(console_loglevel); (6) mm/filemap.c: make the function filemap_nopage non-static, so it can be called from other places. This was not an inline function so there's no performance impact. ditto
Re: Ext2 / VFS projects
In message <[EMAIL PROTECTED]>, Manfred Spraul writes: > Erez Zadok wrote: > > [...] > > (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they > > can be included in the same original source code as well as stackable > > file systems: > > > > check_parent macro (fs/namei.c -> include/linux/dcache_func.h) > > lock_parent (fs/namei.c -> include/linux/dcache_func.h) > > get_parent (fs/namei.c -> include/linux/dcache_func.h) > > unlock_dir (fs/namei.c -> include/linux/dcache_func.h) > > double_lock (fs/namei.c -> include/linux/dcache_func.h) > > double_unlock (fs/namei.c -> include/linux/dcache_func.h) > > > That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of > most of these functions... I agree. I didn't want to make copies of those b/c I got burnt in the past when they changed subtly and I didn't notice the change. > -- > Manfred Erez.
Re: [Announcement] inode_operations/super_operations changes
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > Summary: > > 1) s_op->notify_change() - went into inode_operations (called > ->setattr(), otherwise the same). Thanks for this info Alexander. I also noticed that i_op->getattr was added, but it is not called from anywhere yet, right? So I left it NULL in my inode_operations structure for my stackable templates. Could you inform -fs when the VFS starts to use it? Thanks, Erez.
Re: [Announce] VFS changes (2.3.51-1)
Thanks Al. VFS changes are important to any F/S developer, but even more important to me since my stackable templates must behave like both a lower-level F/S and a VFS. Ion and I updated our templates to 2.3.49 just a couple of days ago, taking into account the previous set of VFS changes. I was under the impression that this late into 2.3, no such major changes were going to happen, so that we get a 2.4 soon, not another long series like 2.1. Do you know if there are more (VFS) changes planned in 2.3, and if so, which ones? I would prefer to wait until all changes are in, rather than spend time on my stacking templates for each change; it would be a smaller effort doing it all at once. BTW, the new vfs_* things are very nice. They are "stacking friendly." But Ion and I noticed other problems that make it hard to do clean stacking. For example, there are asymmetries b/t the creation and deletion of inodes and dentries; a file system can get notified when a refcnt of an object is decreased, but not when it is increased, and more. Ion will send a separate detailed mail about that a little later. If you're doing all this VFS work, are you open to suggestions that would make stacking cleaner and more flexible? We were going to hold off submitting such changes until 2.5, but if 2.3 is going to stretch further, we might as well do it now. Thanks, Erez.
the last remaining patches to support stacking modules (for 2.3.49)
Linus, Here are the last remaining kernel patches to support stacking, at least up to 2.3.49. As you can see, it's very small, passive stuff. Hopefully you can include it soon. I didn't include a full lofs with this patch, b/c more VFS changes are coming up soon, which will definitely require changes to the lofs templates code (but hopefully nothing to the kernel itself). Erez. == diff -ruN linux-2.3.49-vanilla/include/linux/fs.h linux-2.3.49-fist/include/linux/fs.h --- linux-2.3.49-vanilla/include/linux/fs.h Thu Mar 2 17:01:26 2000 +++ linux-2.3.49-fist/include/linux/fs.hSun Mar 5 03:26:34 2000 @@ -949,6 +949,8 @@ typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +/* needed for stackable file system support */ +extern loff_t default_llseek(struct file *file, loff_t offset, int origin); extern struct dentry * lookup_dentry(const char *, struct dentry *, unsigned int); extern struct dentry * __namei(const char *, unsigned int); diff -ruN linux-2.3.49-vanilla/kernel/ksyms.c linux-2.3.49-fist/kernel/ksyms.c --- linux-2.3.49-vanilla/kernel/ksyms.c Sun Feb 27 01:34:27 2000 +++ linux-2.3.49-fist/kernel/ksyms.cTue Mar 7 04:22:28 2000 @@ -234,12 +234,12 @@ EXPORT_SYMBOL(page_symlink_inode_operations); EXPORT_SYMBOL(block_symlink); -/* for stackable file systems (lofs, wrapfs, etc.) */ -EXPORT_SYMBOL(add_to_page_cache); +/* for stackable file systems (lofs, wrapfs, cryptfs, etc.) */ +EXPORT_SYMBOL(default_llseek); EXPORT_SYMBOL(filemap_nopage); EXPORT_SYMBOL(filemap_swapout); EXPORT_SYMBOL(filemap_sync); -EXPORT_SYMBOL(remove_inode_page); +EXPORT_SYMBOL(lock_page); #if !defined(CONFIG_NFSD) && defined(CONFIG_NFSD_MODULE) EXPORT_SYMBOL(do_nfsservctl); ==
2.3.99-pre1 VFS comments
Hi Al, Ion and I worked on updating our stackable templates for 2.3.99-pre1 for the past few days. We have found various oddities and other possible problems. We promised to report on anything interesting we find wrt the VFS, so here it is. We are willing to test and submit patches for anything below that you think is worth it. (1) Asymmetry b/t double_lock and double_unlock: Only double_unlock does dput() on the two dentries. The only place where double_lock is called is in do_rename, and do_rename already calls get_parent() which increments the reference counts. We can simplify the code and make it symmetric by moving the two get_parent() calls into double_lock(). (2) vfs_readlink: It would be nice if all vfs_ were essentially wrappers that did some checking and then called the file system specific method. This isn't the case for vfs_readlink. (BTW, we like the vfs_* routines very much!) (3) "__" routines: In fs/namei.c, vfs_follow_link simply calls __vfs_follow_link with the same, unchanged args. Can't we simplify and get rid of the __vfs_follow_link routine? Then at least in page_follow_link, it should call the vfs_follow_link directly. (4) permission: fs/namei.c:permission() should probably be renamed to vfs_permission, b/c it is a generic VFS routine (and we make direct use of it in lofs). BTW, with stacking, "permission" gets called O(n^2) times in total. I'm not sure there's anything that can be done about it now, but it's something to keep in mind. Here's the call sequence recursive scenario when we have lofs mounted on, say, ext2 (just one stack level): vfs_create may_create permission lofs_permission permission ext2_permission lofs_create vfs_create may_create permission ext2_permission This happens b/c we use the nicer/newer vfs_ routines. However, since permission() is also called from places other than vfs_ routines, we must define permission in lofs, and thus it gets called recursively. We thought we could solve the problem by not defining our own permission method, b/c the real routines (mkdir, create, etc) will call permission on the lower f/s via the vfs_ routines, but we couldn't do it b/c permission() is called explicitly in open_namei(). One possible solution is creating a vfs_open() routine which will do most of the checks in filp_open, including permission(), but will take a dentry and not a filename. Then filp_open can call vfs_open, and so could we; right now we have to duplicate most of the filp_open code in our ->open function. This would also nicely solve the recursive permission problem, as well a cleanup filp_open(). (5) llseek: In fs/read_write.c, llseek should probably be renamed vfs_llseek, and the un/lock_kernel that it calls should be moved to sys_llseek. Then vfs_llseek should be exported so we can use it. (6) vfs_readdir: vfs_readdir doesn't take the same prototype list as _readdir, which can be *very* confusing since all other vfs_ use the same prototype. I suggest you make the two the same: swap the "dirent" and "filldir" args in vfs_readdir() so they're the same everywhere. We've had some amusing (read: nasty :) kernel panics b/c of that. Cheers, Erez & Ion.
cleaning up 2.3.99-pre3 fs/exec.c:open_exec()
Al, this is the current (and new) open_exec(): struct file *open_exec(const char *name) { struct dentry *dentry; struct file *file; lock_kernel(); dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW); file = (struct file*) dentry; if (!IS_ERR(dentry)) { file = ERR_PTR(-EACCES); if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) { int err = permission(dentry->d_inode, MAY_EXEC); file = ERR_PTR(err); if (!err) { file = dentry_open(dentry, O_RDONLY); out: unlock_kernel(); return file; } } dput(dentry); } goto out; } The exit conditions from it are rather odd. It ends with a "goto out" to the middle of the code, just so it can return an arg and unlock the kernel. Also, it has a few too many nested if's. Ion and I rewrote it more cleanly and clearly. Here's a small patch. Erez. *** linux-2.3-vanilla/fs/exec.c Fri Mar 24 12:34:59 2000 --- linux-2.3.bad/fs/exec.c Fri Mar 24 22:13:59 2000 *** *** 319,343 { struct dentry *dentry; struct file *file; lock_kernel(); dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW); ! file = (struct file*) dentry; ! if (!IS_ERR(dentry)) { file = ERR_PTR(-EACCES); ! if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) { ! int err = permission(dentry->d_inode, MAY_EXEC); ! file = ERR_PTR(err); ! if (!err) { ! file = dentry_open(dentry, O_RDONLY); ! out: ! unlock_kernel(); ! return file; ! } ! } ! dput(dentry); } goto out; } int kernel_read(struct file *file, unsigned long offset, --- 319,351 { struct dentry *dentry; struct file *file; + int err; lock_kernel(); dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW); ! if (IS_ERR(dentry)) { ! file = (struct file*) dentry; ! goto out; ! } ! if (!dentry->d_inode || !S_ISREG(dentry->d_inode->i_mode)) { file = ERR_PTR(-EACCES); ! goto out_dput; ! } ! ! err = permission(dentry->d_inode, MAY_EXEC); ! if (err) { ! file = ERR_PTR(err); ! goto out_dput; } + + file = dentry_open(dentry, O_RDONLY); goto out; + + out_dput: + dput(dentry); + out: + unlock_kernel(); + return file; } int kernel_read(struct file *file, unsigned long offset,
stacking patches and other cleanups for 2.3.99-pre3
Hi Al, Ion and I looked at -pre3 and found you began doing what we suggested earlier, splitting filp_open() into a generic part and an open(2)-specific part. Thanks! We've tried to use it, but we had to change the code to pass a "mode" variable to dentry_open(); otherwise, dentry_open was munging the mode/flags which are undesirable to stacking. Also, with our changes, the open(2) specific stuff was moved back into where it belongs, and dentry_open() became more generic. Next, we cleaned up the logic in dentry_open, filp_open, and open_exec. They all had hard-to-follow nested "if" statements. With our restructuring, it's easier to follow the execution flow of this relatively new code. Also, we were able to eliminate a couple of cases where variables were computed unnecessarily or more than once. Finally, I'm including in our patch some very small stuff that's needed for stacking: exporting a couple more symbols to modules, and one extern for default_llseek() in fs.h. Could you please apply those? They are pretty harmless, very small, but necessary. We tested the patch below with and without our stacking. We now have an lofs/wrapfs/cryptfs which work with 2.3.99-pre3, using all the latest VFS code changes, and we also fixed all known reported bugs in the templates. We'd love to [re]submit lofs for inclusion in the kernel, as soon as the stuff below is included. Enjoy. Erez. == diff -ruN linux-2.3.99-pre3-vanilla/fs/exec.c linux-2.3.99-pre3-fist/fs/exec.c --- linux-2.3.99-pre3-vanilla/fs/exec.c Fri Mar 24 01:38:50 2000 +++ linux-2.3.99-pre3-fist/fs/exec.cSat Mar 25 01:34:17 2000 @@ -319,25 +319,33 @@ { struct dentry *dentry; struct file *file; + int err; lock_kernel(); dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW); - file = (struct file*) dentry; - if (!IS_ERR(dentry)) { + if (IS_ERR(dentry)) { + file = (struct file*) dentry; + goto out; + } + if (!dentry->d_inode || !S_ISREG(dentry->d_inode->i_mode)) { file = ERR_PTR(-EACCES); - if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) { - int err = permission(dentry->d_inode, MAY_EXEC); - file = ERR_PTR(err); - if (!err) { - file = dentry_open(dentry, O_RDONLY); -out: - unlock_kernel(); - return file; - } - } - dput(dentry); + goto out_dput; + } + + err = permission(dentry->d_inode, MAY_EXEC); + if (err) { + file = ERR_PTR(err); + goto out_dput; } + + file = dentry_open(dentry, FMODE_READ, O_RDONLY); goto out; + +out_dput: + dput(dentry); +out: + unlock_kernel(); + return file; } int kernel_read(struct file *file, unsigned long offset, diff -ruN linux-2.3.99-pre3-vanilla/fs/open.c linux-2.3.99-pre3-fist/fs/open.c --- linux-2.3.99-pre3-vanilla/fs/open.c Thu Mar 23 16:11:49 2000 +++ linux-2.3.99-pre3-fist/fs/open.cSat Mar 25 07:30:52 2000 @@ -631,7 +631,7 @@ } /* - * Note that while the flag value (low two bits) for sys_open means: + * Note that while the open_mode value (low two bits) for sys_open means: * 00 - read-only * 01 - write-only * 10 - read-write @@ -647,23 +647,23 @@ struct file *filp_open(const char * filename, int flags, int mode, struct dentry * base) { struct dentry * dentry; - int flag,error; + int open_mode, open_namei_mode; - flag = flags; - if ((flag+1) & O_ACCMODE) - flag++; - if (flag & O_TRUNC) - flag |= 2; + open_namei_mode = flags; + open_mode = ((flags + 1) & O_ACCMODE); + if (open_mode) + open_namei_mode++; + if (open_namei_mode & O_TRUNC) + open_namei_mode |= 2; - dentry = __open_namei(filename, flag, mode, base); - error = PTR_ERR(dentry); - if (!IS_ERR(dentry)) - return dentry_open(dentry, flags); + dentry = __open_namei(filename, open_namei_mode, mode, base); + if (IS_ERR(dentry)) + return (struct file *)dentry; - return ERR_PTR(error); + return dentry_open(dentry, open_mode, flags); } -struct file *dentry_open(struct dentry *dentry, int flags) +struct file *dentry_open(struct dentry *dentry, int mode, int flags) { struct file * f; struct inode *inode; @@ -674,7 +674,7 @@ if (!f) goto cleanup_dentry; f->f_flags = flags; - f->f_mode = (flags+1) & O_ACCMODE; + f->f_mode = mode; inode = dentry->d_inode; if (f->f_mode & FMODE_WRITE) { error = get_write_access(inode); diff -ruN linux-2.3.99-p
Re: __block_prepare_write(): bug?
In message <[EMAIL PROTECTED]>, Ion Badulescu writes: [...] > The current implementation will also populate the page cache with pages > that are not Uptodate, but are not Locked either, which is clearly a bug. > It will always happen if there is a partial write to a page, e.g. if a > program creates a file and then writes 1.5k worth of data, on a 1k-block > filesystem. > > It should be fixed either by getting all the buffers within the page > Uptodate, or by throwing away the page at the end of the write operation. > > > Ion Right. This messed up our stacking code a bit. The VFS essentially does this in generic_file_read: read_cache_page() wait_on_page() if (!Page_uptodate()) { report an error } Since our stacking behaves like a VFS, we have to reproduce the above code in our readpage(). For some file systems, such as cryptfs, there are two pages in memory for each normal page: one ciphertext and one cleartext. But now we have a problem in the following scenario: (1) you copy a file through the lower level file system (ext2) which has a 1k block size for a 4k page size (intel) (2) the file you copy isn't an exact multiple of PAGE_CACHE_SIZE (3) the caching at the ext2 level will put the last page in the cache, with only some of the buffers being BH_Uptodate. This is fs/buffer.c:__block_commit_write() which ext2 uses. The code in that function will not set the page uptodate flag on partial pages that only have a few buffers uptodate. So now we have a page in the cache that is not up-to-date. Now see what happens when our stacking layer executes the code similar to generic_file_read() in our own readpage(): read_cache_page(lower_page) -> we find it wait_on_page(lower_page)-> page not locked, no more wait if (!Page_uptodate(lower-page)) { -> page is NOT uptodate report an error -> we flag an error } There is no way for us to fix the problem in our stacking code b/c we cannot distinguish b/t a page that is truly not up-to-date, and a partial page such as the last page of a file just written. Note also that there is no problem if the file is written *through* our stacking layer, b/c then we can force the up-to-date flag on the cached pages. There is also no a problem if we read a file which was not cached at the ext2 level, and we read it through our stacked layer; in that case, the page comes back up-to-date and we're ok. It's only when __block_commit_write() runs that we have a problem. Summary: a page should not be in the cache and not be up-to-date. If it is not up-to-date, then it should also probably be locked, but only b/c it is probably in transit from the disk to the cache. We have tested the patch included here, which tries to ensure that no pages are in the cache and not up-to-date; it fixes __block_prepare_write(). Can someone who knows the buffer.c code well comment on this? I can't see a way out of this situation without one of the following: - no partial pages are left in the cache (our patch) - partial pages are put in the cache if they are the last page of a file, but they are then marked up-to-date, and the rest of the code changed to handle this special situation - a new flag added to pages that indicates that the page is partial (last page of a file) *and* up-to-date. That way, everyone can write code that depends on handles this situation. Thanks, Erez. diff -ruN linux-2.3.99-pre3-vanilla/fs/buffer.c linux-2.3.99-pre3-fist/fs/buffer.c --- linux-2.3.99-pre3-vanilla/fs/buffer.c Tue Mar 21 14:30:08 2000 +++ linux-2.3.99-pre3-fist/fs/buffer.c Mon Apr 3 08:59:11 2000 @@ -1448,10 +1448,6 @@ if (!bh) BUG(); block_end = block_start+blocksize; - if (block_end <= from) - continue; - if (block_start >= to) - break; bh->b_end_io = end_buffer_io_sync; if (!buffer_mapped(bh)) { err = get_block(inode, block, bh, 1); @@ -1459,10 +1455,15 @@ goto out; if (buffer_new(bh)) { unmap_underlying_metadata(bh); - if (block_end > to) - memset(kaddr+to, 0, block_end-to); - if (block_start < from) - memset(kaddr+block_start, 0, from-block_start); + if (block_end <= from || block_start >= to) + memset(kaddr+block_start, 0, block_end); + else { + if (block_end > to) + memset(kaddr+to, 0, block_end-to); + if
Re: __block_prepare_write(): bug?
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > > > On Wed, 5 Apr 2000, Erez Zadok wrote: > > > - if (block_start >= to) > > - break; > > bh->b_end_io = end_buffer_io_sync; > > if (!buffer_mapped(bh)) { > > err = get_block(inode, block, bh, 1); > > And there you go: bloody thing bumps the size of every file to 4k > boundary. Which is _not_ going to make fsck[1] happy, since ->i_size is > not consistent with the block pointers in inode. get_block() has side > effects, damnit. > > [1] (8), that is. We were not sure our patch was right, and now we are certain it isn't. Thanks to you and Erik for pointing out these problems. Maybe all that's needed is more documentation in the code? Either way, we'll have to change our stacking code so that it'll probably do a readpage after a wait_on_page that isn't uptodate. Thanks, Erez.
new VFS method sync_page and stacking
Background: my stacking code for linux is minimal. I only stack on things I absolutely have to. By "stack on" I mean that I save a link/pointer to a lower-level object in the private data field of an upper-level object. I do so for struct file, inode, dentry, etc. But I do NOT stack on pages. Doing so would complicate stacking considerably. So far I was able to avoid this b/c every function that deals with pages also passes a struct file/dentry to it so I can find the correct lower page. The new method, sync_page() is only passed a struct page. So I cannot stack on it! If I have to stack on it, I'll have to either (1) complicate my stacking code considerably by stacking on pages. This is impossible for my stackable compression file system, b/c the mapping of upper and lower pages is not 1-to-1. (2) change the kernel so that every instance of sync_page is passed the corresponding struct file. This isn't pretty either. Luckily, sync_page isn't used too much. Only nfs seems to use it at the moment. All other file systems which define ->sync_page use block_sync_page() which is defined as: int block_sync_page(struct page *page) { run_task_queue(&tq_disk); return 0; } This is confusing. Why would block_sync_page ignore the page argument and call something else. The name "block_sync_page" might be misleading. The only thing I can think of is that block_sync_page is a placeholder for for a time when it would actually do something with the page. Anyway, since sync_page appears to be an optional function, I've tried my stacking without defining my own ->sync_page. Preliminary results show it seems to work. However, if at any point I'd have to define ->sync_page page and have to call the lower file system's ->sync_page, I'd urge a change in the prototype of this method that would make it possible for me to stack this operation. Also, I don't understand what's ->sync_page for in the first place. The name of the fxn implies it might be something like a commit_write. Thanks, Erez.
Re: new VFS method sync_page and stacking
In message <[EMAIL PROTECTED]>, "Roman V. Shaposhnick" writes: > On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote: > > Background: my stacking code for linux is minimal. I only stack on > > things I absolutely have to. By "stack on" I mean that I save a > > link/pointer to a lower-level object in the private data field of an > > upper-level object. I do so for struct file, inode, dentry, etc. But I > > do NOT stack on pages. Doing so would complicate stacking considerably. > > So far I was able to avoid this b/c every function that deals with pages > > also passes a struct file/dentry to it so I can find the correct lower > > page. > > > > The new method, sync_page() is only passed a struct page. So I cannot > > stack on it! If I have to stack on it, I'll have to either > > If inode will be enough for you than ( as it is implemented in > nfs_sync_page ) you can do something like: >struct inode*inode = (struct inode *)page->mapping->host; Yes I can probably do that. I can get the inode, from it I can get the lower level inode since I stack on inodes. Then I can call grab_cache_page on the i_mapping of the lower inode and given this page's index. I'll give this idea a try. Thanks. > > (2) change the kernel so that every instance of sync_page is passed the > > corresponding struct file. This isn't pretty either. > > >Did you see my letter about readpage ? Nevertheless, I think that first > argument of every function from address_space_operations should be "struct > file *" and AFAIK this is 1) possible with the current kernel 2) will > simplify things a lot since it lets one to see the whole picture: > file->dentry->inode->pages, not the particular spot. Yes, I saw your post. I agree. I'm all for common-looking APIs. > Roman. Erez.
Re: new VFS method sync_page and stacking
In message <[EMAIL PROTECTED]>, Steve Dodd writes: > On Sun, Apr 30, 2000 at 01:44:50PM +0400, Roman V. Shaposhnick wrote: > > >Did you see my letter about readpage ? Nevertheless, I think that first > > argument of every function from address_space_operations should be > > "struct file *" and AFAIK this is 1) possible with the current kernel 2) will > > simplify things a lot since it lets one to see the whole picture: > > file->dentry->inode->pages, not the particular spot. > > But an address_space is (or could be) a completely generic cache. It might > never be associated with an inode, let alone a dentry or file structure. > > For example, I've got some experimental NTFS code which caches all metadata > in the page cache using the address_space stuff. (This /mostly/ works really > well, and makes the code a lot simpler. The only problem is > block_read_full_page() and friends, which do: > > struct inode *inode = (struct inode*)page->mapping->host; > > At the moment I have an evil hack in place -- I'm kludging up an inode > structure and temporarily changing mapping->host before I call > block_read_full_page. I'd really like to see this cleaned up, though I accept > it may not happen before 2.5.) It sounds like different people have possibly conflicting needs. I think any major changes should wait for 2.5. I would also suggest that such significant VFS changes be discussed on this list so we can ensure that we can all get what we need out of the VFS. Thanks. Erez.
Re: new VFS method sync_page and stacking
In message <[EMAIL PROTECTED]>, "Roman V. Shaposhnick" writes: > On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote: [...] > > But an address_space is (or could be) a completely generic cache. It > > might never be associated with an inode, let alone a dentry or file > > structure. [...] > Thus my opinion is that address_space_operations should remain > file-oriented ( and if there are no good contras take the first argument > of "struct file *" type ). At the same time it is possible to have completely > different set of methods around the same address_space stuff, but from my > point of view this story has nothing in common with how an *existing* > file-oriented interface should work. > > Thanks, > Roman. If you look at how various address_space ops are called, you'll see enough evidence of an attempt to make this interface both a file-based interface and a generic cache one (well, at least as far as I understood the code): (1) generic_file_write (mm/filemap.c) can call ->commit_write with a normal non-NULL file. (2) block_symlink (fs/buffer.c) calls ->commit_write with NULL for the file arg. So perhaps to satisfy the various needs, all address_space ops should be passed a struct file which may be NULL; the individual f/s will have to check for it being NULL and deal with it. (My stacking code already treats commit_write this way.) Erez.
Re: new VFS method sync_page and stacking
In message <[EMAIL PROTECTED]>, Steve Dodd writes: > On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote: > > > Background: my stacking code for linux is minimal. I only stack on > > things I absolutely have to. By "stack on" I mean that I save a > > link/pointer to a lower-level object in the private data field of an > > upper-level object. I do so for struct file, inode, dentry, etc. But I > > do NOT stack on pages. Doing so would complicate stacking considerably. > > So far I was able to avoid this b/c every function that deals with pages > > also passes a struct file/dentry to it so I can find the correct lower > > page. > > You shouldn't need to "stack on" pages anyway, I wouldn't have thought. > For each page you can reference mapping->host, which should point to the > hosting structure (at the moment always an inode, but this may change). > > > The new method, sync_page() is only passed a struct page. So I cannot > > stack on it! If I have to stack on it, I'll have to either > > > > (1) complicate my stacking code considerably by stacking on pages. This is > > impossible for my stackable compression file system, b/c the mapping of > > upper and lower pages is not 1-to-1. > > Why can your sync_page implementation not grab the inode from mapping->host > and then call sync_page on the underlying fs' page(s) that hold the data? I can, and I do (at least now :-) I tried it last night and so far it seems to work just fine. > > (2) change the kernel so that every instance of sync_page is passed the > > corresponding struct file. This isn't pretty either. > > I'd like to the see the address_space methods /lose/ the struct file / > struct dentry pointer, but it may be there are situations which require > it. I took a closer look at my address_space ops for stacking. We don't do anything special with the struct file/dentry that we get. We just pass those along (or their lower unstacked counterparts) to other address_space ops which require them. We get corresponding lower pages using the mapping->host inode. I also agree that pages should be associated with the inode, not the file/dentry. So I'm now leaning more towards losing the struct file/dentry from the address_space ops. Furthermore, since the address_space structure showed up relatively recently, we might consider cleaning up this API before 2.4. I believe my stacking code would work fine w/o these struct file/dentry being passed around (Ion, can you verify this please?) Thanks for the info Steve. Erez.
Re: fs changes in 2.3
In message <[EMAIL PROTECTED]>, [EMAIL PROTECTED] writes: > On Mon, May 01, 2000 at 06:25:43PM +0200, Peter Schneider-Kamp wrote: > > I second that. I had to stop maintaining the steganographic file > > system around 2.3.7 because I did not have that much time to > > find out where my fs is "broken" and needs to be "fixed". > > FYI, the changes which broke filesystems in 2.3.8 were page cache / > buffer cache changes and as such were VM changes, not VFS. They were > a major change that was required to make Linux more scalable. Ideally, developing file systems would involve only the VFS. In practice, in involves the VM as well. I've worked on stacking interfaces for several different OSs, and as much as they all want the VFS and VM to be two completely separate entities, in practice they are not. About half of the effort I spent on my stacking templates was related to the VM and changes to the VM (linux/mm/*.c). IOW, most people who maintain file systems must track changes in both the VFS and the VM. That said, I'm quite pleased with the changes that happened in late 2.3.40s: breaking some into address_space ops and more. IMHO the separation b/t the VFS and the VM became more clear then, and it allowed me to cleanup my stacking code quite nicely, as well as make easy use of vfs_ calls, generic_file_{read,write}, and more. Erez.
stackable f/s patches for 2.3.99-pre6
Hi Al. First, thanks for my last set of patches you've applied. Here a few more that we have, some of which are the result of recent VFS changes. The patches are passive. They are intended to make the Linux kernel more "stacking-friendly" so that my stacking templates can call VFS code directly rather than reproducing it. As a result of my patches, I hope the VFS is gradually becoming cleaner. Here's what the patch below does: - dentry_open cannot be called as is from stackable templates because it computes the 'mode' variable internally. So we changed dentry_open such that it accepts a specific mode variable, separate from "flags". Then we assign the mode which was passed to dentry_open to f->f_mode. - we changed all 3 invocations of dentry_open to pass the correct mode information: o fs/exec.c:open_exec(), we pass FMODE_READ to the dentry_open call. o fs/open.c:filp_open(), we compute and pass the correct mode to pass to dentry_open based on the flags passed to filp_open. This makes the code in filp_open more clear, rather than distributing mode computations inside and outside various functions. o ipc/shm.c:sys_shmat(), we pass "prot" which is already computed in this function, further showing that computing the mode flags inside dentry_open isn't correct; it should be done by the caller of dentry_open(). - we clarified filp_open() by adding a new variable called open_flags, which is computed from the flags passed to filp_open. This is passed to the dentry_open call as the "mode" argument. - we moved the static inline function sync_page from filemap.c to mm.h so that it can be called from stackable file systems. In many ways this sync_page function is a "VFS" callable function that could be called by other file systems. Comments are welcome. Let me know if you have any concerns about this patch, and we can work on it some more. Thanks, Erez. ## diff -ruN linux-2.3.99-pre6-vanilla/fs/exec.c linux-2.3.99-pre6-fist/fs/exec.c --- linux-2.3.99-pre6-vanilla/fs/exec.c Fri Apr 21 16:36:39 2000 +++ linux-2.3.99-pre6-fist/fs/exec.cFri Apr 28 22:36:32 2000 @@ -331,7 +331,7 @@ int err = permission(nd.dentry->d_inode, MAY_EXEC); file = ERR_PTR(err); if (!err) { - file = dentry_open(nd.dentry, nd.mnt, O_RDONLY); + file = dentry_open(nd.dentry, nd.mnt, FMODE_READ, +O_RDONLY); out: unlock_kernel(); return file; diff -ruN linux-2.3.99-pre6-vanilla/fs/open.c linux-2.3.99-pre6-fist/fs/open.c --- linux-2.3.99-pre6-vanilla/fs/open.c Mon Apr 24 19:10:27 2000 +++ linux-2.3.99-pre6-fist/fs/open.cFri Apr 28 22:36:32 2000 @@ -615,12 +615,12 @@ } /* - * Note that while the flag value (low two bits) for sys_open means: + * Note that while the flags value (low two bits) for sys_open means: * 00 - read-only * 01 - write-only * 10 - read-write * 11 - special - * it is changed into + * when it is copied into open_flags, it is changed into * 00 - no permissions needed * 01 - read-permission * 10 - write-permission @@ -630,23 +630,24 @@ */ struct file *filp_open(const char * filename, int flags, int mode) { - int namei_flags, error; + int namei_flags, open_flags, error; struct nameidata nd; namei_flags = flags; - if ((namei_flags+1) & O_ACCMODE) + open_flags = ((flags + 1) & O_ACCMODE); + if (open_flags) namei_flags++; if (namei_flags & O_TRUNC) namei_flags |= 2; error = open_namei(filename, namei_flags, mode, &nd); if (!error) - return dentry_open(nd.dentry, nd.mnt, flags); + return dentry_open(nd.dentry, nd.mnt, open_flags, flags); return ERR_PTR(error); } -struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags) +struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int mode, int +flags) { struct file * f; struct inode *inode; @@ -657,7 +658,7 @@ if (!f) goto cleanup_dentry; f->f_flags = flags; - f->f_mode = (flags+1) & O_ACCMODE; + f->f_mode = mode; inode = dentry->d_inode; if (f->f_mode & FMODE_WRITE) { error = get_write_access(inode); diff -ruN linux-2.3.99-pre6-vanilla/include/linux/fs.h linux-2.3.99-pre6-fist/include/linux/fs.h --- linux-2.3.99-pre6-vanilla/include/linux/fs.hWed Apr 26 18:29:07 2000 +++ linux-2.3.99-pre6-fist/include/linux/fs.h Sun Apr 30 06:20:24 2000 @@ -855,7 +855,7 @@ extern void put_unused_fd(unsigned int); /* locked inside */ extern struct file *
announcing stackable file system templates and code generator
It is my pleasure to announce fistgen-0.0.1, the first release of the FiST code generator, used to create stackable file systems out of templates and a high-level language. This package comes with stackable file system templates for Linux, Solaris, and FreeBSD. It also contains several sample file systems built using the FiST language: an encryption file system, a compression file system, and more --- all of which are written as portable stackable file systems. Linux 2.3 folks: my stackable templates now support Size Changing Algorithms (SCAs) such as compression, uuencoding, etc. See specific papers and sample file systems for more details. For more information, software, and papers, see the FiST home page: http://www.cs.columbia.edu/~ezk/research/fist/ Happy stacking. Erez Zadok. --- Columbia University Department of Computer Science. EMail: [EMAIL PROTECTED] Web: http://www.cs.columbia.edu/~ezk
Re: file checksums
In message <[EMAIL PROTECTED]>, Thomas Pornin writes: > On Tue, May 09, 2000 at 03:13:40PM -0400, Theodore Y. Ts'o wrote: > > ... and what prevents the attacker from simply updating the checksum > > when he's modifying the blocks? > > As you may have not noticed, I am talking about a block device where > every data is enciphered. To be more specific, each 64 bit (or 128 bit) > block is enciphered with a different key. The attacker has not access to > the data, neither to the checksum. However, he knows where these items > are, and may perform modifications (although they would be essentially > random). Hence the checksum. > > > Clearly you don't understand about cryptographic checksum. > > Sarcasm ignored. I have been studying cryptography for the last 5 years. Thomas, I must agree with Ted. I've written a cryptographic f/s myself, mostly as an exercise in stacking, and also written several papers on related topics. What you've described in this tread didn't inspire me that this is a strong and secure design. Perhaps it wasn't explained in enough detail. Either way, you should follow Ted's advise and pass your detailed design by some of the security oriented mailing lists. (Warning: security buffs aren't as polite in their criticism as the people on this list... :-) The reasons you've explained for separate checksums aren't very compelling. You can get most of it by using a different cipher, perhaps in CBC mode. This would allow you to detect corruptions or attacks in the middle of a file, if that's what you're concerned about. If you haven't already, I suggest you also read up on some of the more prominent papers in the area of secure file systems: - Matt Blaze's CFS - Mazieres's SFS (from the last SOSP) - others you can follow from those two > --Thomas Pornin Since I'm partial to stacking anyway, let me suggest an alternate design using stacking. If you can pull that off, you'll have the advantage of a f/s that works on top of any other f/s. This allows safe and fast backups of ciphertext, for example (assuming you're backing up via the f/s, not dump-ing the raw device). - use overlay stacking, so you hide the mount point. This also helps in hiding certain files. - for each file foo, you create a file called foo.ckm, which will contain your checksum information in whatever way you choose. You'll have to come up with a fast, reliable, incremental checksum algorithm. It may exist. I don't know. If you use the xor you've suggested, you're not that secure. If you use MD5, you waste CPU b/c you'll have to re-compute the checksum on every tiny change to the file. The .ckm file may contain checksum info for the whole file, per block, or whatever you choose. Your f/s will manage that file any way it wants. - make sure that the .ckm files aren't listed by default: this means hacking readdir and lookup, possibly others. - make sure that only authenticated authorized users can view/access/modify idx files (if you wanted that). You can do that using special purpose ioctls that pass (one time?) keys to the f/s. You can use new ioctls in general to create a whole API for updating the .ckm files. - limit as much as possible root attacks. One of my cryptfs versions stored keys in memory only, hashed by the uid and the session-ID of the process. The SID was added to limit make it harder for root users to decrypt other users' files. It's not totally safe, but every bit helps. You can do much of the above with my stacking templates. I've distributed f/s examples showing how to achieve various features. You can get more info from http://www.cs.columbia.edu/~ezk/research/fist/. Either way, note that this stackable method still doesn't give you all the security you want: attackers can still get at the raw device of the lower f/s; there is window of opportunity b/t the lower mount and the stackable f/s mount; anything that's in memory or can swap/page over the network is vulnerable, and more. You will find that there's a tremendous amount of effort and many details that must be addressed to build a secure file system. I for one would love to see ultra-secure, fast cryptographic file systems become a standard component of operating systems. Good luck. Erez.
Re: Multiple devfs mounts
In message <[EMAIL PROTECTED]>, Chris Wedgwood writes: > On Tue, May 02, 2000 at 12:15:20AM -0400, Theodore Y. Ts'o wrote: > >Date: Mon, 1 May 2000 11:27:04 -0400 (EDT) >From: Alexander Viro <[EMAIL PROTECTED]> > >Keep in mind that userland may need to be taught how to deal with getdents() >returning duplicates - there is no reasonable way to serve that in >the kernel. > > *BSD does this in libc, for the exactly same reason; there's no good way > to do this in the kernel. [...] > I'm not sure how efficient and fast the code would be to make this > work quickly, for large numbers of file systems it might prove > horribly slow. IMHO the BSD hacks to libc support unionfs were ugly. To write unionfs, they used the existing nullfs "template", but then they had to modify the VFS *and* other user-land stuff. It depends what you mean by "reasonable way" and "good way". I've done it in my prototype implementation of unionfs which uses fan-out stackable f/s: (1) you read directory 1, and store the names you see in a hash table. (2) as you read each entry from subsequent directories, you check if it's in the hash table. If it is, skip it; if it's not, add it to the getdents output buf, and add the entry to the hash-table. This was a simple design and easy to implement. Yes it added overhead to readdir(2) but not as much as you'd think. It was certainly not "horribly slow", nor did it chew up lots of ram. I tried it on several directories with several dozen entries each (i.e., typical directory sizes), not on directories with thousands or more entries. I think that if we're adding directory unification features into the linux kernel, then we should add unique-fication of names as well to the kernel. One possible way would be to take advantage of the fact that most readdir()'s are followed by lstat()s of each entry (hence NFSv3's READDIRPLUS): so when you do a readdir, maybe it's best to pre-create a mini-dentry for each such entry, in anticipation of its probable use. The advantage there is that the dentry already has the name, and we already have code to do dentry lookups based on their name. > --cw Erez.
Re: [RFC] Possible design for "mount traps"
In message <[EMAIL PROTECTED]>, Alexander Viro writes: [...] > So what about the following trick: let's allow vfsmounts without > associated superblock and allow to "mount" them even on the negative > dentries? Notice that the latter will not break walk_name() - checks for > dentry being negative are done after we try to follow mounts. > Notice also that once we mount something atop of such vfsmount it > becomes completely invisible - it's wedged between two real objects and > following mounts will walk through it without stopping. > So the only case when these beasts count is when they are > "mounted", but nothing is mounted atop of them. But that's precisely the > class of situations we are interested in. In case of autofs we want > follow_down() into such animal to trigger mounting, in case of portalfs - > passing the rest of pathname to daemon, in case of devfs-with-automount > we want to kick devfsd. So let them have a method that would be called > upon such follow_down() (i.e. one when we have nothing mounted atop of > us). And that's it. > These objects are not filesystems - they rather look like a traps > set in the unified tree. Notice that they do not waste anon device like > "one node autofs" would do. > That way if autofs daemon mounted /mnt/net/foo it would not follow > up with /mnt/net/foo/bar - it would just set the trap in /mnt/net/foo/bar > and let the actual lookups trigger further mounts. [...] This sounds almost identical to what Sun did to solve similar problems in their first version of autofs. There's a paper in LISA '99 describing their enhancements to the original autofs. Your proposal, however, is better b/c it generalizes to more than autofs. Erez.
Re: file checksums
In message <[EMAIL PROTECTED]>, Thomas Pornin writes: [...] > To answer to the second question, I need the experience from the > filesystem-guys, who know how a filesystem is typically used, and > who are supposed to at least lurk in this mailing-list. Hence this > discussion. > It is my understanding that typical filesystem write usage is either > creating new files and filling them, truncating files to zero length and > filling them, and appending to files. For these operations, the checksum > cost is zero (in terms of disk accesses). This is not true for mmaped() > files (databases, production of executables...) but I think (and I beg > for comments) that this may be handled with a per-file exception (for > instance, producing the executable file in /tmp and then copying it > into place -- since /tmp may be emptied at boottime, it is pointless to > ensure data integrity in /tmp when the machine is not powered). > > > Any comment is welcome, of course. I thank you for sharing part of your > time reading my prose. > > > --Thomas Pornin Thomas, f/s usage patterns vary of course. There are at least three papers you should take a look at concerning this topic: L. Mummert and M. Satyanarayanan. Long Term Distributed File Reference Tracing: Implementation and Experience. Technical Report CMU-CS-94-213. Carnegie Mellon University, Pittsburgh, U.S., 1994. Werner Vogels File system usage in Windows NT 4.0 SOSP '99 D. Roselli, J. R. Lorch, and T. E. Anderson. A Comparison of File System Workloads. To appear in USENIX Conf. Proc., June 2000. (You're going to have to wait to read that one, unless you can get an advance copy from the authors. It's an interesting paper.) Also, I've had to visit this issue recently when I added size-changing support to my stackable templates (for compression, CBC encryption, all kinds of encoding, etc.) You can get a copy of a paper (entitled "Performance of Size-Changing Algorithms in Stackable File Systems") in http://www.cs.columbia.edu/~ezk/research/fist/. Cheers, Erez.
Re: [prepatch] Directory Notification
In message <8ghn4m$965$[EMAIL PROTECTED]>, Ton Hospel writes: > In article <[EMAIL PROTECTED]>, > "Theodore Y. Ts'o" <[EMAIL PROTECTED]> writes: > > This was discussed on IRC, but for those who weren't there it > > should be clear that the current implementation uses dentries, so if you > > have a file which is hard-linked to appear in two different directories, > > only the parent directory which was used as an access path when the file > > was changed would get notified. > > > > That is, if /usr/foo/changed_file and /usr/bar/changed_file are hard > > links, and a user-program modifies /usr/foo/changed_file via that > > pathname, a server who had asked for directory notification on /usr/bar > > would not get notified that /usr/bar/changed_file had changed. > > > > This is a pretty fundamental limitation, and can't really be fixed > > without using inode numbers as the notification path; but that requires > > a very different architecture, and that design wouldn't work for those > > filesystems that don't use inode numbers. Life is full of tradeoffs. > > > > Still, that's pretty yucky. Inode based notification should be the default > behaviour, with the others the exception (for the others the filename path > is usually the ONLY path). I think there are uses for notification on a change for inodes and dentries, maybe even files. Some information is available in one and not in the others. Nevertheless, it's clear that inode notification is the first most obvious choice. > I don't care HOW my files got changed, just if they got changed. > > Good thing directory hardlinking is very rare. on the other hand, > multiple mounting is coming AFAIK, directory hard-linking isn't available in Linux, and that's a good thing. Not just was this a seldom used feature, but it wrecked havoc on recursive programs like find, rm, and even backup tools like Legato Networker. The reason was that you cd to a directory in one way, but when you cd out of it (cd ..) you find yourself ... "transported through a wormhole to another dimension..." You bring an interesting point, however. With the new multiple mounting and vfsmount stuff, I hope that we're not re-introducing the same problems that directory hard-linking caused. Erez.
superblock refcount
Al, since we are now allowing multiple struct vfsmount objects to point to the same super_block, shouldn't struct super_block have a refcount variable? Erez.
->truncate may need struct file
Ion and I ported our stacking templates to the last few releases, including 2.3.99-pre9. We had to create a fake struct file inside our ->truncate method so we can pass it to some as-ops methods. We believe that sooner or later, truncate (and notify_change and friends) may need to have a struct file passed to them. This has become more evident since the recent cleanup of the as-ops. Some background: our stacking templates support file systems that change data size, and we used that in our compression f/s, gzipfs. When data is written in the middle of a compressed file, the new sequence may be shorter or longer than the previous sequence, so we handle it by shifting the data in the compressed file inward/outward as needed. Truncate() can be used to shrink a file or extend it. When truncate is called to shrink a file, we first check if the truncation occurs in the middle of a compressed sequence and we re-compress the remaining bytes to ensure data validity. When truncate is used to enlarge a file, we often have to compress a "chunk" of data that contains some bytes at the previous end of the file, and now with zeros afterwards. A truncation that increases the file's size may also result in multiple holes created. Anyway, there are lots more details to our stacking support for size-changing file systems, all of which we've outlined in a paper if anyone is interested. The key point for this message is that our code has to perform data and page operations on a file from inside the ->truncate method. But truncate doesn't have a struct file. It has a dentry (and hence an inode), but that's not enough because the new address ops require passing a struct file to them. We've managed to hack around this by creating a fake struct file, stuffing the dentry that we get in ->truncate, and pass that file around. It seems to work so far, but it's an ugly solution that may not work in the long run. Having to create fake objects is a possible indication that something is missing from an API. For now we're ok, but I think we should consider passing a struct file to ->truncate, and maybe even move ->truncate from inode ops to as-ops. Here are a brief list of reasons: (1) truncate is fundamentally an operation that is asked to operate on a file without being given a file to operate on. Truncating or enlarging a file implies munging data pages. (2) ftruncate(2) takes an open fd, meaning that a struct file must already exist in the kernel b/c a user process opened it. truncate(2) takes a pathname, which the kernel must translate somehow into a file and/or a dentry. This is further reason that the two syscalls are probably implemented in the kernel using a struct file, so it should be possible to pass that struct file further down. (3) if we pass a struct file to ->truncate instead of a dentry, very little change will be required to actual file systems. The changes to the VFS may not be simple however. My quick inspections showed that notify_change and friends don't have an easy access to a struct file. (4) our stacking code can certainly use a struct file passed to ->truncate, and that would help us get rid of the ugly fake file we have to create now. If at any point in the future someone (NFS?) may need to put something inside the private data of struct file, then it would be necessary to pass a struct file to ->truncate. Comments? Thanks, Erez.
stackable f/s patches for 2.3.99-pre9
Al, Linus, Ion and I ported our stacking templates to all versions up to 2.3.99-pre9. We've cut a new release, fistgen-0.0.3, available in http://www.cs.columbia.edu/~ezk/research/fist/. The following kernel patches are passive and are functionally identical to the patches we've submitted for -pre6. They are intended to make the Linux kernel more "stacking-friendly" so that my stacking templates can call VFS code directly rather than reproducing it. Here's what the patch below does: - dentry_open cannot be called as is from stackable templates because it computes the 'mode' variable internally. So we changed dentry_open such that it accepts a specific mode variable, separate from "flags". Then we assign the mode which was passed to dentry_open to f->f_mode. - we changed all 3 invocations of dentry_open to pass the correct mode information: o fs/exec.c:open_exec(), we pass FMODE_READ to the dentry_open call. o fs/open.c:filp_open(), we compute and pass the correct mode to pass to dentry_open based on the flags passed to filp_open. This makes the code in filp_open more clear, rather than distributing mode computations inside and outside various functions. o ipc/shm.c:sys_shmat(), we pass "prot" which is already computed in this function, further showing that computing the mode flags inside dentry_open isn't correct; it should be done by the caller of dentry_open(). - we clarified filp_open() by adding a new variable called open_flags, which is computed from the flags passed to filp_open. This is passed to the dentry_open call as the "mode" argument. - we moved the static inline function sync_page from filemap.c to mm.h so that it can be called from stackable file systems. In many ways this sync_page function is a "VFS" callable function that could be called by other file systems. Comments are welcome. Let me know if you have any concerns about this patch, and we can work on it some more. Thanks, Erez. ## diff -ruN linux-2.3.99-pre9-vanilla/fs/exec.c linux-2.3.99-pre9-fist/fs/exec.c --- linux-2.3.99-pre9-vanilla/fs/exec.c Sun May 21 14:38:47 2000 +++ linux-2.3.99-pre9-fist/fs/exec.cTue May 23 19:30:47 2000 @@ -334,7 +334,7 @@ file = ERR_PTR(err); if (!err) { lock_kernel(); - file = dentry_open(nd.dentry, nd.mnt, O_RDONLY); + file = dentry_open(nd.dentry, nd.mnt, FMODE_READ, +O_RDONLY); unlock_kernel(); out: return file; diff -ruN linux-2.3.99-pre9-vanilla/fs/open.c linux-2.3.99-pre9-fist/fs/open.c --- linux-2.3.99-pre9-vanilla/fs/open.c Mon May 8 16:31:40 2000 +++ linux-2.3.99-pre9-fist/fs/open.cTue May 23 19:30:47 2000 @@ -599,12 +599,12 @@ } /* - * Note that while the flag value (low two bits) for sys_open means: + * Note that while the flags value (low two bits) for sys_open means: * 00 - read-only * 01 - write-only * 10 - read-write * 11 - special - * it is changed into + * when it is copied into open_flags, it is changed into * 00 - no permissions needed * 01 - read-permission * 10 - write-permission @@ -614,23 +614,24 @@ */ struct file *filp_open(const char * filename, int flags, int mode) { - int namei_flags, error; + int namei_flags, open_flags, error; struct nameidata nd; namei_flags = flags; - if ((namei_flags+1) & O_ACCMODE) + open_flags = ((flags + 1) & O_ACCMODE); + if (open_flags) namei_flags++; if (namei_flags & O_TRUNC) namei_flags |= 2; error = open_namei(filename, namei_flags, mode, &nd); if (!error) - return dentry_open(nd.dentry, nd.mnt, flags); + return dentry_open(nd.dentry, nd.mnt, open_flags, flags); return ERR_PTR(error); } -struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags) +struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int mode, int +flags) { struct file * f; struct inode *inode; @@ -641,7 +642,7 @@ if (!f) goto cleanup_dentry; f->f_flags = flags; - f->f_mode = (flags+1) & O_ACCMODE; + f->f_mode = mode; inode = dentry->d_inode; if (f->f_mode & FMODE_WRITE) { error = get_write_access(inode); diff -ruN linux-2.3.99-pre9-vanilla/include/linux/fs.h linux-2.3.99-pre9-fist/include/linux/fs.h --- linux-2.3.99-pre9-vanilla/include/linux/fs.hTue May 23 17:18:48 2000 +++ linux-2.3.99-pre9-fist/include/linux/fs.h Tue May 23 19:48:52 2000 @@ -858,7 +858,7 @@ extern void put_unused_fd(unsigned int); /* locked inside
Re: ->truncate may need struct file
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > [...] > I suspect that keeping parallel stacks in different levels will turn out > to be a design mistake. IOW, I'm less than sure that struct file of > underlying fs should be obtained from struct file of covering one. I think > that when the case of device nodes will be finally sorted out we'll see > what the real situation is. If by "parallel stacks" you mean that each layer keeps its own state, I don't see how you can avoid it altogether. All stacking work in the past (Rosenthal, Skinner, Heidemann, etc.) have argued for layer independence, meaning each layer keeps its own stuff. Remember that you have to support multiple layers, fan-in, fan-out, and all combinations of those. This makes stacking more modular. Those guys argued for a massive overhaul of the VM, VFS, cache system (centralized), and a rewrite of all file systems. While that is out of the question of any OS vendor nowadays, and the main reason why no OS vendor has seriously adopted it, even their improved approaches had each layer maintain its own state. My stacking work took the approach of not changing anything in the VM, VFS, or other file systems (I got away with very little changes on Linux, and none for Solaris/FreeBSD). This was important to me b/c no one would ever accept or use a piece of work that requires massive kernel changes. I have every known stacking paper on my Web site, the old papers as well as mine, at http://www.cs.columbia.edu/~ezk/research/fist/. Any one is welcome to look at them, as well as my stacking sources, and suggest alternatives. I doubt any one would be able to come up w/ a cleaner stacking interface that does not require a major kernel overhaul. Nevertheless, I would be very happy if I could achieve the same level of flexibility and functionality in stacking that I have today with less code in the template. For example, Ion and I were able to get rid of stacking on the vm_area_struct structures some time ago, while keeping the same functionality. If I can avoid stacking on, say, struct file, and keep the same functionality, that sure would make the Wrapfs templates simpler. Erez.
Re: FS_SINGLE queries
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > > > On Sat, 10 Jun 2000, Richard Gooch wrote: > > > I see your point. However, that suggests that the naming of > > /proc/mounts is wrong. Perhaps we should have a /proc/namespace that > > shows all these VFS bindings, and separately a list of real mounts. > > What's "real"? /proc/mounts would better left as it was (funny replacement > for /etc/mtab) and there should be something along the lines of > /proc/namespace (hell knows, we might make it compatible with /proc/ns > from new Plan 9). That something most definitely doesn't need to share the > format with /proc/mounts... On a related note, since we do have /proc/mounts, and assuming that procfs is pretty much necessary nowadays, are we going to get rid of /etc/mtab and completely move all getmntent info into the kernel? I never liked the fact that people doing mounts (such as automounters) have to ensure that they correctly maintain a separate text file in /etc. If we want to go crazy, we can implement mntfs ala Solaris 8, which moved the mnt info into the kernel, but allowed for "editing" /etc/mnttab which is now a special f/s mounted on top of a single file. Hmmm, maybe that's a question to the glibc folks. I guess as long as all the necessary tools and libraries will use /proc/mounts if available, and avoid using /etc/mtab, that'd be ok. Erez.
Re: FS_SINGLE queries
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > > > On Fri, 16 Jun 2000, Erez Zadok wrote: > > > On a related note, since we do have /proc/mounts, and assuming that procfs > > is pretty much necessary nowadays, are we going to get rid of /etc/mtab and > > completely move all getmntent info into the kernel? I never liked the fact > > that people doing mounts (such as automounters) have to ensure that they > > correctly maintain a separate text file in /etc. > > I'm not sure that we need to keep it on procfs - especially with the > union-mounts coming into the game. Procfs or not, I'm advocating for keeping it in the kernel only, where it belongs, and removing the kludgy need (ala Sun and many others) to maintain a separate /etc/mtab file. > > Hmmm, maybe that's a question to the glibc folks. I guess as long as all > > the necessary tools and libraries will use /proc/mounts if available, and > > avoid using /etc/mtab, that'd be ok. > > How many programs actually need this getmntent(), in the first > place? Programs like df(1) need to read mtab. Automounters (such as amd, which I maintain) and /bin/mount need to write it. The problem with a separate mtab file is that there's no way to guarantee that the file in /etc is in sync w/ the actual mounts in the kernel. There are many reasons why you can get an mtab file that's out of sync w/ the actual in-kernel mounts. AIX, Ultrix, and BSD44 did the right thing by moving this mtab list into the kernel, and rewriting "[gs]etmntent" (they also renamed them) so they query the kernel via a syscall. Solaris 8 move that way too, but kept backwards compatibility using their special mntfs. Anyway, I'd like to see a new syscall that returns a list of mounts and associated info in linux. Currently that can be done by reading /proc/mounts, but not if procfs isn't available or we're going to take /proc/mounts away. It would make programs like df more reliable, and programs like /bin/mount won't have to rewrite the mtab file each time a mount(2) is made. And it'll make amd work a little faster (I already auto-detect in-kernel vs. in-/etc mount tables and handle that in amd). Anyway it's not a big thing or something that we need to do right now. Erez.
Re: FS_SINGLE queries
In message <[EMAIL PROTECTED]>, [EMAIL PROTECTED] writes: [...] > so mount could keep a /etc/mtab2 to record this informatoin, but that's > freaking ugly. or we could pass a new mount option down into the kernel > which causes it to display `loop' in that entry, bu this seems like a > waste of a bit. other alternatives gladly sought. Not necessarily. Several OSs use an "ignore" bit as a mount flag telling programs like df(1) not to stat certain entries by default. This is often used for automounted/autofs entries, where normally no reasonable info can be returned to statvfs(2), plus it's a good idea not to slow df(1) by stating file systems that may be served by slow user-level file servers (amd w/o autofs support). There are cases where such file servers can return useful info back to statvfs(2) (as amd can). BTW, the usual reason you don't see such automounted entries is that GNU df automatically will not list entries with statfs values of 0, but it still will statfs(2) them which will be slow (and hand if the automounter is hung). It's much better if the kernel can record that certain entries were mounted w/ the "ignore" option, and ensure that df(1) simply doesn't statfs's 'em at all. Erez.
Re: FS_SINGLE queries
In message <[EMAIL PROTECTED]>, Alexander Viro writes: > > > On Fri, 16 Jun 2000, Erez Zadok wrote: > [...] > > Anyway, I'd like to see a new syscall that returns a list of mounts and > > Sigh... We already have a crapload of syscalls that should not be there. > If it can be done by open()/read()/write()/lseek()/close() it should be > done that way. Hey, we can make it yet another ioctl(2). Then we can trade a crapload of syscalls for a crapload of ioctls --- a time-honored Unix tradition... :-) :-) Seriously, an open/read/.../close would work fine, but on what file? If it's something inside /proc, fine, but has the Linux community as a whole accepted that procfs is a *must* for any working system "or else"? If the file to open/read/close won't be in /proc, what type of file it'd be and how it'd get created? Erez.
Re: Are there any FS related docs on the net?
Vitaly, you may also find some of my papers and stackable templates useful as forms of documentations: http://www.cs.columbia.edu/~ezk/research/fist/ (I also have a collection of older VFS papers in there.) Erez.