Re: Read/write counts
>It is not strictly an error to read/write less than the requested amount, >but you will find that a lot of applications don't handle this correctly. I'd give it a slightly different nuance. It's not an error, and it's a reasonable thing to do, but there is value in not doing it. POSIX and its predecessors back to the beginning of Unix say read()/write() don't have to transfer the full count (they must transfer at least one byte). The main reason for this choice is that it may require more resources (e.g. a memory buffer) than the system can allocate to do the whole request at once. Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Most application programs access files via libc's fread/fwrite, which don't have partial transfers. GNU libc does handle partial (kernel) reads and writes correctly. I'd be surprised if someone can name a major application that doesn't. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
>Part of the problem is that "whenever you modify a file" >is ill-defined, or rather, if you were to take the literal meaning of it >you'd end up with an unmanageable number of revisions. Let me expand on that. Do you want to save a revision every time the user types in an editor? Every time he runs a "save" command? Every time a program does a write() system call? Every time a program closes a modified file? If you're adding to a C program, is every draft you compile a revision, or just the final modification after the bugs are worked out? When I was very new to coding, I used VMS and thought the automatic revisioning would be a great thing because it would save me when I modified a program and later regretted it. The system made a revision every time I exited the editor. But I soon found that the "previous revision" to which I wanted to revert was always many editings back, since I spent a lot of time trying to make the regrettable code work before giving up. VMS kept a fixed number of revisions per file. But keeping 20 versions of other files would have been wasteful of disk space, directory listing space, etc. Later, I discovered what I think are superior alternatives: RCS-style version management on top of the filesystem, and automatic versioning based on time instead of count of "modifications." For example, make a copy of every changed file every hour and keep it for a day and keep one of those for a week, and keep one of those for a month, etc. This works even without snapshot technology and even without sub-file deltas. But of course, it's better with those. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
>The question remains is where to implement versioning: directly in >individual filesystems or in the vfs code so all filesystems can use it? Or not in the kernel at all. I've been doing versioning of the types I described for years with user space code and I don't remember feeling that I compromised in order not to involve the kernel. Of course, if you want to do it with snapshots and COW, you'll have to ask where in the kernel to put that, but that's not a file versioning question; it's the larger snapshot question. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
>We don't need a new special character for every >> new feature. We've got one, and it's flexible enough to do what you want, >> as proven by NetApp's extremely successful implementation. I don't know NetApp's implementation, but I assume it is more than just a choice of special character. If you merely start the directory name with a dot, you don't fool anyone but 'ls' and shell wildcard expansion. (And for some enlightened people like me, you don't even fool ls, because we use the --almost-all option to show the dot files by default, having been burned too many times by invisible files). I assume NetApp flags the directory specially so that a POSIX directory read doesn't get it. I've seen that done elsewhere. The same thing, by the way, is possible with Jack's filename:version idea, and I assumed that's what he had in mind. Not that that makes it all OK. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
>The directory is quite visible with a standard 'ls -a'. Instead, >they simply mark it as a separate volume/filesystem: i.e. the fsid >differs when you call stat(). The whole thing ends up acting rather like >our bind mounts. Hmm. So it breaks user space quite a bit. By break, I mean uses that work with more conventional filesystems stop working if you switch to NetAp. Most programs that operate on directory trees willingly cross filesystems, right? Even ones that give you an option, such as GNU cp, don't by default. But if the implementation is, as described, wildly successful, that means users are willing to tolerate this level of breakage, so it could be used for versioning too. But I think I'd rather see a truly hidden directory for this (visible only when looked up explicitly). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patent or not patent a new idea
>If your only purpose is to try generate a defensive patent, then just >dumping the idea in the public domain serves the same purpose, probably >better. > >I have a few patents, some of which are defensive. That has not prevented >the USPTO issuing quite a few patents that are in clear violation of mine. That's not what a defensive patent is. Indeed, patenting something just so someone else can't patent it is ridiculous, because publishing is so much easier. A defensive patent is one you file so that you can trade rights to it for rights to other patents that you need. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patent or not patent a new idea
>md/raid already works happily with different sized drives from >different manufacturers ... >So I still cannot see anything particularly new. As compared to md of conventional disk partitions, it brings the ability to create and delete arrays without shutting down all use of the physical disks (to update the partition tables). (LVM gives you that too). It also makes managing space much easier because the component devices don't have to be carved from contiguous space on the physical disks. Neither of those benefits is specific to RAID, but you could probably say that RAID multiplies the problems they address. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
> Consistent state means many different things. And, significantly, open/close has nothing to do with any of them (assuming we're talking about the system calls). open/close does not identify a transaction; a program may open and close a file multiple times the course of making a "single" update. Also, data and metadata updates remain buffered at the kernel level after a close. And don't forget that a single update may span multiple files. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
>But you look around, you may find that many >systems claim that they can take snapshot without shutdown the >application. The claim is true, because you can just pause the application and not shut it down. While this means you can't simply add snapshot capability and solve your copy consistency problem (you need new applications too), this is a huge advance over what there was before. Without snapshots, you do have to shut down the application. Often for hours, and during that time any service request to the application fails. With snapshots, you simply pause the application for a few seconds. During that time it delays processing of service requests, but every request ultimately goes through, with the requester probably not noticing any difference. If a system claims that snapshot function in the filesystem alone gets you consistent backups, it's wrong. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
>>we want a open/close consistency in snapshots. > >This depends on the transaction engine in your filesystem. None of the >existing linux filesystems have a way to start a transaction when the >file opens and finish it when the file closes, or a way to roll back >individual operations that have happened inside a given transaction. > >It certainly could be done, but it would also introduce a great deal of >complexity to the FS. And I would be opposed as a matter of architecture to making open/close transactional. People often read more into open/close than is there, but open is just about gaining access and close is just about releasing resources. It isn't appropriate for close to _mean_ anything. There are filesystems that have transactions. They use separate start transaction / end transaction system calls (not POSIX). >> Pausing apps itself >> does not solve this problem, because a file could be already opened >> and in the middle of write. Just to be clear: we're saying "pause," but we mean "quiesce." I.e., tell the application to reach a point where it's not in the middle of anything and then tell you it's there. Indeed, whether you use open/close or some other kind of transaction, just pausing the application doesn't help. If you were to implement open/close transactions, the filesystem driver would just wait for the application to close and in the meantime block all new opens. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
>i dont see how blaming autotools for other people's misuse is relevant Here's how other people's misuse of the tool can be relevant to the choice of the tool: some tools are easier to use right than others. Probably the easiest thing to use right is the system you designed and built yourself. I've considered distributing code with an Autotools-based build system before and determined quickly that I am not up to that challenge. (The bigger part of the challenge isn't writing the original input files; it's debugging when a user says his build doesn't work). But as far as I know, my hand-rolled build system is used correctly by me. >> checks the width of integers on i386 for projects not caring about that and >> fails to find installed libraries without telling how it was supposed to >> find them or how to make it find that library. > >no idea what this rant is about. The second part sounds like my number 1 complaint as a user of Autotools-based packages: 'configure' often can't find my libraries. I know exactly where they are, and even what compiler and linker options are needed to use them, but it often takes a half hour of tracing 'configure' or generated make files to figure out how to force the build to understand the same thing. And that's with lots of experience. The first five times it was much more frustrating. >> Configuring the build of an autotools program is harder than nescensary; >> if it used a config file, you could easily save it somewhere while adding >> comments on how and why you did *that* choice, and you could possibly >> use a set of default configs which you'd just include. > >history shows this is a pita to maintain. every package has its own build >system and configuration file ... It's my understanding that autotools _does_ provide that ability (as stated, though I think "config file" may have been meant here as "config.make"). The config file is a shell script that contains a 'configure' command with a pile of options on it, and as many comments as you want, to tailor the build to your requirements. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
>the >maintainers of util-linux have well versed autotool people at their disposal, >so i really dont see this as being worrisome. As long as that is true, I agree that the fact that so many autotool packages don't work well is irrelevant. However, I think the difficulty of using autotools (I mean using by packagers), as evidenced by all the people who get it wrong, justifies people being skeptical that util-linux really has that expertise available. Also, many open source projects are developed by a large diverse group of people, so even if there exist people who can do the autotools right, it doesn't mean they'll be done right. One reason I try to minimize the number of tools/skills used in maintaining packages I distribute is to enable a larger group of people to help me maintain them. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-1.19 for v. old ext2 ?
>All changes made to ext2 (even journalling) will work with the >same filesystem. I can't figure out what this says. Can you elaborate or reword? I was dismayed just last week to find that Linux 2.0.36 (from a rescue disk) could not mount an ext2 filesystem created recently. The error message complained that the filesystem had a feature than Linux didn't understand. So apparently all is not strictly backward and forward compatible in ext2. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Partition IDs in the New World TM
>OK. s/Linux/Well behaved operating systems that look for >file system signatures, rather than relying on stupid Partition IDs/ Well then you're still assigning partition IDs to operating systems, and my point was that partition types are not strictly tied to operating systems. Allow me to reword to what you probably meant: Have a partition ID that means "generic partition - check signatures within for details." (And then get people who develop file systems for use with Linux, at least, to have a policy of always using that). Incidentally, I just realized that the common name "partition ID" for this value is quite a misnomer. As far as I know, it has never identified the partition, but rather described its contents. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: rename ops and ctime/mtime
>I quite like the >mtime change that XFS does as, finally, the ".." entry is rewritten when a >directory is moved (if the new parent != old parent). I don't understand this. Can you explain? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)
How it can be used? Well, say it you've mounted JFS on /usr/local >% mount -t jfsmeta none /mnt -o jfsroot=/usr/local >% ls /mnt >stats control bootcode whatever_I_bloody_want >% cat /mnt/stats >master is on /usr/local >fragmentation = 5% >696942 reads, yodda, yodda >% echo "defrag 69 whatever 42 13" > /mnt/control >% umount /mnt There's a lot of cool simplicity in this, both in implementation and application, but it leaves something to be desired in functionality. This is partly because the price you pay for being able to use existing, well-worn Unix interfaces is the ancient limitations of those interfaces -- like the inability to return adequate error information. Specifically, transactional stuff looks really hard in this method. If I want the user to know why his "defrag" command failed, how would I pass that information back to him? What if I want to warn him of of a filesystem inconsistency I found along the way? Or inform him of how effective the defrag was? And bear in mind that multiple processes may be issuing commands to /mnt/control simultaneously. With ioctl, I can easily match a response of any kind to a request. I can even return an English text message if I want to be friendly. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
>Solaris man-page of dup() says: If you read this with the proper lexicon, it does in fact specify the broken-as-designed behavior people are complaining about. >dup() returns a new file descriptor having the following in > common with the original open file descriptor fildes: > > Same open file (or pipe). Locks are associated with files, and consequently with open files, so dup() creates a new file descriptor that is associated with the same locks as the original. > All locks associated with a file for a given process are > removed when a file descriptor for that file is closed by > that process So when you close one file descriptor for a file, the file's locks are removed, and therefore the locks associated with all of that file's file descriptors are as well. >I would have thought that dup() creates new file object which >does not share file state with the original one I guess it depends on what a "file object" is. The quoted man page doesn't use that term, and I can see it applying to a file descriptor or an open instance, or even to a file. I don't think it's a well defined term. I don't really believe the man page here. I'll bet you can mount a filesystem twice, and then Solaris sees a single file in it as two different files, and closing the file doesn't cause all of the file's locks to get dropped. I usually use the term "file image" in a case like this instead of "file." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: read_super()
>all ->read_super() should do is read the superblock! You're taking it kind of literally, aren't you? These days, superblock is a metaphor. Lots of filesystems don't have real superblocks. I think if we were naming ->read_super() today, we'd name it ->mount(). >Ideally, we should let VFS do exclusion between mount/umount and remove >lock_super() from there. Then it becomes fs-private thing. It's not too >hard now - most of the stuff it depends on is already in the tree. I don't think the filesystem driver can do mount/unmount exclusion. What you're trying to serialize is the very existence of the filesystem driver -- Once ->write_super() is running it's too late to lock out another process which might be in the middle of unregistering the fileystem type and eliminating write_super(). But since we're on the topic of filesystem driver coordination of mount and unmount, I think a useful enhancment in this area would be for ->put_super () to be able to refuse the unmount, with a bad return code. Then the concept of the filesystem being busy could be pushed into the VFS layer. In complex filesystems, there may be more to it than just files in use. >This deadlocks ext3, which wants to call ext3_truncate() >inside ext3_read_super(). This is the part I don't get. Does ext3_truncate() acquire the superblock lock? If so, that would seem to be the problem -- it's a layering violation. On the other hand, along the same lines of complex stuff being done in ->read_super(), I've run into grief because read_super() is expected to go all the way to accessing the root directory and creating an inode and dentry for it. That's a big job on the network filesystem I'm working on, and it's all done under the mount semaphore so everyone else has to wait for it. If it crashes, that's it for mounting and mounting until a hard reboot. On AIX, the same filesystem type doesn't present a problem because the ->read_super() function is split in two. The filesystem can be fully mounted and registered and everything without the root directory having been accessed. The root directory gets accessed the first time someone actually needs to get to a file. In Linux, if we could just have separate ->read_super() and read_root(), with the mount semaphore dropped in between, it would probably solve the problem. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
read-only mounts
I have discovered, looking at Linux 2.4.2, that the read-only status of a mount is considered in some places to be a matter of file permissions, and in others as something separate from file permissions. So in some cases, it is the responsibility of a filesystem object's ->permission routine to check the MS_RDONLY superblock flag and deny write permission, but in other cases FS code checks MS_RDONLY itself. This seems to me inconsistent to the point of surely causing mistakes. Is there a consistent philosophy here that I'm missing? I noticed the problem when my filesystem driver had the following quirky behavior: I have an easy ->permission that grants write access in spite of the MS_RDONLY flag. When I open(O_RDWR | O_CREAT) a new file on a read-only mount, the open() fails, but the file gets created anyway. open_namei() defers to the filesystem driver for the creation part, but fails the open on its own authority later on. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: inode->i_blksize and inode->i_blocks
>Are there any deeper reasons, >why >a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the actual >block size of the file system? Well, why not? The field tells what is a good chunk size to read or write for maximum performance. If the I/O is done in PAGESIZE cache page units, then that's the best number to use. I suppose in the very first unix filesystems, the field may have meant filesystem block size, which was identical to the highest performing read/write size, and that may account for its name. >b) the number of blocks is counted in 512 Bytes and not in the actual blocksize >of the filesystem? I can't see how the number of actual blocks would be helpful, especially since as described above, we don't even know how big they are. We don't even know that they're fixed size or that a concept of a block even exists. >(is this for historical reasons??) That would be my guess. Though I can't think of any particular programs that would measure a file by multiplying this number by 512. In any case, the inode fields are defined as they are because they implement a standard stat() interface that includes these same numbers. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: inode->i_blksize and inode->i_blocks
>> >a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the >> > actual block size of the file system? > >>If the I/O is done in PAGESIZE(-size) cache >> page units, then that's the best number to use. > >But we already know that from PAGE_SIZE, this seems like a complete >waste. OK, but who's "we"? Individual filesystem drivers (some of them, that is) know that the optimum read/write size is PAGESIZE. But does the FS layer? And if the FS layer doesn't, the user space program certainly can't. The ext2 driver sets i_blksize to PAGESIZE, but another driver might set it to something else. FS blindly passes i_blksize up to user space (via stat()). >> In any case, the inode fields are defined as they are because they >> implement a standard stat() interface that includes these same >> numbers. > >We can fix things up in cp_old/new_stat if we want. Just as long as all the information is in the inode. Which I think is what the issue is. I'm more confused than ever about the i_blocks (filesize divided by 512) field. I note that in both the inode and the stat() result, the filesize in bytes exists, and presumably always has. So why would anyone ever want to know separately how many 512 byte units of data are in the file? FS code appears to explicitly allow for a filesystem driver to denominate i_blocks in other units, but any other unit would appear to break the stat () interface. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
I think what may have gotten lost in Alexander's detailed reply is the big picture on the BKL in VFS. The issue of the BLK protecting ->lookup is the same as for every other VFS call: A whole bunch of filesystem drivers were designed in a time when there could be only one CPU, and coupled with a non-preemptive kernel, that meant these filesystem drivers could depend on uninterrupted access to data structures and filesystems. When the multiple CPU case was introduced, it was not practical to update every filesystem driver, so the Big Kernel Lock (BKL) was added to give those drivers the uninterrupted access they (may) expect. You may surmise that a "lookup" routine doesn't need such uninterrupted access, but you can never really assume that. I think an individual filesystem driver that is specifically designed to do the fine-grained locking necessary to tolerate multiple CPUs can just release the BKL and avoid any bottleneck. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
Bryan: >> introduced, it was not practical to update every filesystem driver, so the >> Big Kernel Lock (BKL) was added to give those drivers the uninterrupted >> access they (may) expect. You may surmise that a "lookup" routine doesn't >> need such uninterrupted access, but you can never really assume that. > Al: >Now, now. BKL _is_ worth the removal. The thing being, "oh, we just take >BKL, so we don't have to worry about SMP races" is wrong attitude. Yeah, I agree. I took the question to be "why is the BKL there?" and "Can we just remove the lock_kernel() from FS?", not "is the BKL shield the best possible design for Linux?" I'd like to see the single threaded guarantee to VFS routines revoked -- not only the UP side of it, but the non-preemption as well. I have always been taught to assume that anything can happen between any two instructions, or even in the middle of one, unless I explicitly lock against it. An "MP-safe" attribute that a filesystem driver can register for its VFS routines would be a good tool to get there. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
>IMO preemptive kernel patches are an >exercise in masturbation (bad algorithm that can be preempted at any point >is still a bad algorithm and should be fixed, not hidden) What does this mean? What is a preemptive kernel patch and what kind of bad algorithm are you contemplating, and what does it mean to hide one? You're apparently referring back to some well known argument, but I'm not familiar with it myself. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
>What we ought to do in 2.5.early (possibly - in 2.4) is to >add ->max_page to address_space. I.e. ->i_size in pages I don't get it. What would address_space.max_page mean and how would you use it? Obviously, you don't really mean for it to be defined as inode.i_size in pages, since then it would have to be updated in lockstep with i_size and wouldn't buy you anything. Although I've been all over this code and am actually in the midst of writing a network filesystem driver, I don't understand most the language in your post, so you may have to use extra detail. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
read-only mounts
I posted this earlier, but it was right at the time that linux-fsdevel got swamped with a linux-kernel discussion, so I don't think anyone saw it. I have discovered, looking at Linux 2.4.2, that the read-only status of a mount is considered in some places to be a matter of file permissions, and in others as something separate from file permissions. So in some cases, it is the responsibility of a filesystem object's ->permission routine to check the MS_RDONLY superblock flag and deny write permission, but in other cases FS code checks MS_RDONLY itself. This seems to me inconsistent to the point of surely causing mistakes. Is there a consistent philosophy here that I'm missing? I noticed the problem when my filesystem driver had the following quirky behavior: I have an easy ->permission that grants write access in spite of the MS_RDONLY flag. When I open(O_RDWR | O_CREAT) a new file on a read-only mount, the open() fails, but the file gets created anyway. open_namei() defers to the filesystem driver for the creation part, but fails the open on its own authority later on. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
How to use page cache
A filesystem driver is supposed to be able to use the page cache for file caching without involving the buffer cache, isn't it? I can't find any examples of it, but I heard that was the case. I had a filesystem driver doing just that with Linux 2.4.2, but now the interface it used is no longer available to a loadable kernel module. How would a driver, having put file write data into a page cache page, get the page onto the dirty page list so that it might be written out to the file later? I am able to do this easily by modifying the base kernel to export __set_page_dirty(), but there must be a better way. Incidentally, what changed since 2.4.2 is that the lock that protects the page lists was moved from the address_space struct to the global pagecache_lock. My code previously internally duplicated the function of __set_page_dirty(), but since pagecache_lock is not an exported symbol, it can't do so anymore. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage
>Just putting up my hand to say "yeah, us too" - we could also make >use of that functionality, so we can grok existing XFS filesystems >that have blocksizes larger than the page size. IBM Storage Tank has block size > page size and has the same problem. This is one of several ways that Storage Tank isn't generic enough to use generic_file_write() and generic_file_read(), so it doesn't. That's not a terrible way to go, by the way. At some point, making the generic interface complex enough to handle every possible filesystem becomes worse than every filesystem driver having its own code. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage
>OOC, have you folks measured any performance improvements at all >using larger IOs (doing multi-page bios?) with larger blocksizes? First, let me clarify that by "larger I/O" you mean a larger unit of I/O submitted to the block layer (doing multi-page bios), because people often say "larger I/O" to mean larger units of I/O from Linux to the device, and the two are only barely coupled. Blocksize > page size doesn't mean multi-page bios as long as VM is still managing the file cache. VM pages in and out one page at a time. To get multi-page bios (in any natural way), you need to throw out not only the generic file read/write routines, but the page cache as well. Every time I've looked at multi-page bios, I've been unable to see any reason that they would be faster than multiple single-page bios. But I haven't seen any experiments. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage
Thanks for the numbers, though there are enough variables here that it's hard to make any hard conclusions. When I've seen these comparisons in the past, it turned out to be one of two things: 1) The system with the smaller I/Os (I/O = unit seen by the device) had more CPU time per megabyte in the code path to start I/O, so that it started less I/O. The small I/Os are a consequence of the lower throughput, not a cause. You can often rule this out just by looking at CPU utilization. 2) The system with the smaller I/Os had a window tuning problem in which it was waiting for previous I/O to complete before starting more, with queues not full, and thus starting less I/O. Some devices, with good intentions, suck the Linux queue dry, one tiny I/O at a time, and then perform miserably processing those tiny I/Os. Properly tuned, the device would buffer fewer I/Os and thus let the queues build inside Linux and thus cause Linux to send larger I/Os. People have done ugly queue plugging algorithms to try to defeat this queue sucking by withholding I/O from a device willing to take it. Others defeat it by withholding I/O from a willing Linux block layer, instead saving up I/O and submitting it in large bios. >Ext3 (writeback mode) > >Device:rrqm/s wrqm/s r/sw/s rsec/swsec/srkB/s wkB/s avgrq-sz avgqu-sz await >svctm %util >sdc 0.00 21095.60 21.00 244.40 168.00 170723.2084.00 85361.60 643.9011.15 42.15 > 3.45 91.60 > >We see 21k merges per second going on, and an average request size of >only 643 sectors where the device can handle up to 1Mb (2048 sectors). > >Here is iostat from the same test w/ JFS instead: > >Device:rrqm/s wrqm/s r/s w/s rsec/swsec/srkB/s wkB/s avgrq-sz avgqu-sz await >svctm %util >sdc 0.00 1110.58 0.00 97.800.00 201821.96 0.00 100910.98 2063.53 117.09 1054.11 >10.21 99.84 > >So, in this case I think it is making a difference 1k merges and a big difference in >throughput, though there could be other issues. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative
>Or actually we wouldn't >even care if stale pages are added as they would still be cleared in >readpage(). And pages found and uptodate and locked simply need to be >marked dirty and released again and if not uptodate they need to be >cleared first. You do need some form of locking to make sure someone doesn't add a page, update it, and clean it while you're independently initializing the block under it. The cache locks are usually what coordinate this kind of activity, but I think we've established those locks aren't available at this level (inside a pageout). Maybe a block lock or file lock could serve. I believe modifying a page and its status while it's locked by another process is a violation of page management ethics. I wouldn't dare. If you do figure something out with direct clearing of the block upon pageout of the first page in it, remember to have some reserved memory for the I/O buffer and bio/bh, because you can't wait for memory inside a pageout. >Is your driver's source available to look at? Not easily. An old version is available for download (under GPL) at http://www-1.ibm.com/servers/storage/software/virtualization/sfs/implementation.html . You have to register. I could email a copy (1.2M) to you, but I don't know if it would be worth your time to plow through it. This (Storage Tank) is a multi-disk shared filesystem (multiple computers access the same disks, but the block maps are kept on a separate metadata server) with multiple block size and page size and copy-on-write snapshots. And the same code works in a wide variety of 2.4 and 2.6 Linux kernels. These things all complicate this area of initializing a new block. And it's 20,000 lines of code not counting the metadata server. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device
>I did a patch which switched loop to use the file_operations.read/write >about a year ago. Forget what happened to it. It always seemed the right >thing to do.. This is unquestionably the right thing to do (at least compared to what we have now). The loop device driver has no business assuming that the underlying filesystem uses the generic routines. I always assumed it was a simple design error that it did. (Such errors are easy to make because prepare_write and commit_write are declared as address space operations, when they're really private to the buffer cache and generic writer). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative
>> > > And for the vmscan->writepage() side of things I wonder if it would be >> > > possible to overload the mapping's ->nopage handler. If the target page >> > > lies in a hole, go off and allocate all the necessary pagecache pages, zero >> > > them, mark them dirty? >> > >> > I guess it would be possible but ->nopage is used for the read case and >> > why would we want to then cause writes/allocations? >> >> yup, we'd need to create a new handler for writes, or pass `write_access' >> into ->nopage. I think others (dwdm2?) have seen a need for that. > >That would work as long as all writable mappings are actually written to >everywhere. Otherwise you still get that reading the whole mmap()ped >are but writing a small part of it would still instantiate all of it on >disk. As far as I understand this there is no way to hook into the mmap >system such that we have a hook whenever a mmap()ped page gets written >to for the first time. (I may well be wrong on that one so please >correct me if that is the case.) I think the point is that we can't have a "handler for writes," because the writes are being done by simple CPU Store instructions in a user program. The handler we're talking about is just for page faults. Other operating systems approach this by actually _having_ a handler for a CPU store instruction, in the form of a page protection fault handler -- the nopage routine adds the page to the user's address space, but write protects it. The first time the user tries to store into it, the filesystem driver gets a chance to do what's necessary to support a dirty cache page -- allocate a block, add additional dirty pages to the cache, etc. It would be wonderful to have that in Linux. I saw hints of such code in a Linux kernel once (a "write_protect" address space operation or something like that); I don't know what happened to it. Short of that, I don't see any way to avoid sometimes filling in holes due to reads. It's not a huge problem, though -- it requires someone to do a shared writable mmap and then read lots of holes and not write to them, which is a pretty rare situation for a normal file. I didn't follow how the helper function solves this problem. If it's something involving adding the required extra pages to the cache at pageout time, then that's not going to work -- you can't make adding pages to the cache a prerequisite for cleaning a page -- that would be Deadlock City. My large-block filesystem driver does the nopage thing, and does in fact fill in files unnecessarily in this scenario. :-( The driver for the same filesystems on AIX does not, though. It has the write protection thing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
>I see much larger IO chunks and better throughput. So, I guess its >worth doing it I hate to see something like this go ahead based on empirical results without theory. It might make things worse somewhere else. Do you have an explanation for why the IO chunks are larger? Is the I/O scheduler not building large I/Os out of small requests? Is the queue running dry while the device is actually busy? -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
>I am inferring this using iostat which shows that average device >utilization fluctuates between 83 and 99 percent and the average >request size is around 650 sectors (going to the device) without >writepages. > >With writepages, device utilization never drops below 95 percent and >is usually about 98 percent utilized, and the average request size to >the device is around 1000 sectors. Well that blows away the only two ways I know that this effect can happen. The first has to do with certain code being more efficient than other code at assembling I/Os, but the fact that the CPU utilization is the same in both cases pretty much eliminates that. The other is where the interactivity of the I/O generator doesn't match the buffering in the device so that the device ends up 100% busy processing small I/Os that were sent to it because it said all the while that it needed more work. But in the small-I/O case, we don't see a 100% busy device. So why would the device be up to 17% idle, since the writepages case makes it apparent that the I/O generator is capable of generating much more work? Is there some queue plugging (I/O scheduler delays sending I/O to the device even though the device is idle) going on? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
>Don't you think, filesystems submitting biggest chunks of IO >possible is better than submitting 1k-4k chunks and hoping that >IO schedulers do the perfect job ? No, I don't see why it would better. In fact intuitively, I think the I/O scheduler, being closer to the device, should do a better job of deciding in what packages I/O should go to the device. After all, there exist block devices that don't process big chunks faster than small ones. But So this starts to look like something where you withhold data from the I/O scheduler in order to prevent it from scheduling the I/O wrongly because you (the pager/filesystem driver) know better. That shouldn't be the architecture. So I'd like still like to see a theory that explains why submitting the I/O a little at a time (i.e. including the bio_submit() in the loop that assembles the I/O) causes the device to be idle more. >We all learnt thro 2.4 RAW code about the overhead of doing 512bytes >IO and making the elevator merge all the peices together. That was CPU time, right? In the present case, the numbers say it takes the same amount of CPU time to assemble the I/O above the I/O scheduler as inside it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
>Its possible that by doing larger >IOs we save CPU and use that CPU to push more data ? This is absolutely right; my mistake -- the relevant number is CPU seconds per megabyte moved, not CPU seconds per elapsed second. But I don't think we're close enough to 100% CPU utilization that this explains much. In fact, the curious thing here is that neither the disk nor the CPU seems to be a bottleneck in the slow case. Maybe there's some serialization I'm not seeing that makes less parallelism between I/O and execution. Is this a single thread doing writes and syncs to a single file? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
I went back and looked more closely and see that you did more than add a ->writepages method. You replaced the ->prepare_write with one that doesn't involve the buffer cache, right? And from your answer to Badari's question about that, I believe you said this is not an integral part of having ->writepages, but an additional enhancement. Well, that could explain a lot. First of all, there's a significant amount of CPU time involved in managing buffer heads. In the profile you posted, it's one of the differences in CPU time between the writepages and non-writepages case. But it also changes the whole way the file cache is managed, doesn't it? That might account for the fact that in one case you see cache cleaning happening via balance_dirty_pages() (i.e. memory fills up), but in the other it happens via Pdflush. I'm not really up on the buffer cache; I haven't used it in my own studies for years. I also saw that while you originally said CPU utilization was 73% in both cases, in one of the profiles I add up at least 77% for the writepages case, so I'm not sure we're really comparing straight across. To investigate these effects further, I think you should monitor /proc/meminfo. And/or make more isolated changes to the code. >So yes, there could be better parallelism in the writepages case, but >again this behavior could be a symptom and not a cause, I'm not really suggesting that there's better parallelism in the writepages case. I'm suggesting that there's poor parallelism (compared to what I expect) in both cases, which means that adding CPU time directly affects throughput. If the CPU time were in parallel with the I/O time, adding an extra 1.8ms per megabyte to the CPU time (which is what one of my calculation from your data gave) wouldn't affect throughput. But I believe we've at least established doubt that submitting an entire file cache in one bio is faster than submitting a bio for each page and that smaller I/Os (to the device) cause lower throughput in the non-writepages case (it seems more likely that the lower throughput causes the smaller I/Os). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Efficient handling of sparse files
>This is very similar to the Windows ability to do a query to >get the block map of a sparse file. Might be worth looking at >that interface to see what we can learn. XDSM (better but incorrectly known by the generic term DMAPI) also has one of those, for use in migrating or backing up sparse files and restoring them to their original sparseness. I'd resist any interface that exposes implementation details like that. The user program shouldn't know anything about block allocations. On the other hand, I can see the value in exposing the concept of a clear section of file (a hole), as distinct from one filled with zeroes. I once had to deal with this in a system that would have to transfer mass quantities of zero bytes over a network for sparse files. I found then that the most convenient interface was a new form of the read call. It returned an indicator of whether the offsets being read were clear or filled plus, if filled, the values. If clear, the values are by definition zero. At boundaries between clear and filled sections of the file, it would do a short read. Otherwise, the semantics were pretty much the same as classic Unix character stream read. My interface didn't have the ability to tell you how far the hole extends without you having to allocate a buffer that big (because you don't know until you do the read if you're reading a hole or not), but that seems like a reasonable addition. If someone's expending development effort on exploiting file sparseness, I'd rather see it spent implementing a clear (aka punch) system call first. Or has that been done when I wasn't looking? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Efficient handling of sparse files
>A database or file scanner that must read a lot of data can benefit >from having even a rough idea of the layout of the data on disk. True. There's always room for interfaces that dive into the lower layers for those users who want to be there. (Of course, you end up crossing a line fairly quickly where you shouldn't be pretending to use a filesystem at all and should just use a block disk). But I first want to see an abstract interface where an application can recognize cleared regions of file without actually knowing anything about how the filesystem represents them or what the filesystem does with them. In particular, there's no reason to give up the character stream notion of a file and start talking about blocks just to have visible cleared regions (holes). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Max mounted filesystems ?
>I cannot seem to increase the maximum number of >filesystems on my Red hat system... What is your evidence of the maximum that you can't increase? (E.g. does something fail? How?) -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Max mounted filesystems ?
>>>118 total. When I attempt to mount the 57th one, I >>>get "Too many mounted Filesystems" >> >> >> Sorry I don't know what the limitations are for non-anonymous filesystems. >> 57 seems a bit unusual though. > >Is that the exact error message? >Can you post the kernel message log with that message in it? I don't think he said it was a kernel message. And since the kernel has not traditionally issued a message when failing a mount for any reason, I suspect it's a message from the program doing the mount. But I don't know how a mount program would recognize such a condition. util-linux 'mount' has a message that includes "too many mounted filesystems" in a list of of possible reasons a mount may have failed. By the way, it's always a good idea to use the -t (type) option on util-linux 'mount'. You get better diagnostic information and less arbitrary behavior that way. Without -t, 'mount' tries types until one works and can't tell the difference between a bad guess at the type and a legitimate mount failure. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
>Hmm, it's a bit confusing that we call both things "reservation". I think "reservation" is wrong for one of them and anyone using it that way should stop. I believe the common terminology is: - choosing the blocks is "placement." - committing the required number of blocks from the resource pool for the instant use is "reservation." - the combination of reservation and placement is "allocation." Obviously, traditional filesystem drivers haven't split placement from reservation, so don't bother to use those terms. Most delaying schemes delay the placement but not the reservation because they don't want to accept the possibility that a write would fail for lack of space after the write() system call succeeded. Even in non-filesystem areas, "allocate" usually means to assign particular resources, while "reserve" just means to make arrangements so that a future allocate will succeed. For example, if you know you need up to 10 blocks of memory to complete a task without deadlocking, but you don't know yet how exactly how many, you would reserve 10 blocks and later, if necessary, allocate the actual blocks. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
>Sounds reasonable. The thing with "reservation" is that people use >it in daily life with all kinds of meanings, That's the way it is all over. Normal people are very sloppy in their language. Engineers have to try to narrow the meanings of the common words to avoid totally confusing each other in these complex discussions. But I think "reserve" in common usage is a lot less ambiguous than you say. I believe when you reserve a seat on an airplane, most of the time it isn't a particular seat. When it is, the airline will call it a "seat assignment" and you get it only after you turn your reservation into a purchased ticket. I've never worked in a restaurant, but I've always assumed that when I make a reservation, even the restaurant doesn't know which table it is until I show up. That way, it can load balance and give people choices when they come in. >E.g. if we "reserve" the next hundred blocks, so that allocation is >contiguous, we may want to be able to take them away if some other >file needs them. I would not call that a reservation. I did, incidentally, design such a system once, and I called it "pencilled in." I might also call it preliminary placement. But I agree that reservations can be more or less firm, owing to the fact that sometimes they can be broken, with more or less ease. E.g. you might reserve a megabyte of space for a file, and under pathological conditions still be told when you go to write that there's no space for you and you're screwed. Just like you can get to the restaurant and be told there's no table for you. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
>But anyway it's interesting why the resulting sparse >files have different size on different fs? That looks like a bug. Assuming you didn't see any seeks or writes fail, the file size on all filesystems should be 2^56 + 4. I suspect this is beyond the maximum file size allowed by the filesystem in some cases, so the write isn't happening, which means you should get a failure return code. In the results you showed, the filesize ends up being a little less than 2^48, which is not a place that you wrote ever. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
>I found >that for larger values, your test program is returning -1, but unsigned >it appears as 18446744073709551615. You mean you ran it? Then what about the more interesting question of what your filesize ends up to be? You say JFS allows files up to 2**52 bytes, so I expect the test case would succeed up through the write at 2**48 and leave the filesize 2**48 + 8. But Max reports seeing 2**48 - 4080. It's conceivable that the reporting of the filesize is wrong, by the way. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
>The problem appears to be mixing calls to lseek64 with calls to fread >and fwrite. Oh, of course. I didn't see that. You can't use the file descriptor of a file that is opened as a stream. This test case uses the fileno() function to mess with the internals of the stream. fseeko64() is the proper function to position a stream. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
>I want all my file system's operations to be complete uncached and >synchronous, but I also want to support mmap. >... >What am I doing wrong? Is what I'm trying to do impossible, and if >so, how can I get as close as possible? It looks to me like you're running into the fundamental limitation that the CPU doesn't notify Linux every time you store into a memory location. It does, though, set the dirty flag in the page table, and Linux eventually inspects that flag and finds out that you have stored in the past. At that time, it can call set_page_dirty. Without knowing what properties of not having a cache you were hoping for, I couldn't say what alternative would be closest to this. Hypothetically, if you had a backing storage device that could do memory mapped I/O, you could have mmapped direct I/O. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
>well, we *could* know ... never map this page writable. have a per-vma >flag that says "emulate writes", and call the filesystem to update >backing storage before returning to the application. Ah yes, you mean, I take it, that the page fault handler would look at the user's program and emulate the faulting store instruction and return to the instruction after it. Very clever. And as long as we're going down that path, we should also consider changing exec() so instead of branching into the program, it just interprets it! - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
I forgot you were talking about code inside the kernel. In that case, filemap_sync(). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
>Is there an existing interface to force it to check if the page is >dirty The msync() system call and libc function does that. And then it does the same thing as fsync(). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
>So, semantics of ->sync_page() are roughly "kick underlying storage >driver to actually perform all IO queued for this page, and, maybe, for >other pages on this device too". I prefer to think of it in a more modular sense. To preserve modularity, the caller of sync_page() can't know anything about I/O scheduling. So I think the semantics of ->sync_page() are "Someone is about to wait for the results of the previously requested write_page on this page." It's completely up to the owner of the address space to figure out what would be appropriate to do given that information. I agree that for the conventional filesystem and device types for which this interface was designed, the appropriate response would be to start any queued I/O. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
>what it >*really* means to be called in sync_page() is that you're being told >that some process is about to block on that page. For what reason, you >can't know from the call alone. Ugh. IOW it barely means anything. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
>It reflects the fact that the page lock can be held for a variety of >reasons, some of which require you to kick the filesystem and some which >don't. So then what I don't understand is why you would make a call that tells you someone is trying to hold the page lock? Why not a call that tells you something meaningful like, "someone is trying to read this page"? Or "someone is waiting for this page to get clean?" >I introduced the sync_page() call in 2.4.x partly in order to get rid of >all those pathetic hard-coded calls to "run_task_queue(&tq_disk)" That was pathetic all right, and sync_page() would be a clear improvement if it just replaced those modularity-busting I/O scheduling calls. But did it? Were there run_task_queue's every time the kernel waited for page status to change? I thought they were in more eclectic places. >the NFS client itself had to defer actually >putting reads on the wire until someone requested the lock But really, you mean the client had to defer putting reads on the wire until someone was ready to use the data. That suggests a call to ->sync_page in file read or page fault code rather than deep in page management. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Access content of file via inodes
>How do I access/read the content of the files via using inodes >or blocks that belong to the inode, at sys_link and vfs_link layer? This is tricky because many interfaces that one would expect to use an inode as a file handle use a dentry instead. To read the contents of a file via the VFS interface, you need a file pointer (struct file), and the file pointer identifies the file by dentry. So you need to create a dummy dentry, which you can do with d_alloc_root(), and then create the file pointer with dentry_open(), then read the file with vfs_read(). That's for "via inodes." I don't know what "via blocks" means. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Access content of file via inodes
>What I meant by >> via blocks is to gain knowledge of the physical blocks used by the inodes >> and retrieve the content from it directly, by accessing b_data. > >The problem with that approach is that some filesystems may store part >of the file outside of a complete block. There's an even more basic problem with this approach: The question is specifically about the filesystem-type-independent layer above the VFS interface. At this layer, you don't even know that there is a block device involved. And if you do, you don't know that the filesystem driver uses the buffer cache to access it. And if you do know that it uses the buffer cache, you don't know that the file data you're looking for is presently in the buffer cache, or how to get it there if it isn't. If you believe in the layering at all, the only interface you can consider at this layer for getting at file data is VFS ->read. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations - >bmap
>We are about to start implementing a fs where data can move around the >device and so a physical block address is not really useful. I have >understood from other postings to this list that reiserfs and ntfs >don't implement this method so I suppose we'll do the same. I'll just >find some nice error to return. It's appropriate only for the most classic of filesystems, really. It was always a layering violation, but is handy for hackish things. Interfaces that expose block addresses are in the same boat as all those fsstat fields -- block size, blocks used, blocks free, inodes used, inodes free. They make sense for the original Unix File System, but get harder to give meaning with every new generation. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>We already have compat_sys_mount that treats the mount data for smbfs and >ncpfs specially, so you could you add an nsfv4 specific bit there? Do we really want to pile filesystem-type-specific stuff into fs/compat.c? It's bad enough that it's there for smbfs and ncpfs (and similar stuff for NFS server). It's only going to get worse. fs/compat.c is fine for interfaces implemented by fs/ code, but the 32/64 bit translations for other interfaces ought to be done by the modules that know those interfaces. A mount option structure that contains addresses should contain information as to whether it's in 32-bit-address format or 64-bit-address format. The nfsv4 read_super method can use that to translate its own mount options. Another option would be for Linux to pass that information (essentially, whether the mount() system call is being handled by sys_mount() or compat_sys_mount() as another argument to read_super. This would allow better backward compatibility with user space binaries, if there are already 32 bit and 64bit binaries using indistinguishable mount option structures. The same issue, by the way, applies to ioctls, some of which have an argument which is the address of a block of memory that contains other addresses. fs/compat.c approaches these in a more filesystem-type-independent way than it does mount(), but still not independent enough. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>Make a ->compat_read_super() just like we have a ->compat_ioctl() >method for files, if you want to suggest a solution like what >you describe. Even better. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>On Fri, Apr 15, 2005 at 01:22:59PM -0700, David S. Miller wrote: >> >> Make a ->compat_read_super() just like we have a ->compat_ioctl() >> method for files, if you want to suggest a solution like what >> you describe. > >I don't think we should encourage filesystem writers to do such stupid >things as ncfps/smbfs do. In fact I'm totally unhappy thay nfs4 went >down that road. Which road is that? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>mount() is not a documented syscall. The binary formats for filesystems >like NFS are only documented inside the kernels to which they apply. What _is_ a documented system call? Linux is famous for not having documented interfaces (or, put another way, not distinguishing between an interface you can read in an official document and one you discover by reading kernel source code). But of all interfaces in Linux, the system call interface is probably the most accepted as one a user of the kernel can rely on. I don't think a filesystem driver designer should expect mount options to be private to one particular user space program. Especially one that isn't even packaged with the driver. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lilo requirements (Was: Re: Address space operations questions)
>- unit of disk space allocation for the kernel image file is > block. That is, optimizations like UFS fragments or reiserfs tails are > not applied, and > > - blocks that kernel image is stored into are real disk blocks (i.e., > there is a way to disable "delayed allocation"), and > > - kernel image file is not relocated, i.e., data are not moved into > another blocks on the fly. It also has to implement the ioctl that tells you what blocks a file is in (that kind of implies much of the above). Except if the LILO installer makes special provisions as for Reiserfs, of course. To be really exact, it's OK for the blocks to move, as long as it doesn't do so so subtly that the user doesn't know to rerun the LILO installer. E.g. you can move the blocks of the kernel file if someone overwrites it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>> Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data). >> If you want it to be a blob, at least have a decency to use encoding >> that would not depend on alignment rules and word size. Hell, you >> could use XDR - it's not that nfs would need something new to handle >> it. Or, better yet, use a normal string. > >Mount doesn't appear to permit a big enough blob though. It has a hard limit >of PAGE_SIZE. That seems to me to be orthogonal to Al's point. You could make an architecture-independent format for that page that still contains addresses in user space of additional information. Which would presumably also have an architecture-independent format. But why is mount() special here? It's ancient tradition for Linux system calls to take as parameters, and return as results, in-memory structures that are dependent on local word size and endianness. Lots of them do. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>(1) The kernel is returning EFAULT to the 32-bit userspace; this implies that > userspace is handing over a bad address. It isn't, the kernel is > malfunctioning as it stands. >... >Either the kernel should return ENOSYS for any 32-bit mount on a 64-bit kernel >or it must support it fully. So this point is just the error code? If so, where do you get ENOSYS? A more usual errno for where a particular filesystem type can't be mounted is ENODEV. Choosing errnos is a pretty whimsical thing anyway, since there are so many more kinds of errors than the authors of the errno space contemplated, but EFAULT and ENOSYS are two that have a pretty solid definition. ENOSYS is for when an entire system call type is missing. I'm not sure we can complain about EFAULT, though, because you really are supplying an invalid address. You're doing it because you're using the wrong mount option format, so what you think of as 4 bytes of flags followed by 4 bytes of address is really 8 bytes of address. I do understand the more important issue of there being a kernel that understands both mount option formats; but since you enumerated the errno issue, I wanted to comment on that one independently. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS4 mount problem
>My concern is that we are slowly but surely building up a bigger >in-kernel library for parsing the binary structure than it would take to >parse the naked mount option string. > >... >If people really do need a fully documented NFS mount interface, then >the only one that makes sense is a string interface. Looking back at the >manpages, the string mount options are the only thing that have remained >constant over the last 10 years. > >We're already up to version 6 of the binary interfaces for v2/v3, and if >you count NFSv4 too, then that makes 7. I don't know the NFS mount option format, but I'm having a hard time imagining how a string-based format can take less code to parse and be more forward compatible than a binary one. People don't even use the term "parse" for binary structures, because parsing typically means turning strings into binary structures. Having 6 separate formats isn't the only way to have an evolving binary interface. People do make extensible binary formats. >There are only 2 reasons for doing >that parsing in userland: > > 1) DNS lookups > 2) Keeping the kernel parsing code small I personally almost never worry about the number of bytes of code, but I worry a lot about its simplicity. User space code is less costly to develop and less risky to make a mistake in. I would add, 3) Keeping the kernel parsing code simple. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
>> routines will fail - since they assume that page->private represents >> bufferheads. So we need a better way to do this. > >They are not generic then. Some file systems store things completely >different from buffer head ring in page->private. I've seen these instances (and worked around them because I maintain filesystem code that does in fact use private pages but not use the buffer cache to manage them). I've always assumed they're just errors -- corners that were cut in the original project to abstract out the buffer cache. Anyone who has a problem with them should just fix them. >I think that one reasonable way to add generic support for journalling >is to split struct address_space into two objects: lower layer that >represents "file" (say, struct vm_file), in which pages are linearly >ordered, and on top of this vm_cache (representing transaction) that >keeps track of pages from various vm_file's. vm_file is embedded into >inode, and vm_cache has a pointer to (the analog of) struct >address_space_operations. > >vm_cache's are created by file system back-end as necessary (can be >embedded into inode for non-journalled file systems). VM scanner and >balance_dirty_pages() call vm_cache operations to do write-out. That looks entirely reasonable to me, but should be combined with divorcing address spaces from files. An address space (or the "lower level" above) should be a simple virtual memory object, managed by the virtual memory manager. It can be used for a file data cache, but also for anything else you want to participate in system memory management / page replacement. We're already practically there. Address spaces are tied to files only in these ways: 1) The code is in the fs/ directory. It needs to be be in mm/ . 2) The "host" field is a struct inode *. It needs to be void *. 3) In a handful of places (and they keep moving), memory manager code dereferences 'host' and looks in the inode. I know these are trivial connections, because I work around them by supplying a dummy inode (and sometimes a dummy superblock) with a few fields filled in. (Incidentally, _I_ am actually using address spaces for file caches; I just can't tie them to the files in the traditional way; the cache exists even when there are no inodes for the file). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] User CLONE_NEWNS permission and rlimits
>In essense, I was >thinking of splitting up the concepts of 1) accessing the filesystem on >the HDD/device and 2) setting up a namespace for accessing the files >into two separate concepts I've been crusading for years to get people to understand that a classic Unix mount is composed of these two parts, and they don't have to be married together. (1) is called creating a filesystem image and (2) is called mounting a filesystem image. (2) isn't actually "setting up" a namespace. There's one namespace. Mounting is adding the names in a filesystem to that namespace, and thereby making the named filesystem objects accessible. The two pieces have been slowly divorcing over the years. We now have a little-used ability to have a filesystem image exist without being mounted at all (you get that by forcibly unmounting a filesystem image that has open files. The unmount happens right away, but the filesystem image continues to exist until the last file is closed). We also have the bind mounts that add to the namespace without creating a new filesystem image. I would like someday to see the ability to create a filesystem image without ever mounting it, and access a file in it without ever adding it to the master file namespace. >bringing up 2) completely in the userspace. That part's another issue. The user-controls-his-namespace aspect of it has been commented on at length in this and another current thread. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>> But that shouldn't be the only option - because it would be horrible >> to use. If I login on multiple terminals, I normally want to mount >> filesystems in /home/jamie/mnt on one terminal, and use them on another. > >And when you log in on several terminals you usually want same $PATH. >You don't do that by sharing VM between shell processes, do you? I share Al's view, and would expand: You'd _like_ to be able to add something to your namespace once and have it show up in multiple process' namespaces, but you wouldn't expect it, because Unix has been horrible to use in that way forever. I am frequently frustrated when I decide to change my environment either by setting an environment variable or shell variable or alias, and I have to do it separately in every existing shell. And forget about the background jobs. But at least it's consistent. And there are other times when I exploit the fact that I can set something differently in different shells of the same user. We do have a few areas where a group of processes can share the same kernel state, but it's always based on common ancestry. It would take a major new concept to have a different kind of group of processes for namespace purposes, and then we probably wouldn't want to base it on uid, because uid means other things already. Why tie them together? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>How about making namespace's as first class objects with some associated >name or device in the device tree having owner/permissions etc. any >process which forks off a namespace shall create the >device node for the namespace. If some other process wants to use >the same namespace, it can do so by attaching itself to the namespace >dynamically? Offcourse children processes inherit the same namespace. For the issues being discussed here, I don't think that's materially different from what we started with; it has the same issue concerning whether a user should be allowed to change his namespace and whether a process' namespace should change automatically when another process does something. Here's one more proposal, kind of a compromise among various previous ones. - When you mount(), you say whether the names should be visible by default or not. It takes system privilege to make them visible by default, but an ordinary user can mount a willing filesystem over a directory he's permitted to modify unconditionally, invisible by default - A process can explicitly request to see an invisible-by-default mounted filesystem. Anyone can do this, but permissions on the root directory of the mount determine if he can actually see anything. - A process inherits the parent's namespace (i.e. sees the mounts the parent does). This accomplishes: - not much of a philosophical break from where we are now. - users can mount their own stuff without system privilege. - no one, not even a fully permitted administrative process, sees user junk by default. - setuid programs see standard files where the system administrator put them. - setuid programs see user files where the user put them. - multiple processes, with or without the same uid, can see user-mounted files if they want. - a process can opt not to see user-mounted files, even if it has the same uid as processes that do. I'm not saying how I would implement this; there's enough discussion over the desired result that I thought we should start there. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>How would you request to make the mountpoint visible from _any_ >program. It's not acceptable to expect every program to include a >menu, command, etc. to be able to modify the visibility of >mountpoints. OK, I overlooked the problem of having to add commands to the shell and everything else. While there's plenty of precedent for this style (current directory, ulimits, umask), I wouldn't like to extend it, even to adding a command to Bash. But it could follow the 'nice' and 'renice' model. >Would it not be better if you could specify the visibility policy when >mounting? Something simple like the user-group-other permission >model would do nicely. That would also have the advantage of being >bound to the mountpoint, not the process. I just don't think that gives you enough policy flexibility. If processes can control visibility on a per-process basis independent from the mount action, they can use a much greater variety of policy, and do it in user space. As for user-group-other, let me first point out that this whole namespace discussion started when a design based on actual file permission bits, but not a true implementation of Unix security (root didn't get carte blanche) was found unpalatable by some. So as you say, it would be something _like_ the permission model, not a part of it. We've been straining against the limitations of user/group/other for decades. Sophisticated systems don't even use them for file permissions. So I hesitate to tie anything else to them. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>That assumes that everyone has the same stuff in the same places. I.e. >that there is a universal tree with different subset hidden from different >processes. But that is obviously a wrong approach - e.g. it loses ability >to bind different stuff on the same place in different namespaces. Aren't you trying to boil another egg in my pot? In Linux today, everyone (every process on the same Linux system, that is) has the same stuff in the same place. I'm trying to propose an incremental improvement, and relaxing that restriction isn't part of it. The only change would be that some processes wouldn't have some stuff in _any_ place. (Either because they didn't ask to see a particular mount, or because they did and it covered up something else). >IOW, notion that every directory has its "real" absolute pathname >(and that's what your approach boils down to) won't match the reality >anyway. Not sure which reality you're talking about. I don't think a directory has a real absolute pathname, because I think the person who mounts the filesystem that contains it chooses part of its absolute pathname for the lifetime of the mount. But as between multiple processes on the same system at the same time, yeah, the directory has one name. (statements above have to be modified for chroot, btw). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>Well I am not aware of issues that can arise if a user is allowed to >change to some namespace for which it has permission to switch. I think I misunderstood your proposal. >A user 'ram' creates a namespace 'n1' with a device node /dev/n1 having >permission 700 owned by the user 'ram'. The user than tailors his >namespace with a bunch of mount/umount/binds etc to meet his >requirement. How does that address the setuid problem -- that a setuid program is installed with the expectation that when it runs, certain names will identify certain files (e.g. /etc/shadow)? But also that certain other names will identify a file of the invoker's choosing? >Trying to understand your proposal to see how it could be used to solve >the problem faced by the FUSE project. Are you trying to use a single >namespace with invisible mounts capability? Essentially. It's a compromise. A user can customize his namespace, but only within limits that preserve the integrity of the system. Technically, we have to admit it's not one namespace today or with invisible mounts. Because of the way mounts cover up mountpoints, it's technically possible for two processes to see different files as the same name, if one opened the directory before a mount and the other after. "Mounting over" is a curse. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call
>It would still not work for ftp-server style programs, True. Users might want the mounts to show up to an ftp or not, and this handles only "not." >If used in conjuction with CLONE_NEWNS it would have all the needed >flexibility. I don't see how. What if my policy is that processes with a certain process name (command) see the mount? What if my policy is that users in a certain filesystem ACL can see it? That's the kind of flexibility you can't get if the policy is set up via the mount() system call. >But the non-sophisticated case is by far the most abundant. And for >that the traditional UNIX permission modell is not only good enough, >it is in fact _better_ than any sophisticated access control mechanism >because of it's _simplicity_. Absolutely. And that's why I speak of flexibility. Let the simple users have their simple U-G-0 and the more creative ones do something more complex. I'm not opposed, by the way, to an implementation that just does U-G-O (or even just U) if it's done in a way amenable to future extension. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: share/private/slave a subtree - define vs enum
I wasn't aware anyone preferred defines to enums for declaring enumerated data types. The practical advantages of enums are slight, but as far as I know, the practical advantages of defines are zero. Isn't the only argument for defines, "that's what I'm used to."? Two advantages of the enum declaration that haven't been mentioned yet, that help me significantly: - if you have a typo in a define, it can be really hard to interpret the compiler error messages. The same typo in an enum gets a pointed error message referring to the line that has the typo. - Gcc warns you if a switch statement doesn't handle every case. I often add an enumeration and Gcc lets me know where I forgot to consider it. The macro language is one the most hated parts of the C language; it makes sense to try to avoid it as a general rule. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: share/private/slave a subtree - define vs enum
>If it's really enumerated data types, that's fine, but this example was >about bitfield masks. Ah. In that case, enum is a pretty tortured way to declare it, though it does have the practical advantages over define that have been mentioned because the syntax is more rigorous. The proper way to do bitfield masks is usually C bit field declarations, but I understand that tradition works even more strongly against using those than against using enum to declare enumerated types. >there is _nothing_ wrong with using defines for constants. I disagree with that; I find practical and, more importantly, philosophical reasons not to use defines for constants. I'm sure you've heard the arguments; I just didn't want to let that statement go uncontested. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: share/private/slave a subtree - define vs enum
>I don't see how the following is tortured: > >enum { > PNODE_MEMBER_VFS = 0x01, > PNODE_SLAVE_VFS = 0x02 >}; Only because it's using a facility that's supposed to be for enumerated types for something that isn't. If it were a true enumerated type, the codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas here they must fundamentally be integers whose pure binary cipher has exactly one 1 bit (because, as I understand it, these are used as bitmasks somewhere). I can see that this paradigm has practical advantages over using macros (or a middle ground - integer constants), but only as a byproduct of what the construct is really for. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What happens to pages that failed to be written to disk?
>On Thu, 28 Jul 2005, Andrew Morton wrote: >> Martin Jambor <[EMAIL PROTECTED]> wrote: >> > >> > Do filesystems try to relocate the data from bad blocks of the >> > device? > >Only Windows NTFS, not others AFAIK (most filesytems can mark them during >mkfs, that's all). > >> Nope. Disks will do that internally. If a disk gets a write I/O error >> it's generally dead. > >That's what I thought also for over a decade (that they are basically dead >soon) so originally I disabled NTFS resizing support for such disks (the >tool is quite widely used since it's the only free, open source NTFS >resizer). > >However over the last three years users convinced me that it's quite ok >having a few bad sectors There's a common misunderstanding in this area. First of all, Andrew and Szakacsits are talking about different things: Szakacsits is saying that you don't have to throw away your whole disk because of one media error (a spot on the disk that won't hold data). Andrew is saying that if you get an error when writing, the disk is dead, and the reasoning goes that if it were just a media error, the write wouldn't have failed -- the disk would have relocated the sector somewhere else and succeeded. Szakacsits is right. Andrew is too, but for a different reason. A normal disk doesn't give you a write error when a media error prevents writing the data. The disk doesn't know that the data it wrote did not get actually stored. It's not going to wait for the disk to come around again and try to read it back to verify. And even if it did, a lot of media errors cause the data to disappear after a short while, so that wouldn't help much. So if a write fails, it isn't because of a media error; i.e. can't be fixed by relocation. The write fails because the whole drive is broken. The disk won't turn, a wire is broken, etc. (The drive relocates a bad sector when you write to it after a previously failed read. I.e. after data has already been lost). As Andrew pointed out, write errors are becoming much more common these days because of network storage. The write fails because the disk isn't plugged in, the network switch isn't properly configured, the storage server isn't up and running yet, and a bunch of other fairly common problems. What makes this really interesting in relation to the question about what to do with these failed writes is not just that they're so common, but that they're all easily repairable. If you had a few megabytes stuck in your cache because the storage server isn't up yet, it would be nice if the system could just write them out a few seconds later when the problem is resolved. Or if they're stuck because the drive isn't properly plugged in, it would be nice if you could tell an operator to either plug it in or explicitly delete the file. But the memory management issue is a major stumbling block. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
I don't know enough about shared subtrees to have an opinion on what should happen with those, but you fundamentally asked about a perceived weirdness in existing Linux code, and I do have an opinion on that (which is that there's no weirdness). >On analysis it turns out the culprit is the current rule which says >'expose the most-recent-mount and not the topmost mount' I don't think the current rule is "expose the most-recent-mount." I see it as "expose the topmost mount." I think the issue is what does "mount F over directory D" mean? Does it mean to mount F immediately over D, in spite of anything that might be stacked above D right now? Or does it mean to throw F onto the stack which is currently sitting over D? Your analysis assumes it's the former, whereas what Linux does is consistent with the latter. Neither of them actually makes sense. mount over "." simply doesn't make sense. Mount is a namespace operation. "mount over D" says, "when someone looks up name D, ignore what's really in the directory and instead give him this other filesystem object." "Mount over /mnt/cdrom" doesn't mean mount over the directory /mnt/cdrom. It means mount under the name "cdrom" in the directory /mnt. So "mount over '.'" means any future lookup of "." in that directory should hyperjump to the other mount. That's clearly not what anyone wants, so mount ought to recognize the special nature of the "." directory entry and not allow mounts over it. If you did that, and made mount into the namespace operation it's meant to be, there would be no such thing as inserting a mount into the stack, since you have no way to refer to the covered directory -- it's no longer in the namespace. I have no idea if that clarifies the shared subtree dilemma, but you ask if there's any pressing need for the current behavior, and I would have to say no, because a) neither behavior has any business existing; and b) I have a hard time imagining anyone depending on it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
>> Does it mean to mount F immediately over D, in spite of anything that >> might be stacked above D right now? Or does it mean to throw F onto the >> stack which is currently sitting over D? Your analysis assumes it's the >> former, whereas what Linux does is consistent with the latter. > >In fact those two are indistinguishable. What linux does is an >internal implementation detail. Then you must have misunderstood what I meant to say, because I didn't touch on Linux implementation at all; I'm talking only about what a user sees (distinguishes). I say a user perceives a stack of mounts over a directory entry D. A lookup sees the mount which is on top of the stack. One could conceivably 1) add a mount to the middle of that stack -- above D but below everything else, such that it isn't visible until everything above it gets removed, or 2) add the mount to the top of the stack so it's visible now. >The semantics are simple: if you >mount over a directory, that mount will be visible (no matter what was >previously visible) on lookup of that directory. So in my terms, Linux adds to the top of the stack, not to the middle. Note that saying this is stronger than what you say above, because it tells you not only that the mount is visible now, but when it will be visible in the future as people do mounts and unmounts. >Well, mounting over '.' may not be perfect in the mathematical sense >of namespace operations, but it does make some practical sense. I bet >you anything that some script/tool/person out there depends on it. It wouldn't surprise me if someone is depending on mount over ".". But I'd be surprised if someone is doing it to a directory that's already been mounted over (such that the stacking behavior is relevant). That seems really eccentric. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
>Bryan, what would you expect the behavior to be when somebody mounts on >a directory what is already mounted over? Well, I've tried to beg the question. I said I don't think it's meaningful to mount over a directory; that one actually mounts at a name. And that Linux's peculiar "mount over '.'" (which is in fact mounting over a directory and not at a name) is weird enough that there is no natural expectation of it except that it should fail. But if I had to try to merge "mount over '.'" into as consistent a model as possible with one of the two behaviors we've been discussing, I'd say that "." stands for the name by which you looked up that directory in the first place (so in this case, it's equivalent to mount ... /mnt). And that means I would expect the new mount to obscure the already existing mount. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount behavior question.
>One problem with 1) [mounting into the middle of a mount stack] >is that it breaks the assumption that an 'mount X; >umount X' pair is a no-op. A very good point. Since unmounts are always from the top of the stack, for symmetry mounts should be there too. Here's another tidbit of information I just verified: umount of "." unmounts from the top of the stack, as opposed to unmounting the stuff you would see if you did "ls .". So this is all consistent. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
>Intents are meant as optimisations, not replacements for existing >operations. I'm therefore not really comfortable about having them >return errors at all. That's true of normal intents, but not what are called intents here. A normal intent merely expresses an intent, and it can be totally ignored without harm to correctness. But these "intents" were designed to be responded to by actually performing the foreshadowed operation now - irreversibly. Linux needs an atomic lookup/open/create in order to participate in a shared filesystem and provide a POSIX interface (where shared filesystem means a filesystem that is simultaneously accessed by something besides the Linux system in question). Some operating systems do this simply with a VFS lookup/open/create function. Linux does it with this intents interface. It's hard to merge the concepts in code or in one's mind, which is why we're here now. A filesystem driver that needs to do atomic lookup/open/create has to bend over backwards to split the operation across the three filesystem driver calls that Linux wants to make. I've always preferred just to have a new inode operation for lookup/open/create (mirroring the POSIX open operation, used for all opens if available), but if enough arguments to lookup can do it, that's practically as good. But that means returning final status from lookup, and not under any circumstance proceeding to create or open when the filesystem driver has said the entire operation is complete. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
>Have you looked at how we're dealing with this in NFSv4? No. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
I have to correct an error in perspective, or at least in the wording of it, in the following, because it affects how people see the big picture in trying to decide how the filesystem types in question fit into the world: >Shared storage can be more efficient than network file >systems like NFS because the storage access is often more efficient >than network access The shared storage access _is_ network access. In most cases, it's a fibre channel/FCP network. Nowadays, it's more and more common for it to be a TCP/IP network just like the one folks use for NFS (but carrying ISCSI instead of NFS). It's also been done with a handful of other TCP/IP-based block storage protocols. The reason the storage access is expected to be more efficient than the NFS access is because the block access network protocols are supposed to be more efficient than the file access network protocols. In reality, I'm not sure there really is such a difference in efficiency between the protocols. The demonstrated differences in efficiency, or at least in speed, are due to other things that are different between a given new shared block implementation and a given old shared file implementation. But there's another advantage to shared block over shared file that hasn't been mentioned yet: some people find it easier to manage a pool of blocks than a pool of filesystems. >it is more reliable because it doesn't have a >single point of failure in form of the NFS server. This advantage isn't because it's shared (block) storage, but because it's a distributed filesystem. There are shared storage filesystems (e.g. IBM SANFS, ADIC StorNext) that have a centralized metadata or locking server that makes them unreliable (or unscalable) in the same ways as an NFS server. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
>The idea behind the cloneset is that most of the blocks (or files) >do not change in either source or target. This being the case its only necessary >to update the changed elements. This means updates are incremental. Once >the system has figured out what it needs to update its usable and if you access >an element that should be updated you will see the correctly updated version - even >though backgound resyncing is still in progress. I still can't tell what you're describing. With RAID1 as well, only changed elements ever get updated. I have two identical filesystems, members of a RAIF set. I change one file. One file in each member filesystem gets updated, and I again have two identical filesystems. How would a cloneset work differently, and how would it be better? >This type of logic is great for backups. Can you give an example of using it for backup? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
>Although, it is not possible with the current code, it should be possible >to do via failing the branches. First, you fail the branch intended for >backups and it becomes a backup copy. Later you can "unfail" the same >branch and fail the newer branch to start the on-line recovery. If you >enable atime updates on these lower file systems incremental (delta) >updates should not be a problem. So I guess you're saying that what you have now doesn't have the ability to recover from a temporary absence of a member by updating just the areas that changed while it was absent. Given how complex the path to one of these member filesystems might be, and how big a filesystem can be, I would think that's pretty important for making RAIF practical. Actually getting to the cloneset-like thing is a step further, though, because it doesn't have the instantaneous resync property -- if you fail a branch while it's being resynced, you can't then access that branch and expect to get current data. But I didn't actually understand, "Later you can 'unfail' the same branch and fail the newer branch to start the on-line recovery," so maybe you're talking about something different. I would think that if you fail the only branch that has current data on it (the "newer branch"?) that recovery would be pretty much over. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
>A cloneset is only syncronized at the point in time that you tell it to resync. >The source and target fs are useable independently. When you resync the >target is reset to be indentical to the source at the point in time of the sync. >Its also immediatly useable - the sync and access to the source and target >are coordinated so users of the target see the correct data, even if the sync >is still running in background. > >This allows things likes: > >... These applications sure seem like a better fit for ordinary snapshots. It looks like with the cloneset, there's a whole superfluous copy of the filesystem, whereas with a snapshot, you have to have storage space and I/O time only for data that changes after the snapshot. I'm sure I could dream up an application for this -- maybe you want that second copy as a backup or it gives you additional data transfer capacity. I just don't see the panacea so far. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stacked filesystem cache waste
>Every stackable file system caches the >data at its own level and copies it from/to the lower file system's cached >pages when necessary. ... this effectively >reduces the system's cache memory size by two or more times. It should not be that bad with a decent cache replacement policy; I wonder if observing the problem (that you corrected in the various ways you've described), you got some insight as to what exactly was happening. In the classic case of multiple caches, where each cache has a fixed size (example: cache in the disk drive + cache in the operating system), the caches tend to contain different data. The most frequently accessed data is in the near cache and the less frequently in the far cache (that's because frequent accesses to a piece of data are always near cache hits, so the far cache never sees them and considers that data once-only). In the stacked filesystem case, it should be even better because it's all one pool of memory. The far cache should shrink down to nothing, since anything that might have been a hit in that cache is a hit in the near cache first. There are certainly simplistic cache replacement algorithms, and specific workloads that defeat that. Straight LRU with lots of once-only accesses would tend to generate twice as much cache waste. But the reduction in useful cache space would be less than half, because at least some of the pages are frequently accessed, so stored only once. I lost track of the Linux cache replacement policy years ago, but it used to have a second-chance element that should measure frequency well enough to stop this cache duplication -- a page read from a file was on the inactive list until it got referenced again, so it could not stay in memory long when there was contention for memory. I believe this would make the far cache pages always inactive, so essentially not consuming resource. So I'd be interested to know by what mechanism stacked filesystems have drastically reduce cache efficiency in your experiments and whether a simple policy change might solve the problem as well as the more complex approach of getting an individual filesystem driver more involved in memory management. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>Adding a vfs call to check for file equivalence seems like a good idea to me. That would be only barely useful. It would let 'diff' say, "those are both the same file," but wouldn't be useful for something trying to duplicate a filesystem (e.g. a backup program). Such a program can't do the comparison between every possible pairing of file names. I'd rather just see a unique file identifier that's as big as it needs to be. And the more unique the better. (There are lots of degrees of uniqueness; unique as long as the files exist; as long as the filesystems are mounted, etc.). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>> Well, the NFS protocol allows that [see rfc1813, p. 21: "If two file handles from >> the same server are equal, they must refer to the same file, but if they are not >> equal, no conclusions can be drawn."] >> >Interesting. That does seem to break the method of st_dev/st_ino for finding >hardlinks. For Linux fileservers I think we generally do have 1:1 correspondence >so that's not generally an issue. > >If we're getting into changing specs, though, I think it would be better to >change it to enforce a 1:1 filehandle to inode correspondence rather than making >new NFS ops. That does mean you can't use the filehandle for carrying other >info, but it seems like there ought to be better mechanisms for that. The filehandle is very much the appropriate mechanism for that. A handle is opaque. The client has no business doing anything with it besides sending it back to the server. Though you seem to want to avoid adding new NFS operations, what you're proposing is changing the nature of existing ones so that new operations would have to be added to get back what the existing ones do! If it's important to know that two names refer to the same file in a remote filesystem, I don't see any way around adding a new concept of file identifier to the protocol. BTW, a primary characteristic of an "identifier" is that it can be used to tell whether you've got the object you're looking for, but often can't be used to _find_ that object. For the latter, you need an address. There lots of examples where you can't practically use the same value for both an identifier and an address. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>Statement 1: >If two files have identical st_dev and st_ino, they MUST be hardlinks of >each other/the same file. > >Statement 2: >If two "files" are a hardlink of each other, they MUST be detectable >(for example by having the same st_dev/st_ino) > >I personally consider statement 1 a mandatory requirement in terms of >quality of implementation if not Posix compliance. > >Statement 2 for me is "nice but optional" Statement 1 without Statement 2 provides one of those facilities where the computer tells you something is "maybe" or "almost certainly" true. While it's useful in plenty of practical cases, in my experience, it leaves computer engineers uncomfortable. Recently, there was a discussion on this list of a proposed case in which stat() results are "maybe correct, but maybe garbage" that covered some of that philosophy. >it's an optimization for a program like tar to not have to >back a file up twice, I think it's a stronger need than just to make a tarball smaller. When you restore the tarball in which 'foo' and 'bar' are different files, you get a fundamentally different tree of files than the one you started with in which 'foo' and 'bar' were two different names for the same file. If, in the restored tree, you write to 'foo', you won't see the result in 'bar'. If you remove read permission from 'foo', the world can still see the information in 'bar'. Plus, in some cases optimization is a matter of life or death -- the extra resources (storage space, cache space, access time, etc) for the duplicated files might be enough to move you from practical to impractical. People tend to demand that restore programs faithfully restore what was backed up. (I've even seen requirements that the inode numbers upon restore be the same). Given the difficulty of dealing with multi-linked files, not to mention various nonstandard file attributes fancy filesystem types have, I suppose they probably don't have really high expectations of that nowadays, but it's still a worthy goal not to turn one file into two. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>The chance of an accidental >collision is infinitesimally small. For a set of > > 100 files: 0.03% > 1,000,000 files: 0.03% Hey, if you're going to use a mathematical term, use it right. :-) .03% isn't infinitesimal. It's just insignificant. And I think infinitesimally small must mean infinite. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>On Thu, 2006-12-28 at 16:44 -0800, Bryan Henderson wrote: >> >Statement 1: >> >If two files have identical st_dev and st_ino, they MUST be hardlinks of >> >each other/the same file. >> > >> >Statement 2: >> >If two "files" are a hardlink of each other, they MUST be detectable >> >(for example by having the same st_dev/st_ino) >> > >> >I personally consider statement 1 a mandatory requirement in terms of >> >quality of implementation if not Posix compliance. >> > >> >Statement 2 for me is "nice but optional" >> >> Statement 1 without Statement 2 provides one of those facilities where the >> computer tells you something is "maybe" or "almost certainly" true. > >No it's not a "almost certainly". It's a "these ARE". There are various "these AREs" here, but the "almost certainly" I'm talking about is where Statement 1 is true and Statement 2 is false and the inode numbers you read through two links are different. (For example, consider a filesystem in which the reported inode number is the internal inode number truncated to 32 bits). The links are almost certainly to different files. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>On Fri, 2006-12-29 at 10:08 -0800, Bryan Henderson wrote: >> >On Thu, 2006-12-28 at 16:44 -0800, Bryan Henderson wrote: >> >> >Statement 1: >> >> >If two files have identical st_dev and st_ino, they MUST be hardlinks >> of >> >> >each other/the same file. >> >> > >> >> >Statement 2: >> >> >If two "files" are a hardlink of each other, they MUST be detectable >> >> >(for example by having the same st_dev/st_ino) >> >> > >> >> >I personally consider statement 1 a mandatory requirement in terms of >> >> >quality of implementation if not Posix compliance. >> >> > >> >> >Statement 2 for me is "nice but optional" >> >> >> >> Statement 1 without Statement 2 provides one of those facilities where >> the > >> There are various "these AREs" here, but the "almost certainly" I'm >> talking about is where Statement 1 is true and Statement 2 is false and >> the inode numbers you read through two links are different. (For example, >> consider a filesystem in which the reported inode number is the internal >> inode number truncated to 32 bits). The links are almost certainly to >> different files. >> > >but then statement 1 is false and violated. Whoops; wrong example. It doesn't matter, though, since clearly there exist correct examples: where Statement 1 is true and Statement 2 is false, and in that case when the inode numbers are different, the links are "almost certainly" to different files. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>On any decent filesystem st_ino should uniquely identify an object and >reliably provide hardlink information. The UNIX world has relied upon this >for decades. A filesystem with st_ino collisions without being hardlinked >(or the other way around) needs a fix. But for at least the last of those decades, filesystems that could not do that were not uncommon. They had to present 32 bit inode numbers and either allowed more than 4G files or just didn't have the means of assigning inode numbers with the proper uniqueness to files. And the sky did not fall. I don't have an explanation why, but it makes it look to me like there are worse things than not having total one-one correspondence between inode numbers and files. Having a stat or mount fail because inodes are too big, having fewer than 4G files, and waiting for the filesystem to generate a suitable inode number might fall in that category. I fully agree that much effort should be put into making inode numbers work the way POSIX demands, but I also know that that sometimes requires more than just writing some code. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nfsv4] RE: Finding hardlinks
>>> "Clients MUST use filehandle comparisons only to improve >>> performance, not for correct behavior. All clients need to >>> be prepared for situations in which it cannot be determined >>> whether two filehandles denote the same object and in such >>> cases, avoid making invalid assumptions which might cause incorrect behavior." >> Don't you consider data corruption due to cache inconsistency an incorrect behavior? > >Exactly where do you see us violating the close-to-open cache >consistency guarantees? Let me add the information that Trond is implying: His answer is yes, he doesn't consider data corruption due to cache inconsistency to be incorrect behavior. And the reason is that, contrary to what one would expect, NFS allows that (for reasons of implementation practicality). It says when you open a file via an NFS client and read it via that open instance, you can legally see data as old as the moment you opened it. Ergo, you can't use NFS in cases where that would cause unacceptable data corruption. We normally think of this happening when a different client updates the file, in which case there's no practical way for the reading client to know his cache is stale. When the updater and reader use the same client, we can do better, but if I'm not mistaken, the NFS protocol does not require us to do so. And probably more relevant: the user wouldn't expect cache consistency. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>but you can get a large number of >1 linked >files, when you copy full directories with "cp -rl". Which I do a lot >when developing. I've done that a few times with the Linux tree. Can you shed some light on how you use this technique? (I.e. what does it do for you?) Many people are of the opinion that since the invention of symbolic links, multiple hard links to files have been more trouble than they're worth. I purged the last of them from my personal system years ago. This thread has been a good overview of the negative side of hardlinking; it would be good to see what the positives are. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
>I did cp -rl his-tree my-tree (which completed >quickly), edited the two files that needed to be patched, then did >diff -urp his-tree my-tree, which also completed quickly, as diff knows >that if two files have the same inode, they don't need to be opened. >... download one tree from kernel.org, do a bunch of cp -lr for >each arch you plan to play with, and then go and work on each of the trees >separately. Cool. It's like a poor man's directory overlay (same basic concept as union mount, Make's VPATH, and Subversion branching). And I guess this explains why the diff optimization is so important. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbolic links vs hard links
>Other people are of the opinion that the invention of the symbolic link >was a huge mistake. I guess I haven't heard that one. What is the argument that we were better off without symbolic links? -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbolic links vs hard links
>On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote: >> >Other people are of the opinion that the invention of the symbolic link >> >was a huge mistake. >> >> I guess I haven't heard that one. What is the argument that we were >> better off without symbolic links? > >I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good >a presentation of that argument as any ... Thanks. For those who didn't read it, this refers to the problem of ".." being ambiguous when there are many paths to a directory. I.e. it's about the ability of a symbolic link to link to a directory, not just a file (like a hard link). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html