Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page
On Thu, Dec 28, 2006 at 08:17:17PM +0530, Suparna Bhattacharya wrote: I am really bad with names :( I tried using the _wq suffixes earlier and that seemed confusing to some, but if no one else objects I'm happy to use that. I thought aio_lock_page() might be misleading because it is synchronous if a regular wait queue entry is passed in, but again it may not be too bad. What's your preference ? Does anything more intuitive come to mind ? Beein bad about naming seems to be a disease, at least I suffer from it aswell. I wouldn't mind either the _wq or aio_ naming - _wq describes the way it's called and aio_ describes it's a special case for aio. Similarly to how -aio_read/-aio_write can be used for synchronous I/O aswell. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [FSAIO][PATCH 7/8] Filesystem AIO read
On Thu, Dec 28, 2006 at 08:48:30PM +0530, Suparna Bhattacharya wrote: Yes, we can do that -- how about aio_restarted() as an alternate name ? Sounds fine to me. Pluse possible naming updates discussed in the last mail. Also do we really need to pass current-io_wait here? Isn't the waitqueue in the kiocb always guaranteed to be the same? Now that all pagecache We don't have have the kiocb available to this routine. Using current-io_wait avoids the need to pass the iocb down to deeper levels just for the sync vs async checks, also allowing such routines to be shared by other code which does not use iocbs (e.g. generic_file_sendfile-do_generic_file_read -do_generic_mapping_read) without having to set up dummy iocbs. We really want to switch senfile to kiocbs btw, - for one thing to allow an aio_sendfile implementation and second to make it more common to all the other I/O path code so we can avoid special cases in the fs code So I'm not convinced by that argument. But again we don't need to put the io_wait removal into your patchkit. I'll try to hack on it once I'll get a little spare time. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Hi! It seems like the posix idea of unique st_dev, st_ino doesn't hold water for modern file systems are you really sure? Well Jan's example was of Coda that uses 128-bit internal file ids. and if so, why don't we fix *THAT* instead Hmm, sometimes you can't fix the world, especially if the filesystem is exported over NFS and has a problem with fitting its file IDs uniquely into a 64-bit identifier. Note, it's pretty easy to fit _anything_ into a 64-bit identifier with the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Pavel -- Thanks for all the (sleeping) penguins. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
It seems like the posix idea of unique st_dev, st_ino doesn't hold water for modern file systems are you really sure? Well Jan's example was of Coda that uses 128-bit internal file ids. and if so, why don't we fix *THAT* instead Hmm, sometimes you can't fix the world, especially if the filesystem is exported over NFS and has a problem with fitting its file IDs uniquely into a 64-bit identifier. Note, it's pretty easy to fit _anything_ into a 64-bit identifier with the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Certainly, but tar isn't going to remember all the inode numbers. Even if you solve the storage requirements (not impossible) it would have to do (4e9^2)/2=8e18 comparisons, which computers don't have enough CPU power just yet. It doesn't matter if there are collisions within the filesystem, as long as there are no collisions between the set of files an application is working on at the same time. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Tue, 2 Jan 2007, Miklos Szeredi wrote: It seems like the posix idea of unique st_dev, st_ino doesn't hold water for modern file systems are you really sure? Well Jan's example was of Coda that uses 128-bit internal file ids. and if so, why don't we fix *THAT* instead Hmm, sometimes you can't fix the world, especially if the filesystem is exported over NFS and has a problem with fitting its file IDs uniquely into a 64-bit identifier. Note, it's pretty easy to fit _anything_ into a 64-bit identifier with the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Certainly, but tar isn't going to remember all the inode numbers. Even if you solve the storage requirements (not impossible) it would have to do (4e9^2)/2=8e18 comparisons, which computers don't have enough CPU power just yet. It is remembering all inode numbers with nlink 1 and many other tools are remembering all directory inode numbers (see my other post on this topic). It of course doesn't compare each number with all others, it is using hashing. It doesn't matter if there are collisions within the filesystem, as long as there are no collisions between the set of files an application is working on at the same time. --- that are all files in case of backup. Miklos Mikulas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [nfsv4] RE: Finding hardlinks
On Sun, 2006-12-31 at 16:25 -0500, Halevy, Benny wrote: Trond Myklebust wrote: On Thu, 2006-12-28 at 15:07 -0500, Halevy, Benny wrote: Mikulas Patocka wrote: BTW. how does (or how should?) NFS client deal with cache coherency if filehandles for the same file differ? Trond can probably answer this better than me... As I read it, currently the nfs client matches both the fileid and the filehandle (in nfs_find_actor). This means that different filehandles for the same file would result in different inodes :(. Strictly following the nfs protocol, comparing only the fileid should be enough IF fileids are indeed unique within the filesystem. Comparing the filehandle works as a workaround when the exported filesystem (or the nfs server) violates that. From a user stand point I think that this should be configurable, probably per mount point. Matching files by fileid instead of filehandle is a lot more trouble since fileids may be reused after a file has been deleted. Every time you look up a file, and get a new filehandle for the same fileid, you would at the very least have to do another GETATTR using one of the 'old' filehandles in order to ensure that the file is the same object as the one you have cached. Then there is the issue of what to do when you open(), read() or write() to the file: which filehandle do you use, are the access permissions the same for all filehandles, ... All in all, much pain for little or no gain. See my answer to your previous reply. It seems like the current implementation is in violation of the nfs protocol and the extra pain is required. ...and we should care because...? Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Finding hardlinks
On Sun, 2006-12-31 at 16:19 -0500, Halevy, Benny wrote: Even for NFSv3 (that doesn't have the unique_handles attribute I think that the linux nfs client can do a better job. If you'd have a filehandle cache that points at inodes you could maintain a many to one relationship from multiple filehandles into one inode. When you discover a new filehandle you can look up the inode cache for the same fileid and if one is found you can do a getattr on the old filehandle (without loss of generality you should always use the latest filehandle that was returned for that filesystem object, although any filehandle that refers to it can be used). If the getattr succeeded then the filehandles refer to the same fs object and you can create a new entry in the filehandle cache pointing at that inode. Otherwise, if getattr says that the old filehandle is stale I think you should mark the inode as stale and keep it around so that applications can get an appropriate error until last close, before you clean up the fh cache from the stale filehandles. A new inode structure should be created for the new filehandle. There are, BTW, other reasons why the above is a bad idea: it breaks on a bunch of well known servers. Look back at the 2.2.x kernels and the kind of hacks they had in order to deal with crap like the Netapp '.snapshot' directories which contain files with duplicate fileids that do not represent hard links, but rather represent previous revisions of the same file. That kind of hackery was part of the reason why I ripped out that code. The other reasons were - that you end up playing unnecessary getattr games like the above for little gain. - the only servers that implemented the above were borken pieces of crap that encoded parent directories in the filehandle, and which end up breaking anyway under cross-directory renames. - the world is filled with non-posix filesystems that frequently don't have real fileids. They are often just generated on the fly and can change at the drop of a hat. Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, 3 Jan 2007, Trond Myklebust wrote: On Sat, 2006-12-30 at 02:04 +0100, Mikulas Patocka wrote: On Fri, 29 Dec 2006, Trond Myklebust wrote: On Thu, 2006-12-28 at 19:14 +0100, Mikulas Patocka wrote: Why don't you rip off the support for colliding inode number from the kernel at all (i.e. remove iget5_locked)? It's reasonable to have either no support for colliding ino_t or full support for that (including syscalls that userspace can use to work with such filesystem) --- but I don't see any point in having half-way support in kernel as is right now. What would ino_t have to do with inode numbers? It is only used as a hash table lookup. The inode number is set in the -getattr() callback. The question is: why does the kernel contain iget5 function that looks up according to callback, if the filesystem cannot have more than 64-bit inode identifier? Huh? The filesystem can have as large a damned identifier as it likes. NFSv4 uses 128-byte filehandles, for instance. But then it needs some other syscall to let applications determine hardlinks --- which was the initial topic in this thread. POSIX filesystems are another matter. They can only have 64-bit identifiers thanks to the requirement that inode numbers be 64-bit unique and permanently stored, however Linux caters for a whole truckload of filesystems which will never fit that label: look at all those users of iunique(), for one... I see them. The bad thing is that many programmers read POSIX, write programs as if POSIX specification was true and these programs break randomly on non-POSIX filesystem. Each non-POSIX filesystem invents st_ino on its own, trying to minimize hash collision, making the failure even less probable and worse to find. The current situation is (for example) that cp does stat(), open(), fstat() and compares st_ino/st_dev --- if they mismatch, it writes error and doesn't copy files --- so if kernel removes the inode from cache between stat() and open() and filesystem uses iunique(), cp will fail. What utilities should the user use on those non-POSIX filesystems, if not cp? Probably some file-handling guidelines should be specified and written to Documentation/ as a form of standard that can appliaction programmers use. Mikulas Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
Sorry for the delay, I'm finally back from the holiday break :) (1) The filesystem AIO patchset, attempts to address one part of the problem which is to make regular file IO, (without O_DIRECT) asynchronous (mainly the case of reads of uncached or partially cached files, and O_SYNC writes). One of the properties of the currently implemented EIOCBRETRY aio path is that -mm is the only field in current which matches the submitting task_struct while inside the retry path. It looks like a retry-based aio write path would be broken because of this. generic_write_checks() could run in the aio thread and get its task_struct instead of that of the submitter. The wrong rlimit will be tested and SIGXFSZ won't be raised. remove_suid() could check the capabilities of the aio thread instead of those of the submitter. I don't think EIOCBRETRY is the way to go because of this increased (and subtle!) complexity. What are the chances that we would have ever found those bugs outside code review? How do we make sure that current references don't sneak back in after having initially audited the paths? Take the io_cmd_epoll_wait patch.. issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with modifications from Jeff Moyer and me) addresses this problem for native linux aio in a simple manner. It's simple looking, sure. This current flipping didn't even occur to me while throwing the patch together! But that patch ends up calling -poll (and poll_table-qproc) and writing to userspace (so potentially calling -nopage) from the aio threads. Are we sure that none of them will behave surprisingly because current changed under them? It might be safe now, but that isn't really the point. I'd rather we didn't have yet one more subtle invariant to audit and maintain. At the risk of making myself vulnerable to the charge of mentioning vapourware, I will admit that I've been working on a (slightly mad) implementation of async syscalls. I've been quiet about it because I don't want to whip up complicated discussion without being able to show code that works, even if barely. I mention it now only to make it clear that I want to be constructive, not just critical :). - z - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
Any details? Well, one path I tried I couldn't help but post a blog entry about for my friends. I'm not sure it's the direction I'll take with linux- kernel, but the fundamentals are there: the api should be the syscall interface, and there should be no difference between sync and async behaviour. http://www.zabbo.net/?p=72 Any code you're willing to let people play with? I could at least have real test cases, and a library to go along with it as it gets finished. Another pie in the sky idea: One thing that's been bugging me lately (working on a 9p server), is sendfile is hard to use in practice because you need packet headers and such, and they need to go out at the same time. Sendfile listio support would fix this, but it's not a general solution. What would be really usefull is a way to say that a certain batch of async ops either all succeed or all fail, and happen atomically; i.e., transactions for syscalls. Probably even harder to do than general async syscalls, but it'd be the best thing since sliced bread... and hey, it seems the logical next step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote: Sorry for the delay, I'm finally back from the holiday break :) Welcome back ! (1) The filesystem AIO patchset, attempts to address one part of the problem which is to make regular file IO, (without O_DIRECT) asynchronous (mainly the case of reads of uncached or partially cached files, and O_SYNC writes). One of the properties of the currently implemented EIOCBRETRY aio path is that -mm is the only field in current which matches the submitting task_struct while inside the retry path. Yes and that as I guess you know is to enable the aio worker thread to operate on the caller's address space for copy_from/to_user. The actual io setup and associated checks are expected to have been handled at submission time. It looks like a retry-based aio write path would be broken because of this. generic_write_checks() could run in the aio thread and get its task_struct instead of that of the submitter. The wrong rlimit will be tested and SIGXFSZ won't be raised. remove_suid() could check the capabilities of the aio thread instead of those of the submitter. generic_write_checks() are done in the submission path, not repeated during retries, so such types of checks are not intended to run in the aio thread. Did I miss something here ? I don't think EIOCBRETRY is the way to go because of this increased (and subtle!) complexity. What are the chances that we would have ever found those bugs outside code review? How do we make sure that current references don't sneak back in after having initially audited the paths? The EIOCBRETRY route is not something that is intended to be used blindly, It is just one alternative to implement an aio operation by splitting up responsibility between the submitter and aio threads, where aio threads can run in the caller's address space. Take the io_cmd_epoll_wait patch.. issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with modifications from Jeff Moyer and me) addresses this problem for native linux aio in a simple manner. It's simple looking, sure. This current flipping didn't even occur to me while throwing the patch together! But that patch ends up calling -poll (and poll_table-qproc) and writing to userspace (so potentially calling -nopage) from the aio Yes of course, but why is that a problem ? The copy_from/to_user/put_user constructs are designed to handle soft failures, and we are already using the caller's -mm. Do you see a need for any additional asserts() ? If there is something that is needed by -nopage etc which is not abstracted out within the -mm, then we would need to fix that instead, for correctness anyway, isn't that so ? Now it is possible that there are minor blocking points in the code and the effect of these would be to hold up / delay subsequent queued aio operations; which is an efficiency issue, but not a correctness concern. threads. Are we sure that none of them will behave surprisingly because current changed under them? My take is that we should fix the problems that we see. It is likely that what manifests relatively more easily with AIO is also a subtle problem in other cases. It might be safe now, but that isn't really the point. I'd rather we didn't have yet one more subtle invariant to audit and maintain. At the risk of making myself vulnerable to the charge of mentioning vapourware, I will admit that I've been working on a (slightly mad) implementation of async syscalls. I've been quiet about it because I don't want to whip up complicated discussion without being able to show code that works, even if barely. I mention it now only to make it clear that I want to be constructive, not just critical :). That is great and I look forward to it :) I am, however assuming that whatever implementation you come up will have a different interface from current linux aio -- i.e. a next generation aio model, that will be easily integratable with kevents etc. Which takes me back to Ingo's point - lets have the new evolve parallely with the old, if we can, and not hold up the patches for POSIX AIO to start using kernel AIO, or for epoll to integrate with AIO. OK, I just took a quick look at your blog and I see that you are basically implementing Linus' microthreads scheduling approach - one year since we had that discussion. Glad to see that you found a way to make it workable ... (I'm guessing that you are copying over the part of the stack in use at the time of every switch, is that correct ? At what point do you do the allocation of the saved stacks ? Sorry I should hold off all these questions till your patch comes out) Regards Suparna - z -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at