Re: [RFC] atomic open(..., O_CREAT | ...)
We've already got a patch that does this, and that I'm queueing up for inclusion. Cool! http://client.linux-nfs.org/Linux-2.6.x/2.6.12/linux-2.6.12-63-open_file_intents.dif Comments: /* + * Open intents have to release any file pointer that was allocated + * but not used by the VFS. + */ +void path_release_open_intent(struct nameidata *nd) +{ + if ((nd-flags LOOKUP_OPEN) nd-intent.open.file != NULL) { + fput(nd-intent.open.file); I think you should consider adding this: + if (!IS_ERR(nd-intent.open.file)) + fput(nd-intent.open.file); so the filesystem can delay returning the error from the open operation until the other errors have been sorted out by the lookup code. + nd-intent.open.file = NULL; Why is this NULL assignment needed? nd will not be used after this. + } + path_release(nd); +} + As for the orig flags thing. What is the point of that? dentry_open() needs the original open flags, not the transformed ones stored in intent.open.flags. The behavior is slightly strange, since filp_open() calculates namei_flags (which gets stored in intent.open.flags) so that an O_ACCMODE of 3 is transformed into FMODE_READ | FMODE_WRITE. But dentry_open() calculates filp-f_mode, so that O_ACCMODE of 3 is transformed into zero. This means that the (undocumented) access mode of 3 will require read-write permission, but will allow neither read() nor write() on the opened file. If we want to keep this behavior, then the orig_flags field is needed. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 09:46 (+0200) skreiv Miklos Szeredi: We've already got a patch that does this, and that I'm queueing up for inclusion. Cool! http://client.linux-nfs.org/Linux-2.6.x/2.6.12/linux-2.6.12-63-open_file_intents.dif Comments: /* + * Open intents have to release any file pointer that was allocated + * but not used by the VFS. + */ +void path_release_open_intent(struct nameidata *nd) +{ + if ((nd-flags LOOKUP_OPEN) nd-intent.open.file != NULL) { + fput(nd-intent.open.file); I think you should consider adding this: + if (!IS_ERR(nd-intent.open.file)) + fput(nd-intent.open.file); so the filesystem can delay returning the error from the open operation until the other errors have been sorted out by the lookup code. Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. + nd-intent.open.file = NULL; Why is this NULL assignment needed? nd will not be used after this. + } + path_release(nd); +} + It could be dropped. The reason for putting it in is that some parts of the VFS may restart a path walk operation if it fails (see for instance the ESTALE handling). As for the orig flags thing. What is the point of that? dentry_open() needs the original open flags, not the transformed ones stored in intent.open.flags. The behavior is slightly strange, since filp_open() calculates namei_flags (which gets stored in intent.open.flags) so that an O_ACCMODE of 3 is transformed into FMODE_READ | FMODE_WRITE. But dentry_open() calculates filp-f_mode, so that O_ACCMODE of 3 is transformed into zero. This means that the (undocumented) access mode of 3 will require read-write permission, but will allow neither read() nor write() on the opened file. If we want to keep this behavior, then the orig_flags field is needed. Why do we want to keep this behaviour? It is undocumented, it is non-posix, and it appears to do nothing you cannot do with the existing access() call. Are there any applications using it? If so, which ones, and why? Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. That's true of normal intents, but not what are called intents here. A normal intent merely expresses an intent, and it can be totally ignored without harm to correctness. But these intents were designed to be responded to by actually performing the foreshadowed operation now - irreversibly. Linux needs an atomic lookup/open/create in order to participate in a shared filesystem and provide a POSIX interface (where shared filesystem means a filesystem that is simultaneously accessed by something besides the Linux system in question). Some operating systems do this simply with a VFS lookup/open/create function. Linux does it with this intents interface. It's hard to merge the concepts in code or in one's mind, which is why we're here now. A filesystem driver that needs to do atomic lookup/open/create has to bend over backwards to split the operation across the three filesystem driver calls that Linux wants to make. I've always preferred just to have a new inode operation for lookup/open/create (mirroring the POSIX open operation, used for all opens if available), but if enough arguments to lookup can do it, that's practically as good. But that means returning final status from lookup, and not under any circumstance proceeding to create or open when the filesystem driver has said the entire operation is complete. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 08:47 (-0700) skreiv Bryan Henderson: Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. That's true of normal intents, but not what are called intents here. A normal intent merely expresses an intent, and it can be totally ignored without harm to correctness. But these intents were designed to be responded to by actually performing the foreshadowed operation now - irreversibly. Linux needs an atomic lookup/open/create in order to participate in a shared filesystem and provide a POSIX interface (where shared filesystem means a filesystem that is simultaneously accessed by something besides the Linux system in question). Some operating systems do this simply with a VFS lookup/open/create function. Linux does it with this intents interface. It's hard to merge the concepts in code or in one's mind, which is why we're here now. A filesystem driver that needs to do atomic lookup/open/create has to bend over backwards to split the operation across the three filesystem driver calls that Linux wants to make. I've always preferred just to have a new inode operation for lookup/open/create (mirroring the POSIX open operation, used for all opens if available), but if enough arguments to lookup can do it, that's practically as good. But that means returning final status from lookup, and not under any circumstance proceeding to create or open when the filesystem driver has said the entire operation is complete. Have you looked at how we're dealing with this in NFSv4? Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
Have you looked at how we're dealing with this in NFSv4? No. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. In my case they are not an optimization, rather the only way to correctly perform an open with O_CREAT. + nd-intent.open.file = NULL; Why is this NULL assignment needed? nd will not be used after this. + } + path_release(nd); +} + It could be dropped. The reason for putting it in is that some parts of the VFS may restart a path walk operation if it fails (see for instance the ESTALE handling). If you use the nameidata after path_release_open_intent(), you're screwed anyway, since nd-mnt and nd-dentry have already been released. If there's any chance that the path walk restart thing will invoke the filesystems open code twice (I doubt it), then the filesystem must make sure to check intent.open.file, whether it has already been set, and fput() it before setting it another time. Why do we want to keep this behaviour? It is undocumented, it is non-posix, and it appears to do nothing you cannot do with the existing access() call. Are there any applications using it? If so, which ones, and why? I have absolutely no idea. Looking closer, there's a problem with O_TRUNC as well: namei_flags = flags; if ((namei_flags+1) O_ACCMODE) namei_flags++; if (namei_flags O_TRUNC) namei_flags |= 2; So if flags is O_RDONLY|O_TRUNC, intent.open.flags will be FMODE_WRITE|FMODE_READ|O_TRUNC, but filp-f_mode will be FMODE_READ. This is also undocumented (or rather documented to be undefined), but the behavior is perfectly logical, and I can imagine some application relying on it. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 18:40 (+0200) skreiv Miklos Szeredi: Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. In my case they are not an optimization, rather the only way to correctly perform an open with O_CREAT. How so? In NFS the only case that can be tricky is O_EXCL, and that simply _cannot_ be done inside nfs_lookup without confusing the VFS. Instead, we always create a negative dentry, and then do the actual atomic exclusive create inside vfs_create(). We can do the ordinary atomic open(O_CREAT) inside nfs_lookup though. If that fails, then we simply return the error as an ERR_PTR in the dentry (i.e. the lookup won't succeed). The only case which we have that can be iffy is the case of open() on an existing dentry. If that fails, then we drop the dentry and force an uncached lookup in order to get it all right. + nd-intent.open.file = NULL; Why is this NULL assignment needed? nd will not be used after this. + } + path_release(nd); +} + It could be dropped. The reason for putting it in is that some parts of the VFS may restart a path walk operation if it fails (see for instance the ESTALE handling). If you use the nameidata after path_release_open_intent(), you're screwed anyway, since nd-mnt and nd-dentry have already been released. There is quite a bit of code out there that assumes it is free to stuff things into nd-mnt and nd-dentry. Some of it is Al Viro's code, some of it is from other people. For instance, the ESTALE handling will just save nd-mnt/nd-dentry before calling __link_path_walk(), then restore + retry if the ESTALE error comes up. If there's any chance that the path walk restart thing will invoke the filesystems open code twice (I doubt it), then the filesystem must make sure to check intent.open.file, whether it has already been set, and fput() it before setting it another time. The only user of that code is NFSv4, and we will never even try to allocate a file if the OPEN call returned an ESTALE. Why do we want to keep this behaviour? It is undocumented, it is non-posix, and it appears to do nothing you cannot do with the existing access() call. Are there any applications using it? If so, which ones, and why? I have absolutely no idea. Looking closer, there's a problem with O_TRUNC as well: namei_flags = flags; if ((namei_flags+1) O_ACCMODE) namei_flags++; if (namei_flags O_TRUNC) namei_flags |= 2; So if flags is O_RDONLY|O_TRUNC, intent.open.flags will be FMODE_WRITE|FMODE_READ|O_TRUNC, but filp-f_mode will be FMODE_READ. That is a bug that needs to be fixed in the intent.open.flags. We don't ever want to be opening the file for writing at the filesystem level when the user specified open for read. Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
+ nd-intent.open.file = NULL; Why is this NULL assignment needed? nd will not be used after this. + } + path_release(nd); +} + It could be dropped. The reason for putting it in is that some parts of the VFS may restart a path walk operation if it fails (see for instance the ESTALE handling). If you use the nameidata after path_release_open_intent(), you're screwed anyway, since nd-mnt and nd-dentry have already been released. There is quite a bit of code out there that assumes it is free to stuff things into nd-mnt and nd-dentry. Some of it is Al Viro's code, some of it is from other people. For instance, the ESTALE handling will just save nd-mnt/nd-dentry before calling __link_path_walk(), then restore + retry if the ESTALE error comes up. Yeah, but how is that relevant to the fact, that after path_release_*() _nothing_ will be valid in the nameidata, not nd-mnt, not nd-dentry, and not nd-intent.open.file. So what's the point in setting it to NULL if it must never be used anyway? If there's any chance that the path walk restart thing will invoke the filesystems open code twice (I doubt it), then the filesystem must make sure to check intent.open.file, whether it has already been set, and fput() it before setting it another time. The only user of that code is NFSv4, and we will never even try to allocate a file if the OPEN call returned an ESTALE. That's why I doubted that this is an issue. Why do we want to keep this behaviour? It is undocumented, it is non-posix, and it appears to do nothing you cannot do with the existing access() call. Are there any applications using it? If so, which ones, and why? I have absolutely no idea. Looking closer, there's a problem with O_TRUNC as well: namei_flags = flags; if ((namei_flags+1) O_ACCMODE) namei_flags++; if (namei_flags O_TRUNC) namei_flags |= 2; So if flags is O_RDONLY|O_TRUNC, intent.open.flags will be FMODE_WRITE|FMODE_READ|O_TRUNC, but filp-f_mode will be FMODE_READ. That is a bug that needs to be fixed in the intent.open.flags. We don't ever want to be opening the file for writing at the filesystem level when the user specified open for read. No, it's being opened for read. The namei_flags (and hence intent.open.flags) will have FMODE_WRITE, so that the permission is checked for write. So I thing the bug is not in the calculation of namei_flags, rather the fact that intent.open.flags is set to namei_flags, rather than the original open flags. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 19:44 (+0200) skreiv Miklos Szeredi: There is quite a bit of code out there that assumes it is free to stuff things into nd-mnt and nd-dentry. Some of it is Al Viro's code, some of it is from other people. For instance, the ESTALE handling will just save nd-mnt/nd-dentry before calling __link_path_walk(), then restore + retry if the ESTALE error comes up. Yeah, but how is that relevant to the fact, that after path_release_*() _nothing_ will be valid in the nameidata, not nd-mnt, not nd-dentry, and not nd-intent.open.file. So what's the point in setting it to NULL if it must never be used anyway? path_release() does _not_ invalidate the nameidata. Look for instance at __emul_lookup_dentry(), which clearly makes use of that fact. If there's any chance that the path walk restart thing will invoke the filesystems open code twice (I doubt it), then the filesystem must make sure to check intent.open.file, whether it has already been set, and fput() it before setting it another time. The only user of that code is NFSv4, and we will never even try to allocate a file if the OPEN call returned an ESTALE. That's why I doubted that this is an issue. Why do we want to keep this behaviour? It is undocumented, it is non-posix, and it appears to do nothing you cannot do with the existing access() call. Are there any applications using it? If so, which ones, and why? I have absolutely no idea. Looking closer, there's a problem with O_TRUNC as well: namei_flags = flags; if ((namei_flags+1) O_ACCMODE) namei_flags++; if (namei_flags O_TRUNC) namei_flags |= 2; So if flags is O_RDONLY|O_TRUNC, intent.open.flags will be FMODE_WRITE|FMODE_READ|O_TRUNC, but filp-f_mode will be FMODE_READ. That is a bug that needs to be fixed in the intent.open.flags. We don't ever want to be opening the file for writing at the filesystem level when the user specified open for read. No, it's being opened for read. The namei_flags (and hence intent.open.flags) will have FMODE_WRITE, so that the permission is checked for write. Firstly, the open_namei() flags field is not a permissions field. It contains open mode information. The calculation of the open permissions flags is done by open_namei() itself. Secondly, what advantage is there in allowing callers of open_namei() to be able to override the MAY_WRITE check when doing open(O_TRUNC)? This is a calculation that should be done _once_ in order to always get it right, and it should therefore be done in open_namei() together with the rest of the permissions calculation. Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
There is quite a bit of code out there that assumes it is free to stuff things into nd-mnt and nd-dentry. Some of it is Al Viro's code, some of it is from other people. For instance, the ESTALE handling will just save nd-mnt/nd-dentry before calling __link_path_walk(), then restore + retry if the ESTALE error comes up. Yeah, but how is that relevant to the fact, that after path_release_*() _nothing_ will be valid in the nameidata, not nd-mnt, not nd-dentry, and not nd-intent.open.file. So what's the point in setting it to NULL if it must never be used anyway? path_release() does _not_ invalidate the nameidata. Look for instance at __emul_lookup_dentry(), which clearly makes use of that fact. Trond, wake up! __emul_lookup_dentry() does nothing of the sort. Neither does anything else. In theory it could, but that's not a reason to do a confusing thing like that. Firstly, the open_namei() flags field is not a permissions field. It contains open mode information. The calculation of the open permissions flags is done by open_namei() itself. Based on flags. It's just a FMODE_* - MAY_* transformation Secondly, what advantage is there in allowing callers of open_namei() to be able to override the MAY_WRITE check when doing open(O_TRUNC)? This is a calculation that should be done _once_ in order to always get it right, and it should therefore be done in open_namei() together with the rest of the permissions calculation. I think the _only_ caller of open_namei() is filp_open(), so this is not much of an issue, but yeah, you could do it that way too. Or you could initialize nameidata from filp_open(). Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 22:42 (+0200) skreiv Miklos Szeredi: Trond, wake up! __emul_lookup_dentry() does nothing of the sort. Neither does anything else. In theory it could, but that's not a reason to do a confusing thing like that. Really? static int __emul_lookup_dentry(const char *name, struct nameidata *nd) { . if (path_walk(name, nd) == 0) { if (nd-dentry-d_inode) { dput(old_dentry); mntput(old_mnt); return 1; } path_release(nd); } nd-dentry = old_dentry; nd-mnt = old_mnt; nd-last = last; nd-last_type = last_type; } return 1; } Which is called by path_lookup(), which again returns success, and expects the user to call path_release() later. Firstly, the open_namei() flags field is not a permissions field. It contains open mode information. The calculation of the open permissions flags is done by open_namei() itself. Based on flags. It's just a FMODE_* - MAY_* transformation Secondly, what advantage is there in allowing callers of open_namei() to be able to override the MAY_WRITE check when doing open(O_TRUNC)? This is a calculation that should be done _once_ in order to always get it right, and it should therefore be done in open_namei() together with the rest of the permissions calculation. I think the _only_ caller of open_namei() is filp_open(), so this is not much of an issue, but yeah, you could do it that way too. Currently, yes. The only caller of open_namei() is filp_open(). That was not always the case previously. If we think it will never be the case in the future, then there is an argument for merging the two and/or making open_namei() and inlined function. Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
Really? static int __emul_lookup_dentry(const char *name, struct nameidata *nd) { . if (path_walk(name, nd) == 0) { if (nd-dentry-d_inode) { dput(old_dentry); mntput(old_mnt); return 1; } path_release(nd); } nd-dentry = old_dentry; nd-mnt = old_mnt; nd-last = last; nd-last_type = last_type; } return 1; } I see what you are getting at. But notice, that every (relevant) field of nameidata is reinitialized, except nd-flags, which is _obviously_ not changed by either path_walk() or path_release(). So your argument doesn't hold. You basically argue, that intent.open.file must be zeroed, because someone might call path_release_open_intent() twice, which very obviously does not make any sense, unless it does some magic like the above (which it should not), in which case it might as well be aware, that it has to save/restore the intent.open.file field as well. Currently, yes. The only caller of open_namei() is filp_open(). That was not always the case previously. If we think it will never be the case in the future, then there is an argument for merging the two and/or making open_namei() and inlined function. Yes, that would make sense. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] atomic open(..., O_CREAT | ...)
ty den 09.08.2005 Klokka 00:27 (+0200) skreiv Miklos Szeredi: I'd like to make my filesystem be able to do file creation and opening atomically. This is needed for filesystems which cannot separate checking open permission from the actual open operation. Usually any filesystem served from userspace by an unprivileged (no CAP_DAC_OVERRIDE) process will be such (ftp, sftp, etc.). With nameidata-intent.open.* it is possible to do the actual open from -lookup() or -create(). However there's no easy way to associate the 'struct file *' returned by dentry_open() with the filesystem's private file object. Also if there's some error after the file has been opened but before a successful return of the file pointer, the filesystem has no way to know that it should destroy the private file object. We've already got a patch that does this, and that I'm queueing up for inclusion. See http://client.linux-nfs.org/Linux-2.6.x/2.6.12/linux-2.6.12-63-open_file_intents.dif As for the orig flags thing. What is the point of that? Cheers, Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html