Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On 8/14/2020 1:05 PM, Linus Torvalds (torva...@linux-foundation.org) wrote: > Honestly, I really think you may want an extended [f]statfs(), not > some mount tracking. > > Linus Linus, Thank you for the reply. Perhaps some of the communication disconnect is due to which thread this discussion is taking place on. My understanding is that there were two separate pull requests. One for mount notifications and the other for filesystem information. This thread is derived from the pull request entitled "Filesystem Information" and my response was a request for use cases. The assumption being that the request was related to the subject. I apologize for creating unnecessary noise due to my misinterpretation of your intended question. The use cases I described and the types of filesystem information required to satisfy them do not require mount tracking. Jeffrey Altman <> smime.p7s Description: S/MIME Cryptographic Signature
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 11:30 PM Al Viro wrote: > > On Wed, Aug 12, 2020 at 07:33:26PM +0100, Al Viro wrote: > > > BTW, what would such opened files look like from /proc/*/fd/* POV? And > > what would happen if you walk _through_ that symlink, with e.g. ".." > > following it? Or with names of those attributes, for that matter... > > What about a normal open() of such a sucker? It won't know where to > > look for your ->private_data... > > > > FWIW, you keep refering to regularity of this stuff from the syscall > > POV, but it looks like you have no real idea of what subset of the > > things available for normal descriptors will be available for those. > > Another question: what should happen with that sucker on umount of > the filesystem holding the underlying object? Should it be counted > as pinning that fs? Obviously yes. > Who controls what's in that tree? It could be several entities: - global (like mount info) - per inode (like xattr) - per fs (fs specific inode attributes) - etc.. > If we plan to have xattrs there, > will they be in a flat tree, or should it mirror the hierarchy of > xattrs? When is it populated? open() time? What happens if we > add/remove an xattr after that point? >From the interface perspective it would be dynamic (i.e. would get updated on open or read). From an implementation POV it could have caching, but that's not how I'd start out. > If we open the same file several times, what should we get? A full > copy of the tree every time, with all coherency being up to whatever's > putting attributes there? > > What are the permissions needed to do lookups in that thing? That would depend on what would need to be looked up. Top level would be world readable, otherwise it would be up to the attribute/group. > > All of that is about semantics and the answers are needed before we > start looking into implementations. "Whatever my implementation > does" is _not_ a good way to go, especially since that'll be cast > in stone as soon as API becomes exposed to userland... Fine. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 8:33 PM Al Viro wrote: > > On Wed, Aug 12, 2020 at 06:39:11PM +0100, Al Viro wrote: > > On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote: > > > On Wed, Aug 12, 2020 at 6:33 PM Al Viro wrote: > > > > > > > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote: > > > > > > > > Why does it have to have a struct mount? It does not have to use > > > > > dentry/mount based path lookup. > > > > > > > > What the fuck? So we suddenly get an additional class of objects > > > > serving as kinda-sorta analogues of dentries *AND* now struct file > > > > might refer to that instead of a dentry/mount pair - all on the VFS > > > > level? And so do all the syscalls you want to allow for such > > > > "pathnames"? > > > > > > The only syscall I'd want to allow is open, everything else would be > > > on the open files themselves. > > > > > > file->f_path can refer to an anon mount/inode, the real object is > > > referred to by file->private_data. > > > > > > The change to namei.c would be on the order of ~10 lines. No other > > > parts of the VFS would be affected. > > > > If some of the things you open are directories (and you *have* said that > > directories will be among those just upthread, and used references to > > readdir() as argument in favour of your approach elsewhere in the thread), > > you will have to do something about fchdir(). And that's the least of > > the issues. > > BTW, what would such opened files look like from /proc/*/fd/* POV? And > what would happen if you walk _through_ that symlink, with e.g. ".." > following it? Or with names of those attributes, for that matter... > What about a normal open() of such a sucker? It won't know where to > look for your ->private_data... > > FWIW, you keep refering to regularity of this stuff from the syscall > POV, but it looks like you have no real idea of what subset of the > things available for normal descriptors will be available for those. I have said that IMO using a non-seekable anon-file would be okay for those. All the answers fall out of that: nothing works on those fd's except read/write/getdents. No fchdir(), no /proc/*/fd deref, etc... Starting with a very limited functionality and expanding on that if necessary is I think a good way to not get bogged down with the details. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 8:53 PM Jeffrey E Altman wrote: > > For the AFS community, fsinfo offers a method of exposing some server > and volume properties that are obtained via "path ioctls" in OpenAFS and > AuriStorFS. Some example of properties that might be exposed include > answers to questions such as: Note that several of the questions you ask aren't necessarily mount-related at all. Doing it by mount ends up being completely the wrong thing. For example, at a minimum, these guys may well be per-directory (or even possibly per-file): > * where is a mounted volume hosted? which fileservers, named by uuid > * what is the block size? 1K, 4K, ... > * are directories just-send-8, case-sensitive, case-preserving, or >case-insensitive? > * if not just-send-8, what character set is used? > * if Unicode, what normalization rules? etc. > * what volume security policy (authn, integ, priv) is assigned, if any? > * what is the replication policy, if any? > * what is the volume encryption policy, if any? and trying to solve this with some kind of "mount info" is pure garbage. Honestly, I really think you may want an extended [f]statfs(), not some mount tracking. Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Mi, 12.08.20 11:18, Linus Torvalds (torva...@linux-foundation.org) wrote: > On Tue, Aug 11, 2020 at 5:05 PM David Howells wrote: > > > > Well, the start of it was my proposal of an fsinfo() system call. > > Ugh. Ok, it's that thing. > > This all seems *WAY* over-designed - both your fsinfo and Miklos' version. > > What's wrong with fstatfs()? All the extra magic metadata seems to not > really be anything people really care about. > > What people are actually asking for seems to be some unique mount ID, > and we have 16 bytes of spare information in 'struct statfs64'. statx() exposes a `stx_mnt_id` field nowadays. So that's easy and quick to get nowadays. It's just so inefficient matching that up with /proc/self/mountinfo then. And it still won't give you any of the fs capability bits (time granularity, max file size, features, …), because the kernel doesn't expose that at all right now. OTOH I'd already be quite happy if struct statfs64 would expose f_features, f_max_fsize, f_time_granularity, f_charset_case_handling fields or so. Lennart -- Lennart Poettering, Berlin
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On 8/12/2020 2:18 PM, Linus Torvalds (torva...@linux-foundation.org) wrote: > What's wrong with fstatfs()? All the extra magic metadata seems to not > really be anything people really care about. > > What people are actually asking for seems to be some unique mount ID, > and we have 16 bytes of spare information in 'struct statfs64'. > > All the other fancy fsinfo stuff seems to be "just because", and like > complete overdesign. Hi Linus, Is there any existing method by which userland applications can determine the properties of the filesystem in which a directory or file is stored in a filesystem agnostic manner? Over the past year I've observed the opendev/openstack community struggle with performance issues caused by rsync's inability to determine if the source and destination object's last update time have the same resolution and valid time range. If the source file system supports 100 nanosecond granularity and the destination file system supports one second granularity, any source file with a non-zero fractional seconds timestamp will appear to have changed compared to the copy in the destination filesystem which discarded the fractional seconds during the last sync. Sure, the end user could use the --modify-window=1 option to inform rsync to add fuzz to the comparisons, but that introduces the possibility that a file updated a fraction of a second after an rsync execution would not synchronize the file on the next run when both source and target have fine grained timestamps. If the userland sync processes have access to the source and destination filesystem time capabilities, they can make more intelligent decisions without explicit user input. At a minimum, the timestamp properties that are important to know include the range of valid timestamps and the resolution. Some filesystems support unsigned 32-bit time starting with UNIX epoch. Others signed 32-bit time with UNIX epoch. Still others FAT, NTFS, etc use alternative epochs and range and resolutions. Another case where lack of filesystem properties is problematic is "df --local" which currently relies upon string comparisons of file system name strings to determine if the underlying file system is local or remote. This requires that the gnulib maintainers have knowledge of all file systems implementations, their published names, and which category they belong to. Patches have been accepted in the past year to add "smb3", "afs", and "gpfs" to the list of remote file systems. There are many more remote filesystems that have yet to be added including "cephfs", "lustre", "gluster", etc. In many cases, the filesystem properties cannot be inferred from the filesystem name. For network file systems, these properties might depend upon the remote server capabilities or even the properties associated with a particular volume or share. Consider the case of a remote file server that supports 64-bit 100ns time but which for backward compatibility exports certain volumes or shares with more restrictive capabilities. Or the case of a network file system protocol that has evolved over time and gained new capabilities. For the AFS community, fsinfo offers a method of exposing some server and volume properties that are obtained via "path ioctls" in OpenAFS and AuriStorFS. Some example of properties that might be exposed include answers to questions such as: * what is the volume cell id? perhaps a uuid. * what is the volume id in the cell? unsigned 64-bit integer * where is a mounted volume hosted? which fileservers, named by uuid * what is the block size? 1K, 4K, ... * how many blocks are in use or available? * what is the quota (thin provisioning), if any? * what is the reserved space (fat provisioning), if any? * how many vnodes are present? * what is the vnode count limit, if any? * when was the volume created and last updated? * what is the file size limit? * are byte range locks supported? * are mandatory locks supported? * how many entries can be created within a directory? * are cross-directory hard links supported? * are directories just-send-8, case-sensitive, case-preserving, or case-insensitive? * if not just-send-8, what character set is used? * if Unicode, what normalization rules? etc. * are per-object acls supported? * what volume maximum acl is assigned, if any? * what volume security policy (authn, integ, priv) is assigned, if any? * what is the replication policy, if any? * what is the volume encryption policy, if any? * what is the volume compression policy, if any? * are server-to-server copies supported? * which of atime, ctime and mtime does the volume support? * what is the permitted timestamp range and resolution? * are xattrs supported? * what is the xattr maximum name length? * what is the xattr maximum object size? * is the volume currently reachable? * is the volume immutable? * etc ... Its true that there isn't widespread use of these filesystem properties by today's userland
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 07:33:26PM +0100, Al Viro wrote: > BTW, what would such opened files look like from /proc/*/fd/* POV? And > what would happen if you walk _through_ that symlink, with e.g. ".." > following it? Or with names of those attributes, for that matter... > What about a normal open() of such a sucker? It won't know where to > look for your ->private_data... > > FWIW, you keep refering to regularity of this stuff from the syscall > POV, but it looks like you have no real idea of what subset of the > things available for normal descriptors will be available for those. Another question: what should happen with that sucker on umount of the filesystem holding the underlying object? Should it be counted as pinning that fs? Who controls what's in that tree? If we plan to have xattrs there, will they be in a flat tree, or should it mirror the hierarchy of xattrs? When is it populated? open() time? What happens if we add/remove an xattr after that point? If we open the same file several times, what should we get? A full copy of the tree every time, with all coherency being up to whatever's putting attributes there? What are the permissions needed to do lookups in that thing? All of that is about semantics and the answers are needed before we start looking into implementations. "Whatever my implementation does" is _not_ a good way to go, especially since that'll be cast in stone as soon as API becomes exposed to userland...
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 06:39:11PM +0100, Al Viro wrote: > On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote: > > On Wed, Aug 12, 2020 at 6:33 PM Al Viro wrote: > > > > > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote: > > > > > > Why does it have to have a struct mount? It does not have to use > > > > dentry/mount based path lookup. > > > > > > What the fuck? So we suddenly get an additional class of objects > > > serving as kinda-sorta analogues of dentries *AND* now struct file > > > might refer to that instead of a dentry/mount pair - all on the VFS > > > level? And so do all the syscalls you want to allow for such "pathnames"? > > > > The only syscall I'd want to allow is open, everything else would be > > on the open files themselves. > > > > file->f_path can refer to an anon mount/inode, the real object is > > referred to by file->private_data. > > > > The change to namei.c would be on the order of ~10 lines. No other > > parts of the VFS would be affected. > > If some of the things you open are directories (and you *have* said that > directories will be among those just upthread, and used references to > readdir() as argument in favour of your approach elsewhere in the thread), > you will have to do something about fchdir(). And that's the least of > the issues. BTW, what would such opened files look like from /proc/*/fd/* POV? And what would happen if you walk _through_ that symlink, with e.g. ".." following it? Or with names of those attributes, for that matter... What about a normal open() of such a sucker? It won't know where to look for your ->private_data... FWIW, you keep refering to regularity of this stuff from the syscall POV, but it looks like you have no real idea of what subset of the things available for normal descriptors will be available for those.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 5:05 PM David Howells wrote: > > Well, the start of it was my proposal of an fsinfo() system call. Ugh. Ok, it's that thing. This all seems *WAY* over-designed - both your fsinfo and Miklos' version. What's wrong with fstatfs()? All the extra magic metadata seems to not really be anything people really care about. What people are actually asking for seems to be some unique mount ID, and we have 16 bytes of spare information in 'struct statfs64'. All the other fancy fsinfo stuff seems to be "just because", and like complete overdesign. Let's not add system calls just because we can. Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote: > On Wed, Aug 12, 2020 at 6:33 PM Al Viro wrote: > > > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote: > > > > Why does it have to have a struct mount? It does not have to use > > > dentry/mount based path lookup. > > > > What the fuck? So we suddenly get an additional class of objects > > serving as kinda-sorta analogues of dentries *AND* now struct file > > might refer to that instead of a dentry/mount pair - all on the VFS > > level? And so do all the syscalls you want to allow for such "pathnames"? > > The only syscall I'd want to allow is open, everything else would be > on the open files themselves. > > file->f_path can refer to an anon mount/inode, the real object is > referred to by file->private_data. > > The change to namei.c would be on the order of ~10 lines. No other > parts of the VFS would be affected. If some of the things you open are directories (and you *have* said that directories will be among those just upthread, and used references to readdir() as argument in favour of your approach elsewhere in the thread), you will have to do something about fchdir(). And that's the least of the issues. > Maybe I'm optimistic; we'll > see... > Now off to something completely different. Back on Tuesday. ... after the window closes. You know, it's really starting to look like rather nasty tactical games...
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 6:33 PM Al Viro wrote: > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote: > > Why does it have to have a struct mount? It does not have to use > > dentry/mount based path lookup. > > What the fuck? So we suddenly get an additional class of objects > serving as kinda-sorta analogues of dentries *AND* now struct file > might refer to that instead of a dentry/mount pair - all on the VFS > level? And so do all the syscalls you want to allow for such "pathnames"? The only syscall I'd want to allow is open, everything else would be on the open files themselves. file->f_path can refer to an anon mount/inode, the real object is referred to by file->private_data. The change to namei.c would be on the order of ~10 lines. No other parts of the VFS would be affected. Maybe I'm optimistic; we'll see... Now off to something completely different. Back on Tuesday. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote: > > Lovely. And what of fchdir() to those? > > Not allowed. Not allowed _how_? Existing check is "is it a directory"; what do you propose? IIRC, you've mentioned using readdir() in that context, so it's not that you only allow to open the leaves there. > > > > Is that a flat space, or can they be directories?" > > > > > > Yes it has a directory tree. But you can't mkdir, rename, link, > > > symlink, etc on anything in there. > > > > That kills the "shared inode" part - you'll get deadlocks from > > hell that way. > > No. The shared inode is not for lookup, just for the open file. Bloody hell... So what inodes are you using for lookups? And that thing you would be passing to readdir() - what inode will _that_ have? > > Next: what will that tree be attached to? As in, "what's the parent > > of its root"? And while we are at it, what will be the struct mount > > used with those - same as the original file, something different > > attached to it, something created on the fly for each pathwalk and > > lazy-umounted? And see above re fchdir() - if they can be directories, > > it's very much in the game. > > Why does it have to have a struct mount? It does not have to use > dentry/mount based path lookup. What the fuck? So we suddenly get an additional class of objects serving as kinda-sorta analogues of dentries *AND* now struct file might refer to that instead of a dentry/mount pair - all on the VFS level? And so do all the syscalls you want to allow for such "pathnames"? Sure, that avoids all questions about dcache interactions - by growing a replacement layer and making just about everything in fs/namei.c, fs/open.c, etc. special-case the handling of that crap. But yes, the syscall-level interface will be simple. Wonderful. I really hope that's not what you have in mind, though.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Miklos Szeredi wrote: > Why does it have to have a struct mount? It does not have to use > dentry/mount based path lookup. file->f_path.mnt David
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 5:08 PM Al Viro wrote: > > On Wed, Aug 12, 2020 at 04:46:20PM +0200, Miklos Szeredi wrote: > > > > "Can those suckers be passed to > > > ...at() as starting points? > > > > No. > > Lovely. And what of fchdir() to those? Not allowed. > Are they all non-directories? > Because the starting point of ...at() can be simulated that way... > > > > Can they be bound in namespace? > > > > No. > > > > > Can something be bound *on* them? > > > > No. > > > > > What do they have for inodes > > > and what maintains their inumbers (and st_dev, while we are at > > > it)? > > > > Irrelevant. Can be some anon dev + shared inode. > > > > The only attribute of an attribute that I can think of that makes > > sense would be st_size, but even that is probably unimportant. > > > > > Can _they_ have secondaries like that (sensu Swift)? > > > > Reference? > > http://www.online-literature.com/swift/3515/ > So, naturalists observe, a flea > Has smaller fleas that on him prey; > And these have smaller still to bite 'em, > And so proceed ad infinitum. > of course ;-) > IOW, can the things in those trees have secondary trees on them, etc.? > Not "will they have it in your originally intended use?" - "do we need > the architecture of the entire thing to be capable to deal with that?" No. > > > > Is that a flat space, or can they be directories?" > > > > Yes it has a directory tree. But you can't mkdir, rename, link, > > symlink, etc on anything in there. > > That kills the "shared inode" part - you'll get deadlocks from > hell that way. No. The shared inode is not for lookup, just for the open file. > "Can't mkdir" doesn't save you from that. BTW, > what of unlink()? If the tree shape is not a hardwired constant, > you get to decide how it's initially populated... > > Next: what will that tree be attached to? As in, "what's the parent > of its root"? And while we are at it, what will be the struct mount > used with those - same as the original file, something different > attached to it, something created on the fly for each pathwalk and > lazy-umounted? And see above re fchdir() - if they can be directories, > it's very much in the game. Why does it have to have a struct mount? It does not have to use dentry/mount based path lookup. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 04:46:20PM +0200, Miklos Szeredi wrote: > > "Can those suckers be passed to > > ...at() as starting points? > > No. Lovely. And what of fchdir() to those? Are they all non-directories? Because the starting point of ...at() can be simulated that way... > > Can they be bound in namespace? > > No. > > > Can something be bound *on* them? > > No. > > > What do they have for inodes > > and what maintains their inumbers (and st_dev, while we are at > > it)? > > Irrelevant. Can be some anon dev + shared inode. > > The only attribute of an attribute that I can think of that makes > sense would be st_size, but even that is probably unimportant. > > > Can _they_ have secondaries like that (sensu Swift)? > > Reference? http://www.online-literature.com/swift/3515/ So, naturalists observe, a flea Has smaller fleas that on him prey; And these have smaller still to bite 'em, And so proceed ad infinitum. of course ;-) IOW, can the things in those trees have secondary trees on them, etc.? Not "will they have it in your originally intended use?" - "do we need the architecture of the entire thing to be capable to deal with that?" > > Is that a flat space, or can they be directories?" > > Yes it has a directory tree. But you can't mkdir, rename, link, > symlink, etc on anything in there. That kills the "shared inode" part - you'll get deadlocks from hell that way. "Can't mkdir" doesn't save you from that. BTW, what of unlink()? If the tree shape is not a hardwired constant, you get to decide how it's initially populated... Next: what will that tree be attached to? As in, "what's the parent of its root"? And while we are at it, what will be the struct mount used with those - same as the original file, something different attached to it, something created on the fly for each pathwalk and lazy-umounted? And see above re fchdir() - if they can be directories, it's very much in the game.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 4:40 PM Al Viro wrote: > > On Wed, Aug 12, 2020 at 09:23:23AM +0200, Miklos Szeredi wrote: > > > Anyway, starting with just introducing the alt namespace without > > unification seems to be a good first step. If that turns out to be > > workable, we can revisit unification later. > > Start with coming up with answers to the questions on semantics > upthread. To spare you the joy of digging through the branches > of that thread, how's that for starters? > > "Can those suckers be passed to > ...at() as starting points? No. > Can they be bound in namespace? No. > Can something be bound *on* them? No. > What do they have for inodes > and what maintains their inumbers (and st_dev, while we are at > it)? Irrelevant. Can be some anon dev + shared inode. The only attribute of an attribute that I can think of that makes sense would be st_size, but even that is probably unimportant. > Can _they_ have secondaries like that (sensu Swift)? Reference? > Is that a flat space, or can they be directories?" Yes it has a directory tree. But you can't mkdir, rename, link, symlink, etc on anything in there. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 09:23:23AM +0200, Miklos Szeredi wrote: > Anyway, starting with just introducing the alt namespace without > unification seems to be a good first step. If that turns out to be > workable, we can revisit unification later. Start with coming up with answers to the questions on semantics upthread. To spare you the joy of digging through the branches of that thread, how's that for starters? "Can those suckers be passed to ...at() as starting points? Can they be bound in namespace? Can something be bound *on* them? What do they have for inodes and what maintains their inumbers (and st_dev, while we are at it)? Can _they_ have secondaries like that (sensu Swift)? Is that a flat space, or can they be directories?"
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Miklos Szeredi wrote: > The point is that generic operations already exist and no need to add > new, specialized ones to access metadata. open and read already exist, yes, but the metadata isn't currently in convenient inodes and dentries that you can just walk through. So you're going to end up with a specialised filesystem instead, I suspect. Basically, it's the same as your do-everything-through-/proc/self/fds/ approach. And it's going to be heavier. I don't know if you're planning on creating a superblock each time you do an O_ALT open, but you will end up creating some inodes, dentries and a file - even before you get to the reading bit. David
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 3:54 PM David Howells wrote: > > Linus Torvalds wrote: > > > IOW, if you do something more along the lines of > > > >fd = open(""foo/bar", O_PATH); > >metadatafd = openat(fd, "metadataname", O_ALT); > > > > it might be workable. > > What is it going to walk through? You need to end up with an inode and dentry > from somewhere. > > It sounds like this would have to open up a procfs-like magic filesystem, and > walk into it. But how would that actually work? Would you create a new > superblock each time you do this, labelled with the starting object (say the > dentry for "foo/bar" in this case), and then walk from the root? > > An alternative, maybe, could be to make a new dentry type, say, and include it > in the superblock of the object being queried - and let the filesystems deal > with it. That would mean that non-dir dentries would then have virtual > children. You could then even use this to implement resource forks... > > Another alternative would be to note O_ALT and then skip pathwalk entirely, > but just use the name as a key to the attribute, creating an anonfd to read > it. But then why use openat() at all? You could instead do: > > metadatafd = openmeta(fd, "metadataname"); > > and save the page flag. You could even merge the two opens and do: > > metadatafd = openmeta("foo/bar", "metadataname"); > > Why not even combine this with Miklos's readfile() idea: > > readmeta(AT_FDCWD, "foo/bar", "metadataname", buf, sizeof(buf)); And writemeta() and createmeta() and readdirmeta() and ... The point is that generic operations already exist and no need to add new, specialized ones to access metadata. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 3:33 PM David Howells wrote: > > Miklos Szeredi wrote: > > > You said yourself, that what's really needed is e.g. consistent > > snapshot of a complete mount tree topology. And to get the complete > > topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are > > needed for *each* individual mount. > > That's not entirely true. > > FSINFO_ATTR_MOUNT_ALL can be used instead of FSINFO_ATTR_MOUNT_CHILDREN if you > want to scan an entire subtree in one go. It returns the same record type. > > The result from ALL/CHILDREN includes sufficient information to build the > tree. That only requires the parent ID. All the rest of the information > TOPOLOGY exposes is to do with propagation. > > Now, granted, I didn't include all of the topology info in the records > returned by ALL/CHILDREN because I don't expect it to change very often. But > you can check the event counter supplied with each record to see if it might > have changed - and then call TOPOLOGY on the ones that changed. IDGI, you have all these interfaces but how will they be used? E.g. one wants to build a consistent topology together with propagation and attributes. That would start with FSINFO_ATTR_MOUNT_ALL, then iterate the given mounts calling FSINFO_ATTR_MOUNT_INFO and FSINFO_ATTR_MOUNT_TOPOLOGY for each. Then when done, check the subtree notification counter with FSINFO_ATTR_MOUNT_INFO on the top one to see if anything has changed in the meantime. If it has, the whole process needs to be restarted to see which has been changed (unless notification is also enabled). How does the atomicity of FSINFO_ATTR_MOUNT_ALL help with that? The same could be done with just FSINFO_ATTR_MOUNT_CHILDREN. And more importantly does level of consistency matter at all? There's no such thing for directory trees, why are mount trees different in this respect? > Text interfaces are also a PITA, especially when you may get multiple pieces > of information returned in one buffer and especially when you throw in > character escaping. Of course, we can do it - and we do do it all over - but > that doesn't make it efficient. Agreed. The format of text interfaces matters very much. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Linus Torvalds wrote: > IOW, if you do something more along the lines of > >fd = open(""foo/bar", O_PATH); >metadatafd = openat(fd, "metadataname", O_ALT); > > it might be workable. What is it going to walk through? You need to end up with an inode and dentry from somewhere. It sounds like this would have to open up a procfs-like magic filesystem, and walk into it. But how would that actually work? Would you create a new superblock each time you do this, labelled with the starting object (say the dentry for "foo/bar" in this case), and then walk from the root? An alternative, maybe, could be to make a new dentry type, say, and include it in the superblock of the object being queried - and let the filesystems deal with it. That would mean that non-dir dentries would then have virtual children. You could then even use this to implement resource forks... Another alternative would be to note O_ALT and then skip pathwalk entirely, but just use the name as a key to the attribute, creating an anonfd to read it. But then why use openat() at all? You could instead do: metadatafd = openmeta(fd, "metadataname"); and save the page flag. You could even merge the two opens and do: metadatafd = openmeta("foo/bar", "metadataname"); Why not even combine this with Miklos's readfile() idea: readmeta(AT_FDCWD, "foo/bar", "metadataname", buf, sizeof(buf)); and we're now down to one syscall and no fds and you don't even need a magic filesystem to make it work. There's another consideration too: Paths are not unique handles to mounts. It's entirely possible to have colocated mounts. We need to be able to query all the mounts on a mountpoint. David
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Miklos Szeredi wrote: > You said yourself, that what's really needed is e.g. consistent > snapshot of a complete mount tree topology. And to get the complete > topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are > needed for *each* individual mount. That's not entirely true. FSINFO_ATTR_MOUNT_ALL can be used instead of FSINFO_ATTR_MOUNT_CHILDREN if you want to scan an entire subtree in one go. It returns the same record type. The result from ALL/CHILDREN includes sufficient information to build the tree. That only requires the parent ID. All the rest of the information TOPOLOGY exposes is to do with propagation. Now, granted, I didn't include all of the topology info in the records returned by ALL/CHILDREN because I don't expect it to change very often. But you can check the event counter supplied with each record to see if it might have changed - and then call TOPOLOGY on the ones that changed. If it simplifies life, I could add the propagation info into ALL/CHILDREN so that you only need to call ALL to scan everything. It requires larger buffers, however. > Adding a few generic binary interfaces is okay. Adding many > specialized binary interfaces is a PITA. Text interfaces are also a PITA, especially when you may get multiple pieces of information returned in one buffer and especially when you throw in character escaping. Of course, we can do it - and we do do it all over - but that doesn't make it efficient. David
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 12:14 PM Karel Zak wrote: > For example, by fsinfo(FSINFO_ATTR_MOUNT_TOPOLOGY) you get all > mountpoint propagation setting and relations by one syscall, That's just an arbitrary grouping of attributes. You said yourself, that what's really needed is e.g. consistent snapshot of a complete mount tree topology. And to get the complete topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are needed for *each* individual mount. The topology can obviously change between those calls. So there's no fundamental difference between getting individual attributes or getting attribute groups in this respect. > It would be also nice to avoid some strings formatting and separators > like we use in the current mountinfo. I think quoting non-printable is okay. > I can imagine multiple values separated by binary header (like we already > have for watch_notification, inotify, etc): Adding a few generic binary interfaces is okay. Adding many specialized binary interfaces is a PITA. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 08:20:24AM -0700, Linus Torvalds wrote: > IOW, if you do something more along the lines of > >fd = open(""foo/bar", O_PATH); >metadatafd = openat(fd, "metadataname", O_ALT); > > it might be workable. I have thought we want to replace mountinfo to reduce overhead. If I understand your idea than we will need openat()+read()+close() for each attribute? Sounds like a pretty expensive interface. The question is also how consistent results you get if you will read information about the same mountpoint by multiple openat()+read()+close() calls. For example, by fsinfo(FSINFO_ATTR_MOUNT_TOPOLOGY) you get all mountpoint propagation setting and relations by one syscall, with your idea you will read parent, slave and flags by multiple read() and without any lock. Sounds like you can get a mess if someone moves or reconfigure the mountpoint or so. openat(O_ALT) seems elegant at first glance, but it will be necessary to provide a richer (complex) answers by read() to reduce overhead and to make it more consistent for userspace. It would be also nice to avoid some strings formatting and separators like we use in the current mountinfo. I can imagine multiple values separated by binary header (like we already have for watch_notification, inotify, etc): fd = openat(fd, "mountinfo", O_ALT); sz = read(fd, buf, BUFSZ); p = buf; while (sz) { struct alt_metadata *alt = (struct alt_metadata *) p; char *varname = alt->name; char *data = alt->data; int len = alt->len; sz -= len; p += len; } Karel -- Karel Zak http://karelzak.blogspot.com
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 10:29 AM David Howells wrote: > > Miklos Szeredi wrote: > > > Worried about performance? Io-uring will allow you to do all those > > five syscalls (or many more) with just one I/O submission. > > io_uring isn't going to help here. We're talking about synchronous reads. > AIUI, you're adding a couple more syscalls to the list and running stuff in a > side thread to save the effort of going in and out of the kernel five times. > But you still have to pay the set up/tear down costs on the fds and do the > pathwalks. io_uring doesn't magically make that cost disappear. > > io_uring also requires resources such as a kernel accessible ring buffer to > make it work. > > You're proposing making everything else more messy just to avoid a dedicated > syscall. Could you please set out your reasoning for that? a) A dedicated syscall with a complex binary API is a non-trivial maintenance burden. b) The awarded performance boost is not warranted for the use cases it is designed for. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Miklos Szeredi wrote: > Worried about performance? Io-uring will allow you to do all those > five syscalls (or many more) with just one I/O submission. io_uring isn't going to help here. We're talking about synchronous reads. AIUI, you're adding a couple more syscalls to the list and running stuff in a side thread to save the effort of going in and out of the kernel five times. But you still have to pay the set up/tear down costs on the fds and do the pathwalks. io_uring doesn't magically make that cost disappear. io_uring also requires resources such as a kernel accessible ring buffer to make it work. You're proposing making everything else more messy just to avoid a dedicated syscall. Could you please set out your reasoning for that? David
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 12, 2020 at 2:05 AM David Howells wrote: > > { > > int fd, attrfd; > > > > fd = open(path, O_PATH); > > attrfd = openat(fd, name, O_ALT); > > close(fd); > > read(attrfd, value, size); > > close(attrfd); > > } > > Please don't go down this path. You're proposing five syscalls - including > creating two file descriptors - to do what fsinfo() does in one. So what? People argued against readfile() for exactly the opposite of reasons, even though that's a lot less specialized than fsinfo(). Worried about performance? Io-uring will allow you to do all those five syscalls (or many more) with just one I/O submission. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 11:19 PM Linus Torvalds wrote: > > On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi wrote: > > > > So that's where O_ALT comes in. If the application is consenting, > > then that should prevent exploits. Or? > > If the application is consenting AND GETS IT RIGHT it should prevent exploits. > > But that's a big deal. > > Why not just do it the way I suggested? Then you don't have any of these > issues. Will do. I just want to understand the reasons why a unified namespace is completely out of the question. And I won't accept "it's just fugly" or "it's the way it's always been done, so don't change it". Those are not good reasons. Oh, I'm used to these "fights", had them all along. In hindsight I should have accepted others' advice in some of the cases, but in others that big argument turned out to be a complete non-issue. One such being inode and dentry duplication in the overlayfs case vs. in-built stacking in the union-mount case. There were a lot of issues with overlayfs, that's true, but dcache/icache size has NEVER actually been reported as a problem. While Al has a lot of experience, it's hard to accept all that anecdotal evidence just because he says it. Your worries are also just those: worries. They may turn out to be an issue or they may not. Anyway, starting with just introducing the alt namespace without unification seems to be a good first step. If that turns out to be workable, we can revisit unification later. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, 2020-08-11 at 21:39 +0200, Christian Brauner wrote: > On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote: > > On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi > > wrote: > > > What's the disadvantage of doing it with a single lookup WITH an > > > enabling flag? > > > > > > It's definitely not going to break anything, so no backward > > > compatibility issues whatsoever. > > > > No backwards compatibility issues for existing programs, no. > > > > But your suggestion is fundamentally ambiguous, and you most > > definitely *can* hit that if people start using this in new > > programs. > > > > Where does that "unified" pathname come from? It will be generated > > from "base filename + metadata name" in user space, and > > > > (a) the base filename might have double or triple slashes in it > > for > > whatever reasons. > > > > This is not some "made-up gotcha" thing - I see double slashes > > *all* > > the time when we have things like Makefiles doing > > > > srctree=../../src/ > > > > and then people do "$(srctree)/". If you haven't seen that kind of > > pattern where the pathname has two (or sometimes more!) slashes in > > the > > middle, you've led a very sheltered life. > > > > (b) even if the new user space were to think about that, and > > remove > > those (hah! when have you ever seen user space do that?), as Al > > mentioned, the user *filesystem* might have pathnames with double > > slashes as part of symlinks. > > > > So now we'd have to make sure that when we traverse symlinks, that > > O_ALT gets cleared. Which means that it's not a unified namespace > > after all, because you can't make symlinks point to metadata. > > > > Or we'd retroactively change the semantics of a symlink, and that > > _is_ > > a backwards compatibility issue. Not with old software, no, but it > > changes the meaning of old symlinks! > > > > So no, I don't think a unified namespace ends up working. > > > > And I say that as somebody who actually loves the concept. Ask Al: > > I > > have a few times pushed for "let's allow directory behavior on > > regular > > files", so that you could do things like a tar-filesystem, and > > access > > the contents of a tar-file by just doing > > > > cat my-file.tar/inside/the/archive.c > > > > or similar. > > > > Al has convinced me it's a horrible idea (and there you have a > > non-ambiguous marker: the slash at the end of a pathname that > > otherwise looks and acts as a non-directory) > > > > Putting my kernel hat down, putting my userspace hat on. > > I'm looking at this from a potential user of this interface. > I'm not a huge fan of the metadata fd approach I'd much rather have a > dedicated system call rather than opening a side-channel metadata fd > that I can read binary data from. Maybe I'm alone in this but I was > under the impression that other users including Ian, Lennart, and > Karel > have said on-list in some form that they would prefer this approach. > There are even patches for systemd and libmount, I thought? Not quite sure what you mean here. Karel (with some contributions by me) has implemented the interfaces for David's mount notifications and fsinfo() call in libmount. We still have a little more to do on that. I also have a systemd implementation that uses these libmount features for mount table handling that works quite well, with a couple more things to do to complete it, that Lennart has done an initial review for. It's no secret that I don't like the proc file system in general but it is really useful for many things, that's just the way it is. Ian
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
Linus Torvalds wrote: > [ I missed the beginning of this discussion, so maybe this was already > suggested ] Well, the start of it was my proposal of an fsinfo() system call. That at its simplest takes an object reference (eg. a path) and an integer attribute ID (it could use a string instead, I suppose, but it would mean a bunch of strcmps instead of integer comparisons) and returns the value of the attribute. But I allow you to do slightly more interesting things than that too. Miklós seems dead-set against adding a system call specifically for this - though he's proposed extending open in various ways and also proposed an additional syscall, readfile(), that does the open+read+close all in one step. I think also at some point, he (or maybe James?) proposed adding a new magic filesystem mounted somewhere on proc (reflecting an open fd) that then had a bunch of symlinks to somewhere in sysfs (reflecting a mount). The idea being that you did something like: fd = open("/path/to/object", O_PATH); sprintf(name, "/proc/self/fds/%u/attr1", fd); attrfd = open(name, O_RDONLY); read(attrfd, buf1, sizeof(buf1)); close(attrfd); sprintf(name, "/proc/self/fds/%u/attr2", fd); attrfd = open(name, O_RDONLY); read(attrfd, buf2, sizeof(buf2)); close(attrfd); or: sprintf(name, "/proc/self/fds/%u/attr1", fd); readfile(name, buf1, sizeof(buf1)); sprintf(name, "/proc/self/fds/%u/attr2", fd); readfile(name, buf2, sizeof(buf2)); and then "/proc/self/fds/12/attr2" might then be a symlink to, say, "/sys/mounts/615/mount_attr". Miklós's justification for this was that it could then be operated from a shell script without the need for a utility - except that bash, at least, can't do O_PATH opens. James has proposed making fsconfig() able to retrieve attributes (though I'd prefer to give it a sibling syscall that does the retrieval rather than making fsconfig() do that too). > { > int fd, attrfd; > > fd = open(path, O_PATH); > attrfd = openat(fd, name, O_ALT); > close(fd); > read(attrfd, value, size); > close(attrfd); > } Please don't go down this path. You're proposing five syscalls - including creating two file descriptors - to do what fsinfo() does in one. Do you have a particular objection to adding a syscall specifically for retrieving filesystem/VFS information? -~- Anyway, in case you're interested in what I want to get out of this - which is the reason for it being posted in the first place: (*) The ability to retrieve various attributes of a filesystem/superblock, including information on: - Filesystem features: Does it support things like hard links, user quotas, direct I/O. - Filesystem limits: What's the maximum size of a file, an xattr, a directory; how many files can it support. - Supported API features: What FS_IOC_GETFLAGS does it support? Which can be set? Does it have Windows file attributes available? What statx attributes are supported? What do the timestamps support? What sort of case handling is done on filenames? Note that for a lot of cases, this stuff is fixed and can just be memcpy'd from rodata. Some of this is variable, however, in things like ext4 and xfs, depending on, say, mkfs configuration. The situation is even more complex with network filesystems as this may depend on the server they're talking to. But note also that some of this stuff might change file-to-file, even within a superblock. (*) The ability to retrieve attributes of a mount point, including information on the flags, propagation settings and child lists. (*) The ability to quickly retrieve a list of accessible mount point IDs, with change event counters to permit userspace (eg. systemd) to quickly determine if anything changed in the even of an overrun. (*) The ability to find mounts/superblocks by mount ID. Paths are not unique identifiers for mountpoints. You can stack multiple mounts on the same directory, but a path only sees the top one. (*) The ability to look inside a different mount namespace - one to which you have a reference fd. This would allow a container manager to look inside the container it is managing. (*) The ability to expose filesystem-specific attributes. Network filesystems can expose lists of servers and server addresses, for instance. (*) The ability to use the object referenced to determine the namespace (particularly the network namespace) to look in. The problem with looking in, say, /proc/net/... is that it looks at current's net namespace - whether or not the object of interest is in the same one. (*) The ability to query the context attached to the fd obtained from fsopen(). Such a context may not have a superblock attached to it yet or may no
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On 8/11/2020 1:28 PM, Miklos Szeredi wrote: > On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler > wrote: > >> Since ab has known meaning, and lots of applications >> play loose with '/', its really dangerous to treat the string as >> special. We only get away with '.' and '..' because their behavior >> was defined before many of y'all were born. > So the founding fathers have set things in stone and now we can't > change it. Right? The founders did lots of things that, in retrospect, weren't such great ideas, but that we have to live with. > Well that's how it looks... but let's think a little; we have '/' and > '\0' that can't be used in filenames. Also '.' and '..' are > prohibited names. It's not a trivial limitation, so applications are > probably not used to dumping binary data into file names. Hee Hee. Back in the early days of UNIX (the 1970s) there was command dsw(1) "delete from switches" because files with untypeible names where unfortunately common. I would question the assertion that "applications are not used to dumping binary data into file names", based on how often I've wished we still had dsw(1). > And that > means it's probably possible to find a fairly short combination that > is never used in practice (probably containing the "/." sequence). You'd think, but you'd be wrong. In the UNIX days we tried everything from "..." to ".NO_HID." and there always arose a problem or two. Not the least of which is that a "magic" pathname generated on an old system, then mounted on a new system will never give you the results you want. > Why couldn't we reserve such a combination now? > > I have no idea how to find such it, but other than that, I see no > theoretical problem with extending the list of reserved filenames. You need a sequence that is never used in any language, and that has never been used as a magic shell sequence. If you want a fun story to tell over beers, look up how using the "@" as the erase character on a TTY33 lead to it being used in email addresses. > Thanks, > Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 10:28:31PM +0200, Miklos Szeredi wrote: > On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler > wrote: > > > Since ab has known meaning, and lots of applications > > play loose with '/', its really dangerous to treat the string as > > special. We only get away with '.' and '..' because their behavior > > was defined before many of y'all were born. > > So the founding fathers have set things in stone and now we can't > change it. Right? Right. > Well that's how it looks... but let's think a little; we have '/' and > '\0' that can't be used in filenames. Also '.' and '..' are > prohibited names. It's not a trivial limitation, so applications are > probably not used to dumping binary data into file names. And that > means it's probably possible to find a fairly short combination that > is never used in practice (probably containing the "/." sequence). No, it is not. Miklos, get real - you will end up with obscure pathname produced once in a while by a script fragment from hell spewed out by crusty piece of awk buried in a piece of shit makefile from hell (and you are lucky if it won't be an automake output, while we are at it). Exercised only when some shipped turd needs to be regenerated. Have you _ever_ tried to debug e.g. gcc build problems? I have, and it's extremely unpleasant. Failures tend to be obscure as hell, backtracking them through the makefiles is a massive PITA and figuring out why said piece of awk produces what it does... I know what I would've done if the likely 5 hours of cursing everything would have ended up with discovery that some luser had assumed that surely, no sane software would ever generate this sequence of characters in anything used as a pathname, and that for this reason I'm looking forward to several more hours of playing with utterly revolting crap to convince it to stay away from that sequence... > Why couldn't we reserve such a combination now? > > I have no idea how to find such it, but other than that, I see no > theoretical problem with extending the list of reserved filenames. "not breaking userland", for one.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi wrote: > > So that's where O_ALT comes in. If the application is consenting, > then that should prevent exploits. Or? If the application is consenting AND GETS IT RIGHT it should prevent exploits. But that's a big deal. Why not just do it the way I suggested? Then you don't have any of these issues. Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi wrote: > > On Tue, Aug 11, 2020 at 10:37 PM Jann Horn wrote: > > If you change the semantics of path strings, you'd have to be > > confident that the new semantics fit nicely with all the path > > validation routines that exist scattered across userspace, and don't > > expose new interfaces through file server software and setuid binaries > > and so on. > > So that's where O_ALT comes in. If the application is consenting, > then that should prevent exploits. Or? We're going to be at risk from libraries that want to use the new O_ALT mechanism but are invoked by old code that passes traditional Linux paths. Each library will have to sanitize paths, and some will screw it up. I much prefer Linus' variant where the final part of the extended path is passed as a separate parameter.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 10:37 PM Jann Horn wrote: > If you change the semantics of path strings, you'd have to be > confident that the new semantics fit nicely with all the path > validation routines that exist scattered across userspace, and don't > expose new interfaces through file server software and setuid binaries > and so on. So that's where O_ALT comes in. If the application is consenting, then that should prevent exploits. Or? Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 10:29 PM Miklos Szeredi wrote: > On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler > wrote: > > Since ab has known meaning, and lots of applications > > play loose with '/', its really dangerous to treat the string as > > special. We only get away with '.' and '..' because their behavior > > was defined before many of y'all were born. > > So the founding fathers have set things in stone and now we can't > change it. Right? > > Well that's how it looks... but let's think a little; we have '/' and > '\0' that can't be used in filenames. Also '.' and '..' are > prohibited names. It's not a trivial limitation, so applications are > probably not used to dumping binary data into file names. And that > means it's probably possible to find a fairly short combination that > is never used in practice (probably containing the "/." sequence). > Why couldn't we reserve such a combination now? This isn't just about finding a string that "is never used in practice". There is userspace software that performs security checks based on the precise semantics that paths have nowadays, and those security checks will sometimes happily let you use arbitrary binary garbage in path components as long as there's no '\0' or '/' in there and the name isn't "." or "..", because that's just how paths work on Linux. If you change the semantics of path strings, you'd have to be confident that the new semantics fit nicely with all the path validation routines that exist scattered across userspace, and don't expose new interfaces through file server software and setuid binaries and so on. I really don't like this idea.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler wrote: > Since ab has known meaning, and lots of applications > play loose with '/', its really dangerous to treat the string as > special. We only get away with '.' and '..' because their behavior > was defined before many of y'all were born. So the founding fathers have set things in stone and now we can't change it. Right? Well that's how it looks... but let's think a little; we have '/' and '\0' that can't be used in filenames. Also '.' and '..' are prohibited names. It's not a trivial limitation, so applications are probably not used to dumping binary data into file names. And that means it's probably possible to find a fairly short combination that is never used in practice (probably containing the "/." sequence). Why couldn't we reserve such a combination now? I have no idea how to find such it, but other than that, I see no theoretical problem with extending the list of reserved filenames. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 09:31:05PM +0200, Lennart Poettering wrote: > On Di, 11.08.20 20:49, Miklos Szeredi (mik...@szeredi.hu) wrote: > > > On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds > > wrote: > > > > > and then people do "$(srctree)/". If you haven't seen that kind of > > > pattern where the pathname has two (or sometimes more!) slashes in the > > > middle, you've led a very sheltered life. > > > > Oh, I have. That's why I opted for triple slashes, since that should > > work most of the time even in those concatenated cases. And yes, I > > know, most is not always, and this might just be hiding bugs, etc... > > I think the pragmatic approach would be to try this and see how many > > triple slash hits a normal workload gets and if it's reasonably low, > > then hopefully that together with warnings for O_ALT would be enough. > > There's no point. Userspace relies on the current meaning of triple > slashes. It really does. > > I know many places in systemd where we might end up with a triple > slash. Here's a real-life example: some code wants to access the > cgroup attribute 'cgroup.controllers' of the root cgroup. It thus > generates the right path in the fs for it, which is the concatenation of > "/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/" > (i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the > file the attribute is exposed under). > > And there you go: > >"/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → > "/sys/fs/cgroup///cgroup.controllers" > > This is a real-life thing. Don't break this please. Taken from a log from a container: lxc f4 20200810105815.742 TRACEcgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1" lxc f4 20200810105815.742 WARN cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.monitor.f4" lxc f4 20200810105815.743 INFO cgfsng - cgroups/cgfsng.c:cgfsng_monitor_create:1366 - The monitor process uses "lxc.monitor.f4" as cgroup lxc f4 20200810105815.743 DEBUGstorage - storage/storage.c:get_storage_by_name:211 - Detected rootfs type "dir" lxc f4 20200810105815.743 TRACEcgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1" lxc f4 20200810105815.743 WARN cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.payload.f4" lxc f4 20200810105815.743 INFO cgfsng - cgroups/cgfsng.c:cgfsng_payload_create:1469 - The container process uses "lxc.payload.f4" as cgroup lxc f4 20200810105815.744 TRACEstart - start.c:lxc_spawn:1731 - Spawned container directly into target cgroup via cgroup2 fd 17 Christian
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote: > On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi wrote: > > > > What's the disadvantage of doing it with a single lookup WITH an enabling > > flag? > > > > It's definitely not going to break anything, so no backward > > compatibility issues whatsoever. > > No backwards compatibility issues for existing programs, no. > > But your suggestion is fundamentally ambiguous, and you most > definitely *can* hit that if people start using this in new programs. > > Where does that "unified" pathname come from? It will be generated > from "base filename + metadata name" in user space, and > > (a) the base filename might have double or triple slashes in it for > whatever reasons. > > This is not some "made-up gotcha" thing - I see double slashes *all* > the time when we have things like Makefiles doing > > srctree=../../src/ > > and then people do "$(srctree)/". If you haven't seen that kind of > pattern where the pathname has two (or sometimes more!) slashes in the > middle, you've led a very sheltered life. > > (b) even if the new user space were to think about that, and remove > those (hah! when have you ever seen user space do that?), as Al > mentioned, the user *filesystem* might have pathnames with double > slashes as part of symlinks. > > So now we'd have to make sure that when we traverse symlinks, that > O_ALT gets cleared. Which means that it's not a unified namespace > after all, because you can't make symlinks point to metadata. > > Or we'd retroactively change the semantics of a symlink, and that _is_ > a backwards compatibility issue. Not with old software, no, but it > changes the meaning of old symlinks! > > So no, I don't think a unified namespace ends up working. > > And I say that as somebody who actually loves the concept. Ask Al: I > have a few times pushed for "let's allow directory behavior on regular > files", so that you could do things like a tar-filesystem, and access > the contents of a tar-file by just doing > > cat my-file.tar/inside/the/archive.c > > or similar. > > Al has convinced me it's a horrible idea (and there you have a > non-ambiguous marker: the slash at the end of a pathname that > otherwise looks and acts as a non-directory) > Putting my kernel hat down, putting my userspace hat on. I'm looking at this from a potential user of this interface. I'm not a huge fan of the metadata fd approach I'd much rather have a dedicated system call rather than opening a side-channel metadata fd that I can read binary data from. Maybe I'm alone in this but I was under the impression that other users including Ian, Lennart, and Karel have said on-list in some form that they would prefer this approach. There are even patches for systemd and libmount, I thought? But if we want to go down a completely different route then I'd prefer if this metadata fd with "special semantics" did not in any way alter the meaning of regular paths. This has the potential to cause a lot of churn for userspace. I think having to play concatenation games in shared libraries for mount information is a bad plan in addition to all the issues you raised here. Christian
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Di, 11.08.20 20:49, Miklos Szeredi (mik...@szeredi.hu) wrote: > On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds > wrote: > > > and then people do "$(srctree)/". If you haven't seen that kind of > > pattern where the pathname has two (or sometimes more!) slashes in the > > middle, you've led a very sheltered life. > > Oh, I have. That's why I opted for triple slashes, since that should > work most of the time even in those concatenated cases. And yes, I > know, most is not always, and this might just be hiding bugs, etc... > I think the pragmatic approach would be to try this and see how many > triple slash hits a normal workload gets and if it's reasonably low, > then hopefully that together with warnings for O_ALT would be enough. There's no point. Userspace relies on the current meaning of triple slashes. It really does. I know many places in systemd where we might end up with a triple slash. Here's a real-life example: some code wants to access the cgroup attribute 'cgroup.controllers' of the root cgroup. It thus generates the right path in the fs for it, which is the concatenation of "/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/" (i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the file the attribute is exposed under). And there you go: "/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → "/sys/fs/cgroup///cgroup.controllers" This is a real-life thing. Don't break this please. Lennart -- Lennart Poettering, Berlin
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds wrote: > and then people do "$(srctree)/". If you haven't seen that kind of > pattern where the pathname has two (or sometimes more!) slashes in the > middle, you've led a very sheltered life. Oh, I have. That's why I opted for triple slashes, since that should work most of the time even in those concatenated cases. And yes, I know, most is not always, and this might just be hiding bugs, etc... I think the pragmatic approach would be to try this and see how many triple slash hits a normal workload gets and if it's reasonably low, then hopefully that together with warnings for O_ALT would be enough. > (b) even if the new user space were to think about that, and remove > those (hah! when have you ever seen user space do that?), as Al > mentioned, the user *filesystem* might have pathnames with double > slashes as part of symlinks. > > So now we'd have to make sure that when we traverse symlinks, that > O_ALT gets cleared. That's exactly what I implemented in the proof of concept patch. > Which means that it's not a unified namespace > after all, because you can't make symlinks point to metadata. I don't think that's a great deal. Also I think other limitations would make sense: - no mounts allowed under /// - no ./.. resolution after /// - no hardlinks - no special files, just regular and directory - no seeking (regular or dir) > cat my-file.tar/inside/the/archive.c > > or similar. > > Al has convinced me it's a horrible idea (and there you have a > non-ambiguous marker: the slash at the end of a pathname that > otherwise looks and acts as a non-directory) Umm, can you remind me what's so horrible about that? Yeah, hard linked directories are a no-no. But it doesn't have to be implemented in a way to actually be a problem with hard links. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 09:09:36AM -0700, Linus Torvalds wrote: > On Tue, Aug 11, 2020 at 9:05 AM Al Viro wrote: > > > > Except that you suddenly see non-directory dentries get children. > > And a lot of dcache-related logics needs to be changed if that > > becomes possible. > > Yeah, I think you'd basically need to associate a (dynamic) > mount-point to that path when you start doing O_ALT. Or something. Whee... That's going to be non-workable for xattrs - fgetxattr() needs to work after unlink(). And you'd obviously need to prevent crossing into that sucker on normal lookups, which would add quite a few interesting twists around the automount points. I'm not saying it's not doable, but it won't be anywhere near straightforward. And API semantics questions are still there...
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 9:17 AM Casey Schaufler wrote: > > This doesn't work so well for setxattr(), which we want to be atomic. Well, it's not like the old interfaces could go away. But yes, doing metadatafd = openat(fd, "metadataname", O_ALT | O_CREAT | O_EXCL) to create a new xattr (and then write to it) would not act like setxattr(). Even if you do it as one atomic write, a reader would see that zero-sized xattr between the O_CREAT and the write. Of course, we could just hide zero-sized xattrs from the legacy interfaces and avoid things like that, but another option is to say that only the legacy interfaces give that particular atomicity guarantee. > Since ab has known meaning, and lots of applications > play loose with '/', its really dangerous to treat the string as > special. We only get away with '.' and '..' because their behavior > was defined before many of y'all were born. Yeah, I really don't think it's a good idea to play with "//". POSIX does allow special semantics for a pathname with "//" at the *beginning*, but even that has been very questionable (and Linux has never supported it). Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On 8/11/2020 8:39 AM, Andy Lutomirski wrote: > >> On Aug 11, 2020, at 8:20 AM, Linus Torvalds >> wrote: >> >> [ I missed the beginning of this discussion, so maybe this was already >> suggested ] >> >>> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi wrote: >>> E.g. openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); >>> Proof of concept patch and test program below. >> I don't think this works for the reasons Al says, but a slight >> modification might. >> >> IOW, if you do something more along the lines of >> >> fd = open(""foo/bar", O_PATH); >> metadatafd = openat(fd, "metadataname", O_ALT); >> >> it might be workable. >> >> So you couldn't do it with _one_ pathname, because that is always >> fundamentally going to hit pathname lookup rules. >> >> But if you start a new path lookup with new rules, that's fine. >> >> This is what I think xattrs should always have done, because they are >> broken garbage. >> >> In fact, if we do it right, I think we could have "getxattr()" be 100% >> equivalent to (modulo all the error handling that this doesn't do, of >> course): >> >> ssize_t getxattr(const char *path, const char *name, >>void *value, size_t size) >> {known >> int fd, attrfd; >> >> fd = open(path, O_PATH); >> attrfd = openat(fd, name, O_ALT); >> close(fd); >> read(attrfd, value, size); >> close(attrfd); >> } >> >> and you'd still use getxattr() and friends as a shorthand (and for >> POSIX compatibility), but internally in the kernel we'd have a >> interface around that "xattrs are just file handles" model. This doesn't work so well for setxattr(), which we want to be atomic. > This is a lot like a less nutty version of NTFS streams, whereas the /// idea > is kind of like an extra-nutty version of NTFS streams. > > I am personally not a fan of the in-band signaling implications of > overloading /. For example, there is plenty of code out there that thinks > that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being > true. Since ab has known meaning, and lots of applications play loose with '/', its really dangerous to treat the string as special. We only get away with '.' and '..' because their behavior was defined before many of y'all were born.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 9:05 AM Al Viro wrote: > > Except that you suddenly see non-directory dentries get children. > And a lot of dcache-related logics needs to be changed if that > becomes possible. Yeah, I think you'd basically need to associate a (dynamic) mount-point to that path when you start doing O_ALT. Or something. And it might not be reasonably implementable. I just think that as _interface_ it's unambiguous and fairly clean, and if Miklos can implement something like that, I think it would be maintainable. No? Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi wrote: > > What's the disadvantage of doing it with a single lookup WITH an enabling > flag? > > It's definitely not going to break anything, so no backward > compatibility issues whatsoever. No backwards compatibility issues for existing programs, no. But your suggestion is fundamentally ambiguous, and you most definitely *can* hit that if people start using this in new programs. Where does that "unified" pathname come from? It will be generated from "base filename + metadata name" in user space, and (a) the base filename might have double or triple slashes in it for whatever reasons. This is not some "made-up gotcha" thing - I see double slashes *all* the time when we have things like Makefiles doing srctree=../../src/ and then people do "$(srctree)/". If you haven't seen that kind of pattern where the pathname has two (or sometimes more!) slashes in the middle, you've led a very sheltered life. (b) even if the new user space were to think about that, and remove those (hah! when have you ever seen user space do that?), as Al mentioned, the user *filesystem* might have pathnames with double slashes as part of symlinks. So now we'd have to make sure that when we traverse symlinks, that O_ALT gets cleared. Which means that it's not a unified namespace after all, because you can't make symlinks point to metadata. Or we'd retroactively change the semantics of a symlink, and that _is_ a backwards compatibility issue. Not with old software, no, but it changes the meaning of old symlinks! So no, I don't think a unified namespace ends up working. And I say that as somebody who actually loves the concept. Ask Al: I have a few times pushed for "let's allow directory behavior on regular files", so that you could do things like a tar-filesystem, and access the contents of a tar-file by just doing cat my-file.tar/inside/the/archive.c or similar. Al has convinced me it's a horrible idea (and there you have a non-ambiguous marker: the slash at the end of a pathname that otherwise looks and acts as a non-directory) Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 08:20:24AM -0700, Linus Torvalds wrote: > I don't think this works for the reasons Al says, but a slight > modification might. > > IOW, if you do something more along the lines of > >fd = open(""foo/bar", O_PATH); >metadatafd = openat(fd, "metadataname", O_ALT); > > it might be workable. > > So you couldn't do it with _one_ pathname, because that is always > fundamentally going to hit pathname lookup rules. > > But if you start a new path lookup with new rules, that's fine. Except that you suddenly see non-directory dentries get children. And a lot of dcache-related logics needs to be changed if that becomes possible. I agree that xattrs are garbage, but this approach won't be a straightforward solution. Can those suckers be passed to ...at() as starting points? Can they be bound in namespace? Can something be bound *on* them? What do they have for inodes and what maintains their inumbers (and st_dev, while we are at it)? Can _they_ have secondaries like that (sensu Swift)? Is that a flat space, or can they be directories? Only a part of the problems is implementation-related (and those are not trivial at all); most the fun comes from semantics of those things. And answers to the implementation questions are seriously dependent upon that...
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
> On Aug 11, 2020, at 8:20 AM, Linus Torvalds > wrote: > > [ I missed the beginning of this discussion, so maybe this was already > suggested ] > >> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi wrote: >> >>> >>> E.g. >>> openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); >> >> Proof of concept patch and test program below. > > I don't think this works for the reasons Al says, but a slight > modification might. > > IOW, if you do something more along the lines of > > fd = open(""foo/bar", O_PATH); > metadatafd = openat(fd, "metadataname", O_ALT); > > it might be workable. > > So you couldn't do it with _one_ pathname, because that is always > fundamentally going to hit pathname lookup rules. > > But if you start a new path lookup with new rules, that's fine. > > This is what I think xattrs should always have done, because they are > broken garbage. > > In fact, if we do it right, I think we could have "getxattr()" be 100% > equivalent to (modulo all the error handling that this doesn't do, of > course): > > ssize_t getxattr(const char *path, const char *name, >void *value, size_t size) > { > int fd, attrfd; > > fd = open(path, O_PATH); > attrfd = openat(fd, name, O_ALT); > close(fd); > read(attrfd, value, size); > close(attrfd); > } > > and you'd still use getxattr() and friends as a shorthand (and for > POSIX compatibility), but internally in the kernel we'd have a > interface around that "xattrs are just file handles" model. > > This is a lot like a less nutty version of NTFS streams, whereas the /// idea is kind of like an extra-nutty version of NTFS streams. I am personally not a fan of the in-band signaling implications of overloading /. For example, there is plenty of code out there that thinks that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being true.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 5:20 PM Linus Torvalds wrote: > > [ I missed the beginning of this discussion, so maybe this was already > suggested ] > > On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi wrote: > > > > > > > > E.g. > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > > > Proof of concept patch and test program below. > > I don't think this works for the reasons Al says, but a slight > modification might. > > IOW, if you do something more along the lines of > >fd = open(""foo/bar", O_PATH); >metadatafd = openat(fd, "metadataname", O_ALT); > > it might be workable. That would have been my backup suggestion, in case the unified namespace doesn't work out. I wouldn't think the normal lookup rules really get in the way if we explicitly enable alternative path lookup with a flag. The rules just need to be documented. What's the disadvantage of doing it with a single lookup WITH an enabling flag? It's definitely not going to break anything, so no backward compatibility issues whatsoever. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
[ I missed the beginning of this discussion, so maybe this was already suggested ] On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi wrote: > > > > > E.g. > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > Proof of concept patch and test program below. I don't think this works for the reasons Al says, but a slight modification might. IOW, if you do something more along the lines of fd = open(""foo/bar", O_PATH); metadatafd = openat(fd, "metadataname", O_ALT); it might be workable. So you couldn't do it with _one_ pathname, because that is always fundamentally going to hit pathname lookup rules. But if you start a new path lookup with new rules, that's fine. This is what I think xattrs should always have done, because they are broken garbage. In fact, if we do it right, I think we could have "getxattr()" be 100% equivalent to (modulo all the error handling that this doesn't do, of course): ssize_t getxattr(const char *path, const char *name, void *value, size_t size) { int fd, attrfd; fd = open(path, O_PATH); attrfd = openat(fd, name, O_ALT); close(fd); read(attrfd, value, size); close(attrfd); } and you'd still use getxattr() and friends as a shorthand (and for POSIX compatibility), but internally in the kernel we'd have a interface around that "xattrs are just file handles" model. Linus
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 4:42 PM Al Viro wrote: > > On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote: > > > > > - strip off trailing part after first instance of /// > > > > - perform path lookup as normal > > > > - resolve meta path after /// on result of normal lookup > > > > > > ... and interpolation of relative symlink body into the pathname does > > > change > > > behaviour now, *including* the cases when said symlink body does not > > > contain > > > that triple-X^Hslash garbage. Wonderful... > > > > Can you please explain? > > Currently substituting the body of a relative symlink in place of its name > results in equivalent pathname. Except proc symlinks, that is. > With your patch that is not just no longer > true, it's no longer true even when the symlink body does not contain that > /// kludge - it can come in part from the symlink body and in part from the > rest of pathname. I.e. you can't even tell if substitution is an equivalent > replacement by looking at the symlink body alone. Yes, that's true not just for symlink bodies but any concatenation of two path segments. That's why it's enabled with RESOLVE_ALT. I've said that I plan to experiment with turning this on globally, but that doesn't mean it's necessarily a good idea. The posted patch contains nothing of that sort. Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote: > > > - strip off trailing part after first instance of /// > > > - perform path lookup as normal > > > - resolve meta path after /// on result of normal lookup > > > > ... and interpolation of relative symlink body into the pathname does change > > behaviour now, *including* the cases when said symlink body does not contain > > that triple-X^Hslash garbage. Wonderful... > > Can you please explain? Currently substituting the body of a relative symlink in place of its name results in equivalent pathname. With your patch that is not just no longer true, it's no longer true even when the symlink body does not contain that /// kludge - it can come in part from the symlink body and in part from the rest of pathname. I.e. you can't even tell if substitution is an equivalent replacement by looking at the symlink body alone.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 4:31 PM Al Viro wrote: > > On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote: > > On Tue, Aug 11, 2020 at 4:08 PM Al Viro wrote: > > > > > > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote: > > > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote: > > > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi > > > > > wrote: > > > > > > > > > > > I think we already lost that with the xattr API, that should have > > > > > > been > > > > > > done in a way that fits this philosophy. But given that we have > > > > > > "/" > > > > > > as the only special purpose char in filenames, and even repetitions > > > > > > are allowed, it's hard to think of a good way to do that. Pity. > > > > > > > > > > One way this could be solved is to allow opting into an alternative > > > > > path resolution mode. > > > > > > > > > > E.g. > > > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > > > > > > > Proof of concept patch and test program below. > > > > > > > > Opted for triple slash in the hope that just maybe we could add a global > > > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative > > > > (non-POSIX) path > > > > resolution without breaking too many things. Will try that later... > > > > > > > > Comments? > > > > > > Hell, NO. This is unspeakably tasteless. And full of lovely corner > > > cases wrt > > > symlink bodies, etc. > > > > It's disabled inside symlink body resolution. > > > > Rules are simple: > > > > - strip off trailing part after first instance of /// > > - perform path lookup as normal > > - resolve meta path after /// on result of normal lookup > > ... and interpolation of relative symlink body into the pathname does change > behaviour now, *including* the cases when said symlink body does not contain > that triple-X^Hslash garbage. Wonderful... Can you please explain? Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 10:33:59AM -0400, Tang Jiye wrote: > anyone knows how to post a question? Generally the way you just have, except that you generally put it *after* the relevant parts of the quoted text (and removes the irrelevant ones).
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote: > On Tue, Aug 11, 2020 at 4:08 PM Al Viro wrote: > > > > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote: > > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote: > > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi wrote: > > > > > > > > > I think we already lost that with the xattr API, that should have been > > > > > done in a way that fits this philosophy. But given that we have "/" > > > > > as the only special purpose char in filenames, and even repetitions > > > > > are allowed, it's hard to think of a good way to do that. Pity. > > > > > > > > One way this could be solved is to allow opting into an alternative > > > > path resolution mode. > > > > > > > > E.g. > > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > > > > > Proof of concept patch and test program below. > > > > > > Opted for triple slash in the hope that just maybe we could add a global > > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative > > > (non-POSIX) path > > > resolution without breaking too many things. Will try that later... > > > > > > Comments? > > > > Hell, NO. This is unspeakably tasteless. And full of lovely corner cases > > wrt > > symlink bodies, etc. > > It's disabled inside symlink body resolution. > > Rules are simple: > > - strip off trailing part after first instance of /// > - perform path lookup as normal > - resolve meta path after /// on result of normal lookup ... and interpolation of relative symlink body into the pathname does change behaviour now, *including* the cases when said symlink body does not contain that triple-X^Hslash garbage. Wonderful...
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 4:08 PM Al Viro wrote: > > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote: > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote: > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi wrote: > > > > > > > I think we already lost that with the xattr API, that should have been > > > > done in a way that fits this philosophy. But given that we have "/" > > > > as the only special purpose char in filenames, and even repetitions > > > > are allowed, it's hard to think of a good way to do that. Pity. > > > > > > One way this could be solved is to allow opting into an alternative > > > path resolution mode. > > > > > > E.g. > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > > > Proof of concept patch and test program below. > > > > Opted for triple slash in the hope that just maybe we could add a global > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) > > path > > resolution without breaking too many things. Will try that later... > > > > Comments? > > Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt > symlink bodies, etc. It's disabled inside symlink body resolution. Rules are simple: - strip off trailing part after first instance of /// - perform path lookup as normal - resolve meta path after /// on result of normal lookup Thanks, Miklos
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote: > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote: > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi wrote: > > > > > I think we already lost that with the xattr API, that should have been > > > done in a way that fits this philosophy. But given that we have "/" > > > as the only special purpose char in filenames, and even repetitions > > > are allowed, it's hard to think of a good way to do that. Pity. > > > > One way this could be solved is to allow opting into an alternative > > path resolution mode. > > > > E.g. > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); > > Proof of concept patch and test program below. > > Opted for triple slash in the hope that just maybe we could add a global > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) > path > resolution without breaking too many things. Will try that later... > > Comments? Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt symlink bodies, etc. Consider that one NAKed. I'm seriously unhappy with the entire fsinfo thing in general, but this one is really over the top.
Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote: > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi wrote: > > > I think we already lost that with the xattr API, that should have been > > done in a way that fits this philosophy. But given that we have "/" > > as the only special purpose char in filenames, and even repetitions > > are allowed, it's hard to think of a good way to do that. Pity. > > One way this could be solved is to allow opting into an alternative > path resolution mode. > > E.g. > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); Proof of concept patch and test program below. Opted for triple slash in the hope that just maybe we could add a global /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path resolution without breaking too many things. Will try that later... Comments? Thanks, Miklos cat_alt.c: >8 #define _GNU_SOURCE #include #include #include #include #include #include #include #define RESOLVE_ALT 0x20 /* Alternative path walk mode where multiple slashes have special meaning */ int main(int argc, char *argv[]) { struct open_how how = { .flags = O_RDONLY, .resolve = RESOLVE_ALT, }; int fd, res, i; char buf[65536], *end; const char *path = argv[1]; int dfd = AT_FDCWD; if (argc < 2 || argc > 4) errx(1, "usage: %s path [dirfd] [--nofollow]", argv[0]); for (i = 2; i < argc; i++) { if (strcmp(argv[i], "--nofollow") == 0) { how.flags |= O_NOFOLLOW; } else { dfd = strtoul(argv[i], &end, 0); if (end == argv[i] || *end) errx(1, "invalid dirfd: %s", argv[i]); } } fd = syscall(__NR_openat2, dfd, path, &how, sizeof(how)); if (fd == -1) err(1, "failed to open %s", argv[1]); while (1) { res = read(fd, buf, sizeof(buf)); if (res == -1) err(1, "failed to read file"); if (res == 0) break; write(1, buf, res); } close(fd); return 0; } >8 --- fs/Makefile |2 fs/file_table.c | 70 ++ fs/fsmeta.c | 135 +++ fs/internal.h|9 ++ fs/mount.h |4 + fs/namei.c | 77 +--- fs/namespace.c | 12 +++ fs/open.c|2 fs/proc_namespace.c |2 include/linux/fcntl.h|2 include/linux/namei.h|3 include/uapi/linux/magic.h |1 include/uapi/linux/openat2.h |2 13 files changed, 282 insertions(+), 39 deletions(-) --- a/fs/Makefile +++ b/fs/Makefile @@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table. seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o splice.o sync.o utimes.o d_path.o \ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ - fs_types.o fs_context.o fs_parser.o fsopen.o + fs_types.o fs_context.o fs_parser.o fsopen.o fsmeta.o \ ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o block_dev.o direct-io.o mpage.o --- a/fs/file_table.c +++ b/fs/file_table.c @@ -178,22 +178,9 @@ struct file *alloc_empty_file_noaccount( return f; } -/** - * alloc_file - allocate and initialize a 'struct file' - * - * @path: the (dentry, vfsmount) pair for the new file - * @flags: O_... flags with which the new file will be opened - * @fop: the 'struct file_operations' for the new file - */ -static struct file *alloc_file(const struct path *path, int flags, - const struct file_operations *fop) +static void init_file(struct file *file, const struct path *path, int flags, + const struct file_operations *fop) { - struct file *file; - - file = alloc_empty_file(flags, current_cred()); - if (IS_ERR(file)) - return file; - file->f_path = *path; file->f_inode = path->dentry->d_inode; file->f_mapping = path->dentry->d_inode->i_mapping; @@ -209,31 +196,66 @@ static struct file *alloc_file(const str file->f_op = fop; if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_inc(path->dentry->d_inode); +} + +/** + * alloc_file - allocate and initialize a 'struct file' + * + * @path: the (dentry, vfsmount) pair for the new file + * @flags: O_... flags with which the new file will be opened + * @fop: the 'struct file_operations' for the new file + */ +static struct file *alloc_file(const struct path *path, int flags, + c
file metadata via fs API (was: [GIT PULL] Filesystem Information)
On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi wrote: > I think we already lost that with the xattr API, that should have been > done in a way that fits this philosophy. But given that we have "/" > as the only special purpose char in filenames, and even repetitions > are allowed, it's hard to think of a good way to do that. Pity. One way this could be solved is to allow opting into an alternative path resolution mode. E.g. openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT); Yes, the implementation might be somewhat tricky, but that's another question. Also I'm pretty sure that we should be reducing the POSIX-ness of anything below "//" to the bare minimum. No seeking, etc I think this would open up some nice possibilities beyond the fsinfo thing. Thanks, Miklos