RE: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
From: Christian Brauner > Sent: 10 May 2024 11:55 > > > For the uapi issue you describe below my take would be that we should just > > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe > > I'm biased from the gpu world, where we've been hammering it in that > > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid > > Oh, we're very much on the same page. All new file descriptor types that > I've added over the years are O_CLOEXEC by default. IOW, you need to > remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new > fd type that's added should just be O_CLOEXEC by default. For fd a shell redirect creates you may want so be able to say 'this fd will have O_CLOEXEC set after the next exec'. Also (possibly) a flag that can't be cleared once set and that gets kept by dup() etc. But maybe that is excessive? I've certainly used: # ip netns exec ns command 3
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
> For the uapi issue you describe below my take would be that we should just > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe > I'm biased from the gpu world, where we've been hammering it in that > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid Oh, we're very much on the same page. All new file descriptor types that I've added over the years are O_CLOEXEC by default. IOW, you need to remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new fd type that's added should just be O_CLOEXEC by default.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Thu, May 09, 2024 at 08:48:20AM -0700, Linus Torvalds wrote: > On Thu, 9 May 2024 at 04:39, Christian Brauner wrote: > > > > Not worth it without someone explaining in detail why imho. First pass > > should be to try and replace kcmp() in scenarios where it's obviously > > not needed or overkill. > > Ack. > > > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that > > anyway which means that your comparison patch becomes even simpler imho. > > I've also added a selftest patch: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc > > LGTM. > > Maybe worth adding an explicit test for "open same file, but two > separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not > testing the file on the filesystem for equality, but the file pointer > itself". Yep, good point. Added now.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Thu, 9 May 2024 at 04:39, Christian Brauner wrote: > > Not worth it without someone explaining in detail why imho. First pass > should be to try and replace kcmp() in scenarios where it's obviously > not needed or overkill. Ack. > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that > anyway which means that your comparison patch becomes even simpler imho. > I've also added a selftest patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc LGTM. Maybe worth adding an explicit test for "open same file, but two separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not testing the file on the filesystem for equality, but the file pointer itself". Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Wed, May 08, 2024 at 10:14:44AM -0700, Linus Torvalds wrote: > On Wed, 8 May 2024 at 09:19, Linus Torvalds > wrote: > > > > So since we already have two versions of F_DUPFD (the other being > > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend > > on that existing naming pattern, and called it F_DUPFD_QUERY instead. > > > > I'm not married to the name, so if somebody hates it, feel free to > > argue otherwise. > > Side note: with this patch, doing > >ret = fcntl(fd1, F_DUPFD_QUERY, fd2); > > will result in: > > -1 (EBADF): 'fd1' is not a valid file descriptor > -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY > 0: fd2 does not refer to the same file as fd1 > 1: fd2 is the same 'struct file' as fd1 > > and it might be worth noting a couple of things here: > > (a) fd2 being an invalid file descriptor does not cause EBADF, it > just causes "does not match". > > (b) we *could* use more bits for more equality > > IOW, it would possibly make sense to extend the 0/1 result to be > > - bit #0: same file pointer > - bit #1: same path > - bit #2: same dentry > - bit #3: same inode > > which are all different levels of "sameness". Not worth it without someone explaining in detail why imho. First pass should be to try and replace kcmp() in scenarios where it's obviously not needed or overkill. I've added a CLASS(fd_raw) in a preliminary patch since we'll need that anyway which means that your comparison patch becomes even simpler imho. I've also added a selftest patch: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc ?
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Wed, 8 May 2024 at 09:19, Linus Torvalds wrote: > > So since we already have two versions of F_DUPFD (the other being > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend > on that existing naming pattern, and called it F_DUPFD_QUERY instead. > > I'm not married to the name, so if somebody hates it, feel free to > argue otherwise. Side note: with this patch, doing ret = fcntl(fd1, F_DUPFD_QUERY, fd2); will result in: -1 (EBADF): 'fd1' is not a valid file descriptor -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY 0: fd2 does not refer to the same file as fd1 1: fd2 is the same 'struct file' as fd1 and it might be worth noting a couple of things here: (a) fd2 being an invalid file descriptor does not cause EBADF, it just causes "does not match". (b) we *could* use more bits for more equality IOW, it would possibly make sense to extend the 0/1 result to be - bit #0: same file pointer - bit #1: same path - bit #2: same dentry - bit #3: same inode which are all different levels of "sameness". Does anybody care? Do we want to extend on this "sameness"? I'm not convinced, but it might be a good idea to document this as a possibly future extension, ie *if* what you care about is "same file pointer", maybe you should make sure to only look at bit #0. Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, 7 May 2024 at 12:07, Linus Torvalds wrote: > > That example thing shows that we shouldn't make it a FISAME ioctl - we > should make it a fcntl() instead, and it would just be a companion to > F_DUPFD. > > Doesn't that strike everybody as a *much* cleaner interface? I think > F_ISDUP would work very naturally indeed with F_DUPFD. So since we already have two versions of F_DUPFD (the other being F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend on that existing naming pattern, and called it F_DUPFD_QUERY instead. I'm not married to the name, so if somebody hates it, feel free to argue otherwise. But with that, the suggested patch would end up looking something like the attached (I also re-ordered the existing "F_LINUX_SPECIFIC_BASE" users, since one of them was out of numerical order). This really feels like a very natural thing, and yes, the 'same_fd()' function in systemd that Christian also pointed at could use this very naturally. Also note that I obviously haven't tested this. Because obviously this is trivially correct and cannot possibly have any bugs. Right? RIGHT? And yes, I did check - despite the odd jump in numbers, we've never had anything between F_NOTIFY (+2) and F_CANCELLK (+5). We added F_SETLEASE (+0) , F_GETLEASE (+1) and F_NOTIFY (+2) in 2.4.0-test9 (roughly October 2000, I didn't dig deeper). And then back in 2007 we suddenly jumped to F_CANCELLK (+5) in commit 9b9d2ab4154a ("locks: add lock cancel command"). I don't know why 3/4 were shunned. After that we had 22d2b35b200f ("F_DUPFD_CLOEXEC implementation") add F_DUPFD_CLOEXEC (+6). I'd have loved to put F_DUPFD_QUERY next to it, but +5 and +7 are both used. Linus fs/fcntl.c | 23 +++ include/uapi/linux/fcntl.h | 14 -- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 54cc85d3338e..1ddb63f70445 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -327,6 +327,25 @@ static long fcntl_set_rw_hint(struct file *file, unsigned int cmd, return 0; } +/* + * Is the file descriptor a dup of the file? + */ +static long f_dupfd_query(int fd, struct file *filp) +{ + struct fd f = fdget_raw(fd); + + /* + * We can do the 'fdput()' immediately, as the only thing that + * matters is the pointer value which isn't changed by the fdput. + * + * Technically we didn't need a ref at all, and 'fdget()' was + * overkill, but given our lockless file pointer lookup, the + * alternatives are complicated. + */ + fdput(f); + return f.file == filp; +} + static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) { @@ -342,6 +361,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_DUPFD_CLOEXEC: err = f_dupfd(argi, filp, O_CLOEXEC); break; + case F_DUPFD_QUERY: + err = f_dupfd_query(argi, filp); + break; case F_GETFD: err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; break; @@ -446,6 +468,7 @@ static int check_fcntl_cmd(unsigned cmd) switch (cmd) { case F_DUPFD: case F_DUPFD_CLOEXEC: + case F_DUPFD_QUERY: case F_GETFD: case F_SETFD: case F_GETFL: diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 282e90aeb163..c0bcc185fa48 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -8,6 +8,14 @@ #define F_SETLEASE (F_LINUX_SPECIFIC_BASE + 0) #define F_GETLEASE (F_LINUX_SPECIFIC_BASE + 1) +/* + * Request nofications on a directory. + * See below for events that may be notified. + */ +#define F_NOTIFY (F_LINUX_SPECIFIC_BASE + 2) + +#define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) + /* * Cancel a blocking posix lock; internal use only until we expose an * asynchronous lock api to userspace: @@ -17,12 +25,6 @@ /* Create a file descriptor with FD_CLOEXEC set. */ #define F_DUPFD_CLOEXEC (F_LINUX_SPECIFIC_BASE + 6) -/* - * Request nofications on a directory. - * See below for events that may be notified. - */ -#define F_NOTIFY (F_LINUX_SPECIFIC_BASE+2) - /* * Set and get of pipe page size array */
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Wed, May 08, 2024 at 12:08:57PM +0200, Christian Brauner wrote: > On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote: > > Am 04.05.24 um 20:20 schrieb Linus Torvalds: > > > On Sat, 4 May 2024 at 08:32, Linus Torvalds > > > wrote: > > > > Lookie here, the fundamental issue is that epoll can call '->poll()' > > > > on a file descriptor that is being closed concurrently. > > > Thinking some more about this, and replying to myself... > > > > > > Actually, I wonder if we could *really* fix this by simply moving the > > > eventpoll_release() to where it really belongs. > > > > > > If we did it in file_close_fd_locked(), it would actually make a > > > *lot* more sense. Particularly since eventpoll actually uses this: > > > > > > struct epoll_filefd { > > > struct file *file; > > > int fd; > > > } __packed; > > > > > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > > > (for ep_find()). > > > > > > (Strictly speaking, it should also have a pointer to the 'struct > > > files_struct' to make the 'int fd' be meaningful). > > > > While I completely agree on this I unfortunately have to ruin the idea. > > > > Before we had KCMP some people relied on the strange behavior of eventpoll > > to compare struct files when the fd is the same. > > > > I just recently suggested that solution to somebody at AMD as a workaround > > when KCMP is disabled because of security hardening and I'm pretty sure I've > > seen it somewhere else as well. > > > > So when we change that it would break (undocumented?) UAPI behavior. > > I've worked on that a bit yesterday and I learned new things about epoll > and ran into some limitations. > > Like, what happens if process P1 has a file descriptor registered in an > epoll instance and now P1 forks and creates P2. So every file that P1 > maintains gets copied into a new file descriptor table for P2. And the > same file descriptors refer to the same files for both P1 and P2. So this is pretty similar to any other struct file that has resources hanging off the struct file instead of the underlying inode. Like drm chardev files, where all the buffers, gpu contexts and everything else hangs off the file and there's no other way to get at them (except when exporting to some explicitly meant-for-sharing file like dma-buf). If you fork() that it's utter hilarity, which is why absolutely everyone should set O_CLOEXEC on these. Or EPOLL_CLOEXEC for epoll_create. For the uapi issue you describe below my take would be that we should just try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe I'm biased from the gpu world, where we've been hammering it in that "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid use-case is something like systemd handing open files to a service, where it drops priviledges even well before the exec() call. But we can't switch around the defaults for any of these special open files with anything more than just a current seek position as state, since that breaks uapi. -Sima > > So there's two interesting cases here: > > (1) P2 explicitly removes the file descriptor from the epoll instance > via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2 > since the pair is only registered once and it isn't > marked whether it belongs to P1 and P2 fdtable. > > So effectively fork()ing with epoll creates a weird shared state > where removal of file descriptors that were registered before the > fork() affects both child and parent. > > I found that surprising even though I've worked with epoll quite > extensively in low-level userspace. > > (2) P2 doesn't close it's file descriptors. It just exits. Since removal > of the file descriptor from the epoll instance isn't done during > close() but during last fput() P1's epoll state remains unaffected > by P2's sloppy exit because P1 still holds references to all files > in its fdtable. > > (Sidenote, if one ends up adding every more duped-fds into epoll > instance that one doesn't explicitly close and all of them refer to > the same file wouldn't one just be allocating new epitems that > are kept around for a really long time?) > > So if the removal of the fd would now be done during close() or during > exit_files() when we call close_files() and since there's currently no > way of differentiating whether P1 or P2 own that fd it would mean that > (2) collapses into (1) and we'd always alter (1)'s epoll state. That > would be a UAPI break. > > So say we record the fdtable to get ownership of that file descriptor so > P2 doesn't close anything in (2) that really belongs to P1 to fix that > problem. > > But afaict, that would break another possible use-case. Namely, where P1 > creates an epoll instance and registeres fds and then fork()s to create > P2. Now P1 can exit and P2 takes over the epoll loop of P1. This > wouldn't work anymore
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Wed, May 08, 2024 at 10:32:08AM +0200, Daniel Vetter wrote: > On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote: > > Am 07.05.24 um 21:07 schrieb Linus Torvalds: > > > On Tue, 7 May 2024 at 11:04, Daniel Vetter wrote: > > > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > > > > > > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > > > > > too, if this is possibly a more common thing. and not just DRM wants > > > > > it. > > > > > > > > > > Would something like that work for you? > > > > Yes. > > > > > > > > Adding Simon and Pekka as two of the usual suspects for this kind of > > > > stuff. Also example code (the int return value is just so that callers > > > > know > > > > when kcmp isn't available, they all only care about equality): > > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 > > > That example thing shows that we shouldn't make it a FISAME ioctl - we > > > should make it a fcntl() instead, and it would just be a companion to > > > F_DUPFD. > > > > > > Doesn't that strike everybody as a *much* cleaner interface? I think > > > F_ISDUP would work very naturally indeed with F_DUPFD. > > > > > > Yes? No? > > > > Sounds absolutely sane to me. > > Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too. > > Aside, after some irc discussions I paged a few more of the relevant info > back in, and at least for dma-buf we kinda sorted this out by going away > from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each > buffer a full-fledged inode") > > It's uapi now so we can't ever undo that, but with hindsight just the > F_ISDUP is really what we wanted. Because we have no need for that inode > aside from the unique inode number that's only used to compare dma-buf fd > for sameness, e.g. > > https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490 > > The one question I have is whether this could lead to some exploit tools, > because at least the android conformance test suite verifies that kcmp > isn't available to apps (which is where we need it, because even with all > the binder-based isolation gpu userspace still all run in the application > process due to performance reasons, any ipc at all is just too much). > > Otoh if we just add this to drm fd as an ioctl somewhere, then it will > also be available to every android app because they all do need the gpu > for rendering. So going with the full generic fcntl is probably best. > -Sima fcntl() will call security_file_fcntl(). IIRC, Android uses selinux and I'm pretty certain they'd disallow any fcntl() operations they deems unsafe. So a kernel update for them would likely require allow-listing the new fcntl(). Or if they do allow all new fnctl()s by default they'd have to disallow it if they thought that's an issue but really I don't even think there's any issue in that. I think kcmp() is a different problem because you can use it to compare objects from different tasks. The generic fcntl() wouldn't allow that.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote: > Am 04.05.24 um 20:20 schrieb Linus Torvalds: > > On Sat, 4 May 2024 at 08:32, Linus Torvalds > > wrote: > > > Lookie here, the fundamental issue is that epoll can call '->poll()' > > > on a file descriptor that is being closed concurrently. > > Thinking some more about this, and replying to myself... > > > > Actually, I wonder if we could *really* fix this by simply moving the > > eventpoll_release() to where it really belongs. > > > > If we did it in file_close_fd_locked(), it would actually make a > > *lot* more sense. Particularly since eventpoll actually uses this: > > > > struct epoll_filefd { > > struct file *file; > > int fd; > > } __packed; > > > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > > (for ep_find()). > > > > (Strictly speaking, it should also have a pointer to the 'struct > > files_struct' to make the 'int fd' be meaningful). > > While I completely agree on this I unfortunately have to ruin the idea. > > Before we had KCMP some people relied on the strange behavior of eventpoll > to compare struct files when the fd is the same. > > I just recently suggested that solution to somebody at AMD as a workaround > when KCMP is disabled because of security hardening and I'm pretty sure I've > seen it somewhere else as well. > > So when we change that it would break (undocumented?) UAPI behavior. I've worked on that a bit yesterday and I learned new things about epoll and ran into some limitations. Like, what happens if process P1 has a file descriptor registered in an epoll instance and now P1 forks and creates P2. So every file that P1 maintains gets copied into a new file descriptor table for P2. And the same file descriptors refer to the same files for both P1 and P2. So there's two interesting cases here: (1) P2 explicitly removes the file descriptor from the epoll instance via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2 since the pair is only registered once and it isn't marked whether it belongs to P1 and P2 fdtable. So effectively fork()ing with epoll creates a weird shared state where removal of file descriptors that were registered before the fork() affects both child and parent. I found that surprising even though I've worked with epoll quite extensively in low-level userspace. (2) P2 doesn't close it's file descriptors. It just exits. Since removal of the file descriptor from the epoll instance isn't done during close() but during last fput() P1's epoll state remains unaffected by P2's sloppy exit because P1 still holds references to all files in its fdtable. (Sidenote, if one ends up adding every more duped-fds into epoll instance that one doesn't explicitly close and all of them refer to the same file wouldn't one just be allocating new epitems that are kept around for a really long time?) So if the removal of the fd would now be done during close() or during exit_files() when we call close_files() and since there's currently no way of differentiating whether P1 or P2 own that fd it would mean that (2) collapses into (1) and we'd always alter (1)'s epoll state. That would be a UAPI break. So say we record the fdtable to get ownership of that file descriptor so P2 doesn't close anything in (2) that really belongs to P1 to fix that problem. But afaict, that would break another possible use-case. Namely, where P1 creates an epoll instance and registeres fds and then fork()s to create P2. Now P1 can exit and P2 takes over the epoll loop of P1. This wouldn't work anymore because P1 would deregister all fds it owns in that epoll instance during exit. I didn't see an immediate nice way of fixing that issue. But note that taking over an epoll loop from the parent doesn't work reliably for some file descriptors. Consider man signalfd(2): epoll(7) semantics If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance, then epoll_wait(2) returns events only for signals sent to that process. In particular, if the process then uses fork(2) to create a child process, then the child will be able to read(2) signals that are sent to it using the signalfd file descriptor, but epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this scenario, a possible workaround is that after the fork(2), the child process can close the signalfd file descriptor that it inherited from the parent process and then create another signalfd file descriptor and add it to the epoll instance. Alternatively, the parent and the child could delay creating their (separate) signalfd file descriptors and adding them to the epoll instance until after the call to fork(2). So effectively P1 opens a signalfd and registers it in an epoll instance. Then
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Am 08.05.24 um 10:23 schrieb Christian Brauner: On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote: Am 07.05.24 um 18:46 schrieb Linus Torvalds: On Tue, 7 May 2024 at 04:03, Daniel Vetter wrote: It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have. It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP). There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason. Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space). And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same". The same argument goes for using EPOLL for that. Disgusting hack. Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is? Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same; same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same; where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()". Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP. I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors. The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths. For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time. Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to. If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example. Additional to that it makes cgroup accounting much easier because you don't count things twice because they are shared etc... One thing to keep in mind is that a generic VFS level comparing function will only catch the obvious case where you have dup() equivalency as outlined above by Linus. That's what most people are interested in and that could easily replace most kcmp() use-cases for comparing fds. But, of course there's the case where you have two file descriptors referring to two different files that reference the same underlying object (usually stashed in file->private_data). For most cases that problem can ofc be solved by comparing the underlying inode. But that doesn't work for drivers using the generic anonymous inode infrastructure because it uses the same inode for everything or for cases where the same underlying object can even be represented by different inodes. So for such cases a driver specific ioctl() to compare two fds will be needed in addition to the generic helper. At least for the DRM we already have some solution for that, see drmGetPrimaryDeviceNameFromFd() for an example. Basically the file->private_data is still something different, but we use this to figure out if we have two file descriptors (with individual struct files underneath) pointing to the same hw driver. This is important if you need to know if just importing/exporting of DMA-buf handles between the two file descriptors is enough or if you deal with two different hw devices and need to do stuff like format conversion etc... Regards, Christian.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote: > Am 07.05.24 um 21:07 schrieb Linus Torvalds: > > On Tue, 7 May 2024 at 11:04, Daniel Vetter wrote: > > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > > > > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > > > > too, if this is possibly a more common thing. and not just DRM wants > > > > it. > > > > > > > > Would something like that work for you? > > > Yes. > > > > > > Adding Simon and Pekka as two of the usual suspects for this kind of > > > stuff. Also example code (the int return value is just so that callers > > > know > > > when kcmp isn't available, they all only care about equality): > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 > > That example thing shows that we shouldn't make it a FISAME ioctl - we > > should make it a fcntl() instead, and it would just be a companion to > > F_DUPFD. > > > > Doesn't that strike everybody as a *much* cleaner interface? I think > > F_ISDUP would work very naturally indeed with F_DUPFD. > > > > Yes? No? > > Sounds absolutely sane to me. Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too. Aside, after some irc discussions I paged a few more of the relevant info back in, and at least for dma-buf we kinda sorted this out by going away from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each buffer a full-fledged inode") It's uapi now so we can't ever undo that, but with hindsight just the F_ISDUP is really what we wanted. Because we have no need for that inode aside from the unique inode number that's only used to compare dma-buf fd for sameness, e.g. https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490 The one question I have is whether this could lead to some exploit tools, because at least the android conformance test suite verifies that kcmp isn't available to apps (which is where we need it, because even with all the binder-based isolation gpu userspace still all run in the application process due to performance reasons, any ipc at all is just too much). Otoh if we just add this to drm fd as an ioctl somewhere, then it will also be available to every android app because they all do need the gpu for rendering. So going with the full generic fcntl is probably best. -Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote: > Am 07.05.24 um 18:46 schrieb Linus Torvalds: > > On Tue, 7 May 2024 at 04:03, Daniel Vetter wrote: > > > It's really annoying that on some distros/builds we don't have that, and > > > for gpu driver stack reasons we _really_ need to know whether a fd is the > > > same as another, due to some messy uniqueness requirements on buffer > > > objects various drivers have. > > It's sad that such a simple thing would require two other horrid > > models (EPOLL or KCMP). > > > > There'[s a reason that KCMP is a config option - *some* of that is > > horrible code - but the "compare file descriptors for equality" is not > > that reason. > > > > Note that KCMP really is a broken mess. It's also a potential security > > hole, even for the simple things, because of how it ends up comparing > > kernel pointers (ie it doesn't just say "same file descriptor", it > > gives an ordering of them, so you can use KCMP to sort things in > > kernel space). > > > > And yes, it orders them after obfuscating the pointer, but it's still > > not something I would consider sane as a baseline interface. It was > > designed for checkpoint-restore, it's the wrong thing to use for some > > "are these file descriptors the same". > > > > The same argument goes for using EPOLL for that. Disgusting hack. > > > > Just what are the requirements for the GPU stack? Is one of the file > > descriptors "trusted", IOW, you know what kind it is? > > > > Because dammit, it's *so* easy to do. You could just add a core DRM > > ioctl for it. Literally just > > > > struct fd f1 = fdget(fd1); > > struct fd f2 = fdget(fd2); > > int same; > > > > same = f1.file && f1.file == f2.file; > > fdput(fd1); > > fdput(fd2); > > return same; > > > > where the only question is if you also woudl want to deal with O_PATH > > fd's, in which case the "fdget()" would be "fdget_raw()". > > > > Honestly, adding some DRM ioctl for this sounds hacky, but it sounds > > less hacky than relying on EPOLL or KCMP. > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > > too, if this is possibly a more common thing. and not just DRM wants > > it. > > > > Would something like that work for you? > > Well the generic approach yes, the DRM specific one maybe. IIRC we need to > be able to compare both DRM as well as DMA-buf file descriptors. > > The basic problem userspace tries to solve is that drivers might get the > same fd through two different code paths. > > For example application using OpenGL/Vulkan for rendering and VA-API for > video decoding/encoding at the same time. > > Both APIs get a fd which identifies the device to use. It can be the same, > but it doesn't have to. > > If it's the same device driver connection (or in kernel speak underlying > struct file) then you can optimize away importing and exporting of buffers > for example. > > Additional to that it makes cgroup accounting much easier because you don't > count things twice because they are shared etc... One thing to keep in mind is that a generic VFS level comparing function will only catch the obvious case where you have dup() equivalency as outlined above by Linus. That's what most people are interested in and that could easily replace most kcmp() use-cases for comparing fds. But, of course there's the case where you have two file descriptors referring to two different files that reference the same underlying object (usually stashed in file->private_data). For most cases that problem can ofc be solved by comparing the underlying inode. But that doesn't work for drivers using the generic anonymous inode infrastructure because it uses the same inode for everything or for cases where the same underlying object can even be represented by different inodes. So for such cases a driver specific ioctl() to compare two fds will be needed in addition to the generic helper.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, May 07, 2024 at 12:07:10PM -0700, Linus Torvalds wrote: > On Tue, 7 May 2024 at 11:04, Daniel Vetter wrote: > > > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > > > too, if this is possibly a more common thing. and not just DRM wants > > > it. > > > > > > Would something like that work for you? > > > > Yes. > > > > Adding Simon and Pekka as two of the usual suspects for this kind of > > stuff. Also example code (the int return value is just so that callers know > > when kcmp isn't available, they all only care about equality): > > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 > > That example thing shows that we shouldn't make it a FISAME ioctl - we > should make it a fcntl() instead, and it would just be a companion to > F_DUPFD. > > Doesn't that strike everybody as a *much* cleaner interface? I think +1 See https://github.com/systemd/systemd/blob/a4f0e0da3573a10bc5404142be8799418760b1d1/src/basic/fd-util.c#L517 that's another heavy user of this kind of functionality.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Am 08.05.24 um 09:51 schrieb Michel Dänzer: On 2024-05-07 19:45, Christian König wrote: Am 07.05.24 um 18:46 schrieb Linus Torvalds: Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is? Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same; same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same; where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()". Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP. I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors. The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths. For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time. Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to. If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example. It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour. Oh, yeah good point as well. I think we can say in general that if two userspace driver libraries would mess with the state of an fd at the same time without knowing of each other bad things would happen. Regards, Christian.
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On 2024-05-07 19:45, Christian König wrote: > Am 07.05.24 um 18:46 schrieb Linus Torvalds: >> >> Just what are the requirements for the GPU stack? Is one of the file >> descriptors "trusted", IOW, you know what kind it is? >> >> Because dammit, it's *so* easy to do. You could just add a core DRM >> ioctl for it. Literally just >> >> struct fd f1 = fdget(fd1); >> struct fd f2 = fdget(fd2); >> int same; >> >> same = f1.file && f1.file == f2.file; >> fdput(fd1); >> fdput(fd2); >> return same; >> >> where the only question is if you also woudl want to deal with O_PATH >> fd's, in which case the "fdget()" would be "fdget_raw()". >> >> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds >> less hacky than relying on EPOLL or KCMP. >> >> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl >> too, if this is possibly a more common thing. and not just DRM wants >> it. >> >> Would something like that work for you? > > Well the generic approach yes, the DRM specific one maybe. IIRC we need to be > able to compare both DRM as well as DMA-buf file descriptors. > > The basic problem userspace tries to solve is that drivers might get the same > fd through two different code paths. > > For example application using OpenGL/Vulkan for rendering and VA-API for > video decoding/encoding at the same time. > > Both APIs get a fd which identifies the device to use. It can be the same, > but it doesn't have to. > > If it's the same device driver connection (or in kernel speak underlying > struct file) then you can optimize away importing and exporting of buffers > for example. It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Am 07.05.24 um 21:07 schrieb Linus Torvalds: On Tue, 7 May 2024 at 11:04, Daniel Vetter wrote: On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Yes. Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality): https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD. Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD. Yes? No? Sounds absolutely sane to me. Christian. Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, 7 May 2024 at 11:04, Daniel Vetter wrote: > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > > too, if this is possibly a more common thing. and not just DRM wants > > it. > > > > Would something like that work for you? > > Yes. > > Adding Simon and Pekka as two of the usual suspects for this kind of > stuff. Also example code (the int return value is just so that callers know > when kcmp isn't available, they all only care about equality): > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD. Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD. Yes? No? Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > On Tue, 7 May 2024 at 04:03, Daniel Vetter wrote: > > > > It's really annoying that on some distros/builds we don't have that, and > > for gpu driver stack reasons we _really_ need to know whether a fd is the > > same as another, due to some messy uniqueness requirements on buffer > > objects various drivers have. > > It's sad that such a simple thing would require two other horrid > models (EPOLL or KCMP). > > There'[s a reason that KCMP is a config option - *some* of that is > horrible code - but the "compare file descriptors for equality" is not > that reason. > > Note that KCMP really is a broken mess. It's also a potential security > hole, even for the simple things, because of how it ends up comparing > kernel pointers (ie it doesn't just say "same file descriptor", it > gives an ordering of them, so you can use KCMP to sort things in > kernel space). > > And yes, it orders them after obfuscating the pointer, but it's still > not something I would consider sane as a baseline interface. It was > designed for checkpoint-restore, it's the wrong thing to use for some > "are these file descriptors the same". > > The same argument goes for using EPOLL for that. Disgusting hack. > > Just what are the requirements for the GPU stack? Is one of the file > descriptors "trusted", IOW, you know what kind it is? > > Because dammit, it's *so* easy to do. You could just add a core DRM > ioctl for it. Literally just > > struct fd f1 = fdget(fd1); > struct fd f2 = fdget(fd2); > int same; > > same = f1.file && f1.file == f2.file; > fdput(fd1); > fdput(fd2); > return same; > > where the only question is if you also woudl want to deal with O_PATH > fd's, in which case the "fdget()" would be "fdget_raw()". > > Honestly, adding some DRM ioctl for this sounds hacky, but it sounds > less hacky than relying on EPOLL or KCMP. Well, in slightly more code (because it's part of the "import this dma-buf/dma-fence/whatever fd into a driver object" ioctl) this is what we do. The issue is that there's generic userspace (like compositors) that sees these things fly by and would also like to know whether the other side they receive them from is doing nasty stuff/buggy/evil. And they don't have access to the device drm fd (since those are a handful of layers away behind the opengl/vulkan userspace drivers even if the compositor could get at them, and in some cases not even that). So if we do this in drm we'd essentially have to create a special drm_compare_files chardev, put the ioctl there and then tell everyone to make that thing world-accessible. Which is just too close to a real syscall that it's offensive, and hey kcmp does what we want already (but unfortunately also way more). So we rejected adding that to drm. But we did think about it. > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > too, if this is possibly a more common thing. and not just DRM wants > it. > > Would something like that work for you? Yes. Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality): https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 -Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Am 07.05.24 um 18:46 schrieb Linus Torvalds: On Tue, 7 May 2024 at 04:03, Daniel Vetter wrote: It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have. It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP). There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason. Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space). And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same". The same argument goes for using EPOLL for that. Disgusting hack. Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is? Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same; same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same; where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()". Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP. I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors. The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths. For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time. Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to. If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example. Additional to that it makes cgroup accounting much easier because you don't count things twice because they are shared etc... Regards, Christian. Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Tue, 7 May 2024 at 04:03, Daniel Vetter wrote: > > It's really annoying that on some distros/builds we don't have that, and > for gpu driver stack reasons we _really_ need to know whether a fd is the > same as another, due to some messy uniqueness requirements on buffer > objects various drivers have. It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP). There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason. Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space). And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same". The same argument goes for using EPOLL for that. Disgusting hack. Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is? Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same; same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same; where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()". Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP. I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Linus
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote: > Am 04.05.24 um 20:20 schrieb Linus Torvalds: > > On Sat, 4 May 2024 at 08:32, Linus Torvalds > > wrote: > > > Lookie here, the fundamental issue is that epoll can call '->poll()' > > > on a file descriptor that is being closed concurrently. > > Thinking some more about this, and replying to myself... > > > > Actually, I wonder if we could *really* fix this by simply moving the > > eventpoll_release() to where it really belongs. > > > > If we did it in file_close_fd_locked(), it would actually make a > > *lot* more sense. Particularly since eventpoll actually uses this: > > > > struct epoll_filefd { > > struct file *file; > > int fd; > > } __packed; > > > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > > (for ep_find()). > > > > (Strictly speaking, it should also have a pointer to the 'struct > > files_struct' to make the 'int fd' be meaningful). > > While I completely agree on this I unfortunately have to ruin the idea. > > Before we had KCMP some people relied on the strange behavior of eventpoll > to compare struct files when the fd is the same. > > I just recently suggested that solution to somebody at AMD as a workaround > when KCMP is disabled because of security hardening and I'm pretty sure I've > seen it somewhere else as well. > > So when we change that it would break (undocumented?) UAPI behavior. Uh extremely aside, but doesn't this mean we should just enable kcmp on files unconditionally, since there's an alternative? Or a least everywhere CONFIG_EPOLL is enabled? It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have. -Sima > > Regards, > Christian. > > > > > IOW, eventpoll already considers the file _descriptor_ relevant, not > > just the file pointer, and that's destroyed at *close* time, not at > > 'fput()' time. > > > > Yeah, yeah, the locking situation in file_close_fd_locked() is a bit > > inconvenient, but if we can solve that, it would solve the problem in > > a fundamentally different way: remove the ep iterm before the > > file->f_count has actually been decremented, so the whole "race with > > fput()" would just go away entirely. > > > > I dunno. I think that would be the right thing to do, but I wouldn't > > be surprised if some disgusting eventpoll user then might depend on > > the current situation where the eventpoll thing stays around even > > after the close() if you have another copy of the file open. > > > > Linus > > ___ > > Linaro-mm-sig mailing list -- linaro-mm-...@lists.linaro.org > > To unsubscribe send an email to linaro-mm-sig-le...@lists.linaro.org > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Am 04.05.24 um 20:20 schrieb Linus Torvalds: On Sat, 4 May 2024 at 08:32, Linus Torvalds wrote: Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently. Thinking some more about this, and replying to myself... Actually, I wonder if we could *really* fix this by simply moving the eventpoll_release() to where it really belongs. If we did it in file_close_fd_locked(), it would actually make a *lot* more sense. Particularly since eventpoll actually uses this: struct epoll_filefd { struct file *file; int fd; } __packed; ie it doesn't just use the 'struct file *', it uses the 'fd' itself (for ep_find()). (Strictly speaking, it should also have a pointer to the 'struct files_struct' to make the 'int fd' be meaningful). While I completely agree on this I unfortunately have to ruin the idea. Before we had KCMP some people relied on the strange behavior of eventpoll to compare struct files when the fd is the same. I just recently suggested that solution to somebody at AMD as a workaround when KCMP is disabled because of security hardening and I'm pretty sure I've seen it somewhere else as well. So when we change that it would break (undocumented?) UAPI behavior. Regards, Christian. IOW, eventpoll already considers the file _descriptor_ relevant, not just the file pointer, and that's destroyed at *close* time, not at 'fput()' time. Yeah, yeah, the locking situation in file_close_fd_locked() is a bit inconvenient, but if we can solve that, it would solve the problem in a fundamentally different way: remove the ep iterm before the file->f_count has actually been decremented, so the whole "race with fput()" would just go away entirely. I dunno. I think that would be the right thing to do, but I wouldn't be surprised if some disgusting eventpoll user then might depend on the current situation where the eventpoll thing stays around even after the close() if you have another copy of the file open. Linus ___ Linaro-mm-sig mailing list -- linaro-mm-...@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-le...@lists.linaro.org