Re: [PATCH 2/4] pid: add pidfd_open()
On Sat, Mar 30, 2019 at 7:30 AM Jonathan Kowalski wrote: > > On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione wrote: > > > > [SNIP] > > > > Thanks again. > > > > I agree that the operation we're discussing has a simple signature, > > but signature flexibility isn't the only reason to prefer a system > > call over an ioctl. There are other reasons for preferring system > > calls to ioctls (safety, tracing, etc.) that apply even if the > > operation we're discussing has a relatively simple signature: for > > example, every system call has a distinct and convenient ftrace event, > > but ioctls don't; strace filtering Just Works on a > > system-call-by-system-call basis, but it doesn't for ioctls; and > > It does for those with a unique number. There's no such thing as a unique ioctl number though. Anyone is free to create an ioctl with a conflicting number. Sure, you can guarantee ioctl request number uniqueness within things that ship with the kernel, but you can also guarantee system call number uniqueness, so what do we gain by using an ioctl? And yes, you can do *some* filtering based on ioctl command, but it's frequently more awkward than the system call equivalent and sometimes doesn't work at all. What's the strace -e pidfd_open equivalent for an ioctl? Grep isn't quite equivalent. > > documentation for system calls is much more discoverable (e.g., man > > -k) than documentation for ioctls. Even if the distinction doesn't > > matter much, IMHO, it still matters a little, enough to favor a system > > call without an offsetting advantage for the ioctl option. > > It will be alongside other I/O operations in the pidfd man page. > > > > > > If you start adding a system call for every specific operation on file > > > descriptors, it *will* become a problem. > > > > I'm not sure what you mean. Do you mean that adding a top-level system > > call for every operation that might apply to one specific kind of file > > descriptor would lead, as the overall result, to the kernel having > > enough system calls to cause negative consequences? I'm not sure I > > agree, but accepting this idea for the sake of discussion: shouldn't > > we be more okay with system calls for features present on almost all > > systems --- like procfs --- even if we punt to ioctls very rarely-used > > functionality, e.g., some hypothetical special squeak noise that you > > could get some specific 1995 adlib clone to make? > > Consider if we were to do: > > int procpidfd_open(int pidfd, int procrootfd); > > This system call is useless besides a very specific operation for a > very specific usecase on a file descriptor. I don't understand this argument based on an operation being "very specific". Are you suggesting that epoll_wait should have been an ioctl? I don't see how an operation being specific to a type of file descriptor is an argument for making that operation an ioctl instead of a system call. > Then, you figure you might > want to support procpidfd back to pidfd (although YAGNI), so you will > add a CMD or flags argument now, Why would we need a system call at all? And why are we invoking hypothetical argument B in an argument against adding operation A as a system call? B doesn't exist yet. As Christian mentioned, we could create a named /proc/pid/handle file which, when opened, would yield a pidfd. More generally, though: if we expose new functionality, we can make a new system call. Why is that worse than adding a new ioctl code? > int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags); > > int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD); > int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD); > > vs procpidfd2pidfd(int procpidfd); if you did not foresee the need. > Then, you want pidfd2pid, and so on. > > If you don't, you will have to add a command interface in the new > system call, which changes parameters depending on the flags. This is > already starting to get ugly. > In the end, it's a matter of taste, I don't think it's quite a matter of taste. That idiom suggests that the two options are roughly functionally equivalent. In this case, there are clear and specific technical reasons to prefer system calls. We're not talking about two equivalent approaches. Making the operation a system call gives users more power, and we shouldn't deprive them of this power without a good reason. So far, I haven't seen such a reason. > this > pattern if exploited leads to endless proliferation of the system call > interface, often ridden with short-sighted APIs, because you cannot > know if you want a focused call or a command style call. Bad interfaces proliferate no matter what, and there's no difference in support burden between an ioctl and a system call: both need to work indefinitely. Are you saying that it's easier to remove a bad interface if it's an ioctl instead of a system call? I don't recall any precedent, but maybe I'm wrong. > ioctl is > already a command
Re: [PATCH 2/4] pid: add pidfd_open()
On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione wrote: > > [SNIP] > > Thanks again. > > I agree that the operation we're discussing has a simple signature, > but signature flexibility isn't the only reason to prefer a system > call over an ioctl. There are other reasons for preferring system > calls to ioctls (safety, tracing, etc.) that apply even if the > operation we're discussing has a relatively simple signature: for > example, every system call has a distinct and convenient ftrace event, > but ioctls don't; strace filtering Just Works on a > system-call-by-system-call basis, but it doesn't for ioctls; and It does for those with a unique number. > documentation for system calls is much more discoverable (e.g., man > -k) than documentation for ioctls. Even if the distinction doesn't > matter much, IMHO, it still matters a little, enough to favor a system > call without an offsetting advantage for the ioctl option. It will be alongside other I/O operations in the pidfd man page. > > > If you start adding a system call for every specific operation on file > > descriptors, it *will* become a problem. > > I'm not sure what you mean. Do you mean that adding a top-level system > call for every operation that might apply to one specific kind of file > descriptor would lead, as the overall result, to the kernel having > enough system calls to cause negative consequences? I'm not sure I > agree, but accepting this idea for the sake of discussion: shouldn't > we be more okay with system calls for features present on almost all > systems --- like procfs --- even if we punt to ioctls very rarely-used > functionality, e.g., some hypothetical special squeak noise that you > could get some specific 1995 adlib clone to make? Consider if we were to do: int procpidfd_open(int pidfd, int procrootfd); This system call is useless besides a very specific operation for a very specific usecase on a file descriptor. Then, you figure you might want to support procpidfd back to pidfd (although YAGNI), so you will add a CMD or flags argument now, int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags); int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD); int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD); vs procpidfd2pidfd(int procpidfd); if you did not foresee the need. Then, you want pidfd2pid, and so on. If you don't, you will have to add a command interface in the new system call, which changes parameters depending on the flags. This is already starting to get ugly. In the end, it's a matter of taste, this pattern if exploited leads to endless proliferation of the system call interface, often ridden with short-sighted APIs, because you cannot know if you want a focused call or a command style call. ioctl is already a command interface, and 10 system calls for each command is _NOT_ nice, from a user perspective. > > > Besides, the translation is > > just there because it is racy to do in userspace, it is not some well > > defined core kernel functionality. > > Therefore, it is just a way to > > enter the kernel to do the openat in a race free and safe manner. > > I agree that the translation has to be done in the kernel, not > userspace, and that the kernel must provide to userspace some > interface for requesting that the translation happen: we're just > discussing the shape of this interface. Shouldn't all interfaces > provided by the kernel to userspace be equally well defined? I'm not > sure that the internal simplicity of the operation matters much > either. There are already explicit system calls for some > simple-to-implement things, e.g., timerfd_gettime. It's worth noting > that timerfd is (IIRC), like procfs, a feature that's both ubiquitous > and optional. It is well defined, it has a well defined signature, and it will error out if you don't use it properly. Again, what it does is very limited and niche. I am not sure it warrants a system call of its own. timerfd_gettime was an afterthought anyway, that's probably not a good example (it was more to just match the POSIX timers interface as the original timerfd never had support for querying, so the split into two steps, create and initialize, you could argue one could do it without a syscall, but it still has a well defined argument list, and accepting elaborate data structures into an ioctl is not a good interface to plumb, so that's easily justified). It's much like socket and setsockopt/getsockopt in nature. I would even say APIs separating creation and configuration age well and are better, but a process doesn't fit such a model cleanly. > > > As is, the facility being provided through an ioctl on the pidfd is > > not something I'd consider a problem. > > You're right that from a signature perspective, using an ioctl isn't a > problem. I just want to make sure we take into account the other, > non-signature advantages that system calls have over ioctls. > > > I think the translation stuff >
Re: [PATCH 2/4] pid: add pidfd_open()
On Fri, Mar 29, 2019 at 11:25 PM Jonathan Kowalski wrote: > > On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione wrote: > > > > On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner > > wrote: > > > > > > > All that said, thanks for the work on this once again. My intention is > > > > just that we don't end up with an API that could have been done better > > > > and be cleaner to use for potential users in the coming years. > > > > > > Thanks for your input on all of this. I still don't find multiplexers in > > > the style of seccomp()/fsconfig()/keyctl() to be a problem since they > > > deal with a specific task. They are very much different from ioctl()s in > > > that regard. But since Joel, you, and Daniel found the pidctl() approach > > > not very nice I dropped it. The interface needs to be satisfactory for > > > all of us especially since Android and other system managers will be the > > > main consumers. > > > > Thanks. > > > > > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which > > > allows to cleanly get pidfds independent procfs and do the translation > > > to procpidfds in an ioctl() as we've discussed in prior threads. This > > > > I sustain my objection to adding an ioctl. Compared to a system call, > > an ioctl has a more rigid interface, greater susceptibility to > > programmer error (due to the same ioctl control code potentially doing > > different things for different file types), longer path length, and > > more awkward filtering/monitoring/auditing/tracing. We've discussed > > this issue at length before, and I thought we all agreed to use system > > calls, not ioctl, for core kernel functionality. So why is an ioctl > > suddenly back on the table? The way I see it, an ioctl has no > > advantages except for 1) conserving system call numbers, which are not > > scarce, and 2) avoiding the system call number coordination problem > > (and the coordination problem isn't a factor for core kernel code). I > > don't understand everyone's reluctance to add new system calls. What > > am I missing? Why would we give up all the advantages that a system > > call gives us? > > > > I agree in general, but in this particular case a system call or an > ioctl doesn't matter much, all it does is take the pidfd, the command, > and /proc's dir fd. Thanks again. I agree that the operation we're discussing has a simple signature, but signature flexibility isn't the only reason to prefer a system call over an ioctl. There are other reasons for preferring system calls to ioctls (safety, tracing, etc.) that apply even if the operation we're discussing has a relatively simple signature: for example, every system call has a distinct and convenient ftrace event, but ioctls don't; strace filtering Just Works on a system-call-by-system-call basis, but it doesn't for ioctls; and documentation for system calls is much more discoverable (e.g., man -k) than documentation for ioctls. Even if the distinction doesn't matter much, IMHO, it still matters a little, enough to favor a system call without an offsetting advantage for the ioctl option. > If you start adding a system call for every specific operation on file > descriptors, it *will* become a problem. I'm not sure what you mean. Do you mean that adding a top-level system call for every operation that might apply to one specific kind of file descriptor would lead, as the overall result, to the kernel having enough system calls to cause negative consequences? I'm not sure I agree, but accepting this idea for the sake of discussion: shouldn't we be more okay with system calls for features present on almost all systems --- like procfs --- even if we punt to ioctls very rarely-used functionality, e.g., some hypothetical special squeak noise that you could get some specific 1995 adlib clone to make? > Besides, the translation is > just there because it is racy to do in userspace, it is not some well > defined core kernel functionality. > Therefore, it is just a way to > enter the kernel to do the openat in a race free and safe manner. I agree that the translation has to be done in the kernel, not userspace, and that the kernel must provide to userspace some interface for requesting that the translation happen: we're just discussing the shape of this interface. Shouldn't all interfaces provided by the kernel to userspace be equally well defined? I'm not sure that the internal simplicity of the operation matters much either. There are already explicit system calls for some simple-to-implement things, e.g., timerfd_gettime. It's worth noting that timerfd is (IIRC), like procfs, a feature that's both ubiquitous and optional. > As is, the facility being provided through an ioctl on the pidfd is > not something I'd consider a problem. You're right that from a signature perspective, using an ioctl isn't a problem. I just want to make sure we take into account the other, non-signature advantages that system calls have over ioctls. > I think the tran
Re: [PATCH 2/4] pid: add pidfd_open()
On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione wrote: > > On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner > wrote: > > > > > All that said, thanks for the work on this once again. My intention is > > > just that we don't end up with an API that could have been done better > > > and be cleaner to use for potential users in the coming years. > > > > Thanks for your input on all of this. I still don't find multiplexers in > > the style of seccomp()/fsconfig()/keyctl() to be a problem since they > > deal with a specific task. They are very much different from ioctl()s in > > that regard. But since Joel, you, and Daniel found the pidctl() approach > > not very nice I dropped it. The interface needs to be satisfactory for > > all of us especially since Android and other system managers will be the > > main consumers. > > Thanks. > > > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which > > allows to cleanly get pidfds independent procfs and do the translation > > to procpidfds in an ioctl() as we've discussed in prior threads. This > > I sustain my objection to adding an ioctl. Compared to a system call, > an ioctl has a more rigid interface, greater susceptibility to > programmer error (due to the same ioctl control code potentially doing > different things for different file types), longer path length, and > more awkward filtering/monitoring/auditing/tracing. We've discussed > this issue at length before, and I thought we all agreed to use system > calls, not ioctl, for core kernel functionality. So why is an ioctl > suddenly back on the table? The way I see it, an ioctl has no > advantages except for 1) conserving system call numbers, which are not > scarce, and 2) avoiding the system call number coordination problem > (and the coordination problem isn't a factor for core kernel code). I > don't understand everyone's reluctance to add new system calls. What > am I missing? Why would we give up all the advantages that a system > call gives us? > I agree in general, but in this particular case a system call or an ioctl doesn't matter much, all it does is take the pidfd, the command, and /proc's dir fd. If you start adding a system call for every specific operation on file descriptors, it *will* become a problem. Besides, the translation is just there because it is racy to do in userspace, it is not some well defined core kernel functionality. Therefore, it is just a way to enter the kernel to do the openat in a race free and safe manner. As is, the facility being provided through an ioctl on the pidfd is not something I'd consider a problem. I think the translation stuff should also probably be an extension of ioctl_ns(2) (but I wouldn't be opposed if translate_pid is resurrected as is). For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD, procrootfd), I'd agree that a system call would be a cleaner interface, otherwise, if you cannot generalise it, using ioctls as a command interface is probably the better tradeoff here. > I also don't understand Andy's argument on the other thread that an > ioctl is okay if it's an "operation on an FD" --- *most* system calls > are operations on FDs. We don't have an ioctl for sendmsg(2) and it's > an "operation on an FD".
Re: [PATCH 2/4] pid: add pidfd_open()
On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner wrote: > > > All that said, thanks for the work on this once again. My intention is > > just that we don't end up with an API that could have been done better > > and be cleaner to use for potential users in the coming years. > > Thanks for your input on all of this. I still don't find multiplexers in > the style of seccomp()/fsconfig()/keyctl() to be a problem since they > deal with a specific task. They are very much different from ioctl()s in > that regard. But since Joel, you, and Daniel found the pidctl() approach > not very nice I dropped it. The interface needs to be satisfactory for > all of us especially since Android and other system managers will be the > main consumers. Thanks. > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which > allows to cleanly get pidfds independent procfs and do the translation > to procpidfds in an ioctl() as we've discussed in prior threads. This I sustain my objection to adding an ioctl. Compared to a system call, an ioctl has a more rigid interface, greater susceptibility to programmer error (due to the same ioctl control code potentially doing different things for different file types), longer path length, and more awkward filtering/monitoring/auditing/tracing. We've discussed this issue at length before, and I thought we all agreed to use system calls, not ioctl, for core kernel functionality. So why is an ioctl suddenly back on the table? The way I see it, an ioctl has no advantages except for 1) conserving system call numbers, which are not scarce, and 2) avoiding the system call number coordination problem (and the coordination problem isn't a factor for core kernel code). I don't understand everyone's reluctance to add new system calls. What am I missing? Why would we give up all the advantages that a system call gives us? I also don't understand Andy's argument on the other thread that an ioctl is okay if it's an "operation on an FD" --- *most* system calls are operations on FDs. We don't have an ioctl for sendmsg(2) and it's an "operation on an FD".
Re: [PATCH 2/4] pid: add pidfd_open()
On Thu, Mar 28, 2019 at 12:59:46PM -0400, Joel Fernandes wrote: > > > On March 28, 2019 6:38:15 AM EDT, Christian Brauner > wrote: > >> All that said, thanks for the work on this once again. My intention > >is > >> just that we don't end up with an API that could have been done > >better > >> and be cleaner to use for potential users in the coming years. > > > >Thanks for your input on all of this. I still don't find multiplexers > >in > >the style of seccomp()/fsconfig()/keyctl() to be a problem since they > >deal with a specific task. They are very much different from ioctl()s > >in > >that regard. But since Joel, you, and Daniel found the pidctl() > >approach > >not very nice I dropped it. The interface needs to be satisfactory for > >all of us especially since Android and other system managers will be > >the > >main consumers. > >So let's split this into pidfd_open(pid_t pid, unsigned int flags) > >which > >allows to cleanly get pidfds independent procfs and do the translation > >to procpidfds in an ioctl() as we've discussed in prior threads. This > >should also accommodate comments and ideas from Andy and Jann. > >I'm coding this up now. > > This sounds quite sensible to me. Thanks! Thanks! I have it ready and hope to send it out later today. Christian
Re: [PATCH 2/4] pid: add pidfd_open()
On March 28, 2019 6:38:15 AM EDT, Christian Brauner wrote: >> All that said, thanks for the work on this once again. My intention >is >> just that we don't end up with an API that could have been done >better >> and be cleaner to use for potential users in the coming years. > >Thanks for your input on all of this. I still don't find multiplexers >in >the style of seccomp()/fsconfig()/keyctl() to be a problem since they >deal with a specific task. They are very much different from ioctl()s >in >that regard. But since Joel, you, and Daniel found the pidctl() >approach >not very nice I dropped it. The interface needs to be satisfactory for >all of us especially since Android and other system managers will be >the >main consumers. >So let's split this into pidfd_open(pid_t pid, unsigned int flags) >which >allows to cleanly get pidfds independent procfs and do the translation >to procpidfds in an ioctl() as we've discussed in prior threads. This >should also accommodate comments and ideas from Andy and Jann. >I'm coding this up now. This sounds quite sensible to me. Thanks! Joel Fernandes, Android kernel team Sent from k9-mail on Android
Re: [PATCH 2/4] pid: add pidfd_open()
> All that said, thanks for the work on this once again. My intention is > just that we don't end up with an API that could have been done better > and be cleaner to use for potential users in the coming years. Thanks for your input on all of this. I still don't find multiplexers in the style of seccomp()/fsconfig()/keyctl() to be a problem since they deal with a specific task. They are very much different from ioctl()s in that regard. But since Joel, you, and Daniel found the pidctl() approach not very nice I dropped it. The interface needs to be satisfactory for all of us especially since Android and other system managers will be the main consumers. So let's split this into pidfd_open(pid_t pid, unsigned int flags) which allows to cleanly get pidfds independent procfs and do the translation to procpidfds in an ioctl() as we've discussed in prior threads. This should also accommodate comments and ideas from Andy and Jann. I'm coding this up now. Christian
Re: [PATCH 2/4] pid: add pidfd_open()
pidfd_open is open pidfd for pid relative to pidns, so a better analogy is that it is like openat for a relative pathname wrt dirfd. O_DIRECTORY is analogous to what type of object, so a TIDFD flag in the future which interprets pid (pathname) as thread id only and pins that specific struct pid. That has now limited you to the specific object and operations you can perform on said object with other pidfd APIs. procfd_open in my proposed signature is just a fancy race free openat on the corresponding pid of the pidfd relative to the procrootfd (which becomes the dirfd if i were doing it in userspace), which might as well been implemented in userspace if things were not racy. Andy suggested something similar. My point is, when I am talking of the pidfd API, procfs is irrelevant. You are thinking of it as a process directory and a process file, I am thinking of it in terms of a process object and the proc dir fd as an file system based interface to query process state (through read) which is why I object to them being usable with pidfd_send_signal as well, in principle. In the future, people may use task_diag for the same purpose or a different interface, to work around its limitations. This would just be another interface of the kernel to query process state, not representative of the process object itself. Hence, keeping the pidfd to procfd translation entirely separate (as already suggested) sounds much, much better to me. The pidfd API and related calls are untouched and unaffected by presence, absence of procfs or not (they are, but you do unrelated stuff in the same system call). To me atleast, munging opening (and then changing what the procfd means to support the flag use case), having flags like PIDFD_TO_PROCFD that will work only without CLONE_NEWPID, then having eg. GET_TIDFID that may work with CLONE_NEWPID, etc. I find this interface confusing. I have a few steps when starting to work with a pidfd: 1. Acquire pidfd_open(pid, ns, flags) or pidfd_clone(...) 2. Operate pidfd_send_signal(...) For those who need a race free way to open the correct /proc/ for a pidfd relative to a /proc dir fd, for the purposes of metadata access, you will have procfd_open, which is in the kernel because the same thing is racy to do in userspace. Otherwise, pidfd_open in this patchset is this and also a polymorphic system call that can become procfd_open in my example when passed a flag. It is doing vastly different things given the presence and absence of options. This is similar to a multiplexor again, but it looks more confusing. You have to mask options. pidfd_open currently: pidfd_open(pid, -1, -1, 0); gets pidfd in current active ns pidfd_open(-1, procrootfd, pidfd, PIDFD_TO_PROCFD); returns dir fd of /proc/ it maps to rel. to proc rootfd pidfd_open(-1, nsfd, pidfd, CLONE_NEWPID); as you propose this searches pid in pidns pinned by nsfd, and returns a pidfd file descriptor. Extend this to threads in the future, and the combination and permutation starts getting confusing. Based on the flag, it is entirely changing what will it work upon, and what it will do. I can reasonably summarise my pidfd_open and procfd_open in their man page in one line: pidfd_open(pid, ns, flags) returns pidfd for pid, searching it in the pidns pinned by ns fd, and flags will determine further if this is thread local or process local (i.e. tid or tgid, and tgid == tid for single threaded) (in the future) (so you could do thread directed signals by passing a flag to pidfd_send_signal and this pidfd). Your call without CONFIG_PROC_FS will be literally this, but a few options will have to be set as -1. procfd_open(procrootfd, pidfd, flags), returns the proc dir fd for the pid/tid depending on if the pidfd is thread local, process local, hint it in flags, etc. It is just a race free wrapper around an openat in userspace, undergoing the same access control checks. Yes, pidfd_open as it is now works *just fine*, but it is more confusing to use and discuss. The conclusion from the previous discussion also seemed to be to split pidctl's PIDCMD_GET_PIDFD into its own thing, and provide a translation from pidfd to its proc dir fd on its own. Then, translate_pid can be its own thing, or you could extend ioctl_ns(2) if you want. All that said, thanks for the work on this once again. My intention is just that we don't end up with an API that could have been done better and be cleaner to use for potential users in the coming years.
Re: [PATCH 2/4] pid: add pidfd_open()
> procfd_open(int procrootfd, int pidfd, unsigned int flags); > pidfd_open(pid_t pid, int nsfd, unsigned int flags); That honestly just feels like splitting openat into: openat_dir() and opentat_file()
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 9:34 PM Christian Brauner wrote: > > On Wed, Mar 27, 2019 at 07:38:13PM +, Jonathan Kowalski wrote: > > Christian, > > > > Giving this some thought, it looks fine to me, but I am not convinced > > this should take a pid argument. I much prefer if obtaining a pidfd > > The signatures for pidfd_open() you're suggesting are conflicting. Even > taking into account that you're referring to Andy's version My version pidfd_open takes a pid, the pid namespace to search it in, and has an optional flags value to extend it further in the future to an immediate need. This is a simpler version of the pidctl's GET_PIDFD command that you yourself proposed in the last patch, and you don't need two namespace arguments, just one here (in that case you were passing -1 in either of those anyway). so pidctl(PIDCMD_GET_PIDFD, pid, source, -1, 0) becomes pidfd_open(pid, source, 0); pidctl(PIDCMD_GET_PIDFD, pid, -1, target or -1, 0) would be pidfd_open(pid, -1, 0); The target was just there due to it having to support comparison/translation commands, but it did not serve much purpose except the pid was reachable in both. I am not sure that is needed if you just want a pidfd for pid relative to a namespace. What I prefer is dropping the pid argument from *your* pidfd_open, rename it to procfd/procpidfd_open to give it a clearer name, and have the signature the same as what Andy suggested. This system call is tied to CONFIG_PROC_FS, all it does is take a pidfd, a proc rootfd, and return the /proc/ dir fd. It is up to you if you want to support procpidfd to pidfd conversion, I don't think there was any usecase mentioned regarding that far, so you may just ignore it. > > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name > > the "not convinced this should take a pid argument" confuses me wrt to > the proposed second signature: > > pidfd_open(pid_t pid, int ns, unsigned int flags); > > > goes in the translation stuff, that has very real usecases, and you > > would end up duplicating the same thing over in two places. > > > > If the /proc/ dir fd of a pidfd is relative to its procrootfd, a > > pid's pidfd is also relative to its namespace. Currently, you change > > Pass a /proc/ file descriptor that you have been given access to to > pidfd_open() and retrieve a pidfd to that process. It would be nice if it would be free of /proc entirely. The usecase this system call is after is helping race free translation from pidfd to procfd. pidfd_open as a separate system call as mentioned above sounds much cleaner to me, and it works regardless of what /proc is (compiled in, mounted, not compiled in, no access, etc). > > > things which would mean I now have to setns and spawn a child that > > passes me back the pidfd for a pid in the said namespace. > > There are two scenarios: > - you're in a cousin pid namespace > - you're in an ancestor pid namespace > > If you're in an ancestor pid namespace you can just get a pidfd based on > the pid of the process in your namespace. If you're in a cousin > pid namespace it seems reasonable that you have to do a setns to get a > pidfd. After all, you're not able to see any pids in there by default. > If you want pidfd to a cousin pid namespace prove that you can access it > by attaching to the owning user namespace of the pid namespace and get > your pidfd. Yes, but you break the part that you had working in the pidctl patchset, I learn the pid of a process in a child pidns, and I am able to get a pidfd for it with the pid relative to the passed in namespace descriptor. You could even get a pidfd for an ancestor process, given it explicitly gives you its namespace descriptor (or a cousin pidns, but that still means userspace will pass the namespace descriptor explicitly), but it is up to you if you want to limit its scope to down the hierarchy strictly (and it would be nice to know now so that you don't end up wasting more time later, and the next patchset is hopefully the last one before merge modulo nits). > > > > > I prefer Andy's version of > > > > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY); > > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name > > this procfd_open, maybe > > > > This is just a nicer race free openat. > > > > But, Joel and you agreed that it makes sense to split out the > > translation stuff out of pidfds. > > I don't quite remember that part. > > I honestly think that the version proposed here covers for the most part > what we want and provides a decent compromise by avoiding ioctl()s for > the translation bits, allows us to not split this into multiple I was not wanting to suggest an ioctl to burden you further, but it occured to me that you would have to do less, as it already has support to determine ownership and is meant for such tasks in ioctl_ns(2), but I don't care if it's a translate_pid system call. > syscalls, and also allows retrieving pidfds from other pid namespaces > provided that
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 07:38:13PM +, Jonathan Kowalski wrote: > Christian, > > Giving this some thought, it looks fine to me, but I am not convinced > this should take a pid argument. I much prefer if obtaining a pidfd The signatures for pidfd_open() you're suggesting are conflicting. Even taking into account that you're referring to Andy's version: int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name the "not convinced this should take a pid argument" confuses me wrt to the proposed second signature: pidfd_open(pid_t pid, int ns, unsigned int flags); > goes in the translation stuff, that has very real usecases, and you > would end up duplicating the same thing over in two places. > > If the /proc/ dir fd of a pidfd is relative to its procrootfd, a > pid's pidfd is also relative to its namespace. Currently, you change Pass a /proc/ file descriptor that you have been given access to to pidfd_open() and retrieve a pidfd to that process. > things which would mean I now have to setns and spawn a child that > passes me back the pidfd for a pid in the said namespace. There are two scenarios: - you're in a cousin pid namespace - you're in an ancestor pid namespace If you're in an ancestor pid namespace you can just get a pidfd based on the pid of the process in your namespace. If you're in a cousin pid namespace it seems reasonable that you have to do a setns to get a pidfd. After all, you're not able to see any pids in there by default. If you want pidfd to a cousin pid namespace prove that you can access it by attaching to the owning user namespace of the pid namespace and get your pidfd. > > I prefer Andy's version of > > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY); > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name > this procfd_open, maybe > > This is just a nicer race free openat. > > But, Joel and you agreed that it makes sense to split out the > translation stuff out of pidfds. I don't quite remember that part. I honestly think that the version proposed here covers for the most part what we want and provides a decent compromise by avoiding ioctl()s for the translation bits, allows us to not split this into multiple syscalls, and also allows retrieving pidfds from other pid namespaces provided that one has gotten access to a file descriptor for /proc/. If really needed at some point - I doubt we will - you can extend pidfd_open() to take the flag CLONE_NEWPID: int pidfd = pidfd_open(pid, pid_ns_fd, -1, CLONE_NEWPID); and get pidfd for another pid namespace back. Christian > > My suggestion would be to extend ioctl_ns(2) then. It already provides > a (rather clumsy) mechanism to determine the relationship by comparing > st_dev and st_ino, which userspace can do itself for two namespace > descriptors. For translation, you can extend this namespace ioctl > (there is one to get a owner UID, you could add one to get a relative > PID). > > Then, your pidfd call will be: > > pidfd_open(pid_t pid, int ns, unsigned int flags); > > You would also be able to compile out anything procfs related (include > the new API to procfs dir fd translation), otherwise, the way to open > pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good > as pidfd_open(pid_t pid) (of which a better version I propose above). > The new API should be its own thing, and the procfs translation tuple > its own thing, tied to the proc fs option. pidfds need not have any > relation to /proc. They don't have a relation to procfs right now in this syscall. Works fine without procfs. That's the point. > > For this procfd conversion system call, I would also lift any > namespace related restrictions if you are planning to, given the > constraint that I can only open a pidfd from an ancestor namespace > with the new pidfd_open system call, or acquire it through a > cooperative userspace process by fd passing otherwise, and I need the > /proc root fd, having both only permits me to open the said process's > /proc/ dir fd subject to normal access restrictions. This means > the simplified procfd_open can be used to do metadata access without > even talking of PIDs at all, your process could live in its own world > and, given it has the authority to open the /proc directory, be able > to purely collect metadata based on the two file descriptors it is > given. > > Once you have the restriction in the same call that allows you to open > a pidfd for an addressable PID from the given namespace fd, you can > finally remove the restriction to signal across namespaces once the > siginfo stuff is sorted out, as that will only work when you > explicitly push the fd into a sandbox, the process cannot get it out > of thin air on its own (and you already mentioned it has nothing to do > with security). What I do worry about is one can use NS_GET_PARENT > ioctl to get the parent pidns if the owning userns is the same, and > just passing that gives me back a pidfd for the ta
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 06:07:54PM +0100, Jann Horn wrote: > On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner > wrote: > > pidfd_open() allows to retrieve pidfds for processes and removes the > > dependency of pidfd on procfs. Multiple people have expressed a desire to > > do this even when pidfd_send_signal() was merged. It is even recorded in > [...] > > IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a > > /proc mount in a given pid namespace and a pidfd pidfd_open() will return a > > file descriptor to the corresponding /proc/ directory in procfs > > mounts' pid namespace. pidfd_open() is very careful to verify that the pid > > nit: s/mounts'/mount's/ Thanks. > > > hasn't been recycled in between. > > IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor > > referencing a /proc/ directory a pidfd referencing the struct pid > > stashed in /proc/ of the process will be returned. > > nit: s/of the process //? Yes. > > > The pidfd_open() syscalls in that manner resembles openat() as it uses a > > nit: s/syscalls/syscall/ Thanks. > > [...] > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 20881598bdfa..c9e24e726aba 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > [...] > > +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t > > pid, > > + const struct pid *pidfd_pid) > > +{ > > + char name[11]; /* int to strlen + \0 */ > > nit: The comment is a bit off; an unconstrained int needs 1+10+1 > bytes, I think? minus sign, 10 digits, nullbyte? But of course that > can't actually happen here. Yes, the comment is misleading. > > > + struct file *file; > > + struct pid *proc_pid; > > + > > + snprintf(name, sizeof(name), "%d", pid); > > + file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name, > > + O_DIRECTORY | O_NOFOLLOW, 0); > > Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity? Yes. > > [...] > > +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd) > > +{ > > + long fd; > > + pid_t ns_pid; > > + struct fd fdproc, fdpid; > > + struct file *file = NULL; > > + struct pid *pidfd_pid = NULL; > > + struct pid_namespace *proc_pid_ns = NULL; > > + > > + fdproc = fdget(procfd); > > + if (!fdproc.file) > > + return -EBADF; > > + > > + fdpid = fdget(pidfd); > > + if (!fdpid.file) { > > + fdput(fdpid); > > Typo: s/fdput(fdpid)/fdput(fdproc)/ Good catch! > > [...] > > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned > > int, > > + flags) > [...] > > + if (!flags) { > [...] > > + rcu_read_lock(); > > + pidfd_pid = get_pid(find_pid_ns(pid, > > task_active_pid_ns(current))); > > + rcu_read_unlock(); > > The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`. Perfect, will replace. > > > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC); > > Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of > the second function argument if you want to. Hm, let me rename this to pidfd_create_cloexec(pidfd_pid) then. > > > + put_pid(pidfd_pid); > > + } else if (flags & PIDFD_TO_PROCFD) { > [...] > > + fd = pidfd_to_procfd(pid, procfd, pidfd); > > The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes > sense to get rid of that? Yes.
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 06:21:24PM +0100, Yann Droneaud wrote: > Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit : > > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 20881598bdfa..c9e24e726aba 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > > @@ -26,8 +26,10 @@ > > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, > > unsigned int, > > + flags) > > +{ > > + long fd = -EINVAL; > > + > > + if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD)) > > + return -EINVAL; > > + > > + if (!flags) { > > + struct pid *pidfd_pid; > > + > > + if (pid <= 0) > > + return -EINVAL; > > + > > + if (procfd != -1 || pidfd != -1) > > + return -EINVAL; > > + > > + rcu_read_lock(); > > + pidfd_pid = get_pid(find_pid_ns(pid, > > task_active_pid_ns(current))); > > + rcu_read_unlock(); > > + > > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC); > > + put_pid(pidfd_pid); > > + } else if (flags & PIDFD_TO_PROCFD) { > > [...] > > > + } else if (flags & PROCFD_TO_PIDFD) { > > + if (flags & ~PROCFD_TO_PIDFD) > > + return -EINVAL; > > + > > + if (pid != -1) > > + return -EINVAL; > > + > > + if (pidfd >= 0) > > > > I think it can be stricter with: > > if (pidfd != -1) Yes. > > (and match the check done for flag == 0). > > Regards. > > -- > Yann Droneaud > OPTEYA > >
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 7:38 PM Jonathan Kowalski wrote: > ... > ... the process cannot get it out > of thin air on its own (and you already mentioned it has nothing to do > with security). What I do worry about is one can use NS_GET_PARENT disregard this, it works as it should. > ioctl to get the parent pidns if the owning userns is the same, and > just passing that gives me back a pidfd for the task. **So, you might > want to add the constraint that the PID is actually reachable by the > current task as well, apart from being reachable in the passed in > namespace.** > > Lastly, I also see no need of /proc/ dir fd to pidfd conversion, > I would even recommend getting rid of that, so we only have one type > of pidfd, the anon inode one. What is the usecase behind that? It > would only be needed if you did not have a way to be able to metadata > access through a pidfd, which would be the case only prior to this > patch. > > I think this would simplify a lot of things, and ioctl_ns(2) is > probably already the place to do comparison operations and query > operations on hierarichal namespaces, just adding the relative PID bit > will make it gain feature parity with translate_pid.
Re: [PATCH 2/4] pid: add pidfd_open()
Christian, Giving this some thought, it looks fine to me, but I am not convinced this should take a pid argument. I much prefer if obtaining a pidfd goes in the translation stuff, that has very real usecases, and you would end up duplicating the same thing over in two places. If the /proc/ dir fd of a pidfd is relative to its procrootfd, a pid's pidfd is also relative to its namespace. Currently, you change things which would mean I now have to setns and spawn a child that passes me back the pidfd for a pid in the said namespace. I prefer Andy's version of int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY); int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name this procfd_open, maybe This is just a nicer race free openat. But, Joel and you agreed that it makes sense to split out the translation stuff out of pidfds. My suggestion would be to extend ioctl_ns(2) then. It already provides a (rather clumsy) mechanism to determine the relationship by comparing st_dev and st_ino, which userspace can do itself for two namespace descriptors. For translation, you can extend this namespace ioctl (there is one to get a owner UID, you could add one to get a relative PID). Then, your pidfd call will be: pidfd_open(pid_t pid, int ns, unsigned int flags); You would also be able to compile out anything procfs related (include the new API to procfs dir fd translation), otherwise, the way to open pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good as pidfd_open(pid_t pid) (of which a better version I propose above). The new API should be its own thing, and the procfs translation tuple its own thing, tied to the proc fs option. pidfds need not have any relation to /proc. For this procfd conversion system call, I would also lift any namespace related restrictions if you are planning to, given the constraint that I can only open a pidfd from an ancestor namespace with the new pidfd_open system call, or acquire it through a cooperative userspace process by fd passing otherwise, and I need the /proc root fd, having both only permits me to open the said process's /proc/ dir fd subject to normal access restrictions. This means the simplified procfd_open can be used to do metadata access without even talking of PIDs at all, your process could live in its own world and, given it has the authority to open the /proc directory, be able to purely collect metadata based on the two file descriptors it is given. Once you have the restriction in the same call that allows you to open a pidfd for an addressable PID from the given namespace fd, you can finally remove the restriction to signal across namespaces once the siginfo stuff is sorted out, as that will only work when you explicitly push the fd into a sandbox, the process cannot get it out of thin air on its own (and you already mentioned it has nothing to do with security). What I do worry about is one can use NS_GET_PARENT ioctl to get the parent pidns if the owning userns is the same, and just passing that gives me back a pidfd for the task. **So, you might want to add the constraint that the PID is actually reachable by the current task as well, apart from being reachable in the passed in namespace.** Lastly, I also see no need of /proc/ dir fd to pidfd conversion, I would even recommend getting rid of that, so we only have one type of pidfd, the anon inode one. What is the usecase behind that? It would only be needed if you did not have a way to be able to metadata access through a pidfd, which would be the case only prior to this patch. I think this would simplify a lot of things, and ioctl_ns(2) is probably already the place to do comparison operations and query operations on hierarichal namespaces, just adding the relative PID bit will make it gain feature parity with translate_pid.
Re: [PATCH 2/4] pid: add pidfd_open()
Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit : > diff --git a/kernel/pid.c b/kernel/pid.c > index 20881598bdfa..c9e24e726aba 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -26,8 +26,10 @@ > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, > unsigned int, > + flags) > +{ > + long fd = -EINVAL; > + > + if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD)) > + return -EINVAL; > + > + if (!flags) { > + struct pid *pidfd_pid; > + > + if (pid <= 0) > + return -EINVAL; > + > + if (procfd != -1 || pidfd != -1) > + return -EINVAL; > + > + rcu_read_lock(); > + pidfd_pid = get_pid(find_pid_ns(pid, > task_active_pid_ns(current))); > + rcu_read_unlock(); > + > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC); > + put_pid(pidfd_pid); > + } else if (flags & PIDFD_TO_PROCFD) { [...] > + } else if (flags & PROCFD_TO_PIDFD) { > + if (flags & ~PROCFD_TO_PIDFD) > + return -EINVAL; > + > + if (pid != -1) > + return -EINVAL; > + > + if (pidfd >= 0) > I think it can be stricter with: if (pidfd != -1) (and match the check done for flag == 0). Regards. -- Yann Droneaud OPTEYA
Re: [PATCH 2/4] pid: add pidfd_open()
On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner wrote: > pidfd_open() allows to retrieve pidfds for processes and removes the > dependency of pidfd on procfs. Multiple people have expressed a desire to > do this even when pidfd_send_signal() was merged. It is even recorded in [...] > IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a > /proc mount in a given pid namespace and a pidfd pidfd_open() will return a > file descriptor to the corresponding /proc/ directory in procfs > mounts' pid namespace. pidfd_open() is very careful to verify that the pid nit: s/mounts'/mount's/ > hasn't been recycled in between. > IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor > referencing a /proc/ directory a pidfd referencing the struct pid > stashed in /proc/ of the process will be returned. nit: s/of the process //? > The pidfd_open() syscalls in that manner resembles openat() as it uses a nit: s/syscalls/syscall/ [...] > diff --git a/kernel/pid.c b/kernel/pid.c > index 20881598bdfa..c9e24e726aba 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c [...] > +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid, > + const struct pid *pidfd_pid) > +{ > + char name[11]; /* int to strlen + \0 */ nit: The comment is a bit off; an unconstrained int needs 1+10+1 bytes, I think? minus sign, 10 digits, nullbyte? But of course that can't actually happen here. > + struct file *file; > + struct pid *proc_pid; > + > + snprintf(name, sizeof(name), "%d", pid); > + file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name, > + O_DIRECTORY | O_NOFOLLOW, 0); Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity? [...] > +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd) > +{ > + long fd; > + pid_t ns_pid; > + struct fd fdproc, fdpid; > + struct file *file = NULL; > + struct pid *pidfd_pid = NULL; > + struct pid_namespace *proc_pid_ns = NULL; > + > + fdproc = fdget(procfd); > + if (!fdproc.file) > + return -EBADF; > + > + fdpid = fdget(pidfd); > + if (!fdpid.file) { > + fdput(fdpid); Typo: s/fdput(fdpid)/fdput(fdproc)/ [...] > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned > int, > + flags) [...] > + if (!flags) { [...] > + rcu_read_lock(); > + pidfd_pid = get_pid(find_pid_ns(pid, > task_active_pid_ns(current))); > + rcu_read_unlock(); The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`. > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC); Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of the second function argument if you want to. > + put_pid(pidfd_pid); > + } else if (flags & PIDFD_TO_PROCFD) { [...] > + fd = pidfd_to_procfd(pid, procfd, pidfd); The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes sense to get rid of that?
[PATCH 2/4] pid: add pidfd_open()
pidfd_open() allows to retrieve pidfds for processes and removes the dependency of pidfd on procfs. Multiple people have expressed a desire to do this even when pidfd_send_signal() was merged. It is even recorded in the commit message for pidfd_send_signal() itself (cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad): Q-06: (Andrew Morton [1]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [1]). Alexey made a similar request (cf. [2]). Additionally, Andy made an argument that we should go forward with non-proc-dirfd file descriptors for the sake of security and extensibility (cf. [3]). This will unblock or help move along work on pidfd_wait which is currently ongoing. /* pidfds are anon inode file descriptors */ These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by default and can be used with the pidfd_send_signal() syscall. They are not dirfds and as such have the advantage that we can make them pollable or readable in the future if we see a need to do so. Currently they do not support any advanced operations. The pidfds are not associated with a specific pid namespaces but rather only reference struct pid of a given process in their private_data member. /* Process Metadata Access */ One of the oustanding issues has been how to get information about a given process if pidfds are regular file descriptors and do not provide access to the process /proc/ directory. Various solutions have been proposed. The one that most people prefer is to be able to retrieve a file descriptor to /proc/ based on a pidfd (and the other way around). IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a /proc mount in a given pid namespace and a pidfd pidfd_open() will return a file descriptor to the corresponding /proc/ directory in procfs mounts' pid namespace. pidfd_open() is very careful to verify that the pid hasn't been recycled in between. IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor referencing a /proc/ directory a pidfd referencing the struct pid stashed in /proc/ of the process will be returned. The pidfd_open() syscalls in that manner resembles openat() as it uses a flag argument to modify what type of file descriptor will be returned. The pidfd_open() implementation together with the flags argument strikes me as an elegant compromise between splitting this into multiple syscalls and avoiding ioctls(). /* Examples */ // Retrieve pidfd int pidfd = pidfd_open(1234, -1, -1, 0); // Retrieve /proc/ handle for pidfd int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC); int procpidfd = pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD); // Retrieve pidfd for /proc/ int procpidfd = open("/proc/1234", O_DIRECTORY | O_RDONLY | O_CLOEXEC); int pidfd = pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD); /* References */ [1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcss...@brauner.io/ [2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/ [3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qedldvpf8ecglzib9botrufe0v-u-tuzoeo...@mail.gmail.com Signed-off-by: Christian Brauner Cc: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Alexey Dobriyan Cc: Thomas Gleixner Cc: Serge Hallyn Cc: Jann Horn Cc: "Michael Kerrisk (man-pages)" Cc: Konstantin Khlebnikov Cc: Jonathan Kowalski Cc: "Dmitry V. Levin" Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Oleg Nesterov Cc: Nagarathnam Muthusamy Cc: Aleksa Sarai Cc: Al Viro --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/pid.h| 2 + include/linux/syscalls.h | 2 + include/uapi/linux/wait.h | 3 + kernel/pid.c | 247 + 6 files changed, 256 insertions(+) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 1f9607ed087c..c8046f261bee 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -433,3 +433,4 @@ 425i386io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426i386io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427i386io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428i386pidfd_open sys_pidfd_open __ia32_sys_pidfd_open diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 92ee0b4378d4..f714a3d57b88 100644