Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 09:54:58PM +, Jonathan Kowalski wrote: > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes wrote: > > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > > > wrote: > > > > > > But often you don't just want to wait for a single thing to happen; > > > you want to wait for many things at once, and react as soon as any one > > > of them happens. This is why the kernel has epoll and all the other > > > "wait for event on FD" APIs. If waiting for a process isn't possible > > > with fd-based APIs like epoll, users of this API have to spin up > > > useless helper threads. > > > > This is true. I almost forgot about the polling requirement, sorry. So then > > a > > new syscall it is.. About what to wait for, that can be a separate parameter > > to pidfd_wait then. > > > > This introduces a time window where the process changes state between > "get pidfd" and "setup waitfd", it would be simpler if the pidfd > itself supported .poll and on termination the exit status was made > readable from the file descriptor. It is much cleaner to add a new pidfd_wait syscall for this as discussed in the other thread. Adding .poll directly to the pidfd would seem to complicate the blocking configuration. Do we block until the task is a zombie or until it is dead? That is not possible to specify easily. Also if we need to return other types of information from the pidfd, not just exit state, then it is not clear whether blocking on a pidfd just purely on exit status makes sense. It is much cleaner to add a new pidfd_wait syscall giving it a pidfd, specify what to block on (EXIT_DEAD or EXIT_ZOMBIE or both) and then return a wait fd that can be read/blocked and returning all the needed information on unblock. > Further, in the clone4 patchset, there was a mechanism to autoreap > such a process so that it does not interfere with waiting a process > does normally. How do you intend to handle this case if anyone except > the parent is wanting to *wait* on the process (a second process, > without reaping, so as to not disrupt any waiting in the parent), and > for things like libraries that finally can manage their own set of > process when pidfd_clone is added, by excluding this process from the > process's normal wait handling logic. The pidfd_wait logic being discussed does not depend on or affects the autoreap behavior. This wait is not the traditional wait family of calls. It is for just getting notified about reading all the exit state of a task. Once I post the code, it will be clear. And I'll CC you.. thanks, - Joel
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:56 AM David Howells wrote: > > Daniel Colascione wrote: > > > System calls are cheap. > > Only to a point. x86_64 will have an issue when we hit syscall 512. We're > currently at 427. > I don't consider this to be a problem. I have patches to make this problem go away forever. I just need to tidy them up a bit and upstream them.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 3:37 PM Jonathan Kowalski wrote: > > On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione wrote: > > > > On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski > > wrote: > > > > > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes > > > wrote: > > > > > > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > > > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > > > > > wrote: > > > > > > > > > > But often you don't just want to wait for a single thing to happen; > > > > > you want to wait for many things at once, and react as soon as any one > > > > > of them happens. This is why the kernel has epoll and all the other > > > > > "wait for event on FD" APIs. If waiting for a process isn't possible > > > > > with fd-based APIs like epoll, users of this API have to spin up > > > > > useless helper threads. > > > > > > > > This is true. I almost forgot about the polling requirement, sorry. So > > > > then a > > > > new syscall it is.. About what to wait for, that can be a separate > > > > parameter > > > > to pidfd_wait then. > > > > > > > > > > This introduces a time window where the process changes state between > > > "get pidfd" and "setup waitfd", it would be simpler if the pidfd > > > itself supported .poll and on termination the exit status was made > > > readable from the file descriptor. > > > > I don't see a race here. Process state moves in one direction, so > > there's no race. If you want the poll to end when a process exits and > > the process exits before you poll, the poll should finish immediately. > > If the process exits before you even create the polling FD, whatever > > creates the polling FD can fail with ESRCH, which is what usually > > happens when you try to do anything with a process that's gone. Either > > way, whatever's trying to set up the poll knows the state of the > > process and there's no possibility of a missed wakeup. > > > > poll will possibly work and return immediately, but you won't be able > to read back anything, because for the kernel there will be no waiter > before you open one. If you make pidfd pollable and readable (for > parents, as an example), the poll ends immediately but there will > something to read from the fd. You can lose exit status this way. That's not a problem, though, because *without* synchronization between the exiting process and the waiting process, the waiting process can lose the race no matter what, and *with* synchronization, there's no change of losing the information. (It's just like a condition variable.) Today, synchronization between a parent and child is automatic (because zombies), but direct children work reliably with pidfd too: the descriptor that pidfd_clone (or whatever it ends up being called) returns will *always* be created before the child process and so will *always* have exit status information available, even with some non-SIGCHLD option applied. > > > Further, in the clone4 patchset, there was a mechanism to autoreap > > > such a process so that it does not interfere with waiting a process > > > does normally. How do you intend to handle this case if anyone except > > > the parent is wanting to *wait* on the process (a second process, > > > without reaping, so as to not disrupt any waiting in the parent), and > > > for things like libraries that finally can manage their own set of > > > process when pidfd_clone is added, by excluding this process from the > > > process's normal wait handling logic. > > > > I think that discussion is best had around pidfd_clone. pidfd waiting > > functionality shouldn't affect wait* in any way nor affect a zombie > > transition or reaping. > > So this is more akin to waitfd and traditional wait* and friends, and > a single notification fd for possibly multiple children? I don't know what gave you this idea. I'm talking about a model in which you wait for one child with one open file description. > I just wanted > to be sure one would (should) be able to use a pidfd obtained from > pidfd_clone without affecting the existing waitfd, and other > primitives. I don't think the use of pidfd *waiting*, by itself, should affect the current fork/wait/zombie model of process exit delivery. That is, by default, a process should transition to the zombie state, send SIGCHLD to its parent, and then get reaped at exactly the same times it does today, with exactly the same semantics we have today. Pidfd waiting should be an "add on". > The same application's components should be able to host > their own set of children using such an API (think libraries) and > without affecting the process. Agreed. To be clear, you're talking about something slightly different from pure pidfd waiting. There should be some way to *create* a process in such a way that it's not visible to wait(-1), doesn't generate SIGCHLD, and so on. It would be great for libraries to create and manage worker processes transparently, even in the presence of some other thread sitting on
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski wrote: > > > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes > > wrote: > > > > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > > > > wrote: > > > > > > > > But often you don't just want to wait for a single thing to happen; > > > > you want to wait for many things at once, and react as soon as any one > > > > of them happens. This is why the kernel has epoll and all the other > > > > "wait for event on FD" APIs. If waiting for a process isn't possible > > > > with fd-based APIs like epoll, users of this API have to spin up > > > > useless helper threads. > > > > > > This is true. I almost forgot about the polling requirement, sorry. So > > > then a > > > new syscall it is.. About what to wait for, that can be a separate > > > parameter > > > to pidfd_wait then. > > > > > > > This introduces a time window where the process changes state between > > "get pidfd" and "setup waitfd", it would be simpler if the pidfd > > itself supported .poll and on termination the exit status was made > > readable from the file descriptor. > > I don't see a race here. Process state moves in one direction, so > there's no race. If you want the poll to end when a process exits and > the process exits before you poll, the poll should finish immediately. > If the process exits before you even create the polling FD, whatever > creates the polling FD can fail with ESRCH, which is what usually > happens when you try to do anything with a process that's gone. Either > way, whatever's trying to set up the poll knows the state of the > process and there's no possibility of a missed wakeup. > poll will possibly work and return immediately, but you won't be able to read back anything, because for the kernel there will be no waiter before you open one. If you make pidfd pollable and readable (for parents, as an example), the poll ends immediately but there will something to read from the fd. > > Further, in the clone4 patchset, there was a mechanism to autoreap > > such a process so that it does not interfere with waiting a process > > does normally. How do you intend to handle this case if anyone except > > the parent is wanting to *wait* on the process (a second process, > > without reaping, so as to not disrupt any waiting in the parent), and > > for things like libraries that finally can manage their own set of > > process when pidfd_clone is added, by excluding this process from the > > process's normal wait handling logic. > > I think that discussion is best had around pidfd_clone. pidfd waiting > functionality shouldn't affect wait* in any way nor affect a zombie > transition or reaping. So this is more akin to waitfd and traditional wait* and friends, and a single notification fd for possibly multiple children? I just wanted to be sure one would (should) be able to use a pidfd obtained from pidfd_clone without affecting the existing waitfd, and other primitives. The same application's components should be able to host their own set of children using such an API (think libraries) and without affecting the process. > > > Moreover, should anyone except the parent process be allowed to open a > > readable pidfd (or waitfd), without additional capabilities? > > That's a separate discussion. See > https://lore.kernel.org/lkml/cakozueussvwabqac+o9fw+mzayccvttkqzfwg0hh-cz+1yk...@mail.gmail.com/, > where we talked about permissions extensively. Anyone can hammer on > /proc today or hammer on kill(pid, 0) to learn about a process > exiting, so anyone should be able to wait for a process to die --- it > just automates what anyone can do manually. What's left is the > question of who should be able to learn a process's exit code. As I > mentioned in the linked email, process exit codes are public on > FreeBSD, and the simplest option is to make them public in Linux too. People might be using them in ways that convey information between the parent and child something else shouldn't be able to know. They might be relying on this assumption that it is private. I don't think opening it up without requiring *some* privileges is safe. Also, taking this further, if the structure being returned from a read isn't just the exit code but something more elaborate (you have mentioned siginfo previously), or even statistics like getrusage, I would be concerned if anyone except those with CAP_SYS_PTRACE or so should be able to obtain such a readable pidfd/waitfd.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski wrote: > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes wrote: > > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > > > wrote: > > > > > > But often you don't just want to wait for a single thing to happen; > > > you want to wait for many things at once, and react as soon as any one > > > of them happens. This is why the kernel has epoll and all the other > > > "wait for event on FD" APIs. If waiting for a process isn't possible > > > with fd-based APIs like epoll, users of this API have to spin up > > > useless helper threads. > > > > This is true. I almost forgot about the polling requirement, sorry. So then > > a > > new syscall it is.. About what to wait for, that can be a separate parameter > > to pidfd_wait then. > > > > This introduces a time window where the process changes state between > "get pidfd" and "setup waitfd", it would be simpler if the pidfd > itself supported .poll and on termination the exit status was made > readable from the file descriptor. I don't see a race here. Process state moves in one direction, so there's no race. If you want the poll to end when a process exits and the process exits before you poll, the poll should finish immediately. If the process exits before you even create the polling FD, whatever creates the polling FD can fail with ESRCH, which is what usually happens when you try to do anything with a process that's gone. Either way, whatever's trying to set up the poll knows the state of the process and there's no possibility of a missed wakeup. > Further, in the clone4 patchset, there was a mechanism to autoreap > such a process so that it does not interfere with waiting a process > does normally. How do you intend to handle this case if anyone except > the parent is wanting to *wait* on the process (a second process, > without reaping, so as to not disrupt any waiting in the parent), and > for things like libraries that finally can manage their own set of > process when pidfd_clone is added, by excluding this process from the > process's normal wait handling logic. I think that discussion is best had around pidfd_clone. pidfd waiting functionality shouldn't affect wait* in any way nor affect a zombie transition or reaping. > Moreover, should anyone except the parent process be allowed to open a > readable pidfd (or waitfd), without additional capabilities? That's a separate discussion. See https://lore.kernel.org/lkml/cakozueussvwabqac+o9fw+mzayccvttkqzfwg0hh-cz+1yk...@mail.gmail.com/, where we talked about permissions extensively. Anyone can hammer on /proc today or hammer on kill(pid, 0) to learn about a process exiting, so anyone should be able to wait for a process to die --- it just automates what anyone can do manually. What's left is the question of who should be able to learn a process's exit code. As I mentioned in the linked email, process exit codes are public on FreeBSD, and the simplest option is to make them public in Linux too.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes wrote: > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > > wrote: > > > > But often you don't just want to wait for a single thing to happen; > > you want to wait for many things at once, and react as soon as any one > > of them happens. This is why the kernel has epoll and all the other > > "wait for event on FD" APIs. If waiting for a process isn't possible > > with fd-based APIs like epoll, users of this API have to spin up > > useless helper threads. > > This is true. I almost forgot about the polling requirement, sorry. So then a > new syscall it is.. About what to wait for, that can be a separate parameter > to pidfd_wait then. > This introduces a time window where the process changes state between "get pidfd" and "setup waitfd", it would be simpler if the pidfd itself supported .poll and on termination the exit status was made readable from the file descriptor. Further, in the clone4 patchset, there was a mechanism to autoreap such a process so that it does not interfere with waiting a process does normally. How do you intend to handle this case if anyone except the parent is wanting to *wait* on the process (a second process, without reaping, so as to not disrupt any waiting in the parent), and for things like libraries that finally can manage their own set of process when pidfd_clone is added, by excluding this process from the process's normal wait handling logic. Moreover, should anyone except the parent process be allowed to open a readable pidfd (or waitfd), without additional capabilities? > Thanks. > > - Joel >
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote: > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes > wrote: > > On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote: > > > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote: > > > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > > > > wrote: > > > > > > The pidctl() syscalls builds on, extends, and improves > > > > > > translate_pid() [4]. > > > > > > I quote Konstantins original patchset first that has already been > > > > > > acked and > > > > > > picked up by Eric before and whose functionality is preserved in > > > > > > this > > > > > > syscall. Multiple people have asked when this patchset will be sent > > > > > > in > > > > > > for merging (cf. [1], [2]). It has recently been revived by > > > > > > Nagarathnam > > > > > > Muthusamy from Oracle [3]. > > > > > > > > > > > > The intention of the original translate_pid() syscall was twofold: > > > > > > 1. Provide translation of pids between pid namespaces > > > > > > 2. Provide implicit pid namespace introspection > > > > > > > > > > > > Both functionalities are preserved. The latter task has been > > > > > > improved > > > > > > upon though. In the original version of the pachset passing pid as 1 > > > > > > would allow to deterimine the relationship between the pid > > > > > > namespaces. > > > > > > This is inherhently racy. If pid 1 inside a pid namespace has died > > > > > > it > > > > > > would report false negatives. For example, if pid 1 inside of the > > > > > > target > > > > > > pid namespace already died, it would report that the target pid > > > > > > namespace cannot be reached from the source pid namespace because it > > > > > > couldn't find the pid inside of the target pid namespace and thus > > > > > > falsely report to the user that the two pid namespaces are not > > > > > > related. > > > > > > This problem is simple to avoid. In the new version we simply walk > > > > > > the > > > > > > list of ancestors and check whether the namespace are related to > > > > > > each > > > > > > other. By doing it this way we can reliably report what the > > > > > > relationship > > > > > > between two pid namespace file descriptors looks like. > > > > > > > > > > > > Additionally, this syscall has been extended to allow the retrieval > > > > > > of > > > > > > pidfds independent of procfs. These pidfds can e.g. be used with > > > > > > the new > > > > > > pidfd_send_signal() syscall we recently merged. The ability to > > > > > > retrieve > > > > > > pidfds independent of procfs had already been requested in the > > > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by > > > > > > Alexey > > > > > > [5]. A use-case where a kernel is compiled without procfs but where > > > > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > > > > anon-inode based file descriptors are used that stash a reference to > > > > > > struct pid in file->private_data and drop that reference on close. > > > > > > > > > > > > With this translate_pid() has three closely related but still > > > > > > distinct > > > > > > functionalities. To clarify the semantics and to make it easier for > > > > > > userspace to use the syscall it has: > > > > > > - gained a command argument and three commands clearly reflecting > > > > > > the > > > > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > > > > PIDCMD_GET_PIDFD). > > > > > > - been renamed to pidctl() > > > > > > > > > [snip] > > > > > Also, I'm still confused about how metadata access is supposed to work > > > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > > > > You snipped out a portion of a previous email in which I asked about > > > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > > > > place, we have two different kinds of file descriptors for processes, > > > > > one derived from procfs and one that's independent. The former works > > > > > with openat(2). The latter does not. To be very specific; if I'm > > > > > writing a function that accepts a pidfd and I get a pidfd that comes > > > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > > > > smaps or oom_score_adj or statm for the named process in a race-free > > > > > manner? > > > > > > > > This is true, that such usecase will not be supportable. But the > > > > advantage > > > > on the other hand, is that suchs "pidfd" can be made pollable or > > > > readable in > > > > the future. Potentially allowing us to return exit status without a new > > > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we > > > > cannot do > > > > with proc. > > > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > > directory fd > > > > can be translated into a "pidfd" using another syscall or even a
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes wrote: > On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote: > > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote: > > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > > > wrote: > > > > > The pidctl() syscalls builds on, extends, and improves > > > > > translate_pid() [4]. > > > > > I quote Konstantins original patchset first that has already been > > > > > acked and > > > > > picked up by Eric before and whose functionality is preserved in this > > > > > syscall. Multiple people have asked when this patchset will be sent in > > > > > for merging (cf. [1], [2]). It has recently been revived by > > > > > Nagarathnam > > > > > Muthusamy from Oracle [3]. > > > > > > > > > > The intention of the original translate_pid() syscall was twofold: > > > > > 1. Provide translation of pids between pid namespaces > > > > > 2. Provide implicit pid namespace introspection > > > > > > > > > > Both functionalities are preserved. The latter task has been improved > > > > > upon though. In the original version of the pachset passing pid as 1 > > > > > would allow to deterimine the relationship between the pid namespaces. > > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > > > > would report false negatives. For example, if pid 1 inside of the > > > > > target > > > > > pid namespace already died, it would report that the target pid > > > > > namespace cannot be reached from the source pid namespace because it > > > > > couldn't find the pid inside of the target pid namespace and thus > > > > > falsely report to the user that the two pid namespaces are not > > > > > related. > > > > > This problem is simple to avoid. In the new version we simply walk the > > > > > list of ancestors and check whether the namespace are related to each > > > > > other. By doing it this way we can reliably report what the > > > > > relationship > > > > > between two pid namespace file descriptors looks like. > > > > > > > > > > Additionally, this syscall has been extended to allow the retrieval of > > > > > pidfds independent of procfs. These pidfds can e.g. be used with the > > > > > new > > > > > pidfd_send_signal() syscall we recently merged. The ability to > > > > > retrieve > > > > > pidfds independent of procfs had already been requested in the > > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by > > > > > Alexey > > > > > [5]. A use-case where a kernel is compiled without procfs but where > > > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > > > anon-inode based file descriptors are used that stash a reference to > > > > > struct pid in file->private_data and drop that reference on close. > > > > > > > > > > With this translate_pid() has three closely related but still distinct > > > > > functionalities. To clarify the semantics and to make it easier for > > > > > userspace to use the syscall it has: > > > > > - gained a command argument and three commands clearly reflecting the > > > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > > > PIDCMD_GET_PIDFD). > > > > > - been renamed to pidctl() > > > > > > > [snip] > > > > Also, I'm still confused about how metadata access is supposed to work > > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > > > You snipped out a portion of a previous email in which I asked about > > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > > > place, we have two different kinds of file descriptors for processes, > > > > one derived from procfs and one that's independent. The former works > > > > with openat(2). The latter does not. To be very specific; if I'm > > > > writing a function that accepts a pidfd and I get a pidfd that comes > > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > > > smaps or oom_score_adj or statm for the named process in a race-free > > > > manner? > > > > > > This is true, that such usecase will not be supportable. But the > > > advantage > > > on the other hand, is that suchs "pidfd" can be made pollable or readable > > > in > > > the future. Potentially allowing us to return exit status without a new > > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we > > > cannot do > > > with proc. > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > directory fd > > > can be translated into a "pidfd" using another syscall or even a node, > > > like > > > /proc/pid/handle or something. I think this is what Christian suggested in > > > the previous threads. > > > > Andy - and Jann who I just talked to - have proposed solutions for this. > > Jann's idea is similar to what you suggested, Joel. You could e.g. do an > > ioctl() handler for /proc that would give you a dirfd back for a given >
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:40 PM Jonathan Kowalski wrote: > On Mon, Mar 25, 2019 at 8:34 PM Jann Horn wrote: > > > > [...SNIP...] > > > > Please don't do that. /proc/$pid/fd refers to the set of file > > descriptors the process has open, and semantically doesn't have much > > to do with the identity of the process. If you want to have a procfs > > directory entry for getting a pidfd, please add a new entry. (Although > > I don't see the point in adding a new procfs entry for this when you > > could instead have an ioctl or syscall operating on the procfs > > directory fd.) > > There is no new entry. What I was saying (and I should have been > clearer) is that the existing entry for the fd when open'd with > O_DIRECTORY makes the kernel resolve the symlink to /proc/ of the > process it maps to, so it would become: > > int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC); That still seems really weird. This magically overloads O_DIRECTORY, which means "fail if the thing is not a directory", to suddenly have an entirely different meaning for one magical special type of file. On top of that, unlike an ioctl or a new syscall, it doesn't convey explicit intent and increases the risk of confused deputy issues. > This also means you cannot cross the filesystem boundry, the said > process needs to have a visible entry (which would mean hidepid= and > gid= based access controls are honored), and you can only open the > dirfd of a process in the current ns (as the PID will not map to an > existent process if the pidfd maps to a process not in the same or > children pid ns, in fdinfo it lists -1 in the pid field (we might not > even need fdinfo anymore)). AFAICS that doesn't have anything to do with whether you do this as a syscall, as an ioctl, or as a jumped symlink. The kernel would have to do the same security checks in any of those cases - only a classic, non-jumped symlink would implicitly go through the existing permission checks. And if you implement this with a non-jumped symlink, you get races.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 2:11 PM Joel Fernandes wrote: > > On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote: > > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote: > > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > > > wrote: > > > > > The pidctl() syscalls builds on, extends, and improves > > > > > translate_pid() [4]. > > > > > I quote Konstantins original patchset first that has already been > > > > > acked and > > > > > picked up by Eric before and whose functionality is preserved in this > > > > > syscall. Multiple people have asked when this patchset will be sent in > > > > > for merging (cf. [1], [2]). It has recently been revived by > > > > > Nagarathnam > > > > > Muthusamy from Oracle [3]. > > > > > > > > > > The intention of the original translate_pid() syscall was twofold: > > > > > 1. Provide translation of pids between pid namespaces > > > > > 2. Provide implicit pid namespace introspection > > > > > > > > > > Both functionalities are preserved. The latter task has been improved > > > > > upon though. In the original version of the pachset passing pid as 1 > > > > > would allow to deterimine the relationship between the pid namespaces. > > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > > > > would report false negatives. For example, if pid 1 inside of the > > > > > target > > > > > pid namespace already died, it would report that the target pid > > > > > namespace cannot be reached from the source pid namespace because it > > > > > couldn't find the pid inside of the target pid namespace and thus > > > > > falsely report to the user that the two pid namespaces are not > > > > > related. > > > > > This problem is simple to avoid. In the new version we simply walk the > > > > > list of ancestors and check whether the namespace are related to each > > > > > other. By doing it this way we can reliably report what the > > > > > relationship > > > > > between two pid namespace file descriptors looks like. > > > > > > > > > > Additionally, this syscall has been extended to allow the retrieval of > > > > > pidfds independent of procfs. These pidfds can e.g. be used with the > > > > > new > > > > > pidfd_send_signal() syscall we recently merged. The ability to > > > > > retrieve > > > > > pidfds independent of procfs had already been requested in the > > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by > > > > > Alexey > > > > > [5]. A use-case where a kernel is compiled without procfs but where > > > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > > > anon-inode based file descriptors are used that stash a reference to > > > > > struct pid in file->private_data and drop that reference on close. > > > > > > > > > > With this translate_pid() has three closely related but still distinct > > > > > functionalities. To clarify the semantics and to make it easier for > > > > > userspace to use the syscall it has: > > > > > - gained a command argument and three commands clearly reflecting the > > > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > > > PIDCMD_GET_PIDFD). > > > > > - been renamed to pidctl() > > > > > > > [snip] > > > > Also, I'm still confused about how metadata access is supposed to work > > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > > > You snipped out a portion of a previous email in which I asked about > > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > > > place, we have two different kinds of file descriptors for processes, > > > > one derived from procfs and one that's independent. The former works > > > > with openat(2). The latter does not. To be very specific; if I'm > > > > writing a function that accepts a pidfd and I get a pidfd that comes > > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > > > smaps or oom_score_adj or statm for the named process in a race-free > > > > manner? > > > > > > This is true, that such usecase will not be supportable. But the > > > advantage > > > on the other hand, is that suchs "pidfd" can be made pollable or readable > > > in > > > the future. Potentially allowing us to return exit status without a new > > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we > > > cannot do > > > with proc. > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > directory fd > > > can be translated into a "pidfd" using another syscall or even a node, > > > like > > > /proc/pid/handle or something. I think this is what Christian suggested in > > > the previous threads. > > > > Andy - and Jann who I just talked to - have proposed solutions for this. > > Jann's idea is similar to what you suggested, Joel. You could e.g. do an > > ioctl() handler for /proc that would give you a dirfd back for a given
Re: [PATCH 0/4] pid: add pidctl()
Also, extending this further, instead of new ioctl flags over to translate a tidfd one might introduce later for thread targetted signals (which would still be a pidfd in the struct pid terms, but with a bit set in its reference to target the selected TID in particular), you could resolve this neatly to the proc entry of the task itself, which would be subject to restrictions similar to a regular open call, minus all the races involved. This also means you can get rid of having to support the /proc/ dir fd in pidfd_send_signal, because there is no incentive to, any longer. The kernel now has just one pidfd object, well scoped in its purpose, and this "feature" is tied to procfs itself, disabling which takes away the feature as well. Otherwise, the ioctl will be conditionally available and/or work only when procfs is present, and you'd tie procfs to pidfds eternally as ABI.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote: > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote: > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > > wrote: > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() > > > > [4]. > > > > I quote Konstantins original patchset first that has already been acked > > > > and > > > > picked up by Eric before and whose functionality is preserved in this > > > > syscall. Multiple people have asked when this patchset will be sent in > > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > > > > Muthusamy from Oracle [3]. > > > > > > > > The intention of the original translate_pid() syscall was twofold: > > > > 1. Provide translation of pids between pid namespaces > > > > 2. Provide implicit pid namespace introspection > > > > > > > > Both functionalities are preserved. The latter task has been improved > > > > upon though. In the original version of the pachset passing pid as 1 > > > > would allow to deterimine the relationship between the pid namespaces. > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > > > would report false negatives. For example, if pid 1 inside of the target > > > > pid namespace already died, it would report that the target pid > > > > namespace cannot be reached from the source pid namespace because it > > > > couldn't find the pid inside of the target pid namespace and thus > > > > falsely report to the user that the two pid namespaces are not related. > > > > This problem is simple to avoid. In the new version we simply walk the > > > > list of ancestors and check whether the namespace are related to each > > > > other. By doing it this way we can reliably report what the relationship > > > > between two pid namespace file descriptors looks like. > > > > > > > > Additionally, this syscall has been extended to allow the retrieval of > > > > pidfds independent of procfs. These pidfds can e.g. be used with the new > > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve > > > > pidfds independent of procfs had already been requested in the > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > > > > [5]. A use-case where a kernel is compiled without procfs but where > > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > > anon-inode based file descriptors are used that stash a reference to > > > > struct pid in file->private_data and drop that reference on close. > > > > > > > > With this translate_pid() has three closely related but still distinct > > > > functionalities. To clarify the semantics and to make it easier for > > > > userspace to use the syscall it has: > > > > - gained a command argument and three commands clearly reflecting the > > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > > PIDCMD_GET_PIDFD). > > > > - been renamed to pidctl() > > > > > [snip] > > > Also, I'm still confused about how metadata access is supposed to work > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > > You snipped out a portion of a previous email in which I asked about > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > > place, we have two different kinds of file descriptors for processes, > > > one derived from procfs and one that's independent. The former works > > > with openat(2). The latter does not. To be very specific; if I'm > > > writing a function that accepts a pidfd and I get a pidfd that comes > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > > smaps or oom_score_adj or statm for the named process in a race-free > > > manner? > > > > This is true, that such usecase will not be supportable. But the advantage > > on the other hand, is that suchs "pidfd" can be made pollable or readable in > > the future. Potentially allowing us to return exit status without a new > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot > > do > > with proc. > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory > > fd > > can be translated into a "pidfd" using another syscall or even a node, like > > /proc/pid/handle or something. I think this is what Christian suggested in > > the previous threads. > > Andy - and Jann who I just talked to - have proposed solutions for this. > Jann's idea is similar to what you suggested, Joel. You could e.g. do an > ioctl() handler for /proc that would give you a dirfd back for a given > pidfd. The advantage is that pidfd_clone() can then give back pidfds > without having to care in what procfs the process is supposed to live. > That makes things a lot easier. But pidfds for the general case should > be anon inodes. It's clean, it's simple and it is way more secure. That makes sense to
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 8:34 PM Jann Horn wrote: > > [...SNIP...] > > Please don't do that. /proc/$pid/fd refers to the set of file > descriptors the process has open, and semantically doesn't have much > to do with the identity of the process. If you want to have a procfs > directory entry for getting a pidfd, please add a new entry. (Although > I don't see the point in adding a new procfs entry for this when you > could instead have an ioctl or syscall operating on the procfs > directory fd.) There is no new entry. What I was saying (and I should have been clearer) is that the existing entry for the fd when open'd with O_DIRECTORY makes the kernel resolve the symlink to /proc/ of the process it maps to, so it would become: int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC); This also means you cannot cross the filesystem boundry, the said process needs to have a visible entry (which would mean hidepid= and gid= based access controls are honored), and you can only open the dirfd of a process in the current ns (as the PID will not map to an existent process if the pidfd maps to a process not in the same or children pid ns, in fdinfo it lists -1 in the pid field (we might not even need fdinfo anymore)).
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 09:34:00PM +0100, Jann Horn wrote: > On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski > > wrote: > > > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione > > > wrote: > [...] > > > Yes, but everything in /proc is not equivalent to an attribute, or an > > > option, and depending on its configuration, you may not want to allow > > > processes to even be able to see /proc for any PIDs other than those > > > running as their own user (hidepid). This means, even if this new > > > system call is added, to respect hidepid, it must, depending on if > > > /proc is mounted (and what hidepid is set to, and what gid= is set > > > to), return EPERM, because then there is a discrepancy between how the > > > two entrypoints to acquire a process handle do access control. > > > > That's why I proposed that this translation mechanism accept a procfs > > root directory --- so you'd specify *which* procfs you want and let > > the kernel apply whatever hidepid access restrictions it wants. > [...] > > > > and 2) it's > > > > "fail unsafe": IMHO, most users in practice will skip the line marked > > > > "LIVENESS CHECK", and as a result, their code will appear to work but > > > > contain subtle race conditions. An explicit interface to translate > > > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file > > > > descriptor would be both more efficient and fail-safe. > > > > > > > > [1] as a separate matter, it'd be nice to have a batch version of > > > > close(2). > > > > > > Since /proc is full of gunk, > > > > People keep saying /proc is bad, but I haven't seen any serious > > proposals for a clean replacement. :-) > > > > > how about adding more to it and making > > > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd > > > of the /proc entry of the process it maps to, when one uses > > > O_DIRECTORY while opening it? Otherwise, it behaves as it does today. > > > It would be equivalent to opening the proc entry with usual access > > > restrictions (and hidepid made to work) but without the races, and > > > because for processes outside your and children pid ns, it shouldn't > > > work anyway, and since they wouldn't have their entry on this procfs > > > instance, it would all just fit in nicely? > > > > Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical > > anyway, so that's okay. > > Please don't do that. /proc/$pid/fd refers to the set of file > descriptors the process has open, and semantically doesn't have much > to do with the identity of the process. If you want to have a procfs > directory entry for getting a pidfd, please add a new entry. (Although > I don't see the point in adding a new procfs entry for this when you > could instead have an ioctl or syscall operating on the procfs > directory fd.) Very much agreed!
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione wrote: > On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski > wrote: > > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione wrote: [...] > > Yes, but everything in /proc is not equivalent to an attribute, or an > > option, and depending on its configuration, you may not want to allow > > processes to even be able to see /proc for any PIDs other than those > > running as their own user (hidepid). This means, even if this new > > system call is added, to respect hidepid, it must, depending on if > > /proc is mounted (and what hidepid is set to, and what gid= is set > > to), return EPERM, because then there is a discrepancy between how the > > two entrypoints to acquire a process handle do access control. > > That's why I proposed that this translation mechanism accept a procfs > root directory --- so you'd specify *which* procfs you want and let > the kernel apply whatever hidepid access restrictions it wants. [...] > > > and 2) it's > > > "fail unsafe": IMHO, most users in practice will skip the line marked > > > "LIVENESS CHECK", and as a result, their code will appear to work but > > > contain subtle race conditions. An explicit interface to translate > > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file > > > descriptor would be both more efficient and fail-safe. > > > > > > [1] as a separate matter, it'd be nice to have a batch version of > > > close(2). > > > > Since /proc is full of gunk, > > People keep saying /proc is bad, but I haven't seen any serious > proposals for a clean replacement. :-) > > > how about adding more to it and making > > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd > > of the /proc entry of the process it maps to, when one uses > > O_DIRECTORY while opening it? Otherwise, it behaves as it does today. > > It would be equivalent to opening the proc entry with usual access > > restrictions (and hidepid made to work) but without the races, and > > because for processes outside your and children pid ns, it shouldn't > > work anyway, and since they wouldn't have their entry on this procfs > > instance, it would all just fit in nicely? > > Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical > anyway, so that's okay. Please don't do that. /proc/$pid/fd refers to the set of file descriptors the process has open, and semantically doesn't have much to do with the identity of the process. If you want to have a procfs directory entry for getting a pidfd, please add a new entry. (Although I don't see the point in adding a new procfs entry for this when you could instead have an ioctl or syscall operating on the procfs directory fd.)
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski wrote: > > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione wrote: > > > > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski > > wrote: > > > > > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione > > > wrote: > > > > > > > > [..snip..] > > > > > > > > I don't like the idea of having one kind of pollfd be pollable and > > > > another not. Such an interface would be confusing for users. If, as > > > > you suggest below, we instead make the procfs-less FD the only thing > > > > we call a "pidfd", then this proposal becomes less confusing and more > > > > viable. > > > > > > That's certainly on the table, we remove the ability to open > > > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of > > > this. > > > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > > > directory fd > > > > > can be translated into a "pidfd" using another syscall or even a > > > > > node, like > > > > > /proc/pid/handle or something. I think this is what Christian > > > > > suggested in > > > > > the previous threads. > > > > > > > > /proc/pid/handle, if I understand correctly, "translates" a > > > > procfs-based pidfd to a non-pidfd one. The needed interface would have > > > > to perform the opposite translation, providing a procfs directory for > > > > a given pidfd. > > > > > > > > > And also for the translation the other way, add a syscall or modify > > > > > translate_fd or something, to covert a anon_inode pidfd into a > > > > > /proc/pid > > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ > > > > > directory fd. > > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd > > > > > fds, not > > > > > to /proc/pid directory fds. > > > > > > > > This approach would work, but there's one subtlety to take into > > > > account: which procfs? You'd want to take, as inputs, 1) the procfs > > > > root you want, and 2) the pidfd, this way you could return to the user > > > > a directory FD in the right procfs. > > > > > > > > > > I don't think metadata access is in the scope of such a pidfd itself. > > > > It should be possible to write a race-free pkill(1) using pidfds. > > Without metadata access tied to the pidfd somehow, that can't be done. > > > > > > > > > Should we work on patches for these? Please let us know if this idea > > > > > makes > > > > > sense and thanks a lot for adding us to the review as well. > > > > > > > > I would strongly prefer that we not merge pidctl (or whatever it is) > > > > without a story for metadata access, be it your suggestion or > > > > something else. > > > > > > IMO, this is out of scope for a process handle. Hence, the need for > > > metadata access musn't stall it as is. > > > > On the contrary, rather than metadata being out of scope, metadata > > access is essential. Given a file handle (an FD), you can learn about > > the file backing that handle by using fstat(2). Given a socket handle > > (a socket FD), you can learn about the socket with getsockopt(2) and > > ioctl(2). It would be strange not to be able, similarly, to learn > > things about a process given a handle (a pidfd) to that process. I > > want process handles to be "boring" in that they let users query and > > manipulate processes mostly like they manipulate other resources. > > > > Yes, but everything in /proc is not equivalent to an attribute, or an > option, and depending on its configuration, you may not want to allow > processes to even be able to see /proc for any PIDs other than those > running as their own user (hidepid). This means, even if this new > system call is added, to respect hidepid, it must, depending on if > /proc is mounted (and what hidepid is set to, and what gid= is set > to), return EPERM, because then there is a discrepancy between how the > two entrypoints to acquire a process handle do access control. That's why I proposed that this translation mechanism accept a procfs root directory --- so you'd specify *which* procfs you want and let the kernel apply whatever hidepid access restrictions it wants. [snip] > PIDs however are addressable with kill(2) even with hidepid enabled. > For good reason, the capabilities you can exercise using a PID is > limited in that case. The same logic applies to a process reference > like pidfd. Sure. I'm not proposing a mechanism that would grant anyone additional access to anything --- I'm just suggesting that we provide a way to "directly" open a procfs dirfd instead of having people parse an fdinfo text file. > I agree there should be some way (if you can take care of the hidepid > gotcha) to return a dir fd of /proc entry, but I don't think it should > be the pidfd itself. It's fine that pidfds aren't procfs directory FDs so long as there's a race-free way to go from the former to the latter. > You can get it to work using the fdinfo thing for > now. > > > > If you really need this, the pid this pidfd is
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote: > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > wrote: > > > The pidctl() syscalls builds on, extends, and improves translate_pid() > > > [4]. > > > I quote Konstantins original patchset first that has already been acked > > > and > > > picked up by Eric before and whose functionality is preserved in this > > > syscall. Multiple people have asked when this patchset will be sent in > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > > > Muthusamy from Oracle [3]. > > > > > > The intention of the original translate_pid() syscall was twofold: > > > 1. Provide translation of pids between pid namespaces > > > 2. Provide implicit pid namespace introspection > > > > > > Both functionalities are preserved. The latter task has been improved > > > upon though. In the original version of the pachset passing pid as 1 > > > would allow to deterimine the relationship between the pid namespaces. > > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > > would report false negatives. For example, if pid 1 inside of the target > > > pid namespace already died, it would report that the target pid > > > namespace cannot be reached from the source pid namespace because it > > > couldn't find the pid inside of the target pid namespace and thus > > > falsely report to the user that the two pid namespaces are not related. > > > This problem is simple to avoid. In the new version we simply walk the > > > list of ancestors and check whether the namespace are related to each > > > other. By doing it this way we can reliably report what the relationship > > > between two pid namespace file descriptors looks like. > > > > > > Additionally, this syscall has been extended to allow the retrieval of > > > pidfds independent of procfs. These pidfds can e.g. be used with the new > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve > > > pidfds independent of procfs had already been requested in the > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > > > [5]. A use-case where a kernel is compiled without procfs but where > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > anon-inode based file descriptors are used that stash a reference to > > > struct pid in file->private_data and drop that reference on close. > > > > > > With this translate_pid() has three closely related but still distinct > > > functionalities. To clarify the semantics and to make it easier for > > > userspace to use the syscall it has: > > > - gained a command argument and three commands clearly reflecting the > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > PIDCMD_GET_PIDFD). > > > - been renamed to pidctl() > > > [snip] > > Also, I'm still confused about how metadata access is supposed to work > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > You snipped out a portion of a previous email in which I asked about > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > place, we have two different kinds of file descriptors for processes, > > one derived from procfs and one that's independent. The former works > > with openat(2). The latter does not. To be very specific; if I'm > > writing a function that accepts a pidfd and I get a pidfd that comes > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > smaps or oom_score_adj or statm for the named process in a race-free > > manner? > > This is true, that such usecase will not be supportable. But the advantage > on the other hand, is that suchs "pidfd" can be made pollable or readable in > the future. Potentially allowing us to return exit status without a new > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do > with proc. > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd > can be translated into a "pidfd" using another syscall or even a node, like > /proc/pid/handle or something. I think this is what Christian suggested in > the previous threads. Andy - and Jann who I just talked to - have proposed solutions for this. Jann's idea is similar to what you suggested, Joel. You could e.g. do an ioctl() handler for /proc that would give you a dirfd back for a given pidfd. The advantage is that pidfd_clone() can then give back pidfds without having to care in what procfs the process is supposed to live. That makes things a lot easier. But pidfds for the general case should be anon inodes. It's clean, it's simple and it is way more secure. > > And also for the translation the other way, add a syscall or modify > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > directory fd. Then the user is welcomed to do openat(2) on _that_ directory > fd. > Then we modify
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski > wrote: > > > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione wrote: > > > > > > [..snip..] > > > > > > I don't like the idea of having one kind of pollfd be pollable and > > > another not. Such an interface would be confusing for users. If, as > > > you suggest below, we instead make the procfs-less FD the only thing > > > we call a "pidfd", then this proposal becomes less confusing and more > > > viable. > > > > That's certainly on the table, we remove the ability to open > > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of > > this. > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > > directory fd > > > > can be translated into a "pidfd" using another syscall or even a node, > > > > like > > > > /proc/pid/handle or something. I think this is what Christian suggested > > > > in > > > > the previous threads. > > > > > > /proc/pid/handle, if I understand correctly, "translates" a > > > procfs-based pidfd to a non-pidfd one. The needed interface would have > > > to perform the opposite translation, providing a procfs directory for > > > a given pidfd. > > > > > > > And also for the translation the other way, add a syscall or modify > > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ > > > > directory fd. > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd > > > > fds, not > > > > to /proc/pid directory fds. > > > > > > This approach would work, but there's one subtlety to take into > > > account: which procfs? You'd want to take, as inputs, 1) the procfs > > > root you want, and 2) the pidfd, this way you could return to the user > > > a directory FD in the right procfs. > > > > > > > I don't think metadata access is in the scope of such a pidfd itself. > > It should be possible to write a race-free pkill(1) using pidfds. > Without metadata access tied to the pidfd somehow, that can't be done. > > > > > > Should we work on patches for these? Please let us know if this idea > > > > makes > > > > sense and thanks a lot for adding us to the review as well. > > > > > > I would strongly prefer that we not merge pidctl (or whatever it is) > > > without a story for metadata access, be it your suggestion or > > > something else. > > > > IMO, this is out of scope for a process handle. Hence, the need for > > metadata access musn't stall it as is. > > On the contrary, rather than metadata being out of scope, metadata > access is essential. Given a file handle (an FD), you can learn about > the file backing that handle by using fstat(2). Given a socket handle > (a socket FD), you can learn about the socket with getsockopt(2) and > ioctl(2). It would be strange not to be able, similarly, to learn > things about a process given a handle (a pidfd) to that process. I > want process handles to be "boring" in that they let users query and > manipulate processes mostly like they manipulate other resources. > Yes, but everything in /proc is not equivalent to an attribute, or an option, and depending on its configuration, you may not want to allow processes to even be able to see /proc for any PIDs other than those running as their own user (hidepid). This means, even if this new system call is added, to respect hidepid, it must, depending on if /proc is mounted (and what hidepid is set to, and what gid= is set to), return EPERM, because then there is a discrepancy between how the two entrypoints to acquire a process handle do access control. This is ugly, but ideally should work the way it is described above. Otherwise, things would be able to bypass the restrictions made therein, because we also want a system call. PIDs however are addressable with kill(2) even with hidepid enabled. For good reason, the capabilities you can exercise using a PID is limited in that case. The same logic applies to a process reference like pidfd. I agree there should be some way (if you can take care of the hidepid gotcha) to return a dir fd of /proc entry, but I don't think it should be the pidfd itself. You can get it to work using the fdinfo thing for now. > > If you really need this, the pid this pidfd is mapped to can be > > exposed through fdinfo (which is perfectly fine for your case as you > > use /proc anyway). This means you can reliably determine if you're > > opening it for the same pid (and it hasn't been recycled) because this > > pidfd will be pollable for state change (in the future). Exposing it > > through fdinfo isn't a problem, pid ns is bound to the process during > > its lifetime, it's a process creation time property, so the value > > isn't going to change anyway. So you can do all metadata access you > > want subsequently. > > Thanks for the proposal. I'm a bit confused. Are you suggesting that > we report the numeric
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski wrote: > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione wrote: > > > > [..snip..] > > > > I don't like the idea of having one kind of pollfd be pollable and > > another not. Such an interface would be confusing for users. If, as > > you suggest below, we instead make the procfs-less FD the only thing > > we call a "pidfd", then this proposal becomes less confusing and more > > viable. > > That's certainly on the table, we remove the ability to open > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of > this. > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid > > > directory fd > > > can be translated into a "pidfd" using another syscall or even a node, > > > like > > > /proc/pid/handle or something. I think this is what Christian suggested in > > > the previous threads. > > > > /proc/pid/handle, if I understand correctly, "translates" a > > procfs-based pidfd to a non-pidfd one. The needed interface would have > > to perform the opposite translation, providing a procfs directory for > > a given pidfd. > > > > > And also for the translation the other way, add a syscall or modify > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > > > directory fd. Then the user is welcomed to do openat(2) on _that_ > > > directory fd. > > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, > > > not > > > to /proc/pid directory fds. > > > > This approach would work, but there's one subtlety to take into > > account: which procfs? You'd want to take, as inputs, 1) the procfs > > root you want, and 2) the pidfd, this way you could return to the user > > a directory FD in the right procfs. > > > > I don't think metadata access is in the scope of such a pidfd itself. It should be possible to write a race-free pkill(1) using pidfds. Without metadata access tied to the pidfd somehow, that can't be done. > > > Should we work on patches for these? Please let us know if this idea makes > > > sense and thanks a lot for adding us to the review as well. > > > > I would strongly prefer that we not merge pidctl (or whatever it is) > > without a story for metadata access, be it your suggestion or > > something else. > > IMO, this is out of scope for a process handle. Hence, the need for > metadata access musn't stall it as is. On the contrary, rather than metadata being out of scope, metadata access is essential. Given a file handle (an FD), you can learn about the file backing that handle by using fstat(2). Given a socket handle (a socket FD), you can learn about the socket with getsockopt(2) and ioctl(2). It would be strange not to be able, similarly, to learn things about a process given a handle (a pidfd) to that process. I want process handles to be "boring" in that they let users query and manipulate processes mostly like they manipulate other resources. > If you really need this, the pid this pidfd is mapped to can be > exposed through fdinfo (which is perfectly fine for your case as you > use /proc anyway). This means you can reliably determine if you're > opening it for the same pid (and it hasn't been recycled) because this > pidfd will be pollable for state change (in the future). Exposing it > through fdinfo isn't a problem, pid ns is bound to the process during > its lifetime, it's a process creation time property, so the value > isn't going to change anyway. So you can do all metadata access you > want subsequently. Thanks for the proposal. I'm a bit confused. Are you suggesting that we report the numeric FD in fdinfo? Are you saying it should work basically like this? int get_oom_score_adj(int pidfd) { unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd)); string fdinfo = read_all(fdinfo_fd); numeric_pid = parse_fdinfo_for_line("pid"); unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY); if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /* LIVENESS CHECK */ // We know the process named by pidfd was called NUMERIC_PID // in our namespace (because fdinfo told us) and we know that the named process // was alive after we successfully opened its /proc/pid directory, therefore, // PROCDIR_FD and PIDFD must refer to the same underlying process. unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj"); return parse_int(read_all(oom_adj_score_fd)); } While this interface functions correctly, I have two concerns: 1) the number of system calls necessary is very large -- by my count, starting with the pifd, we need six, not counting the follow-on close(2) calls (which would bring the total to nine[1]), and 2) it's "fail unsafe": IMHO, most users in practice will skip the line marked "LIVENESS CHECK", and as a result, their code will appear to work but contain subtle race conditions. An explicit interface to translate from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file descriptor would be both more
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione wrote: > > [..snip..] > > I don't like the idea of having one kind of pollfd be pollable and > another not. Such an interface would be confusing for users. If, as > you suggest below, we instead make the procfs-less FD the only thing > we call a "pidfd", then this proposal becomes less confusing and more > viable. That's certainly on the table, we remove the ability to open /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of this. > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory > > fd > > can be translated into a "pidfd" using another syscall or even a node, like > > /proc/pid/handle or something. I think this is what Christian suggested in > > the previous threads. > > /proc/pid/handle, if I understand correctly, "translates" a > procfs-based pidfd to a non-pidfd one. The needed interface would have > to perform the opposite translation, providing a procfs directory for > a given pidfd. > > > And also for the translation the other way, add a syscall or modify > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > > directory fd. Then the user is welcomed to do openat(2) on _that_ directory > > fd. > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not > > to /proc/pid directory fds. > > This approach would work, but there's one subtlety to take into > account: which procfs? You'd want to take, as inputs, 1) the procfs > root you want, and 2) the pidfd, this way you could return to the user > a directory FD in the right procfs. > I don't think metadata access is in the scope of such a pidfd itself. > > Should we work on patches for these? Please let us know if this idea makes > > sense and thanks a lot for adding us to the review as well. > > I would strongly prefer that we not merge pidctl (or whatever it is) > without a story for metadata access, be it your suggestion or > something else. IMO, this is out of scope for a process handle. Hence, the need for metadata access musn't stall it as is. If you really need this, the pid this pidfd is mapped to can be exposed through fdinfo (which is perfectly fine for your case as you use /proc anyway). This means you can reliably determine if you're opening it for the same pid (and it hasn't been recycled) because this pidfd will be pollable for state change (in the future). Exposing it through fdinfo isn't a problem, pid ns is bound to the process during its lifetime, it's a process creation time property, so the value isn't going to change anyway. So you can do all metadata access you want subsequently. The upshot is that this same pidfd can then be returned from a future extension of clone(2) Christian is planning to work on. It is still undecided whether this fd should be only readable by the parent process or everyone (which would just be a matter of changing the access mode depending on the caller), but that we can discuss all that when the patchset is posted.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 10:36 AM Joel Fernandes wrote: > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > wrote: > > > The pidctl() syscalls builds on, extends, and improves translate_pid() > > > [4]. > > > I quote Konstantins original patchset first that has already been acked > > > and > > > picked up by Eric before and whose functionality is preserved in this > > > syscall. Multiple people have asked when this patchset will be sent in > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > > > Muthusamy from Oracle [3]. > > > > > > The intention of the original translate_pid() syscall was twofold: > > > 1. Provide translation of pids between pid namespaces > > > 2. Provide implicit pid namespace introspection > > > > > > Both functionalities are preserved. The latter task has been improved > > > upon though. In the original version of the pachset passing pid as 1 > > > would allow to deterimine the relationship between the pid namespaces. > > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > > would report false negatives. For example, if pid 1 inside of the target > > > pid namespace already died, it would report that the target pid > > > namespace cannot be reached from the source pid namespace because it > > > couldn't find the pid inside of the target pid namespace and thus > > > falsely report to the user that the two pid namespaces are not related. > > > This problem is simple to avoid. In the new version we simply walk the > > > list of ancestors and check whether the namespace are related to each > > > other. By doing it this way we can reliably report what the relationship > > > between two pid namespace file descriptors looks like. > > > > > > Additionally, this syscall has been extended to allow the retrieval of > > > pidfds independent of procfs. These pidfds can e.g. be used with the new > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve > > > pidfds independent of procfs had already been requested in the > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > > > [5]. A use-case where a kernel is compiled without procfs but where > > > pidfds are still useful has been outlined by Andy in [6]. Regular > > > anon-inode based file descriptors are used that stash a reference to > > > struct pid in file->private_data and drop that reference on close. > > > > > > With this translate_pid() has three closely related but still distinct > > > functionalities. To clarify the semantics and to make it easier for > > > userspace to use the syscall it has: > > > - gained a command argument and three commands clearly reflecting the > > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > > PIDCMD_GET_PIDFD). > > > - been renamed to pidctl() > > > [snip] > > Also, I'm still confused about how metadata access is supposed to work > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > You snipped out a portion of a previous email in which I asked about > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > place, we have two different kinds of file descriptors for processes, > > one derived from procfs and one that's independent. The former works > > with openat(2). The latter does not. To be very specific; if I'm > > writing a function that accepts a pidfd and I get a pidfd that comes > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > smaps or oom_score_adj or statm for the named process in a race-free > > manner? > > This is true, that such usecase will not be supportable. But the advantage > on the other hand, is that suchs "pidfd" can be made pollable or readable in > the future. Potentially allowing us to return exit status without a new > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do > with proc. I don't like the idea of having one kind of pollfd be pollable and another not. Such an interface would be confusing for users. If, as you suggest below, we instead make the procfs-less FD the only thing we call a "pidfd", then this proposal becomes less confusing and more viable. > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd > can be translated into a "pidfd" using another syscall or even a node, like > /proc/pid/handle or something. I think this is what Christian suggested in > the previous threads. /proc/pid/handle, if I understand correctly, "translates" a procfs-based pidfd to a non-pidfd one. The needed interface would have to perform the opposite translation, providing a procfs directory for a given pidfd. > And also for the translation the other way, add a syscall or modify > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > directory fd. Then the user is welcomed to do openat(2) on _that_ directory > fd. > Then we modify pidfd_send_signal
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote: > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > wrote: > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4]. > > I quote Konstantins original patchset first that has already been acked and > > picked up by Eric before and whose functionality is preserved in this > > syscall. Multiple people have asked when this patchset will be sent in > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > > Muthusamy from Oracle [3]. > > > > The intention of the original translate_pid() syscall was twofold: > > 1. Provide translation of pids between pid namespaces > > 2. Provide implicit pid namespace introspection > > > > Both functionalities are preserved. The latter task has been improved > > upon though. In the original version of the pachset passing pid as 1 > > would allow to deterimine the relationship between the pid namespaces. > > This is inherhently racy. If pid 1 inside a pid namespace has died it > > would report false negatives. For example, if pid 1 inside of the target > > pid namespace already died, it would report that the target pid > > namespace cannot be reached from the source pid namespace because it > > couldn't find the pid inside of the target pid namespace and thus > > falsely report to the user that the two pid namespaces are not related. > > This problem is simple to avoid. In the new version we simply walk the > > list of ancestors and check whether the namespace are related to each > > other. By doing it this way we can reliably report what the relationship > > between two pid namespace file descriptors looks like. > > > > Additionally, this syscall has been extended to allow the retrieval of > > pidfds independent of procfs. These pidfds can e.g. be used with the new > > pidfd_send_signal() syscall we recently merged. The ability to retrieve > > pidfds independent of procfs had already been requested in the > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > > [5]. A use-case where a kernel is compiled without procfs but where > > pidfds are still useful has been outlined by Andy in [6]. Regular > > anon-inode based file descriptors are used that stash a reference to > > struct pid in file->private_data and drop that reference on close. > > > > With this translate_pid() has three closely related but still distinct > > functionalities. To clarify the semantics and to make it easier for > > userspace to use the syscall it has: > > - gained a command argument and three commands clearly reflecting the > > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > > PIDCMD_GET_PIDFD). > > - been renamed to pidctl() > [snip] > Also, I'm still confused about how metadata access is supposed to work > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > You snipped out a portion of a previous email in which I asked about > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > place, we have two different kinds of file descriptors for processes, > one derived from procfs and one that's independent. The former works > with openat(2). The latter does not. To be very specific; if I'm > writing a function that accepts a pidfd and I get a pidfd that comes > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > smaps or oom_score_adj or statm for the named process in a race-free > manner? This is true, that such usecase will not be supportable. But the advantage on the other hand, is that suchs "pidfd" can be made pollable or readable in the future. Potentially allowing us to return exit status without a new syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do with proc. But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd can be translated into a "pidfd" using another syscall or even a node, like /proc/pid/handle or something. I think this is what Christian suggested in the previous threads. And also for the translation the other way, add a syscall or modify translate_fd or something, to covert a anon_inode pidfd into a /proc/pid directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd. Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not to /proc/pid directory fds. Should we work on patches for these? Please let us know if this idea makes sense and thanks a lot for adding us to the review as well. Best, - Joel
Re: [PATCH 0/4] pid: add pidctl()
On 25.03.2019 19:48, Daniel Colascione wrote: On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner wrote: The pidctl() syscalls builds on, extends, and improves translate_pid() [4]. I quote Konstantins original patchset first that has already been acked and picked up by Eric before and whose functionality is preserved in this syscall. Multiple people have asked when this patchset will be sent in for merging (cf. [1], [2]). It has recently been revived by Nagarathnam Muthusamy from Oracle [3]. The intention of the original translate_pid() syscall was twofold: 1. Provide translation of pids between pid namespaces 2. Provide implicit pid namespace introspection Both functionalities are preserved. The latter task has been improved upon though. In the original version of the pachset passing pid as 1 would allow to deterimine the relationship between the pid namespaces. This is inherhently racy. If pid 1 inside a pid namespace has died it would report false negatives. For example, if pid 1 inside of the target pid namespace already died, it would report that the target pid namespace cannot be reached from the source pid namespace because it couldn't find the pid inside of the target pid namespace and thus falsely report to the user that the two pid namespaces are not related. This problem is simple to avoid. In the new version we simply walk the list of ancestors and check whether the namespace are related to each other. By doing it this way we can reliably report what the relationship between two pid namespace file descriptors looks like. Additionally, this syscall has been extended to allow the retrieval of pidfds independent of procfs. These pidfds can e.g. be used with the new pidfd_send_signal() syscall we recently merged. The ability to retrieve pidfds independent of procfs had already been requested in the pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey [5]. A use-case where a kernel is compiled without procfs but where pidfds are still useful has been outlined by Andy in [6]. Regular anon-inode based file descriptors are used that stash a reference to struct pid in file->private_data and drop that reference on close. With this translate_pid() has three closely related but still distinct functionalities. To clarify the semantics and to make it easier for userspace to use the syscall it has: - gained a command argument and three commands clearly reflecting the distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, PIDCMD_GET_PIDFD). - been renamed to pidctl() Having made these changes, you've built a general-purpose command command multiplexer, not one operation that happens to be flexible. The general-purpose command multiplexer is a common antipattern: multiplexers make it hard to talk about different kernel-provided operations using the common vocabulary we use to distinguish kernel-related operations, the system call number. socketcall, for example, turned out to be cumbersome for users like SELinux policy writers. People had to do work work later to split socketcall into fine-grained system calls. Please split the pidctl system call so that the design is clean from the start and we avoid work later. System calls are cheap. Also, I'm still confused about how metadata access is supposed to work for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, You snipped out a portion of a previous email in which I asked about your thoughts on this question. With the PIDCMD_GET_PIDFD command in place, we have two different kinds of file descriptors for processes, one derived from procfs and one that's independent. The former works with openat(2). The latter does not. To be very specific; if I'm writing a function that accepts a pidfd and I get a pidfd that comes from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of smaps or oom_score_adj or statm for the named process in a race-free manner? Task metadata could be exposed via "pages" identified by offset: struct pidfd_stats stats; pread(pidfd, , sizeof(stats), PIDFD_STATS_OFFSET); I'm not sure that we need yet another binary procfs. But it will be faster than current text-based for sure.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 10:05 AM Konstantin Khlebnikov wrote: > On 25.03.2019 19:48, Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner > > wrote: > >> The pidctl() syscalls builds on, extends, and improves translate_pid() [4]. > >> I quote Konstantins original patchset first that has already been acked and > >> picked up by Eric before and whose functionality is preserved in this > >> syscall. Multiple people have asked when this patchset will be sent in > >> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > >> Muthusamy from Oracle [3]. > >> > >> The intention of the original translate_pid() syscall was twofold: > >> 1. Provide translation of pids between pid namespaces > >> 2. Provide implicit pid namespace introspection > >> > >> Both functionalities are preserved. The latter task has been improved > >> upon though. In the original version of the pachset passing pid as 1 > >> would allow to deterimine the relationship between the pid namespaces. > >> This is inherhently racy. If pid 1 inside a pid namespace has died it > >> would report false negatives. For example, if pid 1 inside of the target > >> pid namespace already died, it would report that the target pid > >> namespace cannot be reached from the source pid namespace because it > >> couldn't find the pid inside of the target pid namespace and thus > >> falsely report to the user that the two pid namespaces are not related. > >> This problem is simple to avoid. In the new version we simply walk the > >> list of ancestors and check whether the namespace are related to each > >> other. By doing it this way we can reliably report what the relationship > >> between two pid namespace file descriptors looks like. > >> > >> Additionally, this syscall has been extended to allow the retrieval of > >> pidfds independent of procfs. These pidfds can e.g. be used with the new > >> pidfd_send_signal() syscall we recently merged. The ability to retrieve > >> pidfds independent of procfs had already been requested in the > >> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > >> [5]. A use-case where a kernel is compiled without procfs but where > >> pidfds are still useful has been outlined by Andy in [6]. Regular > >> anon-inode based file descriptors are used that stash a reference to > >> struct pid in file->private_data and drop that reference on close. > >> > >> With this translate_pid() has three closely related but still distinct > >> functionalities. To clarify the semantics and to make it easier for > >> userspace to use the syscall it has: > >> - gained a command argument and three commands clearly reflecting the > >>distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > >>PIDCMD_GET_PIDFD). > >> - been renamed to pidctl() > > > > Having made these changes, you've built a general-purpose command > > command multiplexer, not one operation that happens to be flexible. > > The general-purpose command multiplexer is a common antipattern: > > multiplexers make it hard to talk about different kernel-provided > > operations using the common vocabulary we use to distinguish > > kernel-related operations, the system call number. socketcall, for > > example, turned out to be cumbersome for users like SELinux policy > > writers. People had to do work work later to split socketcall into > > fine-grained system calls. Please split the pidctl system call so that > > the design is clean from the start and we avoid work later. System > > calls are cheap. > > > > Also, I'm still confused about how metadata access is supposed to work > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > > You snipped out a portion of a previous email in which I asked about > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > > place, we have two different kinds of file descriptors for processes, > > one derived from procfs and one that's independent. The former works > > with openat(2). The latter does not. To be very specific; if I'm > > writing a function that accepts a pidfd and I get a pidfd that comes > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > > smaps or oom_score_adj or statm for the named process in a race-free > > manner? > > > > Task metadata could be exposed via "pages" identified by offset: > > struct pidfd_stats stats; > > pread(pidfd, , sizeof(stats), PIDFD_STATS_OFFSET); > > I'm not sure that we need yet another binary procfs. > But it will be faster than current text-based for sure. There are many options. I have some detailed thoughts on several of them, but long design emails seem rather unwelcome. Right now, I'd like to hear how Christian intends his patch set to address this use case.
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:56 AM David Howells wrote: > Daniel Colascione wrote: > > > System calls are cheap. > > Only to a point. x86_64 will have an issue when we hit syscall 512. We're > currently at 427. IIRC, a while ago, someone proposed restarting system call numbering above (again, IIRC) 1024 for both 32- and 64-bit varieties to establish a clean slate and sidestep this problem.
Re: [PATCH 0/4] pid: add pidctl()
Daniel Colascione wrote: > System calls are cheap. Only to a point. x86_64 will have an issue when we hit syscall 512. We're currently at 427. David
Re: [PATCH 0/4] pid: add pidctl()
On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner wrote: > The pidctl() syscalls builds on, extends, and improves translate_pid() [4]. > I quote Konstantins original patchset first that has already been acked and > picked up by Eric before and whose functionality is preserved in this > syscall. Multiple people have asked when this patchset will be sent in > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam > Muthusamy from Oracle [3]. > > The intention of the original translate_pid() syscall was twofold: > 1. Provide translation of pids between pid namespaces > 2. Provide implicit pid namespace introspection > > Both functionalities are preserved. The latter task has been improved > upon though. In the original version of the pachset passing pid as 1 > would allow to deterimine the relationship between the pid namespaces. > This is inherhently racy. If pid 1 inside a pid namespace has died it > would report false negatives. For example, if pid 1 inside of the target > pid namespace already died, it would report that the target pid > namespace cannot be reached from the source pid namespace because it > couldn't find the pid inside of the target pid namespace and thus > falsely report to the user that the two pid namespaces are not related. > This problem is simple to avoid. In the new version we simply walk the > list of ancestors and check whether the namespace are related to each > other. By doing it this way we can reliably report what the relationship > between two pid namespace file descriptors looks like. > > Additionally, this syscall has been extended to allow the retrieval of > pidfds independent of procfs. These pidfds can e.g. be used with the new > pidfd_send_signal() syscall we recently merged. The ability to retrieve > pidfds independent of procfs had already been requested in the > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey > [5]. A use-case where a kernel is compiled without procfs but where > pidfds are still useful has been outlined by Andy in [6]. Regular > anon-inode based file descriptors are used that stash a reference to > struct pid in file->private_data and drop that reference on close. > > With this translate_pid() has three closely related but still distinct > functionalities. To clarify the semantics and to make it easier for > userspace to use the syscall it has: > - gained a command argument and three commands clearly reflecting the > distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, > PIDCMD_GET_PIDFD). > - been renamed to pidctl() Having made these changes, you've built a general-purpose command command multiplexer, not one operation that happens to be flexible. The general-purpose command multiplexer is a common antipattern: multiplexers make it hard to talk about different kernel-provided operations using the common vocabulary we use to distinguish kernel-related operations, the system call number. socketcall, for example, turned out to be cumbersome for users like SELinux policy writers. People had to do work work later to split socketcall into fine-grained system calls. Please split the pidctl system call so that the design is clean from the start and we avoid work later. System calls are cheap. Also, I'm still confused about how metadata access is supposed to work for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, You snipped out a portion of a previous email in which I asked about your thoughts on this question. With the PIDCMD_GET_PIDFD command in place, we have two different kinds of file descriptors for processes, one derived from procfs and one that's independent. The former works with openat(2). The latter does not. To be very specific; if I'm writing a function that accepts a pidfd and I get a pidfd that comes from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of smaps or oom_score_adj or statm for the named process in a race-free manner?