Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Joel Fernandes
On Mon, Mar 25, 2019 at 09:54:58PM +, Jonathan Kowalski wrote:
> On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  wrote:
> >
> > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  
> > > wrote:
> > >
> > > But often you don't just want to wait for a single thing to happen;
> > > you want to wait for many things at once, and react as soon as any one
> > > of them happens. This is why the kernel has epoll and all the other
> > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > with fd-based APIs like epoll, users of this API have to spin up
> > > useless helper threads.
> >
> > This is true. I almost forgot about the polling requirement, sorry. So then 
> > a
> > new syscall it is.. About what to wait for, that can be a separate parameter
> > to pidfd_wait then.
> >
> 
> This introduces a time window where the process changes state between
> "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> itself supported .poll and on termination the exit status was made
> readable from the file descriptor.

It is much cleaner to add a new pidfd_wait syscall for this as discussed in
the other thread. Adding .poll directly to the pidfd would seem to complicate
the blocking configuration. Do we block until the task is a zombie or until
it is dead?  That is not possible to specify easily. Also if we need to
return other types of information from the pidfd, not just exit state, then
it is not clear whether blocking on a pidfd just purely on exit status makes
sense. It is much cleaner to add a new pidfd_wait syscall giving it a pidfd,
specify what to block on (EXIT_DEAD or EXIT_ZOMBIE or both) and then return a
wait fd that can be read/blocked and returning all the needed information on
unblock.

> Further, in the clone4 patchset, there was a mechanism to autoreap
> such a process so that it does not interfere with waiting a process
> does normally. How do you intend to handle this case if anyone except
> the parent is wanting to *wait* on the process (a second process,
> without reaping, so as to not disrupt any waiting in the parent), and
> for things like libraries that finally can manage their own set of
> process when pidfd_clone is added, by excluding this process from the
> process's normal wait handling logic.

The pidfd_wait logic being discussed does not depend on or affects the
autoreap behavior. This wait is not the traditional wait family of calls.
It is for just getting notified about reading all the exit state of a task.
Once I post the code, it will be clear. And I'll CC you..

thanks,

 - Joel



Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Andy Lutomirski
On Mon, Mar 25, 2019 at 9:56 AM David Howells  wrote:
>
> Daniel Colascione  wrote:
>
> > System calls are cheap.
>
> Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
> currently at 427.
>

I don't consider this to be a problem.  I have patches to make this
problem go away forever.  I just need to tidy them up a bit and
upstream them.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 3:37 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione  wrote:
> >
> > On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski  
> > wrote:
> > >
> > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  
> > > wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes 
> > > > >  wrote:
> > > > >
> > > > > But often you don't just want to wait for a single thing to happen;
> > > > > you want to wait for many things at once, and react as soon as any one
> > > > > of them happens. This is why the kernel has epoll and all the other
> > > > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > > > with fd-based APIs like epoll, users of this API have to spin up
> > > > > useless helper threads.
> > > >
> > > > This is true. I almost forgot about the polling requirement, sorry. So 
> > > > then a
> > > > new syscall it is.. About what to wait for, that can be a separate 
> > > > parameter
> > > > to pidfd_wait then.
> > > >
> > >
> > > This introduces a time window where the process changes state between
> > > "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> > > itself supported .poll and on termination the exit status was made
> > > readable from the file descriptor.
> >
> > I don't see a race here. Process state moves in one direction, so
> > there's no race. If you want the poll to end when a process exits and
> > the process exits before you poll, the poll should finish immediately.
> > If the process exits before you even create the polling FD, whatever
> > creates the polling FD can fail with ESRCH, which is what usually
> > happens when you try to do anything with a process that's gone. Either
> > way, whatever's trying to set up the poll knows the state of the
> > process and there's no possibility of a missed wakeup.
> >
>
> poll will possibly work and return immediately, but you won't be able
> to read back anything, because for the kernel there will be no waiter
> before you open one. If you make pidfd pollable and readable (for
> parents, as an example), the poll ends immediately but there will
> something to read from the fd.

You can lose exit status this way. That's not a problem, though,
because *without* synchronization between the exiting process and the
waiting process, the waiting process can lose the race no matter what,
and *with* synchronization, there's no change of losing the
information. (It's just like a condition variable.) Today,
synchronization between a parent and child is automatic (because
zombies), but direct children work reliably with pidfd too: the
descriptor that pidfd_clone (or whatever it ends up being called)
returns will *always* be created before the child process and so will
*always* have exit status information available, even with some
non-SIGCHLD option applied.

> > > Further, in the clone4 patchset, there was a mechanism to autoreap
> > > such a process so that it does not interfere with waiting a process
> > > does normally. How do you intend to handle this case if anyone except
> > > the parent is wanting to *wait* on the process (a second process,
> > > without reaping, so as to not disrupt any waiting in the parent), and
> > > for things like libraries that finally can manage their own set of
> > > process when pidfd_clone is added, by excluding this process from the
> > > process's normal wait handling logic.
> >
> > I think that discussion is best had around pidfd_clone. pidfd waiting
> > functionality shouldn't affect wait* in any way nor affect a zombie
> > transition or reaping.
>
> So this is more akin to waitfd and traditional wait* and friends, and
> a single notification fd for possibly multiple children?

I don't know what gave you this idea. I'm talking about a model in
which you wait for one child with one open file description.

> I just wanted
> to be sure one would (should) be able to use a pidfd obtained from
> pidfd_clone without affecting the existing waitfd, and other
> primitives.

I don't think the use of pidfd *waiting*, by itself, should affect the
current fork/wait/zombie model of process exit delivery. That is, by
default, a process should transition to the zombie state, send SIGCHLD
to its parent, and then get reaped at exactly the same times it does
today, with exactly the same semantics we have today. Pidfd waiting
should be an "add on".

> The same application's components should be able to host
> their own set of children using such an API (think libraries) and
> without affecting the process.

Agreed. To be clear, you're talking about something slightly different
from pure pidfd waiting. There should be some way to *create* a
process in such a way that it's not visible to wait(-1), doesn't
generate SIGCHLD, and so on. It would be great for libraries to create
and manage worker processes transparently, even in the presence of
some other thread sitting on 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione  wrote:
>
> On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski  wrote:
> >
> > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  
> > wrote:
> > >
> > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes 
> > > >  wrote:
> > > >
> > > > But often you don't just want to wait for a single thing to happen;
> > > > you want to wait for many things at once, and react as soon as any one
> > > > of them happens. This is why the kernel has epoll and all the other
> > > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > > with fd-based APIs like epoll, users of this API have to spin up
> > > > useless helper threads.
> > >
> > > This is true. I almost forgot about the polling requirement, sorry. So 
> > > then a
> > > new syscall it is.. About what to wait for, that can be a separate 
> > > parameter
> > > to pidfd_wait then.
> > >
> >
> > This introduces a time window where the process changes state between
> > "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> > itself supported .poll and on termination the exit status was made
> > readable from the file descriptor.
>
> I don't see a race here. Process state moves in one direction, so
> there's no race. If you want the poll to end when a process exits and
> the process exits before you poll, the poll should finish immediately.
> If the process exits before you even create the polling FD, whatever
> creates the polling FD can fail with ESRCH, which is what usually
> happens when you try to do anything with a process that's gone. Either
> way, whatever's trying to set up the poll knows the state of the
> process and there's no possibility of a missed wakeup.
>

poll will possibly work and return immediately, but you won't be able
to read back anything, because for the kernel there will be no waiter
before you open one. If you make pidfd pollable and readable (for
parents, as an example), the poll ends immediately but there will
something to read from the fd.

> > Further, in the clone4 patchset, there was a mechanism to autoreap
> > such a process so that it does not interfere with waiting a process
> > does normally. How do you intend to handle this case if anyone except
> > the parent is wanting to *wait* on the process (a second process,
> > without reaping, so as to not disrupt any waiting in the parent), and
> > for things like libraries that finally can manage their own set of
> > process when pidfd_clone is added, by excluding this process from the
> > process's normal wait handling logic.
>
> I think that discussion is best had around pidfd_clone. pidfd waiting
> functionality shouldn't affect wait* in any way nor affect a zombie
> transition or reaping.

So this is more akin to waitfd and traditional wait* and friends, and
a single notification fd for possibly multiple children? I just wanted
to be sure one would (should) be able to use a pidfd obtained from
pidfd_clone without affecting the existing waitfd, and other
primitives. The same application's components should be able to host
their own set of children using such an API (think libraries) and
without affecting the process.

>
> > Moreover, should anyone except the parent process be allowed to open a
> > readable pidfd (or waitfd), without additional capabilities?
>
> That's a separate discussion. See
> https://lore.kernel.org/lkml/cakozueussvwabqac+o9fw+mzayccvttkqzfwg0hh-cz+1yk...@mail.gmail.com/,
>  where we talked about permissions extensively. Anyone can hammer on
> /proc today or hammer on kill(pid, 0) to learn about a process
> exiting, so anyone should be able to wait for a process to die --- it
> just automates what anyone can do manually. What's left is the
> question of who should be able to learn a process's exit code. As I
> mentioned in the linked email, process exit codes are public on
> FreeBSD, and the simplest option is to make them public in Linux too.

People might be using them in ways that convey information between the
parent and child something else shouldn't be able to know. They might
be relying on this assumption that it is private. I don't think
opening it up without requiring *some* privileges is safe.

Also, taking this further, if the structure being returned from a read
isn't just the exit code but something more elaborate (you have
mentioned siginfo previously), or even statistics like getrusage, I
would be concerned if anyone except those with CAP_SYS_PTRACE or so
should be able to obtain such a readable pidfd/waitfd.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  wrote:
> >
> > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  
> > > wrote:
> > >
> > > But often you don't just want to wait for a single thing to happen;
> > > you want to wait for many things at once, and react as soon as any one
> > > of them happens. This is why the kernel has epoll and all the other
> > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > with fd-based APIs like epoll, users of this API have to spin up
> > > useless helper threads.
> >
> > This is true. I almost forgot about the polling requirement, sorry. So then 
> > a
> > new syscall it is.. About what to wait for, that can be a separate parameter
> > to pidfd_wait then.
> >
>
> This introduces a time window where the process changes state between
> "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> itself supported .poll and on termination the exit status was made
> readable from the file descriptor.

I don't see a race here. Process state moves in one direction, so
there's no race. If you want the poll to end when a process exits and
the process exits before you poll, the poll should finish immediately.
If the process exits before you even create the polling FD, whatever
creates the polling FD can fail with ESRCH, which is what usually
happens when you try to do anything with a process that's gone. Either
way, whatever's trying to set up the poll knows the state of the
process and there's no possibility of a missed wakeup.

> Further, in the clone4 patchset, there was a mechanism to autoreap
> such a process so that it does not interfere with waiting a process
> does normally. How do you intend to handle this case if anyone except
> the parent is wanting to *wait* on the process (a second process,
> without reaping, so as to not disrupt any waiting in the parent), and
> for things like libraries that finally can manage their own set of
> process when pidfd_clone is added, by excluding this process from the
> process's normal wait handling logic.

I think that discussion is best had around pidfd_clone. pidfd waiting
functionality shouldn't affect wait* in any way nor affect a zombie
transition or reaping.

> Moreover, should anyone except the parent process be allowed to open a
> readable pidfd (or waitfd), without additional capabilities?

That's a separate discussion. See
https://lore.kernel.org/lkml/cakozueussvwabqac+o9fw+mzayccvttkqzfwg0hh-cz+1yk...@mail.gmail.com/,
where we talked about permissions extensively. Anyone can hammer on
/proc today or hammer on kill(pid, 0) to learn about a process
exiting, so anyone should be able to wait for a process to die --- it
just automates what anyone can do manually. What's left is the
question of who should be able to learn a process's exit code. As I
mentioned in the linked email, process exit codes are public on
FreeBSD, and the simplest option is to make them public in Linux too.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  wrote:
>
> On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  
> > wrote:
> >
> > But often you don't just want to wait for a single thing to happen;
> > you want to wait for many things at once, and react as soon as any one
> > of them happens. This is why the kernel has epoll and all the other
> > "wait for event on FD" APIs. If waiting for a process isn't possible
> > with fd-based APIs like epoll, users of this API have to spin up
> > useless helper threads.
>
> This is true. I almost forgot about the polling requirement, sorry. So then a
> new syscall it is.. About what to wait for, that can be a separate parameter
> to pidfd_wait then.
>

This introduces a time window where the process changes state between
"get pidfd" and "setup waitfd", it would be simpler if the pidfd
itself supported .poll and on termination the exit status was made
readable from the file descriptor.

Further, in the clone4 patchset, there was a mechanism to autoreap
such a process so that it does not interfere with waiting a process
does normally. How do you intend to handle this case if anyone except
the parent is wanting to *wait* on the process (a second process,
without reaping, so as to not disrupt any waiting in the parent), and
for things like libraries that finally can manage their own set of
process when pidfd_clone is added, by excluding this process from the
process's normal wait handling logic.

Moreover, should anyone except the parent process be allowed to open a
readable pidfd (or waitfd), without additional capabilities?

> Thanks.
>
>  - Joel
>


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Joel Fernandes
On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  
> wrote:
> > On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner 
> > > > >  wrote:
> > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > acked and
> > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > this
> > > > > > syscall. Multiple people have asked when this patchset will be sent 
> > > > > > in
> > > > > > for merging (cf. [1], [2]). It has recently been revived by 
> > > > > > Nagarathnam
> > > > > > Muthusamy from Oracle [3].
> > > > > >
> > > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > > 1. Provide translation of pids between pid namespaces
> > > > > > 2. Provide implicit pid namespace introspection
> > > > > >
> > > > > > Both functionalities are preserved. The latter task has been 
> > > > > > improved
> > > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > > would allow to deterimine the relationship between the pid 
> > > > > > namespaces.
> > > > > > This is inherhently racy. If pid 1 inside a pid namespace has died 
> > > > > > it
> > > > > > would report false negatives. For example, if pid 1 inside of the 
> > > > > > target
> > > > > > pid namespace already died, it would report that the target pid
> > > > > > namespace cannot be reached from the source pid namespace because it
> > > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > > falsely report to the user that the two pid namespaces are not 
> > > > > > related.
> > > > > > This problem is simple to avoid. In the new version we simply walk 
> > > > > > the
> > > > > > list of ancestors and check whether the namespace are related to 
> > > > > > each
> > > > > > other. By doing it this way we can reliably report what the 
> > > > > > relationship
> > > > > > between two pid namespace file descriptors looks like.
> > > > > >
> > > > > > Additionally, this syscall has been extended to allow the retrieval 
> > > > > > of
> > > > > > pidfds independent of procfs. These pidfds can e.g. be used with 
> > > > > > the new
> > > > > > pidfd_send_signal() syscall we recently merged. The ability to 
> > > > > > retrieve
> > > > > > pidfds independent of procfs had already been requested in the
> > > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by 
> > > > > > Alexey
> > > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > > struct pid in file->private_data and drop that reference on close.
> > > > > >
> > > > > > With this translate_pid() has three closely related but still 
> > > > > > distinct
> > > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > > userspace to use the syscall it has:
> > > > > > - gained a command argument and three commands clearly reflecting 
> > > > > > the
> > > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > > >   PIDCMD_GET_PIDFD).
> > > > > > - been renamed to pidctl()
> > > > >
> > > > [snip]
> > > > > Also, I'm still confused about how metadata access is supposed to work
> > > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > > You snipped out a portion of a previous email in which I asked about
> > > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > > place, we have two different kinds of file descriptors for processes,
> > > > > one derived from procfs and one that's independent. The former works
> > > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > > manner?
> > > >
> > > > This is true, that such usecase will not be supportable.  But the 
> > > > advantage
> > > > on the other hand, is that suchs "pidfd" can be made pollable or 
> > > > readable in
> > > > the future. Potentially allowing us to return exit status without a new
> > > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we 
> > > > cannot do
> > > > with proc.
> > > >
> > > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > > directory fd
> > > > can be translated into a "pidfd" using another syscall or even a 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jann Horn
On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  wrote:
> On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner 
> > > >  wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > for merging (cf. [1], [2]). It has recently been revived by 
> > > > > Nagarathnam
> > > > > Muthusamy from Oracle [3].
> > > > >
> > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > 1. Provide translation of pids between pid namespaces
> > > > > 2. Provide implicit pid namespace introspection
> > > > >
> > > > > Both functionalities are preserved. The latter task has been improved
> > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > would report false negatives. For example, if pid 1 inside of the 
> > > > > target
> > > > > pid namespace already died, it would report that the target pid
> > > > > namespace cannot be reached from the source pid namespace because it
> > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > falsely report to the user that the two pid namespaces are not 
> > > > > related.
> > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > list of ancestors and check whether the namespace are related to each
> > > > > other. By doing it this way we can reliably report what the 
> > > > > relationship
> > > > > between two pid namespace file descriptors looks like.
> > > > >
> > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > pidfds independent of procfs. These pidfds can e.g. be used with the 
> > > > > new
> > > > > pidfd_send_signal() syscall we recently merged. The ability to 
> > > > > retrieve
> > > > > pidfds independent of procfs had already been requested in the
> > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by 
> > > > > Alexey
> > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > struct pid in file->private_data and drop that reference on close.
> > > > >
> > > > > With this translate_pid() has three closely related but still distinct
> > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > userspace to use the syscall it has:
> > > > > - gained a command argument and three commands clearly reflecting the
> > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > >   PIDCMD_GET_PIDFD).
> > > > > - been renamed to pidctl()
> > > >
> > > [snip]
> > > > Also, I'm still confused about how metadata access is supposed to work
> > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > You snipped out a portion of a previous email in which I asked about
> > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > place, we have two different kinds of file descriptors for processes,
> > > > one derived from procfs and one that's independent. The former works
> > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > manner?
> > >
> > > This is true, that such usecase will not be supportable.  But the 
> > > advantage
> > > on the other hand, is that suchs "pidfd" can be made pollable or readable 
> > > in
> > > the future. Potentially allowing us to return exit status without a new
> > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we 
> > > cannot do
> > > with proc.
> > >
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, 
> > > like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > Andy - and Jann who I just talked to - have proposed solutions for this.
> > Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> > ioctl() handler for /proc that would give you a dirfd back for a given
> 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jann Horn
On Mon, Mar 25, 2019 at 9:40 PM Jonathan Kowalski  wrote:
> On Mon, Mar 25, 2019 at 8:34 PM Jann Horn  wrote:
> >
> >  [...SNIP...]
> >
> > Please don't do that. /proc/$pid/fd refers to the set of file
> > descriptors the process has open, and semantically doesn't have much
> > to do with the identity of the process. If you want to have a procfs
> > directory entry for getting a pidfd, please add a new entry. (Although
> > I don't see the point in adding a new procfs entry for this when you
> > could instead have an ioctl or syscall operating on the procfs
> > directory fd.)
>
> There is no new entry. What I was saying (and I should have been
> clearer) is that the existing entry for the fd when open'd with
> O_DIRECTORY makes the kernel resolve the symlink to /proc/ of the
> process it maps to, so it would become:
>
>   int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC);

That still seems really weird. This magically overloads O_DIRECTORY,
which means "fail if the thing is not a directory", to suddenly have
an entirely different meaning for one magical special type of file. On
top of that, unlike an ioctl or a new syscall, it doesn't convey
explicit intent and increases the risk of confused deputy issues.

> This also means you cannot cross the filesystem boundry, the said
> process needs to have a visible entry (which would mean hidepid= and
> gid= based access controls are honored), and you can only open the
> dirfd of a process in the current ns (as the PID will not map to an
> existent process if the pidfd maps to a process not in the same or
> children pid ns, in fdinfo it lists -1 in the pid field (we might not
> even need fdinfo anymore)).

AFAICS that doesn't have anything to do with whether you do this as a
syscall, as an ioctl, or as a jumped symlink. The kernel would have to
do the same security checks in any of those cases - only a classic,
non-jumped symlink would implicitly go through the existing permission
checks. And if you implement this with a non-jumped symlink, you get
races.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 2:11 PM Joel Fernandes  wrote:
>
> On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner 
> > > >  wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > for merging (cf. [1], [2]). It has recently been revived by 
> > > > > Nagarathnam
> > > > > Muthusamy from Oracle [3].
> > > > >
> > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > 1. Provide translation of pids between pid namespaces
> > > > > 2. Provide implicit pid namespace introspection
> > > > >
> > > > > Both functionalities are preserved. The latter task has been improved
> > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > would report false negatives. For example, if pid 1 inside of the 
> > > > > target
> > > > > pid namespace already died, it would report that the target pid
> > > > > namespace cannot be reached from the source pid namespace because it
> > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > falsely report to the user that the two pid namespaces are not 
> > > > > related.
> > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > list of ancestors and check whether the namespace are related to each
> > > > > other. By doing it this way we can reliably report what the 
> > > > > relationship
> > > > > between two pid namespace file descriptors looks like.
> > > > >
> > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > pidfds independent of procfs. These pidfds can e.g. be used with the 
> > > > > new
> > > > > pidfd_send_signal() syscall we recently merged. The ability to 
> > > > > retrieve
> > > > > pidfds independent of procfs had already been requested in the
> > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by 
> > > > > Alexey
> > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > struct pid in file->private_data and drop that reference on close.
> > > > >
> > > > > With this translate_pid() has three closely related but still distinct
> > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > userspace to use the syscall it has:
> > > > > - gained a command argument and three commands clearly reflecting the
> > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > >   PIDCMD_GET_PIDFD).
> > > > > - been renamed to pidctl()
> > > >
> > > [snip]
> > > > Also, I'm still confused about how metadata access is supposed to work
> > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > You snipped out a portion of a previous email in which I asked about
> > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > place, we have two different kinds of file descriptors for processes,
> > > > one derived from procfs and one that's independent. The former works
> > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > manner?
> > >
> > > This is true, that such usecase will not be supportable.  But the 
> > > advantage
> > > on the other hand, is that suchs "pidfd" can be made pollable or readable 
> > > in
> > > the future. Potentially allowing us to return exit status without a new
> > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we 
> > > cannot do
> > > with proc.
> > >
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, 
> > > like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > Andy - and Jann who I just talked to - have proposed solutions for this.
> > Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> > ioctl() handler for /proc that would give you a dirfd back for a given

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
Also,  extending this further, instead of new ioctl flags over to
translate a tidfd one might introduce later for thread targetted
signals (which would still be a pidfd in the struct pid terms, but
with a bit set in its reference to target the selected TID in
particular), you could resolve this neatly to the proc entry of the
task itself, which would be subject to restrictions similar to a
regular open call, minus all the races involved.

This also means you can get rid of having to support the /proc/
dir fd in pidfd_send_signal, because there is no incentive to, any
longer. The kernel now has just one pidfd object, well scoped in its
purpose, and this "feature" is tied to procfs itself, disabling which
takes away the feature as well.

Otherwise, the ioctl will be conditionally available and/or work only
when procfs is present, and you'd tie procfs to pidfds eternally as
ABI.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Joel Fernandes
On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > > wrote:
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > > [4].
> > > > I quote Konstantins original patchset first that has already been acked 
> > > > and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > > Muthusamy from Oracle [3].
> > > >
> > > > The intention of the original translate_pid() syscall was twofold:
> > > > 1. Provide translation of pids between pid namespaces
> > > > 2. Provide implicit pid namespace introspection
> > > >
> > > > Both functionalities are preserved. The latter task has been improved
> > > > upon though. In the original version of the pachset passing pid as 1
> > > > would allow to deterimine the relationship between the pid namespaces.
> > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > would report false negatives. For example, if pid 1 inside of the target
> > > > pid namespace already died, it would report that the target pid
> > > > namespace cannot be reached from the source pid namespace because it
> > > > couldn't find the pid inside of the target pid namespace and thus
> > > > falsely report to the user that the two pid namespaces are not related.
> > > > This problem is simple to avoid. In the new version we simply walk the
> > > > list of ancestors and check whether the namespace are related to each
> > > > other. By doing it this way we can reliably report what the relationship
> > > > between two pid namespace file descriptors looks like.
> > > >
> > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > > pidfds independent of procfs had already been requested in the
> > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > anon-inode based file descriptors are used that stash a reference to
> > > > struct pid in file->private_data and drop that reference on close.
> > > >
> > > > With this translate_pid() has three closely related but still distinct
> > > > functionalities. To clarify the semantics and to make it easier for
> > > > userspace to use the syscall it has:
> > > > - gained a command argument and three commands clearly reflecting the
> > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > >   PIDCMD_GET_PIDFD).
> > > > - been renamed to pidctl()
> > > 
> > [snip]
> > > Also, I'm still confused about how metadata access is supposed to work
> > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > You snipped out a portion of a previous email in which I asked about
> > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > place, we have two different kinds of file descriptors for processes,
> > > one derived from procfs and one that's independent. The former works
> > > with openat(2). The latter does not. To be very specific; if I'm
> > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > smaps or oom_score_adj or statm for the named process in a race-free
> > > manner?
> > 
> > This is true, that such usecase will not be supportable.  But the advantage
> > on the other hand, is that suchs "pidfd" can be made pollable or readable in
> > the future. Potentially allowing us to return exit status without a new
> > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot 
> > do
> > with proc.
> > 
> > But.. one thing we could do for Daniel usecase is if a /proc/pid directory 
> > fd
> > can be translated into a "pidfd" using another syscall or even a node, like
> > /proc/pid/handle or something. I think this is what Christian suggested in
> > the previous threads.
> 
> Andy - and Jann who I just talked to - have proposed solutions for this.
> Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> ioctl() handler for /proc that would give you a dirfd back for a given
> pidfd. The advantage is that pidfd_clone() can then give back pidfds
> without having to care in what procfs the process is supposed to live.
> That makes things a lot easier. But pidfds for the general case should
> be anon inodes. It's clean, it's simple and it is way more secure.

That makes sense to 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
On Mon, Mar 25, 2019 at 8:34 PM Jann Horn  wrote:
>
>  [...SNIP...]
>
> Please don't do that. /proc/$pid/fd refers to the set of file
> descriptors the process has open, and semantically doesn't have much
> to do with the identity of the process. If you want to have a procfs
> directory entry for getting a pidfd, please add a new entry. (Although
> I don't see the point in adding a new procfs entry for this when you
> could instead have an ioctl or syscall operating on the procfs
> directory fd.)

There is no new entry. What I was saying (and I should have been
clearer) is that the existing entry for the fd when open'd with
O_DIRECTORY makes the kernel resolve the symlink to /proc/ of the
process it maps to, so it would become:

  int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC);

This also means you cannot cross the filesystem boundry, the said
process needs to have a visible entry (which would mean hidepid= and
gid= based access controls are honored), and you can only open the
dirfd of a process in the current ns (as the PID will not map to an
existent process if the pidfd maps to a process not in the same or
children pid ns, in fdinfo it lists -1 in the pid field (we might not
even need fdinfo anymore)).


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Christian Brauner
On Mon, Mar 25, 2019 at 09:34:00PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione  wrote:
> > On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski  
> > wrote:
> > > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione  
> > > wrote:
> [...]
> > > Yes, but everything in /proc is not equivalent to an attribute, or an
> > > option, and depending on its configuration, you may not want to allow
> > > processes to even be able to see /proc for any PIDs other than those
> > > running as their own user (hidepid). This means, even if this new
> > > system call is added, to respect hidepid, it must, depending on if
> > > /proc is mounted (and what hidepid is set to, and what gid= is set
> > > to), return EPERM, because then there is a discrepancy between how the
> > > two entrypoints to acquire a process handle do access control.
> >
> > That's why I proposed that this translation mechanism accept a procfs
> > root directory --- so you'd specify *which* procfs you want and let
> > the kernel apply whatever hidepid access restrictions it wants.
> [...]
> > > > and 2) it's
> > > > "fail unsafe": IMHO, most users in practice will skip the line marked
> > > > "LIVENESS CHECK", and as a result, their code will appear to work but
> > > > contain subtle race conditions. An explicit interface to translate
> > > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> > > > descriptor would be both more efficient and fail-safe.
> > > >
> > > > [1] as a separate matter, it'd be nice to have a batch version of 
> > > > close(2).
> > >
> > > Since /proc is full of gunk,
> >
> > People keep saying /proc is bad, but I haven't seen any serious
> > proposals for a clean replacement. :-)
> >
> > > how about adding more to it and making
> > > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
> > > of the /proc entry of the process it maps to, when one uses
> > > O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
> > > It would be equivalent to opening the proc entry with usual access
> > > restrictions (and hidepid made to work) but without the races, and
> > > because for processes outside your and children pid ns, it shouldn't
> > > work anyway, and since they wouldn't have their entry on this procfs
> > > instance, it would all just fit in nicely?
> >
> > Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical
> > anyway, so that's okay.
> 
> Please don't do that. /proc/$pid/fd refers to the set of file
> descriptors the process has open, and semantically doesn't have much
> to do with the identity of the process. If you want to have a procfs
> directory entry for getting a pidfd, please add a new entry. (Although
> I don't see the point in adding a new procfs entry for this when you
> could instead have an ioctl or syscall operating on the procfs
> directory fd.)

Very much agreed!


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jann Horn
On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione  wrote:
> On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski  
> wrote:
> > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione  wrote:
[...]
> > Yes, but everything in /proc is not equivalent to an attribute, or an
> > option, and depending on its configuration, you may not want to allow
> > processes to even be able to see /proc for any PIDs other than those
> > running as their own user (hidepid). This means, even if this new
> > system call is added, to respect hidepid, it must, depending on if
> > /proc is mounted (and what hidepid is set to, and what gid= is set
> > to), return EPERM, because then there is a discrepancy between how the
> > two entrypoints to acquire a process handle do access control.
>
> That's why I proposed that this translation mechanism accept a procfs
> root directory --- so you'd specify *which* procfs you want and let
> the kernel apply whatever hidepid access restrictions it wants.
[...]
> > > and 2) it's
> > > "fail unsafe": IMHO, most users in practice will skip the line marked
> > > "LIVENESS CHECK", and as a result, their code will appear to work but
> > > contain subtle race conditions. An explicit interface to translate
> > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> > > descriptor would be both more efficient and fail-safe.
> > >
> > > [1] as a separate matter, it'd be nice to have a batch version of 
> > > close(2).
> >
> > Since /proc is full of gunk,
>
> People keep saying /proc is bad, but I haven't seen any serious
> proposals for a clean replacement. :-)
>
> > how about adding more to it and making
> > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
> > of the /proc entry of the process it maps to, when one uses
> > O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
> > It would be equivalent to opening the proc entry with usual access
> > restrictions (and hidepid made to work) but without the races, and
> > because for processes outside your and children pid ns, it shouldn't
> > work anyway, and since they wouldn't have their entry on this procfs
> > instance, it would all just fit in nicely?
>
> Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical
> anyway, so that's okay.

Please don't do that. /proc/$pid/fd refers to the set of file
descriptors the process has open, and semantically doesn't have much
to do with the identity of the process. If you want to have a procfs
directory entry for getting a pidfd, please add a new entry. (Although
I don't see the point in adding a new procfs entry for this when you
could instead have an ioctl or syscall operating on the procfs
directory fd.)


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione  wrote:
> >
> > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski  
> > wrote:
> > >
> > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > [..snip..]
> > > >
> > > > I don't like the idea of having one kind of pollfd be pollable and
> > > > another not. Such an interface would be confusing for users. If, as
> > > > you suggest below, we instead make the procfs-less FD the only thing
> > > > we call a "pidfd", then this proposal becomes less confusing and more
> > > > viable.
> > >
> > > That's certainly on the table, we remove the ability to open
> > > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
> > > this.
> > >
> > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > > > directory fd
> > > > > can be translated into a "pidfd" using another syscall or even a 
> > > > > node, like
> > > > > /proc/pid/handle or something. I think this is what Christian 
> > > > > suggested in
> > > > > the previous threads.
> > > >
> > > > /proc/pid/handle, if I understand correctly, "translates" a
> > > > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > > > to perform the opposite translation, providing a procfs directory for
> > > > a given pidfd.
> > > >
> > > > > And also for the translation the other way, add a syscall or modify
> > > > > translate_fd or something, to covert a anon_inode pidfd into a 
> > > > > /proc/pid
> > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ 
> > > > > directory fd.
> > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd 
> > > > > fds, not
> > > > > to /proc/pid directory fds.
> > > >
> > > > This approach would work, but there's one subtlety to take into
> > > > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > > > root you want, and 2) the pidfd, this way you could return to the user
> > > > a directory FD in the right procfs.
> > > >
> > >
> > > I don't think metadata access is in the scope of such a pidfd itself.
> >
> > It should be possible to write a race-free pkill(1) using pidfds.
> > Without metadata access tied to the pidfd somehow, that can't be done.
> >
> >
> > > > > Should we work on patches for these? Please let us know if this idea 
> > > > > makes
> > > > > sense and thanks a lot for adding us to the review as well.
> > > >
> > > > I would strongly prefer that we not merge pidctl (or whatever it is)
> > > > without a story for metadata access, be it your suggestion or
> > > > something else.
> > >
> > > IMO, this is out of scope for a process handle. Hence, the need for
> > > metadata access musn't stall it as is.
> >
> > On the contrary, rather than metadata being out of scope, metadata
> > access is essential. Given a file handle (an FD), you can learn about
> > the file backing that handle by using fstat(2). Given a socket handle
> > (a socket FD), you can learn about the socket with getsockopt(2) and
> > ioctl(2). It would be strange not to be able, similarly, to learn
> > things about a process given a handle (a pidfd) to that process. I
> > want process handles to be "boring" in that they let users query and
> > manipulate processes mostly like they manipulate other resources.
> >
>
> Yes, but everything in /proc is not equivalent to an attribute, or an
> option, and depending on its configuration, you may not want to allow
> processes to even be able to see /proc for any PIDs other than those
> running as their own user (hidepid). This means, even if this new
> system call is added, to respect hidepid, it must, depending on if
> /proc is mounted (and what hidepid is set to, and what gid= is set
> to), return EPERM, because then there is a discrepancy between how the
> two entrypoints to acquire a process handle do access control.

That's why I proposed that this translation mechanism accept a procfs
root directory --- so you'd specify *which* procfs you want and let
the kernel apply whatever hidepid access restrictions it wants.

[snip]

> PIDs however are addressable with kill(2) even with hidepid enabled.
> For good reason, the capabilities you can exercise using a PID is
> limited in that case. The same logic applies to a process reference
> like pidfd.

Sure. I'm not proposing a mechanism that would grant anyone additional
access to anything --- I'm just suggesting that we provide a way to
"directly" open a procfs dirfd instead of having people parse an
fdinfo text file.

> I agree there should be some way (if you can take care of the hidepid
> gotcha) to return a dir fd of /proc entry, but I don't think it should
> be the pidfd itself.

It's fine that pidfds aren't procfs directory FDs so long as there's a
race-free way to go from the former to the latter.

> You can get it to work using the fdinfo thing for
> now.
>
> > > If you really need this, the pid this pidfd is 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Christian Brauner
On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall. Multiple people have asked when this patchset will be sent in
> > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > Muthusamy from Oracle [3].
> > >
> > > The intention of the original translate_pid() syscall was twofold:
> > > 1. Provide translation of pids between pid namespaces
> > > 2. Provide implicit pid namespace introspection
> > >
> > > Both functionalities are preserved. The latter task has been improved
> > > upon though. In the original version of the pachset passing pid as 1
> > > would allow to deterimine the relationship between the pid namespaces.
> > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > would report false negatives. For example, if pid 1 inside of the target
> > > pid namespace already died, it would report that the target pid
> > > namespace cannot be reached from the source pid namespace because it
> > > couldn't find the pid inside of the target pid namespace and thus
> > > falsely report to the user that the two pid namespaces are not related.
> > > This problem is simple to avoid. In the new version we simply walk the
> > > list of ancestors and check whether the namespace are related to each
> > > other. By doing it this way we can reliably report what the relationship
> > > between two pid namespace file descriptors looks like.
> > >
> > > Additionally, this syscall has been extended to allow the retrieval of
> > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > pidfds independent of procfs had already been requested in the
> > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > [5]. A use-case where a kernel is compiled without procfs but where
> > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > anon-inode based file descriptors are used that stash a reference to
> > > struct pid in file->private_data and drop that reference on close.
> > >
> > > With this translate_pid() has three closely related but still distinct
> > > functionalities. To clarify the semantics and to make it easier for
> > > userspace to use the syscall it has:
> > > - gained a command argument and three commands clearly reflecting the
> > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > >   PIDCMD_GET_PIDFD).
> > > - been renamed to pidctl()
> > 
> [snip]
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
> 
> This is true, that such usecase will not be supportable.  But the advantage
> on the other hand, is that suchs "pidfd" can be made pollable or readable in
> the future. Potentially allowing us to return exit status without a new
> syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> with proc.
> 
> But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> can be translated into a "pidfd" using another syscall or even a node, like
> /proc/pid/handle or something. I think this is what Christian suggested in
> the previous threads.

Andy - and Jann who I just talked to - have proposed solutions for this.
Jann's idea is similar to what you suggested, Joel. You could e.g. do an
ioctl() handler for /proc that would give you a dirfd back for a given
pidfd. The advantage is that pidfd_clone() can then give back pidfds
without having to care in what procfs the process is supposed to live.
That makes things a lot easier. But pidfds for the general case should
be anon inodes. It's clean, it's simple and it is way more secure.

> 
> And also for the translation the other way, add a syscall or modify
> translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> directory fd. Then the user is welcomed to do openat(2) on _that_ directory 
> fd.
> Then we modify 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione  wrote:
>
> On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski  
> wrote:
> >
> > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  wrote:
> > >
> > > [..snip..]
> > >
> > > I don't like the idea of having one kind of pollfd be pollable and
> > > another not. Such an interface would be confusing for users. If, as
> > > you suggest below, we instead make the procfs-less FD the only thing
> > > we call a "pidfd", then this proposal becomes less confusing and more
> > > viable.
> >
> > That's certainly on the table, we remove the ability to open
> > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
> > this.
> >
> > > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > > directory fd
> > > > can be translated into a "pidfd" using another syscall or even a node, 
> > > > like
> > > > /proc/pid/handle or something. I think this is what Christian suggested 
> > > > in
> > > > the previous threads.
> > >
> > > /proc/pid/handle, if I understand correctly, "translates" a
> > > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > > to perform the opposite translation, providing a procfs directory for
> > > a given pidfd.
> > >
> > > > And also for the translation the other way, add a syscall or modify
> > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > > directory fd. Then the user is welcomed to do openat(2) on _that_ 
> > > > directory fd.
> > > > Then we modify pidfd_send_signal to only send signals to pure pidfd 
> > > > fds, not
> > > > to /proc/pid directory fds.
> > >
> > > This approach would work, but there's one subtlety to take into
> > > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > > root you want, and 2) the pidfd, this way you could return to the user
> > > a directory FD in the right procfs.
> > >
> >
> > I don't think metadata access is in the scope of such a pidfd itself.
>
> It should be possible to write a race-free pkill(1) using pidfds.
> Without metadata access tied to the pidfd somehow, that can't be done.
>
>
> > > > Should we work on patches for these? Please let us know if this idea 
> > > > makes
> > > > sense and thanks a lot for adding us to the review as well.
> > >
> > > I would strongly prefer that we not merge pidctl (or whatever it is)
> > > without a story for metadata access, be it your suggestion or
> > > something else.
> >
> > IMO, this is out of scope for a process handle. Hence, the need for
> > metadata access musn't stall it as is.
>
> On the contrary, rather than metadata being out of scope, metadata
> access is essential. Given a file handle (an FD), you can learn about
> the file backing that handle by using fstat(2). Given a socket handle
> (a socket FD), you can learn about the socket with getsockopt(2) and
> ioctl(2). It would be strange not to be able, similarly, to learn
> things about a process given a handle (a pidfd) to that process. I
> want process handles to be "boring" in that they let users query and
> manipulate processes mostly like they manipulate other resources.
>

Yes, but everything in /proc is not equivalent to an attribute, or an
option, and depending on its configuration, you may not want to allow
processes to even be able to see /proc for any PIDs other than those
running as their own user (hidepid). This means, even if this new
system call is added, to respect hidepid, it must, depending on if
/proc is mounted (and what hidepid is set to, and what gid= is set
to), return EPERM, because then there is a discrepancy between how the
two entrypoints to acquire a process handle do access control. This is
ugly, but ideally should work the way it is described above.
Otherwise, things would be able to bypass the restrictions made
therein, because we also want a system call.

PIDs however are addressable with kill(2) even with hidepid enabled.
For good reason, the capabilities you can exercise using a PID is
limited in that case. The same logic applies to a process reference
like pidfd.

I agree there should be some way (if you can take care of the hidepid
gotcha) to return a dir fd of /proc entry, but I don't think it should
be the pidfd itself. You can get it to work using the fdinfo thing for
now.

> > If you really need this, the pid this pidfd is mapped to can be
> > exposed through fdinfo (which is perfectly fine for your case as you
> > use /proc anyway). This means you can reliably determine if you're
> > opening it for the same pid (and it hasn't been recycled) because this
> > pidfd will be pollable for state change (in the future). Exposing it
> > through fdinfo isn't a problem, pid ns is bound to the process during
> > its lifetime, it's a process creation time property, so the value
> > isn't going to change anyway. So you can do all metadata access you
> > want subsequently.
>
> Thanks for the proposal. I'm a bit confused. Are you suggesting that
> we report the numeric 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  wrote:
> >
> > [..snip..]
> >
> > I don't like the idea of having one kind of pollfd be pollable and
> > another not. Such an interface would be confusing for users. If, as
> > you suggest below, we instead make the procfs-less FD the only thing
> > we call a "pidfd", then this proposal becomes less confusing and more
> > viable.
>
> That's certainly on the table, we remove the ability to open
> /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
> this.
>
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, 
> > > like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > /proc/pid/handle, if I understand correctly, "translates" a
> > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > to perform the opposite translation, providing a procfs directory for
> > a given pidfd.
> >
> > > And also for the translation the other way, add a syscall or modify
> > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > directory fd. Then the user is welcomed to do openat(2) on _that_ 
> > > directory fd.
> > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, 
> > > not
> > > to /proc/pid directory fds.
> >
> > This approach would work, but there's one subtlety to take into
> > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > root you want, and 2) the pidfd, this way you could return to the user
> > a directory FD in the right procfs.
> >
>
> I don't think metadata access is in the scope of such a pidfd itself.

It should be possible to write a race-free pkill(1) using pidfds.
Without metadata access tied to the pidfd somehow, that can't be done.

> > > Should we work on patches for these? Please let us know if this idea makes
> > > sense and thanks a lot for adding us to the review as well.
> >
> > I would strongly prefer that we not merge pidctl (or whatever it is)
> > without a story for metadata access, be it your suggestion or
> > something else.
>
> IMO, this is out of scope for a process handle. Hence, the need for
> metadata access musn't stall it as is.

On the contrary, rather than metadata being out of scope, metadata
access is essential. Given a file handle (an FD), you can learn about
the file backing that handle by using fstat(2). Given a socket handle
(a socket FD), you can learn about the socket with getsockopt(2) and
ioctl(2). It would be strange not to be able, similarly, to learn
things about a process given a handle (a pidfd) to that process. I
want process handles to be "boring" in that they let users query and
manipulate processes mostly like they manipulate other resources.

> If you really need this, the pid this pidfd is mapped to can be
> exposed through fdinfo (which is perfectly fine for your case as you
> use /proc anyway). This means you can reliably determine if you're
> opening it for the same pid (and it hasn't been recycled) because this
> pidfd will be pollable for state change (in the future). Exposing it
> through fdinfo isn't a problem, pid ns is bound to the process during
> its lifetime, it's a process creation time property, so the value
> isn't going to change anyway. So you can do all metadata access you
> want subsequently.

Thanks for the proposal. I'm a bit confused. Are you suggesting that
we report the numeric FD in fdinfo? Are you saying it should work
basically like this?

int get_oom_score_adj(int pidfd) {
  unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
  string fdinfo = read_all(fdinfo_fd);
  numeric_pid = parse_fdinfo_for_line("pid");
  unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
  if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
LIVENESS CHECK */
  // We know the process named by pidfd was called NUMERIC_PID
  // in our namespace (because fdinfo told us) and we know that the
named process
  // was alive after we successfully opened its /proc/pid directory, therefore,
  // PROCDIR_FD and PIDFD must refer to the same underlying process.
  unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
  return parse_int(read_all(oom_adj_score_fd));
}

While this interface functions correctly, I have two concerns: 1) the
number of system calls necessary is very large -- by my count,
starting with the pifd, we need six, not counting the follow-on
close(2) calls (which would bring the total to nine[1]), and 2) it's
"fail unsafe": IMHO, most users in practice will skip the line marked
"LIVENESS CHECK", and as a result, their code will appear to work but
contain subtle race conditions. An explicit interface to translate
from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
descriptor would be both more 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Jonathan Kowalski
On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  wrote:
>
> [..snip..]
>
> I don't like the idea of having one kind of pollfd be pollable and
> another not. Such an interface would be confusing for users. If, as
> you suggest below, we instead make the procfs-less FD the only thing
> we call a "pidfd", then this proposal becomes less confusing and more
> viable.

That's certainly on the table, we remove the ability to open
/proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
this.

> > But.. one thing we could do for Daniel usecase is if a /proc/pid directory 
> > fd
> > can be translated into a "pidfd" using another syscall or even a node, like
> > /proc/pid/handle or something. I think this is what Christian suggested in
> > the previous threads.
>
> /proc/pid/handle, if I understand correctly, "translates" a
> procfs-based pidfd to a non-pidfd one. The needed interface would have
> to perform the opposite translation, providing a procfs directory for
> a given pidfd.
>
> > And also for the translation the other way, add a syscall or modify
> > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > directory fd. Then the user is welcomed to do openat(2) on _that_ directory 
> > fd.
> > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> > to /proc/pid directory fds.
>
> This approach would work, but there's one subtlety to take into
> account: which procfs? You'd want to take, as inputs, 1) the procfs
> root you want, and 2) the pidfd, this way you could return to the user
> a directory FD in the right procfs.
>

I don't think metadata access is in the scope of such a pidfd itself.

> > Should we work on patches for these? Please let us know if this idea makes
> > sense and thanks a lot for adding us to the review as well.
>
> I would strongly prefer that we not merge pidctl (or whatever it is)
> without a story for metadata access, be it your suggestion or
> something else.

IMO, this is out of scope for a process handle. Hence, the need for
metadata access musn't stall it as is.

If you really need this, the pid this pidfd is mapped to can be
exposed through fdinfo (which is perfectly fine for your case as you
use /proc anyway). This means you can reliably determine if you're
opening it for the same pid (and it hasn't been recycled) because this
pidfd will be pollable for state change (in the future). Exposing it
through fdinfo isn't a problem, pid ns is bound to the process during
its lifetime, it's a process creation time property, so the value
isn't going to change anyway. So you can do all metadata access you
want subsequently.

The upshot is that this same pidfd can then be returned from a future
extension of clone(2) Christian is planning to work on. It is still
undecided whether this fd should be only readable by the parent
process or everyone (which would just be a matter of changing the
access mode depending on the caller), but that we can discuss all that
when the patchset is posted.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 10:36 AM Joel Fernandes  wrote:
>
> On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall. Multiple people have asked when this patchset will be sent in
> > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > Muthusamy from Oracle [3].
> > >
> > > The intention of the original translate_pid() syscall was twofold:
> > > 1. Provide translation of pids between pid namespaces
> > > 2. Provide implicit pid namespace introspection
> > >
> > > Both functionalities are preserved. The latter task has been improved
> > > upon though. In the original version of the pachset passing pid as 1
> > > would allow to deterimine the relationship between the pid namespaces.
> > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > would report false negatives. For example, if pid 1 inside of the target
> > > pid namespace already died, it would report that the target pid
> > > namespace cannot be reached from the source pid namespace because it
> > > couldn't find the pid inside of the target pid namespace and thus
> > > falsely report to the user that the two pid namespaces are not related.
> > > This problem is simple to avoid. In the new version we simply walk the
> > > list of ancestors and check whether the namespace are related to each
> > > other. By doing it this way we can reliably report what the relationship
> > > between two pid namespace file descriptors looks like.
> > >
> > > Additionally, this syscall has been extended to allow the retrieval of
> > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > pidfds independent of procfs had already been requested in the
> > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > [5]. A use-case where a kernel is compiled without procfs but where
> > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > anon-inode based file descriptors are used that stash a reference to
> > > struct pid in file->private_data and drop that reference on close.
> > >
> > > With this translate_pid() has three closely related but still distinct
> > > functionalities. To clarify the semantics and to make it easier for
> > > userspace to use the syscall it has:
> > > - gained a command argument and three commands clearly reflecting the
> > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > >   PIDCMD_GET_PIDFD).
> > > - been renamed to pidctl()
> >
> [snip]
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
>
> This is true, that such usecase will not be supportable.  But the advantage
> on the other hand, is that suchs "pidfd" can be made pollable or readable in
> the future. Potentially allowing us to return exit status without a new
> syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> with proc.

I don't like the idea of having one kind of pollfd be pollable and
another not. Such an interface would be confusing for users. If, as
you suggest below, we instead make the procfs-less FD the only thing
we call a "pidfd", then this proposal becomes less confusing and more
viable.

> But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> can be translated into a "pidfd" using another syscall or even a node, like
> /proc/pid/handle or something. I think this is what Christian suggested in
> the previous threads.

/proc/pid/handle, if I understand correctly, "translates" a
procfs-based pidfd to a non-pidfd one. The needed interface would have
to perform the opposite translation, providing a procfs directory for
a given pidfd.

> And also for the translation the other way, add a syscall or modify
> translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> directory fd. Then the user is welcomed to do openat(2) on _that_ directory 
> fd.
> Then we modify pidfd_send_signal 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Joel Fernandes
On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall. Multiple people have asked when this patchset will be sent in
> > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > Muthusamy from Oracle [3].
> >
> > The intention of the original translate_pid() syscall was twofold:
> > 1. Provide translation of pids between pid namespaces
> > 2. Provide implicit pid namespace introspection
> >
> > Both functionalities are preserved. The latter task has been improved
> > upon though. In the original version of the pachset passing pid as 1
> > would allow to deterimine the relationship between the pid namespaces.
> > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > would report false negatives. For example, if pid 1 inside of the target
> > pid namespace already died, it would report that the target pid
> > namespace cannot be reached from the source pid namespace because it
> > couldn't find the pid inside of the target pid namespace and thus
> > falsely report to the user that the two pid namespaces are not related.
> > This problem is simple to avoid. In the new version we simply walk the
> > list of ancestors and check whether the namespace are related to each
> > other. By doing it this way we can reliably report what the relationship
> > between two pid namespace file descriptors looks like.
> >
> > Additionally, this syscall has been extended to allow the retrieval of
> > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > pidfds independent of procfs had already been requested in the
> > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > [5]. A use-case where a kernel is compiled without procfs but where
> > pidfds are still useful has been outlined by Andy in [6]. Regular
> > anon-inode based file descriptors are used that stash a reference to
> > struct pid in file->private_data and drop that reference on close.
> >
> > With this translate_pid() has three closely related but still distinct
> > functionalities. To clarify the semantics and to make it easier for
> > userspace to use the syscall it has:
> > - gained a command argument and three commands clearly reflecting the
> >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> >   PIDCMD_GET_PIDFD).
> > - been renamed to pidctl()
> 
[snip]
> Also, I'm still confused about how metadata access is supposed to work
> for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> You snipped out a portion of a previous email in which I asked about
> your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> place, we have two different kinds of file descriptors for processes,
> one derived from procfs and one that's independent. The former works
> with openat(2). The latter does not. To be very specific; if I'm
> writing a function that accepts a pidfd and I get a pidfd that comes
> from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> smaps or oom_score_adj or statm for the named process in a race-free
> manner?

This is true, that such usecase will not be supportable.  But the advantage
on the other hand, is that suchs "pidfd" can be made pollable or readable in
the future. Potentially allowing us to return exit status without a new
syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
with proc.

But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
can be translated into a "pidfd" using another syscall or even a node, like
/proc/pid/handle or something. I think this is what Christian suggested in
the previous threads.

And also for the translation the other way, add a syscall or modify
translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
to /proc/pid directory fds.

Should we work on patches for these? Please let us know if this idea makes
sense and thanks a lot for adding us to the review as well.

Best,

 - Joel


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Konstantin Khlebnikov




On 25.03.2019 19:48, Daniel Colascione wrote:

On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  wrote:

The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
I quote Konstantins original patchset first that has already been acked and
picked up by Eric before and whose functionality is preserved in this
syscall. Multiple people have asked when this patchset will be sent in
for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
Muthusamy from Oracle [3].

The intention of the original translate_pid() syscall was twofold:
1. Provide translation of pids between pid namespaces
2. Provide implicit pid namespace introspection

Both functionalities are preserved. The latter task has been improved
upon though. In the original version of the pachset passing pid as 1
would allow to deterimine the relationship between the pid namespaces.
This is inherhently racy. If pid 1 inside a pid namespace has died it
would report false negatives. For example, if pid 1 inside of the target
pid namespace already died, it would report that the target pid
namespace cannot be reached from the source pid namespace because it
couldn't find the pid inside of the target pid namespace and thus
falsely report to the user that the two pid namespaces are not related.
This problem is simple to avoid. In the new version we simply walk the
list of ancestors and check whether the namespace are related to each
other. By doing it this way we can reliably report what the relationship
between two pid namespace file descriptors looks like.

Additionally, this syscall has been extended to allow the retrieval of
pidfds independent of procfs. These pidfds can e.g. be used with the new
pidfd_send_signal() syscall we recently merged. The ability to retrieve
pidfds independent of procfs had already been requested in the
pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
[5]. A use-case where a kernel is compiled without procfs but where
pidfds are still useful has been outlined by Andy in [6]. Regular
anon-inode based file descriptors are used that stash a reference to
struct pid in file->private_data and drop that reference on close.

With this translate_pid() has three closely related but still distinct
functionalities. To clarify the semantics and to make it easier for
userspace to use the syscall it has:
- gained a command argument and three commands clearly reflecting the
   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
   PIDCMD_GET_PIDFD).
- been renamed to pidctl()


Having made these changes, you've built a general-purpose command
command multiplexer, not one operation that happens to be flexible.
The general-purpose command multiplexer is a common antipattern:
multiplexers make it hard to talk about different kernel-provided
operations using the common vocabulary we use to distinguish
kernel-related operations, the system call number. socketcall, for
example, turned out to be cumbersome for users like SELinux policy
writers. People had to do work work later to split socketcall into
fine-grained system calls. Please split the pidctl system call so that
the design is clean from the start and we avoid work later. System
calls are cheap.

Also, I'm still confused about how metadata access is supposed to work
for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
You snipped out a portion of a previous email in which I asked about
your thoughts on this question. With the PIDCMD_GET_PIDFD command in
place, we have two different kinds of file descriptors for processes,
one derived from procfs and one that's independent. The former works
with openat(2). The latter does not. To be very specific; if I'm
writing a function that accepts a pidfd and I get a pidfd that comes
from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
smaps or oom_score_adj or statm for the named process in a race-free
manner?



Task metadata could be exposed via "pages" identified by offset:

struct pidfd_stats stats;

pread(pidfd, , sizeof(stats), PIDFD_STATS_OFFSET);

I'm not sure that we need yet another binary procfs.
But it will be faster than current text-based for sure.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 10:05 AM Konstantin Khlebnikov
 wrote:
> On 25.03.2019 19:48, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > wrote:
> >> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> >> I quote Konstantins original patchset first that has already been acked and
> >> picked up by Eric before and whose functionality is preserved in this
> >> syscall. Multiple people have asked when this patchset will be sent in
> >> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> >> Muthusamy from Oracle [3].
> >>
> >> The intention of the original translate_pid() syscall was twofold:
> >> 1. Provide translation of pids between pid namespaces
> >> 2. Provide implicit pid namespace introspection
> >>
> >> Both functionalities are preserved. The latter task has been improved
> >> upon though. In the original version of the pachset passing pid as 1
> >> would allow to deterimine the relationship between the pid namespaces.
> >> This is inherhently racy. If pid 1 inside a pid namespace has died it
> >> would report false negatives. For example, if pid 1 inside of the target
> >> pid namespace already died, it would report that the target pid
> >> namespace cannot be reached from the source pid namespace because it
> >> couldn't find the pid inside of the target pid namespace and thus
> >> falsely report to the user that the two pid namespaces are not related.
> >> This problem is simple to avoid. In the new version we simply walk the
> >> list of ancestors and check whether the namespace are related to each
> >> other. By doing it this way we can reliably report what the relationship
> >> between two pid namespace file descriptors looks like.
> >>
> >> Additionally, this syscall has been extended to allow the retrieval of
> >> pidfds independent of procfs. These pidfds can e.g. be used with the new
> >> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> >> pidfds independent of procfs had already been requested in the
> >> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> >> [5]. A use-case where a kernel is compiled without procfs but where
> >> pidfds are still useful has been outlined by Andy in [6]. Regular
> >> anon-inode based file descriptors are used that stash a reference to
> >> struct pid in file->private_data and drop that reference on close.
> >>
> >> With this translate_pid() has three closely related but still distinct
> >> functionalities. To clarify the semantics and to make it easier for
> >> userspace to use the syscall it has:
> >> - gained a command argument and three commands clearly reflecting the
> >>distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> >>PIDCMD_GET_PIDFD).
> >> - been renamed to pidctl()
> >
> > Having made these changes, you've built a general-purpose command
> > command multiplexer, not one operation that happens to be flexible.
> > The general-purpose command multiplexer is a common antipattern:
> > multiplexers make it hard to talk about different kernel-provided
> > operations using the common vocabulary we use to distinguish
> > kernel-related operations, the system call number. socketcall, for
> > example, turned out to be cumbersome for users like SELinux policy
> > writers. People had to do work work later to split socketcall into
> > fine-grained system calls. Please split the pidctl system call so that
> > the design is clean from the start and we avoid work later. System
> > calls are cheap.
> >
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
> >
>
> Task metadata could be exposed via "pages" identified by offset:
>
> struct pidfd_stats stats;
>
> pread(pidfd, , sizeof(stats), PIDFD_STATS_OFFSET);
>
> I'm not sure that we need yet another binary procfs.
> But it will be faster than current text-based for sure.

There are many options. I have some detailed thoughts on several of
them, but long design emails seem rather unwelcome. Right now, I'd
like to hear how Christian intends his patch set to address this use
case.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 9:56 AM David Howells  wrote:
> Daniel Colascione  wrote:
>
> > System calls are cheap.
>
> Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
> currently at 427.

IIRC, a while ago, someone proposed restarting system call numbering
above (again, IIRC) 1024 for both 32- and 64-bit varieties to
establish a clean slate and sidestep this problem.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread David Howells
Daniel Colascione  wrote:

> System calls are cheap.

Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
currently at 427.

David


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall. Multiple people have asked when this patchset will be sent in
> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> Muthusamy from Oracle [3].
>
> The intention of the original translate_pid() syscall was twofold:
> 1. Provide translation of pids between pid namespaces
> 2. Provide implicit pid namespace introspection
>
> Both functionalities are preserved. The latter task has been improved
> upon though. In the original version of the pachset passing pid as 1
> would allow to deterimine the relationship between the pid namespaces.
> This is inherhently racy. If pid 1 inside a pid namespace has died it
> would report false negatives. For example, if pid 1 inside of the target
> pid namespace already died, it would report that the target pid
> namespace cannot be reached from the source pid namespace because it
> couldn't find the pid inside of the target pid namespace and thus
> falsely report to the user that the two pid namespaces are not related.
> This problem is simple to avoid. In the new version we simply walk the
> list of ancestors and check whether the namespace are related to each
> other. By doing it this way we can reliably report what the relationship
> between two pid namespace file descriptors looks like.
>
> Additionally, this syscall has been extended to allow the retrieval of
> pidfds independent of procfs. These pidfds can e.g. be used with the new
> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> pidfds independent of procfs had already been requested in the
> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> [5]. A use-case where a kernel is compiled without procfs but where
> pidfds are still useful has been outlined by Andy in [6]. Regular
> anon-inode based file descriptors are used that stash a reference to
> struct pid in file->private_data and drop that reference on close.
>
> With this translate_pid() has three closely related but still distinct
> functionalities. To clarify the semantics and to make it easier for
> userspace to use the syscall it has:
> - gained a command argument and three commands clearly reflecting the
>   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
>   PIDCMD_GET_PIDFD).
> - been renamed to pidctl()

Having made these changes, you've built a general-purpose command
command multiplexer, not one operation that happens to be flexible.
The general-purpose command multiplexer is a common antipattern:
multiplexers make it hard to talk about different kernel-provided
operations using the common vocabulary we use to distinguish
kernel-related operations, the system call number. socketcall, for
example, turned out to be cumbersome for users like SELinux policy
writers. People had to do work work later to split socketcall into
fine-grained system calls. Please split the pidctl system call so that
the design is clean from the start and we avoid work later. System
calls are cheap.

Also, I'm still confused about how metadata access is supposed to work
for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
You snipped out a portion of a previous email in which I asked about
your thoughts on this question. With the PIDCMD_GET_PIDFD command in
place, we have two different kinds of file descriptors for processes,
one derived from procfs and one that's independent. The former works
with openat(2). The latter does not. To be very specific; if I'm
writing a function that accepts a pidfd and I get a pidfd that comes
from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
smaps or oom_score_adj or statm for the named process in a race-free
manner?