Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Sat, Mar 30, 2019 at 7:30 AM Jonathan Kowalski  wrote:
>
> On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione  wrote:
> >
> >  [SNIP]
> >
> > Thanks again.
> >
> > I agree that the operation we're discussing has a simple signature,
> > but signature flexibility isn't the only reason to prefer a system
> > call over an ioctl. There are other reasons for preferring system
> > calls to ioctls (safety, tracing, etc.) that apply even if the
> > operation we're discussing has a relatively simple signature: for
> > example, every system call has a distinct and convenient ftrace event,
> > but ioctls don't; strace filtering Just Works on a
> > system-call-by-system-call basis, but it doesn't for ioctls; and
>
> It does for those with a unique number.

There's no such thing as a unique ioctl number though. Anyone is free
to create an ioctl with a conflicting number. Sure, you can guarantee
ioctl request number uniqueness within things that ship with the
kernel, but you can also guarantee system call number uniqueness, so
what do we gain by using an ioctl? And yes, you can do *some*
filtering based on ioctl command, but it's frequently more awkward
than the system call equivalent and sometimes doesn't work at all.
What's the strace -e pidfd_open equivalent for an ioctl? Grep isn't
quite equivalent.

> > documentation for system calls is much more discoverable (e.g., man
> > -k) than documentation for ioctls. Even if the distinction doesn't
> > matter much, IMHO, it still matters a little, enough to favor a system
> > call without an offsetting advantage for the ioctl option.
>
> It will be alongside other I/O operations in the pidfd man page.
>
> >
> > > If you start adding a system call for every specific operation on file
> > > descriptors, it *will* become a problem.
> >
> > I'm not sure what you mean. Do you mean that adding a top-level system
> > call for every operation that might apply to one specific kind of file
> > descriptor would lead, as the overall result, to the kernel having
> > enough system calls to cause negative consequences? I'm not sure I
> > agree, but accepting this idea for the sake of discussion: shouldn't
> > we be more okay with system calls for features present on almost all
> > systems --- like procfs --- even if we punt to ioctls very rarely-used
> > functionality, e.g., some hypothetical special squeak noise that you
> > could get some specific 1995 adlib clone to make?
>
> Consider if we were to do:
>
> int procpidfd_open(int pidfd, int procrootfd);
>
> This system call is useless besides a very specific operation for a
> very specific usecase on a file descriptor.

I don't understand this argument based on an operation being "very
specific". Are you suggesting that epoll_wait should have been an
ioctl? I don't see how an operation being specific to a type of file
descriptor is an argument for making that operation an ioctl instead
of a system call.

> Then, you figure you might
> want to support procpidfd back to pidfd (although YAGNI), so you will
> add a CMD or flags argument now,

Why would we need a system call at all? And why are we invoking
hypothetical argument B in an argument against adding operation A as a
system call? B doesn't exist yet. As Christian mentioned, we could
create a named /proc/pid/handle file which, when opened, would yield a
pidfd. More generally, though: if we expose new functionality, we can
make a new system call. Why is that worse than adding a new ioctl
code?

> int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags);
>
>   int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD);
>   int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD);
>
> vs procpidfd2pidfd(int procpidfd); if you did not foresee the need.
> Then, you want pidfd2pid, and so on.
>
> If you don't, you will have to add a command interface in the new
> system call, which changes parameters depending on the flags. This is
> already starting to get ugly.

> In the end, it's a matter of taste,

I don't think it's quite a matter of taste. That idiom suggests that
the two options are roughly functionally equivalent. In this case,
there are clear and specific technical reasons to prefer system calls.
We're not talking about two equivalent approaches. Making the
operation a system call gives users more power, and we shouldn't
deprive them of this power without a good reason. So far, I haven't
seen such a reason.

>  this
> pattern if exploited leads to endless proliferation of the system call
> interface, often ridden with short-sighted APIs, because you cannot
> know if you want a focused call or a command style call.

Bad interfaces proliferate no matter what, and there's no difference
in support burden between an ioctl and a system call: both need to
work indefinitely. Are you saying that it's easier to remove a bad
interface if it's an ioctl instead of a system call? I don't recall
any precedent, but maybe I'm wrong.

> ioctl is
> already a command 

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-30 Thread Jonathan Kowalski
On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione  wrote:
>
>  [SNIP]
>
> Thanks again.
>
> I agree that the operation we're discussing has a simple signature,
> but signature flexibility isn't the only reason to prefer a system
> call over an ioctl. There are other reasons for preferring system
> calls to ioctls (safety, tracing, etc.) that apply even if the
> operation we're discussing has a relatively simple signature: for
> example, every system call has a distinct and convenient ftrace event,
> but ioctls don't; strace filtering Just Works on a
> system-call-by-system-call basis, but it doesn't for ioctls; and

It does for those with a unique number.

> documentation for system calls is much more discoverable (e.g., man
> -k) than documentation for ioctls. Even if the distinction doesn't
> matter much, IMHO, it still matters a little, enough to favor a system
> call without an offsetting advantage for the ioctl option.

It will be alongside other I/O operations in the pidfd man page.

>
> > If you start adding a system call for every specific operation on file
> > descriptors, it *will* become a problem.
>
> I'm not sure what you mean. Do you mean that adding a top-level system
> call for every operation that might apply to one specific kind of file
> descriptor would lead, as the overall result, to the kernel having
> enough system calls to cause negative consequences? I'm not sure I
> agree, but accepting this idea for the sake of discussion: shouldn't
> we be more okay with system calls for features present on almost all
> systems --- like procfs --- even if we punt to ioctls very rarely-used
> functionality, e.g., some hypothetical special squeak noise that you
> could get some specific 1995 adlib clone to make?

Consider if we were to do:

int procpidfd_open(int pidfd, int procrootfd);

This system call is useless besides a very specific operation for a
very specific usecase on a file descriptor. Then, you figure you might
want to support procpidfd back to pidfd (although YAGNI), so you will
add a CMD or flags argument now,

int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags);

  int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD);
  int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD);

vs procpidfd2pidfd(int procpidfd); if you did not foresee the need.
Then, you want pidfd2pid, and so on.

If you don't, you will have to add a command interface in the new
system call, which changes parameters depending on the flags. This is
already starting to get ugly. In the end, it's a matter of taste, this
pattern if exploited leads to endless proliferation of the system call
interface, often ridden with short-sighted APIs, because you cannot
know if you want a focused call or a command style call. ioctl is
already a command interface, and 10 system calls for each command is
_NOT_ nice, from a user perspective.

>
> > Besides, the translation is
> > just there because it is racy to do in userspace,  it is not some well
> > defined core kernel functionality.
> > Therefore, it is just a way to
> > enter the kernel to do the openat in a race free and safe manner.
>
> I agree that the translation has to be done in the kernel, not
> userspace, and that the kernel must provide to userspace some
> interface for requesting that the translation happen: we're just
> discussing the shape of this interface. Shouldn't all interfaces
> provided by the kernel to userspace be equally well defined? I'm not
> sure that the internal simplicity of the operation matters much
> either. There are already explicit system calls for some
> simple-to-implement things, e.g., timerfd_gettime. It's worth noting
> that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
> and optional.

It is well defined, it has a well defined signature, and it will error
out if you don't use it properly. Again, what it does is very limited
and niche. I am not sure it warrants a system call of its own.

timerfd_gettime was an afterthought anyway, that's probably not a good
example (it was more to just match the POSIX timers interface as the
original timerfd never had support for querying, so the split into two
steps, create and initialize, you could argue one could do it without
a syscall, but it still has a well defined argument list, and
accepting elaborate data structures into an ioctl is not a good
interface to plumb, so that's easily justified).

It's much like socket and setsockopt/getsockopt in nature. I would
even say APIs separating creation and configuration age well and are
better, but a process doesn't fit such a model cleanly.

>
> > As is, the facility being provided through an ioctl on the pidfd is
> > not something I'd consider a problem.
>
> You're right that from a signature perspective, using an ioctl isn't a
> problem. I just want to make sure we take into account the other,
> non-signature advantages that system calls have over ioctls.
>
> > I think the translation stuff
> 

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Fri, Mar 29, 2019 at 11:25 PM Jonathan Kowalski  wrote:
>
> On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione  wrote:
> >
> > On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner  
> > wrote:
> > >
> > > > All that said, thanks for the work on this once again. My intention is
> > > > just that we don't end up with an API that could have been done better
> > > > and be cleaner to use for potential users in the coming years.
> > >
> > > Thanks for your input on all of this. I still don't find multiplexers in
> > > the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> > > deal with a specific task. They are very much different from ioctl()s in
> > > that regard. But since Joel, you, and Daniel found the pidctl() approach
> > > not very nice I dropped it. The interface needs to be satisfactory for
> > > all of us especially since Android and other system managers will be the
> > > main consumers.
> >
> > Thanks.
> >
> > > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> > > allows to cleanly get pidfds independent procfs and do the translation
> > > to procpidfds in an ioctl() as we've discussed in prior threads. This
> >
> > I sustain my objection to adding an ioctl. Compared to a system call,
> > an ioctl has a more rigid interface, greater susceptibility to
> > programmer error (due to the same ioctl control code potentially doing
> > different things for different file types), longer path length, and
> > more awkward filtering/monitoring/auditing/tracing. We've discussed
> > this issue at length before, and I thought we all agreed to use system
> > calls, not ioctl, for core kernel functionality. So why is an ioctl
> > suddenly back on the table? The way I see it, an ioctl has no
> > advantages except for 1) conserving system call numbers, which are not
> > scarce, and 2) avoiding the system call number coordination problem
> > (and the coordination problem isn't a factor for core kernel code). I
> > don't understand everyone's reluctance to add new system calls. What
> > am I missing? Why would we give up all the advantages that a system
> > call gives us?
> >
>
> I agree in general, but in this particular case a system call or an
> ioctl doesn't matter much, all it does is take the pidfd, the command,
> and /proc's dir fd.

Thanks again.

I agree that the operation we're discussing has a simple signature,
but signature flexibility isn't the only reason to prefer a system
call over an ioctl. There are other reasons for preferring system
calls to ioctls (safety, tracing, etc.) that apply even if the
operation we're discussing has a relatively simple signature: for
example, every system call has a distinct and convenient ftrace event,
but ioctls don't; strace filtering Just Works on a
system-call-by-system-call basis, but it doesn't for ioctls; and
documentation for system calls is much more discoverable (e.g., man
-k) than documentation for ioctls. Even if the distinction doesn't
matter much, IMHO, it still matters a little, enough to favor a system
call without an offsetting advantage for the ioctl option.

> If you start adding a system call for every specific operation on file
> descriptors, it *will* become a problem.

I'm not sure what you mean. Do you mean that adding a top-level system
call for every operation that might apply to one specific kind of file
descriptor would lead, as the overall result, to the kernel having
enough system calls to cause negative consequences? I'm not sure I
agree, but accepting this idea for the sake of discussion: shouldn't
we be more okay with system calls for features present on almost all
systems --- like procfs --- even if we punt to ioctls very rarely-used
functionality, e.g., some hypothetical special squeak noise that you
could get some specific 1995 adlib clone to make?

> Besides, the translation is
> just there because it is racy to do in userspace,  it is not some well
> defined core kernel functionality.
> Therefore, it is just a way to
> enter the kernel to do the openat in a race free and safe manner.

I agree that the translation has to be done in the kernel, not
userspace, and that the kernel must provide to userspace some
interface for requesting that the translation happen: we're just
discussing the shape of this interface. Shouldn't all interfaces
provided by the kernel to userspace be equally well defined? I'm not
sure that the internal simplicity of the operation matters much
either. There are already explicit system calls for some
simple-to-implement things, e.g., timerfd_gettime. It's worth noting
that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
and optional.

> As is, the facility being provided through an ioctl on the pidfd is
> not something I'd consider a problem.

You're right that from a signature perspective, using an ioctl isn't a
problem. I just want to make sure we take into account the other,
non-signature advantages that system calls have over ioctls.

> I think the tran

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-29 Thread Jonathan Kowalski
On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione  wrote:
>
> On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner  
> wrote:
> >
> > > All that said, thanks for the work on this once again. My intention is
> > > just that we don't end up with an API that could have been done better
> > > and be cleaner to use for potential users in the coming years.
> >
> > Thanks for your input on all of this. I still don't find multiplexers in
> > the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> > deal with a specific task. They are very much different from ioctl()s in
> > that regard. But since Joel, you, and Daniel found the pidctl() approach
> > not very nice I dropped it. The interface needs to be satisfactory for
> > all of us especially since Android and other system managers will be the
> > main consumers.
>
> Thanks.
>
> > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> > allows to cleanly get pidfds independent procfs and do the translation
> > to procpidfds in an ioctl() as we've discussed in prior threads. This
>
> I sustain my objection to adding an ioctl. Compared to a system call,
> an ioctl has a more rigid interface, greater susceptibility to
> programmer error (due to the same ioctl control code potentially doing
> different things for different file types), longer path length, and
> more awkward filtering/monitoring/auditing/tracing. We've discussed
> this issue at length before, and I thought we all agreed to use system
> calls, not ioctl, for core kernel functionality. So why is an ioctl
> suddenly back on the table? The way I see it, an ioctl has no
> advantages except for 1) conserving system call numbers, which are not
> scarce, and 2) avoiding the system call number coordination problem
> (and the coordination problem isn't a factor for core kernel code). I
> don't understand everyone's reluctance to add new system calls. What
> am I missing? Why would we give up all the advantages that a system
> call gives us?
>

I agree in general, but in this particular case a system call or an
ioctl doesn't matter much, all it does is take the pidfd, the command,
and /proc's dir fd.

If you start adding a system call for every specific operation on file
descriptors, it *will* become a problem. Besides, the translation is
just there because it is racy to do in userspace,  it is not some well
defined core kernel functionality. Therefore, it is just a way to
enter the kernel to do the openat in a race free and safe manner.

As is, the facility being provided through an ioctl on the pidfd is
not something I'd consider a problem. I think the translation stuff
should also probably be an extension of ioctl_ns(2) (but I wouldn't be
opposed if translate_pid is resurrected as is).

For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD,
procrootfd), I'd agree that a system call would be a cleaner
interface, otherwise, if you cannot generalise it, using ioctls as a
command interface is probably the better tradeoff here.

> I also don't understand Andy's argument on the other thread that an
> ioctl is okay if it's an "operation on an FD" --- *most* system calls
> are operations on FDs. We don't have an ioctl for sendmsg(2) and it's
> an "operation on an FD".


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-29 Thread Daniel Colascione
On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner  wrote:
>
> > All that said, thanks for the work on this once again. My intention is
> > just that we don't end up with an API that could have been done better
> > and be cleaner to use for potential users in the coming years.
>
> Thanks for your input on all of this. I still don't find multiplexers in
> the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> deal with a specific task. They are very much different from ioctl()s in
> that regard. But since Joel, you, and Daniel found the pidctl() approach
> not very nice I dropped it. The interface needs to be satisfactory for
> all of us especially since Android and other system managers will be the
> main consumers.

Thanks.

> So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> allows to cleanly get pidfds independent procfs and do the translation
> to procpidfds in an ioctl() as we've discussed in prior threads. This

I sustain my objection to adding an ioctl. Compared to a system call,
an ioctl has a more rigid interface, greater susceptibility to
programmer error (due to the same ioctl control code potentially doing
different things for different file types), longer path length, and
more awkward filtering/monitoring/auditing/tracing. We've discussed
this issue at length before, and I thought we all agreed to use system
calls, not ioctl, for core kernel functionality. So why is an ioctl
suddenly back on the table? The way I see it, an ioctl has no
advantages except for 1) conserving system call numbers, which are not
scarce, and 2) avoiding the system call number coordination problem
(and the coordination problem isn't a factor for core kernel code). I
don't understand everyone's reluctance to add new system calls. What
am I missing? Why would we give up all the advantages that a system
call gives us?

I also don't understand Andy's argument on the other thread that an
ioctl is okay if it's an "operation on an FD" --- *most* system calls
are operations on FDs. We don't have an ioctl for sendmsg(2) and it's
an "operation on an FD".


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-28 Thread Christian Brauner
On Thu, Mar 28, 2019 at 12:59:46PM -0400, Joel Fernandes wrote:
> 
> 
> On March 28, 2019 6:38:15 AM EDT, Christian Brauner  
> wrote:
> >> All that said, thanks for the work on this once again. My intention
> >is
> >> just that we don't end up with an API that could have been done
> >better
> >> and be cleaner to use for potential users in the coming years.
> >
> >Thanks for your input on all of this. I still don't find multiplexers
> >in
> >the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> >deal with a specific task. They are very much different from ioctl()s
> >in
> >that regard. But since Joel, you, and Daniel found the pidctl()
> >approach
> >not very nice I dropped it. The interface needs to be satisfactory for
> >all of us especially since Android and other system managers will be
> >the
> >main consumers.
> >So let's split this into pidfd_open(pid_t pid, unsigned int flags)
> >which
> >allows to cleanly get pidfds independent procfs and do the translation
> >to procpidfds in an ioctl() as we've discussed in prior threads. This
> >should also accommodate comments and ideas from Andy and Jann.
> >I'm coding this up now.
> 
> This sounds quite sensible to me. Thanks!

Thanks! I have it ready and hope to send it out later today.

Christian


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-28 Thread Joel Fernandes



On March 28, 2019 6:38:15 AM EDT, Christian Brauner  
wrote:
>> All that said, thanks for the work on this once again. My intention
>is
>> just that we don't end up with an API that could have been done
>better
>> and be cleaner to use for potential users in the coming years.
>
>Thanks for your input on all of this. I still don't find multiplexers
>in
>the style of seccomp()/fsconfig()/keyctl() to be a problem since they
>deal with a specific task. They are very much different from ioctl()s
>in
>that regard. But since Joel, you, and Daniel found the pidctl()
>approach
>not very nice I dropped it. The interface needs to be satisfactory for
>all of us especially since Android and other system managers will be
>the
>main consumers.
>So let's split this into pidfd_open(pid_t pid, unsigned int flags)
>which
>allows to cleanly get pidfds independent procfs and do the translation
>to procpidfds in an ioctl() as we've discussed in prior threads. This
>should also accommodate comments and ideas from Andy and Jann.
>I'm coding this up now.

This sounds quite sensible to me. Thanks!


Joel Fernandes, Android kernel team
Sent from k9-mail on Android


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-28 Thread Christian Brauner
> All that said, thanks for the work on this once again. My intention is
> just that we don't end up with an API that could have been done better
> and be cleaner to use for potential users in the coming years.

Thanks for your input on all of this. I still don't find multiplexers in
the style of seccomp()/fsconfig()/keyctl() to be a problem since they
deal with a specific task. They are very much different from ioctl()s in
that regard. But since Joel, you, and Daniel found the pidctl() approach
not very nice I dropped it. The interface needs to be satisfactory for
all of us especially since Android and other system managers will be the
main consumers.
So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
allows to cleanly get pidfds independent procfs and do the translation
to procpidfds in an ioctl() as we've discussed in prior threads. This
should also accommodate comments and ideas from Andy and Jann.
I'm coding this up now.

Christian


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Jonathan Kowalski
pidfd_open is open pidfd for pid relative to pidns, so a better
analogy is that it is like openat for a relative pathname wrt dirfd.
O_DIRECTORY is analogous to what type of object, so a TIDFD flag in
the future which interprets pid (pathname) as thread id only and pins
that specific struct pid. That has now limited you to the specific
object and operations you can perform on said object with other pidfd
APIs.

procfd_open in my proposed signature is just a fancy race free openat
on the corresponding pid of the pidfd relative to the procrootfd
(which becomes the dirfd if i were doing it in userspace), which might
as well been implemented in userspace if things were not racy. Andy
suggested something similar.

My point is, when I am talking of the pidfd API, procfs is irrelevant.
You are thinking of it as a process directory and a process file, I am
thinking of it in terms of a process object and the proc dir fd as an
file system based interface to query process state (through read)
which is why I object to them being usable with pidfd_send_signal as
well, in principle. In the future, people may use task_diag for the
same purpose or a different interface, to work around its limitations.
This would just be another interface of the kernel to query process
state, not representative of the process object itself. Hence, keeping
the pidfd to procfd translation entirely separate (as already
suggested) sounds much, much better to me.

The pidfd API and related calls are untouched and unaffected by
presence, absence of procfs or not (they are, but you do unrelated
stuff in the same system call). To me atleast, munging opening (and
then changing what the procfd means to support the flag use case),
having flags like PIDFD_TO_PROCFD that will work only without
CLONE_NEWPID, then having eg. GET_TIDFID that may work with
CLONE_NEWPID, etc.

I find this interface confusing.

I have a few steps when starting to work with a pidfd:

1. Acquire

pidfd_open(pid, ns, flags) or pidfd_clone(...)

2. Operate

pidfd_send_signal(...)

For those who need a race free way to open the correct /proc/ for
a pidfd relative to a /proc dir fd, for the purposes of metadata
access, you will have procfd_open, which is in the kernel because the
same thing is racy to do in userspace.

Otherwise, pidfd_open in this patchset is this and also a polymorphic
system call that can become procfd_open in my example when passed a
flag. It is doing vastly different things given the presence and
absence of options. This is similar to a multiplexor again, but it
looks more confusing. You have to mask options.

pidfd_open currently:

pidfd_open(pid, -1, -1, 0); gets pidfd in current active ns
pidfd_open(-1, procrootfd, pidfd, PIDFD_TO_PROCFD); returns dir fd of
/proc/ it maps to rel. to proc rootfd
pidfd_open(-1, nsfd, pidfd, CLONE_NEWPID); as you propose this
searches pid in pidns pinned by nsfd, and returns a pidfd file
descriptor.

Extend this to threads in the future, and the combination and
permutation starts getting confusing. Based on the flag, it is
entirely changing what will it work upon, and what it will do.

I can reasonably summarise my pidfd_open and procfd_open in their man
page in one line:

pidfd_open(pid, ns, flags) returns pidfd for pid, searching it in the
pidns pinned by ns fd, and flags will determine further if this is
thread local or process local (i.e. tid or tgid, and tgid == tid for
single threaded) (in the future) (so you could do thread directed
signals by passing a flag to pidfd_send_signal and this pidfd).

Your call without CONFIG_PROC_FS will be literally this, but a few
options will have to be set as -1.

procfd_open(procrootfd, pidfd, flags), returns the proc dir fd for the
pid/tid depending on if the pidfd is thread local, process local, hint
it in flags, etc. It is just a race free wrapper around an openat in
userspace, undergoing the same access control checks.

Yes, pidfd_open as it is now works *just fine*, but it is more
confusing to use and discuss. The conclusion from the previous
discussion also seemed to be to split pidctl's PIDCMD_GET_PIDFD into
its own thing, and provide a translation from pidfd to its proc dir fd
on its own. Then, translate_pid can be its own thing, or you could
extend ioctl_ns(2) if you want.

All that said, thanks for the work on this once again. My intention is
just that we don't end up with an API that could have been done better
and be cleaner to use for potential users in the coming years.


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Christian Brauner


> procfd_open(int procrootfd, int pidfd, unsigned int flags);
> pidfd_open(pid_t pid, int nsfd, unsigned int flags);

That honestly just feels like splitting openat into:
openat_dir()
and
opentat_file()


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Jonathan Kowalski
On Wed, Mar 27, 2019 at 9:34 PM Christian Brauner  wrote:
>
> On Wed, Mar 27, 2019 at 07:38:13PM +, Jonathan Kowalski wrote:
> > Christian,
> >
> > Giving this some thought, it looks fine to me, but I am not convinced
> > this should take a pid argument. I much prefer if obtaining a pidfd
>
> The signatures for pidfd_open() you're suggesting are conflicting. Even
> taking into account that you're referring to Andy's version

My version pidfd_open takes a pid, the pid namespace to search it in,
and has an optional flags value to extend it further in the future to
an immediate need. This is a simpler version of the pidctl's GET_PIDFD
command that you yourself proposed in the last patch, and you don't
need two namespace arguments, just one here (in that case you were
passing -1 in either of those anyway).

so pidctl(PIDCMD_GET_PIDFD, pid, source, -1, 0) becomes
pidfd_open(pid, source, 0);
pidctl(PIDCMD_GET_PIDFD, pid, -1, target or -1, 0) would be
pidfd_open(pid, -1, 0);

The target was just there due to it having to support
comparison/translation commands, but it did not serve much purpose
except the pid was reachable in both. I am not sure that is needed if
you just want a pidfd for pid relative to a namespace.

What I prefer is dropping the pid argument from *your* pidfd_open,
rename it to procfd/procpidfd_open to give it a clearer name, and have
the signature the same as what Andy suggested. This system call is
tied to CONFIG_PROC_FS, all it does is take a pidfd, a proc rootfd,
and return the /proc/ dir fd. It is up to you if you want to
support procpidfd to pidfd conversion, I don't think there was any
usecase mentioned regarding that far, so you may just ignore it.

>
> int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
>
> the "not convinced this should take a pid argument" confuses me wrt to
> the proposed second signature:
>
> pidfd_open(pid_t pid, int ns, unsigned int flags);
>
> > goes in the translation stuff, that has very real usecases, and you
> > would end up duplicating the same thing over in two places.
> >
> > If the /proc/ dir fd of a pidfd is relative to its procrootfd, a
> > pid's pidfd is also relative to its namespace. Currently, you change
>
> Pass a /proc/ file descriptor that you have been given access to to
> pidfd_open() and retrieve a pidfd to that process.

It would be nice if it would be free of /proc entirely. The usecase
this system call is after is helping race free translation from pidfd
to procfd. pidfd_open as a separate system call as mentioned above
sounds much cleaner to me, and it works regardless of what /proc is
(compiled in, mounted, not compiled in, no access, etc).

>
> > things which would mean I now have to setns and spawn a child that
> > passes me back the pidfd for a pid in the said namespace.
>
> There are two scenarios:
> - you're in a cousin pid namespace
> - you're in an ancestor pid namespace
>
> If you're in an ancestor pid namespace you can just get a pidfd based on
> the pid of the process in your namespace. If you're in a cousin
> pid namespace it seems reasonable that you have to do a setns to get a
> pidfd. After all, you're not able to see any pids in there by default.
> If you want pidfd to a cousin pid namespace prove that you can access it
> by attaching to the owning user namespace of the pid namespace and get
> your pidfd.

Yes, but you break the part that you had working in the pidctl
patchset, I learn the pid of a process in a child pidns, and I am able
to get a pidfd for it with the pid relative to the passed in namespace
descriptor. You could even get a pidfd for an ancestor process, given
it explicitly gives you its namespace descriptor (or a cousin pidns,
but that still means userspace will pass the namespace descriptor
explicitly), but it is up to you if you want to limit its scope to
down the hierarchy strictly (and it would be nice to know now so that
you don't end up wasting more time later, and the next patchset is
hopefully the last one before merge modulo nits).

>
> >
> > I prefer Andy's version of
> >
> >   int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> >   int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
> > this procfd_open, maybe
> >
> > This is just a nicer race free openat.
> >
> > But, Joel and you agreed that it makes sense to split out the
> > translation stuff out of pidfds.
>
> I don't quite remember that part.
>
> I honestly think that the version proposed here covers for the most part
> what we want and provides a decent compromise by avoiding ioctl()s for
> the translation bits, allows us to not split this into multiple

I was not wanting to suggest an ioctl to burden you further, but it
occured to me that you would have to do less, as it already has
support to determine ownership and is meant for such tasks in
ioctl_ns(2), but I don't care if it's a translate_pid system call.

> syscalls, and also allows retrieving pidfds from other pid namespaces
> provided that

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Christian Brauner
On Wed, Mar 27, 2019 at 07:38:13PM +, Jonathan Kowalski wrote:
> Christian,
> 
> Giving this some thought, it looks fine to me, but I am not convinced
> this should take a pid argument. I much prefer if obtaining a pidfd

The signatures for pidfd_open() you're suggesting are conflicting. Even
taking into account that you're referring to Andy's version:

int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name

the "not convinced this should take a pid argument" confuses me wrt to
the proposed second signature:

pidfd_open(pid_t pid, int ns, unsigned int flags);

> goes in the translation stuff, that has very real usecases, and you
> would end up duplicating the same thing over in two places.
> 
> If the /proc/ dir fd of a pidfd is relative to its procrootfd, a
> pid's pidfd is also relative to its namespace. Currently, you change

Pass a /proc/ file descriptor that you have been given access to to
pidfd_open() and retrieve a pidfd to that process.

> things which would mean I now have to setns and spawn a child that
> passes me back the pidfd for a pid in the said namespace.

There are two scenarios:
- you're in a cousin pid namespace
- you're in an ancestor pid namespace

If you're in an ancestor pid namespace you can just get a pidfd based on
the pid of the process in your namespace. If you're in a cousin
pid namespace it seems reasonable that you have to do a setns to get a
pidfd. After all, you're not able to see any pids in there by default.
If you want pidfd to a cousin pid namespace prove that you can access it
by attaching to the owning user namespace of the pid namespace and get
your pidfd.

> 
> I prefer Andy's version of
> 
>   int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
>   int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
> this procfd_open, maybe
> 
> This is just a nicer race free openat.
> 
> But, Joel and you agreed that it makes sense to split out the
> translation stuff out of pidfds.

I don't quite remember that part.

I honestly think that the version proposed here covers for the most part
what we want and provides a decent compromise by avoiding ioctl()s for
the translation bits, allows us to not split this into multiple
syscalls, and also allows retrieving pidfds from other pid namespaces
provided that one has gotten access to a file descriptor for
/proc/. If really needed at some point - I doubt we will - you
can extend pidfd_open() to take the flag CLONE_NEWPID:

int pidfd = pidfd_open(pid, pid_ns_fd, -1, CLONE_NEWPID);

and get pidfd for another pid namespace back.

Christian

> 
> My suggestion would be to extend ioctl_ns(2) then. It already provides
> a (rather clumsy) mechanism to determine the relationship by comparing
> st_dev and st_ino, which userspace can do itself for two namespace
> descriptors. For translation, you can extend this namespace ioctl
> (there is one to get a owner UID, you could add one to get a relative
> PID).
> 
> Then, your pidfd call will be:
> 
> pidfd_open(pid_t pid, int ns, unsigned int flags);
> 
> You would also be able to compile out anything procfs related (include
> the new API to procfs dir fd translation), otherwise, the way to open
> pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good
> as pidfd_open(pid_t pid) (of which a better version I propose above).
> The new API should be its own thing, and the procfs translation tuple
> its own thing, tied to the proc fs option. pidfds need not have any
> relation to /proc.

They don't have a relation to procfs right now in this syscall. Works
fine without procfs. That's the point.

> 
> For this procfd conversion system call, I would also lift any
> namespace related restrictions if you are planning to, given the
> constraint that I can only open a pidfd from an ancestor namespace
> with the new pidfd_open system call, or acquire it through a
> cooperative userspace process by fd passing otherwise, and I need the
> /proc root fd, having both only permits me to open the said process's
> /proc/ dir fd subject to normal access restrictions. This means
> the simplified procfd_open can be used to do metadata access without
> even talking of PIDs at all, your process could live in its own world
> and, given it has the authority to open the /proc directory, be able
> to purely collect metadata based on the two file descriptors it is
> given.
> 
> Once you have the restriction in the same call that allows you to open
> a pidfd for an addressable PID from the given namespace fd, you can
> finally remove the restriction to signal across namespaces once the
> siginfo stuff is sorted out, as that will only work when you
> explicitly push the fd into a sandbox, the process cannot get it out
> of thin air on its own (and you already mentioned it has nothing to do
> with security). What I do worry about is one can use NS_GET_PARENT
> ioctl to get the parent pidns if the owning userns is the same, and
> just passing that gives me back a pidfd for the ta

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Christian Brauner
On Wed, Mar 27, 2019 at 06:07:54PM +0100, Jann Horn wrote:
> On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner  
> wrote:
> > pidfd_open() allows to retrieve pidfds for processes and removes the
> > dependency of pidfd on procfs. Multiple people have expressed a desire to
> > do this even when pidfd_send_signal() was merged. It is even recorded in
> [...]
> > IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
> > /proc mount in a given pid namespace and a pidfd pidfd_open() will return a
> > file descriptor to the corresponding /proc/ directory in procfs
> > mounts' pid namespace. pidfd_open() is very careful to verify that the pid
> 
> nit: s/mounts'/mount's/

Thanks.

> 
> > hasn't been recycled in between.
> > IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
> > referencing a /proc/ directory a pidfd referencing the struct pid
> > stashed in /proc/ of the process will be returned.
> 
> nit: s/of the process //?

Yes.

> 
> > The pidfd_open() syscalls in that manner resembles openat() as it uses a
> 
> nit: s/syscalls/syscall/

Thanks.

> 
> [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..c9e24e726aba 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> [...]
> > +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t 
> > pid,
> > +   const struct pid *pidfd_pid)
> > +{
> > +   char name[11]; /* int to strlen + \0 */
> 
> nit: The comment is a bit off; an unconstrained int needs 1+10+1
> bytes, I think? minus sign, 10 digits, nullbyte? But of course that
> can't actually happen here.

Yes, the comment is misleading.

> 
> > +   struct file *file;
> > +   struct pid *proc_pid;
> > +
> > +   snprintf(name, sizeof(name), "%d", pid);
> > +   file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
> > + O_DIRECTORY | O_NOFOLLOW, 0);
> 
> Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity?

Yes.

> 
> [...]
> > +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
> > +{
> > +   long fd;
> > +   pid_t ns_pid;
> > +   struct fd fdproc, fdpid;
> > +   struct file *file = NULL;
> > +   struct pid *pidfd_pid = NULL;
> > +   struct pid_namespace *proc_pid_ns = NULL;
> > +
> > +   fdproc = fdget(procfd);
> > +   if (!fdproc.file)
> > +   return -EBADF;
> > +
> > +   fdpid = fdget(pidfd);
> > +   if (!fdpid.file) {
> > +   fdput(fdpid);
> 
> Typo: s/fdput(fdpid)/fdput(fdproc)/

Good catch!

> 
> [...]
> > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned 
> > int,
> > +   flags)
> [...]
> > +   if (!flags) {
> [...]
> > +   rcu_read_lock();
> > +   pidfd_pid = get_pid(find_pid_ns(pid, 
> > task_active_pid_ns(current)));
> > +   rcu_read_unlock();
> 
> The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`.

Perfect, will replace.

> 
> > +   fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
> 
> Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of
> the second function argument if you want to.

Hm, let me rename this to pidfd_create_cloexec(pidfd_pid) then.

> 
> > +   put_pid(pidfd_pid);
> > +   } else if (flags & PIDFD_TO_PROCFD) {
> [...]
> > +   fd = pidfd_to_procfd(pid, procfd, pidfd);
> 
> The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes
> sense to get rid of that?

Yes.


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Christian Brauner
On Wed, Mar 27, 2019 at 06:21:24PM +0100, Yann Droneaud wrote:
> Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit :
> 
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..c9e24e726aba 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -26,8 +26,10 @@
> > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd,
> > unsigned int,
> > +   flags)
> > +{
> > +   long fd = -EINVAL;
> > +
> > +   if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD))
> > +   return -EINVAL;
> > +
> > +   if (!flags) {
> > +   struct pid *pidfd_pid;
> > +
> > +   if (pid <= 0)
> > +   return -EINVAL;
> > +
> > +   if (procfd != -1 || pidfd != -1)
> > +   return -EINVAL;
> > +
> > +   rcu_read_lock();
> > +   pidfd_pid = get_pid(find_pid_ns(pid, 
> > task_active_pid_ns(current)));
> > +   rcu_read_unlock();
> > +
> > +   fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
> > +   put_pid(pidfd_pid);
> > +   } else if (flags & PIDFD_TO_PROCFD) {
> 
> [...]
> 
> > +   } else if (flags & PROCFD_TO_PIDFD) {
> > +   if (flags & ~PROCFD_TO_PIDFD)
> > +   return -EINVAL;
> > +
> > +   if (pid != -1)
> > +   return -EINVAL;
> > +
> > +   if (pidfd >= 0)
> > 
> 
> I think it can be stricter with:
> 
> if (pidfd != -1)

Yes.

> 
> (and match the check done for flag == 0).
> 
> Regards.
> 
> -- 
> Yann Droneaud
> OPTEYA
> 
> 


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Jonathan Kowalski
On Wed, Mar 27, 2019 at 7:38 PM Jonathan Kowalski  wrote:
>  ...
> ... the process cannot get it out
> of thin air on its own (and you already mentioned it has nothing to do
> with security). What I do worry about is one can use NS_GET_PARENT

disregard this, it works as it should.

> ioctl to get the parent pidns if the owning userns is the same, and
> just passing that gives me back a pidfd for the task. **So, you might
> want to add the constraint that the PID is actually reachable by the
> current task as well, apart from being reachable in the passed in
> namespace.**
>
> Lastly, I also see no need of /proc/ dir fd to pidfd conversion,
> I would even recommend getting rid of that, so we only have one type
> of pidfd, the anon inode one. What is the usecase behind that? It
> would only be needed if you did not have a way to be able to metadata
> access through a pidfd, which would be the case only prior to this
> patch.
>
> I think this would simplify a lot of things, and ioctl_ns(2) is
> probably already the place to do comparison operations and query
> operations on hierarichal namespaces, just adding the relative PID bit
> will make it gain feature parity with translate_pid.


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Jonathan Kowalski
Christian,

Giving this some thought, it looks fine to me, but I am not convinced
this should take a pid argument. I much prefer if obtaining a pidfd
goes in the translation stuff, that has very real usecases, and you
would end up duplicating the same thing over in two places.

If the /proc/ dir fd of a pidfd is relative to its procrootfd, a
pid's pidfd is also relative to its namespace. Currently, you change
things which would mean I now have to setns and spawn a child that
passes me back the pidfd for a pid in the said namespace.

I prefer Andy's version of

  int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
  int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
this procfd_open, maybe

This is just a nicer race free openat.

But, Joel and you agreed that it makes sense to split out the
translation stuff out of pidfds.

My suggestion would be to extend ioctl_ns(2) then. It already provides
a (rather clumsy) mechanism to determine the relationship by comparing
st_dev and st_ino, which userspace can do itself for two namespace
descriptors. For translation, you can extend this namespace ioctl
(there is one to get a owner UID, you could add one to get a relative
PID).

Then, your pidfd call will be:

pidfd_open(pid_t pid, int ns, unsigned int flags);

You would also be able to compile out anything procfs related (include
the new API to procfs dir fd translation), otherwise, the way to open
pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good
as pidfd_open(pid_t pid) (of which a better version I propose above).
The new API should be its own thing, and the procfs translation tuple
its own thing, tied to the proc fs option. pidfds need not have any
relation to /proc.

For this procfd conversion system call, I would also lift any
namespace related restrictions if you are planning to, given the
constraint that I can only open a pidfd from an ancestor namespace
with the new pidfd_open system call, or acquire it through a
cooperative userspace process by fd passing otherwise, and I need the
/proc root fd, having both only permits me to open the said process's
/proc/ dir fd subject to normal access restrictions. This means
the simplified procfd_open can be used to do metadata access without
even talking of PIDs at all, your process could live in its own world
and, given it has the authority to open the /proc directory, be able
to purely collect metadata based on the two file descriptors it is
given.

Once you have the restriction in the same call that allows you to open
a pidfd for an addressable PID from the given namespace fd, you can
finally remove the restriction to signal across namespaces once the
siginfo stuff is sorted out, as that will only work when you
explicitly push the fd into a sandbox, the process cannot get it out
of thin air on its own (and you already mentioned it has nothing to do
with security). What I do worry about is one can use NS_GET_PARENT
ioctl to get the parent pidns if the owning userns is the same, and
just passing that gives me back a pidfd for the task. **So, you might
want to add the constraint that the PID is actually reachable by the
current task as well, apart from being reachable in the passed in
namespace.**

Lastly, I also see no need of /proc/ dir fd to pidfd conversion,
I would even recommend getting rid of that, so we only have one type
of pidfd, the anon inode one. What is the usecase behind that? It
would only be needed if you did not have a way to be able to metadata
access through a pidfd, which would be the case only prior to this
patch.

I think this would simplify a lot of things, and ioctl_ns(2) is
probably already the place to do comparison operations and query
operations on hierarichal namespaces, just adding the relative PID bit
will make it gain feature parity with translate_pid.


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Yann Droneaud
Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit :

> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..c9e24e726aba 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -26,8 +26,10 @@
> +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd,
> unsigned int,
> + flags)
> +{
> + long fd = -EINVAL;
> +
> + if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD))
> + return -EINVAL;
> +
> + if (!flags) {
> + struct pid *pidfd_pid;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + if (procfd != -1 || pidfd != -1)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + pidfd_pid = get_pid(find_pid_ns(pid, 
> task_active_pid_ns(current)));
> + rcu_read_unlock();
> +
> + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
> + put_pid(pidfd_pid);
> + } else if (flags & PIDFD_TO_PROCFD) {

[...]

> + } else if (flags & PROCFD_TO_PIDFD) {
> + if (flags & ~PROCFD_TO_PIDFD)
> + return -EINVAL;
> +
> + if (pid != -1)
> + return -EINVAL;
> +
> + if (pidfd >= 0)
> 

I think it can be stricter with:

if (pidfd != -1)

(and match the check done for flag == 0).

Regards.

-- 
Yann Droneaud
OPTEYA




Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Jann Horn
On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner  wrote:
> pidfd_open() allows to retrieve pidfds for processes and removes the
> dependency of pidfd on procfs. Multiple people have expressed a desire to
> do this even when pidfd_send_signal() was merged. It is even recorded in
[...]
> IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
> /proc mount in a given pid namespace and a pidfd pidfd_open() will return a
> file descriptor to the corresponding /proc/ directory in procfs
> mounts' pid namespace. pidfd_open() is very careful to verify that the pid

nit: s/mounts'/mount's/

> hasn't been recycled in between.
> IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
> referencing a /proc/ directory a pidfd referencing the struct pid
> stashed in /proc/ of the process will be returned.

nit: s/of the process //?

> The pidfd_open() syscalls in that manner resembles openat() as it uses a

nit: s/syscalls/syscall/

[...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..c9e24e726aba 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
[...]
> +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
> +   const struct pid *pidfd_pid)
> +{
> +   char name[11]; /* int to strlen + \0 */

nit: The comment is a bit off; an unconstrained int needs 1+10+1
bytes, I think? minus sign, 10 digits, nullbyte? But of course that
can't actually happen here.

> +   struct file *file;
> +   struct pid *proc_pid;
> +
> +   snprintf(name, sizeof(name), "%d", pid);
> +   file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
> + O_DIRECTORY | O_NOFOLLOW, 0);

Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity?

[...]
> +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
> +{
> +   long fd;
> +   pid_t ns_pid;
> +   struct fd fdproc, fdpid;
> +   struct file *file = NULL;
> +   struct pid *pidfd_pid = NULL;
> +   struct pid_namespace *proc_pid_ns = NULL;
> +
> +   fdproc = fdget(procfd);
> +   if (!fdproc.file)
> +   return -EBADF;
> +
> +   fdpid = fdget(pidfd);
> +   if (!fdpid.file) {
> +   fdput(fdpid);

Typo: s/fdput(fdpid)/fdput(fdproc)/

[...]
> +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned 
> int,
> +   flags)
[...]
> +   if (!flags) {
[...]
> +   rcu_read_lock();
> +   pidfd_pid = get_pid(find_pid_ns(pid, 
> task_active_pid_ns(current)));
> +   rcu_read_unlock();

The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`.

> +   fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);

Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of
the second function argument if you want to.

> +   put_pid(pidfd_pid);
> +   } else if (flags & PIDFD_TO_PROCFD) {
[...]
> +   fd = pidfd_to_procfd(pid, procfd, pidfd);

The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes
sense to get rid of that?


[PATCH 2/4] pid: add pidfd_open()

2019-03-27 Thread Christian Brauner
pidfd_open() allows to retrieve pidfds for processes and removes the
dependency of pidfd on procfs. Multiple people have expressed a desire to
do this even when pidfd_send_signal() was merged. It is even recorded in
the commit message for pidfd_send_signal() itself
(cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
Q-06: (Andrew Morton [1])
  Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
  so this is something that we will need to support anyway. Hence,
  there's no immediate need to add another syscalls just to make
  pidfd_send_signal() not dependent on the presence of procfs. However,
  adding a syscalls to get such file descriptors is planned for a
  future patchset (cf. [1]).
Alexey made a similar request (cf. [2]). Additionally, Andy made an
argument that we should go forward with non-proc-dirfd file descriptors for
the sake of security and extensibility (cf. [3]).
This will unblock or help move along work on pidfd_wait which is currently
ongoing.

/* pidfds are anon inode file descriptors */
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. Currently they do not
support any advanced operations. The pidfds are not associated with a
specific pid namespaces but rather only reference struct pid of a given
process in their private_data member.

/* Process Metadata Access */
One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/ directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/ based on a pidfd (and
the other way around).
IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
/proc mount in a given pid namespace and a pidfd pidfd_open() will return a
file descriptor to the corresponding /proc/ directory in procfs
mounts' pid namespace. pidfd_open() is very careful to verify that the pid
hasn't been recycled in between.
IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
referencing a /proc/ directory a pidfd referencing the struct pid
stashed in /proc/ of the process will be returned.
The pidfd_open() syscalls in that manner resembles openat() as it uses a
flag argument to modify what type of file descriptor will be returned.

The pidfd_open() implementation together with the flags argument strikes me
as an elegant compromise between splitting this into multiple syscalls and
avoiding ioctls().

/* Examples */
// Retrieve pidfd
int pidfd = pidfd_open(1234, -1, -1, 0);

// Retrieve /proc/ handle for pidfd
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD);

// Retrieve pidfd for /proc/
int procpidfd = open("/proc/1234", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int pidfd = pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD);

/* References */
[1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcss...@brauner.io/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: 
https://lore.kernel.org/lkml/CALCETrXO=V=+qedldvpf8ecglzib9botrufe0v-u-tuzoeo...@mail.gmail.com

Signed-off-by: Christian Brauner 
Cc: Arnd Bergmann 
Cc: "Eric W. Biederman" 
Cc: Kees Cook 
Cc: Alexey Dobriyan 
Cc: Thomas Gleixner 
Cc: Serge Hallyn 
Cc: Jann Horn 
Cc: "Michael Kerrisk (man-pages)" 
Cc: Konstantin Khlebnikov 
Cc: Jonathan Kowalski 
Cc: "Dmitry V. Levin" 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Oleg Nesterov 
Cc: Nagarathnam Muthusamy 
Cc: Aleksa Sarai 
Cc: Al Viro 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/pid.h|   2 +
 include/linux/syscalls.h   |   2 +
 include/uapi/linux/wait.h  |   3 +
 kernel/pid.c   | 247 +
 6 files changed, 256 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 1f9607ed087c..c8046f261bee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -433,3 +433,4 @@
 425i386io_uring_setup  sys_io_uring_setup  
__ia32_sys_io_uring_setup
 426i386io_uring_enter  sys_io_uring_enter  
__ia32_sys_io_uring_enter
 427i386io_uring_register   sys_io_uring_register   
__ia32_sys_io_uring_register
+428i386pidfd_open  sys_pidfd_open  
__ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4378d4..f714a3d57b88 100644