Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-22 Thread Andy Lutomirski



> On Jun 22, 2018, at 9:23 AM, Jann Horn  wrote:
> 
>> On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>> 
>> The idea here is that the userspace handler should be able to pass an fd
>> back to the trapped task, for example so it can be returned from socket().
>> 
>> I've proposed one API here, but I'm open to other options. In particular,
>> this only lets you return an fd from a syscall, which may not be enough in
>> all cases. For example, if an fd is written to an output parameter instead
>> of returned, the current API can't handle this. Another case is that
>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>> ever decides to install an fd and output it, we wouldn't be able to handle
>> this either.
>> 
>> Still, the vast majority of interesting cases are covered by this API, so
>> perhaps it is Enough.
>> 
>> I've left it as a separate commit for two reasons:
>>  * It illustrates the way in which we would grow struct seccomp_notif and
>>struct seccomp_notif_resp without using netlink
>>  * It shows just how little code is needed to accomplish this :)
>> 
> [...]
>> @@ -1669,10 +1706,20 @@ static ssize_t seccomp_notify_write(struct file 
>> *file, const char __user *buf,
>>goto out;
>>}
>> 
>> +   if (resp.return_fd) {
>> +   knotif->flags = resp.fd_flags;
>> +   knotif->file = fget(resp.fd);
>> +   if (!knotif->file) {
>> +   ret = -EBADF;
>> +   goto out;
>> +   }
>> +   }
>> +
> 
> I think this is a security bug. Imagine the following scenario:
> 
> - attacker creates processes A and B
> - process A installs a seccomp filter and sends the notification fd
> to process B
> - process A starts a syscall for which the filter returns
> SECCOMP_RET_USER_NOTIF
> - process B reads the notification from the notification fd
> - process B uses dup2() to copy the notification fd to file
> descriptor 1 (stdout)
> - process B executes a setuid root binary
> - the setuid root binary opens some privileged file descriptor
> (something like open("/etc/shadow", O_RDWR))
> - the setuid root binary tries to write some attacker-controlled data to 
> stdout
> - seccomp_notify_write() interprets the start of the written data as
> a struct seccomp_notif_resp
> - seccomp_notify_write() grabs the privileged file descriptor and
> installs a copy in process A
> - process A now has access to the privileged file (e.g. /etc/shadow)
> 
> It isn't clear whether it would actually be exploitable - you'd need a
> setuid binary that performs the right actions - but it's still bad.

Jann is right. ->read and ->write must not reference any of the calling task’s 
state except the literal memory passed in.

> 
> Unless I'm missing something, can you please turn the ->read and
> ->write handlers into an ->unlocked_ioctl handler? Something like
> this:
> 
> struct seccomp_user_notif_args {
>u64 buf;
>u64 size;
> };
> 
> static long unlocked_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
>struct seccomp_user_notif_args args;
>struct seccomp_user_notif_args __user *uargs;
> 
>if (cmd != SECCOMP_USER_NOTIF_READ && cmd != SECCOMP_USER_NOTIF_WRITE)
>return -EINVAL;
> 
>if (copy_from_user(, uargs, sizeof(args)))
>return -EFAULT;
> 
>switch (cmd) {
>case SECCOMP_USER_NOTIF_READ:
>return seccomp_notify_read(file, (char __user
> *)args.buf, (size_t)args.size);
>case SECCOMP_USER_NOTIF_WRITE:
>return seccomp_notify_write(file, (char __user
> *)args.buf, (size_t)args.size);
>default:
>return -EINVAL;
>}
> }


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-22 Thread Andy Lutomirski



> On Jun 22, 2018, at 9:23 AM, Jann Horn  wrote:
> 
>> On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>> 
>> The idea here is that the userspace handler should be able to pass an fd
>> back to the trapped task, for example so it can be returned from socket().
>> 
>> I've proposed one API here, but I'm open to other options. In particular,
>> this only lets you return an fd from a syscall, which may not be enough in
>> all cases. For example, if an fd is written to an output parameter instead
>> of returned, the current API can't handle this. Another case is that
>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>> ever decides to install an fd and output it, we wouldn't be able to handle
>> this either.
>> 
>> Still, the vast majority of interesting cases are covered by this API, so
>> perhaps it is Enough.
>> 
>> I've left it as a separate commit for two reasons:
>>  * It illustrates the way in which we would grow struct seccomp_notif and
>>struct seccomp_notif_resp without using netlink
>>  * It shows just how little code is needed to accomplish this :)
>> 
> [...]
>> @@ -1669,10 +1706,20 @@ static ssize_t seccomp_notify_write(struct file 
>> *file, const char __user *buf,
>>goto out;
>>}
>> 
>> +   if (resp.return_fd) {
>> +   knotif->flags = resp.fd_flags;
>> +   knotif->file = fget(resp.fd);
>> +   if (!knotif->file) {
>> +   ret = -EBADF;
>> +   goto out;
>> +   }
>> +   }
>> +
> 
> I think this is a security bug. Imagine the following scenario:
> 
> - attacker creates processes A and B
> - process A installs a seccomp filter and sends the notification fd
> to process B
> - process A starts a syscall for which the filter returns
> SECCOMP_RET_USER_NOTIF
> - process B reads the notification from the notification fd
> - process B uses dup2() to copy the notification fd to file
> descriptor 1 (stdout)
> - process B executes a setuid root binary
> - the setuid root binary opens some privileged file descriptor
> (something like open("/etc/shadow", O_RDWR))
> - the setuid root binary tries to write some attacker-controlled data to 
> stdout
> - seccomp_notify_write() interprets the start of the written data as
> a struct seccomp_notif_resp
> - seccomp_notify_write() grabs the privileged file descriptor and
> installs a copy in process A
> - process A now has access to the privileged file (e.g. /etc/shadow)
> 
> It isn't clear whether it would actually be exploitable - you'd need a
> setuid binary that performs the right actions - but it's still bad.

Jann is right. ->read and ->write must not reference any of the calling task’s 
state except the literal memory passed in.

> 
> Unless I'm missing something, can you please turn the ->read and
> ->write handlers into an ->unlocked_ioctl handler? Something like
> this:
> 
> struct seccomp_user_notif_args {
>u64 buf;
>u64 size;
> };
> 
> static long unlocked_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
>struct seccomp_user_notif_args args;
>struct seccomp_user_notif_args __user *uargs;
> 
>if (cmd != SECCOMP_USER_NOTIF_READ && cmd != SECCOMP_USER_NOTIF_WRITE)
>return -EINVAL;
> 
>if (copy_from_user(, uargs, sizeof(args)))
>return -EFAULT;
> 
>switch (cmd) {
>case SECCOMP_USER_NOTIF_READ:
>return seccomp_notify_read(file, (char __user
> *)args.buf, (size_t)args.size);
>case SECCOMP_USER_NOTIF_WRITE:
>return seccomp_notify_write(file, (char __user
> *)args.buf, (size_t)args.size);
>default:
>return -EINVAL;
>}
> }


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-22 Thread Jann Horn
On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.
>
> Still, the vast majority of interesting cases are covered by this API, so
> perhaps it is Enough.
>
> I've left it as a separate commit for two reasons:
>   * It illustrates the way in which we would grow struct seccomp_notif and
> struct seccomp_notif_resp without using netlink
>   * It shows just how little code is needed to accomplish this :)
>
[...]
> @@ -1669,10 +1706,20 @@ static ssize_t seccomp_notify_write(struct file 
> *file, const char __user *buf,
> goto out;
> }
>
> +   if (resp.return_fd) {
> +   knotif->flags = resp.fd_flags;
> +   knotif->file = fget(resp.fd);
> +   if (!knotif->file) {
> +   ret = -EBADF;
> +   goto out;
> +   }
> +   }
> +

I think this is a security bug. Imagine the following scenario:

 - attacker creates processes A and B
 - process A installs a seccomp filter and sends the notification fd
to process B
 - process A starts a syscall for which the filter returns
SECCOMP_RET_USER_NOTIF
 - process B reads the notification from the notification fd
 - process B uses dup2() to copy the notification fd to file
descriptor 1 (stdout)
 - process B executes a setuid root binary
 - the setuid root binary opens some privileged file descriptor
(something like open("/etc/shadow", O_RDWR))
 - the setuid root binary tries to write some attacker-controlled data to stdout
 - seccomp_notify_write() interprets the start of the written data as
a struct seccomp_notif_resp
 - seccomp_notify_write() grabs the privileged file descriptor and
installs a copy in process A
 - process A now has access to the privileged file (e.g. /etc/shadow)

It isn't clear whether it would actually be exploitable - you'd need a
setuid binary that performs the right actions - but it's still bad.

Unless I'm missing something, can you please turn the ->read and
->write handlers into an ->unlocked_ioctl handler? Something like
this:

struct seccomp_user_notif_args {
u64 buf;
u64 size;
};

static long unlocked_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
struct seccomp_user_notif_args args;
struct seccomp_user_notif_args __user *uargs;

if (cmd != SECCOMP_USER_NOTIF_READ && cmd != SECCOMP_USER_NOTIF_WRITE)
return -EINVAL;

if (copy_from_user(, uargs, sizeof(args)))
return -EFAULT;

switch (cmd) {
case SECCOMP_USER_NOTIF_READ:
return seccomp_notify_read(file, (char __user
*)args.buf, (size_t)args.size);
case SECCOMP_USER_NOTIF_WRITE:
return seccomp_notify_write(file, (char __user
*)args.buf, (size_t)args.size);
default:
return -EINVAL;
}
}


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-22 Thread Jann Horn
On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.
>
> Still, the vast majority of interesting cases are covered by this API, so
> perhaps it is Enough.
>
> I've left it as a separate commit for two reasons:
>   * It illustrates the way in which we would grow struct seccomp_notif and
> struct seccomp_notif_resp without using netlink
>   * It shows just how little code is needed to accomplish this :)
>
[...]
> @@ -1669,10 +1706,20 @@ static ssize_t seccomp_notify_write(struct file 
> *file, const char __user *buf,
> goto out;
> }
>
> +   if (resp.return_fd) {
> +   knotif->flags = resp.fd_flags;
> +   knotif->file = fget(resp.fd);
> +   if (!knotif->file) {
> +   ret = -EBADF;
> +   goto out;
> +   }
> +   }
> +

I think this is a security bug. Imagine the following scenario:

 - attacker creates processes A and B
 - process A installs a seccomp filter and sends the notification fd
to process B
 - process A starts a syscall for which the filter returns
SECCOMP_RET_USER_NOTIF
 - process B reads the notification from the notification fd
 - process B uses dup2() to copy the notification fd to file
descriptor 1 (stdout)
 - process B executes a setuid root binary
 - the setuid root binary opens some privileged file descriptor
(something like open("/etc/shadow", O_RDWR))
 - the setuid root binary tries to write some attacker-controlled data to stdout
 - seccomp_notify_write() interprets the start of the written data as
a struct seccomp_notif_resp
 - seccomp_notify_write() grabs the privileged file descriptor and
installs a copy in process A
 - process A now has access to the privileged file (e.g. /etc/shadow)

It isn't clear whether it would actually be exploitable - you'd need a
setuid binary that performs the right actions - but it's still bad.

Unless I'm missing something, can you please turn the ->read and
->write handlers into an ->unlocked_ioctl handler? Something like
this:

struct seccomp_user_notif_args {
u64 buf;
u64 size;
};

static long unlocked_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
struct seccomp_user_notif_args args;
struct seccomp_user_notif_args __user *uargs;

if (cmd != SECCOMP_USER_NOTIF_READ && cmd != SECCOMP_USER_NOTIF_WRITE)
return -EINVAL;

if (copy_from_user(, uargs, sizeof(args)))
return -EFAULT;

switch (cmd) {
case SECCOMP_USER_NOTIF_READ:
return seccomp_notify_read(file, (char __user
*)args.buf, (size_t)args.size);
case SECCOMP_USER_NOTIF_WRITE:
return seccomp_notify_write(file, (char __user
*)args.buf, (size_t)args.size);
default:
return -EINVAL;
}
}


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Tycho Andersen
On Fri, Jun 22, 2018 at 01:34:18AM +0200, Jann Horn wrote:
> On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
> >
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> [...]
> > +Userspace can also return file descriptors. For example, one may decide to
> > +intercept ``socket()`` syscalls, and return some file descriptor from those
> > +based on some policy. To return a file descriptor, the ``return_fd`` member
> > +should be non-zero, the ``fd`` argument should be the fd in the listener's
> > +table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
> > +``fd_flags`` should be the flags that the fd in the tracee's table is 
> > opened
> > +with (e.g. ``O_EXCL`` or similar).
> 
> fd_flags only contains file descriptor flags (meaning only O_CLOEXEC).
> O_EXCL is a file creation flag, so setting it here wouldn't make sense.
> Setting file status flags like O_APPEND does make sense, but those are
> stored in the `struct file` and don't need to be passed separately;
> the caller can e.g. set them via fcntl(fd, F_SETFD, flags) or on
> open().
> (The fcntl.2 manpage explains these.)

Ugh, yes, O_CLOEXEC is what I meant. Thanks, I'll clarify.

Tycho


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Tycho Andersen
On Fri, Jun 22, 2018 at 01:34:18AM +0200, Jann Horn wrote:
> On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
> >
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> [...]
> > +Userspace can also return file descriptors. For example, one may decide to
> > +intercept ``socket()`` syscalls, and return some file descriptor from those
> > +based on some policy. To return a file descriptor, the ``return_fd`` member
> > +should be non-zero, the ``fd`` argument should be the fd in the listener's
> > +table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
> > +``fd_flags`` should be the flags that the fd in the tracee's table is 
> > opened
> > +with (e.g. ``O_EXCL`` or similar).
> 
> fd_flags only contains file descriptor flags (meaning only O_CLOEXEC).
> O_EXCL is a file creation flag, so setting it here wouldn't make sense.
> Setting file status flags like O_APPEND does make sense, but those are
> stored in the `struct file` and don't need to be passed separately;
> the caller can e.g. set them via fcntl(fd, F_SETFD, flags) or on
> open().
> (The fcntl.2 manpage explains these.)

Ugh, yes, O_CLOEXEC is what I meant. Thanks, I'll clarify.

Tycho


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Jann Horn
On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
[...]
> +Userspace can also return file descriptors. For example, one may decide to
> +intercept ``socket()`` syscalls, and return some file descriptor from those
> +based on some policy. To return a file descriptor, the ``return_fd`` member
> +should be non-zero, the ``fd`` argument should be the fd in the listener's
> +table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
> +``fd_flags`` should be the flags that the fd in the tracee's table is opened
> +with (e.g. ``O_EXCL`` or similar).

fd_flags only contains file descriptor flags (meaning only O_CLOEXEC).
O_EXCL is a file creation flag, so setting it here wouldn't make sense.
Setting file status flags like O_APPEND does make sense, but those are
stored in the `struct file` and don't need to be passed separately;
the caller can e.g. set them via fcntl(fd, F_SETFD, flags) or on
open().
(The fcntl.2 manpage explains these.)


Re: [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Jann Horn
On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen  wrote:
>
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
[...]
> +Userspace can also return file descriptors. For example, one may decide to
> +intercept ``socket()`` syscalls, and return some file descriptor from those
> +based on some policy. To return a file descriptor, the ``return_fd`` member
> +should be non-zero, the ``fd`` argument should be the fd in the listener's
> +table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
> +``fd_flags`` should be the flags that the fd in the tracee's table is opened
> +with (e.g. ``O_EXCL`` or similar).

fd_flags only contains file descriptor flags (meaning only O_CLOEXEC).
O_EXCL is a file creation flag, so setting it here wouldn't make sense.
Setting file status flags like O_APPEND does make sense, but those are
stored in the `struct file` and don't need to be passed separately;
the caller can e.g. set them via fcntl(fd, F_SETFD, flags) or on
open().
(The fcntl.2 manpage explains these.)


[PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Tycho Andersen
The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes
v4: * pass fd flags back from userspace as well (Jann)
* update same cgroup data on fd pass as SCM_RIGHTS (Alban)
* only set the REPLIED state /after/ successful fdget (Alban)
* reflect GET_LISTENER -> NEW_LISTENER changes
* add to the new Documentation/ on user notifications about fd replies

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Andy Lutomirski 
CC: Oleg Nesterov 
CC: Eric W. Biederman 
CC: "Serge E. Hallyn" 
CC: Christian Brauner 
CC: Tyler Hicks 
CC: Akihiro Suda 
---
 .../userspace-api/seccomp_filter.rst  |  11 ++
 include/uapi/linux/seccomp.h  |   3 +
 kernel/seccomp.c  |  51 +++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 114 ++
 4 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst 
b/Documentation/userspace-api/seccomp_filter.rst
index e51422559dd6..3db93df254fb 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -232,6 +232,9 @@ The interface for a seccomp notification fd consists of two 
structures:
 __u64 id;
 __s32 error;
 __s64 val;
+__u8 return_fd;
+__u32 fd;
+__u32 fd_flags;
 };
 
 Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a
@@ -251,6 +254,14 @@ mentioned above in this document: all arguments being read 
from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also return file descriptors. For example, one may decide to
+intercept ``socket()`` syscalls, and return some file descriptor from those
+based on some policy. To return a file descriptor, the ``return_fd`` member
+should be non-zero, the ``fd`` argument should be the fd in the listener's
+table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
+``fd_flags`` should be the flags that the fd in the tracee's table is opened
+with (e.g. ``O_EXCL`` or similar).
+
 Sysctls
 ===
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 8836a3b25500..ed2a475e0fe6 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -72,6 +72,9 @@ struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
+   __u8 return_fd;
+   __u32 fd;
+   __u32 fd_flags;
 };
 
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b68a5d4a15cd..abd6e8c7e64e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 
 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION
 #include 
+#include 
 
 enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -77,6 +78,8 @@ struct seccomp_knotif {
/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
int error;
long val;
+   struct file *file;
+   unsigned int flags;
 
/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
struct completion ready;
@@ -796,10 +799,44 @@ static void seccomp_do_user_notification(int this_syscall,
goto remove_list;
}
 
-   ret = n.val;
-   err = n.error;
+   if (n.file) {
+   int fd;
+   struct socket *sock;
+
+   fd = get_unused_fd_flags(n.flags);
+   if (fd < 0) {
+   err = fd;
+   ret = -1;
+   goto remove_list;
+   }
+
+   /*
+* Similar to what SCM_RIGHTS does, let's re-set the cgroup
+* data to point ot the tracee's cgroups instead of the
+* listener's.
+*/
+   sock = sock_from_file(n.file, );
+   if (sock) {
+   sock_update_netprioidx(>sk->sk_cgrp_data);
+   sock_update_classid(>sk->sk_cgrp_data);
+   }

[PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF

2018-06-21 Thread Tycho Andersen
The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes
v4: * pass fd flags back from userspace as well (Jann)
* update same cgroup data on fd pass as SCM_RIGHTS (Alban)
* only set the REPLIED state /after/ successful fdget (Alban)
* reflect GET_LISTENER -> NEW_LISTENER changes
* add to the new Documentation/ on user notifications about fd replies

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Andy Lutomirski 
CC: Oleg Nesterov 
CC: Eric W. Biederman 
CC: "Serge E. Hallyn" 
CC: Christian Brauner 
CC: Tyler Hicks 
CC: Akihiro Suda 
---
 .../userspace-api/seccomp_filter.rst  |  11 ++
 include/uapi/linux/seccomp.h  |   3 +
 kernel/seccomp.c  |  51 +++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 114 ++
 4 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst 
b/Documentation/userspace-api/seccomp_filter.rst
index e51422559dd6..3db93df254fb 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -232,6 +232,9 @@ The interface for a seccomp notification fd consists of two 
structures:
 __u64 id;
 __s32 error;
 __s64 val;
+__u8 return_fd;
+__u32 fd;
+__u32 fd_flags;
 };
 
 Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a
@@ -251,6 +254,14 @@ mentioned above in this document: all arguments being read 
from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also return file descriptors. For example, one may decide to
+intercept ``socket()`` syscalls, and return some file descriptor from those
+based on some policy. To return a file descriptor, the ``return_fd`` member
+should be non-zero, the ``fd`` argument should be the fd in the listener's
+table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
+``fd_flags`` should be the flags that the fd in the tracee's table is opened
+with (e.g. ``O_EXCL`` or similar).
+
 Sysctls
 ===
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 8836a3b25500..ed2a475e0fe6 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -72,6 +72,9 @@ struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
+   __u8 return_fd;
+   __u32 fd;
+   __u32 fd_flags;
 };
 
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b68a5d4a15cd..abd6e8c7e64e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 
 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION
 #include 
+#include 
 
 enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -77,6 +78,8 @@ struct seccomp_knotif {
/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
int error;
long val;
+   struct file *file;
+   unsigned int flags;
 
/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
struct completion ready;
@@ -796,10 +799,44 @@ static void seccomp_do_user_notification(int this_syscall,
goto remove_list;
}
 
-   ret = n.val;
-   err = n.error;
+   if (n.file) {
+   int fd;
+   struct socket *sock;
+
+   fd = get_unused_fd_flags(n.flags);
+   if (fd < 0) {
+   err = fd;
+   ret = -1;
+   goto remove_list;
+   }
+
+   /*
+* Similar to what SCM_RIGHTS does, let's re-set the cgroup
+* data to point ot the tracee's cgroups instead of the
+* listener's.
+*/
+   sock = sock_from_file(n.file, );
+   if (sock) {
+   sock_update_netprioidx(>sk->sk_cgrp_data);
+   sock_update_classid(>sk->sk_cgrp_data);
+   }