Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Mon, Nov 2, 2020 at 8:50 PM Sargun Dhillon  wrote:
> On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
>  wrote:
> >Caveats regarding blocking system calls
> >Suppose that the target performs a blocking system call (e.g.,
> >accept(2)) that the supervisor should handle.  The supervisor
> >might then in turn execute the same blocking system call.
> >
> >In this scenario, it is important to note that if the target's
> >system call is now interrupted by a signal, the supervisor is not
> >informed of this.  If the supervisor does not take suitable steps
> >to actively discover that the target's system call has been
> >canceled, various difficulties can occur.  Taking the example of
> >accept(2), the supervisor might remain blocked in its accept(2)
> >holding a port number that the target (which, after the
> >interruption by the signal handler, perhaps closed  its listening
> >socket) might expect to be able to reuse in a bind(2) call.
> >
> >Therefore, when the supervisor wishes to emulate a blocking system
> >call, it must do so in such a way that it gets informed if the
> >target's system call is interrupted by a signal handler.  For
> >example, if the supervisor itself executes the same blocking
> >system call, then it could employ a separate thread that uses the
> >SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
> >still blocked in its system call.  Alternatively, in the accept(2)
> >example, the supervisor might use poll(2) to monitor both the
> >notification file descriptor (so as as to discover when the
> >target's accept(2) call has been interrupted) and the listening
> >file descriptor (so as to know when a connection is available).
> >
> >If the target's system call is interrupted, the supervisor must
> >take care to release resources (e.g., file descriptors) that it
> >acquired on behalf of the target.
> >
> > Does that seem okay?
> >
> This is far clearer than my explanation. The one thing is that *just*
> poll is not good enough, you would poll, with some timeout, and when
> that timeout is hit, check if all the current notifications are valid,
> as poll isn't woken up when an in progress notification goes off
> AFAIK.

Arguably that's so terrible that it qualifies for being in the BUGS
section of the manpage.

If you want this to be fixed properly, I recommend that someone
implements my proposal from
,
unless you can come up with something better.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon
On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply!
>
> On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> > On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Sargun,
> >>
> >> Thanks for your reply.
> >>
> >> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> >>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> >>> wrote:
> >>
> >> [...]
> >>
> > I think I commented in another thread somewhere that the
> > supervisor is not notified if the syscall is preempted. Therefore
> > if it is performing a preemptible, long-running syscall, you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> > you can end up in a bad situation -- like leaking resources, or
> > holding on to file descriptors after the program under
> > supervision has intended to release them.
> 
>  It's been a long day, and I'm not sure I reallu understand this.
>  Could you outline the scnario in more detail?
> 
> >>> S: Sets up filter + interception for accept T: socket(AF_INET,
> >>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> >>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >>
> >> Presumably, the preceding line should have been:
> >>
> >> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >> (s/T:/S:/)
> >>
> >> right?
> >
> > Right.
> >>
> >>
> >>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
> >>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> >>> Still running accept(7, ), holding port , so if now T
> >>> retries to bind to port , things fail.
> >>
> >> Okay -- I understand. Presumably the solution here is not to
> >> block in accept(), but rather to use poll() to monitor both the
> >> notification FD and the listening socket FD?
> >>
> > You need to have some kind of mechanism to periodically check
> > if the notification is still alive, and preempt the accept. It doesn't
> > matter how exactly you "background" the accept (threads, or
> > O_NONBLOCK + epoll).
> >
> > The thing is you need to make sure that when the process
> > cancels a syscall, you need to release the resources you
> > may have acquired on its behalf or bad things can happen.
> >
>
> Got it. I added the following text:
>
>Caveats regarding blocking system calls
>Suppose that the target performs a blocking system call (e.g.,
>accept(2)) that the supervisor should handle.  The supervisor
>might then in turn execute the same blocking system call.
>
>In this scenario, it is important to note that if the target's
>system call is now interrupted by a signal, the supervisor is not
>informed of this.  If the supervisor does not take suitable steps
>to actively discover that the target's system call has been
>canceled, various difficulties can occur.  Taking the example of
>accept(2), the supervisor might remain blocked in its accept(2)
>holding a port number that the target (which, after the
>interruption by the signal handler, perhaps closed  its listening
>socket) might expect to be able to reuse in a bind(2) call.
>
>Therefore, when the supervisor wishes to emulate a blocking system
>call, it must do so in such a way that it gets informed if the
>target's system call is interrupted by a signal handler.  For
>example, if the supervisor itself executes the same blocking
>system call, then it could employ a separate thread that uses the
>SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
>still blocked in its system call.  Alternatively, in the accept(2)
>example, the supervisor might use poll(2) to monitor both the
>notification file descriptor (so as as to discover when the
>target's accept(2) call has been interrupted) and the listening
>file descriptor (so as to know when a connection is available).
>
>If the target's system call is interrupted, the supervisor must
>take care to release resources (e.g., file descriptors) that it
>acquired on behalf of the target.
>
> Does that seem okay?
>
This is far clearer than my explanation. The one thing is that *just*
poll is not good enough, you would poll, with some timeout, and when
that timeout is hit, check if all the current notifications are valid,
as poll isn't woken up when an in progress notification goes off
AFAIK.

> > ENOENT The cookie number is not valid. This can happen if a
> > response has already been sent, or if the syscall was
> > interrupted
> >
> > EBADF If the file descriptor specified in srcfd is invalid, or if
> > the fd is out of range of the destination program.
> 
>  The piece "or if the fd is out of range of the destination program"
>  is not clear to me. Can you say 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply!

On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
>  wrote:
>>
>> Hello Sargun,
>>
>> Thanks for your reply.
>>
>> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
>>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
>>> wrote:
>>
>> [...]
>>
> I think I commented in another thread somewhere that the
> supervisor is not notified if the syscall is preempted. Therefore
> if it is performing a preemptible, long-running syscall, you need
> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> you can end up in a bad situation -- like leaking resources, or
> holding on to file descriptors after the program under
> supervision has intended to release them.

 It's been a long day, and I'm not sure I reallu understand this.
 Could you outline the scnario in more detail?

>>> S: Sets up filter + interception for accept T: socket(AF_INET,
>>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
>>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>>
>> Presumably, the preceding line should have been:
>>
>> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>> (s/T:/S:/)
>>
>> right?
> 
> Right.
>>
>>
>>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
>>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
>>> Still running accept(7, ), holding port , so if now T
>>> retries to bind to port , things fail.
>>
>> Okay -- I understand. Presumably the solution here is not to
>> block in accept(), but rather to use poll() to monitor both the
>> notification FD and the listening socket FD?
>>
> You need to have some kind of mechanism to periodically check
> if the notification is still alive, and preempt the accept. It doesn't
> matter how exactly you "background" the accept (threads, or
> O_NONBLOCK + epoll).
> 
> The thing is you need to make sure that when the process
> cancels a syscall, you need to release the resources you
> may have acquired on its behalf or bad things can happen.
> 

Got it. I added the following text:

   Caveats regarding blocking system calls
   Suppose that the target performs a blocking system call (e.g.,
   accept(2)) that the supervisor should handle.  The supervisor
   might then in turn execute the same blocking system call.

   In this scenario, it is important to note that if the target's
   system call is now interrupted by a signal, the supervisor is not
   informed of this.  If the supervisor does not take suitable steps
   to actively discover that the target's system call has been
   canceled, various difficulties can occur.  Taking the example of
   accept(2), the supervisor might remain blocked in its accept(2)
   holding a port number that the target (which, after the
   interruption by the signal handler, perhaps closed  its listening
   socket) might expect to be able to reuse in a bind(2) call.

   Therefore, when the supervisor wishes to emulate a blocking system
   call, it must do so in such a way that it gets informed if the
   target's system call is interrupted by a signal handler.  For
   example, if the supervisor itself executes the same blocking
   system call, then it could employ a separate thread that uses the
   SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
   still blocked in its system call.  Alternatively, in the accept(2)
   example, the supervisor might use poll(2) to monitor both the
   notification file descriptor (so as as to discover when the
   target's accept(2) call has been interrupted) and the listening
   file descriptor (so as to know when a connection is available).

   If the target's system call is interrupted, the supervisor must
   take care to release resources (e.g., file descriptors) that it
   acquired on behalf of the target.

Does that seem okay?

> ENOENT The cookie number is not valid. This can happen if a
> response has already been sent, or if the syscall was
> interrupted
>
> EBADF If the file descriptor specified in srcfd is invalid, or if
> the fd is out of range of the destination program.

 The piece "or if the fd is out of range of the destination program"
 is not clear to me. Can you say some more please.

>>>
>>> IIRC the maximum fd range is specific in proc by some sysctl named
>>> nr_open. It's also evaluated against RLIMITs, and nr_max.
>>>
>>> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
>>> FDs are open, it wont work if newfd is 1001.
>>
>> Actually, the relevant limit seems to be just the RLIMIT_NOFILE
>> resource limit at least in my reading of fs/file.c::replace_fd().
>> So I made the text
>>
>>   EBADF  Allocating the file descriptor in the target would
>>  caus

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:51 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:20 PM, Jann Horn wrote:
> > On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
> >  wrote:
> >> On 10/29/20 2:42 AM, Jann Horn wrote:
> >>> As discussed at
> >>> ,
> >>> we need to re-check checkNotificationIdIsValid() after reading remote
> >>> memory but before using the read value in any way. Otherwise, the
> >>> syscall could in the meantime get interrupted by a signal handler, the
> >>> signal handler could return, and then the function that performed the
> >>> syscall could free() allocations or return (thereby freeing buffers on
> >>> the stack).
> >>>
> >>> In essence, this pread() is (unavoidably) a potential use-after-free
> >>> read; and to make that not have any security impact, we need to check
> >>> whether UAF read occurred before using the read value. This should
> >>> probably be called out elsewhere in the manpage, too...
> >>>
> >>> Now, of course, **reading** is the easy case. The difficult case is if
> >>> we have to **write** to the remote process... because then we can't
> >>> play games like that. If we write data to a freed pointer, we're
> >>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> >>> that /proc/$pid/mem is originally intended for process debugging,
> >>> including installing breakpoints, and will therefore happily write
> >>> over "readonly" private mappings, such as typical mappings of
> >>> executable code.)
> >>>
> >>> So, h... I guess if anyone wants to actually write memory back to
> >>> the target process, we'd better come up with some dedicated API for
> >>> that, using an ioctl on the seccomp fd that magically freezes the
> >>> target process inside the syscall while writing to its memory, or
> >>> something like that? And until then, the manpage should have a big fat
> >>> warning that writing to the target's memory is simply not possible
> >>> (safely).
> >>
> >> Thank you for your very clear explanation! It turned out to be
> >> trivially easy to demonstrate this issue with a slightly modified
> >> version of my program.
> >>
> >> As well as the change to the code example that I already mentioned
> >> my reply of a few hours ago, I've added the following text to the
> >> page:
> >>
> >>Caveats regarding the use of /proc/[tid]/mem
> >>The discussion above noted the need to use the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
> >>/proc/[tid]/mem file of the target to avoid the possibility of
> >>accessing the memory of the wrong process in the event that the
> >>target terminates and its ID is recycled by another (unrelated)
> >>thread.  However, the use of this ioctl(2) operation is also
> >>necessary in other situations, as explained in the following
> >>pargraphs.
> >
> > (nit: paragraphs)
>
> I spotted that one also already. But thanks for reading carefully!
>
> >>Consider the following scenario, where the supervisor tries to
> >>read the pathname argument of a target's blocked mount(2) system
> >>call:
> > [...]
> >> Seem okay?
> >
> > Yeah, sounds good.
> >
> >> By the way, is there any analogous kind of issue concerning
> >> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> >> something.
> >
> > When it is used by a seccomp supervisor, you mean? I think basically
> > the same thing applies - when resource identifiers (such as memory
> > addresses or file descriptors) are passed to a syscall, it generally
> > has to be assumed that those identifiers may become invalid and be
> > reused as soon as the syscall has returned.
>
> I probably needed to be more explicit. Would the following (i.e., a
> single cookie check) not be sufficient to handle the above scenario.
> Here, the target is making a syscall a system call that employs the
> file descriptor 'tfd':
>
> T: makes syscall that triggers notification
> S: Get notification
> S: pidfd = pidfd_open(T, 0);
> S: sfd = pifd_getfd(pidfd, tfd, 0)
> S: check that the cookie is still valid
> S: do operation with sfd [*]
>
> By contrast, I can see that we might want to do multiple cookie
> checks in the /proc/PID/mem case, since the supervisor might do
> multiple reads.

Aaah, okay. I didn't really understand the question at first.

> Or, do you mean: there really needs to be another cookie check after
> the point [*], since, if the the target's syscall was interrupted
> and 'tfd' was closed/resused, then the supervisor would be operating
> with a file descriptor that refers to an open file description
> (a "struct file") that is no longer meaningful in the target?
> (Thinking about it, I think this probably is what you mean, but
> I want to confirm.)

I wasn't thinking about your actual question when I wrote that. :P

I think you could argue that leaving out the first cookie check doe

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:31 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:14 PM, Jann Horn wrote:
> > With the caveat that a cancelled syscall
> > could've also led to the memory being munmap()ed, so the nread==0 case
> > could also happen legitimately - so you might want to move this check
> > up above the nread==0 (mm went away) and nread==-1 (mm still exists,
> > but read from address failed, errno EIO) checks if the error message
> > shouldn't appear spuriously.
>
> In any case, I've been refactoring (simplifying) that code a little.
> I haven't so far rearranged the order of the checks, but I already
> log message for the nread==0 case. (Instead, there will eventually
> be an error when the response is sent.)
>
> I also haven't exactly tested the scenario you describe in the
> seccomp unotify scenario, but I think the above is not correct. Here
> are two scenarios I did test, simply with mmap() and /proc/PID/mem
> (no seccomp involved):
>
> Scenario 1:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A terminates
> B reads from FD ==> read() returns 0 (EOF)
>
> Scenario 2:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A unmaps mapping at address X
> B reads from FD ==> read() returns -1 / EIO.
>
> That last scenario seems to contradict what you say, since I
> think you meant that in this case read() should return 0 in
> that case. Have I misunderstood you?

Sorry, I messed up the description when I wrote that. Yes, this looks
as expected - EIO if the VMA is gone, 0 if the mm_users of the
mm_struct have dropped to zero because all tasks that use the mm have
exited.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon
On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply.
>
> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> > On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> > wrote:
>
> [...]
>
> >>> I think I commented in another thread somewhere that the
> >>> supervisor is not notified if the syscall is preempted. Therefore
> >>> if it is performing a preemptible, long-running syscall, you need
> >>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> >>> you can end up in a bad situation -- like leaking resources, or
> >>> holding on to file descriptors after the program under
> >>> supervision has intended to release them.
> >>
> >> It's been a long day, and I'm not sure I reallu understand this.
> >> Could you outline the scnario in more detail?
> >>
> > S: Sets up filter + interception for accept T: socket(AF_INET,
> > SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> > 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>
> Presumably, the preceding line should have been:
>
> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> (s/T:/S:/)
>
> right?

Right.
>
>
> > T: accept(7, ...) S: Intercepts accept S: Does accept in background
> > T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> > Still running accept(7, ), holding port , so if now T
> > retries to bind to port , things fail.
>
> Okay -- I understand. Presumably the solution here is not to
> block in accept(), but rather to use poll() to monitor both the
> notification FD and the listening socket FD?
>
You need to have some kind of mechanism to periodically check
if the notification is still alive, and preempt the accept. It doesn't
matter how exactly you "background" the accept (threads, or
O_NONBLOCK + epoll).

The thing is you need to make sure that when the process
cancels a syscall, you need to release the resources you
may have acquired on its behalf or bad things can happen.

> >>> A very specific example is if you're performing an accept on
> >>> behalf of the program generating the notification, and the
> >>> program intends to reuse the port. You can get into all sorts of
> >>> awkward situations there.
> >>
> >> [...]
> >>
> > See above
>
> [...]
>
> >>> In addition, if it is a socket, it inherits the cgroup v1 classid
> >>> and netprioidx of the receiving process.
> >>>
> >>> The argument of this is as follows:
> >>>
> >>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd;
> >>> __u32 newfd; __u32 newfd_flags; };
> >>>
> >>> id This is the cookie value that was obtained using
> >>> SECCOMP_IOCTL_NOTIF_RECV.
> >>>
> >>> flags A bitmask that includes zero or more of the
> >>> SECCOMP_ADDFD_FLAG_* bits set
> >>>
> >>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
> >>> when copying the file descriptor.
> >>>
> >>> srcfd The file descriptor number to copy in the supervisor
> >>> process.
> >>>
> >>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
> >>> be the file descriptor that is used in the dup2 semantics. If
> >>> this file descriptor exists in the receiving process, it is
> >>> closed and replaced by this file descriptor in an atomic fashion.
> >>> If the copy process fails due to a MAC failure, or if srcfd is
> >>> invalid, the newfd will not be closed in the receiving process.
> >>
> >> Great description!
> >>
> >>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
> >>> 0.
> >>>
> >>> newfd_flags The file descriptor flags to set on the file
> >>> descriptor after it has been received by the process. The only
> >>> flag that can currently be specified is O_CLOEXEC.
> >>>
> >>> On success, this operation returns the file descriptor number in
> >>> the receiving process. On failure, -1 is returned.
> >>>
> >>> It can fail with the following error codes:
> >>>
> >>> EINPROGRESS The cookie number specified hasn't been received by
> >>> the listener
> >>
> >> I don't understand this. Can you say more about the scenario?
> >>
> >
> > This should not really happen. But if you do a ADDFD(...), on a
> > notification *before* you've received it, you will get this error. So
> > for example,
> > --> epoll() -> returns
> > --> RECV(...) cookie id is 777
> > --> epoll(...) -> returns
> > <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> > where we've received a notification for 778.
>
> Got it. Looking also at the source code, I came up with the
> following:
>
>   EINPROGRESS
>  The user-space notification specified in the id
>  field exists but has not yet been fetched (by a
>  SECCOMP_IOCTL_NOTIF_RECV) or has already been
>  responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
>
> Does that seem okay?
>
Looks good to me.

> >>> ENOENT The cookie number is not valid. This can happen if a
> >>> response has already been sent, or if

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
Hello Sargun,

Thanks for your reply.

On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> wrote:

[...]

>>> I think I commented in another thread somewhere that the
>>> supervisor is not notified if the syscall is preempted. Therefore
>>> if it is performing a preemptible, long-running syscall, you need
>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>> you can end up in a bad situation -- like leaking resources, or
>>> holding on to file descriptors after the program under
>>> supervision has intended to release them.
>> 
>> It's been a long day, and I'm not sure I reallu understand this. 
>> Could you outline the scnario in more detail?
>> 
> S: Sets up filter + interception for accept T: socket(AF_INET,
> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.

Presumably, the preceding line should have been:

S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
(s/T:/S:/)

right?

> T: accept(7, ...) S: Intercepts accept S: Does accept in background 
> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> Still running accept(7, ), holding port , so if now T
> retries to bind to port , things fail.

Okay -- I understand. Presumably the solution here is not to 
block in accept(), but rather to use poll() to monitor both the
notification FD and the listening socket FD?

>>> A very specific example is if you're performing an accept on
>>> behalf of the program generating the notification, and the
>>> program intends to reuse the port. You can get into all sorts of
>>> awkward situations there.
>> 
>> [...]
>> 
> See above

[...]

>>> In addition, if it is a socket, it inherits the cgroup v1 classid
>>> and netprioidx of the receiving process.
>>> 
>>> The argument of this is as follows:
>>> 
>>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd; 
>>> __u32 newfd; __u32 newfd_flags; };
>>> 
>>> id This is the cookie value that was obtained using 
>>> SECCOMP_IOCTL_NOTIF_RECV.
>>> 
>>> flags A bitmask that includes zero or more of the 
>>> SECCOMP_ADDFD_FLAG_* bits set
>>> 
>>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
>>> when copying the file descriptor.
>>> 
>>> srcfd The file descriptor number to copy in the supervisor
>>> process.
>>> 
>>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
>>> be the file descriptor that is used in the dup2 semantics. If
>>> this file descriptor exists in the receiving process, it is
>>> closed and replaced by this file descriptor in an atomic fashion.
>>> If the copy process fails due to a MAC failure, or if srcfd is
>>> invalid, the newfd will not be closed in the receiving process.
>> 
>> Great description!
>> 
>>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
>>> 0.
>>> 
>>> newfd_flags The file descriptor flags to set on the file
>>> descriptor after it has been received by the process. The only
>>> flag that can currently be specified is O_CLOEXEC.
>>> 
>>> On success, this operation returns the file descriptor number in
>>> the receiving process. On failure, -1 is returned.
>>> 
>>> It can fail with the following error codes:
>>> 
>>> EINPROGRESS The cookie number specified hasn't been received by
>>> the listener
>> 
>> I don't understand this. Can you say more about the scenario?
>> 
> 
> This should not really happen. But if you do a ADDFD(...), on a
> notification *before* you've received it, you will get this error. So
> for example, 
> --> epoll() -> returns 
> --> RECV(...) cookie id is 777
> --> epoll(...) -> returns
> <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> where we've received a notification for 778.

Got it. Looking also at the source code, I came up with the 
following:

  EINPROGRESS
 The user-space notification specified in the id
 field exists but has not yet been fetched (by a
 SECCOMP_IOCTL_NOTIF_RECV) or has already been
 responded to (by a SECCOMP_IOCTL_NOTIF_SEND).

Does that seem okay?

>>> ENOENT The cookie number is not valid. This can happen if a
>>> response has already been sent, or if the syscall was
>>> interrupted
>>> 
>>> EBADF If the file descriptor specified in srcfd is invalid, or if
>>> the fd is out of range of the destination program.
>> 
>> The piece "or if the fd is out of range of the destination program"
>> is not clear to me. Can you say some more please.
>> 
> 
> IIRC the maximum fd range is specific in proc by some sysctl named 
> nr_open. It's also evaluated against RLIMITs, and nr_max.
> 
> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
> FDs are open, it wont work if newfd is 1001.

Actually, the relevant limit seems to be just the RLIMIT_NOFILE
resource limit at least in my reading of fs/file.c::replace_fd().
So I

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:20 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> As discussed at
>>> ,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>>
>>> Now, of course, **reading** is the easy case. The difficult case is if
>>> we have to **write** to the remote process... because then we can't
>>> play games like that. If we write data to a freed pointer, we're
>>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
>>> that /proc/$pid/mem is originally intended for process debugging,
>>> including installing breakpoints, and will therefore happily write
>>> over "readonly" private mappings, such as typical mappings of
>>> executable code.)
>>>
>>> So, h... I guess if anyone wants to actually write memory back to
>>> the target process, we'd better come up with some dedicated API for
>>> that, using an ioctl on the seccomp fd that magically freezes the
>>> target process inside the syscall while writing to its memory, or
>>> something like that? And until then, the manpage should have a big fat
>>> warning that writing to the target's memory is simply not possible
>>> (safely).
>>
>> Thank you for your very clear explanation! It turned out to be
>> trivially easy to demonstrate this issue with a slightly modified
>> version of my program.
>>
>> As well as the change to the code example that I already mentioned
>> my reply of a few hours ago, I've added the following text to the
>> page:
>>
>>Caveats regarding the use of /proc/[tid]/mem
>>The discussion above noted the need to use the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>>/proc/[tid]/mem file of the target to avoid the possibility of
>>accessing the memory of the wrong process in the event that the
>>target terminates and its ID is recycled by another (unrelated)
>>thread.  However, the use of this ioctl(2) operation is also
>>necessary in other situations, as explained in the following
>>pargraphs.
> 
> (nit: paragraphs)

I spotted that one also already. But thanks for reading carefully!

>>Consider the following scenario, where the supervisor tries to
>>read the pathname argument of a target's blocked mount(2) system
>>call:
> [...]
>> Seem okay?
> 
> Yeah, sounds good.
> 
>> By the way, is there any analogous kind of issue concerning
>> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
>> something.
> 
> When it is used by a seccomp supervisor, you mean? I think basically
> the same thing applies - when resource identifiers (such as memory
> addresses or file descriptors) are passed to a syscall, it generally
> has to be assumed that those identifiers may become invalid and be
> reused as soon as the syscall has returned.

I probably needed to be more explicit. Would the following (i.e., a
single cookie check) not be sufficient to handle the above scenario.
Here, the target is making a syscall a system call that employs the
file descriptor 'tfd':

T: makes syscall that triggers notification
S: Get notification
S: pidfd = pidfd_open(T, 0);
S: sfd = pifd_getfd(pidfd, tfd, 0)
S: check that the cookie is still valid
S: do operation with sfd [*]

By contrast, I can see that we might want to do multiple cookie
checks in the /proc/PID/mem case, since the supervisor might do
multiple reads.

Or, do you mean: there really needs to be another cookie check after
the point [*], since, if the the target's syscall was interrupted
and 'tfd' was closed/resused, then the supervisor would be operating
with a file descriptor that refers to an open file description
(a "struct file") that is no longer meaningful in the target?
(Thinking about it, I think this probably is what you mean, but 
I want to confirm.)

Thanks,

Michael
-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-31 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:14 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 2:42 AM, Jann Horn wrote:
>>> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>>>  wrote:
static bool
getTargetPathname(struct seccomp_notif *req, int notifyFd,
  char *path, size_t len)
{
char procMemPath[PATH_MAX];

snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
 req->pid);

int procMemFd = open(procMemPath, O_RDONLY);
if (procMemFd == -1)
errExit("\tS: open");

/* Check that the process whose info we are accessing is still 
 alive.
   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
   in checkNotificationIdIsValid()) succeeds, we know that the
   /proc/PID/mem file descriptor that we opened corresponds to 
 the
   process for which we received a notification. If that process
   subsequently terminates, then read() on that file descriptor
   will return 0 (EOF). */

checkNotificationIdIsValid(notifyFd, req->id);

/* Read bytes at the location containing the pathname argument
   (i.e., the first argument) of the mkdir(2) call */

ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
if (nread == -1)
errExit("pread");
>>>
>>> As discussed at
>>> ,
>>> we need to re-check checkNotificationIdIsValid() after reading remote
>>> memory but before using the read value in any way. Otherwise, the
>>> syscall could in the meantime get interrupted by a signal handler, the
>>> signal handler could return, and then the function that performed the
>>> syscall could free() allocations or return (thereby freeing buffers on
>>> the stack).
>>>
>>> In essence, this pread() is (unavoidably) a potential use-after-free
>>> read; and to make that not have any security impact, we need to check
>>> whether UAF read occurred before using the read value. This should
>>> probably be called out elsewhere in the manpage, too...
>>
>> Thanks very much for pointing me at this!
>>
>> So, I want to conform that the fix to the code is as simple as
>> adding a check following the pread() call, something like:
>>
>> [[
>>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>>  if (nread == -1)
>> errExit("Supervisor: pread");
>>
>>  if (nread == 0) {
>> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>>  }
>>
>>  if (close(procMemFd) == -1)
>> errExit("Supervisor: close-/proc/PID/mem");
>>
>> +/* Once again check that the notification ID is still valid. The
>> +   case we are particularly concerned about here is that just
>> +   before we fetched the pathname, the target's blocked system
>> +   call was interrupted by a signal handler, and after the handler
>> +   returned, the target carried on execution (past the interrupted
>> +   system call). In that case, we have no guarantees about what we
>> +   are reading, since the target's memory may have been arbitrarily
>> +   changed by subsequent operations. */
>> +
>> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
>> +return false;
>> +
>>  /* We have no guarantees about what was in the memory of the target
>> process. We therefore treat the buffer returned by pread() as
>> untrusted input. The buffer should be terminated by a null byte;
>> if not, then we will trigger an error for the target process. */
>>
>>  if (strnlen(path, nread) < nread)
>>  return true;
>> ]]
> 
> Yeah, that should do the job. 

Thanks.

> With the caveat that a cancelled syscall
> could've also led to the memory being munmap()ed, so the nread==0 case
> could also happen legitimately - so you might want to move this check
> up above the nread==0 (mm went away) and nread==-1 (mm still exists,
> but read from address failed, errno EIO) checks if the error message
> shouldn't appear spuriously.

In any case, I've been refactoring (simplifying) that code a little.
I haven't so far rearranged the order of the checks, but I already
log message for the nread==0 case. (Instead, there will eventually
be an error when the response is sent.)

I also haven't exactly tested the scenario you describe in the
seccomp unotify scenario, but I think the above is not correct. Here 
are two scenarios I did test, simply with mmap() and /proc/PID/mem
(no seccomp involved): 

Scenario 1:
A creates a mapping at address X
B opens /proc/A/mem and and lseeks on re

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Sargun Dhillon
On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Sargun,,
> 
> On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>ioctl(2) operations
> >>The following ioctl(2) operations are provided to support seccomp
> >>user-space notification.  For each of these operations, the first
> >>(file descriptor) argument of ioctl(2) is the listening file
> >>descriptor returned by a call to seccomp(2) with the
> >>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> >>
> >>SECCOMP_IOCTL_NOTIF_RECV
> >>   This operation is used to obtain a user-space notification
> >>   event.  If no such event is currently pending, the
> >>   operation blocks until an event occurs.  The third
> >>   ioctl(2) argument is a pointer to a structure of the
> >>   following form which contains information about the event.
> >>   This structure must be zeroed out before the call.
> >>
> >>   struct seccomp_notif {
> >>   __u64  id;  /* Cookie */
> >>   __u32  pid; /* TID of target thread */
> >>   __u32  flags;   /* Currently unused (0) */
> >>   struct seccomp_data data;   /* See seccomp(2) */
> >>   };
> >>
> >>   The fields in this structure are as follows:
> >>
> >>   id This is a cookie for the notification.  Each such
> >>  cookie is guaranteed to be unique for the
> >>  corresponding seccomp filter.
> >>
> >>  · It can be used with the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
> >>to verify that the target is still alive.
> >>
> >>  · When returning a notification response to the
> >>kernel, the supervisor must include the cookie
> >>value in the seccomp_notif_resp structure that is
> >>specified as the argument of the
> >>SECCOMP_IOCTL_NOTIF_SEND operation.
> >>
> >>   pidThis is the thread ID of the target thread that
> >>  triggered the notification event.
> >>
> >>   flags  This is a bit mask of flags providing further
> >>  information on the event.  In the current
> >>  implementation, this field is always zero.
> >>
> >>   data   This is a seccomp_data structure containing
> >>  information about the system call that triggered
> >>  the notification.  This is the same structure that
> >>  is passed to the seccomp filter.  See seccomp(2)
> >>  for details of this structure.
> >>
> >>   On success, this operation returns 0; on failure, -1 is
> >>   returned, and errno is set to indicate the cause of the
> >>   error.  This operation can fail with the following errors:
> >>
> >>   EINVAL (since Linux 5.5)
> >>  The seccomp_notif structure that was passed to the
> >>  call contained nonzero fields.
> >>
> >>   ENOENT The target thread was killed by a signal as the
> >>  notification information was being generated, or
> >>  the target's (blocked) system call was interrupted
> >>  by a signal handler.
> >>
> > 
> > I think I commented in another thread somewhere that the supervisor is not 
> > notified if the syscall is preempted. Therefore if it is performing a 
> > preemptible, long-running syscall, you need to poll
> > SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise you can
> > end up in a bad situation -- like leaking resources, or holding on to
> > file descriptors after the program under supervision has intended to
> > release them.
> 
> It's been a long day, and I'm not sure I reallu understand this.
> Could you outline the scnario in more detail?
> 
S: Sets up filter + interception for accept
T: socket(AF_INET, SOCK_STREAM, 0) = 7
T: bind(7, {127.0.0.1, }, ..)
T: listen(7, 10)
T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
T: accept(7, ...)
S: Intercepts accept
S: Does accept in background
T: Receives signal, and accept(...) responds in EINTR
T: close(7)
S: Still running accept(7, ), holding port , so if now T retries
   to bind to port , things fail.

> > A very specific example is if you're performing an accept on behalf
> > of the program generating the notification, and the program intends
> > to reuse the port. You can get into all sorts of awkward situations
> > there.
> 
> [...]
> 
See above

> > SECCOMP_IOCTL_NOTIF_ADDF

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Michael Kerrisk (man-pages)
On 10/30/20 8:24 PM, Jann Horn wrote:
> On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/29/20 4:26 PM, Christian Brauner wrote:
>>> I like this manpage. I think this is the most comprehensive explanation
>>> of any seccomp feature
>>
>> Thanks (at least, I think so...)
>>
>>> and somewhat understandable.
>>   
>>
>> (... but I'm not sure ;-).)
> 
> Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg

Perfekt :-).


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, h... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
> > target process inside the syscall while writing to its memory, or
> > something like that? And until then, the manpage should have a big fat
> > warning that writing to the target's memory is simply not possible
> > (safely).
>
> Thank you for your very clear explanation! It turned out to be
> trivially easy to demonstrate this issue with a slightly modified
> version of my program.
>
> As well as the change to the code example that I already mentioned
> my reply of a few hours ago, I've added the following text to the
> page:
>
>Caveats regarding the use of /proc/[tid]/mem
>The discussion above noted the need to use the
>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>/proc/[tid]/mem file of the target to avoid the possibility of
>accessing the memory of the wrong process in the event that the
>target terminates and its ID is recycled by another (unrelated)
>thread.  However, the use of this ioctl(2) operation is also
>necessary in other situations, as explained in the following
>pargraphs.

(nit: paragraphs)

>Consider the following scenario, where the supervisor tries to
>read the pathname argument of a target's blocked mount(2) system
>call:
[...]
> Seem okay?

Yeah, sounds good.

> By the way, is there any analogous kind of issue concerning
> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> something.

When it is used by a seccomp supervisor, you mean? I think basically
the same thing applies - when resource identifiers (such as memory
addresses or file descriptors) are passed to a syscall, it generally
has to be assumed that those identifiers may become invalid and be
reused as soon as the syscall has returned.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 4:26 PM, Christian Brauner wrote:
> > I like this manpage. I think this is the most comprehensive explanation
> > of any seccomp feature
>
> Thanks (at least, I think so...)
>
> > and somewhat understandable.
>   
>
> (... but I'm not sure ;-).)

Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>static bool
> >>getTargetPathname(struct seccomp_notif *req, int notifyFd,
> >>  char *path, size_t len)
> >>{
> >>char procMemPath[PATH_MAX];
> >>
> >>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> >> req->pid);
> >>
> >>int procMemFd = open(procMemPath, O_RDONLY);
> >>if (procMemFd == -1)
> >>errExit("\tS: open");
> >>
> >>/* Check that the process whose info we are accessing is still 
> >> alive.
> >>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> >>   in checkNotificationIdIsValid()) succeeds, we know that the
> >>   /proc/PID/mem file descriptor that we opened corresponds to 
> >> the
> >>   process for which we received a notification. If that process
> >>   subsequently terminates, then read() on that file descriptor
> >>   will return 0 (EOF). */
> >>
> >>checkNotificationIdIsValid(notifyFd, req->id);
> >>
> >>/* Read bytes at the location containing the pathname argument
> >>   (i.e., the first argument) of the mkdir(2) call */
> >>
> >>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> >>if (nread == -1)
> >>errExit("pread");
> >
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
>
> Thanks very much for pointing me at this!
>
> So, I want to conform that the fix to the code is as simple as
> adding a check following the pread() call, something like:
>
> [[
>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>  if (nread == -1)
> errExit("Supervisor: pread");
>
>  if (nread == 0) {
> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
> "returned 0 (EOF)\n");
> exit(EXIT_FAILURE);
>  }
>
>  if (close(procMemFd) == -1)
> errExit("Supervisor: close-/proc/PID/mem");
>
> +/* Once again check that the notification ID is still valid. The
> +   case we are particularly concerned about here is that just
> +   before we fetched the pathname, the target's blocked system
> +   call was interrupted by a signal handler, and after the handler
> +   returned, the target carried on execution (past the interrupted
> +   system call). In that case, we have no guarantees about what we
> +   are reading, since the target's memory may have been arbitrarily
> +   changed by subsequent operations. */
> +
> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
> +return false;
> +
>  /* We have no guarantees about what was in the memory of the target
> process. We therefore treat the buffer returned by pread() as
> untrusted input. The buffer should be terminated by a null byte;
> if not, then we will trigger an error for the target process. */
>
>  if (strnlen(path, nread) < nread)
>  return true;
> ]]

Yeah, that should do the job. With the caveat that a cancelled syscall
could've also led to the memory being munmap()ed, so the nread==0 case
could also happen legitimately - so you might want to move this check
up above the nread==0 (mm went away) and nread==-1 (mm still exists,
but read from address failed, errno EIO) checks if the error message
shouldn't appear spuriously.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Sargun,,

On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:

[...]

>>ioctl(2) operations
>>The following ioctl(2) operations are provided to support seccomp
>>user-space notification.  For each of these operations, the first
>>(file descriptor) argument of ioctl(2) is the listening file
>>descriptor returned by a call to seccomp(2) with the
>>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>>SECCOMP_IOCTL_NOTIF_RECV
>>   This operation is used to obtain a user-space notification
>>   event.  If no such event is currently pending, the
>>   operation blocks until an event occurs.  The third
>>   ioctl(2) argument is a pointer to a structure of the
>>   following form which contains information about the event.
>>   This structure must be zeroed out before the call.
>>
>>   struct seccomp_notif {
>>   __u64  id;  /* Cookie */
>>   __u32  pid; /* TID of target thread */
>>   __u32  flags;   /* Currently unused (0) */
>>   struct seccomp_data data;   /* See seccomp(2) */
>>   };
>>
>>   The fields in this structure are as follows:
>>
>>   id This is a cookie for the notification.  Each such
>>  cookie is guaranteed to be unique for the
>>  corresponding seccomp filter.
>>
>>  · It can be used with the
>>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
>>to verify that the target is still alive.
>>
>>  · When returning a notification response to the
>>kernel, the supervisor must include the cookie
>>value in the seccomp_notif_resp structure that is
>>specified as the argument of the
>>SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>>   pidThis is the thread ID of the target thread that
>>  triggered the notification event.
>>
>>   flags  This is a bit mask of flags providing further
>>  information on the event.  In the current
>>  implementation, this field is always zero.
>>
>>   data   This is a seccomp_data structure containing
>>  information about the system call that triggered
>>  the notification.  This is the same structure that
>>  is passed to the seccomp filter.  See seccomp(2)
>>  for details of this structure.
>>
>>   On success, this operation returns 0; on failure, -1 is
>>   returned, and errno is set to indicate the cause of the
>>   error.  This operation can fail with the following errors:
>>
>>   EINVAL (since Linux 5.5)
>>  The seccomp_notif structure that was passed to the
>>  call contained nonzero fields.
>>
>>   ENOENT The target thread was killed by a signal as the
>>  notification information was being generated, or
>>  the target's (blocked) system call was interrupted
>>  by a signal handler.
>>
>>┌─┐
>>│FIXME│
>>├─┤
>>│From my experiments, it appears that if a│
>>│SECCOMP_IOCTL_NOTIF_RECV is done after the target│
>>│thread terminates, then the ioctl() simply blocks│
>>│(rather than returning an error to indicate that the │
>>│target no longer exists).│
>>│ │
>>│I found that surprising, and it required some│
>>│contortions in the example program.  It was not  │
>>│possible to code my SIGCHLD handler (which reaps the │
>>│zombie when the worker/target terminates) to simply  │
>>│set a flag checked in the main handleNotifications() │
>>│loop, since this created an unavoidable race where   │
>>│the child might terminate just after I had checked   │
>>│the flag, but before I blocked (forever!) in the │
>>│SECCOMP_IOCTL_NOTIF_RECV operation. Instead, I had   │
>>│to code the signal handler to simply call _exit(2)   │
>>│in order to terminate the parent process (the│
>>│supervisor). │
>>│ │
>>│Is this expected behavior? It seems to me rathe

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Christian

Thanks for taking a look at the page.

On 10/29/20 4:26 PM, Christian Brauner wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi all (and especially Tycho and Sargun),
>>
>> Following review comments on the first draft (thanks to Jann, Kees,
>> Christian and Tycho), I've made a lot of changes to this page.
>> I've also added a few FIXMEs relating to outstanding API issues.
>> I'd like a second pass review of the page before I release it.
>> But also, this mail serves as a way of noting the outstanding API
>> issues.
>>
>> Tycho: I still have an outstanding question for you at [2].
>>
>> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> I've shown the rendered version of the page below. The page source
>> currently sits in a branch at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> At this point, I'm mainly interested in feedback about the FIXMEs,
>> some of which relate to the text of the page itself, while the
>> others relate to the various outstanding API issues. The first 
>> FIXME provides a small opportunity for some bikeshedding :-);
> 
> I like this manpage. I think this is the most comprehensive explanation
> of any seccomp feature

Thanks (at least, I think so...)

> and somewhat understandable.
  

(... but I'm not sure ;-).)

> Just tiny comments below, hopefully no bike-shedding though. :)

Most relevant point for bikeshedding is the page name. I plan 
to change it to seccomp_unotify(2) (shorter, reads better out loud).

>> Thanks,
>>
>> Michael
>>
>> [1] 
>> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
>> [2] 
>> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>>
>> =
>>
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

[...]

>>An overview of the steps performed by the target and the
>>supervisor is as follows:
>>
>>1. The target establishes a seccomp filter in the usual manner,
>>   but with two differences:
>>
>>   · The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>> value of the (successful) seccomp(2) call is a new
>> "listening" file descriptor that can be used to receive
>> notifications.  Only one "listening" seccomp filter can be
>> installed for a thread.
>>
>> ┌─┐
>> │FIXME│
>> ├─┤
>> │Is the last sentence above correct?  │
>> │ │
>> │Kees Cook (25 Oct 2020) notes:   │
>> │ │
>> │I like this limitation, but I expect that it'll need │
>> │to change in the future. Even with LSMs, we see the  │
>> │need for arbitrary stacking, and the idea of there   │
>> │being only 1 supervisor will eventually break down.  │
>> │Right now there is only 1 because only container │
>> │managers are using this feature. But if some daemon  │
>> │starts using it to isolate some thread, suddenly it  │
>> │might break if a container manager is trying to  │
>> │listen to it too, etc. I expect it won't be needed   │
>> │soon, but I do think it'll change.   │
>> │ │
>> └─┘
>>
>>   · In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor can obtain notifications using
>>   the listening file descriptor, (a duplicate of) that file
>>   descriptor must be passed from the target to the supervisor.
>>   One way in which this could be done is by passing the file
>>   descriptor over a UNIX domain socket connection between the
>>   target and the supervisor (using the SCM_RIGHTS ancillary
>>   message type described in unix(7)).
> 
> Fwiw, on newer kernels you could also use pidfd_getfd() for that.

Thanks. I added that to the text.

>>3. The supervisor will receive notification events on the
>>   listening file descriptor.  These events are returned as
>>   structures of type seccomp_notif.  Because this structure and
>>   its size may evolve over kernel versions, 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> ,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...
> 
> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writing to the target's memory is simply not possible
> (safely).

Thank you for your very clear explanation! It turned out to be 
trivially easy to demonstrate this issue with a slightly modified
version of my program.

As well as the change to the code example that I already mentioned
my reply of a few hours ago, I've added the following text to the 
page:

   Caveats regarding the use of /proc/[tid]/mem
   The discussion above noted the need to use the
   SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
   /proc/[tid]/mem file of the target to avoid the possibility of
   accessing the memory of the wrong process in the event that the
   target terminates and its ID is recycled by another (unrelated)
   thread.  However, the use of this ioctl(2) operation is also
   necessary in other situations, as explained in the following
   pargraphs.

   Consider the following scenario, where the supervisor tries to
   read the pathname argument of a target's blocked mount(2) system
   call:

   • From one of its functions (func()), the target calls mount(2),
 which triggers a user-space notification and causes the target
 to block.

   • The supervisor receives the notification, opens
 /proc/[tid]/mem, and (successfully) performs the
 SECCOMP_IOCTL_NOTIF_ID_VALID check.

   • The target receives a signal, which causes the mount(2) to
 abort.

   • The signal handler executes in the target, and returns.

   • Upon return from the handler, the execution of func() resumes,
 and it returns (and perhaps other functions are called,
 overwriting the memory that had been used for the stack frame
 of func()).

   • Using the address provided in the notification information, the
 supervisor reads from the target's memory location that used to
 contain the pathnam

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Christian Brauner
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> 
> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> I've shown the rendered version of the page below. The page source
> currently sits in a branch at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> At this point, I'm mainly interested in feedback about the FIXMEs,
> some of which relate to the text of the page itself, while the
> others relate to the various outstanding API issues. The first 
> FIXME provides a small opportunity for some bikeshedding :-);

I like this manpage. I think this is the most comprehensive explanation
of any seccomp feature and somewhat understandable.

Just tiny comments below, hopefully no bike-shedding though. :)

> 
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> 
> =
> 
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
>┌─┐
>│FIXME│
>├─┤
>│Might "seccomp_unotify(2)" be a better name for this │
>│page?  It's slightly shorter to type, and perhaps│
>│reads better when spoken.│
>└─┘
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism
>provided by the Secure Computing (seccomp) facility.  As well as
>the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By
>contrast, the user-space notification mechanism allows the
>seccomp filter to delegate the handling of the system call to
>another user-space process.  Note that this mechanism is
>explicitly not intended as a method implementing security policy;
>see NOTES.
> 
>In the discussion that follows, the thread(s) on which the
>seccomp filter is installed is (are) referred to as the target,
>and the process that is notified by the user-space notification
>mechanism is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the
>target.  The advantage of the user-space notification mechanism
>is that the supervisor will usually be able to retrieve
>information about the target and the performed system call that
>the seccomp filter itself cannot.  (A seccomp filter is limited
>in the information it can obtain and the actions that it can
>perform because it is running on a virtual machine inside the
>kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value of the (successful) seccomp(2) call is a new
> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.
> 
> ┌─

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/29/20 2:42 AM, Jann Horn wrote:
> On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
>  wrote:
>>static bool
>>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>  char *path, size_t len)
>>{
>>char procMemPath[PATH_MAX];
>>
>>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
>> req->pid);
>>
>>int procMemFd = open(procMemPath, O_RDONLY);
>>if (procMemFd == -1)
>>errExit("\tS: open");
>>
>>/* Check that the process whose info we are accessing is still 
>> alive.
>>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>   in checkNotificationIdIsValid()) succeeds, we know that the
>>   /proc/PID/mem file descriptor that we opened corresponds to the
>>   process for which we received a notification. If that process
>>   subsequently terminates, then read() on that file descriptor
>>   will return 0 (EOF). */
>>
>>checkNotificationIdIsValid(notifyFd, req->id);
>>
>>/* Read bytes at the location containing the pathname argument
>>   (i.e., the first argument) of the mkdir(2) call */
>>
>>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>if (nread == -1)
>>errExit("pread");
> 
> As discussed at
> ,
> we need to re-check checkNotificationIdIsValid() after reading remote
> memory but before using the read value in any way. Otherwise, the
> syscall could in the meantime get interrupted by a signal handler, the
> signal handler could return, and then the function that performed the
> syscall could free() allocations or return (thereby freeing buffers on
> the stack).
> 
> In essence, this pread() is (unavoidably) a potential use-after-free
> read; and to make that not have any security impact, we need to check
> whether UAF read occurred before using the read value. This should
> probably be called out elsewhere in the manpage, too...

Thanks very much for pointing me at this!

So, I want to conform that the fix to the code is as simple as
adding a check following the pread() call, something like:

[[
 ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
 if (nread == -1)
errExit("Supervisor: pread");
 
 if (nread == 0) {
fprintf(stderr, "\tS: pread() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
 }
 
 if (close(procMemFd) == -1)
errExit("Supervisor: close-/proc/PID/mem");
 
+/* Once again check that the notification ID is still valid. The
+   case we are particularly concerned about here is that just
+   before we fetched the pathname, the target's blocked system
+   call was interrupted by a signal handler, and after the handler
+   returned, the target carried on execution (past the interrupted
+   system call). In that case, we have no guarantees about what we
+   are reading, since the target's memory may have been arbitrarily
+   changed by subsequent operations. */
+
+if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
+return false;
+
 /* We have no guarantees about what was in the memory of the target
process. We therefore treat the buffer returned by pread() as
untrusted input. The buffer should be terminated by a null byte;
if not, then we will trigger an error for the target process. */
 
 if (strnlen(path, nread) < nread)
 return true;
]]

> Now, of course, **reading** is the easy case. The difficult case is if
> we have to **write** to the remote process... because then we can't
> play games like that. If we write data to a freed pointer, we're
> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> that /proc/$pid/mem is originally intended for process debugging,
> including installing breakpoints, and will therefore happily write
> over "readonly" private mappings, such as typical mappings of
> executable code.)
> 
> So, h... I guess if anyone wants to actually write memory back to
> the target process, we'd better come up with some dedicated API for
> that, using an ioctl on the seccomp fd that magically freezes the
> target process inside the syscall while writing to its memory, or
> something like that? And until then, the manpage should have a big fat
> warning that writing to the target's memory is simply not possible
> (safely).
> 
>>if (nread == 0) {
>>fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>>"returned 0 (EOF)\n");
>>exit(EXIT_FAILURE);
>>}
> .

I'll think over some changes to the text of the manual page.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages mai

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Sargun Dhillon
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> 
> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> I've shown the rendered version of the page below. The page source
> currently sits in a branch at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> At this point, I'm mainly interested in feedback about the FIXMEs,
> some of which relate to the text of the page itself, while the
> others relate to the various outstanding API issues. The first 
> FIXME provides a small opportunity for some bikeshedding :-);
> 
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> 
> =
> 
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
>┌─┐
>│FIXME│
>├─┤
>│Might "seccomp_unotify(2)" be a better name for this │
>│page?  It's slightly shorter to type, and perhaps│
>│reads better when spoken.│
>└─┘
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism
>provided by the Secure Computing (seccomp) facility.  As well as
>the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By
>contrast, the user-space notification mechanism allows the
>seccomp filter to delegate the handling of the system call to
>another user-space process.  Note that this mechanism is
>explicitly not intended as a method implementing security policy;
>see NOTES.
> 
>In the discussion that follows, the thread(s) on which the
>seccomp filter is installed is (are) referred to as the target,
>and the process that is notified by the user-space notification
>mechanism is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the
>target.  The advantage of the user-space notification mechanism
>is that the supervisor will usually be able to retrieve
>information about the target and the performed system call that
>the seccomp filter itself cannot.  (A seccomp filter is limited
>in the information it can obtain and the actions that it can
>perform because it is running on a virtual machine inside the
>kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value of the (successful) seccomp(2) call is a new
> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.
> 
> ┌─┐
> │FIXME│
> ├

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:04 AM Tycho Andersen  wrote:
> On Thu, Oct 29, 2020 at 02:42:58AM +0100, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> >  wrote:
> > >static bool
> > >getTargetPathname(struct seccomp_notif *req, int notifyFd,
> > >  char *path, size_t len)
> > >{
> > >char procMemPath[PATH_MAX];
> > >
> > >snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> > > req->pid);
> > >
> > >int procMemFd = open(procMemPath, O_RDONLY);
> > >if (procMemFd == -1)
> > >errExit("\tS: open");
> > >
> > >/* Check that the process whose info we are accessing is still 
> > > alive.
> > >   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > >   in checkNotificationIdIsValid()) succeeds, we know that the
> > >   /proc/PID/mem file descriptor that we opened corresponds to 
> > > the
> > >   process for which we received a notification. If that 
> > > process
> > >   subsequently terminates, then read() on that file descriptor
> > >   will return 0 (EOF). */
> > >
> > >checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >/* Read bytes at the location containing the pathname argument
> > >   (i.e., the first argument) of the mkdir(2) call */
> > >
> > >ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> > >if (nread == -1)
> > >errExit("pread");
> >
> > As discussed at
> > ,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, h... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
>
> By freeze here you mean a killable wait instead of an interruptible
> wait, right?

Nope, nonkillable.

Consider the case of vfork(), where a target process does something like this:

void spawn_executable(char **argv, char **envv) {
  pid_t child = vfork();
  if (child == 0) {
char path[1000];
sprintf(path, ...);
execve(path, argv, envv);
  }
}

and the seccomp notifier wants to look at the execve() path (as a
somewhat silly example). The child process is just borrowing the
parent's stack, and as soon as the child either gets far enough into
execve() or dies, the parent continues using that stack. So keeping
the child in killable sleep would not be enough to prevent reuse of
the child's stack.


But conceptually that's not really a big problem - we already have a
way to force the target task to stay inside the seccomp code no matter
if it gets SIGKILLed or whatever, and that is to take the notify_lock.
When the target task wakes up and wants to continue executing, it has
to first get through mutex_lock(&match->notify_lock) - and that will
always block until the lock is free. So we could e.g. do something
like:

 - Grab references to the source pages in the supervisor's address
space with get_user_pages_fast().
 - Take mmap_sem on the target.
 - Grab references to pages in the relevant range with pin_user_pages_remote().
 - Drop the mmap_sem.
 - Take the notify_lock.
 - Recheck whether the notification with the right ID is still there.
 - Copy data from the pinned source pages to the pinned target pages.
 - Drop the notify_lock.
 - Drop the page references.

and this way we would still guarantee that the target process would
only be blocked in noninterruptible sleep for a small amount of time
(and would not be indirectly blocked on sleeping operations through
the mutex). It'd be pretty straightforward, I think. But

Re: For review: seccomp_user_notif(2) manual page

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:13 AM Tycho Andersen  wrote:
> > > Consider the following scenario (with supervisor "S" and target "T"; S
> > > wants to wait for events on two file descriptors seccomp_fd and
> > > other_fd):
> > >
> > > S: starts poll() to wait for events on seccomp_fd and other_fd
> > > T: performs a syscall that's filtered with RET_USER_NOTIF
> > > S: poll() returns and signals readiness of seccomp_fd
> > > T: receives signal SIGUSR1
> > > T: syscall aborts, enters signal handler
> > > T: signal handler blocks on unfiltered syscall (e.g. write())
> > > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > > S: blocks because no syscalls are pending
> > >
> > > Depending on what other_fd is, this could in a worst case even lead to
> > > a deadlock (if e.g. the signal handler wants to write to stdout, but
> > > the stdout fd is hooked up to other_fd in the supervisor, but the
> > > supervisor can't consume the data written because it's stuck in
> > > seccomp handling).
> > >
> > > So we have to ensure that when existing code (like that crun code you
> > > linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
> > > immediately instead of blocking.
> >
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I feel like it's ok to return an error from the RECV ioctl() if
> there's never going to be any more events on the fd; was there
> something fundamentally wrong with your patch here:
> https://lore.kernel.org/bpf/cag48ez2xn+_kzneztj-evtstzkbf9cvgpqaak7tprnaqbda...@mail.gmail.com/
> ?

No, I have a new version of that about 80% done and hope to send it
out soonish. (There's some stuff around tests that I still need to
cobble together).


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-28 Thread Jann Horn
On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
 wrote:
>static bool
>getTargetPathname(struct seccomp_notif *req, int notifyFd,
>  char *path, size_t len)
>{
>char procMemPath[PATH_MAX];
>
>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> req->pid);
>
>int procMemFd = open(procMemPath, O_RDONLY);
>if (procMemFd == -1)
>errExit("\tS: open");
>
>/* Check that the process whose info we are accessing is still 
> alive.
>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>   in checkNotificationIdIsValid()) succeeds, we know that the
>   /proc/PID/mem file descriptor that we opened corresponds to the
>   process for which we received a notification. If that process
>   subsequently terminates, then read() on that file descriptor
>   will return 0 (EOF). */
>
>checkNotificationIdIsValid(notifyFd, req->id);
>
>/* Read bytes at the location containing the pathname argument
>   (i.e., the first argument) of the mkdir(2) call */
>
>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>if (nread == -1)
>errExit("pread");

As discussed at
,
we need to re-check checkNotificationIdIsValid() after reading remote
memory but before using the read value in any way. Otherwise, the
syscall could in the meantime get interrupted by a signal handler, the
signal handler could return, and then the function that performed the
syscall could free() allocations or return (thereby freeing buffers on
the stack).

In essence, this pread() is (unavoidably) a potential use-after-free
read; and to make that not have any security impact, we need to check
whether UAF read occurred before using the read value. This should
probably be called out elsewhere in the manpage, too...

Now, of course, **reading** is the easy case. The difficult case is if
we have to **write** to the remote process... because then we can't
play games like that. If we write data to a freed pointer, we're
screwed, that's it. (And for somewhat unrelated bonus fun, consider
that /proc/$pid/mem is originally intended for process debugging,
including installing breakpoints, and will therefore happily write
over "readonly" private mappings, such as typical mappings of
executable code.)

So, h... I guess if anyone wants to actually write memory back to
the target process, we'd better come up with some dedicated API for
that, using an ioctl on the seccomp fd that magically freezes the
target process inside the syscall while writing to its memory, or
something like that? And until then, the manpage should have a big fat
warning that writing to the target's memory is simply not possible
(safely).

>if (nread == 0) {
>fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>"returned 0 (EOF)\n");
>exit(EXIT_FAILURE);
>}


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 11:53 PM Kees Cook  wrote:
> On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> > The problem is the scenario where a process is interrupted while it's
> > waiting for the supervisor to reply.
> >
> > Consider the following scenario (with supervisor "S" and target "T"; S
> > wants to wait for events on two file descriptors seccomp_fd and
> > other_fd):
> >
> > S: starts poll() to wait for events on seccomp_fd and other_fd
> > T: performs a syscall that's filtered with RET_USER_NOTIF
> > S: poll() returns and signals readiness of seccomp_fd
> > T: receives signal SIGUSR1
> > T: syscall aborts, enters signal handler
> > T: signal handler blocks on unfiltered syscall (e.g. write())
> > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > S: blocks because no syscalls are pending
>
> Oooh, yes, ew. Thanks for the illustration.
>
> Thinking about this from userspace's least-surprise view, I would expect
> the "recv" to stay "queued", in the sense we'd see this:
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: gets (stale) seccomp_notif from seccomp_fd
> S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)
>
> This is not at all how things are designed internally right now, but
> that behavior would work, yes?

It would be really ugly, but it could theoretically be made to work,
to some degree.


The first bit of trouble is that currently the notification lives on
the stack of the target process. If we want to be able to show
userspace the stale notification, we'd have to store it elsewhere. And
since we really don't want to start randomly throwing -ENOMEM in any
of this stuff, we'd basically have to store it in pre-allocated memory
inside the filter.


The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 11:56 PM Kees Cook  wrote:
> On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I thought about that initially, but it rubs me the wrong way: it
> violates least-surprise for me. File descriptors are expected to be
> default-blocking. It *is* a special fd, though, so maybe it could work.
> The only case I can think of it would break would be ioctl-loop case
> that is already buggy in that it didn't handle non-zero returns?

We don't have any actual users that use the API that way outside of
the kernel's selftest/sample code, right?


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 6:44 PM Sargun Dhillon  wrote:
> On Wed, Oct 28, 2020 at 2:43 AM Jann Horn  wrote:
> > On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon  wrote:
> > > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
> > > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > > >  wrote:
> > > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > > revents, which would be kinda ugly?
> > > > >
> > > > > I must confess, I'm not quite clear on which two cases you
> > > > > are trying to distinguish. Can you elaborate?
> > > >
> > > > Let's say someone writes a program whose responsibilities are just to
> > > > handle seccomp events and to listen on some other fd for commands. And
> > > > this is implemented with an event loop. Then once all the target
> > > > processes are gone (including zombie reaping), we'll start getting
> > > > EPOLLERR.
> > > >
> > > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > > can just call into the seccomp logic without any arguments; it can
> > > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > > The downside is that there's one more error code userspace has to
> > > > special-case.
> > > > This would be more consistent with what we'd be doing in the blocking 
> > > > case.
> > > >
> > > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > > the seccomp logic what the revents are.
> > > >
> > > > I guess it probably doesn't really matter much.
> > >
> > > So, in practice, if you're emulating a blocking syscall (such as open,
> > > perf_event_open, or any of a number of other syscalls), you probably
> > > have to do it on a separate thread in the supervisor because you want
> > > to continue to be able to receive new notifications if any other process
> > > generates a seccomp notification event that you need to handle.
> > >
> > > In addition to that, some of these syscalls are preemptible, so you need
> > > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > > under supervision hasn't left the syscall.
> > >
> > > If we're to implement a mechanism that makes the seccomp ioctl receive
> > > non-blocking, it would be valuable to address this problem as well 
> > > (getting
> > > a notification when the supervisor is processing a syscall and needs to
> > > preempt it). In the best case, this can be a minor inconvenience, and
> > > in the worst case this can result in weird errors where you're keeping
> > > resources open that the container expects to be closed.
> >
> > Does "a notification" mean signals? Or would you want to have a second
> > thread in userspace that poll()s for cancellation events on the
> > seccomp fd and then somehow takes care of interrupting the first
> > thread, or something like that?
>
> I would be reluctant to be prescriptive in that it be a signal. Right
> now, it's implemented
> as a second thread in userspace that does a ioctl(...) and checks if
> the notification
> is valid / alive, and does what's required if the notification has
> died (interrupting
> the first thread).
>
> >
> > Either way, I think your proposal goes beyond the scope of patching
> > the existing weirdness, and should be a separate patch.
>
> I agree it should be a separate patch, but I think that it'd be nice if there
> was a way to do something like:
> * opt-in to getting another message after receiving the notification
>   that indicates the program has left the syscall

I guess to do that cleanly, we'd want something like an array
associated with the seccomp filter that has a size N that's determined
when the filter is set up... and then when a received but unanswered
notification is cancelled, we'd insert its identifier into that array.
And if we enforce that the supervisor can never have more than N
pending messages (by just not delivering new ones if there are N old
ones pending), we'll know that any possible cancellation will always
fit, and we don't need to worry about dynamic memory allocation.

And we could raise EPOLLPRI on the file descriptor when the array is
non-empty, so that if userspace doesn't currently want to handle new
notifications (because it's already dealing with a bunch of them),
userspace can do that, too.

> * when you do the RECV, you can specify a flag or some such asking
>   that you get signaled / notified about the program leaving the syscall

I think filter setup time is easier to deal with than RECV time.

> * a multiplexed receive that can say if an existing notification in progress
>   has left the valid state.

Or alternatively a separate ioctl for receiving cancellation messages,
which you'd only call on EPOLLPRI.


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Sargun Dhillon
On Wed, Oct 28, 2020 at 2:43 AM Jann Horn  wrote:
>
> On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon  wrote:
> > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
> > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > >  wrote:
> > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > revents, which would be kinda ugly?
> > > >
> > > > I must confess, I'm not quite clear on which two cases you
> > > > are trying to distinguish. Can you elaborate?
> > >
> > > Let's say someone writes a program whose responsibilities are just to
> > > handle seccomp events and to listen on some other fd for commands. And
> > > this is implemented with an event loop. Then once all the target
> > > processes are gone (including zombie reaping), we'll start getting
> > > EPOLLERR.
> > >
> > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > can just call into the seccomp logic without any arguments; it can
> > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > The downside is that there's one more error code userspace has to
> > > special-case.
> > > This would be more consistent with what we'd be doing in the blocking 
> > > case.
> > >
> > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > the seccomp logic what the revents are.
> > >
> > > I guess it probably doesn't really matter much.
> >
> > So, in practice, if you're emulating a blocking syscall (such as open,
> > perf_event_open, or any of a number of other syscalls), you probably
> > have to do it on a separate thread in the supervisor because you want
> > to continue to be able to receive new notifications if any other process
> > generates a seccomp notification event that you need to handle.
> >
> > In addition to that, some of these syscalls are preemptible, so you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > under supervision hasn't left the syscall.
> >
> > If we're to implement a mechanism that makes the seccomp ioctl receive
> > non-blocking, it would be valuable to address this problem as well (getting
> > a notification when the supervisor is processing a syscall and needs to
> > preempt it). In the best case, this can be a minor inconvenience, and
> > in the worst case this can result in weird errors where you're keeping
> > resources open that the container expects to be closed.
>
> Does "a notification" mean signals? Or would you want to have a second
> thread in userspace that poll()s for cancellation events on the
> seccomp fd and then somehow takes care of interrupting the first
> thread, or something like that?

I would be reluctant to be prescriptive in that it be a signal. Right
now, it's implemented
as a second thread in userspace that does a ioctl(...) and checks if
the notification
is valid / alive, and does what's required if the notification has
died (interrupting
the first thread).

>
> Either way, I think your proposal goes beyond the scope of patching
> the existing weirdness, and should be a separate patch.

I agree it should be a separate patch, but I think that it'd be nice if there
was a way to do something like:
* opt-in to getting another message after receiving the notification
  that indicates the program has left the syscall
* when you do the RECV, you can specify a flag or some such asking
  that you get signaled / notified about the program leaving the syscall
* a multiplexed receive that can say if an existing notification in progress
  has left the valid state.

---
The reason I bring this up as part of this current thread / discussion is that
I think that they may be related in terms of how we want the behaviour to act.

I would love to hear how people think this should work, or better suggestions
than the second thread approach above, or the alternative approach of
polling all the notifications in progress on some interval [and relying on
epoll timeout to trigger that interval].


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Sargun Dhillon
On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
>
> On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
>  wrote:
> > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > I'm a bit on the fence now on whether non-blocking mode should use
> > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > no more listeners, you'd have to disambiguate through the poll()
> > > revents, which would be kinda ugly?
> >
> > I must confess, I'm not quite clear on which two cases you
> > are trying to distinguish. Can you elaborate?
>
> Let's say someone writes a program whose responsibilities are just to
> handle seccomp events and to listen on some other fd for commands. And
> this is implemented with an event loop. Then once all the target
> processes are gone (including zombie reaping), we'll start getting
> EPOLLERR.
>
> If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> can just call into the seccomp logic without any arguments; it can
> just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> The downside is that there's one more error code userspace has to
> special-case.
> This would be more consistent with what we'd be doing in the blocking case.
>
> If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> the seccomp logic what the revents are.
>
> I guess it probably doesn't really matter much.

So, in practice, if you're emulating a blocking syscall (such as open,
perf_event_open, or any of a number of other syscalls), you probably
have to do it on a separate thread in the supervisor because you want
to continue to be able to receive new notifications if any other process
generates a seccomp notification event that you need to handle.

In addition to that, some of these syscalls are preemptible, so you need
to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
under supervision hasn't left the syscall.

If we're to implement a mechanism that makes the seccomp ioctl receive
non-blocking, it would be valuable to address this problem as well (getting
a notification when the supervisor is processing a syscall and needs to
preempt it). In the best case, this can be a minor inconvenience, and
in the worst case this can result in weird errors where you're keeping
resources open that the container expects to be closed.


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon  wrote:
> On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
> > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> >  wrote:
> > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > no more listeners, you'd have to disambiguate through the poll()
> > > > revents, which would be kinda ugly?
> > >
> > > I must confess, I'm not quite clear on which two cases you
> > > are trying to distinguish. Can you elaborate?
> >
> > Let's say someone writes a program whose responsibilities are just to
> > handle seccomp events and to listen on some other fd for commands. And
> > this is implemented with an event loop. Then once all the target
> > processes are gone (including zombie reaping), we'll start getting
> > EPOLLERR.
> >
> > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > can just call into the seccomp logic without any arguments; it can
> > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > The downside is that there's one more error code userspace has to
> > special-case.
> > This would be more consistent with what we'd be doing in the blocking case.
> >
> > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > the seccomp logic what the revents are.
> >
> > I guess it probably doesn't really matter much.
>
> So, in practice, if you're emulating a blocking syscall (such as open,
> perf_event_open, or any of a number of other syscalls), you probably
> have to do it on a separate thread in the supervisor because you want
> to continue to be able to receive new notifications if any other process
> generates a seccomp notification event that you need to handle.
>
> In addition to that, some of these syscalls are preemptible, so you need
> to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> under supervision hasn't left the syscall.
>
> If we're to implement a mechanism that makes the seccomp ioctl receive
> non-blocking, it would be valuable to address this problem as well (getting
> a notification when the supervisor is processing a syscall and needs to
> preempt it). In the best case, this can be a minor inconvenience, and
> in the worst case this can result in weird errors where you're keeping
> resources open that the container expects to be closed.

Does "a notification" mean signals? Or would you want to have a second
thread in userspace that poll()s for cancellation events on the
seccomp fd and then somehow takes care of interrupting the first
thread, or something like that?

Either way, I think your proposal goes beyond the scope of patching
the existing weirdness, and should be a separate patch.


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Kees Cook
On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
> 
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
> 
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending

Oooh, yes, ew. Thanks for the illustration.

Thinking about this from userspace's least-surprise view, I would expect
the "recv" to stay "queued", in the sense we'd see this:

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: gets (stale) seccomp_notif from seccomp_fd
S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)

This is not at all how things are designed internally right now, but
that behavior would work, yes?

-- 
Kees Cook


Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Kees Cook
On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> Or I guess we could also just set O_NONBLOCK on the fd by default?
> Since the one existing user is eventloop-based...

I thought about that initially, but it rubs me the wrong way: it
violates least-surprise for me. File descriptors are expected to be
default-blocking. It *is* a special fd, though, so maybe it could work.
The only case I can think of it would break would be ioctl-loop case
that is already buggy in that it didn't handle non-zero returns?

-- 
Kees Cook


Re: For review: seccomp_user_notif(2) manual page

2020-10-27 Thread Jann Horn
On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/26/20 4:54 PM, Jann Horn wrote:
> > I'm a bit on the fence now on whether non-blocking mode should use
> > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > no more listeners, you'd have to disambiguate through the poll()
> > revents, which would be kinda ugly?
>
> I must confess, I'm not quite clear on which two cases you
> are trying to distinguish. Can you elaborate?

Let's say someone writes a program whose responsibilities are just to
handle seccomp events and to listen on some other fd for commands. And
this is implemented with an event loop. Then once all the target
processes are gone (including zombie reaping), we'll start getting
EPOLLERR.

If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
can just call into the seccomp logic without any arguments; it can
just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
The downside is that there's one more error code userspace has to
special-case.
This would be more consistent with what we'd be doing in the blocking case.

If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
the seccomp logic what the revents are.

I guess it probably doesn't really matter much.


Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Michael Kerrisk (man-pages)
On 10/26/20 4:54 PM, Jann Horn wrote:
> On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
>  wrote:
[...]
>> I tried applying the patch below to vanilla 5.9.0.
>> (There's one typo: s/ENOTCON/ENOTCONN).
>>
>> It seems not to work though; when I send a signal to my test
>> target process that is sleeping waiting for the notification
>> response, the process enters the uninterruptible D state.
>> Any thoughts?
> 
> Ah, yeah, I think I was completely misusing the wait API. I'll go change that.
> 
> (Btw, in general, for reports about hangs like that, it can be helpful
> to have the contents of /proc/$pid/stack. And for cases where CPUs are
> spinning, the relevant part from the output of the "L" sysrq, or
> something like that.)

Thanks for the tipcs!

> Also, I guess we can probably break this part of UAPI after all, since
> the only user of this interface seems to currently be completely
> broken in this case anyway? So I think we want the other
> implementation without the ->canceled_reqs logic after all.

Okay.

> I'm a bit on the fence now on whether non-blocking mode should use
> ENOTCONN or not... I guess if we returned ENOENT even when there are
> no more listeners, you'd have to disambiguate through the poll()
> revents, which would be kinda ugly?

I must confess, I'm not quite clear on which two cases you 
are trying to distinguish. Can you elaborate?

> I'll try to turn this into a proper patch submission...

Thank you!!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

I don't have that thread in my inbox any more, but I can reply here:
no, I don't know any users of this info, but I also don't anticipate
knowing how people will all use this feature :)

Tycho


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 03:30:29PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for getting back to me.
> 
> On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
> >
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > > Hi all (and especially Tycho and Sargun),
> > >
> > > Following review comments on the first draft (thanks to Jann, Kees,
> > > Christian and Tycho), I've made a lot of changes to this page.
> > > I've also added a few FIXMEs relating to outstanding API issues.
> > > I'd like a second pass review of the page before I release it.
> > > But also, this mail serves as a way of noting the outstanding API
> > > issues.
> > >
> > > Tycho: I still have an outstanding question for you at [2].
> > > [2] 
> > > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> >
> > I don't have that thread in my inbox any more, but I can reply here:
> > no, I don't know any users of this info, but I also don't anticipate
> > knowing how people will all use this feature :)
> 
> Yes, but my questions were:
> 
> [[
> [1] So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> [2] If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred?
> ]]
> 
> So, do I understand right in [1]? (The implication from your reply is
> yes, but I want to be sure...)

Yes.

> For [2], my question was not about users, but *use cases*. The
> question I asked myself is: why does the feature exist? Hence my
> question [2] reworded: "when you designed this, did you have in mind
> scenarios here the process doing the NOTIFY_SEND might need to test
> for POLLOUT because it doesn't know whether a NOTIFY_RECV has
> occurred?"

I did not.

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Jann Horn
On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/1/20 4:14 AM, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 3:52 AM Jann Horn  wrote:
> >> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> >>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
>  On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> > wrote:
> >> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> >>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
> >>> wrote:
> ┌─┐
> │FIXME│
> ├─┤
> │From my experiments,  it  appears  that  if  a  SEC‐ │
> │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> │process terminates, then the ioctl()  simply  blocks │
> │(rather than returning an error to indicate that the │
> │target process no longer exists).│
> >>>
> >>> Yeah, I think Christian wanted to fix this at some point,
> >>
> >> Do you have a pointer that discussion? I could not find it with a
> >> quick search.
> >>
> >>> but it's a
> >>> bit sticky to do.
> >>
> >> Can you say a few words about the nature of the problem?
> >
> > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > notify about unused filter"). So maybe there's a bug here?
> 
>  That thing only notifies on ->poll, it doesn't unblock ioctls; and
>  Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
>  commit doesn't have any effect on this kind of usage.
> >>>
> >>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> >>> we don't have a count of all of them, unfortunately.
> >>>
> >>> We could maybe look inside the wait_list, but that will probably make
> >>> people angry :)
> >>
> >> The easiest way would probably be to open-code the semaphore-ish part,
> >> and let the semaphore and poll share the waitqueue. The current code
> >> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> >> entire semaphore would IMO be cleaner than that. And it's not like
> >> semaphore semantics are even a good fit for this code anyway.
> >>
> >> Let's see... if we didn't have the existing UAPI to worry about, I'd
> >> do it as follows (*completely* untested). That way, the ioctl would
> >> block exactly until either there actually is a request to deliver or
> >> there are no more users of the filter. The problem is that if we just
> >> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> >> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >> probably also have to add some stupid counter in place of the
> >> semaphore's counter that we can use to preserve the old behavior of
> >> returning -ENOENT once for each cancelled request. :(
> >>
> >> I guess this is a nice point in favor of Michael's usual complaint
> >> that if there are no man pages for a feature by the time the feature
> >> lands upstream, there's a higher chance that the UAPI will suck
> >> forever...
> >
> > And I guess this would be the UAPI-compatible version - not actually
> > as terrible as I thought it might be. Do y'all want this? If so, feel
> > free to either turn this into a proper patch with Co-developed-by, or
> > tell me that I should do it and I'll try to get around to turning it
> > into something proper.
>
> Thanks for taking a shot at this.
>
> I tried applying the patch below to vanilla 5.9.0.
> (There's one typo: s/ENOTCON/ENOTCONN).
>
> It seems not to work though; when I send a signal to my test
> target process that is sleeping waiting for the notification
> response, the process enters the uninterruptible D state.
> Any thoughts?

Ah, yeah, I think I was completely misusing the wait API. I'll go change that.

(Btw, in general, for reports about hangs like that, it can be helpful
to have the contents of /proc/$pid/stack. And for cases where CPUs are
spinning, the relevant part from the output of the "L" sysrq, or
something like that.)

Also, I guess we can probably break this part of UAPI after all, since
the only user of this interface seems to currently be completely
broken in this case anyway? So I think we want the other
implementation without the ->canceled_reqs logic after all.

I'm a bit on the fence now on whether non-blocking mode should use
ENOTCONN or not... I guess if we returned ENOENT even when there are
no more listeners, you'd have to disambiguate through the poll()
revents, which would be kinda ugly?

I'll try to turn this into a proper patch submission...


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Thanks for getting back to me.

On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
>
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > Hi all (and especially Tycho and Sargun),
> >
> > Following review comments on the first draft (thanks to Jann, Kees,
> > Christian and Tycho), I've made a lot of changes to this page.
> > I've also added a few FIXMEs relating to outstanding API issues.
> > I'd like a second pass review of the page before I release it.
> > But also, this mail serves as a way of noting the outstanding API
> > issues.
> >
> > Tycho: I still have an outstanding question for you at [2].
> > [2] 
> > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
>
> I don't have that thread in my inbox any more, but I can reply here:
> no, I don't know any users of this info, but I also don't anticipate
> knowing how people will all use this feature :)

Yes, but my questions were:

[[
[1] So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

[2] If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred?
]]

So, do I understand right in [1]? (The implication from your reply is
yes, but I want to be sure...)

For [2], my question was not about users, but *use cases*. The
question I asked myself is: why does the feature exist? Hence my
question [2] reworded: "when you designed this, did you have in mind
scenarios here the process doing the NOTIFY_SEND might need to test
for POLLOUT because it doesn't know whether a NOTIFY_RECV has
occurred?"

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Jann Horn
On Mon, Oct 26, 2020 at 10:51 AM Jann Horn  wrote:
> On Mon, Oct 26, 2020 at 1:32 AM Kees Cook  wrote:
> > On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> > > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  
> > > > > wrote:
> > > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk 
> > > > > > (man-pages) wrote:
> > > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk 
> > > > > > > > (man-pages) wrote:
> > > > > > > >>┌─┐
> > > > > > > >>│FIXME│
> > > > > > > >>├─┤
> > > > > > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > > >>│process terminates, then the ioctl()  simply  blocks │
> > > > > > > >>│(rather than returning an error to indicate that the │
> > > > > > > >>│target process no longer exists).│
> > > > > > > >
> > > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > > >
> > > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > > quick search.
> > > > > > >
> > > > > > > > but it's a
> > > > > > > > bit sticky to do.
> > > > > > >
> > > > > > > Can you say a few words about the nature of the problem?
> > > > > >
> > > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 
> > > > > > ("seccomp:
> > > > > > notify about unused filter"). So maybe there's a bug here?
> > > > >
> > > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > > commit doesn't have any effect on this kind of usage.
> > > >
> > > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > > we don't have a count of all of them, unfortunately.
> > > >
> > > > We could maybe look inside the wait_list, but that will probably make
> > > > people angry :)
> > >
> > > The easiest way would probably be to open-code the semaphore-ish part,
> > > and let the semaphore and poll share the waitqueue. The current code
> > > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > > entire semaphore would IMO be cleaner than that. And it's not like
> > > semaphore semantics are even a good fit for this code anyway.
> > >
> > > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > > do it as follows (*completely* untested). That way, the ioctl would
> > > block exactly until either there actually is a request to deliver or
> > > there are no more users of the filter. The problem is that if we just
> > > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > > an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >
> > Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> > loop)?
>
> No, I'm talking about poll event loops.
>
> > I think poll would be fine, but a "try calling RECV and expect to
> > return ENOENT" loop would change. But I don't think anyone would do this
> > exactly because it _currently_ acts like O_NONBLOCK, yes?
> >
> > > probably also have to add some stupid counter in place of the
> > > semaphore's counter that we can use to preserve the old behavior of
> > > returning -ENOENT once for each cancelled request. :(
> >
> > I only see this in Debian Code Search:
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> > which is using epoll_wait():
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
> >
> > I expect LXC is using it. :)
>
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
>
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending
>
> Depending on what other_fd is, this could in a worst case even lead to
> a deadlock (if e.g. the signal handler wants to write to stdout, but
> the stdout fd is hooked up to other_fd in the supervisor, but the
> supervisor can't consume the data written because it's stuck in
> seccomp handling).
>
> So we have to ensure that when ex

For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi all (and especially Tycho and Sargun),

Following review comments on the first draft (thanks to Jann, Kees,
Christian and Tycho), I've made a lot of changes to this page.
I've also added a few FIXMEs relating to outstanding API issues.
I'd like a second pass review of the page before I release it.
But also, this mail serves as a way of noting the outstanding API
issues.

Tycho: I still have an outstanding question for you at [2].

Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

I've shown the rendered version of the page below. The page source
currently sits in a branch at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

At this point, I'm mainly interested in feedback about the FIXMEs,
some of which relate to the text of the page itself, while the
others relate to the various outstanding API issues. The first 
FIXME provides a small opportunity for some bikeshedding :-);


Thanks,

Michael

[1] 
https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
[2] 
https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

=

SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

NAME
   seccomp_user_notif - Seccomp user-space notification mechanism

   ┌─┐
   │FIXME│
   ├─┤
   │Might "seccomp_unotify(2)" be a better name for this │
   │page?  It's slightly shorter to type, and perhaps│
   │reads better when spoken.│
   └─┘

SYNOPSIS
   #include 
   #include 
   #include 

   int seccomp(unsigned int operation, unsigned int flags, void *args);

   #include 

   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
 struct seccomp_notif *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
 struct seccomp_notif_resp *resp);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);

DESCRIPTION
   This page describes the user-space notification mechanism
   provided by the Secure Computing (seccomp) facility.  As well as
   the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
   SECCOMP_RET_USER_NOTIF action value, and the
   SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
   mechanism involves the use of a number of related ioctl(2)
   operations (described below).

   Overview
   In conventional usage of a seccomp filter, the decision about how
   to treat a system call is made by the filter itself.  By
   contrast, the user-space notification mechanism allows the
   seccomp filter to delegate the handling of the system call to
   another user-space process.  Note that this mechanism is
   explicitly not intended as a method implementing security policy;
   see NOTES.

   In the discussion that follows, the thread(s) on which the
   seccomp filter is installed is (are) referred to as the target,
   and the process that is notified by the user-space notification
   mechanism is referred to as the supervisor.

   A suitably privileged supervisor can use the user-space
   notification mechanism to perform actions on behalf of the
   target.  The advantage of the user-space notification mechanism
   is that the supervisor will usually be able to retrieve
   information about the target and the performed system call that
   the seccomp filter itself cannot.  (A seccomp filter is limited
   in the information it can obtain and the actions that it can
   perform because it is running on a virtual machine inside the
   kernel.)

   An overview of the steps performed by the target and the
   supervisor is as follows:

   1. The target establishes a seccomp filter in the usual manner,
  but with two differences:

  · The seccomp(2) flags argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
value of the (successful) seccomp(2) call is a new
"listening" file descriptor that can be used to receive
notifications.  Only one "listening" seccomp filter can be
installed for a thread.

┌─┐
│FIXME│
├─┤
│Is the last sentence above correct?  │
│ │
│Kees Cook (25 Oct 2020) notes:   │
│ │
│I like this l

Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Jann Horn
On Mon, Oct 26, 2020 at 1:32 AM Kees Cook  wrote:
> On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> > > > > wrote:
> > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk 
> > > > > > > (man-pages) wrote:
> > > > > > >>┌─┐
> > > > > > >>│FIXME│
> > > > > > >>├─┤
> > > > > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > >>│process terminates, then the ioctl()  simply  blocks │
> > > > > > >>│(rather than returning an error to indicate that the │
> > > > > > >>│target process no longer exists).│
> > > > > > >
> > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > >
> > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > quick search.
> > > > > >
> > > > > > > but it's a
> > > > > > > bit sticky to do.
> > > > > >
> > > > > > Can you say a few words about the nature of the problem?
> > > > >
> > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > notify about unused filter"). So maybe there's a bug here?
> > > >
> > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > commit doesn't have any effect on this kind of usage.
> > >
> > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > we don't have a count of all of them, unfortunately.
> > >
> > > We could maybe look inside the wait_list, but that will probably make
> > > people angry :)
> >
> > The easiest way would probably be to open-code the semaphore-ish part,
> > and let the semaphore and poll share the waitqueue. The current code
> > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > entire semaphore would IMO be cleaner than that. And it's not like
> > semaphore semantics are even a good fit for this code anyway.
> >
> > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > do it as follows (*completely* untested). That way, the ioctl would
> > block exactly until either there actually is a request to deliver or
> > there are no more users of the filter. The problem is that if we just
> > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > an event loop and don't set O_NONBLOCK will be screwed. So we'd
>
> Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> loop)?

No, I'm talking about poll event loops.

> I think poll would be fine, but a "try calling RECV and expect to
> return ENOENT" loop would change. But I don't think anyone would do this
> exactly because it _currently_ acts like O_NONBLOCK, yes?
>
> > probably also have to add some stupid counter in place of the
> > semaphore's counter that we can use to preserve the old behavior of
> > returning -ENOENT once for each cancelled request. :(
>
> I only see this in Debian Code Search:
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> which is using epoll_wait():
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
>
> I expect LXC is using it. :)

The problem is the scenario where a process is interrupted while it's
waiting for the supervisor to reply.

Consider the following scenario (with supervisor "S" and target "T"; S
wants to wait for events on two file descriptors seccomp_fd and
other_fd):

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: blocks because no syscalls are pending

Depending on what other_fd is, this could in a worst case even lead to
a deadlock (if e.g. the signal handler wants to write to stdout, but
the stdout fd is hooked up to other_fd in the supervisor, but the
supervisor can't consume the data written because it's stuck in
seccomp handling).

So we have to ensure that when existing code (like that crun code you
linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
immediately instead of blocking.

(Oh, but by the way, that crun code looks broken anyway, because
AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV

Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Michael Kerrisk (man-pages)
Hi Jann,

On 10/26/20 10:32 AM, Jann Horn wrote:
> On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/17/20 2:25 AM, Jann Horn wrote:
>>> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>>>  wrote:
> [...]
 I'm not sure if I should write anything about this small UAPI
 breakage in BUGS, or not. Your thoughts?
>>>
>>> Thinking about it a bit more: Any code that relies on pause() or
>>> epoll_wait() not restarting is buggy anyway, right? Because a signal
>>> could also arrive directly before entering the syscall, while
>>> userspace code is still executing? So one could argue that we're just
>>> enlarging a preexisting race. (Unless the signal handler checks the
>>> interrupted register state to figure out whether we already entered
>>> syscall handling?)
>>
>> Yes, that all makes sense.
>>
>>> If userspace relies on non-restarting behavior, it should be using
>>> something like epoll_pwait(). And that stuff only unblocks signals
>>> after we've already past the seccomp checks on entry.
>>
>> Thanks for elaborating that detail, since as soon as you talked
>> about "enlarging a preexisting race" above, I immediately wondered
>> sigsuspend(), pselect(), etc.
>>
>> (Mind you, I still wonder about the effect on system calls that
>> are normally nonrestartable because they have timeouts. My
>> understanding is that the kernel doesn't restart those system
>> calls because it's impossible for the kernel to restart the call
>> with the right timeout value. I wonder what happens when those
>> system calls are restarted in the scenario we're discussing.)
> 
> Ah, that's an interesting edge case...

I'm going to drop a FIXME into the page source so that
there's a reminder of this issue in the next draft of 
the page, which I'm about to send out.

[...]

Thanks for checking the other pieces, Jann.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Michael Kerrisk (man-pages)
Hello Kees,

On 10/26/20 1:19 AM, Kees Cook wrote:
> On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 10/1/20 1:39 AM, Kees Cook wrote:
>>> I'll comment more later, but I've run out of time today and I didn't see
>>> anyone mention this detail yet in the existing threads... :)
>>
>> Later never came :-). But, I hope you may have comments for the 
>> next draft, which I will send out soon.
> 
> Later is now, and Soon approaches!
> 
> I finally caught up and read through this whole thread. Thank you all
> for the bug fix[1], and I'm looking forward to more[2]. :)


> For my reply I figured I'd base it on the current draft, so here's a
> simulated quote based on the seccomp_user_notif branch of
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> through commit 71101158fe330af5a26552447a0bb433b69e15b7
> $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

Thanks for reviewing the latest version!

> On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual   SECCOMP_USER_NOTIF(2)
>>
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>>#include 
>>
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>>  struct seccomp_notif *req);
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>>  struct seccomp_notif_resp *resp);
>>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>>
>> DESCRIPTION
>>This page describes the user-space notification mechanism provided
>>by the Secure Computing (seccomp) facility.  As well as the use of
>>the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>>SECCOMP_RET_USER_NOTIF action value, and the
>>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>>mechanism involves the use of a number of related ioctl(2)
>>operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to treat a system call is made by the filter itself.  By contrast,
>>the user-space notification mechanism allows the seccomp filter to
>>delegate the handling of the system call to another user-space
>>process.  Note that this mechanism is explicitly not intended as a
>>method implementing security policy; see NOTES.
>>
>>In the discussion that follows, the thread(s) on which the seccomp
>>filter is installed is (are) referred to as the target, and the
>>process that is notified by the user-space notification mechanism
>>is referred to as the supervisor.
>>
>>A suitably privileged supervisor can use the user-space
>>notification mechanism to perform actions on behalf of the target.
>>The advantage of the user-space notification mechanism is that the
>>supervisor will usually be able to retrieve information about the
>>target and the performed system call that the seccomp filter
>>itself cannot.  (A seccomp filter is limited in the information it
>>can obtain and the actions that it can perform because it is
>>running on a virtual machine inside the kernel.)
>>
>>An overview of the steps performed by the target and the
>>supervisor is as follows:
>>
>>1. The target establishes a seccomp filter in the usual manner,
>>   but with two differences:
>>
>>   • The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>> value  of the (successful) seccomp(2) call is a new
> 
> nit: extra space

Thanks. Fixed.

>> "listening" file descriptor that can be used to receive
>> notifications.  Only one "listening" seccomp filter can be
>> installed for a thread.
> 
> I like this limitation, but I expect that it'll need to change in the
> future. Even with LSMs, we see the need for arbitrary stacking, and the
> idea of there being only 1 supervisor will eventually break down. Right
> now there is only 1 because only container managers are using this
> feature. But if some daemon starts using it to isolate some thread,
> suddenly it might break if a container manager is trying to listen to it
> too, etc. I expect it won't be needed soon, but I do think it'll change.

Thanks for the background. (I added your text in a comment in the
page, just for my own reference in the future.)

>>   • In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor can obtain notifications u

Re: For review: seccomp_user_notif(2) manual page

2020-10-26 Thread Jann Horn
On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/17/20 2:25 AM, Jann Horn wrote:
> > On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
> >  wrote:
[...]
> >> I'm not sure if I should write anything about this small UAPI
> >> breakage in BUGS, or not. Your thoughts?
> >
> > Thinking about it a bit more: Any code that relies on pause() or
> > epoll_wait() not restarting is buggy anyway, right? Because a signal
> > could also arrive directly before entering the syscall, while
> > userspace code is still executing? So one could argue that we're just
> > enlarging a preexisting race. (Unless the signal handler checks the
> > interrupted register state to figure out whether we already entered
> > syscall handling?)
>
> Yes, that all makes sense.
>
> > If userspace relies on non-restarting behavior, it should be using
> > something like epoll_pwait(). And that stuff only unblocks signals
> > after we've already past the seccomp checks on entry.
>
> Thanks for elaborating that detail, since as soon as you talked
> about "enlarging a preexisting race" above, I immediately wondered
> sigsuspend(), pselect(), etc.
>
> (Mind you, I still wonder about the effect on system calls that
> are normally nonrestartable because they have timeouts. My
> understanding is that the kernel doesn't restart those system
> calls because it's impossible for the kernel to restart the call
> with the right timeout value. I wonder what happens when those
> system calls are restarted in the scenario we're discussing.)

Ah, that's an interesting edge case...

> Anyway, returning to your point... So, to be clear (and to
> quickly remind myself in case I one day reread this thread),
> there is not a problem with sigsuspend(), pselect(), ppoll(),
> and epoll_pwait() since:
>
> * Before the syscall, signals are blocked in the target.
> * Inside the syscall, signals are still blocked at the time
>   the check is made for seccomp filters.
> * If a seccomp user-space notification  event kicks, the target
>   is put to sleep with the signals still blocked.
> * The signal will only get delivered after the supervisor either
>   triggers a spoofed success/failure return in the target or the
>   supervisor sends a CONTINUE response to the kernel telling it
>   to execute the target's system call. Either way, there won't be
>   any restarting of the target's system call (and the supervisor
>   thus won't see multiple notifications).
>
> (Right?)

Yeah.

[...]
> > So we should probably document the restarting behavior as something
> > the supervisor has to deal with in the manpage; but for the
> > "non-restarting syscalls can restart from the target's perspective"
> > aspect, it might be enough to document this as quirky behavior that
> > can't actually break correct code? (Or not document it at all. Dunno.)
>
> So, I've added the following to the page:
>
>Interaction with SA_RESTART signal handlers
>Consider the following scenario:
>
>· The target process has used sigaction(2)  to  install  a  signal
>  handler with the SA_RESTART flag.
>
>· The target has made a system call that triggered a seccomp user-
>  space notification and the target is currently blocked until the
>  supervisor sends a notification response.
>
>· A  signal  is  delivered to the target and the signal handler is
>  executed.
>
>· When  (if)  the  supervisor  attempts  to  send  a  notification
>  response,  the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>  fail with the ENOENT error.
>
>In this scenario, the kernel  will  restart  the  target's  system
>call.   Consequently,  the  supervisor  will receive another user-
>space notification.  Thus, depending on how many times the blocked
>system call is interrupted by a signal handler, the supervisor may
>receive multiple notifications for the same  system  call  in  the
>target.
>
>One  oddity  is  that  system call restarting as described in this
>scenario will occur even for the blocking system calls  listed  in
>signal(7) that would never normally be restarted by the SA_RESTART
>flag.
>
> Does that seem okay?

Sounds good to me.

> In addition, I've queued a cross-reference in signal(7):
>
>In certain circumstances, the seccomp(2) user-space notifi‐
>cation  feature can lead to restarting of system calls that
>would otherwise  never  be  restarted  by  SA_RESTART;  for
>details, see seccomp_user_notif(2).


Re: For review: seccomp_user_notif(2) manual page

2020-10-25 Thread Kees Cook
On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> > > > wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk 
> > > > > > (man-pages) wrote:
> > > > > >>┌─┐
> > > > > >>│FIXME│
> > > > > >>├─┤
> > > > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > >>│process terminates, then the ioctl()  simply  blocks │
> > > > > >>│(rather than returning an error to indicate that the │
> > > > > >>│target process no longer exists).│
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
> 
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
> 
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd

Wait, why? Do you mean a ioctl calling loop (rather than a poll event
loop)? I think poll would be fine, but a "try calling RECV and expect to
return ENOENT" loop would change. But I don't think anyone would do this
exactly because it _currently_ acts like O_NONBLOCK, yes?

> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(

I only see this in Debian Code Search:
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
which is using epoll_wait():
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326

I expect LXC is using it. :)

Let's change it ASAP! ;)

-Kees

> 
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...
> 
> 
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..f0f4c68e0bc6 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,6 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> -   struct semaphore request;
> u64 next_id;
> struct list_head notifications;
>  };
> @@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
> list_add(&n.list, &match->notif->notifications);
> INIT_LIST_HEAD(&n.addfd);
> 
> -   up(&match->notif->request);
> wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> mutex_unlock(&match->notify_lock);
> 
> @@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 
> id)
> 
> 
>  static long seccomp_notify_recv(struct seccomp_filter *filter,
> -   void __user *buf)
> +   void __user *buf, bool blocking)
>  {
> struct seccomp_knotif *knotif = NULL, *cur;
> +   DECLARE_WAITQUEUE(wait, current);
> struct seccomp_notif unotif;
> ssize_t ret;
> 
> @@ -1190,11 +1189,9 @@ st

Re: For review: seccomp_user_notif(2) manual page

2020-10-25 Thread Kees Cook
On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:39 AM, Kees Cook wrote:
> > I'll comment more later, but I've run out of time today and I didn't see
> > anyone mention this detail yet in the existing threads... :)
> 
> Later never came :-). But, I hope you may have comments for the 
> next draft, which I will send out soon.

Later is now, and Soon approaches!

I finally caught up and read through this whole thread. Thank you all
for the bug fix[1], and I'm looking forward to more[2]. :)

For my reply I figured I'd base it on the current draft, so here's a
simulated quote based on the seccomp_user_notif branch of
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
through commit 71101158fe330af5a26552447a0bb433b69e15b7
$ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual   SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism provided
>by the Secure Computing (seccomp) facility.  As well as the use of
>the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By contrast,
>the user-space notification mechanism allows the seccomp filter to
>delegate the handling of the system call to another user-space
>process.  Note that this mechanism is explicitly not intended as a
>method implementing security policy; see NOTES.
> 
>In the discussion that follows, the thread(s) on which the seccomp
>filter is installed is (are) referred to as the target, and the
>process that is notified by the user-space notification mechanism
>is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the target.
>The advantage of the user-space notification mechanism is that the
>supervisor will usually be able to retrieve information about the
>target and the performed system call that the seccomp filter
>itself cannot.  (A seccomp filter is limited in the information it
>can obtain and the actions that it can perform because it is
>running on a virtual machine inside the kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   • The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value  of the (successful) seccomp(2) call is a new

nit: extra space

> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.

I like this limitation, but I expect that it'll need to change in the
future. Even with LSMs, we see the need for arbitrary stacking, and the
idea of there being only 1 supervisor will eventually break down. Right
now there is only 1 because only container managers are using this
feature. But if some daemon starts using it to isolate some thread,
suddenly it might break if a container manager is trying to listen to it
too, etc. I expect it won't be needed soon, but I do think it'll change.

> 
>   • In cases where it is appropriate, the seccomp filter returns
> the action value SECCOMP_RET_USER_NOTIF.  This return value
> will trigger a notification event.
> 
>2. In order that the supervisor can obtain notifications using the
>   listening file descriptor, (a duplicate of) that file
>   descriptor must be passed from the target to the supervisor.

Yet another reason to have an "activate on exec" mode for seccomp. With
no_new_privs _not_ being delayed in such a way, I think it'd be safe to
add. The supervisor would get t

Re: For review: seccomp_user_notif(2) manual page

2020-10-25 Thread Michael Kerrisk (man-pages)
Hi Jann,

On 10/1/20 4:14 AM, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 3:52 AM Jann Horn  wrote:
>> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
>>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
 On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> wrote:
>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
>>> wrote:
┌─┐
│FIXME│
├─┤
│From my experiments,  it  appears  that  if  a  SEC‐ │
│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
│process terminates, then the ioctl()  simply  blocks │
│(rather than returning an error to indicate that the │
│target process no longer exists).│
>>>
>>> Yeah, I think Christian wanted to fix this at some point,
>>
>> Do you have a pointer that discussion? I could not find it with a
>> quick search.
>>
>>> but it's a
>>> bit sticky to do.
>>
>> Can you say a few words about the nature of the problem?
>
> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> notify about unused filter"). So maybe there's a bug here?

 That thing only notifies on ->poll, it doesn't unblock ioctls; and
 Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
 commit doesn't have any effect on this kind of usage.
>>>
>>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
>>> we don't have a count of all of them, unfortunately.
>>>
>>> We could maybe look inside the wait_list, but that will probably make
>>> people angry :)
>>
>> The easiest way would probably be to open-code the semaphore-ish part,
>> and let the semaphore and poll share the waitqueue. The current code
>> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
>> entire semaphore would IMO be cleaner than that. And it's not like
>> semaphore semantics are even a good fit for this code anyway.
>>
>> Let's see... if we didn't have the existing UAPI to worry about, I'd
>> do it as follows (*completely* untested). That way, the ioctl would
>> block exactly until either there actually is a request to deliver or
>> there are no more users of the filter. The problem is that if we just
>> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
>> an event loop and don't set O_NONBLOCK will be screwed. So we'd
>> probably also have to add some stupid counter in place of the
>> semaphore's counter that we can use to preserve the old behavior of
>> returning -ENOENT once for each cancelled request. :(
>>
>> I guess this is a nice point in favor of Michael's usual complaint
>> that if there are no man pages for a feature by the time the feature
>> lands upstream, there's a higher chance that the UAPI will suck
>> forever...
> 
> And I guess this would be the UAPI-compatible version - not actually
> as terrible as I thought it might be. Do y'all want this? If so, feel
> free to either turn this into a proper patch with Co-developed-by, or
> tell me that I should do it and I'll try to get around to turning it
> into something proper.

Thanks for taking a shot at this.

I tried applying the patch below to vanilla 5.9.0.
(There's one typo: s/ENOTCON/ENOTCONN).

It seems not to work though; when I send a signal to my test
target process that is sleeping waiting for the notification
response, the process enters the uninterruptible D state.
Any thoughts?

Thanks,

Michael

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..d08c453fcc2c 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,7 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> -   struct semaphore request;
> +   bool canceled_reqs;
> u64 next_id;
> struct list_head notifications;
>  };
> @@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
> list_add(&n.list, &match->notif->notifications);
> INIT_LIST_HEAD(&n.addfd);
> 
> -   up(&match->notif->request);
> wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> mutex_unlock(&match->notify_lock);
> 
> @@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
>  * *reattach* to a notifier right now. If one is added, we'll need to
>  * keep track of the notif itself and make sure they match here.
>  */
> -   if (match->notif)
> +   if (match->notif) {
> list_del(&n.list);
> +
> +   /*
> +* We are stuck with a UAPI

Re: For review: seccomp_user_notif(2) manual page

2020-10-24 Thread Michael Kerrisk (man-pages)
Hello Jann,

On 10/17/20 2:25 AM, Jann Horn wrote:
> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 10/15/20 10:32 PM, Jann Horn wrote:
>>> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>>>  wrote:
 On 9/30/20 5:53 PM, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>  wrote:
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
>>> [...]
>>3. The supervisor process will receive notification events on the
>>   listening  file  descriptor.   These  events  are  returned as
>>   structures of type seccomp_notif.  Because this structure  and
>>   its  size may evolve over kernel versions, the supervisor must
>>   first determine the size of  this  structure  using  the  sec‐
>>   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>   to receive notification events.   In  addition,the  supervisor
>>   allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>   comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>   comp_notif_resp  structure) that it will provide to the kernel
>>   (and thus the target process).
>>
>>4. The target process then performs its workload, which  includes
>>   system  calls  that  will be controlled by the seccomp filter.
>>   Whenever one of these system calls causes the filter to return
>>   the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>   execute the system call;  instead,  execution  of  the  target
>>   process is temporarily blocked inside the kernel and a notifi‐
>
> where "blocked" refers to the interruptible, restartable kind - if the
> child receives a signal with an SA_RESTART signal handler in the
> meantime, it'll leave the syscall, go through the signal handler, then
> restart the syscall again and send the same request to the supervisor
> again. so the supervisor may see duplicate syscalls.

 So, I partially demonstrated what you describe here, for two example
 system calls (epoll_wait() and pause()). But I could not exactly
 demonstrate things as I understand you to be describing them. (So,
 I'm not sure whether I have not understood you correctly, or
 if things are not exactly as you describe them.)

 Here's a scenario (A) that I tested:

 1. Target installs seccomp filters for a blocking syscall
(epoll_wait() or pause(), both of which should never restart,
regardless of SA_RESTART)
 2. Target installs SIGINT handler with SA_RESTART
 3. Supervisor is sleeping (i.e., is not blocked in
SECCOMP_IOCTL_NOTIF_RECV operation).
 4. Target makes a blocking system call (epoll_wait() or pause()).
 5. SIGINT gets delivered to target; handler gets called;
***and syscall gets restarted by the kernel***

 That last should never happen, of course, and is a result of the
 combination of both the user-notify filter and the SA_RESTART flag.
 If one or other is not present, then the system call is not
 restarted.

 So, as you note below, the UAPI gets broken a little.

 However, from your description above I had understood that
 something like the following scenario (B) could occur:

 1. Target installs seccomp filters for a blocking syscall
(epoll_wait() or pause(), both of which should never restart,
regardless of SA_RESTART)
 2. Target installs SIGINT handler with SA_RESTART
 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
blocks).
 4. Target makes a blocking system call (epoll_wait() or pause()).
 5. Supervisor gets seccomp user-space notification (i.e.,
SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
 6. SIGINT gets delivered to target; handler gets called;
and syscall gets restarted by the kernel
 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
which gets another notification for the restarted system call.

 However, I don't observe such behavior. In step 6, the syscall
 does not get restarted by the kernel, but instead returns -1/EINTR.
 Perhaps I have misconstructed my experiment in the second case, or
 perhaps I've misunderstood what you meant, or is it possibly 

Re: For review: seccomp_user_notif(2) manual page

2020-10-16 Thread Jann Horn
On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/15/20 10:32 PM, Jann Horn wrote:
> > On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
> >  wrote:
> >> On 9/30/20 5:53 PM, Jann Horn wrote:
> >>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> >>>  wrote:
>  I knew it would be a big ask, but below is kind of the manual page
>  I was hoping you might write [1] for the seccomp user-space notification
>  mechanism. Since you didn't (and because 5.9 adds various new pieces
>  such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>  that also will need documenting [2]), I did :-). But of course I may
>  have made mistakes...
> > [...]
> 3. The supervisor process will receive notification events on the
>    listening  file  descriptor.   These  events  are  returned as
>    structures of type seccomp_notif.  Because this structure  and
>    its  size may evolve over kernel versions, the supervisor must
>    first determine the size of  this  structure  using  the  sec‐
>    comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>    structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>    cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>    to receive notification events.   In  addition,the  supervisor
>    allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>    comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>    comp_notif_resp  structure) that it will provide to the kernel
>    (and thus the target process).
> 
> 4. The target process then performs its workload, which  includes
>    system  calls  that  will be controlled by the seccomp filter.
>    Whenever one of these system calls causes the filter to return
>    the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>    execute the system call;  instead,  execution  of  the  target
>    process is temporarily blocked inside the kernel and a notifi‐
> >>>
> >>> where "blocked" refers to the interruptible, restartable kind - if the
> >>> child receives a signal with an SA_RESTART signal handler in the
> >>> meantime, it'll leave the syscall, go through the signal handler, then
> >>> restart the syscall again and send the same request to the supervisor
> >>> again. so the supervisor may see duplicate syscalls.
> >>
> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>(epoll_wait() or pause(), both of which should never restart,
> >>regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >>SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >>***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>(epoll_wait() or pause(), both of which should never restart,
> >>regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >>blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >>SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >>and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >>which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
>

Re: For review: seccomp_user_notif(2) manual page

2020-10-16 Thread Michael Kerrisk (man-pages)
Hello Jann,

Thanks for your reply!

On 10/15/20 10:32 PM, Jann Horn wrote:
> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>  wrote:
>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>  wrote:
 I knew it would be a big ask, but below is kind of the manual page
 I was hoping you might write [1] for the seccomp user-space notification
 mechanism. Since you didn't (and because 5.9 adds various new pieces
 such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
 that also will need documenting [2]), I did :-). But of course I may
 have made mistakes...
> [...]
3. The supervisor process will receive notification events on the
   listening  file  descriptor.   These  events  are  returned as
   structures of type seccomp_notif.  Because this structure  and
   its  size may evolve over kernel versions, the supervisor must
   first determine the size of  this  structure  using  the  sec‐
   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
   to receive notification events.   In  addition,the  supervisor
   allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
   comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
   comp_notif_resp  structure) that it will provide to the kernel
   (and thus the target process).

4. The target process then performs its workload, which  includes
   system  calls  that  will be controlled by the seccomp filter.
   Whenever one of these system calls causes the filter to return
   the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
   execute the system call;  instead,  execution  of  the  target
   process is temporarily blocked inside the kernel and a notifi‐
>>>
>>> where "blocked" refers to the interruptible, restartable kind - if the
>>> child receives a signal with an SA_RESTART signal handler in the
>>> meantime, it'll leave the syscall, go through the signal handler, then
>>> restart the syscall again and send the same request to the supervisor
>>> again. so the supervisor may see duplicate syscalls.
>>
>> So, I partially demonstrated what you describe here, for two example
>> system calls (epoll_wait() and pause()). But I could not exactly
>> demonstrate things as I understand you to be describing them. (So,
>> I'm not sure whether I have not understood you correctly, or
>> if things are not exactly as you describe them.)
>>
>> Here's a scenario (A) that I tested:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>(epoll_wait() or pause(), both of which should never restart,
>>regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor is sleeping (i.e., is not blocked in
>>SECCOMP_IOCTL_NOTIF_RECV operation).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. SIGINT gets delivered to target; handler gets called;
>>***and syscall gets restarted by the kernel***
>>
>> That last should never happen, of course, and is a result of the
>> combination of both the user-notify filter and the SA_RESTART flag.
>> If one or other is not present, then the system call is not
>> restarted.
>>
>> So, as you note below, the UAPI gets broken a little.
>>
>> However, from your description above I had understood that
>> something like the following scenario (B) could occur:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>(epoll_wait() or pause(), both of which should never restart,
>>regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>>blocks).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. Supervisor gets seccomp user-space notification (i.e.,
>>SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>> 6. SIGINT gets delivered to target; handler gets called;
>>and syscall gets restarted by the kernel
>> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>>which gets another notification for the restarted system call.
>>
>> However, I don't observe such behavior. In step 6, the syscall
>> does not get restarted by the kernel, but instead returns -1/EINTR.
>> Perhaps I have misconstructed my experiment in the second case, or
>> perhaps I've misunderstood what you meant, or is it possibly the
>> case that things are not quite as you said?

Thanks for the code, Jann (including the demo of the CLONE_FILES
technique to pass the notification FD to the supervisor).

But I think your code just demonstrates what I described in
scenario A. So, it seem

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Jann Horn
On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
 wrote:
> On 9/30/20 5:53 PM, Jann Horn wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> >  wrote:
> >> I knew it would be a big ask, but below is kind of the manual page
> >> I was hoping you might write [1] for the seccomp user-space notification
> >> mechanism. Since you didn't (and because 5.9 adds various new pieces
> >> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> >> that also will need documenting [2]), I did :-). But of course I may
> >> have made mistakes...
[...]
> >>3. The supervisor process will receive notification events on the
> >>   listening  file  descriptor.   These  events  are  returned as
> >>   structures of type seccomp_notif.  Because this structure  and
> >>   its  size may evolve over kernel versions, the supervisor must
> >>   first determine the size of  this  structure  using  the  sec‐
> >>   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
> >>   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
> >>   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >>   to receive notification events.   In  addition,the  supervisor
> >>   allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
> >>   comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
> >>   comp_notif_resp  structure) that it will provide to the kernel
> >>   (and thus the target process).
> >>
> >>4. The target process then performs its workload, which  includes
> >>   system  calls  that  will be controlled by the seccomp filter.
> >>   Whenever one of these system calls causes the filter to return
> >>   the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
> >>   execute the system call;  instead,  execution  of  the  target
> >>   process is temporarily blocked inside the kernel and a notifi‐
> >
> > where "blocked" refers to the interruptible, restartable kind - if the
> > child receives a signal with an SA_RESTART signal handler in the
> > meantime, it'll leave the syscall, go through the signal handler, then
> > restart the syscall again and send the same request to the supervisor
> > again. so the supervisor may see duplicate syscalls.
>
> So, I partially demonstrated what you describe here, for two example
> system calls (epoll_wait() and pause()). But I could not exactly
> demonstrate things as I understand you to be describing them. (So,
> I'm not sure whether I have not understood you correctly, or
> if things are not exactly as you describe them.)
>
> Here's a scenario (A) that I tested:
>
> 1. Target installs seccomp filters for a blocking syscall
>(epoll_wait() or pause(), both of which should never restart,
>regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor is sleeping (i.e., is not blocked in
>SECCOMP_IOCTL_NOTIF_RECV operation).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. SIGINT gets delivered to target; handler gets called;
>***and syscall gets restarted by the kernel***
>
> That last should never happen, of course, and is a result of the
> combination of both the user-notify filter and the SA_RESTART flag.
> If one or other is not present, then the system call is not
> restarted.
>
> So, as you note below, the UAPI gets broken a little.
>
> However, from your description above I had understood that
> something like the following scenario (B) could occur:
>
> 1. Target installs seccomp filters for a blocking syscall
>(epoll_wait() or pause(), both of which should never restart,
>regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>blocks).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. Supervisor gets seccomp user-space notification (i.e.,
>SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> 6. SIGINT gets delivered to target; handler gets called;
>and syscall gets restarted by the kernel
> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>which gets another notification for the restarted system call.
>
> However, I don't observe such behavior. In step 6, the syscall
> does not get restarted by the kernel, but instead returns -1/EINTR.
> Perhaps I have misconstructed my experiment in the second case, or
> perhaps I've misunderstood what you meant, or is it possibly the
> case that things are not quite as you said?

user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt.c
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

struct {
  int seccomp_fd;
} *shared;

static void handle_signal(int sig, siginfo_t *info, void

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hello Christian,

On 10/1/20 2:36 PM, Christian Brauner wrote:
> [I'm on vacation so I'll just give this a quick glance for now.]
> 
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho, Sargun (and all),
>>
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces 
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
>> that also will need documenting [2]), I did :-). But of course I may 
>> have made mistakes...
>>
>> I've shown the rendered version of the page below, and would love
>> to receive review comments from you and others, and acks, etc.
>>
>> There are a few FIXMEs sprinkled into the page, including one
>> that relates to what appears to me to be a misdesign (possibly 
>> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
>> operation. I would be especially interested in feedback on that
>> FIXME, and also of course the other FIXMEs.
>>
>> The page includes an extensive (albeit slightly contrived)
>> example program, and I would be happy also to receive comments
>> on that program.
>>
>> The page source currently sits in a branch (along with the text
>> that you sent me for the seccomp(2) page) at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> Thanks,
>>
>> Michael
>>
>> [1] 
>> https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
>> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> =
>>
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> DESCRIPTION
>>This  page  describes  the user-space notification mechanism pro‐
>>vided by the Secure Computing (seccomp) facility.  As well as the
>>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>operation  described  in  seccomp(2), this mechanism involves the
>>use of a number of related ioctl(2) operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
> 
> "In contrast, the user notification mechanism allows to delegate the
> handling of the system call of one process (target) to another
> user-space process (supervisor)."?

Thanks. I've reworded similarly to what you suggest.

>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
> 
> This section reads a bit difficult imho:
> "A suitably privileged supervisor can use the user notification
> mechanism to perform actions in lieu of the target. The supervisor will
> usually be able to retrieve information about the target and the
> performed system call that the seccomp filter itself cannot."

Thanks. Again I've done some rewording.

>>In the discussion that follows, the process  that  has  installed
>>the  seccomp filter is referred to as the target, and the process
>>that is notified by  the  user-space  notification  mechanism  is
>>referred  to  as  the  supervisor.  An overview of the steps per‐
>>formed by these two processes is as follows:

After the various rewordings, the opening paragraphs now read:

   In conventional usage of a seccomp filter, the decision about  how
   to treat a system call is made by the filter itself.  By contrast,
   the user-space notification mechanism allows the seccomp filter to
   delegate  the  handling  of  the system call to another user-space
   process.

   In the discussion that follows, the thread(s) on which the seccomp
   filter  is  installed  is (are) referred to as the target, and the
   process that is notified by the user-space notification  mechanism
   is referred to as the supervisor.

   A  suitably privileged supervisor can use the user-space notifica‐
   tion mechanism to perform actions on behalf of  the  target.   The
   advantage  of  the  user-space  notification mechanism is that the
   supervisor will usually be able to retrieve information about  the
   tar

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hi Jann,

So, first off, thank you for the detailed review. I really 
appreciate it! I've changed various pieces, and still have
a few questions below.

On 9/30/20 5:53 PM, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>  wrote:
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
> [...]
>> NAME
>>seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>#include 
>>#include 
>>#include 
>>
>>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?

Yes, good idea. I added:

   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
 struct seccomp_notif *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
 struct seccomp_notif_resp *req);
   int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
>> DESCRIPTION
>>This  page  describes  the user-space notification mechanism pro‐
>>vided by the Secure Computing (seccomp) facility.  As well as the
>>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>operation  described  in  seccomp(2), this mechanism involves the
>>use of a number of related ioctl(2) operations (described below).
>>
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
>>
>>In the discussion that follows, the process  that  has  installed
>>the  seccomp filter is referred to as the target, and the process
> 
> Technically, this definition of "target" is a bit inaccurate because:
> 
>  - seccomp filters are inherited
>  - seccomp filters apply to threads, not processes
>  - seccomp filters can be semi-remotely installed via TSYNC

(Nice summary.)

> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)

Exactly.

> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?

Thanks. It's always hugely helpful to get a suggested wording, even
if I still feel the need to rework it (which I don't in this case).
The sentence now reads:

   In the discussion that follows, the thread(s) on which the seccomp
   filter is installed are referred to as the target, and the process
   that is notified  by  the  user-space  notification  mechanism  is
   referred to as the supervisor.

>>that is notified by  the  user-space  notification  mechanism  is
>>referred  to  as  the  supervisor.  An overview of the steps per‐
>>formed by these two processes is as follows:
>>
>>1. The target process establishes a seccomp filter in  the  usual
>>   manner, but with two differences:
>>
>>   · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>> the  (successful)  seccomp(2) call is a new "listening" file
>> descriptor that can be used to receive notifications.
>>
>>   · In cases where it is appropriate, the seccomp filter returns
>> the  action value SECCOMP_RET_USER_NOTIF.  This return value
>> will trigger a notification event.
>>
>>2. In order that the supervisor process can obtain  notifications
>>   using  the  listening  file  descriptor, (a duplicate of) that
>>   file descriptor must be passed from the target process to  the
>>   supervisor process.  One way in which this could be done is by
>>   passing the file descriptor over a UNIX domain socket  connec‐
>>   tion between the two processes (using the SCM_RIGHTS ancillary
>>   message type described in unix(7)).   Another  possibility  is
>>   that  the  supervisor  might  inherit  the file descriptor via
>>  

Re: For review: seccomp_user_notif(2) manual page

2020-10-15 Thread Michael Kerrisk (man-pages)
Hello Kees,

On 10/1/20 1:39 AM, Kees Cook wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> [...] I did :-)
> 
> Yay! Thank you!

You're welcome :-)

>> [...]
>>Overview
>>In conventional usage of a seccomp filter, the decision about how
>>to  treat  a particular system call is made by the filter itself.
>>The user-space notification mechanism allows the handling of  the
>>system  call  to  instead  be handed off to a user-space process.
>>The advantages of doing this are that, by contrast with the  sec‐
>>comp  filter,  which  is  running on a virtual machine inside the
>>kernel, the user-space process has access to information that  is
>>unavailable to the seccomp filter and it can perform actions that
>>can't be performed from the seccomp filter.
> 
> I might clarify a bit with something like (though maybe the
> target/supervisor paragraph needs to be moved to the start):
> 
>   This is used for performing syscalls on behalf of the target,
>   rather than having the supervisor make security policy decisions
>   about the syscall, which would be inherently race-prone. The
>   target's syscall should either be handled by the supervisor or
>   allowed to continue normally in the kernel (where standard security
>   policies will be applied).

You, Christian, and Jann all pulled me up on this point. And thanks; 
I'm going to use some of your words above. See my reply to Jann, sent
at about the same time as this reply. Please take a look at the text
in my reply to Jann, and let me know what you think.

> I'll comment more later, but I've run out of time today and I didn't see
> anyone mention this detail yet in the existing threads... :)

Later never came :-). But, I hope you may have comments for the 
next draft, which I will send out soon.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-14 Thread Michael Kerrisk (man-pages)
On 10/1/20 7:12 PM, Christian Brauner wrote:
> On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
>> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
>>> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>>>  wrote:
 On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>  wrote:
>> NOTES
>>The file descriptor returned when seccomp(2) is employed with the
>>SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>ing,  these interfaces indicate that the file descriptor is read‐
>>able.
>
> We should probably also point out somewhere that, as
> include/uapi/linux/seccomp.h says:
>
>  * Similar precautions should be applied when stacking 
> SECCOMP_RET_USER_NOTIF
>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>  * same syscall, the most recently added filter takes precedence. This 
> means
>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>  * such filtered syscalls to be executed by sending the response
>  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> equally
>  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>
> In other words, from a security perspective, you must assume that the
> target process can bypass any SECCOMP_RET_USER_NOTIF (or
> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> calling seccomp(). This should also be noted over in the main
> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.

 So I was actually wondering about this when I skimmed this and a while
 ago but forgot about this again... Afaict, you can only ever load a
 single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
 already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
 in the tasks filter hierarchy then the kernel will refuse to load a new
 one?

 static struct file *init_listener(struct seccomp_filter *filter)
 {
 struct file *ret = ERR_PTR(-EBUSY);
 struct seccomp_filter *cur;

 for (cur = current->seccomp.filter; cur; cur = cur->prev) {
 if (cur->notif)
 goto out;
 }

 shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
 override each other for the same task simply because there can only ever
 be a single one?
>>>
>>> Good point. Excpt that that check seems ineffective because this
>>> happens before we take the locks that guard against TSYNC, and also
>>> before we decide to which existing filter we want to chain the new
>>> filter. So if two threads race with TSYNC, I think they'll be able to
>>> chain two filters with listeners together.
>>
>> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
>> be totally effective,
>>
>>> I don't know whether we want to eternalize this "only one listener
>>> across all the filters" restriction in the manpage though, or whether
>>> the man page should just say that the kernel currently doesn't support
>>> it but that security-wise you should assume that it might at some
>>> point.
>>
>> This requirement originally came from Andy, arguing that the semantics
>> of this were/are confusing, which still makes sense to me. Perhaps we
>> should do something like the below?
> 
> I think we should either keep up this restriction and then cement it in
> the manpage or add a flag to indicate that the notifier is
> non-overridable.
> I don't care about the default too much, i.e. whether it's overridable
> by default and exclusive if opting in or the other way around doesn't
> matter too much. But from a supervisor's perspective it'd be quite nice
> to be able to be sure that a notifier can't be overriden by another
> notifier.
> 
> I think having a flag would provide the greatest flexibility but I agree
> that the semantics of multiple listeners are kinda odd.

So, for now, I have applied the patch at the foot of this mail
to the pages. Does this seem correct?

> Below looks sane to me though again, I'm not sitting in fron of source
> code.
[...]

Thanks,

Michael

PS Jann, if you see this, I'm still working through your (extensive
and very helpful) review comments. I will be sending a response.

==

diff --git a/man2/seccomp.2 b/man2/seccomp.2
index 9ab07f4ab..45a6984df 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -221,6 +221,11 @@ return a new user-space notification file descriptor.
 When the filter returns
 .BR SECCOMP_RET_USER_NOTIF
 a notification will be sent to this file descriptor.
+.IP
+At most one seccomp filter using the
+.

Re: For review: seccomp_user_notif(2) manual page

2020-10-14 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Ping on the question below!

Thanks,

Michael

On 10/1/20 9:45 AM, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:03 AM, Tycho Andersen wrote:
>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Tycho,
>>>
>>> Thanks for taking time to look at the page!
>>>
>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
 On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
 wrote:
> 
> [...]
> 
>┌─┐
>│FIXME│
>├─┤
>│Interestingly, after the event  had  been  received, │
>│the  file descriptor indicates as writable (verified │
>│from the source code and by experiment). How is this │
>│useful?  │

 You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
 reasonable.
>>>
>>> No, I'm saying something more fundamental: why is the FD indicating as
>>> writable? Can you write something to it? If yes, what? If not, then
>>> why do these APIs want to say that the FD is writable?
>>
>> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
>> NOTIFY_SEND are reading and writing events from the fd. I don't know
>> that much about the poll interface though -- is it possible to
>> indicate "here's a pseudo-read event"? It didn't look like it, so I
>> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.
> 
> I think the POLLIN thing is fine.
> 
> So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred? 
> 
> Thanks,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 02:06:10PM -0700, Sargun Dhillon wrote:
> On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
>  wrote:
> >
> > Hi Tycho, Sargun (and all),
> >
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> >
> > I've shown the rendered version of the page below, and would love
> > to receive review comments from you and others, and acks, etc.
> >
> > There are a few FIXMEs sprinkled into the page, including one
> > that relates to what appears to me to be a misdesign (possibly
> > fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> > operation. I would be especially interested in feedback on that
> > FIXME, and also of course the other FIXMEs.
> >
> > The page includes an extensive (albeit slightly contrived)
> > example program, and I would be happy also to receive comments
> > on that program.
> >
> > The page source currently sits in a branch (along with the text
> > that you sent me for the seccomp(2) page) at
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> >
> > Thanks,
> >
> > Michael
> >
> > [1] 
> > https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
> > [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> > and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> >
> > 
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
> 
> Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
> this point, given that the extensible ioctl mechanism works? If we add
> new fields to the
> seccomp datastructures, we would move them from fixed-size ioctls, to
> variable sized
> ioctls that encode the datastructure size / length?
> 
> -- This is mostly a question for Kees and Tycho.

It will tell you how big struct seccomp_data in the currently running
kernel is, so it still seems useful/necessary to me, unless there's
another way to figure that out.

But I agree, I don't think the intent is to add anything else to
struct seccomp_notif. (I don't know that it ever was.)

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Sargun Dhillon
On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hi Tycho, Sargun (and all),
>
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
>
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
>
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
>
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
>
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>
> Thanks,
>
> Michael
>
> [1] 
> https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>
> 
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
this point, given that the extensible ioctl mechanism works? If we add
new fields to the
seccomp datastructures, we would move them from fixed-size ioctls, to
variable sized
ioctls that encode the datastructure size / length?

-- This is mostly a question for Kees and Tycho.


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 08:18:49PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen  wrote:
> > On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > >  wrote:
> > > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers 
> > > > wrote:
> > > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > >  wrote:
> > > > > > NOTES
> > > > > >The file descriptor returned when seccomp(2) is employed 
> > > > > > with the
> > > > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  
> > > > > > using
> > > > > >poll(2), epoll(7), and select(2).  When a notification  is  
> > > > > > pend‐
> > > > > >ing,  these interfaces indicate that the file descriptor is 
> > > > > > read‐
> > > > > >able.
> > > > >
> > > > > We should probably also point out somewhere that, as
> > > > > include/uapi/linux/seccomp.h says:
> > > > >
> > > > >  * Similar precautions should be applied when stacking 
> > > > > SECCOMP_RET_USER_NOTIF
> > > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on 
> > > > > the
> > > > >  * same syscall, the most recently added filter takes precedence. 
> > > > > This means
> > > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially 
> > > > > allowing all
> > > > >  * such filtered syscalls to be executed by sending the response
> > > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > > > equally
> > > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > > >
> > > > > In other words, from a security perspective, you must assume that the
> > > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > > calling seccomp(). This should also be noted over in the main
> > > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > > >
> > > > So I was actually wondering about this when I skimmed this and a while
> > > > ago but forgot about this again... Afaict, you can only ever load a
> > > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > > one?
> > > >
> > > > static struct file *init_listener(struct seccomp_filter *filter)
> > > > {
> > > > struct file *ret = ERR_PTR(-EBUSY);
> > > > struct seccomp_filter *cur;
> > > >
> > > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > > if (cur->notif)
> > > > goto out;
> > > > }
> > > >
> > > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > > override each other for the same task simply because there can only ever
> > > > be a single one?
> > >
> > > Good point. Excpt that that check seems ineffective because this
> > > happens before we take the locks that guard against TSYNC, and also
> > > before we decide to which existing filter we want to chain the new
> > > filter. So if two threads race with TSYNC, I think they'll be able to
> > > chain two filters with listeners together.
> >
> > Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> > be totally effective,
> >
> > > I don't know whether we want to eternalize this "only one listener
> > > across all the filters" restriction in the manpage though, or whether
> > > the man page should just say that the kernel currently doesn't support
> > > it but that security-wise you should assume that it might at some
> > > point.
> >
> > This requirement originally came from Andy, arguing that the semantics
> > of this were/are confusing, which still makes sense to me. Perhaps we
> > should do something like the below?
> [...]
> > +static bool has_listener_parent(struct seccomp_filter *child)
> > +{
> > +   struct seccomp_filter *cur;
> > +
> > +   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > +   if (cur->notif)
> > +   return true;
> > +   }
> > +
> > +   return false;
> > +}
> [...]
> > @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
> [...]
> > +   /* don't allow TSYNC to install multiple listeners */
> > +   if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> > +   !has_listener_parent(thread->seccomp.filter))
> > +   continue;
> [...]
> > @@ -1462,12 +1479,9 @@ static const struct file_operations 
> > seccomp_notify_ops = {
> >  static struct file *init_listener(struct seccomp_filter *filter)
> [...]
> > -   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > -   if (cur->notif)
> > - 

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Jann Horn
On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen  wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> >  wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > >  wrote:
> > > > > NOTES
> > > > >The file descriptor returned when seccomp(2) is employed with 
> > > > > the
> > > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  
> > > > > using
> > > > >poll(2), epoll(7), and select(2).  When a notification  is  
> > > > > pend‐
> > > > >ing,  these interfaces indicate that the file descriptor is 
> > > > > read‐
> > > > >able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > >  * Similar precautions should be applied when stacking 
> > > > SECCOMP_RET_USER_NOTIF
> > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on 
> > > > the
> > > >  * same syscall, the most recently added filter takes precedence. This 
> > > > means
> > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing 
> > > > all
> > > >  * such filtered syscalls to be executed by sending the response
> > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > > equally
> > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > > struct file *ret = ERR_PTR(-EBUSY);
> > > struct seccomp_filter *cur;
> > >
> > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > if (cur->notif)
> > > goto out;
> > > }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> >
> > Good point. Excpt that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
>
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
>
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
>
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?
[...]
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> +   struct seccomp_filter *cur;
> +
> +   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> +   if (cur->notif)
> +   return true;
> +   }
> +
> +   return false;
> +}
[...]
> @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
[...]
> +   /* don't allow TSYNC to install multiple listeners */
> +   if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> +   !has_listener_parent(thread->seccomp.filter))
> +   continue;
[...]
> @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops 
> = {
>  static struct file *init_listener(struct seccomp_filter *filter)
[...]
> -   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> -   if (cur->notif)
> -   goto out;
> -   }
> +   if (has_listener_parent(current->seccomp.filter))
> +   goto out;

I dislike this because it combines a non-locked check and a locked
check. And I don't think this will work in the case where TSYNC and
non-TSYNC race - if the non-TSYNC 

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Christian Brauner
On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> >  wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > >  wrote:
> > > > > NOTES
> > > > >The file descriptor returned when seccomp(2) is employed with 
> > > > > the
> > > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  
> > > > > using
> > > > >poll(2), epoll(7), and select(2).  When a notification  is  
> > > > > pend‐
> > > > >ing,  these interfaces indicate that the file descriptor is 
> > > > > read‐
> > > > >able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > >  * Similar precautions should be applied when stacking 
> > > > SECCOMP_RET_USER_NOTIF
> > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on 
> > > > the
> > > >  * same syscall, the most recently added filter takes precedence. This 
> > > > means
> > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing 
> > > > all
> > > >  * such filtered syscalls to be executed by sending the response
> > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > > equally
> > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > > struct file *ret = ERR_PTR(-EBUSY);
> > > struct seccomp_filter *cur;
> > >
> > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > if (cur->notif)
> > > goto out;
> > > }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> > 
> > Good point. Excpt that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
> 
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
> 
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
> 
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?

I think we should either keep up this restriction and then cement it in
the manpage or add a flag to indicate that the notifier is
non-overridable.
I don't care about the default too much, i.e. whether it's overridable
by default and exclusive if opting in or the other way around doesn't
matter too much. But from a supervisor's perspective it'd be quite nice
to be able to be sure that a notifier can't be overriden by another
notifier.

I think having a flag would provide the greatest flexibility but I agree
that the semantics of multiple listeners are kinda odd.

Below looks sane to me though again, I'm not sitting in fron of source
code.

Christian

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..7b107207c2b0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
>   return 0;
>  }
>  
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> + struct seccomp_filter *cur;
> +
> + for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> + if (cur->notif)
> + return true;
> + }
> +
> + return false;
> +}
> 

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Christian Brauner
On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>  wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > >  wrote:
> > > > NOTES
> > > >The file descriptor returned when seccomp(2) is employed with the
> > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > >poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > >ing,  these interfaces indicate that the file descriptor is read‐
> > > >able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > >  * Similar precautions should be applied when stacking 
> > > SECCOMP_RET_USER_NOTIF
> > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > >  * same syscall, the most recently added filter takes precedence. This 
> > > means
> > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > >  * such filtered syscalls to be executed by sending the response
> > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > equally
> > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> > struct file *ret = ERR_PTR(-EBUSY);
> > struct seccomp_filter *cur;
> >
> > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > if (cur->notif)
> > goto out;
> > }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
> 
> Good point. Excpt that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

That's a bug, imho. I don't have source code in front of me right now
though.

> 
> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

Maybe. I would argue that it might be worth having at least a new
flag/option to indicate either "This is a non-overridable filter." or at
least for the seccomp notifier have an option to indicate that no other
notifer can be installed.

Christian


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>  wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > >  wrote:
> > > > NOTES
> > > >The file descriptor returned when seccomp(2) is employed with the
> > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > >poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > >ing,  these interfaces indicate that the file descriptor is read‐
> > > >able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > >  * Similar precautions should be applied when stacking 
> > > SECCOMP_RET_USER_NOTIF
> > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > >  * same syscall, the most recently added filter takes precedence. This 
> > > means
> > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > >  * such filtered syscalls to be executed by sending the response
> > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > equally
> > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> > struct file *ret = ERR_PTR(-EBUSY);
> > struct seccomp_filter *cur;
> >
> > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > if (cur->notif)
> > goto out;
> > }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
> 
> Good point. Excpt that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

Yep, seems the check needs to also be in seccomp_can_sync_threads() to
be totally effective,

> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

This requirement originally came from Andy, arguing that the semantics
of this were/are confusing, which still makes sense to me. Perhaps we
should do something like the below?

Tycho


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7b107207c2b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
return 0;
 }
 
+static bool has_listener_parent(struct seccomp_filter *child)
+{
+   struct seccomp_filter *cur;
+
+   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+   if (cur->notif)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  * seccomp_can_sync_threads: checks if all threads can be synchronized
  *
@@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
  * either not in the correct seccomp mode or did not have an ancestral
  * seccomp filter.
  */
-static inline pid_t seccomp_can_sync_threads(void)
+static inline pid_t seccomp_can_sync_threads(unsigned int flags)
 {
struct task_struct *thread, *caller;
 
@@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
 caller->seccomp.filter)))
continue;
 
+   /* don't allow TSYNC to install multiple listeners */
+   if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
+   !has_listener_parent(thread->seccomp.filter))
+   continue;
+
/* Return the first thread that cannot be synchronized. */
failed = task_pid_vnr(thread);
  

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Jann Horn
On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
 wrote:
> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> >  wrote:
> > > NOTES
> > >The file descriptor returned when seccomp(2) is employed with the
> > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > >poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > >ing,  these interfaces indicate that the file descriptor is read‐
> > >able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> >  * Similar precautions should be applied when stacking 
> > SECCOMP_RET_USER_NOTIF
> >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> >  * same syscall, the most recently added filter takes precedence. This means
> >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> >  * such filtered syscalls to be executed by sending the response
> >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp(). This should also be noted over in the main
> > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>
> So I was actually wondering about this when I skimmed this and a while
> ago but forgot about this again... Afaict, you can only ever load a
> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> in the tasks filter hierarchy then the kernel will refuse to load a new
> one?
>
> static struct file *init_listener(struct seccomp_filter *filter)
> {
> struct file *ret = ERR_PTR(-EBUSY);
> struct seccomp_filter *cur;
>
> for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> if (cur->notif)
> goto out;
> }
>
> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> override each other for the same task simply because there can only ever
> be a single one?

Good point. Excpt that that check seems ineffective because this
happens before we take the locks that guard against TSYNC, and also
before we decide to which existing filter we want to chain the new
filter. So if two threads race with TSYNC, I think they'll be able to
chain two filters with listeners together.

I don't know whether we want to eternalize this "only one listener
across all the filters" restriction in the manpage though, or whether
the man page should just say that the kernel currently doesn't support
it but that security-wise you should assume that it might at some
point.

[...]
> > >if (procMemFd == -1)
> > >errExit("Supervisor: open");
> > >
> > >/* Check that the process whose info we are accessing is still 
> > > alive.
> > >   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > >   in checkNotificationIdIsValid()) succeeds, we know that the
> > >   /proc/PID/mem file descriptor that we opened corresponds to 
> > > the
> > >   process for which we received a notification. If that 
> > > process
> > >   subsequently terminates, then read() on that file descriptor
> > >   will return 0 (EOF). */
> > >
> > >checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >/* Seek to the location containing the pathname argument 
> > > (i.e., the
> > >   first argument) of the mkdir(2) call and read that pathname 
> > > */
> > >
> > >if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> > >errExit("Supervisor: lseek");
> > >
> > >ssize_t s = read(procMemFd, path, PATH_MAX);
> > >if (s == -1)
> > >errExit("read");
> >
> > Why not pread() instead of lseek()+read()?
>
> With multiple arguments to be read process_vm_readv() should also be
> considered.

process_vm_readv() can end up doing each read against a different
process, which is sort of weird semantically. You would end up taking
page faults at random addresses in unrelated processes, blocking on
their mmap locks, potentially triggering their userfaultfd notifiers,
and so on.

Whereas if you first open /proc/$tid/mem, then re-check
SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that
you're only taking page faults on the process where you intended to do
it.

So until there is a variant of process_vm_readv() that operates on
pidfds, I would not recommend using that here.


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Christian Brauner
On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>  wrote:
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> [...]
> > NAME
> >seccomp_user_notif - Seccomp user-space notification mechanism
> >
> > SYNOPSIS
> >#include 
> >#include 
> >#include 
> >
> >int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?
> 
> > DESCRIPTION
> >This  page  describes  the user-space notification mechanism pro‐
> >vided by the Secure Computing (seccomp) facility.  As well as the
> >use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
> >COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
> >operation  described  in  seccomp(2), this mechanism involves the
> >use of a number of related ioctl(2) operations (described below).
> >
> >Overview
> >In conventional usage of a seccomp filter, the decision about how
> >to  treat  a particular system call is made by the filter itself.
> >The user-space notification mechanism allows the handling of  the
> >system  call  to  instead  be handed off to a user-space process.
> >The advantages of doing this are that, by contrast with the  sec‐
> >comp  filter,  which  is  running on a virtual machine inside the
> >kernel, the user-space process has access to information that  is
> >unavailable to the seccomp filter and it can perform actions that
> >can't be performed from the seccomp filter.
> >
> >In the discussion that follows, the process  that  has  installed
> >the  seccomp filter is referred to as the target, and the process
> 
> Technically, this definition of "target" is a bit inaccurate because:
> 
>  - seccomp filters are inherited
>  - seccomp filters apply to threads, not processes
>  - seccomp filters can be semi-remotely installed via TSYNC
> 
> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)
> 
> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?
> 
> >that is notified by  the  user-space  notification  mechanism  is
> >referred  to  as  the  supervisor.  An overview of the steps per‐
> >formed by these two processes is as follows:
> >
> >1. The target process establishes a seccomp filter in  the  usual
> >   manner, but with two differences:
> >
> >   · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> > TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
> > the  (successful)  seccomp(2) call is a new "listening" file
> > descriptor that can be used to receive notifications.
> >
> >   · In cases where it is appropriate, the seccomp filter returns
> > the  action value SECCOMP_RET_USER_NOTIF.  This return value
> > will trigger a notification event.
> >
> >2. In order that the supervisor process can obtain  notifications
> >   using  the  listening  file  descriptor, (a duplicate of) that
> >   file descriptor must be passed from the target process to  the
> >   supervisor process.  One way in which this could be done is by
> >   passing the file descriptor over a UNIX domain socket  connec‐
> >   tion between the two processes (using the SCM_RIGHTS ancillary
> >   message type described in unix(7)).   Another  possibility  is
> >   that  the  supervisor  might  inherit  the file descriptor via
> >   fork(2).
> 
> With the caveat that if the supervisor inherits the file descriptor
> via fork(), that (more or less) implies that the supervisor is subject
> to the same filter (although it could bypass the filter using a helper
> thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
> expect any clean software to do that).
> 
> >3. The supervisor process will receive notification events on the
> >   listening  file  descriptor.   These  events  are  returned as
> >   structures of type seccomp_notif.  Because this structure  and
> >   its  size may evolve over kernel versions, the supervisor must
> >   first determine the size of  this  structure  using  the  sec‐
> >   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Christian Brauner
[I'm on vacation so I'll just give this a quick glance for now.]

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho, Sargun (and all),
> 
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces 
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
> that also will need documenting [2]), I did :-). But of course I may 
> have made mistakes...
> 
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
> 
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly 
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
> 
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
> 
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> =
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> DESCRIPTION
>This  page  describes  the user-space notification mechanism pro‐
>vided by the Secure Computing (seccomp) facility.  As well as the
>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>operation  described  in  seccomp(2), this mechanism involves the
>use of a number of related ioctl(2) operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to  treat  a particular system call is made by the filter itself.
>The user-space notification mechanism allows the handling of  the
>system  call  to  instead  be handed off to a user-space process.

"In contrast, the user notification mechanism allows to delegate the
handling of the system call of one process (target) to another
user-space process (supervisor)."?

>The advantages of doing this are that, by contrast with the  sec‐
>comp  filter,  which  is  running on a virtual machine inside the
>kernel, the user-space process has access to information that  is
>unavailable to the seccomp filter and it can perform actions that
>can't be performed from the seccomp filter.

This section reads a bit difficult imho:
"A suitably privileged supervisor can use the user notification
mechanism to perform actions in lieu of the target. The supervisor will
usually be able to retrieve information about the target and the
performed system call that the seccomp filter itself cannot."

> 
>In the discussion that follows, the process  that  has  installed
>the  seccomp filter is referred to as the target, and the process
>that is notified by  the  user-space  notification  mechanism  is
>referred  to  as  the  supervisor.  An overview of the steps per‐
>formed by these two processes is as follows:
> 
>1. The target process establishes a seccomp filter in  the  usual
>   manner, but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
> the  (successful)  seccomp(2) call is a new "listening" file
> descriptor that can be used to receive notifications.

I think it would be good to mention that seccomp notify fds are
O_CLOEXEC by default somewhere.

> 
>   · In cases where it is appropriate, the seccomp filter returns
> the  action value SECCOMP_RET_USER_NOTIF.  This return value
> will trigger a notification event.
> 
>2. In order that the supervisor process can obtain  notifications
>   using  the  listening  file  descriptor, (a duplicate of) that
>   file descriptor must be passed from the target process to  the
>   supervisor process.  One way in which this could be done is by
>   passing the file descriptor over a UNIX domain socket  connec‐
>   tion between the two processes (using the SCM_RIGHTS ancillary
>   

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Michael Kerrisk (man-pages)
On 10/1/20 3:52 AM, Jann Horn wrote:

[...]

> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

Thanks for saving me the trouble of saying that (again).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Michael Kerrisk (man-pages)
On 10/1/20 1:03 AM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho,
>>
>> Thanks for taking time to look at the page!
>>
>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

┌─┐
│FIXME│
├─┤
│Interestingly, after the event  had  been  received, │
│the  file descriptor indicates as writable (verified │
│from the source code and by experiment). How is this │
│useful?  │
>>>
>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>> reasonable.
>>
>> No, I'm saying something more fundamental: why is the FD indicating as
>> writable? Can you write something to it? If yes, what? If not, then
>> why do these APIs want to say that the FD is writable?
> 
> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
> NOTIFY_SEND are reading and writing events from the fd. I don't know
> that much about the poll interface though -- is it possible to
> indicate "here's a pseudo-read event"? It didn't look like it, so I
> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

I think the POLLIN thing is fine.

So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred? 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Jann Horn
On Thu, Oct 1, 2020 at 3:52 AM Jann Horn  wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> > > > wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk 
> > > > > > (man-pages) wrote:
> > > > > >>┌─┐
> > > > > >>│FIXME│
> > > > > >>├─┤
> > > > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > >>│process terminates, then the ioctl()  simply  blocks │
> > > > > >>│(rather than returning an error to indicate that the │
> > > > > >>│target process no longer exists).│
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
>
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
>
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(
>
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

And I guess this would be the UAPI-compatible version - not actually
as terrible as I thought it might be. Do y'all want this? If so, feel
free to either turn this into a proper patch with Co-developed-by, or
tell me that I should do it and I'll try to get around to turning it
into something proper.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..d08c453fcc2c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,7 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-   struct semaphore request;
+   bool canceled_reqs;
u64 next_id;
struct list_head notifications;
 };
@@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
list_add(&n.list, &match->notif->notifications);
INIT_LIST_HEAD(&n.addfd);

-   up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
mutex_unlock(&match->notify_lock);

@@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
 * *reattach* to a notifier right now. If one is added, we'll need to
 * keep track of the notif itself and make sure they match here.
 */
-   if (match->notif)
+   if (match->notif) {
list_del(&n.list);
+
+   /*
+* We are stuck with a UAPI that requires that after a spurious
+* wakeup, SECCOMP_IOCTL_NOTIF_RECV must return immediately.
+* This is the tracking for that, keeping track of whether we
+* canceled a request after waking waiters, but before userspace
+* picked up the notification.
+*/
+   if (n.state == SECCOMP_NOTIFY_INIT)
+

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Jann Horn
On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen  wrote:
> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) 
> > > wrote:
> > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
> > > > > wrote:
> > > > >>┌─┐
> > > > >>│FIXME│
> > > > >>├─┤
> > > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > >>│process terminates, then the ioctl()  simply  blocks │
> > > > >>│(rather than returning an error to indicate that the │
> > > > >>│target process no longer exists).│
> > > > >
> > > > > Yeah, I think Christian wanted to fix this at some point,
> > > >
> > > > Do you have a pointer that discussion? I could not find it with a
> > > > quick search.
> > > >
> > > > > but it's a
> > > > > bit sticky to do.
> > > >
> > > > Can you say a few words about the nature of the problem?
> > >
> > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > notify about unused filter"). So maybe there's a bug here?
> >
> > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > commit doesn't have any effect on this kind of usage.
>
> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> we don't have a count of all of them, unfortunately.
>
> We could maybe look inside the wait_list, but that will probably make
> people angry :)

The easiest way would probably be to open-code the semaphore-ish part,
and let the semaphore and poll share the waitqueue. The current code
kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
entire semaphore would IMO be cleaner than that. And it's not like
semaphore semantics are even a good fit for this code anyway.

Let's see... if we didn't have the existing UAPI to worry about, I'd
do it as follows (*completely* untested). That way, the ioctl would
block exactly until either there actually is a request to deliver or
there are no more users of the filter. The problem is that if we just
apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
an event loop and don't set O_NONBLOCK will be screwed. So we'd
probably also have to add some stupid counter in place of the
semaphore's counter that we can use to preserve the old behavior of
returning -ENOENT once for each cancelled request. :(

I guess this is a nice point in favor of Michael's usual complaint
that if there are no man pages for a feature by the time the feature
lands upstream, there's a higher chance that the UAPI will suck
forever...



diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..f0f4c68e0bc6 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,6 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-   struct semaphore request;
u64 next_id;
struct list_head notifications;
 };
@@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
list_add(&n.list, &match->notif->notifications);
INIT_LIST_HEAD(&n.addfd);

-   up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
mutex_unlock(&match->notify_lock);

@@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 id)


 static long seccomp_notify_recv(struct seccomp_filter *filter,
-   void __user *buf)
+   void __user *buf, bool blocking)
 {
struct seccomp_knotif *knotif = NULL, *cur;
+   DECLARE_WAITQUEUE(wait, current);
struct seccomp_notif unotif;
ssize_t ret;

@@ -1190,11 +1189,9 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,

memset(&unotif, 0, sizeof(unotif));

-   ret = down_interruptible(&filter->notif->request);
-   if (ret < 0)
-   return ret;
-
mutex_lock(&filter->notify_lock);
+
+retry:
list_for_each_entry(cur, &filter->notif->notifications, list) {
if (cur->state == SECCOMP_NOTIFY_INIT) {
knotif = cur;
@@ -1202,14 +1199,32 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
}
}

-   /*
-* If we didn't find a notification, it could be that the task was
-* interrupted by a fatal signal between the time we were woken and
-* when we were able to acquire the rw lock.
-*/
if (!knotif) {
-   ret = -ENOENT;
-  

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Kees Cook
On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> [...] I did :-)

Yay! Thank you!

> [...]
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to  treat  a particular system call is made by the filter itself.
>The user-space notification mechanism allows the handling of  the
>system  call  to  instead  be handed off to a user-space process.
>The advantages of doing this are that, by contrast with the  sec‐
>comp  filter,  which  is  running on a virtual machine inside the
>kernel, the user-space process has access to information that  is
>unavailable to the seccomp filter and it can perform actions that
>can't be performed from the seccomp filter.

I might clarify a bit with something like (though maybe the
target/supervisor paragraph needs to be moved to the start):

This is used for performing syscalls on behalf of the target,
rather than having the supervisor make security policy decisions
about the syscall, which would be inherently race-prone. The
target's syscall should either be handled by the supervisor or
allowed to continue normally in the kernel (where standard security
policies will be applied).

I'll comment more later, but I've run out of time today and I didn't see
anyone mention this detail yet in the existing threads... :)

-- 
Kees Cook


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
> > > > wrote:
> > > >>┌─┐
> > > >>│FIXME│
> > > >>├─┤
> > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > >>│process terminates, then the ioctl()  simply  blocks │
> > > >>│(rather than returning an error to indicate that the │
> > > >>│target process no longer exists).│
> > > >
> > > > Yeah, I think Christian wanted to fix this at some point,
> > >
> > > Do you have a pointer that discussion? I could not find it with a
> > > quick search.
> > >
> > > > but it's a
> > > > bit sticky to do.
> > >
> > > Can you say a few words about the nature of the problem?
> >
> > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > notify about unused filter"). So maybe there's a bug here?
> 
> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> commit doesn't have any effect on this kind of usage.

Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
we don't have a count of all of them, unfortunately.

We could maybe look inside the wait_list, but that will probably make
people angry :)

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Jann Horn
On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
> > > wrote:
> > >>┌─┐
> > >>│FIXME│
> > >>├─┤
> > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > >>│process terminates, then the ioctl()  simply  blocks │
> > >>│(rather than returning an error to indicate that the │
> > >>│target process no longer exists).│
> > >
> > > Yeah, I think Christian wanted to fix this at some point,
> >
> > Do you have a pointer that discussion? I could not find it with a
> > quick search.
> >
> > > but it's a
> > > bit sticky to do.
> >
> > Can you say a few words about the nature of the problem?
>
> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> notify about unused filter"). So maybe there's a bug here?

That thing only notifies on ->poll, it doesn't unblock ioctls; and
Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
commit doesn't have any effect on this kind of usage.


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for taking time to look at the page!
> 
> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >>2. In order that the supervisor process can obtain  notifications
> >>   using  the  listening  file  descriptor, (a duplicate of) that
> >>   file descriptor must be passed from the target process to  the
> >>   supervisor process.  One way in which this could be done is by
> >>   passing the file descriptor over a UNIX domain socket  connec‐
> >>   tion between the two processes (using the SCM_RIGHTS ancillary
> >>   message type described in unix(7)).   Another  possibility  is
> >>   that  the  supervisor  might  inherit  the file descriptor via
> >>   fork(2).
> > 
> > It is technically possible to inherit the fd via fork, but is it
> > really that useful? The child process wouldn't be able to actually do
> > the syscall in question, since it would have the same filter.
> 
> D'oh! Yes, of course.
> 
> I think I was reaching because in an earlier conversation
> you replied:
> 
> [[
> > 3. The "target process" passes the "listening file descriptor"
> >to the "monitoring process" via the UNIX domain socket.
> 
> or some other means, it doesn't have to be with SCM_RIGHTS.
> ]]
> 
> So, what other means?
> 
> Anyway, I removed the sentence mentioning fork().

Whatever means people want :). fork() could work (it's how some of the
tests for this feature work, but it's not particularly useful I don't
think), clone(CLONE_FILES) is similar, seccomp_putfd, or maybe even
cloning it via some pidfd interface that might be invented for
re-opening files.

> >>┌─┐
> >>│FIXME│
> >>├─┤
> >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> >>│process terminates, then the ioctl()  simply  blocks │
> >>│(rather than returning an error to indicate that the │
> >>│target process no longer exists).│
> > 
> > Yeah, I think Christian wanted to fix this at some point,
> 
> Do you have a pointer that discussion? I could not find it with a 
> quick search.
> 
> > but it's a
> > bit sticky to do.
> 
> Can you say a few words about the nature of the problem?

I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
notify about unused filter"). So maybe there's a bug here?

> >>┌─┐
> >>│FIXME│
> >>├─┤
> >>│Interestingly, after the event  had  been  received, │
> >>│the  file descriptor indicates as writable (verified │
> >>│from the source code and by experiment). How is this │
> >>│useful?  │
> > 
> > You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> > reasonable.
> 
> No, I'm saying something more fundamental: why is the FD indicating as
> writable? Can you write something to it? If yes, what? If not, then
> why do these APIs want to say that the FD is writable?

You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
NOTIFY_SEND are reading and writing events from the fd. I don't know
that much about the poll interface though -- is it possible to
indicate "here's a pseudo-read event"? It didn't look like it, so I
just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Michael Kerrisk (man-pages)
Hi Tycho,

Thanks for taking time to look at the page!

On 9/30/20 5:03 PM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>>2. In order that the supervisor process can obtain  notifications
>>   using  the  listening  file  descriptor, (a duplicate of) that
>>   file descriptor must be passed from the target process to  the
>>   supervisor process.  One way in which this could be done is by
>>   passing the file descriptor over a UNIX domain socket  connec‐
>>   tion between the two processes (using the SCM_RIGHTS ancillary
>>   message type described in unix(7)).   Another  possibility  is
>>   that  the  supervisor  might  inherit  the file descriptor via
>>   fork(2).
> 
> It is technically possible to inherit the fd via fork, but is it
> really that useful? The child process wouldn't be able to actually do
> the syscall in question, since it would have the same filter.

D'oh! Yes, of course.

I think I was reaching because in an earlier conversation
you replied:

[[
> 3. The "target process" passes the "listening file descriptor"
>to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.
]]

So, what other means?

Anyway, I removed the sentence mentioning fork().

>>   The  information  in  the notification can be used to discover
>>   the values of pointer arguments for the target process's  sys‐
>>   tem call.  (This is something that can't be done from within a
>>   seccomp filter.)  To do this (and  assuming  it  has  suitable
> 
> s/To do this/One way to accomplish this/ perhaps, since there are
> others.

Yes, thanks, done.

>>   permissions),   the   supervisor   opens   the   corresponding
>>   /proc/[pid]/mem file, seeks to the memory location that corre‐
>>   sponds to one of the pointer arguments whose value is supplied
>>   in the notification event, and reads bytes from that location.
>>   (The supervisor must be careful to avoid a race condition that
>>   can occur when doing this; see the  description  of  the  SEC‐
>>   COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>>   tion, the supervisor can access other system information  that
>>   is  visible  in  user space but which is not accessible from a
>>   seccomp filter.
>>
>>   ┌─┐
>>   │FIXME│
>>   ├─┤
>>   │Suppose we are reading a pathname from /proc/PID/mem │
>>   │for  a system call such as mkdir(). The pathname can │
>>   │be an arbitrary length. How do we know how much (how │
>>   │many pages) to read from /proc/PID/mem?  │
>>   └─┘
> 
> PATH_MAX, I suppose.

Yes, I misunderstood a fundamental detail here, as Jann 
also confirmed.

>>┌─┐
>>│FIXME│
>>├─┤
>>│From my experiments,  it  appears  that  if  a  SEC‐ │
>>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>│process terminates, then the ioctl()  simply  blocks │
>>│(rather than returning an error to indicate that the │
>>│target process no longer exists).│
> 
> Yeah, I think Christian wanted to fix this at some point,

Do you have a pointer that discussion? I could not find it with a 
quick search.

> but it's a
> bit sticky to do.

Can you say a few words about the nature of the problem?

In the meantime. I think this merits a note under BUGS, and
I've added one.

> Note that if you e.g. rely on fork() above, the
> filter is shared with your current process, and this notification
> would never be possible. Perhaps another reason to omit that from the
> man page.

(Yes, as noted above, I removed that sentence.)

>>SECCOMP_IOCTL_NOTIF_ID_VALID
>>   This operation can be used to check that a notification ID
>>   returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>>   is  still  valid  (i.e.,  that  the  target  process still
>>   exists).
>>
>>   The third ioctl(2) argument is a  pointer  to  the  cookie
>>   (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>>   This  operation is necessary to avoid race conditions that
>>   can  occur   when   the   pid   returned   by   the   SEC‐
>>   COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>>   process ID is reused by another process.   An  example  of

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Jann Horn
On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
 wrote:
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
[...]
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
>
> SYNOPSIS
>#include 
>#include 
>#include 
>
>int seccomp(unsigned int operation, unsigned int flags, void *args);

Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
of the ioctl_* manpages?

> DESCRIPTION
>This  page  describes  the user-space notification mechanism pro‐
>vided by the Secure Computing (seccomp) facility.  As well as the
>use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>operation  described  in  seccomp(2), this mechanism involves the
>use of a number of related ioctl(2) operations (described below).
>
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to  treat  a particular system call is made by the filter itself.
>The user-space notification mechanism allows the handling of  the
>system  call  to  instead  be handed off to a user-space process.
>The advantages of doing this are that, by contrast with the  sec‐
>comp  filter,  which  is  running on a virtual machine inside the
>kernel, the user-space process has access to information that  is
>unavailable to the seccomp filter and it can perform actions that
>can't be performed from the seccomp filter.
>
>In the discussion that follows, the process  that  has  installed
>the  seccomp filter is referred to as the target, and the process

Technically, this definition of "target" is a bit inaccurate because:

 - seccomp filters are inherited
 - seccomp filters apply to threads, not processes
 - seccomp filters can be semi-remotely installed via TSYNC

(I assume that in manpages, we should try to go for the "a task is a
thread and a thread group is a process" definition, right?)

Perhaps "the threads on which the seccomp filter is installed are
referred to as the target", or something like that would be better?

>that is notified by  the  user-space  notification  mechanism  is
>referred  to  as  the  supervisor.  An overview of the steps per‐
>formed by these two processes is as follows:
>
>1. The target process establishes a seccomp filter in  the  usual
>   manner, but with two differences:
>
>   · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
> the  (successful)  seccomp(2) call is a new "listening" file
> descriptor that can be used to receive notifications.
>
>   · In cases where it is appropriate, the seccomp filter returns
> the  action value SECCOMP_RET_USER_NOTIF.  This return value
> will trigger a notification event.
>
>2. In order that the supervisor process can obtain  notifications
>   using  the  listening  file  descriptor, (a duplicate of) that
>   file descriptor must be passed from the target process to  the
>   supervisor process.  One way in which this could be done is by
>   passing the file descriptor over a UNIX domain socket  connec‐
>   tion between the two processes (using the SCM_RIGHTS ancillary
>   message type described in unix(7)).   Another  possibility  is
>   that  the  supervisor  might  inherit  the file descriptor via
>   fork(2).

With the caveat that if the supervisor inherits the file descriptor
via fork(), that (more or less) implies that the supervisor is subject
to the same filter (although it could bypass the filter using a helper
thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
expect any clean software to do that).

>3. The supervisor process will receive notification events on the
>   listening  file  descriptor.   These  events  are  returned as
>   structures of type seccomp_notif.  Because this structure  and
>   its  size may evolve over kernel versions, the supervisor must
>   first determine the size of  this  structure  using  the  sec‐
>   comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>   structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>   cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>   to receive notification events.   In  addition,the  supervisor
>   alloca

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>2. In order that the supervisor process can obtain  notifications
>   using  the  listening  file  descriptor, (a duplicate of) that
>   file descriptor must be passed from the target process to  the
>   supervisor process.  One way in which this could be done is by
>   passing the file descriptor over a UNIX domain socket  connec‐
>   tion between the two processes (using the SCM_RIGHTS ancillary
>   message type described in unix(7)).   Another  possibility  is
>   that  the  supervisor  might  inherit  the file descriptor via
>   fork(2).

It is technically possible to inherit the fd via fork, but is it
really that useful? The child process wouldn't be able to actually do
the syscall in question, since it would have the same filter.

>   The  information  in  the notification can be used to discover
>   the values of pointer arguments for the target process's  sys‐
>   tem call.  (This is something that can't be done from within a
>   seccomp filter.)  To do this (and  assuming  it  has  suitable

s/To do this/One way to accomplish this/ perhaps, since there are
others.

>   permissions),   the   supervisor   opens   the   corresponding
>   /proc/[pid]/mem file, seeks to the memory location that corre‐
>   sponds to one of the pointer arguments whose value is supplied
>   in the notification event, and reads bytes from that location.
>   (The supervisor must be careful to avoid a race condition that
>   can occur when doing this; see the  description  of  the  SEC‐
>   COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>   tion, the supervisor can access other system information  that
>   is  visible  in  user space but which is not accessible from a
>   seccomp filter.
> 
>   ┌─┐
>   │FIXME│
>   ├─┤
>   │Suppose we are reading a pathname from /proc/PID/mem │
>   │for  a system call such as mkdir(). The pathname can │
>   │be an arbitrary length. How do we know how much (how │
>   │many pages) to read from /proc/PID/mem?  │
>   └─┘

PATH_MAX, I suppose.

>┌─┐
>│FIXME│
>├─┤
>│From my experiments,  it  appears  that  if  a  SEC‐ │
>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>│process terminates, then the ioctl()  simply  blocks │
>│(rather than returning an error to indicate that the │
>│target process no longer exists).│

Yeah, I think Christian wanted to fix this at some point, but it's a
bit sticky to do. Note that if you e.g. rely on fork() above, the
filter is shared with your current process, and this notification
would never be possible. Perhaps another reason to omit that from the
man page.

>SECCOMP_IOCTL_NOTIF_ID_VALID
>   This operation can be used to check that a notification ID
>   returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>   is  still  valid  (i.e.,  that  the  target  process still
>   exists).
> 
>   The third ioctl(2) argument is a  pointer  to  the  cookie
>   (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
> 
>   This  operation is necessary to avoid race conditions that
>   can  occur   when   the   pid   returned   by   the   SEC‐
>   COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>   process ID is reused by another process.   An  example  of
>   this kind of race is the following
> 
>   1. A  notification  is  generated  on  the  listening file
>  descriptor.  The returned  seccomp_notif  contains  the
>  PID of the target process.
> 
>   2. The target process terminates.
> 
>   3. Another process is created on the system that by chance
>  reuses the PID that was freed when the  target  process
>  terminates.
> 
>   4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>  the PID obtained in step 1, with the intention of (say)
>  inspecting the memory locations that contains the argu‐
>  ments of the system call that triggered  the  notifica‐
>  tion in step 1.
> 
>   In the above scenario, the risk is that the supervisor may
>

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 09:03:36AM -0600, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >┌─┐
> >│FIXME│
> >├─┤
> >│Interestingly, after the event  had  been  received, │
> >│the  file descriptor indicates as writable (verified │
> >│from the source code and by experiment). How is this │
> >│useful?  │
> 
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

If we make this change, I suppose we should also drop EPOLLRDNORM from
things which have not been received yet, since they're not really
readable.

Tycho


For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Michael Kerrisk (man-pages)
Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces 
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
that also will need documenting [2]), I did :-). But of course I may 
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly 
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] 
https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=

NAME
   seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
   #include 
   #include 
   #include 

   int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
   This  page  describes  the user-space notification mechanism pro‐
   vided by the Secure Computing (seccomp) facility.  As well as the
   use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
   COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
   operation  described  in  seccomp(2), this mechanism involves the
   use of a number of related ioctl(2) operations (described below).

   Overview
   In conventional usage of a seccomp filter, the decision about how
   to  treat  a particular system call is made by the filter itself.
   The user-space notification mechanism allows the handling of  the
   system  call  to  instead  be handed off to a user-space process.
   The advantages of doing this are that, by contrast with the  sec‐
   comp  filter,  which  is  running on a virtual machine inside the
   kernel, the user-space process has access to information that  is
   unavailable to the seccomp filter and it can perform actions that
   can't be performed from the seccomp filter.

   In the discussion that follows, the process  that  has  installed
   the  seccomp filter is referred to as the target, and the process
   that is notified by  the  user-space  notification  mechanism  is
   referred  to  as  the  supervisor.  An overview of the steps per‐
   formed by these two processes is as follows:

   1. The target process establishes a seccomp filter in  the  usual
  manner, but with two differences:

  · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
the  (successful)  seccomp(2) call is a new "listening" file
descriptor that can be used to receive notifications.

  · In cases where it is appropriate, the seccomp filter returns
the  action value SECCOMP_RET_USER_NOTIF.  This return value
will trigger a notification event.

   2. In order that the supervisor process can obtain  notifications
  using  the  listening  file  descriptor, (a duplicate of) that
  file descriptor must be passed from the target process to  the
  supervisor process.  One way in which this could be done is by
  passing the file descriptor over a UNIX domain socket  connec‐
  tion between the two processes (using the SCM_RIGHTS ancillary
  message type described in unix(7)).   Another  possibility  is
  that  the  supervisor  might  inherit  the file descriptor via
  fork(2).

   3. The supervisor process will receive notification events on the
  listening  file  descriptor.   These  events  are  returned as
  structures of type seccomp_notif.  Because this structure  and
  its  size may evolve over kernel versions, the supervisor must
  first determine the size of  this  structure  using  the  sec‐
  comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
  structure of type seccomp_notif_sizes.  The  supervisor  allo‐
  cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
  to receive notification events.   In  addition,the  supervisor
  allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
  co