Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-06-01 Thread Kees Cook
On Fri, May 27, 2016 at 1:14 PM, Andy Lutomirski  wrote:
> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
>>> Right, I know, it's aesthetically much nicer that way, but I really
>>> want to stay totally paranoid and keep seccomp absolutely first on the
>>> path.
>>>
>>> How about this: we'll use this patch as-is for now, since I'd like to
>>> be able to start getting feedback from the container-using folks ASAP,
>>> and then we can redesign the 2-phase system going forward from there.
>>>
>>
>> I think I'd rather change the ABI as few times as possible.  On the
>> other hand, it's still early, and I see nothing wrong with adding it
>> to -next.
>
> To get the ball rolling:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=seccomp
>
> It's incomplete, but it should be straightforward to finish it.  The
> only interesting bit is dealing with SECCOMP_RET_TRACE.

I did a bit more from there (though it needs further cleanup, I see my
"const" fixes landed in the wrong patch), this passes my tests on x86,
the other architectures need reordering and testing:

http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/reorder-ptrace

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-06-01 Thread Kees Cook
On Fri, May 27, 2016 at 1:14 PM, Andy Lutomirski  wrote:
> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
>>> Right, I know, it's aesthetically much nicer that way, but I really
>>> want to stay totally paranoid and keep seccomp absolutely first on the
>>> path.
>>>
>>> How about this: we'll use this patch as-is for now, since I'd like to
>>> be able to start getting feedback from the container-using folks ASAP,
>>> and then we can redesign the 2-phase system going forward from there.
>>>
>>
>> I think I'd rather change the ABI as few times as possible.  On the
>> other hand, it's still early, and I see nothing wrong with adding it
>> to -next.
>
> To get the ball rolling:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=seccomp
>
> It's incomplete, but it should be straightforward to finish it.  The
> only interesting bit is dealing with SECCOMP_RET_TRACE.

I did a bit more from there (though it needs further cleanup, I see my
"const" fixes landed in the wrong patch), this passes my tests on x86,
the other architectures need reordering and testing:

http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/reorder-ptrace

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Fri, May 27, 2016 at 4:20 PM, Andy Lutomirski  wrote:
> On May 27, 2016 3:38 PM, "Kees Cook"  wrote:
>>
>> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  
>> wrote:
>> > On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>> >>
>> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  
>> >> wrote:
>> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  
>> >> > wrote:
>> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
>> >> >> wrote:
>> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  
>> >> >>> wrote:
>> >>  One problem with seccomp was that ptrace could be used to change a
>> >>  syscall after seccomp filtering had completed. This was a well 
>> >>  documented
>> >>  limitation, and it was recommended to block ptrace when defining a 
>> >>  filter
>> >>  to avoid this problem. This can be quite a limitation for containers 
>> >>  or
>> >>  other places where ptrace is desired even under seccomp filters.
>> >> 
>> >>  Since seccomp filtering has been split into pre-trace and trace 
>> >>  phases
>> >>  (phase1 and phase2 respectively), it's possible to re-run phase1 
>> >>  seccomp
>> >>  after ptrace. This makes that change, and updates the test suite for
>> >>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >> >>>
>> >> >>> I like fixing the hole, but I don't like this fix.
>> >> >>>
>> >> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> >> >>> huge speedup.  Since then, I've made a ton of changes to the way that
>> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >> >>>
>> >> >>> I suggest that we fix the by simplifying the code instead of making it
>> >> >>> even more complicated.  Let's back out the two-phase mechanism (but
>> >> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >> >>> reorder it so that seccomp happens after ptrace.  The result should be
>> >> >>> considerably simpler.  (We'll still have to answer the question of
>> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >> >>> maybe the answer is to just let it through -- after all,
>> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >> >>> internal filtering.)
>> >> >>
>> >> >> I'm really against this. I think seccomp needs to stay first,
>> >> >
>> >> > Why?  What use case is improved with it going first?
>> >>
>> >> I feel that the critical purpose of seccomp is to minimize attack
>> >> surface. To that end, I am strongly against anything coming before it
>> >> in the syscall path. I really do not want ptrace going first, I think
>> >> it's just asking for bugs.
>> >
>> > I disagree in this case.  There's no actual code surface opened up.
>> > If seccomp allows even a single syscall through and there's a ptracer
>> > attached, then the ptrace code is exposed.  As far as ptrace is
>> > concerned, the syscall number is just a number, and ptrace has
>> > basically no awareness of the arguments.
>>
>> No, I completely disagree: there is a significant amount of surface
>> exposed. With a tracer attached there is significantly more happening
>> before the filter would be checked. Even less obvious things like
>> signal delivery, etc get exposed. Seccomp must be first -- this is
>> it's basic design principle. Bugs creep in, unexpected combinations
>> creep in, etc. Seccomp must mitigate this and be first on the syscall
>> path. The paranoia of this design principle must remain in place, even
>> at the expense of some inelegant results.
>
> But this only works if the filter is literally "deny everything".  If
> there is even a single syscall allowed and a ptracer is attached, then
> the whole ptrace machinery is exposed anyway.

That's an excellent point. Thanks for persisting on this, I'm starting
to come around. :)

> Users who are this paranoid about attack surface need to disable
> ptrace, full stop.  If you can do the ptrace(2) syscall, then you can
> invoke all the nasty code paths by yourself, and there is nothing
> seccomp can do about it.  All seccomp can do is prevent ptrace from
> generating a syscall that would otherwise be filtered out.
>
> Let's look at the actual supposed attack surface:
>
> if (unlikely(work & _TIF_SYSCALL_EMU))
> ret = -1L;
>
> if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
> tracehook_report_syscall_entry(regs))
> ret = -1L;
>
> That's all.

Yeah, looking through this, I see that audit isn't part of this, so
I'm much more relaxed.

> The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
> a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
> similar.  If that has happened, then 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Fri, May 27, 2016 at 4:20 PM, Andy Lutomirski  wrote:
> On May 27, 2016 3:38 PM, "Kees Cook"  wrote:
>>
>> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  
>> wrote:
>> > On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>> >>
>> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  
>> >> wrote:
>> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  
>> >> > wrote:
>> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
>> >> >> wrote:
>> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  
>> >> >>> wrote:
>> >>  One problem with seccomp was that ptrace could be used to change a
>> >>  syscall after seccomp filtering had completed. This was a well 
>> >>  documented
>> >>  limitation, and it was recommended to block ptrace when defining a 
>> >>  filter
>> >>  to avoid this problem. This can be quite a limitation for containers 
>> >>  or
>> >>  other places where ptrace is desired even under seccomp filters.
>> >> 
>> >>  Since seccomp filtering has been split into pre-trace and trace 
>> >>  phases
>> >>  (phase1 and phase2 respectively), it's possible to re-run phase1 
>> >>  seccomp
>> >>  after ptrace. This makes that change, and updates the test suite for
>> >>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >> >>>
>> >> >>> I like fixing the hole, but I don't like this fix.
>> >> >>>
>> >> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> >> >>> huge speedup.  Since then, I've made a ton of changes to the way that
>> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >> >>>
>> >> >>> I suggest that we fix the by simplifying the code instead of making it
>> >> >>> even more complicated.  Let's back out the two-phase mechanism (but
>> >> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >> >>> reorder it so that seccomp happens after ptrace.  The result should be
>> >> >>> considerably simpler.  (We'll still have to answer the question of
>> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >> >>> maybe the answer is to just let it through -- after all,
>> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >> >>> internal filtering.)
>> >> >>
>> >> >> I'm really against this. I think seccomp needs to stay first,
>> >> >
>> >> > Why?  What use case is improved with it going first?
>> >>
>> >> I feel that the critical purpose of seccomp is to minimize attack
>> >> surface. To that end, I am strongly against anything coming before it
>> >> in the syscall path. I really do not want ptrace going first, I think
>> >> it's just asking for bugs.
>> >
>> > I disagree in this case.  There's no actual code surface opened up.
>> > If seccomp allows even a single syscall through and there's a ptracer
>> > attached, then the ptrace code is exposed.  As far as ptrace is
>> > concerned, the syscall number is just a number, and ptrace has
>> > basically no awareness of the arguments.
>>
>> No, I completely disagree: there is a significant amount of surface
>> exposed. With a tracer attached there is significantly more happening
>> before the filter would be checked. Even less obvious things like
>> signal delivery, etc get exposed. Seccomp must be first -- this is
>> it's basic design principle. Bugs creep in, unexpected combinations
>> creep in, etc. Seccomp must mitigate this and be first on the syscall
>> path. The paranoia of this design principle must remain in place, even
>> at the expense of some inelegant results.
>
> But this only works if the filter is literally "deny everything".  If
> there is even a single syscall allowed and a ptracer is attached, then
> the whole ptrace machinery is exposed anyway.

That's an excellent point. Thanks for persisting on this, I'm starting
to come around. :)

> Users who are this paranoid about attack surface need to disable
> ptrace, full stop.  If you can do the ptrace(2) syscall, then you can
> invoke all the nasty code paths by yourself, and there is nothing
> seccomp can do about it.  All seccomp can do is prevent ptrace from
> generating a syscall that would otherwise be filtered out.
>
> Let's look at the actual supposed attack surface:
>
> if (unlikely(work & _TIF_SYSCALL_EMU))
> ret = -1L;
>
> if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
> tracehook_report_syscall_entry(regs))
> ret = -1L;
>
> That's all.

Yeah, looking through this, I see that audit isn't part of this, so
I'm much more relaxed.

> The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
> a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
> similar.  If that has happened, then there's very, very little that
> seccomp can possibly do to reduce attack surface.  First, literally
> any syscall that results in SECCOMP_RET_OK will cause all of the same
> 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On May 27, 2016 3:38 PM, "Kees Cook"  wrote:
>
> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
> > On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
> >>
> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  
> >> wrote:
> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
> >> >> wrote:
> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  
> >> >>> wrote:
> >>  One problem with seccomp was that ptrace could be used to change a
> >>  syscall after seccomp filtering had completed. This was a well 
> >>  documented
> >>  limitation, and it was recommended to block ptrace when defining a 
> >>  filter
> >>  to avoid this problem. This can be quite a limitation for containers 
> >>  or
> >>  other places where ptrace is desired even under seccomp filters.
> >> 
> >>  Since seccomp filtering has been split into pre-trace and trace phases
> >>  (phase1 and phase2 respectively), it's possible to re-run phase1 
> >>  seccomp
> >>  after ptrace. This makes that change, and updates the test suite for
> >>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
> >> >>>
> >> >>> I like fixing the hole, but I don't like this fix.
> >> >>>
> >> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> >> >>> huge speedup.  Since then, I've made a ton of changes to the way that
> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
> >> >>>
> >> >>> I suggest that we fix the by simplifying the code instead of making it
> >> >>> even more complicated.  Let's back out the two-phase mechanism (but
> >> >>> keep the ability for arch code to supply seccomp_data) and then just
> >> >>> reorder it so that seccomp happens after ptrace.  The result should be
> >> >>> considerably simpler.  (We'll still have to answer the question of
> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> >> >>> maybe the answer is to just let it through -- after all,
> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> >> >>> internal filtering.)
> >> >>
> >> >> I'm really against this. I think seccomp needs to stay first,
> >> >
> >> > Why?  What use case is improved with it going first?
> >>
> >> I feel that the critical purpose of seccomp is to minimize attack
> >> surface. To that end, I am strongly against anything coming before it
> >> in the syscall path. I really do not want ptrace going first, I think
> >> it's just asking for bugs.
> >
> > I disagree in this case.  There's no actual code surface opened up.
> > If seccomp allows even a single syscall through and there's a ptracer
> > attached, then the ptrace code is exposed.  As far as ptrace is
> > concerned, the syscall number is just a number, and ptrace has
> > basically no awareness of the arguments.
>
> No, I completely disagree: there is a significant amount of surface
> exposed. With a tracer attached there is significantly more happening
> before the filter would be checked. Even less obvious things like
> signal delivery, etc get exposed. Seccomp must be first -- this is
> it's basic design principle. Bugs creep in, unexpected combinations
> creep in, etc. Seccomp must mitigate this and be first on the syscall
> path. The paranoia of this design principle must remain in place, even
> at the expense of some inelegant results.

But this only works if the filter is literally "deny everything".  If
there is even a single syscall allowed and a ptracer is attached, then
the whole ptrace machinery is exposed anyway.

Users who are this paranoid about attack surface need to disable
ptrace, full stop.  If you can do the ptrace(2) syscall, then you can
invoke all the nasty code paths by yourself, and there is nothing
seccomp can do about it.  All seccomp can do is prevent ptrace from
generating a syscall that would otherwise be filtered out.

Let's look at the actual supposed attack surface:


if (unlikely(work & _TIF_SYSCALL_EMU))
ret = -1L;

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;

That's all.

The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
similar.  If that has happened, then there's very, very little that
seccomp can possibly do to reduce attack surface.  First, literally
any syscall that results in SECCOMP_RET_OK will cause all of the same
kernel code paths to run.  Second, most of those code paths can be
triggered without any syscall at all by using PTRACE_SINGLESTEP
instead.

So I challenge you to find a realistic scenario (i.e. something that a
real program 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On May 27, 2016 3:38 PM, "Kees Cook"  wrote:
>
> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
> > On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
> >>
> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  
> >> wrote:
> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
> >> >> wrote:
> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  
> >> >>> wrote:
> >>  One problem with seccomp was that ptrace could be used to change a
> >>  syscall after seccomp filtering had completed. This was a well 
> >>  documented
> >>  limitation, and it was recommended to block ptrace when defining a 
> >>  filter
> >>  to avoid this problem. This can be quite a limitation for containers 
> >>  or
> >>  other places where ptrace is desired even under seccomp filters.
> >> 
> >>  Since seccomp filtering has been split into pre-trace and trace phases
> >>  (phase1 and phase2 respectively), it's possible to re-run phase1 
> >>  seccomp
> >>  after ptrace. This makes that change, and updates the test suite for
> >>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
> >> >>>
> >> >>> I like fixing the hole, but I don't like this fix.
> >> >>>
> >> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> >> >>> huge speedup.  Since then, I've made a ton of changes to the way that
> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
> >> >>>
> >> >>> I suggest that we fix the by simplifying the code instead of making it
> >> >>> even more complicated.  Let's back out the two-phase mechanism (but
> >> >>> keep the ability for arch code to supply seccomp_data) and then just
> >> >>> reorder it so that seccomp happens after ptrace.  The result should be
> >> >>> considerably simpler.  (We'll still have to answer the question of
> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> >> >>> maybe the answer is to just let it through -- after all,
> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> >> >>> internal filtering.)
> >> >>
> >> >> I'm really against this. I think seccomp needs to stay first,
> >> >
> >> > Why?  What use case is improved with it going first?
> >>
> >> I feel that the critical purpose of seccomp is to minimize attack
> >> surface. To that end, I am strongly against anything coming before it
> >> in the syscall path. I really do not want ptrace going first, I think
> >> it's just asking for bugs.
> >
> > I disagree in this case.  There's no actual code surface opened up.
> > If seccomp allows even a single syscall through and there's a ptracer
> > attached, then the ptrace code is exposed.  As far as ptrace is
> > concerned, the syscall number is just a number, and ptrace has
> > basically no awareness of the arguments.
>
> No, I completely disagree: there is a significant amount of surface
> exposed. With a tracer attached there is significantly more happening
> before the filter would be checked. Even less obvious things like
> signal delivery, etc get exposed. Seccomp must be first -- this is
> it's basic design principle. Bugs creep in, unexpected combinations
> creep in, etc. Seccomp must mitigate this and be first on the syscall
> path. The paranoia of this design principle must remain in place, even
> at the expense of some inelegant results.

But this only works if the filter is literally "deny everything".  If
there is even a single syscall allowed and a ptracer is attached, then
the whole ptrace machinery is exposed anyway.

Users who are this paranoid about attack surface need to disable
ptrace, full stop.  If you can do the ptrace(2) syscall, then you can
invoke all the nasty code paths by yourself, and there is nothing
seccomp can do about it.  All seccomp can do is prevent ptrace from
generating a syscall that would otherwise be filtered out.

Let's look at the actual supposed attack surface:


if (unlikely(work & _TIF_SYSCALL_EMU))
ret = -1L;

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;

That's all.

The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
similar.  If that has happened, then there's very, very little that
seccomp can possibly do to reduce attack surface.  First, literally
any syscall that results in SECCOMP_RET_OK will cause all of the same
kernel code paths to run.  Second, most of those code paths can be
triggered without any syscall at all by using PTRACE_SINGLESTEP
instead.

So I challenge you to find a realistic scenario (i.e. something that a
real program might actually program into seccomp) in which running
seccomp before ptrace avoids even a single line of code worth of
attack surface.

On the flip side, 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
> On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>>
>> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
>> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
>> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
>> >> wrote:
>> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>>  One problem with seccomp was that ptrace could be used to change a
>>  syscall after seccomp filtering had completed. This was a well 
>>  documented
>>  limitation, and it was recommended to block ptrace when defining a 
>>  filter
>>  to avoid this problem. This can be quite a limitation for containers or
>>  other places where ptrace is desired even under seccomp filters.
>> 
>>  Since seccomp filtering has been split into pre-trace and trace phases
>>  (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>>  after ptrace. This makes that change, and updates the test suite for
>>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >>>
>> >>> I like fixing the hole, but I don't like this fix.
>> >>>
>> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> >>> huge speedup.  Since then, I've made a ton of changes to the way that
>> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >>>
>> >>> I suggest that we fix the by simplifying the code instead of making it
>> >>> even more complicated.  Let's back out the two-phase mechanism (but
>> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >>> reorder it so that seccomp happens after ptrace.  The result should be
>> >>> considerably simpler.  (We'll still have to answer the question of
>> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >>> maybe the answer is to just let it through -- after all,
>> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >>> internal filtering.)
>> >>
>> >> I'm really against this. I think seccomp needs to stay first,
>> >
>> > Why?  What use case is improved with it going first?
>>
>> I feel that the critical purpose of seccomp is to minimize attack
>> surface. To that end, I am strongly against anything coming before it
>> in the syscall path. I really do not want ptrace going first, I think
>> it's just asking for bugs.
>
> I disagree in this case.  There's no actual code surface opened up.
> If seccomp allows even a single syscall through and there's a ptracer
> attached, then the ptrace code is exposed.  As far as ptrace is
> concerned, the syscall number is just a number, and ptrace has
> basically no awareness of the arguments.

No, I completely disagree: there is a significant amount of surface
exposed. With a tracer attached there is significantly more happening
before the filter would be checked. Even less obvious things like
signal delivery, etc get exposed. Seccomp must be first -- this is
it's basic design principle. Bugs creep in, unexpected combinations
creep in, etc. Seccomp must mitigate this and be first on the syscall
path. The paranoia of this design principle must remain in place, even
at the expense of some inelegant results.

>> >> and I
>> >> like the two-phase split because it gives us a lot of flexibility on
>> >> other architectures.
>> >
>> > I thought so too when I wrote it, and I even tried a bit to evangelize
>> > it to other arch maintainers.  So far, it's used *only* in x86, and it
>> > would IMO be a cleanup to stop using it in x86 now.  Given my
>> > experience cleanup up the x86 syscall path, my current advice to other
>> > arch maintainers would be to try hard to avoid having a context in
>> > which syscall args are known but ptrace can't be invoked (as x86 had
>> > before Linux 4.5).
>>
>> Well, I've got most of the ARM 2-phase port done, but haven't gotten
>> it all the way finished. But I could be talked into removing the
>> 2-phase just from the perspective of reducing complexity.
>
> Does it help anything?  It's certainly more complex and harder to audit.
>
> On x86, it used to save hundreds of cycles.  Now it's probably five
> cycles or so at most.  It could even be a loss because of increased
> code size.

Yeah, on ARM I wasn't seeing much benefit, so I'm okay with dropping 2-phase.

>> >> And we can't just let through RET_TRACE because
>> >> we'll have exactly the same problem: a process can add a RET_TRACE
>> >> filter for some syscall and then change it arbitrarily to escape the
>> >> filtering. The non-trace returns of seccomp need to be check first and
>> >> after ptrace manipulations. The patch seems like the best approach and
>> >> it covers all the corners.
>> >
>> > But RET_TRACE really is special.
>> >
>> > Suppose you have a 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
> On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>>
>> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
>> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
>> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
>> >> wrote:
>> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>>  One problem with seccomp was that ptrace could be used to change a
>>  syscall after seccomp filtering had completed. This was a well 
>>  documented
>>  limitation, and it was recommended to block ptrace when defining a 
>>  filter
>>  to avoid this problem. This can be quite a limitation for containers or
>>  other places where ptrace is desired even under seccomp filters.
>> 
>>  Since seccomp filtering has been split into pre-trace and trace phases
>>  (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>>  after ptrace. This makes that change, and updates the test suite for
>>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >>>
>> >>> I like fixing the hole, but I don't like this fix.
>> >>>
>> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> >>> huge speedup.  Since then, I've made a ton of changes to the way that
>> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >>>
>> >>> I suggest that we fix the by simplifying the code instead of making it
>> >>> even more complicated.  Let's back out the two-phase mechanism (but
>> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >>> reorder it so that seccomp happens after ptrace.  The result should be
>> >>> considerably simpler.  (We'll still have to answer the question of
>> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >>> maybe the answer is to just let it through -- after all,
>> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >>> internal filtering.)
>> >>
>> >> I'm really against this. I think seccomp needs to stay first,
>> >
>> > Why?  What use case is improved with it going first?
>>
>> I feel that the critical purpose of seccomp is to minimize attack
>> surface. To that end, I am strongly against anything coming before it
>> in the syscall path. I really do not want ptrace going first, I think
>> it's just asking for bugs.
>
> I disagree in this case.  There's no actual code surface opened up.
> If seccomp allows even a single syscall through and there's a ptracer
> attached, then the ptrace code is exposed.  As far as ptrace is
> concerned, the syscall number is just a number, and ptrace has
> basically no awareness of the arguments.

No, I completely disagree: there is a significant amount of surface
exposed. With a tracer attached there is significantly more happening
before the filter would be checked. Even less obvious things like
signal delivery, etc get exposed. Seccomp must be first -- this is
it's basic design principle. Bugs creep in, unexpected combinations
creep in, etc. Seccomp must mitigate this and be first on the syscall
path. The paranoia of this design principle must remain in place, even
at the expense of some inelegant results.

>> >> and I
>> >> like the two-phase split because it gives us a lot of flexibility on
>> >> other architectures.
>> >
>> > I thought so too when I wrote it, and I even tried a bit to evangelize
>> > it to other arch maintainers.  So far, it's used *only* in x86, and it
>> > would IMO be a cleanup to stop using it in x86 now.  Given my
>> > experience cleanup up the x86 syscall path, my current advice to other
>> > arch maintainers would be to try hard to avoid having a context in
>> > which syscall args are known but ptrace can't be invoked (as x86 had
>> > before Linux 4.5).
>>
>> Well, I've got most of the ARM 2-phase port done, but haven't gotten
>> it all the way finished. But I could be talked into removing the
>> 2-phase just from the perspective of reducing complexity.
>
> Does it help anything?  It's certainly more complex and harder to audit.
>
> On x86, it used to save hundreds of cycles.  Now it's probably five
> cycles or so at most.  It could even be a loss because of increased
> code size.

Yeah, on ARM I wasn't seeing much benefit, so I'm okay with dropping 2-phase.

>> >> And we can't just let through RET_TRACE because
>> >> we'll have exactly the same problem: a process can add a RET_TRACE
>> >> filter for some syscall and then change it arbitrarily to escape the
>> >> filtering. The non-trace returns of seccomp need to be check first and
>> >> after ptrace manipulations. The patch seems like the best approach and
>> >> it covers all the corners.
>> >
>> > But RET_TRACE really is special.
>> >
>> > Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
>> > tracer sees a syscall, approves, and calls PTRACE_CONT, then the
>> > syscall 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
>> Right, I know, it's aesthetically much nicer that way, but I really
>> want to stay totally paranoid and keep seccomp absolutely first on the
>> path.
>>
>> How about this: we'll use this patch as-is for now, since I'd like to
>> be able to start getting feedback from the container-using folks ASAP,
>> and then we can redesign the 2-phase system going forward from there.
>>
>
> I think I'd rather change the ABI as few times as possible.  On the
> other hand, it's still early, and I see nothing wrong with adding it
> to -next.

To get the ball rolling:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=seccomp

It's incomplete, but it should be straightforward to finish it.  The
only interesting bit is dealing with SECCOMP_RET_TRACE.

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski  wrote:
>> Right, I know, it's aesthetically much nicer that way, but I really
>> want to stay totally paranoid and keep seccomp absolutely first on the
>> path.
>>
>> How about this: we'll use this patch as-is for now, since I'd like to
>> be able to start getting feedback from the container-using folks ASAP,
>> and then we can redesign the 2-phase system going forward from there.
>>
>
> I think I'd rather change the ABI as few times as possible.  On the
> other hand, it's still early, and I see nothing wrong with adding it
> to -next.

To get the ball rolling:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=seccomp

It's incomplete, but it should be straightforward to finish it.  The
only interesting bit is dealing with SECCOMP_RET_TRACE.

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>
> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
> >> wrote:
> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>  One problem with seccomp was that ptrace could be used to change a
>  syscall after seccomp filtering had completed. This was a well documented
>  limitation, and it was recommended to block ptrace when defining a filter
>  to avoid this problem. This can be quite a limitation for containers or
>  other places where ptrace is desired even under seccomp filters.
> 
>  Since seccomp filtering has been split into pre-trace and trace phases
>  (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>  after ptrace. This makes that change, and updates the test suite for
>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
> >>>
> >>> I like fixing the hole, but I don't like this fix.
> >>>
> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> >>> huge speedup.  Since then, I've made a ton of changes to the way that
> >>> x86 syscalls work, and there are two relevant effects: the slow path
> >>> is quite fast, and the phase-1-only path isn't really a win any more.
> >>>
> >>> I suggest that we fix the by simplifying the code instead of making it
> >>> even more complicated.  Let's back out the two-phase mechanism (but
> >>> keep the ability for arch code to supply seccomp_data) and then just
> >>> reorder it so that seccomp happens after ptrace.  The result should be
> >>> considerably simpler.  (We'll still have to answer the question of
> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> >>> maybe the answer is to just let it through -- after all,
> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> >>> internal filtering.)
> >>
> >> I'm really against this. I think seccomp needs to stay first,
> >
> > Why?  What use case is improved with it going first?
>
> I feel that the critical purpose of seccomp is to minimize attack
> surface. To that end, I am strongly against anything coming before it
> in the syscall path. I really do not want ptrace going first, I think
> it's just asking for bugs.

I disagree in this case.  There's no actual code surface opened up.
If seccomp allows even a single syscall through and there's a ptracer
attached, then the ptrace code is exposed.  As far as ptrace is
concerned, the syscall number is just a number, and ptrace has
basically no awareness of the arguments.

>
> >> and I
> >> like the two-phase split because it gives us a lot of flexibility on
> >> other architectures.
> >
> > I thought so too when I wrote it, and I even tried a bit to evangelize
> > it to other arch maintainers.  So far, it's used *only* in x86, and it
> > would IMO be a cleanup to stop using it in x86 now.  Given my
> > experience cleanup up the x86 syscall path, my current advice to other
> > arch maintainers would be to try hard to avoid having a context in
> > which syscall args are known but ptrace can't be invoked (as x86 had
> > before Linux 4.5).
>
> Well, I've got most of the ARM 2-phase port done, but haven't gotten
> it all the way finished. But I could be talked into removing the
> 2-phase just from the perspective of reducing complexity.

Does it help anything?  It's certainly more complex and harder to audit.

On x86, it used to save hundreds of cycles.  Now it's probably five
cycles or so at most.  It could even be a loss because of increased
code size.

>
> >> And we can't just let through RET_TRACE because
> >> we'll have exactly the same problem: a process can add a RET_TRACE
> >> filter for some syscall and then change it arbitrarily to escape the
> >> filtering. The non-trace returns of seccomp need to be check first and
> >> after ptrace manipulations. The patch seems like the best approach and
> >> it covers all the corners.
> >
> > But RET_TRACE really is special.
> >
> > Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
> > tracer sees a syscall, approves, and calls PTRACE_CONT, then the
> > syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
> > anew would be to either force -ENOSYS or to trap back to the tracer,
> > depending on whether there is a tracer.  Your patch has a
> > SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
> > special case.
>
> But after the seccomp re-trap, we'd still have to re-check the
> filters.

Why?  We don't do the now and, as far as I know, it's not a problem.

If we change that, I think it should be its own patch.

> So it's not cleaner, and we gain attack surface. Don't get me
> wrong, I totally see why you're suggesting doing ptrace first, but I
> still think that attack 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Andy Lutomirski
On May 27, 2016 11:42 AM, "Kees Cook"  wrote:
>
> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  
> >> wrote:
> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>  One problem with seccomp was that ptrace could be used to change a
>  syscall after seccomp filtering had completed. This was a well documented
>  limitation, and it was recommended to block ptrace when defining a filter
>  to avoid this problem. This can be quite a limitation for containers or
>  other places where ptrace is desired even under seccomp filters.
> 
>  Since seccomp filtering has been split into pre-trace and trace phases
>  (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>  after ptrace. This makes that change, and updates the test suite for
>  both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
> >>>
> >>> I like fixing the hole, but I don't like this fix.
> >>>
> >>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> >>> huge speedup.  Since then, I've made a ton of changes to the way that
> >>> x86 syscalls work, and there are two relevant effects: the slow path
> >>> is quite fast, and the phase-1-only path isn't really a win any more.
> >>>
> >>> I suggest that we fix the by simplifying the code instead of making it
> >>> even more complicated.  Let's back out the two-phase mechanism (but
> >>> keep the ability for arch code to supply seccomp_data) and then just
> >>> reorder it so that seccomp happens after ptrace.  The result should be
> >>> considerably simpler.  (We'll still have to answer the question of
> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> >>> maybe the answer is to just let it through -- after all,
> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> >>> internal filtering.)
> >>
> >> I'm really against this. I think seccomp needs to stay first,
> >
> > Why?  What use case is improved with it going first?
>
> I feel that the critical purpose of seccomp is to minimize attack
> surface. To that end, I am strongly against anything coming before it
> in the syscall path. I really do not want ptrace going first, I think
> it's just asking for bugs.

I disagree in this case.  There's no actual code surface opened up.
If seccomp allows even a single syscall through and there's a ptracer
attached, then the ptrace code is exposed.  As far as ptrace is
concerned, the syscall number is just a number, and ptrace has
basically no awareness of the arguments.

>
> >> and I
> >> like the two-phase split because it gives us a lot of flexibility on
> >> other architectures.
> >
> > I thought so too when I wrote it, and I even tried a bit to evangelize
> > it to other arch maintainers.  So far, it's used *only* in x86, and it
> > would IMO be a cleanup to stop using it in x86 now.  Given my
> > experience cleanup up the x86 syscall path, my current advice to other
> > arch maintainers would be to try hard to avoid having a context in
> > which syscall args are known but ptrace can't be invoked (as x86 had
> > before Linux 4.5).
>
> Well, I've got most of the ARM 2-phase port done, but haven't gotten
> it all the way finished. But I could be talked into removing the
> 2-phase just from the perspective of reducing complexity.

Does it help anything?  It's certainly more complex and harder to audit.

On x86, it used to save hundreds of cycles.  Now it's probably five
cycles or so at most.  It could even be a loss because of increased
code size.

>
> >> And we can't just let through RET_TRACE because
> >> we'll have exactly the same problem: a process can add a RET_TRACE
> >> filter for some syscall and then change it arbitrarily to escape the
> >> filtering. The non-trace returns of seccomp need to be check first and
> >> after ptrace manipulations. The patch seems like the best approach and
> >> it covers all the corners.
> >
> > But RET_TRACE really is special.
> >
> > Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
> > tracer sees a syscall, approves, and calls PTRACE_CONT, then the
> > syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
> > anew would be to either force -ENOSYS or to trap back to the tracer,
> > depending on whether there is a tracer.  Your patch has a
> > SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
> > special case.
>
> But after the seccomp re-trap, we'd still have to re-check the
> filters.

Why?  We don't do the now and, as far as I know, it's not a problem.

If we change that, I think it should be its own patch.

> So it's not cleaner, and we gain attack surface. Don't get me
> wrong, I totally see why you're suggesting doing ptrace first, but I
> still think that attack surface reduction must remain the primary
> principle of seccomp.
>
> > I think your patch also has a minor hole: if 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
> On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
>> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
>>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
 One problem with seccomp was that ptrace could be used to change a
 syscall after seccomp filtering had completed. This was a well documented
 limitation, and it was recommended to block ptrace when defining a filter
 to avoid this problem. This can be quite a limitation for containers or
 other places where ptrace is desired even under seccomp filters.

 Since seccomp filtering has been split into pre-trace and trace phases
 (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
 after ptrace. This makes that change, and updates the test suite for
 both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>>>
>>> I like fixing the hole, but I don't like this fix.
>>>
>>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>>> huge speedup.  Since then, I've made a ton of changes to the way that
>>> x86 syscalls work, and there are two relevant effects: the slow path
>>> is quite fast, and the phase-1-only path isn't really a win any more.
>>>
>>> I suggest that we fix the by simplifying the code instead of making it
>>> even more complicated.  Let's back out the two-phase mechanism (but
>>> keep the ability for arch code to supply seccomp_data) and then just
>>> reorder it so that seccomp happens after ptrace.  The result should be
>>> considerably simpler.  (We'll still have to answer the question of
>>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>>> maybe the answer is to just let it through -- after all,
>>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>>> internal filtering.)
>>
>> I'm really against this. I think seccomp needs to stay first,
>
> Why?  What use case is improved with it going first?

I feel that the critical purpose of seccomp is to minimize attack
surface. To that end, I am strongly against anything coming before it
in the syscall path. I really do not want ptrace going first, I think
it's just asking for bugs.

>> and I
>> like the two-phase split because it gives us a lot of flexibility on
>> other architectures.
>
> I thought so too when I wrote it, and I even tried a bit to evangelize
> it to other arch maintainers.  So far, it's used *only* in x86, and it
> would IMO be a cleanup to stop using it in x86 now.  Given my
> experience cleanup up the x86 syscall path, my current advice to other
> arch maintainers would be to try hard to avoid having a context in
> which syscall args are known but ptrace can't be invoked (as x86 had
> before Linux 4.5).

Well, I've got most of the ARM 2-phase port done, but haven't gotten
it all the way finished. But I could be talked into removing the
2-phase just from the perspective of reducing complexity.

>> And we can't just let through RET_TRACE because
>> we'll have exactly the same problem: a process can add a RET_TRACE
>> filter for some syscall and then change it arbitrarily to escape the
>> filtering. The non-trace returns of seccomp need to be check first and
>> after ptrace manipulations. The patch seems like the best approach and
>> it covers all the corners.
>
> But RET_TRACE really is special.
>
> Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
> tracer sees a syscall, approves, and calls PTRACE_CONT, then the
> syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
> anew would be to either force -ENOSYS or to trap back to the tracer,
> depending on whether there is a tracer.  Your patch has a
> SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
> special case.

But after the seccomp re-trap, we'd still have to re-check the
filters. So it's not cleaner, and we gain attack surface. Don't get me
wrong, I totally see why you're suggesting doing ptrace first, but I
still think that attack surface reduction must remain the primary
principle of seccomp.

> I think your patch also has a minor hole: if you have
> SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
> (PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
> after TRACE does its thing but before recheck, and can then redirect
> to another RET_TRACE action that would otherwise be denied.  This is
> minor because it could only happen if the tracer actively fights with
> itself.

So, a few thoughts went into the patch design, and here's why I don't
think this hole is a hole (which you already talk about a bit):
- no filtered syscall could be suddenly made to execute (this is core
goal, obviously)
- worst case, the tracer doesn't notice a syscall marked for
RET_TRACE, but that would be it's own fault because it chose to put
itself in that state.
- RET_TRACE is just under RET_ALLOW 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-27 Thread Kees Cook
On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski  wrote:
> On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
>> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
>>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
 One problem with seccomp was that ptrace could be used to change a
 syscall after seccomp filtering had completed. This was a well documented
 limitation, and it was recommended to block ptrace when defining a filter
 to avoid this problem. This can be quite a limitation for containers or
 other places where ptrace is desired even under seccomp filters.

 Since seccomp filtering has been split into pre-trace and trace phases
 (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
 after ptrace. This makes that change, and updates the test suite for
 both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>>>
>>> I like fixing the hole, but I don't like this fix.
>>>
>>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>>> huge speedup.  Since then, I've made a ton of changes to the way that
>>> x86 syscalls work, and there are two relevant effects: the slow path
>>> is quite fast, and the phase-1-only path isn't really a win any more.
>>>
>>> I suggest that we fix the by simplifying the code instead of making it
>>> even more complicated.  Let's back out the two-phase mechanism (but
>>> keep the ability for arch code to supply seccomp_data) and then just
>>> reorder it so that seccomp happens after ptrace.  The result should be
>>> considerably simpler.  (We'll still have to answer the question of
>>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>>> maybe the answer is to just let it through -- after all,
>>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>>> internal filtering.)
>>
>> I'm really against this. I think seccomp needs to stay first,
>
> Why?  What use case is improved with it going first?

I feel that the critical purpose of seccomp is to minimize attack
surface. To that end, I am strongly against anything coming before it
in the syscall path. I really do not want ptrace going first, I think
it's just asking for bugs.

>> and I
>> like the two-phase split because it gives us a lot of flexibility on
>> other architectures.
>
> I thought so too when I wrote it, and I even tried a bit to evangelize
> it to other arch maintainers.  So far, it's used *only* in x86, and it
> would IMO be a cleanup to stop using it in x86 now.  Given my
> experience cleanup up the x86 syscall path, my current advice to other
> arch maintainers would be to try hard to avoid having a context in
> which syscall args are known but ptrace can't be invoked (as x86 had
> before Linux 4.5).

Well, I've got most of the ARM 2-phase port done, but haven't gotten
it all the way finished. But I could be talked into removing the
2-phase just from the perspective of reducing complexity.

>> And we can't just let through RET_TRACE because
>> we'll have exactly the same problem: a process can add a RET_TRACE
>> filter for some syscall and then change it arbitrarily to escape the
>> filtering. The non-trace returns of seccomp need to be check first and
>> after ptrace manipulations. The patch seems like the best approach and
>> it covers all the corners.
>
> But RET_TRACE really is special.
>
> Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
> tracer sees a syscall, approves, and calls PTRACE_CONT, then the
> syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
> anew would be to either force -ENOSYS or to trap back to the tracer,
> depending on whether there is a tracer.  Your patch has a
> SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
> special case.

But after the seccomp re-trap, we'd still have to re-check the
filters. So it's not cleaner, and we gain attack surface. Don't get me
wrong, I totally see why you're suggesting doing ptrace first, but I
still think that attack surface reduction must remain the primary
principle of seccomp.

> I think your patch also has a minor hole: if you have
> SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
> (PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
> after TRACE does its thing but before recheck, and can then redirect
> to another RET_TRACE action that would otherwise be denied.  This is
> minor because it could only happen if the tracer actively fights with
> itself.

So, a few thoughts went into the patch design, and here's why I don't
think this hole is a hole (which you already talk about a bit):
- no filtered syscall could be suddenly made to execute (this is core
goal, obviously)
- worst case, the tracer doesn't notice a syscall marked for
RET_TRACE, but that would be it's own fault because it chose to put
itself in that state.
- RET_TRACE is just under RET_ALLOW in priority, so even if there were
some really weird unintended side-effects, we still 

Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Andy Lutomirski
On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>>> One problem with seccomp was that ptrace could be used to change a
>>> syscall after seccomp filtering had completed. This was a well documented
>>> limitation, and it was recommended to block ptrace when defining a filter
>>> to avoid this problem. This can be quite a limitation for containers or
>>> other places where ptrace is desired even under seccomp filters.
>>>
>>> Since seccomp filtering has been split into pre-trace and trace phases
>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>>> after ptrace. This makes that change, and updates the test suite for
>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>>
>> I like fixing the hole, but I don't like this fix.
>>
>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> huge speedup.  Since then, I've made a ton of changes to the way that
>> x86 syscalls work, and there are two relevant effects: the slow path
>> is quite fast, and the phase-1-only path isn't really a win any more.
>>
>> I suggest that we fix the by simplifying the code instead of making it
>> even more complicated.  Let's back out the two-phase mechanism (but
>> keep the ability for arch code to supply seccomp_data) and then just
>> reorder it so that seccomp happens after ptrace.  The result should be
>> considerably simpler.  (We'll still have to answer the question of
>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> maybe the answer is to just let it through -- after all,
>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> internal filtering.)
>
> I'm really against this. I think seccomp needs to stay first,

Why?  What use case is improved with it going first?

> and I
> like the two-phase split because it gives us a lot of flexibility on
> other architectures.

I thought so too when I wrote it, and I even tried a bit to evangelize
it to other arch maintainers.  So far, it's used *only* in x86, and it
would IMO be a cleanup to stop using it in x86 now.  Given my
experience cleanup up the x86 syscall path, my current advice to other
arch maintainers would be to try hard to avoid having a context in
which syscall args are known but ptrace can't be invoked (as x86 had
before Linux 4.5).

> And we can't just let through RET_TRACE because
> we'll have exactly the same problem: a process can add a RET_TRACE
> filter for some syscall and then change it arbitrarily to escape the
> filtering. The non-trace returns of seccomp need to be check first and
> after ptrace manipulations. The patch seems like the best approach and
> it covers all the corners.

But RET_TRACE really is special.

Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
tracer sees a syscall, approves, and calls PTRACE_CONT, then the
syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
anew would be to either force -ENOSYS or to trap back to the tracer,
depending on whether there is a tracer.  Your patch has a
SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
special case.

I think your patch also has a minor hole: if you have
SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
(PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
after TRACE does its thing but before recheck, and can then redirect
to another RET_TRACE action that would otherwise be denied.  This is
minor because it could only happen if the tracer actively fights with
itself.

Finally, I think that the your approach would break an existing valid
use case.  Suppose I have a tracer that wants to intercept some
syscall sys_foo (using SECCOMP_RET_TRACE) and, when it sees a sys_foo
attempt, it will implement it by redirecting it so some other syscall
that wouldn't be allowed if called directly (i.e. it would return
SECCOMP_RET_KILL or similar).  Currently, it'll work.  With your
patch, it will kill the tracee.  I think the former behavior is
better.  On the flip side, if you write a program that uses
SECCOMP_RET_TRACE, you more or less have to trust the tracer to begin
with.

One more reason to prefer my approach: currently, if you strace a
process that gets killed by SECCOMP_RET_KILL, you can't tell what
killed it.  For example:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 1, filter =
0x7ffe7b2b7d30}) = 0
+++ killed by SIGSYS +++
Bad system call (core dumped)

With my approach, strace will have the IMO much more sensible behavior
of showing the fatal syscall entry before showing the "killed by
SIGSYS".

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Andy Lutomirski
On Thu, May 26, 2016 at 7:41 PM, Kees Cook  wrote:
> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>>> One problem with seccomp was that ptrace could be used to change a
>>> syscall after seccomp filtering had completed. This was a well documented
>>> limitation, and it was recommended to block ptrace when defining a filter
>>> to avoid this problem. This can be quite a limitation for containers or
>>> other places where ptrace is desired even under seccomp filters.
>>>
>>> Since seccomp filtering has been split into pre-trace and trace phases
>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>>> after ptrace. This makes that change, and updates the test suite for
>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>>
>> I like fixing the hole, but I don't like this fix.
>>
>> The two-phase seccomp mechanism is messy.  I wrote it because it was a
>> huge speedup.  Since then, I've made a ton of changes to the way that
>> x86 syscalls work, and there are two relevant effects: the slow path
>> is quite fast, and the phase-1-only path isn't really a win any more.
>>
>> I suggest that we fix the by simplifying the code instead of making it
>> even more complicated.  Let's back out the two-phase mechanism (but
>> keep the ability for arch code to supply seccomp_data) and then just
>> reorder it so that seccomp happens after ptrace.  The result should be
>> considerably simpler.  (We'll still have to answer the question of
>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> maybe the answer is to just let it through -- after all,
>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> internal filtering.)
>
> I'm really against this. I think seccomp needs to stay first,

Why?  What use case is improved with it going first?

> and I
> like the two-phase split because it gives us a lot of flexibility on
> other architectures.

I thought so too when I wrote it, and I even tried a bit to evangelize
it to other arch maintainers.  So far, it's used *only* in x86, and it
would IMO be a cleanup to stop using it in x86 now.  Given my
experience cleanup up the x86 syscall path, my current advice to other
arch maintainers would be to try hard to avoid having a context in
which syscall args are known but ptrace can't be invoked (as x86 had
before Linux 4.5).

> And we can't just let through RET_TRACE because
> we'll have exactly the same problem: a process can add a RET_TRACE
> filter for some syscall and then change it arbitrarily to escape the
> filtering. The non-trace returns of seccomp need to be check first and
> after ptrace manipulations. The patch seems like the best approach and
> it covers all the corners.

But RET_TRACE really is special.

Suppose you have a tracer and you use SECCOMP_RET_TRACE.  If the
tracer sees a syscall, approves, and calls PTRACE_CONT, then the
syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
anew would be to either force -ENOSYS or to trap back to the tracer,
depending on whether there is a tracer.  Your patch has a
SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
special case.

I think your patch also has a minor hole: if you have
SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
(PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
after TRACE does its thing but before recheck, and can then redirect
to another RET_TRACE action that would otherwise be denied.  This is
minor because it could only happen if the tracer actively fights with
itself.

Finally, I think that the your approach would break an existing valid
use case.  Suppose I have a tracer that wants to intercept some
syscall sys_foo (using SECCOMP_RET_TRACE) and, when it sees a sys_foo
attempt, it will implement it by redirecting it so some other syscall
that wouldn't be allowed if called directly (i.e. it would return
SECCOMP_RET_KILL or similar).  Currently, it'll work.  With your
patch, it will kill the tracee.  I think the former behavior is
better.  On the flip side, if you write a program that uses
SECCOMP_RET_TRACE, you more or less have to trust the tracer to begin
with.

One more reason to prefer my approach: currently, if you strace a
process that gets killed by SECCOMP_RET_KILL, you can't tell what
killed it.  For example:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 1, filter =
0x7ffe7b2b7d30}) = 0
+++ killed by SIGSYS +++
Bad system call (core dumped)

With my approach, strace will have the IMO much more sensible behavior
of showing the fatal syscall entry before showing the "killed by
SIGSYS".

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Kees Cook
On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>> One problem with seccomp was that ptrace could be used to change a
>> syscall after seccomp filtering had completed. This was a well documented
>> limitation, and it was recommended to block ptrace when defining a filter
>> to avoid this problem. This can be quite a limitation for containers or
>> other places where ptrace is desired even under seccomp filters.
>>
>> Since seccomp filtering has been split into pre-trace and trace phases
>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>> after ptrace. This makes that change, and updates the test suite for
>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>
> I like fixing the hole, but I don't like this fix.
>
> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> huge speedup.  Since then, I've made a ton of changes to the way that
> x86 syscalls work, and there are two relevant effects: the slow path
> is quite fast, and the phase-1-only path isn't really a win any more.
>
> I suggest that we fix the by simplifying the code instead of making it
> even more complicated.  Let's back out the two-phase mechanism (but
> keep the ability for arch code to supply seccomp_data) and then just
> reorder it so that seccomp happens after ptrace.  The result should be
> considerably simpler.  (We'll still have to answer the question of
> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> maybe the answer is to just let it through -- after all,
> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> internal filtering.)

I'm really against this. I think seccomp needs to stay first, and I
like the two-phase split because it gives us a lot of flexibility on
other architectures. And we can't just let through RET_TRACE because
we'll have exactly the same problem: a process can add a RET_TRACE
filter for some syscall and then change it arbitrarily to escape the
filtering. The non-trace returns of seccomp need to be check first and
after ptrace manipulations. The patch seems like the best approach and
it covers all the corners.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Kees Cook
On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski  wrote:
> On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
>> One problem with seccomp was that ptrace could be used to change a
>> syscall after seccomp filtering had completed. This was a well documented
>> limitation, and it was recommended to block ptrace when defining a filter
>> to avoid this problem. This can be quite a limitation for containers or
>> other places where ptrace is desired even under seccomp filters.
>>
>> Since seccomp filtering has been split into pre-trace and trace phases
>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>> after ptrace. This makes that change, and updates the test suite for
>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>
> I like fixing the hole, but I don't like this fix.
>
> The two-phase seccomp mechanism is messy.  I wrote it because it was a
> huge speedup.  Since then, I've made a ton of changes to the way that
> x86 syscalls work, and there are two relevant effects: the slow path
> is quite fast, and the phase-1-only path isn't really a win any more.
>
> I suggest that we fix the by simplifying the code instead of making it
> even more complicated.  Let's back out the two-phase mechanism (but
> keep the ability for arch code to supply seccomp_data) and then just
> reorder it so that seccomp happens after ptrace.  The result should be
> considerably simpler.  (We'll still have to answer the question of
> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> maybe the answer is to just let it through -- after all,
> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> internal filtering.)

I'm really against this. I think seccomp needs to stay first, and I
like the two-phase split because it gives us a lot of flexibility on
other architectures. And we can't just let through RET_TRACE because
we'll have exactly the same problem: a process can add a RET_TRACE
filter for some syscall and then change it arbitrarily to escape the
filtering. The non-trace returns of seccomp need to be check first and
after ptrace manipulations. The patch seems like the best approach and
it covers all the corners.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Andy Lutomirski
On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
> One problem with seccomp was that ptrace could be used to change a
> syscall after seccomp filtering had completed. This was a well documented
> limitation, and it was recommended to block ptrace when defining a filter
> to avoid this problem. This can be quite a limitation for containers or
> other places where ptrace is desired even under seccomp filters.
>
> Since seccomp filtering has been split into pre-trace and trace phases
> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
> after ptrace. This makes that change, and updates the test suite for
> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.

I like fixing the hole, but I don't like this fix.

The two-phase seccomp mechanism is messy.  I wrote it because it was a
huge speedup.  Since then, I've made a ton of changes to the way that
x86 syscalls work, and there are two relevant effects: the slow path
is quite fast, and the phase-1-only path isn't really a win any more.

I suggest that we fix the by simplifying the code instead of making it
even more complicated.  Let's back out the two-phase mechanism (but
keep the ability for arch code to supply seccomp_data) and then just
reorder it so that seccomp happens after ptrace.  The result should be
considerably simpler.  (We'll still have to answer the question of
what happens when a SECCOMP_RET_TRACE event changes the syscall, but
maybe the answer is to just let it through -- after all,
SECCOMP_RET_TRACE might be a request by a tracer to do its own
internal filtering.)

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Andy Lutomirski
On Thu, May 26, 2016 at 2:04 PM, Kees Cook  wrote:
> One problem with seccomp was that ptrace could be used to change a
> syscall after seccomp filtering had completed. This was a well documented
> limitation, and it was recommended to block ptrace when defining a filter
> to avoid this problem. This can be quite a limitation for containers or
> other places where ptrace is desired even under seccomp filters.
>
> Since seccomp filtering has been split into pre-trace and trace phases
> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
> after ptrace. This makes that change, and updates the test suite for
> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.

I like fixing the hole, but I don't like this fix.

The two-phase seccomp mechanism is messy.  I wrote it because it was a
huge speedup.  Since then, I've made a ton of changes to the way that
x86 syscalls work, and there are two relevant effects: the slow path
is quite fast, and the phase-1-only path isn't really a win any more.

I suggest that we fix the by simplifying the code instead of making it
even more complicated.  Let's back out the two-phase mechanism (but
keep the ability for arch code to supply seccomp_data) and then just
reorder it so that seccomp happens after ptrace.  The result should be
considerably simpler.  (We'll still have to answer the question of
what happens when a SECCOMP_RET_TRACE event changes the syscall, but
maybe the answer is to just let it through -- after all,
SECCOMP_RET_TRACE might be a request by a tracer to do its own
internal filtering.)

--Andy


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Jann Horn
On Thu, May 26, 2016 at 02:04:50PM -0700, Kees Cook wrote:
> One problem with seccomp was that ptrace could be used to change a
> syscall after seccomp filtering had completed. This was a well documented
> limitation, and it was recommended to block ptrace when defining a filter
> to avoid this problem. This can be quite a limitation for containers or
> other places where ptrace is desired even under seccomp filters.
> 
> Since seccomp filtering has been split into pre-trace and trace phases
> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
> after ptrace. This makes that change, and updates the test suite for
> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.

Looks good to me. As far as I can tell, there are no codepaths that allow
manipulation of syscall arguments via ptrace register modification without
going through tracehook_report_syscall_entry() or seccomp_phase2(), and
the checks look good, too.


> Signed-off-by: Kees Cook 
> ---
>  include/linux/seccomp.h   |   6 +
>  include/linux/tracehook.h |   8 +-
>  kernel/seccomp.c  |  42 ++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 176 
> --
>  4 files changed, 220 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 2296e6b2f690..e2b72394c200 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -85,6 +85,7 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> +extern int seccomp_phase1_recheck(void);
>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> @@ -94,6 +95,11 @@ static inline void get_seccomp_filter(struct task_struct 
> *tsk)
>  {
>   return;
>  }
> +
> +static inline int seccomp_phase1_recheck(void)
> +{
> + return 0;
> +}
>  #endif /* CONFIG_SECCOMP_FILTER */
>  
>  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 26c152122a42..69b584d88508 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -48,6 +48,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -100,7 +101,12 @@ static inline int ptrace_report_syscall(struct pt_regs 
> *regs)
>  static inline __must_check int tracehook_report_syscall_entry(
>   struct pt_regs *regs)
>  {
> - return ptrace_report_syscall(regs);
> + int skip;
> +
> + skip = ptrace_report_syscall(regs);
> + if (skip)
> + return skip;
> + return seccomp_phase1_recheck();
>  }
>  
>  /**
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 7002796f14a4..6eaa3a1c5edb 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -665,6 +665,46 @@ u32 seccomp_phase1(struct seccomp_data *sd)
>  }
>  
>  /**
> + * seccomp_phase1_recheck() - recheck phase1 in the context of ptrace
> + *
> + * This re-runs phase 1 seccomp checks in the case where ptrace may have
> + * just changed things out from under us.
> + *
> + * Returns 0 if the syscall should be processed or -1 to skip the syscall.
> + */
> +int seccomp_phase1_recheck(void)
> +{
> + u32 action;
> +
> + /* If we're not under seccomp, continue normally. */
> + if (!test_thread_flag(TIF_SECCOMP))
> + return 0;
> +
> + /* Pass NULL struct seccomp_data to force reload after ptrace. */
> + action = seccomp_phase1(NULL);
> + switch (action) {
> + case SECCOMP_PHASE1_OK:
> + /* Passes seccomp, continue normally. */
> + break;
> + case SECCOMP_PHASE1_SKIP:
> + /* Skip the syscall. */
> + return -1;
> + default:
> + if ((action & SECCOMP_RET_ACTION) != SECCOMP_RET_TRACE) {
> + /* Impossible return value: kill the process. */
> + do_exit(SIGSYS);
> + }
> + /*
> +  * We've hit a trace request, but ptrace already put us
> +  * into this state, so just continue.
> +  */
> + break;
> + }
> +
> + return 0;
> +}
> +
> +/**
>   * seccomp_phase2() - finish slow path seccomp work for the current syscall
>   * @phase1_result: The return value from seccomp_phase1()
>   *
> @@ -701,6 +741,8 @@ int seccomp_phase2(u32 phase1_result)
>   do_exit(SIGSYS);
>   if (syscall_get_nr(current, regs) < 0)
>   return -1;  /* Explicit request to skip. */
> + if (seccomp_phase1_recheck() < 0)
> + return -1;
>  
>   return 0;
>  }
[...]


signature.asc
Description: Digital signature


Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

2016-05-26 Thread Jann Horn
On Thu, May 26, 2016 at 02:04:50PM -0700, Kees Cook wrote:
> One problem with seccomp was that ptrace could be used to change a
> syscall after seccomp filtering had completed. This was a well documented
> limitation, and it was recommended to block ptrace when defining a filter
> to avoid this problem. This can be quite a limitation for containers or
> other places where ptrace is desired even under seccomp filters.
> 
> Since seccomp filtering has been split into pre-trace and trace phases
> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
> after ptrace. This makes that change, and updates the test suite for
> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.

Looks good to me. As far as I can tell, there are no codepaths that allow
manipulation of syscall arguments via ptrace register modification without
going through tracehook_report_syscall_entry() or seccomp_phase2(), and
the checks look good, too.


> Signed-off-by: Kees Cook 
> ---
>  include/linux/seccomp.h   |   6 +
>  include/linux/tracehook.h |   8 +-
>  kernel/seccomp.c  |  42 ++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 176 
> --
>  4 files changed, 220 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 2296e6b2f690..e2b72394c200 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -85,6 +85,7 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> +extern int seccomp_phase1_recheck(void);
>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> @@ -94,6 +95,11 @@ static inline void get_seccomp_filter(struct task_struct 
> *tsk)
>  {
>   return;
>  }
> +
> +static inline int seccomp_phase1_recheck(void)
> +{
> + return 0;
> +}
>  #endif /* CONFIG_SECCOMP_FILTER */
>  
>  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 26c152122a42..69b584d88508 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -48,6 +48,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -100,7 +101,12 @@ static inline int ptrace_report_syscall(struct pt_regs 
> *regs)
>  static inline __must_check int tracehook_report_syscall_entry(
>   struct pt_regs *regs)
>  {
> - return ptrace_report_syscall(regs);
> + int skip;
> +
> + skip = ptrace_report_syscall(regs);
> + if (skip)
> + return skip;
> + return seccomp_phase1_recheck();
>  }
>  
>  /**
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 7002796f14a4..6eaa3a1c5edb 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -665,6 +665,46 @@ u32 seccomp_phase1(struct seccomp_data *sd)
>  }
>  
>  /**
> + * seccomp_phase1_recheck() - recheck phase1 in the context of ptrace
> + *
> + * This re-runs phase 1 seccomp checks in the case where ptrace may have
> + * just changed things out from under us.
> + *
> + * Returns 0 if the syscall should be processed or -1 to skip the syscall.
> + */
> +int seccomp_phase1_recheck(void)
> +{
> + u32 action;
> +
> + /* If we're not under seccomp, continue normally. */
> + if (!test_thread_flag(TIF_SECCOMP))
> + return 0;
> +
> + /* Pass NULL struct seccomp_data to force reload after ptrace. */
> + action = seccomp_phase1(NULL);
> + switch (action) {
> + case SECCOMP_PHASE1_OK:
> + /* Passes seccomp, continue normally. */
> + break;
> + case SECCOMP_PHASE1_SKIP:
> + /* Skip the syscall. */
> + return -1;
> + default:
> + if ((action & SECCOMP_RET_ACTION) != SECCOMP_RET_TRACE) {
> + /* Impossible return value: kill the process. */
> + do_exit(SIGSYS);
> + }
> + /*
> +  * We've hit a trace request, but ptrace already put us
> +  * into this state, so just continue.
> +  */
> + break;
> + }
> +
> + return 0;
> +}
> +
> +/**
>   * seccomp_phase2() - finish slow path seccomp work for the current syscall
>   * @phase1_result: The return value from seccomp_phase1()
>   *
> @@ -701,6 +741,8 @@ int seccomp_phase2(u32 phase1_result)
>   do_exit(SIGSYS);
>   if (syscall_get_nr(current, regs) < 0)
>   return -1;  /* Explicit request to skip. */
> + if (seccomp_phase1_recheck() < 0)
> + return -1;
>  
>   return 0;
>  }
[...]


signature.asc
Description: Digital signature