RE: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
Hi, Jann, Andy, Alexei, Kees and Paul: thanks a lot for your comments on my RFC!!. There were a few important points that I didn't mention but are critical to understand what I was trying to do. The focus of the patch was on protecting "real-time embedded IoT devices" such as a PLC (programmable logic controller) inside a factory assembly line . They have a few important properties that I took into consideration: - They often rely on firewall technology, and are not updated for many years (~20 years). For that reason, I think that a white-list approach (define the correct behaviour) seems suitable. Note also that the typical problem of white list approaches, false-positives, is unlikely to occur because they are very deterministic systems. - No asynchronous signal handlers: real-time applications need deterministic response times. For that reason, signals are handled synchronously typically by using 'sigtimedwait' on a separate thread. - Initialization vs cycle: real-time applications usually have an initialization phase where memory and stack are locked into RAM and threads are created. After the initialization phase, threads typically loop through periodic cycles and perform their tasks. The important point here is that once the initialization is done we can ban any further calls to 'clone', 'execve', 'mprotect' and the like. This can be done already by installing an extra filter. For the cyclic phase, my patch would allow enforcing the order of the system calls inside the cycles. (e.g.: read sensor, send a message, and write to an actuator). Despite the fact that the attacker cannot call 'clone' anymore, he could try to alter the control of an external actuator (e.g. a motor) by using the 'ioctl' system call for example. - Mimicry: as I mentioned in the cover letter (and Jann showed with his ROP attack) if the attacker is able to emulate the system call's order (plus its arguments and the address from which the call was made) this patch can be bypassed. However, note that this is not easy for several reasons: + the attacker may need a long stack to mimic all the system calls and their arguments. + the stealthy attacker must make sure the real-time application does not crash, miss any of its deadlines or cause deadline misses in other apps [Note] Real-time application binaries are usually closed source so this might require quite a bit of effort. + randomized system calls: applications could randomly activate dummy system calls each time they are instantiated (and adjust their BPF filter, which should later be zeroed). In this case, the attacker (or virus) would need to figure out which dummy system calls have to be mimicked and prepare a stack accordingly. This seems challenging. [Note] under a brute force attack, the application may just raise an alarm, activate a redundant node (not connected to the network) and commit digital suicide :). About the ABI, by all means I don't want to break it. If putting the field at the end does not break it, as Alexei mentioned, I can change it. Also I would be glad to review the SECCOMP_FILTER_FLAG_TSYNC flag mentioned by Jann in case there is any interest. However, I'll understand the NACK if you think that the maintenance is not worth it as Andy mentioned; that it can be bypassed under certain conditions; or the fact that it focuses on a particular type of systems. I will keep reading the messages in the kernel-hardening list and see if I find another topic to contribute :). Thanks a lot for your consideration and comments, Daniel
RE: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
Hi, Jann, Andy, Alexei, Kees and Paul: thanks a lot for your comments on my RFC!!. There were a few important points that I didn't mention but are critical to understand what I was trying to do. The focus of the patch was on protecting "real-time embedded IoT devices" such as a PLC (programmable logic controller) inside a factory assembly line . They have a few important properties that I took into consideration: - They often rely on firewall technology, and are not updated for many years (~20 years). For that reason, I think that a white-list approach (define the correct behaviour) seems suitable. Note also that the typical problem of white list approaches, false-positives, is unlikely to occur because they are very deterministic systems. - No asynchronous signal handlers: real-time applications need deterministic response times. For that reason, signals are handled synchronously typically by using 'sigtimedwait' on a separate thread. - Initialization vs cycle: real-time applications usually have an initialization phase where memory and stack are locked into RAM and threads are created. After the initialization phase, threads typically loop through periodic cycles and perform their tasks. The important point here is that once the initialization is done we can ban any further calls to 'clone', 'execve', 'mprotect' and the like. This can be done already by installing an extra filter. For the cyclic phase, my patch would allow enforcing the order of the system calls inside the cycles. (e.g.: read sensor, send a message, and write to an actuator). Despite the fact that the attacker cannot call 'clone' anymore, he could try to alter the control of an external actuator (e.g. a motor) by using the 'ioctl' system call for example. - Mimicry: as I mentioned in the cover letter (and Jann showed with his ROP attack) if the attacker is able to emulate the system call's order (plus its arguments and the address from which the call was made) this patch can be bypassed. However, note that this is not easy for several reasons: + the attacker may need a long stack to mimic all the system calls and their arguments. + the stealthy attacker must make sure the real-time application does not crash, miss any of its deadlines or cause deadline misses in other apps [Note] Real-time application binaries are usually closed source so this might require quite a bit of effort. + randomized system calls: applications could randomly activate dummy system calls each time they are instantiated (and adjust their BPF filter, which should later be zeroed). In this case, the attacker (or virus) would need to figure out which dummy system calls have to be mimicked and prepare a stack accordingly. This seems challenging. [Note] under a brute force attack, the application may just raise an alarm, activate a redundant node (not connected to the network) and commit digital suicide :). About the ABI, by all means I don't want to break it. If putting the field at the end does not break it, as Alexei mentioned, I can change it. Also I would be glad to review the SECCOMP_FILTER_FLAG_TSYNC flag mentioned by Jann in case there is any interest. However, I'll understand the NACK if you think that the maintenance is not worth it as Andy mentioned; that it can be bypassed under certain conditions; or the fact that it focuses on a particular type of systems. I will keep reading the messages in the kernel-hardening list and see if I find another topic to contribute :). Thanks a lot for your consideration and comments, Daniel
Re: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
On Fri, Jan 22, 2016 at 2:48 AM, Jann Horn wrote: > On Fri, Jan 22, 2016 at 03:30:00PM +0900, Daniel Sangorrin wrote: >> This patch allows applications to restrict the order in which >> its system calls may be requested. In order to do that, we >> provide seccomp-BPF scripts with information about the >> previous system call requested. >> >> An example use case consists of detecting (and stopping) return >> oriented attacks that disturb the normal execution flow of >> a user program. > > > The intent here is to mitigate attacks in which an attacker has > e.g. a function pointer overwrite without a high degree of stack > control or the ability to perform a stack pivot, correct? So that > e.g. a one-gadget system() call won't succeed? > > Do you have data on how effective this protection is using just > the previous system call number? > > I think that for example, the "magic ROP gadget" in glibc that > can be used given just a single pointer overwrite and stdin > control (https://gist.github.com/zachriggle/ca24daf4e8be953a3f96), > which (as far as I can tell) is in the middle of the system() > implementation, could be used as long as a transition to one of > the following syscalls is allowed: > > - rt_sigaction > - rt_sigprocmask > - clone > - execve > > I'm not sure how many interesting syscalls typically transition > to that, perhaps you can comment on that? rt_sigaction is going to be a problem. It can legitimately follow *anything* because of async signals. In general, I think I don't like this idea. It seems like a hack that we'll have to support forever that will allow semi-reliable IDS signatures to break due to async signals and occasionally detect intrusions that don't modify themselves slightly to evade detection. --Andy
Re: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
On Fri, Jan 22, 2016 at 03:30:00PM +0900, Daniel Sangorrin wrote: > This patch allows applications to restrict the order in which > its system calls may be requested. In order to do that, we > provide seccomp-BPF scripts with information about the > previous system call requested. > > An example use case consists of detecting (and stopping) return > oriented attacks that disturb the normal execution flow of > a user program. The intent here is to mitigate attacks in which an attacker has e.g. a function pointer overwrite without a high degree of stack control or the ability to perform a stack pivot, correct? So that e.g. a one-gadget system() call won't succeed? Do you have data on how effective this protection is using just the previous system call number? I think that for example, the "magic ROP gadget" in glibc that can be used given just a single pointer overwrite and stdin control (https://gist.github.com/zachriggle/ca24daf4e8be953a3f96), which (as far as I can tell) is in the middle of the system() implementation, could be used as long as a transition to one of the following syscalls is allowed: - rt_sigaction - rt_sigprocmask - clone - execve I'm not sure how many interesting syscalls typically transition to that, perhaps you can comment on that? However, when exploiting network servers, this magic gadget won't help much - an attacker would probably have to either call into an interesting function in the application or use ROP. In the latter case, this protection won't help much - especially considering that most syscalls just return -EFAULT / -EINVAL when you supply nonsense arguments, ROPping through a "pop rax;ret" gadget and a "syscall;ret" gadget should make it fairly easy to bypass the protection. There are a bunch of occurences of both gadgets in Debian's libc (and these are just the trivial ones): $ hexdump -C /lib/x86_64-linux-gnu/libc-2.19.so | grep '58 c3' 000382e0 00 00 48 8b 00 5b 8b 40 58 c3 48 8d 05 4f 8a 36 |..H..[.@X.H..O.6| 000383b0 58 c3 48 8d 05 87 89 36 00 48 39 c3 74 0e 48 89 |X.H6.H9.t.H.| 00038450 40 58 c3 48 8d 05 e6 88 36 00 48 39 c3 74 0e 48 |@X.H6.H9.t.H| 000d9a00 48 89 44 24 18 e8 56 ff ff ff 48 83 c4 58 c3 90 |H.D$..V...H..X..| 000e51d0 c3 0f 1f 80 00 00 00 00 48 8b 40 58 c3 0f 1f 00 |H.@X| 000ea2f0 48 83 3d 58 c3 2b 00 00 48 8b 1d 69 8b 2b 00 64 |H.=X.+..H..i.+.d| 00160520 48 c3 fa ff 58 c3 fa ff 68 c3 fa ff 80 c3 fa ff |H...X...h...| 00171470 58 c3 f8 ff 84 60 02 00 74 c3 f8 ff 94 62 02 00 |X`..tb..| $ hexdump -C /lib/x86_64-linux-gnu/libc-2.19.so | grep '0f 05 c3' 000b85b0 b8 6e 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.n..| 000b85c0 b8 66 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.f..| 000b85d0 b8 6b 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.k..| 000b85e0 b8 68 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.h..| 000b85f0 b8 6c 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.l..| 000b87f0 b8 6f 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.o..| 000d9260 b8 5f 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |._..| 000e6400 b8 e4 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 || 000fff60 48 63 3f b8 03 00 00 00 0f 05 c3 0f 1f 44 00 00 |Hc?..D..| So an attacker would craft the stack like this: [pop rax;ret address] [first syscall for transition] [syscall;ret address] [pop rax;ret address] [second syscall for transition] [syscall;ret address] [...] [normal ROP for whatever the attacker wants to do] Maybe someone who knows a bit more about binary exploiting can comment on this, especially how likely it is that a manipulation of a network service's program flow is successful in the presence of full ASLR and so on without ROP. Also, there is a potential functional issue: What about signal handlers? Signal handlers will require transitions from all syscalls to any syscall that occurs at the start of a signal handler to be allowed as far as I can tell. > @@ -443,6 +448,11 @@ static long seccomp_attach_filter(unsigned int flags, > return ret; > } > > + /* Initialize the prev_nr field only once */ > + if (current->seccomp.filter == NULL) > + current->seccomp.prev_nr = > + syscall_get_nr(current, task_pt_regs(current)); > + > /* >* If there is an existing filter, make it the prev and don't drop its >* task reference. What about SECCOMP_FILTER_FLAG_TSYNC? When a thread is transitioned from SECCOMP_MODE_DISABLED to SECCOMP_MODE_FILTER by another thread, its initial prev_nr will be 0, which would e.g. appear to be the read() syscall on x86_64, right? signature.asc Description: Digital signature
Re: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
On Fri, Jan 22, 2016 at 03:30:00PM +0900, Daniel Sangorrin wrote: > This patch allows applications to restrict the order in which > its system calls may be requested. In order to do that, we > provide seccomp-BPF scripts with information about the > previous system call requested. > > An example use case consists of detecting (and stopping) return > oriented attacks that disturb the normal execution flow of > a user program. The intent here is to mitigate attacks in which an attacker has e.g. a function pointer overwrite without a high degree of stack control or the ability to perform a stack pivot, correct? So that e.g. a one-gadget system() call won't succeed? Do you have data on how effective this protection is using just the previous system call number? I think that for example, the "magic ROP gadget" in glibc that can be used given just a single pointer overwrite and stdin control (https://gist.github.com/zachriggle/ca24daf4e8be953a3f96), which (as far as I can tell) is in the middle of the system() implementation, could be used as long as a transition to one of the following syscalls is allowed: - rt_sigaction - rt_sigprocmask - clone - execve I'm not sure how many interesting syscalls typically transition to that, perhaps you can comment on that? However, when exploiting network servers, this magic gadget won't help much - an attacker would probably have to either call into an interesting function in the application or use ROP. In the latter case, this protection won't help much - especially considering that most syscalls just return -EFAULT / -EINVAL when you supply nonsense arguments, ROPping through a "pop rax;ret" gadget and a "syscall;ret" gadget should make it fairly easy to bypass the protection. There are a bunch of occurences of both gadgets in Debian's libc (and these are just the trivial ones): $ hexdump -C /lib/x86_64-linux-gnu/libc-2.19.so | grep '58 c3' 000382e0 00 00 48 8b 00 5b 8b 40 58 c3 48 8d 05 4f 8a 36 |..H..[.@X.H..O.6| 000383b0 58 c3 48 8d 05 87 89 36 00 48 39 c3 74 0e 48 89 |X.H6.H9.t.H.| 00038450 40 58 c3 48 8d 05 e6 88 36 00 48 39 c3 74 0e 48 |@X.H6.H9.t.H| 000d9a00 48 89 44 24 18 e8 56 ff ff ff 48 83 c4 58 c3 90 |H.D$..V...H..X..| 000e51d0 c3 0f 1f 80 00 00 00 00 48 8b 40 58 c3 0f 1f 00 |H.@X| 000ea2f0 48 83 3d 58 c3 2b 00 00 48 8b 1d 69 8b 2b 00 64 |H.=X.+..H..i.+.d| 00160520 48 c3 fa ff 58 c3 fa ff 68 c3 fa ff 80 c3 fa ff |H...X...h...| 00171470 58 c3 f8 ff 84 60 02 00 74 c3 f8 ff 94 62 02 00 |X`..tb..| $ hexdump -C /lib/x86_64-linux-gnu/libc-2.19.so | grep '0f 05 c3' 000b85b0 b8 6e 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.n..| 000b85c0 b8 66 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.f..| 000b85d0 b8 6b 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.k..| 000b85e0 b8 68 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.h..| 000b85f0 b8 6c 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.l..| 000b87f0 b8 6f 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |.o..| 000d9260 b8 5f 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 |._..| 000e6400 b8 e4 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 || 000fff60 48 63 3f b8 03 00 00 00 0f 05 c3 0f 1f 44 00 00 |Hc?..D..| So an attacker would craft the stack like this: [pop rax;ret address] [first syscall for transition] [syscall;ret address] [pop rax;ret address] [second syscall for transition] [syscall;ret address] [...] [normal ROP for whatever the attacker wants to do] Maybe someone who knows a bit more about binary exploiting can comment on this, especially how likely it is that a manipulation of a network service's program flow is successful in the presence of full ASLR and so on without ROP. Also, there is a potential functional issue: What about signal handlers? Signal handlers will require transitions from all syscalls to any syscall that occurs at the start of a signal handler to be allowed as far as I can tell. > @@ -443,6 +448,11 @@ static long seccomp_attach_filter(unsigned int flags, > return ret; > } > > + /* Initialize the prev_nr field only once */ > + if (current->seccomp.filter == NULL) > + current->seccomp.prev_nr = > + syscall_get_nr(current, task_pt_regs(current)); > + > /* >* If there is an existing filter, make it the prev and don't drop its >* task reference. What about SECCOMP_FILTER_FLAG_TSYNC? When a thread is transitioned from SECCOMP_MODE_DISABLED to SECCOMP_MODE_FILTER by another thread, its initial prev_nr will be 0, which would e.g. appear to be the read() syscall on x86_64, right? signature.asc Description: Digital signature
Re: [kernel-hardening] [RFC PATCH 1/1] seccomp: provide information about the previous syscall
On Fri, Jan 22, 2016 at 2:48 AM, Jann Hornwrote: > On Fri, Jan 22, 2016 at 03:30:00PM +0900, Daniel Sangorrin wrote: >> This patch allows applications to restrict the order in which >> its system calls may be requested. In order to do that, we >> provide seccomp-BPF scripts with information about the >> previous system call requested. >> >> An example use case consists of detecting (and stopping) return >> oriented attacks that disturb the normal execution flow of >> a user program. > > > The intent here is to mitigate attacks in which an attacker has > e.g. a function pointer overwrite without a high degree of stack > control or the ability to perform a stack pivot, correct? So that > e.g. a one-gadget system() call won't succeed? > > Do you have data on how effective this protection is using just > the previous system call number? > > I think that for example, the "magic ROP gadget" in glibc that > can be used given just a single pointer overwrite and stdin > control (https://gist.github.com/zachriggle/ca24daf4e8be953a3f96), > which (as far as I can tell) is in the middle of the system() > implementation, could be used as long as a transition to one of > the following syscalls is allowed: > > - rt_sigaction > - rt_sigprocmask > - clone > - execve > > I'm not sure how many interesting syscalls typically transition > to that, perhaps you can comment on that? rt_sigaction is going to be a problem. It can legitimately follow *anything* because of async signals. In general, I think I don't like this idea. It seems like a hack that we'll have to support forever that will allow semi-reliable IDS signatures to break due to async signals and occasionally detect intrusions that don't modify themselves slightly to evade detection. --Andy