Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
On Wed, Mar 01, 2017 at 11:38:56PM +0100, Eduardo Otubo wrote: > On Thu, Feb 16, 2017 at 09=33=16AM +, Daniel P. Berrange wrote: > > On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote: > > > On Wed, Feb 15, 2017 at 06=27=32PM +, Daniel P. Berrange wrote: > > [...] > > > > > > > > > There is a reasonable easily identifiable set of syscalls that QEMU > > > > should > > > > never be permitted to use, no matter what configuration it is in, what > > > > helpers > > > > it spawns, or what libraries it links to. eg reboot, swapon, swapoff, > > > > syslog, > > > > mount, unmount, kexec_*, etc - any syscall that affects global system > > > > state, > > > > rather than process local state should be forbidden. > > > > > > > > There are some syscalls that are simply hardcoded to return ENOSYS > > > > which can > > > > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see > > > > the > > > > man page 'unimplemented(2)'). > > I've been working on the blacklist, you can see here: > https://github.com/otubo/qemu/commit/31e603180081474ff35c5897813cb635f8e9a786 > > I didn't send as an RFC to the list because it's still an on going work, > but if you have any comments, please feel free. > > > > > > > > > There are some syscalls which are considered obsolete - they were > > > > previously > > > > useful, but no modern code would call them, as they have been > > > > superceeded. > > > > For example, readdir replaced by getdents. We could blacklist these by > > > > default > > > > but provide a way to allow use of obsolete syscalls if running on older > > > > systems. > > > > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that > > > > we decide > > > > to just block them permanently with no opt in - would need to analyse > > > > when > > > > their replacements appeared in widespread use. > > The obsolete part is also on my github (didn't send for the same > reason): > https://github.com/otubo/qemu/commit/54a57eb150ca3e5b67e9a81394c6cfa4ac82a6ff > > Also, can't find anywhere a solid list of obsolete system calls, can you > elaborate a little more on how to determine this list? Systemd has such a list in ./src/shared/seccomp-util.c Look for the array containing SYSCALL_FILTER_SET_OBSOLETE Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :|
Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
On Thu, Feb 16, 2017 at 09=33=16AM +, Daniel P. Berrange wrote: > On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote: > > On Wed, Feb 15, 2017 at 06=27=32PM +, Daniel P. Berrange wrote: [...] > > > > > > There is a reasonable easily identifiable set of syscalls that QEMU should > > > never be permitted to use, no matter what configuration it is in, what > > > helpers > > > it spawns, or what libraries it links to. eg reboot, swapon, swapoff, > > > syslog, > > > mount, unmount, kexec_*, etc - any syscall that affects global system > > > state, > > > rather than process local state should be forbidden. > > > > > > There are some syscalls that are simply hardcoded to return ENOSYS which > > > can > > > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the > > > man page 'unimplemented(2)'). I've been working on the blacklist, you can see here: https://github.com/otubo/qemu/commit/31e603180081474ff35c5897813cb635f8e9a786 I didn't send as an RFC to the list because it's still an on going work, but if you have any comments, please feel free. > > > > > > There are some syscalls which are considered obsolete - they were > > > previously > > > useful, but no modern code would call them, as they have been superceeded. > > > For example, readdir replaced by getdents. We could blacklist these by > > > default > > > but provide a way to allow use of obsolete syscalls if running on older > > > systems. > > > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we > > > decide > > > to just block them permanently with no opt in - would need to analyse when > > > their replacements appeared in widespread use. The obsolete part is also on my github (didn't send for the same reason): https://github.com/otubo/qemu/commit/54a57eb150ca3e5b67e9a81394c6cfa4ac82a6ff Also, can't find anywhere a solid list of obsolete system calls, can you elaborate a little more on how to determine this list? > > > > > > There might be a few more syscalls which we can determine are never valid > > > to > > > use in QEMU or any library or helper program it might run. I expect this > > > list > > > to be very small though, given the impossibility of auditing code paths > > > through > > > millions of lines of code QEMU links to. > > > > > > Everything else should be allowed. > > > > > > At this point we have a highly reliable "-sandbox on" which we're not > > > having > > > to constantly patch. > > > > > > > > > From here we need a way to allow a user to opt-in to more restrictive > > > policies, > > > accepting that it will block certain features. For example, there should > > > be a > > > a way to disable any means to elevate privileges from QEMU or things it > > > spawns. > > > e.g. '-sandbox on,elevateprivileges=deny'. > > > > > > This would not only block the variuous set*uid|gid functions via seccomp, > > > but > > > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to > > > optin to > > > a restrictive world if they know they'll not require things like the > > > setuid > > > bridge helper. Also, I was re-reading all documentation again, prctl(PR_SET_NO_NEW_PRIVS) is enabled by default when using seccomp. > > > > > > Similarly there should be an '-sandbox on,spawn=deny' which prevents the > > > ability > > > to fork/exec processes at all, whether privileged or not. This would block > > > features like the qemu bridge helper, SMB server, ifup/down scripts, > > > migration > > > exec: protocol. These are all rarely used features though, so an opt-in > > > to block > > > their use is reasonable & desirable. > > > > > > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting > > > stuff like > > > process affinity, schedular priority, etc. Some uses of QEMU might need > > > them, > > > but normally such controls are left to the mgmt app above QEMU to set > > > prior to > > > the exec() of QEMU. > > > > > > > > > > > > The key is that these are *not* low level knobs controlling system calls, > > > but > > > moderately high level knobs controlling general concepts. This is a high > > > enough > > > level of abstraction to enable libvirt to automatically turn them on/off > > > based > > > on guest config, without libvirt having to know anything detailed about > > > QEMU > > > code impl for the features. > > > > > > > > > Finally, for avoidance of doubt, I'm *not* actually proposing to > > > implement this > > > myself any time in the forseeable future. This mail came about from the > > > fact > > > that many people have questioned whether current seccomp code is anything > > > other > > > than "security theatre". I tend to agree with such an assessment myself, > > > and was > > > initially intending to just send a patch to remove seccomp, to stimulate > > > some > > > discussion. Instead, however, I decided to write this mail to see if we > > > can > > > identify a way forward to make seccomp both
Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
On 15.02.2017 19:27, Daniel P. Berrange wrote: > The current impl of seccomp in QEMU is intentionally allowing a huge range > of system calls to be executed. The goal was that running '-sandbox on' > should never break any feature of QEMU, so naturally any syscall that can > executed on any codepath QEMU takes must be allowed. > > This is good for usability because users don't need to understand the > technical > details of the sandbox technology, they merely say "on" and it "just works". > Conversely though, this is bad for security because QEMU has to allow a huge > range of system calls to be used due to its broad functionality. > > During initial discussions for seccomp back in 2012 it was suggested, there > might be alternate policies developed for QEMU which deny some features, but > improve security overall. To best of my knowledge, this has never been > discussed > again since then. > > > In addition, since initially merging, there has been a steady stream of > patches > to whitelist further syscalls that were missing. Some of these were missing > due > to newly added functionality in QEMU since the original seccomp impl, while > others have been missing since day 1. It is reasonable to expect that there > are > still many syscalls missing in the whitelist. In just a couple of minutes of > comparing the whitelist vs global syscall list it was possible to identify two > further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails > because setuid is blocked, preventing execution of the qemu-bridge-helper > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will > fail to call eventfd() because we only permit eventfd2() syscall, not the > older eventfd() syscall used on older Linux. Some ifup scripts used with the > -netdev arg may also break due to lack of chmod, flock, getxattr permissions. > This risk of missing syscalls is why -sandbox defaults to off, and we've never > considered defaulting it to on. > > > The fundamental problem is that building a whitelist of syscalls used by QEMU > emulators is an intractable problem. QEMU on my system links to 183 different > shared libraries and there is no way in the world that anyone can figure out > which code paths QEMU triggers in these libraries and thus identify which > syscalls will be genuinely needed. > > Thus a whitelist based approach for QEMU is doomed to always be missing some > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge > case. If you are lucky the abort() happens at startup so you see it quickly > and can address it. If you are unlucky the abort() happens after your VM has > been running for days/week/months and you loose data. > > IOW, seccomp integration as it currently exists today in QEMU offers minimal > security benefits, while at the same time causing spurious crashes which may > cause user data loss from aborting a running VM, discouraging users from using > even the minimal protection it offers. > > I think we need to rework our seccomp support so that we can have a high > enough > level of confidence in it, that it could be enabled by default. At the same > time > we need to make it do something more tangibly useful from a security POV. > > > First we need to admit that whitelisting is a failed approach, and switch to > using blacklisting. Unless we do this, we'll never have high enough confidence > to enable it by default - something that's never turned on might as well not > exist at all. > > > There is a reasonable easily identifiable set of syscalls that QEMU should > never be permitted to use, no matter what configuration it is in, what helpers > it spawns, or what libraries it links to. eg reboot, swapon, swapoff, syslog, > mount, unmount, kexec_*, etc - any syscall that affects global system state, > rather than process local state should be forbidden. > > There are some syscalls that are simply hardcoded to return ENOSYS which can > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the > man page 'unimplemented(2)'). > > There are some syscalls which are considered obsolete - they were previously > useful, but no modern code would call them, as they have been superceeded. > For example, readdir replaced by getdents. We could blacklist these by default > but provide a way to allow use of obsolete syscalls if running on older > systems. > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we > decide > to just block them permanently with no opt in - would need to analyse when > their replacements appeared in widespread use. > > There might be a few more syscalls which we can determine are never valid to > use in QEMU or any library or helper program it might run. I expect this list > to be very small though, given the impossibility of auditing code paths > through > millions of lines of code QEMU links to. > > Everything else should be allowed. > > At this point we have a highly reliable
Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
On Wed, Feb 15, 2017 at 06=27=32PM +, Daniel P. Berrange wrote: > The current impl of seccomp in QEMU is intentionally allowing a huge range > of system calls to be executed. The goal was that running '-sandbox on' > should never break any feature of QEMU, so naturally any syscall that can > executed on any codepath QEMU takes must be allowed. > > This is good for usability because users don't need to understand the > technical > details of the sandbox technology, they merely say "on" and it "just works". > Conversely though, this is bad for security because QEMU has to allow a huge > range of system calls to be used due to its broad functionality. > > During initial discussions for seccomp back in 2012 it was suggested, there > might be alternate policies developed for QEMU which deny some features, but > improve security overall. To best of my knowledge, this has never been > discussed > again since then. > > > In addition, since initially merging, there has been a steady stream of > patches > to whitelist further syscalls that were missing. Some of these were missing > due > to newly added functionality in QEMU since the original seccomp impl, while > others have been missing since day 1. It is reasonable to expect that there > are > still many syscalls missing in the whitelist. In just a couple of minutes of > comparing the whitelist vs global syscall list it was possible to identify two > further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails > because setuid is blocked, preventing execution of the qemu-bridge-helper > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will > fail to call eventfd() because we only permit eventfd2() syscall, not the > older eventfd() syscall used on older Linux. Some ifup scripts used with the > -netdev arg may also break due to lack of chmod, flock, getxattr permissions. > This risk of missing syscalls is why -sandbox defaults to off, and we've never > considered defaulting it to on. > > > The fundamental problem is that building a whitelist of syscalls used by QEMU > emulators is an intractable problem. QEMU on my system links to 183 different > shared libraries and there is no way in the world that anyone can figure out > which code paths QEMU triggers in these libraries and thus identify which > syscalls will be genuinely needed. > > Thus a whitelist based approach for QEMU is doomed to always be missing some > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge > case. If you are lucky the abort() happens at startup so you see it quickly > and can address it. If you are unlucky the abort() happens after your VM has > been running for days/week/months and you loose data. > > IOW, seccomp integration as it currently exists today in QEMU offers minimal > security benefits, while at the same time causing spurious crashes which may > cause user data loss from aborting a running VM, discouraging users from using > even the minimal protection it offers. > > I think we need to rework our seccomp support so that we can have a high > enough > level of confidence in it, that it could be enabled by default. At the same > time > we need to make it do something more tangibly useful from a security POV. > > > First we need to admit that whitelisting is a failed approach, and switch to > using blacklisting. Unless we do this, we'll never have high enough confidence > to enable it by default - something that's never turned on might as well not > exist at all. > > > There is a reasonable easily identifiable set of syscalls that QEMU should > never be permitted to use, no matter what configuration it is in, what helpers > it spawns, or what libraries it links to. eg reboot, swapon, swapoff, syslog, > mount, unmount, kexec_*, etc - any syscall that affects global system state, > rather than process local state should be forbidden. > > There are some syscalls that are simply hardcoded to return ENOSYS which can > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the > man page 'unimplemented(2)'). > > There are some syscalls which are considered obsolete - they were previously > useful, but no modern code would call them, as they have been superceeded. > For example, readdir replaced by getdents. We could blacklist these by default > but provide a way to allow use of obsolete syscalls if running on older > systems. > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we > decide > to just block them permanently with no opt in - would need to analyse when > their replacements appeared in widespread use. > > There might be a few more syscalls which we can determine are never valid to > use in QEMU or any library or helper program it might run. I expect this list > to be very small though, given the impossibility of auditing code paths > through > millions of lines of code QEMU links to. > > Everything else should be allowed. > > At this point we