OpenWrt service hardening and jailing =====================================
Current firmware builds have the problem, that a lot of services are running as root. This is especially critical for those services exposed to the network. Once an attacker has managed to compromise such a service he has full control over the system. Even if we change all processes to run as non root, there is still the risk of root privilege escalation using for example a bug in a kernel syscall implementation. Patching all services that we run (we have over 1000 packages) is not possible. The problem is far to diverse and distributed. Linux has many on board features to reduce these attack vectors. Lets create a solution that allows us to use these features without touching all services. Adding Seccomp support to procd =============================== Seccomp is a Berkley Packet Filter (BPF) base syscall filter. Whenever a service does a syscall, a BPF filter runs and checks if the syscall is allowed. Currently the filters are setup to allow whitelisted syscalls. All other syscalls get handled by the default policy. This can be. * kill the service (policy = 0) * return -1 and set errno to n (policy = n) In order to define a syscall filter for a service, it must as a prerequisite be a procd managed service. Once this is done we simply add the following line to the init.d script -> procd_set_param seccomp /etc/seccomp/<service_name>.json This will tell procd that when the service is started to use the filter rules defined inside the json file. As we do not want to patch every service to support seccomp natively we use the following LD_PRELOAD trick. The service is started with /lib/libpreload-seccomp.so preloading the __libc_start_main() function of the libc. The preloaded function does the following things. 1) get a pointer to the real __libc_start_main() 2) get a pointer to the real main() function of the service 3) call the real __libc_start_main() but pass it a pointer to a dummy main() function 4) inside the dummy main() we read the seccomp filter from json and apply it 5) call the real main() Doing this allows us to run the dynamic loader (ld.so) and libc_init code before we apply the seccomp filter. This reduces the amount of whitelisted syscall required to run the service. We have extended the kernels seccomp implementation and added an additional return action that works exactly like the errno action described above but also prints a line to the kernel log reporting which process trigger the exception and what syscall number is missing. This could also be achieved using the trap action but that would require additional stack parsing inside user-land, which is ugly. Enabling seccomp support ======================== As this procd feature is still new and under development it is not enabled by default. In order to enable it you need to select the following option inside "make menuconfig" -> Global build settings ---> [*] Enable seccomp support The feature has so far been implemented for mips, i386 and x86_64. Other architectures will follow shortly. Creating a seccomp filter json file =================================== It would be very tedious to dig through all packages and figure out which syscalls are used, specially as many of them are hidden inside the libc functions that get used. To make things easier we have written a trivial strace like tool called utrace that will automagically create the json file for you. Simply calling -> /sbin/utrace /bin/echo foo will create a file called /tmp/echo.json.$pid The syscalls are ordered with the one called most often listed first, so that it is the first inside the actual filter rules set inside the kernel. To make things even easier an extra init.d command called "trace" was added. By calling -> /etc/init.d/<service_name> trace you can make procd start the service in trace mode. Once the service is stopped the json file is written to /tmp/. Adding process jailing to procd =============================== This feature uses Linux namespaces. If you do not know what namespaces are (there are currently 5 of them) please refer to this LWN article. -> http://lwn.net/Articles/531114/ In a nutshell a namespace is a container that separates various aspects of the user-land from the rest of the system. Namespaces are the base feature that allows us to run virtual containers using projects such as LXC. The first namespace that we use is the mount namespace. Once we have spawned our service inside a mount namespace it cannot see an mounts outside the container. This in turn allows us to use the pivot_root() syscall and effectively creating a separate rootfs for the service. As we do not want the service to see all files on the system, we stage the required ones into the container. The jailing tool will automatically detect the libraries needed to run the service, all other files need to be defined inside the procd init script. The following 3 new commands are used for this. * procd_add_jail <name> <features> this will tell procd to create a jail and call it <name>. it will also bring enable the following * procd_add_jail_mount <file1> <file2> ... procd will add the listed files as readonly to the container * procd_add_jail_mount_rw <file1> <file2> ... procd will add the listed files as writable to the container Due to lack of better option we currently create a tmpfs and then do a rebind mount for each of the files. Readonly files get a "-o remount,ro" applied. Once all files are mounted we also run "-o remount,ro" on the whole tmpfs. This has the big disadvantage that the mount table is bloated with lots of single mounts. In a future step we intended to create a new filesystem called jailfs that works similarly to overlayfs and does the magic inside a single mount, but more on that later. Loading the jail ================ Instead of spawning the process directly, procd uses the /sbin/ujail tool. This will do the following things 1) create the readonly tmpfs with all the files inside 2) call the clone() syscall which spawns the namespaces and executes the ujail stub running inside the container 3) the stub will then call pivot_root() and jump into the tmpfs 4) finally the actual service is spawned It is possible to tell the stub to also apply a seccomp filter to the service using the same method mentioned above. In addition to the mount namspace we also enable the process namespace. this makes sure that our stub runs inside the container with pid 1, the service with pid 2 and both are not able to see any other pids running on the system. Enabling the UTS namespace allows us to set a separate hostname inside the container. There are 2 more namespaces for which support will be added soon. These are the user namespace which will make the service think it is running as uid/gid 0, but in reality it is not root but a different user (the namespace uses an offset), such a nobody with the according restrictions applied. The fifth namespace is the network namespace which will add virtual network interfaces to the container. But more on that later. Having this chain of processes allows procd to manage the jailed process just like any other process. All the features like restartig the service when the config has changed work as if the jail was not loaded. Resource usage overhead and performance hit =========================================== The actual code used to bring up the setup is minimal and currently weighs in at around 1,5k lines of code. The initial measurement showed that a container has a ram footprint of around 150KB. The overhead of seccomp is hard to measure but most likely linear. The more syscalls get called, the higher the performance hit is. However, as almost all architectures already have a BPF JIT inside the kernel these filters get converted to native code that is executed. The utrace tool will sort the syscalls by the number of invocations, leading the most used syscall being the first one in the filter chain. With routers becoming faster, having more ram and recently being multi core the tradeoff between a little performance hit and a more advanced security is something that the user has too evaluate based on the scenario where the router is deployed Next steps ========== With the core functionality now being implemented a whole wealth of things are still missing * user namespace - this will allow us to separate the users inside the jail from the primary system. A privilege escalation will not grant the attacker the same rights as without the container. It also allows us to run services that require root without them actually being root. * network namespace - this still needs a bit of brainstorming. The network namespace would require netifd to track and manage the various virtual interfaces of the containers and setup routing and firewalling for these. A far simpler solution might be to use a LD_PRELOAD library that hooks into the bind() and connect() syscalls and filters which IPs, ports and protocola the container has access to * jail delegation - currently each service has its own jail, ideally we can group services together and have jails running more than one service based on some set of rules. * eBPF - seccomp uses BPF but there is also eBPF, which to my understanding allows us to not only filter syscalls based on their number but also gives the ability to look at the parameters passed to the syscalls. This would allow us to kill exploits such as path traversals. We could simply block the open() syscall from opening files that have a ".." inside the path. eBPF also requires a piece of c code to be written which then gets passed to llvm, for which there is backend that will convert the code to a BPF filter. This requires us to add support for the llvm backend to OpenWrt and also to write the filters. This is far more complex than using normal BPL filters. * jailfs - as mentioned before, we are considering to build a small overlayfs like filesystem that will do the filtering of which files the container can see. This would obsolete the use of the rebind mounts and be far more lightweight. This still needs to be evaluated. * ubus ACL - if the container has ubus support, it will also have access to all features provided by ubus, which is almost equivalent to being root. Adding support to ubus for adding permissions based on the callers uid/gid is easy, the hard part is defining the format in which the ACL rules are defined. We are still brainstorming on this one. * cgroup support - we want to add support for cgroups to allow us to limit the amount of ram resources and cpu cycles a container can use. This way a exploited daemon cannot DoS the cpu by running "while(true) {}” or trigger an OOM * ssl certificate pinning - there is a iptables module that allows us to pin SSL certificates. Adding the module is easy, but handling the tracking and allowing/blocking the certificates still needs some brainstorming and coding. If there are features that we are not aware of yet or that we forgot to list, then please let us know about them. Comments and ideas are welcome ... _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel