Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Fri, Feb 3, 2017 at 3:21 PM, Alexei Starovoitov wrote: > On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote: >> >> Is there any plan to address this? If not, I'll try to write that >> patch this weekend. > > yes. I'm working on 'disallow program override' flag. > It got stalled, because netns discussion got stalled. > Later today will send a patch for dev_id+inode and > will continue on the flag patch. > Would it make sense to try to document what your proposal does before writing the code? I don't yet see how to get semantics that are both simple and sensible with a "disallow override" flag. I *do* see how to get simple, sensible semantics with an approach where all the programs in scope for the cgroup in question get called. If needed, I can imagine a special "overridable" program that would not be run if the socket in question is bound to a descendent cgroup that also has an "overridable" program but would still let all the normal hierarchical programs in scope get called.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote: > > Is there any plan to address this? If not, I'll try to write that > patch this weekend. yes. I'm working on 'disallow program override' flag. It got stalled, because netns discussion got stalled. Later today will send a patch for dev_id+inode and will continue on the flag patch.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Jan 23, 2017 at 12:20 PM, Andy Lutomirski wrote: > On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov > wrote: >> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: >>> restricting the types of sockets that can be created, then you do want >>> the filter to work across namespaces, but seccomp can do that too and >>> the current code doesn't handle netns correctly. >> >> are you saying that seccomp supports netns filtering? please show the proof. > > It can trivially restruct the types of sockets that are created by > filtering on socket(2) syscall parameters, at least on sane > architectures that don't use socketcall(). I think this is actually wrong -- the socket creation filter appears to be called only on inet sockets. Is there a good reason for that? > >> To summarize, I think the 'prog override is not allowed' flag would be >> ok feature to have and I wouldn't mind making it the default when no 'flag' >> field is passed to bpf syscall, but it's not acceptable to remove current >> 'prog override is allowed' behavior. >> So if you insist on changing the default, please implement the flag asap. >> Though from security point of view and ABI point of view there is absolutely >> no difference whether this flag is added today or a year later while >> the default is kept as-is. > > It's too late and I have too little time. I'll try to write a patch > to change the default to just deny attempts to override. Better > behavior can be added later. > > IMO your suggestions about priorities are overcomplicated. For your > instrumentation needs, I can imagine that a simple "make this hook not > run if a descendent has a hook" would do it. For everything else, run > them all in tree order (depending on filter type) and let the eBPF > code do whatever it wants to do. Is there any plan to address this? If not, I'll try to write that patch this weekend.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov wrote: > On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: >> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov >> wrote: >> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: >> >> I think it could work by making a single socket cgroup controller that >> >> handles all cgroup things that are bound to a socket. Using >> > >> > Such 'socket cgroup controller' would limit usability of the feature >> > to sockets and force all other use cases like landlock to invent >> > their own wheel, which is undesirable. Everyone will be >> > inventing new 'foo cgroup controller', while all of them >> > are really bpf features. They are different bpf program >> > types that attach to different hooks and use cgroup for scoping. >> >> Can you elaborate on why that would be a problem? In a cgroup v1 >> world, users who want different hierarchies for different types of >> control could easily want one hierarchy for socket hooks and a >> different hierarchy for lsm hooks. In a cgroup v2 delegation world, I >> could easily imagine the decision to delegate socket hooks being >> different from the decision to delegate lsm hooks. Almost all of the >> code would be shared between different bpf-using cgroup controllers. > > how do you think it can be enforced when directory is chowned? A combination of delegation policy (a la subtree_control in the parent) and MAC policy. I'm quite confident that apparmor can apply policy to cgroupfs and I'm reasonably confident (although slightly less so) that selinux can as well. The docs suck :( But if there's only one file in there to apply MAC policy to, the ability to differentiate between subsystems is quite limited. In the current API, there are no files to apply policy to, so it won't work at all. > > ... and open() of the directory done by the current api will preserve > cgroup delegation when and only when bpf_prog_type_cgroup_* > becomes unprivileged. > I'm not proposing creating new api here. I don't know what you mean. open() of the directory can't check anything useful because it has to work for programs that just want to read the directory. >> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even >> regardless of what semantics you think are better. sk_bound_dev_if is >> exposed as a u32 value, but sk_bound_dev_if only has meaning within a >> given netns. The "ip vrf" stuff will straight-up malfunction if a >> process affected by its hook runs in a different netns from the netns >> that "ip vrf" was run in. > > how is that any different from normal 'ip netns exec'? > that is expected user behavior. # ./ip link add dev vrf0 type vrf table 10 # ./ip vrf exec vrf0 ./show_bind Default binding is "vrf0" # ./ip vrf exec vrf0 unshare -n ./show_bind show_bind: getsockopt: No such device The expected behavior to me is that ip netns exec (or equivalently unshare -n) gives a clean slate. Actually malfunctioning (which this example using the latest iproute2 and linux just did) is not expected. I'm done with this part of this thread and I'm sending a patch. > >> restricting the types of sockets that can be created, then you do want >> the filter to work across namespaces, but seccomp can do that too and >> the current code doesn't handle netns correctly. > > are you saying that seccomp supports netns filtering? please show the proof. It can trivially restruct the types of sockets that are created by filtering on socket(2) syscall parameters, at least on sane architectures that don't use socketcall(). > To summarize, I think the 'prog override is not allowed' flag would be > ok feature to have and I wouldn't mind making it the default when no 'flag' > field is passed to bpf syscall, but it's not acceptable to remove current > 'prog override is allowed' behavior. > So if you insist on changing the default, please implement the flag asap. > Though from security point of view and ABI point of view there is absolutely > no difference whether this flag is added today or a year later while > the default is kept as-is. It's too late and I have too little time. I'll try to write a patch to change the default to just deny attempts to override. Better behavior can be added later. IMO your suggestions about priorities are overcomplicated. For your instrumentation needs, I can imagine that a simple "make this hook not run if a descendent has a hook" would do it. For everything else, run them all in tree order (depending on filter type) and let the eBPF code do whatever it wants to do.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: > On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov > wrote: > > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: > >> I think it could work by making a single socket cgroup controller that > >> handles all cgroup things that are bound to a socket. Using > > > > Such 'socket cgroup controller' would limit usability of the feature > > to sockets and force all other use cases like landlock to invent > > their own wheel, which is undesirable. Everyone will be > > inventing new 'foo cgroup controller', while all of them > > are really bpf features. They are different bpf program > > types that attach to different hooks and use cgroup for scoping. > > Can you elaborate on why that would be a problem? In a cgroup v1 > world, users who want different hierarchies for different types of > control could easily want one hierarchy for socket hooks and a > different hierarchy for lsm hooks. In a cgroup v2 delegation world, I > could easily imagine the decision to delegate socket hooks being > different from the decision to delegate lsm hooks. Almost all of the > code would be shared between different bpf-using cgroup controllers. how do you think it can be enforced when directory is chowned? > >> Having thought about this some more, I think that making it would > >> alleviate a bunch of my concerns, as it would make the semantics if > >> the capable() check were relaxed to ns_capable() be sane. Here's what > > > > here we're on the same page. For any meaningful discussion about > > 'bpf cgroup controller' to happen bpf itself needs to become > > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* > > program types need to become available for unprivileged users. > > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. > > To make it secure we severely limited its functionality. > > All bpf advances since then (like new map types and verifier extensions) > > were done for root only. If early on the priv vs unpriv bpf features > > were 80/20. Now it's close to 95/5. No work has been done to > > make socket filter type more powerful. It still has to use > > slow-ish ld_abs skb access while tc/xdp have direct packet access. > > Things like register value tracking is root only as well and so on > > and so forth. > > We cannot just flip the switch and allow type_cgroup* to unpriv > > and I don't see any volunteers willing to do this work. > > Until that happens there is no point coming up with designs > > for 'cgroup bpf controller'... whatever that means. > > Sure there is. If delegation can be turned on without changing the > API, then the result will be easier to work with and have fewer > compatibility issues. ... and open() of the directory done by the current api will preserve cgroup delegation when and only when bpf_prog_type_cgroup_* becomes unprivileged. I'm not proposing creating new api here. > > > >> I currently should happen before bpf+cgroup is enabled in a release: > >> > >> 1. Make it netns-aware. This could be as simple as making it only > >> work in the root netns because then real netns awareness can be added > >> later without breaking anything. The current situation is bad in that > >> network namespaces are just ignored and it's plausible that people > >> will start writing user code that depends on having network namespaces > >> be ignored. > > > > nothing in bpf today is netns-aware and frankly I don't see > > how cgroup+bpf has anything to do with netns. > > For regular sockets+bpf we don't check netns. > > When tcpdump opens raw socket and attaches bpf there are no netns > > checks, since socket itself gives a scope for the program to run. > > Same thing applies to cgroup+bpf. cgroup gives a scope for the program. > > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* > > hooks. > > > Here I completely disagree with you. tcpdump sees packets in its > network namespace. Regular sockets apply bpf filters to the packets > seen by that socket, and the socket itself is scoped to a netns. > > Meanwhile, cgroup+bpf actually appears to be buggy in this regard even > regardless of what semantics you think are better. sk_bound_dev_if is > exposed as a u32 value, but sk_bound_dev_if only has meaning within a > given netns. The "ip vrf" stuff will straight-up malfunction if a > process affected by its hook runs in a different netns from the netns > that "ip vrf" was run in. how is that any different from normal 'ip netns exec'? that is expected user behavior. > IOW, the current code is buggy. > > > Then if the hooks are used for security, the process > > only needs to do setns() to escape security sandbox. Obviously > > broken semantics. > > This could go both ways. If the goal is to filter packets, then it's > not really important to have the filter keep working if the sandboxed > task unshares netns -- in the new netns, there isn't any access to the > netwo
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov wrote: > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: >> I think it could work by making a single socket cgroup controller that >> handles all cgroup things that are bound to a socket. Using > > Such 'socket cgroup controller' would limit usability of the feature > to sockets and force all other use cases like landlock to invent > their own wheel, which is undesirable. Everyone will be > inventing new 'foo cgroup controller', while all of them > are really bpf features. They are different bpf program > types that attach to different hooks and use cgroup for scoping. Can you elaborate on why that would be a problem? In a cgroup v1 world, users who want different hierarchies for different types of control could easily want one hierarchy for socket hooks and a different hierarchy for lsm hooks. In a cgroup v2 delegation world, I could easily imagine the decision to delegate socket hooks being different from the decision to delegate lsm hooks. Almost all of the code would be shared between different bpf-using cgroup controllers. > >> Having thought about this some more, I think that making it would >> alleviate a bunch of my concerns, as it would make the semantics if >> the capable() check were relaxed to ns_capable() be sane. Here's what > > here we're on the same page. For any meaningful discussion about > 'bpf cgroup controller' to happen bpf itself needs to become > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* > program types need to become available for unprivileged users. > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. > To make it secure we severely limited its functionality. > All bpf advances since then (like new map types and verifier extensions) > were done for root only. If early on the priv vs unpriv bpf features > were 80/20. Now it's close to 95/5. No work has been done to > make socket filter type more powerful. It still has to use > slow-ish ld_abs skb access while tc/xdp have direct packet access. > Things like register value tracking is root only as well and so on > and so forth. > We cannot just flip the switch and allow type_cgroup* to unpriv > and I don't see any volunteers willing to do this work. > Until that happens there is no point coming up with designs > for 'cgroup bpf controller'... whatever that means. Sure there is. If delegation can be turned on without changing the API, then the result will be easier to work with and have fewer compatibility issues. > >> I currently should happen before bpf+cgroup is enabled in a release: >> >> 1. Make it netns-aware. This could be as simple as making it only >> work in the root netns because then real netns awareness can be added >> later without breaking anything. The current situation is bad in that >> network namespaces are just ignored and it's plausible that people >> will start writing user code that depends on having network namespaces >> be ignored. > > nothing in bpf today is netns-aware and frankly I don't see > how cgroup+bpf has anything to do with netns. > For regular sockets+bpf we don't check netns. > When tcpdump opens raw socket and attaches bpf there are no netns > checks, since socket itself gives a scope for the program to run. > Same thing applies to cgroup+bpf. cgroup gives a scope for the program. > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* > hooks. Here I completely disagree with you. tcpdump sees packets in its network namespace. Regular sockets apply bpf filters to the packets seen by that socket, and the socket itself is scoped to a netns. Meanwhile, cgroup+bpf actually appears to be buggy in this regard even regardless of what semantics you think are better. sk_bound_dev_if is exposed as a u32 value, but sk_bound_dev_if only has meaning within a given netns. The "ip vrf" stuff will straight-up malfunction if a process affected by its hook runs in a different netns from the netns that "ip vrf" was run in. IOW, the current code is buggy. > Then if the hooks are used for security, the process > only needs to do setns() to escape security sandbox. Obviously > broken semantics. This could go both ways. If the goal is to filter packets, then it's not really important to have the filter keep working if the sandboxed task unshares netns -- in the new netns, there isn't any access to the network at all. If the goal is to reduce attack surface by restricting the types of sockets that can be created, then you do want the filter to work across namespaces, but seccomp can do that too and the current code doesn't handle netns correctly. > >> 2. Make it inherit properly. Inner cgroups should not override outer >> hooks. As in (1), this could be simplified by preventing the same >> hook from being configured in both an ancestor and a descendent >> cgroup. Then inheritance could be added for real later on. > > In general it sounds fine, but it seems the reasoning
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: > I think it could work by making a single socket cgroup controller that > handles all cgroup things that are bound to a socket. Using Such 'socket cgroup controller' would limit usability of the feature to sockets and force all other use cases like landlock to invent their own wheel, which is undesirable. Everyone will be inventing new 'foo cgroup controller', while all of them are really bpf features. They are different bpf program types that attach to different hooks and use cgroup for scoping. > Having thought about this some more, I think that making it would > alleviate a bunch of my concerns, as it would make the semantics if > the capable() check were relaxed to ns_capable() be sane. Here's what here we're on the same page. For any meaningful discussion about 'bpf cgroup controller' to happen bpf itself needs to become delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* program types need to become available for unprivileged users. The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. To make it secure we severely limited its functionality. All bpf advances since then (like new map types and verifier extensions) were done for root only. If early on the priv vs unpriv bpf features were 80/20. Now it's close to 95/5. No work has been done to make socket filter type more powerful. It still has to use slow-ish ld_abs skb access while tc/xdp have direct packet access. Things like register value tracking is root only as well and so on and so forth. We cannot just flip the switch and allow type_cgroup* to unpriv and I don't see any volunteers willing to do this work. Until that happens there is no point coming up with designs for 'cgroup bpf controller'... whatever that means. > I currently should happen before bpf+cgroup is enabled in a release: > > 1. Make it netns-aware. This could be as simple as making it only > work in the root netns because then real netns awareness can be added > later without breaking anything. The current situation is bad in that > network namespaces are just ignored and it's plausible that people > will start writing user code that depends on having network namespaces > be ignored. nothing in bpf today is netns-aware and frankly I don't see how cgroup+bpf has anything to do with netns. For regular sockets+bpf we don't check netns. When tcpdump opens raw socket and attaches bpf there are no netns checks, since socket itself gives a scope for the program to run. Same thing applies to cgroup+bpf. cgroup gives a scope for the program. But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* hooks. Then if the hooks are used for security, the process only needs to do setns() to escape security sandbox. Obviously broken semantics. > 2. Make it inherit properly. Inner cgroups should not override outer > hooks. As in (1), this could be simplified by preventing the same > hook from being configured in both an ancestor and a descendent > cgroup. Then inheritance could be added for real later on. In general it sounds fine, but it seems the reasoning to add such restriction now (instead of later), so that program chain can be added without breaking abi, since if we don't restrict it now there will be no way to add it without breaking abi?! That is incorrect assumption. We can add chaining and can add 'do not override' logic without breaking existing semantics. For example, we can add 'priority' field to struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */} which would indicate relative position of multiple chained programs applied to the same cgroup+hook pair. Multiple programs with the same priority will be executed in the order they were added. Programs with different priorities will execute in the priority order. Such scheme will be more generic and flexible than earlier proposals. Similarly we can add another flag that will say 'dissallow override of bpf program in descendent cgroup'. It's all trivial to do, since bpf syscall was designed for extensibility. Also until bpf_type_cgroup* becomes unprivileged there is no reason to add this 'priority/prog chaining' feature, since if it's used for security the root can always override it no matter cgroup hierarchy. > 3. Give cgroup delegation support some serious thought. In > particular, if delegation would be straightforward but the current API > wouldn't work well with delegation, then at least consider whether the > API should change before it becomes stable so that two APIs don't need > to be supported going forward. please see example above. Since we went with bpf syscall (instead of inextensible ioctl) we can add any new cgroup+bpf logic without breaking current abi. No matter how you twist it the cgroup+bpf is bpf specific feature.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Wed 18-01-17 14:18:50, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote: > > This would require using hierarchical cgroup iterators to iterate over > > It does behave hierarchically. > > > tasks. As per Andy's testing this doesn't seem to be the case. I haven't > > That's not what Andy's testing showed. What that showed was that > program in a child can override the one from its ancestor. My fault, I've misread Andy's test case. I thought that the child group simply disabled the bpf program and the one from the parent hasn't executed. -- Michal Hocko SUSE Labs
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Wed, Jan 18, 2017 at 4:59 PM, Tejun Heo wrote: > Hello, Andy. > > On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote: >> To map cgroup -> hook, a simple array in each cgroup structure works. >> To map (cgroup, netns) -> hook function, the natural approach would be >> to have some kind of hash, and that would be slower. This code is >> intended to be *fast*. > > Updating these programs isn't a frequent operation. We can easily > calculate a perfect (or acceptable) hash per-cgroup and rcu swap the > new hashtable. > Fair enough. > >> I think it's currently a broken cgroup feature. I think it could be >> made into a properly working cgroup feature (which I favor) or a >> properly working net feature. Can you articulate why you prefer >> having it be a net feature instead of a cgroup feature? I tend to > > I thought that's what I was doing in the previous message. > >> favor making it be a working cgroup feature (by giving it a file or >> files in cgroupfs and maybe even a controller) because the result >> would have very obvious semantics. > > I'm fine with both directions but one seems far out. I thought you were saying why you thought it wasn't a cgroup feature, but I'm not sure I understand why you thought it shouldn't be a cgroup feature. When I chatted with Alexei, I had the impression that you and he had wanted it to be a cgroup controller but thought it wouldn't work well. I think it could work by making a single socket cgroup controller that handles all cgroup things that are bound to a socket. Using individual files would play nicer with cgroup delegation within a single netns. > >> But maybe this is just a false dichotomy. Could this feature be a >> per-netns configuration where you can give instructions like "run this >> hook if you're in such-and-such netns and in a given cgroup or one of >> its descendents"? This would prevent it from being a direct analogue >> to systemd's RestrictAddressFamilies= option, but that may be okay. >> This would side-step issues in the current code where a hook can't >> rely on ifindex meaning what the hook thinks it means. > > How is this different from making the current code netns aware? The descendents part is important. > >> Actually, I think I like that approach. What if it we had a "socket" >> controller and files like "socket.create_socket", "socket.ingress" and >> "socket.egress"? You could use the bpf() syscall to install a bpf >> filter into "socket.egress" and that filter would filter egress for >> the target cgroup and its descendents on the current netns. As a >> first pass, the netns issue could be sidestepped by making it only >> work in the root netns (i.e. the bpf() call would fail if you're not >> in the root netns and the hooks wouldn't run if you're not in the root >> netns.) It *might* even be possible to retrofit in the socket >> controller by saying that the default hierarchy is used if the socket >> controller isn't mounted. > > I don't know. You're throwing out too many ideas too fast and it's > difficult to keep up with what the differences are between those > ideas. But if we're doing cgroup controllers, shouldn't cgroup ns > support be sufficient? We can consider the product of cgroup and net > namespaces but that doesn't seem necessary given that people usually > set up these namespaces in conjunction. Fair enough. See way below. > >> What *isn't* possible to cleanly fix after the fact is the current >> semantics that cgroup hooks override the hooks in their ancestors. >> IMO that is simply broken. The example you gave (perf_event) is very >> careful not to have this problem. > > That's like saying installing an iptables rule for a more specific > target is broken. As a cgroup controller, it is not an acceptable > behavior given how delegation works. As something similar to > iptables, it is completely fine. Even the xt_cgroup code checks cgroup_is_descendent(). > >> > * My impression with bpf is that while delegation is something >> > theoretically possible it is not something which is gonna take place >> > in any immediate time frame. If I'm wrong on this, please feel free >> > to correct me. >> >> But the issue isn't *BPF* delegation. It's cgroup delegation or netns >> creation, both of which exist today. > > No, the issue is bpf delegation. If bpf were fully delegatable in > practical sense, we could just do straight forward cgroup bpf > controller. Well, we'll have to think about how to chain the programs > which would differ on program type but that shouldn't be too hard. What do you mean by "if bpf were fully delegatable"? I don't know what it would even mean to delegate bpf. I do know what it means to allow programs to create new network namespaces and for cgroup sub-hierarchies to be delegated. > >> > * There are a lot of use cases which can immediately take advantage of >> > cgroup-aware bpf programs operating as proposed. The model is >> > semantically equivalen
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 19/01/2017 01:18, Andy Lutomirski wrote: >>> it explicitly respects the cgroup hierarchy. It shows up in >>> /proc/cgroups, and I had no problem mounting a cgroupfs instance with >>> perf_event enabled. So I'm not sure what you mean. >> >> That all it's doing is providing membership information. > > But it's doing it wrong! Even perf_event tests for membership in a > given cgroup *or one of its descendents*. This code does not. > > I think the moral of the story here is that there are lots of open > questions and design work to be done and that this feature really > isn't ready to be stable. For Landlock, I believe that it really > needs to be done right and I will put my foot down and NAK any effort > to have Landlock available in a released kernel without resolving > these types of issues first. Does anyone really want Landlock to work > differently than the net hooks simply because the net hooks were in a > rush? About Landlock, there is two use cases: The first is to allow unprivileged users to tie eBPF programs (rules) to processes. This is the (final) goal. In this case, a (cgroup) hierarchy is mandatory, otherwise it would be trivial to defeat any access rule. This is the same issue with namespaces. The second use case is to only allow privileged users to tie eBPF programs to processes. As discussed, this will be the next series, preceding the unprivileged series. In this privilege case, only root (global CAP_SYS_ADMIN, no namespaces) may be able to use Landlock. Not having a hierarchy is not a security issue (only a practical one). The first/next Landlock series (in February) will focus on process hierarchies (without cgroup), a la seccomp-bpf. It would be too confusing to not use an inheritable hierarchy like seccomp does, even in a privileged-only first approach. The inherited rules should then behave similarly to the seccomp-bpf filters. However, the following series focusing on cgroup could keep the current cgroup-bpf behavior, without hierarchy. I don't like the non-hierarchy approach very much because it add another (less flexible and more confusing) way to do things (for Landlock at least), but I'm willing to do it if needed. Regards, Mickaël signature.asc Description: OpenPGP digital signature
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Hello, Andy. On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote: > To map cgroup -> hook, a simple array in each cgroup structure works. > To map (cgroup, netns) -> hook function, the natural approach would be > to have some kind of hash, and that would be slower. This code is > intended to be *fast*. Updating these programs isn't a frequent operation. We can easily calculate a perfect (or acceptable) hash per-cgroup and rcu swap the new hashtable. > I suppose you could have each cgroup structure contain an array where > each element is (netns, hook function) and just skip the ones that are > the wrong netns. Perhaps it would be rare to have many of those. Yeah, or maybe even that. It just isn't a fundamentally difficult problem. > I think it's currently a broken cgroup feature. I think it could be > made into a properly working cgroup feature (which I favor) or a > properly working net feature. Can you articulate why you prefer > having it be a net feature instead of a cgroup feature? I tend to I thought that's what I was doing in the previous message. > favor making it be a working cgroup feature (by giving it a file or > files in cgroupfs and maybe even a controller) because the result > would have very obvious semantics. I'm fine with both directions but one seems far out. > But maybe this is just a false dichotomy. Could this feature be a > per-netns configuration where you can give instructions like "run this > hook if you're in such-and-such netns and in a given cgroup or one of > its descendents"? This would prevent it from being a direct analogue > to systemd's RestrictAddressFamilies= option, but that may be okay. > This would side-step issues in the current code where a hook can't > rely on ifindex meaning what the hook thinks it means. How is this different from making the current code netns aware? > Actually, I think I like that approach. What if it we had a "socket" > controller and files like "socket.create_socket", "socket.ingress" and > "socket.egress"? You could use the bpf() syscall to install a bpf > filter into "socket.egress" and that filter would filter egress for > the target cgroup and its descendents on the current netns. As a > first pass, the netns issue could be sidestepped by making it only > work in the root netns (i.e. the bpf() call would fail if you're not > in the root netns and the hooks wouldn't run if you're not in the root > netns.) It *might* even be possible to retrofit in the socket > controller by saying that the default hierarchy is used if the socket > controller isn't mounted. I don't know. You're throwing out too many ideas too fast and it's difficult to keep up with what the differences are between those ideas. But if we're doing cgroup controllers, shouldn't cgroup ns support be sufficient? We can consider the product of cgroup and net namespaces but that doesn't seem necessary given that people usually set up these namespaces in conjunction. > What *isn't* possible to cleanly fix after the fact is the current > semantics that cgroup hooks override the hooks in their ancestors. > IMO that is simply broken. The example you gave (perf_event) is very > careful not to have this problem. That's like saying installing an iptables rule for a more specific target is broken. As a cgroup controller, it is not an acceptable behavior given how delegation works. As something similar to iptables, it is completely fine. > > * My impression with bpf is that while delegation is something > > theoretically possible it is not something which is gonna take place > > in any immediate time frame. If I'm wrong on this, please feel free > > to correct me. > > But the issue isn't *BPF* delegation. It's cgroup delegation or netns > creation, both of which exist today. No, the issue is bpf delegation. If bpf were fully delegatable in practical sense, we could just do straight forward cgroup bpf controller. Well, we'll have to think about how to chain the programs which would differ on program type but that shouldn't be too hard. > > * There are a lot of use cases which can immediately take advantage of > > cgroup-aware bpf programs operating as proposed. The model is > > semantically equivalent to iptables (let's address the netns part if > > that's an issue) which net people are familiar with. > > That sounds like a great argument for those users to patch their > kernels to gain experience with this feature. I don't get why this would particularly point to out-of-tree usage. Why is that? > > * It isn't exclusive with adding cgroup bpf controller down the road > > if necessary and practical. Sure, this isn't the perfect situation > > but it seems like an acceptable trade-off to me. What ever is > > perfect? > > I think it more or less is exclusive. I can imagine davem accepting a > patch to make the sock_cgroup_data think point at a new "socket" > cgroup type. I can't imagine him accepting a patch to have
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Wed, Jan 18, 2017 at 2:41 PM, Tejun Heo wrote: > Helloo, Andy. > > On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote: >> Perhaps this is a net feature, though, not a cgroup feature. This >> would IMO make a certain amount of sense. Iptables cgroup matches, >> for example, logically are an iptables (i.e., net) feature. The > > Yeap. > >> problem here is that network configuration (and this explicitly >> includes iptables) is done on a per-netns level, whereas these new >> hooks entirely ignore network namespaces. I've pointed out that >> trying to enable them in a namespaced world is problematic (e.g. >> switching to ns_capable() will cause serious problems), and Alexei >> seems to think that this will never happen. So I don't think we can >> really call this a net feature that works the way that other net >> features work. >> >> (Suppose that this actually tried to be netns-enabled. Then you'd >> have to map from (netns, cgroup) -> hook, and this would probably be >> slow and messy.) > > I'm not sure how important netns support for these bpf programs > actually is but I have no objection to making it behave in an > equivalent manner to iptables where appropriate. This is kinda > trivial to do too. For now, we can simply not run the programs if the > associated socket belongs to non-root netns. Later, if necessary, we > can add per-ns bpf programs. And I can't think of a reason why > (netns, cgroup) -> function mapping would be slow and messy. Why > would that be? To map cgroup -> hook, a simple array in each cgroup structure works. To map (cgroup, netns) -> hook function, the natural approach would be to have some kind of hash, and that would be slower. This code is intended to be *fast*. I suppose you could have each cgroup structure contain an array where each element is (netns, hook function) and just skip the ones that are the wrong netns. Perhaps it would be rare to have many of those. > >> So I think that leaves it being a cgroup feature. And it definitely >> does *not* play by the rules of cgroups right now. > > Because this in no way is a cgroup feature. As you wrote above, it is > something similar to iptables with lacking netns considerations. > Let's address that and make it a proper network thing. I think it's currently a broken cgroup feature. I think it could be made into a properly working cgroup feature (which I favor) or a properly working net feature. Can you articulate why you prefer having it be a net feature instead of a cgroup feature? I tend to favor making it be a working cgroup feature (by giving it a file or files in cgroupfs and maybe even a controller) because the result would have very obvious semantics. But maybe this is just a false dichotomy. Could this feature be a per-netns configuration where you can give instructions like "run this hook if you're in such-and-such netns and in a given cgroup or one of its descendents"? This would prevent it from being a direct analogue to systemd's RestrictAddressFamilies= option, but that may be okay. This would side-step issues in the current code where a hook can't rely on ifindex meaning what the hook thinks it means. Actually, I think I like that approach. What if it we had a "socket" controller and files like "socket.create_socket", "socket.ingress" and "socket.egress"? You could use the bpf() syscall to install a bpf filter into "socket.egress" and that filter would filter egress for the target cgroup and its descendents on the current netns. As a first pass, the netns issue could be sidestepped by making it only work in the root netns (i.e. the bpf() call would fail if you're not in the root netns and the hooks wouldn't run if you're not in the root netns.) It *might* even be possible to retrofit in the socket controller by saying that the default hierarchy is used if the socket controller isn't mounted. What *isn't* possible to cleanly fix after the fact is the current semantics that cgroup hooks override the hooks in their ancestors. IMO that is simply broken. The example you gave (perf_event) is very careful not to have this problem. > >> > I'm sure we'll >> > eventually get there but from what I hear it isn't something we can >> > pull off in a restricted timeframe. >> >> To me, this sounds like "the API isn't that great, we know how to do >> better, but we really really want this feature ASAP so we want to ship >> it with a sub-par API." I think that's a bad idea. > > I see your point but this isn't something which is black and white. > There are arguments for both directions. I'm leaning the other > direction because > > * My impression with bpf is that while delegation is something > theoretically possible it is not something which is gonna take place > in any immediate time frame. If I'm wrong on this, please feel free > to correct me. But the issue isn't *BPF* delegation. It's cgroup delegation or netns creation, both of which exist today. > > * There are a lo
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Helloo, Andy. On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote: > Perhaps this is a net feature, though, not a cgroup feature. This > would IMO make a certain amount of sense. Iptables cgroup matches, > for example, logically are an iptables (i.e., net) feature. The Yeap. > problem here is that network configuration (and this explicitly > includes iptables) is done on a per-netns level, whereas these new > hooks entirely ignore network namespaces. I've pointed out that > trying to enable them in a namespaced world is problematic (e.g. > switching to ns_capable() will cause serious problems), and Alexei > seems to think that this will never happen. So I don't think we can > really call this a net feature that works the way that other net > features work. > > (Suppose that this actually tried to be netns-enabled. Then you'd > have to map from (netns, cgroup) -> hook, and this would probably be > slow and messy.) I'm not sure how important netns support for these bpf programs actually is but I have no objection to making it behave in an equivalent manner to iptables where appropriate. This is kinda trivial to do too. For now, we can simply not run the programs if the associated socket belongs to non-root netns. Later, if necessary, we can add per-ns bpf programs. And I can't think of a reason why (netns, cgroup) -> function mapping would be slow and messy. Why would that be? > So I think that leaves it being a cgroup feature. And it definitely > does *not* play by the rules of cgroups right now. Because this in no way is a cgroup feature. As you wrote above, it is something similar to iptables with lacking netns considerations. Let's address that and make it a proper network thing. > > I'm sure we'll > > eventually get there but from what I hear it isn't something we can > > pull off in a restricted timeframe. > > To me, this sounds like "the API isn't that great, we know how to do > better, but we really really want this feature ASAP so we want to ship > it with a sub-par API." I think that's a bad idea. I see your point but this isn't something which is black and white. There are arguments for both directions. I'm leaning the other direction because * My impression with bpf is that while delegation is something theoretically possible it is not something which is gonna take place in any immediate time frame. If I'm wrong on this, please feel free to correct me. * There are a lot of use cases which can immediately take advantage of cgroup-aware bpf programs operating as proposed. The model is semantically equivalent to iptables (let's address the netns part if that's an issue) which net people are familiar with. * It isn't exclusive with adding cgroup bpf controller down the road if necessary and practical. Sure, this isn't the perfect situation but it seems like an acceptable trade-off to me. What ever is perfect? > > This also holds true for the perf controller. While it is implemented > > as a controller, it isn't visible to cgroup users in any way and the > > only function it serves is providing the membership test to perf > > subsystem. perf is the one which decides whether and how it is to be > > used. cgroup providing membership test to other subsystems is > > completely acceptable and established. > > Unless I'm mistaken, "perf_event" is an actual cgroup controller, and Yeah, it is implemented as a controller but in essence what it does is just tagging the hierarchy to tell perf to use that hierarchy for membership test purposes. > it explicitly respects the cgroup hierarchy. It shows up in > /proc/cgroups, and I had no problem mounting a cgroupfs instance with > perf_event enabled. So I'm not sure what you mean. That all it's doing is providing membership information. Thanks. -- tejun
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Hello, Michal. On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote: > This would require using hierarchical cgroup iterators to iterate over It does behave hierarchically. > tasks. As per Andy's testing this doesn't seem to be the case. I haven't That's not what Andy's testing showed. What that showed was that program in a child can override the one from its ancestor. > checked the implementation closely but my understanding was that using > only cgroup specific tasks was intentional. Maybe I'm misreading you but if you're saying that a bpf program would only execute for tasks on that specific cgroup, that is not true. > I do agree that using hierarchy aware cgroup iterators is the right > approach here and we wouldn't see any issue. But I am still not sure > I've wrapped my head around this feature completely. It's really simple. You can install bpf programs with a cgroup path associated with them. When that particular type of bpf program needs to be executed, the bpf program which is attached to the closest ancestor (including self) is executed. Thanks. -- tejun
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue, Jan 17, 2017 at 5:58 AM, Michal Hocko wrote: > On Tue 17-01-17 14:32:04, Peter Zijlstra wrote: >> On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: >> > On Sun 15-01-17 20:19:01, Tejun Heo wrote: >> > [...] >> > > So, what's proposed is a proper part of bpf. In terms of >> > > implementation, cgroup helps by hosting the pointers but that doesn't >> > > necessarily affect the conceptual structure of it. Given that, I >> > > don't think it'd be a good idea to add anything to cgroup interface >> > > for this feature. Introspection is great to have but this should be >> > > introspectable together with other bpf programs using the same >> > > mechanism. That's where it belongs. >> > >> > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we >> > at least enforce that the cgroup has to be a leaf one and no further >> > children groups can be created once there is BPF program attached? >> >> Why (again) this stupid constraint? >> >> If you want to use cgroups for tagging (like perf does), _any_ parent >> cgroup will also tag you. >> >> So creating child cgroups, and placing tasks in it, should not be a >> problem, the BPF thing should apply to all of them. > > This would require using hierarchical cgroup iterators to iterate over > tasks. As per Andy's testing this doesn't seem to be the case. I haven't > checked the implementation closely but my understanding was that using > only cgroup specific tasks was intentional. The current semantics are AFAIK that only the innermost cgroup that has a hook installed is in effect. I think this is the wrong design. I think that the right semantics are probably to support both innermost-to-outermost and outermost-to-innermost and to select which is appropriate for each hook. Suppose we have a cgroup /a/b where a and b both have hooks installed. If the hook is a socket creation or egress hook, I think that b's hook should run first. If b's hook rejects, then a's hook is not run. If b's hook accepts, then a's hook is run. This way a gets the last word on any changes to the socket settings and a sees exactly what would happen if it were to accept. Conversely, for ingress hooks, I think that a's hook should run first. This way a sees the packet as it originally came in and can modify or reject it, and then b only sees whatever a chooses to let through. The guiding principle here is that, for actions that originate outside the machine, the outer hooks should IMO run first and, for actions that originate from a task in a cgroup, the innermost hooks should run first. --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue 17-01-17 14:32:04, Peter Zijlstra wrote: > On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: > > On Sun 15-01-17 20:19:01, Tejun Heo wrote: > > [...] > > > So, what's proposed is a proper part of bpf. In terms of > > > implementation, cgroup helps by hosting the pointers but that doesn't > > > necessarily affect the conceptual structure of it. Given that, I > > > don't think it'd be a good idea to add anything to cgroup interface > > > for this feature. Introspection is great to have but this should be > > > introspectable together with other bpf programs using the same > > > mechanism. That's where it belongs. > > > > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we > > at least enforce that the cgroup has to be a leaf one and no further > > children groups can be created once there is BPF program attached? > > Why (again) this stupid constraint? > > If you want to use cgroups for tagging (like perf does), _any_ parent > cgroup will also tag you. > > So creating child cgroups, and placing tasks in it, should not be a > problem, the BPF thing should apply to all of them. This would require using hierarchical cgroup iterators to iterate over tasks. As per Andy's testing this doesn't seem to be the case. I haven't checked the implementation closely but my understanding was that using only cgroup specific tasks was intentional. I do agree that using hierarchy aware cgroup iterators is the right approach here and we wouldn't see any issue. But I am still not sure I've wrapped my head around this feature completely. -- Michal Hocko SUSE Labs
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: > On Sun 15-01-17 20:19:01, Tejun Heo wrote: > [...] > > So, what's proposed is a proper part of bpf. In terms of > > implementation, cgroup helps by hosting the pointers but that doesn't > > necessarily affect the conceptual structure of it. Given that, I > > don't think it'd be a good idea to add anything to cgroup interface > > for this feature. Introspection is great to have but this should be > > introspectable together with other bpf programs using the same > > mechanism. That's where it belongs. > > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we > at least enforce that the cgroup has to be a leaf one and no further > children groups can be created once there is BPF program attached? Why (again) this stupid constraint? If you want to use cgroups for tagging (like perf does), _any_ parent cgroup will also tag you. So creating child cgroups, and placing tasks in it, should not be a problem, the BPF thing should apply to all of them.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Sun 15-01-17 20:19:01, Tejun Heo wrote: [...] > So, what's proposed is a proper part of bpf. In terms of > implementation, cgroup helps by hosting the pointers but that doesn't > necessarily affect the conceptual structure of it. Given that, I > don't think it'd be a good idea to add anything to cgroup interface > for this feature. Introspection is great to have but this should be > introspectable together with other bpf programs using the same > mechanism. That's where it belongs. If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we at least enforce that the cgroup has to be a leaf one and no further children groups can be created once there is BPF program attached? This should break the existing usecases AFAIU and it would allow future changes without major API surprises. -- Michal Hocko SUSE Labs
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Hello, Sorry about the delay. Some fire fighthing followed the holidays. On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote: > > So from what I understand the proposed cgroup is not in fact > > hierarchical at all. > > > > @TJ, I thought you were enforcing all new cgroups to be properly > > hierarchical, that would very much include this one. > > I would be interested in that as well. We have made that mistake in > memcg v1 where hierarchy could be disabled for performance reasons and > that turned out to be major PITA in the end. Why do we want to repeat > the same mistake here? Across the different threads on this subject, there have been multiple explanations but I'll try to sum it up more clearly. The big issue here is whether this is a cgroup thing or a bpf thing. I don't think there's anything inherently wrong with one approach or the other. Forget about the proposed cgroup bpf extentions but thinkg about how iptables does cgroups. Whether it's the netcls/netprio in v1 or direct membership matching in v2, it is the network side testing for cgroup membership one way or the other. The only part where cgroup is involved in is answering that test. This also holds true for the perf controller. While it is implemented as a controller, it isn't visible to cgroup users in any way and the only function it serves is providing the membership test to perf subsystem. perf is the one which decides whether and how it is to be used. cgroup providing membership test to other subsystems is completely acceptable and established. Now coming back to bpf, the current implementation is just that. Sure, cgroup hosts the rules in its data structures but that isn't something conceptually relevant. We might as well implement it as a prefixed hash table from bpf side. Having pointers in struct cgroup is just a more efficient and easier way of achieving the same result. In fact, IIUC, this whole thing was born out of discussions around implementing scalable cgroup membership matching from bpf programs. So, what's proposed is a proper part of bpf. In terms of implementation, cgroup helps by hosting the pointers but that doesn't necessarily affect the conceptual structure of it. Given that, I don't think it'd be a good idea to add anything to cgroup interface for this feature. Introspection is great to have but this should be introspectable together with other bpf programs using the same mechanism. That's where it belongs. None of the issues that people have been raising here is actually an issue if one thinks of it as a part of bpf. Its security model is exactly the same as any other bpf programs. Recursive behavior is exactly the same as how other external cgroup descendant membership testing work. There is no issue here whatsoever. Now, I'm not claiming that a bpf mechanism which is a proper part of cgrou isn't attractive. It is, especially with delegation; however, that is also where we don't quite know how to proceed. This doesn't have much to do with cgroup. If something is delegatable to non-priv users and scoped, cgroup's fine with it and if that's not possible it simply isn't something which is delegatable and putting it on cgroup doesn't change that. I'm far from being a bpf expert, so I could be wrong here, but I don't think there's anything fundamental which prevents bpf from being delegatable but at the same time bpf is something which is extremely flexible and nobody really thought about or worked that much on delegating bpf. If there's enough need for it, I'm sure we'll eventually get there but from what I hear it isn't something we can pull off in a restricted timeframe. There's nothing which makes the currently implemented mechanism exclusive with a cgroup controller based one. The hooks are the expensive part but can be shared, the rest is just about which programs to execute in what order and how they should be chained. There are a lot of immediate use cases which can benefit from the proposed cgroup bpf mechanism and they're all fine with it being a part of bpf and behaving like any other network mechanism behaves in terms of configuration and delegation. I don't see a reason why we would hold back on merging this. All the raised issues are coming from confusing this as a part of cgroup. It isn't. It is a part of bpf. If we want a bpf cgroup controller, great, but that is a separate thing. Thanks. -- tejun
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue 20-12-16 10:11:50, Peter Zijlstra wrote: > On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote: > > >> Huh? My example in the original email attaches a program in a > > >> sub-hierarchy. Are you saying that 4.11 could make that example stop > > >> working? > > > > > > Are you suggesting sub-cgroups should not be allowed to override the > > > filter of a parent cgroup? > > > > Yes, exactly. I think there are two sensible behaviors: > > > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > > (This is the "punt" approach -- it lets different semantics be > > assigned later without breaking userspace.) > > > > b) sub-cgroups can have a filter if a parent does, too. The semantics > > are that the sub-cgroup filter runs first and all side-effects occur. > > If that filter says "reject" then ancestor filters are skipped. If > > that filter says "accept", then the ancestor filter is run and its > > side-effects happen as well. (And so on, all the way up to the root.) > > So from what I understand the proposed cgroup is not in fact > hierarchical at all. > > @TJ, I thought you were enforcing all new cgroups to be properly > hierarchical, that would very much include this one. I would be interested in that as well. We have made that mistake in memcg v1 where hierarchy could be disabled for performance reasons and that turned out to be major PITA in the end. Why do we want to repeat the same mistake here? -- Michal Hocko SUSE Labs
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue, Dec 20, 2016 at 10:49:25AM -0800, Andy Lutomirski wrote: > >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. > >> It doesn't make a semantic difference, except that I dislike > >> BPF_PROG_DETACH because that particular command isn't BPF-specific at > >> all. > > > > Well, I think it is; it pops the bpf program from a target and drops the > > reference on it. It's not much code, but it's certainly bpf-specific. > > I mean the interface isn't bpf-specific. If there was something that > wasn't bpf attached to the target, you'd still want an API to detach > it. This discussion won't go anywhere while you keep thinking that this api has to be generalized. As I explained several times earlier BPF_CGROUP_INET_SOCK_CREATE hook is bpf specific. There is nothing in the kernel that can take advantage of it today, so by definition the hook is bpf specific. Period. Saying that something in the future may come along that would want to use that is like saying I want to design the generic steering wheel for any car that will ever need it. Hence if you want to change 'target_fd' in BPF_PROG_ATTACH/DETACH cmds from being fd of open("cgroupdir") to fd of open("cgroupdir/cgroup.bpf") file inside it then I'm ok with that. All other proposals with non-extensible ioctls() and crazy text based per-hook permissions is nack.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack wrote: > Hi, > > On 12/20/2016 06:23 PM, Andy Lutomirski wrote: >> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack wrote: > >> To clarify, since this thread has gotten excessively long and twisted, >> I think it's important that, for hooks attached to a cgroup, you be >> able to tell in a generic way whether something is plugged into the >> hook. The natural way to see a cgroup's configuration is to read from >> cgroupfs, so I think that reading from cgroupfs should show you that a >> BPF program is attached and also give enough information that, once >> bpf programs become dumpable, you can dump the program (using the >> bpf() syscall or whatever). > > [...] > >> There isn't a big semantic difference between >> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., >> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", >> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic >> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) >> because the permission check is much weaker. > > Okay, if you have such a control file, you can of course do something > like that. When we discussed things back then with Tejun however, we > concluded that a controller that is not completely controllable through > control knobs that can be written and read via cat is meaningless. > That's why this has become a 'hidden' cgroup feature. > > With your proposed API, you'd first go to the bpf(2) syscall in order to > get a prog fd, and then come back to some sort of cgroup API to put the > fd in there. That's quite a mix and match, which is why we considered > the API cleaner in its current form, as everything that is related to > bpf is encapsulated behind a single syscall. You already have to do bpf() to get a prog fd, then open() to get a cgroup fd, then bpf() or ioctl() to attach, so this isn't much different, and its exactly the same number of syscalls. > >> My preference would be to do an ioctl on a new >> /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells >> you whether something is attached and hopefully also gives enough >> information (a hash of the BPF program, perhaps) to dump the actual >> program using future bpf() interfaces. write() and ioctl() can be >> used to configure it as appropriate. > > So am I reading this right? You're proposing to add ioctl() hooks to > kernfs/cgroupfs? That would open more possibilities of course, but I'm > not sure where that rabbit hole leads us eventually. Indeed. I already have a test patch to add ioctl() to kernfs. Adding it to cgroupfs shouldn't be much more complicated. > >> Another option that I like less would be to have a >> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with >> their contents. You would do an ioctl() on that to program a hook and >> you could read it to see what's there. > > Yes, read() could, in theory, give you similar information than ioctl(), > but in human-readable form. > >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. >> It doesn't make a semantic difference, except that I dislike >> BPF_PROG_DETACH because that particular command isn't BPF-specific at >> all. > > Well, I think it is; it pops the bpf program from a target and drops the > reference on it. It's not much code, but it's certainly bpf-specific. I mean the interface isn't bpf-specific. If there was something that wasn't bpf attached to the target, you'd still want an API to detach it. > So if I set up a cgroup that's monitored and call it /cgroup/a and enable delegation and if the program running there wants to do its own monitoring in /cgroup/a/b (via delegation), then you really want the outer monitor to silently drop events coming from /cgroup/a/b? >>> >>> That's a fair point, and we've discussed it as well. The issue is, as >>> Alexei already pointed out, that we do not want to traverse the tree up >>> to the root for nested cgroups due to the runtime costs in the >>> networking fast-path. After all, we're running the bpf program for each >>> packet in flight. Hence, we opted for the approach to only look at the >>> leaf node for now, with the ability to open it up further in the future >>> using flags during attach etc. >> >> Careful here! You don't look only at the leaf node for now. You do a >> fancy traversal and choose the nearest node that has a hook set up. > > But we do the 'complex' operation at attach time or when a cgroup is > created, both of which are slow-path operations. In the fast-path, we > only look at the leaf, which may or may not have an effective program > installed. And that's of course much cheaper then doing the traversing > for each packet. You would never traverse the full hierarchy for each packet. You'd have a linked list of programs that are attached, kind of like how there's an "effective" array right now. I sent out pseudocode earlier in the thread. > >> mkdir
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Hi, On 12/20/2016 06:23 PM, Andy Lutomirski wrote: > On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack wrote: > To clarify, since this thread has gotten excessively long and twisted, > I think it's important that, for hooks attached to a cgroup, you be > able to tell in a generic way whether something is plugged into the > hook. The natural way to see a cgroup's configuration is to read from > cgroupfs, so I think that reading from cgroupfs should show you that a > BPF program is attached and also give enough information that, once > bpf programs become dumpable, you can dump the program (using the > bpf() syscall or whatever). [...] > There isn't a big semantic difference between > 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., > CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", > O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic > difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) > because the permission check is much weaker. Okay, if you have such a control file, you can of course do something like that. When we discussed things back then with Tejun however, we concluded that a controller that is not completely controllable through control knobs that can be written and read via cat is meaningless. That's why this has become a 'hidden' cgroup feature. With your proposed API, you'd first go to the bpf(2) syscall in order to get a prog fd, and then come back to some sort of cgroup API to put the fd in there. That's quite a mix and match, which is why we considered the API cleaner in its current form, as everything that is related to bpf is encapsulated behind a single syscall. > My preference would be to do an ioctl on a new > /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells > you whether something is attached and hopefully also gives enough > information (a hash of the BPF program, perhaps) to dump the actual > program using future bpf() interfaces. write() and ioctl() can be > used to configure it as appropriate. So am I reading this right? You're proposing to add ioctl() hooks to kernfs/cgroupfs? That would open more possibilities of course, but I'm not sure where that rabbit hole leads us eventually. > Another option that I like less would be to have a > /cgroup/NAME/cgroup.bpf that lists all the active hooks along with > their contents. You would do an ioctl() on that to program a hook and > you could read it to see what's there. Yes, read() could, in theory, give you similar information than ioctl(), but in human-readable form. > FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. > It doesn't make a semantic difference, except that I dislike > BPF_PROG_DETACH because that particular command isn't BPF-specific at > all. Well, I think it is; it pops the bpf program from a target and drops the reference on it. It's not much code, but it's certainly bpf-specific. >>> So if I set up a cgroup that's monitored and call it /cgroup/a and >>> enable delegation and if the program running there wants to do its own >>> monitoring in /cgroup/a/b (via delegation), then you really want the >>> outer monitor to silently drop events coming from /cgroup/a/b? >> >> That's a fair point, and we've discussed it as well. The issue is, as >> Alexei already pointed out, that we do not want to traverse the tree up >> to the root for nested cgroups due to the runtime costs in the >> networking fast-path. After all, we're running the bpf program for each >> packet in flight. Hence, we opted for the approach to only look at the >> leaf node for now, with the ability to open it up further in the future >> using flags during attach etc. > > Careful here! You don't look only at the leaf node for now. You do a > fancy traversal and choose the nearest node that has a hook set up. But we do the 'complex' operation at attach time or when a cgroup is created, both of which are slow-path operations. In the fast-path, we only look at the leaf, which may or may not have an effective program installed. And that's of course much cheaper then doing the traversing for each packet. > mkdir /cgroup/foo > BPF_PROG_ATTACH(some program to foo) > mkdir /cgroup/foo/bar > chown -R some_user /cgroup/foo/bar > > If the kernel only looked at the leaf, then the program that did the > above would not expect that the program would constrain > /cgroup/foo/bar's activity. But, as it stands, the program *would* > expect /cgroup/foo/bar to be constrained, except that, whenever the > capable() check changes to ns_capable() (which will happen eventually > one way or another), then the bad guy can create /cgroup/foo/bar/baz, > install a new no-op hook there, and break the security assumption. > > IOW, I think that totally non-recursive hooks are okay from a security > perspective, albeit rather strange, but the current design is not okay > from a security perspective. We locked down the ability to override any of these programs with
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack wrote: > Hi, > > On 12/20/2016 04:50 AM, Andy Lutomirski wrote: >> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf >> specfic about the hook except that the name of this macro has "BPF" in >> it. There is nothing whatsoever that's bpf-specific about the context >> -- sk is not bpf-specific at all. >> >> The only thing bpf-specific about it is that it currently only invokes >> bpf programs. That could easily change. > > I'm not sure if I follow. The code as it currently stands only supports > attaching bpf programs to cgroups which have been created using > BPF_PROG_LOAD. If cgroups would support other program types in the > future, then they would need to be stored in different data types > anyway, and the bpf syscall multiplexer would be the wrong entry point > to access them anyway. To clarify, since this thread has gotten excessively long and twisted, I think it's important that, for hooks attached to a cgroup, you be able to tell in a generic way whether something is plugged into the hook. The natural way to see a cgroup's configuration is to read from cgroupfs, so I think that reading from cgroupfs should show you that a BPF program is attached and also give enough information that, once bpf programs become dumpable, you can dump the program (using the bpf() syscall or whatever). Obviously the interface to *attach* a BPF program to a hook will need to be at least a little bit BPF-specific. But there's nothing particularly BPF-specific about detaching, and if a control file were to exist, writing "detach" or "none" to it seems natural. > > Whether we add bpf-specific code to the cgroup file parsers or > cgroup-specific code to the bpf layer does not make much of a semantic > difference, does it? As a matter of fact, my very first implementation > of this patch set implemented a cgroup controller that would allow > writing strings like "ing > > b) make it possible to extend the functionality in the future by adding > flags to the command struct etc. > > And I hoped we achieved that after discussing it for so long. > >> How about slowing down a wee bit and trying to come up with cgroup >> hook semantics that work for all of these use cases? > > I'm all for discussing things, but I don't this was done in a rush. > > I do agree though that adding functionality to cgroups that is not > limited to resource control is a delicate thing to do, which is why I > cc'ed cgroups@ in my patches. I should have also added linux-api@ I > guess, sorry I missed that. >ress 5" to its control file, where 5 is the fd > number that came out of BPF_PROG_LOAD. The main reason we decided to > ditch that was that echoing fd numbers into a text file seemed way worse > than going through a proper syscall layer with it, and ioctls are > unavailable on pseudo-fs. There isn't a big semantic difference between 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) because the permission check is much weaker. The reason I suggest ioctl() and not write() is that write() MUST NOT make its behavior depend on the caller's credentials, file table, etc. Imagine what would happen if you did 'sudo -u eviltext >/cgroup/NAME/control.file'. (This particular mistake has been repeated many times in the kernel, in drivers, networking, namespaces, core code, etc, and it's resulted in a big pile of privilege escalation bugs.) So write("bpf:") is safe (but unusably awkward, I think), whereas write("bpf:fd 5") is unsafe. > > The idea was rather to allow attaching bpf programs to other things than > just cgroups as well, which is why we called the member of 'union > bpf_attr' 'target_fd', and a cgroup is just one type a target here. I would make that a separate operation. If someone adds the ability to attach an ebpf program to, say, seccomp (I'm quite sure this will happen eventually), it should be attached using seccomp(), not bpf(), for example). The people writing seccomp filters will thank you for making the syscall in question reflect what object (the cgroup, for example) is being modified. > >>> i'm assuming 'baadf00d' is bpf program fd expressed a text string? >>> and kernel needs to parse above? will you allow capital and lower >>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? >>> can program fd expressed as decimal or hex or both? >>> how do you return the error? as a text string for user space >>> to parse? >> >> No. The kernel does not parse it because you cannot write this to the >> file. You set a bpf filter with ioctl and pass an fd. > > An ioctl on what file, exactly? There are at least two plausible models. My preference would be to do an ioctl on a new /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tell
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Hi, On 12/20/2016 04:50 AM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov > wrote: >> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: >>> I think we're still talking past each other. A big part of the point >>> of changing it is that none of this is specific to bpf. You could (in >> >> the hooks and context passed into the program is very much bpf specific. >> That's what I've been trying to convey all along. > > You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf > specfic about the hook except that the name of this macro has "BPF" in > it. There is nothing whatsoever that's bpf-specific about the context > -- sk is not bpf-specific at all. > > The only thing bpf-specific about it is that it currently only invokes > bpf programs. That could easily change. I'm not sure if I follow. The code as it currently stands only supports attaching bpf programs to cgroups which have been created using BPF_PROG_LOAD. If cgroups would support other program types in the future, then they would need to be stored in different data types anyway, and the bpf syscall multiplexer would be the wrong entry point to access them anyway. Whether we add bpf-specific code to the cgroup file parsers or cgroup-specific code to the bpf layer does not make much of a semantic difference, does it? As a matter of fact, my very first implementation of this patch set implemented a cgroup controller that would allow writing strings like "ingress 5" to its control file, where 5 is the fd number that came out of BPF_PROG_LOAD. The main reason we decided to ditch that was that echoing fd numbers into a text file seemed way worse than going through a proper syscall layer with it, and ioctls are unavailable on pseudo-fs. The idea was rather to allow attaching bpf programs to other things than just cgroups as well, which is why we called the member of 'union bpf_attr' 'target_fd', and a cgroup is just one type a target here. >> i'm assuming 'baadf00d' is bpf program fd expressed a text string? >> and kernel needs to parse above? will you allow capital and lower >> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? >> can program fd expressed as decimal or hex or both? >> how do you return the error? as a text string for user space >> to parse? > > No. The kernel does not parse it because you cannot write this to the > file. You set a bpf filter with ioctl and pass an fd. An ioctl on what file, exactly? > If you *read* > the file, you get the same bpf program hash that fdinfo on the bpf > object would show -- this is for debugging and (eventually) CRIU. We need a debugging facility at some point, I agree to that. As the code currently stands, that would rather need to go into the bpf(2) syscall though, as setting a program through bpf(2) and reading it through cgroupfs is really nasty. >> so you're proposing to add a bunch of hard coded logic to the kernel. >> First to parse such text into some sort of syntax tree or list/set >> and then have hard coded logic specifically for these two use cases? >> While above two can be implemented as trivial bpf programs already?! >> That goes 180% degree vs bpf philosophy. bpf is about moving >> the specific code out of the kernel and keeping kernel generic that >> it can solve as many use cases as possible by being programmable. > > I'm not seriously proposing implementing these. My point is that > *bpf*, while wonderful, is not the be-all-and-end-all of kernel > configurability, and other types of hooks might want to be hooked in > here. Sure, but nobody claimed it to be that be-all-and-end-all thing. It's just one thing that a cgroup is now able to accommodate, and because that new feature is specific to bpf, we decided to hook up the uapi to the bpf syscall. > So if I set up a cgroup that's monitored and call it /cgroup/a and > enable delegation and if the program running there wants to do its own > monitoring in /cgroup/a/b (via delegation), then you really want the > outer monitor to silently drop events coming from /cgroup/a/b? That's a fair point, and we've discussed it as well. The issue is, as Alexei already pointed out, that we do not want to traverse the tree up to the root for nested cgroups due to the runtime costs in the networking fast-path. After all, we're running the bpf program for each packet in flight. Hence, we opted for the approach to only look at the leaf node for now, with the ability to open it up further in the future using flags during attach etc. > The current approach to bpf hooks will bite you down the road. David > Ahern is already proposing using it for something that is not tracing > at all, and someone will want that in a container, and there will be a > problem. Hmm, I thought we've sorted out the concerns about that by making sure that we a) lock-down the API sufficiently so it doesn't cause any security issues in its current form, and b) make it possible to extend t
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote: > >> Huh? My example in the original email attaches a program in a > >> sub-hierarchy. Are you saying that 4.11 could make that example stop > >> working? > > > > Are you suggesting sub-cgroups should not be allowed to override the filter > > of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) So from what I understand the proposed cgroup is not in fact hierarchical at all. @TJ, I thought you were enforcing all new cgroups to be properly hierarchical, that would very much include this one.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov > wrote: > > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: > >> > >> struct cgroup_bpf { > >> /* > >> * Store two sets of bpf_prog pointers, one for programs that are > >> * pinned directly to this cgroup, and one for those that are > >> effective > >> * when this cgroup is accessed. > >> */ > >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; > >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; > >> }; > >> > >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'. > >> > >> This would change to something like: > >> > >> struct cgroup_filter_slot { > >> struct bpf_prog *effective; > >> struct cgroup_filter_slot *next; > >> struct bpf_prog *local; > >> } > >> > >> local is NULL unless *this* cgroup has a filter. effective points to > >> the bpf_prog that's active in this cgroup or the nearest ancestor that > >> has a filter. next is NULL if there are no filters higher in the > >> chain or points to the next slot that has a filter. struct cgroup > >> has: > >> > >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; > >> > >> To evaluate it, you do: > >> > >> struct cgroup_filter_slot *slot = &cgroup->slot[the index]; > >> > >> if (!slot->effective) > >> return; > >> > >> do { > >> evaluate(slot->effective); > >> slot = slot->next; > >> } while (unlikely(slot)); > > > > yes. something like this can work as a future extension > > to support multiple programs for security use case. > > Please propose a patch. > > Again, it's not needed today and there is no rush to implement it. > > > > If this happens after 4.10 and 4.10 is released as is, then this > change would be an ABI break. it won't break existing apps. please study how bpf syscall was extended in the past without breaking anything. Same thing here. The default is one program per hook per cgroup. Everything else is future extensions.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov wrote: > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: >> >> struct cgroup_bpf { >> /* >> * Store two sets of bpf_prog pointers, one for programs that are >> * pinned directly to this cgroup, and one for those that are >> effective >> * when this cgroup is accessed. >> */ >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; >> }; >> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'. >> >> This would change to something like: >> >> struct cgroup_filter_slot { >> struct bpf_prog *effective; >> struct cgroup_filter_slot *next; >> struct bpf_prog *local; >> } >> >> local is NULL unless *this* cgroup has a filter. effective points to >> the bpf_prog that's active in this cgroup or the nearest ancestor that >> has a filter. next is NULL if there are no filters higher in the >> chain or points to the next slot that has a filter. struct cgroup >> has: >> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; >> >> To evaluate it, you do: >> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index]; >> >> if (!slot->effective) >> return; >> >> do { >> evaluate(slot->effective); >> slot = slot->next; >> } while (unlikely(slot)); > > yes. something like this can work as a future extension > to support multiple programs for security use case. > Please propose a patch. > Again, it's not needed today and there is no rush to implement it. > If this happens after 4.10 and 4.10 is released as is, then this change would be an ABI break. --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov wrote: > On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote: >> >> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't >> even take a reference to a BPF program as an argument. What is it >> supposed to do if this mechanism ever gets extended? > > we just add another field to that anonymous union just like > we did for other commands and everything is backwards compatible. > It's the basics of bpf syscall that we've been relying on for some > time now and it worked just fine. And what happens if you don't specify that member and two programs are attached? --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote: > > By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't > even take a reference to a BPF program as an argument. What is it > supposed to do if this mechanism ever gets extended? we just add another field to that anonymous union just like we did for other commands and everything is backwards compatible. It's the basics of bpf syscall that we've been relying on for some time now and it worked just fine.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote: > >> > >> net.socket_create_filter = "none": no filter > >> net.socket_create_filter = "bpf:baadf00d": bpf filter > > > > i'm assuming 'baadf00d' is bpf program fd expressed a text string? > > and kernel needs to parse above? will you allow capital and lower > > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? > > can program fd expressed as decimal or hex or both? > > how do you return the error? as a text string for user space > > to parse? > > No. The kernel does not parse it because you cannot write this to the > file. You set a bpf filter with ioctl and pass an fd. If you *read* > the file, you get the same bpf program hash that fdinfo on the bpf > object would show -- this is for debugging and (eventually) CRIU. my understanding that cgroup is based on kernfs and both don't support ioctl, so you'd need quite some hacking to introduce such concepts and buy-in from a bunch of people first. > >> net.socket_create_filter = "iptables:foobar": some iptables thingy > >> net.socket_create_filter = "nft:blahblahblah": some nft thingy > > > > iptables/nft are not applicable to 'bpf_socket_create' which > > looks into: > > struct bpf_sock { > > __u32 bound_dev_if; > > __u32 family; > > __u32 type; > > __u32 protocol; > > }; > > so don't fit as an example. > > The code that takes a 'struct sock' and sets up bpf_sock is > bpf-specific and would obviously not be used for non-bpf filter. But > if you had a filter that was just a like of address families, that > filter would look at struct sock and do its own thing. iptables > wouldn't make sense for a socket creation filter, but it would make > perfect sense for an ingress filter. Obviously the bpf-specific state > object wouldn't be used, but it would still be a hook, invoked from > the same network code, guarded by the same static key, looking at the > same skb. I strongly suggest to go back and read my first reply where I think I explained well enough that something like iptables will not able to reuse the ioctl mechanism you're proposing here, hook ids will be different, attachment mechanism will be different too. So your proposed cgroup ioctl is already dead as a reusable interface. > >> net.socket_create_filter = "disallow": no sockets created period > >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and > >> 3 > > > > so you're proposing to add a bunch of hard coded logic to the kernel. > > First to parse such text into some sort of syntax tree or list/set > > and then have hard coded logic specifically for these two use cases? > > While above two can be implemented as trivial bpf programs already?! > > That goes 180% degree vs bpf philosophy. bpf is about moving > > the specific code out of the kernel and keeping kernel generic that > > it can solve as many use cases as possible by being programmable. > > I'm not seriously proposing implementing these. My point is that > *bpf*, while wonderful, is not the be-all-and-end-all of kernel > configurability, and other types of hooks might want to be hooked in > here. Then please let's talk about real use cases. This daydreaming of some future llvm in the kernel that you were bringing up during LPC doesn't help the discussion. Just like these artificial examples. > > ... > >> What exactly isn't sensible about using cgroup_bpf for containers? > > > > my use case is close to zero overhead application network monitoring. > > So if I set up a cgroup that's monitored and call it /cgroup/a and > enable delegation and if the program running there wants to do its own > monitoring in /cgroup/a/b (via delegation), then you really want the > outer monitor to silently drop events coming from /cgroup/a/b? yes. both are root and must talk to each other if they want to co-exist. When root process is asking kernel to do X, this X has to happen. > Then disallow nesting. You're welcome to not rush the decision, but > that's not what you've done. If 4.10 is released as is, you've made > the decision and you're going to have a hard time changing it. Nothing needs to be changed. > > No. As was pointed out before only root can load > > BPF_PROG_TYPE_CGROUP_[SKB|SOCK] > > type programs and only root can attach them. > > Why? It really seems to me that you expect that future namespaceable > bpf hooks will use a totally different API. At KS, I sat in a room > full of people arguing about cgroup v2 and a lot of them pointed out > that there are valid, paying use cases that want to stick cgroup v1 in > a container, because the code that uses cgroup v1 in the container is > called systemd and the contained OS is called RHEL (or SuSE or Ubuntu > or Gentoo or whatever), and that code is *already written* and needs > to be contained. bpf in general is not namespace aware. It's global and this cgroup scoping of bpf programs is the first of this kind. Namespacing of bpf is comp
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: > > struct cgroup_bpf { > /* > * Store two sets of bpf_prog pointers, one for programs that are > * pinned directly to this cgroup, and one for those that are > effective > * when this cgroup is accessed. > */ > struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; > struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; > }; > > in struct cgroup, there's a 'struct cgroup_bpf bpf;'. > > This would change to something like: > > struct cgroup_filter_slot { > struct bpf_prog *effective; > struct cgroup_filter_slot *next; > struct bpf_prog *local; > } > > local is NULL unless *this* cgroup has a filter. effective points to > the bpf_prog that's active in this cgroup or the nearest ancestor that > has a filter. next is NULL if there are no filters higher in the > chain or points to the next slot that has a filter. struct cgroup > has: > > struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; > > To evaluate it, you do: > > struct cgroup_filter_slot *slot = &cgroup->slot[the index]; > > if (!slot->effective) > return; > > do { > evaluate(slot->effective); > slot = slot->next; > } while (unlikely(slot)); yes. something like this can work as a future extension to support multiple programs for security use case. Please propose a patch. Again, it's not needed today and there is no rush to implement it.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov wrote: > On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: >> I think we're still talking past each other. A big part of the point >> of changing it is that none of this is specific to bpf. You could (in > > the hooks and context passed into the program is very much bpf specific. > That's what I've been trying to convey all along. You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf specfic about the hook except that the name of this macro has "BPF" in it. There is nothing whatsoever that's bpf-specific about the context -- sk is not bpf-specific at all. The only thing bpf-specific about it is that it currently only invokes bpf programs. That could easily change. > >> theory -- I'm not proposing implementing these until there's demand) >> have: >> >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter > > i'm assuming 'baadf00d' is bpf program fd expressed a text string? > and kernel needs to parse above? will you allow capital and lower > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? > can program fd expressed as decimal or hex or both? > how do you return the error? as a text string for user space > to parse? No. The kernel does not parse it because you cannot write this to the file. You set a bpf filter with ioctl and pass an fd. If you *read* the file, you get the same bpf program hash that fdinfo on the bpf object would show -- this is for debugging and (eventually) CRIU. > >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy > > iptables/nft are not applicable to 'bpf_socket_create' which > looks into: > struct bpf_sock { > __u32 bound_dev_if; > __u32 family; > __u32 type; > __u32 protocol; > }; > so don't fit as an example. The code that takes a 'struct sock' and sets up bpf_sock is bpf-specific and would obviously not be used for non-bpf filter. But if you had a filter that was just a like of address families, that filter would look at struct sock and do its own thing. iptables wouldn't make sense for a socket creation filter, but it would make perfect sense for an ingress filter. Obviously the bpf-specific state object wouldn't be used, but it would still be a hook, invoked from the same network code, guarded by the same static key, looking at the same skb. > >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > so you're proposing to add a bunch of hard coded logic to the kernel. > First to parse such text into some sort of syntax tree or list/set > and then have hard coded logic specifically for these two use cases? > While above two can be implemented as trivial bpf programs already?! > That goes 180% degree vs bpf philosophy. bpf is about moving > the specific code out of the kernel and keeping kernel generic that > it can solve as many use cases as possible by being programmable. I'm not seriously proposing implementing these. My point is that *bpf*, while wonderful, is not the be-all-and-end-all of kernel configurability, and other types of hooks might want to be hooked in here. > ... >> What exactly isn't sensible about using cgroup_bpf for containers? > > my use case is close to zero overhead application network monitoring. So if I set up a cgroup that's monitored and call it /cgroup/a and enable delegation and if the program running there wants to do its own monitoring in /cgroup/a/b (via delegation), then you really want the outer monitor to silently drop events coming from /cgroup/a/b? > >> >> > As you're pointing out, in case of security, we probably >> >> > want to preserve original bpf program that should always be >> >> > run first and only after it returned 'ok' (we'd need to define >> >> > what 'ok' means in secruity context) run program attached to >> >> > sub-hierarchy. >> >> >> >> It's already defined AFAICT. 1 means okay. 0 means not okay. >> > >> > sorry that doesn't make any sense. For seccomp we have a set of >> > ranges that mean different things. Here you're proposing to >> > hastily assign 1 and 0 ? How is that extensible? >> > We need to carefully think through what should be the semantics >> > of attaching multiple programs, consider performance implications, >> > return codes and so on. >> >> You already assigned it. The return value of the bpf program, loaded >> in Linus' tree today, tells the kernel whether to accept or reject. > > yes. that's what the program tells the hook. > I'm saying that whenever we have a link list of the programs > interaction between them may or may not be expressible with 1/0 > and I don't want to rush such decision. Then disallow nesting. You're welcome to not rush the decision, but that's not what you've done. If 4.10 is released as is, you've made
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > wrote: > > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov > >> wrote: > >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> >> Hi all- > >> >> > >> >> I apologize for being rather late with this. I didn't realize that > >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> >> it on the linux-api list, so I missed the discussion. > >> >> > >> >> I think that the inet ingress, egress etc filters are a neat feature, > >> >> but I think the API has some issues that will bite us down the road > >> >> if it becomes stable in its current form. > >> >> > >> >> Most of the problems I see are summarized in this transcript: > >> >> > >> >> # mkdir cg2 > >> >> # mount -t cgroup2 none cg2 > >> >> # mkdir cg2/nosockets > >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> >> ... > >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> >> > >> >> You can modify a cgroup after opening it O_RDONLY? > >> >> > >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> >> > >> >> This is fine. The bpf() syscall manipulates bpf objects. > >> >> > >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> >> > >> >> This is not so good: > >> >> > >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This > >> >> is manipulating a cgroup. There's no reason that a socket > >> >> creation > >> >> filter couldn't be written in a different language (new iptables > >> >> table? Simple list of address families?), but if that happened, > >> >> then using bpf() to install it would be entirely nonsensical. > >> > > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as > >> > network stack is using cgroup as an application group that should > >> > invoke bpf program at the certain point in the stack. > >> > imo cgroup management is orthogonal. > >> > >> It is literally modifying the struct cgroup, and, as a practical > >> matter, it's causing membership in the cgroup to have a certain > >> effect. But rather than pointless arguing, let me propose an > >> alternative API that I think solves most of the problems here. > >> > >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > >> Instead, the cgroup gets three new control files: > >> "net.ingress_filter", "net.egress_filter", and > >> "net.socket_create_filter". Initially, if you read these files, you > >> see "none\n". > >> > >> To attach a bpf filter, you open the file for write and do an ioctl on > >> it. After doing the ioctl, if you read the file, you'll see > >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > >> the bpf program. > >> > >> To detach any type of filter, bpf or otherwise, you open the file for > >> write and write "none\n" (or just "none"). > >> > >> If you write anything else to the file, you get -EINVAL. But, if > >> someone writes a new type of filter (perhaps a simple list of address > >> families), maybe you can enable the filter by writing something > >> appropriate to the file. > > > > I see no difference in what you're proposing vs what is already implemented > > from feature set point of view, but the file approach is very ugly, since > > it's a mismatch to FD style access that bpf is using everywhere. > > In your proposal you'd also need to add bpf prefix everywhere. > > So the control file names should be bpf_inet_ingress, bpf_inet_egress > > and bpf_socket_create. > > I think we're still talking past each other. A big part of the point > of changing it is that none of this is specific to bpf. You could (in the hooks and context passed into the program is very much bpf specific. That's what I've been trying to convey all along. > theory -- I'm not proposing implementing these until there's demand) > have: > > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter i'm assuming 'baadf00d' is bpf program fd expressed a text string? and kernel needs to parse above? will you allow capital and lower case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? can program fd expressed as decimal or hex or both? how do you return the error? as a text string for user space to parse? > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy iptables/nft are not applicable to 'bpf_socket_create' which looks into: struct bpf_sock { __u32 bound_dev_if; __u32 family; __u32 type; __u32 protocol; }; so don't fit as an example. > net.socket_create_filter = "disallow": no so
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 6:52 PM, David Ahern wrote: > On 12/19/16 6:56 PM, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: >>> On 12/19/16 5:25 PM, Andy Lutomirski wrote: net.socket_create_filter = "none": no filter net.socket_create_filter = "bpf:baadf00d": bpf filter net.socket_create_filter = "disallow": no sockets created period net.socket_create_filter = "iptables:foobar": some iptables thingy net.socket_create_filter = "nft:blahblahblah": some nft thingy net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >>> >>> Such a scheme works for the socket create filter b/c it is a very simple >>> use case. It does not work for the ingress and egress which allow generic >>> bpf filters. >> >> Can you elaborate on what goes wrong? (Obviously the >> "address_family_list" example makes no sense in that context.) > > Being able to dump a filter or see that one exists would be a great add-on, > but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable > API for loading generic BPF filters. Simple cases like "disallow" are easy -- > just return 0 in the filter, no complicated BPF code needed. The rest are > specific cases of the moment which goes against the intent of ebpf and > generic programmability. Oh -- I'm not proposing that at all. Let me clarify. For the bpf case, if you *read* the file, you'd see "bpf:baadf00d". But writing "bpf:baadf00d" is nonsense and would give you -EINVAL. Instead you install a bpf filter by opening the file for write (O_RDWR or O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd, CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd); It's kind of like BPF_PROG_ATTACH except that it respects filesystem permissions, it isn't in the bpf() multiplexer, the filter being set is implied (by the fd in use), and everything is nicely seccompable. To remove the filter, you write "none" or "none\n" to the file. As a future extension, if someone wanted more than one filter to be able to coexist in the cgroup socket_create_filter slot, you could plausibly do 'echo disallow >>net.socket_create_filter' or use a new ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and see more than one line. But this would be a future extension and may never be needed. >> >> a) sub-cgroups cannot have a filter at all of the parent has a filter. >> (This is the "punt" approach -- it lets different semantics be >> assigned later without breaking userspace.) >> >> b) sub-cgroups can have a filter if a parent does, too. The semantics >> are that the sub-cgroup filter runs first and all side-effects occur. >> If that filter says "reject" then ancestor filters are skipped. If >> that filter says "accept", then the ancestor filter is run and its >> side-effects happen as well. (And so on, all the way up to the root.) > > That comes with a big performance hit for skb / data path cases. > > I'm riding my use case on Daniel's work, and as I understand it the nesting > case has been discussed. I'll defer to Daniel and Alexei on this part. I'm not sure I buy the performance hit. If you do it naively, then performance will indeed suck. But there's already a bunch of code in there to pre-populate a filter list for faster use. Currently, we have: struct cgroup_bpf { /* * Store two sets of bpf_prog pointers, one for programs that are * pinned directly to this cgroup, and one for those that are effective * when this cgroup is accessed. */ struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; }; in struct cgroup, there's a 'struct cgroup_bpf bpf;'. This would change to something like: struct cgroup_filter_slot { struct bpf_prog *effective; struct cgroup_filter_slot *next; struct bpf_prog *local; } local is NULL unless *this* cgroup has a filter. effective points to the bpf_prog that's active in this cgroup or the nearest ancestor that has a filter. next is NULL if there are no filters higher in the chain or points to the next slot that has a filter. struct cgroup has: struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; To evaluate it, you do: struct cgroup_filter_slot *slot = &cgroup->slot[the index]; if (!slot->effective) return; do { evaluate(slot->effective); slot = slot->next; } while (unlikely(slot)); The old code was one branch per evaluation. The new code is n+1 branches where n is the number of filters, so it's one extra branch in the worst case. But the extra branch is cache-hot (the variable is right next to slot->effective, which is needed regardless) and highly predictable (in the case where nesting isn't used, the branch is not taken, and it's a backward branch which most CPUs will predict as not taken). I expect that, on x86, this adds at most a cycle or two and quite possibly adds no measurable overhead at all.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 6:56 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: >> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>> net.socket_create_filter = "none": no filter >>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>> net.socket_create_filter = "disallow": no sockets created period >>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >> >> Such a scheme works for the socket create filter b/c it is a very simple use >> case. It does not work for the ingress and egress which allow generic bpf >> filters. > > Can you elaborate on what goes wrong? (Obviously the > "address_family_list" example makes no sense in that context.) Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. >> >> ... >> you're ignoring use cases I described earlier. In vrf case there is only one ifindex it needs to bind to. >>> >>> I'm totally lost. Can you explain what this has to do with the cgroup >>> hierarchy? >> >> I think the point is that a group hierarchy makes no sense for the VRF use >> case. What I put into iproute2 is >> >> cgrp2/vrf/NAME >> >> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 >> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does >> not have a filter. A user can certainly add another layer >> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not >> make sense. > > I tend to agree. I still think that the mechanism as it stands is > broken in other respects and should be fixed before it goes live. I > have no desire to cause problems for the vrf use case. > > But keep in mind that the vrf use case is, in Linus' tree, a bit > broken right now in its interactions with other users of the same > mechanism. Suppose I create a container and want to trace all of its > created sockets. I'll set up cgrp2/container and load my tracer as a > socket creation hook. Then a container sets up > cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding > filter. Now the tracing stops working -- oops. There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want. > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You > have to do it now (or disable the feature for 4.10). This is why I'm > bringing this whole thing up now. We don't have to touch user visible api here, so extensions are fine. >>> >>> Huh? My example in the original email attaches a program in a >>> sub-hierarchy. Are you saying that 4.11 could make that example stop >>> working? >> >> Are you suggesting sub-cgroups should not be allowed to override the filter >> of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) That comes with a big performance hit for skb / data path cases. I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: > On 12/19/16 5:25 PM, Andy Lutomirski wrote: >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > Such a scheme works for the socket create filter b/c it is a very simple use > case. It does not work for the ingress and egress which allow generic bpf > filters. Can you elaborate on what goes wrong? (Obviously the "address_family_list" example makes no sense in that context.) > > ... > >>> you're ignoring use cases I described earlier. >>> In vrf case there is only one ifindex it needs to bind to. >> >> I'm totally lost. Can you explain what this has to do with the cgroup >> hierarchy? > > I think the point is that a group hierarchy makes no sense for the VRF use > case. What I put into iproute2 is > > cgrp2/vrf/NAME > > where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 > sockets to a specific device index. cgrp2/vrf is the "default" vrf and does > not have a filter. A user can certainly add another layer > cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not > make sense. I tend to agree. I still think that the mechanism as it stands is broken in other respects and should be fixed before it goes live. I have no desire to cause problems for the vrf use case. But keep in mind that the vrf use case is, in Linus' tree, a bit broken right now in its interactions with other users of the same mechanism. Suppose I create a container and want to trace all of its created sockets. I'll set up cgrp2/container and load my tracer as a socket creation hook. Then a container sets up cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding filter. Now the tracing stops working -- oops. > > ... > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You have to do it now (or disable the feature for 4.10). This is why I'm bringing this whole thing up now. >>> >>> We don't have to touch user visible api here, so extensions are fine. >> >> Huh? My example in the original email attaches a program in a >> sub-hierarchy. Are you saying that 4.11 could make that example stop >> working? > > Are you suggesting sub-cgroups should not be allowed to override the filter > of a parent cgroup? Yes, exactly. I think there are two sensible behaviors: a) sub-cgroups cannot have a filter at all of the parent has a filter. (This is the "punt" approach -- it lets different semantics be assigned later without breaking userspace.) b) sub-cgroups can have a filter if a parent does, too. The semantics are that the sub-cgroup filter runs first and all side-effects occur. If that filter says "reject" then ancestor filters are skipped. If that filter says "accept", then the ancestor filter is run and its side-effects happen as well. (And so on, all the way up to the root.) --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > wrote: >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? > Okay, I figured out what you mean, I think. You have a handful of vrf devices. Let's say they have ifindexes 1 and 2 (and maybe more). The interesting case happens when you set up /cgroup/a with a bpf program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf program that binds new sockets to ifindex 2. The question is: what should happen if you're in /cgroup/a/b? Presumably, if you do this, you wanted to end up with ifindex 2. I think the way it should actually work is that the kernel evaluates /cgroup/a/b's hook and then /cgroup/a's hook. Then /cgroup/a (which is the more privileged hook) gets to make the choice. If it wants ifindex 2 to win, it can do (pseudocode): if (!sk->sk_bound_if) sk->sk_bound_if = 1;
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 5:25 PM, Andy Lutomirski wrote: > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter > net.socket_create_filter = "disallow": no sockets created period > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy > net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. ... >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is cgrp2/vrf/NAME where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. ... >>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>> have to do it now (or disable the feature for 4.10). This is why I'm >>> bringing this whole thing up now. >> >> We don't have to touch user visible api here, so extensions are fine. > > Huh? My example in the original email attaches a program in a > sub-hierarchy. Are you saying that 4.11 could make that example stop > working? Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:34 PM, David Miller wrote: > From: Alexei Starovoitov > Date: Mon, 19 Dec 2016 16:02:56 -0800 > >> huh? 'not right api' because it's using bpf syscall instead >> of cgroup control-file? I think the opposite is the truth. > > I completely agree with Alexei on this. So what happens when someone adds another type of filter? Let's say there's a simple, no-privilege-required list of allowed address families that can hook up to the socket creation hook for a cgroup. Does BPF_PROG_DETACH still detach it? Or would both the bpf *and* the list of allowed address families be in force? If the latter, why wouldn't two BPF programs on the same hook be allowed? Concretely: # mkdir /cgroup/a # set_up_bpf_socket_rule /cgroup/a # set_up_list_of_address_families /cgroup/a # cat /cgroup/a/some_new_file [what gets displayed?] # BPF_PROG_DETACH: what happens By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't even take a reference to a BPF program as an argument. What is it supposed to do if this mechanism ever gets extended? --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov Date: Mon, 19 Dec 2016 16:02:56 -0800 > huh? 'not right api' because it's using bpf syscall instead > of cgroup control-file? I think the opposite is the truth. I completely agree with Alexei on this.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov wrote: > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov >> wrote: >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> >> Hi all- >> >> >> >> I apologize for being rather late with this. I didn't realize that >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> >> it on the linux-api list, so I missed the discussion. >> >> >> >> I think that the inet ingress, egress etc filters are a neat feature, >> >> but I think the API has some issues that will bite us down the road >> >> if it becomes stable in its current form. >> >> >> >> Most of the problems I see are summarized in this transcript: >> >> >> >> # mkdir cg2 >> >> # mount -t cgroup2 none cg2 >> >> # mkdir cg2/nosockets >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> >> ... >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> >> >> You can modify a cgroup after opening it O_RDONLY? >> >> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> >> >> This is fine. The bpf() syscall manipulates bpf objects. >> >> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> >> >> This is not so good: >> >> >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This >> >> is manipulating a cgroup. There's no reason that a socket >> >> creation >> >> filter couldn't be written in a different language (new iptables >> >> table? Simple list of address families?), but if that happened, >> >> then using bpf() to install it would be entirely nonsensical. >> > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as >> > network stack is using cgroup as an application group that should >> > invoke bpf program at the certain point in the stack. >> > imo cgroup management is orthogonal. >> >> It is literally modifying the struct cgroup, and, as a practical >> matter, it's causing membership in the cgroup to have a certain >> effect. But rather than pointless arguing, let me propose an >> alternative API that I think solves most of the problems here. >> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. >> Instead, the cgroup gets three new control files: >> "net.ingress_filter", "net.egress_filter", and >> "net.socket_create_filter". Initially, if you read these files, you >> see "none\n". >> >> To attach a bpf filter, you open the file for write and do an ioctl on >> it. After doing the ioctl, if you read the file, you'll see >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for >> the bpf program. >> >> To detach any type of filter, bpf or otherwise, you open the file for >> write and write "none\n" (or just "none"). >> >> If you write anything else to the file, you get -EINVAL. But, if >> someone writes a new type of filter (perhaps a simple list of address >> families), maybe you can enable the filter by writing something >> appropriate to the file. > > I see no difference in what you're proposing vs what is already implemented > from feature set point of view, but the file approach is very ugly, since > it's a mismatch to FD style access that bpf is using everywhere. > In your proposal you'd also need to add bpf prefix everywhere. > So the control file names should be bpf_inet_ingress, bpf_inet_egress > and bpf_socket_create. I think we're still talking past each other. A big part of the point of changing it is that none of this is specific to bpf. You could (in theory -- I'm not proposing implementing these until there's demand) have: net.socket_create_filter = "none": no filter net.socket_create_filter = "bpf:baadf00d": bpf filter net.socket_create_filter = "disallow": no sockets created period net.socket_create_filter = "iptables:foobar": some iptables thingy net.socket_create_filter = "nft:blahblahblah": some nft thingy net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 See? This API is not bpf-specific. It's an API for filtering. The fact that struct cgroup currently contains a member called "bpf" is purely an artifact of the fact that it currently only supports bpf. Someone will want to rename it to "filters" some day, and BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a filter. > If you want to prepare such patch for them, I don't mind, > but you cannot kill syscall command, since it's more flexible > and your control-file approach _will_ be obsolete pretty quickly. BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO. If you really really want a syscall, make it a new syscall. > >> Now the API matches the effect. A cgroup with something other than >> "none" in one of its net.*_filter files is a cgroup that filters >> networ
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov > wrote: > > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> Hi all- > >> > >> I apologize for being rather late with this. I didn't realize that > >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> it on the linux-api list, so I missed the discussion. > >> > >> I think that the inet ingress, egress etc filters are a neat feature, > >> but I think the API has some issues that will bite us down the road > >> if it becomes stable in its current form. > >> > >> Most of the problems I see are summarized in this transcript: > >> > >> # mkdir cg2 > >> # mount -t cgroup2 none cg2 > >> # mkdir cg2/nosockets > >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> ... > >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> > >> You can modify a cgroup after opening it O_RDONLY? > >> > >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> > >> This is fine. The bpf() syscall manipulates bpf objects. > >> > >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> > >> This is not so good: > >> > >> a) The bpf() syscall is supposed to manipulate bpf objects. This > >> is manipulating a cgroup. There's no reason that a socket creation > >> filter couldn't be written in a different language (new iptables > >> table? Simple list of address families?), but if that happened, > >> then using bpf() to install it would be entirely nonsensical. > > > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > > network stack is using cgroup as an application group that should > > invoke bpf program at the certain point in the stack. > > imo cgroup management is orthogonal. > > It is literally modifying the struct cgroup, and, as a practical > matter, it's causing membership in the cgroup to have a certain > effect. But rather than pointless arguing, let me propose an > alternative API that I think solves most of the problems here. > > In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > Instead, the cgroup gets three new control files: > "net.ingress_filter", "net.egress_filter", and > "net.socket_create_filter". Initially, if you read these files, you > see "none\n". > > To attach a bpf filter, you open the file for write and do an ioctl on > it. After doing the ioctl, if you read the file, you'll see > "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > the bpf program. > > To detach any type of filter, bpf or otherwise, you open the file for > write and write "none\n" (or just "none"). > > If you write anything else to the file, you get -EINVAL. But, if > someone writes a new type of filter (perhaps a simple list of address > families), maybe you can enable the filter by writing something > appropriate to the file. I see no difference in what you're proposing vs what is already implemented from feature set point of view, but the file approach is very ugly, since it's a mismatch to FD style access that bpf is using everywhere. In your proposal you'd also need to add bpf prefix everywhere. So the control file names should be bpf_inet_ingress, bpf_inet_egress and bpf_socket_create. If you want to prepare such patch for them, I don't mind, but you cannot kill syscall command, since it's more flexible and your control-file approach _will_ be obsolete pretty quickly. > Now the API matches the effect. A cgroup with something other than > "none" in one of its net.*_filter files is a cgroup that filters > network activity. And you get CRIU compatibility for free: CRIU can > read the control file and use whatever mechanism is uses for BPF in > general to restore the cgroup filter state. As an added bonus, you > get ACLs for free and the ugly multiplexer goes away. extended bpf is not supported by criu. only classic, so having control_file-style attachment doesn't buy us anything. > >> # mkdir cg2/nosockets/sockets > >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule > >> cg2/nosockets/sockets/ 1 > >> > >> This succeeded, which means that, if this feature is enabled in 4.10, > >> then we're stuck with its semantics. If it returned -EINVAL instead, > >> there would be a chance to refine it. > > > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > > Imange that socket accounting program is attached to cg2/nosockets. > > The program is readonly and carry no security meaning. > > Why cgroup should prevent creation of cg2/nosockets/foo directory ? > > I think you're misunderstanding me. What I'm saying is that, if you > allow a cgroup and one of its descendents to both enable the same type > of filter, you have just committed to some particular semantics for
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov wrote: > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> Hi all- >> >> I apologize for being rather late with this. I didn't realize that >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> it on the linux-api list, so I missed the discussion. >> >> I think that the inet ingress, egress etc filters are a neat feature, >> but I think the API has some issues that will bite us down the road >> if it becomes stable in its current form. >> >> Most of the problems I see are summarized in this transcript: >> >> # mkdir cg2 >> # mount -t cgroup2 none cg2 >> # mkdir cg2/nosockets >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> ... >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> You can modify a cgroup after opening it O_RDONLY? >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> This is fine. The bpf() syscall manipulates bpf objects. >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> This is not so good: >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This >> is manipulating a cgroup. There's no reason that a socket creation >> filter couldn't be written in a different language (new iptables >> table? Simple list of address families?), but if that happened, >> then using bpf() to install it would be entirely nonsensical. > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > network stack is using cgroup as an application group that should > invoke bpf program at the certain point in the stack. > imo cgroup management is orthogonal. It is literally modifying the struct cgroup, and, as a practical matter, it's causing membership in the cgroup to have a certain effect. But rather than pointless arguing, let me propose an alternative API that I think solves most of the problems here. In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. Instead, the cgroup gets three new control files: "net.ingress_filter", "net.egress_filter", and "net.socket_create_filter". Initially, if you read these files, you see "none\n". To attach a bpf filter, you open the file for write and do an ioctl on it. After doing the ioctl, if you read the file, you'll see "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for the bpf program. To detach any type of filter, bpf or otherwise, you open the file for write and write "none\n" (or just "none"). If you write anything else to the file, you get -EINVAL. But, if someone writes a new type of filter (perhaps a simple list of address families), maybe you can enable the filter by writing something appropriate to the file. Now the API matches the effect. A cgroup with something other than "none" in one of its net.*_filter files is a cgroup that filters network activity. And you get CRIU compatibility for free: CRIU can read the control file and use whatever mechanism is uses for BPF in general to restore the cgroup filter state. As an added bonus, you get ACLs for free and the ugly multiplexer goes away. >> # mkdir cg2/nosockets/sockets >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 >> >> This succeeded, which means that, if this feature is enabled in 4.10, >> then we're stuck with its semantics. If it returned -EINVAL instead, >> there would be a chance to refine it. > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > Imange that socket accounting program is attached to cg2/nosockets. > The program is readonly and carry no security meaning. > Why cgroup should prevent creation of cg2/nosockets/foo directory ? I think you're misunderstanding me. What I'm saying is that, if you allow a cgroup and one of its descendents to both enable the same type of filter, you have just committed to some particular semantics for what happens. And I think that the current semantics are the *wrong* semantics for default long-term use, so you should either fix the semantics or disable the problematic case. > >> # echo $$ >cg2/nosockets/sockets/cgroup.procs >> # ping 127.0.0.1 >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > > hmm. in this case the admin created a subgroup with a group that allows > ping, so? how's that a problem? I think you're forgetting about namespaces. There are two different admins here. There's the admin who created the outer container and said "no sockets here". Then there's the admin inside the container who said "muahaha, I can create a sub-container and allow sockets *there*". > In case of vrf I can imagine > a set of application auto-binding to one vrf and smaller > set of applications binding to a different vrf. > Or broader set of applicati
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > Hi all- > > I apologize for being rather late with this. I didn't realize that > cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > it on the linux-api list, so I missed the discussion. > > I think that the inet ingress, egress etc filters are a neat feature, > but I think the API has some issues that will bite us down the road > if it becomes stable in its current form. > > Most of the problems I see are summarized in this transcript: > > # mkdir cg2 > # mount -t cgroup2 none cg2 > # mkdir cg2/nosockets > # strace cgrp_socket_rule cg2/nosockets/ 0 > ... > open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > > You can modify a cgroup after opening it O_RDONLY? > > bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > log_buf=0x6020c0, kern_version=0}, 48) = 4 > > This is fine. The bpf() syscall manipulates bpf objects. > > bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > > This is not so good: > > a) The bpf() syscall is supposed to manipulate bpf objects. This > is manipulating a cgroup. There's no reason that a socket creation > filter couldn't be written in a different language (new iptables > table? Simple list of address families?), but if that happened, > then using bpf() to install it would be entirely nonsensical. I don't see why it's _modifing_ the cgroup. I'm looking at it as network stack is using cgroup as an application group that should invoke bpf program at the certain point in the stack. imo cgroup management is orthogonal. I certainly don't mind adding 'cat cgroup.bpf' control file that will print the program digest for debugging, just like we do for fdinfo, but cgroup file interface shouldn't be used to control networking. If there was something like networking-wide ioctl, I think it could have been an alternative way to attach a program to a hook to a cgroup. Adding a control file to a cgroup for the purpose of attaching bpf, just doesn't make sense to me, since it's dragging bpf and networking details into cgroup land. Imagine in the future somebody would want to have nft to perform similar BPF_CGROUP_INET_INGRESS functionality. I'd think that the position of the hook within the networking stack would remain the same, but enum id will be different, since other ids in that enum (like BPF_CGROUP_INET_SOCK_CREATE) are not applicable to nft. Similarly the concept of program_fd doesn't exist there either. So it would be using its own netlink based mechanism and its own way of enumerating hook id. And cgroup could be passed as a string into netlink message instead of fd. So if somebody adds a control file to a cgroup api that takes bpf's hookid and bpf's program_fd, such control file will only be applicable to bpf and not usable for anything else. Hence I see no reason to go that route. > > b) This is starting to be an excessively ugly multiplexer. Among > other things, it's very unfriendly to seccomp. > > # echo $$ >cg2/nosockets/cgroup.procs > # ping 127.0.0.1 > ping: socket: Operation not permitted > # ls cg2/nosockets/ > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > # cat cg2/nosockets/cgroup.controllers > > Something in cgroupfs should give an indication that this cgroup > filters socket creation, but there's nothing there. You should also > be able to turn the filter off from cgroupfs. Doing 'cat cg2/nosockets/cgroup.bpf' here would be useful for debugging to see which program is attached to which hook in this cgroup. Detaching programs via cgroup control file is a lot less clear, since it will be confusing to a running application that attached the program in the first place, since it won't have any indication that external entity detached the program. One can argue that it could be useful for curious human/admin to detach all bpf progs from all hooks for this cgroup, but then it probably needs dmesg warning and only usable in extreme cases as debugging and not something control plane should ever use. > # mkdir cg2/nosockets/sockets > # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > > This succeeded, which means that, if this feature is enabled in 4.10, > then we're stuck with its semantics. If it returned -EINVAL instead, > there would be a chance to refine it. i don't see the problem with this behavior. bpf and cgroup are indepedent. Imange that socket accounting program is attached to cg2/nosockets. The program is readonly and carry no security meaning. Why cgroup should prevent creation of cg2/nosockets/foo directory ? > # echo $$ >cg2/nosockets/sockets/cgroup.procs > # ping 127.0.0.1 > PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms hmm. in this case the admin cre
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Sat, Dec 17, 2016 at 11:26 AM, Mickaël Salaün wrote: > > On 17/12/2016 19:18, Andy Lutomirski wrote: >> Hi all- >> >> I apologize for being rather late with this. I didn't realize that >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> it on the linux-api list, so I missed the discussion. >> >> I think that the inet ingress, egress etc filters are a neat feature, >> but I think the API has some issues that will bite us down the road >> if it becomes stable in its current form. >> >> Most of the problems I see are summarized in this transcript: >> >> # mkdir cg2 >> # mount -t cgroup2 none cg2 >> # mkdir cg2/nosockets >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> ... >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> You can modify a cgroup after opening it O_RDONLY? > > I sent a patch to check the cgroup.procs permission before attaching a > BPF program to it [1], but it was not merged because not part of the > current security model (which may not be crystal clear). The thing is > that the current socket/BPF/cgroup feature is only available to a > process with the *global CAP_NET_ADMIN* and such a process can already > modify the network for every processes, so it doesn't make much sense to > check if it can modify the network for a subset of this processes. > > [1] https://lkml.org/lkml/2016/9/19/854 I don't think that's quite the right approach. First, checking permissions on an fd that's not the fd passed to the syscall is weird and IMO shouldn't be done without a good reason. Second, cgroups.procs seems like the wrong file. Why not create "net.socket_create_filter" and require a writable fd to *that* to be passed to the syscall. > > However, needing a process to open a cgroup *directory* in write mode > may not make sense because the process does not modify the content of > the cgroup but only use it as a *reference* in the network stack. > Forcing an open with write mode may forbid to use this kind of > network-filtering feature in a read-only file-system but not necessarily > read-only *network configuration*. It does modify the cgroup. Logically it changes the cgroup behavior. As an implementation matter, it modifies the struct cgroup. > > Another point of view is that the CAP_NET_ADMIN may be an unneeded > privilege if the cgroup migration is using a no_new_privs-like feature > as I proposed with Landlock [2] (with an extra ptrace_may_access() check). > The new capability proposition for cgroup may be interesting too. > > [2] https://lkml.org/lkml/2016/9/14/82 > >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> This is fine. The bpf() syscall manipulates bpf objects. >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> This is not so good: >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This >> is manipulating a cgroup. There's no reason that a socket creation >> filter couldn't be written in a different language (new iptables >> table? Simple list of address families?), but if that happened, >> then using bpf() to install it would be entirely nonsensical. > > Another point of view is to say that the BPF program (called by the > network stack) is using a reference to a set of processes thanks to a > cgroup. I strongly disagree with this point of view. From a user's perspective, nothing about the bpf program changed -- the cgroup changes. Even in the code, the bpf program doesn't have a reference to anything. The cgroup references the bpf program. >> >> # echo $$ >cg2/nosockets/cgroup.procs >> # ping 127.0.0.1 >> ping: socket: Operation not permitted >> # ls cg2/nosockets/ >> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control >> # cat cg2/nosockets/cgroup.controllers >> >> Something in cgroupfs should give an indication that this cgroup >> filters socket creation, but there's nothing there. You should also >> be able to turn the filter off from cgroupfs. > > Right. Everybody was OK at LPC to add such an information but it is not > there yet. Then let's do it before CONFIG_CGROUP_BPF=y becomes possible in a released kernel. > Right. Because this feature doesn't handle namespaces (only global > CAP_NET_ADMIN), nested containers should not be allowed to use it at all. > If we want this kind of feature to be usable by something other than a > process with a global capability, then we need an inheritance mechanism > similar to what I designed for Landlock. I think it could be added later. I think it's okay to have a new kernel feature that doesn't support namespacing yet. I don't think it's okay to have a new kernel feature that will become problematic when namespacing is added, and I think cgroup-bpf is in the latter class. Anyway, here's a half-baked proposal to clean this all u
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 17/12/2016 19:18, Andy Lutomirski wrote: > Hi all- > > I apologize for being rather late with this. I didn't realize that > cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > it on the linux-api list, so I missed the discussion. > > I think that the inet ingress, egress etc filters are a neat feature, > but I think the API has some issues that will bite us down the road > if it becomes stable in its current form. > > Most of the problems I see are summarized in this transcript: > > # mkdir cg2 > # mount -t cgroup2 none cg2 > # mkdir cg2/nosockets > # strace cgrp_socket_rule cg2/nosockets/ 0 > ... > open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > > You can modify a cgroup after opening it O_RDONLY? I sent a patch to check the cgroup.procs permission before attaching a BPF program to it [1], but it was not merged because not part of the current security model (which may not be crystal clear). The thing is that the current socket/BPF/cgroup feature is only available to a process with the *global CAP_NET_ADMIN* and such a process can already modify the network for every processes, so it doesn't make much sense to check if it can modify the network for a subset of this processes. [1] https://lkml.org/lkml/2016/9/19/854 However, needing a process to open a cgroup *directory* in write mode may not make sense because the process does not modify the content of the cgroup but only use it as a *reference* in the network stack. Forcing an open with write mode may forbid to use this kind of network-filtering feature in a read-only file-system but not necessarily read-only *network configuration*. Another point of view is that the CAP_NET_ADMIN may be an unneeded privilege if the cgroup migration is using a no_new_privs-like feature as I proposed with Landlock [2] (with an extra ptrace_may_access() check). The new capability proposition for cgroup may be interesting too. [2] https://lkml.org/lkml/2016/9/14/82 > > bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > log_buf=0x6020c0, kern_version=0}, 48) = 4 > > This is fine. The bpf() syscall manipulates bpf objects. > > bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > > This is not so good: > > a) The bpf() syscall is supposed to manipulate bpf objects. This > is manipulating a cgroup. There's no reason that a socket creation > filter couldn't be written in a different language (new iptables > table? Simple list of address families?), but if that happened, > then using bpf() to install it would be entirely nonsensical. Another point of view is to say that the BPF program (called by the network stack) is using a reference to a set of processes thanks to a cgroup. > > b) This is starting to be an excessively ugly multiplexer. Among > other things, it's very unfriendly to seccomp. FWIW, Landlock will have the capability to filter this kind of action. > > # echo $$ >cg2/nosockets/cgroup.procs > # ping 127.0.0.1 > ping: socket: Operation not permitted > # ls cg2/nosockets/ > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > # cat cg2/nosockets/cgroup.controllers > > Something in cgroupfs should give an indication that this cgroup > filters socket creation, but there's nothing there. You should also > be able to turn the filter off from cgroupfs. Right. Everybody was OK at LPC to add such an information but it is not there yet. > > # mkdir cg2/nosockets/sockets > # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > > This succeeded, which means that, if this feature is enabled in 4.10, > then we're stuck with its semantics. If it returned -EINVAL instead, > there would be a chance to refine it. This is indeed unfortunate. > > # echo $$ >cg2/nosockets/sockets/cgroup.procs > # ping 127.0.0.1 > PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > ^C > --- 127.0.0.1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms > > Bash was inside a cgroup that disallowed socket creation, but socket > creation wasn't disallowed. This means that the obvious use of socket > creation filters in nestable constainers fails insecurely. > > > There's also a subtle but nasty potential security problem here. > In 4.9 and before, cgroups has only one real effect in the kernel: > resource control. A process in a malicious cgroup could be DoSed, > but that was about the extent of the damage that a malicious cgroup > could do. > > In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf > programs attached that can do things if various events occur. (Right > now, this means socket operations, but there are plans in the works > to do this f
Potential issues (security and otherwise) with the current cgroup-bpf API
Hi all- I apologize for being rather late with this. I didn't realize that cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see it on the linux-api list, so I missed the discussion. I think that the inet ingress, egress etc filters are a neat feature, but I think the API has some issues that will bite us down the road if it becomes stable in its current form. Most of the problems I see are summarized in this transcript: # mkdir cg2 # mount -t cgroup2 none cg2 # mkdir cg2/nosockets # strace cgrp_socket_rule cg2/nosockets/ 0 ... open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 You can modify a cgroup after opening it O_RDONLY? bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, log_buf=0x6020c0, kern_version=0}, 48) = 4 This is fine. The bpf() syscall manipulates bpf objects. bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 This is not so good: a) The bpf() syscall is supposed to manipulate bpf objects. This is manipulating a cgroup. There's no reason that a socket creation filter couldn't be written in a different language (new iptables table? Simple list of address families?), but if that happened, then using bpf() to install it would be entirely nonsensical. b) This is starting to be an excessively ugly multiplexer. Among other things, it's very unfriendly to seccomp. # echo $$ >cg2/nosockets/cgroup.procs # ping 127.0.0.1 ping: socket: Operation not permitted # ls cg2/nosockets/ cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control # cat cg2/nosockets/cgroup.controllers Something in cgroupfs should give an indication that this cgroup filters socket creation, but there's nothing there. You should also be able to turn the filter off from cgroupfs. # mkdir cg2/nosockets/sockets # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 This succeeded, which means that, if this feature is enabled in 4.10, then we're stuck with its semantics. If it returned -EINVAL instead, there would be a chance to refine it. # echo $$ >cg2/nosockets/sockets/cgroup.procs # ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms ^C --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms Bash was inside a cgroup that disallowed socket creation, but socket creation wasn't disallowed. This means that the obvious use of socket creation filters in nestable constainers fails insecurely. There's also a subtle but nasty potential security problem here. In 4.9 and before, cgroups has only one real effect in the kernel: resource control. A process in a malicious cgroup could be DoSed, but that was about the extent of the damage that a malicious cgroup could do. In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf programs attached that can do things if various events occur. (Right now, this means socket operations, but there are plans in the works to do this for LSM hooks too.) These bpf programs can say yes or no, but they can also read out various data (including socket payloads!) and save them away where an attacker can find them. This sounds a lot like seccomp with a narrower scope but a much stronger ability to exfiltrate private information. Unfortunately, while seccomp is very, very careful to prevent injection of a privileged victim into a malicious sandbox, the CGROUP_BPF mechanism appears to have no real security model. There is nothing to prevent a program that's in a malicious cgroup from running a setuid binary, and there is nothing to prevent a program that has the ability to move itself or another program into a malicious cgroup from doing so and then, if needed for exploitation, exec a setuid binary. This isn't much of a problem yet because you currently need CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm sure that, in the near future, someone will want to make this stuff work in containers with delegated cgroup hierarchies, and then there may be a real problem here. I've included a few security people on this thread. The current API looks abusable, and it would be nice to find all the holes before 4.10 comes out. (The cgrp_socket_rule source is attached. You can build it by sticking it in samples/bpf and doing: $ make headers_install $ cd samples/bpf $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include ) --Andy /* eBPF example program: * * - Loads eBPF program * * The eBPF program sets the sk_bound_dev_if index in new AF_INET{6} * sockets opened by processes in the cgroup. * * - Attaches the new program to a cgroup using BPF_PROG_ATTACH */ #define _GNU_SOURCE #include #include #include #include #include #in