subject:"Potential issues \(security and otherwise\) with the current cgroup\-bpf API"

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-02-04 Thread Andy Lutomirski

On Fri, Feb 3, 2017 at 3:21 PM, Alexei Starovoitov
 wrote:
> On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote:
>>
>> Is there any plan to address this?  If not, I'll try to write that
>> patch this weekend.
>
> yes. I'm working on 'disallow program override' flag.
> It got stalled, because netns discussion got stalled.
> Later today will send a patch for dev_id+inode and
> will continue on the flag patch.
>

Would it make sense to try to document what your proposal does before
writing the code?  I don't yet see how to get semantics that are both
simple and sensible with a "disallow override" flag.

I *do* see how to get simple, sensible semantics with an approach
where all the programs in scope for the cgroup in question get called.
If needed, I can imagine a special "overridable" program that would
not be run if the socket in question is bound to a descendent cgroup
that also has an "overridable" program but would still let all the
normal hierarchical programs in scope get called.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-02-03 Thread Alexei Starovoitov

On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote:
> 
> Is there any plan to address this?  If not, I'll try to write that
> patch this weekend.

yes. I'm working on 'disallow program override' flag.
It got stalled, because netns discussion got stalled.
Later today will send a patch for dev_id+inode and
will continue on the flag patch.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-02-03 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 12:20 PM, Andy Lutomirski  wrote:
> On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
>  wrote:
>> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>>> restricting the types of sockets that can be created, then you do want
>>> the filter to work across namespaces, but seccomp can do that too and
>>> the current code doesn't handle netns correctly.
>>
>> are you saying that seccomp supports netns filtering? please show the proof.
>
> It can trivially restruct the types of sockets that are created by
> filtering on socket(2) syscall parameters, at least on sane
> architectures that don't use socketcall().

I think this is actually wrong -- the socket creation filter appears
to be called only on inet sockets.  Is there a good reason for that?

>
>> To summarize, I think the 'prog override is not allowed' flag would be
>> ok feature to have and I wouldn't mind making it the default when no 'flag'
>> field is passed to bpf syscall, but it's not acceptable to remove current
>> 'prog override is allowed' behavior.
>> So if you insist on changing the default, please implement the flag asap.
>> Though from security point of view and ABI point of view there is absolutely
>> no difference whether this flag is added today or a year later while
>> the default is kept as-is.
>
> It's too late and I have too little time.  I'll try to write a patch
> to change the default to just deny attempts to override.  Better
> behavior can be added later.
>
> IMO your suggestions about priorities are overcomplicated.  For your
> instrumentation needs, I can imagine that a simple "make this hook not
> run if a descendent has a hook" would do it.  For everything else, run
> them all in tree order (depending on filter type) and let the eBPF
> code do whatever it wants to do.

Is there any plan to address this?  If not, I'll try to write that
patch this weekend.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-23 Thread Andy Lutomirski

On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
 wrote:
> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
>>  wrote:
>> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> >> I think it could work by making a single socket cgroup controller that
>> >> handles all cgroup things that are bound to a socket.  Using
>> >
>> > Such 'socket cgroup controller' would limit usability of the feature
>> > to sockets and force all other use cases like landlock to invent
>> > their own wheel, which is undesirable. Everyone will be
>> > inventing new 'foo cgroup controller', while all of them
>> > are really bpf features. They are different bpf program
>> > types that attach to different hooks and use cgroup for scoping.
>>
>> Can you elaborate on why that would be a problem?  In a cgroup v1
>> world, users who want different hierarchies for different types of
>> control could easily want one hierarchy for socket hooks and a
>> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
>> could easily imagine the decision to delegate socket hooks being
>> different from the decision to delegate lsm hooks.  Almost all of the
>> code would be shared between different bpf-using cgroup controllers.
>
> how do you think it can be enforced when directory is chowned?

A combination of delegation policy (a la subtree_control in the
parent) and MAC policy.  I'm quite confident that apparmor can apply
policy to cgroupfs and I'm reasonably confident (although slightly
less so) that selinux can as well.  The docs suck :(

But if there's only one file in there to apply MAC policy to, the
ability to differentiate between subsystems is quite limited.   In the
current API, there are no files to apply policy to, so it won't work
at all.

>
> ... and open() of the directory done by the current api will preserve
> cgroup delegation when and only when bpf_prog_type_cgroup_*
> becomes unprivileged.
> I'm not proposing creating new api here.

I don't know what you mean.  open() of the directory can't check
anything useful because it has to work for programs that just want to
read the directory.

>> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
>> regardless of what semantics you think are better.  sk_bound_dev_if is
>> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
>> given netns.  The "ip vrf" stuff will straight-up malfunction if a
>> process affected by its hook runs in a different netns from the netns
>> that "ip vrf" was run in.
>
> how is that any different from normal 'ip netns exec'?
> that is expected user behavior.

# ./ip link add dev vrf0 type vrf table 10
# ./ip vrf exec vrf0 ./show_bind
Default binding is "vrf0"
# ./ip vrf exec vrf0 unshare -n ./show_bind
show_bind: getsockopt: No such device

The expected behavior to me is that ip netns exec (or equivalently
unshare -n) gives a clean slate.  Actually malfunctioning (which this
example using the latest iproute2 and linux just did) is not expected.

I'm done with this part of this thread and I'm sending a patch.

>
>> restricting the types of sockets that can be created, then you do want
>> the filter to work across namespaces, but seccomp can do that too and
>> the current code doesn't handle netns correctly.
>
> are you saying that seccomp supports netns filtering? please show the proof.

It can trivially restruct the types of sockets that are created by
filtering on socket(2) syscall parameters, at least on sane
architectures that don't use socketcall().

> To summarize, I think the 'prog override is not allowed' flag would be
> ok feature to have and I wouldn't mind making it the default when no 'flag'
> field is passed to bpf syscall, but it's not acceptable to remove current
> 'prog override is allowed' behavior.
> So if you insist on changing the default, please implement the flag asap.
> Though from security point of view and ABI point of view there is absolutely
> no difference whether this flag is added today or a year later while
> the default is kept as-is.

It's too late and I have too little time.  I'll try to write a patch
to change the default to just deny attempts to override.  Better
behavior can be added later.

IMO your suggestions about priorities are overcomplicated.  For your
instrumentation needs, I can imagine that a simple "make this hook not
run if a descendent has a hook" would do it.  For everything else, run
them all in tree order (depending on filter type) and let the eBPF
code do whatever it wants to do.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-22 Thread Alexei Starovoitov

On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> >> I think it could work by making a single socket cgroup controller that
> >> handles all cgroup things that are bound to a socket.  Using
> >
> > Such 'socket cgroup controller' would limit usability of the feature
> > to sockets and force all other use cases like landlock to invent
> > their own wheel, which is undesirable. Everyone will be
> > inventing new 'foo cgroup controller', while all of them
> > are really bpf features. They are different bpf program
> > types that attach to different hooks and use cgroup for scoping.
> 
> Can you elaborate on why that would be a problem?  In a cgroup v1
> world, users who want different hierarchies for different types of
> control could easily want one hierarchy for socket hooks and a
> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
> could easily imagine the decision to delegate socket hooks being
> different from the decision to delegate lsm hooks.  Almost all of the
> code would be shared between different bpf-using cgroup controllers.

how do you think it can be enforced when directory is chowned?

> >> Having thought about this some more, I think that making it would
> >> alleviate a bunch of my concerns, as it would make the semantics if
> >> the capable() check were relaxed to ns_capable() be sane.  Here's what
> >
> > here we're on the same page. For any meaningful discussion about
> > 'bpf cgroup controller' to happen bpf itself needs to become
> > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> > program types need to become available for unprivileged users.
> > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> > To make it secure we severely limited its functionality.
> > All bpf advances since then (like new map types and verifier extensions)
> > were done for root only. If early on the priv vs unpriv bpf features
> > were 80/20. Now it's close to 95/5. No work has been done to
> > make socket filter type more powerful. It still has to use
> > slow-ish ld_abs skb access while tc/xdp have direct packet access.
> > Things like register value tracking is root only as well and so on
> > and so forth.
> > We cannot just flip the switch and allow type_cgroup* to unpriv
> > and I don't see any volunteers willing to do this work.
> > Until that happens there is no point coming up with designs
> > for 'cgroup bpf controller'... whatever that means.
> 
> Sure there is.  If delegation can be turned on without changing the
> API, then the result will be easier to work with and have fewer
> compatibility issues.

... and open() of the directory done by the current api will preserve
cgroup delegation when and only when bpf_prog_type_cgroup_*
becomes unprivileged.
I'm not proposing creating new api here.

> >
> >> I currently should happen before bpf+cgroup is enabled in a release:
> >>
> >> 1. Make it netns-aware.  This could be as simple as making it only
> >> work in the root netns because then real netns awareness can be added
> >> later without breaking anything.  The current situation is bad in that
> >> network namespaces are just ignored and it's plausible that people
> >> will start writing user code that depends on having network namespaces
> >> be ignored.
> >
> > nothing in bpf today is netns-aware and frankly I don't see
> > how cgroup+bpf has anything to do with netns.
> > For regular sockets+bpf we don't check netns.
> > When tcpdump opens raw socket and attaches bpf there are no netns
> > checks, since socket itself gives a scope for the program to run.
> > Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> > hooks.
> 
> 
> Here I completely disagree with you.  tcpdump sees packets in its
> network namespace.  Regular sockets apply bpf filters to the packets
> seen by that socket, and the socket itself is scoped to a netns.
> 
> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
> regardless of what semantics you think are better.  sk_bound_dev_if is
> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
> given netns.  The "ip vrf" stuff will straight-up malfunction if a
> process affected by its hook runs in a different netns from the netns
> that "ip vrf" was run in.

how is that any different from normal 'ip netns exec'?
that is expected user behavior.

> IOW, the current code is buggy.
> 
> > Then if the hooks are used for security, the process
> > only needs to do setns() to escape security sandbox. Obviously
> > broken semantics.
> 
> This could go both ways.  If the goal is to filter packets, then it's
> not really important to have the filter keep working if the sandboxed
> task unshares netns -- in the new netns, there isn't any access to the
> netwo

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-19 Thread Andy Lutomirski

On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
 wrote:
> On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> I think it could work by making a single socket cgroup controller that
>> handles all cgroup things that are bound to a socket.  Using
>
> Such 'socket cgroup controller' would limit usability of the feature
> to sockets and force all other use cases like landlock to invent
> their own wheel, which is undesirable. Everyone will be
> inventing new 'foo cgroup controller', while all of them
> are really bpf features. They are different bpf program
> types that attach to different hooks and use cgroup for scoping.

Can you elaborate on why that would be a problem?  In a cgroup v1
world, users who want different hierarchies for different types of
control could easily want one hierarchy for socket hooks and a
different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
could easily imagine the decision to delegate socket hooks being
different from the decision to delegate lsm hooks.  Almost all of the
code would be shared between different bpf-using cgroup controllers.

>
>> Having thought about this some more, I think that making it would
>> alleviate a bunch of my concerns, as it would make the semantics if
>> the capable() check were relaxed to ns_capable() be sane.  Here's what
>
> here we're on the same page. For any meaningful discussion about
> 'bpf cgroup controller' to happen bpf itself needs to become
> delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> program types need to become available for unprivileged users.
> The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> To make it secure we severely limited its functionality.
> All bpf advances since then (like new map types and verifier extensions)
> were done for root only. If early on the priv vs unpriv bpf features
> were 80/20. Now it's close to 95/5. No work has been done to
> make socket filter type more powerful. It still has to use
> slow-ish ld_abs skb access while tc/xdp have direct packet access.
> Things like register value tracking is root only as well and so on
> and so forth.
> We cannot just flip the switch and allow type_cgroup* to unpriv
> and I don't see any volunteers willing to do this work.
> Until that happens there is no point coming up with designs
> for 'cgroup bpf controller'... whatever that means.

Sure there is.  If delegation can be turned on without changing the
API, then the result will be easier to work with and have fewer
compatibility issues.

>
>> I currently should happen before bpf+cgroup is enabled in a release:
>>
>> 1. Make it netns-aware.  This could be as simple as making it only
>> work in the root netns because then real netns awareness can be added
>> later without breaking anything.  The current situation is bad in that
>> network namespaces are just ignored and it's plausible that people
>> will start writing user code that depends on having network namespaces
>> be ignored.
>
> nothing in bpf today is netns-aware and frankly I don't see
> how cgroup+bpf has anything to do with netns.
> For regular sockets+bpf we don't check netns.
> When tcpdump opens raw socket and attaches bpf there are no netns
> checks, since socket itself gives a scope for the program to run.
> Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> hooks.

Here I completely disagree with you.  tcpdump sees packets in its
network namespace.  Regular sockets apply bpf filters to the packets
seen by that socket, and the socket itself is scoped to a netns.

Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
regardless of what semantics you think are better.  sk_bound_dev_if is
exposed as a u32 value, but sk_bound_dev_if only has meaning within a
given netns.  The "ip vrf" stuff will straight-up malfunction if a
process affected by its hook runs in a different netns from the netns
that "ip vrf" was run in.

IOW, the current code is buggy.

> Then if the hooks are used for security, the process
> only needs to do setns() to escape security sandbox. Obviously
> broken semantics.

This could go both ways.  If the goal is to filter packets, then it's
not really important to have the filter keep working if the sandboxed
task unshares netns -- in the new netns, there isn't any access to the
network at all.  If the goal is to reduce attack surface by
restricting the types of sockets that can be created, then you do want
the filter to work across namespaces, but seccomp can do that too and
the current code doesn't handle netns correctly.

>
>> 2. Make it inherit properly.  Inner cgroups should not override outer
>> hooks.  As in (1), this could be simplified by preventing the same
>> hook from being configured in both an ancestor and a descendent
>> cgroup.  Then inheritance could be added for real later on.
>
> In general it sounds fine, but it seems the reasoning

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-19 Thread Alexei Starovoitov

On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> I think it could work by making a single socket cgroup controller that
> handles all cgroup things that are bound to a socket.  Using

Such 'socket cgroup controller' would limit usability of the feature
to sockets and force all other use cases like landlock to invent
their own wheel, which is undesirable. Everyone will be
inventing new 'foo cgroup controller', while all of them
are really bpf features. They are different bpf program
types that attach to different hooks and use cgroup for scoping.

> Having thought about this some more, I think that making it would
> alleviate a bunch of my concerns, as it would make the semantics if
> the capable() check were relaxed to ns_capable() be sane.  Here's what

here we're on the same page. For any meaningful discussion about
'bpf cgroup controller' to happen bpf itself needs to become
delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
program types need to become available for unprivileged users.
The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
To make it secure we severely limited its functionality.
All bpf advances since then (like new map types and verifier extensions)
were done for root only. If early on the priv vs unpriv bpf features
were 80/20. Now it's close to 95/5. No work has been done to
make socket filter type more powerful. It still has to use
slow-ish ld_abs skb access while tc/xdp have direct packet access.
Things like register value tracking is root only as well and so on
and so forth.
We cannot just flip the switch and allow type_cgroup* to unpriv
and I don't see any volunteers willing to do this work.
Until that happens there is no point coming up with designs
for 'cgroup bpf controller'... whatever that means.

> I currently should happen before bpf+cgroup is enabled in a release:
> 
> 1. Make it netns-aware.  This could be as simple as making it only
> work in the root netns because then real netns awareness can be added
> later without breaking anything.  The current situation is bad in that
> network namespaces are just ignored and it's plausible that people
> will start writing user code that depends on having network namespaces
> be ignored.

nothing in bpf today is netns-aware and frankly I don't see
how cgroup+bpf has anything to do with netns.
For regular sockets+bpf we don't check netns.
When tcpdump opens raw socket and attaches bpf there are no netns
checks, since socket itself gives a scope for the program to run.
Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
hooks. Then if the hooks are used for security, the process
only needs to do setns() to escape security sandbox. Obviously
broken semantics.

> 2. Make it inherit properly.  Inner cgroups should not override outer
> hooks.  As in (1), this could be simplified by preventing the same
> hook from being configured in both an ancestor and a descendent
> cgroup.  Then inheritance could be added for real later on.

In general it sounds fine, but it seems the reasoning to add
such restriction now (instead of later), so that program chain can
be added without breaking abi, since if we don't restrict it now
there will be no way to add it without breaking abi?!
That is incorrect assumption. We can add chaining and can add
'do not override' logic without breaking existing semantics.
For example, we can add 'priority' field to
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */}
which would indicate relative position of multiple chained programs
applied to the same cgroup+hook pair. Multiple programs with
the same priority will be executed in the order they were added.
Programs with different priorities will execute in the priority order.
Such scheme will be more generic and flexible than earlier proposals.
Similarly we can add another flag that will say 'dissallow override
of bpf program in descendent cgroup'. It's all trivial to do,
since bpf syscall was designed for extensibility.

Also until bpf_type_cgroup* becomes unprivileged there is no reason
to add this 'priority/prog chaining' feature, since if it's
used for security the root can always override it no matter cgroup
hierarchy.

> 3. Give cgroup delegation support some serious thought.  In
> particular, if delegation would be straightforward but the current API
> wouldn't work well with delegation, then at least consider whether the
> API should change before it becomes stable so that two APIs don't need
> to be supported going forward.

please see example above. Since we went with bpf syscall (instead of
inextensible ioctl) we can add any new cgroup+bpf logic without
breaking current abi.
No matter how you twist it the cgroup+bpf is bpf specific feature.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-19 Thread Michal Hocko

On Wed 18-01-17 14:18:50, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote:
> > This would require using hierarchical cgroup iterators to iterate over
> 
> It does behave hierarchically.
> 
> > tasks. As per Andy's testing this doesn't seem to be the case. I haven't
> 
> That's not what Andy's testing showed.  What that showed was that
> program in a child can override the one from its ancestor.

My fault, I've misread Andy's test case. I thought that the child group
simply disabled the bpf program and the one from the parent hasn't
executed.
-- 
Michal Hocko
SUSE Labs

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Andy Lutomirski

On Wed, Jan 18, 2017 at 4:59 PM, Tejun Heo  wrote:
> Hello, Andy.
>
> On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
>> To map cgroup -> hook, a simple array in each cgroup structure works.
>> To map (cgroup, netns) -> hook function, the natural approach would be
>> to have some kind of hash, and that would be slower.  This code is
>> intended to be *fast*.
>
> Updating these programs isn't a frequent operation.  We can easily
> calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
> new hashtable.
>

Fair enough.

>
>> I think it's currently a broken cgroup feature.  I think it could be
>> made into a properly working cgroup feature (which I favor) or a
>> properly working net feature.  Can you articulate why you prefer
>> having it be a net feature instead of a cgroup feature?  I tend to
>
> I thought that's what I was doing in the previous message.
>
>> favor making it be a working cgroup feature (by giving it a file or
>> files in cgroupfs and maybe even a controller) because the result
>> would have very obvious semantics.
>
> I'm fine with both directions but one seems far out.

I thought you were saying why you thought it wasn't a cgroup feature,
but I'm not sure I understand why you thought it shouldn't be a cgroup
feature.

When I chatted with Alexei, I had the impression that you and he had
wanted it to be a cgroup controller but thought it wouldn't work well.
I think it could work by making a single socket cgroup controller that
handles all cgroup things that are bound to a socket.  Using
individual files would play nicer with cgroup delegation within a
single netns.

>
>> But maybe this is just a false dichotomy.  Could this feature be a
>> per-netns configuration where you can give instructions like "run this
>> hook if you're in such-and-such netns and in a given cgroup or one of
>> its descendents"?  This would prevent it from being a direct analogue
>> to systemd's RestrictAddressFamilies= option, but that may be okay.
>> This would side-step issues in the current code where a hook can't
>> rely on ifindex meaning what the hook thinks it means.
>
> How is this different from making the current code netns aware?

The descendents part is important.

>
>> Actually, I think I like that approach.  What if it we had a "socket"
>> controller and files like "socket.create_socket", "socket.ingress" and
>> "socket.egress"?  You could use the bpf() syscall to install a bpf
>> filter into "socket.egress" and that filter would filter egress for
>> the target cgroup and its descendents on the current netns.  As a
>> first pass, the netns issue could be sidestepped by making it only
>> work in the root netns (i.e. the bpf() call would fail if you're not
>> in the root netns and the hooks wouldn't run if you're not in the root
>> netns.)  It *might* even be possible to retrofit in the socket
>> controller by saying that the default hierarchy is used if the socket
>> controller isn't mounted.
>
> I don't know.  You're throwing out too many ideas too fast and it's
> difficult to keep up with what the differences are between those
> ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
> support be sufficient?  We can consider the product of cgroup and net
> namespaces but that doesn't seem necessary given that people usually
> set up these namespaces in conjunction.

Fair enough.  See way below.

>
>> What *isn't* possible to cleanly fix after the fact is the current
>> semantics that cgroup hooks override the hooks in their ancestors.
>> IMO that is simply broken.  The example you gave (perf_event) is very
>> careful not to have this problem.
>
> That's like saying installing an iptables rule for a more specific
> target is broken.  As a cgroup controller, it is not an acceptable
> behavior given how delegation works.  As something similar to
> iptables, it is completely fine.

Even the xt_cgroup code checks cgroup_is_descendent().

>
>> > * My impression with bpf is that while delegation is something
>> >   theoretically possible it is not something which is gonna take place
>> >   in any immediate time frame.  If I'm wrong on this, please feel free
>> >   to correct me.
>>
>> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
>> creation, both of which exist today.
>
> No, the issue is bpf delegation.  If bpf were fully delegatable in
> practical sense, we could just do straight forward cgroup bpf
> controller.  Well, we'll have to think about how to chain the programs
> which would differ on program type but that shouldn't be too hard.

What do you mean by "if bpf were fully delegatable"?  I don't know
what it would even mean to delegate bpf.  I do know what it means to
allow programs to create new network namespaces and for cgroup
sub-hierarchies to be delegated.

>
>> > * There are a lot of use cases which can immediately take advantage of
>> >   cgroup-aware bpf programs operating as proposed.  The model is
>> >   semantically equivalen

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Mickaël Salaün

On 19/01/2017 01:18, Andy Lutomirski wrote:
>>> it explicitly respects the cgroup hierarchy.  It shows up in
>>> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
>>> perf_event enabled.  So I'm not sure what you mean.
>>
>> That all it's doing is providing membership information.
> 
> But it's doing it wrong!  Even perf_event tests for membership in a
> given cgroup *or one of its descendents*.  This code does not.
> 
> I think the moral of the story here is that there are lots of open
> questions and design work to be done and that this feature really
> isn't ready to be stable.  For Landlock, I believe that it really
> needs to be done right and I will put my foot down and NAK any effort
> to have Landlock available in a released kernel without resolving
> these types of issues first.  Does anyone really want Landlock to work
> differently than the net hooks simply because the net hooks were in a
> rush?

About Landlock, there is two use cases:

The first is to allow unprivileged users to tie eBPF programs (rules) to
processes. This is the (final) goal. In this case, a (cgroup) hierarchy
is mandatory, otherwise it would be trivial to defeat any access rule.
This is the same issue with namespaces.

The second use case is to only allow privileged users to tie eBPF
programs to processes. As discussed, this will be the next series,
preceding the unprivileged series. In this privilege case, only root
(global CAP_SYS_ADMIN, no namespaces) may be able to use Landlock. Not
having a hierarchy is not a security issue (only a practical one).

The first/next Landlock series (in February) will focus on process
hierarchies (without cgroup), a la seccomp-bpf. It would be too
confusing to not use an inheritable hierarchy like seccomp does, even in
a privileged-only first approach. The inherited rules should then behave
similarly to the seccomp-bpf filters.

However, the following series focusing on cgroup could keep the current
cgroup-bpf behavior, without hierarchy. I don't like the non-hierarchy
approach very much because it add another (less flexible and more
confusing) way to do things (for Landlock at least), but I'm willing to
do it if needed.

Regards,
 Mickaël

signature.asc
Description: OpenPGP digital signature

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo

Hello, Andy.

On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
> To map cgroup -> hook, a simple array in each cgroup structure works.
> To map (cgroup, netns) -> hook function, the natural approach would be
> to have some kind of hash, and that would be slower.  This code is
> intended to be *fast*.

Updating these programs isn't a frequent operation.  We can easily
calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
new hashtable.

> I suppose you could have each cgroup structure contain an array where
> each element is (netns, hook function) and just skip the ones that are
> the wrong netns.  Perhaps it would be rare to have many of those.

Yeah, or maybe even that.  It just isn't a fundamentally difficult
problem.

> I think it's currently a broken cgroup feature.  I think it could be
> made into a properly working cgroup feature (which I favor) or a
> properly working net feature.  Can you articulate why you prefer
> having it be a net feature instead of a cgroup feature?  I tend to

I thought that's what I was doing in the previous message.

> favor making it be a working cgroup feature (by giving it a file or
> files in cgroupfs and maybe even a controller) because the result
> would have very obvious semantics.

I'm fine with both directions but one seems far out.

> But maybe this is just a false dichotomy.  Could this feature be a
> per-netns configuration where you can give instructions like "run this
> hook if you're in such-and-such netns and in a given cgroup or one of
> its descendents"?  This would prevent it from being a direct analogue
> to systemd's RestrictAddressFamilies= option, but that may be okay.
> This would side-step issues in the current code where a hook can't
> rely on ifindex meaning what the hook thinks it means.

How is this different from making the current code netns aware?

> Actually, I think I like that approach.  What if it we had a "socket"
> controller and files like "socket.create_socket", "socket.ingress" and
> "socket.egress"?  You could use the bpf() syscall to install a bpf
> filter into "socket.egress" and that filter would filter egress for
> the target cgroup and its descendents on the current netns.  As a
> first pass, the netns issue could be sidestepped by making it only
> work in the root netns (i.e. the bpf() call would fail if you're not
> in the root netns and the hooks wouldn't run if you're not in the root
> netns.)  It *might* even be possible to retrofit in the socket
> controller by saying that the default hierarchy is used if the socket
> controller isn't mounted.

I don't know.  You're throwing out too many ideas too fast and it's
difficult to keep up with what the differences are between those
ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
support be sufficient?  We can consider the product of cgroup and net
namespaces but that doesn't seem necessary given that people usually
set up these namespaces in conjunction.

> What *isn't* possible to cleanly fix after the fact is the current
> semantics that cgroup hooks override the hooks in their ancestors.
> IMO that is simply broken.  The example you gave (perf_event) is very
> careful not to have this problem.

That's like saying installing an iptables rule for a more specific
target is broken.  As a cgroup controller, it is not an acceptable
behavior given how delegation works.  As something similar to
iptables, it is completely fine.

> > * My impression with bpf is that while delegation is something
> >   theoretically possible it is not something which is gonna take place
> >   in any immediate time frame.  If I'm wrong on this, please feel free
> >   to correct me.
> 
> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
> creation, both of which exist today.

No, the issue is bpf delegation.  If bpf were fully delegatable in
practical sense, we could just do straight forward cgroup bpf
controller.  Well, we'll have to think about how to chain the programs
which would differ on program type but that shouldn't be too hard.

> > * There are a lot of use cases which can immediately take advantage of
> >   cgroup-aware bpf programs operating as proposed.  The model is
> >   semantically equivalent to iptables (let's address the netns part if
> >   that's an issue) which net people are familiar with.
> 
> That sounds like a great argument for those users to patch their
> kernels to gain experience with this feature.

I don't get why this would particularly point to out-of-tree usage.
Why is that?

> > * It isn't exclusive with adding cgroup bpf controller down the road
> >   if necessary and practical.  Sure, this isn't the perfect situation
> >   but it seems like an acceptable trade-off to me.  What ever is
> >   perfect?
> 
> I think it more or less is exclusive.  I can imagine davem accepting a
> patch to make the sock_cgroup_data think point at a new "socket"
> cgroup type.  I can't imagine him accepting a patch to have

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Andy Lutomirski

On Wed, Jan 18, 2017 at 2:41 PM, Tejun Heo  wrote:
> Helloo, Andy.
>
> On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote:
>> Perhaps this is a net feature, though, not a cgroup feature.  This
>> would IMO make a certain amount of sense.  Iptables cgroup matches,
>> for example, logically are an iptables (i.e., net) feature.  The
>
> Yeap.
>
>> problem here is that network configuration (and this explicitly
>> includes iptables) is done on a per-netns level, whereas these new
>> hooks entirely ignore network namespaces.  I've pointed out that
>> trying to enable them in a namespaced world is problematic (e.g.
>> switching to ns_capable() will cause serious problems), and Alexei
>> seems to think that this will never happen.  So I don't think we can
>> really call this a net feature that works the way that other net
>> features work.
>>
>> (Suppose that this actually tried to be netns-enabled.  Then you'd
>> have to map from (netns, cgroup) -> hook, and this would probably be
>> slow and messy.)
>
> I'm not sure how important netns support for these bpf programs
> actually is but I have no objection to making it behave in an
> equivalent manner to iptables where appropriate.  This is kinda
> trivial to do too.  For now, we can simply not run the programs if the
> associated socket belongs to non-root netns.  Later, if necessary, we
> can add per-ns bpf programs.  And I can't think of a reason why
> (netns, cgroup) -> function mapping would be slow and messy.  Why
> would that be?

To map cgroup -> hook, a simple array in each cgroup structure works.
To map (cgroup, netns) -> hook function, the natural approach would be
to have some kind of hash, and that would be slower.  This code is
intended to be *fast*.

I suppose you could have each cgroup structure contain an array where
each element is (netns, hook function) and just skip the ones that are
the wrong netns.  Perhaps it would be rare to have many of those.

>
>> So I think that leaves it being a cgroup feature.  And it definitely
>> does *not* play by the rules of cgroups right now.
>
> Because this in no way is a cgroup feature.  As you wrote above, it is
> something similar to iptables with lacking netns considerations.
> Let's address that and make it a proper network thing.

I think it's currently a broken cgroup feature.  I think it could be
made into a properly working cgroup feature (which I favor) or a
properly working net feature.  Can you articulate why you prefer
having it be a net feature instead of a cgroup feature?  I tend to
favor making it be a working cgroup feature (by giving it a file or
files in cgroupfs and maybe even a controller) because the result
would have very obvious semantics.

But maybe this is just a false dichotomy.  Could this feature be a
per-netns configuration where you can give instructions like "run this
hook if you're in such-and-such netns and in a given cgroup or one of
its descendents"?  This would prevent it from being a direct analogue
to systemd's RestrictAddressFamilies= option, but that may be okay.
This would side-step issues in the current code where a hook can't
rely on ifindex meaning what the hook thinks it means.

Actually, I think I like that approach.  What if it we had a "socket"
controller and files like "socket.create_socket", "socket.ingress" and
"socket.egress"?  You could use the bpf() syscall to install a bpf
filter into "socket.egress" and that filter would filter egress for
the target cgroup and its descendents on the current netns.  As a
first pass, the netns issue could be sidestepped by making it only
work in the root netns (i.e. the bpf() call would fail if you're not
in the root netns and the hooks wouldn't run if you're not in the root
netns.)  It *might* even be possible to retrofit in the socket
controller by saying that the default hierarchy is used if the socket
controller isn't mounted.

What *isn't* possible to cleanly fix after the fact is the current
semantics that cgroup hooks override the hooks in their ancestors.
IMO that is simply broken.  The example you gave (perf_event) is very
careful not to have this problem.

>
>> > I'm sure we'll
>> > eventually get there but from what I hear it isn't something we can
>> > pull off in a restricted timeframe.
>>
>> To me, this sounds like "the API isn't that great, we know how to do
>> better, but we really really want this feature ASAP so we want to ship
>> it with a sub-par API."  I think that's a bad idea.
>
> I see your point but this isn't something which is black and white.
> There are arguments for both directions.  I'm leaning the other
> direction because
>
> * My impression with bpf is that while delegation is something
>   theoretically possible it is not something which is gonna take place
>   in any immediate time frame.  If I'm wrong on this, please feel free
>   to correct me.

But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
creation, both of which exist today.

>
> * There are a lo

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo

Helloo, Andy.

On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote:
> Perhaps this is a net feature, though, not a cgroup feature.  This
> would IMO make a certain amount of sense.  Iptables cgroup matches,
> for example, logically are an iptables (i.e., net) feature.  The

Yeap.

> problem here is that network configuration (and this explicitly
> includes iptables) is done on a per-netns level, whereas these new
> hooks entirely ignore network namespaces.  I've pointed out that
> trying to enable them in a namespaced world is problematic (e.g.
> switching to ns_capable() will cause serious problems), and Alexei
> seems to think that this will never happen.  So I don't think we can
> really call this a net feature that works the way that other net
> features work.
> 
> (Suppose that this actually tried to be netns-enabled.  Then you'd
> have to map from (netns, cgroup) -> hook, and this would probably be
> slow and messy.)

I'm not sure how important netns support for these bpf programs
actually is but I have no objection to making it behave in an
equivalent manner to iptables where appropriate.  This is kinda
trivial to do too.  For now, we can simply not run the programs if the
associated socket belongs to non-root netns.  Later, if necessary, we
can add per-ns bpf programs.  And I can't think of a reason why
(netns, cgroup) -> function mapping would be slow and messy.  Why
would that be?

> So I think that leaves it being a cgroup feature.  And it definitely
> does *not* play by the rules of cgroups right now.

Because this in no way is a cgroup feature.  As you wrote above, it is
something similar to iptables with lacking netns considerations.
Let's address that and make it a proper network thing.

> > I'm sure we'll
> > eventually get there but from what I hear it isn't something we can
> > pull off in a restricted timeframe.
> 
> To me, this sounds like "the API isn't that great, we know how to do
> better, but we really really want this feature ASAP so we want to ship
> it with a sub-par API."  I think that's a bad idea.

I see your point but this isn't something which is black and white.
There are arguments for both directions.  I'm leaning the other
direction because

* My impression with bpf is that while delegation is something
  theoretically possible it is not something which is gonna take place
  in any immediate time frame.  If I'm wrong on this, please feel free
  to correct me.

* There are a lot of use cases which can immediately take advantage of
  cgroup-aware bpf programs operating as proposed.  The model is
  semantically equivalent to iptables (let's address the netns part if
  that's an issue) which net people are familiar with.

* It isn't exclusive with adding cgroup bpf controller down the road
  if necessary and practical.  Sure, this isn't the perfect situation
  but it seems like an acceptable trade-off to me.  What ever is
  perfect?

> > This also holds true for the perf controller.  While it is implemented
> > as a controller, it isn't visible to cgroup users in any way and the
> > only function it serves is providing the membership test to perf
> > subsystem.  perf is the one which decides whether and how it is to be
> > used.  cgroup providing membership test to other subsystems is
> > completely acceptable and established.
> 
> Unless I'm mistaken, "perf_event" is an actual cgroup controller, and

Yeah, it is implemented as a controller but in essence what it does is
just tagging the hierarchy to tell perf to use that hierarchy for
membership test purposes.

> it explicitly respects the cgroup hierarchy.  It shows up in
> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
> perf_event enabled.  So I'm not sure what you mean.

That all it's doing is providing membership information.

Thanks.

-- 
tejun

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo

Hello, Michal.

On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote:
> This would require using hierarchical cgroup iterators to iterate over

It does behave hierarchically.

> tasks. As per Andy's testing this doesn't seem to be the case. I haven't

That's not what Andy's testing showed.  What that showed was that
program in a child can override the one from its ancestor.

> checked the implementation closely but my understanding was that using
> only cgroup specific tasks was intentional.

Maybe I'm misreading you but if you're saying that a bpf program would
only execute for tasks on that specific cgroup, that is not true.

> I do agree that using hierarchy aware cgroup iterators is the right
> approach here and we wouldn't see any issue. But I am still not sure
> I've wrapped my head around this feature completely.

It's really simple.  You can install bpf programs with a cgroup path
associated with them.  When that particular type of bpf program needs
to be executed, the bpf program which is attached to the closest
ancestor (including self) is executed.

Thanks.

-- 
tejun

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-17 Thread Andy Lutomirski

On Tue, Jan 17, 2017 at 5:58 AM, Michal Hocko  wrote:
> On Tue 17-01-17 14:32:04, Peter Zijlstra wrote:
>> On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote:
>> > On Sun 15-01-17 20:19:01, Tejun Heo wrote:
>> > [...]
>> > > So, what's proposed is a proper part of bpf.  In terms of
>> > > implementation, cgroup helps by hosting the pointers but that doesn't
>> > > necessarily affect the conceptual structure of it.  Given that, I
>> > > don't think it'd be a good idea to add anything to cgroup interface
>> > > for this feature.  Introspection is great to have but this should be
>> > > introspectable together with other bpf programs using the same
>> > > mechanism.  That's where it belongs.
>> >
>> > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we
>> > at least enforce that the cgroup has to be a leaf one and no further
>> > children groups can be created once there is BPF program attached?
>>
>> Why (again) this stupid constraint?
>>
>> If you want to use cgroups for tagging (like perf does), _any_ parent
>> cgroup will also tag you.
>>
>> So creating child cgroups, and placing tasks in it, should not be a
>> problem, the BPF thing should apply to all of them.
>
> This would require using hierarchical cgroup iterators to iterate over
> tasks. As per Andy's testing this doesn't seem to be the case. I haven't
> checked the implementation closely but my understanding was that using
> only cgroup specific tasks was intentional.

The current semantics are AFAIK that only the innermost cgroup that
has a hook installed is in effect.  I think this is the wrong design.

I think that the right semantics are probably to support both
innermost-to-outermost and outermost-to-innermost and to select which
is appropriate for each hook.  Suppose we have a cgroup /a/b where a
and b both have hooks installed.  If the hook is a socket creation or
egress hook, I think that b's hook should run first.  If b's hook
rejects, then a's hook is not run.  If b's hook accepts, then a's hook
is run.  This way a gets the last word on any changes to the socket
settings and a sees exactly what would happen if it were to accept.

Conversely, for ingress hooks, I think that a's hook should run first.
This way a sees the packet as it originally came in and can modify or
reject it, and then b only sees whatever a chooses to let through.

The guiding principle here is that, for actions that originate outside
the machine, the outer hooks should IMO run first and, for actions
that originate from a task in a cgroup, the innermost hooks should run
first.

--Andy

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-17 Thread Michal Hocko

On Tue 17-01-17 14:32:04, Peter Zijlstra wrote:
> On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote:
> > On Sun 15-01-17 20:19:01, Tejun Heo wrote:
> > [...]
> > > So, what's proposed is a proper part of bpf.  In terms of
> > > implementation, cgroup helps by hosting the pointers but that doesn't
> > > necessarily affect the conceptual structure of it.  Given that, I
> > > don't think it'd be a good idea to add anything to cgroup interface
> > > for this feature.  Introspection is great to have but this should be
> > > introspectable together with other bpf programs using the same
> > > mechanism.  That's where it belongs.
> > 
> > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we
> > at least enforce that the cgroup has to be a leaf one and no further
> > children groups can be created once there is BPF program attached?
> 
> Why (again) this stupid constraint?
> 
> If you want to use cgroups for tagging (like perf does), _any_ parent
> cgroup will also tag you.
> 
> So creating child cgroups, and placing tasks in it, should not be a
> problem, the BPF thing should apply to all of them.

This would require using hierarchical cgroup iterators to iterate over
tasks. As per Andy's testing this doesn't seem to be the case. I haven't
checked the implementation closely but my understanding was that using
only cgroup specific tasks was intentional.

I do agree that using hierarchy aware cgroup iterators is the right
approach here and we wouldn't see any issue. But I am still not sure
I've wrapped my head around this feature completely.

-- 
Michal Hocko
SUSE Labs

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-17 Thread Peter Zijlstra

On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote:
> On Sun 15-01-17 20:19:01, Tejun Heo wrote:
> [...]
> > So, what's proposed is a proper part of bpf.  In terms of
> > implementation, cgroup helps by hosting the pointers but that doesn't
> > necessarily affect the conceptual structure of it.  Given that, I
> > don't think it'd be a good idea to add anything to cgroup interface
> > for this feature.  Introspection is great to have but this should be
> > introspectable together with other bpf programs using the same
> > mechanism.  That's where it belongs.
> 
> If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we
> at least enforce that the cgroup has to be a leaf one and no further
> children groups can be created once there is BPF program attached?

Why (again) this stupid constraint?

If you want to use cgroups for tagging (like perf does), _any_ parent
cgroup will also tag you.

So creating child cgroups, and placing tasks in it, should not be a
problem, the BPF thing should apply to all of them.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-17 Thread Michal Hocko

On Sun 15-01-17 20:19:01, Tejun Heo wrote:
[...]
> So, what's proposed is a proper part of bpf.  In terms of
> implementation, cgroup helps by hosting the pointers but that doesn't
> necessarily affect the conceptual structure of it.  Given that, I
> don't think it'd be a good idea to add anything to cgroup interface
> for this feature.  Introspection is great to have but this should be
> introspectable together with other bpf programs using the same
> mechanism.  That's where it belongs.

If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we
at least enforce that the cgroup has to be a leaf one and no further
children groups can be created once there is BPF program attached?
This should break the existing usecases AFAIU and it would allow future
changes without major API surprises.

-- 
Michal Hocko
SUSE Labs

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-15 Thread Tejun Heo

Hello,

Sorry about the delay.  Some fire fighthing followed the holidays.

On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote:
> > So from what I understand the proposed cgroup is not in fact
> > hierarchical at all.
> > 
> > @TJ, I thought you were enforcing all new cgroups to be properly
> > hierarchical, that would very much include this one.
> 
> I would be interested in that as well. We have made that mistake in
> memcg v1 where hierarchy could be disabled for performance reasons and
> that turned out to be major PITA in the end. Why do we want to repeat
> the same mistake here?

Across the different threads on this subject, there have been multiple
explanations but I'll try to sum it up more clearly.

The big issue here is whether this is a cgroup thing or a bpf thing.
I don't think there's anything inherently wrong with one approach or
the other.  Forget about the proposed cgroup bpf extentions but thinkg
about how iptables does cgroups.  Whether it's the netcls/netprio in
v1 or direct membership matching in v2, it is the network side testing
for cgroup membership one way or the other.  The only part where
cgroup is involved in is answering that test.

This also holds true for the perf controller.  While it is implemented
as a controller, it isn't visible to cgroup users in any way and the
only function it serves is providing the membership test to perf
subsystem.  perf is the one which decides whether and how it is to be
used.  cgroup providing membership test to other subsystems is
completely acceptable and established.

Now coming back to bpf, the current implementation is just that.
Sure, cgroup hosts the rules in its data structures but that isn't
something conceptually relevant.  We might as well implement it as a
prefixed hash table from bpf side.  Having pointers in struct cgroup
is just a more efficient and easier way of achieving the same result.
In fact, IIUC, this whole thing was born out of discussions around
implementing scalable cgroup membership matching from bpf programs.

So, what's proposed is a proper part of bpf.  In terms of
implementation, cgroup helps by hosting the pointers but that doesn't
necessarily affect the conceptual structure of it.  Given that, I
don't think it'd be a good idea to add anything to cgroup interface
for this feature.  Introspection is great to have but this should be
introspectable together with other bpf programs using the same
mechanism.  That's where it belongs.

None of the issues that people have been raising here is actually an
issue if one thinks of it as a part of bpf.  Its security model is
exactly the same as any other bpf programs.  Recursive behavior is
exactly the same as how other external cgroup descendant membership
testing work.  There is no issue here whatsoever.

Now, I'm not claiming that a bpf mechanism which is a proper part of
cgrou isn't attractive.  It is, especially with delegation; however,
that is also where we don't quite know how to proceed.  This doesn't
have much to do with cgroup.  If something is delegatable to non-priv
users and scoped, cgroup's fine with it and if that's not possible it
simply isn't something which is delegatable and putting it on cgroup
doesn't change that.

I'm far from being a bpf expert, so I could be wrong here, but I don't
think there's anything fundamental which prevents bpf from being
delegatable but at the same time bpf is something which is extremely
flexible and nobody really thought about or worked that much on
delegating bpf. If there's enough need for it, I'm sure we'll
eventually get there but from what I hear it isn't something we can
pull off in a restricted timeframe.

There's nothing which makes the currently implemented mechanism
exclusive with a cgroup controller based one.  The hooks are the
expensive part but can be shared, the rest is just about which
programs to execute in what order and how they should be chained.

There are a lot of immediate use cases which can benefit from the
proposed cgroup bpf mechanism and they're all fine with it being a
part of bpf and behaving like any other network mechanism behaves in
terms of configuration and delegation.  I don't see a reason why we
would hold back on merging this.  All the raised issues are coming
from confusing this as a part of cgroup.  It isn't.  It is a part of
bpf.  If we want a bpf cgroup controller, great, but that is a
separate thing.

Thanks.

-- 
tejun

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-03 Thread Michal Hocko

On Tue 20-12-16 10:11:50, Peter Zijlstra wrote:
> On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote:
> > >> Huh?  My example in the original email attaches a program in a
> > >> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> > >> working?
> > >
> > > Are you suggesting sub-cgroups should not be allowed to override the 
> > > filter of a parent cgroup?
> > 
> > Yes, exactly.  I think there are two sensible behaviors:
> > 
> > a) sub-cgroups cannot have a filter at all of the parent has a filter.
> > (This is the "punt" approach -- it lets different semantics be
> > assigned later without breaking userspace.)
> > 
> > b) sub-cgroups can have a filter if a parent does, too.  The semantics
> > are that the sub-cgroup filter runs first and all side-effects occur.
> > If that filter says "reject" then ancestor filters are skipped.  If
> > that filter says "accept", then the ancestor filter is run and its
> > side-effects happen as well.  (And so on, all the way up to the root.)
> 
> So from what I understand the proposed cgroup is not in fact
> hierarchical at all.
> 
> @TJ, I thought you were enforcing all new cgroups to be properly
> hierarchical, that would very much include this one.

I would be interested in that as well. We have made that mistake in
memcg v1 where hierarchy could be disabled for performance reasons and
that turned out to be major PITA in the end. Why do we want to repeat
the same mistake here?
-- 
Michal Hocko
SUSE Labs

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Alexei Starovoitov

On Tue, Dec 20, 2016 at 10:49:25AM -0800, Andy Lutomirski wrote:
> >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> >> It doesn't make a semantic difference, except that I dislike
> >> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> >> all.
> >
> > Well, I think it is; it pops the bpf program from a target and drops the
> > reference on it. It's not much code, but it's certainly bpf-specific.
> 
> I mean the interface isn't bpf-specific.  If there was something that
> wasn't bpf attached to the target, you'd still want an API to detach
> it.

This discussion won't go anywhere while you keep thinking that this api
has to be generalized. As I explained several times earlier
BPF_CGROUP_INET_SOCK_CREATE hook is bpf specific. There is nothing
in the kernel that can take advantage of it today, so by definition
the hook is bpf specific. Period. Saying that something in the future
may come along that would want to use that is like saying I want
to design the generic steering wheel for any car that will ever need it.

Hence if you want to change 'target_fd' in BPF_PROG_ATTACH/DETACH cmds
from being fd of open("cgroupdir") to fd of open("cgroupdir/cgroup.bpf")
file inside it then I'm ok with that.
All other proposals with non-extensible ioctls() and crazy text based
per-hook permissions is nack.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Andy Lutomirski

On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack  wrote:
> Hi,
>
> On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
>> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack  wrote:
>
>> To clarify, since this thread has gotten excessively long and twisted,
>> I think it's important that, for hooks attached to a cgroup, you be
>> able to tell in a generic way whether something is plugged into the
>> hook.  The natural way to see a cgroup's configuration is to read from
>> cgroupfs, so I think that reading from cgroupfs should show you that a
>> BPF program is attached and also give enough information that, once
>> bpf programs become dumpable, you can dump the program (using the
>> bpf() syscall or whatever).
>
> [...]
>
>> There isn't a big semantic difference between
>> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
>> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
>> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
>> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
>> because the permission check is much weaker.
>
> Okay, if you have such a control file, you can of course do something
> like that. When we discussed things back then with Tejun however, we
> concluded that a controller that is not completely controllable through
> control knobs that can be written and read via cat is meaningless.
> That's why this has become a 'hidden' cgroup feature.
>
> With your proposed API, you'd first go to the bpf(2) syscall in order to
> get a prog fd, and then come back to some sort of cgroup API to put the
> fd in there. That's quite a mix and match, which is why we considered
> the API cleaner in its current form, as everything that is related to
> bpf is encapsulated behind a single syscall.

You already have to do bpf() to get a prog fd, then open() to get a
cgroup fd, then bpf() or ioctl() to attach, so this isn't much
different, and its exactly the same number of syscalls.

>
>> My preference would be to do an ioctl on a new
>> /cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tells
>> you whether something is attached and hopefully also gives enough
>> information (a hash of the BPF program, perhaps) to dump the actual
>> program using future bpf() interfaces.  write() and ioctl() can be
>> used to configure it as appropriate.
>
> So am I reading this right? You're proposing to add ioctl() hooks to
> kernfs/cgroupfs? That would open more possibilities of course, but I'm
> not sure where that rabbit hole leads us eventually.

Indeed.  I already have a test patch to add ioctl() to kernfs.  Adding
it to cgroupfs shouldn't be much more complicated.

>
>> Another option that I like less would be to have a
>> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
>> their contents.  You would do an ioctl() on that to program a hook and
>> you could read it to see what's there.
>
> Yes, read() could, in theory, give you similar information than ioctl(),
> but in human-readable form.
>
>> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
>> It doesn't make a semantic difference, except that I dislike
>> BPF_PROG_DETACH because that particular command isn't BPF-specific at
>> all.
>
> Well, I think it is; it pops the bpf program from a target and drops the
> reference on it. It's not much code, but it's certainly bpf-specific.

I mean the interface isn't bpf-specific.  If there was something that
wasn't bpf attached to the target, you'd still want an API to detach
it.

>
 So if I set up a cgroup that's monitored and call it /cgroup/a and
 enable delegation and if the program running there wants to do its own
 monitoring in /cgroup/a/b (via delegation), then you really want the
 outer monitor to silently drop events coming from /cgroup/a/b?
>>>
>>> That's a fair point, and we've discussed it as well. The issue is, as
>>> Alexei already pointed out, that we do not want to traverse the tree up
>>> to the root for nested cgroups due to the runtime costs in the
>>> networking fast-path. After all, we're running the bpf program for each
>>> packet in flight. Hence, we opted for the approach to only look at the
>>> leaf node for now, with the ability to open it up further in the future
>>> using flags during attach etc.
>>
>> Careful here!  You don't look only at the leaf node for now.  You do a
>> fancy traversal and choose the nearest node that has a hook set up.
>
> But we do the 'complex' operation at attach time or when a cgroup is
> created, both of which are slow-path operations. In the fast-path, we
> only look at the leaf, which may or may not have an effective program
> installed. And that's of course much cheaper then doing the traversing
> for each packet.

You would never traverse the full hierarchy for each packet.  You'd
have a linked list of programs that are attached, kind of like how
there's an "effective" array right now.  I sent out pseudocode earlier
in the thread.

>
>> mkdir

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Daniel Mack

Hi,

On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack  wrote:

> To clarify, since this thread has gotten excessively long and twisted,
> I think it's important that, for hooks attached to a cgroup, you be
> able to tell in a generic way whether something is plugged into the
> hook.  The natural way to see a cgroup's configuration is to read from
> cgroupfs, so I think that reading from cgroupfs should show you that a
> BPF program is attached and also give enough information that, once
> bpf programs become dumpable, you can dump the program (using the
> bpf() syscall or whatever).

[...]

> There isn't a big semantic difference between
> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
> because the permission check is much weaker.

Okay, if you have such a control file, you can of course do something
like that. When we discussed things back then with Tejun however, we
concluded that a controller that is not completely controllable through
control knobs that can be written and read via cat is meaningless.
That's why this has become a 'hidden' cgroup feature.

With your proposed API, you'd first go to the bpf(2) syscall in order to
get a prog fd, and then come back to some sort of cgroup API to put the
fd in there. That's quite a mix and match, which is why we considered
the API cleaner in its current form, as everything that is related to
bpf is encapsulated behind a single syscall.

> My preference would be to do an ioctl on a new
> /cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tells
> you whether something is attached and hopefully also gives enough
> information (a hash of the BPF program, perhaps) to dump the actual
> program using future bpf() interfaces.  write() and ioctl() can be
> used to configure it as appropriate.

So am I reading this right? You're proposing to add ioctl() hooks to
kernfs/cgroupfs? That would open more possibilities of course, but I'm
not sure where that rabbit hole leads us eventually.

> Another option that I like less would be to have a
> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
> their contents.  You would do an ioctl() on that to program a hook and
> you could read it to see what's there.

Yes, read() could, in theory, give you similar information than ioctl(),
but in human-readable form.

> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> It doesn't make a semantic difference, except that I dislike
> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> all.

Well, I think it is; it pops the bpf program from a target and drops the
reference on it. It's not much code, but it's certainly bpf-specific.

>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>> enable delegation and if the program running there wants to do its own
>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>
>> That's a fair point, and we've discussed it as well. The issue is, as
>> Alexei already pointed out, that we do not want to traverse the tree up
>> to the root for nested cgroups due to the runtime costs in the
>> networking fast-path. After all, we're running the bpf program for each
>> packet in flight. Hence, we opted for the approach to only look at the
>> leaf node for now, with the ability to open it up further in the future
>> using flags during attach etc.
> 
> Careful here!  You don't look only at the leaf node for now.  You do a
> fancy traversal and choose the nearest node that has a hook set up.

But we do the 'complex' operation at attach time or when a cgroup is
created, both of which are slow-path operations. In the fast-path, we
only look at the leaf, which may or may not have an effective program
installed. And that's of course much cheaper then doing the traversing
for each packet.

> mkdir /cgroup/foo
> BPF_PROG_ATTACH(some program to foo)
> mkdir /cgroup/foo/bar
> chown -R some_user /cgroup/foo/bar
> 
> If the kernel only looked at the leaf, then the program that did the
> above would not expect that the program would constrain
> /cgroup/foo/bar's activity.  But, as it stands, the program *would*
> expect /cgroup/foo/bar to be constrained, except that, whenever the
> capable() check changes to ns_capable() (which will happen eventually
> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
> install a new no-op hook there, and break the security assumption.
> 
> IOW, I think that totally non-recursive hooks are okay from a security
> perspective, albeit rather strange, but the current design is not okay
> from a security perspective.

We locked down the ability to override any of these programs with

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Andy Lutomirski

On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack  wrote:
> Hi,
>
> On 12/20/2016 04:50 AM, Andy Lutomirski wrote:
>> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)?  There is nothing bpf
>> specfic about the hook except that the name of this macro has "BPF" in
>> it.  There is nothing whatsoever that's bpf-specific about the context
>> -- sk is not bpf-specific at all.
>>
>> The only thing bpf-specific about it is that it currently only invokes
>> bpf programs.  That could easily change.
>
> I'm not sure if I follow. The code as it currently stands only supports
> attaching bpf programs to cgroups which have been created using
> BPF_PROG_LOAD. If cgroups would support other program types in the
> future, then they would need to be stored in different data types
> anyway, and the bpf syscall multiplexer would be the wrong entry point
> to access them anyway.

To clarify, since this thread has gotten excessively long and twisted,
I think it's important that, for hooks attached to a cgroup, you be
able to tell in a generic way whether something is plugged into the
hook.  The natural way to see a cgroup's configuration is to read from
cgroupfs, so I think that reading from cgroupfs should show you that a
BPF program is attached and also give enough information that, once
bpf programs become dumpable, you can dump the program (using the
bpf() syscall or whatever).

Obviously the interface to *attach* a BPF program to a hook will need
to be at least a little bit BPF-specific.  But there's nothing
particularly BPF-specific about detaching, and if a control file were
to exist, writing "detach" or "none" to it seems natural.

>
> Whether we add bpf-specific code to the cgroup file parsers or
> cgroup-specific code to the bpf layer does not make much of a semantic
> difference, does it? As a matter of fact, my very first implementation
> of this patch set implemented a cgroup controller that would allow
> writing strings like "ing
>
> b) make it possible to extend the functionality in the future by adding
> flags to the command struct etc.
>
> And I hoped we achieved that after discussing it for so long.
>
>> How about slowing down a wee bit and trying to come up with cgroup
>> hook semantics that work for all of these use cases?
>
> I'm all for discussing things, but I don't this was done in a rush.
>
> I do agree though that adding functionality to cgroups that is not
> limited to resource control is a delicate thing to do, which is why I
> cc'ed cgroups@ in my patches. I should have also added linux-api@ I
> guess, sorry I missed that.
>ress 5" to its control file, where 5 is the fd
> number that came out of BPF_PROG_LOAD. The main reason we decided to
> ditch that was that echoing fd numbers into a text file seemed way worse
> than going through a proper syscall layer with it, and ioctls are
> unavailable on pseudo-fs.

There isn't a big semantic difference between
'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
because the permission check is much weaker.

The reason I suggest ioctl() and not write() is that write() MUST NOT
make its behavior depend on the caller's credentials, file table, etc.
Imagine what would happen if you did 'sudo -u eviltext
>/cgroup/NAME/control.file'.  (This particular mistake has been
repeated many times in the kernel, in drivers, networking, namespaces,
core code, etc, and it's resulted in a big pile of privilege
escalation bugs.)  So write("bpf:") is safe
(but unusably awkward, I think), whereas write("bpf:fd 5") is unsafe.

>
> The idea was rather to allow attaching bpf programs to other things than
> just cgroups as well, which is why we called the member of 'union
> bpf_attr' 'target_fd', and a cgroup is just one type a target here.

I would make that a separate operation.  If someone adds the ability
to attach an ebpf program to, say, seccomp (I'm quite sure this will
happen eventually), it should be attached using seccomp(), not bpf(),
for example).  The people writing seccomp filters will thank you for
making the syscall in question reflect what object (the cgroup, for
example) is being modified.

>
>>> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
>>> and kernel needs to parse above? will you allow capital and lower
>>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
>>> can program fd expressed as decimal or hex or both?
>>> how do you return the error? as a text string for user space
>>> to parse?
>>
>> No.  The kernel does not parse it because you cannot write this to the
>> file.  You set a bpf filter with ioctl and pass an fd.
>
> An ioctl on what file, exactly?

There are at least two plausible models.

My preference would be to do an ioctl on a new
/cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tell

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Daniel Mack

Hi,

On 12/20/2016 04:50 AM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
>  wrote:
>> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>>> I think we're still talking past each other.  A big part of the point
>>> of changing it is that none of this is specific to bpf.  You could (in
>>
>> the hooks and context passed into the program is very much bpf specific.
>> That's what I've been trying to convey all along.
> 
> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)?  There is nothing bpf
> specfic about the hook except that the name of this macro has "BPF" in
> it.  There is nothing whatsoever that's bpf-specific about the context
> -- sk is not bpf-specific at all.
> 
> The only thing bpf-specific about it is that it currently only invokes
> bpf programs.  That could easily change.

I'm not sure if I follow. The code as it currently stands only supports
attaching bpf programs to cgroups which have been created using
BPF_PROG_LOAD. If cgroups would support other program types in the
future, then they would need to be stored in different data types
anyway, and the bpf syscall multiplexer would be the wrong entry point
to access them anyway.

Whether we add bpf-specific code to the cgroup file parsers or
cgroup-specific code to the bpf layer does not make much of a semantic
difference, does it? As a matter of fact, my very first implementation
of this patch set implemented a cgroup controller that would allow
writing strings like "ingress 5" to its control file, where 5 is the fd
number that came out of BPF_PROG_LOAD. The main reason we decided to
ditch that was that echoing fd numbers into a text file seemed way worse
than going through a proper syscall layer with it, and ioctls are
unavailable on pseudo-fs.

The idea was rather to allow attaching bpf programs to other things than
just cgroups as well, which is why we called the member of 'union
bpf_attr' 'target_fd', and a cgroup is just one type a target here.

>> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
>> and kernel needs to parse above? will you allow capital and lower
>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
>> can program fd expressed as decimal or hex or both?
>> how do you return the error? as a text string for user space
>> to parse?
> 
> No.  The kernel does not parse it because you cannot write this to the
> file.  You set a bpf filter with ioctl and pass an fd.

An ioctl on what file, exactly?

> If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

We need a debugging facility at some point, I agree to that. As the code
currently stands, that would rather need to go into the bpf(2) syscall
though, as setting a program through bpf(2) and reading it through
cgroupfs is really nasty.

>> so you're proposing to add a bunch of hard coded logic to the kernel.
>> First to parse such text into some sort of syntax tree or list/set
>> and then have hard coded logic specifically for these two use cases?
>> While above two can be implemented as trivial bpf programs already?!
>> That goes 180% degree vs bpf philosophy. bpf is about moving
>> the specific code out of the kernel and keeping kernel generic that
>> it can solve as many use cases as possible by being programmable.
> 
> I'm not seriously proposing implementing these.  My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Sure, but nobody claimed it to be that be-all-and-end-all thing. It's
just one thing that a cgroup is now able to accommodate, and because
that new feature is specific to bpf, we decided to hook up the uapi to
the bpf syscall.

> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

That's a fair point, and we've discussed it as well. The issue is, as
Alexei already pointed out, that we do not want to traverse the tree up
to the root for nested cgroups due to the runtime costs in the
networking fast-path. After all, we're running the bpf program for each
packet in flight. Hence, we opted for the approach to only look at the
leaf node for now, with the ability to open it up further in the future
using flags during attach etc.

> The current approach to bpf hooks will bite you down the road.  David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.

Hmm, I thought we've sorted out the concerns about that by making sure
that we

a) lock-down the API sufficiently so it doesn't cause any security
issues in its current form, and

b) make it possible to extend t

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Peter Zijlstra

On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote:
> >> Huh?  My example in the original email attaches a program in a
> >> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> >> working?
> >
> > Are you suggesting sub-cgroups should not be allowed to override the filter 
> > of a parent cgroup?
> 
> Yes, exactly.  I think there are two sensible behaviors:
> 
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
> 
> b) sub-cgroups can have a filter if a parent does, too.  The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped.  If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well.  (And so on, all the way up to the root.)

So from what I understand the proposed cgroup is not in fact
hierarchical at all.

@TJ, I thought you were enforcing all new cgroups to be properly
hierarchical, that would very much include this one.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> >>
> >> struct cgroup_bpf {
> >> /*
> >>  * Store two sets of bpf_prog pointers, one for programs that are
> >>  * pinned directly to this cgroup, and one for those that are 
> >> effective
> >>  * when this cgroup is accessed.
> >>  */
> >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> >> };
> >>
> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> >>
> >> This would change to something like:
> >>
> >> struct cgroup_filter_slot {
> >>   struct bpf_prog *effective;
> >>   struct cgroup_filter_slot *next;
> >>   struct bpf_prog *local;
> >> }
> >>
> >> local is NULL unless *this* cgroup has a filter.  effective points to
> >> the bpf_prog that's active in this cgroup or the nearest ancestor that
> >> has a filter.  next is NULL if there are no filters higher in the
> >> chain or points to the next slot that has a filter.  struct cgroup
> >> has:
> >>
> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> >>
> >> To evaluate it, you do:
> >>
> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> >>
> >> if (!slot->effective)
> >>   return;
> >>
> >> do {
> >>   evaluate(slot->effective);
> >>   slot = slot->next;
> >> } while (unlikely(slot));
> >
> > yes. something like this can work as a future extension
> > to support multiple programs for security use case.
> > Please propose a patch.
> > Again, it's not needed today and there is no rush to implement it.
> >
> 
> If this happens after 4.10 and 4.10 is released as is, then this
> change would be an ABI break.

it won't break existing apps.
please study how bpf syscall was extended in the past without
breaking anything.
Same thing here. The default is one program per hook per cgroup.
Everything else is future extensions.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
>>
>> struct cgroup_bpf {
>> /*
>>  * Store two sets of bpf_prog pointers, one for programs that are
>>  * pinned directly to this cgroup, and one for those that are 
>> effective
>>  * when this cgroup is accessed.
>>  */
>> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
>> };
>>
>> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
>>
>> This would change to something like:
>>
>> struct cgroup_filter_slot {
>>   struct bpf_prog *effective;
>>   struct cgroup_filter_slot *next;
>>   struct bpf_prog *local;
>> }
>>
>> local is NULL unless *this* cgroup has a filter.  effective points to
>> the bpf_prog that's active in this cgroup or the nearest ancestor that
>> has a filter.  next is NULL if there are no filters higher in the
>> chain or points to the next slot that has a filter.  struct cgroup
>> has:
>>
>> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
>>
>> To evaluate it, you do:
>>
>> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
>>
>> if (!slot->effective)
>>   return;
>>
>> do {
>>   evaluate(slot->effective);
>>   slot = slot->next;
>> } while (unlikely(slot));
>
> yes. something like this can work as a future extension
> to support multiple programs for security use case.
> Please propose a patch.
> Again, it's not needed today and there is no rush to implement it.
>

If this happens after 4.10 and 4.10 is released as is, then this
change would be an ABI break.

--Andy

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
>>
>> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
>> even take a reference to a BPF program as an argument.  What is it
>> supposed to do if this mechanism ever gets extended?
>
> we just add another field to that anonymous union just like
> we did for other commands and everything is backwards compatible.
> It's the basics of bpf syscall that we've been relying on for some
> time now and it worked just fine.

And what happens if you don't specify that member and two programs are attached?

--Andy

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
> 
> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
> even take a reference to a BPF program as an argument.  What is it
> supposed to do if this mechanism ever gets extended?

we just add another field to that anonymous union just like
we did for other commands and everything is backwards compatible.
It's the basics of bpf syscall that we've been relying on for some
time now and it worked just fine.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote:
> >>
> >> net.socket_create_filter = "none": no filter
> >> net.socket_create_filter = "bpf:baadf00d": bpf filter
> >
> > i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> > and kernel needs to parse above? will you allow capital and lower
> > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> > can program fd expressed as decimal or hex or both?
> > how do you return the error? as a text string for user space
> > to parse?
> 
> No.  The kernel does not parse it because you cannot write this to the
> file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

my understanding that cgroup is based on kernfs and both don't support
ioctl, so you'd need quite some hacking to introduce such concepts
and buy-in from a bunch of people first.

> >> net.socket_create_filter = "iptables:foobar": some iptables thingy
> >> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> >
> > iptables/nft are not applicable to 'bpf_socket_create' which
> > looks into:
> > struct bpf_sock {
> > __u32 bound_dev_if;
> > __u32 family;
> > __u32 type;
> > __u32 protocol;
> > };
> > so don't fit as an example.
> 
> The code that takes a 'struct sock' and sets up bpf_sock is
> bpf-specific and would obviously not be used for non-bpf filter.  But
> if you had a filter that was just a like of address families, that
> filter would look at struct sock and do its own thing.  iptables
> wouldn't make sense for a socket creation filter, but it would make
> perfect sense for an ingress filter.  Obviously the bpf-specific state
> object wouldn't be used, but it would still be a hook, invoked from
> the same network code, guarded by the same static key, looking at the
> same skb.

I strongly suggest to go back and read my first reply where
I think I explained well enough that something like iptables
will not able to reuse the ioctl mechanism you're proposing here,
hook ids will be different, attachment mechanism will be different too.
So your proposed cgroup ioctl is already dead as a reusable interface.

> >> net.socket_create_filter = "disallow": no sockets created period
> >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 
> >> 3
> >
> > so you're proposing to add a bunch of hard coded logic to the kernel.
> > First to parse such text into some sort of syntax tree or list/set
> > and then have hard coded logic specifically for these two use cases?
> > While above two can be implemented as trivial bpf programs already?!
> > That goes 180% degree vs bpf philosophy. bpf is about moving
> > the specific code out of the kernel and keeping kernel generic that
> > it can solve as many use cases as possible by being programmable.
> 
> I'm not seriously proposing implementing these.  My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Then please let's talk about real use cases.
This daydreaming of some future llvm in the kernel that you were
bringing up during LPC doesn't help the discussion.
Just like these artificial examples.

> > ...
> >> What exactly isn't sensible about using cgroup_bpf for containers?
> >
> > my use case is close to zero overhead application network monitoring.
> 
> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

yes. both are root and must talk to each other if they want
to co-exist. When root process is asking kernel to do X, this X has
to happen.

> Then disallow nesting.  You're welcome to not rush the decision, but
> that's not what you've done.  If 4.10 is released as is, you've made
> the decision and you're going to have a hard time changing it.

Nothing needs to be changed.

> > No. As was pointed out before only root can load 
> > BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> > type programs and only root can attach them.
> 
> Why?  It really seems to me that you expect that future namespaceable
> bpf hooks will use a totally different API.  At KS, I sat in a room
> full of people arguing about cgroup v2 and a lot of them pointed out
> that there are valid, paying use cases that want to stick cgroup v1 in
> a container, because the code that uses cgroup v1 in the container is
> called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
> or Gentoo or whatever), and that code is *already written* and needs
> to be contained.

bpf in general is not namespace aware. It's global and this cgroup scoping
of bpf programs is the first of this kind. Namespacing of bpf is comp

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> 
> struct cgroup_bpf {
> /*
>  * Store two sets of bpf_prog pointers, one for programs that are
>  * pinned directly to this cgroup, and one for those that are 
> effective
>  * when this cgroup is accessed.
>  */
> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> };
> 
> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> 
> This would change to something like:
> 
> struct cgroup_filter_slot {
>   struct bpf_prog *effective;
>   struct cgroup_filter_slot *next;
>   struct bpf_prog *local;
> }
> 
> local is NULL unless *this* cgroup has a filter.  effective points to
> the bpf_prog that's active in this cgroup or the nearest ancestor that
> has a filter.  next is NULL if there are no filters higher in the
> chain or points to the next slot that has a filter.  struct cgroup
> has:
> 
> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> 
> To evaluate it, you do:
> 
> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> 
> if (!slot->effective)
>   return;
> 
> do {
>   evaluate(slot->effective);
>   slot = slot->next;
> } while (unlikely(slot));

yes. something like this can work as a future extension
to support multiple programs for security use case.
Please propose a patch.
Again, it's not needed today and there is no rush to implement it.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>> I think we're still talking past each other.  A big part of the point
>> of changing it is that none of this is specific to bpf.  You could (in
>
> the hooks and context passed into the program is very much bpf specific.
> That's what I've been trying to convey all along.

You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)?  There is nothing bpf
specfic about the hook except that the name of this macro has "BPF" in
it.  There is nothing whatsoever that's bpf-specific about the context
-- sk is not bpf-specific at all.

The only thing bpf-specific about it is that it currently only invokes
bpf programs.  That could easily change.

>
>> theory -- I'm not proposing implementing these until there's demand)
>> have:
>>
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>
> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> and kernel needs to parse above? will you allow capital and lower
> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> can program fd expressed as decimal or hex or both?
> how do you return the error? as a text string for user space
> to parse?

No.  The kernel does not parse it because you cannot write this to the
file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
the file, you get the same bpf program hash that fdinfo on the bpf
object would show -- this is for debugging and (eventually) CRIU.

>
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>
> iptables/nft are not applicable to 'bpf_socket_create' which
> looks into:
> struct bpf_sock {
> __u32 bound_dev_if;
> __u32 family;
> __u32 type;
> __u32 protocol;
> };
> so don't fit as an example.

The code that takes a 'struct sock' and sets up bpf_sock is
bpf-specific and would obviously not be used for non-bpf filter.  But
if you had a filter that was just a like of address families, that
filter would look at struct sock and do its own thing.  iptables
wouldn't make sense for a socket creation filter, but it would make
perfect sense for an ingress filter.  Obviously the bpf-specific state
object wouldn't be used, but it would still be a hook, invoked from
the same network code, guarded by the same static key, looking at the
same skb.

>
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> so you're proposing to add a bunch of hard coded logic to the kernel.
> First to parse such text into some sort of syntax tree or list/set
> and then have hard coded logic specifically for these two use cases?
> While above two can be implemented as trivial bpf programs already?!
> That goes 180% degree vs bpf philosophy. bpf is about moving
> the specific code out of the kernel and keeping kernel generic that
> it can solve as many use cases as possible by being programmable.

I'm not seriously proposing implementing these.  My point is that
*bpf*, while wonderful, is not the be-all-and-end-all of kernel
configurability, and other types of hooks might want to be hooked in
here.

> ...
>> What exactly isn't sensible about using cgroup_bpf for containers?
>
> my use case is close to zero overhead application network monitoring.

So if I set up a cgroup that's monitored and call it /cgroup/a and
enable delegation and if the program running there wants to do its own
monitoring in /cgroup/a/b (via delegation), then you really want the
outer monitor to silently drop events coming from /cgroup/a/b?

>
>> >> > As you're pointing out, in case of security, we probably
>> >> > want to preserve original bpf program that should always be
>> >> > run first and only after it returned 'ok' (we'd need to define
>> >> > what 'ok' means in secruity context) run program attached to 
>> >> > sub-hierarchy.
>> >>
>> >> It's already defined AFAICT.  1 means okay.  0 means not okay.
>> >
>> > sorry that doesn't make any sense. For seccomp we have a set of
>> > ranges that mean different things. Here you're proposing to
>> > hastily assign 1 and 0 ? How is that extensible?
>> > We need to carefully think through what should be the semantics
>> > of attaching multiple programs, consider performance implications,
>> > return codes and so on.
>>
>> You already assigned it.  The return value of the bpf program, loaded
>> in Linus' tree today, tells the kernel whether to accept or reject.
>
> yes. that's what the program tells the hook.
> I'm saying that whenever we have a link list of the programs
> interaction between them may or may not be expressible with 1/0
> and I don't want to rush such decision.

Then disallow nesting.  You're welcome to not rush the decision, but
that's not what you've done.  If 4.10 is released as is, you've made

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> >> Hi all-
> >> >>
> >> >> I apologize for being rather late with this.  I didn't realize that
> >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> >> it on the linux-api list, so I missed the discussion.
> >> >>
> >> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> >> but I think the API has some issues that will bite us down the road
> >> >> if it becomes stable in its current form.
> >> >>
> >> >> Most of the problems I see are summarized in this transcript:
> >> >>
> >> >> # mkdir cg2
> >> >> # mount -t cgroup2 none cg2
> >> >> # mkdir cg2/nosockets
> >> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> >> ...
> >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >> >>
> >> >>  You can modify a cgroup after opening it O_RDONLY?
> >> >>
> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >> >>
> >> >>  This is fine.  The bpf() syscall manipulates bpf objects.
> >> >>
> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >> >>
> >> >>  This is not so good:
> >> >> 
> >> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> >> is manipulating a cgroup.  There's no reason that a socket 
> >> >> creation
> >> >> filter couldn't be written in a different language (new iptables
> >> >> table?  Simple list of address families?), but if that happened,
> >> >> then using bpf() to install it would be entirely nonsensical.
> >> >
> >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> >> > network stack is using cgroup as an application group that should
> >> > invoke bpf program at the certain point in the stack.
> >> > imo cgroup management is orthogonal.
> >>
> >> It is literally modifying the struct cgroup, and, as a practical
> >> matter, it's causing membership in the cgroup to have a certain
> >> effect.  But rather than pointless arguing, let me propose an
> >> alternative API that I think solves most of the problems here.
> >>
> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> >> Instead, the cgroup gets three new control files:
> >> "net.ingress_filter", "net.egress_filter", and
> >> "net.socket_create_filter".  Initially, if you read these files, you
> >> see "none\n".
> >>
> >> To attach a bpf filter, you open the file for write and do an ioctl on
> >> it.  After doing the ioctl, if you read the file, you'll see
> >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> >> the bpf program.
> >>
> >> To detach any type of filter, bpf or otherwise, you open the file for
> >> write and write "none\n" (or just "none").
> >>
> >> If you write anything else to the file, you get -EINVAL.  But, if
> >> someone writes a new type of filter (perhaps a simple list of address
> >> families), maybe you can enable the filter by writing something
> >> appropriate to the file.
> >
> > I see no difference in what you're proposing vs what is already implemented
> > from feature set point of view, but the file approach is very ugly, since
> > it's a mismatch to FD style access that bpf is using everywhere.
> > In your proposal you'd also need to add bpf prefix everywhere.
> > So the control file names should be bpf_inet_ingress, bpf_inet_egress
> > and bpf_socket_create.
> 
> I think we're still talking past each other.  A big part of the point
> of changing it is that none of this is specific to bpf.  You could (in

the hooks and context passed into the program is very much bpf specific.
That's what I've been trying to convey all along.

> theory -- I'm not proposing implementing these until there's demand)
> have:
> 
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter

i'm assuming 'baadf00d' is bpf program fd expressed a text string?
and kernel needs to parse above? will you allow capital and lower
case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
can program fd expressed as decimal or hex or both?
how do you return the error? as a text string for user space
to parse?

> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy

iptables/nft are not applicable to 'bpf_socket_create' which
looks into:
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
so don't fit as an example.

> net.socket_create_filter = "disallow": no so

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 6:52 PM, David Ahern  wrote:
> On 12/19/16 6:56 PM, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
>>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
 net.socket_create_filter = "none": no filter
 net.socket_create_filter = "bpf:baadf00d": bpf filter
 net.socket_create_filter = "disallow": no sockets created period
 net.socket_create_filter = "iptables:foobar": some iptables thingy
 net.socket_create_filter = "nft:blahblahblah": some nft thingy
 net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 
 3
>>>
>>> Such a scheme works for the socket create filter b/c it is a very simple 
>>> use case. It does not work for the ingress and egress which allow generic 
>>> bpf filters.
>>
>> Can you elaborate on what goes wrong?  (Obviously the
>> "address_family_list" example makes no sense in that context.)
>
> Being able to dump a filter or see that one exists would be a great add-on, 
> but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable 
> API for loading generic BPF filters. Simple cases like "disallow" are easy -- 
> just return 0 in the filter, no complicated BPF code needed. The rest are 
> specific cases of the moment which goes against the intent of ebpf and 
> generic programmability.

Oh -- I'm not proposing that at all.  Let me clarify.  For the bpf
case, if you *read* the file, you'd see "bpf:baadf00d".  But writing
"bpf:baadf00d" is nonsense and would give you -EINVAL.  Instead you
install a bpf filter by opening the file for write (O_RDWR or
O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd,
CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd);  It's kind of like
BPF_PROG_ATTACH except that it respects filesystem permissions, it
isn't in the bpf() multiplexer, the filter being set is implied (by
the fd in use), and everything is nicely seccompable.

To remove the filter, you write "none" or "none\n" to the file.

As a future extension, if someone wanted more than one filter to be
able to coexist in the cgroup socket_create_filter slot, you could
plausibly do 'echo disallow >>net.socket_create_filter' or use a new
ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and
see more than one line.  But this would be a future extension and may
never be needed.

>>
>> a) sub-cgroups cannot have a filter at all of the parent has a filter.
>> (This is the "punt" approach -- it lets different semantics be
>> assigned later without breaking userspace.)
>>
>> b) sub-cgroups can have a filter if a parent does, too.  The semantics
>> are that the sub-cgroup filter runs first and all side-effects occur.
>> If that filter says "reject" then ancestor filters are skipped.  If
>> that filter says "accept", then the ancestor filter is run and its
>> side-effects happen as well.  (And so on, all the way up to the root.)
>
> That comes with a big performance hit for skb / data path cases.
>
> I'm riding my use case on Daniel's work, and as I understand it the nesting 
> case has been discussed. I'll defer to Daniel and Alexei on this part.

I'm not sure I buy the performance hit.  If you do it naively, then
performance will indeed suck.  But there's already a bunch of code in
there to pre-populate a filter list for faster use.  Currently, we
have:

struct cgroup_bpf {
/*
 * Store two sets of bpf_prog pointers, one for programs that are
 * pinned directly to this cgroup, and one for those that are effective
 * when this cgroup is accessed.
 */
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
};

in struct cgroup, there's a 'struct cgroup_bpf bpf;'.

This would change to something like:

struct cgroup_filter_slot {
  struct bpf_prog *effective;
  struct cgroup_filter_slot *next;
  struct bpf_prog *local;
}

local is NULL unless *this* cgroup has a filter.  effective points to
the bpf_prog that's active in this cgroup or the nearest ancestor that
has a filter.  next is NULL if there are no filters higher in the
chain or points to the next slot that has a filter.  struct cgroup
has:

struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];

To evaluate it, you do:

struct cgroup_filter_slot *slot = &cgroup->slot[the index];

if (!slot->effective)
  return;

do {
  evaluate(slot->effective);
  slot = slot->next;
} while (unlikely(slot));

The old code was one branch per evaluation.  The new code is n+1
branches where n is the number of filters, so it's one extra branch in
the worst case.  But the extra branch is cache-hot (the variable is
right next to slot->effective, which is needed regardless) and highly
predictable (in the case where nesting isn't used, the branch is not
taken, and it's a backward branch which most CPUs will predict as not
taken).

I expect that, on x86, this adds at most a cycle or two and quite
possibly adds no measurable overhead at all.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern

On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use 
>> case. It does not work for the ingress and egress which allow generic bpf 
>> filters.
> 
> Can you elaborate on what goes wrong?  (Obviously the
> "address_family_list" example makes no sense in that context.)

Being able to dump a filter or see that one exists would be a great add-on, but 
I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for 
loading generic BPF filters. Simple cases like "disallow" are easy -- just 
return 0 in the filter, no complicated BPF code needed. The rest are specific 
cases of the moment which goes against the intent of ebpf and generic 
programmability. 
 
>>
>> ...
>>
 you're ignoring use cases I described earlier.
 In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost.  Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use 
>> case. What I put into iproute2 is
>>
>> cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
>> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
>> not have a filter. A user can certainly add another layer 
>> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
>> make sense.
> 
> I tend to agree.  I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live.  I
> have no desire to cause problems for the vrf use case.
> 
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism.  Suppose I create a container and want to trace all of its
> created sockets.  I'll set up cgrp2/container and load my tracer as a
> socket creation hook.  Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter.  Now the tracing stops working -- oops.

There are other ways to achieve socket tracing, but I get your point -- nested 
cases do not work as users may want.


> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> have to do it now (or disable the feature for 4.10).  This is why I'm
> bringing this whole thing up now.

 We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh?  My example in the original email attaches a program in a
>>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter 
>> of a parent cgroup?
> 
> Yes, exactly.  I think there are two sensible behaviors:
> 
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
> 
> b) sub-cgroups can have a filter if a parent does, too.  The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped.  If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well.  (And so on, all the way up to the root.)

That comes with a big performance hit for skb / data path cases.

I'm riding my use case on Daniel's work, and as I understand it the nesting 
case has been discussed. I'll defer to Daniel and Alexei on this part.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> Such a scheme works for the socket create filter b/c it is a very simple use 
> case. It does not work for the ingress and egress which allow generic bpf 
> filters.

Can you elaborate on what goes wrong?  (Obviously the
"address_family_list" example makes no sense in that context.)

>
> ...
>
>>> you're ignoring use cases I described earlier.
>>> In vrf case there is only one ifindex it needs to bind to.
>>
>> I'm totally lost.  Can you explain what this has to do with the cgroup
>> hierarchy?
>
> I think the point is that a group hierarchy makes no sense for the VRF use 
> case. What I put into iproute2 is
>
> cgrp2/vrf/NAME
>
> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
> not have a filter. A user can certainly add another layer 
> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
> make sense.

I tend to agree.  I still think that the mechanism as it stands is
broken in other respects and should be fixed before it goes live.  I
have no desire to cause problems for the vrf use case.

But keep in mind that the vrf use case is, in Linus' tree, a bit
broken right now in its interactions with other users of the same
mechanism.  Suppose I create a container and want to trace all of its
created sockets.  I'll set up cgrp2/container and load my tracer as a
socket creation hook.  Then a container sets up
cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
filter.  Now the tracing stops working -- oops.

>
> ...
>
 I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
 have to do it now (or disable the feature for 4.10).  This is why I'm
 bringing this whole thing up now.
>>>
>>> We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> Are you suggesting sub-cgroups should not be allowed to override the filter 
> of a parent cgroup?

Yes, exactly.  I think there are two sensible behaviors:

a) sub-cgroups cannot have a filter at all of the parent has a filter.
(This is the "punt" approach -- it lets different semantics be
assigned later without breaking userspace.)

b) sub-cgroups can have a filter if a parent does, too.  The semantics
are that the sub-cgroup filter runs first and all side-effects occur.
If that filter says "reject" then ancestor filters are skipped.  If
that filter says "accept", then the ancestor filter is run and its
side-effects happen as well.  (And so on, all the way up to the root.)

--Andy

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski  wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
>  wrote:
>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?
>

Okay, I figured out what you mean, I think.  You have a handful of vrf
devices.  Let's say they have ifindexes 1 and 2 (and maybe more).

The interesting case happens when you set up /cgroup/a with a bpf
program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf
program that binds new sockets to ifindex 2.  The question is: what
should happen if you're in /cgroup/a/b?  Presumably, if you do this,
you wanted to end up with ifindex 2.

I think the way it should actually work is that the kernel evaluates
/cgroup/a/b's hook and then /cgroup/a's hook.  Then /cgroup/a (which
is the more privileged hook) gets to make the choice.  If it wants
ifindex 2 to win, it can do (pseudocode):

if (!sk->sk_bound_if)
  sk->sk_bound_if = 1;

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern

On 12/19/16 5:25 PM, Andy Lutomirski wrote:
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter
> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

Such a scheme works for the socket create filter b/c it is a very simple use 
case. It does not work for the ingress and egress which allow generic bpf 
filters.

...

>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
> 
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?

I think the point is that a group hierarchy makes no sense for the VRF use 
case. What I put into iproute2 is

cgrp2/vrf/NAME

where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets 
to a specific device index. cgrp2/vrf is the "default" vrf and does not have a 
filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it 
provides no value since VRF in a VRF does not make sense. 

...

>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>> bringing this whole thing up now.
>>
>> We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

Are you suggesting sub-cgroups should not be allowed to override the filter of 
a parent cgroup?

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 5:34 PM, David Miller  wrote:
> From: Alexei Starovoitov 
> Date: Mon, 19 Dec 2016 16:02:56 -0800
>
>> huh? 'not right api' because it's using bpf syscall instead
>> of cgroup control-file? I think the opposite is the truth.
>
> I completely agree with Alexei on this.

So what happens when someone adds another type of filter?  Let's say
there's a simple, no-privilege-required list of allowed address
families that can hook up to the socket creation hook for a cgroup.
Does BPF_PROG_DETACH still detach it?  Or would both the bpf *and* the
list of allowed address families be in force?  If the latter, why
wouldn't two BPF programs on the same hook be allowed?

Concretely:

# mkdir /cgroup/a
# set_up_bpf_socket_rule /cgroup/a
# set_up_list_of_address_families /cgroup/a
# cat /cgroup/a/some_new_file [what gets displayed?]
# BPF_PROG_DETACH: what happens

By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
even take a reference to a BPF program as an argument.  What is it
supposed to do if this mechanism ever gets extended?

--Andy

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Miller

From: Alexei Starovoitov 
Date: Mon, 19 Dec 2016 16:02:56 -0800

> huh? 'not right api' because it's using bpf syscall instead
> of cgroup control-file? I think the opposite is the truth.

I completely agree with Alexei on this.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>>  wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this.  I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >>  You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >>  This is fine.  The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >>  This is not so good:
>> >> 
>> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> >> is manipulating a cgroup.  There's no reason that a socket 
>> >> creation
>> >> filter couldn't be written in a different language (new iptables
>> >> table?  Simple list of address families?), but if that happened,
>> >> then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect.  But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter".  Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it.  After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL.  But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other.  A big part of the point
of changing it is that none of this is specific to bpf.  You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See?  This API is not bpf-specific.  It's an API for filtering.  The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect.  A cgroup with something other than
>> "none" in one of its net.*_filter files is a cgroup that filters
>> networ

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>  wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this.  I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >>  You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >>  This is fine.  The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >>  This is not so good:
> >> 
> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> is manipulating a cgroup.  There's no reason that a socket creation
> >> filter couldn't be written in a different language (new iptables
> >> table?  Simple list of address families?), but if that happened,
> >> then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
> 
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect.  But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
> 
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter".  Initially, if you read these files, you
> see "none\n".
> 
> To attach a bpf filter, you open the file for write and do an ioctl on
> it.  After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
> 
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
> 
> If you write anything else to the file, you get -EINVAL.  But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect.  A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity.  And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state.   As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule 
> >> cg2/nosockets/sockets/ 1
> >>
> >>  This succeeded, which means that, if this feature is enabled in 4.10,
> >>  then we're stuck with its semantics.  If it returned -EINVAL instead,
> >>  there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
> 
> I think you're misunderstanding me.  What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to some particular semantics for

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski

On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
 wrote:
> On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> Hi all-
>>
>> I apologize for being rather late with this.  I didn't realize that
>> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> it on the linux-api list, so I missed the discussion.
>>
>> I think that the inet ingress, egress etc filters are a neat feature,
>> but I think the API has some issues that will bite us down the road
>> if it becomes stable in its current form.
>>
>> Most of the problems I see are summarized in this transcript:
>>
>> # mkdir cg2
>> # mount -t cgroup2 none cg2
>> # mkdir cg2/nosockets
>> # strace cgrp_socket_rule cg2/nosockets/ 0
>> ...
>> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>>
>>  You can modify a cgroup after opening it O_RDONLY?
>>
>> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> log_buf=0x6020c0, kern_version=0}, 48) = 4
>>
>>  This is fine.  The bpf() syscall manipulates bpf objects.
>>
>> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>>
>>  This is not so good:
>> 
>>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> is manipulating a cgroup.  There's no reason that a socket creation
>> filter couldn't be written in a different language (new iptables
>> table?  Simple list of address families?), but if that happened,
>> then using bpf() to install it would be entirely nonsensical.
>
> I don't see why it's _modifing_ the cgroup. I'm looking at it as
> network stack is using cgroup as an application group that should
> invoke bpf program at the certain point in the stack.
> imo cgroup management is orthogonal.

It is literally modifying the struct cgroup, and, as a practical
matter, it's causing membership in the cgroup to have a certain
effect.  But rather than pointless arguing, let me propose an
alternative API that I think solves most of the problems here.

In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
Instead, the cgroup gets three new control files:
"net.ingress_filter", "net.egress_filter", and
"net.socket_create_filter".  Initially, if you read these files, you
see "none\n".

To attach a bpf filter, you open the file for write and do an ioctl on
it.  After doing the ioctl, if you read the file, you'll see
"bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
the bpf program.

To detach any type of filter, bpf or otherwise, you open the file for
write and write "none\n" (or just "none").

If you write anything else to the file, you get -EINVAL.  But, if
someone writes a new type of filter (perhaps a simple list of address
families), maybe you can enable the filter by writing something
appropriate to the file.

Now the API matches the effect.  A cgroup with something other than
"none" in one of its net.*_filter files is a cgroup that filters
network activity.  And you get CRIU compatibility for free: CRIU can
read the control file and use whatever mechanism is uses for BPF in
general to restore the cgroup filter state.   As an added bonus, you
get ACLs for free and the ugly multiplexer goes away.

>> # mkdir cg2/nosockets/sockets
>> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>>
>>  This succeeded, which means that, if this feature is enabled in 4.10,
>>  then we're stuck with its semantics.  If it returned -EINVAL instead,
>>  there would be a chance to refine it.
>
> i don't see the problem with this behavior. bpf and cgroup are indepedent.
> Imange that socket accounting program is attached to cg2/nosockets.
> The program is readonly and carry no security meaning.
> Why cgroup should prevent creation of cg2/nosockets/foo directory ?

I think you're misunderstanding me.  What I'm saying is that, if you
allow a cgroup and one of its descendents to both enable the same type
of filter, you have just committed to some particular semantics for
what happens.  And I think that the current semantics are the *wrong*
semantics for default long-term use, so you should either fix the
semantics or disable the problematic case.

>
>> # echo $$ >cg2/nosockets/sockets/cgroup.procs
>> # ping 127.0.0.1
>> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
>
> hmm. in this case the admin created a subgroup with a group that allows
> ping, so? how's that a problem?

I think you're forgetting about namespaces.  There are two different
admins here.  There's the admin who created the outer container and
said "no sockets here".  Then there's the admin inside the container
who said "muahaha, I can create a sub-container and allow sockets
*there*".

> In case of vrf I can imagine
> a set of application auto-binding to one vrf and smaller
> set of applications binding to a different vrf.
> Or broader set of applicati

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov

On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> Hi all-
> 
> I apologize for being rather late with this.  I didn't realize that
> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> it on the linux-api list, so I missed the discussion.
> 
> I think that the inet ingress, egress etc filters are a neat feature,
> but I think the API has some issues that will bite us down the road
> if it becomes stable in its current form.
> 
> Most of the problems I see are summarized in this transcript:
> 
> # mkdir cg2
> # mount -t cgroup2 none cg2
> # mkdir cg2/nosockets
> # strace cgrp_socket_rule cg2/nosockets/ 0
> ...
> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> 
>  You can modify a cgroup after opening it O_RDONLY?
> 
> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> log_buf=0x6020c0, kern_version=0}, 48) = 4
> 
>  This is fine.  The bpf() syscall manipulates bpf objects.
> 
> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> 
>  This is not so good:
> 
>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> is manipulating a cgroup.  There's no reason that a socket creation
> filter couldn't be written in a different language (new iptables
> table?  Simple list of address families?), but if that happened,
> then using bpf() to install it would be entirely nonsensical.

I don't see why it's _modifing_ the cgroup. I'm looking at it as
network stack is using cgroup as an application group that should
invoke bpf program at the certain point in the stack.
imo cgroup management is orthogonal.
I certainly don't mind adding 'cat cgroup.bpf' control file that will
print the program digest for debugging, just like we do for fdinfo,
but cgroup file interface shouldn't be used to control networking.
If there was something like networking-wide ioctl, I think it could
have been an alternative way to attach a program to a hook to a cgroup.
Adding a control file to a cgroup for the purpose of attaching bpf,
just doesn't make sense to me, since it's dragging bpf and networking
details into cgroup land. Imagine in the future somebody would want
to have nft to perform similar BPF_CGROUP_INET_INGRESS functionality.
I'd think that the position of the hook within the networking stack
would remain the same, but enum id will be different, since
other ids in that enum (like BPF_CGROUP_INET_SOCK_CREATE) are not
applicable to nft. Similarly the concept of program_fd doesn't exist
there either. So it would be using its own netlink based mechanism
and its own way of enumerating hook id. And cgroup could be passed
as a string into netlink message instead of fd.
So if somebody adds a control file to a cgroup api that takes bpf's hookid
and bpf's program_fd, such control file will only be applicable to bpf
and not usable for anything else. Hence I see no reason to go that route.

> 
>  b) This is starting to be an excessively ugly multiplexer.  Among
> other things, it's very unfriendly to seccomp.
> 
> # echo $$ >cg2/nosockets/cgroup.procs
> # ping 127.0.0.1
> ping: socket: Operation not permitted
> # ls cg2/nosockets/
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> # cat cg2/nosockets/cgroup.controllers
> 
>  Something in cgroupfs should give an indication that this cgroup
>  filters socket creation, but there's nothing there.  You should also
>  be able to turn the filter off from cgroupfs.

Doing 'cat cg2/nosockets/cgroup.bpf' here would be useful for debugging
to see which program is attached to which hook in this cgroup.
Detaching programs via cgroup control file is a lot less clear,
since it will be confusing to a running application that attached the program
in the first place, since it won't have any indication that external entity
detached the program.
One can argue that it could be useful for curious human/admin
to detach all bpf progs from all hooks for this cgroup, but
then it probably needs dmesg warning and only usable in extreme cases
as debugging and not something control plane should ever use.

> # mkdir cg2/nosockets/sockets
> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> 
>  This succeeded, which means that, if this feature is enabled in 4.10,
>  then we're stuck with its semantics.  If it returned -EINVAL instead,
>  there would be a chance to refine it.

i don't see the problem with this behavior. bpf and cgroup are indepedent.
Imange that socket accounting program is attached to cg2/nosockets.
The program is readonly and carry no security meaning.
Why cgroup should prevent creation of cg2/nosockets/foo directory ?

> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> # ping 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms

hmm. in this case the admin cre

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Andy Lutomirski

On Sat, Dec 17, 2016 at 11:26 AM, Mickaël Salaün  wrote:
>
> On 17/12/2016 19:18, Andy Lutomirski wrote:
>> Hi all-
>>
>> I apologize for being rather late with this.  I didn't realize that
>> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> it on the linux-api list, so I missed the discussion.
>>
>> I think that the inet ingress, egress etc filters are a neat feature,
>> but I think the API has some issues that will bite us down the road
>> if it becomes stable in its current form.
>>
>> Most of the problems I see are summarized in this transcript:
>>
>> # mkdir cg2
>> # mount -t cgroup2 none cg2
>> # mkdir cg2/nosockets
>> # strace cgrp_socket_rule cg2/nosockets/ 0
>> ...
>> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>>
>>  You can modify a cgroup after opening it O_RDONLY?
>
> I sent a patch to check the cgroup.procs permission before attaching a
> BPF program to it [1], but it was not merged because not part of the
> current security model (which may not be crystal clear). The thing is
> that the current socket/BPF/cgroup feature is only available to a
> process with the *global CAP_NET_ADMIN* and such a process can already
> modify the network for every processes, so it doesn't make much sense to
> check if it can modify the network for a subset of this processes.
>
> [1] https://lkml.org/lkml/2016/9/19/854

I don't think that's quite the right approach.  First, checking
permissions on an fd that's not the fd passed to the syscall is weird
and IMO shouldn't be done without a good reason.  Second,
cgroups.procs seems like the wrong file.

Why not create "net.socket_create_filter" and require a writable fd to
*that* to be passed to the syscall.

>
> However, needing a process to open a cgroup *directory* in write mode
> may not make sense because the process does not modify the content of
> the cgroup but only use it as a *reference* in the network stack.
> Forcing an open with write mode may forbid to use this kind of
> network-filtering feature in a read-only file-system but not necessarily
> read-only *network configuration*.

It does modify the cgroup.  Logically it changes the cgroup behavior.
As an implementation matter, it modifies the struct cgroup.

>
> Another point of view is that the CAP_NET_ADMIN may be an unneeded
> privilege if the cgroup migration is using a no_new_privs-like feature
> as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
> The new capability proposition for cgroup may be interesting too.
>
> [2] https://lkml.org/lkml/2016/9/14/82
>
>>
>> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> log_buf=0x6020c0, kern_version=0}, 48) = 4
>>
>>  This is fine.  The bpf() syscall manipulates bpf objects.
>>
>> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>>
>>  This is not so good:
>> 
>>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> is manipulating a cgroup.  There's no reason that a socket creation
>> filter couldn't be written in a different language (new iptables
>> table?  Simple list of address families?), but if that happened,
>> then using bpf() to install it would be entirely nonsensical.
>
> Another point of view is to say that the BPF program (called by the
> network stack) is using a reference to a set of processes thanks to a
> cgroup.

I strongly disagree with this point of view.  From a user's
perspective, nothing about the bpf program changed -- the cgroup
changes.  Even in the code, the bpf program doesn't have a reference
to anything.  The cgroup references the bpf program.

>>
>> # echo $$ >cg2/nosockets/cgroup.procs
>> # ping 127.0.0.1
>> ping: socket: Operation not permitted
>> # ls cg2/nosockets/
>> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
>> # cat cg2/nosockets/cgroup.controllers
>>
>>  Something in cgroupfs should give an indication that this cgroup
>>  filters socket creation, but there's nothing there.  You should also
>>  be able to turn the filter off from cgroupfs.
>
> Right. Everybody was OK at LPC to add such an information but it is not
> there yet.

Then let's do it before CONFIG_CGROUP_BPF=y becomes possible in a
released kernel.

> Right. Because this feature doesn't handle namespaces (only global
> CAP_NET_ADMIN), nested containers should not be allowed to use it at all.
> If we want this kind of feature to be usable by something other than a
> process with a global capability, then we need an inheritance mechanism
> similar to what I designed for Landlock. I think it could be added later.

I think it's okay to have a new kernel feature that doesn't support
namespacing yet.  I don't think it's okay to have a new kernel feature
that will become problematic when namespacing is added, and I think
cgroup-bpf is in the latter class.


Anyway, here's a half-baked proposal to clean this all u

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Mickaël Salaün


On 17/12/2016 19:18, Andy Lutomirski wrote:
> Hi all-
> 
> I apologize for being rather late with this.  I didn't realize that
> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> it on the linux-api list, so I missed the discussion.
> 
> I think that the inet ingress, egress etc filters are a neat feature,
> but I think the API has some issues that will bite us down the road
> if it becomes stable in its current form.
> 
> Most of the problems I see are summarized in this transcript:
> 
> # mkdir cg2
> # mount -t cgroup2 none cg2
> # mkdir cg2/nosockets
> # strace cgrp_socket_rule cg2/nosockets/ 0
> ...
> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> 
>  You can modify a cgroup after opening it O_RDONLY?

I sent a patch to check the cgroup.procs permission before attaching a
BPF program to it [1], but it was not merged because not part of the
current security model (which may not be crystal clear). The thing is
that the current socket/BPF/cgroup feature is only available to a
process with the *global CAP_NET_ADMIN* and such a process can already
modify the network for every processes, so it doesn't make much sense to
check if it can modify the network for a subset of this processes.

[1] https://lkml.org/lkml/2016/9/19/854

However, needing a process to open a cgroup *directory* in write mode
may not make sense because the process does not modify the content of
the cgroup but only use it as a *reference* in the network stack.
Forcing an open with write mode may forbid to use this kind of
network-filtering feature in a read-only file-system but not necessarily
read-only *network configuration*.

Another point of view is that the CAP_NET_ADMIN may be an unneeded
privilege if the cgroup migration is using a no_new_privs-like feature
as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
The new capability proposition for cgroup may be interesting too.

[2] https://lkml.org/lkml/2016/9/14/82

> 
> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> log_buf=0x6020c0, kern_version=0}, 48) = 4
> 
>  This is fine.  The bpf() syscall manipulates bpf objects.
> 
> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> 
>  This is not so good:
> 
>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> is manipulating a cgroup.  There's no reason that a socket creation
> filter couldn't be written in a different language (new iptables
> table?  Simple list of address families?), but if that happened,
> then using bpf() to install it would be entirely nonsensical.

Another point of view is to say that the BPF program (called by the
network stack) is using a reference to a set of processes thanks to a
cgroup.

> 
>  b) This is starting to be an excessively ugly multiplexer.  Among
> other things, it's very unfriendly to seccomp.

FWIW, Landlock will have the capability to filter this kind of action.

> 
> # echo $$ >cg2/nosockets/cgroup.procs
> # ping 127.0.0.1
> ping: socket: Operation not permitted
> # ls cg2/nosockets/
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> # cat cg2/nosockets/cgroup.controllers
> 
>  Something in cgroupfs should give an indication that this cgroup
>  filters socket creation, but there's nothing there.  You should also
>  be able to turn the filter off from cgroupfs.

Right. Everybody was OK at LPC to add such an information but it is not
there yet.

> 
> # mkdir cg2/nosockets/sockets
> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> 
>  This succeeded, which means that, if this feature is enabled in 4.10,
>  then we're stuck with its semantics.  If it returned -EINVAL instead,
>  there would be a chance to refine it.

This is indeed unfortunate.

> 
> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> # ping 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> ^C
> --- 127.0.0.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms
> 
>  Bash was inside a cgroup that disallowed socket creation, but socket
>  creation wasn't disallowed.  This means that the obvious use of socket
>  creation filters in nestable constainers fails insecurely.
> 
> 
> There's also a subtle but nasty potential security problem here.
> In 4.9 and before, cgroups has only one real effect in the kernel:
> resource control. A process in a malicious cgroup could be DoSed,
> but that was about the extent of the damage that a malicious cgroup
> could do.
> 
> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> programs attached that can do things if various events occur. (Right
> now, this means socket operations, but there are plans in the works
> to do this f

Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Andy Lutomirski

Hi all-

I apologize for being rather late with this.  I didn't realize that
cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
it on the linux-api list, so I missed the discussion.

I think that the inet ingress, egress etc filters are a neat feature,
but I think the API has some issues that will bite us down the road
if it becomes stable in its current form.

Most of the problems I see are summarized in this transcript:

# mkdir cg2
# mount -t cgroup2 none cg2
# mkdir cg2/nosockets
# strace cgrp_socket_rule cg2/nosockets/ 0
...
open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3

 You can modify a cgroup after opening it O_RDONLY?

bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
log_buf=0x6020c0, kern_version=0}, 48) = 4

 This is fine.  The bpf() syscall manipulates bpf objects.

bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0

 This is not so good:

 a) The bpf() syscall is supposed to manipulate bpf objects.  This
is manipulating a cgroup.  There's no reason that a socket creation
filter couldn't be written in a different language (new iptables
table?  Simple list of address families?), but if that happened,
then using bpf() to install it would be entirely nonsensical.

 b) This is starting to be an excessively ugly multiplexer.  Among
other things, it's very unfriendly to seccomp.

# echo $$ >cg2/nosockets/cgroup.procs
# ping 127.0.0.1
ping: socket: Operation not permitted
# ls cg2/nosockets/
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
# cat cg2/nosockets/cgroup.controllers

 Something in cgroupfs should give an indication that this cgroup
 filters socket creation, but there's nothing there.  You should also
 be able to turn the filter off from cgroupfs.

# mkdir cg2/nosockets/sockets
# /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1

 This succeeded, which means that, if this feature is enabled in 4.10,
 then we're stuck with its semantics.  If it returned -EINVAL instead,
 there would be a chance to refine it.

# echo $$ >cg2/nosockets/sockets/cgroup.procs
# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms

 Bash was inside a cgroup that disallowed socket creation, but socket
 creation wasn't disallowed.  This means that the obvious use of socket
 creation filters in nestable constainers fails insecurely.


There's also a subtle but nasty potential security problem here.
In 4.9 and before, cgroups has only one real effect in the kernel:
resource control. A process in a malicious cgroup could be DoSed,
but that was about the extent of the damage that a malicious cgroup
could do.

In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
programs attached that can do things if various events occur. (Right
now, this means socket operations, but there are plans in the works
to do this for LSM hooks too.) These bpf programs can say yes or no,
but they can also read out various data (including socket payloads!)
and save them away where an attacker can find them. This sounds a
lot like seccomp with a narrower scope but a much stronger ability to
exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent
injection of a privileged victim into a malicious sandbox, the
CGROUP_BPF mechanism appears to have no real security model. There
is nothing to prevent a program that's in a malicious cgroup from
running a setuid binary, and there is nothing to prevent a program
that has the ability to move itself or another program into a
malicious cgroup from doing so and then, if needed for exploitation,
exec a setuid binary.

This isn't much of a problem yet because you currently need
CAP_NET_ADMIN to create a malicious sandbox in the first place.  I'm
sure that, in the near future, someone will want to make this stuff
work in containers with delegated cgroup hierarchies, and then there
may be a real problem here.


I've included a few security people on this thread.  The current API
looks abusable, and it would be nice to find all the holes before
4.10 comes out.


(The cgrp_socket_rule source is attached.  You can build it by sticking it
 in samples/bpf and doing:

 $ make headers_install
 $ cd samples/bpf
 $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
)

--Andy
/* eBPF example program:
 *
 * - Loads eBPF program
 *
 *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
 *   sockets opened by processes in the cgroup.
 *
 * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
 */

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#in

48 matches

Mail list logo