Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-23 Thread Daniel Mack
On 08/23/2016 11:54 AM, Sargun Dhillon wrote:
> On Tue, Aug 23, 2016 at 10:27:28AM +0200, Daniel Mack wrote:
>> On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
>>> On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
 On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
>>
> This patchset also needs an extra egress hook, not yet known where to
> be placed, so two hooks in the network stacks in the end, 

 That should be solvable, I'm sure. I can as well leave egress out for
 the next version so it can be added later on.

>>> Any idea where you might put that yet? Does dev_xmit seems like a 
>>> reasonable 
>>> place?
>>
>> Ah, yes. Thanks for the pointer, that seems to work fine.
>>
> Daniel pointed out to me that there's already a BPF program that's used there 
> for tc matches. So, it should work fine. I would just verify you can call 
> programs from IRQs, and rcu_bh plays well with it.

IRQs should not matter AFAICS, and for testing, I placed the hook even
outside of rcu_bh. All the program runner needs is rcu_read_lock() to
access the rcu protected pointers.

> Alternatively, if you want to filter only IP traffic, ip_output, and 
> ip6_output 
> are fairly good places. I'm planning on putting some LSM hooks there soon. 
> It's 
> a bit simpler.

If you do that, and that's simpler, we can as well move the hook over at
some point. For now, I think dev_xmit() is sufficient.

> I also suggest you use verdicts rather than trimming for simplicity sake.

That's how it works already. eBPF programs in that context are expected
to either return 0 (reject) or 1 (pass). The may, however cause side
effects such as shared map updates etc, which is what the example
program does for accounting.

> I think that we should just add another pointer to the end of 
> sock_cgroup_data 
> while we're in this state of transition, and nudge people to disable 
> CONFIG_CGROUP_NET_PRIO and CONFIG_CGROUP_NET_CLASSID over time.
> 
> Alternatively, we add these controllers for v2, and we have some kind of 
> marker 
> whether or not they're on v2 in the skcd. If they are, we can find the 
> cgroup, 
> and get the prioidx, and classid from the css. Although the comment in 
> cgroup-defs.h suggests that v2 and classid should never be used concurrently, 
> I 
> can't help but to disagree, given there's legacy infrastructure that 
> leverages 
> classid.

I'll leave that to Tejun to comment on :)


Thanks,
Daniel




Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-23 Thread Sargun Dhillon
On Tue, Aug 23, 2016 at 10:27:28AM +0200, Daniel Mack wrote:
> On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
> > On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
> >> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> 
> >>> This patchset also needs an extra egress hook, not yet known where to
> >>> be placed, so two hooks in the network stacks in the end, 
> >>
> >> That should be solvable, I'm sure. I can as well leave egress out for
> >> the next version so it can be added later on.
> >>
> > Any idea where you might put that yet? Does dev_xmit seems like a 
> > reasonable 
> > place?
> 
> Ah, yes. Thanks for the pointer, that seems to work fine.
> 
Daniel pointed out to me that there's already a BPF program that's used there 
for tc matches. So, it should work fine. I would just verify you can call 
programs from IRQs, and rcu_bh plays well with it.

Alternatively, if you want to filter only IP traffic, ip_output, and ip6_output 
are fairly good places. I'm planning on putting some LSM hooks there soon. It's 
a bit simpler.

I also suggest you use verdicts rather than trimming for simplicity sake.

> > If someone uses the netprio, or the net classid controllers, skcd matches
> > no longer work.
> 
> Yes, sock_cgroup_ptr() will fall back to the v2 root in this case.
> 
> > Ideally, we should fix up these controllers to make them
> > more v2 friendly.
> 
> These controllers do not exist for v2, that's why sock_cgroup_ptr()
> behaves that way. What's your idea to fix that up?
I think that we should just add another pointer to the end of sock_cgroup_data 
while we're in this state of transition, and nudge people to disable 
CONFIG_CGROUP_NET_PRIO and CONFIG_CGROUP_NET_CLASSID over time.

Alternatively, we add these controllers for v2, and we have some kind of marker 
whether or not they're on v2 in the skcd. If they are, we can find the cgroup, 
and get the prioidx, and classid from the css. Although the comment in 
cgroup-defs.h suggests that v2 and classid should never be used concurrently, I 
can't help but to disagree, given there's legacy infrastructure that leverages 
classid.

> 
> 
> Thanks,
> Daniel
> 

Looking forward to seeing these patches,
-Sargun


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-23 Thread Daniel Mack
On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
> On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
>> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:

>>> This patchset also needs an extra egress hook, not yet known where to
>>> be placed, so two hooks in the network stacks in the end, 
>>
>> That should be solvable, I'm sure. I can as well leave egress out for
>> the next version so it can be added later on.
>>
> Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
> place?

Ah, yes. Thanks for the pointer, that seems to work fine.

> If someone uses the netprio, or the net classid controllers, skcd matches
> no longer work.

Yes, sock_cgroup_ptr() will fall back to the v2 root in this case.

> Ideally, we should fix up these controllers to make them
> more v2 friendly.

These controllers do not exist for v2, that's why sock_cgroup_ptr()
behaves that way. What's your idea to fix that up?


Thanks,
Daniel



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-22 Thread Sargun Dhillon
On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> > On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:
> 
> >> You brought up multiple tables which reflect the cumulative approach.
> >> This sometimes works but has its issues as well. Users must be aware
> >> of each other and anticipate what rules other users might inject
> >> before or after their own tables. The very existence of firewalld which
> >> aims at democratizing this collaboration proves this point.
> > 
> > Firewalld, was really required in the iptables predefined tables
> > model, in nft last time we talked about this during NFWS'15, future
> > plans for firewalld were not clear yet.
> > 
> > Moreover, in nft, different users can indeed dump the ruleset and it
> > would be possible to validate if one policy is being shadowed by
> > another coming later on. The bpf bytecode dump cannot be taken to the
> > original representation.
> 
> But as Thomas said - both things address different use-cases. For
> container setups, there is no administrator involved to use cli tools,
> so I don't think that's really much of an argument.
> 
> >> So in that sense I would very much like for both models to be made
> >> available to users. nftables+cgroups for a cumulative approach as
> >> well as BPF+cgroups for the delegation approach.  I don't see why the
> >> cgroups based filtering capability should not be made available to both.
> > 
> > This patchset also needs an extra egress hook, not yet known where to
> > be placed, so two hooks in the network stacks in the end, 
> 
> That should be solvable, I'm sure. I can as well leave egress out for
> the next version so it can be added later on.
> 
Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
place?

> > and this only works for cgroups version 2.
> 
> I don't see a problem with that, as v1 and v2 hierarchies can peacefully
> coexist.
> 
If someone uses the netprio, or the net classid controllers, skcd matches
no longer work. Ideally, we should fix up these controllers to make them
more v2 friendly.

> > Last time we talked about this, main concerns were that this was too
> > specific, but this approach seems even more specific to me.
> 
> Hmm, I disagree - bpf programs that are associated with cgroups are
> rather something that can be extended a lot in the future, for instance
> for handling port binding permissions etc. Unlike the proposed network
> cgroup controller with all sorts of complicated knobs to control ranges
> of ports etc, a bpf program that take care of that in a much more
> versatile way.
> 
> I also strongly believe we can have both, a cgroup controller that has
> bpf programs for socket filtering and other things, _and_ a "post socket
> lookup netfilter" table type. Both will have their individual use-cases.
> 
> 
> Thanks,
> Daniel
> 


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-22 Thread Daniel Mack
On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:

>> You brought up multiple tables which reflect the cumulative approach.
>> This sometimes works but has its issues as well. Users must be aware
>> of each other and anticipate what rules other users might inject
>> before or after their own tables. The very existence of firewalld which
>> aims at democratizing this collaboration proves this point.
> 
> Firewalld, was really required in the iptables predefined tables
> model, in nft last time we talked about this during NFWS'15, future
> plans for firewalld were not clear yet.
> 
> Moreover, in nft, different users can indeed dump the ruleset and it
> would be possible to validate if one policy is being shadowed by
> another coming later on. The bpf bytecode dump cannot be taken to the
> original representation.

But as Thomas said - both things address different use-cases. For
container setups, there is no administrator involved to use cli tools,
so I don't think that's really much of an argument.

>> So in that sense I would very much like for both models to be made
>> available to users. nftables+cgroups for a cumulative approach as
>> well as BPF+cgroups for the delegation approach.  I don't see why the
>> cgroups based filtering capability should not be made available to both.
> 
> This patchset also needs an extra egress hook, not yet known where to
> be placed, so two hooks in the network stacks in the end, 

That should be solvable, I'm sure. I can as well leave egress out for
the next version so it can be added later on.

> and this only works for cgroups version 2.

I don't see a problem with that, as v1 and v2 hierarchies can peacefully
coexist.

> Last time we talked about this, main concerns were that this was too
> specific, but this approach seems even more specific to me.

Hmm, I disagree - bpf programs that are associated with cgroups are
rather something that can be extended a lot in the future, for instance
for handling port binding permissions etc. Unlike the proposed network
cgroup controller with all sorts of complicated knobs to control ranges
of ports etc, a bpf program that take care of that in a much more
versatile way.

I also strongly believe we can have both, a cgroup controller that has
bpf programs for socket filtering and other things, _and_ a "post socket
lookup netfilter" table type. Both will have their individual use-cases.


Thanks,
Daniel



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-22 Thread Pablo Neira Ayuso
Hi Thomas,

On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:
> On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> > On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > > Also true. A cgroup can currently only hold one bpf program for each
> > > direction, and they are supposed to be set from one controlling instance
> > > in the system. However, it is possible to create subcgroups, and install
> > > own programs in them, which will then be effective instead of the one in
> > > the parent. They will, however, replace each other in runtime behavior,
> > > and not be stacked. This is a fundamentally different approach than how
> > > nf_tables works of course.
> > 
> > I see, this looks problematic indeed, thanks for confirming this.
> 
> What exactly is problematic? I think the proposed mechanism is very
> clean in allowing sub groups to provide the entire program. This
> allows for delegation. Different orchestrators can manage different
> cgroups. It's different as Daniel stated. I don't see how this is
> problematic though.
> 
> You brought up multiple tables which reflect the cumulative approach.
> This sometimes works but has its issues as well. Users must be aware
> of each other and anticipate what rules other users might inject
> before or after their own tables. The very existence of firewalld which
> aims at democratizing this collaboration proves this point.

Firewalld, was really required in the iptables predefined tables
model, in nft last time we talked about this during NFWS'15, future
plans for firewalld were not clear yet.

Moreover, in nft, different users can indeed dump the ruleset and it
would be possible to validate if one policy is being shadowed by
another coming later on. The bpf bytecode dump cannot be taken to the
original representation.

> So in that sense I would very much like for both models to be made
> available to users. nftables+cgroups for a cumulative approach as
> well as BPF+cgroups for the delegation approach.  I don't see why the
> cgroups based filtering capability should not be made available to both.

This patchset also needs an extra egress hook, not yet known where to
be placed, so two hooks in the network stacks in the end, and this
only works for cgroups version 2.

Last time we talked about this, main concerns were that this was too
specific, but this approach seems even more specific to me.


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf
On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > Also true. A cgroup can currently only hold one bpf program for each
> > direction, and they are supposed to be set from one controlling instance
> > in the system. However, it is possible to create subcgroups, and install
> > own programs in them, which will then be effective instead of the one in
> > the parent. They will, however, replace each other in runtime behavior,
> > and not be stacked. This is a fundamentally different approach than how
> > nf_tables works of course.
> 
> I see, this looks problematic indeed, thanks for confirming this.

What exactly is problematic? I think the proposed mechanism is very
clean in allowing sub groups to provide the entire program. This
allows for delegation. Different orchestrators can manage different
cgroups. It's different as Daniel stated. I don't see how this is
problematic though.

You brought up multiple tables which reflect the cumulative approach.
This sometimes works but has its issues as well. Users must be aware
of each other and anticipate what rules other users might inject
before or after their own tables. The very existence of firewalld which
aims at democratizing this collaboration proves this point.

So in that sense I would very much like for both models to be made
available to users. nftables+cgroups for a cumulative approach as
well as BPF+cgroups for the delegation approach.  I don't see why the
cgroups based filtering capability should not be made available to both.



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf
On 08/19/16 at 06:31pm, Pablo Neira Ayuso wrote:
> Why do you need global seccomp policies? The process knows better what
> he needs to place in his sandbox, so attaching this from the process
> itself makes more sense to me... Anyway, this reminds me to selinux.

Two different objectives. The point is to sandbox applications and
restrict their capabilities. It's not the application itself but the task
orchestration system around it that manages the policies.


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Pablo Neira Ayuso
On Fri, Aug 19, 2016 at 01:20:25PM +0200, Daniel Borkmann wrote:
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> [...]
> > * During the Netfilter Workshop, the main concern to add this new socket
> 
> Don't really know what was discussed exactly at NFWS, but ...

Slides are available here:

http://workshop.netfilter.org/2016/wiki/index.php/File_Cgroup-matches-NFWS2016.html

We missed you Daniel, I hope you can make it next year ;-)

> >ingress hook was that it is too specific. However this new hook in
> >the network stack looks way more specific more specific since *it only
> >works for cgroups*.
> 
> ... why would that be something so overly specific? I don't think that "it
> only works for cgroups" would be a narrow use case. While the current
> sk_filter() BPF policies are set from application level, it makes sense to
> me to have an option for an entity that manages the cgroups to apply an
> external policy for networking side as well for participating processes.
> It seems like a useful extension to the current sk_filter() infrastructure
> iff we carve out the details properly and generic enough, and besides ...

This forces anyone to filter socket traffic from cgroups, so this
makes the networking infrastructure dependent on cgroups for no
reason, instead of simply using the cgroupv1, cgroupv2, or whatever
other information as yet another selector.

> [...]
> On 08/19/2016 12:35 PM, Daniel Mack wrote:
> [...]
> >So - I don't know. The whole 'eBPF in cgroups' idea was born because
> >through the discussions over the past months we had on all this, it
> >became clear to me that netfilter is not the right place for filtering
> >on local tasks. I agree the solution I am proposing in my patch set has
> >its downsides, mostly when it comes to transparency to users, but I
> >considered that acceptable. After all, we have eBPF users all over the
> >place in the kernel already, and seccomp, for instance, isn't any better
> >in that regard.
> 
> ... since you mention seccomp here as well, it would be another good fit
> as a program subtype to apply syscall policies for those participants on
> a cgroup level too, f.e. to disallow certain syscalls. It would be quite
> similar conceptually. So, fwiw, if this is being designed generic enough,
> the use cases would go much broader than that.

Why do you need global seccomp policies? The process knows better what
he needs to place in his sandbox, so attaching this from the process
itself makes more sense to me... Anyway, this reminds me to selinux.

Back to my main point, I would not expect we have to request sysadmins
to dump BPF bytecode to understand what global policy is being
enforced.


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Pablo Neira Ayuso
Hi Daniel,

On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> > On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> >> I'd appreciate some feedback on this. Pablo has some remaining concerns
> >> about this approach, and I'd like to continue the discussion we had
> >> off-list in the light of this patchset.
> > 
> > OK, I'm going to summarize them here below:
> > 
> > * This new hook
> 
> "This" refers to your alternative to my patch set, right?
>
> > allows us to enforce an *administrative filtering
> >   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
> >   easy to display in nf_tables as you can list the ruleset via the nft
> >   userspace tool. Otherwise, in your approach if a misconfigured
> >   filtering policy causes connectivity problems, I don't see how the
> >   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> True. That's the downside of bpf.
> 
> > * Interaction with other software. As I could read from your patch,
> >   what you propose will detach any previous existing filter. So I
> >   don't see how you can attach multiple filtering policies from
> >   different processes that don't cooperate each other.
> 
> Also true. A cgroup can currently only hold one bpf program for each
> direction, and they are supposed to be set from one controlling instance
> in the system. However, it is possible to create subcgroups, and install
> own programs in them, which will then be effective instead of the one in
> the parent. They will, however, replace each other in runtime behavior,
> and not be stacked. This is a fundamentally different approach than how
> nf_tables works of course.

I see, this looks problematic indeed, thanks for confirming this.

> > In nf_tables
> >   this is easy since they can create their own tables so they keep their
> >   ruleset in separate spaces. If the interaction is not OK, again the
> >   sysadmin can very quickly debug this since the policies would be
> >   visible via nf_tables ruleset listing.
> 
> True. Debugging would be much easier that way.
> 
> > So what I'm proposing goes in the direction of using the nf_tables
> > infrastructure instead:
> > 
> > * Add a new socket family for nf_tables with an input hook at
> >   sk_filter(). This just requires the new netfilter hook there and
> >   the boiler plate code to allow creating tables for this new family.
> >   And then we get access to many of the existing features in
> >   nf_tables for free.
> 
> Yes. However, when I proposed more or less exactly that back in
> September last year ("NF_INET_LOCAL_SOCKET_IN"), the concern raised by
> you and Florian Westphal was that this type of decision making is out of
> scope for netfilter, mostly because
>
> a) whether a userspace process is running should not have any influence
> in the netfilter behavior (which it does, because the rules are not
> processed when the local socket is cannot be determined)

We always have a socket at sk_filter(). This new socket family retains
this specific semantics.

> b) it is asymmetric, as it only exists for the input path

Unless strictly necessary, I would not add another earlier socket
egress hook if ipv4 and ipv6 LOCAL_OUT hook is enough for this.

> c) it's a change in netfilter paradigm, because rules for multicast
> receivers are run multiple times (once for each receiving task)

This is part of the semantics of this new socket family, so the user
is aware of this particular expensive multicast behaviour in this new
family.

> d) it was considered a sledgehammer solution for a something that very
> few people really need

I don't see why the current RFC is less heavyweight. I guess this was
true with the "hooks spread all over the transport layer" approach was
discussed, but not with the single sk_filter() hook we're discussing.

> I still think such a hook would be a good thing to have. As far as
> implementation goes, my patch set back then patched each of the
> protocols individually (ipv4, ipv6, dccp, sctp), while your idea to hook
> in to sk_filter sound much more reasonable.

That's it. Instead of having hooks spread all over the place per
protocol, the idea is to add a new socket family with a hook like
NF_SOCKET_INGRESS at sk_filter(), this new family encloses this new
specific socket semantics, that differs from the ip, ip6 and inet
family semantics.

> If the opinions on the previously raised concerns have changed, I'm
> happy to revisit.
> 
> > * We can quickly find a verdict on the packet using using any combination
> >   of selectors through concatenations and maps in nf_tables. In
> >   nf_tables we can express the policy with a non-linear ruleset.
> 
> That's another interesting detail that was discussed on NFWS, yes. We
> need a way to dispatch incoming packets without walking a linear
> dispatcher list. In the eBPF approach, that's very easy because the
> cgroup is directly associa

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Alexei Starovoitov
On Fri, Aug 19, 2016 at 11:19:41AM +0200, Pablo Neira Ayuso wrote:
> Hi Daniel,
> 
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> > I'd appreciate some feedback on this. Pablo has some remaining concerns
> > about this approach, and I'd like to continue the discussion we had
> > off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other. In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.
> 
> * During the Netfilter Workshop, the main concern to add this new socket
>   ingress hook was that it is too specific. However this new hook in
>   the network stack looks way more specific more specific since *it only
>   works for cgroups*.
> 
> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:

Pablo, if you were proposing to do cgroups+nft as well as cgroups+bpf
we could have had much more productive discussion.
You were not participating in cgroup+bpf design and now bringing up
bogus points that make no sense to me. That's not helpful.
Please start another cgroups+nft thread and there we can discuss the
ways to do it cleanly without slowdown the stack.
netfilter hooks bloat the stack enough that some people compile them out.
If I were you, I'd focus on improving iptables/nft performance instead
of arguing about their coolness.

> Thanks for your patience on debating this!

I don't think you're sincere.



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Daniel Borkmann

On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
[...]
> * During the Netfilter Workshop, the main concern to add this new socket

Don't really know what was discussed exactly at NFWS, but ...

>ingress hook was that it is too specific. However this new hook in
>the network stack looks way more specific more specific since *it only
>works for cgroups*.

... why would that be something so overly specific? I don't think that "it
only works for cgroups" would be a narrow use case. While the current
sk_filter() BPF policies are set from application level, it makes sense to
me to have an option for an entity that manages the cgroups to apply an
external policy for networking side as well for participating processes.
It seems like a useful extension to the current sk_filter() infrastructure
iff we carve out the details properly and generic enough, and besides ...

[...]
On 08/19/2016 12:35 PM, Daniel Mack wrote:
[...]

So - I don't know. The whole 'eBPF in cgroups' idea was born because
through the discussions over the past months we had on all this, it
became clear to me that netfilter is not the right place for filtering
on local tasks. I agree the solution I am proposing in my patch set has
its downsides, mostly when it comes to transparency to users, but I
considered that acceptable. After all, we have eBPF users all over the
place in the kernel already, and seccomp, for instance, isn't any better
in that regard.


... since you mention seccomp here as well, it would be another good fit
as a program subtype to apply syscall policies for those participants on
a cgroup level too, f.e. to disallow certain syscalls. It would be quite
similar conceptually. So, fwiw, if this is being designed generic enough,
the use cases would go much broader than that.

Cheers,
Daniel


Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Daniel Mack
Hi Pablo,

On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
>> I'd appreciate some feedback on this. Pablo has some remaining concerns
>> about this approach, and I'd like to continue the discussion we had
>> off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook

"This" refers to your alternative to my patch set, right?

> allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.

True. That's the downside of bpf.

> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other.

Also true. A cgroup can currently only hold one bpf program for each
direction, and they are supposed to be set from one controlling instance
in the system. However, it is possible to create subcgroups, and install
own programs in them, which will then be effective instead of the one in
the parent. They will, however, replace each other in runtime behavior,
and not be stacked. This is a fundamentally different approach than how
nf_tables works of course.

> In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.

True. Debugging would be much easier that way.

> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:
> 
> * Add a new socket family for nf_tables with an input hook at
>   sk_filter(). This just requires the new netfilter hook there and
>   the boiler plate code to allow creating tables for this new family.
>   And then we get access to many of the existing features in
>   nf_tables for free.

Yes. However, when I proposed more or less exactly that back in
September last year ("NF_INET_LOCAL_SOCKET_IN"), the concern raised by
you and Florian Westphal was that this type of decision making is out of
scope for netfilter, mostly because

a) whether a userspace process is running should not have any influence
in the netfilter behavior (which it does, because the rules are not
processed when the local socket is cannot be determined)

b) it is asymmetric, as it only exists for the input path

c) it's a change in netfilter paradigm, because rules for multicast
receivers are run multiple times (once for each receiving task)

d) it was considered a sledgehammer solution for a something that very
few people really need


I still think such a hook would be a good thing to have. As far as
implementation goes, my patch set back then patched each of the
protocols individually (ipv4, ipv6, dccp, sctp), while your idea to hook
in to sk_filter sound much more reasonable.

If the opinions on the previously raised concerns have changed, I'm
happy to revisit.

> * We can quickly find a verdict on the packet using using any combination
>   of selectors through concatenations and maps in nf_tables. In
>   nf_tables we can express the policy with a non-linear ruleset.

That's another interesting detail that was discussed on NFWS, yes. We
need a way to dispatch incoming packets without walking a linear
dispatcher list. In the eBPF approach, that's very easy because the
cgroup is directly associated with the receiving socket, so the lookup
of the effective eBPF programs is really fast.

If we can achieve similar things with nf_tables and maps, then that
should be applicable as well.

> On
>   top of this, by delaying the nf_reset() calls we can reach the
>   conntrack information from sk_filter(). That would be useful to skip
>   evaluating packets that belong to already established flows. Thus, we
>   incur the performance penalty in classifying only for the first
>   packet of the flow.

If that's possible, that's an interesting feature, but at least for
accounting, we need to run the rules for all packets, always.

> * We can skip the socket egress hook (that you don't know where to place
>   yet) since you can use the existing local output hook in netfilter that
>   is available for IPv4 and IPv6.

If asymmetry is not a no-go anymore, that sounds fine to me.

> * This new hook would fit into the existing netfilter set of hooks,
>   the sysadmin is already familiarized with the administrative
>   infrastructure to define filtering policies in our stack, so adding this
>   new hook to what we have looks natural to me.

At least for inspecti

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Pablo Neira Ayuso
Hi Daniel,

On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> I'd appreciate some feedback on this. Pablo has some remaining concerns
> about this approach, and I'd like to continue the discussion we had
> off-list in the light of this patchset.

OK, I'm going to summarize them here below:

* This new hook allows us to enforce an *administrative filtering
  policy* that must be visible to anyone with CAP_NET_ADMIN. This is
  easy to display in nf_tables as you can list the ruleset via the nft
  userspace tool. Otherwise, in your approach if a misconfigured
  filtering policy causes connectivity problems, I don't see how the
  sysadmin is going to have an easy way to troubleshoot what is going on.

* Interaction with other software. As I could read from your patch,
  what you propose will detach any previous existing filter. So I
  don't see how you can attach multiple filtering policies from
  different processes that don't cooperate each other. In nf_tables
  this is easy since they can create their own tables so they keep their
  ruleset in separate spaces. If the interaction is not OK, again the
  sysadmin can very quickly debug this since the policies would be
  visible via nf_tables ruleset listing.

* During the Netfilter Workshop, the main concern to add this new socket
  ingress hook was that it is too specific. However this new hook in
  the network stack looks way more specific more specific since *it only
  works for cgroups*.

So what I'm proposing goes in the direction of using the nf_tables
infrastructure instead:

* Add a new socket family for nf_tables with an input hook at
  sk_filter(). This just requires the new netfilter hook there and
  the boiler plate code to allow creating tables for this new family.
  And then we get access to many of the existing features in
  nf_tables for free.

* We can quickly find a verdict on the packet using using any combination
  of selectors through concatenations and maps in nf_tables. In
  nf_tables we can express the policy with a non-linear ruleset. On
  top of this, by delaying the nf_reset() calls we can reach the
  conntrack information from sk_filter(). That would be useful to skip
  evaluating packets that belong to already established flows. Thus, we
  incur the performance penalty in classifying only for the first
  packet of the flow.

* We can skip the socket egress hook (that you don't know where to place
  yet) since you can use the existing local output hook in netfilter that
  is available for IPv4 and IPv6.

* This new hook would fit into the existing netfilter set of hooks,
  the sysadmin is already familiarized with the administrative
  infrastructure to define filtering policies in our stack, so adding this
  new hook to what we have looks natural to me.

Thanks for your patience on debating this!


[RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-17 Thread Daniel Mack
This patch set allows eBPF programs for network filtering and
accounting to be attached to cgroups, so that they apply to all sockets
of all tasks placed in that cgroup. The logic also allows to be
extendeded for other cgroup-based eBPF logic.

In short, the patch set adds the following:

* A new eBPF program type BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  which is identical to BPF_PROG_TYPE_SOCKET_FILTER for now

* CONFIG_CGROUP_BPF, which guards every new detail affected by this
  patch set

* Two pointers to struct cgroup to store eBPF programs for socket
  filtering

* Two new bpf(2) syscall commands, BPF_PROG_ATTACH and BPF_PROG_DETACH,
  which are kept generic at the kernel API level in a sense that can
  be reused to attach programs to other entities than cgroups. All that
  is passed in to kernel is a file descriptor.

* A hook in the networking ingress path to run these programs

* A userspace program that demonstrates how to use that new feature


I'd appreciate some feedback on this. Pablo has some remaining concerns
about this approach, and I'd like to continue the discussion we had
off-list in the light of this patchset.

Also, I currently lack an idea for a good place to hook up the egress
filter. If anyone has a good idea, I'd be happy to hear about it.


Thanks,
Daniel


Daniel Mack (5):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add bpf_{e,in}gress pointers
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/cgroup-defs.h |   6 ++
 include/uapi/linux/bpf.h|  15 +
 init/Kconfig|   8 +++
 kernel/bpf/syscall.c| 132 ++
 kernel/bpf/verifier.c   |   1 +
 kernel/cgroup.c |   9 +++
 net/core/filter.c   |  50 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 136 
 11 files changed, 383 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5