On 2 November 2016 at 04:48, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).

I have no objection to doing this on top of this series.

> My concern with using BPF is that the rest of the kernel doesn't really
> see the semantics and can't optimize or cache at specific points,
> because the kernel cannot introspect what the BPF program does (for
> metadata manipulation, one can e.g. specifiy that the program is "pure",
> and always provides the same output for some specified given input, thus
> things can be cached and memorized, but that framework seems very hard
> to build).

So you want to reintroduce a routing cache? Each packet needs to pass
through the BPF program anyway for accounting purposes. This is not
just about getting the packets out the right nexthop in the fastest
possible way.

> I also fear this becomes a kernel by-pass:
>
> It might be very hard e.g. to apply NFT/netfilter to such packets, if
> e.g. a redirect happens suddenly and packet flow is diverted from the
> one the user sees currently based on the interfaces and routing tables.

The LWT xmit hook is after the POST_ROUTING hook. The input and output
hook cannot redirect and output will become read-only just like input
already is. We are not bypassing anything. Please stop throwing the
word bypass around. This is just a false claim.

> That's why I am in favor of splitting this patchset down and allow the
> policies that should be expressed by BPF programs being applied to the
> specific subsystems (I am not totally against a generic BPF hook in
> input or output of the protocol engines). E.g. can we deal with static
> rewriting of L2 addresses in the neighbor cache? We already provide a
> fast header cache for L2 data which might be used here?

Split what? What policies?

I have two primary use cases for this:
1) Traffic into local containers: Containers are only supposed to do
L3, all L2 traffic is dropped for security reasons. The L2 header for
any packets in and out of the container is fixed and does not require
any sort of resolving. I in order to feed packets from the local host
into the containers, a route with the container prefix is set up. It
points to a nexthop address which appears behind a veth pair. A BPF
program is listening at tc ingress on the veth pair and will enforce
policies and do accounting. It requires very ugly hacks because Linux
does not like to do forwarding to an address which is considered
local. It works but it is a hack.

What I want to do instead is to run the BPF program for the route
directly, apply the policies, do accounting, push the fixed dummy L2
header and redirect it to the container. If someone has netfilter
rules installed, they will still apply. Nothing is hidden.

2) For external traffic that is coming in. We have a BPF program
listening on tc ingress which matches on the destination address on
all incoming traffic. If the packet is a for a container, we perform
the same actions as above. In this case we are bypassing the routing
table. This is ugly. What I want to do instead is to have the
container prefix invoke the BPF program so all packets have a route
lookup performed and netfilter filtering performed, only after that,
the BPF program is invoked exclusively for the packets destined for
local containers. Yes, it would be possible to redirect into a
temporary veth again and listen on that but it again requires to fake
a L2 segment which is just unnecessary and slow.

This is not hiding anything and it is not bypassing anything.

Reply via email to