[PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
{Open question:
 Tom brought up the question on whether it is safe to modify the packet
 in artbirary ways before dst_output(). This is the equivalent to a raw
 socket injecting illegal headers. This v2 currently assumes that
 dst_output() is ready to accept invalid header values. This needs to be
 verified and if not the case, then raw sockets or dst_output() handlers
 must be fixed as well. Another option is to mark lwtunnel_output() as
 read-only for now.}

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
skb as context. input is read-only, output can write, xmit can write,
push headers, and redirect.

Motiviation for this work:
 - Restricting outgoing routes beyond what the route tuple supports
 - Per route accounting byond realms
 - Fast attachment of L2 headers where header does not require resolving
   L2 addresses
 - ILA like uses cases where L3 addresses are resolved and then routed
   in an async manner
 - Fast encapsulation + redirect. For now limited to use cases where not
   setting inner and outer offset/protocol is OK.

A couple of samples on how to use it can be found in patch 04.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed

Thomas Graf (5):
  route: Set orig_output when redirecting to lwt on locally generated
traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel encapsulation
  bpf: Add samples for LWT-BPF
  lwtunnel: Limit number of recursions on output to 5

 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  37 +++-
 include/uapi/linux/lwtunnel.h |  21 ++
 kernel/bpf/verifier.c |  16 +-
 net/Kconfig   |   1 +
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 504 ++
 net/core/lwtunnel.c   |  15 +-
 net/ipv4/route.c  |  37 +++-
 samples/bpf/bpf_helpers.h |   4 +
 samples/bpf/lwt_bpf.c | 235 
 samples/bpf/test_lwt_bpf.sh   | 370 +++
 13 files changed, 1373 insertions(+), 19 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4



Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Hannes Frederic Sowa
Hi Thomas,

On 01.11.2016 01:37, Thomas Graf wrote:
> {Open question:
>  Tom brought up the question on whether it is safe to modify the packet
>  in artbirary ways before dst_output(). This is the equivalent to a raw
>  socket injecting illegal headers. This v2 currently assumes that
>  dst_output() is ready to accept invalid header values. This needs to be
>  verified and if not the case, then raw sockets or dst_output() handlers
>  must be fixed as well. Another option is to mark lwtunnel_output() as
>  read-only for now.}
> 
> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure. The BPF program can be attached to
> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
> skb as context. input is read-only, output can write, xmit can write,
> push headers, and redirect.
> 
> Motiviation for this work:
>  - Restricting outgoing routes beyond what the route tuple supports
>  - Per route accounting byond realms
>  - Fast attachment of L2 headers where header does not require resolving
>L2 addresses
>  - ILA like uses cases where L3 addresses are resolved and then routed
>in an async manner
>  - Fast encapsulation + redirect. For now limited to use cases where not
>setting inner and outer offset/protocol is OK.
> 
> A couple of samples on how to use it can be found in patch 04.

I do fear the complexity and debugability introduced by this patch set
quite a bit.

I wonder if architecturally it would be more feasible to add a generic
(bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
added into the dsts.

BPF could then be able to access parts of the metadata in the attached
metadata dst entries and performing the matching this way?

The reason why I would prefer an approach like this: irregardless of the
routing lookup we would process the skb with bpf or not. This gives a
single point to debug, where instead in your approach we first must
figure out the corresponding bpf program and then check for it specifically.

Thanks,
Hannes



Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Tom Herbert
On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf  wrote:
> {Open question:
>  Tom brought up the question on whether it is safe to modify the packet
>  in artbirary ways before dst_output(). This is the equivalent to a raw
>  socket injecting illegal headers. This v2 currently assumes that
>  dst_output() is ready to accept invalid header values. This needs to be
>  verified and if not the case, then raw sockets or dst_output() handlers
>  must be fixed as well. Another option is to mark lwtunnel_output() as
>  read-only for now.}
>
The question might not be so much about illegal headers but whether
fields in the skbuff related to the packet contents are kept correct.
We have protocol, header offsets, offsets for inner protocols also,
encapsulation settings, checksum status, checksum offset, checksum
complete value, vlan information. Any or all of which I believe could
be turned into being incorrect if we allow the packet to be
arbitrarily modified by BPF. This problem is different than raw
sockets because LWT operates in the middle of the stack, the skbuff
has already been set up which such things.

> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure. The BPF program can be attached to
> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
> skb as context. input is read-only, output can write, xmit can write,
> push headers, and redirect.
>
> Motiviation for this work:
>  - Restricting outgoing routes beyond what the route tuple supports
>  - Per route accounting byond realms
>  - Fast attachment of L2 headers where header does not require resolving
>L2 addresses
>  - ILA like uses cases where L3 addresses are resolved and then routed
>in an async manner
>  - Fast encapsulation + redirect. For now limited to use cases where not
>setting inner and outer offset/protocol is OK.
>
Is checksum offload supported? By default, at least for Linux, we
offload the outer UDP checksum in VXLAN and the other UDP
encapsulations for performance.

Tom

> A couple of samples on how to use it can be found in patch 04.
>
> v1 -> v2:
>  - Added new BPF_LWT_REROUTE return code for program to indicate
>that new route lookup should be performed. Suggested by Tom.
>  - New sample to illustrate rerouting
>  - New patch 05: Recursion limit for lwtunnel_output for the case
>when user creates circular dst redirection. Also resolves the
>issue for ILA.
>  - Fix to ensure headroom for potential future L2 header is still
>guaranteed
>
> Thomas Graf (5):
>   route: Set orig_output when redirecting to lwt on locally generated
> traffic
>   route: Set lwtstate for local traffic and cached input dsts
>   bpf: BPF for lightweight tunnel encapsulation
>   bpf: Add samples for LWT-BPF
>   lwtunnel: Limit number of recursions on output to 5
>
>  include/linux/filter.h|   2 +-
>  include/uapi/linux/bpf.h  |  37 +++-
>  include/uapi/linux/lwtunnel.h |  21 ++
>  kernel/bpf/verifier.c |  16 +-
>  net/Kconfig   |   1 +
>  net/core/Makefile |   2 +-
>  net/core/filter.c | 148 -
>  net/core/lwt_bpf.c| 504 
> ++
>  net/core/lwtunnel.c   |  15 +-
>  net/ipv4/route.c  |  37 +++-
>  samples/bpf/bpf_helpers.h |   4 +
>  samples/bpf/lwt_bpf.c | 235 
>  samples/bpf/test_lwt_bpf.sh   | 370 +++
>  13 files changed, 1373 insertions(+), 19 deletions(-)
>  create mode 100644 net/core/lwt_bpf.c
>  create mode 100644 samples/bpf/lwt_bpf.c
>  create mode 100755 samples/bpf/test_lwt_bpf.sh
>
> --
> 2.7.4
>


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 09:17, Tom Herbert  wrote:
> On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf  wrote:
>> {Open question:
>>  Tom brought up the question on whether it is safe to modify the packet
>>  in artbirary ways before dst_output(). This is the equivalent to a raw
>>  socket injecting illegal headers. This v2 currently assumes that
>>  dst_output() is ready to accept invalid header values. This needs to be
>>  verified and if not the case, then raw sockets or dst_output() handlers
>>  must be fixed as well. Another option is to mark lwtunnel_output() as
>>  read-only for now.}
>>
> The question might not be so much about illegal headers but whether
> fields in the skbuff related to the packet contents are kept correct.
> We have protocol, header offsets, offsets for inner protocols also,
> encapsulation settings, checksum status, checksum offset, checksum

The headers cannot be extended or reduced so the offsets always remain
correct. What can happen is that the header contains invalid data.

> complete value, vlan information. Any or all of which I believe could
> be turned into being incorrect if we allow the packet to be
> arbitrarily modified by BPF. This problem is different than raw
> sockets because LWT operates in the middle of the stack, the skbuff
> has already been set up which such things.

You keep saying this "middle in the stack" but the point is exactly
the same as a raw socket with IPPROTO_RAW and hdrincl, see
rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
arbitrary garbage into dst_output(). IPv4 does some minimal sanity
checks.

If this is a concern I'm fine with making the dst_output path read-only for now.

>> This series implements BPF program invocation from dst entries via the
>> lightweight tunnels infrastructure. The BPF program can be attached to
>> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
>> skb as context. input is read-only, output can write, xmit can write,
>> push headers, and redirect.
>>
>> Motiviation for this work:
>>  - Restricting outgoing routes beyond what the route tuple supports
>>  - Per route accounting byond realms
>>  - Fast attachment of L2 headers where header does not require resolving
>>L2 addresses
>>  - ILA like uses cases where L3 addresses are resolved and then routed
>>in an async manner
>>  - Fast encapsulation + redirect. For now limited to use cases where not
>>setting inner and outer offset/protocol is OK.
>>
> Is checksum offload supported? By default, at least for Linux, we
> offload the outer UDP checksum in VXLAN and the other UDP
> encapsulations for performance.

No. UDP encap is done by setting a tunnel key through a helper and
letting the encapsulation device handle this. I don't currently see a
point in replicating all of that logic.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Tom Herbert
On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf  wrote:
> On 1 November 2016 at 09:17, Tom Herbert  wrote:
>> On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf  wrote:
>>> {Open question:
>>>  Tom brought up the question on whether it is safe to modify the packet
>>>  in artbirary ways before dst_output(). This is the equivalent to a raw
>>>  socket injecting illegal headers. This v2 currently assumes that
>>>  dst_output() is ready to accept invalid header values. This needs to be
>>>  verified and if not the case, then raw sockets or dst_output() handlers
>>>  must be fixed as well. Another option is to mark lwtunnel_output() as
>>>  read-only for now.}
>>>
>> The question might not be so much about illegal headers but whether
>> fields in the skbuff related to the packet contents are kept correct.
>> We have protocol, header offsets, offsets for inner protocols also,
>> encapsulation settings, checksum status, checksum offset, checksum
>
> The headers cannot be extended or reduced so the offsets always remain
> correct. What can happen is that the header contains invalid data.
>
If we can't add/remove headers then doesn't that really limit the
utility of these patches? My assumption was that BPF+LWT is needed to
allow users to define and implement their own encapsulations, EH
insertion, packet modification, etc.

>> complete value, vlan information. Any or all of which I believe could
>> be turned into being incorrect if we allow the packet to be
>> arbitrarily modified by BPF. This problem is different than raw
>> sockets because LWT operates in the middle of the stack, the skbuff
>> has already been set up which such things.
>
> You keep saying this "middle in the stack" but the point is exactly
> the same as a raw socket with IPPROTO_RAW and hdrincl, see
> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
> checks.
>
What I mean is that an admin can create a BPF program that run on any
user packets (for instance default route could be set). This would be
in the path of TCP, UDP, and other protocols tightly integrated with
the stack. Packets being routed may be encapsulated, VLAN, have
checksum offload, GORed set etc. They also might be looped back in
which case the settings in skbuff become receive parameters.

> If this is a concern I'm fine with making the dst_output path read-only for 
> now.
>
The might be good. The ramifications around allowing an open ended
method for users to modify L3/L2 packets needs more consideration.

>>> This series implements BPF program invocation from dst entries via the
>>> lightweight tunnels infrastructure. The BPF program can be attached to
>>> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
>>> skb as context. input is read-only, output can write, xmit can write,
>>> push headers, and redirect.
>>>
>>> Motiviation for this work:
>>>  - Restricting outgoing routes beyond what the route tuple supports
>>>  - Per route accounting byond realms
>>>  - Fast attachment of L2 headers where header does not require resolving
>>>L2 addresses
>>>  - ILA like uses cases where L3 addresses are resolved and then routed
>>>in an async manner
>>>  - Fast encapsulation + redirect. For now limited to use cases where not
>>>setting inner and outer offset/protocol is OK.
>>>
>> Is checksum offload supported? By default, at least for Linux, we
>> offload the outer UDP checksum in VXLAN and the other UDP
>> encapsulations for performance.
>
> No. UDP encap is done by setting a tunnel key through a helper and
> letting the encapsulation device handle this. I don't currently see a
> point in replicating all of that logic.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 03:54, Hannes Frederic Sowa
 wrote:
> I do fear the complexity and debugability introduced by this patch set
> quite a bit.

What is the complexity concern? This is pretty straight forward. I
agree on debugability. This is being worked on separately as Alexei
mentioned, to address this for all BPF integrations.

> I wonder if architecturally it would be more feasible to add a generic
> (bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
> added into the dsts.
>
> BPF could then be able to access parts of the metadata in the attached
> metadata dst entries and performing the matching this way?

If I understand you correctly then a single BPF program would be
loaded which then applies to all dst_output() calls? This has a huge
drawback, instead of multiple small BPF programs which do exactly what
is required per dst, a large BPF program is needed which matches on
metadata. That's way slower and renders one of the biggest advantages
of BPF invalid, the ability to generate a a small program tailored to
a particular use. See Cilium.

> The reason why I would prefer an approach like this: irregardless of the
> routing lookup we would process the skb with bpf or not. This gives a
> single point to debug, where instead in your approach we first must
> figure out the corresponding bpf program and then check for it specifically.

Not sure I see what kind of advantage this actually provides. You can
dump the routes and see which programs get invoked and which section.
If it's based on metadata then you need to know the program logic and
associate it with the metadata in the dst. It actually doesn't get
much easier than to debug one of the samples, they are completely
static once compiled and it's very simple to verify if they do what
they are supposed to do.

If you like the single program approach, feel free to load the same
program for every dst. Perfectly acceptable but I don't see why we
should force everybody to use that model.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 11:51, Tom Herbert  wrote:
> On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf  wrote:
>> The headers cannot be extended or reduced so the offsets always remain
>> correct. What can happen is that the header contains invalid data.
>>
> If we can't add/remove headers then doesn't that really limit the
> utility of these patches? My assumption was that BPF+LWT is needed to
> allow users to define and implement their own encapsulations, EH
> insertion, packet modification, etc.

I thought you were specifically referring to lwtunnel_output(). You
can extend headers at lwtunnel_xmit() but you can't for
lwtunnel_output() and lwtunnel_input(). The xmit hook is in
ip_finish_output2() where it is fine to do so. We have existing code
doing this there.

>> You keep saying this "middle in the stack" but the point is exactly
>> the same as a raw socket with IPPROTO_RAW and hdrincl, see
>> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
>> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
>> checks.
>>
> What I mean is that an admin can create a BPF program that run on any
> user packets (for instance default route could be set). This would be
> in the path of TCP, UDP, and other protocols tightly integrated with
> the stack. Packets being routed may be encapsulated, VLAN, have
> checksum offload, GORed set etc. They also might be looped back in
> which case the settings in skbuff become receive parameters.

You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
before or after you go into the encapsulation device. This is a tool
for root, if you install a drop all ssh rule then you won't be able to
log into the box anymore. If you modify the packet and render it
invalid, the packet will be dropped.
If you attach a VLAN header while VLAN offload is already set up, then
the hardware will add another VLAN header, this is what I would
expect. I truly hope that we don't have code which crashes if we
dev_queue_xmit() garbage into any device. That would be horrible.

Do you have a specific example that could be a problem?


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Tom Herbert
On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf  wrote:
> On 1 November 2016 at 11:51, Tom Herbert  wrote:
>> On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf  wrote:
>>> The headers cannot be extended or reduced so the offsets always remain
>>> correct. What can happen is that the header contains invalid data.
>>>
>> If we can't add/remove headers then doesn't that really limit the
>> utility of these patches? My assumption was that BPF+LWT is needed to
>> allow users to define and implement their own encapsulations, EH
>> insertion, packet modification, etc.
>
> I thought you were specifically referring to lwtunnel_output(). You
> can extend headers at lwtunnel_xmit() but you can't for
> lwtunnel_output() and lwtunnel_input(). The xmit hook is in
> ip_finish_output2() where it is fine to do so. We have existing code
> doing this there.
>
>>> You keep saying this "middle in the stack" but the point is exactly
>>> the same as a raw socket with IPPROTO_RAW and hdrincl, see
>>> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
>>> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
>>> checks.
>>>
>> What I mean is that an admin can create a BPF program that run on any
>> user packets (for instance default route could be set). This would be
>> in the path of TCP, UDP, and other protocols tightly integrated with
>> the stack. Packets being routed may be encapsulated, VLAN, have
>> checksum offload, GORed set etc. They also might be looped back in
>> which case the settings in skbuff become receive parameters.
>
> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
> before or after you go into the encapsulation device. This is a tool
> for root, if you install a drop all ssh rule then you won't be able to
> log into the box anymore. If you modify the packet and render it
> invalid, the packet will be dropped.
> If you attach a VLAN header while VLAN offload is already set up, then
> the hardware will add another VLAN header, this is what I would
> expect. I truly hope that we don't have code which crashes if we
> dev_queue_xmit() garbage into any device. That would be horrible.
>
> Do you have a specific example that could be a problem?

I believe Alexander was dealing with problems where drivers were not
properly handling IP extension headers. We regularly get reports about
driver checksum failures on edge conditions. Making a fully functional
and efficient transmit data path is nontrivial, there are many
assumptions being made some documented some not. When drivers crash we
either fix the driver or the protocol stack that is doing something
bad.

Tom


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 12:27, Tom Herbert  wrote:
> On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf  wrote:
>> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
>> before or after you go into the encapsulation device. This is a tool
>> for root, if you install a drop all ssh rule then you won't be able to
>> log into the box anymore. If you modify the packet and render it
>> invalid, the packet will be dropped.
>> If you attach a VLAN header while VLAN offload is already set up, then
>> the hardware will add another VLAN header, this is what I would
>> expect. I truly hope that we don't have code which crashes if we
>> dev_queue_xmit() garbage into any device. That would be horrible.
>>
>> Do you have a specific example that could be a problem?
>
> I believe Alexander was dealing with problems where drivers were not
> properly handling IP extension headers. We regularly get reports about

There are many ways to cause IP extension headers to be inserted. How
is this specific to BPF or this series?

> driver checksum failures on edge conditions. Making a fully functional

Not sure what an "edge condition" is? Untrusted networks? How is
drivers crashing on receive related to this series?

> and efficient transmit data path is nontrivial, there are many
> assumptions being made some documented some not. When drivers crash we
> either fix the driver or the protocol stack that is doing something
> bad.

Tom, we have a dozen ways to modify packet content already and we have
multiple ways to take raw packet data from userspace and inject them
at dev_queue_xmit() level and some even above. What is different for
lwtunnel_xmit()?

This is entirely optional and nobody is forced to use any of this. If
you don't trust the BPF program then you better not run in. It's about
the same as trusting a random tc filter or iptables ruleset. The
important point is that integrity is maintained at all times.I would
love to address any specific concern.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Hannes Frederic Sowa
On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
> On 1 November 2016 at 03:54, Hannes Frederic Sowa
>  wrote:
> > I do fear the complexity and debugability introduced by this patch set
> > quite a bit.
> 
> What is the complexity concern? This is pretty straight forward. I
> agree on debugability. This is being worked on separately as Alexei
> mentioned, to address this for all BPF integrations.

We have a multi-layered policy engine which is actually hard to inspect
from user space already.

We first resolve the rules, with forwards us to the table_id, where we
do the fib lookup, which in the end returns the eBPF program to use.

> > I wonder if architecturally it would be more feasible to add a generic
> > (bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
> > added into the dsts.
> >
> > BPF could then be able to access parts of the metadata in the attached
> > metadata dst entries and performing the matching this way?
> 
> If I understand you correctly then a single BPF program would be
> loaded which then applies to all dst_output() calls? This has a huge
> drawback, instead of multiple small BPF programs which do exactly what
> is required per dst, a large BPF program is needed which matches on
> metadata. That's way slower and renders one of the biggest advantages
> of BPF invalid, the ability to generate a a small program tailored to
> a particular use. See Cilium.

I thought more of hooks in the actual output/input functions specific to
the protocol type (unfortunately again) protected by jump labels? Those
hook get part of the dst_entry mapped so they can act on them.

Another idea would be to put the eBPF hooks into the fib rules
infrastructure. But I fear this wouldn't get you the hooks you were
looking for? There they would only end up in the runtime path if
actually activated.

> > The reason why I would prefer an approach like this: irregardless of the
> > routing lookup we would process the skb with bpf or not. This gives a
> > single point to debug, where instead in your approach we first must
> > figure out the corresponding bpf program and then check for it specifically.
> 
> Not sure I see what kind of advantage this actually provides. You can
> dump the routes and see which programs get invoked and which section.

Dumping and verifying which routes get used might actually already be
quite complex on its own. Thus my fear.

> If it's based on metadata then you need to know the program logic and
> associate it with the metadata in the dst. It actually doesn't get
> much easier than to debug one of the samples, they are completely
> static once compiled and it's very simple to verify if they do what
> they are supposed to do.

At the same time you can have lots of those programs and you e.g. would
also need to verify if they are acting on the same data structures or
have the identical code.

It all reminds me a bit on grepping in source code which makes heavy use
of function pointers with very generic and short names.

> If you like the single program approach, feel free to load the same
> program for every dst. Perfectly acceptable but I don't see why we
> should force everybody to use that model.

I am concerned having 100ths of BPF programs, all specialized on a
particular route, to debug. Looking at one code file and its associated
tables seems still easier to me.

E.g. imaging we have input routes and output routes with different BPF
programs. We somehow must make sure all nodes kind of behave accordingly
to "sane" network semantics. If you end up with an input route doing bpf
processing and the according output node, which e.g. might be needed to
reflect ICMP packets, doesn't behave accordingly you at least have two
programs to debug already instead of a switch- or if-condition in one
single code location. I would like to "force" this kind of symmetry to
developers using eBPF, thus I thought meta-data manipulation and
verification inside the kernel would be a better attack at this problem,
no?

Bye,
Hannes


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Tom Herbert
On Tue, Nov 1, 2016 at 12:59 PM, Thomas Graf  wrote:
> On 1 November 2016 at 12:27, Tom Herbert  wrote:
>> On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf  wrote:
>>> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
>>> before or after you go into the encapsulation device. This is a tool
>>> for root, if you install a drop all ssh rule then you won't be able to
>>> log into the box anymore. If you modify the packet and render it
>>> invalid, the packet will be dropped.
>>> If you attach a VLAN header while VLAN offload is already set up, then
>>> the hardware will add another VLAN header, this is what I would
>>> expect. I truly hope that we don't have code which crashes if we
>>> dev_queue_xmit() garbage into any device. That would be horrible.
>>>
>>> Do you have a specific example that could be a problem?
>>
>> I believe Alexander was dealing with problems where drivers were not
>> properly handling IP extension headers. We regularly get reports about
>
> There are many ways to cause IP extension headers to be inserted. How
> is this specific to BPF or this series?
>
>> driver checksum failures on edge conditions. Making a fully functional
>
> Not sure what an "edge condition" is? Untrusted networks? How is
> drivers crashing on receive related to this series?
>
>> and efficient transmit data path is nontrivial, there are many
>> assumptions being made some documented some not. When drivers crash we
>> either fix the driver or the protocol stack that is doing something
>> bad.
>
> Tom, we have a dozen ways to modify packet content already and we have
> multiple ways to take raw packet data from userspace and inject them
> at dev_queue_xmit() level and some even above. What is different for
> lwtunnel_xmit()?
>
> This is entirely optional and nobody is forced to use any of this. If
> you don't trust the BPF program then you better not run in. It's about
> the same as trusting a random tc filter or iptables ruleset. The
> important point is that integrity is maintained at all times.I would
> love to address any specific concern.

You need to show that is integrity is maintained with these patches.
Part of this can be done, but part of this needs to be established in
testing.

I've already given specifics for at least one potential source of
issues in routing issues. I would like to know what happens if someone
rewrites an IPv4 packet into IPv6 packet or vice versa. AFAICT we
would be send an IPv6 using an IPv4 route, or an IPv4 using an IPv6
route. What is supposed to happen in these circumstances? What
actually happens?

Similarly, someone overwrites the whole packet with 0xff. What happens
when we try to send that?

Tom


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 13:08, Hannes Frederic Sowa
 wrote:
> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>> If I understand you correctly then a single BPF program would be
>> loaded which then applies to all dst_output() calls? This has a huge
>> drawback, instead of multiple small BPF programs which do exactly what
>> is required per dst, a large BPF program is needed which matches on
>> metadata. That's way slower and renders one of the biggest advantages
>> of BPF invalid, the ability to generate a a small program tailored to
>> a particular use. See Cilium.
>
> I thought more of hooks in the actual output/input functions specific to
> the protocol type (unfortunately again) protected by jump labels? Those
> hook get part of the dst_entry mapped so they can act on them.

This has no advantage over installing a BPF program at tc egress and
enabling to store/access metadata per dst. The whole point is to
execute bpf for a specific route.

> Another idea would be to put the eBPF hooks into the fib rules
> infrastructure. But I fear this wouldn't get you the hooks you were
> looking for? There they would only end up in the runtime path if
> actually activated.

Use of fib rules kills performance so it's not an option. I'm not even
sure that would be any simpler.

> Dumping and verifying which routes get used might actually already be
> quite complex on its own. Thus my fear.

We even have an API to query which route is used for a tuple. What
else would you like to see?

>> If it's based on metadata then you need to know the program logic and
>> associate it with the metadata in the dst. It actually doesn't get
>> much easier than to debug one of the samples, they are completely
>> static once compiled and it's very simple to verify if they do what
>> they are supposed to do.
>
> At the same time you can have lots of those programs and you e.g. would
> also need to verify if they are acting on the same data structures or
> have the identical code.

This will be addressed with signing AFAIK.

> It all reminds me a bit on grepping in source code which makes heavy use
> of function pointers with very generic and short names.

Is this statement related to routing? I don't get the reference to
function pointers and generic short names.

>> If you like the single program approach, feel free to load the same
>> program for every dst. Perfectly acceptable but I don't see why we
>> should force everybody to use that model.
>
> I am concerned having 100ths of BPF programs, all specialized on a
> particular route, to debug. Looking at one code file and its associated
> tables seems still easier to me.

100 programs != 100 source files. A lot more realistic is a single or
a handful of programs which get compiled for a particular route with
certain pieces enabled/disabled.

> E.g. imaging we have input routes and output routes with different BPF
> programs. We somehow must make sure all nodes kind of behave accordingly
> to "sane" network semantics. If you end up with an input route doing bpf

As soon as we have signing, you can verify your programs in testing,
sign the programs and then quickly verify on all your nodes whether
you are running the correct programs.

Would it help if we allow to store the original source used for
bytecode generation. What would make it clear which program was used.

> processing and the according output node, which e.g. might be needed to
> reflect ICMP packets, doesn't behave accordingly you at least have two
> programs to debug already instead of a switch- or if-condition in one
> single code location. I would like to "force" this kind of symmetry to
> developers using eBPF, thus I thought meta-data manipulation and
> verification inside the kernel would be a better attack at this problem,
> no?

Are you saying you want a single gigantic program for both input and output?

That's not possible. The BPF program has different limitations
depending on where it runs. On input, any write action on the packet
is not allowed, extending the header is only allowed on xmit, and so
on.

I also don't see how this could possibly scale if all packets must go
through a single BPF program. The overhead will be tremendous if you
only want to filter a couple of prefixes.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 13:33, Tom Herbert  wrote:
> You need to show that is integrity is maintained with these patches.
> Part of this can be done, but part of this needs to be established in
> testing.
>
> I've already given specifics for at least one potential source of
> issues in routing issues. I would like to know what happens if someone
> rewrites an IPv4 packet into IPv6 packet or vice versa. AFAICT we
> would be send an IPv6 using an IPv4 route, or an IPv4 using an IPv6
> route. What is supposed to happen in these circumstances? What
> actually happens?
>
> Similarly, someone overwrites the whole packet with 0xff. What happens
> when we try to send that?

OK, I will add these tests to the selftest in the next iteration.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Hannes Frederic Sowa
On 01.11.2016 21:59, Thomas Graf wrote:
> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>  wrote:
>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>>> If I understand you correctly then a single BPF program would be
>>> loaded which then applies to all dst_output() calls? This has a huge
>>> drawback, instead of multiple small BPF programs which do exactly what
>>> is required per dst, a large BPF program is needed which matches on
>>> metadata. That's way slower and renders one of the biggest advantages
>>> of BPF invalid, the ability to generate a a small program tailored to
>>> a particular use. See Cilium.
>>
>> I thought more of hooks in the actual output/input functions specific to
>> the protocol type (unfortunately again) protected by jump labels? Those
>> hook get part of the dst_entry mapped so they can act on them.
> 
> This has no advantage over installing a BPF program at tc egress and
> enabling to store/access metadata per dst. The whole point is to
> execute bpf for a specific route.

The advantage I saw here was that in your proposal the tc egress path
would have to be chosen by a route. Otherwise I would already have
proposed it. :)

>> Another idea would be to put the eBPF hooks into the fib rules
>> infrastructure. But I fear this wouldn't get you the hooks you were
>> looking for? There they would only end up in the runtime path if
>> actually activated.
> 
> Use of fib rules kills performance so it's not an option. I'm not even
> sure that would be any simpler.

It very much depends on the number of rules installed. If there are just
several very few rules, it shouldn't hurt performance that much (but
haven't verified).

>> Dumping and verifying which routes get used might actually already be
>> quite complex on its own. Thus my fear.
> 
> We even have an API to query which route is used for a tuple. What
> else would you like to see?

I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
sockets sniff at interfaces and also gather and dump the metadata to
user space (this would depend on bpf programs only doing the
modifications in metadata and not in the actual packet).

Or maybe just tracing support (without depending on the eBPF program
developer to have added debugging in the BPF program).

>>> If it's based on metadata then you need to know the program logic and
>>> associate it with the metadata in the dst. It actually doesn't get
>>> much easier than to debug one of the samples, they are completely
>>> static once compiled and it's very simple to verify if they do what
>>> they are supposed to do.
>>
>> At the same time you can have lots of those programs and you e.g. would
>> also need to verify if they are acting on the same data structures or
>> have the identical code.
> 
> This will be addressed with signing AFAIK.

This sounds a bit unrealistic. Signing lots of small programs can be a
huge burden to the entity doing the signing (if it is not on the same
computer). And as far as I understood the programs should be generated
dynamically?

>> It all reminds me a bit on grepping in source code which makes heavy use
>> of function pointers with very generic and short names.
> 
> Is this statement related to routing? I don't get the reference to
> function pointers and generic short names.

No, just an anecdotal side note how I felt when I saw the patchset. ;)

>>> If you like the single program approach, feel free to load the same
>>> program for every dst. Perfectly acceptable but I don't see why we
>>> should force everybody to use that model.
>>
>> I am concerned having 100ths of BPF programs, all specialized on a
>> particular route, to debug. Looking at one code file and its associated
>> tables seems still easier to me.
> 
> 100 programs != 100 source files. A lot more realistic is a single or
> a handful of programs which get compiled for a particular route with
> certain pieces enabled/disabled.
> 
>> E.g. imaging we have input routes and output routes with different BPF
>> programs. We somehow must make sure all nodes kind of behave accordingly
>> to "sane" network semantics. If you end up with an input route doing bpf
> 
> As soon as we have signing, you can verify your programs in testing,
> sign the programs and then quickly verify on all your nodes whether
> you are running the correct programs.
> 
> Would it help if we allow to store the original source used for
> bytecode generation. What would make it clear which program was used.

I would also be fine with just a strong hash of the bytecode, so the
program can be identified accurately. Maybe helps with deduplication
later on, too. ;)

>> processing and the according output node, which e.g. might be needed to
>> reflect ICMP packets, doesn't behave accordingly you at least have two
>> programs to debug already instead of a switch- or if-condition in one
>> single code location. I would like to "force" this kind of symmetry to
>> developers using eBPF, thus I thought meta-data manipulat

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Tom Herbert
On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
 wrote:
> On 01.11.2016 21:59, Thomas Graf wrote:
>> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>>  wrote:
>>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
 If I understand you correctly then a single BPF program would be
 loaded which then applies to all dst_output() calls? This has a huge
 drawback, instead of multiple small BPF programs which do exactly what
 is required per dst, a large BPF program is needed which matches on
 metadata. That's way slower and renders one of the biggest advantages
 of BPF invalid, the ability to generate a a small program tailored to
 a particular use. See Cilium.
>>>
>>> I thought more of hooks in the actual output/input functions specific to
>>> the protocol type (unfortunately again) protected by jump labels? Those
>>> hook get part of the dst_entry mapped so they can act on them.
>>
>> This has no advantage over installing a BPF program at tc egress and
>> enabling to store/access metadata per dst. The whole point is to
>> execute bpf for a specific route.
>
> The advantage I saw here was that in your proposal the tc egress path
> would have to be chosen by a route. Otherwise I would already have
> proposed it. :)
>
>>> Another idea would be to put the eBPF hooks into the fib rules
>>> infrastructure. But I fear this wouldn't get you the hooks you were
>>> looking for? There they would only end up in the runtime path if
>>> actually activated.
>>
>> Use of fib rules kills performance so it's not an option. I'm not even
>> sure that would be any simpler.
>
> It very much depends on the number of rules installed. If there are just
> several very few rules, it shouldn't hurt performance that much (but
> haven't verified).
>
Hannes,

I can say that the primary value we get out of using ILA+LWT is that
we can essentially cache a policy decision in connected sockets. That
is we are able to create a host route for each destination (thousands
of them) that describes how to do the translation for each one. There
is no route lookup per packet, and actually no extra lookup otherwise.
The translation code doesn't do much at all, basically just copies in
new destination to the packet. We need a route lookup for the
rewritten destination, but that is easily cached in the LWT structure.
The net result is that the transmit path for ILA is _really_ fast. I'm
not sure how we can match this same performance tc egress, it seems
like we would want to cache the matching rules in the socket to avoid
rule lookups.

On the other hand, I'm not really sure how to implement for this level
of performance this in LWT+BPF either. It seems like one way to do
that would be to create a program each destination and set it each
host. As you point out would create a million different programs which
doesn't seem manageable. I don't think the BPF map works either since
that implies we need a lookup (?). It seems like what we need is one
program but allow it to be parameterized with per destination
information saved in the route (LWT structure).

Tom

>>> Dumping and verifying which routes get used might actually already be
>>> quite complex on its own. Thus my fear.
>>
>> We even have an API to query which route is used for a tuple. What
>> else would you like to see?
>
> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
> sockets sniff at interfaces and also gather and dump the metadata to
> user space (this would depend on bpf programs only doing the
> modifications in metadata and not in the actual packet).
>
> Or maybe just tracing support (without depending on the eBPF program
> developer to have added debugging in the BPF program).
>
 If it's based on metadata then you need to know the program logic and
 associate it with the metadata in the dst. It actually doesn't get
 much easier than to debug one of the samples, they are completely
 static once compiled and it's very simple to verify if they do what
 they are supposed to do.
>>>
>>> At the same time you can have lots of those programs and you e.g. would
>>> also need to verify if they are acting on the same data structures or
>>> have the identical code.
>>
>> This will be addressed with signing AFAIK.
>
> This sounds a bit unrealistic. Signing lots of small programs can be a
> huge burden to the entity doing the signing (if it is not on the same
> computer). And as far as I understood the programs should be generated
> dynamically?
>
>>> It all reminds me a bit on grepping in source code which makes heavy use
>>> of function pointers with very generic and short names.
>>
>> Is this statement related to routing? I don't get the reference to
>> function pointers and generic short names.
>
> No, just an anecdotal side note how I felt when I saw the patchset. ;)
>
 If you like the single program approach, feel free to load the same
 program for every dst. Perfectly acceptable but I don't see why we
 should

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Hannes Frederic Sowa
Hi Tom,

On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
>  wrote:
> > On 01.11.2016 21:59, Thomas Graf wrote:
> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
> >>  wrote:
> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>  If I understand you correctly then a single BPF program would be
>  loaded which then applies to all dst_output() calls? This has a huge
>  drawback, instead of multiple small BPF programs which do exactly what
>  is required per dst, a large BPF program is needed which matches on
>  metadata. That's way slower and renders one of the biggest advantages
>  of BPF invalid, the ability to generate a a small program tailored to
>  a particular use. See Cilium.
> >>>
> >>> I thought more of hooks in the actual output/input functions specific to
> >>> the protocol type (unfortunately again) protected by jump labels? Those
> >>> hook get part of the dst_entry mapped so they can act on them.
> >>
> >> This has no advantage over installing a BPF program at tc egress and
> >> enabling to store/access metadata per dst. The whole point is to
> >> execute bpf for a specific route.
> >
> > The advantage I saw here was that in your proposal the tc egress path
> > would have to be chosen by a route. Otherwise I would already have
> > proposed it. :)
> >
> >>> Another idea would be to put the eBPF hooks into the fib rules
> >>> infrastructure. But I fear this wouldn't get you the hooks you were
> >>> looking for? There they would only end up in the runtime path if
> >>> actually activated.
> >>
> >> Use of fib rules kills performance so it's not an option. I'm not even
> >> sure that would be any simpler.
> >
> > It very much depends on the number of rules installed. If there are just
> > several very few rules, it shouldn't hurt performance that much (but
> > haven't verified).
> >
> Hannes,
> 
> I can say that the primary value we get out of using ILA+LWT is that
> we can essentially cache a policy decision in connected sockets. That
> is we are able to create a host route for each destination (thousands
> of them) that describes how to do the translation for each one. There
> is no route lookup per packet, and actually no extra lookup otherwise.

Exactly, that is why I do like LWT and the dst_entry socket caching
shows its benefits here. Also the dst_entries communicate enough vital
information up the stack so that allocation of sk_buffs is done
accordingly to the headers that might need to be inserted later on.

(On the other hand, the looked up BPF program can also be cached. This
becomes more difficult if we can't share the socket structs between
namespaces though.)

> The translation code doesn't do much at all, basically just copies in
> new destination to the packet. We need a route lookup for the
> rewritten destination, but that is easily cached in the LWT structure.
> The net result is that the transmit path for ILA is _really_ fast. I'm
> not sure how we can match this same performance tc egress, it seems
> like we would want to cache the matching rules in the socket to avoid
> rule lookups.

In case of namespaces, do you allocate the host routes in the parent or
child (net-)namespaces? Or don't we talk about namespaces right now at all?

Why do we want to do the packet manipulation in tc egress and not using
LWT + interfaces? The dst_entries should be able to express all possible
allocation strategies etc. so that we don't need to shift/reallocate
packets around when inserting an additional header. We can't express
those semantics with tc egress.

> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Yes, that is my proposal. Just using the dst entry as meta-data (which
can actually also be an ID for the network namespace the packet is
coming from).

My concern with using BPF is that the rest of the kernel doesn't really
see the semantics and can't optimize or cache at specific points,
because the kernel cannot introspect what the BPF program does (for
metadata manipulation, one can e.g. specifiy that the program is "pure",
and always provides the same output for some specified given input, thus
things can be cached and memorized, but that framework seems very hard
to build).

That's why I am in favor of splitting this patchset down and allow the
policies that should be expressed by BPF programs being applied to the
specific subsystems (I am not totally against a generic BPF hook in
input or output of the protocol

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Tom Herbert
On Wed, Nov 2, 2016 at 3:48 AM, Hannes Frederic Sowa
 wrote:
> Hi Tom,
>
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
>>  wrote:
>> > On 01.11.2016 21:59, Thomas Graf wrote:
>> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>> >>  wrote:
>> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>>  If I understand you correctly then a single BPF program would be
>>  loaded which then applies to all dst_output() calls? This has a huge
>>  drawback, instead of multiple small BPF programs which do exactly what
>>  is required per dst, a large BPF program is needed which matches on
>>  metadata. That's way slower and renders one of the biggest advantages
>>  of BPF invalid, the ability to generate a a small program tailored to
>>  a particular use. See Cilium.
>> >>>
>> >>> I thought more of hooks in the actual output/input functions specific to
>> >>> the protocol type (unfortunately again) protected by jump labels? Those
>> >>> hook get part of the dst_entry mapped so they can act on them.
>> >>
>> >> This has no advantage over installing a BPF program at tc egress and
>> >> enabling to store/access metadata per dst. The whole point is to
>> >> execute bpf for a specific route.
>> >
>> > The advantage I saw here was that in your proposal the tc egress path
>> > would have to be chosen by a route. Otherwise I would already have
>> > proposed it. :)
>> >
>> >>> Another idea would be to put the eBPF hooks into the fib rules
>> >>> infrastructure. But I fear this wouldn't get you the hooks you were
>> >>> looking for? There they would only end up in the runtime path if
>> >>> actually activated.
>> >>
>> >> Use of fib rules kills performance so it's not an option. I'm not even
>> >> sure that would be any simpler.
>> >
>> > It very much depends on the number of rules installed. If there are just
>> > several very few rules, it shouldn't hurt performance that much (but
>> > haven't verified).
>> >
>> Hannes,
>>
>> I can say that the primary value we get out of using ILA+LWT is that
>> we can essentially cache a policy decision in connected sockets. That
>> is we are able to create a host route for each destination (thousands
>> of them) that describes how to do the translation for each one. There
>> is no route lookup per packet, and actually no extra lookup otherwise.
>
> Exactly, that is why I do like LWT and the dst_entry socket caching
> shows its benefits here. Also the dst_entries communicate enough vital
> information up the stack so that allocation of sk_buffs is done
> accordingly to the headers that might need to be inserted later on.
>
> (On the other hand, the looked up BPF program can also be cached. This
> becomes more difficult if we can't share the socket structs between
> namespaces though.)
>
>> The translation code doesn't do much at all, basically just copies in
>> new destination to the packet. We need a route lookup for the
>> rewritten destination, but that is easily cached in the LWT structure.
>> The net result is that the transmit path for ILA is _really_ fast. I'm
>> not sure how we can match this same performance tc egress, it seems
>> like we would want to cache the matching rules in the socket to avoid
>> rule lookups.
>
> In case of namespaces, do you allocate the host routes in the parent or
> child (net-)namespaces? Or don't we talk about namespaces right now at all?
>
ILA is namespace aware, everything is set per namespace. I don't see
any issue with set routes per namespace either, nor any namespace
related issues with this patch, except maybe that I wouldn't know what
the interaction between BPF maps and namespaces are. Do maps belong to
namespaces?

> Why do we want to do the packet manipulation in tc egress and not using
> LWT + interfaces? The dst_entries should be able to express all possible
> allocation strategies etc. so that we don't need to shift/reallocate
> packets around when inserting an additional header. We can't express
> those semantics with tc egress.
>
I don't think we do want to do this sort of packet manipulation (ILA
in particular) in tc egress, that was my point. It's also not
appropriate for netfilter either I think.

>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).
>
> My concern with using BPF is that the rest 

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Thomas Graf
On 1 November 2016 at 16:12, Hannes Frederic Sowa
 wrote:
> On 01.11.2016 21:59, Thomas Graf wrote:
>>> Dumping and verifying which routes get used might actually already be
>>> quite complex on its own. Thus my fear.
>>
>> We even have an API to query which route is used for a tuple. What
>> else would you like to see?
>
> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
> sockets sniff at interfaces and also gather and dump the metadata to
> user space (this would depend on bpf programs only doing the
> modifications in metadata and not in the actual packet).

Not sure I understand. Why does this depend on BPF?

> Or maybe just tracing support (without depending on the eBPF program
> developer to have added debugging in the BPF program).

Absolutely in favour of that.

>> This will be addressed with signing AFAIK.
>
> This sounds a bit unrealistic. Signing lots of small programs can be a
> huge burden to the entity doing the signing (if it is not on the same
> computer). And as far as I understood the programs should be generated
> dynamically?

Right, for generated programs, a hash is a better fit and still sufficient.

>> Would it help if we allow to store the original source used for
>> bytecode generation. What would make it clear which program was used.
>
> I would also be fine with just a strong hash of the bytecode, so the
> program can be identified accurately. Maybe helps with deduplication
> later on, too. ;)

OK, I think we all already agreed on doing this.

> Even though I read through the patchset I am not absolutely sure which
> problem it really solves. Especially because lots of things can be done
> already at the ingress vs. egress interface (I looked at patch 4 but I
> am not sure how realistic they are).

Filtering at egress requires to attach the BPF program to all
potential outgoing interface and then pass every single packet through
the program whereas with LWT BPF, I'm only taking the cost where
actually needed.

>> I also don't see how this could possibly scale if all packets must go
>> through a single BPF program. The overhead will be tremendous if you
>> only want to filter a couple of prefixes.
>
> In case of hash table lookup it should be fast. llvm will probably also
> generate jump table for a few 100 ip addresses, no? Additionally the
> routing table lookup could be not done at all.

Why would I want to accept the overhead if I simply avoid it? Just
parsing the header and doing the hash lookup will add cost, cost for
each packet.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Thomas Graf
On 1 November 2016 at 17:07, Tom Herbert  wrote:
> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Attaching different BPF programs to millions of unique dsts doesn't
make any sense. That will obivously will never scale and it's not
supposed to scale. This is meant to be used for prefixes which
represent a series of endpoints, f.e. all local containers, all
non-internal traffic, all vpn traffic, etc. I'm also not sure why we
are talking about ILA here, you have written a native implementation,
why would you want to solve it with BPF again?

If you want to run a single program for all dsts, feel free to run the
same BPF program for each dst. Nobody is forcing you to attach
individual programs.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Tom Herbert
On Wed, Nov 2, 2016 at 3:57 PM, Thomas Graf  wrote:
> On 1 November 2016 at 17:07, Tom Herbert  wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Attaching different BPF programs to millions of unique dsts doesn't
> make any sense. That will obivously will never scale and it's not
> supposed to scale. This is meant to be used for prefixes which
> represent a series of endpoints, f.e. all local containers, all
> non-internal traffic, all vpn traffic, etc. I'm also not sure why we
> are talking about ILA here, you have written a native implementation,
> why would you want to solve it with BPF again?
>
We are talking about ILA because you specifically mentioned that in
overview log as a use case: "ILA like uses cases where L3 addresses
are resolved and then routed".

Tom

> If you want to run a single program for all dsts, feel free to run the
> same BPF program for each dst. Nobody is forcing you to attach
> individual programs.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-03 Thread Thomas Graf
On 2 November 2016 at 04:48, Hannes Frederic Sowa
 wrote:
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).

I have no objection to doing this on top of this series.

> My concern with using BPF is that the rest of the kernel doesn't really
> see the semantics and can't optimize or cache at specific points,
> because the kernel cannot introspect what the BPF program does (for
> metadata manipulation, one can e.g. specifiy that the program is "pure",
> and always provides the same output for some specified given input, thus
> things can be cached and memorized, but that framework seems very hard
> to build).

So you want to reintroduce a routing cache? Each packet needs to pass
through the BPF program anyway for accounting purposes. This is not
just about getting the packets out the right nexthop in the fastest
possible way.

> I also fear this becomes a kernel by-pass:
>
> It might be very hard e.g. to apply NFT/netfilter to such packets, if
> e.g. a redirect happens suddenly and packet flow is diverted from the
> one the user sees currently based on the interfaces and routing tables.

The LWT xmit hook is after the POST_ROUTING hook. The input and output
hook cannot redirect and output will become read-only just like input
already is. We are not bypassing anything. Please stop throwing the
word bypass around. This is just a false claim.

> That's why I am in favor of splitting this patchset down and allow the
> policies that should be expressed by BPF programs being applied to the
> specific subsystems (I am not totally against a generic BPF hook in
> input or output of the protocol engines). E.g. can we deal with static
> rewriting of L2 addresses in the neighbor cache? We already provide a
> fast header cache for L2 data which might be used here?

Split what? What policies?

I have two primary use cases for this:
1) Traffic into local containers: Containers are only supposed to do
L3, all L2 traffic is dropped for security reasons. The L2 header for
any packets in and out of the container is fixed and does not require
any sort of resolving. I in order to feed packets from the local host
into the containers, a route with the container prefix is set up. It
points to a nexthop address which appears behind a veth pair. A BPF
program is listening at tc ingress on the veth pair and will enforce
policies and do accounting. It requires very ugly hacks because Linux
does not like to do forwarding to an address which is considered
local. It works but it is a hack.

What I want to do instead is to run the BPF program for the route
directly, apply the policies, do accounting, push the fixed dummy L2
header and redirect it to the container. If someone has netfilter
rules installed, they will still apply. Nothing is hidden.

2) For external traffic that is coming in. We have a BPF program
listening on tc ingress which matches on the destination address on
all incoming traffic. If the packet is a for a container, we perform
the same actions as above. In this case we are bypassing the routing
table. This is ugly. What I want to do instead is to have the
container prefix invoke the BPF program so all packets have a route
lookup performed and netfilter filtering performed, only after that,
the BPF program is invoked exclusively for the packets destined for
local containers. Yes, it would be possible to redirect into a
temporary veth again and listen on that but it again requires to fake
a L2 segment which is just unnecessary and slow.

This is not hiding anything and it is not bypassing anything.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-03 Thread Hannes Frederic Sowa
On 02.11.2016 23:54, Thomas Graf wrote:
> On 1 November 2016 at 16:12, Hannes Frederic Sowa
>  wrote:
>> On 01.11.2016 21:59, Thomas Graf wrote:
 Dumping and verifying which routes get used might actually already be
 quite complex on its own. Thus my fear.
>>>
>>> We even have an API to query which route is used for a tuple. What
>>> else would you like to see?
>>
>> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
>> sockets sniff at interfaces and also gather and dump the metadata to
>> user space (this would depend on bpf programs only doing the
>> modifications in metadata and not in the actual packet).
> 
> Not sure I understand. Why does this depend on BPF?

It doesn't. My hope was, if BPF merely tries to modify meta-data, we can
provide better debugging tools as if we mangle the packet directly.

>> Or maybe just tracing support (without depending on the eBPF program
>> developer to have added debugging in the BPF program).
> 
> Absolutely in favour of that.
> 
>>> This will be addressed with signing AFAIK.
>>
>> This sounds a bit unrealistic. Signing lots of small programs can be a
>> huge burden to the entity doing the signing (if it is not on the same
>> computer). And as far as I understood the programs should be generated
>> dynamically?
> 
> Right, for generated programs, a hash is a better fit and still sufficient.
> 
>>> Would it help if we allow to store the original source used for
>>> bytecode generation. What would make it clear which program was used.
>>
>> I would also be fine with just a strong hash of the bytecode, so the
>> program can be identified accurately. Maybe helps with deduplication
>> later on, too. ;)
> 
> OK, I think we all already agreed on doing this.
> 
>> Even though I read through the patchset I am not absolutely sure which
>> problem it really solves. Especially because lots of things can be done
>> already at the ingress vs. egress interface (I looked at patch 4 but I
>> am not sure how realistic they are).
> 
> Filtering at egress requires to attach the BPF program to all
> potential outgoing interface and then pass every single packet through
> the program whereas with LWT BPF, I'm only taking the cost where
> actually needed.

I do certainly see this point as a big plus. I definitely also thought
about this a lot when thinking about how flower can/should be used with
multiple interfaces and how to keep its flow tables synchronized.

>>> I also don't see how this could possibly scale if all packets must go
>>> through a single BPF program. The overhead will be tremendous if you
>>> only want to filter a couple of prefixes.
>>
>> In case of hash table lookup it should be fast. llvm will probably also
>> generate jump table for a few 100 ip addresses, no? Additionally the
>> routing table lookup could be not done at all.
> 
> Why would I want to accept the overhead if I simply avoid it? Just
> parsing the header and doing the hash lookup will add cost, cost for
> each packet.

That is true, but in case you are outside of the namespace, you still
have to calculate the cost of doing the FIB lookup for the BPF program
each time, too.

E.g. given the lookup cost in a hash for a netnwork namespace pointer
vs. the cost of doing a FIB lookup to get a program that does a specific
transformation sounds at least in the big O-notiation to be in a better
place. ;)

If you have to do both anyway, probably your patchset will perform
better, I agree.

Bye,
Hannes



Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-03 Thread Thomas Graf
On 3 November 2016 at 08:52, Hannes Frederic Sowa
 wrote:
> On 02.11.2016 23:54, Thomas Graf wrote:
>> Why would I want to accept the overhead if I simply avoid it? Just
>> parsing the header and doing the hash lookup will add cost, cost for
>> each packet.
>
> That is true, but in case you are outside of the namespace, you still
> have to calculate the cost of doing the FIB lookup for the BPF program
> each time, too.
>
> E.g. given the lookup cost in a hash for a netnwork namespace pointer
> vs. the cost of doing a FIB lookup to get a program that does a specific
> transformation sounds at least in the big O-notiation to be in a better
> place. ;)
>
> If you have to do both anyway, probably your patchset will perform
> better, I agree.

Most containers are unprivileged, the route inside the container's
namespace is owned by the host and we can attach the BPF program
directly to the default route inside the container and all packets
egressing from the container will pass through it. That fib lookup is
needed anyway so we can leverage the cost of that lookup. We can drop
hostile packets early without ever going on L2 level.