[PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Thomas Graf
Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
LWT redirection of type LWTUNNEL_ENCAP_BPF.

The separate program types are required because manipulation of
packet data is only allowed on the output and transmit path as
the subsequent dst_input() call path assumes an IP header
validated by ip_rcv(). The BPF programs will be handed an skb
with the L3 header attached and may return one of the following
return codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
(Only valid on lwtunnel_xmit() hook)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

A new helper bpf_skb_push() is added which allows to preprend an
L2 header in front of the skb, extend the existing L3 header, or
both. This allows to address a wide range of issues:
 - Optimize L2 header construction when L2 information is always
   static to avoid ARP/NDisc lookup.
 - Extend IP header to add additional IP options.
 - Perform simple encapsulation where offload is of no concern.
   (The existing funtionality to attach a tunnel key to the skb
and redirect to a tunnel net_device to allow for offload
continues to work obviously).

Signed-off-by: Thomas Graf 
---
 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  31 +++-
 include/uapi/linux/lwtunnel.h |  21 +++
 kernel/bpf/verifier.c |  16 +-
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 365 ++
 net/core/lwtunnel.c   |   1 +
 8 files changed, 579 insertions(+), 7 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..aad7f81 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..2ebaa3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,9 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_LWT_IN,
+   BPF_PROG_TYPE_LWT_OUT,
+   BPF_PROG_TYPE_LWT_XMIT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -383,6 +386,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  * Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ * Add room to beginning of skb and adjusts MAC header offset accordingly.
+ * Extends/reallocaes for needed skb headeroom automatically.
+ * May change skb data pointer and will thus invalidate any check done
+ * for direct packet access.
+ * @skb: pointer to skb
+ * @len: length of header to be pushed in front
+ * @flags: Flags (unused for now)
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -427,7 +440,8 @@ union bpf_attr {
FN(skb_pull_data),  \
FN(csum_update),\
FN(set_hash_invalid),   \
-   FN(get_numa_node_id),
+   FN(get_numa_node_id),   \
+   FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -511,6 +525,21 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+   BPF_OK = 0,
+   /* 1 reserved */
+   BPF_DROP = 2,
+   /* 3-6 reserved */
+   BPF_REDIRECT = 7,
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..9354d997 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_IP,
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
+   LWTUNNEL_ENCAP_BPF,
__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -42,4 +43,24 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+   LWT_BPF_PROG_UNSPEC,
+   LWT_BPF_PROG_FD,
+   LWT_BPF_PROG_NAME,
+ 

Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Tom Herbert
On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf  wrote:
> Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
> BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
> LWT redirection of type LWTUNNEL_ENCAP_BPF.
>
> The separate program types are required because manipulation of
> packet data is only allowed on the output and transmit path as
> the subsequent dst_input() call path assumes an IP header
> validated by ip_rcv(). The BPF programs will be handed an skb
> with the L3 header attached and may return one of the following
> return codes:
>
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> (Only valid on lwtunnel_xmit() hook)
>
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
>
> A new helper bpf_skb_push() is added which allows to preprend an
> L2 header in front of the skb, extend the existing L3 header, or
> both. This allows to address a wide range of issues:
>  - Optimize L2 header construction when L2 information is always
>static to avoid ARP/NDisc lookup.
>  - Extend IP header to add additional IP options.
>  - Perform simple encapsulation where offload is of no concern.
>(The existing funtionality to attach a tunnel key to the skb
> and redirect to a tunnel net_device to allow for offload
> continues to work obviously).
>
> Signed-off-by: Thomas Graf 
> ---
>  include/linux/filter.h|   2 +-
>  include/uapi/linux/bpf.h  |  31 +++-
>  include/uapi/linux/lwtunnel.h |  21 +++
>  kernel/bpf/verifier.c |  16 +-
>  net/core/Makefile |   2 +-
>  net/core/filter.c | 148 -
>  net/core/lwt_bpf.c| 365 
> ++
>  net/core/lwtunnel.c   |   1 +
>  8 files changed, 579 insertions(+), 7 deletions(-)
>  create mode 100644 net/core/lwt_bpf.c
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1f09c52..aad7f81 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -438,7 +438,7 @@ struct xdp_buff {
>  };
>
>  /* compute the linear packet data range [data, data_end) which
> - * will be accessed by cls_bpf and act_bpf programs
> + * will be accessed by cls_bpf, act_bpf and lwt programs
>   */
>  static inline void bpf_compute_data_end(struct sk_buff *skb)
>  {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e2f38e0..2ebaa3c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -96,6 +96,9 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_TRACEPOINT,
> BPF_PROG_TYPE_XDP,
> BPF_PROG_TYPE_PERF_EVENT,
> +   BPF_PROG_TYPE_LWT_IN,
> +   BPF_PROG_TYPE_LWT_OUT,
> +   BPF_PROG_TYPE_LWT_XMIT,
>  };
>
>  #define BPF_PSEUDO_MAP_FD  1
> @@ -383,6 +386,16 @@ union bpf_attr {
>   *
>   * int bpf_get_numa_node_id()
>   * Return: Id of current NUMA node.
> + *
> + * int bpf_skb_push()
> + * Add room to beginning of skb and adjusts MAC header offset 
> accordingly.
> + * Extends/reallocaes for needed skb headeroom automatically.
> + * May change skb data pointer and will thus invalidate any check done
> + * for direct packet access.
> + * @skb: pointer to skb
> + * @len: length of header to be pushed in front
> + * @flags: Flags (unused for now)
> + * Return: 0 on success or negative error
>   */
>  #define __BPF_FUNC_MAPPER(FN)  \
> FN(unspec), \
> @@ -427,7 +440,8 @@ union bpf_attr {
> FN(skb_pull_data),  \
> FN(csum_update),\
> FN(set_hash_invalid),   \
> -   FN(get_numa_node_id),
> +   FN(get_numa_node_id),   \
> +   FN(skb_push),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -511,6 +525,21 @@ struct bpf_tunnel_key {
> __u32 tunnel_label;
>  };
>
> +/* Generic BPF return codes which all BPF program types may support.
> + * The values are binary compatible with their TC_ACT_* counter-part to
> + * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
> + * programs.
> + *
> + * XDP is handled seprately, see XDP_*.
> + */
> +enum bpf_ret_code {
> +   BPF_OK = 0,
> +   /* 1 reserved */
> +   BPF_DROP = 2,
> +   /* 3-6 reserved */
> +   BPF_REDIRECT = 7,
> +};
> +
>  /* User return codes for XDP prog type.
>   * A valid XDP program must return one of these defined values. All other
>   * return codes are reserved for future use. Unknown return codes will result
> diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
> index a478fe8..9354d997 100644
> --- a/include/uapi/linux/lwtunnel.h
> +++ b/include/uapi/linux/lwtunnel.h
> @@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
> LWTUNNEL_ENCAP_IP,
> LWTUNN

Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Thomas Graf
On 10/30/16 at 01:34pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf  wrote:
> > +   if (unlikely(!dst->lwtstate->orig_output)) {
> > +   WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
> > + bpf->out.name);
> > +   kfree_skb(skb);
> > +   return -EINVAL;
> > +   }
> > +
> > +   return dst->lwtstate->orig_output(net, sk, skb);
> 
> The BPF program may have changed the destination address so continuing
> with original route in skb may not be appropriate here. This was fixed
> in ila_lwt by calling ip6_route_output and we were able to dst cache
> facility to cache the route to avoid cost of looking it up on every
> packet. Since the kernel  has no insight into what the BPF program
> does to the packet I'd suggest 1) checking if destination address
> changed by BPF and if it did then call route_output to get new route
> 2) If the LWT destination is a host route then try to keep a dst
> cache. This would entail checking destination address on return that
> it is the same one as kept in the dst cache.

Instead of building complex logic, we can allow the program to return
a code to indicate when to perform another route lookup just as we do
for the redirect case. Just because the destination address has
changed may not require another lookup in all cases. A typical example
would be a program rewriting addresses for the default route to other
address which are always handled by the default route as well. An
unconditional lookup would hurt performance in many cases.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Tom Herbert
On Sun, Oct 30, 2016 at 2:47 PM, Thomas Graf  wrote:
> On 10/30/16 at 01:34pm, Tom Herbert wrote:
>> On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf  wrote:
>> > +   if (unlikely(!dst->lwtstate->orig_output)) {
>> > +   WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
>> > + bpf->out.name);
>> > +   kfree_skb(skb);
>> > +   return -EINVAL;
>> > +   }
>> > +
>> > +   return dst->lwtstate->orig_output(net, sk, skb);
>>
>> The BPF program may have changed the destination address so continuing
>> with original route in skb may not be appropriate here. This was fixed
>> in ila_lwt by calling ip6_route_output and we were able to dst cache
>> facility to cache the route to avoid cost of looking it up on every
>> packet. Since the kernel  has no insight into what the BPF program
>> does to the packet I'd suggest 1) checking if destination address
>> changed by BPF and if it did then call route_output to get new route
>> 2) If the LWT destination is a host route then try to keep a dst
>> cache. This would entail checking destination address on return that
>> it is the same one as kept in the dst cache.
>
> Instead of building complex logic, we can allow the program to return
> a code to indicate when to perform another route lookup just as we do
> for the redirect case. Just because the destination address has
> changed may not require another lookup in all cases. A typical example
> would be a program rewriting addresses for the default route to other
> address which are always handled by the default route as well. An
> unconditional lookup would hurt performance in many cases.

Right, that's why we rely on a dst cache. Any use of LWT that
encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
etc.) would want to use the dst cache optimization to avoid the second
lookup. The ILA LWT code used to call orig output and that worked as
long as we could set the default router as the gateway "via". It was
something we were able to deploy, but not a general solution.
Integrating properly with routing gives a much better solution IMO.
Note that David Lebrun's latest LWT Segment Routing patch does the
second lookup with the dst cache to try to avoid it.

Thanks,
Tom


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/30/16 at 06:28pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 2:47 PM, Thomas Graf  wrote:
> > Instead of building complex logic, we can allow the program to return
> > a code to indicate when to perform another route lookup just as we do
> > for the redirect case. Just because the destination address has
> > changed may not require another lookup in all cases. A typical example
> > would be a program rewriting addresses for the default route to other
> > address which are always handled by the default route as well. An
> > unconditional lookup would hurt performance in many cases.
> 
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Yes, I saw both ILA and SR dst_cache. I was planning on addressing
the conditional reroute in a second step but it looks fairly simple
actually so I'm fine adding this in a v2 based on a return code. I
will limit lwt-bpf to AF_INET && AF_INET6 though.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/30/16 at 06:28pm, Tom Herbert wrote:
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Noticed while implementing this: How does ILA ensure that dst_output()
is not invoked in a circular manner?

dstA->output() -> dstB->otuput() -> dstA->output() -> ...


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Tom Herbert
On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf  wrote:
> On 10/30/16 at 06:28pm, Tom Herbert wrote:
>> Right, that's why we rely on a dst cache. Any use of LWT that
>> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
>> etc.) would want to use the dst cache optimization to avoid the second
>> lookup. The ILA LWT code used to call orig output and that worked as
>> long as we could set the default router as the gateway "via". It was
>> something we were able to deploy, but not a general solution.
>> Integrating properly with routing gives a much better solution IMO.
>> Note that David Lebrun's latest LWT Segment Routing patch does the
>> second lookup with the dst cache to try to avoid it.
>
> Noticed while implementing this: How does ILA ensure that dst_output()
> is not invoked in a circular manner?
>
> dstA->output() -> dstB->otuput() -> dstA->output() -> ...

It doesn't. We'll need to add a check for that. Maybe the rule should
be that an skbuff is only allowed to hit one LWT route?

Another scenario to consider: Suppose someone is doing protocol
translation like in RFC7915. This is one of operations we'd need with
ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
be allowed/supported in LWT BPF?

Tom


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/31/16 at 07:17am, Tom Herbert wrote:
> On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf  wrote:
> > Noticed while implementing this: How does ILA ensure that dst_output()
> > is not invoked in a circular manner?
> >
> > dstA->output() -> dstB->otuput() -> dstA->output() -> ...
> 
> It doesn't. We'll need to add a check for that. Maybe the rule should
> be that an skbuff is only allowed to hit one LWT route?

I'll add a per cpu variable to do a recursion limit for dst_output()
which callers to possibly recurse can use. ILA can use that as well.

> Another scenario to consider: Suppose someone is doing protocol
> translation like in RFC7915. This is one of operations we'd need with
> ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
> be allowed/supported in LWT BPF?

In lwtunnel_xmit() yes, input and output would not support this right
now. It will need some logic as the orig_input and orig_output would
obviously be expecting the same protocol. This is the reason why the
xmit prog type currently has a wider set of allowed helpers.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Tom Herbert
On Mon, Oct 31, 2016 at 8:06 AM, Thomas Graf  wrote:
> On 10/31/16 at 07:17am, Tom Herbert wrote:
>> On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf  wrote:
>> > Noticed while implementing this: How does ILA ensure that dst_output()
>> > is not invoked in a circular manner?
>> >
>> > dstA->output() -> dstB->otuput() -> dstA->output() -> ...
>>
>> It doesn't. We'll need to add a check for that. Maybe the rule should
>> be that an skbuff is only allowed to hit one LWT route?
>
> I'll add a per cpu variable to do a recursion limit for dst_output()
> which callers to possibly recurse can use. ILA can use that as well.
>
>> Another scenario to consider: Suppose someone is doing protocol
>> translation like in RFC7915. This is one of operations we'd need with
>> ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
>> be allowed/supported in LWT BPF?
>
> In lwtunnel_xmit() yes, input and output would not support this right
> now. It will need some logic as the orig_input and orig_output would
> obviously be expecting the same protocol. This is the reason why the
> xmit prog type currently has a wider set of allowed helpers.

I guess this leads to a more general question I have about the effects
of allowing userspace to insert code in the kernel that modifies
packets. If we allow BPF programs to arbitrarily modify packets in
LWT, how do we ensure that there are no insidious effects later in the
path? For instance,  what someone uses BPF to convert an IPv6 packet
to IPv4, or maybe convert packet to something that isn't even IP, or
what if someone just decides to overwrite every byte in a packet with
0xff? Are these thing allowed, and if so what is the effect? I would
assume a policy that these can't cause any insidious effects to
unrelated traffic or the rest of the system, in particular such things
should not cause the  kernel to crash (based on the principle that
user space code should never cause kernel to crash). I think XDP might
be okay since the path is straightforward and only deals with raw
packets, but LWT is much higher in that stack.

Thanks,
Tom


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/31/16 at 09:07am, Tom Herbert wrote:
> I guess this leads to a more general question I have about the effects
> of allowing userspace to insert code in the kernel that modifies
> packets. If we allow BPF programs to arbitrarily modify packets in
> LWT, how do we ensure that there are no insidious effects later in the
> path? For instance,  what someone uses BPF to convert an IPv6 packet
> to IPv4, or maybe convert packet to something that isn't even IP, or
> what if someone just decides to overwrite every byte in a packet with
> 0xff?

This is why modifying packets is not allowed on input at all as it
would invalidate the IP parsing that has already been done.

Writing is allowed for dst_output() on the basis that it is the
equivalent of a raw socket with header inclusion. If you look at
rawv6_send_hdrinc(), it does not perform any validation and calls into
dst_output() directly. I agree though that this must be made water
proof.

Pushing additional headers is only allowed at xmit, this is the
equivalent LWT MPLS.

> Are these thing allowed, and if so what is the effect? I would
> assume a policy that these can't cause any insidious effects to
> unrelated traffic or the rest of the system, in particular such things
> should not cause the  kernel to crash (based on the principle that
> user space code should never cause kernel to crash). I think XDP might

Agreed. Although it's already possible to hook a kernel module at LWT
or Netfilter to do arbitrary packet modifications, BPF must be held
at a higher standard even in privileged mode.