from:"Peter Oskolkov"

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov

Thanks, David! This is for egress only, so I'll add an appropriate
check. I'll also address your other comments/concerns in a v2 version
of this patchset.
On Fri, Nov 30, 2018 at 12:08 PM David Ahern  wrote:
>
> On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> > On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
> >>
> >> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> >>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>> index bd0df75dc7b6..17f3c37218e5 100644
> >>> --- a/net/core/filter.c
> >>> +++ b/net/core/filter.c
> >>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff 
> >>> *skb, u32 type, void *hdr, u32 len
> >>>  }
> >>>  #endif /* CONFIG_IPV6_SEG6_BPF */
> >>>
> >>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> >>> +{
> >>> + struct dst_entry *dst;
> >>> + struct rtable *rt;
> >>> + struct iphdr *iph;
> >>> + struct net *net;
> >>> + int err;
> >>> +
> >>> + if (skb->protocol != htons(ETH_P_IP))
> >>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> >>> +
> >>> + iph = (struct iphdr *)hdr;
> >>> +
> >>> + if (unlikely(len < sizeof(struct iphdr) || len > 
> >>> LWTUNNEL_MAX_ENCAP_HSIZE))
> >>> + return -EINVAL;
> >>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> >>> + return -EINVAL;
> >>> +
> >>> + if (skb->sk)
> >>> + net = sock_net(skb->sk);
> >>> + else {
> >>> + net = dev_net(skb_dst(skb)->dev);
> >>> + }
> >>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
> >>
> >> That is a very limited use case. e.g., oif = 0 means you are not
> >> considering any kind of policy routing (e.g., VRF).
> >
> > Hi David! Could you be a bit more specific re: what you would like to
> > see here? Thanks!
> >
>
> Is the encap happening on ingress or egress? Seems like the current code
> does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
> BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
> you should be filling in a flow struct and doing a proper lookup.
>
> When the potentially custom encap header is pushed on, seems to me skb
> marks should still be honored for the route lookup. If not, you should
> handle that in the API.
>
> From there skb->dev at a minimum should be used as either iif (ingress)
> or oif (egress).
>
> The iph is already set so you have quick access to the tos.
>
> Also, this should implement IPv6 as well before going in.

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov

On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
>
> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> > Thanks, David! This is for egress only, so I'll add an appropriate
> > check. I'll also address your other comments/concerns in a v2 version
> > of this patchset.
>
> Why are you limiting this to egress only?

I'm focusing on egress because this is a use case that we have and
understand well, and would like to have a solution for sooner rather
than later.

Without understanding why anybody would want to do lwt-bpf encap on
ingress, I don't know, for example, what a good test would look like;
but I'd be happy to look into a specific ingress use case if you have
one.

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-12-03 Thread Peter Oskolkov

Our case is a bit different - it is more like using the SRH header in
IPv6 to route packets via a non-default intermediate hop. But I see
your point - I'll expand the patchset to cover IPv6 and the ingress
path.
On Mon, Dec 3, 2018 at 8:04 AM David Ahern  wrote:
>
> On 11/30/18 5:14 PM, Peter Oskolkov wrote:
> > On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
> >>
> >> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> >>> Thanks, David! This is for egress only, so I'll add an appropriate
> >>> check. I'll also address your other comments/concerns in a v2 version
> >>> of this patchset.
> >>
> >> Why are you limiting this to egress only?
> >
> > I'm focusing on egress because this is a use case that we have and
> > understand well, and would like to have a solution for sooner rather
> > than later.
> >
> > Without understanding why anybody would want to do lwt-bpf encap on
> > ingress, I don't know, for example, what a good test would look like;
> > but I'd be happy to look into a specific ingress use case if you have
> > one.
> >
>
> We can not have proliferation of helpers for a lot of one off use cases.
> A little thought now makes this helper useful for more than just your 1
> use case. And, IPv6 parity should be a minimal requirement for helpers.
>
> Based on your route lookup I presume your use case is capturing certain
> local traffic, pushing a custom header and sending that packet else
> where. The same could be done on the ingress path.

[PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-03 Thread Peter Oskolkov

When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/sched/sch_netem.c | 86 ++-
 1 file changed, 68 insertions(+), 18 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..f1099486ecd3 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,6 +77,10 @@ struct netem_sched_data {
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root;
 
+   /* a linear queue; reduces rbtree rebalancing when jitter is low */
+   struct sk_buff  *t_head;
+   struct sk_buff  *t_tail;
+
/* optional qdisc for classful handling (NULL at netem init) */
struct Qdisc*qdisc;
 
@@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch)
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
+
+   rtnl_kfree_skbs(q->t_head, q->t_tail);
+   q->t_head = NULL;
+   q->t_tail = NULL;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
u64 tnext = netem_skb_cb(nskb)->time_to_send;
-   struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
 
-   while (*p) {
-   struct sk_buff *skb;
-
-   parent = *p;
-   skb = rb_to_skb(parent);
-   if (tnext >= netem_skb_cb(skb)->time_to_send)
-   p = &parent->rb_right;
+   if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) {
+   if (q->t_tail)
+   q->t_tail->next = nskb;
else
-   p = &parent->rb_left;
+   q->t_head = nskb;
+   q->t_tail = nskb;
+   } else {
+   struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
+
+   while (*p) {
+   struct sk_buff *skb;
+
+   parent = *p;
+   skb = rb_to_skb(parent);
+   if (tnext >= netem_skb_cb(skb)->time_to_send)
+   p = &parent->rb_right;
+   else
+   p = &parent->rb_left;
+   }
+   rb_link_node(&nskb->rbnode, parent, p);
+   rb_insert_color(&nskb->rbnode, &q->t_root);
}
-   rb_link_node(&nskb->rbnode, parent, p);
-   rb_insert_color(&nskb->rbnode, &q->t_root);
sch->q.qlen++;
 }
 
@@ -534,6 +551,15 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
last = t_last;
}
}
+   if (q->t_tail) {
+   struct netem_skb_cb *t_last =
+   netem_skb_cb(q->t_tail);
+
+   if (!last ||
+   t_last->time_to_send > last->time_to_send) {
+   last = t_last;
+   }
+   }
 
if (last) {
/*
@@ -611,11 +637,37 @@ static void get_slot_next(struct netem_sched_data *q, u64 
now)
q->slot.bytes_left = q->slot_config.max_bytes;
 }
 
+static struct sk_buff *netem_peek(struct netem_sched_data *q)
+{
+   struct sk_buff *skb = skb_rb_first(&q->t_root);
+   u64 t1, t2;
+
+   if (!skb)
+   return q->t_head;
+   if (!q->t_head)
+   return skb;
+
+   t1 = netem_skb_cb(skb)->time_to_send;
+   t2 = netem_skb_cb(q->t_head)->time_to_send;
+   if (t1 < t2)
+   return skb;
+   return q->t_head;
+}
+
+static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb)
+{
+   if (skb == q->t_head) {
+   q->t_head = skb->next;
+   if (!q->t_head)
+   q->t_tail = NULL;
+   } else
+   rb_erase(&skb->rbnode, &q->t_root);
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
-

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov

Thanks, Stephen!

I don't care much about braces either. David, do you want me to send a
new patch with braces moved around?

On Tue, Dec 4, 2018 at 9:56 AM Stephen Hemminger
 wrote:
>
> I like this, it makes a lot of sense since packets are almost
> always queued in order.
>
> Minor style stuff you might want to fix (but don't have to).
>
> > + if (!last ||
> > + t_last->time_to_send > 
> > last->time_to_send) {
> > + last = t_last;
> > + }
>
> I don't think you need braces here for single assignment.
>
> > +static void netem_erase_head(struct netem_sched_data *q, struct sk_buff 
> > *skb)
> > +{
> > + if (skb == q->t_head) {
> > + q->t_head = skb->next;
> > + if (!q->t_head)
> > + q->t_tail = NULL;
> > + } else
> > + rb_erase(&skb->rbnode, &q->t_root);
>
> Checkpatch wants both sides of if/else to have brackets.
> Personally, don't care.
>
> Reviewed-by: Stephen Hemminger

[PATCH v2 net-next 0/1] net: netem: use a list _and_ rbtree

2018-12-04 Thread Peter Oskolkov

v2: address style suggestions by Stephen Hemminger.

All changes are noop vs v1.

Peter Oskolkov (1):
  net: netem: use a list in addition to rbtree

 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)

[PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov

When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..84658f60a872 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,6 +77,10 @@ struct netem_sched_data {
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root;
 
+   /* a linear queue; reduces rbtree rebalancing when jitter is low */
+   struct sk_buff  *t_head;
+   struct sk_buff  *t_tail;
+
/* optional qdisc for classful handling (NULL at netem init) */
struct Qdisc*qdisc;
 
@@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch)
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
+
+   rtnl_kfree_skbs(q->t_head, q->t_tail);
+   q->t_head = NULL;
+   q->t_tail = NULL;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
u64 tnext = netem_skb_cb(nskb)->time_to_send;
-   struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
 
-   while (*p) {
-   struct sk_buff *skb;
-
-   parent = *p;
-   skb = rb_to_skb(parent);
-   if (tnext >= netem_skb_cb(skb)->time_to_send)
-   p = &parent->rb_right;
+   if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) {
+   if (q->t_tail)
+   q->t_tail->next = nskb;
else
-   p = &parent->rb_left;
+   q->t_head = nskb;
+   q->t_tail = nskb;
+   } else {
+   struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
+
+   while (*p) {
+   struct sk_buff *skb;
+
+   parent = *p;
+   skb = rb_to_skb(parent);
+   if (tnext >= netem_skb_cb(skb)->time_to_send)
+   p = &parent->rb_right;
+   else
+   p = &parent->rb_left;
+   }
+   rb_link_node(&nskb->rbnode, parent, p);
+   rb_insert_color(&nskb->rbnode, &q->t_root);
}
-   rb_link_node(&nskb->rbnode, parent, p);
-   rb_insert_color(&nskb->rbnode, &q->t_root);
sch->q.qlen++;
 }
 
@@ -530,9 +547,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
t_skb = skb_rb_last(&q->t_root);
t_last = netem_skb_cb(t_skb);
if (!last ||
-   t_last->time_to_send > last->time_to_send) {
+   t_last->time_to_send > last->time_to_send)
+   last = t_last;
+   }
+   if (q->t_tail) {
+   struct netem_skb_cb *t_last =
+   netem_skb_cb(q->t_tail);
+
+   if (!last ||
+   t_last->time_to_send > last->time_to_send)
last = t_last;
-   }
}
 
if (last) {
@@ -611,11 +635,38 @@ static void get_slot_next(struct netem_sched_data *q, u64 
now)
q->slot.bytes_left = q->slot_config.max_bytes;
 }
 
+static struct sk_buff *netem_peek(struct netem_sched_data *q)
+{
+   struct sk_buff *skb = skb_rb_first(&q->t_root);
+   u64 t1, t2;
+
+   if (!skb)
+   return q->t_head;
+   if (!q->t_head)
+   return skb;
+
+   t1 = netem_skb_cb(skb)->time_to_send;
+   t2 = netem_skb_cb(q->t_head)->time_to_send;
+   if (t1 < t2)
+   return skb;
+   return q->t_head;
+}
+
+static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb)
+{
+   if (skb == q->t_head) {
+   q->t_head = skb->next;
+   if (!q->t

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov

On Tue, Dec 4, 2018 at 11:11 AM Peter Oskolkov  wrote:
>
> Thanks, Stephen!
>
> I don't care much about braces either. David, do you want me to send a
> new patch with braces moved around?

Sent a v2 with style fixes, just in case.

>
> On Tue, Dec 4, 2018 at 9:56 AM Stephen Hemminger
>  wrote:
> >
> > I like this, it makes a lot of sense since packets are almost
> > always queued in order.
> >
> > Minor style stuff you might want to fix (but don't have to).
> >
> > > + if (!last ||
> > > + t_last->time_to_send > 
> > > last->time_to_send) {
> > > + last = t_last;
> > > + }
> >
> > I don't think you need braces here for single assignment.
> >
> > > +static void netem_erase_head(struct netem_sched_data *q, struct sk_buff 
> > > *skb)
> > > +{
> > > + if (skb == q->t_head) {
> > > + q->t_head = skb->next;
> > > + if (!q->t_head)
> > > + q->t_tail = NULL;
> > > + } else
> > > + rb_erase(&skb->rbnode, &q->t_root);
> >
> > Checkpatch wants both sides of if/else to have brackets.
> > Personally, don't care.
> >
> > Reviewed-by: Stephen Hemminger

Re: [PATCH net] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2018-12-05 Thread Peter Oskolkov

On Wed, Dec 5, 2018 at 7:57 AM Jiri Wiesner  wrote:
>
> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
>
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
>
>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [880621563d48] _raw_spin_lock at 815f5480
>   #6 [880621563d48] inet_evict_bucket at 8158020b
>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>   #8 [880621563db0] ops_exit_list at 814f5856
>   #9 [880621563dd8] cleanup_net at 814f67c0
>  #10 [880621563e38] process_one_work at 81096f14
>
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
>
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>
> Signed-off-by: Jiri Wiesner 
> Reported-by: Per Sundstrom 
> ---

Acked-by: Peter Oskolkov 

>  net/ipv4/ip_fragment.c  | 7 +++
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++-
>  net/ipv6/reassembly.c   | 8 +++-
>  3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index d6ee343fdb86..aa0b22697998 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -515,6 +515,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
> *skb,
> struct rb_node *rbn;
> int len;
> int ihlen;
> +   int delta;
> int err;
> u8 ecn;
>
> @@ -556,10 +557,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
> *skb,
> if (len > 65535)
> goto out_oversize;
>
> +   delta = - head->truesize;
> +
> /* Head of list must not be cloned. */
> if (skb_unclone(head, GFP_ATOMIC))
> goto out_nomem;
>
> +   delta += head->truesize;
> +   if (delta)
> +   add_frag_mem_limit(qp->q.net, delta);
> +
> /* If the first fragment is fragmented itself, we split
>  * it to two chunks: the first with data and paged part
>  * and the second, holding only fragments. */
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
> b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index d219979c3e52..181da2c40f9a 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -341,7 +341,7 @@ static bool
>  nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct 
> net_device *dev)
>  {
> struct sk_buff *fp, *head = fq->q.fragments;
> -   intpayload_len;
> +   intpayload_len, delta;
> u8 ecn;
>
> inet_frag_kill(&fq->q);
> @@ -363,10 +363,16 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff 
> *prev,  struct net_devic
> return false;
> }
>
> +   delta = - head->truesize;
> +
> /* Head of list must not be cloned. */
> if (skb_unclone(head, GFP_ATOMIC))
> return false;
>
> +   delta += head->truesize;
> +   if (delta)
> +   add_frag_mem_limit(fq->q.net, delta);
> +
> /* If the first fragment is fragmented itself, we split
>  * it to two chunks: the first with data and paged part
>  * and the second, holding only fragments. */
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index 5c3c92713096..aa26c45486d9 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -281,7 +281,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
> sk_buff *prev,
>  {
> struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
> struct sk_buff *fp, *head = fq->q.fragments;
> -   intpayload_len;
> +

Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-12-05 Thread Peter Oskolkov

FWIW, I find the patch really useful - I applied it to my local dev
repo (with minor changes) and use skb_dump() a lot now. It would be
great if it makes its way into net-next in some form.
On Fri, Nov 30, 2018 at 12:15 PM Saeed Mahameed  wrote:
>
> On Thu, 2018-11-22 at 17:45 -0800, Cong Wang wrote:
> > On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed 
> > wrote:
> > > On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang <
> > > > xiyou.wangc...@gmail.com>
> > > > wrote:
> > > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > > > eric.duma...@gmail.com> wrote:
> > > > > >
> > > > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > > > Currently, we only dump a few selected skb fields in
> > > > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > > > checksum
> > > > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > > > header,
> > > > > > > network header and its whole skb->data too.
> > > > > > >
> > > > > > > Cc: Herbert Xu 
> > > > > > > Cc: Eric Dumazet 
> > > > > > > Cc: David Miller 
> > > > > > > Signed-off-by: Cong Wang 
> > > > > > > ---
> > > > > > > + print_hex_dump(level, "skb data: ",
> > > > > > > DUMP_PREFIX_OFFSET,
> > > > > > > 16, 1,
> > > > > > > +skb->data, skb->len, false);
> > > > > >
> > > > > > As I mentioned to David, we want all the bytes that were
> > > > > > maybe
> > > > > > already pulled
> > > > > >
> > > > > > (skb->head starting point, not skb->data)
> > > > >
> > > > > Hmm, with mac header and network header, it is effectively from
> > > > > skb->head, no?
> > > > > Is there anything between skb->head and mac header?
> > > >
> > > > Oh, I guess we wanted a single hex dump, or we need some user
> > > > program
> > > > to be able to
> > > > rebuild from different memory zones the original
> > > > CHECKSUM_COMPLETE
> > > > value.
> > > >
> > >
> > > Normally the driver keeps some headroom @skb->head, so the actual
> > > mac
> > > header starts @ skb->head + driver_specific_headroom
> >
> > Good to know, but this headroom isn't covered by skb->csum, so
> > not useful here, right? The skb->csum for mlx5 only covers network
> > header and its payload.
>
> correct
>

[PATCH net-next 0/5] net: prefer listeners bound to an address

2018-12-12 Thread Peter Oskolkov

A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.

However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patchset eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.

In a future patchset I plan to explore whether it is possible
to remove port-only hashtables completely: additional refactoring
will be required, as some non-lookup code uses the hashtables.

Peter Oskolkov (5):
  net: udp: prefer listeners bound to an address
  net: udp6: prefer listeners bound to an address
  net: tcp: prefer listeners bound to an address
  net: tcp6: prefer listeners bound to an address
  selftests: net: test that listening sockets match on address properly

 net/ipv4/inet_hashtables.c|  60 +---
 net/ipv4/udp.c|  76 ++---
 net/ipv6/inet6_hashtables.c   |  54 +---
 net/ipv6/udp.c|  79 ++
 tools/testing/selftests/net/.gitignore|   1 +
 tools/testing/selftests/net/Makefile  |   4 +-
 .../selftests/net/reuseport_addr_any.c| 264 ++
 .../selftests/net/reuseport_addr_any.sh   |   4 +
 8 files changed, 325 insertions(+), 217 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_addr_any.c
 create mode 100755 tools/testing/selftests/net/reuseport_addr_any.sh

-- 
2.20.0.rc2.403.gdbc3b29805-goog

[PATCH net-next 1/5] net: udp: prefer listeners bound to an address

2018-12-12 Thread Peter Oskolkov

A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.

However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.

In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.

Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/ipv4/udp.c | 76 +-
 1 file changed, 19 insertions(+), 57 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e014..3fb0ed5e4789e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -380,15 +380,12 @@ static int compute_score(struct sock *sk, struct net *net,
ipv6_only_sock(sk))
return -1;
 
-   score = (sk->sk_family == PF_INET) ? 2 : 1;
-   inet = inet_sk(sk);
+   if (sk->sk_rcv_saddr != daddr)
+   return -1;
 
-   if (inet->inet_rcv_saddr) {
-   if (inet->inet_rcv_saddr != daddr)
-   return -1;
-   score += 4;
-   }
+   score = (sk->sk_family == PF_INET) ? 2 : 1;
 
+   inet = inet_sk(sk);
if (inet->inet_daddr) {
if (inet->inet_daddr != saddr)
return -1;
@@ -464,65 +461,30 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
__be16 sport, __be32 daddr, __be16 dport, int dif,
int sdif, struct udp_table *udptable, struct sk_buff *skb)
 {
-   struct sock *sk, *result;
+   struct sock *result;
unsigned short hnum = ntohs(dport);
-   unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
-   struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
+   unsigned int hash2, slot2;
+   struct udp_hslot *hslot2;
bool exact_dif = udp_lib_exact_dif_match(net, skb);
-   int score, badness;
-   u32 hash = 0;
 
-   if (hslot->count > 10) {
-   hash2 = ipv4_portaddr_hash(net, daddr, hnum);
+   hash2 = ipv4_portaddr_hash(net, daddr, hnum);
+   slot2 = hash2 & udptable->mask;
+   hslot2 = &udptable->hash2[slot2];
+
+   result = udp4_lib_lookup2(net, saddr, sport,
+ daddr, hnum, dif, sdif,
+ exact_dif, hslot2, skb);
+   if (!result) {
+   hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
-   if (hslot->count < hslot2->count)
-   goto begin;
 
result = udp4_lib_lookup2(net, saddr, sport,
- daddr, hnum, dif, sdif,
+ htonl(INADDR_ANY), hnum, dif, sdif,
  exact_dif, hslot2, skb);
-   if (!result) {
-   unsigned int old_slot2 = slot2;
-   hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), 
hnum);
-   slot2 = hash2 & udptable->mask;
-   /* avoid searching the same slot again. */
-   if (unlikely(slot2 == old_slot2))
-   return result;
-
-   hslot2 = &udptable->hash2[slot2];
-   if (hslot->count < hslot2->count)
-   goto begin;
-
-   result = udp4_lib_lookup2(net, saddr, sport,
- daddr, hnum, dif, sdif,
- exact_dif, hslot2, skb);
-   }
-   if (unlikely(IS_ERR(result)))
-   return NULL;
-   return result;
-   }
-begin:
-   result = NULL;
-   badness = 0;
-   sk_for_each_rcu(sk, &hslot->head) {
-   score = compute_score(sk, net, saddr, sport,
- daddr, hnum, dif, sdif, exact_dif);
-   if (score > badness) {
-   if (sk->sk_reuseport) {
-   hash = udp_ehashfn(net, daddr, hnum,
-  saddr, sport);
-   result = reuseport_select_sock(sk, hash, skb,
-

[PATCH net-next 5/5] selftests: net: test that listening sockets match on address properly

2018-12-12 Thread Peter Oskolkov

This patch adds a selftest that verifies that a socket listening
on a specific address is chosen in preference over sockets
that listen on any address. The test covers UDP/UDP6/TCP/TCP6.

It is based on, and similar to, reuseport_dualstack.c selftest.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/net/.gitignore|   1 +
 tools/testing/selftests/net/Makefile  |   4 +-
 .../selftests/net/reuseport_addr_any.c| 264 ++
 .../selftests/net/reuseport_addr_any.sh   |   4 +
 4 files changed, 271 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_addr_any.c
 create mode 100755 tools/testing/selftests/net/reuseport_addr_any.sh

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 7f57b916e6b22..6f81130605d7d 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -3,6 +3,7 @@ socket
 psock_fanout
 psock_snd
 psock_tpacket
+reuseport_addr_any
 reuseport_bpf
 reuseport_bpf_cpu
 reuseport_bpf_numa
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index ee2e27b1cd0d3..aeecc3ef53d02 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,10 +7,10 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
-TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh
+TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh 
reuseport_addr_any.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
-TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
+TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd txring_overwrite
 TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
diff --git a/tools/testing/selftests/net/reuseport_addr_any.c 
b/tools/testing/selftests/net/reuseport_addr_any.c
new file mode 100644
index 0..f5e01d989519d
--- /dev/null
+++ b/tools/testing/selftests/net/reuseport_addr_any.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Test that sockets listening on a specific address are preferred
+ * over sockets listening on addr_any.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const char *IP4_ADDR = "127.0.0.1";
+static const char *IP6_ADDR = "::1";
+static const char *IP4_MAPPED6 = ":::127.0.0.1";
+
+static const int PORT = ;
+
+static void build_rcv_fd(int family, int proto, int *rcv_fds, int count,
+const char *addr_str)
+{
+   struct sockaddr_in  addr4 = {0};
+   struct sockaddr_in6 addr6 = {0};
+   struct sockaddr *addr;
+   int opt, i, sz;
+
+   memset(&addr, 0, sizeof(addr));
+
+   switch (family) {
+   case AF_INET:
+   addr4.sin_family = family;
+   if (!addr_str)
+   addr4.sin_addr.s_addr = htonl(INADDR_ANY);
+   else if (!inet_pton(family, addr_str, &addr4.sin_addr.s_addr))
+   error(1, errno, "inet_pton failed: %s", addr_str);
+   addr4.sin_port = htons(PORT);
+   sz = sizeof(addr4);
+   addr = (struct sockaddr *)&addr4;
+   break;
+   case AF_INET6:
+   addr6.sin6_family = AF_INET6;
+   if (!addr_str)
+   addr6.sin6_addr = in6addr_any;
+   else if (!inet_pton(family, addr_str, &addr6.sin6_addr))
+   error(1, errno, "inet_pton failed: %s", addr_str);
+   addr6.sin6_port = htons(PORT);
+   sz = sizeof(addr6);
+   addr = (struct sockaddr *)&addr6;
+   break;
+   default:
+   error(1, 0, "Unsupported family %d", family);
+   }
+
+   for (i = 0; i < count; ++i) {
+   rcv_fds[i] = socket(family, proto, 0);
+   if (rcv_fds[i] < 0)
+   error(1, errno, "failed to create receive socket");
+
+   opt = 1;
+   if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
+  sizeof(opt)))
+   error(1, errno, "failed to set SO_REUSEPORT");
+
+   if (bind(rcv_fds[i], addr, sz))
+   error(1, errno, "failed to bind receive socket");
+
+   if (proto == SOCK_STREAM && listen(rcv_fds[i], 1

[PATCH net-next 3/5] net: tcp: prefer listeners bound to an address

2018-12-12 Thread Peter Oskolkov

A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.

However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.

In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.

Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/ipv4/inet_hashtables.c | 60 +-
 1 file changed, 8 insertions(+), 52 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 13890d5bfc340..cd03ab42705b4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -234,24 +234,16 @@ static inline int compute_score(struct sock *sk, struct 
net *net,
const int dif, const int sdif, bool exact_dif)
 {
int score = -1;
-   struct inet_sock *inet = inet_sk(sk);
-   bool dev_match;
 
-   if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
+   if (net_eq(sock_net(sk), net) && sk->sk_num == hnum &&
!ipv6_only_sock(sk)) {
-   __be32 rcv_saddr = inet->inet_rcv_saddr;
-   score = sk->sk_family == PF_INET ? 2 : 1;
-   if (rcv_saddr) {
-   if (rcv_saddr != daddr)
-   return -1;
-   score += 4;
-   }
-   dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
-dif, sdif);
-   if (!dev_match)
+   if (sk->sk_rcv_saddr != daddr)
+   return -1;
+
+   if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
return -1;
-   score += 4;
 
+   score = sk->sk_family == PF_INET ? 2 : 1;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}
@@ -307,26 +299,12 @@ struct sock *__inet_lookup_listener(struct net *net,
const __be32 daddr, const unsigned short 
hnum,
const int dif, const int sdif)
 {
-   unsigned int hash = inet_lhashfn(net, hnum);
-   struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
-   bool exact_dif = inet_exact_dif_match(net, skb);
struct inet_listen_hashbucket *ilb2;
-   struct sock *sk, *result = NULL;
-   int score, hiscore = 0;
+   struct sock *result = NULL;
unsigned int hash2;
-   u32 phash = 0;
-
-   if (ilb->count <= 10 || !hashinfo->lhash2)
-   goto port_lookup;
-
-   /* Too many sk in the ilb bucket (which is hashed by port alone).
-* Try lhash2 (which is hashed by port and addr) instead.
-*/
 
hash2 = ipv4_portaddr_hash(net, daddr, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-   if (ilb2->count > ilb->count)
-   goto port_lookup;
 
result = inet_lhash2_lookup(net, ilb2, skb, doff,
saddr, sport, daddr, hnum,
@@ -335,34 +313,12 @@ struct sock *__inet_lookup_listener(struct net *net,
goto done;
 
/* Lookup lhash2 with INADDR_ANY */
-
hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-   if (ilb2->count > ilb->count)
-   goto port_lookup;
 
result = inet_lhash2_lookup(net, ilb2, skb, doff,
-   saddr, sport, daddr, hnum,
+   saddr, sport, htonl(INADDR_ANY), hnum,
dif, sdif);
-   goto done;
-
-port_lookup:
-   sk_for_each_rcu(sk, &ilb->head) {
-   score = compute_score(sk, net, hnum, daddr,
- dif, sdif, exact_dif);
-   if (score > hiscore) {
-   if (sk->sk_reuseport) {
-   phash = inet_ehashfn(net, daddr, hnum,
-saddr, sport);
-   result = reuseport_select_sock(sk, phash,
-  skb, doff);
-   if (result)
-

[PATCH net-next 2/5] net: udp6: prefer listeners bound to an address

2018-12-12 Thread Peter Oskolkov

A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.

However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.

In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.

Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/udp.c | 79 ++
 1 file changed, 21 insertions(+), 58 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31ff..9cbf363172bdc 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -125,6 +125,9 @@ static int compute_score(struct sock *sk, struct net *net,
sk->sk_family != PF_INET6)
return -1;
 
+   if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
+   return -1;
+
score = 0;
inet = inet_sk(sk);
 
@@ -134,12 +137,6 @@ static int compute_score(struct sock *sk, struct net *net,
score++;
}
 
-   if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
-   return -1;
-   score++;
-   }
-
if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
return -1;
@@ -197,66 +194,32 @@ struct sock *__udp6_lib_lookup(struct net *net,
   int dif, int sdif, struct udp_table *udptable,
   struct sk_buff *skb)
 {
-   struct sock *sk, *result;
unsigned short hnum = ntohs(dport);
-   unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
-   struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
+   unsigned int hash2, slot2;
+   struct udp_hslot *hslot2;
+   struct sock *result;
bool exact_dif = udp6_lib_exact_dif_match(net, skb);
-   int score, badness;
-   u32 hash = 0;
 
-   if (hslot->count > 10) {
-   hash2 = ipv6_portaddr_hash(net, daddr, hnum);
+   hash2 = ipv6_portaddr_hash(net, daddr, hnum);
+   slot2 = hash2 & udptable->mask;
+   hslot2 = &udptable->hash2[slot2];
+
+   result = udp6_lib_lookup2(net, saddr, sport,
+ daddr, hnum, dif, sdif, exact_dif,
+ hslot2, skb);
+   if (!result) {
+   hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
slot2 = hash2 & udptable->mask;
+
hslot2 = &udptable->hash2[slot2];
-   if (hslot->count < hslot2->count)
-   goto begin;
 
result = udp6_lib_lookup2(net, saddr, sport,
- daddr, hnum, dif, sdif, exact_dif,
- hslot2, skb);
-   if (!result) {
-   unsigned int old_slot2 = slot2;
-   hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
-   slot2 = hash2 & udptable->mask;
-   /* avoid searching the same slot again. */
-   if (unlikely(slot2 == old_slot2))
-   return result;
-
-   hslot2 = &udptable->hash2[slot2];
-   if (hslot->count < hslot2->count)
-   goto begin;
-
-   result = udp6_lib_lookup2(net, saddr, sport,
- daddr, hnum, dif, sdif,
- exact_dif, hslot2,
- skb);
-   }
-   if (unlikely(IS_ERR(result)))
-   return NULL;
-   return result;
-   }
-begin:
-   result = NULL;
-   badness = -1;
-   sk_for_each_rcu(sk, &hslot->head) {
-   score = compute_score(sk, net, saddr, sport, daddr, hnum, dif,
- sdif, exact_dif);
-   if (score > badness) {
-   if (sk->sk_reuseport) {
-   hash = udp6_ehashfn(net, daddr, hnum,
-   saddr, sport);
-   result = reusep

[PATCH net-next 4/5] net: tcp6: prefer listeners bound to an address

2018-12-12 Thread Peter Oskolkov

A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.

However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.

In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.

Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/inet6_hashtables.c | 54 +
 1 file changed, 6 insertions(+), 48 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 5eeeba7181a1b..f3515ebe9b3a7 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -99,23 +99,16 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
const int dif, const int sdif, bool exact_dif)
 {
int score = -1;
-   bool dev_match;
 
if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
sk->sk_family == PF_INET6) {
+   if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
+   return -1;
 
-   score = 1;
-   if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
-   return -1;
-   score++;
-   }
-   dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
-dif, sdif);
-   if (!dev_match)
+   if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
return -1;
-   score++;
 
+   score = 1;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}
@@ -164,26 +157,12 @@ struct sock *inet6_lookup_listener(struct net *net,
const __be16 sport, const struct in6_addr *daddr,
const unsigned short hnum, const int dif, const int sdif)
 {
-   unsigned int hash = inet_lhashfn(net, hnum);
-   struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
-   bool exact_dif = inet6_exact_dif_match(net, skb);
struct inet_listen_hashbucket *ilb2;
-   struct sock *sk, *result = NULL;
-   int score, hiscore = 0;
+   struct sock *result = NULL;
unsigned int hash2;
-   u32 phash = 0;
-
-   if (ilb->count <= 10 || !hashinfo->lhash2)
-   goto port_lookup;
-
-   /* Too many sk in the ilb bucket (which is hashed by port alone).
-* Try lhash2 (which is hashed by port and addr) instead.
-*/
 
hash2 = ipv6_portaddr_hash(net, daddr, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-   if (ilb2->count > ilb->count)
-   goto port_lookup;
 
result = inet6_lhash2_lookup(net, ilb2, skb, doff,
 saddr, sport, daddr, hnum,
@@ -192,33 +171,12 @@ struct sock *inet6_lookup_listener(struct net *net,
goto done;
 
/* Lookup lhash2 with in6addr_any */
-
hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
ilb2 = inet_lhash2_bucket(hashinfo, hash2);
-   if (ilb2->count > ilb->count)
-   goto port_lookup;
 
result = inet6_lhash2_lookup(net, ilb2, skb, doff,
-saddr, sport, daddr, hnum,
+saddr, sport, &in6addr_any, hnum,
 dif, sdif);
-   goto done;
-
-port_lookup:
-   sk_for_each(sk, &ilb->head) {
-   score = compute_score(sk, net, hnum, daddr, dif, sdif, 
exact_dif);
-   if (score > hiscore) {
-   if (sk->sk_reuseport) {
-   phash = inet6_ehashfn(net, daddr, hnum,
- saddr, sport);
-   result = reuseport_select_sock(sk, phash,
-  skb, doff);
-   if (result)
-   goto done;
-   }
-   result = sk;
-   hiscore = score;
-   }
-   }
 done:
if (unlikely(IS_ERR(result)))
return NULL;
-- 
2.20.0.rc2.403.gdbc3b29805-goog

[PATCH bpf-next 2/2] selftests/bpf: add test_lwt_ip_encap selftest

2018-11-28 Thread Peter Oskolkov

This patch adds a sample/selftest that covers BPF_LWT_ENCAP_IP option
added in the first patch in the series.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/bpf/Makefile  |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  65 ++
 .../selftests/bpf/test_lwt_ip_encap.sh| 114 ++
 3 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 73aa6d8f4a2f..044fcdbc9864 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -39,7 +39,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
-   xdp_dummy.o test_map_in_map.o
+   xdp_dummy.o test_map_in_map.o test_lwt_ip_encap.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -53,7 +53,8 @@ TEST_PROGS := test_kmod.sh \
test_lirc_mode2.sh \
test_skb_cgroup_id.sh \
test_flow_dissector.sh \
-   test_xdp_vlan.sh
+   test_xdp_vlan.sh \
+   test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh
 
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index ..967db922dcc6
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define BPF_LWT_ENCAP_IP 2
+
+struct iphdr {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+   __u8ihl:4,
+   version:4;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+   __u8version:4,
+   ihl:4;
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+   __u8tos;
+   __be16  tot_len;
+   __be16  id;
+   __be16  frag_off;
+   __u8ttl;
+   __u8protocol;
+   __sum16 check;
+   __be32  saddr;
+   __be32  daddr;
+};
+
+struct grehdr {
+   __be16 flags;
+   __be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+   char encap_header[24];
+   int err;
+   struct iphdr *iphdr = (struct iphdr *)encap_header;
+   struct grehdr *greh = (struct grehdr *)(encap_header + sizeof(struct 
iphdr));
+
+   memset(encap_header, 0, sizeof(encap_header));
+
+   iphdr->ihl = 5;
+   iphdr->version = 4;
+   iphdr->tos = 0;
+   iphdr->ttl = 0x40;
+   iphdr->protocol = 47;  /* IPPROTO_GRE */
+   iphdr->saddr = 0x640110ac;  /* 172.16.1.100 */
+   iphdr->daddr = 0x640310ac;  /* 172.16.5.100 */
+   iphdr->check = 0;
+   iphdr->tot_len = bpf_htons(skb->len + sizeof(encap_header));
+
+   greh->protocol = bpf_htons(0x800);
+
+   err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, (void *)encap_header,
+sizeof(encap_header));
+   if (err)
+   return BPF_DROP;
+
+   return BPF_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index ..4c32b754bf96
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup:
+# - create VETH1/VETH2 veth
+# - VETH1 gets IP_SRC
+# - create netns NS
+# - move VETH2 to NS, add IP_DST
+# - in NS, create gre tunnel GREDEV, add IP_GRE
+# - in NS, configure GREDEV to route to IP_DST from IP_SRC
+# - configure route to IP_GRE via VETH1
+#   (note: there is no route to IP_DST from root/init ns)
+#
+# Test:
+# - listen on IP_DST
+# - send a packet to IP_DST: the listener does not get it
+# - add LWT_XMIT bpf to IP_DST that gre-encaps all packets to IP_GRE
+# - send a packet to IP_DST: the listener gets it
+
+
+# set -x  # debug ON
+set +x  # debug OFF
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+   echo "This script must be run as root"
+   echo "FAIL"
+   exit 1
+fi
+
+readonly NS="ns-ip-encap-$(mktemp -u XX)"
+readonly OUT=$(mktemp /tmp/test_lwt_ip_incap.XX)
+
+readonly NET_SRC="172.16.1.0"
+
+readonly IP_SRC="172.16.1.100"
+readonly IP_DST="172.16.2.100"
+readonly IP_GRE="172.16.3.100"
+
+readonly PORT=
+readonly MSG="foo_bar"
+
+PID1=0
+PID2=0
+
+setup() {
+   ip link add veth1

[PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov

This patch enables BPF programs (specifically, of LWT_XMIT type)
to add IP encapsulation headers to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

A future patch will enable IPv6 encapping (and IPv4/IPv6 cross-routing).

Tested: see the second patch in the series.

Signed-off-by: Peter Oskolkov 
---
 include/net/lwtunnel.h   |  2 ++
 include/uapi/linux/bpf.h |  7 -
 net/core/filter.c| 58 
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..6a1c5c2f16d5 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -16,6 +16,8 @@
 #define LWTUNNEL_STATE_INPUT_REDIRECT  BIT(1)
 #define LWTUNNEL_STATE_XMIT_REDIRECT   BIT(2)
 
+#define LWTUNNEL_MAX_ENCAP_HSIZE   80
+
 enum {
LWTUNNEL_XMIT_DONE,
LWTUNNEL_XMIT_CONTINUE,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597afdbc1ab9..6f2efe2dca9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1998,6 +1998,10 @@ union bpf_attr {
  * Only works if *skb* contains an IPv6 packet. Insert a
  * Segment Routing Header (**struct ipv6_sr_hdr**) inside
  * the IPv6 header.
+ * **BPF_LWT_ENCAP_IP**
+ * IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ * must be IPv4, followed by zero, one, or more additional
+ * headers.
  *
  * A call to this helper is susceptible to change the underlaying
  * packet buffer. Therefore, at load time, all checks on pointers
@@ -2444,7 +2448,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_SEG6,
-   BPF_LWT_ENCAP_SEG6_INLINE
+   BPF_LWT_ENCAP_SEG6_INLINE,
+   BPF_LWT_ENCAP_IP,
 };
 
 /* user accessible mirror of in-kernel sk_buff.
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..17f3c37218e5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 
type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
+{
+   struct dst_entry *dst;
+   struct rtable *rt;
+   struct iphdr *iph;
+   struct net *net;
+   int err;
+
+   if (skb->protocol != htons(ETH_P_IP))
+   return -EINVAL;  /* ETH_P_IPV6 not yet supported */
+
+   iph = (struct iphdr *)hdr;
+
+   if (unlikely(len < sizeof(struct iphdr) || len > 
LWTUNNEL_MAX_ENCAP_HSIZE))
+   return -EINVAL;
+   if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
+   return -EINVAL;
+
+   if (skb->sk)
+   net = sock_net(skb->sk);
+   else {
+   net = dev_net(skb_dst(skb)->dev);
+   }
+   rt = ip_route_output(net, iph->daddr, 0, 0, 0);
+   if (IS_ERR(rt) || rt->dst.error)
+   return -EINVAL;
+   dst = &rt->dst;
+
+   skb_reset_inner_headers(skb);
+   skb->encapsulation = 1;
+
+   err = skb_cow_head(skb, len + LL_RESERVED_SPACE(dst->dev));
+   if (unlikely(err))
+   return err;
+
+   skb_push(skb, len);
+   skb_reset_network_header(skb);
+
+   iph = ip_hdr(skb);
+   memcpy(iph, hdr, len);
+
+   bpf_compute_data_pointers(skb);
+   if (iph->ihl * 4 < len)
+   skb_set_transport_header(skb, iph->ihl * 4);
+   skb->protocol = htons(ETH_P_IP);
+   if (!iph->check)
+   iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
+
+   skb_dst_drop(skb);
+   dst_hold(dst);
+   skb_dst_set(skb, dst);
+   return 0;
+}
+
 BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
   u32, len)
 {
@@ -4802,6 +4856,8 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, 
u32, type, void *, hdr,
case BPF_LWT_ENCAP_SEG6_INLINE:
return bpf_push_seg6_encap(skb, type, hdr, len);
 #endif
+   case BPF_LWT_ENCAP_IP:
+   return bpf_push_ip_encap(skb, hdr, len);
default:
return -EINVAL;
}
@@ -5687,6 +5743,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_l4_csum_replace_proto;
case BPF_FUNC_set_hash_invalid:
return &bpf_set_hash_invalid_proto;
+   case BPF_FUNC_lwt_pu

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov

On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
>
> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index bd0df75dc7b6..17f3c37218e5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
> > u32 type, void *hdr, u32 len
> >  }
> >  #endif /* CONFIG_IPV6_SEG6_BPF */
> >
> > +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> > +{
> > + struct dst_entry *dst;
> > + struct rtable *rt;
> > + struct iphdr *iph;
> > + struct net *net;
> > + int err;
> > +
> > + if (skb->protocol != htons(ETH_P_IP))
> > + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> > +
> > + iph = (struct iphdr *)hdr;
> > +
> > + if (unlikely(len < sizeof(struct iphdr) || len > 
> > LWTUNNEL_MAX_ENCAP_HSIZE))
> > + return -EINVAL;
> > + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> > + return -EINVAL;
> > +
> > + if (skb->sk)
> > + net = sock_net(skb->sk);
> > + else {
> > + net = dev_net(skb_dst(skb)->dev);
> > + }
> > + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
>
> That is a very limited use case. e.g., oif = 0 means you are not
> considering any kind of policy routing (e.g., VRF).

Hi David! Could you be a bit more specific re: what you would like to
see here? Thanks!

[PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-13 Thread Peter Oskolkov

DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern 
Acked-by: Wei Wang 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..a3902f805305 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-   rt->dst.flags |= fib6_info_dst_flags(ort);
+   rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
 
if (ort->fib6_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
-- 
2.19.0.397.gdd90340f6a-goog

Re: [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-17 Thread Peter Oskolkov

On Thu, Sep 13, 2018 at 9:11 PM David Ahern  wrote:
>
> On 9/13/18 1:38 PM, Peter Oskolkov wrote:
>
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 3eed045c65a5..a3902f805305 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
> > struct fib6_info *ort)
> >
> >  static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
> >  {
> > - rt->dst.flags |= fib6_info_dst_flags(ort);
> > + rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
>
> I think my mistake is setting dst.flags in ip6_rt_init_dst. Flags
> argument is passed to ip6_dst_alloc which is always invoked before
> ip6_rt_copy_init is called which is the only caller of ip6_rt_init_dst.

ip6_rt_cache_alloc calls ip6_dst_alloc with zero as flags; and only
one flag is copied later (DST_HOST) outside of ip6_rt_init_dst().
If the flag assignment is completely removed from ip6_rt_init_dst(),
then DST_NOPOLICY flag will be lost.

Which may be OK, but is more than what this patch tries to solve (do not
copy DST_NOCOUNT flag).

>
> >
> >   if (ort->fib6_flags & RTF_REJECT) {
> >   ip6_rt_init_dst_reject(rt, ort);
> >
>

[Patch net v2] net/ipv6: do not copy dst flags on rt init

2018-09-17 Thread Peter Oskolkov

DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern 
Acked-by: Wei Wang 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/route.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..480a79f47c52 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,8 +946,6 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-   rt->dst.flags |= fib6_info_dst_flags(ort);
-
if (ort->fib6_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
return;
-- 
2.19.0.397.gdd90340f6a-goog

Re: [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-17 Thread Peter Oskolkov

On Mon, Sep 17, 2018 at 9:59 AM David Ahern  wrote:
>
> On 9/17/18 9:11 AM, Peter Oskolkov wrote:
> > On Thu, Sep 13, 2018 at 9:11 PM David Ahern  wrote:
> >>
> >> On 9/13/18 1:38 PM, Peter Oskolkov wrote:
> >>
> >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> >>> index 3eed045c65a5..a3902f805305 100644
> >>> --- a/net/ipv6/route.c
> >>> +++ b/net/ipv6/route.c
> >>> @@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info 
> >>> *rt, struct fib6_info *ort)
> >>>
> >>>  static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
> >>>  {
> >>> - rt->dst.flags |= fib6_info_dst_flags(ort);
> >>> + rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
> >>
> >> I think my mistake is setting dst.flags in ip6_rt_init_dst. Flags
> >> argument is passed to ip6_dst_alloc which is always invoked before
> >> ip6_rt_copy_init is called which is the only caller of ip6_rt_init_dst.
> >
> > ip6_rt_cache_alloc calls ip6_dst_alloc with zero as flags; and only
> > one flag is copied later (DST_HOST) outside of ip6_rt_init_dst().
> > If the flag assignment is completely removed from ip6_rt_init_dst(),
> > then DST_NOPOLICY flag will be lost.
> >
> > Which may be OK, but is more than what this patch tries to solve (do not
> > copy DST_NOCOUNT flag).
>
> In the 4.17 kernel (prior to the fib6_info change), ip6_rt_cache_alloc
> calls __ip6_dst_alloc with 0 for flags so this is correct. The mistake
> is ip6_rt_copy_init -> ip6_rt_init_dst -> fib6_info_dst_flags.
>
> I believe the right fix is to drop the 'rt->dst.flags |=
> fib6_info_dst_flags(ort);' from ip6_rt_init_dst.

OK, I sent a v2 with the assignment removed. Thanks for the review!

[PATCH net-next 2/3] net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net

2018-09-21 Thread Peter Oskolkov

Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
  root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
  thresholds higher in its namespace than what is in the root/init ns

The new behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=900
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl -w net.ipv6.ip6frag_high_thresh=900
 net.ipv6.ip6frag_high_thresh = 900

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 900

The old behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=900
 net.ipv4.ipfrag_high_thresh = 900

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 4194304

 # sysctl -w net.ipv6.ip6frag_high_thresh=900
 net.ipv6.ip6frag_high_thresh = 900

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 4194304

Signed-off-by: Peter Oskolkov 
---
 net/ieee802154/6lowpan/reassembly.c | 1 -
 net/ipv4/ip_fragment.c  | 1 -
 net/ipv6/reassembly.c   | 1 -
 3 files changed, 3 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 09ffbf5ce8fa..d14226ecfde4 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -463,7 +463,6 @@ static int __net_init 
lowpan_frags_ns_sysctl_register(struct net *net)
 
table[0].data = &ieee802154_lowpan->frags.high_thresh;
table[0].extra1 = &ieee802154_lowpan->frags.low_thresh;
-   table[0].extra2 = &init_net.ieee802154_lowpan.frags.high_thresh;
table[1].data = &ieee802154_lowpan->frags.low_thresh;
table[1].extra2 = &ieee802154_lowpan->frags.high_thresh;
table[2].data = &ieee802154_lowpan->frags.timeout;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 13f4d189e12b..9b0158fa431f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -822,7 +822,6 @@ static int __net_init ip4_frags_ns_ctl_register(struct net 
*net)
 
table[0].data = &net->ipv4.frags.high_thresh;
table[0].extra1 = &net->ipv4.frags.low_thresh;
-   table[0].extra2 = &init_net.ipv4.frags.high_thresh;
table[1].data = &net->ipv4.frags.low_thresh;
table[1].extra2 = &net->ipv4.frags.high_thresh;
table[2].data = &net->ipv4.frags.timeout;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 536c1d172cba..5c3c92713096 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -554,7 +554,6 @@ static int __net_init ip6_frags_ns_sysctl_register(struct 
net *net)
 
table[0].data = &net->ipv6.frags.high_thresh;
table[0].extra1 = &net->ipv6.frags.low_thresh;
-   table[0].extra2 = &init_net.ipv6.frags.high_thresh;
table[1].data = &net->ipv6.frags.low_thresh;
table[1].extra2 = &net->ipv6.frags.high_thresh;
table[2].data = &net->ipv6.frags.timeout;
-- 
2.19.0.444.g18242da7ef-goog

[PATCH net-next 1/3] ipv6: discard IP frag queue on more errors

2018-09-21 Thread Peter Oskolkov

This is similar to how ipv4 now behaves:
commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").

Signed-off-by: Peter Oskolkov 
---
 net/ipv6/reassembly.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index f1b1ff30fe5b..536c1d172cba 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -145,7 +145,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 */
if (end < fq->q.len ||
((fq->q.flags & INET_FRAG_LAST_IN) && end != fq->q.len))
-   goto err;
+   goto discard_fq;
fq->q.flags |= INET_FRAG_LAST_IN;
fq->q.len = end;
} else {
@@ -162,20 +162,20 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end > fq->q.len) {
/* Some bits beyond end -> corruption. */
if (fq->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_fq;
fq->q.len = end;
}
}
 
if (end == offset)
-   goto err;
+   goto discard_fq;
 
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
-   goto err;
+   goto discard_fq;
 
if (pskb_trim_rcsum(skb, end - offset))
-   goto err;
+   goto discard_fq;
 
/* Find out which fragments are in front and at the back of us
 * in the chain of fragments so far.  We must know where to put
@@ -418,6 +418,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
rcu_read_lock();
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
rcu_read_unlock();
+   inet_frag_kill(&fq->q);
return -1;
 }
 
-- 
2.19.0.444.g18242da7ef-goog

[PATCH net-next 3/3] selftests/net: add ipv6 tests to ip_defrag selftest

2018-09-21 Thread Peter Oskolkov

This patch adds ipv6 defragmentation tests to ip_defrag selftest,
to complement existing ipv4 tests.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/net/ip_defrag.c  | 249 +++
 tools/testing/selftests/net/ip_defrag.sh |  39 ++--
 2 files changed, 190 insertions(+), 98 deletions(-)

diff --git a/tools/testing/selftests/net/ip_defrag.c 
b/tools/testing/selftests/net/ip_defrag.c
index 55fdcdc78eef..2366dc6bce71 100644
--- a/tools/testing/selftests/net/ip_defrag.c
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -23,21 +23,28 @@ static bool cfg_overlap;
 static unsigned short  cfg_port = 9000;
 
 const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) 
};
+const struct in6_addr addr6 = IN6ADDR_LOOPBACK_INIT;
 
 #define IP4_HLEN   (sizeof(struct iphdr))
 #define IP6_HLEN   (sizeof(struct ip6_hdr))
 #define UDP_HLEN   (sizeof(struct udphdr))
 
-static int msg_len;
+/* IPv6 fragment header lenth. */
+#define FRAG_HLEN  8
+
+static int payload_len;
 static int max_frag_len;
 
 #define MSG_LEN_MAX6   /* Max UDP payload length. */
 
 #define IP4_MF (1u << 13)  /* IPv4 MF flag. */
+#define IP6_MF (1)  /* IPv6 MF flag. */
+
+#define CSUM_MANGLED_0 (0x)
 
 static uint8_t udp_payload[MSG_LEN_MAX];
 static uint8_t ip_frame[IP_MAXPACKET];
-static uint16_t ip_id = 0xabcd;
+static uint32_t ip_id = 0xabcd;
 static int msg_counter;
 static int frag_counter;
 static unsigned int seed;
@@ -48,25 +55,25 @@ static void recv_validate_udp(int fd_udp)
ssize_t ret;
static uint8_t recv_buff[MSG_LEN_MAX];
 
-   ret = recv(fd_udp, recv_buff, msg_len, 0);
+   ret = recv(fd_udp, recv_buff, payload_len, 0);
msg_counter++;
 
if (cfg_overlap) {
if (ret != -1)
-   error(1, 0, "recv: expected timeout; got %d; seed = %u",
-   (int)ret, seed);
+   error(1, 0, "recv: expected timeout; got %d",
+   (int)ret);
if (errno != ETIMEDOUT && errno != EAGAIN)
-   error(1, errno, "recv: expected timeout: %d; seed = %u",
-errno, seed);
+   error(1, errno, "recv: expected timeout: %d",
+errno);
return;  /* OK */
}
 
if (ret == -1)
-   error(1, errno, "recv: msg_len = %d max_frag_len = %d",
-   msg_len, max_frag_len);
-   if (ret != msg_len)
-   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, msg_len);
-   if (memcmp(udp_payload, recv_buff, msg_len))
+   error(1, errno, "recv: payload_len = %d max_frag_len = %d",
+   payload_len, max_frag_len);
+   if (ret != payload_len)
+   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, 
payload_len);
+   if (memcmp(udp_payload, recv_buff, payload_len))
error(1, 0, "recv: wrong data");
 }
 
@@ -92,31 +99,95 @@ static uint32_t raw_checksum(uint8_t *buf, int len, 
uint32_t sum)
 static uint16_t udp_checksum(struct ip *iphdr, struct udphdr *udphdr)
 {
uint32_t sum = 0;
+   uint16_t res;
 
sum = raw_checksum((uint8_t *)&iphdr->ip_src, 2 * sizeof(iphdr->ip_src),
-   IPPROTO_UDP + (uint32_t)(UDP_HLEN + msg_len));
-   sum = raw_checksum((uint8_t *)udp_payload, msg_len, sum);
+   IPPROTO_UDP + (uint32_t)(UDP_HLEN + 
payload_len));
+   sum = raw_checksum((uint8_t *)udphdr, UDP_HLEN, sum);
+   sum = raw_checksum((uint8_t *)udp_payload, payload_len, sum);
+   res = 0x & ~sum;
+   if (res)
+   return htons(res);
+   else
+   return CSUM_MANGLED_0;
+}
+
+static uint16_t udp6_checksum(struct ip6_hdr *iphdr, struct udphdr *udphdr)
+{
+   uint32_t sum = 0;
+   uint16_t res;
+
+   sum = raw_checksum((uint8_t *)&iphdr->ip6_src, 2 * 
sizeof(iphdr->ip6_src),
+   IPPROTO_UDP);
+   sum = raw_checksum((uint8_t *)&udphdr->len, sizeof(udphdr->len), sum);
sum = raw_checksum((uint8_t *)udphdr, UDP_HLEN, sum);
-   return htons(0x & ~sum);
+   sum = raw_checksum((uint8_t *)udp_payload, payload_len, sum);
+   res = 0x & ~sum;
+   if (res)
+   return htons(res);
+   else
+   return CSUM_MANGLED_0;
 }
 
 static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
-   struct ip *iphdr, int offset)
+   int offset, bool ipv6)
 {
int frag_len;
int res;
+   int payload_offset = offset > 0 ? offset - UDP_HLEN : 0;
+   uint8_t *frag_start = ipv6 ? ip_frame +

[PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..da1a144f1a51 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -40,6 +40,7 @@ enum
IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
IPSTATS_MIB_REASMFAILS, /* ReasmFails */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
IPSTATS_MIB_FRAGOKS,/* FragOKs */
IPSTATS_MIB_FRAGFAILS,  /* FragFails */
IPSTATS_MIB_FRAGCREATES,/* FragCreates */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */

[PATCH net-next 0/3] ip: Use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(&skb->rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(&q->rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+

[PATCH v2 net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..f80135e5feaa 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -56,6 +56,7 @@ enum
IPSTATS_MIB_ECT1PKTS,   /* InECT1Pkts */
IPSTATS_MIB_ECT0PKTS,   /* InECT0Pkts */
IPSTATS_MIB_CEPKTS, /* InCEPkts */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp,

[PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-02 Thread Peter Oskolkov

This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

[PATCH v2 net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(&q->rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+

[PATCH v2 net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(&skb->rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog

Re: [PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-03 Thread Peter Oskolkov

On Fri, Aug 3, 2018 at 12:33 PM Josh Hunt  wrote:
>
> On Thu, Aug 2, 2018 at 4:34 PM, Peter Oskolkov  wrote:
>>
>> This patchset
>>  * changes IPv4 defrag behavior to match that of IPv6: overlapping
>>fragments now cause the whole IP datagram to be discarded (suggested
>>by David Miller): there are no legitimate use cases for overlapping
>>fragments;
>>  * changes IPv4 defrag queue from a list to a rb tree (suggested
>>by Eric Dumazet): this change removes a potential attach vector.
>>
>> Upcoming patches will contain similar changes for IPv6 frag queue,
>> as well as a comprehensive IP defrag self-test (temporarily delayed).
>>
>> Peter Oskolkov (3):
>>   ip: discard IPv4 datagrams with overlapping segments.
>>   net: modify skb_rbtree_purge to return the truesize of all purged
>> skbs.
>>   ip: use rb trees for IP frag queue.
>>
>>  include/linux/skbuff.h  |  11 +-
>>  include/net/inet_frag.h |   3 +-
>>  include/uapi/linux/snmp.h   |   1 +
>>  net/core/skbuff.c   |   6 +-
>>  net/ipv4/inet_fragment.c|  16 +-
>>  net/ipv4/ip_fragment.c  | 239 +++-
>>  net/ipv4/proc.c |   1 +
>>  net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
>>  net/ipv6/reassembly.c   |   1 +
>>  9 files changed, 139 insertions(+), 140 deletions(-)
>>
>> --
>> 2.18.0.597.ga71716f1ad-goog
>>
>
> Peter
>
> I just tested your patches along with Florian's on top of net-next. Things 
> look much better wrt this type of attack. Thanks for doing this. I'm 
> wondering if we want to put an optional mechanism in place to limit the size 
> of the tree in terms of skbs it can hold? Otherwise an attacker can send 
> ~1400 8 byte frags and consume all frag memory (default high thresh is 4M) 
> pretty easily and I believe also evict other frags which may have been 
> pending? I am guessing this is what Florian's min MTU patches are trying to 
> help with.
>
> --
> Josh

Hi Josh,

It will be really easy to limit the size of the queue/tree (e.g. based
on a sysctl parameter). I can send a follow-up patch if there is a
consensus that this behavior is needed/useful.

Thanks,
Peter

Re: [PATCH net-next] ipv4: frags: precedence bug in ip_expire()

2018-08-06 Thread Peter Oskolkov

Ack. Thanks, Dan!
On Mon, Aug 6, 2018 at 12:17 PM Dan Carpenter  wrote:
>
> We accidentally removed the parentheses here, but they are required
> because '!' has higher precedence than '&'.
>
> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
> Signed-off-by: Dan Carpenter 
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 0e8f8de77e71..7cb7ed761d8c 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -154,7 +154,7 @@ static void ip_expire(struct timer_list *t)
> __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
> __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
>
> -   if (!qp->q.flags & INET_FRAG_FIRST_IN)
> +   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
> goto out;
>
> /* sk_buff::dev and sk_buff::rbnode are unionized. So we

[PATCH net-next 1/2] ip: add helpers to process in-order fragments faster.

2018-08-10 Thread Peter Oskolkov

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Cc: Eric Dumazet 
Cc: Cc: Florian Westphal 

---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b86d14528188..1662cbc0b46b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+

[PATCH net-next 2/2] ip: process in-order fragments efficiently

2018-08-10 Thread Peter Oskolkov

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Cc: Eric Dumazet 
Cc: Cc: Florian Westphal 

---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6d258a5669e7..bcb11f3a27c0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -146,7 +146,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
head = skb_rb_first(&qp->q.rb_fragments);
if (!head)
goto out;
-   rb_erase(&head->rbnode, &qp->q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(&head->rbnode,
+   &FRAG_CB(head)->next_frag->rbnode,
+   &qp->q.rb_fragments);
+   else
+   rb_erase(&head->rbnode, &qp->q.rb_fragments);
memset(&head->rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(&skb->rbnode, &skb1->rbnode, 
&skb1->rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(&qp->q, skb);
+   else
+   ip4_frag_create_run(&qp->q, skb);
} else {
-   /* Binary search. Note that skb can become the first fragment, 
but
-* not the last (covered

[PATCH net-next v2 1/2] ip: add helpers to process in-order fragments faster.

2018-08-11 Thread Peter Oskolkov

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 

---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b86d14528188..1662cbc0b46b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *

[PATCH net-next v2 2/2] ip: process in-order fragments efficiently

2018-08-11 Thread Peter Oskolkov

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 

---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6d258a5669e7..bcb11f3a27c0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -146,7 +146,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
head = skb_rb_first(&qp->q.rb_fragments);
if (!head)
goto out;
-   rb_erase(&head->rbnode, &qp->q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(&head->rbnode,
+   &FRAG_CB(head)->next_frag->rbnode,
+   &qp->q.rb_fragments);
+   else
+   rb_erase(&head->rbnode, &qp->q.rb_fragments);
memset(&head->rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(&skb->rbnode, &skb1->rbnode, 
&skb1->rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+

[PATCH net-next v2 0/2] ip: faster in-order IP fragments

2018-08-11 Thread Peter Oskolkov

Added "Signed-off-by" in v2.

Peter Oskolkov (2):
  ip: add helpers to process in-order fragments faster.
  ip: process in-order fragments efficiently

 include/net/inet_frag.h  |   6 ++
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 183 ++-
 3 files changed, 149 insertions(+), 42 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 2/2] selftests/net: add ip_defrag selftest

2018-08-28 Thread Peter Oskolkov

This test creates a raw IPv4 socket, fragments a largish UDP
datagram and sends the fragments out of order.

Then repeats in a loop with different message and fragment lengths.

Then does the same with overlapping fragments (with overlapping
fragments the expectation is that the recv times out).

Tested:

root@# time ./ip_defrag.sh
ipv4 defrag
PASS
ipv4 defrag with overlaps
PASS

real1m7.679s
user0m0.628s
sys 0m2.242s

A similar test for IPv6 is to follow.

Signed-off-by: Peter Oskolkov 
Reviewed-by: Willem de Bruijn 
---
 tools/testing/selftests/net/.gitignore   |   2 +
 tools/testing/selftests/net/Makefile |   4 +-
 tools/testing/selftests/net/ip_defrag.c  | 313 +++
 tools/testing/selftests/net/ip_defrag.sh |  29 +++
 4 files changed, 346 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/ip_defrag.c
 create mode 100755 tools/testing/selftests/net/ip_defrag.sh

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 78b24cf76f40..2836e0cf2d81 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -14,3 +14,5 @@ udpgso_bench_rx
 udpgso_bench_tx
 tcp_inq
 tls
+ip_defrag
+
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 9cca68e440a0..cccdb2295567 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,13 +5,13 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd
-TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx
+TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
diff --git a/tools/testing/selftests/net/ip_defrag.c 
b/tools/testing/selftests/net/ip_defrag.c
new file mode 100644
index ..55fdcdc78eef
--- /dev/null
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static boolcfg_do_ipv4;
+static boolcfg_do_ipv6;
+static boolcfg_verbose;
+static boolcfg_overlap;
+static unsigned short  cfg_port = 9000;
+
+const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) 
};
+
+#define IP4_HLEN   (sizeof(struct iphdr))
+#define IP6_HLEN   (sizeof(struct ip6_hdr))
+#define UDP_HLEN   (sizeof(struct udphdr))
+
+static int msg_len;
+static int max_frag_len;
+
+#define MSG_LEN_MAX6   /* Max UDP payload length. */
+
+#define IP4_MF (1u << 13)  /* IPv4 MF flag. */
+
+static uint8_t udp_payload[MSG_LEN_MAX];
+static uint8_t ip_frame[IP_MAXPACKET];
+static uint16_t ip_id = 0xabcd;
+static int msg_counter;
+static int frag_counter;
+static unsigned int seed;
+
+/* Receive a UDP packet. Validate it matches udp_payload. */
+static void recv_validate_udp(int fd_udp)
+{
+   ssize_t ret;
+   static uint8_t recv_buff[MSG_LEN_MAX];
+
+   ret = recv(fd_udp, recv_buff, msg_len, 0);
+   msg_counter++;
+
+   if (cfg_overlap) {
+   if (ret != -1)
+   error(1, 0, "recv: expected timeout; got %d; seed = %u",
+   (int)ret, seed);
+   if (errno != ETIMEDOUT && errno != EAGAIN)
+   error(1, errno, "recv: expected timeout: %d; seed = %u",
+errno, seed);
+   return;  /* OK */
+   }
+
+   if (ret == -1)
+   error(1, errno, "recv: msg_len = %d max_frag_len = %d",
+   msg_len, max_frag_len);
+   if (ret != msg_len)
+   error(1, 0, "recv: wrong size: %d vs %d", (int)ret, msg_len);
+   if (memcmp(udp_payload, recv_buff, msg_len))
+   error(1, 0, "recv: wrong data");
+}
+
+static uint32_t raw_checksum(uint8_t *buf, int len, uint32_t sum)
+{
+   int i;
+
+   for (i = 0; i < (len & ~1U); i += 2) {
+   sum += (u_int16_t)ntohs(*((u_int16_t *)(buf + i)));
+   if (sum > 0x)
+   sum -= 0x;
+   }
+
+   if (i < len) {
+   sum += buf[i] << 8;
+

[PATCH net-next 1/2] ip: fail fast on IP defrag errors

2018-08-28 Thread Peter Oskolkov

The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
  affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
  by timeout;
- testing of overlapping fragments is now much easier: any kind of
  random fragment length mutation now leads to the frag queue being
  discarded (IP packet dropped); before this patch, some overlaps were
  "corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov 
Reviewed-by: Eric Dumazet 
Reviewed-by: Willem de Bruijn 
---
 net/ipv4/ip_fragment.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..330f62353b11 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
if (end < qp->q.len ||
((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len))
-   goto err;
+   goto discard_qp;
qp->q.flags |= INET_FRAG_LAST_IN;
qp->q.len = end;
} else {
@@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
if (end > qp->q.len) {
/* Some bits beyond end -> corruption. */
if (qp->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_qp;
qp->q.len = end;
}
}
if (end == offset)
-   goto err;
+   goto discard_qp;
 
err = -ENOMEM;
if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
-   goto err;
+   goto discard_qp;
 
err = pskb_trim_rcsum(skb, end - offset);
if (err)
-   goto err;
+   goto discard_qp;
 
/* Note : skb->rbnode and skb->dev share the same location. */
dev = skb->dev;
@@ -423,6 +423,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * We do the same here for IPv4 (and increment an snmp counter).
 */
 
+   err = -EINVAL;
/* Find out where to put this fragment.  */
prev_tail = qp->q.fragments_tail;
if (!prev_tail)
@@ -431,7 +432,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
/* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
-   goto discard_qp;
+   goto overlap;
if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
ip4_frag_append_to_last_run(&qp->q, skb);
else
@@ -450,7 +451,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
FRAG_CB(skb1)->frag_run_len)
rbn = &parent->rb_right;
else /* Found an overlap with skb1. */
-   goto discard_qp;
+   goto overlap;
} while (*rbn);
/* Here we have parent properly set, and rbn pointing to
 * one of its NULL left/right children. Insert skb.
@@ -487,16 +488,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
skb->_skb_refdst = 0UL;
err = ip_frag_reasm(qp, skb, prev_tail, dev);
skb->_skb_refdst = orefdst;
+   if (err)
+   inet_frag_kill(&qp->q);
return err;
}
 
skb_dst_drop(skb);
return -EINPROGRESS;
 
+overlap:
+   __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 discard_qp:
inet_frag_kill(&qp->q);
-   err = -EINVAL;
-   __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
kfree_skb(skb);
return err;
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

Re: [PATCH bpf] bpf: lwtunnel: fix reroute supplying invalid dst

2019-10-14 Thread Peter Oskolkov

On Sat, Oct 12, 2019 at 9:59 AM Alexei Starovoitov
 wrote:
>
> On Wed, Oct 9, 2019 at 1:31 AM Jiri Benc  wrote:
> >
> > The dst in bpf_input() has lwtstate field set. As it is of the
> > LWTUNNEL_ENCAP_BPF type, lwtstate->data is struct bpf_lwt. When the bpf
> > program returns BPF_LWT_REROUTE, ip_route_input_noref is directly called on
> > this skb. This causes invalid memory access, as ip_route_input_slow calls
> > skb_tunnel_info(skb) that expects the dst->lwstate->data to be
> > struct ip_tunnel_info. This results to struct bpf_lwt being accessed as
> > struct ip_tunnel_info.
> >
> > Drop the dst before calling the IP route input functions (both for IPv4 and
> > IPv6).
> >
> > Reported by KASAN.
> >
> > Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c")
> > Cc: Peter Oskolkov 
> > Signed-off-by: Jiri Benc 
>
> Peter and other google folks,
> please review.

selftests/bpf/test_lwt_ip_encap.sh passes. Seems OK.

Acked-by: Peter Oskolkov

Re: [PATCH bpf] selftests/bpf: More compatible nc options in test_tc_edt

2019-10-18 Thread Peter Oskolkov

On Fri, Oct 18, 2019 at 5:00 AM Jiri Benc  wrote:
>
> Out of the three nc implementations widely in use, at least two (BSD netcat
> and nmap-ncat) do not support -l combined with -s. Modify the nc invocation
> to be accepted by all of them.
>
> Fixes: 7df5e3db8f63 ("selftests: bpf: tc-bpf flow shaping with EDT")
> Cc: Peter Oskolkov 
> Signed-off-by: Jiri Benc 
> ---
>  tools/testing/selftests/bpf/test_tc_edt.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/test_tc_edt.sh 
> b/tools/testing/selftests/bpf/test_tc_edt.sh
> index f38567ef694b..daa7d1b8d309 100755
> --- a/tools/testing/selftests/bpf/test_tc_edt.sh
> +++ b/tools/testing/selftests/bpf/test_tc_edt.sh
> @@ -59,7 +59,7 @@ ip netns exec ${NS_SRC} tc filter add dev veth_src egress \
>
>  # start the listener
>  ip netns exec ${NS_DST} bash -c \
> -   "nc -4 -l -s ${IP_DST} -p 9000 >/dev/null &"
> +   "nc -4 -l -p 9000 >/dev/null &"

The test passes with the regular linux/debian nc. If it passes will the rest,

Acked-by: Peter Oskolkov 

>  declare -i NC_PID=$!
>  sleep 1
>
> --
> 2.18.1
>

[PATCH bpf-next] selftests: bpf: add VRF test cases to lwt_ip_encap test.

2019-04-03 Thread Peter Oskolkov

This patch adds tests validating that VRF and BPF-LWT
encap work together well, as requested by David Ahern.

Signed-off-by: Peter Oskolkov 
---
 .../selftests/bpf/test_lwt_ip_encap.sh| 134 +++---
 1 file changed, 86 insertions(+), 48 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
index d4d3391cc13af..acf7a74f97cd9 100755
--- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -129,6 +129,24 @@ setup()
ip link set veth7 netns ${NS2}
ip link set veth8 netns ${NS3}
 
+   if [ ! -z "${VRF}" ] ; then
+   ip -netns ${NS1} link add red type vrf table 1001
+   ip -netns ${NS1} link set red up
+   ip -netns ${NS1} route add table 1001 unreachable default 
metric 8192
+   ip -netns ${NS1} -6 route add table 1001 unreachable default 
metric 8192
+   ip -netns ${NS1} link set veth1 vrf red
+   ip -netns ${NS1} link set veth5 vrf red
+
+   ip -netns ${NS2} link add red type vrf table 1001
+   ip -netns ${NS2} link set red up
+   ip -netns ${NS2} route add table 1001 unreachable default 
metric 8192
+   ip -netns ${NS2} -6 route add table 1001 unreachable default 
metric 8192
+   ip -netns ${NS2} link set veth2 vrf red
+   ip -netns ${NS2} link set veth3 vrf red
+   ip -netns ${NS2} link set veth6 vrf red
+   ip -netns ${NS2} link set veth7 vrf red
+   fi
+
# configure addesses: the top route (1-2-3-4)
ip -netns ${NS1}addr add ${IPv4_1}/24  dev veth1
ip -netns ${NS2}addr add ${IPv4_2}/24  dev veth2
@@ -163,29 +181,29 @@ setup()
 
# NS1
# top route
-   ip -netns ${NS1}route add ${IPv4_2}/32  dev veth1
-   ip -netns ${NS1}route add default dev veth1 via ${IPv4_2}  # go top 
by default
-   ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
-   ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top 
by default
+   ip -netns ${NS1}route add ${IPv4_2}/32  dev veth1 ${VRF}
+   ip -netns ${NS1}route add default dev veth1 via ${IPv4_2} ${VRF}  # 
go top by default
+   ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1 ${VRF}
+   ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2} ${VRF}  # 
go top by default
# bottom route
-   ip -netns ${NS1}route add ${IPv4_6}/32  dev veth5
-   ip -netns ${NS1}route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
-   ip -netns ${NS1}route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
-   ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
-   ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
-   ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+   ip -netns ${NS1}route add ${IPv4_6}/32  dev veth5 ${VRF}
+   ip -netns ${NS1}route add ${IPv4_7}/32  dev veth5 via ${IPv4_6} 
${VRF}
+   ip -netns ${NS1}route add ${IPv4_8}/32  dev veth5 via ${IPv4_6} 
${VRF}
+   ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5 ${VRF}
+   ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6} 
${VRF}
+   ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6} 
${VRF}
 
# NS2
# top route
-   ip -netns ${NS2}route add ${IPv4_1}/32  dev veth2
-   ip -netns ${NS2}route add ${IPv4_4}/32  dev veth3
-   ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
-   ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+   ip -netns ${NS2}route add ${IPv4_1}/32  dev veth2 ${VRF}
+   ip -netns ${NS2}route add ${IPv4_4}/32  dev veth3 ${VRF}
+   ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2 ${VRF}
+   ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3 ${VRF}
# bottom route
-   ip -netns ${NS2}route add ${IPv4_5}/32  dev veth6
-   ip -netns ${NS2}route add ${IPv4_8}/32  dev veth7
-   ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
-   ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+   ip -netns ${NS2}route add ${IPv4_5}/32  dev veth6 ${VRF}
+   ip -netns ${NS2}route add ${IPv4_8}/32  dev veth7 ${VRF}
+   ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6 ${VRF}
+   ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7 ${VRF}
 
# NS3
# top route
@@ -207,16 +225,16 @@ setup()
ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local 
${IPv4_GRE} ttl 255
ip -netns ${NS3} link set gre_dev up
ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
-   ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
-   ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+   ip -netns ${NS1} route add ${IP

Re: [PATCH net] ipv6: un-do: defrag: drop non-last frags smaller than min mtu

2019-04-08 Thread Peter Oskolkov

On Mon, Apr 8, 2019 at 8:51 AM Sasha Levin  wrote:
>
> On Mon, Apr 08, 2019 at 08:49:52AM -0600, Captain Wiggum wrote:
> >Hi Sasha,
> >
> >This patch cannot be applied to upstream, the code is significantly 
> >different.
> >Therefore, this un-do patch would not be seen in the upstream git log.
> >It was solved there by coding a better solution, not by the un-do patch.
>
> Okay, so this is effectively a request to diverge the -stable tree from
> upstream in a non-trivial way, which is why I asked David Miller to ack
> this act explcitly (or to send me patches, or whatever else he thinks is
> appropriate here).

I believe that applying this patch series:
https://patchwork.ozlabs.org/cover/1029418/
from upstream will achieve the desired outcome (assuming it applies cleanly).

>
> >Please consider this:
> >Upstream passes the TAHI IPv6 protocol tests. All the LTS kernels do NOT.
> >This is the patch that causes the failure in 4.9, 4.14, 4.19 LTS kernels.
>
> I very much agree that this should get fixed. My concerns are not with
> the bug but are with the proposed fix as it applies to -stable trees.
>
> >And this patch has been in place with 4.9.134, a long time.
> >It is not right that "Linux" can not pass the IPv6 protocol test.
> >My executive are asking me why "Linux" is not fit for IPv6 deployments.
>
> Arguments such as this carry no weight in a more technical discussion
> such as this. Yes, some tests are currently broken, but we will not take
> shortcuts just because "executives are unhappy".
>
> --
> Thanks,
> Sasha

Re: [PATCH net] ipv6: un-do: defrag: drop non-last frags smaller than min mtu

2019-04-08 Thread Peter Oskolkov

On Mon, Apr 8, 2019 at 9:29 AM Captain Wiggum  wrote:
>
> Thank you Peter!
>
> I tried the patch on 4.9.167 & 4.19.32. It's out of sync with upstream.
> Looks like a little different work needed for each LTS kernel.
> Is someone is familiar with it, and is available to patch it?

I'll try to backport the patchset to 4.9 and send it out for review.
If it is accepted (this
will be my first stable backport attempt), I'll do the same for 4.14 and 4.19.

Thanks,
Peter

> If not, I'd be happy to do this and propose a patch for at least 4.9,
> 4.14, 4.19.
> I look forward to your feedback.
> -
> patch against 4.9.167:
> include/net/inet_frag.h: 1 out of 2 hunks FAILED
> net/ipv4/ip_fragment.c: 4 out of 9 hunks FAILED
> can't find file to patch at input line 796: Not found: include/net/ipv6_frag.h
> -
> patch against 4.19.32:
> net/ipv4/ip_fragment.c: 3 out of 9 hunks FAILED
> net/ipv6/reassembly.c: 2 out of 8 hunks FAILED
> can't find file to patch: tools/testing/selftests/net/ip_defrag.c
>
> --John Masinter
>
> On Mon, Apr 8, 2019 at 9:59 AM Peter Oskolkov  wrote:
> >
> > On Mon, Apr 8, 2019 at 8:51 AM Sasha Levin  wrote:
> > >
> > > On Mon, Apr 08, 2019 at 08:49:52AM -0600, Captain Wiggum wrote:
> > > >Hi Sasha,
> > > >
> > > >This patch cannot be applied to upstream, the code is significantly 
> > > >different.
> > > >Therefore, this un-do patch would not be seen in the upstream git log.
> > > >It was solved there by coding a better solution, not by the un-do patch.
> > >
> > > Okay, so this is effectively a request to diverge the -stable tree from
> > > upstream in a non-trivial way, which is why I asked David Miller to ack
> > > this act explcitly (or to send me patches, or whatever else he thinks is
> > > appropriate here).
> >
> > I believe that applying this patch series:
> > https://patchwork.ozlabs.org/cover/1029418/
> > from upstream will achieve the desired outcome (assuming it applies 
> > cleanly).
> >
> > >
> > > >Please consider this:
> > > >Upstream passes the TAHI IPv6 protocol tests. All the LTS kernels do NOT.
> > > >This is the patch that causes the failure in 4.9, 4.14, 4.19 LTS kernels.
> > >
> > > I very much agree that this should get fixed. My concerns are not with
> > > the bug but are with the proposed fix as it applies to -stable trees.
> > >
> > > >And this patch has been in place with 4.9.134, a long time.
> > > >It is not right that "Linux" can not pass the IPv6 protocol test.
> > > >My executive are asking me why "Linux" is not fit for IPv6 deployments.
> > >
> > > Arguments such as this carry no weight in a more technical discussion
> > > such as this. Yes, some tests are currently broken, but we will not take
> > > shortcuts just because "executives are unhappy".
> > >
> > > --
> > > Thanks,
> > > Sasha

Re: [PATCH net] ipv6: un-do: defrag: drop non-last frags smaller than min mtu

2019-04-08 Thread Peter Oskolkov

On Mon, Apr 8, 2019 at 4:15 PM Sasha Levin  wrote:
>
> On Mon, Apr 08, 2019 at 10:13:57AM -0700, Peter Oskolkov wrote:
> >On Mon, Apr 8, 2019 at 9:29 AM Captain Wiggum  wrote:
> >>
> >> Thank you Peter!
> >>
> >> I tried the patch on 4.9.167 & 4.19.32. It's out of sync with upstream.
> >> Looks like a little different work needed for each LTS kernel.
> >> Is someone is familiar with it, and is available to patch it?
> >
> >I'll try to backport the patchset to 4.9 and send it out for review.
> >If it is accepted (this
> >will be my first stable backport attempt), I'll do the same for 4.14 and 
> >4.19.
>
> Please start with 4.19 and work backwards for two reasons:
>
> 1. If we fix something in 4.9, and the user upgrades to a broken
> 4.14/4.19 he'll be very upset.
>
> 2. It'll make the backporting process much easier for you as the
> differences between 4.19 and 5.0 are much smaller than 4.9 and 5.0.
>

Thanks, Sasha - it is indeed much easier to backport to 4.19. I'll
send the patchset for review shortly.

>
> --
> Thanks,
> Sasha

[PATCH 4.19 stable 1/3] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-08 Thread Peter Oskolkov

[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +++
 net/ipv4/ip_fragment.c   | 290 --
 3 files changed, 335 insertions(+), 264 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1662cbc0b46b4..b02bf737d019a 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -77,8 +77,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -153,4 +153,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 760a9e52e02b9..9f69411251d03 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -25,6 +25,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -123,6 +179,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+   sum += skb->truesize;
+

[PATCH 4.19 stable 2/3] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-04-08 Thread Peter Oskolkov

[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 233 +++-
 2 files changed, 71 insertions(+), 173 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6e..28aa9b30aecea 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 7c943392c1287..642f9f53b01db 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -69,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(struct timer_list *t)
 {
@@ -111,21 +111,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -170,62 +175,27 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end == offset)
goto err;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
goto err;
 
-   if (pskb_trim_rcsum(skb, end - offset))
-   goto err;
-
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (next->ip_defrag_offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4, amended by Errata ID : 3089
-*  When reassembling an IPv6 datagram, if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments) MUST be silently discarded.
-*/
-
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (prev->ip_defrag_offset + prev->len) > offset)
-   goto disca

[PATCH 4.19 stable 0/3] net: ip6 defrag: backport fixes

2019-04-08 Thread Peter Oskolkov

Currently, 4.19 and earlier stable kernels contain a security fix
that is not fully IPv6 standard compliant.

This patchset backports IPv6 defrag fixes from 5.1rc that restore
standard-compliance.

Original 5.1 patchet: https://patchwork.ozlabs.org/cover/1029418/


John Masinter (captwiggum), could you, please, confirm that this
patchset fixes TAHI tests?


Peter Oskolkov (3):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

 include/net/inet_frag.h |  16 +-
 include/net/ipv6_frag.h |  11 +-
 net/ipv4/inet_fragment.c| 293 
 net/ipv4/ip_fragment.c  | 290 +++
 net/ipv6/netfilter/nf_conntrack_reasm.c | 260 ++---
 net/ipv6/reassembly.c   | 233 +--
 6 files changed, 477 insertions(+), 626 deletions(-)

-- 
2.21.0.392.gf8f6787159e-goog

[PATCH 4.19 stable 3/3] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-04-08 Thread Peter Oskolkov

[ Upstream commit 997dd96471641e147cb2c33ad54284000d0f5e35 ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 260 +++-
 1 file changed, 71 insertions(+), 189 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 043ed8eb0ab98..cb1b4772dac0a 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -136,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -177,9 +180,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -254,55 +258,18 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (next->ip_defrag_offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
-
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (prev->ip_defrag_offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && next->ip_defrag_offset < end)
-   goto discard_fq;
-
-   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
-   if (skb->dev)
-   fq->iif = skb->dev->ifindex;
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
/* Makes sure compiler wont do silly aliasing games */
barrier();
-   skb->ip_defrag_offset = offset;
 
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
+   prev = fq->q.fragments_tail;
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
+
+   if (dev)
+   fq->iif = dev->ifindex;
 
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
@@ -319,11 +286,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
+   fq->q.meat == fq->q.len) {
+   unsigned long orefdst = skb->_skb_refdst;
+
+   skb->_skb_refdst = 0UL;
+   err = nf_ct_frag6_reasm(fq, skb, prev, dev);
+   skb->_skb_refdst = orefdst;
+   return err;
+   }
+
+   skb_dst_drop(skb);

Re: [PATCH bpf-next] selftests: bpf: add VRF test cases to lwt_ip_encap test.

2019-04-10 Thread Peter Oskolkov

On Wed, Apr 10, 2019 at 6:19 PM David Ahern  wrote:
>
> On 4/3/19 8:43 AM, Peter Oskolkov wrote:
> > This patch adds tests validating that VRF and BPF-LWT
> > encap work together well, as requested by David Ahern.
> >
> > Signed-off-by: Peter Oskolkov 
> > ---
> >  .../selftests/bpf/test_lwt_ip_encap.sh| 134 +++---
> >  1 file changed, 86 insertions(+), 48 deletions(-)
> >
>
> Peter: What OS are you using to run this test script?

Debian Testing with a net-next kernel. What kind of errors do you see?

Re: [PATCH bpf-next] selftests: bpf: add VRF test cases to lwt_ip_encap test.

2019-04-10 Thread Peter Oskolkov

Your test output tells me that everything is OK - see below.

On Wed, Apr 10, 2019 at 9:17 PM David Ahern  wrote:
>
> On 4/10/19 6:26 PM, Peter Oskolkov wrote:
> > On Wed, Apr 10, 2019 at 6:19 PM David Ahern  wrote:
> >>
> >> On 4/3/19 8:43 AM, Peter Oskolkov wrote:
> >>> This patch adds tests validating that VRF and BPF-LWT
> >>> encap work together well, as requested by David Ahern.
> >>>
> >>> Signed-off-by: Peter Oskolkov 
> >>> ---
> >>>  .../selftests/bpf/test_lwt_ip_encap.sh| 134 +++---
> >>>  1 file changed, 86 insertions(+), 48 deletions(-)
> >>>
> >>
> >> Peter: What OS are you using to run this test script?
> >
> > Debian Testing with a net-next kernel. What kind of errors do you see?
> >
>
> This is on Debian Stretch.
>
> 1. nc is not installed
>
> ###
> $ ./test_lwt_ip_encap.sh
> starting egress IPv4 encap test
> nc is not available: skipping TSO tests
> nc is not available: skipping TSO tests
> ping: sendmsg: No route to host
> PASS
> starting egress IPv6 encap test
> nc is not available: skipping TSO tests
> nc is not available: skipping TSO tests
> ping: sendmsg: No route to host
> PASS
> starting ingress IPv4 encap test
> PASS
> starting ingress IPv6 encap test
> PASS
> starting egress IPv4 encap test vrf red
> ping: sendmsg: No route to host
> ping: sendmsg: No route to host
> PASS
> starting egress IPv6 encap test vrf red
> ping: sendmsg: No route to host
> ping: sendmsg: No route to host
> PASS
> starting ingress IPv4 encap test vrf red
> PASS
> starting ingress IPv6 encap test vrf red
> PASS
>
> ###
>
> Notice the "No route to host" errors.

"No route to host" is OK: there are negative tests, as you requested a
couple of months ago... :), and these tests correctly trigger "no
route to host".

This output basically tell me that the test passes, both with and without VRF.

>
>
> 2. install netcat
>
> $ apt-get install netcat
> ...
> ###
> $  ./test_lwt_ip_encap.sh
> starting egress IPv4 encap test
> nc: invalid option -- '4'
> nc -h for help
> bash: connect: Connection refused
> bash: /dev/tcp/172.16.4.100/9000: Connection refused
> test_gso failed: IPv4
> nc: invalid option -- '6'
> nc -h for help
> bash: connect: Connection refused
> bash: /dev/tcp/fb04::1/9000: Connection refused
> test_gso failed: IPv6
> ping: sendmsg: No route to host
> FAIL
> starting egress IPv6 encap test
> nc: invalid option -- '4'
> nc -h for help
> bash: connect: Connection refused
> bash: /dev/tcp/172.16.4.100/9000: Connection refused
> test_gso failed: IPv4
> nc: invalid option -- '6'
> nc -h for help
> bash: connect: Connection refused
> bash: /dev/tcp/fb04::1/9000: Connection refused
> test_gso failed: IPv6
> ping: sendmsg: No route to host
> FAIL
> starting ingress IPv4 encap test
> PASS
> starting ingress IPv6 encap test
> PASS
> starting egress IPv4 encap test vrf red
> ping: sendmsg: No route to host
> ping: sendmsg: No route to host
> PASS
> starting egress IPv6 encap test vrf red
> ping: sendmsg: No route to host
> ping: sendmsg: No route to host
> PASS
> starting ingress IPv4 encap test vrf red
> PASS
> starting ingress IPv6 encap test vrf red
> PASS
> passed tests: 6
> failed tests: 2
>
> ###
>
> so netcat is not the right package. 'apt-cache search netcat' shows
> another package, so try it.

I guess Debian Stretch has a too old version of netcat that does not
support the flags used in the test.

>
>
> 3. remove netcat and install netcat-openbsd
>
> ###
>
> $  ./test_lwt_ip_encap.sh
> starting egress IPv4 encap test
> nc: cannot use -s and -l
> bash: connect: Connection refused
> bash: /dev/tcp/172.16.4.100/9000: Connection refused
> test_gso failed: IPv4
> nc: cannot use -s and -l
> bash: connect: Connection refused
> bash: /dev/tcp/fb04::1/9000: Connection refused
> test_gso failed: IPv6
> ping: sendmsg: No route to host
> FAIL
> starting egress IPv6 encap test
> nc: cannot use -s and -l
> bash: connect: Connection refused
> bash: /dev/tcp/172.16.4.100/9000: Connection refused
> test_gso failed: IPv4
> nc: cannot use -s and -l
> bash: connect: Connection refused
> bash: /dev/tcp/fb04::1/9000: Connection refused
> test_gso failed: IPv6
> ping: sendmsg: No route to host
> FAIL
> starting ingress IPv4 encap test
> PASS
> starting ingress IPv6 encap test
> PASS
> starting egress IPv4 encap test vrf red
> ...
>
> ###
>
> still not the right nc command.
>
> This is when I started instrumenting the script.
>
> So really we need the existing (pre-VRF version) to work without errors
> and then add the VRF tests. And the ability to see what is failing is
> important.
>
> Compare the above output to pmtu.sh and fib_tests.sh for example -- and
> the options fib_tests.sh has to help a user when a test fails (verbose
> mode and pause on fail).

[PATCH 4.14 stable 3/5] ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

2019-04-22 Thread Peter Oskolkov

From: Florian Westphal 

[ Upstream commit 70b095c84326640eeacfd69a411db8fc36e8ab1a ]

IPV6=m
DEFRAG_IPV6=m
CONNTRACK=y yields:

net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
net/netfilter/nf_conntrack_proto.c:802: undefined reference to 
`nf_defrag_ipv6_enable'
net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to 
`nf_conntrack_l4proto_icmpv6'

Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
IPV6=y too.

This patch gets rid of the 'followup linker error' by removing
the dependency of ipv6.ko symbols from netfilter ipv6 defrag.

Shared code is placed into a header, then used from both.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 104 ++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c   |  17 ++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c |  92 ++-
 net/openvswitch/conntrack.c   |   1 +
 7 files changed, 126 insertions(+), 122 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fa87a62e9bd3..6294d20a5f0e 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -512,35 +512,6 @@ static inline bool ipv6_prefix_equal(const struct in6_addr 
*addr1,
 }
 #endif
 
-struct inet_frag_queue;
-
-enum ip6_defrag_users {
-   IP6_DEFRAG_LOCAL_DELIVER,
-   IP6_DEFRAG_CONNTRACK_IN,
-   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_OUT,
-   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
-   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
-};
-
-void ip6_frag_init(struct inet_frag_queue *q, const void *a);
-extern const struct rhashtable_params ip6_rhash_params;
-
-/*
- * Equivalent of ipv4 struct ip
- */
-struct frag_queue {
-   struct inet_frag_queue  q;
-
-   int iif;
-   unsigned intcsum;
-   __u16   nhoffset;
-   u8  ecn;
-};
-
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
-
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
new file mode 100644
index ..6ced1e6899b6
--- /dev/null
+++ b/include/net/ipv6_frag.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _IPV6_FRAG_H
+#define _IPV6_FRAG_H
+#include 
+#include 
+#include 
+#include 
+
+enum ip6_defrag_users {
+   IP6_DEFRAG_LOCAL_DELIVER,
+   IP6_DEFRAG_CONNTRACK_IN,
+   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_OUT,
+   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
+   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
+};
+
+/*
+ * Equivalent of ipv4 struct ip
+ */
+struct frag_queue {
+   struct inet_frag_queue  q;
+
+   int iif;
+   __u16   nhoffset;
+   u8  ecn;
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static inline void ip6frag_init(struct inet_frag_queue *q, const void *a)
+{
+   struct frag_queue *fq = container_of(q, struct frag_queue, q);
+   const struct frag_v6_compare_key *key = a;
+
+   q->key.v6 = *key;
+   fq->ecn = 0;
+}
+
+static inline u32 ip6frag_key_hashfn(const void *data, u32 len, u32 seed)
+{
+   return jhash2(data,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline u32 ip6frag_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+   const struct inet_frag_queue *fq = data;
+
+   return jhash2((const u32 *)&fq->key.v6,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline int
+ip6frag_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+   const struct frag_v6_compare_key *key = arg->key;
+   const struct inet_frag_queue *fq = ptr;
+
+   return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+static inline void
+ip6frag_expire_frag_queue(struct net *net, struct frag_queue *fq)
+{
+   struct net_device *dev = NULL;
+   struct sk_buff *head;
+
+   rcu_read_lock();
+   spin_lock(&fq->q.lock);
+
+   if (fq->q.flags & INET_FRAG_COMPLETE)
+   goto out;
+
+   inet_frag_kill(&fq->q);
+
+   dev = dev_get_by_index_rcu(net, fq->iif);
+   if (!dev)
+   goto out;
+
+   __IP6_INC_STATS(net, _

[PATCH 4.14 stable 2/5] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-22 Thread Peter Oskolkov

[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +++
 net/ipv4/ip_fragment.c   | 290 --
 3 files changed, 335 insertions(+), 264 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 335cf7851f12..008f64823c41 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -77,8 +77,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -153,4 +153,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6ffee9d2b0e5..481cded81b2d 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -122,6 +178,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+   sum += skb->truesize;
+   kfree_skb(skb);

[PATCH 4.14 stable 0/5] net: ip6 defrag: backport fixes

2019-04-22 Thread Peter Oskolkov

This is a backport of a 5.1rc patchset:
  https://patchwork.ozlabs.org/cover/1029418/

Which was backported into 4.19:
  https://patchwork.ozlabs.org/cover/1081619/

I had to backport two additional patches into 4.14 to make it work.


John Masinter (captwiggum), could you, please, confirm that this
patchset fixes TAHI tests? (I'm reasonably certain that it does, as
I ran ip_defrag selftest, but given the amount of changes here,
another set of completed tests would be nice to have).


Eric Dumazet (1):
  ipv6: frags: fix a lockdep false positive

Florian Westphal (1):
  ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

Peter Oskolkov (3):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

 include/net/inet_frag.h   |  16 +-
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 111 +++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv4/inet_fragment.c  | 293 ++
 net/ipv4/ip_fragment.c| 290 ++
 net/ipv6/netfilter/nf_conntrack_reasm.c   | 279 +
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c | 357 +-
 net/openvswitch/conntrack.c   |   1 +
 10 files changed, 616 insertions(+), 765 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.14 stable 4/5] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-04-22 Thread Peter Oskolkov

[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 242 +++-
 2 files changed, 73 insertions(+), 180 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6..28aa9b30aece 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e5ab3b7813d6..6e452356ed45 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -62,13 +62,6 @@
 
 static const char ip6_frag_cache_name[] = "ip6-frags";
 
-struct ip6frag_skb_cb {
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define FRAG6_CB(skb)  ((struct ip6frag_skb_cb *)((skb)->cb))
-
 static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -76,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(struct timer_list *t)
 {
@@ -118,21 +111,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -177,62 +175,28 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end == offset)
goto err;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
goto err;
 
-   if (pskb_trim_rcsum(skb, end - offset))
-   goto err;
-
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || FRAG6_CB(prev)->offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (FRAG6_CB(next)->offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4, amended by Errata ID : 3089
-*  When reassembling an IPv6 datagram, if
-

[PATCH 4.14 stable 5/5] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-04-22 Thread Peter Oskolkov

[ Upstream commit 997dd96471641e147cb2c33ad54284000d0f5e35 ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 262 +++-
 1 file changed, 72 insertions(+), 190 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0568d49b5da4..cb1b4772dac0 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -51,14 +51,6 @@
 
 static const char nf_frags_cache_name[] = "nf-frags";
 
-struct nf_ct_frag6_skb_cb
-{
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define NFCT_FRAG6_CB(skb) ((struct nf_ct_frag6_skb_cb *)((skb)->cb))
-
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
@@ -144,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -185,9 +180,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -262,55 +258,19 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+
prev = fq->q.fragments_tail;
-   if (!prev || NFCT_FRAG6_CB(prev)->offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (NFCT_FRAG6_CB(next)->offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
 
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
+   if (dev)
+   fq->iif = dev->ifindex;
 
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && NFCT_FRAG6_CB(next)->offset < end)
-   goto discard_fq;
-
-   NFCT_FRAG6_CB(skb)->offset = offset;
-
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
-
-   if (skb->dev) {
-   fq->iif = skb->dev->ifindex;
-   skb->dev = NULL;
-   }
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
fq->ecn |= ecn;
@@ -326,11 +286,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&

[PATCH 4.14 stable 1/5] ipv6: frags: fix a lockdep false positive

2019-04-22 Thread Peter Oskolkov

From: Eric Dumazet 

[ Upstream commit 415787d7799f4fccbe8d49cb0b8e5811be6b0389 ]

lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()(lock txq->_xmit_lock)
dev_hard_start_xmit()
 xmit_one()
  dev_queue_xmit_nit()
   packet_rcv_fanout()
ip_check_defrag()
 ip_defrag()
  spin_lock() (lock frag queue spinlock)

2) ip6_input_finish()
ipv6_frag_rcv()   (lock frag queue spinlock)
 ip6_frag_queue()
  icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv6/reassembly.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 2a8c680b67cd..f75e9e711c31 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -170,7 +170,8 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
 }
 
 static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
-  struct frag_hdr *fhdr, int nhoff)
+ struct frag_hdr *fhdr, int nhoff,
+ u32 *prob_offset)
 {
struct sk_buff *prev, *next;
struct net_device *dev;
@@ -186,11 +187,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- ((u8 *)&fhdr->frag_off -
-  skb_network_header(skb)));
+   *prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
return -1;
}
 
@@ -221,10 +218,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
/* RFC2460 says always send parameter problem in
 * this case. -DaveM
 */
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- offsetof(struct ipv6hdr, 
payload_len));
+   *prob_offset = offsetof(struct ipv6hdr, payload_len);
return -1;
}
if (end > fq->q.len) {
@@ -536,15 +530,22 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
+   u32 prob_offset = 0;
int ret;
 
spin_lock(&fq->q.lock);
 
fq->iif = iif;
-   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
+   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff,
+&prob_offset);
 
spin_unlock(&fq->q.lock);
inet_frag_put(&fq->q);
+   if (prob_offset) {
+   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+   IPSTATS_MIB_INHDRERRORS);
+   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, prob_offset);
+   }
return ret;
}
 
-- 
2.21.0.593.g511ec345e18-goog

Re: [PATCH 4.14 stable 2/5] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-23 Thread Peter Oskolkov

On Tue, Apr 23, 2019 at 5:07 AM Lars Persson  wrote:
>
> On Tue, Apr 23, 2019 at 12:29 AM Peter Oskolkov  wrote:
> >
> > [ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]
> >
> > This is a refactoring patch: without changing runtime behavior,
> > it moves rbtree-related code from IPv4-specific files/functions
> > into .h/.c defrag files shared with IPv6 defragmentation code.
> >
> > Signed-off-by: Peter Oskolkov 
> > Cc: Eric Dumazet 
> > Cc: Florian Westphal 
> > Cc: Tom Herbert 
> > Signed-off-by: David S. Miller 
> > ---
> >  include/net/inet_frag.h  |  16 ++-
> >  net/ipv4/inet_fragment.c | 293 +++
> >  net/ipv4/ip_fragment.c   | 290 --
> >  3 files changed, 335 insertions(+), 264 deletions(-)
> >
> Hi
>
> We get a compile error with gcc 8.2 after applying this patch:
>  net/ipv4/ip_fragment.c: In function 'ip_frag_queue':
>  net/ipv4/ip_fragment.c:390:1: error: label 'discard_qp' defined but
> not used [-Werror=unused-label]
>   discard_qp:

Thanks for the report: I'll send a v2.

[PATCH 4.19 stable v2 0/3] net: ip6 defrag: backport fixes

2019-04-23 Thread Peter Oskolkov

Lars Persson  reported that a label was unused in
the 4.14 version of this patchset, and the issue was present in
the 4.19 patchset as well, so I'm sending a v2 that fixes it.

The original 4.19 patchset queued for stable is OK, and
can be used as is, but this v2 is a bit better: it fixes the
unused label issue and handles overlapping fragments better.

Sorry for the mess/v2.

===

Currently, 4.19 and earlier stable kernels contain a security fix
that is not fully IPv6 standard compliant.

This patchset backports IPv6 defrag fixes from 5.1rc that restore
standard-compliance.

Original 5.1 patchet: https://patchwork.ozlabs.org/cover/1029418/

v2 changes: handle overlapping fragments the way it is done upstream


Peter Oskolkov (3):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

 include/net/inet_frag.h |  16 +-
 include/net/ipv6_frag.h |  11 +-
 net/ipv4/inet_fragment.c| 293 +++
 net/ipv4/ip_fragment.c  | 302 +++-
 net/ipv6/netfilter/nf_conntrack_reasm.c | 260 ++--
 net/ipv6/reassembly.c   | 240 ++-
 6 files changed, 488 insertions(+), 634 deletions(-)

-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.19 stable v2 1/3] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-23 Thread Peter Oskolkov

[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

v2: make handling of overlapping packets match upstream.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +
 net/ipv4/ip_fragment.c   | 302 +--
 3 files changed, 342 insertions(+), 269 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1662cbc0b46b..b02bf737d019 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -77,8 +77,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -153,4 +153,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 760a9e52e02b..9f69411251d0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -25,6 +25,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -123,6 +179,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+

[PATCH 4.19 stable v2 2/3] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-04-23 Thread Peter Oskolkov

[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

v2: change handling of overlaps to match that of upstream.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 240 +++-
 2 files changed, 75 insertions(+), 176 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6..28aa9b30aece 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 7c943392c128..095825f964e2 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -69,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(struct timer_list *t)
 {
@@ -111,21 +111,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -145,7 +150,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 */
if (end < fq->q.len ||
((fq->q.flags & INET_FRAG_LAST_IN) && end != fq->q.len))
-   goto err;
+   goto discard_fq;
fq->q.flags |= INET_FRAG_LAST_IN;
fq->q.len = end;
} else {
@@ -162,70 +167,35 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end > fq->q.len) {
/* Some bits beyond end -> corruption. */
if (fq->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_fq;
fq->q.len = end;
}
}
 
if (end == offset)
-   goto err;
+   goto discard_fq;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
-   goto err;
-
-   if (pskb_trim_rcsum(skb, end - offset))
-   goto err;
-
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offse

[PATCH 4.19 stable v2 3/3] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-04-23 Thread Peter Oskolkov

[ Upstream commit 997dd96471641e147cb2c33ad54284000d0f5e35 ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 260 +++-
 1 file changed, 71 insertions(+), 189 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 043ed8eb0ab9..cb1b4772dac0 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -136,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -177,9 +180,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -254,55 +258,18 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (next->ip_defrag_offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
-
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (prev->ip_defrag_offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && next->ip_defrag_offset < end)
-   goto discard_fq;
-
-   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
-   if (skb->dev)
-   fq->iif = skb->dev->ifindex;
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
/* Makes sure compiler wont do silly aliasing games */
barrier();
-   skb->ip_defrag_offset = offset;
 
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
+   prev = fq->q.fragments_tail;
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
+
+   if (dev)
+   fq->iif = dev->ifindex;
 
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
@@ -319,11 +286,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
+   fq->q.meat == fq->q.len) {
+   unsigned long orefdst = skb->_skb_refdst;
+
+   skb->_skb_refdst = 0UL;
+   err = nf_ct_frag6_reasm(fq, skb, prev, dev);
+   skb->_skb_refdst = orefdst;
+   return err;
+   }
+
+   skb_dst_drop(skb);

[PATCH 4.14 stable v2 0/5] net: ip6 defrag: backport fixes

2019-04-23 Thread Peter Oskolkov

Lars Persson  reported that a label was unused in
the previous version of this patchset, so I'm sending a v2 that fixes it.

Sorry for the mess/v2.

v2 changes: handle overlapping fragments the way it is done upstream.

This is a backport of a 5.1rc patchset:
  https://patchwork.ozlabs.org/cover/1029418/

Which was backported into 4.19:
  https://patchwork.ozlabs.org/cover/1081619/

I had to backport two additional patches into 4.14 to make it work.


John Masinter (captwiggum), could you, please, confirm that this
patchset fixes TAHI tests? (I'm reasonably certain that it does, as
I ran ip_defrag selftest, but given the amount of changes here,
another set of completed tests would be nice to have).


Eric Dumazet (1):
  ipv6: frags: fix a lockdep false positive

Florian Westphal (1):
  ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

Peter Oskolkov (3):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

 include/net/inet_frag.h   |  16 +-
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 111 +++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv4/inet_fragment.c  | 293 +
 net/ipv4/ip_fragment.c| 302 +++---
 net/ipv6/netfilter/nf_conntrack_reasm.c   | 279 +
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c | 364 ++
 net/openvswitch/conntrack.c   |   1 +
 10 files changed, 627 insertions(+), 773 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.14 stable v2 1/5] ipv6: frags: fix a lockdep false positive

2019-04-23 Thread Peter Oskolkov

From: Eric Dumazet 

[ Upstream commit 415787d7799f4fccbe8d49cb0b8e5811be6b0389 ]

lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()(lock txq->_xmit_lock)
dev_hard_start_xmit()
 xmit_one()
  dev_queue_xmit_nit()
   packet_rcv_fanout()
ip_check_defrag()
 ip_defrag()
  spin_lock() (lock frag queue spinlock)

2) ip6_input_finish()
ipv6_frag_rcv()   (lock frag queue spinlock)
 ip6_frag_queue()
  icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv6/reassembly.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 2a8c680b67cd..f75e9e711c31 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -170,7 +170,8 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
 }
 
 static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
-  struct frag_hdr *fhdr, int nhoff)
+ struct frag_hdr *fhdr, int nhoff,
+ u32 *prob_offset)
 {
struct sk_buff *prev, *next;
struct net_device *dev;
@@ -186,11 +187,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- ((u8 *)&fhdr->frag_off -
-  skb_network_header(skb)));
+   *prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
return -1;
}
 
@@ -221,10 +218,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
/* RFC2460 says always send parameter problem in
 * this case. -DaveM
 */
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- offsetof(struct ipv6hdr, 
payload_len));
+   *prob_offset = offsetof(struct ipv6hdr, payload_len);
return -1;
}
if (end > fq->q.len) {
@@ -536,15 +530,22 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
+   u32 prob_offset = 0;
int ret;
 
spin_lock(&fq->q.lock);
 
fq->iif = iif;
-   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
+   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff,
+&prob_offset);
 
spin_unlock(&fq->q.lock);
inet_frag_put(&fq->q);
+   if (prob_offset) {
+   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+   IPSTATS_MIB_INHDRERRORS);
+   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, prob_offset);
+   }
return ret;
}
 
-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.14 stable v2 2/5] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-23 Thread Peter Oskolkov

[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

v2: make handling of overlapping packets match upstream.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +
 net/ipv4/ip_fragment.c   | 302 +--
 3 files changed, 342 insertions(+), 269 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 335cf7851f12..008f64823c41 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -77,8 +77,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -153,4 +153,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6ffee9d2b0e5..481cded81b2d 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -122,6 +178,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+

[PATCH 4.14 stable v2 3/5] ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

2019-04-23 Thread Peter Oskolkov

From: Florian Westphal 

[ Upstream commit 70b095c84326640eeacfd69a411db8fc36e8ab1a ]

IPV6=m
DEFRAG_IPV6=m
CONNTRACK=y yields:

net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
net/netfilter/nf_conntrack_proto.c:802: undefined reference to 
`nf_defrag_ipv6_enable'
net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to 
`nf_conntrack_l4proto_icmpv6'

Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
IPV6=y too.

This patch gets rid of the 'followup linker error' by removing
the dependency of ipv6.ko symbols from netfilter ipv6 defrag.

Shared code is placed into a header, then used from both.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 104 ++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c   |  17 ++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c |  92 ++-
 net/openvswitch/conntrack.c   |   1 +
 7 files changed, 126 insertions(+), 122 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fa87a62e9bd3..6294d20a5f0e 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -512,35 +512,6 @@ static inline bool ipv6_prefix_equal(const struct in6_addr 
*addr1,
 }
 #endif
 
-struct inet_frag_queue;
-
-enum ip6_defrag_users {
-   IP6_DEFRAG_LOCAL_DELIVER,
-   IP6_DEFRAG_CONNTRACK_IN,
-   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_OUT,
-   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
-   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
-};
-
-void ip6_frag_init(struct inet_frag_queue *q, const void *a);
-extern const struct rhashtable_params ip6_rhash_params;
-
-/*
- * Equivalent of ipv4 struct ip
- */
-struct frag_queue {
-   struct inet_frag_queue  q;
-
-   int iif;
-   unsigned intcsum;
-   __u16   nhoffset;
-   u8  ecn;
-};
-
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
-
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
new file mode 100644
index ..6ced1e6899b6
--- /dev/null
+++ b/include/net/ipv6_frag.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _IPV6_FRAG_H
+#define _IPV6_FRAG_H
+#include 
+#include 
+#include 
+#include 
+
+enum ip6_defrag_users {
+   IP6_DEFRAG_LOCAL_DELIVER,
+   IP6_DEFRAG_CONNTRACK_IN,
+   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_OUT,
+   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
+   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
+};
+
+/*
+ * Equivalent of ipv4 struct ip
+ */
+struct frag_queue {
+   struct inet_frag_queue  q;
+
+   int iif;
+   __u16   nhoffset;
+   u8  ecn;
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static inline void ip6frag_init(struct inet_frag_queue *q, const void *a)
+{
+   struct frag_queue *fq = container_of(q, struct frag_queue, q);
+   const struct frag_v6_compare_key *key = a;
+
+   q->key.v6 = *key;
+   fq->ecn = 0;
+}
+
+static inline u32 ip6frag_key_hashfn(const void *data, u32 len, u32 seed)
+{
+   return jhash2(data,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline u32 ip6frag_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+   const struct inet_frag_queue *fq = data;
+
+   return jhash2((const u32 *)&fq->key.v6,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline int
+ip6frag_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+   const struct frag_v6_compare_key *key = arg->key;
+   const struct inet_frag_queue *fq = ptr;
+
+   return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+static inline void
+ip6frag_expire_frag_queue(struct net *net, struct frag_queue *fq)
+{
+   struct net_device *dev = NULL;
+   struct sk_buff *head;
+
+   rcu_read_lock();
+   spin_lock(&fq->q.lock);
+
+   if (fq->q.flags & INET_FRAG_COMPLETE)
+   goto out;
+
+   inet_frag_kill(&fq->q);
+
+   dev = dev_get_by_index_rcu(net, fq->iif);
+   if (!dev)
+   goto out;
+
+   __IP6_INC_STATS(net, _

[PATCH 4.14 stable v2 4/5] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-04-23 Thread Peter Oskolkov

[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

v2: change handling of overlaps to match that of upstream.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 249 +++-
 2 files changed, 77 insertions(+), 183 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6..28aa9b30aece 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e5ab3b7813d6..fe797b29ca89 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -62,13 +62,6 @@
 
 static const char ip6_frag_cache_name[] = "ip6-frags";
 
-struct ip6frag_skb_cb {
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define FRAG6_CB(skb)  ((struct ip6frag_skb_cb *)((skb)->cb))
-
 static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -76,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(struct timer_list *t)
 {
@@ -118,21 +111,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -152,7 +150,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 */
if (end < fq->q.len ||
((fq->q.flags & INET_FRAG_LAST_IN) && end != fq->q.len))
-   goto err;
+   goto discard_fq;
fq->q.flags |= INET_FRAG_LAST_IN;
fq->q.len = end;
} else {
@@ -169,70 +167,36 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end > fq->q.len) {
/* Some bits beyond end -> corruption. */
if (fq->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_fq;
fq->q.len = end;
}
}
 
if (end == offset)
-   goto err;
+   goto discard_fq;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - s

[PATCH 4.14 stable v2 5/5] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-04-23 Thread Peter Oskolkov

[ Upstream commit 997dd96471641e147cb2c33ad54284000d0f5e35 ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 262 +++-
 1 file changed, 72 insertions(+), 190 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0568d49b5da4..cb1b4772dac0 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -51,14 +51,6 @@
 
 static const char nf_frags_cache_name[] = "nf-frags";
 
-struct nf_ct_frag6_skb_cb
-{
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define NFCT_FRAG6_CB(skb) ((struct nf_ct_frag6_skb_cb *)((skb)->cb))
-
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
@@ -144,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -185,9 +180,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -262,55 +258,19 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+
prev = fq->q.fragments_tail;
-   if (!prev || NFCT_FRAG6_CB(prev)->offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (NFCT_FRAG6_CB(next)->offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
 
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
+   if (dev)
+   fq->iif = dev->ifindex;
 
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && NFCT_FRAG6_CB(next)->offset < end)
-   goto discard_fq;
-
-   NFCT_FRAG6_CB(skb)->offset = offset;
-
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
-
-   if (skb->dev) {
-   fq->iif = skb->dev->ifindex;
-   skb->dev = NULL;
-   }
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
fq->ecn |= ecn;
@@ -326,11 +286,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&

Re: [PATCH 4.14 stable v2 0/5] net: ip6 defrag: backport fixes

2019-04-25 Thread Peter Oskolkov

On Thu, Apr 25, 2019 at 1:13 AM Lars Persson  wrote:
>
> On 4/23/19 7:48 PM, Peter Oskolkov wrote:
> > Lars Persson  reported that a label was unused in
> > the previous version of this patchset, so I'm sending a v2 that fixes it.
> >
> > Sorry for the mess/v2.
> >
> > v2 changes: handle overlapping fragments the way it is done upstream.
> >
> > This is a backport of a 5.1rc patchset:
> >https://patchwork.ozlabs.org/cover/1029418/
> >
> > Which was backported into 4.19:
> >https://patchwork.ozlabs.org/cover/1081619/
> >
> > I had to backport two additional patches into 4.14 to make it work.
> >
> >
> > John Masinter (captwiggum), could you, please, confirm that this
> > patchset fixes TAHI tests? (I'm reasonably certain that it does, as
> > I ran ip_defrag selftest, but given the amount of changes here,
> > another set of completed tests would be nice to have).
> >
> >
> > Eric Dumazet (1):
> >    ipv6: frags: fix a lockdep false positive
> >
> > Florian Westphal (1):
> >ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module
> >
> > Peter Oskolkov (3):
> >net: IP defrag: encapsulate rbtree defrag code into callable functions
> >net: IP6 defrag: use rbtrees for IPv6 defrag
> >net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
> >
> >   include/net/inet_frag.h   |  16 +-
> >   include/net/ipv6.h|  29 --
> >   include/net/ipv6_frag.h   | 111 +++
> >   net/ieee802154/6lowpan/reassembly.c   |   2 +-
> >   net/ipv4/inet_fragment.c  | 293 +
> >   net/ipv4/ip_fragment.c| 302 +++---
> >   net/ipv6/netfilter/nf_conntrack_reasm.c   | 279 +
> >   net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
> >   net/ipv6/reassembly.c | 364 ++
> >   net/openvswitch/conntrack.c   |   1 +
> >   10 files changed, 627 insertions(+), 773 deletions(-)
> >   create mode 100644 include/net/ipv6_frag.h
> >
>
> Hi
>
> Our QA ran this with the IOL INTACT test suite and they give thumbs up.
>
> Tested-by: Lars Persson 
>
> - Lars

Thanks, Lars! I'll prepare a patchset for 4.4 then - should be
identical, or very similar, to the 4.14 one.

[PATCH 4.9 stable 5/5] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-04-26 Thread Peter Oskolkov

[ Upstream commit 997dd96471641e147cb2c33ad54284000d0f5e35 ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 256 +++-
 1 file changed, 72 insertions(+), 184 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 033f44493a10..1e1fa99b3243 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -51,14 +51,6 @@
 
 static const char nf_frags_cache_name[] = "nf-frags";
 
-struct nf_ct_frag6_skb_cb
-{
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define NFCT_FRAG6_CB(skb) ((struct nf_ct_frag6_skb_cb *)((skb)->cb))
-
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
@@ -144,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -184,9 +179,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -261,55 +257,19 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+
prev = fq->q.fragments_tail;
-   if (!prev || NFCT_FRAG6_CB(prev)->offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (NFCT_FRAG6_CB(next)->offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
 
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
+   if (dev)
+   fq->iif = dev->ifindex;
 
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && NFCT_FRAG6_CB(next)->offset < end)
-   goto discard_fq;
-
-   NFCT_FRAG6_CB(skb)->offset = offset;
-
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
-
-   if (skb->dev) {
-   fq->iif = skb->dev->ifindex;
-   skb->dev = NULL;
-   }
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
fq->ecn |= ecn;
@@ -325,11 +285,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&

[PATCH 4.9 stable 2/5] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-04-26 Thread Peter Oskolkov

[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 ++
 net/ipv4/ip_fragment.c   | 295 +--
 3 files changed, 342 insertions(+), 262 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index a3812e9c8fee..c2c724abde57 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -76,8 +76,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
atomic_trefcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -152,4 +152,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 0fb49dedc9fb..2325cd3454a6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -122,6 +178,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+   sum += skb->truesize;
+   kfree_skb(skb);

[PATCH 4.9 stable 1/5] ipv6: frags: fix a lockdep false positive

2019-04-26 Thread Peter Oskolkov

From: Eric Dumazet 

[ Upstream commit 415787d7799f4fccbe8d49cb0b8e5811be6b0389 ]

lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()(lock txq->_xmit_lock)
dev_hard_start_xmit()
 xmit_one()
  dev_queue_xmit_nit()
   packet_rcv_fanout()
ip_check_defrag()
 ip_defrag()
  spin_lock() (lock frag queue spinlock)

2) ip6_input_finish()
ipv6_frag_rcv()   (lock frag queue spinlock)
 ip6_frag_queue()
  icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv6/reassembly.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 74ffbcb306a6..64c8f20d3c41 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -169,7 +169,8 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
 }
 
 static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
-  struct frag_hdr *fhdr, int nhoff)
+ struct frag_hdr *fhdr, int nhoff,
+ u32 *prob_offset)
 {
struct sk_buff *prev, *next;
struct net_device *dev;
@@ -185,11 +186,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- ((u8 *)&fhdr->frag_off -
-  skb_network_header(skb)));
+   *prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
return -1;
}
 
@@ -220,10 +217,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
/* RFC2460 says always send parameter problem in
 * this case. -DaveM
 */
-   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- offsetof(struct ipv6hdr, 
payload_len));
+   *prob_offset = offsetof(struct ipv6hdr, payload_len);
return -1;
}
if (end > fq->q.len) {
@@ -524,15 +518,22 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
+   u32 prob_offset = 0;
int ret;
 
spin_lock(&fq->q.lock);
 
fq->iif = iif;
-   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
+   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff,
+&prob_offset);
 
spin_unlock(&fq->q.lock);
inet_frag_put(&fq->q);
+   if (prob_offset) {
+   __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+   IPSTATS_MIB_INHDRERRORS);
+   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, prob_offset);
+   }
return ret;
}
 
-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.9 stable 4/5] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-04-26 Thread Peter Oskolkov

[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ]

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 246 
 2 files changed, 81 insertions(+), 176 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6..28aa9b30aece 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 199c44a9358d..4aed9c45a91a 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -62,13 +62,6 @@
 
 static const char ip6_frag_cache_name[] = "ip6-frags";
 
-struct ip6frag_skb_cb {
-   struct inet6_skb_parm   h;
-   int offset;
-};
-
-#define FRAG6_CB(skb)  ((struct ip6frag_skb_cb *)((skb)->cb))
-
 static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -76,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(unsigned long data)
 {
@@ -117,21 +110,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -151,7 +149,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 */
if (end < fq->q.len ||
((fq->q.flags & INET_FRAG_LAST_IN) && end != fq->q.len))
-   goto err;
+   goto discard_fq;
fq->q.flags |= INET_FRAG_LAST_IN;
fq->q.len = end;
} else {
@@ -168,75 +166,45 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end > fq->q.len) {
/* Some bits beyond end -> corruption. */
if (fq->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_fq;
fq->q.len = end;
}
}
 
if (end == offset)
-   goto err;
+   goto discard_fq;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
-   goto err;
-
-   if (pskb_trim_rcsum(sk

[PATCH 4.9 stable 0/5] net: ip6 defrag: backport fixes

2019-04-26 Thread Peter Oskolkov

This is a backport of a 5.1rc patchset:
  https://patchwork.ozlabs.org/cover/1029418/

Which was backported into 4.19:
  https://patchwork.ozlabs.org/cover/1081619/

and into 4.14:
  https://patchwork.ozlabs.org/cover/1089651/


This 4.9 patchset is very close to the 4.14 patchset above
(cherry-picks from 4.14 were almost clean).


Eric Dumazet (1):
  ipv6: frags: fix a lockdep false positive

Florian Westphal (1):
  ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

Peter Oskolkov (3):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

 include/net/inet_frag.h   |  16 +-
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 111 +++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv4/inet_fragment.c  | 293 ++
 net/ipv4/ip_fragment.c| 295 +++---
 net/ipv6/netfilter/nf_conntrack_reasm.c   | 273 +---
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c | 361 ++
 net/openvswitch/conntrack.c   |   1 +
 10 files changed, 631 insertions(+), 753 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

-- 
2.21.0.593.g511ec345e18-goog

[PATCH 4.9 stable 3/5] ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module

2019-04-26 Thread Peter Oskolkov

From: Florian Westphal 

[ Upstream commit 70b095c84326640eeacfd69a411db8fc36e8ab1a ]

IPV6=m
DEFRAG_IPV6=m
CONNTRACK=y yields:

net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
net/netfilter/nf_conntrack_proto.c:802: undefined reference to 
`nf_defrag_ipv6_enable'
net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to 
`nf_conntrack_l4proto_icmpv6'

Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
IPV6=y too.

This patch gets rid of the 'followup linker error' by removing
the dependency of ipv6.ko symbols from netfilter ipv6 defrag.

Shared code is placed into a header, then used from both.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ipv6.h|  29 --
 include/net/ipv6_frag.h   | 104 ++
 net/ieee802154/6lowpan/reassembly.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c   |  17 ++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
 net/ipv6/reassembly.c |  92 ++-
 net/openvswitch/conntrack.c   |   1 +
 7 files changed, 126 insertions(+), 122 deletions(-)
 create mode 100644 include/net/ipv6_frag.h

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 7cb100d25bb5..168009eef5e4 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -511,35 +511,6 @@ static inline bool ipv6_prefix_equal(const struct in6_addr 
*addr1,
 }
 #endif
 
-struct inet_frag_queue;
-
-enum ip6_defrag_users {
-   IP6_DEFRAG_LOCAL_DELIVER,
-   IP6_DEFRAG_CONNTRACK_IN,
-   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_OUT,
-   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
-   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
-   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
-};
-
-void ip6_frag_init(struct inet_frag_queue *q, const void *a);
-extern const struct rhashtable_params ip6_rhash_params;
-
-/*
- * Equivalent of ipv4 struct ip
- */
-struct frag_queue {
-   struct inet_frag_queue  q;
-
-   int iif;
-   unsigned intcsum;
-   __u16   nhoffset;
-   u8  ecn;
-};
-
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
-
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
new file mode 100644
index ..6ced1e6899b6
--- /dev/null
+++ b/include/net/ipv6_frag.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _IPV6_FRAG_H
+#define _IPV6_FRAG_H
+#include 
+#include 
+#include 
+#include 
+
+enum ip6_defrag_users {
+   IP6_DEFRAG_LOCAL_DELIVER,
+   IP6_DEFRAG_CONNTRACK_IN,
+   __IP6_DEFRAG_CONNTRACK_IN   = IP6_DEFRAG_CONNTRACK_IN + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_OUT,
+   __IP6_DEFRAG_CONNTRACK_OUT  = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
+   IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
+   __IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 
USHRT_MAX,
+};
+
+/*
+ * Equivalent of ipv4 struct ip
+ */
+struct frag_queue {
+   struct inet_frag_queue  q;
+
+   int iif;
+   __u16   nhoffset;
+   u8  ecn;
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static inline void ip6frag_init(struct inet_frag_queue *q, const void *a)
+{
+   struct frag_queue *fq = container_of(q, struct frag_queue, q);
+   const struct frag_v6_compare_key *key = a;
+
+   q->key.v6 = *key;
+   fq->ecn = 0;
+}
+
+static inline u32 ip6frag_key_hashfn(const void *data, u32 len, u32 seed)
+{
+   return jhash2(data,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline u32 ip6frag_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+   const struct inet_frag_queue *fq = data;
+
+   return jhash2((const u32 *)&fq->key.v6,
+ sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static inline int
+ip6frag_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+   const struct frag_v6_compare_key *key = arg->key;
+   const struct inet_frag_queue *fq = ptr;
+
+   return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+static inline void
+ip6frag_expire_frag_queue(struct net *net, struct frag_queue *fq)
+{
+   struct net_device *dev = NULL;
+   struct sk_buff *head;
+
+   rcu_read_lock();
+   spin_lock(&fq->q.lock);
+
+   if (fq->q.flags & INET_FRAG_COMPLETE)
+   goto out;
+
+   inet_frag_kill(&fq->q);
+
+   dev = dev_get_by_index_rcu(net, fq->iif);
+   if (!dev)
+   goto out;
+
+   __IP6_INC_STATS(net, _

Re: [PATCH 4.9 stable 0/5] net: ip6 defrag: backport fixes

2019-04-26 Thread Peter Oskolkov

On Fri, Apr 26, 2019 at 8:41 AM Peter Oskolkov  wrote:
>
> This is a backport of a 5.1rc patchset:
>   https://patchwork.ozlabs.org/cover/1029418/
>
> Which was backported into 4.19:
>   https://patchwork.ozlabs.org/cover/1081619/
>
> and into 4.14:
>   https://patchwork.ozlabs.org/cover/1089651/
>
>
> This 4.9 patchset is very close to the 4.14 patchset above
> (cherry-picks from 4.14 were almost clean).

FYI: I have a patchset that backports these into 4.4, but things got
much hairier
there, as I needed to backport three additional netfilter patches. So I'm not
going to send the patchset to the lists unless there is a real need and somebody
with enough knowledge of netfitler volunteers to review/test it (I tested
that IP defrag works, but there are netfilter-related pieces that
I understand little about).

>
>
> Eric Dumazet (1):
>   ipv6: frags: fix a lockdep false positive
>
> Florian Westphal (1):
>   ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module
>
> Peter Oskolkov (3):
>   net: IP defrag: encapsulate rbtree defrag code into callable functions
>   net: IP6 defrag: use rbtrees for IPv6 defrag
>   net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
>
>  include/net/inet_frag.h   |  16 +-
>  include/net/ipv6.h|  29 --
>  include/net/ipv6_frag.h   | 111 +++
>  net/ieee802154/6lowpan/reassembly.c   |   2 +-
>  net/ipv4/inet_fragment.c  | 293 ++
>  net/ipv4/ip_fragment.c| 295 +++---
>  net/ipv6/netfilter/nf_conntrack_reasm.c   | 273 +---
>  net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
>  net/ipv6/reassembly.c | 361 ++
>  net/openvswitch/conntrack.c   |   1 +
>  10 files changed, 631 insertions(+), 753 deletions(-)
>  create mode 100644 include/net/ipv6_frag.h
>
> --
> 2.21.0.593.g511ec345e18-goog
>

Re: [PATCH 4.9 stable 0/5] net: ip6 defrag: backport fixes

2019-04-29 Thread Peter Oskolkov

On Mon, Apr 29, 2019 at 10:24 AM Captain Wiggum  wrote:
>
> Hi Peter,
>
> I forgot to mention one thing about the 4.9 patch set.
> When patching against 4.9.170, I had to remove a couple of snippets
> that were already in release:

Hi John, I see these checks still present in 4.4.171. Maybe you had
them removed in
your local branch?

>
> Patch #604 (linux-4.9-4-ip6-defrag-use-rbtrees.patch):
> + /usr/bin/cat 
> /home/admin/WORK/os/PACKAGES/kernel49/WORK/linux-4.9-4-ip6-defrag-use-rbtrees.patch
> + /usr/bin/patch -p1 -b --suffix .ip6-4 --fuzz=0
> patching file include/net/ipv6_frag.h
> patching file net/ipv6/reassembly.c
> Hunk #10 FAILED at 357.
> Hunk #11 succeeded at 374 (offset -4 lines).
> 1 out of 11 hunks FAILED -- saving rejects to file net/ipv6/reassembly.c.rej
>
> --- net/ipv6/reassembly.c
> +++ net/ipv6/reassembly.c
> @@ -357,10 +258,6 @@
> return 1;
> }
>
> -   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
> -   fhdr->frag_off & htons(IP6_MF))
> -   goto fail_hdr;
> -
> iif = skb->dev ? skb->dev->ifindex : 0;
> fq = fq_find(net, fhdr->identification, hdr, iif);
> if (fq) {
>
> Patch #605 (linux-4.9-5-ip6-defrag-use-rbtrees-in-nf_conntrack_reasm.patch):
> + /usr/bin/cat 
> /home/admin/WORK/os/PACKAGES/kernel49/WORK/linux-4.9-5-ip6-defrag-use-rbtrees-in-nf_conntrack_reasm.patch
> + /usr/bin/patch -p1 -b --suffix .ip6-5 --fuzz=0
> patching file net/ipv6/netfilter/nf_conntrack_reasm.c
> Hunk #8 FAILED at 464.
> Hunk #9 succeeded at 475 (offset -4 lines).
> 1 out of 9 hunks FAILED -- saving rejects to file
> net/ipv6/netfilter/nf_conntrack_reasm.c.rej
>
> --- net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -464,10 +363,6 @@
> hdr = ipv6_hdr(skb);
> fhdr = (struct frag_hdr *)skb_transport_header(skb);
>
> -   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
> -   fhdr->frag_off & htons(IP6_MF))
> -   return -EINVAL;
> -
> skb_orphan(skb);
> fq = fq_find(net, fhdr->identification, user, hdr,
>  skb->dev ? skb->dev->ifindex : 0);
>
> On Mon, Apr 29, 2019 at 10:57 AM Captain Wiggum  wrote:
> >
> > I have run the 4.9 patch set on the full TAHI test sweet.
> > Similar to 4.14, it does fix all the IPv6 frag header issues.
> > But the "change MTU" mesg routing is still broken.
> > Overall, it fixes what it was intended to fix, so I suggest it move
> > toward release.
> > Thanks Peter!
> >
> > --John Masinter
> >
> > On Fri, Apr 26, 2019 at 9:41 AM Peter Oskolkov  wrote:
> > >
> > > This is a backport of a 5.1rc patchset:
> > >   https://patchwork.ozlabs.org/cover/1029418/
> > >
> > > Which was backported into 4.19:
> > >   https://patchwork.ozlabs.org/cover/1081619/
> > >
> > > and into 4.14:
> > >   https://patchwork.ozlabs.org/cover/1089651/
> > >
> > >
> > > This 4.9 patchset is very close to the 4.14 patchset above
> > > (cherry-picks from 4.14 were almost clean).
> > >
> > >
> > > Eric Dumazet (1):
> > >   ipv6: frags: fix a lockdep false positive
> > >
> > > Florian Westphal (1):
> > >   ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module
> > >
> > > Peter Oskolkov (3):
> > >   net: IP defrag: encapsulate rbtree defrag code into callable functions
> > >   net: IP6 defrag: use rbtrees for IPv6 defrag
> > >   net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
> > >
> > >  include/net/inet_frag.h   |  16 +-
> > >  include/net/ipv6.h|  29 --
> > >  include/net/ipv6_frag.h   | 111 +++
> > >  net/ieee802154/6lowpan/reassembly.c   |   2 +-
> > >  net/ipv4/inet_fragment.c  | 293 ++
> > >  net/ipv4/ip_fragment.c| 295 +++---
> > >  net/ipv6/netfilter/nf_conntrack_reasm.c   | 273 +---
> > >  net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |   3 +-
> > >  net/ipv6/reassembly.c | 361 ++
> > >  net/openvswitch/conntrack.c   |   1 +
> > >  10 files changed, 631 insertions(+), 753 deletions(-)
> > >  create mode 100644 include/net/ipv6_frag.h
> > >
> > > --
> > > 2.21.0.593.g511ec345e18-goog
> > >

[PATCH net-next] net: fix double-free in bpf_lwt_xmit_reroute

2019-02-23 Thread Peter Oskolkov

dst_output() frees skb when it fails (see, for example,
ip_finish_output2), so it must not be freed in this case.

Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c")
Signed-off-by: Peter Oskolkov 
---
 net/core/lwt_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a5c8c79d468a..cf2f8897ca19 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -260,7 +260,7 @@ static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
 
err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
if (unlikely(err))
-   goto err;
+   return err;
 
/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
return LWTUNNEL_XMIT_DONE;
-- 
2.21.0.rc0.258.g878e2cd30e-goog

[PATCH net-next] net: remove unused struct inet_frag_queue.fragments field

2019-02-25 Thread Peter Oskolkov

Now that all users of struct inet_frag_queue have been converted
to use 'rb_fragments', remove the unused 'fragments' field.

Build with `make allyesconfig` succeeded. ip_defrag selftest passed.

Signed-off-by: Peter Oskolkov 
---
 include/net/inet_frag.h |  4 +--
 net/ieee802154/6lowpan/reassembly.c |  1 -
 net/ipv4/inet_fragment.c| 44 -
 net/ipv4/ip_fragment.c  |  2 --
 net/ipv6/netfilter/nf_conntrack_reasm.c |  1 -
 net/ipv6/reassembly.c   |  1 -
 6 files changed, 14 insertions(+), 39 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b02bf737d019..378904ee9129 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -56,7 +56,6 @@ struct frag_v6_compare_key {
  * @timer: queue expiration timer
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
- * @fragments: received fragments head
  * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
  * @last_run_head: the head of the last "run". see ip_fragment.c
@@ -77,8 +76,7 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
+   struct rb_root  rb_fragments;
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index bd61633d2c32..4196bcd4105a 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -179,7 +179,6 @@ static int lowpan_frag_reasm(struct lowpan_frag_queue *fq, 
struct sk_buff *skb,
 
skb->dev = ldev;
skb->tstamp = fq->q.stamp;
-   fq->q.fragments = NULL;
fq->q.rb_fragments = RB_ROOT;
fq->q.fragments_tail = NULL;
fq->q.last_run_head = NULL;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 9f69411251d0..737808e27f8b 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -203,7 +203,6 @@ EXPORT_SYMBOL(inet_frag_rbtree_purge);
 
 void inet_frag_destroy(struct inet_frag_queue *q)
 {
-   struct sk_buff *fp;
struct netns_frags *nf;
unsigned int sum, sum_truesize = 0;
struct inet_frags *f;
@@ -212,20 +211,9 @@ void inet_frag_destroy(struct inet_frag_queue *q)
WARN_ON(del_timer(&q->timer) != 0);
 
/* Release all fragment data. */
-   fp = q->fragments;
nf = q->net;
f = nf->f;
-   if (fp) {
-   do {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
-   } while (fp);
-   } else {
-   sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
-   }
+   sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
sum = sum_truesize + f->qsize;
 
call_rcu(&q->rcu, inet_frag_destroy_rcu);
@@ -489,26 +477,20 @@ EXPORT_SYMBOL(inet_frag_reasm_finish);
 
 struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q)
 {
-   struct sk_buff *head;
+   struct sk_buff *head, *skb;
 
-   if (q->fragments) {
-   head = q->fragments;
-   q->fragments = head->next;
-   } else {
-   struct sk_buff *skb;
+   head = skb_rb_first(&q->rb_fragments);
+   if (!head)
+   return NULL;
+   skb = FRAG_CB(head)->next_frag;
+   if (skb)
+   rb_replace_node(&head->rbnode, &skb->rbnode,
+   &q->rb_fragments);
+   else
+   rb_erase(&head->rbnode, &q->rb_fragments);
+   memset(&head->rbnode, 0, sizeof(head->rbnode));
+   barrier();
 
-   head = skb_rb_first(&q->rb_fragments);
-   if (!head)
-   return NULL;
-   skb = FRAG_CB(head)->next_frag;
-   if (skb)
-   rb_replace_node(&head->rbnode, &skb->rbnode,
-   &q->rb_fragments);
-   else
-   rb_erase(&head->rbnode, &q->rb_fragments);
-   memset(&head->rbnode, 0, sizeof(head->rbnode));
-   barrier();
-   }
if (head == q->fragments_tail)
q->fragments_tail = NULL;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 486ecb0aeb87..cf2b0a6a3337 100644
--- a/net

Re: [PATCH bpf-next] bpf: fix memory leak in bpf_lwt_xmit_reroute

2019-03-04 Thread Peter Oskolkov

On Sun, Mar 3, 2019 at 6:55 PM Willem de Bruijn
 wrote:
>
> On Fri, Mar 1, 2019 at 9:27 PM David Ahern  wrote:
> >
> > On 2/28/19 10:57 AM, Peter Oskolkov wrote:
> > > David: I'm not sure how to test GSO (I assume we are talking about GSO
> > > here) in
> > > the selftest: the encapping code sets SKB_GSO_DODGY flag, and veth does
> > > not support
> > > dodginess: "tx-gso-robust: off [fixed]".
> > >
> > > If the "dodgy" flag is not set, then gso validation in dev.c passes, and
> > > large GSO packets
> > > happily go through; if the "dodgy" flag is set, "dodgy" GSO packets are
> > > rejected, TCP does
> > > segmentation, and non-GSO packets happily go through (with an mtu tweak
> > > to the LWT tunnel).
>
> Very few devices unconditionally accept dodgy packets (only veth?).
>
> A device that lacks the robust gso feature will cause a gso packet
> with dodgy flag to enter software gso instead of passing to device
> segmentation offload.
>
> That should be perfect for checking that the packets can be segmented
> correctly with the new header.
>
> If the gso layer drops the packets, that is not due to dropping all
> dodgy sources. It will be dropped somewhere else inside gso,
> indication that something is not as expected with the packet.
>
> > > So I see three options:
> > > - add a sysctl to _not_ set SKB_GSO_DODGY flag in lwt_bpf.c =>
> > > handle_gso_type();
> > > - change veth to accept "dodgy" GSO packets
>
> Neither, as these would bypass segmentation offload and pass the large
> packet to the receive path. It is more interesting to validate the
> packet in gso.

I found the problem: skb->inner_protocol was not set, so software GSO
fallback failed.  I have a patch that fixes the issue: IPIP+GRE+TCP
gso works! net-next is closed though... Will have to wait for net-next
to reopen.


>
> > > - test the code "as is", meaning that GSO will be tried and disabled by
> > > TCP stack
> > >
> > > Which approach would you prefer?
> > >
> >
> > definitely not a sysctl.
> >
> > After that, I don't have a suggestion for GSO at the moment.

Re: [PATCH bpf-next] bpf: fix memory leak in bpf_lwt_xmit_reroute

2019-03-04 Thread Peter Oskolkov

On Mon, Mar 4, 2019 at 1:03 PM David Ahern  wrote:
>
> On 3/4/19 1:39 PM, Peter Oskolkov wrote:
> > I found the problem: skb->inner_protocol was not set, so software GSO
> > fallback failed.  I have a patch that fixes the issue: IPIP+GRE+TCP
> > gso works! net-next is closed though... Will have to wait for net-next
> > to reopen.
>
> That's a bug fix. I suggest sending now.

I see the encap patches neither in net nor in bpf trees, only in
net-next and bpf-next. And *-next trees are closed, so there is
nowhere to send the fix. Am I missing something?

[PATCH net-next (fix) 2/2] selftests/bpf: test that GSO works in lwt_ip_encap

2019-03-04 Thread Peter Oskolkov

Add a test on egress that a large TCP packet successfully goes through
the lwt+bpf encap tunnel.

Although there is no direct evidence that GSO worked, as opposed to
e.g. TCP segmentation or IP fragmentation (maybe a kernel stats counter
should be added to track the number of failed GSO attempts?), without
the previous patch in the patchset this test fails, and printk-debugging
showed that software-based GSO succeeded here (veth is not compatible with
SKB_GSO_DODGY, so GSO happens in the software stack).

Also removed an unnecessary nodad and added a missed failed flag.

Signed-off-by: Peter Oskolkov 
---
 .../selftests/bpf/test_lwt_ip_encap.sh| 54 ++-
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
index 612632c1425f..d4d3391cc13a 100755
--- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -78,6 +78,8 @@ TEST_STATUS=0
 TESTS_SUCCEEDED=0
 TESTS_FAILED=0
 
+TMPFILE=""
+
 process_test_results()
 {
if [[ "${TEST_STATUS}" -eq 0 ]] ; then
@@ -147,7 +149,6 @@ setup()
ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
 
-
ip -netns ${NS1} link set dev veth1 up
ip -netns ${NS2} link set dev veth2 up
ip -netns ${NS2} link set dev veth3 up
@@ -205,7 +206,7 @@ setup()
# configure IPv4 GRE device in NS3, and a route to it via the "bottom" 
route
ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local 
${IPv4_GRE} ttl 255
ip -netns ${NS3} link set gre_dev up
-   ip -netns ${NS3} addr add ${IPv4_GRE} nodad dev gre_dev
+   ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
 
@@ -222,12 +223,18 @@ setup()
ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
 
+   TMPFILE=$(mktemp /tmp/test_lwt_ip_encap.XX)
+
sleep 1  # reduce flakiness
set +e
 }
 
 cleanup()
 {
+   if [ -f ${TMPFILE} ] ; then
+   rm ${TMPFILE}
+   fi
+
ip netns del ${NS1} 2> /dev/null
ip netns del ${NS2} 2> /dev/null
ip netns del ${NS3} 2> /dev/null
@@ -278,6 +285,46 @@ test_ping()
fi
 }
 
+test_gso()
+{
+   local readonly PROTO=$1
+   local readonly PKT_SZ=5000
+   local IP_DST=""
+   : > ${TMPFILE}  # trim the capture file
+
+   # check that nc is present
+   command -v nc >/dev/null 2>&1 || \
+   { echo >&2 "nc is not available: skipping TSO tests"; return; }
+
+   # listen on IPv*_DST, capture TCP into $TMPFILE
+   if [ "${PROTO}" == "IPv4" ] ; then
+   IP_DST=${IPv4_DST}
+   ip netns exec ${NS3} bash -c \
+   "nc -4 -l -s ${IPv4_DST} -p 9000 > ${TMPFILE} &"
+   elif [ "${PROTO}" == "IPv6" ] ; then
+   IP_DST=${IPv6_DST}
+   ip netns exec ${NS3} bash -c \
+   "nc -6 -l -s ${IPv6_DST} -p 9000 > ${TMPFILE} &"
+   RET=$?
+   else
+   echo "test_gso: unknown PROTO: ${PROTO}"
+   TEST_STATUS=1
+   fi
+   sleep 1  # let nc start listening
+
+   # send a packet larger than MTU
+   ip netns exec ${NS1} bash -c \
+   "dd if=/dev/zero bs=$PKT_SZ count=1 > /dev/tcp/${IP_DST}/9000 
2>/dev/null"
+   sleep 2 # let the packet get delivered
+
+   # verify we received all expected bytes
+   SZ=$(stat -c %s ${TMPFILE})
+   if [ "$SZ" != "$PKT_SZ" ] ; then
+   echo "test_gso failed: ${PROTO}"
+   TEST_STATUS=1
+   fi
+}
+
 test_egress()
 {
local readonly ENCAP=$1
@@ -307,6 +354,8 @@ test_egress()
fi
test_ping IPv4 0
test_ping IPv6 0
+   test_gso IPv4
+   test_gso IPv6
 
# a negative test: remove routes to GRE devices: ping fails
remove_routes_to_gredev
@@ -350,6 +399,7 @@ test_ingress()
ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj 
test_lwt_ip_encap.o sec encap_gre6 dev veth2
else
echo "FAIL: unknown encap ${ENCAP}"
+   TEST_STATUS=1
fi
test_ping IPv4 0
test_ping IPv6 0
-- 
2.21.0.352.gf09ad66450-goog

[PATCH net-next (fix) 0/2] fix GSO bpf_lwt_ip_encap

2019-03-04 Thread Peter Oskolkov

This is a small fix and a test. Sent to net-next because
the offending patch is not in net yet.

Peter Oskolkov (2):
  net: fix GSO in bpf_lwt_push_ip_encap
  selftests/bpf: test that GSO works in lwt_ip_encap

 net/core/lwt_bpf.c|  2 +
 .../selftests/bpf/test_lwt_ip_encap.sh| 51 ++-
 2 files changed, 51 insertions(+), 2 deletions(-)

-- 
2.21.0.352.gf09ad66450-goog

[PATCH net-next (fix) 1/2] net: fix GSO in bpf_lwt_push_ip_encap

2019-03-04 Thread Peter Oskolkov

GSO needs inner headers and inner protocol set properly to work.

skb->inner_mac_header: skb_reset_inner_headers() assigns the current
mac header value to inner_mac_header; but it is not set at the point,
so we need to call skb_reset_inner_mac_header, otherwise gre_gso_segment
fails: it does

int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb);
...
if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
...

skb->inner_protocol should also be correctly set.

Fixes: ca78801a81e0 ("bpf: handle GSO in bpf_lwt_push_encap")
Signed-off-by: Peter Oskolkov 
---
 net/core/lwt_bpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index cf2f8897ca19..126d31ff5ee3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -625,6 +625,8 @@ int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, 
u32 len, bool ingress)
 
/* push the encap headers and fix pointers */
skb_reset_inner_headers(skb);
+   skb_reset_inner_mac_header(skb);  /* mac header is not yet set */
+   skb_set_inner_protocol(skb, skb->protocol);
skb->encapsulation = 1;
skb_push(skb, len);
if (ingress)
-- 
2.21.0.352.gf09ad66450-goog

[PATCH bpf] bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT

2019-03-14 Thread Peter Oskolkov

This helper is useful if a bpf tc filter sets skb->tstamp.

Signed-off-by: Peter Oskolkov 
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 647c63a7b25b..c6d016d9c4b8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5719,6 +5719,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_tcp_sock_proto;
case BPF_FUNC_get_listener_sock:
return &bpf_get_listener_sock_proto;
+   case BPF_FUNC_skb_ecn_set_ce:
+   return &bpf_skb_ecn_set_ce_proto;
 #endif
default:
return bpf_base_func_proto(func_id);
-- 
2.21.0.360.g471c308f928-goog

Re: [PATCH bpf] bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT

2019-03-15 Thread Peter Oskolkov

On Fri, Mar 15, 2019 at 9:52 AM Martin Lau  wrote:
>
> On Thu, Mar 14, 2019 at 05:28:58PM -0700, Peter Oskolkov wrote:
> > This helper is useful if a bpf tc filter sets skb->tstamp.
> >
> For the patch,
> Acked-by: Martin KaFai Lau 
>
> Not sure if it should belong to bpf-next material though.

Thanks, Martin! I consider it a bug that the helper is not available
in TC context... ;-)

[PATCH bpf-next] bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT

2019-03-20 Thread Peter Oskolkov

This helper is useful if a bpf tc filter sets skb->tstamp.

Signed-off-by: Peter Oskolkov 
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 647c63a7b25b..c6d016d9c4b8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5719,6 +5719,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_tcp_sock_proto;
case BPF_FUNC_get_listener_sock:
return &bpf_get_listener_sock_proto;
+   case BPF_FUNC_skb_ecn_set_ce:
+   return &bpf_skb_ecn_set_ce_proto;
 #endif
default:
return bpf_base_func_proto(func_id);
-- 
2.21.0.225.g810b269d1ac-goog

[PATCH bpf-next 1/2] bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT

2019-03-22 Thread Peter Oskolkov

This helper is useful if a bpf tc filter sets skb->tstamp.

Signed-off-by: Peter Oskolkov 
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index c1d19b074d6c..0a972fbf60df 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5959,6 +5959,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_skc_lookup_tcp_proto;
case BPF_FUNC_tcp_check_syncookie:
return &bpf_tcp_check_syncookie_proto;
+   case BPF_FUNC_skb_ecn_set_ce:
+   return &bpf_skb_ecn_set_ce_proto;
 #endif
default:
return bpf_base_func_proto(func_id);
-- 
2.21.0.392.gf8f6787159e-goog

[PATCH bpf-next 2/2] selftests: bpf: tc-bpf flow shaping with EDT

2019-03-22 Thread Peter Oskolkov

Add a small test that shows how to shape a TCP flow in tc-bpf
with EDT and ECN.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/bpf/Makefile  |   3 +-
 .../testing/selftests/bpf/progs/test_tc_edt.c | 109 ++
 tools/testing/selftests/bpf/test_tc_edt.sh|  99 
 3 files changed, 210 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_edt.c
 create mode 100755 tools/testing/selftests/bpf/test_tc_edt.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index cdcc54ddf4b9..77b73b892136 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -53,7 +53,8 @@ TEST_PROGS := test_kmod.sh \
test_xdp_vlan.sh \
test_lwt_ip_encap.sh \
test_tcp_check_syncookie.sh \
-   test_tc_tunnel.sh
+   test_tc_tunnel.sh \
+   test_tc_edt.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/progs/test_tc_edt.c 
b/tools/testing/selftests/bpf/progs/test_tc_edt.c
new file mode 100644
index ..3af64c470d64
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_edt.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+/* the maximum delay we are willing to add (drop packets beyond that) */
+#define TIME_HORIZON_NS (2000 * 1000 * 1000)
+#define NS_PER_SEC 10
+#define ECN_HORIZON_NS 500
+#define THROTTLE_RATE_BPS (5 * 1000 * 1000)
+
+/* flow_key => last_tstamp timestamp used */
+struct bpf_map_def SEC("maps") flow_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(uint32_t),
+   .value_size = sizeof(uint64_t),
+   .max_entries = 1,
+};
+
+static inline int throttle_flow(struct __sk_buff *skb)
+{
+   int key = 0;
+   uint64_t *last_tstamp = bpf_map_lookup_elem(&flow_map, &key);
+   uint64_t delay_ns = ((uint64_t)skb->len) * NS_PER_SEC /
+   THROTTLE_RATE_BPS;
+   uint64_t now = bpf_ktime_get_ns();
+   uint64_t tstamp, next_tstamp = 0;
+
+   if (last_tstamp)
+   next_tstamp = *last_tstamp + delay_ns;
+
+   tstamp = skb->tstamp;
+   if (tstamp < now)
+   tstamp = now;
+
+   /* should we throttle? */
+   if (next_tstamp <= tstamp) {
+   if (bpf_map_update_elem(&flow_map, &key, &tstamp, BPF_ANY))
+   return TC_ACT_SHOT;
+   return TC_ACT_OK;
+   }
+
+   /* do not queue past the time horizon */
+   if (next_tstamp - now >= TIME_HORIZON_NS)
+   return TC_ACT_SHOT;
+
+   /* set ecn bit, if needed */
+   if (next_tstamp - now >= ECN_HORIZON_NS)
+   bpf_skb_ecn_set_ce(skb);
+
+   if (bpf_map_update_elem(&flow_map, &key, &next_tstamp, BPF_EXIST))
+   return TC_ACT_SHOT;
+   skb->tstamp = next_tstamp;
+
+   return TC_ACT_OK;
+}
+
+static inline int handle_tcp(struct __sk_buff *skb, struct tcphdr *tcp)
+{
+   void *data_end = (void *)(long)skb->data_end;
+
+   /* drop malformed packets */
+   if ((void *)(tcp + 1) > data_end)
+   return TC_ACT_SHOT;
+
+   if (tcp->dest == bpf_htons(9000))
+   return throttle_flow(skb);
+
+   return TC_ACT_OK;
+}
+
+static inline int handle_ipv4(struct __sk_buff *skb)
+{
+   void *data_end = (void *)(long)skb->data_end;
+   void *data = (void *)(long)skb->data;
+   struct iphdr *iph;
+   uint32_t ihl;
+
+   /* drop malformed packets */
+   if (data + sizeof(struct ethhdr) > data_end)
+   return TC_ACT_SHOT;
+   iph = (struct iphdr *)(data + sizeof(struct ethhdr));
+   if ((void *)(iph + 1) > data_end)
+   return TC_ACT_SHOT;
+   ihl = iph->ihl * 4;
+   if (((void *)iph) + ihl > data_end)
+   return TC_ACT_SHOT;
+
+   if (iph->protocol == IPPROTO_TCP)
+   return handle_tcp(skb, (struct tcphdr *)(((void *)iph) + ihl));
+
+   return TC_ACT_OK;
+}
+
+SEC("cls_test") int tc_prog(struct __sk_buff *skb)
+{
+   if (skb->protocol == bpf_htons(ETH_P_IP))
+   return handle_ipv4(skb);
+
+   return TC_ACT_OK;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tc_edt.sh 
b/tools/testing/selftests/bpf/test_tc_edt.sh
new file mode 100755
index ..f38567ef694b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tc_edt.sh
@@ -0,0 +1,99 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test installs a TC bpf program that throttles a TCP flow
+# with dst port = 9000 down to 5MBps. Then it measures actual
+# throughput of the

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Peter Oskolkov

On Fri, May 3, 2019 at 4:47 AM Eric Dumazet  wrote:
>
> Since ip6frag_expire_frag_queue() now pulls the head skb
> from frag queue, we should no longer use skb_get(), since
> this leads to an skb leak.
>
> Stefan Bader initially reported a problem in 4.4.stable [1] caused
> by the skb_get(), so this patch should also fix this issue.
>
> 296583.091021] kernel BUG at 
> /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> [296583.091734] Call Trace:
> [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> [296583.091764]  [] _decode_session6+0x26a/0x400
> [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> [296583.091809]  [] icmp6_send+0x5e1/0x940
> [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
> [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091893]  [] icmpv6_send+0x21/0x30
> [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> [nf_defrag_ipv6]
> [296583.091938]  [] call_timer_fn+0x37/0x140
> [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091968]  [] run_timer_softirq+0x234/0x330
> [296583.091982]  [] __do_softirq+0x109/0x2b0
>
> Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
> Signed-off-by: Eric Dumazet 
> Reported-by: Stfan Bader 
> Cc: Peter Oskolkov 
> Cc: Florian Westphal 
> ---
>  include/net/ipv6_frag.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
> index 
> 28aa9b30aeceac9a86ee6754e4b5809be115e947..1f77fb4dc79df6bc4e41d6d2f4d49ace32082ca4
>  100644
> --- a/include/net/ipv6_frag.h
> +++ b/include/net/ipv6_frag.h
> @@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct 
> frag_queue *fq)
> goto out;
>
> head->dev = dev;
> -   skb_get(head);

This skb_get was introduced by commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
"ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
is not in 4.4, where the bug is reported at.
Shouldn't the "Fixes" tag also reference the original patch?


> spin_unlock(&fq->q.lock);
>
> icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
> --
> 2.21.0.1020.gf2820cf01a-goog
>

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Peter Oskolkov

On Fri, May 3, 2019 at 8:52 AM Eric Dumazet  wrote:
>
> On Fri, May 3, 2019 at 11:33 AM Peter Oskolkov  wrote:
> >
> > This skb_get was introduced by commit 
> > 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
> > "ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
> > is not in 4.4, where the bug is reported at.
> > Shouldn't the "Fixes" tag also reference the original patch?
>
> No, this bug really fixes a memory leak.
>
> Fact that it also fixes the XFRM issue is secondary, since all your
> patches are being backported in stable
> trees anyway for other reasons.

There are no plans to backport rbtree patches to 4.4 and earlier at
the moment, afaik.

>
> There is no need to list all commits and give a complete context for a
> bug fix like this one,
> this would be quite noisy.

[PATCH net-next 0/4] net: IP defrag: use rbtrees in IPv6 defragmentation

2019-01-22 Thread Peter Oskolkov

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break compatibility
with some IPv6 implementations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patchset contains four patches:
- patch 1 moves rbtree-related code from IPv4 to files shared b/w
IPv4/IPv6
- patch 2 changes IPv6 defragmenation code to use rbtrees for defrag
queue
- patch 3 changes nf_conntrack IPv6 defragmentation code to use rbtrees
- patch 4 changes ip_defrag selftest to test changes made in the
previous three patches.

Along the way, the 1280-byte restrictions are removed.

I plan to introduce similar changes to 6lowpan defragmentation code
once I figure out how to test it.

Peter Oskolkov (4):
  net: IP defrag: encapsulate rbtree defrag code into callable functions
  net: IP6 defrag: use rbtrees for IPv6 defrag
  net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
  selftests: net: ip_defrag:  cover new IPv6 defrag behavior

 include/net/inet_frag.h  |  16 +-
 include/net/ipv6_frag.h  |  11 +-
 net/ipv4/inet_fragment.c | 293 +++
 net/ipv4/ip_fragment.c   | 289 +++---
 net/ipv6/netfilter/nf_conntrack_reasm.c  | 260 ++--
 net/ipv6/reassembly.c| 233 +-
 tools/testing/selftests/net/ip_defrag.c  |  69 +++---
 tools/testing/selftests/net/ip_defrag.sh |  16 ++
 8 files changed, 527 insertions(+), 660 deletions(-)

-- 
2.20.1.321.g9e740568ce-goog

[PATCH net-next 3/4] net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c

2019-01-22 Thread Peter Oskolkov

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 260 +++-
 1 file changed, 71 insertions(+), 189 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 181da2c40f9a..cb1b4772dac0 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -136,6 +136,9 @@ static void __net_exit 
nf_ct_frags6_sysctl_unregister(struct net *net)
 }
 #endif
 
+static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
+
 static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 {
return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
@@ -177,9 +180,10 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
 static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 const struct frag_hdr *fhdr, int nhoff)
 {
-   struct sk_buff *prev, *next;
unsigned int payload_len;
-   int offset, end;
+   struct net_device *dev;
+   struct sk_buff *prev;
+   int offset, end, err;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE) {
@@ -254,55 +258,18 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
goto err;
}
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (next->ip_defrag_offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4:
-*  When reassembling an IPv6 datagram, 
if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments, including those not yet received) MUST be silently
-*   discarded.
-*/
-
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (prev->ip_defrag_offset + prev->len) > offset)
-   goto discard_fq;
-
-   /* Look for overlap with succeeding segment. */
-   if (next && next->ip_defrag_offset < end)
-   goto discard_fq;
-
-   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
-   if (skb->dev)
-   fq->iif = skb->dev->ifindex;
+   /* Note : skb->rbnode and skb->dev share the same location. */
+   dev = skb->dev;
/* Makes sure compiler wont do silly aliasing games */
barrier();
-   skb->ip_defrag_offset = offset;
 
-   /* Insert this fragment in the chain of fragments. */
-   skb->next = next;
-   if (!next)
-   fq->q.fragments_tail = skb;
-   if (prev)
-   prev->next = skb;
-   else
-   fq->q.fragments = skb;
+   prev = fq->q.fragments_tail;
+   err = inet_frag_queue_insert(&fq->q, skb, offset, end);
+   if (err)
+   goto insert_error;
+
+   if (dev)
+   fq->iif = dev->ifindex;
 
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
@@ -319,11 +286,25 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, 
struct sk_buff *skb,
fq->q.flags |= INET_FRAG_FIRST_IN;
}
 
-   return 0;
+   if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
+   fq->q.meat == fq->q.len) {
+   unsigned long orefdst = skb->_skb_refdst;
+
+   skb->_skb_refdst = 0UL;
+   err = nf_ct_frag6_reasm(fq, skb, prev, dev);
+   skb->_skb_refdst = orefdst;
+   return err;
+   }
+
+   skb_dst_drop(skb);
+   return -EINPROGRESS;
 
-discard_fq:
+insert_error:
+   if (err == IPFRAG_DUP)

[PATCH net-next 2/4] net: IP6 defrag: use rbtrees for IPv6 defrag

2019-01-22 Thread Peter Oskolkov

Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

Signed-off-by: Peter Oskolkov 
Reported-by: Tom Herbert 
Cc: Eric Dumazet 
Cc: Florian Westphal 
---
 include/net/ipv6_frag.h |  11 +-
 net/ipv6/reassembly.c   | 233 +++-
 2 files changed, 71 insertions(+), 173 deletions(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 6ced1e6899b6..28aa9b30aece 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -82,8 +82,15 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   head = inet_frag_pull_head(&fq->q);
+   if (!head)
goto out;
 
head->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 36a3d8dc61f5..24264d0a4b85 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -69,8 +69,8 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 
 static struct inet_frags ip6_frags;
 
-static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
+ struct sk_buff *prev_tail, struct net_device *dev);
 
 static void ip6_frag_expire(struct timer_list *t)
 {
@@ -111,21 +111,26 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
  struct frag_hdr *fhdr, int nhoff,
  u32 *prob_offset)
 {
-   struct sk_buff *prev, *next;
-   struct net_device *dev;
-   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
+   int offset, end, fragsize;
+   struct sk_buff *prev_tail;
+   struct net_device *dev;
+   int err = -ENOENT;
u8 ecn;
 
if (fq->q.flags & INET_FRAG_COMPLETE)
goto err;
 
+   err = -EINVAL;
offset = ntohs(fhdr->frag_off) & ~0x7;
end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
+   /* note that if prob_offset is set, the skb is freed elsewhere,
+* we do not free it here.
+*/
return -1;
}
 
@@ -170,62 +175,27 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
if (end == offset)
goto discard_fq;
 
+   err = -ENOMEM;
/* Point into the IP datagram 'data' part. */
if (!pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data))
goto discard_fq;
 
-   if (pskb_trim_rcsum(skb, end - offset))
+   err = pskb_trim_rcsum(skb, end - offset);
+   if (err)
goto discard_fq;
 
-   /* Find out which fragments are in front and at the back of us
-* in the chain of fragments so far.  We must know where to put
-* this fragment, right?
-*/
-   prev = fq->q.fragments_tail;
-   if (!prev || prev->ip_defrag_offset < offset) {
-   next = NULL;
-   goto found;
-   }
-   prev = NULL;
-   for (next = fq->q.fragments; next != NULL; next = next->next) {
-   if (next->ip_defrag_offset >= offset)
-   break;  /* bingo! */
-   prev = next;
-   }
-
-found:
-   /* RFC5722, Section 4, amended by Errata ID : 3089
-*  When reassembling an IPv6 datagram, if
-*   one or more its constituent fragments is determined to be an
-*   overlapping fragment, the entire datagram (and any constituent
-*   fragments) MUST be silently discarded.
-*/
-
-   /* Check for overlap with preceding fragment. */
-   if (prev &&
-   (prev->ip_defrag_offset + prev->len) > offset)
-   goto discard_fq;
-

[PATCH net-next 4/4] selftests: net: ip_defrag: cover new IPv6 defrag behavior

2019-01-22 Thread Peter Oskolkov

This patch adds several changes to the ip_defrag selftest, to cover
new IPv6 defrag behavior:

- min IPv6 frag size is now 8 instead of 1280

- new test cases to cover IPv6 defragmentation in nf_conntrack_reasm.c

- new "permissive" mode in negative (overlap) tests: netfilter
sometimes drops invalid packets without passing them to IPv6
underneath, and thus defragmentation sometimes succeeds when
it is expected to fail; so the permissive mode does not fail the
test if the correct reassembled datagram is received instead of a
timeout.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/net/ip_defrag.c  | 69 
 tools/testing/selftests/net/ip_defrag.sh | 16 ++
 2 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/net/ip_defrag.c 
b/tools/testing/selftests/net/ip_defrag.c
index 5d56cc0838f6..c0c9ecb891e1 100644
--- a/tools/testing/selftests/net/ip_defrag.c
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -20,6 +20,7 @@ static bool   cfg_do_ipv4;
 static boolcfg_do_ipv6;
 static boolcfg_verbose;
 static boolcfg_overlap;
+static boolcfg_permissive;
 static unsigned short  cfg_port = 9000;
 
 const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) 
};
@@ -35,7 +36,7 @@ const struct in6_addr addr6 = IN6ADDR_LOOPBACK_INIT;
 static int payload_len;
 static int max_frag_len;
 
-#define MSG_LEN_MAX6   /* Max UDP payload length. */
+#define MSG_LEN_MAX1   /* Max UDP payload length. */
 
 #define IP4_MF (1u << 13)  /* IPv4 MF flag. */
 #define IP6_MF (1)  /* IPv6 MF flag. */
@@ -59,13 +60,14 @@ static void recv_validate_udp(int fd_udp)
msg_counter++;
 
if (cfg_overlap) {
-   if (ret != -1)
-   error(1, 0, "recv: expected timeout; got %d",
-   (int)ret);
-   if (errno != ETIMEDOUT && errno != EAGAIN)
-   error(1, errno, "recv: expected timeout: %d",
-errno);
-   return;  /* OK */
+   if (ret == -1 && (errno == ETIMEDOUT || errno == EAGAIN))
+   return;  /* OK */
+   if (!cfg_permissive) {
+   if (ret != -1)
+   error(1, 0, "recv: expected timeout; got %d",
+   (int)ret);
+   error(1, errno, "recv: expected timeout: %d", errno);
+   }
}
 
if (ret == -1)
@@ -203,7 +205,6 @@ static void send_udp_frags(int fd_raw, struct sockaddr 
*addr,
 {
struct ip *iphdr = (struct ip *)ip_frame;
struct ip6_hdr *ip6hdr = (struct ip6_hdr *)ip_frame;
-   const bool ipv4 = !ipv6;
int res;
int offset;
int frag_len;
@@ -251,7 +252,7 @@ static void send_udp_frags(int fd_raw, struct sockaddr 
*addr,
}
 
/* Occasionally test IPv4 "runs" (see net/ipv4/ip_fragment.c) */
-   if (ipv4 && !cfg_overlap && (rand() % 100 < 20) &&
+   if (!cfg_overlap && (rand() % 100 < 20) &&
(payload_len > 9 * max_frag_len)) {
offset = 6 * max_frag_len;
while (offset < (UDP_HLEN + payload_len)) {
@@ -276,41 +277,38 @@ static void send_udp_frags(int fd_raw, struct sockaddr 
*addr,
while (offset < (UDP_HLEN + payload_len)) {
send_fragment(fd_raw, addr, alen, offset, ipv6);
/* IPv4 ignores duplicates, so randomly send a duplicate. */
-   if (ipv4 && (1 == rand() % 100))
+   if (rand() % 100 == 1)
send_fragment(fd_raw, addr, alen, offset, ipv6);
offset += 2 * max_frag_len;
}
 
if (cfg_overlap) {
-   /* Send an extra random fragment. */
+   /* Send an extra random fragment.
+*
+* Duplicates and some fragments completely inside
+* previously sent fragments are dropped/ignored. So
+* random offset and frag_len can result in a dropped
+* fragment instead of a dropped queue/packet. Thus we
+* hard-code offset and frag_len.
+*/
+   if (max_frag_len * 4 < payload_len || max_frag_len < 16) {
+   /* not enough payload for random offset and frag_len. */
+   offset = 8;
+   frag_len = UDP_HLEN + max_frag_len;
+   } else {
+   offset = rand() % (payload_len / 2);
+   frag_len = 2 * max_frag_len + 1 + rand() % 256;
+   }
if (ipv6) {
struct ip6_frag *fraghdr = (struct ip6_frag *)(ip_frame 
+ I

[PATCH net-next 1/4] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-01-22 Thread Peter Oskolkov

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +++
 net/ipv4/ip_fragment.c   | 289 --
 3 files changed, 334 insertions(+), 264 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1662cbc0b46b..b02bf737d019 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -77,8 +77,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -153,4 +153,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 760a9e52e02b..9f69411251d0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -25,6 +25,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(&skb->rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+&q->last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+   rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -123,6 +179,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(&skb->rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+   sum += skb->truesize;
+   kfree_skb(skb);
+   skb = next;
+   }
+   }
+   return

1 2 3 >

1 - 100 of 208 matches

Mail list logo