Re: [PATCH net] udp: correct reuseport selection with connected sockets

2019-09-13 Thread Craig Gallek
On Thu, Sep 12, 2019 at 9:16 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> UDP reuseport groups can hold a mix unconnected and connected sockets.
> Ensure that connections only receive all traffic to their 4-tuple.
>
> Fast reuseport returns on the first reuseport match on the assumption
> that all matches are equal. Only if connections are present, return to
> the previous behavior of scoring all sockets.
>
> Record if connections are present and if so (1) treat such connected
> sockets as an independent match from the group, (2) only return
> 2-tuple matches from reuseport and (3) do not return on the first
> 2-tuple reuseport match to allow for a higher scoring match later.
>
> New field has_conns is set without locks. No other fields in the
> bitmap are modified at runtime and the field is only ever set
> unconditionally, so an RMW cannot miss a change.
>
> Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
> Link: 
> http://lkml.kernel.org/r/ca+futsfrp09ajnyrt04ss6qj22viioewawmlawx0psk8-pg...@mail.gmail.com
> Signed-off-by: Willem de Bruijn 

Slick, no additional cost for the BPF case and just a single branch
for the unconnected udp, tcp listener case!

Acked-by: Craig Gallek 


Re: [PATCH bpf 1/2] bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err

2019-06-03 Thread Craig Gallek
On Fri, May 31, 2019 at 6:29 PM Martin KaFai Lau  wrote:
>
> __udp6_lib_err() may be called when handling icmpv6 message. For example,
> the icmpv6 toobig(type=2).  __udp6_lib_lookup() is then called
> which may call reuseport_select_sock().  reuseport_select_sock() will
> call into a bpf_prog (if there is one).
>
> reuseport_select_sock() is expecting the skb->data pointing to the
> transport header (udphdr in this case).  For example, run_bpf_filter()
> is pulling the transport header.
>
> However, in the __udp6_lib_err() path, the skb->data is pointing to the
> ipv6hdr instead of the udphdr.
>
> One option is to pull and push the ipv6hdr in __udp6_lib_err().
> Instead of doing this, this patch follows how the original
> commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
> was done in IPv4, which has passed a NULL skb pointer to
> reuseport_select_sock().
>
> Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
> Cc: Craig Gallek 
> Signed-off-by: Martin KaFai Lau 

Acked-by: Craig Gallek 


Re: [PATCH bpf v2] bpf, lpm: fix lookup bug in map_delete_elem

2019-02-22 Thread Craig Gallek
On Fri, Feb 22, 2019 at 8:19 AM Alban Crequy  wrote:
>
> From: Alban Crequy 
>
> trie_delete_elem() was deleting an entry even though it was not matching
> if the prefixlen was correct. This patch adds a check on matchlen.
>
> Reproducer:
>
> $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 
> entries 128 name mylpm flags 1
> $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb 
> cc dd value hex 01
> $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
> key: 10 00 00 00 aa bb cc dd  value: 01
> Found 1 element
> $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff 
> ff ff
> $ echo $?
> 0
> $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
> Found 0 elements
>
> A similar reproducer is added in the selftests.
>
> Without the patch:
>
> $ sudo ./tools/testing/selftests/bpf/test_lpm_map
> test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion 
> `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
> Aborted
>
> With the patch: test_lpm_map runs without errors.
>
> Fixes: e454cf595853 ("bpf: Implement map_delete_elem for 
> BPF_MAP_TYPE_LPM_TRIE")
> Cc: Craig Gallek 
> Signed-off-by: Alban Crequy 
>
> ---
Acked-by: Craig Gallek 

Good catch, thanks for the fix!


Re: [PATCH net] sock_diag: fix use-after-free read in __sk_free

2018-05-18 Thread Craig Gallek
0, 88018a02e3c0)
> The buggy address belongs to the page:
> page:ea0006280b00 count:1 mapcount:0 mapping:88018a02c140 index:0x0 
> compound_mapcount: 0
> flags: 0x2fffc008100(slab|head)
> raw: 02fffc0000008100 88018a02c140  00010001
> raw: ea00062a1320 ea0006268020 8801d9bdde40 
> page dumped because: kasan: bad access detected
>
> Fixes: b922622ec6ef ("sock_diag: don't broadcast kernel sockets")
> Signed-off-by: Eric Dumazet 
> Cc: Craig Gallek 
> Reported-by: syzbot 

Acked-by: Craig Gallek 

Thanks Eric!


Re: [PATCH net] soreuseport: fix mem leak in reuseport_add_sock()

2018-02-02 Thread Craig Gallek
On Fri, Feb 2, 2018 at 1:27 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> reuseport_add_sock() needs to deal with attaching a socket having
> its own sk_reuseport_cb, after a prior
> setsockopt(SO_ATTACH_REUSEPORT_?BPF)
>
> Without this fix, not only a WARN_ONCE() was issued, but we were also
> leaking memory.
>
> Thanks to sysbot and Eric Biggers for providing us nice C repros.
>
> [ cut here ]
> socket already in reuseport group
> WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119
> reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x257 lib/dump_stack.c:53
>   panic+0x1e4/0x41c kernel/panic.c:183
>   __warn+0x1dc/0x200 kernel/panic.c:547
>   report_bug+0x211/0x2d0 lib/bug.c:184
>   fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>   fixup_bug arch/x86/kernel/traps.c:247 [inline]
>   do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>   invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
>
> Fixes: ef456144da8e ("soreuseport: define reuseport groups")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot+c0ea2226f77a42936...@syzkaller.appspotmail.com

Clever fix, thanks Eric(s)!
Acked-by: Craig Gallek 


Re: [PATCH net] ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only

2018-01-25 Thread Craig Gallek
On Thu, Jan 25, 2018 at 2:15 AM, Martin KaFai Lau  wrote:
> If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
> implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
> this addr_type checking and setting sk->sk_ipv6only to 1 are only done
> after sk->sk_prot->get_port(sk, snum) has been completed successfully.
>
> This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
> the 'get_port()'.
>
> In particular, when binding SO_REUSEPORT UDP sockets,
> udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
> checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
> sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
> 1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
> reuseport_alloc(sk) is called instead of adding sk to the existing
> sk2->sk_reuseport_cb.
>
> It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
> IPv6 address (!ANY and !MAPPED).  Only one of the socket will
> receive packet.
>
> The fix is to set the implicit sk_ipv6only before calling get_port().
> The original sk_ipv6only has to be saved such that it can be restored
> in case get_port() failed.  The situation is similar to the
> inet_reset_saddr(sk) after get_port() has failed.
>
> Thanks to Calvin Owens  who created an easy
> reproduction which leads to a fix.
>
> Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
> Signed-off-by: Martin KaFai Lau 

Wow, good catch!
Acked-by: Craig Gallek 


Re: [PATCH net v2] netns, rtnetlink: fix struct net reference leak

2017-12-29 Thread Craig Gallek
On Sat, Dec 23, 2017 at 5:12 PM, Nicolas Dichtel
 wrote:
> Le 22/12/2017 à 21:36, Craig Gallek a écrit :
>> From: Craig Gallek 
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index 60a71be75aea..4b7ea33f5705 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -627,6 +627,8 @@ static int rtnl_net_newid(struct sk_buff *skb, struct 
>> nlmsghdr *nlh,
>>   return -EINVAL;
>>   }
>>   nsid = nla_get_s32(tb[NETNSA_NSID]);
>> + if (nsid < 0)
>> + return -EINVAL;
> No, this breaks the current behavior.
> Look at alloc_netid(). If reqid is < 0, the kernel allocates an nsid with no
> constraint. If reqid is >= 0, it tries to alloc the specified nsid.
Ah, thanks.  alloc_netid does appear to do the right thing.  In fact,
this seems to be another clue to the problem.  The current behavior is
to allocate from [0,max) when the input value is negative and the
problem seems to trigger when 0 is allocated.  Changing this range to
[1, max) fixes the problem, so there must be code elsewhere that does
not handle the case where the id is zero properly...

>>
>>   if (tb[NETNSA_PID]) {
>>   peer = get_net_ns_by_pid(nla_get_u32(tb[NETNSA_PID]));
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index dabba2a91fc8..a928b8f081b8 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, 
>> struct netlink_callback *cb)
>>   ifla_policy, NULL) >= 0) {
>>   if (tb[IFLA_IF_NETNSID]) {
>>   netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
>> - tgt_net = get_target_net(skb, netnsid);
>> - if (IS_ERR(tgt_net)) {
>> - tgt_net = net;
>> - netnsid = -1;
>> + if (netnsid >= 0) {
>> + tgt_net = get_target_net(skb, netnsid);
> I would prefer to put this test in get_target_net.
>
>> + if (IS_ERR(tgt_net)) {
>> + tgt_net = net;
>> + netnsid = -1;
> Maybe using NETNSA_NSID_NOT_ASSIGNED is better? Same for the initialization of
> this variable.
>
>> + }
>>   }
>>   }
>>
>> @@ -2792,6 +2794,11 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
>> nlmsghdr *nlh,
>>   if (tb[IFLA_LINK_NETNSID]) {
>>   int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
>>
>> + if (id < 0) {
>> + err =  -EINVAL;
>> + goto out;
>> + }
>> +
> This is not needed. get_net_ns_by_id() returns NULL if id is < 0.
Indeed, and by extension get_target_net does not need this check
either (as it calls get_net_ns_by_id).

I'm having trouble debugging this remotely, so I'll give it a whirl
when I get back to the office next week.

Thanks again for the pointers,
Craig


[PATCH net v2] netns, rtnetlink: fix struct net reference leak

2017-12-22 Thread Craig Gallek
From: Craig Gallek 

netns ids were added in commit 0c7aecd4bde4 and defined as signed
integers in both the kernel datastructures and the netlink interface.
However, the semantics of the implementation assume that the ids
are always greater than or equal to zero, except for an internal
sentinal value NETNSA_NSID_NOT_ASSIGNED.

Several subsequent patches carried this pattern forward.  This patch
updates all of the netlink input paths of this value to enforce the
'greater than or equal to zero' constraint.

This issue was discovered by syskaller.  It would set a negative
value for a netns id and then repeatedly call the RTM_GETLINK.
The put_net call in that path would not trigger for negative netns ids,
caused a reference count leak, and eventually overflowed.  There are
probably additional error paths that do not handle this situation
correctly, but this was the only one I was able to trigger a real
issue through.

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
CC: Jiri Benc 
CC: Nicolas Dichtel 
CC: Jason A. Donenfeld 
Signed-off-by: Craig Gallek 
---
 net/core/net_namespace.c |  2 ++
 net/core/rtnetlink.c | 17 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be75aea..4b7ea33f5705 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -627,6 +627,8 @@ static int rtnl_net_newid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return -EINVAL;
}
nsid = nla_get_s32(tb[NETNSA_NSID]);
+   if (nsid < 0)
+   return -EINVAL;
 
if (tb[NETNSA_PID]) {
peer = get_net_ns_by_pid(nla_get_u32(tb[NETNSA_PID]));
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..a928b8f081b8 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
ifla_policy, NULL) >= 0) {
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-   tgt_net = get_target_net(skb, netnsid);
-   if (IS_ERR(tgt_net)) {
-   tgt_net = net;
-   netnsid = -1;
+   if (netnsid >= 0) {
+   tgt_net = get_target_net(skb, netnsid);
+   if (IS_ERR(tgt_net)) {
+   tgt_net = net;
+   netnsid = -1;
+   }
}
}
 
@@ -2792,6 +2794,11 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tb[IFLA_LINK_NETNSID]) {
int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
 
+   if (id < 0) {
+   err =  -EINVAL;
+   goto out;
+   }
+
link_net = get_net_ns_by_id(dest_net, id);
if (!link_net) {
err =  -EINVAL;
@@ -2883,6 +2890,8 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+   if (netnsid < 0)
+   return -EINVAL;
tgt_net = get_target_net(skb, netnsid);
if (IS_ERR(tgt_net))
return PTR_ERR(tgt_net);
-- 
2.15.1.620.gb9897f4670-goog



Re: [PATCH net] rtnetlink: fix struct net reference leak

2017-12-22 Thread Craig Gallek
On Fri, Dec 22, 2017 at 8:59 AM, Craig Gallek  wrote:
> On Fri, Dec 22, 2017 at 3:11 AM, Nicolas Dichtel
>  wrote:
>> Le 21/12/2017 à 23:18, Craig Gallek a écrit :
>>> From: Craig Gallek 
>>>
>>> The below referenced commit extended the RTM_GETLINK interface to
>>> allow querying by netns id.  The netnsid property was previously
>>> defined as a signed integer, but this patch assumes that the user
>>> always passes a positive integer.  syzkaller discovered this problem
>>> by setting a negative netnsid and then calling the get-link path
>>> in a tight loop.  This surprisingly quickly overflows the reference
>>> count on the associated struct net, potentially destroying it.  When the
>>> default namespace is used, the machine crashes in strange and interesting
>>> ways.
>>>
>>> Unfortunately, this is not easy to reproduce with just the ip tool
>>> as it enforces unsigned integer parsing despite the interface interpeting
>>> the NETNSID attribute as signed.
>>>
>>> I'm not sure why this attribute is signed in the first place, but
>>> the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
>>> so I assume it's too late to change.
>> A valid (assigned) nsid is always >= 0.
>>
>>>
>>> This patch removes the positive netns id assumption, but adds another
>>> assumption that the netns id 0 is always the 'self' identifying id (for
>>> which an additional struct net reference is not necessary).
>> We cannot make this assumption, this is wrong. nsids may be automatically
>> allocated by the kernel, and it starts by 0.
>> The current netns can be identify by NETNSA_NSID_NOT_ASSIGNED, ie -1.
> Thank you, I'll respin this with NETNSA_NSID_NOT_ASSIGNED as the sentinel 
> value.

Looking at the netns id code more closely, there are several places
that assume ids will never be zero (short of the sentinel).  I think
the only simple fix here is to update the netlink interfaces to not
accept negative values as input.  I'm going to send that patch
instead...


Re: [PATCH net] rtnetlink: fix struct net reference leak

2017-12-22 Thread Craig Gallek
On Fri, Dec 22, 2017 at 3:11 AM, Nicolas Dichtel
 wrote:
> Le 21/12/2017 à 23:18, Craig Gallek a écrit :
>> From: Craig Gallek 
>>
>> The below referenced commit extended the RTM_GETLINK interface to
>> allow querying by netns id.  The netnsid property was previously
>> defined as a signed integer, but this patch assumes that the user
>> always passes a positive integer.  syzkaller discovered this problem
>> by setting a negative netnsid and then calling the get-link path
>> in a tight loop.  This surprisingly quickly overflows the reference
>> count on the associated struct net, potentially destroying it.  When the
>> default namespace is used, the machine crashes in strange and interesting
>> ways.
>>
>> Unfortunately, this is not easy to reproduce with just the ip tool
>> as it enforces unsigned integer parsing despite the interface interpeting
>> the NETNSID attribute as signed.
>>
>> I'm not sure why this attribute is signed in the first place, but
>> the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
>> so I assume it's too late to change.
> A valid (assigned) nsid is always >= 0.
>
>>
>> This patch removes the positive netns id assumption, but adds another
>> assumption that the netns id 0 is always the 'self' identifying id (for
>> which an additional struct net reference is not necessary).
> We cannot make this assumption, this is wrong. nsids may be automatically
> allocated by the kernel, and it starts by 0.
> The current netns can be identify by NETNSA_NSID_NOT_ASSIGNED, ie -1.
Thank you, I'll respin this with NETNSA_NSID_NOT_ASSIGNED as the sentinel value.

Craig


[PATCH net] rtnetlink: fix struct net reference leak

2017-12-21 Thread Craig Gallek
From: Craig Gallek 

The below referenced commit extended the RTM_GETLINK interface to
allow querying by netns id.  The netnsid property was previously
defined as a signed integer, but this patch assumes that the user
always passes a positive integer.  syzkaller discovered this problem
by setting a negative netnsid and then calling the get-link path
in a tight loop.  This surprisingly quickly overflows the reference
count on the associated struct net, potentially destroying it.  When the
default namespace is used, the machine crashes in strange and interesting
ways.

Unfortunately, this is not easy to reproduce with just the ip tool
as it enforces unsigned integer parsing despite the interface interpeting
the NETNSID attribute as signed.

I'm not sure why this attribute is signed in the first place, but
the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
so I assume it's too late to change.

This patch removes the positive netns id assumption, but adds another
assumption that the netns id 0 is always the 'self' identifying id (for
which an additional struct net reference is not necessary).

Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
CC: Jiri Benc 
CC: Nicolas Dichtel 
CC: Jason A. Donenfeld 
Signed-off-by: Craig Gallek 
---
 net/core/rtnetlink.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..3de033b7e4b9 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1451,7 +1451,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
ifm->ifi_flags = dev_get_flags(dev);
ifm->ifi_change = change;
 
-   if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
+   if (tgt_netnsid && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
goto nla_put_failure;
 
if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
@@ -1712,7 +1712,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
const struct rtnl_link_ops *kind_ops = NULL;
unsigned int flags = NLM_F_MULTI;
int master_idx = 0;
-   int netnsid = -1;
+   int netnsid = 0;
int err;
int hdrlen;
 
@@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
ifla_policy, NULL) >= 0) {
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-   tgt_net = get_target_net(skb, netnsid);
-   if (IS_ERR(tgt_net)) {
-   tgt_net = net;
-   netnsid = -1;
+   if (netnsid) {
+   tgt_net = get_target_net(skb, netnsid);
+   if (IS_ERR(tgt_net)) {
+   tgt_net = net;
+   netnsid = 0;
+   }
}
}
 
@@ -1786,7 +1788,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
cb->args[0] = h;
cb->seq = net->dev_base_seq;
nl_dump_check_consistent(cb, nlmsg_hdr(skb));
-   if (netnsid >= 0)
+   if (netnsid)
put_net(tgt_net);
 
return err;
@@ -2873,7 +2875,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
struct nlattr *tb[IFLA_MAX+1];
struct net_device *dev = NULL;
struct sk_buff *nskb;
-   int netnsid = -1;
+   int netnsid = 0;
int err;
u32 ext_filter_mask = 0;
 
@@ -2883,9 +2885,11 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-   tgt_net = get_target_net(skb, netnsid);
-   if (IS_ERR(tgt_net))
-   return PTR_ERR(tgt_net);
+   if (netnsid) {
+   tgt_net = get_target_net(skb, netnsid);
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
+   }
}
 
if (tb[IFLA_IFNAME])
@@ -2923,7 +2927,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
} else
err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
 out:
-   if (netnsid >= 0)
+   if (netnsid)
put_net(tgt_net);
 
return err;
-- 
2.15.1.620.gb9897f4670-goog



Re: [RFC PATCH] reuseport: compute the ehash only if needed

2017-12-12 Thread Craig Gallek
On Tue, Dec 12, 2017 at 8:09 AM, Paolo Abeni  wrote:
> When a reuseport socket group is using a BPF filter to distribute
> the packets among the sockets, we don't need to compute any hash
> value, but the current reuseport_select_sock() requires the
> caller to compute such hash in advance.
>
> This patch reworks reuseport_select_sock() to compute the hash value
> only if needed - missing or failing BPF filter. Since different
> hash functions have different argument types - ipv4 addresses vs ipv6
> ones - to avoid over-complicate the interface, reuseport_select_sock()
> is now a macro.
Purely subjective, but I think a slightly more complicated function
signature for reuseport_select_sock (and reuseport_select_sock6?)
would look a little better than this macro.  It would avoid needing to
expose the reuseport_info struct and would keep the rcu semantics
entirely within the function call (the fast-path memory access
semantics here are already non-trivial...)

> Additionally, the sk_reuseport test is move inside reuseport_select_sock,
> to avoid some code duplication.
>
> Overall this gives small but measurable performance improvement
> under UDP flood while using SO_REUSEPORT + BPF.
Exciting, do you have some specific numbers here?  I'd be interested
in knowing what kinds of loads you end up seeing improvements for.

> Signed-off-by: Paolo Abeni 


Re: Uninitialized value in __sk_nulls_add_node_rcu()

2017-12-05 Thread Craig Gallek
On Tue, Dec 5, 2017 at 3:07 PM, Eric Dumazet  wrote:
> On Tue, 2017-12-05 at 14:39 -0500, Craig Gallek wrote:
>> On Tue, Dec 5, 2017 at 9:18 AM, Eric Dumazet 
>> wrote:
>> > On Tue, 2017-12-05 at 06:15 -0800, Eric Dumazet wrote:
>> > >
>> > > + hlist_nulls_add_head_rcu(&sk->sk_nulss_node, list);
>> >
>> > Typo here, this needs sk_nulls_node of course.
>> >
>>
>> Thanks Eric, this looks good to me.  The tail insertion is still
>> required in udp_lib_get_port for the second layer hash, but not here.
>> fwiw, reuseport_dualstack in the selftests directory verifies this
>> behavior.  I tried it with your patch (it still passes) and removing
>> the udp_lib_get_port path (to make sure it breaks when it should).
>>
>
> Thanks for confirming this, I will send the official patch then ;)

Great, feel free to include my Acked-by.


Re: Uninitialized value in __sk_nulls_add_node_rcu()

2017-12-05 Thread Craig Gallek
On Tue, Dec 5, 2017 at 9:18 AM, Eric Dumazet  wrote:
> On Tue, 2017-12-05 at 06:15 -0800, Eric Dumazet wrote:
>>
>> + hlist_nulls_add_head_rcu(&sk->sk_nulss_node, list);
>
> Typo here, this needs sk_nulls_node of course.
>

Thanks Eric, this looks good to me.  The tail insertion is still
required in udp_lib_get_port for the second layer hash, but not here.
fwiw, reuseport_dualstack in the selftests directory verifies this
behavior.  I tried it with your patch (it still passes) and removing
the udp_lib_get_port path (to make sure it breaks when it should).

Craig


Re: [PATCH net-next] net/reuseport: drop legacy code

2017-11-30 Thread Craig Gallek
On Thu, Nov 30, 2017 at 9:39 AM, Paolo Abeni  wrote:
> Since commit e32ea7e74727 ("soreuseport: fast reuseport UDP socket
> selection") and commit c125e80b8868 ("soreuseport: fast reuseport
> TCP socket selection") the relevant reuseport socket matching the current
> packet is selected by the reuseport_select_sock() call. The only
> exceptions are invalid BPF filters/filters returning out-of-range
> indices.
> In the latter case the code implicitly falls back to using the hash
> demultiplexing, but instead of selecting the socket inside the
> reuseport_select_sock() function, it relies on the hash selection
> logic introduced with the early soreuseport implementation.
>
> With this patch, in case of a BPF filter returning a bad socket
> index value, we fall back to hash-based selection inside the
> reuseport_select_sock() body, so that we can drop some duplicate
> code in the ipv4 and ipv6 stack.
>
> This also allows faster lookup in the above scenario and will allow
> us to avoid computing the hash value for successful, BPF based
> demultiplexing - in a later patch.
>
> Signed-off-by: Paolo Abeni 

Great optimization!
Acked-by: Craig Gallek 

> ---
>  net/core/sock_reuseport.c   |  4 +++-
>  net/ipv4/inet_hashtables.c  | 11 ++-
>  net/ipv4/udp.c  | 22 --
>  net/ipv6/inet6_hashtables.c | 11 ++-
>  net/ipv6/udp.c  | 22 --
>  5 files changed, 15 insertions(+), 55 deletions(-)
>
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index 5eeb1d20cc38..c5bb52bc73a1 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -235,7 +235,9 @@ struct sock *reuseport_select_sock(struct sock *sk,
>
> if (prog && skb)
> sk2 = run_bpf(reuse, socks, prog, skb, hdr_len);
> -   else
> +
> +   /* no bpf or invalid bpf result: fall back to hash usage */
> +   if (!sk2)
> sk2 = reuse->socks[reciprocal_scale(hash, socks)];
> }
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7d15fb0d94d..427b705d7c64 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -216,32 +216,25 @@ struct sock *__inet_lookup_listener(struct net *net,
>  {
> unsigned int hash = inet_lhashfn(net, hnum);
> struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
> -   int score, hiscore = 0, matches = 0, reuseport = 0;
> bool exact_dif = inet_exact_dif_match(net, skb);
> struct sock *sk, *result = NULL;
> +   int score, hiscore = 0;
> u32 phash = 0;
>
> sk_for_each_rcu(sk, &ilb->head) {
> score = compute_score(sk, net, hnum, daddr,
>   dif, sdif, exact_dif);
> if (score > hiscore) {
> -   reuseport = sk->sk_reuseport;
> -   if (reuseport) {
> +   if (sk->sk_reuseport) {
> phash = inet_ehashfn(net, daddr, hnum,
>  saddr, sport);
> result = reuseport_select_sock(sk, phash,
>skb, doff);
> if (result)
> return result;
> -   matches = 1;
> }
> result = sk;
> hiscore = score;
> -   } else if (score == hiscore && reuseport) {
> -   matches++;
> -   if (reciprocal_scale(phash, matches) == 0)
> -   result = sk;
> -   phash = next_pseudo_random32(phash);
> }
> }
> return result;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e4ff25c947c5..36f857c87fe2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -445,7 +445,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  struct sk_buff *skb)
>  {
> struct sock *sk, *result;
> -   int score, badness, matches = 0, reuseport = 0;
> +   int score, badness;
> u32 hash = 0;
>
> result = NULL;
> @@ -454,23 +454,16 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> score = compute_score(sk, net, saddr, sport,
>   daddr, hnum, dif, sdif, exact_dif);
>  

[PATCH net-next v2] bpf: fix verifier NULL pointer dereference

2017-11-02 Thread Craig Gallek
From: Craig Gallek 

do_check() can fail early without allocating env->cur_state under
memory pressure.  Syzkaller found the stack below on the linux-next
tree because of this.

  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault:  [#1] SMP KASAN
  Dumping ftrace buffer:
 (ftrace buffer empty)
  Modules linked in:
  CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
  task: 8801c2c74700 task.stack: 8801c3e28000
  RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
  RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
  RSP: 0018:8801c3e2f5c8 EFLAGS: 00010202
  RAX: dc00 RBX: fff4 RCX: 
  RDX: 0070 RSI: 817d5aa9 RDI: 0380
  RBP: 8801c3e2f668 R08:  R09: 1100387c5d9f
  R10: 218c4e80 R11: 85b34380 R12: 8801c4dc6a28
  R13:  R14: 8801c4dc6a00 R15: 8801c4dc6a20
  FS:  7f311079b700() GS:8801db30() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 004d4a24 CR3: 0001cbcd CR4: 001406e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
   SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
   SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
   entry_SYSCALL_64_fastpath+0x1f/0xbe
  RIP: 0033:0x452869
  RSP: 002b:7f311079abe8 EFLAGS: 0212 ORIG_RAX: 0141
  RAX: ffda RBX: 00758020 RCX: 00452869
  RDX: 0030 RSI: 20168000 RDI: 0005
  RBP: 7f311079aa20 R08:  R09: 
  R10:  R11: 0212 R12: 004b7550
  R13: 7f311079ab58 R14: 004b7560 R15: 
  Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00 00 00 
00 00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b6 
0b 00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
  RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP: 
8801c3e2f5c8
  RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP: 8801c3e2f5c8
  ---[ end trace c8d37f339dc64004 ]---

Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption")
Fixes: 1969db47f8d0 ("bpf: fix verifier memory leaks")
Signed-off-by: Craig Gallek 
---
 kernel/bpf/verifier.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

v2:
  Forgot second spot for the same bug in bpf_analyzer (from Alexei).

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 530b68550fd2..624aee966ab5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4530,8 +4530,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
 
ret = do_check(env);
-   free_verifier_state(env->cur_state, true);
-   env->cur_state = NULL;
+   if (env->cur_state) {
+   free_verifier_state(env->cur_state, true);
+   env->cur_state = NULL;
+   }
 
 skip_full_check:
while (!pop_stack(env, NULL, NULL));
@@ -4637,8 +4639,10 @@ int bpf_analyzer(struct bpf_prog *prog, const struct 
bpf_ext_analyzer_ops *ops,
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
 
ret = do_check(env);
-   free_verifier_state(env->cur_state, true);
-   env->cur_state = NULL;
+   if (env->cur_state) {
+   free_verifier_state(env->cur_state, true);
+   env->cur_state = NULL;
+   }
 
 skip_full_check:
while (!pop_stack(env, NULL, NULL));
-- 
2.15.0.403.gc27cc4dac6-goog



Re: [PATCH net-next] bpf: fix verifier NULL pointer dereference

2017-11-02 Thread Craig Gallek
On Thu, Nov 2, 2017 at 11:07 AM, Alexei Starovoitov  wrote:
> On 11/2/17 7:21 AM, Craig Gallek wrote:
>>
>> From: Craig Gallek 
>>
>> do_check() can fail early without allocating env->cur_state under
>> memory pressure.  Syzkaller found the stack below on the linux-next
>> tree because of this.
>>
>>   kasan: CONFIG_KASAN_INLINE enabled
>>   kasan: GPF could be caused by NULL-ptr deref or user memory access
>>   general protection fault:  [#1] SMP KASAN
>>   Dumping ftrace buffer:
>>  (ftrace buffer empty)
>>   Modules linked in:
>>   CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
>>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>>   task: 8801c2c74700 task.stack: 8801c3e28000
>>   RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
>>   RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
>>   RSP: 0018:8801c3e2f5c8 EFLAGS: 00010202
>>   RAX: dc00 RBX: fff4 RCX: 
>>   RDX: 0070 RSI: 817d5aa9 RDI: 0380
>>   RBP: 8801c3e2f668 R08:  R09: 1100387c5d9f
>>   R10: 218c4e80 R11: 85b34380 R12: 8801c4dc6a28
>>   R13:  R14: 8801c4dc6a00 R15: 8801c4dc6a20
>>   FS:  7f311079b700() GS:8801db30()
>> knlGS:
>>   CS:  0010 DS:  ES:  CR0: 80050033
>>   CR2: 004d4a24 CR3: 0001cbcd CR4: 001406e0
>>   DR0:  DR1:  DR2: 
>>   DR3:  DR6: fffe0ff0 DR7: 0400
>>   Call Trace:
>>bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
>>SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
>>SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
>>entry_SYSCALL_64_fastpath+0x1f/0xbe
>>   RIP: 0033:0x452869
>>   RSP: 002b:7f311079abe8 EFLAGS: 0212 ORIG_RAX: 0141
>>   RAX: ffda RBX: 00758020 RCX: 00452869
>>   RDX: 0030 RSI: 20168000 RDI: 0005
>>   RBP: 7f311079aa20 R08:  R09: 
>>   R10:  R11: 0212 R12: 004b7550
>>   R13: 7f311079ab58 R14: 004b7560 R15: 
>>   Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00
>> 00 00 00 00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00
>> 0f 85 b6 0b 00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
>>   RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP:
>> ffff8801c3e2f5c8
>>   RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP:
>> 8801c3e2f5c8
>>   ---[ end trace c8d37f339dc64004 ]---
>>
>> Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption")
>> Fixes: 1969db47f8d0 ("bpf: fix verifier memory leaks")
>> Signed-off-by: Craig Gallek 
>> ---
>>  kernel/bpf/verifier.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 530b68550fd2..199ae7ccb2b7 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -4530,8 +4530,10 @@ int bpf_check(struct bpf_prog **prog, union
>> bpf_attr *attr)
>> env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
>>
>> ret = do_check(env);
>> -   free_verifier_state(env->cur_state, true);
>> -   env->cur_state = NULL;
>> +   if (env->cur_state) {
>> +   free_verifier_state(env->cur_state, true);
>> +   env->cur_state = NULL;
>> +   }
>
>
> right. good catch. similar fix needed in bpf_analyzer()
> Can you respin with fix for both?

I swear I had that in my test tree ;)  Sending now...


[PATCH net-next] bpf: fix verifier NULL pointer dereference

2017-11-02 Thread Craig Gallek
From: Craig Gallek 

do_check() can fail early without allocating env->cur_state under
memory pressure.  Syzkaller found the stack below on the linux-next
tree because of this.

  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault:  [#1] SMP KASAN
  Dumping ftrace buffer:
 (ftrace buffer empty)
  Modules linked in:
  CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
  task: 8801c2c74700 task.stack: 8801c3e28000
  RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
  RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
  RSP: 0018:8801c3e2f5c8 EFLAGS: 00010202
  RAX: dc00 RBX: fff4 RCX: 
  RDX: 0070 RSI: 817d5aa9 RDI: 0380
  RBP: 8801c3e2f668 R08:  R09: 1100387c5d9f
  R10: 218c4e80 R11: 85b34380 R12: 8801c4dc6a28
  R13:  R14: 8801c4dc6a00 R15: 8801c4dc6a20
  FS:  7f311079b700() GS:8801db30() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 004d4a24 CR3: 0001cbcd CR4: 001406e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
   SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
   SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
   entry_SYSCALL_64_fastpath+0x1f/0xbe
  RIP: 0033:0x452869
  RSP: 002b:7f311079abe8 EFLAGS: 0212 ORIG_RAX: 0141
  RAX: ffda RBX: 00758020 RCX: 00452869
  RDX: 0030 RSI: 20168000 RDI: 0005
  RBP: 7f311079aa20 R08:  R09: 
  R10:  R11: 0212 R12: 004b7550
  R13: 7f311079ab58 R14: 004b7560 R15: 
  Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00 00 00 
00 00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b6 
0b 00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
  RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP: 
8801c3e2f5c8
  RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP: 8801c3e2f5c8
  ---[ end trace c8d37f339dc64004 ]---

Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption")
Fixes: 1969db47f8d0 ("bpf: fix verifier memory leaks")
Signed-off-by: Craig Gallek 
---
 kernel/bpf/verifier.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 530b68550fd2..199ae7ccb2b7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4530,8 +4530,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
 
ret = do_check(env);
-   free_verifier_state(env->cur_state, true);
-   env->cur_state = NULL;
+   if (env->cur_state) {
+   free_verifier_state(env->cur_state, true);
+   env->cur_state = NULL;
+   }
 
 skip_full_check:
while (!pop_stack(env, NULL, NULL));
-- 
2.15.0.403.gc27cc4dac6-goog



[PATCH net] tun/tap: sanitize TUNSETSNDBUF input

2017-10-30 Thread Craig Gallek
From: Craig Gallek 

Syzkaller found several variants of the lockup below by setting negative
values with the TUNSETSNDBUF ioctl.  This patch adds a sanity check
to both the tun and tap versions of this ioctl.

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389]
  Modules linked in:
  irq event stamp: 329692056
  hardirqs last  enabled at (329692055): [] 
_raw_spin_unlock_irqrestore+0x31/0x75
  hardirqs last disabled at (329692056): [] 
apic_timer_interrupt+0x98/0xb0
  softirqs last  enabled at (35659740): [] 
__do_softirq+0x328/0x48c
  softirqs last disabled at (35659731): [] irq_exit+0xbc/0xd0
  CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  task: 880009452140 task.stack: 880006a2
  RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80
  RSP: 0018:880006a27c50 EFLAGS: 0282 ORIG_RAX: ff10
  RAX: 880009ac68d0 RBX: 880006a27ce0 RCX: 
  RDX: 0001 RSI: 880006a27ce0 RDI: 880009ac6900
  RBP: 880006a27c60 R08:  R09: 
  R10: 0001 R11: 0063ff00 R12: 880009ac6900
  R13: 880006a27cf8 R14: 0001 R15: 880006a27cf8
  FS:  7f4be4838700() GS:88000cc0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 20101000 CR3: 09616000 CR4: 06f0
  Call Trace:
   prepare_to_wait+0x26/0xc0
   sock_alloc_send_pskb+0x14e/0x270
   ? remove_wait_queue+0x60/0x60
   tun_get_user+0x2cc/0x19d0
   ? __tun_get+0x60/0x1b0
   tun_chr_write_iter+0x57/0x86
   __vfs_write+0x156/0x1e0
   vfs_write+0xf7/0x230
   SyS_write+0x57/0xd0
   entry_SYSCALL_64_fastpath+0x1f/0xbe
  RIP: 0033:0x7f4be4356df9
  RSP: 002b:7ffc18101c08 EFLAGS: 0293 ORIG_RAX: 0001
  RAX: ffda RBX:  RCX: 7f4be4356df9
  RDX: 0046 RSI: 20101000 RDI: 0005
  RBP: 7ffc18101c40 R08: 0001 R09: 0001
  R10: 0001 R11: 0293 R12: 559c75f64780
  R13: 7ffc18101d30 R14:  R15: 

Fixes: 33dccbb050bb ("tun: Limit amount of queued packets per device")
Fixes: 20d29d7a916a ("net: macvtap driver")
Signed-off-by: Craig Gallek 
---
 drivers/net/tap.c | 2 ++
 drivers/net/tun.c | 4 
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 1b10fcc6a58d..6c0c84c33e1f 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1032,6 +1032,8 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
case TUNSETSNDBUF:
if (get_user(s, sp))
return -EFAULT;
+   if (s <= 0)
+   return -EINVAL;
 
q->sk.sk_sndbuf = s;
return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5550f56cb895..42bb820a56c9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2429,6 +2429,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
ret = -EFAULT;
break;
}
+   if (sndbuf <= 0) {
+   ret = -EINVAL;
+   break;
+   }
 
tun->sndbuf = sndbuf;
tun_set_sndbuf(tun);
-- 
2.15.0.rc2.357.g7e34df9404-goog



[PATCH net] soreuseport: fix initialization race

2017-10-19 Thread Craig Gallek
From: Craig Gallek 

Syzkaller stumbled upon a way to trigger
WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

There are two initialization paths for the sock_reuseport structure in a
socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
SO_ATTACH_REUSEPORT_[CE]BPF before bind.  The existing implementation
assumedthat the socket lock protected both of these paths when it actually
only protects the SO_ATTACH_REUSEPORT path.  Syzkaller triggered this
double allocation by running these paths concurrently.

This patch moves the check for double allocation into the reuseport_alloc
function which is protected by a global spin lock.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
 net/core/sock_reuseport.c  | 12 +---
 net/ipv4/inet_hashtables.c |  5 +
 net/ipv4/udp.c |  5 +
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index eed1ebf7f29d..b1e0dbea1e8c 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -36,9 +36,14 @@ int reuseport_alloc(struct sock *sk)
 * soft irq of receive path or setsockopt from process context
 */
spin_lock_bh(&reuseport_lock);
-   WARN_ONCE(rcu_dereference_protected(sk->sk_reuseport_cb,
-   lockdep_is_held(&reuseport_lock)),
- "multiple allocations for the same socket");
+
+   /* Allocation attempts can occur concurrently via the setsockopt path
+* and the bind/hash path.  Nothing to do when we lose the race.
+*/
+   if (rcu_dereference_protected(sk->sk_reuseport_cb,
+ lockdep_is_held(&reuseport_lock)))
+   goto out;
+
reuse = __reuseport_alloc(INIT_SOCKS);
if (!reuse) {
spin_unlock_bh(&reuseport_lock);
@@ -49,6 +54,7 @@ int reuseport_alloc(struct sock *sk)
reuse->num_socks = 1;
rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
 
+out:
spin_unlock_bh(&reuseport_lock);
 
return 0;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 597bb4cfe805..e7d15fb0d94d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -456,10 +456,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
return reuseport_add_sock(sk, sk2);
}
 
-   /* Initial allocation may have already happened via setsockopt */
-   if (!rcu_access_pointer(sk->sk_reuseport_cb))
-   return reuseport_alloc(sk);
-   return 0;
+   return reuseport_alloc(sk);
 }
 
 int __inet_hash(struct sock *sk, struct sock *osk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e45177ceb0ee..3cd8103bab2c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -231,10 +231,7 @@ static int udp_reuseport_add_sock(struct sock *sk, struct 
udp_hslot *hslot)
}
}
 
-   /* Initial allocation may have already happened via setsockopt */
-   if (!rcu_access_pointer(sk->sk_reuseport_cb))
-   return reuseport_alloc(sk);
-   return 0;
+   return reuseport_alloc(sk);
 }
 
 /**
-- 
2.15.0.rc1.287.g2b38de12cc-goog



[PATCH net-next v3 1/2] libbpf: parse maps sections of varying size

2017-10-05 Thread Craig Gallek
From: Craig Gallek 

This library previously assumed a fixed-size map options structure.
Any new options were ignored.  In order to allow the options structure
to grow and to support parsing older programs, this patch updates
the maps section parsing to handle varying sizes.

Object files with maps sections smaller than expected will have the new
fields initialized to zero.  Object files which have larger than expected
maps sections will be rejected unless all of the unrecognized data is zero.

This change still assumes that each map definition in the maps section
is the same size.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 70 +-
 1 file changed, 41 insertions(+), 29 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4f402dcdf372..23152890ec60 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -579,31 +579,6 @@ bpf_object__init_kversion(struct bpf_object *obj,
return 0;
 }
 
-static int
-bpf_object__validate_maps(struct bpf_object *obj)
-{
-   int i;
-
-   /*
-* If there's only 1 map, the only error case should have been
-* catched in bpf_object__init_maps().
-*/
-   if (!obj->maps || !obj->nr_maps || (obj->nr_maps == 1))
-   return 0;
-
-   for (i = 1; i < obj->nr_maps; i++) {
-   const struct bpf_map *a = &obj->maps[i - 1];
-   const struct bpf_map *b = &obj->maps[i];
-
-   if (b->offset - a->offset < sizeof(struct bpf_map_def)) {
-   pr_warning("corrupted map section in %s: map \"%s\" too 
small\n",
-  obj->path, a->name);
-   return -EINVAL;
-   }
-   }
-   return 0;
-}
-
 static int compare_bpf_map(const void *_a, const void *_b)
 {
const struct bpf_map *a = _a;
@@ -615,7 +590,7 @@ static int compare_bpf_map(const void *_a, const void *_b)
 static int
 bpf_object__init_maps(struct bpf_object *obj)
 {
-   int i, map_idx, nr_maps = 0;
+   int i, map_idx, map_def_sz, nr_maps = 0;
Elf_Scn *scn;
Elf_Data *data;
Elf_Data *symbols = obj->efile.symbols;
@@ -658,6 +633,15 @@ bpf_object__init_maps(struct bpf_object *obj)
if (!nr_maps)
return 0;
 
+   /* Assume equally sized map definitions */
+   map_def_sz = data->d_size / nr_maps;
+   if (!data->d_size || (data->d_size % nr_maps) != 0) {
+   pr_warning("unable to determine map definition size "
+  "section %s, %d maps in %zd bytes\n",
+  obj->path, nr_maps, data->d_size);
+   return -EINVAL;
+   }
+
obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));
if (!obj->maps) {
pr_warning("alloc maps for object failed\n");
@@ -690,7 +674,7 @@ bpf_object__init_maps(struct bpf_object *obj)
  obj->efile.strtabidx,
  sym.st_name);
obj->maps[map_idx].offset = sym.st_value;
-   if (sym.st_value + sizeof(struct bpf_map_def) > data->d_size) {
+   if (sym.st_value + map_def_sz > data->d_size) {
pr_warning("corrupted maps section in %s: last map 
\"%s\" too small\n",
   obj->path, map_name);
return -EINVAL;
@@ -704,12 +688,40 @@ bpf_object__init_maps(struct bpf_object *obj)
pr_debug("map %d is \"%s\"\n", map_idx,
 obj->maps[map_idx].name);
def = (struct bpf_map_def *)(data->d_buf + sym.st_value);
-   obj->maps[map_idx].def = *def;
+   /*
+* If the definition of the map in the object file fits in
+* bpf_map_def, copy it.  Any extra fields in our version
+* of bpf_map_def will default to zero as a result of the
+* calloc above.
+*/
+   if (map_def_sz <= sizeof(struct bpf_map_def)) {
+   memcpy(&obj->maps[map_idx].def, def, map_def_sz);
+   } else {
+   /*
+* Here the map structure being read is bigger than what
+* we expect, truncate if the excess bits are all zero.
+* If they are not zero, reject this map as
+* incompatible.
+*/
+   char *b;
+   for (b = ((char *)def) + sizeof(struct bpf_map_def);
+b < ((char *)def) + map_def_sz; b++) {
+   if (*b != 0) {
+ 

[PATCH net-next v3 0/2] libbpf: support more map options

2017-10-05 Thread Craig Gallek
From: Craig Gallek 

The functional change to this series is the ability to use flags when
creating maps from object files loaded by libbpf.  In order to do this,
the first patch updates the library to handle map definitions that
differ in size from libbpf's struct bpf_map_def.

For object files with a larger map definition, libbpf will continue to load
if the unknown fields are all zero, otherwise the map is rejected.  If the
map definition in the object file is smaller than expected, libbpf will use
zero as a default value in the missing fields.

Craig Gallek (2):
  libbpf: parse maps sections of varying size
  libbpf: use map_flags when creating maps

 tools/lib/bpf/libbpf.c | 72 +-
 tools/lib/bpf/libbpf.h |  1 +
 2 files changed, 43 insertions(+), 30 deletions(-)

-- 
v3:
  - explicit memcpy instead of struct assignment.
  - remove unnecessary bpf_object__validate_maps function

v2
  - determine bpf_map_def structure size dynamically from object file

2.14.2.920.gcf0c67979c-goog



[PATCH net-next v3 2/2] libbpf: use map_flags when creating maps

2017-10-05 Thread Craig Gallek
From: Craig Gallek 

This is required to use BPF_MAP_TYPE_LPM_TRIE or any other map type
which requires flags.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 2 +-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 23152890ec60..5aa45f89da93 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -942,7 +942,7 @@ bpf_object__create_maps(struct bpf_object *obj)
   def->key_size,
   def->value_size,
   def->max_entries,
-  0);
+  def->map_flags);
if (*pfd < 0) {
size_t j;
int err = *pfd;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 7959086eb9c9..6e20003109e0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,7 @@ struct bpf_map_def {
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
+   unsigned int map_flags;
 };
 
 /*
-- 
2.14.2.920.gcf0c67979c-goog



Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-04 Thread Craig Gallek
On Tue, Oct 3, 2017 at 10:39 AM, Daniel Borkmann  wrote:
> On 10/03/2017 01:07 AM, Alexei Starovoitov wrote:
>>
>> On 10/2/17 9:41 AM, Craig Gallek wrote:
>>>
>>> +/* Assume equally sized map definitions */
>>> +map_def_sz = data->d_size / nr_maps;
>>> +if (!data->d_size || (data->d_size % nr_maps) != 0) {
>>> +pr_warning("unable to determine map definition size "
>>> +   "section %s, %d maps in %zd bytes\n",
>>> +   obj->path, nr_maps, data->d_size);
>>> +return -EINVAL;
>>> +}
>>
>>
>> this approach is not as flexible as done by samples/bpf/bpf_load.c
>> where it looks at every map independently by walking symtab,
>> but I guess it's ok.
>
>
> Regarding different map spec structs in a single prog: unless
> we have a good use case why we would need it (and I'm not aware
> of anything in particular), I would just go with a fixed size.
> I did kind of similar sanity checks in bpf_fetch_maps_end() in
> iproute2 loader as well.

Split vote?  I'm happy to try to make this work with varying sizes,
but I agree the usefulness seems low.  It would imply building map
sections with different definition structures and we would then need a
way to differentiate them.  If the goal is to allow for different map
definition structures, I don't think size alone is a sufficient
differentiator.


Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-04 Thread Craig Gallek
On Tue, Oct 3, 2017 at 10:11 AM, Jesper Dangaard Brouer
 wrote:
> On Mon,  2 Oct 2017 12:41:28 -0400
> Craig Gallek  wrote:
>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 4f402dcdf372..28b300868ad7 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -580,7 +580,7 @@ bpf_object__init_kversion(struct bpf_object *obj,
>>  }
>>
>>  static int
>> -bpf_object__validate_maps(struct bpf_object *obj)
>> +bpf_object__validate_maps(struct bpf_object *obj, int map_def_sz)
>>  {
>>   int i;
>>
>> @@ -595,9 +595,11 @@ bpf_object__validate_maps(struct bpf_object *obj)
>>   const struct bpf_map *a = &obj->maps[i - 1];
>>   const struct bpf_map *b = &obj->maps[i];
>>
>> - if (b->offset - a->offset < sizeof(struct bpf_map_def)) {
>> - pr_warning("corrupted map section in %s: map \"%s\" 
>> too small\n",
>> -obj->path, a->name);
>> + if (b->offset - a->offset < map_def_sz) {
>> + pr_warning("corrupted map section in %s: map \"%s\" 
>> too small "
>> +"(%zd vs %d)\n",
>> +obj->path, a->name, b->offset - a->offset,
>> +map_def_sz);
>>   return -EINVAL;
>
> Hmm... one more comment.  You have just coded handling of ELF
> map_def_sz which are smaller in a safe manor, but here this case will
> get rejected (in bpf_object__validate_maps).  That cannot be the right
> intend?

I think you are right, but my test program didn't catch this...
Perhaps an off-by-one (I only used slightly smaller map_def sizes).  I
suppose this entire function is unnecessary now.  Even the old version
wouldn't catch the case where the last offset was out of bounds?

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-04 Thread Craig Gallek
On Tue, Oct 3, 2017 at 10:11 AM, Jesper Dangaard Brouer
 wrote:
> On Mon,  2 Oct 2017 12:41:28 -0400
> Craig Gallek  wrote:
>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 4f402dcdf372..28b300868ad7 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -580,7 +580,7 @@ bpf_object__init_kversion(struct bpf_object *obj,
>>  }
>>
>>  static int
>> -bpf_object__validate_maps(struct bpf_object *obj)
>> +bpf_object__validate_maps(struct bpf_object *obj, int map_def_sz)
>>  {
>>   int i;
>>
>> @@ -595,9 +595,11 @@ bpf_object__validate_maps(struct bpf_object *obj)
>>   const struct bpf_map *a = &obj->maps[i - 1];
>>   const struct bpf_map *b = &obj->maps[i];
>>
>> - if (b->offset - a->offset < sizeof(struct bpf_map_def)) {
>> - pr_warning("corrupted map section in %s: map \"%s\" 
>> too small\n",
>> -obj->path, a->name);
>> + if (b->offset - a->offset < map_def_sz) {
>> + pr_warning("corrupted map section in %s: map \"%s\" 
>> too small "
>> +"(%zd vs %d)\n",
>> +obj->path, a->name, b->offset - a->offset,
>> +map_def_sz);
>>   return -EINVAL;
>
> Hmm... one more comment.  You have just coded handling of ELF
> map_def_sz which are smaller in a safe manor, but here this case will
> get rejected (in bpf_object__validate_maps).  That cannot be the right
> intend?

I think you are right, but my test program didn't catch this...
Perhaps an off-by-one (I only used slightly smaller map_def sizes).  I
suppose this entire function is unnecessary now.  Even the old version
wouldn't catch the case where the last offset was out of bounds?

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-04 Thread Craig Gallek
On Tue, Oct 3, 2017 at 10:03 AM, Jesper Dangaard Brouer
 wrote:
>
>
> First of all, thank you Craig for working on this.  As Alexei says, we
> need to improve tools/lib/bpf/libbpf and move towards converting users
> of bpf_load.c to this lib instead.
>
> Comments inlined below.
>
>> + obj->maps[map_idx].def = *def;
>
> I'm not too happy/comfortable with this way of copying the memory of
> "def" (the type-cased struct bpf_map_def).  I guess it works, and is
> part of the C-standard(?).

I believe this is a C++-ism.  I'm not sure if it was pulled into the
C99 standard or if it's just a gcc 'feature' now.  I kept it because
it was in the initial code, but I'm happy to do an explicit copy for
v3 if that looks better.


[PATCH net-next v2 2/2] libbpf: use map_flags when creating maps

2017-10-02 Thread Craig Gallek
From: Craig Gallek 

This is required to use BPF_MAP_TYPE_LPM_TRIE or any other map type
which requires flags.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 2 +-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 28b300868ad7..5996e7565cc8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -968,7 +968,7 @@ bpf_object__create_maps(struct bpf_object *obj)
   def->key_size,
   def->value_size,
   def->max_entries,
-  0);
+  def->map_flags);
if (*pfd < 0) {
size_t j;
int err = *pfd;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 7959086eb9c9..6e20003109e0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,7 @@ struct bpf_map_def {
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
+   unsigned int map_flags;
 };
 
 /*
-- 
2.14.2.822.g60be5d43e6-goog



[PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-02 Thread Craig Gallek
From: Craig Gallek 

This library previously assumed a fixed-size map options structure.
Any new options were ignored.  In order to allow the options structure
to grow and to support parsing older programs, this patch updates
the maps section parsing to handle varying sizes.

Object files with maps sections smaller than expected will have the new
fields initialized to zero.  Object files which have larger than expected
maps sections will be rejected unless all of the unrecognized data is zero.

This change still assumes that each map definition in the maps section
is the same size.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 54 ++
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4f402dcdf372..28b300868ad7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -580,7 +580,7 @@ bpf_object__init_kversion(struct bpf_object *obj,
 }
 
 static int
-bpf_object__validate_maps(struct bpf_object *obj)
+bpf_object__validate_maps(struct bpf_object *obj, int map_def_sz)
 {
int i;
 
@@ -595,9 +595,11 @@ bpf_object__validate_maps(struct bpf_object *obj)
const struct bpf_map *a = &obj->maps[i - 1];
const struct bpf_map *b = &obj->maps[i];
 
-   if (b->offset - a->offset < sizeof(struct bpf_map_def)) {
-   pr_warning("corrupted map section in %s: map \"%s\" too 
small\n",
-  obj->path, a->name);
+   if (b->offset - a->offset < map_def_sz) {
+   pr_warning("corrupted map section in %s: map \"%s\" too 
small "
+  "(%zd vs %d)\n",
+  obj->path, a->name, b->offset - a->offset,
+  map_def_sz);
return -EINVAL;
}
}
@@ -615,7 +617,7 @@ static int compare_bpf_map(const void *_a, const void *_b)
 static int
 bpf_object__init_maps(struct bpf_object *obj)
 {
-   int i, map_idx, nr_maps = 0;
+   int i, map_idx, map_def_sz, nr_maps = 0;
Elf_Scn *scn;
Elf_Data *data;
Elf_Data *symbols = obj->efile.symbols;
@@ -658,6 +660,15 @@ bpf_object__init_maps(struct bpf_object *obj)
if (!nr_maps)
return 0;
 
+   /* Assume equally sized map definitions */
+   map_def_sz = data->d_size / nr_maps;
+   if (!data->d_size || (data->d_size % nr_maps) != 0) {
+   pr_warning("unable to determine map definition size "
+  "section %s, %d maps in %zd bytes\n",
+  obj->path, nr_maps, data->d_size);
+   return -EINVAL;
+   }
+
obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));
if (!obj->maps) {
pr_warning("alloc maps for object failed\n");
@@ -690,7 +701,7 @@ bpf_object__init_maps(struct bpf_object *obj)
  obj->efile.strtabidx,
  sym.st_name);
obj->maps[map_idx].offset = sym.st_value;
-   if (sym.st_value + sizeof(struct bpf_map_def) > data->d_size) {
+   if (sym.st_value + map_def_sz > data->d_size) {
pr_warning("corrupted maps section in %s: last map 
\"%s\" too small\n",
   obj->path, map_name);
return -EINVAL;
@@ -704,12 +715,39 @@ bpf_object__init_maps(struct bpf_object *obj)
pr_debug("map %d is \"%s\"\n", map_idx,
 obj->maps[map_idx].name);
def = (struct bpf_map_def *)(data->d_buf + sym.st_value);
-   obj->maps[map_idx].def = *def;
+   /*
+* If the definition of the map in the object file fits in
+* bpf_map_def, copy it.  Any extra fields in our version
+* of bpf_map_def will default to zero as a result of the
+* calloc above.
+*/
+   if (map_def_sz <= sizeof(struct bpf_map_def)) {
+   memcpy(&obj->maps[map_idx].def, def, map_def_sz);
+   } else {
+   /*
+* Here the map structure being read is bigger than what
+* we expect, truncate if the excess bits are all zero.
+* If they are not zero, reject this map as
+* incompatible.
+*/
+   char *b;
+   for (b = ((char *)def) + sizeof(struct bpf_map_def);
+b < (

[PATCH net-next v2 0/2] libbpf: support more map options

2017-10-02 Thread Craig Gallek
From: Craig Gallek 

The functional change to this series is the ability to use flags when
creating maps from object files loaded by libbpf.  In order to do this,
the first patch updates the library to handle map definitions that
differ in size from libbpf's struct bpf_map_def.

For object files with a larger map definition, libbpf will continue to load
if the unknown fields are all zero, otherwise the map is rejected.  If the
map definition in the object file is smaller than expected, libbpf will use
zero as a default value in the missing fields.

Craig Gallek (2):
  libbpf: parse maps sections of varying size
  libbpf: use map_flags when creating maps

 tools/lib/bpf/libbpf.c | 56 ++
 tools/lib/bpf/libbpf.h |  1 +
 2 files changed, 48 insertions(+), 9 deletions(-)

-- 
2.14.2.822.g60be5d43e6-goog



Re: [PATCH net-next] libbpf: use map_flags when creating maps

2017-09-28 Thread Craig Gallek
On Wed, Sep 27, 2017 at 6:03 PM, Daniel Borkmann  wrote:
> On 09/27/2017 06:29 PM, Alexei Starovoitov wrote:
>>
>> On 9/27/17 7:04 AM, Craig Gallek wrote:
>>>
>>> From: Craig Gallek 
>>>
>>> This extends struct bpf_map_def to include a flags field.  Note that
>>> this has the potential to break the validation logic in
>>> bpf_object__validate_maps and bpf_object__init_maps as they use
>>> sizeof(struct bpf_map_def) as a minimal allowable size of a map section.
>>> Any bpf program compiled with a smaller struct bpf_map_def will fail this
>>> check.
>>>
>>> I don't believe this will be an issue in practice as both compile-time
>>> definitions of struct bpf_map_def (in samples/bpf/bpf_load.h and
>>> tools/testing/selftests/bpf/bpf_helpers.h) have always been larger
>>> than this newly updated version in libbpf.h.
>>>
>>> Signed-off-by: Craig Gallek 
>>> ---
>>>  tools/lib/bpf/libbpf.c | 2 +-
>>>  tools/lib/bpf/libbpf.h | 1 +
>>>  2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>> index 35f6dfcdc565..6bea85f260a3 100644
>>> --- a/tools/lib/bpf/libbpf.c
>>> +++ b/tools/lib/bpf/libbpf.c
>>> @@ -874,7 +874,7 @@ bpf_object__create_maps(struct bpf_object *obj)
>>>def->key_size,
>>>def->value_size,
>>>def->max_entries,
>>> -  0);
>>> +  def->map_flags);
>>>  if (*pfd < 0) {
>>>  size_t j;
>>>  int err = *pfd;
>>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>>> index 7959086eb9c9..6e20003109e0 100644
>>> --- a/tools/lib/bpf/libbpf.h
>>> +++ b/tools/lib/bpf/libbpf.h
>>> @@ -207,6 +207,7 @@ struct bpf_map_def {
>>>  unsigned int key_size;
>>>  unsigned int value_size;
>>>  unsigned int max_entries;
>>> +unsigned int map_flags;
>>>  };
>>
>>
>> yes it will break loading of pre-compiled .o
>> Instead of breaking, let's fix the loader to do it the way
>> samples/bpf/bpf_load.c does.
>> See commit 156450d9d964 ("samples/bpf: make bpf_load.c code compatible
>> with ELF maps section changes")
>
>
> +1, iproute2 loader also does map spec fixup
>
> For libbpf it would be good also such that it reduces the diff
> further between the libbpf and bpf_load so that it allows move
> to libbpf for samples in future.

Fair enough, I'll try to get this to work more dynamically.  I did
noticed that the fields of struct bpf_map_def in
selftests/.../bpf_helpers.h and iproute2's struct bpf_elf_map have
diverged. The flags field is the only thing missing from libbpf right
now (and they are at the same offset for both), so it won't be an
issue for this change, but it is going to make unifying all of these
things under libbpf not trivial at some point...


[PATCH net-next] libbpf: use map_flags when creating maps

2017-09-27 Thread Craig Gallek
From: Craig Gallek 

This extends struct bpf_map_def to include a flags field.  Note that
this has the potential to break the validation logic in
bpf_object__validate_maps and bpf_object__init_maps as they use
sizeof(struct bpf_map_def) as a minimal allowable size of a map section.
Any bpf program compiled with a smaller struct bpf_map_def will fail this
check.

I don't believe this will be an issue in practice as both compile-time
definitions of struct bpf_map_def (in samples/bpf/bpf_load.h and
tools/testing/selftests/bpf/bpf_helpers.h) have always been larger
than this newly updated version in libbpf.h.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 2 +-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 35f6dfcdc565..6bea85f260a3 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -874,7 +874,7 @@ bpf_object__create_maps(struct bpf_object *obj)
  def->key_size,
  def->value_size,
  def->max_entries,
- 0);
+ def->map_flags);
if (*pfd < 0) {
size_t j;
int err = *pfd;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 7959086eb9c9..6e20003109e0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,7 @@ struct bpf_map_def {
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
+   unsigned int map_flags;
 };
 
 /*
-- 
2.14.2.822.g60be5d43e6-goog



[PATCH net-next v2] bpf: Optimize lpm trie delete

2017-09-21 Thread Craig Gallek
From: Craig Gallek 

Before the delete operator was added, this datastructure maintained
an invariant that intermediate nodes were only present when necessary
to build the tree.  This patch updates the delete operation to reinstate
that invariant by removing unnecessary intermediate nodes after a node is
removed and thus keeping the tree structure at a minimal size.

Suggested-by: Daniel Mack 
Signed-off-by: Craig Gallek 
---
v2:
  I really failed the interview on this one.  v1 did not collapse all
  extra intermediate node possibilities.  I believe this one does by
additionally tracking node parent information.  See comments.

 kernel/bpf/lpm_trie.c | 71 +++
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9d58a576b2ae..34d8a690ea05 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -394,8 +394,8 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 {
struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
struct bpf_lpm_trie_key *key = _key;
-   struct lpm_trie_node __rcu **trim;
-   struct lpm_trie_node *node;
+   struct lpm_trie_node __rcu **trim, **trim2;
+   struct lpm_trie_node *node, *parent;
unsigned long irq_flags;
unsigned int next_bit;
size_t matchlen = 0;
@@ -407,31 +407,26 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
/* Walk the tree looking for an exact key/length match and keeping
-* track of where we could begin trimming the tree.  The trim-point
-* is the sub-tree along the walk consisting of only single-child
-* intermediate nodes and ending at a leaf node that we want to
-* remove.
+* track of the path we traverse.  We will need to know the node
+* we wish to delete, and the slot that points to the node we want
+* to delete.  We may also need to know the nodes parent and the
+* slot that contains it.
 */
trim = &trie->root;
-   node = rcu_dereference_protected(
-   trie->root, lockdep_is_held(&trie->lock));
-   while (node) {
+   trim2 = trim;
+   parent = NULL;
+   while ((node = rcu_dereference_protected(
+  *trim, lockdep_is_held(&trie->lock {
matchlen = longest_prefix_match(trie, node, key);
 
if (node->prefixlen != matchlen ||
node->prefixlen == key->prefixlen)
break;
 
+   parent = node;
+   trim2 = trim;
next_bit = extract_bit(key->data, node->prefixlen);
-   /* If we hit a node that has more than one child or is a valid
-* prefix itself, do not remove it. Reset the root of the trim
-* path to its descendant on our path.
-*/
-   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
-   (node->child[0] && node->child[1]))
-   trim = &node->child[next_bit];
-   node = rcu_dereference_protected(
-   node->child[next_bit], lockdep_is_held(&trie->lock));
+   trim = &node->child[next_bit];
}
 
if (!node || node->prefixlen != key->prefixlen ||
@@ -442,27 +437,47 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
 
trie->n_entries--;
 
-   /* If the node we are removing is not a leaf node, simply mark it
+   /* If the node we are removing has two children, simply mark it
 * as intermediate and we are done.
 */
-   if (rcu_access_pointer(node->child[0]) ||
+   if (rcu_access_pointer(node->child[0]) &&
rcu_access_pointer(node->child[1])) {
node->flags |= LPM_TREE_NODE_FLAG_IM;
goto out;
}
 
-   /* trim should now point to the slot holding the start of a path from
-* zero or more intermediate nodes to our leaf node for deletion.
+   /* If the parent of the node we are about to delete is an intermediate
+* node, and the deleted node doesn't have any children, we can delete
+* the intermediate parent as well and promote its other child
+* up the tree.  Doing this maintains the invariant that all
+* intermediate nodes have exactly 2 children and that there are no
+* unnecessary intermediate nodes in the tree.
 */
-   while ((node = rcu_dereference_protected(
-   *trim, lockdep_is_held(&trie->lock {
-   RCU_INIT_POINTER(*trim, NULL);
-   trim = rcu_access_pointer(node->child[0]) ?
-   &node->child[0] :
-

Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-21 Thread Craig Gallek
On Wed, Sep 20, 2017 at 6:56 PM, Daniel Mack  wrote:
> On 09/20/2017 08:51 PM, Craig Gallek wrote:
>> On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack  wrote:
>>> Hi Craig,
>>>
>>> Thanks, this looks much cleaner already :)
>>>
>>> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>>>> --- a/kernel/bpf/lpm_trie.c
>>>> +++ b/kernel/bpf/lpm_trie.c
>>>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
>>>> *_key)
>>>>   struct lpm_trie_node __rcu **trim;
>>>>   struct lpm_trie_node *node;
>>>>   unsigned long irq_flags;
>>>> - unsigned int next_bit;
>>>> + unsigned int next_bit = 0;
>>>
>>> This default assignment seems wrong, and I guess you only added it to
>>> squelch a compiler warning?
>> Yes, this variable is only initialized after the lookup iterations
>> below (meaning it will never be initialized the the root-removal
>> case).
>
> Right, and once set, it's only updated in case we don't have an exact
> match and try to drill down further.
>
>>> [...]
>>>
>>>> + /* If the node has one child, we may be able to collapse the tree
>>>> +  * while removing this node if the node's child is in the same
>>>> +  * 'next bit' slot as this node was in its parent or if the node
>>>> +  * itself is the root.
>>>> +  */
>>>> + if (trim == &trie->root) {
>>>> + next_bit = node->child[0] ? 0 : 1;
>>>> + rcu_assign_pointer(trie->root, node->child[next_bit]);
>>>> + kfree_rcu(node, rcu);
>>>
>>> I don't think you should treat this 'root' case special.
>>>
>>> Instead, move the 'next_bit' assignment outside of the condition ...
>> I'm not quite sure I follow.  Are you saying do something like this:
>>
>> if (trim == &trie->root) {
>>   next_bit = node->child[0] ? 0 : 1;
>> }
>> if (rcu_access_pointer(node->child[next_bit])) {
>> ...
>>
>> This would save a couple lines of code, but I think the as-is
>> implementation is slightly easier to understand.  I don't have a
>> strong opinion either way, though.
>
> Me neither :)
>
> My idea was to set
>
>   next_bit = node->child[0] ? 0 : 1;
>
> unconditionally, because it should result in the same in both cases.
>
> It might be a bit of bike shedding, but I dislike this default
> assignment, and I believe that not relying on next_bit to be set as a
> side effect of the lookup loop makes the code a bit more readable.
>
> WDYT?
That sounds reasonable.  I'll spin a v2 today if no one else has any comments.

Thanks again,
Craig


Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Craig Gallek
On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack  wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
>> *_key)
>>   struct lpm_trie_node __rcu **trim;
>>   struct lpm_trie_node *node;
>>   unsigned long irq_flags;
>> - unsigned int next_bit;
>> + unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> + /* If the node has one child, we may be able to collapse the tree
>> +  * while removing this node if the node's child is in the same
>> +  * 'next bit' slot as this node was in its parent or if the node
>> +  * itself is the root.
>> +  */
>> + if (trim == &trie->root) {
>> + next_bit = node->child[0] ? 0 : 1;
>> + rcu_assign_pointer(trie->root, node->child[next_bit]);
>> + kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == &trie->root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig


[PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Craig Gallek
From: Craig Gallek 

Before the delete operator was added, this datastructure maintained
an invariant that intermediate nodes were only present when necessary
to build the tree.  This patch updates the delete operation to reinstate
that invariant by removing unnecessary intermediate nodes after a node is
removed and thus keeping the tree structure at a minimal size.

Suggested-by: Daniel Mack 
Signed-off-by: Craig Gallek 
---
 kernel/bpf/lpm_trie.c | 55 +++
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9d58a576b2ae..b5a7d70ec8b5 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
struct lpm_trie_node __rcu **trim;
struct lpm_trie_node *node;
unsigned long irq_flags;
-   unsigned int next_bit;
+   unsigned int next_bit = 0;
size_t matchlen = 0;
int ret = 0;
 
@@ -408,14 +408,12 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
 
/* Walk the tree looking for an exact key/length match and keeping
 * track of where we could begin trimming the tree.  The trim-point
-* is the sub-tree along the walk consisting of only single-child
-* intermediate nodes and ending at a leaf node that we want to
-* remove.
+* is the location of the pointer where we will remove a node from the
+* tree.
 */
trim = &trie->root;
-   node = rcu_dereference_protected(
-   trie->root, lockdep_is_held(&trie->lock));
-   while (node) {
+   while ((node = rcu_dereference_protected(
+  *trim, lockdep_is_held(&trie->lock {
matchlen = longest_prefix_match(trie, node, key);
 
if (node->prefixlen != matchlen ||
@@ -423,15 +421,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
break;
 
next_bit = extract_bit(key->data, node->prefixlen);
-   /* If we hit a node that has more than one child or is a valid
-* prefix itself, do not remove it. Reset the root of the trim
-* path to its descendant on our path.
-*/
-   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
-   (node->child[0] && node->child[1]))
-   trim = &node->child[next_bit];
-   node = rcu_dereference_protected(
-   node->child[next_bit], lockdep_is_held(&trie->lock));
+   trim = &node->child[next_bit];
}
 
if (!node || node->prefixlen != key->prefixlen ||
@@ -442,25 +432,38 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
 
trie->n_entries--;
 
-   /* If the node we are removing is not a leaf node, simply mark it
+   /* If the node we are removing has two children, simply mark it
 * as intermediate and we are done.
 */
-   if (rcu_access_pointer(node->child[0]) ||
+   if (rcu_access_pointer(node->child[0]) &&
rcu_access_pointer(node->child[1])) {
node->flags |= LPM_TREE_NODE_FLAG_IM;
goto out;
}
 
-   /* trim should now point to the slot holding the start of a path from
-* zero or more intermediate nodes to our leaf node for deletion.
-*/
-   while ((node = rcu_dereference_protected(
-   *trim, lockdep_is_held(&trie->lock {
+   /* If the node has no children, it can be completely removed */
+   if (!rcu_access_pointer(node->child[0]) &&
+   !rcu_access_pointer(node->child[1])) {
RCU_INIT_POINTER(*trim, NULL);
-   trim = rcu_access_pointer(node->child[0]) ?
-   &node->child[0] :
-   &node->child[1];
kfree_rcu(node, rcu);
+   goto out;
+   }
+
+   /* If the node has one child, we may be able to collapse the tree
+* while removing this node if the node's child is in the same
+* 'next bit' slot as this node was in its parent or if the node
+* itself is the root.
+*/
+   if (trim == &trie->root) {
+   next_bit = node->child[0] ? 0 : 1;
+   rcu_assign_pointer(trie->root, node->child[next_bit]);
+   kfree_rcu(node, rcu);
+   } else if (rcu_access_pointer(node->child[next_bit])) {
+   rcu_assign_pointer(*trim, node->child[next_bit]);
+   kfree_rcu(node, rcu);
+   } else {
+   /* If we can't collapse, just mark this node as intermediate */
+   node->flags |= LPM_TREE_NODE_FLAG_IM;
}
 
 out:
-- 
2.14.1.821.g8fa685d3b7-goog



Re: [PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-19 Thread Craig Gallek
On Tue, Sep 19, 2017 at 5:13 PM, Daniel Mack  wrote:
> On 09/19/2017 10:55 PM, David Miller wrote:
>> From: Craig Gallek 
>> Date: Mon, 18 Sep 2017 15:30:54 -0400
>>
>>> This was previously left as a TODO.  Add the implementation and
>>> extend the test to cover it.
>>
>> Series applied, thanks.
>>
>
> Hmm, I think these patches need some more discussion regarding the IM
> nodes handling, see the reply I sent an hour ago. Could you wait for
> that before pushing your tree?

I can follow up with a patch to implement your suggestion.  It's
really just an efficiency improvement, though, so I think it's ok to
handle independently. (Sorry, I haven't had a chance to play with the
implementation details yet).

Craig


Re: [PATCH net-next 1/3] bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE

2017-09-19 Thread Craig Gallek
On Mon, Sep 18, 2017 at 6:53 PM, Alexei Starovoitov  wrote:
Thanks for the review!  Please correct me if I'm wrong...

> On 9/18/17 12:30 PM, Craig Gallek wrote:
>>
>> From: Craig Gallek 
>>
>> This is a simple non-recursive delete operation.  It prunes paths
>> of empty nodes in the tree, but it does not try to further compress
>> the tree as nodes are removed.
>>
>> Signed-off-by: Craig Gallek 
>> ---
>>  kernel/bpf/lpm_trie.c | 80
>> +--
>>  1 file changed, 77 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 1b767844a76f..9d58a576b2ae 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -389,10 +389,84 @@ static int trie_update_elem(struct bpf_map *map,
>> return ret;
>>  }
>>
>> -static int trie_delete_elem(struct bpf_map *map, void *key)
>> +/* Called from syscall or from eBPF program */
>> +static int trie_delete_elem(struct bpf_map *map, void *_key)
>>  {
>> -   /* TODO */
>> -   return -ENOSYS;
>> +   struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
>> +   struct bpf_lpm_trie_key *key = _key;
>> +   struct lpm_trie_node __rcu **trim;
>> +   struct lpm_trie_node *node;
>> +   unsigned long irq_flags;
>> +   unsigned int next_bit;
>> +   size_t matchlen = 0;
>> +   int ret = 0;
>> +
>> +   if (key->prefixlen > trie->max_prefixlen)
>> +   return -EINVAL;
>> +
>> +   raw_spin_lock_irqsave(&trie->lock, irq_flags);
>> +
>> +   /* Walk the tree looking for an exact key/length match and keeping
>> +* track of where we could begin trimming the tree.  The
>> trim-point
>> +* is the sub-tree along the walk consisting of only single-child
>> +* intermediate nodes and ending at a leaf node that we want to
>> +* remove.
>> +*/
>> +   trim = &trie->root;
>> +   node = rcu_dereference_protected(
>> +   trie->root, lockdep_is_held(&trie->lock));
>> +   while (node) {
>> +   matchlen = longest_prefix_match(trie, node, key);
>> +
>> +   if (node->prefixlen != matchlen ||
>> +   node->prefixlen == key->prefixlen)
>> +   break;
>
>
> curious why there is no need to do
> 'node->prefixlen == trie->max_prefixlen' in the above
> like update/lookup do?
I don't believe the node->prefixlen == trie->max_prefixlen check in
trie_update_elem is necessary. In order to get to this third clause,
it implies that the first two clauses evaluated false.  Which happens
when we find an exact prefix match for the current node, but the
to-be-inserted key prefix is different.  If the node we are comparing
against had a prefix of max_prefixlen, it would not be possible to
have both a full prefix match but different prefix lengths.  This
assumes that there are no nodes in the tree with > max_prefixlen
prefixes, but that is handled earlier in the update function.

There's a similar (I believe) unnecessary max_prefixlen check in
trie_lookup_elem.  The function should behave the same way without
that check, but at least in this case it's used as an early-out and
saves a few lines of execution.

>
>> +
>> +   next_bit = extract_bit(key->data, node->prefixlen);
>> +   /* If we hit a node that has more than one child or is a
>> valid
>> +* prefix itself, do not remove it. Reset the root of the
>> trim
>> +* path to its descendant on our path.
>> +*/
>> +   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
>> +   (node->child[0] && node->child[1]))
>> +   trim = &node->child[next_bit];
>> +   node = rcu_dereference_protected(
>> +   node->child[next_bit],
>> lockdep_is_held(&trie->lock));
>> +   }
>> +
>> +   if (!node || node->prefixlen != key->prefixlen ||
>> +   (node->flags & LPM_TREE_NODE_FLAG_IM)) {
>> +   ret = -ENOENT;
>> +   goto out;
>> +   }
>> +
>> +   trie->n_entries--;
>> +
>> +   /* If the node we are removing is not a leaf node, simply mark it
>> +* as intermediate and we are done.
>> +*/
>> +   if (rcu_access_po

[PATCH net-next 3/3] bpf: Test deletion in BPF_MAP_TYPE_LPM_TRIE

2017-09-18 Thread Craig Gallek
From: Craig Gallek 

Extend the 'random' operation tests to include a delete operation
(delete half of the nodes from both lpm implementions and ensure
that lookups are still equivalent).

Also, add a simple IPv4 test which verifies lookup behavior as nodes
are deleted from the tree.

Signed-off-by: Craig Gallek 
---
 tools/testing/selftests/bpf/test_lpm_map.c | 187 -
 1 file changed, 183 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
index a5ed7c4f1819..6fedb9fec36b 100644
--- a/tools/testing/selftests/bpf/test_lpm_map.c
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -104,6 +104,34 @@ static struct tlpm_node *tlpm_match(struct tlpm_node *list,
return best;
 }
 
+static struct tlpm_node *tlpm_delete(struct tlpm_node *list,
+const uint8_t *key,
+size_t n_bits)
+{
+   struct tlpm_node *best = tlpm_match(list, key, n_bits);
+   struct tlpm_node *node;
+
+   if (!best || best->n_bits != n_bits)
+   return list;
+
+   if (best == list) {
+   node = best->next;
+   free(best);
+   return node;
+   }
+
+   for (node = list; node; node = node->next) {
+   if (node->next == best) {
+   node->next = best->next;
+   free(best);
+   return list;
+   }
+   }
+   /* should never get here */
+   assert(0);
+   return list;
+}
+
 static void test_lpm_basic(void)
 {
struct tlpm_node *list = NULL, *t1, *t2;
@@ -126,6 +154,13 @@ static void test_lpm_basic(void)
assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 15));
assert(!tlpm_match(list, (uint8_t[]){ 0x7f, 0xff }, 16));
 
+   list = tlpm_delete(list, (uint8_t[]){ 0xff, 0xff }, 16);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+
+   list = tlpm_delete(list, (uint8_t[]){ 0xff }, 8);
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
tlpm_clear(list);
 }
 
@@ -170,7 +205,7 @@ static void test_lpm_order(void)
 
 static void test_lpm_map(int keysize)
 {
-   size_t i, j, n_matches, n_nodes, n_lookups;
+   size_t i, j, n_matches, n_matches_after_delete, n_nodes, n_lookups;
struct tlpm_node *t, *list = NULL;
struct bpf_lpm_trie_key *key;
uint8_t *data, *value;
@@ -182,6 +217,7 @@ static void test_lpm_map(int keysize)
 */
 
n_matches = 0;
+   n_matches_after_delete = 0;
n_nodes = 1 << 8;
n_lookups = 1 << 16;
 
@@ -235,15 +271,54 @@ static void test_lpm_map(int keysize)
}
}
 
+   /* Remove the first half of the elements in the tlpm and the
+* corresponding nodes from the bpf-lpm.  Then run the same
+* large number of random lookups in both and make sure they match.
+* Note: we need to count the number of nodes actually inserted
+* since there may have been duplicates.
+*/
+   for (i = 0, t = list; t; i++, t = t->next)
+   ;
+   for (j = 0; j < i / 2; ++j) {
+   key->prefixlen = list->n_bits;
+   memcpy(key->data, list->key, keysize);
+   r = bpf_map_delete_elem(map, key);
+   assert(!r);
+   list = tlpm_delete(list, list->key, list->n_bits);
+   assert(list);
+   }
+   for (i = 0; i < n_lookups; ++i) {
+   for (j = 0; j < keysize; ++j)
+   data[j] = rand() & 0xff;
+
+   t = tlpm_match(list, data, 8 * keysize);
+
+   key->prefixlen = 8 * keysize;
+   memcpy(key->data, data, keysize);
+   r = bpf_map_lookup_elem(map, key, value);
+   assert(!r || errno == ENOENT);
+   assert(!t == !!r);
+
+   if (t) {
+   ++n_matches_after_delete;
+   assert(t->n_bits == value[keysize]);
+   for (j = 0; j < t->n_bits; ++j)
+   assert((t->key[j / 8] & (1 << (7 - j % 8))) ==
+  (value[j / 8] & (1 << (7 - j % 8;
+   }
+   }
+
close(map);
tlpm_clear(list);
 
/* With 255 random nodes in the map, we are pretty likely to match
 * something on every lookup. For statistics, use this:
 *
-* printf("  nodes: %zu\n"
-*"lookups: %zu\n"
-*"matches: %zu\n", n_nodes, n_lookups, n_matches);
+* printf("  nodes: %zu\n"
+*"looku

[PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-18 Thread Craig Gallek
From: Craig Gallek 

This was previously left as a TODO.  Add the implementation and
extend the test to cover it.

Craig Gallek (3):
  bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE
  bpf: Add uniqueness invariant to trivial lpm test implementation
  bpf: Test deletion in BPF_MAP_TYPE_LPM_TRIE

 kernel/bpf/lpm_trie.c  |  80 +++-
 tools/testing/selftests/bpf/test_lpm_map.c | 201 -
 2 files changed, 273 insertions(+), 8 deletions(-)

-- 
2.14.1.690.gbb1197296e-goog



[PATCH net-next 2/3] bpf: Add uniqueness invariant to trivial lpm test implementation

2017-09-18 Thread Craig Gallek
From: Craig Gallek 

The 'trivial' lpm implementation in this test allows equivalent nodes
to be added (that is, nodes consisting of the same prefix and prefix
length).  For lookup operations, this is fine because insertion happens
at the head of the (singly linked) list and the first, best match is
returned.  In order to support deletion, the tlpm data structue must
first enforce uniqueness.  This change modifies the insertion algorithm
to search for equivalent nodes and remove them.  Note: the
BPF_MAP_TYPE_LPM_TRIE already has a uniqueness invariant that is
implemented as node replacement.

Signed-off-by: Craig Gallek 
---
 tools/testing/selftests/bpf/test_lpm_map.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
index e97565243d59..a5ed7c4f1819 100644
--- a/tools/testing/selftests/bpf/test_lpm_map.c
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -31,6 +31,10 @@ struct tlpm_node {
uint8_t key[];
 };
 
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits);
+
 static struct tlpm_node *tlpm_add(struct tlpm_node *list,
  const uint8_t *key,
  size_t n_bits)
@@ -38,9 +42,17 @@ static struct tlpm_node *tlpm_add(struct tlpm_node *list,
struct tlpm_node *node;
size_t n;
 
+   n = (n_bits + 7) / 8;
+
+   /* 'overwrite' an equivalent entry if one already exists */
+   node = tlpm_match(list, key, n_bits);
+   if (node && node->n_bits == n_bits) {
+   memcpy(node->key, key, n);
+   return list;
+   }
+
/* add new entry with @key/@n_bits to @list and return new head */
 
-   n = (n_bits + 7) / 8;
node = malloc(sizeof(*node) + n);
assert(node);
 
-- 
2.14.1.690.gbb1197296e-goog



[PATCH net-next 1/3] bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE

2017-09-18 Thread Craig Gallek
From: Craig Gallek 

This is a simple non-recursive delete operation.  It prunes paths
of empty nodes in the tree, but it does not try to further compress
the tree as nodes are removed.

Signed-off-by: Craig Gallek 
---
 kernel/bpf/lpm_trie.c | 80 +--
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 1b767844a76f..9d58a576b2ae 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -389,10 +389,84 @@ static int trie_update_elem(struct bpf_map *map,
return ret;
 }
 
-static int trie_delete_elem(struct bpf_map *map, void *key)
+/* Called from syscall or from eBPF program */
+static int trie_delete_elem(struct bpf_map *map, void *_key)
 {
-   /* TODO */
-   return -ENOSYS;
+   struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+   struct bpf_lpm_trie_key *key = _key;
+   struct lpm_trie_node __rcu **trim;
+   struct lpm_trie_node *node;
+   unsigned long irq_flags;
+   unsigned int next_bit;
+   size_t matchlen = 0;
+   int ret = 0;
+
+   if (key->prefixlen > trie->max_prefixlen)
+   return -EINVAL;
+
+   raw_spin_lock_irqsave(&trie->lock, irq_flags);
+
+   /* Walk the tree looking for an exact key/length match and keeping
+* track of where we could begin trimming the tree.  The trim-point
+* is the sub-tree along the walk consisting of only single-child
+* intermediate nodes and ending at a leaf node that we want to
+* remove.
+*/
+   trim = &trie->root;
+   node = rcu_dereference_protected(
+   trie->root, lockdep_is_held(&trie->lock));
+   while (node) {
+   matchlen = longest_prefix_match(trie, node, key);
+
+   if (node->prefixlen != matchlen ||
+   node->prefixlen == key->prefixlen)
+   break;
+
+   next_bit = extract_bit(key->data, node->prefixlen);
+   /* If we hit a node that has more than one child or is a valid
+* prefix itself, do not remove it. Reset the root of the trim
+* path to its descendant on our path.
+*/
+   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
+   (node->child[0] && node->child[1]))
+   trim = &node->child[next_bit];
+   node = rcu_dereference_protected(
+   node->child[next_bit], lockdep_is_held(&trie->lock));
+   }
+
+   if (!node || node->prefixlen != key->prefixlen ||
+   (node->flags & LPM_TREE_NODE_FLAG_IM)) {
+   ret = -ENOENT;
+   goto out;
+   }
+
+   trie->n_entries--;
+
+   /* If the node we are removing is not a leaf node, simply mark it
+* as intermediate and we are done.
+*/
+   if (rcu_access_pointer(node->child[0]) ||
+   rcu_access_pointer(node->child[1])) {
+   node->flags |= LPM_TREE_NODE_FLAG_IM;
+   goto out;
+   }
+
+   /* trim should now point to the slot holding the start of a path from
+* zero or more intermediate nodes to our leaf node for deletion.
+*/
+   while ((node = rcu_dereference_protected(
+   *trim, lockdep_is_held(&trie->lock {
+   RCU_INIT_POINTER(*trim, NULL);
+   trim = rcu_access_pointer(node->child[0]) ?
+   &node->child[0] :
+   &node->child[1];
+   kfree_rcu(node, rcu);
+   }
+
+out:
+   raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
+
+   return ret;
 }
 
 #define LPM_DATA_SIZE_MAX  256
-- 
2.14.1.690.gbb1197296e-goog



[PATCH net-next] dsa: fix flow disector null pointer

2017-08-15 Thread Craig Gallek
From: Craig Gallek 

A recent change to fix up DSA device behavior made the assumption that
all skbs passing through the flow disector will be associated with a
device. This does not appear to be a safe assumption.  Syzkaller found
the crash below by attaching a BPF socket filter that tries to find the
payload offset of a packet passing between two unix sockets.

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 2940 Comm: syzkaller872007 Not tainted 4.13.0-rc4-next-20170811 #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
task: 8801d1b425c0 task.stack: 8801d0bc
RIP: 0010:__skb_flow_dissect+0xdcd/0x3ae0 net/core/flow_dissector.c:445
RSP: 0018:8801d0bc7340 EFLAGS: 00010206
RAX: dc00 RBX:  RCX: 
RDX: 0060 RSI: 856dc080 RDI: 0300
RBP: 8801d0bc7870 R08:  R09: 
R10: 0008 R11: ed003a178f1e R12: 
R13:  R14: 856dc080 R15: 8801ce223140
FS:  016ed880() GS:8801dc00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20008000 CR3: 0001ce22d000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 skb_flow_dissect_flow_keys include/linux/skbuff.h:1176 [inline]
 skb_get_poff+0x9a/0x1a0 net/core/flow_dissector.c:1079
 __skb_get_pay_offset net/core/filter.c:114 [inline]
 __skb_get_pay_offset+0x15/0x20 net/core/filter.c:112
Code: 80 3c 02 00 44 89 6d 10 0f 85 44 2b 00 00 4d 8b 67 20 48 b8 00 00 00 00 
00 fc ff df 49 8d bc 24 00 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 13 
2b 00 00 4d 8b a4 24 00 03 00 00 4d 85 e4
RIP: __skb_flow_dissect+0xdcd/0x3ae0 net/core/flow_dissector.c:445 RSP: 
8801d0bc7340

Fixes: 43e665287f93 ("net-next: dsa: fix flow dissection")
Reported-by: Dmitry Vyukov 
Signed-off-by: Craig Gallek 
---
 net/core/flow_dissector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 79b9c06c83ad..e2eaa1ff948d 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -442,7 +442,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
nhoff = skb_network_offset(skb);
hlen = skb_headlen(skb);
 #if IS_ENABLED(CONFIG_NET_DSA)
-   if (unlikely(netdev_uses_dsa(skb->dev))) {
+   if (unlikely(skb->dev && netdev_uses_dsa(skb->dev))) {
const struct dsa_device_ops *ops;
int offset;
 
-- 
2.14.0.434.g98096fd7a8-goog



Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf

2017-06-21 Thread Craig Gallek
On Wed, Jun 21, 2017 at 12:51 PM, Lawrence Brakmo  wrote:
>
> On 6/20/17, 2:25 PM, "Craig Gallek"  wrote:
>
> On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo  wrote:
> > Added support for calling a subset of socket setsockopts from
> > BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
> > than making the changes to call the socket setsockopt function because
> > the changes required would have been larger.
> >
> > @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto 
> bpf_get_socket_uid_proto = {
> > .arg1_type  = ARG_PTR_TO_CTX,
> >  };
> >
> > +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> > +  int, level, int, optname, char *, optval, int, optlen)
> > +{
> > +   struct sock *sk = bpf_sock->sk;
> > +   int ret = 0;
> > +   int val;
> > +
> > +   if (bpf_sock->is_req_sock)
> > +   return -EINVAL;
> > +
> > +   if (level == SOL_SOCKET) {
> > +   /* Only some socketops are supported */
> > +   val = *((int *)optval);
> > +
> > +   switch (optname) {
> > +   case SO_RCVBUF:
> > +   sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > +   sk->sk_rcvbuf = max_t(int, val * 2, 
> SOCK_MIN_RCVBUF);
> > +   break;
> > +   case SO_SNDBUF:
> > +   sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> > +   sk->sk_sndbuf = max_t(int, val * 2, 
> SOCK_MIN_SNDBUF);
> > +   break;
> > +   case SO_MAX_PACING_RATE:
> > +   sk->sk_max_pacing_rate = val;
> > +   sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> > +
> sk->sk_max_pacing_rate);
> > +   break;
> > +   case SO_PRIORITY:
> > +   sk->sk_priority = val;
> > +   break;
> > +   case SO_RCVLOWAT:
> > +   if (val < 0)
> > +   val = INT_MAX;
> > +   sk->sk_rcvlowat = val ? : 1;
> > +   break;
> > +   case SO_MARK:
> > +   sk->sk_mark = val;
> > +   break;
>
> Isn't the socket lock required when manipulating these fields?  It's
> not obvious that the lock is held from every bpf hook point that could
> trigger this function...
>
> The sock_ops BPF programs are being called from within the network
> stack and my understanding is that  lock has already been taken.
> Currently they are only called:
> (1) after a packet is received, where there is the call to
> bh_lock_sock_nested() in tcp_v4_rcv() before calling
> tcp_v4_do_rcv().
> (2) in tcp_connect(), where there should be no issue
Someone who understands the TCP stack better than I should verify
this, but even if it's OK to do in these specific spots, it's not
unreasonable to believe that someone will add another socket-context
bpf hook in the future where it would not be safe.  Without some
additional check to prevent this setsockopt function from being called
in those spots, we could run into trouble.  The only other
socket-context point currently is the cgroup one, which happens during
socket creation and should also be safe.

> Just in case I added a check “sock_owned_by_me(sk)” in tcp_call_bpf()
> Do you think this is enough, or should I explicitly add a bh_lock_sock_nested
> in the bpf_setsockopt function?
Adding the check is certainly a way to test the behavior as
implemented, but this bpf function could be called by any
socket-context bpf (not just the tcp_call_bpf ones).  I believe the
current bpf hook points only guarantee RCU read-side lock.  Adding an
additional lock guarantee may have undesirable performance
implications.  If this is just for socket creation or other rare
events it's probably not a big deal, but if it's for a hook in the
fast path it's probably a non-starter.

I guess the higher level question is what should the locking
guarantees for socket-context bpf programs be?


Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf

2017-06-20 Thread Craig Gallek
On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo  wrote:
> Added support for calling a subset of socket setsockopts from
> BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
> than making the changes to call the socket setsockopt function because
> the changes required would have been larger.
>
> @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto 
> bpf_get_socket_uid_proto = {
> .arg1_type  = ARG_PTR_TO_CTX,
>  };
>
> +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> +  int, level, int, optname, char *, optval, int, optlen)
> +{
> +   struct sock *sk = bpf_sock->sk;
> +   int ret = 0;
> +   int val;
> +
> +   if (bpf_sock->is_req_sock)
> +   return -EINVAL;
> +
> +   if (level == SOL_SOCKET) {
> +   /* Only some socketops are supported */
> +   val = *((int *)optval);
> +
> +   switch (optname) {
> +   case SO_RCVBUF:
> +   sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +   sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +   break;
> +   case SO_SNDBUF:
> +   sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> +   sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
> +   break;
> +   case SO_MAX_PACING_RATE:
> +   sk->sk_max_pacing_rate = val;
> +   sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> +sk->sk_max_pacing_rate);
> +   break;
> +   case SO_PRIORITY:
> +   sk->sk_priority = val;
> +   break;
> +   case SO_RCVLOWAT:
> +   if (val < 0)
> +   val = INT_MAX;
> +   sk->sk_rcvlowat = val ? : 1;
> +   break;
> +   case SO_MARK:
> +   sk->sk_mark = val;
> +   break;

Isn't the socket lock required when manipulating these fields?  It's
not obvious that the lock is held from every bpf hook point that could
trigger this function...


Re: Leak in ipv6_gso_segment()?

2017-06-02 Thread Craig Gallek
On Fri, Jun 2, 2017 at 2:25 PM, Craig Gallek  wrote:
> On Fri, Jun 2, 2017 at 2:05 PM, David Miller  wrote:
>> From: Ben Hutchings 
>> Date: Wed, 31 May 2017 13:26:02 +0100
>>
>>> If I'm not mistaken, ipv6_gso_segment() now leaks segs if
>>> ip6_find_1stfragopt() fails.  I'm not sure whether the fix would be as
>>> simple as adding a kfree_skb(segs) or whether more complex cleanup is
>>> needed.
>>
>> I think we need to use kfree_skb_list(), like the following.
> I think this is problematic as well.  ipv6_gso_segment could
> previously return errors, in which case the caller uses kfree_skb (ex
> validate_xmit_skb() -> skb_gso_segment -> ...
> callbacks.gso_segment()).  Having the kfree_skb_list here would cause
> a double free if I'm reading this correctly.
>
> My first guess was going to be skb_gso_error_unwind(), but I'm still
> trying to understand that code...
>
> Sorry again for the fallout from this bug fix.  This is my first time
> down this code path and I clearly didn't understand it fully :/

Ok, I take it back.  I believe your kfree_skb_list suggestion is correct.

I was assuming that skb_segment consumed the original skb upon
successful segmentation.  It does not.  This is exactly why
validate_xmit_skb calls consume_skb when segments are returned.
Further, there is at least one existing example of kfree_skb_list in a
similar post-semgent cleanup path (esp6_gso_segment).


Re: Leak in ipv6_gso_segment()?

2017-06-02 Thread Craig Gallek
On Fri, Jun 2, 2017 at 2:05 PM, David Miller  wrote:
> From: Ben Hutchings 
> Date: Wed, 31 May 2017 13:26:02 +0100
>
>> If I'm not mistaken, ipv6_gso_segment() now leaks segs if
>> ip6_find_1stfragopt() fails.  I'm not sure whether the fix would be as
>> simple as adding a kfree_skb(segs) or whether more complex cleanup is
>> needed.
>
> I think we need to use kfree_skb_list(), like the following.
I think this is problematic as well.  ipv6_gso_segment could
previously return errors, in which case the caller uses kfree_skb (ex
validate_xmit_skb() -> skb_gso_segment -> ...
callbacks.gso_segment()).  Having the kfree_skb_list here would cause
a double free if I'm reading this correctly.

My first guess was going to be skb_gso_error_unwind(), but I'm still
trying to understand that code...

Sorry again for the fallout from this bug fix.  This is my first time
down this code path and I clearly didn't understand it fully :/

> Can someone please audit and review this for me?  We need to
> get this to -stable.
>
> Thanks.
>
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 280268f..cdb3728 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -116,8 +116,10 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
> *skb,
>
> if (udpfrag) {
> int err = ip6_find_1stfragopt(skb, &prevhdr);
> -   if (err < 0)
> +   if (err < 0) {
> +   kfree_skb_list(segs);
> return ERR_PTR(err);
> +   }
> fptr = (struct frag_hdr *)((u8 *)ipv6h + err);
> fptr->frag_off = htons(offset);
> if (skb->next)


Re: [PATCH net] ipv6: xfrm: Handle errors reported by xfrm6_find_1stfragopt()

2017-05-31 Thread Craig Gallek
On Wed, May 31, 2017 at 8:15 AM, Ben Hutchings  wrote:
> xfrm6_find_1stfragopt() may now return an error code and we must
> not treat it as a length.
>
> Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options")
> Signed-off-by: Ben Hutchings 
> ---
> Commits 2423496af35d "ipv6: Prevent overrun when parsing v6 header
> options" and 7dd7eb9513bd "ipv6: Check ip6_find_1stfragopt() return
> value properly." changed ip6_find_1stfragopt() to return negative
> error codes and changed some of its callers to handle this.
>
> However, there is also xfrm6_find_1stfragopt() which is a wrapper for
> ip6_find_1stfragopt() and is called indirectly by xfrm6_ro_output()
> and xfrm6_transport_output().  Neither of them check for errors.
>
> This is totally untested.  I think xfrm_type::hdr_offset deserves a
> comment about its semantics, but I don't understand it well enogugh to
> write that.
Thank you for finding this, it's a good lesson to not rely solely on cscope :\

I believe this fix is correct and sufficient.  I only found 2 up-stack
callers of this (pktgen_output_ipsec and xfrm_output_one) and they
both ultimately call kfree_skb on error.

However, the fact that this function is used as a function pointer
through xfrm_type.hdr_offset made me look at a couple other functions
that can be stored in this structure.  mip6_destopt_offset and
mip6_rthdr_offset have very similar implementations to the original
ip6_find_1stfragopt and may very well suffer from the same bug I was
trying to fix.  Maybe it doesn't matter since that bug relied on the
user changing the v6 nexthdr field.  I need to understand the mip6
code first...

In any event, I think this patch applies on its own.  Thanks again.

Acked-by: Craig Gallek 


Re: [net:master 9/12] net/ipv6/ip6_offload.c:120:7-21: WARNING: Unsigned expression compared with zero: unfrag_ip6hlen < 0 (fwd)

2017-05-18 Thread Craig Gallek
On Wed, May 17, 2017 at 10:58 PM, David Miller  wrote:
> From: Julia Lawall 
> Date: Thu, 18 May 2017 10:01:07 +0800 (SGT)
>
>> It may be worth checking on these.  The code context is shown in the first
>> case (line 120).  For the others, at least it gives the line numbers.
>  ...
 net/ipv6/ip6_offload.c:120:7-21: WARNING: Unsigned expression compared 
 with zero: unfrag_ip6hlen < 0
>> --
 net/ipv6/ip6_output.c:601:5-9: WARNING: Unsigned expression compared with 
 zero: hlen < 0
>> --
 net/ipv6/udp_offload.c:94:6-20: WARNING: Unsigned expression compared with 
 zero: unfrag_ip6hlen < 0
>
> Sigh, this is worse than the bug the commit introducing these warnings
> was trying to fix.
Thanks for the fix, sorry I missed this.


[PATCH net-next] ipv6: Prevent overrun when parsing v6 header options

2017-05-16 Thread Craig Gallek
From: Craig Gallek 

The KASAN warning repoted below was discovered with a syzkaller
program.  The reproducer is basically:
  int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
  send(s, &one_byte_of_data, 1, MSG_MORE);
  send(s, &more_than_mtu_bytes_data, 2000, 0);

The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.

The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option.  Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.

This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.

[   42.361487] 
==
[   42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[   42.365471] Read of size 840 at addr 88000969e798 by task 
ip6_fragment-oo/3789
[   42.366469]
[   42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[   42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.1-1ubuntu1 04/01/2014
[   42.368824] Call Trace:
[   42.369183]  dump_stack+0xb3/0x10b
[   42.369664]  print_address_description+0x73/0x290
[   42.370325]  kasan_report+0x252/0x370
[   42.370839]  ? ip6_fragment+0x11c8/0x3730
[   42.371396]  check_memory_region+0x13c/0x1a0
[   42.371978]  memcpy+0x23/0x50
[   42.372395]  ip6_fragment+0x11c8/0x3730
[   42.372920]  ? nf_ct_expect_unregister_notifier+0x110/0x110
[   42.373681]  ? ip6_copy_metadata+0x7f0/0x7f0
[   42.374263]  ? ip6_forward+0x2e30/0x2e30
[   42.374803]  ip6_finish_output+0x584/0x990
[   42.375350]  ip6_output+0x1b7/0x690
[   42.375836]  ? ip6_finish_output+0x990/0x990
[   42.376411]  ? ip6_fragment+0x3730/0x3730
[   42.376968]  ip6_local_out+0x95/0x160
[   42.377471]  ip6_send_skb+0xa1/0x330
[   42.377969]  ip6_push_pending_frames+0xb3/0xe0
[   42.378589]  rawv6_sendmsg+0x2051/0x2db0
[   42.379129]  ? rawv6_bind+0x8b0/0x8b0
[   42.379633]  ? _copy_from_user+0x84/0xe0
[   42.380193]  ? debug_check_no_locks_freed+0x290/0x290
[   42.380878]  ? ___sys_sendmsg+0x162/0x930
[   42.381427]  ? rcu_read_lock_sched_held+0xa3/0x120
[   42.382074]  ? sock_has_perm+0x1f6/0x290
[   42.382614]  ? ___sys_sendmsg+0x167/0x930
[   42.383173]  ? lock_downgrade+0x660/0x660
[   42.383727]  inet_sendmsg+0x123/0x500
[   42.384226]  ? inet_sendmsg+0x123/0x500
[   42.384748]  ? inet_recvmsg+0x540/0x540
[   42.385263]  sock_sendmsg+0xca/0x110
[   42.385758]  SYSC_sendto+0x217/0x380
[   42.386249]  ? SYSC_connect+0x310/0x310
[   42.386783]  ? __might_fault+0x110/0x1d0
[   42.387324]  ? lock_downgrade+0x660/0x660
[   42.387880]  ? __fget_light+0xa1/0x1f0
[   42.388403]  ? __fdget+0x18/0x20
[   42.388851]  ? sock_common_setsockopt+0x95/0xd0
[   42.389472]  ? SyS_setsockopt+0x17f/0x260
[   42.390021]  ? entry_SYSCALL_64_fastpath+0x5/0xbe
[   42.390650]  SyS_sendto+0x40/0x50
[   42.391103]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.391731] RIP: 0033:0x7fbbb711e383
[   42.392217] RSP: 002b:74d34f28 EFLAGS: 0246 ORIG_RAX: 
002c
[   42.393235] RAX: ffda RBX:  RCX: 7fbbb711e383
[   42.394195] RDX: 1000 RSI: 74d34f60 RDI: 0003
[   42.395145] RBP: 0046 R08: 74d34f40 R09: 0018
[   42.396056] R10:  R11: 0246 R12: 00400aad
[   42.396598] R13: 0066 R14: 74d34ee0 R15: 7fbbb717af00
[   42.397257]
[   42.397411] Allocated by task 3789:
[   42.397702]  save_stack_trace+0x16/0x20
[   42.398005]  save_stack+0x46/0xd0
[   42.398267]  kasan_kmalloc+0xad/0xe0
[   42.398548]  kasan_slab_alloc+0x12/0x20
[   42.398848]  __kmalloc_node_track_caller+0xcb/0x380
[   42.399224]  __kmalloc_reserve.isra.32+0x41/0xe0
[   42.399654]  __alloc_skb+0xf8/0x580
[   42.43]  sock_wmalloc+0xab/0xf0
[   42.400346]  __ip6_append_data.isra.41+0x2472/0x33d0
[   42.400813]  ip6_append_data+0x1a8/0x2f0
[   42.401122]  rawv6_sendmsg+0x11ee/0x2db0
[   42.401505]  inet_sendmsg+0x123/0x500
[   42.401860]  sock_sendmsg+0xca/0x110
[   42.402209]  ___sys_sendmsg+0x7cb/0x930
[   42.402582]  __sys_sendmsg+0xd9/0x190
[   42.402941]  SyS_sendmsg+0x2d/0x50
[   42.403273]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.403718]
[   42.403871] Freed by task 1794:
[   42.404146]  save_stack_trace+0x16/0x20
[   42.404515]  save_stack+0x46/0xd0
[   42.404827]  kasan_slab_free+0x72/0xc0
[   42.405167]  kfree+0xe8/0x2b0
[   42.405462]  skb_free_head+0x74/0xb0
[   42.405806]  skb_release_data+0x30e/0x3a0
[   42.406198]  skb_release_all+0x4a/0x60
[   42.406563]  consume_skb+0x113/0x2e0
[   42.406910]  skb_free_datagram+0x1a/0xe0
[   42.407288]  netlink_recvmsg+0x60d/0xe40
[   42.407667]  sock_recv

Re: [PATCH] ipv6: Need to export ipv6_push_frag_opts for tunneling now.

2017-05-01 Thread Craig Gallek
On Mon, May 1, 2017 at 3:11 PM, David Miller  wrote:
>
> Since that change also made the nfrag function not necessary
> for exports, remove it.
>
> Fixes: 89a23c8b528b ("ip6_tunnel: Fix missing tunnel encapsulation limit 
> option")
> Signed-off-by: David S. Miller 
Woops, sorry I missed this.  Thanks for the fix!

Acked-by: Craig Gallek 


[PATCH v2 net-next] ip6_tunnel: Fix missing tunnel encapsulation limit option

2017-04-26 Thread Craig Gallek
From: Craig Gallek 

The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
IPV6_TLV_PADN options when an encapsulation limit is defined (the
default is a limit of 4).  An MTU adjustment is done to account for
these options as well.  However, the options are never present in the
generated packets.

The issue appears to be a subtlety between IPV6_DSTOPTS and
IPV6_RTHDRDSTOPTS defined in RFC 3542.  When the IPIP tunnel driver was
written, the encap limit options were included as IPV6_RTHDRDSTOPTS in
dst0opt of struct ipv6_txoptions.  Later, ipv6_push_nfrags_opts was
(correctly) updated to require IPV6_RTHDR options when IPV6_RTHDRDSTOPTS
are to be used.  This caused the options to no longer be included in v6
encapsulated packets.

The fix is to use IPV6_DSTOPTS (in dst1opt of struct ipv6_txoptions)
instead.  IPV6_DSTOPTS do not have the additional IPV6_RTHDR requirement.

Fixes: 1df64a8569c7: ("[IPV6]: Add ip6ip6 tunnel driver.")
Fixes: 333fad5364d6: ("[IPV6]: Support several new sockopt / ancillary data in 
Advanced API (RFC3542)")
Signed-off-by: Craig Gallek 
---

v2: Change tunnel code to use dst1opt rather than making the checks for
dst0opt more permissive.

 net/ipv6/ip6_tunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index ad15d38b41e8..c81f9541f1f7 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -954,7 +954,7 @@ static void init_tel_txopt(struct ipv6_tel_txoption *opt, 
__u8 encap_limit)
opt->dst_opt[5] = IPV6_TLV_PADN;
opt->dst_opt[6] = 1;
 
-   opt->ops.dst0opt = (struct ipv6_opt_hdr *) opt->dst_opt;
+   opt->ops.dst1opt = (struct ipv6_opt_hdr *) opt->dst_opt;
opt->ops.opt_nflen = 8;
 }
 
@@ -1176,7 +1176,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
 
if (encap_limit >= 0) {
init_tel_txopt(&opt, encap_limit);
-   ipv6_push_nfrag_opts(skb, &opt.ops, &proto, NULL, NULL);
+   ipv6_push_frag_opts(skb, &opt.ops, &proto);
}
 
/* Calculate max headroom for all the headers and adjust
-- 
2.13.0.rc0.306.g87b477812d-goog



Re: [PATCH net-next] ip6_tunnel: Fix missing tunnel encapsulation limit option

2017-04-26 Thread Craig Gallek
On Wed, Apr 26, 2017 at 1:07 PM, Craig Gallek  wrote:
> From: Craig Gallek 
>
> The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
> IPV6_TLV_PADN options when an encapsulation limit is defined (the
> default is a limit of 4).  An MTU adjustment is done to account for
> these options as well.  However, the options are never present in the
> generated packets.
>
> ipv6_push_nfrag_opts requires that IPV6_RTHDR be present in order to
> include any IPV6_DSTOPTS options.  The v6 tunnel code does not
> use routing options, so the encap limit options are not included.
>
> A brief reading of RFC 3542 section 9.2 (specifically the 4th paragraph)
> makes me believe that this requirement in the kernel is incorrect.
Looking more closely, I think I'm wrong here.  Specifically, the cmsg
parser puts IPV6_RTHDRDSTOPTS in dst0opt and IPV6_DSTOPTS in dst1opt.
The tunnel code is currently building dst0opt and using
ipv6_push_nfrag_opts.  Perhaps it should be building dst1opt and
calling ipv6_push_frag_opts?


[PATCH net-next] ip6_tunnel: Fix missing tunnel encapsulation limit option

2017-04-26 Thread Craig Gallek
From: Craig Gallek 

The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
IPV6_TLV_PADN options when an encapsulation limit is defined (the
default is a limit of 4).  An MTU adjustment is done to account for
these options as well.  However, the options are never present in the
generated packets.

ipv6_push_nfrag_opts requires that IPV6_RTHDR be present in order to
include any IPV6_DSTOPTS options.  The v6 tunnel code does not
use routing options, so the encap limit options are not included.

A brief reading of RFC 3542 section 9.2 (specifically the 4th paragraph)
makes me believe that this requirement in the kernel is incorrect.

Fixes: 333fad5364d6: ("[IPV6]: Support several new sockopt / ancillary data in 
Advanced API (RFC3542)")
Signed-off-by: Craig Gallek 
---
 net/ipv6/exthdrs.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 25192a3b0cd7..224a89e68a42 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -932,15 +932,12 @@ void ipv6_push_nfrag_opts(struct sk_buff *skb, struct 
ipv6_txoptions *opt,
  u8 *proto,
  struct in6_addr **daddr, struct in6_addr *saddr)
 {
-   if (opt->srcrt) {
+   if (opt->srcrt)
ipv6_push_rthdr(skb, proto, opt->srcrt, daddr, saddr);
-   /*
-* IPV6_RTHDRDSTOPTS is ignored
-* unless IPV6_RTHDR is set (RFC3542).
-*/
-   if (opt->dst0opt)
-   ipv6_push_exthdr(skb, proto, NEXTHDR_DEST, 
opt->dst0opt);
-   }
+
+   if (opt->dst0opt)
+   ipv6_push_exthdr(skb, proto, NEXTHDR_DEST, opt->dst0opt);
+
if (opt->hopopt)
ipv6_push_exthdr(skb, proto, NEXTHDR_HOP, opt->hopopt);
 }
-- 
2.13.0.rc0.306.g87b477812d-goog



[PATCH iproute2] gre6: fix copy/paste bugs in GREv6 attribute manipulation

2017-04-21 Thread Craig Gallek
From: Craig Gallek 

Fixes: af89576d7a8c("iproute2: GRE over IPv6 tunnel support.")
Signed-off-by: Craig Gallek 
---
 ip/link_gre6.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index a91f635760fa..1b4fb051b37f 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -355,7 +355,7 @@ get_failed:
addattr_l(n, 1024, IFLA_GRE_TTL, &hop_limit, 1);
addattr_l(n, 1024, IFLA_GRE_ENCAP_LIMIT, &encap_limit, 1);
addattr_l(n, 1024, IFLA_GRE_FLOWINFO, &flowinfo, 4);
-   addattr_l(n, 1024, IFLA_GRE_FLAGS, &flowinfo, 4);
+   addattr32(n, 1024, IFLA_GRE_FLAGS, flags);
 
addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
@@ -383,7 +383,7 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
flags = rta_getattr_u32(tb[IFLA_GRE_FLAGS]);
 
if (tb[IFLA_GRE_FLOWINFO])
-   flags = rta_getattr_u32(tb[IFLA_GRE_FLOWINFO]);
+   flowinfo = rta_getattr_u32(tb[IFLA_GRE_FLOWINFO]);
 
if (tb[IFLA_GRE_REMOTE]) {
struct in6_addr addr;
-- 
2.12.2.816.g281164-goog



[PATCH iproute2] iplink: Expose IFLA_*_FWMARK attributes for supported link types

2017-04-21 Thread Craig Gallek
From: Craig Gallek 

This attribute allows the administrator to adjust the packet marking
attribute of tunnels that support policy based routing.

Signed-off-by: Craig Gallek 
---
 include/linux/if_tunnel.h |  3 +++
 ip/link_gre.c | 16 
 ip/link_gre6.c| 24 +++-
 ip/link_ip6tnl.c  | 23 ++-
 ip/link_iptnl.c   | 16 
 ip/link_vti.c | 16 
 ip/link_vti6.c| 15 +++
 7 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 4f975f5704d8..7375335a0773 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -75,6 +75,7 @@ enum {
IFLA_IPTUN_ENCAP_SPORT,
IFLA_IPTUN_ENCAP_DPORT,
IFLA_IPTUN_COLLECT_METADATA,
+   IFLA_IPTUN_FWMARK,
__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1)
@@ -132,6 +133,7 @@ enum {
IFLA_GRE_ENCAP_DPORT,
IFLA_GRE_COLLECT_METADATA,
IFLA_GRE_IGNORE_DF,
+   IFLA_GRE_FWMARK,
__IFLA_GRE_MAX,
 };
 
@@ -147,6 +149,7 @@ enum {
IFLA_VTI_OKEY,
IFLA_VTI_LOCAL,
IFLA_VTI_REMOTE,
+   IFLA_VTI_FWMARK,
__IFLA_VTI_MAX,
 };
 
diff --git a/ip/link_gre.c b/ip/link_gre.c
index 35d437a15562..82df900614bf 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -42,11 +42,13 @@ static void print_usage(FILE *f)
"[ [no]encap-csum ]\n"
"[ [no]encap-csum6 ]\n"
"[ [no]encap-remcsum ]\n"
+   "[ fwmark MARK ]\n"
"\n"
"Where: ADDR := { IP_ADDRESS | any }\n"
"   TOS  := { NUMBER | inherit }\n"
"   TTL  := { 1..255 | inherit }\n"
"   KEY  := { DOTTED_QUAD | NUMBER }\n"
+   "   MARK := { 0x0..0x }\n"
);
 }
 
@@ -91,6 +93,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char 
**argv,
__u16 encapsport = 0;
__u16 encapdport = 0;
__u8 metadata = 0;
+   __u32 fwmark = 0;
 
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
@@ -160,6 +163,9 @@ get_failed:
 
if (greinfo[IFLA_GRE_COLLECT_METADATA])
metadata = 1;
+
+   if (greinfo[IFLA_GRE_FWMARK])
+   fwmark = rta_getattr_u32(greinfo[IFLA_GRE_FWMARK]);
}
 
while (argc > 0) {
@@ -305,6 +311,10 @@ get_failed:
encapflags |= ~TUNNEL_ENCAP_FLAG_REMCSUM;
} else if (strcmp(*argv, "external") == 0) {
metadata = 1;
+   } else if (strcmp(*argv, "fwmark") == 0) {
+   NEXT_ARG();
+   if (get_u32(&fwmark, *argv, 0))
+   invarg("invalid fwmark\n", *argv);
} else
usage();
argc--; argv++;
@@ -335,6 +345,7 @@ get_failed:
addattr32(n, 1024, IFLA_GRE_LINK, link);
addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
+   addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
} else {
addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
}
@@ -426,6 +437,11 @@ static void gre_print_direct_opt(FILE *f, struct rtattr 
*tb[])
fputs("icsum ", f);
if (oflags & GRE_CSUM)
fputs("ocsum ", f);
+
+   if (tb[IFLA_GRE_FWMARK] && rta_getattr_u32(tb[IFLA_GRE_FWMARK])) {
+   fprintf(f, "fwmark 0x%x ",
+   rta_getattr_u32(tb[IFLA_GRE_FWMARK]));
+   }
 }
 
 static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 1b4fb051b37f..205bada78054 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -43,6 +43,7 @@ static void print_usage(FILE *f)
"  [ tclass TCLASS ]\n"
"  [ flowlabel FLOWLABEL ]\n"
"  [ dscp inherit ]\n"
+   "  [ fwmark MARK ]\n"
"  [ dev PHYS_DEV ]\n"
"  [ noencap ]\n"
"  [ encap {

[PATCH net-next 1/2] ip6_tunnel: Allow policy-based routing through tunnels

2017-04-19 Thread Craig Gallek
From: Craig Gallek 

This feature allows the administrator to set an fwmark for
packets traversing a tunnel.  This allows the use of independent
routing tables for tunneled packets without the use of iptables.

Signed-off-by: Craig Gallek 
---
 include/net/ip6_tunnel.h   |  2 ++
 include/uapi/linux/if_tunnel.h |  3 +++
 net/ipv6/ip6_gre.c | 14 +-
 net/ipv6/ip6_tunnel.c  | 15 ++-
 net/ipv6/ip6_vti.c | 10 +-
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 1b1cf33cbfb0..08fbc7f7d8d7 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -33,6 +33,8 @@ struct __ip6_tnl_parm {
__be16  o_flags;
__be32  i_key;
__be32  o_key;
+
+   __u32   fwmark;
 };
 
 /* IPv6 tunnel */
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 92f3c8677523..6792d1967d31 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -75,6 +75,7 @@ enum {
IFLA_IPTUN_ENCAP_SPORT,
IFLA_IPTUN_ENCAP_DPORT,
IFLA_IPTUN_COLLECT_METADATA,
+   IFLA_IPTUN_FWMARK,
__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1)
@@ -132,6 +133,7 @@ enum {
IFLA_GRE_ENCAP_DPORT,
IFLA_GRE_COLLECT_METADATA,
IFLA_GRE_IGNORE_DF,
+   IFLA_GRE_FWMARK,
__IFLA_GRE_MAX,
 };
 
@@ -147,6 +149,7 @@ enum {
IFLA_VTI_OKEY,
IFLA_VTI_LOCAL,
IFLA_VTI_REMOTE,
+   IFLA_VTI_FWMARK,
__IFLA_VTI_MAX,
 };
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 6fcb7cb49bb2..8d128ba79b66 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -544,6 +544,8 @@ static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, 
struct net_device *dev)
  & IPV6_TCLASS_MASK;
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
+   else
+   fl6.flowi6_mark = t->parms.fwmark;
 
fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
@@ -603,6 +605,8 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, 
struct net_device *dev)
fl6.flowlabel |= ip6_flowlabel(ipv6h);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
+   else
+   fl6.flowi6_mark = t->parms.fwmark;
 
fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
@@ -780,6 +784,7 @@ static int ip6gre_tnl_change(struct ip6_tnl *t,
t->parms.o_key = p->o_key;
t->parms.i_flags = p->i_flags;
t->parms.o_flags = p->o_flags;
+   t->parms.fwmark = p->fwmark;
dst_cache_reset(&t->dst_cache);
ip6gre_tnl_link_config(t, set_mtu);
return 0;
@@ -1249,6 +1254,9 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 
if (data[IFLA_GRE_FLAGS])
parms->flags = nla_get_u32(data[IFLA_GRE_FLAGS]);
+
+   if (data[IFLA_GRE_FWMARK])
+   parms->fwmark = nla_get_u32(data[IFLA_GRE_FWMARK]);
 }
 
 static int ip6gre_tap_init(struct net_device *dev)
@@ -1470,6 +1478,8 @@ static size_t ip6gre_get_size(const struct net_device 
*dev)
nla_total_size(2) +
/* IFLA_GRE_ENCAP_DPORT */
nla_total_size(2) +
+   /* IFLA_GRE_FWMARK */
+   nla_total_size(4) +
0;
 }
 
@@ -1490,7 +1500,8 @@ static int ip6gre_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u8(skb, IFLA_GRE_TTL, p->hop_limit) ||
nla_put_u8(skb, IFLA_GRE_ENCAP_LIMIT, p->encap_limit) ||
nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) ||
-   nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags))
+   nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags) ||
+   nla_put_u32(skb, IFLA_GRE_FWMARK, p->fwmark))
goto nla_put_failure;
 
if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE,
@@ -1525,6 +1536,7 @@ static const struct nla_policy ip6gre_policy[IFLA_GRE_MAX 
+ 1] = {
[IFLA_GRE_ENCAP_FLAGS]  = { .type = NLA_U16 },
[IFLA_GRE_ENCAP_SPORT]  = { .type = NLA_U16 },
[IFLA_GRE_ENCAP_DPORT]  = { .type = NLA_U16 },
+   [IFLA_GRE_FWMARK]   = { .type = NLA_U32 },
 };
 
 static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 75fac933c209..ad15d38b41e8 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1256,6 +1256,8 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
 & IPV6_TCLASS_MASK;
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_

[PATCH net-next 2/2] ip_tunnel: Allow policy-based routing through tunnels

2017-04-19 Thread Craig Gallek
From: Craig Gallek 

This feature allows the administrator to set an fwmark for
packets traversing a tunnel.  This allows the use of independent
routing tables for tunneled packets without the use of iptables.

There is no concept of per-packet routing decisions through IPv4
tunnels, so this implementation does not need to work with
per-packet route lookups as the v6 implementation may
(with IP6_TNL_F_USE_ORIG_FWMARK).

Further, since the v4 tunnel ioctls share datastructures
(which can not be trivially modified) with the kernel's internal
tunnel configuration structures, the mark attribute must be stored
in the tunnel structure itself and passed as a parameter when
creating or changing tunnel attributes.

Signed-off-by: Craig Gallek 
---
 include/net/ip_tunnels.h |  5 +++--
 net/ipv4/ip_gre.c| 24 +---
 net/ipv4/ip_tunnel.c | 27 +--
 net/ipv4/ip_vti.c| 20 +++-
 net/ipv4/ipip.c  | 24 +---
 net/ipv6/sit.c   | 37 -
 6 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 95056796657c..520809912f03 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -132,6 +132,7 @@ struct ip_tunnel {
unsigned intprl_count;  /* # of entries in PRL */
unsigned intip_tnl_net_id;
struct gro_cellsgro_cells;
+   __u32   fwmark;
boolcollect_md;
boolignore_df;
 };
@@ -273,9 +274,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff 
*skb,
  const struct tnl_ptk_info *tpi, struct metadata_dst *tun_dst,
  bool log_ecn_error);
 int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[],
-struct ip_tunnel_parm *p);
+struct ip_tunnel_parm *p, __u32 fwmark);
 int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
- struct ip_tunnel_parm *p);
+ struct ip_tunnel_parm *p, __u32 fwmark);
 void ip_tunnel_setup(struct net_device *dev, unsigned int net_id);
 
 struct ip_tunnel_encap_ops {
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index c9c1cb635d9a..e90c80a548ad 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -829,7 +829,8 @@ static int ipgre_tap_validate(struct nlattr *tb[], struct 
nlattr *data[])
 static int ipgre_netlink_parms(struct net_device *dev,
struct nlattr *data[],
struct nlattr *tb[],
-   struct ip_tunnel_parm *parms)
+   struct ip_tunnel_parm *parms,
+   __u32 *fwmark)
 {
struct ip_tunnel *t = netdev_priv(dev);
 
@@ -886,6 +887,9 @@ static int ipgre_netlink_parms(struct net_device *dev,
t->ignore_df = !!nla_get_u8(data[IFLA_GRE_IGNORE_DF]);
}
 
+   if (data[IFLA_GRE_FWMARK])
+   *fwmark = nla_get_u32(data[IFLA_GRE_FWMARK]);
+
return 0;
 }
 
@@ -957,6 +961,7 @@ static int ipgre_newlink(struct net *src_net, struct 
net_device *dev,
 {
struct ip_tunnel_parm p;
struct ip_tunnel_encap ipencap;
+   __u32 fwmark = 0;
int err;
 
if (ipgre_netlink_encap_parms(data, &ipencap)) {
@@ -967,31 +972,32 @@ static int ipgre_newlink(struct net *src_net, struct 
net_device *dev,
return err;
}
 
-   err = ipgre_netlink_parms(dev, data, tb, &p);
+   err = ipgre_netlink_parms(dev, data, tb, &p, &fwmark);
if (err < 0)
return err;
-   return ip_tunnel_newlink(dev, tb, &p);
+   return ip_tunnel_newlink(dev, tb, &p, fwmark);
 }
 
 static int ipgre_changelink(struct net_device *dev, struct nlattr *tb[],
struct nlattr *data[])
 {
+   struct ip_tunnel *t = netdev_priv(dev);
struct ip_tunnel_parm p;
struct ip_tunnel_encap ipencap;
+   __u32 fwmark = t->fwmark;
int err;
 
if (ipgre_netlink_encap_parms(data, &ipencap)) {
-   struct ip_tunnel *t = netdev_priv(dev);
err = ip_tunnel_encap_setup(t, &ipencap);
 
if (err < 0)
return err;
}
 
-   err = ipgre_netlink_parms(dev, data, tb, &p);
+   err = ipgre_netlink_parms(dev, data, tb, &p, &fwmark);
if (err < 0)
return err;
-   return ip_tunnel_changelink(dev, tb, &p);
+   return ip_tunnel_changelink(dev, tb, &p, fwmark);
 }
 
 static size_t ipgre_get_size(const struct net_device *dev)
@@ -1029,6 +1035,8 @@ static size_t ipgre_get_size(const struct net_device *dev)
nla_total_size(0) +
 

[PATCH net-next 0/2] ip_tunnel: Allow policy-based routing through tunnels

2017-04-19 Thread Craig Gallek
From: Craig Gallek 

iproute2 changes to follow.  Example usage:
  ip link add gre-test type gre local 10.0.0.1 remote 10.0.0.2 fwmark 0x4
  ip -detail link show gre-test
  ...
  ip link set gre-test type gre fwmark 0


Craig Gallek (2):
  ip6_tunnel: Allow policy-based routing through tunnels
  ip_tunnel: Allow policy-based routing through tunnels

 include/net/ip6_tunnel.h   |  2 ++
 include/net/ip_tunnels.h   |  5 +++--
 include/uapi/linux/if_tunnel.h |  3 +++
 net/ipv4/ip_gre.c  | 24 +---
 net/ipv4/ip_tunnel.c   | 27 +--
 net/ipv4/ip_vti.c  | 20 +++-
 net/ipv4/ipip.c| 24 +---
 net/ipv6/ip6_gre.c | 14 +-
 net/ipv6/ip6_tunnel.c  | 15 ++-
 net/ipv6/ip6_vti.c | 10 +-
 net/ipv6/sit.c | 37 -
 11 files changed, 134 insertions(+), 47 deletions(-)

-- 
2.12.2.816.g281164-goog



Re: [PATCH] soreuseport: use "unsigned int" in __reuseport_alloc()

2017-04-03 Thread Craig Gallek
On Sun, Apr 2, 2017 at 6:18 PM, Alexey Dobriyan  wrote:
> Number of sockets is limited by 16-bit, so 64-bit allocation will never
> happen.
>
> 16-bit ops are the worst code density-wise on x86_64 because of
> additional prefix (66).
So this boils down to a compiled code density vs a
readability/maintainability argument?  I'm not familiar with the 16
bit problem you're referring to, but I'd argue that using the
self-documenting u16 as an input parameter to define the range
expectations is more useful that the micro optimization that this
change may buy you in the assembly of one platform.  Especially given
that this is a rare-use function.

>
> Space savings:
>
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
> function old new   delta
> reuseport_add_sock   539 536  -3
>
> Signed-off-by: Alexey Dobriyan 
> ---
>
>  net/core/sock_reuseport.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -13,9 +13,9 @@
>
>  static DEFINE_SPINLOCK(reuseport_lock);
>
> -static struct sock_reuseport *__reuseport_alloc(u16 max_socks)
> +static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
>  {
> -   size_t size = sizeof(struct sock_reuseport) +
> +   unsigned int size = sizeof(struct sock_reuseport) +
>   sizeof(struct sock *) * max_socks;
> struct sock_reuseport *reuse = kzalloc(size, GFP_ATOMIC);
>


Re: [PATCH 3/5] net/packet: fix overflow in check for tp_frame_nr

2017-03-29 Thread Craig Gallek
On Tue, Mar 28, 2017 at 1:19 PM, Andrey Konovalov  wrote:
> On Tue, Mar 28, 2017 at 5:54 PM, Craig Gallek  wrote:
>> On Tue, Mar 28, 2017 at 10:00 AM, Andrey Konovalov
>>  wrote:
>>> When calculating rb->frames_per_block * req->tp_block_nr the result
>>> can overflow.
>>>
>>> Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
>>>
>>> Since frames_per_block <= tp_block_size, the expression would
>>> never overflow.
>>>
>>> Signed-off-by: Andrey Konovalov 
>>> ---
>>>  net/packet/af_packet.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>>> index 506348abdf2f..c5c43fff8c01 100644
>>> --- a/net/packet/af_packet.c
>>> +++ b/net/packet/af_packet.c
>>> @@ -4197,6 +4197,9 @@ static int packet_set_ring(struct sock *sk, union 
>>> tpacket_req_u *req_u,
>>> goto out;
>>> if (unlikely(req->tp_frame_size == 0))
>>> goto out;
>>> +   if (unlikely((u64)req->tp_block_size * req->tp_block_nr >
>>> +   UINT_MAX))
>>> +   goto out;
>> So this may be pedantic, but really the only guarantee that you have
>> for the 'unsigned int' type of these fields is that they are _at
>> least_ 16 bits.  There is no guarantee on the upper bound size, so
>> casting to a u64 will be problematic on a compiler that happens to use
>> 64 bits for 'unsigned int'.  I'm not aware of any that use greater
>> than 32 bits right now and using one that does may very well break
>> other things in the kernel, but here we are...  Perhaps a alternative
>> fix would be to do the multiplication into an 'unsigned int' type and
>> ensure that the result is larger than each of the original two values?
>
> I don't mind changing the check, but I've never encountered such compilers.
>
> Would this alternative work? It doesn't seem obvious.
>
> Other alternatives that I see for this check are:
>
> 1. req->tp_block_size > UINT_MAX / req->tp_block_nr
>
> 2. (req->tp_block_size * req->tp_block_nr) / req->tp_block_nr !=
> req->tp_block_size
>
> I'm not sure which one is better.
I'm by no means the style expert here, but I would go with whichever
makes the intention of the check (preventing overflow) most obvious.
Maybe #1 in your example?  I'm also not sure what the acceptable
assumptions about the size of 'int' are in the kernel code.  I'm sure
there's a thread out there with Linus expressing a strong feeling one
way or another, but I haven't found it yet ;)

>
>>
>> The real issue is that explicit size types should have been used in
>> this userspace structure.


Re: [PATCH 3/5] net/packet: fix overflow in check for tp_frame_nr

2017-03-28 Thread Craig Gallek
On Tue, Mar 28, 2017 at 10:00 AM, Andrey Konovalov
 wrote:
> When calculating rb->frames_per_block * req->tp_block_nr the result
> can overflow.
>
> Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
>
> Since frames_per_block <= tp_block_size, the expression would
> never overflow.
>
> Signed-off-by: Andrey Konovalov 
> ---
>  net/packet/af_packet.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 506348abdf2f..c5c43fff8c01 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -4197,6 +4197,9 @@ static int packet_set_ring(struct sock *sk, union 
> tpacket_req_u *req_u,
> goto out;
> if (unlikely(req->tp_frame_size == 0))
> goto out;
> +   if (unlikely((u64)req->tp_block_size * req->tp_block_nr >
> +   UINT_MAX))
> +   goto out;
So this may be pedantic, but really the only guarantee that you have
for the 'unsigned int' type of these fields is that they are _at
least_ 16 bits.  There is no guarantee on the upper bound size, so
casting to a u64 will be problematic on a compiler that happens to use
64 bits for 'unsigned int'.  I'm not aware of any that use greater
than 32 bits right now and using one that does may very well break
other things in the kernel, but here we are...  Perhaps a alternative
fix would be to do the multiplication into an 'unsigned int' type and
ensure that the result is larger than each of the original two values?

The real issue is that explicit size types should have been used in
this userspace structure.


Re: [PATCH 1/6 net-next] inet: collapse ipv4/v6 rcv_saddr_equal functions into one

2017-01-12 Thread Craig Gallek
On Wed, Jan 11, 2017 at 3:19 PM, Josef Bacik  wrote:
> +int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
> +bool match_wildcard)
> +{
> +#if IS_ENABLED(CONFIG_IPV6)
> +   if (sk->sk_family == AF_INET6)
Still wrapping my head around this, so take it with a grain of salt,
but it's not obvious to me that sk and sk2 are guaranteed to have the
same family here (or if it even matters).  Especially in the context
of the next patch which removes the bind_conflict callback...  Does
this need to be an OR check for the family of either socket?  Or is it
safe as-is because the first socket passed to this function is always
the existing one and the second one is the possible conflict socket?


Re: [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk

2016-12-21 Thread Craig Gallek
On Tue, Dec 20, 2016 at 3:07 PM, Josef Bacik  wrote:
> If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 
> and
> never set it again.  Which means that in the future if we end up adding a 
> bunch
> of reuseport sk's to that tb we'll have to do the expensive scan every time.
> Instead add a sock_common to the tb so we know what reuseport sk succeeded 
> last.
> Once one sk has made it onto the list we know that there are no potential bind
> conflicts on the owners list that match that sk's rcv_addr.  So copy the sk's
> common into our tb->fastsock and set tb->fastruseport to FASTREUSESOCK_STRICT 
> so
> we know we have to do an extra check for subsequent reuseport sockets and skip
> the expensive bind conflict check.
>
> Signed-off-by: Josef Bacik 
> ---
>  include/net/inet_hashtables.h   |  4 
>  net/ipv4/inet_connection_sock.c | 53 
> +
>  2 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 50f635c..b776401 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -74,12 +74,16 @@ struct inet_ehash_bucket {
>   * users logged onto your box, isn't it nice to know that new data
>   * ports are created in O(1) time?  I thought so. ;-)  -DaveM
>   */
> +#define FASTREUSEPORT_ANY  1
> +#define FASTREUSEPORT_STRICT   2
> +
>  struct inet_bind_bucket {
> possible_net_t  ib_net;
> unsigned short  port;
> signed char fastreuse;
> signed char fastreuseport;
> kuid_t  fastuid;
> +   struct sock_common  fastsock;
> int num_owners;
> struct hlist_node   node;
> struct hlist_head   owners;
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d3ccf62..9e29fad 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -164,6 +164,32 @@ success:
> return head;
>  }
>
> +static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
> +struct sock *sk)
> +{
> +   struct sock *sk2 = (struct sock *)&tb->fastsock;
> +   kuid_t uid = sock_i_uid(sk);
> +
> +   if (tb->fastreuseport <= 0)
> +   return 0;
> +   if (!sk->sk_reuseport)
> +   return 0;
> +   if (rcu_access_pointer(sk->sk_reuseport_cb))
> +   return 0;
> +   if (!uid_eq(tb->fastuid, uid))
> +   return 0;
> +   /* We only need to check the rcv_saddr if this tb was once marked
> +* without fastreuseport and then was reset, as we can only know that
> +* the fastsock has no potential bind conflicts with the rest of the
> +* possible socks on the owners list.
> +*/
> +   if (tb->fastreuseport == FASTREUSEPORT_ANY)
> +   return 1;
> +   if (!inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
The rcv_saddr_equal functions assume the type of the sk to be
inet_sock.  It doesn't look like they actually depend on any of the
inet-specific fields, but it's probably safer to either remove this
assumption or change the type of tb->fastsock to be a full inet_sock.

> +   return 0;
> +   return 1;
> +}
> +
>  /* Obtain a reference to a local port for the given sock,
>   * if snum is zero it means select any available local port.
>   * We try to allocate an odd port (and leave even ports for connect())
> @@ -206,9 +232,7 @@ tb_found:
> goto success;
>
> if ((tb->fastreuse > 0 && reuse) ||
> -(tb->fastreuseport > 0 &&
> - !rcu_access_pointer(sk->sk_reuseport_cb) &&
> - sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
> +   sk_reuseport_match(tb, sk))
> goto success;
> if (inet_csk_bind_conflict(sk, tb, true, true))
> goto fail_unlock;
> @@ -220,14 +244,35 @@ success:
> if (sk->sk_reuseport) {
> tb->fastreuseport = 1;
> tb->fastuid = uid;
> +   memcpy(&tb->fastsock, &sk->__sk_common,
> +  sizeof(struct sock_common));
> } else {
> tb->fastreuseport = 0;
> }
> } else {
> if (!reuse)
> tb->fastreuse = 0;
> -   if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
> +   if (sk->sk_reuseport) {
> +   /* We didn't match or we don't have fastreuseport set 
> on
> +* the tb, but we have sk_reuseport set on this socket
> +* and we know that there are no bind conflicts with
> +* this socket in th

Re: Soft lockup in inet_put_port on 4.6

2016-12-15 Thread Craig Gallek
On Thu, Dec 15, 2016 at 5:39 PM, Tom Herbert  wrote:
> On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik  wrote:
>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  wrote:
>>>
>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>> wrote:
>>>>
>>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>> wrote:
>>>>>
>>>>>  I think there may be some suspicious code in inet_csk_get_port. At
>>>>>  tb_found there is:
>>>>>
>>>>>  if (((tb->fastreuse > 0 && reuse) ||
>>>>>   (tb->fastreuseport > 0 &&
>>>>>!rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>>>>>  smallest_size == -1)
>>>>>  goto success;
>>>>>  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
>>>>> true)) {
>>>>>  if ((reuse ||
>>>>>   (tb->fastreuseport > 0 &&
>>>>>sk->sk_reuseport &&
>>>>>!rcu_access_pointer(sk->sk_reuseport_cb)
>>>>> &&
>>>>>uid_eq(tb->fastuid, uid))) &&
>>>>>  smallest_size != -1 && --attempts >= 0) {
>>>>>  spin_unlock_bh(&head->lock);
>>>>>  goto again;
>>>>>  }
>>>>>  goto fail_unlock;
>>>>>  }
>>>>>
>>>>>  AFAICT there is redundancy in these two conditionals.  The same clause
>>>>>  is being checked in both: (tb->fastreuseport > 0 &&
>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>>>>>  first conditional should be hit, goto done,  and the second will never
>>>>>  evaluate that part to true-- unless the sk is changed (do we need
>>>>>  READ_ONCE for sk->sk_reuseport_cb?).
>>>>
>>>>  That's an interesting point... It looks like this function also
>>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>>  with several spin_lock(&head->lock) to exclusively
>>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
>>>>  disable variant was preventing the timers in your stack trace from
>>>>  running interleaved with this function before?
>>>
>>>
>>> Could be, although dropping the lock shouldn't be able to affect the
>>> search state. TBH, I'm a little lost in reading function, the
>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>>> function and also in every call to inet_csk_bind_conflict. I wonder if
>>> we can simply this under the assumption that SO_REUSEPORT is only
>>> allowed if the port number (snum) is explicitly specified.
>>
>>
>> Ok first I have data for you Hannes, here's the time distributions before
>> during and after the lockup (with all the debugging in place the box
>> eventually recovers).  I've attached it as a text file since it is long.
>>
>> Second is I was thinking about why we would spend so much time doing the
>> ->owners list, and obviously it's because of the massive amount of timewait
>> sockets on the owners list.  I wrote the following dumb patch and tested it
>> and the problem has disappeared completely.  Now I don't know if this is
>> right at all, but I thought it was weird we weren't copying the soreuseport
>> option from the original socket onto the twsk.  Is there are reason we
>> aren't doing this currently?  Does this help explain what is happening?
>> Thanks,
>>
> I think that would explain it. We would be walking long lists of TW
> sockets in inet_bind_bucket_for_each(tb, &head->chain). This should
> break, although now I'm wondering if there's other ways we can get
> into this situation. reuseport ensures that we can have long lists of
> sockets in a single bucket, TW sockets can make that list really long.
What if the time-wait timer implementation was changed to do more
opportunistic removals?  In this case, you seem to have a coordinated
timer event causing many independent locking events on the bucket in
question.  If one of those firing events realized it could handle all
of them, you could greatly reduce the contention.  The fact that they
all hash to the same bucket may make this even easier...

> Tom
>
>> Josef


Re: [PATCH net-next 2/2] inet: Fix get port to handle zero port number with soreuseport set

2016-12-15 Thread Craig Gallek
On Wed, Dec 14, 2016 at 7:54 PM, Tom Herbert  wrote:
> A user may call listen with binding an explicit port with the intent
> that the kernel will assign an available port to the socket. In this
> case inet_csk_get_port does a port scan. For such sockets, the user may
> also set soreuseport with the intent a creating more sockets for the
> port that is selected. The problem is that the initial socket being
> opened could inadvertently choose an existing and unreleated port
> number that was already created with soreuseport.
Good catch!  I think this problem may also exist in the UDP path?
(udp_lib_get_port -> udp_lib_lport_inuse[2])


Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Craig Gallek
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  wrote:
> I think there may be some suspicious code in inet_csk_get_port. At
> tb_found there is:
>
> if (((tb->fastreuse > 0 && reuse) ||
>  (tb->fastreuseport > 0 &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
> smallest_size == -1)
> goto success;
> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
> if ((reuse ||
>  (tb->fastreuseport > 0 &&
>   sk->sk_reuseport &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   uid_eq(tb->fastuid, uid))) &&
> smallest_size != -1 && --attempts >= 0) {
> spin_unlock_bh(&head->lock);
> goto again;
> }
> goto fail_unlock;
> }
>
> AFAICT there is redundancy in these two conditionals.  The same clause
> is being checked in both: (tb->fastreuseport > 0 &&
> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
> first conditional should be hit, goto done,  and the second will never
> evaluate that part to true-- unless the sk is changed (do we need
> READ_ONCE for sk->sk_reuseport_cb?).
That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at the beginning
with several spin_lock(&head->lock) to exclusively
spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
disable variant was preventing the timers in your stack trace from
running interleaved with this function before?


[PATCH net] inet: Fix missing return value in inet6_hash

2016-10-25 Thread Craig Gallek
From: Craig Gallek 

As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f5862 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6.  However, the latter does
not respect the former's convention.

This properly propagates the hash errors in the IPv6 case.

Fixes: 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
Reported-by: Soheil Hassas Yeganeh 
Signed-off-by: Craig Gallek 
---
 net/ipv6/inet6_hashtables.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 2fd0374a35b1..02761c9fe43e 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -264,13 +264,15 @@ EXPORT_SYMBOL_GPL(inet6_hash_connect);
 
 int inet6_hash(struct sock *sk)
 {
+   int err = 0;
+
if (sk->sk_state != TCP_CLOSE) {
local_bh_disable();
-   __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
+   err = __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
local_bh_enable();
}
 
-   return 0;
+   return err;
 }
 EXPORT_SYMBOL_GPL(inet6_hash);
 
-- 
2.8.0.rc3.226.g39d4020



Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-07 Thread Craig Gallek
On Thu, Jul 7, 2016 at 4:36 PM, Jiri Kosina  wrote:
> From: Jiri Kosina 
>
> Convert the per-device linked list into a hashtable. The primary
> motivation for this change is that currently, we're not tracking all the
> qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
> performed over the linked list by qdisc_match_from_root() is rather
> expensive.
>
> The ultimate goal is to get rid of hidden qdiscs completely, which will
> bring much more determinism in user experience.
>
> As we're adding hashtable.h include into generic netdevice.h, we have to make
> sure HASH_SIZE macro is now non-conflicting with local definitions.
>
> Signed-off-by: Jiri Kosina 
> ---
>
> v1 -> v2:   fix up RCU hastable usage wrt. rtnl
> fix compilation of .c files which define their own
>  HASH_SIZE that now oncflicts with the one from
>  hashtable.h (newly included via netdevice.h)
This sort of seems like it's just side-stepping the problem.  Given
that the size of this hash table is fixed, the lookup time of this
operation is still going to approach linear as the number of qdiscs
increases.  I took a quick pass at trying to use rhashtable for this
purpose a few weeks ago but dropped it when I realized many of the
table operations (which can trigger resize events) need to happen
while holding the rtnl lock.  I still think it would be possible to
use a dynamically sizable datastructure for this purpose, but it will
be a fair amount of work to change the current locking semantics to
make it work...


[PATCH net-next] tun: Don't assume type tun in tun_device_event

2016-07-06 Thread Craig Gallek
From: Craig Gallek 

The referenced change added a netlink notifier for processing
device queue size events.  These events are fired for all devices
but the registered callback assumed they only occurred for tun
devices.  This fix adds a check (borrowed from macvtap.c) to discard
non-tun device events.

For reference, this fixes the following splat:
[   71.505935] BUG: unable to handle kernel NULL pointer dereference at 
0010
[   71.513870] IP: [] tun_device_event+0x110/0x340
[   71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
[   71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[   71.529374] gsmi: Log Shutdown Reason 0x03
[   71.533417] Modules linked in:[   71.533826] mlx4_en: eth1: Link Up

[   71.539616]  bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en 
ib_uverbs mlx4_ib ib_core mlx4_core
[   71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 
4.7.0-dbx-DEV #8
[   71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 
05/03/2016
[   71.564495] task: 887f00bb20c0 ti: 887f00798000 task.ti: 
887f00798000
[   71.571894] RIP: 0010:[]  [] 
tun_device_event+0x110/0x340
[   71.580327] RSP: 0018:887f0079bbd8  EFLAGS: 00010202
[   71.585576] RAX: fae8 RBX: 887ef6d03378 RCX: 
[   71.592624] RDX:  RSI: 0028 RDI: 
[   71.599675] RBP: 887f0079bc48 R08:  R09: 0001
[   71.606730] R10: 0004 R11:  R12: 0010
[   71.613780] R13:  R14: 0001 R15: 887f0079bd00
[   71.620832] FS:  7f5cdc581700() GS:883f7f70() 
knlGS:
[   71.628826] CS:  0010 DS:  ES:  CR0: 80050033
[   71.634500] CR2: 0010 CR3: 003f3eb62000 CR4: 001406e0
[   71.641549] Stack:
[   71.643533]  887f0079bc08 0246 001e 
887ef6d0
[   71.650871]  887f0079bd00   

[   71.658210]  887f0079bc48 81d24070 fff9 
81cec7a0
[   71.665549] Call Trace:
[   71.667975]  [] notifier_call_chain+0x5d/0x80
[   71.673823]  [] ? show_tx_maxrate+0x30/0x30
[   71.679502]  [] __raw_notifier_call_chain+0xe/0x10
[   71.685778]  [] raw_notifier_call_chain+0x16/0x20
[   71.691976]  [] call_netdevice_notifiers_info+0x40/0x70
[   71.698681]  [] call_netdevice_notifiers+0x16/0x20
[   71.704956]  [] change_tx_queue_len+0x66/0x90
[   71.710807]  [] netdev_store.isra.5+0xbf/0xd0
[   71.716658]  [] tx_queue_len_store+0x50/0x60
[   71.722431]  [] dev_attr_store+0x18/0x30
[   71.727857]  [] sysfs_kf_write+0x4f/0x70
[   71.733274]  [] kernfs_fop_write+0x147/0x1d0
[   71.739045]  [] ? rcu_read_lock_sched_held+0x8f/0xa0
[   71.745499]  [] __vfs_write+0x28/0x120
[   71.750748]  [] ? percpu_down_read+0x57/0x90
[   71.756516]  [] ? __sb_start_write+0xc8/0xe0
[   71.762278]  [] ? __sb_start_write+0xc8/0xe0
[   71.768038]  [] vfs_write+0xbe/0x1b0
[   71.773113]  [] SyS_write+0x52/0xa0
[   71.778110]  [] entry_SYSCALL_64_fastpath+0x18/0xa8
[   71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 
d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 <49> 89 
34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
[   71.803655] RIP  [] tun_device_event+0x110/0x340
[   71.809769]  RSP 
[   71.813213] CR2: 0010
[   71.816512] ---[ end trace 4db6449606319f73 ]---

Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Signed-off-by: Craig Gallek 
---
 drivers/net/tun.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5eadb7a1ad7b..9c8b5bc2b9d8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2503,6 +2503,9 @@ static int tun_device_event(struct notifier_block *unused,
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct tun_struct *tun = netdev_priv(dev);
 
+   if (dev->rtnl_link_ops != &tun_link_ops)
+   return NOTIFY_DONE;
+
switch (event) {
case NETDEV_CHANGE_TX_QUEUE_LEN:
if (tun_queue_resize(tun))
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH net-next V4 0/6] switch to use tx skb array in tun

2016-07-06 Thread Craig Gallek
On Thu, Jun 30, 2016 at 2:45 AM, Jason Wang  wrote:
> Hi all:
>
> This series tries to switch to use skb array in tun. This is used to
> eliminate the spinlock contention between producer and consumer. The
> conversion was straightforward: just introdce a tx skb array and use
> it instead of sk_receive_queue.

I'm seeing the splat below after this series.  I'm still wrapping my
head around this code, but it appears to be happening because the
tun_struct passed into tun_queue_resize is uninitialized.
Specifically, iteration over the disabled list_head fails because prev
= next = NULL.  This seems to happen when a startup script on my test
machine changes the queue length.  I'll try to figure out what's
happening, but if it's obvious to someone else from the stack, please
let me know.

[   72.322236] BUG: unable to handle kernel NULL pointer dereference
at 0010
[   72.329993] IP: [] tun_device_event+0x110/0x340
[   72.336032] PGD 7f054f1067 PUD 7ef6f3f067 PMD 0
[   72.340616] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[   72.345498] gsmi: Log Shutdown Reason 0x03
[   72.349541] Modules linked in: w1_therm wire cdc_acm ehci_pci
ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[   72.359870] CPU: 12 PID: 7820 Comm: set.ixion-haswe Not tainted
4.7.0-dbx-DEV #10
[   72.360253] mlx4_en: eth0: Link Up
[   72.370618] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15,
BIOS 2.50.0 01/21/2016
[   72.378525] task: 883f2501e8c0 ti: 883f3ef08000 task.ti:
883f3ef08000
[   72.385917] RIP: 0010:[]  []
tun_device_event+0x110/0x340
[   72.394353] RSP: 0018:883f3ef0bbe8  EFLAGS: 00010202
[   72.399599] RAX: fae8 RBX: 887ef9883378 RCX: 
[   72.406647] RDX:  RSI: 0028 RDI: 
[   72.413694] RBP: 883f3ef0bc58 R08:  R09: 0001
[   72.420742] R10: 0004 R11:  R12: 0010
[   72.427789] R13:  R14: 0001 R15: 883f3ef0bd10
[   72.434837] FS:  7fac4e5dd700() GS:883f7f70()
knlGS:
[   72.442832] CS:  0010 DS:  ES:  CR0: 80050033
[   72.448507] CR2: 0010 CR3: 007ef66ac000 CR4: 001406e0
[   72.45] Stack:
[   72.457541]  883f3ef0bc18 0246 001e
887ef988
[   72.464880]  883f3ef0bd10  

[   72.472219]  883f3ef0bc58 81d24070 fff9
81cec7a0
[   72.479559] Call Trace:
[   72.481986]  [] notifier_call_chain+0x5d/0x80
[   72.487844]  [] ? show_tx_maxrate+0x30/0x30
[   72.493520]  [] __raw_notifier_call_chain+0xe/0x10
[   72.499801]  [] raw_notifier_call_chain+0x16/0x20
[   72.506001]  [] call_netdevice_notifiers_info+0x40/0x70
[   72.512706]  [] call_netdevice_notifiers+0x16/0x20
[   72.518986]  [] change_tx_queue_len+0x38/0x80
[   72.524838]  [] netdev_store.isra.5+0xbf/0xd0
[   72.530688]  [] tx_queue_len_store+0x50/0x60
[   72.536459]  [] dev_attr_store+0x18/0x30
[   72.541888]  [] sysfs_kf_write+0x4f/0x70
[   72.547306]  [] kernfs_fop_write+0x147/0x1d0
[   72.553077]  [] ? rcu_read_lock_sched_held+0x8f/0xa0
[   72.559534]  [] __vfs_write+0x28/0x120
[   72.564781]  [] ? percpu_down_read+0x57/0x90
[   72.570542]  [] ? __sb_start_write+0xc8/0xe0
[   72.576303]  [] ? __sb_start_write+0xc8/0xe0
[   72.582063]  [] vfs_write+0xbe/0x1b0
[   72.587138]  [] SyS_write+0x52/0xa0
[   72.592135]  [] entry_SYSCALL_64_fastpath+0x18/0xa8
[   72.598497] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00
00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6
41 83 c6 01 <49> 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa
ff ff
[   72.617767] RIP  [] tun_device_event+0x110/0x340
[   72.623883]  RSP 
[   72.627327] CR2: 0010
[   72.630638] ---[ end trace b0c54137cf861b91 ]---


Re: [PATCH] soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF

2016-06-03 Thread Craig Gallek
On Fri, Jun 3, 2016 at 5:09 PM, Helge Deller  wrote:
> Any idea for a better naming than "do_sockopt_fix_sock_fprog()" ?
Thanks for catching and fixing this.  I'd suggest simply leaving the
function name as-is.  Your fix to the condition in that function is
sufficient to address the issue.

Craig


Re: [PATCH] soreuseport: Fix reuseport_bpf testcase on 32bit architectures

2016-06-03 Thread Craig Gallek
On Fri, Jun 3, 2016 at 1:19 PM, Helge Deller  wrote:
> This fixes the following compiler warnings when compiling the
> reuseport_bpf testcase on a 32 bit platform:
>
> reuseport_bpf.c: In function ‘attach_ebpf’:
> reuseport_bpf.c:114:15: warning: cast from pointer to integer of ifferent 
> size [-Wpointer-to-int-cast]
>
> Signed-off-by: Helge Deller 
Acked-by: Craig Gallek 

Thanks!


[PATCH v3 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek
From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
v3 Changes
  - Eric pointed out that the net namespace check isn't necessary when
comparing bucket pointers.  They can not be equal across namespaces.
v2 Changes
  - Suggestions from Eric Dumazet to include network namespace equality
check and to avoid a dreference by simply checking inet_bind_bucket
pointer equality.
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..0d9e9d7bb029 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,6 +470,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
@@ -479,6 +480,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash == tb &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v2 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek
On Thu, Apr 28, 2016 at 5:59 PM, Eric Dumazet  wrote:
> On Thu, 2016-04-28 at 17:07 -0400, Craig Gallek wrote:
>> From: Craig Gallek 
>>
>> I forgot to include a check for listener port equality when deciding
>> if two sockets should belong to the same reuseport group.  This was
>> not caught previously because it's only necessary when two listening
>> sockets for the same user happen to hash to the same listener bucket.
>> This change also includes a check for network namespace equality.
>> The same error does not exist in the UDP path.
>>
>> Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
>> Signed-off-by: Craig Gallek 
>> ---
>> v2 Changes
>>   - Suggestions from Eric Dumazet to include network namespace equality
>> check and to avoid a dreference by simply checking inet_bind_bucket
>> pointer equality.
>> ---
>>  net/ipv4/inet_hashtables.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index bc68eced0105..5c5658268d5e 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -470,15 +470,19 @@ static int inet_reuseport_add_sock(struct sock *sk,
>>const struct sock *sk2,
>>bool match_wildcard))
>>  {
>> + struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
>> + struct net *net = sock_net(sk);
>>   struct sock *sk2;
>>   struct hlist_nulls_node *node;
>>   kuid_t uid = sock_i_uid(sk);
>>
>>   sk_nulls_for_each_rcu(sk2, node, &ilb->head) {
>> - if (sk2 != sk &&
>> + if (net_eq(sock_net(sk2), net) &&
>> + sk2 != sk &&
>>   sk2->sk_family == sk->sk_family &&
>>   ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
>>   sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
>> + inet_csk(sk2)->icsk_bind_hash == tb &&
>>   sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
>>   saddr_same(sk, sk2, false))
>>   return reuseport_add_sock(sk, sk2);
>
> Note that I suggested to only use "inet_csk(sk2)->icsk_bind_hash == tb"
>
> If test is true, it means that sockets share same name space and same
> port ;)
>
> Therefore the added net_eq(sock_net(sk2), net) test is redundant.
Thanks for the quick review Eric, sorry I miss read :\  I'll send a v3...

> No strong opinion, as this patch works, and this is not fast path
> anyway.
>
> Acked-by: Eric Dumazet 
>
>


[PATCH v2 net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek
From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
This change also includes a check for network namespace equality.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
v2 Changes
  - Suggestions from Eric Dumazet to include network namespace equality
check and to avoid a dreference by simply checking inet_bind_bucket
pointer equality.
---
 net/ipv4/inet_hashtables.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..5c5658268d5e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,15 +470,19 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
+   struct net *net = sock_net(sk);
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
 
sk_nulls_for_each_rcu(sk2, node, &ilb->head) {
-   if (sk2 != sk &&
+   if (net_eq(sock_net(sk2), net) &&
+   sk2 != sk &&
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash == tb &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020



[PATCH net] soreuseport: Fix TCP listener hash collision

2016-04-28 Thread Craig Gallek
From: Craig Gallek 

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..326d26c7a9e6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -470,6 +470,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 const struct sock *sk2,
 bool match_wildcard))
 {
+   struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
struct sock *sk2;
struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
@@ -479,6 +480,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
sk2->sk_family == sk->sk_family &&
ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if &&
+   inet_csk(sk2)->icsk_bind_hash->port == tb->port &&
sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) &&
saddr_same(sk, sk2, false))
return reuseport_add_sock(sk, sk2);
-- 
2.8.0.rc3.226.g39d4020



Re: net merged into net-next

2016-04-25 Thread Craig Gallek
Thanks David,
There was one other change that conflicts (functionally) with this
merge as well: 3b24d854cb35 ("tcp/dccp: do not touch listener
sk_refcnt under synflood")
It did a similar hlist_nulls -> hlist transform for the TCP stack.
I'll send a formal patch to address this as well.

Craig

On Sat, Apr 23, 2016 at 9:39 PM, Eric Dumazet  wrote:
> On Sat, 2016-04-23 at 20:12 -0400, David Miller wrote:
>> Eric, please double check my merge work for net/ipv4/udp.c
>>
>> In net we fixed a soreuseport bug that added a hlist_nulls_add_tail_rcu()
>> call to net/ipv4/udp.c
>>
>> And in net-next you converted UDP sockets away from nulls lists.
>>
>> Therefore I had to add a hlist_add_tail_rcu() implementation to
>> linux/rculist.h and use it in net/ipv4/udp.c
>>
>> Please make sure this is fine and will work safely.
>>
>> Thanks!
>
> Looks fine to me, thanks David
>
> Craig, please have a look ?
>
> ( commit 1602f49b58abcb0d34a5f0a29d68e7c1769547aa in net-next)
>
>


[PATCH net-next] soreuseport: Resolve merge conflict for v4/v6 ordering fix

2016-04-25 Thread Craig Gallek
From: Craig Gallek 

d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58ab ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek 
---
 include/net/sock.h | 6 +-
 net/ipv4/inet_hashtables.c | 6 +-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 52448baf19d7..d6f26b3119aa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -630,7 +630,11 @@ static inline void sk_add_node(struct sock *sk, struct 
hlist_head *list)
 static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
 {
sock_hold(sk);
-   hlist_add_head_rcu(&sk->sk_node, list);
+   if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+   sk->sk_family == AF_INET6)
+   hlist_add_tail_rcu(&sk->sk_node, list);
+   else
+   hlist_add_head_rcu(&sk->sk_node, list);
 }
 
 static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct 
hlist_nulls_head *list)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index fcadb670f50b..b76b0d7e59c1 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -479,7 +479,11 @@ int __inet_hash(struct sock *sk, struct sock *osk,
if (err)
goto unlock;
}
-   hlist_add_head_rcu(&sk->sk_node, &ilb->head);
+   if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+   sk->sk_family == AF_INET6)
+   hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
+   else
+   hlist_add_head_rcu(&sk->sk_node, &ilb->head);
sock_set_flag(sk, SOCK_RCU_FREE);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
-- 
2.8.0.rc3.226.g39d4020



[RFC net-next] soreuseport: fix ordering for mixed v4/v6 sockets

2016-04-15 Thread Craig Gallek
From: Craig Gallek 

With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek 
---
A similar patch was recently accepted to the net tree:
d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")

However, two patches have already been submitted to net-next which
will conflict when net is merged back into net-next:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These net-next patches change the TCP and UDP socket list datastructures
from hlist_nulls to hlists.  The fix for net needed to extend the
hlist_nulls API, the fix for net-next will need to extend the hlist API.
Further, the TCP stack now directly uses the hlist API rather than
the sk_* helper functions that wrapped them.

This RFC patch is a re-implementation of the net patch for the net-next
tree.  It could be used if the net patch is first reverted before merging to
net-next or simply used as a reference to correct the merge conflict.
The test submitted with the initial patch should work in both cases.
---
 include/linux/rculist.h| 35 +++
 include/net/sock.h |  6 +-
 net/ipv4/inet_hashtables.c |  6 +-
 net/ipv4/udp.c |  9 +++--
 4 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17d4f849c65e..7c5a8f7b0cb1 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -542,6 +542,41 @@ static inline void hlist_add_behind_rcu(struct hlist_node 
*n,
n->next->pprev = &n->next;
 }
 
+/**
+ * hlist_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the end of the specified hlist_nulls,
+ * while permitting racing traversals.  NOTE: tail insertion requires
+ * list traversal.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_add_head_rcu()
+ * or hlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+
+static inline void hlist_add_tail_rcu(struct hlist_node *n,
+   struct hlist_head *h)
+{
+   struct hlist_node *i, *last = NULL;
+
+   for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
+   last = i;
+
+   if (last)
+   hlist_add_behind_rcu(n, last);
+   else
+   hlist_add_head_rcu(n, h);
+}
+
 #define __hlist_for_each_rcu(pos, head)\
for (pos = rcu_dereference(hlist_first_rcu(head));  \
 pos;   \
diff --git a/include/net/sock.h b/include/net/sock.h
index d997ec13a643..2b620c79f531 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -630,7 +630,11 @@ static inline void sk_add_node(struct sock *sk, struct 
hlist_head *list)
 static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
 {
sock_hold(sk);
-   hlist_add_head_rcu(&sk->sk_node, list);
+   if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+   sk->sk_family == AF_INET6)
+   hlist_add_tail

[PATCH net 2/2] soreuseport: test mixed v4/v6 sockets

2016-04-12 Thread Craig Gallek
From: Craig Gallek 

Test to validate the behavior of SO_REUSEPORT sockets that are
created with both AF_INET and AF_INET6.  See the commit prior to this
for a description of this behavior.

Signed-off-by: Craig Gallek 
---
 tools/testing/selftests/net/.gitignore|   1 +
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++
 3 files changed, 210 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 69bb3fc38fb2..0840684deb7d 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -3,3 +3,4 @@ psock_fanout
 psock_tpacket
 reuseport_bpf
 reuseport_bpf_cpu
+reuseport_dualstack
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index c658792d47b4..0e5340742620 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,7 @@ CFLAGS = -Wall -O2 -g
 
 CFLAGS += -I../../../../usr/include/
 
-NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu
+NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu 
reuseport_dualstack
 
 all: $(NET_PROGS)
 %: %.c
diff --git a/tools/testing/selftests/net/reuseport_dualstack.c 
b/tools/testing/selftests/net/reuseport_dualstack.c
new file mode 100644
index ..90958fb9
--- /dev/null
+++ b/tools/testing/selftests/net/reuseport_dualstack.c
@@ -0,0 +1,208 @@
+/*
+ * It is possible to use SO_REUSEPORT to open multiple sockets bound to
+ * equivalent local addresses using AF_INET and AF_INET6 at the same time.  If
+ * the AF_INET6 socket has IPV6_V6ONLY set, it's clear which socket should
+ * receive a given incoming packet.  However, when it is not set, incoming v4
+ * packets should prefer the AF_INET socket(s).  This behavior was defined with
+ * the original SO_REUSEPORT implementation, but broke with
+ * e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
+ * This test creates these mixed AF_INET/AF_INET6 sockets and asserts the
+ * AF_INET preference for v4 packets.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const int PORT = ;
+
+static void build_rcv_fd(int family, int proto, int *rcv_fds, int count)
+{
+   struct sockaddr_storage addr;
+   struct sockaddr_in  *addr4;
+   struct sockaddr_in6 *addr6;
+   int opt, i;
+
+   switch (family) {
+   case AF_INET:
+   addr4 = (struct sockaddr_in *)&addr;
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_ANY);
+   addr4->sin_port = htons(PORT);
+   break;
+   case AF_INET6:
+   addr6 = (struct sockaddr_in6 *)&addr;
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_any;
+   addr6->sin6_port = htons(PORT);
+   break;
+   default:
+   error(1, 0, "Unsupported family %d", family);
+   }
+
+   for (i = 0; i < count; ++i) {
+   rcv_fds[i] = socket(family, proto, 0);
+   if (rcv_fds[i] < 0)
+   error(1, errno, "failed to create receive socket");
+
+   opt = 1;
+   if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
+  sizeof(opt)))
+   error(1, errno, "failed to set SO_REUSEPORT");
+
+   if (bind(rcv_fds[i], (struct sockaddr *)&addr, sizeof(addr)))
+   error(1, errno, "failed to bind receive socket");
+
+   if (proto == SOCK_STREAM && listen(rcv_fds[i], 10))
+   error(1, errno, "failed to listen on receive port");
+   }
+}
+
+static void send_from_v4(int proto)
+{
+   struct sockaddr_in  saddr, daddr;
+   int fd;
+
+   saddr.sin_family = AF_INET;
+   saddr.sin_addr.s_addr = htonl(INADDR_ANY);
+   saddr.sin_port = 0;
+
+   daddr.sin_family = AF_INET;
+   daddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   daddr.sin_port = htons(PORT);
+
+   fd = socket(AF_INET, proto, 0);
+   if (fd < 0)
+   error(1, errno, "failed to create send socket");
+
+   if (bind(fd, (struct sockaddr *)&saddr, sizeof(saddr)))
+   error(1, errno, "failed to bind send socket");
+
+   if (connect(fd, (struct sockaddr *)&daddr, sizeof(daddr)))
+   error(1, errno, "failed to connect send socket");
+
+   if (send(fd, "a", 1, 0) < 0)
+   error(1, errn

[PATCH net 0/2] Fixes for SO_REUSEPORT and mixed v4/v6 sockets

2016-04-12 Thread Craig Gallek
From: Craig Gallek 

Recent changes to the datastructures associated with SO_REUSEPORT broke
an existing behavior when equivalent SO_REUSEPORT sockets are created
using both AF_INET and AF_INET6.  This patch series restores the previous
behavior and includes a test to validate it.

This series should be a trivial merge to stable kernels (if deemed
necessary), but will have conflicts in net-next.  The following patches
recently replaced the use of hlist_nulls with hlists for UDP and TCP
socket lists:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

If this series is accepted, I will send an RFC for the net-next change
to assist with the merge.

Craig Gallek (2):
  soreuseport: fix ordering for mixed v4/v6 sockets
  soreuseport: test mixed v4/v6 sockets

 include/linux/rculist_nulls.h |  39 
 include/net/sock.h|   6 +-
 net/ipv4/udp.c|   9 +-
 tools/testing/selftests/net/.gitignore|   1 +
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++
 6 files changed, 261 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

-- 
2.8.0.rc3.226.g39d4020



[PATCH net 1/2] soreuseport: fix ordering for mixed v4/v6 sockets

2016-04-12 Thread Craig Gallek
From: Craig Gallek 

With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski 
Signed-off-by: Craig Gallek 
Signed-off-by: Eric Dumazet 
---
 include/linux/rculist_nulls.h | 39 +++
 include/net/sock.h|  6 +-
 net/ipv4/udp.c|  9 +++--
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 1c33dd7da4a7..4ae95f7e8597 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -98,6 +98,45 @@ static inline void hlist_nulls_add_head_rcu(struct 
hlist_nulls_node *n,
if (!is_a_nulls(first))
first->pprev = &n->next;
 }
+
+/**
+ * hlist_nulls_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the end of the specified hlist_nulls,
+ * while permitting racing traversals.  NOTE: tail insertion requires
+ * list traversal.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
+   struct hlist_nulls_head *h)
+{
+   struct hlist_nulls_node *i, *last = NULL;
+
+   for (i = hlist_nulls_first_rcu(h); !is_a_nulls(i);
+i = hlist_nulls_next_rcu(i))
+   last = i;
+
+   if (last) {
+   n->next = last->next;
+   n->pprev = &last->next;
+   rcu_assign_pointer(hlist_nulls_next_rcu(last), n);
+   } else {
+   hlist_nulls_add_head_rcu(n, h);
+   }
+}
+
 /**
  * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
  * @tpos:  the type * to use as a loop cursor.
diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..121ffc115c4f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -630,7 +630,11 @@ static inline void sk_add_node_rcu(struct sock *sk, struct 
hlist_head *list)
 
 static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct 
hlist_nulls_head *list)
 {
-   hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
+   if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+   sk->sk_family == AF_INET6)
+   hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
+   else
+   hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
 }
 
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct 
hlist_nulls_head *list)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e16df0..a2e7f55a1f61 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -339,8 +339,13 @@ found:
 
hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
spin_lock(&hslot2->lock);
-   hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
-&hslot2->head);
+   if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Craig Gallek
On Fri, Mar 25, 2016 at 12:21 PM, Alexei Starovoitov
 wrote:
> On Fri, Mar 25, 2016 at 11:29:10AM -0400, Craig Gallek wrote:
>> On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau  wrote:
>> > The pattern is :
>> >
>> >   t0 : unprivileged processes 1 and 2 are listening to the same port
>> >(sock1@pid1) (sock2@pid2)
>> ><-- listening -->
>> >
>> >   t1 : new processes are started to replace the old ones
>> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>> ><-- listening --> <-- listening -->
>> >
>> >   t2 : new processes signal the old ones they must stop
>> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>> ><--- draining --> <-- listening -->
>> >
>> >   t3 : pids 1 and 2 have finished, they go away
>> >  (sock3@pid3) (sock4@pid4)
>> > <-- gone ->  <-- listening -->
> ...
>> t3: Close the first two sockets and only use the last two.  This is
>> the tricky step.  Before this point, the sockets are numbered 0
>> through 3 from the perspective of the BPF program (in the order
>> listen() was called).  As soon as socket 0 is closed, the last socket
>> in the list replaces it (what was 3 becomes 0).  When socket 1 is
>> closed, socket 2 moves into that position.  The assumptions about the
>> socket indexes in the BPF program need to change as the indexes change
>> as a result of closing them.
>
> yeah, the way reuseport_detach_sock() was done makes it hard to manage
> such transitions from bpf program, but I don't see yet what stops
> pid1 an pid2 at stage t2 to just close their sockets.
> If these 'draining' pids don't want to receive packets, they should
> close their sockets. Complicating bpf side to redistribute spraying
> to sock3 and sock4 only (while sock1 and sock2 are still open) is possible,
> but looks unnecessary complex to me.
> Just close sock1 and sock2 at t2 time and then exit pid1, pid2 later.
> If they are tcp sockets with rpc protocol on top and have a problem of
> partial messages, then kcm can solve that and it will simplify
> the user space side as well.
I believe the issue here is that closing the listen sockets will drop
any connections that are in the listen queue but have not been
accepted yet.  In the case of reuseport, you could in theory drain
those queues into the non-closed sockets, but that probably has some
interesting consequences...


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Craig Gallek
On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau  wrote:
> The pattern is :
>
>   t0 : unprivileged processes 1 and 2 are listening to the same port
>(sock1@pid1) (sock2@pid2)
><-- listening -->
>
>   t1 : new processes are started to replace the old ones
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><-- listening --> <-- listening -->
>
>   t2 : new processes signal the old ones they must stop
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><--- draining --> <-- listening -->
>
>   t3 : pids 1 and 2 have finished, they go away
>  (sock3@pid3) (sock4@pid4)
> <-- gone ->  <-- listening -->

To address the documentation issues, I'd like to reference the following:
- The filter.txt document in the kernel tree:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/filter.txt
- It uses (and extends) the BPF instruction set defined in the
original BSD BPF paper: http://www.tcpdump.org/papers/bpf-usenix93.pdf
- The kernel headers define all of the user-space structures used:
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/filter.h
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h

I've been trying to come up with an example BPF program for use in the
example Willy gave earlier in this thread (using 4 points in time and
describing one process with two listening sockets replacing another
with two listening sockets).  Everything except the last step is
pretty straight forward using what is currently available in the
kernel.  I'm using random distribution for simplicity, but you could
easily do something smarter using more information about the specific
hardware:

t0: Evenly distrubute load to two SO_REUSEPORT sockets in a single process:
  ld rand
  mod #2
  ret a

t1: Fork a new process, create two new listening sockets in the same
group. Even after calling listen(), but before updating the BPF
program, only the first two sockets will see new connections.  The
program is trivially modified to use all 4.
  ld rand
  mod #4
  ret a

t2: Stop sending new connections to the first two sockets (draining)
  ld rand
  mod #2
  add #2
  ret a

t3: Close the first two sockets and only use the last two.  This is
the tricky step.  Before this point, the sockets are numbered 0
through 3 from the perspective of the BPF program (in the order
listen() was called).  As soon as socket 0 is closed, the last socket
in the list replaces it (what was 3 becomes 0).  When socket 1 is
closed, socket 2 moves into that position.  The assumptions about the
socket indexes in the BPF program need to change as the indexes change
as a result of closing them.

Even if you use an EBPF map as a level of indirection here, you still
have the issue that the socket indexes change as a result of some of
them leaving the group.  I'm not sure yet how to properly fix this,
but it will probably mean changing the way the socket indexing
works...  The current scheme is really an implementation detail
optimized for efficiency.  It may be worth modifying or creating a
mode which results in a stable mapping.  This will probably be
necessary for any scheme which expects sockets to regularly enter or
leave the group.


Re: [PATCH v2] socket.7: Document some BPF-related socket options

2016-03-01 Thread Craig Gallek
On Tue, Mar 1, 2016 at 5:29 AM, Michael Kerrisk (man-pages)
 wrote:
> On 03/01/2016 11:10 AM, Vincent Bernat wrote:
>>  ❦  1 mars 2016 11:03 +0100, "Michael Kerrisk (man-pages)" 
>>  :
>>
>>>   Once   the   SO_LOCK_FILTER  option  has  been  enabled,
>>>   attempts by an unprivileged process to change or  remove
>>>   the  filter  attached  to  a  socket,  or to disable the
>>>   SO_LOCK_FILTER option will fail with the error EPERM.
>>
>> You should remove "unprivileged". I didn't try to check for permissions
>> because I was just lazy (and I didn't have a need for it). As root, you
>> can just recreate another socket.
>
> Bother. That's what I meant to do, and then I omitted to do it! Done now
> And thanks for catching that, Vincent.
>
> Revised text below, with another query.
>
>SO_LOCK_FILTER
>   When set, this option will prevent changing the  filters
>   associated  with  the socket.  These filters include any
>   set   using   the   socket   options   SO_ATTACH_FILTER,
>   SO_ATTACH_BPF,SO_ATTACH_REUSEPORT_CBPF   and
>   SO_ATTACH_REUSEPORT_EPBF.
>
>   The typical use case is for a privileged process to  set
>   up  a  socket with restrictive filters, set SO_LOCK_FIL‐
>   TER, and then either drop its  privileges  or  pass  the
>   socket file descriptor to an unprivileged process.
>
>   Once   the   SO_LOCK_FILTER  option  has  been  enabled,
>   attempts to change or remove the filter  attached  to  a
>   socket,  or  to  disable  the SO_LOCK_FILTER option will
>   fail with the error EPERM.
>
> I think the second paragraph should probably drop mention of privileges,
> right? In fact, maybe just drop the paragraph altogether?
Thanks Michael, all of your changes in the git tree look good to me.
I parsed the one-way nature of LOCK_FILTER completely backwards from
the commit message.  It's describing BSD's root-modify behavior, not
the implementation in Linux.  I think I like this last paragraph as
you have it to explicitly call out this as intended behavior.

Thanks again,
Craig

> Cheers,
>
> Michael
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/


[PATCH v2] socket.7: Document some BPF-related socket options

2016-02-29 Thread Craig Gallek
From: Craig Gallek 

Document the behavior and the first kernel version for each of the
following socket options:
SO_ATTACH_FILTER
SO_ATTACH_BPF
SO_ATTACH_REUSEPORT_CBPF
SO_ATTACH_REUSEPORT_EBPF
SO_DETACH_FILTER
SO_DETACH_BPF
SO_LOCK_FILTER

Signed-off-by: Craig Gallek 
---
v2 changes:
- Content suggestions from Michael Kerrisk :
  * Clarify socket filter return value semantics
  * Clarify wording of minimal kernel versions
  * Explain behavior of multiple calls using SO_ATTACH_[BPF|FILTER]
  * Define 'reuseport groups' in SO_ATTACH_REUSEPORT_*
- Include SO_LOCK_FILTER documentation mostly based off of the wording
  in the commit message by Vincent Bernat 
  d59577b6ffd3 ("sk-filter: Add ability to lock a socket filter program")

---
 man7/socket.7 | 136 +-
 1 file changed, 115 insertions(+), 21 deletions(-)

diff --git a/man7/socket.7 b/man7/socket.7
index db7cb8324dde..d22107cc47d7 100644
--- a/man7/socket.7
+++ b/man7/socket.7
@@ -41,9 +41,6 @@
 .\"SO_GET_FILTER (3.8)
 .\"commit a8fc92778080c845eaadc369a0ecf5699a03bef0
 .\"Author: Pavel Emelyanov 
-.\"SO_LOCK_FILTER (3.9)
-.\"commit d59577b6ffd313d0ab3be39cb1ab47e29bdc9182
-.\"Author: Vincent Bernat 
 .\"SO_SELECT_ERR_QUEUE (3.10)
 .\" commit 7d4c04fc170087119727119074e72445f2bb192b
 .\"Author: Keller, Jacob E 
@@ -53,13 +50,6 @@
 .\" SO_BPF_EXTENSIONS (3.14)
 .\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e
 .\"Author: Michal Sekletar 
-.\" SO_ATTACH_BPF (3.19)
-.\" and SO_DETACH_BPF as synonym for SO_DETACH_FILTER
-.\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
-.\"Author: Alexei Starovoitov 
-.\"SO_ATTACH_REUSEPORT_CBPF, SO_ATTACH_REUSEPORT_EBPF (4.5)
-.\"commit 538950a1b7527a0a52ccd9337e3fcd304f027f13
-.\"Author: Craig Gallek 
 .\"
 .TH SOCKET 7 2015-05-07 Linux "Linux Programmer's Manual"
 .SH NAME
@@ -311,6 +301,90 @@ The value 0 indicates that this is not a listening socket,
 the value 1 indicates that this is a listening socket.
 This socket option is read-only.
 .TP
+.BR SO_ATTACH_FILTER " and " SO_ATTACH_BPF
+Attach a classic or extended BPF program (respectively) to the socket
+for use as a filter of incoming packets. A packet will be dropped if
+the filter program returns zero.  If the filter program returns a
+non-zero value which is less than the packet's data length, the packet
+will be truncated to the length returned.  If the value returned by
+the filter is greater than or equal to the packet's data length, the
+packet is allowed to proceed unmodified.
+
+The argument for
+.BR SO_ATTACH_FILTER
+is a
+.I sock_fprog
+structure in
+.B .
+.sp
+.in +4n
+.nf
+struct sock_fprog {
+unsigned short  len;
+struct sock_filter *filter;
+};
+.fi
+.in
+.IP
+The argument for
+.BR SO_ATTACH_BPF
+is a file descriptor returned by the
+.BR bpf (2)
+system call and must refer to a program of type
+.BR BPF_PROG_TYPE_SOCKET_FILTER.
+These options may be set multiple times for a given socket, each time
+replacing the previous filter program.  The classic and extended
+versions may be called on the same socket, but the previous filter
+will always be replaced such that a socket never has more than one
+filter defined.
+
+.BR SO_ATTACH_FILTER
+is available since Linux 2.2.
+.BR SO_ATTACH_BPF
+is available since Linux 3.19.  Both classic and extended BPF are
+explained in the kernel source file
+.I Documentation/networking/filter.txt
+.TP
+.BR SO_ATTACH_REUSEPORT_CBPF " and " SO_ATTACH_REUSEPORT_EBPF " (since Linux 
4.5)"
+For use with the
+.BR SO_REUSEPORT
+option, these options allow the user to set a classic or extended
+BPF program (respectively) which defines how packets are assigned to
+the sockets in the reuseport group (that is, all sockets which have
+.BR SO_REUSEPORT
+set and are using the same local address to receive packets).  The BPF
+program must return an index between 0 and N-1 representing the socket
+which should receive the packet (where N is the number of sockets in
+the group). If the BPF program returns an invalid index, socket
+selection will fall back to the plain
+.BR SO_REUSEPORT
+mechanism.
+
+Sockets are numbered in the order in which they are added to the group
+(that is, the order of
+.BR bind (2)
+calls for UDP sockets or the order of
+.BR listen (2)
+calls for TCP sockets).  New sockets added to a reuseport group will
+inherit the BPF program.  When a socket is removed from a reuseport
+group (via
+.BR close (2))
+the last socket in the group will be moved into the closed socket's
+position.
+
+These options may be set repeatedly at any time on any single socket
+in the group to replace the current 

[PATCH] socket.7: Document some BPF-related socket options

2016-02-25 Thread Craig Gallek
From: Craig Gallek 

Document the behavior and the first kernel version for each of the
following socket options:
SO_ATTACH_FILTER
SO_ATTACH_BPF
SO_ATTACH_REUSEPORT_CBPF
SO_ATTACH_REUSEPORT_EBPF
SO_DETACH_FILTER
SO_DETACH_BPF

Signed-off-by: Craig Gallek 
---
 man7/socket.7 | 104 --
 1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/man7/socket.7 b/man7/socket.7
index db7cb8324dde..79b4f3158541 100644
--- a/man7/socket.7
+++ b/man7/socket.7
@@ -53,13 +53,6 @@
 .\" SO_BPF_EXTENSIONS (3.14)
 .\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e
 .\"Author: Michal Sekletar 
-.\" SO_ATTACH_BPF (3.19)
-.\" and SO_DETACH_BPF as synonym for SO_DETACH_FILTER
-.\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
-.\"Author: Alexei Starovoitov 
-.\"SO_ATTACH_REUSEPORT_CBPF, SO_ATTACH_REUSEPORT_EBPF (4.5)
-.\"commit 538950a1b7527a0a52ccd9337e3fcd304f027f13
-.\"Author: Craig Gallek 
 .\"
 .TH SOCKET 7 2015-05-07 Linux "Linux Programmer's Manual"
 .SH NAME
@@ -311,6 +304,80 @@ The value 0 indicates that this is not a listening socket,
 the value 1 indicates that this is a listening socket.
 This socket option is read-only.
 .TP
+.BR SO_ATTACH_FILTER " and " SO_ATTACH_BPF
+Attach a classic or extended BPF program (respectively) to the socket
+for use as a filter of incoming packets.  A packet will be dropped if
+the filter returns zero or have its data truncated to the non-zero
+length returned.  If the value returned is greater or equal to the
+packet's data length, the packet is allowed to proceed unmodified.
+
+The argument for
+.BR SO_ATTACH_FILTER
+is a
+.I sock_fprog
+structure in
+.B .
+.sp
+.in +4n
+.nf
+struct sock_fprog {
+unsigned short  len;
+struct sock_filter *filter;
+};
+.fi
+.in
+.IP
+The argument for
+.BR SO_ATTACH_BPF
+is a file descriptor returned by the
+.BR bpf (2)
+system call and must represent a program of type
+.BR BPF_PROG_TYPE_SOCKET_FILTER.
+
+.BR SO_ATTACH_FILTER
+is available in Linux 2.2.
+.BR SO_ATTACH_BPF
+is available in Linux 3.19.  Both classic and extended BPF are
+explained in the kernel source file
+.I Documentation/networking/filter.txt
+.TP
+.BR SO_ATTACH_REUSEPORT_CBPF " and " SO_ATTACH_REUSEPORT_EBPF " (since Linux 
4.5)"
+For use with the
+.BR SO_REUSEPORT
+option, these options allow the user to define a classic or extended
+BPF program (respectively) which defines how packets are assigned to
+the sockets in the reuseport group.  The program must return an index
+between 0 and N-1 representing the socket which should receive the
+packet (where N is the number of sockets in the group). If the BPF
+program returns an invalid index, socket selection will fall back to
+the plain
+.BR SO_REUSEPORT
+mechanism.
+
+Sockets are numbered in the order in which they are added to the group
+(that is, the order of
+.BR bind (2)
+calls for UDP sockets or the order of
+.BR listen (2)
+calls for TCP sockets).  New sockets added to the group will inherit
+the program.  When a socket is removed from the group (via
+.BR close (2))
+the last socket in the group will be moved into the closed socket's
+position.
+
+These options may be set repeatedly at any time on any single socket
+in the group to replace the current BPF program used by all sockets in
+the group.
+.BR SO_ATTACH_REUSEPORT_CBPF
+takes the same socket argument type as
+.BR SO_ATTACH_FILTER
+and
+.BR SO_ATTACH_REUSEPORT_EBPF
+takes the same socket argument type as
+.BR SO_ATTACH_BPF.
+UDP support for this feature is available in Linux 4.5.
+TCP support for this feature is available in Linux 4.6.
+.TP
 .B SO_BINDTODEVICE
 Bind this socket to a particular device like \(lqeth0\(rq,
 as specified in the passed interface name.
@@ -368,6 +435,18 @@ Only allowed for processes with the
 .B CAP_NET_ADMIN
 capability or an effective user ID of 0.
 .TP
+.BR SO_DETACH_FILTER " and " SO_DETACH_BPF
+These options may be used to remove the BPF program attached to the
+socket with either
+.BR SO_ATTACH_FILTER
+or
+.BR SO_ATTACH_BPF.
+The option value is ignored.
+.BR SO_DETACH_FILTER
+is available in Linux 2.2.
+.BR SO_DETACH_BPF
+is available in Linux 3.19.
+.TP
 .BR SO_DOMAIN " (since Linux 2.6.32)"
 Retrieves the socket domain as an integer, returning a value such as
 .BR AF_INET6 .
@@ -991,17 +1070,6 @@ where only the later program needs to set the
 option.
 Typically this difference is invisible, since, for example, a server
 program is designed to always set this option.
-.SH BUGS
-The
-.B CONFIG_FILTER
-socket options
-.B SO_ATTACH_FILTER
-and
-.B SO_DETACH_FILTER
-.\" FIXME Document SO_ATTACH_FILTER and SO_DETACH_FILTER
-are not documented.
-The suggested interface to use them is via the libpcap
-library.
 .\" .SH AUTHORS
 .\" This man page was written by Andi Kleen.
 .SH SEE ALSO
-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next] soreuseport: fix merge conflict in tcp bind

2016-02-22 Thread Craig Gallek
From: Craig Gallek 

One of the validation checks for the new array-based TCP SO_REUSEPORT
validation was unintentionally dropped in ea8add2b1903.  This adds it back.

Lack of this check allows the user to allocate multiple sock_reuseport
structures (leaking all but the first).

Fixes: ea8add2b1903 ("tcp/dccp: better use of ephemeral ports in bind()")
Signed-off-by: Craig Gallek 
---
 net/ipv4/inet_connection_sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 3d28c6d5c3c3..fb0349acbd45 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -202,6 +202,7 @@ tb_found:
 
if (((tb->fastreuse > 0 && reuse) ||
 (tb->fastreuseport > 0 &&
+ !rcu_access_pointer(sk->sk_reuseport_cb) &&
  sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
smallest_size == -1)
goto success;
-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next v4 5/7] soreuseport: Prep for fast reuseport TCP socket selection

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

Both of the lines in this patch probably should have been included
in the initial implementation of this code for generic socket
support, but weren't technically necessary since only UDP sockets
were supported.

First, the sk_reuseport_cb points to a structure which assumes
each socket in the group has this pointer assigned at the same
time it's added to the array in the structure.  The sk_clone_lock
function breaks this assumption.  Since a child socket shouldn't
implicitly be in a reuseport group, the simple fix is to clear
the field in the clone.

Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
SO_REUSEPORT also be set first.  For UDP sockets, this is easily
enforced at bind-time since that process both puts the socket in
the appropriate receive hlist and updates the reuseport structures.
Since these operations can happen at two different times for TCP
sockets (bind and listen) it must be explicitly checked to enforce
the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
setsockopt call.

Signed-off-by: Craig Gallek 
---
 net/core/filter.c | 2 +-
 net/core/sock.c   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 94d26201080d..2a6e9562f1ab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1181,7 +1181,7 @@ static int __reuseport_attach_prog(struct bpf_prog *prog, 
struct sock *sk)
if (bpf_prog_size(prog->len) > sysctl_optmem_max)
return -ENOMEM;
 
-   if (sk_unhashed(sk)) {
+   if (sk_unhashed(sk) && sk->sk_reuseport) {
err = reuseport_alloc(sk);
if (err)
return err;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c1c8bc93412..46dc8ad7d050 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1531,6 +1531,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
newsk = NULL;
goto out;
}
+   RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
 
newsk->sk_err  = 0;
newsk->sk_priority = 0;
-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next v4 6/7] soreuseport: fast reuseport TCP socket selection

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access.  This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet.  Previously, every socket in the group
needed to be considered before handing off the incoming packet.

This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.

Signed-off-by: Craig Gallek 
---
 include/net/inet_hashtables.h|  5 +++-
 net/ipv4/inet_connection_sock.c  | 14 ++---
 net/ipv4/inet_hashtables.c   | 64 +---
 net/ipv4/udp.c   |  4 +--
 net/ipv6/inet6_connection_sock.c |  2 ++
 net/ipv6/inet6_hashtables.c  | 16 +-
 6 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 82403390af58..50f635c2c536 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -207,7 +207,10 @@ void inet_hashinfo_init(struct inet_hashinfo *h);
 
 bool inet_ehash_insert(struct sock *sk, struct sock *osk);
 bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
-void __inet_hash(struct sock *sk, struct sock *osk);
+int __inet_hash(struct sock *sk, struct sock *osk,
+   int (*saddr_same)(const struct sock *sk1,
+ const struct sock *sk2,
+ bool match_wildcard));
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 12c8d389dc18..c16a2e6273d9 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
@@ -67,7 +68,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
-   (sk2->sk_state != TCP_TIME_WAIT &&
+rcu_access_pointer(sk->sk_reuseport_cb) ||
+(sk2->sk_state != TCP_TIME_WAIT &&
 !uid_eq(uid, sock_i_uid(sk2) {
 
if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr ||
@@ -132,6 +134,7 @@ again:
  sk->sk_state != TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
  sk->sk_reuseport &&
+ 
!rcu_access_pointer(sk->sk_reuseport_cb) &&
  uid_eq(tb->fastuid, uid))) &&
(tb->num_owners < smallest_size || 
smallest_size == -1)) {
smallest_size = tb->num_owners;
@@ -193,15 +196,18 @@ tb_found:
if (((tb->fastreuse > 0 &&
  sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
- sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
-   smallest_size == -1) {
+ sk->sk_reuseport &&
+ !rcu_access_pointer(sk->sk_reuseport_cb) &&
+ uid_eq(tb->fastuid, uid))) && smallest_size == -1) {
goto success;
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, 
true)) {
if (((sk->sk_reuse && sk->sk_state != 
TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
- sk->sk_reuseport && uid_eq(tb->fastuid, 
uid))) &&
+ sk->sk_reuseport &&
+ !rcu_access_pointer(sk->sk_reuseport_cb) 
&&
+ uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 5e4290b83255..c0f994

[PATCH net-next v4 7/7] soreuseport: BPF selection functional test for TCP

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

Unfortunately the existing test relied on packet payload in order to
map incoming packets to sockets.  In order to get this to work with TCP,
TCP_FASTOPEN needed to be used.

Since the fast open path is slightly different than the standard TCP path,
I created a second test which sends to reuseport group members based
on receiving cpu core id.  This will probably serve as a better
real-world example use as well.

Signed-off-by: Craig Gallek 
---
 tools/testing/selftests/net/.gitignore  |   1 +
 tools/testing/selftests/net/Makefile|   2 +-
 tools/testing/selftests/net/reuseport_bpf.c | 117 ++-
 tools/testing/selftests/net/reuseport_bpf_cpu.c | 258 
 4 files changed, 370 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_bpf_cpu.c

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 6fb23366b258..69bb3fc38fb2 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -2,3 +2,4 @@ socket
 psock_fanout
 psock_tpacket
 reuseport_bpf
+reuseport_bpf_cpu
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 41449b5ad0a9..c658792d47b4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,7 @@ CFLAGS = -Wall -O2 -g
 
 CFLAGS += -I../../../../usr/include/
 
-NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf
+NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu
 
 all: $(NET_PROGS)
 %: %.c
diff --git a/tools/testing/selftests/net/reuseport_bpf.c 
b/tools/testing/selftests/net/reuseport_bpf.c
index bec1b5dd2530..96ba386b1b7b 100644
--- a/tools/testing/selftests/net/reuseport_bpf.c
+++ b/tools/testing/selftests/net/reuseport_bpf.c
@@ -9,10 +9,12 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -169,9 +171,15 @@ static void build_recv_group(const struct test_params p, 
int fd[], uint16_t mod,
if (bind(fd[i], addr, sockaddr_size()))
error(1, errno, "failed to bind recv socket %d", i);
 
-   if (p.protocol == SOCK_STREAM)
+   if (p.protocol == SOCK_STREAM) {
+   opt = 4;
+   if (setsockopt(fd[i], SOL_TCP, TCP_FASTOPEN, &opt,
+  sizeof(opt)))
+   error(1, errno,
+ "failed to set TCP_FASTOPEN on %d", i);
if (listen(fd[i], p.recv_socks * 10))
error(1, errno, "failed to listen on socket");
+   }
}
free(addr);
 }
@@ -189,10 +197,8 @@ static void send_from(struct test_params p, uint16_t 
sport, char *buf,
 
if (bind(fd, saddr, sockaddr_size()))
error(1, errno, "failed to bind send socket");
-   if (connect(fd, daddr, sockaddr_size()))
-   error(1, errno, "failed to connect");
 
-   if (send(fd, buf, len, 0) < 0)
+   if (sendto(fd, buf, len, MSG_FASTOPEN, daddr, sockaddr_size()) < 0)
error(1, errno, "failed to send message");
 
close(fd);
@@ -260,7 +266,7 @@ static void test_recv_order(const struct test_params p, int 
fd[], int mod)
}
 }
 
-static void test_reuseport_ebpf(const struct test_params p)
+static void test_reuseport_ebpf(struct test_params p)
 {
int i, fd[p.recv_socks];
 
@@ -268,6 +274,7 @@ static void test_reuseport_ebpf(const struct test_params p)
build_recv_group(p, fd, p.recv_socks, attach_ebpf);
test_recv_order(p, fd, p.recv_socks);
 
+   p.send_port_min += p.recv_socks * 2;
fprintf(stderr, "Reprograming, testing mod %zd...\n", p.recv_socks / 2);
attach_ebpf(fd[0], p.recv_socks / 2);
test_recv_order(p, fd, p.recv_socks / 2);
@@ -276,7 +283,7 @@ static void test_reuseport_ebpf(const struct test_params p)
close(fd[i]);
 }
 
-static void test_reuseport_cbpf(const struct test_params p)
+static void test_reuseport_cbpf(struct test_params p)
 {
int i, fd[p.recv_socks];
 
@@ -284,6 +291,7 @@ static void test_reuseport_cbpf(const struct test_params p)
build_recv_group(p, fd, p.recv_socks, attach_cbpf);
test_recv_order(p, fd, p.recv_socks);
 
+   p.send_port_min += p.recv_socks * 2;
fprintf(stderr, "Reprograming, testing mod %zd...\n", p.recv_socks / 2);
attach_cbpf(fd[0], p.recv_socks / 2);
test_recv_order(p, fd, p.recv_socks / 2);
@@ -377,7 +385,7 @@ static void test_filter_no_reuseport(const struct 
test_params p)
 
 static void test_filter_without_bind(void)
 {
-   int fd1, fd2;
+   int fd1, fd2, opt = 1;
 
fprintf(stderr, "

[PATCH net-next v4 0/7] Faster SO_REUSEPORT for TCP

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

This patch series complements an earlier series (6a5ef90c58da)
which added faster SO_REUSEPORT lookup for UDP sockets by
extending the feature to TCP sockets.  It uses the same
array-based data structure which allows for socket selection
after finding the first listening socket that matches an incoming
packet.  Prior to this feature, every socket in the reuseport
group needed to be found and examined before a selection could be
made.

With this series the SO_ATTACH_REUSEPORT_CBPF and
SO_ATTACH_REUSEPORT_EBPF socket options now work for TCP sockets
as well.  The test at the end of the series includes an example of
how to use these options to select a reuseport socket based on the
cpu core id handling the incoming packet.

There are several refactoring patches that precede the feature
implementation.  Only the last two patches in this series
should result in any behavioral changes.

v4
- Fix build issue when compiling IPv6 as a module.  This required
  moving the ipv6_rcv_saddr_equal into an object that is included as a
  built-in object.  I included this change in the second patch which
  adds inet6_hash since that is where ipv6_rcv_saddr_equal will
  later be called from non-module code.

v3:
- Another warning in the first patch caught by a build bot.  Return 0 in
  the no-op UDP hash function.

v2:
- In the first patched I missed a couple of hash functions that should now be
  returning int instead of void.  I missed these the first time through as it
  only generated a warning and not an error :\

Craig Gallek (7):
  sock: struct proto hash function may error
  inet: create IPv6-equivalent inet_hash function
  tcp: __tcp_hdrlen() helper
  inet: refactor inet[6]_lookup functions to take skb
  soreuseport: Prep for fast reuseport TCP socket selection
  soreuseport: fast reuseport TCP socket selection
  soreuseport: BPF selection functional test for TCP

 include/linux/tcp.h |   7 +-
 include/net/addrconf.h  |   2 +
 include/net/inet6_hashtables.h  |  13 +-
 include/net/inet_hashtables.h   |  25 ++-
 include/net/phonet/phonet.h |   2 +-
 include/net/ping.h  |   2 +-
 include/net/raw.h   |   2 +-
 include/net/sock.h  |   6 +-
 include/net/udp.h   |   3 +-
 net/core/filter.c   |   2 +-
 net/core/sock.c |   1 +
 net/dccp/ipv4.c |   2 +-
 net/dccp/ipv6.c |   4 +-
 net/ieee802154/socket.c |  17 +-
 net/ipv4/af_inet.c  |   9 +-
 net/ipv4/inet_connection_sock.c |  22 +-
 net/ipv4/inet_diag.c|   6 +-
 net/ipv4/inet_hashtables.c  |  67 +-
 net/ipv4/ping.c |   4 +-
 net/ipv4/raw.c  |   4 +-
 net/ipv4/tcp_ipv4.c |  10 +-
 net/ipv4/udp.c  |   4 +-
 net/ipv6/af_inet6.c |   6 +-
 net/ipv6/inet6_connection_sock.c|   2 +
 net/ipv6/inet6_hashtables.c |  78 ++-
 net/ipv6/tcp_ipv6.c |  10 +-
 net/ipv6/udp.c  |  44 +---
 net/l2tp/l2tp_ip6.c |   3 +-
 net/netfilter/xt_TPROXY.c   |  31 ++-
 net/netfilter/xt_socket.c   |  28 ++-
 net/phonet/socket.c |   6 +-
 net/sctp/socket.c   |   3 +-
 tools/testing/selftests/net/.gitignore  |   1 +
 tools/testing/selftests/net/Makefile|   2 +-
 tools/testing/selftests/net/reuseport_bpf.c | 117 ++-
 tools/testing/selftests/net/reuseport_bpf_cpu.c | 258 
 36 files changed, 670 insertions(+), 133 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_bpf_cpu.c

-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next v4 1/7] sock: struct proto hash function may error

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.

Signed-off-by: Craig Gallek 
---
 include/net/inet_hashtables.h   |  2 +-
 include/net/phonet/phonet.h |  2 +-
 include/net/ping.h  |  2 +-
 include/net/raw.h   |  2 +-
 include/net/sock.h  |  6 +++---
 include/net/udp.h   |  3 ++-
 net/ieee802154/socket.c | 17 +
 net/ipv4/af_inet.c  |  9 ++---
 net/ipv4/inet_connection_sock.c |  8 +---
 net/ipv4/inet_hashtables.c  |  4 +++-
 net/ipv4/ping.c |  4 +++-
 net/ipv4/raw.c  |  4 +++-
 net/ipv6/af_inet6.c |  6 +-
 net/phonet/socket.c |  6 --
 net/sctp/socket.c   |  3 ++-
 15 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index de2e3ade6102..554440e7f83d 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -208,7 +208,7 @@ void inet_hashinfo_init(struct inet_hashinfo *h);
 bool inet_ehash_insert(struct sock *sk, struct sock *osk);
 bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
 void __inet_hash(struct sock *sk, struct sock *osk);
-void inet_hash(struct sock *sk);
+int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
 struct sock *__inet_lookup_listener(struct net *net,
diff --git a/include/net/phonet/phonet.h b/include/net/phonet/phonet.h
index 68e509750caa..039cc29cb4a8 100644
--- a/include/net/phonet/phonet.h
+++ b/include/net/phonet/phonet.h
@@ -51,7 +51,7 @@ void pn_sock_init(void);
 struct sock *pn_find_sock_by_sa(struct net *net, const struct sockaddr_pn *sa);
 void pn_deliver_sock_broadcast(struct net *net, struct sk_buff *skb);
 void phonet_get_local_port_range(int *min, int *max);
-void pn_sock_hash(struct sock *sk);
+int pn_sock_hash(struct sock *sk);
 void pn_sock_unhash(struct sock *sk);
 int pn_sock_get_port(struct sock *sk, unsigned short sport);
 
diff --git a/include/net/ping.h b/include/net/ping.h
index ac80cb45e630..5fd7cc244833 100644
--- a/include/net/ping.h
+++ b/include/net/ping.h
@@ -65,7 +65,7 @@ struct pingfakehdr {
 };
 
 int  ping_get_port(struct sock *sk, unsigned short ident);
-void ping_hash(struct sock *sk);
+int ping_hash(struct sock *sk);
 void ping_unhash(struct sock *sk);
 
 int  ping_init_sock(struct sock *sk);
diff --git a/include/net/raw.h b/include/net/raw.h
index 6a40c6562dd2..3e789008394d 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -57,7 +57,7 @@ int raw_seq_open(struct inode *ino, struct file *file,
 
 #endif
 
-void raw_hash_sk(struct sock *sk);
+int raw_hash_sk(struct sock *sk);
 void raw_unhash_sk(struct sock *sk);
 
 struct raw_sock {
diff --git a/include/net/sock.h b/include/net/sock.h
index f5ea148853e2..255d3e03727b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -984,7 +984,7 @@ struct proto {
void(*release_cb)(struct sock *sk);
 
/* Keeping track of sk's, looking them up, and port selection methods. 
*/
-   void(*hash)(struct sock *sk);
+   int (*hash)(struct sock *sk);
void(*unhash)(struct sock *sk);
void(*rehash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned short 
snum);
@@ -1194,10 +1194,10 @@ static inline void sock_prot_inuse_add(struct net *net, 
struct proto *prot,
 /* With per-bucket locks this operation is not-atomic, so that
  * this version is not worse.
  */
-static inline void __sk_prot_rehash(struct sock *sk)
+static inline int __sk_prot_rehash(struct sock *sk)
 {
sk->sk_prot->unhash(sk);
-   sk->sk_prot->hash(sk);
+   return sk->sk_prot->hash(sk);
 }
 
 void sk_prot_clear_portaddr_nulls(struct sock *sk, int size);
diff --git a/include/net/udp.h b/include/net/udp.h
index 2842541e28e7..92927f729ac8 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -177,9 +177,10 @@ static inline struct udphdr *udp_gro_udphdr(struct sk_buff 
*skb)
 }
 
 /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
-static inline void udp_lib_hash(struct sock *sk)
+static inline int udp_lib_hash(struct sock *sk)
 {
BUG();
+   return 0;
 }
 
 void udp_lib_unhash(struct sock *sk);
diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index a548be247e15..e0bd013a1e5e 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -182,12 +182,14 @@ static int ieee802154_sock_ioctl(struct socket *sock, 
unsigned int cmd,
 static HLIST_HEAD(raw_head);
 static DEFINE_RWLOCK(raw_lock);
 
-static void raw_hash(struct sock *sk)
+static int r

[PATCH net-next v4 4/7] inet: refactor inet[6]_lookup functions to take skb

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups.  Doing so with a BPF filter will require access to the
skb in question.  This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek 
---
 include/net/addrconf.h |  2 ++
 include/net/inet6_hashtables.h | 11 +++
 include/net/inet_hashtables.h  | 18 --
 net/dccp/ipv4.c|  2 +-
 net/dccp/ipv6.c|  2 +-
 net/ipv4/inet_diag.c   |  6 +++---
 net/ipv4/inet_hashtables.c |  1 +
 net/ipv4/tcp_ipv4.c| 10 ++
 net/ipv6/inet6_hashtables.c|  8 ++--
 net/ipv6/tcp_ipv6.c|  8 +---
 net/netfilter/xt_TPROXY.c  | 31 ---
 net/netfilter/xt_socket.c  | 28 +---
 12 files changed, 85 insertions(+), 42 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 47f52d3cd8df..730d856683e5 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -87,6 +87,8 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr 
*addr,
  u32 banned_flags);
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr,
u32 banned_flags);
+int ipv4_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
+bool match_wildcard);
 int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
 bool match_wildcard);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index b3c28a9dfbf1..28332bdac333 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -53,6 +53,7 @@ struct sock *__inet6_lookup_established(struct net *net,
 
 struct sock *inet6_lookup_listener(struct net *net,
   struct inet_hashinfo *hashinfo,
+  struct sk_buff *skb, int doff,
   const struct in6_addr *saddr,
   const __be16 sport,
   const struct in6_addr *daddr,
@@ -60,6 +61,7 @@ struct sock *inet6_lookup_listener(struct net *net,
 
 static inline struct sock *__inet6_lookup(struct net *net,
  struct inet_hashinfo *hashinfo,
+ struct sk_buff *skb, int doff,
  const struct in6_addr *saddr,
  const __be16 sport,
  const struct in6_addr *daddr,
@@ -71,12 +73,12 @@ static inline struct sock *__inet6_lookup(struct net *net,
if (sk)
return sk;
 
-   return inet6_lookup_listener(net, hashinfo, saddr, sport,
+   return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
 daddr, hnum, dif);
 }
 
 static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
- struct sk_buff *skb,
+ struct sk_buff *skb, int doff,
  const __be16 sport,
  const __be16 dport,
  int iif)
@@ -86,13 +88,14 @@ static inline struct sock *__inet6_lookup_skb(struct 
inet_hashinfo *hashinfo,
if (sk)
return sk;
 
-   return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
- &ipv6_hdr(skb)->saddr, sport,
+   return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+ doff, &ipv6_hdr(skb)->saddr, sport,
  &ipv6_hdr(skb)->daddr, ntohs(dport),
  iif);
 }
 
 struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
+ struct sk_buff *skb, int doff,
  const struct in6_addr *saddr, const __be16 sport,
  const struct in6_addr *daddr, const __be16 dport,
  const int dif);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 554440e7f83d..82403390af58 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -213,6 +213,7 @@ void inet_unhash(struct sock *sk);
 
 struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
+   struct sk_buff *skb, int doff,
const __be32 saddr, const __be16 sport,
 

[PATCH net-next v4 3/7] tcp: __tcp_hdrlen() helper

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr.
This splits the size calculation into a helper function that can be
used if a struct tcphdr is already available.

Signed-off-by: Craig Gallek 
---
 include/linux/tcp.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d909feeeaea2..bcbf51da4e1e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -29,9 +29,14 @@ static inline struct tcphdr *tcp_hdr(const struct sk_buff 
*skb)
return (struct tcphdr *)skb_transport_header(skb);
 }
 
+static inline unsigned int __tcp_hdrlen(const struct tcphdr *th)
+{
+   return th->doff * 4;
+}
+
 static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 {
-   return tcp_hdr(skb)->doff * 4;
+   return __tcp_hdrlen(tcp_hdr(skb));
 }
 
 static inline struct tcphdr *inner_tcp_hdr(const struct sk_buff *skb)
-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next v4 2/7] inet: create IPv6-equivalent inet_hash function

2016-02-10 Thread Craig Gallek
From: Craig Gallek 

In order to support fast lookups for TCP sockets with SO_REUSEPORT,
the function that adds sockets to the listening hash set needs
to be able to check receive address equality.  Since this equality
check is different for IPv4 and IPv6, we will need two different
socket hashing functions.

This patch adds inet6_hash identical to the existing inet_hash function
and updates the appropriate references.  A following patch will
differentiate the two by passing different comparison functions to
__inet_hash.

Additionally, in order to use the IPv6 address equality function from
inet6_hashtables (which is compiled as a built-in object when IPv6 is
enabled) it also needs to be in a built-in object file as well.  This
moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.

Signed-off-by: Craig Gallek 
---
 include/net/inet6_hashtables.h |  2 ++
 net/dccp/ipv6.c|  2 +-
 net/ipv6/inet6_hashtables.c| 56 ++
 net/ipv6/tcp_ipv6.c|  2 +-
 net/ipv6/udp.c | 44 +
 net/l2tp/l2tp_ip6.c|  3 ++-
 6 files changed, 63 insertions(+), 46 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 7ff588ca6817..b3c28a9dfbf1 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -96,6 +96,8 @@ struct sock *inet6_lookup(struct net *net, struct 
inet_hashinfo *hashinfo,
  const struct in6_addr *saddr, const __be16 sport,
  const struct in6_addr *daddr, const __be16 dport,
  const int dif);
+
+int inet6_hash(struct sock *sk);
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
 #define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif) \
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 9c6d0508e63a..90a8269b28d0 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -993,7 +993,7 @@ static struct proto dccp_v6_prot = {
.sendmsg   = dccp_sendmsg,
.recvmsg   = dccp_recvmsg,
.backlog_rcv   = dccp_v6_do_rcv,
-   .hash  = inet_hash,
+   .hash  = inet6_hash,
.unhash= inet_unhash,
.accept= inet_csk_accept,
.get_port  = inet_csk_get_port,
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 21ace5a2bf7c..072653dd9c98 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -274,3 +274,59 @@ int inet6_hash_connect(struct inet_timewait_death_row 
*death_row,
   __inet6_check_established);
 }
 EXPORT_SYMBOL_GPL(inet6_hash_connect);
+
+int inet6_hash(struct sock *sk)
+{
+   if (sk->sk_state != TCP_CLOSE) {
+   local_bh_disable();
+   __inet_hash(sk, NULL);
+   local_bh_enable();
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(inet6_hash);
+
+/* match_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses if IPv6
+ *  only, and any IPv4 addresses if not IPv6 only
+ * match_wildcard == false: addresses must be exactly the same, i.e.
+ *  IPV6_ADDR_ANY only equals to IPV6_ADDR_ANY,
+ *  and 0.0.0.0 equals to 0.0.0.0 only
+ */
+int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
+bool match_wildcard)
+{
+   const struct in6_addr *sk2_rcv_saddr6 = inet6_rcv_saddr(sk2);
+   int sk2_ipv6only = inet_v6_ipv6only(sk2);
+   int addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr);
+   int addr_type2 = sk2_rcv_saddr6 ? ipv6_addr_type(sk2_rcv_saddr6) : 
IPV6_ADDR_MAPPED;
+
+   /* if both are mapped, treat as IPv4 */
+   if (addr_type == IPV6_ADDR_MAPPED && addr_type2 == IPV6_ADDR_MAPPED) {
+   if (!sk2_ipv6only) {
+   if (sk->sk_rcv_saddr == sk2->sk_rcv_saddr)
+   return 1;
+   if (!sk->sk_rcv_saddr || !sk2->sk_rcv_saddr)
+   return match_wildcard;
+   }
+   return 0;
+   }
+
+   if (addr_type == IPV6_ADDR_ANY && addr_type2 == IPV6_ADDR_ANY)
+   return 1;
+
+   if (addr_type2 == IPV6_ADDR_ANY && match_wildcard &&
+   !(sk2_ipv6only && addr_type == IPV6_ADDR_MAPPED))
+   return 1;
+
+   if (addr_type == IPV6_ADDR_ANY && match_wildcard &&
+   !(ipv6_only_sock(sk) && addr_type2 == IPV6_ADDR_MAPPED))
+   return 1;
+
+   if (sk2_rcv_saddr6 &&
+   ipv6_addr_equal(&sk->sk_v6_rcv_saddr, sk2_rcv_saddr6))
+   return 1;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ipv6_rcv_saddr_equal);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 006396e31cb0..d72bc

[PATCH net-next v3 6/7] soreuseport: fast reuseport TCP socket selection

2016-02-09 Thread Craig Gallek
From: Craig Gallek 

This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access.  This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet.  Previously, every socket in the group
needed to be considered before handing off the incoming packet.

This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.

Signed-off-by: Craig Gallek 
---
 include/net/inet_hashtables.h|  5 +++-
 net/ipv4/inet_connection_sock.c  | 14 ++---
 net/ipv4/inet_hashtables.c   | 64 +---
 net/ipv4/udp.c   |  4 +--
 net/ipv6/inet6_connection_sock.c |  2 ++
 net/ipv6/inet6_hashtables.c  | 16 +-
 6 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 82403390af58..50f635c2c536 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -207,7 +207,10 @@ void inet_hashinfo_init(struct inet_hashinfo *h);
 
 bool inet_ehash_insert(struct sock *sk, struct sock *osk);
 bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
-void __inet_hash(struct sock *sk, struct sock *osk);
+int __inet_hash(struct sock *sk, struct sock *osk,
+   int (*saddr_same)(const struct sock *sk1,
+ const struct sock *sk2,
+ bool match_wildcard));
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4c457c492b1f..0daac5d7c3b9 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
@@ -67,7 +68,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
-   (sk2->sk_state != TCP_TIME_WAIT &&
+rcu_access_pointer(sk->sk_reuseport_cb) ||
+(sk2->sk_state != TCP_TIME_WAIT &&
 !uid_eq(uid, sock_i_uid(sk2) {
 
if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr ||
@@ -132,6 +134,7 @@ again:
  sk->sk_state != TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
  sk->sk_reuseport &&
+ 
!rcu_access_pointer(sk->sk_reuseport_cb) &&
  uid_eq(tb->fastuid, uid))) &&
(tb->num_owners < smallest_size || 
smallest_size == -1)) {
smallest_size = tb->num_owners;
@@ -193,15 +196,18 @@ tb_found:
if (((tb->fastreuse > 0 &&
  sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
- sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
-   smallest_size == -1) {
+ sk->sk_reuseport &&
+ !rcu_access_pointer(sk->sk_reuseport_cb) &&
+ uid_eq(tb->fastuid, uid))) && smallest_size == -1) {
goto success;
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, 
true)) {
if (((sk->sk_reuse && sk->sk_state != 
TCP_LISTEN) ||
 (tb->fastreuseport > 0 &&
- sk->sk_reuseport && uid_eq(tb->fastuid, 
uid))) &&
+ sk->sk_reuseport &&
+ !rcu_access_pointer(sk->sk_reuseport_cb) 
&&
+ uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 5e4290b83255..c0f994

[PATCH net-next v3 3/7] tcp: __tcp_hdrlen() helper

2016-02-09 Thread Craig Gallek
From: Craig Gallek 

tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr.
This splits the size calculation into a helper function that can be
used if a struct tcphdr is already available.

Signed-off-by: Craig Gallek 
---
 include/linux/tcp.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361ba3e8..c216707d63bf 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -29,9 +29,14 @@ static inline struct tcphdr *tcp_hdr(const struct sk_buff 
*skb)
return (struct tcphdr *)skb_transport_header(skb);
 }
 
+static inline unsigned int __tcp_hdrlen(const struct tcphdr *th)
+{
+   return th->doff * 4;
+}
+
 static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 {
-   return tcp_hdr(skb)->doff * 4;
+   return __tcp_hdrlen(tcp_hdr(skb));
 }
 
 static inline struct tcphdr *inner_tcp_hdr(const struct sk_buff *skb)
-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next v3 4/7] inet: refactor inet[6]_lookup functions to take skb

2016-02-09 Thread Craig Gallek
From: Craig Gallek 

This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups.  Doing so with a BPF filter will require access to the
skb in question.  This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek 
---
 include/net/addrconf.h |  2 ++
 include/net/inet6_hashtables.h | 11 +++
 include/net/inet_hashtables.h  | 18 --
 net/dccp/ipv4.c|  2 +-
 net/dccp/ipv6.c|  2 +-
 net/ipv4/inet_diag.c   |  6 +++---
 net/ipv4/inet_hashtables.c |  1 +
 net/ipv4/tcp_ipv4.c| 10 ++
 net/ipv6/inet6_hashtables.c|  8 ++--
 net/ipv6/tcp_ipv6.c|  8 +---
 net/netfilter/xt_TPROXY.c  | 31 ---
 net/netfilter/xt_socket.c  | 28 +---
 12 files changed, 85 insertions(+), 42 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 47f52d3cd8df..730d856683e5 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -87,6 +87,8 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr 
*addr,
  u32 banned_flags);
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr,
u32 banned_flags);
+int ipv4_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
+bool match_wildcard);
 int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
 bool match_wildcard);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index b3c28a9dfbf1..28332bdac333 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -53,6 +53,7 @@ struct sock *__inet6_lookup_established(struct net *net,
 
 struct sock *inet6_lookup_listener(struct net *net,
   struct inet_hashinfo *hashinfo,
+  struct sk_buff *skb, int doff,
   const struct in6_addr *saddr,
   const __be16 sport,
   const struct in6_addr *daddr,
@@ -60,6 +61,7 @@ struct sock *inet6_lookup_listener(struct net *net,
 
 static inline struct sock *__inet6_lookup(struct net *net,
  struct inet_hashinfo *hashinfo,
+ struct sk_buff *skb, int doff,
  const struct in6_addr *saddr,
  const __be16 sport,
  const struct in6_addr *daddr,
@@ -71,12 +73,12 @@ static inline struct sock *__inet6_lookup(struct net *net,
if (sk)
return sk;
 
-   return inet6_lookup_listener(net, hashinfo, saddr, sport,
+   return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
 daddr, hnum, dif);
 }
 
 static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
- struct sk_buff *skb,
+ struct sk_buff *skb, int doff,
  const __be16 sport,
  const __be16 dport,
  int iif)
@@ -86,13 +88,14 @@ static inline struct sock *__inet6_lookup_skb(struct 
inet_hashinfo *hashinfo,
if (sk)
return sk;
 
-   return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
- &ipv6_hdr(skb)->saddr, sport,
+   return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+ doff, &ipv6_hdr(skb)->saddr, sport,
  &ipv6_hdr(skb)->daddr, ntohs(dport),
  iif);
 }
 
 struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
+ struct sk_buff *skb, int doff,
  const struct in6_addr *saddr, const __be16 sport,
  const struct in6_addr *daddr, const __be16 dport,
  const int dif);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 554440e7f83d..82403390af58 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -213,6 +213,7 @@ void inet_unhash(struct sock *sk);
 
 struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
+   struct sk_buff *skb, int doff,
const __be32 saddr, const __be16 sport,
 

  1   2   >