date:20180124

[PATCH net-next] ptr_ring: fix integer overflow

2018-01-24 Thread Jason Wang

We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun 
array bounds")
Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9ca1726..3f99484 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -453,6 +453,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (unlikely(size + 1 == 0))
+   return NULL;
/* Allocate an extra dummy element at end of ring to avoid consumer head
 * or produce head access past the end of the array. Possible when
 * producer/consumer operations and __ptr_ring_peek operations run in
-- 
2.7.4

[PATCH net] ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only

2018-01-24 Thread Martin KaFai Lau

If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
this addr_type checking and setting sk->sk_ipv6only to 1 are only done
after sk->sk_prot->get_port(sk, snum) has been completed successfully.

This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
the 'get_port()'.

In particular, when binding SO_REUSEPORT UDP sockets,
udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
reuseport_alloc(sk) is called instead of adding sk to the existing
sk2->sk_reuseport_cb.

It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
IPv6 address (!ANY and !MAPPED).  Only one of the socket will
receive packet.

The fix is to set the implicit sk_ipv6only before calling get_port().
The original sk_ipv6only has to be saved such that it can be restored
in case get_port() failed.  The situation is similar to the
inet_reset_saddr(sk) after get_port() has failed.

Thanks to Calvin Owens  who created an easy
reproduction which leads to a fix.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Signed-off-by: Martin KaFai Lau 
---
 net/ipv6/af_inet6.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index c9441ca45399..416917719a6f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -284,6 +284,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
__be32 v4addr = 0;
unsigned short snum;
+   bool saved_ipv6only;
int addr_type = 0;
int err = 0;
 
@@ -389,19 +390,21 @@ int inet6_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
if (!(addr_type & IPV6_ADDR_MULTICAST))
np->saddr = addr->sin6_addr;
 
+   saved_ipv6only = sk->sk_ipv6only;
+   if (addr_type != IPV6_ADDR_ANY && addr_type != IPV6_ADDR_MAPPED)
+   sk->sk_ipv6only = 1;
+
/* Make sure we are allowed to bind here. */
if ((snum || !inet->bind_address_no_port) &&
sk->sk_prot->get_port(sk, snum)) {
+   sk->sk_ipv6only = saved_ipv6only;
inet_reset_saddr(sk);
err = -EADDRINUSE;
goto out;
}
 
-   if (addr_type != IPV6_ADDR_ANY) {
+   if (addr_type != IPV6_ADDR_ANY)
sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
-   if (addr_type != IPV6_ADDR_MAPPED)
-   sk->sk_ipv6only = 1;
-   }
if (snum)
sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
inet->inet_sport = htons(inet->inet_num);
-- 
2.9.5

linux-next: manual merge of the net-next tree with the vfs tree

2018-01-24 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/tipc/socket.c

between commit:

  ade994f4f6c8 ("net: annotate ->poll() instances")

from the vfs tree and commit:

  60c253069632 ("tipc: fix race between poll() and setsockopt()")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/tipc/socket.c
index 2aa46e8cd8fe,473a096b6fba..
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@@ -715,8 -716,7 +716,7 @@@ static __poll_t tipc_poll(struct file *
  {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct tipc_group *grp = tsk->group;
 -  u32 revents = 0;
 +  __poll_t revents = 0;
  
sock_poll_wait(file, sk_sleep(sk), wait);

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang

Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:
> On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
>> Hi Tariq
>>
>> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
>> On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
>>> Hi Tariq
>>>
>>> Very sad that the crash was reproduced again after applied the patch.

 Memory barriers vary for different Archs, can you please share more 
 details regarding arch and repro steps?
>>> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
>>> 12/27/2015
>>> The xen is installed. The crash occurred in DOM0.
>>> Regarding to the repro steps, it is a customer's test which does heavy disk 
>>> I/O over NFS storage without any guest.
>>>
>>
>> What is the finial suggestion on this ?
>> If use wmb there, is the performance pulled down ?
> 
> Since 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k=
> 
> we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.
> 
> I doubt the additional wmb() will have serious impact there.
> 
>

[RFC] net: qcom/emac: mdiobus-dev fwnode should point to emac-adev

2018-01-24 Thread Wang Dongsheng

mdiobus always try to get a GPIO "reset" consumer, based on ACPI
the GPIO should be described in emac-adev _DSD or _CRS.

ACPI uses mido common API to register, however mdio->dev->fwnode is not
pointing to any adev. So the "reset" consumer can never be found.

OF has done this by using an of_mdiobus_register. The mdiobus get emac
of_node and go through the of_node to find a GPIO "reset" consumer.

Not sure, ACPI needs to add the same API for mdio just like OF because
mdio isn't a real entity in ACPI. So I think there isn't any work in
ACPI, the mac driver needs to take adev to mdiobus when mido-bus is
registering.

Signed-off-by: Wang Dongsheng 
---
 drivers/net/ethernet/qualcomm/emac/emac-phy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-phy.c 
b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
index 53dbf1e..69171d5 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-phy.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
@@ -117,6 +117,10 @@ int emac_phy_config(struct platform_device *pdev, struct 
emac_adapter *adpt)
 
if (has_acpi_companion(>dev)) {
u32 phy_addr;
+   struct fwnode_handle *fwnode;
+
+   fwnode = acpi_fwnode_handle(ACPI_COMPANION(>dev));
+   mii_bus->dev.fwnode = fwnode;
 
ret = mdiobus_register(mii_bus);
if (ret) {
-- 
2.7.4

Re: [PATCH v2 net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_{match|target}

2018-01-24 Thread Florian Westphal

Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> It looks like syzbot found its way into netfilter territory.
> 
> Issue here is that @name comes from user space and might
> not be null terminated.
> 
> Out-of-bound reads happen, KASAN is not happy.
> 
> v2 added similar fix for xt_request_find_target(),
> as Florian advised.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Thanks a lot Eric!

Acked-by: Florian Westphal

Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()

2018-01-24 Thread Sathya Perla

On Wed, Jan 24, 2018 at 9:37 PM, Jiri Pirko  wrote:
> Wed, Jan 24, 2018 at 12:42:55PM CET, sathya.pe...@broadcom.com wrote:
>>When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns
>>error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags.
>>
>>This flag (via tc_in_hw()) must be checked before issuing the call
>>to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the
>>call to query stats (fl_hw_update_stats()).
>>
>>Signed-off-by: Sathya Perla 
>
> 1) You have to indicate what tree you aim this to be applied on:
>[patch net] or [patch net-next]
> 2) Please provided a "Fixes" line
> 3) Please use scripts/get_maintainer.pl to get the people to cc
> 4) Please aim the fix not only to cls_flower, but to other cls as well.

Ok, will do. thanks!

Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()

2018-01-24 Thread Sathya Perla

On Thu, Jan 25, 2018 at 3:53 AM, Jakub Kicinski  wrote:
>
> On Wed, 24 Jan 2018 17:12:55 +0530, Sathya Perla wrote:
> > When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns
> > error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags.
> >
> > This flag (via tc_in_hw()) must be checked before issuing the call
> > to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the
> > call to query stats (fl_hw_update_stats()).
> >
> > Signed-off-by: Sathya Perla 
>
> Could you explain why you want to make that change?  Saying "tc_in_hw()
> must be checked" is a bit strong, tc_in_hw() is useless from correctness
> POV.  Your patch may be a good optimization, but with shared blocks in
> the picture now tc_in_hw() == true doesn't mean it's in *your* HW.

I agree that for shared filters when skip_sw is false
tcf_block_cb_call() can return a
positive status even if the filter add on one of the devices fails.
I'll change the commit-log wording to indicate that this new check is
an optimization.
Thanks!

Re: [v2] wcn36xx: release resources in case of error

2018-01-24 Thread Kalle Valo

Ramon Fried  wrote:

> wcn36xx_dxe_init() doesn't check for the return value of
> wcn36xx_dxe_init_descs(), release the resources in case an error ocurred.
> 
> Signed-off-by: Ramon Fried 
> Signed-off-by: Kalle Valo 

Patch applied to ath-next branch of ath.git, thanks.

d0bb950b9f5f wcn36xx: release DMA memory in case of error

-- 
https://patchwork.kernel.org/patch/10180503/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

[PATCH v2 4/4] net: check the size of a packet in validate_xmit_skb

2018-01-24 Thread Daniel Axtens

There are a number of paths where an oversize skb could be sent to
a driver. The driver should not be required to check for this - the
core layer should do it instead.

Add a check to validate_xmit_skb that checks both GSO and non-GSO
packets and drops them if they are too large.

Signed-off-by: Daniel Axtens 
---
 net/core/dev.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6c96c26aadbf..f09eece2cd21 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
__net_timestamp(SKB);   \
}   \
 
-bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
+static inline bool skb_mac_len_fits_dev(const struct net_device *dev,
+   const struct sk_buff *skb)
 {
unsigned int len;
 
-   if (!(dev->flags & IFF_UP))
-   return false;
-
len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
if (skb->len <= len)
return true;
@@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, 
const struct sk_buff *skb)
 
return false;
 }
+
+bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
+{
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   return skb_mac_len_fits_dev(dev, skb);
+}
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
@@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff 
*skb, struct net_device
if (unlikely(!skb))
goto out_null;
 
+   if (unlikely(!skb_mac_len_fits_dev(dev, skb)))
+   goto out_kfree_skb;
+
if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;
 
-- 
2.14.1

[PATCH v2 1/4] net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len

2018-01-24 Thread Daniel Axtens

If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we're about to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens 
---
 include/linux/skbuff.h  | 2 +-
 net/core/skbuff.c   | 9 +
 net/ipv4/ip_forward.c   | 2 +-
 net/ipv4/ip_output.c| 2 +-
 net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +-
 net/ipv6/ip6_output.c   | 2 +-
 net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +-
 net/mpls/af_mpls.c  | 2 +-
 net/xfrm/xfrm_device.c  | 2 +-
 9 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b8e0da6c27d6..b137c79bf88d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3286,7 +3286,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, 
const u32 len);
 int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
 unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
-bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu);
+bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu);
 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
 struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
 int skb_ensure_writable(struct sk_buff *skb, int write_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01e8285aea73..a93e5c7aa5b2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4914,15 +4914,16 @@ unsigned int skb_gso_transport_seglen(const struct 
sk_buff *skb)
 EXPORT_SYMBOL_GPL(skb_gso_transport_seglen);
 
 /**
- * skb_gso_validate_mtu - Return in case such skb fits a given MTU
+ * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU?
  *
  * @skb: GSO skb
  * @mtu: MTU to validate against
  *
- * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU
- * once split.
+ * skb_gso_validate_network_len validates if a given skb will fit a
+ * wanted MTU once split. It considers L3 headers, L4 headers, and the
+ * payload.
  */
-bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu)
+bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu)
 {
const struct skb_shared_info *shinfo = skb_shinfo(skb);
const struct sk_buff *iter;
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 2dd21c3281a1..b54b948b0596 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -55,7 +55,7 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, 
unsigned int mtu)
if (skb->ignore_df)
return false;
 
-   if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu))
+   if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
return false;
 
return true;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e8e675be60ec..66340ab750e6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -248,7 +248,7 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
 
/* common case: seglen is <= mtu
 */
-   if (skb_gso_validate_mtu(skb, mtu))
+   if (skb_gso_validate_network_len(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length exceeds the egress MTU.
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index b2d01eb25f2c..cdf2625dc277 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -185,7 +185,7 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff 
*skb, unsigned int mtu)
if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0)
return false;
 
-   if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu))
+   if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
return false;
 
return true;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 18547a44bdaf..4e888328d4dd 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -412,7 +412,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, 
unsigned int mtu)
if (skb->ignore_df)
return false;
 
-   if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu))
+   if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
return false;
 
return true;
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index 0c3b9d32f64f..f1ab4e03df7d 100644

[PATCH v2 2/4] net: move skb_gso_mac_seglen to skbuff.h

2018-01-24 Thread Daniel Axtens

We're about to use this elsewhere, so move it into the header with
the other related functions like skb_gso_network_seglen().

Signed-off-by: Daniel Axtens 
---
 include/linux/skbuff.h | 15 +++
 net/sched/sch_tbf.c| 10 --
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b137c79bf88d..4b3ca6a5ec0a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4120,6 +4120,21 @@ static inline unsigned int skb_gso_network_seglen(const 
struct sk_buff *skb)
return hdr_len + skb_gso_transport_seglen(skb);
 }
 
+/**
+ * skb_gso_mac_seglen - Return length of individual segments of a gso packet
+ *
+ * @skb: GSO skb
+ *
+ * skb_gso_mac_seglen is used to determine the real size of the
+ * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4
+ * headers (TCP/UDP).
+ */
+static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
+{
+   unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
+   return hdr_len + skb_gso_transport_seglen(skb);
+}
+
 /* Local Checksum Offload.
  * Compute outer checksum based on the assumption that the
  * inner checksum will be offloaded later.
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 83e76d046993..229172d509cc 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -142,16 +142,6 @@ static u64 psched_ns_t2l(const struct psched_ratecfg *r,
return len;
 }
 
-/*
- * Return length of individual segments of a gso packet,
- * including all headers (MAC, IP, TCP/UDP)
- */
-static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
-{
-   unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
-   return hdr_len + skb_gso_transport_seglen(skb);
-}
-
 /* GSO packet is too big, segment it so that tbf can transmit
  * each segment in time
  */
-- 
2.14.1

[PATCH v2 3/4] net: is_skb_forwardable: check the size of GSO segments

2018-01-24 Thread Daniel Axtens

is_skb_forwardable attempts to detect if a packet is too large to
be sent to the destination device. However, this test does not
consider GSO skbs, and it is possible that a GSO skb, when
segmented, will be larger than the device can transmit.

Create skb_gso_validate_mac_len, and use that to check.

Signed-off-by: Daniel Axtens 
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c |  7 +++---
 net/core/skbuff.c  | 67 +++---
 3 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4b3ca6a5ec0a..ec9c47b5a1c8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
 unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
 bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu);
+bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
 struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
 int skb_ensure_writable(struct sk_buff *skb, int write_len);
diff --git a/net/core/dev.c b/net/core/dev.c
index 94435cd09072..6c96c26aadbf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, 
const struct sk_buff *skb)
if (skb->len <= len)
return true;
 
-   /* if TSO is enabled, we don't care about the length as the packet
-* could be forwarded without being segmented before
+   /*
+* if TSO is enabled, we need to check the size of the
+* segmented packets
 */
if (skb_is_gso(skb))
-   return true;
+   return skb_gso_validate_mac_len(skb, len);
 
return false;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a93e5c7aa5b2..93f66725c32d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4914,37 +4914,74 @@ unsigned int skb_gso_transport_seglen(const struct 
sk_buff *skb)
 EXPORT_SYMBOL_GPL(skb_gso_transport_seglen);
 
 /**
- * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU?
+ * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS
  *
- * @skb: GSO skb
- * @mtu: MTU to validate against
+ * There are a couple of instances where we have a GSO skb, and we
+ * want to determine what size it would be after it is segmented.
  *
- * skb_gso_validate_network_len validates if a given skb will fit a
- * wanted MTU once split. It considers L3 headers, L4 headers, and the
- * payload.
+ * We might want to check:
+ * -L3+L4+payload size (e.g. IP forwarding)
+ * - L2+L3+L4+payload size (e.g. sanity check before passing to driver)
+ *
+ * This is a helper to do that correctly considering GSO_BY_FRAGS.
+ *
+ * @seg_len: The segmented length (from skb_gso_*_seglen). In the
+ *   GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS].
+ *
+ * @max_len: The maximum permissible length.
+ *
+ * Returns true if the segmented length <= max length.
  */
-bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu)
-{
+static inline bool skb_gso_size_check(const struct sk_buff *skb,
+ unsigned int seg_len,
+ unsigned int max_len) {
const struct skb_shared_info *shinfo = skb_shinfo(skb);
const struct sk_buff *iter;
-   unsigned int hlen;
-
-   hlen = skb_gso_network_seglen(skb);
 
if (shinfo->gso_size != GSO_BY_FRAGS)
-   return hlen <= mtu;
+   return seg_len <= max_len;
 
/* Undo this so we can re-use header sizes */
-   hlen -= GSO_BY_FRAGS;
+   seg_len -= GSO_BY_FRAGS;
 
skb_walk_frags(skb, iter) {
-   if (hlen + skb_headlen(iter) > mtu)
+   if (seg_len + skb_headlen(iter) > max_len)
return false;
}
 
return true;
 }
-EXPORT_SYMBOL_GPL(skb_gso_validate_mtu);
+
+/**
+ * skb_gso_validate_network_len - Does an skb fit a given MTU?
+ *
+ * @skb: GSO skb
+ * @mtu: MTU to validate against
+ *
+ * skb_gso_validate_network_len validates if a given skb will fit a
+ * wanted MTU once split. It considers L3 headers, L4 headers, and the
+ * payload.
+ */
+bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu)
+{
+   return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu);
+}
+EXPORT_SYMBOL_GPL(skb_gso_validate_network_len);
+
+/**
+ * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length?
+ *
+ * @skb: GSO skb
+ * @len: length to validate against
+ *
+ * skb_gso_validate_mac_len validates if a given skb will fit a wanted
+ * length once split, including L2, L3 and L4 headers and the payload.
+ */
+bool

[PATCH v2 0/4] Check size of packets before sending

2018-01-24 Thread Daniel Axtens

There are a few ways we can send packets that are too large to a
network driver.

When non-GSO packets are forwarded, we validate their size, based on
the MTU of the destination device. However, when GSO packets are
forwarded, we do not validate their size. We implicitly assume that
when they are segmented, the resultant packets will be correctly
sized.

This is not always the case.

We observed a case where a packet received on an ibmveth device had a
GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
device, where it caused a firmware assert. This is described in detail
at [0] and was the genesis of this series.

Rather than fixing this in the driver, this series fixes the
core path. It does it in 2 steps:

 1) make is_skb_forwardable check GSO packets - this catches bridges
 
 2) make validate_xmit_skb check the size of all packets, so as to
catch everything else (e.g. macvlan, tc mired, OVS)

I am a bit nervous about how this series will interact with nested
VLANs, as the existing code only allows for one VLAN_HLEN. (Previously
these packets would sail past unchecked.) But I thought it would be
prudent to get more eyes on this sooner rather than later.

Thanks,
Daniel

v1: https://www.spinics.net/lists/netdev/msg478634.html
Changes in v2:

 - improve names, thanks Marcelo Ricardo Leitner

 - add check to xmit_validate_skb; thanks to everyone who participated
   in the discussion.

 - drop extra check in Open vSwitch. Bad packets will be caught by
   validate_xmit_skb for now and we can come back and add it later if
   OVS people would like the extra logging.
   
[0]: https://patchwork.ozlabs.org/patch/859410/

Cc: Jason Wang 
Cc: Pravin Shelar 
Cc: Marcelo Ricardo Leitner 
Cc: manish.cho...@cavium.com
Cc: d...@openvswitch.org

Daniel Axtens (4):
  net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
  net: move skb_gso_mac_seglen to skbuff.h
  net: is_skb_forwardable: check the size of GSO segments
  net: check the size of a packet in validate_xmit_skb

 include/linux/skbuff.h  | 18 -
 net/core/dev.c  | 24 
 net/core/skbuff.c   | 66 ++---
 net/ipv4/ip_forward.c   |  2 +-
 net/ipv4/ip_output.c|  2 +-
 net/ipv4/netfilter/nf_flow_table_ipv4.c |  2 +-
 net/ipv6/ip6_output.c   |  2 +-
 net/ipv6/netfilter/nf_flow_table_ipv6.c |  2 +-
 net/mpls/af_mpls.c  |  2 +-
 net/sched/sch_tbf.c | 10 -
 net/xfrm/xfrm_device.c  |  2 +-
 11 files changed, 93 insertions(+), 39 deletions(-)

-- 
2.14.1

Re: [PATCH 10/10] kill kernel_sock_ioctl()

2018-01-24 Thread David Miller

From: Al Viro 
Date: Thu, 25 Jan 2018 00:01:25 +

> On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
>> 
>> Al this series looks fine to me, want me to toss it into net-next?
> 
> Do you want them reposted (with updated commit messages), or would
> you prefer a pull request (with or without rebase to current tip
> of net-next)?

A pull request works for me.  Rebasing to net-next tip is pilot's
discretion.

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread Eric Dumazet

On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
> Hi Tariq
> 
> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> > > > On 19/01/2018 5:49 PM, Eric Dumazet wrote:
> > > > > On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
> > > > > > Hi Tariq
> > > > > > 
> > > > > > Very sad that the crash was reproduced again after applied the 
> > > > > > patch.
> > > 
> > > Memory barriers vary for different Archs, can you please share more 
> > > details regarding arch and repro steps?
> > The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
> > 12/27/2015
> > The xen is installed. The crash occurred in DOM0.
> > Regarding to the repro steps, it is a customer's test which does heavy disk 
> > I/O over NFS storage without any guest.
> > 
> 
> What is the finial suggestion on this ?
> If use wmb there, is the performance pulled down ?

Since 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=dad42c3038a59d27fced28ee4ec1d4a891b28155

we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.

I doubt the additional wmb() will have serious impact there.

[PATCH net-next 1/2] net: vrf: Add support for sends to local broadcast address

2018-01-24 Thread David Ahern

Sukumar reported that sends to the local broadcast address
(255.255.255.255) are broken. Check for the address in vrf driver
and do not redirect to the VRF device - similar to multicast
packets.

With this change sockets can use SO_BINDTODEVICE to specify an
egress interface and receive responses. Note: the egress interface
can not be a VRF device but needs to be the enslaved device.

https://bugzilla.kernel.org/show_bug.cgi?id=198521

Reported-by: Sukumar Gopalakrishnan 
Signed-off-by: David Ahern 

---
Dave: Really this is a day 1 bug that goes back to the beginning of VRF.
IMO, backport to the 4.14 LTS kernel is sufficient; the multicast
handling for IPv4 was only complete as of the 4.12 kernel. I directed
this at net-next because it is not urgent for the 4.15 merge window.

 drivers/net/vrf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index feb1b2e15c2e..139c61c8244a 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -673,8 +673,9 @@ static struct sk_buff *vrf_ip_out(struct net_device 
*vrf_dev,
  struct sock *sk,
  struct sk_buff *skb)
 {
-   /* don't divert multicast */
-   if (ipv4_is_multicast(ip_hdr(skb)->daddr))
+   /* don't divert multicast or local broadcast */
+   if (ipv4_is_multicast(ip_hdr(skb)->daddr) ||
+   ipv4_is_lbcast(ip_hdr(skb)->daddr))
return skb;
 
if (qdisc_tx_is_default(vrf_dev))
-- 
2.11.0

[PATCH net-next 2/2] net/ipv4: Allow send to local broadcast from a socket bound to a VRF

2018-01-24 Thread David Ahern

Message sends to the local broadcast address (255.255.255.255) require
uc_index or sk_bound_dev_if to be set to an egress device. However,
responses or only received if the socket is bound to the device. This
is overly constraining for processes running in an L3 domain. This
patch allows a socket bound to the VRF device to send to the local
broadcast address by using IP_UNICAST_IF to set the egress interface
with packet receipt handled by the VRF binding.

Similar to IP_MULTICAST_IF, relax the constraint on setting
IP_UNICAST_IF if a socket is bound to an L3 master device. In this
case allow uc_index to be set to an enslaved if sk_bound_dev_if is
an L3 master device and is the master device for the ifindex.

In udp and raw sendmsg, allow uc_index to override the oif if
uc_index master device is oif (ie., the oif is an L3 master and the
index is an L3 slave).

Signed-off-by: David Ahern 
---
 net/ipv4/ip_sockglue.c |  6 +-
 net/ipv4/raw.c | 15 ++-
 net/ipv4/udp.c | 15 ++-
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 60fb1eb7d7d8..6cc70fa488cb 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -808,6 +808,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
{
struct net_device *dev = NULL;
int ifindex;
+   int midx;
 
if (optlen != sizeof(int))
goto e_inval;
@@ -823,10 +824,13 @@ static int do_ip_setsockopt(struct sock *sk, int level,
err = -EADDRNOTAVAIL;
if (!dev)
break;
+
+   midx = l3mdev_master_ifindex(dev);
dev_put(dev);
 
err = -EINVAL;
-   if (sk->sk_bound_dev_if)
+   if (sk->sk_bound_dev_if &&
+   (!midx || midx != sk->sk_bound_dev_if))
break;
 
inet->uc_index = ifindex;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 136544b36a46..7c509697ebc7 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -617,8 +617,21 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc.oif = inet->mc_index;
if (!saddr)
saddr = inet->mc_addr;
-   } else if (!ipc.oif)
+   } else if (!ipc.oif) {
ipc.oif = inet->uc_index;
+   } else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
+   /* oif is set, packet is to local broadcast and
+* and uc_index is set. oif is most likely set
+* by sk_bound_dev_if. If uc_index != oif check if the
+* oif is an L3 master and uc_index is an L3 slave.
+* If so, we want to allow the send using the uc_index.
+*/
+   if (ipc.oif != inet->uc_index &&
+   ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
+ inet->uc_index)) {
+   ipc.oif = inet->uc_index;
+   }
+   }
 
flowi4_init_output(, ipc.oif, sk->sk_mark, tos,
   RT_SCOPE_UNIVERSE,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 853321555a4e..3f018f34cf56 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -977,8 +977,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (!saddr)
saddr = inet->mc_addr;
connected = 0;
-   } else if (!ipc.oif)
+   } else if (!ipc.oif) {
ipc.oif = inet->uc_index;
+   } else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
+   /* oif is set, packet is to local broadcast and
+* and uc_index is set. oif is most likely set
+* by sk_bound_dev_if. If uc_index != oif check if the
+* oif is an L3 master and uc_index is an L3 slave.
+* If so, we want to allow the send using the uc_index.
+*/
+   if (ipc.oif != inet->uc_index &&
+   ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
+ inet->uc_index)) {
+   ipc.oif = inet->uc_index;
+   }
+   }
 
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
-- 
2.11.0

[PATCH net-next 0/2] net: vrf: Fix send to local broadcast address

2018-01-24 Thread David Ahern

Patch set to fix packet send to the 255.255.255.255 address from a VRF.

First patch tell vrf driver to ignore those packets. Second patches
updates the uapi to allow sends from sockets bound to an L3 master
device.

David Ahern (2):
  net: vrf: Add support for sends to local broadcast address
  net/ipv4: Allow send to local broadcast from a socket bound to a VRF

 drivers/net/vrf.c  |  5 +++--
 net/ipv4/ip_sockglue.c |  6 +-
 net/ipv4/raw.c | 15 ++-
 net/ipv4/udp.c | 15 ++-
 4 files changed, 36 insertions(+), 5 deletions(-)

-- 
2.11.0

Re: [PATCH v6 16/36] nds32: DMA mapping API

2018-01-24 Thread Greentime Hu

Hi, Arnd:

2018-01-24 19:36 GMT+08:00 Arnd Bergmann :
> On Tue, Jan 23, 2018 at 12:52 PM, Greentime Hu  wrote:
>> Hi, Arnd:
>>
>> 2018-01-23 16:23 GMT+08:00 Greentime Hu :
>>> Hi, Arnd:
>>>
>>> 2018-01-18 18:26 GMT+08:00 Arnd Bergmann :
 On Mon, Jan 15, 2018 at 6:53 AM, Greentime Hu  wrote:
> From: Greentime Hu 
>
> This patch adds support for the DMA mapping API. It uses dma_map_ops for
> flexibility.
>
> Signed-off-by: Vincent Chen 
> Signed-off-by: Greentime Hu 

 I'm still unhappy about the way the cache flushes are done here as 
 discussed
 before. It's not a show-stopped, but no Ack from me.
>>>
>>> How about this implementation?
>
>> I am not sure if I understand it correctly.
>> I list all the combinations.
>>
>> RAM to DEVICE
>> before DMA => writeback cache
>> after DMA => nop
>>
>> DEVICE to RAM
>> before DMA => nop
>> after DMA => invalidate cache
>>
>> static void consistent_sync(void *vaddr, size_t size, int direction, int 
>> master)
>> {
>> unsigned long start = (unsigned long)vaddr;
>> unsigned long end = start + size;
>>
>> if (master == FOR_CPU) {
>> switch (direction) {
>> case DMA_TO_DEVICE:
>> break;
>> case DMA_FROM_DEVICE:
>> case DMA_BIDIRECTIONAL:
>> cpu_dma_inval_range(start, end);
>> break;
>> default:
>> BUG();
>> }
>> } else {
>> /* FOR_DEVICE */
>> switch (direction) {
>> case DMA_FROM_DEVICE:
>> break;
>> case DMA_TO_DEVICE:
>> case DMA_BIDIRECTIONAL:
>> cpu_dma_wb_range(start, end);
>> break;
>> default:
>> BUG();
>> }
>> }
>> }
>
> That looks reasonable enough, but it does depend on a number of factors,
> and the dma-mapping.h implementation is not just about cache flushes.
>
> As I don't know the microarchitecture, can you answer these questions:
>
> - are caches always write-back, or could they be write-through?
Yes, we can config it to write-back or write-through.

> - can the cache be shared with another CPU or a device?
No, we don't support it.

> - if the cache is shared, is it always coherent, never coherent, or
> either of them?
We don't support SMP and the device will access memory through bus. I
think the cache is not shared.

> - could the same memory be visible at different physical addresses
>   and have conflicting caches?
We currently don't have such kind of SoC memory map.

> - is the CPU physical address always the same as the address visible to the
>   device?
Yes, it is always the same unless the CPU uses local memory. The
physical address of local memory will overlap the original bus
address.
I think the local memory case can be ignored because we don't use it for now.

> - are there devices that can only see a subset of the physical memory?
All devices are able to see the whole physical memory in our current
SoC, but I think other SoC may support such kind of HW behavior.

> - can there be an IOMMU?
No.

> - are there write-buffers in the CPU that might need to get flushed before
>   flushing the cache?
Yes, there are write-buffers in front of CPU caches but it should be
transparent to SW. We don't need to flush it.

> - could cache lines be loaded speculatively or with read-ahead while
>   a buffer is owned by a device?
No.

[PATCH v3 net-next] net/ipv6: Do not allow route add with a device that is down

2018-01-24 Thread David Ahern

IPv6 allows routes to be installed when the device is not up (admin up).
Worse, it does not mark it as LINKDOWN. IPv4 does not allow it and really
there is no reason for IPv6 to allow it, so check the flags and deny if
device is admin down.

Signed-off-by: David Ahern 
---
v3
- moved err=-ENETDOWN under the if check per Eric's request
- left the up check using dev->flags for consistency with IPv4
  and that it is used more often in ipv4 and ivp6 code than
  netif_running

v2
- missed setting err to -ENETDOWN (thanks for catching that Roopa)

 net/ipv6/route.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f85da2f1e729..aa4411c81e7e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2734,6 +2734,12 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
if (!dev)
goto out;
 
+   if (!(dev->flags & IFF_UP)) {
+   NL_SET_ERR_MSG(extack, "Nexthop device is not up");
+   err = -ENETDOWN;
+   goto out;
+   }
+
if (!ipv6_addr_any(>fc_prefsrc)) {
if (!ipv6_chk_addr(net, >fc_prefsrc, dev, 0)) {
NL_SET_ERR_MSG(extack, "Invalid source address");
-- 
2.11.0

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang

Hi Tariq

On 01/22/2018 10:12 AM, jianchao.wang wrote:
>>> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
 On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
> Hi Tariq
>
> Very sad that the crash was reproduced again after applied the patch.
>> Memory barriers vary for different Archs, can you please share more details 
>> regarding arch and repro steps?

> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
> 12/27/2015
> The xen is installed. The crash occurred in DOM0.
> Regarding to the repro steps, it is a customer's test which does heavy disk 
> I/O over NFS storage without any guest.
> 

What is the finial suggestion on this ?
If use wmb there, is the performance pulled down ?

Thanks in advance
Jianchao

[PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-24 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 26 ++
 include/uapi/linux/tcp.h |  3 ++-
 net/ipv4/tcp.c   | 24 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59fa771..ff7758d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1047,6 +1047,32 @@ enum {
 * Arg3: return value of
 *   tcp_transmit_skb (0 => success)
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index ec03a2b..cf0b861 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -271,7 +271,8 @@ struct tcp_diag_md5sig {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable

2018-01-24 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo 
Acked-by: Yuchung Cheng 
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..bf9bb75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

[PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback

2018-01-24 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_timer.c | 7 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7573f5b..2a8c40a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index d1df2f6..129032ca 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -269,7 +269,8 @@ struct tcp_diag_md5sig {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-24 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 12 +++-
 include/uapi/linux/tcp.h |  5 +
 net/core/filter.c| 34 ++
 4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..7573f5b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,6 +978,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..d1df2f6 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -268,4 +268,9 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_flags_set,
+   .gpl_only   =

[PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and more

2018-01-24 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  lost_out
  sacked_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h |  22 
 net/core/filter.c| 143 +++
 2 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a8c40a..5f08420 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,28 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..a858ebc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern,

[PATCH bpf-next v9 00/12] bpf: More sock_ops callbacks

2018-01-24 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  
v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
based on comments form Eric Dumazet and Alexei Starovoitov.
Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
comments from Daniel Borkmann.
Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
on comments from Daniel Borkmann.
v8: Add commit message 00/12
Add Acked-by as appropriate
v9: Moved the bug fix to the front of the patchset
Changed RETRANS_CB so it is always called (before it was only called if
the retransmit succeeded). It is now called with an extra argument, the
return value of tcp_transmit_skb (0 => success). Based on comments
from Yuchung Cheng.
Added support for reading 2 new fields, sacked_out and lost_out, based on
comments from Yuchung Cheng.

Consists of the following patches:
[PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable
[PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback
[PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and
[PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass
[PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  42 -
 include/uapi/linux/bpf.h   |  76 +++-
 include/uapi/linux/tcp.h   |   8 +
 net/core/filter.c  | 290 
---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   6 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  78 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 ++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 18 files changed, 927 insertions(+), 39 deletions(-)

[PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock fields

2018-01-24 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index dbb6d2f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4491,6 +4491,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, sk),\
+

[PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-24 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 62e7874..dbb6d2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4469,11 +4469,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,18 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-24 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo 
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);
if (base_rtt >

[PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-24 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 6 ++
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_output.c| 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5f08420..59fa771 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1041,6 +1041,12 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit_skb (0 => success)
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 129032ca..ec03a2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..e9f985e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2905,6 +2905,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}
 
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs, err);
+
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
-- 
2.9.5

[PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf

2018-01-24 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h |  78 ++-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +++
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 
 8 files changed, 482 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..ff7758d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,6 +978,29 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
@@ -1003,6 +1036,43 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit_skb (0 => success)
+*/
+

[PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-24 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index a858ebc..fe2c793 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4690,7 +4731,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4701,6 +4743,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-24 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bf9bb75..62e7874 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,9 +4470,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4484,16 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

Re: [net-next 0/8][pull request] 1GbE Intel Wired LAN Driver Updates 2018-01-24

2018-01-24 Thread Alexander Duyck

On Wed, Jan 24, 2018 at 1:10 PM, David Miller  wrote:
> From: Jeff Kirsher 
> Date: Wed, 24 Jan 2018 12:55:12 -0800
>
>> This series contains updates to igb and e1000e only.
>
> Pulled, however:
>
>> Corinna Vinschen implements the ability to set the VF MAC to
>> 00:00:00:00:00:00 via RTM_SETLINK on the PF, to prevent receiving
>> "invlaid argument" when libvirt attempts to restore the MAC address back
>> to its original state of 00:00:00:00:00:00.
>
> This is really a mess and the wrong way to go about this.
>
> No interface, even a VF, should come up or ever have an invalid
> MAC addres like all-zeros.  That's the fundamental problem and
> once you fix that all of this other crazy logic and workarounds
> no longer become necessary.

In the case of igbvf the VFs never come up with 0s in their MAC
address. An all 0's MAC address basically leaves it open to VF's
choice for assigning themselves a MAC address, or at least that is the
way I recall coding it back in the day.

There are a few issues with making changes to this at this point. The
first being that this concept is pretty much baked into the VF driver
logic for most drivers supporting legacy SR-IOV, and as pointed out in
the patch comments the libvirt interface is writing 0's to disable the
VF MAC address when it is not in use. At this point we cannot change
this without breaking the libvirt userspace.

One of the motivations for clearing this is to avoid having the PF
misdirect traffic as having a MAC address mapped to a
disabled/unassigned VF could result in traffic being dropped when it
should be directed elsewhere such as a bridge on the PF, or out to
some other PF that is now running the VM there.

> Whatever it takes, just do it.  We can even come up with a global
> MAC address range that on a Linux system is reserved for VFs to
> come up with.

That is normally how the VFs handle this on their side. The code was
setup such that if the PF provided an all 0's MAC address then the VF
would assign itself a locally administered address so that it wouldn't
come up with an address of 0s. If you are saying the VFs shouldn't be
allowed to come up with an all 0's MAC address I believe that none of
them do. I believe they either fail to come up at all or report a
locally administered address for themselves. I can double check that
though (at least for Intel) to verify that it is in fact a consistent
behavior. In theory there isn't likely to be a VF bound to the
interface anyway, usually when the MAC address is invalidated it is
because a VM has been terminated and the VF driver is just in limbo
since it is usually assigned to a VFIO interface which doesn't
actually expose the network interface to the kernel.

I suppose we could look at pushing the LAA generation up into the PF,
but we would still want to maintain the all 0's address while the VF
is inactive since we need to clear the stale VF addresses from the MAC
address table in the event of a VM being relocated to a different
server and taking the MAC address with it.

The good news to all this is that this is going to be fading out and
going away anyway as SwitchDev takes over for SR-IOV.

> Thanks.

I'll double check our VF drivers and make sure none of them are
exposing a netdevice with an all 0's MAC address, and see what we can
do about relocating the locally administered address generation into
the PF.

Thanks.

- Alex

[PATCH net-next] ipv6: raw: use IPv4 raw_sendmsg on v4-mapped IPv6 destinations

2018-01-24 Thread Ivan Delalande

Make IPv6 SOCK_RAW sockets operate like IPv6 UDP and TCP sockets with
respect to IPv4 mapped addresses by calling IPv4 raw_sendmsg from
rawv6_sendmsg to send those messages out.

Signed-off-by: Travis Brown 
Signed-off-by: Ivan Delalande 
---
 include/net/raw.h |  1 +
 net/ipv4/raw.c|  5 +++--
 net/ipv6/raw.c| 14 ++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index 99d26d0c4a19..b4dbf730da54 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -33,6 +33,7 @@ void raw_icmp_error(struct sk_buff *, int, u32);
 int raw_local_deliver(struct sk_buff *, int);
 
 int raw_rcv(struct sock *, struct sk_buff *);
+int rawv4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 
 #define RAW_HTABLE_SIZEMAX_INET_PROTOS
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 136544b36a46..09f719af8642 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -499,7 +499,7 @@ static int raw_getfrag(void *from, char *to, int offset, 
int len, int odd,
return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb);
 }
 
-static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+int rawv4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
struct inet_sock *inet = inet_sk(sk);
struct net *net = sock_net(sk);
@@ -692,6 +692,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = 0;
goto done;
 }
+EXPORT_SYMBOL_GPL(rawv4_sendmsg);
 
 static void raw_close(struct sock *sk, long timeout)
 {
@@ -969,7 +970,7 @@ struct proto raw_prot = {
.init  = raw_init,
.setsockopt= raw_setsockopt,
.getsockopt= raw_getsockopt,
-   .sendmsg   = raw_sendmsg,
+   .sendmsg   = rawv4_sendmsg,
.recvmsg   = raw_recvmsg,
.bind  = raw_bind,
.backlog_rcv   = raw_rcv_skb,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index ddda7eb3c623..f8513e2f1481 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -844,6 +844,20 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.flowlabel = np->flow_label;
}
 
+   if (daddr && ipv6_addr_v4mapped(daddr)) {
+   struct sockaddr_in sin;
+
+   sin.sin_family = AF_INET;
+   sin.sin_port = sin6 ? sin6->sin6_port : inet->inet_dport;
+   sin.sin_addr.s_addr = daddr->s6_addr32[3];
+   msg->msg_name = 
+   msg->msg_namelen = sizeof(sin);
+
+   if (__ipv6_only_sock(sk))
+   return -ENETUNREACH;
+   return rawv4_sendmsg(sk, msg, len);
+   }
+
if (fl6.flowi6_oif == 0)
fl6.flowi6_oif = sk->sk_bound_dev_if;
 
-- 
2.16.1

[PATCH] kbuild: make Makefile|Kbuild in each directory optional

2018-01-24 Thread Jakub Kicinski

It is useful to be able to build single object files, e.g.:
$ make net/sched/cls_flower.o W=1 C=2

Currently kbuild does a hard include of a Kbuild or Makefile
for directory where that object would reside.  Kbuild doesn't
cater too well to multi-directory drivers, meaning such drivers
will usually only use a single central Makefile.  This in turn
means it will be impossible to build most of object files
individually for such drivers.

Make the include of $dir/{Makefile,Kbuild} optional.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
I must admit I have no idea whose tree I should send this to :(
Could it go via net-next if no one on linux-kbuild objects?

 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 47cddf32aeba..178864f877d5 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -42,7 +42,7 @@ save-cflags := $(CFLAGS)
 # The filename Kbuild has precedence over Makefile
 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
 kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
-include $(kbuild-file)
+-include $(kbuild-file)
 
 # If the save-* variables changed error out
 ifeq ($(KBUILD_NOPEDANTIC),)
-- 
2.15.1

Re: [PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support

2018-01-24 Thread Jakub Kicinski

On Thu, 25 Jan 2018 01:09:41 +0100, Pablo Neira Ayuso wrote:
> This patch adds the infrastructure to offload flows to hardware, in case
> the nic/switch comes with built-in flow tables capabilities.
> 
> If the hardware comes with no hardware flow tables or they have
> limitations in terms of features, the existing infrastructure falls back
> to the software flow table implementation.
> 
> The software flow table garbage collector skips entries that resides in
> the hardware, so the hardware will be responsible for releasing this
> flow table entry too via flow_offload_dead().
> 
> Hardware configuration, either to add or to delete entries, is done from
> the hardware offload workqueue, to ensure this is done from user context
> given that we may sleep when grabbing the mdio mutex.
> 
> Signed-off-by: Pablo Neira Ayuso 

I wonder how do you deal with device/table removal?  I know regrettably
little about internals of nftables.  I assume the table cannot be
removed/module unloaded as long as there are flow entries?  And on
device removal all flows pertaining to the removed ifindex will be
automatically flushed?

Still there could be outstanding work items targeting the device, so
this WARN_ON:

+   indev = dev_get_by_index(net, ifindex);
+   if (WARN_ON(!indev))
+   return 0;

looks possible to trigger.

On the general architecture - I think it's worth documenting somewhere
clearly that unlike TC offloads and most NDOs add/del of NFT flows are
not protected by rtnl_lock.

> v4: More work in progress
> - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks
> - Consolidate ->ndo invocations, now they happen from the hw worker.
> - Fix bug in list handling, use list_replace_init()
> - cleanup entries on nf_flow_table_hw module removal
> - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants
>   to offload entries to hardware.
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ed0799a12bf2..be0c12acc3f0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -859,6 +859,13 @@ struct dev_ifalias {
>   char ifalias[];
>  };
>  
> +struct flow_offload;
> +
> +enum flow_offload_type {
> + FLOW_OFFLOAD_ADD= 0,
> + FLOW_OFFLOAD_DEL,
> +};
> +
>  /*
>   * This structure defines the management hooks for network devices.
>   * The following hooks can be defined; unless noted otherwise, they are
> @@ -1316,6 +1323,8 @@ struct net_device_ops {
>   int (*ndo_bridge_dellink)(struct net_device *dev,
> struct nlmsghdr *nlh,
> u16 flags);
> + int (*ndo_flow_offload)(enum flow_offload_type type,
> + struct flow_offload *flow);

nit: should there be kdoc for the new NDO?  ndo kdoc comment doesn't
 look like it would be recognized by tools anyway though..

nit: using "flow" as the name rings slightly grandiose to me :)  
 I would appreciate a nf_ prefix for clarity.  Drivers will have 
 to juggle a number of "flow" things, it would make the code easier
 to follow if names were prefixed clearly, I feel.

>   int (*ndo_change_carrier)(struct net_device *dev,
> bool new_carrier);
>   int (*ndo_get_phys_port_id)(struct net_device *dev,

[PATCH v2 net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_{match|target}

2018-01-24 Thread Eric Dumazet

From: Eric Dumazet 

It looks like syzbot found its way into netfilter territory.

Issue here is that @name comes from user space and might
not be null terminated.

Out-of-bound reads happen, KASAN is not happy.

v2 added similar fix for xt_request_find_target(),
as Florian advised.

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
No Fixes: tag, bug seems to be a day-0 one.

 net/netfilter/x_tables.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 
55802e97f906d1987ed78b4296584deb38e5f876..ecffc51ce83b07c063a0db67cdb33d9bf48a75ac
 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, 
uint8_t revision)
 {
struct xt_match *match;
 
+   if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN)
+   return ERR_PTR(-EINVAL);
+
match = xt_find_match(nfproto, name, revision);
if (IS_ERR(match)) {
request_module("%st_%s", xt_prefix[nfproto], name);
@@ -252,6 +255,9 @@ struct xt_target *xt_request_find_target(u8 af, const char 
*name, u8 revision)
 {
struct xt_target *target;
 
+   if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN)
+   return ERR_PTR(-EINVAL);
+
target = xt_find_target(af, name, revision);
if (IS_ERR(target)) {
request_module("%st_%s", xt_prefix[af], name);

Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Eric Dumazet

On Thu, 2018-01-25 at 01:13 +0100, Pablo Neira Ayuso wrote:
> On Thu, Jan 25, 2018 at 12:50:56AM +0100, Pablo Neira Ayuso wrote:
> > On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote:
> > > Eric Dumazet  wrote:
> > > > From: Eric Dumazet 
> > > > 
> > > > It looks like syzbot found its way into netfilter territory.
> > > 
> > > Excellent.  This will sure allow to find and fix more bugs :-)
> > > 
> > > > Issue here is that @name comes from user space and might
> > > > not be null terminated.
> > > 
> > > Indeed, thanks for fixing this Eric.
> > > 
> > > xt_find_target() and xt_find_table_lock() might have similar issues.
> > 
> > I'm going to keep back this patch then, it would be good if we can
> > find this in one single patch.
> 
> s/find/fix/
> 
> Sorry.


Ok, but apparently you partially fixed this recently :/

Commits 78b79876761b8 and b301f25387599 took care of
xt_find_table_lock() it seems.

I'll send a V2 including xt_request_find_target()

[PATCH net-next] rds: tcp: per-netns flag to stop new connection creation when rds-tcp is being dismantled

2018-01-24 Thread Sowmini Varadhan

An rds_connection can get added during netns deletion between lines 528
and 529 of

  506 static void rds_tcp_kill_sock(struct net *net)
  :
  /* code to pull out all the rds_connections that should be destroyed */
  :
  528 spin_unlock_irq(_tcp_conn_lock);
  529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node)
  530 rds_conn_destroy(tc->t_cpath->cp_conn);

Such an rds_connection would miss out the rds_conn_destroy()
loop (that cancels all pending work) and (if it was scheduled
after netns deletion) could trigger the use-after-free.

A similar race-window exists for the module unload path
in rds_tcp_exit -> rds_tcp_destroy_conns

To avoid the addition of new rds_connections during kill_sock
or netns_delete, this patch introduces a per-netns flag,
RTN_DELETE_PENDING, that will cause RDS connection creation to fail.
RCU is used to make sure that we wait for the critical
section of __rds_conn_create threads (that may have started before
the setting of RTN_DELETE_PENDING) to complete before starting
the connection destruction.

Reported-by: syzbot+bbd8e9a06452cc480...@syzkaller.appspotmail.com
Signed-off-by: Sowmini Varadhan 
---
 net/rds/connection.c |3 ++
 net/rds/tcp.c|   82 -
 net/rds/tcp.h|1 +
 3 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index b10c0ef..2ae539d 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -220,8 +220,10 @@ static void __rds_conn_path_init(struct rds_connection 
*conn,
 is_outgoing);
conn->c_path[i].cp_index = i;
}
+   rcu_read_lock();
ret = trans->conn_alloc(conn, gfp);
if (ret) {
+   rcu_read_unlock();
kfree(conn->c_path);
kmem_cache_free(rds_conn_slab, conn);
conn = ERR_PTR(ret);
@@ -283,6 +285,7 @@ static void __rds_conn_path_init(struct rds_connection 
*conn,
}
}
spin_unlock_irqrestore(_conn_lock, flags);
+   rcu_read_unlock();
 
 out:
return conn;
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 9920d2f..2bdd3cc 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -274,14 +274,13 @@ static int rds_tcp_laddr_check(struct net *net, __be32 
addr)
 static void rds_tcp_conn_free(void *arg)
 {
struct rds_tcp_connection *tc = arg;
-   unsigned long flags;
 
rdsdebug("freeing tc %p\n", tc);
 
-   spin_lock_irqsave(_tcp_conn_lock, flags);
+   spin_lock_bh(_tcp_conn_lock);
if (!tc->t_tcp_node_detached)
list_del(>t_tcp_node);
-   spin_unlock_irqrestore(_tcp_conn_lock, flags);
+   spin_unlock_bh(_tcp_conn_lock);
 
kmem_cache_free(rds_tcp_conn_slab, tc);
 }
@@ -296,7 +295,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, 
gfp_t gfp)
tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp);
if (!tc) {
ret = -ENOMEM;
-   break;
+   goto fail;
}
mutex_init(>t_conn_path_lock);
tc->t_sock = NULL;
@@ -306,14 +305,25 @@ static int rds_tcp_conn_alloc(struct rds_connection 
*conn, gfp_t gfp)
 
conn->c_path[i].cp_transport_data = tc;
tc->t_cpath = >c_path[i];
+   tc->t_tcp_node_detached = true;
 
-   spin_lock_irq(_tcp_conn_lock);
-   tc->t_tcp_node_detached = false;
-   list_add_tail(>t_tcp_node, _tcp_conn_list);
-   spin_unlock_irq(_tcp_conn_lock);
rdsdebug("rds_conn_path [%d] tc %p\n", i,
 conn->c_path[i].cp_transport_data);
}
+   spin_lock_bh(_tcp_conn_lock);
+   if (rds_tcp_netns_delete_pending(rds_conn_net(conn))) {
+   rdsdebug("RTN_DELETE_PENDING\n");
+   ret = -ENETDOWN;
+   spin_unlock_bh(_tcp_conn_lock);
+   goto fail;
+   }
+   for (i = 0; i < RDS_MPATH_WORKERS; i++) {
+   tc = conn->c_path[i].cp_transport_data;
+   tc->t_tcp_node_detached = false;
+   list_add_tail(>t_tcp_node, _tcp_conn_list);
+   }
+   spin_unlock_bh(_tcp_conn_lock);
+fail:
if (ret) {
for (j = 0; j < i; j++)
rds_tcp_conn_free(conn->c_path[j].cp_transport_data);
@@ -332,23 +342,6 @@ static bool list_has_conn(struct list_head *list, struct 
rds_connection *conn)
return false;
 }
 
-static void rds_tcp_destroy_conns(void)
-{
-   struct rds_tcp_connection *tc, *_tc;
-   LIST_HEAD(tmp_list);
-
-   /* avoid calling conn_destroy with irqs off */
-   spin_lock_irq(_tcp_conn_lock);
-   list_for_each_entry_safe(tc, _tc, _tcp_conn_list, t_tcp_node) {
-   if (!list_has_conn(_list,

Re: [PATCH net-next 1/4] net: core: Fix kernel-doc for carrier_* attributes

2018-01-24 Thread kbuild test robot

Hi Florian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-core-Fix-kernel-doc-for-carrier_-attributes/20180125-062300
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   Warning: Could not extract kernel version
   WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick 
(https://www.imagemagick.org)
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/crypto/hash.h:89: warning: duplicate section name 'Note'
   include/crypto/hash.h:95: warning: duplicate section name 'Note'
   include/crypto/hash.h:102: warning: duplicate section name 'Note'
   include/linux/gpio/driver.h:142: warning: No description found for parameter 
'request_key'
   drivers/gpio/gpiolib.c:602: warning: No description found for parameter '16'
   drivers/gpio/gpiolib.c:602: warning: Excess struct member 'events' 
description in 'lineevent_state'
   include/linux/iio/iio.h:610: warning: No description found for parameter 
'iio_dev'
   include/linux/iio/iio.h:610: warning: Excess function parameter 'indio_dev' 
description in 'iio_device_register'
   include/linux/iio/trigger.h:79: warning: No description found for parameter 
'owner'
   fs/inode.c:1680: warning: No description found for parameter 'rcu'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_transaction'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_next_transaction'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_list'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_vfs_inode'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_flags'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_rsv_handle'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_reserved'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_type'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_line_no'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_start_jiffies'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_requested_credits'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'saved_alloc_context'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_chkpt_bhs'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_devname'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_average_commit_time'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_min_batch_time'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_max_batch_time'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_commit_callback'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_failed_commit'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_chksum_driver'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_csum_seed'
   fs/jbd2/transaction.c:511: warning: No description found for parameter 'type'
   fs/jbd2/transaction.c:511: warning: No description found for parameter 
'line_no'
   fs/jbd2/transaction.c:641: warning: No description found for parameter 
'gfp_mask'
   include/drm/drm_drv.h:594: warning: No description found for parameter 
'gem_prime_pin'

Re: [PATCH 10/10] kill kernel_sock_ioctl()

2018-01-24 Thread Al Viro

On Thu, Jan 25, 2018 at 12:01:25AM +, Al Viro wrote:
> On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
> > 
> > Al this series looks fine to me, want me to toss it into net-next?
> 
> Do you want them reposted (with updated commit messages), or would
> you prefer a pull request (with or without rebase to current tip
> of net-next)?

Below is a pull request for rebased branch.  Patches themselves are
identical to what had been posted, Reviewed-by added and commit message
for "kill dev_ifsioc()" made more detailed.

The following changes since commit be1b6e8b5470e8311bfa1a3dfd7bd59e85a99759:

  Merge branch '100GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue (2018-01-24 
18:02:17 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git rebased-net-ioctl

for you to fetch changes up to 5c59e564e46dcbab2ee7a4e9e0243562a39679a2:

  kill kernel_sock_ioctl() (2018-01-24 19:13:45 -0500)

Al Viro (10):
  net: separate SIOCGIFCONF handling from dev_ioctl()
  devinet_ioctl(): take copyin/copyout to caller
  ip_rt_ioctl(): take copyin to caller
  kill dev_ifsioc()
  kill bond_ioctl()
  kill dev_ifname32()
  lift handling of SIOCIW... out of dev_ioctl()
  ipconfig: use dev_set_mtu()
  dev_ioctl(): move copyin/copyout to callers
  kill kernel_sock_ioctl()

 include/linux/inetdevice.h |   2 +-
 include/linux/net.h|   1 -
 include/linux/netdevice.h  |   7 +-
 include/net/route.h|   2 +-
 include/net/wext.h |   4 +-
 net/core/dev_ioctl.c   | 132 ++
 net/ipv4/af_inet.c |  28 -
 net/ipv4/devinet.c |  57 --
 net/ipv4/fib_frontend.c|   8 +-
 net/ipv4/ipconfig.c|  47 ++--
 net/socket.c   | 271 -
 net/wireless/wext-core.c   |  13 ++-
 12 files changed, 173 insertions(+), 399 deletions(-)

[PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc()

2018-01-24 Thread Roman Gushchin

We've catched several cgroup css refcounting issues on 4.15-rc7,
triggered from different release paths. We've used cgroups v2.
I've added a temporarily per-memcg sockmem atomic counter,
and found, that we're sometimes falling below 0. It was easy
to reproduce, so I was able to bisect the problem.

It was introduced by the commit 9f1c2674b328 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()"), which moved
the mem_cgroup_sk_alloc() call from the BH context
into inet_csk_accept().

The problem is that all the memory allocated before
mem_cgroup_sk_alloc() is charged to the socket,
but not charged to the memcg. So, when we're releasing
the socket, we're uncharging more, than we've charged.

Fix this by charging the cgroup by the amount of already
allocated memory right after mem_cgroup_sk_alloc() in
inet_csk_accept().

Fixes: 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
Signed-off-by: Roman Gushchin 
Cc: Eric Dumazet 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: David S. Miller 
---
 net/ipv4/inet_connection_sock.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4ca46dc08e63..f439162c2ea2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -434,6 +434,7 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, 
int *err, bool kern)
struct request_sock *req;
struct sock *newsk;
int error;
+   long amt;
 
lock_sock(sk);
 
@@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, 
int *err, bool kern)
spin_unlock_bh(>fastopenq.lock);
}
mem_cgroup_sk_alloc(newsk);
+   amt = sk_memory_allocated(newsk);
+   if (amt && newsk->sk_memcg)
+   mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+
 out:
release_sock(sk);
if (req)
-- 
2.14.3

[PATCH net-next 3/8] mlx5: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
CC: Saeed Mahameed 
CC: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  | 5 +
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8530c770c873..47bab842c5ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2944,9 +2944,6 @@ static int mlx5e_setup_tc_mqprio(struct net_device 
*netdev,
 static int mlx5e_setup_tc_cls_flower(struct mlx5e_priv *priv,
 struct tc_cls_flower_offload *cls_flower)
 {
-   if (cls_flower->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
return mlx5e_configure_flower(priv, cls_flower);
@@ -2964,7 +2961,7 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void 
*type_data,
 {
struct mlx5e_priv *priv = cb_priv;
 
-   if (!tc_can_offload(priv->netdev))
+   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 10fa6a18fcf9..363d8dcb7f17 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -719,9 +719,6 @@ static int
 mlx5e_rep_setup_tc_cls_flower(struct mlx5e_priv *priv,
  struct tc_cls_flower_offload *cls_flower)
 {
-   if (cls_flower->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
return mlx5e_configure_flower(priv, cls_flower);
@@ -739,7 +736,7 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, 
void *type_data,
 {
struct mlx5e_priv *priv = cb_priv;
 
-   if (!tc_can_offload(priv->netdev))
+   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
-- 
2.15.1

[PATCH net-next 1/8] pkt_cls: add new tc cls helper to check offload flag and chain index

2018-01-24 Thread Jakub Kicinski

Very few (mlxsw) upstream drivers seem to allow offload of chains
other than 0.  Save driver developers typing and add a helper for
checking both if ethtool's TC offload flag is on and if chain is 0.
This helper will set the extack appropriately in both error cases.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  4 +---
 drivers/net/netdevsim/bpf.c   |  5 +
 include/net/pkt_cls.h | 12 
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index b3206855535a..322027792fe8 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -130,7 +130,7 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type 
type,
   "only offload of BPF classifiers supported");
return -EOPNOTSUPP;
}
-   if (!tc_can_offload_extack(nn->dp.netdev, cls_bpf->common.extack))
+   if (!tc_cls_can_offload_and_chain0(nn->dp.netdev, _bpf->common))
return -EOPNOTSUPP;
if (!nfp_net_ebpf_capable(nn)) {
NL_SET_ERR_MSG_MOD(cls_bpf->common.extack,
@@ -142,8 +142,6 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type 
type,
   "only ETH_P_ALL supported as filter 
protocol");
return -EOPNOTSUPP;
}
-   if (cls_bpf->common.chain_index)
-   return -EOPNOTSUPP;
 
/* Only support TC direct action */
if (!cls_bpf->exts_integrated ||
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 8166f121bbcc..de73c1ff0939 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -135,7 +135,7 @@ int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type,
return -EOPNOTSUPP;
}
 
-   if (!tc_can_offload_extack(ns->netdev, cls_bpf->common.extack))
+   if (!tc_cls_can_offload_and_chain0(ns->netdev, _bpf->common))
return -EOPNOTSUPP;
 
if (cls_bpf->common.protocol != htons(ETH_P_ALL)) {
@@ -144,9 +144,6 @@ int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type,
return -EOPNOTSUPP;
}
 
-   if (cls_bpf->common.chain_index)
-   return -EOPNOTSUPP;
-
if (!ns->bpf_tc_accept) {
NSIM_EA(cls_bpf->common.extack,
"netdevsim configured to reject BPF TC offload");
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 1a41513cec7f..4db08d7dd22c 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -656,6 +656,18 @@ static inline bool tc_can_offload_extack(const struct 
net_device *dev,
return can;
 }
 
+static inline bool
+tc_cls_can_offload_and_chain0(const struct net_device *dev,
+ struct tc_cls_common_offload *common)
+{
+   if (common->chain_index) {
+   NL_SET_ERR_MSG(common->extack,
+  "Driver supports only offload of chain 0");
+   return false;
+   }
+   return tc_can_offload_extack(dev, common->extack);
+}
+
 static inline bool tc_skip_hw(u32 flags)
 {
return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false;
-- 
2.15.1

[PATCH net-next 0/8] use tc_cls_can_offload_and_chain0() throughout the drivers

2018-01-24 Thread Jakub Kicinski

Hi!

This set makes most drivers use a new tc_cls_can_offload_and_chain0()
helper which will set extack in case TC hw offload flag is disabled.
i40e patch will follow after net -> net-next merge.

I chose to keep the new helper which also looks at the chain but
renamed it more appropriately.  The rationale being that most drivers
don't accept chains other than 0 and since we have to pass extack
to the helper we can as well pass the entire struct tc_cls_common_offload
and perform the most common checks.  Jiri, please let me know if that's
acceptable for you.

This code makes the assumption that type_data in the callback can
be interpreted as struct tc_cls_common_offload, i.e. the real offload
structure has common part as the first member.  This allows us to
make the check once for all classifier types if driver supports
more than one.  This also means I've dropped the last patch of
the RFC (preventing use of common before type validation in nfp).

Jakub Kicinski (8):
  pkt_cls: add new tc cls helper to check offload flag and chain index
  bnxt: use tc_cls_can_offload_and_chain0()
  nfp: flower: use tc_cls_can_offload_and_chain0()
  cxgb4: use tc_cls_can_offload_and_chain0()
  ixgbe: use tc_cls_can_offload_and_chain0()
  mlx5: use tc_cls_can_offload_and_chain0()
  mlxsw: use tc_cls_can_offload_and_chain0()
  selftests/bpf: check for spurious extacks from the driver

 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   |  3 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c  |  3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c|  8 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  5 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  5 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  5 +---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  6 ++---
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  4 +---
 .../net/ethernet/netronome/nfp/flower/offload.c|  7 +++---
 drivers/net/netdevsim/bpf.c|  5 +---
 include/net/pkt_cls.h  | 12 ++
 tools/testing/selftests/bpf/test_offload.py| 27 ++
 13 files changed, 54 insertions(+), 39 deletions(-)

-- 
2.15.1

[PATCH net-next 2/8] cxgb4: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
CC: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index f0fd2eba30c2..1e3cd8abc56d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2928,9 +2928,6 @@ static int cxgb_set_tx_maxrate(struct net_device *dev, 
int index, u32 rate)
 static int cxgb_setup_tc_flower(struct net_device *dev,
struct tc_cls_flower_offload *cls_flower)
 {
-   if (cls_flower->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
return cxgb4_tc_flower_replace(dev, cls_flower);
@@ -2946,9 +2943,6 @@ static int cxgb_setup_tc_flower(struct net_device *dev,
 static int cxgb_setup_tc_cls_u32(struct net_device *dev,
 struct tc_cls_u32_offload *cls_u32)
 {
-   if (cls_u32->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_u32->command) {
case TC_CLSU32_NEW_KNODE:
case TC_CLSU32_REPLACE_KNODE:
@@ -2974,7 +2968,7 @@ static int cxgb_setup_tc_block_cb(enum tc_setup_type 
type, void *type_data,
return -EINVAL;
}
 
-   if (!tc_can_offload(dev))
+   if (!tc_cls_can_offload_and_chain0(dev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
-- 
2.15.1

[PATCH net-next 6/8] ixgbe: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
CC: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 722cc3153a99..bbb622f15a77 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9303,9 +9303,6 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
 static int ixgbe_setup_tc_cls_u32(struct ixgbe_adapter *adapter,
  struct tc_cls_u32_offload *cls_u32)
 {
-   if (cls_u32->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_u32->command) {
case TC_CLSU32_NEW_KNODE:
case TC_CLSU32_REPLACE_KNODE:
@@ -9327,7 +9324,7 @@ static int ixgbe_setup_tc_block_cb(enum tc_setup_type 
type, void *type_data,
 {
struct ixgbe_adapter *adapter = cb_priv;
 
-   if (!tc_can_offload(adapter->netdev))
+   if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
-- 
2.15.1

[PATCH net-next 8/8] selftests/bpf: check for spurious extacks from the driver

2018-01-24 Thread Jakub Kicinski

Drivers should not report errors when offload is not forced.
Check stdout and stderr for familiar messages when with no
skip flags and with skip_hw.  Check for add, replace, and
destroy.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 tools/testing/selftests/bpf/test_offload.py | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index ae3eea3ab820..3a43fbc896db 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -543,6 +543,10 @@ netns = [] # net namespaces to be removed
 def check_extack_nsim(output, reference, args):
 check_extack(output, "Error: netdevsim: " + reference, args)
 
+def check_no_extack(res, needle):
+fail((res[1] + res[2]).find(needle) != -1,
+ "Found '%s' in command output, leaky extack?" % (needle))
+
 def check_verifier_log(output, reference):
 lines = output.split("\n")
 for l in reversed(lines):
@@ -550,6 +554,18 @@ netns = [] # net namespaces to be removed
 return
 fail(True, "Missing or incorrect message from netdevsim in verifier log")
 
+def test_spurios_extack(sim, obj, skip_hw, needle):
+res = sim.cls_bpf_add_filter(obj, prio=1, handle=1, skip_hw=skip_hw,
+ include_stderr=True)
+check_no_extack(res, needle)
+res = sim.cls_bpf_add_filter(obj, op="replace", prio=1, handle=1,
+ skip_hw=skip_hw, include_stderr=True)
+check_no_extack(res, needle)
+res = sim.cls_filter_op(op="delete", prio=1, handle=1, cls="bpf",
+include_stderr=True)
+check_no_extack(res, needle)
+
+
 # Parse command line
 parser = argparse.ArgumentParser()
 parser.add_argument("--log", help="output verbose log to given file")
@@ -687,6 +703,17 @@ netns = []
  (j))
 sim.cls_filter_op(op="delete", prio=1, handle=1, cls="bpf")
 
+start_test("Test spurious extack from the driver...")
+test_spurios_extack(sim, obj, False, "netdevsim")
+test_spurios_extack(sim, obj, True, "netdevsim")
+
+sim.set_ethtool_tc_offloads(False)
+
+test_spurios_extack(sim, obj, False, "TC offload is disabled")
+test_spurios_extack(sim, obj, True, "TC offload is disabled")
+
+sim.set_ethtool_tc_offloads(True)
+
 sim.tc_flush_filters()
 
 start_test("Test TC offloads work...")
-- 
2.15.1

[PATCH net-next 4/8] bnxt: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
CC: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  | 3 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 3 ++-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6b7e99675571..4b001d2050c2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7778,7 +7778,8 @@ static int bnxt_setup_tc_block_cb(enum tc_setup_type 
type, void *type_data,
 {
struct bnxt *bp = cb_priv;
 
-   if (!bnxt_tc_flower_enabled(bp) || !tc_can_offload(bp->dev))
+   if (!bnxt_tc_flower_enabled(bp) ||
+   !tc_cls_can_offload_and_chain0(bp->dev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 2ece1645f55d..fbe6e208e17b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1474,9 +1474,6 @@ int bnxt_tc_setup_flower(struct bnxt *bp, u16 src_fid,
 {
int rc = 0;
 
-   if (cls_flower->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
rc = bnxt_tc_add_flow(bp, src_fid, cls_flower);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 2ca11be64182..26290403f38f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -124,7 +124,8 @@ static int bnxt_vf_rep_setup_tc_block_cb(enum tc_setup_type 
type,
struct bnxt *bp = vf_rep->bp;
int vf_fid = bp->pf.vf[vf_rep->vf_idx].fw_fid;
 
-   if (!bnxt_tc_flower_enabled(vf_rep->bp) || !tc_can_offload(bp->dev))
+   if (!bnxt_tc_flower_enabled(vf_rep->bp) ||
+   !tc_cls_can_offload_and_chain0(bp->dev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
-- 
2.15.1

[PATCH net-next 7/8] mlxsw: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
CC: Jiri Pirko 
CC: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 833cd0a96fd9..3dcc58d61506 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1738,9 +1738,6 @@ static int mlxsw_sp_setup_tc_cls_matchall(struct 
mlxsw_sp_port *mlxsw_sp_port,
  struct tc_cls_matchall_offload *f,
  bool ingress)
 {
-   if (f->common.chain_index)
-   return -EOPNOTSUPP;
-
switch (f->command) {
case TC_CLSMATCHALL_REPLACE:
return mlxsw_sp_port_add_cls_matchall(mlxsw_sp_port, f,
@@ -1780,7 +1777,8 @@ static int mlxsw_sp_setup_tc_block_cb_matchall(enum 
tc_setup_type type,
 
switch (type) {
case TC_SETUP_CLSMATCHALL:
-   if (!tc_can_offload(mlxsw_sp_port->dev))
+   if (!tc_cls_can_offload_and_chain0(mlxsw_sp_port->dev,
+  type_data))
return -EOPNOTSUPP;
 
return mlxsw_sp_setup_tc_cls_matchall(mlxsw_sp_port, type_data,
-- 
2.15.1

[PATCH net-next 5/8] nfp: flower: use tc_cls_can_offload_and_chain0()

2018-01-24 Thread Jakub Kicinski

Make use of tc_cls_can_offload_and_chain0() to set extack msg in case
ethtool tc offload flag is not set or chain unsupported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 837134a9137c..08c4c6dc5f7f 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -483,8 +483,7 @@ static int
 nfp_flower_repr_offload(struct nfp_app *app, struct net_device *netdev,
struct tc_cls_flower_offload *flower, bool egress)
 {
-   if (!eth_proto_is_802_3(flower->common.protocol) ||
-   flower->common.chain_index)
+   if (!eth_proto_is_802_3(flower->common.protocol))
return -EOPNOTSUPP;
 
switch (flower->command) {
@@ -504,7 +503,7 @@ int nfp_flower_setup_tc_egress_cb(enum tc_setup_type type, 
void *type_data,
 {
struct nfp_repr *repr = cb_priv;
 
-   if (!tc_can_offload(repr->netdev))
+   if (!tc_cls_can_offload_and_chain0(repr->netdev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
@@ -521,7 +520,7 @@ static int nfp_flower_setup_tc_block_cb(enum tc_setup_type 
type,
 {
struct nfp_repr *repr = cb_priv;
 
-   if (!tc_can_offload(repr->netdev))
+   if (!tc_cls_can_offload_and_chain0(repr->netdev, type_data))
return -EOPNOTSUPP;
 
switch (type) {
-- 
2.15.1

Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Pablo Neira Ayuso

On Thu, Jan 25, 2018 at 12:50:56AM +0100, Pablo Neira Ayuso wrote:
> On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote:
> > Eric Dumazet  wrote:
> > > From: Eric Dumazet 
> > > 
> > > It looks like syzbot found its way into netfilter territory.
> > 
> > Excellent.  This will sure allow to find and fix more bugs :-)
> > 
> > > Issue here is that @name comes from user space and might
> > > not be null terminated.
> > 
> > Indeed, thanks for fixing this Eric.
> > 
> > xt_find_target() and xt_find_table_lock() might have similar issues.
> 
> I'm going to keep back this patch then, it would be good if we can
> find this in one single patch.

s/find/fix/

Sorry.

[PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support

2018-01-24 Thread Pablo Neira Ayuso

This patch adds the infrastructure to offload flows to hardware, in case
the nic/switch comes with built-in flow tables capabilities.

If the hardware comes with no hardware flow tables or they have
limitations in terms of features, the existing infrastructure falls back
to the software flow table implementation.

The software flow table garbage collector skips entries that resides in
the hardware, so the hardware will be responsible for releasing this
flow table entry too via flow_offload_dead().

Hardware configuration, either to add or to delete entries, is done from
the hardware offload workqueue, to ensure this is done from user context
given that we may sleep when grabbing the mdio mutex.

Signed-off-by: Pablo Neira Ayuso 
---
v4: More work in progress
- Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks
- Consolidate ->ndo invocations, now they happen from the hw worker.
- Fix bug in list handling, use list_replace_init()
- cleanup entries on nf_flow_table_hw module removal
- add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants
  to offload entries to hardware.

 include/linux/netdevice.h|   9 ++
 include/net/netfilter/nf_flow_table.h|  16 +++
 include/uapi/linux/netfilter/nf_tables.h |  11 ++
 net/netfilter/Kconfig|   9 ++
 net/netfilter/Makefile   |   1 +
 net/netfilter/nf_flow_table.c|  60 +++
 net/netfilter/nf_flow_table_hw.c | 174 +++
 net/netfilter/nf_tables_api.c|  12 ++-
 net/netfilter/nft_flow_offload.c |   5 +
 9 files changed, 296 insertions(+), 1 deletion(-)
 create mode 100644 net/netfilter/nf_flow_table_hw.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ed0799a12bf2..be0c12acc3f0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -859,6 +859,13 @@ struct dev_ifalias {
char ifalias[];
 };
 
+struct flow_offload;
+
+enum flow_offload_type {
+   FLOW_OFFLOAD_ADD= 0,
+   FLOW_OFFLOAD_DEL,
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1316,6 +1323,8 @@ struct net_device_ops {
int (*ndo_bridge_dellink)(struct net_device *dev,
  struct nlmsghdr *nlh,
  u16 flags);
+   int (*ndo_flow_offload)(enum flow_offload_type type,
+   struct flow_offload *flow);
int (*ndo_change_carrier)(struct net_device *dev,
  bool new_carrier);
int (*ndo_get_phys_port_id)(struct net_device *dev,
diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index ed49cd169ecf..69067deb61b6 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -22,7 +22,9 @@ struct nf_flowtable_type {
 struct nf_flowtable {
struct rhashtable   rhashtable;
const struct nf_flowtable_type  *type;
+   u32 flags;
struct delayed_work gc_work;
+   possible_net_t  ft_net;
 };
 
 enum flow_offload_tuple_dir {
@@ -65,6 +67,7 @@ struct flow_offload_tuple_rhash {
 #define FLOW_OFFLOAD_SNAT  0x1
 #define FLOW_OFFLOAD_DNAT  0x2
 #define FLOW_OFFLOAD_DYING 0x4
+#define FLOW_OFFLOAD_HW0x8
 
 struct flow_offload {
struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX];
@@ -119,6 +122,19 @@ unsigned int nf_flow_offload_ip_hook(void *priv, struct 
sk_buff *skb,
 unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
   const struct nf_hook_state *state);
 
+void nf_flow_offload_hw_add(struct net *net, struct flow_offload *flow,
+   struct nf_conn *ct);
+void nf_flow_offload_hw_del(struct net *net, struct flow_offload *flow);
+
+struct nf_flow_table_hw {
+   void (*add)(struct net *net, struct flow_offload *flow,
+   struct nf_conn *ct);
+   void (*del)(struct net *net, struct flow_offload *flow);
+};
+
+int nf_flow_table_hw_register(const struct nf_flow_table_hw *offload);
+void nf_flow_table_hw_unregister(const struct nf_flow_table_hw *offload);
+
 #define MODULE_ALIAS_NF_FLOWTABLE(family)  \
MODULE_ALIAS("nf-flowtable-" __stringify(family))
 
diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 66dceee0ae30..1974829d6440 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1334,6 +1334,15 @@ enum nft_object_attributes {
 #define NFTA_OBJ_MAX

[PATCH bpf-next v7 0/5] libbpf: add XDP setup support

2018-01-24 Thread Eric Leblond


Hello,

This patchset fixes the problem found by Alexei when building libbpf on a
system with old headers. It has been tested on an old Ubuntu and seems
to behave fine.

Best regards,
--
Eric

[PATCH bpf-next v7 2/5] libbpf: add function to setup XDP

2018-01-24 Thread Eric Leblond

Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
slightly modified to be library compliant.

Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/bpf.c| 127 +
 tools/lib/bpf/libbpf.c |   2 +
 tools/lib/bpf/libbpf.h |   4 ++
 3 files changed, 133 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5128677e4117..749a447ec9ed 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -25,6 +25,17 @@
 #include 
 #include 
 #include "bpf.h"
+#include "libbpf.h"
+#include "nlattr.h"
+#include 
+#include 
+#include 
+
+#ifndef IFLA_XDP_MAX
+#define IFLA_XDP   43
+#define IFLA_XDP_FD1
+#define IFLA_XDP_FLAGS 3
+#endif
 
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
@@ -46,7 +57,9 @@
 # endif
 #endif
 
+#ifndef min
 #define min(x, y) ((x) < (y) ? (x) : (y))
+#endif
 
 static inline __u64 ptr_to_u64(const void *ptr)
 {
@@ -413,3 +426,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 
*info_len)
 
return err;
 }
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+   struct sockaddr_nl sa;
+   int sock, seq = 0, len, ret = -1;
+   char buf[4096];
+   struct nlattr *nla, *nla_xdp;
+   struct {
+   struct nlmsghdr  nh;
+   struct ifinfomsg ifinfo;
+   char attrbuf[64];
+   } req;
+   struct nlmsghdr *nh;
+   struct nlmsgerr *err;
+   socklen_t addrlen;
+
+   memset(, 0, sizeof(sa));
+   sa.nl_family = AF_NETLINK;
+
+   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+   if (sock < 0) {
+   return -errno;
+   }
+
+   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   addrlen = sizeof(sa);
+   if (getsockname(sock, (struct sockaddr *), ) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   if (addrlen != sizeof(sa)) {
+   ret = -LIBBPF_ERRNO__INTERNAL;
+   goto cleanup;
+   }
+
+   memset(, 0, sizeof(req));
+   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+   req.nh.nlmsg_type = RTM_SETLINK;
+   req.nh.nlmsg_pid = 0;
+   req.nh.nlmsg_seq = ++seq;
+   req.ifinfo.ifi_family = AF_UNSPEC;
+   req.ifinfo.ifi_index = ifindex;
+
+   /* started nested attribute for XDP */
+   nla = (struct nlattr *)(((char *))
+   + NLMSG_ALIGN(req.nh.nlmsg_len));
+   nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+   nla->nla_len = NLA_HDRLEN;
+
+   /* add XDP fd */
+   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+   nla_xdp->nla_type = IFLA_XDP_FD;
+   nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd));
+   nla->nla_len += nla_xdp->nla_len;
+
+   /* if user passed in any flags, add those too */
+   if (flags) {
+   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+   nla_xdp->nla_type = IFLA_XDP_FLAGS;
+   nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags));
+   nla->nla_len += nla_xdp->nla_len;
+   }
+
+   req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+   if (send(sock, , req.nh.nlmsg_len, 0) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   len = recv(sock, buf, sizeof(buf), 0);
+   if (len < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+nh = NLMSG_NEXT(nh, len)) {
+   if (nh->nlmsg_pid != sa.nl_pid) {
+   ret = -LIBBPF_ERRNO__WRNGPID;
+   goto cleanup;
+   }
+   if (nh->nlmsg_seq != seq) {
+   ret = -LIBBPF_ERRNO__INVSEQ;
+   goto cleanup;
+   }
+   switch (nh->nlmsg_type) {
+   case NLMSG_ERROR:
+   err = (struct nlmsgerr *)NLMSG_DATA(nh);
+   if (!err->error)
+   continue;
+   ret = err->error;
+   goto cleanup;
+   case NLMSG_DONE:
+   break;
+   default:
+   break;
+   }
+   }
+
+   ret = 0;
+
+cleanup:
+   close(sock);
+   return ret;
+}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 30c776375118..c60122d3ea85 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = {
[ERRCODE_OFFSET(PROG2BIG)]  =

Re: [Patch net-next v2 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-24 Thread John Fastabend

On 01/23/2018 10:18 AM, Cong Wang wrote:
> Introduce a new qdisc ops ->change_tx_queue_len() so that
> each qdisc could decide how to implement this if it wants.
> Previously we simply read dev->tx_queue_len, after pfifo_fast
> switches to skb array, we need this API to resize the skb array
> when we change dev->tx_queue_len.
> 
> To avoid handling race conditions with TX BH, we need to
> deactivate all TX queues before change the value and bring them
> back after we are done, this also makes implementation easier.
> 
> Cc: John Fastabend 
> Signed-off-by: Cong Wang 
> ---
>  include/net/sch_generic.h |  2 ++
>  net/core/dev.c|  1 +
>  net/sched/sch_generic.c   | 33 +
>  3 files changed, 36 insertions(+)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index cd1be1f25c36..d13dd129d085 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -200,6 +200,7 @@ struct Qdisc_ops {
> struct nlattr *arg,
> struct netlink_ext_ack *extack);
>   void(*attach)(struct Qdisc *sch);
> + int (*change_tx_queue_len)(struct Qdisc *, unsigned 
> int);
>  
>   int (*dump)(struct Qdisc *, struct sk_buff *);
>   int (*dump_stats)(struct Qdisc *, struct gnet_dump 
> *);
> @@ -488,6 +489,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *,
>  void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *);
>  void qdisc_class_hash_destroy(struct Qdisc_class_hash *);
>  
> +int dev_qdisc_change_tx_queue_len(struct net_device *dev);
>  void dev_init_scheduler(struct net_device *dev);
>  void dev_shutdown(struct net_device *dev);
>  void dev_activate(struct net_device *dev);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 913655e82859..a9d7d883416d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7059,6 +7059,7 @@ int dev_change_tx_queue_len(struct net_device *dev, 
> unsigned long new_len)
>   dev->tx_queue_len = orig_len;
>   return res;
>   }
> + return dev_qdisc_change_tx_queue_len(dev);
>   }
>  
>   return 0;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 1816bde47256..08f9fa27e06e 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1178,6 +1178,39 @@ void dev_deactivate(struct net_device *dev)
>  }
>  EXPORT_SYMBOL(dev_deactivate);
>  
> +static int qdisc_change_tx_queue_len(struct net_device *dev,
> +  struct netdev_queue *dev_queue)
> +{
> + struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
> + const struct Qdisc_ops *ops = qdisc->ops;
> +
> + if (ops->change_tx_queue_len)
> + return ops->change_tx_queue_len(qdisc, dev->tx_queue_len);
> + return 0;
> +}
> +
> +int dev_qdisc_change_tx_queue_len(struct net_device *dev)
> +{
> + bool up = dev->flags & IFF_UP;
> + unsigned int i;
> + int ret = 0;
> +
> + if (up)
> + dev_deactivate(dev);
> +
> + for (i = 0; i < dev->num_tx_queues; i++) {
> + ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
> +
> + /* TODO: revert changes on a partial failure */
> + if (ret)
> + break;

After another look it seems we can solve this without too much pain
by using skb_array_resize_multiple() in patch 3/3. Then pass the
error pack here via qdisc_change_tx_queue_len and reset queue length
to orig_length.

Mind giving it a try? Or else I'll do it Friday probably.

Thanks,
John

[PATCH bpf-next v7 1/5] tools: import netlink header in tools uapi

2018-01-24 Thread Eric Leblond

The header is necessary for libbpf compilation on system with older
version of the headers.

Signed-off-by: Eric Leblond 
---
 tools/include/uapi/linux/netlink.h | 251 +
 tools/lib/bpf/Makefile |   3 +
 2 files changed, 254 insertions(+)
 create mode 100644 tools/include/uapi/linux/netlink.h

diff --git a/tools/include/uapi/linux/netlink.h 
b/tools/include/uapi/linux/netlink.h
new file mode 100644
index ..776bc92e9118
--- /dev/null
+++ b/tools/include/uapi/linux/netlink.h
@@ -0,0 +1,251 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI__LINUX_NETLINK_H
+#define _UAPI__LINUX_NETLINK_H
+
+#include 
+#include  /* for __kernel_sa_family_t */
+#include 
+
+#define NETLINK_ROUTE  0   /* Routing/device hook  
*/
+#define NETLINK_UNUSED 1   /* Unused number
*/
+#define NETLINK_USERSOCK   2   /* Reserved for user mode socket 
protocols  */
+#define NETLINK_FIREWALL   3   /* Unused number, formerly ip_queue 
*/
+#define NETLINK_SOCK_DIAG  4   /* socket monitoring
*/
+#define NETLINK_NFLOG  5   /* netfilter/iptables ULOG */
+#define NETLINK_XFRM   6   /* ipsec */
+#define NETLINK_SELINUX7   /* SELinux event notifications 
*/
+#define NETLINK_ISCSI  8   /* Open-iSCSI */
+#define NETLINK_AUDIT  9   /* auditing */
+#define NETLINK_FIB_LOOKUP 10  
+#define NETLINK_CONNECTOR  11
+#define NETLINK_NETFILTER  12  /* netfilter subsystem */
+#define NETLINK_IP6_FW 13
+#define NETLINK_DNRTMSG14  /* DECnet routing messages */
+#define NETLINK_KOBJECT_UEVENT 15  /* Kernel messages to userspace */
+#define NETLINK_GENERIC16
+/* leave room for NETLINK_DM (DM Events) */
+#define NETLINK_SCSITRANSPORT  18  /* SCSI Transports */
+#define NETLINK_ECRYPTFS   19
+#define NETLINK_RDMA   20
+#define NETLINK_CRYPTO 21  /* Crypto layer */
+#define NETLINK_SMC22  /* SMC monitoring */
+
+#define NETLINK_INET_DIAG  NETLINK_SOCK_DIAG
+
+#define MAX_LINKS 32   
+
+struct sockaddr_nl {
+   __kernel_sa_family_tnl_family;  /* AF_NETLINK   */
+   unsigned short  nl_pad; /* zero */
+   __u32   nl_pid; /* port ID  */
+   __u32   nl_groups;  /* multicast groups mask */
+};
+
+struct nlmsghdr {
+   __u32   nlmsg_len;  /* Length of message including header */
+   __u16   nlmsg_type; /* Message content */
+   __u16   nlmsg_flags;/* Additional flags */
+   __u32   nlmsg_seq;  /* Sequence number */
+   __u32   nlmsg_pid;  /* Sending process port ID */
+};
+
+/* Flags values */
+
+#define NLM_F_REQUEST  0x01/* It is request message.   */
+#define NLM_F_MULTI0x02/* Multipart message, terminated by 
NLMSG_DONE */
+#define NLM_F_ACK  0x04/* Reply with ack, with zero or error 
code */
+#define NLM_F_ECHO 0x08/* Echo this request*/
+#define NLM_F_DUMP_INTR0x10/* Dump was inconsistent due to 
sequence change */
+#define NLM_F_DUMP_FILTERED0x20/* Dump was filtered as requested */
+
+/* Modifiers to GET request */
+#define NLM_F_ROOT 0x100   /* specify tree root*/
+#define NLM_F_MATCH0x200   /* return all matching  */
+#define NLM_F_ATOMIC   0x400   /* atomic GET   */
+#define NLM_F_DUMP (NLM_F_ROOT|NLM_F_MATCH)
+
+/* Modifiers to NEW request */
+#define NLM_F_REPLACE  0x100   /* Override existing*/
+#define NLM_F_EXCL 0x200   /* Do not touch, if it exists   */
+#define NLM_F_CREATE   0x400   /* Create, if it does not exist */
+#define NLM_F_APPEND   0x800   /* Add to end of list   */
+
+/* Modifiers to DELETE request */
+#define NLM_F_NONREC   0x100   /* Do not delete recursively*/
+
+/* Flags for ACK message */
+#define NLM_F_CAPPED   0x100   /* request was capped */
+#define NLM_F_ACK_TLVS 0x200   /* extended ACK TVLs were included */
+
+/*
+   4.4BSD ADD  NLM_F_CREATE|NLM_F_EXCL
+   4.4BSD CHANGE   NLM_F_REPLACE
+
+   True CHANGE NLM_F_CREATE|NLM_F_REPLACE
+   Append  NLM_F_CREATE
+   Check   NLM_F_EXCL
+ */
+
+#define NLMSG_ALIGNTO  4U
+#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
+#define NLMSG_HDRLEN((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
+#define NLMSG_LENGTH(len) ((len) + NLMSG_HDRLEN)
+#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
+#define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
+#define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \
+ (struct

[PATCH bpf-next v7 4/5] libbpf: add missing SPDX-License-Identifier

2018-01-24 Thread Eric Leblond

Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/bpf.c| 2 ++
 tools/lib/bpf/bpf.h| 2 ++
 tools/lib/bpf/libbpf.c | 2 ++
 tools/lib/bpf/libbpf.h | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 765fd95b0657..e850d8365100 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: LGPL-2.1
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9f44c196931e..8d18fb73d7fb 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index c60122d3ea85..71ddc481f349 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: LGPL-2.1
+
 /*
  * Common eBPF ELF object loading operations.
  *
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e42f96900318..f85906533cdd 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * Common eBPF ELF object loading operations.
  *
-- 
2.15.1

[PATCH bpf-next v7 3/5] libbpf: add error reporting in XDP

2018-01-24 Thread Eric Leblond

Parse netlink ext attribute to get the error message returned by
the card. Code is partially take from libnl.

We add netlink.h to the uapi include of tools. And we need to
avoid include of userspace netlink header to have a successful
build of sample so nlattr.h has a define to avoid
the inclusion. Using a direct define could have been an issue
as NLMSGERR_ATTR_MAX can change in the future.

We also define SOL_NETLINK if not defined to avoid to have to
copy socket.h for a fixed value.

Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 

remote rtne

Signed-off-by: Eric Leblond 
---
 samples/bpf/Makefile   |   2 +-
 tools/lib/bpf/Build|   2 +-
 tools/lib/bpf/bpf.c|  13 +++-
 tools/lib/bpf/nlattr.c | 187 +
 tools/lib/bpf/nlattr.h |  72 +++
 5 files changed, 273 insertions(+), 3 deletions(-)
 create mode 100644 tools/lib/bpf/nlattr.c
 create mode 100644 tools/lib/bpf/nlattr.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 7f61a3d57fa7..5c4cd3745282 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,7 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
-LIBBPF := ../../tools/lib/bpf/bpf.o
+LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index d8749756352d..64c679d67109 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1 +1 @@
-libbpf-y := libbpf.o bpf.o
+libbpf-y := libbpf.o bpf.o nlattr.o
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 749a447ec9ed..765fd95b0657 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -27,7 +27,7 @@
 #include "bpf.h"
 #include "libbpf.h"
 #include "nlattr.h"
-#include 
+#include 
 #include 
 #include 
 
@@ -37,6 +37,10 @@
 #define IFLA_XDP_FLAGS 3
 #endif
 
+#ifndef SOL_NETLINK
+#define SOL_NETLINK 270
+#endif
+
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
  * required to be defined explicitly.
@@ -441,6 +445,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
struct nlmsghdr *nh;
struct nlmsgerr *err;
socklen_t addrlen;
+   int one = 1;
 
memset(, 0, sizeof(sa));
sa.nl_family = AF_NETLINK;
@@ -450,6 +455,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
return -errno;
}
 
+   if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
+  , sizeof(one)) < 0) {
+   fprintf(stderr, "Netlink error reporting not supported\n");
+   }
+
if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
ret = -errno;
goto cleanup;
@@ -526,6 +536,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
if (!err->error)
continue;
ret = err->error;
+   nla_dump_errormsg(nh);
goto cleanup;
case NLMSG_DONE:
break;
diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c
new file mode 100644
index ..4719434278b2
--- /dev/null
+++ b/tools/lib/bpf/nlattr.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+/*
+ * NETLINK  Netlink attributes
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation version 2.1
+ * of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf 
+ */
+
+#include 
+#include "nlattr.h"
+#include 
+#include 
+#include 
+
+static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = {
+   [NLA_U8]= sizeof(uint8_t),
+   [NLA_U16]   = sizeof(uint16_t),
+   [NLA_U32]   = sizeof(uint32_t),
+   [NLA_U64]   = sizeof(uint64_t),
+   [NLA_STRING]= 1,
+   [NLA_FLAG]  = 0,
+};
+
+static int nla_len(const struct nlattr *nla)
+{
+   return nla->nla_len - NLA_HDRLEN;
+}
+
+static struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
+{
+   int totlen = NLA_ALIGN(nla->nla_len);
+
+   *remaining -= totlen;
+   return (struct nlattr *) ((char *) nla + totlen);
+}
+
+static int nla_ok(const struct nlattr *nla, int remaining)
+{
+   return remaining >= sizeof(*nla) &&
+  nla->nla_len >= sizeof(*nla) &&
+  nla->nla_len <= remaining;
+}
+
+static void *nla_data(const struct nlattr *nla)
+{
+   return (char *) nla + NLA_HDRLEN;
+}
+
+static int nla_type(const struct nlattr *nla)
+{
+   return nla->nla_type & NLA_TYPE_MASK;
+}
+
+static int validate_nla(struct nlattr *nla, int maxtype,
+   struct nla_policy *policy)
+{
+

[PATCH bpf-next v7 5/5] samples/bpf: use bpf_set_link_xdp_fd

2018-01-24 Thread Eric Leblond

Use bpf_set_link_xdp_fd instead of set_link_xdp_fd to remove some
code duplication and benefit of netlink ext ack errors message.

Signed-off-by: Eric Leblond 
---
 samples/bpf/bpf_load.c  | 102 
 samples/bpf/bpf_load.h  |   2 +-
 samples/bpf/xdp1_user.c |   4 +-
 samples/bpf/xdp_redirect_cpu_user.c |   6 +--
 samples/bpf/xdp_redirect_map_user.c |   8 +--
 samples/bpf/xdp_redirect_user.c |   8 +--
 samples/bpf/xdp_router_ipv4_user.c  |  10 ++--
 samples/bpf/xdp_rxq_info_user.c |   4 +-
 samples/bpf/xdp_tx_iptunnel_user.c  |   6 +--
 9 files changed, 24 insertions(+), 126 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 242631aa4ea2..69806d74fa53 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -695,105 +695,3 @@ struct ksym *ksym_search(long key)
return [0];
 }
 
-int set_link_xdp_fd(int ifindex, int fd, __u32 flags)
-{
-   struct sockaddr_nl sa;
-   int sock, seq = 0, len, ret = -1;
-   char buf[4096];
-   struct nlattr *nla, *nla_xdp;
-   struct {
-   struct nlmsghdr  nh;
-   struct ifinfomsg ifinfo;
-   char attrbuf[64];
-   } req;
-   struct nlmsghdr *nh;
-   struct nlmsgerr *err;
-
-   memset(, 0, sizeof(sa));
-   sa.nl_family = AF_NETLINK;
-
-   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
-   if (sock < 0) {
-   printf("open netlink socket: %s\n", strerror(errno));
-   return -1;
-   }
-
-   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
-   printf("bind to netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   memset(, 0, sizeof(req));
-   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
-   req.nh.nlmsg_type = RTM_SETLINK;
-   req.nh.nlmsg_pid = 0;
-   req.nh.nlmsg_seq = ++seq;
-   req.ifinfo.ifi_family = AF_UNSPEC;
-   req.ifinfo.ifi_index = ifindex;
-
-   /* started nested attribute for XDP */
-   nla = (struct nlattr *)(((char *))
-   + NLMSG_ALIGN(req.nh.nlmsg_len));
-   nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
-   nla->nla_len = NLA_HDRLEN;
-
-   /* add XDP fd */
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd));
-   nla->nla_len += nla_xdp->nla_len;
-
-   /* if user passed in any flags, add those too */
-   if (flags) {
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags));
-   nla->nla_len += nla_xdp->nla_len;
-   }
-
-   req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
-
-   if (send(sock, , req.nh.nlmsg_len, 0) < 0) {
-   printf("send to netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   len = recv(sock, buf, sizeof(buf), 0);
-   if (len < 0) {
-   printf("recv from netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
-nh = NLMSG_NEXT(nh, len)) {
-   if (nh->nlmsg_pid != getpid()) {
-   printf("Wrong pid %d, expected %d\n",
-  nh->nlmsg_pid, getpid());
-   goto cleanup;
-   }
-   if (nh->nlmsg_seq != seq) {
-   printf("Wrong seq %d, expected %d\n",
-  nh->nlmsg_seq, seq);
-   goto cleanup;
-   }
-   switch (nh->nlmsg_type) {
-   case NLMSG_ERROR:
-   err = (struct nlmsgerr *)NLMSG_DATA(nh);
-   if (!err->error)
-   continue;
-   printf("nlmsg error %s\n", strerror(-err->error));
-   goto cleanup;
-   case NLMSG_DONE:
-   break;
-   }
-   }
-
-   ret = 0;
-
-cleanup:
-   close(sock);
-   return ret;
-}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a4248893..453c200b389b 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -61,5 +61,5 @@ struct ksym {
 
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
-int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index fdaefe91801d..b901ee2b3336 100644
---

Re: [PATCH 10/10] kill kernel_sock_ioctl()

2018-01-24 Thread Al Viro

On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
> 
> Al this series looks fine to me, want me to toss it into net-next?

Do you want them reposted (with updated commit messages), or would
you prefer a pull request (with or without rebase to current tip
of net-next)?

Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-24 Thread Pablo Neira Ayuso

On Wed, Jan 24, 2018 at 04:37:16PM -0500, David Miller wrote:
> From: Eyal Birger 
> Date: Tue, 23 Jan 2018 11:17:32 +0200
> 
> > +   network_offset = skb_network_offset(skb);
> > +   skb_pull(skb, network_offset);
> > +
> > +   rcu_read_lock();
> > +
> > +   if (skb->skb_iif)
> > +   indev = dev_get_by_index_rcu(em->net, skb->skb_iif);
> > +
> > +   nf_hook_state_init(, im->hook, im->nfproto, indev ?: skb->dev,
> > +  skb->dev, NULL, em->net, NULL);
> > +
> > +   acpar.match = im->match;
> > +   acpar.matchinfo = im->match_data;
> > +   acpar.state = 
> > +
> > +   ret = im->match->match(skb, );
> > +
> > +   rcu_read_unlock();
> > +
> > +   skb_push(skb, network_offset);
> 
> If the SKB is shared in any way, this pull/push around the NF hook
> invocation is illegal.

At ingress, skb->data points to the network header, which is what the
xtables matches expect, so these are actually noops, therefore,
skb_pull() and skb_push() can be removed.

Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-01-24 Thread Ben Greear


On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->state = FWS_L;



The printout looks like this (when adding 4000 mac-vlans, so it is pretty 
rare).  PN is definitely NULL sometimes:

[root@2u-6n ~]# journalctl -f|grep FWS
Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0  fn: 
880856a09260  pn:   (null)
Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0  fn: 
8807e369f920  pn:   (null)
Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0  fn: 
88082d1eeb60  pn:   (null)



8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device 
eth2#1006
 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ]
 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 
0x154/0x1b0 [ipv6]
 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac 
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath  i2c_i801 mac80211 joydev 
lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl 
sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt

 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G
   O4.13.16+ #22
 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: 
c9002083c000
 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 
[ipv6]
 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246
 8075 Jan 24 15:48:05 2u-6n kernel: RAX:  RBX: 8807ea121ba0 
RCX: 
 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 
RDI: 81ef2140
 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 
R09: 8807e36f6b25
 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11:  
R12: 0002
 8079 Jan 24 15:48:05 2u-6n kernel:

Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Pablo Neira Ayuso

On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote:
> Eric Dumazet  wrote:
> > From: Eric Dumazet 
> > 
> > It looks like syzbot found its way into netfilter territory.
> 
> Excellent.  This will sure allow to find and fix more bugs :-)
> 
> > Issue here is that @name comes from user space and might
> > not be null terminated.
> 
> Indeed, thanks for fixing this Eric.
> 
> xt_find_target() and xt_find_table_lock() might have similar issues.

I'm going to keep back this patch then, it would be good if we can
find this in one single patch.

Thanks.

Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Pablo Neira Ayuso

On Wed, Jan 24, 2018 at 02:49:48PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> It looks like syzbot found its way into netfilter territory.
> 
> Issue here is that @name comes from user space and might
> not be null terminated.
> 
> Out-of-bound reads happen, KASAN is not happy.

Applied, thanks Eric.

[GIT] Networking

2018-01-24 Thread David Miller


1) Avoid negative netdev refcount in error flow of xfrm state add,
   from Aviad Yehezkel.

2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
   ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
   From Yossi Kuperman.

3) Fix a syzbot triggered skb_under_panic in pppoe having to do
   with failing to allocate an appropriate amount of headroom.  From
   Guillaume Nault.

4) Fix memory leak in vmxnet3 driver, from Neil Horman.

5) Cure out-of-bounds packet memory access in em_nbyte EMATCH
   module, from Wolfgang Bumiller.

6) Restrict what kinds of sockets can be bound to the KCM
   multiplexer and also disallow when another layer has
   attached to the socket and made use of sk_user_data.
   From Tom Herbert.

7) Fix use before init of IOTLB in vhost code, from Jason Wang.

8) Correct STACR register write bit definition in IBM emac driver,
   from Ivan Mikhaylov.

Please pull, thanks a lot.

The following changes since commit a84a8ab94ed5cb65a1355fe9e8d1d55283375808:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-01-23 
08:52:55 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 624ca9c33c8a853a4a589836e310d776620f4ab9:

  net/ibm/emac: wrong bit is used for STA control register write (2018-01-24 
18:10:57 -0500)


Aviad Yehezkel (1):
  xfrm: fix error flow in case of add state fails

Ben Hutchings (1):
  ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL

David S. Miller (3):
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'kcm-fix-two-syzcaller-issues'
  Merge branch 'qed-rdma-bug-fixes'

Guillaume Nault (1):
  pppoe: take ->needed_headroom of lower device into account on xmit

Gustavo A. R. Silva (1):
  xfrm: fix boolean assignment in xfrm_get_type_offload

Ivan Mikhaylov (2):
  net/ibm/emac: add 8192 rx/tx fifo size
  net/ibm/emac: wrong bit is used for STA control register write

Jakub Kicinski (1):
  i40e: flower: check if TC offload is enabled on a netdev

Jason Wang (2):
  vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
  vhost: do not try to access device IOTLB when not initialized

Michal Kalderon (2):
  qed: Remove reserveration of dpi for kernel
  qed: Free reserved MR tid

Neil Horman (1):
  vmxnet3: repair memory leak

Tom Herbert (2):
  kcm: Only allow TCP sockets to be attached to a KCM mux
  kcm: Check if sk_user_data already set in kcm_attach

Wolfgang Bumiller (2):
  net: sched: em_nbyte: don't add the data offset twice
  net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr

Yossi Kuperman (2):
  xfrm: Add SA to hardware at the end of xfrm_state_construct()
  xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version

Yuval Mintz (1):
  mlxsw: spectrum_router: Don't log an error on missing neighbor

 drivers/net/ethernet/ibm/emac/core.c  |  6 ++
 drivers/net/ethernet/ibm/emac/emac.h  |  4 +++-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 10 ++
 drivers/net/ethernet/qlogic/qed/qed_rdma.c| 31 
+--
 drivers/net/ppp/pppoe.c   | 11 ++-
 drivers/net/vmxnet3/vmxnet3_drv.c |  2 +-
 drivers/vhost/vhost.c |  6 +-
 include/net/ipv6.h|  1 +
 include/net/pkt_cls.h |  2 +-
 net/ipv4/xfrm4_mode_tunnel.c  |  1 +
 net/ipv6/ip6_output.c |  2 +-
 net/ipv6/ipv6_sockglue.c  |  2 +-
 net/ipv6/xfrm6_mode_tunnel.c  |  1 +
 net/kcm/kcmsock.c | 25 
+
 net/sched/em_nbyte.c  |  2 +-
 net/xfrm/xfrm_device.c|  1 +
 net/xfrm/xfrm_state.c | 12 
 net/xfrm/xfrm_user.c  | 18 +++---
 19 files changed, 90 insertions(+), 49 deletions(-)

Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Florian Westphal

Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> It looks like syzbot found its way into netfilter territory.

Excellent.  This will sure allow to find and fix more bugs :-)

> Issue here is that @name comes from user space and might
> not be null terminated.

Indeed, thanks for fixing this Eric.

xt_find_target() and xt_find_table_lock() might have similar issues.

Re: [PATCH v2 1/2] net/ibm/emac: add 8192 rx/tx fifo size

2018-01-24 Thread David Miller

From: Ivan Mikhaylov 
Date: Wed, 24 Jan 2018 15:53:24 +0300

> emac4syn chips has availability to use 8192 rx/tx fifo buffer sizes,
> in current state if we set it up in dts 8192 as example, we will get
> only 2048 which may impact on network speed.
> 
> Signed-off-by: Ivan Mikhaylov 

Applied.

Re: [PATCH v2 2/2] net/ibm/emac: wrong bit is used for STA control register write

2018-01-24 Thread David Miller

From: Ivan Mikhaylov 
Date: Wed, 24 Jan 2018 15:53:25 +0300

> STA control register has areas of mode and opcodes for opeations. 18 bit is
> using for mode selection, where 0 is old MIO/MDIO access method and 1 is
> indirect access mode. 19-20 bits are using for setting up read/write
> operation(STA opcodes). In current state 'read' is set into old MIO/MDIO mode
> with 19 bit and write operation is set into 18 bit which is mode selection,
> not a write operation. To correlate write with read we set it into 20 bit.
> All those bit operations are MSB 0 based.
> 
> Signed-off-by: Ivan Mikhaylov 

Applied.

Re: [net-next 0/7][pull request] 100GbE Intel Wired LAN Driver Updates 2018-01-24

2018-01-24 Thread David Miller

From: Jeff Kirsher 
Date: Wed, 24 Jan 2018 14:45:39 -0800

> This series contains updates to fm10k only.
> 
> Alex fixes MACVLAN offload for fm10k, where we were not seeing unicast
> packets being received because we did not correctly configure the
> default VLAN ID for the port and defaulting to 0.
> 
> Jake cleans up unnecessary parenthesis in a couple of "if" statements.
> Fixed the driver to stop adding VLAN 0 into the VLAN table, since it
> would cause the VLAN table to be inconsistent between the PF and VF.
> Also fixed an issue where we were assuming that VLAN 1 is enabled when
> the default VLAN ID is not set, so resolve by not requesting any filters
> for the default_vid if it has not yet been assigned.
> 
> Ngai fixes an issue which was generating a dmesg regarding unbale to
> kill a particular VLAN ID for the device.  This is due to
> ndo_vlan_rx_kill_vid() exits with an error and the handler for this ndo
> is fm10k_update_vid() which exits prematurely under PF VLAN management.
> So to resolve, we must check the VLAN update action type before exiting
> fm10k_update_vid(), and act appropriately based on the action type.
> Also corrected code comment typos.

Looks good, pulled, thanks Jeff.

Re: [PATCH net-next 3/3] net/ipv6: Add support for onlink flag

2018-01-24 Thread David Miller

From: David Ahern 
Date: Wed, 24 Jan 2018 15:08:39 -0700

> On 1/23/18 8:00 PM, David Ahern wrote:
>> +tbid = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN;
>> +if (cfg->fc_table && cfg->fc_table != tbid) {
>> +NL_SET_ERR_MSG(extack,
>> +   "Table id mismatch between given table and 
>> device");
>> +return -EINVAL;
>> +}
>> +
>> +cfg->fc_table = tbid;
>> +
>> +return 0;
> 
> This table check is too restrictive for some PBR cases.
> 
> Dave: please drop this set; I'll repost.

Ok.

[PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Eric Dumazet

From: Eric Dumazet 

It looks like syzbot found its way into netfilter territory.

Issue here is that @name comes from user space and might
not be null terminated.

Out-of-bound reads happen, KASAN is not happy.

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
No Fixes: tag, bug seems to be a day-0 one.

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 
55802e97f906d1987ed78b4296584deb38e5f876..8516dc459b539342f44d2b2b3e21b140677c7826
 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, 
uint8_t revision)
 {
struct xt_match *match;
 
+   if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN)
+   return ERR_PTR(-EINVAL);
+
match = xt_find_match(nfproto, name, revision);
if (IS_ERR(match)) {
request_module("%st_%s", xt_prefix[nfproto], name);

Re: [Intel-wired-lan] [RFC v2 net-next 01/10] net: Add a new socket option for a future transmit time.

2018-01-24 Thread Vinicius Costa Gomes

Hi Richard,

Richard Cochran  writes:

> On Tue, Jan 23, 2018 at 01:22:37PM -0800, Vinicius Costa Gomes wrote:
>> What I think would be the ideal scenario would be if the clockid
>> parameter to the TBS Qdisc would not be necessary (if offload was
>> enabled), but that's not quite possible right now, because there's no
>> support for using the hrtimer infrastructure with dynamic clocks
>> (/dev/ptp*).
>
> We don't need hrtimer for HW offloading.  Just enqueue the packets.  I
> thought we agreed that user space get the ordering correct.  In fact,
> davem insisted on it, IIRC.

About the ordering of packets, From here [1], there are 3 clear points
(in my understanding):

1. Re-ordering of TX descriptors on the device queue should/must not
   happen;

2. Out of order requests are an error;

3. Timestamps in the past are an error;

The only robust way that we could think of about keeping the the packets
in order for the device queue is re-ordering packets in the Qdisc.

We tried to reach out for confirmation [2] of this understanding but
didn't receive any word.

Even if we reach a decision that the Qdisc should not re-order packets
(we wouldn't have any dependency on hrtimers in the offload case, as you
pointed out), we still need hrtimers for the software implementation.

So, I guess, the problem remains, if it's possible for the user to
express a /dev/ptp* clock, what should we do? 

>
> Thanks,
> Richard

Cheers,
--
Vinicius

[1] https://patchwork.ozlabs.org/comment/1770302/

[2] https://patchwork.ozlabs.org/comment/1816492/q

[net-next 1/7] fm10k: Fix configuration for macvlan offload

2018-01-24 Thread Jeff Kirsher

From: Alexander Duyck 

The fm10k driver didn't work correctly when macvlan offload was enabled.
Specifically what would occur is that we would see no unicast packets being
received. This was traced down to us not correctly configuring the default
VLAN ID for the port and defaulting to 0.

To correct this we either use the default ID provided by the switch or
simply use 1. With that we are able to pass and receive traffic without any
issues.

In addition we were not repopulating the filter table following a reset. To
correct that I have added a bit of code to fm10k_restore_rx_state that will
repopulate the Rx filter configuration for the macvlan interfaces.

Signed-off-by: Alexander Duyck 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index adc62fb38c49..6d9088956407 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1182,9 +1182,10 @@ static void fm10k_set_rx_mode(struct net_device *dev)
 
 void fm10k_restore_rx_state(struct fm10k_intfc *interface)
 {
+   struct fm10k_l2_accel *l2_accel = interface->l2_accel;
struct net_device *netdev = interface->netdev;
struct fm10k_hw *hw = >hw;
-   int xcast_mode;
+   int xcast_mode, i;
u16 vid, glort;
 
/* record glort for this interface */
@@ -1234,6 +1235,24 @@ void fm10k_restore_rx_state(struct fm10k_intfc 
*interface)
__dev_uc_sync(netdev, fm10k_uc_sync, fm10k_uc_unsync);
__dev_mc_sync(netdev, fm10k_mc_sync, fm10k_mc_unsync);
 
+   /* synchronize macvlan addresses */
+   if (l2_accel) {
+   for (i = 0; i < l2_accel->size; i++) {
+   struct net_device *sdev = l2_accel->macvlan[i];
+
+   if (!sdev)
+   continue;
+
+   glort = l2_accel->dglort + 1 + i;
+
+   hw->mac.ops.update_xcast_mode(hw, glort,
+ FM10K_XCAST_MODE_MULTI);
+   fm10k_queue_mac_request(interface, glort,
+   sdev->dev_addr,
+   hw->mac.default_vid, true);
+   }
+   }
+
fm10k_mbx_unlock(interface);
 
/* record updated xcast mode state */
@@ -1490,7 +1509,7 @@ static void *fm10k_dfwd_add_station(struct net_device 
*dev,
hw->mac.ops.update_xcast_mode(hw, glort,
  FM10K_XCAST_MODE_MULTI);
fm10k_queue_mac_request(interface, glort, sdev->dev_addr,
-   0, true);
+   hw->mac.default_vid, true);
}
 
fm10k_mbx_unlock(interface);
@@ -1530,7 +1549,7 @@ static void fm10k_dfwd_del_station(struct net_device 
*dev, void *priv)
hw->mac.ops.update_xcast_mode(hw, glort,
  FM10K_XCAST_MODE_NONE);
fm10k_queue_mac_request(interface, glort, sdev->dev_addr,
-   0, false);
+   hw->mac.default_vid, false);
}
 
fm10k_mbx_unlock(interface);
-- 
2.14.3

[net-next 5/7] fm10k: don't assume VLAN 1 is enabled

2018-01-24 Thread Jeff Kirsher

From: Jacob Keller 

Since commit 856dfd69e84f ("fm10k: Fix multicast mode synch issues",
2016-03-03) we've incorrectly assumed that VLAN 1 is enabled when the
default VID is not set.

This occurs because we check the default_vid and if it's zero, start
several loops over the active_vlans bitmask at 1, instead of checking to
ensure that that bit is active.

This happened because of commit d9ff3ee8efe9 ("fm10k: Add support for
VLAN 0 w/o default VLAN", 2014-08-07) which mistakenly assumed that we
should send requests for MAC and VLAN filters with VLAN 0 when the
default_vid isn't set.

However, the switch generally considers this an invalid configuration,
so the only time we'd have a default_vid of 0 is when the switch is
down.

Instead, lets just not request any filters for the default_vid if it's
not yet been assigned.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 4cf68a235318..4c9d8e52415b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1050,14 +1050,13 @@ static int __fm10k_uc_sync(struct net_device *dev,
   const unsigned char *addr, bool sync)
 {
struct fm10k_intfc *interface = netdev_priv(dev);
-   struct fm10k_hw *hw = >hw;
u16 vid, glort = interface->glort;
s32 err;
 
if (!is_valid_ether_addr(addr))
return -EADDRNOTAVAIL;
 
-   for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1;
+   for (vid = fm10k_find_next_vlan(interface, 0);
 vid < VLAN_N_VID;
 vid = fm10k_find_next_vlan(interface, vid)) {
err = fm10k_queue_mac_request(interface, glort,
@@ -1116,14 +1115,13 @@ static int __fm10k_mc_sync(struct net_device *dev,
   const unsigned char *addr, bool sync)
 {
struct fm10k_intfc *interface = netdev_priv(dev);
-   struct fm10k_hw *hw = >hw;
u16 vid, glort = interface->glort;
s32 err;
 
if (!is_multicast_ether_addr(addr))
return -EADDRNOTAVAIL;
 
-   for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1;
+   for (vid = fm10k_find_next_vlan(interface, 0);
 vid < VLAN_N_VID;
 vid = fm10k_find_next_vlan(interface, vid)) {
err = fm10k_queue_mac_request(interface, glort,
@@ -1223,7 +1221,7 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface)
 xcast_mode == FM10K_XCAST_MODE_PROMISC);
 
/* update table with current entries */
-   for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1;
+   for (vid = fm10k_find_next_vlan(interface, 0);
 vid < VLAN_N_VID;
 vid = fm10k_find_next_vlan(interface, vid)) {
fm10k_queue_vlan_request(interface, vid, 0, true);
-- 
2.14.3

[net-next 7/7] fm10k: clarify action when updating the VLAN table

2018-01-24 Thread Jeff Kirsher

From: Ngai-Mint Kwan 

Clarify the comment for when entering promiscuous mode that we update
the VLAN table. Add a comment distinguishing the case where we're
exiting promiscuous mode and need to clear the entire VLAN table.

Signed-off-by: Ngai-Mint Kwan 
Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 4c9d8e52415b..a38ae5c54da3 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1165,10 +1165,12 @@ static void fm10k_set_rx_mode(struct net_device *dev)
 
/* update xcast mode first, but only if it changed */
if (interface->xcast_mode != xcast_mode) {
-   /* update VLAN table */
+   /* update VLAN table when entering promiscuous mode */
if (xcast_mode == FM10K_XCAST_MODE_PROMISC)
fm10k_queue_vlan_request(interface, FM10K_VLAN_ALL,
 0, true);
+
+   /* clear VLAN table when exiting promiscuous mode */
if (interface->xcast_mode == FM10K_XCAST_MODE_PROMISC)
fm10k_clear_unused_vlans(interface);
 
-- 
2.14.3

[net-next 0/7][pull request] 100GbE Intel Wired LAN Driver Updates 2018-01-24

2018-01-24 Thread Jeff Kirsher

This series contains updates to fm10k only.

Alex fixes MACVLAN offload for fm10k, where we were not seeing unicast
packets being received because we did not correctly configure the
default VLAN ID for the port and defaulting to 0.

Jake cleans up unnecessary parenthesis in a couple of "if" statements.
Fixed the driver to stop adding VLAN 0 into the VLAN table, since it
would cause the VLAN table to be inconsistent between the PF and VF.
Also fixed an issue where we were assuming that VLAN 1 is enabled when
the default VLAN ID is not set, so resolve by not requesting any filters
for the default_vid if it has not yet been assigned.

Ngai fixes an issue which was generating a dmesg regarding unbale to
kill a particular VLAN ID for the device.  This is due to
ndo_vlan_rx_kill_vid() exits with an error and the handler for this ndo
is fm10k_update_vid() which exits prematurely under PF VLAN management.
So to resolve, we must check the VLAN update action type before exiting
fm10k_update_vid(), and act appropriately based on the action type.
Also corrected code comment typos.

The following are changes since commit 46410c2efa9cb5b2f40c9ce24a75d147f44aedeb:
  Merge branch 'pktgen-Behavior-flags-fixes'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Alexander Duyck (1):
  fm10k: Fix configuration for macvlan offload

Jacob Keller (3):
  fm10k: cleanup unnecessary parenthesis in fm10k_iov.c
  fm10k: stop adding VLAN 0 to the VLAN table
  fm10k: don't assume VLAN 1 is enabled

Ngai-Mint Kwan (3):
  fm10k: fix "failed to kill vid" message for VF
  fm10k: correct typo in fm10k_pf.c
  fm10k: clarify action when updating the VLAN table

 drivers/net/ethernet/intel/fm10k/fm10k_iov.c|  4 +-
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 54 ++---
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c |  2 +-
 3 files changed, 43 insertions(+), 17 deletions(-)

-- 
2.14.3

[net-next 4/7] fm10k: stop adding VLAN 0 to the VLAN table

2018-01-24 Thread Jeff Kirsher

From: Jacob Keller 

Currently, when the driver loads, it sends a request to add VLAN 0 to the
VLAN table. For the PF, this is honored, and VLAN 0 is indeed set. For
the VF, this request is silently converted into a request for the
default VLAN as defined by either the switch vid or the PF vid.

This results in the odd behavior that the VLAN table doesn't appear
consistent between the PF and the VF.

Furthermore, setting a MAC filter with VLAN 0 is generally considered an
invalid configuration by the switch, and since commit 856dfd69e84f
("fm10k: Fix multicast mode synch issues", 2016-03-03) we've had code
which prevents us from ever sending such a request.

Since there's not really a good reason to keep VLAN 0 in the VLAN table,
stop requesting it in fm10k_restore_rx_state().

This might seem to indicate that we would no longer properly configure
the MAC and VLAN tables for the default vid. However, due to the way
that fm10k_find_next_vlan() behaves, it will always return the
default_vid as enabled.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index e85e0b077da3..4cf68a235318 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1222,9 +1222,6 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface)
fm10k_queue_vlan_request(interface, FM10K_VLAN_ALL, 0,
 xcast_mode == FM10K_XCAST_MODE_PROMISC);
 
-   /* Add filter for VLAN 0 */
-   fm10k_queue_vlan_request(interface, 0, 0, true);
-
/* update table with current entries */
for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1;
 vid < VLAN_N_VID;
-- 
2.14.3

[net-next 3/7] fm10k: fix "failed to kill vid" message for VF

2018-01-24 Thread Jeff Kirsher

From: Ngai-Mint Kwan 

When a VF is under PF VLAN assignment:

ip link set  vf <#> vlan 

This will remove all previous entries in the VLAN table including those
generated by VLAN interfaces created on the VF. The issue arises when
the VF is under PF VLAN assignment and one or more of these VLAN
interfaces of the VF are deleted. When deleting these VLAN interfaces,
the following message will be generated in "dmesg":

failed to kill vid 0081/ for device 

This is due to the fact that "ndo_vlan_rx_kill_vid" exits with an error.
The handler for this ndo is "fm10k_update_vid". Any calls to this
function while under PF VLAN management will exit prematurely and, thus,
it will generate the failure message.

Additionally, since "fm10k_update_vid" exits prematurely, none of the
VLAN update is performed. So, even though the actual VLAN interfaces of
the VF will be deleted, the active_vlans bitmask is not cleared. When
the VF is no longer under PF VLAN assignment, the driver mistakenly
restores the previous entries of the VLAN table based on an
unsynchronized list of active VLANs.

The solution to this issue involves checking the VLAN update action type
before exiting "fm10k_update_vid". If the VLAN update action type is to
"add", this action will not be permitted while the VF is under PF VLAN
assignment and the VLAN update is abandoned like before.

However, if the VLAN update action type is to "kill", then we need to
also clear the active_vlans bitmask. However, we don't need to actually
queue any messages to the PF, because the MAC and VLAN tables have
already been cleared, and the PF would silently ignore these requests
anyways.

Signed-off-by: Ngai-Mint Kwan 
Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 6d9088956407..e85e0b077da3 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -934,8 +934,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
if (vid >= VLAN_N_VID)
return -EINVAL;
 
-   /* Verify we have permission to add VLANs */
-   if (hw->mac.vlan_override)
+   /* Verify that we have permission to add VLANs. If this is a request
+* to remove a VLAN, we still want to allow the user to remove the
+* VLAN device. In that case, we need to clear the bit in the
+* active_vlans bitmask.
+*/
+   if (set && hw->mac.vlan_override)
return -EACCES;
 
/* update active_vlans bitmask */
@@ -954,6 +958,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
rx_ring->vid &= ~FM10K_VLAN_CLEAR;
}
 
+   /* If our VLAN has been overridden, there is no reason to send VLAN
+* removal requests as they will be silently ignored.
+*/
+   if (hw->mac.vlan_override)
+   return 0;
+
/* Do not remove default VLAN ID related entries from VLAN and MAC
 * tables
 */
-- 
2.14.3

[net-next 6/7] fm10k: correct typo in fm10k_pf.c

2018-01-24 Thread Jeff Kirsher

From: Ngai-Mint Kwan 

Signed-off-by: Ngai-Mint Kwan 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index 425d814aed4d..d6406fc31ffb 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -866,7 +866,7 @@ static s32 fm10k_iov_assign_default_mac_vlan_pf(struct 
fm10k_hw *hw,
/* Determine correct default VLAN ID. The FM10K_VLAN_OVERRIDE bit is
 * used here to indicate to the VF that it will not have privilege to
 * write VLAN_TABLE. All policy is enforced on the PF but this allows
-* the VF to correctly report errors to userspace rqeuests.
+* the VF to correctly report errors to userspace requests.
 */
if (vf_info->pf_vid)
vf_vid = vf_info->pf_vid | FM10K_VLAN_OVERRIDE;
-- 
2.14.3

[net-next 2/7] fm10k: cleanup unnecessary parenthesis in fm10k_iov.c

2018-01-24 Thread Jeff Kirsher

From: Jacob Keller 

This fixes a few warnings found by checkpatch.pl --strict

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
index ea3ab24265ee..760cfa52d02c 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
@@ -353,7 +353,7 @@ int fm10k_iov_resume(struct pci_dev *pdev)
struct fm10k_vf_info *vf_info = _data->vf_info[i];
 
/* allocate all but the last GLORT to the VFs */
-   if (i == ((~hw->mac.dglort_map) >> FM10K_DGLORTMAP_MASK_SHIFT))
+   if (i == (~hw->mac.dglort_map >> FM10K_DGLORTMAP_MASK_SHIFT))
break;
 
/* assign GLORT to VF, and restrict it to multicast */
@@ -511,7 +511,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
return err;
 
/* allocate VFs if not already allocated */
-   if (num_vfs && (num_vfs != current_vfs)) {
+   if (num_vfs && num_vfs != current_vfs) {
/* Disable completer abort error reporting as
 * the VFs can trigger this any time they read a queue
 * that they don't own.
-- 
2.14.3

Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()

2018-01-24 Thread Jakub Kicinski

On Wed, 24 Jan 2018 17:12:55 +0530, Sathya Perla wrote:
> When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns
> error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags.
> 
> This flag (via tc_in_hw()) must be checked before issuing the call
> to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the
> call to query stats (fl_hw_update_stats()).
> 
> Signed-off-by: Sathya Perla 

Could you explain why you want to make that change?  Saying "tc_in_hw()
must be checked" is a bit strong, tc_in_hw() is useless from correctness 
POV.  Your patch may be a good optimization, but with shared blocks in
the picture now tc_in_hw() == true doesn't mean it's in *your* HW.

Re: [PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and more

2018-01-24 Thread Lawrence Brakmo

On 1/24/18, 12:07 PM, "netdev-ow...@vger.kernel.org on behalf of Yuchung Cheng" 
 wrote:

On Tue, Jan 23, 2018 at 11:57 PM, Lawrence Brakmo  wrote:
> Add support for reading many more tcp_sock fields
>
>   state,same as sk->sk_state
>   rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
>   snd_ssthresh
>   rcv_nxt
>   snd_nxt
>   snd_una
>   mss_cache
>   ecn_flags
>   rate_delivered
>   rate_interval_us
>   packets_out
>   retrans_out
Might as well get ca_state, sacked_out and lost_out to estimate CA
states and the packets in flight?

Will try to add in updated patchset. If not, I will add as a new patch.

>   total_retrans
>   segs_in
>   data_segs_in
>   segs_out
>   data_segs_out
>   sk_txhash
>   bytes_received (__u64)
>   bytes_acked(__u64)
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/uapi/linux/bpf.h |  20 +++
>  net/core/filter.c| 135 
+++
>  2 files changed, 144 insertions(+), 11 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a8c40a..6998032 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -979,6 +979,26 @@ struct bpf_sock_ops {
> __u32 snd_cwnd;
> __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
> __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h 
*/
> +   __u32 state;
> +   __u32 rtt_min;
> +   __u32 snd_ssthresh;
> +   __u32 rcv_nxt;
> +   __u32 snd_nxt;
> +   __u32 snd_una;
> +   __u32 mss_cache;
> +   __u32 ecn_flags;
> +   __u32 rate_delivered;
> +   __u32 rate_interval_us;
> +   __u32 packets_out;
> +   __u32 retrans_out;
> +   __u32 total_retrans;
> +   __u32 segs_in;
> +   __u32 data_segs_in;
> +   __u32 segs_out;
> +   __u32 data_segs_out;
> +   __u32 sk_txhash;
> +   __u64 bytes_received;
> +   __u64 bytes_acked;
>  };
>
>  /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6936d19..ffe9b60 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
>  }
>  EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> -static bool __is_valid_sock_ops_access(int off, int size)
> +static bool sock_ops_is_valid_access(int off, int size,
> +enum bpf_access_type type,
> +struct bpf_insn_access_aux *info)
>  {
> +   const int size_default = sizeof(__u32);
> +
> if (off < 0 || off >= sizeof(struct bpf_sock_ops))
> return false;
> +
> /* The verifier guarantees that size > 0. */
> if (off % size != 0)
> return false;
> -   if (size != sizeof(__u32))
> -   return false;
> -
> -   return true;
> -}
>
> -static bool sock_ops_is_valid_access(int off, int size,
> -enum bpf_access_type type,
> -struct bpf_insn_access_aux *info)
> -{
> if (type == BPF_WRITE) {
> switch (off) {
> case offsetof(struct bpf_sock_ops, reply):
> +   if (size != size_default)
> +   return false;
> break;
> default:
> return false;
> }
> +   } else {
> +   switch (off) {
> +   case bpf_ctx_range_till(struct bpf_sock_ops, 
bytes_received,
> +   bytes_acked):
> +   if (size != sizeof(__u64))
> +   return false;
> +   break;
> +   default:
> +   if (size != size_default)
> +   return false;
> +   break;
> +   }
> }
>
> -   return __is_valid_sock_ops_access(off, size);
> +   return true;
>  }
>
>  static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
> @@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
>is_fullsock));
> break;
>
> +   case offsetof(struct bpf_sock_ops, state):
> +

Re: [PATCH net-next 3/3] net/ipv6: Add support for onlink flag

2018-01-24 Thread David Ahern

On 1/23/18 8:00 PM, David Ahern wrote:
> + tbid = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN;
> + if (cfg->fc_table && cfg->fc_table != tbid) {
> + NL_SET_ERR_MSG(extack,
> +"Table id mismatch between given table and 
> device");
> + return -EINVAL;
> + }
> +
> + cfg->fc_table = tbid;
> +
> + return 0;

This table check is too restrictive for some PBR cases.

Dave: please drop this set; I'll repost.

Re: [PATCH net-next] cxgb4: make symbol pedits static

2018-01-24 Thread David Miller

From: Wei Yongjun 
Date: Wed, 24 Jan 2018 02:14:33 +

> Fixes the following sparse warning:
> 
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:46:27: warning:
>  symbol 'pedits' was not declared. Should it be static?
> 
> Fixes: 27ece1f357b7 ("cxgb4: add tc flower support for ETH-DMAC rewrite")
> Signed-off-by: Wei Yongjun 

Applied, thank you.

Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()

2018-01-24 Thread David Miller

From: "Michael S. Tsirkin" 
Date: Wed, 24 Jan 2018 23:46:19 +0200

> On Wed, Jan 24, 2018 at 04:38:30PM -0500, David Miller wrote:
>> From: Jason Wang 
>> Date: Tue, 23 Jan 2018 17:27:25 +0800
>> 
>> > We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to
>> > hold mutexes of all virtqueues. This may confuse lockdep to report a
>> > possible deadlock because of trying to hold locks belong to same
>> > class. Switch to use mutex_lock_nested() to avoid false positive.
>> > 
>> > Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
>> > Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com
>> > Signed-off-by: Jason Wang 
>> 
>> Michael, I see you ACK'd this, meaning that you're OK with these two
>> fixes going via my net tree?
>> 
>> Thanks.
> 
> Yes - this seems to be what Jason wanted (judging by the net
> tag in the subject) and I'm fine with it.
> Thanks a lot.

Great, not a problem, done.

Re: [PATCH net] net: erspan: fix use-after-free

2018-01-24 Thread David Miller

From: William Tu 
Date: Tue, 23 Jan 2018 17:01:29 -0800

> When building the erspan header for either v1 or v2, the eth_hdr()
> does not point to the right inner packet's eth_hdr,
> causing kasan report use-after-free and slab-out-of-bouds read.
 ...
> Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
> Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
> Reported-by: syzbot+9723f2d288e49b492...@syzkaller.appspotmail.com
> Reported-by: syzbot+f0ddeb2b032a8e1d9...@syzkaller.appspotmail.com
> Reported-by: syzbot+f14b3703cd8d76702...@syzkaller.appspotmail.com
> Reported-by: syzbot+eefa384efad8d7997...@syzkaller.appspotmail.com
> Signed-off-by: William Tu 

Applied to net-next.

Re: [PATCH net] i40e: flower: check if TC offload is enabled on a netdev

2018-01-24 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 23 Jan 2018 08:47:29 -0800

> On Tue, 2018-01-23 at 00:08 -0800, Jakub Kicinski wrote:
>> Since TC block changes drivers are required to check if
>> the TC hw offload flag is set on the interface themselves.
>> 
>> Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
>> Fixes: 44ae12a768b7 ("net: sched: move the can_offload check from
>> binding phase to rule insertion phase")
>> Signed-off-by: Jakub Kicinski 
>> Reviewed-by: Simon Horman 
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++
>>  1 file changed, 2 insertions(+)
> 
> Acked-by: Jeff Kirsher 
> 
> Dave, feel free to pick this up.

Ok, done.  Thanks.

Re: [Patch net-next v2 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-24 Thread David Miller

From: Cong Wang 
Date: Tue, 23 Jan 2018 10:18:56 -0800

> This pathcset restores the pfifo_fast qdisc behavior of dropping
> packets based on latest dev->tx_queue_len. Patch 1 introduces
> a helper, patch 2 introduces a new Qdisc ops which is called when
> we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.
> 
> Please see each patch for details.
> 
> ---
> v2: handle error case for ->change_tx_queue_len()

John, please review.

Thanks.

Re: [PATCH net-next v2 00/12] net: sched: propagate extack to cls offloads on destroy and only with skip_sw

2018-01-24 Thread Jakub Kicinski

On Wed, 24 Jan 2018 22:15:00 +0100, Jiri Pirko wrote:
> Wed, Jan 24, 2018 at 10:07:25PM CET, dsah...@gmail.com wrote:
> >On 1/24/18 2:04 PM, Jiri Pirko wrote:  
> >> For the record, I still think it is odd to have 6 patches just to add
> >> one arg to a function. I wonder where this unnecessary patch splits
> >> would lead to in the future.  
> >
> >I think it made the review much easier than 1 really long patch.  
> 
> Even squashed, the patch is quite small. Doing the same thing in every
> hunk.
> 
> On contrary, the split made it more complicated for me, because when
> I looked at patch 1 and the function duplication with another arg,
> I did not understand what is going on. Only the last patch actually
> explained it. But perhaps I'm slow.

Next time I'll do a better job explaining things in commit logs, sorry!

Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()

2018-01-24 Thread Michael S. Tsirkin

On Wed, Jan 24, 2018 at 04:38:30PM -0500, David Miller wrote:
> From: Jason Wang 
> Date: Tue, 23 Jan 2018 17:27:25 +0800
> 
> > We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to
> > hold mutexes of all virtqueues. This may confuse lockdep to report a
> > possible deadlock because of trying to hold locks belong to same
> > class. Switch to use mutex_lock_nested() to avoid false positive.
> > 
> > Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> > Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com
> > Signed-off-by: Jason Wang 
> 
> Michael, I see you ACK'd this, meaning that you're OK with these two
> fixes going via my net tree?
> 
> Thanks.

Yes - this seems to be what Jason wanted (judging by the net
tag in the subject) and I'm fine with it.
Thanks a lot.

-- 
MST

Re: [PATCH net 0/2] qed: rdma bug fixes

2018-01-24 Thread David Miller

From: Michal Kalderon 
Date: Tue, 23 Jan 2018 11:33:45 +0200

> This patch contains two small bug fixes related to RDMA. 
> Both related to resource reservations.
> 
> Signed-off-by: Michal Kalderon 
> Signed-off-by: Ariel Elior 

Series applied, thanks Michal.

Re: [PATCH net-next] rds: tcp: per-netns flag to stop new connection creation when rds-tcp is being dismantled

2018-01-24 Thread Santosh Shilimkar


On 1/24/2018 1:03 PM, Sowmini Varadhan wrote:

An rds_connection can get added during netns deletion between lines 528
and 529 of

   506 static void rds_tcp_kill_sock(struct net *net)
   :
   /* code to pull out all the rds_connections that should be destroyed */
   :
   528 spin_unlock_irq(_tcp_conn_lock);
   529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node)
   530 rds_conn_destroy(tc->t_cpath->cp_conn);

Such an rds_connection would miss out the rds_conn_destroy()
loop (that cancels all pending work) and (if it was scheduled
after netns deletion) could trigger the use-after-free.

A similar race-window exists for the module unload path
in rds_tcp_exit -> rds_tcp_destroy_conns

To avoid the addition of new rds_connections during kill_sock
or netns_delete, this patch introduces a per-netns flag,
RTN_DELETE_PENDING, that will cause RDS connection creation to fail.
RCU is used to make sure that we wait for the critical
section of __rds_conn_create threads (that may have started before
the setting of RTN_DELETE_PENDING) to complete before starting
the connection destruction.

Reported-by: syzbot+bbd8e9a06452cc480...@syzkaller.appspotmail.com
Signed-off-by: Sowmini Varadhan 
---
  net/rds/connection.c |3 ++
  net/rds/tcp.c|   82 -
  net/rds/tcp.h|1 +
  3 files changed, 57 insertions(+), 29 deletions(-)


FWIW,
Acked-by: Santosh Shilimkar 

Just for archives, just summarizing off-list discussion. Netns
destroy making use of conn_destroy now which in past was used for only
module unload is racy.

Its not possible to make it race free with just flags alone and needs
rcu sync kind of mechanism. RDS being sensitive to brownouts on 
reconnects, rcu usage was has been minimised. Netns delete

is expected to be non-frequent operation and hence usage of rcu as
done in this patch is probably ok. If needed it will be revisited in
future for optimization.

regards,
Santosh

Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()

2018-01-24 Thread David Miller

From: Jason Wang 
Date: Tue, 23 Jan 2018 17:27:25 +0800

> We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to
> hold mutexes of all virtqueues. This may confuse lockdep to report a
> possible deadlock because of trying to hold locks belong to same
> class. Switch to use mutex_lock_nested() to avoid false positive.
> 
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com
> Signed-off-by: Jason Wang 

Michael, I see you ACK'd this, meaning that you're OK with these two
fixes going via my net tree?

Thanks.

Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-24 Thread David Miller

From: Eyal Birger 
Date: Tue, 23 Jan 2018 11:17:32 +0200

> + network_offset = skb_network_offset(skb);
> + skb_pull(skb, network_offset);
> +
> + rcu_read_lock();
> +
> + if (skb->skb_iif)
> + indev = dev_get_by_index_rcu(em->net, skb->skb_iif);
> +
> + nf_hook_state_init(, im->hook, im->nfproto, indev ?: skb->dev,
> +skb->dev, NULL, em->net, NULL);
> +
> + acpar.match = im->match;
> + acpar.matchinfo = im->match_data;
> + acpar.state = 
> +
> + ret = im->match->match(skb, );
> +
> + rcu_read_unlock();
> +
> + skb_push(skb, network_offset);

If the SKB is shared in any way, this pull/push around the NF hook
invocation is illegal.

Re: [PATCH v4] net: qcom/emac: extend DMA mask to 46bits

2018-01-24 Thread David Miller

From: Wang Dongsheng 
Date: Mon, 22 Jan 2018 20:25:06 -0800

> Bit TPD3[31] is used as a timestamp bit if PTP is enabled, but
> it's used as an address bit if PTP is disabled.  Since PTP isn't
> supported by the driver, we can extend the DMA address to 46 bits.
> 
> Signed-off-by: Wang Dongsheng 

Applied to net-next, thanks.

Re: [PATCH] ip_tunnel: Use mark in skb by default

2018-01-24 Thread David Miller

From: Thomas Winter 
Date: Tue, 23 Jan 2018 16:46:24 +1300

> This allows marks set by connmark in iptables
> to be used for route lookups.
> 
> Signed-off-by: Thomas Winter 

Applied to net-next, thanks.

1 2 3 >

1 - 100 of 277 matches

Mail list logo