[PATCH net-next] ptr_ring: fix integer overflow
We try to allocate one more entry for lockless peeking. The adding operation may overflow which causes zero to be passed to kmalloc(). In this case, it returns ZERO_SIZE_PTR without any notice by ptr ring. Try to do producing or consuming on such ring will lead NULL dereference. Fix this detect and fail early. Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun array bounds") Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com Cc: John FastabendSigned-off-by: Jason Wang --- include/linux/ptr_ring.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 9ca1726..3f99484 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -453,6 +453,8 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { + if (unlikely(size + 1 == 0)) + return NULL; /* Allocate an extra dummy element at end of ring to avoid consumer head * or produce head access past the end of the array. Possible when * producer/consumer operations and __ptr_ring_peek operations run in -- 2.7.4
[PATCH net] ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only
If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it implicitly implies it is an ipv6only socket. However, in inet6_bind(), this addr_type checking and setting sk->sk_ipv6only to 1 are only done after sk->sk_prot->get_port(sk, snum) has been completed successfully. This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses the 'get_port()'. In particular, when binding SO_REUSEPORT UDP sockets, udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock() checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be 1 while ipv6_only_sock(sk) is still 0 here. The end result is, reuseport_alloc(sk) is called instead of adding sk to the existing sk2->sk_reuseport_cb. It can be reproduced by binding two SO_REUSEPORT UDP sockets on an IPv6 address (!ANY and !MAPPED). Only one of the socket will receive packet. The fix is to set the implicit sk_ipv6only before calling get_port(). The original sk_ipv6only has to be saved such that it can be restored in case get_port() failed. The situation is similar to the inet_reset_saddr(sk) after get_port() has failed. Thanks to Calvin Owenswho created an easy reproduction which leads to a fix. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Signed-off-by: Martin KaFai Lau --- net/ipv6/af_inet6.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index c9441ca45399..416917719a6f 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -284,6 +284,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); __be32 v4addr = 0; unsigned short snum; + bool saved_ipv6only; int addr_type = 0; int err = 0; @@ -389,19 +390,21 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) if (!(addr_type & IPV6_ADDR_MULTICAST)) np->saddr = addr->sin6_addr; + saved_ipv6only = sk->sk_ipv6only; + if (addr_type != IPV6_ADDR_ANY && addr_type != IPV6_ADDR_MAPPED) + sk->sk_ipv6only = 1; + /* Make sure we are allowed to bind here. */ if ((snum || !inet->bind_address_no_port) && sk->sk_prot->get_port(sk, snum)) { + sk->sk_ipv6only = saved_ipv6only; inet_reset_saddr(sk); err = -EADDRINUSE; goto out; } - if (addr_type != IPV6_ADDR_ANY) { + if (addr_type != IPV6_ADDR_ANY) sk->sk_userlocks |= SOCK_BINDADDR_LOCK; - if (addr_type != IPV6_ADDR_MAPPED) - sk->sk_ipv6only = 1; - } if (snum) sk->sk_userlocks |= SOCK_BINDPORT_LOCK; inet->inet_sport = htons(inet->inet_num); -- 2.9.5
linux-next: manual merge of the net-next tree with the vfs tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/tipc/socket.c between commit: ade994f4f6c8 ("net: annotate ->poll() instances") from the vfs tree and commit: 60c253069632 ("tipc: fix race between poll() and setsockopt()") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/tipc/socket.c index 2aa46e8cd8fe,473a096b6fba.. --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@@ -715,8 -716,7 +716,7 @@@ static __poll_t tipc_poll(struct file * { struct sock *sk = sock->sk; struct tipc_sock *tsk = tipc_sk(sk); - struct tipc_group *grp = tsk->group; - u32 revents = 0; + __poll_t revents = 0; sock_poll_wait(file, sk_sleep(sk), wait);
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
Hi Eric Thanks for you kindly response and suggestion. That's really appreciated. Jianchao On 01/25/2018 11:55 AM, Eric Dumazet wrote: > On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote: >> Hi Tariq >> >> On 01/22/2018 10:12 AM, jianchao.wang wrote: > On 19/01/2018 5:49 PM, Eric Dumazet wrote: >> On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: >>> Hi Tariq >>> >>> Very sad that the crash was reproduced again after applied the patch. Memory barriers vary for different Archs, can you please share more details regarding arch and repro steps? >>> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 >>> 12/27/2015 >>> The xen is installed. The crash occurred in DOM0. >>> Regarding to the repro steps, it is a customer's test which does heavy disk >>> I/O over NFS storage without any guest. >>> >> >> What is the finial suggestion on this ? >> If use wmb there, is the performance pulled down ? > > Since > https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k= > > we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often. > > I doubt the additional wmb() will have serious impact there. > >
[RFC] net: qcom/emac: mdiobus-dev fwnode should point to emac-adev
mdiobus always try to get a GPIO "reset" consumer, based on ACPI the GPIO should be described in emac-adev _DSD or _CRS. ACPI uses mido common API to register, however mdio->dev->fwnode is not pointing to any adev. So the "reset" consumer can never be found. OF has done this by using an of_mdiobus_register. The mdiobus get emac of_node and go through the of_node to find a GPIO "reset" consumer. Not sure, ACPI needs to add the same API for mdio just like OF because mdio isn't a real entity in ACPI. So I think there isn't any work in ACPI, the mac driver needs to take adev to mdiobus when mido-bus is registering. Signed-off-by: Wang Dongsheng--- drivers/net/ethernet/qualcomm/emac/emac-phy.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/emac/emac-phy.c b/drivers/net/ethernet/qualcomm/emac/emac-phy.c index 53dbf1e..69171d5 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-phy.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-phy.c @@ -117,6 +117,10 @@ int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt) if (has_acpi_companion(>dev)) { u32 phy_addr; + struct fwnode_handle *fwnode; + + fwnode = acpi_fwnode_handle(ACPI_COMPANION(>dev)); + mii_bus->dev.fwnode = fwnode; ret = mdiobus_register(mii_bus); if (ret) { -- 2.7.4
Re: [PATCH v2 net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_{match|target}
Eric Dumazetwrote: > From: Eric Dumazet > > It looks like syzbot found its way into netfilter territory. > > Issue here is that @name comes from user space and might > not be null terminated. > > Out-of-bound reads happen, KASAN is not happy. > > v2 added similar fix for xt_request_find_target(), > as Florian advised. > > Signed-off-by: Eric Dumazet > Reported-by: syzbot Thanks a lot Eric! Acked-by: Florian Westphal
Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()
On Wed, Jan 24, 2018 at 9:37 PM, Jiri Pirkowrote: > Wed, Jan 24, 2018 at 12:42:55PM CET, sathya.pe...@broadcom.com wrote: >>When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns >>error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags. >> >>This flag (via tc_in_hw()) must be checked before issuing the call >>to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the >>call to query stats (fl_hw_update_stats()). >> >>Signed-off-by: Sathya Perla > > 1) You have to indicate what tree you aim this to be applied on: >[patch net] or [patch net-next] > 2) Please provided a "Fixes" line > 3) Please use scripts/get_maintainer.pl to get the people to cc > 4) Please aim the fix not only to cls_flower, but to other cls as well. Ok, will do. thanks!
Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()
On Thu, Jan 25, 2018 at 3:53 AM, Jakub Kicinskiwrote: > > On Wed, 24 Jan 2018 17:12:55 +0530, Sathya Perla wrote: > > When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns > > error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags. > > > > This flag (via tc_in_hw()) must be checked before issuing the call > > to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the > > call to query stats (fl_hw_update_stats()). > > > > Signed-off-by: Sathya Perla > > Could you explain why you want to make that change? Saying "tc_in_hw() > must be checked" is a bit strong, tc_in_hw() is useless from correctness > POV. Your patch may be a good optimization, but with shared blocks in > the picture now tc_in_hw() == true doesn't mean it's in *your* HW. I agree that for shared filters when skip_sw is false tcf_block_cb_call() can return a positive status even if the filter add on one of the devices fails. I'll change the commit-log wording to indicate that this new check is an optimization. Thanks!
Re: [v2] wcn36xx: release resources in case of error
Ramon Friedwrote: > wcn36xx_dxe_init() doesn't check for the return value of > wcn36xx_dxe_init_descs(), release the resources in case an error ocurred. > > Signed-off-by: Ramon Fried > Signed-off-by: Kalle Valo Patch applied to ath-next branch of ath.git, thanks. d0bb950b9f5f wcn36xx: release DMA memory in case of error -- https://patchwork.kernel.org/patch/10180503/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
[PATCH v2 4/4] net: check the size of a packet in validate_xmit_skb
There are a number of paths where an oversize skb could be sent to a driver. The driver should not be required to check for this - the core layer should do it instead. Add a check to validate_xmit_skb that checks both GSO and non-GSO packets and drops them if they are too large. Signed-off-by: Daniel Axtens--- net/core/dev.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 6c96c26aadbf..f09eece2cd21 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff *skb) __net_timestamp(SKB); \ } \ -bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +static inline bool skb_mac_len_fits_dev(const struct net_device *dev, + const struct sk_buff *skb) { unsigned int len; - if (!(dev->flags & IFF_UP)) - return false; - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; if (skb->len <= len) return true; @@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) return false; } + +bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +{ + if (!(dev->flags & IFF_UP)) + return false; + + return skb_mac_len_fits_dev(dev, skb); +} EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) @@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device if (unlikely(!skb)) goto out_null; + if (unlikely(!skb_mac_len_fits_dev(dev, skb))) + goto out_kfree_skb; + if (netif_needs_gso(skb, features)) { struct sk_buff *segs; -- 2.14.1
[PATCH v2 1/4] net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we're about to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens--- include/linux/skbuff.h | 2 +- net/core/skbuff.c | 9 + net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/xfrm/xfrm_device.c | 2 +- 9 files changed, 13 insertions(+), 12 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b8e0da6c27d6..b137c79bf88d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3286,7 +3286,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 01e8285aea73..a93e5c7aa5b2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,15 +4914,16 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_mtu - Return in case such skb fits a given MTU + * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? * * @skb: GSO skb * @mtu: MTU to validate against * - * skb_gso_validate_mtu validates if a given skb will fit a wanted MTU - * once split. + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. */ -bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu) +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 2dd21c3281a1..b54b948b0596 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -55,7 +55,7 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e8e675be60ec..66340ab750e6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -248,7 +248,7 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, /* common case: seglen is <= mtu */ - if (skb_gso_validate_mtu(skb, mtu)) + if (skb_gso_validate_network_len(skb, mtu)) return ip_finish_output2(net, sk, skb); /* Slowpath - GSO segment length exceeds the egress MTU. diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c b/net/ipv4/netfilter/nf_flow_table_ipv4.c index b2d01eb25f2c..cdf2625dc277 100644 --- a/net/ipv4/netfilter/nf_flow_table_ipv4.c +++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c @@ -185,7 +185,7 @@ static bool __nf_flow_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) if ((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 18547a44bdaf..4e888328d4dd 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -412,7 +412,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) if (skb->ignore_df) return false; - if (skb_is_gso(skb) && skb_gso_validate_mtu(skb, mtu)) + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) return false; return true; diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c b/net/ipv6/netfilter/nf_flow_table_ipv6.c index 0c3b9d32f64f..f1ab4e03df7d 100644
[PATCH v2 2/4] net: move skb_gso_mac_seglen to skbuff.h
We're about to use this elsewhere, so move it into the header with the other related functions like skb_gso_network_seglen(). Signed-off-by: Daniel Axtens--- include/linux/skbuff.h | 15 +++ net/sched/sch_tbf.c| 10 -- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b137c79bf88d..4b3ca6a5ec0a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4120,6 +4120,21 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb) return hdr_len + skb_gso_transport_seglen(skb); } +/** + * skb_gso_mac_seglen - Return length of individual segments of a gso packet + * + * @skb: GSO skb + * + * skb_gso_mac_seglen is used to determine the real size of the + * individual segments, including MAC/L2, Layer3 (IP, IPv6) and L4 + * headers (TCP/UDP). + */ +static inline unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) +{ + unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + return hdr_len + skb_gso_transport_seglen(skb); +} + /* Local Checksum Offload. * Compute outer checksum based on the assumption that the * inner checksum will be offloaded later. diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index 83e76d046993..229172d509cc 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -142,16 +142,6 @@ static u64 psched_ns_t2l(const struct psched_ratecfg *r, return len; } -/* - * Return length of individual segments of a gso packet, - * including all headers (MAC, IP, TCP/UDP) - */ -static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) -{ - unsigned int hdr_len = skb_transport_header(skb) - skb_mac_header(skb); - return hdr_len + skb_gso_transport_seglen(skb); -} - /* GSO packet is too big, segment it so that tbf can transmit * each segment in time */ -- 2.14.1
[PATCH v2 3/4] net: is_skb_forwardable: check the size of GSO segments
is_skb_forwardable attempts to detect if a packet is too large to be sent to the destination device. However, this test does not consider GSO skbs, and it is possible that a GSO skb, when segmented, will be larger than the device can transmit. Create skb_gso_validate_mac_len, and use that to check. Signed-off-by: Daniel Axtens--- include/linux/skbuff.h | 1 + net/core/dev.c | 7 +++--- net/core/skbuff.c | 67 +++--- 3 files changed, 57 insertions(+), 18 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4b3ca6a5ec0a..ec9c47b5a1c8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); +bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); int skb_ensure_writable(struct sk_buff *skb, int write_len); diff --git a/net/core/dev.c b/net/core/dev.c index 94435cd09072..6c96c26aadbf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) if (skb->len <= len) return true; - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before + /* +* if TSO is enabled, we need to check the size of the +* segmented packets */ if (skb_is_gso(skb)) - return true; + return skb_gso_validate_mac_len(skb, len); return false; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a93e5c7aa5b2..93f66725c32d 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4914,37 +4914,74 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) EXPORT_SYMBOL_GPL(skb_gso_transport_seglen); /** - * skb_gso_validate_network_len - Will a split GSO skb fit into a given MTU? + * skb_gso_size_check - check the skb size, considering GSO_BY_FRAGS * - * @skb: GSO skb - * @mtu: MTU to validate against + * There are a couple of instances where we have a GSO skb, and we + * want to determine what size it would be after it is segmented. * - * skb_gso_validate_network_len validates if a given skb will fit a - * wanted MTU once split. It considers L3 headers, L4 headers, and the - * payload. + * We might want to check: + * -L3+L4+payload size (e.g. IP forwarding) + * - L2+L3+L4+payload size (e.g. sanity check before passing to driver) + * + * This is a helper to do that correctly considering GSO_BY_FRAGS. + * + * @seg_len: The segmented length (from skb_gso_*_seglen). In the + * GSO_BY_FRAGS case this will be [header sizes + GSO_BY_FRAGS]. + * + * @max_len: The maximum permissible length. + * + * Returns true if the segmented length <= max length. */ -bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) -{ +static inline bool skb_gso_size_check(const struct sk_buff *skb, + unsigned int seg_len, + unsigned int max_len) { const struct skb_shared_info *shinfo = skb_shinfo(skb); const struct sk_buff *iter; - unsigned int hlen; - - hlen = skb_gso_network_seglen(skb); if (shinfo->gso_size != GSO_BY_FRAGS) - return hlen <= mtu; + return seg_len <= max_len; /* Undo this so we can re-use header sizes */ - hlen -= GSO_BY_FRAGS; + seg_len -= GSO_BY_FRAGS; skb_walk_frags(skb, iter) { - if (hlen + skb_headlen(iter) > mtu) + if (seg_len + skb_headlen(iter) > max_len) return false; } return true; } -EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); + +/** + * skb_gso_validate_network_len - Does an skb fit a given MTU? + * + * @skb: GSO skb + * @mtu: MTU to validate against + * + * skb_gso_validate_network_len validates if a given skb will fit a + * wanted MTU once split. It considers L3 headers, L4 headers, and the + * payload. + */ +bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu) +{ + return skb_gso_size_check(skb, skb_gso_network_seglen(skb), mtu); +} +EXPORT_SYMBOL_GPL(skb_gso_validate_network_len); + +/** + * skb_gso_validate_mac_len - Will a split GSO skb fit in a given length? + * + * @skb: GSO skb + * @len: length to validate against + * + * skb_gso_validate_mac_len validates if a given skb will fit a wanted + * length once split, including L2, L3 and L4 headers and the payload. + */ +bool
[PATCH v2 0/4] Check size of packets before sending
There are a few ways we can send packets that are too large to a network driver. When non-GSO packets are forwarded, we validate their size, based on the MTU of the destination device. However, when GSO packets are forwarded, we do not validate their size. We implicitly assume that when they are segmented, the resultant packets will be correctly sized. This is not always the case. We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0] and was the genesis of this series. Rather than fixing this in the driver, this series fixes the core path. It does it in 2 steps: 1) make is_skb_forwardable check GSO packets - this catches bridges 2) make validate_xmit_skb check the size of all packets, so as to catch everything else (e.g. macvlan, tc mired, OVS) I am a bit nervous about how this series will interact with nested VLANs, as the existing code only allows for one VLAN_HLEN. (Previously these packets would sail past unchecked.) But I thought it would be prudent to get more eyes on this sooner rather than later. Thanks, Daniel v1: https://www.spinics.net/lists/netdev/msg478634.html Changes in v2: - improve names, thanks Marcelo Ricardo Leitner - add check to xmit_validate_skb; thanks to everyone who participated in the discussion. - drop extra check in Open vSwitch. Bad packets will be caught by validate_xmit_skb for now and we can come back and add it later if OVS people would like the extra logging. [0]: https://patchwork.ozlabs.org/patch/859410/ Cc: Jason WangCc: Pravin Shelar Cc: Marcelo Ricardo Leitner Cc: manish.cho...@cavium.com Cc: d...@openvswitch.org Daniel Axtens (4): net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len net: move skb_gso_mac_seglen to skbuff.h net: is_skb_forwardable: check the size of GSO segments net: check the size of a packet in validate_xmit_skb include/linux/skbuff.h | 18 - net/core/dev.c | 24 net/core/skbuff.c | 66 ++--- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 2 +- net/ipv4/netfilter/nf_flow_table_ipv4.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/ipv6/netfilter/nf_flow_table_ipv6.c | 2 +- net/mpls/af_mpls.c | 2 +- net/sched/sch_tbf.c | 10 - net/xfrm/xfrm_device.c | 2 +- 11 files changed, 93 insertions(+), 39 deletions(-) -- 2.14.1
Re: [PATCH 10/10] kill kernel_sock_ioctl()
From: Al ViroDate: Thu, 25 Jan 2018 00:01:25 + > On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote: >> >> Al this series looks fine to me, want me to toss it into net-next? > > Do you want them reposted (with updated commit messages), or would > you prefer a pull request (with or without rebase to current tip > of net-next)? A pull request works for me. Rebasing to net-next tip is pilot's discretion.
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote: > Hi Tariq > > On 01/22/2018 10:12 AM, jianchao.wang wrote: > > > > On 19/01/2018 5:49 PM, Eric Dumazet wrote: > > > > > On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: > > > > > > Hi Tariq > > > > > > > > > > > > Very sad that the crash was reproduced again after applied the > > > > > > patch. > > > > > > Memory barriers vary for different Archs, can you please share more > > > details regarding arch and repro steps? > > The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 > > 12/27/2015 > > The xen is installed. The crash occurred in DOM0. > > Regarding to the repro steps, it is a customer's test which does heavy disk > > I/O over NFS storage without any guest. > > > > What is the finial suggestion on this ? > If use wmb there, is the performance pulled down ? Since https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=dad42c3038a59d27fced28ee4ec1d4a891b28155 we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often. I doubt the additional wmb() will have serious impact there.
[PATCH net-next 1/2] net: vrf: Add support for sends to local broadcast address
Sukumar reported that sends to the local broadcast address (255.255.255.255) are broken. Check for the address in vrf driver and do not redirect to the VRF device - similar to multicast packets. With this change sockets can use SO_BINDTODEVICE to specify an egress interface and receive responses. Note: the egress interface can not be a VRF device but needs to be the enslaved device. https://bugzilla.kernel.org/show_bug.cgi?id=198521 Reported-by: Sukumar GopalakrishnanSigned-off-by: David Ahern --- Dave: Really this is a day 1 bug that goes back to the beginning of VRF. IMO, backport to the 4.14 LTS kernel is sufficient; the multicast handling for IPv4 was only complete as of the 4.12 kernel. I directed this at net-next because it is not urgent for the 4.15 merge window. drivers/net/vrf.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index feb1b2e15c2e..139c61c8244a 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -673,8 +673,9 @@ static struct sk_buff *vrf_ip_out(struct net_device *vrf_dev, struct sock *sk, struct sk_buff *skb) { - /* don't divert multicast */ - if (ipv4_is_multicast(ip_hdr(skb)->daddr)) + /* don't divert multicast or local broadcast */ + if (ipv4_is_multicast(ip_hdr(skb)->daddr) || + ipv4_is_lbcast(ip_hdr(skb)->daddr)) return skb; if (qdisc_tx_is_default(vrf_dev)) -- 2.11.0
[PATCH net-next 2/2] net/ipv4: Allow send to local broadcast from a socket bound to a VRF
Message sends to the local broadcast address (255.255.255.255) require uc_index or sk_bound_dev_if to be set to an egress device. However, responses or only received if the socket is bound to the device. This is overly constraining for processes running in an L3 domain. This patch allows a socket bound to the VRF device to send to the local broadcast address by using IP_UNICAST_IF to set the egress interface with packet receipt handled by the VRF binding. Similar to IP_MULTICAST_IF, relax the constraint on setting IP_UNICAST_IF if a socket is bound to an L3 master device. In this case allow uc_index to be set to an enslaved if sk_bound_dev_if is an L3 master device and is the master device for the ifindex. In udp and raw sendmsg, allow uc_index to override the oif if uc_index master device is oif (ie., the oif is an L3 master and the index is an L3 slave). Signed-off-by: David Ahern--- net/ipv4/ip_sockglue.c | 6 +- net/ipv4/raw.c | 15 ++- net/ipv4/udp.c | 15 ++- 3 files changed, 33 insertions(+), 3 deletions(-) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 60fb1eb7d7d8..6cc70fa488cb 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -808,6 +808,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, { struct net_device *dev = NULL; int ifindex; + int midx; if (optlen != sizeof(int)) goto e_inval; @@ -823,10 +824,13 @@ static int do_ip_setsockopt(struct sock *sk, int level, err = -EADDRNOTAVAIL; if (!dev) break; + + midx = l3mdev_master_ifindex(dev); dev_put(dev); err = -EINVAL; - if (sk->sk_bound_dev_if) + if (sk->sk_bound_dev_if && + (!midx || midx != sk->sk_bound_dev_if)) break; inet->uc_index = ifindex; diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 136544b36a46..7c509697ebc7 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -617,8 +617,21 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) ipc.oif = inet->mc_index; if (!saddr) saddr = inet->mc_addr; - } else if (!ipc.oif) + } else if (!ipc.oif) { ipc.oif = inet->uc_index; + } else if (ipv4_is_lbcast(daddr) && inet->uc_index) { + /* oif is set, packet is to local broadcast and +* and uc_index is set. oif is most likely set +* by sk_bound_dev_if. If uc_index != oif check if the +* oif is an L3 master and uc_index is an L3 slave. +* If so, we want to allow the send using the uc_index. +*/ + if (ipc.oif != inet->uc_index && + ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk), + inet->uc_index)) { + ipc.oif = inet->uc_index; + } + } flowi4_init_output(, ipc.oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 853321555a4e..3f018f34cf56 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -977,8 +977,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!saddr) saddr = inet->mc_addr; connected = 0; - } else if (!ipc.oif) + } else if (!ipc.oif) { ipc.oif = inet->uc_index; + } else if (ipv4_is_lbcast(daddr) && inet->uc_index) { + /* oif is set, packet is to local broadcast and +* and uc_index is set. oif is most likely set +* by sk_bound_dev_if. If uc_index != oif check if the +* oif is an L3 master and uc_index is an L3 slave. +* If so, we want to allow the send using the uc_index. +*/ + if (ipc.oif != inet->uc_index && + ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk), + inet->uc_index)) { + ipc.oif = inet->uc_index; + } + } if (connected) rt = (struct rtable *)sk_dst_check(sk, 0); -- 2.11.0
[PATCH net-next 0/2] net: vrf: Fix send to local broadcast address
Patch set to fix packet send to the 255.255.255.255 address from a VRF. First patch tell vrf driver to ignore those packets. Second patches updates the uapi to allow sends from sockets bound to an L3 master device. David Ahern (2): net: vrf: Add support for sends to local broadcast address net/ipv4: Allow send to local broadcast from a socket bound to a VRF drivers/net/vrf.c | 5 +++-- net/ipv4/ip_sockglue.c | 6 +- net/ipv4/raw.c | 15 ++- net/ipv4/udp.c | 15 ++- 4 files changed, 36 insertions(+), 5 deletions(-) -- 2.11.0
Re: [PATCH v6 16/36] nds32: DMA mapping API
Hi, Arnd: 2018-01-24 19:36 GMT+08:00 Arnd Bergmann: > On Tue, Jan 23, 2018 at 12:52 PM, Greentime Hu wrote: >> Hi, Arnd: >> >> 2018-01-23 16:23 GMT+08:00 Greentime Hu : >>> Hi, Arnd: >>> >>> 2018-01-18 18:26 GMT+08:00 Arnd Bergmann : On Mon, Jan 15, 2018 at 6:53 AM, Greentime Hu wrote: > From: Greentime Hu > > This patch adds support for the DMA mapping API. It uses dma_map_ops for > flexibility. > > Signed-off-by: Vincent Chen > Signed-off-by: Greentime Hu I'm still unhappy about the way the cache flushes are done here as discussed before. It's not a show-stopped, but no Ack from me. >>> >>> How about this implementation? > >> I am not sure if I understand it correctly. >> I list all the combinations. >> >> RAM to DEVICE >> before DMA => writeback cache >> after DMA => nop >> >> DEVICE to RAM >> before DMA => nop >> after DMA => invalidate cache >> >> static void consistent_sync(void *vaddr, size_t size, int direction, int >> master) >> { >> unsigned long start = (unsigned long)vaddr; >> unsigned long end = start + size; >> >> if (master == FOR_CPU) { >> switch (direction) { >> case DMA_TO_DEVICE: >> break; >> case DMA_FROM_DEVICE: >> case DMA_BIDIRECTIONAL: >> cpu_dma_inval_range(start, end); >> break; >> default: >> BUG(); >> } >> } else { >> /* FOR_DEVICE */ >> switch (direction) { >> case DMA_FROM_DEVICE: >> break; >> case DMA_TO_DEVICE: >> case DMA_BIDIRECTIONAL: >> cpu_dma_wb_range(start, end); >> break; >> default: >> BUG(); >> } >> } >> } > > That looks reasonable enough, but it does depend on a number of factors, > and the dma-mapping.h implementation is not just about cache flushes. > > As I don't know the microarchitecture, can you answer these questions: > > - are caches always write-back, or could they be write-through? Yes, we can config it to write-back or write-through. > - can the cache be shared with another CPU or a device? No, we don't support it. > - if the cache is shared, is it always coherent, never coherent, or > either of them? We don't support SMP and the device will access memory through bus. I think the cache is not shared. > - could the same memory be visible at different physical addresses > and have conflicting caches? We currently don't have such kind of SoC memory map. > - is the CPU physical address always the same as the address visible to the > device? Yes, it is always the same unless the CPU uses local memory. The physical address of local memory will overlap the original bus address. I think the local memory case can be ignored because we don't use it for now. > - are there devices that can only see a subset of the physical memory? All devices are able to see the whole physical memory in our current SoC, but I think other SoC may support such kind of HW behavior. > - can there be an IOMMU? No. > - are there write-buffers in the CPU that might need to get flushed before > flushing the cache? Yes, there are write-buffers in front of CPU caches but it should be transparent to SW. We don't need to flush it. > - could cache lines be loaded speculatively or with read-ahead while > a buffer is owned by a device? No.
[PATCH v3 net-next] net/ipv6: Do not allow route add with a device that is down
IPv6 allows routes to be installed when the device is not up (admin up). Worse, it does not mark it as LINKDOWN. IPv4 does not allow it and really there is no reason for IPv6 to allow it, so check the flags and deny if device is admin down. Signed-off-by: David Ahern--- v3 - moved err=-ENETDOWN under the if check per Eric's request - left the up check using dev->flags for consistency with IPv4 and that it is used more often in ipv4 and ivp6 code than netif_running v2 - missed setting err to -ENETDOWN (thanks for catching that Roopa) net/ipv6/route.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f85da2f1e729..aa4411c81e7e 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2734,6 +2734,12 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, if (!dev) goto out; + if (!(dev->flags & IFF_UP)) { + NL_SET_ERR_MSG(extack, "Nexthop device is not up"); + err = -ENETDOWN; + goto out; + } + if (!ipv6_addr_any(>fc_prefsrc)) { if (!ipv6_chk_addr(net, >fc_prefsrc, dev, 0)) { NL_SET_ERR_MSG(extack, "Invalid source address"); -- 2.11.0
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
Hi Tariq On 01/22/2018 10:12 AM, jianchao.wang wrote: >>> On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: > Hi Tariq > > Very sad that the crash was reproduced again after applied the patch. >> Memory barriers vary for different Archs, can you please share more details >> regarding arch and repro steps? > The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 > 12/27/2015 > The xen is installed. The crash occurred in DOM0. > Regarding to the repro steps, it is a customer's test which does heavy disk > I/O over NFS storage without any guest. > What is the finial suggestion on this ? If use wmb there, is the performance pulled down ? Thanks in advance Jianchao
[PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
Adds support for calling sock_ops BPF program when there is a TCP state change. Two arguments are used; one for the old state and another for the new state. There is a new enum in include/uapi/linux/bpf.h that exports the TCP states that prepends BPF_ to the current TCP state names. If it is ever necessary to change the internal TCP state values (other than adding more to the end), then it will become necessary to convert from the internal TCP state value to the BPF value before calling the BPF sock_ops function. There are a set of compile checks added in tcp.c to detect if the internal and BPF values differ so we can make the necessary fixes. New op: BPF_SOCK_OPS_STATE_CB. Signed-off-by: Lawrence Brakmo--- include/uapi/linux/bpf.h | 26 ++ include/uapi/linux/tcp.h | 3 ++- net/ipv4/tcp.c | 24 3 files changed, 52 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 59fa771..ff7758d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1047,6 +1047,32 @@ enum { * Arg3: return value of * tcp_transmit_skb (0 => success) */ + BPF_SOCK_OPS_STATE_CB, /* Called when TCP changes state. +* Arg1: old_state +* Arg2: new_state +*/ +}; + +/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect + * changes between the TCP and BPF versions. Ideally this should never happen. + * If it does, we need to add code to convert them before calling + * the BPF sock_ops function. + */ +enum { + BPF_TCP_ESTABLISHED = 1, + BPF_TCP_SYN_SENT, + BPF_TCP_SYN_RECV, + BPF_TCP_FIN_WAIT1, + BPF_TCP_FIN_WAIT2, + BPF_TCP_TIME_WAIT, + BPF_TCP_CLOSE, + BPF_TCP_CLOSE_WAIT, + BPF_TCP_LAST_ACK, + BPF_TCP_LISTEN, + BPF_TCP_CLOSING,/* Now a valid state */ + BPF_TCP_NEW_SYN_RECV, + + BPF_TCP_MAX_STATES /* Leave at the end! */ }; #define TCP_BPF_IW 1001/* Set TCP initial congestion window */ diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index ec03a2b..cf0b861 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -271,7 +271,8 @@ struct tcp_diag_md5sig { /* Definitions for bpf_sock_ops_cb_flags */ #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) -#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x3/* Mask of all currently +#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) +#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7/* Mask of all currently * supported cb flags */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 88b6244..f013ddc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state) { int oldstate = sk->sk_state; + /* We defined a new enum for TCP states that are exported in BPF +* so as not force the internal TCP states to be frozen. The +* following checks will detect if an internal state value ever +* differs from the BPF value. If this ever happens, then we will +* need to remap the internal value to the BPF value before calling +* tcp_call_bpf_2arg. +*/ + BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED); + BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT); + BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV); + BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1); + BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2); + BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT); + BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE); + BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT); + BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK); + BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN); + BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING); + BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV); + BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES); + + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG)) + tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state); + switch (state) { case TCP_ESTABLISHED: if (oldstate != TCP_ESTABLISHED) -- 2.9.5
[PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable
Currently, a sock_ops BPF program can write the op field and all the reply fields (reply and replylong). This is a bug. The op field should not have been writeable and there is currently no way to use replylong field for indices >= 1. This patch enforces that only the reply field (which equals replylong[0]) is writeable. Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops") Signed-off-by: Lawrence BrakmoAcked-by: Yuchung Cheng --- net/core/filter.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 18da42a..bf9bb75 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size, { if (type == BPF_WRITE) { switch (off) { - case offsetof(struct bpf_sock_ops, op) ... -offsetof(struct bpf_sock_ops, replylong[3]): + case offsetof(struct bpf_sock_ops, reply): break; default: return false; -- 2.9.5
[PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback
Adds an optional call to sock_ops BPF program based on whether the BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags. The BPF program is passed 2 arguments: icsk_retransmits and whether the RTO has expired. Signed-off-by: Lawrence Brakmo--- include/uapi/linux/bpf.h | 5 + include/uapi/linux/tcp.h | 3 ++- net/ipv4/tcp_timer.c | 7 +++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7573f5b..2a8c40a 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1014,6 +1014,11 @@ enum { * a congestion threshold. RTTs above * this indicate congestion */ + BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered. +* Arg1: value of icsk_retransmits +* Arg2: value of icsk_rto +* Arg3: whether RTO has expired +*/ }; #define TCP_BPF_IW 1001/* Set TCP initial congestion window */ diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index d1df2f6..129032ca 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -269,7 +269,8 @@ struct tcp_diag_md5sig { }; /* Definitions for bpf_sock_ops_cb_flags */ -#define BPF_SOCK_OPS_ALL_CB_FLAGS 0 /* Mask of all currently +#define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) +#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x1/* Mask of all currently * supported cb flags */ diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 6db3124..257abdd 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk) icsk->icsk_user_timeout); } tcp_fastopen_active_detect_blackhole(sk, expired); + + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG)) + tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB, + icsk->icsk_retransmits, + icsk->icsk_rto, (int)expired); + if (expired) { /* Has it gone just too far? */ tcp_write_err(sk); return 1; } + return 0; } -- 2.9.5
[PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock
Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary use is to determine if there should be calls to sock_ops bpf program at various points in the TCP code. The field is initialized to zero, disabling the calls. A sock_ops BPF program can set it, per connection and as necessary, when the connection is established. It also adds support for reading and writting the field within a sock_ops BPF program. Reading is done by accessing the field directly. However, writing is done through the helper function bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program is trying to set a callback that is not supported in the current kernel (i.e. running an older kernel). The helper function returns 0 if it was able to set all of the bits set in the argument, a positive number containing the bits that could not be set, or -EINVAL if the socket is not a full TCP socket. Examples of where one could call the bpf program: 1) When RTO fires 2) When a packet is retransmitted 3) When the connection terminates 4) When a packet is sent 5) When a packet is received Signed-off-by: Lawrence BrakmoAcked-by: Alexei Starovoitov --- include/linux/tcp.h | 11 +++ include/uapi/linux/bpf.h | 12 +++- include/uapi/linux/tcp.h | 5 + net/core/filter.c| 34 ++ 4 files changed, 61 insertions(+), 1 deletion(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 4f93f095..8f4c549 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -335,6 +335,17 @@ struct tcp_sock { int linger2; + +/* Sock_ops bpf program related variables */ +#ifdef CONFIG_BPF + u8 bpf_sock_ops_cb_flags; /* Control calling BPF programs +* values defined in uapi/linux/tcp.h +*/ +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG) +#else +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0 +#endif + /* Receiver side RTT estimation */ struct { u32 rtt_us; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 8d5874c..7573f5b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -642,6 +642,14 @@ union bpf_attr { * @optlen: length of optval in bytes * Return: 0 or negative error * + * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags) + * Set callback flags for sock_ops + * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct + * @flags: flags value + * Return: 0 for no error + * -EINVAL if there is no full tcp socket + * bits in flags that are not supported by current kernel + * * int bpf_skb_adjust_room(skb, len_diff, mode, flags) * Grow or shrink room in sk_buff. * @skb: pointer to skb @@ -748,7 +756,8 @@ union bpf_attr { FN(perf_event_read_value), \ FN(perf_prog_read_value), \ FN(getsockopt), \ - FN(override_return), + FN(override_return),\ + FN(sock_ops_cb_flags_set), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -969,6 +978,7 @@ struct bpf_sock_ops { */ __u32 snd_cwnd; __u32 srtt_us; /* Averaged RTT << 3 in usecs */ + __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */ }; /* List of known BPF sock_ops operators. diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index b4a4f64..d1df2f6 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -268,4 +268,9 @@ struct tcp_diag_md5sig { __u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; }; +/* Definitions for bpf_sock_ops_cb_flags */ +#define BPF_SOCK_OPS_ALL_CB_FLAGS 0 /* Mask of all currently +* supported cb flags +*/ + #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/core/filter.c b/net/core/filter.c index c356ec0..6936d19 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock, + int, argval) +{ + struct sock *sk = bpf_sock->sk; + int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS; + + if (!sk_fullsock(sk)) + return -EINVAL; + +#ifdef CONFIG_INET + if (val) + tcp_sk(sk)->bpf_sock_ops_cb_flags = val; + + return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS); +#else + return -EINVAL; +#endif +} + +static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = { + .func = bpf_sock_ops_cb_flags_set, + .gpl_only =
[PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and more
Add support for reading many more tcp_sock fields state,same as sk->sk_state rtt_min same as sk->rtt_min.s[0].v (current rtt_min) snd_ssthresh rcv_nxt snd_nxt snd_una mss_cache ecn_flags rate_delivered rate_interval_us packets_out retrans_out total_retrans segs_in data_segs_in segs_out data_segs_out lost_out sacked_out sk_txhash bytes_received (__u64) bytes_acked(__u64) Signed-off-by: Lawrence Brakmo--- include/uapi/linux/bpf.h | 22 net/core/filter.c| 143 +++ 2 files changed, 154 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2a8c40a..5f08420 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -979,6 +979,28 @@ struct bpf_sock_ops { __u32 snd_cwnd; __u32 srtt_us; /* Averaged RTT << 3 in usecs */ __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */ + __u32 state; + __u32 rtt_min; + __u32 snd_ssthresh; + __u32 rcv_nxt; + __u32 snd_nxt; + __u32 snd_una; + __u32 mss_cache; + __u32 ecn_flags; + __u32 rate_delivered; + __u32 rate_interval_us; + __u32 packets_out; + __u32 retrans_out; + __u32 total_retrans; + __u32 segs_in; + __u32 data_segs_in; + __u32 segs_out; + __u32 data_segs_out; + __u32 lost_out; + __u32 sacked_out; + __u32 sk_txhash; + __u64 bytes_received; + __u64 bytes_acked; }; /* List of known BPF sock_ops operators. diff --git a/net/core/filter.c b/net/core/filter.c index 6936d19..a858ebc 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act) } EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action); -static bool __is_valid_sock_ops_access(int off, int size) +static bool sock_ops_is_valid_access(int off, int size, +enum bpf_access_type type, +struct bpf_insn_access_aux *info) { + const int size_default = sizeof(__u32); + if (off < 0 || off >= sizeof(struct bpf_sock_ops)) return false; + /* The verifier guarantees that size > 0. */ if (off % size != 0) return false; - if (size != sizeof(__u32)) - return false; - - return true; -} -static bool sock_ops_is_valid_access(int off, int size, -enum bpf_access_type type, -struct bpf_insn_access_aux *info) -{ if (type == BPF_WRITE) { switch (off) { case offsetof(struct bpf_sock_ops, reply): + if (size != size_default) + return false; break; default: return false; } + } else { + switch (off) { + case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received, + bytes_acked): + if (size != sizeof(__u64)) + return false; + break; + default: + if (size != size_default) + return false; + break; + } } - return __is_valid_sock_ops_access(off, size); + return true; } static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write, @@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, is_fullsock)); break; + case offsetof(struct bpf_sock_ops, state): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct bpf_sock_ops_kern, sk), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, sk)); + *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_state)); + break; + + case offsetof(struct bpf_sock_ops, rtt_min): + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) != +sizeof(struct minmax)); + BUILD_BUG_ON(sizeof(struct minmax) < +sizeof(struct minmax_sample)); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct bpf_sock_ops_kern, sk), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern,
[PATCH bpf-next v9 00/12] bpf: More sock_ops callbacks
This patchset adds support for: - direct R or R/W access to many tcp_sock fields - passing up to 4 arguments to sock_ops BPF functions - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks - optionally calling sock_ops BPF program when RTO fires - optionally calling sock_ops BPF program when packet is retransmitted - optionally calling sock_ops BPF program when TCP state changes - access to tclass and sk_txhash - new selftest v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch below used "bpf" and Patchwork didn't work correctly. v3: Cleaned RTO callback as per Yuchung's comment Added BPF enum for TCP states as per Alexei's comment v4: Fixed compile warnings related to detecting changes between TCP internal states and the BPF defined states. v5: Fixed comment issues in some selftest files Fixed accesss issue with u64 fields in bpf_sock_ops struct v6: Made fixes based on comments form Eric Dumazet: The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels Field bpf_sock_ops_cb_flags is now set through a helper function which returns an error when a BPF program tries to set bits for callbacks that are not supported in the current kernel. Added a comment indicating that when adding fields to bpf_sock_ops_kern they should be added before the field named "temp" if they need to be cleared before calling the BPF function. v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable based on comments form Eric Dumazet and Alexei Starovoitov. Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on comments from Daniel Borkmann. Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based on comments from Daniel Borkmann. v8: Add commit message 00/12 Add Acked-by as appropriate v9: Moved the bug fix to the front of the patchset Changed RETRANS_CB so it is always called (before it was only called if the retransmit succeeded). It is now called with an extra argument, the return value of tcp_transmit_skb (0 => success). Based on comments from Yuchung Cheng. Added support for reading 2 new fields, sacked_out and lost_out, based on comments from Yuchung Cheng. Consists of the following patches: [PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable [PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent [PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct [PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock [PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf [PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to [PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback [PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and [PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass [PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB [PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB [PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf include/linux/filter.h | 10 ++ include/linux/tcp.h| 11 ++ include/net/tcp.h | 42 - include/uapi/linux/bpf.h | 76 +++- include/uapi/linux/tcp.h | 8 + net/core/filter.c | 290 --- net/ipv4/tcp.c | 26 ++- net/ipv4/tcp_nv.c | 2 +- net/ipv4/tcp_output.c | 6 +- net/ipv4/tcp_timer.c | 7 + tools/include/uapi/linux/bpf.h | 78 - tools/testing/selftests/bpf/Makefile | 4 +- tools/testing/selftests/bpf/bpf_helpers.h | 2 + tools/testing/selftests/bpf/tcp_client.py | 52 ++ tools/testing/selftests/bpf/tcp_server.py | 79 + tools/testing/selftests/bpf/test_tcpbpf.h | 16 ++ tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 ++ tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++ 18 files changed, 927 insertions(+), 39 deletions(-)
[PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock fields
This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to struct tcp_sock or struct sock fields. This required adding a new field "temp" to struct bpf_sock_ops_kern for temporary storage that is used by sock_ops_convert_ctx_access. It is used to store and recover the contents of a register, so the register can be used to store the address of the sk. Since we cannot overwrite the dst_reg because it contains the pointer to ctx, nor the src_reg since it contains the value we want to store, we need an extra register to contain the address of the sk. Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the GET or SET macros depending on the value of the TYPE field. Signed-off-by: Lawrence BrakmoAcked-by: Alexei Starovoitov --- include/linux/filter.h | 9 + include/net/tcp.h | 2 +- net/core/filter.c | 48 3 files changed, 58 insertions(+), 1 deletion(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 425056c..daa5a67 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern { u32 replylong[4]; }; u32 is_fullsock; + u64 temp; /* temp and everything after is not +* initialized to 0 before calling +* the BPF program. New fields that +* should be initialized to 0 should +* be inserted before temp. +* temp is scratch storage used by +* sock_ops_convert_ctx_access +* as temporary storage of a register. +*/ }; #endif /* __LINUX_FILTER_H__ */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 5a1d26a..6092eaf 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op) struct bpf_sock_ops_kern sock_ops; int ret; - memset(_ops, 0, sizeof(sock_ops)); + memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); if (sk_fullsock(sk)) { sock_ops.is_fullsock = 1; sock_owned_by_me(sk); diff --git a/net/core/filter.c b/net/core/filter.c index dbb6d2f..c356ec0 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4491,6 +4491,54 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, offsetof(OBJ, OBJ_FIELD)); \ } while (0) +/* Helper macro for adding write access to tcp_sock or sock fields. + * The macro is called with two registers, dst_reg which contains a pointer + * to ctx (context) and src_reg which contains the value that should be + * stored. However, we need an additional register since we cannot overwrite + * dst_reg because it may be used later in the program. + * Instead we "borrow" one of the other register. We first save its value + * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore + * it at the end of the macro. + */ +#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\ + do { \ + int reg = BPF_REG_9; \ + BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) > \ +FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD)); \ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--;\ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--;\ + *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp));\ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, \ + is_fullsock), \ + reg, si->dst_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + is_fullsock)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, sk),\ +
[PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct independent
Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2 arguments so now it can also work with struct sock fields. The first argument is the name of the field in the bpf_sock_ops struct, the 2nd argument is the name of the field in the OBJ struct. Previous: SOCK_OPS_GET_TCP(FIELD_NAME) New: SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) Where OBJ is either "struct tcp_sock" or "struct sock" (without quotation). BPF_FIELD is the name of the field in the bpf_sock_ops struct and OBJ_FIELD is the name of the field in the OBJ struct. Although the field names are currently the same, the kernel struct names could change in the future and this change makes it easier to support that. Note that adding access to tcp_sock fields in sock_ops programs does not preclude the tcp_sock fields from being removed as long as we are willing to do one of the following: 1) Return a fixed value (e.x. 0 or 0x), or 2) Make the verifier fail if that field is accessed (i.e. program fails to load) so the user will know that field is no longer supported. Signed-off-by: Lawrence Brakmo--- net/core/filter.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 62e7874..dbb6d2f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4469,11 +4469,11 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, is_fullsock)); break; -/* Helper macro for adding read access to tcp_sock fields. */ -#define SOCK_OPS_GET_TCP(FIELD_NAME) \ +/* Helper macro for adding read access to tcp_sock or sock fields. */ +#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\ do { \ - BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) > \ -FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME)); \ + BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) > \ +FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD)); \ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ struct bpf_sock_ops_kern, \ is_fullsock), \ @@ -4485,18 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, struct bpf_sock_ops_kern, sk),\ si->dst_reg, si->src_reg, \ offsetof(struct bpf_sock_ops_kern, sk));\ - *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock, \ - FIELD_NAME), si->dst_reg, \ - si->dst_reg,\ - offsetof(struct tcp_sock, FIELD_NAME)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ, \ + OBJ_FIELD),\ + si->dst_reg, si->dst_reg, \ + offsetof(OBJ, OBJ_FIELD)); \ } while (0) case offsetof(struct bpf_sock_ops, snd_cwnd): - SOCK_OPS_GET_TCP(snd_cwnd); + SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock); break; case offsetof(struct bpf_sock_ops, srtt_us): - SOCK_OPS_GET_TCP(srtt_us); + SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock); break; } return insn - insn_buf; -- 2.9.5
[PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf function
Adds support for passing up to 4 arguments to sock_ops bpf functions. It reusues the reply union, so the bpf_sock_ops structures are not increased in size. Signed-off-by: Lawrence Brakmo--- include/linux/filter.h | 1 + include/net/tcp.h| 40 +++- include/uapi/linux/bpf.h | 5 +++-- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_nv.c| 2 +- net/ipv4/tcp_output.c| 2 +- 6 files changed, 42 insertions(+), 10 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index daa5a67..20384c4 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern { struct sock *sk; u32 op; union { + u32 args[4]; u32 reply; u32 replylong[4]; }; diff --git a/include/net/tcp.h b/include/net/tcp.h index 6092eaf..093e967 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk); * program loaded). */ #ifdef CONFIG_BPF -static inline int tcp_call_bpf(struct sock *sk, int op) +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args) { struct bpf_sock_ops_kern sock_ops; int ret; @@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op) sock_ops.sk = sk; sock_ops.op = op; + if (nargs > 0) + memcpy(sock_ops.args, args, nargs * sizeof(*args)); ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops); if (ret == 0) @@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op) ret = -1; return ret; } + +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2) +{ + u32 args[2] = {arg1, arg2}; + + return tcp_call_bpf(sk, op, 2, args); +} + +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2, + u32 arg3) +{ + u32 args[3] = {arg1, arg2, arg3}; + + return tcp_call_bpf(sk, op, 3, args); +} + #else -static inline int tcp_call_bpf(struct sock *sk, int op) +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args) { return -EPERM; } + +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2) +{ + return -EPERM; +} + +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2, + u32 arg3) +{ + return -EPERM; +} + #endif static inline u32 tcp_timeout_init(struct sock *sk) { int timeout; - timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT); + timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL); if (timeout <= 0) timeout = TCP_TIMEOUT_INIT; @@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk) { int rwnd; - rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT); + rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL); if (rwnd < 0) rwnd = 0; @@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk) static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk) { - return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1); + return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1); } #if IS_ENABLED(CONFIG_SMC) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 406c19d..8d5874c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -952,8 +952,9 @@ struct bpf_map_info { struct bpf_sock_ops { __u32 op; union { - __u32 reply; - __u32 replylong[4]; + __u32 args[4]; /* Optionally passed to bpf program */ + __u32 reply;/* Returned by bpf program */ + __u32 replylong[4]; /* Optionally returned by bpf prog */ }; __u32 family; __u32 remote_ip4; /* Stored in network byte order */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index d7cf861..88b6244 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op) tcp_mtup_init(sk); icsk->icsk_af_ops->rebuild_header(sk); tcp_init_metrics(sk); - tcp_call_bpf(sk, bpf_op); + tcp_call_bpf(sk, bpf_op, 0, NULL); tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); } diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c index 0b5a05b..ddbce73 100644 --- a/net/ipv4/tcp_nv.c +++ b/net/ipv4/tcp_nv.c @@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk) * within a datacenter, where we have reasonable estimates of * RTTs */ - base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT); + base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL); if (base_rtt >
[PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
Adds support for calling sock_ops BPF program when there is a retransmission. Three arguments are used; one for the sequence number, another for the number of segments retransmitted, and the last one for the return value of tcp_transmit_skb (0 => success). Does not include syn-ack retransmissions. New op: BPF_SOCK_OPS_RETRANS_CB. Signed-off-by: Lawrence Brakmo--- include/uapi/linux/bpf.h | 6 ++ include/uapi/linux/tcp.h | 3 ++- net/ipv4/tcp_output.c| 4 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 5f08420..59fa771 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1041,6 +1041,12 @@ enum { * Arg2: value of icsk_rto * Arg3: whether RTO has expired */ + BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted. +* Arg1: sequence number of 1st byte +* Arg2: # segments +* Arg3: return value of +* tcp_transmit_skb (0 => success) +*/ }; #define TCP_BPF_IW 1001/* Set TCP initial congestion window */ diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 129032ca..ec03a2b 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -270,7 +270,8 @@ struct tcp_diag_md5sig { /* Definitions for bpf_sock_ops_cb_flags */ #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) -#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x1/* Mask of all currently +#define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) +#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x3/* Mask of all currently * supported cb flags */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d12f7f7..e9f985e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2905,6 +2905,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); } + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG)) + tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB, + TCP_SKB_CB(skb)->seq, segs, err); + if (likely(!err)) { TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS; trace_tcp_retransmit_skb(sk, skb); -- 2.9.5
[PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf
Added a selftest for tcpbpf (sock_ops) that checks that the appropriate callbacks occured and that it can access tcp_sock fields and that their values are correct. Run with command: ./test_tcpbpf_user Signed-off-by: Lawrence BrakmoAcked-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 78 ++- tools/testing/selftests/bpf/Makefile | 4 +- tools/testing/selftests/bpf/bpf_helpers.h | 2 + tools/testing/selftests/bpf/tcp_client.py | 52 ++ tools/testing/selftests/bpf/tcp_server.py | 79 +++ tools/testing/selftests/bpf/test_tcpbpf.h | 16 +++ tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 + tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 8 files changed, 482 insertions(+), 6 deletions(-) create mode 100755 tools/testing/selftests/bpf/tcp_client.py create mode 100755 tools/testing/selftests/bpf/tcp_server.py create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index af1f49a..ff7758d 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -17,7 +17,7 @@ #define BPF_ALU64 0x07/* alu mode in double word width */ /* ld/ldx fields */ -#define BPF_DW 0x18/* double word */ +#define BPF_DW 0x18/* double word (64-bit) */ #define BPF_XADD 0xc0/* exclusive add */ /* alu/jmp fields */ @@ -642,6 +642,14 @@ union bpf_attr { * @optlen: length of optval in bytes * Return: 0 or negative error * + * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags) + * Set callback flags for sock_ops + * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct + * @flags: flags value + * Return: 0 for no error + * -EINVAL if there is no full tcp socket + * bits in flags that are not supported by current kernel + * * int bpf_skb_adjust_room(skb, len_diff, mode, flags) * Grow or shrink room in sk_buff. * @skb: pointer to skb @@ -748,7 +756,8 @@ union bpf_attr { FN(perf_event_read_value), \ FN(perf_prog_read_value), \ FN(getsockopt), \ - FN(override_return), + FN(override_return),\ + FN(sock_ops_cb_flags_set), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -952,8 +961,9 @@ struct bpf_map_info { struct bpf_sock_ops { __u32 op; union { - __u32 reply; - __u32 replylong[4]; + __u32 args[4]; /* Optionally passed to bpf program */ + __u32 reply;/* Returned by bpf program */ + __u32 replylong[4]; /* Optionally returned by bpf prog */ }; __u32 family; __u32 remote_ip4; /* Stored in network byte order */ @@ -968,6 +978,29 @@ struct bpf_sock_ops { */ __u32 snd_cwnd; __u32 srtt_us; /* Averaged RTT << 3 in usecs */ + __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */ + __u32 state; + __u32 rtt_min; + __u32 snd_ssthresh; + __u32 rcv_nxt; + __u32 snd_nxt; + __u32 snd_una; + __u32 mss_cache; + __u32 ecn_flags; + __u32 rate_delivered; + __u32 rate_interval_us; + __u32 packets_out; + __u32 retrans_out; + __u32 total_retrans; + __u32 segs_in; + __u32 data_segs_in; + __u32 segs_out; + __u32 data_segs_out; + __u32 lost_out; + __u32 sacked_out; + __u32 sk_txhash; + __u64 bytes_received; + __u64 bytes_acked; }; /* List of known BPF sock_ops operators. @@ -1003,6 +1036,43 @@ enum { * a congestion threshold. RTTs above * this indicate congestion */ + BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered. +* Arg1: value of icsk_retransmits +* Arg2: value of icsk_rto +* Arg3: whether RTO has expired +*/ + BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted. +* Arg1: sequence number of 1st byte +* Arg2: # segments +* Arg3: return value of +* tcp_transmit_skb (0 => success) +*/ +
[PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass
Adds direct write access to sk_txhash and access to tclass for ipv6 flows through getsockopt and setsockopt. Sample usage for tclass: bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v)) where skops is a pointer to the ctx (struct bpf_sock_ops). Signed-off-by: Lawrence Brakmo--- net/core/filter.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index a858ebc..fe2c793 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, ret = -EINVAL; } #ifdef CONFIG_INET +#if IS_ENABLED(CONFIG_IPV6) + } else if (level == SOL_IPV6) { + if (optlen != sizeof(int) || sk->sk_family != AF_INET6) + return -EINVAL; + + val = *((int *)optval); + /* Only some options are supported */ + switch (optname) { + case IPV6_TCLASS: + if (val < -1 || val > 0xff) { + ret = -EINVAL; + } else { + struct ipv6_pinfo *np = inet6_sk(sk); + + if (val == -1) + val = 0; + np->tclass = val; + } + break; + default: + ret = -EINVAL; + } +#endif } else if (level == SOL_TCP && sk->sk_prot->setsockopt == tcp_setsockopt) { if (optname == TCP_CONGESTION) { @@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, strncpy(name, optval, min_t(long, optlen, TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; - ret = tcp_set_congestion_control(sk, name, false, reinit); + ret = tcp_set_congestion_control(sk, name, false, +reinit); } else { struct tcp_sock *tp = tcp_sk(sk); @@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, } else { goto err_clear; } +#if IS_ENABLED(CONFIG_IPV6) + } else if (level == SOL_IPV6) { + struct ipv6_pinfo *np = inet6_sk(sk); + + if (optlen != sizeof(int) || sk->sk_family != AF_INET6) + goto err_clear; + + /* Only some options are supported */ + switch (optname) { + case IPV6_TCLASS: + *((int *)optval) = (int)np->tclass; + break; + default: + goto err_clear; + } +#endif } else { goto err_clear; } @@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size, if (type == BPF_WRITE) { switch (off) { case offsetof(struct bpf_sock_ops, reply): + case offsetof(struct bpf_sock_ops, sk_txhash): if (size != size_default) return false; break; @@ -4690,7 +4731,8 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, break; case offsetof(struct bpf_sock_ops, sk_txhash): - SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock); + SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash, + struct sock, type); break; case offsetof(struct bpf_sock_ops, bytes_received): @@ -4701,6 +4743,7 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, case offsetof(struct bpf_sock_ops, bytes_acked): SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock); break; + } return insn - insn_buf; } -- 2.9.5
[PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent
Make SOCK_OPS_GET_TCP helper macro size independent (before only worked with 4-byte fields. Signed-off-by: Lawrence Brakmo--- net/core/filter.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index bf9bb75..62e7874 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4470,9 +4470,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, break; /* Helper macro for adding read access to tcp_sock fields. */ -#define SOCK_OPS_GET_TCP32(FIELD_NAME) \ +#define SOCK_OPS_GET_TCP(FIELD_NAME) \ do { \ - BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \ + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) > \ +FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME)); \ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ struct bpf_sock_ops_kern, \ is_fullsock), \ @@ -4484,16 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, struct bpf_sock_ops_kern, sk),\ si->dst_reg, si->src_reg, \ offsetof(struct bpf_sock_ops_kern, sk));\ - *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\ + *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock, \ + FIELD_NAME), si->dst_reg, \ + si->dst_reg,\ offsetof(struct tcp_sock, FIELD_NAME)); \ } while (0) case offsetof(struct bpf_sock_ops, snd_cwnd): - SOCK_OPS_GET_TCP32(snd_cwnd); + SOCK_OPS_GET_TCP(snd_cwnd); break; case offsetof(struct bpf_sock_ops, srtt_us): - SOCK_OPS_GET_TCP32(srtt_us); + SOCK_OPS_GET_TCP(srtt_us); break; } return insn - insn_buf; -- 2.9.5
Re: [net-next 0/8][pull request] 1GbE Intel Wired LAN Driver Updates 2018-01-24
On Wed, Jan 24, 2018 at 1:10 PM, David Millerwrote: > From: Jeff Kirsher > Date: Wed, 24 Jan 2018 12:55:12 -0800 > >> This series contains updates to igb and e1000e only. > > Pulled, however: > >> Corinna Vinschen implements the ability to set the VF MAC to >> 00:00:00:00:00:00 via RTM_SETLINK on the PF, to prevent receiving >> "invlaid argument" when libvirt attempts to restore the MAC address back >> to its original state of 00:00:00:00:00:00. > > This is really a mess and the wrong way to go about this. > > No interface, even a VF, should come up or ever have an invalid > MAC addres like all-zeros. That's the fundamental problem and > once you fix that all of this other crazy logic and workarounds > no longer become necessary. In the case of igbvf the VFs never come up with 0s in their MAC address. An all 0's MAC address basically leaves it open to VF's choice for assigning themselves a MAC address, or at least that is the way I recall coding it back in the day. There are a few issues with making changes to this at this point. The first being that this concept is pretty much baked into the VF driver logic for most drivers supporting legacy SR-IOV, and as pointed out in the patch comments the libvirt interface is writing 0's to disable the VF MAC address when it is not in use. At this point we cannot change this without breaking the libvirt userspace. One of the motivations for clearing this is to avoid having the PF misdirect traffic as having a MAC address mapped to a disabled/unassigned VF could result in traffic being dropped when it should be directed elsewhere such as a bridge on the PF, or out to some other PF that is now running the VM there. > Whatever it takes, just do it. We can even come up with a global > MAC address range that on a Linux system is reserved for VFs to > come up with. That is normally how the VFs handle this on their side. The code was setup such that if the PF provided an all 0's MAC address then the VF would assign itself a locally administered address so that it wouldn't come up with an address of 0s. If you are saying the VFs shouldn't be allowed to come up with an all 0's MAC address I believe that none of them do. I believe they either fail to come up at all or report a locally administered address for themselves. I can double check that though (at least for Intel) to verify that it is in fact a consistent behavior. In theory there isn't likely to be a VF bound to the interface anyway, usually when the MAC address is invalidated it is because a VM has been terminated and the VF driver is just in limbo since it is usually assigned to a VFIO interface which doesn't actually expose the network interface to the kernel. I suppose we could look at pushing the LAA generation up into the PF, but we would still want to maintain the all 0's address while the VF is inactive since we need to clear the stale VF addresses from the MAC address table in the event of a VM being relocated to a different server and taking the MAC address with it. The good news to all this is that this is going to be fading out and going away anyway as SwitchDev takes over for SR-IOV. > Thanks. I'll double check our VF drivers and make sure none of them are exposing a netdevice with an all 0's MAC address, and see what we can do about relocating the locally administered address generation into the PF. Thanks. - Alex
[PATCH net-next] ipv6: raw: use IPv4 raw_sendmsg on v4-mapped IPv6 destinations
Make IPv6 SOCK_RAW sockets operate like IPv6 UDP and TCP sockets with respect to IPv4 mapped addresses by calling IPv4 raw_sendmsg from rawv6_sendmsg to send those messages out. Signed-off-by: Travis BrownSigned-off-by: Ivan Delalande --- include/net/raw.h | 1 + net/ipv4/raw.c| 5 +++-- net/ipv6/raw.c| 14 ++ 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index 99d26d0c4a19..b4dbf730da54 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -33,6 +33,7 @@ void raw_icmp_error(struct sk_buff *, int, u32); int raw_local_deliver(struct sk_buff *, int); int raw_rcv(struct sock *, struct sk_buff *); +int rawv4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len); #define RAW_HTABLE_SIZEMAX_INET_PROTOS diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 136544b36a46..09f719af8642 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -499,7 +499,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd, return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb); } -static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) +int rawv4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); @@ -692,6 +692,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) err = 0; goto done; } +EXPORT_SYMBOL_GPL(rawv4_sendmsg); static void raw_close(struct sock *sk, long timeout) { @@ -969,7 +970,7 @@ struct proto raw_prot = { .init = raw_init, .setsockopt= raw_setsockopt, .getsockopt= raw_getsockopt, - .sendmsg = raw_sendmsg, + .sendmsg = rawv4_sendmsg, .recvmsg = raw_recvmsg, .bind = raw_bind, .backlog_rcv = raw_rcv_skb, diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index ddda7eb3c623..f8513e2f1481 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -844,6 +844,20 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) fl6.flowlabel = np->flow_label; } + if (daddr && ipv6_addr_v4mapped(daddr)) { + struct sockaddr_in sin; + + sin.sin_family = AF_INET; + sin.sin_port = sin6 ? sin6->sin6_port : inet->inet_dport; + sin.sin_addr.s_addr = daddr->s6_addr32[3]; + msg->msg_name = + msg->msg_namelen = sizeof(sin); + + if (__ipv6_only_sock(sk)) + return -ENETUNREACH; + return rawv4_sendmsg(sk, msg, len); + } + if (fl6.flowi6_oif == 0) fl6.flowi6_oif = sk->sk_bound_dev_if; -- 2.16.1
[PATCH] kbuild: make Makefile|Kbuild in each directory optional
It is useful to be able to build single object files, e.g.: $ make net/sched/cls_flower.o W=1 C=2 Currently kbuild does a hard include of a Kbuild or Makefile for directory where that object would reside. Kbuild doesn't cater too well to multi-directory drivers, meaning such drivers will usually only use a single central Makefile. This in turn means it will be impossible to build most of object files individually for such drivers. Make the include of $dir/{Makefile,Kbuild} optional. Signed-off-by: Jakub KicinskiReviewed-by: Dirk van der Merwe --- I must admit I have no idea whose tree I should send this to :( Could it go via net-next if no one on linux-kbuild objects? scripts/Makefile.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 47cddf32aeba..178864f877d5 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -42,7 +42,7 @@ save-cflags := $(CFLAGS) # The filename Kbuild has precedence over Makefile kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) -include $(kbuild-file) +-include $(kbuild-file) # If the save-* variables changed error out ifeq ($(KBUILD_NOPEDANTIC),) -- 2.15.1
Re: [PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support
On Thu, 25 Jan 2018 01:09:41 +0100, Pablo Neira Ayuso wrote: > This patch adds the infrastructure to offload flows to hardware, in case > the nic/switch comes with built-in flow tables capabilities. > > If the hardware comes with no hardware flow tables or they have > limitations in terms of features, the existing infrastructure falls back > to the software flow table implementation. > > The software flow table garbage collector skips entries that resides in > the hardware, so the hardware will be responsible for releasing this > flow table entry too via flow_offload_dead(). > > Hardware configuration, either to add or to delete entries, is done from > the hardware offload workqueue, to ensure this is done from user context > given that we may sleep when grabbing the mdio mutex. > > Signed-off-by: Pablo Neira AyusoI wonder how do you deal with device/table removal? I know regrettably little about internals of nftables. I assume the table cannot be removed/module unloaded as long as there are flow entries? And on device removal all flows pertaining to the removed ifindex will be automatically flushed? Still there could be outstanding work items targeting the device, so this WARN_ON: + indev = dev_get_by_index(net, ifindex); + if (WARN_ON(!indev)) + return 0; looks possible to trigger. On the general architecture - I think it's worth documenting somewhere clearly that unlike TC offloads and most NDOs add/del of NFT flows are not protected by rtnl_lock. > v4: More work in progress > - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks > - Consolidate ->ndo invocations, now they happen from the hw worker. > - Fix bug in list handling, use list_replace_init() > - cleanup entries on nf_flow_table_hw module removal > - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants > to offload entries to hardware. > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index ed0799a12bf2..be0c12acc3f0 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -859,6 +859,13 @@ struct dev_ifalias { > char ifalias[]; > }; > > +struct flow_offload; > + > +enum flow_offload_type { > + FLOW_OFFLOAD_ADD= 0, > + FLOW_OFFLOAD_DEL, > +}; > + > /* > * This structure defines the management hooks for network devices. > * The following hooks can be defined; unless noted otherwise, they are > @@ -1316,6 +1323,8 @@ struct net_device_ops { > int (*ndo_bridge_dellink)(struct net_device *dev, > struct nlmsghdr *nlh, > u16 flags); > + int (*ndo_flow_offload)(enum flow_offload_type type, > + struct flow_offload *flow); nit: should there be kdoc for the new NDO? ndo kdoc comment doesn't look like it would be recognized by tools anyway though.. nit: using "flow" as the name rings slightly grandiose to me :) I would appreciate a nf_ prefix for clarity. Drivers will have to juggle a number of "flow" things, it would make the code easier to follow if names were prefixed clearly, I feel. > int (*ndo_change_carrier)(struct net_device *dev, > bool new_carrier); > int (*ndo_get_phys_port_id)(struct net_device *dev,
[PATCH v2 net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_{match|target}
From: Eric DumazetIt looks like syzbot found its way into netfilter territory. Issue here is that @name comes from user space and might not be null terminated. Out-of-bound reads happen, KASAN is not happy. v2 added similar fix for xt_request_find_target(), as Florian advised. Signed-off-by: Eric Dumazet Reported-by: syzbot --- No Fixes: tag, bug seems to be a day-0 one. Â net/netfilter/x_tables.c |6 ++ Â 1 file changed, 6 insertions(+) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 55802e97f906d1987ed78b4296584deb38e5f876..ecffc51ce83b07c063a0db67cdb33d9bf48a75ac 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, uint8_t revision) { struct xt_match *match; + if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN) + return ERR_PTR(-EINVAL); + match = xt_find_match(nfproto, name, revision); if (IS_ERR(match)) { request_module("%st_%s", xt_prefix[nfproto], name); @@ -252,6 +255,9 @@ struct xt_target *xt_request_find_target(u8 af, const char *name, u8 revision) { struct xt_target *target; + if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN) + return ERR_PTR(-EINVAL); + target = xt_find_target(af, name, revision); if (IS_ERR(target)) { request_module("%st_%s", xt_prefix[af], name);
Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
On Thu, 2018-01-25 at 01:13 +0100, Pablo Neira Ayuso wrote: > On Thu, Jan 25, 2018 at 12:50:56AM +0100, Pablo Neira Ayuso wrote: > > On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote: > > > Eric Dumazetwrote: > > > > From: Eric Dumazet > > > > > > > > It looks like syzbot found its way into netfilter territory. > > > > > > Excellent. This will sure allow to find and fix more bugs :-) > > > > > > > Issue here is that @name comes from user space and might > > > > not be null terminated. > > > > > > Indeed, thanks for fixing this Eric. > > > > > > xt_find_target() and xt_find_table_lock() might have similar issues. > > > > I'm going to keep back this patch then, it would be good if we can > > find this in one single patch. > > s/find/fix/ > > Sorry. Ok, but apparently you partially fixed this recently :/ Commits 78b79876761b8 and b301f25387599 took care of xt_find_table_lock() it seems. I'll send a V2 including xt_request_find_target()
[PATCH net-next] rds: tcp: per-netns flag to stop new connection creation when rds-tcp is being dismantled
An rds_connection can get added during netns deletion between lines 528 and 529 of 506 static void rds_tcp_kill_sock(struct net *net) : /* code to pull out all the rds_connections that should be destroyed */ : 528 spin_unlock_irq(_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node) 530 rds_conn_destroy(tc->t_cpath->cp_conn); Such an rds_connection would miss out the rds_conn_destroy() loop (that cancels all pending work) and (if it was scheduled after netns deletion) could trigger the use-after-free. A similar race-window exists for the module unload path in rds_tcp_exit -> rds_tcp_destroy_conns To avoid the addition of new rds_connections during kill_sock or netns_delete, this patch introduces a per-netns flag, RTN_DELETE_PENDING, that will cause RDS connection creation to fail. RCU is used to make sure that we wait for the critical section of __rds_conn_create threads (that may have started before the setting of RTN_DELETE_PENDING) to complete before starting the connection destruction. Reported-by: syzbot+bbd8e9a06452cc480...@syzkaller.appspotmail.com Signed-off-by: Sowmini Varadhan--- net/rds/connection.c |3 ++ net/rds/tcp.c| 82 - net/rds/tcp.h|1 + 3 files changed, 57 insertions(+), 29 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index b10c0ef..2ae539d 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -220,8 +220,10 @@ static void __rds_conn_path_init(struct rds_connection *conn, is_outgoing); conn->c_path[i].cp_index = i; } + rcu_read_lock(); ret = trans->conn_alloc(conn, gfp); if (ret) { + rcu_read_unlock(); kfree(conn->c_path); kmem_cache_free(rds_conn_slab, conn); conn = ERR_PTR(ret); @@ -283,6 +285,7 @@ static void __rds_conn_path_init(struct rds_connection *conn, } } spin_unlock_irqrestore(_conn_lock, flags); + rcu_read_unlock(); out: return conn; diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 9920d2f..2bdd3cc 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -274,14 +274,13 @@ static int rds_tcp_laddr_check(struct net *net, __be32 addr) static void rds_tcp_conn_free(void *arg) { struct rds_tcp_connection *tc = arg; - unsigned long flags; rdsdebug("freeing tc %p\n", tc); - spin_lock_irqsave(_tcp_conn_lock, flags); + spin_lock_bh(_tcp_conn_lock); if (!tc->t_tcp_node_detached) list_del(>t_tcp_node); - spin_unlock_irqrestore(_tcp_conn_lock, flags); + spin_unlock_bh(_tcp_conn_lock); kmem_cache_free(rds_tcp_conn_slab, tc); } @@ -296,7 +295,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp) tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp); if (!tc) { ret = -ENOMEM; - break; + goto fail; } mutex_init(>t_conn_path_lock); tc->t_sock = NULL; @@ -306,14 +305,25 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp) conn->c_path[i].cp_transport_data = tc; tc->t_cpath = >c_path[i]; + tc->t_tcp_node_detached = true; - spin_lock_irq(_tcp_conn_lock); - tc->t_tcp_node_detached = false; - list_add_tail(>t_tcp_node, _tcp_conn_list); - spin_unlock_irq(_tcp_conn_lock); rdsdebug("rds_conn_path [%d] tc %p\n", i, conn->c_path[i].cp_transport_data); } + spin_lock_bh(_tcp_conn_lock); + if (rds_tcp_netns_delete_pending(rds_conn_net(conn))) { + rdsdebug("RTN_DELETE_PENDING\n"); + ret = -ENETDOWN; + spin_unlock_bh(_tcp_conn_lock); + goto fail; + } + for (i = 0; i < RDS_MPATH_WORKERS; i++) { + tc = conn->c_path[i].cp_transport_data; + tc->t_tcp_node_detached = false; + list_add_tail(>t_tcp_node, _tcp_conn_list); + } + spin_unlock_bh(_tcp_conn_lock); +fail: if (ret) { for (j = 0; j < i; j++) rds_tcp_conn_free(conn->c_path[j].cp_transport_data); @@ -332,23 +342,6 @@ static bool list_has_conn(struct list_head *list, struct rds_connection *conn) return false; } -static void rds_tcp_destroy_conns(void) -{ - struct rds_tcp_connection *tc, *_tc; - LIST_HEAD(tmp_list); - - /* avoid calling conn_destroy with irqs off */ - spin_lock_irq(_tcp_conn_lock); - list_for_each_entry_safe(tc, _tc, _tcp_conn_list, t_tcp_node) { - if (!list_has_conn(_list,
Re: [PATCH net-next 1/4] net: core: Fix kernel-doc for carrier_* attributes
Hi Florian, I love your patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-core-Fix-kernel-doc-for-carrier_-attributes/20180125-062300 reproduce: make htmldocs All warnings (new ones prefixed by >>): Warning: Could not extract kernel version WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick (https://www.imagemagick.org) include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/crypto/hash.h:89: warning: duplicate section name 'Note' include/crypto/hash.h:95: warning: duplicate section name 'Note' include/crypto/hash.h:102: warning: duplicate section name 'Note' include/linux/gpio/driver.h:142: warning: No description found for parameter 'request_key' drivers/gpio/gpiolib.c:602: warning: No description found for parameter '16' drivers/gpio/gpiolib.c:602: warning: Excess struct member 'events' description in 'lineevent_state' include/linux/iio/iio.h:610: warning: No description found for parameter 'iio_dev' include/linux/iio/iio.h:610: warning: Excess function parameter 'indio_dev' description in 'iio_device_register' include/linux/iio/trigger.h:79: warning: No description found for parameter 'owner' fs/inode.c:1680: warning: No description found for parameter 'rcu' include/linux/jbd2.h:443: warning: No description found for parameter 'i_transaction' include/linux/jbd2.h:443: warning: No description found for parameter 'i_next_transaction' include/linux/jbd2.h:443: warning: No description found for parameter 'i_list' include/linux/jbd2.h:443: warning: No description found for parameter 'i_vfs_inode' include/linux/jbd2.h:443: warning: No description found for parameter 'i_flags' include/linux/jbd2.h:497: warning: No description found for parameter 'h_rsv_handle' include/linux/jbd2.h:497: warning: No description found for parameter 'h_reserved' include/linux/jbd2.h:497: warning: No description found for parameter 'h_type' include/linux/jbd2.h:497: warning: No description found for parameter 'h_line_no' include/linux/jbd2.h:497: warning: No description found for parameter 'h_start_jiffies' include/linux/jbd2.h:497: warning: No description found for parameter 'h_requested_credits' include/linux/jbd2.h:497: warning: No description found for parameter 'saved_alloc_context' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_chkpt_bhs' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_devname' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_average_commit_time' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_min_batch_time' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_max_batch_time' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_commit_callback' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_failed_commit' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_chksum_driver' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_csum_seed' fs/jbd2/transaction.c:511: warning: No description found for parameter 'type' fs/jbd2/transaction.c:511: warning: No description found for parameter 'line_no' fs/jbd2/transaction.c:641: warning: No description found for parameter 'gfp_mask' include/drm/drm_drv.h:594: warning: No description found for parameter 'gem_prime_pin'
Re: [PATCH 10/10] kill kernel_sock_ioctl()
On Thu, Jan 25, 2018 at 12:01:25AM +, Al Viro wrote: > On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote: > > > > Al this series looks fine to me, want me to toss it into net-next? > > Do you want them reposted (with updated commit messages), or would > you prefer a pull request (with or without rebase to current tip > of net-next)? Below is a pull request for rebased branch. Patches themselves are identical to what had been posted, Reviewed-by added and commit message for "kill dev_ifsioc()" made more detailed. The following changes since commit be1b6e8b5470e8311bfa1a3dfd7bd59e85a99759: Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue (2018-01-24 18:02:17 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git rebased-net-ioctl for you to fetch changes up to 5c59e564e46dcbab2ee7a4e9e0243562a39679a2: kill kernel_sock_ioctl() (2018-01-24 19:13:45 -0500) Al Viro (10): net: separate SIOCGIFCONF handling from dev_ioctl() devinet_ioctl(): take copyin/copyout to caller ip_rt_ioctl(): take copyin to caller kill dev_ifsioc() kill bond_ioctl() kill dev_ifname32() lift handling of SIOCIW... out of dev_ioctl() ipconfig: use dev_set_mtu() dev_ioctl(): move copyin/copyout to callers kill kernel_sock_ioctl() include/linux/inetdevice.h | 2 +- include/linux/net.h| 1 - include/linux/netdevice.h | 7 +- include/net/route.h| 2 +- include/net/wext.h | 4 +- net/core/dev_ioctl.c | 132 ++ net/ipv4/af_inet.c | 28 - net/ipv4/devinet.c | 57 -- net/ipv4/fib_frontend.c| 8 +- net/ipv4/ipconfig.c| 47 ++-- net/socket.c | 271 - net/wireless/wext-core.c | 13 ++- 12 files changed, 173 insertions(+), 399 deletions(-)
[PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc()
We've catched several cgroup css refcounting issues on 4.15-rc7, triggered from different release paths. We've used cgroups v2. I've added a temporarily per-memcg sockmem atomic counter, and found, that we're sometimes falling below 0. It was easy to reproduce, so I was able to bisect the problem. It was introduced by the commit 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()"), which moved the mem_cgroup_sk_alloc() call from the BH context into inet_csk_accept(). The problem is that all the memory allocated before mem_cgroup_sk_alloc() is charged to the socket, but not charged to the memcg. So, when we're releasing the socket, we're uncharging more, than we've charged. Fix this by charging the cgroup by the amount of already allocated memory right after mem_cgroup_sk_alloc() in inet_csk_accept(). Fixes: 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") Signed-off-by: Roman GushchinCc: Eric Dumazet Cc: Johannes Weiner Cc: Tejun Heo Cc: David S. Miller --- net/ipv4/inet_connection_sock.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4ca46dc08e63..f439162c2ea2 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -434,6 +434,7 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) struct request_sock *req; struct sock *newsk; int error; + long amt; lock_sock(sk); @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) spin_unlock_bh(>fastopenq.lock); } mem_cgroup_sk_alloc(newsk); + amt = sk_memory_allocated(newsk); + if (amt && newsk->sk_memcg) + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); + out: release_sock(sk); if (req) -- 2.14.3
[PATCH net-next 3/8] mlx5: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- CC: Saeed Mahameed CC: Or Gerlitz --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 + drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 5 + 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 8530c770c873..47bab842c5ee 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2944,9 +2944,6 @@ static int mlx5e_setup_tc_mqprio(struct net_device *netdev, static int mlx5e_setup_tc_cls_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *cls_flower) { - if (cls_flower->common.chain_index) - return -EOPNOTSUPP; - switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: return mlx5e_configure_flower(priv, cls_flower); @@ -2964,7 +2961,7 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data, { struct mlx5e_priv *priv = cb_priv; - if (!tc_can_offload(priv->netdev)) + if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data)) return -EOPNOTSUPP; switch (type) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index 10fa6a18fcf9..363d8dcb7f17 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -719,9 +719,6 @@ static int mlx5e_rep_setup_tc_cls_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *cls_flower) { - if (cls_flower->common.chain_index) - return -EOPNOTSUPP; - switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: return mlx5e_configure_flower(priv, cls_flower); @@ -739,7 +736,7 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, void *type_data, { struct mlx5e_priv *priv = cb_priv; - if (!tc_can_offload(priv->netdev)) + if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data)) return -EOPNOTSUPP; switch (type) { -- 2.15.1
[PATCH net-next 1/8] pkt_cls: add new tc cls helper to check offload flag and chain index
Very few (mlxsw) upstream drivers seem to allow offload of chains other than 0. Save driver developers typing and add a helper for checking both if ethtool's TC offload flag is on and if chain is 0. This helper will set the extack appropriately in both error cases. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 4 +--- drivers/net/netdevsim/bpf.c | 5 + include/net/pkt_cls.h | 12 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index b3206855535a..322027792fe8 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -130,7 +130,7 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type, "only offload of BPF classifiers supported"); return -EOPNOTSUPP; } - if (!tc_can_offload_extack(nn->dp.netdev, cls_bpf->common.extack)) + if (!tc_cls_can_offload_and_chain0(nn->dp.netdev, _bpf->common)) return -EOPNOTSUPP; if (!nfp_net_ebpf_capable(nn)) { NL_SET_ERR_MSG_MOD(cls_bpf->common.extack, @@ -142,8 +142,6 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type, "only ETH_P_ALL supported as filter protocol"); return -EOPNOTSUPP; } - if (cls_bpf->common.chain_index) - return -EOPNOTSUPP; /* Only support TC direct action */ if (!cls_bpf->exts_integrated || diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 8166f121bbcc..de73c1ff0939 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -135,7 +135,7 @@ int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type, return -EOPNOTSUPP; } - if (!tc_can_offload_extack(ns->netdev, cls_bpf->common.extack)) + if (!tc_cls_can_offload_and_chain0(ns->netdev, _bpf->common)) return -EOPNOTSUPP; if (cls_bpf->common.protocol != htons(ETH_P_ALL)) { @@ -144,9 +144,6 @@ int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type, return -EOPNOTSUPP; } - if (cls_bpf->common.chain_index) - return -EOPNOTSUPP; - if (!ns->bpf_tc_accept) { NSIM_EA(cls_bpf->common.extack, "netdevsim configured to reject BPF TC offload"); diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 1a41513cec7f..4db08d7dd22c 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -656,6 +656,18 @@ static inline bool tc_can_offload_extack(const struct net_device *dev, return can; } +static inline bool +tc_cls_can_offload_and_chain0(const struct net_device *dev, + struct tc_cls_common_offload *common) +{ + if (common->chain_index) { + NL_SET_ERR_MSG(common->extack, + "Driver supports only offload of chain 0"); + return false; + } + return tc_can_offload_extack(dev, common->extack); +} + static inline bool tc_skip_hw(u32 flags) { return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false; -- 2.15.1
[PATCH net-next 0/8] use tc_cls_can_offload_and_chain0() throughout the drivers
Hi! This set makes most drivers use a new tc_cls_can_offload_and_chain0() helper which will set extack in case TC hw offload flag is disabled. i40e patch will follow after net -> net-next merge. I chose to keep the new helper which also looks at the chain but renamed it more appropriately. The rationale being that most drivers don't accept chains other than 0 and since we have to pass extack to the helper we can as well pass the entire struct tc_cls_common_offload and perform the most common checks. Jiri, please let me know if that's acceptable for you. This code makes the assumption that type_data in the callback can be interpreted as struct tc_cls_common_offload, i.e. the real offload structure has common part as the first member. This allows us to make the check once for all classifier types if driver supports more than one. This also means I've dropped the last patch of the RFC (preventing use of common before type validation in nfp). Jakub Kicinski (8): pkt_cls: add new tc cls helper to check offload flag and chain index bnxt: use tc_cls_can_offload_and_chain0() nfp: flower: use tc_cls_can_offload_and_chain0() cxgb4: use tc_cls_can_offload_and_chain0() ixgbe: use tc_cls_can_offload_and_chain0() mlx5: use tc_cls_can_offload_and_chain0() mlxsw: use tc_cls_can_offload_and_chain0() selftests/bpf: check for spurious extacks from the driver drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++- drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 3 --- drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 3 ++- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 8 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 +--- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +--- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 5 +--- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 6 ++--- drivers/net/ethernet/netronome/nfp/bpf/main.c | 4 +--- .../net/ethernet/netronome/nfp/flower/offload.c| 7 +++--- drivers/net/netdevsim/bpf.c| 5 +--- include/net/pkt_cls.h | 12 ++ tools/testing/selftests/bpf/test_offload.py| 27 ++ 13 files changed, 54 insertions(+), 39 deletions(-) -- 2.15.1
[PATCH net-next 2/8] cxgb4: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- CC: Ganesh Goudar --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index f0fd2eba30c2..1e3cd8abc56d 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2928,9 +2928,6 @@ static int cxgb_set_tx_maxrate(struct net_device *dev, int index, u32 rate) static int cxgb_setup_tc_flower(struct net_device *dev, struct tc_cls_flower_offload *cls_flower) { - if (cls_flower->common.chain_index) - return -EOPNOTSUPP; - switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: return cxgb4_tc_flower_replace(dev, cls_flower); @@ -2946,9 +2943,6 @@ static int cxgb_setup_tc_flower(struct net_device *dev, static int cxgb_setup_tc_cls_u32(struct net_device *dev, struct tc_cls_u32_offload *cls_u32) { - if (cls_u32->common.chain_index) - return -EOPNOTSUPP; - switch (cls_u32->command) { case TC_CLSU32_NEW_KNODE: case TC_CLSU32_REPLACE_KNODE: @@ -2974,7 +2968,7 @@ static int cxgb_setup_tc_block_cb(enum tc_setup_type type, void *type_data, return -EINVAL; } - if (!tc_can_offload(dev)) + if (!tc_cls_can_offload_and_chain0(dev, type_data)) return -EOPNOTSUPP; switch (type) { -- 2.15.1
[PATCH net-next 6/8] ixgbe: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- CC: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 722cc3153a99..bbb622f15a77 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9303,9 +9303,6 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, static int ixgbe_setup_tc_cls_u32(struct ixgbe_adapter *adapter, struct tc_cls_u32_offload *cls_u32) { - if (cls_u32->common.chain_index) - return -EOPNOTSUPP; - switch (cls_u32->command) { case TC_CLSU32_NEW_KNODE: case TC_CLSU32_REPLACE_KNODE: @@ -9327,7 +9324,7 @@ static int ixgbe_setup_tc_block_cb(enum tc_setup_type type, void *type_data, { struct ixgbe_adapter *adapter = cb_priv; - if (!tc_can_offload(adapter->netdev)) + if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data)) return -EOPNOTSUPP; switch (type) { -- 2.15.1
[PATCH net-next 8/8] selftests/bpf: check for spurious extacks from the driver
Drivers should not report errors when offload is not forced. Check stdout and stderr for familiar messages when with no skip flags and with skip_hw. Check for add, replace, and destroy. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- tools/testing/selftests/bpf/test_offload.py | 27 +++ 1 file changed, 27 insertions(+) diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py index ae3eea3ab820..3a43fbc896db 100755 --- a/tools/testing/selftests/bpf/test_offload.py +++ b/tools/testing/selftests/bpf/test_offload.py @@ -543,6 +543,10 @@ netns = [] # net namespaces to be removed def check_extack_nsim(output, reference, args): check_extack(output, "Error: netdevsim: " + reference, args) +def check_no_extack(res, needle): +fail((res[1] + res[2]).find(needle) != -1, + "Found '%s' in command output, leaky extack?" % (needle)) + def check_verifier_log(output, reference): lines = output.split("\n") for l in reversed(lines): @@ -550,6 +554,18 @@ netns = [] # net namespaces to be removed return fail(True, "Missing or incorrect message from netdevsim in verifier log") +def test_spurios_extack(sim, obj, skip_hw, needle): +res = sim.cls_bpf_add_filter(obj, prio=1, handle=1, skip_hw=skip_hw, + include_stderr=True) +check_no_extack(res, needle) +res = sim.cls_bpf_add_filter(obj, op="replace", prio=1, handle=1, + skip_hw=skip_hw, include_stderr=True) +check_no_extack(res, needle) +res = sim.cls_filter_op(op="delete", prio=1, handle=1, cls="bpf", +include_stderr=True) +check_no_extack(res, needle) + + # Parse command line parser = argparse.ArgumentParser() parser.add_argument("--log", help="output verbose log to given file") @@ -687,6 +703,17 @@ netns = [] (j)) sim.cls_filter_op(op="delete", prio=1, handle=1, cls="bpf") +start_test("Test spurious extack from the driver...") +test_spurios_extack(sim, obj, False, "netdevsim") +test_spurios_extack(sim, obj, True, "netdevsim") + +sim.set_ethtool_tc_offloads(False) + +test_spurios_extack(sim, obj, False, "TC offload is disabled") +test_spurios_extack(sim, obj, True, "TC offload is disabled") + +sim.set_ethtool_tc_offloads(True) + sim.tc_flush_filters() start_test("Test TC offloads work...") -- 2.15.1
[PATCH net-next 4/8] bnxt: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- CC: Michael Chan --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++- drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 3 --- drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 3 ++- 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 6b7e99675571..4b001d2050c2 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -7778,7 +7778,8 @@ static int bnxt_setup_tc_block_cb(enum tc_setup_type type, void *type_data, { struct bnxt *bp = cb_priv; - if (!bnxt_tc_flower_enabled(bp) || !tc_can_offload(bp->dev)) + if (!bnxt_tc_flower_enabled(bp) || + !tc_cls_can_offload_and_chain0(bp->dev, type_data)) return -EOPNOTSUPP; switch (type) { diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c index 2ece1645f55d..fbe6e208e17b 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c @@ -1474,9 +1474,6 @@ int bnxt_tc_setup_flower(struct bnxt *bp, u16 src_fid, { int rc = 0; - if (cls_flower->common.chain_index) - return -EOPNOTSUPP; - switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: rc = bnxt_tc_add_flow(bp, src_fid, cls_flower); diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c index 2ca11be64182..26290403f38f 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c @@ -124,7 +124,8 @@ static int bnxt_vf_rep_setup_tc_block_cb(enum tc_setup_type type, struct bnxt *bp = vf_rep->bp; int vf_fid = bp->pf.vf[vf_rep->vf_idx].fw_fid; - if (!bnxt_tc_flower_enabled(vf_rep->bp) || !tc_can_offload(bp->dev)) + if (!bnxt_tc_flower_enabled(vf_rep->bp) || + !tc_cls_can_offload_and_chain0(bp->dev, type_data)) return -EOPNOTSUPP; switch (type) { -- 2.15.1
[PATCH net-next 7/8] mlxsw: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- CC: Jiri Pirko CC: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 833cd0a96fd9..3dcc58d61506 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -1738,9 +1738,6 @@ static int mlxsw_sp_setup_tc_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port, struct tc_cls_matchall_offload *f, bool ingress) { - if (f->common.chain_index) - return -EOPNOTSUPP; - switch (f->command) { case TC_CLSMATCHALL_REPLACE: return mlxsw_sp_port_add_cls_matchall(mlxsw_sp_port, f, @@ -1780,7 +1777,8 @@ static int mlxsw_sp_setup_tc_block_cb_matchall(enum tc_setup_type type, switch (type) { case TC_SETUP_CLSMATCHALL: - if (!tc_can_offload(mlxsw_sp_port->dev)) + if (!tc_cls_can_offload_and_chain0(mlxsw_sp_port->dev, + type_data)) return -EOPNOTSUPP; return mlxsw_sp_setup_tc_cls_matchall(mlxsw_sp_port, type_data, -- 2.15.1
[PATCH net-next 5/8] nfp: flower: use tc_cls_can_offload_and_chain0()
Make use of tc_cls_can_offload_and_chain0() to set extack msg in case ethtool tc offload flag is not set or chain unsupported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- drivers/net/ethernet/netronome/nfp/flower/offload.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 837134a9137c..08c4c6dc5f7f 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -483,8 +483,7 @@ static int nfp_flower_repr_offload(struct nfp_app *app, struct net_device *netdev, struct tc_cls_flower_offload *flower, bool egress) { - if (!eth_proto_is_802_3(flower->common.protocol) || - flower->common.chain_index) + if (!eth_proto_is_802_3(flower->common.protocol)) return -EOPNOTSUPP; switch (flower->command) { @@ -504,7 +503,7 @@ int nfp_flower_setup_tc_egress_cb(enum tc_setup_type type, void *type_data, { struct nfp_repr *repr = cb_priv; - if (!tc_can_offload(repr->netdev)) + if (!tc_cls_can_offload_and_chain0(repr->netdev, type_data)) return -EOPNOTSUPP; switch (type) { @@ -521,7 +520,7 @@ static int nfp_flower_setup_tc_block_cb(enum tc_setup_type type, { struct nfp_repr *repr = cb_priv; - if (!tc_can_offload(repr->netdev)) + if (!tc_cls_can_offload_and_chain0(repr->netdev, type_data)) return -EOPNOTSUPP; switch (type) { -- 2.15.1
Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
On Thu, Jan 25, 2018 at 12:50:56AM +0100, Pablo Neira Ayuso wrote: > On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote: > > Eric Dumazetwrote: > > > From: Eric Dumazet > > > > > > It looks like syzbot found its way into netfilter territory. > > > > Excellent. This will sure allow to find and fix more bugs :-) > > > > > Issue here is that @name comes from user space and might > > > not be null terminated. > > > > Indeed, thanks for fixing this Eric. > > > > xt_find_target() and xt_find_table_lock() might have similar issues. > > I'm going to keep back this patch then, it would be good if we can > find this in one single patch. s/find/fix/ Sorry.
[PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support
This patch adds the infrastructure to offload flows to hardware, in case the nic/switch comes with built-in flow tables capabilities. If the hardware comes with no hardware flow tables or they have limitations in terms of features, the existing infrastructure falls back to the software flow table implementation. The software flow table garbage collector skips entries that resides in the hardware, so the hardware will be responsible for releasing this flow table entry too via flow_offload_dead(). Hardware configuration, either to add or to delete entries, is done from the hardware offload workqueue, to ensure this is done from user context given that we may sleep when grabbing the mdio mutex. Signed-off-by: Pablo Neira Ayuso--- v4: More work in progress - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks - Consolidate ->ndo invocations, now they happen from the hw worker. - Fix bug in list handling, use list_replace_init() - cleanup entries on nf_flow_table_hw module removal - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user wants to offload entries to hardware. include/linux/netdevice.h| 9 ++ include/net/netfilter/nf_flow_table.h| 16 +++ include/uapi/linux/netfilter/nf_tables.h | 11 ++ net/netfilter/Kconfig| 9 ++ net/netfilter/Makefile | 1 + net/netfilter/nf_flow_table.c| 60 +++ net/netfilter/nf_flow_table_hw.c | 174 +++ net/netfilter/nf_tables_api.c| 12 ++- net/netfilter/nft_flow_offload.c | 5 + 9 files changed, 296 insertions(+), 1 deletion(-) create mode 100644 net/netfilter/nf_flow_table_hw.c diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ed0799a12bf2..be0c12acc3f0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -859,6 +859,13 @@ struct dev_ifalias { char ifalias[]; }; +struct flow_offload; + +enum flow_offload_type { + FLOW_OFFLOAD_ADD= 0, + FLOW_OFFLOAD_DEL, +}; + /* * This structure defines the management hooks for network devices. * The following hooks can be defined; unless noted otherwise, they are @@ -1316,6 +1323,8 @@ struct net_device_ops { int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh, u16 flags); + int (*ndo_flow_offload)(enum flow_offload_type type, + struct flow_offload *flow); int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier); int (*ndo_get_phys_port_id)(struct net_device *dev, diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index ed49cd169ecf..69067deb61b6 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -22,7 +22,9 @@ struct nf_flowtable_type { struct nf_flowtable { struct rhashtable rhashtable; const struct nf_flowtable_type *type; + u32 flags; struct delayed_work gc_work; + possible_net_t ft_net; }; enum flow_offload_tuple_dir { @@ -65,6 +67,7 @@ struct flow_offload_tuple_rhash { #define FLOW_OFFLOAD_SNAT 0x1 #define FLOW_OFFLOAD_DNAT 0x2 #define FLOW_OFFLOAD_DYING 0x4 +#define FLOW_OFFLOAD_HW0x8 struct flow_offload { struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX]; @@ -119,6 +122,19 @@ unsigned int nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state); +void nf_flow_offload_hw_add(struct net *net, struct flow_offload *flow, + struct nf_conn *ct); +void nf_flow_offload_hw_del(struct net *net, struct flow_offload *flow); + +struct nf_flow_table_hw { + void (*add)(struct net *net, struct flow_offload *flow, + struct nf_conn *ct); + void (*del)(struct net *net, struct flow_offload *flow); +}; + +int nf_flow_table_hw_register(const struct nf_flow_table_hw *offload); +void nf_flow_table_hw_unregister(const struct nf_flow_table_hw *offload); + #define MODULE_ALIAS_NF_FLOWTABLE(family) \ MODULE_ALIAS("nf-flowtable-" __stringify(family)) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 66dceee0ae30..1974829d6440 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -1334,6 +1334,15 @@ enum nft_object_attributes { #define NFTA_OBJ_MAX
[PATCH bpf-next v7 0/5] libbpf: add XDP setup support
Hello, This patchset fixes the problem found by Alexei when building libbpf on a system with old headers. It has been tested on an old Ubuntu and seems to behave fine. Best regards, -- Eric
[PATCH bpf-next v7 2/5] libbpf: add function to setup XDP
Most of the code is taken from set_link_xdp_fd() in bpf_load.c and slightly modified to be library compliant. Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov --- tools/lib/bpf/bpf.c| 127 + tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/libbpf.h | 4 ++ 3 files changed, 133 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 5128677e4117..749a447ec9ed 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -25,6 +25,17 @@ #include #include #include "bpf.h" +#include "libbpf.h" +#include "nlattr.h" +#include +#include +#include + +#ifndef IFLA_XDP_MAX +#define IFLA_XDP 43 +#define IFLA_XDP_FD1 +#define IFLA_XDP_FLAGS 3 +#endif /* * When building perf, unistd.h is overridden. __NR_bpf is @@ -46,7 +57,9 @@ # endif #endif +#ifndef min #define min(x, y) ((x) < (y) ? (x) : (y)) +#endif static inline __u64 ptr_to_u64(const void *ptr) { @@ -413,3 +426,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len) return err; } + +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) +{ + struct sockaddr_nl sa; + int sock, seq = 0, len, ret = -1; + char buf[4096]; + struct nlattr *nla, *nla_xdp; + struct { + struct nlmsghdr nh; + struct ifinfomsg ifinfo; + char attrbuf[64]; + } req; + struct nlmsghdr *nh; + struct nlmsgerr *err; + socklen_t addrlen; + + memset(, 0, sizeof(sa)); + sa.nl_family = AF_NETLINK; + + sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); + if (sock < 0) { + return -errno; + } + + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { + ret = -errno; + goto cleanup; + } + + addrlen = sizeof(sa); + if (getsockname(sock, (struct sockaddr *), ) < 0) { + ret = -errno; + goto cleanup; + } + + if (addrlen != sizeof(sa)) { + ret = -LIBBPF_ERRNO__INTERNAL; + goto cleanup; + } + + memset(, 0, sizeof(req)); + req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); + req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + req.nh.nlmsg_type = RTM_SETLINK; + req.nh.nlmsg_pid = 0; + req.nh.nlmsg_seq = ++seq; + req.ifinfo.ifi_family = AF_UNSPEC; + req.ifinfo.ifi_index = ifindex; + + /* started nested attribute for XDP */ + nla = (struct nlattr *)(((char *)) + + NLMSG_ALIGN(req.nh.nlmsg_len)); + nla->nla_type = NLA_F_NESTED | IFLA_XDP; + nla->nla_len = NLA_HDRLEN; + + /* add XDP fd */ + nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); + nla_xdp->nla_type = IFLA_XDP_FD; + nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); + memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd)); + nla->nla_len += nla_xdp->nla_len; + + /* if user passed in any flags, add those too */ + if (flags) { + nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); + nla_xdp->nla_type = IFLA_XDP_FLAGS; + nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags); + memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags)); + nla->nla_len += nla_xdp->nla_len; + } + + req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len); + + if (send(sock, , req.nh.nlmsg_len, 0) < 0) { + ret = -errno; + goto cleanup; + } + + len = recv(sock, buf, sizeof(buf), 0); + if (len < 0) { + ret = -errno; + goto cleanup; + } + + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); +nh = NLMSG_NEXT(nh, len)) { + if (nh->nlmsg_pid != sa.nl_pid) { + ret = -LIBBPF_ERRNO__WRNGPID; + goto cleanup; + } + if (nh->nlmsg_seq != seq) { + ret = -LIBBPF_ERRNO__INVSEQ; + goto cleanup; + } + switch (nh->nlmsg_type) { + case NLMSG_ERROR: + err = (struct nlmsgerr *)NLMSG_DATA(nh); + if (!err->error) + continue; + ret = err->error; + goto cleanup; + case NLMSG_DONE: + break; + default: + break; + } + } + + ret = 0; + +cleanup: + close(sock); + return ret; +} diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 30c776375118..c60122d3ea85 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = { [ERRCODE_OFFSET(PROG2BIG)] =
Re: [Patch net-next v2 2/3] net_sched: plug in qdisc ops change_tx_queue_len
On 01/23/2018 10:18 AM, Cong Wang wrote: > Introduce a new qdisc ops ->change_tx_queue_len() so that > each qdisc could decide how to implement this if it wants. > Previously we simply read dev->tx_queue_len, after pfifo_fast > switches to skb array, we need this API to resize the skb array > when we change dev->tx_queue_len. > > To avoid handling race conditions with TX BH, we need to > deactivate all TX queues before change the value and bring them > back after we are done, this also makes implementation easier. > > Cc: John Fastabend> Signed-off-by: Cong Wang > --- > include/net/sch_generic.h | 2 ++ > net/core/dev.c| 1 + > net/sched/sch_generic.c | 33 + > 3 files changed, 36 insertions(+) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index cd1be1f25c36..d13dd129d085 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -200,6 +200,7 @@ struct Qdisc_ops { > struct nlattr *arg, > struct netlink_ext_ack *extack); > void(*attach)(struct Qdisc *sch); > + int (*change_tx_queue_len)(struct Qdisc *, unsigned > int); > > int (*dump)(struct Qdisc *, struct sk_buff *); > int (*dump_stats)(struct Qdisc *, struct gnet_dump > *); > @@ -488,6 +489,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *, > void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *); > void qdisc_class_hash_destroy(struct Qdisc_class_hash *); > > +int dev_qdisc_change_tx_queue_len(struct net_device *dev); > void dev_init_scheduler(struct net_device *dev); > void dev_shutdown(struct net_device *dev); > void dev_activate(struct net_device *dev); > diff --git a/net/core/dev.c b/net/core/dev.c > index 913655e82859..a9d7d883416d 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -7059,6 +7059,7 @@ int dev_change_tx_queue_len(struct net_device *dev, > unsigned long new_len) > dev->tx_queue_len = orig_len; > return res; > } > + return dev_qdisc_change_tx_queue_len(dev); > } > > return 0; > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 1816bde47256..08f9fa27e06e 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1178,6 +1178,39 @@ void dev_deactivate(struct net_device *dev) > } > EXPORT_SYMBOL(dev_deactivate); > > +static int qdisc_change_tx_queue_len(struct net_device *dev, > + struct netdev_queue *dev_queue) > +{ > + struct Qdisc *qdisc = dev_queue->qdisc_sleeping; > + const struct Qdisc_ops *ops = qdisc->ops; > + > + if (ops->change_tx_queue_len) > + return ops->change_tx_queue_len(qdisc, dev->tx_queue_len); > + return 0; > +} > + > +int dev_qdisc_change_tx_queue_len(struct net_device *dev) > +{ > + bool up = dev->flags & IFF_UP; > + unsigned int i; > + int ret = 0; > + > + if (up) > + dev_deactivate(dev); > + > + for (i = 0; i < dev->num_tx_queues; i++) { > + ret = qdisc_change_tx_queue_len(dev, >_tx[i]); > + > + /* TODO: revert changes on a partial failure */ > + if (ret) > + break; After another look it seems we can solve this without too much pain by using skb_array_resize_multiple() in patch 3/3. Then pass the error pack here via qdisc_change_tx_queue_len and reset queue length to orig_length. Mind giving it a try? Or else I'll do it Friday probably. Thanks, John
[PATCH bpf-next v7 1/5] tools: import netlink header in tools uapi
The header is necessary for libbpf compilation on system with older version of the headers. Signed-off-by: Eric Leblond--- tools/include/uapi/linux/netlink.h | 251 + tools/lib/bpf/Makefile | 3 + 2 files changed, 254 insertions(+) create mode 100644 tools/include/uapi/linux/netlink.h diff --git a/tools/include/uapi/linux/netlink.h b/tools/include/uapi/linux/netlink.h new file mode 100644 index ..776bc92e9118 --- /dev/null +++ b/tools/include/uapi/linux/netlink.h @@ -0,0 +1,251 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI__LINUX_NETLINK_H +#define _UAPI__LINUX_NETLINK_H + +#include +#include /* for __kernel_sa_family_t */ +#include + +#define NETLINK_ROUTE 0 /* Routing/device hook */ +#define NETLINK_UNUSED 1 /* Unused number */ +#define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */ +#define NETLINK_FIREWALL 3 /* Unused number, formerly ip_queue */ +#define NETLINK_SOCK_DIAG 4 /* socket monitoring */ +#define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */ +#define NETLINK_XFRM 6 /* ipsec */ +#define NETLINK_SELINUX7 /* SELinux event notifications */ +#define NETLINK_ISCSI 8 /* Open-iSCSI */ +#define NETLINK_AUDIT 9 /* auditing */ +#define NETLINK_FIB_LOOKUP 10 +#define NETLINK_CONNECTOR 11 +#define NETLINK_NETFILTER 12 /* netfilter subsystem */ +#define NETLINK_IP6_FW 13 +#define NETLINK_DNRTMSG14 /* DECnet routing messages */ +#define NETLINK_KOBJECT_UEVENT 15 /* Kernel messages to userspace */ +#define NETLINK_GENERIC16 +/* leave room for NETLINK_DM (DM Events) */ +#define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ +#define NETLINK_ECRYPTFS 19 +#define NETLINK_RDMA 20 +#define NETLINK_CRYPTO 21 /* Crypto layer */ +#define NETLINK_SMC22 /* SMC monitoring */ + +#define NETLINK_INET_DIAG NETLINK_SOCK_DIAG + +#define MAX_LINKS 32 + +struct sockaddr_nl { + __kernel_sa_family_tnl_family; /* AF_NETLINK */ + unsigned short nl_pad; /* zero */ + __u32 nl_pid; /* port ID */ + __u32 nl_groups; /* multicast groups mask */ +}; + +struct nlmsghdr { + __u32 nlmsg_len; /* Length of message including header */ + __u16 nlmsg_type; /* Message content */ + __u16 nlmsg_flags;/* Additional flags */ + __u32 nlmsg_seq; /* Sequence number */ + __u32 nlmsg_pid; /* Sending process port ID */ +}; + +/* Flags values */ + +#define NLM_F_REQUEST 0x01/* It is request message. */ +#define NLM_F_MULTI0x02/* Multipart message, terminated by NLMSG_DONE */ +#define NLM_F_ACK 0x04/* Reply with ack, with zero or error code */ +#define NLM_F_ECHO 0x08/* Echo this request*/ +#define NLM_F_DUMP_INTR0x10/* Dump was inconsistent due to sequence change */ +#define NLM_F_DUMP_FILTERED0x20/* Dump was filtered as requested */ + +/* Modifiers to GET request */ +#define NLM_F_ROOT 0x100 /* specify tree root*/ +#define NLM_F_MATCH0x200 /* return all matching */ +#define NLM_F_ATOMIC 0x400 /* atomic GET */ +#define NLM_F_DUMP (NLM_F_ROOT|NLM_F_MATCH) + +/* Modifiers to NEW request */ +#define NLM_F_REPLACE 0x100 /* Override existing*/ +#define NLM_F_EXCL 0x200 /* Do not touch, if it exists */ +#define NLM_F_CREATE 0x400 /* Create, if it does not exist */ +#define NLM_F_APPEND 0x800 /* Add to end of list */ + +/* Modifiers to DELETE request */ +#define NLM_F_NONREC 0x100 /* Do not delete recursively*/ + +/* Flags for ACK message */ +#define NLM_F_CAPPED 0x100 /* request was capped */ +#define NLM_F_ACK_TLVS 0x200 /* extended ACK TVLs were included */ + +/* + 4.4BSD ADD NLM_F_CREATE|NLM_F_EXCL + 4.4BSD CHANGE NLM_F_REPLACE + + True CHANGE NLM_F_CREATE|NLM_F_REPLACE + Append NLM_F_CREATE + Check NLM_F_EXCL + */ + +#define NLMSG_ALIGNTO 4U +#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) ) +#define NLMSG_HDRLEN((int) NLMSG_ALIGN(sizeof(struct nlmsghdr))) +#define NLMSG_LENGTH(len) ((len) + NLMSG_HDRLEN) +#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len)) +#define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0))) +#define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \ + (struct
[PATCH bpf-next v7 4/5] libbpf: add missing SPDX-License-Identifier
Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov --- tools/lib/bpf/bpf.c| 2 ++ tools/lib/bpf/bpf.h| 2 ++ tools/lib/bpf/libbpf.c | 2 ++ tools/lib/bpf/libbpf.h | 2 ++ 4 files changed, 8 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 765fd95b0657..e850d8365100 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: LGPL-2.1 + /* * common eBPF ELF operations. * diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 9f44c196931e..8d18fb73d7fb 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -1,3 +1,5 @@ +/* SPDX-License-Identifier: LGPL-2.1 */ + /* * common eBPF ELF operations. * diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index c60122d3ea85..71ddc481f349 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: LGPL-2.1 + /* * Common eBPF ELF object loading operations. * diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e42f96900318..f85906533cdd 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -1,3 +1,5 @@ +/* SPDX-License-Identifier: LGPL-2.1 */ + /* * Common eBPF ELF object loading operations. * -- 2.15.1
[PATCH bpf-next v7 3/5] libbpf: add error reporting in XDP
Parse netlink ext attribute to get the error message returned by the card. Code is partially take from libnl. We add netlink.h to the uapi include of tools. And we need to avoid include of userspace netlink header to have a successful build of sample so nlattr.h has a define to avoid the inclusion. Using a direct define could have been an issue as NLMSGERR_ATTR_MAX can change in the future. We also define SOL_NETLINK if not defined to avoid to have to copy socket.h for a fixed value. Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov remote rtne Signed-off-by: Eric Leblond --- samples/bpf/Makefile | 2 +- tools/lib/bpf/Build| 2 +- tools/lib/bpf/bpf.c| 13 +++- tools/lib/bpf/nlattr.c | 187 + tools/lib/bpf/nlattr.h | 72 +++ 5 files changed, 273 insertions(+), 3 deletions(-) create mode 100644 tools/lib/bpf/nlattr.c create mode 100644 tools/lib/bpf/nlattr.h diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 7f61a3d57fa7..5c4cd3745282 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -45,7 +45,7 @@ hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp # Libbpf dependencies -LIBBPF := ../../tools/lib/bpf/bpf.o +LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o test_lru_dist-objs := test_lru_dist.o $(LIBBPF) diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build index d8749756352d..64c679d67109 100644 --- a/tools/lib/bpf/Build +++ b/tools/lib/bpf/Build @@ -1 +1 @@ -libbpf-y := libbpf.o bpf.o +libbpf-y := libbpf.o bpf.o nlattr.o diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 749a447ec9ed..765fd95b0657 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -27,7 +27,7 @@ #include "bpf.h" #include "libbpf.h" #include "nlattr.h" -#include +#include #include #include @@ -37,6 +37,10 @@ #define IFLA_XDP_FLAGS 3 #endif +#ifndef SOL_NETLINK +#define SOL_NETLINK 270 +#endif + /* * When building perf, unistd.h is overridden. __NR_bpf is * required to be defined explicitly. @@ -441,6 +445,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) struct nlmsghdr *nh; struct nlmsgerr *err; socklen_t addrlen; + int one = 1; memset(, 0, sizeof(sa)); sa.nl_family = AF_NETLINK; @@ -450,6 +455,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) return -errno; } + if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, + , sizeof(one)) < 0) { + fprintf(stderr, "Netlink error reporting not supported\n"); + } + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { ret = -errno; goto cleanup; @@ -526,6 +536,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) if (!err->error) continue; ret = err->error; + nla_dump_errormsg(nh); goto cleanup; case NLMSG_DONE: break; diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c new file mode 100644 index ..4719434278b2 --- /dev/null +++ b/tools/lib/bpf/nlattr.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: LGPL-2.1 + +/* + * NETLINK Netlink attributes + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation version 2.1 + * of the License. + * + * Copyright (c) 2003-2013 Thomas Graf + */ + +#include +#include "nlattr.h" +#include +#include +#include + +static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = { + [NLA_U8]= sizeof(uint8_t), + [NLA_U16] = sizeof(uint16_t), + [NLA_U32] = sizeof(uint32_t), + [NLA_U64] = sizeof(uint64_t), + [NLA_STRING]= 1, + [NLA_FLAG] = 0, +}; + +static int nla_len(const struct nlattr *nla) +{ + return nla->nla_len - NLA_HDRLEN; +} + +static struct nlattr *nla_next(const struct nlattr *nla, int *remaining) +{ + int totlen = NLA_ALIGN(nla->nla_len); + + *remaining -= totlen; + return (struct nlattr *) ((char *) nla + totlen); +} + +static int nla_ok(const struct nlattr *nla, int remaining) +{ + return remaining >= sizeof(*nla) && + nla->nla_len >= sizeof(*nla) && + nla->nla_len <= remaining; +} + +static void *nla_data(const struct nlattr *nla) +{ + return (char *) nla + NLA_HDRLEN; +} + +static int nla_type(const struct nlattr *nla) +{ + return nla->nla_type & NLA_TYPE_MASK; +} + +static int validate_nla(struct nlattr *nla, int maxtype, + struct nla_policy *policy) +{ +
[PATCH bpf-next v7 5/5] samples/bpf: use bpf_set_link_xdp_fd
Use bpf_set_link_xdp_fd instead of set_link_xdp_fd to remove some code duplication and benefit of netlink ext ack errors message. Signed-off-by: Eric Leblond--- samples/bpf/bpf_load.c | 102 samples/bpf/bpf_load.h | 2 +- samples/bpf/xdp1_user.c | 4 +- samples/bpf/xdp_redirect_cpu_user.c | 6 +-- samples/bpf/xdp_redirect_map_user.c | 8 +-- samples/bpf/xdp_redirect_user.c | 8 +-- samples/bpf/xdp_router_ipv4_user.c | 10 ++-- samples/bpf/xdp_rxq_info_user.c | 4 +- samples/bpf/xdp_tx_iptunnel_user.c | 6 +-- 9 files changed, 24 insertions(+), 126 deletions(-) diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 242631aa4ea2..69806d74fa53 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -695,105 +695,3 @@ struct ksym *ksym_search(long key) return [0]; } -int set_link_xdp_fd(int ifindex, int fd, __u32 flags) -{ - struct sockaddr_nl sa; - int sock, seq = 0, len, ret = -1; - char buf[4096]; - struct nlattr *nla, *nla_xdp; - struct { - struct nlmsghdr nh; - struct ifinfomsg ifinfo; - char attrbuf[64]; - } req; - struct nlmsghdr *nh; - struct nlmsgerr *err; - - memset(, 0, sizeof(sa)); - sa.nl_family = AF_NETLINK; - - sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); - if (sock < 0) { - printf("open netlink socket: %s\n", strerror(errno)); - return -1; - } - - if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { - printf("bind to netlink: %s\n", strerror(errno)); - goto cleanup; - } - - memset(, 0, sizeof(req)); - req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); - req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; - req.nh.nlmsg_type = RTM_SETLINK; - req.nh.nlmsg_pid = 0; - req.nh.nlmsg_seq = ++seq; - req.ifinfo.ifi_family = AF_UNSPEC; - req.ifinfo.ifi_index = ifindex; - - /* started nested attribute for XDP */ - nla = (struct nlattr *)(((char *)) - + NLMSG_ALIGN(req.nh.nlmsg_len)); - nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/; - nla->nla_len = NLA_HDRLEN; - - /* add XDP fd */ - nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); - nla_xdp->nla_type = 1/*IFLA_XDP_FD*/; - nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); - memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd)); - nla->nla_len += nla_xdp->nla_len; - - /* if user passed in any flags, add those too */ - if (flags) { - nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); - nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/; - nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags); - memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags)); - nla->nla_len += nla_xdp->nla_len; - } - - req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len); - - if (send(sock, , req.nh.nlmsg_len, 0) < 0) { - printf("send to netlink: %s\n", strerror(errno)); - goto cleanup; - } - - len = recv(sock, buf, sizeof(buf), 0); - if (len < 0) { - printf("recv from netlink: %s\n", strerror(errno)); - goto cleanup; - } - - for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); -nh = NLMSG_NEXT(nh, len)) { - if (nh->nlmsg_pid != getpid()) { - printf("Wrong pid %d, expected %d\n", - nh->nlmsg_pid, getpid()); - goto cleanup; - } - if (nh->nlmsg_seq != seq) { - printf("Wrong seq %d, expected %d\n", - nh->nlmsg_seq, seq); - goto cleanup; - } - switch (nh->nlmsg_type) { - case NLMSG_ERROR: - err = (struct nlmsgerr *)NLMSG_DATA(nh); - if (!err->error) - continue; - printf("nlmsg error %s\n", strerror(-err->error)); - goto cleanup; - case NLMSG_DONE: - break; - } - } - - ret = 0; - -cleanup: - close(sock); - return ret; -} diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h index 7d57a4248893..453c200b389b 100644 --- a/samples/bpf/bpf_load.h +++ b/samples/bpf/bpf_load.h @@ -61,5 +61,5 @@ struct ksym { int load_kallsyms(void); struct ksym *ksym_search(long key); -int set_link_xdp_fd(int ifindex, int fd, __u32 flags); +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags); #endif diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c index fdaefe91801d..b901ee2b3336 100644 ---
Re: [PATCH 10/10] kill kernel_sock_ioctl()
On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote: > > Al this series looks fine to me, want me to toss it into net-next? Do you want them reposted (with updated commit messages), or would you prefer a pull request (with or without rebase to current tip of net-next)?
Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches
On Wed, Jan 24, 2018 at 04:37:16PM -0500, David Miller wrote: > From: Eyal Birger> Date: Tue, 23 Jan 2018 11:17:32 +0200 > > > + network_offset = skb_network_offset(skb); > > + skb_pull(skb, network_offset); > > + > > + rcu_read_lock(); > > + > > + if (skb->skb_iif) > > + indev = dev_get_by_index_rcu(em->net, skb->skb_iif); > > + > > + nf_hook_state_init(, im->hook, im->nfproto, indev ?: skb->dev, > > + skb->dev, NULL, em->net, NULL); > > + > > + acpar.match = im->match; > > + acpar.matchinfo = im->match_data; > > + acpar.state = > > + > > + ret = im->match->match(skb, ); > > + > > + rcu_read_unlock(); > > + > > + skb_push(skb, network_offset); > > If the SKB is shared in any way, this pull/push around the NF hook > invocation is illegal. At ingress, skb->data points to the network header, which is what the xtables matches expect, so these are actually noops, therefore, skb_pull() and skb_push() can be removed.
Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
On 06/20/2017 08:03 PM, David Ahern wrote: On 6/20/17 5:41 PM, Ben Greear wrote: On 06/20/2017 11:05 AM, Michal Kubecek wrote: On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: On 06/14/2017 03:25 PM, David Ahern wrote: On 6/14/17 4:23 PM, Ben Greear wrote: On 06/13/2017 07:27 PM, David Ahern wrote: Let's try a targeted debug patch. See attached I had to change it to pr_err so it would go to our serial console since the system locked hard on crash, and that appears to be enough to change the timing where we can no longer reproduce the problem. ok, let's figure out which one is doing that. There are 3 debug statements. I suspect fib6_del_route is the one setting the state to FWS_U. Can you remove the debug prints in fib6_repair_tree and fib6_walk_continue and try again? We cannot reproduce with just that one printf in the kernel either. It must change the timing too much to trigger the bug. You might try trace_printk() which should have less impact (don't forget to enable /proc/sys/kernel/ftrace_dump_on_oops). We cannot reproduce with trace_printk() either. I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can. So, though I don't know the right way to fix it, the patch below appears to make the system not crash. diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 68b9cc7..bf19a14 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w) pn = fn->parent; w->node = pn; #ifdef CONFIG_IPV6_SUBTREES + if (WARN_ON_ONCE(!pn)) { + pr_err("FWS-U, w: %p fn: %p pn: %p\n", + w, fn, pn); + /* Attempt to work around crash that has been here forever. --Ben */ + return 0; + } if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); w->state = FWS_L; The printout looks like this (when adding 4000 mac-vlans, so it is pretty rare). PN is definitely NULL sometimes: [root@2u-6n ~]# journalctl -f|grep FWS Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: 8807ea121ba0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 8807e3963de0 fn: 880856a09260 pn: (null) Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: 88081ac22de0 fn: 880856a09260 pn: (null) Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: 8808290c69c0 fn: 8807e369f920 pn: (null) Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: 8807ea3156c0 fn: 88082d1eeb60 pn: (null) 8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1006 8067 Jan 24 15:48:05 2u-6n kernel: [ cut here ] 8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 0x154/0x1b0 [ipv6] 8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_raplsb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath i2c_i801 mac80211 joydev lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core ipv6 crc_ccitt 8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G O4.13.16+ #22 8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 8072 Jan 24 15:48:05 2u-6n kernel: task: 8807e9ef1dc0 task.stack: c9002083c000 8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 [ipv6] 8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:c9002083fbc0 EFLAGS: 00010246 8075 Jan 24 15:48:05 2u-6n kernel: RAX: RBX: 8807ea121ba0 RCX: 8076 Jan 24 15:48:05 2u-6n kernel: RDX: 880856a09260 RSI: c9002083fc00 RDI: 81ef2140 8077 Jan 24 15:48:05 2u-6n kernel: RBP: c9002083fbc8 R08: 0008 R09: 8807e36f6b25 8078 Jan 24 15:48:05 2u-6n kernel: R10: c9002083fb70 R11: R12: 0002 8079 Jan 24 15:48:05 2u-6n kernel:
Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
On Thu, Jan 25, 2018 at 12:19:52AM +0100, Florian Westphal wrote: > Eric Dumazetwrote: > > From: Eric Dumazet > > > > It looks like syzbot found its way into netfilter territory. > > Excellent. This will sure allow to find and fix more bugs :-) > > > Issue here is that @name comes from user space and might > > not be null terminated. > > Indeed, thanks for fixing this Eric. > > xt_find_target() and xt_find_table_lock() might have similar issues. I'm going to keep back this patch then, it would be good if we can find this in one single patch. Thanks.
Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
On Wed, Jan 24, 2018 at 02:49:48PM -0800, Eric Dumazet wrote: > From: Eric Dumazet> > It looks like syzbot found its way into netfilter territory. > > Issue here is that @name comes from user space and might > not be null terminated. > > Out-of-bound reads happen, KASAN is not happy. Applied, thanks Eric.
[GIT] Networking
1) Avoid negative netdev refcount in error flow of xfrm state add, from Aviad Yehezkel. 2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the ethernet header protocol field in xfrm{4,6}_mode_tunnel_input(). From Yossi Kuperman. 3) Fix a syzbot triggered skb_under_panic in pppoe having to do with failing to allocate an appropriate amount of headroom. From Guillaume Nault. 4) Fix memory leak in vmxnet3 driver, from Neil Horman. 5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module, from Wolfgang Bumiller. 6) Restrict what kinds of sockets can be bound to the KCM multiplexer and also disallow when another layer has attached to the socket and made use of sk_user_data. From Tom Herbert. 7) Fix use before init of IOTLB in vhost code, from Jason Wang. 8) Correct STACR register write bit definition in IBM emac driver, from Ivan Mikhaylov. Please pull, thanks a lot. The following changes since commit a84a8ab94ed5cb65a1355fe9e8d1d55283375808: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-01-23 08:52:55 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to 624ca9c33c8a853a4a589836e310d776620f4ab9: net/ibm/emac: wrong bit is used for STA control register write (2018-01-24 18:10:57 -0500) Aviad Yehezkel (1): xfrm: fix error flow in case of add state fails Ben Hutchings (1): ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL David S. Miller (3): Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec Merge branch 'kcm-fix-two-syzcaller-issues' Merge branch 'qed-rdma-bug-fixes' Guillaume Nault (1): pppoe: take ->needed_headroom of lower device into account on xmit Gustavo A. R. Silva (1): xfrm: fix boolean assignment in xfrm_get_type_offload Ivan Mikhaylov (2): net/ibm/emac: add 8192 rx/tx fifo size net/ibm/emac: wrong bit is used for STA control register write Jakub Kicinski (1): i40e: flower: check if TC offload is enabled on a netdev Jason Wang (2): vhost: use mutex_lock_nested() in vhost_dev_lock_vqs() vhost: do not try to access device IOTLB when not initialized Michal Kalderon (2): qed: Remove reserveration of dpi for kernel qed: Free reserved MR tid Neil Horman (1): vmxnet3: repair memory leak Tom Herbert (2): kcm: Only allow TCP sockets to be attached to a KCM mux kcm: Check if sk_user_data already set in kcm_attach Wolfgang Bumiller (2): net: sched: em_nbyte: don't add the data offset twice net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr Yossi Kuperman (2): xfrm: Add SA to hardware at the end of xfrm_state_construct() xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version Yuval Mintz (1): mlxsw: spectrum_router: Don't log an error on missing neighbor drivers/net/ethernet/ibm/emac/core.c | 6 ++ drivers/net/ethernet/ibm/emac/emac.h | 4 +++- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++ drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 10 ++ drivers/net/ethernet/qlogic/qed/qed_rdma.c| 31 +-- drivers/net/ppp/pppoe.c | 11 ++- drivers/net/vmxnet3/vmxnet3_drv.c | 2 +- drivers/vhost/vhost.c | 6 +- include/net/ipv6.h| 1 + include/net/pkt_cls.h | 2 +- net/ipv4/xfrm4_mode_tunnel.c | 1 + net/ipv6/ip6_output.c | 2 +- net/ipv6/ipv6_sockglue.c | 2 +- net/ipv6/xfrm6_mode_tunnel.c | 1 + net/kcm/kcmsock.c | 25 + net/sched/em_nbyte.c | 2 +- net/xfrm/xfrm_device.c| 1 + net/xfrm/xfrm_state.c | 12 net/xfrm/xfrm_user.c | 18 +++--- 19 files changed, 90 insertions(+), 49 deletions(-)
Re: [PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
Eric Dumazetwrote: > From: Eric Dumazet > > It looks like syzbot found its way into netfilter territory. Excellent. This will sure allow to find and fix more bugs :-) > Issue here is that @name comes from user space and might > not be null terminated. Indeed, thanks for fixing this Eric. xt_find_target() and xt_find_table_lock() might have similar issues.
Re: [PATCH v2 1/2] net/ibm/emac: add 8192 rx/tx fifo size
From: Ivan MikhaylovDate: Wed, 24 Jan 2018 15:53:24 +0300 > emac4syn chips has availability to use 8192 rx/tx fifo buffer sizes, > in current state if we set it up in dts 8192 as example, we will get > only 2048 which may impact on network speed. > > Signed-off-by: Ivan Mikhaylov Applied.
Re: [PATCH v2 2/2] net/ibm/emac: wrong bit is used for STA control register write
From: Ivan MikhaylovDate: Wed, 24 Jan 2018 15:53:25 +0300 > STA control register has areas of mode and opcodes for opeations. 18 bit is > using for mode selection, where 0 is old MIO/MDIO access method and 1 is > indirect access mode. 19-20 bits are using for setting up read/write > operation(STA opcodes). In current state 'read' is set into old MIO/MDIO mode > with 19 bit and write operation is set into 18 bit which is mode selection, > not a write operation. To correlate write with read we set it into 20 bit. > All those bit operations are MSB 0 based. > > Signed-off-by: Ivan Mikhaylov Applied.
Re: [net-next 0/7][pull request] 100GbE Intel Wired LAN Driver Updates 2018-01-24
From: Jeff KirsherDate: Wed, 24 Jan 2018 14:45:39 -0800 > This series contains updates to fm10k only. > > Alex fixes MACVLAN offload for fm10k, where we were not seeing unicast > packets being received because we did not correctly configure the > default VLAN ID for the port and defaulting to 0. > > Jake cleans up unnecessary parenthesis in a couple of "if" statements. > Fixed the driver to stop adding VLAN 0 into the VLAN table, since it > would cause the VLAN table to be inconsistent between the PF and VF. > Also fixed an issue where we were assuming that VLAN 1 is enabled when > the default VLAN ID is not set, so resolve by not requesting any filters > for the default_vid if it has not yet been assigned. > > Ngai fixes an issue which was generating a dmesg regarding unbale to > kill a particular VLAN ID for the device. This is due to > ndo_vlan_rx_kill_vid() exits with an error and the handler for this ndo > is fm10k_update_vid() which exits prematurely under PF VLAN management. > So to resolve, we must check the VLAN update action type before exiting > fm10k_update_vid(), and act appropriately based on the action type. > Also corrected code comment typos. Looks good, pulled, thanks Jeff.
Re: [PATCH net-next 3/3] net/ipv6: Add support for onlink flag
From: David AhernDate: Wed, 24 Jan 2018 15:08:39 -0700 > On 1/23/18 8:00 PM, David Ahern wrote: >> +tbid = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN; >> +if (cfg->fc_table && cfg->fc_table != tbid) { >> +NL_SET_ERR_MSG(extack, >> + "Table id mismatch between given table and >> device"); >> +return -EINVAL; >> +} >> + >> +cfg->fc_table = tbid; >> + >> +return 0; > > This table check is too restrictive for some PBR cases. > > Dave: please drop this set; I'll repost. Ok.
[PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
From: Eric DumazetIt looks like syzbot found its way into netfilter territory. Issue here is that @name comes from user space and might not be null terminated. Out-of-bound reads happen, KASAN is not happy. Signed-off-by: Eric Dumazet Reported-by: syzbot --- No Fixes: tag, bug seems to be a day-0 one. diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 55802e97f906d1987ed78b4296584deb38e5f876..8516dc459b539342f44d2b2b3e21b140677c7826 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, uint8_t revision) { struct xt_match *match; + if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN) + return ERR_PTR(-EINVAL); + match = xt_find_match(nfproto, name, revision); if (IS_ERR(match)) { request_module("%st_%s", xt_prefix[nfproto], name);
Re: [Intel-wired-lan] [RFC v2 net-next 01/10] net: Add a new socket option for a future transmit time.
Hi Richard, Richard Cochranwrites: > On Tue, Jan 23, 2018 at 01:22:37PM -0800, Vinicius Costa Gomes wrote: >> What I think would be the ideal scenario would be if the clockid >> parameter to the TBS Qdisc would not be necessary (if offload was >> enabled), but that's not quite possible right now, because there's no >> support for using the hrtimer infrastructure with dynamic clocks >> (/dev/ptp*). > > We don't need hrtimer for HW offloading. Just enqueue the packets. I > thought we agreed that user space get the ordering correct. In fact, > davem insisted on it, IIRC. About the ordering of packets, From here [1], there are 3 clear points (in my understanding): 1. Re-ordering of TX descriptors on the device queue should/must not happen; 2. Out of order requests are an error; 3. Timestamps in the past are an error; The only robust way that we could think of about keeping the the packets in order for the device queue is re-ordering packets in the Qdisc. We tried to reach out for confirmation [2] of this understanding but didn't receive any word. Even if we reach a decision that the Qdisc should not re-order packets (we wouldn't have any dependency on hrtimers in the offload case, as you pointed out), we still need hrtimers for the software implementation. So, I guess, the problem remains, if it's possible for the user to express a /dev/ptp* clock, what should we do? > > Thanks, > Richard Cheers, -- Vinicius [1] https://patchwork.ozlabs.org/comment/1770302/ [2] https://patchwork.ozlabs.org/comment/1816492/q
[net-next 1/7] fm10k: Fix configuration for macvlan offload
From: Alexander DuyckThe fm10k driver didn't work correctly when macvlan offload was enabled. Specifically what would occur is that we would see no unicast packets being received. This was traced down to us not correctly configuring the default VLAN ID for the port and defaulting to 0. To correct this we either use the default ID provided by the switch or simply use 1. With that we are able to pass and receive traffic without any issues. In addition we were not repopulating the filter table following a reset. To correct that I have added a bit of code to fm10k_restore_rx_state that will repopulate the Rx filter configuration for the macvlan interfaces. Signed-off-by: Alexander Duyck Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index adc62fb38c49..6d9088956407 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1182,9 +1182,10 @@ static void fm10k_set_rx_mode(struct net_device *dev) void fm10k_restore_rx_state(struct fm10k_intfc *interface) { + struct fm10k_l2_accel *l2_accel = interface->l2_accel; struct net_device *netdev = interface->netdev; struct fm10k_hw *hw = >hw; - int xcast_mode; + int xcast_mode, i; u16 vid, glort; /* record glort for this interface */ @@ -1234,6 +1235,24 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface) __dev_uc_sync(netdev, fm10k_uc_sync, fm10k_uc_unsync); __dev_mc_sync(netdev, fm10k_mc_sync, fm10k_mc_unsync); + /* synchronize macvlan addresses */ + if (l2_accel) { + for (i = 0; i < l2_accel->size; i++) { + struct net_device *sdev = l2_accel->macvlan[i]; + + if (!sdev) + continue; + + glort = l2_accel->dglort + 1 + i; + + hw->mac.ops.update_xcast_mode(hw, glort, + FM10K_XCAST_MODE_MULTI); + fm10k_queue_mac_request(interface, glort, + sdev->dev_addr, + hw->mac.default_vid, true); + } + } + fm10k_mbx_unlock(interface); /* record updated xcast mode state */ @@ -1490,7 +1509,7 @@ static void *fm10k_dfwd_add_station(struct net_device *dev, hw->mac.ops.update_xcast_mode(hw, glort, FM10K_XCAST_MODE_MULTI); fm10k_queue_mac_request(interface, glort, sdev->dev_addr, - 0, true); + hw->mac.default_vid, true); } fm10k_mbx_unlock(interface); @@ -1530,7 +1549,7 @@ static void fm10k_dfwd_del_station(struct net_device *dev, void *priv) hw->mac.ops.update_xcast_mode(hw, glort, FM10K_XCAST_MODE_NONE); fm10k_queue_mac_request(interface, glort, sdev->dev_addr, - 0, false); + hw->mac.default_vid, false); } fm10k_mbx_unlock(interface); -- 2.14.3
[net-next 5/7] fm10k: don't assume VLAN 1 is enabled
From: Jacob KellerSince commit 856dfd69e84f ("fm10k: Fix multicast mode synch issues", 2016-03-03) we've incorrectly assumed that VLAN 1 is enabled when the default VID is not set. This occurs because we check the default_vid and if it's zero, start several loops over the active_vlans bitmask at 1, instead of checking to ensure that that bit is active. This happened because of commit d9ff3ee8efe9 ("fm10k: Add support for VLAN 0 w/o default VLAN", 2014-08-07) which mistakenly assumed that we should send requests for MAC and VLAN filters with VLAN 0 when the default_vid isn't set. However, the switch generally considers this an invalid configuration, so the only time we'd have a default_vid of 0 is when the switch is down. Instead, lets just not request any filters for the default_vid if it's not yet been assigned. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 4cf68a235318..4c9d8e52415b 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1050,14 +1050,13 @@ static int __fm10k_uc_sync(struct net_device *dev, const unsigned char *addr, bool sync) { struct fm10k_intfc *interface = netdev_priv(dev); - struct fm10k_hw *hw = >hw; u16 vid, glort = interface->glort; s32 err; if (!is_valid_ether_addr(addr)) return -EADDRNOTAVAIL; - for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1; + for (vid = fm10k_find_next_vlan(interface, 0); vid < VLAN_N_VID; vid = fm10k_find_next_vlan(interface, vid)) { err = fm10k_queue_mac_request(interface, glort, @@ -1116,14 +1115,13 @@ static int __fm10k_mc_sync(struct net_device *dev, const unsigned char *addr, bool sync) { struct fm10k_intfc *interface = netdev_priv(dev); - struct fm10k_hw *hw = >hw; u16 vid, glort = interface->glort; s32 err; if (!is_multicast_ether_addr(addr)) return -EADDRNOTAVAIL; - for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1; + for (vid = fm10k_find_next_vlan(interface, 0); vid < VLAN_N_VID; vid = fm10k_find_next_vlan(interface, vid)) { err = fm10k_queue_mac_request(interface, glort, @@ -1223,7 +1221,7 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface) xcast_mode == FM10K_XCAST_MODE_PROMISC); /* update table with current entries */ - for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1; + for (vid = fm10k_find_next_vlan(interface, 0); vid < VLAN_N_VID; vid = fm10k_find_next_vlan(interface, vid)) { fm10k_queue_vlan_request(interface, vid, 0, true); -- 2.14.3
[net-next 7/7] fm10k: clarify action when updating the VLAN table
From: Ngai-Mint KwanClarify the comment for when entering promiscuous mode that we update the VLAN table. Add a comment distinguishing the case where we're exiting promiscuous mode and need to clear the entire VLAN table. Signed-off-by: Ngai-Mint Kwan Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 4c9d8e52415b..a38ae5c54da3 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1165,10 +1165,12 @@ static void fm10k_set_rx_mode(struct net_device *dev) /* update xcast mode first, but only if it changed */ if (interface->xcast_mode != xcast_mode) { - /* update VLAN table */ + /* update VLAN table when entering promiscuous mode */ if (xcast_mode == FM10K_XCAST_MODE_PROMISC) fm10k_queue_vlan_request(interface, FM10K_VLAN_ALL, 0, true); + + /* clear VLAN table when exiting promiscuous mode */ if (interface->xcast_mode == FM10K_XCAST_MODE_PROMISC) fm10k_clear_unused_vlans(interface); -- 2.14.3
[net-next 0/7][pull request] 100GbE Intel Wired LAN Driver Updates 2018-01-24
This series contains updates to fm10k only. Alex fixes MACVLAN offload for fm10k, where we were not seeing unicast packets being received because we did not correctly configure the default VLAN ID for the port and defaulting to 0. Jake cleans up unnecessary parenthesis in a couple of "if" statements. Fixed the driver to stop adding VLAN 0 into the VLAN table, since it would cause the VLAN table to be inconsistent between the PF and VF. Also fixed an issue where we were assuming that VLAN 1 is enabled when the default VLAN ID is not set, so resolve by not requesting any filters for the default_vid if it has not yet been assigned. Ngai fixes an issue which was generating a dmesg regarding unbale to kill a particular VLAN ID for the device. This is due to ndo_vlan_rx_kill_vid() exits with an error and the handler for this ndo is fm10k_update_vid() which exits prematurely under PF VLAN management. So to resolve, we must check the VLAN update action type before exiting fm10k_update_vid(), and act appropriately based on the action type. Also corrected code comment typos. The following are changes since commit 46410c2efa9cb5b2f40c9ce24a75d147f44aedeb: Merge branch 'pktgen-Behavior-flags-fixes' and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE Alexander Duyck (1): fm10k: Fix configuration for macvlan offload Jacob Keller (3): fm10k: cleanup unnecessary parenthesis in fm10k_iov.c fm10k: stop adding VLAN 0 to the VLAN table fm10k: don't assume VLAN 1 is enabled Ngai-Mint Kwan (3): fm10k: fix "failed to kill vid" message for VF fm10k: correct typo in fm10k_pf.c fm10k: clarify action when updating the VLAN table drivers/net/ethernet/intel/fm10k/fm10k_iov.c| 4 +- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 54 ++--- drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 2 +- 3 files changed, 43 insertions(+), 17 deletions(-) -- 2.14.3
[net-next 4/7] fm10k: stop adding VLAN 0 to the VLAN table
From: Jacob KellerCurrently, when the driver loads, it sends a request to add VLAN 0 to the VLAN table. For the PF, this is honored, and VLAN 0 is indeed set. For the VF, this request is silently converted into a request for the default VLAN as defined by either the switch vid or the PF vid. This results in the odd behavior that the VLAN table doesn't appear consistent between the PF and the VF. Furthermore, setting a MAC filter with VLAN 0 is generally considered an invalid configuration by the switch, and since commit 856dfd69e84f ("fm10k: Fix multicast mode synch issues", 2016-03-03) we've had code which prevents us from ever sending such a request. Since there's not really a good reason to keep VLAN 0 in the VLAN table, stop requesting it in fm10k_restore_rx_state(). This might seem to indicate that we would no longer properly configure the MAC and VLAN tables for the default vid. However, due to the way that fm10k_find_next_vlan() behaves, it will always return the default_vid as enabled. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index e85e0b077da3..4cf68a235318 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1222,9 +1222,6 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface) fm10k_queue_vlan_request(interface, FM10K_VLAN_ALL, 0, xcast_mode == FM10K_XCAST_MODE_PROMISC); - /* Add filter for VLAN 0 */ - fm10k_queue_vlan_request(interface, 0, 0, true); - /* update table with current entries */ for (vid = hw->mac.default_vid ? fm10k_find_next_vlan(interface, 0) : 1; vid < VLAN_N_VID; -- 2.14.3
[net-next 3/7] fm10k: fix "failed to kill vid" message for VF
From: Ngai-Mint KwanWhen a VF is under PF VLAN assignment: ip link set vf <#> vlan This will remove all previous entries in the VLAN table including those generated by VLAN interfaces created on the VF. The issue arises when the VF is under PF VLAN assignment and one or more of these VLAN interfaces of the VF are deleted. When deleting these VLAN interfaces, the following message will be generated in "dmesg": failed to kill vid 0081/ for device This is due to the fact that "ndo_vlan_rx_kill_vid" exits with an error. The handler for this ndo is "fm10k_update_vid". Any calls to this function while under PF VLAN management will exit prematurely and, thus, it will generate the failure message. Additionally, since "fm10k_update_vid" exits prematurely, none of the VLAN update is performed. So, even though the actual VLAN interfaces of the VF will be deleted, the active_vlans bitmask is not cleared. When the VF is no longer under PF VLAN assignment, the driver mistakenly restores the previous entries of the VLAN table based on an unsynchronized list of active VLANs. The solution to this issue involves checking the VLAN update action type before exiting "fm10k_update_vid". If the VLAN update action type is to "add", this action will not be permitted while the VF is under PF VLAN assignment and the VLAN update is abandoned like before. However, if the VLAN update action type is to "kill", then we need to also clear the active_vlans bitmask. However, we don't need to actually queue any messages to the PF, because the MAC and VLAN tables have already been cleared, and the PF would silently ignore these requests anyways. Signed-off-by: Ngai-Mint Kwan Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 6d9088956407..e85e0b077da3 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -934,8 +934,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) if (vid >= VLAN_N_VID) return -EINVAL; - /* Verify we have permission to add VLANs */ - if (hw->mac.vlan_override) + /* Verify that we have permission to add VLANs. If this is a request +* to remove a VLAN, we still want to allow the user to remove the +* VLAN device. In that case, we need to clear the bit in the +* active_vlans bitmask. +*/ + if (set && hw->mac.vlan_override) return -EACCES; /* update active_vlans bitmask */ @@ -954,6 +958,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) rx_ring->vid &= ~FM10K_VLAN_CLEAR; } + /* If our VLAN has been overridden, there is no reason to send VLAN +* removal requests as they will be silently ignored. +*/ + if (hw->mac.vlan_override) + return 0; + /* Do not remove default VLAN ID related entries from VLAN and MAC * tables */ -- 2.14.3
[net-next 6/7] fm10k: correct typo in fm10k_pf.c
From: Ngai-Mint KwanSigned-off-by: Ngai-Mint Kwan Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c index 425d814aed4d..d6406fc31ffb 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c @@ -866,7 +866,7 @@ static s32 fm10k_iov_assign_default_mac_vlan_pf(struct fm10k_hw *hw, /* Determine correct default VLAN ID. The FM10K_VLAN_OVERRIDE bit is * used here to indicate to the VF that it will not have privilege to * write VLAN_TABLE. All policy is enforced on the PF but this allows -* the VF to correctly report errors to userspace rqeuests. +* the VF to correctly report errors to userspace requests. */ if (vf_info->pf_vid) vf_vid = vf_info->pf_vid | FM10K_VLAN_OVERRIDE; -- 2.14.3
[net-next 2/7] fm10k: cleanup unnecessary parenthesis in fm10k_iov.c
From: Jacob KellerThis fixes a few warnings found by checkpatch.pl --strict Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c index ea3ab24265ee..760cfa52d02c 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c @@ -353,7 +353,7 @@ int fm10k_iov_resume(struct pci_dev *pdev) struct fm10k_vf_info *vf_info = _data->vf_info[i]; /* allocate all but the last GLORT to the VFs */ - if (i == ((~hw->mac.dglort_map) >> FM10K_DGLORTMAP_MASK_SHIFT)) + if (i == (~hw->mac.dglort_map >> FM10K_DGLORTMAP_MASK_SHIFT)) break; /* assign GLORT to VF, and restrict it to multicast */ @@ -511,7 +511,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs) return err; /* allocate VFs if not already allocated */ - if (num_vfs && (num_vfs != current_vfs)) { + if (num_vfs && num_vfs != current_vfs) { /* Disable completer abort error reporting as * the VFs can trigger this any time they read a queue * that they don't own. -- 2.14.3
Re: [PATCH] cls_flower: check if filter is in HW before calling fl_hw_destroy_filter()
On Wed, 24 Jan 2018 17:12:55 +0530, Sathya Perla wrote: > When a filter cannot be added in HW (i.e, fl_hw_replace_filter() returns > error), the TCA_CLS_FLAGS_IN_HW flag is not set in the filter flags. > > This flag (via tc_in_hw()) must be checked before issuing the call > to delete a filter in HW (fl_hw_destroy_filter()) and before issuing the > call to query stats (fl_hw_update_stats()). > > Signed-off-by: Sathya PerlaCould you explain why you want to make that change? Saying "tc_in_hw() must be checked" is a bit strong, tc_in_hw() is useless from correctness POV. Your patch may be a good optimization, but with shared blocks in the picture now tc_in_hw() == true doesn't mean it's in *your* HW.
Re: [PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and more
On 1/24/18, 12:07 PM, "netdev-ow...@vger.kernel.org on behalf of Yuchung Cheng"wrote: On Tue, Jan 23, 2018 at 11:57 PM, Lawrence Brakmo wrote: > Add support for reading many more tcp_sock fields > > state,same as sk->sk_state > rtt_min same as sk->rtt_min.s[0].v (current rtt_min) > snd_ssthresh > rcv_nxt > snd_nxt > snd_una > mss_cache > ecn_flags > rate_delivered > rate_interval_us > packets_out > retrans_out Might as well get ca_state, sacked_out and lost_out to estimate CA states and the packets in flight? Will try to add in updated patchset. If not, I will add as a new patch. > total_retrans > segs_in > data_segs_in > segs_out > data_segs_out > sk_txhash > bytes_received (__u64) > bytes_acked(__u64) > > Signed-off-by: Lawrence Brakmo > --- > include/uapi/linux/bpf.h | 20 +++ > net/core/filter.c| 135 +++ > 2 files changed, 144 insertions(+), 11 deletions(-) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 2a8c40a..6998032 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -979,6 +979,26 @@ struct bpf_sock_ops { > __u32 snd_cwnd; > __u32 srtt_us; /* Averaged RTT << 3 in usecs */ > __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */ > + __u32 state; > + __u32 rtt_min; > + __u32 snd_ssthresh; > + __u32 rcv_nxt; > + __u32 snd_nxt; > + __u32 snd_una; > + __u32 mss_cache; > + __u32 ecn_flags; > + __u32 rate_delivered; > + __u32 rate_interval_us; > + __u32 packets_out; > + __u32 retrans_out; > + __u32 total_retrans; > + __u32 segs_in; > + __u32 data_segs_in; > + __u32 segs_out; > + __u32 data_segs_out; > + __u32 sk_txhash; > + __u64 bytes_received; > + __u64 bytes_acked; > }; > > /* List of known BPF sock_ops operators. > diff --git a/net/core/filter.c b/net/core/filter.c > index 6936d19..ffe9b60 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act) > } > EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action); > > -static bool __is_valid_sock_ops_access(int off, int size) > +static bool sock_ops_is_valid_access(int off, int size, > +enum bpf_access_type type, > +struct bpf_insn_access_aux *info) > { > + const int size_default = sizeof(__u32); > + > if (off < 0 || off >= sizeof(struct bpf_sock_ops)) > return false; > + > /* The verifier guarantees that size > 0. */ > if (off % size != 0) > return false; > - if (size != sizeof(__u32)) > - return false; > - > - return true; > -} > > -static bool sock_ops_is_valid_access(int off, int size, > -enum bpf_access_type type, > -struct bpf_insn_access_aux *info) > -{ > if (type == BPF_WRITE) { > switch (off) { > case offsetof(struct bpf_sock_ops, reply): > + if (size != size_default) > + return false; > break; > default: > return false; > } > + } else { > + switch (off) { > + case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received, > + bytes_acked): > + if (size != sizeof(__u64)) > + return false; > + break; > + default: > + if (size != size_default) > + return false; > + break; > + } > } > > - return __is_valid_sock_ops_access(off, size); > + return true; > } > > static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write, > @@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, >is_fullsock)); > break; > > + case offsetof(struct bpf_sock_ops, state): > +
Re: [PATCH net-next 3/3] net/ipv6: Add support for onlink flag
On 1/23/18 8:00 PM, David Ahern wrote: > + tbid = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN; > + if (cfg->fc_table && cfg->fc_table != tbid) { > + NL_SET_ERR_MSG(extack, > +"Table id mismatch between given table and > device"); > + return -EINVAL; > + } > + > + cfg->fc_table = tbid; > + > + return 0; This table check is too restrictive for some PBR cases. Dave: please drop this set; I'll repost.
Re: [PATCH net-next] cxgb4: make symbol pedits static
From: Wei YongjunDate: Wed, 24 Jan 2018 02:14:33 + > Fixes the following sparse warning: > > drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:46:27: warning: > symbol 'pedits' was not declared. Should it be static? > > Fixes: 27ece1f357b7 ("cxgb4: add tc flower support for ETH-DMAC rewrite") > Signed-off-by: Wei Yongjun Applied, thank you.
Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
From: "Michael S. Tsirkin"Date: Wed, 24 Jan 2018 23:46:19 +0200 > On Wed, Jan 24, 2018 at 04:38:30PM -0500, David Miller wrote: >> From: Jason Wang >> Date: Tue, 23 Jan 2018 17:27:25 +0800 >> >> > We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to >> > hold mutexes of all virtqueues. This may confuse lockdep to report a >> > possible deadlock because of trying to hold locks belong to same >> > class. Switch to use mutex_lock_nested() to avoid false positive. >> > >> > Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") >> > Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com >> > Signed-off-by: Jason Wang >> >> Michael, I see you ACK'd this, meaning that you're OK with these two >> fixes going via my net tree? >> >> Thanks. > > Yes - this seems to be what Jason wanted (judging by the net > tag in the subject) and I'm fine with it. > Thanks a lot. Great, not a problem, done.
Re: [PATCH net] net: erspan: fix use-after-free
From: William TuDate: Tue, 23 Jan 2018 17:01:29 -0800 > When building the erspan header for either v1 or v2, the eth_hdr() > does not point to the right inner packet's eth_hdr, > causing kasan report use-after-free and slab-out-of-bouds read. ... > Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre") > Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") > Reported-by: syzbot+9723f2d288e49b492...@syzkaller.appspotmail.com > Reported-by: syzbot+f0ddeb2b032a8e1d9...@syzkaller.appspotmail.com > Reported-by: syzbot+f14b3703cd8d76702...@syzkaller.appspotmail.com > Reported-by: syzbot+eefa384efad8d7997...@syzkaller.appspotmail.com > Signed-off-by: William Tu Applied to net-next.
Re: [PATCH net] i40e: flower: check if TC offload is enabled on a netdev
From: Jeff KirsherDate: Tue, 23 Jan 2018 08:47:29 -0800 > On Tue, 2018-01-23 at 00:08 -0800, Jakub Kicinski wrote: >> Since TC block changes drivers are required to check if >> the TC hw offload flag is set on the interface themselves. >> >> Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower") >> Fixes: 44ae12a768b7 ("net: sched: move the can_offload check from >> binding phase to rule insertion phase") >> Signed-off-by: Jakub Kicinski >> Reviewed-by: Simon Horman >> --- >> drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++ >> 1 file changed, 2 insertions(+) > > Acked-by: Jeff Kirsher > > Dave, feel free to pick this up. Ok, done. Thanks.
Re: [Patch net-next v2 0/3] net_sched: reflect tx_queue_len change for pfifo_fast
From: Cong WangDate: Tue, 23 Jan 2018 10:18:56 -0800 > This pathcset restores the pfifo_fast qdisc behavior of dropping > packets based on latest dev->tx_queue_len. Patch 1 introduces > a helper, patch 2 introduces a new Qdisc ops which is called when > we modify tx_queue_len, patch 3 implements this ops for pfifo_fast. > > Please see each patch for details. > > --- > v2: handle error case for ->change_tx_queue_len() John, please review. Thanks.
Re: [PATCH net-next v2 00/12] net: sched: propagate extack to cls offloads on destroy and only with skip_sw
On Wed, 24 Jan 2018 22:15:00 +0100, Jiri Pirko wrote: > Wed, Jan 24, 2018 at 10:07:25PM CET, dsah...@gmail.com wrote: > >On 1/24/18 2:04 PM, Jiri Pirko wrote: > >> For the record, I still think it is odd to have 6 patches just to add > >> one arg to a function. I wonder where this unnecessary patch splits > >> would lead to in the future. > > > >I think it made the review much easier than 1 really long patch. > > Even squashed, the patch is quite small. Doing the same thing in every > hunk. > > On contrary, the split made it more complicated for me, because when > I looked at patch 1 and the function duplication with another arg, > I did not understand what is going on. Only the last patch actually > explained it. But perhaps I'm slow. Next time I'll do a better job explaining things in commit logs, sorry!
Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
On Wed, Jan 24, 2018 at 04:38:30PM -0500, David Miller wrote: > From: Jason Wang> Date: Tue, 23 Jan 2018 17:27:25 +0800 > > > We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to > > hold mutexes of all virtqueues. This may confuse lockdep to report a > > possible deadlock because of trying to hold locks belong to same > > class. Switch to use mutex_lock_nested() to avoid false positive. > > > > Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") > > Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com > > Signed-off-by: Jason Wang > > Michael, I see you ACK'd this, meaning that you're OK with these two > fixes going via my net tree? > > Thanks. Yes - this seems to be what Jason wanted (judging by the net tag in the subject) and I'm fine with it. Thanks a lot. -- MST
Re: [PATCH net 0/2] qed: rdma bug fixes
From: Michal KalderonDate: Tue, 23 Jan 2018 11:33:45 +0200 > This patch contains two small bug fixes related to RDMA. > Both related to resource reservations. > > Signed-off-by: Michal Kalderon > Signed-off-by: Ariel Elior Series applied, thanks Michal.
Re: [PATCH net-next] rds: tcp: per-netns flag to stop new connection creation when rds-tcp is being dismantled
On 1/24/2018 1:03 PM, Sowmini Varadhan wrote: An rds_connection can get added during netns deletion between lines 528 and 529 of 506 static void rds_tcp_kill_sock(struct net *net) : /* code to pull out all the rds_connections that should be destroyed */ : 528 spin_unlock_irq(_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node) 530 rds_conn_destroy(tc->t_cpath->cp_conn); Such an rds_connection would miss out the rds_conn_destroy() loop (that cancels all pending work) and (if it was scheduled after netns deletion) could trigger the use-after-free. A similar race-window exists for the module unload path in rds_tcp_exit -> rds_tcp_destroy_conns To avoid the addition of new rds_connections during kill_sock or netns_delete, this patch introduces a per-netns flag, RTN_DELETE_PENDING, that will cause RDS connection creation to fail. RCU is used to make sure that we wait for the critical section of __rds_conn_create threads (that may have started before the setting of RTN_DELETE_PENDING) to complete before starting the connection destruction. Reported-by: syzbot+bbd8e9a06452cc480...@syzkaller.appspotmail.com Signed-off-by: Sowmini Varadhan--- net/rds/connection.c |3 ++ net/rds/tcp.c| 82 - net/rds/tcp.h|1 + 3 files changed, 57 insertions(+), 29 deletions(-) FWIW, Acked-by: Santosh Shilimkar Just for archives, just summarizing off-list discussion. Netns destroy making use of conn_destroy now which in past was used for only module unload is racy. Its not possible to make it race free with just flags alone and needs rcu sync kind of mechanism. RDS being sensitive to brownouts on reconnects, rcu usage was has been minimised. Netns delete is expected to be non-frequent operation and hence usage of rcu as done in this patch is probably ok. If needed it will be revisited in future for optimization. regards, Santosh
Re: [PATCH net 1/2] vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
From: Jason WangDate: Tue, 23 Jan 2018 17:27:25 +0800 > We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to > hold mutexes of all virtqueues. This may confuse lockdep to report a > possible deadlock because of trying to hold locks belong to same > class. Switch to use mutex_lock_nested() to avoid false positive. > > Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") > Reported-by: syzbot+dbb7c1161485e61b0...@syzkaller.appspotmail.com > Signed-off-by: Jason Wang Michael, I see you ACK'd this, meaning that you're OK with these two fixes going via my net tree? Thanks.
Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches
From: Eyal BirgerDate: Tue, 23 Jan 2018 11:17:32 +0200 > + network_offset = skb_network_offset(skb); > + skb_pull(skb, network_offset); > + > + rcu_read_lock(); > + > + if (skb->skb_iif) > + indev = dev_get_by_index_rcu(em->net, skb->skb_iif); > + > + nf_hook_state_init(, im->hook, im->nfproto, indev ?: skb->dev, > +skb->dev, NULL, em->net, NULL); > + > + acpar.match = im->match; > + acpar.matchinfo = im->match_data; > + acpar.state = > + > + ret = im->match->match(skb, ); > + > + rcu_read_unlock(); > + > + skb_push(skb, network_offset); If the SKB is shared in any way, this pull/push around the NF hook invocation is illegal.
Re: [PATCH v4] net: qcom/emac: extend DMA mask to 46bits
From: Wang DongshengDate: Mon, 22 Jan 2018 20:25:06 -0800 > Bit TPD3[31] is used as a timestamp bit if PTP is enabled, but > it's used as an address bit if PTP is disabled. Since PTP isn't > supported by the driver, we can extend the DMA address to 46 bits. > > Signed-off-by: Wang Dongsheng Applied to net-next, thanks.
Re: [PATCH] ip_tunnel: Use mark in skb by default
From: Thomas WinterDate: Tue, 23 Jan 2018 16:46:24 +1300 > This allows marks set by connmark in iptables > to be used for route lookups. > > Signed-off-by: Thomas Winter Applied to net-next, thanks.