Re: [PATCH ipsec-next] xfrm: Allow xfrmi if_id to be updated by UPDSA
On Thu, Jul 19, 2018 at 07:07:47PM -0700, Nathan Harold wrote: > Allow attaching an SA to an xfrm interface id after > the creation of the SA, so that tasks such as keying > which must be done as the SA is created, can remain > separate from the decision on how to route traffic > from an SA. This permits SA creation to be decomposed > in to three separate steps: > 1) allocation of a SPI > 2) algorithm and key negotiation > 3) insertion into the data path > > Signed-off-by: Nathan Harold Applied to ipsec-next, thanks Nathan!
Re: [PATCH ipsec-next] xfrm: Remove xfrmi interface ID from flowi
On Thu, Jul 19, 2018 at 10:50:44AM -0700, Benedict Wong wrote: > In order to remove performance impact of having the extra u32 in every > single flowi, this change removes the flowi_xfrm struct, prefering to > take the if_id as a method parameter where needed. > > In the inbound direction, if_id is only needed during the > __xfrm_check_policy() function, and the if_id can be determined at that > point based on the skb. As such, xfrmi_decode_session() is only called > with the skb in __xfrm_check_policy(). > > In the outbound direction, the only place where if_id is needed is the > xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is > directly passed into the xfrm_lookup_with_ifid() call. All existing > callers can still call xfrm_lookup(), which uses a default if_id of 0. > > This change does not change any behavior of XFRMIs except for improving > overall system performance via flowi size reduction. > > This change has been tested against the Android Kernel Networking Tests: > > https://android.googlesource.com/kernel/tests/+/master/net/test > > Signed-off-by: Benedict Wong Nice improvement. Applied to ipsec-next, thanks a lot!
Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote: > The offload_handle should be an opaque data cookie for the driver > to use, much like the data cookie for a timer or alarm callback. > Thus, the XFRM stack should not be checking for non-zero, because > the driver might use that to store an array reference, which could > be zero, or some other zero but meaningful value. > > We can remove the checks for non-zero because there are plenty > other attributes also being checked to see if there is an offload > in place for the SA in question. > > Signed-off-by: Shannon Nelson Applied to ipsec-next, thanks Shannon!
Re: [PATCH net] ip: in cmsg IP(V6)_ORIGDSTADDR do not read beyond headlen
On Sun, Jul 22, 2018 at 9:43 PM Eric Dumazet wrote: > > > > On 07/22/2018 05:43 PM, Willem de Bruijn wrote: > > From: Willem de Bruijn > > > > Syzbot reported a read beyond the end of the skb head when returning > > IPV6_ORIGDSTADDR: > > > > BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242 > > CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > Call Trace: > > __dump_stack lib/dump_stack.c:77 [inline] > > dump_stack+0x185/0x1d0 lib/dump_stack.c:113 > > kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125 > > kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219 > > kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261 > > copy_to_user include/linux/uaccess.h:184 [inline] > > put_cmsg+0x5ef/0x860 net/core/scm.c:242 > > ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719 > > ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733 > > rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521 > > [..] > > > > This logic and its ipv4 counterpart read the destination port from > > the packet at skb_transport_offset(skb) + 4. > > > > With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a > > packet that stores headers exactly up to skb_transport_offset(skb) in > > the head and the remainder in a frag. > > > > Avoid reading beyond skb->tail by testing skb_headlen(skb) instead of > > skb->len when trying to access this field. > > > > Link: > > http://lkml.kernel.org/r/CAF=yd-lejwzj5a1-baaj2oy_hkmgygv6rsj_woraynv-fna...@mail.gmail.com > > Reported-by: syzbot+9adb4b567003cac78...@syzkaller.appspotmail.com > > Signed-off-by: Willem de Bruijn > > --- > > net/ipv4/ip_sockglue.c | 2 +- > > net/ipv6/datagram.c| 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c > > index 64c76dcf7386..5f228d641d9b 100644 > > --- a/net/ipv4/ip_sockglue.c > > +++ b/net/ipv4/ip_sockglue.c > > @@ -152,7 +152,7 @@ static void ip_cmsg_recv_dstaddr(struct msghdr *msg, > > struct sk_buff *skb) > > const struct iphdr *iph = ip_hdr(skb); > > __be16 *ports = (__be16 *)skb_transport_header(skb); > > > > - if (skb_transport_offset(skb) + 4 > (int)skb->len) > > + if (skb_transport_offset(skb) + 4 > (int)skb_headlen(skb)) > > return; > > > > This really should call pskb_may_pull() or skb_header_pointer(). Okay. Since in all common cases these bytes do lie in the skb head, pskb_may_pull is pretty cheap. I'll use that. The linked discussion already outlines a rough patch.
Re: [PATCH net] ip: in cmsg IP(V6)_ORIGDSTADDR do not read beyond headlen
On 07/22/2018 05:43 PM, Willem de Bruijn wrote: > From: Willem de Bruijn > > Syzbot reported a read beyond the end of the skb head when returning > IPV6_ORIGDSTADDR: > > BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242 > CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x185/0x1d0 lib/dump_stack.c:113 > kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125 > kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219 > kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261 > copy_to_user include/linux/uaccess.h:184 [inline] > put_cmsg+0x5ef/0x860 net/core/scm.c:242 > ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719 > ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733 > rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521 > [..] > > This logic and its ipv4 counterpart read the destination port from > the packet at skb_transport_offset(skb) + 4. > > With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a > packet that stores headers exactly up to skb_transport_offset(skb) in > the head and the remainder in a frag. > > Avoid reading beyond skb->tail by testing skb_headlen(skb) instead of > skb->len when trying to access this field. > > Link: > http://lkml.kernel.org/r/CAF=yd-lejwzj5a1-baaj2oy_hkmgygv6rsj_woraynv-fna...@mail.gmail.com > Reported-by: syzbot+9adb4b567003cac78...@syzkaller.appspotmail.com > Signed-off-by: Willem de Bruijn > --- > net/ipv4/ip_sockglue.c | 2 +- > net/ipv6/datagram.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c > index 64c76dcf7386..5f228d641d9b 100644 > --- a/net/ipv4/ip_sockglue.c > +++ b/net/ipv4/ip_sockglue.c > @@ -152,7 +152,7 @@ static void ip_cmsg_recv_dstaddr(struct msghdr *msg, > struct sk_buff *skb) > const struct iphdr *iph = ip_hdr(skb); > __be16 *ports = (__be16 *)skb_transport_header(skb); > > - if (skb_transport_offset(skb) + 4 > (int)skb->len) > + if (skb_transport_offset(skb) + 4 > (int)skb_headlen(skb)) > return; > This really should call pskb_may_pull() or skb_header_pointer().
[PATCH net] ip: in cmsg IP(V6)_ORIGDSTADDR do not read beyond headlen
From: Willem de Bruijn Syzbot reported a read beyond the end of the skb head when returning IPV6_ORIGDSTADDR: BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242 CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:113 kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125 kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219 kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261 copy_to_user include/linux/uaccess.h:184 [inline] put_cmsg+0x5ef/0x860 net/core/scm.c:242 ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719 ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733 rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521 [..] This logic and its ipv4 counterpart read the destination port from the packet at skb_transport_offset(skb) + 4. With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a packet that stores headers exactly up to skb_transport_offset(skb) in the head and the remainder in a frag. Avoid reading beyond skb->tail by testing skb_headlen(skb) instead of skb->len when trying to access this field. Link: http://lkml.kernel.org/r/CAF=yd-lejwzj5a1-baaj2oy_hkmgygv6rsj_woraynv-fna...@mail.gmail.com Reported-by: syzbot+9adb4b567003cac78...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn --- net/ipv4/ip_sockglue.c | 2 +- net/ipv6/datagram.c| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 64c76dcf7386..5f228d641d9b 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -152,7 +152,7 @@ static void ip_cmsg_recv_dstaddr(struct msghdr *msg, struct sk_buff *skb) const struct iphdr *iph = ip_hdr(skb); __be16 *ports = (__be16 *)skb_transport_header(skb); - if (skb_transport_offset(skb) + 4 > (int)skb->len) + if (skb_transport_offset(skb) + 4 > (int)skb_headlen(skb)) return; /* All current transport protocols have the port numbers in the diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 2ee08b6a86a4..fd43810aacd2 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -702,7 +702,7 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg, struct sockaddr_in6 sin6; __be16 *ports = (__be16 *) skb_transport_header(skb); - if (skb_transport_offset(skb) + 4 <= (int)skb->len) { + if (skb_transport_offset(skb) + 4 <= (int)skb_headlen(skb)) { /* All current transport protocols have the port numbers in the * first four bytes of the transport header and this function is * written with this assumption in mind. -- 2.18.0.233.g985f88cf7e-goog
Re: [PATCH] net: dsa: mv88e6xxx: fix races between lock and irq freeing
On Sun, Jul 22, 2018 at 01:04:11PM -0700, David Miller wrote: > From: Uwe Kleine-König > Date: Sun, 22 Jul 2018 21:00:35 +0200 > > > On Sat, Jul 21, 2018 at 10:44:09PM -0700, David Miller wrote: > >> From: Uwe Kleine-König > >> Date: Fri, 20 Jul 2018 11:53:15 +0200 > >> > >> > free_irq() waits until all handlers for this IRQ have completed. As the > >> > relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock > >> > it might never return if the thread calling free_irq() holds this lock. > >> > > >> > For the same reason kthread_cancel_delayed_work_sync() in the polling > >> > case > >> > must not hold this lock. > >> > > >> > Also first free the irq (or stop the worker respectively) such that > >> > mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq > >> > mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the > >> > worker thread to call handle_nested_irq(0) which results in a > >> > NULL-pointer > >> > exception. > >> > > >> > Signed-off-by: Uwe Kleine-König > >> > >> Looks good. > >> > >> Note than the IRQ domain unmapping will do a synchronize_irq() which > >> should cause the same deadlock as free_irq() will with the reg_lock > >> held. > > > > Do you think that there is still a problem? When free_irq() for the > > external visible irq returns the muxed irqs should be all gone, too, so > > this should not trigger, should it? > > It shouldn't be a problem after your changes. > > I'm just saying that I'm surprised that, in the original code, you see > the deadlock in free_irq(), since the synchronize_irq() done by the > IRQ domain code should have happened first. ah, I see. This didn't happen because I added an msleep to mv88e6xxx_g1_irq_thread_work() before the lock it taken to widen the race window for a different problem. So the sub-irqs were not active when mv88e6xxx_g1_irq_free() run, only the mux-irq was. When irq_dispose_mapping() is called for the sub-irq there is no problem as this results in synchronize_irq() for the sub-irq, not the mux-irq. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | http://www.pengutronix.de/ |
Re: [PATCH] net: dsa: mv88e6xxx: fix races between lock and irq freeing
From: Uwe Kleine-König Date: Sun, 22 Jul 2018 21:00:35 +0200 > On Sat, Jul 21, 2018 at 10:44:09PM -0700, David Miller wrote: >> From: Uwe Kleine-König >> Date: Fri, 20 Jul 2018 11:53:15 +0200 >> >> > free_irq() waits until all handlers for this IRQ have completed. As the >> > relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock >> > it might never return if the thread calling free_irq() holds this lock. >> > >> > For the same reason kthread_cancel_delayed_work_sync() in the polling case >> > must not hold this lock. >> > >> > Also first free the irq (or stop the worker respectively) such that >> > mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq >> > mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the >> > worker thread to call handle_nested_irq(0) which results in a NULL-pointer >> > exception. >> > >> > Signed-off-by: Uwe Kleine-König >> >> Looks good. >> >> Note than the IRQ domain unmapping will do a synchronize_irq() which >> should cause the same deadlock as free_irq() will with the reg_lock >> held. > > Do you think that there is still a problem? When free_irq() for the > external visible irq returns the muxed irqs should be all gone, too, so > this should not trigger, should it? It shouldn't be a problem after your changes. I'm just saying that I'm surprised that, in the original code, you see the deadlock in free_irq(), since the synchronize_irq() done by the IRQ domain code should have happened first.
Re: [PATCH] net: dsa: mv88e6xxx: fix races between lock and irq freeing
Hello, On Sat, Jul 21, 2018 at 10:44:09PM -0700, David Miller wrote: > From: Uwe Kleine-König > Date: Fri, 20 Jul 2018 11:53:15 +0200 > > > free_irq() waits until all handlers for this IRQ have completed. As the > > relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock > > it might never return if the thread calling free_irq() holds this lock. > > > > For the same reason kthread_cancel_delayed_work_sync() in the polling case > > must not hold this lock. > > > > Also first free the irq (or stop the worker respectively) such that > > mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq > > mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the > > worker thread to call handle_nested_irq(0) which results in a NULL-pointer > > exception. > > > > Signed-off-by: Uwe Kleine-König > > Looks good. > > Note than the IRQ domain unmapping will do a synchronize_irq() which > should cause the same deadlock as free_irq() will with the reg_lock > held. Do you think that there is still a problem? When free_irq() for the external visible irq returns the muxed irqs should be all gone, too, so this should not trigger, should it? > Note also that g2 IRQ freeing gets the ordering right, and doesn't need > a lock because it doesn't program any registers when tearing down it's > IRQ. Yes. > Applied and queued up for -stable, thanks. Fine, thanks Uwe -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | http://www.pengutronix.de/ |
Hello Beautiful
Hi Dear, my name is Jack and i am seeking for a relationship in which i will feel loved after a series of failed relationships. I am hoping that you would be interested and we could possibly get to know each other more if you do not mind. I am open to answering questions from you as i think my approach is a little inappropriate. Hope to hear back from you. Jack.
Re: [PATCH net-next 1/2] net: phy: add helper phy_polling_mode
I think you can combine these two patches into one. Thank you.
Re: [PATCH net-next 2/2] nfp: avoid buffer leak when FW communication fails
From: Jakub Kicinski Date: Fri, 20 Jul 2018 21:14:39 -0700 > After device is stopped we reset the rings by moving all free buffers > to positions [0, cnt - 2], and clear the position cnt - 1 in the ring. > We then proceed to clear the read/write pointers. This means that if > we try to reset the ring again the code will assume that the next to > fill buffer is at position 0 and swap it with cnt - 1. Since we > previously cleared position cnt - 1 it will lead to leaking the first > buffer and leaving ring in a bad state. > > This scenario can only happen if FW communication fails, in which case > the ring will never be used again, so the fact it's in a bad state will > not be noticed. Buffer leak is the only problem. Don't try to move > buffers in the ring if the read/write pointers indicate the ring was > never used or have already been reset. > > nfp_net_clear_config_and_disable() is now fully idempotent. > > Found by code inspection, FW communication failures are very rare, > and reconfiguring a live device is not common either, so it's unlikely > anyone has ever noticed the leak. > > Signed-off-by: Jakub Kicinski > Reviewed-by: Dirk van der Merwe Applied. > This is arguably net material but IMHO the risk of me missing something > this could break is higher than the error actually occurring, and a > page leak on a FW communication error doesn't seem like it's worth > it at -rc6 time.. I'm happy to respin if I'm wrong! Agreed, net-next is more appropriate for this. Thanks.
Re: [PATCH net-next 1/2] nfp: bring back support for offloading shared blocks
From: Jakub Kicinski Date: Fri, 20 Jul 2018 21:14:38 -0700 > Now that we have offload replay infrastructure added by > commit 326367427cc0 ("net: sched: call reoffload op on block callback reg") > and flows are guaranteed to be removed correctly, we can revert > commit 951a8ee6def3 ("nfp: reject binding to shared blocks"). > > Signed-off-by: Jakub Kicinski > Reviewed-by: John Hurley Applied.
Re: [PATCH net] nfp: flower: ensure dead neighbour entries are not offloaded
From: Jakub Kicinski Date: Fri, 20 Jul 2018 21:07:54 -0700 > From: John Hurley > > Previously only the neighbour state was checked to decide if an offloaded > entry should be removed. However, there can be situations when the entry > is dead but still marked as valid. This can lead to dead entries not > being removed from fw tables or even incorrect data being added. > > Check the entry dead bit before deciding if it should be added to or > removed from fw neighbour tables. > > Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload") > Signed-off-by: John Hurley > Reviewed-by: Jakub Kicinski Patch applied and queued up for -stable, thanks Jakub.
Re: [PATCH net 0/4] vxlan: fix default fdb entry user-space notify ordering/race
From: Roopa Prabhu Date: Fri, 20 Jul 2018 13:21:00 -0700 > From: Roopa Prabhu > > Problem: > In vxlan_newlink, a default fdb entry is added before register_netdev. > The default fdb creation function notifies user-space of the > fdb entry on the vxlan device which user-space does not know about yet. > (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex). > > This series fixes the user-space netlink notification ordering issue > with the following changes: > - decouple fdb notify from fdb create. > - Move fdb notify after register_netdev. > - modify rtnl_configure_link to allow configuring a link early. > - Call rtnl_configure_link in vxlan newlink handler to notify > userspace about the newlink before fdb notify and > hence avoiding the user-space race. > > Fixes: afbd8bae9c79 ("vxlan: add implicit fdb entry for default destination") > Signed-off-by: Roopa Prabhu Series applied and queued up for -stable, thanks Roopa.
Re: [PATCH net] atl1c: reserve min skb headroom
From: Florian Westphal Date: Fri, 20 Jul 2018 19:30:57 +0200 > Got crash report with following backtrace: > BUG: unable to handle kernel paging request at 8801869daffe > RIP: 0010:[] [] > ip6_finish_output2+0x394/0x4c0 > RSP: 0018:880186c83a98 EFLAGS: 00010283 > RAX: 8801869db00e ... > [] ip6_finish_output+0x8c/0xf0 > [] ip6_output+0x57/0x100 > [] ip6_forward+0x4b9/0x840 > [] ip6_rcv_finish+0x66/0xc0 > [] ipv6_rcv+0x319/0x530 > [] netif_receive_skb+0x1c/0x70 > [] atl1c_clean+0x1ec/0x310 [atl1c] > ... > > The bad access is in neigh_hh_output(), at skb->data - 16 (HH_DATA_MOD). > atl1c driver provided skb with no headroom, so 14 bytes (ethernet > header) got pulled, but then 16 are copied. > > Reserve NET_SKB_PAD bytes headroom, like netdev_alloc_skb(). > > Compile tested only; I lack hardware. > > Fixes: 7b7017642199 ("atl1c: Fix misuse of netdev_alloc_skb in refilling rx > ring") > Signed-off-by: Florian Westphal Ancient bug :-/ Applied and queued up for -stable, thanks Florian.
Re: [PATCH net-next] bonding: don't cast const buf in sysfs store
From: Nikolay Aleksandrov Date: Sun, 22 Jul 2018 11:37:31 +0300 > As was recently discussed [1], let's avoid casting the const buf in > bonding_sysfs_store_option and use kstrndup/kfree instead. > > [1] http://lists.openwall.net/netdev/2018/07/22/25 > > Signed-off-by: Nikolay Aleksandrov Applied, thanks Nikolay.
[PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT
From: Toshiaki Makita This allows further redirection of xdp_frames like NIC -> veth--veth -> veth--veth (XDP) (XDP) (XDP) The intermediate XDP, redirecting packets from NIC to the other veth, reuses xdp_mem_info from NIC so that page recycling of the NIC works on the destination veth's XDP. In this way return_frame is not fully guarded by NAPI, since another NAPI handler on another cpu may use the same xdp_mem_info concurrently. Thus disable napi_direct by XDP_MEM_RF_NO_DIRECT flag. v3: - Fix double free when veth_xdp_tx() returns a positive value. - Convert xdp_xmit and xdp_redir variables into flags. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 119 + 1 file changed, 110 insertions(+), 9 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 57187e955fea..0323a4ca74e2 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -32,6 +32,10 @@ #define VETH_RING_SIZE 256 #define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) +/* Separating two types of XDP xmit */ +#define VETH_XDP_TXBIT(0) +#define VETH_XDP_REDIR BIT(1) + struct pcpu_vstats { u64 packets; u64 bytes; @@ -45,6 +49,7 @@ struct veth_priv { struct bpf_prog *_xdp_prog; struct net_device __rcu *peer; atomic64_t dropped; + struct xdp_mem_info xdp_mem; unsignedrequested_headroom; boolrx_notify_masked; struct ptr_ring xdp_ring; @@ -311,10 +316,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, return n - drops; } +static void veth_xdp_flush(struct net_device *dev) +{ + struct veth_priv *rcv_priv, *priv = netdev_priv(dev); + struct net_device *rcv; + + rcu_read_lock(); + rcv = rcu_dereference(priv->peer); + if (unlikely(!rcv)) + goto out; + + rcv_priv = netdev_priv(rcv); + /* xdp_ring is initialized on receive side? */ + if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog))) + goto out; + + __veth_xdp_flush(rcv_priv); +out: + rcu_read_unlock(); +} + +static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) +{ + struct xdp_frame *frame = convert_to_xdp_frame(xdp); + + if (unlikely(!frame)) + return -EOVERFLOW; + + return veth_xdp_xmit(dev, 1, , 0); +} + static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv, - struct xdp_frame *frame) + struct xdp_frame *frame, + unsigned int *xdp_xmit) { int len = frame->len, delta = 0; + struct xdp_frame orig_frame; struct bpf_prog *xdp_prog; unsigned int headroom; struct sk_buff *skb; @@ -338,6 +375,31 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv, delta = frame->data - xdp.data; len = xdp.data_end - xdp.data; break; + case XDP_TX: + orig_frame = *frame; + xdp.data_hard_start = frame; + xdp.rxq->mem = frame->mem; + xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT; + if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) { + trace_xdp_exception(priv->dev, xdp_prog, act); + frame = _frame; + goto err_xdp; + } + *xdp_xmit |= VETH_XDP_TX; + rcu_read_unlock(); + goto xdp_xmit; + case XDP_REDIRECT: + orig_frame = *frame; + xdp.data_hard_start = frame; + xdp.rxq->mem = frame->mem; + xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT; + if (xdp_do_redirect(priv->dev, , xdp_prog)) { + frame = _frame; + goto err_xdp; + } + *xdp_xmit |= VETH_XDP_REDIR; + rcu_read_unlock(); + goto xdp_xmit; default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED: @@ -362,12 +424,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv, err_xdp: rcu_read_unlock(); xdp_return_frame(frame); - +xdp_xmit: return NULL; } static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv, - struct sk_buff *skb) + struct sk_buff *skb, + unsigned int *xdp_xmit) {
[PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring
From: Toshiaki Makita This is preparation for XDP TX and ndo_xdp_xmit. This allows napi handler to handle xdp_frames through xdp ring as well as sk_buff. v3: - Revert v2 change around rings and use a flag to differentiate skb and xdp_frame, since bulk skb xmit makes little performance difference for now. v2: - Use another ring instead of using flag to differentiate skb and xdp_frame. This approach makes bulk skb transmit possible in veth_xmit later. - Clear xdp_frame feilds in skb->head. - Implement adjust_tail. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 87 ++ 1 file changed, 82 insertions(+), 5 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f5b72e937d9d..4be75c58bc6a 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -22,12 +22,12 @@ #include #include #include -#include #include #define DRV_NAME "veth" #define DRV_VERSION"1.0" +#define VETH_XDP_FLAG BIT(0) #define VETH_RING_SIZE 256 #define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) @@ -115,6 +115,24 @@ static const struct ethtool_ops veth_ethtool_ops = { /* general routines */ +static bool veth_is_xdp_frame(void *ptr) +{ + return (unsigned long)ptr & VETH_XDP_FLAG; +} + +static void *veth_ptr_to_xdp(void *ptr) +{ + return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG); +} + +static void veth_ptr_free(void *ptr) +{ + if (veth_is_xdp_frame(ptr)) + xdp_return_frame(veth_ptr_to_xdp(ptr)); + else + kfree_skb(ptr); +} + static void __veth_xdp_flush(struct veth_priv *priv) { /* Write ptr_ring before reading rx_notify_masked */ @@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len, return skb; } +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv, + struct xdp_frame *frame) +{ + int len = frame->len, delta = 0; + struct bpf_prog *xdp_prog; + unsigned int headroom; + struct sk_buff *skb; + + rcu_read_lock(); + xdp_prog = rcu_dereference(priv->xdp_prog); + if (likely(xdp_prog)) { + struct xdp_buff xdp; + u32 act; + + xdp.data_hard_start = frame->data - frame->headroom; + xdp.data = frame->data; + xdp.data_end = frame->data + frame->len; + xdp.data_meta = frame->data - frame->metasize; + xdp.rxq = >xdp_rxq; + + act = bpf_prog_run_xdp(xdp_prog, ); + + switch (act) { + case XDP_PASS: + delta = frame->data - xdp.data; + len = xdp.data_end - xdp.data; + break; + default: + bpf_warn_invalid_xdp_action(act); + case XDP_ABORTED: + trace_xdp_exception(priv->dev, xdp_prog, act); + case XDP_DROP: + goto err_xdp; + } + } + rcu_read_unlock(); + + headroom = frame->data - delta - (void *)frame; + skb = veth_build_skb(frame, headroom, len, 0); + if (!skb) { + xdp_return_frame(frame); + goto err; + } + + memset(frame, 0, sizeof(*frame)); + skb->protocol = eth_type_trans(skb, priv->dev); +err: + return skb; +err_xdp: + rcu_read_unlock(); + xdp_return_frame(frame); + + return NULL; +} + static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv, struct sk_buff *skb) { @@ -358,12 +431,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget) int i, done = 0; for (i = 0; i < budget; i++) { - struct sk_buff *skb = __ptr_ring_consume(>xdp_ring); + void *ptr = __ptr_ring_consume(>xdp_ring); + struct sk_buff *skb; - if (!skb) + if (!ptr) break; - skb = veth_xdp_rcv_skb(priv, skb); + if (veth_is_xdp_frame(ptr)) + skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr)); + else + skb = veth_xdp_rcv_skb(priv, ptr); if (skb) napi_gro_receive(>xdp_napi, skb); @@ -416,7 +493,7 @@ static void veth_napi_del(struct net_device *dev) napi_disable(>xdp_napi); netif_napi_del(>xdp_napi); priv->rx_notify_masked = false; - ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb); + ptr_ring_cleanup(>xdp_ring, veth_ptr_free); } static int veth_enable_xdp(struct net_device *dev) -- 2.14.3
[PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
From: Toshiaki Makita This allows NIC's XDP to redirect packets to veth. The destination veth device enqueues redirected packets to the napi ring of its peer, then they are processed by XDP on its peer veth device. This can be thought as calling another XDP program by XDP program using REDIRECT, when the peer enables driver XDP. Note that when the peer veth device does not set driver xdp, redirected packets will be dropped because the peer is not ready for NAPI. v2: - Drop the part converting xdp_frame into skb when XDP is not enabled. - Implement bulk interface of ndo_xdp_xmit. - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 45 + 1 file changed, 45 insertions(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 4be75c58bc6a..57187e955fea 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr) return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG); } +static void *veth_xdp_to_ptr(void *ptr) +{ + return (void *)((unsigned long)ptr | VETH_XDP_FLAG); +} + static void veth_ptr_free(void *ptr) { if (veth_is_xdp_frame(ptr)) @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len, return skb; } +static int veth_xdp_xmit(struct net_device *dev, int n, +struct xdp_frame **frames, u32 flags) +{ + struct veth_priv *rcv_priv, *priv = netdev_priv(dev); + struct net_device *rcv; + int i, drops = 0; + + if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) + return -EINVAL; + + rcv = rcu_dereference(priv->peer); + if (unlikely(!rcv)) + return -ENXIO; + + rcv_priv = netdev_priv(rcv); + /* xdp_ring is initialized on receive side? */ + if (!rcu_access_pointer(rcv_priv->xdp_prog)) + return -ENXIO; + + spin_lock(_priv->xdp_ring.producer_lock); + for (i = 0; i < n; i++) { + struct xdp_frame *frame = frames[i]; + void *ptr = veth_xdp_to_ptr(frame); + + if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) || +__ptr_ring_produce(_priv->xdp_ring, ptr))) { + xdp_return_frame_rx_napi(frame); + drops++; + } + } + spin_unlock(_priv->xdp_ring.producer_lock); + + if (flags & XDP_XMIT_FLUSH) + __veth_xdp_flush(rcv_priv); + + return n - drops; +} + static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv, struct xdp_frame *frame) { @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = { .ndo_features_check = passthru_features_check, .ndo_set_rx_headroom= veth_set_rx_headroom, .ndo_bpf= veth_xdp, + .ndo_xdp_xmit = veth_xdp_xmit, }; #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \ -- 2.14.3
[PATCH v3 bpf-next 0/8] veth: Driver XDP
From: Toshiaki Makita This patch set introduces driver XDP for veth. Basically this is used in conjunction with redirect action of another XDP program. NIC ---> veth===veth (XDP) (redirect)(XDP) In this case xdp_frame can be forwarded to the peer veth without modification, so we can expect far better performance than generic XDP. Envisioned use-cases * Container managed XDP program Container host redirects frames to containers by XDP redirect action, and privileged containers can deploy their own XDP programs. * XDP program cascading Two or more XDP programs can be called for each packet by redirecting xdp frames to veth. * Internal interface for an XDP bridge When using XDP redirection to create a virtual bridge, veth can be used to create an internal interface for the bridge. Implementation -- This changeset is making use of NAPI to implement ndo_xdp_xmit and XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI context. - patch 1: Export a function needed for veth XDP. - patch 2-3: Basic implementation of veth XDP. - patch 4-5: Add ndo_xdp_xmit. - patch 6-7: Add XDP_TX and XDP_REDIRECT. - patch 8: Performance optimization for multi-queue env. Tests and performance numbers - Tested with a simple XDP program which only redirects packets between NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The server has 20 of Xeon Silver 2.20 GHz cores. pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP) The rightmost veth loads XDP progs and just does DROP or TX. The number of packets is measured in the XDP progs. The leftmost pktgen sends packets at 37.1 Mpps (almost 25G wire speed). veth XDP actionFlowsMpps DROP 110.6 DROP 221.2 DROP 10036.0 TX 1 5.0 TX 210.0 TX 10031.0 I also measured netperf TCP_STREAM but was not so great performance due to lack of tx/rx checksum offload and TSO, etc. netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS) Direction Flows Gbps == external->veth1 20.8 external->veth2 23.5 external->veth 100 23.6 veth->external19.0 veth->external2 17.8 veth->external 100 22.9 Also tested doing ifup/down or load/unload a XDP program repeatedly during processing XDP packets in order to check if enabling/disabling NAPI is working as expected, and found no problems. Note: This patchset depends on commit d8d7218ad842 ("xdp: XDP_REDIRECT should check IFF_UP and MTU") which is currently in DaveM's net-next tree. v3: - Drop skb bulk xmit patch since it makes little performance difference. The hotspot in TCP skb xmit at this point is checksum computation in skb_segment and packet copy on XDP_REDIRECT due to cloned/nonlinear skb. - Fix race on closing device. - Add extack messages in ndo_bpf. v2: - Squash NAPI patch with "Add driver XDP" patch. - Remove conversion from xdp_frame to skb when NAPI is not enabled. - Introduce per-queue XDP ring (patch 8). - Introduce bulk skb xmit when XDP is enabled on the peer (patch 9). Signed-off-by: Toshiaki Makita Toshiaki Makita (8): net: Export skb_headers_offset_update veth: Add driver XDP veth: Avoid drops by oversized packets when XDP is enabled veth: Handle xdp_frames in xdp napi ring veth: Add ndo_xdp_xmit xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info veth: Add XDP TX and REDIRECT veth: Support per queue XDP ring drivers/net/veth.c | 735 - include/linux/skbuff.h | 1 + include/net/xdp.h | 4 + net/core/skbuff.c | 3 +- net/core/xdp.c | 6 +- 5 files changed, 737 insertions(+), 12 deletions(-) -- 2.14.3
[PATCH v3 bpf-next 2/8] veth: Add driver XDP
From: Toshiaki Makita This is the basic implementation of veth driver XDP. Incoming packets are sent from the peer veth device in the form of skb, so this is generally doing the same thing as generic XDP. This itself is not so useful, but a starting point to implement other useful veth XDP features like TX and REDIRECT. This introduces NAPI when XDP is enabled, because XDP is now heavily relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function enqueues packets to the ring and peer NAPI handler drains the ring. Currently only one ring is allocated for each veth device, so it does not scale on multiqueue env. This can be resolved by allocating rings on the per-queue basis later. Note that NAPI is not used but netif_rx is used when XDP is not loaded, so this does not change the default behaviour. v3: - Fix race on closing the device. - Add extack messages in ndo_bpf. v2: - Squashed with the patch adding NAPI. - Implement adjust_tail. - Don't acquire consumer lock because it is guarded by NAPI. - Make poll_controller noop since it is unnecessary. - Register rxq_info on enabling XDP rather than on opening the device. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 373 - 1 file changed, 366 insertions(+), 7 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index a69ad39ee57e..78fa08cb6e24 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -19,10 +19,18 @@ #include #include #include +#include +#include +#include +#include +#include #define DRV_NAME "veth" #define DRV_VERSION"1.0" +#define VETH_RING_SIZE 256 +#define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) + struct pcpu_vstats { u64 packets; u64 bytes; @@ -30,9 +38,16 @@ struct pcpu_vstats { }; struct veth_priv { + struct napi_struct xdp_napi; + struct net_device *dev; + struct bpf_prog __rcu *xdp_prog; + struct bpf_prog *_xdp_prog; struct net_device __rcu *peer; atomic64_t dropped; unsignedrequested_headroom; + boolrx_notify_masked; + struct ptr_ring xdp_ring; + struct xdp_rxq_info xdp_rxq; }; /* @@ -98,11 +113,43 @@ static const struct ethtool_ops veth_ethtool_ops = { .get_link_ksettings = veth_get_link_ksettings, }; -static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) +/* general routines */ + +static void __veth_xdp_flush(struct veth_priv *priv) +{ + /* Write ptr_ring before reading rx_notify_masked */ + smp_mb(); + if (!priv->rx_notify_masked) { + priv->rx_notify_masked = true; + napi_schedule(>xdp_napi); + } +} + +static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb) +{ + if (unlikely(ptr_ring_produce(>xdp_ring, skb))) { + dev_kfree_skb_any(skb); + return NET_RX_DROP; + } + + return NET_RX_SUCCESS; +} + +static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp) { struct veth_priv *priv = netdev_priv(dev); + + return __dev_forward_skb(dev, skb) ?: xdp ? + veth_xdp_rx(priv, skb) : + netif_rx(skb); +} + +static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct veth_priv *rcv_priv, *priv = netdev_priv(dev); struct net_device *rcv; int length = skb->len; + bool rcv_xdp = false; rcu_read_lock(); rcv = rcu_dereference(priv->peer); @@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) goto drop; } - if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) { + rcv_priv = netdev_priv(rcv); + rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog); + + if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) { struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats); u64_stats_update_begin(>syncp); @@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) drop: atomic64_inc(>dropped); } + + if (rcv_xdp) + __veth_xdp_flush(rcv_priv); + rcu_read_unlock(); + return NETDEV_TX_OK; } -/* - * general routines - */ - static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev) { struct veth_priv *priv = netdev_priv(dev); @@ -179,18 +230,253 @@ static void veth_set_multicast_list(struct net_device *dev) { } +static struct sk_buff *veth_build_skb(void *head, int headroom, int len, + int buflen) +{ + struct sk_buff *skb; + + if (!buflen) { + buflen = SKB_DATA_ALIGN(headroom + len) + +
[PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
From: Toshiaki Makita All oversized packets including GSO packets are dropped if XDP is enabled on receiver side, so don't send such packets from peer. Drop TSO and SCTP fragmentation features so that veth devices themselves segment packets with XDP enabled. Also cap MTU accordingly. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 41 +++-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 78fa08cb6e24..f5b72e937d9d 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev) return iflink; } +static netdev_features_t veth_fix_features(struct net_device *dev, + netdev_features_t features) +{ + struct veth_priv *priv = netdev_priv(dev); + struct net_device *peer; + + peer = rtnl_dereference(priv->peer); + if (peer) { + struct veth_priv *peer_priv = netdev_priv(peer); + + if (peer_priv->_xdp_prog) + features &= ~NETIF_F_GSO_SOFTWARE; + } + + return features; +} + static void veth_set_rx_headroom(struct net_device *dev, int new_hr) { struct veth_priv *peer_priv, *priv = netdev_priv(dev); @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, goto err; } } + + if (!old_prog) { + peer->hw_features &= ~NETIF_F_GSO_SOFTWARE; + peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM - + peer->hard_header_len - + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (peer->mtu > peer->max_mtu) + dev_set_mtu(peer, peer->max_mtu); + } } if (old_prog) { - if (!prog && dev->flags & IFF_UP) - veth_disable_xdp(dev); + if (!prog) { + if (dev->flags & IFF_UP) + veth_disable_xdp(dev); + + if (peer) { + peer->hw_features |= NETIF_F_GSO_SOFTWARE; + peer->max_mtu = ETH_MAX_MTU; + } + } bpf_prog_put(old_prog); } + if ((!!old_prog ^ !!prog) && peer) + netdev_update_features(peer); + return 0; err: priv->_xdp_prog = old_prog; @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = { .ndo_poll_controller= veth_poll_controller, #endif .ndo_get_iflink = veth_get_iflink, + .ndo_fix_features = veth_fix_features, .ndo_features_check = passthru_features_check, .ndo_set_rx_headroom= veth_set_rx_headroom, .ndo_bpf= veth_xdp, -- 2.14.3
[PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
From: Toshiaki Makita We need some mechanism to disable napi_direct on calling xdp_return_frame_rx_napi() from some context. When veth gets support of XDP_REDIRECT, it will redirects packets which are redirected from other devices. On redirection veth will reuse xdp_mem_info of the redirection source device to make return_frame work. But in this case .ndo_xdp_xmit() called from veth redirection uses xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is not called directly from the rxq which owns the xdp_mem_info. This approach introduces a flag in xdp_mem_info to indicate that napi_direct should be disabled even when _rx_napi variant is used. Signed-off-by: Toshiaki Makita --- include/net/xdp.h | 4 net/core/xdp.c| 6 -- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index fcb033f51d8c..1d1bc6553ff2 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -41,6 +41,9 @@ enum xdp_mem_type { MEM_TYPE_MAX, }; +/* XDP flags for xdp_mem_info */ +#define XDP_MEM_RF_NO_DIRECT BIT(0) /* don't use napi_direct */ + /* XDP flags for ndo_xdp_xmit */ #define XDP_XMIT_FLUSH (1U << 0) /* doorbell signal consumer */ #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH @@ -48,6 +51,7 @@ enum xdp_mem_type { struct xdp_mem_info { u32 type; /* enum xdp_mem_type, but known size type */ u32 id; + u32 flags; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 57285383ed00..1426c608fd75 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct, /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */ xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params); page = virt_to_head_page(data); - if (xa) + if (xa) { + napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT); page_pool_put_page(xa->page_pool, page, napi_direct); - else + } else { put_page(page); + } rcu_read_unlock(); break; case MEM_TYPE_PAGE_SHARED: -- 2.14.3
[PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update
From: Toshiaki Makita This is needed for veth XDP which does skb_copy_expand()-like operation. v2: - Drop skb_copy_header part because it has already been exported now. Signed-off-by: Toshiaki Makita --- include/linux/skbuff.h | 1 + net/core/skbuff.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index fd3cb1b247df..f6929688853a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, } struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); +void skb_headers_offset_update(struct sk_buff *skb, int off); int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority); void skb_copy_header(struct sk_buff *new, const struct sk_buff *old); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0c1a00672ba9..5366d1660e5b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask) } EXPORT_SYMBOL(skb_clone); -static void skb_headers_offset_update(struct sk_buff *skb, int off) +void skb_headers_offset_update(struct sk_buff *skb, int off) { /* Only adjust this if it actually is csum_start rather than csum */ if (skb->ip_summed == CHECKSUM_PARTIAL) @@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off) skb->inner_network_header += off; skb->inner_mac_header += off; } +EXPORT_SYMBOL(skb_headers_offset_update); void skb_copy_header(struct sk_buff *new, const struct sk_buff *old) { -- 2.14.3
[PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring
From: Toshiaki Makita Move XDP and napi related fields in veth_priv to newly created veth_rq structure. When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is selected by current cpu. When skbs are enqueued from the peer device, rxq is one to one mapping of its peer txq. This way we have a restriction that the number of rxqs must not less than the number of peer txqs, but leave the possibility to achieve bulk skb xmit in the future because txq lock would make it possible to remove rxq ptr_ring lock. v3: - Add extack messages. - Fix array overrun in veth_xmit. Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 278 - 1 file changed, 188 insertions(+), 90 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 0323a4ca74e2..84482d9901ec 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -42,20 +42,24 @@ struct pcpu_vstats { struct u64_stats_sync syncp; }; -struct veth_priv { +struct veth_rq { struct napi_struct xdp_napi; struct net_device *dev; struct bpf_prog __rcu *xdp_prog; - struct bpf_prog *_xdp_prog; - struct net_device __rcu *peer; - atomic64_t dropped; struct xdp_mem_info xdp_mem; - unsignedrequested_headroom; boolrx_notify_masked; struct ptr_ring xdp_ring; struct xdp_rxq_info xdp_rxq; }; +struct veth_priv { + struct net_device __rcu *peer; + atomic64_t dropped; + struct bpf_prog *_xdp_prog; + struct veth_rq *rq; + unsigned intrequested_headroom; +}; + /* * ethtool interface */ @@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr) kfree_skb(ptr); } -static void __veth_xdp_flush(struct veth_priv *priv) +static void __veth_xdp_flush(struct veth_rq *rq) { /* Write ptr_ring before reading rx_notify_masked */ smp_mb(); - if (!priv->rx_notify_masked) { - priv->rx_notify_masked = true; - napi_schedule(>xdp_napi); + if (!rq->rx_notify_masked) { + rq->rx_notify_masked = true; + napi_schedule(>xdp_napi); } } -static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb) +static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb) { - if (unlikely(ptr_ring_produce(>xdp_ring, skb))) { + if (unlikely(ptr_ring_produce(>xdp_ring, skb))) { dev_kfree_skb_any(skb); return NET_RX_DROP; } @@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb) return NET_RX_SUCCESS; } -static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp) +static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, + struct veth_rq *rq, bool xdp) { - struct veth_priv *priv = netdev_priv(dev); - return __dev_forward_skb(dev, skb) ?: xdp ? - veth_xdp_rx(priv, skb) : + veth_xdp_rx(rq, skb) : netif_rx(skb); } static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) { struct veth_priv *rcv_priv, *priv = netdev_priv(dev); + struct veth_rq *rq = NULL; struct net_device *rcv; int length = skb->len; bool rcv_xdp = false; + int rxq; rcu_read_lock(); rcv = rcu_dereference(priv->peer); @@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) } rcv_priv = netdev_priv(rcv); - rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog); + rxq = skb_get_queue_mapping(skb); + if (rxq < rcv->real_num_rx_queues) { + rq = _priv->rq[rxq]; + rcv_xdp = rcu_access_pointer(rq->xdp_prog); + if (rcv_xdp) + skb_record_rx_queue(skb, rxq); + } - if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) { + if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) { struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats); u64_stats_update_begin(>syncp); @@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) } if (rcv_xdp) - __veth_xdp_flush(rcv_priv); + __veth_xdp_flush(rq); rcu_read_unlock(); @@ -278,11 +289,17 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len, return skb; } +static int veth_select_rxq(struct net_device *dev) +{ + return smp_processor_id() % dev->real_num_rx_queues; +} + static int veth_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, u32 flags) { struct veth_priv *rcv_priv, *priv =
Re: [PATCH net-next 4/4] act_mirred: use ACT_MIRRED when possible
On Sat, 2018-07-21 at 16:29 -0700, David Miller wrote: > From: Paolo Abeni > Date: Thu, 19 Jul 2018 15:02:29 +0200 > > > kernel openswitch datapath. > ^^ > > "openvswitch" Oops, typo! Thank you for the feedback! I do hope this is your major concern ;) I will address it v3. Thanks, Paolo
Security enhancement proposal for kernel TLS
Hi The kernel based TLS record layer allows the user space world to use a decoupled TLS implementation. The applications need not be linked with TLS stack. The TLS handshake can be done by a TLS daemon on the behalf of applications. Presently, as soon as the handshake process derives keys, it pushes the negotiated keys to kernel TLS . Thereafter the applications can directly read and write data on their TCP socket (without having to use SSL apis). With the current kernel TLS implementation, there is a security problem. Since the kernel TLS socket does not have information about the state of handshake, it allows applications to be able to receive data from the peer TLS endpoint even when the handshake verification has not been completed by the SSL daemon. It is a security problem if applications can receive data if verification of the handshake transcript is not completed (done with processing tls FINISHED message). My proposal: - Kernel TLS should maintain state of handshake (verified or unverified). In un-verified state, data records should not be allowed pass through to the applications. - Add a new control interface using which that the user space SSL stack can tell the TLS socket that handshake has been verified and DATA records can flow. In 'unverified' state, only control records should be allowed to pass and reception DATA record should be pause the receive side record decryption. - The handshake state should fallback to 'unverified' in case a control record is seen again by kernel TLS (e.g. in case of renegotiation, post handshake client auth etc). Kindly comment. Regards Vakul
[PATCH iproute2-next] ip: Add violation counters to VF statisctics
Extend VFs statistics by receive and transmit violation counters. Example: "ip -s link show dev enp5s0f0" 6: enp5s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 24:8a:07:a5:28:f0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 0 00 0 0 2 TX: bytes packets errors dropped carrier collsns 1406 17 0 0 0 0 vf 0 MAC 00:00:ca:fe:ca:fe, vlan 5, spoof checking off, link-state auto, trust off, query_rss off RX: bytes packets mcast bcast dropped 1666 29 14 32 0 TX: bytes packets dropped 2880 44 2412 Signed-off-by: Eran Ben Elisha --- ip/ipaddress.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/ip/ipaddress.c b/ip/ipaddress.c index 5009bfe6d2e3..ddb8bf876e10 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -562,6 +562,9 @@ static void print_vf_stats64(FILE *fp, struct rtattr *vfstats) rta_getattr_u64(vf[IFLA_VF_STATS_MULTICAST])); print_u64(PRINT_JSON, "broadcast", NULL, rta_getattr_u64(vf[IFLA_VF_STATS_BROADCAST])); + if (vf[IFLA_VF_STATS_RX_DROPPED]) + print_u64(PRINT_JSON, "dropped", NULL, + rta_getattr_u64(vf[IFLA_VF_STATS_RX_DROPPED])); close_json_object(); /* TX stats */ @@ -570,26 +573,39 @@ static void print_vf_stats64(FILE *fp, struct rtattr *vfstats) rta_getattr_u64(vf[IFLA_VF_STATS_TX_BYTES])); print_u64(PRINT_JSON, "tx_packets", NULL, rta_getattr_u64(vf[IFLA_VF_STATS_TX_PACKETS])); + if (vf[IFLA_VF_STATS_TX_DROPPED]) + print_u64(PRINT_JSON, "dropped", NULL, + rta_getattr_u64(vf[IFLA_VF_STATS_TX_DROPPED])); close_json_object(); close_json_object(); } else { /* RX stats */ fprintf(fp, "%s", _SL_); - fprintf(fp, "RX: bytes packets mcast bcast %s", _SL_); + fprintf(fp, "RX: bytes packets mcast bcast "); + if (vf[IFLA_VF_STATS_RX_DROPPED]) + fprintf(fp, " dropped "); + fprintf(fp, "%s", _SL_); fprintf(fp, ""); print_num(fp, 10, rta_getattr_u64(vf[IFLA_VF_STATS_RX_BYTES])); print_num(fp, 8, rta_getattr_u64(vf[IFLA_VF_STATS_RX_PACKETS])); print_num(fp, 7, rta_getattr_u64(vf[IFLA_VF_STATS_MULTICAST])); print_num(fp, 7, rta_getattr_u64(vf[IFLA_VF_STATS_BROADCAST])); + if (vf[IFLA_VF_STATS_RX_DROPPED]) + print_num(fp, 8, rta_getattr_u64(vf[IFLA_VF_STATS_RX_DROPPED])); /* TX stats */ fprintf(fp, "%s", _SL_); - fprintf(fp, "TX: bytes packets %s", _SL_); + fprintf(fp, "TX: bytes packets "); + if (vf[IFLA_VF_STATS_TX_DROPPED]) + fprintf(fp, " dropped "); + fprintf(fp, "%s", _SL_); fprintf(fp, ""); print_num(fp, 10, rta_getattr_u64(vf[IFLA_VF_STATS_TX_BYTES])); print_num(fp, 8, rta_getattr_u64(vf[IFLA_VF_STATS_TX_PACKETS])); + if (vf[IFLA_VF_STATS_TX_DROPPED]) + print_num(fp, 8, rta_getattr_u64(vf[IFLA_VF_STATS_TX_DROPPED])); } } -- 1.8.3.1
Re: [PATCH 4/4] net: dsa: Add Lantiq / Intel DSA driver for vrx200
On 07/22/2018 05:17 AM, David Miller wrote: > From: Hauke Mehrtens > Date: Sat, 21 Jul 2018 21:13:58 +0200 > >> +// start the table access: > > Please stick to C-style comments, except perhaps in the SPDX > identifiers. > > Thank you. > Sorry that I missed that, it is fixed now. Hauke
[PATCH net-next] bonding: don't cast const buf in sysfs store
As was recently discussed [1], let's avoid casting the const buf in bonding_sysfs_store_option and use kstrndup/kfree instead. [1] http://lists.openwall.net/netdev/2018/07/22/25 Signed-off-by: Nikolay Aleksandrov --- drivers/net/bonding/bond_sysfs.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 6096440e96ea..35847250da5a 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -160,14 +160,19 @@ static ssize_t bonding_sysfs_store_option(struct device *d, { struct bonding *bond = to_bond(d); const struct bond_option *opt; + char *buffer_clone; int ret; opt = bond_opt_get_by_name(attr->attr.name); if (WARN_ON(!opt)) return -ENOENT; - ret = bond_opt_tryset_rtnl(bond, opt->id, (char *)buffer); + buffer_clone = kstrndup(buffer, count, GFP_KERNEL); + if (!buffer_clone) + return -ENOMEM; + ret = bond_opt_tryset_rtnl(bond, opt->id, buffer_clone); if (!ret) ret = count; + kfree(buffer_clone); return ret; } -- 2.11.0
Re: [PATCH net] net: rollback orig value on failure of dev_qdisc_change_tx_queue_len
On 19/07/2018 8:25 PM, Cong Wang wrote: On Thu, Jul 19, 2018 at 7:34 AM Tariq Toukan wrote: Fix dev_change_tx_queue_len so it rolls back original value upon a failure in dev_qdisc_change_tx_queue_len. This is already done for notifirers' failures, share the code. The revert of changes in dev_qdisc_change_tx_queue_len in such case is needed but missing (marked as TODO), yet it is still better to not apply the new requested value. You misunderstand the TODO, that is for reverting tx queue len change for previous queues in the loop. I still don't have any nice solution for this yet. I understood this, but maybe didn't describe it clearly. I'll re-phrase. Yeah, your change itself looks good. Thanks. Please update the changelog. Thanks.
Re: [PATCH net-next] net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc
On 19/07/2018 8:21 PM, Cong Wang wrote: On Thu, Jul 19, 2018 at 7:50 AM Tariq Toukan wrote: --- a/net/core/dev_ioctl.c +++ b/net/core/dev_ioctl.c @@ -282,14 +282,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd) return dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data); case SIOCSIFTXQLEN: - if (ifr->ifr_qlen < 0) - return -EINVAL; Are you sure we can remove this if check too? The other one is safe to remove. Hmm, let's see: dev_change_tx_queue_len gets unsigned long new_len, any negative value passed is interpreted as a very large number, then we test: if (new_len != (unsigned int)new_len) This test returns true if range of unsigned long is larger than range of unsigned int. AFAIK these ranges are Arch dependent and there is no guarantee this holds. Right?
Query about tls patch
Hi I got a query reading patch https://patchwork.ozlabs.org/patch/943442/ (already merged). [PATCH]: tls: Fix zerocopy_from_iter iov handling In tls_sw_sendmsg(), if zerocopy_from_iter() fails, we go to fallback_to_reg_send. Here we first call iov_iter_revert(). But the iov_iter_advance didn't happen in first place. Similarly, in zerocopy_from_iter(), if 'revert' is passed as 'true', iov_iter_revert() is called even if iov_iter_advance() never happened (case of error return). In tls_sw_recvmsg() flow, the iov_iter_revert() is always triggered. With revert happening, I am missing the point how the case where multiple records are decrypted into user space buffer with one invocation of 'recvmsg()' happens? It seems that successively dequeued tls records would get decrypted at same location in the output buffer and keep overwriting the previous one. Regards Vakul
Re: [PATCH net] multicast: do not restore deleted record source filter mode to new one
From: Hangbin Liu Date: Fri, 20 Jul 2018 14:04:27 +0800 > There are two scenarios that we will restore deleted records. The first is > when device down and up(or unmap/remap). In this scenario the new filter > mode is same with previous one. Because we get it from in_dev->mc_list and > we do not touch it during device down and up. > > The other scenario is when a new socket join a group which was just delete > and not finish sending status reports. In this scenario, we should use the > current filter mode instead of restore old one. Here are 4 cases in total. > > old_socketnew_socket before_fix after_fix > IN(A) IN(A) ALLOW(A) ALLOW(A) > IN(A) EX( ) TO_IN( ) TO_EX( ) > EX( ) IN(A) TO_EX( ) ALLOW(A) > EX( ) EX( ) TO_EX( ) TO_EX( ) > > Fixes: 24803f38a5c0b (igmp: do not remove igmp souce list info when set link > down) > Fixes: 1666d49e1d416 (mld: do not remove mld souce list info when set link > down) > Signed-off-by: Hangbin Liu Applied and queued up for -stable, thank you.
Re: [linux-sunxi] [PATCH 4/5] arm64: allwinner: h6: add EMAC device nodes
On Sun, Jul 22, 2018 at 01:39:54PM +0800, Icenowy Zheng wrote: > Allwinner H6 SoC has an EMAC like the one in A64. > > Add device tree nodes for the H6 DTSI file. > > Signed-off-by: Icenowy Zheng > --- > arch/arm64/boot/dts/allwinner/sun50i-h6.dtsi | 30 > 1 file changed, 30 insertions(+) > > diff --git a/arch/arm64/boot/dts/allwinner/sun50i-h6.dtsi > b/arch/arm64/boot/dts/allwinner/sun50i-h6.dtsi > index 3ab6cf0256ca..c65311de301a 100644 > --- a/arch/arm64/boot/dts/allwinner/sun50i-h6.dtsi > +++ b/arch/arm64/boot/dts/allwinner/sun50i-h6.dtsi > @@ -149,6 +149,14 @@ > interrupt-controller; > #interrupt-cells = <3>; > > + ext_rgmii_pins: rgmii_pins { > + pins = "PD0", "PD1", "PD2", "PD3", "PD4", > +"PD5", "PD7", "PD8", "PD9", "PD10", > +"PD11", "PD12", "PD13", "PD19", "PD20"; > + function = "emac"; > + drive-strength = <40>; > + }; > + > mmc0_pins: mmc0-pins { > pins = "PF0", "PF1", "PF2", "PF3", > "PF4", "PF5"; > @@ -258,6 +266,28 @@ > status = "disabled"; > }; > > + emac: ethernet@502 { > + compatible = "allwinner,sun50i-a64-emac", > + "allwinner,sun50i-h6-emac"; > + syscon = <>; > + reg = <0x0502 0x1>; > + interrupts = ; > + interrupt-names = "macirq"; > + resets = < RST_BUS_EMAC>; > + reset-names = "stmmaceth"; > + clocks = < CLK_BUS_EMAC>; > + clock-names = "stmmaceth"; > + status = "disabled"; > + #address-cells = <1>; > + #size-cells = <0>; #address-cells and #size-cells is unnecessary in emac node. Regards