Re: [PATCH bpf-next 0/6] Few x64 jit improvements to shrink image size
On Sat, Feb 24, 2018 at 01:07:57AM +0100, Daniel Borkmann wrote: > Couple of minor improvements to the x64 JIT I had still around from > pre merge window in order to shrink the image size further. Added > test cases for kselftests too as well as running Cilium workloads on > them w/o issues. Applied to bpf-next, thanks Daniel.
tcp_bind_bucket is missing from slabinfo
Somewhere back around 3.17 the kmem cache "tcp_bind_bucket" dropped out of /proc/slabinfo. It turns out the ss command was dumpster diving in slabinfo to determine the number of bound sockets and now it always reports 0. Not sure why, the cache is still created but it doesn't show in slabinfo. Could it be some part of making slab/slub common code (or network namespaces). The cache is created in tcp_init but not visible. Any ideas?
RE: [Intel-wired-lan] [PATCH net-queue] e1000e: Fix check_for_link return value with autoneg off.
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > Behalf Of Benjamin Poirier > Sent: Monday, February 19, 2018 10:12 PM > To: Kirsher, Jeffrey T > Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux- > ker...@vger.kernel.org > Subject: [Intel-wired-lan] [PATCH net-queue] e1000e: Fix check_for_link > return value with autoneg off. > > When autoneg is off, the .check_for_link callback functions clear the > get_link_status flag and systematically return a "pseudo-error". This means > that the link is not detected as up until the next execution of the > e1000_watchdog_task() 2 seconds later. > > Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up") > Signed-off-by: Benjamin Poirier > --- > drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 +- > drivers/net/ethernet/intel/e1000e/mac.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > Tested-by: Aaron Brown
GIT: net has been merged into net-next
Just FYI...
Re: [next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters
On February 23, 2018 5:20:35 PM PST, Vinicius Costa Gomes wrote: >This allows filters added by tc-flower and specifying MAC addresses, >Ethernet types, and the VLAN priority field, to be offloaded to the >controller. > >This reuses most of the infrastructure used by ethtool, ethtool can be >used to read these filters, but modification and deletion can only be >done via tc-flower. You would want to check what other drivers supporting both ethtool::rxnfc and cls_flower do, but this can be seriously confusing to an user. As an user I would be more comfortable with seeing only rules added through ethtool via ethtool and those added by cls_flower via cls_flower. They will both access a shared set of resources but it seems easier for me to dump rules with both tools to figure out why -ENOSPC was returned rather than seeing something I did not add. Others might see it entirely differently. If you added the ability for cls_flower to indicate a queue number and either a fixed rule index or auto-placement (RX_CLS_LOC_ANY), could that eliminate entirely the need for adding MAC address steering in earlier patches? -- Florian
Re: [next-queue PATCH 5/8] igb: Add support for ethtool MAC address filters
On February 23, 2018 5:20:33 PM PST, Vinicius Costa Gomes wrote: >This adds the capability of configuring the queue steering of arriving >packets based on their source and destination MAC addresses. > >In practical terms this adds support for the following use cases, >characterized by these examples: > >$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0 >(this will direct packets with destination address "aa:aa:aa:aa:aa:aa" >to the RX queue 0) > >$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3 >(this will direct packets with destination address "44:44:44:44:44:44" >to the RX queue 3) > >Signed-off-by: Vinicius Costa Gomes >--- [snip] >diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c >b/drivers/net/ethernet/intel/igb/igb_ethtool.c >index 143f0bb34e4d..d8686a0f5b5d 100644 >--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c >+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c >@@ -152,6 +152,9 @@ static const char >igb_priv_flags_strings[][ETH_GSTRING_LEN] = { > > #define IGB_PRIV_FLAGS_STR_LEN ARRAY_SIZE(igb_priv_flags_strings) > >+static const u8 broadcast_addr[ETH_ALEN] = { >+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }; This is already defined in an existing header, don't have it handy but likely etherdevice.h. -- Florian
Re: [next-queue PATCH 8/8] igb: Add support for removing offloaded tc-flower filters
On February 23, 2018 5:20:36 PM PST, Vinicius Costa Gomes wrote: >This allows tc-flower filters that were offloaded to be removed. This should be squashed into your previous patch, either the functionality is there and you can add/remove or it is not. -- Florian
[PATCH V4 net 1/3] Revert "tuntap: add missing xdp flush"
This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The reason is we try to batch packets for devmap which causes calling xdp_do_flush() in the process context. Simply disabling preemption may not work since process may move among processors which lead xdp_do_flush() to miss some flushes on some processors. So simply revert the patch, a follow-up patch will add the xdp flush correctly. Reported-by: Christoffer Dall Fixes: 762c330d670e ("tuntap: add missing xdp flush") Signed-off-by: Jason Wang --- drivers/net/tun.c | 15 --- 1 file changed, 15 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index b52258c..2823a4a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -181,7 +181,6 @@ struct tun_file { struct tun_struct *detached; struct ptr_ring tx_ring; struct xdp_rxq_info xdp_rxq; - int xdp_pending_pkts; }; struct tun_flow_entry { @@ -1662,7 +1661,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, case XDP_REDIRECT: get_page(alloc_frag->page); alloc_frag->offset += buflen; - ++tfile->xdp_pending_pkts; err = xdp_do_redirect(tun->dev, &xdp, xdp_prog); if (err) goto err_redirect; @@ -1984,11 +1982,6 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK, false); - if (tfile->xdp_pending_pkts) { - tfile->xdp_pending_pkts = 0; - xdp_do_flush_map(); - } - tun_put(tun); return result; } @@ -2325,13 +2318,6 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) ret = tun_get_user(tun, tfile, m->msg_control, &m->msg_iter, m->msg_flags & MSG_DONTWAIT, m->msg_flags & MSG_MORE); - - if (tfile->xdp_pending_pkts >= NAPI_POLL_WEIGHT || - !(m->msg_flags & MSG_MORE)) { - tfile->xdp_pending_pkts = 0; - xdp_do_flush_map(); - } - tun_put(tun); return ret; } @@ -3163,7 +3149,6 @@ static int tun_chr_open(struct inode *inode, struct file * file) sock_set_flag(&tfile->sk, SOCK_ZEROCOPY); memset(&tfile->tx_ring, 0, sizeof(tfile->tx_ring)); - tfile->xdp_pending_pkts = 0; return 0; } -- 2.7.4
[PATCH V4 net 3/3] tuntap: correctly add the missing XDP flush
We don't flush batched XDP packets through xdp_do_flush_map(), this will cause packets stall at TX queue. Consider we don't do XDP on NAPI poll(), the only possible fix is to call xdp_do_flush_map() immediately after xdp_do_redirect(). Note, this in fact won't try to batch packets through devmap, we could address in the future. Reported-by: Christoffer Dall Fixes: 761876c857cb ("tap: XDP support") Signed-off-by: Jason Wang --- drivers/net/tun.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 63d39fe6..7433bb2 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1663,6 +1663,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, get_page(alloc_frag->page); alloc_frag->offset += buflen; err = xdp_do_redirect(tun->dev, &xdp, xdp_prog); + xdp_do_flush_map(); if (err) goto err_redirect; rcu_read_unlock(); -- 2.7.4
[PATCH V4 net 2/3] tuntap: disable preemption during XDP processing
Except for tuntap, all other drivers' XDP was implemented at NAPI poll() routine in a bh. This guarantees all XDP operation were done at the same CPU which is required by e.g BFP_MAP_TYPE_PERCPU_ARRAY. But for tuntap, we do it in process context and we try to protect XDP processing by RCU reader lock. This is insufficient since CONFIG_PREEMPT_RCU can preempt the RCU reader critical section which breaks the assumption that all XDP were processed in the same CPU. Fixing this by simply disabling preemption during XDP processing. Fixes: 761876c857cb ("tap: XDP support") Signed-off-by: Jason Wang --- drivers/net/tun.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2823a4a..63d39fe6 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1642,6 +1642,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, else *skb_xdp = 0; + preempt_disable(); rcu_read_lock(); xdp_prog = rcu_dereference(tun->xdp_prog); if (xdp_prog && !*skb_xdp) { @@ -1665,6 +1666,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, if (err) goto err_redirect; rcu_read_unlock(); + preempt_enable(); return NULL; case XDP_TX: xdp_xmit = true; @@ -1686,6 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, skb = build_skb(buf, buflen); if (!skb) { rcu_read_unlock(); + preempt_enable(); return ERR_PTR(-ENOMEM); } @@ -1698,10 +1701,12 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, skb->dev = tun->dev; generic_xdp_tx(skb, xdp_prog); rcu_read_unlock(); + preempt_enable(); return NULL; } rcu_read_unlock(); + preempt_enable(); return skb; @@ -1709,6 +1714,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, put_page(alloc_frag->page); err_xdp: rcu_read_unlock(); + preempt_enable(); this_cpu_inc(tun->pcpu_stats->rx_dropped); return NULL; } -- 2.7.4
Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
On Sat, Feb 24, 2018 at 01:49:35AM +, Chris Mi wrote: > To verify this patch, the following is a sanity test case: > > # tc qdisc delete dev $link ingress > /dev/null 2>&1; > # tc qdisc add dev $link ingress; > # tc filter add dev $link prio 1 protocol ip handle 0x8001 parent : > flower skip_hw src_mac e4:11:0:0:0:2 dst_mac e4:12:0:0:0:2 action drop; > # tc filter show dev $link parent : > > filter pref 1 flower chain 0 > filter pref 1 flower chain 0 handle 0x8001 I added these tests to my local tree for now. diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c index 44ef9eba5a7a..28d99325a32d 100644 --- a/tools/testing/radix-tree/idr-test.c +++ b/tools/testing/radix-tree/idr-test.c @@ -178,6 +178,29 @@ void idr_get_next_test(int base) idr_destroy(&idr); } +void idr_u32_test(struct idr *idr, int base) +{ + assert(idr_is_empty(idr)); + idr_init_base(idr, base); + u32 handle = 10; + idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL); + BUG_ON(handle != 10); + idr_remove(idr, handle); + assert(idr_is_empty(idr)); + + handle = 0x8001; + idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL); + BUG_ON(handle != 0x8001); + idr_remove(idr, handle); + assert(idr_is_empty(idr)); + + handle = 0xffe0; + idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL); + BUG_ON(handle != 0xffe0); + idr_remove(idr, handle); + assert(idr_is_empty(idr)); +} + void idr_checks(void) { unsigned long i; @@ -248,6 +271,9 @@ void idr_checks(void) idr_get_next_test(0); idr_get_next_test(1); idr_get_next_test(4); + idr_u32_test(&idr, 0); + idr_u32_test(&idr, 1); + idr_u32_test(&idr, 4); } /*
Re: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver
> Ok, but it seems to me that what I have is an example of "specific book > keeping > private information". Can you clarify the style you prefer? > > In cases of allocation where I can just compare a pointer to null, I can > easily remove > the flags. But in other cases I need a record of which steps completed in > order to > clean up properly. In cases where I need some sort of a flag would you prefer > I avoid a bit mask, and have a standalone variable for each flag? Hi Bryan Often you know some thing has been done, because if it had not been done, you would of bombed out with an error. In the release function you can assume everything done in probe has been done, otherwise the probe would not be successful. In close, you can assume everything done in open was successful, otherwise the open would of failed So probe does not need any flags. open does not need any flags. Andrew
RE: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
> -Original Message- > From: Matthew Wilcox [mailto:wi...@infradead.org] > Sent: Saturday, February 24, 2018 9:15 AM > To: Cong Wang ; Khalid Aziz > ; linux-ker...@vger.kernel.org; > netdev@vger.kernel.org > Cc: Chris Mi > Subject: Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running > selftest > > On Fri, Feb 23, 2018 Randy Dunlap wrote: > > [add Matthew Wilcox; hopefully he can look/see] > > Thanks, Randy. I don't understand why nobody else thought to cc the author > of the patch that it was bisected to ... > > > On 02/23/2018 04:13 PM, Cong Wang wrote: > > > On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang > > > > > wrote: > > >> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap > > >> > > wrote: > > >>> On 02/23/2018 08:05 AM, Khalid Aziz wrote: > > Same selftest does not cause panic on 4.15. git bisect pointed to > > commit 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based > > IDRs more efficient"). > > Kernel config is attached. > > >> > > >> Looks like something horribly wrong with u32 key id idr... > > > > > > Adding a few printk's, I got: > > > > > > [ 31.231560] requested handle = ffe0 > > > [ 31.232426] allocated handle = 0 > > > ... > > > [ 31.246475] requested handle = ffd0 > > > [ 31.247555] allocated handle = 1 > > > > > > > > > So the bug is here where we can't allocate a specific handle: > > > > > > err = idr_alloc_u32(&tp_c->handle_idr, ht, > > &handle, > > > handle, GFP_KERNEL); > > > if (err) { > > > kfree(ht); > > > return err; > > > } > > Please try this patch. It fixes ffe0, but there may be more things tested > that it may not work for. > > Chris Mi, what happened to that set of testcases you promised to write for > me? I promised to write it after the API is stabilized since you were going to change it. I will inform the management about this new task and get back to you later. > > diff --git a/lib/idr.c b/lib/idr.c > index c98d77fcf393..10d9b8d47c33 100644 > --- a/lib/idr.c > +++ b/lib/idr.c > @@ -36,8 +36,8 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid, > { > struct radix_tree_iter iter; > void __rcu **slot; > - int base = idr->idr_base; > - int id = *nextid; > + unsigned int base = idr->idr_base; > + unsigned int id = *nextid; > > if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr))) > return -EINVAL; To verify this patch, the following is a sanity test case: # tc qdisc delete dev $link ingress > /dev/null 2>&1; # tc qdisc add dev $link ingress; # tc filter add dev $link prio 1 protocol ip handle 0x8001 parent : flower skip_hw src_mac e4:11:0:0:0:2 dst_mac e4:12:0:0:0:2 action drop; # tc filter show dev $link parent : filter pref 1 flower chain 0 filter pref 1 flower chain 0 handle 0x8001 dst_mac e4:12:00:00:00:02 src_mac e4:11:00:00:00:02 eth_type ipv4 skip_hw not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 Please make sure the handle is the same as the user specifies.
Re: [PATCH net-next] r8169: improve interrupt handling
Heiner Kallweit : [...] > Last but not least it enables a feature which was (I presume accidently) > disabled before. There are members of the RTL8169 family supporting MSI > (e.g. RTL8169SB), however MSI never got enabled because RTL_CFG_0 was > missing flag RTL_FEATURE_MSI. > An indicator for "accidently" is that statement "cfg2 |= MSIEnable;" The reality is more simple: it could had been removed. > in rtl_try_msi() is dead code. cfg2 is used for chip versions <= 06 > only and for all these chip versions RTL_FEATURE_MSI isn't set. On purpose: 1. mostly untested 2. MSI without MSI-X does not buy much 3. wrt 2., ok, it kills (yucky) plain old shared PCI irq (remember those ?) but I didn't end feeling that good about realtek MSI support on older chipsets to enable it any further [...] > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 96db3283e..4730db990 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c [...] > -static unsigned rtl_try_msi(struct rtl8169_private *tp, > - const struct rtl_cfg_info *cfg) > +static void rtl_alloc_irq(struct rtl8169_private *tp) > { [...] > + ret = pci_alloc_irq_vectors(tp->pci_dev, 1, 1, PCI_IRQ_ALL_TYPES); > + if (ret < 0) { > + netif_err(tp, drv, tp->dev, "failed to allocate irq!\n"); > + return; [...] > @@ -8497,9 +8495,7 @@ static int rtl_init_one(struct pci_dev *pdev, const > struct pci_device_id *ent) > chipset = tp->mac_version; > tp->txd_version = rtl_chip_infos[chipset].txd_version; > > - RTL_W8(Cfg9346, Cfg9346_Unlock); > - tp->features |= rtl_try_msi(tp, cfg); > - RTL_W8(Cfg9346, Cfg9346_Lock); > + rtl_alloc_irq(tp); Happily proceeding after error. :o/ -- Ueimor
[next-queue PATCH 3/8] igb: Enable the hardware traffic class feature bit for igb models
This will allow functionality depending on the hardware being traffic class aware to work. In particular the tc-flower offloading checks verifies that this bit is set. Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb_main.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 0ea32be07d71..543aa99892eb 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2820,8 +2820,10 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_RXALL; - if (hw->mac.type >= e1000_i350) - netdev->hw_features |= NETIF_F_NTUPLE; + if (hw->mac.type >= e1000_i350) { + netdev->hw_features |= (NETIF_F_NTUPLE | NETIF_F_HW_TC); + netdev->features |= NETIF_F_HW_TC; + } if (pci_using_dac) netdev->features |= NETIF_F_HIGHDMA; -- 2.16.2
[next-queue PATCH 2/8] igb: Fix queue selection on MAC filters on i210 and i211
On the RAH registers there are semantic differences on the meaning of the "queue" parameter for traffic steering depending on the controller model: there is the 82575 meaning, which "queue" means a RX Hardware Queue, and the i350 meaning, where it is a reception pool. The previous behaviour was having no effect for i210 and i211 based controllers because the QSEL bit of the RAH register wasn't being set. This patch separates the condition in discrete cases, so the different handling is clearer. Fixes: 83c21335c876 ("igb: improve MAC filter handling") Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/e1000_defines.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c | 15 +++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h index 83cabff1e0ab..573bf177fd08 100644 --- a/drivers/net/ethernet/intel/igb/e1000_defines.h +++ b/drivers/net/ethernet/intel/igb/e1000_defines.h @@ -490,6 +490,7 @@ * manageability enabled, allowing us room for 15 multicast addresses. */ #define E1000_RAH_AV 0x8000/* Receive descriptor valid */ +#define E1000_RAH_QSEL_ENABLE 0x1000 #define E1000_RAL_MAC_ADDR_LEN 4 #define E1000_RAH_MAC_ADDR_LEN 2 #define E1000_RAH_POOL_MASK 0x03FC diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index b88fae785369..0ea32be07d71 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -8741,12 +8741,19 @@ static void igb_rar_set_index(struct igb_adapter *adapter, u32 index) if (is_valid_ether_addr(addr)) rar_high |= E1000_RAH_AV; - if (hw->mac.type == e1000_82575) + switch (hw->mac.type) { + case e1000_82575: + case e1000_i210: + case e1000_i211: + rar_high |= E1000_RAH_QSEL_ENABLE; rar_high |= E1000_RAH_POOL_1 * - adapter->mac_table[index].queue; - else + adapter->mac_table[index].queue; + break; + default: rar_high |= E1000_RAH_POOL_1 << - adapter->mac_table[index].queue; + adapter->mac_table[index].queue; + break; + } } wr32(E1000_RAL(index), rar_low); -- 2.16.2
[next-queue PATCH 8/8] igb: Add support for removing offloaded tc-flower filters
This allows tc-flower filters that were offloaded to be removed. Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb_main.c | 32 ++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index b1d401e77d62..5e0e1df0e941 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2641,10 +2641,40 @@ static int igb_configure_clsflower(struct igb_adapter *adapter, return err; } +static int igb_delete_filter_by_cookie(struct igb_adapter *adapter, + unsigned long cookie) +{ + struct igb_nfc_filter *filter; + int err; + + spin_lock(&adapter->nfc_lock); + + hlist_for_each_entry(filter, &adapter->nfc_filter_list, nfc_node) { + if (filter->cookie == cookie) + break; + } + + if (!filter) { + err = -ENOENT; + goto out; + } + + err = igb_erase_filter(adapter, filter); + + hlist_del(&filter->nfc_node); + kfree(filter); + adapter->nfc_filter_count--; + +out: + spin_unlock(&adapter->nfc_lock); + + return err; +} + static int igb_delete_clsflower(struct igb_adapter *adapter, struct tc_cls_flower_offload *cls_flower) { - return -EOPNOTSUPP; + return igb_delete_filter_by_cookie(adapter, cls_flower->cookie); } static int igb_setup_tc_cls_flower(struct igb_adapter *adapter, -- 2.16.2
[PATCH V2 net-next 0/3] RDS: optimized notification for zerocopy completion
RDS applications use predominantly request-response, transacation based IPC, so that ingress and egress traffic are well-balanced, and it is possible/desirable to reduce system-call overhead by piggybacking the notifications for zerocopy completion response with data. Moreover, it has been pointed out that socket functions block if sk_err is non-zero, thus if the RDS code does not plan/need to use sk_error_queue path for completion notification, it is preferable to remove the sk_errror_queue related paths in RDS. Both of these goals are implemented in this series. Sowmini Varadhan (3): selftests/net: revert the zerocopy Rx path for PF_RDS rds: deliver zerocopy completion notification with data selftests/net: reap zerocopy completions passed up as ancillary data. include/uapi/linux/errqueue.h |2 - include/uapi/linux/rds.h |7 ++ net/rds/af_rds.c |7 ++- net/rds/message.c | 35 -- net/rds/rds.h |2 + net/rds/recv.c | 34 +- tools/testing/selftests/net/msg_zerocopy.c | 109 +++- 7 files changed, 103 insertions(+), 93 deletions(-)
[next-queue PATCH 5/8] igb: Add support for ethtool MAC address filters
This adds the capability of configuring the queue steering of arriving packets based on their source and destination MAC addresses. In practical terms this adds support for the following use cases, characterized by these examples: $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0 (this will direct packets with destination address "aa:aa:aa:aa:aa:aa" to the RX queue 0) $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3 (this will direct packets with destination address "44:44:44:44:44:44" to the RX queue 3) Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb.h | 9 +++ drivers/net/ethernet/intel/igb/igb_ethtool.c | 110 ++- drivers/net/ethernet/intel/igb/igb_main.c| 8 +- 3 files changed, 119 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index d5cd5f6708d9..e06d6fdcb2ce 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -440,6 +440,8 @@ struct hwmon_buff { enum igb_filter_match_flags { IGB_FILTER_FLAG_ETHER_TYPE = 0x1, IGB_FILTER_FLAG_VLAN_TCI = 0x2, + IGB_FILTER_FLAG_SRC_MAC_ADDR = 0x4, + IGB_FILTER_FLAG_DST_MAC_ADDR = 0x8, }; #define IGB_MAX_RXNFC_FILTERS 16 @@ -454,6 +456,8 @@ struct igb_nfc_input { u8 match_flags; __be16 etype; __be16 vlan_tci; + u8 src_addr[ETH_ALEN]; + u8 dst_addr[ETH_ALEN]; }; struct igb_nfc_filter { @@ -738,4 +742,9 @@ int igb_add_filter(struct igb_adapter *adapter, int igb_erase_filter(struct igb_adapter *adapter, struct igb_nfc_filter *input); +int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr, + const u8 queue, const u8 flags); +int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr, + const u8 queue, const u8 flags); + #endif /* _IGB_H_ */ diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c index 143f0bb34e4d..d8686a0f5b5d 100644 --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c @@ -152,6 +152,9 @@ static const char igb_priv_flags_strings[][ETH_GSTRING_LEN] = { #define IGB_PRIV_FLAGS_STR_LEN ARRAY_SIZE(igb_priv_flags_strings) +static const u8 broadcast_addr[ETH_ALEN] = { + 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }; + static int igb_get_link_ksettings(struct net_device *netdev, struct ethtool_link_ksettings *cmd) { @@ -2494,6 +2497,25 @@ static int igb_get_ethtool_nfc_entry(struct igb_adapter *adapter, fsp->h_ext.vlan_tci = rule->filter.vlan_tci; fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK); } + if (rule->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) { + ether_addr_copy(fsp->h_u.ether_spec.h_dest, + rule->filter.dst_addr); + /* As we only support matching by the full +* mask, return the mask to userspace +*/ + ether_addr_copy(fsp->m_u.ether_spec.h_dest, + broadcast_addr); + } + if (rule->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) { + ether_addr_copy(fsp->h_u.ether_spec.h_source, + rule->filter.src_addr); + /* As we only support matching by the full +* mask, return the mask to userspace +*/ + ether_addr_copy(fsp->m_u.ether_spec.h_source, + broadcast_addr); + } + return 0; } return -EINVAL; @@ -2698,6 +2720,58 @@ static int igb_set_rss_hash_opt(struct igb_adapter *adapter, return 0; } +static int igb_rxnfc_write_dst_mac_filter(struct igb_adapter *adapter, + struct igb_nfc_filter *input) +{ + int err; + + err = igb_add_mac_filter(adapter, input->filter.dst_addr, +input->action, 0); + if (err < 0) + return err; + + return 0; +} + +static int igb_rxnfc_write_src_mac_filter(struct igb_adapter *adapter, + struct igb_nfc_filter *input) +{ + int err; + + err = igb_add_mac_filter(adapter, input->filter.src_addr, +input->action, IGB_MAC_STATE_SRC_ADDR); + if (err < 0) + return err; + + return 0; +} + +static int igb_rxnfc_del_dst_mac_filter(struct igb_adapter *adapter, + struct igb_nfc_filter *input) +{ + int err; + + err = igb_del_mac_filter(adapt
[next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters
This allows filters added by tc-flower and specifying MAC addresses, Ethernet types, and the VLAN priority field, to be offloaded to the controller. This reuses most of the infrastructure used by ethtool, ethtool can be used to read these filters, but modification and deletion can only be done via tc-flower. Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb.h | 5 + drivers/net/ethernet/intel/igb/igb_ethtool.c | 13 ++- drivers/net/ethernet/intel/igb/igb_main.c| 140 ++- 3 files changed, 152 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index e06d6fdcb2ce..05d8c827d33e 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -463,6 +463,7 @@ struct igb_nfc_input { struct igb_nfc_filter { struct hlist_node nfc_node; struct igb_nfc_input filter; + unsigned long cookie; u16 etype_reg_index; u16 sw_idx; u16 action; @@ -747,4 +748,8 @@ int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr, int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr, const u8 queue, const u8 flags); +int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter, +struct igb_nfc_filter *input, +u16 sw_idx); + #endif /* _IGB_H_ */ diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c index d8686a0f5b5d..5386eb68ab15 100644 --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c @@ -2918,9 +2918,9 @@ int igb_erase_filter(struct igb_adapter *adapter, struct igb_nfc_filter *input) return 0; } -static int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter, - struct igb_nfc_filter *input, - u16 sw_idx) +int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter, +struct igb_nfc_filter *input, +u16 sw_idx) { struct igb_nfc_filter *rule, *parent; int err = -EINVAL; @@ -2935,8 +2935,11 @@ static int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter, parent = rule; } - /* if there is an old rule occupying our place remove it */ - if (rule && (rule->sw_idx == sw_idx)) { + /* if there is an old rule occupying our place remove it, also +* only allow rules added by ethtool to be removed, these +* rules don't have a cookie +*/ + if (rule && (!rule->cookie && rule->sw_idx == sw_idx)) { if (!input) err = igb_erase_filter(adapter, rule); diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 5344261e6f45..b1d401e77d62 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2497,10 +2497,148 @@ static int igb_offload_cbs(struct igb_adapter *adapter, return 0; } +#define ETHER_TYPE_FULL_MASK ((__force __be16)~0) + +static int igb_parse_cls_flower(struct igb_adapter *adapter, + struct tc_cls_flower_offload *f, + int traffic_class, + struct igb_nfc_filter *input) +{ + if (f->dissector->used_keys & + ~(BIT(FLOW_DISSECTOR_KEY_BASIC) | + BIT(FLOW_DISSECTOR_KEY_CONTROL) | + BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | + BIT(FLOW_DISSECTOR_KEY_VLAN))) { + dev_err(&adapter->pdev->dev, "Unsupported key used: 0x%x\n", + f->dissector->used_keys); + return -EOPNOTSUPP; + } + + if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) { + struct flow_dissector_key_eth_addrs *key = + skb_flow_dissector_target(f->dissector, + FLOW_DISSECTOR_KEY_ETH_ADDRS, + f->key); + + struct flow_dissector_key_eth_addrs *mask = + skb_flow_dissector_target(f->dissector, + FLOW_DISSECTOR_KEY_ETH_ADDRS, + f->mask); + + if (is_broadcast_ether_addr(mask->dst)) { + input->filter.match_flags |= + IGB_FILTER_FLAG_DST_MAC_ADDR; + ether_addr_copy(input->filter.dst_addr, key->dst); + } + + if (is_broadcast_ether_addr(mask->src)) { + input->filter.match_flags |= + IGB_FILTER_FLAG_SRC_MAC_ADDR; + ether_addr_co
[next-queue PATCH 4/8] igb: Add support for MAC address filters specifying source addresses
Makes it possible to direct packets to queues based on their source address. Documents the expected usage of the 'flags' parameter. Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/e1000_defines.h | 1 + drivers/net/ethernet/intel/igb/igb.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c | 35 +++--- 3 files changed, 28 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h index 573bf177fd08..c6f552de30dd 100644 --- a/drivers/net/ethernet/intel/igb/e1000_defines.h +++ b/drivers/net/ethernet/intel/igb/e1000_defines.h @@ -490,6 +490,7 @@ * manageability enabled, allowing us room for 15 multicast addresses. */ #define E1000_RAH_AV 0x8000/* Receive descriptor valid */ +#define E1000_RAH_ASEL_SRC_ADDR 0x0001 #define E1000_RAH_QSEL_ENABLE 0x1000 #define E1000_RAL_MAC_ADDR_LEN 4 #define E1000_RAH_MAC_ADDR_LEN 2 diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 1c6b8d9176a8..d5cd5f6708d9 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -472,6 +472,7 @@ struct igb_mac_addr { #define IGB_MAC_STATE_DEFAULT 0x1 #define IGB_MAC_STATE_IN_USE 0x2 +#define IGB_MAC_STATE_SRC_ADDR 0x4 /* board specific private data structure */ struct igb_adapter { diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 543aa99892eb..db66b697fe3b 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -6837,8 +6837,13 @@ static void igb_set_default_mac_filter(struct igb_adapter *adapter) igb_rar_set_index(adapter, 0); } +/* Add a MAC filter for 'addr' directing matching traffic to 'queue', + * 'flags' is used to indicate what kind of match is made, match is by + * default for the destination address, if matching by source address + * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used. + */ static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr, - const u8 queue) + const u8 queue, const u8 flags) { struct e1000_hw *hw = &adapter->hw; int rar_entries = hw->mac.rar_entry_count - @@ -6858,7 +6863,7 @@ static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr, ether_addr_copy(adapter->mac_table[i].addr, addr); adapter->mac_table[i].queue = queue; - adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE; + adapter->mac_table[i].state |= (IGB_MAC_STATE_IN_USE | flags); igb_rar_set_index(adapter, i); return i; @@ -6867,8 +6872,14 @@ static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr, return -ENOSPC; } +/* Remove a MAC filter for 'addr' directing matching traffic to + * 'queue', 'flags' is used to indicate what kind of match need to be + * removed, match is by default for the destination address, if + * matching by source address is to be removed the flag + * IGB_MAC_STATE_SRC_ADDR can be used. + */ static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr, - const u8 queue) + const u8 queue, const u8 flags) { struct e1000_hw *hw = &adapter->hw; int rar_entries = hw->mac.rar_entry_count - @@ -6883,14 +6894,15 @@ static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr, * for the VF MAC addresses. */ for (i = 0; i < rar_entries; i++) { - if (!(adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE)) + if (!(adapter->mac_table[i].state & + (IGB_MAC_STATE_IN_USE | flags))) continue; if (adapter->mac_table[i].queue != queue) continue; if (!ether_addr_equal(adapter->mac_table[i].addr, addr)) continue; - adapter->mac_table[i].state &= ~IGB_MAC_STATE_IN_USE; + adapter->mac_table[i].state &= ~(IGB_MAC_STATE_IN_USE | flags); memset(adapter->mac_table[i].addr, 0, ETH_ALEN); adapter->mac_table[i].queue = 0; @@ -6906,7 +6918,8 @@ static int igb_uc_sync(struct net_device *netdev, const unsigned char *addr) struct igb_adapter *adapter = netdev_priv(netdev); int ret; - ret = igb_add_mac_filter(adapter, addr, adapter->vfs_allocated_count); + ret = igb_add_mac_filter(adapter, addr, +adapter->vfs_allocated_count, 0); return min_t(int, ret, 0); } @@ -6915,7 +6928,7 @@ static int igb_uc_unsync(struct net_device *netdev, const unsigned char *addr) { struct igb_adapter *adapter = netdev_priv(netdev); - igb_del_mac_f
[next-queue PATCH 0/8] igb: offloading of receive filters
Hi, This series enables some ethtool and tc-flower filters to be offloaded to igb-based network controllers. This is useful when the system configurator want to steer kinds of traffic to a specific hardware queue. The first two commits are bug fixes. The basis of this series is to export the internal API used to configure address filters, so they can be used by ethtool, and extending the functionality so an source address can be handled. Then, we enable the tc-flower offloading implementation to re-use the same infrastructure as ethtool, and storing them in the per-adapter "nfc" (Network Filter Config?) list. But for consistency, for destructive access they are separated, i.e. an filter added by tc-flower can only be removed by tc-flower, but ethtool can read them all. Only support for VLAN Prio, Source and Destination MAC Address, and Ethertype are supported for now. Open question: - igb is initialized with the number of traffic classes as 1, if we want to use multiple traffic classes we need to increase this value, the only way I could find is to use mqprio (for example). Should igb be initialized with, say, the number of queues as its "num_tc"? Vinicius Costa Gomes (8): igb: Fix not adding filter elements to the list igb: Fix queue selection on MAC filters on i210 and i211 igb: Enable the hardware traffic class feature bit for igb models igb: Add support for MAC address filters specifying source addresses igb: Add support for ethtool MAC address filters igb: Add the skeletons for tc-flower offloading igb: Add support for adding offloaded clsflower filters igb: Add support for removing offloaded tc-flower filters drivers/net/ethernet/intel/igb/e1000_defines.h | 2 + drivers/net/ethernet/intel/igb/igb.h | 15 ++ drivers/net/ethernet/intel/igb/igb_ethtool.c | 125 ++- drivers/net/ethernet/intel/igb/igb_main.c | 294 +++-- 4 files changed, 409 insertions(+), 27 deletions(-) -- 2.16.2
[next-queue PATCH 6/8] igb: Add the skeletons for tc-flower offloading
This adds basic functions needed to implement offloading for filters created by tc-flower. Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb_main.c | 66 +++ 1 file changed, 66 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 5dfbdf05f765..5344261e6f45 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -2496,6 +2497,69 @@ static int igb_offload_cbs(struct igb_adapter *adapter, return 0; } +static int igb_configure_clsflower(struct igb_adapter *adapter, + struct tc_cls_flower_offload *cls_flower) +{ + return -EOPNOTSUPP; +} + +static int igb_delete_clsflower(struct igb_adapter *adapter, + struct tc_cls_flower_offload *cls_flower) +{ + return -EOPNOTSUPP; +} + +static int igb_setup_tc_cls_flower(struct igb_adapter *adapter, + struct tc_cls_flower_offload *cls_flower) +{ + switch (cls_flower->command) { + case TC_CLSFLOWER_REPLACE: + return igb_configure_clsflower(adapter, cls_flower); + case TC_CLSFLOWER_DESTROY: + return igb_delete_clsflower(adapter, cls_flower); + case TC_CLSFLOWER_STATS: + return -EOPNOTSUPP; + default: + return -EINVAL; + } +} + +static int igb_setup_tc_block_cb(enum tc_setup_type type, void *type_data, +void *cb_priv) +{ + struct igb_adapter *adapter = cb_priv; + + if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data)) + return -EOPNOTSUPP; + + switch (type) { + case TC_SETUP_CLSFLOWER: + return igb_setup_tc_cls_flower(adapter, type_data); + + default: + return -EOPNOTSUPP; + } +} + +static int igb_setup_tc_block(struct igb_adapter *adapter, + struct tc_block_offload *f) +{ + if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) + return -EOPNOTSUPP; + + switch (f->command) { + case TC_BLOCK_BIND: + return tcf_block_cb_register(f->block, igb_setup_tc_block_cb, +adapter, adapter); + case TC_BLOCK_UNBIND: + tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb, + adapter); + return 0; + default: + return -EOPNOTSUPP; + } +} + static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type, void *type_data) { @@ -2504,6 +2568,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type, switch (type) { case TC_SETUP_QDISC_CBS: return igb_offload_cbs(adapter, type_data); + case TC_SETUP_BLOCK: + return igb_setup_tc_block(adapter, type_data); default: return -EOPNOTSUPP; -- 2.16.2
[next-queue PATCH 1/8] igb: Fix not adding filter elements to the list
Because the order of the parameters passes to 'hlist_add_behind()' was inverted, the 'parent' node was added "behind" the 'input', as input is not in the list, this causes the 'input' node to be lost. Fixes: 0e71def25281 ("igb: add support of RX network flow classification") Signed-off-by: Vinicius Costa Gomes --- drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c index 606e6761758f..143f0bb34e4d 100644 --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c @@ -2864,7 +2864,7 @@ static int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter, /* add filter to the list */ if (parent) - hlist_add_behind(&parent->nfc_node, &input->nfc_node); + hlist_add_behind(&input->nfc_node, &parent->nfc_node); else hlist_add_head(&input->nfc_node, &adapter->nfc_filter_list); -- 2.16.2
Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
On Fri, Feb 23, 2018 Randy Dunlap wrote: > [add Matthew Wilcox; hopefully he can look/see] Thanks, Randy. I don't understand why nobody else thought to cc the author of the patch that it was bisected to ... > On 02/23/2018 04:13 PM, Cong Wang wrote: > > On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang > wrote: > >> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap > wrote: > >>> On 02/23/2018 08:05 AM, Khalid Aziz wrote: > Same selftest does not cause panic on 4.15. git bisect pointed to > commit 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs > more efficient"). > Kernel config is attached. > >> > >> Looks like something horribly wrong with u32 key id idr... > > > > Adding a few printk's, I got: > > > > [ 31.231560] requested handle = ffe0 > > [ 31.232426] allocated handle = 0 > > ... > > [ 31.246475] requested handle = ffd0 > > [ 31.247555] allocated handle = 1 > > > > > > So the bug is here where we can't allocate a specific handle: > > > > err = idr_alloc_u32(&tp_c->handle_idr, ht, > &handle, > > handle, GFP_KERNEL); > > if (err) { > > kfree(ht); > > return err; > > } Please try this patch. It fixes ffe0, but there may be more things tested that it may not work for. Chris Mi, what happened to that set of testcases you promised to write for me? diff --git a/lib/idr.c b/lib/idr.c index c98d77fcf393..10d9b8d47c33 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -36,8 +36,8 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid, { struct radix_tree_iter iter; void __rcu **slot; - int base = idr->idr_base; - int id = *nextid; + unsigned int base = idr->idr_base; + unsigned int id = *nextid; if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr))) return -EINVAL;
Re: [PATCH V7 2/4] sctp: Add ip option support
On Fri, Feb 23, 2018 at 11:11:50AM -0500, Paul Moore wrote: > On Thu, Feb 22, 2018 at 9:40 PM, Marcelo Ricardo Leitner > wrote: > > On Thu, Feb 22, 2018 at 06:08:05PM -0500, Paul Moore wrote: > >> On Wed, Feb 21, 2018 at 3:45 PM, Paul Moore wrote: > >> > On February 21, 2018 9:33:51 AM Marcelo Ricardo Leitner > >> > wrote: > >> >> On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote: > >> >>> Add ip option support to allow LSM security modules to utilise > >> >>> CIPSO/IPv4 > >> >>> and CALIPSO/IPv6 services. > >> >>> > >> >>> Signed-off-by: Richard Haines > >> >> > >> >> LGTM too, thanks! > >> >> > >> >> Acked-by: Marcelo Ricardo Leitner > >> > > >> > I agree, thanks everyone for all the work, review, and patience behind > >> > this patchset! I'll work on merging this into selinux/next and I'll > >> > send a note when it's done. > >> > >> I just merged the four patches (1,3,4 from the v6 patchset, 2 from the > >> v7 patchset) in selinux/next and did a quick sanity test on the kernel > >> (booted, no basic SELinux regressions). Additional testing help is > >> always appreciated ... > > > > I'll try it early next week. > > > > Any ideas on when this is going to appear on Dave's net-next tree? > > We have a lot of SCTP changes to be posted on this cycle and would be > > nice if we could avoid merge conflicts. > > It's merged into the SELinux tree, next branch; see the links below. > Last I checked DaveM doesn't pull the selinux/next into his net-next > tree (that would be a little funny for historical reasons). > > Any idea on how bad the merge conflicts are? I know about 5 patchsets that we are cooking. For 4 of them I think it would be mostly fine, perhaps one conflict here and there. But the other one is a refactoring on MTU handling and it touches lots of places that 92c49e12646e4 ("sctp: Add ip option support") also touched, like in the chunk below: +++ b/include/net/sctp/sctp.h @@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct list_head *head) static inline int sctp_frag_point(const struct sctp_association *asoc, int pmtu) { struct sctp_sock *sp = sctp_sk(asoc->base.sk); + struct sctp_af *af = sp->pf->af; int frag = pmtu; - frag -= sp->pf->af->net_header_len; + frag -= af->ip_options_len(asoc->base.sk); + frag -= af->net_header_len; In the refactor I'm removing this function from here and adding a similar, not quite the same but similar, in a .c file. I post the mtu patchset as RFC next week so we can know better. Marcelo > > >> * git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git > >> * https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git > > -- > paul moore > www.paul-moore.com
Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
[add Matthew Wilcox; hopefully he can look/see] On 02/23/2018 04:13 PM, Cong Wang wrote: > On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang wrote: >> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap wrote: >>> [adding netdev] >>> >>> On 02/23/2018 08:05 AM, Khalid Aziz wrote: I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running selftests from tools/testing/selftests. Last messages from selftest before kernel panic are: >> ... Same selftest does not cause panic on 4.15. git bisect pointed to commit 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more efficient"). Kernel config is attached. >> >> Looks like something horribly wrong with u32 key id idr... > > Adding a few printk's, I got: > > [ 31.231560] requested handle = ffe0 > [ 31.232426] allocated handle = 0 > ... > [ 31.246475] requested handle = ffd0 > [ 31.247555] allocated handle = 1 > > > So the bug is here where we can't allocate a specific handle: > > err = idr_alloc_u32(&tp_c->handle_idr, ht, &handle, > handle, GFP_KERNEL); > if (err) { > kfree(ht); > return err; > } > -- ~Randy
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko wrote: > Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote: > > [...] > No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. > > How that could work? If you have the VF netdev with all configuration > including IPs and routes and whatever - now you want to do migration > so you add virtio_net and do some weird in-driver bonding with it. But > then, VF disappears and the VF netdev with that and also all > configuration it had. > I don't think this scenario is valid. We are talking about making udev aware of the new virtio-bypass to rebind the name of the old VF interface with supposedly virtio-bypass *post the kernel upgrade*. Of course, this needs virtio-net backend to supply the [bdf] info where the VF/PT device was located. -Siwei > > >>> >>> >>> Yes. This sounds interesting. Looks like you want an existing VM image with >>> VF only configuration to get transparent live migration support by adding >>> virtio_net with BACKUP feature. We may need another feature bit to switch >>> between these 2 options. >> >>Yes, that's what I was thinking about. I have been building something >>like this before, and would like to get back after merging with your >>patch.
Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang wrote: > On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap wrote: >> [adding netdev] >> >> On 02/23/2018 08:05 AM, Khalid Aziz wrote: >>> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running >>> selftests >>> from tools/testing/selftests. Last messages from selftest before kernel >>> panic are: >>> > ... >>> Same selftest does not cause panic on 4.15. git bisect pointed to commit >>> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more >>> efficient"). >>> Kernel config is attached. > > Looks like something horribly wrong with u32 key id idr... Adding a few printk's, I got: [ 31.231560] requested handle = ffe0 [ 31.232426] allocated handle = 0 ... [ 31.246475] requested handle = ffd0 [ 31.247555] allocated handle = 1 So the bug is here where we can't allocate a specific handle: err = idr_alloc_u32(&tp_c->handle_idr, ht, &handle, handle, GFP_KERNEL); if (err) { kfree(ht); return err; }
[PATCH bpf-next 2/6] bpf, x64: save several bytes by using mov over movabsq when possible
While analyzing some of the more complex BPF programs from Cilium, I found that LLVM generally prefers to emit LD_IMM64 instead of MOV32 BPF instructions for loading unsigned 32-bit immediates into a register. Given we cannot change the current/stable LLVM versions that are already out there, lets optimize this case such that the JIT prefers to emit 'mov %eax, imm32' over 'movabsq %rax, imm64' whenever suitable in order to reduce the image size by 4-5 bytes per such load in the typical case, reducing image size on some of the bigger programs by up to 4%. emit_mov_imm32() and emit_mov_imm64() have been added as helpers. Signed-off-by: Daniel Borkmann --- arch/x86/net/bpf_jit_comp.c | 125 ++-- 1 file changed, 74 insertions(+), 51 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4bc36bd..f3e5cd8 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -60,7 +60,12 @@ static bool is_imm8(int value) static bool is_simm32(s64 value) { - return value == (s64) (s32) value; + return value == (s64)(s32)value; +} + +static bool is_uimm32(u64 value) +{ + return value == (u64)(u32)value; } /* mov dst, src */ @@ -355,6 +360,68 @@ static void emit_load_skb_data_hlen(u8 **pprog) *pprog = prog; } +static void emit_mov_imm32(u8 **pprog, bool sign_propagate, + u32 dst_reg, const u32 imm32) +{ + u8 *prog = *pprog; + u8 b1, b2, b3; + int cnt = 0; + + /* optimization: if imm32 is positive, use 'mov %eax, imm32' +* (which zero-extends imm32) to save 2 bytes. +*/ + if (sign_propagate && (s32)imm32 < 0) { + /* 'mov %rax, imm32' sign extends imm32 */ + b1 = add_1mod(0x48, dst_reg); + b2 = 0xC7; + b3 = 0xC0; + EMIT3_off32(b1, b2, add_1reg(b3, dst_reg), imm32); + goto done; + } + + /* optimization: if imm32 is zero, use 'xor %eax, %eax' +* to save 3 bytes. +*/ + if (imm32 == 0) { + if (is_ereg(dst_reg)) + EMIT1(add_2mod(0x40, dst_reg, dst_reg)); + b2 = 0x31; /* xor */ + b3 = 0xC0; + EMIT2(b2, add_2reg(b3, dst_reg, dst_reg)); + goto done; + } + + /* mov %eax, imm32 */ + if (is_ereg(dst_reg)) + EMIT1(add_1mod(0x40, dst_reg)); + EMIT1_off32(add_1reg(0xB8, dst_reg), imm32); +done: + *pprog = prog; +} + +static void emit_mov_imm64(u8 **pprog, u32 dst_reg, + const u32 imm32_hi, const u32 imm32_lo) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (is_uimm32(((u64)imm32_hi << 32) | (u32)imm32_lo)) { + /* For emitting plain u32, where sign bit must not be +* propagated LLVM tends to load imm64 over mov32 +* directly, so save couple of bytes by just doing +* 'mov %eax, imm32' instead. +*/ + emit_mov_imm32(&prog, false, dst_reg, imm32_lo); + } else { + /* movabsq %rax, imm64 */ + EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg)); + EMIT(imm32_lo, 4); + EMIT(imm32_hi, 4); + } + + *pprog = prog; +} + static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) { @@ -377,7 +444,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, const s32 imm32 = insn->imm; u32 dst_reg = insn->dst_reg; u32 src_reg = insn->src_reg; - u8 b1 = 0, b2 = 0, b3 = 0; + u8 b2 = 0, b3 = 0; s64 jmp_offset; u8 jmp_cond; bool reload_skb_data; @@ -485,58 +552,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, break; case BPF_ALU64 | BPF_MOV | BPF_K: - /* optimization: if imm32 is positive, -* use 'mov eax, imm32' (which zero-extends imm32) -* to save 2 bytes -*/ - if (imm32 < 0) { - /* 'mov rax, imm32' sign extends imm32 */ - b1 = add_1mod(0x48, dst_reg); - b2 = 0xC7; - b3 = 0xC0; - EMIT3_off32(b1, b2, add_1reg(b3, dst_reg), imm32); - break; - } - case BPF_ALU | BPF_MOV | BPF_K: - /* optimization: if imm32 is zero, use 'xor ,' -* to save 3 bytes. -*/ - if (imm32 == 0) { - if (is_ereg(dst
[PATCH bpf-next 3/6] bpf, x64: save several bytes when mul dest is r0/r3 anyway
Instead of unconditionally performing push/pop on rax/rdx in case of multiplication, we can save a few bytes in case of dest register being either BPF r0 (rax) or r3 (rdx) since the result is written in there anyway. Signed-off-by: Daniel Borkmann --- arch/x86/net/bpf_jit_comp.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index f3e5cd8..9895ca3 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -615,8 +615,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_ALU | BPF_MUL | BPF_X: case BPF_ALU64 | BPF_MUL | BPF_K: case BPF_ALU64 | BPF_MUL | BPF_X: - EMIT1(0x50); /* push rax */ - EMIT1(0x52); /* push rdx */ + if (dst_reg != BPF_REG_0) + EMIT1(0x50); /* push rax */ + if (dst_reg != BPF_REG_3) + EMIT1(0x52); /* push rdx */ /* mov r11, dst_reg */ EMIT_mov(AUX_REG, dst_reg); @@ -636,14 +638,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, /* mul(q) r11 */ EMIT2(0xF7, add_1reg(0xE0, AUX_REG)); - /* mov r11, rax */ - EMIT_mov(AUX_REG, BPF_REG_0); - - EMIT1(0x5A); /* pop rdx */ - EMIT1(0x58); /* pop rax */ - - /* mov dst_reg, r11 */ - EMIT_mov(dst_reg, AUX_REG); + if (dst_reg != BPF_REG_3) + EMIT1(0x5A); /* pop rdx */ + if (dst_reg != BPF_REG_0) { + /* mov dst_reg, rax */ + EMIT_mov(dst_reg, BPF_REG_0); + EMIT1(0x58); /* pop rax */ + } break; /* shifts */ -- 2.9.5
[PATCH bpf-next 5/6] bpf, x64: save 5 bytes in prologue when ebpf insns came from cbpf
While it's rather cumbersome to reduce prologue for cBPF->eBPF migrations wrt spill/fill for r15 which is callee saved register due to bpf_error path in bpf_jit.S that is both used by migrations as well as native eBPF, we can still trivially save 5 bytes in prologue for the former since tail calls can never be used there. cBPF->eBPF migrations also have their own custom prologue in BPF asm that xors A and X reg anyway, so it's fine we skip this here. Signed-off-by: Daniel Borkmann --- arch/x86/net/bpf_jit_comp.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 5b8fc13..70f9748 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -216,7 +216,7 @@ struct jit_context { /* emit x64 prologue code for BPF program and check it's size. * bpf_tail_call helper will skip it while jumping into another program */ -static void emit_prologue(u8 **pprog, u32 stack_depth) +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf) { u8 *prog = *pprog; int cnt = 0; @@ -251,18 +251,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth) /* mov qword ptr [rbp+24],r15 */ EMIT4(0x4C, 0x89, 0x7D, 24); - /* Clear the tail call counter (tail_call_cnt): for eBPF tail calls -* we need to reset the counter to 0. It's done in two instructions, -* resetting rax register to 0 (xor on eax gets 0 extended), and -* moving it to the counter location. -*/ + if (!ebpf_from_cbpf) { + /* Clear the tail call counter (tail_call_cnt): for eBPF tail +* calls we need to reset the counter to 0. It's done in two +* instructions, resetting rax register to 0, and moving it +* to the counter location. +*/ - /* xor eax, eax */ - EMIT2(0x31, 0xc0); - /* mov qword ptr [rbp+32], rax */ - EMIT4(0x48, 0x89, 0x45, 32); + /* xor eax, eax */ + EMIT2(0x31, 0xc0); + /* mov qword ptr [rbp+32], rax */ + EMIT4(0x48, 0x89, 0x45, 32); + + BUILD_BUG_ON(cnt != PROLOGUE_SIZE); + } - BUILD_BUG_ON(cnt != PROLOGUE_SIZE); *pprog = prog; } @@ -453,7 +456,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int proglen = 0; u8 *prog = temp; - emit_prologue(&prog, bpf_prog->aux->stack_depth); + emit_prologue(&prog, bpf_prog->aux->stack_depth, + bpf_prog_was_classic(bpf_prog)); if (seen_ld_abs) emit_load_skb_data_hlen(&prog); -- 2.9.5
[PATCH bpf-next 6/6] bpf: add various jit test cases
Add few test cases that check the rnu-time results under JIT. Signed-off-by: Daniel Borkmann --- tools/testing/selftests/bpf/test_verifier.c | 89 + 1 file changed, 89 insertions(+) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 2971ba2..c987d3a 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -11140,6 +11140,95 @@ static struct bpf_test tests[] = { .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, + { + "jit: lsh, rsh, arsh by 1", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_MOV64_IMM(BPF_REG_1, 0xff), + BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 1), + BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 1), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x3fc, 1), + BPF_EXIT_INSN(), + BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 1), + BPF_ALU32_IMM(BPF_RSH, BPF_REG_1, 1), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0xff, 1), + BPF_EXIT_INSN(), + BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 1), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x7f, 1), + BPF_EXIT_INSN(), + BPF_MOV64_IMM(BPF_REG_0, 2), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 2, + }, + { + "jit: mov32 for ldimm64, 1", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 2), + BPF_LD_IMM64(BPF_REG_1, 0xfeffULL), + BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 32), + BPF_LD_IMM64(BPF_REG_2, 0xfeffULL), + BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_2, 1), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 2, + }, + { + "jit: mov32 for ldimm64, 2", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_LD_IMM64(BPF_REG_1, 0x1ULL), + BPF_LD_IMM64(BPF_REG_2, 0xULL), + BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_2, 1), + BPF_MOV64_IMM(BPF_REG_0, 2), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 2, + }, + { + "jit: various mul tests", + .insns = { + BPF_LD_IMM64(BPF_REG_2, 0xeeff0d413122ULL), + BPF_LD_IMM64(BPF_REG_0, 0xfefefeULL), + BPF_LD_IMM64(BPF_REG_1, 0xefefefULL), + BPF_ALU64_REG(BPF_MUL, BPF_REG_0, BPF_REG_1), + BPF_JMP_REG(BPF_JEQ, BPF_REG_0, BPF_REG_2, 2), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + BPF_LD_IMM64(BPF_REG_3, 0xfefefeULL), + BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1), + BPF_JMP_REG(BPF_JEQ, BPF_REG_3, BPF_REG_2, 2), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + BPF_MOV32_REG(BPF_REG_2, BPF_REG_2), + BPF_LD_IMM64(BPF_REG_0, 0xfefefeULL), + BPF_ALU32_REG(BPF_MUL, BPF_REG_0, BPF_REG_1), + BPF_JMP_REG(BPF_JEQ, BPF_REG_0, BPF_REG_2, 2), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + BPF_LD_IMM64(BPF_REG_3, 0xfefefeULL), + BPF_ALU32_REG(BPF_MUL, BPF_REG_3, BPF_REG_1), + BPF_JMP_REG(BPF_JEQ, BPF_REG_3, BPF_REG_2, 2), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + BPF_LD_IMM64(BPF_REG_0, 0x952a7bbcULL), + BPF_LD_IMM64(BPF_REG_1, 0xfefefeULL), + BPF_LD_IMM64(BPF_REG_2, 0xeeff0d413122ULL), + BPF_ALU32_REG(BPF_MUL, BPF_REG_2, BPF_REG_1), + BPF_JMP_REG(BPF_JEQ, BPF_REG_2, BPF_REG_0, 2), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_IMM(BPF_REG_0, 2), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 2, + }, + }; static int probe_filter_length(const struct bpf_insn *fp) -- 2.9.5
[PATCH bpf-next 4/6] bpf, x64: save few bytes when mul is in alu32
Add a generic emit_mov_reg() helper in order to reuse it in BPF multiplication to load the src into rax, we can save a few bytes in alu32 while doing so. Signed-off-by: Daniel Borkmann --- arch/x86/net/bpf_jit_comp.c | 43 --- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 9895ca3..5b8fc13 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -422,6 +422,24 @@ static void emit_mov_imm64(u8 **pprog, u32 dst_reg, *pprog = prog; } +static void emit_mov_reg(u8 **pprog, bool is64, u32 dst_reg, u32 src_reg) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (is64) { + /* mov dst, src */ + EMIT_mov(dst_reg, src_reg); + } else { + /* mov32 dst, src */ + if (is_ereg(dst_reg) || is_ereg(src_reg)) + EMIT1(add_2mod(0x40, dst_reg, src_reg)); + EMIT2(0x89, add_2reg(0xC0, dst_reg, src_reg)); + } + + *pprog = prog; +} + static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) { @@ -480,16 +498,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg)); break; - /* mov dst, src */ case BPF_ALU64 | BPF_MOV | BPF_X: - EMIT_mov(dst_reg, src_reg); - break; - - /* mov32 dst, src */ case BPF_ALU | BPF_MOV | BPF_X: - if (is_ereg(dst_reg) || is_ereg(src_reg)) - EMIT1(add_2mod(0x40, dst_reg, src_reg)); - EMIT2(0x89, add_2reg(0xC0, dst_reg, src_reg)); + emit_mov_reg(&prog, +BPF_CLASS(insn->code) == BPF_ALU64, +dst_reg, src_reg); break; /* neg dst */ @@ -615,6 +628,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_ALU | BPF_MUL | BPF_X: case BPF_ALU64 | BPF_MUL | BPF_K: case BPF_ALU64 | BPF_MUL | BPF_X: + { + bool is64 = BPF_CLASS(insn->code) == BPF_ALU64; + if (dst_reg != BPF_REG_0) EMIT1(0x50); /* push rax */ if (dst_reg != BPF_REG_3) @@ -624,14 +640,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, EMIT_mov(AUX_REG, dst_reg); if (BPF_SRC(insn->code) == BPF_X) - /* mov rax, src_reg */ - EMIT_mov(BPF_REG_0, src_reg); + emit_mov_reg(&prog, is64, BPF_REG_0, src_reg); else - /* mov rax, imm32 */ - emit_mov_imm32(&prog, true, - BPF_REG_0, imm32); + emit_mov_imm32(&prog, is64, BPF_REG_0, imm32); - if (BPF_CLASS(insn->code) == BPF_ALU64) + if (is64) EMIT1(add_1mod(0x48, AUX_REG)); else if (is_ereg(AUX_REG)) EMIT1(add_1mod(0x40, AUX_REG)); @@ -646,7 +659,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, EMIT1(0x58); /* pop rax */ } break; - + } /* shifts */ case BPF_ALU | BPF_LSH | BPF_K: case BPF_ALU | BPF_RSH | BPF_K: -- 2.9.5
[PATCH bpf-next 0/6] Few x64 jit improvements to shrink image size
Couple of minor improvements to the x64 JIT I had still around from pre merge window in order to shrink the image size further. Added test cases for kselftests too as well as running Cilium workloads on them w/o issues. Thanks! Daniel Borkmann (6): bpf, x64: save one byte per shl/shr/sar when imm is 1 bpf, x64: save several bytes by using mov over movabsq when possible bpf, x64: save several bytes when mul dest is r0/r3 anyway bpf, x64: save few bytes when mul is in alu32 bpf, x64: save 5 bytes in prologue when ebpf insns came from cbpf bpf: add various jit test cases arch/x86/net/bpf_jit_comp.c | 219 +--- tools/testing/selftests/bpf/test_verifier.c | 89 +++ 2 files changed, 221 insertions(+), 87 deletions(-) -- 2.9.5
[PATCH bpf-next 1/6] bpf, x64: save one byte per shl/shr/sar when imm is 1
When we shift by one, we can use a different encoding where imm is not explicitly needed, which saves 1 byte per such op. Signed-off-by: Daniel Borkmann --- arch/x86/net/bpf_jit_comp.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4923d92..4bc36bd 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -640,7 +640,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_RSH: b3 = 0xE8; break; case BPF_ARSH: b3 = 0xF8; break; } - EMIT3(0xC1, add_1reg(b3, dst_reg), imm32); + + if (imm32 == 1) + EMIT2(0xD1, add_1reg(b3, dst_reg)); + else + EMIT3(0xC1, add_1reg(b3, dst_reg), imm32); break; case BPF_ALU | BPF_LSH | BPF_X: -- 2.9.5
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
(pruned to reduce thread) On Wed, 21 Feb 2018 16:17:19 -0800 Alexander Duyck wrote: > >>> FWIW two solutions that immediately come to mind is to export "backup" > >>> as phys_port_name of the backup virtio link and/or assign a name to the > >>> master like you are doing already. I think team uses team%d and bond > >>> uses bond%d, soft naming of master devices seems quite natural in this > >>> case. > >> > >> I figured I had overlooked something like that.. Thanks for pointing > >> this out. Okay so I think the phys_port_name approach might resolve > >> the original issue. If I am reading things correctly what we end up > >> with is the master showing up as "ens1" for example and the backup > >> showing up as "ens1nbackup". Am I understanding that right? > >> > >> The problem with the team/bond%d approach is that it creates a new > >> netdevice and so it would require guest configuration changes. > >> > >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > >>> link is quite neat. > >> > >> I agree. For non-"backup" virio_net devices would it be okay for us to > >> just return -EOPNOTSUPP? I assume it would be and that way the legacy > >> behavior could be maintained although the function still exists. > >> > - When the 'active' netdev is unplugged OR not present on a destination > system after live migration, the user will see 2 virtio_net netdevs. > >>> > >>> That's necessary and expected, all configuration applies to the master > >>> so master must exist. > >> > >> With the naming issue resolved this is the only item left outstanding. > >> This becomes a matter of form vs function. > >> > >> The main complaint about the "3 netdev" solution is a bit confusing to > >> have the 2 netdevs present if the VF isn't there. The idea is that > >> having the extra "master" netdev there if there isn't really a bond is > >> a bit ugly. > > > > Is it this uglier in terms of user experience rather than > > functionality? I don't want it dynamically changed between 2-netdev > > and 3-netdev depending on the presence of VF. That gets back to my > > original question and suggestion earlier: why not just hide the lower > > netdevs from udev renaming and such? Which important observability > > benefits users may get if exposing the lower netdevs? > > > > Thanks, > > -Siwei > > The only real advantage to a 2 netdev solution is that it looks like > the netvsc solution, however it doesn't behave like it since there are > some features like XDP that may not function correctly if they are > left enabled in the virtio_net interface. > > As far as functionality the advantage of not hiding the lower devices > is that they are free to be managed. The problem with pushing all of > the configuration into the upper device is that you are limited to the > intersection of the features of the lower devices. This can be > limiting for some setups as some VFs support things like more queues, > or better interrupt moderation options than others so trying to make > everything work with one config would be ugly. > Let's not make XDP the blocker for doing the best solution from the end user point of view. XDP is just yet another offload thing which needs to be handled. The current backup device solution used in netvsc doesn't handle the full range of offload options (things like flow direction, DCB, etc); no one but the HW vendors seems to care.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, 22 Feb 2018 13:30:12 -0800 Alexander Duyck wrote: > > Again, I undertand your motivation. Yet I don't like your solution. > > But if the decision is made to do this in-driver bonding. I would like > > to see it baing done some generic way: > > 1) share the same "in-driver bonding core" code with netvsc > >put to net/core. > > 2) the "in-driver bonding core" will strictly limit the functionality, > >like active-backup mode only, one vf, one backup, vf netdev type > >check (so noone could enslave a tap or anything else) > > If user would need something more, he should employ team/bond. Sharing would be good, but netvsc world would really like to only have one visible network device.
Re: [PATCH iproute2-next v3 2/8] iplink: Correctly report error when network device isn't found
On 2/22/18 6:02 AM, Serhey Popovych wrote: > @@ -650,6 +658,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req > *req, > bool drv = strcmp(*argv, "xdpdrv") == 0; > bool offload = strcmp(*argv, "xdpoffload") == 0; > > + if (offload) > + has_dev(*dev, dev_index); > + I think this is actually the wrong direction. seems to me argv should be passed to xdp_parse rather than the generic, drv, offload bool's. That function can then the check on which option it is and has the knowledge about whether a device is needed or not. > NEXT_ARG(); > if (xdp_parse(&argc, &argv, req, dev_index, > generic, drv, offload))
Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap wrote: > [adding netdev] > > On 02/23/2018 08:05 AM, Khalid Aziz wrote: >> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running >> selftests >> from tools/testing/selftests. Last messages from selftest before kernel >> panic are: >> ... >> Same selftest does not cause panic on 4.15. git bisect pointed to commit >> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more >> efficient"). >> Kernel config is attached. Looks like something horribly wrong with u32 key id idr...
RE: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver
Hi Florian, Thanks for your review. I have the following questions/comments. > On 02/21/2018 11:06 AM, Bryan Whitehead wrote: > > Add main source files for new lan743x driver. > > > > Signed-off-by: Bryan Whitehead > > --- > > > +lan743x-objs := lan743x_main.o > > Should we assume that you have additional object files you would like to > contribute at a later point? If that is the case, this is fine, if this is > going to be > only file of this driver, consider renaming it so you don't even have to have > this lan743x-objs line at all. I am planning to add additional source files later. At the very least there will be a lan743x_ethtool.o, and a lan743x_ptp.o. I didn't want to have to change the name of the main file later, so I gave it the expected name now. > > + > > +static void lan743x_pci_cleanup(struct lan743x_adapter *adapter) { > > + struct lan743x_pci *pci = &adapter->pci; > > + > > + if (pci->init_flags & INIT_FLAG_PCI_REGIONS_REQUESTED) { > > There is a pattern throughout the driver of maintaining flags to track what > was initialized and what was not, do you really need that, or can you check > for specific book keeping private information. Maintaining flags is error > prone > and requires you to keep adding new ones, that does not really scale. Ok, but it seems to me that what I have is an example of "specific book keeping private information". Can you clarify the style you prefer? In cases of allocation where I can just compare a pointer to null, I can easily remove the flags. But in other cases I need a record of which steps completed in order to clean up properly. In cases where I need some sort of a flag would you prefer I avoid a bit mask, and have a standalone variable for each flag? [snip] > > + dmac_int_en &= ioc_bit; > > + dmac_int_sts &= dmac_int_en; > > + if (dmac_int_sts & ioc_bit) { > > + tasklet_schedule(&tx->tx_isr_bottom_half); > > + enable_flag = 0;/* tasklet will re-enable later */ > > + } > > Consider migrating your TX buffer reclamation to a NAPI context. If you have > one TX queue and one RX, the same NAPI context can be re-used, if you > have separate RX/TX queues, you may create a NAPI context per RX/TX pair, > or you may create separate NAPI contexts per TX queues and RX queues. This may take some time to refactor, But I will see what I can do. [snip] > > +static int lan743x_phy_open(struct lan743x_adapter *adapter) { > > + struct lan743x_phy *phy = &adapter->phy; > > + struct phy_device *phydev; > > + struct net_device *netdev; > > + int ret = -EIO; > > + u32 mii_adv; > > + > > + netdev = adapter->netdev; > > + phydev = phy_find_first(adapter->mdiobus); > > + if (!phydev) { > > + ret = -EIO; > > + goto clean_up; > > + } > > + ret = phy_connect_direct(netdev, phydev, > > +lan743x_phy_link_status_change, > > +PHY_INTERFACE_MODE_GMII); > > + if (ret) > > + goto clean_up; > > + phy->flags |= PHY_FLAG_ATTACHED; > > + > > + /* MAC doesn't support 1000T Half */ > > + phydev->supported &= ~SUPPORTED_1000baseT_Half; > > + > > + /* support both flow controls */ > > + phy->fc_request_control = (FLOW_CTRL_RX | FLOW_CTRL_TX); > > + phydev->advertising &= ~(ADVERTISED_Pause | > ADVERTISED_Asym_Pause); > > + mii_adv = (u32)mii_advertise_flowctrl(phy->fc_request_control); > > + phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv); > > + phy->fc_autoneg = phydev->autoneg; > > No driver does things like that, that appears to be really wrong. Can you clarify? What is wrong and how should it be? [snip] > > +static int lan743x_dmac_init(struct lan743x_adapter *adapter) { > > + struct lan743x_dmac *dmac = &adapter->dmac; > > + u32 data = 0; > > + > > + dmac->flags = 0; > > + dmac->descriptor_spacing = DEFAULT_DMA_DESCRIPTOR_SPACING; > > + lan743x_csr_write(adapter, DMAC_CMD, DMAC_CMD_SWR_); > > + lan743x_csr_wait_for_bit(adapter, DMAC_CMD, > DMAC_CMD_SWR_, > > +0, 1000, 2, 100); > > + switch (dmac->descriptor_spacing) { > > + case DMA_DESCRIPTOR_SPACING_16: > > + data = DMAC_CFG_MAX_DSPACE_16_; > > + break; > > + case DMA_DESCRIPTOR_SPACING_32: > > + data = DMAC_CFG_MAX_DSPACE_32_; > > + break; > > + case DMA_DESCRIPTOR_SPACING_64: > > + data = DMAC_CFG_MAX_DSPACE_64_; > > + break; > > + case DMA_DESCRIPTOR_SPACING_128: > > + data = DMAC_CFG_MAX_DSPACE_128_; > > + break; > > + default: > > + return -EPERM; > > -EINVAL maybe? I think -EPERM is more appropriate because it reflects an unresolvable error. In other words the platform is in compatible with the device. Would you prefer I use a preprocessor error instead, such as #if invalid_setting #error incompatible setting #endif > > [snip] > > > + > > +done: > > +
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote: [...] >>> >>> No, that's not what I was talking about of course. I thought you >>> mentioned the upgrade scenario this patch would like to address is to >>> use the bypass interface "to take the place of the original virtio, >>> and get udev to rename the bypass to what the original virtio_net >>> was". That is one of the possible upgrade paths for sure. However the >>> upgrade path I was seeking is to use the bypass interface to take the >>> place of original VF interface while retaining the name and network >>> configs, which generally can be done simply with kernel upgrade. It >>> would become limiting as this patch makes the bypass interface share >>> the same virtio pci device with virito backup. Can this bypass >>> interface be made general to take place of any pci device other than >>> virtio-net? This will be more helpful as the cloud users who has >>> existing setup on VF interface don't have to recreate it on virtio-net >>> and VF separately again. How that could work? If you have the VF netdev with all configuration including IPs and routes and whatever - now you want to do migration so you add virtio_net and do some weird in-driver bonding with it. But then, VF disappears and the VF netdev with that and also all configuration it had. I don't think this scenario is valid. >> >> >> Yes. This sounds interesting. Looks like you want an existing VM image with >> VF only configuration to get transparent live migration support by adding >> virtio_net with BACKUP feature. We may need another feature bit to switch >> between these 2 options. > >Yes, that's what I was thinking about. I have been building something >like this before, and would like to get back after merging with your >patch.
Re: [PATCH bpf] bpf: allow xadd only on aligned memory
On Fri, Feb 23, 2018 at 10:29:05PM +0100, Daniel Borkmann wrote: > The requirements around atomic_add() / atomic64_add() resp. their > JIT implementations differ across architectures. E.g. while x86_64 > seems just fine with BPF's xadd on unaligned memory, on arm64 it > triggers via interpreter but also JIT the following crash: > > [ 830.864985] Unable to handle kernel paging request at virtual address > 8097d7ed6703 > [...] > [ 830.916161] Internal error: Oops: 9621 [#1] SMP > [ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted > 4.16.0-rc2+ #8 > [ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 > 07/17/2017 > [ 830.998998] pstate: 8045 (Nzcv daif +PAN -UAO) > [ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18 > [ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588 > [ 831.012485] sp : 1ccabc20 > [ 831.015786] x29: 1ccabc20 x28: 8017d56a0f00 > [ 831.021087] x27: 0001 x26: > [ 831.026387] x25: 00c168d9db98 x24: > [ 831.031686] x23: 08203878 x22: 09488000 > [ 831.036986] x21: 08b14e28 x20: 1ccabcb0 > [ 831.042286] x19: 097b5080 x18: 0a03 > [ 831.047585] x17: x16: > [ 831.052885] x15: aeca8000 x14: > [ 831.058184] x13: x12: > [ 831.063484] x11: 0001 x10: > [ 831.068783] x9 : x8 : > [ 831.074083] x7 : x6 : 000580d42800 > [ 831.079383] x5 : 0018 x4 : > [ 831.084682] x3 : 1ccabcb0 x2 : 0001 > [ 831.089982] x1 : 8097d7ed6703 x0 : 0001 > [ 831.095282] Process test_verifier (pid: 2788, stack limit = > 0x18370044) > [ 831.102577] Call trace: > [ 831.105012] __ll_sc_atomic_add+0x4/0x18 > [ 831.108923] __bpf_prog_run32+0x4c/0x70 > [ 831.112748] bpf_test_run+0x78/0xf8 > [ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120 > [ 831.120567] SyS_bpf+0x77c/0x1110 > [ 831.123873] el0_svc_naked+0x30/0x34 > [ 831.127437] Code: 97fffe97 17ec f9800031 (885f7c31) > > Reason for this is because memory is required to be aligned. In > case of BPF, we always enforce alignment in terms of stack access, > but not when accessing map values or packet data when the underlying > arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set. > > xadd on packet data that is local to us anyway is just wrong, so > forbid this case entirely. The only place where xadd makes sense in > fact are map values; xadd on stack is wrong as well, but it's been > around for much longer. Specifically enforce strict alignment in case > of xadd, so that we handle this case generically and avoid such crashes > in the first place. > > Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") > Signed-off-by: Daniel Borkmann Nice catch! Applied to bpf tree, Thanks Daniel.
[PATCH V2 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.
PF_RDS sockets pass up cookies for zerocopy completion as ancillary data. Update msg_zerocopy to reap this information. Signed-off-by: Sowmini Varadhan --- v2: receive zerocopy completion notification as POLLIN tools/testing/selftests/net/msg_zerocopy.c | 60 1 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index eff9cf2..8c466e8 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,7 +344,48 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static bool do_recv_completion(int fd) +static int do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck) +{ + int ncookies, i; + + ncookies = ck->num; + if (ncookies > RDS_MAX_ZCOOKIES) + error(1, 0, "Returned %d cookies, max expected %d\n", + ncookies, RDS_MAX_ZCOOKIES); + for (i = 0; i < ncookies; i++) + if (cfg_verbose >= 2) + fprintf(stderr, "%d\n", ck->cookies[i]); + return ncookies; +} + +static int do_recvmsg_completion(int fd) +{ + struct msghdr msg; + char cmsgbuf[256]; + struct cmsghdr *cmsg; + bool ret = false; + struct rds_zcopy_cookies *ck; + + memset(&msg, 0, sizeof(msg)); + msg.msg_control = cmsgbuf; + msg.msg_controllen = sizeof(cmsgbuf); + + if (recvmsg(fd, &msg, MSG_DONTWAIT)) + return ret; + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_RDS && + cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) { + + ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg); + completions += do_process_zerocopy_cookies(ck); + ret = true; + break; + } + } + return ret; +} + +static bool do_recv_completion(int fd, int domain) { struct sock_extended_err *serr; struct msghdr msg = {}; @@ -353,6 +394,9 @@ static bool do_recv_completion(int fd) int ret, zerocopy; char control[100]; + if (domain == PF_RDS) + return do_recvmsg_completion(fd); + msg.msg_control = control; msg.msg_controllen = sizeof(control); @@ -409,20 +453,20 @@ static bool do_recv_completion(int fd) } /* Read all outstanding messages on the errqueue */ -static void do_recv_completions(int fd) +static void do_recv_completions(int fd, int domain) { - while (do_recv_completion(fd)) {} + while (do_recv_completion(fd, domain)) {} } /* Wait for all remaining completions on the errqueue */ -static void do_recv_remaining_completions(int fd) +static void do_recv_remaining_completions(int fd, int domain) { int64_t tstop = gettimeofday_ms() + cfg_waittime_ms; while (completions < expected_completions && gettimeofday_ms() < tstop) { - if (do_poll(fd, POLLERR)) - do_recv_completions(fd); + if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR)) + do_recv_completions(fd, domain); } if (completions < expected_completions) @@ -503,13 +547,13 @@ static void do_tx(int domain, int type, int protocol) while (!do_poll(fd, POLLOUT)) { if (cfg_zerocopy) - do_recv_completions(fd); + do_recv_completions(fd, domain); } } while (gettimeofday_ms() < tstop); if (cfg_zerocopy) - do_recv_remaining_completions(fd); + do_recv_remaining_completions(fd, domain); if (close(fd)) error(1, errno, "close"); -- 1.7.1
[PATCH V2 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS
In preparation for optimized reception of zerocopy completion, revert the Rx side changes introduced by Commit dfb8434b0a94 ("selftests/net: add zerocopy support for PF_RDS test case") Signed-off-by: Sowmini Varadhan --- v2: prepare to remove sk_error_queue based path; remove recvmsg() as well, PF_RDS can also use recv() for the usage pattern in msg_zerocopy tools/testing/selftests/net/msg_zerocopy.c | 67 1 files changed, 0 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 5cc2a53..eff9cf2 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static int do_process_zerocopy_cookies(struct sock_extended_err *serr, - uint32_t *ckbuf, size_t nbytes) -{ - int ncookies, i; - - if (serr->ee_errno != 0) - error(1, 0, "serr: wrong error code: %u", serr->ee_errno); - ncookies = serr->ee_data; - if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES) - error(1, 0, "Returned %d cookies, max expected %d\n", - ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES); - if (nbytes != ncookies * sizeof(uint32_t)) - error(1, 0, "Expected %d cookies, got %ld\n", - ncookies, nbytes/sizeof(uint32_t)); - for (i = 0; i < ncookies; i++) - if (cfg_verbose >= 2) - fprintf(stderr, "%d\n", ckbuf[i]); - return ncookies; -} - static bool do_recv_completion(int fd) { struct sock_extended_err *serr; @@ -372,17 +352,10 @@ static bool do_recv_completion(int fd) uint32_t hi, lo, range; int ret, zerocopy; char control[100]; - uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES]; - struct iovec iov; msg.msg_control = control; msg.msg_controllen = sizeof(control); - iov.iov_base = ckbuf; - iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0])); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - ret = recvmsg(fd, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) return false; @@ -402,10 +375,6 @@ static bool do_recv_completion(int fd) serr = (void *) CMSG_DATA(cm); - if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) { - completions += do_process_zerocopy_cookies(serr, ckbuf, ret); - return true; - } if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) error(1, 0, "serr: wrong origin: %u", serr->ee_origin); if (serr->ee_errno != 0) @@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type) bytes += cfg_payload_len; } - -static void do_recvmsg(int fd) -{ - int ret, off = 0; - char *buf; - struct iovec iov; - struct msghdr msg; - struct sockaddr_storage din; - - buf = calloc(cfg_payload_len, sizeof(char)); - iov.iov_base = buf; - iov.iov_len = cfg_payload_len; - - memset(&msg, 0, sizeof(msg)); - msg.msg_name = &din; - msg.msg_namelen = sizeof(din); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - - ret = recvmsg(fd, &msg, MSG_TRUNC); - - if (ret == -1) - error(1, errno, "recv"); - if (ret != cfg_payload_len) - error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len); - - if (memcmp(buf + off, payload, ret)) - error(1, 0, "recv: data mismatch"); - - free(buf); - packets++; - bytes += cfg_payload_len; -} - static void do_rx(int domain, int type, int protocol) { uint64_t tstop; @@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol) do { if (type == SOCK_STREAM) do_flush_tcp(fd); - else if (domain == PF_RDS) - do_recvmsg(fd); else do_flush_datagram(fd, type); -- 1.7.1
[PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data
This commit is an optimization of the commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan --- v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie() and callers to make sure we dont remove cookies from the queue and then fail to pass it up to caller include/uapi/linux/errqueue.h |2 -- include/uapi/linux/rds.h |7 +++ net/rds/af_rds.c |7 +-- net/rds/message.c | 35 +-- net/rds/rds.h |2 ++ net/rds/recv.c| 34 +- 6 files changed, 60 insertions(+), 27 deletions(-) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 28812ed..dc64cfa 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -20,13 +20,11 @@ struct sock_extended_err { #define SO_EE_ORIGIN_ICMP6 3 #define SO_EE_ORIGIN_TXSTATUS 4 #define SO_EE_ORIGIN_ZEROCOPY 5 -#define SO_EE_ORIGIN_ZCOOKIE 6 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1)) #define SO_EE_CODE_ZEROCOPY_COPIED 1 -#defineSO_EE_ORIGIN_MAX_ZCOOKIES 8 /** * struct scm_timestamping - timestamps exposed through cmsg diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index 12e3bca..a66b213 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -104,6 +104,7 @@ #define RDS_CMSG_MASKED_ATOMIC_CSWP9 #define RDS_CMSG_RXPATH_LATENCY11 #defineRDS_CMSG_ZCOPY_COOKIE 12 +#defineRDS_CMSG_ZCOPY_COMPLETION 13 #define RDS_INFO_FIRST 1 #define RDS_INFO_COUNTERS 1 @@ -317,6 +318,12 @@ struct rds_rdma_notify { #define RDS_RDMA_DROPPED 3 #define RDS_RDMA_OTHER_ERROR 4 +#defineRDS_MAX_ZCOOKIES8 +struct rds_zcopy_cookies { + __u32 num; + __u32 cookies[RDS_MAX_ZCOOKIES]; +}; + /* * Common set of flags for all RDMA related structs */ diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index a937f18..f712610 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -77,6 +77,7 @@ static int rds_release(struct socket *sock) rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); + __skb_queue_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); @@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr *uaddr, * - to signal that a previously congested destination may have become * uncongested * - A notification has been queued to the socket (this can be a congestion - * update, or a RDMA completion). + * update, or a RDMA completion, or a MSG_ZEROCOPY completion). * * EPOLLOUT is asserted if there is room on the send queue. This does not mean * however, that the next sendmsg() call will succeed. If the application tries @@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket *sock, spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || - !list_empty(&rs->rs_notify_queue)) + !list_empty(&rs->rs_notify_queue) || + !skb_queue_empty(&rs->rs_zcookie_queue)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); @@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol) INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue); INIT_LIST_HEAD(&rs->rs_cong_list); + skb_queue_head_init(&rs->rs_zcookie_queue); spin_lock_init(&rs->rs_rdma_lock); rs->rs_rdma_keys = RB_ROOT; rs->rs_rx_traces = 0; diff --git a/net/rds/message.c b/net/rds/message.c index 6518345..2e8bdaf 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -58,32 +58,26 @@ void rds_message_addref(struct rds_message *rm) static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie) { - struct sock_exterr_skb *serr = SKB_EXT_ERR(skb); - int ncookies; - u32 *ptr; + struct rds_zcopy_cookies *ck = (struct rds_zcopy_cookies *)skb->cb; + int ncookies = c
Re: [RFC PATCH bpf-next 07/12] bpf/verifier: allow bounded loops with JLT/true back-edge
On 02/23/2018 09:41 AM, Edward Cree wrote: > Where the register umin_value is increasing sufficiently fast, the loop > will terminate after a reasonable number of iterations, so we can allow > to keep walking it. Continuing to walk the loop is problematic because we hit the complexity limit. What we want to do is a static analysis step upfront on the basic block containing the loop to ensure the loop is bounded. We can do this by finding the induction variables (variables with form cn+d) and ensuring the comparison register is an induction variable and also increasing/decreasing correctly with respect to the comparison op. This seems to be common in compilers to pull computation out of the loop. So should be doable in the verifier. > The semantics of prev_insn_idx are changed slightly: it now always refers > to the last jump insn walked, even when that jump was not taken. Also it > is moved into env, alongside a new bool last_jump_taken that records > whether the jump was taken or not. This is needed so that, when we detect > a loop, we can see how we ended up in the back-edge: what was the jump > condition, and is it the true- or the false-branch that ends up looping? > We record in our explored_states whether they represent a conditional jump > and whether that jump produced a good loop bound. This allows us to make > unconditional jumps "not responsible" for ensuring the loop is bounded, as > long as there is a conditional jump somewhere in the loop body; whereas a > conditional jump must ensure that there is a bounding conditional somewhere > in the loop body. > Recursive calls have to remain forbidden because the calculation of maximum > stack depth assumes the call graph is acyclic, even though the loop > handling code could determine whether the recursion was bounded. > > Signed-off-by: Edward Cree > ---
Re: [RFC PATCH bpf-next 05/12] bpf/verifier: detect loops dynamically rather than statically
On 02/23/2018 09:40 AM, Edward Cree wrote: > Add in a new chain of parent states, which does not cross function-call > boundaries, and check whether our current insn_idx appears anywhere in > the chain. Since all jump targets have state-list marks (now placed > by prepare_cfg_marks(), which replaces check_cfg()), it suffices to > thread this chain through the existing explored_states and check it > only in is_state_visited(). > By using this parent-state chain to detect loops, we open up the > possibility that in future we could determine a loop to be bounded and > thus accept it. > A few selftests had to be changed to ensure that all the insns walked > before reaching the back-edge are valid. > I still believe this needs to be done in an initial pass where we can build a proper CFG and analyze the loops to ensure we have loops that can be properly verified. Otherwise we end up trying to handle all the special cases later like the comment in patch 7/12 'three-jump trick'. So rather than try to handle all loops only handle "normal" loops. In an early CFG stage with DOM tree the algorithm for this is already used by gcc/clang so there is good precedent for this type of work. Without this we are open to other "clever" goto programs that create loops that we have to special case. Better to just reject all these classes of loops because most users will never need them. See more notes in 7/12 patch, sorry I've been busy and not had a chance to push code to build cfg and dom tree yet. Also case analysis in patch description describing types of loops this handles (either here or in another patch) would help me understand at least. > Signed-off-by: Edward Cree > --- > include/linux/bpf_verifier.h| 2 + > kernel/bpf/verifier.c | 280 > +--- > tools/testing/selftests/bpf/test_verifier.c | 12 +- > 3 files changed, 97 insertions(+), 197 deletions(-) > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > index 0bc49c768585..24ddbc1ed3b2 100644 >
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar wrote: > On 2/21/2018 5:59 PM, Siwei Liu wrote: >> >> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck >> wrote: >>> >>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: > > On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >> >> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >>> >>> Ppatch 2 is in response to the community request for a 3 netdev >>> solution. However, it creates some issues we'll get into in a >>> moment. >>> It extends virtio_net to use alternate datapath when available and >>> registered. When BACKUP feature is enabled, virtio_net driver creates >>> an additional 'bypass' netdev that acts as a master device and >>> controls >>> 2 slave devices. The original virtio_net netdev is registered as >>> 'backup' netdev and a passthru/vf device with the same MAC gets >>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>> associated with the same 'pci' device. The user accesses the network >>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' >>> netdev >>> as default for transmits when it is available with link up and >>> running. >> >> Thank you do doing this. >> >>> We noticed a couple of issues with this approach during testing. >>> - As both 'bypass' and 'backup' netdevs are associated with the same >>>virtio pci device, udev tries to rename both of them with the same >>> name >>>and the 2nd rename will fail. This would be OK as long as the >>> first netdev >>>to be renamed is the 'bypass' netdev, but the order in which udev >>> gets >>>to rename the 2 netdevs is not reliable. >> >> Out of curiosity - why do you link the master netdev to the virtio >> struct device? > > The basic idea of all this is that we wanted this to work with an > existing VM image that was using virtio. As such we were trying to > make it so that the bypass interface takes the place of the original > virtio and get udev to rename the bypass to what the original > virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. >>> >>> So if I am understanding what you are saying you are wanting to flip >>> the backup interface from the virtio to a VF. The problem is that >>> becomes a bit of a vendor lock-in solution since it would rely on a >>> specific VF driver. I would agree with Jiri that we don't want to go >>> down that path. We don't want every VF out there firing up its own >>> separate bond. Ideally you want the hypervisor to be able to manage >>> all of this which is why it makes sense to have virtio manage this and >>> why this is associated with the virtio_net interface. >> >> No, that's not what I was talking about of course. I thought you >> mentioned the upgrade scenario this patch would like to address is to >> use the bypass interface "to take the place of the original virtio, >> and get udev to rename the bypass to what the original virtio_net >> was". That is one of the possible upgrade paths for sure. However the >> upgrade path I was seeking is to use the bypass interface to take the >> place of original VF interface while retaining the name and network >> configs, which generally can be done simply with kernel upgrade. It >> would become limiting as this patch makes the bypass interface share >> the same virtio pci device with virito backup. Can this bypass >> interface be made general to take place of any pci device other than >> virtio-net? This will be more helpful as the cloud users who has >> existing setup on VF interface don't have to recreate it on virtio-net >> and VF separately again. > > > Yes. This sounds interesting. Looks like you want an existing VM image with > VF only configuration to get transparent live migration support by adding > virtio_
Re: [v3,net-next,2/2] tls: Use correct sk->sk_prot for IPV6
Hi Ilya, On Mon, Sep 04, 2017 at 01:14:01PM +0300, Ilya Lesokhin wrote: > The tls ulp overrides sk->prot with a new tls specific proto structs. > The tls specific structs were previously based on the ipv4 specific > tcp_prot sturct. > As a result, attaching the tls ulp to an ipv6 tcp socket replaced > some ipv6 callback with the ipv4 equivalents. > > This patch adds ipv6 tls proto structs and uses them when > attached to ipv6 sockets. > Do you plan to update this patch with the missing TCPv6 support ? As far as I can see, the part that was accepted upstream does not fix the TCPv6 protocol issue which triggers CVE-2018-5703. If adding IPv6 support is not possible/acceptable, would it make sense to limit TLS support to TCPv4, ie add something like if (sk->sk_prot != &tcp_prot) return -EINVAL; to tls_init() ? Thanks, Guenter > Fixes: 3c4d7559159b ('tls: kernel TLS support') > Signed-off-by: Boris Pismenny > Signed-off-by: Ilya Lesokhin > --- > net/tls/Kconfig| 1 + > net/tls/tls_main.c | 50 ++ > 2 files changed, 39 insertions(+), 12 deletions(-) > > diff --git a/net/tls/Kconfig b/net/tls/Kconfig > index eb58303..7e9cf8b 100644 > --- a/net/tls/Kconfig > +++ b/net/tls/Kconfig > @@ -7,6 +7,7 @@ config TLS > select CRYPTO > select CRYPTO_AES > select CRYPTO_GCM > + select IPV6 > default n > ---help--- > Enable kernel support for TLS protocol. This allows symmetric > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c > index 60aff60..33c499e 100644 > --- a/net/tls/tls_main.c > +++ b/net/tls/tls_main.c > @@ -40,13 +40,25 @@ > #include > > #include > +#include > > MODULE_AUTHOR("Mellanox Technologies"); > MODULE_DESCRIPTION("Transport Layer Security Support"); > MODULE_LICENSE("Dual BSD/GPL"); > > -static struct proto tls_base_prot; > -static struct proto tls_sw_prot; > +enum { > + TLSV4, > + TLSV6, > + TLS_NUM_PROTS, > +}; > + > +enum { > + TLS_BASE_TX, > + TLS_SW_TX, > + TLS_NUM_CONFIG, > +}; > + > +static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG]; > > int wait_on_pending_writer(struct sock *sk, long *timeo) > { > @@ -342,6 +354,7 @@ static int do_tls_setsockopt_tx(struct sock *sk, char > __user *optval, > struct tls_context *ctx = tls_get_ctx(sk); > struct proto *prot = NULL; > int rc = 0; > + int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4; > > if (!optval || (optlen < sizeof(*crypto_info))) { > rc = -EINVAL; > @@ -396,7 +409,7 @@ static int do_tls_setsockopt_tx(struct sock *sk, char > __user *optval, > > /* currently SW is default, we will have ethtool in future */ > rc = tls_set_sw_offload(sk, ctx); > - prot = &tls_sw_prot; > + prot = &tls_prots[ip_ver][TLS_SW_TX]; > if (rc) > goto err_crypto_info; > > @@ -443,6 +456,12 @@ static int tls_init(struct sock *sk) > struct inet_connection_sock *icsk = inet_csk(sk); > struct tls_context *ctx; > int rc = 0; > + int ip_ver = TLSV4; > + > + if (sk->sk_prot == &tcpv6_prot) > + ip_ver = TLSV6; > + else if (sk->sk_prot != &tcp_prot) > + return -EINVAL; > > /* allocate tls context */ > ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); > @@ -453,7 +472,8 @@ static int tls_init(struct sock *sk) > icsk->icsk_ulp_data = ctx; > ctx->setsockopt = sk->sk_prot->setsockopt; > ctx->getsockopt = sk->sk_prot->getsockopt; > - sk->sk_prot = &tls_base_prot; > + > + sk->sk_prot = &tls_prots[ip_ver][TLS_BASE_TX]; > out: > return rc; > } > @@ -464,16 +484,22 @@ static int tls_init(struct sock *sk) > .init = tls_init, > }; > > +static void build_protos(struct proto *prot, struct proto *base) > +{ > + prot[TLS_BASE_TX] = *base; > + prot[TLS_BASE_TX].setsockopt = tls_setsockopt; > + prot[TLS_BASE_TX].getsockopt = tls_getsockopt; > + > + prot[TLS_SW_TX] = prot[TLS_BASE_TX]; > + prot[TLS_SW_TX].close = tls_sk_proto_close; > + prot[TLS_SW_TX].sendmsg = tls_sw_sendmsg; > + prot[TLS_SW_TX].sendpage= tls_sw_sendpage; > +} > + > static int __init tls_register(void) > { > - tls_base_prot = tcp_prot; > - tls_base_prot.setsockopt= tls_setsockopt; > - tls_base_prot.getsockopt= tls_getsockopt; > - > - tls_sw_prot = tls_base_prot; > - tls_sw_prot.sendmsg = tls_sw_sendmsg; > - tls_sw_prot.sendpage= tls_sw_sendpage; > - tls_sw_prot.close = tls_sk_proto_close; > + build_protos(tls_prots[TLSV4], &tcp_prot); > + build_protos(tls_prots[TLSV6], &tcpv6_prot); > > tcp_register_ulp(&tcp_tls_ulp_ops); >
[PATCH net-next] r8169: improve interrupt handling
This patch improves few aspects of interrupt handling: - update to current interrupt allocation API (use pci_alloc_irq_vectors() instead of deprecated pci_enable_msi()) - this implicitly will allocate a MSI-X interrupt if available - get rid of flag RTL_FEATURE_MSI Last but not least it enables a feature which was (I presume accidently) disabled before. There are members of the RTL8169 family supporting MSI (e.g. RTL8169SB), however MSI never got enabled because RTL_CFG_0 was missing flag RTL_FEATURE_MSI. An indicator for "accidently" is that statement "cfg2 |= MSIEnable;" in rtl_try_msi() is dead code. cfg2 is used for chip versions <= 06 only and for all these chip versions RTL_FEATURE_MSI isn't set. The patch works fine on a RTL8168evl (chip version 34). The previously accidently disabled MSI support for the few members of the old RTL8169 family supporting MSI isn't tested. The RTL8169SB (chip version 04) I have at least still works in a PCI slot w/o MSI. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 44 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 96db3283e..4730db990 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -736,8 +736,7 @@ struct ring_info { }; enum features { - RTL_FEATURE_MSI = (1 << 0), - RTL_FEATURE_GMII= (1 << 1), + RTL_FEATURE_GMII= (1 << 0), }; struct rtl8169_counters { @@ -7847,7 +7846,7 @@ static int rtl8169_close(struct net_device *dev) cancel_work_sync(&tp->wk.work); - free_irq(pdev->irq, dev); + pci_free_irq(pdev, 0, dev); dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp->RxDescArray, tp->RxPhyAddr); @@ -7903,9 +7902,8 @@ static int rtl_open(struct net_device *dev) rtl_request_firmware(tp); - retval = request_irq(pdev->irq, rtl8169_interrupt, -(tp->features & RTL_FEATURE_MSI) ? 0 : IRQF_SHARED, -dev->name, dev); + retval = pci_request_irq(pdev, 0, rtl8169_interrupt, NULL, dev, +dev->name); if (retval < 0) goto err_release_fw_2; @@ -8253,7 +8251,7 @@ static const struct rtl_cfg_info { .region = 2, .align = 8, .event_slow = SYSErr | LinkChg | RxOverflow, - .features = RTL_FEATURE_GMII | RTL_FEATURE_MSI, + .features = RTL_FEATURE_GMII, .coalesce_info = rtl_coalesce_info_8168_8136, .default_ver= RTL_GIGA_MAC_VER_11, }, @@ -8263,32 +8261,32 @@ static const struct rtl_cfg_info { .align = 8, .event_slow = SYSErr | LinkChg | RxOverflow | RxFIFOOver | PCSTimeout, - .features = RTL_FEATURE_MSI, .coalesce_info = rtl_coalesce_info_8168_8136, .default_ver= RTL_GIGA_MAC_VER_13, } }; /* Cfg9346_Unlock assumed. */ -static unsigned rtl_try_msi(struct rtl8169_private *tp, - const struct rtl_cfg_info *cfg) +static void rtl_alloc_irq(struct rtl8169_private *tp) { void __iomem *ioaddr = tp->mmio_addr; - unsigned msi = 0; u8 cfg2; + int ret; - cfg2 = RTL_R8(Config2) & ~MSIEnable; - if (cfg->features & RTL_FEATURE_MSI) { - if (pci_enable_msi(tp->pci_dev)) { - netif_info(tp, hw, tp->dev, "no MSI. Back to INTx.\n"); - } else { - cfg2 |= MSIEnable; - msi = RTL_FEATURE_MSI; - } + ret = pci_alloc_irq_vectors(tp->pci_dev, 1, 1, PCI_IRQ_ALL_TYPES); + if (ret < 0) { + netif_err(tp, drv, tp->dev, "failed to allocate irq!\n"); + return; } - if (tp->mac_version <= RTL_GIGA_MAC_VER_06) + + if (tp->mac_version <= RTL_GIGA_MAC_VER_06) { + RTL_W8(Cfg9346, Cfg9346_Unlock); + cfg2 = RTL_R8(Config2) & ~MSIEnable; + if (pci_dev_msi_enabled(tp->pci_dev)) + cfg2 |= MSIEnable; RTL_W8(Config2, cfg2); - return msi; + RTL_W8(Cfg9346, Cfg9346_Lock); + } } DECLARE_RTL_COND(rtl_link_list_ready_cond) @@ -8497,9 +8495,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) chipset = tp->mac_version; tp->txd_version = rtl_chip_infos[chipset].txd_version; - RTL_W8(Cfg9346, Cfg9346_Unlock); - tp->features |= rtl_try_msi(tp, cfg); - RTL_W8(Cfg9346, Cfg9346_Lock); + rtl_alloc_irq(tp); /* override BIOS settings, use userspace tools to enable WOL */
[PATCH bpf] bpf: allow xadd only on aligned memory
The requirements around atomic_add() / atomic64_add() resp. their JIT implementations differ across architectures. E.g. while x86_64 seems just fine with BPF's xadd on unaligned memory, on arm64 it triggers via interpreter but also JIT the following crash: [ 830.864985] Unable to handle kernel paging request at virtual address 8097d7ed6703 [...] [ 830.916161] Internal error: Oops: 9621 [#1] SMP [ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8 [ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017 [ 830.998998] pstate: 8045 (Nzcv daif +PAN -UAO) [ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18 [ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588 [ 831.012485] sp : 1ccabc20 [ 831.015786] x29: 1ccabc20 x28: 8017d56a0f00 [ 831.021087] x27: 0001 x26: [ 831.026387] x25: 00c168d9db98 x24: [ 831.031686] x23: 08203878 x22: 09488000 [ 831.036986] x21: 08b14e28 x20: 1ccabcb0 [ 831.042286] x19: 097b5080 x18: 0a03 [ 831.047585] x17: x16: [ 831.052885] x15: aeca8000 x14: [ 831.058184] x13: x12: [ 831.063484] x11: 0001 x10: [ 831.068783] x9 : x8 : [ 831.074083] x7 : x6 : 000580d42800 [ 831.079383] x5 : 0018 x4 : [ 831.084682] x3 : 1ccabcb0 x2 : 0001 [ 831.089982] x1 : 8097d7ed6703 x0 : 0001 [ 831.095282] Process test_verifier (pid: 2788, stack limit = 0x18370044) [ 831.102577] Call trace: [ 831.105012] __ll_sc_atomic_add+0x4/0x18 [ 831.108923] __bpf_prog_run32+0x4c/0x70 [ 831.112748] bpf_test_run+0x78/0xf8 [ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120 [ 831.120567] SyS_bpf+0x77c/0x1110 [ 831.123873] el0_svc_naked+0x30/0x34 [ 831.127437] Code: 97fffe97 17ec f9800031 (885f7c31) Reason for this is because memory is required to be aligned. In case of BPF, we always enforce alignment in terms of stack access, but not when accessing map values or packet data when the underlying arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set. xadd on packet data that is local to us anyway is just wrong, so forbid this case entirely. The only place where xadd makes sense in fact are map values; xadd on stack is wrong as well, but it's been around for much longer. Specifically enforce strict alignment in case of xadd, so that we handle this case generically and avoid such crashes in the first place. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 42 + tools/testing/selftests/bpf/test_verifier.c | 58 + 2 files changed, 84 insertions(+), 16 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5fb69a8..c6eff10 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1356,6 +1356,13 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno) return reg->type == PTR_TO_CTX; } +static bool is_pkt_reg(struct bpf_verifier_env *env, int regno) +{ + const struct bpf_reg_state *reg = cur_regs(env) + regno; + + return type_is_pkt_pointer(reg->type); +} + static int check_pkt_ptr_alignment(struct bpf_verifier_env *env, const struct bpf_reg_state *reg, int off, int size, bool strict) @@ -1416,10 +1423,10 @@ static int check_generic_ptr_alignment(struct bpf_verifier_env *env, } static int check_ptr_alignment(struct bpf_verifier_env *env, - const struct bpf_reg_state *reg, - int off, int size) + const struct bpf_reg_state *reg, int off, + int size, bool strict_alignment_once) { - bool strict = env->strict_alignment; + bool strict = env->strict_alignment || strict_alignment_once; const char *pointer_desc = ""; switch (reg->type) { @@ -1576,9 +1583,9 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size) * if t==write && value_regno==-1, some unknown value is stored into memory * if t==read && value_regno==-1, don't care what we read from memory */ -static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regno, int off, - int bpf_size, enum bpf_access_type t, - int value_regno) +static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regno, + int off, int bpf_size, enum bpf_access_type t, + int value_regno,
Re: VRF destination unreachable
On 2/23/18 10:49 AM, Stephen Suryaputra wrote: > Greetings, > > We found that ICMP destination unreachable isn't sent if VRF > forwarding isn't configured, i.e. > /proc/sys/net/ipv4/conf//forwarding isn't set. The > relevant code is: > > static int ip_error(struct sk_buff *skb) > { > ... > // in_dev is the vrf net_device > if (!IN_DEV_FORWARD(in_dev)) { > switch (rt->dst.error) { > case EHOSTUNREACH: > __IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS); > break; > > case ENETUNREACH: > __IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES); > break; > } > goto out; > } > ... > out:kfree_skb(skb); > return 0; > } > > The question: is it intended to be set? The basic forwarding seems to > be working without. We do set it on the slave net devices. Unintended side effect of VRF as a netdev. This should fix it: diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5ca7415cd48c..d59d005fb7c5 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -944,7 +944,7 @@ static int ip_error(struct sk_buff *skb) goto out; net = dev_net(rt->dst.dev); - if (!IN_DEV_FORWARD(in_dev)) { + if (!IN_DEV_FORWARD(in_dev) && !netif_is_l3_master(skb->dev)) { switch (rt->dst.error) { case EHOSTUNREACH: __IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings
On Fri, 2018-02-23 at 23:41 +0300, Alexey Dobriyan wrote: > On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote: > > @@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac) > > { > > int i; > > > > - /* XX:XX:XX:XX:XX:XX */ > > - if (strlen(s) < 3 * ETH_ALEN - 1) > > - return false; > > - > > /* Don't dirty result unless string is valid MAC. */ > > for (i = 0; i < ETH_ALEN; i++) { > > if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1])) > > Short string will bail in the loop, indeed. > > Reviewed-by: Alexey Dobriyan Since the author is okay with the change, I'm following: Reviewed-by: Andy Shevchenko -- Andy Shevchenko Intel Finland Oy
Re: [PATCH] net: fib_rules: Add new attribute to set protocol
From: Donald Sharp Date: Fri, 23 Feb 2018 14:01:52 -0500 > For ages iproute2 has used `struct rtmsg` as the ancillary header for > FIB rules and in the process set the protocol value to RTPROT_BOOT. > Until ca56209a66 ("net: Allow a rule to track originating protocol") > the kernel rules code ignored the protocol value sent from userspace > and always returned 0 in notifications. To avoid incompatibility with > existing iproute2, send the protocol as a new attribute. > > Fixes: cac56209a66 ("net: Allow a rule to track originating protocol") > Signed-off-by: Donald Sharp Applied, thanks Donald.
Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings
On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote: > @@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac) > { > int i; > > - /* XX:XX:XX:XX:XX:XX */ > - if (strlen(s) < 3 * ETH_ALEN - 1) > - return false; > - > /* Don't dirty result unless string is valid MAC. */ > for (i = 0; i < ETH_ALEN; i++) { > if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1])) Short string will bail in the loop, indeed. Reviewed-by: Alexey Dobriyan
[net] ixgbe: fix crash in build_skb Rx code path
From: Emil Tantilov Add check for build_skb enabled ring in ixgbe_dma_sync_frag(). In that case &skb_shinfo(skb)->frags[0] may not always be set which can lead to a crash. Instead we derive the page offset from skb->data. Fixes: 42073d91a214 ("ixgbe: Have the CPU take ownership of the buffers sooner") CC: stable Reported-by: Ambarish Soman Suggested-by: Alexander Duyck Signed-off-by: Emil Tantilov Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 0da5aa2c8aba..9fc063af233c 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1888,6 +1888,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring, ixgbe_rx_pg_size(rx_ring), DMA_FROM_DEVICE, IXGBE_RX_DMA_ATTR); + } else if (ring_uses_build_skb(rx_ring)) { + unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK; + + dma_sync_single_range_for_cpu(rx_ring->dev, + IXGBE_CB(skb)->dma, + offset, + skb_headlen(skb), + DMA_FROM_DEVICE); } else { struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0]; -- 2.14.3
response
I am Ms.Ella Golan, I am the Executive Vice President Banking Division with FIRST INTERNATIONAL BANK OF ISRAEL LTD (FIBI). I am getting in touch with you regarding an extremely important and urgent matter. If you would oblige me the opportunity, I shall provide you with details upon your response. Faithfully, Ms.Ella Golan
Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings
On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote: > Commit 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") crashes my > QNAP TS-209 NAS early on boot. > > The boot code for the TS-209 is looping through an ext2 filesystem on a > 384kB mtd partition (factory configuration put there by QNAP). There it > looks on every 1kB boundary if there is a valid MAC address. The > filesystem has a 1kB block size, so this seems to work. > > On my device the MAC address is on the 37th 1kB block. But: On the 27th > block is a large file (1,5kB) without 0 bytes inside. The code in > qnap_tsx09_find_mac_addr() maps 1kB into memory (not a whole file or the > whole 384kB) and then calls qnap_tsx09_check_mac_addr() -> mac_pton() -> > strlen() on this memory block. as there is no 0 byte in the file on the > 27th block, strlen() runs into bad memory and the machine panics. The old > code had no strlen(). > > Actually mac_pton() doesn't need to call strlen(), the following loop > catches short strings quite nicely. The strlen() seems to be an > optimization for calls to mac_pton with empty string. But this is rarely > the case and this is not a hot path. Remove it to reduce code size and > speed up calls with an not empty string. > > Besides fixing the crash there is are other users interested in > this change, see https://patchwork.ozlabs.org/patch/851008/ > > Fixes: 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") > Signed-off-by: Stefan Hellermann > Cc: [4.4+] Reviewed-by: Andrew Lunn Andrew
[PATCH] net: Allow mac_pton() to work on non-NULL terminated strings
Commit 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") crashes my QNAP TS-209 NAS early on boot. The boot code for the TS-209 is looping through an ext2 filesystem on a 384kB mtd partition (factory configuration put there by QNAP). There it looks on every 1kB boundary if there is a valid MAC address. The filesystem has a 1kB block size, so this seems to work. On my device the MAC address is on the 37th 1kB block. But: On the 27th block is a large file (1,5kB) without 0 bytes inside. The code in qnap_tsx09_find_mac_addr() maps 1kB into memory (not a whole file or the whole 384kB) and then calls qnap_tsx09_check_mac_addr() -> mac_pton() -> strlen() on this memory block. as there is no 0 byte in the file on the 27th block, strlen() runs into bad memory and the machine panics. The old code had no strlen(). Actually mac_pton() doesn't need to call strlen(), the following loop catches short strings quite nicely. The strlen() seems to be an optimization for calls to mac_pton with empty string. But this is rarely the case and this is not a hot path. Remove it to reduce code size and speed up calls with an not empty string. Besides fixing the crash there is are other users interested in this change, see https://patchwork.ozlabs.org/patch/851008/ Fixes: 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") Signed-off-by: Stefan Hellermann Cc: [4.4+] --- lib/net_utils.c | 4 1 file changed, 4 deletions(-) diff --git a/lib/net_utils.c b/lib/net_utils.c index af525353395d..9d38da67ee44 100644 --- a/lib/net_utils.c +++ b/lib/net_utils.c @@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac) { int i; - /* XX:XX:XX:XX:XX:XX */ - if (strlen(s) < 3 * ETH_ALEN - 1) - return false; - /* Don't dirty result unless string is valid MAC. */ for (i = 0; i < ETH_ALEN; i++) { if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1])) -- 2.11.0
[PATCH net 1/1] tc: python3, string formattings
This patch converts old type string formattings to new type string formattings for adapting Linux Traffic Control (tc) unit testing suite python3. Linux Traffic Control (tc) unit testing suite's code quality improved is improved with this patch. According to python documentation; "The built-in string class provides the ability to do complex variable substitutions and value formatting via the format() method described in PEP 3101. " but the project was using old type formattings and new type string formattings together, this patch's main purpose is converting all old types to new types. Following files changed: 1. tools/testing/selftests/tc-testing/tdc.py 2. tools/testing/selftests/tc-testing/tdc_batch.py Following PEP rules applied: 1. PEP8 - Code Styling 2. PEP3101 - Advanced Code Formatting Signed-off-by: Batuhan Osman Taskaya --- tools/testing/selftests/tc-testing/tdc.py | 2 +- tools/testing/selftests/tc-testing/tdc_batch.py | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index fc373fdf2bdc..80c75464c67e 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -277,7 +277,7 @@ def generate_case_ids(alltests): for c in alltests: if (c["id"] == ""): while True: -newid = str('%04x' % random.randrange(16**4)) +newid = str('{:04x}'.format(random.randrange(16**4))) if (does_id_exist(alltests, newid)): continue else: diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py b/tools/testing/selftests/tc-testing/tdc_batch.py index 707c6bfef689..52fa539dc662 100755 --- a/tools/testing/selftests/tc-testing/tdc_batch.py +++ b/tools/testing/selftests/tc-testing/tdc_batch.py @@ -49,13 +49,13 @@ index = 0 for i in range(0x100): for j in range(0x100): for k in range(0x100): -mac = ("%02x:%02x:%02x" % (i, j, k)) +mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k)) src_mac = "e4:11:00:" + mac dst_mac = "e4:12:00:" + mac -cmd = ("filter add dev %s %s protocol ip parent : flower %s " - "src_mac %s dst_mac %s action drop %s" % +cmd = ("filter add dev {} {} protocol ip parent : flower {} " + "src_mac {} dst_mac {} action drop {}".format (device, prio, skip, src_mac, dst_mac, share_action)) -file.write("%s\n" % cmd) +file.write("{}\n".format(cmd)) index += 1 if index >= number: file.close() -- 2.16.1
[PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to do c
Signed-off-by: Batuhan Osman Taskaya --- tools/testing/selftests/tc-testing/tdc.py | 2 +- tools/testing/selftests/tc-testing/tdc_batch.py | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index fc373fdf2bdc..80c75464c67e 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -277,7 +277,7 @@ def generate_case_ids(alltests): for c in alltests: if (c["id"] == ""): while True: -newid = str('%04x' % random.randrange(16**4)) +newid = str('{:04x}'.format(random.randrange(16**4))) if (does_id_exist(alltests, newid)): continue else: diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py b/tools/testing/selftests/tc-testing/tdc_batch.py index 707c6bfef689..52fa539dc662 100755 --- a/tools/testing/selftests/tc-testing/tdc_batch.py +++ b/tools/testing/selftests/tc-testing/tdc_batch.py @@ -49,13 +49,13 @@ index = 0 for i in range(0x100): for j in range(0x100): for k in range(0x100): -mac = ("%02x:%02x:%02x" % (i, j, k)) +mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k)) src_mac = "e4:11:00:" + mac dst_mac = "e4:12:00:" + mac -cmd = ("filter add dev %s %s protocol ip parent : flower %s " - "src_mac %s dst_mac %s action drop %s" % +cmd = ("filter add dev {} {} protocol ip parent : flower {} " + "src_mac {} dst_mac {} action drop {}".format (device, prio, skip, src_mac, dst_mac, share_action)) -file.write("%s\n" % cmd) +file.write("{}\n".format(cmd)) index += 1 if index >= number: file.close() -- 2.16.1
[PATCH v3 iproute2-next 1/3] ip: Use the `struct fib_rule_hdr` for rules
The iprule.c code was using `struct rtmsg` as the data type to pass into the kernel for the netlink message. While 'struct rtmsg' and `struct fib_rule_hdr` are the same size and mostly the same, we should use the correct data structure. This commit translates the data structures to have iprule.c use the correct one. Additionally copy over the modified fib_rules.h file Signed-off-by: Donald Sharp --- include/uapi/linux/fib_rules.h | 1 + ip/iprule.c| 128 + 2 files changed, 68 insertions(+), 61 deletions(-) diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 2b642bf9..9477c3af 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -58,6 +58,7 @@ enum { FRA_PAD, FRA_L3MDEV, /* iif or oif is l3mdev goto its table */ FRA_UID_RANGE, /* UID range */ + FRA_PROTOCOL, __FRA_MAX }; diff --git a/ip/iprule.c b/ip/iprule.c index a3abf2f6..94356bf8 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -73,25 +73,33 @@ static struct inet_prefix dst; } filter; +static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb) +{ + __u32 table = frh->table; + if (tb[RTA_TABLE]) + table = rta_getattr_u32(tb[RTA_TABLE]); + return table; +} + static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) { - struct rtmsg *r = NLMSG_DATA(n); + struct fib_rule_hdr *frh = NLMSG_DATA(n); __u32 table; - if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family) + if (preferred_family != AF_UNSPEC && frh->family != preferred_family) return false; if (filter.prefmask && filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) : 0)) return false; - if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT)) + if (filter.not && !(frh->flags & FIB_RULE_INVERT)) return false; if (filter.src.family) { inet_prefix *f_src = &filter.src; - if (f_src->family != r->rtm_family || - f_src->bitlen > r->rtm_src_len) + if (f_src->family != frh->family || + f_src->bitlen > frh->src_len) return false; if (inet_addr_match_rta(f_src, tb[FRA_SRC])) @@ -101,15 +109,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) if (filter.dst.family) { inet_prefix *f_dst = &filter.dst; - if (f_dst->family != r->rtm_family || - f_dst->bitlen > r->rtm_dst_len) + if (f_dst->family != frh->family || + f_dst->bitlen > frh->dst_len) return false; if (inet_addr_match_rta(f_dst, tb[FRA_DST])) return false; } - if (filter.tosmask && filter.tos ^ r->rtm_tos) + if (filter.tosmask && filter.tos ^ frh->tos) return false; if (filter.fwmark) { @@ -159,7 +167,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) return false; } - table = rtm_get_table(r, tb); + table = frh_get_table(frh, tb); if (filter.tb > 0 && filter.tb ^ table) return false; @@ -169,7 +177,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { FILE *fp = (FILE *)arg; - struct rtmsg *r = NLMSG_DATA(n); + struct fib_rule_hdr *frh = NLMSG_DATA(n); int len = n->nlmsg_len; int host_len = -1; __u32 table; @@ -180,13 +188,13 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE) return 0; - len -= NLMSG_LENGTH(sizeof(*r)); + len -= NLMSG_LENGTH(sizeof(*frh)); if (len < 0) return -1; - parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len); + parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len); - host_len = af_bit_len(r->rtm_family); + host_len = af_bit_len(frh->family); if (!filter_nlmsg(n, tb, host_len)) return 0; @@ -200,41 +208,41 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) else fprintf(fp, "0:\t"); - if (r->rtm_flags & FIB_RULE_INVERT) + if (frh->flags & FIB_RULE_INVERT) fprintf(fp, "not "); if (tb[FRA_SRC]) { - if (r->rtm_src_len != host_len) { + if (frh->src_len != host_len) { fprintf(fp, "from %s/%u ", - rt_addr_n2a_rta(r->rtm_family, tb[FRA_SRC]), -
[PATCH v3 iproute2-next 2/3] ip: Display ip rule protocol used
Modify 'ip rule' command to notice when the kernel passes to us the originating protocol. Add code to allow the `ip rule flush protocol XXX` command to be accepted and properly handled. Modify the documentation to reflect these code changes. Signed-off-by: Donald Sharp --- ip/iprule.c| 36 +--- man/man8/ip-rule.8 | 18 +- 2 files changed, 46 insertions(+), 8 deletions(-) diff --git a/ip/iprule.c b/ip/iprule.c index 94356bf8..17df9e9b 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -47,6 +47,7 @@ static void usage(void) "[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ l3mdev ]\n" "[ uidrange NUMBER-NUMBER ]\n" "ACTION := [ table TABLE_ID ]\n" + " [ protocol PROTO ]\n" " [ nat ADDRESS ]\n" " [ realms [SRCREALM/]DSTREALM ]\n" " [ goto NUMBER ]\n" @@ -71,6 +72,8 @@ static struct struct fib_rule_uid_range range; inet_prefix src; inet_prefix dst; + int protocol; + int protocolmask; } filter; static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb) @@ -338,6 +341,16 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) rtnl_rtntype_n2a(frh->action, b1, sizeof(b1))); + if (tb[FRA_PROTOCOL]) { + __u8 protocol = rta_getattr_u8(tb[FRA_PROTOCOL]); + + if ((protocol && protocol != RTPROT_KERNEL) || + show_details > 0) { + fprintf(fp, " proto %s ", + rtnl_rtprot_n2a(protocol, b1, sizeof(b1))); + } + } + fprintf(fp, "\n"); fflush(fp); return 0; @@ -391,6 +404,10 @@ static int flush_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len); + if (tb[FRA_PROTOCOL] && + (filter.protocol^rta_getattr_u8(tb[FRA_PROTOCOL]))&filter.protocolmask) + return 0; + if (tb[FRA_PRIORITY]) { n->nlmsg_type = RTM_DELRULE; n->nlmsg_flags = NLM_F_REQUEST; @@ -415,12 +432,6 @@ static int iprule_list_flush_or_save(int argc, char **argv, int action) if (af == AF_UNSPEC) af = AF_INET; - if (action != IPRULE_LIST && argc > 0) { - fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n", - action == IPRULE_SAVE ? "save" : "flush"); - return -1; - } - switch (action) { case IPRULE_SAVE: if (save_rule_prep()) @@ -508,7 +519,18 @@ static int iprule_list_flush_or_save(int argc, char **argv, int action) NEXT_ARG(); if (get_prefix(&filter.src, *argv, af)) invarg("from value is invalid\n", *argv); - } else { + } else if (matches(*argv, "protocol") == 0) { + __u32 prot; + NEXT_ARG(); + filter.protocolmask = -1; + if (rtnl_rtprot_a2n(&prot, *argv)) { + if (strcmp(*argv, "all") != 0) + invarg("invalid \"protocol\"\n", *argv); + prot = 0; + filter.protocolmask = 0; + } + filter.protocol = prot; + } else{ if (matches(*argv, "dst") == 0 || matches(*argv, "to") == 0) { NEXT_ARG(); diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8 index a5c47981..f4070542 100644 --- a/man/man8/ip-rule.8 +++ b/man/man8/ip-rule.8 @@ -50,6 +50,8 @@ ip-rule \- routing policy database management .IR ACTION " := [ " .B table .IR TABLE_ID " ] [ " +.B protocol +.IR PROTO " ] [ " .B nat .IR ADDRESS " ] [ " .B realms @@ -240,6 +242,10 @@ The options preference and order are synonyms with priority. the routing table identifier to lookup if the rule selector matches. It is also possible to use lookup instead of table. +.TP +.BI protocol " PROTO" +the protocol who installed the rule in question. + .TP .BI suppress_prefixlength " NUMBER" reject routing decisions that have a prefix length of NUMBER or less. @@ -275,7 +281,11 @@ updates, it flushes the routing cache with .RE .TP .B ip rule flush - also dumps all the deleted rules. -This command has no arguments. +.RS +.TP +.BI protocol " PROTO" +Select the originating protocol. +.RE .TP .B ip rule show - list rules This command has no arguments. @@ -283,6 +293,12 @@ The options list or lst are synonyms with show. .TP .B ip rule save +.RS +.TP +.BI protocl " PROTO" +Select the
[PATCH v3 iproute2-next 3/3] ip: Allow rules to accept a specified protocol
Allow the specification of a protocol when the user adds/modifies/deletes a rule. Signed-off-by: Donald Sharp --- ip/iprule.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/ip/iprule.c b/ip/iprule.c index 17df9e9b..796da3b3 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -689,6 +689,12 @@ static int iprule_modify(int cmd, int argc, char **argv) if (get_rt_realms_or_raw(&realm, *argv)) invarg("invalid realms\n", *argv); addattr32(&req.n, sizeof(req), FRA_FLOW, realm); + } else if (matches(*argv, "protocol") == 0) { + __u32 proto; + NEXT_ARG(); + if (rtnl_rtprot_a2n(&proto, *argv)) + invarg("\"protocol\" value is invalid\n", *argv); + addattr8(&req.n, sizeof(req), FRA_PROTOCOL, proto); } else if (matches(*argv, "table") == 0 || strcmp(*argv, "lookup") == 0) { NEXT_ARG(); -- 2.14.3
[PATCH v3 iproute2-next 0/3] Allow 'ip rule' command to use protocol
Fix iprule.c to use the actual `struct fib_rule_hdr` and to allow the end user to see and use the protocol keyword for rule manipulation. v2: Rearrange and code changes as per David Ahern v3: Fix some missed RTN_XXX to appropriate FR_XX and doc changes Donald Sharp (3): ip: Use the `struct fib_rule_hdr` for rules ip: Display ip rule protocol used ip: Allow rules to accept a specified protocol include/uapi/linux/fib_rules.h | 1 + ip/iprule.c| 170 - man/man8/ip-rule.8 | 18 - 3 files changed, 120 insertions(+), 69 deletions(-) -- 2.14.3
Re: [PATCH iproute2 v2] ip: Properly display AF_BRIDGE address information for neighbor events
On Fri, 23 Feb 2018 14:10:09 -0500 Donald Sharp wrote: > The vxlan driver when a neighbor add/delete event occurs sends > NDA_DST filled with a union: > > union vxlan_addr { > struct sockaddr_in sin; > struct sockaddr_in6 sin6; > struct sockaddr sa; > }; > > This eventually calls rt_addr_n2a_r which had no handler for the > AF_BRIDGE family and "???" was being printed. > > Add code to properly display this data when requested. > > Signed-off-by: Donald Sharp I did some minor changes to the patch and applied it. (need space after switch, and can don't need local variable family). Thanks.
Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)
On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote: > Hi Saeed > > On 2/21/2018 12:13 PM, Saeed Mahameed wrote: > > From: Yonatan Cohen > > > > The current implementation of create CQ requires contiguous > > memory, such requirement is problematic once the memory is > > fragmented or the system is low in memory, it causes for > > failures in dma_zalloc_coherent(). > > > > This patch implements new scheme of fragmented CQ to overcome > > this issue by introducing new type: 'struct mlx5_frag_buf_ctrl' > > to allocate fragmented buffers, rather than contiguous ones. > > > > Base the Completion Queues (CQs) on this new fragmented buffer. > > > > It fixes following crashes: > > kworker/29:0: page allocation failure: order:6, mode:0x80d0 > > CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0 > > Workqueue: ib_cm cm_work_handler [ib_cm] > > Call Trace: > > [<>] dump_stack+0x19/0x1b > > [<>] warn_alloc_failed+0x110/0x180 > > [<>] __alloc_pages_slowpath+0x6b7/0x725 > > [<>] __alloc_pages_nodemask+0x405/0x420 > > [<>] dma_generic_alloc_coherent+0x8f/0x140 > > [<>] x86_swiotlb_alloc_coherent+0x21/0x50 > > [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core] > > [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core] > > [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core] > > [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core] > > [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib] > > [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib] > > > > Signed-off-by: Yonatan Cohen > > Reviewed-by: Tariq Toukan > > Signed-off-by: Leon Romanovsky > > Signed-off-by: Saeed Mahameed > > --- > > Jason mentioned about this patch to me off-list. We were > seeing similar issue with SRQs & QPs. So wondering whether > you have any plans to do similar change for other resouces > too so that they don't rely on higher order page allocation > for icm tables. > Hi Santosh, Adding Majd, Which ULP is in question ? how big are the QPs/SRQs you create that lead to this problem ? For icm tables we already allocate only order 0 pages: see alloc_system_page() in drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c But for kernel RDMA SRQ and QP buffers there is a place for improvement. Majd, do you know if we have any near future plans for this. > Regards, > Santosh
Re: [PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to
From: BTaskaya Date: Fri, 23 Feb 2018 21:48:30 +0300 > Signed-off-by: Batuhan Osman Taskaya Sorry, this still is not submitted properly. You Subject line is enormous. It should contain a short, concise, title for the change. Including subsystem prefixes, followed by a colon character, and a space, then the title text. The rest belongs in the commit message proper, and thus the message body. I asked you quite politely to look at how other people submit patches properly on this mailing list, so that you could observe, learn, and see how it is supposed to look. If you had done so, we wouldn't have to go back and forth like this. So please take my polite suggestion seriously. Thank you.
[PATCH iproute2 v2] ip: Properly display AF_BRIDGE address information for neighbor events
The vxlan driver when a neighbor add/delete event occurs sends NDA_DST filled with a union: union vxlan_addr { struct sockaddr_in sin; struct sockaddr_in6 sin6; struct sockaddr sa; }; This eventually calls rt_addr_n2a_r which had no handler for the AF_BRIDGE family and "???" was being printed. Add code to properly display this data when requested. Signed-off-by: Donald Sharp --- lib/utils.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/lib/utils.c b/lib/utils.c index 24aeddd8..fe5841f6 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -1004,6 +1004,25 @@ const char *rt_addr_n2a_r(int af, int len, } case AF_PACKET: return ll_addr_n2a(addr, len, ARPHRD_VOID, buf, buflen); + case AF_BRIDGE: + { + const union { + struct sockaddr sa; + struct sockaddr_in sin; + struct sockaddr_in6 sin6; + } *sa = addr; + unsigned short family = sa->sa.sa_family; + + switch(family) { + case AF_INET: + return inet_ntop(AF_INET, &sa->sin.sin_addr, buf, buflen); + case AF_INET6: + return inet_ntop(AF_INET6, &sa->sin6.sin6_addr, +buf, buflen); + } + + /* fallthrough */ + } default: return "???"; } -- 2.14.3
[PATCH] net: fib_rules: Add new attribute to set protocol
For ages iproute2 has used `struct rtmsg` as the ancillary header for FIB rules and in the process set the protocol value to RTPROT_BOOT. Until ca56209a66 ("net: Allow a rule to track originating protocol") the kernel rules code ignored the protocol value sent from userspace and always returned 0 in notifications. To avoid incompatibility with existing iproute2, send the protocol as a new attribute. Fixes: cac56209a66 ("net: Allow a rule to track originating protocol") Signed-off-by: Donald Sharp --- drivers/net/vrf.c | 5 - include/net/fib_rules.h| 3 ++- include/uapi/linux/fib_rules.h | 5 +++-- net/core/fib_rules.c | 15 +++ 4 files changed, 20 insertions(+), 8 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 951a4b42cb29..9ce0182223a0 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -1145,6 +1145,7 @@ static inline size_t vrf_fib_rule_nl_size(void) sz = NLMSG_ALIGN(sizeof(struct fib_rule_hdr)); sz += nla_total_size(sizeof(u8)); /* FRA_L3MDEV */ sz += nla_total_size(sizeof(u32)); /* FRA_PRIORITY */ + sz += nla_total_size(sizeof(u8)); /* FRA_PROTOCOL */ return sz; } @@ -1174,7 +1175,9 @@ static int vrf_fib_rule(const struct net_device *dev, __u8 family, bool add_it) memset(frh, 0, sizeof(*frh)); frh->family = family; frh->action = FR_ACT_TO_TBL; - frh->proto = RTPROT_KERNEL; + + if (nla_put_u8(skb, FRA_PROTOCOL, RTPROT_KERNEL)) + goto nla_put_failure; if (nla_put_u8(skb, FRA_L3MDEV, 1)) goto nla_put_failure; diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index b166ef07e6d4..b3d216249240 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -109,7 +109,8 @@ struct fib_rule_notifier_info { [FRA_SUPPRESS_IFGROUP] = { .type = NLA_U32 }, \ [FRA_GOTO] = { .type = NLA_U32 }, \ [FRA_L3MDEV]= { .type = NLA_U8 }, \ - [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) } + [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) }, \ + [FRA_PROTOCOL] = { .type = NLA_U8 } static inline void fib_rule_get(struct fib_rule *rule) { diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 925539172d5b..77d90ae38114 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -23,8 +23,8 @@ struct fib_rule_hdr { __u8tos; __u8table; - __u8proto; - __u8res1; /* reserved */ + __u8res1; /* reserved */ + __u8res2; /* reserved */ __u8action; __u32 flags; @@ -58,6 +58,7 @@ enum { FRA_PAD, FRA_L3MDEV, /* iif or oif is l3mdev goto its table */ FRA_UID_RANGE, /* UID range */ + FRA_PROTOCOL, /* Originator of the rule */ __FRA_MAX }; diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 88298f18cbae..a6aea805a0a2 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -466,11 +466,13 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh, } refcount_set(&rule->refcnt, 1); rule->fr_net = net; - rule->proto = frh->proto; rule->pref = tb[FRA_PRIORITY] ? nla_get_u32(tb[FRA_PRIORITY]) : fib_default_rule_pref(ops); + rule->proto = tb[FRA_PROTOCOL] ? + nla_get_u8(tb[FRA_PROTOCOL]) : RTPROT_UNSPEC; + if (tb[FRA_IIFNAME]) { struct net_device *dev; @@ -666,7 +668,8 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh, } list_for_each_entry(rule, &ops->rules_list, list) { - if (frh->proto && (frh->proto != rule->proto)) + if (tb[FRA_PROTOCOL] && + (rule->proto != nla_get_u8(tb[FRA_PROTOCOL]))) continue; if (frh->action && (frh->action != rule->action)) @@ -786,7 +789,8 @@ static inline size_t fib_rule_nlmsg_size(struct fib_rules_ops *ops, + nla_total_size(4) /* FRA_FWMARK */ + nla_total_size(4) /* FRA_FWMASK */ + nla_total_size_64bit(8) /* FRA_TUN_ID */ -+ nla_total_size(sizeof(struct fib_kuid_range)); ++ nla_total_size(sizeof(struct fib_kuid_range)) ++ nla_total_size(1); /* FRA_PROTOCOL */ if (ops->nlmsg_payload) payload += ops->nlmsg_payload(rule); @@ -813,9 +817,12 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule, if (nla_put_u32(skb, FRA_SUPPRESS_PREFIXLEN, rule->suppress_prefixlen)) goto nla_put_failure; frh->res1 = 0; + frh->res2 = 0; frh->action = rule->action;
Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest
[adding netdev] On 02/23/2018 08:05 AM, Khalid Aziz wrote: > I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running > selftests > from tools/testing/selftests. Last messages from selftest before kernel panic > are: > > > running psock_tpacket test > > test: TPACKET_V1 with PACKET_RX_RING test: skip TPACKET_V1 PACKET_RX_RING > since user and kernel space have different bit width > test: TPACKET_V1 with PACKET_TX_RING test: skip TPACKET_V1 PACKET_TX_RING > since user and kernel space have different bit width > test: TPACKET_V2 with PACKET_RX_RING 100 pkts (14200 > bytes) > test: TPACKET_V2 with PACKET_TX_RING 100 pkts (14200 > bytes) > test: TPACKET_V3 with PACKET_RX_RING 100 pkts (14200 > bytes) > test: TPACKET_V3 with PACKET_TX_RING 100 pkts (14200 > bytes) > OK. All tests passed > [PASS] > ok 1..7 selftests: run_afpackettests [PASS] > selftests: test_bpf.sh > > test_bpf: [FAIL] > not ok 1..8 selftests:Â test_bpf.sh [FAIL] > selftests: netdevice.sh > > ok 1..9 selftests: netdevice.sh [PASS] > selftests: rtnetlink.sh > > PASS: policy routing > PASS: route get > > > Kernel panic message is below: > > [Â 572.486722] BUG: unable to handle kernel paging request at 0600 > [Â 572.494498] IP: tcf_exts_dump_stats+0x10/0x30 > [Â 572.499360] PGD 80be413cb067 P4D 80be413cb067 PUD bead15c067 PMD 0 > [Â 572.507126] Oops: [#1] SMP PTI > [Â 572.511010] Modules linked in: cls_u32 sch_htb dummy vfat fat ext4 mbcache > jb > d2 intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul > crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper > cryptd sg iTCO_wdt iTCO_vendor_support ioatdma ipmi_ssif pcspkr wmi i2c_i801 > lpc_ich shpchp mfd_core ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss > nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci > crc32c_intel nvme libahci dca drm megaraid_sas nvme_core i2c_algo_bit libata > bnxt_en i2c_core dm_mirror dm_region_hash dm_log dm_mod > [Â 572.574377] CPU: 81 PID: 17886 Comm: tc Not tainted 4.16.0-rc2 #112 > [Â 572.581371] Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, > X7-2, BIOS 41017600 10/06/2017 > [Â 572.591957] RIP: 0010:tcf_exts_dump_stats+0x10/0x30 > [Â 572.597402] RSP: 0018:c900313b7928 EFLAGS: 00010206 > [Â 572.603226] RAX: 0600 RBX: 88bea9117db0 RCX: > 1ca4 > [Â 572.611191] RDX: 1ca3 RSI: 88bea90cf018 RDI: > 88be4fb6c000 > [Â 572.619157] RBP: 88be4fb6c000 R08: 00024800 R09: > a05697fb > [Â 572.627121] R10: 88bebe064800 R11: ea02faa445c0 R12: > 88bea90ce034 > [Â 572.635087] R13: 88bea90cf000 R14: 88be9fe33300 R15: > 88bea90ce000 > [Â 572.643053] FS:Â 7f98ae464740() GS:88bebe04() > knlGS: > [Â 572.652084] CS:Â 0010 DS: ES: CR0: 80050033 > [Â 572.658497] CR2: 0600 CR3: 00be41a94005 CR4: > 007606e0 > [Â 572.666462] DR0: DR1: DR2: > > [Â 572.674428] DR3: DR6: fffe0ff0 DR7: > 0400 > [Â 572.682393] PKRU: 5554 > [Â 572.685413] Call Trace: > [Â 572.688145]Â u32_dump+0x2be/0x3c0 [cls_u32] > [Â 572.692816]Â tcf_fill_node.isra.29+0x15b/0x1f0 > [Â 572.69]Â tfilter_notify+0xc1/0x150 > [Â 572.701952]Â tc_ctl_tfilter+0x87d/0xbd0 > [Â 572.706238]Â rtnetlink_rcv_msg+0x29c/0x310 > [Â 572.710813]Â ? _cond_resched+0x15/0x30 > [Â 572.714999]Â ? __kmalloc_node_track_caller+0x1b9/0x270 > [Â 572.720737]Â ? rtnl_calcit.isra.28+0x100/0x100 > [Â 572.725697]Â netlink_rcv_skb+0xd2/0x110 > [Â 572.729969]Â netlink_unicast+0x17c/0x230 > [Â 572.734348]Â netlink_sendmsg+0x2cd/0x3c0 > [Â 572.738719]Â sock_sendmsg+0x30/0x40 > [Â 572.742612]Â ___sys_sendmsg+0x27a/0x290 > [Â 572.746896]Â ? do_wp_page+0x89/0x4c0 > [Â 572.750886]Â ? page_add_new_anon_rmap+0x72/0xc0 > [Â 572.755944]Â ? __handle_mm_fault+0x74b/0x1280 > [Â 572.760807]Â __sys_sendmsg+0x51/0x90 > [Â 572.764800]Â do_syscall_64+0x6e/0x1a0 > [Â 572.76]Â entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [Â 572.774526] RIP: 0033:0x7f98ada843b0 > [Â 572.778515] RSP: 002b:7fff833a4f38 EFLAGS: 0246 ORIG_RAX: > 002e > [Â 572.786963] RAX: ffda RBX: 5a8deb31 RCX: > 7f98ada843b0 > [Â 572.794929] RDX: RSI: 7fff833a4f80 RDI: > 0003 > [Â 572.802892] RBP: 7fff833a4f80 R08: R09: > 0001 > [Â 572.810856] R10: 7fff833a4320 R11: 0246 R12: > > [Â 572.818823] R13: 00650ba0 R14: 7fff833b11e8 R15:
[GIT] Networking
1) Fix TTL offset calculation in mac80211 mesh code, from Peter Oh. 2) Fix races with procfs in ipt_CLUSTERIP, from Cong Wang. 3) Memory leak fix in lpm_trie BPF map code, from Yonghong Song. 4) Need to use GFP_ATOMIC in BPF cpumap allocations, from Jason Wang. 5) Fix potential deadlocks in netfilter getsockopt() code paths, from Paolo Abeni. 6) Netfilter stackpointer size checks really are needed to validate user input, from Florian Westphal. 7) Missing timer init in x_tables, from Paolo Abeni. 8) Don't use WQ_MEM_RECLAIM in mac80211 hwsim, from Johannes Berg. 9) When an ibmvnic device is brought down then back up again, it can be sent queue entries from a previous session, handle this properly instead of crashing. From Thomas Falcon. 10) Fix TCP checkums on LRO buffers in mlx5e, from Gal Pressman. 11) When we are dumping filters in cls_api, the output SKB is empty, and the filter we are dumping is too large for the space in the SKB, we should return -EMSGSIZE like other netlink dump operations do. Otherwise userland has no signal that is needs to increase the size of it's read buffer. From Roman Kapl. 12) Several XDP fixes for virtio_net, from Jesper Dangaard Brouer. 13) Module refcount leak in netlink when a dump start fails, from Jason A. Donenfeld. 14) Handle sub-optimal GSO sizes better in TCP BBR congestion control, from Eric Dumazet. 15) Releasing bpf per-cpu arraymaps can take a long time, add a condtional scheduling point. From Eric Dumazet. 16) Implement retpolines for tail calls in x64 and arm64 bpf JITs. From Daniel Borkmann. 17) Fix page leak in gianfar driver, from Andy Spencer. 18) Missed clearing of estimator scratch buffer, from Eric Dumazet. Please pull, thanks a lot! The following changes since commit 79c0ef3e85c015b0921a8fd5dd539d1480e9cd6c: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-02-19 11:58:19 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to a5f7add332b4ea6d4b9480971b3b0f5e66466ae9: net_sched: gen_estimator: fix broken estimators based on percpu stats (2018-02-23 12:35:46 -0500) Alexey Kodanev (1): macvlan: fix use-after-free in macvlan_common_newlink() Anders Roxell (2): selftests/bpf: tcpbpf_kern: use in6_* macros from glibc selftests/bpf: update gitignore with test_libbpf_open Andy Spencer (1): gianfar: simplify FCS handling and fix memory leak Arnd Bergmann (3): cfg80211: fix cfg80211_beacon_dup bpf: clean up unused-variable warning ipv6 sit: work around bogus gcc-8 -Wrestrict warning Avraham Stern (1): cfg80211: clear wep keys after disconnection Cong Wang (2): netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation netfilter: ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get() Dan Carpenter (1): net: aquantia: Fix error handling in aq_pci_probe() Daniel Borkmann (5): bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query Merge branch 'bpf-bpftool-json-fixes' bpf: fix mlock precharge on arraymaps bpf, x64: implement retpoline for tail call bpf, arm64: fix out of bounds access in tail call Daniel Jurgens (1): net/mlx5: Use 128B cacheline size for 128B or larger cachelines David Ahern (1): net: ipv4: Set addr_type in hash_keys for forwarded case David Howells (1): rxrpc: Fix send in rxrpc_send_data_packet() David S. Miller (6): Merge git://git.kernel.org/.../pablo/nf Merge tag 'mlx5-fixes-2018-02-20' of git://git.kernel.org/.../saeed/linux Merge branch 'virtio_net-XDP-fixes' Merge git://git.kernel.org/.../bpf/bpf Merge tag 'mac80211-for-davem-2018-02-22' of git://git.kernel.org/.../jberg/mac80211 Merge git://git.kernel.org/.../bpf/bpf Eran Ben Elisha (1): net/mlx5e: Verify inline header size do not exceed SKB linear size Eric Dumazet (7): bpf: fix sock_map_alloc() error path netfilter: xt_hashlimit: fix lock imbalance netfilter: IDLETIMER: be syzkaller friendly smsc75xx: fix smsc75xx_set_features() tcp_bbr: better deal with suboptimal GSO bpf: add schedule points in percpu arrays management net_sched: gen_estimator: fix broken estimators based on percpu stats Eugenia Emantayev (1): net/mlx5: E-Switch, Fix drop counters use before creation Felix Fietkau (1): mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4 Finn Thain (1): net/smc9194: Remove bogus CONFIG_MAC reference Florian Westphal (10): netfilter: add back stackpointer size checks netfilter: x_tables: remove pr_info where possible netfilter: x_tables: use pr ratelimiting in xt core netfilter: xt_CT: use pr ratelimiting netfilter: xt_NFQUEUE: use pr r
VRF destination unreachable
Greetings, We found that ICMP destination unreachable isn't sent if VRF forwarding isn't configured, i.e. /proc/sys/net/ipv4/conf//forwarding isn't set. The relevant code is: static int ip_error(struct sk_buff *skb) { ... // in_dev is the vrf net_device if (!IN_DEV_FORWARD(in_dev)) { switch (rt->dst.error) { case EHOSTUNREACH: __IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS); break; case ENETUNREACH: __IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES); break; } goto out; } ... out:kfree_skb(skb); return 0; } The question: is it intended to be set? The basic forwarding seems to be working without. We do set it on the slave net devices. Thank you, Stephen.
Re: [PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to
Sorry, this is still not submitted in an acceptable way. Please read the documents I pointed you too, look at how other people submit patches on this list, and submit it properly. Thank you.
Re: [PATCH net-next] selftests/net: ignore background traffic in psock_fanout
From: Willem de Bruijn Date: Fri, 23 Feb 2018 11:56:20 -0500 > From: Willem de Bruijn > > The packet fanout test generates UDP traffic and reads this with > a pair of packet sockets, testing the various fanout algorithms. > > Avoid non-determinism from reading unrelated background traffic. > Fanout decisions are made before unrelated packets can be dropped with > a filter, so that is an insufficient strategy [*]. Run the packet > socket tests in a network namespace, similar to msg_zerocopy. > > It it still good practice to install a filter on a packet socket > before accepting traffic. Because this is example code, demonstrate > that pattern. Open the socket initially bound to no protocol, install > a filter, and only then bind to ETH_P_IP. > > Another source of non-determinism is hash collisions in FANOUT_HASH. > The hash function used to select a socket in the fanout group includes > the pseudorandom number hashrnd, which is not visible from userspace. > To work around this, the test tries to find a pair of UDP source ports > that do not collide. It gives up too soon (5 times, every 32 runs) and > output is confusing. Increase tries to 20 and revise the error msg. > > [*] another approach would be to add a third socket to the fanout > group and direct all unexpected traffic here. This is possible > only when reimplementing methods like RR or HASH alongside this > extra catch-all bucket, using the BPF fanout method. > > Signed-off-by: Willem de Bruijn Applied, thanks Willem. Indeed, not being able to control hashrnd makes determinism in tests like this quite difficult.
[PATCH net 2/5] l2tp: don't use inet_shutdown on ppp session destroy
Previously, if a ppp session was closed, we called inet_shutdown to mark the socket as unconnected such that userspace would get errors and then close the socket. This could race with userspace closing the socket. Instead, leave userspace to close the socket in its own time (our session will be detached anyway). BUG: KASAN: use-after-free in inet_shutdown+0x5d/0x1c0 Read of size 4 at addr 880010ea3ac0 by task syzbot_347bd5ac/8296 CPU: 3 PID: 8296 Comm: syzbot_347bd5ac Not tainted 4.16.0-rc1+ #91 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: dump_stack+0x101/0x157 ? inet_shutdown+0x5d/0x1c0 print_address_description+0x78/0x260 ? inet_shutdown+0x5d/0x1c0 kasan_report+0x240/0x360 __asan_load4+0x78/0x80 inet_shutdown+0x5d/0x1c0 ? pppol2tp_show+0x80/0x80 pppol2tp_session_close+0x68/0xb0 l2tp_tunnel_closeall+0x199/0x210 ? udp_v6_flush_pending_frames+0x90/0x90 l2tp_udp_encap_destroy+0x6b/0xc0 ? l2tp_tunnel_del_work+0x2e0/0x2e0 udpv6_destroy_sock+0x8c/0x90 sk_common_release+0x47/0x190 udp_lib_close+0x15/0x20 inet_release+0x85/0xd0 inet6_release+0x43/0x60 sock_release+0x53/0x100 ? sock_alloc_file+0x260/0x260 sock_close+0x1b/0x20 __fput+0x19f/0x380 fput+0x1a/0x20 task_work_run+0xd2/0x110 exit_to_usermode_loop+0x18d/0x190 do_syscall_64+0x389/0x3b0 entry_SYSCALL_64_after_hwframe+0x26/0x9b RIP: 0033:0x7fe240a45259 RSP: 002b:7fe241132df8 EFLAGS: 0297 ORIG_RAX: 0003 RAX: RBX: RCX: 7fe240a45259 RDX: 7fe240a45259 RSI: RDI: 00a5 RBP: 7fe241132e20 R08: 7fe241133700 R09: R10: 7fe241133700 R11: 0297 R12: R13: 7ffc49aff84f R14: R15: 7fe241141040 Allocated by task 8331: save_stack+0x43/0xd0 kasan_kmalloc+0xad/0xe0 kasan_slab_alloc+0x12/0x20 kmem_cache_alloc+0x144/0x3e0 sock_alloc_inode+0x22/0x130 alloc_inode+0x3d/0xf0 new_inode_pseudo+0x1c/0x90 sock_alloc+0x30/0x110 __sock_create+0xaa/0x4c0 SyS_socket+0xbe/0x130 do_syscall_64+0x128/0x3b0 entry_SYSCALL_64_after_hwframe+0x26/0x9b Freed by task 8314: save_stack+0x43/0xd0 __kasan_slab_free+0x11a/0x170 kasan_slab_free+0xe/0x10 kmem_cache_free+0x88/0x2b0 sock_destroy_inode+0x49/0x50 destroy_inode+0x77/0xb0 evict+0x285/0x340 iput+0x429/0x530 dentry_unlink_inode+0x28c/0x2c0 __dentry_kill+0x1e3/0x2f0 dput.part.21+0x500/0x560 dput+0x24/0x30 __fput+0x2aa/0x380 fput+0x1a/0x20 task_work_run+0xd2/0x110 exit_to_usermode_loop+0x18d/0x190 do_syscall_64+0x389/0x3b0 entry_SYSCALL_64_after_hwframe+0x26/0x9b Fixes: fd558d186df2c ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: James Chapman --- net/l2tp/l2tp_ppp.c | 10 -- 1 file changed, 10 deletions(-) diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c index 59f246d7b290..2d2955e8f710 100644 --- a/net/l2tp/l2tp_ppp.c +++ b/net/l2tp/l2tp_ppp.c @@ -420,16 +420,6 @@ static int pppol2tp_xmit(struct ppp_channel *chan, struct sk_buff *skb) */ static void pppol2tp_session_close(struct l2tp_session *session) { - struct sock *sk; - - BUG_ON(session->magic != L2TP_SESSION_MAGIC); - - sk = pppol2tp_session_get_sock(session); - if (sk) { - if (sk->sk_socket) - inet_shutdown(sk->sk_socket, SEND_SHUTDOWN); - sock_put(sk); - } } /* Really kill the session socket. (Called from sock_put() if -- 1.9.1
[PATCH net 5/5] l2tp: fix tunnel lookup use-after-free race
l2tp_tunnel_get walks the tunnel list to find a matching tunnel instance and if a match is found, its refcount is increased before returning the tunnel pointer. But when tunnel objects are destroyed, they are on the tunnel list after their refcount hits zero. Fix this by moving the code that removes the tunnel from the tunnel list from the tunnel socket destructor into in the l2tp_tunnel_delete path, before the tunnel refcount is decremented. refcount_t: increment on 0; use-after-free. WARNING: CPU: 3 PID: 13507 at lib/refcount.c:153 refcount_inc+0x47/0x50 Modules linked in: CPU: 3 PID: 13507 Comm: syzbot_6e6a5ec8 Not tainted 4.16.0-rc2+ #36 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:refcount_inc+0x47/0x50 RSP: 0018:8800136ffb20 EFLAGS: 00010286 RAX: dc08 RBX: 880017068e68 RCX: 814d RDX: RSI: 88001a59f6d8 RDI: 88001a59f6d8 RBP: 8800136ffb28 R08: R09: R10: 8800136ffab0 R11: R12: 880017068e50 R13: R14: 8800174da800 R15: 0004 FS: 7f403ab1e700() GS:88001a58() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 205fafd2 CR3: 1677 CR4: 06e0 Call Trace: l2tp_tunnel_get+0x2dd/0x4e0 pppol2tp_connect+0x428/0x13c0 ? pppol2tp_session_create+0x170/0x170 ? __might_fault+0x115/0x1d0 ? lock_downgrade+0x860/0x860 ? __might_fault+0xe5/0x1d0 ? security_socket_connect+0x8e/0xc0 SYSC_connect+0x1b6/0x310 ? SYSC_bind+0x280/0x280 ? __do_page_fault+0x5d1/0xca0 ? up_read+0x1f/0x40 ? __do_page_fault+0x3c8/0xca0 SyS_connect+0x29/0x30 ? SyS_accept+0x40/0x40 do_syscall_64+0x1e0/0x730 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7f403a42f259 RSP: 002b:7f403ab1dee8 EFLAGS: 0296 ORIG_RAX: 002a RAX: ffda RBX: 205fafe4 RCX: 7f403a42f259 RDX: 002e RSI: 205fafd2 RDI: 0004 RBP: 7f403ab1df20 R08: 7f403ab1e700 R09: R10: 7f403ab1e700 R11: 0296 R12: R13: 7ffc81906cbf R14: R15: 7f403ab2b040 Code: 3b ff 5b 5d c3 e8 ca 5f 3b ff 80 3d 49 8e 66 04 00 75 ea e8 bc 5f 3b ff 48 c7 c7 60 69 64 85 c6 05 34 8e 66 04 01 e8 59 49 15 ff <0f> 0b eb ce 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 Fixes: f8ccac0e44934 ("l2tp: put tunnel socket release on a workqueue") Reported-and-tested-by: syzbot+19c09769f14b48810...@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+347bd5acde002e353...@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+6e6a5ec8de31a94cd...@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+9df43faf09bd400f2...@syzkaller.appspotmail.com Signed-off-by: James Chapman --- net/l2tp/l2tp_core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 0fa53ead24aa..83421c6f0bef 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1164,7 +1164,6 @@ int l2tp_xmit_skb(struct l2tp_session *session, struct sk_buff *skb, int hdr_len static void l2tp_tunnel_destruct(struct sock *sk) { struct l2tp_tunnel *tunnel = l2tp_tunnel(sk); - struct l2tp_net *pn; if (tunnel == NULL) goto end; @@ -1187,12 +1186,6 @@ static void l2tp_tunnel_destruct(struct sock *sk) sk->sk_destruct = tunnel->old_sk_destruct; sk->sk_user_data = NULL; - /* Remove the tunnel struct from the tunnel list */ - pn = l2tp_pernet(tunnel->l2tp_net); - spin_lock_bh(&pn->l2tp_tunnel_list_lock); - list_del_rcu(&tunnel->list); - spin_unlock_bh(&pn->l2tp_tunnel_list_lock); - /* Call the original destructor */ if (sk->sk_destruct) (*sk->sk_destruct)(sk); @@ -1271,6 +1264,7 @@ static void l2tp_tunnel_del_work(struct work_struct *work) del_work); struct sock *sk = tunnel->sock; struct socket *sock = sk->sk_socket; + struct l2tp_net *pn; l2tp_tunnel_closeall(tunnel); @@ -1284,6 +1278,12 @@ static void l2tp_tunnel_del_work(struct work_struct *work) } } + /* Remove the tunnel struct from the tunnel list */ + pn = l2tp_pernet(tunnel->l2tp_net); + spin_lock_bh(&pn->l2tp_tunnel_list_lock); + list_del_rcu(&tunnel->list); + spin_unlock_bh(&pn->l2tp_tunnel_list_lock); + /* drop initial ref */ l2tp_tunnel_dec_refcount(tunnel); -- 1.9.1
[PATCH net 1/5] l2tp: don't use inet_shutdown on tunnel destroy
Previously, if a tunnel was closed, we called inet_shutdown to mark the socket as unconnected such that userspace would get errors and then close the socket. This could race with userspace closing the socket. Instead, leave userspace to close the socket in its own time (our tunnel will be detached anyway). BUG: unable to handle kernel NULL pointer dereference at 00a0 IP: __lock_acquire+0x263/0x1630 PGD 0 P4D 0 Oops: [#1] SMP KASAN Modules linked in: CPU: 2 PID: 42 Comm: kworker/u8:2 Not tainted 4.15.0-rc7+ #129 Workqueue: l2tp l2tp_tunnel_del_work RIP: 0010:__lock_acquire+0x263/0x1630 RSP: 0018:88001a37fc70 EFLAGS: 00010002 RAX: 0001 RBX: 0088 RCX: RDX: RSI: RDI: RBP: 88001a37fd18 R08: 0001 R09: R10: R11: 76fd R12: 00a0 R13: 88001a3722c0 R14: 0001 R15: FS: () GS:88001ad0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00a0 CR3: 1730b000 CR4: 06e0 Call Trace: ? __lock_acquire+0xc77/0x1630 ? console_trylock+0x11/0xa0 lock_acquire+0x117/0x230 ? lock_sock_nested+0x3a/0xa0 _raw_spin_lock_bh+0x3a/0x50 ? lock_sock_nested+0x3a/0xa0 lock_sock_nested+0x3a/0xa0 inet_shutdown+0x33/0xf0 l2tp_tunnel_del_work+0x60/0xef process_one_work+0x1ea/0x5f0 ? process_one_work+0x162/0x5f0 worker_thread+0x48/0x3e0 ? trace_hardirqs_on+0xd/0x10 kthread+0x108/0x140 ? process_one_work+0x5f0/0x5f0 ? kthread_stop+0x2a0/0x2a0 ret_from_fork+0x24/0x30 Code: 00 41 81 ff ff 1f 00 00 0f 87 7a 13 00 00 45 85 f6 49 8b 85 68 08 00 00 0f 84 ae 03 00 00 c7 44 24 18 00 00 00 00 e9 f0 00 00 00 <49> 81 3c 24 80 93 3f 83 b8 00 00 00 00 44 0f 44 c0 83 fe 01 0f RIP: __lock_acquire+0x263/0x1630 RSP: 88001a37fc70 CR2: 00a0 Fixes: 309795f4bec2d ("l2tp: Add netlink control API for L2TP") Signed-off-by: James Chapman --- net/l2tp/l2tp_core.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 194a7483bb93..9cd2a99d0752 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1327,17 +1327,10 @@ static void l2tp_tunnel_del_work(struct work_struct *work) sock = sk->sk_socket; - /* If the tunnel socket was created by userspace, then go through the -* inet layer to shut the socket down, and let userspace close it. -* Otherwise, if we created the socket directly within the kernel, use + /* If the tunnel socket was created within the kernel, use * the sk API to release it here. -* In either case the tunnel resources are freed in the socket -* destructor when the tunnel socket goes away. */ - if (tunnel->fd >= 0) { - if (sock) - inet_shutdown(sock, 2); - } else { + if (tunnel->fd < 0) { if (sock) { kernel_sock_shutdown(sock, SHUT_RDWR); sock_release(sock); -- 1.9.1
[PATCH net 0/5] l2tp: fix API races discovered by syzbot
This patch series addresses several races with L2TP APIs discovered by syzbot. There are no functional changes. The set of patches 1-5 in combination fix the following syzbot reports. 19c09769f WARNING in debug_print_object 347bd5acd KASAN: use-after-free Read in inet_shutdown 6e6a5ec8d general protection fault in pppol2tp_connect 9df43faf0 KASAN: use-after-free Read in pppol2tp_connect My first attempts to fix these issues were as net-next patches but the series included other refactoring and cleanup work. I was asked to separate out the bugfixes and redo for the net tree, which is what these patches are. The changes are: 1. Fix inet_shutdown races when L2TP tunnels and sessions close. (patches 1-2) 2. Fix races with tunnel and its socket. (patch 3) 3. Fix race in pppol2tp_release with session and its socket. (patch 4) 4. Fix tunnel lookup use-after-free. (patch 5) All of the syzbot reproducers hit races in the tunnel and pppol2tp session create and destroy paths. These tests create and destroy pppol2tp tunnels and sessions rapidly using multiple threads, provoking races in several tunnel/session create/destroy paths. The key problem was that each tunnel/session socket could be destroyed while its associated tunnel/session object still existed (patches 3, 4). Patch 5 addresses a problem with the way tunnels are removed from the tunnel list. Patch 5 is tagged that it addresses all four syzbot issues, though all 5 patches are needed. James Chapman (5): l2tp: don't use inet_shutdown on tunnel destroy l2tp: don't use inet_shutdown on ppp session destroy l2tp: fix races with tunnel socket close l2tp: fix race in pppol2tp_release with session object destroy l2tp: fix tunnel lookup use-after-free race net/l2tp/l2tp_core.c | 142 --- net/l2tp/l2tp_core.h | 23 + net/l2tp/l2tp_ip.c | 10 ++-- net/l2tp/l2tp_ip6.c | 8 ++- net/l2tp/l2tp_ppp.c | 60 ++ 5 files changed, 77 insertions(+), 166 deletions(-) --
[PATCH net 3/5] l2tp: fix races with tunnel socket close
The tunnel socket tunnel->sock (struct sock) is accessed when preparing a new ppp session on a tunnel at pppol2tp_session_init. If the socket is closed by a thread while another is creating a new session, the threads race. In pppol2tp_connect, the tunnel object may be created if the pppol2tp socket is associated with the special session_id 0 and the tunnel socket is looked up using the provided fd. When handling this, pppol2tp_connect cannot sock_hold the tunnel socket to prevent it being destroyed during pppol2tp_connect since this may itself may race with the socket being destroyed. Doing sockfd_lookup in pppol2tp_connect isn't sufficient to prevent tunnel->sock going away either because a given tunnel socket fd may be reused between calls to pppol2tp_connect. Instead, have l2tp_tunnel_create sock_hold the tunnel socket before it does sockfd_put. This ensures that the tunnel's socket is always extant while the tunnel object exists. Hold a ref on the socket until the tunnel is destroyed and ensure that all tunnel destroy paths go through a common function (l2tp_tunnel_delete) since this will do the final sock_put to release the tunnel socket. Since the tunnel's socket is now guaranteed to exist if the tunnel exists, we no longer need to use sockfd_lookup via l2tp_sock_to_tunnel to derive the tunnel from the socket since this is always sk_user_data. Also, sessions no longer sock_hold the tunnel socket since sessions already hold a tunnel ref and the tunnel sock will not be freed until the tunnel is freed. Removing these sock_holds in l2tp_session_register avoids a possible sock leak in the pppol2tp_connect error path if l2tp_session_register succeeds but attaching a ppp channel fails. The pppol2tp_connect error path could have been fixed instead and have the sock ref dropped when the session is freed, but doing a sock_put of the tunnel socket when the session is freed would require a new session_free callback. It is simpler to just remove the sock_hold of the tunnel socket in l2tp_session_register, now that the tunnel socket lifetime is guaranteed. Finally, some init code in l2tp_tunnel_create is reordered to ensure that the new tunnel object's refcount is set and the tunnel socket ref is taken before the tunnel socket destructor callbacks are set. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN Modules linked in: CPU: 0 PID: 4360 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ #34 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:pppol2tp_session_init+0x1d6/0x500 RSP: 0018:88001377fb40 EFLAGS: 00010212 RAX: dc00 RBX: 88001636a940 RCX: 84836c1d RDX: 0045 RSI: 55976744 RDI: 0228 RBP: 88001377fb60 R08: 84836bc8 R09: 0002 R10: 88001377fab8 R11: 0001 R12: R13: 88001636aac8 R14: 8800160f81c0 R15: 1100026eff76 FS: 7ffb3ea66700() GS:88001a40() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 20e77000 CR3: 16261000 CR4: 06f0 Call Trace: pppol2tp_connect+0xd18/0x13c0 ? pppol2tp_session_create+0x170/0x170 ? __might_fault+0x115/0x1d0 ? lock_downgrade+0x860/0x860 ? __might_fault+0xe5/0x1d0 ? security_socket_connect+0x8e/0xc0 SYSC_connect+0x1b6/0x310 ? SYSC_bind+0x280/0x280 ? __do_page_fault+0x5d1/0xca0 ? up_read+0x1f/0x40 ? __do_page_fault+0x3c8/0xca0 SyS_connect+0x29/0x30 ? SyS_accept+0x40/0x40 do_syscall_64+0x1e0/0x730 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7ffb3e376259 RSP: 002b:7ffeda4f6508 EFLAGS: 0202 ORIG_RAX: 002a RAX: ffda RBX: 20e77012 RCX: 7ffb3e376259 RDX: 002e RSI: 20e77000 RDI: 0004 RBP: 7ffeda4f6540 R08: R09: R10: R11: 0202 R12: 00400b60 R13: 7ffeda4f6660 R14: R15: Code: 80 3d b0 ff 06 02 00 0f 84 07 02 00 00 e8 13 d6 db fc 49 8d bc 24 28 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f a 48 c1 ea 03 <80> 3c 02 00 0f 85 ed 02 00 00 4d 8b a4 24 28 02 00 00 e8 13 16 Fixes: 80d84ef3ff1dd ("l2tp: prevent l2tp_tunnel_delete racing with userspace close") Signed-off-by: James Chapman --- net/l2tp/l2tp_core.c | 117 +++ net/l2tp/l2tp_core.h | 23 +- net/l2tp/l2tp_ip.c | 10 ++--- net/l2tp/l2tp_ip6.c | 8 ++-- 4 files changed, 42 insertions(+), 116 deletions(-) diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 9cd2a99d0752..0fa53ead24aa 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -136,51 +136,6 @@ static inline struct l2tp_net *l2tp_pernet(const struct net *net) } -/* Lookup the tunnel socket, possibly involving the
[PATCH net 4/5] l2tp: fix race in pppol2tp_release with session object destroy
pppol2tp_release uses call_rcu to put the final ref on its socket. But the session object doesn't hold a ref on the session socket so may be freed while the pppol2tp_put_sk RCU callback is scheduled. Fix this by having the session hold a ref on its socket until the session is destroyed. It is this ref that is dropped via call_rcu. Sessions are also deleted via l2tp_tunnel_closeall. This must now also put the final ref via call_rcu. So move the call_rcu call site into pppol2tp_session_close so that this happens in both destroy paths. A common destroy path should really be implemented, perhaps with l2tp_tunnel_closeall calling l2tp_session_delete like pppol2tp_release does, but this will be looked at later. ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) WARNING: CPU: 3 PID: 13407 at lib/debugobjects.c:291 debug_print_object+0x166/0x220 Modules linked in: CPU: 3 PID: 13407 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ #38 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:debug_print_object+0x166/0x220 RSP: 0018:880013647a00 EFLAGS: 00010082 RAX: dc08 RBX: 0003 RCX: 814d RDX: RSI: 0001 RDI: 88001a59f6d0 RBP: 880013647a40 R08: R09: 0001 R10: 8800136479a8 R11: R12: 0001 R13: 86161420 R14: 85648b60 R15: FS: () GS:88001a58() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 20e77000 CR3: 06022000 CR4: 06e0 Call Trace: debug_object_activate+0x38b/0x530 ? debug_object_assert_init+0x3b0/0x3b0 ? __mutex_unlock_slowpath+0x85/0x8b0 ? pppol2tp_session_destruct+0x110/0x110 __call_rcu.constprop.66+0x39/0x890 ? __call_rcu.constprop.66+0x39/0x890 call_rcu_sched+0x17/0x20 pppol2tp_release+0x2c7/0x440 ? fcntl_setlk+0xca0/0xca0 ? sock_alloc_file+0x340/0x340 sock_release+0x92/0x1e0 sock_close+0x1b/0x20 __fput+0x296/0x6e0 fput+0x1a/0x20 task_work_run+0x127/0x1a0 do_exit+0x7f9/0x2ce0 ? SYSC_connect+0x212/0x310 ? mm_update_next_owner+0x690/0x690 ? up_read+0x1f/0x40 ? __do_page_fault+0x3c8/0xca0 do_group_exit+0x10d/0x330 ? do_group_exit+0x330/0x330 SyS_exit_group+0x22/0x30 do_syscall_64+0x1e0/0x730 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7f362e471259 RSP: 002b:7ffe389abe08 EFLAGS: 0202 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 7f362e471259 RDX: 7f362e471259 RSI: 002e RDI: RBP: 7ffe389abe30 R08: R09: 7f362e944270 R10: R11: 0202 R12: 00400b60 R13: 7ffe389abf50 R14: R15: Code: 8d 3c dd a0 8f 64 85 48 89 fa 48 c1 ea 03 80 3c 02 00 75 7b 48 8b 14 dd a0 8f 64 85 4c 89 f6 48 c7 c7 20 85 64 85 e 8 2a 55 14 ff <0f> 0b 83 05 ad 2a 68 04 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 Fixes: ee40fb2e1eb5b ("l2tp: protect sock pointer of struct pppol2tp_session with RCU") Signed-off-by: James Chapman --- net/l2tp/l2tp_ppp.c | 52 +++- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c index 2d2955e8f710..3b02f24ea9ec 100644 --- a/net/l2tp/l2tp_ppp.c +++ b/net/l2tp/l2tp_ppp.c @@ -416,10 +416,28 @@ static int pppol2tp_xmit(struct ppp_channel *chan, struct sk_buff *skb) * Session (and tunnel control) socket create/destroy. */ +static void pppol2tp_put_sk(struct rcu_head *head) +{ + struct pppol2tp_session *ps; + + ps = container_of(head, typeof(*ps), rcu); + sock_put(ps->__sk); +} + /* Called by l2tp_core when a session socket is being closed. */ static void pppol2tp_session_close(struct l2tp_session *session) { + struct pppol2tp_session *ps; + + ps = l2tp_session_priv(session); + mutex_lock(&ps->sk_lock); + ps->__sk = rcu_dereference_protected(ps->sk, +lockdep_is_held(&ps->sk_lock)); + RCU_INIT_POINTER(ps->sk, NULL); + if (ps->__sk) + call_rcu(&ps->rcu, pppol2tp_put_sk); + mutex_unlock(&ps->sk_lock); } /* Really kill the session socket. (Called from sock_put() if @@ -439,14 +457,6 @@ static void pppol2tp_session_destruct(struct sock *sk) } } -static void pppol2tp_put_sk(struct rcu_head *head) -{ - struct pppol2tp_session *ps; - - ps = container_of(head, typeof(*ps), rcu); - sock_put(ps->__sk); -} - /* Called when the PPPoX socket (session) is closed. */ static int pppol2tp_release(struct socket *sock) @@ -470,26 +480,17 @@ static int pppol2tp_release(struct socket *sock) sock_orphan(sk); sock->sk = NULL;
Re: [PATCH bpf-next] bpf: NULL pointer check is not needed in BPF_CGROUP_RUN_PROG_INET_SOCK
From: Yafang Shao Date: Fri, 23 Feb 2018 14:58:41 +0800 > sk is already allocated in inet_create/inet6_create, hence when > BPF_CGROUP_RUN_PROG_INET_SOCK is executed sk will never be NULL. > > The logic is as bellow, > sk = sk_alloc(); > if (!sk) > goto out; > BPF_CGROUP_RUN_PROG_INET_SOCK(sk); > > Signed-off-by: Yafang Shao Acked-by: David S. Miller
Re: [PATCH] atm: idt77252: remove redundant bit-wise or'ing of zero
From: Colin King Date: Fri, 23 Feb 2018 12:22:52 + > From: Colin Ian King > > Zero is being bit-wise or'd in a calculation twice; these are redundant > and can be removed. > > Signed-off-by: Colin Ian King Applied.
Re: [PATCH net-next] cxgb4vf: Forcefully link up virtual interfaces
From: Ganesh Goudar Date: Fri, 23 Feb 2018 20:36:53 +0530 > @@ -92,6 +92,18 @@ module_param(msi, int, 0644); > MODULE_PARM_DESC(msi, "whether to use MSI-X or MSI"); > > /* > + * Logic controls. > + * === > + */ > + > +/* The Virtual Interfaces are connected to an internal switch on the chip > + * which allows VIs attached to the same port to talk to each other even when > + * the port link is down. As a result, we generally want to always report a > + * VI's link as being "up". > + */ > +#define force_link_up 1 Please don't do this, just perform the actions unconditionally.
Re: [PATCH v2 net] net_sched: gen_estimator: fix broken estimators based on percpu stats
From: Eric Dumazet Date: Thu, 22 Feb 2018 19:45:27 -0800 > From: Eric Dumazet > > pfifo_fast got percpu stats lately, uncovering a bug I introduced last > year in linux-4.10. > > I missed the fact that we have to clear our temporary storage > before calling __gnet_stats_copy_basic() in the case of percpu stats. > > Without this fix, rate estimators (tc qd replace dev xxx root est 1sec > 4sec pfifo_fast) are utterly broken. > > Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate > estimators") > Signed-off-by: Eric Dumazet > --- > v2: Perform the zeroing in est_fetch_counters() instead of caller(s) Applied and queued up for -stable.
[RFC PATCH bpf-next 12/12] bpf/verifier: update selftests
Slight change to message when execution can run off the end of the program. Signed-off-by: Edward Cree --- tools/testing/selftests/bpf/test_verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index fda35a5a0ff9..a54e50a887dc 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -9758,7 +9758,7 @@ static struct bpf_test tests[] = { BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "jump out of range from insn 5 to 6", + .errstr = "no exit/jump at end of program (insn 5)", .result = REJECT, }, {
[RFC PATCH bpf-next 11/12] bpf/verifier: better detection of statically unreachable code
My previous version just checked that every insn could be jumped to from somewhere, which meant that the program could contain multiple connected components. Instead let's do a flood-fill of the insns set to ensure all insns are actually reachable from the entry point. Since this involves essentially the same "where does the jump go" calculations as placing state list marks, do both in the same pass. Signed-off-by: Edward Cree --- kernel/bpf/verifier.c | 149 +- 1 file changed, 88 insertions(+), 61 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index cc8aaf73b4a2..d9d851d8c84e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3813,83 +3813,110 @@ static int check_return_code(struct bpf_verifier_env *env) #define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L) -static int mark_jump_target(struct bpf_verifier_env *env, int from, int off) -{ - int to = from + off + 1; - - if (to < 0 || to >= env->prog->len) { - verbose(env, "jump out of range from insn %d to %d\n", from, to); - return -EINVAL; - } - /* mark branch target for state pruning */ - env->explored_states[to] = STATE_LIST_MARK; - return 0; -} - /* Mark jump/branch targets for control flow tracking & state pruning */ static int prepare_cfg_marks(struct bpf_verifier_env *env) { struct bpf_insn *insns = env->prog->insnsi; int insn_cnt = env->prog->len, i, err = 0; + unsigned long *frontier, *visited; - for (i = 0; i < insn_cnt; i++) { - u8 opcode = BPF_OP(insns[i].code); + frontier = kcalloc(BITS_TO_LONGS(insn_cnt), sizeof(unsigned long), + GFP_KERNEL); + if (!frontier) + return -ENOMEM; + visited = kcalloc(BITS_TO_LONGS(insn_cnt), sizeof(unsigned long), + GFP_KERNEL); + if (!visited) { + err = -ENOMEM; + goto out_frontier; + } + /* insn 0 is our start node */ + __set_bit(0, frontier); + + /* flood fill to find all reachable insns and mark jump targets */ + while (true) { + bool fallthrough = false, jump = false; + int target; + u8 opcode; + + /* select a frontier node */ + i = find_first_bit(frontier, insn_cnt); + if (i >= insn_cnt) /* frontier is empty, done */ + break; + /* remove it from the frontier */ + __clear_bit(i, frontier); + /* mark it as visited */ + __set_bit(i, visited); + opcode = BPF_OP(insns[i].code); if (BPF_CLASS(insns[i].code) != BPF_JMP) { - if (i + 1 == insn_cnt) { + fallthrough = true; + } else { + switch (opcode) { + case BPF_EXIT: + break; + case BPF_CALL: + /* following insn is target of function return */ + fallthrough = true; + /* subprog call insns 'branch both ways', as the +* subprog will eventually exit +*/ + if (insns[i].src_reg == BPF_PSEUDO_CALL) { + jump = true; + target = i + insns[i].imm + 1; + } + break; + case BPF_JA: + /* unconditional jump; mark target */ + jump = true; + target = i + insns[i].off + 1; + break; + default: + /* conditional jump; branch both ways */ + fallthrough = true; + jump = true; + target = i + insns[i].off + 1; + break; + } + } + /* if targets are unvisited, add them to the frontier */ + if (fallthrough) { + if (i + 1 >= insn_cnt) { verbose(env, "no exit/jump at end of program (insn %d)\n", i); - return -EINVAL; + err = -EINVAL; + goto out_visited; } - continue; + if (!test_bit(i + 1, visited)) + __set_bit(i + 1, frontier); } - switch (opcode) { - case BPF_EXIT: -
[RFC PATCH bpf-next 10/12] bpf/verifier: parent state pointer is not per-frame
Computing min_frame in find_loop and using it to detect recursion means we don't need to play games with per-frame parent pointers, and can instead have a single parent pointer in the verifier_state. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index ee034232fbd6..0320df10555b 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -121,12 +121,6 @@ struct bpf_func_state { */ u32 subprogno; - /* loop detection; points into an explored_state */ - struct bpf_func_state *parent; - /* These flags are only meaningful in an explored_state, not cur_state */ - bool bounded_loop, conditional; - int live_children; - /* should be second to last. See copy_func_state() */ int allocated_stack; struct bpf_stack_state *stack; @@ -137,6 +131,12 @@ struct bpf_verifier_state { /* call stack tracking */ struct bpf_func_state *frame[MAX_CALL_FRAMES]; u32 curframe; + + /* loop detection; points into an explored_state */ + struct bpf_verifier_state *parent; + /* These flags are only meaningful in an explored_state, not cur_state */ + bool bounded_loop, conditional; + int live_children; }; /* linked list of verifier states used to prune search */
Re: [PATCH][next] xen-netback: make function xenvif_rx_skb static
On Fri, Feb 23, 2018 at 05:16:57PM +, Colin King wrote: > From: Colin Ian King > > The function xenvif_rx_skb is local to the source and does not need > to be in global scope, so make it static. > > Cleans up sparse warning: > drivers/net/xen-netback/rx.c:422:6: warning: symbol 'xenvif_rx_skb' > was not declared. Should it be static? > > Signed-off-by: Colin Ian King Acked-by: Wei Liu Thanks
[RFC PATCH bpf-next 09/12] bpf/verifier: count still-live children of explored_states
By reference-counting how many children of an explored_state are still being walked, we can avoid pruning based on a state that's in our own history (and thus hasn't reached an exit yet) without a persistent mark that prevents other, later branches from being pruned against it when it _has_ reached an exit. Includes a check at free_states() time to ensure that all the reference counts have fallen to zero. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h | 3 +- kernel/bpf/verifier.c| 109 --- 2 files changed, 74 insertions(+), 38 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 6abd484391f4..ee034232fbd6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -124,7 +124,8 @@ struct bpf_func_state { /* loop detection; points into an explored_state */ struct bpf_func_state *parent; /* These flags are only meaningful in an explored_state, not cur_state */ - bool in_loop, bounded_loop, conditional; + bool bounded_loop, conditional; + int live_children; /* should be second to last. See copy_func_state() */ int allocated_stack; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9828cb0cde73..cc8aaf73b4a2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -398,6 +398,12 @@ static void free_verifier_state(struct bpf_verifier_state *state, free_func_state(state->frame[i]); state->frame[i] = NULL; } + /* Check that the live_children accounting is correct */ + if (state->live_children) + pr_warn("Leaked live_children=%d at insn %d, frame %d\n", + state->live_children, + state->frame[state->curframe]->insn_idx, + state->curframe); if (free_self) kfree(state); } @@ -429,6 +435,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, dst_state->frame[i] = NULL; } dst_state->curframe = src->curframe; + dst_state->parent = src->parent; for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; @@ -445,6 +452,15 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, return 0; } +/* Mark this thread as having reached an exit */ +static void kill_thread(struct bpf_verifier_state *state) +{ + struct bpf_verifier_state *cur = state->parent; + + while (cur && !--cur->live_children) + cur = cur->parent; +} + static int pop_stack(struct bpf_verifier_env *env, int *insn_idx) { struct bpf_verifier_state *cur = env->cur_state; @@ -458,6 +474,8 @@ static int pop_stack(struct bpf_verifier_env *env, int *insn_idx) err = copy_verifier_state(cur, &head->st); if (err) return err; + } else { + kill_thread(&head->st); } if (insn_idx) *insn_idx = head->insn_idx; @@ -479,6 +497,7 @@ static int unpush_stack(struct bpf_verifier_env *env) return -ENOENT; elem = head->next; + kill_thread(&head->st); free_verifier_state(&head->st, false); kfree(head); env->head = elem; @@ -509,6 +528,8 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, verbose(env, "BPF program is too complex\n"); goto err; } + if (elem->st.parent) + elem->st.parent->live_children++; return &elem->st; err: free_verifier_state(env->cur_state, true); @@ -728,11 +749,9 @@ static void init_reg_state(struct bpf_verifier_env *env, static void init_func_state(struct bpf_verifier_env *env, struct bpf_func_state *state, - struct bpf_func_state *parent, int entry, int frameno, int subprogno) { state->insn_idx = entry; - state->parent = parent; state->frameno = frameno; state->subprogno = subprogno; init_reg_state(env, state); @@ -2111,7 +2130,6 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, * callee can read/write into caller's stack */ init_func_state(env, callee, - caller->parent /* parent state for loop detection */, target /* entry point */, state->curframe + 1 /* frameno within this callchain */, subprog /* subprog number within this prog */); @@ -4207,14 +4225,20 @@ static int propagate_liveness(struct bpf_verifier_env *env, return err; } -static struct bpf_func_state *find_loop(struct bpf_verifier_env *env, - bool *saw_cond, bool *saw_bound) +static struct bpf_veri
[RFC PATCH bpf-next 08/12] bpf/verifier: selftests for bounded loops
Mainly consists of tests that broke (or I expected to break) earlier versions of the bounded-loop handling. Also updated some existing tests to deal with changed error messages, programs now being accepted etc. Signed-off-by: Edward Cree --- tools/testing/selftests/bpf/test_verifier.c | 198 +++- 1 file changed, 191 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 722a16b2e9c4..fda35a5a0ff9 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -9338,7 +9338,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "frames is too deep", + .errstr = "recursive call", .result = REJECT, }, { @@ -9389,8 +9389,8 @@ static struct bpf_test tests[] = { BPF_JMP_IMM(BPF_JA, 0, 0, -6), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "back-edge from insn", - .result = REJECT, + .result = ACCEPT, + .retval = 1, }, { "calls: conditional call 4", @@ -9424,8 +9424,8 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "back-edge from insn", - .result = REJECT, + .result = ACCEPT, + .retval = 1, }, { "calls: conditional call 6", @@ -9666,7 +9666,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "frames is too deep", + .errstr = "recursive call", .result = REJECT, }, { @@ -9678,7 +9678,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "frames is too deep", + .errstr = "recursive call", .result = REJECT, }, { @@ -11135,6 +11135,190 @@ static struct bpf_test tests[] = { .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, + { + "bounded loop, count to 4", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 4, -2), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_TRACEPOINT, + .retval = 4, + }, + { + "bounded loop, count to 20", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 20, -2), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "r0 is increasing too slowly", + .prog_type = BPF_PROG_TYPE_TRACEPOINT, + }, + { + "bounded loop, count from positive unknown to 4", + .insns = { + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSLT, BPF_REG_0, 0, 2), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 4, -2), + BPF_EXIT_INSN(), + }, + .fixup_map1 = { 3 }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_TRACEPOINT, + .retval = 4, + }, + { + "bounded loop, count from totally unknown to 4", + .insns = { + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_ADD, BPF_REG
[RFC PATCH bpf-next 05/12] bpf/verifier: detect loops dynamically rather than statically
Add in a new chain of parent states, which does not cross function-call boundaries, and check whether our current insn_idx appears anywhere in the chain. Since all jump targets have state-list marks (now placed by prepare_cfg_marks(), which replaces check_cfg()), it suffices to thread this chain through the existing explored_states and check it only in is_state_visited(). By using this parent-state chain to detect loops, we open up the possibility that in future we could determine a loop to be bounded and thus accept it. A few selftests had to be changed to ensure that all the insns walked before reaching the back-edge are valid. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h| 2 + kernel/bpf/verifier.c | 280 +--- tools/testing/selftests/bpf/test_verifier.c | 12 +- 3 files changed, 97 insertions(+), 197 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 0bc49c768585..24ddbc1ed3b2 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -120,6 +120,8 @@ struct bpf_func_state { * zero == main subprog */ u32 subprogno; + /* loop detection; points into an explored_state */ + struct bpf_func_state *parent; /* should be second to last. See copy_func_state() */ int allocated_stack; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7a08b5e8e071..e7d898f4fd29 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -429,6 +429,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, dst_state->frame[i] = NULL; } dst_state->curframe = src->curframe; + for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; if (!dst) { @@ -716,6 +717,7 @@ static void init_func_state(struct bpf_verifier_env *env, int entry, int frameno, int subprogno) { state->insn_idx = entry; + state->parent = NULL; state->frameno = frameno; state->subprogno = subprogno; init_reg_state(env, state); @@ -3747,211 +3749,85 @@ static int check_return_code(struct bpf_verifier_env *env) return 0; } -/* non-recursive DFS pseudo code - * 1 procedure DFS-iterative(G,v): - * 2 label v as discovered - * 3 let S be a stack - * 4 S.push(v) - * 5 while S is not empty - * 6t <- S.pop() - * 7if t is what we're looking for: - * 8return t - * 9for all edges e in G.adjacentEdges(t) do - * 10 if edge e is already labelled - * 11 continue with the next edge - * 12 w <- G.adjacentVertex(t,e) - * 13 if vertex w is not discovered and not explored - * 14 label e as tree-edge - * 15 label w as discovered - * 16 S.push(w) - * 17 continue at 5 - * 18 else if vertex w is discovered - * 19 label e as back-edge - * 20 else - * 21 // vertex w is explored - * 22 label e as forward- or cross-edge - * 23 label t as explored - * 24 S.pop() - * - * convention: - * 0x10 - discovered - * 0x11 - discovered and fall-through edge labelled - * 0x12 - discovered and fall-through and branch edges labelled - * 0x20 - explored - */ - -enum { - DISCOVERED = 0x10, - EXPLORED = 0x20, - FALLTHROUGH = 1, - BRANCH = 2, -}; - #define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L) -static int *insn_stack;/* stack of insns to process */ -static int cur_stack; /* current stack index */ -static int *insn_state; - -/* t, w, e - match pseudo-code above: - * t - index of current instruction - * w - next instruction - * e - edge - */ -static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) +static int mark_jump_target(struct bpf_verifier_env *env, int from, int off) { - if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH)) - return 0; - - if (e == BRANCH && insn_state[t] >= (DISCOVERED | BRANCH)) - return 0; + int to = from + off + 1; - if (w < 0 || w >= env->prog->len) { - verbose(env, "jump out of range from insn %d to %d\n", t, w); + if (to < 0 || to >= env->prog->len) { + verbose(env, "jump out of range from insn %d to %d\n", from, to); return -EINVAL; } - - if (e == BRANCH) - /* mark branch target for state pruning */ - env->explored_states[w] = STATE_LIST_MARK; - - if (insn_state[w] == 0) { - /* tree-edge */ - insn_state[t] = DISCOVERED | e; - insn_state[w] = DISCOVERED; - if (cur_stack >= env->prog->len) -
[RFC PATCH bpf-next 07/12] bpf/verifier: allow bounded loops with JLT/true back-edge
Where the register umin_value is increasing sufficiently fast, the loop will terminate after a reasonable number of iterations, so we can allow to keep walking it. The semantics of prev_insn_idx are changed slightly: it now always refers to the last jump insn walked, even when that jump was not taken. Also it is moved into env, alongside a new bool last_jump_taken that records whether the jump was taken or not. This is needed so that, when we detect a loop, we can see how we ended up in the back-edge: what was the jump condition, and is it the true- or the false-branch that ends up looping? We record in our explored_states whether they represent a conditional jump and whether that jump produced a good loop bound. This allows us to make unconditional jumps "not responsible" for ensuring the loop is bounded, as long as there is a conditional jump somewhere in the loop body; whereas a conditional jump must ensure that there is a bounding conditional somewhere in the loop body. Recursive calls have to remain forbidden because the calculation of maximum stack depth assumes the call graph is acyclic, even though the loop handling code could determine whether the recursion was bounded. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h | 5 + kernel/bpf/verifier.c| 295 +++ 2 files changed, 244 insertions(+), 56 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 24ddbc1ed3b2..6abd484391f4 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -120,8 +120,11 @@ struct bpf_func_state { * zero == main subprog */ u32 subprogno; + /* loop detection; points into an explored_state */ struct bpf_func_state *parent; + /* These flags are only meaningful in an explored_state, not cur_state */ + bool in_loop, bounded_loop, conditional; /* should be second to last. See copy_func_state() */ int allocated_stack; @@ -197,6 +200,8 @@ struct bpf_verifier_env { u32 id_gen; /* used to generate unique reg IDs */ bool allow_ptr_leaks; bool seen_direct_write; + int prev_insn_idx; /* last branch (BPF_JMP-class) insn */ + bool last_jump_taken; /* Was branch at prev_insn_idx taken? */ struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ struct bpf_verifer_log log; struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS]; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7a8ae633d0c3..9828cb0cde73 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -445,8 +445,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, return 0; } -static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx, -int *insn_idx) +static int pop_stack(struct bpf_verifier_env *env, int *insn_idx) { struct bpf_verifier_state *cur = env->cur_state; struct bpf_verifier_stack_elem *elem, *head = env->head; @@ -462,8 +461,7 @@ static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx, } if (insn_idx) *insn_idx = head->insn_idx; - if (prev_insn_idx) - *prev_insn_idx = head->prev_insn_idx; + env->prev_insn_idx = head->prev_insn_idx; elem = head->next; free_verifier_state(&head->st, false); kfree(head); @@ -516,7 +514,7 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, free_verifier_state(env->cur_state, true); env->cur_state = NULL; /* pop all elements and return */ - while (!pop_stack(env, NULL, NULL)); + while (!pop_stack(env, NULL)); return NULL; } @@ -730,10 +728,11 @@ static void init_reg_state(struct bpf_verifier_env *env, static void init_func_state(struct bpf_verifier_env *env, struct bpf_func_state *state, + struct bpf_func_state *parent, int entry, int frameno, int subprogno) { state->insn_idx = entry; - state->parent = NULL; + state->parent = parent; state->frameno = frameno; state->subprogno = subprogno; init_reg_state(env, state); @@ -2112,6 +2111,7 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, * callee can read/write into caller's stack */ init_func_state(env, callee, + caller->parent /* parent state for loop detection */, target /* entry point */, state->curframe + 1 /* frameno within this callchain */, subprog /* subprog number within this prog */); @@ -3480,6 +3480,12 @@ static bool reg_is_impossible(struct bpf_reg_state reg) reg.smin_value > reg.smax_valu
[RFC PATCH bpf-next 06/12] bpf/verifier: do not walk impossible branches
If a conditional branch would produce an inconsistent set of bounds (e.g. umin_value > umax_value) on one leg, then that leg cannot actually happen so we don't need to walk it. This will necessary for bounded loop support so that we stop walking round the loop once the termination condition is known to have been met. Signed-off-by: Edward Cree --- kernel/bpf/verifier.c | 53 ++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e7d898f4fd29..7a8ae633d0c3 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -472,6 +472,22 @@ static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx, return 0; } +/* Throw away top element of stack. Like pop_stack but don't update cur_state */ +static int unpush_stack(struct bpf_verifier_env *env) +{ + struct bpf_verifier_stack_elem *elem, *head = env->head; + + if (env->head == NULL) + return -ENOENT; + + elem = head->next; + free_verifier_state(&head->st, false); + kfree(head); + env->head = elem; + env->stack_size--; + return 0; +} + static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx) { @@ -2376,8 +2392,10 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, if ((known && (smin_val != smax_val || umin_val != umax_val)) || smin_val > smax_val || umin_val > umax_val) { /* Taint dst register if offset had invalid bounds derived from -* e.g. dead branches. +* e.g. dead branches. This should be impossible now, since we +* prune dead branches in check_cond_jmp_op(). */ + WARN_ON_ONCE(1); __mark_reg_unknown(dst_reg); return 0; } @@ -3455,6 +3473,13 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn, return true; } +static bool reg_is_impossible(struct bpf_reg_state reg) +{ + return reg.type == SCALAR_VALUE && + (reg.umin_value > reg.umax_value || + reg.smin_value > reg.smax_value); +} + static int check_cond_jmp_op(struct bpf_verifier_env *env, struct bpf_insn *insn, int *insn_idx) { @@ -3574,6 +3599,32 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } if (env->log.level) print_verifier_state(env, this_branch->frame[this_branch->curframe]); + + if (reg_is_impossible(other_branch_regs[insn->src_reg]) || + reg_is_impossible(other_branch_regs[insn->dst_reg])) { + /* if (false) goto pc+off; +* only follow fall-through branch, since +* that's where the program will go +*/ + verbose(env, "pruning impossible jump\n"); + unpush_stack(env); + } else if (reg_is_impossible(regs[insn->src_reg]) || + reg_is_impossible(regs[insn->dst_reg])) { + /* if (true) goto pc+off; +* only follow the goto, ignore fall-through +*/ + verbose(env, "pruning impossible fall-through\n"); + err = pop_stack(env, NULL, insn_idx); + if (err < 0) { + if (err != -ENOENT) + return err; + } + /* pushed goto included the +1, which caller will try to apply +* so let's undo it here. +*/ + (*insn_idx)--; + } + return 0; }
[RFC PATCH bpf-next 04/12] bpf/verifier: store insn_idx instead of callsite in bpf_func_state
This means each entry in the parentage chain can have its insn identified, which will help to support bounded-loop handling later. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h | 6 -- kernel/bpf/verifier.c| 22 +++--- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 2b56be9dfb56..0bc49c768585 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -107,8 +107,10 @@ struct bpf_stack_state { */ struct bpf_func_state { struct bpf_reg_state regs[MAX_BPF_REG]; - /* index of call instruction that called into this func */ - int callsite; + /* index of last instruction processed in this func. In frames other +* than innermost, will be call insn +*/ + int insn_idx; /* stack frame number of this function state from pov of * enclosing bpf_verifier_state. * 0 = main function, 1 = first callee. diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index dfba9dbc5bfb..7a08b5e8e071 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -267,7 +267,7 @@ static void print_verifier_state(struct bpf_verifier_env *env, /* reg->off should be 0 for SCALAR_VALUE */ verbose(env, "%lld", reg->var_off.value + reg->off); if (t == PTR_TO_STACK) - verbose(env, ",call_%d", func(env, reg)->callsite); + verbose(env, ",frame_%u", reg->frameno); } else { verbose(env, "(id=%d", reg->id); if (t != SCALAR_VALUE) @@ -711,12 +711,11 @@ static void init_reg_state(struct bpf_verifier_env *env, mark_reg_known_zero(env, regs, BPF_REG_1); } -#define BPF_MAIN_FUNC (-1) static void init_func_state(struct bpf_verifier_env *env, struct bpf_func_state *state, - int callsite, int frameno, int subprogno) + int entry, int frameno, int subprogno) { - state->callsite = callsite; + state->insn_idx = entry; state->frameno = frameno; state->subprogno = subprogno; init_reg_state(env, state); @@ -2095,8 +2094,7 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, * callee can read/write into caller's stack */ init_func_state(env, callee, - /* remember the callsite, it will be used by bpf_exit */ - *insn_idx /* callsite */, + target /* entry point */, state->curframe + 1 /* frameno within this callchain */, subprog /* subprog number within this prog */); @@ -2151,7 +2149,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) /* return to the caller whatever r0 had in the callee */ caller->regs[BPF_REG_0] = *r0; - *insn_idx = callee->callsite + 1; + *insn_idx = caller->insn_idx + 1; if (env->log.level) { verbose(env, "returning from callee:\n"); print_verifier_state(env, callee); @@ -4232,7 +4230,7 @@ static bool states_equal(struct bpf_verifier_env *env, * and all frame states need to be equivalent */ for (i = 0; i <= old->curframe; i++) { - if (old->frame[i]->callsite != cur->frame[i]->callsite) + if (old->frame[i]->insn_idx != cur->frame[i]->insn_idx) return false; if (!func_states_equal(old->frame[i], cur->frame[i])) return false; @@ -4327,7 +4325,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) * technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) * or it will be rejected. Since there are no loops, we won't be -* seeing this tuple (frame[0].callsite, frame[1].callsite, .. insn_idx) +* seeing this tuple (frame[0].insn_idx, frame[1].insn_idx, .. insn_idx) * again on the way to bpf_exit */ new_sl = kzalloc(sizeof(struct bpf_verifier_state_list), GFP_KERNEL); @@ -4394,11 +4392,11 @@ static int do_check(struct bpf_verifier_env *env) mainprogno = add_subprog(env, 0); if (mainprogno < 0) return mainprogno; + insn_idx = 0; init_func_state(env, state->frame[0], - BPF_MAIN_FUNC /* callsite */, + insn_idx /* entry point */, 0 /* frameno */, mainprogno /* subprogno */); - insn_idx = 0; for (;;) { struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx]; struct bpf_func_st
[RFC PATCH bpf-next 03/12] bpf/verifier: per-register parent pointers
By giving each register its own liveness chain, we elide the skip_callee() logic. Instead, each register's parent is the state it inherits from; both check_func_call() and prepare_func_exit() automatically connect reg states to the correct chain since when they copy the reg state across (r1-r5 into the callee as args, and r0 out as the return value) they also copy the parent pointer. Signed-off-by: Edward Cree --- include/linux/bpf_verifier.h | 8 +- kernel/bpf/verifier.c| 180 ++- 2 files changed, 45 insertions(+), 143 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 0387e0c61161..2b56be9dfb56 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -41,6 +41,7 @@ enum bpf_reg_liveness { }; struct bpf_reg_state { + /* Ordering of fields matters. See states_equal() */ enum bpf_reg_type type; union { /* valid when type == PTR_TO_PACKET */ @@ -59,7 +60,6 @@ struct bpf_reg_state { * came from, when one is tested for != NULL. */ u32 id; - /* Ordering of fields matters. See states_equal() */ /* For scalar types (SCALAR_VALUE), this represents our knowledge of * the actual value. * For pointer types, this represents the variable part of the offset @@ -76,15 +76,15 @@ struct bpf_reg_state { s64 smax_value; /* maximum possible (s64)value */ u64 umin_value; /* minimum possible (u64)value */ u64 umax_value; /* maximum possible (u64)value */ + /* parentage chain for liveness checking */ + struct bpf_reg_state *parent; /* Inside the callee two registers can be both PTR_TO_STACK like * R1=fp-8 and R2=fp-8, but one of them points to this function stack * while another to the caller's stack. To differentiate them 'frameno' * is used which is an index in bpf_verifier_state->frame[] array * pointing to bpf_func_state. -* This field must be second to last, for states_equal() reasons. */ u32 frameno; - /* This field must be last, for states_equal() reasons. */ enum bpf_reg_liveness live; }; @@ -107,7 +107,6 @@ struct bpf_stack_state { */ struct bpf_func_state { struct bpf_reg_state regs[MAX_BPF_REG]; - struct bpf_verifier_state *parent; /* index of call instruction that called into this func */ int callsite; /* stack frame number of this function state from pov of @@ -129,7 +128,6 @@ struct bpf_func_state { struct bpf_verifier_state { /* call stack tracking */ struct bpf_func_state *frame[MAX_CALL_FRAMES]; - struct bpf_verifier_state *parent; u32 curframe; }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2dc69fb3bfbe..dfba9dbc5bfb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -343,9 +343,9 @@ static int copy_stack_state(struct bpf_func_state *dst, /* do_check() starts with zero-sized stack in struct bpf_verifier_state to * make it consume minimal amount of memory. check_stack_write() access from * the program calls into realloc_func_state() to grow the stack size. - * Note there is a non-zero 'parent' pointer inside bpf_verifier_state - * which this function copies over. It points to previous bpf_verifier_state - * which is never reallocated + * Note there is a non-zero parent pointer inside each reg of bpf_verifier_state + * which this function copies over. It points to corresponding reg in previous + * bpf_verifier_state which is never reallocated */ static int realloc_func_state(struct bpf_func_state *state, int size, bool copy_old) @@ -429,7 +429,6 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, dst_state->frame[i] = NULL; } dst_state->curframe = src->curframe; - dst_state->parent = src->parent; for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; if (!dst) { @@ -699,6 +698,7 @@ static void init_reg_state(struct bpf_verifier_env *env, for (i = 0; i < MAX_BPF_REG; i++) { mark_reg_not_init(env, regs, i); regs[i].live = REG_LIVE_NONE; + regs[i].parent = NULL; } /* frame pointer */ @@ -773,74 +773,21 @@ static int add_subprog(struct bpf_verifier_env *env, int off) return ret; } -static -struct bpf_verifier_state *skip_callee(struct bpf_verifier_env *env, - const struct bpf_verifier_state *state, - struct bpf_verifier_state *parent, - u32 regno) -{ - struct bpf_verifier_state *tmp = NULL; - - /* 'parent' could be a state of caller and -* 'state' could be a state of callee. In such case -* parent->cur
[RFC PATCH bpf-next 02/12] bpf/verifier: update selftests
Error messages for some bad programs have changed, partly because we now check for loops / out-of-bounds jumps before checking subprogs. Problematic selftests: 513 calls: wrong recursive calls This is now rejected with 'unreachable insn 1'. I'm not entirely sure what it was meant to do/test, since all of the JMP|CALLs are also unreachable. 546 calls: ld_abs with changing ctx data in callee Rejected with R1 !read_ok. It was testing for the "can't mix LD_ABS with function calls", which has now changed to "can't use LD_ABS in functions other than main()". I'm still not 100% sure that's right though. Signed-off-by: Edward Cree --- tools/testing/selftests/bpf/test_verifier.c | 46 ++--- 1 file changed, 29 insertions(+), 17 deletions(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 697bd83de295..9c7531887ee3 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -644,7 +644,7 @@ static struct bpf_test tests[] = { .insns = { BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2), }, - .errstr = "not an exit", + .errstr = "jump out of range", .result = REJECT, }, { @@ -9288,7 +9288,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "last insn is not an exit or jmp", + .errstr = "insn 1 was in subprog 1, now 0", .result = REJECT, }, { @@ -9354,7 +9354,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "jump out of range", + .errstr = "insn 5 was in subprog 1, now 0", .result = REJECT, }, { @@ -9633,7 +9633,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SCHED_CLS, - .errstr = "jump out of range from insn 1 to 4", + .errstr = "insn 5 was in subprog 1, now 0", .result = REJECT, }, { @@ -9649,13 +9649,12 @@ static struct bpf_test tests[] = { BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0), BPF_MOV64_REG(BPF_REG_0, BPF_REG_7), BPF_EXIT_INSN(), - BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, - offsetof(struct __sk_buff, len)), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, -3), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "jump out of range from insn 11 to 9", + .errstr = "insn 9 was in subprog 1, now 2", .result = REJECT, }, { @@ -9707,7 +9706,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "invalid destination", + .errstr = "jump out of range from insn 2 to -1", .result = REJECT, }, { @@ -9719,7 +9718,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "invalid destination", + .errstr = "jump out of range from insn 2 to -2147483646", .result = REJECT, }, { @@ -9732,7 +9731,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "jump out of range", + .errstr = "insn 1 was in subprog 0, now 1", .result = REJECT, }, { @@ -9745,7 +9744,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "jump out of range", + .errstr = "insn 4 was in subprog 1, now 0", .result = REJECT, }, { @@ -9759,7 +9758,7 @@ static struct bpf_test tests[] = { BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, - .errstr = "not an exit", + .errstr = "jump out of range from insn 5 to 6", .result = REJECT, }, { @@ -9773,7 +9772,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), },
[RFC PATCH bpf-next 01/12] bpf/verifier: validate func_calls by marking at do_check() time
Removes a couple of passes from the verifier, one to check subprogs don't overlap etc., and one to compute max stack depth (which now is done by topologically sorting the call graph). Signed-off-by: Edward Cree --- As noted in the cover letter, I haven't yet integrated the feedback I got the first time I posted this patch. include/linux/bpf_verifier.h | 24 ++- kernel/bpf/verifier.c| 425 +++ 2 files changed, 242 insertions(+), 207 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 6b66cd1aa0b9..0387e0c61161 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -146,6 +146,8 @@ struct bpf_insn_aux_data { s32 call_imm; /* saved imm field of call insn */ }; int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ + u16 subprogno; /* subprog in which this insn resides, valid iff @seen */ + u16 subprog_off; /* insn_idx within subprog, computed in jit_subprogs */ bool seen; /* this insn was processed by the verifier */ }; @@ -168,6 +170,15 @@ static inline bool bpf_verifier_log_full(const struct bpf_verifer_log *log) #define BPF_MAX_SUBPROGS 256 +struct bpf_subprog_info { + /* which other subprogs does this one directly call? */ + DECLARE_BITMAP(callees, BPF_MAX_SUBPROGS); + u32 start; /* insn idx of function entry point */ + u16 stack_depth; /* max. stack depth used by this function */ + u16 total_stack_depth; /* max. stack depth used by entire call chain */ + u16 len; /* #insns in this subprog */ +}; + /* single container for all structs * one verifier_env per bpf_check() call */ @@ -186,20 +197,23 @@ struct bpf_verifier_env { bool seen_direct_write; struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ struct bpf_verifer_log log; - u32 subprog_starts[BPF_MAX_SUBPROGS]; - /* computes the stack depth of each bpf function */ - u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1]; + struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS]; u32 subprog_cnt; }; __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, const char *fmt, ...); -static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) +static inline struct bpf_func_state *cur_frame(struct bpf_verifier_env *env) { struct bpf_verifier_state *cur = env->cur_state; - return cur->frame[cur->curframe]->regs; + return cur->frame[cur->curframe]; +} + +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) +{ + return cur_frame(env)->regs; } int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5fb69a85d967..2dc69fb3bfbe 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -728,111 +728,49 @@ enum reg_arg_type { DST_OP_NO_MARK /* same as above, check only, don't mark */ }; -static int cmp_subprogs(const void *a, const void *b) +static int find_subprog(struct bpf_verifier_env *env, int insn_idx) { - return *(int *)a - *(int *)b; -} - -static int find_subprog(struct bpf_verifier_env *env, int off) -{ - u32 *p; + struct bpf_insn_aux_data *aux; + int insn_cnt = env->prog->len; + u32 subprogno; - p = bsearch(&off, env->subprog_starts, env->subprog_cnt, - sizeof(env->subprog_starts[0]), cmp_subprogs); - if (!p) + if (insn_idx >= insn_cnt || insn_idx < 0) { + verbose(env, "find_subprog of invalid insn_idx %d\n", insn_idx); + return -EINVAL; + } + aux = &env->insn_aux_data[insn_idx]; + if (!aux->seen) /* haven't visited this line yet */ return -ENOENT; - return p - env->subprog_starts; - + subprogno = aux->subprogno; + /* validate that we are at start of subprog */ + if (env->subprog_info[subprogno].start != insn_idx) { + verbose(env, "insn_idx %d is in subprog %u but that starts at %d\n", + insn_idx, subprogno, env->subprog_info[subprogno].start); + return -EINVAL; + } + return subprogno; } static int add_subprog(struct bpf_verifier_env *env, int off) { int insn_cnt = env->prog->len; + struct bpf_subprog_info *info; int ret; if (off >= insn_cnt || off < 0) { verbose(env, "call to invalid destination\n"); return -EINVAL; } - ret = find_subprog(env, off); - if (ret >= 0) - return 0; if (env->subprog_cnt >= BPF_MAX_SUBPROGS) { verbose(env, "too many subprograms\n"); return -E2BIG; } - env->subprog_starts[env->subprog_cnt++] = off; - sort(env->subpro
[PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to do c
--- tools/testing/selftests/tc-testing/tdc.py | 2 +- tools/testing/selftests/tc-testing/tdc_batch.py | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index fc373fdf2bdc..80c75464c67e 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -277,7 +277,7 @@ def generate_case_ids(alltests): for c in alltests: if (c["id"] == ""): while True: -newid = str('%04x' % random.randrange(16**4)) +newid = str('{:04x}'.format(random.randrange(16**4))) if (does_id_exist(alltests, newid)): continue else: diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py b/tools/testing/selftests/tc-testing/tdc_batch.py index 707c6bfef689..52fa539dc662 100755 --- a/tools/testing/selftests/tc-testing/tdc_batch.py +++ b/tools/testing/selftests/tc-testing/tdc_batch.py @@ -49,13 +49,13 @@ index = 0 for i in range(0x100): for j in range(0x100): for k in range(0x100): -mac = ("%02x:%02x:%02x" % (i, j, k)) +mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k)) src_mac = "e4:11:00:" + mac dst_mac = "e4:12:00:" + mac -cmd = ("filter add dev %s %s protocol ip parent : flower %s " - "src_mac %s dst_mac %s action drop %s" % +cmd = ("filter add dev {} {} protocol ip parent : flower {} " + "src_mac {} dst_mac {} action drop {}".format (device, prio, skip, src_mac, dst_mac, share_action)) -file.write("%s\n" % cmd) +file.write("{}\n".format(cmd)) index += 1 if index >= number: file.close() -- 2.16.1