Re: [PATCH bpf-next 0/6] Few x64 jit improvements to shrink image size

2018-02-23 Thread Alexei Starovoitov
On Sat, Feb 24, 2018 at 01:07:57AM +0100, Daniel Borkmann wrote:
> Couple of minor improvements to the x64 JIT I had still around from
> pre merge window in order to shrink the image size further. Added
> test cases for kselftests too as well as running Cilium workloads on
> them w/o issues.

Applied to bpf-next, thanks Daniel.



tcp_bind_bucket is missing from slabinfo

2018-02-23 Thread Stephen Hemminger
Somewhere back around 3.17 the kmem cache "tcp_bind_bucket" dropped out
of /proc/slabinfo. It turns out the ss command was dumpster diving
in slabinfo to determine the number of bound sockets and now it always
reports 0.

Not sure why, the cache is still created but it doesn't
show in slabinfo. Could it be some part of making slab/slub common code
(or network namespaces). The cache is created in tcp_init but not visible.

Any ideas?


RE: [Intel-wired-lan] [PATCH net-queue] e1000e: Fix check_for_link return value with autoneg off.

2018-02-23 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Benjamin Poirier
> Sent: Monday, February 19, 2018 10:12 PM
> To: Kirsher, Jeffrey T 
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux-
> ker...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH net-queue] e1000e: Fix check_for_link
> return value with autoneg off.
> 
> When autoneg is off, the .check_for_link callback functions clear the
> get_link_status flag and systematically return a "pseudo-error". This means
> that the link is not detected as up until the next execution of the
> e1000_watchdog_task() 2 seconds later.
> 
> Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> Signed-off-by: Benjamin Poirier 
> ---
>  drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 +-
>  drivers/net/ethernet/intel/e1000e/mac.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Tested-by: Aaron Brown 


GIT: net has been merged into net-next

2018-02-23 Thread David Miller

Just FYI...


Re: [next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters

2018-02-23 Thread Florian Fainelli
On February 23, 2018 5:20:35 PM PST, Vinicius Costa Gomes 
 wrote:
>This allows filters added by tc-flower and specifying MAC addresses,
>Ethernet types, and the VLAN priority field, to be offloaded to the
>controller.
>
>This reuses most of the infrastructure used by ethtool, ethtool can be
>used to read these filters, but modification and deletion can only be
>done via tc-flower.

You would want to check what other drivers supporting both ethtool::rxnfc and 
cls_flower do, but this can be seriously confusing to an user. As an user I 
would be more comfortable with seeing only rules added through ethtool via 
ethtool and those added by cls_flower via cls_flower. They will both access a 
shared set of resources but it seems easier for me to dump rules with both 
tools to figure out why -ENOSPC was returned rather than seeing something I did 
not add. Others might see it entirely differently.

If you added the ability for cls_flower to indicate a queue number and either a 
fixed rule index or auto-placement (RX_CLS_LOC_ANY), could that eliminate 
entirely the need for adding MAC address steering in earlier patches?

-- 
Florian


Re: [next-queue PATCH 5/8] igb: Add support for ethtool MAC address filters

2018-02-23 Thread Florian Fainelli
On February 23, 2018 5:20:33 PM PST, Vinicius Costa Gomes 
 wrote:
>This adds the capability of configuring the queue steering of arriving
>packets based on their source and destination MAC addresses.
>
>In practical terms this adds support for the following use cases,
>characterized by these examples:
>
>$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
>(this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
>to the RX queue 0)
>
>$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
>(this will direct packets with destination address "44:44:44:44:44:44"
>to the RX queue 3)
>
>Signed-off-by: Vinicius Costa Gomes 
>---

[snip]

>diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c
>b/drivers/net/ethernet/intel/igb/igb_ethtool.c
>index 143f0bb34e4d..d8686a0f5b5d 100644
>--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
>+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
>@@ -152,6 +152,9 @@ static const char
>igb_priv_flags_strings[][ETH_GSTRING_LEN] = {
> 
> #define IGB_PRIV_FLAGS_STR_LEN ARRAY_SIZE(igb_priv_flags_strings)
> 
>+static const u8 broadcast_addr[ETH_ALEN] = {
>+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff };

This is already defined in an existing header, don't have it handy but likely 
etherdevice.h.

-- 
Florian


Re: [next-queue PATCH 8/8] igb: Add support for removing offloaded tc-flower filters

2018-02-23 Thread Florian Fainelli
On February 23, 2018 5:20:36 PM PST, Vinicius Costa Gomes 
 wrote:
>This allows tc-flower filters that were offloaded to be removed.

This should be squashed into your previous patch, either the functionality is 
there and you can add/remove or it is not.

-- 
Florian


[PATCH V4 net 1/3] Revert "tuntap: add missing xdp flush"

2018-02-23 Thread Jason Wang
This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() in the process context. Simply disabling preemption
may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up patch will add the xdp flush
correctly.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b52258c..2823a4a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -181,7 +181,6 @@ struct tun_file {
struct tun_struct *detached;
struct ptr_ring tx_ring;
struct xdp_rxq_info xdp_rxq;
-   int xdp_pending_pkts;
 };
 
 struct tun_flow_entry {
@@ -1662,7 +1661,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
case XDP_REDIRECT:
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
-   ++tfile->xdp_pending_pkts;
err = xdp_do_redirect(tun->dev, &xdp, xdp_prog);
if (err)
goto err_redirect;
@@ -1984,11 +1982,6 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
result = tun_get_user(tun, tfile, NULL, from,
  file->f_flags & O_NONBLOCK, false);
 
-   if (tfile->xdp_pending_pkts) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return result;
 }
@@ -2325,13 +2318,6 @@ static int tun_sendmsg(struct socket *sock, struct 
msghdr *m, size_t total_len)
ret = tun_get_user(tun, tfile, m->msg_control, &m->msg_iter,
   m->msg_flags & MSG_DONTWAIT,
   m->msg_flags & MSG_MORE);
-
-   if (tfile->xdp_pending_pkts >= NAPI_POLL_WEIGHT ||
-   !(m->msg_flags & MSG_MORE)) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return ret;
 }
@@ -3163,7 +3149,6 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
sock_set_flag(&tfile->sk, SOCK_ZEROCOPY);
 
memset(&tfile->tx_ring, 0, sizeof(tfile->tx_ring));
-   tfile->xdp_pending_pkts = 0;
 
return 0;
 }
-- 
2.7.4



[PATCH V4 net 3/3] tuntap: correctly add the missing XDP flush

2018-02-23 Thread Jason Wang
We don't flush batched XDP packets through xdp_do_flush_map(), this
will cause packets stall at TX queue. Consider we don't do XDP on NAPI
poll(), the only possible fix is to call xdp_do_flush_map()
immediately after xdp_do_redirect().

Note, this in fact won't try to batch packets through devmap, we could
address in the future.

Reported-by: Christoffer Dall 
Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 63d39fe6..7433bb2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1663,6 +1663,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
err = xdp_do_redirect(tun->dev, &xdp, xdp_prog);
+   xdp_do_flush_map();
if (err)
goto err_redirect;
rcu_read_unlock();
-- 
2.7.4



[PATCH V4 net 2/3] tuntap: disable preemption during XDP processing

2018-02-23 Thread Jason Wang
Except for tuntap, all other drivers' XDP was implemented at NAPI
poll() routine in a bh. This guarantees all XDP operation were done at
the same CPU which is required by e.g BFP_MAP_TYPE_PERCPU_ARRAY. But
for tuntap, we do it in process context and we try to protect XDP
processing by RCU reader lock. This is insufficient since
CONFIG_PREEMPT_RCU can preempt the RCU reader critical section which
breaks the assumption that all XDP were processed in the same CPU.

Fixing this by simply disabling preemption during XDP processing.

Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2823a4a..63d39fe6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1642,6 +1642,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
else
*skb_xdp = 0;
 
+   preempt_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
if (xdp_prog && !*skb_xdp) {
@@ -1665,6 +1666,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
if (err)
goto err_redirect;
rcu_read_unlock();
+   preempt_enable();
return NULL;
case XDP_TX:
xdp_xmit = true;
@@ -1686,6 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
skb = build_skb(buf, buflen);
if (!skb) {
rcu_read_unlock();
+   preempt_enable();
return ERR_PTR(-ENOMEM);
}
 
@@ -1698,10 +1701,12 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
skb->dev = tun->dev;
generic_xdp_tx(skb, xdp_prog);
rcu_read_unlock();
+   preempt_enable();
return NULL;
}
 
rcu_read_unlock();
+   preempt_enable();
 
return skb;
 
@@ -1709,6 +1714,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
put_page(alloc_frag->page);
 err_xdp:
rcu_read_unlock();
+   preempt_enable();
this_cpu_inc(tun->pcpu_stats->rx_dropped);
return NULL;
 }
-- 
2.7.4



Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Matthew Wilcox
On Sat, Feb 24, 2018 at 01:49:35AM +, Chris Mi wrote:
> To verify this patch, the following is a sanity test case:
> 
> # tc qdisc delete dev $link ingress > /dev/null 2>&1;
> # tc qdisc add dev $link ingress;
> # tc filter add dev $link prio 1 protocol ip handle 0x8001 parent : 
> flower skip_hw src_mac e4:11:0:0:0:2 dst_mac e4:12:0:0:0:2 action drop;
> # tc filter show dev $link parent :
> 
> filter pref 1 flower chain 0
> filter pref 1 flower chain 0 handle 0x8001

I added these tests to my local tree for now.

diff --git a/tools/testing/radix-tree/idr-test.c 
b/tools/testing/radix-tree/idr-test.c
index 44ef9eba5a7a..28d99325a32d 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -178,6 +178,29 @@ void idr_get_next_test(int base)
idr_destroy(&idr);
 }
 
+void idr_u32_test(struct idr *idr, int base)
+{
+   assert(idr_is_empty(idr));
+   idr_init_base(idr, base);
+   u32 handle = 10;
+   idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL);
+   BUG_ON(handle != 10);
+   idr_remove(idr, handle);
+   assert(idr_is_empty(idr));
+
+   handle = 0x8001;
+   idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL);
+   BUG_ON(handle != 0x8001);
+   idr_remove(idr, handle);
+   assert(idr_is_empty(idr));
+
+   handle = 0xffe0;
+   idr_alloc_u32(idr, NULL, &handle, handle, GFP_KERNEL);
+   BUG_ON(handle != 0xffe0);
+   idr_remove(idr, handle);
+   assert(idr_is_empty(idr));
+}
+
 void idr_checks(void)
 {
unsigned long i;
@@ -248,6 +271,9 @@ void idr_checks(void)
idr_get_next_test(0);
idr_get_next_test(1);
idr_get_next_test(4);
+   idr_u32_test(&idr, 0);
+   idr_u32_test(&idr, 1);
+   idr_u32_test(&idr, 4);
 }
 
 /*


Re: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver

2018-02-23 Thread Andrew Lunn
> Ok, but it seems to me that what I have is an example of "specific book 
> keeping
> private information". Can you clarify the style you prefer?
> 
> In cases of allocation where I can just compare a pointer to null, I can 
> easily remove
> the flags. But in other cases I need a record of which steps completed in 
> order to
> clean up properly. In cases where I need some sort of a flag would you prefer
> I avoid a bit mask, and have a standalone variable for each flag?

Hi Bryan

Often you know some thing has been done, because if it had not been
done, you would of bombed out with an error. In the release function
you can assume everything done in probe has been done, otherwise the
probe would not be successful. In close, you can assume everything
done in open was successful, otherwise the open would of failed

So probe does not need any flags. open does not need any flags.

   Andrew


RE: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Chris Mi
> -Original Message-
> From: Matthew Wilcox [mailto:wi...@infradead.org]
> Sent: Saturday, February 24, 2018 9:15 AM
> To: Cong Wang ; Khalid Aziz
> ; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org
> Cc: Chris Mi 
> Subject: Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running
> selftest
> 
> On Fri, Feb 23, 2018 Randy Dunlap wrote:
> > [add Matthew Wilcox; hopefully he can look/see]
> 
> Thanks, Randy.  I don't understand why nobody else thought to cc the author
> of the patch that it was bisected to ...
> 
> > On 02/23/2018 04:13 PM, Cong Wang wrote:
> > > On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang
> > > 
> > wrote:
> > >> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap
> > >> 
> > wrote:
> > >>> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
> >  Same selftest does not cause panic on 4.15. git bisect pointed to
> > commit 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based
> > IDRs more efficient").
> >  Kernel config is attached.
> > >>
> > >> Looks like something horribly wrong with u32 key id idr...
> > >
> > > Adding a few printk's, I got:
> > >
> > > [   31.231560] requested handle = ffe0
> > > [   31.232426] allocated handle = 0
> > > ...
> > > [   31.246475] requested handle = ffd0
> > > [   31.247555] allocated handle = 1
> > >
> > >
> > > So the bug is here where we can't allocate a specific handle:
> > >
> > > err = idr_alloc_u32(&tp_c->handle_idr, ht,
> > &handle,
> > > handle, GFP_KERNEL);
> > > if (err) {
> > > kfree(ht);
> > > return err;
> > > }
> 
> Please try this patch.  It fixes ffe0, but there may be more things tested
> that it may not work for.
> 
> Chris Mi, what happened to that set of testcases you promised to write for
> me?
I promised to write it after the API is stabilized since you were going to 
change it.
I will inform the management about this new task and get back to you later.
> 
> diff --git a/lib/idr.c b/lib/idr.c
> index c98d77fcf393..10d9b8d47c33 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -36,8 +36,8 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid,  
> {
>   struct radix_tree_iter iter;
>   void __rcu **slot;
> - int base = idr->idr_base;
> - int id = *nextid;
> + unsigned int base = idr->idr_base;
> + unsigned int id = *nextid;
> 
>   if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
>   return -EINVAL;
To verify this patch, the following is a sanity test case:

# tc qdisc delete dev $link ingress > /dev/null 2>&1;
# tc qdisc add dev $link ingress;
# tc filter add dev $link prio 1 protocol ip handle 0x8001 parent : 
flower skip_hw src_mac e4:11:0:0:0:2 dst_mac e4:12:0:0:0:2 action drop;
# tc filter show dev $link parent :

filter pref 1 flower chain 0
filter pref 1 flower chain 0 handle 0x8001
  dst_mac e4:12:00:00:00:02
  src_mac e4:11:00:00:00:02
  eth_type ipv4
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1

Please make sure the handle is the same as the user specifies.


Re: [PATCH net-next] r8169: improve interrupt handling

2018-02-23 Thread Francois Romieu
Heiner Kallweit  :
[...]
> Last but not least it enables a feature which was (I presume accidently)
> disabled before. There are members of the RTL8169 family supporting MSI
> (e.g. RTL8169SB), however MSI never got enabled because RTL_CFG_0 was
> missing flag RTL_FEATURE_MSI.
> An indicator for "accidently" is that statement "cfg2 |= MSIEnable;"

The reality is more simple: it could had been removed.

> in rtl_try_msi() is dead code. cfg2 is used for chip versions <= 06
> only and for all these chip versions RTL_FEATURE_MSI isn't set.

On purpose:
1. mostly untested
2. MSI without MSI-X does not buy much
3. wrt 2., ok, it kills (yucky) plain old shared PCI irq (remember those ?)
   but I didn't end feeling that good about realtek MSI support on older
   chipsets to enable it any further

[...]
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 96db3283e..4730db990 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
[...]
> -static unsigned rtl_try_msi(struct rtl8169_private *tp,
> - const struct rtl_cfg_info *cfg)
> +static void rtl_alloc_irq(struct rtl8169_private *tp)
>  {
[...]
> + ret = pci_alloc_irq_vectors(tp->pci_dev, 1, 1, PCI_IRQ_ALL_TYPES);
> + if (ret < 0) {
> + netif_err(tp, drv, tp->dev, "failed to allocate irq!\n");
> + return;
[...]
> @@ -8497,9 +8495,7 @@ static int rtl_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   chipset = tp->mac_version;
>   tp->txd_version = rtl_chip_infos[chipset].txd_version;
>  
> - RTL_W8(Cfg9346, Cfg9346_Unlock);
> - tp->features |= rtl_try_msi(tp, cfg);
> - RTL_W8(Cfg9346, Cfg9346_Lock);
> + rtl_alloc_irq(tp);

Happily proceeding after error. :o/

-- 
Ueimor


[next-queue PATCH 3/8] igb: Enable the hardware traffic class feature bit for igb models

2018-02-23 Thread Vinicius Costa Gomes
This will allow functionality depending on the hardware being traffic
class aware to work. In particular the tc-flower offloading checks
verifies that this bit is set.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 0ea32be07d71..543aa99892eb 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2820,8 +2820,10 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
   NETIF_F_HW_VLAN_CTAG_TX |
   NETIF_F_RXALL;
 
-   if (hw->mac.type >= e1000_i350)
-   netdev->hw_features |= NETIF_F_NTUPLE;
+   if (hw->mac.type >= e1000_i350) {
+   netdev->hw_features |= (NETIF_F_NTUPLE | NETIF_F_HW_TC);
+   netdev->features |= NETIF_F_HW_TC;
+   }
 
if (pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
-- 
2.16.2



[next-queue PATCH 2/8] igb: Fix queue selection on MAC filters on i210 and i211

2018-02-23 Thread Vinicius Costa Gomes
On the RAH registers there are semantic differences on the meaning of
the "queue" parameter for traffic steering depending on the controller
model: there is the 82575 meaning, which "queue" means a RX Hardware
Queue, and the i350 meaning, where it is a reception pool.

The previous behaviour was having no effect for i210 and i211 based
controllers because the QSEL bit of the RAH register wasn't being set.

This patch separates the condition in discrete cases, so the different
handling is clearer.

Fixes: 83c21335c876 ("igb: improve MAC filter handling")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 15 +++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..573bf177fd08 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -490,6 +490,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
+#define E1000_RAH_QSEL_ENABLE 0x1000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..0ea32be07d71 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8741,12 +8741,19 @@ static void igb_rar_set_index(struct igb_adapter 
*adapter, u32 index)
if (is_valid_ether_addr(addr))
rar_high |= E1000_RAH_AV;
 
-   if (hw->mac.type == e1000_82575)
+   switch (hw->mac.type) {
+   case e1000_82575:
+   case e1000_i210:
+   case e1000_i211:
+   rar_high |= E1000_RAH_QSEL_ENABLE;
rar_high |= E1000_RAH_POOL_1 *
-   adapter->mac_table[index].queue;
-   else
+ adapter->mac_table[index].queue;
+   break;
+   default:
rar_high |= E1000_RAH_POOL_1 <<
-   adapter->mac_table[index].queue;
+   adapter->mac_table[index].queue;
+   break;
+   }
}
 
wr32(E1000_RAL(index), rar_low);
-- 
2.16.2



[next-queue PATCH 8/8] igb: Add support for removing offloaded tc-flower filters

2018-02-23 Thread Vinicius Costa Gomes
This allows tc-flower filters that were offloaded to be removed.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 32 ++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b1d401e77d62..5e0e1df0e941 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2641,10 +2641,40 @@ static int igb_configure_clsflower(struct igb_adapter 
*adapter,
return err;
 }
 
+static int igb_delete_filter_by_cookie(struct igb_adapter *adapter,
+  unsigned long cookie)
+{
+   struct igb_nfc_filter *filter;
+   int err;
+
+   spin_lock(&adapter->nfc_lock);
+
+   hlist_for_each_entry(filter, &adapter->nfc_filter_list, nfc_node) {
+   if (filter->cookie == cookie)
+   break;
+   }
+
+   if (!filter) {
+   err = -ENOENT;
+   goto out;
+   }
+
+   err = igb_erase_filter(adapter, filter);
+
+   hlist_del(&filter->nfc_node);
+   kfree(filter);
+   adapter->nfc_filter_count--;
+
+out:
+   spin_unlock(&adapter->nfc_lock);
+
+   return err;
+}
+
 static int igb_delete_clsflower(struct igb_adapter *adapter,
struct tc_cls_flower_offload *cls_flower)
 {
-   return -EOPNOTSUPP;
+   return igb_delete_filter_by_cookie(adapter, cls_flower->cookie);
 }
 
 static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
-- 
2.16.2



[PATCH V2 net-next 0/3] RDS: optimized notification for zerocopy completion

2018-02-23 Thread Sowmini Varadhan
RDS applications use predominantly request-response, transacation 
based IPC, so that ingress and egress traffic are well-balanced,
and it is possible/desirable to reduce system-call overhead by
piggybacking the notifications for zerocopy completion response
with data. 

Moreover, it has been pointed out that socket functions block
if sk_err is non-zero, thus if the RDS code does not plan/need
to use sk_error_queue path for completion notification, it
is preferable to remove the sk_errror_queue related paths in
RDS.

Both of these goals are implemented in this series.

Sowmini Varadhan (3):
  selftests/net: revert the zerocopy Rx path for PF_RDS
  rds: deliver zerocopy completion notification with data
  selftests/net: reap zerocopy completions passed up as ancillary data.

 include/uapi/linux/errqueue.h  |2 -
 include/uapi/linux/rds.h   |7 ++
 net/rds/af_rds.c   |7 ++-
 net/rds/message.c  |   35 --
 net/rds/rds.h  |2 +
 net/rds/recv.c |   34 +-
 tools/testing/selftests/net/msg_zerocopy.c |  109 +++-
 7 files changed, 103 insertions(+), 93 deletions(-)



[next-queue PATCH 5/8] igb: Add support for ethtool MAC address filters

2018-02-23 Thread Vinicius Costa Gomes
This adds the capability of configuring the queue steering of arriving
packets based on their source and destination MAC addresses.

In practical terms this adds support for the following use cases,
characterized by these examples:

$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
(this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
to the RX queue 0)

$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
(this will direct packets with destination address "44:44:44:44:44:44"
to the RX queue 3)

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h |   9 +++
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 110 ++-
 drivers/net/ethernet/intel/igb/igb_main.c|   8 +-
 3 files changed, 119 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index d5cd5f6708d9..e06d6fdcb2ce 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -440,6 +440,8 @@ struct hwmon_buff {
 enum igb_filter_match_flags {
IGB_FILTER_FLAG_ETHER_TYPE = 0x1,
IGB_FILTER_FLAG_VLAN_TCI   = 0x2,
+   IGB_FILTER_FLAG_SRC_MAC_ADDR   = 0x4,
+   IGB_FILTER_FLAG_DST_MAC_ADDR   = 0x8,
 };
 
 #define IGB_MAX_RXNFC_FILTERS 16
@@ -454,6 +456,8 @@ struct igb_nfc_input {
u8 match_flags;
__be16 etype;
__be16 vlan_tci;
+   u8 src_addr[ETH_ALEN];
+   u8 dst_addr[ETH_ALEN];
 };
 
 struct igb_nfc_filter {
@@ -738,4 +742,9 @@ int igb_add_filter(struct igb_adapter *adapter,
 int igb_erase_filter(struct igb_adapter *adapter,
 struct igb_nfc_filter *input);
 
+int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+  const u8 queue, const u8 flags);
+int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+  const u8 queue, const u8 flags);
+
 #endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 143f0bb34e4d..d8686a0f5b5d 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -152,6 +152,9 @@ static const char igb_priv_flags_strings[][ETH_GSTRING_LEN] 
= {
 
 #define IGB_PRIV_FLAGS_STR_LEN ARRAY_SIZE(igb_priv_flags_strings)
 
+static const u8 broadcast_addr[ETH_ALEN] = {
+   0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+
 static int igb_get_link_ksettings(struct net_device *netdev,
  struct ethtool_link_ksettings *cmd)
 {
@@ -2494,6 +2497,25 @@ static int igb_get_ethtool_nfc_entry(struct igb_adapter 
*adapter,
fsp->h_ext.vlan_tci = rule->filter.vlan_tci;
fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK);
}
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_dest,
+   rule->filter.dst_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   ether_addr_copy(fsp->m_u.ether_spec.h_dest,
+   broadcast_addr);
+   }
+   if (rule->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+   ether_addr_copy(fsp->h_u.ether_spec.h_source,
+   rule->filter.src_addr);
+   /* As we only support matching by the full
+* mask, return the mask to userspace
+*/
+   ether_addr_copy(fsp->m_u.ether_spec.h_source,
+   broadcast_addr);
+   }
+
return 0;
}
return -EINVAL;
@@ -2698,6 +2720,58 @@ static int igb_set_rss_hash_opt(struct igb_adapter 
*adapter,
return 0;
 }
 
+static int igb_rxnfc_write_dst_mac_filter(struct igb_adapter *adapter,
+ struct igb_nfc_filter *input)
+{
+   int err;
+
+   err = igb_add_mac_filter(adapter, input->filter.dst_addr,
+input->action, 0);
+   if (err < 0)
+   return err;
+
+   return 0;
+}
+
+static int igb_rxnfc_write_src_mac_filter(struct igb_adapter *adapter,
+ struct igb_nfc_filter *input)
+{
+   int err;
+
+   err = igb_add_mac_filter(adapter, input->filter.src_addr,
+input->action, IGB_MAC_STATE_SRC_ADDR);
+   if (err < 0)
+   return err;
+
+   return 0;
+}
+
+static int igb_rxnfc_del_dst_mac_filter(struct igb_adapter *adapter,
+   struct igb_nfc_filter *input)
+{
+   int err;
+
+   err = igb_del_mac_filter(adapt

[next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters

2018-02-23 Thread Vinicius Costa Gomes
This allows filters added by tc-flower and specifying MAC addresses,
Ethernet types, and the VLAN priority field, to be offloaded to the
controller.

This reuses most of the infrastructure used by ethtool, ethtool can be
used to read these filters, but modification and deletion can only be
done via tc-flower.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb.h |   5 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c |  13 ++-
 drivers/net/ethernet/intel/igb/igb_main.c| 140 ++-
 3 files changed, 152 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index e06d6fdcb2ce..05d8c827d33e 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -463,6 +463,7 @@ struct igb_nfc_input {
 struct igb_nfc_filter {
struct hlist_node nfc_node;
struct igb_nfc_input filter;
+   unsigned long cookie;
u16 etype_reg_index;
u16 sw_idx;
u16 action;
@@ -747,4 +748,8 @@ int igb_add_mac_filter(struct igb_adapter *adapter, const 
u8 *addr,
 int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
   const u8 queue, const u8 flags);
 
+int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter,
+struct igb_nfc_filter *input,
+u16 sw_idx);
+
 #endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index d8686a0f5b5d..5386eb68ab15 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2918,9 +2918,9 @@ int igb_erase_filter(struct igb_adapter *adapter, struct 
igb_nfc_filter *input)
return 0;
 }
 
-static int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter,
-   struct igb_nfc_filter *input,
-   u16 sw_idx)
+int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter,
+struct igb_nfc_filter *input,
+u16 sw_idx)
 {
struct igb_nfc_filter *rule, *parent;
int err = -EINVAL;
@@ -2935,8 +2935,11 @@ static int igb_update_ethtool_nfc_entry(struct 
igb_adapter *adapter,
parent = rule;
}
 
-   /* if there is an old rule occupying our place remove it */
-   if (rule && (rule->sw_idx == sw_idx)) {
+   /* if there is an old rule occupying our place remove it, also
+* only allow rules added by ethtool to be removed, these
+* rules don't have a cookie
+*/
+   if (rule && (!rule->cookie && rule->sw_idx == sw_idx)) {
if (!input)
err = igb_erase_filter(adapter, rule);
 
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 5344261e6f45..b1d401e77d62 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2497,10 +2497,148 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+#define ETHER_TYPE_FULL_MASK ((__force __be16)~0)
+
+static int igb_parse_cls_flower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *f,
+   int traffic_class,
+   struct igb_nfc_filter *input)
+{
+   if (f->dissector->used_keys &
+   ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+ BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+ BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+ BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+   dev_err(&adapter->pdev->dev, "Unsupported key used: 0x%x\n",
+   f->dissector->used_keys);
+   return -EOPNOTSUPP;
+   }
+
+   if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   struct flow_dissector_key_eth_addrs *key =
+   skb_flow_dissector_target(f->dissector,
+ FLOW_DISSECTOR_KEY_ETH_ADDRS,
+ f->key);
+
+   struct flow_dissector_key_eth_addrs *mask =
+   skb_flow_dissector_target(f->dissector,
+ FLOW_DISSECTOR_KEY_ETH_ADDRS,
+ f->mask);
+
+   if (is_broadcast_ether_addr(mask->dst)) {
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_DST_MAC_ADDR;
+   ether_addr_copy(input->filter.dst_addr, key->dst);
+   }
+
+   if (is_broadcast_ether_addr(mask->src)) {
+   input->filter.match_flags |=
+   IGB_FILTER_FLAG_SRC_MAC_ADDR;
+   ether_addr_co

[next-queue PATCH 4/8] igb: Add support for MAC address filters specifying source addresses

2018-02-23 Thread Vinicius Costa Gomes
Makes it possible to direct packets to queues based on their source
address. Documents the expected usage of the 'flags' parameter.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/igb.h   |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 35 +++---
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 573bf177fd08..c6f552de30dd 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -490,6 +490,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x8000/* Receive descriptor valid */
+#define E1000_RAH_ASEL_SRC_ADDR 0x0001
 #define E1000_RAH_QSEL_ENABLE 0x1000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..d5cd5f6708d9 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -472,6 +472,7 @@ struct igb_mac_addr {
 
 #define IGB_MAC_STATE_DEFAULT  0x1
 #define IGB_MAC_STATE_IN_USE   0x2
+#define IGB_MAC_STATE_SRC_ADDR  0x4
 
 /* board specific private data structure */
 struct igb_adapter {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 543aa99892eb..db66b697fe3b 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6837,8 +6837,13 @@ static void igb_set_default_mac_filter(struct 
igb_adapter *adapter)
igb_rar_set_index(adapter, 0);
 }
 
+/* Add a MAC filter for 'addr' directing matching traffic to 'queue',
+ * 'flags' is used to indicate what kind of match is made, match is by
+ * default for the destination address, if matching by source address
+ * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used.
+ */
 static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
- const u8 queue)
+ const u8 queue, const u8 flags)
 {
struct e1000_hw *hw = &adapter->hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6858,7 +6863,7 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
 
ether_addr_copy(adapter->mac_table[i].addr, addr);
adapter->mac_table[i].queue = queue;
-   adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state |= (IGB_MAC_STATE_IN_USE | flags);
 
igb_rar_set_index(adapter, i);
return i;
@@ -6867,8 +6872,14 @@ static int igb_add_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
return -ENOSPC;
 }
 
+/* Remove a MAC filter for 'addr' directing matching traffic to
+ * 'queue', 'flags' is used to indicate what kind of match need to be
+ * removed, match is by default for the destination address, if
+ * matching by source address is to be removed the flag
+ * IGB_MAC_STATE_SRC_ADDR can be used.
+ */
 static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
- const u8 queue)
+ const u8 queue, const u8 flags)
 {
struct e1000_hw *hw = &adapter->hw;
int rar_entries = hw->mac.rar_entry_count -
@@ -6883,14 +6894,15 @@ static int igb_del_mac_filter(struct igb_adapter 
*adapter, const u8 *addr,
 * for the VF MAC addresses.
 */
for (i = 0; i < rar_entries; i++) {
-   if (!(adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE))
+   if (!(adapter->mac_table[i].state &
+ (IGB_MAC_STATE_IN_USE | flags)))
continue;
if (adapter->mac_table[i].queue != queue)
continue;
if (!ether_addr_equal(adapter->mac_table[i].addr, addr))
continue;
 
-   adapter->mac_table[i].state &= ~IGB_MAC_STATE_IN_USE;
+   adapter->mac_table[i].state &= ~(IGB_MAC_STATE_IN_USE | flags);
memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
adapter->mac_table[i].queue = 0;
 
@@ -6906,7 +6918,8 @@ static int igb_uc_sync(struct net_device *netdev, const 
unsigned char *addr)
struct igb_adapter *adapter = netdev_priv(netdev);
int ret;
 
-   ret = igb_add_mac_filter(adapter, addr, adapter->vfs_allocated_count);
+   ret = igb_add_mac_filter(adapter, addr,
+adapter->vfs_allocated_count, 0);
 
return min_t(int, ret, 0);
 }
@@ -6915,7 +6928,7 @@ static int igb_uc_unsync(struct net_device *netdev, const 
unsigned char *addr)
 {
struct igb_adapter *adapter = netdev_priv(netdev);
 
-   igb_del_mac_f

[next-queue PATCH 0/8] igb: offloading of receive filters

2018-02-23 Thread Vinicius Costa Gomes
Hi,

This series enables some ethtool and tc-flower filters to be offloaded
to igb-based network controllers. This is useful when the system
configurator want to steer kinds of traffic to a specific hardware
queue.

The first two commits are bug fixes.

The basis of this series is to export the internal API used to
configure address filters, so they can be used by ethtool, and
extending the functionality so an source address can be handled.

Then, we enable the tc-flower offloading implementation to re-use the
same infrastructure as ethtool, and storing them in the per-adapter
"nfc" (Network Filter Config?) list. But for consistency, for
destructive access they are separated, i.e. an filter added by
tc-flower can only be removed by tc-flower, but ethtool can read them
all.

Only support for VLAN Prio, Source and Destination MAC Address, and
Ethertype are supported for now.

Open question:
  - igb is initialized with the number of traffic classes as 1, if we
  want to use multiple traffic classes we need to increase this value,
  the only way I could find is to use mqprio (for example). Should igb
  be initialized with, say, the number of queues as its "num_tc"?


Vinicius Costa Gomes (8):
  igb: Fix not adding filter elements to the list
  igb: Fix queue selection on MAC filters on i210 and i211
  igb: Enable the hardware traffic class feature bit for igb models
  igb: Add support for MAC address filters specifying source addresses
  igb: Add support for ethtool MAC address filters
  igb: Add the skeletons for tc-flower offloading
  igb: Add support for adding offloaded clsflower filters
  igb: Add support for removing offloaded tc-flower filters

 drivers/net/ethernet/intel/igb/e1000_defines.h |   2 +
 drivers/net/ethernet/intel/igb/igb.h   |  15 ++
 drivers/net/ethernet/intel/igb/igb_ethtool.c   | 125 ++-
 drivers/net/ethernet/intel/igb/igb_main.c  | 294 +++--
 4 files changed, 409 insertions(+), 27 deletions(-)

--
2.16.2


[next-queue PATCH 6/8] igb: Add the skeletons for tc-flower offloading

2018-02-23 Thread Vinicius Costa Gomes
This adds basic functions needed to implement offloading for filters
created by tc-flower.

Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 66 +++
 1 file changed, 66 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 5dfbdf05f765..5344261e6f45 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2496,6 +2497,69 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
return 0;
 }
 
+static int igb_configure_clsflower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_delete_clsflower(struct igb_adapter *adapter,
+   struct tc_cls_flower_offload *cls_flower)
+{
+   return -EOPNOTSUPP;
+}
+
+static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
+  struct tc_cls_flower_offload *cls_flower)
+{
+   switch (cls_flower->command) {
+   case TC_CLSFLOWER_REPLACE:
+   return igb_configure_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_DESTROY:
+   return igb_delete_clsflower(adapter, cls_flower);
+   case TC_CLSFLOWER_STATS:
+   return -EOPNOTSUPP;
+   default:
+   return -EINVAL;
+   }
+}
+
+static int igb_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
+void *cb_priv)
+{
+   struct igb_adapter *adapter = cb_priv;
+
+   if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data))
+   return -EOPNOTSUPP;
+
+   switch (type) {
+   case TC_SETUP_CLSFLOWER:
+   return igb_setup_tc_cls_flower(adapter, type_data);
+
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static int igb_setup_tc_block(struct igb_adapter *adapter,
+ struct tc_block_offload *f)
+{
+   if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+   return -EOPNOTSUPP;
+
+   switch (f->command) {
+   case TC_BLOCK_BIND:
+   return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
+adapter, adapter);
+   case TC_BLOCK_UNBIND:
+   tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
+   adapter);
+   return 0;
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
void *type_data)
 {
@@ -2504,6 +2568,8 @@ static int igb_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
switch (type) {
case TC_SETUP_QDISC_CBS:
return igb_offload_cbs(adapter, type_data);
+   case TC_SETUP_BLOCK:
+   return igb_setup_tc_block(adapter, type_data);
 
default:
return -EOPNOTSUPP;
-- 
2.16.2



[next-queue PATCH 1/8] igb: Fix not adding filter elements to the list

2018-02-23 Thread Vinicius Costa Gomes
Because the order of the parameters passes to 'hlist_add_behind()' was
inverted, the 'parent' node was added "behind" the 'input', as input
is not in the list, this causes the 'input' node to be lost.

Fixes: 0e71def25281 ("igb: add support of RX network flow classification")
Signed-off-by: Vinicius Costa Gomes 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 606e6761758f..143f0bb34e4d 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2864,7 +2864,7 @@ static int igb_update_ethtool_nfc_entry(struct 
igb_adapter *adapter,
 
/* add filter to the list */
if (parent)
-   hlist_add_behind(&parent->nfc_node, &input->nfc_node);
+   hlist_add_behind(&input->nfc_node, &parent->nfc_node);
else
hlist_add_head(&input->nfc_node, &adapter->nfc_filter_list);
 
-- 
2.16.2



Re: Fwd: Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Matthew Wilcox
On Fri, Feb 23, 2018 Randy Dunlap wrote:
> [add Matthew Wilcox; hopefully he can look/see]

Thanks, Randy.  I don't understand why nobody else thought to cc the
author of the patch that it was bisected to ...

> On 02/23/2018 04:13 PM, Cong Wang wrote:
> > On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang 
> wrote:
> >> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap 
> wrote:
> >>> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
>  Same selftest does not cause panic on 4.15. git bisect pointed to
> commit 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs
> more efficient").
>  Kernel config is attached.
> >>
> >> Looks like something horribly wrong with u32 key id idr...
> >
> > Adding a few printk's, I got:
> >
> > [   31.231560] requested handle = ffe0
> > [   31.232426] allocated handle = 0
> > ...
> > [   31.246475] requested handle = ffd0
> > [   31.247555] allocated handle = 1
> >
> >
> > So the bug is here where we can't allocate a specific handle:
> >
> > err = idr_alloc_u32(&tp_c->handle_idr, ht,
> &handle,
> > handle, GFP_KERNEL);
> > if (err) {
> > kfree(ht);
> > return err;
> > }

Please try this patch.  It fixes ffe0, but there may be more things
tested that it may not work for.

Chris Mi, what happened to that set of testcases you promised to write
for me?

diff --git a/lib/idr.c b/lib/idr.c
index c98d77fcf393..10d9b8d47c33 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -36,8 +36,8 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid,
 {
struct radix_tree_iter iter;
void __rcu **slot;
-   int base = idr->idr_base;
-   int id = *nextid;
+   unsigned int base = idr->idr_base;
+   unsigned int id = *nextid;
 
if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
return -EINVAL;


Re: [PATCH V7 2/4] sctp: Add ip option support

2018-02-23 Thread Marcelo Ricardo Leitner
On Fri, Feb 23, 2018 at 11:11:50AM -0500, Paul Moore wrote:
> On Thu, Feb 22, 2018 at 9:40 PM, Marcelo Ricardo Leitner
>  wrote:
> > On Thu, Feb 22, 2018 at 06:08:05PM -0500, Paul Moore wrote:
> >> On Wed, Feb 21, 2018 at 3:45 PM, Paul Moore  wrote:
> >> > On February 21, 2018 9:33:51 AM Marcelo Ricardo Leitner 
> >> >  wrote:
> >> >> On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote:
> >> >>> Add ip option support to allow LSM security modules to utilise 
> >> >>> CIPSO/IPv4
> >> >>> and CALIPSO/IPv6 services.
> >> >>>
> >> >>> Signed-off-by: Richard Haines 
> >> >>
> >> >> LGTM too, thanks!
> >> >>
> >> >> Acked-by: Marcelo Ricardo Leitner 
> >> >
> >> > I agree, thanks everyone for all the work, review, and patience behind 
> >> > this patchset!  I'll work on merging this into selinux/next and I'll 
> >> > send a note when it's done.
> >>
> >> I just merged the four patches (1,3,4 from the v6 patchset, 2 from the
> >> v7 patchset) in selinux/next and did a quick sanity test on the kernel
> >> (booted, no basic SELinux regressions).  Additional testing help is
> >> always appreciated ...
> >
> > I'll try it early next week.
> >
> > Any ideas on when this is going to appear on Dave's net-next tree?
> > We have a lot of SCTP changes to be posted on this cycle and would be
> > nice if we could avoid merge conflicts.
> 
> It's merged into the SELinux tree, next branch; see the links below.
> Last I checked DaveM doesn't pull the selinux/next into his net-next
> tree (that would be a little funny for historical reasons).
> 
> Any idea on how bad the merge conflicts are?

I know about 5 patchsets that we are cooking. For 4 of them I think it
would be mostly fine, perhaps one conflict here and there. But the
other one is a refactoring on MTU handling and it touches lots of
places that 92c49e12646e4 ("sctp: Add ip option support") also
touched, like in the chunk below:

+++ b/include/net/sctp/sctp.h
@@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct list_head 
*head)
 static inline int sctp_frag_point(const struct sctp_association *asoc, int 
pmtu)
 {
struct sctp_sock *sp = sctp_sk(asoc->base.sk);
+   struct sctp_af *af = sp->pf->af;
int frag = pmtu;
 
-   frag -= sp->pf->af->net_header_len;
+   frag -= af->ip_options_len(asoc->base.sk);
+   frag -= af->net_header_len;

In the refactor I'm removing this function from here and adding a
similar, not quite the same but similar, in a .c file. 

I post the mtu patchset as RFC next week so we can know better.

  Marcelo

> 
> >> * git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
> >> * https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
> 
> -- 
> paul moore
> www.paul-moore.com


Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Randy Dunlap
[add Matthew Wilcox; hopefully he can look/see]

On 02/23/2018 04:13 PM, Cong Wang wrote:
> On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang  wrote:
>> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap  wrote:
>>> [adding netdev]
>>>
>>> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
 I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
 selftests
 from tools/testing/selftests. Last messages from selftest before kernel 
 panic are:

>> ...
 Same selftest does not cause panic on 4.15. git bisect pointed to commit 
 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more 
 efficient").
 Kernel config is attached.
>>
>> Looks like something horribly wrong with u32 key id idr...
> 
> Adding a few printk's, I got:
> 
> [   31.231560] requested handle = ffe0
> [   31.232426] allocated handle = 0
> ...
> [   31.246475] requested handle = ffd0
> [   31.247555] allocated handle = 1
> 
> 
> So the bug is here where we can't allocate a specific handle:
> 
> err = idr_alloc_u32(&tp_c->handle_idr, ht, &handle,
> handle, GFP_KERNEL);
> if (err) {
> kfree(ht);
> return err;
> }
> 


-- 
~Randy


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko  wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote:
>
> [...]
>

 No, that's not what I was talking about of course. I thought you
 mentioned the upgrade scenario this patch would like to address is to
 use the bypass interface "to take the place of the original virtio,
 and get udev to rename the bypass to what the original virtio_net
 was". That is one of the possible upgrade paths for sure. However the
 upgrade path I was seeking is to use the bypass interface to take the
 place of original VF interface while retaining the name and network
 configs, which generally can be done simply with kernel upgrade. It
 would become limiting as this patch makes the bypass interface share
 the same virtio pci device with virito backup. Can this bypass
 interface be made general to take place of any pci device other than
 virtio-net? This will be more helpful as the cloud users who has
 existing setup on VF interface don't have to recreate it on virtio-net
 and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.


Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Cong Wang
On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang  wrote:
> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap  wrote:
>> [adding netdev]
>>
>> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
>>> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
>>> selftests
>>> from tools/testing/selftests. Last messages from selftest before kernel 
>>> panic are:
>>>
> ...
>>> Same selftest does not cause panic on 4.15. git bisect pointed to commit 
>>> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more 
>>> efficient").
>>> Kernel config is attached.
>
> Looks like something horribly wrong with u32 key id idr...

Adding a few printk's, I got:

[   31.231560] requested handle = ffe0
[   31.232426] allocated handle = 0
...
[   31.246475] requested handle = ffd0
[   31.247555] allocated handle = 1


So the bug is here where we can't allocate a specific handle:

err = idr_alloc_u32(&tp_c->handle_idr, ht, &handle,
handle, GFP_KERNEL);
if (err) {
kfree(ht);
return err;
}


[PATCH bpf-next 2/6] bpf, x64: save several bytes by using mov over movabsq when possible

2018-02-23 Thread Daniel Borkmann
While analyzing some of the more complex BPF programs from Cilium,
I found that LLVM generally prefers to emit LD_IMM64 instead of MOV32
BPF instructions for loading unsigned 32-bit immediates into a
register. Given we cannot change the current/stable LLVM versions
that are already out there, lets optimize this case such that the
JIT prefers to emit 'mov %eax, imm32' over 'movabsq %rax, imm64'
whenever suitable in order to reduce the image size by 4-5 bytes per
such load in the typical case, reducing image size on some of the
bigger programs by up to 4%. emit_mov_imm32() and emit_mov_imm64()
have been added as helpers.

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 125 ++--
 1 file changed, 74 insertions(+), 51 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4bc36bd..f3e5cd8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -60,7 +60,12 @@ static bool is_imm8(int value)
 
 static bool is_simm32(s64 value)
 {
-   return value == (s64) (s32) value;
+   return value == (s64)(s32)value;
+}
+
+static bool is_uimm32(u64 value)
+{
+   return value == (u64)(u32)value;
 }
 
 /* mov dst, src */
@@ -355,6 +360,68 @@ static void emit_load_skb_data_hlen(u8 **pprog)
*pprog = prog;
 }
 
+static void emit_mov_imm32(u8 **pprog, bool sign_propagate,
+  u32 dst_reg, const u32 imm32)
+{
+   u8 *prog = *pprog;
+   u8 b1, b2, b3;
+   int cnt = 0;
+
+   /* optimization: if imm32 is positive, use 'mov %eax, imm32'
+* (which zero-extends imm32) to save 2 bytes.
+*/
+   if (sign_propagate && (s32)imm32 < 0) {
+   /* 'mov %rax, imm32' sign extends imm32 */
+   b1 = add_1mod(0x48, dst_reg);
+   b2 = 0xC7;
+   b3 = 0xC0;
+   EMIT3_off32(b1, b2, add_1reg(b3, dst_reg), imm32);
+   goto done;
+   }
+
+   /* optimization: if imm32 is zero, use 'xor %eax, %eax'
+* to save 3 bytes.
+*/
+   if (imm32 == 0) {
+   if (is_ereg(dst_reg))
+   EMIT1(add_2mod(0x40, dst_reg, dst_reg));
+   b2 = 0x31; /* xor */
+   b3 = 0xC0;
+   EMIT2(b2, add_2reg(b3, dst_reg, dst_reg));
+   goto done;
+   }
+
+   /* mov %eax, imm32 */
+   if (is_ereg(dst_reg))
+   EMIT1(add_1mod(0x40, dst_reg));
+   EMIT1_off32(add_1reg(0xB8, dst_reg), imm32);
+done:
+   *pprog = prog;
+}
+
+static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
+  const u32 imm32_hi, const u32 imm32_lo)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   if (is_uimm32(((u64)imm32_hi << 32) | (u32)imm32_lo)) {
+   /* For emitting plain u32, where sign bit must not be
+* propagated LLVM tends to load imm64 over mov32
+* directly, so save couple of bytes by just doing
+* 'mov %eax, imm32' instead.
+*/
+   emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
+   } else {
+   /* movabsq %rax, imm64 */
+   EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+   EMIT(imm32_lo, 4);
+   EMIT(imm32_hi, 4);
+   }
+
+   *pprog = prog;
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
  int oldproglen, struct jit_context *ctx)
 {
@@ -377,7 +444,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 
*image,
const s32 imm32 = insn->imm;
u32 dst_reg = insn->dst_reg;
u32 src_reg = insn->src_reg;
-   u8 b1 = 0, b2 = 0, b3 = 0;
+   u8 b2 = 0, b3 = 0;
s64 jmp_offset;
u8 jmp_cond;
bool reload_skb_data;
@@ -485,58 +552,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
break;
 
case BPF_ALU64 | BPF_MOV | BPF_K:
-   /* optimization: if imm32 is positive,
-* use 'mov eax, imm32' (which zero-extends imm32)
-* to save 2 bytes
-*/
-   if (imm32 < 0) {
-   /* 'mov rax, imm32' sign extends imm32 */
-   b1 = add_1mod(0x48, dst_reg);
-   b2 = 0xC7;
-   b3 = 0xC0;
-   EMIT3_off32(b1, b2, add_1reg(b3, dst_reg), 
imm32);
-   break;
-   }
-
case BPF_ALU | BPF_MOV | BPF_K:
-   /* optimization: if imm32 is zero, use 'xor ,'
-* to save 3 bytes.
-*/
-   if (imm32 == 0) {
-   if (is_ereg(dst

[PATCH bpf-next 3/6] bpf, x64: save several bytes when mul dest is r0/r3 anyway

2018-02-23 Thread Daniel Borkmann
Instead of unconditionally performing push/pop on rax/rdx
in case of multiplication, we can save a few bytes in case
of dest register being either BPF r0 (rax) or r3 (rdx)
since the result is written in there anyway.

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index f3e5cd8..9895ca3 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -615,8 +615,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
case BPF_ALU | BPF_MUL | BPF_X:
case BPF_ALU64 | BPF_MUL | BPF_K:
case BPF_ALU64 | BPF_MUL | BPF_X:
-   EMIT1(0x50); /* push rax */
-   EMIT1(0x52); /* push rdx */
+   if (dst_reg != BPF_REG_0)
+   EMIT1(0x50); /* push rax */
+   if (dst_reg != BPF_REG_3)
+   EMIT1(0x52); /* push rdx */
 
/* mov r11, dst_reg */
EMIT_mov(AUX_REG, dst_reg);
@@ -636,14 +638,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
/* mul(q) r11 */
EMIT2(0xF7, add_1reg(0xE0, AUX_REG));
 
-   /* mov r11, rax */
-   EMIT_mov(AUX_REG, BPF_REG_0);
-
-   EMIT1(0x5A); /* pop rdx */
-   EMIT1(0x58); /* pop rax */
-
-   /* mov dst_reg, r11 */
-   EMIT_mov(dst_reg, AUX_REG);
+   if (dst_reg != BPF_REG_3)
+   EMIT1(0x5A); /* pop rdx */
+   if (dst_reg != BPF_REG_0) {
+   /* mov dst_reg, rax */
+   EMIT_mov(dst_reg, BPF_REG_0);
+   EMIT1(0x58); /* pop rax */
+   }
break;
 
/* shifts */
-- 
2.9.5



[PATCH bpf-next 5/6] bpf, x64: save 5 bytes in prologue when ebpf insns came from cbpf

2018-02-23 Thread Daniel Borkmann
While it's rather cumbersome to reduce prologue for cBPF->eBPF
migrations wrt spill/fill for r15 which is callee saved register
due to bpf_error path in bpf_jit.S that is both used by migrations
as well as native eBPF, we can still trivially save 5 bytes in
prologue for the former since tail calls can never be used there.
cBPF->eBPF migrations also have their own custom prologue in BPF
asm that xors A and X reg anyway, so it's fine we skip this here.

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5b8fc13..70f9748 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -216,7 +216,7 @@ struct jit_context {
 /* emit x64 prologue code for BPF program and check it's size.
  * bpf_tail_call helper will skip it while jumping into another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth)
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
 {
u8 *prog = *pprog;
int cnt = 0;
@@ -251,18 +251,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth)
/* mov qword ptr [rbp+24],r15 */
EMIT4(0x4C, 0x89, 0x7D, 24);
 
-   /* Clear the tail call counter (tail_call_cnt): for eBPF tail calls
-* we need to reset the counter to 0. It's done in two instructions,
-* resetting rax register to 0 (xor on eax gets 0 extended), and
-* moving it to the counter location.
-*/
+   if (!ebpf_from_cbpf) {
+   /* Clear the tail call counter (tail_call_cnt): for eBPF tail
+* calls we need to reset the counter to 0. It's done in two
+* instructions, resetting rax register to 0, and moving it
+* to the counter location.
+*/
 
-   /* xor eax, eax */
-   EMIT2(0x31, 0xc0);
-   /* mov qword ptr [rbp+32], rax */
-   EMIT4(0x48, 0x89, 0x45, 32);
+   /* xor eax, eax */
+   EMIT2(0x31, 0xc0);
+   /* mov qword ptr [rbp+32], rax */
+   EMIT4(0x48, 0x89, 0x45, 32);
+
+   BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
+   }
 
-   BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
*pprog = prog;
 }
 
@@ -453,7 +456,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 
*image,
int proglen = 0;
u8 *prog = temp;
 
-   emit_prologue(&prog, bpf_prog->aux->stack_depth);
+   emit_prologue(&prog, bpf_prog->aux->stack_depth,
+ bpf_prog_was_classic(bpf_prog));
 
if (seen_ld_abs)
emit_load_skb_data_hlen(&prog);
-- 
2.9.5



[PATCH bpf-next 6/6] bpf: add various jit test cases

2018-02-23 Thread Daniel Borkmann
Add few test cases that check the rnu-time results under JIT.

Signed-off-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_verifier.c | 89 +
 1 file changed, 89 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 2971ba2..c987d3a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11140,6 +11140,95 @@ static struct bpf_test tests[] = {
.result = REJECT,
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
+   {
+   "jit: lsh, rsh, arsh by 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_MOV64_IMM(BPF_REG_1, 0xff),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 1),
+   BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 1),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x3fc, 1),
+   BPF_EXIT_INSN(),
+   BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 1),
+   BPF_ALU32_IMM(BPF_RSH, BPF_REG_1, 1),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0xff, 1),
+   BPF_EXIT_INSN(),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 1),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x7f, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .retval = 2,
+   },
+   {
+   "jit: mov32 for ldimm64, 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_LD_IMM64(BPF_REG_1, 0xfeffULL),
+   BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 32),
+   BPF_LD_IMM64(BPF_REG_2, 0xfeffULL),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_2, 1),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .retval = 2,
+   },
+   {
+   "jit: mov32 for ldimm64, 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_LD_IMM64(BPF_REG_1, 0x1ULL),
+   BPF_LD_IMM64(BPF_REG_2, 0xULL),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_2, 1),
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .retval = 2,
+   },
+   {
+   "jit: various mul tests",
+   .insns = {
+   BPF_LD_IMM64(BPF_REG_2, 0xeeff0d413122ULL),
+   BPF_LD_IMM64(BPF_REG_0, 0xfefefeULL),
+   BPF_LD_IMM64(BPF_REG_1, 0xefefefULL),
+   BPF_ALU64_REG(BPF_MUL, BPF_REG_0, BPF_REG_1),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_0, BPF_REG_2, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_LD_IMM64(BPF_REG_3, 0xfefefeULL),
+   BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_3, BPF_REG_2, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV32_REG(BPF_REG_2, BPF_REG_2),
+   BPF_LD_IMM64(BPF_REG_0, 0xfefefeULL),
+   BPF_ALU32_REG(BPF_MUL, BPF_REG_0, BPF_REG_1),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_0, BPF_REG_2, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_LD_IMM64(BPF_REG_3, 0xfefefeULL),
+   BPF_ALU32_REG(BPF_MUL, BPF_REG_3, BPF_REG_1),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_3, BPF_REG_2, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_LD_IMM64(BPF_REG_0, 0x952a7bbcULL),
+   BPF_LD_IMM64(BPF_REG_1, 0xfefefeULL),
+   BPF_LD_IMM64(BPF_REG_2, 0xeeff0d413122ULL),
+   BPF_ALU32_REG(BPF_MUL, BPF_REG_2, BPF_REG_1),
+   BPF_JMP_REG(BPF_JEQ, BPF_REG_2, BPF_REG_0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .retval = 2,
+   },
+
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5



[PATCH bpf-next 4/6] bpf, x64: save few bytes when mul is in alu32

2018-02-23 Thread Daniel Borkmann
Add a generic emit_mov_reg() helper in order to reuse it in BPF
multiplication to load the src into rax, we can save a few bytes
in alu32 while doing so.

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 9895ca3..5b8fc13 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -422,6 +422,24 @@ static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
*pprog = prog;
 }
 
+static void emit_mov_reg(u8 **pprog, bool is64, u32 dst_reg, u32 src_reg)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   if (is64) {
+   /* mov dst, src */
+   EMIT_mov(dst_reg, src_reg);
+   } else {
+   /* mov32 dst, src */
+   if (is_ereg(dst_reg) || is_ereg(src_reg))
+   EMIT1(add_2mod(0x40, dst_reg, src_reg));
+   EMIT2(0x89, add_2reg(0xC0, dst_reg, src_reg));
+   }
+
+   *pprog = prog;
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
  int oldproglen, struct jit_context *ctx)
 {
@@ -480,16 +498,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
break;
 
-   /* mov dst, src */
case BPF_ALU64 | BPF_MOV | BPF_X:
-   EMIT_mov(dst_reg, src_reg);
-   break;
-
-   /* mov32 dst, src */
case BPF_ALU | BPF_MOV | BPF_X:
-   if (is_ereg(dst_reg) || is_ereg(src_reg))
-   EMIT1(add_2mod(0x40, dst_reg, src_reg));
-   EMIT2(0x89, add_2reg(0xC0, dst_reg, src_reg));
+   emit_mov_reg(&prog,
+BPF_CLASS(insn->code) == BPF_ALU64,
+dst_reg, src_reg);
break;
 
/* neg dst */
@@ -615,6 +628,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 
*image,
case BPF_ALU | BPF_MUL | BPF_X:
case BPF_ALU64 | BPF_MUL | BPF_K:
case BPF_ALU64 | BPF_MUL | BPF_X:
+   {
+   bool is64 = BPF_CLASS(insn->code) == BPF_ALU64;
+
if (dst_reg != BPF_REG_0)
EMIT1(0x50); /* push rax */
if (dst_reg != BPF_REG_3)
@@ -624,14 +640,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
EMIT_mov(AUX_REG, dst_reg);
 
if (BPF_SRC(insn->code) == BPF_X)
-   /* mov rax, src_reg */
-   EMIT_mov(BPF_REG_0, src_reg);
+   emit_mov_reg(&prog, is64, BPF_REG_0, src_reg);
else
-   /* mov rax, imm32 */
-   emit_mov_imm32(&prog, true,
-  BPF_REG_0, imm32);
+   emit_mov_imm32(&prog, is64, BPF_REG_0, imm32);
 
-   if (BPF_CLASS(insn->code) == BPF_ALU64)
+   if (is64)
EMIT1(add_1mod(0x48, AUX_REG));
else if (is_ereg(AUX_REG))
EMIT1(add_1mod(0x40, AUX_REG));
@@ -646,7 +659,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 
*image,
EMIT1(0x58); /* pop rax */
}
break;
-
+   }
/* shifts */
case BPF_ALU | BPF_LSH | BPF_K:
case BPF_ALU | BPF_RSH | BPF_K:
-- 
2.9.5



[PATCH bpf-next 0/6] Few x64 jit improvements to shrink image size

2018-02-23 Thread Daniel Borkmann
Couple of minor improvements to the x64 JIT I had still around from
pre merge window in order to shrink the image size further. Added
test cases for kselftests too as well as running Cilium workloads on
them w/o issues.

Thanks!

Daniel Borkmann (6):
  bpf, x64: save one byte per shl/shr/sar when imm is 1
  bpf, x64: save several bytes by using mov over movabsq when possible
  bpf, x64: save several bytes when mul dest is r0/r3 anyway
  bpf, x64: save few bytes when mul is in alu32
  bpf, x64: save 5 bytes in prologue when ebpf insns came from cbpf
  bpf: add various jit test cases

 arch/x86/net/bpf_jit_comp.c | 219 +---
 tools/testing/selftests/bpf/test_verifier.c |  89 +++
 2 files changed, 221 insertions(+), 87 deletions(-)

-- 
2.9.5



[PATCH bpf-next 1/6] bpf, x64: save one byte per shl/shr/sar when imm is 1

2018-02-23 Thread Daniel Borkmann
When we shift by one, we can use a different encoding where imm
is not explicitly needed, which saves 1 byte per such op.

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4923d92..4bc36bd 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -640,7 +640,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
case BPF_RSH: b3 = 0xE8; break;
case BPF_ARSH: b3 = 0xF8; break;
}
-   EMIT3(0xC1, add_1reg(b3, dst_reg), imm32);
+
+   if (imm32 == 1)
+   EMIT2(0xD1, add_1reg(b3, dst_reg));
+   else
+   EMIT3(0xC1, add_1reg(b3, dst_reg), imm32);
break;
 
case BPF_ALU | BPF_LSH | BPF_X:
-- 
2.9.5



Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger
(pruned to reduce thread)

On Wed, 21 Feb 2018 16:17:19 -0800
Alexander Duyck  wrote:

> >>> FWIW two solutions that immediately come to mind is to export "backup"
> >>> as phys_port_name of the backup virtio link and/or assign a name to the
> >>> master like you are doing already.  I think team uses team%d and bond
> >>> uses bond%d, soft naming of master devices seems quite natural in this
> >>> case.  
> >>
> >> I figured I had overlooked something like that.. Thanks for pointing
> >> this out. Okay so I think the phys_port_name approach might resolve
> >> the original issue. If I am reading things correctly what we end up
> >> with is the master showing up as "ens1" for example and the backup
> >> showing up as "ens1nbackup". Am I understanding that right?
> >>
> >> The problem with the team/bond%d approach is that it creates a new
> >> netdevice and so it would require guest configuration changes.
> >>  
> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> >>> link is quite neat.  
> >>
> >> I agree. For non-"backup" virio_net devices would it be okay for us to
> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> >> behavior could be maintained although the function still exists.
> >>  
>  - When the 'active' netdev is unplugged OR not present on a destination
>    system after live migration, the user will see 2 virtio_net netdevs.  
> >>>
> >>> That's necessary and expected, all configuration applies to the master
> >>> so master must exist.  
> >>
> >> With the naming issue resolved this is the only item left outstanding.
> >> This becomes a matter of form vs function.
> >>
> >> The main complaint about the "3 netdev" solution is a bit confusing to
> >> have the 2 netdevs present if the VF isn't there. The idea is that
> >> having the extra "master" netdev there if there isn't really a bond is
> >> a bit ugly.  
> >
> > Is it this uglier in terms of user experience rather than
> > functionality? I don't want it dynamically changed between 2-netdev
> > and 3-netdev depending on the presence of VF. That gets back to my
> > original question and suggestion earlier: why not just hide the lower
> > netdevs from udev renaming and such? Which important observability
> > benefits users may get if exposing the lower netdevs?
> >
> > Thanks,
> > -Siwei  
> 
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
> 
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.
> 


Let's not make XDP the blocker for doing the best solution
from the end user point of view. XDP is just yet another offload
thing which needs to be handled.  The current backup device solution
used in netvsc doesn't handle the full range of offload options
(things like flow direction, DCB, etc); no one but the HW vendors
seems to care.


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger
On Thu, 22 Feb 2018 13:30:12 -0800
Alexander Duyck  wrote:

> > Again, I undertand your motivation. Yet I don't like your solution.
> > But if the decision is made to do this in-driver bonding. I would like
> > to see it baing done some generic way:
> > 1) share the same "in-driver bonding core" code with netvsc
> >put to net/core.
> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >like active-backup mode only, one vf, one backup, vf netdev type
> >check (so noone could enslave a tap or anything else)
> > If user would need something more, he should employ team/bond.  

Sharing would be good, but netvsc world would really like to only have
one visible network device.


Re: [PATCH iproute2-next v3 2/8] iplink: Correctly report error when network device isn't found

2018-02-23 Thread David Ahern
On 2/22/18 6:02 AM, Serhey Popovych wrote:
> @@ -650,6 +658,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
> *req,
>   bool drv = strcmp(*argv, "xdpdrv") == 0;
>   bool offload = strcmp(*argv, "xdpoffload") == 0;
>  
> + if (offload)
> + has_dev(*dev, dev_index);
> +

I think this is actually the wrong direction. seems to me argv should be
passed to xdp_parse rather than the generic, drv, offload bool's. That
function can then the check on which option it is and has the knowledge
about whether a device is needed or not.


>   NEXT_ARG();
>   if (xdp_parse(&argc, &argv, req, dev_index,
> generic, drv, offload))


Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Cong Wang
On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap  wrote:
> [adding netdev]
>
> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
>> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
>> selftests
>> from tools/testing/selftests. Last messages from selftest before kernel 
>> panic are:
>>
...
>> Same selftest does not cause panic on 4.15. git bisect pointed to commit 
>> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more 
>> efficient").
>> Kernel config is attached.

Looks like something horribly wrong with u32 key id idr...


RE: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver

2018-02-23 Thread Bryan.Whitehead
Hi Florian,
Thanks for your review. I have the following questions/comments.

> On 02/21/2018 11:06 AM, Bryan Whitehead wrote:
> > Add main source files for new lan743x driver.
> >
> > Signed-off-by: Bryan Whitehead 
> > ---
> 
> > +lan743x-objs := lan743x_main.o
> 
> Should we assume that you have additional object files you would like to
> contribute at a later point? If that is the case, this is fine, if this is 
> going to be
> only file of this driver, consider renaming it so you don't even have to have
> this lan743x-objs line at all.

I am planning to add additional source files later. At the very least there will
be a lan743x_ethtool.o, and a lan743x_ptp.o. I didn't want to have to change
the name of the main file later, so I gave it the expected name now.

> > +
> > +static void lan743x_pci_cleanup(struct lan743x_adapter *adapter) {
> > +   struct lan743x_pci *pci = &adapter->pci;
> > +
> > +   if (pci->init_flags & INIT_FLAG_PCI_REGIONS_REQUESTED) {
> 
> There is a pattern throughout the driver of maintaining flags to track what
> was initialized and what was not, do you really need that, or can you check
> for specific book keeping private information. Maintaining flags is error 
> prone
> and requires you to keep adding new ones, that does not really scale.

Ok, but it seems to me that what I have is an example of "specific book keeping
private information". Can you clarify the style you prefer?

In cases of allocation where I can just compare a pointer to null, I can easily 
remove
the flags. But in other cases I need a record of which steps completed in order 
to
clean up properly. In cases where I need some sort of a flag would you prefer
I avoid a bit mask, and have a standalone variable for each flag?

[snip]

> > +   dmac_int_en &= ioc_bit;
> > +   dmac_int_sts &= dmac_int_en;
> > +   if (dmac_int_sts & ioc_bit) {
> > +   tasklet_schedule(&tx->tx_isr_bottom_half);
> > +   enable_flag = 0;/* tasklet will re-enable later */
> > +   }
> 
> Consider migrating your TX buffer reclamation to a NAPI context. If you have
> one TX queue and one RX, the same NAPI context can be re-used, if you
> have separate RX/TX queues, you may create a NAPI context per RX/TX pair,
> or you may create separate NAPI contexts per TX queues and RX queues.

This may take some time to refactor, But I will see what I can do.

[snip]


> > +static int lan743x_phy_open(struct lan743x_adapter *adapter) {
> > +   struct lan743x_phy *phy = &adapter->phy;
> > +   struct phy_device *phydev;
> > +   struct net_device *netdev;
> > +   int ret = -EIO;
> > +   u32 mii_adv;
> > +
> > +   netdev = adapter->netdev;
> > +   phydev = phy_find_first(adapter->mdiobus);
> > +   if (!phydev) {
> > +   ret = -EIO;
> > +   goto clean_up;
> > +   }
> > +   ret = phy_connect_direct(netdev, phydev,
> > +lan743x_phy_link_status_change,
> > +PHY_INTERFACE_MODE_GMII);
> > +   if (ret)
> > +   goto clean_up;
> > +   phy->flags |= PHY_FLAG_ATTACHED;
> > +
> > +   /* MAC doesn't support 1000T Half */
> > +   phydev->supported &= ~SUPPORTED_1000baseT_Half;
> > +
> > +   /* support both flow controls */
> > +   phy->fc_request_control = (FLOW_CTRL_RX | FLOW_CTRL_TX);
> > +   phydev->advertising &= ~(ADVERTISED_Pause |
> ADVERTISED_Asym_Pause);
> > +   mii_adv = (u32)mii_advertise_flowctrl(phy->fc_request_control);
> > +   phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
> > +   phy->fc_autoneg = phydev->autoneg;
> 
> No driver does things like that, that appears to be really wrong.
Can you clarify? What is wrong and how should it be?

[snip]

> > +static int lan743x_dmac_init(struct lan743x_adapter *adapter) {
> > +   struct lan743x_dmac *dmac = &adapter->dmac;
> > +   u32 data = 0;
> > +
> > +   dmac->flags = 0;
> > +   dmac->descriptor_spacing = DEFAULT_DMA_DESCRIPTOR_SPACING;
> > +   lan743x_csr_write(adapter, DMAC_CMD, DMAC_CMD_SWR_);
> > +   lan743x_csr_wait_for_bit(adapter, DMAC_CMD,
> DMAC_CMD_SWR_,
> > +0, 1000, 2, 100);
> > +   switch (dmac->descriptor_spacing) {
> > +   case DMA_DESCRIPTOR_SPACING_16:
> > +   data = DMAC_CFG_MAX_DSPACE_16_;
> > +   break;
> > +   case DMA_DESCRIPTOR_SPACING_32:
> > +   data = DMAC_CFG_MAX_DSPACE_32_;
> > +   break;
> > +   case DMA_DESCRIPTOR_SPACING_64:
> > +   data = DMAC_CFG_MAX_DSPACE_64_;
> > +   break;
> > +   case DMA_DESCRIPTOR_SPACING_128:
> > +   data = DMAC_CFG_MAX_DSPACE_128_;
> > +   break;
> > +   default:
> > +   return -EPERM;
> 
> -EINVAL maybe?
I think -EPERM is more appropriate because it reflects an unresolvable error. 
In other words the platform is in compatible with the device.
Would you prefer I use a preprocessor error instead, such as

#if invalid_setting
#error incompatible setting
#endif

> 
> [snip]
> 
> > +
> > +done:
> > +   

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Jiri Pirko
Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote:

[...]

>>>
>>> No, that's not what I was talking about of course. I thought you
>>> mentioned the upgrade scenario this patch would like to address is to
>>> use the bypass interface "to take the place of the original virtio,
>>> and get udev to rename the bypass to what the original virtio_net
>>> was". That is one of the possible upgrade paths for sure. However the
>>> upgrade path I was seeking is to use the bypass interface to take the
>>> place of original VF interface while retaining the name and network
>>> configs, which generally can be done simply with kernel upgrade. It
>>> would become limiting as this patch makes the bypass interface share
>>> the same virtio pci device with virito backup. Can this bypass
>>> interface be made general to take place of any pci device other than
>>> virtio-net? This will be more helpful as the cloud users who has
>>> existing setup on VF interface don't have to recreate it on virtio-net
>>> and VF separately again.

How that could work? If you have the VF netdev with all configuration
including IPs and routes and whatever - now you want to do migration
so you add virtio_net and do some weird in-driver bonding with it. But
then, VF disappears and the VF netdev with that and also all
configuration it had.
I don't think this scenario is valid.


>>
>>
>> Yes. This sounds interesting. Looks like you want an existing VM image with
>> VF only configuration to get transparent live migration support by adding
>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>> between these 2 options.
>
>Yes, that's what I was thinking about. I have been building something
>like this before, and would like to get back after merging with your
>patch.


Re: [PATCH bpf] bpf: allow xadd only on aligned memory

2018-02-23 Thread Alexei Starovoitov
On Fri, Feb 23, 2018 at 10:29:05PM +0100, Daniel Borkmann wrote:
> The requirements around atomic_add() / atomic64_add() resp. their
> JIT implementations differ across architectures. E.g. while x86_64
> seems just fine with BPF's xadd on unaligned memory, on arm64 it
> triggers via interpreter but also JIT the following crash:
> 
>   [  830.864985] Unable to handle kernel paging request at virtual address 
> 8097d7ed6703
>   [...]
>   [  830.916161] Internal error: Oops: 9621 [#1] SMP
>   [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 
> 4.16.0-rc2+ #8
>   [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 
> 07/17/2017
>   [  830.998998] pstate: 8045 (Nzcv daif +PAN -UAO)
>   [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
>   [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
>   [  831.012485] sp : 1ccabc20
>   [  831.015786] x29: 1ccabc20 x28: 8017d56a0f00
>   [  831.021087] x27: 0001 x26: 
>   [  831.026387] x25: 00c168d9db98 x24: 
>   [  831.031686] x23: 08203878 x22: 09488000
>   [  831.036986] x21: 08b14e28 x20: 1ccabcb0
>   [  831.042286] x19: 097b5080 x18: 0a03
>   [  831.047585] x17:  x16: 
>   [  831.052885] x15: aeca8000 x14: 
>   [  831.058184] x13:  x12: 
>   [  831.063484] x11: 0001 x10: 
>   [  831.068783] x9 :  x8 : 
>   [  831.074083] x7 :  x6 : 000580d42800
>   [  831.079383] x5 : 0018 x4 : 
>   [  831.084682] x3 : 1ccabcb0 x2 : 0001
>   [  831.089982] x1 : 8097d7ed6703 x0 : 0001
>   [  831.095282] Process test_verifier (pid: 2788, stack limit = 
> 0x18370044)
>   [  831.102577] Call trace:
>   [  831.105012]  __ll_sc_atomic_add+0x4/0x18
>   [  831.108923]  __bpf_prog_run32+0x4c/0x70
>   [  831.112748]  bpf_test_run+0x78/0xf8
>   [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
>   [  831.120567]  SyS_bpf+0x77c/0x1110
>   [  831.123873]  el0_svc_naked+0x30/0x34
>   [  831.127437] Code: 97fffe97 17ec  f9800031 (885f7c31)
> 
> Reason for this is because memory is required to be aligned. In
> case of BPF, we always enforce alignment in terms of stack access,
> but not when accessing map values or packet data when the underlying
> arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
> 
> xadd on packet data that is local to us anyway is just wrong, so
> forbid this case entirely. The only place where xadd makes sense in
> fact are map values; xadd on stack is wrong as well, but it's been
> around for much longer. Specifically enforce strict alignment in case
> of xadd, so that we handle this case generically and avoid such crashes
> in the first place.
> 
> Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
> Signed-off-by: Daniel Borkmann 

Nice catch!
Applied to bpf tree, Thanks Daniel.



[PATCH V2 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.

2018-02-23 Thread Sowmini Varadhan
PF_RDS sockets pass up cookies for zerocopy completion as ancillary
data. Update msg_zerocopy to reap this information.

Signed-off-by: Sowmini Varadhan 
---
v2: receive zerocopy completion notification as POLLIN

 tools/testing/selftests/net/msg_zerocopy.c |   60 
 1 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index eff9cf2..8c466e8 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,7 +344,48 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static bool do_recv_completion(int fd)
+static int do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck)
+{
+   int ncookies, i;
+
+   ncookies = ck->num;
+   if (ncookies > RDS_MAX_ZCOOKIES)
+   error(1, 0, "Returned %d cookies, max expected %d\n",
+ ncookies, RDS_MAX_ZCOOKIES);
+   for (i = 0; i < ncookies; i++)
+   if (cfg_verbose >= 2)
+   fprintf(stderr, "%d\n", ck->cookies[i]);
+   return ncookies;
+}
+
+static int do_recvmsg_completion(int fd)
+{
+   struct msghdr msg;
+   char cmsgbuf[256];
+   struct cmsghdr *cmsg;
+   bool ret = false;
+   struct rds_zcopy_cookies *ck;
+
+   memset(&msg, 0, sizeof(msg));
+   msg.msg_control = cmsgbuf;
+   msg.msg_controllen = sizeof(cmsgbuf);
+
+   if (recvmsg(fd, &msg, MSG_DONTWAIT))
+   return ret;
+   for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+   if (cmsg->cmsg_level == SOL_RDS &&
+   cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) {
+
+   ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg);
+   completions += do_process_zerocopy_cookies(ck);
+   ret = true;
+   break;
+   }
+   }
+   return ret;
+}
+
+static bool do_recv_completion(int fd, int domain)
 {
struct sock_extended_err *serr;
struct msghdr msg = {};
@@ -353,6 +394,9 @@ static bool do_recv_completion(int fd)
int ret, zerocopy;
char control[100];
 
+   if (domain == PF_RDS)
+   return do_recvmsg_completion(fd);
+
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
@@ -409,20 +453,20 @@ static bool do_recv_completion(int fd)
 }
 
 /* Read all outstanding messages on the errqueue */
-static void do_recv_completions(int fd)
+static void do_recv_completions(int fd, int domain)
 {
-   while (do_recv_completion(fd)) {}
+   while (do_recv_completion(fd, domain)) {}
 }
 
 /* Wait for all remaining completions on the errqueue */
-static void do_recv_remaining_completions(int fd)
+static void do_recv_remaining_completions(int fd, int domain)
 {
int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
 
while (completions < expected_completions &&
   gettimeofday_ms() < tstop) {
-   if (do_poll(fd, POLLERR))
-   do_recv_completions(fd);
+   if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR))
+   do_recv_completions(fd, domain);
}
 
if (completions < expected_completions)
@@ -503,13 +547,13 @@ static void do_tx(int domain, int type, int protocol)
 
while (!do_poll(fd, POLLOUT)) {
if (cfg_zerocopy)
-   do_recv_completions(fd);
+   do_recv_completions(fd, domain);
}
 
} while (gettimeofday_ms() < tstop);
 
if (cfg_zerocopy)
-   do_recv_remaining_completions(fd);
+   do_recv_remaining_completions(fd, domain);
 
if (close(fd))
error(1, errno, "close");
-- 
1.7.1



[PATCH V2 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS

2018-02-23 Thread Sowmini Varadhan
In preparation for optimized reception of zerocopy completion,
revert the Rx side changes introduced by Commit dfb8434b0a94
("selftests/net: add zerocopy support for PF_RDS test case")

Signed-off-by: Sowmini Varadhan 
---
v2: prepare to remove sk_error_queue based path; remove recvmsg() as well,
PF_RDS can also use recv() for the usage pattern in msg_zerocopy

 tools/testing/selftests/net/msg_zerocopy.c |   67 
 1 files changed, 0 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 5cc2a53..eff9cf2 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static int do_process_zerocopy_cookies(struct sock_extended_err *serr,
-  uint32_t *ckbuf, size_t nbytes)
-{
-   int ncookies, i;
-
-   if (serr->ee_errno != 0)
-   error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
-   ncookies = serr->ee_data;
-   if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES)
-   error(1, 0, "Returned %d cookies, max expected %d\n",
- ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES);
-   if (nbytes != ncookies * sizeof(uint32_t))
-   error(1, 0, "Expected %d cookies, got %ld\n",
- ncookies, nbytes/sizeof(uint32_t));
-   for (i = 0; i < ncookies; i++)
-   if (cfg_verbose >= 2)
-   fprintf(stderr, "%d\n", ckbuf[i]);
-   return ncookies;
-}
-
 static bool do_recv_completion(int fd)
 {
struct sock_extended_err *serr;
@@ -372,17 +352,10 @@ static bool do_recv_completion(int fd)
uint32_t hi, lo, range;
int ret, zerocopy;
char control[100];
-   uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
-   struct iovec iov;
 
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
-   iov.iov_base = ckbuf;
-   iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
-   msg.msg_iov = &iov;
-   msg.msg_iovlen = 1;
-
ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
if (ret == -1 && errno == EAGAIN)
return false;
@@ -402,10 +375,6 @@ static bool do_recv_completion(int fd)
 
serr = (void *) CMSG_DATA(cm);
 
-   if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
-   completions += do_process_zerocopy_cookies(serr, ckbuf, ret);
-   return true;
-   }
if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
if (serr->ee_errno != 0)
@@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type)
bytes += cfg_payload_len;
 }
 
-
-static void do_recvmsg(int fd)
-{
-   int ret, off = 0;
-   char *buf;
-   struct iovec iov;
-   struct msghdr msg;
-   struct sockaddr_storage din;
-
-   buf = calloc(cfg_payload_len, sizeof(char));
-   iov.iov_base = buf;
-   iov.iov_len = cfg_payload_len;
-
-   memset(&msg, 0, sizeof(msg));
-   msg.msg_name = &din;
-   msg.msg_namelen = sizeof(din);
-   msg.msg_iov = &iov;
-   msg.msg_iovlen = 1;
-
-   ret = recvmsg(fd, &msg, MSG_TRUNC);
-
-   if (ret == -1)
-   error(1, errno, "recv");
-   if (ret != cfg_payload_len)
-   error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
-
-   if (memcmp(buf + off, payload, ret))
-   error(1, 0, "recv: data mismatch");
-
-   free(buf);
-   packets++;
-   bytes += cfg_payload_len;
-}
-
 static void do_rx(int domain, int type, int protocol)
 {
uint64_t tstop;
@@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol)
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
-   else if (domain == PF_RDS)
-   do_recvmsg(fd);
else
do_flush_datagram(fd, type);
 
-- 
1.7.1



[PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-23 Thread Sowmini Varadhan
This commit is an optimization of the commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

RDS applications are predominantly request-response transactions, so
it is more efficient to reduce the number of system calls and have
zerocopy completion notification delivered as ancillary data on the
POLLIN channel.

Cookies are passed up as ancillary data (at level SOL_RDS) in a
struct rds_zcopy_cookies when the returned value of recvmsg() is
greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed
with each message.

This commit removes support for zerocopy completion notification on
MSG_ERRQUEUE for PF_RDS sockets.

Signed-off-by: Sowmini Varadhan 
---
v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie()
and callers to make sure we dont remove cookies from the queue and then
fail to pass it up to caller

 include/uapi/linux/errqueue.h |2 --
 include/uapi/linux/rds.h  |7 +++
 net/rds/af_rds.c  |7 +--
 net/rds/message.c |   35 +--
 net/rds/rds.h |2 ++
 net/rds/recv.c|   34 +-
 6 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 28812ed..dc64cfa 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,13 +20,11 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
-#define SO_EE_ORIGIN_ZCOOKIE   6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
-#defineSO_EE_ORIGIN_MAX_ZCOOKIES   8
 
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 12e3bca..a66b213 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -104,6 +104,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
 #define RDS_CMSG_RXPATH_LATENCY11
 #defineRDS_CMSG_ZCOPY_COOKIE   12
+#defineRDS_CMSG_ZCOPY_COMPLETION   13
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -317,6 +318,12 @@ struct rds_rdma_notify {
 #define RDS_RDMA_DROPPED   3
 #define RDS_RDMA_OTHER_ERROR   4
 
+#defineRDS_MAX_ZCOOKIES8
+struct rds_zcopy_cookies {
+   __u32 num;
+   __u32 cookies[RDS_MAX_ZCOOKIES];
+};
+
 /*
  * Common set of flags for all RDMA related structs
  */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a937f18..f712610 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -77,6 +77,7 @@ static int rds_release(struct socket *sock)
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
rds_notify_queue_get(rs, NULL);
+   __skb_queue_purge(&rs->rs_zcookie_queue);
 
spin_lock_bh(&rds_sock_lock);
list_del_init(&rs->rs_item);
@@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr 
*uaddr,
  *  -  to signal that a previously congested destination may have become
  * uncongested
  *  -  A notification has been queued to the socket (this can be a congestion
- * update, or a RDMA completion).
+ * update, or a RDMA completion, or a MSG_ZEROCOPY completion).
  *
  * EPOLLOUT is asserted if there is room on the send queue. This does not mean
  * however, that the next sendmsg() call will succeed. If the application tries
@@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket 
*sock,
spin_unlock(&rs->rs_lock);
}
if (!list_empty(&rs->rs_recv_queue) ||
-   !list_empty(&rs->rs_notify_queue))
+   !list_empty(&rs->rs_notify_queue) ||
+   !skb_queue_empty(&rs->rs_zcookie_queue))
mask |= (EPOLLIN | EPOLLRDNORM);
if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
mask |= (EPOLLOUT | EPOLLWRNORM);
@@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(&rs->rs_recv_queue);
INIT_LIST_HEAD(&rs->rs_notify_queue);
INIT_LIST_HEAD(&rs->rs_cong_list);
+   skb_queue_head_init(&rs->rs_zcookie_queue);
spin_lock_init(&rs->rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
rs->rs_rx_traces = 0;
diff --git a/net/rds/message.c b/net/rds/message.c
index 6518345..2e8bdaf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -58,32 +58,26 @@ void rds_message_addref(struct rds_message *rm)
 
 static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie)
 {
-   struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
-   int ncookies;
-   u32 *ptr;
+   struct rds_zcopy_cookies *ck = (struct rds_zcopy_cookies *)skb->cb;
+   int ncookies = c

Re: [RFC PATCH bpf-next 07/12] bpf/verifier: allow bounded loops with JLT/true back-edge

2018-02-23 Thread John Fastabend
On 02/23/2018 09:41 AM, Edward Cree wrote:
> Where the register umin_value is increasing sufficiently fast, the loop
>  will terminate after a reasonable number of iterations, so we can allow
>  to keep walking it.

Continuing to walk the loop is problematic because we hit the complexity
limit. What we want to do is a static analysis step upfront on the basic
block containing the loop to ensure the loop is bounded.

We can do this by finding the induction variables (variables with form
cn+d) and ensuring the comparison register is an induction variable and
also increasing/decreasing correctly with respect to the comparison op.

This seems to be common in compilers to pull computation out of the
loop. So should be doable in the verifier.


> The semantics of prev_insn_idx are changed slightly: it now always refers
>  to the last jump insn walked, even when that jump was not taken.  Also it
>  is moved into env, alongside a new bool last_jump_taken that records
>  whether the jump was taken or not.  This is needed so that, when we detect
>  a loop, we can see how we ended up in the back-edge: what was the jump
>  condition, and is it the true- or the false-branch that ends up looping?
> We record in our explored_states whether they represent a conditional jump
>  and whether that jump produced a good loop bound.  This allows us to make
>  unconditional jumps "not responsible" for ensuring the loop is bounded, as
>  long as there is a conditional jump somewhere in the loop body; whereas a
>  conditional jump must ensure that there is a bounding conditional somewhere
>  in the loop body.
> Recursive calls have to remain forbidden because the calculation of maximum
>  stack depth assumes the call graph is acyclic, even though the loop
>  handling code could determine whether the recursion was bounded.
> 
> Signed-off-by: Edward Cree 
> ---


Re: [RFC PATCH bpf-next 05/12] bpf/verifier: detect loops dynamically rather than statically

2018-02-23 Thread John Fastabend
On 02/23/2018 09:40 AM, Edward Cree wrote:
> Add in a new chain of parent states, which does not cross function-call
>  boundaries, and check whether our current insn_idx appears anywhere in
>  the chain.  Since all jump targets have state-list marks (now placed
>  by prepare_cfg_marks(), which replaces check_cfg()), it suffices to
>  thread this chain through the existing explored_states and check it
>  only in is_state_visited().
> By using this parent-state chain to detect loops, we open up the
>  possibility that in future we could determine a loop to be bounded and
>  thus accept it.
> A few selftests had to be changed to ensure that all the insns walked
>  before reaching the back-edge are valid.
> 

I still believe this needs to be done in an initial pass where we can
build a proper CFG and analyze the loops to ensure we have loops that
can be properly verified. Otherwise we end up trying to handle all the
special cases later like the comment in patch 7/12 'three-jump trick'.
So rather than try to handle all loops only handle "normal" loops. In
an early CFG stage with DOM tree the algorithm for this is already
used by gcc/clang so there is good precedent for this type of work.

Without this we are open to other "clever" goto programs that create
loops that we have to special case. Better to just reject all these
classes of loops because most users will never need them.

See more notes in 7/12 patch, sorry I've been busy and not had a chance
to push code to build cfg and dom tree yet.

Also case analysis in patch description describing types of loops
this handles (either here or in another patch) would help me understand
at least.

> Signed-off-by: Edward Cree 
> ---
>  include/linux/bpf_verifier.h|   2 +
>  kernel/bpf/verifier.c   | 280 
> +---
>  tools/testing/selftests/bpf/test_verifier.c |  12 +-
>  3 files changed, 97 insertions(+), 197 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 0bc49c768585..24ddbc1ed3b2 100644
>


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
 wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>>  wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:

 I haven't checked emails for days and did not realize the new revision
 had already came out. And thank you for the effort, this revision
 really looks to be a step forward towards our use case and is close to
 what we wanted to do. A few questions in line.

 On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
  wrote:
>
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>>
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a
>>> moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and
>>> controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>> netdev
>>> as default for transmits when it is available with link up and
>>> running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>virtio pci device, udev tries to rename both of them with the same
>>> name
>>>and the 2nd rename will fail. This would be OK as long as the
>>> first netdev
>>>to be renamed is the 'bypass' netdev, but the order in which udev
>>> gets
>>>to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

 Could it made it also possible to take over the config from VF instead
 of virtio on an existing VM image? And get udev rename the bypass
 netdev to what the original VF was. I don't say tightly binding the
 bypass master to only virtio or VF, but I think we should provide both
 options to support different upgrade paths. Possibly we could tweak
 the device tree layout to reuse the same PCI slot for the master
 bypass netdev, such that udev would not get confused when renaming the
 device. The VF needs to use a different function slot afterwards.
 Perhaps we might need to a special multiseat like QEMU device for that
 purpose?

 Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>>
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
>
> Yes. This sounds interesting. Looks like you want an existing VM image with
> VF only configuration to get transparent live migration support by adding
> virtio_

Re: [v3,net-next,2/2] tls: Use correct sk->sk_prot for IPV6

2018-02-23 Thread Guenter Roeck
Hi Ilya,

On Mon, Sep 04, 2017 at 01:14:01PM +0300, Ilya Lesokhin wrote:
> The tls ulp overrides sk->prot with a new tls specific proto structs.
> The tls specific structs were previously based on the ipv4 specific
> tcp_prot sturct.
> As a result, attaching the tls ulp to an ipv6 tcp socket replaced
> some ipv6 callback with the ipv4 equivalents.
> 
> This patch adds ipv6 tls proto structs and uses them when
> attached to ipv6 sockets.
> 

Do you plan to update this patch with the missing TCPv6 support ?
As far as I can see, the part that was accepted upstream does not fix
the TCPv6 protocol issue which triggers CVE-2018-5703.

If adding IPv6 support is not possible/acceptable, would it make sense
to limit TLS support to TCPv4, ie add something like

if (sk->sk_prot != &tcp_prot)
return -EINVAL;

to tls_init() ?

Thanks,
Guenter

> Fixes: 3c4d7559159b ('tls: kernel TLS support')
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Ilya Lesokhin 
> ---
>  net/tls/Kconfig|  1 +
>  net/tls/tls_main.c | 50 ++
>  2 files changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/net/tls/Kconfig b/net/tls/Kconfig
> index eb58303..7e9cf8b 100644
> --- a/net/tls/Kconfig
> +++ b/net/tls/Kconfig
> @@ -7,6 +7,7 @@ config TLS
>   select CRYPTO
>   select CRYPTO_AES
>   select CRYPTO_GCM
> + select IPV6
>   default n
>   ---help---
>   Enable kernel support for TLS protocol. This allows symmetric
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index 60aff60..33c499e 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -40,13 +40,25 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  MODULE_AUTHOR("Mellanox Technologies");
>  MODULE_DESCRIPTION("Transport Layer Security Support");
>  MODULE_LICENSE("Dual BSD/GPL");
>  
> -static struct proto tls_base_prot;
> -static struct proto tls_sw_prot;
> +enum {
> + TLSV4,
> + TLSV6,
> + TLS_NUM_PROTS,
> +};
> +
> +enum {
> + TLS_BASE_TX,
> + TLS_SW_TX,
> + TLS_NUM_CONFIG,
> +};
> +
> +static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG];
>  
>  int wait_on_pending_writer(struct sock *sk, long *timeo)
>  {
> @@ -342,6 +354,7 @@ static int do_tls_setsockopt_tx(struct sock *sk, char 
> __user *optval,
>   struct tls_context *ctx = tls_get_ctx(sk);
>   struct proto *prot = NULL;
>   int rc = 0;
> + int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4;
>  
>   if (!optval || (optlen < sizeof(*crypto_info))) {
>   rc = -EINVAL;
> @@ -396,7 +409,7 @@ static int do_tls_setsockopt_tx(struct sock *sk, char 
> __user *optval,
>  
>   /* currently SW is default, we will have ethtool in future */
>   rc = tls_set_sw_offload(sk, ctx);
> - prot = &tls_sw_prot;
> + prot = &tls_prots[ip_ver][TLS_SW_TX];
>   if (rc)
>   goto err_crypto_info;
>  
> @@ -443,6 +456,12 @@ static int tls_init(struct sock *sk)
>   struct inet_connection_sock *icsk = inet_csk(sk);
>   struct tls_context *ctx;
>   int rc = 0;
> + int ip_ver = TLSV4;
> +
> + if (sk->sk_prot == &tcpv6_prot)
> + ip_ver = TLSV6;
> + else if (sk->sk_prot != &tcp_prot)
> + return -EINVAL;
>  
>   /* allocate tls context */
>   ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> @@ -453,7 +472,8 @@ static int tls_init(struct sock *sk)
>   icsk->icsk_ulp_data = ctx;
>   ctx->setsockopt = sk->sk_prot->setsockopt;
>   ctx->getsockopt = sk->sk_prot->getsockopt;
> - sk->sk_prot = &tls_base_prot;
> +
> + sk->sk_prot = &tls_prots[ip_ver][TLS_BASE_TX];
>  out:
>   return rc;
>  }
> @@ -464,16 +484,22 @@ static int tls_init(struct sock *sk)
>   .init   = tls_init,
>  };
>  
> +static void build_protos(struct proto *prot, struct proto *base)
> +{
> + prot[TLS_BASE_TX] = *base;
> + prot[TLS_BASE_TX].setsockopt = tls_setsockopt;
> + prot[TLS_BASE_TX].getsockopt = tls_getsockopt;
> +
> + prot[TLS_SW_TX] = prot[TLS_BASE_TX];
> + prot[TLS_SW_TX].close   = tls_sk_proto_close;
> + prot[TLS_SW_TX].sendmsg = tls_sw_sendmsg;
> + prot[TLS_SW_TX].sendpage= tls_sw_sendpage;
> +}
> +
>  static int __init tls_register(void)
>  {
> - tls_base_prot   = tcp_prot;
> - tls_base_prot.setsockopt= tls_setsockopt;
> - tls_base_prot.getsockopt= tls_getsockopt;
> -
> - tls_sw_prot = tls_base_prot;
> - tls_sw_prot.sendmsg = tls_sw_sendmsg;
> - tls_sw_prot.sendpage= tls_sw_sendpage;
> - tls_sw_prot.close   = tls_sk_proto_close;
> + build_protos(tls_prots[TLSV4], &tcp_prot);
> + build_protos(tls_prots[TLSV6], &tcpv6_prot);
>  
>   tcp_register_ulp(&tcp_tls_ulp_ops);
>  


[PATCH net-next] r8169: improve interrupt handling

2018-02-23 Thread Heiner Kallweit
This patch improves few aspects of interrupt handling:
- update to current interrupt allocation API
  (use pci_alloc_irq_vectors() instead of deprecated pci_enable_msi())
- this implicitly will allocate a MSI-X interrupt if available
- get rid of flag RTL_FEATURE_MSI

Last but not least it enables a feature which was (I presume accidently)
disabled before. There are members of the RTL8169 family supporting MSI
(e.g. RTL8169SB), however MSI never got enabled because RTL_CFG_0 was
missing flag RTL_FEATURE_MSI.
An indicator for "accidently" is that statement "cfg2 |= MSIEnable;"
in rtl_try_msi() is dead code. cfg2 is used for chip versions <= 06
only and for all these chip versions RTL_FEATURE_MSI isn't set.

The patch works fine on a RTL8168evl (chip version 34).

The previously accidently disabled MSI support for the few members of
the old RTL8169 family supporting MSI isn't tested. The RTL8169SB
(chip version 04) I have at least still works in a PCI slot w/o MSI.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 44 
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 96db3283e..4730db990 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -736,8 +736,7 @@ struct ring_info {
 };
 
 enum features {
-   RTL_FEATURE_MSI = (1 << 0),
-   RTL_FEATURE_GMII= (1 << 1),
+   RTL_FEATURE_GMII= (1 << 0),
 };
 
 struct rtl8169_counters {
@@ -7847,7 +7846,7 @@ static int rtl8169_close(struct net_device *dev)
 
cancel_work_sync(&tp->wk.work);
 
-   free_irq(pdev->irq, dev);
+   pci_free_irq(pdev, 0, dev);
 
dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp->RxDescArray,
  tp->RxPhyAddr);
@@ -7903,9 +7902,8 @@ static int rtl_open(struct net_device *dev)
 
rtl_request_firmware(tp);
 
-   retval = request_irq(pdev->irq, rtl8169_interrupt,
-(tp->features & RTL_FEATURE_MSI) ? 0 : IRQF_SHARED,
-dev->name, dev);
+   retval = pci_request_irq(pdev, 0, rtl8169_interrupt, NULL, dev,
+dev->name);
if (retval < 0)
goto err_release_fw_2;
 
@@ -8253,7 +8251,7 @@ static const struct rtl_cfg_info {
.region = 2,
.align  = 8,
.event_slow = SYSErr | LinkChg | RxOverflow,
-   .features   = RTL_FEATURE_GMII | RTL_FEATURE_MSI,
+   .features   = RTL_FEATURE_GMII,
.coalesce_info  = rtl_coalesce_info_8168_8136,
.default_ver= RTL_GIGA_MAC_VER_11,
},
@@ -8263,32 +8261,32 @@ static const struct rtl_cfg_info {
.align  = 8,
.event_slow = SYSErr | LinkChg | RxOverflow | RxFIFOOver |
  PCSTimeout,
-   .features   = RTL_FEATURE_MSI,
.coalesce_info  = rtl_coalesce_info_8168_8136,
.default_ver= RTL_GIGA_MAC_VER_13,
}
 };
 
 /* Cfg9346_Unlock assumed. */
-static unsigned rtl_try_msi(struct rtl8169_private *tp,
-   const struct rtl_cfg_info *cfg)
+static void rtl_alloc_irq(struct rtl8169_private *tp)
 {
void __iomem *ioaddr = tp->mmio_addr;
-   unsigned msi = 0;
u8 cfg2;
+   int ret;
 
-   cfg2 = RTL_R8(Config2) & ~MSIEnable;
-   if (cfg->features & RTL_FEATURE_MSI) {
-   if (pci_enable_msi(tp->pci_dev)) {
-   netif_info(tp, hw, tp->dev, "no MSI. Back to INTx.\n");
-   } else {
-   cfg2 |= MSIEnable;
-   msi = RTL_FEATURE_MSI;
-   }
+   ret = pci_alloc_irq_vectors(tp->pci_dev, 1, 1, PCI_IRQ_ALL_TYPES);
+   if (ret < 0) {
+   netif_err(tp, drv, tp->dev, "failed to allocate irq!\n");
+   return;
}
-   if (tp->mac_version <= RTL_GIGA_MAC_VER_06)
+
+   if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
+   RTL_W8(Cfg9346, Cfg9346_Unlock);
+   cfg2 = RTL_R8(Config2) & ~MSIEnable;
+   if (pci_dev_msi_enabled(tp->pci_dev))
+   cfg2 |= MSIEnable;
RTL_W8(Config2, cfg2);
-   return msi;
+   RTL_W8(Cfg9346, Cfg9346_Lock);
+   }
 }
 
 DECLARE_RTL_COND(rtl_link_list_ready_cond)
@@ -8497,9 +8495,7 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
chipset = tp->mac_version;
tp->txd_version = rtl_chip_infos[chipset].txd_version;
 
-   RTL_W8(Cfg9346, Cfg9346_Unlock);
-   tp->features |= rtl_try_msi(tp, cfg);
-   RTL_W8(Cfg9346, Cfg9346_Lock);
+   rtl_alloc_irq(tp);
 
/* override BIOS settings, use userspace tools to enable WOL */

[PATCH bpf] bpf: allow xadd only on aligned memory

2018-02-23 Thread Daniel Borkmann
The requirements around atomic_add() / atomic64_add() resp. their
JIT implementations differ across architectures. E.g. while x86_64
seems just fine with BPF's xadd on unaligned memory, on arm64 it
triggers via interpreter but also JIT the following crash:

  [  830.864985] Unable to handle kernel paging request at virtual address 
8097d7ed6703
  [...]
  [  830.916161] Internal error: Oops: 9621 [#1] SMP
  [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ 
#8
  [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 
07/17/2017
  [  830.998998] pstate: 8045 (Nzcv daif +PAN -UAO)
  [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
  [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
  [  831.012485] sp : 1ccabc20
  [  831.015786] x29: 1ccabc20 x28: 8017d56a0f00
  [  831.021087] x27: 0001 x26: 
  [  831.026387] x25: 00c168d9db98 x24: 
  [  831.031686] x23: 08203878 x22: 09488000
  [  831.036986] x21: 08b14e28 x20: 1ccabcb0
  [  831.042286] x19: 097b5080 x18: 0a03
  [  831.047585] x17:  x16: 
  [  831.052885] x15: aeca8000 x14: 
  [  831.058184] x13:  x12: 
  [  831.063484] x11: 0001 x10: 
  [  831.068783] x9 :  x8 : 
  [  831.074083] x7 :  x6 : 000580d42800
  [  831.079383] x5 : 0018 x4 : 
  [  831.084682] x3 : 1ccabcb0 x2 : 0001
  [  831.089982] x1 : 8097d7ed6703 x0 : 0001
  [  831.095282] Process test_verifier (pid: 2788, stack limit = 
0x18370044)
  [  831.102577] Call trace:
  [  831.105012]  __ll_sc_atomic_add+0x4/0x18
  [  831.108923]  __bpf_prog_run32+0x4c/0x70
  [  831.112748]  bpf_test_run+0x78/0xf8
  [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
  [  831.120567]  SyS_bpf+0x77c/0x1110
  [  831.123873]  el0_svc_naked+0x30/0x34
  [  831.127437] Code: 97fffe97 17ec  f9800031 (885f7c31)

Reason for this is because memory is required to be aligned. In
case of BPF, we always enforce alignment in terms of stack access,
but not when accessing map values or packet data when the underlying
arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.

xadd on packet data that is local to us anyway is just wrong, so
forbid this case entirely. The only place where xadd makes sense in
fact are map values; xadd on stack is wrong as well, but it's been
around for much longer. Specifically enforce strict alignment in case
of xadd, so that we handle this case generically and avoid such crashes
in the first place.

Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c   | 42 +
 tools/testing/selftests/bpf/test_verifier.c | 58 +
 2 files changed, 84 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fb69a8..c6eff10 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1356,6 +1356,13 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int 
regno)
return reg->type == PTR_TO_CTX;
 }
 
+static bool is_pkt_reg(struct bpf_verifier_env *env, int regno)
+{
+   const struct bpf_reg_state *reg = cur_regs(env) + regno;
+
+   return type_is_pkt_pointer(reg->type);
+}
+
 static int check_pkt_ptr_alignment(struct bpf_verifier_env *env,
   const struct bpf_reg_state *reg,
   int off, int size, bool strict)
@@ -1416,10 +1423,10 @@ static int check_generic_ptr_alignment(struct 
bpf_verifier_env *env,
 }
 
 static int check_ptr_alignment(struct bpf_verifier_env *env,
-  const struct bpf_reg_state *reg,
-  int off, int size)
+  const struct bpf_reg_state *reg, int off,
+  int size, bool strict_alignment_once)
 {
-   bool strict = env->strict_alignment;
+   bool strict = env->strict_alignment || strict_alignment_once;
const char *pointer_desc = "";
 
switch (reg->type) {
@@ -1576,9 +1583,9 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, 
int size)
  * if t==write && value_regno==-1, some unknown value is stored into memory
  * if t==read && value_regno==-1, don't care what we read from memory
  */
-static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 
regno, int off,
-   int bpf_size, enum bpf_access_type t,
-   int value_regno)
+static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 
regno,
+   int off, int bpf_size, enum bpf_access_type t,
+   int value_regno,

Re: VRF destination unreachable

2018-02-23 Thread David Ahern
On 2/23/18 10:49 AM, Stephen Suryaputra wrote:
> Greetings,
> 
> We found that ICMP destination unreachable isn't sent if VRF
> forwarding isn't configured, i.e.
> /proc/sys/net/ipv4/conf//forwarding isn't set. The
> relevant code is:
> 
> static int ip_error(struct sk_buff *skb)
> {
> ...
> // in_dev is the vrf net_device
> if (!IN_DEV_FORWARD(in_dev)) {
> switch (rt->dst.error) {
> case EHOSTUNREACH:
> __IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
> break;
> 
> case ENETUNREACH:
> __IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
> break;
> }
> goto out;
> }
> ...
> out:kfree_skb(skb);
> return 0;
> }
> 
> The question: is it intended to be set? The basic forwarding seems to
> be working without. We do set it on the slave net devices.

Unintended side effect of VRF as a netdev. This should fix it:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5ca7415cd48c..d59d005fb7c5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -944,7 +944,7 @@ static int ip_error(struct sk_buff *skb)
goto out;

net = dev_net(rt->dst.dev);
-   if (!IN_DEV_FORWARD(in_dev)) {
+   if (!IN_DEV_FORWARD(in_dev) && !netif_is_l3_master(skb->dev)) {
switch (rt->dst.error) {
case EHOSTUNREACH:
__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);


Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings

2018-02-23 Thread Andy Shevchenko
On Fri, 2018-02-23 at 23:41 +0300, Alexey Dobriyan wrote:
> On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote:
> > @@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac)
> >  {
> > int i;
> >  
> > -   /* XX:XX:XX:XX:XX:XX */
> > -   if (strlen(s) < 3 * ETH_ALEN - 1)
> > -   return false;
> > -
> > /* Don't dirty result unless string is valid MAC. */
> > for (i = 0; i < ETH_ALEN; i++) {
> > if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1]))
> 
> Short string will bail in the loop, indeed.
> 
> Reviewed-by: Alexey Dobriyan 

Since the author is okay with the change, I'm following:

Reviewed-by: Andy Shevchenko 

-- 
Andy Shevchenko 
Intel Finland Oy


Re: [PATCH] net: fib_rules: Add new attribute to set protocol

2018-02-23 Thread David Miller
From: Donald Sharp 
Date: Fri, 23 Feb 2018 14:01:52 -0500

> For ages iproute2 has used `struct rtmsg` as the ancillary header for
> FIB rules and in the process set the protocol value to RTPROT_BOOT.
> Until ca56209a66 ("net: Allow a rule to track originating protocol")
> the kernel rules code ignored the protocol value sent from userspace
> and always returned 0 in notifications. To avoid incompatibility with
> existing iproute2, send the protocol as a new attribute.
> 
> Fixes: cac56209a66 ("net: Allow a rule to track originating protocol")
> Signed-off-by: Donald Sharp 

Applied, thanks Donald.


Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings

2018-02-23 Thread Alexey Dobriyan
On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote:
> @@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac)
>  {
>   int i;
>  
> - /* XX:XX:XX:XX:XX:XX */
> - if (strlen(s) < 3 * ETH_ALEN - 1)
> - return false;
> -
>   /* Don't dirty result unless string is valid MAC. */
>   for (i = 0; i < ETH_ALEN; i++) {
>   if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1]))

Short string will bail in the loop, indeed.

Reviewed-by: Alexey Dobriyan 


[net] ixgbe: fix crash in build_skb Rx code path

2018-02-23 Thread Jeff Kirsher
From: Emil Tantilov 

Add check for build_skb enabled ring in ixgbe_dma_sync_frag().
In that case &skb_shinfo(skb)->frags[0] may not always be set which
can lead to a crash. Instead we derive the page offset from skb->data.

Fixes: 42073d91a214
("ixgbe: Have the CPU take ownership of the buffers sooner")
CC: stable 
Reported-by: Ambarish Soman 
Suggested-by: Alexander Duyck 
Signed-off-by: Emil Tantilov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2c8aba..9fc063af233c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1888,6 +1888,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring 
*rx_ring,
 ixgbe_rx_pg_size(rx_ring),
 DMA_FROM_DEVICE,
 IXGBE_RX_DMA_ATTR);
+   } else if (ring_uses_build_skb(rx_ring)) {
+   unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
+
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ IXGBE_CB(skb)->dma,
+ offset,
+ skb_headlen(skb),
+ DMA_FROM_DEVICE);
} else {
struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
 
-- 
2.14.3



response

2018-02-23 Thread Ms. Ella Golan
I am Ms.Ella Golan, I am the Executive Vice President Banking Division with 
FIRST INTERNATIONAL BANK OF ISRAEL LTD (FIBI). I am getting in touch with you 
regarding an extremely important and urgent matter. If you would oblige me the 
opportunity, I shall provide you with details upon your response.

Faithfully,
Ms.Ella Golan


Re: [PATCH] net: Allow mac_pton() to work on non-NULL terminated strings

2018-02-23 Thread Andrew Lunn
On Fri, Feb 23, 2018 at 09:17:48PM +0100, Stefan Hellermann wrote:
> Commit 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") crashes my
> QNAP TS-209 NAS early on boot.
> 
> The boot code for the TS-209 is looping through an ext2 filesystem on a
> 384kB mtd partition (factory configuration put there by QNAP). There it
> looks on every 1kB boundary if there is a valid MAC address. The
> filesystem has a 1kB block size, so this seems to work.
> 
> On my device the MAC address is on the 37th 1kB block. But: On the 27th
> block is a large file (1,5kB) without 0 bytes inside. The code in
> qnap_tsx09_find_mac_addr() maps 1kB into memory (not a whole file or the
> whole 384kB) and then calls qnap_tsx09_check_mac_addr() -> mac_pton() ->
> strlen() on this memory block. as there is no 0 byte in the file on the
> 27th block, strlen() runs into bad memory and the machine panics. The old
> code had no strlen().
> 
> Actually mac_pton() doesn't need to call strlen(), the following loop
> catches short strings quite nicely. The strlen() seems to be an
> optimization for calls to mac_pton with empty string. But this is rarely
> the case and this is not a hot path. Remove it to reduce code size and
> speed up calls with an not empty string.
> 
> Besides fixing the crash there is are other users interested in
> this change, see https://patchwork.ozlabs.org/patch/851008/
> 
> Fixes: 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper")
> Signed-off-by: Stefan Hellermann 
> Cc:  [4.4+]

Reviewed-by: Andrew Lunn 

Andrew


[PATCH] net: Allow mac_pton() to work on non-NULL terminated strings

2018-02-23 Thread Stefan Hellermann
Commit 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper") crashes my
QNAP TS-209 NAS early on boot.

The boot code for the TS-209 is looping through an ext2 filesystem on a
384kB mtd partition (factory configuration put there by QNAP). There it
looks on every 1kB boundary if there is a valid MAC address. The
filesystem has a 1kB block size, so this seems to work.

On my device the MAC address is on the 37th 1kB block. But: On the 27th
block is a large file (1,5kB) without 0 bytes inside. The code in
qnap_tsx09_find_mac_addr() maps 1kB into memory (not a whole file or the
whole 384kB) and then calls qnap_tsx09_check_mac_addr() -> mac_pton() ->
strlen() on this memory block. as there is no 0 byte in the file on the
27th block, strlen() runs into bad memory and the machine panics. The old
code had no strlen().

Actually mac_pton() doesn't need to call strlen(), the following loop
catches short strings quite nicely. The strlen() seems to be an
optimization for calls to mac_pton with empty string. But this is rarely
the case and this is not a hot path. Remove it to reduce code size and
speed up calls with an not empty string.

Besides fixing the crash there is are other users interested in
this change, see https://patchwork.ozlabs.org/patch/851008/

Fixes: 4904dbda41c8 ("ARM: orion5x: use mac_pton() helper")
Signed-off-by: Stefan Hellermann 
Cc:  [4.4+]
---
 lib/net_utils.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/lib/net_utils.c b/lib/net_utils.c
index af525353395d..9d38da67ee44 100644
--- a/lib/net_utils.c
+++ b/lib/net_utils.c
@@ -8,10 +8,6 @@ bool mac_pton(const char *s, u8 *mac)
 {
int i;
 
-   /* XX:XX:XX:XX:XX:XX */
-   if (strlen(s) < 3 * ETH_ALEN - 1)
-   return false;
-
/* Don't dirty result unless string is valid MAC. */
for (i = 0; i < ETH_ALEN; i++) {
if (!isxdigit(s[i * 3]) || !isxdigit(s[i * 3 + 1]))
-- 
2.11.0



[PATCH net 1/1] tc: python3, string formattings

2018-02-23 Thread BTaskaya
This patch converts old type string formattings to new type string
formattings for adapting Linux Traffic Control (tc) unit testing suite
python3.

Linux Traffic Control (tc) unit testing suite's code quality improved is 
improved with this patch.
According to python documentation;
"The built-in string class provides the ability to do complex variable 
substitutions and
value formatting via the format() method described in PEP 3101. "
but the project was using old type formattings and new type string formattings 
together,
this patch's main purpose is converting all old types to new types.

Following files changed:
 1. tools/testing/selftests/tc-testing/tdc.py
 2. tools/testing/selftests/tc-testing/tdc_batch.py

Following PEP rules applied:
 1. PEP8 - Code Styling
 2. PEP3101 - Advanced Code Formatting

 Signed-off-by: Batuhan Osman Taskaya 
---
 tools/testing/selftests/tc-testing/tdc.py   | 2 +-
 tools/testing/selftests/tc-testing/tdc_batch.py | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index fc373fdf2bdc..80c75464c67e 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -277,7 +277,7 @@ def generate_case_ids(alltests):
 for c in alltests:
 if (c["id"] == ""):
 while True:
-newid = str('%04x' % random.randrange(16**4))
+newid = str('{:04x}'.format(random.randrange(16**4)))
 if (does_id_exist(alltests, newid)):
 continue
 else:
diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py 
b/tools/testing/selftests/tc-testing/tdc_batch.py
index 707c6bfef689..52fa539dc662 100755
--- a/tools/testing/selftests/tc-testing/tdc_batch.py
+++ b/tools/testing/selftests/tc-testing/tdc_batch.py
@@ -49,13 +49,13 @@ index = 0
 for i in range(0x100):
 for j in range(0x100):
 for k in range(0x100):
-mac = ("%02x:%02x:%02x" % (i, j, k))
+mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k))
 src_mac = "e4:11:00:" + mac
 dst_mac = "e4:12:00:" + mac
-cmd = ("filter add dev %s %s protocol ip parent : flower %s "
-   "src_mac %s dst_mac %s action drop %s" %
+cmd = ("filter add dev {} {} protocol ip parent : flower {} "
+   "src_mac {} dst_mac {} action drop {}".format
(device, prio, skip, src_mac, dst_mac, share_action))
-file.write("%s\n" % cmd)
+file.write("{}\n".format(cmd))
 index += 1
 if index >= number:
 file.close()
-- 
2.16.1



[PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to do c

2018-02-23 Thread BTaskaya
Signed-off-by: Batuhan Osman Taskaya 
---
 tools/testing/selftests/tc-testing/tdc.py   | 2 +-
 tools/testing/selftests/tc-testing/tdc_batch.py | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index fc373fdf2bdc..80c75464c67e 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -277,7 +277,7 @@ def generate_case_ids(alltests):
 for c in alltests:
 if (c["id"] == ""):
 while True:
-newid = str('%04x' % random.randrange(16**4))
+newid = str('{:04x}'.format(random.randrange(16**4)))
 if (does_id_exist(alltests, newid)):
 continue
 else:
diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py 
b/tools/testing/selftests/tc-testing/tdc_batch.py
index 707c6bfef689..52fa539dc662 100755
--- a/tools/testing/selftests/tc-testing/tdc_batch.py
+++ b/tools/testing/selftests/tc-testing/tdc_batch.py
@@ -49,13 +49,13 @@ index = 0
 for i in range(0x100):
 for j in range(0x100):
 for k in range(0x100):
-mac = ("%02x:%02x:%02x" % (i, j, k))
+mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k))
 src_mac = "e4:11:00:" + mac
 dst_mac = "e4:12:00:" + mac
-cmd = ("filter add dev %s %s protocol ip parent : flower %s "
-   "src_mac %s dst_mac %s action drop %s" %
+cmd = ("filter add dev {} {} protocol ip parent : flower {} "
+   "src_mac {} dst_mac {} action drop {}".format
(device, prio, skip, src_mac, dst_mac, share_action))
-file.write("%s\n" % cmd)
+file.write("{}\n".format(cmd))
 index += 1
 if index >= number:
 file.close()
-- 
2.16.1



[PATCH v3 iproute2-next 1/3] ip: Use the `struct fib_rule_hdr` for rules

2018-02-23 Thread Donald Sharp
The iprule.c code was using `struct rtmsg` as the data
type to pass into the kernel for the netlink message.
While 'struct rtmsg' and `struct fib_rule_hdr` are
the same size and mostly the same, we should use
the correct data structure.  This commit translates
the data structures to have iprule.c use the correct
one.

Additionally copy over the modified fib_rules.h file

Signed-off-by: Donald Sharp 
---
 include/uapi/linux/fib_rules.h |   1 +
 ip/iprule.c| 128 +
 2 files changed, 68 insertions(+), 61 deletions(-)

diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b642bf9..9477c3af 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -58,6 +58,7 @@ enum {
FRA_PAD,
FRA_L3MDEV, /* iif or oif is l3mdev goto its table */
FRA_UID_RANGE,  /* UID range */
+   FRA_PROTOCOL,
__FRA_MAX
 };
 
diff --git a/ip/iprule.c b/ip/iprule.c
index a3abf2f6..94356bf8 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -73,25 +73,33 @@ static struct
inet_prefix dst;
 } filter;
 
+static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
+{
+   __u32 table = frh->table;
+   if (tb[RTA_TABLE])
+   table = rta_getattr_u32(tb[RTA_TABLE]);
+   return table;
+}
+
 static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 {
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
__u32 table;
 
-   if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family)
+   if (preferred_family != AF_UNSPEC && frh->family != preferred_family)
return false;
 
if (filter.prefmask &&
filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) 
: 0))
return false;
-   if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT))
+   if (filter.not && !(frh->flags & FIB_RULE_INVERT))
return false;
 
if (filter.src.family) {
inet_prefix *f_src = &filter.src;
 
-   if (f_src->family != r->rtm_family ||
-   f_src->bitlen > r->rtm_src_len)
+   if (f_src->family != frh->family ||
+   f_src->bitlen > frh->src_len)
return false;
 
if (inet_addr_match_rta(f_src, tb[FRA_SRC]))
@@ -101,15 +109,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct 
rtattr **tb, int host_len)
if (filter.dst.family) {
inet_prefix *f_dst = &filter.dst;
 
-   if (f_dst->family != r->rtm_family ||
-   f_dst->bitlen > r->rtm_dst_len)
+   if (f_dst->family != frh->family ||
+   f_dst->bitlen > frh->dst_len)
return false;
 
if (inet_addr_match_rta(f_dst, tb[FRA_DST]))
return false;
}
 
-   if (filter.tosmask && filter.tos ^ r->rtm_tos)
+   if (filter.tosmask && filter.tos ^ frh->tos)
return false;
 
if (filter.fwmark) {
@@ -159,7 +167,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return false;
}
 
-   table = rtm_get_table(r, tb);
+   table = frh_get_table(frh, tb);
if (filter.tb > 0 && filter.tb ^ table)
return false;
 
@@ -169,7 +177,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE *)arg;
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
__u32 table;
@@ -180,13 +188,13 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
return 0;
 
-   len -= NLMSG_LENGTH(sizeof(*r));
+   len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
return -1;
 
-   parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len);
+   parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
-   host_len = af_bit_len(r->rtm_family);
+   host_len = af_bit_len(frh->family);
 
if (!filter_nlmsg(n, tb, host_len))
return 0;
@@ -200,41 +208,41 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else
fprintf(fp, "0:\t");
 
-   if (r->rtm_flags & FIB_RULE_INVERT)
+   if (frh->flags & FIB_RULE_INVERT)
fprintf(fp, "not ");
 
if (tb[FRA_SRC]) {
-   if (r->rtm_src_len != host_len) {
+   if (frh->src_len != host_len) {
fprintf(fp, "from %s/%u ",
-   rt_addr_n2a_rta(r->rtm_family, tb[FRA_SRC]),
-

[PATCH v3 iproute2-next 2/3] ip: Display ip rule protocol used

2018-02-23 Thread Donald Sharp
Modify 'ip rule' command to notice when the kernel passes
to us the originating protocol.

Add code to allow the `ip rule flush protocol XXX`
command to be accepted and properly handled.

Modify the documentation to reflect these code changes.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c| 36 +---
 man/man8/ip-rule.8 | 18 +-
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index 94356bf8..17df9e9b 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -47,6 +47,7 @@ static void usage(void)
"[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ 
l3mdev ]\n"
"[ uidrange NUMBER-NUMBER ]\n"
"ACTION := [ table TABLE_ID ]\n"
+   "  [ protocol PROTO ]\n"
"  [ nat ADDRESS ]\n"
"  [ realms [SRCREALM/]DSTREALM ]\n"
"  [ goto NUMBER ]\n"
@@ -71,6 +72,8 @@ static struct
struct fib_rule_uid_range range;
inet_prefix src;
inet_prefix dst;
+   int protocol;
+   int protocolmask;
 } filter;
 
 static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
@@ -338,6 +341,16 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
rtnl_rtntype_n2a(frh->action,
 b1, sizeof(b1)));
 
+   if (tb[FRA_PROTOCOL]) {
+   __u8 protocol = rta_getattr_u8(tb[FRA_PROTOCOL]);
+
+   if ((protocol && protocol != RTPROT_KERNEL) ||
+   show_details > 0) {
+   fprintf(fp, " proto %s ",
+   rtnl_rtprot_n2a(protocol, b1, sizeof(b1)));
+   }
+   }
+
fprintf(fp, "\n");
fflush(fp);
return 0;
@@ -391,6 +404,10 @@ static int flush_rule(const struct sockaddr_nl *who, 
struct nlmsghdr *n,
 
parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
+   if (tb[FRA_PROTOCOL] &&
+   
(filter.protocol^rta_getattr_u8(tb[FRA_PROTOCOL]))&filter.protocolmask)
+   return 0;
+
if (tb[FRA_PRIORITY]) {
n->nlmsg_type = RTM_DELRULE;
n->nlmsg_flags = NLM_F_REQUEST;
@@ -415,12 +432,6 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
if (af == AF_UNSPEC)
af = AF_INET;
 
-   if (action != IPRULE_LIST && argc > 0) {
-   fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n",
-   action == IPRULE_SAVE ? "save" : "flush");
-   return -1;
-   }
-
switch (action) {
case IPRULE_SAVE:
if (save_rule_prep())
@@ -508,7 +519,18 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
NEXT_ARG();
if (get_prefix(&filter.src, *argv, af))
invarg("from value is invalid\n", *argv);
-   } else {
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 prot;
+   NEXT_ARG();
+   filter.protocolmask = -1;
+   if (rtnl_rtprot_a2n(&prot, *argv)) {
+   if (strcmp(*argv, "all") != 0)
+   invarg("invalid \"protocol\"\n", *argv);
+   prot = 0;
+   filter.protocolmask = 0;
+   }
+   filter.protocol = prot;
+   } else{
if (matches(*argv, "dst") == 0 ||
matches(*argv, "to") == 0) {
NEXT_ARG();
diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8
index a5c47981..f4070542 100644
--- a/man/man8/ip-rule.8
+++ b/man/man8/ip-rule.8
@@ -50,6 +50,8 @@ ip-rule \- routing policy database management
 .IR ACTION " := [ "
 .B  table
 .IR TABLE_ID " ] [ "
+.B  protocol
+.IR PROTO " ] [ "
 .B  nat
 .IR ADDRESS " ] [ "
 .B realms
@@ -240,6 +242,10 @@ The options preference and order are synonyms with 
priority.
 the routing table identifier to lookup if the rule selector matches.
 It is also possible to use lookup instead of table.
 
+.TP
+.BI protocol " PROTO"
+the protocol who installed the rule in question.
+
 .TP
 .BI suppress_prefixlength " NUMBER"
 reject routing decisions that have a prefix length of NUMBER or less.
@@ -275,7 +281,11 @@ updates, it flushes the routing cache with
 .RE
 .TP
 .B ip rule flush - also dumps all the deleted rules.
-This command has no arguments.
+.RS
+.TP
+.BI protocol " PROTO"
+Select the originating protocol.
+.RE
 .TP
 .B ip rule show - list rules
 This command has no arguments.
@@ -283,6 +293,12 @@ The options list or lst are synonyms with show.
 
 .TP
 .B ip rule save
+.RS
+.TP
+.BI protocl " PROTO"
+Select the

[PATCH v3 iproute2-next 3/3] ip: Allow rules to accept a specified protocol

2018-02-23 Thread Donald Sharp
Allow the specification of a protocol when the user
adds/modifies/deletes a rule.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index 17df9e9b..796da3b3 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -689,6 +689,12 @@ static int iprule_modify(int cmd, int argc, char **argv)
if (get_rt_realms_or_raw(&realm, *argv))
invarg("invalid realms\n", *argv);
addattr32(&req.n, sizeof(req), FRA_FLOW, realm);
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 proto;
+   NEXT_ARG();
+   if (rtnl_rtprot_a2n(&proto, *argv))
+   invarg("\"protocol\" value is invalid\n", 
*argv);
+   addattr8(&req.n, sizeof(req), FRA_PROTOCOL, proto);
} else if (matches(*argv, "table") == 0 ||
   strcmp(*argv, "lookup") == 0) {
NEXT_ARG();
-- 
2.14.3



[PATCH v3 iproute2-next 0/3] Allow 'ip rule' command to use protocol

2018-02-23 Thread Donald Sharp
Fix iprule.c to use the actual `struct fib_rule_hdr` and to
allow the end user to see and use the protocol keyword
for rule manipulation.

v2: Rearrange and code changes as per David Ahern
v3: Fix some missed RTN_XXX to appropriate FR_XX and doc changes

Donald Sharp (3):
  ip: Use the `struct fib_rule_hdr` for rules
  ip: Display ip rule protocol used
  ip: Allow rules to accept a specified protocol

 include/uapi/linux/fib_rules.h |   1 +
 ip/iprule.c| 170 -
 man/man8/ip-rule.8 |  18 -
 3 files changed, 120 insertions(+), 69 deletions(-)

-- 
2.14.3



Re: [PATCH iproute2 v2] ip: Properly display AF_BRIDGE address information for neighbor events

2018-02-23 Thread Stephen Hemminger
On Fri, 23 Feb 2018 14:10:09 -0500
Donald Sharp  wrote:

> The vxlan driver when a neighbor add/delete event occurs sends
> NDA_DST filled with a union:
> 
> union vxlan_addr {
>   struct sockaddr_in sin;
>   struct sockaddr_in6 sin6;
>   struct sockaddr sa;
> };
> 
> This eventually calls rt_addr_n2a_r which had no handler for the
> AF_BRIDGE family and "???" was being printed.
> 
> Add code to properly display this data when requested.
> 
> Signed-off-by: Donald Sharp 

I did some minor changes to the patch and applied it.
(need space after switch, and can don't need local variable family).
Thanks.



Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-23 Thread Saeed Mahameed
On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote:
> Hi Saeed
> 
> On 2/21/2018 12:13 PM, Saeed Mahameed wrote:
> > From: Yonatan Cohen 
> > 
> > The current implementation of create CQ requires contiguous
> > memory, such requirement is problematic once the memory is
> > fragmented or the system is low in memory, it causes for
> > failures in dma_zalloc_coherent().
> > 
> > This patch implements new scheme of fragmented CQ to overcome
> > this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
> > to allocate fragmented buffers, rather than contiguous ones.
> > 
> > Base the Completion Queues (CQs) on this new fragmented buffer.
> > 
> > It fixes following crashes:
> > kworker/29:0: page allocation failure: order:6, mode:0x80d0
> > CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
> > Workqueue: ib_cm cm_work_handler [ib_cm]
> > Call Trace:
> > [<>] dump_stack+0x19/0x1b
> > [<>] warn_alloc_failed+0x110/0x180
> > [<>] __alloc_pages_slowpath+0x6b7/0x725
> > [<>] __alloc_pages_nodemask+0x405/0x420
> > [<>] dma_generic_alloc_coherent+0x8f/0x140
> > [<>] x86_swiotlb_alloc_coherent+0x21/0x50
> > [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
> > [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
> > [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
> > [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
> > [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
> > [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
> > 
> > Signed-off-by: Yonatan Cohen 
> > Reviewed-by: Tariq Toukan 
> > Signed-off-by: Leon Romanovsky 
> > Signed-off-by: Saeed Mahameed 
> > ---
> 
> Jason mentioned about this patch to me off-list. We were
> seeing similar issue with SRQs & QPs. So wondering whether
> you have any plans to do similar change for other resouces
> too so that they don't rely on higher order page allocation
> for icm tables.
> 

Hi Santosh,

Adding Majd,

Which ULP is in question ? how big are the QPs/SRQs you create that
lead to this problem ?

For icm tables we already allocate only order 0 pages:
see alloc_system_page() in
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c

But for kernel RDMA SRQ and QP buffers there is a place for
improvement.

Majd, do you know if we have any near future plans for this.

> Regards,
> Santosh

Re: [PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to

2018-02-23 Thread David Miller
From: BTaskaya 
Date: Fri, 23 Feb 2018 21:48:30 +0300

> Signed-off-by: Batuhan Osman Taskaya 

Sorry, this still is not submitted properly.

You Subject line is enormous.  It should contain a short,
concise, title for the change.  Including subsystem prefixes,
followed by a colon character, and a space, then the
title text.

The rest belongs in the commit message proper, and thus the
message body.

I asked you quite politely to look at how other people submit
patches properly on this mailing list, so that you could observe,
learn, and see how it is supposed to look.

If you had done so, we wouldn't have to go back and forth like
this.

So please take my polite suggestion seriously.

Thank you.


[PATCH iproute2 v2] ip: Properly display AF_BRIDGE address information for neighbor events

2018-02-23 Thread Donald Sharp
The vxlan driver when a neighbor add/delete event occurs sends
NDA_DST filled with a union:

union vxlan_addr {
struct sockaddr_in sin;
struct sockaddr_in6 sin6;
struct sockaddr sa;
};

This eventually calls rt_addr_n2a_r which had no handler for the
AF_BRIDGE family and "???" was being printed.

Add code to properly display this data when requested.

Signed-off-by: Donald Sharp 
---
 lib/utils.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/lib/utils.c b/lib/utils.c
index 24aeddd8..fe5841f6 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -1004,6 +1004,25 @@ const char *rt_addr_n2a_r(int af, int len,
}
case AF_PACKET:
return ll_addr_n2a(addr, len, ARPHRD_VOID, buf, buflen);
+   case AF_BRIDGE:
+   {
+   const union {
+   struct sockaddr sa;
+   struct sockaddr_in sin;
+   struct sockaddr_in6 sin6;
+   } *sa = addr;
+   unsigned short family = sa->sa.sa_family;
+
+   switch(family) {
+   case AF_INET:
+   return inet_ntop(AF_INET, &sa->sin.sin_addr, buf, 
buflen);
+   case AF_INET6:
+   return inet_ntop(AF_INET6, &sa->sin6.sin6_addr,
+buf, buflen);
+   }
+
+   /* fallthrough */
+   }
default:
return "???";
}
-- 
2.14.3



[PATCH] net: fib_rules: Add new attribute to set protocol

2018-02-23 Thread Donald Sharp
For ages iproute2 has used `struct rtmsg` as the ancillary header for
FIB rules and in the process set the protocol value to RTPROT_BOOT.
Until ca56209a66 ("net: Allow a rule to track originating protocol")
the kernel rules code ignored the protocol value sent from userspace
and always returned 0 in notifications. To avoid incompatibility with
existing iproute2, send the protocol as a new attribute.

Fixes: cac56209a66 ("net: Allow a rule to track originating protocol")
Signed-off-by: Donald Sharp 
---
 drivers/net/vrf.c  |  5 -
 include/net/fib_rules.h|  3 ++-
 include/uapi/linux/fib_rules.h |  5 +++--
 net/core/fib_rules.c   | 15 +++
 4 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 951a4b42cb29..9ce0182223a0 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -1145,6 +1145,7 @@ static inline size_t vrf_fib_rule_nl_size(void)
sz  = NLMSG_ALIGN(sizeof(struct fib_rule_hdr));
sz += nla_total_size(sizeof(u8));   /* FRA_L3MDEV */
sz += nla_total_size(sizeof(u32));  /* FRA_PRIORITY */
+   sz += nla_total_size(sizeof(u8));   /* FRA_PROTOCOL */
 
return sz;
 }
@@ -1174,7 +1175,9 @@ static int vrf_fib_rule(const struct net_device *dev, 
__u8 family, bool add_it)
memset(frh, 0, sizeof(*frh));
frh->family = family;
frh->action = FR_ACT_TO_TBL;
-   frh->proto = RTPROT_KERNEL;
+
+   if (nla_put_u8(skb, FRA_PROTOCOL, RTPROT_KERNEL))
+   goto nla_put_failure;
 
if (nla_put_u8(skb, FRA_L3MDEV, 1))
goto nla_put_failure;
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index b166ef07e6d4..b3d216249240 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -109,7 +109,8 @@ struct fib_rule_notifier_info {
[FRA_SUPPRESS_IFGROUP] = { .type = NLA_U32 }, \
[FRA_GOTO]  = { .type = NLA_U32 }, \
[FRA_L3MDEV]= { .type = NLA_U8 }, \
-   [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) }
+   [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) }, \
+   [FRA_PROTOCOL]  = { .type = NLA_U8 }
 
 static inline void fib_rule_get(struct fib_rule *rule)
 {
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 925539172d5b..77d90ae38114 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -23,8 +23,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
-   __u8proto;
-   __u8res1;   /* reserved */
+   __u8res1;   /* reserved */
+   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
@@ -58,6 +58,7 @@ enum {
FRA_PAD,
FRA_L3MDEV, /* iif or oif is l3mdev goto its table */
FRA_UID_RANGE,  /* UID range */
+   FRA_PROTOCOL,   /* Originator of the rule */
__FRA_MAX
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 88298f18cbae..a6aea805a0a2 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -466,11 +466,13 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
}
refcount_set(&rule->refcnt, 1);
rule->fr_net = net;
-   rule->proto = frh->proto;
 
rule->pref = tb[FRA_PRIORITY] ? nla_get_u32(tb[FRA_PRIORITY])
  : fib_default_rule_pref(ops);
 
+   rule->proto = tb[FRA_PROTOCOL] ?
+   nla_get_u8(tb[FRA_PROTOCOL]) : RTPROT_UNSPEC;
+
if (tb[FRA_IIFNAME]) {
struct net_device *dev;
 
@@ -666,7 +668,8 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
}
 
list_for_each_entry(rule, &ops->rules_list, list) {
-   if (frh->proto && (frh->proto != rule->proto))
+   if (tb[FRA_PROTOCOL] &&
+   (rule->proto != nla_get_u8(tb[FRA_PROTOCOL])))
continue;
 
if (frh->action && (frh->action != rule->action))
@@ -786,7 +789,8 @@ static inline size_t fib_rule_nlmsg_size(struct 
fib_rules_ops *ops,
 + nla_total_size(4) /* FRA_FWMARK */
 + nla_total_size(4) /* FRA_FWMASK */
 + nla_total_size_64bit(8) /* FRA_TUN_ID */
-+ nla_total_size(sizeof(struct fib_kuid_range));
++ nla_total_size(sizeof(struct fib_kuid_range))
++ nla_total_size(1); /* FRA_PROTOCOL */
 
if (ops->nlmsg_payload)
payload += ops->nlmsg_payload(rule);
@@ -813,9 +817,12 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct 
fib_rule *rule,
if (nla_put_u32(skb, FRA_SUPPRESS_PREFIXLEN, rule->suppress_prefixlen))
goto nla_put_failure;
frh->res1 = 0;
+   frh->res2 = 0;
frh->action = rule->action;
  

Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Randy Dunlap
[adding netdev]

On 02/23/2018 08:05 AM, Khalid Aziz wrote:
> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
> selftests
> from tools/testing/selftests. Last messages from selftest before kernel panic 
> are:
> 
> 
> running psock_tpacket test
> 
> test: TPACKET_V1 with PACKET_RX_RING test: skip TPACKET_V1 PACKET_RX_RING 
> since user and kernel space have different bit width
> test: TPACKET_V1 with PACKET_TX_RING test: skip TPACKET_V1 PACKET_TX_RING 
> since user and kernel space have different bit width
> test: TPACKET_V2 with PACKET_RX_RING  100 pkts (14200 
> bytes)
> test: TPACKET_V2 with PACKET_TX_RING  100 pkts (14200 
> bytes)
> test: TPACKET_V3 with PACKET_RX_RING  100 pkts (14200 
> bytes)
> test: TPACKET_V3 with PACKET_TX_RING  100 pkts (14200 
> bytes)
> OK. All tests passed
> [PASS]
> ok 1..7 selftests: run_afpackettests [PASS]
> selftests: test_bpf.sh
> 
> test_bpf: [FAIL]
> not ok 1..8 selftests:  test_bpf.sh [FAIL]
> selftests: netdevice.sh
> 
> ok 1..9 selftests: netdevice.sh [PASS]
> selftests: rtnetlink.sh
> 
> PASS: policy routing
> PASS: route get
> 
> 
> Kernel panic message is below:
> 
> [  572.486722] BUG: unable to handle kernel paging request at 0600
> [  572.494498] IP: tcf_exts_dump_stats+0x10/0x30
> [  572.499360] PGD 80be413cb067 P4D 80be413cb067 PUD bead15c067 PMD 0 
> [  572.507126] Oops:  [#1] SMP PTI
> [  572.511010] Modules linked in: cls_u32 sch_htb dummy vfat fat ext4 mbcache 
> jb
> d2 intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper 
> cryptd sg iTCO_wdt iTCO_vendor_support ioatdma ipmi_ssif pcspkr wmi i2c_i801 
> lpc_ich shpchp mfd_core ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss 
> nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci 
> crc32c_intel nvme libahci dca drm megaraid_sas nvme_core i2c_algo_bit libata 
> bnxt_en i2c_core dm_mirror dm_region_hash dm_log dm_mod
> [  572.574377] CPU: 81 PID: 17886 Comm: tc Not tainted 4.16.0-rc2 #112
> [  572.581371] Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, 
> X7-2, BIOS 41017600 10/06/2017
> [  572.591957] RIP: 0010:tcf_exts_dump_stats+0x10/0x30
> [  572.597402] RSP: 0018:c900313b7928 EFLAGS: 00010206
> [  572.603226] RAX: 0600 RBX: 88bea9117db0 RCX: 
> 1ca4
> [  572.611191] RDX: 1ca3 RSI: 88bea90cf018 RDI: 
> 88be4fb6c000
> [  572.619157] RBP: 88be4fb6c000 R08: 00024800 R09: 
> a05697fb
> [  572.627121] R10: 88bebe064800 R11: ea02faa445c0 R12: 
> 88bea90ce034
> [  572.635087] R13: 88bea90cf000 R14: 88be9fe33300 R15: 
> 88bea90ce000
> [  572.643053] FS:  7f98ae464740() GS:88bebe04() 
> knlGS:
> [  572.652084] CS:  0010 DS:  ES:  CR0: 80050033
> [  572.658497] CR2: 0600 CR3: 00be41a94005 CR4: 
> 007606e0
> [  572.666462] DR0:  DR1:  DR2: 
> 
> [  572.674428] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  572.682393] PKRU: 5554
> [  572.685413] Call Trace:
> [  572.688145]  u32_dump+0x2be/0x3c0 [cls_u32]
> [  572.692816]  tcf_fill_node.isra.29+0x15b/0x1f0
> [  572.69]  tfilter_notify+0xc1/0x150
> [  572.701952]  tc_ctl_tfilter+0x87d/0xbd0
> [  572.706238]  rtnetlink_rcv_msg+0x29c/0x310
> [  572.710813]  ? _cond_resched+0x15/0x30
> [  572.714999]  ? __kmalloc_node_track_caller+0x1b9/0x270
> [  572.720737]  ? rtnl_calcit.isra.28+0x100/0x100
> [  572.725697]  netlink_rcv_skb+0xd2/0x110
> [  572.729969]  netlink_unicast+0x17c/0x230
> [  572.734348]  netlink_sendmsg+0x2cd/0x3c0
> [  572.738719]  sock_sendmsg+0x30/0x40
> [  572.742612]  ___sys_sendmsg+0x27a/0x290
> [  572.746896]  ? do_wp_page+0x89/0x4c0
> [  572.750886]  ? page_add_new_anon_rmap+0x72/0xc0
> [  572.755944]  ? __handle_mm_fault+0x74b/0x1280
> [  572.760807]  __sys_sendmsg+0x51/0x90
> [  572.764800]  do_syscall_64+0x6e/0x1a0
> [  572.76]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  572.774526] RIP: 0033:0x7f98ada843b0
> [  572.778515] RSP: 002b:7fff833a4f38 EFLAGS: 0246 ORIG_RAX: 
> 002e
> [  572.786963] RAX: ffda RBX: 5a8deb31 RCX: 
> 7f98ada843b0
> [  572.794929] RDX:  RSI: 7fff833a4f80 RDI: 
> 0003
> [  572.802892] RBP: 7fff833a4f80 R08:  R09: 
> 0001
> [  572.810856] R10: 7fff833a4320 R11: 0246 R12: 
> 
> [  572.818823] R13: 00650ba0 R14: 7fff833b11e8 R15: 

[GIT] Networking

2018-02-23 Thread David Miller

1) Fix TTL offset calculation in mac80211 mesh code, from Peter Oh.

2) Fix races with procfs in ipt_CLUSTERIP, from Cong Wang.

3) Memory leak fix in lpm_trie BPF map code, from Yonghong Song.

4) Need to use GFP_ATOMIC in BPF cpumap allocations, from Jason Wang.

5) Fix potential deadlocks in netfilter getsockopt() code paths, from
   Paolo Abeni.

6) Netfilter stackpointer size checks really are needed to validate
   user input, from Florian Westphal.

7) Missing timer init in x_tables, from Paolo Abeni.

8) Don't use WQ_MEM_RECLAIM in mac80211 hwsim, from Johannes Berg.

9) When an ibmvnic device is brought down then back up again, it can
   be sent queue entries from a previous session, handle this properly
   instead of crashing.  From Thomas Falcon.

10) Fix TCP checkums on LRO buffers in mlx5e, from Gal Pressman.

11) When we are dumping filters in cls_api, the output SKB is empty,
and the filter we are dumping is too large for the space in the
SKB, we should return -EMSGSIZE like other netlink dump operations
do.  Otherwise userland has no signal that is needs to increase
the size of it's read buffer.  From Roman Kapl.

12) Several XDP fixes for virtio_net, from Jesper Dangaard Brouer.

13) Module refcount leak in netlink when a dump start fails, from
Jason A. Donenfeld.

14) Handle sub-optimal GSO sizes better in TCP BBR congestion control,
from Eric Dumazet.

15) Releasing bpf per-cpu arraymaps can take a long time, add a
condtional scheduling point.  From Eric Dumazet.

16) Implement retpolines for tail calls in x64 and arm64 bpf JITs.
From Daniel Borkmann.

17) Fix page leak in gianfar driver, from Andy Spencer.

18) Missed clearing of estimator scratch buffer, from Eric Dumazet.

Please pull, thanks a lot!

The following changes since commit 79c0ef3e85c015b0921a8fd5dd539d1480e9cd6c:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-02-19 
11:58:19 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to a5f7add332b4ea6d4b9480971b3b0f5e66466ae9:

  net_sched: gen_estimator: fix broken estimators based on percpu stats 
(2018-02-23 12:35:46 -0500)


Alexey Kodanev (1):
  macvlan: fix use-after-free in macvlan_common_newlink()

Anders Roxell (2):
  selftests/bpf: tcpbpf_kern: use in6_* macros from glibc
  selftests/bpf: update gitignore with test_libbpf_open

Andy Spencer (1):
  gianfar: simplify FCS handling and fix memory leak

Arnd Bergmann (3):
  cfg80211: fix cfg80211_beacon_dup
  bpf: clean up unused-variable warning
  ipv6 sit: work around bogus gcc-8 -Wrestrict warning

Avraham Stern (1):
  cfg80211: clear wep keys after disconnection

Cong Wang (2):
  netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation
  netfilter: ipt_CLUSTERIP: fix a refcount bug in 
clusterip_config_find_get()

Dan Carpenter (1):
  net: aquantia: Fix error handling in aq_pci_probe()

Daniel Borkmann (5):
  bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query
  Merge branch 'bpf-bpftool-json-fixes'
  bpf: fix mlock precharge on arraymaps
  bpf, x64: implement retpoline for tail call
  bpf, arm64: fix out of bounds access in tail call

Daniel Jurgens (1):
  net/mlx5: Use 128B cacheline size for 128B or larger cachelines

David Ahern (1):
  net: ipv4: Set addr_type in hash_keys for forwarded case

David Howells (1):
  rxrpc: Fix send in rxrpc_send_data_packet()

David S. Miller (6):
  Merge git://git.kernel.org/.../pablo/nf
  Merge tag 'mlx5-fixes-2018-02-20' of git://git.kernel.org/.../saeed/linux
  Merge branch 'virtio_net-XDP-fixes'
  Merge git://git.kernel.org/.../bpf/bpf
  Merge tag 'mac80211-for-davem-2018-02-22' of 
git://git.kernel.org/.../jberg/mac80211
  Merge git://git.kernel.org/.../bpf/bpf

Eran Ben Elisha (1):
  net/mlx5e: Verify inline header size do not exceed SKB linear size

Eric Dumazet (7):
  bpf: fix sock_map_alloc() error path
  netfilter: xt_hashlimit: fix lock imbalance
  netfilter: IDLETIMER: be syzkaller friendly
  smsc75xx: fix smsc75xx_set_features()
  tcp_bbr: better deal with suboptimal GSO
  bpf: add schedule points in percpu arrays management
  net_sched: gen_estimator: fix broken estimators based on percpu stats

Eugenia Emantayev (1):
  net/mlx5: E-Switch, Fix drop counters use before creation

Felix Fietkau (1):
  mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4

Finn Thain (1):
  net/smc9194: Remove bogus CONFIG_MAC reference

Florian Westphal (10):
  netfilter: add back stackpointer size checks
  netfilter: x_tables: remove pr_info where possible
  netfilter: x_tables: use pr ratelimiting in xt core
  netfilter: xt_CT: use pr ratelimiting
  netfilter: xt_NFQUEUE: use pr r

VRF destination unreachable

2018-02-23 Thread Stephen Suryaputra
Greetings,

We found that ICMP destination unreachable isn't sent if VRF
forwarding isn't configured, i.e.
/proc/sys/net/ipv4/conf//forwarding isn't set. The
relevant code is:

static int ip_error(struct sk_buff *skb)
{
...
// in_dev is the vrf net_device
if (!IN_DEV_FORWARD(in_dev)) {
switch (rt->dst.error) {
case EHOSTUNREACH:
__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
break;

case ENETUNREACH:
__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
break;
}
goto out;
}
...
out:kfree_skb(skb);
return 0;
}

The question: is it intended to be set? The basic forwarding seems to
be working without. We do set it on the slave net devices.

Thank you,

Stephen.


Re: [PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to

2018-02-23 Thread David Miller

Sorry, this is still not submitted in an acceptable way.

Please read the documents I pointed you too, look at how other people
submit patches on this list, and submit it properly.

Thank you.


Re: [PATCH net-next] selftests/net: ignore background traffic in psock_fanout

2018-02-23 Thread David Miller
From: Willem de Bruijn 
Date: Fri, 23 Feb 2018 11:56:20 -0500

> From: Willem de Bruijn 
> 
> The packet fanout test generates UDP traffic and reads this with
> a pair of packet sockets, testing the various fanout algorithms.
> 
> Avoid non-determinism from reading unrelated background traffic.
> Fanout decisions are made before unrelated packets can be dropped with
> a filter, so that is an insufficient strategy [*]. Run the packet
> socket tests in a network namespace, similar to msg_zerocopy.
> 
> It it still good practice to install a filter on a packet socket
> before accepting traffic. Because this is example code, demonstrate
> that pattern. Open the socket initially bound to no protocol, install
> a filter, and only then bind to ETH_P_IP.
> 
> Another source of non-determinism is hash collisions in FANOUT_HASH.
> The hash function used to select a socket in the fanout group includes
> the pseudorandom number hashrnd, which is not visible from userspace.
> To work around this, the test tries to find a pair of UDP source ports
> that do not collide. It gives up too soon (5 times, every 32 runs) and
> output is confusing. Increase tries to 20 and revise the error msg.
> 
> [*] another approach would be to add a third socket to the fanout
> group and direct all unexpected traffic here. This is possible
> only when reimplementing methods like RR or HASH alongside this
> extra catch-all bucket, using the BPF fanout method.
> 
> Signed-off-by: Willem de Bruijn 

Applied, thanks Willem.

Indeed, not being able to control hashrnd makes determinism in tests
like this quite difficult.


[PATCH net 2/5] l2tp: don't use inet_shutdown on ppp session destroy

2018-02-23 Thread James Chapman
Previously, if a ppp session was closed, we called inet_shutdown to mark
the socket as unconnected such that userspace would get errors and
then close the socket. This could race with userspace closing the
socket. Instead, leave userspace to close the socket in its own time
(our session will be detached anyway).

BUG: KASAN: use-after-free in inet_shutdown+0x5d/0x1c0
Read of size 4 at addr 880010ea3ac0 by task syzbot_347bd5ac/8296

CPU: 3 PID: 8296 Comm: syzbot_347bd5ac Not tainted 4.16.0-rc1+ #91
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 dump_stack+0x101/0x157
 ? inet_shutdown+0x5d/0x1c0
 print_address_description+0x78/0x260
 ? inet_shutdown+0x5d/0x1c0
 kasan_report+0x240/0x360
 __asan_load4+0x78/0x80
 inet_shutdown+0x5d/0x1c0
 ? pppol2tp_show+0x80/0x80
 pppol2tp_session_close+0x68/0xb0
 l2tp_tunnel_closeall+0x199/0x210
 ? udp_v6_flush_pending_frames+0x90/0x90
 l2tp_udp_encap_destroy+0x6b/0xc0
 ? l2tp_tunnel_del_work+0x2e0/0x2e0
 udpv6_destroy_sock+0x8c/0x90
 sk_common_release+0x47/0x190
 udp_lib_close+0x15/0x20
 inet_release+0x85/0xd0
 inet6_release+0x43/0x60
 sock_release+0x53/0x100
 ? sock_alloc_file+0x260/0x260
 sock_close+0x1b/0x20
 __fput+0x19f/0x380
 fput+0x1a/0x20
 task_work_run+0xd2/0x110
 exit_to_usermode_loop+0x18d/0x190
 do_syscall_64+0x389/0x3b0
 entry_SYSCALL_64_after_hwframe+0x26/0x9b
RIP: 0033:0x7fe240a45259
RSP: 002b:7fe241132df8 EFLAGS: 0297 ORIG_RAX: 0003
RAX:  RBX:  RCX: 7fe240a45259
RDX: 7fe240a45259 RSI:  RDI: 00a5
RBP: 7fe241132e20 R08: 7fe241133700 R09: 
R10: 7fe241133700 R11: 0297 R12: 
R13: 7ffc49aff84f R14:  R15: 7fe241141040

Allocated by task 8331:
 save_stack+0x43/0xd0
 kasan_kmalloc+0xad/0xe0
 kasan_slab_alloc+0x12/0x20
 kmem_cache_alloc+0x144/0x3e0
 sock_alloc_inode+0x22/0x130
 alloc_inode+0x3d/0xf0
 new_inode_pseudo+0x1c/0x90
 sock_alloc+0x30/0x110
 __sock_create+0xaa/0x4c0
 SyS_socket+0xbe/0x130
 do_syscall_64+0x128/0x3b0
 entry_SYSCALL_64_after_hwframe+0x26/0x9b

Freed by task 8314:
 save_stack+0x43/0xd0
 __kasan_slab_free+0x11a/0x170
 kasan_slab_free+0xe/0x10
 kmem_cache_free+0x88/0x2b0
 sock_destroy_inode+0x49/0x50
 destroy_inode+0x77/0xb0
 evict+0x285/0x340
 iput+0x429/0x530
 dentry_unlink_inode+0x28c/0x2c0
 __dentry_kill+0x1e3/0x2f0
 dput.part.21+0x500/0x560
 dput+0x24/0x30
 __fput+0x2aa/0x380
 fput+0x1a/0x20
 task_work_run+0xd2/0x110
 exit_to_usermode_loop+0x18d/0x190
 do_syscall_64+0x389/0x3b0
 entry_SYSCALL_64_after_hwframe+0x26/0x9b

Fixes: fd558d186df2c ("l2tp: Split pppol2tp patch into separate l2tp and ppp 
parts")
Signed-off-by: James Chapman 
---
 net/l2tp/l2tp_ppp.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 59f246d7b290..2d2955e8f710 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -420,16 +420,6 @@ static int pppol2tp_xmit(struct ppp_channel *chan, struct 
sk_buff *skb)
  */
 static void pppol2tp_session_close(struct l2tp_session *session)
 {
-   struct sock *sk;
-
-   BUG_ON(session->magic != L2TP_SESSION_MAGIC);
-
-   sk = pppol2tp_session_get_sock(session);
-   if (sk) {
-   if (sk->sk_socket)
-   inet_shutdown(sk->sk_socket, SEND_SHUTDOWN);
-   sock_put(sk);
-   }
 }
 
 /* Really kill the session socket. (Called from sock_put() if
-- 
1.9.1



[PATCH net 5/5] l2tp: fix tunnel lookup use-after-free race

2018-02-23 Thread James Chapman
l2tp_tunnel_get walks the tunnel list to find a matching tunnel
instance and if a match is found, its refcount is increased before
returning the tunnel pointer. But when tunnel objects are destroyed,
they are on the tunnel list after their refcount hits zero. Fix this
by moving the code that removes the tunnel from the tunnel list from
the tunnel socket destructor into in the l2tp_tunnel_delete path,
before the tunnel refcount is decremented.

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 3 PID: 13507 at lib/refcount.c:153 refcount_inc+0x47/0x50
Modules linked in:
CPU: 3 PID: 13507 Comm: syzbot_6e6a5ec8 Not tainted 4.16.0-rc2+ #36
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:refcount_inc+0x47/0x50
RSP: 0018:8800136ffb20 EFLAGS: 00010286
RAX: dc08 RBX: 880017068e68 RCX: 814d
RDX:  RSI: 88001a59f6d8 RDI: 88001a59f6d8
RBP: 8800136ffb28 R08:  R09: 
R10: 8800136ffab0 R11:  R12: 880017068e50
R13:  R14: 8800174da800 R15: 0004
FS:  7f403ab1e700() GS:88001a58() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 205fafd2 CR3: 1677 CR4: 06e0
Call Trace:
 l2tp_tunnel_get+0x2dd/0x4e0
 pppol2tp_connect+0x428/0x13c0
 ? pppol2tp_session_create+0x170/0x170
 ? __might_fault+0x115/0x1d0
 ? lock_downgrade+0x860/0x860
 ? __might_fault+0xe5/0x1d0
 ? security_socket_connect+0x8e/0xc0
 SYSC_connect+0x1b6/0x310
 ? SYSC_bind+0x280/0x280
 ? __do_page_fault+0x5d1/0xca0
 ? up_read+0x1f/0x40
 ? __do_page_fault+0x3c8/0xca0
 SyS_connect+0x29/0x30
 ? SyS_accept+0x40/0x40
 do_syscall_64+0x1e0/0x730
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f403a42f259
RSP: 002b:7f403ab1dee8 EFLAGS: 0296 ORIG_RAX: 002a
RAX: ffda RBX: 205fafe4 RCX: 7f403a42f259
RDX: 002e RSI: 205fafd2 RDI: 0004
RBP: 7f403ab1df20 R08: 7f403ab1e700 R09: 
R10: 7f403ab1e700 R11: 0296 R12: 
R13: 7ffc81906cbf R14:  R15: 7f403ab2b040
Code: 3b ff 5b 5d c3 e8 ca 5f 3b ff 80 3d 49 8e 66 04 00 75 ea e8 bc 5f 3b ff 
48 c7 c7 60 69 64 85 c6 05 34 8e 66 04 01 e8 59 49 15 ff <0f> 0b eb ce 0f 1f 44 
00 00 55 48 89 e5 41 56 41 55 41 54 53 49

Fixes: f8ccac0e44934 ("l2tp: put tunnel socket release on a workqueue")
Reported-and-tested-by: syzbot+19c09769f14b48810...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+347bd5acde002e353...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+6e6a5ec8de31a94cd...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+9df43faf09bd400f2...@syzkaller.appspotmail.com
Signed-off-by: James Chapman 
---
 net/l2tp/l2tp_core.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 0fa53ead24aa..83421c6f0bef 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1164,7 +1164,6 @@ int l2tp_xmit_skb(struct l2tp_session *session, struct 
sk_buff *skb, int hdr_len
 static void l2tp_tunnel_destruct(struct sock *sk)
 {
struct l2tp_tunnel *tunnel = l2tp_tunnel(sk);
-   struct l2tp_net *pn;
 
if (tunnel == NULL)
goto end;
@@ -1187,12 +1186,6 @@ static void l2tp_tunnel_destruct(struct sock *sk)
sk->sk_destruct = tunnel->old_sk_destruct;
sk->sk_user_data = NULL;
 
-   /* Remove the tunnel struct from the tunnel list */
-   pn = l2tp_pernet(tunnel->l2tp_net);
-   spin_lock_bh(&pn->l2tp_tunnel_list_lock);
-   list_del_rcu(&tunnel->list);
-   spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
-
/* Call the original destructor */
if (sk->sk_destruct)
(*sk->sk_destruct)(sk);
@@ -1271,6 +1264,7 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
  del_work);
struct sock *sk = tunnel->sock;
struct socket *sock = sk->sk_socket;
+   struct l2tp_net *pn;
 
l2tp_tunnel_closeall(tunnel);
 
@@ -1284,6 +1278,12 @@ static void l2tp_tunnel_del_work(struct work_struct 
*work)
}
}
 
+   /* Remove the tunnel struct from the tunnel list */
+   pn = l2tp_pernet(tunnel->l2tp_net);
+   spin_lock_bh(&pn->l2tp_tunnel_list_lock);
+   list_del_rcu(&tunnel->list);
+   spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
+
/* drop initial ref */
l2tp_tunnel_dec_refcount(tunnel);
 
-- 
1.9.1



[PATCH net 1/5] l2tp: don't use inet_shutdown on tunnel destroy

2018-02-23 Thread James Chapman
Previously, if a tunnel was closed, we called inet_shutdown to mark
the socket as unconnected such that userspace would get errors and
then close the socket. This could race with userspace closing the
socket. Instead, leave userspace to close the socket in its own time
(our tunnel will be detached anyway).

BUG: unable to handle kernel NULL pointer dereference at 00a0
IP: __lock_acquire+0x263/0x1630
PGD 0 P4D 0
Oops:  [#1] SMP KASAN
Modules linked in:
CPU: 2 PID: 42 Comm: kworker/u8:2 Not tainted 4.15.0-rc7+ #129
Workqueue: l2tp l2tp_tunnel_del_work
RIP: 0010:__lock_acquire+0x263/0x1630
RSP: 0018:88001a37fc70 EFLAGS: 00010002
RAX: 0001 RBX: 0088 RCX: 
RDX:  RSI:  RDI: 
RBP: 88001a37fd18 R08: 0001 R09: 
R10:  R11: 76fd R12: 00a0
R13: 88001a3722c0 R14: 0001 R15: 
FS:  () GS:88001ad0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00a0 CR3: 1730b000 CR4: 06e0
Call Trace:
 ? __lock_acquire+0xc77/0x1630
 ? console_trylock+0x11/0xa0
 lock_acquire+0x117/0x230
 ? lock_sock_nested+0x3a/0xa0
 _raw_spin_lock_bh+0x3a/0x50
 ? lock_sock_nested+0x3a/0xa0
 lock_sock_nested+0x3a/0xa0
 inet_shutdown+0x33/0xf0
 l2tp_tunnel_del_work+0x60/0xef
 process_one_work+0x1ea/0x5f0
 ? process_one_work+0x162/0x5f0
 worker_thread+0x48/0x3e0
 ? trace_hardirqs_on+0xd/0x10
 kthread+0x108/0x140
 ? process_one_work+0x5f0/0x5f0
 ? kthread_stop+0x2a0/0x2a0
 ret_from_fork+0x24/0x30
Code: 00 41 81 ff ff 1f 00 00 0f 87 7a 13 00 00 45 85 f6 49 8b 85
68 08 00 00 0f 84 ae 03 00 00 c7 44 24 18 00 00 00 00 e9 f0 00 00 00 <49> 81 3c
24 80 93 3f 83 b8 00 00 00 00 44 0f 44 c0 83 fe 01 0f
RIP: __lock_acquire+0x263/0x1630 RSP: 88001a37fc70
CR2: 00a0

Fixes: 309795f4bec2d ("l2tp: Add netlink control API for L2TP")
Signed-off-by: James Chapman 
---
 net/l2tp/l2tp_core.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 194a7483bb93..9cd2a99d0752 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1327,17 +1327,10 @@ static void l2tp_tunnel_del_work(struct work_struct 
*work)
 
sock = sk->sk_socket;
 
-   /* If the tunnel socket was created by userspace, then go through the
-* inet layer to shut the socket down, and let userspace close it.
-* Otherwise, if we created the socket directly within the kernel, use
+   /* If the tunnel socket was created within the kernel, use
 * the sk API to release it here.
-* In either case the tunnel resources are freed in the socket
-* destructor when the tunnel socket goes away.
 */
-   if (tunnel->fd >= 0) {
-   if (sock)
-   inet_shutdown(sock, 2);
-   } else {
+   if (tunnel->fd < 0) {
if (sock) {
kernel_sock_shutdown(sock, SHUT_RDWR);
sock_release(sock);
-- 
1.9.1



[PATCH net 0/5] l2tp: fix API races discovered by syzbot

2018-02-23 Thread James Chapman
This patch series addresses several races with L2TP APIs discovered by
syzbot. There are no functional changes.

The set of patches 1-5 in combination fix the following syzbot reports.

19c09769f WARNING in debug_print_object
347bd5acd KASAN: use-after-free Read in inet_shutdown
6e6a5ec8d general protection fault in pppol2tp_connect
9df43faf0 KASAN: use-after-free Read in pppol2tp_connect

My first attempts to fix these issues were as net-next patches but
the series included other refactoring and cleanup work. I was asked to
separate out the bugfixes and redo for the net tree, which is what
these patches are.

The changes are:

 1. Fix inet_shutdown races when L2TP tunnels and sessions close. (patches 1-2)
 2. Fix races with tunnel and its socket. (patch 3)
 3. Fix race in pppol2tp_release with session and its socket. (patch 4)
 4. Fix tunnel lookup use-after-free. (patch 5)

All of the syzbot reproducers hit races in the tunnel and pppol2tp
session create and destroy paths. These tests create and destroy
pppol2tp tunnels and sessions rapidly using multiple threads,
provoking races in several tunnel/session create/destroy paths. The
key problem was that each tunnel/session socket could be destroyed
while its associated tunnel/session object still existed (patches 3,
4). Patch 5 addresses a problem with the way tunnels are removed from
the tunnel list. Patch 5 is tagged that it addresses all four syzbot
issues, though all 5 patches are needed.

James Chapman (5):
  l2tp: don't use inet_shutdown on tunnel destroy
  l2tp: don't use inet_shutdown on ppp session destroy
  l2tp: fix races with tunnel socket close
  l2tp: fix race in pppol2tp_release with session object destroy
  l2tp: fix tunnel lookup use-after-free race

 net/l2tp/l2tp_core.c | 142 ---
 net/l2tp/l2tp_core.h |  23 +
 net/l2tp/l2tp_ip.c   |  10 ++--
 net/l2tp/l2tp_ip6.c  |   8 ++-
 net/l2tp/l2tp_ppp.c  |  60 ++
 5 files changed, 77 insertions(+), 166 deletions(-)

-- 



[PATCH net 3/5] l2tp: fix races with tunnel socket close

2018-02-23 Thread James Chapman
The tunnel socket tunnel->sock (struct sock) is accessed when
preparing a new ppp session on a tunnel at pppol2tp_session_init. If
the socket is closed by a thread while another is creating a new
session, the threads race. In pppol2tp_connect, the tunnel object may
be created if the pppol2tp socket is associated with the special
session_id 0 and the tunnel socket is looked up using the provided
fd. When handling this, pppol2tp_connect cannot sock_hold the tunnel
socket to prevent it being destroyed during pppol2tp_connect since
this may itself may race with the socket being destroyed. Doing
sockfd_lookup in pppol2tp_connect isn't sufficient to prevent
tunnel->sock going away either because a given tunnel socket fd may be
reused between calls to pppol2tp_connect. Instead, have
l2tp_tunnel_create sock_hold the tunnel socket before it does
sockfd_put. This ensures that the tunnel's socket is always extant
while the tunnel object exists. Hold a ref on the socket until the
tunnel is destroyed and ensure that all tunnel destroy paths go
through a common function (l2tp_tunnel_delete) since this will do the
final sock_put to release the tunnel socket.

Since the tunnel's socket is now guaranteed to exist if the tunnel
exists, we no longer need to use sockfd_lookup via l2tp_sock_to_tunnel
to derive the tunnel from the socket since this is always
sk_user_data.

Also, sessions no longer sock_hold the tunnel socket since sessions
already hold a tunnel ref and the tunnel sock will not be freed until
the tunnel is freed. Removing these sock_holds in
l2tp_session_register avoids a possible sock leak in the
pppol2tp_connect error path if l2tp_session_register succeeds but
attaching a ppp channel fails. The pppol2tp_connect error path could
have been fixed instead and have the sock ref dropped when the session
is freed, but doing a sock_put of the tunnel socket when the session
is freed would require a new session_free callback. It is simpler to
just remove the sock_hold of the tunnel socket in
l2tp_session_register, now that the tunnel socket lifetime is
guaranteed.

Finally, some init code in l2tp_tunnel_create is reordered to ensure
that the new tunnel object's refcount is set and the tunnel socket ref
is taken before the tunnel socket destructor callbacks are set.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 4360 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ #34
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:pppol2tp_session_init+0x1d6/0x500
RSP: 0018:88001377fb40 EFLAGS: 00010212
RAX: dc00 RBX: 88001636a940 RCX: 84836c1d
RDX: 0045 RSI: 55976744 RDI: 0228
RBP: 88001377fb60 R08: 84836bc8 R09: 0002
R10: 88001377fab8 R11: 0001 R12: 
R13: 88001636aac8 R14: 8800160f81c0 R15: 1100026eff76
FS:  7ffb3ea66700() GS:88001a40() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20e77000 CR3: 16261000 CR4: 06f0
Call Trace:
 pppol2tp_connect+0xd18/0x13c0
 ? pppol2tp_session_create+0x170/0x170
 ? __might_fault+0x115/0x1d0
 ? lock_downgrade+0x860/0x860
 ? __might_fault+0xe5/0x1d0
 ? security_socket_connect+0x8e/0xc0
 SYSC_connect+0x1b6/0x310
 ? SYSC_bind+0x280/0x280
 ? __do_page_fault+0x5d1/0xca0
 ? up_read+0x1f/0x40
 ? __do_page_fault+0x3c8/0xca0
 SyS_connect+0x29/0x30
 ? SyS_accept+0x40/0x40
 do_syscall_64+0x1e0/0x730
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7ffb3e376259
RSP: 002b:7ffeda4f6508 EFLAGS: 0202 ORIG_RAX: 002a
RAX: ffda RBX: 20e77012 RCX: 7ffb3e376259
RDX: 002e RSI: 20e77000 RDI: 0004
RBP: 7ffeda4f6540 R08:  R09: 
R10:  R11: 0202 R12: 00400b60
R13: 7ffeda4f6660 R14:  R15: 
Code: 80 3d b0 ff 06 02 00 0f 84 07 02 00 00 e8 13 d6 db fc 49 8d bc 24 28 02 
00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f
a 48 c1 ea 03 <80> 3c 02 00 0f 85 ed 02 00 00 4d 8b a4 24 28 02 00 00 e8 13 16

Fixes: 80d84ef3ff1dd ("l2tp: prevent l2tp_tunnel_delete racing with userspace 
close")
Signed-off-by: James Chapman 
---
 net/l2tp/l2tp_core.c | 117 +++
 net/l2tp/l2tp_core.h |  23 +-
 net/l2tp/l2tp_ip.c   |  10 ++---
 net/l2tp/l2tp_ip6.c  |   8 ++--
 4 files changed, 42 insertions(+), 116 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 9cd2a99d0752..0fa53ead24aa 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -136,51 +136,6 @@ static inline struct l2tp_net *l2tp_pernet(const struct 
net *net)
 
 }
 
-/* Lookup the tunnel socket, possibly involving the 

[PATCH net 4/5] l2tp: fix race in pppol2tp_release with session object destroy

2018-02-23 Thread James Chapman
pppol2tp_release uses call_rcu to put the final ref on its socket. But
the session object doesn't hold a ref on the session socket so may be
freed while the pppol2tp_put_sk RCU callback is scheduled. Fix this by
having the session hold a ref on its socket until the session is
destroyed. It is this ref that is dropped via call_rcu.

Sessions are also deleted via l2tp_tunnel_closeall. This must now also put
the final ref via call_rcu. So move the call_rcu call site into
pppol2tp_session_close so that this happens in both destroy paths. A
common destroy path should really be implemented, perhaps with
l2tp_tunnel_closeall calling l2tp_session_delete like pppol2tp_release
does, but this will be looked at later.

ODEBUG: activate active (active state 1) object type: rcu_head hint:   
(null)
WARNING: CPU: 3 PID: 13407 at lib/debugobjects.c:291 
debug_print_object+0x166/0x220
Modules linked in:
CPU: 3 PID: 13407 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ #38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:debug_print_object+0x166/0x220
RSP: 0018:880013647a00 EFLAGS: 00010082
RAX: dc08 RBX: 0003 RCX: 814d
RDX:  RSI: 0001 RDI: 88001a59f6d0
RBP: 880013647a40 R08:  R09: 0001
R10: 8800136479a8 R11:  R12: 0001
R13: 86161420 R14: 85648b60 R15: 
FS:  () GS:88001a58() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20e77000 CR3: 06022000 CR4: 06e0
Call Trace:
 debug_object_activate+0x38b/0x530
 ? debug_object_assert_init+0x3b0/0x3b0
 ? __mutex_unlock_slowpath+0x85/0x8b0
 ? pppol2tp_session_destruct+0x110/0x110
 __call_rcu.constprop.66+0x39/0x890
 ? __call_rcu.constprop.66+0x39/0x890
 call_rcu_sched+0x17/0x20
 pppol2tp_release+0x2c7/0x440
 ? fcntl_setlk+0xca0/0xca0
 ? sock_alloc_file+0x340/0x340
 sock_release+0x92/0x1e0
 sock_close+0x1b/0x20
 __fput+0x296/0x6e0
 fput+0x1a/0x20
 task_work_run+0x127/0x1a0
 do_exit+0x7f9/0x2ce0
 ? SYSC_connect+0x212/0x310
 ? mm_update_next_owner+0x690/0x690
 ? up_read+0x1f/0x40
 ? __do_page_fault+0x3c8/0xca0
 do_group_exit+0x10d/0x330
 ? do_group_exit+0x330/0x330
 SyS_exit_group+0x22/0x30
 do_syscall_64+0x1e0/0x730
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f362e471259
RSP: 002b:7ffe389abe08 EFLAGS: 0202 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 7f362e471259
RDX: 7f362e471259 RSI: 002e RDI: 
RBP: 7ffe389abe30 R08:  R09: 7f362e944270
R10:  R11: 0202 R12: 00400b60
R13: 7ffe389abf50 R14:  R15: 
Code: 8d 3c dd a0 8f 64 85 48 89 fa 48 c1 ea 03 80 3c 02 00 75 7b 48 8b 14 dd 
a0 8f 64 85 4c 89 f6 48 c7 c7 20 85 64 85 e
8 2a 55 14 ff <0f> 0b 83 05 ad 2a 68 04 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41

Fixes: ee40fb2e1eb5b ("l2tp: protect sock pointer of struct pppol2tp_session 
with RCU")
Signed-off-by: James Chapman 
---
 net/l2tp/l2tp_ppp.c | 52 +++-
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 2d2955e8f710..3b02f24ea9ec 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -416,10 +416,28 @@ static int pppol2tp_xmit(struct ppp_channel *chan, struct 
sk_buff *skb)
  * Session (and tunnel control) socket create/destroy.
  */
 
+static void pppol2tp_put_sk(struct rcu_head *head)
+{
+   struct pppol2tp_session *ps;
+
+   ps = container_of(head, typeof(*ps), rcu);
+   sock_put(ps->__sk);
+}
+
 /* Called by l2tp_core when a session socket is being closed.
  */
 static void pppol2tp_session_close(struct l2tp_session *session)
 {
+   struct pppol2tp_session *ps;
+
+   ps = l2tp_session_priv(session);
+   mutex_lock(&ps->sk_lock);
+   ps->__sk = rcu_dereference_protected(ps->sk,
+lockdep_is_held(&ps->sk_lock));
+   RCU_INIT_POINTER(ps->sk, NULL);
+   if (ps->__sk)
+   call_rcu(&ps->rcu, pppol2tp_put_sk);
+   mutex_unlock(&ps->sk_lock);
 }
 
 /* Really kill the session socket. (Called from sock_put() if
@@ -439,14 +457,6 @@ static void pppol2tp_session_destruct(struct sock *sk)
}
 }
 
-static void pppol2tp_put_sk(struct rcu_head *head)
-{
-   struct pppol2tp_session *ps;
-
-   ps = container_of(head, typeof(*ps), rcu);
-   sock_put(ps->__sk);
-}
-
 /* Called when the PPPoX socket (session) is closed.
  */
 static int pppol2tp_release(struct socket *sock)
@@ -470,26 +480,17 @@ static int pppol2tp_release(struct socket *sock)
sock_orphan(sk);
sock->sk = NULL;
 

Re: [PATCH bpf-next] bpf: NULL pointer check is not needed in BPF_CGROUP_RUN_PROG_INET_SOCK

2018-02-23 Thread David Miller
From: Yafang Shao 
Date: Fri, 23 Feb 2018 14:58:41 +0800

> sk is already allocated in inet_create/inet6_create, hence when
> BPF_CGROUP_RUN_PROG_INET_SOCK is executed sk will never be NULL.
> 
> The logic is as bellow,
>   sk = sk_alloc();
>   if (!sk)
>   goto out;
>   BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
> 
> Signed-off-by: Yafang Shao 

Acked-by: David S. Miller 


Re: [PATCH] atm: idt77252: remove redundant bit-wise or'ing of zero

2018-02-23 Thread David Miller
From: Colin King 
Date: Fri, 23 Feb 2018 12:22:52 +

> From: Colin Ian King 
> 
> Zero is being bit-wise or'd in a calculation twice; these are redundant
> and can be removed.
> 
> Signed-off-by: Colin Ian King 

Applied.


Re: [PATCH net-next] cxgb4vf: Forcefully link up virtual interfaces

2018-02-23 Thread David Miller
From: Ganesh Goudar 
Date: Fri, 23 Feb 2018 20:36:53 +0530

> @@ -92,6 +92,18 @@ module_param(msi, int, 0644);
>  MODULE_PARM_DESC(msi, "whether to use MSI-X or MSI");
>  
>  /*
> + * Logic controls.
> + * ===
> + */
> +
> +/* The Virtual Interfaces are connected to an internal switch on the chip
> + * which allows VIs attached to the same port to talk to each other even when
> + * the port link is down.  As a result, we generally want to always report a
> + * VI's link as being "up".
> + */
> +#define force_link_up 1

Please don't do this, just perform the actions unconditionally.


Re: [PATCH v2 net] net_sched: gen_estimator: fix broken estimators based on percpu stats

2018-02-23 Thread David Miller
From: Eric Dumazet 
Date: Thu, 22 Feb 2018 19:45:27 -0800

> From: Eric Dumazet 
> 
> pfifo_fast got percpu stats lately, uncovering a bug I introduced last
> year in linux-4.10.
> 
> I missed the fact that we have to clear our temporary storage
> before calling __gnet_stats_copy_basic() in the case of percpu stats.
> 
> Without this fix, rate estimators (tc qd replace dev xxx root est 1sec
> 4sec pfifo_fast) are utterly broken.
> 
> Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate 
> estimators")
> Signed-off-by: Eric Dumazet 
> ---
> v2: Perform the zeroing in est_fetch_counters() instead of caller(s)

Applied and queued up for -stable.


[RFC PATCH bpf-next 12/12] bpf/verifier: update selftests

2018-02-23 Thread Edward Cree
Slight change to message when execution can run off the end of the program.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index fda35a5a0ff9..a54e50a887dc 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -9758,7 +9758,7 @@ static struct bpf_test tests[] = {
BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "jump out of range from insn 5 to 6",
+   .errstr = "no exit/jump at end of program (insn 5)",
.result = REJECT,
},
{


[RFC PATCH bpf-next 11/12] bpf/verifier: better detection of statically unreachable code

2018-02-23 Thread Edward Cree
My previous version just checked that every insn could be jumped to from
 somewhere, which meant that the program could contain multiple connected
 components.  Instead let's do a flood-fill of the insns set to ensure all
 insns are actually reachable from the entry point.  Since this involves
 essentially the same "where does the jump go" calculations as placing
 state list marks, do both in the same pass.

Signed-off-by: Edward Cree 
---
 kernel/bpf/verifier.c | 149 +-
 1 file changed, 88 insertions(+), 61 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cc8aaf73b4a2..d9d851d8c84e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3813,83 +3813,110 @@ static int check_return_code(struct bpf_verifier_env 
*env)
 
 #define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L)
 
-static int mark_jump_target(struct bpf_verifier_env *env, int from, int off)
-{
-   int to = from + off + 1;
-
-   if (to < 0 || to >= env->prog->len) {
-   verbose(env, "jump out of range from insn %d to %d\n", from, 
to);
-   return -EINVAL;
-   }
-   /* mark branch target for state pruning */
-   env->explored_states[to] = STATE_LIST_MARK;
-   return 0;
-}
-
 /* Mark jump/branch targets for control flow tracking & state pruning */
 static int prepare_cfg_marks(struct bpf_verifier_env *env)
 {
struct bpf_insn *insns = env->prog->insnsi;
int insn_cnt = env->prog->len, i, err = 0;
+   unsigned long *frontier, *visited;
 
-   for (i = 0; i < insn_cnt; i++) {
-   u8 opcode = BPF_OP(insns[i].code);
+   frontier = kcalloc(BITS_TO_LONGS(insn_cnt), sizeof(unsigned long),
+  GFP_KERNEL);
+   if (!frontier)
+   return -ENOMEM;
+   visited = kcalloc(BITS_TO_LONGS(insn_cnt), sizeof(unsigned long),
+ GFP_KERNEL);
+   if (!visited) {
+   err = -ENOMEM;
+   goto out_frontier;
+   }
 
+   /* insn 0 is our start node */
+   __set_bit(0, frontier);
+
+   /* flood fill to find all reachable insns and mark jump targets */
+   while (true) {
+   bool fallthrough = false, jump = false;
+   int target;
+   u8 opcode;
+
+   /* select a frontier node */
+   i = find_first_bit(frontier, insn_cnt);
+   if (i >= insn_cnt) /* frontier is empty, done */
+   break;
+   /* remove it from the frontier */
+   __clear_bit(i, frontier);
+   /* mark it as visited */
+   __set_bit(i, visited);
+   opcode = BPF_OP(insns[i].code);
if (BPF_CLASS(insns[i].code) != BPF_JMP) {
-   if (i + 1 == insn_cnt) {
+   fallthrough = true;
+   } else {
+   switch (opcode) {
+   case BPF_EXIT:
+   break;
+   case BPF_CALL:
+   /* following insn is target of function return 
*/
+   fallthrough = true;
+   /* subprog call insns 'branch both ways', as the
+* subprog will eventually exit
+*/
+   if (insns[i].src_reg == BPF_PSEUDO_CALL) {
+   jump = true;
+   target = i + insns[i].imm + 1;
+   }
+   break;
+   case BPF_JA:
+   /* unconditional jump; mark target */
+   jump = true;
+   target = i + insns[i].off + 1;
+   break;
+   default:
+   /* conditional jump; branch both ways */
+   fallthrough = true;
+   jump = true;
+   target = i + insns[i].off + 1;
+   break;
+   }
+   }
+   /* if targets are unvisited, add them to the frontier */
+   if (fallthrough) {
+   if (i + 1 >= insn_cnt) {
verbose(env, "no exit/jump at end of program 
(insn %d)\n",
i);
-   return -EINVAL;
+   err = -EINVAL;
+   goto out_visited;
}
-   continue;
+   if (!test_bit(i + 1, visited))
+   __set_bit(i + 1, frontier);
}
-   switch (opcode) {
-   case BPF_EXIT:
- 

[RFC PATCH bpf-next 10/12] bpf/verifier: parent state pointer is not per-frame

2018-02-23 Thread Edward Cree
Computing min_frame in find_loop and using it to detect recursion means we
 don't need to play games with per-frame parent pointers, and can instead
 have a single parent pointer in the verifier_state.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ee034232fbd6..0320df10555b 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -121,12 +121,6 @@ struct bpf_func_state {
 */
u32 subprogno;
 
-   /* loop detection; points into an explored_state */
-   struct bpf_func_state *parent;
-   /* These flags are only meaningful in an explored_state, not cur_state 
*/
-   bool bounded_loop, conditional;
-   int live_children;
-
/* should be second to last. See copy_func_state() */
int allocated_stack;
struct bpf_stack_state *stack;
@@ -137,6 +131,12 @@ struct bpf_verifier_state {
/* call stack tracking */
struct bpf_func_state *frame[MAX_CALL_FRAMES];
u32 curframe;
+
+   /* loop detection; points into an explored_state */
+   struct bpf_verifier_state *parent;
+   /* These flags are only meaningful in an explored_state, not cur_state 
*/
+   bool bounded_loop, conditional;
+   int live_children;
 };
 
 /* linked list of verifier states used to prune search */



Re: [PATCH][next] xen-netback: make function xenvif_rx_skb static

2018-02-23 Thread Wei Liu
On Fri, Feb 23, 2018 at 05:16:57PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The function xenvif_rx_skb is local to the source and does not need
> to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> drivers/net/xen-netback/rx.c:422:6: warning: symbol 'xenvif_rx_skb'
> was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King 

Acked-by: Wei Liu 

Thanks


[RFC PATCH bpf-next 09/12] bpf/verifier: count still-live children of explored_states

2018-02-23 Thread Edward Cree
By reference-counting how many children of an explored_state are still
 being walked, we can avoid pruning based on a state that's in our own
 history (and thus hasn't reached an exit yet) without a persistent mark
 that prevents other, later branches from being pruned against it when
 it _has_ reached an exit.
Includes a check at free_states() time to ensure that all the reference
 counts have fallen to zero.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |   3 +-
 kernel/bpf/verifier.c| 109 ---
 2 files changed, 74 insertions(+), 38 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 6abd484391f4..ee034232fbd6 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -124,7 +124,8 @@ struct bpf_func_state {
/* loop detection; points into an explored_state */
struct bpf_func_state *parent;
/* These flags are only meaningful in an explored_state, not cur_state 
*/
-   bool in_loop, bounded_loop, conditional;
+   bool bounded_loop, conditional;
+   int live_children;
 
/* should be second to last. See copy_func_state() */
int allocated_stack;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9828cb0cde73..cc8aaf73b4a2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -398,6 +398,12 @@ static void free_verifier_state(struct bpf_verifier_state 
*state,
free_func_state(state->frame[i]);
state->frame[i] = NULL;
}
+   /* Check that the live_children accounting is correct */
+   if (state->live_children)
+   pr_warn("Leaked live_children=%d at insn %d, frame %d\n",
+   state->live_children,
+   state->frame[state->curframe]->insn_idx,
+   state->curframe);
if (free_self)
kfree(state);
 }
@@ -429,6 +435,7 @@ static int copy_verifier_state(struct bpf_verifier_state 
*dst_state,
dst_state->frame[i] = NULL;
}
dst_state->curframe = src->curframe;
+   dst_state->parent = src->parent;
 
for (i = 0; i <= src->curframe; i++) {
dst = dst_state->frame[i];
@@ -445,6 +452,15 @@ static int copy_verifier_state(struct bpf_verifier_state 
*dst_state,
return 0;
 }
 
+/* Mark this thread as having reached an exit */
+static void kill_thread(struct bpf_verifier_state *state)
+{
+   struct bpf_verifier_state *cur = state->parent;
+
+   while (cur && !--cur->live_children)
+   cur = cur->parent;
+}
+
 static int pop_stack(struct bpf_verifier_env *env, int *insn_idx)
 {
struct bpf_verifier_state *cur = env->cur_state;
@@ -458,6 +474,8 @@ static int pop_stack(struct bpf_verifier_env *env, int 
*insn_idx)
err = copy_verifier_state(cur, &head->st);
if (err)
return err;
+   } else {
+   kill_thread(&head->st);
}
if (insn_idx)
*insn_idx = head->insn_idx;
@@ -479,6 +497,7 @@ static int unpush_stack(struct bpf_verifier_env *env)
return -ENOENT;
 
elem = head->next;
+   kill_thread(&head->st);
free_verifier_state(&head->st, false);
kfree(head);
env->head = elem;
@@ -509,6 +528,8 @@ static struct bpf_verifier_state *push_stack(struct 
bpf_verifier_env *env,
verbose(env, "BPF program is too complex\n");
goto err;
}
+   if (elem->st.parent)
+   elem->st.parent->live_children++;
return &elem->st;
 err:
free_verifier_state(env->cur_state, true);
@@ -728,11 +749,9 @@ static void init_reg_state(struct bpf_verifier_env *env,
 
 static void init_func_state(struct bpf_verifier_env *env,
struct bpf_func_state *state,
-   struct bpf_func_state *parent,
int entry, int frameno, int subprogno)
 {
state->insn_idx = entry;
-   state->parent = parent;
state->frameno = frameno;
state->subprogno = subprogno;
init_reg_state(env, state);
@@ -2111,7 +2130,6 @@ static int check_func_call(struct bpf_verifier_env *env, 
struct bpf_insn *insn,
 * callee can read/write into caller's stack
 */
init_func_state(env, callee,
-   caller->parent /* parent state for loop detection */,
target /* entry point */,
state->curframe + 1 /* frameno within this callchain */,
subprog /* subprog number within this prog */);
@@ -4207,14 +4225,20 @@ static int propagate_liveness(struct bpf_verifier_env 
*env,
return err;
 }
 
-static struct bpf_func_state *find_loop(struct bpf_verifier_env *env,
-   bool *saw_cond, bool *saw_bound)
+static struct bpf_veri

[RFC PATCH bpf-next 08/12] bpf/verifier: selftests for bounded loops

2018-02-23 Thread Edward Cree
Mainly consists of tests that broke (or I expected to break) earlier
 versions of the bounded-loop handling.
Also updated some existing tests to deal with changed error messages,
 programs now being accepted etc.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 198 +++-
 1 file changed, 191 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 722a16b2e9c4..fda35a5a0ff9 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -9338,7 +9338,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "frames is too deep",
+   .errstr = "recursive call",
.result = REJECT,
},
{
@@ -9389,8 +9389,8 @@ static struct bpf_test tests[] = {
BPF_JMP_IMM(BPF_JA, 0, 0, -6),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "back-edge from insn",
-   .result = REJECT,
+   .result = ACCEPT,
+   .retval = 1,
},
{
"calls: conditional call 4",
@@ -9424,8 +9424,8 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "back-edge from insn",
-   .result = REJECT,
+   .result = ACCEPT,
+   .retval = 1,
},
{
"calls: conditional call 6",
@@ -9666,7 +9666,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "frames is too deep",
+   .errstr = "recursive call",
.result = REJECT,
},
{
@@ -9678,7 +9678,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "frames is too deep",
+   .errstr = "recursive call",
.result = REJECT,
},
{
@@ -11135,6 +11135,190 @@ static struct bpf_test tests[] = {
.result = REJECT,
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
+   {
+   "bounded loop, count to 4",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 4, -2),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   .retval = 4,
+   },
+   {
+   "bounded loop, count to 20",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 20, -2),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "r0 is increasing too slowly",
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
+   {
+   "bounded loop, count from positive unknown to 4",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JSLT, BPF_REG_0, 0, 2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JLT, BPF_REG_0, 4, -2),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map1 = { 3 },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   .retval = 4,
+   },
+   {
+   "bounded loop, count from totally unknown to 4",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG

[RFC PATCH bpf-next 05/12] bpf/verifier: detect loops dynamically rather than statically

2018-02-23 Thread Edward Cree
Add in a new chain of parent states, which does not cross function-call
 boundaries, and check whether our current insn_idx appears anywhere in
 the chain.  Since all jump targets have state-list marks (now placed
 by prepare_cfg_marks(), which replaces check_cfg()), it suffices to
 thread this chain through the existing explored_states and check it
 only in is_state_visited().
By using this parent-state chain to detect loops, we open up the
 possibility that in future we could determine a loop to be bounded and
 thus accept it.
A few selftests had to be changed to ensure that all the insns walked
 before reaching the back-edge are valid.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h|   2 +
 kernel/bpf/verifier.c   | 280 +---
 tools/testing/selftests/bpf/test_verifier.c |  12 +-
 3 files changed, 97 insertions(+), 197 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0bc49c768585..24ddbc1ed3b2 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -120,6 +120,8 @@ struct bpf_func_state {
 * zero == main subprog
 */
u32 subprogno;
+   /* loop detection; points into an explored_state */
+   struct bpf_func_state *parent;
 
/* should be second to last. See copy_func_state() */
int allocated_stack;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7a08b5e8e071..e7d898f4fd29 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -429,6 +429,7 @@ static int copy_verifier_state(struct bpf_verifier_state 
*dst_state,
dst_state->frame[i] = NULL;
}
dst_state->curframe = src->curframe;
+
for (i = 0; i <= src->curframe; i++) {
dst = dst_state->frame[i];
if (!dst) {
@@ -716,6 +717,7 @@ static void init_func_state(struct bpf_verifier_env *env,
int entry, int frameno, int subprogno)
 {
state->insn_idx = entry;
+   state->parent = NULL;
state->frameno = frameno;
state->subprogno = subprogno;
init_reg_state(env, state);
@@ -3747,211 +3749,85 @@ static int check_return_code(struct bpf_verifier_env 
*env)
return 0;
 }
 
-/* non-recursive DFS pseudo code
- * 1  procedure DFS-iterative(G,v):
- * 2  label v as discovered
- * 3  let S be a stack
- * 4  S.push(v)
- * 5  while S is not empty
- * 6t <- S.pop()
- * 7if t is what we're looking for:
- * 8return t
- * 9for all edges e in G.adjacentEdges(t) do
- * 10   if edge e is already labelled
- * 11   continue with the next edge
- * 12   w <- G.adjacentVertex(t,e)
- * 13   if vertex w is not discovered and not explored
- * 14   label e as tree-edge
- * 15   label w as discovered
- * 16   S.push(w)
- * 17   continue at 5
- * 18   else if vertex w is discovered
- * 19   label e as back-edge
- * 20   else
- * 21   // vertex w is explored
- * 22   label e as forward- or cross-edge
- * 23   label t as explored
- * 24   S.pop()
- *
- * convention:
- * 0x10 - discovered
- * 0x11 - discovered and fall-through edge labelled
- * 0x12 - discovered and fall-through and branch edges labelled
- * 0x20 - explored
- */
-
-enum {
-   DISCOVERED = 0x10,
-   EXPLORED = 0x20,
-   FALLTHROUGH = 1,
-   BRANCH = 2,
-};
-
 #define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L)
 
-static int *insn_stack;/* stack of insns to process */
-static int cur_stack;  /* current stack index */
-static int *insn_state;
-
-/* t, w, e - match pseudo-code above:
- * t - index of current instruction
- * w - next instruction
- * e - edge
- */
-static int push_insn(int t, int w, int e, struct bpf_verifier_env *env)
+static int mark_jump_target(struct bpf_verifier_env *env, int from, int off)
 {
-   if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH))
-   return 0;
-
-   if (e == BRANCH && insn_state[t] >= (DISCOVERED | BRANCH))
-   return 0;
+   int to = from + off + 1;
 
-   if (w < 0 || w >= env->prog->len) {
-   verbose(env, "jump out of range from insn %d to %d\n", t, w);
+   if (to < 0 || to >= env->prog->len) {
+   verbose(env, "jump out of range from insn %d to %d\n", from, 
to);
return -EINVAL;
}
-
-   if (e == BRANCH)
-   /* mark branch target for state pruning */
-   env->explored_states[w] = STATE_LIST_MARK;
-
-   if (insn_state[w] == 0) {
-   /* tree-edge */
-   insn_state[t] = DISCOVERED | e;
-   insn_state[w] = DISCOVERED;
-   if (cur_stack >= env->prog->len)
-   

[RFC PATCH bpf-next 07/12] bpf/verifier: allow bounded loops with JLT/true back-edge

2018-02-23 Thread Edward Cree
Where the register umin_value is increasing sufficiently fast, the loop
 will terminate after a reasonable number of iterations, so we can allow
 to keep walking it.
The semantics of prev_insn_idx are changed slightly: it now always refers
 to the last jump insn walked, even when that jump was not taken.  Also it
 is moved into env, alongside a new bool last_jump_taken that records
 whether the jump was taken or not.  This is needed so that, when we detect
 a loop, we can see how we ended up in the back-edge: what was the jump
 condition, and is it the true- or the false-branch that ends up looping?
We record in our explored_states whether they represent a conditional jump
 and whether that jump produced a good loop bound.  This allows us to make
 unconditional jumps "not responsible" for ensuring the loop is bounded, as
 long as there is a conditional jump somewhere in the loop body; whereas a
 conditional jump must ensure that there is a bounding conditional somewhere
 in the loop body.
Recursive calls have to remain forbidden because the calculation of maximum
 stack depth assumes the call graph is acyclic, even though the loop
 handling code could determine whether the recursion was bounded.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |   5 +
 kernel/bpf/verifier.c| 295 +++
 2 files changed, 244 insertions(+), 56 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 24ddbc1ed3b2..6abd484391f4 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -120,8 +120,11 @@ struct bpf_func_state {
 * zero == main subprog
 */
u32 subprogno;
+
/* loop detection; points into an explored_state */
struct bpf_func_state *parent;
+   /* These flags are only meaningful in an explored_state, not cur_state 
*/
+   bool in_loop, bounded_loop, conditional;
 
/* should be second to last. See copy_func_state() */
int allocated_stack;
@@ -197,6 +200,8 @@ struct bpf_verifier_env {
u32 id_gen; /* used to generate unique reg IDs */
bool allow_ptr_leaks;
bool seen_direct_write;
+   int prev_insn_idx;  /* last branch (BPF_JMP-class) insn */
+   bool last_jump_taken;   /* Was branch at prev_insn_idx taken? */
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
struct bpf_verifer_log log;
struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS];
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7a8ae633d0c3..9828cb0cde73 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -445,8 +445,7 @@ static int copy_verifier_state(struct bpf_verifier_state 
*dst_state,
return 0;
 }
 
-static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx,
-int *insn_idx)
+static int pop_stack(struct bpf_verifier_env *env, int *insn_idx)
 {
struct bpf_verifier_state *cur = env->cur_state;
struct bpf_verifier_stack_elem *elem, *head = env->head;
@@ -462,8 +461,7 @@ static int pop_stack(struct bpf_verifier_env *env, int 
*prev_insn_idx,
}
if (insn_idx)
*insn_idx = head->insn_idx;
-   if (prev_insn_idx)
-   *prev_insn_idx = head->prev_insn_idx;
+   env->prev_insn_idx = head->prev_insn_idx;
elem = head->next;
free_verifier_state(&head->st, false);
kfree(head);
@@ -516,7 +514,7 @@ static struct bpf_verifier_state *push_stack(struct 
bpf_verifier_env *env,
free_verifier_state(env->cur_state, true);
env->cur_state = NULL;
/* pop all elements and return */
-   while (!pop_stack(env, NULL, NULL));
+   while (!pop_stack(env, NULL));
return NULL;
 }
 
@@ -730,10 +728,11 @@ static void init_reg_state(struct bpf_verifier_env *env,
 
 static void init_func_state(struct bpf_verifier_env *env,
struct bpf_func_state *state,
+   struct bpf_func_state *parent,
int entry, int frameno, int subprogno)
 {
state->insn_idx = entry;
-   state->parent = NULL;
+   state->parent = parent;
state->frameno = frameno;
state->subprogno = subprogno;
init_reg_state(env, state);
@@ -2112,6 +2111,7 @@ static int check_func_call(struct bpf_verifier_env *env, 
struct bpf_insn *insn,
 * callee can read/write into caller's stack
 */
init_func_state(env, callee,
+   caller->parent /* parent state for loop detection */,
target /* entry point */,
state->curframe + 1 /* frameno within this callchain */,
subprog /* subprog number within this prog */);
@@ -3480,6 +3480,12 @@ static bool reg_is_impossible(struct bpf_reg_state reg)
reg.smin_value > reg.smax_valu

[RFC PATCH bpf-next 06/12] bpf/verifier: do not walk impossible branches

2018-02-23 Thread Edward Cree
If a conditional branch would produce an inconsistent set of bounds (e.g.
 umin_value > umax_value) on one leg, then that leg cannot actually happen
 so we don't need to walk it.
This will necessary for bounded loop support so that we stop walking round
 the loop once the termination condition is known to have been met.

Signed-off-by: Edward Cree 
---
 kernel/bpf/verifier.c | 53 ++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e7d898f4fd29..7a8ae633d0c3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -472,6 +472,22 @@ static int pop_stack(struct bpf_verifier_env *env, int 
*prev_insn_idx,
return 0;
 }
 
+/* Throw away top element of stack.  Like pop_stack but don't update cur_state 
*/
+static int unpush_stack(struct bpf_verifier_env *env)
+{
+   struct bpf_verifier_stack_elem *elem, *head = env->head;
+
+   if (env->head == NULL)
+   return -ENOENT;
+
+   elem = head->next;
+   free_verifier_state(&head->st, false);
+   kfree(head);
+   env->head = elem;
+   env->stack_size--;
+   return 0;
+}
+
 static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env,
 int insn_idx, int prev_insn_idx)
 {
@@ -2376,8 +2392,10 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
if ((known && (smin_val != smax_val || umin_val != umax_val)) ||
smin_val > smax_val || umin_val > umax_val) {
/* Taint dst register if offset had invalid bounds derived from
-* e.g. dead branches.
+* e.g. dead branches.  This should be impossible now, since we
+* prune dead branches in check_cond_jmp_op().
 */
+   WARN_ON_ONCE(1);
__mark_reg_unknown(dst_reg);
return 0;
}
@@ -3455,6 +3473,13 @@ static bool try_match_pkt_pointers(const struct bpf_insn 
*insn,
return true;
 }
 
+static bool reg_is_impossible(struct bpf_reg_state reg)
+{
+   return reg.type == SCALAR_VALUE &&
+  (reg.umin_value > reg.umax_value ||
+   reg.smin_value > reg.smax_value);
+}
+
 static int check_cond_jmp_op(struct bpf_verifier_env *env,
 struct bpf_insn *insn, int *insn_idx)
 {
@@ -3574,6 +3599,32 @@ static int check_cond_jmp_op(struct bpf_verifier_env 
*env,
}
if (env->log.level)
print_verifier_state(env, 
this_branch->frame[this_branch->curframe]);
+
+   if (reg_is_impossible(other_branch_regs[insn->src_reg]) ||
+   reg_is_impossible(other_branch_regs[insn->dst_reg])) {
+   /* if (false) goto pc+off;
+* only follow fall-through branch, since
+* that's where the program will go
+*/
+   verbose(env, "pruning impossible jump\n");
+   unpush_stack(env);
+   } else if (reg_is_impossible(regs[insn->src_reg]) ||
+  reg_is_impossible(regs[insn->dst_reg])) {
+   /* if (true) goto pc+off;
+* only follow the goto, ignore fall-through
+*/
+   verbose(env, "pruning impossible fall-through\n");
+   err = pop_stack(env, NULL, insn_idx);
+   if (err < 0) {
+   if (err != -ENOENT)
+   return err;
+   }
+   /* pushed goto included the +1, which caller will try to apply
+* so let's undo it here.
+*/
+   (*insn_idx)--;
+   }
+
return 0;
 }
 



[RFC PATCH bpf-next 04/12] bpf/verifier: store insn_idx instead of callsite in bpf_func_state

2018-02-23 Thread Edward Cree
This means each entry in the parentage chain can have its insn identified,
 which will help to support bounded-loop handling later.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |  6 --
 kernel/bpf/verifier.c| 22 +++---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 2b56be9dfb56..0bc49c768585 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -107,8 +107,10 @@ struct bpf_stack_state {
  */
 struct bpf_func_state {
struct bpf_reg_state regs[MAX_BPF_REG];
-   /* index of call instruction that called into this func */
-   int callsite;
+   /* index of last instruction processed in this func.  In frames other
+* than innermost, will be call insn
+*/
+   int insn_idx;
/* stack frame number of this function state from pov of
 * enclosing bpf_verifier_state.
 * 0 = main function, 1 = first callee.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index dfba9dbc5bfb..7a08b5e8e071 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -267,7 +267,7 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
/* reg->off should be 0 for SCALAR_VALUE */
verbose(env, "%lld", reg->var_off.value + reg->off);
if (t == PTR_TO_STACK)
-   verbose(env, ",call_%d", func(env, 
reg)->callsite);
+   verbose(env, ",frame_%u", reg->frameno);
} else {
verbose(env, "(id=%d", reg->id);
if (t != SCALAR_VALUE)
@@ -711,12 +711,11 @@ static void init_reg_state(struct bpf_verifier_env *env,
mark_reg_known_zero(env, regs, BPF_REG_1);
 }
 
-#define BPF_MAIN_FUNC (-1)
 static void init_func_state(struct bpf_verifier_env *env,
struct bpf_func_state *state,
-   int callsite, int frameno, int subprogno)
+   int entry, int frameno, int subprogno)
 {
-   state->callsite = callsite;
+   state->insn_idx = entry;
state->frameno = frameno;
state->subprogno = subprogno;
init_reg_state(env, state);
@@ -2095,8 +2094,7 @@ static int check_func_call(struct bpf_verifier_env *env, 
struct bpf_insn *insn,
 * callee can read/write into caller's stack
 */
init_func_state(env, callee,
-   /* remember the callsite, it will be used by bpf_exit */
-   *insn_idx /* callsite */,
+   target /* entry point */,
state->curframe + 1 /* frameno within this callchain */,
subprog /* subprog number within this prog */);
 
@@ -2151,7 +2149,7 @@ static int prepare_func_exit(struct bpf_verifier_env 
*env, int *insn_idx)
/* return to the caller whatever r0 had in the callee */
caller->regs[BPF_REG_0] = *r0;
 
-   *insn_idx = callee->callsite + 1;
+   *insn_idx = caller->insn_idx + 1;
if (env->log.level) {
verbose(env, "returning from callee:\n");
print_verifier_state(env, callee);
@@ -4232,7 +4230,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 * and all frame states need to be equivalent
 */
for (i = 0; i <= old->curframe; i++) {
-   if (old->frame[i]->callsite != cur->frame[i]->callsite)
+   if (old->frame[i]->insn_idx != cur->frame[i]->insn_idx)
return false;
if (!func_states_equal(old->frame[i], cur->frame[i]))
return false;
@@ -4327,7 +4325,7 @@ static int is_state_visited(struct bpf_verifier_env *env, 
int insn_idx)
 * technically the current state is not proven to be safe yet,
 * but it will either reach outer most bpf_exit (which means it's safe)
 * or it will be rejected. Since there are no loops, we won't be
-* seeing this tuple (frame[0].callsite, frame[1].callsite, .. insn_idx)
+* seeing this tuple (frame[0].insn_idx, frame[1].insn_idx, .. insn_idx)
 * again on the way to bpf_exit
 */
new_sl = kzalloc(sizeof(struct bpf_verifier_state_list), GFP_KERNEL);
@@ -4394,11 +4392,11 @@ static int do_check(struct bpf_verifier_env *env)
mainprogno = add_subprog(env, 0);
if (mainprogno < 0)
return mainprogno;
+   insn_idx = 0;
init_func_state(env, state->frame[0],
-   BPF_MAIN_FUNC /* callsite */,
+   insn_idx /* entry point */,
0 /* frameno */,
mainprogno /* subprogno */);
-   insn_idx = 0;
for (;;) {
struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
struct bpf_func_st

[RFC PATCH bpf-next 03/12] bpf/verifier: per-register parent pointers

2018-02-23 Thread Edward Cree
By giving each register its own liveness chain, we elide the skip_callee()
 logic.  Instead, each register's parent is the state it inherits from;
 both check_func_call() and prepare_func_exit() automatically connect
 reg states to the correct chain since when they copy the reg state across
 (r1-r5 into the callee as args, and r0 out as the return value) they also
 copy the parent pointer.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |   8 +-
 kernel/bpf/verifier.c| 180 ++-
 2 files changed, 45 insertions(+), 143 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0387e0c61161..2b56be9dfb56 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -41,6 +41,7 @@ enum bpf_reg_liveness {
 };
 
 struct bpf_reg_state {
+   /* Ordering of fields matters.  See states_equal() */
enum bpf_reg_type type;
union {
/* valid when type == PTR_TO_PACKET */
@@ -59,7 +60,6 @@ struct bpf_reg_state {
 * came from, when one is tested for != NULL.
 */
u32 id;
-   /* Ordering of fields matters.  See states_equal() */
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
 * the actual value.
 * For pointer types, this represents the variable part of the offset
@@ -76,15 +76,15 @@ struct bpf_reg_state {
s64 smax_value; /* maximum possible (s64)value */
u64 umin_value; /* minimum possible (u64)value */
u64 umax_value; /* maximum possible (u64)value */
+   /* parentage chain for liveness checking */
+   struct bpf_reg_state *parent;
/* Inside the callee two registers can be both PTR_TO_STACK like
 * R1=fp-8 and R2=fp-8, but one of them points to this function stack
 * while another to the caller's stack. To differentiate them 'frameno'
 * is used which is an index in bpf_verifier_state->frame[] array
 * pointing to bpf_func_state.
-* This field must be second to last, for states_equal() reasons.
 */
u32 frameno;
-   /* This field must be last, for states_equal() reasons. */
enum bpf_reg_liveness live;
 };
 
@@ -107,7 +107,6 @@ struct bpf_stack_state {
  */
 struct bpf_func_state {
struct bpf_reg_state regs[MAX_BPF_REG];
-   struct bpf_verifier_state *parent;
/* index of call instruction that called into this func */
int callsite;
/* stack frame number of this function state from pov of
@@ -129,7 +128,6 @@ struct bpf_func_state {
 struct bpf_verifier_state {
/* call stack tracking */
struct bpf_func_state *frame[MAX_CALL_FRAMES];
-   struct bpf_verifier_state *parent;
u32 curframe;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2dc69fb3bfbe..dfba9dbc5bfb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -343,9 +343,9 @@ static int copy_stack_state(struct bpf_func_state *dst,
 /* do_check() starts with zero-sized stack in struct bpf_verifier_state to
  * make it consume minimal amount of memory. check_stack_write() access from
  * the program calls into realloc_func_state() to grow the stack size.
- * Note there is a non-zero 'parent' pointer inside bpf_verifier_state
- * which this function copies over. It points to previous bpf_verifier_state
- * which is never reallocated
+ * Note there is a non-zero parent pointer inside each reg of 
bpf_verifier_state
+ * which this function copies over. It points to corresponding reg in previous
+ * bpf_verifier_state which is never reallocated
  */
 static int realloc_func_state(struct bpf_func_state *state, int size,
  bool copy_old)
@@ -429,7 +429,6 @@ static int copy_verifier_state(struct bpf_verifier_state 
*dst_state,
dst_state->frame[i] = NULL;
}
dst_state->curframe = src->curframe;
-   dst_state->parent = src->parent;
for (i = 0; i <= src->curframe; i++) {
dst = dst_state->frame[i];
if (!dst) {
@@ -699,6 +698,7 @@ static void init_reg_state(struct bpf_verifier_env *env,
for (i = 0; i < MAX_BPF_REG; i++) {
mark_reg_not_init(env, regs, i);
regs[i].live = REG_LIVE_NONE;
+   regs[i].parent = NULL;
}
 
/* frame pointer */
@@ -773,74 +773,21 @@ static int add_subprog(struct bpf_verifier_env *env, int 
off)
return ret;
 }
 
-static
-struct bpf_verifier_state *skip_callee(struct bpf_verifier_env *env,
-  const struct bpf_verifier_state *state,
-  struct bpf_verifier_state *parent,
-  u32 regno)
-{
-   struct bpf_verifier_state *tmp = NULL;
-
-   /* 'parent' could be a state of caller and
-* 'state' could be a state of callee. In such case
-* parent->cur

[RFC PATCH bpf-next 02/12] bpf/verifier: update selftests

2018-02-23 Thread Edward Cree
Error messages for some bad programs have changed, partly because we now
 check for loops / out-of-bounds jumps before checking subprogs.

Problematic selftests:
513 calls: wrong recursive calls
 This is now rejected with 'unreachable insn 1'.  I'm not entirely sure what
 it was meant to do/test, since all of the JMP|CALLs are also unreachable.
546 calls: ld_abs with changing ctx data in callee
 Rejected with R1 !read_ok.  It was testing for the "can't mix LD_ABS with
 function calls", which has now changed to "can't use LD_ABS in functions
 other than main()".  I'm still not 100% sure that's right though.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 46 ++---
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 697bd83de295..9c7531887ee3 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -644,7 +644,7 @@ static struct bpf_test tests[] = {
.insns = {
BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
},
-   .errstr = "not an exit",
+   .errstr = "jump out of range",
.result = REJECT,
},
{
@@ -9288,7 +9288,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "last insn is not an exit or jmp",
+   .errstr = "insn 1 was in subprog 1, now 0",
.result = REJECT,
},
{
@@ -9354,7 +9354,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "jump out of range",
+   .errstr = "insn 5 was in subprog 1, now 0",
.result = REJECT,
},
{
@@ -9633,7 +9633,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-   .errstr = "jump out of range from insn 1 to 4",
+   .errstr = "insn 5 was in subprog 1, now 0",
.result = REJECT,
},
{
@@ -9649,13 +9649,12 @@ static struct bpf_test tests[] = {
BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0),
BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
BPF_EXIT_INSN(),
-   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-   offsetof(struct __sk_buff, len)),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, -3),
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "jump out of range from insn 11 to 9",
+   .errstr = "insn 9 was in subprog 1, now 2",
.result = REJECT,
},
{
@@ -9707,7 +9706,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "invalid destination",
+   .errstr = "jump out of range from insn 2 to -1",
.result = REJECT,
},
{
@@ -9719,7 +9718,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "invalid destination",
+   .errstr = "jump out of range from insn 2 to -2147483646",
.result = REJECT,
},
{
@@ -9732,7 +9731,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "jump out of range",
+   .errstr = "insn 1 was in subprog 0, now 1",
.result = REJECT,
},
{
@@ -9745,7 +9744,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "jump out of range",
+   .errstr = "insn 4 was in subprog 1, now 0",
.result = REJECT,
},
{
@@ -9759,7 +9758,7 @@ static struct bpf_test tests[] = {
BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-   .errstr = "not an exit",
+   .errstr = "jump out of range from insn 5 to 6",
.result = REJECT,
},
{
@@ -9773,7 +9772,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},

[RFC PATCH bpf-next 01/12] bpf/verifier: validate func_calls by marking at do_check() time

2018-02-23 Thread Edward Cree
Removes a couple of passes from the verifier, one to check subprogs don't
 overlap etc., and one to compute max stack depth (which now is done by
 topologically sorting the call graph).

Signed-off-by: Edward Cree 
---
As noted in the cover letter, I haven't yet integrated the feedback I got the
 first time I posted this patch.

 include/linux/bpf_verifier.h |  24 ++-
 kernel/bpf/verifier.c| 425 +++
 2 files changed, 242 insertions(+), 207 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 6b66cd1aa0b9..0387e0c61161 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -146,6 +146,8 @@ struct bpf_insn_aux_data {
s32 call_imm;   /* saved imm field of call insn 
*/
};
int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
+   u16 subprogno; /* subprog in which this insn resides, valid iff @seen */
+   u16 subprog_off; /* insn_idx within subprog, computed in jit_subprogs */
bool seen; /* this insn was processed by the verifier */
 };
 
@@ -168,6 +170,15 @@ static inline bool bpf_verifier_log_full(const struct 
bpf_verifer_log *log)
 
 #define BPF_MAX_SUBPROGS 256
 
+struct bpf_subprog_info {
+   /* which other subprogs does this one directly call? */
+   DECLARE_BITMAP(callees, BPF_MAX_SUBPROGS);
+   u32 start; /* insn idx of function entry point */
+   u16 stack_depth; /* max. stack depth used by this function */
+   u16 total_stack_depth; /* max. stack depth used by entire call chain */
+   u16 len; /* #insns in this subprog */
+};
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
@@ -186,20 +197,23 @@ struct bpf_verifier_env {
bool seen_direct_write;
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
struct bpf_verifer_log log;
-   u32 subprog_starts[BPF_MAX_SUBPROGS];
-   /* computes the stack depth of each bpf function */
-   u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+   struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS];
u32 subprog_cnt;
 };
 
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+static inline struct bpf_func_state *cur_frame(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *cur = env->cur_state;
 
-   return cur->frame[cur->curframe]->regs;
+   return cur->frame[cur->curframe];
+}
+
+static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+{
+   return cur_frame(env)->regs;
 }
 
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fb69a85d967..2dc69fb3bfbe 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -728,111 +728,49 @@ enum reg_arg_type {
DST_OP_NO_MARK  /* same as above, check only, don't mark */
 };
 
-static int cmp_subprogs(const void *a, const void *b)
+static int find_subprog(struct bpf_verifier_env *env, int insn_idx)
 {
-   return *(int *)a - *(int *)b;
-}
-
-static int find_subprog(struct bpf_verifier_env *env, int off)
-{
-   u32 *p;
+   struct bpf_insn_aux_data *aux;
+   int insn_cnt = env->prog->len;
+   u32 subprogno;
 
-   p = bsearch(&off, env->subprog_starts, env->subprog_cnt,
-   sizeof(env->subprog_starts[0]), cmp_subprogs);
-   if (!p)
+   if (insn_idx >= insn_cnt || insn_idx < 0) {
+   verbose(env, "find_subprog of invalid insn_idx %d\n", insn_idx);
+   return -EINVAL;
+   }
+   aux = &env->insn_aux_data[insn_idx];
+   if (!aux->seen) /* haven't visited this line yet */
return -ENOENT;
-   return p - env->subprog_starts;
-
+   subprogno = aux->subprogno;
+   /* validate that we are at start of subprog */
+   if (env->subprog_info[subprogno].start != insn_idx) {
+   verbose(env, "insn_idx %d is in subprog %u but that starts at 
%d\n",
+   insn_idx, subprogno, 
env->subprog_info[subprogno].start);
+   return -EINVAL;
+   }
+   return subprogno;
 }
 
 static int add_subprog(struct bpf_verifier_env *env, int off)
 {
int insn_cnt = env->prog->len;
+   struct bpf_subprog_info *info;
int ret;
 
if (off >= insn_cnt || off < 0) {
verbose(env, "call to invalid destination\n");
return -EINVAL;
}
-   ret = find_subprog(env, off);
-   if (ret >= 0)
-   return 0;
if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
verbose(env, "too many subprograms\n");
return -E2BIG;
}
-   env->subprog_starts[env->subprog_cnt++] = off;
-   sort(env->subpro

[PATCH 1/1] Linux Traffic Control (tc) unit testing suite's code quality improved, String formattings updated. According to python documentation "The built-in string class provides the ability to do c

2018-02-23 Thread BTaskaya
---
 tools/testing/selftests/tc-testing/tdc.py   | 2 +-
 tools/testing/selftests/tc-testing/tdc_batch.py | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index fc373fdf2bdc..80c75464c67e 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -277,7 +277,7 @@ def generate_case_ids(alltests):
 for c in alltests:
 if (c["id"] == ""):
 while True:
-newid = str('%04x' % random.randrange(16**4))
+newid = str('{:04x}'.format(random.randrange(16**4)))
 if (does_id_exist(alltests, newid)):
 continue
 else:
diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py 
b/tools/testing/selftests/tc-testing/tdc_batch.py
index 707c6bfef689..52fa539dc662 100755
--- a/tools/testing/selftests/tc-testing/tdc_batch.py
+++ b/tools/testing/selftests/tc-testing/tdc_batch.py
@@ -49,13 +49,13 @@ index = 0
 for i in range(0x100):
 for j in range(0x100):
 for k in range(0x100):
-mac = ("%02x:%02x:%02x" % (i, j, k))
+mac = ("{:02x}:{:02x}:{:02x}".format(i, j, k))
 src_mac = "e4:11:00:" + mac
 dst_mac = "e4:12:00:" + mac
-cmd = ("filter add dev %s %s protocol ip parent : flower %s "
-   "src_mac %s dst_mac %s action drop %s" %
+cmd = ("filter add dev {} {} protocol ip parent : flower {} "
+   "src_mac {} dst_mac {} action drop {}".format
(device, prio, skip, src_mac, dst_mac, share_action))
-file.write("%s\n" % cmd)
+file.write("{}\n".format(cmd))
 index += 1
 if index >= number:
 file.close()
-- 
2.16.1



  1   2   >