[PATCH bpf-next] bpf: comment why dots in filenames under BPF virtual FS are not allowed

2018-03-08 Thread Jakub Kicinski
From: Quentin Monnet 

When pinning a file under the BPF virtual file system (traditionally
/sys/fs/bpf), using a dot in the name of the location to pin at is not
allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
rejected with -EPERM.

This check was introduced at the same time as the BPF file system
itself, with commit b2197755b263 ("bpf: add support for persistent
maps/progs"). At this time, it was checked in a function called
"bpf_dname_reserved()", which made clear that using a dot was reserved
for future extensions.

This function disappeared and the check was moved elsewhere with commit
0c93b7d85d40 ("bpf: reject invalid names right in ->lookup()"), and the
meaning of the dot ban was lost.

The present commit simply adds a comment in the source to explain to the
reader that the usage of dots is reserved for future usage.

Signed-off-by: Quentin Monnet 
---
 kernel/bpf/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 81e2f6995adb..bf6da59ae0d0 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -178,6 +178,9 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, 
void *arg)
 static struct dentry *
 bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
 {
+   /* Dots in names (e.g. "/sys/fs/bpf/foo.bar") are reserved for future
+* extensions.
+*/
if (strchr(dentry->d_name.name, '.'))
return ERR_PTR(-EPERM);
 
-- 
2.15.1



Re: [PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings

2018-03-08 Thread Jakub Kicinski
On Thu,  8 Mar 2018 23:00:35 +0100, Jiri Benc wrote:
> When building bpf tool, gcc emits piles of warnings:
> 
> prog.c: In function ‘prog_fd_by_tag’:
> prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct 
> bpf_prog_info’ [-Wmissing-field-initializers]
>   struct bpf_prog_info info = {};
>  ^
> In file included from 
> /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0,
>  from prog.c:47:
> /home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: 
> ‘type’ declared here
>   __u32 type;
> ^
> 
> As these warnings are not useful, switch them off.
> 
> Signed-off-by: Jiri Benc 

FWIW I couldn't reproduce this one.  Out of curiosity what GCC did you
use?


Re: [PATCH bpf-next 0/7] tools: bpf: standardize make

2018-03-08 Thread Jakub Kicinski
On Thu, 8 Mar 2018 21:10:59 -0800, Alexei Starovoitov wrote:
> On Thu, Mar 8, 2018 at 2:00 PM, Jiri Benc  wrote:
> > Currently, 'make bpf' in the tools/ directory does not provide the standard
> > quiet output except for bpftool (which is however listed with a wrong
> > directory). Worse, it does not respect the build output directory.
> >
> > The 'make bpf_install' does not work as one would expect, either. It
> > installs unconditionally to /usr/bin without respecting DESTDIR and prefix.
> >
> > This patchset improves that behavior.  
> 
> Jakub,
> please review this set.
> 
> Thanks!

I stared at it for a while, and it looks good!

Reviewed-by: Jakub Kicinski 

In addition I found a GCC 7 warning in bpftool, .PHONY lacks some
targets and we don't clean up after feature detection.  I'll send
patches for that unless someone beats me to it.


[PATCH net-next] cxgb4: increase max tx rate limit to 100 Gbps

2018-03-08 Thread Ganesh Goudar
T6 cards can support up to 100 G speeds. So, increase
max programmable tx rate limit to 100 Gbps.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 6 +++---
 drivers/net/ethernet/chelsio/cxgb4/sched.h  | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..1b44652 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2870,11 +2870,11 @@ static int cxgb_set_tx_maxrate(struct net_device *dev, 
int index, u32 rate)
/* Convert from Mbps to Kbps */
req_rate = rate << 10;
 
-   /* Max rate is 10 Gbps */
+   /* Max rate is 100 Gbps */
if (req_rate >= SCHED_MAX_RATE_KBPS) {
dev_err(adap->pdev_dev,
-   "Invalid rate %u Mbps, Max rate is %u Gbps\n",
-   rate, SCHED_MAX_RATE_KBPS);
+   "Invalid rate %u Mbps, Max rate is %u Mbps\n",
+   rate, SCHED_MAX_RATE_KBPS >> 10);
return -ERANGE;
}
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h 
b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 77b2b3f..3a49e00 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -42,8 +42,8 @@
 
 #define FW_SCHED_CLS_NONE 0x
 
-/* Max rate that can be set to a scheduling class is 10 Gbps */
-#define SCHED_MAX_RATE_KBPS 1000U
+/* Max rate that can be set to a scheduling class is 100 Gbps */
+#define SCHED_MAX_RATE_KBPS 1U
 
 enum {
SCHED_STATE_ACTIVE,
-- 
2.1.0



Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap

2018-03-08 Thread Jason Wang



On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:

Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

Signed-off-by: Jesper Dangaard Brouer 
---
  include/net/xdp.h   |   32 +++
  kernel/bpf/cpumap.c |   60 +++
  net/core/xdp.c  |   18 +++
  3 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..3cb726a6dc5b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,48 @@
   * also mandatory during RX-ring setup.
   */
  
+enum mem_type {

+   MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+   MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   // Possible new ideas for types:
+   // MEM_TYPE_PAGE_POOL,/* Will be added later  */
+   // MEM_TYPE_AF_XDP,
+   // MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
+   MEM_TYPE_MAX,
+};


So if we plan to support dev->ndo, it looks to me two types AF_XDP and 
DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool 
or ordinary page allocator) in ndo or what ever other callbacks.



+
+struct xdp_mem_info {
+   u32 type; /* enum mem_type, but known size type */
+   u32 id; // Needed later (to lookup struct xdp_rxq_info)
+};
+
  struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
+   struct xdp_mem_info mem;
  } cacheline_aligned; /* perf critical, avoid false-sharing */
  
+

+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+   if (mem->type == MEM_TYPE_PAGE_SHARED)
+   page_frag_free(data);
+
+   if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+   struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+   put_page(page);
+   }
+}
+
  int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
  void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
  void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
  bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+  enum mem_type type, void *allocator);
  
  #endif /* __LINUX_NET_XDP_H__ */

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
return ERR_PTR(err);
  }
  
-static void __cpu_map_queue_destructor(void *ptr)

-{
-   /* The tear-down procedure should have made sure that queue is
-* empty.  See __cpu_map_entry_replace() and work-queue
-* invoked cpu_map_kthread_stop(). Catch any broken behaviour
-* gracefully and warn once.
-*/
-   if (WARN_ON_ONCE(ptr))
-   page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-   if (atomic_dec_and_test(&rcpu->refcnt)) {
-   /* The queue should be empty at this point */
-   ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-   kfree(rcpu->queue);
-   kfree(rcpu);
-   }
-}
-
  static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
  {
atomic_inc(&rcpu->refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
u16 len;
u16 headroom;
u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/


Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?

Thanks


+   struct xdp_mem_info mem;
struct net_device *dev_rx;
  };
  
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)

xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
xdp_pkt->metasize = metasize;
  
+	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */

+   xdp_pkt->mem = xdp->rxq->mem;
+
return xdp_pkt;
  }
  
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,

return skb;
  }
  
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)

+{
+   /* The tear-down procedure should have made sure that queue is
+* empty.  See __cpu_map_entry_replace() and work-queue
+* invoked cpu_map_kthread_stop(). Catch any broken behaviour
+* gracefully and warn once.
+*/
+   struct xdp_pkt *xdp_pkt;
+
+   while ((xdp_pkt = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdp_pkt))
+   xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf

Re: [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API

2018-03-08 Thread Jason Wang



On 2018年03月08日 23:16, Jesper Dangaard Brouer wrote:

Hi Jason,

Please see below FIXME, which is actually a question to you.

On Thu, 08 Mar 2018 14:08:11 +0100 Jesper Dangaard Brouer  
wrote:


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 475088f947bb..cd046cf31b77 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c

[...]


@@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = {
  static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
  {
struct tun_struct *tun = netdev_priv(dev);
-   struct xdp_buff *buff = xdp->data_hard_start;
-   int headroom = xdp->data - xdp->data_hard_start;
+   struct xdp_frame *frame;
struct tun_file *tfile;
u32 numqueues;
int ret = 0;
  
-	/* Assure headroom is available and buff is properly aligned */

-   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-   return -ENOSPC;
+   /* FIXME: Explain why this check is the needed! */
+   if (unlikely(tun_is_xdp_frame(xdp)))
+   return -EBADRQC;
  
-	*buff = *xdp;

+   frame = convert_to_xdp_frame(xdp);
+   if (unlikely(!frame))
+   return -EOVERFLOW;

To Jason, in the FIXME, I'm inheriting a check you put in, but I don't
understand why this check was needed?



Sorry for the late reply.

I think it was used to make sure to not use misaligned or invalid 
pointer that caller passed to us.


Thanks


[PATCH net 1/3] vhost_net: initialize rx_ring in vhost_net_open()

2018-03-08 Thread Jason Wang
From: Alexander Potapenko 

KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:

==
BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 
drivers/vho
et.c:170
CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
 vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
 vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
 vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
 __fput+0x49f/0xa00 fs/file_table.c:209
 fput+0x37/0x40 fs/file_table.c:243
 task_work_run+0x243/0x2c0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:191 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
 prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
 syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
 do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
...
origin:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
 kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
 kmalloc_large_node_hook mm/slub.c:1366 [inline]
 kmalloc_large_node mm/slub.c:3808 [inline]
 __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
 kmalloc_node include/linux/slab.h:554 [inline]
 kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
 kvmalloc include/linux/mm.h:541 [inline]
 vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
 misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
 chrdev_open+0xc28/0xd90 fs/char_dev.c:417
 do_dentry_open+0xccb/0x1430 fs/open.c:752
 vfs_open+0x272/0x2e0 fs/open.c:866
 do_last fs/namei.c:3378 [inline]
 path_openat+0x49ad/0x6580 fs/namei.c:3519
 do_filp_open+0x267/0x640 fs/namei.c:3553
 do_sys_open+0x6ad/0x9c0 fs/open.c:1059
 SYSC_openat+0xc7/0xe0 fs/open.c:1086
 SyS_openat+0x63/0x90 fs/open.c:1080
 do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
==

Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array")
Signed-off-by: Alexander Potapenko 
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba2..60f1080 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rx_ring = NULL;
vhost_net_buf_init(&n->vqs[i].rxq);
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
-- 
2.7.4



[PATCH net 0/3] Several fixes for vhost_net ptr_ring usage

2018-03-08 Thread Jason Wang
Hi:

This small series try to fix several bugs of ptr_ring usage in
vhost_net. Please review.

Thanks

Alexander Potapenko (1):
  vhost_net: initialize rx_ring in vhost_net_open()

Jason Wang (2):
  vhost_net: keep private_data and rx_ring synced
  vhost_net: examine pointer types during un-producing

 drivers/net/tun.c  | 3 ++-
 drivers/vhost/net.c| 8 +---
 include/linux/if_tun.h | 4 
 3 files changed, 11 insertions(+), 4 deletions(-)

-- 
2.7.4



[PATCH net 3/3] vhost_net: examine pointer types during un-producing

2018-03-08 Thread Jason Wang
After commit fc72d1d54dd9 ("tuntap: XDP transmission"), we can
actually queueing XDP pointers in the pointer ring, so we should
examine the pointer type before freeing the pointer.

Fixes: fc72d1d54dd9 ("tuntap: XDP transmission")
Reported-by: Michael S. Tsirkin 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c  | 3 ++-
 drivers/vhost/net.c| 2 +-
 include/linux/if_tun.h | 4 
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7433bb2..28cfa64 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file 
*tfile)
return tun;
 }
 
-static void tun_ptr_free(void *ptr)
+void tun_ptr_free(void *ptr)
 {
if (!ptr)
return;
@@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr)
__skb_array_destroy_skb(ptr);
}
 }
+EXPORT_SYMBOL_GPL(tun_ptr_free);
 
 static void tun_queue_purge(struct tun_file *tfile)
 {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index efb9306..8139bc7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct 
vhost_net_virtqueue *nvq)
if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
   vhost_net_buf_get_size(rxq),
-  __skb_array_destroy_skb);
+  tun_ptr_free);
rxq->head = rxq->tail = 0;
}
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index c5b0a75..fd00170 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
 bool tun_is_xdp_buff(void *ptr);
 void *tun_xdp_to_ptr(void *ptr);
 void *tun_ptr_to_xdp(void *ptr);
+void tun_ptr_free(void *ptr);
 #else
 #include 
 #include 
@@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr)
 {
return NULL;
 }
+static inline void tun_ptr_free(void *ptr)
+{
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4



[PATCH net 2/3] vhost_net: keep private_data and rx_ring synced

2018-03-08 Thread Jason Wang
We get pointer ring from the exported sock, this means we should keep
rx_ring and vq->private synced during both vq stop and backend set,
otherwise we may see stale rx_ring.

Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array")
Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 60f1080..efb9306 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -973,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
vhost_net_disable_vq(n, vq);
vq->private_data = NULL;
vhost_net_buf_unproduce(nvq);
+   nvq->rx_ring = NULL;
mutex_unlock(&vq->mutex);
return sock;
 }
@@ -1162,14 +1163,14 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
vhost_net_disable_vq(n, vq);
vq->private_data = sock;
vhost_net_buf_unproduce(nvq);
-   if (index == VHOST_NET_VQ_RX)
-   nvq->rx_ring = get_tap_ptr_ring(fd);
r = vhost_vq_init_access(vq);
if (r)
goto err_used;
r = vhost_net_enable_vq(n, vq);
if (r)
goto err_used;
+   if (index == VHOST_NET_VQ_RX)
+   nvq->rx_ring = get_tap_ptr_ring(fd);
 
oldubufs = nvq->ubufs;
nvq->ubufs = ubufs;
-- 
2.7.4



[PATCH] rsi: Remove stack VLA usage

2018-03-08 Thread Tobin C. Harding
The kernel would like to have all stack VLA usage removed[1].  rsi uses
a VLA based on 'blksize'.  Elsewhere in the SDIO code maximum block size
is defined using a magic number.  We can use a pre-processor defined
constant and declare the array to maximum size.  We add a check before
accessing the array in case of programmer error.

[1]: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Tobin C. Harding 
---
 drivers/net/wireless/rsi/rsi_91x_hal.c  | 13 +++--
 drivers/net/wireless/rsi/rsi_91x_sdio.c |  9 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/rsi/rsi_91x_hal.c 
b/drivers/net/wireless/rsi/rsi_91x_hal.c
index 1176de646942..839ebdd602df 100644
--- a/drivers/net/wireless/rsi/rsi_91x_hal.c
+++ b/drivers/net/wireless/rsi/rsi_91x_hal.c
@@ -641,7 +641,7 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 cmd, 
u8 *addr, u32 size)
u32 cmd_addr;
u16 cmd_resp, cmd_req;
u8 *str;
-   int status;
+   int status, ret;
 
if (cmd == PING_WRITE) {
cmd_addr = PING_BUFFER_ADDRESS;
@@ -655,12 +655,13 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 
cmd, u8 *addr, u32 size)
str = "PONG_VALID";
}
 
-   status = hif_ops->load_data_master_write(adapter, cmd_addr, size,
+   ret = hif_ops->load_data_master_write(adapter, cmd_addr, size,
block_size, addr);
-   if (status) {
-   rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr %0x\n",
-   __func__, *addr);
-   return status;
+   if (ret) {
+   if (ret != -EINVAL)
+   rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr 
%0x\n",
+   __func__, *addr);
+   return ret;
}
 
status = bl_cmd(adapter, cmd_req, cmd_resp, str);
diff --git a/drivers/net/wireless/rsi/rsi_91x_sdio.c 
b/drivers/net/wireless/rsi/rsi_91x_sdio.c
index b0cf41195051..b766578b591a 100644
--- a/drivers/net/wireless/rsi/rsi_91x_sdio.c
+++ b/drivers/net/wireless/rsi/rsi_91x_sdio.c
@@ -20,6 +20,8 @@
 #include "rsi_common.h"
 #include "rsi_hal.h"
 
+#define RSI_MAX_BLOCK_SIZE 256
+
 /**
  * rsi_sdio_set_cmd52_arg() - This function prepares cmd 52 read/write arg.
  * @rw: Read/write
@@ -362,7 +364,7 @@ static int rsi_setblocklength(struct rsi_hw *adapter, u32 
length)
rsi_dbg(INIT_ZONE, "%s: Setting the block length\n", __func__);
 
status = sdio_set_block_size(dev->pfunction, length);
-   dev->pfunction->max_blksize = 256;
+   dev->pfunction->max_blksize = RSI_MAX_BLOCK_SIZE;
adapter->block_size = dev->pfunction->max_blksize;
 
rsi_dbg(INFO_ZONE,
@@ -567,9 +569,12 @@ static int rsi_sdio_load_data_master_write(struct rsi_hw 
*adapter,
 {
u32 num_blocks, offset, i;
u16 msb_address, lsb_address;
-   u8 temp_buf[block_size];
+   u8 temp_buf[RSI_MAX_BLOCK_SIZE];
int status;
 
+   if (block_size > RSI_MAX_BLOCK_SIZE)
+   return -EINVAL;
+
num_blocks = instructions_sz / block_size;
msb_address = base_address >> 16;
 
-- 
2.7.4



Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()

2018-03-08 Thread Jason Wang



On 2018年03月08日 23:50, Alexander Potapenko wrote:

On Thu, Mar 8, 2018 at 4:45 PM, Eric Dumazet  wrote:

On 03/08/2018 07:20 AM, Alexander Potapenko wrote:

On Thu, Mar 8, 2018 at 4:15 PM, Eric Dumazet
wrote:


On 03/08/2018 05:37 AM, Alexander Potapenko wrote:

KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:


==

Signed-off-by: Alexander Potapenko


Please identify bug origin with a Fixes: tag

Fixes: 5990a30510ed1 ("tun/tap: use ptr_ring instead of skb_array")

Please send a V2 with this added tag. patchwork does not recognize it yet.

Ok, will do. Thanks for reminding about the tag!


The commit should be c67df11f6e480 ("vhost_net: try batch dequing from 
skb array").


Let me squash this patch into my series.

Thanks



[PATCH] net: ethernet: ave: enable Rx drop interrupt

2018-03-08 Thread Kunihiko Hayashi
This enables AVE_GI_RXDROP interrupt factor. This factor indicates
depletion of Rx descriptors and the handler counts the number
of dropped packets.

Signed-off-by: Kunihiko Hayashi 
---
 drivers/net/ethernet/socionext/sni_ave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/socionext/sni_ave.c 
b/drivers/net/ethernet/socionext/sni_ave.c
index b247707..043eb55 100644
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -1349,7 +1349,7 @@ static int ave_open(struct net_device *ndev)
val |= AVE_IIRQC_EN0 | (AVE_INTM_COUNT << 16);
writel(val, priv->base + AVE_IIRQC);
 
-   val = AVE_GI_RXIINT | AVE_GI_RXOVF | AVE_GI_TX;
+   val = AVE_GI_RXIINT | AVE_GI_RXOVF | AVE_GI_TX | AVE_GI_RXDROP;
ave_irq_restore(ndev, val);
 
napi_enable(&priv->napi_rx);
-- 
2.7.4



Re: [PATCH net] vhost_net: examine pointer types during un-producing

2018-03-08 Thread Jason Wang



On 2018年03月09日 11:16, Jason Wang wrote:

After commit 761876c857cb ("tap: XDP support"), we can actually
queueing XDP pointers in the pointer ring, so we should examine the
pointer type before freeing the pointer.

Fixes: 761876c857cb ("tap: XDP support")


Oops, the commit is wrong, let me repost.

Thanks


Reported-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
  drivers/net/tun.c  | 3 ++-
  drivers/vhost/net.c| 2 +-
  include/linux/if_tun.h | 4 
  3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7433bb2..28cfa64 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file 
*tfile)
return tun;
  }
  
-static void tun_ptr_free(void *ptr)

+void tun_ptr_free(void *ptr)
  {
if (!ptr)
return;
@@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr)
__skb_array_destroy_skb(ptr);
}
  }
+EXPORT_SYMBOL_GPL(tun_ptr_free);
  
  static void tun_queue_purge(struct tun_file *tfile)

  {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba2..54a138f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct 
vhost_net_virtqueue *nvq)
if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
   vhost_net_buf_get_size(rxq),
-  __skb_array_destroy_skb);
+  tun_ptr_free);
rxq->head = rxq->tail = 0;
}
  }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index c5b0a75..fd00170 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
  bool tun_is_xdp_buff(void *ptr);
  void *tun_xdp_to_ptr(void *ptr);
  void *tun_ptr_to_xdp(void *ptr);
+void tun_ptr_free(void *ptr);
  #else
  #include 
  #include 
@@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr)
  {
return NULL;
  }
+static inline void tun_ptr_free(void *ptr)
+{
+}
  #endif /* CONFIG_TUN */
  #endif /* __IF_TUN_H */




[PATCH] pktgen: Remove VLA usage

2018-03-08 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLA usage and replace it
with a fixed-length array instead.

Signed-off-by: Gustavo A. R. Silva 
---
David,

I'm not sure how often this function is being called and,
depending on the frequency it may be worth to use
dynamic memory allocation instead?

Thanks
--
Gustavo

 net/core/pktgen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index d81bddd..e2d6ae3 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -907,7 +907,7 @@ static ssize_t pktgen_if_write(struct file *file,
 
if (debug) {
size_t copy = min_t(size_t, count, 1023);
-   char tb[copy + 1];
+   char tb[1024];
if (copy_from_user(tb, user_buffer, copy))
return -EFAULT;
tb[copy] = 0;
-- 
2.7.4



Re: [PATCH bpf-next 0/7] tools: bpf: standardize make

2018-03-08 Thread Alexei Starovoitov
On Thu, Mar 8, 2018 at 2:00 PM, Jiri Benc  wrote:
> Currently, 'make bpf' in the tools/ directory does not provide the standard
> quiet output except for bpftool (which is however listed with a wrong
> directory). Worse, it does not respect the build output directory.
>
> The 'make bpf_install' does not work as one would expect, either. It
> installs unconditionally to /usr/bin without respecting DESTDIR and prefix.
>
> This patchset improves that behavior.

Jakub,
please review this set.

Thanks!


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/8/18 7:54 PM, Andy Lutomirski wrote:





On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
wrote:


Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.



On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem.


there is not abi breakage and file cannot disappear from running task.
One cannot umount fs while file is still being used.



The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container.


Not only "read twice", but "read many".
If .text sections of elf that are not yet in memory can be modified
by malicious user, later they will be brought in with different code.
I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
> wrote:
> 
> 
> Honestly, that "read twice" thing may be what scuttles this.
> Initially, I thought it was a non-issue, because anybody who controls
> the module subdirectory enough to rewrite files would be in a position
> to just execute the file itself directly instead.
> 

On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem. 


The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container. 

Re: [PATCH net] vhost_net: examine pointer types during un-producing

2018-03-08 Thread Michael S. Tsirkin
On Fri, Mar 09, 2018 at 11:16:22AM +0800, Jason Wang wrote:
> After commit 761876c857cb ("tap: XDP support"), we can actually
> queueing XDP pointers in the pointer ring, so we should examine the
> pointer type before freeing the pointer.
> 
> Fixes: 761876c857cb ("tap: XDP support")
> Reported-by: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/tun.c  | 3 ++-
>  drivers/vhost/net.c| 2 +-
>  include/linux/if_tun.h | 4 
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7433bb2..28cfa64 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct 
> tun_file *tfile)
>   return tun;
>  }
>  
> -static void tun_ptr_free(void *ptr)
> +void tun_ptr_free(void *ptr)
>  {
>   if (!ptr)
>   return;
> @@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr)
>   __skb_array_destroy_skb(ptr);
>   }
>  }
> +EXPORT_SYMBOL_GPL(tun_ptr_free);
>  
>  static void tun_queue_purge(struct tun_file *tfile)
>  {
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 610cba2..54a138f 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct 
> vhost_net_virtqueue *nvq)
>   if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
>   ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
>  vhost_net_buf_get_size(rxq),
> -__skb_array_destroy_skb);
> +tun_ptr_free);
>   rxq->head = rxq->tail = 0;
>   }
>  }
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index c5b0a75..fd00170 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
>  bool tun_is_xdp_buff(void *ptr);
>  void *tun_xdp_to_ptr(void *ptr);
>  void *tun_ptr_to_xdp(void *ptr);
> +void tun_ptr_free(void *ptr);
>  #else
>  #include 
>  #include 
> @@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr)
>  {
>   return NULL;
>  }
> +static inline void tun_ptr_free(void *ptr)
> +{
> +}
>  #endif /* CONFIG_TUN */
>  #endif /* __IF_TUN_H */
> -- 
> 2.7.4


Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()

2018-03-08 Thread Jason Wang



On 2018年03月09日 11:29, Michael S. Tsirkin wrote:

On Fri, Mar 09, 2018 at 10:30:17AM +0800, Jason Wang wrote:


On 2018年03月09日 00:00, Michael S. Tsirkin wrote:

On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote:

On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin  wrote:

On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote:

KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:

==
BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 
drivers/vho
et.c:170
CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
   __dump_stack lib/dump_stack.c:17 [inline]
   dump_stack+0x185/0x1d0 lib/dump_stack.c:53
   kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
   __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
   vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
   vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
   vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
   vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
   __fput+0x49f/0xa00 fs/file_table.c:209
   fput+0x37/0x40 fs/file_table.c:243
   task_work_run+0x243/0x2c0 kernel/task_work.c:113
   tracehook_notify_resume include/linux/tracehook.h:191 [inline]
   exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
   prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
   syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
   do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
...
origin:
   kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
   kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
   kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
   kmalloc_large_node_hook mm/slub.c:1366 [inline]
   kmalloc_large_node mm/slub.c:3808 [inline]
   __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
   kmalloc_node include/linux/slab.h:554 [inline]
   kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
   kvmalloc include/linux/mm.h:541 [inline]
   vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
   misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
   chrdev_open+0xc28/0xd90 fs/char_dev.c:417
   do_dentry_open+0xccb/0x1430 fs/open.c:752
   vfs_open+0x272/0x2e0 fs/open.c:866
   do_last fs/namei.c:3378 [inline]
   path_openat+0x49ad/0x6580 fs/namei.c:3519
   do_filp_open+0x267/0x640 fs/namei.c:3553
   do_sys_open+0x6ad/0x9c0 fs/open.c:1059
   SYSC_openat+0xc7/0xe0 fs/open.c:1086
   SyS_openat+0x63/0x90 fs/open.c:1080
   do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
==

Signed-off-by: Alexander Potapenko 
---
   drivers/vhost/net.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba276d47..60f1080bffc7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+ n->vqs[i].rx_ring = NULL;
vhost_net_buf_init(&n->vqs[i].rxq);
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
--
2.16.2.395.g2e18187dfd-goog

I suspect that's not sufficient. rx ring is tied to the tap device.
I think we need to drop it every time we drop the device.

Unfortunately I've no idea where is the device dropped. Are you
referring to vhost_net_vq_reset()?
I can fix that part if needed, but won't be able to validate it with KMSAN.

I see several issues. For example in vhost_net_set_backend
if there's a value then rx ring will point to the
ring of the wrong socket.
Something like the below might help but we really need
documentation of when is rx_ring valid. Is it only valid
when private-data is valid?

I think so, we need keep rx_ring synced with private_data.


If yes need to make sure
we reset it with private_data.

Also I see __skb_array_destroy_skb used with ptr_ring which
seems suspicious: how do we know the entries are skbs?

Good catch, will post an independent patch to fix this.


Patch below is on top of yours, and

Signed-off-by: Michael S. Tsirkin 

But I really would like Jason to look and come up with a
patch to address all these issues.

---

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba2..7a65b69 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -972,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
vhost_net_disable_vq(n, vq);
vq->private_data = NULL;
vhost_net_buf_unproduce(nvq);
+   vq->rx_ring = NULL;
mutex_unlock(&vq->mutex);
return sock;
   }
@@ -1161,8 +1163,6 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
vhost_net_disab

Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()

2018-03-08 Thread Michael S. Tsirkin
On Fri, Mar 09, 2018 at 10:30:17AM +0800, Jason Wang wrote:
> 
> 
> On 2018年03月09日 00:00, Michael S. Tsirkin wrote:
> > On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote:
> > > On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin  
> > > wrote:
> > > > On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote:
> > > > > KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
> > > > > while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:
> > > > > 
> > > > > ==
> > > > > BUG: KMSAN: use of uninitialized memory in 
> > > > > vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho
> > > > > et.c:170
> > > > > CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
> > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 
> > > > > 04/01/2014
> > > > > Call Trace:
> > > > >   __dump_stack lib/dump_stack.c:17 [inline]
> > > > >   dump_stack+0x185/0x1d0 lib/dump_stack.c:53
> > > > >   kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
> > > > >   __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
> > > > >   vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
> > > > >   vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
> > > > >   vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
> > > > >   vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
> > > > >   __fput+0x49f/0xa00 fs/file_table.c:209
> > > > >   fput+0x37/0x40 fs/file_table.c:243
> > > > >   task_work_run+0x243/0x2c0 kernel/task_work.c:113
> > > > >   tracehook_notify_resume include/linux/tracehook.h:191 [inline]
> > > > >   exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
> > > > >   prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
> > > > >   syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
> > > > >   do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
> > > > > ...
> > > > > origin:
> > > > >   kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
> > > > >   kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
> > > > >   kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
> > > > >   kmalloc_large_node_hook mm/slub.c:1366 [inline]
> > > > >   kmalloc_large_node mm/slub.c:3808 [inline]
> > > > >   __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
> > > > >   kmalloc_node include/linux/slab.h:554 [inline]
> > > > >   kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
> > > > >   kvmalloc include/linux/mm.h:541 [inline]
> > > > >   vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
> > > > >   misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
> > > > >   chrdev_open+0xc28/0xd90 fs/char_dev.c:417
> > > > >   do_dentry_open+0xccb/0x1430 fs/open.c:752
> > > > >   vfs_open+0x272/0x2e0 fs/open.c:866
> > > > >   do_last fs/namei.c:3378 [inline]
> > > > >   path_openat+0x49ad/0x6580 fs/namei.c:3519
> > > > >   do_filp_open+0x267/0x640 fs/namei.c:3553
> > > > >   do_sys_open+0x6ad/0x9c0 fs/open.c:1059
> > > > >   SYSC_openat+0xc7/0xe0 fs/open.c:1086
> > > > >   SyS_openat+0x63/0x90 fs/open.c:1080
> > > > >   do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
> > > > > ==
> > > > > 
> > > > > Signed-off-by: Alexander Potapenko 
> > > > > ---
> > > > >   drivers/vhost/net.c | 1 +
> > > > >   1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 610cba276d47..60f1080bffc7 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, 
> > > > > struct file *f)
> > > > >n->vqs[i].done_idx = 0;
> > > > >n->vqs[i].vhost_hlen = 0;
> > > > >n->vqs[i].sock_hlen = 0;
> > > > > + n->vqs[i].rx_ring = NULL;
> > > > >vhost_net_buf_init(&n->vqs[i].rxq);
> > > > >}
> > > > >vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
> > > > > --
> > > > > 2.16.2.395.g2e18187dfd-goog
> > > > 
> > > > I suspect that's not sufficient. rx ring is tied to the tap device.
> > > > I think we need to drop it every time we drop the device.
> > > Unfortunately I've no idea where is the device dropped. Are you
> > > referring to vhost_net_vq_reset()?
> > > I can fix that part if needed, but won't be able to validate it with 
> > > KMSAN.
> > I see several issues. For example in vhost_net_set_backend
> > if there's a value then rx ring will point to the
> > ring of the wrong socket.
> > Something like the below might help but we really need
> > documentation of when is rx_ring valid. Is it only valid
> > when private-data is valid?
> 
> I think so, we need keep rx_ring synced with private_data.
> 
> > If yes need to make sure
> > we reset it with private_data.
> > 
> > Also I see __skb_array_destroy_skb used with ptr_ring which
> > seems suspicious: how do we know the entries are skbs?
> 
> Good catch, will post an 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 02:12:24AM +, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
>  wrote:
> > On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
> >>
> >> Alexei, can you give an example use case?  I'm sure it's upthread
> >> somewhere, but I'm having trouble finding it.
> >
> > at the time of iptable's setsockopt() the kernel will do
> > err = request_module("bpfilter");
> > once.
> > The rough POC code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25
> 
> Here's what I gather from reading that code: you have a new kernel
> feature (consisting of actual kernel code) that wants to defer some of
> its implementation to user mode.  I like this idea a lot.  But I have
> a suggestion for a slightly different way of accomplishing the same
> thing.  Rather than extending init_module() to accept ELF input,
> except the call_umh code to be able to call blobs.  You'd use it it
> very roughly like this:
> 
> First, compile your user code and emit a staitc binary.  Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable.  Then write a tiny shim module like this:
> 
> extern unsigned char __begin_user_code[], __end_user_code[];
> 
> int __init init_shim_module(void)
> {
>   return call_umh_blob(__begin_user_code, __end_user_code - 
> __begin_user_code);
> }
> 
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big.  The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Unfortunately it's not quite the case.
The normal .ko that does call_umh_blob is indeed seen in lsmod,
but the umh process is a separate task.
It could have been oomed or killed by admin and
this normal .ko wouldn't notice it, so health check of umh process
by the kernel is needed regardless.
Right now bpfilter has trivial fuse-like protocol. This part
is still to be designed cleanly.

No doubt that visibility and debuggability into this umh processes
is must have, but lsmod/rmmod interface doesn't quite fit.
As you said letting this priv tasks register themselves in lsmod
is certainly no-go.
I think if they will be in lsmod, kernel has to register them
and establish health check with umh at the same time.
I think worrying about restarting is not necessary.
This is still kernel code with the same high standards and
review process. If they crash it's really a kernel bug.
It only doesn't take the system down.

> I think we don't want to end up in a situation where we ship a program
> with a .ko extension that opens something in /dev, for example.

this part I don't get. What's wrong with open of /dev ?
I don't see a use case for it, but technically why not?

> call_umh_blob() would create an anon_inode or similar object backed by
> the blob and exec it.

Interesting. I haven't considered such approach.

For full context it all started from the idea of 'unprivileged kernel modules'
or 'hardened kernel modules'. Something that kernel can easily
interact with, but much safer than traditional kernel module.

I've tried a bunch of crappy ideas first:
1. have a piece of kernel .text vm_mmap-ed into user process that
doing iptables setsockopt and on return to user space force handle_signal
to execute that code. Sort of like forced ld_preload where parasite
code is provided by the kernel but runs in user space
2. have a special set of kernel page tables in read-only mode while
iptable->bpf conversion is happening
3. have load_module() fork a user task and load real kernel .ko into it

trying to hack #3 realized that I'm mainly copy-pasting a lot of
load_elf_binary() code without elf_interpreter bits,
so figured it's much easier and simpler to blend sys_finit_module
with load_elf_binary via tweaking do_execveat_common and keeping
that .ko as normal elf which is implemented in this patch.
Debugging of that fake .ko is so much better.
If it's done via call_umh_blob() the process that starts
will indeed be a user mode process, but you won't be able to
attach gdb to it. Whereas in this patch it's normal elf and
standard debugging techniques apply. A developer can
do gdb breakpoints, debug info, etc.
That is huge advantage of keeping it as normal elf.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds
 wrote:
>
> So I don't like Andy's "let's make it a kernel module and then that
> kernel module can execve() a blob". THAT seems like just stupid
> indirection.
>
> But I do like Andy's "execve a blob" part, because it is the *blob*
> that has had its signature verified, not the file!

To be honest., Andy's approach has the advantage that all the
synchronization we do for modules still works.

In particular, we have module loading logic where we synchronize
loading of modules with the same name, so that two things that do
request_module() concurrently will not lead to two modules being
loaded.

See that whole "module_mutex" thing, and the logic in (for example)
"add_unformed_module()".

Andrei's patch short-circuits the module loading before that, so if
you have two concurrent request_module() calls, you'll end up with no
serialization, and two executables.

So maybe Andy is right. He often is, after all.

 Linus


[PATCH net] vhost_net: examine pointer types during un-producing

2018-03-08 Thread Jason Wang
After commit 761876c857cb ("tap: XDP support"), we can actually
queueing XDP pointers in the pointer ring, so we should examine the
pointer type before freeing the pointer.

Fixes: 761876c857cb ("tap: XDP support")
Reported-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c  | 3 ++-
 drivers/vhost/net.c| 2 +-
 include/linux/if_tun.h | 4 
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7433bb2..28cfa64 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file 
*tfile)
return tun;
 }
 
-static void tun_ptr_free(void *ptr)
+void tun_ptr_free(void *ptr)
 {
if (!ptr)
return;
@@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr)
__skb_array_destroy_skb(ptr);
}
 }
+EXPORT_SYMBOL_GPL(tun_ptr_free);
 
 static void tun_queue_purge(struct tun_file *tfile)
 {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba2..54a138f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct 
vhost_net_virtqueue *nvq)
if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
   vhost_net_buf_get_size(rxq),
-  __skb_array_destroy_skb);
+  tun_ptr_free);
rxq->head = rxq->tail = 0;
}
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index c5b0a75..fd00170 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
 bool tun_is_xdp_buff(void *ptr);
 void *tun_xdp_to_ptr(void *ptr);
 void *tun_ptr_to_xdp(void *ptr);
+void tun_ptr_free(void *ptr);
 #else
 #include 
 #include 
@@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr)
 {
return NULL;
 }
+static inline void tun_ptr_free(void *ptr)
+{
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 6:31 PM, David Miller  wrote:
> 
> From: Andy Lutomirski 
> Date: Fri, 9 Mar 2018 02:12:24 +
> 
>> First, compile your user code and emit a staitc binary.  Use objdump
>> fiddling or a trivial .S file to make that static binary into a
>> variable.  Then write a tiny shim module like this:
>> 
>> extern unsigned char __begin_user_code[], __end_user_code[];
>> 
>> int __init init_shim_module(void)
>> {
>>  return call_umh_blob(__begin_user_code, __end_user_code - 
>> __begin_user_code);
>> }
>> 
>> By itself, this is clearly a worse solution than yours, but it has two
>> benefits, one small and two big.  The small benefit is that it is
>> completely invisible to userspace: the .ko file is a bona fide module.
> 
> Anything you try to do which makes these binaries "special" is a huge
> negative.

I don’t know what you mean.  Alexei’s approach introduces a whole new kind of 
special module.  Mine doesn’t. 

> 
>> The big benefits are:
> 
> I don't see those things as benefits at all, and Alexei's scheme can
> easily be made to work in your benefit #1 case too.
> 

How?  I think you’ll find that a non-modular implementation of a bundled ELF 
binary looks a *lot* like my call_umh_blob().

> It's a user binary.  It's shipped with the kernel and it's signed.
> 
> If we can't trust that, we can't trust much else.

I’m not making any arguments about security at all. I’m talking about 
functionality. 

If we apply Alexei’s patch as is, then I think we’ll have a situation where 
ET_EXEC modules are only useful if they can do their jobs without any 
filesystem access at all.  This is fine for networking, where netlink sockets 
are used, but I think it’s not so great for other use cases. If we ever try to 
stick a usb driver into userspace, we’re going to want to instantiate the user 
task once per device, passed as stdin or similar, and Alexei’s code will make 
that very awkward.



[PATCH] net: use skb_is_gso_sctp() instead of open-coding

2018-03-08 Thread Daniel Axtens
As well as the basic conversion, I noticed that a lot of the
SCTP code checks gso_type without first checking skb_is_gso()
so I have added that where appropriate.

Also, document the helper.

Cc: Daniel Borkmann 
Cc: Marcelo Ricardo Leitner 
Signed-off-by: Daniel Axtens 

---

This depends on d02f51cbcf12 ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
to deal with gso sctp skbs") which introduces the required helper.
That is in the bpf tree at the moment.
---
 Documentation/networking/segmentation-offloads.txt | 5 -
 net/core/skbuff.c  | 2 +-
 net/sched/act_csum.c   | 2 +-
 net/sctp/input.c   | 8 
 net/sctp/inqueue.c | 2 +-
 net/sctp/offload.c | 2 +-
 6 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/segmentation-offloads.txt 
b/Documentation/networking/segmentation-offloads.txt
index 23a8dd91a9ec..36bb931b35e0 100644
--- a/Documentation/networking/segmentation-offloads.txt
+++ b/Documentation/networking/segmentation-offloads.txt
@@ -155,7 +155,10 @@ Therefore, any code in the core networking stack must be 
aware of the
 possibility that gso_size will be GSO_BY_FRAGS and handle that case
 appropriately.
 
-There are a couple of helpers to make this easier:
+There are some helpers to make this easier:
+
+ - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
+   an skb is an SCTP GSO skb.
 
  - For size checks, the skb_gso_validate_*_len family of helpers correctly
considers GSO_BY_FRAGS.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0bb0d8877954..baf990528943 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4904,7 +4904,7 @@ static unsigned int skb_gso_transport_seglen(const struct 
sk_buff *skb)
thlen += inner_tcp_hdrlen(skb);
} else if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
thlen = tcp_hdrlen(skb);
-   } else if (unlikely(shinfo->gso_type & SKB_GSO_SCTP)) {
+   } else if (unlikely(skb_is_gso_sctp(skb))) {
thlen = sizeof(struct sctphdr);
}
/* UFO sets gso_size to the size of the fragmentation
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index b7ba9b06b147..24b2e8e681cf 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -350,7 +350,7 @@ static int tcf_csum_sctp(struct sk_buff *skb, unsigned int 
ihl,
 {
struct sctphdr *sctph;
 
-   if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)
+   if (skb_is_gso(skb) && skb_is_gso_sctp(skb))
return 1;
 
sctph = tcf_csum_skb_nextlayer(skb, ihl, ipl, sizeof(*sctph));
diff --git a/net/sctp/input.c b/net/sctp/input.c
index 0247cc432e02..b381d78548ac 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -106,6 +106,7 @@ int sctp_rcv(struct sk_buff *skb)
int family;
struct sctp_af *af;
struct net *net = dev_net(skb->dev);
+   bool is_gso = skb_is_gso(skb) && skb_is_gso_sctp(skb);
 
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
@@ -123,8 +124,7 @@ int sctp_rcv(struct sk_buff *skb)
 * it's better to just linearize it otherwise crc computing
 * takes longer.
 */
-   if ((!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) &&
-skb_linearize(skb)) ||
+   if ((!is_gso && skb_linearize(skb)) ||
!pskb_may_pull(skb, sizeof(struct sctphdr)))
goto discard_it;
 
@@ -135,7 +135,7 @@ int sctp_rcv(struct sk_buff *skb)
if (skb_csum_unnecessary(skb))
__skb_decr_checksum_unnecessary(skb);
else if (!sctp_checksum_disable &&
-!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) &&
+!is_gso &&
 sctp_rcv_checksum(net, skb) < 0)
goto discard_it;
skb->csum_valid = 1;
@@ -1218,7 +1218,7 @@ static struct sctp_association 
*__sctp_rcv_lookup_harder(struct net *net,
 * issue as packets hitting this are mostly INIT or INIT-ACK and
 * those cannot be on GSO-style anyway.
 */
-   if ((skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP)
+   if (skb_is_gso(skb) && skb_is_gso_sctp(skb))
return NULL;
 
ch = (struct sctp_chunkhdr *)skb->data;
diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index 48392552ee7c..23ebc5318edc 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -170,7 +170,7 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
 
chunk = list_entry(entry, struct sctp_chunk, list);
 
-   if ((skb_shinfo(chunk->skb)->gso_type & SKB_GSO_SCTP) == 
SKB_GSO_SCTP) {
+   if (skb_is_gso(chunk->skb) && skb_is_gso_sctp(chunk->skb)) {
/* GSO-marked skbs but without frags, handle
 * them normally
   

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 5:44 PM, Kees Cook  wrote:
>
> My concerns are mostly about crossing namespaces. If a container
> triggers an autoload, the result runs in the init_ns.

Heh. I saw that as an advantage. It's basically the same semantics as
a normal module load does - in that the "kernel namespace" is init_ns.

My own personal worry is actually different - we do check the
signature of the file we're loading, but we're then passing it off to
execve() not as the image we loaded, but as the file pointer.

So the execve() will end up not using the actual buffer we checked the
signature on, but instead just re-reading the file.

Now, that has two issues:

 (a) it means that 'init_module' doesn't work, only 'finit_module'.

Not a big deal, but I do think that we should fail the 'info->file
== NULL' case explicitly (instead of failing when it does an execve()
of /dev/null, which is what I think it does now - but that's just from
the reading the patch, not from testing it).

 (b) somebody could maybe try to time it and modify the file
after-the-fact of the signature check, and then we execute something
else.

Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.

But it turns out that isn't needed. Some bad actor could just do
"finit_module()" with a file that they just *copied* from the module
directory.

Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it
does worry me.

So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any
executable as root in the init namespace". They don't have to have the
module signing key that I thought would protect us, because they can
do that "rewrite after signature check".

So that does seem like a bad problem with the approach.

So I don't like Andy's "let's make it a kernel module and then that
kernel module can execve() a blob". THAT seems like just stupid
indirection.

But I do like Andy's "execve a blob" part, because it is the *blob*
that has had its signature verified, not the file!

 Linus


[PATCH] drivers: vhost: vsock: fixed a brace coding style issue

2018-03-08 Thread Vaibhav Murkute
Fixed a coding style issue.

Signed-off-by: Vaibhav Murkute 
---
 drivers/vhost/vsock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 0d14e2ff19f1..0898dbdbf955 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -61,9 +61,9 @@ static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid)
if (other_cid == 0)
continue;
 
-   if (other_cid == guest_cid) {
+   if (other_cid == guest_cid)
return vsock;
-   }
+
}
 
return NULL;
-- 
2.15.1



Re: [PATCH v2 net-next 0/2] ntuple filters with RSS

2018-03-08 Thread David Miller
From: Edward Cree 
Date: Thu, 8 Mar 2018 15:43:50 +

> This series introduces the ability to mark an ethtool steering filter to use
>  RSS spreading, and the ability to create and configure multiple RSS contexts
>  with different indirection tables, hash keys, and hash fields.
> An implementation for the sfc driver (for 7000-series and later SFC NICs) is
>  included in patch 2/2.
> 
> The anticipated use case of this feature is for steering traffic destined for
>  a container (or virtual machine) to the subset of CPUs on which processes in
>  the container (or the VM's vCPUs) are bound, while retaining the scalability
>  of RSS spreading from the viewpoint inside the container.
> The use of both a base queue number (ring_cookie) and indirection table is
>  intended to allow re-use of a single RSS context to target multiple sets of
>  CPUs.  For instance, if an 8-core system is hosting three containers on CPUs
>  [1,2], [3,4] and [6,7], then a single RSS context with an equal-weight [0,1]
>  indirection table could be used to target all three containers by setting
>  ring_cookie to 1, 3 and 6 on the respective filters.
> 
> v2: Initialised ctx in efx_ef10_filter_insert() to avoid (false positive) gcc
>  warning.

Series applied, thanks Edward.


Re: [RFC PATCH linux-next] rds: rds_info_from_znotifier() can be static

2018-03-08 Thread David Miller
From: kbuild test robot 
Date: Thu, 8 Mar 2018 19:37:30 +0800

> Fixes: 9426bbc6de99 ("rds: use list structure to track information for 
> zerocopy completion notification")
> Signed-off-by: Fengguang Wu 

Applied.


Re: [RFC PATCH linux-next] rds: rds_message_zcopy_from_user() can be static

2018-03-08 Thread David Miller
From: Sowmini Varadhan 
Date: Thu, 8 Mar 2018 06:25:52 -0500

> On (03/08/18 18:56), kbuild test robot wrote:
>> 
>> Fixes: d40a126b16ea ("rds: refactor zcopy code into 
>> rds_message_zcopy_from_user")
>> Signed-off-by: Fengguang Wu 
> 
> Acked-by: Sowmini Varadhan 
> 
> (do I need to separately submit a non-RFC patch for this?)

You don't have to resubmit in a situation like this.

Applied, thanks everyone.


Re: [PATCH net-next] net: stmmac: remove superfluous wmb() memory barriers

2018-03-08 Thread David Miller
From: Niklas Cassel 
Date: Thu,  8 Mar 2018 11:30:05 +0100

> These wmb() memory barriers are performed after the last descriptor write,
> and they are followed by enable_dma_transmission()/set_tx_tail_ptr(),
> i.e. a writel() to MMIO register space.
> Since writel() itself performs the equivalent of a wmb() before doing the
> actual write, these barriers are superfluous, and removing them should
> thus not change any existing behavior.
> 
> Ordering within the descriptor writes is already ensured with dma_wmb()
> barriers inside prepare_tx_desc(first, ..)/prepare_tso_tx_desc(first, ..).
> 
> Signed-off-by: Niklas Cassel 

Please allow me some time to consider this issue a little bit more
before applying this patch.

Thank you Niklas.


Re: [PATCH 1/2 net-next] net/ncsi: use kfree_skb() instead of kfree()

2018-03-08 Thread David Miller
From: Dan Carpenter 
Date: Thu, 8 Mar 2018 12:36:04 +0300

> We're supposed to use kfree_skb() to free these sk_buffs.
> 
> Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family")
> Signed-off-by: Dan Carpenter 

Applied.


Re: [PATCH 2/2 net-next] net/ncsi: unlock on error in ncsi_set_interface_nl()

2018-03-08 Thread David Miller
From: Dan Carpenter 
Date: Thu, 8 Mar 2018 12:36:28 +0300

> There are two error paths which are missing unlocks in this function.
> 
> Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family")
> Signed-off-by: Dan Carpenter 

Also applied, thanks Dan.


Re: [PATCH net-next] liquidio: fix ndo_change_mtu to always return correct status to the caller

2018-03-08 Thread David Miller
From: Felix Manlunas 
Date: Wed, 7 Mar 2018 22:36:46 -0800

> + /* command is successful, change the MTU. */
> + netif_info(lio, probe, lio->netdev, "MTU changed from %d to %d\n",
> +netdev->mtu, new_mtu);

Please do not do this, this log message is excessive.

The success return of the MTU change is enough of an indicator
for the user.


Re: [PATCH net-next] liquidio: avoid doing useless work

2018-03-08 Thread David Miller
From: Felix Manlunas 
Date: Wed, 7 Mar 2018 22:23:32 -0800

> From: Prasad Kanneganti 
> 
> Avoid doing useless work by making sure that the response_list is not empty
> before scheduling work to process it.
> 
> Signed-off-by: Prasad Kanneganti 
> Signed-off-by: Felix Manlunas 

Applied.


Re: [PATCH net-next] liquidio: Resolved mbox read issue while reading more than one 64bit data

2018-03-08 Thread David Miller
From: Felix Manlunas 
Date: Wed, 7 Mar 2018 22:12:24 -0800

> From: Intiyaz Basha 
> 
> Corrected length check when data received in the mbox is more than one
> 64 bit data value
> 
> Signed-off-by: Intiyaz Basha 
> Signed-off-by: Felix Manlunas 

Applied.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread David Miller
From: Andy Lutomirski 
Date: Fri, 9 Mar 2018 02:12:24 +

> First, compile your user code and emit a staitc binary.  Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable.  Then write a tiny shim module like this:
> 
> extern unsigned char __begin_user_code[], __end_user_code[];
> 
> int __init init_shim_module(void)
> {
>   return call_umh_blob(__begin_user_code, __end_user_code - 
> __begin_user_code);
> }
> 
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big.  The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Anything you try to do which makes these binaries "special" is a huge
negative.

> The big benefits are:

I don't see those things as benefits at all, and Alexei's scheme can
easily be made to work in your benefit #1 case too.

It's a user binary.  It's shipped with the kernel and it's signed.

If we can't trust that, we can't trust much else.

And this whole container argument..  It's a mirage.

Kernel modules are 1000 times worse, since they can access any
container and any namespace they want.


Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()

2018-03-08 Thread Jason Wang



On 2018年03月09日 00:00, Michael S. Tsirkin wrote:

On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote:

On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin  wrote:

On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote:

KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:

==
BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 
drivers/vho
et.c:170
CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
  vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
  vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
  vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
  vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
  __fput+0x49f/0xa00 fs/file_table.c:209
  fput+0x37/0x40 fs/file_table.c:243
  task_work_run+0x243/0x2c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:191 [inline]
  exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
  prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
  syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
  do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
...
origin:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
  kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
  kmalloc_large_node_hook mm/slub.c:1366 [inline]
  kmalloc_large_node mm/slub.c:3808 [inline]
  __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
  kmalloc_node include/linux/slab.h:554 [inline]
  kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
  kvmalloc include/linux/mm.h:541 [inline]
  vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
  misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
  chrdev_open+0xc28/0xd90 fs/char_dev.c:417
  do_dentry_open+0xccb/0x1430 fs/open.c:752
  vfs_open+0x272/0x2e0 fs/open.c:866
  do_last fs/namei.c:3378 [inline]
  path_openat+0x49ad/0x6580 fs/namei.c:3519
  do_filp_open+0x267/0x640 fs/namei.c:3553
  do_sys_open+0x6ad/0x9c0 fs/open.c:1059
  SYSC_openat+0xc7/0xe0 fs/open.c:1086
  SyS_openat+0x63/0x90 fs/open.c:1080
  do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
==

Signed-off-by: Alexander Potapenko 
---
  drivers/vhost/net.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba276d47..60f1080bffc7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
   n->vqs[i].done_idx = 0;
   n->vqs[i].vhost_hlen = 0;
   n->vqs[i].sock_hlen = 0;
+ n->vqs[i].rx_ring = NULL;
   vhost_net_buf_init(&n->vqs[i].rxq);
   }
   vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
--
2.16.2.395.g2e18187dfd-goog


I suspect that's not sufficient. rx ring is tied to the tap device.
I think we need to drop it every time we drop the device.

Unfortunately I've no idea where is the device dropped. Are you
referring to vhost_net_vq_reset()?
I can fix that part if needed, but won't be able to validate it with KMSAN.

I see several issues. For example in vhost_net_set_backend
if there's a value then rx ring will point to the
ring of the wrong socket.
Something like the below might help but we really need
documentation of when is rx_ring valid. Is it only valid
when private-data is valid?


I think so, we need keep rx_ring synced with private_data.


If yes need to make sure
we reset it with private_data.

Also I see __skb_array_destroy_skb used with ptr_ring which
seems suspicious: how do we know the entries are skbs?


Good catch, will post an independent patch to fix this.



Patch below is on top of yours, and

Signed-off-by: Michael S. Tsirkin 

But I really would like Jason to look and come up with a
patch to address all these issues.

---

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 610cba2..7a65b69 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -972,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
vhost_net_disable_vq(n, vq);
vq->private_data = NULL;
vhost_net_buf_unproduce(nvq);
+   vq->rx_ring = NULL;
mutex_unlock(&vq->mutex);
return sock;
  }
@@ -1161,8 +1163,6 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
vhost_net_disable_vq(n, vq);
vq->private_data = sock;
vhost_net_buf_unproduce(nvq);
-   if (index == VHOST_NET_VQ_RX)
- 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
 wrote:
> On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
>>
>> Alexei, can you give an example use case?  I'm sure it's upthread
>> somewhere, but I'm having trouble finding it.
>
> at the time of iptable's setsockopt() the kernel will do
> err = request_module("bpfilter");
> once.
> The rough POC code:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

Here's what I gather from reading that code: you have a new kernel
feature (consisting of actual kernel code) that wants to defer some of
its implementation to user mode.  I like this idea a lot.  But I have
a suggestion for a slightly different way of accomplishing the same
thing.  Rather than extending init_module() to accept ELF input,
except the call_umh code to be able to call blobs.  You'd use it it
very roughly like this:

First, compile your user code and emit a staitc binary.  Use objdump
fiddling or a trivial .S file to make that static binary into a
variable.  Then write a tiny shim module like this:

extern unsigned char __begin_user_code[], __end_user_code[];

int __init init_shim_module(void)
{
  return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
}

By itself, this is clearly a worse solution than yours, but it has two
benefits, one small and two big.  The small benefit is that it is
completely invisible to userspace: the .ko file is a bona fide module.
The big benefits are:

1. It works even in a non-modular kernel!  (Okay, it probably only
works if you can arrange for the built-in module to be initialized
late enough, but that's straightforward.)

2. It allows future extensions to change the way the glue works.  For
example, maybe you want the module to integrate properly with lsmod,
etc.  Rather than adding a mechanism for general privileged programs
to register themselves with lsmod (ick!), you could do it entirely in
the kernel where lsmod would know that a particular umh task is
special.  More usefully, you could extend call_umh_blob() to pass in
some pre-initialized struct files, which would give a clean way to
*synchronously* create a communication channel to user code for
whatever service the user code provides.  And it would be more
straightforward to make the umh blob do what it needs to do without
relying on any particular filesystems being mounted.

I think we don't want to end up in a situation where we ship a program
with a .ko extension that opens something in /dev, for example.

call_umh_blob() would create an anon_inode or similar object backed by
the blob and exec it.


[PATCH net v2] openvswitch: meter: fix the incorrect calculation of max delta_t

2018-03-08 Thread zhangliping
From: zhangliping 

Max delat_t should be the full_bucket/rate instead of the full_bucket.
Also report EINVAL if the rate is zero.

Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
Cc: Andy Zhou 
Signed-off-by: zhangliping 
---
 V2: report EINVAL if the rate is 0 to avoid divide by zero error.

 net/openvswitch/meter.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
index 04b94281a30b..b891a91577f8 100644
--- a/net/openvswitch/meter.c
+++ b/net/openvswitch/meter.c
@@ -242,14 +242,20 @@ static struct dp_meter *dp_meter_create(struct nlattr **a)
 
band->type = nla_get_u32(attr[OVS_BAND_ATTR_TYPE]);
band->rate = nla_get_u32(attr[OVS_BAND_ATTR_RATE]);
+   if (band->rate == 0) {
+   err = -EINVAL;
+   goto exit_free_meter;
+   }
+
band->burst_size = nla_get_u32(attr[OVS_BAND_ATTR_BURST]);
/* Figure out max delta_t that is enough to fill any bucket.
 * Keep max_delta_t size to the bucket units:
 * pkts => 1/1000 packets, kilobits => bits.
+*
+* Start with a full bucket.
 */
-   band_max_delta_t = (band->burst_size + band->rate) * 1000;
-   /* Start with a full bucket. */
-   band->bucket = band_max_delta_t;
+   band->bucket = (band->burst_size + band->rate) * 1000;
+   band_max_delta_t = band->bucket / band->rate;
if (band_max_delta_t > meter->max_delta_t)
meter->max_delta_t = band_max_delta_t;
band++;
-- 
2.13.4




[PATCH net-next 1/9] net: hns3: refactor the hclge_get/set_rss function

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

This patch refactors the hclge_get/set_rss function in
order to fix the rss configuration loss problem during
reset process.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 39 --
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  2 ++
 2 files changed, 9 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 39bc0e6..cd5b040 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2979,31 +2979,6 @@ static u32 hclge_get_rss_indir_size(struct hnae3_handle 
*handle)
return HCLGE_RSS_IND_TBL_SIZE;
 }
 
-static int hclge_get_rss_algo(struct hclge_dev *hdev)
-{
-   struct hclge_rss_config_cmd *req;
-   struct hclge_desc desc;
-   int rss_hash_algo;
-   int ret;
-
-   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_GENERIC_CONFIG, true);
-
-   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
-   if (ret) {
-   dev_err(&hdev->pdev->dev,
-   "Get link status error, status =%d\n", ret);
-   return ret;
-   }
-
-   req = (struct hclge_rss_config_cmd *)desc.data;
-   rss_hash_algo = (req->hash_config & HCLGE_RSS_HASH_ALGO_MASK);
-
-   if (rss_hash_algo == HCLGE_RSS_HASH_ALGO_TOEPLITZ)
-   return ETH_RSS_HASH_TOP;
-
-   return -EINVAL;
-}
-
 static int hclge_set_rss_algo_key(struct hclge_dev *hdev,
  const u8 hfunc, const u8 *key)
 {
@@ -3042,7 +3017,7 @@ static int hclge_set_rss_algo_key(struct hclge_dev *hdev,
return 0;
 }
 
-static int hclge_set_rss_indir_table(struct hclge_dev *hdev, const u32 *indir)
+static int hclge_set_rss_indir_table(struct hclge_dev *hdev, const u8 *indir)
 {
struct hclge_rss_indirection_table_cmd *req;
struct hclge_desc desc;
@@ -3138,12 +3113,11 @@ static int hclge_get_rss(struct hnae3_handle *handle, 
u32 *indir,
 u8 *key, u8 *hfunc)
 {
struct hclge_vport *vport = hclge_get_vport(handle);
-   struct hclge_dev *hdev = vport->back;
int i;
 
/* Get hash algorithm */
if (hfunc)
-   *hfunc = hclge_get_rss_algo(hdev);
+   *hfunc = vport->rss_algo;
 
/* Get the RSS Key required by the user */
if (key)
@@ -3167,8 +3141,6 @@ static int hclge_set_rss(struct hnae3_handle *handle, 
const u32 *indir,
 
/* Set the RSS Hash Key if specififed by the user */
if (key) {
-   /* Update the shadow RSS key with user specified qids */
-   memcpy(vport->rss_hash_key, key, HCLGE_RSS_KEY_SIZE);
 
if (hfunc == ETH_RSS_HASH_TOP ||
hfunc == ETH_RSS_HASH_NO_CHANGE)
@@ -3178,6 +3150,10 @@ static int hclge_set_rss(struct hnae3_handle *handle, 
const u32 *indir,
ret = hclge_set_rss_algo_key(hdev, hash_algo, key);
if (ret)
return ret;
+
+   /* Update the shadow RSS key with user specified qids */
+   memcpy(vport->rss_hash_key, key, HCLGE_RSS_KEY_SIZE);
+   vport->rss_algo = hash_algo;
}
 
/* Update the shadow RSS table with user specified qids */
@@ -3185,8 +3161,7 @@ static int hclge_set_rss(struct hnae3_handle *handle, 
const u32 *indir,
vport->rss_indirection_tbl[i] = indir[i];
 
/* Update the hardware */
-   ret = hclge_set_rss_indir_table(hdev, indir);
-   return ret;
+   return hclge_set_rss_indir_table(hdev, vport->rss_indirection_tbl);
 }
 
 static u8 hclge_get_rss_hash_bits(struct ethtool_rxnfc *nfc)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index d99a76a..7e762c4 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -579,6 +579,8 @@ struct hclge_vport {
u8  rss_hash_key[HCLGE_RSS_KEY_SIZE]; /* User configured hash keys */
/* User configured lookup table entries */
u8  rss_indirection_tbl[HCLGE_RSS_IND_TBL_SIZE];
+   int rss_algo;   /* User configured hash algorithm */
+
u16 alloc_rss_size;
 
u16 qs_offset;
-- 
2.9.3



[PATCH net-next 4/9] net: hns3: fix for pause configuration lost during reset

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

Pause configuration will be set to default value by hclge_tm_schd_init
during reset, which causes the RSS configuration loss problem.

This patch fixes it by calling hclge_tm_init_hw during reset process
, which will set the pause configuration to default value.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 34a0905..109155d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5493,9 +5493,9 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
return ret;
}
 
-   ret = hclge_tm_schd_init(hdev);
+   ret = hclge_tm_init_hw(hdev);
if (ret) {
-   dev_err(&pdev->dev, "tm schd init fail, ret =%d\n", ret);
+   dev_err(&pdev->dev, "tm init hw fail, ret =%d\n", ret);
return ret;
}
 
-- 
2.9.3



[PATCH net-next 5/9] net: hns3: fix for use-after-free when setting ring parameter

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

In hns3_set_ringparam, hns3_uninit_all_ring frees the
memory pointed by priv->ring_data[i].ring, and
hns3_change_all_ring_bd_num use that pointer without mallocing,
which will cause a use-after-free problem.

The patch fixes it by not freeing the memory in
hns3_uninit_all_ring, and uses hns3_put_ring_config to free it
when necessary.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index c936945..e0ab161 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2967,13 +2967,8 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv)
h->ae_algo->ops->reset_queue(h, i);
 
hns3_fini_ring(priv->ring_data[i].ring);
-   devm_kfree(priv->dev, priv->ring_data[i].ring);
hns3_fini_ring(priv->ring_data[i + h->kinfo.num_tqps].ring);
-   devm_kfree(priv->dev,
-  priv->ring_data[i + h->kinfo.num_tqps].ring);
}
-   devm_kfree(priv->dev, priv->ring_data);
-
return 0;
 }
 
@@ -3111,6 +3106,8 @@ static void hns3_client_uninit(struct hnae3_handle 
*handle, bool reset)
if (ret)
netdev_err(netdev, "uninit ring error\n");
 
+   hns3_put_ring_config(priv);
+
priv->ring_data = NULL;
 
free_netdev(netdev);
@@ -3316,6 +3313,8 @@ static int hns3_reset_notify_uninit_enet(struct 
hnae3_handle *handle)
if (ret)
netdev_err(netdev, "uninit ring error\n");
 
+   hns3_put_ring_config(priv);
+
priv->ring_data = NULL;
 
return ret;
@@ -3422,6 +3421,7 @@ int hns3_set_channels(struct net_device *netdev,
}
 
hns3_uninit_all_ring(priv);
+   hns3_put_ring_config(priv);
 
org_tqp_num = h->kinfo.num_tqps;
ret = hns3_modify_tqp_num(netdev, new_tqp_num);
-- 
2.9.3



[PATCH net-next 8/9] net: hns3: refactor the coalesce related struct

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

This patch refoctors the coalesce related struct by introducing
the hns3_enet_coalesce struct, in order to fix the coalesce
configuation lost problem when changing the channel number.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 46 +++---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 10 +++--
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 26 +++-
 3 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 07dbebb..f466f60 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -168,8 +168,8 @@ void hns3_set_vector_coalesce_rl(struct 
hns3_enet_tqp_vector *tqp_vector,
 * GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing
 */
 
-   if (rl_reg > 0 && !tqp_vector->tx_group.gl_adapt_enable &&
-   !tqp_vector->rx_group.gl_adapt_enable)
+   if (rl_reg > 0 && !tqp_vector->tx_group.coal.gl_adapt_enable &&
+   !tqp_vector->rx_group.coal.gl_adapt_enable)
/* According to the hardware, the range of rl_reg is
 * 0-59 and the unit is 4.
 */
@@ -205,17 +205,17 @@ static void hns3_vector_gl_rl_init(struct 
hns3_enet_tqp_vector *tqp_vector,
 */
 
/* Default: enable interrupt coalescing self-adaptive and GL */
-   tqp_vector->tx_group.gl_adapt_enable = 1;
-   tqp_vector->rx_group.gl_adapt_enable = 1;
+   tqp_vector->tx_group.coal.gl_adapt_enable = 1;
+   tqp_vector->rx_group.coal.gl_adapt_enable = 1;
 
-   tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K;
-   tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
+   tqp_vector->tx_group.coal.int_gl = HNS3_INT_GL_50K;
+   tqp_vector->rx_group.coal.int_gl = HNS3_INT_GL_50K;
 
/* Default: disable RL */
h->kinfo.int_rl_setting = 0;
 
-   tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
-   tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
+   tqp_vector->rx_group.coal.flow_level = HNS3_FLOW_LOW;
+   tqp_vector->tx_group.coal.flow_level = HNS3_FLOW_LOW;
 }
 
 static void hns3_vector_gl_rl_init_hw(struct hns3_enet_tqp_vector *tqp_vector,
@@ -224,9 +224,9 @@ static void hns3_vector_gl_rl_init_hw(struct 
hns3_enet_tqp_vector *tqp_vector,
struct hnae3_handle *h = priv->ae_handle;
 
hns3_set_vector_coalesce_tx_gl(tqp_vector,
-  tqp_vector->tx_group.int_gl);
+  tqp_vector->tx_group.coal.int_gl);
hns3_set_vector_coalesce_rx_gl(tqp_vector,
-  tqp_vector->rx_group.int_gl);
+  tqp_vector->rx_group.coal.int_gl);
hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting);
 }
 
@@ -2393,12 +2393,12 @@ static bool hns3_get_new_int_gl(struct 
hns3_enet_ring_group *ring_group)
u16 new_int_gl;
int usecs;
 
-   if (!ring_group->int_gl)
+   if (!ring_group->coal.int_gl)
return false;
 
if (ring_group->total_packets == 0) {
-   ring_group->int_gl = HNS3_INT_GL_50K;
-   ring_group->flow_level = HNS3_FLOW_LOW;
+   ring_group->coal.int_gl = HNS3_INT_GL_50K;
+   ring_group->coal.flow_level = HNS3_FLOW_LOW;
return true;
}
 
@@ -2408,10 +2408,10 @@ static bool hns3_get_new_int_gl(struct 
hns3_enet_ring_group *ring_group)
 * 20-1249MB/s high  (18000 ints/s)
 * > 4pps  ultra (8000 ints/s)
 */
-   new_flow_level = ring_group->flow_level;
-   new_int_gl = ring_group->int_gl;
+   new_flow_level = ring_group->coal.flow_level;
+   new_int_gl = ring_group->coal.int_gl;
tqp_vector = ring_group->ring->tqp_vector;
-   usecs = (ring_group->int_gl << 1);
+   usecs = (ring_group->coal.int_gl << 1);
bytes_per_usecs = ring_group->total_bytes / usecs;
/* 100 microseconds */
packets_per_secs = ring_group->total_packets * 100 / usecs;
@@ -2458,9 +2458,9 @@ static bool hns3_get_new_int_gl(struct 
hns3_enet_ring_group *ring_group)
 
ring_group->total_bytes = 0;
ring_group->total_packets = 0;
-   ring_group->flow_level = new_flow_level;
-   if (new_int_gl != ring_group->int_gl) {
-   ring_group->int_gl = new_int_gl;
+   ring_group->coal.flow_level = new_flow_level;
+   if (new_int_gl != ring_group->coal.int_gl) {
+   ring_group->coal.int_gl = new_int_gl;
return true;
}
return false;
@@ -2472,18 +2472,18 @@ static void hns3_update_new_int_gl(struct 
hns3_enet_tqp_vector *tqp_vector)
struct hns3_enet_ring_group *tx_group = &tqp_vector->tx_group;
bool rx_update, tx_upda

[PATCH net-next 7/9] net: hns3: fix for coalesce configuration lost during reset

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

Coalesce configuration will be set to default value by
hns3_nic_init_vector_data during reset, which causes the
coalesce configuration loss problem.

This patch fixes it by setting the default value in
hns3_nic_alloc_vector_data, which will not be called in the
reset process.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 156 +---
 1 file changed, 114 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index bbaa21f..07dbebb 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -211,19 +211,25 @@ static void hns3_vector_gl_rl_init(struct 
hns3_enet_tqp_vector *tqp_vector,
tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K;
tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
 
-   hns3_set_vector_coalesce_tx_gl(tqp_vector,
-  tqp_vector->tx_group.int_gl);
-   hns3_set_vector_coalesce_rx_gl(tqp_vector,
-  tqp_vector->rx_group.int_gl);
-
/* Default: disable RL */
h->kinfo.int_rl_setting = 0;
-   hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting);
 
tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
 }
 
+static void hns3_vector_gl_rl_init_hw(struct hns3_enet_tqp_vector *tqp_vector,
+ struct hns3_nic_priv *priv)
+{
+   struct hnae3_handle *h = priv->ae_handle;
+
+   hns3_set_vector_coalesce_tx_gl(tqp_vector,
+  tqp_vector->tx_group.int_gl);
+   hns3_set_vector_coalesce_rx_gl(tqp_vector,
+  tqp_vector->rx_group.int_gl);
+   hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting);
+}
+
 static int hns3_nic_set_real_num_queue(struct net_device *netdev)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);
@@ -2625,32 +2631,18 @@ static int hns3_nic_init_vector_data(struct 
hns3_nic_priv *priv)
struct hnae3_ring_chain_node vector_ring_chain;
struct hnae3_handle *h = priv->ae_handle;
struct hns3_enet_tqp_vector *tqp_vector;
-   struct hnae3_vector_info *vector;
-   struct pci_dev *pdev = h->pdev;
-   u16 tqp_num = h->kinfo.num_tqps;
-   u16 vector_num;
int ret = 0;
u16 i;
 
-   /* RSS size, cpu online and vector_num should be the same */
-   /* Should consider 2p/4p later */
-   vector_num = min_t(u16, num_online_cpus(), tqp_num);
-   vector = devm_kcalloc(&pdev->dev, vector_num, sizeof(*vector),
- GFP_KERNEL);
-   if (!vector)
-   return -ENOMEM;
-
-   vector_num = h->ae_algo->ops->get_vector(h, vector_num, vector);
-
-   priv->vector_num = vector_num;
-   priv->tqp_vector = (struct hns3_enet_tqp_vector *)
-   devm_kcalloc(&pdev->dev, vector_num, sizeof(*priv->tqp_vector),
-GFP_KERNEL);
-   if (!priv->tqp_vector)
-   return -ENOMEM;
+   for (i = 0; i < priv->vector_num; i++) {
+   tqp_vector = &priv->tqp_vector[i];
+   hns3_vector_gl_rl_init_hw(tqp_vector, priv);
+   tqp_vector->num_tqps = 0;
+   }
 
-   for (i = 0; i < tqp_num; i++) {
-   u16 vector_i = i % vector_num;
+   for (i = 0; i < h->kinfo.num_tqps; i++) {
+   u16 vector_i = i % priv->vector_num;
+   u16 tqp_num = h->kinfo.num_tqps;
 
tqp_vector = &priv->tqp_vector[vector_i];
 
@@ -2660,52 +2652,94 @@ static int hns3_nic_init_vector_data(struct 
hns3_nic_priv *priv)
hns3_add_ring_to_group(&tqp_vector->rx_group,
   priv->ring_data[i + tqp_num].ring);
 
-   tqp_vector->idx = vector_i;
-   tqp_vector->mask_addr = vector[vector_i].io_addr;
-   tqp_vector->vector_irq = vector[vector_i].vector;
-   tqp_vector->num_tqps++;
-
priv->ring_data[i].ring->tqp_vector = tqp_vector;
priv->ring_data[i + tqp_num].ring->tqp_vector = tqp_vector;
+   tqp_vector->num_tqps++;
}
 
-   for (i = 0; i < vector_num; i++) {
+   for (i = 0; i < priv->vector_num; i++) {
tqp_vector = &priv->tqp_vector[i];
 
tqp_vector->rx_group.total_bytes = 0;
tqp_vector->rx_group.total_packets = 0;
tqp_vector->tx_group.total_bytes = 0;
tqp_vector->tx_group.total_packets = 0;
-   hns3_vector_gl_rl_init(tqp_vector, priv);
tqp_vector->handle = h;
 
ret = hns3_get_vector_ring_chain(tqp_vector,
 &vector_ring_chain);
   

[PATCH net-next 0/9] fixes for configuration lost problems

2018-03-08 Thread Peng Li
This patchset refactors some functions and some bugs in order
to fix the configuration loss problem when resetting and
setting channel number.

Yunsheng Lin (9):
  net: hns3: refactor the hclge_get/set_rss function
  net: hns3: refactor the hclge_get/set_rss_tuple function
  net: hns3: fix for RSS configuration loss problem during reset
  net: hns3: fix for pause configuration lost during reset
  net: hns3: fix for use-after-free when setting ring parameter
  net: hns3: refactor the get/put_vector function
  net: hns3: fix for coalesce configuration lost during reset
  net: hns3: refactor the coalesce related struct
  net: hns3: fix for coal configuation lost when setting the channel

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   3 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 249 +++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  10 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  26 ++-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c |   2 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 235 +--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  16 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  12 +-
 8 files changed, 351 insertions(+), 202 deletions(-)

-- 
2.9.3



[PATCH net-next 2/9] net: hns3: refactor the hclge_get/set_rss_tuple function

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

This patch refactors the hclge_get/set_rss_tuple function
in order to fix the rss configuration loss problem during
reset process.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 91 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 13 
 2 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index cd5b040..747cc8f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3091,14 +3091,16 @@ static int hclge_set_rss_input_tuple(struct hclge_dev 
*hdev)
hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, false);
 
req = (struct hclge_rss_input_tuple_cmd *)desc.data;
-   req->ipv4_tcp_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
-   req->ipv4_udp_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
-   req->ipv4_sctp_en = HCLGE_RSS_INPUT_TUPLE_SCTP;
-   req->ipv4_fragment_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
-   req->ipv6_tcp_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
-   req->ipv6_udp_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
-   req->ipv6_sctp_en = HCLGE_RSS_INPUT_TUPLE_SCTP;
-   req->ipv6_fragment_en = HCLGE_RSS_INPUT_TUPLE_OTHER;
+
+   /* Get the tuple cfg from pf */
+   req->ipv4_tcp_en = hdev->vport[0].rss_tuple_sets.ipv4_tcp_en;
+   req->ipv4_udp_en = hdev->vport[0].rss_tuple_sets.ipv4_udp_en;
+   req->ipv4_sctp_en = hdev->vport[0].rss_tuple_sets.ipv4_sctp_en;
+   req->ipv4_fragment_en = hdev->vport[0].rss_tuple_sets.ipv4_fragment_en;
+   req->ipv6_tcp_en = hdev->vport[0].rss_tuple_sets.ipv6_tcp_en;
+   req->ipv6_udp_en = hdev->vport[0].rss_tuple_sets.ipv6_udp_en;
+   req->ipv6_sctp_en = hdev->vport[0].rss_tuple_sets.ipv6_sctp_en;
+   req->ipv6_fragment_en = hdev->vport[0].rss_tuple_sets.ipv6_fragment_en;
ret = hclge_cmd_send(&hdev->hw, &desc, 1);
if (ret) {
dev_err(&hdev->pdev->dev,
@@ -3204,15 +3206,16 @@ static int hclge_set_rss_tuple(struct hnae3_handle 
*handle,
return -EINVAL;
 
req = (struct hclge_rss_input_tuple_cmd *)desc.data;
-   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, true);
-   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
-   if (ret) {
-   dev_err(&hdev->pdev->dev,
-   "Read rss tuple fail, status = %d\n", ret);
-   return ret;
-   }
+   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, false);
 
-   hclge_cmd_reuse_desc(&desc, false);
+   req->ipv4_tcp_en = vport->rss_tuple_sets.ipv4_tcp_en;
+   req->ipv4_udp_en = vport->rss_tuple_sets.ipv4_udp_en;
+   req->ipv4_sctp_en = vport->rss_tuple_sets.ipv4_sctp_en;
+   req->ipv4_fragment_en = vport->rss_tuple_sets.ipv4_fragment_en;
+   req->ipv6_tcp_en = vport->rss_tuple_sets.ipv6_tcp_en;
+   req->ipv6_udp_en = vport->rss_tuple_sets.ipv6_udp_en;
+   req->ipv6_sctp_en = vport->rss_tuple_sets.ipv6_sctp_en;
+   req->ipv6_fragment_en = vport->rss_tuple_sets.ipv6_fragment_en;
 
tuple_sets = hclge_get_rss_hash_bits(nfc);
switch (nfc->flow_type) {
@@ -3249,52 +3252,49 @@ static int hclge_set_rss_tuple(struct hnae3_handle 
*handle,
}
 
ret = hclge_cmd_send(&hdev->hw, &desc, 1);
-   if (ret)
+   if (ret) {
dev_err(&hdev->pdev->dev,
"Set rss tuple fail, status = %d\n", ret);
+   return ret;
+   }
 
-   return ret;
+   vport->rss_tuple_sets.ipv4_tcp_en = req->ipv4_tcp_en;
+   vport->rss_tuple_sets.ipv4_udp_en = req->ipv4_udp_en;
+   vport->rss_tuple_sets.ipv4_sctp_en = req->ipv4_sctp_en;
+   vport->rss_tuple_sets.ipv4_fragment_en = req->ipv4_fragment_en;
+   vport->rss_tuple_sets.ipv6_tcp_en = req->ipv6_tcp_en;
+   vport->rss_tuple_sets.ipv6_udp_en = req->ipv6_udp_en;
+   vport->rss_tuple_sets.ipv6_sctp_en = req->ipv6_sctp_en;
+   vport->rss_tuple_sets.ipv6_fragment_en = req->ipv6_fragment_en;
+   return 0;
 }
 
 static int hclge_get_rss_tuple(struct hnae3_handle *handle,
   struct ethtool_rxnfc *nfc)
 {
struct hclge_vport *vport = hclge_get_vport(handle);
-   struct hclge_dev *hdev = vport->back;
-   struct hclge_rss_input_tuple_cmd *req;
-   struct hclge_desc desc;
u8 tuple_sets;
-   int ret;
 
nfc->data = 0;
 
-   req = (struct hclge_rss_input_tuple_cmd *)desc.data;
-   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, true);
-   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
-   if (ret) {
-   dev_err(&hdev->pdev->dev,
-   "Read rss tuple fail, status = %d\n", ret);
-   return ret;
-   }
-
switch (nfc->flow_type) {
case TCP_V4_FLOW

[PATCH net-next 9/9] net: hns3: fix for coal configuation lost when setting the channel

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

This patch fixes the coalesce configuation lost problem when
setting the channel number by restoring all vectors's coalesce
configuation to vector 0's, because all vectors belonging to
the same netdev have the same coalesce configuation for now.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 37 +++--
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index f466f60..0a8c427 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -3413,7 +3413,24 @@ static int hns3_reset_notify(struct hnae3_handle *handle,
return ret;
 }
 
-static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num)
+static void hns3_restore_coal(struct hns3_nic_priv *priv,
+ struct hns3_enet_coalesce *tx,
+ struct hns3_enet_coalesce *rx)
+{
+   u16 vector_num = priv->vector_num;
+   int i;
+
+   for (i = 0; i < vector_num; i++) {
+   memcpy(&priv->tqp_vector[i].tx_group.coal, tx,
+  sizeof(struct hns3_enet_coalesce));
+   memcpy(&priv->tqp_vector[i].rx_group.coal, rx,
+  sizeof(struct hns3_enet_coalesce));
+   }
+}
+
+static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num,
+  struct hns3_enet_coalesce *tx,
+  struct hns3_enet_coalesce *rx)
 {
struct hns3_nic_priv *priv = netdev_priv(netdev);
struct hnae3_handle *h = hns3_get_handle(netdev);
@@ -3431,6 +3448,8 @@ static int hns3_modify_tqp_num(struct net_device *netdev, 
u16 new_tqp_num)
if (ret)
goto err_alloc_vector;
 
+   hns3_restore_coal(priv, tx, rx);
+
ret = hns3_nic_init_vector_data(priv);
if (ret)
goto err_uninit_vector;
@@ -3461,6 +3480,7 @@ int hns3_set_channels(struct net_device *netdev,
struct hns3_nic_priv *priv = netdev_priv(netdev);
struct hnae3_handle *h = hns3_get_handle(netdev);
struct hnae3_knic_private_info *kinfo = &h->kinfo;
+   struct hns3_enet_coalesce tx_coal, rx_coal;
bool if_running = netif_running(netdev);
u32 new_tqp_num = ch->combined_count;
u16 org_tqp_num;
@@ -3494,15 +3514,26 @@ int hns3_set_channels(struct net_device *netdev,
goto open_netdev;
}
 
+   /* Changing the tqp num may also change the vector num,
+* ethtool only support setting and querying one coal
+* configuation for now, so save the vector 0' coal
+* configuation here in order to restore it.
+*/
+   memcpy(&tx_coal, &priv->tqp_vector[0].tx_group.coal,
+  sizeof(struct hns3_enet_coalesce));
+   memcpy(&rx_coal, &priv->tqp_vector[0].rx_group.coal,
+  sizeof(struct hns3_enet_coalesce));
+
hns3_nic_dealloc_vector_data(priv);
 
hns3_uninit_all_ring(priv);
hns3_put_ring_config(priv);
 
org_tqp_num = h->kinfo.num_tqps;
-   ret = hns3_modify_tqp_num(netdev, new_tqp_num);
+   ret = hns3_modify_tqp_num(netdev, new_tqp_num, &tx_coal, &rx_coal);
if (ret) {
-   ret = hns3_modify_tqp_num(netdev, org_tqp_num);
+   ret = hns3_modify_tqp_num(netdev, org_tqp_num,
+ &tx_coal, &rx_coal);
if (ret) {
/* If revert to old tqp failed, fatal error occurred */
dev_err(&netdev->dev,
-- 
2.9.3



[PATCH net-next 6/9] net: hns3: refactor the get/put_vector function

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

There is a get_vector function, which allocate the vectors
for a client, but there is not a put_vector to free the
vector.

This patch introduces the put_vector function in order to
fix the coalesce configuration lost problem during reset
process.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  3 +++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|  4 
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 28 --
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 12 +++---
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index fd06bc7..a072048 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -265,6 +265,8 @@ struct hnae3_ae_dev {
  *   Get tc size of handle
  * get_vector()
  *   Get vector number and vector information
+ * put_vector()
+ *   Put the vector in hdev
  * map_ring_to_vector()
  *   Map rings to vector
  * unmap_ring_from_vector()
@@ -375,6 +377,7 @@ struct hnae3_ae_ops {
 
int (*get_vector)(struct hnae3_handle *handle, u16 vector_num,
  struct hnae3_vector_info *vector_info);
+   int (*put_vector)(struct hnae3_handle *handle, int vector_num);
int (*map_ring_to_vector)(struct hnae3_handle *handle,
  int vector_num,
  struct hnae3_ring_chain_node *vr_chain);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index e0ab161..bbaa21f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2721,6 +2721,10 @@ static int hns3_nic_uninit_vector_data(struct 
hns3_nic_priv *priv)
if (ret)
return ret;
 
+   ret = h->ae_algo->ops->put_vector(h, tqp_vector->vector_irq);
+   if (ret)
+   return ret;
+
hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain);
 
if (priv->tqp_vector[i].irq_init_flag == HNS3_VECTOR_INITED) {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 109155d..a318773 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2969,6 +2969,24 @@ static int hclge_get_vector_index(struct hclge_dev 
*hdev, int vector)
return -EINVAL;
 }
 
+static int hclge_put_vector(struct hnae3_handle *handle, int vector)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+   struct hclge_dev *hdev = vport->back;
+   int vector_id;
+
+   vector_id = hclge_get_vector_index(hdev, vector);
+   if (vector_id < 0) {
+   dev_err(&hdev->pdev->dev,
+   "Get vector index fail. vector_id =%d\n", vector_id);
+   return vector_id;
+   }
+
+   hclge_free_vector(hdev, vector_id);
+
+   return 0;
+}
+
 static u32 hclge_get_rss_key_size(struct hnae3_handle *handle)
 {
return HCLGE_RSS_KEY_SIZE;
@@ -3523,18 +3541,13 @@ static int hclge_unmap_ring_frm_vector(struct 
hnae3_handle *handle,
}
 
ret = hclge_bind_ring_with_vector(vport, vector_id, false, ring_chain);
-   if (ret) {
+   if (ret)
dev_err(&handle->pdev->dev,
"Unmap ring from vector fail. vectorid=%d, ret =%d\n",
vector_id,
ret);
-   return ret;
-   }
-
-   /* Free this MSIX or MSI vector */
-   hclge_free_vector(hdev, vector_id);
 
-   return 0;
+   return ret;
 }
 
 int hclge_cmd_set_promisc_mode(struct hclge_dev *hdev,
@@ -5994,6 +6007,7 @@ static const struct hnae3_ae_ops hclge_ops = {
.map_ring_to_vector = hclge_map_ring_to_vector,
.unmap_ring_from_vector = hclge_unmap_ring_frm_vector,
.get_vector = hclge_get_vector,
+   .put_vector = hclge_put_vector,
.set_promisc_mode = hclge_set_promisc_mode,
.set_loopback = hclge_set_loopback,
.start = hclge_ae_start,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index c9a916a..09a2a66 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -627,13 +627,18 @@ static int hclgevf_unmap_ring_from_vector(
}
 
ret = hclgevf_bind_ring_to_vector(handle, false, vector, ring_chain);
-   if (ret) {
+   if (ret)
dev_err(&handle->pdev->dev,
"Unmap ring from vector fail. vector=%d, ret =%d\n",
vector_id,
 

[PATCH net-next 3/9] net: hns3: fix for RSS configuration loss problem during reset

2018-03-08 Thread Peng Li
From: Yunsheng Lin 

RSS configuration will be set to default value by hclge_rss_init_hw
during reset, which causes the RSS configuration loss problem.

This patch fixes it by setting the default value in
hclge_rss_init_cfg function, which will not be called in the reset
process.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c |   2 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 107 ++---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   1 +
 3 files changed, 56 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
index 5018d66..407cfab 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
@@ -144,6 +144,8 @@ static int hclge_map_update(struct hnae3_handle *h)
if (ret)
return ret;
 
+   hclge_rss_indir_init_cfg(hdev);
+
return hclge_rss_init_hw(hdev);
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 747cc8f..34a0905 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3329,67 +3329,28 @@ static int hclge_get_tc_size(struct hnae3_handle 
*handle)
 
 int hclge_rss_init_hw(struct hclge_dev *hdev)
 {
-   const  u8 hfunc = HCLGE_RSS_HASH_ALGO_TOEPLITZ;
struct hclge_vport *vport = hdev->vport;
+   u8 *rss_indir = vport[0].rss_indirection_tbl;
+   u16 rss_size = vport[0].alloc_rss_size;
+   u8 *key = vport[0].rss_hash_key;
+   u8 hfunc = vport[0].rss_algo;
u16 tc_offset[HCLGE_MAX_TC_NUM];
-   u8 rss_key[HCLGE_RSS_KEY_SIZE];
u16 tc_valid[HCLGE_MAX_TC_NUM];
u16 tc_size[HCLGE_MAX_TC_NUM];
-   u32 *rss_indir = NULL;
-   u16 rss_size = 0, roundup_size;
-   const u8 *key;
-   int i, ret, j;
-
-   rss_indir = kcalloc(HCLGE_RSS_IND_TBL_SIZE, sizeof(u32), GFP_KERNEL);
-   if (!rss_indir)
-   return -ENOMEM;
-
-   /* Get default RSS key */
-   netdev_rss_key_fill(rss_key, HCLGE_RSS_KEY_SIZE);
-
-   /* Initialize RSS indirect table for each vport */
-   for (j = 0; j < hdev->num_vmdq_vport + 1; j++) {
-   vport[j].rss_tuple_sets.ipv4_tcp_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-   vport[j].rss_tuple_sets.ipv4_udp_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-   vport[j].rss_tuple_sets.ipv4_sctp_en =
-   HCLGE_RSS_INPUT_TUPLE_SCTP;
-   vport[j].rss_tuple_sets.ipv4_fragment_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-   vport[j].rss_tuple_sets.ipv6_tcp_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-   vport[j].rss_tuple_sets.ipv6_udp_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-   vport[j].rss_tuple_sets.ipv6_sctp_en =
-   HCLGE_RSS_INPUT_TUPLE_SCTP;
-   vport[j].rss_tuple_sets.ipv6_fragment_en =
-   HCLGE_RSS_INPUT_TUPLE_OTHER;
-
-   for (i = 0; i < HCLGE_RSS_IND_TBL_SIZE; i++) {
-   vport[j].rss_indirection_tbl[i] =
-   i % vport[j].alloc_rss_size;
-
-   /* vport 0 is for PF */
-   if (j != 0)
-   continue;
+   u16 roundup_size;
+   int i, ret;
 
-   rss_size = vport[j].alloc_rss_size;
-   rss_indir[i] = vport[j].rss_indirection_tbl[i];
-   }
-   }
ret = hclge_set_rss_indir_table(hdev, rss_indir);
if (ret)
-   goto err;
+   return ret;
 
-   key = rss_key;
ret = hclge_set_rss_algo_key(hdev, hfunc, key);
if (ret)
-   goto err;
+   return ret;
 
ret = hclge_set_rss_input_tuple(hdev);
if (ret)
-   goto err;
+   return ret;
 
/* Each TC have the same queue size, and tc_size set to hardware is
 * the log2 of roundup power of two of rss_size, the acutal queue
@@ -3399,8 +3360,7 @@ int hclge_rss_init_hw(struct hclge_dev *hdev)
dev_err(&hdev->pdev->dev,
"Configure rss tc size failed, invalid TC_SIZE = %d\n",
rss_size);
-   ret = -EINVAL;
-   goto err;
+   return -EINVAL;
}
 
roundup_size = roundup_pow_of_two(rss_size);
@@ -3417,12 +3377,50 @@ int hclge_rss_init_hw(struct hclge_dev *hdev)
tc_offset[i] = rss_size * i;
}
 
-   ret = hclge_set_rss_tc_mode(hdev, tc_valid, tc_size, tc_offset);
+   return hclge_set_rss_tc_mode(hdev, tc_valid, tc_size, tc_offset);
+}

[PATCH iproute2-next 2/3] ipmroute: convert to output JSON

2018-03-08 Thread Stephen Hemminger
From: Stephen Hemminger 

Should be no change for non-json case except putting color
on address if desired.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmroute.c | 99 ---
 1 file changed, 67 insertions(+), 32 deletions(-)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index aa5029b44f41..83548caf4946 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -29,6 +29,7 @@
 #include 
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 static void usage(void) __attribute__((noreturn));
 
@@ -53,13 +54,12 @@ struct rtfilter {
 
 int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
-   FILE *fp = (FILE *)arg;
struct rtmsg *r = NLMSG_DATA(n);
int len = n->nlmsg_len;
struct rtattr *tb[RTA_MAX+1];
-   char obuf[256];
-
+   const char *src, *dst;
SPRINT_BUF(b1);
+   SPRINT_BUF(b2);
__u32 table;
int iif = 0;
int family;
@@ -103,30 +103,44 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
family = get_real_family(r->rtm_type, r->rtm_family);
 
+   open_json_object(NULL);
if (n->nlmsg_type == RTM_DELROUTE)
-   fprintf(fp, "Deleted ");
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (tb[RTA_SRC])
-   len = snprintf(obuf, sizeof(obuf),
-  "(%s, ", rt_addr_n2a_rta(family, tb[RTA_SRC]));
+   src = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_SRC]),
+   RTA_DATA(tb[RTA_SRC]), b1, sizeof(b1));
else
-   len = sprintf(obuf, "(unknown, ");
+   src = "unknown";
+
if (tb[RTA_DST])
-   snprintf(obuf + len, sizeof(obuf) - len,
-"%s)", rt_addr_n2a_rta(family, tb[RTA_DST]));
+   dst = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_DST]),
+   RTA_DATA(tb[RTA_DST]), b2, sizeof(b2));
else
-   snprintf(obuf + len, sizeof(obuf) - len, "unknown) ");
+   dst = "unknown";
+
+   if (is_json_context()) {
+   print_string(PRINT_JSON, "src", NULL, src);
+   print_string(PRINT_JSON, "dst", NULL, dst);
+   } else {
+   char obuf[256];
+
+   snprintf(obuf, sizeof(obuf), "(%s,%s)", src, dst);
+   print_string(PRINT_FP, NULL,
+"%-32s Iif: ", obuf);
+   }
 
-   fprintf(fp, "%-32s Iif: ", obuf);
if (iif)
-   fprintf(fp, "%-10s ", ll_index_to_name(iif));
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "iif", "%-10s ", ll_index_to_name(iif));
else
-   fprintf(fp, "unresolved ");
+   print_string(PRINT_ANY,"iif", "%s ", "unresolved");
 
if (tb[RTA_MULTIPATH]) {
struct rtnexthop *nh = RTA_DATA(tb[RTA_MULTIPATH]);
int first = 1;
 
+   open_json_array(PRINT_JSON, "multipath");
len = RTA_PAYLOAD(tb[RTA_MULTIPATH]);
 
for (;;) {
@@ -135,47 +149,65 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (nh->rtnh_len > len)
break;
 
+   open_json_object(NULL);
if (first) {
-   fprintf(fp, "Oifs: ");
+   print_string(PRINT_FP, NULL, "Oifs: ", NULL);
first = 0;
}
-   fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex));
+
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "oif", "%s", 
ll_index_to_name(nh->rtnh_ifindex));
+
if (nh->rtnh_hops > 1)
-   fprintf(fp, "(ttl %d) ", nh->rtnh_hops);
+   print_uint(PRINT_ANY,
+  "ttl", "(ttl %u) ", nh->rtnh_hops);
else
-   fprintf(fp, " ");
+   print_string(PRINT_FP, NULL, " ", NULL);
+
+   close_json_object();
len -= NLMSG_ALIGN(nh->rtnh_len);
nh = RTNH_NEXT(nh);
}
+   close_json_array(PRINT_JSON, NULL);
}
-   fprintf(fp, " State: %s",
-   r->rtm_flags & RTNH_F_UNRESOLVED ? "unresolved" : "resolved");
+
+   print_string(PRINT_ANY, "state", " State: %s",
+(r->rtm_flags & RTNH_F_UNRESOLVED) ? "unresolved" : 
"resolved");
+
if (r->rtm_flags & RTNH_F_OFFLOAD)
-   fprintf(fp, " offload");
+   print_null(PRINT_ANY, "offload", " offload", NULL);
+
if (show_stats && tb[R

[PATCH iproute2-next 1/3] ipmaddr: json and color support

2018-03-08 Thread Stephen Hemminger
From: Stephen Hemminger 

Support printing mulitcast addresses in json and color mode.
Output format is unchanged for normal use.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmaddr.c | 69 +---
 1 file changed, 43 insertions(+), 26 deletions(-)

diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c
index d7bf1f99f67e..a48499029e17 100644
--- a/ip/ipmaddr.c
+++ b/ip/ipmaddr.c
@@ -28,6 +28,7 @@
 #include "rt_names.h"
 #include "utils.h"
 #include "ip_common.h"
+#include "json_print.h"
 
 static struct {
char *dev;
@@ -193,50 +194,66 @@ static void read_igmp6(struct ma_info **result_p)
 
 static void print_maddr(FILE *fp, struct ma_info *list)
 {
-   fprintf(fp, "\t");
+   print_string(PRINT_FP, NULL, "\t", NULL);
 
+   open_json_object(NULL);
if (list->addr.family == AF_PACKET) {
SPRINT_BUF(b1);
-   fprintf(fp, "link  %s", ll_addr_n2a((unsigned char 
*)list->addr.data,
-   list->addr.bytelen, 0,
-   b1, sizeof(b1)));
+
+   print_string(PRINT_FP, NULL, "link  ", NULL);
+   print_color_string(PRINT_ANY, COLOR_MAC, "link", "%s",
+  ll_addr_n2a((void *)list->addr.data, 
list->addr.bytelen,
+  0, b1, sizeof(b1)));
} else {
-   switch (list->addr.family) {
-   case AF_INET:
-   fprintf(fp, "inet  ");
-   break;
-   case AF_INET6:
-   fprintf(fp, "inet6 ");
-   break;
-   default:
-   fprintf(fp, "family %d ", list->addr.family);
-   break;
-   }
-   fprintf(fp, "%s",
-   format_host(list->addr.family,
-   -1, list->addr.data));
+   print_string(PRINT_ANY, "family", "%-5s ",
+family_name(list->addr.family));
+   print_color_string(PRINT_ANY, 
ifa_family_color(list->addr.family),
+  "address", "%s",
+  format_host(list->addr.family,
+  -1, list->addr.data));
}
+
if (list->users != 1)
-   fprintf(fp, " users %d", list->users);
+   print_uint(PRINT_ANY, "users", " users %u", list->users);
+
if (list->features)
-   fprintf(fp, " %s", list->features);
-   fprintf(fp, "\n");
+   print_string(PRINT_ANY, "features", " %s", list->features);
+
+   print_string(PRINT_FP, NULL, "\n", NULL);
+   close_json_object();
 }
 
 static void print_mlist(FILE *fp, struct ma_info *list)
 {
int cur_index = 0;
 
+   new_json_obj(json);
for (; list; list = list->next) {
-   if (oneline) {
-   cur_index = list->index;
-   fprintf(fp, "%d:\t%s%s", cur_index, list->name, _SL_);
-   } else if (cur_index != list->index) {
+
+   if (list->index != cur_index || oneline) {
+   if (cur_index) {
+   close_json_array(PRINT_JSON, NULL);
+   close_json_object();
+   }
+   open_json_object(NULL);
+
+   print_uint(PRINT_ANY, "ifindex", "%d:", list->index);
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "ifname", "\t%s", list->name);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
cur_index = list->index;
-   fprintf(fp, "%d:\t%s\n", cur_index, list->name);
+
+   open_json_array(PRINT_JSON, "maddr");
}
+
print_maddr(fp, list);
}
+   if (cur_index) {
+   close_json_array(PRINT_JSON, NULL);
+   close_json_object();
+   }
+
+   delete_json_obj();
 }
 
 static int multiaddr_list(int argc, char **argv)
-- 
2.16.1



[PATCH iproute2-next 3/3] ipmroute: better error message if no kernel mroute

2018-03-08 Thread Stephen Hemminger
From: Stephen Hemminger 

If kernel does not support the IP multicast address family,
then it will report all routes (PF_UNSPEC).
Give the user a better error message and abort the command.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmroute.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index 83548caf4946..a4a8fb2bbe1f 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -75,10 +75,11 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
return -1;
}
+
if (r->rtm_type != RTN_MULTICAST) {
-   fprintf(stderr, "Not a multicast route (type: %s)\n",
-   rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
-   return 0;
+   fprintf(stderr,
+   "Non multicast route received, kernel does support IP 
multicast?\n");
+   return -1;
}
 
parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
-- 
2.16.1



[PATCH iproute2-next 0/3] ip: multicast commands JSON

2018-03-08 Thread Stephen Hemminger
From: Stephen Hemminger 

Some more JSON support and report better error if kernel
is configured without multicast.

Stephen Hemminger (3):
  ipmaddr: json and color support
  ipmroute: convert to output JSON
  ipmroute: better error message if no kernel mroute

 ip/ipmaddr.c  |  69 --
 ip/ipmroute.c | 106 +++---
 2 files changed, 114 insertions(+), 61 deletions(-)

-- 
2.16.1



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Luis R. Rodriguez
On Thu, Mar 08, 2018 at 03:07:01PM -0800, Alexei Starovoitov wrote:
> On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:
> > 
> > request_module() has its own world though too. How often in your proof of
> > concept is request_module() called? How many times do you envision it being
> > called?
> 
> once.

What about other users later in the kernel?

> > > +static int run_umh(struct file *file)
> > > +{
> > > + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
> > > +
> > > + if (!sub_info)
> > > + return -ENOMEM;
> > > + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> > > +}
> > 
> > run_umh() calls the program and waits. Note that while we are running a UMH 
> > we
> > can't suspend. What if they take forever, who is hosing them down with an
> > equivalent:
> > 
> > schedule();
> > try_to_freeze();
> > 
> > Say they are buggy and never return, will they stall suspend, have you 
> > tried?
> > 
> > call_usermodehelper_exec() uses helper_lock() which in turn is used for
> > umh's accounting for number of running umh's. One of the sad obscure uses
> > for this is to deal with suspend/resume. Refer to __usermodehelper* calls
> > on kernel/power/process.c
> > 
> > Note how you use UMH_WAIT_EXEC too, so this is all synchronous.
> 
> looks like you misread this code
>
> #define UMH_NO_WAIT 0   /* don't wait at all */
> #define UMH_WAIT_EXEC   1   /* wait for the exec, but not the process */
> #define UMH_WAIT_PROC   2   /* wait for the process to complete */
> #define UMH_KILLABLE4   /* wait for EXEC/PROC killable */

I certainly did get the incorrect impression this was a sync op, sorry
about that. In that case its odd to see a request_module() call, when
what is really meant is more of a request_module_nowait(), its also
UMH_NO_WAIT on the modprobe call itself as well.

In fact I think I'd much prefer at the very least to see a different wrapper
for this, even if its using the same bolts and nuts underneath for userspace
processes compiled on the kernel. The sanity checks in place for
request_module() may change later and this use case seems rather simple and
limited. Keeping tabs on this type of users seems desirable. At the *very
least*

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 40c89ad4bea6..7530e00da03b 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -37,11 +37,13 @@ extern __printf(2, 3)
 int __request_module(bool wait, const char *name, ...);
 #define request_module(mod...) __request_module(true, mod)
 #define request_module_nowait(mod...) __request_module(false, mod)
+#define request_umh_proc(mod...) __request_module(false, mod)
 #define try_then_request_module(x, mod...) \
((x) ?: (__request_module(true, mod), (x)))
 #else
 static inline int request_module(const char *name, ...) { return -ENOSYS; }
 static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline int request_umh_proc(const char *name, ...) { return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif

Modulo, kernel/umh.c is already its own thing, so pick a name that helps
identify these things clearly.

> and the rest of your concerns with suspend/resume are not applicable any
> more.

Agreed, except it does still mean these userspace processes are limited to
execution only the kernel is up, and not going to suspend, but I think that's
a proper sanity check by the umh API.

> bpftiler.ko is started once and runs non-stop from there on.
> Unless it crashes, but it's a different discussion.

Sure.

  Luis


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 5:35 PM, Linus Torvalds
 wrote:
> I don't want to weaken the type enforcement, and I _thought_ you had
> done that __builtin_types_compatible_p() to keep it in place.

I thought so too (that originally came from Josh), but on removal, I
was surprised that the checking was retained. :)

> But if that's not why you did it, then why was it there at all? If the
> type warning shows through even if it's in the other expression, then
> just a
>
>
> #define __min(t1, t2, x, y) \
> __builtin_choose_expr(  \
> __builtin_constant_p(x) &   \
> __builtin_constant_p(y),\
> (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),  \
> __single_eval_min(t1, t2,   \
>...
>
> would seem to be sufficient?
>
> Because logically, the only thing that matters is that x and y don't
> have any side effects and can be evaluated twice, and
> "__builtin_constant_p()" is already a much stronger version of that.
>
> Hmm? The __builtin_types_compatible_p() just doesn't seem to matter
> for the only thing I thought it was there for.

Yup, agreed. I'll drop it.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 5:38 PM, Linus Torvalds
 wrote:
> On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski  wrote:
>>
>> Also, I don't see how this is any more exploitable than any other
>> init_module().
>
> Absolutely. If Kees doesn't trust the files to be loaded, an
> executable - even if it's running with root privileges and in the
> initns - is still fundamentally weaker than a kernel module.
>
> So I don't understand the security argument AT ALL. It's nonsensical.
> The executable loading does all the same security checks that the
> module loading does, including the signing check.
>
> And the whole point is that we can now do things with building and
> loading a ebpf rule instead of having a full module.

My concerns are mostly about crossing namespaces. If a container
triggers an autoload, the result runs in the init_ns. So, really,
there's nothing new from my perspective, except that it's in userspace
instead of in the kernel.

Perhaps it's an orthogonal concern.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski  wrote:
>
> Also, I don't see how this is any more exploitable than any other
> init_module().

Absolutely. If Kees doesn't trust the files to be loaded, an
executable - even if it's running with root privileges and in the
initns - is still fundamentally weaker than a kernel module.

So I don't understand the security argument AT ALL. It's nonsensical.
The executable loading does all the same security checks that the
module loading does, including the signing check.

And the whole point is that we can now do things with building and
loading a ebpf rule instead of having a full module.

 Linus


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:45 PM, Kees Cook  wrote:
>
> Rasmus mentioned this too. What I said there was that I was shy to
> make that change, since we already can't mix that kind of thing with
> the existing min()/max() implementation. The existing min()/max() is
> already extremely strict, so there are no instances of this in the
> tree.

Yes, but I also didn't want to add any new cases in case people add
new min/max() users.

But:

> If I explicitly add one, I see this with or without the patch:
>
> In file included from drivers/misc/lkdtm.h:7:0,
>  from drivers/misc/lkdtm_core.c:33:
> drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’:
> ./include/linux/kernel.h:809:16: warning: comparison of distinct
> pointer types lacks a cast

Oh, ok, in that case, just drop the __builtin_types_compatible_p()
entirely. It's not adding anything.

I was expecting the non-chosen expression in __builtin_choose_expr()
to not cause type warnings. I'm actually surprised it does. Type games
is why __builtin_choose_expr() tends to exist in the first place.

> So are you saying you _want_ the type enforcement weakened here, or
> that I should just not use __builtin_types_compatible_p()?

I don't want to weaken the type enforcement, and I _thought_ you had
done that __builtin_types_compatible_p() to keep it in place.

But if that's not why you did it, then why was it there at all? If the
type warning shows through even if it's in the other expression, then
just a


#define __min(t1, t2, x, y) \
__builtin_choose_expr(  \
__builtin_constant_p(x) &   \
__builtin_constant_p(y),\
(t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),  \
__single_eval_min(t1, t2,   \
   ...

would seem to be sufficient?

Because logically, the only thing that matters is that x and y don't
have any side effects and can be evaluated twice, and
"__builtin_constant_p()" is already a much stronger version of that.

Hmm? The __builtin_types_compatible_p() just doesn't seem to matter
for the only thing I thought it was there for.

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 01:04:39AM +, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov  wrote:
> > On 3/8/18 4:24 PM, Kees Cook wrote:
> >>
> >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
> >> like forcing the userspace helper to be non-PIE would defeat some of
> >> the userspace defenses in use in most distros.
> >
> >
> > because we don't add features without concrete users.
> 
> I disagree here.  If you're going to add a magic trick that triggers
> an execve(), then I think you should either support *both* standard,
> widely-used types of ELF programs or you should give a compelling use
> case that works for ET_EXEC but not for ET_DYN.  Keep in mind that
> many distros have a very strong preference for ET_DYN.

misunderstanding here is that this bpfiler.ko is part of _kernel build_.
Kernel decides how it's build.
Nothing to do with distros.
Current Makefile is very dumb and it's built with HOSTCC:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/bpfilter/Makefile?h=ipt_bpf
but it will be standalone with CC before final to make sure crosscompiling 
works.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 4:57 PM, Alexei Starovoitov  wrote:
> The above are three paragraphs of security paranoia without single
> concrete example of a security issue.

How is running an arbitrary ELF as full init_ns root from a container
not a concrete example?

I'm not saying this approach can never happen. And this isn't
paranoia. There are very real security boundary violations in this
model, IMO. Though, as Andy says, it's more about autoloading than umh
itself. I just don't want to extend that problem further.

>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

It's just a two-line change, and then it would work on distros that
build PIE-by-default. That seems like a concrete use-case.

>> The exec.c changes should be split into a separate patch. Something
>> feels incorrectly refactored here, though. Passing all three of fd,
>> filename, and file to __do_execve_file() seems wrong; perhaps the fd
>> to file handling needs to happen externally in what you have here as
>> do_execveat_common()? The resulting __do_execve_file() has repeated
>> conditionals on filename... maybe what I object to is being able to
>> pass a NULL filename at all. The filename can be (painfully)
>> reconstructed from the struct file.
>
> reconstruct the path and instantly introduce the race between execing
> something by path vs prior check that it's actual elf of already
> opened file ?!
> excellent suggestion to improve security.

I'm not sure why you're being so hostile. We're both interested in
improving the kernel.

Path names aren't stable, but they provide good _debugging_ about
things when needed. For example, the LSM hooks, if they report on one
of these loads, can produce a hint to the user about what happened.
Passing "/dev/null" around is confusing at the very least. The ELF is
absolutely not /dev/null. Why make someone guess?

>>> [...]
>>> diff --git a/kernel/module.c b/kernel/module.c
>>> index ad2d420024f6..6cfa35795741 100644
>>> --- a/kernel/module.c
>>> +++ b/kernel/module.c
>>> [...]
>>> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char
>>> __user *, uargs, int, flags)
>>>   |MODULE_INIT_IGNORE_VERMAGIC))
>>> return -EINVAL;
>>>
>>> -   err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
>>> -  READING_MODULE);
>>> +   f = fdget(fd);
>>> +   if (!f.file)
>>> +   return -EBADF;
>>> +
>>> +   err = kernel_read_file(f.file, &hdr, &size, INT_MAX,
>>> READING_MODULE);
>>
>>
>> For the LSM subsystem, I think this should also get it's own enum
>> kernel_read_file_id. This is really no longer a kernel module...
>
>
> at this point it's a _file_. It could have been text file just well.
> If lsm is thinking that at this point kernel is processing
> kernel module that lsm is badly broken.

Again, this is about making things more discoverable. We already know
if we're loading a kernel module or a umh ELF. Passing this
information to the LSM is valuable to the LSM to distinguish between
types of files. They have very different effects on the system, for
example. The issue is about validating intent with target. "Is this
kernel module allowed?" and "Is this umh ELF allowed?" are better
questions that "Is something loaded through finit_module() allowed?"
Note already that the LSM distinguishes between many other file types
in kernel_read_file*().

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
> 
> Alexei, can you give an example use case?  I'm sure it's upthread
> somewhere, but I'm having trouble finding it.

at the time of iptable's setsockopt() the kernel will do
err = request_module("bpfilter");
once.
The rough POC code:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

> Also, I just tested this concept a bit.  Depmod invoked explicitly on
> an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
> that has a "module" like that seems to work fine.  Go figure.

right. that's with the current patch.
In v2 I require .modinfo section to make sure license is specified,
but depmod still not very happy:
$ depmod /lib/modules/`uname -r`/kernel/net/bpfilter/bpfilter.ko
depmod: ERROR: Bad version passed 
/lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/kernel/net/bpfilter/bpfilter.ko
I'm not sure it's worth to silence it, since
as you noticed 'depmod -a' works.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov  wrote:
> On 3/8/18 4:24 PM, Kees Cook wrote:
>>
>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

I disagree here.  If you're going to add a magic trick that triggers
an execve(), then I think you should either support *both* standard,
widely-used types of ELF programs or you should give a compelling use
case that works for ET_EXEC but not for ET_DYN.  Keep in mind that
many distros have a very strong preference for ET_DYN.

Or you could argue that ET_DYN requires tooling changes, but I think
it's awkward to ask the tooling to change in advance of the kernel
being willing to actually invoke the thing.  I'm not actually
convinced that any tooling changes would be needed, though.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:24 AM, Kees Cook  wrote:
> How is this not marked [RFC]? :)
>
> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
>> As the first step in development of bpfilter project [1] the request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is that
>> user mode helpers are built as part of the kernel build and installed as
>> traditional kernel modules with .ko file extension into distro specified
>> location, such that from a distribution point of view, they are no different
>> than regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) modules via:
>>
>>   request_module("foo") ->
>> call_umh("modprobe foo") ->
>>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>
> Yikes. This means autoloaded artbitrary exec() now, instead of
> autoloading just loading arbitrary modules? That seems extremely bad.
> This just extends all the problems we've had with defining security
> boundaries with modules out to umh too. I would need some major
> convincing that this can be made safe.
>
> It's easy for container to trigger arbitrary module loading, so if it
> can find/use a similar one of the bugs we've seen in the past to
> redirect the module loading we don't just get new module-created
> attack surface added to the kernel, but we again get arbitrary ELF
> running in the root ns (not in the container). And in the insane case
> of a container with CAP_SYS_MODULE, this is a trivial escape without
> even needing to build a special kernel module: the root ns, running
> with all privileges runs an arbitrary ELF.
>
> As it stands, I have to NAK this. At the very least, you need to solve
> the execution environment problems here: the ELF should run with no
> greater privileges than what loaded the module, and very importantly,
> must not be allowed to bypass these checks through autoloading. What
> _triggered_ the autoload must be the environment, not the "modprobe",
> since that's running with full privileges. Basically, we need to fix
> module autoloading first, or at least a significant subset of the
> problems: https://lkml.org/lkml/2017/11/27/754

I disagree.  If we're going to somehow have ELF binaries that pretend
to be modules, then they should run with high privilege and, more
importantly, should run in a context independent of whoever triggered
autoloading.

Also, I don't see how this is any more exploitable than any other
init_module().  If someone has CAP_SYS_MODULE or autoload privileges
and they can trigger init_module() on a file, then they're granting
maximum privilege to the contents of that file.  So who cares if that
file is a kernel module or an ELF binary?

Alexei, can you give an example use case?  I'm sure it's upthread
somewhere, but I'm having trouble finding it.

Also, I just tested this concept a bit.  Depmod invoked explicitly on
an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
that has a "module" like that seems to work fine.  Go figure.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/8/18 4:24 PM, Kees Cook wrote:

How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:

As the first step in development of bpfilter project [1] the request_module()
code is extended to allow user mode helpers to be invoked. Idea is that
user mode helpers are built as part of the kernel build and installed as
traditional kernel modules with .ko file extension into distro specified
location, such that from a distribution point of view, they are no different
than regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) modules via:

  request_module("foo") ->
call_umh("modprobe foo") ->
  sys_finit_module(FD of /lib/modules/.../foo.ko) ->
call_umh(struct file)


Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754


The above are three paragraphs of security paranoia without single
concrete example of a security issue.


Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise:
control plane is done by the user space while data plane stays in the kernel.

It's easy to distinguish "umh module" from traditional kernel module:

$ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
  Type:  EXEC (Executable file)


As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.


because we don't add features without concrete users.


$ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
  Type:  REL (Relocatable file)

Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_747551_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=pMlEM-qKOmGljadUKAdwKBpuYHQRXzApvSGkZFIT4Gg&s=0-vP6LH-GxXnL7LuV-KfMl1NbqsVJUINSygoVa9R6nk&e=

Signed-off-by: Alexei Starovoitov 
---
 fs/exec.c   | 40 +++-
 include/linux/binfmts.h |  1 +
 include/linux/umh.h |  3 +++
 kernel/module.c | 43 ++-
 kernel/umh.c| 24 +---
 5 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..0483c438de7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+

Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 3:48 PM, Linus Torvalds
 wrote:
> On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook  wrote:
>> +#define __min(t1, t2, x, y)\
>> +   __builtin_choose_expr(__builtin_constant_p(x) &&\
>> + __builtin_constant_p(y) &&\
>> + __builtin_types_compatible_p(t1, t2), \
>> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
>
> I understand why you use __builtin_types_compatible_p(), but please don't.
>
> It will mean that trivial constants like "5" and "sizeof(x)" won't
> simplify, because they have different types.

Rasmus mentioned this too. What I said there was that I was shy to
make that change, since we already can't mix that kind of thing with
the existing min()/max() implementation. The existing min()/max() is
already extremely strict, so there are no instances of this in the
tree. If I explicitly add one, I see this with or without the patch:

In file included from drivers/misc/lkdtm.h:7:0,
 from drivers/misc/lkdtm_core.c:33:
drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’:
./include/linux/kernel.h:809:16: warning: comparison of distinct
pointer types lacks a cast
  (void) (&max1 == &max2);   \
^
./include/linux/kernel.h:818:2: note: in expansion of macro ‘__max’
  __max(typeof(x), typeof(y),   \
  ^
./include/linux/printk.h:308:34: note: in expansion of macro ‘max’
  printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
  ^~~
drivers/misc/lkdtm_core.c:500:2: note: in expansion of macro ‘pr_info’
  pr_info("%lu\n", max(16, sizeof(unsigned long)));
  ^~~

> The ?: will give the right combined type anyway, and if you want the
> type comparison warning, just add a comma-expression with something
> like like
>
>(t1 *)1 == (t2 *)1
>
> to get the type compatibility warning.

When I tried removing __builtin_types_compatible_p(), I still got the
type-check warning because I think the preprocessor still sees the
"(void) (&min1 == &min2)" before optimizing? So, I technically _can_
drop the __builtin_types_compatible_p(), and still keep the type
warning. :P

> Yeah, yeah, maybe none of the VLA cases triggered that, but it seems
> silly to not just get that obvious constant case right.
>
> Hmm?

So are you saying you _want_ the type enforcement weakened here, or
that I should just not use __builtin_types_compatible_p()?

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


[PATCH net] macvlan: filter out unsupported feature flags

2018-03-08 Thread Shannon Nelson
Adding a macvlan device on top of a lowerdev that supports
the xfrm offloads fails with a new regression:
  # ip link add link ens1f0 mv0 type macvlan
  RTNETLINK answers: Operation not permitted

Tracing down the failure shows that the macvlan device inherits
the NETIF_F_HW_ESP and NETIF_F_HW_ESP_TX_CSUM feature flags
from the lowerdev, but with no dev->xfrmdev_ops API filled
in, it doesn't actually support xfrm.  When the request is
made to add the new macvlan device, the XFRM listener for
NETDEV_REGISTER calls xfrm_api_check() which fails the new
registration because dev->xfrmdev_ops is NULL.

The macvlan creation succeeds when we filter out the ESP
feature flags in macvlan_fix_features(), so let's filter them
out like we're already filtering out ~NETIF_F_NETNS_LOCAL.
When XFRM support is added in the future, we can add the flags
into MACVLAN_FEATURES.

This same problem could crop up in the future with any other
new feature flags, so let's filter out any flags that aren't
defined as supported in macvlan.

Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
Reported-by: Alexey Kodanev 
Signed-off-by: Shannon Nelson 
---
 drivers/net/macvlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 8fc02d9..725f4b4 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1036,7 +1036,7 @@ static netdev_features_t macvlan_fix_features(struct 
net_device *dev,
lowerdev_features &= (features | ~NETIF_F_LRO);
features = netdev_increment_features(lowerdev_features, features, mask);
features |= ALWAYS_ON_FEATURES;
-   features &= ~NETIF_F_NETNS_LOCAL;
+   features &= (ALWAYS_ON_FEATURES | MACVLAN_FEATURES);
 
return features;
 }
-- 
2.7.4



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
>
> It's easy to distinguish "umh module" from traditional kernel module:
>
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
>   Type:  EXEC (Executable file)

As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.

> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
>   Type:  REL (Relocatable file)
>
> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
>
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.
>
> [1] https://lwn.net/Articles/747551/
>
> Signed-off-by: Alexei Starovoitov 
> ---
>  fs/exec.c   | 40 +++-
>  include/linux/binfmts.h |  1 +
>  include/linux/umh.h |  3 +++
>  kernel/module.c | 43 ++-
>  kernel/umh.c| 24 +---
>  5 files changed, 94 insertions(+), 17 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 7eb8d21bcab9..0483c438de7d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> +   struct user_arg_ptr argv,
> +   struct user_arg_ptr envp,
> +   int flags, struct file *file)
>  {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> -   struct file *file;
> struct files_struct *displaced;
> i

Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be 
> sent at regular intervals if I read udp_tai.c correctly, so the assumption 
> was that the timestamp from tcpdump should give an inkling to how well it 
> worked.
> 
> I set it up to send a frame every 10ms and computed the diff between each 
> UDP packet received. Nothing fancy, just tcpdump and grep for the 
> timestamp and look at the distribution.

Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:

$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
-tt --time-stamp-precision=nano udp port 7788 -c 1


> 
>>> I have to dig more into why this is happening, a lot frames delayed much 
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
>>> obvious fix is move some hw around and do a direct link, but I didn't have 
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used 
>>> ptp-synched clocks and also used hw receive-time and compared with expected 
>>> tx-time. So, while I'm getting that up and running, I thought I should 
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
> 
> I used the example provided in -00,
> 
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \
>  CLOCK_REALTIME sorting


The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.



> 
>> Also, was this clock synchronized to the PHC? You need that for hw offload 
>> with
>> sorting enabled.
> 
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
> next round for both sender and receiver!

Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:

Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &

My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:

Tx) $ phc2sys -a -r -r -u 8 &


And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.

For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.


Thanks,
Jesus






[PULL resend] virtio: bugfix

2018-03-08 Thread Michael S. Tsirkin
Looks like my pull had corrupted headers. Resending with
fixed up ones.

The following changes since commit 4a3928c6f8a53fa1aed28ccba227742486e8ddcb:

  Linux 4.16-rc3 (2018-02-25 18:50:41 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to e82df670235138575b37ff0ec24412a471efd97f:

  virtio_ring: fix num_free handling in error case (2018-03-01 18:53:38 +0200)


virtio: bugfix

This includes a bugfix for error handling in virtio.

Signed-off-by: Michael S. Tsirkin 


Tiwei Bie (1):
  virtio_ring: fix num_free handling in error case

 drivers/virtio/virtio_ring.c | 2 --
 1 file changed, 2 deletions(-)


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-08 Thread Andy Lutomirski
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün  wrote:
>
> On 07/03/2018 02:21, Andy Lutomirski wrote:
>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>>>
>>> On 06/03/2018 23:46, Tycho Andersen wrote:
 On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>> container, but I don't want to allow moun() in general and I want to
>>> emulate certain mount() actions.  I can write a filter that catches
>>> mount using seccomp and calls out to the container manager for help.
>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>> supported.
>>
>> Well, I think this use case should be handled with something like
>> LD_PRELOAD and a helper library. FYI, I did something like this:
>> https://github.com/stemjail/stemshim
>
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

 Or anything that calls syscalls directly, like go programs.
>>>
>>> That's why the vDSO-like approach. Enforcing an access control is not
>>> the issue here, patching a buggy userland (without patching its code) is
>>> the issue isn't it?
>>>
>>> As far as I remember, the main problem is to handle file descriptors
>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>> code mapped in every processes. Chrome used something like this (in a
>>> previous sandbox mechanism) as a kind of emulation (with the current
>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>> emulation code with an IPC wrapper receiving file descriptors through
>>> UNIX socket.
>>>
>>
>> Can you explain exactly what you mean by "vDSO-like"?
>>
>> When a 64-bit program does a syscall, it just executes the SYSCALL
>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>> go through the vDSO, but not always.
>>
>> It could be possible to force-load a DSO into an entire container and
>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>> such that they merely redirect control to the DSO, but that seems
>> quite messy.
>
> vDSO is a code mapped for all processes. As you said, these processes
> may use it or not. What I was thinking about is to use the same concept,
> i.e. map a "shim" code into each processes pertaining to a particular
> hierarchy (the same way seccomp filters are inherited across processes).
> With a seccomp filter matching some syscall (e.g. mount, open), it is
> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> shim code should then be able to emulate/patch what is needed, even
> faking a file opening by receiving a file descriptor through a UNIX
> socket. As did the Chrome sandbox, the seccomp filter may look at the
> calling address to allow the shim code to call syscalls without being
> catched, if needed. However, relying on SIGSYS may not fit with
> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> to a specific process address, to emulate the syscall in an easier way
> than only relying on a {c,e}BPF program.
>

This could indeed be done, but I think that Tycho's approach is much
cleaner and probably faster.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-08 Thread Mickaël Salaün

On 07/03/2018 02:21, Andy Lutomirski wrote:
> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>>
>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

 I doubt that will work for containers.  Containers that use user
 namespaces and, for example, setuid programs aren't going to honor
 LD_PRELOAD.
>>>
>>> Or anything that calls syscalls directly, like go programs.
>>
>> That's why the vDSO-like approach. Enforcing an access control is not
>> the issue here, patching a buggy userland (without patching its code) is
>> the issue isn't it?
>>
>> As far as I remember, the main problem is to handle file descriptors
>> while "emulating" the kernel behavior. This can be done with a "shim"
>> code mapped in every processes. Chrome used something like this (in a
>> previous sandbox mechanism) as a kind of emulation (with the current
>> seccomp-bpf ). I think it should be doable to replace the (userland)
>> emulation code with an IPC wrapper receiving file descriptors through
>> UNIX socket.
>>
> 
> Can you explain exactly what you mean by "vDSO-like"?
> 
> When a 64-bit program does a syscall, it just executes the SYSCALL
> instruction.  The vDSO isn't involved at all.  32-bit programs usually
> go through the vDSO, but not always.
> 
> It could be possible to force-load a DSO into an entire container and
> rig up seccomp to intercept all SYSCALLs not originating from the DSO
> such that they merely redirect control to the DSO, but that seems
> quite messy.

vDSO is a code mapped for all processes. As you said, these processes
may use it or not. What I was thinking about is to use the same concept,
i.e. map a "shim" code into each processes pertaining to a particular
hierarchy (the same way seccomp filters are inherited across processes).
With a seccomp filter matching some syscall (e.g. mount, open), it is
possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
shim code should then be able to emulate/patch what is needed, even
faking a file opening by receiving a file descriptor through a UNIX
socket. As did the Chrome sandbox, the seccomp filter may look at the
calling address to allow the shim code to call syscalls without being
catched, if needed. However, relying on SIGSYS may not fit with
arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
to a specific process address, to emulate the syscall in an easier way
than only relying on a {c,e}BPF program.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook  wrote:
> +#define __min(t1, t2, x, y)\
> +   __builtin_choose_expr(__builtin_constant_p(x) &&\
> + __builtin_constant_p(y) &&\
> + __builtin_types_compatible_p(t1, t2), \
> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\

I understand why you use __builtin_types_compatible_p(), but please don't.

It will mean that trivial constants like "5" and "sizeof(x)" won't
simplify, because they have different types.

The ?: will give the right combined type anyway, and if you want the
type comparison warning, just add a comma-expression with something
like like

   (t1 *)1 == (t2 *)1

to get the type compatibility warning.

Yeah, yeah, maybe none of the VLA cases triggered that, but it seems
silly to not just get that obvious constant case right.

Hmm?

  Linus


[PATCH v2] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Kees Cook
When max() is used in stack array size calculations from literal values
(e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
thinks this is a dynamic calculation due to the single-eval logic, which
is not needed in the literal case. This change removes several accidental
stack VLAs from an x86 allmodconfig build:

$ diff -u before.txt after.txt | grep ^-
-drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
variable length array ‘ids’ [-Wvla]
-fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array 
‘namebuf’ [-Wvla]
-lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
[-Wvla]
-net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ 
[-Wvla]

Based on an earlier patch from Josh Poimboeuf.

Signed-off-by: Kees Cook 
---
v2:
- fix copy/paste-o max1_/max2_ (ijc)
- clarify "compile-time" constant in comment (Rasmus)
- clean up formatting on min_t()/max_t()
---
 include/linux/kernel.h | 50 --
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..108cdf7bd484 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * strict type-checking.. See the
  * "unnecessary" pointer comparison.
  */
-#define __min(t1, t2, min1, min2, x, y) ({ \
+#define __single_eval_min(t1, t2, min1, min2, x, y) ({ \
t1 min1 = (x);  \
t2 min2 = (y);  \
(void) (&min1 == &min2);\
min1 < min2 ? min1 : min2; })
 
+/*
+ * In the case of compile-time constant values, there is no need to do
+ * the double-evaluation protection, so the raw comparison can be made.
+ * This allows min()/max() to be used in stack array allocations and
+ * avoid the compiler thinking it is a dynamic value leading to an
+ * accidental VLA.
+ */
+#define __min(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_constant_p(x) &&\
+ __builtin_constant_p(y) &&\
+ __builtin_types_compatible_p(t1, t2), \
+ (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
+ __single_eval_min(t1, t2, \
+   __UNIQUE_ID(min1_), \
+   __UNIQUE_ID(min2_), \
+   x, y))
+
 /**
  * min - return minimum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define min(x, y)  \
-   __min(typeof(x), typeof(y), \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define min(x, y)  __min(typeof(x), typeof(y), x, y)
 
-#define __max(t1, t2, max1, max2, x, y) ({ \
+#define __single_eval_max(t1, t2, max1, max2, x, y) ({ \
t1 max1 = (x);  \
t2 max2 = (y);  \
(void) (&max1 == &max2);\
max1 > max2 ? max1 : max2; })
 
+#define __max(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_constant_p(x) &&\
+ __builtin_constant_p(y) &&\
+ __builtin_types_compatible_p(t1, t2), \
+ (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
+ __single_eval_max(t1, t2, \
+   __UNIQUE_ID(max1_), \
+   __UNIQUE_ID(max2_), \
+   x, y))
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define max(x, y)  \
-   __max(typeof(x), typeof(y), \
- __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
- x, y)
+#define max(x, y)  __max(typeof(x), typeof(y), x, y)
 
 /**
  * min3 - return minimum of three values
@@ -869,10 +889,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * @x: first value
  * @y: second value
  */
-#define min_t(type, x, y)  \
-   __min(type, type,   \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- 

Re: [PATCH net] macvlan: filter out xfrm feature flags

2018-03-08 Thread Shannon Nelson

On 3/8/2018 9:33 AM, David Miller wrote:

From: Shannon Nelson 
Date: Tue,  6 Mar 2018 14:57:08 -0800


This isn't broken for vlans because they use a separate features
connection (vlan_features) for inheriting features.  This is
fine, but I don't think trying to add something like this to
every driver for every new upperdev is a good idea - I think
the upperdev should try to protect itself.


I think this fix is correct.

But for how many upperdevs are we going to duplicate code like this,
and how many subtle differences and in fact bugs will result from all
of that duplication?

I think we really need something common for these upperdev drivers
to use.  Maybe just a macro defining feature bits to trim in this
situation.

Thanks.


Thanks, Dave.  Yes, this could use something a little more generic, 
something that would catch any future "dangerous" bits.


I'm not sure we can come up with a generic mask to be used by everyone, 
as each upper and lower dev has their own feature support levels.  But 
we might come up with a better example for others to follow.


Rather than calling out specific non-supported bits, we can probably 
just build a mask from bits that we already know are supported, 
something like this:

features &= (ALWAYS_ON_FEATURES | MACVLAN_FEATURES);

which would take care of NETIF_F_NETNS_LOCAL, the ESP flags, and 
anything else that wasn't already specifically called for.  I'll repost 
with this and see what folks think.


sln


Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 2:12 PM, Rasmus Villemoes
 wrote:
> On 8 March 2018 at 21:39, Kees Cook  wrote:
>> However, this works for me:
>>
>> #define __new_max(t1, t2, max1, max2, x, y)\
>>__builtin_choose_expr(__builtin_constant_p(x) && \
>>  __builtin_constant_p(y) && \
>>  __builtin_types_compatible_p(t1, t2), \
>>  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
>>  __max(t1, t2, max1, max2, x, y))
>>
>> #define new_max(x, y) \
>> __new_max(typeof(x), typeof(y), \
>>   __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
>>   x, y)
>
> Yes, that would seem to do the trick.
>
> Thinking out loud: do we really want or need the
> __builtin_types_compatible condition when x and y are compile-time
> constants? I think it would be nice to be able to use max(16,
> sizeof(bla)) without having to cast either the literal or the sizeof.
> Just omitting the type compatibility check might be dangerous, but
> perhaps it could be relaxed to a check that both values are
> representable in their common promoted type. Something like
>
> (type_signed(t1) == type_signed(t2)) || ((t1)x >= 0 && (t2)y >= 0)
>
> should be safe (if the types have same signedness, or the value of
> signed type is positive), though it doesn't allow a few corner cases
> (e.g. short vs. unsigned char is always ok due to promotion to int,
> and also if the signed type is strictly wider than the unsigned type).

I agree, it would be nice. However, I think it'd be better to continue
to depend on max_t() for these kinds of cases though. For example:

char foo[max_t(size_t, 6, sizeof(something))];

Works with the proposed patch.

Also, I think this mismatch would already be triggering warnings, so
we shouldn't have any currently.

-Kees

-- 
Kees Cook
Pixel Security


Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-03-08 Thread Cong Wang
On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi  wrote:
>> Marco Berizzi wrote:
>>
>>
>> Hello everyone,
>>
>> Yesterday I got this error on a slackware linux 4.16-rc4 system
>> running as a traffic shaping gateway and netfilter nat.
>> The error has been arisen after a partial ISP network outage,
>> so unfortunately it will not trivial for me to reproduce it again.
>
> Hello everyone,
>
> I'm getting this error twice/day, so fortunately I'm able to
> reproduce it.

IIRC, there was a patch for this, but it got lost...

I will take a look anyway.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:


request_module() has its own world though too. How often in your proof of
concept is request_module() called? How many times do you envision it being
called?


once.


+static int run_umh(struct file *file)
+{
+   struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
+
+   if (!sub_info)
+   return -ENOMEM;
+   return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+}


run_umh() calls the program and waits. Note that while we are running a UMH we
can't suspend. What if they take forever, who is hosing them down with an
equivalent:

schedule();
try_to_freeze();

Say they are buggy and never return, will they stall suspend, have you tried?

call_usermodehelper_exec() uses helper_lock() which in turn is used for
umh's accounting for number of running umh's. One of the sad obscure uses
for this is to deal with suspend/resume. Refer to __usermodehelper* calls
on kernel/power/process.c

Note how you use UMH_WAIT_EXEC too, so this is all synchronous.


looks like you misread this code and the rest of your concerns
with suspend/resume are not applicable any more.

#define UMH_NO_WAIT 0   /* don't wait at all */
#define UMH_WAIT_EXEC   1   /* wait for the exec, but not the process */
#define UMH_WAIT_PROC   2   /* wait for the process to complete */
#define UMH_KILLABLE4   /* wait for EXEC/PROC killable */

bpftiler.ko is started once and runs non-stop from there on.
Unless it crashes, but it's a different discussion.


+   if (info->hdr->e_type == ET_EXEC) {
+#ifdef CONFIG_MODULE_SIG
+   if (!info->sig_ok) {
+   pr_notice_once("umh %s verification failed: signature and/or 
required key missing - tainting kernel\n",
+  info->file->f_path.dentry->d_name.name);
+   add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
+   }
+#endif


So I guess this check is done *after* run_umh() then, what about the enforce 
mode,
don't we want to reject loading at all in any circumstance?


already answered this twice in the thread.



Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Henrik Austad
On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote:
> Hi,
> 
> 
> On 03/08/2018 06:09 AM, Henrik Austad wrote:
> 
> (...)
> 
> > 
> > A lot of new knobs, I see the need, I would've like to have fewer, but 
> > you've documented them pretty well. Perhaps we should add something to 
> > Documentation/ at one stage?
> 
> Sure. The idea is working on that once the interfaces have been accepted.

Yeah, probably a good idea.

> > Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> > Using udp_tai and tcpdump in the other end to grab the frames
> > 
> > Setting up with hw offload and sorting in qdisc.
> > 
> > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> > bypass as dual-core and i210 is not friends):
> > 
> > udp_tai -c1 -i eth2 -p 20 -P 1000
> > 
> > Receiver (imx7, kernel 4.9.11):
> > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 
> > 256" > tai_imx7.log
> > 
> > Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> > receiver, so these numbers can only improve.
> > 
> > count2340.00
> > mean0.043770
> > std 0.047784
> > min 0.009025
> > 25% 0.010003
> > 50% 0.010010
> > 75% 0.109998
> > max 0.120060
> > 
> 
> Thanks for giving it a shot.
> 
> But I'm not sure I follow the numbers above, sorry :/
> Are you computing the packet's Rx timestamp offset from the (expected) Tx 
> time?

Just looking at the timestamp when the frames were received. They should be 
sent at regular intervals if I read udp_tai.c correctly, so the assumption 
was that the timestamp from tcpdump should give an inkling to how well it 
worked.

I set it up to send a frame every 10ms and computed the diff between each 
UDP packet received. Nothing fancy, just tcpdump and grep for the 
timestamp and look at the distribution.

> > I have to dig more into why this is happening, a lot frames delayed much 
> > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> > obvious fix is move some hw around and do a direct link, but I didn't have 
> > time for that right now.
> > 
> > I'm very interested in doing what Richard's original test was when he used 
> > ptp-synched clocks and also used hw receive-time and compared with expected 
> > tx-time. So, while I'm getting that up and running, I thought I should 
> > share the early results.
> 
> Sure, thanks. Which delta and clockid are you using, please?

I used the example provided in -00,

tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \
 CLOCK_REALTIME sorting

> Also, was this clock synchronized to the PHC? You need that for hw offload 
> with
> sorting enabled.

Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
next round for both sender and receiver!

-henrik


signature.asc
Description: PGP signature


[RFC] Removing VLA usage in l1oip_core

2018-03-08 Thread Gustavo A. R. Silva
Hi Karsten,

I'm trying to figure out the best way to fix the following VLA warning:

drivers/isdn/mISDN/l1oip_core.c:282:2: warning: ISO C90 forbids variable length 
array ‘frame’ [-Wvla]
  u8 frame[len + 32];
  ^~

So while doing some research I've found the following.

Based on this code at include/linux/mISDNhw.h:38: #define MAX_DFRAME_LEN_L1 300 
and the following at drivers/isdn/mISDN/l1oip_core.c:1115:


if (skb->len > MAX_DFRAME_LEN_L1 || skb->len > L1OIP_MAX_LEN) {
printk(KERN_WARNING "%s: skb too large\n",
   __func__);
break;
}
/* check for AIS / ulaw-silence */
l = skb->len;
if (!memchr_inv(skb->data, 0xff, l)) {
if (debug & DEBUG_L1OIP_MSG)
printk(KERN_DEBUG "%s: got AIS, not sending, "
   "but counting\n", __func__);
hc->chan[bch->slot].tx_counter += l;
skb_trim(skb, 0);
queue_ch_frame(ch, PH_DATA_CNF, hh->id, skb);
return 0;
}
/* check for silence */
l = skb->len;
if (!memchr_inv(skb->data, 0x2a, l)) {
if (debug & DEBUG_L1OIP_MSG)
printk(KERN_DEBUG "%s: got silence, not sending"
   ", but counting\n", __func__);
hc->chan[bch->slot].tx_counter += l;
skb_trim(skb, 0);
queue_ch_frame(ch, PH_DATA_CNF, hh->id, skb);
return 0;
}

/* send frame */
p = skb->data;
l = skb->len;
while (l) {
ll = (l < L1OIP_MAX_PERFRAME) ? l : L1OIP_MAX_PERFRAME;
l1oip_socket_send(hc, hc->codec, bch->slot, 0,
  hc->chan[bch->slot].tx_counter, p, 
ll);
hc->chan[bch->slot].tx_counter += ll;
p += ll;
l -= ll;
}


it seems that the maximum value 'len' can take at 
drivers/isdn/mISDN/l1oip_core.c:282 is 300


/*
 * send a frame via socket, if open and restart timer
 */
static int
l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 channel, u32 chanmask,
  u16 timebase, u8 *buf, int len)
{
u8 *p;
u8 frame[len + 32];


If this is correct, I could send the following patch to fix the VLA warning:

diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c
index 21d50e4..31e3cd5 100644
--- a/drivers/isdn/mISDN/l1oip_core.c
+++ b/drivers/isdn/mISDN/l1oip_core.c
@@ -279,7 +279,7 @@ l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 
channel, u32 chanmask,
  u16 timebase, u8 *buf, int len)
 {
u8 *p;
-   u8 frame[len + 32];
+   u8 frame[332];
struct socket *socket = NULL;
 
if (debug & DEBUG_L1OIP_MSG)

But I wanted to ask for your feedback first, in case I may be missing something.

Another solution is to use dynamic memory allocation, but if the maximum size 
for 'frame' is in the hundreds of bytes, it might not be worth the performace 
penalty.

What do you think?

Thanks!
--
Gustavo


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 2:18 PM, Andrew Morton  wrote:
> On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook  wrote:
>
>> When max() is used in stack array size calculations from literal values
>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>> thinks this is a dynamic calculation due to the single-eval logic, which
>> is not needed in the literal case. This change removes several accidental
>> stack VLAs from an x86 allmodconfig build:
>>
>> $ diff -u before.txt after.txt | grep ^-
>> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
>> variable length array ‘ids’ [-Wvla]
>> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
>> array ‘namebuf’ [-Wvla]
>> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
>> [-Wvla]
>> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
>> ‘buff’ [-Wvla]
>> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
>> ‘buff’ [-Wvla]
>> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
>> ‘buff64’ [-Wvla]
>>
>> Based on an earlier patch from Josh Poimboeuf.
>>
>> ...
>>
>> --- a/include/linux/kernel.h
>> +++ b/include/linux/kernel.h
>> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
>> oops_dump_mode) { }
>>   * strict type-checking.. See the
>>   * "unnecessary" pointer comparison.
>>   */
>> -#define __min(t1, t2, min1, min2, x, y) ({   \
>> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({   \
>>   t1 min1 = (x);  \
>>   t2 min2 = (y);  \
>>   (void) (&min1 == &min2);\
>>   min1 < min2 ? min1 : min2; })
>>
>> +/*
>> + * In the case of builtin constant values, there is no need to do the
>> + * double-evaluation protection, so the raw comparison can be made.
>> + * This allows min()/max() to be used in stack array allocations and
>> + * avoid the compiler thinking it is a dynamic value leading to an
>> + * accidental VLA.
>> + */
>> +#define __min(t1, t2, x, y)  \
>> + __builtin_choose_expr(__builtin_constant_p(x) &&\
>> +   __builtin_constant_p(y) &&\
>> +   __builtin_types_compatible_p(t1, t2), \
>> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
>> +   __single_eval_min(t1, t2, \
>> + __UNIQUE_ID(max1_), \
>> + __UNIQUE_ID(max2_), \
>> + x, y))
>> +
>
> Holy crap.
>
> I suppose gcc will one day be fixed and we won't need this.
>
> Is there a good reason to convert min()?  Surely nobody will be using
> min to dimension an array - always max?  Just for symmetry, I guess.

I just went with symmetry. It seems like an ugly risk to implement min
and mix differently. :) In theory it may produce smaller code for rare
min() uses, but I haven't actually verified that.

I will send a v2 with the two nits mentioned...

-Kees

-- 
Kees Cook
Pixel Security


[PATCH iproute2-next 1/2] tc: Add missing documentation for codel and fq_codel parameters

2018-03-08 Thread Toke Høiland-Jørgensen
Add missing documentation of the memory_limit fq_codel parameter and the
ce_threshold codel and fq_codel parameters.

Signed-off-by: Toke Høiland-Jørgensen 
---
 man/man8/tc-codel.8| 10 +-
 man/man8/tc-fq_codel.8 | 18 +-
 tc/q_fq_codel.c|  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-codel.8 b/man/man8/tc-codel.8
index a0e50a4e..e538e940 100644
--- a/man/man8/tc-codel.8
+++ b/man/man8/tc-codel.8
@@ -13,7 +13,9 @@ TIME ] [
 .B ecn
 |
 .B noecn
-]
+] [
+.B ce_threshold
+TIME ]
 
 .SH DESCRIPTION
 CoDel (pronounced "coddle") is an adaptive "no-knobs" active queue management
@@ -80,6 +82,12 @@ can be used to turn it off and vice-a-versa. By default,
 .B ecn
 is turned off.
 
+.SS ce_threshold
+sets a threshold above which all packets are marked with ECN Congestion
+Experienced. This is useful for DCTCP-style congestion control algorithms that
+require marking at very shallow queueing thresholds.
+
+
 .SH EXAMPLES
  # tc qdisc add dev eth0 root codel
  # tc -s qdisc show
diff --git a/man/man8/tc-fq_codel.8 b/man/man8/tc-fq_codel.8
index a80389a1..7ee6c269 100644
--- a/man/man8/tc-fq_codel.8
+++ b/man/man8/tc-fq_codel.8
@@ -17,7 +17,11 @@ BYTES ] [
 .B ecn
 |
 .B noecn
-]
+] [
+.B ce_threshold
+TIME ] [
+.B memory_limit
+BYTES ]
 
 .SH DESCRIPTION
 FQ_Codel (Fair Queuing Controlled Delay) is queuing discipline that combines 
Fair
@@ -35,6 +39,13 @@ and is the hard limit on the real queue size.
 When this limit is reached, incoming packets are dropped. Default is 10240
 packets.
 
+.SS memory_limit
+sets a limit on the total number of bytes that can be queued in this FQ-CoDel
+instance. The lower of the packet limit of the
+.B limit
+parameter and the memory limit will be enforced. Default is 32 MB.
+
+
 .SS flows
 is the number of flows into which the incoming packets are classified. Due to
 the stochastic nature of hashing, multiple flows may end up being hashed into
@@ -73,6 +84,11 @@ can be used to turn it off and vice-a-versa. Unlike
 .B codel, ecn
 is turned on by default.
 
+.SS ce_threshold
+sets a threshold above which all packets are marked with ECN Congestion
+Experienced. This is useful for DCTCP-style congestion control algorithms that
+require marking at very shallow queueing thresholds.
+
 .SH EXAMPLES
 #tc qdisc add   dev eth0 root fq_codel
 .br
diff --git a/tc/q_fq_codel.c b/tc/q_fq_codel.c
index 9e3736fe..1b2931ef 100644
--- a/tc/q_fq_codel.c
+++ b/tc/q_fq_codel.c
@@ -50,6 +50,7 @@
 static void explain(void)
 {
fprintf(stderr, "Usage: ... fq_codel [ limit PACKETS ] [ flows NUMBER 
]\n");
+   fprintf(stderr, "[ memory_limit BYTES ]\n");
fprintf(stderr, "[ target TIME ] [ interval TIME 
]\n");
fprintf(stderr, "[ quantum BYTES ] [ [no]ecn ]\n");
fprintf(stderr, "[ ce_threshold TIME ]\n");
-- 
2.16.2



[PATCH iproute2-next 2/2] tc: Add JSON output of fq_codel stats

2018-03-08 Thread Toke Høiland-Jørgensen
Enable proper JSON output support for fq_codel in `tc -s qdisc` output.

Signed-off-by: Toke Høiland-Jørgensen 
---
 tc/q_fq_codel.c | 49 -
 1 file changed, 32 insertions(+), 17 deletions(-)

diff --git a/tc/q_fq_codel.c b/tc/q_fq_codel.c
index 1b2931ef..02ad2214 100644
--- a/tc/q_fq_codel.c
+++ b/tc/q_fq_codel.c
@@ -239,35 +239,50 @@ static int fq_codel_print_xstats(struct qdisc_util *qu, 
FILE *f,
st = &_st;
}
if (st->type == TCA_FQ_CODEL_XSTATS_QDISC) {
-   fprintf(f, "  maxpacket %u drop_overlimit %u new_flow_count %u 
ecn_mark %u",
-   st->qdisc_stats.maxpacket,
-   st->qdisc_stats.drop_overlimit,
-   st->qdisc_stats.new_flow_count,
+   print_uint(PRINT_ANY, "maxpacket", "  maxpacket %u",
+   st->qdisc_stats.maxpacket);
+   print_uint(PRINT_ANY, "drop_overlimit", " drop_overlimit %u",
+   st->qdisc_stats.drop_overlimit);
+   print_uint(PRINT_ANY, "new_flow_count", " new_flow_count %u",
+   st->qdisc_stats.new_flow_count);
+   print_uint(PRINT_ANY, "ecn_mark", " ecn_mark %u",
st->qdisc_stats.ecn_mark);
if (st->qdisc_stats.ce_mark)
-   fprintf(f, " ce_mark %u", st->qdisc_stats.ce_mark);
+   print_uint(PRINT_ANY, "ce_mark", " ce_mark %u",
+   st->qdisc_stats.ce_mark);
if (st->qdisc_stats.memory_usage)
-   fprintf(f, " memory_used %u", 
st->qdisc_stats.memory_usage);
+   print_uint(PRINT_ANY, "memory_used", " memory_used %u",
+   st->qdisc_stats.memory_usage);
if (st->qdisc_stats.drop_overmemory)
-   fprintf(f, " drop_overmemory %u", 
st->qdisc_stats.drop_overmemory);
-   fprintf(f, "\n  new_flows_len %u old_flows_len %u",
-   st->qdisc_stats.new_flows_len,
+   print_uint(PRINT_ANY, "drop_overmemory", " 
drop_overmemory %u",
+   st->qdisc_stats.drop_overmemory);
+   print_uint(PRINT_ANY, "new_flows_len", "\n  new_flows_len %u",
+   st->qdisc_stats.new_flows_len);
+   print_uint(PRINT_ANY, "old_flows_len", " old_flows_len %u",
st->qdisc_stats.old_flows_len);
}
if (st->type == TCA_FQ_CODEL_XSTATS_CLASS) {
-   fprintf(f, "  deficit %d count %u lastcount %u ldelay %s",
-   st->class_stats.deficit,
-   st->class_stats.count,
-   st->class_stats.lastcount,
+   print_uint(PRINT_ANY, "deficit", "  deficit %u",
+   st->class_stats.deficit);
+   print_uint(PRINT_ANY, "count", " count %u",
+   st->class_stats.count);
+   print_uint(PRINT_ANY, "lastcount", " lastcount %u",
+   st->class_stats.lastcount);
+   print_uint(PRINT_JSON, "ldelay", NULL,
+   st->class_stats.ldelay);
+   print_string(PRINT_FP, NULL, " ldelay %s",
sprint_time(st->class_stats.ldelay, b1));
if (st->class_stats.dropping) {
-   fprintf(f, " dropping");
+   print_bool(PRINT_ANY, "dropping", " dropping", true);
if (st->class_stats.drop_next < 0)
-   fprintf(f, " drop_next -%s",
+   print_string(PRINT_FP, NULL, " drop_next -%s",
sprint_time(-st->class_stats.drop_next, 
b1));
-   else
-   fprintf(f, " drop_next %s",
+   else {
+   print_uint(PRINT_JSON, "drop_next", NULL,
+   st->class_stats.drop_next);
+   print_string(PRINT_FP, NULL, " drop_next %s",
sprint_time(st->class_stats.drop_next, 
b1));
+   }
}
}
return 0;
-- 
2.16.2



[PATCH net v3] net: phy: Tell caller result of phy_change()

2018-03-08 Thread Brad Mouring
In 664fcf123a30e (net: phy: Threaded interrupts allow some simplification)
the phy_interrupt system was changed to use a traditional threaded
interrupt scheme instead of a workqueue approach.

With this change, the phy status check moved into phy_change, which
did not report back to the caller whether or not the interrupt was
handled. This means that, in the case of a shared phy interrupt,
only the first phydev's interrupt registers are checked (since
phy_interrupt() would always return IRQ_HANDLED). This leads to
interrupt storms when it is a secondary device that's actually the
interrupt source.

Signed-off-by: Brad Mouring 
---
 drivers/net/phy/phy.c | 145 +-
 include/linux/phy.h   |   1 -
 2 files changed, 72 insertions(+), 74 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index e3e29c2b028b..dfff2cff7da9 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -617,6 +617,77 @@ static void phy_error(struct phy_device *phydev)
phy_trigger_machine(phydev, false);
 }
 
+/**
+ * phy_disable_interrupts - Disable the PHY interrupts from the PHY side
+ * @phydev: target phy_device struct
+ */
+static int phy_disable_interrupts(struct phy_device *phydev)
+{
+   int err;
+
+   /* Disable PHY interrupts */
+   err = phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED);
+   if (err)
+   goto phy_err;
+
+   /* Clear the interrupt */
+   err = phy_clear_interrupt(phydev);
+   if (err)
+   goto phy_err;
+
+   return 0;
+
+phy_err:
+   phy_error(phydev);
+
+   return err;
+}
+
+/**
+ * phy_change - Called by the phy_interrupt to handle PHY changes
+ * @phydev: phy_device struct that interrupted
+ */
+static irqreturn_t phy_change(struct phy_device *phydev)
+{
+   if (phy_interrupt_is_valid(phydev)) {
+   if (phydev->drv->did_interrupt &&
+   !phydev->drv->did_interrupt(phydev))
+   return IRQ_NONE;
+
+   if (phydev->state == PHY_HALTED)
+   if (phy_disable_interrupts(phydev))
+   goto phy_err;
+   }
+
+   mutex_lock(&phydev->lock);
+   if ((PHY_RUNNING == phydev->state) || (PHY_NOLINK == phydev->state))
+   phydev->state = PHY_CHANGELINK;
+   mutex_unlock(&phydev->lock);
+
+   /* reschedule state queue work to run as soon as possible */
+   phy_trigger_machine(phydev, true);
+
+   if (phy_interrupt_is_valid(phydev) && phy_clear_interrupt(phydev))
+   goto phy_err;
+   return IRQ_HANDLED;
+
+phy_err:
+   phy_error(phydev);
+   return IRQ_NONE;
+}
+
+/**
+ * phy_change_work - Scheduled by the phy_mac_interrupt to handle PHY changes
+ * @work: work_struct that describes the work to be done
+ */
+void phy_change_work(struct work_struct *work)
+{
+   struct phy_device *phydev =
+   container_of(work, struct phy_device, phy_queue);
+
+   phy_change(phydev);
+}
+
 /**
  * phy_interrupt - PHY interrupt handler
  * @irq: interrupt line
@@ -632,9 +703,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
if (PHY_HALTED == phydev->state)
return IRQ_NONE;/* It can't be ours.  */
 
-   phy_change(phydev);
-
-   return IRQ_HANDLED;
+   return phy_change(phydev);
 }
 
 /**
@@ -651,32 +720,6 @@ static int phy_enable_interrupts(struct phy_device *phydev)
return phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED);
 }
 
-/**
- * phy_disable_interrupts - Disable the PHY interrupts from the PHY side
- * @phydev: target phy_device struct
- */
-static int phy_disable_interrupts(struct phy_device *phydev)
-{
-   int err;
-
-   /* Disable PHY interrupts */
-   err = phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED);
-   if (err)
-   goto phy_err;
-
-   /* Clear the interrupt */
-   err = phy_clear_interrupt(phydev);
-   if (err)
-   goto phy_err;
-
-   return 0;
-
-phy_err:
-   phy_error(phydev);
-
-   return err;
-}
-
 /**
  * phy_start_interrupts - request and enable interrupts for a PHY device
  * @phydev: target phy_device struct
@@ -719,50 +762,6 @@ int phy_stop_interrupts(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_stop_interrupts);
 
-/**
- * phy_change - Called by the phy_interrupt to handle PHY changes
- * @phydev: phy_device struct that interrupted
- */
-void phy_change(struct phy_device *phydev)
-{
-   if (phy_interrupt_is_valid(phydev)) {
-   if (phydev->drv->did_interrupt &&
-   !phydev->drv->did_interrupt(phydev))
-   return;
-
-   if (phydev->state == PHY_HALTED)
-   if (phy_disable_interrupts(phydev))
-   goto phy_err;
-   }
-
-   mutex_lock(&phydev->lock);
-   if ((PHY_RUNNING == phydev->state) || (PHY_NOLINK == phydev->

Re: [PATCH RFC 4/5] tls: RX path for ktls

2018-03-08 Thread Dave Watson
On 03/08/18 09:48 PM, Boris Pismenny wrote:
> Hi Dave,
> 
> On 03/08/18 18:50, Dave Watson wrote:
> > Add rx path for tls software implementation.
> > 
> > recvmsg, splice_read, and poll implemented.
> > 
> > An additional sockopt TLS_RX is added, with the same interface as
> > TLS_TX.  Either TLX_RX or TLX_TX may be provided separately, or
> > together (with two different setsockopt calls with appropriate keys).
> > 
> > Control messages are passed via CMSG in a similar way to transmit.
> > If no cmsg buffer is passed, then only application data records
> > will be passed to userspace, and EIO is returned for other types of
> > alerts.
> > 
> > EBADMSG is passed for decryption errors, and E2BIG is passed for framing
> > errors.  Both are unrecoverable.
> 
> I think E2BIG is for too long argument list. EMSGSIZE might be more
> appropriate.

Sounds good.

> Also, we must check that the record is not too short (cipher specific).
> For TLS1.2 with AES-GCM the minimum length is 8 (IV) + 16 (TAG).
> The correct error for this case is EBADMSG, like a decryption failure.
> 
> Also, how about bad TLS version (e.g. not TLS1.2)?
> A separate error type is required for bad version, because it triggers a
> unique alert in libraries such as OpenSSL.
> I thought of using EINVAL for bad version. What do you think?

Ah, I did not realize there was a separate alert for that, sounds good.

> 
> I wonder if we should provide a more flexible method of obtaining errors for
> the future.
> Maybe use a special CMSG for errors?
> This CMSG will be triggered only after the socket enters the error state.

I'm not opposed to this in principle, but without a concrete use am
hesitant to add it.  I don't know of any other error codes that could
be returned besides the ones discussed above.

> > 
> > +
> > +int tls_sw_recvmsg(struct sock *sk,
> > +  struct msghdr *msg,
> > +  size_t len,
> > +  int nonblock,
> > +  int flags,
> > +  int *addr_len)
> > +{
> > +   struct tls_context *tls_ctx = tls_get_ctx(sk);
> > +   struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
> > +   unsigned char control;
> > +   struct strp_msg *rxm;
> > +   struct sk_buff *skb;
> > +   ssize_t copied = 0;
> > +   bool cmsg = false;
> > +   int err = 0;
> > +   long timeo;
> Maybe try to read from the error queue here?

Sure.



Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Andrew Morton
On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.
> 
> ...
>
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> oops_dump_mode) { }
>   * strict type-checking.. See the
>   * "unnecessary" pointer comparison.
>   */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({   \
>   t1 min1 = (x);  \
>   t2 min2 = (y);  \
>   (void) (&min1 == &min2);\
>   min1 < min2 ? min1 : min2; })
>  
> +/*
> + * In the case of builtin constant values, there is no need to do the
> + * double-evaluation protection, so the raw comparison can be made.
> + * This allows min()/max() to be used in stack array allocations and
> + * avoid the compiler thinking it is a dynamic value leading to an
> + * accidental VLA.
> + */
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_constant_p(x) &&\
> +   __builtin_constant_p(y) &&\
> +   __builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   __single_eval_min(t1, t2, \
> + __UNIQUE_ID(max1_), \
> + __UNIQUE_ID(max2_), \
> + x, y))
> +

Holy crap.

I suppose gcc will one day be fixed and we won't need this.

Is there a good reason to convert min()?  Surely nobody will be using
min to dimension an array - always max?  Just for symmetry, I guess.



Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Ian Campbell
On Thu, 2018-03-08 at 13:40 -0800, Kees Cook wrote:
> 
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_constant_p(x) &&\
> +   __builtin_constant_p(y) &&\
> +   __builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   __single_eval_min(t1, t2, \
> + __UNIQUE_ID(max1_), \
> + __UNIQUE_ID(max2_), \

min1_ etc?

Ian.


Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Rasmus Villemoes
On 8 March 2018 at 21:39, Kees Cook  wrote:
> On Thu, Mar 8, 2018 at 11:57 AM, Rasmus Villemoes
>  wrote:
>> On 2018-03-08 16:02, Josh Poimboeuf wrote:
>>> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
>>> +extern long __error_incompatible_types_in_min_macro;
>>> +extern long __error_incompatible_types_in_max_macro;
>>> +
>>> +#define __min(t1, t2, x, y)  \
>>> + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
>>> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
>>> +   (t1)__error_incompatible_types_in_min_macro)
>>>
>>>  /**
>>>   * min - return minimum of two values of the same or compatible types
>>>   * @x: first value
>>>   * @y: second value
>>>   */
>>> -#define min(x, y)\
>>> - __min(typeof(x), typeof(y), \
>>> -   __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
>>> -   x, y)
>>> +#define min(x, y) __min(typeof(x), typeof(y), x, y)  \
>>>
>>
>> But this introduces the the-chosen-one-of-x-and-y-gets-evaluated-twice
>> problem. Maybe we don't care? But until we get a
>> __builtin_assert_this_has_no_side_effects() I think that's a little
>> dangerous.
>
> Eek, yes, we can't do the double-eval. The proposed change breaks
> things badly. :)
>
> a:   20
> b:   40
> max(a++, b++): 40
> a:   21
> b:   41
>
> a:   20
> b:   40
> new_max(a++, b++): 41
> a:   21
> b:   42
>
> However, this works for me:
>
> #define __new_max(t1, t2, max1, max2, x, y)\
>__builtin_choose_expr(__builtin_constant_p(x) && \
>  __builtin_constant_p(y) && \
>  __builtin_types_compatible_p(t1, t2), \
>  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
>  __max(t1, t2, max1, max2, x, y))
>
> #define new_max(x, y) \
> __new_max(typeof(x), typeof(y), \
>   __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
>   x, y)

Yes, that would seem to do the trick.

Thinking out loud: do we really want or need the
__builtin_types_compatible condition when x and y are compile-time
constants? I think it would be nice to be able to use max(16,
sizeof(bla)) without having to cast either the literal or the sizeof.
Just omitting the type compatibility check might be dangerous, but
perhaps it could be relaxed to a check that both values are
representable in their common promoted type. Something like

(type_signed(t1) == type_signed(t2)) || ((t1)x >= 0 && (t2)y >= 0)

should be safe (if the types have same signedness, or the value of
signed type is positive), though it doesn't allow a few corner cases
(e.g. short vs. unsigned char is always ok due to promotion to int,
and also if the signed type is strictly wider than the unsigned type).

Rasmus


[PATCH bpf-next 2/7] tools: bpf: respect output directory during build

2018-03-08 Thread Jiri Benc
Currently, the programs under tools/bpf (with the notable exception of
bpftool) do not respect the output directory (make O=dir). Fix that.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index c8ec0ae16bf0..e7b15967492e 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../scripts/Makefile.include
+
 prefix = /usr
 
 CC = gcc
@@ -7,7 +9,7 @@ YACC = bison
 MAKE = make
 
 CFLAGS += -Wall -O2
-CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
+CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include
 
 ifeq ($(srctree),)
 srctree := $(patsubst %/,%,$(dir $(CURDIR)))
@@ -38,32 +40,36 @@ ifeq ($(feature-disassembler-four-args), 1)
 CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
 
-%.yacc.c: %.y
+$(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y
$(YACC) -o $@ -d $<
 
-%.lex.c: %.l
+$(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
$(LEX) -o $@ $<
 
-all: bpf_jit_disasm bpf_dbg bpf_asm bpftool
+$(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
+   $(COMPILE.c) -o $@ $<
+
+all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool
 
-bpf_jit_disasm : CFLAGS += -DPACKAGE='bpf_jit_disasm'
-bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
-bpf_jit_disasm : bpf_jit_disasm.o
+$(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
+$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
+$(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o
 
-bpf_dbg : LDLIBS = -lreadline
-bpf_dbg : bpf_dbg.o
+$(OUTPUT)bpf_dbg: LDLIBS = -lreadline
+$(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
 
-bpf_asm : LDLIBS =
-bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
-bpf_exp.lex.o : bpf_exp.yacc.c
+$(OUTPUT)bpf_asm: LDLIBS =
+$(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
+$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
-   rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
+   rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
+  $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: bpftool_install
-   install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
-   install bpf_dbg $(prefix)/bin/bpf_dbg
-   install bpf_asm $(prefix)/bin/bpf_asm
+   install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
+   install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg
+   install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm
 
 bpftool:
$(MAKE) -C bpftool
-- 
1.8.3.1



[PATCH bpf-next 4/7] tools: bpf: make install should build first

2018-03-08 Thread Jiri Benc
Make the 'install' target depend on the 'all' target to build the binaries
first.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index c42ca24a072d..e8d9e4125bf3 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -50,7 +50,9 @@ $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
 $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
$(COMPILE.c) -o $@ $<
 
-all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool
+PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm
+
+all: $(PROGS) bpftool
 
 $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
 $(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
@@ -67,7 +69,7 @@ clean: bpftool_clean
rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
-install: bpftool_install
+install: $(PROGS) bpftool_install
$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
$(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
-- 
1.8.3.1



[PATCH bpf-next 0/7] tools: bpf: standardize make

2018-03-08 Thread Jiri Benc
Currently, 'make bpf' in the tools/ directory does not provide the standard
quiet output except for bpftool (which is however listed with a wrong
directory). Worse, it does not respect the build output directory.

The 'make bpf_install' does not work as one would expect, either. It
installs unconditionally to /usr/bin without respecting DESTDIR and prefix.

This patchset improves that behavior.

Jiri Benc (7):
  tools: bpftool: silence 'missing initializer' warnings
  tools: bpf: respect output directory during build
  tools: bpf: consistent make bpf_install
  tools: bpf: make install should build first
  tools: bpf: call descend in Makefile
  tools: bpf: respect quiet/verbose build
  tools: bpf: silence make by not deleting intermediate file

 tools/bpf/Makefile | 74 +++---
 tools/bpf/bpftool/Makefile |  2 +-
 2 files changed, 51 insertions(+), 25 deletions(-)

-- 
1.8.3.1



[PATCH bpf-next 6/7] tools: bpf: respect quiet/verbose build

2018-03-08 Thread Jiri Benc
Default to quiet build, with V=1 enabling verbose build as is usual.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index daca0a4277d1..757ea22c428a 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -17,6 +17,12 @@ srctree := $(patsubst %/,%,$(dir $(CURDIR)))
 srctree := $(patsubst %/,%,$(dir $(srctree)))
 endif
 
+ifeq ($(V),1)
+  Q =
+else
+  Q = @
+endif
+
 FEATURE_USER = .bpf
 FEATURE_TESTS = libbfd disassembler-four-args
 FEATURE_DISPLAY = libbfd disassembler-four-args
@@ -42,38 +48,48 @@ CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
 
 $(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y
-   $(YACC) -o $@ -d $<
+   $(QUIET_BISON)$(YACC) -o $@ -d $<
 
 $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
-   $(LEX) -o $@ $<
+   $(QUIET_FLEX)$(LEX) -o $@ $<
 
 $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
-   $(COMPILE.c) -o $@ $<
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
+
+$(OUTPUT)%.yacc.o: $(OUTPUT)%.yacc.c
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
+$(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
 
 PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm
 
 all: $(PROGS) bpftool
 
 $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
-$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
 $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lopcodes -lbfd -ldl
 
-$(OUTPUT)bpf_dbg: LDLIBS = -lreadline
 $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lreadline
 
-$(OUTPUT)bpf_asm: LDLIBS =
 $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^
+
 $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
-   rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
+   $(call QUIET_CLEAN, bpf-progs)
+   $(Q)rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: $(PROGS) bpftool_install
-   $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
-   $(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
-   $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
-   $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
+   $(call QUIET_INSTALL, bpf_jit_disasm)
+   $(Q)$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
+   $(Q)$(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
+   $(call QUIET_INSTALL, bpf_dbg)
+   $(Q)$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
+   $(call QUIET_INSTALL, bpf_asm)
+   $(Q)$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
$(call descend,bpftool)
-- 
1.8.3.1



[PATCH bpf-next 7/7] tools: bpf: silence make by not deleting intermediate file

2018-03-08 Thread Jiri Benc
Even in quiet mode, make finishes with

rm tools/bpf/bpf_exp.lex.c

That's because it considers the file to be intermediate. Silence that by
mentioning the lex.c file instead of the lex.o file; the dependency still
stays.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 757ea22c428a..c07b4e718eeb 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -75,7 +75,7 @@ $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
 $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
$(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^
 
-$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
+$(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
$(call QUIET_CLEAN, bpf-progs)
-- 
1.8.3.1



[PATCH bpf-next 3/7] tools: bpf: consistent make bpf_install

2018-03-08 Thread Jiri Benc
Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it
installs to /usr/bin/ unconditionally.

Let it respect DESTDIR and allow prefix to be specified. Also, to be more
consistent with bpftool and with the usual customs, default the prefix to
/usr/local instead of /usr.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index e7b15967492e..c42ca24a072d 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -1,12 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 include ../scripts/Makefile.include
 
-prefix = /usr
+prefix ?= /usr/local
 
 CC = gcc
 LEX = flex
 YACC = bison
 MAKE = make
+INSTALL ?= install
 
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include
@@ -67,9 +68,10 @@ clean: bpftool_clean
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: bpftool_install
-   install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
-   install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg
-   install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm
+   $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
+   $(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
+   $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
+   $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
$(MAKE) -C bpftool
-- 
1.8.3.1



[PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings

2018-03-08 Thread Jiri Benc
When building bpf tool, gcc emits piles of warnings:

prog.c: In function ‘prog_fd_by_tag’:
prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct 
bpf_prog_info’ [-Wmissing-field-initializers]
  struct bpf_prog_info info = {};
 ^
In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0,
 from prog.c:47:
/home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: 
‘type’ declared here
  __u32 type;
^

As these warnings are not useful, switch them off.

Signed-off-by: Jiri Benc 
---
 tools/bpf/bpftool/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 26901ec87361..4c2867481f5c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -38,7 +38,7 @@ bash_compdir ?= /usr/share/bash-completion/completions
 CC = gcc
 
 CFLAGS += -O2
-CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
+CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow 
-Wno-missing-field-initializers
 CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ 
-I$(srctree)/tools/include/uapi -I$(srctree)/tools/include 
-I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
 CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
 LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
-- 
1.8.3.1



Re: [1/2] net: macb: Add phy-handle DT support

2018-03-08 Thread Brad Mouring
On Thu, Mar 08, 2018 at 06:32:47PM +0100, Andrew Lunn wrote:
> On Wed, Mar 07, 2018 at 04:42:56PM -0600, Brad Mouring wrote:
> > This optional binding (as described in the ethernet DT bindings doc)
> > directs the netdev to the phydev to use. This is useful for a phy
> > chip that has >1 phy in it, and two netdevs are using the same phy
> > chip (i.e. the second mac's phy lives on the first mac's MDIO bus)
> > ...
> Hi Brad
> 
> I think it is more logical to do this in macb_mii_probe().
> 
> I would probably also move the fixed_link code from macb_mii_init() to
> macb_mii_probe(). I would probably also move the fallback to standard
> phy registration. Make macb_mii_init() about registering the MDIO bus,
> and macb_mii_probe() about probing the MDIO bus to find the PHY to
> use. At the moment, it is all rather mixed up.
> 
>  Andrew

Hi Andrew

That makes sense, I'll rework and resend. Thanks for the suggestion.

Brad


[PATCH bpf-next 5/7] tools: bpf: call descend in Makefile

2018-03-08 Thread Jiri Benc
Use the descend macro to properly propagate $(subdir) to bpftool.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index e8d9e4125bf3..daca0a4277d1 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -76,12 +76,12 @@ install: $(PROGS) bpftool_install
$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
-   $(MAKE) -C bpftool
+   $(call descend,bpftool)
 
 bpftool_install:
-   $(MAKE) -C bpftool install
+   $(call descend,bpftool,install)
 
 bpftool_clean:
-   $(MAKE) -C bpftool clean
+   $(call descend,bpftool,clean)
 
 .PHONY: bpftool FORCE
-- 
1.8.3.1



[PATCH v3 net-next 3/4] net sched actions: calculate add/delete event message size

2018-03-08 Thread Roman Mashak
Introduce routines to calculate size of the shared tc netlink attributes
and the full message size including netlink header and tc service header.

Update add/delete action logic to have the size for event messages,
the size is passed to tcf_add_notify() and tcf_del_notify() where the
notification message is being allocated and constructed.

Signed-off-by: Roman Mashak 
---
 net/sched/act_api.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3de0e0610200..57cf37145282 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -109,6 +109,42 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool 
strict)
 }
 EXPORT_SYMBOL(__tcf_idr_release);
 
+static size_t tcf_action_shared_attrs_size(const struct tc_action *act)
+{
+   u32 cookie_len = 0;
+
+   if (act->act_cookie)
+   cookie_len = nla_total_size(act->act_cookie->len);
+
+   return  nla_total_size(0) /* action number nested */
+   + nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */
+   + cookie_len /* TCA_ACT_COOKIE */
+   + nla_total_size(0) /* TCA_ACT_STATS nested */
+   /* TCA_STATS_BASIC */
+   + nla_total_size_64bit(sizeof(struct gnet_stats_basic))
+   /* TCA_STATS_QUEUE */
+   + nla_total_size_64bit(sizeof(struct gnet_stats_queue))
+   + nla_total_size(0) /* TCA_OPTIONS nested */
+   + nla_total_size(sizeof(struct tcf_t)); /* TCA_GACT_TM */
+}
+
+static size_t tcf_action_full_attrs_size(size_t sz)
+{
+   return NLMSG_HDRLEN /* struct nlmsghdr */
+   + sizeof(struct tcamsg)
+   + nla_total_size(0) /* TCA_ACT_TAB nested */
+   + sz;
+}
+
+static size_t tcf_action_fill_size(const struct tc_action *act)
+{
+   size_t sz = tcf_action_shared_attrs_size(act);
+
+   if (act->ops->get_fill_size)
+   return act->ops->get_fill_size(act) + sz;
+   return sz;
+}
+
 static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
   struct netlink_callback *cb)
 {
@@ -746,6 +782,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, 
struct nlattr *nla,
 {
struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
struct tc_action *act;
+   size_t sz = 0;
int err;
int i;
 
@@ -761,11 +798,14 @@ int tcf_action_init(struct net *net, struct tcf_proto 
*tp, struct nlattr *nla,
goto err;
}
act->order = i;
+   sz += tcf_action_fill_size(act);
if (ovr)
act->tcfa_refcnt++;
list_add_tail(&act->list, actions);
}
 
+   *attr_size = tcf_action_full_attrs_size(sz);
+
/* Remove the temp refcnt which was necessary to protect against
 * destroying an existing action which was being replaced
 */
@@ -1056,9 +1096,12 @@ tca_action_gd(struct net *net, struct nlattr *nla, 
struct nlmsghdr *n,
goto err;
}
act->order = i;
+   attr_size += tcf_action_fill_size(act);
list_add_tail(&act->list, &actions);
}
 
+   attr_size = tcf_action_full_attrs_size(attr_size);
+
if (event == RTM_GETACTION)
ret = tcf_get_notify(net, portid, n, &actions, event, extack);
else { /* delete */
-- 
2.7.4



[PATCH v3 net-next 4/4] net sched actions: implement get_fill_size routine in act_gact

2018-03-08 Thread Roman Mashak
Signed-off-by: Roman Mashak 
---
 net/sched/act_gact.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 74563254e676..88fbb8403565 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -217,6 +217,19 @@ static int tcf_gact_search(struct net *net, struct 
tc_action **a, u32 index,
return tcf_idr_search(tn, a, index);
 }
 
+static size_t tcf_gact_get_fill_size(const struct tc_action *act)
+{
+   size_t sz = nla_total_size(sizeof(struct tc_gact)); /* TCA_GACT_PARMS */
+
+#ifdef CONFIG_GACT_PROB
+   if (to_gact(act)->tcfg_ptype)
+   /* TCA_GACT_PROB */
+   sz += nla_total_size(sizeof(struct tc_gact_p));
+#endif
+
+   return sz;
+}
+
 static struct tc_action_ops act_gact_ops = {
.kind   =   "gact",
.type   =   TCA_ACT_GACT,
@@ -227,6 +240,7 @@ static struct tc_action_ops act_gact_ops = {
.init   =   tcf_gact_init,
.walk   =   tcf_gact_walker,
.lookup =   tcf_gact_search,
+   .get_fill_size  =   tcf_gact_get_fill_size,
.size   =   sizeof(struct tcf_gact),
 };
 
-- 
2.7.4



  1   2   3   >