[PATCH bpf-next] bpf: comment why dots in filenames under BPF virtual FS are not allowed
From: Quentin Monnet When pinning a file under the BPF virtual file system (traditionally /sys/fs/bpf), using a dot in the name of the location to pin at is not allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be rejected with -EPERM. This check was introduced at the same time as the BPF file system itself, with commit b2197755b263 ("bpf: add support for persistent maps/progs"). At this time, it was checked in a function called "bpf_dname_reserved()", which made clear that using a dot was reserved for future extensions. This function disappeared and the check was moved elsewhere with commit 0c93b7d85d40 ("bpf: reject invalid names right in ->lookup()"), and the meaning of the dot ban was lost. The present commit simply adds a comment in the source to explain to the reader that the usage of dots is reserved for future usage. Signed-off-by: Quentin Monnet --- kernel/bpf/inode.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 81e2f6995adb..bf6da59ae0d0 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -178,6 +178,9 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg) static struct dentry * bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags) { + /* Dots in names (e.g. "/sys/fs/bpf/foo.bar") are reserved for future +* extensions. +*/ if (strchr(dentry->d_name.name, '.')) return ERR_PTR(-EPERM); -- 2.15.1
Re: [PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings
On Thu, 8 Mar 2018 23:00:35 +0100, Jiri Benc wrote: > When building bpf tool, gcc emits piles of warnings: > > prog.c: In function ‘prog_fd_by_tag’: > prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct > bpf_prog_info’ [-Wmissing-field-initializers] > struct bpf_prog_info info = {}; > ^ > In file included from > /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0, > from prog.c:47: > /home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: > ‘type’ declared here > __u32 type; > ^ > > As these warnings are not useful, switch them off. > > Signed-off-by: Jiri Benc FWIW I couldn't reproduce this one. Out of curiosity what GCC did you use?
Re: [PATCH bpf-next 0/7] tools: bpf: standardize make
On Thu, 8 Mar 2018 21:10:59 -0800, Alexei Starovoitov wrote: > On Thu, Mar 8, 2018 at 2:00 PM, Jiri Benc wrote: > > Currently, 'make bpf' in the tools/ directory does not provide the standard > > quiet output except for bpftool (which is however listed with a wrong > > directory). Worse, it does not respect the build output directory. > > > > The 'make bpf_install' does not work as one would expect, either. It > > installs unconditionally to /usr/bin without respecting DESTDIR and prefix. > > > > This patchset improves that behavior. > > Jakub, > please review this set. > > Thanks! I stared at it for a while, and it looks good! Reviewed-by: Jakub Kicinski In addition I found a GCC 7 warning in bpftool, .PHONY lacks some targets and we don't clean up after feature detection. I'll send patches for that unless someone beats me to it.
[PATCH net-next] cxgb4: increase max tx rate limit to 100 Gbps
T6 cards can support up to 100 G speeds. So, increase max programmable tx rate limit to 100 Gbps. Signed-off-by: Ganesh Goudar --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 6 +++--- drivers/net/ethernet/chelsio/cxgb4/sched.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 7b452e8..1b44652 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2870,11 +2870,11 @@ static int cxgb_set_tx_maxrate(struct net_device *dev, int index, u32 rate) /* Convert from Mbps to Kbps */ req_rate = rate << 10; - /* Max rate is 10 Gbps */ + /* Max rate is 100 Gbps */ if (req_rate >= SCHED_MAX_RATE_KBPS) { dev_err(adap->pdev_dev, - "Invalid rate %u Mbps, Max rate is %u Gbps\n", - rate, SCHED_MAX_RATE_KBPS); + "Invalid rate %u Mbps, Max rate is %u Mbps\n", + rate, SCHED_MAX_RATE_KBPS >> 10); return -ERANGE; } diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h index 77b2b3f..3a49e00 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sched.h +++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h @@ -42,8 +42,8 @@ #define FW_SCHED_CLS_NONE 0x -/* Max rate that can be set to a scheduling class is 10 Gbps */ -#define SCHED_MAX_RATE_KBPS 1000U +/* Max rate that can be set to a scheduling class is 100 Gbps */ +#define SCHED_MAX_RATE_KBPS 1U enum { SCHED_STATE_ACTIVE, -- 2.1.0
Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote: Introduce an xdp_return_frame API, and convert over cpumap as the first user, given it have queued XDP frame structure to leverage. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h | 32 +++ kernel/bpf/cpumap.c | 60 +++ net/core/xdp.c | 18 +++ 3 files changed, 86 insertions(+), 24 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index b2362ddfa694..3cb726a6dc5b 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -33,16 +33,48 @@ * also mandatory during RX-ring setup. */ +enum mem_type { + MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ + MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + // Possible new ideas for types: + // MEM_TYPE_PAGE_POOL,/* Will be added later */ + // MEM_TYPE_AF_XDP, + // MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo? + MEM_TYPE_MAX, +}; So if we plan to support dev->ndo, it looks to me two types AF_XDP and DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool or ordinary page allocator) in ndo or what ever other callbacks. + +struct xdp_mem_info { + u32 type; /* enum mem_type, but known size type */ + u32 id; // Needed later (to lookup struct xdp_rxq_info) +}; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; u32 reg_state; + struct xdp_mem_info mem; } cacheline_aligned; /* perf critical, avoid false-sharing */ + +static inline +void xdp_return_frame(void *data, struct xdp_mem_info *mem) +{ + if (mem->type == MEM_TYPE_PAGE_SHARED) + page_frag_free(data); + + if (mem->type == MEM_TYPE_PAGE_ORDER0) { + struct page *page = virt_to_page(data); /* Assumes order0 page*/ + + put_page(page); + } +} + int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq); void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq); bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, + enum mem_type type, void *allocator); #endif /* __LINUX_NET_XDP_H__ */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index a4bb0b34375a..3e4bbcbe3e86 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } -static void __cpu_map_queue_destructor(void *ptr) -{ - /* The tear-down procedure should have made sure that queue is -* empty. See __cpu_map_entry_replace() and work-queue -* invoked cpu_map_kthread_stop(). Catch any broken behaviour -* gracefully and warn once. -*/ - if (WARN_ON_ONCE(ptr)) - page_frag_free(ptr); -} - -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) -{ - if (atomic_dec_and_test(&rcpu->refcnt)) { - /* The queue should be empty at this point */ - ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor); - kfree(rcpu->queue); - kfree(rcpu); - } -} - static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) { atomic_inc(&rcpu->refcnt); @@ -188,6 +168,10 @@ struct xdp_pkt { u16 len; u16 headroom; u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, +* while mem info is valid on remote CPU. +*/ Can we simply move the xdp_mem_info to xdp_buff to avoid conversion? Thanks + struct xdp_mem_info mem; struct net_device *dev_rx; }; @@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp) xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); xdp_pkt->metasize = metasize; + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_pkt->mem = xdp->rxq->mem; + return xdp_pkt; } @@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, return skb; } +static void __cpu_map_ring_cleanup(struct ptr_ring *ring) +{ + /* The tear-down procedure should have made sure that queue is +* empty. See __cpu_map_entry_replace() and work-queue +* invoked cpu_map_kthread_stop(). Catch any broken behaviour +* gracefully and warn once. +*/ + struct xdp_pkt *xdp_pkt; + + while ((xdp_pkt = ptr_ring_consume(ring))) + if (WARN_ON_ONCE(xdp_pkt)) + xdp_return_frame(xdp_pkt, &xdp_pkt->mem); +} + +static void put_cpu_map_entry(struct bpf
Re: [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
On 2018年03月08日 23:16, Jesper Dangaard Brouer wrote: Hi Jason, Please see below FIXME, which is actually a question to you. On Thu, 08 Mar 2018 14:08:11 +0100 Jesper Dangaard Brouer wrote: diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 475088f947bb..cd046cf31b77 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c [...] @@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = { static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) { struct tun_struct *tun = netdev_priv(dev); - struct xdp_buff *buff = xdp->data_hard_start; - int headroom = xdp->data - xdp->data_hard_start; + struct xdp_frame *frame; struct tun_file *tfile; u32 numqueues; int ret = 0; - /* Assure headroom is available and buff is properly aligned */ - if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp))) - return -ENOSPC; + /* FIXME: Explain why this check is the needed! */ + if (unlikely(tun_is_xdp_frame(xdp))) + return -EBADRQC; - *buff = *xdp; + frame = convert_to_xdp_frame(xdp); + if (unlikely(!frame)) + return -EOVERFLOW; To Jason, in the FIXME, I'm inheriting a check you put in, but I don't understand why this check was needed? Sorry for the late reply. I think it was used to make sure to not use misaligned or invalid pointer that caller passed to us. Thanks
[PATCH net 1/3] vhost_net: initialize rx_ring in vhost_net_open()
From: Alexander Potapenko KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: == BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho et.c:170 CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170 vhost_net_stop_vq drivers/vhost/net.c:974 [inline] vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982 vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015 __fput+0x49f/0xa00 fs/file_table.c:209 fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop arch/x86/entry/common.c:166 [inline] prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196 syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265 do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292 ... origin: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213 kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392 kmalloc_large_node_hook mm/slub.c:1366 [inline] kmalloc_large_node mm/slub.c:3808 [inline] __kmalloc_node+0x100e/0x1290 mm/slub.c:3818 kmalloc_node include/linux/slab.h:554 [inline] kvmalloc_node+0x1a5/0x2e0 mm/util.c:419 kvmalloc include/linux/mm.h:541 [inline] vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921 misc_open+0x7b5/0x8b0 drivers/char/misc.c:154 chrdev_open+0xc28/0xd90 fs/char_dev.c:417 do_dentry_open+0xccb/0x1430 fs/open.c:752 vfs_open+0x272/0x2e0 fs/open.c:866 do_last fs/namei.c:3378 [inline] path_openat+0x49ad/0x6580 fs/namei.c:3519 do_filp_open+0x267/0x640 fs/namei.c:3553 do_sys_open+0x6ad/0x9c0 fs/open.c:1059 SYSC_openat+0xc7/0xe0 fs/open.c:1086 SyS_openat+0x63/0x90 fs/open.c:1080 do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287 == Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array") Signed-off-by: Alexander Potapenko Signed-off-by: Jason Wang --- drivers/vhost/net.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba2..60f1080 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) n->vqs[i].done_idx = 0; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rx_ring = NULL; vhost_net_buf_init(&n->vqs[i].rxq); } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); -- 2.7.4
[PATCH net 0/3] Several fixes for vhost_net ptr_ring usage
Hi: This small series try to fix several bugs of ptr_ring usage in vhost_net. Please review. Thanks Alexander Potapenko (1): vhost_net: initialize rx_ring in vhost_net_open() Jason Wang (2): vhost_net: keep private_data and rx_ring synced vhost_net: examine pointer types during un-producing drivers/net/tun.c | 3 ++- drivers/vhost/net.c| 8 +--- include/linux/if_tun.h | 4 3 files changed, 11 insertions(+), 4 deletions(-) -- 2.7.4
[PATCH net 3/3] vhost_net: examine pointer types during un-producing
After commit fc72d1d54dd9 ("tuntap: XDP transmission"), we can actually queueing XDP pointers in the pointer ring, so we should examine the pointer type before freeing the pointer. Fixes: fc72d1d54dd9 ("tuntap: XDP transmission") Reported-by: Michael S. Tsirkin Acked-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- drivers/net/tun.c | 3 ++- drivers/vhost/net.c| 2 +- include/linux/if_tun.h | 4 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 7433bb2..28cfa64 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) return tun; } -static void tun_ptr_free(void *ptr) +void tun_ptr_free(void *ptr) { if (!ptr) return; @@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr) __skb_array_destroy_skb(ptr); } } +EXPORT_SYMBOL_GPL(tun_ptr_free); static void tun_queue_purge(struct tun_file *tfile) { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index efb9306..8139bc7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq) if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) { ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head, vhost_net_buf_get_size(rxq), - __skb_array_destroy_skb); + tun_ptr_free); rxq->head = rxq->tail = 0; } } diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index c5b0a75..fd00170 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file); bool tun_is_xdp_buff(void *ptr); void *tun_xdp_to_ptr(void *ptr); void *tun_ptr_to_xdp(void *ptr); +void tun_ptr_free(void *ptr); #else #include #include @@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr) { return NULL; } +static inline void tun_ptr_free(void *ptr) +{ +} #endif /* CONFIG_TUN */ #endif /* __IF_TUN_H */ -- 2.7.4
[PATCH net 2/3] vhost_net: keep private_data and rx_ring synced
We get pointer ring from the exported sock, this means we should keep rx_ring and vq->private synced during both vq stop and backend set, otherwise we may see stale rx_ring. Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array") Signed-off-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- drivers/vhost/net.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 60f1080..efb9306 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -973,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n, vhost_net_disable_vq(n, vq); vq->private_data = NULL; vhost_net_buf_unproduce(nvq); + nvq->rx_ring = NULL; mutex_unlock(&vq->mutex); return sock; } @@ -1162,14 +1163,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) vhost_net_disable_vq(n, vq); vq->private_data = sock; vhost_net_buf_unproduce(nvq); - if (index == VHOST_NET_VQ_RX) - nvq->rx_ring = get_tap_ptr_ring(fd); r = vhost_vq_init_access(vq); if (r) goto err_used; r = vhost_net_enable_vq(n, vq); if (r) goto err_used; + if (index == VHOST_NET_VQ_RX) + nvq->rx_ring = get_tap_ptr_ring(fd); oldubufs = nvq->ubufs; nvq->ubufs = ubufs; -- 2.7.4
[PATCH] rsi: Remove stack VLA usage
The kernel would like to have all stack VLA usage removed[1]. rsi uses a VLA based on 'blksize'. Elsewhere in the SDIO code maximum block size is defined using a magic number. We can use a pre-processor defined constant and declare the array to maximum size. We add a check before accessing the array in case of programmer error. [1]: https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Tobin C. Harding --- drivers/net/wireless/rsi/rsi_91x_hal.c | 13 +++-- drivers/net/wireless/rsi/rsi_91x_sdio.c | 9 +++-- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/net/wireless/rsi/rsi_91x_hal.c b/drivers/net/wireless/rsi/rsi_91x_hal.c index 1176de646942..839ebdd602df 100644 --- a/drivers/net/wireless/rsi/rsi_91x_hal.c +++ b/drivers/net/wireless/rsi/rsi_91x_hal.c @@ -641,7 +641,7 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 cmd, u8 *addr, u32 size) u32 cmd_addr; u16 cmd_resp, cmd_req; u8 *str; - int status; + int status, ret; if (cmd == PING_WRITE) { cmd_addr = PING_BUFFER_ADDRESS; @@ -655,12 +655,13 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 cmd, u8 *addr, u32 size) str = "PONG_VALID"; } - status = hif_ops->load_data_master_write(adapter, cmd_addr, size, + ret = hif_ops->load_data_master_write(adapter, cmd_addr, size, block_size, addr); - if (status) { - rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr %0x\n", - __func__, *addr); - return status; + if (ret) { + if (ret != -EINVAL) + rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr %0x\n", + __func__, *addr); + return ret; } status = bl_cmd(adapter, cmd_req, cmd_resp, str); diff --git a/drivers/net/wireless/rsi/rsi_91x_sdio.c b/drivers/net/wireless/rsi/rsi_91x_sdio.c index b0cf41195051..b766578b591a 100644 --- a/drivers/net/wireless/rsi/rsi_91x_sdio.c +++ b/drivers/net/wireless/rsi/rsi_91x_sdio.c @@ -20,6 +20,8 @@ #include "rsi_common.h" #include "rsi_hal.h" +#define RSI_MAX_BLOCK_SIZE 256 + /** * rsi_sdio_set_cmd52_arg() - This function prepares cmd 52 read/write arg. * @rw: Read/write @@ -362,7 +364,7 @@ static int rsi_setblocklength(struct rsi_hw *adapter, u32 length) rsi_dbg(INIT_ZONE, "%s: Setting the block length\n", __func__); status = sdio_set_block_size(dev->pfunction, length); - dev->pfunction->max_blksize = 256; + dev->pfunction->max_blksize = RSI_MAX_BLOCK_SIZE; adapter->block_size = dev->pfunction->max_blksize; rsi_dbg(INFO_ZONE, @@ -567,9 +569,12 @@ static int rsi_sdio_load_data_master_write(struct rsi_hw *adapter, { u32 num_blocks, offset, i; u16 msb_address, lsb_address; - u8 temp_buf[block_size]; + u8 temp_buf[RSI_MAX_BLOCK_SIZE]; int status; + if (block_size > RSI_MAX_BLOCK_SIZE) + return -EINVAL; + num_blocks = instructions_sz / block_size; msb_address = base_address >> 16; -- 2.7.4
Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()
On 2018年03月08日 23:50, Alexander Potapenko wrote: On Thu, Mar 8, 2018 at 4:45 PM, Eric Dumazet wrote: On 03/08/2018 07:20 AM, Alexander Potapenko wrote: On Thu, Mar 8, 2018 at 4:15 PM, Eric Dumazet wrote: On 03/08/2018 05:37 AM, Alexander Potapenko wrote: KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: == Signed-off-by: Alexander Potapenko Please identify bug origin with a Fixes: tag Fixes: 5990a30510ed1 ("tun/tap: use ptr_ring instead of skb_array") Please send a V2 with this added tag. patchwork does not recognize it yet. Ok, will do. Thanks for reminding about the tag! The commit should be c67df11f6e480 ("vhost_net: try batch dequing from skb array"). Let me squash this patch into my series. Thanks
[PATCH] net: ethernet: ave: enable Rx drop interrupt
This enables AVE_GI_RXDROP interrupt factor. This factor indicates depletion of Rx descriptors and the handler counts the number of dropped packets. Signed-off-by: Kunihiko Hayashi --- drivers/net/ethernet/socionext/sni_ave.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c index b247707..043eb55 100644 --- a/drivers/net/ethernet/socionext/sni_ave.c +++ b/drivers/net/ethernet/socionext/sni_ave.c @@ -1349,7 +1349,7 @@ static int ave_open(struct net_device *ndev) val |= AVE_IIRQC_EN0 | (AVE_INTM_COUNT << 16); writel(val, priv->base + AVE_IIRQC); - val = AVE_GI_RXIINT | AVE_GI_RXOVF | AVE_GI_TX; + val = AVE_GI_RXIINT | AVE_GI_RXOVF | AVE_GI_TX | AVE_GI_RXDROP; ave_irq_restore(ndev, val); napi_enable(&priv->napi_rx); -- 2.7.4
Re: [PATCH net] vhost_net: examine pointer types during un-producing
On 2018年03月09日 11:16, Jason Wang wrote: After commit 761876c857cb ("tap: XDP support"), we can actually queueing XDP pointers in the pointer ring, so we should examine the pointer type before freeing the pointer. Fixes: 761876c857cb ("tap: XDP support") Oops, the commit is wrong, let me repost. Thanks Reported-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- drivers/net/tun.c | 3 ++- drivers/vhost/net.c| 2 +- include/linux/if_tun.h | 4 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 7433bb2..28cfa64 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) return tun; } -static void tun_ptr_free(void *ptr) +void tun_ptr_free(void *ptr) { if (!ptr) return; @@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr) __skb_array_destroy_skb(ptr); } } +EXPORT_SYMBOL_GPL(tun_ptr_free); static void tun_queue_purge(struct tun_file *tfile) { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba2..54a138f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq) if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) { ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head, vhost_net_buf_get_size(rxq), - __skb_array_destroy_skb); + tun_ptr_free); rxq->head = rxq->tail = 0; } } diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index c5b0a75..fd00170 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file); bool tun_is_xdp_buff(void *ptr); void *tun_xdp_to_ptr(void *ptr); void *tun_ptr_to_xdp(void *ptr); +void tun_ptr_free(void *ptr); #else #include #include @@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr) { return NULL; } +static inline void tun_ptr_free(void *ptr) +{ +} #endif /* CONFIG_TUN */ #endif /* __IF_TUN_H */
[PATCH] pktgen: Remove VLA usage
In preparation to enabling -Wvla, remove VLA usage and replace it with a fixed-length array instead. Signed-off-by: Gustavo A. R. Silva --- David, I'm not sure how often this function is being called and, depending on the frequency it may be worth to use dynamic memory allocation instead? Thanks -- Gustavo net/core/pktgen.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index d81bddd..e2d6ae3 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -907,7 +907,7 @@ static ssize_t pktgen_if_write(struct file *file, if (debug) { size_t copy = min_t(size_t, count, 1023); - char tb[copy + 1]; + char tb[1024]; if (copy_from_user(tb, user_buffer, copy)) return -EFAULT; tb[copy] = 0; -- 2.7.4
Re: [PATCH bpf-next 0/7] tools: bpf: standardize make
On Thu, Mar 8, 2018 at 2:00 PM, Jiri Benc wrote: > Currently, 'make bpf' in the tools/ directory does not provide the standard > quiet output except for bpftool (which is however listed with a wrong > directory). Worse, it does not respect the build output directory. > > The 'make bpf_install' does not work as one would expect, either. It > installs unconditionally to /usr/bin without respecting DESTDIR and prefix. > > This patchset improves that behavior. Jakub, please review this set. Thanks!
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 3/8/18 7:54 PM, Andy Lutomirski wrote: On Mar 8, 2018, at 7:06 PM, Linus Torvalds wrote: Honestly, that "read twice" thing may be what scuttles this. Initially, I thought it was a non-issue, because anybody who controls the module subdirectory enough to rewrite files would be in a position to just execute the file itself directly instead. On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem. there is not abi breakage and file cannot disappear from running task. One cannot umount fs while file is still being used. The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container. Not only "read twice", but "read many". If .text sections of elf that are not yet in memory can be modified by malicious user, later they will be brought in with different code. I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
> On Mar 8, 2018, at 7:06 PM, Linus Torvalds > wrote: > > > Honestly, that "read twice" thing may be what scuttles this. > Initially, I thought it was a non-issue, because anybody who controls > the module subdirectory enough to rewrite files would be in a position > to just execute the file itself directly instead. > On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem. The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container.
Re: [PATCH net] vhost_net: examine pointer types during un-producing
On Fri, Mar 09, 2018 at 11:16:22AM +0800, Jason Wang wrote: > After commit 761876c857cb ("tap: XDP support"), we can actually > queueing XDP pointers in the pointer ring, so we should examine the > pointer type before freeing the pointer. > > Fixes: 761876c857cb ("tap: XDP support") > Reported-by: Michael S. Tsirkin > Signed-off-by: Jason Wang Acked-by: Michael S. Tsirkin > --- > drivers/net/tun.c | 3 ++- > drivers/vhost/net.c| 2 +- > include/linux/if_tun.h | 4 > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 7433bb2..28cfa64 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct > tun_file *tfile) > return tun; > } > > -static void tun_ptr_free(void *ptr) > +void tun_ptr_free(void *ptr) > { > if (!ptr) > return; > @@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr) > __skb_array_destroy_skb(ptr); > } > } > +EXPORT_SYMBOL_GPL(tun_ptr_free); > > static void tun_queue_purge(struct tun_file *tfile) > { > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 610cba2..54a138f 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct > vhost_net_virtqueue *nvq) > if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) { > ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head, > vhost_net_buf_get_size(rxq), > -__skb_array_destroy_skb); > +tun_ptr_free); > rxq->head = rxq->tail = 0; > } > } > diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h > index c5b0a75..fd00170 100644 > --- a/include/linux/if_tun.h > +++ b/include/linux/if_tun.h > @@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file); > bool tun_is_xdp_buff(void *ptr); > void *tun_xdp_to_ptr(void *ptr); > void *tun_ptr_to_xdp(void *ptr); > +void tun_ptr_free(void *ptr); > #else > #include > #include > @@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr) > { > return NULL; > } > +static inline void tun_ptr_free(void *ptr) > +{ > +} > #endif /* CONFIG_TUN */ > #endif /* __IF_TUN_H */ > -- > 2.7.4
Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()
On 2018年03月09日 11:29, Michael S. Tsirkin wrote: On Fri, Mar 09, 2018 at 10:30:17AM +0800, Jason Wang wrote: On 2018年03月09日 00:00, Michael S. Tsirkin wrote: On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote: On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin wrote: On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote: KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: == BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho et.c:170 CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170 vhost_net_stop_vq drivers/vhost/net.c:974 [inline] vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982 vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015 __fput+0x49f/0xa00 fs/file_table.c:209 fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop arch/x86/entry/common.c:166 [inline] prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196 syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265 do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292 ... origin: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213 kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392 kmalloc_large_node_hook mm/slub.c:1366 [inline] kmalloc_large_node mm/slub.c:3808 [inline] __kmalloc_node+0x100e/0x1290 mm/slub.c:3818 kmalloc_node include/linux/slab.h:554 [inline] kvmalloc_node+0x1a5/0x2e0 mm/util.c:419 kvmalloc include/linux/mm.h:541 [inline] vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921 misc_open+0x7b5/0x8b0 drivers/char/misc.c:154 chrdev_open+0xc28/0xd90 fs/char_dev.c:417 do_dentry_open+0xccb/0x1430 fs/open.c:752 vfs_open+0x272/0x2e0 fs/open.c:866 do_last fs/namei.c:3378 [inline] path_openat+0x49ad/0x6580 fs/namei.c:3519 do_filp_open+0x267/0x640 fs/namei.c:3553 do_sys_open+0x6ad/0x9c0 fs/open.c:1059 SYSC_openat+0xc7/0xe0 fs/open.c:1086 SyS_openat+0x63/0x90 fs/open.c:1080 do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287 == Signed-off-by: Alexander Potapenko --- drivers/vhost/net.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba276d47..60f1080bffc7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) n->vqs[i].done_idx = 0; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rx_ring = NULL; vhost_net_buf_init(&n->vqs[i].rxq); } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); -- 2.16.2.395.g2e18187dfd-goog I suspect that's not sufficient. rx ring is tied to the tap device. I think we need to drop it every time we drop the device. Unfortunately I've no idea where is the device dropped. Are you referring to vhost_net_vq_reset()? I can fix that part if needed, but won't be able to validate it with KMSAN. I see several issues. For example in vhost_net_set_backend if there's a value then rx ring will point to the ring of the wrong socket. Something like the below might help but we really need documentation of when is rx_ring valid. Is it only valid when private-data is valid? I think so, we need keep rx_ring synced with private_data. If yes need to make sure we reset it with private_data. Also I see __skb_array_destroy_skb used with ptr_ring which seems suspicious: how do we know the entries are skbs? Good catch, will post an independent patch to fix this. Patch below is on top of yours, and Signed-off-by: Michael S. Tsirkin But I really would like Jason to look and come up with a patch to address all these issues. --- diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba2..7a65b69 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -972,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n, vhost_net_disable_vq(n, vq); vq->private_data = NULL; vhost_net_buf_unproduce(nvq); + vq->rx_ring = NULL; mutex_unlock(&vq->mutex); return sock; } @@ -1161,8 +1163,6 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) vhost_net_disab
Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()
On Fri, Mar 09, 2018 at 10:30:17AM +0800, Jason Wang wrote: > > > On 2018年03月09日 00:00, Michael S. Tsirkin wrote: > > On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote: > > > On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin > > > wrote: > > > > On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote: > > > > > KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() > > > > > while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: > > > > > > > > > > == > > > > > BUG: KMSAN: use of uninitialized memory in > > > > > vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho > > > > > et.c:170 > > > > > CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853 > > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 > > > > > 04/01/2014 > > > > > Call Trace: > > > > > __dump_stack lib/dump_stack.c:17 [inline] > > > > > dump_stack+0x185/0x1d0 lib/dump_stack.c:53 > > > > > kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093 > > > > > __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 > > > > > vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170 > > > > > vhost_net_stop_vq drivers/vhost/net.c:974 [inline] > > > > > vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982 > > > > > vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015 > > > > > __fput+0x49f/0xa00 fs/file_table.c:209 > > > > > fput+0x37/0x40 fs/file_table.c:243 > > > > > task_work_run+0x243/0x2c0 kernel/task_work.c:113 > > > > > tracehook_notify_resume include/linux/tracehook.h:191 [inline] > > > > > exit_to_usermode_loop arch/x86/entry/common.c:166 [inline] > > > > > prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196 > > > > > syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265 > > > > > do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292 > > > > > ... > > > > > origin: > > > > > kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline] > > > > > kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213 > > > > > kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392 > > > > > kmalloc_large_node_hook mm/slub.c:1366 [inline] > > > > > kmalloc_large_node mm/slub.c:3808 [inline] > > > > > __kmalloc_node+0x100e/0x1290 mm/slub.c:3818 > > > > > kmalloc_node include/linux/slab.h:554 [inline] > > > > > kvmalloc_node+0x1a5/0x2e0 mm/util.c:419 > > > > > kvmalloc include/linux/mm.h:541 [inline] > > > > > vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921 > > > > > misc_open+0x7b5/0x8b0 drivers/char/misc.c:154 > > > > > chrdev_open+0xc28/0xd90 fs/char_dev.c:417 > > > > > do_dentry_open+0xccb/0x1430 fs/open.c:752 > > > > > vfs_open+0x272/0x2e0 fs/open.c:866 > > > > > do_last fs/namei.c:3378 [inline] > > > > > path_openat+0x49ad/0x6580 fs/namei.c:3519 > > > > > do_filp_open+0x267/0x640 fs/namei.c:3553 > > > > > do_sys_open+0x6ad/0x9c0 fs/open.c:1059 > > > > > SYSC_openat+0xc7/0xe0 fs/open.c:1086 > > > > > SyS_openat+0x63/0x90 fs/open.c:1080 > > > > > do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287 > > > > > == > > > > > > > > > > Signed-off-by: Alexander Potapenko > > > > > --- > > > > > drivers/vhost/net.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > > > > index 610cba276d47..60f1080bffc7 100644 > > > > > --- a/drivers/vhost/net.c > > > > > +++ b/drivers/vhost/net.c > > > > > @@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, > > > > > struct file *f) > > > > >n->vqs[i].done_idx = 0; > > > > >n->vqs[i].vhost_hlen = 0; > > > > >n->vqs[i].sock_hlen = 0; > > > > > + n->vqs[i].rx_ring = NULL; > > > > >vhost_net_buf_init(&n->vqs[i].rxq); > > > > >} > > > > >vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); > > > > > -- > > > > > 2.16.2.395.g2e18187dfd-goog > > > > > > > > I suspect that's not sufficient. rx ring is tied to the tap device. > > > > I think we need to drop it every time we drop the device. > > > Unfortunately I've no idea where is the device dropped. Are you > > > referring to vhost_net_vq_reset()? > > > I can fix that part if needed, but won't be able to validate it with > > > KMSAN. > > I see several issues. For example in vhost_net_set_backend > > if there's a value then rx ring will point to the > > ring of the wrong socket. > > Something like the below might help but we really need > > documentation of when is rx_ring valid. Is it only valid > > when private-data is valid? > > I think so, we need keep rx_ring synced with private_data. > > > If yes need to make sure > > we reset it with private_data. > > > > Also I see __skb_array_destroy_skb used with ptr_ring which > > seems suspicious: how do we know the entries are skbs? > > Good catch, will post an
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 09, 2018 at 02:12:24AM +, Andy Lutomirski wrote: > On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov > wrote: > > On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote: > >> > >> Alexei, can you give an example use case? I'm sure it's upthread > >> somewhere, but I'm having trouble finding it. > > > > at the time of iptable's setsockopt() the kernel will do > > err = request_module("bpfilter"); > > once. > > The rough POC code: > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25 > > Here's what I gather from reading that code: you have a new kernel > feature (consisting of actual kernel code) that wants to defer some of > its implementation to user mode. I like this idea a lot. But I have > a suggestion for a slightly different way of accomplishing the same > thing. Rather than extending init_module() to accept ELF input, > except the call_umh code to be able to call blobs. You'd use it it > very roughly like this: > > First, compile your user code and emit a staitc binary. Use objdump > fiddling or a trivial .S file to make that static binary into a > variable. Then write a tiny shim module like this: > > extern unsigned char __begin_user_code[], __end_user_code[]; > > int __init init_shim_module(void) > { > return call_umh_blob(__begin_user_code, __end_user_code - > __begin_user_code); > } > > By itself, this is clearly a worse solution than yours, but it has two > benefits, one small and two big. The small benefit is that it is > completely invisible to userspace: the .ko file is a bona fide module. Unfortunately it's not quite the case. The normal .ko that does call_umh_blob is indeed seen in lsmod, but the umh process is a separate task. It could have been oomed or killed by admin and this normal .ko wouldn't notice it, so health check of umh process by the kernel is needed regardless. Right now bpfilter has trivial fuse-like protocol. This part is still to be designed cleanly. No doubt that visibility and debuggability into this umh processes is must have, but lsmod/rmmod interface doesn't quite fit. As you said letting this priv tasks register themselves in lsmod is certainly no-go. I think if they will be in lsmod, kernel has to register them and establish health check with umh at the same time. I think worrying about restarting is not necessary. This is still kernel code with the same high standards and review process. If they crash it's really a kernel bug. It only doesn't take the system down. > I think we don't want to end up in a situation where we ship a program > with a .ko extension that opens something in /dev, for example. this part I don't get. What's wrong with open of /dev ? I don't see a use case for it, but technically why not? > call_umh_blob() would create an anon_inode or similar object backed by > the blob and exec it. Interesting. I haven't considered such approach. For full context it all started from the idea of 'unprivileged kernel modules' or 'hardened kernel modules'. Something that kernel can easily interact with, but much safer than traditional kernel module. I've tried a bunch of crappy ideas first: 1. have a piece of kernel .text vm_mmap-ed into user process that doing iptables setsockopt and on return to user space force handle_signal to execute that code. Sort of like forced ld_preload where parasite code is provided by the kernel but runs in user space 2. have a special set of kernel page tables in read-only mode while iptable->bpf conversion is happening 3. have load_module() fork a user task and load real kernel .ko into it trying to hack #3 realized that I'm mainly copy-pasting a lot of load_elf_binary() code without elf_interpreter bits, so figured it's much easier and simpler to blend sys_finit_module with load_elf_binary via tweaking do_execveat_common and keeping that .ko as normal elf which is implemented in this patch. Debugging of that fake .ko is so much better. If it's done via call_umh_blob() the process that starts will indeed be a user mode process, but you won't be able to attach gdb to it. Whereas in this patch it's normal elf and standard debugging techniques apply. A developer can do gdb breakpoints, debug info, etc. That is huge advantage of keeping it as normal elf.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds wrote: > > So I don't like Andy's "let's make it a kernel module and then that > kernel module can execve() a blob". THAT seems like just stupid > indirection. > > But I do like Andy's "execve a blob" part, because it is the *blob* > that has had its signature verified, not the file! To be honest., Andy's approach has the advantage that all the synchronization we do for modules still works. In particular, we have module loading logic where we synchronize loading of modules with the same name, so that two things that do request_module() concurrently will not lead to two modules being loaded. See that whole "module_mutex" thing, and the logic in (for example) "add_unformed_module()". Andrei's patch short-circuits the module loading before that, so if you have two concurrent request_module() calls, you'll end up with no serialization, and two executables. So maybe Andy is right. He often is, after all. Linus
[PATCH net] vhost_net: examine pointer types during un-producing
After commit 761876c857cb ("tap: XDP support"), we can actually queueing XDP pointers in the pointer ring, so we should examine the pointer type before freeing the pointer. Fixes: 761876c857cb ("tap: XDP support") Reported-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- drivers/net/tun.c | 3 ++- drivers/vhost/net.c| 2 +- include/linux/if_tun.h | 4 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 7433bb2..28cfa64 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -655,7 +655,7 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) return tun; } -static void tun_ptr_free(void *ptr) +void tun_ptr_free(void *ptr) { if (!ptr) return; @@ -667,6 +667,7 @@ static void tun_ptr_free(void *ptr) __skb_array_destroy_skb(ptr); } } +EXPORT_SYMBOL_GPL(tun_ptr_free); static void tun_queue_purge(struct tun_file *tfile) { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba2..54a138f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -170,7 +170,7 @@ static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq) if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) { ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head, vhost_net_buf_get_size(rxq), - __skb_array_destroy_skb); + tun_ptr_free); rxq->head = rxq->tail = 0; } } diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index c5b0a75..fd00170 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file); bool tun_is_xdp_buff(void *ptr); void *tun_xdp_to_ptr(void *ptr); void *tun_ptr_to_xdp(void *ptr); +void tun_ptr_free(void *ptr); #else #include #include @@ -50,5 +51,8 @@ static inline void *tun_ptr_to_xdp(void *ptr) { return NULL; } +static inline void tun_ptr_free(void *ptr) +{ +} #endif /* CONFIG_TUN */ #endif /* __IF_TUN_H */ -- 2.7.4
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
> On Mar 8, 2018, at 6:31 PM, David Miller wrote: > > From: Andy Lutomirski > Date: Fri, 9 Mar 2018 02:12:24 + > >> First, compile your user code and emit a staitc binary. Use objdump >> fiddling or a trivial .S file to make that static binary into a >> variable. Then write a tiny shim module like this: >> >> extern unsigned char __begin_user_code[], __end_user_code[]; >> >> int __init init_shim_module(void) >> { >> return call_umh_blob(__begin_user_code, __end_user_code - >> __begin_user_code); >> } >> >> By itself, this is clearly a worse solution than yours, but it has two >> benefits, one small and two big. The small benefit is that it is >> completely invisible to userspace: the .ko file is a bona fide module. > > Anything you try to do which makes these binaries "special" is a huge > negative. I don’t know what you mean. Alexei’s approach introduces a whole new kind of special module. Mine doesn’t. > >> The big benefits are: > > I don't see those things as benefits at all, and Alexei's scheme can > easily be made to work in your benefit #1 case too. > How? I think you’ll find that a non-modular implementation of a bundled ELF binary looks a *lot* like my call_umh_blob(). > It's a user binary. It's shipped with the kernel and it's signed. > > If we can't trust that, we can't trust much else. I’m not making any arguments about security at all. I’m talking about functionality. If we apply Alexei’s patch as is, then I think we’ll have a situation where ET_EXEC modules are only useful if they can do their jobs without any filesystem access at all. This is fine for networking, where netlink sockets are used, but I think it’s not so great for other use cases. If we ever try to stick a usb driver into userspace, we’re going to want to instantiate the user task once per device, passed as stdin or similar, and Alexei’s code will make that very awkward.
[PATCH] net: use skb_is_gso_sctp() instead of open-coding
As well as the basic conversion, I noticed that a lot of the SCTP code checks gso_type without first checking skb_is_gso() so I have added that where appropriate. Also, document the helper. Cc: Daniel Borkmann Cc: Marcelo Ricardo Leitner Signed-off-by: Daniel Axtens --- This depends on d02f51cbcf12 ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat to deal with gso sctp skbs") which introduces the required helper. That is in the bpf tree at the moment. --- Documentation/networking/segmentation-offloads.txt | 5 - net/core/skbuff.c | 2 +- net/sched/act_csum.c | 2 +- net/sctp/input.c | 8 net/sctp/inqueue.c | 2 +- net/sctp/offload.c | 2 +- 6 files changed, 12 insertions(+), 9 deletions(-) diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index 23a8dd91a9ec..36bb931b35e0 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -155,7 +155,10 @@ Therefore, any code in the core networking stack must be aware of the possibility that gso_size will be GSO_BY_FRAGS and handle that case appropriately. -There are a couple of helpers to make this easier: +There are some helpers to make this easier: + + - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if + an skb is an SCTP GSO skb. - For size checks, the skb_gso_validate_*_len family of helpers correctly considers GSO_BY_FRAGS. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0bb0d8877954..baf990528943 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4904,7 +4904,7 @@ static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb) thlen += inner_tcp_hdrlen(skb); } else if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) { thlen = tcp_hdrlen(skb); - } else if (unlikely(shinfo->gso_type & SKB_GSO_SCTP)) { + } else if (unlikely(skb_is_gso_sctp(skb))) { thlen = sizeof(struct sctphdr); } /* UFO sets gso_size to the size of the fragmentation diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index b7ba9b06b147..24b2e8e681cf 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -350,7 +350,7 @@ static int tcf_csum_sctp(struct sk_buff *skb, unsigned int ihl, { struct sctphdr *sctph; - if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) + if (skb_is_gso(skb) && skb_is_gso_sctp(skb)) return 1; sctph = tcf_csum_skb_nextlayer(skb, ihl, ipl, sizeof(*sctph)); diff --git a/net/sctp/input.c b/net/sctp/input.c index 0247cc432e02..b381d78548ac 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -106,6 +106,7 @@ int sctp_rcv(struct sk_buff *skb) int family; struct sctp_af *af; struct net *net = dev_net(skb->dev); + bool is_gso = skb_is_gso(skb) && skb_is_gso_sctp(skb); if (skb->pkt_type != PACKET_HOST) goto discard_it; @@ -123,8 +124,7 @@ int sctp_rcv(struct sk_buff *skb) * it's better to just linearize it otherwise crc computing * takes longer. */ - if ((!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) && -skb_linearize(skb)) || + if ((!is_gso && skb_linearize(skb)) || !pskb_may_pull(skb, sizeof(struct sctphdr))) goto discard_it; @@ -135,7 +135,7 @@ int sctp_rcv(struct sk_buff *skb) if (skb_csum_unnecessary(skb)) __skb_decr_checksum_unnecessary(skb); else if (!sctp_checksum_disable && -!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) && +!is_gso && sctp_rcv_checksum(net, skb) < 0) goto discard_it; skb->csum_valid = 1; @@ -1218,7 +1218,7 @@ static struct sctp_association *__sctp_rcv_lookup_harder(struct net *net, * issue as packets hitting this are mostly INIT or INIT-ACK and * those cannot be on GSO-style anyway. */ - if ((skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) + if (skb_is_gso(skb) && skb_is_gso_sctp(skb)) return NULL; ch = (struct sctp_chunkhdr *)skb->data; diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c index 48392552ee7c..23ebc5318edc 100644 --- a/net/sctp/inqueue.c +++ b/net/sctp/inqueue.c @@ -170,7 +170,7 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue) chunk = list_entry(entry, struct sctp_chunk, list); - if ((skb_shinfo(chunk->skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) { + if (skb_is_gso(chunk->skb) && skb_is_gso_sctp(chunk->skb)) { /* GSO-marked skbs but without frags, handle * them normally
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 5:44 PM, Kees Cook wrote: > > My concerns are mostly about crossing namespaces. If a container > triggers an autoload, the result runs in the init_ns. Heh. I saw that as an advantage. It's basically the same semantics as a normal module load does - in that the "kernel namespace" is init_ns. My own personal worry is actually different - we do check the signature of the file we're loading, but we're then passing it off to execve() not as the image we loaded, but as the file pointer. So the execve() will end up not using the actual buffer we checked the signature on, but instead just re-reading the file. Now, that has two issues: (a) it means that 'init_module' doesn't work, only 'finit_module'. Not a big deal, but I do think that we should fail the 'info->file == NULL' case explicitly (instead of failing when it does an execve() of /dev/null, which is what I think it does now - but that's just from the reading the patch, not from testing it). (b) somebody could maybe try to time it and modify the file after-the-fact of the signature check, and then we execute something else. Honestly, that "read twice" thing may be what scuttles this. Initially, I thought it was a non-issue, because anybody who controls the module subdirectory enough to rewrite files would be in a position to just execute the file itself directly instead. But it turns out that isn't needed. Some bad actor could just do "finit_module()" with a file that they just *copied* from the module directory. Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it does worry me. So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any executable as root in the init namespace". They don't have to have the module signing key that I thought would protect us, because they can do that "rewrite after signature check". So that does seem like a bad problem with the approach. So I don't like Andy's "let's make it a kernel module and then that kernel module can execve() a blob". THAT seems like just stupid indirection. But I do like Andy's "execve a blob" part, because it is the *blob* that has had its signature verified, not the file! Linus
[PATCH] drivers: vhost: vsock: fixed a brace coding style issue
Fixed a coding style issue. Signed-off-by: Vaibhav Murkute --- drivers/vhost/vsock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 0d14e2ff19f1..0898dbdbf955 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -61,9 +61,9 @@ static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid) if (other_cid == 0) continue; - if (other_cid == guest_cid) { + if (other_cid == guest_cid) return vsock; - } + } return NULL; -- 2.15.1
Re: [PATCH v2 net-next 0/2] ntuple filters with RSS
From: Edward Cree Date: Thu, 8 Mar 2018 15:43:50 + > This series introduces the ability to mark an ethtool steering filter to use > RSS spreading, and the ability to create and configure multiple RSS contexts > with different indirection tables, hash keys, and hash fields. > An implementation for the sfc driver (for 7000-series and later SFC NICs) is > included in patch 2/2. > > The anticipated use case of this feature is for steering traffic destined for > a container (or virtual machine) to the subset of CPUs on which processes in > the container (or the VM's vCPUs) are bound, while retaining the scalability > of RSS spreading from the viewpoint inside the container. > The use of both a base queue number (ring_cookie) and indirection table is > intended to allow re-use of a single RSS context to target multiple sets of > CPUs. For instance, if an 8-core system is hosting three containers on CPUs > [1,2], [3,4] and [6,7], then a single RSS context with an equal-weight [0,1] > indirection table could be used to target all three containers by setting > ring_cookie to 1, 3 and 6 on the respective filters. > > v2: Initialised ctx in efx_ef10_filter_insert() to avoid (false positive) gcc > warning. Series applied, thanks Edward.
Re: [RFC PATCH linux-next] rds: rds_info_from_znotifier() can be static
From: kbuild test robot Date: Thu, 8 Mar 2018 19:37:30 +0800 > Fixes: 9426bbc6de99 ("rds: use list structure to track information for > zerocopy completion notification") > Signed-off-by: Fengguang Wu Applied.
Re: [RFC PATCH linux-next] rds: rds_message_zcopy_from_user() can be static
From: Sowmini Varadhan Date: Thu, 8 Mar 2018 06:25:52 -0500 > On (03/08/18 18:56), kbuild test robot wrote: >> >> Fixes: d40a126b16ea ("rds: refactor zcopy code into >> rds_message_zcopy_from_user") >> Signed-off-by: Fengguang Wu > > Acked-by: Sowmini Varadhan > > (do I need to separately submit a non-RFC patch for this?) You don't have to resubmit in a situation like this. Applied, thanks everyone.
Re: [PATCH net-next] net: stmmac: remove superfluous wmb() memory barriers
From: Niklas Cassel Date: Thu, 8 Mar 2018 11:30:05 +0100 > These wmb() memory barriers are performed after the last descriptor write, > and they are followed by enable_dma_transmission()/set_tx_tail_ptr(), > i.e. a writel() to MMIO register space. > Since writel() itself performs the equivalent of a wmb() before doing the > actual write, these barriers are superfluous, and removing them should > thus not change any existing behavior. > > Ordering within the descriptor writes is already ensured with dma_wmb() > barriers inside prepare_tx_desc(first, ..)/prepare_tso_tx_desc(first, ..). > > Signed-off-by: Niklas Cassel Please allow me some time to consider this issue a little bit more before applying this patch. Thank you Niklas.
Re: [PATCH 1/2 net-next] net/ncsi: use kfree_skb() instead of kfree()
From: Dan Carpenter Date: Thu, 8 Mar 2018 12:36:04 +0300 > We're supposed to use kfree_skb() to free these sk_buffs. > > Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family") > Signed-off-by: Dan Carpenter Applied.
Re: [PATCH 2/2 net-next] net/ncsi: unlock on error in ncsi_set_interface_nl()
From: Dan Carpenter Date: Thu, 8 Mar 2018 12:36:28 +0300 > There are two error paths which are missing unlocks in this function. > > Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family") > Signed-off-by: Dan Carpenter Also applied, thanks Dan.
Re: [PATCH net-next] liquidio: fix ndo_change_mtu to always return correct status to the caller
From: Felix Manlunas Date: Wed, 7 Mar 2018 22:36:46 -0800 > + /* command is successful, change the MTU. */ > + netif_info(lio, probe, lio->netdev, "MTU changed from %d to %d\n", > +netdev->mtu, new_mtu); Please do not do this, this log message is excessive. The success return of the MTU change is enough of an indicator for the user.
Re: [PATCH net-next] liquidio: avoid doing useless work
From: Felix Manlunas Date: Wed, 7 Mar 2018 22:23:32 -0800 > From: Prasad Kanneganti > > Avoid doing useless work by making sure that the response_list is not empty > before scheduling work to process it. > > Signed-off-by: Prasad Kanneganti > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH net-next] liquidio: Resolved mbox read issue while reading more than one 64bit data
From: Felix Manlunas Date: Wed, 7 Mar 2018 22:12:24 -0800 > From: Intiyaz Basha > > Corrected length check when data received in the mbox is more than one > 64 bit data value > > Signed-off-by: Intiyaz Basha > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
From: Andy Lutomirski Date: Fri, 9 Mar 2018 02:12:24 + > First, compile your user code and emit a staitc binary. Use objdump > fiddling or a trivial .S file to make that static binary into a > variable. Then write a tiny shim module like this: > > extern unsigned char __begin_user_code[], __end_user_code[]; > > int __init init_shim_module(void) > { > return call_umh_blob(__begin_user_code, __end_user_code - > __begin_user_code); > } > > By itself, this is clearly a worse solution than yours, but it has two > benefits, one small and two big. The small benefit is that it is > completely invisible to userspace: the .ko file is a bona fide module. Anything you try to do which makes these binaries "special" is a huge negative. > The big benefits are: I don't see those things as benefits at all, and Alexei's scheme can easily be made to work in your benefit #1 case too. It's a user binary. It's shipped with the kernel and it's signed. If we can't trust that, we can't trust much else. And this whole container argument.. It's a mirage. Kernel modules are 1000 times worse, since they can access any container and any namespace they want.
Re: [PATCH] vhost_net: initialize rx_ring in vhost_net_open()
On 2018年03月09日 00:00, Michael S. Tsirkin wrote: On Thu, Mar 08, 2018 at 04:55:39PM +0100, Alexander Potapenko wrote: On Thu, Mar 8, 2018 at 4:33 PM, Michael S. Tsirkin wrote: On Thu, Mar 08, 2018 at 02:37:17PM +0100, Alexander Potapenko wrote: KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: == BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho et.c:170 CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170 vhost_net_stop_vq drivers/vhost/net.c:974 [inline] vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982 vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015 __fput+0x49f/0xa00 fs/file_table.c:209 fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop arch/x86/entry/common.c:166 [inline] prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196 syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265 do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292 ... origin: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213 kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392 kmalloc_large_node_hook mm/slub.c:1366 [inline] kmalloc_large_node mm/slub.c:3808 [inline] __kmalloc_node+0x100e/0x1290 mm/slub.c:3818 kmalloc_node include/linux/slab.h:554 [inline] kvmalloc_node+0x1a5/0x2e0 mm/util.c:419 kvmalloc include/linux/mm.h:541 [inline] vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921 misc_open+0x7b5/0x8b0 drivers/char/misc.c:154 chrdev_open+0xc28/0xd90 fs/char_dev.c:417 do_dentry_open+0xccb/0x1430 fs/open.c:752 vfs_open+0x272/0x2e0 fs/open.c:866 do_last fs/namei.c:3378 [inline] path_openat+0x49ad/0x6580 fs/namei.c:3519 do_filp_open+0x267/0x640 fs/namei.c:3553 do_sys_open+0x6ad/0x9c0 fs/open.c:1059 SYSC_openat+0xc7/0xe0 fs/open.c:1086 SyS_openat+0x63/0x90 fs/open.c:1080 do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287 == Signed-off-by: Alexander Potapenko --- drivers/vhost/net.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba276d47..60f1080bffc7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -948,6 +948,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) n->vqs[i].done_idx = 0; n->vqs[i].vhost_hlen = 0; n->vqs[i].sock_hlen = 0; + n->vqs[i].rx_ring = NULL; vhost_net_buf_init(&n->vqs[i].rxq); } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); -- 2.16.2.395.g2e18187dfd-goog I suspect that's not sufficient. rx ring is tied to the tap device. I think we need to drop it every time we drop the device. Unfortunately I've no idea where is the device dropped. Are you referring to vhost_net_vq_reset()? I can fix that part if needed, but won't be able to validate it with KMSAN. I see several issues. For example in vhost_net_set_backend if there's a value then rx ring will point to the ring of the wrong socket. Something like the below might help but we really need documentation of when is rx_ring valid. Is it only valid when private-data is valid? I think so, we need keep rx_ring synced with private_data. If yes need to make sure we reset it with private_data. Also I see __skb_array_destroy_skb used with ptr_ring which seems suspicious: how do we know the entries are skbs? Good catch, will post an independent patch to fix this. Patch below is on top of yours, and Signed-off-by: Michael S. Tsirkin But I really would like Jason to look and come up with a patch to address all these issues. --- diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 610cba2..7a65b69 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -972,6 +973,7 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n, vhost_net_disable_vq(n, vq); vq->private_data = NULL; vhost_net_buf_unproduce(nvq); + vq->rx_ring = NULL; mutex_unlock(&vq->mutex); return sock; } @@ -1161,8 +1163,6 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) vhost_net_disable_vq(n, vq); vq->private_data = sock; vhost_net_buf_unproduce(nvq); - if (index == VHOST_NET_VQ_RX) -
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov wrote: > On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote: >> >> Alexei, can you give an example use case? I'm sure it's upthread >> somewhere, but I'm having trouble finding it. > > at the time of iptable's setsockopt() the kernel will do > err = request_module("bpfilter"); > once. > The rough POC code: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25 Here's what I gather from reading that code: you have a new kernel feature (consisting of actual kernel code) that wants to defer some of its implementation to user mode. I like this idea a lot. But I have a suggestion for a slightly different way of accomplishing the same thing. Rather than extending init_module() to accept ELF input, except the call_umh code to be able to call blobs. You'd use it it very roughly like this: First, compile your user code and emit a staitc binary. Use objdump fiddling or a trivial .S file to make that static binary into a variable. Then write a tiny shim module like this: extern unsigned char __begin_user_code[], __end_user_code[]; int __init init_shim_module(void) { return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code); } By itself, this is clearly a worse solution than yours, but it has two benefits, one small and two big. The small benefit is that it is completely invisible to userspace: the .ko file is a bona fide module. The big benefits are: 1. It works even in a non-modular kernel! (Okay, it probably only works if you can arrange for the built-in module to be initialized late enough, but that's straightforward.) 2. It allows future extensions to change the way the glue works. For example, maybe you want the module to integrate properly with lsmod, etc. Rather than adding a mechanism for general privileged programs to register themselves with lsmod (ick!), you could do it entirely in the kernel where lsmod would know that a particular umh task is special. More usefully, you could extend call_umh_blob() to pass in some pre-initialized struct files, which would give a clean way to *synchronously* create a communication channel to user code for whatever service the user code provides. And it would be more straightforward to make the umh blob do what it needs to do without relying on any particular filesystems being mounted. I think we don't want to end up in a situation where we ship a program with a .ko extension that opens something in /dev, for example. call_umh_blob() would create an anon_inode or similar object backed by the blob and exec it.
[PATCH net v2] openvswitch: meter: fix the incorrect calculation of max delta_t
From: zhangliping Max delat_t should be the full_bucket/rate instead of the full_bucket. Also report EINVAL if the rate is zero. Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure") Cc: Andy Zhou Signed-off-by: zhangliping --- V2: report EINVAL if the rate is 0 to avoid divide by zero error. net/openvswitch/meter.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c index 04b94281a30b..b891a91577f8 100644 --- a/net/openvswitch/meter.c +++ b/net/openvswitch/meter.c @@ -242,14 +242,20 @@ static struct dp_meter *dp_meter_create(struct nlattr **a) band->type = nla_get_u32(attr[OVS_BAND_ATTR_TYPE]); band->rate = nla_get_u32(attr[OVS_BAND_ATTR_RATE]); + if (band->rate == 0) { + err = -EINVAL; + goto exit_free_meter; + } + band->burst_size = nla_get_u32(attr[OVS_BAND_ATTR_BURST]); /* Figure out max delta_t that is enough to fill any bucket. * Keep max_delta_t size to the bucket units: * pkts => 1/1000 packets, kilobits => bits. +* +* Start with a full bucket. */ - band_max_delta_t = (band->burst_size + band->rate) * 1000; - /* Start with a full bucket. */ - band->bucket = band_max_delta_t; + band->bucket = (band->burst_size + band->rate) * 1000; + band_max_delta_t = band->bucket / band->rate; if (band_max_delta_t > meter->max_delta_t) meter->max_delta_t = band_max_delta_t; band++; -- 2.13.4
[PATCH net-next 1/9] net: hns3: refactor the hclge_get/set_rss function
From: Yunsheng Lin This patch refactors the hclge_get/set_rss function in order to fix the rss configuration loss problem during reset process. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 39 -- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 2 ++ 2 files changed, 9 insertions(+), 32 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 39bc0e6..cd5b040 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2979,31 +2979,6 @@ static u32 hclge_get_rss_indir_size(struct hnae3_handle *handle) return HCLGE_RSS_IND_TBL_SIZE; } -static int hclge_get_rss_algo(struct hclge_dev *hdev) -{ - struct hclge_rss_config_cmd *req; - struct hclge_desc desc; - int rss_hash_algo; - int ret; - - hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_GENERIC_CONFIG, true); - - ret = hclge_cmd_send(&hdev->hw, &desc, 1); - if (ret) { - dev_err(&hdev->pdev->dev, - "Get link status error, status =%d\n", ret); - return ret; - } - - req = (struct hclge_rss_config_cmd *)desc.data; - rss_hash_algo = (req->hash_config & HCLGE_RSS_HASH_ALGO_MASK); - - if (rss_hash_algo == HCLGE_RSS_HASH_ALGO_TOEPLITZ) - return ETH_RSS_HASH_TOP; - - return -EINVAL; -} - static int hclge_set_rss_algo_key(struct hclge_dev *hdev, const u8 hfunc, const u8 *key) { @@ -3042,7 +3017,7 @@ static int hclge_set_rss_algo_key(struct hclge_dev *hdev, return 0; } -static int hclge_set_rss_indir_table(struct hclge_dev *hdev, const u32 *indir) +static int hclge_set_rss_indir_table(struct hclge_dev *hdev, const u8 *indir) { struct hclge_rss_indirection_table_cmd *req; struct hclge_desc desc; @@ -3138,12 +3113,11 @@ static int hclge_get_rss(struct hnae3_handle *handle, u32 *indir, u8 *key, u8 *hfunc) { struct hclge_vport *vport = hclge_get_vport(handle); - struct hclge_dev *hdev = vport->back; int i; /* Get hash algorithm */ if (hfunc) - *hfunc = hclge_get_rss_algo(hdev); + *hfunc = vport->rss_algo; /* Get the RSS Key required by the user */ if (key) @@ -3167,8 +3141,6 @@ static int hclge_set_rss(struct hnae3_handle *handle, const u32 *indir, /* Set the RSS Hash Key if specififed by the user */ if (key) { - /* Update the shadow RSS key with user specified qids */ - memcpy(vport->rss_hash_key, key, HCLGE_RSS_KEY_SIZE); if (hfunc == ETH_RSS_HASH_TOP || hfunc == ETH_RSS_HASH_NO_CHANGE) @@ -3178,6 +3150,10 @@ static int hclge_set_rss(struct hnae3_handle *handle, const u32 *indir, ret = hclge_set_rss_algo_key(hdev, hash_algo, key); if (ret) return ret; + + /* Update the shadow RSS key with user specified qids */ + memcpy(vport->rss_hash_key, key, HCLGE_RSS_KEY_SIZE); + vport->rss_algo = hash_algo; } /* Update the shadow RSS table with user specified qids */ @@ -3185,8 +3161,7 @@ static int hclge_set_rss(struct hnae3_handle *handle, const u32 *indir, vport->rss_indirection_tbl[i] = indir[i]; /* Update the hardware */ - ret = hclge_set_rss_indir_table(hdev, indir); - return ret; + return hclge_set_rss_indir_table(hdev, vport->rss_indirection_tbl); } static u8 hclge_get_rss_hash_bits(struct ethtool_rxnfc *nfc) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index d99a76a..7e762c4 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -579,6 +579,8 @@ struct hclge_vport { u8 rss_hash_key[HCLGE_RSS_KEY_SIZE]; /* User configured hash keys */ /* User configured lookup table entries */ u8 rss_indirection_tbl[HCLGE_RSS_IND_TBL_SIZE]; + int rss_algo; /* User configured hash algorithm */ + u16 alloc_rss_size; u16 qs_offset; -- 2.9.3
[PATCH net-next 4/9] net: hns3: fix for pause configuration lost during reset
From: Yunsheng Lin Pause configuration will be set to default value by hclge_tm_schd_init during reset, which causes the RSS configuration loss problem. This patch fixes it by calling hclge_tm_init_hw during reset process , which will set the pause configuration to default value. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 34a0905..109155d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5493,9 +5493,9 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev) return ret; } - ret = hclge_tm_schd_init(hdev); + ret = hclge_tm_init_hw(hdev); if (ret) { - dev_err(&pdev->dev, "tm schd init fail, ret =%d\n", ret); + dev_err(&pdev->dev, "tm init hw fail, ret =%d\n", ret); return ret; } -- 2.9.3
[PATCH net-next 5/9] net: hns3: fix for use-after-free when setting ring parameter
From: Yunsheng Lin In hns3_set_ringparam, hns3_uninit_all_ring frees the memory pointed by priv->ring_data[i].ring, and hns3_change_all_ring_bd_num use that pointer without mallocing, which will cause a use-after-free problem. The patch fixes it by not freeing the memory in hns3_uninit_all_ring, and uses hns3_put_ring_config to free it when necessary. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index c936945..e0ab161 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2967,13 +2967,8 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv) h->ae_algo->ops->reset_queue(h, i); hns3_fini_ring(priv->ring_data[i].ring); - devm_kfree(priv->dev, priv->ring_data[i].ring); hns3_fini_ring(priv->ring_data[i + h->kinfo.num_tqps].ring); - devm_kfree(priv->dev, - priv->ring_data[i + h->kinfo.num_tqps].ring); } - devm_kfree(priv->dev, priv->ring_data); - return 0; } @@ -3111,6 +3106,8 @@ static void hns3_client_uninit(struct hnae3_handle *handle, bool reset) if (ret) netdev_err(netdev, "uninit ring error\n"); + hns3_put_ring_config(priv); + priv->ring_data = NULL; free_netdev(netdev); @@ -3316,6 +3313,8 @@ static int hns3_reset_notify_uninit_enet(struct hnae3_handle *handle) if (ret) netdev_err(netdev, "uninit ring error\n"); + hns3_put_ring_config(priv); + priv->ring_data = NULL; return ret; @@ -3422,6 +3421,7 @@ int hns3_set_channels(struct net_device *netdev, } hns3_uninit_all_ring(priv); + hns3_put_ring_config(priv); org_tqp_num = h->kinfo.num_tqps; ret = hns3_modify_tqp_num(netdev, new_tqp_num); -- 2.9.3
[PATCH net-next 8/9] net: hns3: refactor the coalesce related struct
From: Yunsheng Lin This patch refoctors the coalesce related struct by introducing the hns3_enet_coalesce struct, in order to fix the coalesce configuation lost problem when changing the channel number. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 46 +++--- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 10 +++-- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 26 +++- 3 files changed, 46 insertions(+), 36 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 07dbebb..f466f60 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -168,8 +168,8 @@ void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector, * GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing */ - if (rl_reg > 0 && !tqp_vector->tx_group.gl_adapt_enable && - !tqp_vector->rx_group.gl_adapt_enable) + if (rl_reg > 0 && !tqp_vector->tx_group.coal.gl_adapt_enable && + !tqp_vector->rx_group.coal.gl_adapt_enable) /* According to the hardware, the range of rl_reg is * 0-59 and the unit is 4. */ @@ -205,17 +205,17 @@ static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector, */ /* Default: enable interrupt coalescing self-adaptive and GL */ - tqp_vector->tx_group.gl_adapt_enable = 1; - tqp_vector->rx_group.gl_adapt_enable = 1; + tqp_vector->tx_group.coal.gl_adapt_enable = 1; + tqp_vector->rx_group.coal.gl_adapt_enable = 1; - tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K; - tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K; + tqp_vector->tx_group.coal.int_gl = HNS3_INT_GL_50K; + tqp_vector->rx_group.coal.int_gl = HNS3_INT_GL_50K; /* Default: disable RL */ h->kinfo.int_rl_setting = 0; - tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW; - tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW; + tqp_vector->rx_group.coal.flow_level = HNS3_FLOW_LOW; + tqp_vector->tx_group.coal.flow_level = HNS3_FLOW_LOW; } static void hns3_vector_gl_rl_init_hw(struct hns3_enet_tqp_vector *tqp_vector, @@ -224,9 +224,9 @@ static void hns3_vector_gl_rl_init_hw(struct hns3_enet_tqp_vector *tqp_vector, struct hnae3_handle *h = priv->ae_handle; hns3_set_vector_coalesce_tx_gl(tqp_vector, - tqp_vector->tx_group.int_gl); + tqp_vector->tx_group.coal.int_gl); hns3_set_vector_coalesce_rx_gl(tqp_vector, - tqp_vector->rx_group.int_gl); + tqp_vector->rx_group.coal.int_gl); hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting); } @@ -2393,12 +2393,12 @@ static bool hns3_get_new_int_gl(struct hns3_enet_ring_group *ring_group) u16 new_int_gl; int usecs; - if (!ring_group->int_gl) + if (!ring_group->coal.int_gl) return false; if (ring_group->total_packets == 0) { - ring_group->int_gl = HNS3_INT_GL_50K; - ring_group->flow_level = HNS3_FLOW_LOW; + ring_group->coal.int_gl = HNS3_INT_GL_50K; + ring_group->coal.flow_level = HNS3_FLOW_LOW; return true; } @@ -2408,10 +2408,10 @@ static bool hns3_get_new_int_gl(struct hns3_enet_ring_group *ring_group) * 20-1249MB/s high (18000 ints/s) * > 4pps ultra (8000 ints/s) */ - new_flow_level = ring_group->flow_level; - new_int_gl = ring_group->int_gl; + new_flow_level = ring_group->coal.flow_level; + new_int_gl = ring_group->coal.int_gl; tqp_vector = ring_group->ring->tqp_vector; - usecs = (ring_group->int_gl << 1); + usecs = (ring_group->coal.int_gl << 1); bytes_per_usecs = ring_group->total_bytes / usecs; /* 100 microseconds */ packets_per_secs = ring_group->total_packets * 100 / usecs; @@ -2458,9 +2458,9 @@ static bool hns3_get_new_int_gl(struct hns3_enet_ring_group *ring_group) ring_group->total_bytes = 0; ring_group->total_packets = 0; - ring_group->flow_level = new_flow_level; - if (new_int_gl != ring_group->int_gl) { - ring_group->int_gl = new_int_gl; + ring_group->coal.flow_level = new_flow_level; + if (new_int_gl != ring_group->coal.int_gl) { + ring_group->coal.int_gl = new_int_gl; return true; } return false; @@ -2472,18 +2472,18 @@ static void hns3_update_new_int_gl(struct hns3_enet_tqp_vector *tqp_vector) struct hns3_enet_ring_group *tx_group = &tqp_vector->tx_group; bool rx_update, tx_upda
[PATCH net-next 7/9] net: hns3: fix for coalesce configuration lost during reset
From: Yunsheng Lin Coalesce configuration will be set to default value by hns3_nic_init_vector_data during reset, which causes the coalesce configuration loss problem. This patch fixes it by setting the default value in hns3_nic_alloc_vector_data, which will not be called in the reset process. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 156 +--- 1 file changed, 114 insertions(+), 42 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index bbaa21f..07dbebb 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -211,19 +211,25 @@ static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector, tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K; tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K; - hns3_set_vector_coalesce_tx_gl(tqp_vector, - tqp_vector->tx_group.int_gl); - hns3_set_vector_coalesce_rx_gl(tqp_vector, - tqp_vector->rx_group.int_gl); - /* Default: disable RL */ h->kinfo.int_rl_setting = 0; - hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting); tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW; tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW; } +static void hns3_vector_gl_rl_init_hw(struct hns3_enet_tqp_vector *tqp_vector, + struct hns3_nic_priv *priv) +{ + struct hnae3_handle *h = priv->ae_handle; + + hns3_set_vector_coalesce_tx_gl(tqp_vector, + tqp_vector->tx_group.int_gl); + hns3_set_vector_coalesce_rx_gl(tqp_vector, + tqp_vector->rx_group.int_gl); + hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting); +} + static int hns3_nic_set_real_num_queue(struct net_device *netdev) { struct hnae3_handle *h = hns3_get_handle(netdev); @@ -2625,32 +2631,18 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) struct hnae3_ring_chain_node vector_ring_chain; struct hnae3_handle *h = priv->ae_handle; struct hns3_enet_tqp_vector *tqp_vector; - struct hnae3_vector_info *vector; - struct pci_dev *pdev = h->pdev; - u16 tqp_num = h->kinfo.num_tqps; - u16 vector_num; int ret = 0; u16 i; - /* RSS size, cpu online and vector_num should be the same */ - /* Should consider 2p/4p later */ - vector_num = min_t(u16, num_online_cpus(), tqp_num); - vector = devm_kcalloc(&pdev->dev, vector_num, sizeof(*vector), - GFP_KERNEL); - if (!vector) - return -ENOMEM; - - vector_num = h->ae_algo->ops->get_vector(h, vector_num, vector); - - priv->vector_num = vector_num; - priv->tqp_vector = (struct hns3_enet_tqp_vector *) - devm_kcalloc(&pdev->dev, vector_num, sizeof(*priv->tqp_vector), -GFP_KERNEL); - if (!priv->tqp_vector) - return -ENOMEM; + for (i = 0; i < priv->vector_num; i++) { + tqp_vector = &priv->tqp_vector[i]; + hns3_vector_gl_rl_init_hw(tqp_vector, priv); + tqp_vector->num_tqps = 0; + } - for (i = 0; i < tqp_num; i++) { - u16 vector_i = i % vector_num; + for (i = 0; i < h->kinfo.num_tqps; i++) { + u16 vector_i = i % priv->vector_num; + u16 tqp_num = h->kinfo.num_tqps; tqp_vector = &priv->tqp_vector[vector_i]; @@ -2660,52 +2652,94 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) hns3_add_ring_to_group(&tqp_vector->rx_group, priv->ring_data[i + tqp_num].ring); - tqp_vector->idx = vector_i; - tqp_vector->mask_addr = vector[vector_i].io_addr; - tqp_vector->vector_irq = vector[vector_i].vector; - tqp_vector->num_tqps++; - priv->ring_data[i].ring->tqp_vector = tqp_vector; priv->ring_data[i + tqp_num].ring->tqp_vector = tqp_vector; + tqp_vector->num_tqps++; } - for (i = 0; i < vector_num; i++) { + for (i = 0; i < priv->vector_num; i++) { tqp_vector = &priv->tqp_vector[i]; tqp_vector->rx_group.total_bytes = 0; tqp_vector->rx_group.total_packets = 0; tqp_vector->tx_group.total_bytes = 0; tqp_vector->tx_group.total_packets = 0; - hns3_vector_gl_rl_init(tqp_vector, priv); tqp_vector->handle = h; ret = hns3_get_vector_ring_chain(tqp_vector, &vector_ring_chain);
[PATCH net-next 0/9] fixes for configuration lost problems
This patchset refactors some functions and some bugs in order to fix the configuration loss problem when resetting and setting channel number. Yunsheng Lin (9): net: hns3: refactor the hclge_get/set_rss function net: hns3: refactor the hclge_get/set_rss_tuple function net: hns3: fix for RSS configuration loss problem during reset net: hns3: fix for pause configuration lost during reset net: hns3: fix for use-after-free when setting ring parameter net: hns3: refactor the get/put_vector function net: hns3: fix for coalesce configuration lost during reset net: hns3: refactor the coalesce related struct net: hns3: fix for coal configuation lost when setting the channel drivers/net/ethernet/hisilicon/hns3/hnae3.h| 3 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 249 +++-- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 10 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 26 ++- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 2 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 235 +-- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 16 ++ .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 12 +- 8 files changed, 351 insertions(+), 202 deletions(-) -- 2.9.3
[PATCH net-next 2/9] net: hns3: refactor the hclge_get/set_rss_tuple function
From: Yunsheng Lin This patch refactors the hclge_get/set_rss_tuple function in order to fix the rss configuration loss problem during reset process. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 91 +- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 13 2 files changed, 67 insertions(+), 37 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index cd5b040..747cc8f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3091,14 +3091,16 @@ static int hclge_set_rss_input_tuple(struct hclge_dev *hdev) hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, false); req = (struct hclge_rss_input_tuple_cmd *)desc.data; - req->ipv4_tcp_en = HCLGE_RSS_INPUT_TUPLE_OTHER; - req->ipv4_udp_en = HCLGE_RSS_INPUT_TUPLE_OTHER; - req->ipv4_sctp_en = HCLGE_RSS_INPUT_TUPLE_SCTP; - req->ipv4_fragment_en = HCLGE_RSS_INPUT_TUPLE_OTHER; - req->ipv6_tcp_en = HCLGE_RSS_INPUT_TUPLE_OTHER; - req->ipv6_udp_en = HCLGE_RSS_INPUT_TUPLE_OTHER; - req->ipv6_sctp_en = HCLGE_RSS_INPUT_TUPLE_SCTP; - req->ipv6_fragment_en = HCLGE_RSS_INPUT_TUPLE_OTHER; + + /* Get the tuple cfg from pf */ + req->ipv4_tcp_en = hdev->vport[0].rss_tuple_sets.ipv4_tcp_en; + req->ipv4_udp_en = hdev->vport[0].rss_tuple_sets.ipv4_udp_en; + req->ipv4_sctp_en = hdev->vport[0].rss_tuple_sets.ipv4_sctp_en; + req->ipv4_fragment_en = hdev->vport[0].rss_tuple_sets.ipv4_fragment_en; + req->ipv6_tcp_en = hdev->vport[0].rss_tuple_sets.ipv6_tcp_en; + req->ipv6_udp_en = hdev->vport[0].rss_tuple_sets.ipv6_udp_en; + req->ipv6_sctp_en = hdev->vport[0].rss_tuple_sets.ipv6_sctp_en; + req->ipv6_fragment_en = hdev->vport[0].rss_tuple_sets.ipv6_fragment_en; ret = hclge_cmd_send(&hdev->hw, &desc, 1); if (ret) { dev_err(&hdev->pdev->dev, @@ -3204,15 +3206,16 @@ static int hclge_set_rss_tuple(struct hnae3_handle *handle, return -EINVAL; req = (struct hclge_rss_input_tuple_cmd *)desc.data; - hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, true); - ret = hclge_cmd_send(&hdev->hw, &desc, 1); - if (ret) { - dev_err(&hdev->pdev->dev, - "Read rss tuple fail, status = %d\n", ret); - return ret; - } + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, false); - hclge_cmd_reuse_desc(&desc, false); + req->ipv4_tcp_en = vport->rss_tuple_sets.ipv4_tcp_en; + req->ipv4_udp_en = vport->rss_tuple_sets.ipv4_udp_en; + req->ipv4_sctp_en = vport->rss_tuple_sets.ipv4_sctp_en; + req->ipv4_fragment_en = vport->rss_tuple_sets.ipv4_fragment_en; + req->ipv6_tcp_en = vport->rss_tuple_sets.ipv6_tcp_en; + req->ipv6_udp_en = vport->rss_tuple_sets.ipv6_udp_en; + req->ipv6_sctp_en = vport->rss_tuple_sets.ipv6_sctp_en; + req->ipv6_fragment_en = vport->rss_tuple_sets.ipv6_fragment_en; tuple_sets = hclge_get_rss_hash_bits(nfc); switch (nfc->flow_type) { @@ -3249,52 +3252,49 @@ static int hclge_set_rss_tuple(struct hnae3_handle *handle, } ret = hclge_cmd_send(&hdev->hw, &desc, 1); - if (ret) + if (ret) { dev_err(&hdev->pdev->dev, "Set rss tuple fail, status = %d\n", ret); + return ret; + } - return ret; + vport->rss_tuple_sets.ipv4_tcp_en = req->ipv4_tcp_en; + vport->rss_tuple_sets.ipv4_udp_en = req->ipv4_udp_en; + vport->rss_tuple_sets.ipv4_sctp_en = req->ipv4_sctp_en; + vport->rss_tuple_sets.ipv4_fragment_en = req->ipv4_fragment_en; + vport->rss_tuple_sets.ipv6_tcp_en = req->ipv6_tcp_en; + vport->rss_tuple_sets.ipv6_udp_en = req->ipv6_udp_en; + vport->rss_tuple_sets.ipv6_sctp_en = req->ipv6_sctp_en; + vport->rss_tuple_sets.ipv6_fragment_en = req->ipv6_fragment_en; + return 0; } static int hclge_get_rss_tuple(struct hnae3_handle *handle, struct ethtool_rxnfc *nfc) { struct hclge_vport *vport = hclge_get_vport(handle); - struct hclge_dev *hdev = vport->back; - struct hclge_rss_input_tuple_cmd *req; - struct hclge_desc desc; u8 tuple_sets; - int ret; nfc->data = 0; - req = (struct hclge_rss_input_tuple_cmd *)desc.data; - hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_RSS_INPUT_TUPLE, true); - ret = hclge_cmd_send(&hdev->hw, &desc, 1); - if (ret) { - dev_err(&hdev->pdev->dev, - "Read rss tuple fail, status = %d\n", ret); - return ret; - } - switch (nfc->flow_type) { case TCP_V4_FLOW
[PATCH net-next 9/9] net: hns3: fix for coal configuation lost when setting the channel
From: Yunsheng Lin This patch fixes the coalesce configuation lost problem when setting the channel number by restoring all vectors's coalesce configuation to vector 0's, because all vectors belonging to the same netdev have the same coalesce configuation for now. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 37 +++-- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index f466f60..0a8c427 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3413,7 +3413,24 @@ static int hns3_reset_notify(struct hnae3_handle *handle, return ret; } -static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num) +static void hns3_restore_coal(struct hns3_nic_priv *priv, + struct hns3_enet_coalesce *tx, + struct hns3_enet_coalesce *rx) +{ + u16 vector_num = priv->vector_num; + int i; + + for (i = 0; i < vector_num; i++) { + memcpy(&priv->tqp_vector[i].tx_group.coal, tx, + sizeof(struct hns3_enet_coalesce)); + memcpy(&priv->tqp_vector[i].rx_group.coal, rx, + sizeof(struct hns3_enet_coalesce)); + } +} + +static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num, + struct hns3_enet_coalesce *tx, + struct hns3_enet_coalesce *rx) { struct hns3_nic_priv *priv = netdev_priv(netdev); struct hnae3_handle *h = hns3_get_handle(netdev); @@ -3431,6 +3448,8 @@ static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num) if (ret) goto err_alloc_vector; + hns3_restore_coal(priv, tx, rx); + ret = hns3_nic_init_vector_data(priv); if (ret) goto err_uninit_vector; @@ -3461,6 +3480,7 @@ int hns3_set_channels(struct net_device *netdev, struct hns3_nic_priv *priv = netdev_priv(netdev); struct hnae3_handle *h = hns3_get_handle(netdev); struct hnae3_knic_private_info *kinfo = &h->kinfo; + struct hns3_enet_coalesce tx_coal, rx_coal; bool if_running = netif_running(netdev); u32 new_tqp_num = ch->combined_count; u16 org_tqp_num; @@ -3494,15 +3514,26 @@ int hns3_set_channels(struct net_device *netdev, goto open_netdev; } + /* Changing the tqp num may also change the vector num, +* ethtool only support setting and querying one coal +* configuation for now, so save the vector 0' coal +* configuation here in order to restore it. +*/ + memcpy(&tx_coal, &priv->tqp_vector[0].tx_group.coal, + sizeof(struct hns3_enet_coalesce)); + memcpy(&rx_coal, &priv->tqp_vector[0].rx_group.coal, + sizeof(struct hns3_enet_coalesce)); + hns3_nic_dealloc_vector_data(priv); hns3_uninit_all_ring(priv); hns3_put_ring_config(priv); org_tqp_num = h->kinfo.num_tqps; - ret = hns3_modify_tqp_num(netdev, new_tqp_num); + ret = hns3_modify_tqp_num(netdev, new_tqp_num, &tx_coal, &rx_coal); if (ret) { - ret = hns3_modify_tqp_num(netdev, org_tqp_num); + ret = hns3_modify_tqp_num(netdev, org_tqp_num, + &tx_coal, &rx_coal); if (ret) { /* If revert to old tqp failed, fatal error occurred */ dev_err(&netdev->dev, -- 2.9.3
[PATCH net-next 6/9] net: hns3: refactor the get/put_vector function
From: Yunsheng Lin There is a get_vector function, which allocate the vectors for a client, but there is not a put_vector to free the vector. This patch introduces the put_vector function in order to fix the coalesce configuration lost problem during reset process. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 3 +++ drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 4 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 28 -- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 12 +++--- 4 files changed, 37 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index fd06bc7..a072048 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -265,6 +265,8 @@ struct hnae3_ae_dev { * Get tc size of handle * get_vector() * Get vector number and vector information + * put_vector() + * Put the vector in hdev * map_ring_to_vector() * Map rings to vector * unmap_ring_from_vector() @@ -375,6 +377,7 @@ struct hnae3_ae_ops { int (*get_vector)(struct hnae3_handle *handle, u16 vector_num, struct hnae3_vector_info *vector_info); + int (*put_vector)(struct hnae3_handle *handle, int vector_num); int (*map_ring_to_vector)(struct hnae3_handle *handle, int vector_num, struct hnae3_ring_chain_node *vr_chain); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index e0ab161..bbaa21f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2721,6 +2721,10 @@ static int hns3_nic_uninit_vector_data(struct hns3_nic_priv *priv) if (ret) return ret; + ret = h->ae_algo->ops->put_vector(h, tqp_vector->vector_irq); + if (ret) + return ret; + hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain); if (priv->tqp_vector[i].irq_init_flag == HNS3_VECTOR_INITED) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 109155d..a318773 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2969,6 +2969,24 @@ static int hclge_get_vector_index(struct hclge_dev *hdev, int vector) return -EINVAL; } +static int hclge_put_vector(struct hnae3_handle *handle, int vector) +{ + struct hclge_vport *vport = hclge_get_vport(handle); + struct hclge_dev *hdev = vport->back; + int vector_id; + + vector_id = hclge_get_vector_index(hdev, vector); + if (vector_id < 0) { + dev_err(&hdev->pdev->dev, + "Get vector index fail. vector_id =%d\n", vector_id); + return vector_id; + } + + hclge_free_vector(hdev, vector_id); + + return 0; +} + static u32 hclge_get_rss_key_size(struct hnae3_handle *handle) { return HCLGE_RSS_KEY_SIZE; @@ -3523,18 +3541,13 @@ static int hclge_unmap_ring_frm_vector(struct hnae3_handle *handle, } ret = hclge_bind_ring_with_vector(vport, vector_id, false, ring_chain); - if (ret) { + if (ret) dev_err(&handle->pdev->dev, "Unmap ring from vector fail. vectorid=%d, ret =%d\n", vector_id, ret); - return ret; - } - - /* Free this MSIX or MSI vector */ - hclge_free_vector(hdev, vector_id); - return 0; + return ret; } int hclge_cmd_set_promisc_mode(struct hclge_dev *hdev, @@ -5994,6 +6007,7 @@ static const struct hnae3_ae_ops hclge_ops = { .map_ring_to_vector = hclge_map_ring_to_vector, .unmap_ring_from_vector = hclge_unmap_ring_frm_vector, .get_vector = hclge_get_vector, + .put_vector = hclge_put_vector, .set_promisc_mode = hclge_set_promisc_mode, .set_loopback = hclge_set_loopback, .start = hclge_ae_start, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index c9a916a..09a2a66 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -627,13 +627,18 @@ static int hclgevf_unmap_ring_from_vector( } ret = hclgevf_bind_ring_to_vector(handle, false, vector, ring_chain); - if (ret) { + if (ret) dev_err(&handle->pdev->dev, "Unmap ring from vector fail. vector=%d, ret =%d\n", vector_id,
[PATCH net-next 3/9] net: hns3: fix for RSS configuration loss problem during reset
From: Yunsheng Lin RSS configuration will be set to default value by hclge_rss_init_hw during reset, which causes the RSS configuration loss problem. This patch fixes it by setting the default value in hclge_rss_init_cfg function, which will not be called in the reset process. Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 2 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 107 ++--- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 1 + 3 files changed, 56 insertions(+), 54 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index 5018d66..407cfab 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -144,6 +144,8 @@ static int hclge_map_update(struct hnae3_handle *h) if (ret) return ret; + hclge_rss_indir_init_cfg(hdev); + return hclge_rss_init_hw(hdev); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 747cc8f..34a0905 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3329,67 +3329,28 @@ static int hclge_get_tc_size(struct hnae3_handle *handle) int hclge_rss_init_hw(struct hclge_dev *hdev) { - const u8 hfunc = HCLGE_RSS_HASH_ALGO_TOEPLITZ; struct hclge_vport *vport = hdev->vport; + u8 *rss_indir = vport[0].rss_indirection_tbl; + u16 rss_size = vport[0].alloc_rss_size; + u8 *key = vport[0].rss_hash_key; + u8 hfunc = vport[0].rss_algo; u16 tc_offset[HCLGE_MAX_TC_NUM]; - u8 rss_key[HCLGE_RSS_KEY_SIZE]; u16 tc_valid[HCLGE_MAX_TC_NUM]; u16 tc_size[HCLGE_MAX_TC_NUM]; - u32 *rss_indir = NULL; - u16 rss_size = 0, roundup_size; - const u8 *key; - int i, ret, j; - - rss_indir = kcalloc(HCLGE_RSS_IND_TBL_SIZE, sizeof(u32), GFP_KERNEL); - if (!rss_indir) - return -ENOMEM; - - /* Get default RSS key */ - netdev_rss_key_fill(rss_key, HCLGE_RSS_KEY_SIZE); - - /* Initialize RSS indirect table for each vport */ - for (j = 0; j < hdev->num_vmdq_vport + 1; j++) { - vport[j].rss_tuple_sets.ipv4_tcp_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - vport[j].rss_tuple_sets.ipv4_udp_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - vport[j].rss_tuple_sets.ipv4_sctp_en = - HCLGE_RSS_INPUT_TUPLE_SCTP; - vport[j].rss_tuple_sets.ipv4_fragment_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - vport[j].rss_tuple_sets.ipv6_tcp_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - vport[j].rss_tuple_sets.ipv6_udp_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - vport[j].rss_tuple_sets.ipv6_sctp_en = - HCLGE_RSS_INPUT_TUPLE_SCTP; - vport[j].rss_tuple_sets.ipv6_fragment_en = - HCLGE_RSS_INPUT_TUPLE_OTHER; - - for (i = 0; i < HCLGE_RSS_IND_TBL_SIZE; i++) { - vport[j].rss_indirection_tbl[i] = - i % vport[j].alloc_rss_size; - - /* vport 0 is for PF */ - if (j != 0) - continue; + u16 roundup_size; + int i, ret; - rss_size = vport[j].alloc_rss_size; - rss_indir[i] = vport[j].rss_indirection_tbl[i]; - } - } ret = hclge_set_rss_indir_table(hdev, rss_indir); if (ret) - goto err; + return ret; - key = rss_key; ret = hclge_set_rss_algo_key(hdev, hfunc, key); if (ret) - goto err; + return ret; ret = hclge_set_rss_input_tuple(hdev); if (ret) - goto err; + return ret; /* Each TC have the same queue size, and tc_size set to hardware is * the log2 of roundup power of two of rss_size, the acutal queue @@ -3399,8 +3360,7 @@ int hclge_rss_init_hw(struct hclge_dev *hdev) dev_err(&hdev->pdev->dev, "Configure rss tc size failed, invalid TC_SIZE = %d\n", rss_size); - ret = -EINVAL; - goto err; + return -EINVAL; } roundup_size = roundup_pow_of_two(rss_size); @@ -3417,12 +3377,50 @@ int hclge_rss_init_hw(struct hclge_dev *hdev) tc_offset[i] = rss_size * i; } - ret = hclge_set_rss_tc_mode(hdev, tc_valid, tc_size, tc_offset); + return hclge_set_rss_tc_mode(hdev, tc_valid, tc_size, tc_offset); +}
[PATCH iproute2-next 2/3] ipmroute: convert to output JSON
From: Stephen Hemminger Should be no change for non-json case except putting color on address if desired. Signed-off-by: Stephen Hemminger --- ip/ipmroute.c | 99 --- 1 file changed, 67 insertions(+), 32 deletions(-) diff --git a/ip/ipmroute.c b/ip/ipmroute.c index aa5029b44f41..83548caf4946 100644 --- a/ip/ipmroute.c +++ b/ip/ipmroute.c @@ -29,6 +29,7 @@ #include #include "utils.h" #include "ip_common.h" +#include "json_print.h" static void usage(void) __attribute__((noreturn)); @@ -53,13 +54,12 @@ struct rtfilter { int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { - FILE *fp = (FILE *)arg; struct rtmsg *r = NLMSG_DATA(n); int len = n->nlmsg_len; struct rtattr *tb[RTA_MAX+1]; - char obuf[256]; - + const char *src, *dst; SPRINT_BUF(b1); + SPRINT_BUF(b2); __u32 table; int iif = 0; int family; @@ -103,30 +103,44 @@ int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) family = get_real_family(r->rtm_type, r->rtm_family); + open_json_object(NULL); if (n->nlmsg_type == RTM_DELROUTE) - fprintf(fp, "Deleted "); + print_bool(PRINT_ANY, "deleted", "Deleted ", true); if (tb[RTA_SRC]) - len = snprintf(obuf, sizeof(obuf), - "(%s, ", rt_addr_n2a_rta(family, tb[RTA_SRC])); + src = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_SRC]), + RTA_DATA(tb[RTA_SRC]), b1, sizeof(b1)); else - len = sprintf(obuf, "(unknown, "); + src = "unknown"; + if (tb[RTA_DST]) - snprintf(obuf + len, sizeof(obuf) - len, -"%s)", rt_addr_n2a_rta(family, tb[RTA_DST])); + dst = rt_addr_n2a_r(family, RTA_PAYLOAD(tb[RTA_DST]), + RTA_DATA(tb[RTA_DST]), b2, sizeof(b2)); else - snprintf(obuf + len, sizeof(obuf) - len, "unknown) "); + dst = "unknown"; + + if (is_json_context()) { + print_string(PRINT_JSON, "src", NULL, src); + print_string(PRINT_JSON, "dst", NULL, dst); + } else { + char obuf[256]; + + snprintf(obuf, sizeof(obuf), "(%s,%s)", src, dst); + print_string(PRINT_FP, NULL, +"%-32s Iif: ", obuf); + } - fprintf(fp, "%-32s Iif: ", obuf); if (iif) - fprintf(fp, "%-10s ", ll_index_to_name(iif)); + print_color_string(PRINT_ANY, COLOR_IFNAME, + "iif", "%-10s ", ll_index_to_name(iif)); else - fprintf(fp, "unresolved "); + print_string(PRINT_ANY,"iif", "%s ", "unresolved"); if (tb[RTA_MULTIPATH]) { struct rtnexthop *nh = RTA_DATA(tb[RTA_MULTIPATH]); int first = 1; + open_json_array(PRINT_JSON, "multipath"); len = RTA_PAYLOAD(tb[RTA_MULTIPATH]); for (;;) { @@ -135,47 +149,65 @@ int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (nh->rtnh_len > len) break; + open_json_object(NULL); if (first) { - fprintf(fp, "Oifs: "); + print_string(PRINT_FP, NULL, "Oifs: ", NULL); first = 0; } - fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex)); + + print_color_string(PRINT_ANY, COLOR_IFNAME, + "oif", "%s", ll_index_to_name(nh->rtnh_ifindex)); + if (nh->rtnh_hops > 1) - fprintf(fp, "(ttl %d) ", nh->rtnh_hops); + print_uint(PRINT_ANY, + "ttl", "(ttl %u) ", nh->rtnh_hops); else - fprintf(fp, " "); + print_string(PRINT_FP, NULL, " ", NULL); + + close_json_object(); len -= NLMSG_ALIGN(nh->rtnh_len); nh = RTNH_NEXT(nh); } + close_json_array(PRINT_JSON, NULL); } - fprintf(fp, " State: %s", - r->rtm_flags & RTNH_F_UNRESOLVED ? "unresolved" : "resolved"); + + print_string(PRINT_ANY, "state", " State: %s", +(r->rtm_flags & RTNH_F_UNRESOLVED) ? "unresolved" : "resolved"); + if (r->rtm_flags & RTNH_F_OFFLOAD) - fprintf(fp, " offload"); + print_null(PRINT_ANY, "offload", " offload", NULL); + if (show_stats && tb[R
[PATCH iproute2-next 1/3] ipmaddr: json and color support
From: Stephen Hemminger Support printing mulitcast addresses in json and color mode. Output format is unchanged for normal use. Signed-off-by: Stephen Hemminger --- ip/ipmaddr.c | 69 +--- 1 file changed, 43 insertions(+), 26 deletions(-) diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c index d7bf1f99f67e..a48499029e17 100644 --- a/ip/ipmaddr.c +++ b/ip/ipmaddr.c @@ -28,6 +28,7 @@ #include "rt_names.h" #include "utils.h" #include "ip_common.h" +#include "json_print.h" static struct { char *dev; @@ -193,50 +194,66 @@ static void read_igmp6(struct ma_info **result_p) static void print_maddr(FILE *fp, struct ma_info *list) { - fprintf(fp, "\t"); + print_string(PRINT_FP, NULL, "\t", NULL); + open_json_object(NULL); if (list->addr.family == AF_PACKET) { SPRINT_BUF(b1); - fprintf(fp, "link %s", ll_addr_n2a((unsigned char *)list->addr.data, - list->addr.bytelen, 0, - b1, sizeof(b1))); + + print_string(PRINT_FP, NULL, "link ", NULL); + print_color_string(PRINT_ANY, COLOR_MAC, "link", "%s", + ll_addr_n2a((void *)list->addr.data, list->addr.bytelen, + 0, b1, sizeof(b1))); } else { - switch (list->addr.family) { - case AF_INET: - fprintf(fp, "inet "); - break; - case AF_INET6: - fprintf(fp, "inet6 "); - break; - default: - fprintf(fp, "family %d ", list->addr.family); - break; - } - fprintf(fp, "%s", - format_host(list->addr.family, - -1, list->addr.data)); + print_string(PRINT_ANY, "family", "%-5s ", +family_name(list->addr.family)); + print_color_string(PRINT_ANY, ifa_family_color(list->addr.family), + "address", "%s", + format_host(list->addr.family, + -1, list->addr.data)); } + if (list->users != 1) - fprintf(fp, " users %d", list->users); + print_uint(PRINT_ANY, "users", " users %u", list->users); + if (list->features) - fprintf(fp, " %s", list->features); - fprintf(fp, "\n"); + print_string(PRINT_ANY, "features", " %s", list->features); + + print_string(PRINT_FP, NULL, "\n", NULL); + close_json_object(); } static void print_mlist(FILE *fp, struct ma_info *list) { int cur_index = 0; + new_json_obj(json); for (; list; list = list->next) { - if (oneline) { - cur_index = list->index; - fprintf(fp, "%d:\t%s%s", cur_index, list->name, _SL_); - } else if (cur_index != list->index) { + + if (list->index != cur_index || oneline) { + if (cur_index) { + close_json_array(PRINT_JSON, NULL); + close_json_object(); + } + open_json_object(NULL); + + print_uint(PRINT_ANY, "ifindex", "%d:", list->index); + print_color_string(PRINT_ANY, COLOR_IFNAME, + "ifname", "\t%s", list->name); + print_string(PRINT_FP, NULL, "%s", _SL_); cur_index = list->index; - fprintf(fp, "%d:\t%s\n", cur_index, list->name); + + open_json_array(PRINT_JSON, "maddr"); } + print_maddr(fp, list); } + if (cur_index) { + close_json_array(PRINT_JSON, NULL); + close_json_object(); + } + + delete_json_obj(); } static int multiaddr_list(int argc, char **argv) -- 2.16.1
[PATCH iproute2-next 3/3] ipmroute: better error message if no kernel mroute
From: Stephen Hemminger If kernel does not support the IP multicast address family, then it will report all routes (PF_UNSPEC). Give the user a better error message and abort the command. Signed-off-by: Stephen Hemminger --- ip/ipmroute.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/ip/ipmroute.c b/ip/ipmroute.c index 83548caf4946..a4a8fb2bbe1f 100644 --- a/ip/ipmroute.c +++ b/ip/ipmroute.c @@ -75,10 +75,11 @@ int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(stderr, "BUG: wrong nlmsg len %d\n", len); return -1; } + if (r->rtm_type != RTN_MULTICAST) { - fprintf(stderr, "Not a multicast route (type: %s)\n", - rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1))); - return 0; + fprintf(stderr, + "Non multicast route received, kernel does support IP multicast?\n"); + return -1; } parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len); -- 2.16.1
[PATCH iproute2-next 0/3] ip: multicast commands JSON
From: Stephen Hemminger Some more JSON support and report better error if kernel is configured without multicast. Stephen Hemminger (3): ipmaddr: json and color support ipmroute: convert to output JSON ipmroute: better error message if no kernel mroute ip/ipmaddr.c | 69 -- ip/ipmroute.c | 106 +++--- 2 files changed, 114 insertions(+), 61 deletions(-) -- 2.16.1
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 08, 2018 at 03:07:01PM -0800, Alexei Starovoitov wrote: > On 3/7/18 5:23 PM, Luis R. Rodriguez wrote: > > > > request_module() has its own world though too. How often in your proof of > > concept is request_module() called? How many times do you envision it being > > called? > > once. What about other users later in the kernel? > > > +static int run_umh(struct file *file) > > > +{ > > > + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file); > > > + > > > + if (!sub_info) > > > + return -ENOMEM; > > > + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC); > > > +} > > > > run_umh() calls the program and waits. Note that while we are running a UMH > > we > > can't suspend. What if they take forever, who is hosing them down with an > > equivalent: > > > > schedule(); > > try_to_freeze(); > > > > Say they are buggy and never return, will they stall suspend, have you > > tried? > > > > call_usermodehelper_exec() uses helper_lock() which in turn is used for > > umh's accounting for number of running umh's. One of the sad obscure uses > > for this is to deal with suspend/resume. Refer to __usermodehelper* calls > > on kernel/power/process.c > > > > Note how you use UMH_WAIT_EXEC too, so this is all synchronous. > > looks like you misread this code > > #define UMH_NO_WAIT 0 /* don't wait at all */ > #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */ > #define UMH_WAIT_PROC 2 /* wait for the process to complete */ > #define UMH_KILLABLE4 /* wait for EXEC/PROC killable */ I certainly did get the incorrect impression this was a sync op, sorry about that. In that case its odd to see a request_module() call, when what is really meant is more of a request_module_nowait(), its also UMH_NO_WAIT on the modprobe call itself as well. In fact I think I'd much prefer at the very least to see a different wrapper for this, even if its using the same bolts and nuts underneath for userspace processes compiled on the kernel. The sanity checks in place for request_module() may change later and this use case seems rather simple and limited. Keeping tabs on this type of users seems desirable. At the *very least* diff --git a/include/linux/kmod.h b/include/linux/kmod.h index 40c89ad4bea6..7530e00da03b 100644 --- a/include/linux/kmod.h +++ b/include/linux/kmod.h @@ -37,11 +37,13 @@ extern __printf(2, 3) int __request_module(bool wait, const char *name, ...); #define request_module(mod...) __request_module(true, mod) #define request_module_nowait(mod...) __request_module(false, mod) +#define request_umh_proc(mod...) __request_module(false, mod) #define try_then_request_module(x, mod...) \ ((x) ?: (__request_module(true, mod), (x))) #else static inline int request_module(const char *name, ...) { return -ENOSYS; } static inline int request_module_nowait(const char *name, ...) { return -ENOSYS; } +static inline int request_umh_proc(const char *name, ...) { return -ENOSYS; } #define try_then_request_module(x, mod...) (x) #endif Modulo, kernel/umh.c is already its own thing, so pick a name that helps identify these things clearly. > and the rest of your concerns with suspend/resume are not applicable any > more. Agreed, except it does still mean these userspace processes are limited to execution only the kernel is up, and not going to suspend, but I think that's a proper sanity check by the umh API. > bpftiler.ko is started once and runs non-stop from there on. > Unless it crashes, but it's a different discussion. Sure. Luis
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 5:35 PM, Linus Torvalds wrote: > I don't want to weaken the type enforcement, and I _thought_ you had > done that __builtin_types_compatible_p() to keep it in place. I thought so too (that originally came from Josh), but on removal, I was surprised that the checking was retained. :) > But if that's not why you did it, then why was it there at all? If the > type warning shows through even if it's in the other expression, then > just a > > > #define __min(t1, t2, x, y) \ > __builtin_choose_expr( \ > __builtin_constant_p(x) & \ > __builtin_constant_p(y),\ > (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y), \ > __single_eval_min(t1, t2, \ >... > > would seem to be sufficient? > > Because logically, the only thing that matters is that x and y don't > have any side effects and can be evaluated twice, and > "__builtin_constant_p()" is already a much stronger version of that. > > Hmm? The __builtin_types_compatible_p() just doesn't seem to matter > for the only thing I thought it was there for. Yup, agreed. I'll drop it. -Kees -- Kees Cook Pixel Security
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 5:38 PM, Linus Torvalds wrote: > On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski wrote: >> >> Also, I don't see how this is any more exploitable than any other >> init_module(). > > Absolutely. If Kees doesn't trust the files to be loaded, an > executable - even if it's running with root privileges and in the > initns - is still fundamentally weaker than a kernel module. > > So I don't understand the security argument AT ALL. It's nonsensical. > The executable loading does all the same security checks that the > module loading does, including the signing check. > > And the whole point is that we can now do things with building and > loading a ebpf rule instead of having a full module. My concerns are mostly about crossing namespaces. If a container triggers an autoload, the result runs in the init_ns. So, really, there's nothing new from my perspective, except that it's in userspace instead of in the kernel. Perhaps it's an orthogonal concern. -Kees -- Kees Cook Pixel Security
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski wrote: > > Also, I don't see how this is any more exploitable than any other > init_module(). Absolutely. If Kees doesn't trust the files to be loaded, an executable - even if it's running with root privileges and in the initns - is still fundamentally weaker than a kernel module. So I don't understand the security argument AT ALL. It's nonsensical. The executable loading does all the same security checks that the module loading does, including the signing check. And the whole point is that we can now do things with building and loading a ebpf rule instead of having a full module. Linus
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 4:45 PM, Kees Cook wrote: > > Rasmus mentioned this too. What I said there was that I was shy to > make that change, since we already can't mix that kind of thing with > the existing min()/max() implementation. The existing min()/max() is > already extremely strict, so there are no instances of this in the > tree. Yes, but I also didn't want to add any new cases in case people add new min/max() users. But: > If I explicitly add one, I see this with or without the patch: > > In file included from drivers/misc/lkdtm.h:7:0, > from drivers/misc/lkdtm_core.c:33: > drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’: > ./include/linux/kernel.h:809:16: warning: comparison of distinct > pointer types lacks a cast Oh, ok, in that case, just drop the __builtin_types_compatible_p() entirely. It's not adding anything. I was expecting the non-chosen expression in __builtin_choose_expr() to not cause type warnings. I'm actually surprised it does. Type games is why __builtin_choose_expr() tends to exist in the first place. > So are you saying you _want_ the type enforcement weakened here, or > that I should just not use __builtin_types_compatible_p()? I don't want to weaken the type enforcement, and I _thought_ you had done that __builtin_types_compatible_p() to keep it in place. But if that's not why you did it, then why was it there at all? If the type warning shows through even if it's in the other expression, then just a #define __min(t1, t2, x, y) \ __builtin_choose_expr( \ __builtin_constant_p(x) & \ __builtin_constant_p(y),\ (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y), \ __single_eval_min(t1, t2, \ ... would seem to be sufficient? Because logically, the only thing that matters is that x and y don't have any side effects and can be evaluated twice, and "__builtin_constant_p()" is already a much stronger version of that. Hmm? The __builtin_types_compatible_p() just doesn't seem to matter for the only thing I thought it was there for. Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 09, 2018 at 01:04:39AM +, Andy Lutomirski wrote: > On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov wrote: > > On 3/8/18 4:24 PM, Kees Cook wrote: > >> > >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems > >> like forcing the userspace helper to be non-PIE would defeat some of > >> the userspace defenses in use in most distros. > > > > > > because we don't add features without concrete users. > > I disagree here. If you're going to add a magic trick that triggers > an execve(), then I think you should either support *both* standard, > widely-used types of ELF programs or you should give a compelling use > case that works for ET_EXEC but not for ET_DYN. Keep in mind that > many distros have a very strong preference for ET_DYN. misunderstanding here is that this bpfiler.ko is part of _kernel build_. Kernel decides how it's build. Nothing to do with distros. Current Makefile is very dumb and it's built with HOSTCC: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/bpfilter/Makefile?h=ipt_bpf but it will be standalone with CC before final to make sure crosscompiling works.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 4:57 PM, Alexei Starovoitov wrote: > The above are three paragraphs of security paranoia without single > concrete example of a security issue. How is running an arbitrary ELF as full init_ns root from a container not a concrete example? I'm not saying this approach can never happen. And this isn't paranoia. There are very real security boundary violations in this model, IMO. Though, as Andy says, it's more about autoloading than umh itself. I just don't want to extend that problem further. >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems >> like forcing the userspace helper to be non-PIE would defeat some of >> the userspace defenses in use in most distros. > > > because we don't add features without concrete users. It's just a two-line change, and then it would work on distros that build PIE-by-default. That seems like a concrete use-case. >> The exec.c changes should be split into a separate patch. Something >> feels incorrectly refactored here, though. Passing all three of fd, >> filename, and file to __do_execve_file() seems wrong; perhaps the fd >> to file handling needs to happen externally in what you have here as >> do_execveat_common()? The resulting __do_execve_file() has repeated >> conditionals on filename... maybe what I object to is being able to >> pass a NULL filename at all. The filename can be (painfully) >> reconstructed from the struct file. > > reconstruct the path and instantly introduce the race between execing > something by path vs prior check that it's actual elf of already > opened file ?! > excellent suggestion to improve security. I'm not sure why you're being so hostile. We're both interested in improving the kernel. Path names aren't stable, but they provide good _debugging_ about things when needed. For example, the LSM hooks, if they report on one of these loads, can produce a hint to the user about what happened. Passing "/dev/null" around is confusing at the very least. The ELF is absolutely not /dev/null. Why make someone guess? >>> [...] >>> diff --git a/kernel/module.c b/kernel/module.c >>> index ad2d420024f6..6cfa35795741 100644 >>> --- a/kernel/module.c >>> +++ b/kernel/module.c >>> [...] >>> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char >>> __user *, uargs, int, flags) >>> |MODULE_INIT_IGNORE_VERMAGIC)) >>> return -EINVAL; >>> >>> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, >>> - READING_MODULE); >>> + f = fdget(fd); >>> + if (!f.file) >>> + return -EBADF; >>> + >>> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX, >>> READING_MODULE); >> >> >> For the LSM subsystem, I think this should also get it's own enum >> kernel_read_file_id. This is really no longer a kernel module... > > > at this point it's a _file_. It could have been text file just well. > If lsm is thinking that at this point kernel is processing > kernel module that lsm is badly broken. Again, this is about making things more discoverable. We already know if we're loading a kernel module or a umh ELF. Passing this information to the LSM is valuable to the LSM to distinguish between types of files. They have very different effects on the system, for example. The issue is about validating intent with target. "Is this kernel module allowed?" and "Is this umh ELF allowed?" are better questions that "Is something loaded through finit_module() allowed?" Note already that the LSM distinguishes between many other file types in kernel_read_file*(). -Kees -- Kees Cook Pixel Security
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote: > > Alexei, can you give an example use case? I'm sure it's upthread > somewhere, but I'm having trouble finding it. at the time of iptable's setsockopt() the kernel will do err = request_module("bpfilter"); once. The rough POC code: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25 > Also, I just tested this concept a bit. Depmod invoked explicitly on > an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel > that has a "module" like that seems to work fine. Go figure. right. that's with the current patch. In v2 I require .modinfo section to make sure license is specified, but depmod still not very happy: $ depmod /lib/modules/`uname -r`/kernel/net/bpfilter/bpfilter.ko depmod: ERROR: Bad version passed /lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/kernel/net/bpfilter/bpfilter.ko I'm not sure it's worth to silence it, since as you noticed 'depmod -a' works.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov wrote: > On 3/8/18 4:24 PM, Kees Cook wrote: >> >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems >> like forcing the userspace helper to be non-PIE would defeat some of >> the userspace defenses in use in most distros. > > > because we don't add features without concrete users. I disagree here. If you're going to add a magic trick that triggers an execve(), then I think you should either support *both* standard, widely-used types of ELF programs or you should give a compelling use case that works for ET_EXEC but not for ET_DYN. Keep in mind that many distros have a very strong preference for ET_DYN. Or you could argue that ET_DYN requires tooling changes, but I think it's awkward to ask the tooling to change in advance of the kernel being willing to actually invoke the thing. I'm not actually convinced that any tooling changes would be needed, though.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 12:24 AM, Kees Cook wrote: > How is this not marked [RFC]? :) > > On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov wrote: >> As the first step in development of bpfilter project [1] the request_module() >> code is extended to allow user mode helpers to be invoked. Idea is that >> user mode helpers are built as part of the kernel build and installed as >> traditional kernel modules with .ko file extension into distro specified >> location, such that from a distribution point of view, they are no different >> than regular kernel modules. Thus, allow request_module() logic to load such >> user mode helper (umh) modules via: >> >> request_module("foo") -> >> call_umh("modprobe foo") -> >> sys_finit_module(FD of /lib/modules/.../foo.ko) -> >> call_umh(struct file) > > Yikes. This means autoloaded artbitrary exec() now, instead of > autoloading just loading arbitrary modules? That seems extremely bad. > This just extends all the problems we've had with defining security > boundaries with modules out to umh too. I would need some major > convincing that this can be made safe. > > It's easy for container to trigger arbitrary module loading, so if it > can find/use a similar one of the bugs we've seen in the past to > redirect the module loading we don't just get new module-created > attack surface added to the kernel, but we again get arbitrary ELF > running in the root ns (not in the container). And in the insane case > of a container with CAP_SYS_MODULE, this is a trivial escape without > even needing to build a special kernel module: the root ns, running > with all privileges runs an arbitrary ELF. > > As it stands, I have to NAK this. At the very least, you need to solve > the execution environment problems here: the ELF should run with no > greater privileges than what loaded the module, and very importantly, > must not be allowed to bypass these checks through autoloading. What > _triggered_ the autoload must be the environment, not the "modprobe", > since that's running with full privileges. Basically, we need to fix > module autoloading first, or at least a significant subset of the > problems: https://lkml.org/lkml/2017/11/27/754 I disagree. If we're going to somehow have ELF binaries that pretend to be modules, then they should run with high privilege and, more importantly, should run in a context independent of whoever triggered autoloading. Also, I don't see how this is any more exploitable than any other init_module(). If someone has CAP_SYS_MODULE or autoload privileges and they can trigger init_module() on a file, then they're granting maximum privilege to the contents of that file. So who cares if that file is a kernel module or an ELF binary? Alexei, can you give an example use case? I'm sure it's upthread somewhere, but I'm having trouble finding it. Also, I just tested this concept a bit. Depmod invoked explicitly on an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel that has a "module" like that seems to work fine. Go figure.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 3/8/18 4:24 PM, Kees Cook wrote: How is this not marked [RFC]? :) On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov wrote: As the first step in development of bpfilter project [1] the request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) modules via: request_module("foo") -> call_umh("modprobe foo") -> sys_finit_module(FD of /lib/modules/.../foo.ko) -> call_umh(struct file) Yikes. This means autoloaded artbitrary exec() now, instead of autoloading just loading arbitrary modules? That seems extremely bad. This just extends all the problems we've had with defining security boundaries with modules out to umh too. I would need some major convincing that this can be made safe. It's easy for container to trigger arbitrary module loading, so if it can find/use a similar one of the bugs we've seen in the past to redirect the module loading we don't just get new module-created attack surface added to the kernel, but we again get arbitrary ELF running in the root ns (not in the container). And in the insane case of a container with CAP_SYS_MODULE, this is a trivial escape without even needing to build a special kernel module: the root ns, running with all privileges runs an arbitrary ELF. As it stands, I have to NAK this. At the very least, you need to solve the execution environment problems here: the ELF should run with no greater privileges than what loaded the module, and very importantly, must not be allowed to bypass these checks through autoloading. What _triggered_ the autoload must be the environment, not the "modprobe", since that's running with full privileges. Basically, we need to fix module autoloading first, or at least a significant subset of the problems: https://lkml.org/lkml/2017/11/27/754 The above are three paragraphs of security paranoia without single concrete example of a security issue. Such approach enables kernel to delegate functionality traditionally done by kernel modules into user space processes (either root or !root) and reduces security attack surface of such new code, meaning in case of potential bugs only the umh would crash but not the kernel. Another advantage coming with that would be that bpfilter.ko can be debugged and tested out of user space as well (e.g. opening the possibility to run all clang sanitizers, fuzzers or test suites for checking translation). Also, such architecture makes the kernel/user boundary very precise: control plane is done by the user space while data plane stays in the kernel. It's easy to distinguish "umh module" from traditional kernel module: $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type Type: EXEC (Executable file) As Andy asked earlier, why not DYN too to catch PIE executables? Seems like forcing the userspace helper to be non-PIE would defeat some of the userspace defenses in use in most distros. because we don't add features without concrete users. $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type Type: REL (Relocatable file) Since umh can crash, can be oom-ed by the kernel, killed by admin, the subsystem that uses them (like bpfilter) need to manage life time of umh on its own, so module infra doesn't do any accounting of them. They don't appear in "lsmod" and cannot be "rmmod". Multiple request_module("umh") will load multiple umh.ko processes. Similar to kernel modules the kernel will be tainted if "umh module" has invalid signature. [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_747551_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=pMlEM-qKOmGljadUKAdwKBpuYHQRXzApvSGkZFIT4Gg&s=0-vP6LH-GxXnL7LuV-KfMl1NbqsVJUINSygoVa9R6nk&e= Signed-off-by: Alexei Starovoitov --- fs/exec.c | 40 +++- include/linux/binfmts.h | 1 + include/linux/umh.h | 3 +++ kernel/module.c | 43 ++- kernel/umh.c| 24 +--- 5 files changed, 94 insertions(+), 17 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 7eb8d21bcab9..0483c438de7d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm) /* * sys_execve() executes a new program. */ -static int do_execveat_common(int fd, struct filename *filename, - struct user_arg_ptr argv, - struct user_arg_ptr envp, - int flags) +static int __do_execve_file(int fd, struct filename *filename, +
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 3:48 PM, Linus Torvalds wrote: > On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook wrote: >> +#define __min(t1, t2, x, y)\ >> + __builtin_choose_expr(__builtin_constant_p(x) &&\ >> + __builtin_constant_p(y) &&\ >> + __builtin_types_compatible_p(t1, t2), \ >> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ > > I understand why you use __builtin_types_compatible_p(), but please don't. > > It will mean that trivial constants like "5" and "sizeof(x)" won't > simplify, because they have different types. Rasmus mentioned this too. What I said there was that I was shy to make that change, since we already can't mix that kind of thing with the existing min()/max() implementation. The existing min()/max() is already extremely strict, so there are no instances of this in the tree. If I explicitly add one, I see this with or without the patch: In file included from drivers/misc/lkdtm.h:7:0, from drivers/misc/lkdtm_core.c:33: drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’: ./include/linux/kernel.h:809:16: warning: comparison of distinct pointer types lacks a cast (void) (&max1 == &max2); \ ^ ./include/linux/kernel.h:818:2: note: in expansion of macro ‘__max’ __max(typeof(x), typeof(y), \ ^ ./include/linux/printk.h:308:34: note: in expansion of macro ‘max’ printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) ^~~ drivers/misc/lkdtm_core.c:500:2: note: in expansion of macro ‘pr_info’ pr_info("%lu\n", max(16, sizeof(unsigned long))); ^~~ > The ?: will give the right combined type anyway, and if you want the > type comparison warning, just add a comma-expression with something > like like > >(t1 *)1 == (t2 *)1 > > to get the type compatibility warning. When I tried removing __builtin_types_compatible_p(), I still got the type-check warning because I think the preprocessor still sees the "(void) (&min1 == &min2)" before optimizing? So, I technically _can_ drop the __builtin_types_compatible_p(), and still keep the type warning. :P > Yeah, yeah, maybe none of the VLA cases triggered that, but it seems > silly to not just get that obvious constant case right. > > Hmm? So are you saying you _want_ the type enforcement weakened here, or that I should just not use __builtin_types_compatible_p()? Thanks! -Kees -- Kees Cook Pixel Security
[PATCH net] macvlan: filter out unsupported feature flags
Adding a macvlan device on top of a lowerdev that supports the xfrm offloads fails with a new regression: # ip link add link ens1f0 mv0 type macvlan RTNETLINK answers: Operation not permitted Tracing down the failure shows that the macvlan device inherits the NETIF_F_HW_ESP and NETIF_F_HW_ESP_TX_CSUM feature flags from the lowerdev, but with no dev->xfrmdev_ops API filled in, it doesn't actually support xfrm. When the request is made to add the new macvlan device, the XFRM listener for NETDEV_REGISTER calls xfrm_api_check() which fails the new registration because dev->xfrmdev_ops is NULL. The macvlan creation succeeds when we filter out the ESP feature flags in macvlan_fix_features(), so let's filter them out like we're already filtering out ~NETIF_F_NETNS_LOCAL. When XFRM support is added in the future, we can add the flags into MACVLAN_FEATURES. This same problem could crop up in the future with any other new feature flags, so let's filter out any flags that aren't defined as supported in macvlan. Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Reported-by: Alexey Kodanev Signed-off-by: Shannon Nelson --- drivers/net/macvlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 8fc02d9..725f4b4 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1036,7 +1036,7 @@ static netdev_features_t macvlan_fix_features(struct net_device *dev, lowerdev_features &= (features | ~NETIF_F_LRO); features = netdev_increment_features(lowerdev_features, features, mask); features |= ALWAYS_ON_FEATURES; - features &= ~NETIF_F_NETNS_LOCAL; + features &= (ALWAYS_ON_FEATURES | MACVLAN_FEATURES); return features; } -- 2.7.4
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
How is this not marked [RFC]? :) On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov wrote: > As the first step in development of bpfilter project [1] the request_module() > code is extended to allow user mode helpers to be invoked. Idea is that > user mode helpers are built as part of the kernel build and installed as > traditional kernel modules with .ko file extension into distro specified > location, such that from a distribution point of view, they are no different > than regular kernel modules. Thus, allow request_module() logic to load such > user mode helper (umh) modules via: > > request_module("foo") -> > call_umh("modprobe foo") -> > sys_finit_module(FD of /lib/modules/.../foo.ko) -> > call_umh(struct file) Yikes. This means autoloaded artbitrary exec() now, instead of autoloading just loading arbitrary modules? That seems extremely bad. This just extends all the problems we've had with defining security boundaries with modules out to umh too. I would need some major convincing that this can be made safe. It's easy for container to trigger arbitrary module loading, so if it can find/use a similar one of the bugs we've seen in the past to redirect the module loading we don't just get new module-created attack surface added to the kernel, but we again get arbitrary ELF running in the root ns (not in the container). And in the insane case of a container with CAP_SYS_MODULE, this is a trivial escape without even needing to build a special kernel module: the root ns, running with all privileges runs an arbitrary ELF. As it stands, I have to NAK this. At the very least, you need to solve the execution environment problems here: the ELF should run with no greater privileges than what loaded the module, and very importantly, must not be allowed to bypass these checks through autoloading. What _triggered_ the autoload must be the environment, not the "modprobe", since that's running with full privileges. Basically, we need to fix module autoloading first, or at least a significant subset of the problems: https://lkml.org/lkml/2017/11/27/754 > Such approach enables kernel to delegate functionality traditionally done > by kernel modules into user space processes (either root or !root) and > reduces security attack surface of such new code, meaning in case of > potential bugs only the umh would crash but not the kernel. Another > advantage coming with that would be that bpfilter.ko can be debugged and > tested out of user space as well (e.g. opening the possibility to run > all clang sanitizers, fuzzers or test suites for checking translation). > Also, such architecture makes the kernel/user boundary very precise: > control plane is done by the user space while data plane stays in the kernel. > > It's easy to distinguish "umh module" from traditional kernel module: > > $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type > Type: EXEC (Executable file) As Andy asked earlier, why not DYN too to catch PIE executables? Seems like forcing the userspace helper to be non-PIE would defeat some of the userspace defenses in use in most distros. > $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type > Type: REL (Relocatable file) > > Since umh can crash, can be oom-ed by the kernel, killed by admin, > the subsystem that uses them (like bpfilter) need to manage life > time of umh on its own, so module infra doesn't do any accounting > of them. They don't appear in "lsmod" and cannot be "rmmod". > Multiple request_module("umh") will load multiple umh.ko processes. > > Similar to kernel modules the kernel will be tainted if "umh module" > has invalid signature. > > [1] https://lwn.net/Articles/747551/ > > Signed-off-by: Alexei Starovoitov > --- > fs/exec.c | 40 +++- > include/linux/binfmts.h | 1 + > include/linux/umh.h | 3 +++ > kernel/module.c | 43 ++- > kernel/umh.c| 24 +--- > 5 files changed, 94 insertions(+), 17 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 7eb8d21bcab9..0483c438de7d 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm) > /* > * sys_execve() executes a new program. > */ > -static int do_execveat_common(int fd, struct filename *filename, > - struct user_arg_ptr argv, > - struct user_arg_ptr envp, > - int flags) > +static int __do_execve_file(int fd, struct filename *filename, > + struct user_arg_ptr argv, > + struct user_arg_ptr envp, > + int flags, struct file *file) > { > char *pathbuf = NULL; > struct linux_binprm *bprm; > - struct file *file; > struct files_struct *displaced; > i
Re: [RFC v3 net-next 00/18] Time based packet transmission
Hi, On 03/08/2018 02:54 PM, Henrik Austad wrote: > Just looking at the timestamp when the frames were received. They should be > sent at regular intervals if I read udp_tai.c correctly, so the assumption > was that the timestamp from tcpdump should give an inkling to how well it > worked. > > I set it up to send a frame every 10ms and computed the diff between each > UDP packet received. Nothing fancy, just tcpdump and grep for the > timestamp and look at the distribution. Ok, I see it now. Just as a reference, this is how I've been running tcpdump on my tests: $ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \ -tt --time-stamp-precision=nano udp port 7788 -c 1 > >>> I have to dig more into why this is happening, a lot frames delayed much >>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One >>> obvious fix is move some hw around and do a direct link, but I didn't have >>> time for that right now. >>> >>> I'm very interested in doing what Richard's original test was when he used >>> ptp-synched clocks and also used hw receive-time and compared with expected >>> tx-time. So, while I'm getting that up and running, I thought I should >>> share the early results. >> >> Sure, thanks. Which delta and clockid are you using, please? > > I used the example provided in -00, > > tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \ > map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 > > tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \ > CLOCK_REALTIME sorting The delta value is highly dependent on the system. I recommend playing around with it a bit before running long tests. On my KabyLake desktop I noticed that 150us is quite reliable value, for example. (same kernel as yours, and no preempt-rt applied) But that is not the issue here it seems. > >> Also, was this clock synchronized to the PHC? You need that for hw offload >> with >> sorting enabled. > > Hmm, good point, no, NIC clock was not synchronized, I'll do that in the > next round for both sender and receiver! Oh, then you need to get that setup first. Here I synchronize both PHCs over the network first with ptp4l: Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2 Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 & My Rx is the PTP master and the Tx is the PTP slave. Then I synchronize the PHC to the system clock on the Tx side only: Tx) $ phc2sys -a -r -r -u 8 & And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no difference for this test specifically because I compensate for it when calculating the offsets on the Rx side. For the next patchset version I will be providing a more complete set of testing instructions. I hope that helps for now. Thanks, Jesus
[PULL resend] virtio: bugfix
Looks like my pull had corrupted headers. Resending with fixed up ones. The following changes since commit 4a3928c6f8a53fa1aed28ccba227742486e8ddcb: Linux 4.16-rc3 (2018-02-25 18:50:41 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to e82df670235138575b37ff0ec24412a471efd97f: virtio_ring: fix num_free handling in error case (2018-03-01 18:53:38 +0200) virtio: bugfix This includes a bugfix for error handling in virtio. Signed-off-by: Michael S. Tsirkin Tiwei Bie (1): virtio_ring: fix num_free handling in error case drivers/virtio/virtio_ring.c | 2 -- 1 file changed, 2 deletions(-)
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün wrote: > > On 07/03/2018 02:21, Andy Lutomirski wrote: >> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün wrote: >>> >>> On 06/03/2018 23:46, Tycho Andersen wrote: On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: >>> Suppose I'm writing a container manager. I want to run "mount" in the >>> container, but I don't want to allow moun() in general and I want to >>> emulate certain mount() actions. I can write a filter that catches >>> mount using seccomp and calls out to the container manager for help. >>> This isn't theoretical -- Tycho wants *exactly* this use case to be >>> supported. >> >> Well, I think this use case should be handled with something like >> LD_PRELOAD and a helper library. FYI, I did something like this: >> https://github.com/stemjail/stemshim > > I doubt that will work for containers. Containers that use user > namespaces and, for example, setuid programs aren't going to honor > LD_PRELOAD. Or anything that calls syscalls directly, like go programs. >>> >>> That's why the vDSO-like approach. Enforcing an access control is not >>> the issue here, patching a buggy userland (without patching its code) is >>> the issue isn't it? >>> >>> As far as I remember, the main problem is to handle file descriptors >>> while "emulating" the kernel behavior. This can be done with a "shim" >>> code mapped in every processes. Chrome used something like this (in a >>> previous sandbox mechanism) as a kind of emulation (with the current >>> seccomp-bpf ). I think it should be doable to replace the (userland) >>> emulation code with an IPC wrapper receiving file descriptors through >>> UNIX socket. >>> >> >> Can you explain exactly what you mean by "vDSO-like"? >> >> When a 64-bit program does a syscall, it just executes the SYSCALL >> instruction. The vDSO isn't involved at all. 32-bit programs usually >> go through the vDSO, but not always. >> >> It could be possible to force-load a DSO into an entire container and >> rig up seccomp to intercept all SYSCALLs not originating from the DSO >> such that they merely redirect control to the DSO, but that seems >> quite messy. > > vDSO is a code mapped for all processes. As you said, these processes > may use it or not. What I was thinking about is to use the same concept, > i.e. map a "shim" code into each processes pertaining to a particular > hierarchy (the same way seccomp filters are inherited across processes). > With a seccomp filter matching some syscall (e.g. mount, open), it is > possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This > shim code should then be able to emulate/patch what is needed, even > faking a file opening by receiving a file descriptor through a UNIX > socket. As did the Chrome sandbox, the seccomp filter may look at the > calling address to allow the shim code to call syscalls without being > catched, if needed. However, relying on SIGSYS may not fit with > arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump > to a specific process address, to emulate the syscall in an easier way > than only relying on a {c,e}BPF program. > This could indeed be done, but I think that Tycho's approach is much cleaner and probably faster.
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On 07/03/2018 02:21, Andy Lutomirski wrote: > On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün wrote: >> >> On 06/03/2018 23:46, Tycho Andersen wrote: >>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: >> Suppose I'm writing a container manager. I want to run "mount" in the >> container, but I don't want to allow moun() in general and I want to >> emulate certain mount() actions. I can write a filter that catches >> mount using seccomp and calls out to the container manager for help. >> This isn't theoretical -- Tycho wants *exactly* this use case to be >> supported. > > Well, I think this use case should be handled with something like > LD_PRELOAD and a helper library. FYI, I did something like this: > https://github.com/stemjail/stemshim I doubt that will work for containers. Containers that use user namespaces and, for example, setuid programs aren't going to honor LD_PRELOAD. >>> >>> Or anything that calls syscalls directly, like go programs. >> >> That's why the vDSO-like approach. Enforcing an access control is not >> the issue here, patching a buggy userland (without patching its code) is >> the issue isn't it? >> >> As far as I remember, the main problem is to handle file descriptors >> while "emulating" the kernel behavior. This can be done with a "shim" >> code mapped in every processes. Chrome used something like this (in a >> previous sandbox mechanism) as a kind of emulation (with the current >> seccomp-bpf ). I think it should be doable to replace the (userland) >> emulation code with an IPC wrapper receiving file descriptors through >> UNIX socket. >> > > Can you explain exactly what you mean by "vDSO-like"? > > When a 64-bit program does a syscall, it just executes the SYSCALL > instruction. The vDSO isn't involved at all. 32-bit programs usually > go through the vDSO, but not always. > > It could be possible to force-load a DSO into an entire container and > rig up seccomp to intercept all SYSCALLs not originating from the DSO > such that they merely redirect control to the DSO, but that seems > quite messy. vDSO is a code mapped for all processes. As you said, these processes may use it or not. What I was thinking about is to use the same concept, i.e. map a "shim" code into each processes pertaining to a particular hierarchy (the same way seccomp filters are inherited across processes). With a seccomp filter matching some syscall (e.g. mount, open), it is possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This shim code should then be able to emulate/patch what is needed, even faking a file opening by receiving a file descriptor through a UNIX socket. As did the Chrome sandbox, the seccomp filter may look at the calling address to allow the shim code to call syscalls without being catched, if needed. However, relying on SIGSYS may not fit with arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump to a specific process address, to emulate the syscall in an easier way than only relying on a {c,e}BPF program. signature.asc Description: OpenPGP digital signature
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook wrote: > +#define __min(t1, t2, x, y)\ > + __builtin_choose_expr(__builtin_constant_p(x) &&\ > + __builtin_constant_p(y) &&\ > + __builtin_types_compatible_p(t1, t2), \ > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ I understand why you use __builtin_types_compatible_p(), but please don't. It will mean that trivial constants like "5" and "sizeof(x)" won't simplify, because they have different types. The ?: will give the right combined type anyway, and if you want the type comparison warning, just add a comma-expression with something like like (t1 *)1 == (t2 *)1 to get the type compatibility warning. Yeah, yeah, maybe none of the VLA cases triggered that, but it seems silly to not just get that obvious constant case right. Hmm? Linus
[PATCH v2] kernel.h: Skip single-eval logic on literals in min()/max()
When max() is used in stack array size calculations from literal values (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler thinks this is a dynamic calculation due to the single-eval logic, which is not needed in the literal case. This change removes several accidental stack VLAs from an x86 allmodconfig build: $ diff -u before.txt after.txt | grep ^- -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids variable length array ‘ids’ [-Wvla] -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array ‘namebuf’ [-Wvla] -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ [-Wvla] -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ [-Wvla] -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ [-Wvla] -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ [-Wvla] Based on an earlier patch from Josh Poimboeuf. Signed-off-by: Kees Cook --- v2: - fix copy/paste-o max1_/max2_ (ijc) - clarify "compile-time" constant in comment (Rasmus) - clean up formatting on min_t()/max_t() --- include/linux/kernel.h | 50 -- 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3fd291503576..108cdf7bd484 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } * strict type-checking.. See the * "unnecessary" pointer comparison. */ -#define __min(t1, t2, min1, min2, x, y) ({ \ +#define __single_eval_min(t1, t2, min1, min2, x, y) ({ \ t1 min1 = (x); \ t2 min2 = (y); \ (void) (&min1 == &min2);\ min1 < min2 ? min1 : min2; }) +/* + * In the case of compile-time constant values, there is no need to do + * the double-evaluation protection, so the raw comparison can be made. + * This allows min()/max() to be used in stack array allocations and + * avoid the compiler thinking it is a dynamic value leading to an + * accidental VLA. + */ +#define __min(t1, t2, x, y)\ + __builtin_choose_expr(__builtin_constant_p(x) &&\ + __builtin_constant_p(y) &&\ + __builtin_types_compatible_p(t1, t2), \ + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ + __single_eval_min(t1, t2, \ + __UNIQUE_ID(min1_), \ + __UNIQUE_ID(min2_), \ + x, y)) + /** * min - return minimum of two values of the same or compatible types * @x: first value * @y: second value */ -#define min(x, y) \ - __min(typeof(x), typeof(y), \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ - x, y) +#define min(x, y) __min(typeof(x), typeof(y), x, y) -#define __max(t1, t2, max1, max2, x, y) ({ \ +#define __single_eval_max(t1, t2, max1, max2, x, y) ({ \ t1 max1 = (x); \ t2 max2 = (y); \ (void) (&max1 == &max2);\ max1 > max2 ? max1 : max2; }) +#define __max(t1, t2, x, y)\ + __builtin_choose_expr(__builtin_constant_p(x) &&\ + __builtin_constant_p(y) &&\ + __builtin_types_compatible_p(t1, t2), \ + (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\ + __single_eval_max(t1, t2, \ + __UNIQUE_ID(max1_), \ + __UNIQUE_ID(max2_), \ + x, y)) /** * max - return maximum of two values of the same or compatible types * @x: first value * @y: second value */ -#define max(x, y) \ - __max(typeof(x), typeof(y), \ - __UNIQUE_ID(max1_), __UNIQUE_ID(max2_), \ - x, y) +#define max(x, y) __max(typeof(x), typeof(y), x, y) /** * min3 - return minimum of three values @@ -869,10 +889,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } * @x: first value * @y: second value */ -#define min_t(type, x, y) \ - __min(type, type, \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ -
Re: [PATCH net] macvlan: filter out xfrm feature flags
On 3/8/2018 9:33 AM, David Miller wrote: From: Shannon Nelson Date: Tue, 6 Mar 2018 14:57:08 -0800 This isn't broken for vlans because they use a separate features connection (vlan_features) for inheriting features. This is fine, but I don't think trying to add something like this to every driver for every new upperdev is a good idea - I think the upperdev should try to protect itself. I think this fix is correct. But for how many upperdevs are we going to duplicate code like this, and how many subtle differences and in fact bugs will result from all of that duplication? I think we really need something common for these upperdev drivers to use. Maybe just a macro defining feature bits to trim in this situation. Thanks. Thanks, Dave. Yes, this could use something a little more generic, something that would catch any future "dangerous" bits. I'm not sure we can come up with a generic mask to be used by everyone, as each upper and lower dev has their own feature support levels. But we might come up with a better example for others to follow. Rather than calling out specific non-supported bits, we can probably just build a mask from bits that we already know are supported, something like this: features &= (ALWAYS_ON_FEATURES | MACVLAN_FEATURES); which would take care of NETIF_F_NETNS_LOCAL, the ESP flags, and anything else that wasn't already specifically called for. I'll repost with this and see what folks think. sln
Re: [PATCH 0/3] Remove accidental VLA usage
On Thu, Mar 8, 2018 at 2:12 PM, Rasmus Villemoes wrote: > On 8 March 2018 at 21:39, Kees Cook wrote: >> However, this works for me: >> >> #define __new_max(t1, t2, max1, max2, x, y)\ >>__builtin_choose_expr(__builtin_constant_p(x) && \ >> __builtin_constant_p(y) && \ >> __builtin_types_compatible_p(t1, t2), \ >> (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\ >> __max(t1, t2, max1, max2, x, y)) >> >> #define new_max(x, y) \ >> __new_max(typeof(x), typeof(y), \ >> __UNIQUE_ID(max1_), __UNIQUE_ID(max2_), \ >> x, y) > > Yes, that would seem to do the trick. > > Thinking out loud: do we really want or need the > __builtin_types_compatible condition when x and y are compile-time > constants? I think it would be nice to be able to use max(16, > sizeof(bla)) without having to cast either the literal or the sizeof. > Just omitting the type compatibility check might be dangerous, but > perhaps it could be relaxed to a check that both values are > representable in their common promoted type. Something like > > (type_signed(t1) == type_signed(t2)) || ((t1)x >= 0 && (t2)y >= 0) > > should be safe (if the types have same signedness, or the value of > signed type is positive), though it doesn't allow a few corner cases > (e.g. short vs. unsigned char is always ok due to promotion to int, > and also if the signed type is strictly wider than the unsigned type). I agree, it would be nice. However, I think it'd be better to continue to depend on max_t() for these kinds of cases though. For example: char foo[max_t(size_t, 6, sizeof(something))]; Works with the proposed patch. Also, I think this mismatch would already be triggering warnings, so we shouldn't have any currently. -Kees -- Kees Cook Pixel Security
Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]
On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi wrote: >> Marco Berizzi wrote: >> >> >> Hello everyone, >> >> Yesterday I got this error on a slackware linux 4.16-rc4 system >> running as a traffic shaping gateway and netfilter nat. >> The error has been arisen after a partial ISP network outage, >> so unfortunately it will not trivial for me to reproduce it again. > > Hello everyone, > > I'm getting this error twice/day, so fortunately I'm able to > reproduce it. IIRC, there was a patch for this, but it got lost... I will take a look anyway.
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 3/7/18 5:23 PM, Luis R. Rodriguez wrote: request_module() has its own world though too. How often in your proof of concept is request_module() called? How many times do you envision it being called? once. +static int run_umh(struct file *file) +{ + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file); + + if (!sub_info) + return -ENOMEM; + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC); +} run_umh() calls the program and waits. Note that while we are running a UMH we can't suspend. What if they take forever, who is hosing them down with an equivalent: schedule(); try_to_freeze(); Say they are buggy and never return, will they stall suspend, have you tried? call_usermodehelper_exec() uses helper_lock() which in turn is used for umh's accounting for number of running umh's. One of the sad obscure uses for this is to deal with suspend/resume. Refer to __usermodehelper* calls on kernel/power/process.c Note how you use UMH_WAIT_EXEC too, so this is all synchronous. looks like you misread this code and the rest of your concerns with suspend/resume are not applicable any more. #define UMH_NO_WAIT 0 /* don't wait at all */ #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */ #define UMH_WAIT_PROC 2 /* wait for the process to complete */ #define UMH_KILLABLE4 /* wait for EXEC/PROC killable */ bpftiler.ko is started once and runs non-stop from there on. Unless it crashes, but it's a different discussion. + if (info->hdr->e_type == ET_EXEC) { +#ifdef CONFIG_MODULE_SIG + if (!info->sig_ok) { + pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n", + info->file->f_path.dentry->d_name.name); + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK); + } +#endif So I guess this check is done *after* run_umh() then, what about the enforce mode, don't we want to reject loading at all in any circumstance? already answered this twice in the thread.
Re: [RFC v3 net-next 00/18] Time based packet transmission
On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote: > Hi, > > > On 03/08/2018 06:09 AM, Henrik Austad wrote: > > (...) > > > > > A lot of new knobs, I see the need, I would've like to have fewer, but > > you've documented them pretty well. Perhaps we should add something to > > Documentation/ at one stage? > > Sure. The idea is working on that once the interfaces have been accepted. Yeah, probably a good idea. > > Anyways, the patches applied cleanly so I gave them a (very) quick spin. > > Using udp_tai and tcpdump in the other end to grab the frames > > > > Setting up with hw offload and sorting in qdisc. > > > > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss > > bypass as dual-core and i210 is not friends): > > > > udp_tai -c1 -i eth2 -p 20 -P 1000 > > > > Receiver (imx7, kernel 4.9.11): > > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length > > 256" > tai_imx7.log > > > > Note: this involves 2 swtiches and a somewhat hackish kernel running on the > > receiver, so these numbers can only improve. > > > > count2340.00 > > mean0.043770 > > std 0.047784 > > min 0.009025 > > 25% 0.010003 > > 50% 0.010010 > > 75% 0.109998 > > max 0.120060 > > > > Thanks for giving it a shot. > > But I'm not sure I follow the numbers above, sorry :/ > Are you computing the packet's Rx timestamp offset from the (expected) Tx > time? Just looking at the timestamp when the frames were received. They should be sent at regular intervals if I read udp_tai.c correctly, so the assumption was that the timestamp from tcpdump should give an inkling to how well it worked. I set it up to send a frame every 10ms and computed the diff between each UDP packet received. Nothing fancy, just tcpdump and grep for the timestamp and look at the distribution. > > I have to dig more into why this is happening, a lot frames delayed much > > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One > > obvious fix is move some hw around and do a direct link, but I didn't have > > time for that right now. > > > > I'm very interested in doing what Richard's original test was when he used > > ptp-synched clocks and also used hw receive-time and compared with expected > > tx-time. So, while I'm getting that up and running, I thought I should > > share the early results. > > Sure, thanks. Which delta and clockid are you using, please? I used the example provided in -00, tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \ CLOCK_REALTIME sorting > Also, was this clock synchronized to the PHC? You need that for hw offload > with > sorting enabled. Hmm, good point, no, NIC clock was not synchronized, I'll do that in the next round for both sender and receiver! -henrik signature.asc Description: PGP signature
[RFC] Removing VLA usage in l1oip_core
Hi Karsten, I'm trying to figure out the best way to fix the following VLA warning: drivers/isdn/mISDN/l1oip_core.c:282:2: warning: ISO C90 forbids variable length array ‘frame’ [-Wvla] u8 frame[len + 32]; ^~ So while doing some research I've found the following. Based on this code at include/linux/mISDNhw.h:38: #define MAX_DFRAME_LEN_L1 300 and the following at drivers/isdn/mISDN/l1oip_core.c:1115: if (skb->len > MAX_DFRAME_LEN_L1 || skb->len > L1OIP_MAX_LEN) { printk(KERN_WARNING "%s: skb too large\n", __func__); break; } /* check for AIS / ulaw-silence */ l = skb->len; if (!memchr_inv(skb->data, 0xff, l)) { if (debug & DEBUG_L1OIP_MSG) printk(KERN_DEBUG "%s: got AIS, not sending, " "but counting\n", __func__); hc->chan[bch->slot].tx_counter += l; skb_trim(skb, 0); queue_ch_frame(ch, PH_DATA_CNF, hh->id, skb); return 0; } /* check for silence */ l = skb->len; if (!memchr_inv(skb->data, 0x2a, l)) { if (debug & DEBUG_L1OIP_MSG) printk(KERN_DEBUG "%s: got silence, not sending" ", but counting\n", __func__); hc->chan[bch->slot].tx_counter += l; skb_trim(skb, 0); queue_ch_frame(ch, PH_DATA_CNF, hh->id, skb); return 0; } /* send frame */ p = skb->data; l = skb->len; while (l) { ll = (l < L1OIP_MAX_PERFRAME) ? l : L1OIP_MAX_PERFRAME; l1oip_socket_send(hc, hc->codec, bch->slot, 0, hc->chan[bch->slot].tx_counter, p, ll); hc->chan[bch->slot].tx_counter += ll; p += ll; l -= ll; } it seems that the maximum value 'len' can take at drivers/isdn/mISDN/l1oip_core.c:282 is 300 /* * send a frame via socket, if open and restart timer */ static int l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 channel, u32 chanmask, u16 timebase, u8 *buf, int len) { u8 *p; u8 frame[len + 32]; If this is correct, I could send the following patch to fix the VLA warning: diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c index 21d50e4..31e3cd5 100644 --- a/drivers/isdn/mISDN/l1oip_core.c +++ b/drivers/isdn/mISDN/l1oip_core.c @@ -279,7 +279,7 @@ l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 channel, u32 chanmask, u16 timebase, u8 *buf, int len) { u8 *p; - u8 frame[len + 32]; + u8 frame[332]; struct socket *socket = NULL; if (debug & DEBUG_L1OIP_MSG) But I wanted to ask for your feedback first, in case I may be missing something. Another solution is to use dynamic memory allocation, but if the maximum size for 'frame' is in the hundreds of bytes, it might not be worth the performace penalty. What do you think? Thanks! -- Gustavo
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 2:18 PM, Andrew Morton wrote: > On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook wrote: > >> When max() is used in stack array size calculations from literal values >> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler >> thinks this is a dynamic calculation due to the single-eval logic, which >> is not needed in the literal case. This change removes several accidental >> stack VLAs from an x86 allmodconfig build: >> >> $ diff -u before.txt after.txt | grep ^- >> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids >> variable length array ‘ids’ [-Wvla] >> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length >> array ‘namebuf’ [-Wvla] >> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ >> [-Wvla] >> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array >> ‘buff’ [-Wvla] >> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array >> ‘buff’ [-Wvla] >> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array >> ‘buff64’ [-Wvla] >> >> Based on an earlier patch from Josh Poimboeuf. >> >> ... >> >> --- a/include/linux/kernel.h >> +++ b/include/linux/kernel.h >> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode >> oops_dump_mode) { } >> * strict type-checking.. See the >> * "unnecessary" pointer comparison. >> */ >> -#define __min(t1, t2, min1, min2, x, y) ({ \ >> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({ \ >> t1 min1 = (x); \ >> t2 min2 = (y); \ >> (void) (&min1 == &min2);\ >> min1 < min2 ? min1 : min2; }) >> >> +/* >> + * In the case of builtin constant values, there is no need to do the >> + * double-evaluation protection, so the raw comparison can be made. >> + * This allows min()/max() to be used in stack array allocations and >> + * avoid the compiler thinking it is a dynamic value leading to an >> + * accidental VLA. >> + */ >> +#define __min(t1, t2, x, y) \ >> + __builtin_choose_expr(__builtin_constant_p(x) &&\ >> + __builtin_constant_p(y) &&\ >> + __builtin_types_compatible_p(t1, t2), \ >> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ >> + __single_eval_min(t1, t2, \ >> + __UNIQUE_ID(max1_), \ >> + __UNIQUE_ID(max2_), \ >> + x, y)) >> + > > Holy crap. > > I suppose gcc will one day be fixed and we won't need this. > > Is there a good reason to convert min()? Surely nobody will be using > min to dimension an array - always max? Just for symmetry, I guess. I just went with symmetry. It seems like an ugly risk to implement min and mix differently. :) In theory it may produce smaller code for rare min() uses, but I haven't actually verified that. I will send a v2 with the two nits mentioned... -Kees -- Kees Cook Pixel Security
[PATCH iproute2-next 1/2] tc: Add missing documentation for codel and fq_codel parameters
Add missing documentation of the memory_limit fq_codel parameter and the ce_threshold codel and fq_codel parameters. Signed-off-by: Toke Høiland-Jørgensen --- man/man8/tc-codel.8| 10 +- man/man8/tc-fq_codel.8 | 18 +- tc/q_fq_codel.c| 1 + 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/man/man8/tc-codel.8 b/man/man8/tc-codel.8 index a0e50a4e..e538e940 100644 --- a/man/man8/tc-codel.8 +++ b/man/man8/tc-codel.8 @@ -13,7 +13,9 @@ TIME ] [ .B ecn | .B noecn -] +] [ +.B ce_threshold +TIME ] .SH DESCRIPTION CoDel (pronounced "coddle") is an adaptive "no-knobs" active queue management @@ -80,6 +82,12 @@ can be used to turn it off and vice-a-versa. By default, .B ecn is turned off. +.SS ce_threshold +sets a threshold above which all packets are marked with ECN Congestion +Experienced. This is useful for DCTCP-style congestion control algorithms that +require marking at very shallow queueing thresholds. + + .SH EXAMPLES # tc qdisc add dev eth0 root codel # tc -s qdisc show diff --git a/man/man8/tc-fq_codel.8 b/man/man8/tc-fq_codel.8 index a80389a1..7ee6c269 100644 --- a/man/man8/tc-fq_codel.8 +++ b/man/man8/tc-fq_codel.8 @@ -17,7 +17,11 @@ BYTES ] [ .B ecn | .B noecn -] +] [ +.B ce_threshold +TIME ] [ +.B memory_limit +BYTES ] .SH DESCRIPTION FQ_Codel (Fair Queuing Controlled Delay) is queuing discipline that combines Fair @@ -35,6 +39,13 @@ and is the hard limit on the real queue size. When this limit is reached, incoming packets are dropped. Default is 10240 packets. +.SS memory_limit +sets a limit on the total number of bytes that can be queued in this FQ-CoDel +instance. The lower of the packet limit of the +.B limit +parameter and the memory limit will be enforced. Default is 32 MB. + + .SS flows is the number of flows into which the incoming packets are classified. Due to the stochastic nature of hashing, multiple flows may end up being hashed into @@ -73,6 +84,11 @@ can be used to turn it off and vice-a-versa. Unlike .B codel, ecn is turned on by default. +.SS ce_threshold +sets a threshold above which all packets are marked with ECN Congestion +Experienced. This is useful for DCTCP-style congestion control algorithms that +require marking at very shallow queueing thresholds. + .SH EXAMPLES #tc qdisc add dev eth0 root fq_codel .br diff --git a/tc/q_fq_codel.c b/tc/q_fq_codel.c index 9e3736fe..1b2931ef 100644 --- a/tc/q_fq_codel.c +++ b/tc/q_fq_codel.c @@ -50,6 +50,7 @@ static void explain(void) { fprintf(stderr, "Usage: ... fq_codel [ limit PACKETS ] [ flows NUMBER ]\n"); + fprintf(stderr, "[ memory_limit BYTES ]\n"); fprintf(stderr, "[ target TIME ] [ interval TIME ]\n"); fprintf(stderr, "[ quantum BYTES ] [ [no]ecn ]\n"); fprintf(stderr, "[ ce_threshold TIME ]\n"); -- 2.16.2
[PATCH iproute2-next 2/2] tc: Add JSON output of fq_codel stats
Enable proper JSON output support for fq_codel in `tc -s qdisc` output. Signed-off-by: Toke Høiland-Jørgensen --- tc/q_fq_codel.c | 49 - 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/tc/q_fq_codel.c b/tc/q_fq_codel.c index 1b2931ef..02ad2214 100644 --- a/tc/q_fq_codel.c +++ b/tc/q_fq_codel.c @@ -239,35 +239,50 @@ static int fq_codel_print_xstats(struct qdisc_util *qu, FILE *f, st = &_st; } if (st->type == TCA_FQ_CODEL_XSTATS_QDISC) { - fprintf(f, " maxpacket %u drop_overlimit %u new_flow_count %u ecn_mark %u", - st->qdisc_stats.maxpacket, - st->qdisc_stats.drop_overlimit, - st->qdisc_stats.new_flow_count, + print_uint(PRINT_ANY, "maxpacket", " maxpacket %u", + st->qdisc_stats.maxpacket); + print_uint(PRINT_ANY, "drop_overlimit", " drop_overlimit %u", + st->qdisc_stats.drop_overlimit); + print_uint(PRINT_ANY, "new_flow_count", " new_flow_count %u", + st->qdisc_stats.new_flow_count); + print_uint(PRINT_ANY, "ecn_mark", " ecn_mark %u", st->qdisc_stats.ecn_mark); if (st->qdisc_stats.ce_mark) - fprintf(f, " ce_mark %u", st->qdisc_stats.ce_mark); + print_uint(PRINT_ANY, "ce_mark", " ce_mark %u", + st->qdisc_stats.ce_mark); if (st->qdisc_stats.memory_usage) - fprintf(f, " memory_used %u", st->qdisc_stats.memory_usage); + print_uint(PRINT_ANY, "memory_used", " memory_used %u", + st->qdisc_stats.memory_usage); if (st->qdisc_stats.drop_overmemory) - fprintf(f, " drop_overmemory %u", st->qdisc_stats.drop_overmemory); - fprintf(f, "\n new_flows_len %u old_flows_len %u", - st->qdisc_stats.new_flows_len, + print_uint(PRINT_ANY, "drop_overmemory", " drop_overmemory %u", + st->qdisc_stats.drop_overmemory); + print_uint(PRINT_ANY, "new_flows_len", "\n new_flows_len %u", + st->qdisc_stats.new_flows_len); + print_uint(PRINT_ANY, "old_flows_len", " old_flows_len %u", st->qdisc_stats.old_flows_len); } if (st->type == TCA_FQ_CODEL_XSTATS_CLASS) { - fprintf(f, " deficit %d count %u lastcount %u ldelay %s", - st->class_stats.deficit, - st->class_stats.count, - st->class_stats.lastcount, + print_uint(PRINT_ANY, "deficit", " deficit %u", + st->class_stats.deficit); + print_uint(PRINT_ANY, "count", " count %u", + st->class_stats.count); + print_uint(PRINT_ANY, "lastcount", " lastcount %u", + st->class_stats.lastcount); + print_uint(PRINT_JSON, "ldelay", NULL, + st->class_stats.ldelay); + print_string(PRINT_FP, NULL, " ldelay %s", sprint_time(st->class_stats.ldelay, b1)); if (st->class_stats.dropping) { - fprintf(f, " dropping"); + print_bool(PRINT_ANY, "dropping", " dropping", true); if (st->class_stats.drop_next < 0) - fprintf(f, " drop_next -%s", + print_string(PRINT_FP, NULL, " drop_next -%s", sprint_time(-st->class_stats.drop_next, b1)); - else - fprintf(f, " drop_next %s", + else { + print_uint(PRINT_JSON, "drop_next", NULL, + st->class_stats.drop_next); + print_string(PRINT_FP, NULL, " drop_next %s", sprint_time(st->class_stats.drop_next, b1)); + } } } return 0; -- 2.16.2
[PATCH net v3] net: phy: Tell caller result of phy_change()
In 664fcf123a30e (net: phy: Threaded interrupts allow some simplification) the phy_interrupt system was changed to use a traditional threaded interrupt scheme instead of a workqueue approach. With this change, the phy status check moved into phy_change, which did not report back to the caller whether or not the interrupt was handled. This means that, in the case of a shared phy interrupt, only the first phydev's interrupt registers are checked (since phy_interrupt() would always return IRQ_HANDLED). This leads to interrupt storms when it is a secondary device that's actually the interrupt source. Signed-off-by: Brad Mouring --- drivers/net/phy/phy.c | 145 +- include/linux/phy.h | 1 - 2 files changed, 72 insertions(+), 74 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index e3e29c2b028b..dfff2cff7da9 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -617,6 +617,77 @@ static void phy_error(struct phy_device *phydev) phy_trigger_machine(phydev, false); } +/** + * phy_disable_interrupts - Disable the PHY interrupts from the PHY side + * @phydev: target phy_device struct + */ +static int phy_disable_interrupts(struct phy_device *phydev) +{ + int err; + + /* Disable PHY interrupts */ + err = phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED); + if (err) + goto phy_err; + + /* Clear the interrupt */ + err = phy_clear_interrupt(phydev); + if (err) + goto phy_err; + + return 0; + +phy_err: + phy_error(phydev); + + return err; +} + +/** + * phy_change - Called by the phy_interrupt to handle PHY changes + * @phydev: phy_device struct that interrupted + */ +static irqreturn_t phy_change(struct phy_device *phydev) +{ + if (phy_interrupt_is_valid(phydev)) { + if (phydev->drv->did_interrupt && + !phydev->drv->did_interrupt(phydev)) + return IRQ_NONE; + + if (phydev->state == PHY_HALTED) + if (phy_disable_interrupts(phydev)) + goto phy_err; + } + + mutex_lock(&phydev->lock); + if ((PHY_RUNNING == phydev->state) || (PHY_NOLINK == phydev->state)) + phydev->state = PHY_CHANGELINK; + mutex_unlock(&phydev->lock); + + /* reschedule state queue work to run as soon as possible */ + phy_trigger_machine(phydev, true); + + if (phy_interrupt_is_valid(phydev) && phy_clear_interrupt(phydev)) + goto phy_err; + return IRQ_HANDLED; + +phy_err: + phy_error(phydev); + return IRQ_NONE; +} + +/** + * phy_change_work - Scheduled by the phy_mac_interrupt to handle PHY changes + * @work: work_struct that describes the work to be done + */ +void phy_change_work(struct work_struct *work) +{ + struct phy_device *phydev = + container_of(work, struct phy_device, phy_queue); + + phy_change(phydev); +} + /** * phy_interrupt - PHY interrupt handler * @irq: interrupt line @@ -632,9 +703,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat) if (PHY_HALTED == phydev->state) return IRQ_NONE;/* It can't be ours. */ - phy_change(phydev); - - return IRQ_HANDLED; + return phy_change(phydev); } /** @@ -651,32 +720,6 @@ static int phy_enable_interrupts(struct phy_device *phydev) return phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED); } -/** - * phy_disable_interrupts - Disable the PHY interrupts from the PHY side - * @phydev: target phy_device struct - */ -static int phy_disable_interrupts(struct phy_device *phydev) -{ - int err; - - /* Disable PHY interrupts */ - err = phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED); - if (err) - goto phy_err; - - /* Clear the interrupt */ - err = phy_clear_interrupt(phydev); - if (err) - goto phy_err; - - return 0; - -phy_err: - phy_error(phydev); - - return err; -} - /** * phy_start_interrupts - request and enable interrupts for a PHY device * @phydev: target phy_device struct @@ -719,50 +762,6 @@ int phy_stop_interrupts(struct phy_device *phydev) } EXPORT_SYMBOL(phy_stop_interrupts); -/** - * phy_change - Called by the phy_interrupt to handle PHY changes - * @phydev: phy_device struct that interrupted - */ -void phy_change(struct phy_device *phydev) -{ - if (phy_interrupt_is_valid(phydev)) { - if (phydev->drv->did_interrupt && - !phydev->drv->did_interrupt(phydev)) - return; - - if (phydev->state == PHY_HALTED) - if (phy_disable_interrupts(phydev)) - goto phy_err; - } - - mutex_lock(&phydev->lock); - if ((PHY_RUNNING == phydev->state) || (PHY_NOLINK == phydev->
Re: [PATCH RFC 4/5] tls: RX path for ktls
On 03/08/18 09:48 PM, Boris Pismenny wrote: > Hi Dave, > > On 03/08/18 18:50, Dave Watson wrote: > > Add rx path for tls software implementation. > > > > recvmsg, splice_read, and poll implemented. > > > > An additional sockopt TLS_RX is added, with the same interface as > > TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or > > together (with two different setsockopt calls with appropriate keys). > > > > Control messages are passed via CMSG in a similar way to transmit. > > If no cmsg buffer is passed, then only application data records > > will be passed to userspace, and EIO is returned for other types of > > alerts. > > > > EBADMSG is passed for decryption errors, and E2BIG is passed for framing > > errors. Both are unrecoverable. > > I think E2BIG is for too long argument list. EMSGSIZE might be more > appropriate. Sounds good. > Also, we must check that the record is not too short (cipher specific). > For TLS1.2 with AES-GCM the minimum length is 8 (IV) + 16 (TAG). > The correct error for this case is EBADMSG, like a decryption failure. > > Also, how about bad TLS version (e.g. not TLS1.2)? > A separate error type is required for bad version, because it triggers a > unique alert in libraries such as OpenSSL. > I thought of using EINVAL for bad version. What do you think? Ah, I did not realize there was a separate alert for that, sounds good. > > I wonder if we should provide a more flexible method of obtaining errors for > the future. > Maybe use a special CMSG for errors? > This CMSG will be triggered only after the socket enters the error state. I'm not opposed to this in principle, but without a concrete use am hesitant to add it. I don't know of any other error codes that could be returned besides the ones discussed above. > > > > + > > +int tls_sw_recvmsg(struct sock *sk, > > + struct msghdr *msg, > > + size_t len, > > + int nonblock, > > + int flags, > > + int *addr_len) > > +{ > > + struct tls_context *tls_ctx = tls_get_ctx(sk); > > + struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); > > + unsigned char control; > > + struct strp_msg *rxm; > > + struct sk_buff *skb; > > + ssize_t copied = 0; > > + bool cmsg = false; > > + int err = 0; > > + long timeo; > Maybe try to read from the error queue here? Sure.
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook wrote: > When max() is used in stack array size calculations from literal values > (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler > thinks this is a dynamic calculation due to the single-eval logic, which > is not needed in the literal case. This change removes several accidental > stack VLAs from an x86 allmodconfig build: > > $ diff -u before.txt after.txt | grep ^- > -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids > variable length array ‘ids’ [-Wvla] > -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length > array ‘namebuf’ [-Wvla] > -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ > [-Wvla] > -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ > [-Wvla] > -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ > [-Wvla] > -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array > ‘buff64’ [-Wvla] > > Based on an earlier patch from Josh Poimboeuf. > > ... > > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode > oops_dump_mode) { } > * strict type-checking.. See the > * "unnecessary" pointer comparison. > */ > -#define __min(t1, t2, min1, min2, x, y) ({ \ > +#define __single_eval_min(t1, t2, min1, min2, x, y) ({ \ > t1 min1 = (x); \ > t2 min2 = (y); \ > (void) (&min1 == &min2);\ > min1 < min2 ? min1 : min2; }) > > +/* > + * In the case of builtin constant values, there is no need to do the > + * double-evaluation protection, so the raw comparison can be made. > + * This allows min()/max() to be used in stack array allocations and > + * avoid the compiler thinking it is a dynamic value leading to an > + * accidental VLA. > + */ > +#define __min(t1, t2, x, y) \ > + __builtin_choose_expr(__builtin_constant_p(x) &&\ > + __builtin_constant_p(y) &&\ > + __builtin_types_compatible_p(t1, t2), \ > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ > + __single_eval_min(t1, t2, \ > + __UNIQUE_ID(max1_), \ > + __UNIQUE_ID(max2_), \ > + x, y)) > + Holy crap. I suppose gcc will one day be fixed and we won't need this. Is there a good reason to convert min()? Surely nobody will be using min to dimension an array - always max? Just for symmetry, I guess.
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, 2018-03-08 at 13:40 -0800, Kees Cook wrote: > > +#define __min(t1, t2, x, y) \ > + __builtin_choose_expr(__builtin_constant_p(x) &&\ > + __builtin_constant_p(y) &&\ > + __builtin_types_compatible_p(t1, t2), \ > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ > + __single_eval_min(t1, t2, \ > + __UNIQUE_ID(max1_), \ > + __UNIQUE_ID(max2_), \ min1_ etc? Ian.
Re: [PATCH 0/3] Remove accidental VLA usage
On 8 March 2018 at 21:39, Kees Cook wrote: > On Thu, Mar 8, 2018 at 11:57 AM, Rasmus Villemoes > wrote: >> On 2018-03-08 16:02, Josh Poimboeuf wrote: >>> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote: >>> +extern long __error_incompatible_types_in_min_macro; >>> +extern long __error_incompatible_types_in_max_macro; >>> + >>> +#define __min(t1, t2, x, y) \ >>> + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \ >>> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ >>> + (t1)__error_incompatible_types_in_min_macro) >>> >>> /** >>> * min - return minimum of two values of the same or compatible types >>> * @x: first value >>> * @y: second value >>> */ >>> -#define min(x, y)\ >>> - __min(typeof(x), typeof(y), \ >>> - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ >>> - x, y) >>> +#define min(x, y) __min(typeof(x), typeof(y), x, y) \ >>> >> >> But this introduces the the-chosen-one-of-x-and-y-gets-evaluated-twice >> problem. Maybe we don't care? But until we get a >> __builtin_assert_this_has_no_side_effects() I think that's a little >> dangerous. > > Eek, yes, we can't do the double-eval. The proposed change breaks > things badly. :) > > a: 20 > b: 40 > max(a++, b++): 40 > a: 21 > b: 41 > > a: 20 > b: 40 > new_max(a++, b++): 41 > a: 21 > b: 42 > > However, this works for me: > > #define __new_max(t1, t2, max1, max2, x, y)\ >__builtin_choose_expr(__builtin_constant_p(x) && \ > __builtin_constant_p(y) && \ > __builtin_types_compatible_p(t1, t2), \ > (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\ > __max(t1, t2, max1, max2, x, y)) > > #define new_max(x, y) \ > __new_max(typeof(x), typeof(y), \ > __UNIQUE_ID(max1_), __UNIQUE_ID(max2_), \ > x, y) Yes, that would seem to do the trick. Thinking out loud: do we really want or need the __builtin_types_compatible condition when x and y are compile-time constants? I think it would be nice to be able to use max(16, sizeof(bla)) without having to cast either the literal or the sizeof. Just omitting the type compatibility check might be dangerous, but perhaps it could be relaxed to a check that both values are representable in their common promoted type. Something like (type_signed(t1) == type_signed(t2)) || ((t1)x >= 0 && (t2)y >= 0) should be safe (if the types have same signedness, or the value of signed type is positive), though it doesn't allow a few corner cases (e.g. short vs. unsigned char is always ok due to promotion to int, and also if the signed type is strictly wider than the unsigned type). Rasmus
[PATCH bpf-next 2/7] tools: bpf: respect output directory during build
Currently, the programs under tools/bpf (with the notable exception of bpftool) do not respect the output directory (make O=dir). Fix that. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 38 ++ 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index c8ec0ae16bf0..e7b15967492e 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -1,4 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 +include ../scripts/Makefile.include + prefix = /usr CC = gcc @@ -7,7 +9,7 @@ YACC = bison MAKE = make CFLAGS += -Wall -O2 -CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include +CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include ifeq ($(srctree),) srctree := $(patsubst %/,%,$(dir $(CURDIR))) @@ -38,32 +40,36 @@ ifeq ($(feature-disassembler-four-args), 1) CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE endif -%.yacc.c: %.y +$(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y $(YACC) -o $@ -d $< -%.lex.c: %.l +$(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l $(LEX) -o $@ $< -all: bpf_jit_disasm bpf_dbg bpf_asm bpftool +$(OUTPUT)%.o: $(srctree)/tools/bpf/%.c + $(COMPILE.c) -o $@ $< + +all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool -bpf_jit_disasm : CFLAGS += -DPACKAGE='bpf_jit_disasm' -bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl -bpf_jit_disasm : bpf_jit_disasm.o +$(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' +$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl +$(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o -bpf_dbg : LDLIBS = -lreadline -bpf_dbg : bpf_dbg.o +$(OUTPUT)bpf_dbg: LDLIBS = -lreadline +$(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o -bpf_asm : LDLIBS = -bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o -bpf_exp.lex.o : bpf_exp.yacc.c +$(OUTPUT)bpf_asm: LDLIBS = +$(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o +$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean - rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* + rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ + $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: bpftool_install - install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm - install bpf_dbg $(prefix)/bin/bpf_dbg - install bpf_asm $(prefix)/bin/bpf_asm + install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm + install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg + install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm bpftool: $(MAKE) -C bpftool -- 1.8.3.1
[PATCH bpf-next 4/7] tools: bpf: make install should build first
Make the 'install' target depend on the 'all' target to build the binaries first. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index c42ca24a072d..e8d9e4125bf3 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -50,7 +50,9 @@ $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c $(COMPILE.c) -o $@ $< -all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool +PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm + +all: $(PROGS) bpftool $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' $(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl @@ -67,7 +69,7 @@ clean: bpftool_clean rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* -install: bpftool_install +install: $(PROGS) bpftool_install $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg -- 1.8.3.1
[PATCH bpf-next 0/7] tools: bpf: standardize make
Currently, 'make bpf' in the tools/ directory does not provide the standard quiet output except for bpftool (which is however listed with a wrong directory). Worse, it does not respect the build output directory. The 'make bpf_install' does not work as one would expect, either. It installs unconditionally to /usr/bin without respecting DESTDIR and prefix. This patchset improves that behavior. Jiri Benc (7): tools: bpftool: silence 'missing initializer' warnings tools: bpf: respect output directory during build tools: bpf: consistent make bpf_install tools: bpf: make install should build first tools: bpf: call descend in Makefile tools: bpf: respect quiet/verbose build tools: bpf: silence make by not deleting intermediate file tools/bpf/Makefile | 74 +++--- tools/bpf/bpftool/Makefile | 2 +- 2 files changed, 51 insertions(+), 25 deletions(-) -- 1.8.3.1
[PATCH bpf-next 6/7] tools: bpf: respect quiet/verbose build
Default to quiet build, with V=1 enabling verbose build as is usual. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 38 +++--- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index daca0a4277d1..757ea22c428a 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -17,6 +17,12 @@ srctree := $(patsubst %/,%,$(dir $(CURDIR))) srctree := $(patsubst %/,%,$(dir $(srctree))) endif +ifeq ($(V),1) + Q = +else + Q = @ +endif + FEATURE_USER = .bpf FEATURE_TESTS = libbfd disassembler-four-args FEATURE_DISPLAY = libbfd disassembler-four-args @@ -42,38 +48,48 @@ CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE endif $(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y - $(YACC) -o $@ -d $< + $(QUIET_BISON)$(YACC) -o $@ -d $< $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l - $(LEX) -o $@ $< + $(QUIET_FLEX)$(LEX) -o $@ $< $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c - $(COMPILE.c) -o $@ $< + $(QUIET_CC)$(COMPILE.c) -o $@ $< + +$(OUTPUT)%.yacc.o: $(OUTPUT)%.yacc.c + $(QUIET_CC)$(COMPILE.c) -o $@ $< +$(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c + $(QUIET_CC)$(COMPILE.c) -o $@ $< PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm all: $(PROGS) bpftool $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' -$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lopcodes -lbfd -ldl -$(OUTPUT)bpf_dbg: LDLIBS = -lreadline $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lreadline -$(OUTPUT)bpf_asm: LDLIBS = $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ + $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean - rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ + $(call QUIET_CLEAN, bpf-progs) + $(Q)rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: $(PROGS) bpftool_install - $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin - $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm - $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg - $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm + $(call QUIET_INSTALL, bpf_jit_disasm) + $(Q)$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin + $(Q)$(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm + $(call QUIET_INSTALL, bpf_dbg) + $(Q)$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg + $(call QUIET_INSTALL, bpf_asm) + $(Q)$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: $(call descend,bpftool) -- 1.8.3.1
[PATCH bpf-next 7/7] tools: bpf: silence make by not deleting intermediate file
Even in quiet mode, make finishes with rm tools/bpf/bpf_exp.lex.c That's because it considers the file to be intermediate. Silence that by mentioning the lex.c file instead of the lex.o file; the dependency still stays. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index 757ea22c428a..c07b4e718eeb 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -75,7 +75,7 @@ $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c +$(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean $(call QUIET_CLEAN, bpf-progs) -- 1.8.3.1
[PATCH bpf-next 3/7] tools: bpf: consistent make bpf_install
Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it installs to /usr/bin/ unconditionally. Let it respect DESTDIR and allow prefix to be specified. Also, to be more consistent with bpftool and with the usual customs, default the prefix to /usr/local instead of /usr. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index e7b15967492e..c42ca24a072d 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -1,12 +1,13 @@ # SPDX-License-Identifier: GPL-2.0 include ../scripts/Makefile.include -prefix = /usr +prefix ?= /usr/local CC = gcc LEX = flex YACC = bison MAKE = make +INSTALL ?= install CFLAGS += -Wall -O2 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include @@ -67,9 +68,10 @@ clean: bpftool_clean $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: bpftool_install - install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm - install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg - install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm + $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin + $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm + $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg + $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: $(MAKE) -C bpftool -- 1.8.3.1
[PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings
When building bpf tool, gcc emits piles of warnings: prog.c: In function ‘prog_fd_by_tag’: prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct bpf_prog_info’ [-Wmissing-field-initializers] struct bpf_prog_info info = {}; ^ In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0, from prog.c:47: /home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: ‘type’ declared here __u32 type; ^ As these warnings are not useful, switch them off. Signed-off-by: Jiri Benc --- tools/bpf/bpftool/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 26901ec87361..4c2867481f5c 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -38,7 +38,7 @@ bash_compdir ?= /usr/share/bash-completion/completions CC = gcc CFLAGS += -O2 -CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow +CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow -Wno-missing-field-initializers CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/ CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"' LIBS = -lelf -lbfd -lopcodes $(LIBBPF) -- 1.8.3.1
Re: [1/2] net: macb: Add phy-handle DT support
On Thu, Mar 08, 2018 at 06:32:47PM +0100, Andrew Lunn wrote: > On Wed, Mar 07, 2018 at 04:42:56PM -0600, Brad Mouring wrote: > > This optional binding (as described in the ethernet DT bindings doc) > > directs the netdev to the phydev to use. This is useful for a phy > > chip that has >1 phy in it, and two netdevs are using the same phy > > chip (i.e. the second mac's phy lives on the first mac's MDIO bus) > > ... > Hi Brad > > I think it is more logical to do this in macb_mii_probe(). > > I would probably also move the fixed_link code from macb_mii_init() to > macb_mii_probe(). I would probably also move the fallback to standard > phy registration. Make macb_mii_init() about registering the MDIO bus, > and macb_mii_probe() about probing the MDIO bus to find the PHY to > use. At the moment, it is all rather mixed up. > > Andrew Hi Andrew That makes sense, I'll rework and resend. Thanks for the suggestion. Brad
[PATCH bpf-next 5/7] tools: bpf: call descend in Makefile
Use the descend macro to properly propagate $(subdir) to bpftool. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index e8d9e4125bf3..daca0a4277d1 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -76,12 +76,12 @@ install: $(PROGS) bpftool_install $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: - $(MAKE) -C bpftool + $(call descend,bpftool) bpftool_install: - $(MAKE) -C bpftool install + $(call descend,bpftool,install) bpftool_clean: - $(MAKE) -C bpftool clean + $(call descend,bpftool,clean) .PHONY: bpftool FORCE -- 1.8.3.1
[PATCH v3 net-next 3/4] net sched actions: calculate add/delete event message size
Introduce routines to calculate size of the shared tc netlink attributes and the full message size including netlink header and tc service header. Update add/delete action logic to have the size for event messages, the size is passed to tcf_add_notify() and tcf_del_notify() where the notification message is being allocated and constructed. Signed-off-by: Roman Mashak --- net/sched/act_api.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 3de0e0610200..57cf37145282 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -109,6 +109,42 @@ int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) } EXPORT_SYMBOL(__tcf_idr_release); +static size_t tcf_action_shared_attrs_size(const struct tc_action *act) +{ + u32 cookie_len = 0; + + if (act->act_cookie) + cookie_len = nla_total_size(act->act_cookie->len); + + return nla_total_size(0) /* action number nested */ + + nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */ + + cookie_len /* TCA_ACT_COOKIE */ + + nla_total_size(0) /* TCA_ACT_STATS nested */ + /* TCA_STATS_BASIC */ + + nla_total_size_64bit(sizeof(struct gnet_stats_basic)) + /* TCA_STATS_QUEUE */ + + nla_total_size_64bit(sizeof(struct gnet_stats_queue)) + + nla_total_size(0) /* TCA_OPTIONS nested */ + + nla_total_size(sizeof(struct tcf_t)); /* TCA_GACT_TM */ +} + +static size_t tcf_action_full_attrs_size(size_t sz) +{ + return NLMSG_HDRLEN /* struct nlmsghdr */ + + sizeof(struct tcamsg) + + nla_total_size(0) /* TCA_ACT_TAB nested */ + + sz; +} + +static size_t tcf_action_fill_size(const struct tc_action *act) +{ + size_t sz = tcf_action_shared_attrs_size(act); + + if (act->ops->get_fill_size) + return act->ops->get_fill_size(act) + sz; + return sz; +} + static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb, struct netlink_callback *cb) { @@ -746,6 +782,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, { struct nlattr *tb[TCA_ACT_MAX_PRIO + 1]; struct tc_action *act; + size_t sz = 0; int err; int i; @@ -761,11 +798,14 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla, goto err; } act->order = i; + sz += tcf_action_fill_size(act); if (ovr) act->tcfa_refcnt++; list_add_tail(&act->list, actions); } + *attr_size = tcf_action_full_attrs_size(sz); + /* Remove the temp refcnt which was necessary to protect against * destroying an existing action which was being replaced */ @@ -1056,9 +1096,12 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, goto err; } act->order = i; + attr_size += tcf_action_fill_size(act); list_add_tail(&act->list, &actions); } + attr_size = tcf_action_full_attrs_size(attr_size); + if (event == RTM_GETACTION) ret = tcf_get_notify(net, portid, n, &actions, event, extack); else { /* delete */ -- 2.7.4
[PATCH v3 net-next 4/4] net sched actions: implement get_fill_size routine in act_gact
Signed-off-by: Roman Mashak --- net/sched/act_gact.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index 74563254e676..88fbb8403565 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -217,6 +217,19 @@ static int tcf_gact_search(struct net *net, struct tc_action **a, u32 index, return tcf_idr_search(tn, a, index); } +static size_t tcf_gact_get_fill_size(const struct tc_action *act) +{ + size_t sz = nla_total_size(sizeof(struct tc_gact)); /* TCA_GACT_PARMS */ + +#ifdef CONFIG_GACT_PROB + if (to_gact(act)->tcfg_ptype) + /* TCA_GACT_PROB */ + sz += nla_total_size(sizeof(struct tc_gact_p)); +#endif + + return sz; +} + static struct tc_action_ops act_gact_ops = { .kind = "gact", .type = TCA_ACT_GACT, @@ -227,6 +240,7 @@ static struct tc_action_ops act_gact_ops = { .init = tcf_gact_init, .walk = tcf_gact_walker, .lookup = tcf_gact_search, + .get_fill_size = tcf_gact_get_fill_size, .size = sizeof(struct tcf_gact), }; -- 2.7.4