Re: [PATCH v6 RESEND] r8152: Add support for setting pass through MAC address on RTL8153-AD
> From: Mario Limonciello > Date: Mon, 11 Jul 2016 19:58:04 -0500 > > > The RTL8153-AD supports a persistent system specific MAC address. > > This means a device plugged into two different systems with host > > side support will show different (but persistent) MAC addresses. > > > > This information for the system's persistent MAC address is burned > > in when the system HW is built and available under \_SB.AMAC in the > > DSDT at runtime. > > > > This technology is currently implemented in the Dell TB15 and WD15 > > Type-C docks. More information is available here: > > http://www.dell.com/support/article/us/en/04/SLN301147 > > > > Signed-off-by: Mario Limonciello > > Applied. Hi, Isn't it possible that the same ACPI name could be used for other vendor-specific extensions on other laptops and that r8152 will now wreak havoc there? Regards, MP
[PATCH 1/1] bonding: restrict the data protected by rcu_read_lock
In this function, origin is a pointer to struct aggregator. No matter what agg is changed to, it has nothing to do with origin. CC: Jay Vosburgh Signed-off-by: Zhu Yanjun --- drivers/net/bonding/bond_3ad.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index edc70ff..20afee3 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -1624,8 +1624,8 @@ static void ad_agg_selection_logic(struct aggregator *agg, struct slave *slave; struct port *port; - rcu_read_lock(); origin = agg; + rcu_read_lock(); active = __get_active_agg(agg); best = (active && agg_device_up(active)) ? active : NULL; -- 1.7.9.5
[PATCH v8 05/11] Add sample for adding simple drop program to link
Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX hook of a link. With the drop-only program, observed single core rate is ~20Mpps. Other tests were run, for instance without the dropcnt increment or without reading from the packet header, the packet rate was mostly unchanged. $ perf record -a samples/bpf/xdp1 $( --- samples/bpf/Makefile| 4 ++ samples/bpf/bpf_load.c | 8 +++ samples/bpf/xdp1_kern.c | 93 + samples/bpf/xdp1_user.c | 181 4 files changed, 286 insertions(+) create mode 100644 samples/bpf/xdp1_kern.c create mode 100644 samples/bpf/xdp1_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index a98b780..0e4ab3a 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -21,6 +21,7 @@ hostprogs-y += spintest hostprogs-y += map_perf_test hostprogs-y += test_overhead hostprogs-y += test_cgrp2_array_pin +hostprogs-y += xdp1 test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o @@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o +xdp1-objs := bpf_load.o libbpf.o xdp1_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o always += test_overhead_kprobe_kern.o always += parse_varlen.o parse_simple.o parse_ldabs.o always += test_cgrp2_tc_kern.o +always += xdp1_kern.o HOSTCFLAGS += -I$(objtree)/usr/include @@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf HOSTLOADLIBES_spintest += -lelf HOSTLOADLIBES_map_perf_test += -lelf -lrt HOSTLOADLIBES_test_overhead += -lelf -lrt +HOSTLOADLIBES_xdp1 += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 022af71..0cfda23 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_kprobe = strncmp(event, "kprobe/", 7) == 0; bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0; bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0; + bool is_xdp = strncmp(event, "xdp", 3) == 0; enum bpf_prog_type prog_type; char buf[256]; int fd, efd, err, id; @@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_KPROBE; } else if (is_tracepoint) { prog_type = BPF_PROG_TYPE_TRACEPOINT; + } else if (is_xdp) { + prog_type = BPF_PROG_TYPE_XDP; } else { printf("Unknown event '%s'\n", event); return -1; @@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; + if (is_xdp) + return 0; + if (is_socket) { event += 6; if (*event != '/') @@ -319,6 +325,7 @@ int load_bpf_file(char *path) if (memcmp(shname_prog, "kprobe/", 7) == 0 || memcmp(shname_prog, "kretprobe/", 10) == 0 || memcmp(shname_prog, "tracepoint/", 11) == 0 || + memcmp(shname_prog, "xdp", 3) == 0 || memcmp(shname_prog, "socket", 6) == 0) load_and_attach(shname_prog, insns, data_prog->d_size); } @@ -336,6 +343,7 @@ int load_bpf_file(char *path) if (memcmp(shname, "kprobe/", 7) == 0 || memcmp(shname, "kretprobe/", 10) == 0 || memcmp(shname, "tracepoint/", 11) == 0 || + memcmp(shname, "xdp", 3) == 0 || memcmp(shname, "socket", 6) == 0) load_and_attach(shname, data->d_buf, data->d_size); } diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c new file mode 100644 index 000..e7dd8ac --- /dev/null +++ b/samples/bpf/xdp1_kern.c @@ -0,0 +1,93 @@ +/* Copyright (c) 2016 PLUMgrid + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#define KBUILD_MODNAME "foo" +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" + +struct bpf_map_def SEC("maps") dropcnt = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(u32), + .value_size = sizeof(long), + .max_entries = 256, +}; + +st
[PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
The mlx4 driver by default allocates order-3 pages for the ring to consume in multiple fragments. When the device has an xdp program, this behavior will prevent tx actions since the page must be re-mapped in TODEVICE mode, which cannot be done if the page is still shared. Start by making the allocator configurable based on whether xdp is running, such that order-0 pages are always used and never shared. Since this will stress the page allocator, add a simple page cache to each rx ring. Pages in the cache are left dma-mapped, and in drop-only stress tests the page allocator is eliminated from the perf report. Note that setting an xdp program will now require the rings to be reconfigured. Before: 26.91% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 17.88% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags 6.00% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag 4.49% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist 3.21% swapper [kernel.vmlinux] [k] intel_idle 2.73% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 2.57% swapper [mlx4_en] [k] mlx4_en_process_rx_cq After: 31.72% swapper [kernel.vmlinux] [k] intel_idle 8.79% swapper [mlx4_en] [k] mlx4_en_process_rx_cq 7.54% swapper [kernel.vmlinux] [k] poll_idle 6.36% swapper [mlx4_core][k] mlx4_eq_int 4.21% swapper [kernel.vmlinux] [k] tasklet_action 4.03% swapper [kernel.vmlinux] [k] cpuidle_enter_state 3.43% swapper [mlx4_en] [k] mlx4_en_prepare_rx_desc 2.18% swapper [kernel.vmlinux] [k] native_irq_return_iret 1.37% swapper [kernel.vmlinux] [k] menu_select 1.09% swapper [kernel.vmlinux] [k] bpf_map_lookup_elem Signed-off-by: Brenden Blanco --- drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 46 +++-- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 69 ++--- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h| 12 - 4 files changed, 115 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c index 51a2e82..d3d51fa 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c @@ -47,7 +47,7 @@ #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0x) #define EN_ETHTOOL_WORD_MASK cpu_to_be32(0x) -static int mlx4_en_moderation_update(struct mlx4_en_priv *priv) +int mlx4_en_moderation_update(struct mlx4_en_priv *priv) { int i; int err = 0; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 369a2ef..252c893d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -2532,21 +2532,59 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) { struct mlx4_en_priv *priv = netdev_priv(dev); + struct mlx4_en_dev *mdev = priv->mdev; struct bpf_prog *old_prog; + int port_up = 0; + int err; + + /* No need to reconfigure buffers when simply swapping the +* program for a new one. +*/ + if (READ_ONCE(priv->prog) && prog) { + /* This xchg is paired with READ_ONCE in the fast path, but is +* also protected from itself via rtnl lock +*/ + old_prog = xchg(&priv->prog, prog); + if (old_prog) + bpf_prog_put(old_prog); + return 0; + } if (priv->num_frags > 1) { en_err(priv, "Cannot set XDP if MTU requires multiple frags\n"); return -EOPNOTSUPP; } - /* This xchg is paired with READ_ONCE in the fast path, but is -* also protected from itself via rtnl lock -*/ + mutex_lock(&mdev->state_lock); + if (priv->port_up) { + port_up = 1; + mlx4_en_stop_port(dev, 1); + } + + mlx4_en_free_resources(priv); + old_prog = xchg(&priv->prog, prog); if (old_prog) bpf_prog_put(old_prog); - return 0; + err = mlx4_en_alloc_resources(priv); + if (err) { + en_err(priv, "Failed reallocating port resources\n"); + goto out; + } + if (port_up) { + err = mlx4_en_start_port(dev); + if (err) + en_err(priv, "Failed starting port\n"); + } + + err = mlx4_en_moderation_update(priv); + +out: + if (err) + priv->prog = NULL; + mutex_unlock(&mdev->state_lock); + return err; } static bool mlx4_xdp_attached(struct net_device *dev) diff
[PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP filter. The single op is used for both setup/query of the xdp program, modelled after ndo_setup_tc. Signed-off-by: Brenden Blanco --- include/linux/netdevice.h | 34 ++ net/core/dev.c| 33 + 2 files changed, 67 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 49736a3..fab9a1c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -63,6 +63,7 @@ struct wpan_dev; struct mpls_dev; /* UDP Tunnel offloads */ struct udp_tunnel_info; +struct bpf_prog; void netdev_set_default_ethtool_ops(struct net_device *dev, const struct ethtool_ops *ops); @@ -799,6 +800,33 @@ struct tc_to_netdev { }; }; +/* These structures hold the attributes of xdp state that are being passed + * to the netdevice through the xdp op. + */ +enum xdp_netdev_command { + /* Set or clear a bpf program used in the earliest stages of packet +* rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee +* is responsible for calling bpf_prog_put on any old progs that are +* stored. In case of error, the callee need not release the new prog +* reference, but on success it takes ownership and must bpf_prog_put +* when it is no longer used. +*/ + XDP_SETUP_PROG, + /* Check if a bpf program is set on the device. The callee should +* return true if a program is currently attached and running. +*/ + XDP_QUERY_PROG, +}; + +struct netdev_xdp { + enum xdp_netdev_command command; + union { + /* XDP_SETUP_PROG */ + struct bpf_prog *prog; + /* XDP_QUERY_PROG */ + bool prog_attached; + }; +}; /* * This structure defines the management hooks for network devices. @@ -1087,6 +1115,9 @@ struct tc_to_netdev { * appropriate rx headroom value allows avoiding skb head copy on * forward. Setting a negative value resets the rx headroom to the * default value. + * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp); + * This function is used to set or query state related to XDP on the + * netdevice. See definition of enum xdp_netdev_command for details. * */ struct net_device_ops { @@ -1271,6 +1302,8 @@ struct net_device_ops { struct sk_buff *skb); void(*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom); + int (*ndo_xdp)(struct net_device *dev, + struct netdev_xdp *xdp); }; /** @@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev, int dev_get_phys_port_name(struct net_device *dev, char *name, size_t len); int dev_change_proto_down(struct net_device *dev, bool proto_down); +int dev_change_xdp_fd(struct net_device *dev, int fd); struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev); struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, int *ret); diff --git a/net/core/dev.c b/net/core/dev.c index 7894e40..2a9c39f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -94,6 +94,7 @@ #include #include #include +#include #include #include #include @@ -6615,6 +6616,38 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down) EXPORT_SYMBOL(dev_change_proto_down); /** + * dev_change_xdp_fd - set or clear a bpf program for a device rx path + * @dev: device + * @fd: new program fd or negative value to clear + * + * Set or clear a bpf program for a device + */ +int dev_change_xdp_fd(struct net_device *dev, int fd) +{ + const struct net_device_ops *ops = dev->netdev_ops; + struct bpf_prog *prog = NULL; + struct netdev_xdp xdp = {}; + int err; + + if (!ops->ndo_xdp) + return -EOPNOTSUPP; + if (fd >= 0) { + prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP); + if (IS_ERR(prog)) + return PTR_ERR(prog); + } + + xdp.command = XDP_SETUP_PROG; + xdp.prog = prog; + err = ops->ndo_xdp(dev, &xdp); + if (err < 0 && prog) + bpf_prog_put(prog); + + return err; +} +EXPORT_SYMBOL(dev_change_xdp_fd); + +/** * dev_new_index - allocate an ifindex * @net: the applicable net namespace * -- 2.8.2
[PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding
This patch set introduces new infrastructure for programmatically processing packets in the earliest stages of rx, as part of an effort others are calling eXpress Data Path (XDP) [1]. Start this effort by introducing a new bpf program type for early packet filtering, before even an skb has been allocated. Extend on this with the ability to modify packet data and send back out on the same port. Patch 1 introduces the new prog type and helpers for validating the bpf program. A new userspace struct is defined containing only data and data_end as fields, with others to follow in the future. In patch 2, create a new ndo to pass the fd to supported drivers. In patch 3, expose a new rtnl option to userspace. In patch 4, enable support in mlx4 driver. In patch 5, create a sample drop and count program. With single core, achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes packet data access, bpf array lookup, and increment. In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is active. In patch 7, add the XDP_TX type to bpf.h In patch 8, add helper in tx patch for writing tx_desc In patch 9, add support in mlx4 for packet data write and forwarding In patch 10, turn on packet write support in the bpf verifier In patch 11, add a sample program for packet write and forwarding. With single core, achieved ~10 Mpps rewrite and forwarding. [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf v8: 1/11: Reduce WARN_ONCE to single line. Also, change act param of that function to u32 to match return type of bpf_prog_run_xdp. 2/11: Clarify locking semantics in ndo comment. 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation. v7: Addressing two of the major discussion points: return codes and ndo. The rest will be taken as todo items for separate patches. Add an XDP_ABORTED type, which explicitly falls through to DROP. The same result must be taken for the default case as well, as it is now well-defined API behavior. Merge ndo_xdp_* into a single ndo. The style is similar to ndo_setup_tc, but with less unidirectional naming convention. The IFLA parameter names are unchanged. TODOs: Add ethtool per-ring stats for aborted, default cases, maybe even drop and tx as well. Avoid duplicate dma sync operation in XDP_PASS case as mentioned by Saeed. 1/12: Add XDP_ABORTED enum, reword API comment, and update commit message. 2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style calling convention. 3/12: Switch to ndo_xdp callback. 4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement ndo_xdp. 12/12: Dropped, this will need some more work. v6: 2/12: drop unnecessary netif_device_present check 4/12, 6/12, 9/12: Reorder default case statement above drop case to remove some copy/paste. v5: 0/12: Rebase and remove previous 1/13 patch 1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed in future. Add bpf_warn_invalid_xdp_action() helper, to be used when out of bounds action is returned by the program. Add a comment to bpf.h denoting the undefined nature of out of bounds returns. 2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to ndo_xdp_attached(). 3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED. 4/12: Fixup the use of READ_ONCE in the ndos. Add a user of bpf_warn_invalid_xdp_action helper. 5/12: Adjust to using the nested netlink options. 6/12: kbuild was complaining about overflow of u16 on tile architecture...bump frag_stride to u32. The page_offset member that is computed from this was already u32. v4: 2/12: Add inline helper for calling xdp bpf prog under rcu 3/12: Add detail to ndo comments 5/12: Remove mlx4_call_xdp and use inline helper instead. 6/12: Fix checkpatch complaints 9/12: Introduce new patch 9/12 with common helper for tx_desc write Refactor to use common tx_desc write helper 11/12: Fix checkpatch complaints v3: Rewrite from v2 trying to incorporate feedback from multiple sources. Specifically, add ability to forward packets out the same port and allow packet modification. For packet forwarding, the driver reserves a dedicated set of tx rings for exclusive use by xdp. Upon completion, the pages on this ring are recycled directly back to a small per-rx-ring page cache without being dma unmapped. Use of the percpu skb is dropped in favor of a lightweight struct xdp_buff. The direct packet access feature is leveraged to remove dependence on the skb. The mlx4 driver implementation allocates a page-per-packet and maps it in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated. Naming is converted to use "xdp" instead of "phys_dev". v2: 1/5: Drop xdp from types, instead consistently use bpf_phys_dev_ Introduce enum for return values from phy
[PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite
Add a sample that rewrites and forwards packets out on the same interface. Observed single core forwarding performance of ~10Mpps. Since the mlx4 driver under test recycles every single packet page, the perf output shows almost exclusively just the ring management and bpf program work. Slowdowns are likely occurring due to cache misses. Signed-off-by: Brenden Blanco --- samples/bpf/Makefile| 5 +++ samples/bpf/xdp2_kern.c | 114 2 files changed, 119 insertions(+) create mode 100644 samples/bpf/xdp2_kern.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 0e4ab3a..d2d2b35 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -22,6 +22,7 @@ hostprogs-y += map_perf_test hostprogs-y += test_overhead hostprogs-y += test_cgrp2_array_pin hostprogs-y += xdp1 +hostprogs-y += xdp2 test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o @@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o xdp1-objs := bpf_load.o libbpf.o xdp1_user.o +# reuse xdp1 source intentionally +xdp2-objs := bpf_load.o libbpf.o xdp1_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o always += parse_varlen.o parse_simple.o parse_ldabs.o always += test_cgrp2_tc_kern.o always += xdp1_kern.o +always += xdp2_kern.o HOSTCFLAGS += -I$(objtree)/usr/include @@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf HOSTLOADLIBES_map_perf_test += -lelf -lrt HOSTLOADLIBES_test_overhead += -lelf -lrt HOSTLOADLIBES_xdp1 += -lelf +HOSTLOADLIBES_xdp2 += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c new file mode 100644 index 000..38fe7e1 --- /dev/null +++ b/samples/bpf/xdp2_kern.c @@ -0,0 +1,114 @@ +/* Copyright (c) 2016 PLUMgrid + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#define KBUILD_MODNAME "foo" +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" + +struct bpf_map_def SEC("maps") dropcnt = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(u32), + .value_size = sizeof(long), + .max_entries = 256, +}; + +static void swap_src_dst_mac(void *data) +{ + unsigned short *p = data; + unsigned short dst[3]; + + dst[0] = p[0]; + dst[1] = p[1]; + dst[2] = p[2]; + p[0] = p[3]; + p[1] = p[4]; + p[2] = p[5]; + p[3] = dst[0]; + p[4] = dst[1]; + p[5] = dst[2]; +} + +static int parse_ipv4(void *data, u64 nh_off, void *data_end) +{ + struct iphdr *iph = data + nh_off; + + if (iph + 1 > data_end) + return 0; + return iph->protocol; +} + +static int parse_ipv6(void *data, u64 nh_off, void *data_end) +{ + struct ipv6hdr *ip6h = data + nh_off; + + if (ip6h + 1 > data_end) + return 0; + return ip6h->nexthdr; +} + +SEC("xdp1") +int xdp_prog1(struct xdp_md *ctx) +{ + void *data_end = (void *)(long)ctx->data_end; + void *data = (void *)(long)ctx->data; + struct ethhdr *eth = data; + int rc = XDP_DROP; + long *value; + u16 h_proto; + u64 nh_off; + u32 index; + + nh_off = sizeof(*eth); + if (data + nh_off > data_end) + return rc; + + h_proto = eth->h_proto; + + if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) { + struct vlan_hdr *vhdr; + + vhdr = data + nh_off; + nh_off += sizeof(struct vlan_hdr); + if (data + nh_off > data_end) + return rc; + h_proto = vhdr->h_vlan_encapsulated_proto; + } + if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) { + struct vlan_hdr *vhdr; + + vhdr = data + nh_off; + nh_off += sizeof(struct vlan_hdr); + if (data + nh_off > data_end) + return rc; + h_proto = vhdr->h_vlan_encapsulated_proto; + } + + if (h_proto == htons(ETH_P_IP)) + index = parse_ipv4(data, nh_off, data_end); + else if (h_proto == htons(ETH_P_IPV6)) + index = parse_ipv6(data, nh_off, data_end); + else + index = 0; + + value = bpf_map_lookup_elem(&dropcnt, &index); + if (value) + *value += 1; + + if (index == 17) { + swap_src_dst_mac(data); +
[PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding
XDP enabled drivers must transmit received packets back out on the same port they were received on when a program returns this action. Signed-off-by: Brenden Blanco --- include/uapi/linux/bpf.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 4282d44..a8f1ea1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -447,6 +447,7 @@ enum xdp_action { XDP_ABORTED = 0, XDP_DROP, XDP_PASS, + XDP_TX, }; /* user accessible metadata for XDP packet hook -- 2.8.2
[PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function
In preparation for writing the tx descriptor from multiple functions, create a helper for both normal and blueflame access. Signed-off-by: Brenden Blanco --- drivers/infiniband/hw/mlx4/qp.c| 11 +-- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 127 + include/linux/mlx4/qp.h| 18 ++-- 3 files changed, 90 insertions(+), 66 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 8db8405..768085f 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) } } else { ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); - s = (ctrl->fence_size & 0x3f) << 4; + s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4; for (i = 64; i < s; i += 64) { wqe = buf + i; *wqe = cpu_to_be32(0x); @@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); } ctrl->srcrb_flags = 0; - ctrl->fence_size = size / 16; + ctrl->qpn_vlan.fence_size = size / 16; /* * Make sure descriptor is fully written before setting ownership bit * (because HW can start executing as soon as we do). @@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); if (qp->sq_max_wqes_per_wr == 1) - ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4); + ctrl->qpn_vlan.fence_size = + 1 << (qp->sq.wqe_shift - 4); stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } @@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, wmb(); *lso_wqe = lso_hdr_sz; - ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? - MLX4_WQE_CTRL_FENCE : 0) | size; + ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ? +MLX4_WQE_CTRL_FENCE : 0) | size; /* * Make sure descriptor is fully written before diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index 76aa4d2..c29191e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -700,10 +700,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src, __iowrite64_copy(dst, src, bytecnt / 8); } +void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring) +{ + wmb(); + /* Since there is no iowrite*_native() that writes the +* value as is, without byteswapping - using the one +* the doesn't do byteswapping in the relevant arch +* endianness. +*/ +#if defined(__LITTLE_ENDIAN) + iowrite32( +#else + iowrite32be( +#endif + ring->doorbell_qpn, + ring->bf.uar->map + MLX4_SEND_DOORBELL); +} + +static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring, + struct mlx4_en_tx_desc *tx_desc, + union mlx4_wqe_qpn_vlan qpn_vlan, + int desc_size, int bf_index, + __be32 op_own, bool bf_ok, + bool send_doorbell) +{ + tx_desc->ctrl.qpn_vlan = qpn_vlan; + + if (bf_ok) { + op_own |= htonl((bf_index & 0x) << 8); + /* Ensure new descriptor hits memory +* before setting ownership of this descriptor to HW +*/ + dma_wmb(); + tx_desc->ctrl.owner_opcode = op_own; + + wmb(); + + mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl, +desc_size); + + wmb(); + + ring->bf.offset ^= ring->bf.buf_size; + } else { + /* Ensure new descriptor hits memory +* before setting ownership of this descriptor to HW +*/ + dma_wmb(); + tx_desc->ctrl.owner_opcode = op_own; + if (send_doorbell) + mlx4_en_xmit_doorbell(ring); + else + ring->xmit_more++; + } +} + netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) { struct skb_shared_info *shinfo = skb_shinfo(skb); struct mlx4_en_priv *priv = netdev_priv(dev); + union mlx4_wqe_qpn_vlan
[PATCH v8 03/11] rtnl: add option for setting link xdp prog
Sets the bpf program represented by fd as an early filter in the rx path of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP. Providing a negative value as fd clears the program. Getting the fd back via rtnl is not possible, therefore reading of this value merely provides a bool whether the program is valid on the link or not. Signed-off-by: Brenden Blanco --- include/uapi/linux/if_link.h | 12 + net/core/rtnetlink.c | 64 2 files changed, 76 insertions(+) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 4285ac3..a1b5202 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -156,6 +156,7 @@ enum { IFLA_GSO_MAX_SEGS, IFLA_GSO_MAX_SIZE, IFLA_PAD, + IFLA_XDP, __IFLA_MAX }; @@ -843,4 +844,15 @@ enum { }; #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1) +/* XDP section */ + +enum { + IFLA_XDP_UNSPEC, + IFLA_XDP_FD, + IFLA_XDP_ATTACHED, + __IFLA_XDP_MAX, +}; + +#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1) + #endif /* _UAPI_LINUX_IF_LINK_H */ diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index a9e3805..eba2b82 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev, return port_self_size; } +static size_t rtnl_xdp_size(const struct net_device *dev) +{ + size_t xdp_size = nla_total_size(1);/* XDP_ATTACHED */ + + if (!dev->netdev_ops->ndo_xdp) + return 0; + else + return xdp_size; +} + static noinline size_t if_nlmsg_size(const struct net_device *dev, u32 ext_filter_mask) { @@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev, + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */ + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */ + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */ + + rtnl_xdp_size(dev) /* IFLA_XDP */ + nla_total_size(1); /* IFLA_PROTO_DOWN */ } @@ -1211,6 +1222,33 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev) return 0; } +static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) +{ + struct netdev_xdp xdp_op = {}; + struct nlattr *xdp; + int err; + + if (!dev->netdev_ops->ndo_xdp) + return 0; + xdp = nla_nest_start(skb, IFLA_XDP); + if (!xdp) + return -EMSGSIZE; + xdp_op.command = XDP_QUERY_PROG; + err = dev->netdev_ops->ndo_xdp(dev, &xdp_op); + if (err) + goto err_cancel; + err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached); + if (err) + goto err_cancel; + + nla_nest_end(skb, xdp); + return 0; + +err_cancel: + nla_nest_cancel(skb, xdp); + return err; +} + static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, int type, u32 pid, u32 seq, u32 change, unsigned int flags, u32 ext_filter_mask) @@ -1307,6 +1345,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, if (rtnl_port_fill(skb, dev, ext_filter_mask)) goto nla_put_failure; + if (rtnl_xdp_fill(skb, dev)) + goto nla_put_failure; + if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) { if (rtnl_link_fill(skb, dev) < 0) goto nla_put_failure; @@ -1392,6 +1433,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_PHYS_SWITCH_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, [IFLA_LINK_NETNSID] = { .type = NLA_S32 }, [IFLA_PROTO_DOWN] = { .type = NLA_U8 }, + [IFLA_XDP] = { .type = NLA_NESTED }, }; static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = { @@ -1429,6 +1471,11 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = { [IFLA_PORT_RESPONSE]= { .type = NLA_U16, }, }; +static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = { + [IFLA_XDP_FD] = { .type = NLA_S32 }, + [IFLA_XDP_ATTACHED] = { .type = NLA_U8 }, +}; + static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla) { const struct rtnl_link_ops *ops = NULL; @@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb, status |= DO_SETLINK_NOTIFY; } + if (tb[IFLA_XDP]) { + struct nlattr *xdp[IFLA_XDP_MAX + 1]; + + err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP], + ifla_xdp_policy); + if (err < 0) + goto
[PATCH v8 10/11] bpf: enable direct packet data write for xdp progs
For forwarding to be effective, XDP programs should be allowed to rewrite packet data. This requires that the drivers supporting XDP must all map the packet memory as TODEVICE or BIDIRECTIONAL before invoking the program. Signed-off-by: Brenden Blanco --- kernel/bpf/verifier.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a8d67d0..f72f23b 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 regno, int off, #define MAX_PACKET_OFF 0x +static bool may_write_pkt_data(enum bpf_prog_type type) +{ + switch (type) { + case BPF_PROG_TYPE_XDP: + return true; + default: + return false; + } +} + static int check_packet_access(struct verifier_env *env, u32 regno, int off, int size) { @@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, err = check_stack_read(state, off, size, value_regno); } } else if (state->regs[regno].type == PTR_TO_PACKET) { - if (t == BPF_WRITE) { + if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) { verbose("cannot write into packet\n"); return -EACCES; } + if (t == BPF_WRITE && value_regno >= 0 && + is_pointer_value(env, value_regno)) { + verbose("R%d leaks addr into packet\n", value_regno); + return -EACCES; + } err = check_packet_access(env, regno, off, size); if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown_value(state->regs, value_regno); -- 2.8.2
[PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver. In tc/socket bpf programs, helpers linearize skb fragments as needed when the program touches the packet data. However, in the pursuit of speed, XDP programs will not be allowed to use these slower functions, especially if it involves allocating an skb. Therefore, disallow MTU settings that would produce a multi-fragment packet that XDP programs would fail to access. Future enhancements could be done to increase the allowable MTU. Signed-off-by: Brenden Blanco --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++ drivers/net/ethernet/mellanox/mlx4/en_rx.c | 37 +-- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 5 +++ 3 files changed, 89 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 6083775..369a2ef 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -31,6 +31,7 @@ * */ +#include #include #include #include @@ -2084,6 +2085,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev) if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS) mlx4_en_remove_timestamp(mdev); + if (priv->prog) + bpf_prog_put(priv->prog); + /* Detach the netdev so tasks would not attempt to access it */ mutex_lock(&mdev->state_lock); mdev->pndev[priv->port] = NULL; @@ -2112,6 +2116,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu) en_err(priv, "Bad MTU size:%d.\n", new_mtu); return -EPERM; } + if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) { + en_err(priv, "MTU size:%d requires frags but XDP prog running", + new_mtu); + return -EOPNOTSUPP; + } dev->mtu = new_mtu; if (netif_running(dev)) { @@ -2520,6 +2529,46 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m return err; } +static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) +{ + struct mlx4_en_priv *priv = netdev_priv(dev); + struct bpf_prog *old_prog; + + if (priv->num_frags > 1) { + en_err(priv, "Cannot set XDP if MTU requires multiple frags\n"); + return -EOPNOTSUPP; + } + + /* This xchg is paired with READ_ONCE in the fast path, but is +* also protected from itself via rtnl lock +*/ + old_prog = xchg(&priv->prog, prog); + if (old_prog) + bpf_prog_put(old_prog); + + return 0; +} + +static bool mlx4_xdp_attached(struct net_device *dev) +{ + struct mlx4_en_priv *priv = netdev_priv(dev); + + return !!READ_ONCE(priv->prog); +} + +static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp) +{ + switch (xdp->command) { + case XDP_SETUP_PROG: + return mlx4_xdp_set(dev, xdp->prog); + case XDP_QUERY_PROG: + xdp->prog_attached = mlx4_xdp_attached(dev); + return 0; + default: + return -EINVAL; + } +} + static const struct net_device_ops mlx4_netdev_ops = { .ndo_open = mlx4_en_open, .ndo_stop = mlx4_en_close, @@ -2548,6 +2597,7 @@ static const struct net_device_ops mlx4_netdev_ops = { .ndo_udp_tunnel_del = mlx4_en_del_vxlan_port, .ndo_features_check = mlx4_en_features_check, .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_xdp= mlx4_xdp, }; static const struct net_device_ops mlx4_netdev_ops_master = { @@ -2584,6 +2634,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = { .ndo_udp_tunnel_del = mlx4_en_del_vxlan_port, .ndo_features_check = mlx4_en_features_check, .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_xdp= mlx4_xdp, }; struct mlx4_en_bond { diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index c1b3a9c..adfa123 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring]; struct mlx4_en_rx_alloc *frags; struct mlx4_en_rx_desc *rx_desc; + struct bpf_prog *prog; struct sk_buff *skb; int index; int nr; @@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud if (budget <= 0) return polled; + prog = READ_ONCE(priv->prog); + /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx * descriptor offset can be deduced from the CQE
[PATCH v8 01/11] bpf: add XDP prog type for early driver filter
Add a new bpf prog type that is intended to run in early stages of the packet rx path. Only minimal packet metadata will be available, hence a new context type, struct xdp_md, is exposed to userspace. So far only expose the packet start and end pointers, and only in read mode. An XDP program must return one of the well known enum values, all other return codes are reserved for future use. Unfortunately, this restriction is hard to enforce at verification time, so take the approach of warning at runtime when such programs are encountered. Out of bounds return codes should alias to XDP_ABORTED. Signed-off-by: Brenden Blanco --- include/linux/filter.h | 18 +++ include/uapi/linux/bpf.h | 20 kernel/bpf/verifier.c| 1 + net/core/filter.c| 79 4 files changed, 118 insertions(+) diff --git a/include/linux/filter.h b/include/linux/filter.h index 6fc31ef..15d816a 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -368,6 +368,11 @@ struct bpf_skb_data_end { void *data_end; }; +struct xdp_buff { + void *data; + void *data_end; +}; + /* compute the linear packet data range [data, data_end) which * will be accessed by cls_bpf and act_bpf programs */ @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, return BPF_PROG_RUN(prog, skb); } +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, + struct xdp_buff *xdp) +{ + u32 ret; + + rcu_read_lock(); + ret = BPF_PROG_RUN(prog, (void *)xdp); + rcu_read_unlock(); + + return ret; +} + static inline unsigned int bpf_prog_size(unsigned int proglen) { return max(sizeof(struct bpf_prog), @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func); struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); +void bpf_warn_invalid_xdp_action(u32 act); #ifdef CONFIG_BPF_JIT extern int bpf_jit_enable; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 262a7e8..4282d44 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -94,6 +94,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_CLS, BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, + BPF_PROG_TYPE_XDP, }; #define BPF_PSEUDO_MAP_FD 1 @@ -437,4 +438,23 @@ struct bpf_tunnel_key { __u32 tunnel_label; }; +/* User return codes for XDP prog type. + * A valid XDP program must return one of these defined values. All other + * return codes are reserved for future use. Unknown return codes will result + * in packet drop. + */ +enum xdp_action { + XDP_ABORTED = 0, + XDP_DROP, + XDP_PASS, +}; + +/* user accessible metadata for XDP packet hook + * new fields must be added to the end of this structure + */ +struct xdp_md { + __u32 data; + __u32 data_end; +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e206c21..a8d67d0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg, switch (env->prog->type) { case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: + case BPF_PROG_TYPE_XDP: break; default: verbose("verifier is misconfigured\n"); diff --git a/net/core/filter.c b/net/core/filter.c index 10c4a2f..2d770f5 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) } } +static const struct bpf_func_proto * +xdp_func_proto(enum bpf_func_id func_id) +{ + return sk_filter_func_proto(func_id); +} + static bool __is_valid_access(int off, int size, enum bpf_access_type type) { if (off < 0 || off >= sizeof(struct __sk_buff)) @@ -2436,6 +2442,44 @@ static bool tc_cls_act_is_valid_access(int off, int size, return __is_valid_access(off, size, type); } +static bool __is_valid_xdp_access(int off, int size, + enum bpf_access_type type) +{ + if (off < 0 || off >= sizeof(struct xdp_md)) + return false; + if (off % size != 0) + return false; + if (size != 4) + return false; + + return true; +} + +static bool xdp_is_valid_access(int off, int size, + enum bpf_access_type type, + enum bpf_reg_type *reg_type) +{ + if (type == BPF_WRITE) + return false; + + switch (off) { + case offsetof(struct xdp_md, data): + *reg_type = PTR_TO_PACKET; + break; + case offsetof(struct xdp_md, data_end): + *reg_type = PTR_TO_PACKET_END;
[PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support
A user will now be able to loop packets back out of the same port using a bpf program attached to xdp hook. Updates to the packet contents from the bpf program is also supported. For the packet write feature to work, the rx buffers are now mapped as bidirectional when the page is allocated. This occurs only when the xdp hook is active. When the program returns a TX action, enqueue the packet directly to a dedicated tx ring, so as to avoid completely any locking. This requires the tx ring to be allocated 1:1 for each rx ring, as well as the tx completion running in the same softirq. Upon tx completion, this dedicated tx ring recycles pages without unmapping directly back to the original rx ring. In steady state tx/drop workload, effectively 0 page allocs/frees will occur. Signed-off-by: Brenden Blanco --- drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 15 ++- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 19 +++- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 14 +++ drivers/net/ethernet/mellanox/mlx4/en_tx.c | 126 +++- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h| 14 ++- 5 files changed, 181 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c index d3d51fa..10642b1 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c @@ -1694,6 +1694,11 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd) return err; } +static int mlx4_en_max_tx_channels(struct mlx4_en_priv *priv) +{ + return (MAX_TX_RINGS - priv->rsv_tx_rings) / MLX4_EN_NUM_UP; +} + static void mlx4_en_get_channels(struct net_device *dev, struct ethtool_channels *channel) { @@ -1705,7 +1710,7 @@ static void mlx4_en_get_channels(struct net_device *dev, channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP; channel->rx_count = priv->rx_ring_num; - channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP; + channel->tx_count = priv->num_tx_rings_p_up; } static int mlx4_en_set_channels(struct net_device *dev, @@ -1717,7 +1722,7 @@ static int mlx4_en_set_channels(struct net_device *dev, int err = 0; if (channel->other_count || channel->combined_count || - channel->tx_count > MLX4_EN_MAX_TX_RING_P_UP || + channel->tx_count > mlx4_en_max_tx_channels(priv) || channel->rx_count > MAX_RX_RINGS || !channel->tx_count || !channel->rx_count) return -EINVAL; @@ -1731,7 +1736,8 @@ static int mlx4_en_set_channels(struct net_device *dev, mlx4_en_free_resources(priv); priv->num_tx_rings_p_up = channel->tx_count; - priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP; + priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP + + priv->rsv_tx_rings; priv->rx_ring_num = channel->rx_count; err = mlx4_en_alloc_resources(priv); @@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev, goto out; } - netif_set_real_num_tx_queues(dev, priv->tx_ring_num); + netif_set_real_num_tx_queues(dev, priv->tx_ring_num - + priv->rsv_tx_rings); netif_set_real_num_rx_queues(dev, priv->rx_ring_num); if (dev->num_tc) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 252c893d..50fbc8f 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -1636,7 +1636,7 @@ int mlx4_en_start_port(struct net_device *dev) /* Configure ring */ tx_ring = priv->tx_ring[i]; err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn, - i / priv->num_tx_rings_p_up); + i / (priv->tx_ring_num / MLX4_EN_NUM_UP)); if (err) { en_err(priv, "Failed allocating Tx ring\n"); mlx4_en_deactivate_cq(priv, cq); @@ -2022,6 +2022,16 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv) goto err; } + /* When rsv_tx_rings is non-zero, each rx ring will have a +* corresponding tx ring, with the tx completion event for that ring +* recycling buffers into the cache. +*/ + for (i = 0; i < priv->rsv_tx_rings; i++) { + int j = (priv->tx_ring_num - priv->rsv_tx_rings) + i; + + priv->tx_ring[j]->recycle_ring = priv->rx_ring[i]; + } + #ifdef CONFIG_RFS_ACCEL priv->dev->rx_cpu_rmap = mlx4_get_cpu_rmap(priv->mdev->dev, priv->port); #endif @@ -2534,9 +2544,12 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) struct mlx4_e
[iproute PATCH v2] ip-address.8: Document autojoin flag
Description copied from related kernel support commit message with a little tailoring to fit. While at it, fix font of non-terminal CONFFLAG-LIST in synopsis. Signed-off-by: Phil Sutter --- Changes since v1: - Set commands/flags in bold font instead of quotes. - Make capitalization consistent for 'Ethernet' and 'IGMP'. - Set 'VXLAN' and 'Openvswitch' in upper case. - Fix typo: then -> them. - Misc grammar and language fixup. --- man/man8/ip-address.8.in | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in index 7cb9271731339..7d6eb9b2b1915 100644 --- a/man/man8/ip-address.8.in +++ b/man/man8/ip-address.8.in @@ -77,14 +77,15 @@ ip-address \- protocol address management .IR FLAG " := " .RB "[ " permanent " | " dynamic " | " secondary " | " primary " |" .RB [ - ] tentative " | [" - ] deprecated " | [" - ] dadfailed " |" -.BR temporary " | " CONFFLAG-LIST " ]" +.BR temporary " |" +.IR CONFFLAG-LIST " ]" .ti -8 .IR CONFFLAG-LIST " := [ " CONFFLAG-LIST " ] " CONFFLAG .ti -8 .IR CONFFLAG " := " -.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " ]" +.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " | " autojoin " ]" .ti -8 .IR LIFETIME " := [ " @@ -249,6 +250,26 @@ address, and don't search for one to delete when removing the address. Changing an address to add this flag will remove the automatically added prefix route, changing it to remove this flag will create the prefix route automatically. +.TP +.B autojoin +Joining multicast groups on Ethernet level via +.B "ip maddr" +command does not work if connected to an Ethernet switch that does IGMP +snooping since the switch would not replicate multicast packets on ports that +did not have IGMP reports for the multicast addresses. + +Linux VXLAN interfaces created via +.B "ip link add vxlan" +have the +.B group +option that enables them to do the required join. + +Using the +.B autojoin +flag when adding a multicast address enables similar functionality for +Openvswitch VXLAN interfaces as well as other tunneling mechanisms that need to +receive multicast traffic. + .SS ip address delete - delete protocol address .B Arguments: coincide with the arguments of -- 2.8.2
Re: [PATCH] iwlwifi: add missing type declaration
On Mon, 2016-07-11 at 22:49 +0200, Arnd Bergmann wrote: > The iwl-debug.h header relies in implicit inclusion of linux/device.h > and > we get a lot of warnings without that: > > drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:23: error: 'struct > device' declared inside parameter list will not be visible outside of > this definition or declaration [-Werror] > void __iwl_err(struct device *dev, bool rfkill_prefix, bool > only_trace, > ^~ > In file included from drivers/net/wireless/intel/iwlwifi/iwl-eeprom- > read.h:66:0, > from drivers/net/wireless/intel/iwlwifi/iwl-eeprom- > read.c:68: > drivers/net/wireless/intel/iwlwifi/iwl-trans.h: In function > 'iwl_trans_tx': > drivers/net/wireless/intel/iwlwifi/iwl-trans.h:1030:348: error: > passing argument 1 of '__iwl_err' from incompatible pointer type [- > Werror=incompatible-pointer-types] > IWL_ERR(trans, "%s bad state = %d\n", __func__, trans->state); > > > > > > ^ > In file included from drivers/net/wireless/intel/iwlwifi/iwl-eeprom- > read.c:67:0: > drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:6: note: expected > 'struct device *' but argument is of type 'struct device *' > void __iwl_err(struct device *dev, bool rfkill_prefix, bool > only_trace, > ^ > > The easiest workaround is to just declare 'struct device' before its > first use, > rather than including the entire header file. > > Signed-off-by: Arnd Bergmann > Fixes: 21cb3222fe56 ("iwlwifi: decouple PCIe transport from > mac80211") > --- Acked-by: Luca Coelho Agree with Kalle that he will take this directly to wireless-drivers- next. -- Cheers, Luca.
[iproute PATCH] ip-macsec.8: fix typo in 'examples' section
fix wrong 'device' keyword in 'ip link add device eth0' Signed-off-by: Davide Caratti --- man/man8/ip-macsec.8 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/man8/ip-macsec.8 b/man/man8/ip-macsec.8 index e8455d7..03ed700 100644 --- a/man/man8/ip-macsec.8 +++ b/man/man8/ip-macsec.8 @@ -74,7 +74,7 @@ type. .PP .SS Create a MACsec device on link eth0 .nf -# ip link add device eth0 macsec0 type macsec port 11 encrypt on +# ip link add link eth0 macsec0 type macsec port 11 encrypt on .PP .SS Configure a secure association on that device .nf -- 2.5.5
Re: [PATCH 2/3] chcr: Support for Chelsio's Crypto Hardware
On Mon, Jul 11, 2016 at 11:28:07AM -0700, Yeshaswi M R Gowda wrote: > > + u_ctx = ULD_CTX(ctx); > + if (cxgb4_is_crypto_q_full(u_ctx->lldi.ports[0], ctx->tx_channel_id)) > + return -EBUSY; You cannot just return -EBUSY. If the request has the MAY_BACKLOG bit set, it must be queued regardless, but you should return -EBUSY in order to throttle the user and then call the completion function with -EINPROGRESS once the queue can accept more requests from the user. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [patch net-next 1/2] devlink: add hardware messages tracing facility
Mon, Jul 11, 2016 at 06:08:14PM CEST, rost...@goodmis.org wrote: >On Mon, 11 Jul 2016 15:18:47 +0200 >Jiri Pirko wrote: > >> From: Jiri Pirko >> >> Define a tracepoint and allow user to trace messages going to and from >> hardware associated with devlink instance. >> >> Signed-off-by: Jiri Pirko >> --- >> include/net/devlink.h | 8 +++ >> include/trace/events/devlink.h | 49 >> ++ >> net/core/devlink.c | 9 >> 3 files changed, 66 insertions(+) >> create mode 100644 include/trace/events/devlink.h >> >> diff --git a/include/net/devlink.h b/include/net/devlink.h >> index c99ffe8..865ade6 100644 >> --- a/include/net/devlink.h >> +++ b/include/net/devlink.h >> @@ -115,6 +115,8 @@ struct devlink *devlink_alloc(const struct devlink_ops >> *ops, size_t priv_size); >> int devlink_register(struct devlink *devlink, struct device *dev); >> void devlink_unregister(struct devlink *devlink); >> void devlink_free(struct devlink *devlink); >> +void devlink_trace_hwmsg(const struct devlink *devlink, bool incoming, >> + unsigned long type, const u8 *buf, size_t len); >> int devlink_port_register(struct devlink *devlink, >>struct devlink_port *devlink_port, >>unsigned int port_index); >> @@ -154,6 +156,12 @@ static inline void devlink_free(struct devlink *devlink) >> kfree(devlink); >> } >> >> +static inline void devlink_trace_hwmsg(const struct devlink *devlink, >> + bool incoming, unsigned long type, >> + const u8 *buf, size_t len); >> +{ >> +} >> + >> static inline int devlink_port_register(struct devlink *devlink, >> struct devlink_port *devlink_port, >> unsigned int port_index) >> diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h >> new file mode 100644 >> index 000..7918c57 >> --- /dev/null >> +++ b/include/trace/events/devlink.h >> @@ -0,0 +1,49 @@ >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM devlink >> + >> +#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_DEVLINK_H >> + >> +#include >> +#include >> +#include >> + >> +/* >> + * Tracepoint for devlink hardware message: >> + */ >> +TRACE_EVENT(devlink_hwmsg, >> +TP_PROTO(const struct devlink *devlink, bool incoming, >> + unsigned long type, const u8 *buf, size_t len), >> + >> +TP_ARGS(devlink, incoming, type, buf, len), >> + >> +TP_STRUCT__entry( >> +__string(bus_name, devlink->dev->bus->name) >> +__string(dev_name, dev_name(devlink->dev)) >> +__string(owner_name, devlink->dev->driver->owner->name) >> +__field(bool, incoming) >> +__field(unsigned long, type) >> +__dynamic_array(u8, buf, len) >> +__field(size_t, len) >> +), >> + >> +TP_fast_assign( >> +__assign_str(bus_name, devlink->dev->bus->name); >> +__assign_str(dev_name, dev_name(devlink->dev)); >> +__assign_str(owner_name, devlink->dev->driver->owner->name); >> +__entry->incoming = incoming; >> +__entry->type = type; >> +memcpy(__get_dynamic_array(buf), buf, len); >> +__entry->len = len; >> +), >> + >> +TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu >> buf=0x[%*phD] len=%lu", >> + __get_str(bus_name), __get_str(dev_name), >> + __get_str(owner_name), __entry->incoming, __entry->type, >> + (int) __entry->len, __get_dynamic_array(buf), __entry->len) >> +); >> + >> +#endif /* _TRACE_DEVLINK_H */ >> + >> +/* This part must be outside protection */ >> +#include >> diff --git a/net/core/devlink.c b/net/core/devlink.c >> index b2e592a..8cfa3b0 100644 >> --- a/net/core/devlink.c >> +++ b/net/core/devlink.c >> @@ -26,6 +26,8 @@ >> #include >> #include >> #include >> +#define CREATE_TRACE_POINTS >> +#include >> >> static LIST_HEAD(devlink_list); >> >> @@ -1679,6 +1681,13 @@ void devlink_free(struct devlink *devlink) >> } >> EXPORT_SYMBOL_GPL(devlink_free); >> >> +void devlink_trace_hwmsg(const struct devlink *devlink, bool incoming, >> + unsigned long type, const u8 *buf, size_t len) >> +{ >> +trace_devlink_hwmsg(devlink, incoming, type, buf, len); >> +} >> +EXPORT_SYMBOL_GPL(devlink_trace_hwmsg); >> + > >Instead of having a function that always gets called even when tracing >isn't enabled, why not have the caller call the trace_devlink_hwmsg() >directly? > >In the trace/devlink.h file you could encapsulate it with: > >#if IS_ENABLED(CONFIG_NET_DEVLINK) > >[...] > >#else >static inline void trace_devlink_hwmsg(const struct devlink *devlink, > bool incoming, unsigned long type, >
Re: [PATCH 3/3] crypto: Added Chelsio Menu to the Kconfig file
On Tue, Jul 12, 2016 at 03:30:41AM +0800, kbuild test robot wrote: > Hi, > > [auto build test WARNING on net-next/master] > [also build test WARNING on v4.7-rc7 next-20160711] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] Yeshaswi, please fix these warnings/errors even though they're compile-only. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc
On 05/07/2016 17:28, Florian Fainelli wrote: > Le 05/07/2016 07:50, Mason wrote: > >> On 05/07/2016 15:33, Mason wrote: >> >>> I was testing suspend/resume sequences where the suspend operation >>> fails and returns without having suspended the platform. Please forget I ever mentioned suspend, that was a red herring. The warning is displayed even if I never suspend. (Dropping linux-pm from this discussion.) >> I rebooted, tried the same operation, and hit the same warning >> still 310 seconds later: However, the 310 seconds time span still seems to be relevant. Steps to reproduce: I booted the system, logged in as root, mounted an NFS file system, then left the system idling at the prompt. (I don't remember seeing this warning in v4.1 and v4.4) What's going wrong here? Is it related to NFS? Here is the defconfig I'm using http://pastebin.ubuntu.com/19160299/ Regards. [ 317.940133] [ cut here ] [ 317.944815] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc [ 317.953223] Modules linked in: [ 317.956305] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc6-00010-gd07031bdc433-dirty #2 [ 317.964784] Hardware name: Sigma Tango DT [ 317.968809] Backtrace: [ 317.971279] [] (dump_backtrace) from [] (show_stack+0x18/0x1c) [ 317.978884] r7:6113 r6:c080ea84 r5: r4:c080ea84 [ 317.984590] [] (show_stack) from [] (dump_stack+0x80/0x94) [ 317.991856] [] (dump_stack) from [] (__warn+0xec/0x104) [ 317.998849] r7:0009 r6:c05e3fc8 r5: r4: [ 318.004549] [] (__warn) from [] (warn_slowpath_null+0x28/0x30) [ 318.012154] r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec19594 r5:df68f144 r4:df68f040 [ 318.019954] [] (warn_slowpath_null) from [] (inet_sock_destruct+0x1c4/0x1dc) [ 318.028788] [] (inet_sock_destruct) from [] (__sk_destruct+0x28/0xe0) [ 318.037005] r7:df45fe30 r6:dec19594 r5:df68f040 r4:df68f1ec [ 318.042710] [] (__sk_destruct) from [] (rcu_process_callbacks+0x488/0x59c) [ 318.051363] r5: r4: [ 318.054962] [] (rcu_process_callbacks) from [] (__do_softirq+0x138/0x264) [ 318.063527] r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 r5:0009 [ 318.071408] r4: [ 318.073953] [] (__do_softirq) from [] (irq_exit+0xc8/0x104) [ 318.081296] r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 r5: [ 318.089176] r4:c0735428 [ 318.091723] [] (irq_exit) from [] (__handle_domain_irq+0x88/0xf4) [ 318.099595] [] (__handle_domain_irq) from [] (gic_handle_irq+0x50/0x94) [ 318.107986] r10: r9:e0803100 r8:e0802100 r7:df45ff58 r6:e080210c r5:c080277c [ 318.115865] r4:c080eca0 r3:df45ff58 [ 318.119461] [] (gic_handle_irq) from [] (__irq_svc+0x54/0x90) [ 318.126980] Exception stack(0xdf45ff58 to 0xdf45ffa0) [ 318.132053] ff40: 0001 [ 318.140273] ff60: ab80 c0117c80 df45e000 c08024f8 c0802494 c081e2d6 c05b9550 413fc090 [ 318.148492] ff80: df45ffb4 df45ffb8 df45ffa8 c01086b0 c01086b4 6013 [ 318.156709] r9:413fc090 r8:c05b9550 r7:df45ff8c r6: r5:6013 r4:c01086b4 [ 318.164512] [] (arch_cpu_idle) from [] (default_idle_call+0x28/0x34) [ 318.172646] [] (default_idle_call) from [] (cpu_startup_entry+0x128/0x17c) [ 318.181305] [] (cpu_startup_entry) from [] (secondary_start_kernel+0x158/0x164) [ 318.190395] r7:c081e7c8 r4:c080b4f0 [ 318.193993] [] (secondary_start_kernel) from [<8010158c>] (0x8010158c) [ 318.201423] r5:0051 r4:9f44006a [ 318.205024] ---[ end trace 6e04001434b19cb9 ]--- Just to be sure, I performed the same steps a second time: [ 316.238527] [ cut here ] [ 316.243210] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc [ 316.251619] Modules linked in: [ 316.254702] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc6-00010-gd07031bdc433-dirty #2 [ 316.263182] Hardware name: Sigma Tango DT [ 316.267206] Backtrace: [ 316.269675] [] (dump_backtrace) from [] (show_stack+0x18/0x1c) [ 316.277280] r7:6113 r6:c080ea84 r5: r4:c080ea84 [ 316.282986] [] (show_stack) from [] (dump_stack+0x80/0x94) [ 316.290254] [] (dump_stack) from [] (__warn+0xec/0x104) [ 316.297247] r7:0009 r6:c05e3fc8 r5: r4: [ 316.302947] [] (__warn) from [] (warn_slowpath_null+0x28/0x30) [ 316.310552] r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec15694 r5:df6063c4 r4:df6062c0 [ 316.318354] [] (warn_slowpath_null) from [] (inet_sock_destruct+0x1c4/0x1dc) [ 316.327190] [] (inet_sock_destruct) from [] (__sk_destruct+0x28/0xe0) [ 316.335406] r7:df45fe30 r6:dec15694 r5:df6062c0 r4:df60646c [ 316.341112] [] (__sk_destruct) from [] (rcu_process_callbacks+0x488/0x59c) [ 316.349765] r5: r4: [ 316.353363] [] (rcu_process_callbacks) from [] (__do_softirq+0x138/0x264) [ 316.361929] r10:c08020a0 r9:4001 r8:010
RE: [PATCH v2 0/8] thunderbolt: Introducing Thunderbolt(TM) networking
On Wed, Jun 29 2016, 11:35 AM, Levy, Amir (Jer) wrote: > This is version 2 of Thunderbolt(TM) driver for non-Apple hardware. > > Changes since v1: > - Separation to 2 modules. > - Moved ICM specific registers definition to ICM header file. > - Added new Thunderbolt device IDs. > - Renamed the Thunderbolt networking documentation. > - General cleanups > > These patches were pushed to GitHub where they can be reviewed more > comfortably with green/red highlighting: > https://github.com/01org/thunderbolt-software-kernel-tree > > Daemon code: > https://github.com/01org/thunderbolt-software-daemon > > For reference, here's a link to version 1: > [v1]: http://www.spinics.net/lists/linux-pci/msg51287.html > > Amir Levy (8): > thunderbolt: Macro rename > thunderbolt: Updating device IDs > thunderbolt: Updating the register definitions > thunderbolt: Kconfig for Thunderbolt(TM) networking > thunderbolt: Communication with the ICM (firmware) > thunderbolt: Networking state machine > thunderbolt: Networking transmit and receive > thunderbolt: Networking doc > > Documentation/00-INDEX |2 + > Documentation/thunderbolt-networking.txt | 135 ++ > drivers/thunderbolt/Kconfig | 25 +- > drivers/thunderbolt/Makefile |4 +- > drivers/thunderbolt/icm_nhi.c| 1631 + > drivers/thunderbolt/icm_nhi.h| 84 ++ > drivers/thunderbolt/net.c| 2277 > ++ > drivers/thunderbolt/net.h| 274 > drivers/thunderbolt/nhi_regs.h | 115 +- > include/linux/pci_ids.h | 44 +- > 10 files changed, 4565 insertions(+), 26 deletions(-) create mode 100644 > Documentation/thunderbolt-networking.txt > create mode 100644 drivers/thunderbolt/icm_nhi.c create mode 100644 > drivers/thunderbolt/icm_nhi.h create mode 100644 > drivers/thunderbolt/net.c create mode 100644 drivers/thunderbolt/net.h > > -- > 2.7.4 Hi All, Do you have any comments or suggestions for this patch set? Regards, Amir
[PATCH -next] stmmac: dwmac-socfpga: fix wrong pointer passed to PTR_ERR()
From: Wei Yongjun PTR_ERR should access the value just tested by IS_ERR, otherwise the wrong error code will be returned. Signed-off-by: Wei Yongjun --- drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c index 4bee2f9..3bc1fa2 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c @@ -214,7 +214,7 @@ static int socfpga_dwmac_parse_data(struct socfpga_dwmac *dwmac, struct device * dev_err(dev, "%s: ERROR: failed mapping tse control port\n", __func__); - return PTR_ERR(dwmac->pcs.sgmii_adapter_base); + return PTR_ERR(dwmac->pcs.tse_pcs_base); } } }
[PATCH -next] rxrpc: Fix error handling in af_rxrpc_init()
From: Wei Yongjun security initialized after alloc workqueue, so we should exit security before destroy workqueue in the error handing. Fixes: 648af7fca159 ("rxrpc: Absorb the rxkad security module") Signed-off-by: Wei Yongjun --- net/rxrpc/af_rxrpc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index d6e4e3b..88effad 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -766,9 +766,9 @@ error_key_type: error_sock: proto_unregister(&rxrpc_proto); error_proto: - destroy_workqueue(rxrpc_workqueue); -error_security: rxrpc_exit_security(); +error_security: + destroy_workqueue(rxrpc_workqueue); error_work_queue: kmem_cache_destroy(rxrpc_call_jar); error_call_jar:
[PATCH -next] net: mediatek: fix non static symbol warnings
From: Wei Yongjun Fixes the following sparse warnings: drivers/net/ethernet/mediatek/mtk_eth_soc.c:79:5: warning: symbol '_mtk_mdio_write' was not declared. Should it be static? drivers/net/ethernet/mediatek/mtk_eth_soc.c:98:5: warning: symbol '_mtk_mdio_read' was not declared. Should it be static? Signed-off-by: Wei Yongjun --- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 682797e..6f597eb 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -76,8 +76,8 @@ static int mtk_mdio_busy_wait(struct mtk_eth *eth) return -1; } -u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr, - u32 phy_register, u32 write_data) +static u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr, + u32 phy_register, u32 write_data) { if (mtk_mdio_busy_wait(eth)) return -1; @@ -95,7 +95,7 @@ u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr, return 0; } -u32 _mtk_mdio_read(struct mtk_eth *eth, int phy_addr, int phy_reg) +static u32 _mtk_mdio_read(struct mtk_eth *eth, int phy_addr, int phy_reg) { u32 d;
[PATCH -next] dwc_eth_qos: fix missing clk_disable_unprepare() on error in dwceqos_probe()
From: Wei Yongjun Fix missing clk_disable_unprepare() call before return from dwceqos_probe() in the error handling case of invalid fixed-link. Signed-off-by: Wei Yongjun --- drivers/net/ethernet/synopsys/dwc_eth_qos.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/synopsys/dwc_eth_qos.c b/drivers/net/ethernet/synopsys/dwc_eth_qos.c index c34111b..fc1ea80 100644 --- a/drivers/net/ethernet/synopsys/dwc_eth_qos.c +++ b/drivers/net/ethernet/synopsys/dwc_eth_qos.c @@ -2877,7 +2877,7 @@ static int dwceqos_probe(struct platform_device *pdev) ret = of_phy_register_fixed_link(lp->pdev->dev.of_node); if (ret < 0) { dev_err(&pdev->dev, "invalid fixed-link"); - goto err_out_unregister_netdev; + goto err_out_unregister_clk_notifier; } lp->phy_node = of_node_get(lp->pdev->dev.of_node);
Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc
On 12/07/2016 11:53, Mason wrote: > However, the 310 seconds time span still seems to be relevant. > > Steps to reproduce: I booted the system, logged in as root, > mounted an NFS file system, then left the system idling at > the prompt. > > (I don't remember seeing this warning in v4.1 and v4.4) > > What's going wrong here? Is it related to NFS? > > Here is the defconfig I'm using > http://pastebin.ubuntu.com/19160299/ > > > [ 317.940133] [ cut here ] > [ 317.944815] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 > inet_sock_destruct+0x1c4/0x1dc > [ 317.953223] Modules linked in: > [ 317.956305] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > 4.7.0-rc6-00010-gd07031bdc433-dirty #2 > [ 317.964784] Hardware name: Sigma Tango DT > [ 317.968809] Backtrace: > [ 317.971279] [] (dump_backtrace) from [] > (show_stack+0x18/0x1c) > [ 317.978884] r7:6113 r6:c080ea84 r5: r4:c080ea84 > [ 317.984590] [] (show_stack) from [] > (dump_stack+0x80/0x94) > [ 317.991856] [] (dump_stack) from [] (__warn+0xec/0x104) > [ 317.998849] r7:0009 r6:c05e3fc8 r5: r4: > [ 318.004549] [] (__warn) from [] > (warn_slowpath_null+0x28/0x30) > [ 318.012154] r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec19594 r5:df68f144 > r4:df68f040 > [ 318.019954] [] (warn_slowpath_null) from [] > (inet_sock_destruct+0x1c4/0x1dc) > [ 318.028788] [] (inet_sock_destruct) from [] > (__sk_destruct+0x28/0xe0) > [ 318.037005] r7:df45fe30 r6:dec19594 r5:df68f040 r4:df68f1ec > [ 318.042710] [] (__sk_destruct) from [] > (rcu_process_callbacks+0x488/0x59c) > [ 318.051363] r5: r4: > [ 318.054962] [] (rcu_process_callbacks) from [] > (__do_softirq+0x138/0x264) > [ 318.063527] r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 > r5:0009 > [ 318.071408] r4: > [ 318.073953] [] (__do_softirq) from [] > (irq_exit+0xc8/0x104) > [ 318.081296] r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 > r5: > [ 318.089176] r4:c0735428 > [ 318.091723] [] (irq_exit) from [] > (__handle_domain_irq+0x88/0xf4) > [ 318.099595] [] (__handle_domain_irq) from [] > (gic_handle_irq+0x50/0x94) > [ 318.107986] r10: r9:e0803100 r8:e0802100 r7:df45ff58 r6:e080210c > r5:c080277c > [ 318.115865] r4:c080eca0 r3:df45ff58 > [ 318.119461] [] (gic_handle_irq) from [] > (__irq_svc+0x54/0x90) > [ 318.126980] Exception stack(0xdf45ff58 to 0xdf45ffa0) > [ 318.132053] ff40: > 0001 > [ 318.140273] ff60: ab80 c0117c80 df45e000 c08024f8 c0802494 c081e2d6 > c05b9550 413fc090 > [ 318.148492] ff80: df45ffb4 df45ffb8 df45ffa8 c01086b0 c01086b4 > 6013 > [ 318.156709] r9:413fc090 r8:c05b9550 r7:df45ff8c r6: r5:6013 > r4:c01086b4 > [ 318.164512] [] (arch_cpu_idle) from [] > (default_idle_call+0x28/0x34) > [ 318.172646] [] (default_idle_call) from [] > (cpu_startup_entry+0x128/0x17c) > [ 318.181305] [] (cpu_startup_entry) from [] > (secondary_start_kernel+0x158/0x164) > [ 318.190395] r7:c081e7c8 r4:c080b4f0 > [ 318.193993] [] (secondary_start_kernel) from [<8010158c>] > (0x8010158c) > [ 318.201423] r5:0051 r4:9f44006a > [ 318.205024] ---[ end trace 6e04001434b19cb9 ]--- > > > Just to be sure, I performed the same steps a second time: > > [ 316.238527] [ cut here ] > [ 316.243210] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 > inet_sock_destruct+0x1c4/0x1dc > [ 316.251619] Modules linked in: > [ 316.254702] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > 4.7.0-rc6-00010-gd07031bdc433-dirty #2 > [ 316.263182] Hardware name: Sigma Tango DT > [ 316.267206] Backtrace: > [ 316.269675] [] (dump_backtrace) from [] > (show_stack+0x18/0x1c) > [ 316.277280] r7:6113 r6:c080ea84 r5: r4:c080ea84 > [ 316.282986] [] (show_stack) from [] > (dump_stack+0x80/0x94) > [ 316.290254] [] (dump_stack) from [] (__warn+0xec/0x104) > [ 316.297247] r7:0009 r6:c05e3fc8 r5: r4: > [ 316.302947] [] (__warn) from [] > (warn_slowpath_null+0x28/0x30) > [ 316.310552] r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec15694 r5:df6063c4 > r4:df6062c0 > [ 316.318354] [] (warn_slowpath_null) from [] > (inet_sock_destruct+0x1c4/0x1dc) > [ 316.327190] [] (inet_sock_destruct) from [] > (__sk_destruct+0x28/0xe0) > [ 316.335406] r7:df45fe30 r6:dec15694 r5:df6062c0 r4:df60646c > [ 316.341112] [] (__sk_destruct) from [] > (rcu_process_callbacks+0x488/0x59c) > [ 316.349765] r5: r4: > [ 316.353363] [] (rcu_process_callbacks) from [] > (__do_softirq+0x138/0x264) > [ 316.361929] r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 > r5:0009 > [ 316.369811] r4: > [ 316.372356] [] (__do_softirq) from [] > (irq_exit+0xc8/0x104) > [ 316.379699] r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 > r5: > [ 316.387579] r4:c073
Re: iwlwifi: add missing type declaration
Arnd Bergmann wrote: > The iwl-debug.h header relies in implicit inclusion of linux/device.h and > we get a lot of warnings without that: > > drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:23: error: 'struct device' > declared inside parameter list will not be visible outside of this definition > or declaration [-Werror] > void __iwl_err(struct device *dev, bool rfkill_prefix, bool only_trace, >^~ > In file included from > drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.h:66:0, > from drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.c:68: > drivers/net/wireless/intel/iwlwifi/iwl-trans.h: In function 'iwl_trans_tx': > drivers/net/wireless/intel/iwlwifi/iwl-trans.h:1030:348: error: passing > argument 1 of '__iwl_err' from incompatible pointer type > [-Werror=incompatible-pointer-types] >IWL_ERR(trans, "%s bad state = %d\n", __func__, trans->state); > > > > > ^ > In file included from > drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.c:67:0: > drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:6: note: expected 'struct > device *' but argument is of type 'struct device *' > void __iwl_err(struct device *dev, bool rfkill_prefix, bool only_trace, > ^ > > The easiest workaround is to just declare 'struct device' before its first > use, > rather than including the entire header file. > > Signed-off-by: Arnd Bergmann > Fixes: 21cb3222fe56 ("iwlwifi: decouple PCIe transport from mac80211") > Acked-by: Luca Coelho Thanks, 1 patch applied to wireless-drivers-next.git: 25f700ef0653 iwlwifi: add missing type declaration -- Sent by pwcli https://patchwork.kernel.org/patch/9224105/
RE: [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function
>In preparation for writing the tx descriptor from multiple functions, create a >helper for both normal and blueflame access. > >Signed-off-by: Brenden Blanco Reviewed-by: Tariq Toukan
Re: [PATCH net v2 2/2] net: ethoc: Correctly pad short packets
Hi Florian, On Mon, Jul 11, 2016 at 04:35:55PM -0700, Florian Fainelli wrote: > Even though the hardware can be doing zero padding, we want the SKB to > be going out on the wire with the appropriate size. This fixes packet > truncations observed with e.g: ARP packets. > > Signed-off-by: Florian Fainelli > --- > drivers/net/ethernet/ethoc.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c > index 06ae14a8e946..ca678d46c322 100644 > --- a/drivers/net/ethernet/ethoc.c > +++ b/drivers/net/ethernet/ethoc.c > @@ -860,6 +860,11 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, > struct net_device *dev) > unsigned int entry; > void *dest; > > + if (skb_put_padto(skb, ETHOC_ZLEN)) { > + dev->stats.tx_errors++; > + goto out; > + } skb_put_padto says that skb is freed on error, so this shoudn't go to label 'out' on error where skb is freed again? -- Thanks. -- Max
Re: [patch net-next 1/2] devlink: add hardware messages tracing facility
On Tue, 12 Jul 2016 10:38:37 +0200 Jiri Pirko wrote: > Hmm, I'm trying to make it work this was but it seems to be a bit more > complicated. I did not find any code doing this (having the tracepoint > compiled in or not) so I cannot copy&paste the solution. > > The problem is that trace_devlink_hwmsg is included from > include/trace/events/devlink.h when devlink is on and from > include/net/devlink.h. Odd but ok. > > I cannot include include/trace/events/devlink.h in include/net/devlink.h > because include/net/devlink.h is included in include/trace/events/devlink.h > > So I have to directly include include/trace/events/devlink.h from mlxsw > driver (trace caller). But when devlink is not compiled in, trecepoint > is not created and I'm getting errors. > > I see no easy way our of this other than the wrapper function I was > originally using. Seem simple and elegant. No. You missed my solution. Any user of trace_devlink_hwmsg() should directly include trace/events/devlink.h. That's what all other trace users do. They include the trace file header. I said in that header do: #if IS_ENABLED(CONFIG_NET_DEVLINK) #undef TRACE_SYSTEM #define TRACE_SYSTEM devlink #if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_DEVLINK_H #include #include #include [..] #endif /* _TRACE_DEVLINK_H */ /* This part must be outside protection */ #include #else /* CONFIG_NET_DEVLINK */ static inline void trace_devlink_hwmsg(const struct devlink *devlink, bool incoming, unsigned long type, const u8 *buf, size_t len) { } #endif -- Steve
Re: [net-next PATCH RFC] mlx4: RX prefetch loop
On Mon, 11 Jul 2016 16:05:11 -0700 Alexei Starovoitov wrote: > On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote: > > > - /* Process all completed CQEs */ > > > + /* Extract and prefetch completed CQEs */ > > > while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK, > > > cq->mcq.cons_index & cq->size)) { > > > + void *data; > > > > > > frags = ring->rx_info + (index << priv->log_rx_info); > > > rx_desc = ring->buf + (index << ring->log_stride); > > > + prefetch(rx_desc); > > > > > > /* > > >* make sure we read the CQE after we read the ownership bit > > >*/ > > > dma_rmb(); > > > > > > + cqe_array[cqe_idx++] = cqe; > > > + > > > + /* Base error handling here, free handled in next loop */ > > > + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == > > > + MLX4_CQE_OPCODE_ERROR)) > > > + goto skip; > > > + > > > + data = page_address(frags[0].page) + frags[0].page_offset; > > > + prefetch(data); > > that's probably not correct in all cases, since doing prefetch on the address > that is going to be evicted soon may hurt performance. > We need to dma_sync_single_for_cpu() before doing a prefetch or > somehow figure out that dma_sync is a nop, so we can omit it altogether > and do whatever prefetches we like. Sure, DMA can be synced first (actually already played with this). > Also unconditionally doing batch of 8 may also hurt depending on what > is happening either with the stack, bpf afterwards or even cpu version. See this as software DDIO, if the unlikely case that data will get evicted, it will still exist in L2 or L3 cache (like DDIO). Notice, only 1024 bytes are getting prefetched here. > Doing single prefetch of Nth packet is probably ok most of the time, > but asking cpu to prefetch 8 packets at once is unnecessary especially > since single prefetch gives the same performance. No, unconditionally prefetch of the Nth packet, will be wrong most of the time, for real work loads, as Eric Dumazet already pointed out. This patch does NOT unconditionally prefetch 8 packets. Prefetching _only_ happens when it is known that packets are ready in the RX ring. We know this prefetch data will be used/touched, within the NAPI cycle. Even if the processing of the packet flush L1 cache, then it will be in L2 or L3 (like DDIO). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [patch net-next 1/2] devlink: add hardware messages tracing facility
Tue, Jul 12, 2016 at 02:44:17PM CEST, rost...@goodmis.org wrote: >On Tue, 12 Jul 2016 10:38:37 +0200 >Jiri Pirko wrote: > > >> Hmm, I'm trying to make it work this was but it seems to be a bit more >> complicated. I did not find any code doing this (having the tracepoint >> compiled in or not) so I cannot copy&paste the solution. >> >> The problem is that trace_devlink_hwmsg is included from >> include/trace/events/devlink.h when devlink is on and from >> include/net/devlink.h. Odd but ok. >> >> I cannot include include/trace/events/devlink.h in include/net/devlink.h >> because include/net/devlink.h is included in include/trace/events/devlink.h >> >> So I have to directly include include/trace/events/devlink.h from mlxsw >> driver (trace caller). But when devlink is not compiled in, trecepoint >> is not created and I'm getting errors. >> >> I see no easy way our of this other than the wrapper function I was >> originally using. Seem simple and elegant. > >No. You missed my solution. > >Any user of trace_devlink_hwmsg() should directly include >trace/events/devlink.h. That's what all other trace users do. They >include the trace file header. > >I said in that header do: > > >#if IS_ENABLED(CONFIG_NET_DEVLINK) > >#undef TRACE_SYSTEM >#define TRACE_SYSTEM devlink > >#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ) >#define _TRACE_DEVLINK_H > >#include >#include >#include > >[..] > >#endif /* _TRACE_DEVLINK_H */ > >/* This part must be outside protection */ >#include > >#else /* CONFIG_NET_DEVLINK */ >static inline void trace_devlink_hwmsg(const struct devlink *devlink, > bool incoming, unsigned long type, > const u8 *buf, size_t len) >{ >} >#endif Okay. That looks good. Thanks! > > >-- Steve
RE: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.
>> On Wed, 15 Jun 2016, Alan Davey wrote: >> >>> The only case that would break is that where an application relies on >>> the existing (documented as a bug) feature of getting an EMSGSIZE >>> return code in the case of an over-sized packet. Applications that >>> perform their own fragmentation would be unaffected. >> >> If this doesn't break existing applications that are doing >> fragmentation in userspace on raw sockets (e.g. Quagga ospfd), that's >> better. >> >> As per previous email, I'd love to be able to get rid of that code and >> have the kernel do it for me. However, I also don't want to have to do >> anything other non-trivial to that code either. :) >> >> The issue for us is, how would we know on any given host whether the >> kernel will do the fragmentation or whether ospfd has to do it? We >> need to be able to probe for that capability, surely? > >The fact is, regardless of whether you could probe for the capability or not, >you have to keep the fragmentation code around forever. > >And that is yet another reason I do not want to add this change at all. > >It doesn't make any existing server any simpler, in fact it makes them all >more complicated because not only do they keep the fragmentation code, they >also >get new logic to test for the feature that would allow them to avoid >using it. > >Sorry, there is no way I am adding this, it's a net lose. David, I accept you don't want to take this patch. However, I don't yet understand why, so I have a few more questions (which will hopefully help me produce patches that are likely to be accepted in future). Adding the patch means that some existing applications will continue to contain their own fragmentation code, which becomes unnecessary. This is OK; those applications will continue to fragment packets, and will continue to work with an updated kernel. Do you agree? Not adding the patch means that - all future applications have to continue to implement their own fragmentation code, duplicating that which already exists in the kernel - hardware vendors may choose to apply the patch themselves (they do not want to implement function that already exists), but would surely prefer to have it in the standard kernel. So I still don't understand what part of taking the patch would have a negative result on existing applications, it should be neutral to them. Could you help me out here and explain? Regards Alan
Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco wrote: > Add a new bpf prog type that is intended to run in early stages of the > packet rx path. Only minimal packet metadata will be available, hence a > new context type, struct xdp_md, is exposed to userspace. So far only > expose the packet start and end pointers, and only in read mode. > > An XDP program must return one of the well known enum values, all other > return codes are reserved for future use. Unfortunately, this > restriction is hard to enforce at verification time, so take the > approach of warning at runtime when such programs are encountered. Out > of bounds return codes should alias to XDP_ABORTED. This is going to be a serious usability problem, when XDP gets extended with newer features. How can users know what XDP capabilities a given driver support? If the user loads a XDP program using capabilities not avail in the driver, then all his traffic will be hard dropped. My proposal is to NOT allow XDP programs to be loaded if they want to use return-codes/features not supported by the driver. Thus, eBPF programs register/annotate their needed capabilities upfront (I guess this could also help HW offload engines). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
RE: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
>Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver. > >In tc/socket bpf programs, helpers linearize skb fragments as needed when the >program touches the packet data. However, in the pursuit of speed, XDP >programs will not be allowed to use these slower functions, especially if it >involves allocating an skb. > >Therefore, disallow MTU settings that would produce a multi-fragment packet >that XDP programs would fail to access. Future enhancements could be done to >increase the allowable MTU. > >Signed-off-by: Brenden Blanco Reviewed-by: Tariq Toukan
[PATCH iproute2] ip route: restore route entries in correct order
Sometimes we cannot restore route entries, because in kernel [1] fib_check_nh() [2] fib_valid_prefsrc() cause some routes to depend on existence of others while adding. For example, we saved all the routes, and flushed all tables [a] default via 192.168.122.1 dev eth0 [b] 192.168.122.0/24 dev eth0 src 192.168.122.21 [c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1 [d] local 127.0.0.0/8 dev lo table local src 127.0.0.1 [e] local 127.0.0.1 dev lo table local src 127.0.0.1 [f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1 [g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21 [h] local 192.168.122.21 dev eth0 table local src 192.168.122.21 [i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21 Now start to restore them: If we want to add [a], we have to add [b] first, as [1] and 'via 192.168.122.1' in [a]. If we want to add [b], we have to add [h] first, as [2] and 'src 192.168.122.21' in [b]. So the correct order to restore should be like: [e][h] -> [b][c][d][f][g][i] -> [a] This patch fixes it by traversing the file 3 times, it only restores part of them in each run according to the following conditions, to make sure every entry can be restored successfully. 1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst) 2. !gw && (fib_prefsrc != cfg->fc_dst) 3. gw Signed-off-by: Xin Long --- ip/iproute.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 24f6b01..c735000 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1810,12 +1810,42 @@ static int iproute_get(int argc, char **argv) return 0; } +static int rtattr_cmp(struct rtattr *rta1, struct rtattr *rta2) +{ + if (!rta1 || !rta2 || rta1->rta_len != rta2->rta_len) + return 1; + + return memcmp(RTA_DATA(rta1), RTA_DATA(rta2), RTA_PAYLOAD(rta1)); +} + static int restore_handler(const struct sockaddr_nl *nl, struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg) { - int ret; + struct rtmsg *r = NLMSG_DATA(n); + struct rtattr *tb[RTA_MAX+1]; + int len = n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); + int ret, prio = *(int *)arg; + + parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len); + /* Restore routes in correct order: +* 0. ones for local addresses, +* 1. ones for local networks, +* 2. others (remote networks/hosts). +*/ + if (!prio && !tb[RTA_GATEWAY] && (!tb[RTA_PREFSRC] || + !rtattr_cmp(tb[RTA_PREFSRC], tb[RTA_DST]))) + goto restore; + else if (prio == 1 && !tb[RTA_GATEWAY] && +rtattr_cmp(tb[RTA_PREFSRC], tb[RTA_DST])) + goto restore; + else if (prio == 2 && tb[RTA_GATEWAY]) + goto restore; + + return 0; + +restore: n->nlmsg_flags |= NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; ll_init_map(&rth); @@ -1848,10 +1878,23 @@ static int route_dump_check_magic(void) static int iproute_restore(void) { + int pos, prio; + if (route_dump_check_magic()) exit(-1); - exit(rtnl_from_file(stdin, &restore_handler, NULL)); + pos = ftell(stdin); + for (prio = 0; prio < 3; prio++) { + int err; + + err = rtnl_from_file(stdin, &restore_handler, &prio); + if (err) + exit(err); + + fseek(stdin, pos, SEEK_SET); + } + + exit(0); } static int show_handler(const struct sockaddr_nl *nl, -- 2.1.0
Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression
David Miller writes: > What really irks me is that we "fixing" something without knowing what > actually is the problem. Agreed > Someone needs to figure out exactly what is making the Juniper thing > unhappy. It really shouldn't care if a link local address is assigned > to the tun device, this is fundamental ipv6 stuff. Yes. Looks like this is up to Jonas and/or Valdis. I tried looking for a demo site which could be used to test the client, but could not find any. The product itself seems to be replaced, and it's no longer Juniper. And the recommended Linux solution seems to be OpenConnect: http://www.infradead.org/openconnect/juniper.html Anyway, it would be good to sort out the problems with the java(?) based client. A few proposals (not an exhaustive list - please use your creativity): a) Try to figure out what the traffic on the interface looks like (there was a single TX packet and no RX, I believe?). Snoop on it and see if that is an IPv6 RS from the kernel or something the client sends. b) Try to isolate the problem by tweaking what you can on the tun- interface. ip addr del dev tun0 echo 1 > /proc/sys/net/ipv6/conf/tun0/disable_ipv6 etc. Is there anything that will make the traffic flow, or is it just dead? c) Try to figure out what the client is doing. strace it. run lsof on it. Anything unexpected? Does it for example happen to read an packet from the tun file descriptor and choke? etc, Bjørn
Re: [PATCH net v2 0/2] net: ethoc: Error path and transmit fixes
Hello, On Mon, Jul 11, 2016 at 04:35:53PM -0700, Florian Fainelli wrote: > This patch series contains two patches for the ethoc driver while testing on a > TS-7300 board where ethoc is provided by an on-board FPGA. > > First patch was cooked after chasing crashes with invalid resources passed to > the driver. > > Second patch was cooked after seeing that an interface configured with IP > 192.168.2.2 was sending ARP packets for 192.168.0.0, no wonder why it could > not > work. I can see opencores intrerface sending ARP packets shorter than 64 bytes, but I couldn't reproduce truncation that affects packet contents on my hardware. > I don't have access to any other platform using an ethoc interface so > it could be good to some testing on Xtensa for instance. I've tested success and error paths affected by this series with the following additional change on top of it: diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c index ca678d4..8c94f45 100644 --- a/drivers/net/ethernet/ethoc.c +++ b/drivers/net/ethernet/ethoc.c @@ -862,7 +862,7 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, struct net_device *dev) if (skb_put_padto(skb, ETHOC_ZLEN)) { dev->stats.tx_errors++; - goto out; + return NETDEV_TX_OK; } if (unlikely(skb->len > ETHOC_BUFSIZ)) { Without it the interface becomes non-functional after the first error in skb_put_padto. Tested-by: Max Filippov Reviewed-by: Max Filippov -- Thanks. -- Max
Re: [PATCH iproute2] ip route: restore route entries in correct order
On Tue, Jul 12, 2016 at 09:37:58PM +0800, Xin Long wrote: > Sometimes we cannot restore route entries, because in kernel > [1] fib_check_nh() > [2] fib_valid_prefsrc() > cause some routes to depend on existence of others while adding. > > For example, we saved all the routes, and flushed all tables > [a] default via 192.168.122.1 dev eth0 > [b] 192.168.122.0/24 dev eth0 src 192.168.122.21 > [c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1 > [d] local 127.0.0.0/8 dev lo table local src 127.0.0.1 > [e] local 127.0.0.1 dev lo table local src 127.0.0.1 > [f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1 > [g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21 > [h] local 192.168.122.21 dev eth0 table local src 192.168.122.21 > [i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21 > > Now start to restore them: > If we want to add [a], we have to add [b] first, as [1] and > 'via 192.168.122.1' in [a]. > If we want to add [b], we have to add [h] first, as [2] and > 'src 192.168.122.21' in [b]. > > So the correct order to restore should be like: > [e][h] -> [b][c][d][f][g][i] -> [a] > > This patch fixes it by traversing the file 3 times, it only restores > part of them in each run according to the following conditions, to > make sure every entry can be restored successfully. > 1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst) > 2. !gw && (fib_prefsrc != cfg->fc_dst) > 3. gw > > Signed-off-by: Xin Long Acked-by: Phil Sutter
Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression
Hannes, I just realized that I missed you on the CC list of the v2 patch. Sorry. It was definitely not my intention. I really appreciate the feedback you have kindly provided before. I guess you might have seen it in netdev anyway, but in case not, here is a direct patchwork link: http://patchwork.ozlabs.org/patch/646958/ I'll try to be less sloppy about CCs the next time, if we ever get to a v3 of this. At the moment that requires figuring out what the actual problem with the client is. Something I don't have the necessary tools to do AFAICS. So I will let the whole thing just sleep until there is some more data about the Juniper client failure. Bjørn
Re: Multi-thread udp 4.7 regression, bisected to 71d8c47fc653
On Mon, Jul 11, 2016 at 06:17:39PM -0300, Marc Dionne wrote: > Hi Pablo, > > Testing out your patch: > > 1) With no NAT in place, the clash resolution happens, with no side > effects. No EPERM errors are seen. > > 2) With ip(6)table_nat loaded, the clash resolution fails and I get > some EPERM errors from sendmsg(), same as before 71d8c47fc653. > > Turns out that even though I have no NAT rules in my iptables config, > the system also had firewalld active and that caused the modules to be > loaded. > > So the bottom line is that the patch looks good to me.. Thanks Marc, I'm going to apply this then.
Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc
On Tue, 2016-07-12 at 11:53 +0200, Mason wrote: > On 05/07/2016 17:28, Florian Fainelli wrote: > > > Le 05/07/2016 07:50, Mason wrote: > > > >> On 05/07/2016 15:33, Mason wrote: > >> > >>> I was testing suspend/resume sequences where the suspend operation > >>> fails and returns without having suspended the platform. > > Please forget I ever mentioned suspend, that was a red herring. > The warning is displayed even if I never suspend. > (Dropping linux-pm from this discussion.) > > >> I rebooted, tried the same operation, and hit the same warning > >> still 310 seconds later: > > However, the 310 seconds time span still seems to be relevant. > > Steps to reproduce: I booted the system, logged in as root, > mounted an NFS file system, then left the system idling at > the prompt. > > (I don't remember seeing this warning in v4.1 and v4.4) > > What's going wrong here? Is it related to NFS? > > Here is the defconfig I'm using > http://pastebin.ubuntu.com/19160299/ > > Regards. > Could you try this debug patch ? diff --git a/net/core/sock.c b/net/core/sock.c index b7f12639c26a..7fb1aeadbda7 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1465,6 +1465,7 @@ static void __sk_destruct(struct rcu_head *head) void sk_destruct(struct sock *sk) { + WARN_ON_ONCE(sk->sk_forward_alloc); if (sock_flag(sk, SOCK_RCU_FREE)) call_rcu(&sk->sk_rcu, __sk_destruct); else
[PATCH 1/1] Fix PCS reset
From: Noam Camus During commit b54b8c2d6e3c ("net: ezchip: adapt driver to little endian architecture") adapting to little endian architecture, zeroing of controller was left out. Signed-off-by: Elad Kanfi --- drivers/net/ethernet/ezchip/nps_enet.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c index 06f0317..9b7a3f5 100644 --- a/drivers/net/ethernet/ezchip/nps_enet.c +++ b/drivers/net/ethernet/ezchip/nps_enet.c @@ -285,6 +285,7 @@ static void nps_enet_hw_reset(struct net_device *ndev) ge_rst_value |= NPS_ENET_ENABLE << RST_GMAC_0_SHIFT; nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value); usleep_range(10, 20); + ge_rst_value = 0; nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value); /* Tx fifo reset sequence */ -- 1.7.1
Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression
On 11.07.2016 16:48, David Miller wrote: > From: Bjørn Mork > Date: Mon, 11 Jul 2016 16:43:50 +0200 > >> The Juniper SSL VPN client use a "tun" interface and seems to >> be picky about visible changes.to it. Commit cc9da6cc4f56 >> ("ipv6: addrconf: use stable address generator for ARPHRD_NONE") >> made such interfaces get an auto-generated IPv6 link local address >> by default, similar to most other interface types. This made the >> Juniper SSL VPN client fail for unknown reasons. >> >> Fixing this regression by adding a new private netdevice flag >> which disables automatic IPv6 link local address generation, and >> making the flag default for "tun" devices. >> >> Setting an explicit addrgenmode will disable the flag, so userspace >> can choose to enable automatic LL generation by selecting a suitable >> addrgenmode. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131 >> Fixes: cc9da6cc4f56 ("ipv6: addrconf: use stable address generator for >> ARPHRD_NONE") >> Reported-by: Valdis Kletnieks >> Reported-by: Jonas Lippuner >> Suggested-by: Hannes Frederic Sowa >> Cc: 吉藤英明 >> Signed-off-by: Bjørn Mork > > What really irks me is that we "fixing" something without knowing what > actually is the problem. > > Someone needs to figure out exactly what is making the Juniper thing > unhappy. It really shouldn't care if a link local address is assigned > to the tun device, this is fundamental ipv6 stuff. I agree, but there could be a lot of factors affecting this. For example we know start to send out an icmp router solicitation into the vpn as soon as the link goes up, as well as initial mld change record. It could very well trigger security software and kill the VPN (something I can very well imagine right now). I actually don't suspect something funny going on in the Juniper's Java client itself, this would just be too strange. Unfortunately I don't have access to such systems to check what could be the problem. Bjorn send a list of things that could be tried. Thanks, Hannes
Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc
On 12/07/2016 16:25, Eric Dumazet wrote: > Could you try this debug patch ? Note: I've been unable to trigger the warning again. Dunno what has changed... With your patch applied, I get a warning at boot: [4.668309] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [4.688609] Sending DHCP requests ., OK [4.711935] IP-Config: Got DHCP answer from 172.27.200.1, my address is 172.27.64.49 [4.719956] IP-Config: Complete: [4.723221] device=eth0, hwaddr=00:16:e8:02:08:42, ipaddr=172.27.64.49, mask=255.255.192.0, gw=172.27.64.1 [4.733376] host=toto5, domain=france.foo.com sac.foo.com asic.foo.com soft.sde, nis-domain=france.foo.com [4.745279] bootserver=172.27.64.1, rootserver=172.27.64.1, rootpath=/export/roots/titi/6_2_0_8756,v3 nameserver0=172.27.0.17 [4.759725] [ cut here ] [4.764426] WARNING: CPU: 0 PID: 877 at net/core/sock.c:1468 sk_destruct+0x74/0x78 [4.772056] Modules linked in: [4.775133] CPU: 0 PID: 877 Comm: kworker/0:1H Not tainted 4.7.0-rc6-00010-gd07031bdc433-dirty #6 [4.784050] Hardware name: Sigma Tango DT [4.788084] Workqueue: rpciod rpc_async_schedule [4.792725] Backtrace: [4.795196] [] (dump_backtrace) from [] (show_stack+0x18/0x1c) [4.802802] r7:6013 r6:c080ea84 r5: r4:c080ea84 [4.808513] [] (show_stack) from [] (dump_stack+0x80/0x94) [4.815781] [] (dump_stack) from [] (__warn+0xec/0x104) [4.822776] r7:0009 r6:c05e4d04 r5: r4: [4.828482] [] (__warn) from [] (warn_slowpath_null+0x28/0x30) [4.836089] r9: r8:df5edb58 r7:df711364 r6:df006c80 r5:df5edb58 r4:df5eda40 [4.843898] [] (warn_slowpath_null) from [] (sk_destruct+0x74/0x78) [4.851945] [] (sk_destruct) from [] (__sk_free+0x50/0xbc) [4.859203] r5:df5edb58 r4:df5eda40 [4.862802] [] (__sk_free) from [] (sk_free+0x3c/0x40) [4.869710] r5:df5edb58 r4:df5eda40 [4.873310] [] (sk_free) from [] (sk_common_release+0xe8/0xf4) [4.880924] [] (sk_common_release) from [] (udp_lib_close+0x10/0x14) [4.889054] r5:df006c80 r4:df5eda40 [4.892657] [] (udp_lib_close) from [] (inet_release+0x4c/0x78) [4.900360] [] (inet_release) from [] (sock_release+0x28/0xa8) [4.907967] r5: r4:df006c80 [4.911575] [] (sock_release) from [] (xs_reset_transport+0xac/0xbc) [4.919706] r5:df5eda40 r4:df711000 [4.923306] [] (xs_reset_transport) from [] (xs_destroy+0x24/0x54) [4.931262] r9: r8:c049c614 r7:df7c3218 r6:df711000 r5: r4:df711000 [4.939070] [] (xs_destroy) from [] (xprt_destroy+0x88/0x8c) [4.946502] r5:df711218 r4:df711000 [4.950102] [] (xprt_destroy) from [] (xprt_put+0x40/0x44) [4.957358] r5:df7c3200 r4:df613d00 [4.960959] [] (xprt_put) from [] (rpc_task_release_client+0x7c/0x80) [4.969181] [] (rpc_task_release_client) from [] (rpc_release_resources_task+0x34/0x38) [4.978971] r7:c049c298 r6:0001 r5: r4:df613d00 [4.984674] [] (rpc_release_resources_task) from [] (__rpc_execute+0xb0/0x2a8) [4.993678] r5: r4:df613d00 [4.997277] [] (__rpc_execute) from [] (rpc_async_schedule+0x14/0x18) [5.005496] r10:df683500 r9: r8:dfbe4700 r7: r6:dfbdcc80 r5:df683500 [5.013383] r4:df613d24 [5.015935] [] (rpc_async_schedule) from [] (process_one_work+0x128/0x3fc) [5.024595] [] (process_one_work) from [] (worker_thread+0x58/0x574) [5.032726] r10:df683500 r9:df6d4000 r8:dfbdcc98 r7:c0802100 r6:0008 r5:df683518 [5.040614] r4:dfbdcc80 [5.043162] [] (worker_thread) from [] (kthread+0xe4/0xfc) [5.050419] r10: r9: r8: r7:c0132914 r6:df683500 r5:df6cc600 [5.058309] r4: [5.060858] [] (kthread) from [] (ret_from_fork+0x14/0x3c) [5.068116] r7: r6: r5:c013838c r4:df6cc600 [5.073846] ---[ end trace 05b24e2dedd2f2a0 ]--- [5.082009] VFS: Mounted root (nfs filesystem) readonly on device 0:11. [5.090328] Freeing unused kernel memory: 1024K (c070 - c080) [5.452552] random: udevd urandom read with 68 bits of entropy available
Re: [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding
Regression tests for mlx4_en are currently running, results will be ready by tomorrow morning. Regards, Tariq
Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer wrote: > > On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco > wrote: > >> Add a new bpf prog type that is intended to run in early stages of the >> packet rx path. Only minimal packet metadata will be available, hence a >> new context type, struct xdp_md, is exposed to userspace. So far only >> expose the packet start and end pointers, and only in read mode. >> >> An XDP program must return one of the well known enum values, all other >> return codes are reserved for future use. Unfortunately, this >> restriction is hard to enforce at verification time, so take the >> approach of warning at runtime when such programs are encountered. Out >> of bounds return codes should alias to XDP_ABORTED. > > This is going to be a serious usability problem, when XDP gets extended > with newer features. > > How can users know what XDP capabilities a given driver support? > We talked about this a the XDP summit and I have some slides on this (hope to have slides posted shortly) . Drivers, more generally XDP platforms, will provide a list APIs that they support. APIs would be contained in a sort common specification files that and would have some name like basic_xdp_api, adv_switch_api, etc. An XDP program is written using one of the published APIs and the API it uses is expressed as part of the program. At runtime (possibly compile time) the API used by the program is checked against the list of APIs for the target platform-- if the API is not in the set then the program is not allowed to be loaded. To whatever extent possible we should also try to verify that program adhere's to the API as load and compile time. In the event that program violates the API it claims to be using, such as return invalid return code for the API, that is an error condition. > If the user loads a XDP program using capabilities not avail in the > driver, then all his traffic will be hard dropped. > > > My proposal is to NOT allow XDP programs to be loaded if they want to use > return-codes/features not supported by the driver. Thus, eBPF programs > register/annotate their needed capabilities upfront (I guess this could > also help HW offload engines). > Exactly. We just need to define exactly how this is done ;-) One open issue is whether XDP needs to be binary compatible across platforms or source code compatible (also require recompile in latter case for each platform). Personally I prefer source code compatible since that might allow platforms to implement the API specifically for their needs (like the ordering of fields in a meta data structure. Tom > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
RE: [PATCH v6 RESEND] r8152: Add support for setting pass through MAC address on RTL8153-AD
> -Original Message- > From: Michał Pecio [mailto:michal.pe...@gmail.com] > Sent: Tuesday, July 12, 2016 2:03 AM > To: David Miller > Cc: Limonciello, Mario ; linux- > ker...@vger.kernel.org; netdev@vger.kernel.org; linux- > u...@vger.kernel.org; anthony.w...@canonical.com > Subject: Re: [PATCH v6 RESEND] r8152: Add support for setting pass through > MAC address on RTL8153-AD > > > From: Mario Limonciello > > Date: Mon, 11 Jul 2016 19:58:04 -0500 > > > > > The RTL8153-AD supports a persistent system specific MAC address. > > > This means a device plugged into two different systems with host > > > side support will show different (but persistent) MAC addresses. > > > > > > This information for the system's persistent MAC address is burned > > > in when the system HW is built and available under \_SB.AMAC in the > > > DSDT at runtime. > > > > > > This technology is currently implemented in the Dell TB15 and WD15 > > > Type-C docks. More information is available here: > > > http://www.dell.com/support/article/us/en/04/SLN301147 > > > > > > Signed-off-by: Mario Limonciello > > > > Applied. > > Hi, > > Isn't it possible that the same ACPI name could be used for other > vendor-specific extensions on other laptops and that r8152 will now > wreak havoc there? > > Regards, > MP This has been discussed a bit previously in earlier submissions. In short, this is an extreme corner case. Some changes were made to diminish potential impact. The ACPI code is resolved only when the specific variant of RTL8135 (-AD) has a bit set on the efuse. The patch also explicitly checks the return type and contents of the ACPI object and won't proceed if it's invalid. The Type-C devices that provide this are currently only sold by Dell. This of course may change one day if other OEM's add this feature, but it just shows that device scope is limited. For now the way this is implemented if Realtek does choose to work with another OEM on this feature, there should be no kernel code change necessary for interoperability of peripherals on other OEM machines. So in order to hit this hypothetical corner case today you would need to be using a real world type-C device something such as a Dell WD15 on another OEM's machine that has type-C and the exact same ACPI object name that does $BADSTUFF other than return a buffer. If that situation arises please alert me and I'll send a follow up patch that whitelists this to match DMI vendor of only Dell systems.
[PATCH -next] ixgbe: Add missing destroy_workqueue() on error in ixgbe_init_module()
From: Wei Yongjun Add the missing destroy_workqueue() before return from ixgbe_init_module() in the error handling case. Signed-off-by: Wei Yongjun --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ef82cf9..c5c3d68 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -10109,6 +10109,7 @@ static int __init ixgbe_init_module(void) ret = pci_register_driver(&ixgbe_driver); if (ret) { + destroy_workqueue(ixgbe_wq); ixgbe_dbg_exit(); return ret; }
Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys
On Tue, Jul 12, 2016 at 05:00:52PM +0200, Charles-Antoine Couret wrote: > Hello, > I'm back with another patch evrsion about Marvell phys with a fiber interface. > >From the previous release, I fixed some issues reported by yours and I added > >some functions around the fiber interface to get statistics, to configure > >the aneg, etc. Hi Charles It is best to submit a number of smaller patches, each doing one thing, than a single big patch. It makes review and discussion much simpler. So for example, this should be a patch of its own: > @@ -151,6 +165,7 @@ struct marvell_hw_stat { > > static struct marvell_hw_stat marvell_hw_stats[] = { > { "phy_receive_errors", 0, 21, 16}, > + { "phy_receive_errors_fiber", 1, 21, 16}, > { "phy_idle_errors", 0, 10, 8 }, > }; I think we should also rename phy_receive_errors to phy_receive_errors_copper. Andrew
[PATCH v3] Marvell phy: add fiber status check and configuration for some phys
Hello, I'm back with another patch evrsion about Marvell phys with a fiber interface. >From the previous release, I fixed some issues reported by yours and I added >some functions around the fiber interface to get statistics, to configure the >aneg, etc. Yes, to implement that, copper and fiber side are closed, but the fiber link needs some custom registers or bit values. I'm based on some genphy functions to adapt to fiber link. With that, the 88E1512 for example could have the same features for the copper and fiber links. I'm interested by your feedback about this patch. Thank you in advance. Regards. Charles-Antoine Couret >From 4ca5935f8aa97c3ba02cb27e970e1bcf248aed18 Mon Sep 17 00:00:00 2001 From: Charles-Antoine Couret Date: Fri, 1 Apr 2016 16:16:35 +0200 Subject: [PATCH] Marvell phy: add fiber status check for some components Some Marvell's phys have two modes: fiber and copper. Currently, the driver configures and checks only the copper mode registers which could be the wrong link. This commit add handlers to check fiber then copper status link or to configure the fiber interface. This patch is not tested with all Marvell's phys. The new functions are enabled only for tested phys. Some of them functions are based on genphy functions. Signed-off-by: Charles-Antoine Couret --- drivers/net/phy/marvell.c | 323 +++--- 1 file changed, 307 insertions(+), 16 deletions(-) diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index ec2c1ee..3cd8d94 100644 --- a/drivers/net/phy/marvell.c +++ b/drivers/net/phy/marvell.c @@ -138,6 +138,20 @@ #define MII_88E1510_GEN_CTRL_REG_1_MODE_SGMII 0x1 /* SGMII to copper */ #define MII_88E1510_GEN_CTRL_REG_1_RESET 0x8000 /* Soft reset */ +#define LPA_FIBER_1000HALF 0x40 +#define LPA_FIBER_1000FULL 0x20 + +#define LPA_PAUSE_FIBER0x180 +#define LPA_PAUSE_ASYM_FIBER 0x100 + +#define ADVERTISE_FIBER_1000HALF 0x40 +#define ADVERTISE_FIBER_1000FULL 0x20 + +#define ADVERTISE_PAUSE_FIBER 0x180 +#define ADVERTISE_PAUSE_ASYM_FIBER 0x100 + +#define REGISTER_LINK_STATUS 0x400 + MODULE_DESCRIPTION("Marvell PHY driver"); MODULE_AUTHOR("Andy Fleming"); MODULE_LICENSE("GPL"); @@ -151,6 +165,7 @@ struct marvell_hw_stat { static struct marvell_hw_stat marvell_hw_stats[] = { { "phy_receive_errors", 0, 21, 16}, + { "phy_receive_errors_fiber", 1, 21, 16}, { "phy_idle_errors", 0, 10, 8 }, }; @@ -477,15 +492,122 @@ static int m88e1318_config_aneg(struct phy_device *phydev) return m88e1121_config_aneg(phydev); } +/** + * ethtool_adv_to_fiber_adv_t + * @ethadv: the ethtool advertisement settings + * + * A small helper function that translates ethtool advertisement + * settings to phy autonegotiation advertisements for the + * MII_ADV register for fiber link. + */ +static inline u32 ethtool_adv_to_fiber_adv_t(u32 ethadv) +{ + u32 result = 0; + + if (ethadv & ADVERTISED_1000baseT_Half) + result |= ADVERTISE_FIBER_1000HALF; + if (ethadv & ADVERTISED_1000baseT_Full) + result |= ADVERTISE_FIBER_1000FULL; + + if ((ethadv & ADVERTISE_PAUSE_ASYM) && (ethadv & ADVERTISE_PAUSE_CAP)) + result |= LPA_PAUSE_ASYM_FIBER; + else if (ethadv & ADVERTISE_PAUSE_CAP) + result |= (ADVERTISE_PAUSE_FIBER + & (~ADVERTISE_PAUSE_ASYM_FIBER)); + + return result; +} + +/** + * marvell_config_aneg_fiber - restart auto-negotiation or write BMCR + * @phydev: target phy_device struct + * + * Description: If auto-negotiation is enabled, we configure the + * advertising, and then restart auto-negotiation. If it is not + * enabled, then we write the BMCR. Adapted for fiber link in + * some Marvell's devices. + */ +static int marvell_config_aneg_fiber(struct phy_device *phydev) +{ + int changed = 0; + int err; + int adv, oldadv; + u32 advertise; + + if (phydev->autoneg != AUTONEG_ENABLE) + return genphy_setup_forced(phydev); + + /* Only allow advertising what this PHY supports */ + phydev->advertising &= phydev->supported; + advertise = phydev->advertising; + + /* Setup fiber advertisement */ + adv = phy_read(phydev, MII_ADVERTISE); + if (adv < 0) + return adv; + + oldadv = adv; + adv &= ~(ADVERTISE_FIBER_1000HALF | ADVERTISE_FIBER_1000FULL + | LPA_PAUSE_FIBER); + adv |= ethtool_adv_to_fiber_adv_t(advertise); + + if (adv != oldadv) { + err = phy_write(phydev, MII_ADVERTISE, adv); + if (err < 0) + return err; + + changed = 1; + } + + if (changed == 0) { + /* Advertisement hasn't changed, but maybe aneg was never on to +* begin with? Or maybe phy was isolated? +*/ + int c
[PATCH -next] net: dsa: Fix non static symbol warning
From: Wei Yongjun Fixes the following sparse warning: net/dsa/dsa2.c:680:6: warning: symbol '_dsa_unregister_switch' was not declared. Should it be static? Signed-off-by: Wei Yongjun --- net/dsa/dsa2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index 78e4c01..f30bad9 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -677,7 +677,7 @@ int dsa_register_switch(struct dsa_switch *ds, struct device_node *np) } EXPORT_SYMBOL_GPL(dsa_register_switch); -void _dsa_unregister_switch(struct dsa_switch *ds) +static void _dsa_unregister_switch(struct dsa_switch *ds) { struct dsa_switch_tree *dst = ds->dst;
[iproute PATCH v2] ip-macsec.8: fix typo in 'examples' section
fix wrong 'device' keyword in 'ip link add device eth0' changes since v1: while at it, add missing description of 'validate' keyword and remove spurious bracket near 'encrypt' keyword. Also, wrong 'es' and 'encoding' keywords in MACsec section of ip-link.8 have been fixed. Signed-off-by: Davide Caratti --- man/man8/ip-link.8.in | 8 man/man8/ip-macsec.8 | 5 +++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in index ad18f75..edb00ff 100644 --- a/man/man8/ip-link.8.in +++ b/man/man8/ip-link.8.in @@ -910,14 +910,14 @@ the following additional arguments are supported: .BR encrypt " {" .BR on " | " off " } ] [ " .BR send_sci " { " on " | " off " } ] [" -.BR es " { " on " | " off " } ] [" +.BR end_station " { " on " | " off " } ] [" .BR scb " { " on " | " off " } ] [" .BR protect " { " on " | " off " } ] [" .BR replay " { " on " | " off " }" .BR window " { " .IR 0..2^32-1 " } ] [" .BR validate " { " strict " | " check " | " disabled " } ] [" -.BR encoding " { " +.BR encodingsa " { " .IR 0..3 " } ]" .in +8 @@ -942,7 +942,7 @@ the following additional arguments are supported: - specifies whether the SCI is included in every packet, or only when it is necessary. .sp -.BR "es on " or " es off" +.BR "end_station on " or " end_station off" - sets the End Station bit. .sp @@ -970,7 +970,7 @@ the following additional arguments are supported: - sets the validation mode on the device. .sp -.BI encoding " AN " +.BI encodingsa " AN " - sets the active secure association for transmission. .in -8 diff --git a/man/man8/ip-macsec.8 b/man/man8/ip-macsec.8 index e8455d7..761e4de 100644 --- a/man/man8/ip-macsec.8 +++ b/man/man8/ip-macsec.8 @@ -6,7 +6,7 @@ ip-macsec \- MACsec device configuration [ [ .BR cipher " { " default " | " gcm-aes-128 " } ] " .BI icvlen " ICVLEN" -] [ [ +] [ .BR encrypt " { " on " | " off " } ] [" .BR send_sci " { " on " | " off " } ] [" .BR end_station " { " on " | " off " } ] [" @@ -15,6 +15,7 @@ ip-macsec \- MACsec device configuration .BR replay " { " on " | " off " } ] [" .BI window " WINDOW" ] [ +.BR validate " { " strict " | " check " | " disabled " } ] [" .BI encodingsa " SA" ] @@ -74,7 +75,7 @@ type. .PP .SS Create a MACsec device on link eth0 .nf -# ip link add device eth0 macsec0 type macsec port 11 encrypt on +# ip link add link eth0 macsec0 type macsec port 11 encrypt on .PP .SS Configure a secure association on that device .nf -- 2.5.5
[PATCH nf-next v2 0/3] Compact netfilter hooks list
This series makes a simple change to shrink the netfilter hook list from a double linked list, to a singly linked list. Since the hooks are always traversed in-order, there is no need to maintain a previous pointer. This was jointly developed by Florian Westphal. It has been tested with RCU and lockdep debugging enabled. Aaron Conole (2): netfilter: bridge: add and use br_nf_hook_thresh netfilter: replace list_head with single linked list Florian Westphal (1): netfilter: call nf_hook_state_init with rcu_read_lock held include/linux/netdevice.h | 2 +- include/linux/netfilter.h | 26 +++-- include/linux/netfilter_ingress.h | 15 ++- include/net/netfilter/br_netfilter.h | 6 ++ include/net/netfilter/nf_queue.h | 9 +- include/net/netns/netfilter.h | 2 +- net/bridge/br_netfilter_hooks.c| 50 -- net/bridge/br_netfilter_ipv6.c | 12 +-- net/bridge/netfilter/ebt_redirect.c| 2 +- net/bridge/netfilter/ebtables.c| 2 +- net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_conntrack_proto_icmp.c | 2 +- net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +- net/netfilter/core.c | 132 - net/netfilter/nf_conntrack_core.c | 2 +- net/netfilter/nf_conntrack_h323_main.c | 2 +- net/netfilter/nf_conntrack_helper.c| 2 +- net/netfilter/nf_internals.h | 10 +- net/netfilter/nf_queue.c | 15 ++- net/netfilter/nfnetlink_cthelper.c | 2 +- net/netfilter/nfnetlink_log.c | 8 +- net/netfilter/nfnetlink_queue.c| 7 +- net/netfilter/xt_helper.c | 2 +- 24 files changed, 202 insertions(+), 114 deletions(-) -- 2.5.5
[PATCH v2 3/3] netfilter: replace list_head with single linked list
The netfilter hook list never uses the prev pointer, and so can be trimmed to be a smaller singly-linked list. In addition to having a more light weight structure for hook traversal, struct net becomes 5568 bytes (down from 6400) and struct net_device becomes 2176 bytes (down from 2240). Signed-off-by: Aaron Conole Signed-off-by: Florian Westphal --- v2: * Adjusted the hook list head function, and retested with rcu and lockdep debugging enabled. include/linux/netdevice.h | 2 +- include/linux/netfilter.h | 18 +++--- include/linux/netfilter_ingress.h | 14 +++-- include/net/netfilter/nf_queue.h | 9 ++- include/net/netns/netfilter.h | 2 +- net/bridge/br_netfilter_hooks.c | 21 +++ net/netfilter/core.c | 127 -- net/netfilter/nf_internals.h | 10 +-- net/netfilter/nf_queue.c | 15 +++-- net/netfilter/nfnetlink_queue.c | 5 +- 10 files changed, 130 insertions(+), 93 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 49736a3..9da07ad 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1749,7 +1749,7 @@ struct net_device { #endif struct netdev_queue __rcu *ingress_queue; #ifdef CONFIG_NETFILTER_INGRESS - struct list_headnf_hooks_ingress; + struct nf_hook_entry __rcu *nf_hooks_ingress; #endif unsigned char broadcast[MAX_ADDR_LEN]; diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index ad444f0..3390a84 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -55,12 +55,12 @@ struct nf_hook_state { struct net_device *out; struct sock *sk; struct net *net; - struct list_head *hook_list; + struct nf_hook_entry *hook_list; int (*okfn)(struct net *, struct sock *, struct sk_buff *); }; static inline void nf_hook_state_init(struct nf_hook_state *p, - struct list_head *hook_list, + struct nf_hook_entry *hook_list, unsigned int hook, int thresh, u_int8_t pf, struct net_device *indev, @@ -97,6 +97,12 @@ struct nf_hook_ops { int priority; }; +struct nf_hook_entry { + struct nf_hook_entry __rcu *next; + struct nf_hook_ops ops; + const struct nf_hook_ops*orig_ops; +}; + struct nf_sockopt_ops { struct list_head list; @@ -161,8 +167,6 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, int (*okfn)(struct net *, struct sock *, struct sk_buff *), int thresh) { - struct list_head *hook_list; - #ifdef HAVE_JUMP_LABEL if (__builtin_constant_p(pf) && __builtin_constant_p(hook) && @@ -170,14 +174,14 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, return 1; #endif - hook_list = &net->nf.hooks[pf][hook]; - - if (!list_empty(hook_list)) { + if (rcu_access_pointer(net->nf.hooks[pf][hook])) { + struct nf_hook_entry *hook_list; struct nf_hook_state state; int ret; /* We may already have this, but read-locks nest anyway */ rcu_read_lock(); + hook_list = rcu_dereference(net->nf.hooks[pf][hook]); nf_hook_state_init(&state, hook_list, hook, thresh, pf, indev, outdev, sk, net, okfn); diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h index 6965ba0..e3e3f6d 100644 --- a/include/linux/netfilter_ingress.h +++ b/include/linux/netfilter_ingress.h @@ -11,23 +11,27 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb) if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS])) return false; #endif - return !list_empty(&skb->dev->nf_hooks_ingress); + return rcu_access_pointer(skb->dev->nf_hooks_ingress) != NULL; } /* caller must hold rcu_read_lock */ static inline int nf_hook_ingress(struct sk_buff *skb) { + struct nf_hook_entry *e = rcu_dereference(skb->dev->nf_hooks_ingress); struct nf_hook_state state; - nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress, - NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV, - skb->dev, NULL, NULL, dev_net(skb->dev), NULL); + if (unlikely(!e)) + return 0; + + nf_hook_state_init(&state, e, NF_NETDEV_INGRESS, INT_MIN, + NFPROTO_NETDEV, skb->dev, NULL, NULL, + dev_net(skb->dev), NULL); return nf_hook_slow(skb, &state); } static inline void nf_hook_ingress_init(struct net_devic
[PATCH nf-next v2 1/3] netfilter: bridge: add and use br_nf_hook_thresh
From: Florian Westphal This replaces the last uses of NF_HOOK_THRESH(). Followup patch will remove it and rename nf_hook_thresh. The reason is that inet (non-bridge) netfilter no longer invokes the hooks from hooks, so we do no longer need the thresh value to skip hooks with a lower priority. The bridge netfilter however may need to do this. br_nf_hook_thresh is a wrapper that is supposed to do this, i.e. only call hooks with a priority that exceeds NF_BR_PRI_BRNF. It's used only in the recursion cases of br_netfilter. Signed-off-by: Florian Westphal Signed-off-by: Aaron Conole --- include/net/netfilter/br_netfilter.h | 6 net/bridge/br_netfilter_hooks.c | 57 ++-- net/bridge/br_netfilter_ipv6.c | 12 3 files changed, 59 insertions(+), 16 deletions(-) diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h index e8d1448..0b0c35c 100644 --- a/include/net/netfilter/br_netfilter.h +++ b/include/net/netfilter/br_netfilter.h @@ -15,6 +15,12 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb) void nf_bridge_update_protocol(struct sk_buff *skb); +int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk, + struct sk_buff *skb, struct net_device *indev, + struct net_device *outdev, + int (*okfn)(struct net *, struct sock *, + struct sk_buff *)); + static inline struct nf_bridge_info * nf_bridge_info_get(const struct sk_buff *skb) { diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 77e7f69..7856129 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include @@ -395,11 +396,10 @@ bridged_dnat: skb->dev = nf_bridge->physindev; nf_bridge_update_protocol(skb); nf_bridge_push_encap_header(skb); - NF_HOOK_THRESH(NFPROTO_BRIDGE, - NF_BR_PRE_ROUTING, - net, sk, skb, skb->dev, NULL, - br_nf_pre_routing_finish_bridge, - 1); + br_nf_hook_thresh(NF_BR_PRE_ROUTING, + net, sk, skb, skb->dev, + NULL, + br_nf_pre_routing_finish); return 0; } ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr); @@ -417,10 +417,8 @@ bridged_dnat: skb->dev = nf_bridge->physindev; nf_bridge_update_protocol(skb); nf_bridge_push_encap_header(skb); - NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb, - skb->dev, NULL, - br_handle_frame_finish, 1); - + br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb, skb->dev, NULL, + br_handle_frame_finish); return 0; } @@ -992,6 +990,47 @@ static struct notifier_block brnf_notifier __read_mostly = { .notifier_call = brnf_device_event, }; +/* recursively invokes nf_hook_slow (again), skipping already-called + * hooks (< NF_BR_PRI_BRNF). + * + * Called with rcu read lock held. + */ +int br_nf_hook_thresh(unsigned int hook, struct net *net, + struct sock *sk, struct sk_buff *skb, + struct net_device *indev, + struct net_device *outdev, + int (*okfn)(struct net *, struct sock *, + struct sk_buf *)) +{ + struct nf_hook_ops *elem; + struct nf_hook_state state; + struct list_head *head; + int ret; + + head = &net->nf.hooks[NFPROTO_BRIDGE][hook]; + + list_for_each_entry_rcu(elem, head, list) { + struct nf_hook_ops *next; + + next = list_entry_rcu(list_next_rcu(&elem->list), + struct nf_hook_ops, list); + if (next->priority <= NF_BR_PRI_BRNF) + continue; + } + + if (&elem->list == head) + return okfn(net, sk, skb); + + nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1, + NFPROTO_BRIDGE, indev, outdev, sk, net, okfn); + + ret = nf_hook_slow(skb, &state); + if (ret == 1) + ret = okfn(net, sk, skb); + + return ret; +} + #ifdef CONFIG_SYSCTL static int brnf_sysctl_call_tables(struct ctl_table *ctl, int write, diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c index 5e59a84..5989661 100644 --- a/net
[PATCH nf-next v2 2/3] netfilter: call nf_hook_state_init with rcu_read_lock held
From: Florian Westphal This makes things simpler because we can store the head of the list in the nf_state structure without worrying about concurrent add/delete of hook elements from the list. Signed-off-by: Florian Westphal Signed-off-by: Aaron Conole --- include/linux/netfilter.h | 8 +++- include/linux/netfilter_ingress.h | 1 + net/bridge/netfilter/ebt_redirect.c| 2 +- net/bridge/netfilter/ebtables.c| 2 +- net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_conntrack_proto_icmp.c | 2 +- net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +- net/netfilter/core.c | 5 + net/netfilter/nf_conntrack_core.c | 2 +- net/netfilter/nf_conntrack_h323_main.c | 2 +- net/netfilter/nf_conntrack_helper.c| 2 +- net/netfilter/nfnetlink_cthelper.c | 2 +- net/netfilter/nfnetlink_log.c | 8 ++-- net/netfilter/nfnetlink_queue.c| 2 +- net/netfilter/xt_helper.c | 2 +- 16 files changed, 27 insertions(+), 19 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 9230f9a..ad444f0 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -174,10 +174,16 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, if (!list_empty(hook_list)) { struct nf_hook_state state; + int ret; + /* We may already have this, but read-locks nest anyway */ + rcu_read_lock(); nf_hook_state_init(&state, hook_list, hook, thresh, pf, indev, outdev, sk, net, okfn); - return nf_hook_slow(skb, &state); + + ret = nf_hook_slow(skb, &state); + rcu_read_unlock(); + return ret; } return 1; } diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h index 5fcd375..6965ba0 100644 --- a/include/linux/netfilter_ingress.h +++ b/include/linux/netfilter_ingress.h @@ -14,6 +14,7 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb) return !list_empty(&skb->dev->nf_hooks_ingress); } +/* caller must hold rcu_read_lock */ static inline int nf_hook_ingress(struct sk_buff *skb) { struct nf_hook_state state; diff --git a/net/bridge/netfilter/ebt_redirect.c b/net/bridge/netfilter/ebt_redirect.c index 20396499..2e7c4f9 100644 --- a/net/bridge/netfilter/ebt_redirect.c +++ b/net/bridge/netfilter/ebt_redirect.c @@ -24,7 +24,7 @@ ebt_redirect_tg(struct sk_buff *skb, const struct xt_action_param *par) return EBT_DROP; if (par->hooknum != NF_BR_BROUTING) - /* rcu_read_lock()ed by nf_hook_slow */ + /* rcu_read_lock()ed by nf_hook_thresh */ ether_addr_copy(eth_hdr(skb)->h_dest, br_port_get_rcu(par->in)->br->dev->dev_addr); else diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c index cceac5b..dd71332 100644 --- a/net/bridge/netfilter/ebtables.c +++ b/net/bridge/netfilter/ebtables.c @@ -146,7 +146,7 @@ ebt_basic_match(const struct ebt_entry *e, const struct sk_buff *skb, return 1; if (NF_INVF(e, EBT_IOUT, ebt_dev_check(e->out, out))) return 1; - /* rcu_read_lock()ed by nf_hook_slow */ + /* rcu_read_lock()ed by nf_hook_thresh */ if (in && (p = br_port_get_rcu(in)) != NULL && NF_INVF(e, EBT_ILOGICALIN, ebt_dev_check(e->logical_in, p->br->dev))) diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c index ae1a71a..eab0239 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c @@ -110,7 +110,7 @@ static unsigned int ipv4_helper(void *priv, if (!help) return NF_ACCEPT; - /* rcu_read_lock()ed by nf_hook_slow */ + /* rcu_read_lock()ed by nf_hook_thresh */ helper = rcu_dereference(help->helper); if (!helper) return NF_ACCEPT; diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c index c567e1b..2c08d6a 100644 --- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c +++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c @@ -149,7 +149,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb, return -NF_ACCEPT; } - /* rcu_read_lock()ed by nf_hook_slow */ + /* rcu_read_lock()ed by nf_hook_thresh */ innerproto = __nf_ct_l4proto_find(PF_INET, origtuple.dst.protonum); /* Ordinarily, we'd expect the inverted tupleproto, but it's diff --git a/net/ipv6/netfilter/nf_co
[PATCH 1/1] net: nps_enet: Fix PCS reset
From: Noam Camus During commit b54b8c2d6e3c ("net: ezchip: adapt driver to little endian architecture") adapting to little endian architecture, zeroing of controller was left out. Signed-off-by: Elad Kanfi --- drivers/net/ethernet/ezchip/nps_enet.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c index 06f0317..9b7a3f5 100644 --- a/drivers/net/ethernet/ezchip/nps_enet.c +++ b/drivers/net/ethernet/ezchip/nps_enet.c @@ -285,6 +285,7 @@ static void nps_enet_hw_reset(struct net_device *ndev) ge_rst_value |= NPS_ENET_ENABLE << RST_GMAC_0_SHIFT; nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value); usleep_range(10, 20); + ge_rst_value = 0; nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value); /* Tx fifo reset sequence */ -- 1.7.1
Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys
> +#define LPA_FIBER_1000HALF 0x40 > +#define LPA_FIBER_1000FULL 0x20 > + > +#define LPA_PAUSE_FIBER 0x180 > +#define LPA_PAUSE_ASYM_FIBER 0x100 > + > +#define ADVERTISE_FIBER_1000HALF 0x40 > +#define ADVERTISE_FIBER_1000FULL 0x20 > + > +#define ADVERTISE_PAUSE_FIBER0x180 > +#define ADVERTISE_PAUSE_ASYM_FIBER 0x100 Are these standardised anywhere? If they are following a standard, they should be put into include/uapi/linux/mii.h. > +static inline u32 ethtool_adv_to_fiber_adv_t(u32 ethadv) > +{ > + u32 result = 0; > + > + if (ethadv & ADVERTISED_1000baseT_Half) > + result |= ADVERTISE_FIBER_1000HALF; Dumb question: Does 1000baseT_Half even make sense for fibre? Can you do half duplex? Would that not mean you have a single fibre, both ends are using the same laser frequency, and you are doing some form of CSMA/CD? > + if (ethadv & ADVERTISED_1000baseT_Full) > + result |= ADVERTISE_FIBER_1000FULL; > + > + if ((ethadv & ADVERTISE_PAUSE_ASYM) && (ethadv & ADVERTISE_PAUSE_CAP)) > + result |= LPA_PAUSE_ASYM_FIBER; > + else if (ethadv & ADVERTISE_PAUSE_CAP) > + result |= (ADVERTISE_PAUSE_FIBER > +& (~ADVERTISE_PAUSE_ASYM_FIBER)); > + > + return result; > +} If these values are standardised, i think this function should be moved into the generic code. If however, this is Marvell specific, keep it here. > * > * Generic status code does not detect Fiber correctly! > @@ -906,12 +1070,17 @@ static int marvell_read_status(struct phy_device > *phydev) > int lpa; > int lpagb; > int status = 0; > + int page, fiber; > > - /* Update the link, but return if there > + /* Detect and update the link, but return if there >* was an error */ > - err = genphy_update_link(phydev); > - if (err) > - return err; > + page = phy_read(phydev, MII_MARVELL_PHY_PAGE); > + if (page == MII_M_FIBER) > + fiber = 1; > + else > + fiber = 0; This read is expensive, since the MDIO bus is slow. It would be better just to pass fibre as a parameter. > +/* marvell_read_fiber_status > + * > + * Some Marvell's phys have two modes: fiber and copper. > + * Both need status checked. > + * Description: > + * First, check the fiber link and status. > + * If the fiber link is down, check the copper link and status which > + * will be the default value if both link are down. > + */ > +static int marvell_read_fiber_status(struct phy_device *phydev) The name is a bit confusing. I would probably use marvell_read_copper_fiber_status() making it clear it reads both. > +{ > + int err; > + > + /* Check the fiber mode first */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER); > + if (err < 0) > + goto error; > + > + err = marvell_read_status(phydev); > + if (err < 0) > + goto error; > + > + if (phydev->link) { > + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + return 0; > + } > + > + /* If fiber link is down, check and save copper mode state */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + if (err < 0) > + goto error; There should be a big fat comment somewhere that after this function the copper or fibre page is left selected, depending on which has link. There is a danger somebody misses this assumption and breaks the code by unconditionally restoring to copper. > + > + return marvell_read_status(phydev); > + > +error: > + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + return err; > +} > + > +/* marvell_suspend_fiber > + * > + * Some Marvell's phys have two modes: fiber and copper. > + * Both need to be suspended > + */ > +static int marvell_suspend_fiber(struct phy_device *phydev) > +{ > + int err; > + > + /* Suspend the fiber mode first */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER); > + if (err < 0) > + goto error; > + > + err = genphy_suspend(phydev); > + if (err < 0) > + goto error; > + > + /* Then, the copper link */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + if (err < 0) > + goto error; > + > + return genphy_suspend(phydev); > + > +error: > + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + return err; > +} I think it would be better to look for SUPPORTED_FIBRE in drv->features, rather than have two different functions. In fact, i would do that in general, rather than add your _fibre() functions. > + > +/* marvell_resume_fiber > + * > + * Some Marvell's phys have two modes: fiber and copper. > + * Both need to be resumed > + */ > +static int marvell_resume_fiber(struct phy_device *phydev) > +{ > + int err; > + > + /* Resume the fiber mode first */ > +
Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
On Tue, 12 Jul 2016 07:52:53 -0700, Tom Herbert wrote: > On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer > wrote: > > > > On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco > > wrote: > > > >> Add a new bpf prog type that is intended to run in early stages of the > >> packet rx path. Only minimal packet metadata will be available, hence a > >> new context type, struct xdp_md, is exposed to userspace. So far only > >> expose the packet start and end pointers, and only in read mode. > >> > >> An XDP program must return one of the well known enum values, all other > >> return codes are reserved for future use. Unfortunately, this > >> restriction is hard to enforce at verification time, so take the > >> approach of warning at runtime when such programs are encountered. Out > >> of bounds return codes should alias to XDP_ABORTED. > > > > This is going to be a serious usability problem, when XDP gets extended > > with newer features. > > > > How can users know what XDP capabilities a given driver support? I'm personally not a big fan of return codes. I'm hoping we can express most decisions by setting fields in metadata. It provides a better structure and is easier to detect/translate/optimize. > We talked about this a the XDP summit and I have some slides on this > (hope to have slides posted shortly) . Drivers, more generally XDP > platforms, will provide a list APIs that they support. APIs would be > contained in a sort common specification files that and would have > some name like basic_xdp_api, adv_switch_api, etc. An XDP program is > written using one of the published APIs and the API it uses is > expressed as part of the program. At runtime (possibly compile time) > the API used by the program is checked against the list of APIs for > the target platform-- if the API is not in the set then the program is > not allowed to be loaded. To whatever extent possible we should also > try to verify that program adhere's to the API as load and compile > time. In the event that program violates the API it claims to be > using, such as return invalid return code for the API, that is an > error condition. +1 > > If the user loads a XDP program using capabilities not avail in the > > driver, then all his traffic will be hard dropped. > > > > > > My proposal is to NOT allow XDP programs to be loaded if they want to use > > return-codes/features not supported by the driver. Thus, eBPF programs > > register/annotate their needed capabilities upfront (I guess this could > > also help HW offload engines). Not sure we need annotation, use of an API will probably be easily detectable (call of a function, access to metadata field). Verifier could collect that info in-kernel with little effort. > Exactly. We just need to define exactly how this is done ;-) > > One open issue is whether XDP needs to be binary compatible across > platforms or source code compatible (also require recompile in latter > case for each platform). Personally I prefer source code compatible > since that might allow platforms to implement the API specifically for > their needs (like the ordering of fields in a meta data structure. Since XDP is so close to hardware by design, I think we could consider introducing per-HW translation stage between the verifier and host JIT. For extended metadata problem for example - we could define metadata as: meta { void *packet_start; void *packet_end; struct standard_meta *extended_meta; } standard_meta { vlan; timestamp; hash; ... } Program coming from the user space would just use standard_meta but per-driver/per-HW translator would patch it up. extended_meta pointer would probably become pointer to HW-specific queue descriptor at runtime. The per-HW translator stage would change the offsets and add necessary checks and fix-ups. It could even be possible to translate from extended_meta access to access to metadata prepended to the frame, translator would need a hint from the verifier on how to get the packet pointer. I haven't thought this through in detail, it's just an idea which would help us to keep binary compatibility. HW-specific optimizations in user space compiler would be great, but breaking binary compatibility is a bit of a scary step.
[patch net-next v2 2/2] mlxsw: core: Trace EMAD messages
From: Jiri Pirko Trace EMAD messages going down to HW and up from HW. Devlink needs to be registered before EMAD init so the trace function can be called with valid devlink handle. Signed-off-by: Jiri Pirko v1->v2: - Use trace_devlink_hwmsg directly --- drivers/net/ethernet/mellanox/mlxsw/core.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 01ae548..480a3ba 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -58,6 +58,7 @@ #include #include #include +#include #include "core.h" #include "item.h" @@ -447,6 +448,10 @@ static int mlxsw_emad_transmit(struct mlxsw_core *mlxsw_core, if (!skb) return -ENOMEM; + trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), false, 0, + skb->data + mlxsw_core->driver->txhdr_len, + skb->len - mlxsw_core->driver->txhdr_len); + atomic_set(&trans->active, 1); err = mlxsw_core_skb_transmit(mlxsw_core, skb, &trans->tx_info); if (err) { @@ -529,6 +534,9 @@ static void mlxsw_emad_rx_listener_func(struct sk_buff *skb, u8 local_port, struct mlxsw_core *mlxsw_core = priv; struct mlxsw_reg_trans *trans; + trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), true, 0, + skb->data, skb->len); + if (!mlxsw_emad_is_resp(skb)) goto free_skb; @@ -1110,14 +1118,14 @@ int mlxsw_core_bus_device_register(const struct mlxsw_bus_info *mlxsw_bus_info, if (err) goto err_emad_init; - err = mlxsw_hwmon_init(mlxsw_core, mlxsw_bus_info, &mlxsw_core->hwmon); - if (err) - goto err_hwmon_init; - err = devlink_register(devlink, mlxsw_bus_info->dev); if (err) goto err_devlink_register; + err = mlxsw_hwmon_init(mlxsw_core, mlxsw_bus_info, &mlxsw_core->hwmon); + if (err) + goto err_hwmon_init; + err = mlxsw_driver->init(mlxsw_core, mlxsw_bus_info); if (err) goto err_driver_init; @@ -1131,9 +1139,9 @@ int mlxsw_core_bus_device_register(const struct mlxsw_bus_info *mlxsw_bus_info, err_debugfs_init: mlxsw_core->driver->fini(mlxsw_core); err_driver_init: +err_hwmon_init: devlink_unregister(devlink); err_devlink_register: -err_hwmon_init: mlxsw_emad_fini(mlxsw_core); err_emad_init: mlxsw_bus->fini(bus_priv); -- 2.5.5
[patch net-next v2 1/2] devlink: add hardware messages tracing facility
From: Jiri Pirko Define a tracepoint and allow user to trace messages going to and from hardware associated with devlink instance. Signed-off-by: Jiri Pirko --- v1->v2: - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function as suggested by David Ahern and Steven Rostedt --- include/trace/events/devlink.h | 68 ++ net/core/devlink.c | 4 +++ 2 files changed, 72 insertions(+) create mode 100644 include/trace/events/devlink.h diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h new file mode 100644 index 000..333c32a --- /dev/null +++ b/include/trace/events/devlink.h @@ -0,0 +1,68 @@ +#if IS_ENABLED(CONFIG_NET_DEVLINK) + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM devlink + +#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DEVLINK_H + +#include +#include +#include + +/* + * Tracepoint for devlink hardware message: + */ +TRACE_EVENT(devlink_hwmsg, + TP_PROTO(const struct devlink *devlink, bool incoming, +unsigned long type, const u8 *buf, size_t len), + + TP_ARGS(devlink, incoming, type, buf, len), + + TP_STRUCT__entry( + __string(bus_name, devlink->dev->bus->name) + __string(dev_name, dev_name(devlink->dev)) + __string(owner_name, devlink->dev->driver->owner->name) + __field(bool, incoming) + __field(unsigned long, type) + __dynamic_array(u8, buf, len) + __field(size_t, len) + ), + + TP_fast_assign( + __assign_str(bus_name, devlink->dev->bus->name); + __assign_str(dev_name, dev_name(devlink->dev)); + __assign_str(owner_name, devlink->dev->driver->owner->name); + __entry->incoming = incoming; + __entry->type = type; + memcpy(__get_dynamic_array(buf), buf, len); + __entry->len = len; + ), + + TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu buf=0x[%*phD] len=%lu", + __get_str(bus_name), __get_str(dev_name), + __get_str(owner_name), __entry->incoming, __entry->type, + (int) __entry->len, __get_dynamic_array(buf), __entry->len) +); + +#endif /* _TRACE_DEVLINK_H */ + +/* This part must be outside protection */ +#include + +#else /* CONFIG_NET_DEVLINK */ + +#if !defined(_TRACE_DEVLINK_H) +#define _TRACE_DEVLINK_H + +#include + +static inline void trace_devlink_hwmsg(const struct devlink *devlink, + bool incoming, unsigned long type, + const u8 *buf, size_t len) +{ +} + +#endif /* _TRACE_DEVLINK_H */ + +#endif diff --git a/net/core/devlink.c b/net/core/devlink.c index b2e592a..1b50630 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -26,6 +26,10 @@ #include #include #include +#define CREATE_TRACE_POINTS +#include + +EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg); static LIST_HEAD(devlink_list); -- 2.5.5
[PATCH 4/6] ipvs: fix bind to link-local mcast IPv6 address in backup
From: Quentin Armitage When using HEAD from https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/, the command: ipvsadm --start-daemon backup --mcast-interface eth0.60 \ --mcast-group ff02::1:81 fails with the error message: Argument list too long whereas both: ipvsadm --start-daemon master --mcast-interface eth0.60 \ --mcast-group ff02::1:81 and: ipvsadm --start-daemon backup --mcast-interface eth0.60 \ --mcast-group 224.0.0.81 are successful. The error message "Argument list too long" isn't helpful. The error occurs because an IPv6 address is given in backup mode. The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c, since it fails to set the interface on the address or the socket before calling inet6_bind() (via sock->ops->bind), where the test 'if (!sk->sk_bound_dev_if)' failed. Setting sock->sk->sk_bound_dev_if on the socket before calling inet6_bind() resolves the issue. Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon") Signed-off-by: Quentin Armitage Acked-by: Julian Anastasov Signed-off-by: Simon Horman --- net/netfilter/ipvs/ip_vs_sync.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c index 803001a..1b07578 100644 --- a/net/netfilter/ipvs/ip_vs_sync.c +++ b/net/netfilter/ipvs/ip_vs_sync.c @@ -1545,7 +1545,8 @@ error: /* * Set up receiving multicast socket over UDP */ -static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id) +static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id, + int ifindex) { /* multicast addr */ union ipvs_sockaddr mcast_addr; @@ -1566,6 +1567,7 @@ static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id) set_sock_size(sock->sk, 0, result); get_mcast_sockaddr(&mcast_addr, &salen, &ipvs->bcfg, id); + sock->sk->sk_bound_dev_if = ifindex; result = sock->ops->bind(sock, (struct sockaddr *)&mcast_addr, salen); if (result < 0) { pr_err("Error binding to the multicast addr\n"); @@ -1868,7 +1870,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct ipvs_sync_daemon_cfg *c, if (state == IP_VS_STATE_MASTER) sock = make_send_sock(ipvs, id); else - sock = make_receive_sock(ipvs, id); + sock = make_receive_sock(ipvs, id, dev->ifindex); if (IS_ERR(sock)) { result = PTR_ERR(sock); goto outtinfo; -- 2.1.4
[PATCH 2/6] netfilter: nft_meta: set skb->nf_trace appropriately
From: Liping Zhang When user add a nft rule to set nftrace to zero, for example: # nft add rule ip filter input nftrace set 0 We should set nf_trace to zero also. Signed-off-by: Liping Zhang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_meta.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c index 16c50b0..f4bad9d 100644 --- a/net/netfilter/nft_meta.c +++ b/net/netfilter/nft_meta.c @@ -227,7 +227,7 @@ void nft_meta_set_eval(const struct nft_expr *expr, skb->pkt_type = value; break; case NFT_META_NFTRACE: - skb->nf_trace = 1; + skb->nf_trace = !!value; break; default: WARN_ON(1); -- 2.1.4
[PATCH 1/6] netfilter: nf_tables: fix memory leak if expr init fails
From: Liping Zhang If expr init fails then we need to free it. So when the user add a nft rule as follows: # nft add rule filter input tcp dport 22 flow table ssh \ { ip saddr limit rate 0/second } memory leak will happen. Signed-off-by: Liping Zhang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 2c88187..cf7c745 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -1724,9 +1724,11 @@ struct nft_expr *nft_expr_init(const struct nft_ctx *ctx, err = nf_tables_newexpr(ctx, &info, expr); if (err < 0) - goto err2; + goto err3; return expr; +err3: + kfree(expr); err2: module_put(info.ops->type->owner); err1: -- 2.1.4
[PATCH 0/6] Netfilter/IPVS fixes for net
Hi David, The following patchset contains Netfilter/IPVS fixes for your net tree. they are: 1) Fix leak in the error path of nft_expr_init(), from Liping Zhang. 2) Tracing from nf_tables cannot be disabled, also from Zhang. 3) Fix an integer overflow on 32bit archs when setting the number of hashtable buckets, from Florian Westphal. 4) Fix configuration of ipvs sync in backup mode with IPv6 address, from Quentin Armitage via Simon Horman. 5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION, from Florian Westphal. 6) Skip clash resolution in conntrack insertion races if NAT is in place. You can pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git We're rather late in the release cycle of 4.7, so if these cannot get upstream I'll make sure to submit them to -stable, no problem. Thanks! The following changes since commit acd43fe85b2d1dbad55ce211b8817e6d6687246f: Merge branch 'mlx4-fixes' (2016-06-22 16:38:17 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD for you to fetch changes up to 590b52e10d410e1439ae86be9fe19d75fdab628b: netfilter: conntrack: skip clash resolution if nat is in place (2016-07-12 16:28:41 +0200) Florian Westphal (2): netfilter: conntrack: avoid integer overflow when resizing netfilter: nft_ct: fix expiration getter Liping Zhang (2): netfilter: nf_tables: fix memory leak if expr init fails netfilter: nft_meta: set skb->nf_trace appropriately Pablo Neira Ayuso (2): Merge tag 'ipvs-fixes2-for-v4.7' of https://git.kernel.org/.../horms/ipvs netfilter: conntrack: skip clash resolution if nat is in place Quentin Armitage (1): ipvs: fix bind to link-local mcast IPv6 address in backup include/net/netfilter/nf_conntrack.h | 8 net/netfilter/ipvs/ip_vs_sync.c | 6 -- net/netfilter/nf_conntrack_core.c| 8 net/netfilter/nf_tables_api.c| 4 +++- net/netfilter/nft_ct.c | 6 +- net/netfilter/nft_meta.c | 2 +- 6 files changed, 25 insertions(+), 9 deletions(-)
[PATCH 6/6] netfilter: conntrack: skip clash resolution if nat is in place
The clash resolution is not easy to apply if the NAT table is registered. Even if no NAT rules are installed, the nul-binding ensures that a unique tuple is used, thus, the packet that loses race gets a different source port number, as described by: http://marc.info/?l=netfilter-devel&m=146818011604484&w=2 Clash resolution with NAT is also problematic if addresses/port range ports are used since the conntrack that wins race may describe a different mangling that we may have earlier applied to the packet via nf_nat_setup_info(). Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Signed-off-by: Pablo Neira Ayuso Tested-by: Marc Dionne --- net/netfilter/nf_conntrack_core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 62c42e9..9f530ad 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -646,6 +646,7 @@ static int nf_ct_resolve_clash(struct net *net, struct sk_buff *skb, l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); if (l4proto->allow_clash && + !nfct_nat(ct) && !nf_ct_is_dying(ct) && atomic_inc_not_zero(&ct->ct_general.use)) { nf_ct_acct_merge(ct, ctinfo, (struct nf_conn *)skb->nfct); -- 2.1.4
[PATCH 3/6] netfilter: conntrack: avoid integer overflow when resizing
From: Florian Westphal Can overflow so we might allocate very small table when bucket count is high on a 32bit platform. Note: resize is only possible from init_netns. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index f204274..62c42e9 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1601,8 +1601,15 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) unsigned int nr_slots, i; size_t sz; + if (*sizep > (UINT_MAX / sizeof(struct hlist_nulls_head))) + return NULL; + BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); + + if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head))) + return NULL; + sz = nr_slots * sizeof(struct hlist_nulls_head); hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, get_order(sz)); -- 2.1.4
[PATCH 5/6] netfilter: nft_ct: fix expiration getter
From: Florian Westphal We need to compute timeout.expires - jiffies, not the other way around. Add a helper, another patch can then later change more places in conntrack code where we currently open-code this. Will allow us to only change one place later when we remove per-ct timer. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack.h | 8 net/netfilter/nft_ct.c | 6 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index dd78bea..b6083c3 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -284,6 +284,14 @@ static inline bool nf_is_loopback_packet(const struct sk_buff *skb) return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK; } +/* jiffies until ct expires, 0 if already expired */ +static inline unsigned long nf_ct_expires(const struct nf_conn *ct) +{ + long timeout = (long)ct->timeout.expires - (long)jiffies; + + return timeout > 0 ? timeout : 0; +} + struct kernel_param; int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp); diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index 137e308..81fbb45 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -54,7 +54,6 @@ static void nft_ct_get_eval(const struct nft_expr *expr, const struct nf_conn_help *help; const struct nf_conntrack_tuple *tuple; const struct nf_conntrack_helper *helper; - long diff; unsigned int state; ct = nf_ct_get(pkt->skb, &ctinfo); @@ -94,10 +93,7 @@ static void nft_ct_get_eval(const struct nft_expr *expr, return; #endif case NFT_CT_EXPIRATION: - diff = (long)jiffies - (long)ct->timeout.expires; - if (diff < 0) - diff = 0; - *dest = jiffies_to_msecs(diff); + *dest = jiffies_to_msecs(nf_ct_expires(ct)); return; case NFT_CT_HELPER: if (ct->master == NULL) -- 2.1.4
Re: eBPF tunable max instructions or max tail call?
On Mon, Jul 11, 2016 at 8:14 PM, Alexei Starovoitov wrote: > On Mon, Jul 11, 2016 at 05:56:07PM -0700, Sargun Dhillon wrote: >> It would be nice to have eBPF programs that are longer than 4096 >> instructions. I'm trying to implement XSalsa20 in eBPF, and >> unfortunately, it doesn't fit into 4096 instructions since I'm >> unrolling all of the loops. Further than that, doing tail calls to >> process each block results in me hitting the tail call limit. > > a cipher in bpf? wow. that's pushing it :) > we've been discussing various way of adding 'bounded loop' instruction > to avoid manual unrolling, but it will be still limited to the 4k > instruction per program, so probably won't help this use case. > Are you trying to do it in the networking context? Yeah, I'm trying to do this as a TC filter. Instruction wise, each 64 byte chunk is about 5000 instructions using LLVM's automatic loop unrolling. I need the first and last invocation to be for finishing and initializing the key schedule, setting checksums, etc.. So, I'm pretty close -- this implementation wasn't actually XSalsa20, it was a port of the Kernel's implementation of Salsa20. I think bumping the instruction limit to 8k would do the trick. > >> It don't think that it makes much sense to expose the crypto API as >> BPF helpers, as I'm not sure if we can ensure safety, and timely >> execution with it. I may be wrong here, and if there is a sane, safe >> way to expose the crypto API, I'm all ears. > > we had the patches to connect crypto api with bpf, but they were > too hacky to upstream, since then we redesigned the approach > and the latest should be much cleaner. The keys will be managed > through normal xfrm api and bpf will call into crypto with > mechanism similar to tail-call. The program will specify the > offset/length within the packet to encrypt/decrypt and next > program to execute when crypto operation completes. > Root only for xdp and tc only. > This is really interesting to me. Right now, I'm passing the key via embedding it in the code itself. It allows LLVM to do a bit more optimization. The crypto APIs are really nice and well fleshed out. XFRM on the other hand introduces a lot of complexity that I'm trying to avoid. It'd be nice if we could treat cryptographic state as just another type of BPF map. >> Other than that, it would be nice to make the max instructions a knob, >> and I don't think that it has much downside, given it's only checked >> on load time. It would be nice to make the tail call limit a tunable >> as well, but I'm unsure of the performance impact it might have given >> that it's checked at runtime. >> >> What do y'all think is reasonable? Make them both tunable? Just one? None? > > It is preferred to achieve the goal without introducing a knob. > Also sounds like that increasing 4k to 8k won't really solve it anyway. >
Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys
Le 12/07/2016 à 17:18, Andrew Lunn a écrit : > Hi Charles It's Charles-Antoine. ;) > > It is best to submit a number of smaller patches, each doing one > thing, than a single big patch. It makes review and discussion much > simpler. I'm sorry, I will fix that. > So for example, this should be a patch of its own: > >> @@ -151,6 +165,7 @@ struct marvell_hw_stat { >> >> static struct marvell_hw_stat marvell_hw_stats[] = { >> { "phy_receive_errors", 0, 21, 16}, >> +{ "phy_receive_errors_fiber", 1, 21, 16}, >> { "phy_idle_errors", 0, 10, 8 }, >> }; > > I think we should also rename phy_receive_errors to > phy_receive_errors_copper. I agree with you. I will fix that too. Thanks. Charles-Antoine Couret
Re: [PATCH v7 04/11] net/mlx4_en: add support for fast rx drop bpf program
On Mon, Jul 11, 2016 at 11:17:00PM -0700, David Miller wrote: > From: Brenden Blanco > Date: Mon, 11 Jul 2016 14:29:51 -0700 > > > + if (priv->num_frags > 1) > > + return -EOPNOTSUPP; > > I hate to be the user who has to debug why his XDP program won't > load just because he set a jumbo MTU beforehand. Me too. Patch v8 now has an en_err message, and double checked that a message is printed when MTU is attempted to be increased when XDP program has already been loaded. Thanks.
Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility
On Tue, 12 Jul 2016 18:05:03 +0200 Jiri Pirko wrote: > From: Jiri Pirko > > Define a tracepoint and allow user to trace messages going to and from > hardware associated with devlink instance. > > Signed-off-by: Jiri Pirko > --- > v1->v2: > - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function > as suggested by David Ahern and Steven Rostedt FYI, you can use the Suggested-by: tag too ;-) > --- > include/trace/events/devlink.h | 68 > ++ > net/core/devlink.c | 4 +++ > 2 files changed, 72 insertions(+) > create mode 100644 include/trace/events/devlink.h > > diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h > new file mode 100644 > index 000..333c32a > --- /dev/null > +++ b/include/trace/events/devlink.h > @@ -0,0 +1,68 @@ > +#if IS_ENABLED(CONFIG_NET_DEVLINK) > + > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM devlink > + > +#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_DEVLINK_H > + > +#include > +#include > +#include > + > +/* > + * Tracepoint for devlink hardware message: > + */ > +TRACE_EVENT(devlink_hwmsg, > + TP_PROTO(const struct devlink *devlink, bool incoming, > + unsigned long type, const u8 *buf, size_t len), > + > + TP_ARGS(devlink, incoming, type, buf, len), > + > + TP_STRUCT__entry( > + __string(bus_name, devlink->dev->bus->name) > + __string(dev_name, dev_name(devlink->dev)) > + __string(owner_name, devlink->dev->driver->owner->name) > + __field(bool, incoming) > + __field(unsigned long, type) > + __dynamic_array(u8, buf, len) > + __field(size_t, len) > + ), > + > + TP_fast_assign( > + __assign_str(bus_name, devlink->dev->bus->name); > + __assign_str(dev_name, dev_name(devlink->dev)); > + __assign_str(owner_name, devlink->dev->driver->owner->name); > + __entry->incoming = incoming; > + __entry->type = type; > + memcpy(__get_dynamic_array(buf), buf, len); > + __entry->len = len; > + ), > + > + TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu > buf=0x[%*phD] len=%lu", > + __get_str(bus_name), __get_str(dev_name), > + __get_str(owner_name), __entry->incoming, __entry->type, > + (int) __entry->len, __get_dynamic_array(buf), __entry->len) > +); > + > +#endif /* _TRACE_DEVLINK_H */ > + > +/* This part must be outside protection */ > +#include > + > +#else /* CONFIG_NET_DEVLINK */ > + > +#if !defined(_TRACE_DEVLINK_H) > +#define _TRACE_DEVLINK_H Ah, I guess you do need header protection here. OK, looks good to me. Acked-by: Steven Rostedt -- Steve > + > +#include > + > +static inline void trace_devlink_hwmsg(const struct devlink *devlink, > +bool incoming, unsigned long type, > +const u8 *buf, size_t len) > +{ > +} > + > +#endif /* _TRACE_DEVLINK_H */ > + > +#endif > diff --git a/net/core/devlink.c b/net/core/devlink.c > index b2e592a..1b50630 100644 > --- a/net/core/devlink.c > +++ b/net/core/devlink.c > @@ -26,6 +26,10 @@ > #include > #include > #include > +#define CREATE_TRACE_POINTS > +#include > + > +EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg); > > static LIST_HEAD(devlink_list); >
Re: [net-next PATCH RFC] mlx4: RX prefetch loop
On Tue, Jul 12, 2016 at 5:45 AM, Jesper Dangaard Brouer wrote: > On Mon, 11 Jul 2016 16:05:11 -0700 > Alexei Starovoitov wrote: > >> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote: >> > > - /* Process all completed CQEs */ >> > > + /* Extract and prefetch completed CQEs */ >> > > while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK, >> > > cq->mcq.cons_index & cq->size)) { >> > > + void *data; >> > > >> > > frags = ring->rx_info + (index << priv->log_rx_info); >> > > rx_desc = ring->buf + (index << ring->log_stride); >> > > + prefetch(rx_desc); >> > > >> > > /* >> > >* make sure we read the CQE after we read the ownership bit >> > >*/ >> > > dma_rmb(); >> > > >> > > + cqe_array[cqe_idx++] = cqe; >> > > + >> > > + /* Base error handling here, free handled in next loop */ >> > > + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == >> > > + MLX4_CQE_OPCODE_ERROR)) >> > > + goto skip; >> > > + >> > > + data = page_address(frags[0].page) + frags[0].page_offset; >> > > + prefetch(data); >> >> that's probably not correct in all cases, since doing prefetch on the address >> that is going to be evicted soon may hurt performance. >> We need to dma_sync_single_for_cpu() before doing a prefetch or >> somehow figure out that dma_sync is a nop, so we can omit it altogether >> and do whatever prefetches we like. > > Sure, DMA can be synced first (actually already played with this). Yes, but the point I think that Alexei is kind of indirectly getting at is that you are doing all your tests on x86 architecture are you not? The x86 stuff is a very different beast from architectures like ARM which have a very different architecture when it comes to how they handle the memory organization of the system. In the case of x86 the only time dma_sync is not a nop is if you force swiotlb to be enabled at which point the whole performance argument is kind of pointless anyway. >> Also unconditionally doing batch of 8 may also hurt depending on what >> is happening either with the stack, bpf afterwards or even cpu version. > > See this as software DDIO, if the unlikely case that data will get > evicted, it will still exist in L2 or L3 cache (like DDIO). Notice, > only 1024 bytes are getting prefetched here. I disagree. DDIO only pushes received frames into the L3 cache. What you are potentially doing is flooding the L2 cache. The difference in size between the L3 and L2 caches is very significant. L3 cache size is in the MB range while the L2 cache is only 256KB or so for Xeon processors and such. In addition DDIO is really meant for an architecture that has a fairly large cache region to spare and it it limits itself to that cache region, the approach taken in this code could potentially prefetch a fairly significant chunk of memory. >> Doing single prefetch of Nth packet is probably ok most of the time, >> but asking cpu to prefetch 8 packets at once is unnecessary especially >> since single prefetch gives the same performance. > > No, unconditionally prefetch of the Nth packet, will be wrong most of > the time, for real work loads, as Eric Dumazet already pointed out. > > This patch does NOT unconditionally prefetch 8 packets. Prefetching > _only_ happens when it is known that packets are ready in the RX ring. > We know this prefetch data will be used/touched, within the NAPI cycle. > Even if the processing of the packet flush L1 cache, then it will be in > L2 or L3 (like DDIO). I think the point you are missing here Jesper is that the packet isn't what will be flushed out of L1. It will be all the data that had been fetched before that. So for example the L1 cache can only hold 32K, and the way it is setup if you fetch the first 64 bytes of 8 pages you will have evicted everything that was in that cache set will be flushed out to L2. Also it might be worth while to see what instruction is being used for the prefetch. Last I knew for read prefetches it was prefetchnta on x86 which would only pull the data into the L1 cache as a "non-temporal" store. If I am not mistaken I think you run the risk of having the data prefetched evicted back out and bypassing the L2 and L3 caches unless it is modified. That was kind of the point of the prefetchnta as it really meant to be a read-only prefetch and meant to avoid polluting the L2 and L3 caches. - Alex
Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility
Tue, Jul 12, 2016 at 06:38:26PM CEST, rost...@goodmis.org wrote: >On Tue, 12 Jul 2016 18:05:03 +0200 >Jiri Pirko wrote: > >> From: Jiri Pirko >> >> Define a tracepoint and allow user to trace messages going to and from >> hardware associated with devlink instance. >> >> Signed-off-by: Jiri Pirko >> --- >> v1->v2: >> - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function >> as suggested by David Ahern and Steven Rostedt > >FYI, you can use the Suggested-by: tag too ;-) Thought that the purpose of that tag is if someone suggested the whole patch existence. >> +/* This part must be outside protection */ >> +#include >> + >> +#else /* CONFIG_NET_DEVLINK */ >> + >> +#if !defined(_TRACE_DEVLINK_H) >> +#define _TRACE_DEVLINK_H > >Ah, I guess you do need header protection here. Yes, you do. > >OK, looks good to me. > >Acked-by: Steven Rostedt Thanks Steven!
Re: [RFC PATCH net-next 1/1] Introduce skbmod action
On Mon, Jul 11, 2016 at 5:12 AM, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > > This action is intended to be an upgrade from a usability perspective > from pedit. Compare this: > Definitely agree we need a more user-friendly interface. > > pedit is a good starting point - but once you start going to a large > number of policies then from a programmability, ops and usability point > of view you need something with more succint params. > But it is still unclear why we can't just build something on top of pedit? Since pedit accepts keys like u32, user-space is free to introduce any wrapper on top of it. It should not be a problem for libnl3 to build anything on top too.
Re: [PATCH 0/6] Netfilter/IPVS fixes for net
From: Pablo Neira Ayuso Date: Tue, 12 Jul 2016 18:10:56 +0200 > The following patchset contains Netfilter/IPVS fixes for your net tree. > they are: > > 1) Fix leak in the error path of nft_expr_init(), from Liping Zhang. > > 2) Tracing from nf_tables cannot be disabled, also from Zhang. > > 3) Fix an integer overflow on 32bit archs when setting the number of >hashtable buckets, from Florian Westphal. > > 4) Fix configuration of ipvs sync in backup mode with IPv6 address, >from Quentin Armitage via Simon Horman. > > 5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION, >from Florian Westphal. > > 6) Skip clash resolution in conntrack insertion races if NAT is in >place. > > You can pull these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Pulled, thanks Pablo. > We're rather late in the release cycle of 4.7, so if these cannot > get upstream I'll make sure to submit them to -stable, no problem. I'm pretty sure there is going to be an -rc8, don't worry...
Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility
On Tue, 12 Jul 2016 19:10:46 +0200 Jiri Pirko wrote: > Tue, Jul 12, 2016 at 06:38:26PM CEST, rost...@goodmis.org wrote: > >On Tue, 12 Jul 2016 18:05:03 +0200 > >Jiri Pirko wrote: > > > >> From: Jiri Pirko > >> > >> Define a tracepoint and allow user to trace messages going to and from > >> hardware associated with devlink instance. > >> > >> Signed-off-by: Jiri Pirko > >> --- > >> v1->v2: > >> - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function > >> as suggested by David Ahern and Steven Rostedt > > > >FYI, you can use the Suggested-by: tag too ;-) > > Thought that the purpose of that tag is if someone suggested the whole > patch existence. Yeah, it has different meanings. You could always be creative and add: Parts-suggested-by: ;-) -- Steve
Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
On Mon, Jul 11, 2016 at 12:45 PM, wrote: > Hi > > On latest kernel i noticed kernel panic happening 1-2 times per day. It is > also happening on older kernel (at least 4.5.3). > ... > [42916.426463] Call Trace: > [42916.426658] > > [42916.426719] [] skb_push+0x36/0x37 > [42916.427111] [] ppp_start_xmit+0x10f/0x150 > [ppp_generic] > [42916.427314] [] dev_hard_start_xmit+0x25a/0x2d3 > [42916.427516] [] ? > validate_xmit_skb.isra.107.part.108+0x11d/0x238 > [42916.427858] [] sch_direct_xmit+0x89/0x1b5 > [42916.428060] [] __qdisc_run+0x133/0x170 > [42916.428261] [] net_tx_action+0xe3/0x148 > [42916.428462] [] __do_softirq+0xb9/0x1a9 > [42916.428663] [] irq_exit+0x37/0x7c > [42916.428862] [] smp_apic_timer_interrupt+0x3d/0x48 > [42916.429063] [] apic_timer_interrupt+0x7c/0x90 Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(), I have no idea why this could happen. Do you have any tc qdisc, filter or actions on this ppp device?
Re: [PATCH net v2 0/2] net: ethoc: Error path and transmit fixes
From: Max Filippov Date: Tue, 12 Jul 2016 16:51:10 +0300 > Hello, > > On Mon, Jul 11, 2016 at 04:35:53PM -0700, Florian Fainelli wrote: >> This patch series contains two patches for the ethoc driver while testing on >> a >> TS-7300 board where ethoc is provided by an on-board FPGA. >> >> First patch was cooked after chasing crashes with invalid resources passed to >> the driver. >> >> Second patch was cooked after seeing that an interface configured with IP >> 192.168.2.2 was sending ARP packets for 192.168.0.0, no wonder why it could >> not >> work. > > I can see opencores intrerface sending ARP packets shorter than 64 bytes, > but I couldn't reproduce truncation that affects packet contents on my > hardware. > >> I don't have access to any other platform using an ethoc interface so >> it could be good to some testing on Xtensa for instance. > > I've tested success and error paths affected by this series with the > following additional change on top of it: > > diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c > index ca678d4..8c94f45 100644 > --- a/drivers/net/ethernet/ethoc.c > +++ b/drivers/net/ethernet/ethoc.c > @@ -862,7 +862,7 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, > struct net_device *dev) > > if (skb_put_padto(skb, ETHOC_ZLEN)) { > dev->stats.tx_errors++; > - goto out; > + return NETDEV_TX_OK; > } > > if (unlikely(skb->len > ETHOC_BUFSIZ)) { > > Without it the interface becomes non-functional after the first error > in skb_put_padto. > > Tested-by: Max Filippov > Reviewed-by: Max Filippov Indeed, skb_put_padto() frees the skb on error so this would have caused a double free. Florian, please respin this series with the fix and tags added. Thanks.
Re: [RFC PATCH v3] net: sched: convert qdisc linked list to hashtable
On Mon, Jul 11, 2016 at 7:02 AM, Jiri Kosina wrote: > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index f45929c..0b5c172e 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -52,6 +52,7 @@ > #include > #include > #include > +#include > > struct netpoll_info; > struct device; > @@ -1778,6 +1779,7 @@ struct net_device { > unsigned intnum_tx_queues; > unsigned intreal_num_tx_queues; > struct Qdisc*qdisc; > + DECLARE_HASHTABLE (qdisc_hash, 4); > unsigned long tx_queue_len; > spinlock_t tx_global_lock; > int watchdog_timeo; Should it be surrounded by CONFIG_NET_SCHED? To save several bytes for !CONFIG_NET_SCHED case.
Re: [PATCH] virtio-net: Remove more stack DMA
From: Andy Lutomirski Date: Mon, 11 Jul 2016 14:30:28 -0700 > DaveM, is it okay for this to go in via -tip? Sure.
Re: [PATCH net-next,v3] tools: hv: Add a script to help bonding synthetic and VF NICs
From: Haiyang Zhang Date: Mon, 11 Jul 2016 17:06:42 -0700 > From: Haiyang Zhang > > This script helps to create bonding network devices based on synthetic NIC > (the virtual network adapter usually provided by Hyper-V) and the matching > VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can > function as one network device, and fail over to the synthetic NIC if VF is > down. > > Mayjor distros (RHEL, Ubuntu, SLES) supported by Hyper-V are supported by > this script. > > Signed-off-by: Haiyang Zhang > Reviewed-by: K. Y. Srinivasan Applied.
Re: ethtool TODO list - additional info
Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote: >Hi ! > >Some days ago, Jiri Pirko was talking about some next steps to >implement for ethtool. > > I haven't seen any follow up since ethtool's maintainer change. Can we have >additional details about these ? > >- libethtool - API >- generic netlink Yep, exactly, no reply. Apparently nobody really want to do any initiative here, and I'm lacking time to do it :( >- sub commands syntax >- TODO/bugzilla > > >On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko wrote: > I was thinking of adding a TODO file to the repository, but it's really for the new maintainer to decide what to do. So here's my list as a suggestion: * Add regression test coverage for all sub-commands with complex logic * Internationalise output and error messages * Build a libethtool that handles all the API quirks and fallbacks for old kernel versions. This might help people writing language bindings or other utilities that use the ethtool API. * Provide a 'cleaned up' ethtool (under some other name) that has: - More conventional sub-command syntax, i.e. no '-'/'--' prefix - More consistent output formatting >>> >>>That seems like a reasonable start for a TODO list. I'll bet there >>>are a few people out there with other suggestions as well...? >> >> Before that, I would like to see ethtool migrate to use generic >> netlink. Then, the new tool would be needed anyway, should exist within >> iproute2 package and have similar command line syntax. >> >> I have some ideas about the gennetlink ethtool, have to find some time >> to implement some initial part of it.
Re: [PATCH 1/1] bonding: restrict the data protected by rcu_read_lock
From: Zhu Yanjun Date: Tue, 12 Jul 2016 15:28:17 +0800 > In this function, origin is a pointer to struct aggregator. > No matter what agg is changed to, it has nothing to do with origin. > > CC: Jay Vosburgh > Signed-off-by: Zhu Yanjun I really hate changes like this. Imagine if a inner function was called with RCU protection around it, as a rule. That's the same as what is happening here, the RCU locking is done around the entire contents of the function and that is perfectly fine even if one variable assignment or whatever is superfluously contained inside of it. I'm not applying this patch, sorry.
Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
On 2016-07-12 20:31, Cong Wang wrote: On Mon, Jul 11, 2016 at 12:45 PM, wrote: Hi On latest kernel i noticed kernel panic happening 1-2 times per day. It is also happening on older kernel (at least 4.5.3). ... [42916.426463] Call Trace: [42916.426658] [42916.426719] [] skb_push+0x36/0x37 [42916.427111] [] ppp_start_xmit+0x10f/0x150 [ppp_generic] [42916.427314] [] dev_hard_start_xmit+0x25a/0x2d3 [42916.427516] [] ? validate_xmit_skb.isra.107.part.108+0x11d/0x238 [42916.427858] [] sch_direct_xmit+0x89/0x1b5 [42916.428060] [] __qdisc_run+0x133/0x170 [42916.428261] [] net_tx_action+0xe3/0x148 [42916.428462] [] __do_softirq+0xb9/0x1a9 [42916.428663] [] irq_exit+0x37/0x7c [42916.428862] [] smp_apic_timer_interrupt+0x3d/0x48 [42916.429063] [] apic_timer_interrupt+0x7c/0x90 Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(), I have no idea why this could happen. Do you have any tc qdisc, filter or actions on this ppp device? Yes, i have policing filters for incoming traffic (ingress), and also on egress htb + pfifo + filters.
Re: ethtool TODO list - additional info
On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko wrote: > Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote: >>Hi ! >> >>Some days ago, Jiri Pirko was talking about some next steps to >>implement for ethtool. >> >> I haven't seen any follow up since ethtool's maintainer change. Can we have >>additional details about these ? >> >>- libethtool - API >>- generic netlink > > Yep, exactly, no reply. Apparently nobody really want to do any initiative > here, and I'm lacking time to do it :( > That's fine, I already got a git repo, I'm trying to understand what 'use generic netlink' means. This is a piece of what grep got me on iproute git repo (I'm still trying to understand) grep -i netlink -r iproute2/ iproute2/bridge/mdb.c:#include "libnetlink.h" iproute2/bridge/vlan.c:#include "libnetlink.h" ... .. > > >>- sub commands syntax >>- TODO/bugzilla >> >> >>On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko wrote: >> > I was thinking of adding a TODO file to the repository, but it's really > for the new maintainer to decide what to do. So here's my list as a > suggestion: > > * Add regression test coverage for all sub-commands with complex logic > > * Internationalise output and error messages > > * Build a libethtool that handles all the API quirks and fallbacks for > old kernel versions. This might help people writing language > bindings or other utilities that use the ethtool API. > > * Provide a 'cleaned up' ethtool (under some other name) that has: > - More conventional sub-command syntax, i.e. no '-'/'--' prefix > - More consistent output formatting That seems like a reasonable start for a TODO list. I'll bet there are a few people out there with other suggestions as well...? >>> >>> Before that, I would like to see ethtool migrate to use generic >>> netlink. Then, the new tool would be needed anyway, should exist within >>> iproute2 package and have similar command line syntax. >>> >>> I have some ideas about the gennetlink ethtool, have to find some time >>> to implement some initial part of it.
Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
On Tue, Jul 12, 2016 at 11:03 AM, wrote: > On 2016-07-12 20:31, Cong Wang wrote: >> >> On Mon, Jul 11, 2016 at 12:45 PM, wrote: >>> >>> Hi >>> >>> On latest kernel i noticed kernel panic happening 1-2 times per day. It >>> is >>> also happening on older kernel (at least 4.5.3). >>> >> ... >>> >>> [42916.426463] Call Trace: >>> [42916.426658] >>> >>> [42916.426719] [] skb_push+0x36/0x37 >>> [42916.427111] [] ppp_start_xmit+0x10f/0x150 >>> [ppp_generic] >>> [42916.427314] [] dev_hard_start_xmit+0x25a/0x2d3 >>> [42916.427516] [] ? >>> validate_xmit_skb.isra.107.part.108+0x11d/0x238 >>> [42916.427858] [] sch_direct_xmit+0x89/0x1b5 >>> [42916.428060] [] __qdisc_run+0x133/0x170 >>> [42916.428261] [] net_tx_action+0xe3/0x148 >>> [42916.428462] [] __do_softirq+0xb9/0x1a9 >>> [42916.428663] [] irq_exit+0x37/0x7c >>> [42916.428862] [] smp_apic_timer_interrupt+0x3d/0x48 >>> [42916.429063] [] apic_timer_interrupt+0x7c/0x90 >> >> >> Interesting, we call a skb_cow_head() before skb_push() in >> ppp_start_xmit(), >> I have no idea why this could happen. >> >> Do you have any tc qdisc, filter or actions on this ppp device? > > Yes, i have policing filters for incoming traffic (ingress), and also on > egress htb + pfifo + filters. Does it make any difference if you remove the egress qdisc and/or filters? If yes, please share the `tc qd show...` and `tc filter show ...`? Thanks!
Re: [PATCH -next] stmmac: dwmac-socfpga: fix wrong pointer passed to PTR_ERR()
From: weiyj...@163.com Date: Tue, 12 Jul 2016 11:00:09 + > From: Wei Yongjun > > PTR_ERR should access the value just tested by IS_ERR, otherwise > the wrong error code will be returned. > > Signed-off-by: Wei Yongjun Applied.
Re: [PATCH -next] dwc_eth_qos: fix missing clk_disable_unprepare() on error in dwceqos_probe()
From: weiyj...@163.com Date: Tue, 12 Jul 2016 11:43:37 + > From: Wei Yongjun > > Fix missing clk_disable_unprepare() call before return > from dwceqos_probe() in the error handling case of invalid > fixed-link. > > Signed-off-by: Wei Yongjun Applied.
Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA
On 2016/07/06 17:29, Ursula Braun wrote: > Dave, > > we still like to see SMC-R included into a future Linux-kernel. After > answering your first 2 questions, there is no longer a response. What should > we do next? > - Still wait for an answer from you? > - Resend the same whole SMC-R patch series, this time with the cover letter > adapted to your requested changes? ^^^ I would suggest to send v2 of the patch series with the changes that were requested. > - Put the SMC-R development on hold, and concentrate on another > s390-specific SMC-solution first (not RDMA-based), that makes use of the > SMC-socket family as well. > - Anything else? > > Kind regards, Ursula > > Forwarded Message > Subject: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory > Communications - RDMA > Date: Tue, 21 Jun 2016 16:02:59 +0200 > From: Ursula Braun > To: da...@davemloft.net > CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org, > schwidef...@de.ibm.com, heiko.carst...@de.ibm.com, utz.bac...@de.ibm.com > > Dave, > > the SMC-R patches submitted 2016-06-03 show up in state "Changes > Requested" on patchwork: > https://patchwork.ozlabs.org/project/netdev/list/?submitter=2266&state=*&page=1 > > You had requested a change of the SMC-R description in the cover letter. > We came up with the response below. Do you need anything else from us? > > Kind regards, > Ursula Braun > > Forwarded Message > Subject: Re: [PATCH net-next 00/15] net/smc: Shared Memory > Communications - RDMA > Date: Thu, 9 Jun 2016 17:36:28 +0200 > From: Ursula Braun > To: da...@davemloft.net > CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org, > schwidef...@de.ibm.com, heiko.carst...@de.ibm.com > > On Tue, 2016-06-07 at 15:07 -0700, David Miller wrote: > > In case my previous reply wasn't clear enough, I require that you provide > > a more accurate description of what the implications of this feature are. > > > > Namely, that important _CORE_ networking features are completely bypassed > > and unusable when SMC applies to a connection. > > > > Specifically, all packet shaping, filtering, traffic inspection, and > > flow management facilitites in the kernel will not be able to see nor > > act upon the data flow of these TCP connections once established. > > > > It is always important, and in my opinion required, to list the > > negative aspects of your change and not just the "wow, amazing" > > positive aspects. > > > > Thanks. > > > > > Correct, the SMC-R data stream bypasses TCP and thus cannot enjoy its > features. This is the price for leveraging the TCP application ecosystem > and reducing CPU load. > > When a load balancer allows the TCP handshake to take place between a > worker node and the TCP client, RDMA will be used between these two > nodes. So anything based on TCP connection establishment (including a > firewall) can apply to SMC-R, too. To be clear -- yes, the data flow > later on is not subject to these features anymore. At least VLAN > isolation from the TCP part can be leveraged for RDMA traffic. From our > experience, discussions, etc., that tradeoff seems acceptable in a > classical data center environment. > > Improving our cover letter would result in the following new > introductory motivation part at the beginning and a slightly modified > list of > planned enhancements at the end: > > On Fri, 2016-06-03 at 17:26 +0200, Ursula Braun wrote: > > > These patches are the initial part of the implementation of the > > "Shared Memory Communications-RDMA" (SMC-R) protocol. The protocol is > > defined in RFC7609 [1]. It allows transformation of TCP connections > > using the "Remote Direct Memory Access over Converged Ethernet" (RoCE) > > feature of specific communication hardware for data center environments. > > > > SMC-R inherits TCP qualities such as reliable connections, host-based > > firewall packet filtering (on connection establishment) and unmodified > > application of communication encryption such as TLS (transport layer > > security) or SSL (secure sockets layer). It is transparent to most existing > > TCP connection load balancers that are commonly used in the enterprise data > > center environment for multi-tier application workloads. > > > > Being designed for the data center network switched fabric environment, it > > does not need congestion control and thus reaches line speed right away > > without having to go through slow start as with TCP. This can be beneficial > > for short living flows including request response patterns requiring > > reliability. A full SMC-R implementation also provides seamless high > > availability and load-balancing demanded by enterprise installations. > > > > SMC-R does not require an RDMA communication manager (RDMA CM). Its use of > > RDMA provides CPU savings transparently for unmodified applications. > > For instance, when running 10 parallel connections with uperf, we measured > > a decrease of 60% in CPU consumption with
Re: [PATCH -next] rxrpc: Fix error handling in af_rxrpc_init()
From: weiyj...@163.com Date: Tue, 12 Jul 2016 11:21:17 + > From: Wei Yongjun > > security initialized after alloc workqueue, so we should exit security > before destroy workqueue in the error handing. > > Fixes: 648af7fca159 ("rxrpc: Absorb the rxkad security module") > Signed-off-by: Wei Yongjun Applied.
Re: [PATCH -next] net: mediatek: fix non static symbol warnings
From: weiyj...@163.com Date: Tue, 12 Jul 2016 11:36:44 + > From: Wei Yongjun > > Fixes the following sparse warnings: > > drivers/net/ethernet/mediatek/mtk_eth_soc.c:79:5: warning: > symbol '_mtk_mdio_write' was not declared. Should it be static? > drivers/net/ethernet/mediatek/mtk_eth_soc.c:98:5: warning: > symbol '_mtk_mdio_read' was not declared. Should it be static? > > Signed-off-by: Wei Yongjun Applied.
Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA
On 2016/07/12 11:08, Benjamin Poirier wrote: > On 2016/07/06 17:29, Ursula Braun wrote: > > Dave, > > > > we still like to see SMC-R included into a future Linux-kernel. After > > answering your first 2 questions, there is no longer a response. What should > > we do next? > > - Still wait for an answer from you? > > - Resend the same whole SMC-R patch series, this time with the cover letter > > adapted to your requested changes? > > ^^^ I would suggest to send v2 of the patch series with the changes > that were requested. > > > - Put the SMC-R development on hold, and concentrate on another > > s390-specific SMC-solution first (not RDMA-based), that makes use of the > > SMC-socket family as well. > > - Anything else? > > ... and I would suggest to trim my emails next time. Sorry.
Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.
From: Alan Davey Date: Tue, 12 Jul 2016 12:34:07 + > - all future applications have to continue to implement their own > fragmentation code, duplicating that which already exists in the kernel They have to do this anyways, don't you see this? Otherwise they don't support %99 of the kernels out there. Even if I put this change in now, it would take years for it to propagate to even a moderate percentage of Linux machines out there. And applications doing a "we only support kernels > version x.y.z" is a situation I don't want to promote if you think that's a reasonable way to handle this.
Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
On 2016-07-12 21:05, Cong Wang wrote: On Tue, Jul 12, 2016 at 11:03 AM, wrote: On 2016-07-12 20:31, Cong Wang wrote: On Mon, Jul 11, 2016 at 12:45 PM, wrote: Hi On latest kernel i noticed kernel panic happening 1-2 times per day. It is also happening on older kernel (at least 4.5.3). ... [42916.426463] Call Trace: [42916.426658] [42916.426719] [] skb_push+0x36/0x37 [42916.427111] [] ppp_start_xmit+0x10f/0x150 [ppp_generic] [42916.427314] [] dev_hard_start_xmit+0x25a/0x2d3 [42916.427516] [] ? validate_xmit_skb.isra.107.part.108+0x11d/0x238 [42916.427858] [] sch_direct_xmit+0x89/0x1b5 [42916.428060] [] __qdisc_run+0x133/0x170 [42916.428261] [] net_tx_action+0xe3/0x148 [42916.428462] [] __do_softirq+0xb9/0x1a9 [42916.428663] [] irq_exit+0x37/0x7c [42916.428862] [] smp_apic_timer_interrupt+0x3d/0x48 [42916.429063] [] apic_timer_interrupt+0x7c/0x90 Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(), I have no idea why this could happen. Do you have any tc qdisc, filter or actions on this ppp device? Yes, i have policing filters for incoming traffic (ingress), and also on egress htb + pfifo + filters. Does it make any difference if you remove the egress qdisc and/or filters? If yes, please share the `tc qd show...` and `tc filter show ...`? Thanks! It is not easy, because it is NAS with approx 5000 users connected (and they are constantly connecting/disconnecting), and crash can't be reproduced easily. If i will remove qdisc/filters users will get unlimited speed and this will cause serious service degradation. But maybe i can add some debug lines and run some test kernel if necessary (if it will not cause serious performance overhead).
Re: ethtool TODO list - additional info
On Tue, Jul 12, 2016 at 01:04:52PM -0500, Jorge Alberto Garcia wrote: > On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko wrote: > > Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote: > >>Hi ! > >> > >>Some days ago, Jiri Pirko was talking about some next steps to > >>implement for ethtool. > >> > >> I haven't seen any follow up since ethtool's maintainer change. Can we have > >>additional details about these ? > >> > >>- libethtool - API > >>- generic netlink > > > > Yep, exactly, no reply. Apparently nobody really want to do any initiative > > here, and I'm lacking time to do it :( > > > > That's fine, I already got a git repo, I'm trying to understand what > 'use generic netlink' means. > > This is a piece of what grep got me on iproute git repo (I'm still > trying to understand) > grep -i netlink -r iproute2/ > iproute2/bridge/mdb.c:#include "libnetlink.h" > iproute2/bridge/vlan.c:#include "libnetlink.h" > ... > .. The general notion would be to replace the current ioctl-based ethtool API with one that is based on netlink, like many of the other networking APIs. I'm fine with the general idea of that, but so far I haven't heard a strong case for why it is necessary. It certainly doesn't seem urgent to me -- if someone disagrees, then please explain! John > > > > > >>- sub commands syntax > >>- TODO/bugzilla > >> > >> > >>On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko wrote: > >> > > I was thinking of adding a TODO file to the repository, but it's really > > for the new maintainer to decide what to do. So here's my list as a > > suggestion: > > > > * Add regression test coverage for all sub-commands with complex logic > > > > * Internationalise output and error messages > > > > * Build a libethtool that handles all the API quirks and fallbacks for > > old kernel versions. This might help people writing language > > bindings or other utilities that use the ethtool API. > > > > * Provide a 'cleaned up' ethtool (under some other name) that has: > > - More conventional sub-command syntax, i.e. no '-'/'--' prefix > > - More consistent output formatting > > That seems like a reasonable start for a TODO list. I'll bet there > are a few people out there with other suggestions as well...? > >>> > >>> Before that, I would like to see ethtool migrate to use generic > >>> netlink. Then, the new tool would be needed anyway, should exist within > >>> iproute2 package and have similar command line syntax. > >>> > >>> I have some ideas about the gennetlink ethtool, have to find some time > >>> to implement some initial part of it. > -- John W. LinvilleSomeday the world will need a hero, and you linvi...@tuxdriver.com might be all we have. Be ready.
Re: ethtool TODO list - additional info
On 16-07-12 11:12 AM, John W. Linville wrote: > On Tue, Jul 12, 2016 at 01:04:52PM -0500, Jorge Alberto Garcia wrote: >> On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko wrote: >>> Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote: Hi ! Some days ago, Jiri Pirko was talking about some next steps to implement for ethtool. I haven't seen any follow up since ethtool's maintainer change. Can we have additional details about these ? - libethtool - API - generic netlink >>> >>> Yep, exactly, no reply. Apparently nobody really want to do any initiative >>> here, and I'm lacking time to do it :( >>> >> >> That's fine, I already got a git repo, I'm trying to understand what >> 'use generic netlink' means. >> >> This is a piece of what grep got me on iproute git repo (I'm still >> trying to understand) >> grep -i netlink -r iproute2/ >> iproute2/bridge/mdb.c:#include "libnetlink.h" >> iproute2/bridge/vlan.c:#include "libnetlink.h" >> ... >> .. > > The general notion would be to replace the current ioctl-based > ethtool API with one that is based on netlink, like many of the other > networking APIs. I'm fine with the general idea of that, but so far > I haven't heard a strong case for why it is necessary. It certainly > doesn't seem urgent to me -- if someone disagrees, then please explain! > And probably obvious but you can't remove the ioctl based ethtool interface because we have a lot of software using it so you will have to maintain both in parallel if there is a netlink equivalent. Also most the stuff in ethtool tends to be things you set once at init time or for debugging so the benefit of notifiers and such are minimal. Sure if I was doing it from scratch netlink would be better and there are probably parts of it that are worth converting over where notifier hooks and standardizing would be beneficial. I think Roopa for example was coming up with some more general statistics mechanism. My $.02 anyways. JohnF
Re: ethtool TODO list - additional info
From: "John W. Linville" Date: Tue, 12 Jul 2016 14:12:52 -0400 > I haven't heard a strong case for why it is necessary. Stats on large scale setups/systems is problematic and ethtool is a part of that problem. Extensibility in general suffers because of the ioctl() interface as well. For getting ethtool settings or stats, we have no clean filtering mechanism. I could go on and on... I think before any new features are added to ethtool, we move over to the netlink based interface and lock the existing ioctl() API in stone.