date:20160712

Re: [PATCH v6 RESEND] r8152: Add support for setting pass through MAC address on RTL8153-AD

2016-07-12 Thread Michał Pecio

> From: Mario Limonciello 
> Date: Mon, 11 Jul 2016 19:58:04 -0500
> 
> > The RTL8153-AD supports a persistent system specific MAC address.
> > This means a device plugged into two different systems with host
> > side support will show different (but persistent) MAC addresses.
> > 
> > This information for the system's persistent MAC address is burned
> > in when the system HW is built and available under \_SB.AMAC in the
> > DSDT at runtime.
> > 
> > This technology is currently implemented in the Dell TB15 and WD15
> > Type-C docks.  More information is available here:
> > http://www.dell.com/support/article/us/en/04/SLN301147
> > 
> > Signed-off-by: Mario Limonciello   
> 
> Applied.

Hi,

Isn't it possible that the same ACPI name could be used for other
vendor-specific extensions on other laptops and that r8152 will now
wreak havoc there?

Regards,
MP

[PATCH 1/1] bonding: restrict the data protected by rcu_read_lock

2016-07-12 Thread Zhu Yanjun

In this function, origin is a pointer to struct aggregator.
No matter what agg is changed to, it has nothing to do with origin.

CC: Jay Vosburgh 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/bonding/bond_3ad.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index edc70ff..20afee3 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -1624,8 +1624,8 @@ static void ad_agg_selection_logic(struct aggregator *agg,
struct slave *slave;
struct port *port;
 
-   rcu_read_lock();
origin = agg;
+   rcu_read_lock();
active = __get_active_agg(agg);
best = (active && agg_device_up(active)) ? active : NULL;
 
-- 
1.7.9.5

[PATCH v8 05/11] Add sample for adding simple drop program to link

2016-07-12 Thread Brenden Blanco

Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/xdp1 $(
---
 samples/bpf/Makefile|   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/xdp1_kern.c |  93 +
 samples/bpf/xdp1_user.c | 181 
 4 files changed, 286 insertions(+)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a98b780..0e4ab3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -21,6 +21,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += xdp1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
+always += xdp1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
+HOSTLOADLIBES_xdp1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 022af71..0cfda23 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+   bool is_xdp = strncmp(event, "xdp", 3) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_KPROBE;
} else if (is_tracepoint) {
prog_type = BPF_PROG_TYPE_TRACEPOINT;
+   } else if (is_xdp) {
+   prog_type = BPF_PROG_TYPE_XDP;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
+   if (is_xdp)
+   return 0;
+
if (is_socket) {
event += 6;
if (*event != '/')
@@ -319,6 +325,7 @@ int load_bpf_file(char *path)
if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
+   memcmp(shname_prog, "xdp", 3) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -336,6 +343,7 @@ int load_bpf_file(char *path)
if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/", 11) == 0 ||
+   memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "socket", 6) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
new file mode 100644
index 000..e7dd8ac
--- /dev/null
+++ b/samples/bpf/xdp1_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 256,
+};
+
+st

[PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support

2016-07-12 Thread Brenden Blanco

The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.

Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.

Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.

Note that setting an xdp program will now require the rings to be
reconfigured.

Before:
 26.91%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_process_rx_cq
 17.88%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_alloc_frags
  6.00%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_free_frag
  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  3.21%  swapper  [kernel.vmlinux]  [k] intel_idle
  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.57%  swapper  [mlx4_en] [k] mlx4_en_process_rx_cq

After:
 31.72%  swapper  [kernel.vmlinux]   [k] intel_idle
  8.79%  swapper  [mlx4_en]  [k] mlx4_en_process_rx_cq
  7.54%  swapper  [kernel.vmlinux]   [k] poll_idle
  6.36%  swapper  [mlx4_core][k] mlx4_eq_int
  4.21%  swapper  [kernel.vmlinux]   [k] tasklet_action
  4.03%  swapper  [kernel.vmlinux]   [k] cpuidle_enter_state
  3.43%  swapper  [mlx4_en]  [k] mlx4_en_prepare_rx_desc
  2.18%  swapper  [kernel.vmlinux]   [k] native_irq_return_iret
  1.37%  swapper  [kernel.vmlinux]   [k] menu_select
  1.09%  swapper  [kernel.vmlinux]   [k] bpf_map_lookup_elem

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 46 +++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  | 69 ++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h| 12 -
 4 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 51a2e82..d3d51fa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -47,7 +47,7 @@
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0x)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0x)
 
-static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
+int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
 {
int i;
int err = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 369a2ef..252c893d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2532,21 +2532,59 @@ static int mlx4_en_set_tx_maxrate(struct net_device 
*dev, int queue_index, u32 m
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
+   struct mlx4_en_dev *mdev = priv->mdev;
struct bpf_prog *old_prog;
+   int port_up = 0;
+   int err;
+
+   /* No need to reconfigure buffers when simply swapping the
+* program for a new one.
+*/
+   if (READ_ONCE(priv->prog) && prog) {
+   /* This xchg is paired with READ_ONCE in the fast path, but is
+* also protected from itself via rtnl lock
+*/
+   old_prog = xchg(&priv->prog, prog);
+   if (old_prog)
+   bpf_prog_put(old_prog);
+   return 0;
+   }
 
if (priv->num_frags > 1) {
en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
return -EOPNOTSUPP;
}
 
-   /* This xchg is paired with READ_ONCE in the fast path, but is
-* also protected from itself via rtnl lock
-*/
+   mutex_lock(&mdev->state_lock);
+   if (priv->port_up) {
+   port_up = 1;
+   mlx4_en_stop_port(dev, 1);
+   }
+
+   mlx4_en_free_resources(priv);
+
old_prog = xchg(&priv->prog, prog);
if (old_prog)
bpf_prog_put(old_prog);
 
-   return 0;
+   err = mlx4_en_alloc_resources(priv);
+   if (err) {
+   en_err(priv, "Failed reallocating port resources\n");
+   goto out;
+   }
+   if (port_up) {
+   err = mlx4_en_start_port(dev);
+   if (err)
+   en_err(priv, "Failed starting port\n");
+   }
+
+   err = mlx4_en_moderation_update(priv);
+
+out:
+   if (err)
+   priv->prog = NULL;
+   mutex_unlock(&mdev->state_lock);
+   return err;
 }
 
 static bool mlx4_xdp_attached(struct net_device *dev)
diff

[PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx

2016-07-12 Thread Brenden Blanco

Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco 
---
 include/linux/netdevice.h | 34 ++
 net/core/dev.c| 33 +
 2 files changed, 67 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..fab9a1c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct wpan_dev;
 struct mpls_dev;
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
+struct bpf_prog;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
const struct ethtool_ops *ops);
@@ -799,6 +800,33 @@ struct tc_to_netdev {
};
 };
 
+/* These structures hold the attributes of xdp state that are being passed
+ * to the netdevice through the xdp op.
+ */
+enum xdp_netdev_command {
+   /* Set or clear a bpf program used in the earliest stages of packet
+* rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee
+* is responsible for calling bpf_prog_put on any old progs that are
+* stored. In case of error, the callee need not release the new prog
+* reference, but on success it takes ownership and must bpf_prog_put
+* when it is no longer used.
+*/
+   XDP_SETUP_PROG,
+   /* Check if a bpf program is set on the device.  The callee should
+* return true if a program is currently attached and running.
+*/
+   XDP_QUERY_PROG,
+};
+
+struct netdev_xdp {
+   enum xdp_netdev_command command;
+   union {
+   /* XDP_SETUP_PROG */
+   struct bpf_prog *prog;
+   /* XDP_QUERY_PROG */
+   bool prog_attached;
+   };
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1087,6 +1115,9 @@ struct tc_to_netdev {
  * appropriate rx headroom value allows avoiding skb head copy on
  * forward. Setting a negative value resets the rx headroom to the
  * default value.
+ * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
+ * This function is used to set or query state related to XDP on the
+ * netdevice. See definition of enum xdp_netdev_command for details.
  *
  */
 struct net_device_ops {
@@ -1271,6 +1302,8 @@ struct net_device_ops {
   struct sk_buff *skb);
void(*ndo_set_rx_headroom)(struct net_device *dev,
   int needed_headroom);
+   int (*ndo_xdp)(struct net_device *dev,
+  struct netdev_xdp *xdp);
 };
 
 /**
@@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_xdp_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device 
*dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device 
*dev,
struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7894e40..2a9c39f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -6615,6 +6616,38 @@ int dev_change_proto_down(struct net_device *dev, bool 
proto_down)
 EXPORT_SYMBOL(dev_change_proto_down);
 
 /**
+ * dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ * @dev: device
+ * @fd: new program fd or negative value to clear
+ *
+ * Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, int fd)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct bpf_prog *prog = NULL;
+   struct netdev_xdp xdp = {};
+   int err;
+
+   if (!ops->ndo_xdp)
+   return -EOPNOTSUPP;
+   if (fd >= 0) {
+   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+   }
+
+   xdp.command = XDP_SETUP_PROG;
+   xdp.prog = prog;
+   err = ops->ndo_xdp(dev, &xdp);
+   if (err < 0 && prog)
+   bpf_prog_put(prog);
+
+   return err;
+}
+EXPORT_SYMBOL(dev_change_xdp_fd);
+
+/**
  * dev_new_index   -   allocate an ifindex
  * @net: the applicable net namespace
  *
-- 
2.8.2

[PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding

2016-07-12 Thread Brenden Blanco

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 2, create a new ndo to pass the fd to supported drivers.
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver.
In patch 5, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 7, add the XDP_TX type to bpf.h
In patch 8, add helper in tx patch for writing tx_desc
In patch 9, add support in mlx4 for packet data write and forwarding
In patch 10, turn on packet write support in the bpf verifier
In patch 11, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

v8:
 1/11: Reduce WARN_ONCE to single line. Also, change act param of that
   function to u32 to match return type of bpf_prog_run_xdp.
 2/11: Clarify locking semantics in ndo comment.
 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.

v7:
 Addressing two of the major discussion points: return codes and ndo.
 The rest will be taken as todo items for separate patches.

 Add an XDP_ABORTED type, which explicitly falls through to DROP. The
 same result must be taken for the default case as well, as it is now
 well-defined API behavior.

 Merge ndo_xdp_* into a single ndo. The style is similar to
 ndo_setup_tc, but with less unidirectional naming convention. The IFLA
 parameter names are unchanged.

 TODOs:
 Add ethtool per-ring stats for aborted, default cases, maybe even drop
 and tx as well.
 Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
 Saeed.

  1/12: Add XDP_ABORTED enum, reword API comment, and update commit
   message.
  2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
calling convention.
  3/12: Switch to ndo_xdp callback.
  4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
ndo_xdp.
 12/12: Dropped, this will need some more work.

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
out of bounds action is returned by the program. Add a comment to
bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
architecture...bump frag_stride to u32. The page_offset member that
is computed from this was already u32.

v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to incorporate feedback from multiple sources.
  Specifically, add ability to forward packets out the same port and
allow packet modification.
  For packet forwarding, the driver reserves a dedicated set of tx rings
for exclusive use by xdp. Upon completion, the pages on this ring are
recycled directly back to a small per-rx-ring page cache without
being dma unmapped.
  Use of the percpu skb is dropped in favor of a lightweight struct
xdp_buff. The direct packet access feature is leveraged to remove
dependence on the skb.
  The mlx4 driver implementation allocates a page-per-packet and maps it
in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
  Naming is converted to use "xdp" instead of "phys_dev".

v2:
  1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
Introduce enum for return values from phy

[PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite

2016-07-12 Thread Brenden Blanco

Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.

Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.

Signed-off-by: Brenden Blanco 
---
 samples/bpf/Makefile|   5 +++
 samples/bpf/xdp2_kern.c | 114 
 2 files changed, 119 insertions(+)
 create mode 100644 samples/bpf/xdp2_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0e4ab3a..d2d2b35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
+hostprogs-y += xdp2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# reuse xdp1 source intentionally
+xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
+always += xdp2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
+HOSTLOADLIBES_xdp2 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
new file mode 100644
index 000..38fe7e1
--- /dev/null
+++ b/samples/bpf/xdp2_kern.c
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 256,
+};
+
+static void swap_src_dst_mac(void *data)
+{
+   unsigned short *p = data;
+   unsigned short dst[3];
+
+   dst[0] = p[0];
+   dst[1] = p[1];
+   dst[2] = p[2];
+   p[0] = p[3];
+   p[1] = p[4];
+   p[2] = p[5];
+   p[3] = dst[0];
+   p[4] = dst[1];
+   p[5] = dst[2];
+}
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+   struct iphdr *iph = data + nh_off;
+
+   if (iph + 1 > data_end)
+   return 0;
+   return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+   struct ipv6hdr *ip6h = data + nh_off;
+
+   if (ip6h + 1 > data_end)
+   return 0;
+   return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+   void *data_end = (void *)(long)ctx->data_end;
+   void *data = (void *)(long)ctx->data;
+   struct ethhdr *eth = data;
+   int rc = XDP_DROP;
+   long *value;
+   u16 h_proto;
+   u64 nh_off;
+   u32 index;
+
+   nh_off = sizeof(*eth);
+   if (data + nh_off > data_end)
+   return rc;
+
+   h_proto = eth->h_proto;
+
+   if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+   struct vlan_hdr *vhdr;
+
+   vhdr = data + nh_off;
+   nh_off += sizeof(struct vlan_hdr);
+   if (data + nh_off > data_end)
+   return rc;
+   h_proto = vhdr->h_vlan_encapsulated_proto;
+   }
+   if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+   struct vlan_hdr *vhdr;
+
+   vhdr = data + nh_off;
+   nh_off += sizeof(struct vlan_hdr);
+   if (data + nh_off > data_end)
+   return rc;
+   h_proto = vhdr->h_vlan_encapsulated_proto;
+   }
+
+   if (h_proto == htons(ETH_P_IP))
+   index = parse_ipv4(data, nh_off, data_end);
+   else if (h_proto == htons(ETH_P_IPV6))
+   index = parse_ipv6(data, nh_off, data_end);
+   else
+   index = 0;
+
+   value = bpf_map_lookup_elem(&dropcnt, &index);
+   if (value)
+   *value += 1;
+
+   if (index == 17) {
+   swap_src_dst_mac(data);
+

[PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding

2016-07-12 Thread Brenden Blanco

XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4282d44..a8f1ea1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -447,6 +447,7 @@ enum xdp_action {
XDP_ABORTED = 0,
XDP_DROP,
XDP_PASS,
+   XDP_TX,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.8.2

[PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function

2016-07-12 Thread Brenden Blanco

In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.

Signed-off-by: Brenden Blanco 
---
 drivers/infiniband/hw/mlx4/qp.c|  11 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 127 +
 include/linux/mlx4/qp.h|  18 ++--
 3 files changed, 90 insertions(+), 66 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8db8405..768085f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, 
int size)
}
} else {
ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
-   s = (ctrl->fence_size & 0x3f) << 4;
+   s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
for (i = 64; i < s; i += 64) {
wqe = buf + i;
*wqe = cpu_to_be32(0x);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int 
size)
inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof 
*inl));
}
ctrl->srcrb_flags = 0;
-   ctrl->fence_size = size / 16;
+   ctrl->qpn_vlan.fence_size = size / 16;
/*
 * Make sure descriptor is fully written before setting ownership bit
 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
ctrl = get_send_wqe(qp, i);
ctrl->owner_opcode = cpu_to_be32(1 << 31);
if (qp->sq_max_wqes_per_wr == 1)
-   ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
+   ctrl->qpn_vlan.fence_size =
+   1 << (qp->sq.wqe_shift - 4);
 
stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
}
@@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
wmb();
*lso_wqe = lso_hdr_sz;
 
-   ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
-   MLX4_WQE_CTRL_FENCE : 0) | size;
+   ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
+MLX4_WQE_CTRL_FENCE : 0) | size;
 
/*
 * Make sure descriptor is fully written before
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 76aa4d2..c29191e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -700,10 +700,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void 
*src,
__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
+{
+   wmb();
+   /* Since there is no iowrite*_native() that writes the
+* value as is, without byteswapping - using the one
+* the doesn't do byteswapping in the relevant arch
+* endianness.
+*/
+#if defined(__LITTLE_ENDIAN)
+   iowrite32(
+#else
+   iowrite32be(
+#endif
+ ring->doorbell_qpn,
+ ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
+static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
+ struct mlx4_en_tx_desc *tx_desc,
+ union mlx4_wqe_qpn_vlan qpn_vlan,
+ int desc_size, int bf_index,
+ __be32 op_own, bool bf_ok,
+ bool send_doorbell)
+{
+   tx_desc->ctrl.qpn_vlan = qpn_vlan;
+
+   if (bf_ok) {
+   op_own |= htonl((bf_index & 0x) << 8);
+   /* Ensure new descriptor hits memory
+* before setting ownership of this descriptor to HW
+*/
+   dma_wmb();
+   tx_desc->ctrl.owner_opcode = op_own;
+
+   wmb();
+
+   mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+desc_size);
+
+   wmb();
+
+   ring->bf.offset ^= ring->bf.buf_size;
+   } else {
+   /* Ensure new descriptor hits memory
+* before setting ownership of this descriptor to HW
+*/
+   dma_wmb();
+   tx_desc->ctrl.owner_opcode = op_own;
+   if (send_doorbell)
+   mlx4_en_xmit_doorbell(ring);
+   else
+   ring->xmit_more++;
+   }
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct skb_shared_info *shinfo = skb_shinfo(skb);
struct mlx4_en_priv *priv = netdev_priv(dev);
+   union mlx4_wqe_qpn_vlan

[PATCH v8 03/11] rtnl: add option for setting link xdp prog

2016-07-12 Thread Brenden Blanco

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/if_link.h | 12 +
 net/core/rtnetlink.c | 64 
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4285ac3..a1b5202 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -156,6 +156,7 @@ enum {
IFLA_GSO_MAX_SEGS,
IFLA_GSO_MAX_SIZE,
IFLA_PAD,
+   IFLA_XDP,
__IFLA_MAX
 };
 
@@ -843,4 +844,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* XDP section */
+
+enum {
+   IFLA_XDP_UNSPEC,
+   IFLA_XDP_FD,
+   IFLA_XDP_ATTACHED,
+   __IFLA_XDP_MAX,
+};
+
+#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a9e3805..eba2b82 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev,
return port_self_size;
 }
 
+static size_t rtnl_xdp_size(const struct net_device *dev)
+{
+   size_t xdp_size = nla_total_size(1);/* XDP_ATTACHED */
+
+   if (!dev->netdev_ops->ndo_xdp)
+   return 0;
+   else
+   return xdp_size;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 u32 ext_filter_mask)
 {
@@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
   + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
+  + rtnl_xdp_size(dev) /* IFLA_XDP */
   + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1211,6 +1222,33 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, 
struct net_device *dev)
return 0;
 }
 
+static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
+{
+   struct netdev_xdp xdp_op = {};
+   struct nlattr *xdp;
+   int err;
+
+   if (!dev->netdev_ops->ndo_xdp)
+   return 0;
+   xdp = nla_nest_start(skb, IFLA_XDP);
+   if (!xdp)
+   return -EMSGSIZE;
+   xdp_op.command = XDP_QUERY_PROG;
+   err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+   if (err)
+   goto err_cancel;
+   err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+   if (err)
+   goto err_cancel;
+
+   nla_nest_end(skb, xdp);
+   return 0;
+
+err_cancel:
+   nla_nest_cancel(skb, xdp);
+   return err;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
int type, u32 pid, u32 seq, u32 change,
unsigned int flags, u32 ext_filter_mask)
@@ -1307,6 +1345,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
if (rtnl_port_fill(skb, dev, ext_filter_mask))
goto nla_put_failure;
 
+   if (rtnl_xdp_fill(skb, dev))
+   goto nla_put_failure;
+
if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
if (rtnl_link_fill(skb, dev) < 0)
goto nla_put_failure;
@@ -1392,6 +1433,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_PHYS_SWITCH_ID]   = { .type = NLA_BINARY, .len = 
MAX_PHYS_ITEM_ID_LEN },
[IFLA_LINK_NETNSID] = { .type = NLA_S32 },
[IFLA_PROTO_DOWN]   = { .type = NLA_U8 },
+   [IFLA_XDP]  = { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1429,6 +1471,11 @@ static const struct nla_policy 
ifla_port_policy[IFLA_PORT_MAX+1] = {
[IFLA_PORT_RESPONSE]= { .type = NLA_U16, },
 };
 
+static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
+   [IFLA_XDP_FD]   = { .type = NLA_S32 },
+   [IFLA_XDP_ATTACHED] = { .type = NLA_U8 },
+};
+
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr 
*nla)
 {
const struct rtnl_link_ops *ops = NULL;
@@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
status |= DO_SETLINK_NOTIFY;
}
 
+   if (tb[IFLA_XDP]) {
+   struct nlattr *xdp[IFLA_XDP_MAX + 1];
+
+   err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
+  ifla_xdp_policy);
+   if (err < 0)
+   goto

[PATCH v8 10/11] bpf: enable direct packet data write for xdp progs

2016-07-12 Thread Brenden Blanco

For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.

This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.

Signed-off-by: Brenden Blanco 
---
 kernel/bpf/verifier.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8d67d0..f72f23b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 
regno, int off,
 
 #define MAX_PACKET_OFF 0x
 
+static bool may_write_pkt_data(enum bpf_prog_type type)
+{
+   switch (type) {
+   case BPF_PROG_TYPE_XDP:
+   return true;
+   default:
+   return false;
+   }
+}
+
 static int check_packet_access(struct verifier_env *env, u32 regno, int off,
   int size)
 {
@@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
err = check_stack_read(state, off, size, value_regno);
}
} else if (state->regs[regno].type == PTR_TO_PACKET) {
-   if (t == BPF_WRITE) {
+   if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
verbose("cannot write into packet\n");
return -EACCES;
}
+   if (t == BPF_WRITE && value_regno >= 0 &&
+   is_pointer_value(env, value_regno)) {
+   verbose("R%d leaks addr into packet\n", value_regno);
+   return -EACCES;
+   }
err = check_packet_access(env, regno, off, size);
if (!err && t == BPF_READ && value_regno >= 0)
mark_reg_unknown_value(state->regs, value_regno);
-- 
2.8.2

[PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-12 Thread Brenden Blanco

Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 37 +--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..369a2ef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -2084,6 +2085,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
mlx4_en_remove_timestamp(mdev);
 
+   if (priv->prog)
+   bpf_prog_put(priv->prog);
+
/* Detach the netdev so tasks would not attempt to access it */
mutex_lock(&mdev->state_lock);
mdev->pndev[priv->port] = NULL;
@@ -2112,6 +2116,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, 
int new_mtu)
en_err(priv, "Bad MTU size:%d.\n", new_mtu);
return -EPERM;
}
+   if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+   en_err(priv, "MTU size:%d requires frags but XDP prog running",
+  new_mtu);
+   return -EOPNOTSUPP;
+   }
dev->mtu = new_mtu;
 
if (netif_running(dev)) {
@@ -2520,6 +2529,46 @@ static int mlx4_en_set_tx_maxrate(struct net_device 
*dev, int queue_index, u32 m
return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+   struct bpf_prog *old_prog;
+
+   if (priv->num_frags > 1) {
+   en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+   return -EOPNOTSUPP;
+   }
+
+   /* This xchg is paired with READ_ONCE in the fast path, but is
+* also protected from itself via rtnl lock
+*/
+   old_prog = xchg(&priv->prog, prog);
+   if (old_prog)
+   bpf_prog_put(old_prog);
+
+   return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+
+   return !!READ_ONCE(priv->prog);
+}
+
+static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   switch (xdp->command) {
+   case XDP_SETUP_PROG:
+   return mlx4_xdp_set(dev, xdp->prog);
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = mlx4_xdp_attached(dev);
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
.ndo_open   = mlx4_en_open,
.ndo_stop   = mlx4_en_close,
@@ -2548,6 +2597,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
.ndo_udp_tunnel_del = mlx4_en_del_vxlan_port,
.ndo_features_check = mlx4_en_features_check,
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_xdp= mlx4_xdp,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2634,7 @@ static const struct net_device_ops mlx4_netdev_ops_master 
= {
.ndo_udp_tunnel_del = mlx4_en_del_vxlan_port,
.ndo_features_check = mlx4_en_features_check,
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_xdp= mlx4_xdp,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..adfa123 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
struct mlx4_en_rx_alloc *frags;
struct mlx4_en_rx_desc *rx_desc;
+   struct bpf_prog *prog;
struct sk_buff *skb;
int index;
int nr;
@@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (budget <= 0)
return polled;
 
+   prog = READ_ONCE(priv->prog);
+
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE

[PATCH v8 01/11] bpf: add XDP prog type for early driver filter

2016-07-12 Thread Brenden Blanco

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.

Signed-off-by: Brenden Blanco 
---
 include/linux/filter.h   | 18 +++
 include/uapi/linux/bpf.h | 20 
 kernel/bpf/verifier.c|  1 +
 net/core/filter.c| 79 
 4 files changed, 118 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..15d816a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
+struct xdp_buff {
+   void *data;
+   void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct 
bpf_prog *prog,
return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+  struct xdp_buff *xdp)
+{
+   u32 ret;
+
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(prog, (void *)xdp);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(u32 act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 262a7e8..4282d44 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
+   BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -437,4 +438,23 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in packet drop.
+ */
+enum xdp_action {
+   XDP_ABORTED = 0,
+   XDP_DROP,
+   XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+   __u32 data;
+   __u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, 
struct reg_state *reg,
switch (env->prog->type) {
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_XDP:
break;
default:
verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 10c4a2f..2d770f5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+   return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2436,6 +2442,44 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+ enum bpf_access_type type)
+{
+   if (off < 0 || off >= sizeof(struct xdp_md))
+   return false;
+   if (off % size != 0)
+   return false;
+   if (size != 4)
+   return false;
+
+   return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+   enum bpf_access_type type,
+   enum bpf_reg_type *reg_type)
+{
+   if (type == BPF_WRITE)
+   return false;
+
+   switch (off) {
+   case offsetof(struct xdp_md, data):
+   *reg_type = PTR_TO_PACKET;
+   break;
+   case offsetof(struct xdp_md, data_end):
+   *reg_type = PTR_TO_PACKET_END;

[PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support

2016-07-12 Thread Brenden Blanco

A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.

For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.

When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.

Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  15 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  19 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  |  14 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  | 126 +++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  14 ++-
 5 files changed, 181 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index d3d51fa..10642b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1694,6 +1694,11 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, 
struct ethtool_rxnfc *cmd)
return err;
 }
 
+static int mlx4_en_max_tx_channels(struct mlx4_en_priv *priv)
+{
+   return (MAX_TX_RINGS - priv->rsv_tx_rings) / MLX4_EN_NUM_UP;
+}
+
 static void mlx4_en_get_channels(struct net_device *dev,
 struct ethtool_channels *channel)
 {
@@ -1705,7 +1710,7 @@ static void mlx4_en_get_channels(struct net_device *dev,
channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP;
 
channel->rx_count = priv->rx_ring_num;
-   channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP;
+   channel->tx_count = priv->num_tx_rings_p_up;
 }
 
 static int mlx4_en_set_channels(struct net_device *dev,
@@ -1717,7 +1722,7 @@ static int mlx4_en_set_channels(struct net_device *dev,
int err = 0;
 
if (channel->other_count || channel->combined_count ||
-   channel->tx_count > MLX4_EN_MAX_TX_RING_P_UP ||
+   channel->tx_count > mlx4_en_max_tx_channels(priv) ||
channel->rx_count > MAX_RX_RINGS ||
!channel->tx_count || !channel->rx_count)
return -EINVAL;
@@ -1731,7 +1736,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
mlx4_en_free_resources(priv);
 
priv->num_tx_rings_p_up = channel->tx_count;
-   priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP;
+   priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP +
+   priv->rsv_tx_rings;
priv->rx_ring_num = channel->rx_count;
 
err = mlx4_en_alloc_resources(priv);
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
goto out;
}
 
-   netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+   netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+   priv->rsv_tx_rings);
netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 252c893d..50fbc8f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1636,7 +1636,7 @@ int mlx4_en_start_port(struct net_device *dev)
/* Configure ring */
tx_ring = priv->tx_ring[i];
err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-   i / priv->num_tx_rings_p_up);
+   i / (priv->tx_ring_num / MLX4_EN_NUM_UP));
if (err) {
en_err(priv, "Failed allocating Tx ring\n");
mlx4_en_deactivate_cq(priv, cq);
@@ -2022,6 +2022,16 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
goto err;
}
 
+   /* When rsv_tx_rings is non-zero, each rx ring will have a
+* corresponding tx ring, with the tx completion event for that ring
+* recycling buffers into the cache.
+*/
+   for (i = 0; i < priv->rsv_tx_rings; i++) {
+   int j = (priv->tx_ring_num - priv->rsv_tx_rings) + i;
+
+   priv->tx_ring[j]->recycle_ring = priv->rx_ring[i];
+   }
+
 #ifdef CONFIG_RFS_ACCEL
priv->dev->rx_cpu_rmap = mlx4_get_cpu_rmap(priv->mdev->dev, priv->port);
 #endif
@@ -2534,9 +2544,12 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
struct mlx4_e

[iproute PATCH v2] ip-address.8: Document autojoin flag

2016-07-12 Thread Phil Sutter

Description copied from related kernel support commit message with a
little tailoring to fit.

While at it, fix font of non-terminal CONFFLAG-LIST in synopsis.

Signed-off-by: Phil Sutter 
---
Changes since v1:
- Set commands/flags in bold font instead of quotes.
- Make capitalization consistent for 'Ethernet' and 'IGMP'.
- Set 'VXLAN' and 'Openvswitch' in upper case.
- Fix typo: then -> them.
- Misc grammar and language fixup.
---
 man/man8/ip-address.8.in | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in
index 7cb9271731339..7d6eb9b2b1915 100644
--- a/man/man8/ip-address.8.in
+++ b/man/man8/ip-address.8.in
@@ -77,14 +77,15 @@ ip-address \- protocol address management
 .IR FLAG " := "
 .RB "[ " permanent " | " dynamic " | " secondary " | " primary " |"
 .RB [ - ] tentative " | [" - ] deprecated " | [" - ] dadfailed " |"
-.BR temporary " | " CONFFLAG-LIST " ]"
+.BR temporary " |"
+.IR CONFFLAG-LIST " ]"
 
 .ti -8
 .IR CONFFLAG-LIST " := [ "  CONFFLAG-LIST " ] " CONFFLAG
 
 .ti -8
 .IR CONFFLAG " := "
-.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " ]"
+.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " | " autojoin 
" ]"
 
 .ti -8
 .IR LIFETIME " := [ "
@@ -249,6 +250,26 @@ address, and don't search for one to delete when removing 
the address. Changing
 an address to add this flag will remove the automatically added prefix route,
 changing it to remove this flag will create the prefix route automatically.
 
+.TP
+.B autojoin
+Joining multicast groups on Ethernet level via
+.B "ip maddr"
+command does not work if connected to an Ethernet switch that does IGMP
+snooping since the switch would not replicate multicast packets on ports that
+did not have IGMP reports for the multicast addresses.
+
+Linux VXLAN interfaces created via
+.B "ip link add vxlan"
+have the
+.B group
+option that enables them to do the required join.
+
+Using the
+.B autojoin
+flag when adding a multicast address enables similar functionality for
+Openvswitch VXLAN interfaces as well as other tunneling mechanisms that need to
+receive multicast traffic.
+
 .SS ip address delete - delete protocol address
 .B Arguments:
 coincide with the arguments of
-- 
2.8.2

Re: [PATCH] iwlwifi: add missing type declaration

2016-07-12 Thread Luca Coelho

On Mon, 2016-07-11 at 22:49 +0200, Arnd Bergmann wrote:
> The iwl-debug.h header relies in implicit inclusion of linux/device.h
> and
> we get a lot of warnings without that:
> 
> drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:23: error: 'struct
> device' declared inside parameter list will not be visible outside of
> this definition or declaration [-Werror]
>  void __iwl_err(struct device *dev, bool rfkill_prefix, bool
> only_trace,
>    ^~
> In file included from drivers/net/wireless/intel/iwlwifi/iwl-eeprom-
> read.h:66:0,
>  from drivers/net/wireless/intel/iwlwifi/iwl-eeprom-
> read.c:68:
> drivers/net/wireless/intel/iwlwifi/iwl-trans.h: In function
> 'iwl_trans_tx':
> drivers/net/wireless/intel/iwlwifi/iwl-trans.h:1030:348: error:
> passing argument 1 of '__iwl_err' from incompatible pointer type [-
> Werror=incompatible-pointer-types]
>    IWL_ERR(trans, "%s bad state = %d\n", __func__, trans->state);
>  
>  
>  
>  
>  
>    ^
> In file included from drivers/net/wireless/intel/iwlwifi/iwl-eeprom-
> read.c:67:0:
> drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:6: note: expected
> 'struct device *' but argument is of type 'struct device *'
>  void __iwl_err(struct device *dev, bool rfkill_prefix, bool
> only_trace,
>   ^
> 
> The easiest workaround is to just declare 'struct device' before its
> first use,
> rather than including the entire header file.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 21cb3222fe56 ("iwlwifi: decouple PCIe transport from
> mac80211")
> ---

Acked-by: Luca Coelho 

Agree with Kalle that he will take this directly to wireless-drivers-
next.

--
Cheers,
Luca.

[iproute PATCH] ip-macsec.8: fix typo in 'examples' section

2016-07-12 Thread Davide Caratti

fix wrong 'device' keyword in 'ip link add device eth0'

Signed-off-by: Davide Caratti 
---
 man/man8/ip-macsec.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/ip-macsec.8 b/man/man8/ip-macsec.8
index e8455d7..03ed700 100644
--- a/man/man8/ip-macsec.8
+++ b/man/man8/ip-macsec.8
@@ -74,7 +74,7 @@ type.
 .PP
 .SS Create a MACsec device on link eth0
 .nf
-# ip link add device eth0 macsec0 type macsec port 11 encrypt on
+# ip link add link eth0 macsec0 type macsec port 11 encrypt on
 .PP
 .SS Configure a secure association on that device
 .nf
-- 
2.5.5

Re: [PATCH 2/3] chcr: Support for Chelsio's Crypto Hardware

2016-07-12 Thread Herbert Xu

On Mon, Jul 11, 2016 at 11:28:07AM -0700, Yeshaswi M R Gowda wrote:
>
> + u_ctx = ULD_CTX(ctx);
> + if (cxgb4_is_crypto_q_full(u_ctx->lldi.ports[0], ctx->tx_channel_id))
> + return -EBUSY;

You cannot just return -EBUSY.  If the request has the MAY_BACKLOG
bit set, it must be queued regardless, but you should return -EBUSY
in order to throttle the user and then call the completion function
with -EINPROGRESS once the queue can accept more requests from the
user.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [patch net-next 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Jiri Pirko

Mon, Jul 11, 2016 at 06:08:14PM CEST, rost...@goodmis.org wrote:
>On Mon, 11 Jul 2016 15:18:47 +0200
>Jiri Pirko  wrote:
>
>> From: Jiri Pirko 
>> 
>> Define a tracepoint and allow user to trace messages going to and from
>> hardware associated with devlink instance.
>> 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  include/net/devlink.h  |  8 +++
>>  include/trace/events/devlink.h | 49 
>> ++
>>  net/core/devlink.c |  9 
>>  3 files changed, 66 insertions(+)
>>  create mode 100644 include/trace/events/devlink.h
>> 
>> diff --git a/include/net/devlink.h b/include/net/devlink.h
>> index c99ffe8..865ade6 100644
>> --- a/include/net/devlink.h
>> +++ b/include/net/devlink.h
>> @@ -115,6 +115,8 @@ struct devlink *devlink_alloc(const struct devlink_ops 
>> *ops, size_t priv_size);
>>  int devlink_register(struct devlink *devlink, struct device *dev);
>>  void devlink_unregister(struct devlink *devlink);
>>  void devlink_free(struct devlink *devlink);
>> +void devlink_trace_hwmsg(const struct devlink *devlink, bool incoming,
>> + unsigned long type, const u8 *buf, size_t len);
>>  int devlink_port_register(struct devlink *devlink,
>>struct devlink_port *devlink_port,
>>unsigned int port_index);
>> @@ -154,6 +156,12 @@ static inline void devlink_free(struct devlink *devlink)
>>  kfree(devlink);
>>  }
>>  
>> +static inline void devlink_trace_hwmsg(const struct devlink *devlink,
>> +   bool incoming, unsigned long type,
>> +   const u8 *buf, size_t len);
>> +{
>> +}
>> +
>>  static inline int devlink_port_register(struct devlink *devlink,
>>  struct devlink_port *devlink_port,
>>  unsigned int port_index)
>> diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h
>> new file mode 100644
>> index 000..7918c57
>> --- /dev/null
>> +++ b/include/trace/events/devlink.h
>> @@ -0,0 +1,49 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM devlink
>> +
>> +#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_DEVLINK_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +/*
>> + * Tracepoint for devlink hardware message:
>> + */
>> +TRACE_EVENT(devlink_hwmsg,
>> +TP_PROTO(const struct devlink *devlink, bool incoming,
>> + unsigned long type, const u8 *buf, size_t len),
>> +
>> +TP_ARGS(devlink, incoming, type, buf, len),
>> +
>> +TP_STRUCT__entry(
>> +__string(bus_name, devlink->dev->bus->name)
>> +__string(dev_name, dev_name(devlink->dev))
>> +__string(owner_name, devlink->dev->driver->owner->name)
>> +__field(bool, incoming)
>> +__field(unsigned long, type)
>> +__dynamic_array(u8, buf, len)
>> +__field(size_t, len)
>> +),
>> +
>> +TP_fast_assign(
>> +__assign_str(bus_name, devlink->dev->bus->name);
>> +__assign_str(dev_name, dev_name(devlink->dev));
>> +__assign_str(owner_name, devlink->dev->driver->owner->name);
>> +__entry->incoming = incoming;
>> +__entry->type = type;
>> +memcpy(__get_dynamic_array(buf), buf, len);
>> +__entry->len = len;
>> +),
>> +
>> +TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu 
>> buf=0x[%*phD] len=%lu",
>> +  __get_str(bus_name), __get_str(dev_name),
>> +  __get_str(owner_name), __entry->incoming, __entry->type,
>> +  (int) __entry->len, __get_dynamic_array(buf), __entry->len)
>> +);
>> +
>> +#endif /* _TRACE_DEVLINK_H */
>> +
>> +/* This part must be outside protection */
>> +#include 
>> diff --git a/net/core/devlink.c b/net/core/devlink.c
>> index b2e592a..8cfa3b0 100644
>> --- a/net/core/devlink.c
>> +++ b/net/core/devlink.c
>> @@ -26,6 +26,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#define CREATE_TRACE_POINTS
>> +#include 
>>  
>>  static LIST_HEAD(devlink_list);
>>  
>> @@ -1679,6 +1681,13 @@ void devlink_free(struct devlink *devlink)
>>  }
>>  EXPORT_SYMBOL_GPL(devlink_free);
>>  
>> +void devlink_trace_hwmsg(const struct devlink *devlink, bool incoming,
>> + unsigned long type, const u8 *buf, size_t len)
>> +{
>> +trace_devlink_hwmsg(devlink, incoming, type, buf, len);
>> +}
>> +EXPORT_SYMBOL_GPL(devlink_trace_hwmsg);
>> +
>
>Instead of having a function that always gets called even when tracing
>isn't enabled, why not have the caller call the trace_devlink_hwmsg()
>directly?
>
>In the trace/devlink.h file you could encapsulate it with:
>
>#if IS_ENABLED(CONFIG_NET_DEVLINK)
>
>[...]
>
>#else
>static inline void trace_devlink_hwmsg(const struct devlink *devlink,
>  bool incoming, unsigned long type,
>

Re: [PATCH 3/3] crypto: Added Chelsio Menu to the Kconfig file

2016-07-12 Thread Herbert Xu

On Tue, Jul 12, 2016 at 03:30:41AM +0800, kbuild test robot wrote:
> Hi,
> 
> [auto build test WARNING on net-next/master]
> [also build test WARNING on v4.7-rc7 next-20160711]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]

Yeshaswi, please fix these warnings/errors even though they're
compile-only.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-12 Thread Mason

On 05/07/2016 17:28, Florian Fainelli wrote:

> Le 05/07/2016 07:50, Mason wrote:
> 
>> On 05/07/2016 15:33, Mason wrote:
>> 
>>> I was testing suspend/resume sequences where the suspend operation
>>> fails and returns without having suspended the platform.

Please forget I ever mentioned suspend, that was a red herring.
The warning is displayed even if I never suspend.
(Dropping linux-pm from this discussion.)

>> I rebooted, tried the same operation, and hit the same warning
>> still 310 seconds later:

However, the 310 seconds time span still seems to be relevant.

Steps to reproduce: I booted the system, logged in as root,
mounted an NFS file system, then left the system idling at
the prompt.

(I don't remember seeing this warning in v4.1 and v4.4)

What's going wrong here? Is it related to NFS?

Here is the defconfig I'm using
http://pastebin.ubuntu.com/19160299/

Regards.


[  317.940133] [ cut here ]
[  317.944815] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 
inet_sock_destruct+0x1c4/0x1dc
[  317.953223] Modules linked in:
[  317.956305] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
4.7.0-rc6-00010-gd07031bdc433-dirty #2
[  317.964784] Hardware name: Sigma Tango DT
[  317.968809] Backtrace: 
[  317.971279] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[  317.978884]  r7:6113 r6:c080ea84 r5: r4:c080ea84
[  317.984590] [] (show_stack) from [] 
(dump_stack+0x80/0x94)
[  317.991856] [] (dump_stack) from [] (__warn+0xec/0x104)
[  317.998849]  r7:0009 r6:c05e3fc8 r5: r4:
[  318.004549] [] (__warn) from [] 
(warn_slowpath_null+0x28/0x30)
[  318.012154]  r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec19594 r5:df68f144 
r4:df68f040
[  318.019954] [] (warn_slowpath_null) from [] 
(inet_sock_destruct+0x1c4/0x1dc)
[  318.028788] [] (inet_sock_destruct) from [] 
(__sk_destruct+0x28/0xe0)
[  318.037005]  r7:df45fe30 r6:dec19594 r5:df68f040 r4:df68f1ec
[  318.042710] [] (__sk_destruct) from [] 
(rcu_process_callbacks+0x488/0x59c)
[  318.051363]  r5: r4:
[  318.054962] [] (rcu_process_callbacks) from [] 
(__do_softirq+0x138/0x264)
[  318.063527]  r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 
r5:0009
[  318.071408]  r4:
[  318.073953] [] (__do_softirq) from [] 
(irq_exit+0xc8/0x104)
[  318.081296]  r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 
r5:
[  318.089176]  r4:c0735428
[  318.091723] [] (irq_exit) from [] 
(__handle_domain_irq+0x88/0xf4)
[  318.099595] [] (__handle_domain_irq) from [] 
(gic_handle_irq+0x50/0x94)
[  318.107986]  r10: r9:e0803100 r8:e0802100 r7:df45ff58 r6:e080210c 
r5:c080277c
[  318.115865]  r4:c080eca0 r3:df45ff58
[  318.119461] [] (gic_handle_irq) from [] 
(__irq_svc+0x54/0x90)
[  318.126980] Exception stack(0xdf45ff58 to 0xdf45ffa0)
[  318.132053] ff40:   
0001 
[  318.140273] ff60: ab80 c0117c80 df45e000 c08024f8 c0802494 c081e2d6 
c05b9550 413fc090
[  318.148492] ff80:  df45ffb4 df45ffb8 df45ffa8 c01086b0 c01086b4 
6013 
[  318.156709]  r9:413fc090 r8:c05b9550 r7:df45ff8c r6: r5:6013 
r4:c01086b4
[  318.164512] [] (arch_cpu_idle) from [] 
(default_idle_call+0x28/0x34)
[  318.172646] [] (default_idle_call) from [] 
(cpu_startup_entry+0x128/0x17c)
[  318.181305] [] (cpu_startup_entry) from [] 
(secondary_start_kernel+0x158/0x164)
[  318.190395]  r7:c081e7c8 r4:c080b4f0
[  318.193993] [] (secondary_start_kernel) from [<8010158c>] 
(0x8010158c)
[  318.201423]  r5:0051 r4:9f44006a
[  318.205024] ---[ end trace 6e04001434b19cb9 ]---


Just to be sure, I performed the same steps a second time:

[  316.238527] [ cut here ]
[  316.243210] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 
inet_sock_destruct+0x1c4/0x1dc
[  316.251619] Modules linked in:
[  316.254702] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
4.7.0-rc6-00010-gd07031bdc433-dirty #2
[  316.263182] Hardware name: Sigma Tango DT
[  316.267206] Backtrace: 
[  316.269675] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[  316.277280]  r7:6113 r6:c080ea84 r5: r4:c080ea84
[  316.282986] [] (show_stack) from [] 
(dump_stack+0x80/0x94)
[  316.290254] [] (dump_stack) from [] (__warn+0xec/0x104)
[  316.297247]  r7:0009 r6:c05e3fc8 r5: r4:
[  316.302947] [] (__warn) from [] 
(warn_slowpath_null+0x28/0x30)
[  316.310552]  r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec15694 r5:df6063c4 
r4:df6062c0
[  316.318354] [] (warn_slowpath_null) from [] 
(inet_sock_destruct+0x1c4/0x1dc)
[  316.327190] [] (inet_sock_destruct) from [] 
(__sk_destruct+0x28/0xe0)
[  316.335406]  r7:df45fe30 r6:dec15694 r5:df6062c0 r4:df60646c
[  316.341112] [] (__sk_destruct) from [] 
(rcu_process_callbacks+0x488/0x59c)
[  316.349765]  r5: r4:
[  316.353363] [] (rcu_process_callbacks) from [] 
(__do_softirq+0x138/0x264)
[  316.361929]  r10:c08020a0 r9:4001 r8:010

RE: [PATCH v2 0/8] thunderbolt: Introducing Thunderbolt(TM) networking

2016-07-12 Thread Levy, Amir (Jer)

On Wed, Jun 29 2016, 11:35 AM, Levy, Amir (Jer) wrote:
> This is version 2 of Thunderbolt(TM) driver for non-Apple hardware.
> 
> Changes since v1:
>  - Separation to 2 modules.
>  - Moved ICM specific registers definition to ICM header file.
>  - Added new Thunderbolt device IDs.
>  - Renamed the Thunderbolt networking documentation.
>  - General cleanups
> 
> These patches were pushed to GitHub where they can be reviewed more
> comfortably with green/red highlighting:
>   https://github.com/01org/thunderbolt-software-kernel-tree
> 
> Daemon code:
>   https://github.com/01org/thunderbolt-software-daemon
> 
> For reference, here's a link to version 1:
> [v1]: http://www.spinics.net/lists/linux-pci/msg51287.html
> 
> Amir Levy (8):
>   thunderbolt: Macro rename
>   thunderbolt: Updating device IDs
>   thunderbolt: Updating the register definitions
>   thunderbolt: Kconfig for Thunderbolt(TM) networking
>   thunderbolt: Communication with the ICM (firmware)
>   thunderbolt: Networking state machine
>   thunderbolt: Networking transmit and receive
>   thunderbolt: Networking doc
> 
>  Documentation/00-INDEX   |2 +
>  Documentation/thunderbolt-networking.txt |  135 ++
>  drivers/thunderbolt/Kconfig  |   25 +-
>  drivers/thunderbolt/Makefile |4 +-
>  drivers/thunderbolt/icm_nhi.c| 1631 +
>  drivers/thunderbolt/icm_nhi.h|   84 ++
>  drivers/thunderbolt/net.c| 2277
> ++
>  drivers/thunderbolt/net.h|  274 
>  drivers/thunderbolt/nhi_regs.h   |  115 +-
>  include/linux/pci_ids.h  |   44 +-
>  10 files changed, 4565 insertions(+), 26 deletions(-)  create mode 100644
> Documentation/thunderbolt-networking.txt
>  create mode 100644 drivers/thunderbolt/icm_nhi.c  create mode 100644
> drivers/thunderbolt/icm_nhi.h  create mode 100644
> drivers/thunderbolt/net.c  create mode 100644 drivers/thunderbolt/net.h
> 
> --
> 2.7.4

Hi All,
Do you have any comments or suggestions for this patch set?

Regards,
Amir

[PATCH -next] stmmac: dwmac-socfpga: fix wrong pointer passed to PTR_ERR()

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

PTR_ERR should access the value just tested by IS_ERR, otherwise
the wrong error code will be returned.

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
index 4bee2f9..3bc1fa2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
@@ -214,7 +214,7 @@ static int socfpga_dwmac_parse_data(struct socfpga_dwmac 
*dwmac, struct device *
dev_err(dev,
"%s: ERROR: failed mapping tse control 
port\n",
__func__);
-   return PTR_ERR(dwmac->pcs.sgmii_adapter_base);
+   return PTR_ERR(dwmac->pcs.tse_pcs_base);
}
}
}

[PATCH -next] rxrpc: Fix error handling in af_rxrpc_init()

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

security initialized after alloc workqueue, so we should exit security
before destroy workqueue in the error handing.

Fixes: 648af7fca159 ("rxrpc: Absorb the rxkad security module")
Signed-off-by: Wei Yongjun 
---
 net/rxrpc/af_rxrpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index d6e4e3b..88effad 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -766,9 +766,9 @@ error_key_type:
 error_sock:
proto_unregister(&rxrpc_proto);
 error_proto:
-   destroy_workqueue(rxrpc_workqueue);
-error_security:
rxrpc_exit_security();
+error_security:
+   destroy_workqueue(rxrpc_workqueue);
 error_work_queue:
kmem_cache_destroy(rxrpc_call_jar);
 error_call_jar:

[PATCH -next] net: mediatek: fix non static symbol warnings

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

Fixes the following sparse warnings:

drivers/net/ethernet/mediatek/mtk_eth_soc.c:79:5: warning:
 symbol '_mtk_mdio_write' was not declared. Should it be static?
drivers/net/ethernet/mediatek/mtk_eth_soc.c:98:5: warning:
 symbol '_mtk_mdio_read' was not declared. Should it be static?

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 682797e..6f597eb 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -76,8 +76,8 @@ static int mtk_mdio_busy_wait(struct mtk_eth *eth)
return -1;
 }
 
-u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr,
-   u32 phy_register, u32 write_data)
+static u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr,
+  u32 phy_register, u32 write_data)
 {
if (mtk_mdio_busy_wait(eth))
return -1;
@@ -95,7 +95,7 @@ u32 _mtk_mdio_write(struct mtk_eth *eth, u32 phy_addr,
return 0;
 }
 
-u32 _mtk_mdio_read(struct mtk_eth *eth, int phy_addr, int phy_reg)
+static u32 _mtk_mdio_read(struct mtk_eth *eth, int phy_addr, int phy_reg)
 {
u32 d;

[PATCH -next] dwc_eth_qos: fix missing clk_disable_unprepare() on error in dwceqos_probe()

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

Fix missing clk_disable_unprepare() call before return
from dwceqos_probe() in the error handling case of invalid
fixed-link.

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/synopsys/dwc_eth_qos.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/synopsys/dwc_eth_qos.c 
b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
index c34111b..fc1ea80 100644
--- a/drivers/net/ethernet/synopsys/dwc_eth_qos.c
+++ b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
@@ -2877,7 +2877,7 @@ static int dwceqos_probe(struct platform_device *pdev)
ret = of_phy_register_fixed_link(lp->pdev->dev.of_node);
if (ret < 0) {
dev_err(&pdev->dev, "invalid fixed-link");
-   goto err_out_unregister_netdev;
+   goto err_out_unregister_clk_notifier;
}
 
lp->phy_node = of_node_get(lp->pdev->dev.of_node);

Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-12 Thread Mason

On 12/07/2016 11:53, Mason wrote:

> However, the 310 seconds time span still seems to be relevant.
> 
> Steps to reproduce: I booted the system, logged in as root,
> mounted an NFS file system, then left the system idling at
> the prompt.
> 
> (I don't remember seeing this warning in v4.1 and v4.4)
> 
> What's going wrong here? Is it related to NFS?
> 
> Here is the defconfig I'm using
> http://pastebin.ubuntu.com/19160299/
> 
> 
> [  317.940133] [ cut here ]
> [  317.944815] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 
> inet_sock_destruct+0x1c4/0x1dc
> [  317.953223] Modules linked in:
> [  317.956305] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 4.7.0-rc6-00010-gd07031bdc433-dirty #2
> [  317.964784] Hardware name: Sigma Tango DT
> [  317.968809] Backtrace: 
> [  317.971279] [] (dump_backtrace) from [] 
> (show_stack+0x18/0x1c)
> [  317.978884]  r7:6113 r6:c080ea84 r5: r4:c080ea84
> [  317.984590] [] (show_stack) from [] 
> (dump_stack+0x80/0x94)
> [  317.991856] [] (dump_stack) from [] (__warn+0xec/0x104)
> [  317.998849]  r7:0009 r6:c05e3fc8 r5: r4:
> [  318.004549] [] (__warn) from [] 
> (warn_slowpath_null+0x28/0x30)
> [  318.012154]  r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec19594 r5:df68f144 
> r4:df68f040
> [  318.019954] [] (warn_slowpath_null) from [] 
> (inet_sock_destruct+0x1c4/0x1dc)
> [  318.028788] [] (inet_sock_destruct) from [] 
> (__sk_destruct+0x28/0xe0)
> [  318.037005]  r7:df45fe30 r6:dec19594 r5:df68f040 r4:df68f1ec
> [  318.042710] [] (__sk_destruct) from [] 
> (rcu_process_callbacks+0x488/0x59c)
> [  318.051363]  r5: r4:
> [  318.054962] [] (rcu_process_callbacks) from [] 
> (__do_softirq+0x138/0x264)
> [  318.063527]  r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 
> r5:0009
> [  318.071408]  r4:
> [  318.073953] [] (__do_softirq) from [] 
> (irq_exit+0xc8/0x104)
> [  318.081296]  r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 
> r5:
> [  318.089176]  r4:c0735428
> [  318.091723] [] (irq_exit) from [] 
> (__handle_domain_irq+0x88/0xf4)
> [  318.099595] [] (__handle_domain_irq) from [] 
> (gic_handle_irq+0x50/0x94)
> [  318.107986]  r10: r9:e0803100 r8:e0802100 r7:df45ff58 r6:e080210c 
> r5:c080277c
> [  318.115865]  r4:c080eca0 r3:df45ff58
> [  318.119461] [] (gic_handle_irq) from [] 
> (__irq_svc+0x54/0x90)
> [  318.126980] Exception stack(0xdf45ff58 to 0xdf45ffa0)
> [  318.132053] ff40:   
> 0001 
> [  318.140273] ff60: ab80 c0117c80 df45e000 c08024f8 c0802494 c081e2d6 
> c05b9550 413fc090
> [  318.148492] ff80:  df45ffb4 df45ffb8 df45ffa8 c01086b0 c01086b4 
> 6013 
> [  318.156709]  r9:413fc090 r8:c05b9550 r7:df45ff8c r6: r5:6013 
> r4:c01086b4
> [  318.164512] [] (arch_cpu_idle) from [] 
> (default_idle_call+0x28/0x34)
> [  318.172646] [] (default_idle_call) from [] 
> (cpu_startup_entry+0x128/0x17c)
> [  318.181305] [] (cpu_startup_entry) from [] 
> (secondary_start_kernel+0x158/0x164)
> [  318.190395]  r7:c081e7c8 r4:c080b4f0
> [  318.193993] [] (secondary_start_kernel) from [<8010158c>] 
> (0x8010158c)
> [  318.201423]  r5:0051 r4:9f44006a
> [  318.205024] ---[ end trace 6e04001434b19cb9 ]---
> 
> 
> Just to be sure, I performed the same steps a second time:
> 
> [  316.238527] [ cut here ]
> [  316.243210] WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:155 
> inet_sock_destruct+0x1c4/0x1dc
> [  316.251619] Modules linked in:
> [  316.254702] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 4.7.0-rc6-00010-gd07031bdc433-dirty #2
> [  316.263182] Hardware name: Sigma Tango DT
> [  316.267206] Backtrace: 
> [  316.269675] [] (dump_backtrace) from [] 
> (show_stack+0x18/0x1c)
> [  316.277280]  r7:6113 r6:c080ea84 r5: r4:c080ea84
> [  316.282986] [] (show_stack) from [] 
> (dump_stack+0x80/0x94)
> [  316.290254] [] (dump_stack) from [] (__warn+0xec/0x104)
> [  316.297247]  r7:0009 r6:c05e3fc8 r5: r4:
> [  316.302947] [] (__warn) from [] 
> (warn_slowpath_null+0x28/0x30)
> [  316.310552]  r9:dfbea4e0 r8:000a r7:df45fe30 r6:dec15694 r5:df6063c4 
> r4:df6062c0
> [  316.318354] [] (warn_slowpath_null) from [] 
> (inet_sock_destruct+0x1c4/0x1dc)
> [  316.327190] [] (inet_sock_destruct) from [] 
> (__sk_destruct+0x28/0xe0)
> [  316.335406]  r7:df45fe30 r6:dec15694 r5:df6062c0 r4:df60646c
> [  316.341112] [] (__sk_destruct) from [] 
> (rcu_process_callbacks+0x488/0x59c)
> [  316.349765]  r5: r4:
> [  316.353363] [] (rcu_process_callbacks) from [] 
> (__do_softirq+0x138/0x264)
> [  316.361929]  r10:c08020a0 r9:4001 r8:0101 r7:df45e000 r6:c08020a4 
> r5:0009
> [  316.369811]  r4:
> [  316.372356] [] (__do_softirq) from [] 
> (irq_exit+0xc8/0x104)
> [  316.379699]  r10:df45ff58 r9:df402400 r8:0001 r7: r6:0013 
> r5:
> [  316.387579]  r4:c073

Re: iwlwifi: add missing type declaration

2016-07-12 Thread Kalle Valo

Arnd Bergmann  wrote:
> The iwl-debug.h header relies in implicit inclusion of linux/device.h and
> we get a lot of warnings without that:
> 
> drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:23: error: 'struct device' 
> declared inside parameter list will not be visible outside of this definition 
> or declaration [-Werror]
>  void __iwl_err(struct device *dev, bool rfkill_prefix, bool only_trace,
>^~
> In file included from 
> drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.h:66:0,
>  from drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.c:68:
> drivers/net/wireless/intel/iwlwifi/iwl-trans.h: In function 'iwl_trans_tx':
> drivers/net/wireless/intel/iwlwifi/iwl-trans.h:1030:348: error: passing 
> argument 1 of '__iwl_err' from incompatible pointer type 
> [-Werror=incompatible-pointer-types]
>IWL_ERR(trans, "%s bad state = %d\n", __func__, trans->state);
>   
>   
>   
>   
> ^
> In file included from 
> drivers/net/wireless/intel/iwlwifi/iwl-eeprom-read.c:67:0:
> drivers/net/wireless/intel/iwlwifi/iwl-debug.h:44:6: note: expected 'struct 
> device *' but argument is of type 'struct device *'
>  void __iwl_err(struct device *dev, bool rfkill_prefix, bool only_trace,
>   ^
> 
> The easiest workaround is to just declare 'struct device' before its first 
> use,
> rather than including the entire header file.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 21cb3222fe56 ("iwlwifi: decouple PCIe transport from mac80211")
> Acked-by: Luca Coelho 

Thanks, 1 patch applied to wireless-drivers-next.git:

25f700ef0653 iwlwifi: add missing type declaration

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/9224105/

RE: [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function

2016-07-12 Thread Tariq Toukan

>In preparation for writing the tx descriptor from multiple functions, create a 
>helper for both normal and blueflame access.
>
>Signed-off-by: Brenden Blanco 

Reviewed-by: Tariq Toukan

Re: [PATCH net v2 2/2] net: ethoc: Correctly pad short packets

2016-07-12 Thread Max Filippov

Hi Florian,

On Mon, Jul 11, 2016 at 04:35:55PM -0700, Florian Fainelli wrote:
> Even though the hardware can be doing zero padding, we want the SKB to
> be going out on the wire with the appropriate size. This fixes packet
> truncations observed with e.g: ARP packets.
> 
> Signed-off-by: Florian Fainelli 
> ---
>  drivers/net/ethernet/ethoc.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
> index 06ae14a8e946..ca678d46c322 100644
> --- a/drivers/net/ethernet/ethoc.c
> +++ b/drivers/net/ethernet/ethoc.c
> @@ -860,6 +860,11 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>   unsigned int entry;
>   void *dest;
>  
> + if (skb_put_padto(skb, ETHOC_ZLEN)) {
> + dev->stats.tx_errors++;
> + goto out;
> + }

skb_put_padto says that skb is freed on error, so this shoudn't go to
label 'out' on error where skb is freed again?

-- 
Thanks.
-- Max

Re: [patch net-next 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Steven Rostedt

On Tue, 12 Jul 2016 10:38:37 +0200
Jiri Pirko  wrote:


> Hmm, I'm trying to make it work this was but it seems to be a bit more
> complicated. I did not find any code doing this (having the tracepoint
> compiled in or not) so I cannot copy&paste the solution.
> 
> The problem is that trace_devlink_hwmsg is included from
> include/trace/events/devlink.h when devlink is on and from
> include/net/devlink.h. Odd but ok.
> 
> I cannot include include/trace/events/devlink.h in include/net/devlink.h
> because include/net/devlink.h is included in include/trace/events/devlink.h
> 
> So I have to directly include include/trace/events/devlink.h from mlxsw
> driver (trace caller). But when devlink is not compiled in, trecepoint
> is not created and I'm getting errors.
> 
> I see no easy way our of this other than the wrapper function I was
> originally using. Seem simple and elegant.

No. You missed my solution.

Any user of trace_devlink_hwmsg() should directly include
trace/events/devlink.h. That's what all other trace users do. They
include the trace file header.

I said in that header do:


#if IS_ENABLED(CONFIG_NET_DEVLINK)

#undef TRACE_SYSTEM
#define TRACE_SYSTEM devlink

#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_DEVLINK_H

#include 
#include 
#include 

[..]

#endif /* _TRACE_DEVLINK_H */

/* This part must be outside protection */
#include 

#else /* CONFIG_NET_DEVLINK */
static inline void trace_devlink_hwmsg(const struct devlink *devlink,
   bool incoming, unsigned long type,
   const u8 *buf, size_t len)
{
}
#endif


-- Steve

Re: [net-next PATCH RFC] mlx4: RX prefetch loop

2016-07-12 Thread Jesper Dangaard Brouer

On Mon, 11 Jul 2016 16:05:11 -0700
Alexei Starovoitov  wrote:

> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
> > > - /* Process all completed CQEs */
> > > + /* Extract and prefetch completed CQEs */
> > >   while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
> > >   cq->mcq.cons_index & cq->size)) {
> > > + void *data;
> > >  
> > >   frags = ring->rx_info + (index << priv->log_rx_info);
> > >   rx_desc = ring->buf + (index << ring->log_stride);
> > > + prefetch(rx_desc);
> > >  
> > >   /*
> > >* make sure we read the CQE after we read the ownership bit
> > >*/
> > >   dma_rmb();
> > >  
> > > + cqe_array[cqe_idx++] = cqe;
> > > +
> > > + /* Base error handling here, free handled in next loop */
> > > + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> > > +  MLX4_CQE_OPCODE_ERROR))
> > > + goto skip;
> > > +
> > > + data = page_address(frags[0].page) + frags[0].page_offset;
> > > + prefetch(data);  
> 
> that's probably not correct in all cases, since doing prefetch on the address
> that is going to be evicted soon may hurt performance.
> We need to dma_sync_single_for_cpu() before doing a prefetch or
> somehow figure out that dma_sync is a nop, so we can omit it altogether
> and do whatever prefetches we like.

Sure, DMA can be synced first (actually already played with this).

> Also unconditionally doing batch of 8 may also hurt depending on what
> is happening either with the stack, bpf afterwards or even cpu version.

See this as software DDIO, if the unlikely case that data will get
evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
only 1024 bytes are getting prefetched here.

> Doing single prefetch of Nth packet is probably ok most of the time,
> but asking cpu to prefetch 8 packets at once is unnecessary especially
> since single prefetch gives the same performance.

No, unconditionally prefetch of the Nth packet, will be wrong most of
the time, for real work loads, as Eric Dumazet already pointed out.

This patch does NOT unconditionally prefetch 8 packets.  Prefetching
_only_ happens when it is known that packets are ready in the RX ring.
We know this prefetch data will be used/touched, within the NAPI cycle.
Even if the processing of the packet flush L1 cache, then it will be in
L2 or L3 (like DDIO).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [patch net-next 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Jiri Pirko

Tue, Jul 12, 2016 at 02:44:17PM CEST, rost...@goodmis.org wrote:
>On Tue, 12 Jul 2016 10:38:37 +0200
>Jiri Pirko  wrote:
>
>
>> Hmm, I'm trying to make it work this was but it seems to be a bit more
>> complicated. I did not find any code doing this (having the tracepoint
>> compiled in or not) so I cannot copy&paste the solution.
>> 
>> The problem is that trace_devlink_hwmsg is included from
>> include/trace/events/devlink.h when devlink is on and from
>> include/net/devlink.h. Odd but ok.
>> 
>> I cannot include include/trace/events/devlink.h in include/net/devlink.h
>> because include/net/devlink.h is included in include/trace/events/devlink.h
>> 
>> So I have to directly include include/trace/events/devlink.h from mlxsw
>> driver (trace caller). But when devlink is not compiled in, trecepoint
>> is not created and I'm getting errors.
>> 
>> I see no easy way our of this other than the wrapper function I was
>> originally using. Seem simple and elegant.
>
>No. You missed my solution.
>
>Any user of trace_devlink_hwmsg() should directly include
>trace/events/devlink.h. That's what all other trace users do. They
>include the trace file header.
>
>I said in that header do:
>
>
>#if IS_ENABLED(CONFIG_NET_DEVLINK)
>
>#undef TRACE_SYSTEM
>#define TRACE_SYSTEM devlink
>
>#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ)
>#define _TRACE_DEVLINK_H
>
>#include 
>#include 
>#include 
>
>[..]
>
>#endif /* _TRACE_DEVLINK_H */
>
>/* This part must be outside protection */
>#include 
>
>#else /* CONFIG_NET_DEVLINK */
>static inline void trace_devlink_hwmsg(const struct devlink *devlink,
>  bool incoming, unsigned long type,
>  const u8 *buf, size_t len)
>{
>}
>#endif


Okay. That looks good. Thanks!


>
>
>-- Steve

RE: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.

2016-07-12 Thread Alan Davey

>> On Wed, 15 Jun 2016, Alan Davey wrote:
>> 
>>> The only case that would break is that where an application relies on 
>>> the existing (documented as a bug) feature of getting an EMSGSIZE 
>>> return code in the case of an over-sized packet.  Applications that 
>>> perform their own fragmentation would be unaffected.
>> 
>> If this doesn't break existing applications that are doing 
>> fragmentation in userspace on raw sockets (e.g. Quagga ospfd), that's 
>> better.
>> 
>> As per previous email, I'd love to be able to get rid of that code and 
>> have the kernel do it for me. However, I also don't want to have to do 
>> anything other non-trivial to that code either. :)
>> 
>> The issue for us is, how would we know on any given host whether the 
>> kernel will do the fragmentation or whether ospfd has to do it? We 
>> need to be able to probe for that capability, surely?
>
>The fact is, regardless of whether you could probe for the capability or not, 
>you have to keep the fragmentation code around forever.
>
>And that is yet another reason I do not want to add this change at all.
>
>It doesn't make any existing server any simpler, in fact it makes them all 
>more complicated because not only do they keep the fragmentation code, they 
>also >get new logic to test for the feature that would allow them to avoid 
>using it.
>
>Sorry, there is no way I am adding this, it's a net lose.

David,

I accept you don't want to take this patch.  However, I don't yet understand 
why, so I have a few more questions (which will hopefully help me produce 
patches that are likely to be accepted in future).

Adding the patch means that some existing applications will continue to contain 
their own fragmentation code, which becomes unnecessary.  This is OK; those 
applications will continue to fragment packets, and will continue to work with 
an updated kernel.  Do you agree?

Not adding the patch means that 
-  all future applications have to continue to implement their own 
fragmentation code, duplicating that which already exists in the kernel
-  hardware vendors may choose to apply the patch themselves (they do not want 
to implement function that already exists), but would surely prefer to have it 
in the standard kernel.

So I still don't understand what part of taking the patch would have a negative 
result on existing applications, it should be neutral to them.  Could you help 
me out here and explain?

Regards
Alan

Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter

2016-07-12 Thread Jesper Dangaard Brouer

On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco  wrote:

> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. Out
> of bounds return codes should alias to XDP_ABORTED.

This is going to be a serious usability problem, when XDP gets extended
with newer features.

How can users know what XDP capabilities a given driver support?

If the user loads a XDP program using capabilities not avail in the
driver, then all his traffic will be hard dropped.

My proposal is to NOT allow XDP programs to be loaded if they want to use
return-codes/features not supported by the driver.  Thus, eBPF programs
register/annotate their needed capabilities upfront (I guess this could
also help HW offload engines).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

RE: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-12 Thread Tariq Toukan


>Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
>In tc/socket bpf programs, helpers linearize skb fragments as needed when the 
>program touches the packet data. However, in the pursuit of speed, XDP 
>programs will not be allowed to use these slower functions, especially if it 
>involves allocating an skb.
>
>Therefore, disallow MTU settings that would produce a multi-fragment packet 
>that XDP programs would fail to access. Future enhancements could be done to 
>increase the allowable MTU.
>
>Signed-off-by: Brenden Blanco 

Reviewed-by: Tariq Toukan

[PATCH iproute2] ip route: restore route entries in correct order

2016-07-12 Thread Xin Long

Sometimes we cannot restore route entries, because in kernel
  [1] fib_check_nh()
  [2] fib_valid_prefsrc()
cause some routes to depend on existence of others while adding.

For example, we saved all the routes, and flushed all tables
  [a] default via 192.168.122.1 dev eth0
  [b] 192.168.122.0/24 dev eth0 src 192.168.122.21
  [c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1
  [d] local 127.0.0.0/8 dev lo table local  src 127.0.0.1
  [e] local 127.0.0.1 dev lo table local src 127.0.0.1
  [f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1
  [g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21
  [h] local 192.168.122.21 dev eth0 table local src 192.168.122.21
  [i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21

  Now start to restore them:
If we want to add [a], we have to add [b] first, as [1] and
'via 192.168.122.1' in [a].
If we want to add [b], we have to add [h] first, as [2] and
'src 192.168.122.21' in [b].

  So the correct order to restore should be like:
[e][h] -> [b][c][d][f][g][i] -> [a]

This patch fixes it by traversing the file 3 times, it only restores
part of them in each run according to the following conditions, to
make sure every entry can be restored successfully.
  1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst)
  2. !gw && (fib_prefsrc != cfg->fc_dst)
  3. gw

Signed-off-by: Xin Long 
---
 ip/iproute.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 24f6b01..c735000 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1810,12 +1810,42 @@ static int iproute_get(int argc, char **argv)
return 0;
 }
 
+static int rtattr_cmp(struct rtattr *rta1, struct rtattr *rta2)
+{
+   if (!rta1 || !rta2 || rta1->rta_len != rta2->rta_len)
+   return 1;
+
+   return memcmp(RTA_DATA(rta1), RTA_DATA(rta2), RTA_PAYLOAD(rta1));
+}
+
 static int restore_handler(const struct sockaddr_nl *nl,
   struct rtnl_ctrl_data *ctrl,
   struct nlmsghdr *n, void *arg)
 {
-   int ret;
+   struct rtmsg *r = NLMSG_DATA(n);
+   struct rtattr *tb[RTA_MAX+1];
+   int len = n->nlmsg_len - NLMSG_LENGTH(sizeof(*r));
+   int ret, prio = *(int *)arg;
+
+   parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
 
+   /* Restore routes in correct order:
+* 0. ones for local addresses,
+* 1. ones for local networks,
+* 2. others (remote networks/hosts).
+*/
+   if (!prio && !tb[RTA_GATEWAY] && (!tb[RTA_PREFSRC] ||
+   !rtattr_cmp(tb[RTA_PREFSRC], tb[RTA_DST])))
+   goto restore;
+   else if (prio == 1 && !tb[RTA_GATEWAY] &&
+rtattr_cmp(tb[RTA_PREFSRC], tb[RTA_DST]))
+   goto restore;
+   else if (prio == 2 && tb[RTA_GATEWAY])
+   goto restore;
+
+   return 0;
+
+restore:
n->nlmsg_flags |= NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
 
ll_init_map(&rth);
@@ -1848,10 +1878,23 @@ static int route_dump_check_magic(void)
 
 static int iproute_restore(void)
 {
+   int pos, prio;
+
if (route_dump_check_magic())
exit(-1);
 
-   exit(rtnl_from_file(stdin, &restore_handler, NULL));
+   pos = ftell(stdin);
+   for (prio = 0; prio < 3; prio++) {
+   int err;
+
+   err = rtnl_from_file(stdin, &restore_handler, &prio);
+   if (err)
+   exit(err);
+
+   fseek(stdin, pos, SEEK_SET);
+   }
+
+   exit(0);
 }
 
 static int show_handler(const struct sockaddr_nl *nl,
-- 
2.1.0

Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-12 Thread Bjørn Mork

David Miller  writes:

> What really irks me is that we "fixing" something without knowing what
> actually is the problem.

Agreed

> Someone needs to figure out exactly what is making the Juniper thing
> unhappy.  It really shouldn't care if a link local address is assigned
> to the tun device, this is fundamental ipv6 stuff.

Yes. Looks like this is up to Jonas and/or Valdis.  I tried looking for
a demo site which could be used to test the client, but could not find
any.  The product itself seems to be replaced, and it's no longer Juniper.
And the recommended Linux solution seems to be OpenConnect:
http://www.infradead.org/openconnect/juniper.html

Anyway, it would be good to sort out the problems with the java(?) based
client.  A few proposals (not an exhaustive list - please use your
creativity):

a) Try to figure out what the traffic on the interface looks like (there
   was a single TX packet and no RX, I believe?).  Snoop on it and see
   if that is an IPv6 RS from the kernel or something the client sends.

b) Try to isolate the problem by tweaking what you can on the tun-
   interface.

 ip addr del  dev tun0
 echo 1 > /proc/sys/net/ipv6/conf/tun0/disable_ipv6

  etc.  Is there anything that will make the traffic flow, or is it just
  dead?

c) Try to figure out what the client is doing.  strace it.  run lsof on
   it.  Anything unexpected?  Does it for example happen to read an
   packet from the tun file descriptor and choke?

etc,

Bjørn

Re: [PATCH net v2 0/2] net: ethoc: Error path and transmit fixes

2016-07-12 Thread Max Filippov

Hello,

On Mon, Jul 11, 2016 at 04:35:53PM -0700, Florian Fainelli wrote:
> This patch series contains two patches for the ethoc driver while testing on a
> TS-7300 board where ethoc is provided by an on-board FPGA.
> 
> First patch was cooked after chasing crashes with invalid resources passed to
> the driver.
> 
> Second patch was cooked after seeing that an interface configured with IP
> 192.168.2.2 was sending ARP packets for 192.168.0.0, no wonder why it could 
> not
> work.

I can see opencores intrerface sending ARP packets shorter than 64 bytes,
but I couldn't reproduce truncation that affects packet contents on my
hardware.

> I don't have access to any other platform using an ethoc interface so
> it could be good to some testing on Xtensa for instance.

I've tested success and error paths affected by this series with the
following additional change on top of it:

diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
index ca678d4..8c94f45 100644
--- a/drivers/net/ethernet/ethoc.c
+++ b/drivers/net/ethernet/ethoc.c
@@ -862,7 +862,7 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
if (skb_put_padto(skb, ETHOC_ZLEN)) {
dev->stats.tx_errors++;
-   goto out;
+   return NETDEV_TX_OK;
}
 
if (unlikely(skb->len > ETHOC_BUFSIZ)) {

Without it the interface becomes non-functional after the first error
in skb_put_padto.

Tested-by: Max Filippov 
Reviewed-by: Max Filippov 

-- 
Thanks.
-- Max

Re: [PATCH iproute2] ip route: restore route entries in correct order

2016-07-12 Thread Phil Sutter

On Tue, Jul 12, 2016 at 09:37:58PM +0800, Xin Long wrote:
> Sometimes we cannot restore route entries, because in kernel
>   [1] fib_check_nh()
>   [2] fib_valid_prefsrc()
> cause some routes to depend on existence of others while adding.
> 
> For example, we saved all the routes, and flushed all tables
>   [a] default via 192.168.122.1 dev eth0
>   [b] 192.168.122.0/24 dev eth0 src 192.168.122.21
>   [c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1
>   [d] local 127.0.0.0/8 dev lo table local  src 127.0.0.1
>   [e] local 127.0.0.1 dev lo table local src 127.0.0.1
>   [f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1
>   [g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21
>   [h] local 192.168.122.21 dev eth0 table local src 192.168.122.21
>   [i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21
> 
>   Now start to restore them:
> If we want to add [a], we have to add [b] first, as [1] and
> 'via 192.168.122.1' in [a].
> If we want to add [b], we have to add [h] first, as [2] and
> 'src 192.168.122.21' in [b].
> 
>   So the correct order to restore should be like:
> [e][h] -> [b][c][d][f][g][i] -> [a]
> 
> This patch fixes it by traversing the file 3 times, it only restores
> part of them in each run according to the following conditions, to
> make sure every entry can be restored successfully.
>   1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst)
>   2. !gw && (fib_prefsrc != cfg->fc_dst)
>   3. gw
> 
> Signed-off-by: Xin Long 

Acked-by: Phil Sutter

Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-12 Thread Bjørn Mork

Hannes,

I just realized that I missed you on the CC list of the v2 patch. Sorry.
It was definitely not my intention. I really appreciate the feedback you
have kindly provided before.

I guess you might have seen it in netdev anyway, but in case not, here
is a direct patchwork link: http://patchwork.ozlabs.org/patch/646958/

I'll try to be less sloppy about CCs the next time, if we ever get to a
v3 of this. At the moment that requires figuring out what the actual
problem with the client is. Something I don't have the necessary tools
to do AFAICS.  So I will let the whole thing just sleep until there is
some more data about the Juniper client failure.


Bjørn

Re: Multi-thread udp 4.7 regression, bisected to 71d8c47fc653

2016-07-12 Thread Pablo Neira Ayuso

On Mon, Jul 11, 2016 at 06:17:39PM -0300, Marc Dionne wrote:
> Hi Pablo,
> 
> Testing out your patch:
> 
> 1) With no NAT in place, the clash resolution happens, with no side
> effects.  No EPERM errors are seen.
> 
> 2) With ip(6)table_nat loaded, the clash resolution fails and I get
> some EPERM errors from sendmsg(), same as before 71d8c47fc653.
> 
> Turns out that even though I have no NAT rules in my iptables config,
> the system also had firewalld active and that caused the modules to be
> loaded.
> 
> So the bottom line is that the patch looks good to me..

Thanks Marc, I'm going to apply this then.

Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-12 Thread Eric Dumazet

On Tue, 2016-07-12 at 11:53 +0200, Mason wrote:
> On 05/07/2016 17:28, Florian Fainelli wrote:
> 
> > Le 05/07/2016 07:50, Mason wrote:
> > 
> >> On 05/07/2016 15:33, Mason wrote:
> >> 
> >>> I was testing suspend/resume sequences where the suspend operation
> >>> fails and returns without having suspended the platform.
> 
> Please forget I ever mentioned suspend, that was a red herring.
> The warning is displayed even if I never suspend.
> (Dropping linux-pm from this discussion.)
> 
> >> I rebooted, tried the same operation, and hit the same warning
> >> still 310 seconds later:
> 
> However, the 310 seconds time span still seems to be relevant.
> 
> Steps to reproduce: I booted the system, logged in as root,
> mounted an NFS file system, then left the system idling at
> the prompt.
> 
> (I don't remember seeing this warning in v4.1 and v4.4)
> 
> What's going wrong here? Is it related to NFS?
> 
> Here is the defconfig I'm using
> http://pastebin.ubuntu.com/19160299/
> 
> Regards.
> 

Could you try this debug patch ?

diff --git a/net/core/sock.c b/net/core/sock.c
index b7f12639c26a..7fb1aeadbda7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1465,6 +1465,7 @@ static void __sk_destruct(struct rcu_head *head)
 
 void sk_destruct(struct sock *sk)
 {
+   WARN_ON_ONCE(sk->sk_forward_alloc);
if (sock_flag(sk, SOCK_RCU_FREE))
call_rcu(&sk->sk_rcu, __sk_destruct);
else

[PATCH 1/1] Fix PCS reset

2016-07-12 Thread Elad Kanfi

From: Noam Camus 

During commit b54b8c2d6e3c
 ("net: ezchip: adapt driver to little endian architecture")
 adapting to little endian architecture,
 zeroing of controller was left out.

Signed-off-by: Elad Kanfi 
---
 drivers/net/ethernet/ezchip/nps_enet.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c 
b/drivers/net/ethernet/ezchip/nps_enet.c
index 06f0317..9b7a3f5 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -285,6 +285,7 @@ static void nps_enet_hw_reset(struct net_device *ndev)
ge_rst_value |= NPS_ENET_ENABLE << RST_GMAC_0_SHIFT;
nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value);
usleep_range(10, 20);
+   ge_rst_value = 0;
nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value);
 
/* Tx fifo reset sequence */
-- 
1.7.1

Re: [PATCH v2 net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-12 Thread Hannes Frederic Sowa

On 11.07.2016 16:48, David Miller wrote:
> From: Bjørn Mork 
> Date: Mon, 11 Jul 2016 16:43:50 +0200
> 
>> The Juniper SSL VPN client use a "tun" interface and seems to
>> be picky about visible changes.to it. Commit cc9da6cc4f56
>> ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
>> made such interfaces get an auto-generated IPv6 link local address
>> by default, similar to most other interface types. This made the
>> Juniper SSL VPN client fail for unknown reasons.
>>
>> Fixing this regression by adding a new private netdevice flag
>> which disables automatic IPv6 link local address generation, and
>> making the flag default for "tun" devices.
>>
>> Setting an explicit addrgenmode will disable the flag, so userspace
>> can choose to enable automatic LL generation by selecting a suitable
>> addrgenmode.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
>> Fixes: cc9da6cc4f56 ("ipv6: addrconf: use stable address generator for 
>> ARPHRD_NONE")
>> Reported-by: Valdis Kletnieks 
>> Reported-by: Jonas Lippuner 
>> Suggested-by: Hannes Frederic Sowa 
>> Cc: 吉藤英明 
>> Signed-off-by: Bjørn Mork 
> 
> What really irks me is that we "fixing" something without knowing what
> actually is the problem.
> 
> Someone needs to figure out exactly what is making the Juniper thing
> unhappy.  It really shouldn't care if a link local address is assigned
> to the tun device, this is fundamental ipv6 stuff.

I agree, but there could be a lot of factors affecting this. For example
we know start to send out an icmp router solicitation into the vpn as
soon as the link goes up, as well as initial mld change record. It could
very well trigger security software and kill the VPN (something I can
very well imagine right now).

I actually don't suspect something funny going on in the Juniper's Java
client itself, this would just be too strange. Unfortunately I don't
have access to such systems to check what could be the problem.

Bjorn send a list of things that could be tried.

Thanks,
Hannes

Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-12 Thread Mason

On 12/07/2016 16:25, Eric Dumazet wrote:

> Could you try this debug patch ?

Note: I've been unable to trigger the warning again. Dunno what has changed...

With your patch applied, I get a warning at boot:

[4.668309] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[4.688609] Sending DHCP requests ., OK
[4.711935] IP-Config: Got DHCP answer from 172.27.200.1, my address is 
172.27.64.49
[4.719956] IP-Config: Complete:
[4.723221]  device=eth0, hwaddr=00:16:e8:02:08:42, ipaddr=172.27.64.49, 
mask=255.255.192.0, gw=172.27.64.1
[4.733376]  host=toto5, domain=france.foo.com sac.foo.com asic.foo.com 
soft.sde, nis-domain=france.foo.com
[4.745279]  bootserver=172.27.64.1, rootserver=172.27.64.1, 
rootpath=/export/roots/titi/6_2_0_8756,v3 nameserver0=172.27.0.17
[4.759725] [ cut here ]
[4.764426] WARNING: CPU: 0 PID: 877 at net/core/sock.c:1468 
sk_destruct+0x74/0x78
[4.772056] Modules linked in:
[4.775133] CPU: 0 PID: 877 Comm: kworker/0:1H Not tainted 
4.7.0-rc6-00010-gd07031bdc433-dirty #6
[4.784050] Hardware name: Sigma Tango DT
[4.788084] Workqueue: rpciod rpc_async_schedule
[4.792725] Backtrace: 
[4.795196] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[4.802802]  r7:6013 r6:c080ea84 r5: r4:c080ea84
[4.808513] [] (show_stack) from [] 
(dump_stack+0x80/0x94)
[4.815781] [] (dump_stack) from [] (__warn+0xec/0x104)
[4.822776]  r7:0009 r6:c05e4d04 r5: r4:
[4.828482] [] (__warn) from [] 
(warn_slowpath_null+0x28/0x30)
[4.836089]  r9: r8:df5edb58 r7:df711364 r6:df006c80 r5:df5edb58 
r4:df5eda40
[4.843898] [] (warn_slowpath_null) from [] 
(sk_destruct+0x74/0x78)
[4.851945] [] (sk_destruct) from [] 
(__sk_free+0x50/0xbc)
[4.859203]  r5:df5edb58 r4:df5eda40
[4.862802] [] (__sk_free) from [] (sk_free+0x3c/0x40)
[4.869710]  r5:df5edb58 r4:df5eda40
[4.873310] [] (sk_free) from [] 
(sk_common_release+0xe8/0xf4)
[4.880924] [] (sk_common_release) from [] 
(udp_lib_close+0x10/0x14)
[4.889054]  r5:df006c80 r4:df5eda40
[4.892657] [] (udp_lib_close) from [] 
(inet_release+0x4c/0x78)
[4.900360] [] (inet_release) from [] 
(sock_release+0x28/0xa8)
[4.907967]  r5: r4:df006c80
[4.911575] [] (sock_release) from [] 
(xs_reset_transport+0xac/0xbc)
[4.919706]  r5:df5eda40 r4:df711000
[4.923306] [] (xs_reset_transport) from [] 
(xs_destroy+0x24/0x54)
[4.931262]  r9: r8:c049c614 r7:df7c3218 r6:df711000 r5: 
r4:df711000
[4.939070] [] (xs_destroy) from [] 
(xprt_destroy+0x88/0x8c)
[4.946502]  r5:df711218 r4:df711000
[4.950102] [] (xprt_destroy) from [] 
(xprt_put+0x40/0x44)
[4.957358]  r5:df7c3200 r4:df613d00
[4.960959] [] (xprt_put) from [] 
(rpc_task_release_client+0x7c/0x80)
[4.969181] [] (rpc_task_release_client) from [] 
(rpc_release_resources_task+0x34/0x38)
[4.978971]  r7:c049c298 r6:0001 r5: r4:df613d00
[4.984674] [] (rpc_release_resources_task) from [] 
(__rpc_execute+0xb0/0x2a8)
[4.993678]  r5: r4:df613d00
[4.997277] [] (__rpc_execute) from [] 
(rpc_async_schedule+0x14/0x18)
[5.005496]  r10:df683500 r9: r8:dfbe4700 r7: r6:dfbdcc80 
r5:df683500
[5.013383]  r4:df613d24
[5.015935] [] (rpc_async_schedule) from [] 
(process_one_work+0x128/0x3fc)
[5.024595] [] (process_one_work) from [] 
(worker_thread+0x58/0x574)
[5.032726]  r10:df683500 r9:df6d4000 r8:dfbdcc98 r7:c0802100 r6:0008 
r5:df683518
[5.040614]  r4:dfbdcc80
[5.043162] [] (worker_thread) from [] 
(kthread+0xe4/0xfc)
[5.050419]  r10: r9: r8: r7:c0132914 r6:df683500 
r5:df6cc600
[5.058309]  r4:
[5.060858] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[5.068116]  r7: r6: r5:c013838c r4:df6cc600
[5.073846] ---[ end trace 05b24e2dedd2f2a0 ]---
[5.082009] VFS: Mounted root (nfs filesystem) readonly on device 0:11.
[5.090328] Freeing unused kernel memory: 1024K (c070 - c080)
[5.452552] random: udevd urandom read with 68 bits of entropy available

Re: [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding

2016-07-12 Thread Tariq Toukan

Regression tests for mlx4_en are currently running, results will be 
ready by tomorrow morning.


Regards,
Tariq

Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter

2016-07-12 Thread Tom Herbert

On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer
 wrote:
>
> On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco  
> wrote:
>
>> Add a new bpf prog type that is intended to run in early stages of the
>> packet rx path. Only minimal packet metadata will be available, hence a
>> new context type, struct xdp_md, is exposed to userspace. So far only
>> expose the packet start and end pointers, and only in read mode.
>>
>> An XDP program must return one of the well known enum values, all other
>> return codes are reserved for future use. Unfortunately, this
>> restriction is hard to enforce at verification time, so take the
>> approach of warning at runtime when such programs are encountered. Out
>> of bounds return codes should alias to XDP_ABORTED.
>
> This is going to be a serious usability problem, when XDP gets extended
> with newer features.
>
> How can users know what XDP capabilities a given driver support?
>
We talked about this a the XDP summit and I have some slides on this
(hope to have slides posted shortly) . Drivers, more generally XDP
platforms, will provide a list APIs that they support. APIs would be
contained in a sort common specification files that and would have
some name like basic_xdp_api, adv_switch_api, etc. An XDP program is
written using one of the published APIs and the API it uses is
expressed as part of the program. At runtime (possibly compile time)
the API used by the program is checked against the list of APIs for
the target platform-- if the API is not in the set then the program is
not allowed to be loaded. To whatever extent possible we should also
try to verify that program adhere's to the API as load and compile
time. In the event that program violates the API it claims to be
using, such as return invalid return code for the API, that is an
error condition.

> If the user loads a XDP program using capabilities not avail in the
> driver, then all his traffic will be hard dropped.
>
>
> My proposal is to NOT allow XDP programs to be loaded if they want to use
> return-codes/features not supported by the driver.  Thus, eBPF programs
> register/annotate their needed capabilities upfront (I guess this could
> also help HW offload engines).
>
Exactly. We just need to define exactly how this is done ;-)

One open issue is whether XDP needs to be binary compatible across
platforms or source code compatible (also require recompile in latter
case for each platform). Personally I prefer source code compatible
since that might allow platforms to implement the API specifically for
their needs (like the ordering of fields in a meta data structure.

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

RE: [PATCH v6 RESEND] r8152: Add support for setting pass through MAC address on RTL8153-AD

2016-07-12 Thread Mario_Limonciello

> -Original Message-
> From: Michał Pecio [mailto:michal.pe...@gmail.com]
> Sent: Tuesday, July 12, 2016 2:03 AM
> To: David Miller 
> Cc: Limonciello, Mario ; linux-
> ker...@vger.kernel.org; netdev@vger.kernel.org; linux-
> u...@vger.kernel.org; anthony.w...@canonical.com
> Subject: Re: [PATCH v6 RESEND] r8152: Add support for setting pass through
> MAC address on RTL8153-AD
> 
> > From: Mario Limonciello 
> > Date: Mon, 11 Jul 2016 19:58:04 -0500
> >
> > > The RTL8153-AD supports a persistent system specific MAC address.
> > > This means a device plugged into two different systems with host
> > > side support will show different (but persistent) MAC addresses.
> > >
> > > This information for the system's persistent MAC address is burned
> > > in when the system HW is built and available under \_SB.AMAC in the
> > > DSDT at runtime.
> > >
> > > This technology is currently implemented in the Dell TB15 and WD15
> > > Type-C docks.  More information is available here:
> > > http://www.dell.com/support/article/us/en/04/SLN301147
> > >
> > > Signed-off-by: Mario Limonciello 
> >
> > Applied.
> 
> Hi,
> 
> Isn't it possible that the same ACPI name could be used for other
> vendor-specific extensions on other laptops and that r8152 will now
> wreak havoc there?
> 
> Regards,
> MP

This has been discussed a bit previously in earlier submissions.

In short, this is an extreme corner case.  Some changes were made 
to diminish potential impact. The ACPI code is resolved only when 
the specific variant of RTL8135 (-AD) has a bit set on the efuse.

The patch also explicitly checks the return type and contents of the
ACPI object and won't proceed if it's invalid.

The Type-C devices that provide this are currently only sold by Dell.  
This of course may change one day if other OEM's add this
feature, but it just shows that device scope is limited.

For now the way this is implemented if Realtek does choose to 
work with another OEM on this feature, there should be no
kernel code change necessary for interoperability of peripherals
on other OEM machines.

So in order to hit this hypothetical corner case today you would 
need to be using a real world type-C device something such as a 
Dell WD15 on another OEM's machine that has type-C and the exact
same ACPI object name that does $BADSTUFF other than return a
buffer.

If that situation arises please alert me and I'll send a follow up
patch that whitelists this to match DMI vendor of only Dell 
systems.

[PATCH -next] ixgbe: Add missing destroy_workqueue() on error in ixgbe_init_module()

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

Add the missing destroy_workqueue() before return from
ixgbe_init_module() in the error handling case.

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ef82cf9..c5c3d68 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10109,6 +10109,7 @@ static int __init ixgbe_init_module(void)
 
ret = pci_register_driver(&ixgbe_driver);
if (ret) {
+   destroy_workqueue(ixgbe_wq);
ixgbe_dbg_exit();
return ret;
}

Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys

2016-07-12 Thread Andrew Lunn

On Tue, Jul 12, 2016 at 05:00:52PM +0200, Charles-Antoine Couret wrote:
> Hello,
> I'm back with another patch evrsion about Marvell phys with a fiber interface.
> >From the previous release, I fixed some issues reported by yours and I added 
> >some functions around the fiber interface to get statistics, to configure 
> >the aneg, etc.

Hi Charles

It is best to submit a number of smaller patches, each doing one
thing, than a single big patch. It makes review and discussion much
simpler.

So for example, this should be a patch of its own:

> @@ -151,6 +165,7 @@ struct marvell_hw_stat {
>  
>  static struct marvell_hw_stat marvell_hw_stats[] = {
>   { "phy_receive_errors", 0, 21, 16},
> + { "phy_receive_errors_fiber", 1, 21, 16},
>   { "phy_idle_errors", 0, 10, 8 },
>  };

I think we should also rename phy_receive_errors to
phy_receive_errors_copper.

Andrew

[PATCH v3] Marvell phy: add fiber status check and configuration for some phys

2016-07-12 Thread Charles-Antoine Couret

Hello,
I'm back with another patch evrsion about Marvell phys with a fiber interface.
>From the previous release, I fixed some issues reported by yours and I added 
>some functions around the fiber interface to get statistics, to configure the 
>aneg, etc.
Yes, to implement that, copper and fiber side are closed, but the fiber link 
needs some custom registers or bit values. I'm based on some genphy functions 
to adapt to fiber link.

With that, the 88E1512 for example could have the same features for the copper 
and fiber links.
I'm interested by your feedback about this patch.

Thank you in advance.
Regards.
Charles-Antoine Couret

>From 4ca5935f8aa97c3ba02cb27e970e1bcf248aed18 Mon Sep 17 00:00:00 2001
From: Charles-Antoine Couret 
Date: Fri, 1 Apr 2016 16:16:35 +0200
Subject: [PATCH] Marvell phy: add fiber status check for some components

Some Marvell's phys have two modes: fiber and copper. Currently, the driver
configures and checks only the copper mode registers
which could be the wrong link.

This commit add handlers to check fiber then copper status link
or to configure the fiber interface.

This patch is not tested with all Marvell's phys.
The new functions are enabled only for tested phys.
Some of them functions are based on genphy functions.

Signed-off-by: Charles-Antoine Couret 
---
 drivers/net/phy/marvell.c | 323 +++---
 1 file changed, 307 insertions(+), 16 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index ec2c1ee..3cd8d94 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -138,6 +138,20 @@
 #define MII_88E1510_GEN_CTRL_REG_1_MODE_SGMII  0x1 /* SGMII to copper */
 #define MII_88E1510_GEN_CTRL_REG_1_RESET   0x8000  /* Soft reset */
 
+#define LPA_FIBER_1000HALF 0x40
+#define LPA_FIBER_1000FULL 0x20
+
+#define LPA_PAUSE_FIBER0x180
+#define LPA_PAUSE_ASYM_FIBER   0x100
+
+#define ADVERTISE_FIBER_1000HALF   0x40
+#define ADVERTISE_FIBER_1000FULL   0x20
+
+#define ADVERTISE_PAUSE_FIBER  0x180
+#define ADVERTISE_PAUSE_ASYM_FIBER 0x100
+
+#define REGISTER_LINK_STATUS   0x400
+
 MODULE_DESCRIPTION("Marvell PHY driver");
 MODULE_AUTHOR("Andy Fleming");
 MODULE_LICENSE("GPL");
@@ -151,6 +165,7 @@ struct marvell_hw_stat {
 
 static struct marvell_hw_stat marvell_hw_stats[] = {
{ "phy_receive_errors", 0, 21, 16},
+   { "phy_receive_errors_fiber", 1, 21, 16},
{ "phy_idle_errors", 0, 10, 8 },
 };
 
@@ -477,15 +492,122 @@ static int m88e1318_config_aneg(struct phy_device 
*phydev)
return m88e1121_config_aneg(phydev);
 }
 
+/**
+ * ethtool_adv_to_fiber_adv_t
+ * @ethadv: the ethtool advertisement settings
+ *
+ * A small helper function that translates ethtool advertisement
+ * settings to phy autonegotiation advertisements for the
+ * MII_ADV register for fiber link.
+ */
+static inline u32 ethtool_adv_to_fiber_adv_t(u32 ethadv)
+{
+   u32 result = 0;
+
+   if (ethadv & ADVERTISED_1000baseT_Half)
+   result |= ADVERTISE_FIBER_1000HALF;
+   if (ethadv & ADVERTISED_1000baseT_Full)
+   result |= ADVERTISE_FIBER_1000FULL;
+
+   if ((ethadv & ADVERTISE_PAUSE_ASYM) && (ethadv & ADVERTISE_PAUSE_CAP))
+   result |= LPA_PAUSE_ASYM_FIBER;
+   else if (ethadv & ADVERTISE_PAUSE_CAP)
+   result |= (ADVERTISE_PAUSE_FIBER
+  & (~ADVERTISE_PAUSE_ASYM_FIBER));
+
+   return result;
+}
+
+/**
+ * marvell_config_aneg_fiber - restart auto-negotiation or write BMCR
+ * @phydev: target phy_device struct
+ *
+ * Description: If auto-negotiation is enabled, we configure the
+ *   advertising, and then restart auto-negotiation.  If it is not
+ *   enabled, then we write the BMCR. Adapted for fiber link in
+ *   some Marvell's devices.
+ */
+static int marvell_config_aneg_fiber(struct phy_device *phydev)
+{
+   int changed = 0;
+   int err;
+   int adv, oldadv;
+   u32 advertise;
+
+   if (phydev->autoneg != AUTONEG_ENABLE)
+   return genphy_setup_forced(phydev);
+
+   /* Only allow advertising what this PHY supports */
+   phydev->advertising &= phydev->supported;
+   advertise = phydev->advertising;
+
+   /* Setup fiber advertisement */
+   adv = phy_read(phydev, MII_ADVERTISE);
+   if (adv < 0)
+   return adv;
+
+   oldadv = adv;
+   adv &= ~(ADVERTISE_FIBER_1000HALF | ADVERTISE_FIBER_1000FULL
+   | LPA_PAUSE_FIBER);
+   adv |= ethtool_adv_to_fiber_adv_t(advertise);
+
+   if (adv != oldadv) {
+   err = phy_write(phydev, MII_ADVERTISE, adv);
+   if (err < 0)
+   return err;
+
+   changed = 1;
+   }
+
+   if (changed == 0) {
+   /* Advertisement hasn't changed, but maybe aneg was never on to
+* begin with?  Or maybe phy was isolated?
+*/
+   int c

[PATCH -next] net: dsa: Fix non static symbol warning

2016-07-12 Thread weiyj_lk

From: Wei Yongjun 

Fixes the following sparse warning:

net/dsa/dsa2.c:680:6: warning:
 symbol '_dsa_unregister_switch' was not declared. Should it be static?

Signed-off-by: Wei Yongjun 
---
 net/dsa/dsa2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 78e4c01..f30bad9 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -677,7 +677,7 @@ int dsa_register_switch(struct dsa_switch *ds, struct 
device_node *np)
 }
 EXPORT_SYMBOL_GPL(dsa_register_switch);
 
-void _dsa_unregister_switch(struct dsa_switch *ds)
+static void _dsa_unregister_switch(struct dsa_switch *ds)
 {
struct dsa_switch_tree *dst = ds->dst;

[iproute PATCH v2] ip-macsec.8: fix typo in 'examples' section

2016-07-12 Thread Davide Caratti

fix wrong 'device' keyword in 'ip link add device eth0'

changes since v1:
while at it, add missing description of 'validate' keyword and remove
spurious bracket near 'encrypt' keyword. Also, wrong 'es' and 'encoding'
keywords in MACsec section of ip-link.8 have been fixed.

Signed-off-by: Davide Caratti 
---
 man/man8/ip-link.8.in | 8 
 man/man8/ip-macsec.8  | 5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index ad18f75..edb00ff 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -910,14 +910,14 @@ the following additional arguments are supported:
 .BR encrypt " {"
 .BR on " | " off " } ] [ "
 .BR send_sci " { " on " | " off " } ] ["
-.BR es " { " on " | " off " } ] ["
+.BR end_station " { " on " | " off " } ] ["
 .BR scb " { " on " | " off " } ] ["
 .BR protect " { " on " | " off " } ] ["
 .BR replay " { " on " | " off " }"
 .BR window " { "
 .IR 0..2^32-1 " } ] ["
 .BR validate " { " strict " | " check " | " disabled " } ] ["
-.BR encoding " { "
+.BR encodingsa " { "
 .IR 0..3 " } ]"
 
 .in +8
@@ -942,7 +942,7 @@ the following additional arguments are supported:
 - specifies whether the SCI is included in every packet, or only when it is 
necessary.
 
 .sp
-.BR "es on " or " es off"
+.BR "end_station on " or " end_station off"
 - sets the End Station bit.
 
 .sp
@@ -970,7 +970,7 @@ the following additional arguments are supported:
 - sets the validation mode on the device.
 
 .sp
-.BI encoding " AN "
+.BI encodingsa " AN "
 - sets the active secure association for transmission.
 
 .in -8
diff --git a/man/man8/ip-macsec.8 b/man/man8/ip-macsec.8
index e8455d7..761e4de 100644
--- a/man/man8/ip-macsec.8
+++ b/man/man8/ip-macsec.8
@@ -6,7 +6,7 @@ ip-macsec \- MACsec device configuration
 [ [
 .BR cipher " { " default " | " gcm-aes-128 " } ] "
 .BI icvlen " ICVLEN"
-] [ [
+] [
 .BR encrypt " { " on " | " off " } ] ["
 .BR send_sci " { " on " | " off " } ] ["
 .BR end_station " { " on " | " off " } ] ["
@@ -15,6 +15,7 @@ ip-macsec \- MACsec device configuration
 .BR replay " { " on " | " off " } ] ["
 .BI window " WINDOW"
 ] [
+.BR validate " { " strict " | " check " | " disabled " } ] ["
 .BI encodingsa " SA"
 ]
 
@@ -74,7 +75,7 @@ type.
 .PP
 .SS Create a MACsec device on link eth0
 .nf
-# ip link add device eth0 macsec0 type macsec port 11 encrypt on
+# ip link add link eth0 macsec0 type macsec port 11 encrypt on
 .PP
 .SS Configure a secure association on that device
 .nf
-- 
2.5.5

[PATCH nf-next v2 0/3] Compact netfilter hooks list

2016-07-12 Thread Aaron Conole

This series makes a simple change to shrink the netfilter hook list
from a double linked list, to a singly linked list.  Since the hooks
are always traversed in-order, there is no need to maintain a previous
pointer.

This was jointly developed by Florian Westphal.

It has been tested with RCU and lockdep debugging enabled.

Aaron Conole (2):
  netfilter: bridge: add and use br_nf_hook_thresh
  netfilter: replace list_head with single linked list

Florian Westphal (1):
  netfilter: call nf_hook_state_init with rcu_read_lock held

 include/linux/netdevice.h  |   2 +-
 include/linux/netfilter.h  |  26 +++--
 include/linux/netfilter_ingress.h  |  15 ++-
 include/net/netfilter/br_netfilter.h   |   6 ++
 include/net/netfilter/nf_queue.h   |   9 +-
 include/net/netns/netfilter.h  |   2 +-
 net/bridge/br_netfilter_hooks.c|  50 --
 net/bridge/br_netfilter_ipv6.c |  12 +--
 net/bridge/netfilter/ebt_redirect.c|   2 +-
 net/bridge/netfilter/ebtables.c|   2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |   2 +-
 net/netfilter/core.c   | 132 -
 net/netfilter/nf_conntrack_core.c  |   2 +-
 net/netfilter/nf_conntrack_h323_main.c |   2 +-
 net/netfilter/nf_conntrack_helper.c|   2 +-
 net/netfilter/nf_internals.h   |  10 +-
 net/netfilter/nf_queue.c   |  15 ++-
 net/netfilter/nfnetlink_cthelper.c |   2 +-
 net/netfilter/nfnetlink_log.c  |   8 +-
 net/netfilter/nfnetlink_queue.c|   7 +-
 net/netfilter/xt_helper.c  |   2 +-
 24 files changed, 202 insertions(+), 114 deletions(-)

-- 
2.5.5

[PATCH v2 3/3] netfilter: replace list_head with single linked list

2016-07-12 Thread Aaron Conole

The netfilter hook list never uses the prev pointer, and so can be
trimmed to be a smaller singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device
becomes 2176 bytes (down from 2240).

Signed-off-by: Aaron Conole 
Signed-off-by: Florian Westphal 
---
v2:
* Adjusted the hook list head function, and retested with rcu and lockdep
  debugging enabled.

 include/linux/netdevice.h |   2 +-
 include/linux/netfilter.h |  18 +++---
 include/linux/netfilter_ingress.h |  14 +++--
 include/net/netfilter/nf_queue.h  |   9 ++-
 include/net/netns/netfilter.h |   2 +-
 net/bridge/br_netfilter_hooks.c   |  21 +++
 net/netfilter/core.c  | 127 --
 net/netfilter/nf_internals.h  |  10 +--
 net/netfilter/nf_queue.c  |  15 +++--
 net/netfilter/nfnetlink_queue.c   |   5 +-
 10 files changed, 130 insertions(+), 93 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..9da07ad 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1749,7 +1749,7 @@ struct net_device {
 #endif
struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
-   struct list_headnf_hooks_ingress;
+   struct nf_hook_entry __rcu *nf_hooks_ingress;
 #endif
 
unsigned char   broadcast[MAX_ADDR_LEN];
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ad444f0..3390a84 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -55,12 +55,12 @@ struct nf_hook_state {
struct net_device *out;
struct sock *sk;
struct net *net;
-   struct list_head *hook_list;
+   struct nf_hook_entry *hook_list;
int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
 static inline void nf_hook_state_init(struct nf_hook_state *p,
- struct list_head *hook_list,
+ struct nf_hook_entry *hook_list,
  unsigned int hook,
  int thresh, u_int8_t pf,
  struct net_device *indev,
@@ -97,6 +97,12 @@ struct nf_hook_ops {
int priority;
 };
 
+struct nf_hook_entry {
+   struct nf_hook_entry __rcu  *next;
+   struct nf_hook_ops  ops;
+   const struct nf_hook_ops*orig_ops;
+};
+
 struct nf_sockopt_ops {
struct list_head list;
 
@@ -161,8 +167,6 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
 int (*okfn)(struct net *, struct sock *, 
struct sk_buff *),
 int thresh)
 {
-   struct list_head *hook_list;
-
 #ifdef HAVE_JUMP_LABEL
if (__builtin_constant_p(pf) &&
__builtin_constant_p(hook) &&
@@ -170,14 +174,14 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
return 1;
 #endif
 
-   hook_list = &net->nf.hooks[pf][hook];
-
-   if (!list_empty(hook_list)) {
+   if (rcu_access_pointer(net->nf.hooks[pf][hook])) {
+   struct nf_hook_entry *hook_list;
struct nf_hook_state state;
int ret;
 
/* We may already have this, but read-locks nest anyway */
rcu_read_lock();
+   hook_list = rcu_dereference(net->nf.hooks[pf][hook]);
nf_hook_state_init(&state, hook_list, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
 
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index 6965ba0..e3e3f6d 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -11,23 +11,27 @@ static inline bool nf_hook_ingress_active(const struct 
sk_buff *skb)
if 
(!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
return false;
 #endif
-   return !list_empty(&skb->dev->nf_hooks_ingress);
+   return rcu_access_pointer(skb->dev->nf_hooks_ingress) != NULL;
 }
 
 /* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
+   struct nf_hook_entry *e = rcu_dereference(skb->dev->nf_hooks_ingress);
struct nf_hook_state state;
 
-   nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
-  NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV,
-  skb->dev, NULL, NULL, dev_net(skb->dev), NULL);
+   if (unlikely(!e))
+   return 0;
+
+   nf_hook_state_init(&state, e, NF_NETDEV_INGRESS, INT_MIN,
+  NFPROTO_NETDEV, skb->dev, NULL, NULL,
+  dev_net(skb->dev), NULL);
return nf_hook_slow(skb, &state);
 }
 
 static inline void nf_hook_ingress_init(struct net_devic

[PATCH nf-next v2 1/3] netfilter: bridge: add and use br_nf_hook_thresh

2016-07-12 Thread Aaron Conole

From: Florian Westphal 

This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/net/netfilter/br_netfilter.h |  6 
 net/bridge/br_netfilter_hooks.c  | 57 ++--
 net/bridge/br_netfilter_ipv6.c   | 12 
 3 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/br_netfilter.h 
b/include/net/netfilter/br_netfilter.h
index e8d1448..0b0c35c 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -15,6 +15,12 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct 
sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
 
+int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
+ struct sk_buff *skb, struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buff *));
+
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 77e7f69..7856129 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -395,11 +396,10 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE,
-  NF_BR_PRE_ROUTING,
-  net, sk, skb, skb->dev, NULL,
-  br_nf_pre_routing_finish_bridge,
-  1);
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING,
+ net, sk, skb, skb->dev,
+ NULL,
+ br_nf_pre_routing_finish);
return 0;
}
ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr);
@@ -417,10 +417,8 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb,
-  skb->dev, NULL,
-  br_handle_frame_finish, 1);
-
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb, skb->dev, NULL,
+ br_handle_frame_finish);
return 0;
 }
 
@@ -992,6 +990,47 @@ static struct notifier_block brnf_notifier __read_mostly = 
{
.notifier_call = brnf_device_event,
 };
 
+/* recursively invokes nf_hook_slow (again), skipping already-called
+ * hooks (< NF_BR_PRI_BRNF).
+ *
+ * Called with rcu read lock held.
+ */
+int br_nf_hook_thresh(unsigned int hook, struct net *net,
+ struct sock *sk, struct sk_buff *skb,
+ struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buf *))
+{
+   struct nf_hook_ops *elem;
+   struct nf_hook_state state;
+   struct list_head *head;
+   int ret;
+
+   head = &net->nf.hooks[NFPROTO_BRIDGE][hook];
+
+   list_for_each_entry_rcu(elem, head, list) {
+   struct nf_hook_ops *next;
+
+   next = list_entry_rcu(list_next_rcu(&elem->list),
+ struct nf_hook_ops, list);
+   if (next->priority <= NF_BR_PRI_BRNF)
+   continue;
+   }
+
+   if (&elem->list == head)
+   return okfn(net, sk, skb);
+
+   nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1,
+  NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
+
+   ret = nf_hook_slow(skb, &state);
+   if (ret == 1)
+   ret = okfn(net, sk, skb);
+
+   return ret;
+}
+
 #ifdef CONFIG_SYSCTL
 static
 int brnf_sysctl_call_tables(struct ctl_table *ctl, int write,
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 5e59a84..5989661 100644
--- a/net

[PATCH nf-next v2 2/3] netfilter: call nf_hook_state_init with rcu_read_lock held

2016-07-12 Thread Aaron Conole

From: Florian Westphal 

This makes things simpler because we can store the head of the list
in the nf_state structure without worrying about concurrent add/delete
of hook elements from the list.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/linux/netfilter.h  | 8 +++-
 include/linux/netfilter_ingress.h  | 1 +
 net/bridge/netfilter/ebt_redirect.c| 2 +-
 net/bridge/netfilter/ebtables.c| 2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   | 2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +-
 net/netfilter/core.c   | 5 +
 net/netfilter/nf_conntrack_core.c  | 2 +-
 net/netfilter/nf_conntrack_h323_main.c | 2 +-
 net/netfilter/nf_conntrack_helper.c| 2 +-
 net/netfilter/nfnetlink_cthelper.c | 2 +-
 net/netfilter/nfnetlink_log.c  | 8 ++--
 net/netfilter/nfnetlink_queue.c| 2 +-
 net/netfilter/xt_helper.c  | 2 +-
 16 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9a..ad444f0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -174,10 +174,16 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
 
if (!list_empty(hook_list)) {
struct nf_hook_state state;
+   int ret;
 
+   /* We may already have this, but read-locks nest anyway */
+   rcu_read_lock();
nf_hook_state_init(&state, hook_list, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
-   return nf_hook_slow(skb, &state);
+
+   ret = nf_hook_slow(skb, &state);
+   rcu_read_unlock();
+   return ret;
}
return 1;
 }
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index 5fcd375..6965ba0 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -14,6 +14,7 @@ static inline bool nf_hook_ingress_active(const struct 
sk_buff *skb)
return !list_empty(&skb->dev->nf_hooks_ingress);
 }
 
+/* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
struct nf_hook_state state;
diff --git a/net/bridge/netfilter/ebt_redirect.c 
b/net/bridge/netfilter/ebt_redirect.c
index 20396499..2e7c4f9 100644
--- a/net/bridge/netfilter/ebt_redirect.c
+++ b/net/bridge/netfilter/ebt_redirect.c
@@ -24,7 +24,7 @@ ebt_redirect_tg(struct sk_buff *skb, const struct 
xt_action_param *par)
return EBT_DROP;
 
if (par->hooknum != NF_BR_BROUTING)
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
ether_addr_copy(eth_hdr(skb)->h_dest,
br_port_get_rcu(par->in)->br->dev->dev_addr);
else
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index cceac5b..dd71332 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -146,7 +146,7 @@ ebt_basic_match(const struct ebt_entry *e, const struct 
sk_buff *skb,
return 1;
if (NF_INVF(e, EBT_IOUT, ebt_dev_check(e->out, out)))
return 1;
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
if (in && (p = br_port_get_rcu(in)) != NULL &&
NF_INVF(e, EBT_ILOGICALIN,
ebt_dev_check(e->logical_in, p->br->dev)))
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index ae1a71a..eab0239 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -110,7 +110,7 @@ static unsigned int ipv4_helper(void *priv,
if (!help)
return NF_ACCEPT;
 
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
helper = rcu_dereference(help->helper);
if (!helper)
return NF_ACCEPT;
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c 
b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index c567e1b..2c08d6a 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -149,7 +149,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
return -NF_ACCEPT;
}
 
-   /* rcu_read_lock()ed by nf_hook_slow */
+   /* rcu_read_lock()ed by nf_hook_thresh */
innerproto = __nf_ct_l4proto_find(PF_INET, origtuple.dst.protonum);
 
/* Ordinarily, we'd expect the inverted tupleproto, but it's
diff --git a/net/ipv6/netfilter/nf_co

[PATCH 1/1] net: nps_enet: Fix PCS reset

2016-07-12 Thread Elad Kanfi

From: Noam Camus 

During commit b54b8c2d6e3c
 ("net: ezchip: adapt driver to little endian architecture")
 adapting to little endian architecture,
 zeroing of controller was left out.

Signed-off-by: Elad Kanfi 
---
 drivers/net/ethernet/ezchip/nps_enet.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c 
b/drivers/net/ethernet/ezchip/nps_enet.c
index 06f0317..9b7a3f5 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -285,6 +285,7 @@ static void nps_enet_hw_reset(struct net_device *ndev)
ge_rst_value |= NPS_ENET_ENABLE << RST_GMAC_0_SHIFT;
nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value);
usleep_range(10, 20);
+   ge_rst_value = 0;
nps_enet_reg_set(priv, NPS_ENET_REG_GE_RST, ge_rst_value);
 
/* Tx fifo reset sequence */
-- 
1.7.1

Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys

2016-07-12 Thread Andrew Lunn

> +#define LPA_FIBER_1000HALF   0x40
> +#define LPA_FIBER_1000FULL   0x20
> +
> +#define LPA_PAUSE_FIBER  0x180
> +#define LPA_PAUSE_ASYM_FIBER 0x100
> +
> +#define ADVERTISE_FIBER_1000HALF 0x40
> +#define ADVERTISE_FIBER_1000FULL 0x20
> +
> +#define ADVERTISE_PAUSE_FIBER0x180
> +#define ADVERTISE_PAUSE_ASYM_FIBER   0x100

Are these standardised anywhere? If they are following a standard,
they should be put into include/uapi/linux/mii.h.

> +static inline u32 ethtool_adv_to_fiber_adv_t(u32 ethadv)
> +{
> + u32 result = 0;
> +
> + if (ethadv & ADVERTISED_1000baseT_Half)
> + result |= ADVERTISE_FIBER_1000HALF;

Dumb question: Does 1000baseT_Half even make sense for fibre? Can you
do half duplex?  Would that not mean you have a single fibre, both
ends are using the same laser frequency, and you are doing some form
of CSMA/CD?

> + if (ethadv & ADVERTISED_1000baseT_Full)
> + result |= ADVERTISE_FIBER_1000FULL;
> +
> + if ((ethadv & ADVERTISE_PAUSE_ASYM) && (ethadv & ADVERTISE_PAUSE_CAP))
> + result |= LPA_PAUSE_ASYM_FIBER;
> + else if (ethadv & ADVERTISE_PAUSE_CAP)
> + result |= (ADVERTISE_PAUSE_FIBER
> +& (~ADVERTISE_PAUSE_ASYM_FIBER));
> +
> + return result;
> +}

If these values are standardised, i think this function should be
moved into the generic code. If however, this is Marvell specific,
keep it here.

>   *
>   * Generic status code does not detect Fiber correctly!
> @@ -906,12 +1070,17 @@ static int marvell_read_status(struct phy_device 
> *phydev)
>   int lpa;
>   int lpagb;
>   int status = 0;
> + int page, fiber;
>  
> - /* Update the link, but return if there
> + /* Detect and update the link, but return if there
>* was an error */
> - err = genphy_update_link(phydev);
> - if (err)
> - return err;
> + page = phy_read(phydev, MII_MARVELL_PHY_PAGE);
> + if (page == MII_M_FIBER)
> + fiber = 1;
> + else
> + fiber = 0;

This read is expensive, since the MDIO bus is slow. It would be better
just to pass fibre as a parameter.

> +/* marvell_read_fiber_status
> + *
> + * Some Marvell's phys have two modes: fiber and copper.
> + * Both need status checked.
> + * Description:
> + *   First, check the fiber link and status.
> + *   If the fiber link is down, check the copper link and status which
> + *   will be the default value if both link are down.
> + */
> +static int marvell_read_fiber_status(struct phy_device *phydev)

The name is a bit confusing. I would probably use
marvell_read_copper_fiber_status() making it clear it reads both.

> +{
> + int err;
> +
> + /* Check the fiber mode first */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
> + if (err < 0)
> + goto error;
> +
> + err = marvell_read_status(phydev);
> + if (err < 0)
> + goto error;
> +
> + if (phydev->link) {
> + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + return 0;
> + }
> +
> + /* If fiber link is down, check and save copper mode state */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + if (err < 0)
> + goto error;

There should be a big fat comment somewhere that after this function
the copper or fibre page is left selected, depending on which has
link.  There is a danger somebody misses this assumption and breaks
the code by unconditionally restoring to copper.

> +
> + return marvell_read_status(phydev);
> +
> +error:
> + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + return err;
> +}
> +
> +/* marvell_suspend_fiber
> + *
> + * Some Marvell's phys have two modes: fiber and copper.
> + * Both need to be suspended
> + */
> +static int marvell_suspend_fiber(struct phy_device *phydev)
> +{
> + int err;
> +
> + /* Suspend the fiber mode first */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
> + if (err < 0)
> + goto error;
> +
> + err = genphy_suspend(phydev);
> + if (err < 0)
> + goto error;
> +
> + /* Then, the copper link */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + if (err < 0)
> + goto error;
> +
> + return genphy_suspend(phydev);
> +
> +error:
> + phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + return err;
> +}

I think it would be better to look for SUPPORTED_FIBRE in
drv->features, rather than have two different functions.

In fact, i would do that in general, rather than add your _fibre()
functions.

> +
> +/* marvell_resume_fiber
> + *
> + * Some Marvell's phys have two modes: fiber and copper.
> + * Both need to be resumed
> + */
> +static int marvell_resume_fiber(struct phy_device *phydev)
> +{
> + int err;
> +
> + /* Resume the fiber mode first */
> +

Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter

2016-07-12 Thread Jakub Kicinski

On Tue, 12 Jul 2016 07:52:53 -0700, Tom Herbert wrote:
> On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer
>  wrote:
> >
> > On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco  
> > wrote:
> >  
> >> Add a new bpf prog type that is intended to run in early stages of the
> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> expose the packet start and end pointers, and only in read mode.
> >>
> >> An XDP program must return one of the well known enum values, all other
> >> return codes are reserved for future use. Unfortunately, this
> >> restriction is hard to enforce at verification time, so take the
> >> approach of warning at runtime when such programs are encountered. Out
> >> of bounds return codes should alias to XDP_ABORTED.  
> >
> > This is going to be a serious usability problem, when XDP gets extended
> > with newer features.
> >
> > How can users know what XDP capabilities a given driver support?

I'm personally not a big fan of return codes.  I'm hoping we can express
most decisions by setting fields in metadata.  It provides a better
structure and is easier to detect/translate/optimize.

> We talked about this a the XDP summit and I have some slides on this
> (hope to have slides posted shortly) . Drivers, more generally XDP
> platforms, will provide a list APIs that they support. APIs would be
> contained in a sort common specification files that and would have
> some name like basic_xdp_api, adv_switch_api, etc. An XDP program is
> written using one of the published APIs and the API it uses is
> expressed as part of the program. At runtime (possibly compile time)
> the API used by the program is checked against the list of APIs for
> the target platform-- if the API is not in the set then the program is
> not allowed to be loaded. To whatever extent possible we should also
> try to verify that program adhere's to the API as load and compile
> time. In the event that program violates the API it claims to be
> using, such as return invalid return code for the API, that is an
> error condition.

+1

> > If the user loads a XDP program using capabilities not avail in the
> > driver, then all his traffic will be hard dropped.
> >
> >
> > My proposal is to NOT allow XDP programs to be loaded if they want to use
> > return-codes/features not supported by the driver.  Thus, eBPF programs
> > register/annotate their needed capabilities upfront (I guess this could
> > also help HW offload engines).

Not sure we need annotation, use of an API will probably be easily
detectable (call of a function, access to metadata field).  Verifier
could collect that info in-kernel with little effort.

> Exactly. We just need to define exactly how this is done ;-)
> 
> One open issue is whether XDP needs to be binary compatible across
> platforms or source code compatible (also require recompile in latter
> case for each platform). Personally I prefer source code compatible
> since that might allow platforms to implement the API specifically for
> their needs (like the ordering of fields in a meta data structure.

Since XDP is so close to hardware by design, I think we could consider
introducing per-HW translation stage between the verifier and host JIT.
For extended metadata problem for example - we could define metadata as:

meta {
void *packet_start;
void *packet_end;
struct standard_meta *extended_meta;
}

standard_meta {
vlan;
timestamp;
hash;
...
}

Program coming from the user space would just use standard_meta but
per-driver/per-HW translator would patch it up.  extended_meta pointer
would probably become pointer to HW-specific queue descriptor at
runtime.  The per-HW translator stage would change the offsets and add
necessary checks and fix-ups.  It could even be possible to translate
from extended_meta access to access to metadata prepended to the frame,
translator would need a hint from the verifier on how to get the packet
pointer.

I haven't thought this through in detail, it's just an idea which would
help us to keep binary compatibility.  HW-specific optimizations in
user space compiler would be great, but breaking binary compatibility
is a bit of a scary step.

[patch net-next v2 2/2] mlxsw: core: Trace EMAD messages

2016-07-12 Thread Jiri Pirko

From: Jiri Pirko 

Trace EMAD messages going down to HW and up from HW. Devlink needs to be
registered before EMAD init so the trace function can be called
with valid devlink handle.

Signed-off-by: Jiri Pirko 
v1->v2:
- Use trace_devlink_hwmsg directly
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 01ae548..480a3ba 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "core.h"
 #include "item.h"
@@ -447,6 +448,10 @@ static int mlxsw_emad_transmit(struct mlxsw_core 
*mlxsw_core,
if (!skb)
return -ENOMEM;
 
+   trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), false, 0,
+   skb->data + mlxsw_core->driver->txhdr_len,
+   skb->len - mlxsw_core->driver->txhdr_len);
+
atomic_set(&trans->active, 1);
err = mlxsw_core_skb_transmit(mlxsw_core, skb, &trans->tx_info);
if (err) {
@@ -529,6 +534,9 @@ static void mlxsw_emad_rx_listener_func(struct sk_buff 
*skb, u8 local_port,
struct mlxsw_core *mlxsw_core = priv;
struct mlxsw_reg_trans *trans;
 
+   trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), true, 0,
+   skb->data, skb->len);
+
if (!mlxsw_emad_is_resp(skb))
goto free_skb;
 
@@ -1110,14 +1118,14 @@ int mlxsw_core_bus_device_register(const struct 
mlxsw_bus_info *mlxsw_bus_info,
if (err)
goto err_emad_init;
 
-   err = mlxsw_hwmon_init(mlxsw_core, mlxsw_bus_info, &mlxsw_core->hwmon);
-   if (err)
-   goto err_hwmon_init;
-
err = devlink_register(devlink, mlxsw_bus_info->dev);
if (err)
goto err_devlink_register;
 
+   err = mlxsw_hwmon_init(mlxsw_core, mlxsw_bus_info, &mlxsw_core->hwmon);
+   if (err)
+   goto err_hwmon_init;
+
err = mlxsw_driver->init(mlxsw_core, mlxsw_bus_info);
if (err)
goto err_driver_init;
@@ -1131,9 +1139,9 @@ int mlxsw_core_bus_device_register(const struct 
mlxsw_bus_info *mlxsw_bus_info,
 err_debugfs_init:
mlxsw_core->driver->fini(mlxsw_core);
 err_driver_init:
+err_hwmon_init:
devlink_unregister(devlink);
 err_devlink_register:
-err_hwmon_init:
mlxsw_emad_fini(mlxsw_core);
 err_emad_init:
mlxsw_bus->fini(bus_priv);
-- 
2.5.5

[patch net-next v2 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Jiri Pirko

From: Jiri Pirko 

Define a tracepoint and allow user to trace messages going to and from
hardware associated with devlink instance.

Signed-off-by: Jiri Pirko 
---
v1->v2:
- Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function
  as suggested by David Ahern and Steven Rostedt
---
 include/trace/events/devlink.h | 68 ++
 net/core/devlink.c |  4 +++
 2 files changed, 72 insertions(+)
 create mode 100644 include/trace/events/devlink.h

diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h
new file mode 100644
index 000..333c32a
--- /dev/null
+++ b/include/trace/events/devlink.h
@@ -0,0 +1,68 @@
+#if IS_ENABLED(CONFIG_NET_DEVLINK)
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM devlink
+
+#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DEVLINK_H
+
+#include 
+#include 
+#include 
+
+/*
+ * Tracepoint for devlink hardware message:
+ */
+TRACE_EVENT(devlink_hwmsg,
+   TP_PROTO(const struct devlink *devlink, bool incoming,
+unsigned long type, const u8 *buf, size_t len),
+
+   TP_ARGS(devlink, incoming, type, buf, len),
+
+   TP_STRUCT__entry(
+   __string(bus_name, devlink->dev->bus->name)
+   __string(dev_name, dev_name(devlink->dev))
+   __string(owner_name, devlink->dev->driver->owner->name)
+   __field(bool, incoming)
+   __field(unsigned long, type)
+   __dynamic_array(u8, buf, len)
+   __field(size_t, len)
+   ),
+
+   TP_fast_assign(
+   __assign_str(bus_name, devlink->dev->bus->name);
+   __assign_str(dev_name, dev_name(devlink->dev));
+   __assign_str(owner_name, devlink->dev->driver->owner->name);
+   __entry->incoming = incoming;
+   __entry->type = type;
+   memcpy(__get_dynamic_array(buf), buf, len);
+   __entry->len = len;
+   ),
+
+   TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu 
buf=0x[%*phD] len=%lu",
+ __get_str(bus_name), __get_str(dev_name),
+ __get_str(owner_name), __entry->incoming, __entry->type,
+ (int) __entry->len, __get_dynamic_array(buf), __entry->len)
+);
+
+#endif /* _TRACE_DEVLINK_H */
+
+/* This part must be outside protection */
+#include 
+
+#else /* CONFIG_NET_DEVLINK */
+
+#if !defined(_TRACE_DEVLINK_H)
+#define _TRACE_DEVLINK_H
+
+#include 
+
+static inline void trace_devlink_hwmsg(const struct devlink *devlink,
+  bool incoming, unsigned long type,
+  const u8 *buf, size_t len)
+{
+}
+
+#endif /* _TRACE_DEVLINK_H */
+
+#endif
diff --git a/net/core/devlink.c b/net/core/devlink.c
index b2e592a..1b50630 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -26,6 +26,10 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg);
 
 static LIST_HEAD(devlink_list);
 
-- 
2.5.5

[PATCH 4/6] ipvs: fix bind to link-local mcast IPv6 address in backup

2016-07-12 Thread Pablo Neira Ayuso

From: Quentin Armitage 

When using HEAD from
https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
the command:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
--mcast-group ff02::1:81
fails with the error message:
Argument list too long

whereas both:
ipvsadm --start-daemon master --mcast-interface eth0.60 \
--mcast-group ff02::1:81
and:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
--mcast-group 224.0.0.81
are successful.

The error message "Argument list too long" isn't helpful. The error occurs
because an IPv6 address is given in backup mode.

The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
since it fails to set the interface on the address or the socket before
calling inet6_bind() (via sock->ops->bind), where the test
'if (!sk->sk_bound_dev_if)' failed.

Setting sock->sk->sk_bound_dev_if on the socket before calling
inet6_bind() resolves the issue.

Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")
Signed-off-by: Quentin Armitage 
Acked-by: Julian Anastasov 
Signed-off-by: Simon Horman 
---
 net/netfilter/ipvs/ip_vs_sync.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 803001a..1b07578 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1545,7 +1545,8 @@ error:
 /*
  *  Set up receiving multicast socket over UDP
  */
-static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id)
+static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id,
+   int ifindex)
 {
/* multicast addr */
union ipvs_sockaddr mcast_addr;
@@ -1566,6 +1567,7 @@ static struct socket *make_receive_sock(struct netns_ipvs 
*ipvs, int id)
set_sock_size(sock->sk, 0, result);
 
get_mcast_sockaddr(&mcast_addr, &salen, &ipvs->bcfg, id);
+   sock->sk->sk_bound_dev_if = ifindex;
result = sock->ops->bind(sock, (struct sockaddr *)&mcast_addr, salen);
if (result < 0) {
pr_err("Error binding to the multicast addr\n");
@@ -1868,7 +1870,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct 
ipvs_sync_daemon_cfg *c,
if (state == IP_VS_STATE_MASTER)
sock = make_send_sock(ipvs, id);
else
-   sock = make_receive_sock(ipvs, id);
+   sock = make_receive_sock(ipvs, id, dev->ifindex);
if (IS_ERR(sock)) {
result = PTR_ERR(sock);
goto outtinfo;
-- 
2.1.4

[PATCH 2/6] netfilter: nft_meta: set skb->nf_trace appropriately

2016-07-12 Thread Pablo Neira Ayuso

From: Liping Zhang 

When user add a nft rule to set nftrace to zero, for example:

  # nft add rule ip filter input nftrace set 0

We should set nf_trace to zero also.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_meta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 16c50b0..f4bad9d 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -227,7 +227,7 @@ void nft_meta_set_eval(const struct nft_expr *expr,
skb->pkt_type = value;
break;
case NFT_META_NFTRACE:
-   skb->nf_trace = 1;
+   skb->nf_trace = !!value;
break;
default:
WARN_ON(1);
-- 
2.1.4

[PATCH 1/6] netfilter: nf_tables: fix memory leak if expr init fails

2016-07-12 Thread Pablo Neira Ayuso

From: Liping Zhang 

If expr init fails then we need to free it.

So when the user add a nft rule as follows:

  # nft add rule filter input tcp dport 22 flow table ssh \
{ ip saddr limit rate 0/second }

memory leak will happen.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 2c88187..cf7c745 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1724,9 +1724,11 @@ struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
 
err = nf_tables_newexpr(ctx, &info, expr);
if (err < 0)
-   goto err2;
+   goto err3;
 
return expr;
+err3:
+   kfree(expr);
 err2:
module_put(info.ops->type->owner);
 err1:
-- 
2.1.4

[PATCH 0/6] Netfilter/IPVS fixes for net

2016-07-12 Thread Pablo Neira Ayuso

Hi David,

The following patchset contains Netfilter/IPVS fixes for your net tree.
they are:

1) Fix leak in the error path of nft_expr_init(), from Liping Zhang.

2) Tracing from nf_tables cannot be disabled, also from Zhang.

3) Fix an integer overflow on 32bit archs when setting the number of
   hashtable buckets, from Florian Westphal.

4) Fix configuration of ipvs sync in backup mode with IPv6 address,
   from Quentin Armitage via Simon Horman.

5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION,
   from Florian Westphal.

6) Skip clash resolution in conntrack insertion races if NAT is in
   place.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

We're rather late in the release cycle of 4.7, so if these cannot get
upstream I'll make sure to submit them to -stable, no problem.

Thanks!



The following changes since commit acd43fe85b2d1dbad55ce211b8817e6d6687246f:

  Merge branch 'mlx4-fixes' (2016-06-22 16:38:17 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 590b52e10d410e1439ae86be9fe19d75fdab628b:

  netfilter: conntrack: skip clash resolution if nat is in place (2016-07-12 
16:28:41 +0200)


Florian Westphal (2):
  netfilter: conntrack: avoid integer overflow when resizing
  netfilter: nft_ct: fix expiration getter

Liping Zhang (2):
  netfilter: nf_tables: fix memory leak if expr init fails
  netfilter: nft_meta: set skb->nf_trace appropriately

Pablo Neira Ayuso (2):
  Merge tag 'ipvs-fixes2-for-v4.7' of https://git.kernel.org/.../horms/ipvs
  netfilter: conntrack: skip clash resolution if nat is in place

Quentin Armitage (1):
  ipvs: fix bind to link-local mcast IPv6 address in backup

 include/net/netfilter/nf_conntrack.h | 8 
 net/netfilter/ipvs/ip_vs_sync.c  | 6 --
 net/netfilter/nf_conntrack_core.c| 8 
 net/netfilter/nf_tables_api.c| 4 +++-
 net/netfilter/nft_ct.c   | 6 +-
 net/netfilter/nft_meta.c | 2 +-
 6 files changed, 25 insertions(+), 9 deletions(-)

[PATCH 6/6] netfilter: conntrack: skip clash resolution if nat is in place

2016-07-12 Thread Pablo Neira Ayuso

The clash resolution is not easy to apply if the NAT table is
registered. Even if no NAT rules are installed, the nul-binding ensures
that a unique tuple is used, thus, the packet that loses race gets a
different source port number, as described by:

http://marc.info/?l=netfilter-devel&m=146818011604484&w=2

Clash resolution with NAT is also problematic if addresses/port range
ports are used since the conntrack that wins race may describe a
different mangling that we may have earlier applied to the packet via
nf_nat_setup_info().

Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on 
insertion race")
Signed-off-by: Pablo Neira Ayuso 
Tested-by: Marc Dionne 
---
 net/netfilter/nf_conntrack_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 62c42e9..9f530ad 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -646,6 +646,7 @@ static int nf_ct_resolve_clash(struct net *net, struct 
sk_buff *skb,
 
l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
if (l4proto->allow_clash &&
+   !nfct_nat(ct) &&
!nf_ct_is_dying(ct) &&
atomic_inc_not_zero(&ct->ct_general.use)) {
nf_ct_acct_merge(ct, ctinfo, (struct nf_conn *)skb->nfct);
-- 
2.1.4

[PATCH 3/6] netfilter: conntrack: avoid integer overflow when resizing

2016-07-12 Thread Pablo Neira Ayuso

From: Florian Westphal 

Can overflow so we might allocate very small table when bucket count is
high on a 32bit platform.

Note: resize is only possible from init_netns.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index f204274..62c42e9 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1601,8 +1601,15 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls)
unsigned int nr_slots, i;
size_t sz;
 
+   if (*sizep > (UINT_MAX / sizeof(struct hlist_nulls_head)))
+   return NULL;
+
BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct 
hlist_head));
nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct 
hlist_nulls_head));
+
+   if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head)))
+   return NULL;
+
sz = nr_slots * sizeof(struct hlist_nulls_head);
hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
get_order(sz));
-- 
2.1.4

[PATCH 5/6] netfilter: nft_ct: fix expiration getter

2016-07-12 Thread Pablo Neira Ayuso

From: Florian Westphal 

We need to compute timeout.expires - jiffies, not the other way around.
Add a helper, another patch can then later change more places in
conntrack code where we currently open-code this.

Will allow us to only change one place later when we remove per-ct timer.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack.h | 8 
 net/netfilter/nft_ct.c   | 6 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index dd78bea..b6083c3 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -284,6 +284,14 @@ static inline bool nf_is_loopback_packet(const struct 
sk_buff *skb)
return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+/* jiffies until ct expires, 0 if already expired */
+static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
+{
+   long timeout = (long)ct->timeout.expires - (long)jiffies;
+
+   return timeout > 0 ? timeout : 0;
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 137e308..81fbb45 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -54,7 +54,6 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
const struct nf_conn_help *help;
const struct nf_conntrack_tuple *tuple;
const struct nf_conntrack_helper *helper;
-   long diff;
unsigned int state;
 
ct = nf_ct_get(pkt->skb, &ctinfo);
@@ -94,10 +93,7 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
return;
 #endif
case NFT_CT_EXPIRATION:
-   diff = (long)jiffies - (long)ct->timeout.expires;
-   if (diff < 0)
-   diff = 0;
-   *dest = jiffies_to_msecs(diff);
+   *dest = jiffies_to_msecs(nf_ct_expires(ct));
return;
case NFT_CT_HELPER:
if (ct->master == NULL)
-- 
2.1.4

Re: eBPF tunable max instructions or max tail call?

2016-07-12 Thread Sargun Dhillon

On Mon, Jul 11, 2016 at 8:14 PM, Alexei Starovoitov
 wrote:
> On Mon, Jul 11, 2016 at 05:56:07PM -0700, Sargun Dhillon wrote:
>> It would be nice to have eBPF programs that are longer than 4096
>> instructions. I'm trying to implement XSalsa20 in eBPF, and
>> unfortunately, it doesn't fit into 4096 instructions since I'm
>> unrolling all of the loops. Further than that, doing tail calls to
>> process each block results in me hitting the tail call limit.
>
> a cipher in bpf? wow. that's pushing it :)
> we've been discussing various way of adding 'bounded loop' instruction
> to avoid manual unrolling, but it will be still limited to the 4k
> instruction per program, so probably won't help this use case.
> Are you trying to do it in the networking context?

Yeah, I'm trying to do this as a TC filter. Instruction wise, each 64
byte chunk is about 5000 instructions using LLVM's automatic loop
unrolling. I need the first and last invocation to be for finishing
and initializing the key schedule, setting checksums, etc.. So, I'm
pretty close -- this implementation wasn't actually XSalsa20, it was a
port of the Kernel's implementation of Salsa20. I think bumping the
instruction limit to 8k would do the trick.

>
>> It don't think that it makes much sense to expose the crypto API as
>> BPF helpers, as I'm not sure if we can ensure safety, and timely
>> execution with it. I may be wrong here, and if there is a sane, safe
>> way to expose the crypto API, I'm all ears.
>
> we had the patches to connect crypto api with bpf, but they were
> too hacky to upstream, since then we redesigned the approach
> and the latest should be much cleaner. The keys will be managed
> through normal xfrm api and bpf will call into crypto with
> mechanism similar to tail-call. The program will specify the
> offset/length within the packet to encrypt/decrypt and next
> program to execute when crypto operation completes.
> Root only for xdp and tc only.
>
This is really interesting to me. Right now, I'm passing the key via
embedding it in the code itself. It allows LLVM to do a bit more
optimization. The crypto APIs are really nice and well fleshed out.
XFRM on the other hand introduces a lot of complexity that I'm trying
to avoid. It'd be nice if we could treat cryptographic state as just
another type of BPF map.

>> Other than that, it would be nice to make the max instructions a knob,
>> and I don't think that it has much downside, given it's only checked
>> on load time. It would be nice to make the tail call limit a tunable
>> as well, but I'm unsure of the performance impact it might have given
>> that it's checked at runtime.
>>
>> What do y'all think is reasonable? Make them both tunable? Just one? None?
>
> It is preferred to achieve the goal without introducing a knob.
> Also sounds like that increasing 4k to 8k won't really solve it anyway.
>

Re: [PATCH v3] Marvell phy: add fiber status check and configuration for some phys

2016-07-12 Thread Charles-Antoine Couret



Le 12/07/2016 à 17:18, Andrew Lunn a écrit :
> Hi Charles
It's Charles-Antoine. ;)

> 
> It is best to submit a number of smaller patches, each doing one
> thing, than a single big patch. It makes review and discussion much
> simpler.
I'm sorry, I will fix that.


> So for example, this should be a patch of its own:
> 
>> @@ -151,6 +165,7 @@ struct marvell_hw_stat {
>>  
>>  static struct marvell_hw_stat marvell_hw_stats[] = {
>>  { "phy_receive_errors", 0, 21, 16},
>> +{ "phy_receive_errors_fiber", 1, 21, 16},
>>  { "phy_idle_errors", 0, 10, 8 },
>>  };
> 
> I think we should also rename phy_receive_errors to
> phy_receive_errors_copper.

I agree with you. I will fix that too.
Thanks.
Charles-Antoine Couret

Re: [PATCH v7 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-12 Thread Brenden Blanco

On Mon, Jul 11, 2016 at 11:17:00PM -0700, David Miller wrote:
> From: Brenden Blanco 
> Date: Mon, 11 Jul 2016 14:29:51 -0700
> 
> > +   if (priv->num_frags > 1)
> > +   return -EOPNOTSUPP;
> 
> I hate to be the user who has to debug why his XDP program won't
> load just because he set a jumbo MTU beforehand.
Me too. Patch v8 now has an en_err message, and double checked that a
message is printed when MTU is attempted to be increased when XDP
program has already been loaded.

Thanks.

Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Steven Rostedt

On Tue, 12 Jul 2016 18:05:03 +0200
Jiri Pirko  wrote:

> From: Jiri Pirko 
> 
> Define a tracepoint and allow user to trace messages going to and from
> hardware associated with devlink instance.
> 
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
> - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function
>   as suggested by David Ahern and Steven Rostedt

FYI, you can use the Suggested-by: tag too ;-)

> ---
>  include/trace/events/devlink.h | 68 
> ++
>  net/core/devlink.c |  4 +++
>  2 files changed, 72 insertions(+)
>  create mode 100644 include/trace/events/devlink.h
> 
> diff --git a/include/trace/events/devlink.h b/include/trace/events/devlink.h
> new file mode 100644
> index 000..333c32a
> --- /dev/null
> +++ b/include/trace/events/devlink.h
> @@ -0,0 +1,68 @@
> +#if IS_ENABLED(CONFIG_NET_DEVLINK)
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM devlink
> +
> +#if !defined(_TRACE_DEVLINK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_DEVLINK_H
> +
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Tracepoint for devlink hardware message:
> + */
> +TRACE_EVENT(devlink_hwmsg,
> + TP_PROTO(const struct devlink *devlink, bool incoming,
> +  unsigned long type, const u8 *buf, size_t len),
> +
> + TP_ARGS(devlink, incoming, type, buf, len),
> +
> + TP_STRUCT__entry(
> + __string(bus_name, devlink->dev->bus->name)
> + __string(dev_name, dev_name(devlink->dev))
> + __string(owner_name, devlink->dev->driver->owner->name)
> + __field(bool, incoming)
> + __field(unsigned long, type)
> + __dynamic_array(u8, buf, len)
> + __field(size_t, len)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(bus_name, devlink->dev->bus->name);
> + __assign_str(dev_name, dev_name(devlink->dev));
> + __assign_str(owner_name, devlink->dev->driver->owner->name);
> + __entry->incoming = incoming;
> + __entry->type = type;
> + memcpy(__get_dynamic_array(buf), buf, len);
> + __entry->len = len;
> + ),
> +
> + TP_printk("bus_name=%s dev_name=%s owner_name=%s incoming=%d type=%lu 
> buf=0x[%*phD] len=%lu",
> +   __get_str(bus_name), __get_str(dev_name),
> +   __get_str(owner_name), __entry->incoming, __entry->type,
> +   (int) __entry->len, __get_dynamic_array(buf), __entry->len)
> +);
> +
> +#endif /* _TRACE_DEVLINK_H */
> +
> +/* This part must be outside protection */
> +#include 
> +
> +#else /* CONFIG_NET_DEVLINK */
> +
> +#if !defined(_TRACE_DEVLINK_H)
> +#define _TRACE_DEVLINK_H

Ah, I guess you do need header protection here.

OK, looks good to me.

Acked-by: Steven Rostedt 

-- Steve

> +
> +#include 
> +
> +static inline void trace_devlink_hwmsg(const struct devlink *devlink,
> +bool incoming, unsigned long type,
> +const u8 *buf, size_t len)
> +{
> +}
> +
> +#endif /* _TRACE_DEVLINK_H */
> +
> +#endif
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index b2e592a..1b50630 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -26,6 +26,10 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
> +
> +EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg);
>  
>  static LIST_HEAD(devlink_list);
>

Re: [net-next PATCH RFC] mlx4: RX prefetch loop

2016-07-12 Thread Alexander Duyck

On Tue, Jul 12, 2016 at 5:45 AM, Jesper Dangaard Brouer
 wrote:
> On Mon, 11 Jul 2016 16:05:11 -0700
> Alexei Starovoitov  wrote:
>
>> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
>> > > - /* Process all completed CQEs */
>> > > + /* Extract and prefetch completed CQEs */
>> > >   while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
>> > >   cq->mcq.cons_index & cq->size)) {
>> > > + void *data;
>> > >
>> > >   frags = ring->rx_info + (index << priv->log_rx_info);
>> > >   rx_desc = ring->buf + (index << ring->log_stride);
>> > > + prefetch(rx_desc);
>> > >
>> > >   /*
>> > >* make sure we read the CQE after we read the ownership bit
>> > >*/
>> > >   dma_rmb();
>> > >
>> > > + cqe_array[cqe_idx++] = cqe;
>> > > +
>> > > + /* Base error handling here, free handled in next loop */
>> > > + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>> > > +  MLX4_CQE_OPCODE_ERROR))
>> > > + goto skip;
>> > > +
>> > > + data = page_address(frags[0].page) + frags[0].page_offset;
>> > > + prefetch(data);
>>
>> that's probably not correct in all cases, since doing prefetch on the address
>> that is going to be evicted soon may hurt performance.
>> We need to dma_sync_single_for_cpu() before doing a prefetch or
>> somehow figure out that dma_sync is a nop, so we can omit it altogether
>> and do whatever prefetches we like.
>
> Sure, DMA can be synced first (actually already played with this).

Yes, but the point I think that Alexei is kind of indirectly getting
at is that you are doing all your tests on x86 architecture are you
not?  The x86 stuff is a very different beast from architectures like
ARM which have a very different architecture when it comes to how they
handle the memory organization of the system.  In the case of x86 the
only time dma_sync is not a nop is if you force swiotlb to be enabled
at which point the whole performance argument is kind of pointless
anyway.

>> Also unconditionally doing batch of 8 may also hurt depending on what
>> is happening either with the stack, bpf afterwards or even cpu version.
>
> See this as software DDIO, if the unlikely case that data will get
> evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
> only 1024 bytes are getting prefetched here.

I disagree.  DDIO only pushes received frames into the L3 cache.  What
you are potentially doing is flooding the L2 cache.  The difference in
size between the L3 and L2 caches is very significant.  L3 cache size
is in the MB range while the L2 cache is only 256KB or so for Xeon
processors and such.  In addition DDIO is really meant for an
architecture that has a fairly large cache region to spare and it it
limits itself to that cache region, the approach taken in this code
could potentially prefetch a fairly significant chunk of memory.

>> Doing single prefetch of Nth packet is probably ok most of the time,
>> but asking cpu to prefetch 8 packets at once is unnecessary especially
>> since single prefetch gives the same performance.
>
> No, unconditionally prefetch of the Nth packet, will be wrong most of
> the time, for real work loads, as Eric Dumazet already pointed out.
>
> This patch does NOT unconditionally prefetch 8 packets.  Prefetching
> _only_ happens when it is known that packets are ready in the RX ring.
> We know this prefetch data will be used/touched, within the NAPI cycle.
> Even if the processing of the packet flush L1 cache, then it will be in
> L2 or L3 (like DDIO).

I think the point you are missing here Jesper is that the packet isn't
what will be flushed out of L1.  It will be all the data that had been
fetched before that.  So for example the L1 cache can only hold 32K,
and the way it is setup if you fetch the first 64 bytes of 8 pages you
will have evicted everything that was in that cache set will be
flushed out to L2.

Also it might be worth while to see what instruction is being used for
the prefetch.  Last I knew for read prefetches it was prefetchnta on
x86 which would only pull the data into the L1 cache as a
"non-temporal" store.  If I am not mistaken I think you run the risk
of having the data prefetched evicted back out and bypassing the L2
and L3 caches unless it is modified.  That was kind of the point of
the prefetchnta as it really meant to be a read-only prefetch and
meant to avoid polluting the L2 and L3 caches.

- Alex

Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Jiri Pirko

Tue, Jul 12, 2016 at 06:38:26PM CEST, rost...@goodmis.org wrote:
>On Tue, 12 Jul 2016 18:05:03 +0200
>Jiri Pirko  wrote:
>
>> From: Jiri Pirko 
>> 
>> Define a tracepoint and allow user to trace messages going to and from
>> hardware associated with devlink instance.
>> 
>> Signed-off-by: Jiri Pirko 
>> ---
>> v1->v2:
>> - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function
>>   as suggested by David Ahern and Steven Rostedt
>
>FYI, you can use the Suggested-by: tag too ;-)

Thought that the purpose of that tag is if someone suggested the whole
patch existence.



>> +/* This part must be outside protection */
>> +#include 
>> +
>> +#else /* CONFIG_NET_DEVLINK */
>> +
>> +#if !defined(_TRACE_DEVLINK_H)
>> +#define _TRACE_DEVLINK_H
>
>Ah, I guess you do need header protection here.

Yes, you do.

>
>OK, looks good to me.
>
>Acked-by: Steven Rostedt 

Thanks Steven!

Re: [RFC PATCH net-next 1/1] Introduce skbmod action

2016-07-12 Thread Cong Wang

On Mon, Jul 11, 2016 at 5:12 AM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> This action is intended to be an upgrade from a usability perspective
> from pedit. Compare this:
>

Definitely agree we need a more user-friendly interface.

>
> pedit is a good starting point - but once you start going to a large
> number of policies then from a programmability, ops and usability point
> of view you need something with more succint params.
>

But it is still unclear why we can't just build something on top of
pedit? Since pedit accepts keys like u32, user-space is free to
introduce any wrapper on top of it.

It should not be a problem for libnl3 to build anything on top too.

Re: [PATCH 0/6] Netfilter/IPVS fixes for net

2016-07-12 Thread David Miller

From: Pablo Neira Ayuso 
Date: Tue, 12 Jul 2016 18:10:56 +0200

> The following patchset contains Netfilter/IPVS fixes for your net tree.
> they are:
> 
> 1) Fix leak in the error path of nft_expr_init(), from Liping Zhang.
> 
> 2) Tracing from nf_tables cannot be disabled, also from Zhang.
> 
> 3) Fix an integer overflow on 32bit archs when setting the number of
>hashtable buckets, from Florian Westphal.
> 
> 4) Fix configuration of ipvs sync in backup mode with IPv6 address,
>from Quentin Armitage via Simon Horman.
> 
> 5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION,
>from Florian Westphal.
> 
> 6) Skip clash resolution in conntrack insertion races if NAT is in
>place.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

> We're rather late in the release cycle of 4.7, so if these cannot
> get upstream I'll make sure to submit them to -stable, no problem.

I'm pretty sure there is going to be an -rc8, don't worry...

Re: [patch net-next v2 1/2] devlink: add hardware messages tracing facility

2016-07-12 Thread Steven Rostedt

On Tue, 12 Jul 2016 19:10:46 +0200
Jiri Pirko  wrote:

> Tue, Jul 12, 2016 at 06:38:26PM CEST, rost...@goodmis.org wrote:
> >On Tue, 12 Jul 2016 18:05:03 +0200
> >Jiri Pirko  wrote:
> >  
> >> From: Jiri Pirko 
> >> 
> >> Define a tracepoint and allow user to trace messages going to and from
> >> hardware associated with devlink instance.
> >> 
> >> Signed-off-by: Jiri Pirko 
> >> ---
> >> v1->v2:
> >> - Use EXPORT_TRACEPOINT_SYMBOL_GPL instead of a wrapper function
> >>   as suggested by David Ahern and Steven Rostedt  
> >
> >FYI, you can use the Suggested-by: tag too ;-)  
> 
> Thought that the purpose of that tag is if someone suggested the whole
> patch existence.

Yeah, it has different meanings. You could always be creative and add: 

Parts-suggested-by: 

;-)

-- Steve

Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit

2016-07-12 Thread Cong Wang

On Mon, Jul 11, 2016 at 12:45 PM,   wrote:
> Hi
>
> On latest kernel i noticed kernel panic happening 1-2 times per day. It is
> also happening on older kernel (at least 4.5.3).
>
...
>  [42916.426463] Call Trace:
>  [42916.426658]  
>
>  [42916.426719]  [] skb_push+0x36/0x37
>  [42916.427111]  [] ppp_start_xmit+0x10f/0x150
> [ppp_generic]
>  [42916.427314]  [] dev_hard_start_xmit+0x25a/0x2d3
>  [42916.427516]  [] ?
> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>  [42916.427858]  [] sch_direct_xmit+0x89/0x1b5
>  [42916.428060]  [] __qdisc_run+0x133/0x170
>  [42916.428261]  [] net_tx_action+0xe3/0x148
>  [42916.428462]  [] __do_softirq+0xb9/0x1a9
>  [42916.428663]  [] irq_exit+0x37/0x7c
>  [42916.428862]  [] smp_apic_timer_interrupt+0x3d/0x48
>  [42916.429063]  [] apic_timer_interrupt+0x7c/0x90

Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(),
I have no idea why this could happen.

Do you have any tc qdisc, filter or actions on this ppp device?

Re: [PATCH net v2 0/2] net: ethoc: Error path and transmit fixes

2016-07-12 Thread David Miller

From: Max Filippov 
Date: Tue, 12 Jul 2016 16:51:10 +0300

> Hello,
> 
> On Mon, Jul 11, 2016 at 04:35:53PM -0700, Florian Fainelli wrote:
>> This patch series contains two patches for the ethoc driver while testing on 
>> a
>> TS-7300 board where ethoc is provided by an on-board FPGA.
>> 
>> First patch was cooked after chasing crashes with invalid resources passed to
>> the driver.
>> 
>> Second patch was cooked after seeing that an interface configured with IP
>> 192.168.2.2 was sending ARP packets for 192.168.0.0, no wonder why it could 
>> not
>> work.
> 
> I can see opencores intrerface sending ARP packets shorter than 64 bytes,
> but I couldn't reproduce truncation that affects packet contents on my
> hardware.
> 
>> I don't have access to any other platform using an ethoc interface so
>> it could be good to some testing on Xtensa for instance.
> 
> I've tested success and error paths affected by this series with the
> following additional change on top of it:
> 
> diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
> index ca678d4..8c94f45 100644
> --- a/drivers/net/ethernet/ethoc.c
> +++ b/drivers/net/ethernet/ethoc.c
> @@ -862,7 +862,7 @@ static netdev_tx_t ethoc_start_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>  
>   if (skb_put_padto(skb, ETHOC_ZLEN)) {
>   dev->stats.tx_errors++;
> - goto out;
> + return NETDEV_TX_OK;
>   }
>  
>   if (unlikely(skb->len > ETHOC_BUFSIZ)) {
> 
> Without it the interface becomes non-functional after the first error
> in skb_put_padto.
> 
> Tested-by: Max Filippov 
> Reviewed-by: Max Filippov 

Indeed, skb_put_padto() frees the skb on error so this would have caused
a double free.

Florian, please respin this series with the fix and tags added.

Thanks.

Re: [RFC PATCH v3] net: sched: convert qdisc linked list to hashtable

2016-07-12 Thread Cong Wang

On Mon, Jul 11, 2016 at 7:02 AM, Jiri Kosina  wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f45929c..0b5c172e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  struct netpoll_info;
>  struct device;
> @@ -1778,6 +1779,7 @@ struct net_device {
> unsigned intnum_tx_queues;
> unsigned intreal_num_tx_queues;
> struct Qdisc*qdisc;
> +   DECLARE_HASHTABLE   (qdisc_hash, 4);
> unsigned long   tx_queue_len;
> spinlock_t  tx_global_lock;
> int watchdog_timeo;

Should it be surrounded by CONFIG_NET_SCHED?
To save several bytes for !CONFIG_NET_SCHED case.

Re: [PATCH] virtio-net: Remove more stack DMA

2016-07-12 Thread David Miller

From: Andy Lutomirski 
Date: Mon, 11 Jul 2016 14:30:28 -0700

> DaveM, is it okay for this to go in via -tip?

Sure.

Re: [PATCH net-next,v3] tools: hv: Add a script to help bonding synthetic and VF NICs

2016-07-12 Thread David Miller

From: Haiyang Zhang 
Date: Mon, 11 Jul 2016 17:06:42 -0700

> From: Haiyang Zhang 
> 
> This script helps to create bonding network devices based on synthetic NIC
> (the virtual network adapter usually provided by Hyper-V) and the matching
> VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can
> function as one network device, and fail over to the synthetic NIC if VF is
> down.
> 
> Mayjor distros (RHEL, Ubuntu, SLES) supported by Hyper-V are supported by
> this script.
> 
> Signed-off-by: Haiyang Zhang 
> Reviewed-by: K. Y. Srinivasan 

Applied.

Re: ethtool TODO list - additional info

2016-07-12 Thread Jiri Pirko

Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote:
>Hi !
>
>Some days ago, Jiri Pirko was talking about some next steps to
>implement for ethtool.
>
> I haven't seen any follow up since ethtool's maintainer change. Can we have
>additional details about these ?
>
>- libethtool - API
>- generic netlink

Yep, exactly, no reply. Apparently nobody really want to do any initiative
here, and I'm lacking time to do it :(



>- sub commands syntax
>- TODO/bugzilla
>
>
>On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko  wrote:
>
 I was thinking of adding a TODO file to the repository, but it's really
 for the new maintainer to decide what to do.  So here's my list as a
 suggestion:

 * Add regression test coverage for all sub-commands with complex logic

 * Internationalise output and error messages

 * Build a libethtool that handles all the API quirks and fallbacks for
   old kernel versions.  This might help people writing language
   bindings or other utilities that use the ethtool API.

 * Provide a 'cleaned up' ethtool (under some other name) that has:
   - More conventional sub-command syntax, i.e. no '-'/'--' prefix
   - More consistent output formatting
>>>
>>>That seems like a reasonable start for a TODO list. I'll bet there
>>>are a few people out there with other suggestions as well...?
>>
>> Before that, I would like to see ethtool migrate to use generic
>> netlink. Then, the new tool would be needed anyway, should exist within
>> iproute2 package and have similar command line syntax.
>>
>> I have some ideas about the gennetlink ethtool, have to find some time
>> to implement some initial part of it.

Re: [PATCH 1/1] bonding: restrict the data protected by rcu_read_lock

2016-07-12 Thread David Miller

From: Zhu Yanjun 
Date: Tue, 12 Jul 2016 15:28:17 +0800

> In this function, origin is a pointer to struct aggregator.
> No matter what agg is changed to, it has nothing to do with origin.
> 
> CC: Jay Vosburgh 
> Signed-off-by: Zhu Yanjun 

I really hate changes like this.

Imagine if a inner function was called with RCU protection around
it, as a rule.

That's the same as what is happening here, the RCU locking is done
around the entire contents of the function and that is perfectly
fine even if one variable assignment or whatever is superfluously
contained inside of it.

I'm not applying this patch, sorry.

Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit

2016-07-12 Thread nuclearcat


On 2016-07-12 20:31, Cong Wang wrote:

On Mon, Jul 11, 2016 at 12:45 PM,   wrote:

Hi

On latest kernel i noticed kernel panic happening 1-2 times per day. 
It is

also happening on older kernel (at least 4.5.3).


...

 [42916.426463] Call Trace:
 [42916.426658]  

 [42916.426719]  [] skb_push+0x36/0x37
 [42916.427111]  [] ppp_start_xmit+0x10f/0x150
[ppp_generic]
 [42916.427314]  [] dev_hard_start_xmit+0x25a/0x2d3
 [42916.427516]  [] ?
validate_xmit_skb.isra.107.part.108+0x11d/0x238
 [42916.427858]  [] sch_direct_xmit+0x89/0x1b5
 [42916.428060]  [] __qdisc_run+0x133/0x170
 [42916.428261]  [] net_tx_action+0xe3/0x148
 [42916.428462]  [] __do_softirq+0xb9/0x1a9
 [42916.428663]  [] irq_exit+0x37/0x7c
 [42916.428862]  [] 
smp_apic_timer_interrupt+0x3d/0x48

 [42916.429063]  [] apic_timer_interrupt+0x7c/0x90


Interesting, we call a skb_cow_head() before skb_push() in 
ppp_start_xmit(),

I have no idea why this could happen.

Do you have any tc qdisc, filter or actions on this ppp device?
Yes, i have policing filters for incoming traffic (ingress), and also on 
egress htb + pfifo + filters.

Re: ethtool TODO list - additional info

2016-07-12 Thread Jorge Alberto Garcia

On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko  wrote:
> Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote:
>>Hi !
>>
>>Some days ago, Jiri Pirko was talking about some next steps to
>>implement for ethtool.
>>
>> I haven't seen any follow up since ethtool's maintainer change. Can we have
>>additional details about these ?
>>
>>- libethtool - API
>>- generic netlink
>
> Yep, exactly, no reply. Apparently nobody really want to do any initiative
> here, and I'm lacking time to do it :(
>

That's fine,  I already got  a git repo, I'm trying to understand what
'use generic netlink' means.

This is a piece of what grep got me on iproute git repo (I'm still
trying to understand)
grep -i netlink -r iproute2/
iproute2/bridge/mdb.c:#include "libnetlink.h"
iproute2/bridge/vlan.c:#include "libnetlink.h"
...
..



>
>
>>- sub commands syntax
>>- TODO/bugzilla
>>
>>
>>On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko  wrote:
>>
> I was thinking of adding a TODO file to the repository, but it's really
> for the new maintainer to decide what to do.  So here's my list as a
> suggestion:
>
> * Add regression test coverage for all sub-commands with complex logic
>
> * Internationalise output and error messages
>
> * Build a libethtool that handles all the API quirks and fallbacks for
>   old kernel versions.  This might help people writing language
>   bindings or other utilities that use the ethtool API.
>
> * Provide a 'cleaned up' ethtool (under some other name) that has:
>   - More conventional sub-command syntax, i.e. no '-'/'--' prefix
>   - More consistent output formatting

That seems like a reasonable start for a TODO list. I'll bet there
are a few people out there with other suggestions as well...?
>>>
>>> Before that, I would like to see ethtool migrate to use generic
>>> netlink. Then, the new tool would be needed anyway, should exist within
>>> iproute2 package and have similar command line syntax.
>>>
>>> I have some ideas about the gennetlink ethtool, have to find some time
>>> to implement some initial part of it.

Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit

2016-07-12 Thread Cong Wang

On Tue, Jul 12, 2016 at 11:03 AM,   wrote:
> On 2016-07-12 20:31, Cong Wang wrote:
>>
>> On Mon, Jul 11, 2016 at 12:45 PM,   wrote:
>>>
>>> Hi
>>>
>>> On latest kernel i noticed kernel panic happening 1-2 times per day. It
>>> is
>>> also happening on older kernel (at least 4.5.3).
>>>
>> ...
>>>
>>>  [42916.426463] Call Trace:
>>>  [42916.426658]  
>>>
>>>  [42916.426719]  [] skb_push+0x36/0x37
>>>  [42916.427111]  [] ppp_start_xmit+0x10f/0x150
>>> [ppp_generic]
>>>  [42916.427314]  [] dev_hard_start_xmit+0x25a/0x2d3
>>>  [42916.427516]  [] ?
>>> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>>>  [42916.427858]  [] sch_direct_xmit+0x89/0x1b5
>>>  [42916.428060]  [] __qdisc_run+0x133/0x170
>>>  [42916.428261]  [] net_tx_action+0xe3/0x148
>>>  [42916.428462]  [] __do_softirq+0xb9/0x1a9
>>>  [42916.428663]  [] irq_exit+0x37/0x7c
>>>  [42916.428862]  [] smp_apic_timer_interrupt+0x3d/0x48
>>>  [42916.429063]  [] apic_timer_interrupt+0x7c/0x90
>>
>>
>> Interesting, we call a skb_cow_head() before skb_push() in
>> ppp_start_xmit(),
>> I have no idea why this could happen.
>>
>> Do you have any tc qdisc, filter or actions on this ppp device?
>
> Yes, i have policing filters for incoming traffic (ingress), and also on
> egress htb + pfifo + filters.

Does it make any difference if you remove the egress qdisc and/or
filters? If yes, please share the `tc qd show...` and `tc filter show ...`?

Thanks!

Re: [PATCH -next] stmmac: dwmac-socfpga: fix wrong pointer passed to PTR_ERR()

2016-07-12 Thread David Miller

From: weiyj...@163.com
Date: Tue, 12 Jul 2016 11:00:09 +

> From: Wei Yongjun 
> 
> PTR_ERR should access the value just tested by IS_ERR, otherwise
> the wrong error code will be returned.
> 
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH -next] dwc_eth_qos: fix missing clk_disable_unprepare() on error in dwceqos_probe()

2016-07-12 Thread David Miller

From: weiyj...@163.com
Date: Tue, 12 Jul 2016 11:43:37 +

> From: Wei Yongjun 
> 
> Fix missing clk_disable_unprepare() call before return
> from dwceqos_probe() in the error handling case of invalid
> fixed-link.
> 
> Signed-off-by: Wei Yongjun 

Applied.

Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA

2016-07-12 Thread Benjamin Poirier

On 2016/07/06 17:29, Ursula Braun wrote:
> Dave,
> 
> we still like to see SMC-R included into a future Linux-kernel. After
> answering your first 2 questions, there is no longer a response. What should
> we do next?
> - Still wait for an answer from you?
> - Resend the same whole SMC-R patch series, this time with the cover letter
> adapted to your requested changes?

^^^ I would suggest to send v2 of the patch series with the changes
that were requested.

> - Put the SMC-R development on hold, and concentrate on another
> s390-specific SMC-solution first (not RDMA-based), that makes use of the
> SMC-socket family as well.
> - Anything else?
> 
> Kind regards, Ursula
> 
>  Forwarded Message 
> Subject: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Tue, 21 Jun 2016 16:02:59 +0200
> From: Ursula Braun 
> To: da...@davemloft.net
> CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org,
> schwidef...@de.ibm.com, heiko.carst...@de.ibm.com, utz.bac...@de.ibm.com
> 
> Dave,
> 
> the SMC-R patches submitted 2016-06-03 show up in state "Changes
> Requested" on patchwork:
> https://patchwork.ozlabs.org/project/netdev/list/?submitter=2266&state=*&page=1
> 
> You had requested a change of the SMC-R description in the cover letter.
> We came up with the response below. Do you need anything else from us?
> 
> Kind regards,
> Ursula Braun
> 
>  Forwarded Message 
> Subject: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Thu,  9 Jun 2016 17:36:28 +0200
> From: Ursula Braun 
> To: da...@davemloft.net
> CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org,
> schwidef...@de.ibm.com, heiko.carst...@de.ibm.com
> 
> On Tue, 2016-06-07 at 15:07 -0700, David Miller wrote:
> > In case my previous reply wasn't clear enough, I require that you provide
> > a more accurate description of what the implications of this feature are.
> > 
> > Namely, that important _CORE_ networking features are completely bypassed
> > and unusable when SMC applies to a connection.
> > 
> > Specifically, all packet shaping, filtering, traffic inspection, and
> > flow management facilitites in the kernel will not be able to see nor
> > act upon the data flow of these TCP connections once established.
> > 
> > It is always important, and in my opinion required, to list the
> > negative aspects of your change and not just the "wow, amazing"
> > positive aspects.
> > 
> > Thanks.
> > 
> > 
> Correct, the SMC-R data stream bypasses TCP and thus cannot enjoy its
> features. This is the price for leveraging the TCP application ecosystem
> and reducing CPU load.
> 
> When a load balancer allows the TCP handshake to take place between a
> worker node and the TCP client, RDMA will be used between these two
> nodes. So anything based on TCP connection establishment (including a
> firewall) can apply to SMC-R, too. To be clear -- yes, the data flow
> later on is not subject to these features anymore.  At least VLAN
> isolation from the TCP part can be leveraged for RDMA traffic. From our
> experience, discussions, etc., that tradeoff seems acceptable in a
> classical data center environment.
> 
> Improving our cover letter would result in the following new
> introductory motivation part at the beginning and a slightly modified
> list of
> planned enhancements at the end:
> 
> On Fri, 2016-06-03 at 17:26 +0200, Ursula Braun wrote:
> 
> > These patches are the initial part of the implementation of the
> > "Shared Memory Communications-RDMA" (SMC-R) protocol. The protocol is
> > defined in RFC7609 [1]. It allows transformation of TCP connections
> > using the "Remote Direct Memory Access over Converged Ethernet" (RoCE)
> > feature of specific communication hardware for data center environments.
> > 
> > SMC-R inherits TCP qualities such as reliable connections, host-based
> > firewall packet filtering (on connection establishment) and unmodified
> > application of communication encryption such as TLS (transport layer
> > security) or SSL (secure sockets layer). It is transparent to most existing
> > TCP connection load balancers that are commonly used in the enterprise data
> > center environment for multi-tier application workloads.
> > 
> > Being designed for the data center network switched fabric environment, it
> > does not need congestion control and thus reaches line speed right away
> > without having to go through slow start as with TCP. This can be beneficial
> > for short living flows including request response patterns requiring
> > reliability. A full SMC-R implementation also provides seamless high
> > availability and load-balancing demanded by enterprise installations.
> > 
> > SMC-R does not require an RDMA communication manager (RDMA CM). Its use of
> > RDMA provides CPU savings transparently for unmodified applications.
> > For instance, when running 10 parallel connections with uperf, we measured
> > a decrease of 60% in CPU consumption with

Re: [PATCH -next] rxrpc: Fix error handling in af_rxrpc_init()

2016-07-12 Thread David Miller

From: weiyj...@163.com
Date: Tue, 12 Jul 2016 11:21:17 +

> From: Wei Yongjun 
> 
> security initialized after alloc workqueue, so we should exit security
> before destroy workqueue in the error handing.
> 
> Fixes: 648af7fca159 ("rxrpc: Absorb the rxkad security module")
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH -next] net: mediatek: fix non static symbol warnings

2016-07-12 Thread David Miller

From: weiyj...@163.com
Date: Tue, 12 Jul 2016 11:36:44 +

> From: Wei Yongjun 
> 
> Fixes the following sparse warnings:
> 
> drivers/net/ethernet/mediatek/mtk_eth_soc.c:79:5: warning:
>  symbol '_mtk_mdio_write' was not declared. Should it be static?
> drivers/net/ethernet/mediatek/mtk_eth_soc.c:98:5: warning:
>  symbol '_mtk_mdio_read' was not declared. Should it be static?
> 
> Signed-off-by: Wei Yongjun 

Applied.

Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA

2016-07-12 Thread Benjamin Poirier

On 2016/07/12 11:08, Benjamin Poirier wrote:
> On 2016/07/06 17:29, Ursula Braun wrote:
> > Dave,
> > 
> > we still like to see SMC-R included into a future Linux-kernel. After
> > answering your first 2 questions, there is no longer a response. What should
> > we do next?
> > - Still wait for an answer from you?
> > - Resend the same whole SMC-R patch series, this time with the cover letter
> > adapted to your requested changes?
> 
> ^^^ I would suggest to send v2 of the patch series with the changes
> that were requested.
> 
> > - Put the SMC-R development on hold, and concentrate on another
> > s390-specific SMC-solution first (not RDMA-based), that makes use of the
> > SMC-socket family as well.
> > - Anything else?
> > 

... and I would suggest to trim my emails next time. Sorry.

Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.

2016-07-12 Thread David Miller

From: Alan Davey 
Date: Tue, 12 Jul 2016 12:34:07 +

> -  all future applications have to continue to implement their own 
> fragmentation code, duplicating that which already exists in the kernel

They have to do this anyways, don't you see this?

Otherwise they don't support %99 of the kernels out there.

Even if I put this change in now, it would take years for it to
propagate to even a moderate percentage of Linux machines out there.

And applications doing a "we only support kernels > version x.y.z" is
a situation I don't want to promote if you think that's a reasonable
way to handle this.

Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit

2016-07-12 Thread nuclearcat


On 2016-07-12 21:05, Cong Wang wrote:

On Tue, Jul 12, 2016 at 11:03 AM,   wrote:

On 2016-07-12 20:31, Cong Wang wrote:


On Mon, Jul 11, 2016 at 12:45 PM,   wrote:


Hi

On latest kernel i noticed kernel panic happening 1-2 times per day. 
It

is
also happening on older kernel (at least 4.5.3).


...


 [42916.426463] Call Trace:
 [42916.426658]  

 [42916.426719]  [] skb_push+0x36/0x37
 [42916.427111]  [] ppp_start_xmit+0x10f/0x150
[ppp_generic]
 [42916.427314]  [] 
dev_hard_start_xmit+0x25a/0x2d3

 [42916.427516]  [] ?
validate_xmit_skb.isra.107.part.108+0x11d/0x238
 [42916.427858]  [] sch_direct_xmit+0x89/0x1b5
 [42916.428060]  [] __qdisc_run+0x133/0x170
 [42916.428261]  [] net_tx_action+0xe3/0x148
 [42916.428462]  [] __do_softirq+0xb9/0x1a9
 [42916.428663]  [] irq_exit+0x37/0x7c
 [42916.428862]  [] 
smp_apic_timer_interrupt+0x3d/0x48

 [42916.429063]  [] apic_timer_interrupt+0x7c/0x90



Interesting, we call a skb_cow_head() before skb_push() in
ppp_start_xmit(),
I have no idea why this could happen.

Do you have any tc qdisc, filter or actions on this ppp device?


Yes, i have policing filters for incoming traffic (ingress), and also 
on

egress htb + pfifo + filters.


Does it make any difference if you remove the egress qdisc and/or
filters? If yes, please share the `tc qd show...` and `tc filter show 
...`?


Thanks!


It is not easy, because it is NAS with approx 5000 users connected (and 
they are constantly connecting/disconnecting), and crash can't be 
reproduced easily. If i will remove qdisc/filters users will get 
unlimited speed and this will cause serious service degradation.
But maybe i can add some debug lines and run some test kernel if 
necessary (if it will not cause serious performance overhead).

Re: ethtool TODO list - additional info

2016-07-12 Thread John W. Linville

On Tue, Jul 12, 2016 at 01:04:52PM -0500, Jorge Alberto Garcia wrote:
> On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko  wrote:
> > Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote:
> >>Hi !
> >>
> >>Some days ago, Jiri Pirko was talking about some next steps to
> >>implement for ethtool.
> >>
> >> I haven't seen any follow up since ethtool's maintainer change. Can we have
> >>additional details about these ?
> >>
> >>- libethtool - API
> >>- generic netlink
> >
> > Yep, exactly, no reply. Apparently nobody really want to do any initiative
> > here, and I'm lacking time to do it :(
> >
> 
> That's fine,  I already got  a git repo, I'm trying to understand what
> 'use generic netlink' means.
> 
> This is a piece of what grep got me on iproute git repo (I'm still
> trying to understand)
> grep -i netlink -r iproute2/
> iproute2/bridge/mdb.c:#include "libnetlink.h"
> iproute2/bridge/vlan.c:#include "libnetlink.h"
> ...
> ..

The general notion would be to replace the current ioctl-based
ethtool API with one that is based on netlink, like many of the other
networking APIs. I'm fine with the general idea of that, but so far
I haven't heard a strong case for why it is necessary. It certainly
doesn't seem urgent to me -- if someone disagrees, then please explain!

John

> >
> >
> >>- sub commands syntax
> >>- TODO/bugzilla
> >>
> >>
> >>On Mon, Jul 4, 2016 at 9:24 AM, Jiri Pirko  wrote:
> >>
> > I was thinking of adding a TODO file to the repository, but it's really
> > for the new maintainer to decide what to do.  So here's my list as a
> > suggestion:
> >
> > * Add regression test coverage for all sub-commands with complex logic
> >
> > * Internationalise output and error messages
> >
> > * Build a libethtool that handles all the API quirks and fallbacks for
> >   old kernel versions.  This might help people writing language
> >   bindings or other utilities that use the ethtool API.
> >
> > * Provide a 'cleaned up' ethtool (under some other name) that has:
> >   - More conventional sub-command syntax, i.e. no '-'/'--' prefix
> >   - More consistent output formatting
> 
> That seems like a reasonable start for a TODO list. I'll bet there
> are a few people out there with other suggestions as well...?
> >>>
> >>> Before that, I would like to see ethtool migrate to use generic
> >>> netlink. Then, the new tool would be needed anyway, should exist within
> >>> iproute2 package and have similar command line syntax.
> >>>
> >>> I have some ideas about the gennetlink ethtool, have to find some time
> >>> to implement some initial part of it.
> 

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: ethtool TODO list - additional info

2016-07-12 Thread John Fastabend

On 16-07-12 11:12 AM, John W. Linville wrote:
> On Tue, Jul 12, 2016 at 01:04:52PM -0500, Jorge Alberto Garcia wrote:
>> On Tue, Jul 12, 2016 at 12:55 PM, Jiri Pirko  wrote:
>>> Tue, Jul 05, 2016 at 05:37:42PM CEST, jorge.garcia.gonza...@gmail.com wrote:
 Hi !

 Some days ago, Jiri Pirko was talking about some next steps to
 implement for ethtool.

 I haven't seen any follow up since ethtool's maintainer change. Can we have
 additional details about these ?

 - libethtool - API
 - generic netlink
>>>
>>> Yep, exactly, no reply. Apparently nobody really want to do any initiative
>>> here, and I'm lacking time to do it :(
>>>
>>
>> That's fine,  I already got  a git repo, I'm trying to understand what
>> 'use generic netlink' means.
>>
>> This is a piece of what grep got me on iproute git repo (I'm still
>> trying to understand)
>> grep -i netlink -r iproute2/
>> iproute2/bridge/mdb.c:#include "libnetlink.h"
>> iproute2/bridge/vlan.c:#include "libnetlink.h"
>> ...
>> ..
> 
> The general notion would be to replace the current ioctl-based
> ethtool API with one that is based on netlink, like many of the other
> networking APIs. I'm fine with the general idea of that, but so far
> I haven't heard a strong case for why it is necessary. It certainly
> doesn't seem urgent to me -- if someone disagrees, then please explain!
> 

And probably obvious but you can't remove the ioctl based ethtool
interface because we have a lot of software using it so you will have
to maintain both in parallel if there is a netlink equivalent.

Also most the stuff in ethtool tends to be things you set once at
init time or for debugging so the benefit of notifiers and such are
minimal.

Sure if I was doing it from scratch netlink would be better and there
are probably parts of it that are worth converting over where notifier
hooks and standardizing would be beneficial. I think Roopa for example
was coming up with some more general statistics mechanism.

My $.02 anyways.

JohnF

Re: ethtool TODO list - additional info

2016-07-12 Thread David Miller

From: "John W. Linville" 
Date: Tue, 12 Jul 2016 14:12:52 -0400

> I haven't heard a strong case for why it is necessary.

Stats on large scale setups/systems is problematic and ethtool
is a part of that problem.

Extensibility in general suffers because of the ioctl() interface as
well.

For getting ethtool settings or stats, we have no clean filtering
mechanism.

I could go on and on...

I think before any new features are added to ethtool, we move over to
the netlink based interface and lock the existing ioctl() API in stone.

1 2 >

1 - 100 of 151 matches

Mail list logo