[PATCH net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"

2017-12-01 Thread Johannes Berg
From: Johannes Berg 

This reverts commit d6f295e9def0; some userspace (in the case
we noticed it's wpa_supplicant), is relying on the current
error code to determine that a fixed name interface already
exists.

Reported-by: Jouni Malinen 
Signed-off-by: Johannes Berg 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 07ed21d64f92..f47e96b62308 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1106,7 +1106,7 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
 * when the name is long and there isn't enough space left
 * for the digits, or if all bits are used.
 */
-   return p ? -ENFILE : -EEXIST;
+   return -ENFILE;
 }
 
 static int dev_alloc_name_ns(struct net *net,
-- 
2.14.2



[PATCH v2] net: macb: change GFP_KERNEL to GFP_ATOMIC

2017-12-01 Thread Julia Lawall
Function gem_add_flow_filter called on line 2958 inside lock on line 2949
but uses GFP_KERNEL

Generated by: scripts/coccinelle/locks/call_kern.cocci

Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering")
CC: Rafal Ozieblo 
Signed-off-by: Julia Lawall 
Signed-off-by: Fengguang Wu 
---

v2: Fix some broken email addresses.  No change to the patch.

tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
master
head:   fb20eb9d798d2f4c1a75b7fe981d72dfa8d7270d
commit: ae8223de3df5a0ce651d14a50dad31b9cae029f2 [2033/2251] net: macb:
Added support for RX filtering

 macb_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -2799,7 +2799,7 @@ static int gem_add_flow_filter(struct ne
int ret = -EINVAL;
bool added = false;

-   newfs = kmalloc(sizeof(*newfs), GFP_KERNEL);
+   newfs = kmalloc(sizeof(*newfs), GFP_ATOMIC);
if (newfs == NULL)
return -ENOMEM;
memcpy(>fs, fs, sizeof(newfs->fs));


Re: [Patch net-next] act_mirred: use tcfm_dev in tcf_mirred_get_dev()

2017-12-01 Thread Jiri Pirko
Fri, Dec 01, 2017 at 10:46:42PM CET, xiyou.wangc...@gmail.com wrote:
>On Fri, Dec 1, 2017 at 9:56 AM, Jiri Pirko  wrote:
>>
>> Isn't this here so user may specify a ifindex of netdev which is not yet
>> present on the system (not sure how much sense that would make though...)
>
>How is this even possible? If an ifindex is not present, we return ENODEV:

Right, I missed this. Thanks.

>
>if (parm->ifindex) {
>dev = __dev_get_by_index(net, parm->ifindex);
>if (dev == NULL) {
>if (exists)
>tcf_idr_release(*a, bind);
>return -ENODEV;
>}


[PATCH net] nfp: fix port stats for mac representors

2017-12-01 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Previously we swapped the tx_packets, tx_bytes and tx_dropped counters
with rx_packets, rx_bytes and rx_dropped counters, respectively. This
behaviour is correct and expected for VF representors but it should not
be swapped for physical port mac representors.

Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index 924a05e05da0..78b36c67c232 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -84,16 +84,13 @@ nfp_repr_phy_port_get_stats64(struct nfp_port *port,
 {
u8 __iomem *mem = port->eth_stats;
 
-   /* TX and RX stats are flipped as we are returning the stats as seen
-* at the switch port corresponding to the phys port.
-*/
-   stats->tx_packets = readq(mem + NFP_MAC_STATS_RX_FRAMES_RECEIVED_OK);
-   stats->tx_bytes = readq(mem + NFP_MAC_STATS_RX_IN_OCTETS);
-   stats->tx_dropped = readq(mem + NFP_MAC_STATS_RX_IN_ERRORS);
+   stats->tx_packets = readq(mem + NFP_MAC_STATS_TX_FRAMES_TRANSMITTED_OK);
+   stats->tx_bytes = readq(mem + NFP_MAC_STATS_TX_OUT_OCTETS);
+   stats->tx_dropped = readq(mem + NFP_MAC_STATS_TX_OUT_ERRORS);
 
-   stats->rx_packets = readq(mem + NFP_MAC_STATS_TX_FRAMES_TRANSMITTED_OK);
-   stats->rx_bytes = readq(mem + NFP_MAC_STATS_TX_OUT_OCTETS);
-   stats->rx_dropped = readq(mem + NFP_MAC_STATS_TX_OUT_ERRORS);
+   stats->rx_packets = readq(mem + NFP_MAC_STATS_RX_FRAMES_RECEIVED_OK);
+   stats->rx_bytes = readq(mem + NFP_MAC_STATS_RX_IN_OCTETS);
+   stats->rx_dropped = readq(mem + NFP_MAC_STATS_RX_IN_ERRORS);
 }
 
 static void
-- 
2.15.1



[PATCH 1/1] timecounter: Make cyclecounter struct part of timecounter struct

2017-12-01 Thread Sagar Arun Kamble
There is no real need for the users of timecounters to define cyclecounter
and timecounter variables separately. Since timecounter will always be
based on cyclecounter, have cyclecounter struct as member of timecounter
struct.

Suggested-by: Chris Wilson 
Signed-off-by: Sagar Arun Kamble 
Cc: Chris Wilson 
Cc: Richard Cochran 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: linux-ker...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: netdev@vger.kernel.org
Cc: intel-wired-...@lists.osuosl.org
Cc: linux-r...@vger.kernel.org
Cc: alsa-de...@alsa-project.org
Cc: kvm...@lists.cs.columbia.edu
---
 arch/microblaze/kernel/timer.c | 20 ++--
 drivers/clocksource/arm_arch_timer.c   | 19 ++--
 drivers/net/ethernet/amd/xgbe/xgbe-dev.c   |  3 +-
 drivers/net/ethernet/amd/xgbe/xgbe-ptp.c   |  9 +++---
 drivers/net/ethernet/amd/xgbe/xgbe.h   |  1 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h|  1 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   | 20 ++--
 drivers/net/ethernet/freescale/fec.h   |  1 -
 drivers/net/ethernet/freescale/fec_ptp.c   | 30 +-
 drivers/net/ethernet/intel/e1000e/e1000.h  |  1 -
 drivers/net/ethernet/intel/e1000e/netdev.c | 27 
 drivers/net/ethernet/intel/e1000e/ptp.c|  2 +-
 drivers/net/ethernet/intel/igb/igb.h   |  1 -
 drivers/net/ethernet/intel/igb/igb_ptp.c   | 25 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c   | 17 +-
 drivers/net/ethernet/mellanox/mlx4/en_clock.c  | 28 -
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
 .../net/ethernet/mellanox/mlx5/core/lib/clock.c| 34 ++--
 drivers/net/ethernet/qlogic/qede/qede_ptp.c| 20 ++--
 drivers/net/ethernet/ti/cpts.c | 36 --
 drivers/net/ethernet/ti/cpts.h |  1 -
 include/linux/mlx5/driver.h|  1 -
 include/linux/timecounter.h|  4 +--
 include/sound/hdaudio.h|  1 -
 kernel/time/timecounter.c  | 28 -
 sound/hda/hdac_stream.c|  7 +++--
 virt/kvm/arm/arch_timer.c  |  6 ++--
 28 files changed, 163 insertions(+), 182 deletions(-)

diff --git a/arch/microblaze/kernel/timer.c b/arch/microblaze/kernel/timer.c
index 7de941c..b7f89e9 100644
--- a/arch/microblaze/kernel/timer.c
+++ b/arch/microblaze/kernel/timer.c
@@ -199,27 +199,25 @@ static u64 xilinx_read(struct clocksource *cs)
return (u64)xilinx_clock_read();
 }
 
-static struct timecounter xilinx_tc = {
-   .cc = NULL,
-};
-
 static u64 xilinx_cc_read(const struct cyclecounter *cc)
 {
return xilinx_read(NULL);
 }
 
-static struct cyclecounter xilinx_cc = {
-   .read = xilinx_cc_read,
-   .mask = CLOCKSOURCE_MASK(32),
-   .shift = 8,
+static struct timecounter xilinx_tc = {
+   .cc.read = xilinx_cc_read,
+   .cc.mask = CLOCKSOURCE_MASK(32),
+   .cc.mult = 0,
+   .cc.shift = 8,
 };
 
 static int __init init_xilinx_timecounter(void)
 {
-   xilinx_cc.mult = div_sc(timer_clock_freq, NSEC_PER_SEC,
-   xilinx_cc.shift);
+   struct cyclecounter *cc = _tc.cc;
+
+   cc->mult = div_sc(timer_clock_freq, NSEC_PER_SEC, cc->shift);
 
-   timecounter_init(_tc, _cc, sched_clock());
+   timecounter_init(_tc, sched_clock());
 
return 0;
 }
diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index 57cb2f0..31543e5 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -179,11 +179,6 @@ static u64 arch_counter_read_cc(const struct cyclecounter 
*cc)
.flags  = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static struct cyclecounter cyclecounter __ro_after_init = {
-   .read   = arch_counter_read_cc,
-   .mask   = CLOCKSOURCE_MASK(56),
-};
-
 struct ate_acpi_oem_info {
char oem_id[ACPI_OEM_ID_SIZE + 1];
char oem_table_id[ACPI_OEM_TABLE_ID_SIZE + 1];
@@ -915,7 +910,10 @@ static u64 arch_counter_get_cntvct_mem(void)
return ((u64) vct_hi << 32) | vct_lo;
 }
 
-static struct arch_timer_kvm_info arch_timer_kvm_info;
+static struct arch_timer_kvm_info arch_timer_kvm_info = {
+   .timecounter.cc.read = arch_counter_read_cc,
+   .timecounter.cc.mask = CLOCKSOURCE_MASK(56),
+};
 
 struct arch_timer_kvm_info *arch_timer_get_kvm_info(void)
 {
@@ -925,6 +923,7 @@ struct arch_timer_kvm_info *arch_timer_get_kvm_info(void)
 static void __init 

[PATCH net-next v2] net: dsa: Allow compiling out legacy support

2017-12-01 Thread Florian Fainelli
Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out
support for the old platform device and Device Tree binding registration.
Support for these configurations is scheduled to be removed in 4.17.

Signed-off-by: Florian Fainelli 
---
Changes in v2:
- make the option enabled by default
- make the .probe function part of NET_DSA_LEGACY
- make mv88e6060 depend on NET_DSA_LEGACY
- move dsa_legacy_fdb_{add,del} out of net/dsa/legacy.c

 drivers/net/dsa/Kconfig  |  2 +-
 drivers/net/dsa/mv88e6xxx/chip.c |  4 
 include/net/dsa.h| 11 +++
 net/dsa/Kconfig  |  9 +
 net/dsa/Makefile |  3 ++-
 net/dsa/dsa_priv.h   |  9 +
 net/dsa/legacy.c | 20 
 net/dsa/slave.c  | 20 
 8 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 83a9bc892a3b..2b81b97e994f 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -33,7 +33,7 @@ config NET_DSA_MT7530
 
 config NET_DSA_MV88E6060
tristate "Marvell 88E6060 ethernet switch chip support"
-   depends on NET_DSA
+   depends on NET_DSA && NET_DSA_LEGACY
select NET_DSA_TAG_TRAILER
---help---
  This enables support for the Marvell 88E6060 ethernet switch
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 8171055fde7a..b2afbd730051 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3739,6 +3739,7 @@ static enum dsa_tag_protocol 
mv88e6xxx_get_tag_protocol(struct dsa_switch *ds,
return chip->info->tag_protocol;
 }
 
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
 static const char *mv88e6xxx_drv_probe(struct device *dsa_dev,
   struct device *host_dev, int sw_addr,
   void **priv)
@@ -3786,6 +3787,7 @@ static const char *mv88e6xxx_drv_probe(struct device 
*dsa_dev,
 
return NULL;
 }
+#endif
 
 static int mv88e6xxx_port_mdb_prepare(struct dsa_switch *ds, int port,
  const struct switchdev_obj_port_mdb *mdb,
@@ -3827,7 +3829,9 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, 
int port,
 }
 
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
.probe  = mv88e6xxx_drv_probe,
+#endif
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
.setup  = mv88e6xxx_setup,
.adjust_link= mv88e6xxx_adjust_link,
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 2a05738570d8..e4326695653e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -315,12 +315,14 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds)
 typedef int dsa_fdb_dump_cb_t(const unsigned char *addr, u16 vid,
  bool is_static, void *data);
 struct dsa_switch_ops {
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
/*
 * Legacy probing.
 */
const char  *(*probe)(struct device *dsa_dev,
  struct device *host_dev, int sw_addr,
  void **priv);
+#endif
 
enum dsa_tag_protocol (*get_tag_protocol)(struct dsa_switch *ds,
  int port);
@@ -472,11 +474,20 @@ struct dsa_switch_driver {
const struct dsa_switch_ops *ops;
 };
 
+#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
 /* Legacy driver registration */
 void register_switch_driver(struct dsa_switch_driver *type);
 void unregister_switch_driver(struct dsa_switch_driver *type);
 struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev);
 
+#else
+static inline void register_switch_driver(struct dsa_switch_driver *type) { }
+static inline void unregister_switch_driver(struct dsa_switch_driver *type) { }
+static inline struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev)
+{
+   return NULL;
+}
+#endif
 struct net_device *dsa_dev_to_net_device(struct device *dev);
 
 /* Keep inline for faster access in hot path */
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index 03c3bdf25468..bbf2c82cf7b2 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -16,6 +16,15 @@ config NET_DSA
 
 if NET_DSA
 
+config NET_DSA_LEGACY
+   bool "Support for older platform device and Device Tree registration"
+   default y
+   ---help---
+ Say Y if you want to enable support for the older platform device and
+ deprecated Device Tree binding registration.
+
+ This feature is scheduled for removal in 4.17.
+
 # tagging formats
 config NET_DSA_TAG_BRCM
bool
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index 0e13c1f95d13..9e4d3536f977 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0

Re: UNITED NATION COMPENSATIONS,

2017-12-01 Thread UNITED NATION COMPENSATIONS

Re:Hello Dear,

What has actually kept you waiting to claim your fund $870.000.00 since then?

Your fund has been approved since and nobody has heard from you.

hurry and get back to me with your valid receiving data immediately you receive 
this mail to avoid error procedures because the United Nation Newly Elected 
president has approved the release of your awaited funds.


Regards,

Mr. Jake Brandon,

CUSTOMER CARE ON FOREIGN PAYMENT.


[PATCH net-next] enic: add sw timestamp support

2017-12-01 Thread Govindarajulu Varadarajan
Add ethtool ops to advertise sw timestamping.
Call skb_tx_timestamp() just before ringing the wq doorbell.

Signed-off-by: Govindarajulu Varadarajan 
---
 drivers/net/ethernet/cisco/enic/enic_ethtool.c | 12 
 drivers/net/ethernet/cisco/enic/enic_main.c|  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/cisco/enic/enic_ethtool.c 
b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
index 462d0ce51240..efb9333c7cf8 100644
--- a/drivers/net/ethernet/cisco/enic/enic_ethtool.c
+++ b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 
 #include "enic_res.h"
 #include "enic.h"
@@ -578,6 +579,16 @@ static int enic_set_rxfh(struct net_device *netdev, const 
u32 *indir,
return __enic_set_rsskey(enic);
 }
 
+static int enic_get_ts_info(struct net_device *netdev,
+   struct ethtool_ts_info *info)
+{
+   info->so_timestamping = SOF_TIMESTAMPING_TX_SOFTWARE |
+   SOF_TIMESTAMPING_RX_SOFTWARE |
+   SOF_TIMESTAMPING_SOFTWARE;
+
+   return 0;
+}
+
 static const struct ethtool_ops enic_ethtool_ops = {
.get_drvinfo = enic_get_drvinfo,
.get_msglevel = enic_get_msglevel,
@@ -597,6 +608,7 @@ static const struct ethtool_ops enic_ethtool_ops = {
.get_rxfh = enic_get_rxfh,
.set_rxfh = enic_set_rxfh,
.get_link_ksettings = enic_get_ksettings,
+   .get_ts_info = enic_get_ts_info,
 };
 
 void enic_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index e130fb757e7b..d98676e43e03 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -856,6 +856,7 @@ static netdev_tx_t enic_hard_start_xmit(struct sk_buff *skb,
 
if (vnic_wq_desc_avail(wq) < MAX_SKB_FRAGS + ENIC_DESC_MAX_SPLITS)
netif_tx_stop_queue(txq);
+   skb_tx_timestamp(skb);
if (!skb->xmit_more || netif_xmit_stopped(txq))
vnic_wq_doorbell(wq);
 
-- 
2.15.0



Re: [PATCH/RFC] Re: 'perf test BPF' failing, libbpf regression wrt "basic API for BPF obj name"

2017-12-01 Thread Alexei Starovoitov

On 12/1/17 9:51 AM, Arnaldo Carvalho de Melo wrote:


But this is not just testcase expectations, the usecase is someone
wanting to use a newer tool, with perhaps some new features of interest
that don't depend on changes in the kernel, in an older kernel on a
system where updating it is not possible or desirable.


I think it's also dangerous for the core library like libbpf to
be smarter than the tool that is using it.
In this case we added prog and map names by default into loader and
create_map functions to make sure that all tools pick them up
automatically and we can see a bit more human readable bpf names
in kernel stack traces and in debug tools like bpftool, bcc/bps.
When kernel is older and doesn't support prog/map names, it's perfectly
reasonable to fall back to map creation without the name, but
library shouldn't be doing it in all cases.
Like prog_load command recently got new prog_ifindex field.
It would be incorrect to fallback to loading without it.



Re: [PATCH net-next V3 3/3] net: add a sysctl to make auto flowlabel consistent

2017-12-01 Thread Tom Herbert
On Fri, Dec 1, 2017 at 3:31 PM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> Currently if there is negative routing, we change sock's txhash, so the
> sock will have a different flowlabel and route to different path.
> According to Tom, we'd better to have option to enable this, because some
> routers require flowlabel consistent. By default, we maintain consistent
> flowlabel, eg, negative routing doesn't change flowlabel.
>
> Suggested-by: Tom Herbert 
> Signed-off-by: Shaohua Li 
> ---
>  Documentation/networking/ip-sysctl.txt |  7 +++
>  include/net/netns/ipv6.h   |  1 +
>  include/net/sock.h | 28 +++-
>  net/ipv6/af_inet6.c|  1 +
>  net/ipv6/sysctl_net_ipv6.c |  8 
>  5 files changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 46c7e10..14132a0 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1345,6 +1345,13 @@ auto_flowlabels - INTEGER
>be disabled by the socket option
> Default: 1
>
> +consistent_auto_flowlabel - BOOLEAN

I think we should call it consistent_txhash since this isn't just
about the flow label.

> +   When auto_flowlabels is enabled, this option makes socket flowlabel
> +   consistent in the lifetime.
> +   TRUE: enabled
> +   FALSE: disabled
> +   Default: TRUE
> +
>  flowlabel_state_ranges - BOOLEAN
> Split the flow label number space into two ranges. 0-0x7 is
> reserved for the IPv6 flow manager facility, 0x8-0xF
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 987cc45..e55f851 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -30,6 +30,7 @@ struct netns_sysctl_ipv6 {
> int ip6_rt_min_advmss;
> int flowlabel_consistency;
> int auto_flowlabels;
> +   int consistent_auto_flowlabel;
> int icmpv6_time;
> int anycast_src_echo_reply;
> int ip_nonlocal_bind;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index b9cb9d2..45e868f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1729,6 +1729,18 @@ static inline kuid_t sock_net_uid(const struct net 
> *net, const struct sock *sk)
> return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
>  }
>
> +static inline
> +struct net *sock_net(const struct sock *sk)
> +{
> +   return read_pnet(>sk_net);
> +}
> +
> +static inline
> +void sock_net_set(struct sock *sk, struct net *net)
> +{
> +   write_pnet(>sk_net, net);
> +}
> +
>  static inline void sk_set_txhash(struct sock *sk, u32 hash)
>  {
> sk->sk_txhash = hash;
> @@ -1736,7 +1748,9 @@ static inline void sk_set_txhash(struct sock *sk, u32 
> hash)
>
>  static inline void sk_rethink_txhash(struct sock *sk)
>  {
> -   if (sk->sk_txhash) {
> +   struct net *net = sock_net(sk);
> +
> +   if (sk->sk_txhash && !net->ipv6.sysctl.consistent_auto_flowlabel) {
> u32 v = prandom_u32();
> sk->sk_txhash = v ?: 1;
> }
> @@ -2291,18 +2305,6 @@ static inline void sk_eat_skb(struct sock *sk, struct 
> sk_buff *skb)
> __kfree_skb(skb);
>  }
>
> -static inline
> -struct net *sock_net(const struct sock *sk)
> -{
> -   return read_pnet(>sk_net);
> -}
> -
> -static inline
> -void sock_net_set(struct sock *sk, struct net *net)
> -{
> -   write_pnet(>sk_net, net);
> -}
> -
>  static inline struct sock *skb_steal_sock(struct sk_buff *skb)
>  {
> if (skb->sk) {
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index c26f712..fe9b312 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -807,6 +807,7 @@ static int __net_init inet6_net_init(struct net *net)
> net->ipv6.sysctl.icmpv6_time = 1*HZ;
> net->ipv6.sysctl.flowlabel_consistency = 1;
> net->ipv6.sysctl.auto_flowlabels = IP6_DEFAULT_AUTO_FLOW_LABELS;
> +   net->ipv6.sysctl.consistent_auto_flowlabel = 1;
> net->ipv6.sysctl.idgen_retries = 3;
> net->ipv6.sysctl.idgen_delay = 1 * HZ;
> net->ipv6.sysctl.flowlabel_state_ranges = 0;
> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> index a789a8a..8908092 100644
> --- a/net/ipv6/sysctl_net_ipv6.c
> +++ b/net/ipv6/sysctl_net_ipv6.c
> @@ -126,6 +126,13 @@ static struct ctl_table ipv6_table_template[] = {
> .mode   = 0644,
> .proc_handler   = proc_dointvec
> },
> +   {
> +   .procname   = "consistent_auto_flowlabel",
> +   .data   = 
> _net.ipv6.sysctl.consistent_auto_flowlabel,
> +   .maxlen = sizeof(int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec
> +   },
> 

Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start

2017-12-01 Thread Tom Herbert
On Fri, Dec 1, 2017 at 3:29 PM, Tom Herbert  wrote:
> On Fri, Dec 1, 2017 at 2:18 PM, Herbert Xu  
> wrote:
>> On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote:
>>> Remove the code that resets the walker table. The walker table should
>>> only be initialized in the walk init function or when a future table is
>>> encountered. If the walker table is NULL this is the indication that
>>> the walk has completed and this information can be used to break a
>>> multi-call walk in the table (e.g. successive calls to nelink_dump
>>> that are dumping elements of an rhashtable).
>>>
>>> This also allows us to change rhashtable_walk_start to return void
>>> since the only error it was returning was -EAGAIN for a table change.
>>> This patch changes all the callers of rhashtable_walk_start to expect
>>> void which eliminates logic needed to check the return value for a
>>> rare condition. Note that -EAGAIN will be returned in a call
>>> to rhashtable_walk_next which seems to always follow the start
>>> of the walk so there should be no behavioral change in doing this.
>>>
>>> Signed-off-by: Tom Herbert 
>>
>> Doesn't this mean that if a walk encounters a rehash you may end up
>> missing half or more of the hash table?
>>
> Because of tbl->rehash < tbl->size conditions in walk stop? How about
> we add a flag to iter that indicates table needs a reset and set it
> along with setting walker.tbl to NULL? On the next walk start do the
> reload when walker.tbl is NULL and flag is set. In this case walk
> start would automatically set walker.tbl which is already done by
> nearly all callers already in that they ignore -EAGAIN returned from
> start walk.
>
Herbert,

Looking at this some more, I am wondering if the walkers list is
necessary. When a rehash table is done, the new table is assigned to
ht->tbl and walker->tbl is cleared for all walkers. In walk start the
walker tbl is checked and if it's NULL ht->tbl is loaded. Assuming
that -EAGAIN isn't interesting to callers here, it seems like we could
just get iter->walker.tbl in each call to walk start and not need to
maintain the walkers list at all. Am I missing something?

Tom

> Thanks,
> Tom
>
>> Cheers,
>> --
>> Email: Herbert Xu 
>> Home Page: http://gondor.apana.org.au/~herbert/
>> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


pull-request: bpf 2017-12-02

2017-12-01 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a compilation warning in xdp redirect tracepoint due to
   missing bpf.h include that pulls in struct bpf_map, from Xie.

2) Limit the maximum number of attachable BPF progs for a given
   perf event as long as uabi is not frozen yet. The hard upper
   limit is now 64 and therefore the same as with BPF multi-prog
   for cgroups. Also add related error checking for the sample
   BPF loader when enabling and attaching to the perf event, from
   Yonghong.

3) Specifically set the RLIMIT_MEMLOCK for the test_verifier_log
   case, so that the test case can always pass and not fail in
   some environments due to too low default limit, also from
   Yonghong.

4) Fix up a missing license header comment for kernel/bpf/offload.c,
   from Jakub.

5) Several fixes for bpftool, among others a crash on incorrect
   arguments when json output is used, error message handling
   fixes on unknown options and proper destruction of json writer
   for some exit cases, all from Quentin.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 2e724dca7749223204bbae21745c0e3fc932700a:

  tipc: eliminate access after delete in group_filter_msg() (2017-11-27 
14:44:45 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 0ec9552b43b98deb882bf48efd347be4bd7afc9f:

  samples/bpf: add error checking for perf ioctl calls in bpf loader 
(2017-12-01 02:59:21 +0100)


Daniel Borkmann (1):
  Merge branch 'bpftool-misc-fixes'

Jakub Kicinski (1):
  bpf: offload: add a license header

Quentin Monnet (6):
  tools: bpftool: fix crash on bad parameters with JSON
  tools: bpftool: clean up the JSON writer before exiting in usage()
  tools: bpftool: make error message from getopt_long() JSON-friendly
  tools: bpftool: remove spurious line break from error message
  tools: bpftool: unify installation directories
  tools: bpftool: declare phony targets as such

Xie XiuQi (1):
  trace/xdp: fix compile warning: 'struct bpf_map' declared inside 
parameter list

Yonghong Song (3):
  tools/bpf: adjust rlimit RLIMIT_MEMLOCK for test_verifier_log
  bpf: set maximum number of attached progs to 64 for a single perf tp
  samples/bpf: add error checking for perf ioctl calls in bpf loader

 include/trace/events/xdp.h  |  1 +
 kernel/bpf/core.c   |  3 ++-
 kernel/bpf/offload.c| 15 +++
 kernel/trace/bpf_trace.c|  8 ++
 samples/bpf/bpf_load.c  | 14 --
 tools/bpf/bpftool/Documentation/Makefile|  2 +-
 tools/bpf/bpftool/Makefile  |  7 ++---
 tools/bpf/bpftool/main.c| 36 -
 tools/bpf/bpftool/main.h|  5 ++--
 tools/testing/selftests/bpf/test_verifier_log.c |  7 +
 10 files changed, 77 insertions(+), 21 deletions(-)


Re: [PATCH net-next 0/2] allow setting gso_maximum values

2017-12-01 Thread Solio Sarabia
On Fri, Dec 01, 2017 at 03:30:01PM -0800, Stephen Hemminger wrote:
> On Fri,  1 Dec 2017 12:11:56 -0800
> Stephen Hemminger  wrote:
> 
> > This is another way of addressing the GSO maximum performance issues for
> > containers on Azure. What happens is that the underlying infrastructure uses
> > a overlay network such that GSO packets over 64K - vlan header end up cause
> > either guest or host to have do expensive software copy and fragmentation.
> > 
> > The netvsc driver reports GSO maximum settings correctly, the issue
> > is that containers on veth devices still have the larger settings.
> > One solution that was examined was propogating the values back
> > through the bridge device, but this does not work for cases where
> > virtual container network is done on L3.
> > 
> > This patch set punts the problem to the orchestration layer that sets
> > up the container network. It also enables other virtual devices
> > to have configurable settings for GSO maximum.
> > 
> > Stephen Hemminger (2):
> >   rtnetlink: allow GSO maximums to be passed to device
> >   veth: allow configuring GSO maximums
> > 
> >  drivers/net/veth.c   | 20 
> >  net/core/rtnetlink.c |  2 ++
> >  2 files changed, 22 insertions(+)
> > 
> 
> I would like a confirmation from Intel that is doing Docker testing
> that this works for them before merging.

This change and its iproute2 counterpart allow creating veth pairs with
specific gso_max{size,segs}. Thanks.

However, the docker code that sets up veth pairis is go-compiled in
their libnetwork. End-users won't be able to tweak gso settings at veth
creation. In this case, we would need to add ioctl (ip/iplink.c:do_set)
support to allow changes after veth is created.


x86 boot broken on -rc1?

2017-12-01 Thread Jakub Kicinski
Hi!

I'm hitting these after DaveM pulled rc1 into net-next on my Xeon
E5-2630 v4 box.  It also happens on linux-next.  Did anyone else
experience it?  (.config attached)

[5.003771] WARNING: CPU: 14 PID: 1 at ../arch/x86/events/intel/uncore.c:936 
uncore_pci_probe+0x285/0x2b0
[5.007544] Modules linked in:
[5.007544] CPU: 14 PID: 1 Comm: swapper/0 Not tainted 
4.15.0-rc1-perf-00225-gb2a4e0a76b1d #782
[5.007544] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[5.007544] task: 9e842725 task.stack: 8a63fd2d
[5.007544] RIP: 0010:uncore_pci_probe+0x285/0x2b0
[5.007544] RSP: :ad8580163d10 EFLAGS: 00010286
[5.007544] RAX: 98576cc3df30 RBX: b08037e0 RCX: b0c1a120
[5.007544] RDX:  RSI:  RDI: b0c1a960
[5.007544] RBP: 985b6c00ac00 R08: fffe R09: 000f
[5.007544] R10: 98576f1b6018 R11: 0022 R12: 985b6c641000
[5.007544] R13: 0001 R14: 0001 R15: 0001
[5.007544] FS:  () GS:98576fb8() 
knlGS:
[5.007544] CS:  0010 DS:  ES:  CR0: 80050033
[5.007544] CR2:  CR3: 000185c09001 CR4: 003606e0
[5.007544] DR0:  DR1:  DR2: 
[5.007544] DR3:  DR6: fffe0ff0 DR7: 0400
[5.007544] Call Trace:
[5.007544]  local_pci_probe+0x3d/0x90
[5.007544]  ? pci_match_device+0xd9/0x100
[5.007544]  pci_device_probe+0x122/0x180
[5.007544]  driver_probe_device+0x246/0x330
[5.007544]  ? set_debug_rodata+0x11/0x11
[5.007544]  __driver_attach+0x8a/0x90
[5.007544]  ? driver_probe_device+0x330/0x330
[5.007544]  bus_for_each_dev+0x5c/0x90
[5.007544]  bus_add_driver+0x196/0x220
[5.007544]  driver_register+0x57/0xc0
[5.007544]  intel_uncore_init+0x1e3/0x249
[5.007544]  ? uncore_type_init+0x193/0x193
[5.007544]  ? set_debug_rodata+0x11/0x11
[5.007544]  do_one_initcall+0x4b/0x190
[5.007544]  kernel_init_freeable+0x16e/0x1f5
[5.007544]  ? rest_init+0xd0/0xd0
[5.007544]  kernel_init+0xa/0x100
[5.007544]  ret_from_fork+0x1f/0x30
[5.007544] Code: 48 8b 52 08 48 85 d2 74 0d 89 44 24 04 48 89 df ff d2 8b 
44 24 04 48 89 df 89 44 24 04 e8 54 0a 1c 00 8b 44 24 0 
[5.007544] ---[ end trace 4dc4c3d5f5afcd2f ]---
[5.244504] bdx_uncore: probe of :ff:08.2 failed with error -22
[5.251604] bdx_uncore: probe of :ff:0b.1 failed with error -22
[5.258711] bdx_uncore: probe of :ff:10.1 failed with error -22
[5.265819] bdx_uncore: probe of :ff:14.0 failed with error -22
[5.272919] bdx_uncore: probe of :ff:14.1 failed with error -22
[5.280019] bdx_uncore: probe of :ff:15.0 failed with error -22
[5.287112] bdx_uncore: probe of :ff:15.1 failed with error -22
[5.294376] WARNING: CPU: 1 PID: 15 at 
../arch/x86/events/intel/uncore.c:1065 uncore_change_type_ctx.isra.5+0xe6/0xf0
[5.298362] Modules linked in:
[5.298362] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: GW
4.15.0-rc1-perf-00225-gb2a4e0a76b1d #782
[5.298362] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[5.298362] task: ae78bc8f task.stack: f79660c1
[5.298362] RIP: 0010:uncore_change_type_ctx.isra.5+0xe6/0xf0
[5.298362] RSP: :ad85833b3db8 EFLAGS: 00010213
[5.298362] RAX:  RBX: 9857669b0200 RCX: 0001
[5.298362] RDX: 985b6f00 RSI: 985b66580400 RDI: b0c1ae8c
[5.298362] RBP: 985b66580400 R08: b0c1ae8c R09: 0001
[5.298362] R10:  R11: 003d0900 R12: 
[5.298362] R13:  R14: 0001 R15: 0008
[5.298362] FS:  () GS:985b6f00() 
knlGS:
[5.298362] CS:  0010 DS:  ES:  CR0: 80050033
[5.298362] CR2:  CR3: 000185c09001 CR4: 003606e0
[5.298362] DR0:  DR1:  DR2: 
[5.298362] DR3:  DR6: fffe0ff0 DR7: 0400
[5.298362] Call Trace:
[5.298362]  uncore_event_cpu_online+0x283/0x340
[5.298362]  ? uncore_event_cpu_offline+0x180/0x180
[5.298362]  cpuhp_invoke_callback+0x8c/0x620
[5.298362]  ? __schedule+0x1ad/0x6c0
[5.298362]  ? sort_range+0x20/0x20
[5.298362]  cpuhp_thread_fun+0xbc/0x140
[5.298362]  smpboot_thread_fn+0x114/0x1d0
[5.298362]  kthread+0x111/0x130
[5.298362]  ? kthread_create_on_node+0x40/0x40
[5.298362]  ret_from_fork+0x1f/0x30
[5.298362] Code: 2a 44 89 73 10 41 83 c4 01 48 81 c5 40 01 00 00 45 3b 20 
7c cf 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f f 
[5.298362] ---[ end trace 

[Patch net-next] net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

2017-12-01 Thread Cong Wang
Both Eric and Paolo noticed the rcu_barrier() we use in
tcf_block_put_ext() could be a performance bottleneck when
we have lots of filters.

Paolo provided the following to demonstrate the issue:

tc qdisc add dev lo root htb
for I in `seq 1 1000`; do
tc class add dev lo parent 1: classid 1:$I htb rate 100kbit
tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb
for J in `seq 1 10`; do
tc filter add dev lo parent $((I + 1)): u32 match ip src 
1.1.1.$J
done
done
time tc qdisc del dev root

real0m54.764s
user0m0.023s
sys 0m0.000s

The rcu_barrier() there is to ensure we free the block after all chains
are gone, that is, to queue tcf_block_put_final() at the tail of workqueue.
We can achieve this ordering requirement by refcnt'ing tcf block instead,
that is, the tcf block is freed only when the last chain in this block is
gone. This also simplifies the code.

Paolo reported after this patch we get:

real0m0.017s
user0m0.000s
sys 0m0.017s

Tested-by: Paolo Abeni 
Cc: Eric Dumazet 
Cc: Jiri Pirko 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 include/net/sch_generic.h |  2 +-
 net/sched/cls_api.c   | 31 +--
 2 files changed, 10 insertions(+), 23 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 65d0d25f2648..b013ded1a38d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -278,7 +278,7 @@ struct tcf_block {
struct net *net;
struct Qdisc *q;
struct list_head cb_list;
-   struct work_struct work;
+   unsigned int nr_chains;
 };
 
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ddcf04b4ab43..dec0d36078c8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -190,6 +190,7 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block 
*block,
return NULL;
list_add_tail(>list, >chain_list);
chain->block = block;
+   block->nr_chains++;
chain->index = chain_index;
chain->refcnt = 1;
return chain;
@@ -218,8 +219,12 @@ static void tcf_chain_flush(struct tcf_chain *chain)
 
 static void tcf_chain_destroy(struct tcf_chain *chain)
 {
+   struct tcf_block *block = chain->block;
+
list_del(>list);
kfree(chain);
+   if (!--block->nr_chains)
+   kfree(block);
 }
 
 static void tcf_chain_hold(struct tcf_chain *chain)
@@ -330,27 +335,13 @@ int tcf_block_get(struct tcf_block **p_block,
 }
 EXPORT_SYMBOL(tcf_block_get);
 
-static void tcf_block_put_final(struct work_struct *work)
-{
-   struct tcf_block *block = container_of(work, struct tcf_block, work);
-   struct tcf_chain *chain, *tmp;
-
-   rtnl_lock();
-
-   /* At this point, all the chains should have refcnt == 1. */
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
-   tcf_chain_put(chain);
-   rtnl_unlock();
-   kfree(block);
-}
-
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
  * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain;
+   struct tcf_chain *chain, *tmp;
 
/* Hold a refcnt for all chains, except 0, so that they don't disappear
 * while we are iterating.
@@ -364,13 +355,9 @@ void tcf_block_put_ext(struct tcf_block *block, struct 
Qdisc *q,
 
tcf_block_offload_unbind(block, q, ei);
 
-   INIT_WORK(>work, tcf_block_put_final);
-   /* Wait for existing RCU callbacks to cool down, make sure their works
-* have been queued before this. We can not flush pending works here
-* because we are holding the RTNL lock.
-*/
-   rcu_barrier();
-   tcf_queue_work(>work);
+   /* At this point, all the chains should have refcnt >= 1. */
+   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   tcf_chain_put(chain);
 }
 EXPORT_SYMBOL(tcf_block_put_ext);
 
-- 
2.13.0



Re: [PATCH net-next] net: dsa: Allow compiling out legacy support

2017-12-01 Thread Florian Fainelli


On 12/01/2017 07:21 AM, Vivien Didelot wrote:
> Hi Florian,
> 
> Florian Fainelli  writes:
> 
>> +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
>>  /* Legacy driver registration */
>>  void register_switch_driver(struct dsa_switch_driver *type);
>>  void unregister_switch_driver(struct dsa_switch_driver *type);
>>  struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev);
>>  
>> +#else
>> +static inline void register_switch_driver(struct dsa_switch_driver *type) { 
>> }
>> +static inline void unregister_switch_driver(struct dsa_switch_driver *type) 
>> { }
>> +static inline struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev)
>> +{
>> +return NULL;
>> +}
>> +#endif
> 
> The .probe dsa_switch_ops is part of the legacy code, we may want to
> wrap it in a CONFIG_NET_DSA_LEGACY check as well.

Fixed, also made 88e6060 dependent on CONFIG_NET_DSA_LEGACY as a result.

> 
>>  struct net_device *dsa_dev_to_net_device(struct device *dev);
>>  
>>  /* Keep inline for faster access in hot path */
>> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
>> index 03c3bdf25468..b6ec8e9069e4 100644
>> --- a/net/dsa/Kconfig
>> +++ b/net/dsa/Kconfig
>> @@ -16,6 +16,14 @@ config NET_DSA
>>  
>>  if NET_DSA
>>  
>> +config NET_DSA_LEGACY
> 
> We need to have it enabled by default, otherwise we'll miss errors when
> touching the code shared by both legacy and new bindings.

Fixed.

> 
>> +bool "Support for older platform device and Device Tree registration"
>> +---help---
>> +  Say Y if you want to enable support for the older platform device and
>> +  deprectaed Device Tree binding registration.
> 
>   deprecated*
> 
>> +
>> +  This feature is scheduled for removal in 4.17.
>> +
>>  /* legacy.c */
>> +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY)
>>  int dsa_legacy_register(void);
>>  void dsa_legacy_unregister(void);
>>  int dsa_legacy_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
>> @@ -106,6 +107,28 @@ int dsa_legacy_fdb_add(struct ndmsg *ndm, struct nlattr 
>> *tb[],
>>  int dsa_legacy_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
>> struct net_device *dev,
>> const unsigned char *addr, u16 vid);
> 
> the dsa_legacy_fdb_{add,del} routines are "legacy" in terms of FDB
> handling, not in terms of DSA bindings, we must keep them.

Oh, right. This should probably be moved somewhere else then, right? The
whole idea was to compile out net/dsa/legacy.c
-- 
Florian


Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.

2017-12-01 Thread Philippe Ombredanne
On Fri, Dec 1, 2017 at 9:56 PM, David Daney  wrote:
> On 12/01/2017 12:41 PM, Philippe Ombredanne wrote:
>>
>> David,
>>
>> On Fri, Dec 1, 2017 at 9:01 PM, David Daney 
>> wrote:
>>>
>>> On 12/01/2017 11:49 AM, Philippe Ombredanne wrote:


 David, Greg,

 On Fri, Dec 1, 2017 at 6:42 PM, David Daney 
 wrote:
>
>
> On 11/30/2017 11:53 PM, Philippe Ombredanne wrote:


 [...]


 --- /dev/null
 +++ b/arch/mips/cavium-octeon/resource-mgr.c
 @@ -0,0 +1,371 @@
 +// SPDX-License-Identifier: GPL-2.0
 +/*
 + * Resource manager for Octeon.
 + *
 + * This file is subject to the terms and conditions of the GNU
 General
 Public
 + * License.  See the file "COPYING" in the main directory of this
 archive
 + * for more details.
 + *
 + * Copyright (C) 2017 Cavium, Inc.
 + */
>>
>>
>>
>>
>> Since you nicely included an SPDX id, you would not need the
>> boilerplate anymore. e.g. these can go alright?
>
>
>
>
> They may not be strictly speaking necessary, but I don't think they
> hurt
> anything.  Unless there is a requirement to strip out the license text,
> we
> would stick with it as is.



 I think the requirement is there and that would be much better for
 everyone: keeping both is redundant and does not bring any value, does
 it? Instead it kinda removes the benefits of having the SPDX id in the
 first place IMHO.

 Furthermore, as there have been already ~12K+ files cleaned up and
 still over 60K files to go, it would really nice if new files could
 adopt the new style: this way we will not have to revisit and repatch
 them in the future.

>>>
>>> I am happy to follow any style Greg would suggest.  There doesn't seem to
>>> be
>>> much documentation about how this should be done yet.
>>
>>
>> Thomas (tglx) has already submitted a first series of doc patches a
>> few weeks ago. And AFAIK he might be working on posting the updates
>> soon, whenever his real time clock yields a few cycles away from real
>> time coding work ;)
>>
>> See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5]
>> on this and mostly related topics
>>
>> [1] https://lkml.org/lkml/2017/11/2/715
>> [2] https://lkml.org/lkml/2017/11/25/125
>> [3] https://lkml.org/lkml/2017/11/25/133
>> [4] https://lkml.org/lkml/2017/11/2/805
>> [5] https://lkml.org/lkml/2017/10/19/165
>>
>
> OK, you convinced me.
>
> Thanks,
> David
>

No! Thank you to you: For doing real work on the kernel that makes my
servers and laptops run, while I am nitpicking you on comments.

-- 
Cordially
Philippe Ombredanne


[PATCH net-next V3 2/3] net-next: copy user configured flowlabel to reset packet

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

Reset packet doesn't use user configured flowlabel, instead, it always
uses 0. This will cause inconsistency for flowlabel. tw sock already
records flowlabel info, so we can directly use it.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 net/ipv6/tcp_ipv6.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 1e4ce06..b8383be 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -902,6 +902,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
int oif = 0;
+   u8 tclass = 0;
+   __be32 flowlabel = 0;
 
if (th->rst)
return;
@@ -955,7 +957,21 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
trace_tcp_send_reset(sk, skb);
}
 
-   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
+   if (sk) {
+   if (sk_fullsock(sk)) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   tclass = np->tclass;
+   flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+   } else {
+   struct inet_timewait_sock *tw = inet_twsk(sk);
+
+   tclass = tw->tw_tclass;
+   flowlabel = cpu_to_be32(tw->tw_flowlabel);
+   }
+   }
+   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1,
+   tclass, flowlabel);
 
 #ifdef CONFIG_TCP_MD5SIG
 out:
-- 
2.9.5



[PATCH net-next V3 3/3] net: add a sysctl to make auto flowlabel consistent

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

Currently if there is negative routing, we change sock's txhash, so the
sock will have a different flowlabel and route to different path.
According to Tom, we'd better to have option to enable this, because some
routers require flowlabel consistent. By default, we maintain consistent
flowlabel, eg, negative routing doesn't change flowlabel.

Suggested-by: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 Documentation/networking/ip-sysctl.txt |  7 +++
 include/net/netns/ipv6.h   |  1 +
 include/net/sock.h | 28 +++-
 net/ipv6/af_inet6.c|  1 +
 net/ipv6/sysctl_net_ipv6.c |  8 
 5 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 46c7e10..14132a0 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1345,6 +1345,13 @@ auto_flowlabels - INTEGER
   be disabled by the socket option
Default: 1
 
+consistent_auto_flowlabel - BOOLEAN
+   When auto_flowlabels is enabled, this option makes socket flowlabel
+   consistent in the lifetime.
+   TRUE: enabled
+   FALSE: disabled
+   Default: TRUE
+
 flowlabel_state_ranges - BOOLEAN
Split the flow label number space into two ranges. 0-0x7 is
reserved for the IPv6 flow manager facility, 0x8-0xF
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 987cc45..e55f851 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -30,6 +30,7 @@ struct netns_sysctl_ipv6 {
int ip6_rt_min_advmss;
int flowlabel_consistency;
int auto_flowlabels;
+   int consistent_auto_flowlabel;
int icmpv6_time;
int anycast_src_echo_reply;
int ip_nonlocal_bind;
diff --git a/include/net/sock.h b/include/net/sock.h
index b9cb9d2..45e868f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1729,6 +1729,18 @@ static inline kuid_t sock_net_uid(const struct net *net, 
const struct sock *sk)
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
 }
 
+static inline
+struct net *sock_net(const struct sock *sk)
+{
+   return read_pnet(>sk_net);
+}
+
+static inline
+void sock_net_set(struct sock *sk, struct net *net)
+{
+   write_pnet(>sk_net, net);
+}
+
 static inline void sk_set_txhash(struct sock *sk, u32 hash)
 {
sk->sk_txhash = hash;
@@ -1736,7 +1748,9 @@ static inline void sk_set_txhash(struct sock *sk, u32 
hash)
 
 static inline void sk_rethink_txhash(struct sock *sk)
 {
-   if (sk->sk_txhash) {
+   struct net *net = sock_net(sk);
+
+   if (sk->sk_txhash && !net->ipv6.sysctl.consistent_auto_flowlabel) {
u32 v = prandom_u32();
sk->sk_txhash = v ?: 1;
}
@@ -2291,18 +2305,6 @@ static inline void sk_eat_skb(struct sock *sk, struct 
sk_buff *skb)
__kfree_skb(skb);
 }
 
-static inline
-struct net *sock_net(const struct sock *sk)
-{
-   return read_pnet(>sk_net);
-}
-
-static inline
-void sock_net_set(struct sock *sk, struct net *net)
-{
-   write_pnet(>sk_net, net);
-}
-
 static inline struct sock *skb_steal_sock(struct sk_buff *skb)
 {
if (skb->sk) {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index c26f712..fe9b312 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -807,6 +807,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.icmpv6_time = 1*HZ;
net->ipv6.sysctl.flowlabel_consistency = 1;
net->ipv6.sysctl.auto_flowlabels = IP6_DEFAULT_AUTO_FLOW_LABELS;
+   net->ipv6.sysctl.consistent_auto_flowlabel = 1;
net->ipv6.sysctl.idgen_retries = 3;
net->ipv6.sysctl.idgen_delay = 1 * HZ;
net->ipv6.sysctl.flowlabel_state_ranges = 0;
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index a789a8a..8908092 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -126,6 +126,13 @@ static struct ctl_table ipv6_table_template[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "consistent_auto_flowlabel",
+   .data   = 
_net.ipv6.sysctl.consistent_auto_flowlabel,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
@@ -190,6 +197,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
ipv6_table[11].data = >ipv6.sysctl.max_hbh_opts_cnt;
ipv6_table[12].data = >ipv6.sysctl.max_dst_opts_len;
ipv6_table[13].data = >ipv6.sysctl.max_hbh_opts_len;
+   ipv6_table[14].data = >ipv6.sysctl.consistent_auto_flowlabel;
 
ipv6_route_table = ipv6_route_sysctl_init(net);
if 

[PATCH net-next V3 0/3] net: fix flowlabel inconsistency in reset packet

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

Hi,

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. We are using flowlabel to do
load balance. Routers in the path maintain connection state. So if flow
label changes, the packet is routed through a different router. In this
case, the old router doesn't get the reset packet to close the tcp
connection.

The reason is the normal packet gets the skb->hash from sk->sk_txhash,
which is generated randomly. ip6_make_flowlabel then uses the hash to
create a flowlabel. The reset packet doesn't get assigned a hash, so the
flowlabel is calculated with flowi6.

The patches fix the issue.

Thanks,
Shaohua

V2->V3:
- Address Tom's comments
- Add a new sysctl suggested by Tom

Shaohua Li (3):
  net-next: use five-tuple hash for sk_txhash
  net-next: copy user configured flowlabel to reset packet
  net: add a sysctl to make auto flowlabel consistent

 Documentation/networking/ip-sysctl.txt |  7 +++
 include/linux/tcp.h|  5 +
 include/net/netns/ipv6.h   |  1 +
 include/net/sock.h | 35 +++-
 include/net/tcp.h  |  2 +-
 net/ipv4/datagram.c|  2 +-
 net/ipv4/syncookies.c  |  4 +++-
 net/ipv4/tcp_input.c   |  1 -
 net/ipv4/tcp_ipv4.c| 18 -
 net/ipv4/tcp_output.c  |  1 -
 net/ipv6/af_inet6.c|  1 +
 net/ipv6/datagram.c|  4 +++-
 net/ipv6/syncookies.c  |  3 ++-
 net/ipv6/sysctl_net_ipv6.c |  8 
 net/ipv6/tcp_ipv6.c| 37 --
 15 files changed, 92 insertions(+), 

[PATCH net-next V3 1/3] net-next: use five-tuple hash for sk_txhash

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

We are using sk_txhash to calculate flowlabel, but sk_txhash isn't
always available, for example, in inet_timewait_sock. This causes
problem for reset packet, which will have a different flowlabel. This
causes our router doesn't correctly close tcp connection. We are using
flowlabel to do load balance. Routers in the path maintain connection
state. So if flow label changes, the packet is routed through a
different router. In this case, the old router doesn't get the reset
packet to close the tcp connection.

Per Tom's suggestion, we switch back to five-tuple hash, so we can
reconstruct correct flowlabel for reset packet.

At most places, we already have the flowi info, so we directly use it
build sk_txhash. For synack, we do this after route search. At that
time, we have the flowi info ready, so don't need to create the flowi
info again.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 include/linux/tcp.h   |  5 +
 include/net/sock.h| 17 ++---
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 18 +-
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 19 ++-
 11 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index df5d97a..227e8b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -139,6 +139,11 @@ struct tcp_request_sock {
  */
 };
 
+static inline void tcp_rsk_set_txhash(struct tcp_request_sock *rsk, u32 hash)
+{
+   rsk->txhash = hash;
+}
+
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
 {
return (struct tcp_request_sock *)req;
diff --git a/include/net/sock.h b/include/net/sock.h
index 79e1a2c..b9cb9d2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1729,22 +1729,17 @@ static inline kuid_t sock_net_uid(const struct net 
*net, const struct sock *sk)
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
 }
 
-static inline u32 net_tx_rndhash(void)
+static inline void sk_set_txhash(struct sock *sk, u32 hash)
 {
-   u32 v = prandom_u32();
-
-   return v ?: 1;
-}
-
-static inline void sk_set_txhash(struct sock *sk)
-{
-   sk->sk_txhash = net_tx_rndhash();
+   sk->sk_txhash = hash;
 }
 
 static inline void sk_rethink_txhash(struct sock *sk)
 {
-   if (sk->sk_txhash)
-   sk_set_txhash(sk);
+   if (sk->sk_txhash) {
+   u32 v = prandom_u32();
+   sk->sk_txhash = v ?: 1;
+   }
 }
 
 static inline struct dst_entry *
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4e09398..a5c28be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req);
+  struct request_sock *req);
u32 (*init_seq)(const struct sk_buff *skb);
u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..1f2f9fc 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
inet->inet_daddr = fl4->daddr;
inet->inet_dport = usin->sin_port;
sk->sk_state = TCP_ESTABLISHED;
-   sk_set_txhash(sk);
+   sk_set_txhash(sk, get_hash_from_flowi4(fl4));
inet->inet_id = jiffies;
 
sk_dst_set(sk, >dst);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index fda37f2..ecf6e7a 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq->rcv_isn   = ntohl(th->seq) - 1;
treq->snt_isn   = cookie;
treq->ts_off= 0;
-   treq->txhash= net_tx_rndhash();
req->mss= mss;
ireq->ir_num= ntohs(th->dest);
ireq->ir_rmt_port   = th->source;
@@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
security_req_classify_flow(req, flowi4_to_flowi());
+
+   tcp_rsk_set_txhash(treq, 

Re: [PATCH net-next 0/2] allow setting gso_maximum values

2017-12-01 Thread Stephen Hemminger
On Fri,  1 Dec 2017 12:11:56 -0800
Stephen Hemminger  wrote:

> This is another way of addressing the GSO maximum performance issues for
> containers on Azure. What happens is that the underlying infrastructure uses
> a overlay network such that GSO packets over 64K - vlan header end up cause
> either guest or host to have do expensive software copy and fragmentation.
> 
> The netvsc driver reports GSO maximum settings correctly, the issue
> is that containers on veth devices still have the larger settings.
> One solution that was examined was propogating the values back
> through the bridge device, but this does not work for cases where
> virtual container network is done on L3.
> 
> This patch set punts the problem to the orchestration layer that sets
> up the container network. It also enables other virtual devices
> to have configurable settings for GSO maximum.
> 
> Stephen Hemminger (2):
>   rtnetlink: allow GSO maximums to be passed to device
>   veth: allow configuring GSO maximums
> 
>  drivers/net/veth.c   | 20 
>  net/core/rtnetlink.c |  2 ++
>  2 files changed, 22 insertions(+)
> 

I would like a confirmation from Intel that is doing Docker testing
that this works for them before merging.


Re: [PATCH v5 net-next,mips 1/7] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2017-12-01 Thread Florian Fainelli


On 12/01/2017 03:18 PM, David Daney wrote:
> From: Carlos Munoz 
> 
> Add bindings for Common Ethernet Interface (BGX) block.
> 
> Acked-by: Rob Herring 
> Signed-off-by: Carlos Munoz 
> Signed-off-by: Steven J. Hill 
> Signed-off-by: David Daney 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH net] Revert "tcp: must block bh in __inet_twsk_hashdance()"

2017-12-01 Thread Eric Dumazet
From: Eric Dumazet 

We had to disable BH _before_ calling __inet_twsk_hashdance() in commit
cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer").

This means we can revert 614bdd4d6e61 ("tcp: must block bh in
__inet_twsk_hashdance()").

Signed-off-by: Eric Dumazet 
---
 net/ipv4/inet_timewait_sock.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 
c690cd0d9b3f0af53c23b9a1ecc87be4098ae059..b563e0c46bac2362acccf38495546a8b6b726384
 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -93,7 +93,7 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock 
*tw,
 }
 
 /*
- * Enter the time wait state.
+ * Enter the time wait state. This is called with locally disabled BH.
  * Essentially we whip up a timewait bucket, copy the relevant info into it
  * from the SK, and mess with hash chains and list linkage.
  */
@@ -111,7 +111,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, 
struct sock *sk,
 */
bhead = >bhash[inet_bhashfn(twsk_net(tw), inet->inet_num,
hashinfo->bhash_size)];
-   spin_lock_bh(>lock);
+   spin_lock(>lock);
tw->tw_tb = icsk->icsk_bind_hash;
WARN_ON(!icsk->icsk_bind_hash);
inet_twsk_add_bind_node(tw, >tw_tb->owners);
@@ -137,7 +137,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, 
struct sock *sk,
if (__sk_nulls_del_node_init_rcu(sk))
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
-   spin_unlock_bh(lock);
+   spin_unlock(lock);
 }
 EXPORT_SYMBOL_GPL(__inet_twsk_hashdance);
 


Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start

2017-12-01 Thread Tom Herbert
On Fri, Dec 1, 2017 at 2:18 PM, Herbert Xu  wrote:
> On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote:
>> Remove the code that resets the walker table. The walker table should
>> only be initialized in the walk init function or when a future table is
>> encountered. If the walker table is NULL this is the indication that
>> the walk has completed and this information can be used to break a
>> multi-call walk in the table (e.g. successive calls to nelink_dump
>> that are dumping elements of an rhashtable).
>>
>> This also allows us to change rhashtable_walk_start to return void
>> since the only error it was returning was -EAGAIN for a table change.
>> This patch changes all the callers of rhashtable_walk_start to expect
>> void which eliminates logic needed to check the return value for a
>> rare condition. Note that -EAGAIN will be returned in a call
>> to rhashtable_walk_next which seems to always follow the start
>> of the walk so there should be no behavioral change in doing this.
>>
>> Signed-off-by: Tom Herbert 
>
> Doesn't this mean that if a walk encounters a rehash you may end up
> missing half or more of the hash table?
>
Because of tbl->rehash < tbl->size conditions in walk stop? How about
we add a flag to iter that indicates table needs a reset and set it
along with setting walker.tbl to NULL? On the next walk start do the
reload when walker.tbl is NULL and flag is set. In this case walk
start would automatically set walker.tbl which is already done by
nearly all callers already in that they ignore -EAGAIN returned from
start walk.

Thanks,
Tom

> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH net-next 2/3] bpf: allow disabling tunnel csum for ipv6

2017-12-01 Thread William Tu
Before the patch, BPF_F_ZERO_CSUM_TX can be used only for ipv4 tunnel.
With introduction of ip6gretap collect_md mode, the flag should be also
supported for ipv6.

Signed-off-by: William Tu 
Cc: Daniel Borkmann 
---
 net/core/filter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 6a85e67fafce..8ec5a504eb28 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3026,10 +3026,11 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, 
skb,
  IPV6_FLOWLABEL_MASK;
} else {
info->key.u.ipv4.dst = cpu_to_be32(from->remote_ipv4);
-   if (flags & BPF_F_ZERO_CSUM_TX)
-   info->key.tun_flags &= ~TUNNEL_CSUM;
}
 
+   if (flags & BPF_F_ZERO_CSUM_TX)
+   info->key.tun_flags &= ~TUNNEL_CSUM;
+
return 0;
 }
 
-- 
2.7.4



[PATCH net-next 0/3] add ip6 gre and gretap collect_md mode

2017-12-01 Thread William Tu
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6gretap tunnels to
operate in collect metadata mode.  The first patch adds the support to
ip6_gre.c. The second patch enables unsetting the csum for ipv6 tunnel,
when using bpf_skb_[gs]et_tunnel_key() helpers.  Finally, the last patch
adds the ip6 gre and gretap tunnel test cases to BPF sample code.

The corresponding iproute2 patch:
https://marc.info/?l=linux-netdev=151216943128087=2

William Tu (3):
  ip6_gre: add ip6 gre and gretap collect_md mode
  bpf: allow disabling tunnel csum for ipv6
  samples/bpf: extend test_tunnel_bpf.sh with ip6gre

 net/core/filter.c  |   5 +-
 net/ipv6/ip6_gre.c | 105 +
 net/ipv6/ip6_tunnel.c  |   5 +-
 samples/bpf/tcbpf2_kern.c  |  43 +
 samples/bpf/test_tunnel_bpf.sh |  65 +
 5 files changed, 210 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH net-next 3/3] samples/bpf: extend test_tunnel_bpf.sh with ip6gre

2017-12-01 Thread William Tu
Extend existing tests for vxlan, gre, geneve, ipip, erspan,
to include ip6 gre and gretap tunnel.

Signed-off-by: William Tu 
Cc: Alexei Starovoitov 
---
 samples/bpf/tcbpf2_kern.c  | 43 
 samples/bpf/test_tunnel_bpf.sh | 65 ++
 2 files changed, 108 insertions(+)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 370b749f5ee6..15a469220e19 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -81,6 +81,49 @@ int _gre_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
 }
 
+SEC("ip6gretap_set_tunnel")
+int _ip6gretap_set_tunnel(struct __sk_buff *skb)
+{
+   struct bpf_tunnel_key key;
+   int ret;
+
+   __builtin_memset(, 0x0, sizeof(key));
+   key.remote_ipv6[3] = _htonl(0x11); /* ::11 */
+   key.tunnel_id = 2;
+   key.tunnel_tos = 0;
+   key.tunnel_ttl = 64;
+   key.tunnel_label = 0xabcde;
+
+   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key),
+BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX);
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   return TC_ACT_OK;
+}
+
+SEC("ip6gretap_get_tunnel")
+int _ip6gretap_get_tunnel(struct __sk_buff *skb)
+{
+   char fmt[] = "key %d remote ip6 ::%x label %x\n";
+   struct bpf_tunnel_key key;
+   int ret;
+
+   ret = bpf_skb_get_tunnel_key(skb, , sizeof(key),
+BPF_F_TUNINFO_IPV6);
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   bpf_trace_printk(fmt, sizeof(fmt),
+key.tunnel_id, key.remote_ipv6[3], key.tunnel_label);
+
+   return TC_ACT_OK;
+}
+
 SEC("erspan_set_tunnel")
 int _erspan_set_tunnel(struct __sk_buff *skb)
 {
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index 312e1722a39f..226f45381b76 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -33,6 +33,30 @@ function add_gre_tunnel {
ip addr add dev $DEV 10.1.1.200/24
 }
 
+function add_ip6gretap_tunnel {
+
+   # assign ipv6 address
+   ip netns exec at_ns0 ip addr add ::11/96 dev veth0
+   ip netns exec at_ns0 ip link set dev veth0 up
+   ip addr add dev veth1 ::22/96
+   ip link set dev veth1 up
+
+   # in namespace
+   ip netns exec at_ns0 \
+   ip link add dev $DEV_NS type $TYPE flowlabel 0xbcdef key 2 \
+   local ::11 remote ::22
+
+   ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
+   ip netns exec at_ns0 ip addr add dev $DEV_NS fc80::100/96
+   ip netns exec at_ns0 ip link set dev $DEV_NS up
+
+   # out of namespace
+   ip link add dev $DEV type $TYPE external
+   ip addr add dev $DEV 10.1.1.200/24
+   ip addr add dev $DEV fc80::200/24
+   ip link set dev $DEV up
+}
+
 function add_erspan_tunnel {
# in namespace
ip netns exec at_ns0 \
@@ -113,6 +137,41 @@ function test_gre {
cleanup
 }
 
+function test_ip6gre {
+   TYPE=ip6gre
+   DEV_NS=ip6gre00
+   DEV=ip6gre11
+   config_device
+   # reuse the ip6gretap function
+   add_ip6gretap_tunnel
+   attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
+   # underlay
+   ping6 -c 4 ::11
+   # overlay: ipv4 over ipv6
+   ip netns exec at_ns0 ping -c 1 10.1.1.200
+   ping -c 1 10.1.1.100
+   # overlay: ipv6 over ipv6
+   ip netns exec at_ns0 ping6 -c 1 fc80::200
+   cleanup
+}
+
+function test_ip6gretap {
+   TYPE=ip6gretap
+   DEV_NS=ip6gretap00
+   DEV=ip6gretap11
+   config_device
+   add_ip6gretap_tunnel
+   attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
+   # underlay
+   ping6 -c 4 ::11
+   # overlay: ipv4 over ipv6
+   ip netns exec at_ns0 ping -i .2 -c 1 10.1.1.200
+   ping -c 1 10.1.1.100
+   # overlay: ipv6 over ipv6
+   ip netns exec at_ns0 ping6 -c 1 fc80::200
+   cleanup
+}
+
 function test_erspan {
TYPE=erspan
DEV_NS=erspan00
@@ -175,6 +234,8 @@ function cleanup {
ip link del veth1
ip link del ipip11
ip link del gretap11
+   ip link del ip6gre11
+   ip link del ip6gretap11
ip link del vxlan11
ip link del geneve11
ip link del erspan11
@@ -187,6 +248,10 @@ trap cleanup 0 2 3 6 9
 cleanup
 echo "Testing GRE tunnel..."
 test_gre
+echo "Testing IP6GRE tunnel..."
+test_ip6gre
+echo "Testing IP6GRETAP tunnel..."
+test_ip6gretap
 echo "Testing ERSPAN tunnel..."
 test_erspan
 echo "Testing VXLAN tunnel..."
-- 
2.7.4



[PATCH net-next 1/3] ip6_gre: add ip6 gre and gretap collect_md mode

2017-12-01 Thread William Tu
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6 gre and gretap
tunnels to operate in collect metadata mode.  bpf_skb_[gs]et_tunnel_key()
helpers can make use of it right away.  OVS can use it as well in the
future.

Signed-off-by: William Tu 
---
 net/ipv6/ip6_gre.c| 105 +-
 net/ipv6/ip6_tunnel.c |   5 ++-
 2 files changed, 99 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 76379f01bcd2..1510ce9a4e4e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 static bool log_ecn_error = true;
@@ -69,6 +70,7 @@ static unsigned int ip6gre_net_id __read_mostly;
 struct ip6gre_net {
struct ip6_tnl __rcu *tunnels[4][IP6_GRE_HASH_SIZE];
 
+   struct ip6_tnl __rcu *collect_md_tun;
struct net_device *fb_tunnel_dev;
 };
 
@@ -229,6 +231,10 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct 
net_device *dev,
if (cand)
return cand;
 
+   t = rcu_dereference(ign->collect_md_tun);
+   if (t && t->dev->flags & IFF_UP)
+   return t;
+
dev = ign->fb_tunnel_dev;
if (dev->flags & IFF_UP)
return netdev_priv(dev);
@@ -264,6 +270,9 @@ static void ip6gre_tunnel_link(struct ip6gre_net *ign, 
struct ip6_tnl *t)
 {
struct ip6_tnl __rcu **tp = ip6gre_bucket(ign, t);
 
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun, t);
+
rcu_assign_pointer(t->next, rtnl_dereference(*tp));
rcu_assign_pointer(*tp, t);
 }
@@ -273,6 +282,9 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)
struct ip6_tnl __rcu **tp;
struct ip6_tnl *iter;
 
+   if (t->parms.collect_md)
+   rcu_assign_pointer(ign->collect_md_tun, NULL);
+
for (tp = ip6gre_bucket(ign, t);
 (iter = rtnl_dereference(*tp)) != NULL;
 tp = >next) {
@@ -463,7 +475,22 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  >saddr, >daddr, tpi->key,
  tpi->proto);
if (tunnel) {
-   ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
+   if (tunnel->parms.collect_md) {
+   struct metadata_dst *tun_dst;
+   __be64 tun_id;
+   __be16 flags;
+
+   flags = tpi->flags;
+   tun_id = key32_to_tunnel_id(tpi->key);
+
+   tun_dst = ipv6_tun_rx_dst(skb, flags, tun_id, 0);
+   if (!tun_dst)
+   return PACKET_REJECT;
+
+   ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
+   } else {
+   ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
+   }
 
return PACKET_RCVD;
}
@@ -633,8 +660,38 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 
/* Push GRE header. */
protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto;
-   gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags,
-protocol, tunnel->parms.o_key, htonl(tunnel->o_seqno));
+
+   if (tunnel->parms.collect_md) {
+   struct ip_tunnel_info *tun_info;
+   const struct ip_tunnel_key *key;
+   __be16 flags;
+
+   tun_info = skb_tunnel_info(skb);
+   if (unlikely(!tun_info ||
+!(tun_info->mode & IP_TUNNEL_INFO_TX) ||
+ip_tunnel_info_af(tun_info) != AF_INET6))
+   return -EINVAL;
+
+   key = _info->key;
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = IPPROTO_GRE;
+   fl6->daddr = key->u.ipv6.dst;
+   fl6->flowlabel = key->label;
+   fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL);
+
+   dsfield = key->tos;
+   flags = key->tun_flags & (TUNNEL_CSUM | TUNNEL_KEY);
+   tunnel->tun_hlen = gre_calc_hlen(flags);
+
+   gre_build_header(skb, tunnel->tun_hlen,
+flags, protocol,
+tunnel_id_to_key32(tun_info->key.tun_id), 0);
+
+   } else {
+   gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags,
+protocol, tunnel->parms.o_key,
+htonl(tunnel->o_seqno));
+   }
 
return ip6_tnl_xmit(skb, dev, dsfield, fl6, encap_limit, pmtu,
NEXTHDR_GRE);
@@ -645,13 +702,15 @@ static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, 
struct net_device *dev)
struct ip6_tnl *t = netdev_priv(dev);
int encap_limit = -1;
struct 

Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers

2017-12-01 Thread Florian Fainelli


On 12/01/2017 02:37 PM, Heiner Kallweit wrote:
> Am 01.12.2017 um 21:42 schrieb David Miller:
>> From: Heiner Kallweit 
>> Date: Thu, 30 Nov 2017 23:47:52 +0100
>>
>>> Remove generic settings for callbacks config_aneg and read_status
>>> from drivers.
>>>
> When re-testing I just figured out that in drivers/net/phy/broadcom.c
> I mistakenly removed three lines too many.
> Do you prefer a fixed version of the patch or just a patch with the
> fix?

Once the patches has been applied by David, you should send an
incremental change to fix your previous patches. Thank you.
-- 
Florian


[PATCH v5 net-next,mips 5/7] MIPS: Octeon: Automatically provision CVMSEG space.

2017-12-01 Thread David Daney
Remove CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE and automatically calculate
the amount of CVMSEG space needed.

1st 128-bytes: Use by IOBDMA
2nd 128-bytes: Reserved by kernel for scratch/TLS emulation.
3rd 128-bytes: OCTEON-III LMTLINE

New config variable CONFIG_CAVIUM_OCTEON_EXTRA_CVMSEG provisions
additional lines, defaults to zero.

Signed-off-by: David Daney 
Signed-off-by: Carlos Munoz 
---
 arch/mips/cavium-octeon/Kconfig| 27 
 arch/mips/cavium-octeon/setup.c| 16 ++--
 .../asm/mach-cavium-octeon/kernel-entry-init.h | 20 +--
 arch/mips/include/asm/mipsregs.h   |  2 ++
 arch/mips/include/asm/octeon/octeon.h  |  2 ++
 arch/mips/include/asm/processor.h  |  2 +-
 arch/mips/kernel/octeon_switch.S   |  2 --
 arch/mips/mm/tlbex.c   | 29 ++
 drivers/staging/octeon/ethernet-defines.h  |  2 +-
 9 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/arch/mips/cavium-octeon/Kconfig b/arch/mips/cavium-octeon/Kconfig
index ce469f982134..29c4d81364a6 100644
--- a/arch/mips/cavium-octeon/Kconfig
+++ b/arch/mips/cavium-octeon/Kconfig
@@ -11,21 +11,26 @@ config CAVIUM_CN63XXP1
  non-CN63XXP1 hardware, so it is recommended to select "n"
  unless it is known the workarounds are needed.
 
-config CAVIUM_OCTEON_CVMSEG_SIZE
-   int "Number of L1 cache lines reserved for CVMSEG memory"
-   range 0 54
-   default 1
-   help
- CVMSEG LM is a segment that accesses portions of the dcache as a
- local memory; the larger CVMSEG is, the smaller the cache is.
- This selects the size of CVMSEG LM, which is in cache blocks. The
- legally range is from zero to 54 cache blocks (i.e. CVMSEG LM is
- between zero and 6192 bytes).
-
 endif # CPU_CAVIUM_OCTEON
 
 if CAVIUM_OCTEON_SOC
 
+config CAVIUM_OCTEON_EXTRA_CVMSEG
+   int "Number of extra L1 cache lines reserved for CVMSEG memory"
+   range 0 50
+   default 0
+   help
+ CVMSEG LM is a segment that accesses portions of the dcache
+ as a local memory; the larger CVMSEG is, the smaller the
+ cache is.  The kernel uses two or three blocks (one for TLB
+ exception handlers, one for driver IOBDMA operations, and on
+ models that need it, one for LMTDMA operations). This
+ selects an optional extra number of CVMSEG lines for use by
+ other software.
+
+ Normally no extra lines are required, and this parameter
+ should be set to zero.
+
 config CAVIUM_OCTEON_LOCK_L2
bool "Lock often used kernel code in the L2"
default "y"
diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
index 99e6a68bc652..51c4d3c3cada 100644
--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -68,6 +68,12 @@ extern void pci_console_init(const char *arg);
 static unsigned long long max_memory = ULLONG_MAX;
 static unsigned long long reserve_low_mem;
 
+/*
+ * modified in hernel-entry-init.h, must have an initial value to keep
+ * it from being clobbered when bss is zeroed.
+ */
+u32 octeon_cvmseg_lines = 2;
+
 DEFINE_SEMAPHORE(octeon_bootbus_sem);
 EXPORT_SYMBOL(octeon_bootbus_sem);
 
@@ -604,11 +610,7 @@ void octeon_user_io_init(void)
 
/* R/W If set, CVMSEG is available for loads/stores in
 * kernel/debug mode. */
-#if CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE > 0
cvmmemctl.s.cvmsegenak = 1;
-#else
-   cvmmemctl.s.cvmsegenak = 0;
-#endif
if (OCTEON_IS_OCTEON3()) {
/* Enable LMTDMA */
cvmmemctl.s.lmtena = 1;
@@ -626,9 +628,9 @@ void octeon_user_io_init(void)
 
/* Setup of CVMSEG is done in kernel-entry-init.h */
if (smp_processor_id() == 0)
-   pr_notice("CVMSEG size: %d cache lines (%d bytes)\n",
- CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE,
- CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE * 128);
+   pr_notice("CVMSEG size: %u cache lines (%u bytes)\n",
+ octeon_cvmseg_lines,
+ octeon_cvmseg_lines * 128);
 
if (octeon_has_feature(OCTEON_FEATURE_FAU)) {
union cvmx_iob_fau_timeout fau_timeout;
diff --git a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h 
b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
index c38b38ce5a3d..cdcca60978a2 100644
--- a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
+++ b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
@@ -26,11 +26,18 @@
# a3 = address of boot descriptor block
.set push
.set arch=octeon
+   mfc0v1, CP0_PRID_REG
+   andiv1, 0xff00
+   li  v0, 0x9500  # cn78XX or later
+   subuv1, v1, v0
+ 

[PATCH v5 net-next,mips 2/7] MIPS: Octeon: Enable LMTDMA/LMTST operations.

2017-12-01 Thread David Daney
From: Carlos Munoz 

LMTDMA/LMTST operations move data between cores and I/O devices:

* LMTST operations can send an address and a variable length
  (up to 128 bytes) of data to an I/O device.
* LMTDMA operations can send an address and a variable length
  (up to 128) of data to the I/O device and then return a
  variable length (up to 128 bytes) response from the I/O device.

For both LMTST and LMTDMA, the data sent to the device is first stored
in the CVMSEG core local memory cache line indexed by
CVMMEMCTL[LMTLINE], the data is then atomically transmitted to the
device with a store to the CVMSEG LMTDMA trigger location.

Reviewed-by: James Hogan 
Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---
 arch/mips/cavium-octeon/setup.c   |  6 ++
 arch/mips/include/asm/octeon/octeon.h | 12 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
index a8034d0dcade..99e6a68bc652 100644
--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -609,6 +609,12 @@ void octeon_user_io_init(void)
 #else
cvmmemctl.s.cvmsegenak = 0;
 #endif
+   if (OCTEON_IS_OCTEON3()) {
+   /* Enable LMTDMA */
+   cvmmemctl.s.lmtena = 1;
+   /* Scratch line to use for LMT operation */
+   cvmmemctl.s.lmtline = 2;
+   }
/* R/W If set, CVMSEG is available for loads/stores in
 * supervisor mode. */
cvmmemctl.s.cvmsegenas = 0;
diff --git a/arch/mips/include/asm/octeon/octeon.h 
b/arch/mips/include/asm/octeon/octeon.h
index c99c4b6a79f4..92a17d67c1fa 100644
--- a/arch/mips/include/asm/octeon/octeon.h
+++ b/arch/mips/include/asm/octeon/octeon.h
@@ -179,7 +179,15 @@ union octeon_cvmemctl {
/* RO 1 = BIST fail, 0 = BIST pass */
__BITFIELD_FIELD(uint64_t wbfbist:1,
/* Reserved */
-   __BITFIELD_FIELD(uint64_t reserved:17,
+   __BITFIELD_FIELD(uint64_t reserved_52_57:6,
+   /* When set, LMTDMA/LMTST operations are permitted */
+   __BITFIELD_FIELD(uint64_t lmtena:1,
+   /* Selects the CVMSEG LM cacheline used by LMTDMA
+* LMTST and wide atomic store operations.
+*/
+   __BITFIELD_FIELD(uint64_t lmtline:6,
+   /* Reserved */
+   __BITFIELD_FIELD(uint64_t reserved_41_44:4,
/* OCTEON II - TLB replacement policy: 0 = bitmask LRU; 1 = NLU.
 * This field selects between the TLB replacement policies:
 * bitmask LRU or NLU. Bitmask LRU maintains a mask of
@@ -275,7 +283,7 @@ union octeon_cvmemctl {
/* R/W Size of local memory in cache blocks, 54 (6912
 * bytes) is max legal value. */
__BITFIELD_FIELD(uint64_t lmemsz:6,
-   ;)
+   ;
} s;
 };
 
-- 
2.14.3



[PATCH v5 net-next,mips 4/7] MIPS: Octeon: Add Free Pointer Unit (FPA) support.

2017-12-01 Thread David Daney
From: Carlos Munoz 

>From the hardware user manual: "The FPA is a unit that maintains
pools of pointers to free L2/DRAM memory. To provide QoS, the pools
are referenced indirectly through 1024 auras. Both core software
and hardware units allocate and free pointers."

Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---
 arch/mips/cavium-octeon/Kconfig   |   8 +
 arch/mips/cavium-octeon/Makefile  |   1 +
 arch/mips/cavium-octeon/octeon-fpa3.c | 363 ++
 arch/mips/include/asm/octeon/octeon.h |  13 ++
 4 files changed, 385 insertions(+)
 create mode 100644 arch/mips/cavium-octeon/octeon-fpa3.c

diff --git a/arch/mips/cavium-octeon/Kconfig b/arch/mips/cavium-octeon/Kconfig
index 204a1670fd9b..ce469f982134 100644
--- a/arch/mips/cavium-octeon/Kconfig
+++ b/arch/mips/cavium-octeon/Kconfig
@@ -87,4 +87,12 @@ config OCTEON_ILM
  To compile this driver as a module, choose M here.  The module
  will be called octeon-ilm
 
+config OCTEON_FPA3
+   tristate "Octeon III fpa driver"
+   help
+ This option enables a Octeon III driver for the Free Pool Unit (FPA).
+ The FPA is a hardware unit that manages pools of pointers to free
+ L2/DRAM memory. This driver provides an interface to reserve,
+ initialize, and fill fpa pools.
+
 endif # CAVIUM_OCTEON_SOC
diff --git a/arch/mips/cavium-octeon/Makefile b/arch/mips/cavium-octeon/Makefile
index 28c0bb75d1a4..9d547c2cd77d 100644
--- a/arch/mips/cavium-octeon/Makefile
+++ b/arch/mips/cavium-octeon/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_MTD)   += flash_setup.o
 obj-$(CONFIG_SMP)+= smp.o
 obj-$(CONFIG_OCTEON_ILM) += oct_ilm.o
 obj-$(CONFIG_USB)+= octeon-usb.o
+obj-$(CONFIG_OCTEON_FPA3)+= octeon-fpa3.o
diff --git a/arch/mips/cavium-octeon/octeon-fpa3.c 
b/arch/mips/cavium-octeon/octeon-fpa3.c
new file mode 100644
index ..3f0c10e9d915
--- /dev/null
+++ b/arch/mips/cavium-octeon/octeon-fpa3.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for the Octeon III Free Pool Unit (fpa).
+ *
+ * Copyright (C) 2015-2017 Cavium, Inc.
+ */
+
+#include 
+
+#include 
+
+
+/* Registers are accessed via xkphys */
+#define SET_XKPHYS (1ull << 63)
+#define NODE_OFFSET0x10ull
+#define SET_NODE(node) ((node) * NODE_OFFSET)
+
+#define FPA_BASE   0x12800ull
+#define SET_FPA_BASE(node) (SET_XKPHYS + SET_NODE(node) + FPA_BASE)
+
+#define FPA_GEN_CFG(n) (SET_FPA_BASE(n)   + 0x0050)
+
+#define FPA_POOLX_CFG(n, p)(SET_FPA_BASE(n) + (p<<3)  + 0x1000)
+#define FPA_POOLX_START_ADDR(n, p) (SET_FPA_BASE(n) + (p<<3)  + 0x1050)
+#define FPA_POOLX_END_ADDR(n, p)   (SET_FPA_BASE(n) + (p<<3)  + 0x1060)
+#define FPA_POOLX_STACK_BASE(n, p) (SET_FPA_BASE(n) + (p<<3)  + 0x1070)
+#define FPA_POOLX_STACK_END(n, p)  (SET_FPA_BASE(n) + (p<<3)  + 0x1080)
+#define FPA_POOLX_STACK_ADDR(n, p) (SET_FPA_BASE(n) + (p<<3)  + 0x1090)
+
+#define FPA_AURAX_POOL(n, a)   (SET_FPA_BASE(n) + (a<<3)  + 0x2000)
+#define FPA_AURAX_CFG(n, a)(SET_FPA_BASE(n) + (a<<3)  + 0x2010)
+#define FPA_AURAX_CNT(n, a)(SET_FPA_BASE(n) + (a<<3)  + 0x2020)
+#define FPA_AURAX_CNT_LIMIT(n, a)  (SET_FPA_BASE(n) + (a<<3)  + 0x2040)
+#define FPA_AURAX_CNT_THRESHOLD(n, a)  (SET_FPA_BASE(n) + (a<<3)  + 0x2050)
+#define FPA_AURAX_POOL_LEVELS(n, a)(SET_FPA_BASE(n) + (a<<3)  + 0x2070)
+#define FPA_AURAX_CNT_LEVELS(n, a) (SET_FPA_BASE(n) + (a<<3)  + 0x2080)
+
+static inline u64 oct_csr_read(u64 addr)
+{
+   return __raw_readq((void __iomem *)addr);
+}
+
+static inline void oct_csr_write(u64 data, u64 addr)
+{
+   __raw_writeq(data, (void __iomem *)addr);
+}
+
+static DEFINE_MUTEX(octeon_fpa3_lock);
+
+static int get_num_pools(void)
+{
+   if (OCTEON_IS_MODEL(OCTEON_CN78XX))
+   return 64;
+   if (OCTEON_IS_MODEL(OCTEON_CNF75XX) || OCTEON_IS_MODEL(OCTEON_CN73XX))
+   return 32;
+   return 0;
+}
+
+static int get_num_auras(void)
+{
+   if (OCTEON_IS_MODEL(OCTEON_CN78XX))
+   return 1024;
+   if (OCTEON_IS_MODEL(OCTEON_CNF75XX) || OCTEON_IS_MODEL(OCTEON_CN73XX))
+   return 512;
+   return 0;
+}
+
+/**
+ * octeon_fpa3_init() - Initialize the fpa to default values.
+ * @node: Node of fpa to initialize.
+ *
+ * Return: 0 if successful.
+ * < 0 for error codes.
+ */
+int octeon_fpa3_init(int node)
+{
+   static bool init_done[2];
+   u64 data;
+   int aura_cnt, i;
+
+   mutex_lock(_fpa3_lock);
+
+   if (init_done[node])
+   goto done;
+
+   aura_cnt = get_num_auras();
+   

[PATCH v5 net-next,mips 3/7] MIPS: Octeon: Add a global resource manager.

2017-12-01 Thread David Daney
From: Carlos Munoz 

Add a global resource manager to manage tagged pointers within
bootmem allocated memory. This is used by various functional
blocks in the Octeon core like the FPA, Ethernet nexus, etc.

Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---
 arch/mips/cavium-octeon/Makefile   |   1 +
 arch/mips/cavium-octeon/resource-mgr.c | 351 +
 arch/mips/include/asm/octeon/octeon.h  |  18 ++
 3 files changed, 370 insertions(+)
 create mode 100644 arch/mips/cavium-octeon/resource-mgr.c

diff --git a/arch/mips/cavium-octeon/Makefile b/arch/mips/cavium-octeon/Makefile
index 7c02e542959a..28c0bb75d1a4 100644
--- a/arch/mips/cavium-octeon/Makefile
+++ b/arch/mips/cavium-octeon/Makefile
@@ -10,6 +10,7 @@
 #
 
 obj-y := cpu.o setup.o octeon-platform.o octeon-irq.o csrc-octeon.o
+obj-y += resource-mgr.o
 obj-y += dma-octeon.o
 obj-y += octeon-memcpy.o
 obj-y += executive/
diff --git a/arch/mips/cavium-octeon/resource-mgr.c 
b/arch/mips/cavium-octeon/resource-mgr.c
new file mode 100644
index ..74efda5420ff
--- /dev/null
+++ b/arch/mips/cavium-octeon/resource-mgr.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Resource manager for Octeon.
+ *
+ * Copyright (C) 2017 Cavium, Inc.
+ */
+#include 
+
+#include 
+#include 
+
+#define RESOURCE_MGR_BLOCK_NAME"cvmx-global-resources"
+#define MAX_RESOURCES  128
+#define INST_AVAILABLE -88
+#define OWNER  0xbadc0de
+
+struct global_resource_entry {
+   struct global_resource_tag tag;
+   u64 phys_addr;
+   u64 size;
+};
+
+struct global_resources {
+#ifdef __LITTLE_ENDIAN_BITFIELD
+   u32 rlock;
+   u32 pad;
+#else
+   u32 pad;
+   u32 rlock;
+#endif
+   u64 entry_cnt;
+   struct global_resource_entry resource_entry[];
+};
+
+static struct global_resources *res_mgr_info;
+
+
+/*
+ * The resource manager interacts with software running outside of the
+ * Linux kernel, which necessitates locking to maintain data structure
+ * consistency.  These custom locking functions implement the locking
+ * protocol, and cannot be replaced by kernel locking functions that
+ * may use different in-memory structures.
+ */
+
+static void res_mgr_lock(void)
+{
+   while (cmpxchg(_mgr_info->rlock, 0, 1))
+   ; /* Loop while not zero */
+   rmb();
+}
+
+static void res_mgr_unlock(void)
+{
+   /* Wait until all resource operations finish before unlocking. */
+   wmb();
+   WRITE_ONCE(res_mgr_info->rlock, 0);
+   /* Force a write buffer flush. */
+   wmb();
+}
+
+static int res_mgr_find_resource(struct global_resource_tag tag)
+{
+   struct global_resource_entry *res_entry;
+   int i;
+
+   for (i = 0; i < res_mgr_info->entry_cnt; i++) {
+   res_entry = _mgr_info->resource_entry[i];
+   if (res_entry->tag.lo == tag.lo && res_entry->tag.hi == tag.hi)
+   return i;
+   }
+   return -1;
+}
+
+/**
+ * res_mgr_create_resource() - Create a resource.
+ * @tag: Identifies the resource.
+ * @inst_cnt: Number of resource instances to create.
+ *
+ * Returns 0 if the source was created successfully.
+ * Returns < 0 for error codes.
+ */
+int res_mgr_create_resource(struct global_resource_tag tag, int inst_cnt)
+{
+   struct global_resource_entry *res_entry;
+   u64 size;
+   u64 *res_addr;
+   int res_index, i, rc = 0;
+
+   res_mgr_lock();
+
+   /* Make sure resource doesn't already exist. */
+   res_index = res_mgr_find_resource(tag);
+   if (res_index >= 0) {
+   rc = -EEXIST;
+   goto err;
+   }
+
+   if (res_mgr_info->entry_cnt >= MAX_RESOURCES) {
+   pr_err("Resource max limit reached, not created\n");
+   rc = -ENOSPC;
+   goto err;
+   }
+
+   /*
+* Each instance is kept in an array of u64s. The first array element
+* holds the number of allocated instances.
+*/
+   size = sizeof(u64) * (inst_cnt + 1);
+   res_addr = cvmx_bootmem_alloc_range(size, CVMX_CACHE_LINE_SIZE, 0, 0);
+   if (!res_addr) {
+   pr_err("Failed to allocate resource. not created\n");
+   rc = -ENOMEM;
+   goto err;
+   }
+
+   /* Initialize the newly created resource. */
+   *res_addr = inst_cnt;
+   for (i = 1; i <= inst_cnt; i++)
+   res_addr[i] = INST_AVAILABLE;
+
+   res_index = res_mgr_info->entry_cnt;
+   res_entry = _mgr_info->resource_entry[res_index];
+   res_entry->tag = tag;
+   res_entry->phys_addr = virt_to_phys(res_addr);
+   res_entry->size = size;
+   res_mgr_info->entry_cnt++;
+
+err:
+   res_mgr_unlock();
+
+   return rc;
+}
+EXPORT_SYMBOL(res_mgr_create_resource);
+
+/**
+ * 

[PATCH v5 net-next,mips 7/7] MAINTAINERS: Add entry for drivers/net/ethernet/cavium/octeon/octeon3-*

2017-12-01 Thread David Daney
Signed-off-by: David Daney 
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 77d819b458a9..5aff6fb41b21 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3249,6 +3249,12 @@ W:   http://www.cavium.com
 S: Supported
 F: drivers/mmc/host/cavium*
 
+CAVIUM OCTEON-III NETWORK DRIVER
+M: David Daney 
+L: netdev@vger.kernel.org
+S: Supported
+F: drivers/net/ethernet/cavium/octeon/octeon3-*
+
 CAVIUM OCTEON-TX CRYPTO DRIVER
 M: George Cherian 
 L: linux-cry...@vger.kernel.org
-- 
2.14.3



[PATCH v5 net-next,mips 0/7] Cavium OCTEON-III network driver.

2017-12-01 Thread David Daney
We are adding the Cavium OCTEON-III network driver.  But since
interacting with the input and output queues is done via special CPU
local memory, we also need to add support to the MIPS/Octeon
architecture code.  Aren't SoCs nice in this way?

The first five patches add the SoC support needed by the driver, the
last two add the driver and an entry in MAINTAINERS.

Since these touch several subsystems (mips, netdev), I would
propose merging via netdev, but defer to the maintainers if they think
something else would work better.

A separate pull request was recently done by Steven Hill for the
firmware required by the driver.

Changes from v4:

o Removed cleanup patch for previous generation SoC "staging" driver,
  as it will be sent as a follow-on.

o Fixed kernel doc formatting in all patches.

o Removed redundant licensing text boilerplate.

o Reviewed-by: header added to 2/7.

o Rewrote locking code in 3/7 to eliminate inline asm.

Changes from v3:

o Use phy_print_status() instead of open coding the equivalent.

o Print warning on phy mode mismatch.

o Improve dt-bindings and add Acked-by.

Changes from v2:

o Fix PKI (RX path) initialization to work with little endian kernel.

Changes from v1:

o Cleanup and use of standard bindings in the device tree bindings
  document.

o Added (hopefully) clarifying comments about several OCTEON
  architectural peculiarities.

o Removed unused testing code from the driver.

o Removed some module parameters that already default to the proper
  values.

o KConfig cleanup, including testing on x86_64, arm64 and mips.

o Fixed breakage to the driver for previous generation of OCTEON SoCs (in
  the staging directory still).

o Verified bisectability of the patch set.

Carlos Munoz (5):
  dt-bindings: Add Cavium Octeon Common Ethernet Interface.
  MIPS: Octeon: Enable LMTDMA/LMTST operations.
  MIPS: Octeon: Add a global resource manager.
  MIPS: Octeon: Add Free Pointer Unit (FPA) support.
  netdev: octeon-ethernet: Add Cavium Octeon III support.

David Daney (2):
  MIPS: Octeon: Automatically provision CVMSEG space.
  MAINTAINERS: Add entry for
drivers/net/ethernet/cavium/octeon/octeon3-*

 .../devicetree/bindings/net/cavium-bgx.txt |   61 +
 MAINTAINERS|6 +
 arch/mips/cavium-octeon/Kconfig|   35 +-
 arch/mips/cavium-octeon/Makefile   |2 +
 arch/mips/cavium-octeon/octeon-fpa3.c  |  363 
 arch/mips/cavium-octeon/resource-mgr.c |  351 
 arch/mips/cavium-octeon/setup.c|   22 +-
 .../asm/mach-cavium-octeon/kernel-entry-init.h |   20 +-
 arch/mips/include/asm/mipsregs.h   |2 +
 arch/mips/include/asm/octeon/octeon.h  |   45 +-
 arch/mips/include/asm/processor.h  |2 +-
 arch/mips/kernel/octeon_switch.S   |2 -
 arch/mips/mm/tlbex.c   |   29 +-
 drivers/net/ethernet/cavium/Kconfig|   55 +-
 drivers/net/ethernet/cavium/octeon/Makefile|6 +
 .../net/ethernet/cavium/octeon/octeon3-bgx-nexus.c |  701 +++
 .../net/ethernet/cavium/octeon/octeon3-bgx-port.c  | 2015 +++
 drivers/net/ethernet/cavium/octeon/octeon3-core.c  | 2069 
 drivers/net/ethernet/cavium/octeon/octeon3-pki.c   |  824 
 drivers/net/ethernet/cavium/octeon/octeon3-pko.c   | 1688 
 drivers/net/ethernet/cavium/octeon/octeon3-sso.c   |  301 +++
 drivers/net/ethernet/cavium/octeon/octeon3.h   |  418 
 drivers/staging/octeon/ethernet-defines.h  |2 +-
 23 files changed, 8955 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt
 create mode 100644 arch/mips/cavium-octeon/octeon-fpa3.c
 create mode 100644 arch/mips/cavium-octeon/resource-mgr.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-bgx-nexus.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-bgx-port.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-core.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-pki.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-pko.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-sso.c
 create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3.h

-- 
2.14.3



[PATCH v5 net-next,mips 1/7] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2017-12-01 Thread David Daney
From: Carlos Munoz 

Add bindings for Common Ethernet Interface (BGX) block.

Acked-by: Rob Herring 
Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---
 .../devicetree/bindings/net/cavium-bgx.txt | 61 ++
 1 file changed, 61 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt

diff --git a/Documentation/devicetree/bindings/net/cavium-bgx.txt 
b/Documentation/devicetree/bindings/net/cavium-bgx.txt
new file mode 100644
index ..830c5f08
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/cavium-bgx.txt
@@ -0,0 +1,61 @@
+* Common Ethernet Interface (BGX) block
+
+Properties:
+
+- compatible: "cavium,octeon-7890-bgx": Compatibility with all cn7xxx SOCs.
+
+- reg: The base address of the BGX block.
+
+- #address-cells: Must be <1>.
+
+- #size-cells: Must be <0>.  BGX addresses have no size component.
+
+A BGX block has several children, each representing an Ethernet
+interface.
+
+
+* Ethernet Interface (BGX port) connects to PKI/PKO
+
+Properties:
+
+- compatible: "cavium,octeon-7890-bgx-port": Compatibility with all
+ cn7xxx SOCs.
+
+ "cavium,octeon-7360-xcv": Compatibility with cn73xx SOCs
+ for RGMII.
+
+- reg: The index of the interface within the BGX block.
+
+Optional properties:
+
+- local-mac-address: Mac address for the interface.
+
+- phy-handle: phandle to the phy node connected to the interface.
+
+- phy-mode: described in ethernet.txt.
+
+- fixed-link: described in fixed-link.txt.
+
+Example:
+
+   ethernet-mac-nexus@11800e000 {
+   compatible = "cavium,octeon-7890-bgx";
+   reg = <0x00011800 0xe000 0x 0x0100>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   ethernet@0 {
+   compatible = "cavium,octeon-7360-xcv";
+   reg = <0>;
+   local-mac-address = [ 00 01 23 45 67 89 ];
+   phy-handle = <>;
+   phy-mode = "rgmii-rxid"
+   };
+   ethernet@1 {
+   compatible = "cavium,octeon-7890-bgx-port";
+   reg = <1>;
+   local-mac-address = [ 00 01 23 45 67 8a ];
+   phy-handle = <>;
+   phy-mode = "sgmii"
+   };
+   };
-- 
2.14.3



[PATCH net-next v3 2/8] net: xdp: report flags program was installed with on query

2017-12-01 Thread Jakub Kicinski
Some drivers enforce that flags on program replacement and
removal must match the flags passed on install.  This leaves
the possibility open to enable simultaneous loading
of XDP programs both to HW and DRV.

Allow such drivers to report the flags back to the stack.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
 include/linux/netdevice.h   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 1a603fdd9e80..ea6bbf1efefc 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3392,6 +3392,7 @@ static int nfp_net_xdp(struct net_device *netdev, struct 
netdev_bpf *xdp)
if (nn->dp.bpf_offload_xdp)
xdp->prog_attached = XDP_ATTACHED_HW;
xdp->prog_id = nn->xdp_prog ? nn->xdp_prog->aux->id : 0;
+   xdp->flags = nn->xdp_prog ? nn->xdp_flags : 0;
return 0;
case BPF_OFFLOAD_VERIFIER_PREP:
return nfp_app_bpf_verifier_prep(nn->app, nn, xdp);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 667bdd3ad33e..cc4ce7456e38 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -820,6 +820,8 @@ struct netdev_bpf {
struct {
u8 prog_attached;
u32 prog_id;
+   /* flags with which program was installed */
+   u32 prog_flags;
};
/* BPF_OFFLOAD_VERIFIER_PREP */
struct {
-- 
2.15.0



[PATCH net-next v3 8/8] net: dummy: remove fake SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
netdevsim driver seems like a better place for fake SR-IOV
functionality.  Remove the code previously added to dummy.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Phil Sutter 
---
CC: Phil Sutter 
CC: Sabrina Dubroca  
---
 drivers/net/dummy.c | 215 +---
 1 file changed, 1 insertion(+), 214 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 58483af80bdb..30b1c8512049 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -42,48 +42,7 @@
 #define DRV_NAME   "dummy"
 #define DRV_VERSION"1.0"
 
-#undef pr_fmt
-#define pr_fmt(fmt) DRV_NAME ": " fmt
-
 static int numdummies = 1;
-static int num_vfs;
-
-struct vf_data_storage {
-   u8  vf_mac[ETH_ALEN];
-   u16 pf_vlan; /* When set, guest VLAN config not allowed. */
-   u16 pf_qos;
-   __be16  vlan_proto;
-   u16 min_tx_rate;
-   u16 max_tx_rate;
-   u8  spoofchk_enabled;
-   boolrss_query_enabled;
-   u8  trusted;
-   int link_state;
-};
-
-struct dummy_priv {
-   struct vf_data_storage  *vfinfo;
-};
-
-static int dummy_num_vf(struct device *dev)
-{
-   return num_vfs;
-}
-
-static struct bus_type dummy_bus = {
-   .name   = "dummy",
-   .num_vf = dummy_num_vf,
-};
-
-static void release_dummy_parent(struct device *dev)
-{
-}
-
-static struct device dummy_parent = {
-   .init_name  = "dummy",
-   .bus= _bus,
-   .release= release_dummy_parent,
-};
 
 /* fake multicast ability */
 static void set_multicast_list(struct net_device *dev)
@@ -133,25 +92,10 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
 static int dummy_dev_init(struct net_device *dev)
 {
-   struct dummy_priv *priv = netdev_priv(dev);
-
dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
if (!dev->dstats)
return -ENOMEM;
 
-   priv->vfinfo = NULL;
-
-   if (!num_vfs)
-   return 0;
-
-   dev->dev.parent = _parent;
-   priv->vfinfo = kcalloc(num_vfs, sizeof(struct vf_data_storage),
-  GFP_KERNEL);
-   if (!priv->vfinfo) {
-   free_percpu(dev->dstats);
-   return -ENOMEM;
-   }
-
return 0;
 }
 
@@ -169,117 +113,6 @@ static int dummy_change_carrier(struct net_device *dev, 
bool new_carrier)
return 0;
 }
 
-static int dummy_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (!is_valid_ether_addr(mac) || (vf >= num_vfs))
-   return -EINVAL;
-
-   memcpy(priv->vfinfo[vf].vf_mac, mac, ETH_ALEN);
-
-   return 0;
-}
-
-static int dummy_set_vf_vlan(struct net_device *dev, int vf,
-u16 vlan, u8 qos, __be16 vlan_proto)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if ((vf >= num_vfs) || (vlan > 4095) || (qos > 7))
-   return -EINVAL;
-
-   priv->vfinfo[vf].pf_vlan = vlan;
-   priv->vfinfo[vf].pf_qos = qos;
-   priv->vfinfo[vf].vlan_proto = vlan_proto;
-
-   return 0;
-}
-
-static int dummy_set_vf_rate(struct net_device *dev, int vf, int min, int max)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (vf >= num_vfs)
-   return -EINVAL;
-
-   priv->vfinfo[vf].min_tx_rate = min;
-   priv->vfinfo[vf].max_tx_rate = max;
-
-   return 0;
-}
-
-static int dummy_set_vf_spoofchk(struct net_device *dev, int vf, bool val)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (vf >= num_vfs)
-   return -EINVAL;
-
-   priv->vfinfo[vf].spoofchk_enabled = val;
-
-   return 0;
-}
-
-static int dummy_set_vf_rss_query_en(struct net_device *dev, int vf, bool val)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (vf >= num_vfs)
-   return -EINVAL;
-
-   priv->vfinfo[vf].rss_query_enabled = val;
-
-   return 0;
-}
-
-static int dummy_set_vf_trust(struct net_device *dev, int vf, bool val)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (vf >= num_vfs)
-   return -EINVAL;
-
-   priv->vfinfo[vf].trusted = val;
-
-   return 0;
-}
-
-static int dummy_get_vf_config(struct net_device *dev,
-  int vf, struct ifla_vf_info *ivi)
-{
-   struct dummy_priv *priv = netdev_priv(dev);
-
-   if (vf >= num_vfs)
-   return -EINVAL;
-
-   ivi->vf = vf;
-   memcpy(>mac, priv->vfinfo[vf].vf_mac, ETH_ALEN);
-   ivi->vlan = priv->vfinfo[vf].pf_vlan;
-   ivi->qos = priv->vfinfo[vf].pf_qos;
-   ivi->spoofchk = priv->vfinfo[vf].spoofchk_enabled;
-   ivi->linkstate = priv->vfinfo[vf].link_state;
-   ivi->min_tx_rate = priv->vfinfo[vf].min_tx_rate;
-

[PATCH net-next v3 5/8] netdevsim: add bpf offload support

2017-12-01 Thread Jakub Kicinski
Add support for loading programs for netdevsim devices and
expose the related information via DebugFS.  Both offload
of XDP and cls_bpf programs is supported.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 drivers/net/netdevsim/Makefile|   1 +
 drivers/net/netdevsim/bpf.c   | 373 ++
 drivers/net/netdevsim/netdev.c| 116 +++-
 drivers/net/netdevsim/netdevsim.h |  40 
 4 files changed, 529 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/netdevsim/bpf.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 07867bfe873b..074ddebbc41d 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_NETDEVSIM) += netdevsim.o
 
 netdevsim-objs := \
netdev.o \
+   bpf.o \
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
new file mode 100644
index ..8e4398a50903
--- /dev/null
+++ b/drivers/net/netdevsim/bpf.c
@@ -0,0 +1,373 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree.
+ *
+ * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+ * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+ * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+ * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+ * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "netdevsim.h"
+
+struct nsim_bpf_bound_prog {
+   struct netdevsim *ns;
+   struct bpf_prog *prog;
+   struct dentry *ddir;
+   const char *state;
+   bool is_loaded;
+   struct list_head l;
+};
+
+static int nsim_debugfs_bpf_string_read(struct seq_file *file, void *data)
+{
+   const char **str = file->private;
+
+   if (*str)
+   seq_printf(file, "%s\n", *str);
+
+   return 0;
+}
+
+static int nsim_debugfs_bpf_string_open(struct inode *inode, struct file *f)
+{
+   return single_open(f, nsim_debugfs_bpf_string_read, inode->i_private);
+}
+
+static const struct file_operations nsim_bpf_string_fops = {
+   .owner = THIS_MODULE,
+   .open = nsim_debugfs_bpf_string_open,
+   .release = single_release,
+   .read = seq_read,
+   .llseek = seq_lseek
+};
+
+static int
+nsim_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn)
+{
+   struct nsim_bpf_bound_prog *state;
+
+   state = env->prog->aux->offload->dev_priv;
+   if (state->ns->bpf_bind_verifier_delay && !insn_idx)
+   msleep(state->ns->bpf_bind_verifier_delay);
+
+   return 0;
+}
+
+static const struct bpf_ext_analyzer_ops nsim_bpf_analyzer_ops = {
+   .insn_hook = nsim_bpf_verify_insn,
+};
+
+static bool nsim_xdp_offload_active(struct netdevsim *ns)
+{
+   return ns->xdp_prog_mode == XDP_ATTACHED_HW;
+}
+
+static void nsim_prog_set_loaded(struct bpf_prog *prog, bool loaded)
+{
+   struct nsim_bpf_bound_prog *state;
+
+   if (!prog || !prog->aux->offload)
+   return;
+
+   state = prog->aux->offload->dev_priv;
+   state->is_loaded = loaded;
+}
+
+static int
+nsim_bpf_offload(struct netdevsim *ns, struct bpf_prog *prog, bool oldprog)
+{
+   nsim_prog_set_loaded(ns->bpf_offloaded, false);
+
+   WARN(!!ns->bpf_offloaded != oldprog,
+"bad offload state, expected offload %sto be active",
+oldprog ? "" : "not ");
+   ns->bpf_offloaded = prog;
+   ns->bpf_offloaded_id = prog ? prog->aux->id : 0;
+   nsim_prog_set_loaded(prog, true);
+
+   return 0;
+}
+
+int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type,
+  void *type_data, void *cb_priv)
+{
+   struct tc_cls_bpf_offload *cls_bpf = type_data;
+   struct bpf_prog *prog = cls_bpf->prog;
+   struct netdevsim *ns = cb_priv;
+   bool skip_sw;
+
+   if (type != TC_SETUP_CLSBPF ||
+   !tc_can_offload(ns->netdev) ||
+   cls_bpf->common.protocol != htons(ETH_P_ALL) ||
+   cls_bpf->common.chain_index)
+   return -EOPNOTSUPP;
+
+   skip_sw = cls_bpf->gen_flags & TCA_CLS_FLAGS_SKIP_SW;
+
+   if (nsim_xdp_offload_active(ns))
+   return -EBUSY;
+
+   if (!ns->bpf_tc_accept)
+   return -EOPNOTSUPP;
+   /* Note: progs without skip_sw will probably not be dev bound */
+   if (prog && !prog->aux->offload && !ns->bpf_tc_non_bound_accept)
+   return -EOPNOTSUPP;
+
+   switch (cls_bpf->command) {
+   

[PATCH net-next v3 3/8] net: xdp: make the stack take care of the tear down

2017-12-01 Thread Jakub Kicinski
Since day one of XDP drivers had to remember to free the program
on the remove path.  This leads to code duplication and is error
prone.  Make the stack query the installed programs on unregister
and if something is installed, remove the program.  Freeing of
program attached to XDP generic is moved from free_netdev() as well.

Because the remove will now be called before notifiers are
invoked, BPF offload state of the program will not get destroyed
before uninstall.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  2 --
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  3 ---
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  7 --
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  3 ---
 drivers/net/ethernet/qlogic/qede/qede_main.c   |  4 ---
 drivers/net/tun.c  |  4 ---
 net/core/dev.c | 29 --
 7 files changed, 22 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 33c49ad697e4..413ad2444ba2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7800,8 +7800,6 @@ static void bnxt_remove_one(struct pci_dev *pdev)
bnxt_dcb_free(bp);
kfree(bp->edev);
bp->edev = NULL;
-   if (bp->xdp_prog)
-   bpf_prog_put(bp->xdp_prog);
bnxt_cleanup_pci(bp);
free_netdev(dev);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d2b057a3e512..0f5c012de52e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4308,9 +4308,6 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
mlx5e_ipsec_cleanup(priv);
mlx5e_vxlan_cleanup(priv);
-
-   if (priv->channels.params.xdp_prog)
-   bpf_prog_put(priv->channels.params.xdp_prog);
 }
 
 static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index e379b78e86ef..54bfd7846f6d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -82,12 +82,6 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, 
struct nfp_net *nn)
return nfp_net_ebpf_capable(nn) ? "BPF" : "";
 }
 
-static void nfp_bpf_vnic_free(struct nfp_app *app, struct nfp_net *nn)
-{
-   if (nn->dp.bpf_offload_xdp)
-   nfp_bpf_xdp_offload(app, nn, NULL);
-}
-
 static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type,
 void *type_data, void *cb_priv)
 {
@@ -168,7 +162,6 @@ const struct nfp_app_type app_bpf = {
.extra_cap  = nfp_bpf_extra_cap,
 
.vnic_alloc = nfp_app_nic_vnic_alloc,
-   .vnic_free  = nfp_bpf_vnic_free,
 
.setup_tc   = nfp_bpf_setup_tc,
.tc_busy= nfp_bpf_tc_busy,
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index ea6bbf1efefc..ad3e9f6a61e5 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3562,9 +3562,6 @@ struct nfp_net *nfp_net_alloc(struct pci_dev *pdev, bool 
needs_netdev,
  */
 void nfp_net_free(struct nfp_net *nn)
 {
-   if (nn->xdp_prog)
-   bpf_prog_put(nn->xdp_prog);
-
if (nn->dp.netdev)
free_netdev(nn->dp.netdev);
else
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 8f9b3eb82137..57332b3e5e64 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1068,10 +1068,6 @@ static void __qede_remove(struct pci_dev *pdev, enum 
qede_remove_mode mode)
 
pci_set_drvdata(pdev, NULL);
 
-   /* Release edev's reference to XDP's bpf if such exist */
-   if (edev->xdp_prog)
-   bpf_prog_put(edev->xdp_prog);
-
/* Use global ops since we've freed edev */
qed_ops->common->slowpath_stop(cdev);
if (system_state == SYSTEM_POWER_OFF)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6a7bde9bc4b2..6f7e8e45c961 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -673,7 +673,6 @@ static void tun_detach(struct tun_file *tfile, bool clean)
 static void tun_detach_all(struct net_device *dev)
 {
struct tun_struct *tun = netdev_priv(dev);
-   struct bpf_prog *xdp_prog = rtnl_dereference(tun->xdp_prog);
struct tun_file *tfile, *tmp;
int i, n = tun->numqueues;
 
@@ -708,9 +707,6 @@ static 

[PATCH net-next v3 0/8] xdp: make stack perform remove and add selftests

2017-12-01 Thread Jakub Kicinski
Hi!

The purpose of this series is to add a software model of BPF offloads
to make it easier for everyone to test them and make some of the more
arcane rules and assumptions more clear.

The series starts with 3 patches aiming to make XDP handling in the
drivers less error prone.  Currently driver authors have to remember
to free XDP programs if XDP is active during unregister.  With this
series the core will disable XDP on its own.  It will take place
after close, drivers are not expected to perform reconfiguration
when disabling XDP on a downed device.

Next two patches add the software netdev driver, followed by a python
test which exercises all the corner cases which came to my mind.

Test needs to be run as root.  It will print basic information to
stdout, but can also create a more detailed log of all commands
when --log option is passed.  Log is in Emacs Org-mode format.

  ./tools/testing/selftests/bpf/test_offload.py --log /tmp/log

Last two patches replace the SR-IOV API implementation of dummy.

v3:
 - move the freeing of vfs to release (Phil).
v2:
 - free device from the release function;
 - use bus-based name generatin instead of netdev name.
v1:
 - replace the SR-IOV API implementation of dummy;
 - make the dev_xdp_uninstall() also handle the XDP generic (Daniel).

Jakub Kicinski (8):
  net: xdp: avoid output parameters when querying XDP prog
  net: xdp: report flags program was installed with on query
  net: xdp: make the stack take care of the tear down
  netdevsim: add software driver for testing offloads
  netdevsim: add bpf offload support
  selftests/bpf: add offload test based on netdevsim
  netdevsim: add SR-IOV functionality
  net: dummy: remove fake SR-IOV functionality

 MAINTAINERS|   5 +
 drivers/net/Kconfig|  11 +
 drivers/net/Makefile   |   1 +
 drivers/net/dummy.c| 215 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |   2 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 -
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |   7 -
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   4 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c   |   4 -
 drivers/net/netdevsim/Makefile |   7 +
 drivers/net/netdevsim/bpf.c| 373 +++
 drivers/net/netdevsim/netdev.c | 502 +++
 drivers/net/netdevsim/netdevsim.h  |  78 +++
 drivers/net/tun.c  |   4 -
 include/linux/netdevice.h  |   5 +-
 net/core/dev.c |  53 +-
 net/core/rtnetlink.c   |   6 +-
 tools/testing/selftests/bpf/Makefile   |   5 +-
 tools/testing/selftests/bpf/sample_ret0.c  |   7 +
 tools/testing/selftests/bpf/test_offload.py| 681 +
 20 files changed, 1715 insertions(+), 258 deletions(-)
 create mode 100644 drivers/net/netdevsim/Makefile
 create mode 100644 drivers/net/netdevsim/bpf.c
 create mode 100644 drivers/net/netdevsim/netdev.c
 create mode 100644 drivers/net/netdevsim/netdevsim.h
 create mode 100644 tools/testing/selftests/bpf/sample_ret0.c
 create mode 100755 tools/testing/selftests/bpf/test_offload.py

-- 
2.15.0



[PATCH net-next v3 4/8] netdevsim: add software driver for testing offloads

2017-12-01 Thread Jakub Kicinski
To be able to run selftests without any hardware required we
need a software model.  The model can also serve as an example
implementation for those implementing actual HW offloads.
The dummy driver have previously been extended to test SR-IOV,
but the general consensus seems to be against adding further
features to it.

Add a new driver for purposes of software modelling only.
eBPF and SR-IOV will be added here shortly, others are invited
to further extend the driver with their offload models.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 MAINTAINERS   |   5 ++
 drivers/net/Kconfig   |  11 
 drivers/net/Makefile  |   1 +
 drivers/net/netdevsim/Makefile|   6 ++
 drivers/net/netdevsim/netdev.c| 118 ++
 drivers/net/netdevsim/netdevsim.h |  26 +
 6 files changed, 167 insertions(+)
 create mode 100644 drivers/net/netdevsim/Makefile
 create mode 100644 drivers/net/netdevsim/netdev.c
 create mode 100644 drivers/net/netdevsim/netdevsim.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 77d819b458a9..010e46a38373 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9599,6 +9599,11 @@ NETWORKING [WIRELESS]
 L: linux-wirel...@vger.kernel.org
 Q: http://patchwork.kernel.org/project/linux-wireless/list/
 
+NETDEVSIM
+M: Jakub Kicinski 
+S: Maintained
+F: drivers/net/netdevsim/*
+
 NETXEN (1/10) GbE SUPPORT
 M: Manish Chopra 
 M: Rahul Verma 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0936da592e12..944ec3c9282c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -497,4 +497,15 @@ config THUNDERBOLT_NET
 
 source "drivers/net/hyperv/Kconfig"
 
+config NETDEVSIM
+   tristate "Simulated networking device"
+   depends on DEBUG_FS
+   help
+ This driver is a developer testing tool and software model that can
+ be used to test various control path networking APIs, especially
+ HW-offload related.
+
+ To compile this driver as a module, choose M here: the module
+ will be called netdevsim.
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 766f62d02a0b..04c3b747812c 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_FUJITSU_ES) += fjes/
 
 thunderbolt-net-y += thunderbolt.o
 obj-$(CONFIG_THUNDERBOLT_NET) += thunderbolt-net.o
+obj-$(CONFIG_NETDEVSIM) += netdevsim/
diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
new file mode 100644
index ..07867bfe873b
--- /dev/null
+++ b/drivers/net/netdevsim/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_NETDEVSIM) += netdevsim.o
+
+netdevsim-objs := \
+   netdev.o \
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
new file mode 100644
index ..7599c72c477a
--- /dev/null
+++ b/drivers/net/netdevsim/netdev.c
@@ -0,0 +1,118 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree.
+ *
+ * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+ * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+ * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+ * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+ * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "netdevsim.h"
+
+static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct netdevsim *ns = netdev_priv(dev);
+
+   u64_stats_update_begin(>syncp);
+   ns->tx_packets++;
+   ns->tx_bytes += skb->len;
+   u64_stats_update_end(>syncp);
+
+   dev_kfree_skb(skb);
+
+   return NETDEV_TX_OK;
+}
+
+static void nsim_set_rx_mode(struct net_device *dev)
+{
+}
+
+static void
+nsim_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+   struct netdevsim *ns = netdev_priv(dev);
+   unsigned int start;
+
+   do {
+   start = u64_stats_fetch_begin(>syncp);
+   stats->tx_bytes = ns->tx_bytes;
+   stats->tx_packets = ns->tx_packets;
+   } while (u64_stats_fetch_retry(>syncp, start));
+}
+
+static const struct net_device_ops nsim_netdev_ops = {
+   .ndo_start_xmit = nsim_start_xmit,
+   .ndo_set_rx_mode= nsim_set_rx_mode,
+   .ndo_set_mac_address= eth_mac_addr,
+   

[PATCH net-next v3 1/8] net: xdp: avoid output parameters when querying XDP prog

2017-12-01 Thread Jakub Kicinski
The output parameters will get unwieldy if we want to add more
information about the program.  Simply pass the entire
struct netdev_bpf in.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 include/linux/netdevice.h |  3 ++-
 net/core/dev.c| 24 ++--
 net/core/rtnetlink.c  |  6 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef789e1d679e..667bdd3ad33e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3330,7 +3330,8 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, 
struct net_device *dev,
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
  int fd, u32 flags);
-u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t xdp_op, u32 *prog_id);
+void __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
+struct netdev_bpf *xdp);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 07ed21d64f92..3f271c9cb5e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7073,17 +7073,21 @@ int dev_change_proto_down(struct net_device *dev, bool 
proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
-u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op, u32 *prog_id)
+void __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op,
+struct netdev_bpf *xdp)
 {
-   struct netdev_bpf xdp;
-
-   memset(, 0, sizeof(xdp));
-   xdp.command = XDP_QUERY_PROG;
+   memset(xdp, 0, sizeof(*xdp));
+   xdp->command = XDP_QUERY_PROG;
 
/* Query must always succeed. */
-   WARN_ON(bpf_op(dev, ) < 0);
-   if (prog_id)
-   *prog_id = xdp.prog_id;
+   WARN_ON(bpf_op(dev, xdp) < 0);
+}
+
+static u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op)
+{
+   struct netdev_bpf xdp;
+
+   __dev_xdp_query(dev, bpf_op, );
 
return xdp.prog_attached;
 }
@@ -7134,10 +7138,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
netlink_ext_ack *extack,
bpf_chk = generic_xdp_install;
 
if (fd >= 0) {
-   if (bpf_chk && __dev_xdp_attached(dev, bpf_chk, NULL))
+   if (bpf_chk && __dev_xdp_attached(dev, bpf_chk))
return -EEXIST;
if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
-   __dev_xdp_attached(dev, bpf_op, NULL))
+   __dev_xdp_attached(dev, bpf_op))
return -EBUSY;
 
prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..9c4cb584bfb0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1261,6 +1261,7 @@ static u8 rtnl_xdp_attached_mode(struct net_device *dev, 
u32 *prog_id)
 {
const struct net_device_ops *ops = dev->netdev_ops;
const struct bpf_prog *generic_xdp_prog;
+   struct netdev_bpf xdp;
 
ASSERT_RTNL();
 
@@ -1273,7 +1274,10 @@ static u8 rtnl_xdp_attached_mode(struct net_device *dev, 
u32 *prog_id)
if (!ops->ndo_bpf)
return XDP_ATTACHED_NONE;
 
-   return __dev_xdp_attached(dev, ops->ndo_bpf, prog_id);
+   __dev_xdp_query(dev, ops->ndo_bpf, );
+   *prog_id = xdp.prog_id;
+
+   return xdp.prog_attached;
 }
 
 static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
-- 
2.15.0



[PATCH net-next v3 6/8] selftests/bpf: add offload test based on netdevsim

2017-12-01 Thread Jakub Kicinski
Add a test of BPF offload control path interfaces based on
just-added netdevsim driver.  Perform various checks of both
the stack and the expected driver behaviour.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 tools/testing/selftests/bpf/Makefile|   5 +-
 tools/testing/selftests/bpf/sample_ret0.c   |   7 +
 tools/testing/selftests/bpf/test_offload.py | 681 
 3 files changed, 691 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sample_ret0.c
 create mode 100755 tools/testing/selftests/bpf/test_offload.py

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 333a48655ee0..2c9d8c63c6fa 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -17,9 +17,10 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o
+   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o
 
-TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh
+TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
+   test_offload.py
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/sample_ret0.c 
b/tools/testing/selftests/bpf/sample_ret0.c
new file mode 100644
index ..fec99750d6ea
--- /dev/null
+++ b/tools/testing/selftests/bpf/sample_ret0.c
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) */
+
+/* Sample program which should always load for testing control paths. */
+int func()
+{
+   return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
new file mode 100755
index ..3914f7a4585a
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -0,0 +1,681 @@
+#!/usr/bin/python3
+
+# Copyright (C) 2017 Netronome Systems, Inc.
+#
+# This software is licensed under the GNU General License Version 2,
+# June 1991 as shown in the file COPYING in the top-level directory of this
+# source tree.
+#
+# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+from datetime import datetime
+import argparse
+import json
+import os
+import pprint
+import subprocess
+import time
+
+logfile = None
+log_level = 1
+bpf_test_dir = os.path.dirname(os.path.realpath(__file__))
+pp = pprint.PrettyPrinter()
+devs = [] # devices we created for clean up
+files = [] # files to be removed
+
+def log_get_sec(level=0):
+return "*" * (log_level + level)
+
+def log_level_inc(add=1):
+global log_level
+log_level += add
+
+def log_level_dec(sub=1):
+global log_level
+log_level -= sub
+
+def log_level_set(level):
+global log_level
+log_level = level
+
+def log(header, data, level=None):
+"""
+Output to an optional log.
+"""
+if logfile is None:
+return
+if level is not None:
+log_level_set(level)
+
+if not isinstance(data, str):
+data = pp.pformat(data)
+
+if len(header):
+logfile.write("\n" + log_get_sec() + " ")
+logfile.write(header)
+if len(header) and len(data.strip()):
+logfile.write("\n")
+logfile.write(data)
+
+def skip(cond, msg):
+if not cond:
+return
+print("SKIP: " + msg)
+log("SKIP: " + msg, "", level=1)
+os.sys.exit(0)
+
+def fail(cond, msg):
+if not cond:
+return
+print("FAIL: " + msg)
+log("FAIL: " + msg, "", level=1)
+os.sys.exit(1)
+
+def start_test(msg):
+log(msg, "", level=1)
+log_level_inc()
+print(msg)
+
+def cmd(cmd, shell=True, include_stderr=False, background=False, fail=True):
+"""
+Run a command in subprocess and return tuple of (retval, stdout);
+optionally return stderr as well as third value.
+"""
+proc = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE,
+stderr=subprocess.PIPE)
+if background:
+msg = "%s START: %s" % (log_get_sec(1),
+datetime.now().strftime("%H:%M:%S.%f"))
+log("BKG " + proc.args, msg)
+return proc
+
+return cmd_result(proc, include_stderr=include_stderr, fail=fail)
+
+def cmd_result(proc, include_stderr=False, fail=False):
+stdout, stderr = proc.communicate()
+stdout 

[PATCH net-next v3 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
dummy driver was extended with VF-related netdev APIs for testing
SR-IOV-related software.  netdevsim did not exist back then.
Implement SR-IOV functionality in netdevsim.  Notable difference
is that since netdevsim has no module parameters, we will actually
create a device with sriov_numvfs attribute for each netdev.
The zero MAC address is accepted as some HW use it to mean any
address is allowed.  Link state is also now validated.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
CC: Phil Sutter 
CC: Sabrina Dubroca  
---
 drivers/net/netdevsim/netdev.c| 274 +-
 drivers/net/netdevsim/netdevsim.h |  12 ++
 2 files changed, 284 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 828c1ce49a8b..eb8c679fca9f 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -25,6 +25,125 @@
 
 #include "netdevsim.h"
 
+struct nsim_vf_config {
+   int link_state;
+   u16 min_tx_rate;
+   u16 max_tx_rate;
+   u16 vlan;
+   __be16 vlan_proto;
+   u16 qos;
+   u8 vf_mac[ETH_ALEN];
+   bool spoofchk_enabled;
+   bool trusted;
+   bool rss_query_enabled;
+};
+
+static u32 nsim_dev_id;
+
+static int nsim_num_vf(struct device *dev)
+{
+   struct netdevsim *ns = to_nsim(dev);
+
+   return ns->num_vfs;
+}
+
+static struct bus_type nsim_bus = {
+   .name   = DRV_NAME,
+   .dev_name   = DRV_NAME,
+   .num_vf = nsim_num_vf,
+};
+
+static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs)
+{
+   ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config),
+   GFP_KERNEL);
+   if (!ns->vfconfigs)
+   return -ENOMEM;
+   ns->num_vfs = num_vfs;
+
+   return 0;
+}
+
+static void nsim_vfs_disable(struct netdevsim *ns)
+{
+   kfree(ns->vfconfigs);
+   ns->vfconfigs = NULL;
+   ns->num_vfs = 0;
+}
+
+static ssize_t
+nsim_numvfs_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct netdevsim *ns = to_nsim(dev);
+   unsigned int num_vfs;
+   int ret;
+
+   ret = kstrtouint(buf, 0, _vfs);
+   if (ret)
+   return ret;
+
+   rtnl_lock();
+   if (ns->num_vfs == num_vfs)
+   goto exit_good;
+   if (ns->num_vfs && num_vfs) {
+   ret = -EBUSY;
+   goto exit_unlock;
+   }
+
+   if (num_vfs) {
+   ret = nsim_vfs_enable(ns, num_vfs);
+   if (ret)
+   goto exit_unlock;
+   } else {
+   nsim_vfs_disable(ns);
+   }
+exit_good:
+   ret = count;
+exit_unlock:
+   rtnl_unlock();
+
+   return ret;
+}
+
+static ssize_t
+nsim_numvfs_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+   struct netdevsim *ns = to_nsim(dev);
+
+   return sprintf(buf, "%u\n", ns->num_vfs);
+}
+
+static struct device_attribute nsim_numvfs_attr =
+   __ATTR(sriov_numvfs, 0664, nsim_numvfs_show, nsim_numvfs_store);
+
+static struct attribute *nsim_dev_attrs[] = {
+   _numvfs_attr.attr,
+   NULL,
+};
+
+static const struct attribute_group nsim_dev_attr_group = {
+   .attrs = nsim_dev_attrs,
+};
+
+static const struct attribute_group *nsim_dev_attr_groups[] = {
+   _dev_attr_group,
+   NULL,
+};
+
+static void nsim_dev_release(struct device *dev)
+{
+   struct netdevsim *ns = to_nsim(dev);
+
+   nsim_vfs_disable(ns);
+   free_netdev(ns->netdev);
+}
+
+struct device_type nsim_dev_type = {
+   .groups = nsim_dev_attr_groups,
+   .release = nsim_dev_release,
+};
+
 static int nsim_init(struct net_device *dev)
 {
struct netdevsim *ns = netdev_priv(dev);
@@ -37,8 +156,19 @@ static int nsim_init(struct net_device *dev)
if (err)
goto err_debugfs_destroy;
 
+   ns->dev.id = nsim_dev_id++;
+   ns->dev.bus = _bus;
+   ns->dev.type = _dev_type;
+   err = device_register(>dev);
+   if (err)
+   goto err_bpf_uninit;
+
+   SET_NETDEV_DEV(dev, >dev);
+
return 0;
 
+err_bpf_uninit:
+   nsim_bpf_uninit(ns);
 err_debugfs_destroy:
debugfs_remove_recursive(ns->ddir);
return err;
@@ -52,6 +182,14 @@ static void nsim_uninit(struct net_device *dev)
nsim_bpf_uninit(ns);
 }
 
+static void nsim_free(struct net_device *dev)
+{
+   struct netdevsim *ns = netdev_priv(dev);
+
+   device_unregister(>dev);
+   /* netdev and vf state will be freed out of device_release() */
+}
+
 static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct netdevsim *ns = netdev_priv(dev);
@@ -122,6 +260,123 @@ nsim_setup_tc_block(struct net_device *dev, struct 
tc_block_offload *f)

[PATCH net 1/2] tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()

2017-12-01 Thread Eric Dumazet
James Morris reported kernel stack corruption bug [1] while
running the SELinux testsuite, and bisected to a recent
commit bffa72cf7f9d ("net: sk_buff rbnode reorg")

We believe this commit is fine, but exposes an older bug.

SELinux code runs from tcp_filter() and might send an ICMP,
expecting IP options to be found in skb->cb[] using regular IPCB placement.

We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.

This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
similar way we added them for IPv6.

[1]
[  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
[  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is 
corrupted in: 81745af5
[  339.822505]
[  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
[  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
[  339.885060] Call Trace:
[  339.896875]  
[  339.908103]  dump_stack+0x63/0x87
[  339.920645]  panic+0xe8/0x248
[  339.932668]  ? ip_push_pending_frames+0x33/0x40
[  339.946328]  ? icmp_send+0x525/0x530
[  339.958861]  ? kfree_skbmem+0x60/0x70
[  339.971431]  __stack_chk_fail+0x1b/0x20
[  339.984049]  icmp_send+0x525/0x530
[  339.996205]  ? netlbl_skbuff_err+0x36/0x40
[  340.008997]  ? selinux_netlbl_err+0x11/0x20
[  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
[  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
[  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
[  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
[  340.074562]  ? tcp_filter+0x2c/0x40
[  340.086400]  ? tcp_v4_rcv+0x820/0xa20
[  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
[  340.111279]  ? ip_local_deliver+0x6f/0xe0
[  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
[  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
[  340.147442]  ? ip_rcv+0x27c/0x3c0
[  340.158668]  ? inet_del_offload+0x40/0x40
[  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
[  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
[  340.195282]  ? __netif_receive_skb+0x18/0x60
[  340.207288]  ? process_backlog+0x95/0x140
[  340.218948]  ? net_rx_action+0x26c/0x3b0
[  340.230416]  ? __do_softirq+0xc9/0x26a
[  340.241625]  ? do_softirq_own_stack+0x2a/0x40
[  340.253368]  
[  340.262673]  ? do_softirq+0x50/0x60
[  340.273450]  ? __local_bh_enable_ip+0x57/0x60
[  340.285045]  ? ip_finish_output2+0x175/0x350
[  340.296403]  ? ip_finish_output+0x127/0x1d0
[  340.307665]  ? nf_hook_slow+0x3c/0xb0
[  340.318230]  ? ip_output+0x72/0xe0
[  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
[  340.340070]  ? ip_local_out+0x35/0x40
[  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
[  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
[  340.372484]  ? __skb_clone+0x2e/0x130
[  340.382633]  ? tcp_transmit_skb+0x558/0xa10
[  340.393262]  ? tcp_connect+0x938/0xad0
[  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
[  340.414206]  ? tcp_v4_connect+0x457/0x4e0
[  340.424471]  ? __inet_stream_connect+0xb3/0x300
[  340.435195]  ? inet_stream_connect+0x3b/0x60
[  340.445607]  ? SYSC_connect+0xd9/0x110
[  340.455455]  ? __audit_syscall_entry+0xaf/0x100
[  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
[  340.476636]  ? __audit_syscall_exit+0x209/0x290
[  340.487151]  ? SyS_connect+0xe/0x10
[  340.496453]  ? do_syscall_64+0x67/0x1b0
[  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
misses")
Signed-off-by: Eric Dumazet 
Reported-by: James Morris 
Tested-by: James Morris 
Tested-by: Casey Schaufler 
---
 net/ipv4/tcp_ipv4.c | 59 -
 net/ipv6/tcp_ipv6.c | 10 +
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
c6bc0c4d19c624888b0d0b5a4246c7183edf63f5..77ea45da0fe9c746907a312989658af3ad3b198d
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1591,6 +1591,34 @@ int tcp_filter(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_filter);
 
+static void tcp_v4_restore_cb(struct sk_buff *skb)
+{
+   memmove(IPCB(skb), _SKB_CB(skb)->header.h4,
+   sizeof(struct inet_skb_parm));
+}
+
+static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
+  const struct tcphdr *th)
+{
+   /* This is tricky : We move IPCB at its correct location into 
TCP_SKB_CB()
+* barrier() makes sure compiler wont play fool^Waliasing games.
+*/
+   memmove(_SKB_CB(skb)->header.h4, IPCB(skb),
+   sizeof(struct inet_skb_parm));
+   barrier();
+
+   TCP_SKB_CB(skb)->seq = ntohl(th->seq);
+   TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+   skb->len - th->doff * 4);
+   TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
+   TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+   

[PATCH net 0/2] tcp: fix SELinux/Smack corruptions

2017-12-01 Thread Eric Dumazet
James Morris reported kernel stack corruption bug that
we tracked back to commit 971f10eca186 ("tcp: better TCP_SKB_CB
layout to reduce cache line misses")

First patch needs to be backported to kernels >= 3.18,
while second patch needs to be backported to kernels >= 4.9, since
this was the time when inet_exact_dif_match appeared.

David Ahern (1):
  tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()

Eric Dumazet (1):
  tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()

 include/net/tcp.h   |  3 +--
 net/ipv4/tcp_ipv4.c | 59 -
 net/ipv6/tcp_ipv6.c | 10 +
 3 files changed, 47 insertions(+), 25 deletions(-)

-- 
2.15.0.531.g2ccb3012c9-goog



[PATCH net 2/2] tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()

2017-12-01 Thread Eric Dumazet
From: David Ahern 

After this fix : ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"),
socket lookups happen while skb->cb[] has not been mangled yet by TCP.

Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is 
l3mdev")
Signed-off-by: David Ahern 
Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
4e09398009c10a72478b43d3cffc24ba01612b91..6998707e81f343ef8d893c0b2ba16db541082230
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -844,12 +844,11 @@ static inline int tcp_v6_sdif(const struct sk_buff *skb)
 }
 #endif
 
-/* TCP_SKB_CB reference means this can not be used from early demux */
 static inline bool inet_exact_dif_match(struct net *net, struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
if (!net->ipv4.sysctl_tcp_l3mdev_accept &&
-   skb && ipv4_l3mdev_skb(TCP_SKB_CB(skb)->header.h4.flags))
+   skb && ipv4_l3mdev_skb(IPCB(skb)->flags))
return true;
 #endif
return false;
-- 
2.15.0.531.g2ccb3012c9-goog



[PATCH iproute2 net-next] gre6: add collect metadata support

2017-12-01 Thread William Tu
The patch adds 'external' option to support collect metadata
gre6 tunnel. Example of L3 and L2 gre device:
bash:~# ip link add dev ip6gre123 type ip6gre external
bash:~# ip link add dev ip6gretap123 type ip6gretap external

Signed-off-by: William Tu 
---
 ip/link_gre6.c| 55 ---
 man/man8/ip-link.8.in |  6 ++
 2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 0a82eaecf2cd..2cb46ca116d0 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -105,6 +105,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, 
char **argv,
__u16 encapflags = TUNNEL_ENCAP_FLAG_CSUM6;
__u16 encapsport = 0;
__u16 encapdport = 0;
+   __u8 metadata = 0;
int len;
__u32 fwmark = 0;
__u32 erspan_idx = 0;
@@ -178,6 +179,9 @@ get_failed:
if (greinfo[IFLA_GRE_ENCAP_SPORT])
encapsport = 
rta_getattr_u16(greinfo[IFLA_GRE_ENCAP_SPORT]);
 
+   if (greinfo[IFLA_GRE_COLLECT_METADATA])
+   metadata = 1;
+
if (greinfo[IFLA_GRE_ENCAP_DPORT])
encapdport = 
rta_getattr_u16(greinfo[IFLA_GRE_ENCAP_DPORT]);
 
@@ -355,6 +359,8 @@ get_failed:
encapflags |= TUNNEL_ENCAP_FLAG_REMCSUM;
} else if (strcmp(*argv, "noencap-remcsum") == 0) {
encapflags &= ~TUNNEL_ENCAP_FLAG_REMCSUM;
+   } else if (strcmp(*argv, "external") == 0) {
+   metadata = 1;
} else if (strcmp(*argv, "fwmark") == 0) {
NEXT_ARG();
if (strcmp(*argv, "inherit") == 0) {
@@ -388,26 +394,30 @@ get_failed:
argc--; argv++;
}
 
-   addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
-   addattr32(n, 1024, IFLA_GRE_OKEY, okey);
-   addattr_l(n, 1024, IFLA_GRE_IFLAGS, , 2);
-   addattr_l(n, 1024, IFLA_GRE_OFLAGS, , 2);
-   addattr_l(n, 1024, IFLA_GRE_LOCAL, , sizeof(laddr));
-   addattr_l(n, 1024, IFLA_GRE_REMOTE, , sizeof(raddr));
-   if (link)
-   addattr32(n, 1024, IFLA_GRE_LINK, link);
-   addattr_l(n, 1024, IFLA_GRE_TTL, _limit, 1);
-   addattr_l(n, 1024, IFLA_GRE_ENCAP_LIMIT, _limit, 1);
-   addattr_l(n, 1024, IFLA_GRE_FLOWINFO, , 4);
-   addattr32(n, 1024, IFLA_GRE_FLAGS, flags);
-   addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
-   if (erspan_idx != 0)
-   addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
-
-   addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
-   addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
-   addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
-   addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport));
+   if (!metadata) {
+   addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
+   addattr32(n, 1024, IFLA_GRE_OKEY, okey);
+   addattr_l(n, 1024, IFLA_GRE_IFLAGS, , 2);
+   addattr_l(n, 1024, IFLA_GRE_OFLAGS, , 2);
+   addattr_l(n, 1024, IFLA_GRE_LOCAL, , sizeof(laddr));
+   addattr_l(n, 1024, IFLA_GRE_REMOTE, , sizeof(raddr));
+   if (link)
+   addattr32(n, 1024, IFLA_GRE_LINK, link);
+   addattr_l(n, 1024, IFLA_GRE_TTL, _limit, 1);
+   addattr_l(n, 1024, IFLA_GRE_ENCAP_LIMIT, _limit, 1);
+   addattr_l(n, 1024, IFLA_GRE_FLOWINFO, , 4);
+   addattr32(n, 1024, IFLA_GRE_FLAGS, flags);
+   addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
+   if (erspan_idx != 0)
+   addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+
+   addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
+   addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
+   addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
+   addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport));
+   } else {
+   addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
+   }
 
return 0;
 }
@@ -426,6 +436,11 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
if (!tb)
return;
 
+   if (tb[IFLA_GRE_COLLECT_METADATA]) {
+   print_bool(PRINT_ANY, "collect_metadata", "external", true);
+   return;
+   }
+
if (tb[IFLA_GRE_FLAGS])
flags = rta_getattr_u32(tb[IFLA_GRE_FLAGS]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a6a10e577b1f..c9b9bb7b2a4e 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -755,6 +755,8 @@ the following additional arguments are supported:
 .BI "dscp inherit"
 ] [
 .BI dev " PHYS_DEV "
+] [
+.RB [ no ] external
 ]
 
 .in +8
@@ -833,6 +835,10 @@ or
 .IR 00 ".." ff
 when tunneling non-IP packets. The default value is 

Re: [PATCH net-next] openvswitch: do not propagate headroom updates to internal port

2017-12-01 Thread Pravin Shelar
On Thu, Nov 30, 2017 at 6:35 AM, Paolo Abeni  wrote:
> After commit 3a927bc7cf9d ("ovs: propagate per dp max headroom to
> all vports") the need_headroom for the internal vport is updated
> accordingly to the max needed headroom in its datapath.
>
> That avoids the pskb_expand_head() costs when sending/forwarding
> packets towards tunnel devices, at least for some scenarios.
>
> We still require such copy when using the ovs-preferred configuration
> for vxlan tunnels:
>
> br_int
>   /   \
> tap  vxlan
>(remote_ip:X)
>
> br_phy
>  \
> NIC
>
> where the route towards the IP 'X' is via 'br_phy'.
>
> When forwarding traffic from the tap towards the vxlan device, we
> will call pskb_expand_head() in vxlan_build_skb() because
> br-phy->needed_headroom is equal to tun->needed_headroom.
>
> With this change we avoid updating the internal vport needed_headroom,
> so that in the above scenario no head copy is needed, giving 5%
> performance improvement in UDP throughput test.
>
> As a trade-off, packets sent from the internal port towards a tunnel
> device will now experience the head copy overhead. The rationale is
> that the latter use-case is less relevant performance-wise.
>
> Signed-off-by: Paolo Abeni 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH net-next 1/5] libbpf: add ability to guess program type based on section name

2017-12-01 Thread Jakub Kicinski
On Fri, 1 Dec 2017 10:22:57 +, Quentin Monnet wrote:
> Thanks Roman!
> One comment in-line.
> 
> 2017-11-30 13:42 UTC+ ~ Roman Gushchin 
> > The bpf_prog_load() function will guess program type if it's not
> > specified explicitly. This functionality will be used to implement
> > loading of different programs without asking a user to specify
> > the program type. In first order it will be used by bpftool.
> > 
> > Signed-off-by: Roman Gushchin 
> > Cc: Alexei Starovoitov 
> > Cc: Daniel Borkmann 
> > Cc: Jakub Kicinski 
> > ---
> >  tools/lib/bpf/libbpf.c | 47 +++
> >  1 file changed, 47 insertions(+)
> > 
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 5aa45f89da93..9f2410beaa18 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -1721,6 +1721,41 @@ BPF_PROG_TYPE_FNS(tracepoint, 
> > BPF_PROG_TYPE_TRACEPOINT);
> >  BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
> >  BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
> >  
> > +static enum bpf_prog_type bpf_program__guess_type(struct bpf_program *prog)
> > +{
> > +   if (!prog->section_name)
> > +   goto err;
> > +
> > +   if (strncmp(prog->section_name, "socket", 6) == 0)
> > +   return BPF_PROG_TYPE_SOCKET_FILTER;
> > +   if (strncmp(prog->section_name, "kprobe/", 7) == 0)
> > +   return BPF_PROG_TYPE_KPROBE;
> > +   if (strncmp(prog->section_name, "kretprobe/", 10) == 0)
> > +   return BPF_PROG_TYPE_KPROBE;
> > +   if (strncmp(prog->section_name, "tracepoint/", 11) == 0)
> > +   return BPF_PROG_TYPE_TRACEPOINT;
> > +   if (strncmp(prog->section_name, "xdp", 3) == 0)
> > +   return BPF_PROG_TYPE_XDP;
> > +   if (strncmp(prog->section_name, "perf_event", 10) == 0)
> > +   return BPF_PROG_TYPE_PERF_EVENT;
> > +   if (strncmp(prog->section_name, "cgroup/skb", 10) == 0)
> > +   return BPF_PROG_TYPE_CGROUP_SKB;
> > +   if (strncmp(prog->section_name, "cgroup/sock", 11) == 0)
> > +   return BPF_PROG_TYPE_CGROUP_SOCK;
> > +   if (strncmp(prog->section_name, "cgroup/dev", 10) == 0)
> > +   return BPF_PROG_TYPE_CGROUP_DEVICE;
> > +   if (strncmp(prog->section_name, "sockops", 7) == 0)
> > +   return BPF_PROG_TYPE_SOCK_OPS;
> > +   if (strncmp(prog->section_name, "sk_skb", 6) == 0)
> > +   return BPF_PROG_TYPE_SK_SKB;  
> 
> I do not really like these hard-coded lengths, maybe we could work out
> something nicer with a bit of pre-processing work? Perhaps something like:
> 
> #define SOCKET_FILTER_SEC_PREFIX "socket"
> #define KPROBE_SEC_PREFIX "kprobe/"
> […]
> 
> #define TRY_TYPE(string, __TYPE)  \
>   do {\
>   if (!strncmp(string, __TYPE ## _SEC_PREFIX, \
>sizeof(__TYPE ## _SEC_PREFIX)))\
>   return BPF_PROG_TYPE_ ## __TYPE;\
>   } while(0);

I like the suggestion, but I think return and goto statements hiding
inside macros are slightly frowned upon in the netdev.  Perhaps just 
a macro that wraps the strncmp() with sizeof would be enough?  Without
the return inside?

> static enum bpf_prog_type bpf_program__guess_type(struct bpf_program *prog)
> {
>   if (!prog->section_name)
>   goto err;
> 
>   TRY_TYPE(prog->section_name, SOCKET_FILTER);
>   TRY_TYPE(prog->section_name, KPROBE);
>   […]
> 
> err:
>   pr_warning("…",
>  prog->section_name);
> 
>   return BPF_PROG_TYPE_UNSPEC;
> }


Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers

2017-12-01 Thread Heiner Kallweit
Am 01.12.2017 um 21:42 schrieb David Miller:
> From: Heiner Kallweit 
> Date: Thu, 30 Nov 2017 23:47:52 +0100
> 
>> Remove generic settings for callbacks config_aneg and read_status
>> from drivers.
>>
When re-testing I just figured out that in drivers/net/phy/broadcom.c
I mistakenly removed three lines too many.
Do you prefer a fixed version of the patch or just a patch with the
fix?

Sorry, Heiner

>> Signed-off-by: Heiner Kallweit 
>> Reviewed-by: Florian Fainelli 
> 
> Applied.
> 



Re: [PATCH v2 net-next 3/4] inet: Add a 2nd listener hashtable (port+addr)

2017-12-01 Thread Eric Dumazet
On Fri, 2017-12-01 at 12:52 -0800, Martin KaFai Lau wrote:
> The current listener hashtable is hashed by port only.
> When a process is listening at many IP addresses with the same port
> (e.g.
> [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
> performance is degraded to a link list.  It is prone to syn attack.
> 
> UDP had a similar issue and a second hashtable was added to resolve
> it.
> 
> This patch adds a second hashtable for the listener's sockets.
> The second hashtable is hashed by port and address.
> 
> It cannot reuse the existing skc_portaddr_node which is shared
> with skc_bind_node.  TCP listener needs to use skc_bind_node.
> Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
> the inet_connection_sock which the listener (like TCP) also belongs
> to.
> 
> The new portaddr hashtable may need two lookup (First by IP:PORT.
> Second by INADDR_ANY:PORT if the IP:PORT is a not found).   Hence,
> it implements a similar cut off as UDP such that it will only consult
> the
> new portaddr hashtable if the current port-only hashtable has >10
> sk in the link-list.
> 
> lhash2 and lhash2_mask are added to 'struct inet_hashinfo'.  I take
> this chance to plug a 4 bytes hole.  It is done by first moving
> the existing bind_bucket_cachep up and then add the new
> (int lhash2_mask, *lhash2) after the existing bhash_size.
> 
> Signed-off-by: Martin KaFai Lau 


Nice work, thanks Martin !

Reviewed-by: Eric Dumazet 




Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start

2017-12-01 Thread Herbert Xu
On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote:
> Remove the code that resets the walker table. The walker table should
> only be initialized in the walk init function or when a future table is
> encountered. If the walker table is NULL this is the indication that
> the walk has completed and this information can be used to break a
> multi-call walk in the table (e.g. successive calls to nelink_dump
> that are dumping elements of an rhashtable).
> 
> This also allows us to change rhashtable_walk_start to return void
> since the only error it was returning was -EAGAIN for a table change.
> This patch changes all the callers of rhashtable_walk_start to expect
> void which eliminates logic needed to check the return value for a
> rare condition. Note that -EAGAIN will be returned in a call
> to rhashtable_walk_next which seems to always follow the start
> of the walk so there should be no behavioral change in doing this.
> 
> Signed-off-by: Tom Herbert 

Doesn't this mean that if a walk encounters a rehash you may end up
missing half or more of the hash table?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
On Fri, 1 Dec 2017 22:58:29 +0100, Phil Sutter wrote:
> > > > > > +   ret = count;
> > > > > > +exit_unlock:
> > > > > > +   rtnl_unlock();
> > > > > > +
> > > > > > +   return ret;
> > > > > > +}  
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > +static void nsim_free(struct net_device *dev)
> > > > > > +{
> > > > > > +   struct netdevsim *ns = netdev_priv(dev);
> > > > > > +
> > > > > > +   device_unregister(>dev);
> > > > > >  }  
> > > > > 
> > > > > Shouldn't this also kfree(ns->vfconfigs)?
> > > > 
> > > > It's in uninit, I will move it to release.
> > > 
> > > Oh, I missed that. If you're certain this won't lead to memleaks, no
> > > objection from my side. :)  
> > 
> > OK, I will respin v3 with the free moved :)  
> 
> So it did leak? I'm glad the traffic I caused wasn't completely
> pointless then. :)

There is a window where it could've been re-enabled and that
would leak, yes.  Thanks for catching it :)


Re: [RFC PATCH] net_sched: bulk free tcf_block

2017-12-01 Thread Cong Wang
On Fri, Dec 1, 2017 at 3:05 AM, Paolo Abeni  wrote:
>
> Thank you for the feedback.
>
> I tested your patch and in the above scenario I measure:
>
> real0m0.017s
> user0m0.000s
> sys 0m0.017s
>
> so it apparently works well for this case.

Thanks a lot for testing it! I will test it further. If it goes well I will
send a formal patch with your Tested-by unless you object it.


>
> We could still have a storm of rtnl lock/unlock operations while
> deleting a large tc tree with lot of filters, and I think we can reduce
> them with bulk free, evenutally applying it to filters, too.
>
> That will also reduce the pressure on the rtnl lock when e.g. OVS H/W
> offload pushes a lot of rules/sec.
>
> WDYT?
>

Why this is specific to tc filter? From what you are saying, we need to
batch all TC operations (qdisc, filter and action) rather than just filter?

In short term, I think batching rtnl lock/unlock is a good optimization,
so I have no objection. For long term, I think we need to revise RTNL
lock and probably move it down to each layer, but clearly it requires
much more work.

Thanks.


Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Phil Sutter
On Fri, Dec 01, 2017 at 01:45:09PM -0800, Jakub Kicinski wrote:
> On Fri, 1 Dec 2017 22:36:52 +0100, Phil Sutter wrote:
> > On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote:
> > > On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote:  
> > > > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote:
> > > > [...]  
> > > > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int 
> > > > > num_vfs)
> > > > > +{
> > > > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config),
> > > > > + GFP_KERNEL);
> > > > > + if (!ns->vfconfigs)
> > > > > + return -ENOMEM;
> > > > > + ns->num_vfs = num_vfs;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void nsim_vfs_disable(struct netdevsim *ns)
> > > > > +{
> > > > > + kfree(ns->vfconfigs);
> > > > > + ns->vfconfigs = NULL;
> > > > > + ns->num_vfs = 0;
> > > > > +}
> > > > 
> > > > Why not something like:
> > > > 
> > > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs)
> > > > | {
> > > > |   void *ptr = krealloc(ns->vfconfigs,
> > > > |num_vfs * sizeof(struct nsim_vf_config),
> > > > |GFP_KERNEL);
> > > > | 
> > > > |   if (!ptr)
> > > > |   return -ENOMEM;
> > > > | 
> > > > |   ns->vfconfigs = ptr;
> > > > |   ns->num_vfs = num_vfs;
> > > > |   return 0;
> > > > | }  
> > > 
> > > Um.  It either frees or allocates, never reallocates so I felt realloc
> > > is misleading.  ZERO_SIZE_PTR is less clearly a NULL than a NULL.  I
> > > will have to specify __GFP_ZERO.  It's not a calloc so there could be
> > > potentially some overflows?  
> > 
> > I don't understand: How can overflows happen if I use malloc() instead
> > of calloc()?
> 
> The multiplication may overflow.  That's why we have kmalloc_array().
> Note this explicit check in kmalloc_array() (which is also called by
> kcalloc):
> 
>   if (size != 0 && n > SIZE_MAX / size)
>   return NULL;
> 
> Where:
> 
> #define SIZE_MAX  (~(size_t)0)

Ah, I see. Thanks for educating me on this!

> > > > > + ret = count;
> > > > > +exit_unlock:
> > > > > + rtnl_unlock();
> > > > > +
> > > > > + return ret;
> > > > > +}
> > > > 
> > > > [...]
> > > >   
> > > > > +static void nsim_free(struct net_device *dev)
> > > > > +{
> > > > > + struct netdevsim *ns = netdev_priv(dev);
> > > > > +
> > > > > + device_unregister(>dev);
> > > > >  }
> > > > 
> > > > Shouldn't this also kfree(ns->vfconfigs)?  
> > > 
> > > It's in uninit, I will move it to release.  
> > 
> > Oh, I missed that. If you're certain this won't lead to memleaks, no
> > objection from my side. :)
> 
> OK, I will respin v3 with the free moved :)

So it did leak? I'm glad the traffic I caused wasn't completely
pointless then. :)

Thanks, Phil


Re: [Patch net-next] act_mirred: use tcfm_dev in tcf_mirred_get_dev()

2017-12-01 Thread Cong Wang
On Fri, Dec 1, 2017 at 9:56 AM, Jiri Pirko  wrote:
>
> Isn't this here so user may specify a ifindex of netdev which is not yet
> present on the system (not sure how much sense that would make though...)

How is this even possible? If an ifindex is not present, we return ENODEV:

if (parm->ifindex) {
dev = __dev_get_by_index(net, parm->ifindex);
if (dev == NULL) {
if (exists)
tcf_idr_release(*a, bind);
return -ENODEV;
}


Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
On Fri, 1 Dec 2017 22:36:52 +0100, Phil Sutter wrote:
> On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote:
> > On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote:  
> > > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote:
> > > [...]  
> > > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs)
> > > > +{
> > > > +   ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config),
> > > > +   GFP_KERNEL);
> > > > +   if (!ns->vfconfigs)
> > > > +   return -ENOMEM;
> > > > +   ns->num_vfs = num_vfs;
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static void nsim_vfs_disable(struct netdevsim *ns)
> > > > +{
> > > > +   kfree(ns->vfconfigs);
> > > > +   ns->vfconfigs = NULL;
> > > > +   ns->num_vfs = 0;
> > > > +}
> > > 
> > > Why not something like:
> > > 
> > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs)
> > > | {
> > > | void *ptr = krealloc(ns->vfconfigs,
> > > |  num_vfs * sizeof(struct nsim_vf_config),
> > > |  GFP_KERNEL);
> > > | 
> > > | if (!ptr)
> > > | return -ENOMEM;
> > > | 
> > > | ns->vfconfigs = ptr;
> > > | ns->num_vfs = num_vfs;
> > > | return 0;
> > > | }  
> > 
> > Um.  It either frees or allocates, never reallocates so I felt realloc
> > is misleading.  ZERO_SIZE_PTR is less clearly a NULL than a NULL.  I
> > will have to specify __GFP_ZERO.  It's not a calloc so there could be
> > potentially some overflows?  
> 
> I don't understand: How can overflows happen if I use malloc() instead
> of calloc()?

The multiplication may overflow.  That's why we have kmalloc_array().
Note this explicit check in kmalloc_array() (which is also called by
kcalloc):

if (size != 0 && n > SIZE_MAX / size)
return NULL;

Where:

#define SIZE_MAX(~(size_t)0)

> > > > +   ret = count;
> > > > +exit_unlock:
> > > > +   rtnl_unlock();
> > > > +
> > > > +   return ret;
> > > > +}
> > > 
> > > [...]
> > >   
> > > > +static void nsim_free(struct net_device *dev)
> > > > +{
> > > > +   struct netdevsim *ns = netdev_priv(dev);
> > > > +
> > > > +   device_unregister(>dev);
> > > >  }
> > > 
> > > Shouldn't this also kfree(ns->vfconfigs)?  
> > 
> > It's in uninit, I will move it to release.  
> 
> Oh, I missed that. If you're certain this won't lead to memleaks, no
> objection from my side. :)

OK, I will respin v3 with the free moved :)


Re: [PATCH net-next v2 8/8] net: dummy: remove fake SR-IOV functionality

2017-12-01 Thread Phil Sutter
On Fri, Dec 01, 2017 at 12:19:52PM -0800, Jakub Kicinski wrote:
> On Fri, 1 Dec 2017 14:46:34 +0100, Phil Sutter wrote:
> > On Thu, Nov 30, 2017 at 05:35:40PM -0800, Jakub Kicinski wrote:
> > > netdevsim driver seems like a better place for fake SR-IOV
> > > functionality.  Remove the code previously added to dummy.
> > > 
> > > Signed-off-by: Jakub Kicinski 
> > > Reviewed-by: Quentin Monnet   
> > 
> > Acked-by: Phil Sutter 
> 
> Thanks!
> 
> Did you have an opportunity to run your tests against this?  I didn't
> find anything that uses dummy's SR-IOV in selftests.

In fact, at Red Hat nobody uses dummy for iproute SR-IOV testing yet
(which was the motivation for it in the first place). Hence why I didn't
see a problem with moving it from dummy over to something else.

Hopefully upstream iproute will at some point contain a testsuite which
makes use of this, but sadly that's still wishful thinking. :(

Cheers, Phil


Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Phil Sutter
On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote:
> On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote:
> > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote:
> > [...]
> > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs)
> > > +{
> > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config),
> > > + GFP_KERNEL);
> > > + if (!ns->vfconfigs)
> > > + return -ENOMEM;
> > > + ns->num_vfs = num_vfs;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void nsim_vfs_disable(struct netdevsim *ns)
> > > +{
> > > + kfree(ns->vfconfigs);
> > > + ns->vfconfigs = NULL;
> > > + ns->num_vfs = 0;
> > > +}  
> > 
> > Why not something like:
> > 
> > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs)
> > | {
> > |   void *ptr = krealloc(ns->vfconfigs,
> > |num_vfs * sizeof(struct nsim_vf_config),
> > |GFP_KERNEL);
> > | 
> > |   if (!ptr)
> > |   return -ENOMEM;
> > | 
> > |   ns->vfconfigs = ptr;
> > |   ns->num_vfs = num_vfs;
> > |   return 0;
> > | }
> 
> Um.  It either frees or allocates, never reallocates so I felt realloc
> is misleading.  ZERO_SIZE_PTR is less clearly a NULL than a NULL.  I
> will have to specify __GFP_ZERO.  It's not a calloc so there could be
> potentially some overflows?

I don't understand: How can overflows happen if I use malloc() instead
of calloc()?

> > > +static ssize_t
> > > +nsim_numvfs_store(struct device *dev, struct device_attribute *attr,
> > > +   const char *buf, size_t count)
> > > +{
> > > + struct netdevsim *ns = to_nsim(dev);
> > > + unsigned int num_vfs;
> > > + int ret;
> > > +
> > > + ret = kstrtouint(buf, 0, _vfs);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + rtnl_lock();
> > > + if (ns->num_vfs == num_vfs)
> > > + goto exit_good;  
> > 
> > Then replace this:
> > 
> > > + if (ns->num_vfs && num_vfs) {
> > > + ret = -EBUSY;
> > > + goto exit_unlock;
> > > + }
> > > +
> > > + if (num_vfs) {
> > > + ret = nsim_vfs_enable(ns, num_vfs);
> > > + if (ret)
> > > + goto exit_unlock;
> > > + } else {
> > > + nsim_vfs_disable(ns);
> > > + }  
> > 
> > with just:
> > 
> > |   nsim_vfs_set(ns, num_vfs);
> 
> I'm trying to mirror the PCI subsystem behaviour here, which only
> allows enable or disable, not increase.  I felt we should follow how
> real devices behave:
> 
>   /* enable VFs */
>   if (pdev->sriov->num_VFs) {
>   dev_warn(>dev, "%d VFs already enabled. Disable before 
> enabling %d VFs\n",
>pdev->sriov->num_VFs, num_vfs);
>   return -EBUSY;
>   }
> 
> So IOW this is intentional.

Ah, I see. Yes, then it makes sense! Keeping this virtual VF
functionality as close to real ones as possible is certainly feasible.

> > > + ret = count;
> > > +exit_unlock:
> > > + rtnl_unlock();
> > > +
> > > + return ret;
> > > +}  
> > 
> > [...]
> > 
> > > +static void nsim_free(struct net_device *dev)
> > > +{
> > > + struct netdevsim *ns = netdev_priv(dev);
> > > +
> > > + device_unregister(>dev);
> > >  }  
> > 
> > Shouldn't this also kfree(ns->vfconfigs)?
> 
> It's in uninit, I will move it to release.

Oh, I missed that. If you're certain this won't lead to memleaks, no
objection from my side. :)

Cheers, Phil


Re: [PATCH net-next 00/11] net: ethernet: ti: cpsw/ale clean up and optimization

2017-12-01 Thread David Miller
From: Grygorii Strashko 
Date: Thu, 30 Nov 2017 18:21:09 -0600

> This is set of non critical clean ups and optimizations for TI
> CPSW and ALE drivers.
> 
> Rebased on top on net-next.

Series applied, thank you.


Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'

2017-12-01 Thread Daniel Borkmann
On 12/01/2017 04:48 AM, Al Viro wrote:
> On Fri, Dec 01, 2017 at 01:33:04AM +, Al Viro wrote:
> 
>> Use of file descriptors should be limited to "got a number from userland,
>> convert to struct file *" on the way in and "install struct file * into
>> descriptor table and return the descriptor to userland" on the way out.
>> And the latter - *ONLY* after the last possible point of failure.  Once
>> a file reference is inserted into descriptor table, that's it - you
>> can't undo that.
>>
>> The only way to use bpf_obj_get_user() is to pass its return value to
>> userland.  As return value of syscall - not even put_user() (for that
>> you'd need to reserve the descriptor, copy it to userland and only
>> then attach struct file * to it).
>>
>> The whole approach stinks - what it needs is something that would
>> take struct filename * and return struct bpf_prog * or struct file *
>> reference.  With bpf_obj_get_user() and this thing implemented
>> via that.

Agree, the "fix" is completely buggy due to fd being exposed to user
space during that period of time ...

>> I'm looking into that thing...
> 
> What it tries to pull off is something not far from
> 
> static struct bpf_prog *__get_prog(struct inode *inode, enum bpf_prog_type 
> type)
> {
>   struct bpf_prog *prog;
>   int err = inode_permission(inode, FMODE_READ | FMODE_WRITE);
>   if (err)
>   return ERR_PTR(err);
> 
>   if (inode->i_op == _map_iops)
>   return ERR_PTR(-EINVAL);
> 
>   if (inode->i_op != _prog_iops)
>   return ERR_PTR(-EACCES);
> 
>   prog = inode->i_private;
>   err = security_bpf_prog(prog);
>   if (err < 0)
>   return ERR_PTR(err);
> 
>   if (!bpf_prog_get_ok(prog, , false))
>   return ERR_PTR(-EINVAL);
> 
>   return bpf_prog_inc(prog);
> }
> 
> struct bpf_prog *get_prog_path_type(const char *name, enum bpf_prog_type type)
> {
>   struct path path;
>   struct bpf_prog *prog;
>   int err = kern_path(name, LOOKUP_FOLLOW, );
>   if (err)
>   return ERR_PTR(err);
>   prog = __get_prog(d_backing_inode(path.dentry), type);
>   if (!IS_ERR(prog))
>   touch_atime();
>   path_put();
>   return prog;
> }
> 
> static int __bpf_mt_check_path(const char *path, struct bpf_prog **ret)
> {
>   *ret = get_prog_path_type(path, BPF_PROG_TYPE_SOCKET_FILTER);
> return PTR_ERR_OR_ZERO(*ret);
> }
> 
> That skips all tracepoint random shite (pardon the triple redundance) and 
> makes
> a somewhat arbitrary change for touch_atime() logics.  And, of course, it is
> not even compile-tested.
> 
> Something similar to get_prog_path_type() above might make for a usable
> primitive, IMO...

The above looks good to me!


Re: [PATCH net-next V2 1/2] net-next: use five-tuple hash for sk_txhash

2017-12-01 Thread Tom Herbert
On Fri, Dec 1, 2017 at 1:00 PM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> We are using sk_txhash to calculate flowlabel, but sk_txhash isn't
> always available, for example, in inet_timewait_sock. This causes
> problem for reset packet, which will have a different flowlabel. This
> causes our router doesn't correctly close tcp connection. We are using
> flowlabel to do load balance. Routers in the path maintain connection
> state. So if flow label changes, the packet is routed through a
> different router. In this case, the old router doesn't get the reset
> packet to close the tcp connection.
>
> Per Tom's suggestion, we switch back to five-tuple hash, so we can
> reconstruct correct flowlabel for reset packet.
>
Thanks for doing this!

> At most places, we already have the flowi info, so we directly use it
> build sk_txhash. For synack, we do this after route search. At that
> time, we have the flowi info ready, so don't need to create the flowi
> info again.
>
> I don't change sk_rethink_txhash() though, it still uses random hash,
> which is the whole point to select a different path after a negative
> routing advise.
>
> Cc: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Florent Fourcot 
> Cc: Cong Wang 
> Cc: Tom Herbert 
> Signed-off-by: Shaohua Li 
> ---
>  include/net/sock.h| 18 --
>  include/net/tcp.h |  2 +-
>  net/ipv4/datagram.c   |  2 +-
>  net/ipv4/syncookies.c |  4 +++-
>  net/ipv4/tcp_input.c  |  1 -
>  net/ipv4/tcp_ipv4.c   | 17 -
>  net/ipv4/tcp_output.c |  1 -
>  net/ipv6/datagram.c   |  4 +++-
>  net/ipv6/syncookies.c |  3 ++-
>  net/ipv6/tcp_ipv6.c   | 18 +-
>  10 files changed, 39 insertions(+), 31 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 79e1a2c..640db0f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1729,22 +1729,12 @@ static inline kuid_t sock_net_uid(const struct net 
> *net, const struct sock *sk)
> return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
>  }
>
> -static inline u32 net_tx_rndhash(void)
> -{
> -   u32 v = prandom_u32();
> -
> -   return v ?: 1;
> -}
> -
> -static inline void sk_set_txhash(struct sock *sk)
> -{
> -   sk->sk_txhash = net_tx_rndhash();
> -}
> -
>  static inline void sk_rethink_txhash(struct sock *sk)
>  {
> -   if (sk->sk_txhash)
> -   sk_set_txhash(sk);
> +   if (sk->sk_txhash) {
> +   u32 v = prandom_u32();
> +   sk->sk_txhash = v ?: 1;
> +   }

We'll need to add configuration about whether rethink is done at all.
Conservative approach is probably to disable it by default. That is
the default behavior of the stack is that flow label is consistent for
lifetime of a flow.

>  }
>
>  static inline struct dst_entry *
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 4e09398..a5c28be 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops {
>  __u16 *mss);
>  #endif
> struct dst_entry *(*route_req)(const struct sock *sk, struct flowi 
> *fl,
> -  const struct request_sock *req);
> +  struct request_sock *req);
> u32 (*init_seq)(const struct sk_buff *skb);
> u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
> int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index f915abf..ed9ccb7 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
> *uaddr, int addr_len
> inet->inet_daddr = fl4->daddr;
> inet->inet_dport = usin->sin_port;
> sk->sk_state = TCP_ESTABLISHED;
> -   sk_set_txhash(sk);
> +   sk->sk_txhash = get_hash_from_flowi4(fl4);

Maybe keep sk_set_txhash but add an argument that gives the hash.
Hiding behind a function gives us the place to add/change logic in the
future.

> inet->inet_id = jiffies;
>
> sk_dst_set(sk, >dst);
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index fda37f2..76f1cf6 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
> sk_buff *skb)
> treq->rcv_isn   = ntohl(th->seq) - 1;
> treq->snt_isn   = cookie;
> treq->ts_off= 0;
> -   treq->txhash= net_tx_rndhash();
> req->mss= mss;
> ireq->ir_num= ntohs(th->dest);
> ireq->ir_rmt_port   = th->source;
> @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
> sk_buff *skb)
>  

[PATCH net-next V2 2/2] net-next: copy user configured flowlabel to reset packet

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

Reset packet doesn't use user configured flowlabel, instead, it always
uses 0. This will cause inconsistency for flowlabel. tw sock already
records flowlabel info, so we can directly use it.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 net/ipv6/tcp_ipv6.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a1a5802..9b678cd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
int oif = 0;
+   u8 tclass = 0;
+   __be32 flowlabel = 0;
 
if (th->rst)
return;
@@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
trace_tcp_send_reset(sk, skb);
}
 
-   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
+   if (sk) {
+   if (sk_fullsock(sk)) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   tclass = np->tclass;
+   flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+   } else {
+   struct inet_timewait_sock *tw = inet_twsk(sk);
+
+   tclass = tw->tw_tclass;
+   flowlabel = cpu_to_be32(tw->tw_flowlabel);
+   }
+   }
+   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1,
+   tclass, flowlabel);
 
 #ifdef CONFIG_TCP_MD5SIG
 out:
-- 
2.9.5



[PATCH net-next V2 1/2] net-next: use five-tuple hash for sk_txhash

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

We are using sk_txhash to calculate flowlabel, but sk_txhash isn't
always available, for example, in inet_timewait_sock. This causes
problem for reset packet, which will have a different flowlabel. This
causes our router doesn't correctly close tcp connection. We are using
flowlabel to do load balance. Routers in the path maintain connection
state. So if flow label changes, the packet is routed through a
different router. In this case, the old router doesn't get the reset
packet to close the tcp connection.

Per Tom's suggestion, we switch back to five-tuple hash, so we can
reconstruct correct flowlabel for reset packet.

At most places, we already have the flowi info, so we directly use it
build sk_txhash. For synack, we do this after route search. At that
time, we have the flowi info ready, so don't need to create the flowi
info again.

I don't change sk_rethink_txhash() though, it still uses random hash,
which is the whole point to select a different path after a negative
routing advise.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 18 +-
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 79e1a2c..640db0f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1729,22 +1729,12 @@ static inline kuid_t sock_net_uid(const struct net 
*net, const struct sock *sk)
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
 }
 
-static inline u32 net_tx_rndhash(void)
-{
-   u32 v = prandom_u32();
-
-   return v ?: 1;
-}
-
-static inline void sk_set_txhash(struct sock *sk)
-{
-   sk->sk_txhash = net_tx_rndhash();
-}
-
 static inline void sk_rethink_txhash(struct sock *sk)
 {
-   if (sk->sk_txhash)
-   sk_set_txhash(sk);
+   if (sk->sk_txhash) {
+   u32 v = prandom_u32();
+   sk->sk_txhash = v ?: 1;
+   }
 }
 
 static inline struct dst_entry *
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4e09398..a5c28be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req);
+  struct request_sock *req);
u32 (*init_seq)(const struct sk_buff *skb);
u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..ed9ccb7 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
inet->inet_daddr = fl4->daddr;
inet->inet_dport = usin->sin_port;
sk->sk_state = TCP_ESTABLISHED;
-   sk_set_txhash(sk);
+   sk->sk_txhash = get_hash_from_flowi4(fl4);
inet->inet_id = jiffies;
 
sk_dst_set(sk, >dst);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index fda37f2..76f1cf6 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq->rcv_isn   = ntohl(th->seq) - 1;
treq->snt_isn   = cookie;
treq->ts_off= 0;
-   treq->txhash= net_tx_rndhash();
req->mss= mss;
ireq->ir_num= ntohs(th->dest);
ireq->ir_rmt_port   = th->source;
@@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
security_req_classify_flow(req, flowi4_to_flowi());
+
+   treq->txhash = get_hash_from_flowi4();
+
rt = ip_route_output_key(sock_net(sk), );
if (IS_ERR(rt)) {
reqsk_free(req);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 734cfc8..e886c28 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6288,7 +6288,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
 
tcp_rsk(req)->snt_isn = isn;
-   tcp_rsk(req)->txhash = net_tx_rndhash();

[PATCH net-next V2 0/2] net: fix flowlabel inconsistency in reset packet

2017-12-01 Thread Shaohua Li
From: Shaohua Li 

Hi,

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. We are using flowlabel to do
load balance. Routers in the path maintain connection state. So if flow
label changes, the packet is routed through a different router. In this
case, the old router doesn't get the reset packet to close the tcp
connection.

The reason is the normal packet gets the skb->hash from sk->sk_txhash,
which is generated randomly. ip6_make_flowlabel then uses the hash to
create a flowlabel. The reset packet doesn't get assigned a hash, so the
flowlabel is calculated with flowi6.

The patches fix the issue.

Thanks,
Shaohua


Shaohua Li (2):
  net-next: use five-tuple hash for sk_txhash
  net-next: copy user configured flowlabel to reset packet

 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 36 ++--
 10 files changed, 56 insertions(+), 32 deletions(-)

-- 
2.9.5



Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.

2017-12-01 Thread David Daney

On 12/01/2017 12:41 PM, Philippe Ombredanne wrote:

David,

On Fri, Dec 1, 2017 at 9:01 PM, David Daney  wrote:

On 12/01/2017 11:49 AM, Philippe Ombredanne wrote:


David, Greg,

On Fri, Dec 1, 2017 at 6:42 PM, David Daney 
wrote:


On 11/30/2017 11:53 PM, Philippe Ombredanne wrote:


[...]


--- /dev/null
+++ b/arch/mips/cavium-octeon/resource-mgr.c
@@ -0,0 +1,371 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Resource manager for Octeon.
+ *
+ * This file is subject to the terms and conditions of the GNU
General
Public
+ * License.  See the file "COPYING" in the main directory of this
archive
+ * for more details.
+ *
+ * Copyright (C) 2017 Cavium, Inc.
+ */




Since you nicely included an SPDX id, you would not need the
boilerplate anymore. e.g. these can go alright?




They may not be strictly speaking necessary, but I don't think they hurt
anything.  Unless there is a requirement to strip out the license text,
we
would stick with it as is.



I think the requirement is there and that would be much better for
everyone: keeping both is redundant and does not bring any value, does
it? Instead it kinda removes the benefits of having the SPDX id in the
first place IMHO.

Furthermore, as there have been already ~12K+ files cleaned up and
still over 60K files to go, it would really nice if new files could
adopt the new style: this way we will not have to revisit and repatch
them in the future.



I am happy to follow any style Greg would suggest.  There doesn't seem to be
much documentation about how this should be done yet.


Thomas (tglx) has already submitted a first series of doc patches a
few weeks ago. And AFAIK he might be working on posting the updates
soon, whenever his real time clock yields a few cycles away from real
time coding work ;)

See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5]
on this and mostly related topics

[1] https://lkml.org/lkml/2017/11/2/715
[2] https://lkml.org/lkml/2017/11/25/125
[3] https://lkml.org/lkml/2017/11/25/133
[4] https://lkml.org/lkml/2017/11/2/805
[5] https://lkml.org/lkml/2017/10/19/165



OK, you convinced me.

Thanks,
David



[PATCH v2 net-next 0/4] tcp: Add a 2nd listener hashtable (port+addr)

2017-12-01 Thread Martin KaFai Lau
This patch set adds a 2nd listener hashtable.  It is to resolve
the performance issue when a process is listening at many IP
addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443)

v2:
- Move the new lhash2 and lhash2_mask before the existing
  listening_hash to avoid adding another cacheline
  to inet_hashinfo (Suggested by Eric Dumazet, Thanks!)
- I take this chance to plug an existing 4 bytes hole while
  adding 'unsigned int lhash2_mask'.
- Add some comments about lhash2 in inet_hashtables.h

Martin KaFai Lau (4):
  inet: Add a count to struct inet_listen_hashbucket
  udp: Move udp[46]_portaddr_hash() to net/ip[v6].h
  inet: Add a 2nd listener hashtable (port+addr)
  tcp: Enable 2nd listener hashtable in TCP

 include/net/inet_connection_sock.h |   2 +
 include/net/inet_hashtables.h  |  29 +--
 include/net/ip.h   |   9 ++
 include/net/ipv6.h |  17 
 net/ipv4/inet_hashtables.c | 173 +++--
 net/ipv4/tcp.c |   3 +
 net/ipv4/udp.c |  22 ++---
 net/ipv6/inet6_hashtables.c|  66 ++
 net/ipv6/udp.c |  32 ++-
 9 files changed, 301 insertions(+), 52 deletions(-)

-- 
2.9.5



[PATCH v2 net-next 2/4] udp: Move udp[46]_portaddr_hash() to net/ip[v6].h

2017-12-01 Thread Martin KaFai Lau
This patch moves the udp[46]_portaddr_hash()
to net/ip[v6].h.  The function name is renamed to
ipv[46]_portaddr_hash().

It will be used by a later patch which adds a second listener
hashtable hashed by the address and port.

Signed-off-by: Martin KaFai Lau 
Reviewed-by: Eric Dumazet 
---
 include/net/ip.h   |  9 +
 include/net/ipv6.h | 17 +
 net/ipv4/udp.c | 22 --
 net/ipv6/udp.c | 32 
 4 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 9896f46cbbf1..fc9bf1b1fe2c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -26,12 +26,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define IPV4_MAX_PMTU  65535U  /* RFC 2675, Section 5.1 */
 
@@ -521,6 +523,13 @@ static inline unsigned int ipv4_addr_hash(__be32 ip)
return (__force unsigned int) ip;
 }
 
+static inline u32 ipv4_portaddr_hash(const struct net *net,
+__be32 saddr,
+unsigned int port)
+{
+   return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
+}
+
 bool ip_call_ra_chain(struct sk_buff *skb);
 
 /*
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f73797e2fa60..25be4715578c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SIN6_LEN_RFC2133   24
 
@@ -673,6 +674,22 @@ static inline bool ipv6_addr_v4mapped(const struct 
in6_addr *a)
cpu_to_be32(0x))) == 0UL;
 }
 
+static inline u32 ipv6_portaddr_hash(const struct net *net,
+const struct in6_addr *addr6,
+unsigned int port)
+{
+   unsigned int hash, mix = net_hash_mix(net);
+
+   if (ipv6_addr_any(addr6))
+   hash = jhash_1word(0, mix);
+   else if (ipv6_addr_v4mapped(addr6))
+   hash = jhash_1word((__force u32)addr6->s6_addr32[3], mix);
+   else
+   hash = jhash2((__force u32 *)addr6->s6_addr32, 4, mix);
+
+   return hash ^ port;
+}
+
 /*
  * Check for a RFC 4843 ORCHID address
  * (Overlay Routable Cryptographic Hash Identifiers)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 36f857c87fe2..e9c0d1e1772e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -357,18 +357,12 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 }
 EXPORT_SYMBOL(udp_lib_get_port);
 
-static u32 udp4_portaddr_hash(const struct net *net, __be32 saddr,
- unsigned int port)
-{
-   return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
-}
-
 int udp_v4_get_port(struct sock *sk, unsigned short snum)
 {
unsigned int hash2_nulladdr =
-   udp4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum);
+   ipv4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum);
unsigned int hash2_partial =
-   udp4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 
0);
+   ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 
0);
 
/* precompute partial secondary hash */
udp_sk(sk)->udp_portaddr_hash = hash2_partial;
@@ -485,7 +479,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
u32 hash = 0;
 
if (hslot->count > 10) {
-   hash2 = udp4_portaddr_hash(net, daddr, hnum);
+   hash2 = ipv4_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = >hash2[slot2];
if (hslot->count < hslot2->count)
@@ -496,7 +490,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
  exact_dif, hslot2, skb);
if (!result) {
unsigned int old_slot2 = slot2;
-   hash2 = udp4_portaddr_hash(net, htonl(INADDR_ANY), 
hnum);
+   hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), 
hnum);
slot2 = hash2 & udptable->mask;
/* avoid searching the same slot again. */
if (unlikely(slot2 == old_slot2))
@@ -1761,7 +1755,7 @@ EXPORT_SYMBOL(udp_lib_rehash);
 
 static void udp_v4_rehash(struct sock *sk)
 {
-   u16 new_hash = udp4_portaddr_hash(sock_net(sk),
+   u16 new_hash = ipv4_portaddr_hash(sock_net(sk),
  inet_sk(sk)->inet_rcv_saddr,
  inet_sk(sk)->inet_num);
udp_lib_rehash(sk, new_hash);
@@ -1952,9 +1946,9 @@ static int __udp4_lib_mcast_deliver(struct net *net, 
struct sk_buff *skb,
struct sk_buff *nskb;
 
if (use_hash2) {
-   hash2_any = udp4_portaddr_hash(net, 

[PATCH v2 net-next 3/4] inet: Add a 2nd listener hashtable (port+addr)

2017-12-01 Thread Martin KaFai Lau
The current listener hashtable is hashed by port only.
When a process is listening at many IP addresses with the same port (e.g.
[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
performance is degraded to a link list.  It is prone to syn attack.

UDP had a similar issue and a second hashtable was added to resolve it.

This patch adds a second hashtable for the listener's sockets.
The second hashtable is hashed by port and address.

It cannot reuse the existing skc_portaddr_node which is shared
with skc_bind_node.  TCP listener needs to use skc_bind_node.
Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
the inet_connection_sock which the listener (like TCP) also belongs to.

The new portaddr hashtable may need two lookup (First by IP:PORT.
Second by INADDR_ANY:PORT if the IP:PORT is a not found).   Hence,
it implements a similar cut off as UDP such that it will only consult the
new portaddr hashtable if the current port-only hashtable has >10
sk in the link-list.

lhash2 and lhash2_mask are added to 'struct inet_hashinfo'.  I take
this chance to plug a 4 bytes hole.  It is done by first moving
the existing bind_bucket_cachep up and then add the new
(int lhash2_mask, *lhash2) after the existing bhash_size.

Signed-off-by: Martin KaFai Lau 
---
 include/net/inet_connection_sock.h |   2 +
 include/net/inet_hashtables.h  |  28 +--
 net/ipv4/inet_hashtables.c | 168 +++--
 net/ipv6/inet6_hashtables.c|  66 +++
 4 files changed, 249 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 0358745ea059..8e1bf9ae4a5e 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops  Pluggable ULP control hook
  * @icsk_ulp_data ULP private data
+ * @icsk_listen_portaddr_node  hash to the portaddr listener hashtable
  * @icsk_ca_state:Congestion control state
  * @icsk_retransmits: Number of unrecovered [RTO] timeouts
  * @icsk_pending: Scheduled timer event
@@ -101,6 +102,7 @@ struct inet_connection_sock {
const struct inet_connection_sock_af_ops *icsk_af_ops;
const struct tcp_ulp_ops  *icsk_ulp_ops;
void  *icsk_ulp_data;
+   struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8  icsk_ca_state:6,
  icsk_ca_setsockopt:1,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 4cce516c41ac..9141e95529e7 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -133,12 +133,13 @@ struct inet_hashinfo {
/* Ok, let's try this, I give up, we do need a local binding
 * TCP hash as well as the others for fast bind/connect.
 */
+   struct kmem_cache   *bind_bucket_cachep;
struct inet_bind_hashbucket *bhash;
-
unsigned intbhash_size;
-   /* 4 bytes hole on 64 bit */
 
-   struct kmem_cache   *bind_bucket_cachep;
+   /* The 2nd listener table hashed by local port and address */
+   unsigned intlhash2_mask;
+   struct inet_listen_hashbucket   *lhash2;
 
/* All the above members are written once at bootup and
 * never written again _or_ are predominantly read-access.
@@ -146,14 +147,25 @@ struct inet_hashinfo {
 * Now align to a new cache line as all the following members
 * might be often dirty.
 */
-   /* All sockets in TCP_LISTEN state will be in here.  This is the only
-* table where wildcard'd TCP sockets can exist.  Hash function here
-* is just local port number.
+   /* All sockets in TCP_LISTEN state will be in listening_hash.
+* This is the only table where wildcard'd TCP sockets can
+* exist.  listening_hash is only hashed by local port number.
+* If lhash2 is initialized, the same socket will also be hashed
+* to lhash2 by port and address.
 */
struct inet_listen_hashbucket   listening_hash[INET_LHTABLE_SIZE]
cacheline_aligned_in_smp;
 };
 
+#define inet_lhash2_for_each_icsk_rcu(__icsk, list) \
+   hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node)
+
+static inline struct inet_listen_hashbucket *
+inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash)
+{
+   return >lhash2[hash & h->lhash2_mask];
+}
+
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
struct inet_hashinfo *hashinfo,
unsigned int hash)
@@ -209,6 +221,10 @@ int __inet_inherit_port(const struct sock 

[PATCH v2 net-next 4/4] tcp: Enable 2nd listener hashtable in TCP

2017-12-01 Thread Martin KaFai Lau
Enable the second listener hashtable in TCP.
The scale is the same as UDP which is one slot per 2MB.

Signed-off-by: Martin KaFai Lau 
Reviewed-by: Eric Dumazet 
---
 net/ipv4/tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bf97317e6c97..180311636023 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3577,6 +3577,9 @@ void __init tcp_init(void)
percpu_counter_init(_sockets_allocated, 0, GFP_KERNEL);
percpu_counter_init(_orphan_count, 0, GFP_KERNEL);
inet_hashinfo_init(_hashinfo);
+   inet_hashinfo2_init(_hashinfo, "tcp_listen_portaddr_hash",
+   thash_entries, 21,  /* one slot per 2 MB*/
+   0, 64 * 1024);
tcp_hashinfo.bind_bucket_cachep =
kmem_cache_create("tcp_bind_bucket",
  sizeof(struct inet_bind_bucket), 0,
-- 
2.9.5



[PATCH v2 net-next 1/4] inet: Add a count to struct inet_listen_hashbucket

2017-12-01 Thread Martin KaFai Lau
This patch adds a count to the 'struct inet_listen_hashbucket'.
It counts how many sk is hashed to a bucket.  It will be
used to decide if the (to-be-added) portaddr listener's hashtable
should be used during inet[6]_lookup_listener().

Signed-off-by: Martin KaFai Lau 
Reviewed-by: Eric Dumazet 
---
 include/net/inet_hashtables.h |  1 +
 net/ipv4/inet_hashtables.c| 11 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 2dbbbff5e1e3..4cce516c41ac 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -111,6 +111,7 @@ struct inet_bind_hashbucket {
  */
 struct inet_listen_hashbucket {
spinlock_t  lock;
+   unsigned intcount;
struct hlist_head   head;
 };
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 427b705d7c64..80cfd3fa21ca 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -476,6 +476,7 @@ int __inet_hash(struct sock *sk, struct sock *osk)
hlist_add_tail_rcu(>sk_node, >head);
else
hlist_add_head_rcu(>sk_node, >head);
+   ilb->count++;
sock_set_flag(sk, SOCK_RCU_FREE);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
@@ -502,6 +503,7 @@ EXPORT_SYMBOL_GPL(inet_hash);
 void inet_unhash(struct sock *sk)
 {
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+   struct inet_listen_hashbucket *ilb;
spinlock_t *lock;
bool listener = false;
int done;
@@ -510,7 +512,8 @@ void inet_unhash(struct sock *sk)
return;
 
if (sk->sk_state == TCP_LISTEN) {
-   lock = 
>listening_hash[inet_sk_listen_hashfn(sk)].lock;
+   ilb = >listening_hash[inet_sk_listen_hashfn(sk)];
+   lock = >lock;
listener = true;
} else {
lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
@@ -522,8 +525,11 @@ void inet_unhash(struct sock *sk)
done = __sk_del_node_init(sk);
else
done = __sk_nulls_del_node_init_rcu(sk);
-   if (done)
+   if (done) {
+   if (listener)
+   ilb->count--;
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+   }
spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL_GPL(inet_unhash);
@@ -658,6 +664,7 @@ void inet_hashinfo_init(struct inet_hashinfo *h)
for (i = 0; i < INET_LHTABLE_SIZE; i++) {
spin_lock_init(>listening_hash[i].lock);
INIT_HLIST_HEAD(>listening_hash[i].head);
+   h->listening_hash[i].count = 0;
}
 }
 EXPORT_SYMBOL_GPL(inet_hashinfo_init);
-- 
2.9.5



Re: [PATCH net] tcp/dccp: block bh before arming time_wait timer

2017-12-01 Thread Eric Dumazet
On Fri, 2017-12-01 at 15:12 -0500, David Miller wrote:
> From: Eric Dumazet 
> Date: Fri, 01 Dec 2017 10:06:56 -0800
> 
> > From: Eric Dumazet 
> > 
> > Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
> > that might be caused by the following bug.
> > 
> > timewait timer is pinned to the cpu, because we want to transition
> > timwewait refcount from 0 to 4 in one go, once everything has been
> > initialized.
> > 
> > At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in
> timer
> > handling") was merged, TCP was always running from BH habdler.
> > 
> > After commit 5413d1babe8f ("net: do not block BH while processing
> > socket backlog") we definitely can run tcp_time_wait() from process
> > context.
> > 
> > We need to block BH in the critical section so that the pinned
> timer
> > has still its purpose.
> > 
> > This bug is more likely to happen under stress and when very small
> RTO
> > are used in datacenter flows.
> > 
> > Fixes: 5413d1babe8f ("net: do not block BH while processing socket
> backlog")
> > Signed-off-by: Eric Dumazet 
> > Reported-by: Maciej Żenczykowski 
> 
> Applied and queued up for -stable, thanks Eric.

It just occurred to me that we can now revert 614bdd4d6e61d26
("tcp: must block bh in __inet_twsk_hashdance()")




Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'

2017-12-01 Thread Daniel Borkmann
On 12/01/2017 06:39 PM, Al Viro wrote:
[...]
> If that does not scream "wrong or missing primitive", I don't know what would.
> You want something along the lines of "create a filesystem object at given
> location, calling this function with this argument for actual object 
> creation"?
> Fair enough, but then let's add a primitive that would do just that.
> 
> And grepping around for similar sick tricks catches a slightly milder example 
> -
> mq_open(2) doesn't play with encoding stuff into dev_t, but otherwise it's 
> very
> similar and could also benefit from the same primitive.
> 
> How about something like this:
> int vfs_mkobj(struct dentry *dentry, umode_t mode,
> int (*f)(struct dentry *, umode_t, void *),
>   void *arg)
> {
>   struct inode *dir = dentry->d_parent->d_inode;
> int error = may_create(dir, dentry);
> if (error)
> return error;
> 
> mode &= S_IALLUGO;
> mode |= S_IFREG;
> error = security_inode_create(dir, dentry, mode);
> if (error)
> return error;
> error = f(dentry, mode, arg);
> if (!error)
> fsnotify_create(dir, dentry);
> return error;
> }
> 
> exported by fs/namei.c, with your code doing
> 
>   switch (type) {
>   case BPF_TYPE_PROG:
>   error = vfs_mkobj(path.dentry, mode, bpf_mkprog, raw);
>   break;
>   case BPF_TYPE_MAP:
>   error = vfs_mkobj(path.dentry, mode, bpf_mkmap, raw);
>   break;
>   default:
>   error = -EPERM;
>   }
> instead that vfs_mknod() hack, with
> 
> static int bpf_mkprog(struct inode *dir, struct dentry *dentry,
>umode_t mode, void *raw)
> {
>   return bpf_mkobj_ops(dir, dentry, mode, raw, _prog_iops);
> }
> 
> static int bpf_mkmap(struct inode *dir, struct dentry *dentry,
>umode_t mode, void *raw)
> {
>   return bpf_mkobj_ops(dir, dentry, mode, raw, _map_iops);
> }
> 
> static int bpf_mkobj_ops(struct inode *dir, struct dentry *dentry,
>umode_t mode, void *raw, struct inode_operations *iops)
> {
> struct inode *inode;
> 
> inode = bpf_get_inode(dir->i_sb, dir, mode);
> if (IS_ERR(inode))
> return PTR_ERR(inode);
> 
> inode->i_op = iops;
> inode->i_private = raw;
> 
> bpf_dentry_finalize(dentry, inode, dir);
> return 0;
> }
> 
> And to hell with messing with dev_t, ->d_fsdata or having ->mknod() there at 
> all...
> Might want to replace security_path_mknod() with something saner, while we are
> at it.
> 
> Objections?

No, thanks for looking into this, and sorry for this fugly hack! :( Not
that this doesn't make it any better, but I think back then I took it
over from mqueue implementation ... should have known better and looking
into making this generic instead, sigh. The above looks good to me, so
no objections from my side and thanks for working on it!

> PS: mqueue.c would also benefit from such primitive - do_create() there would
> simply pass attr as callback's argument into vfs_mkobj(), with callback being
> the guts of mqueue_create()...


Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers

2017-12-01 Thread David Miller
From: Heiner Kallweit 
Date: Thu, 30 Nov 2017 23:47:52 +0100

> Remove generic settings for callbacks config_aneg and read_status
> from drivers.
> 
> Signed-off-by: Heiner Kallweit 
> Reviewed-by: Florian Fainelli 

Applied.


Re: [PATCH net-next resubmit 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default

2017-12-01 Thread David Miller
From: Heiner Kallweit 
Date: Thu, 30 Nov 2017 23:46:19 +0100

> read_status and config_aneg are the only mandatory callbacks and most
> of the time the generic implementation is used by drivers.
> So make the core fall back to the generic version if a driver doesn't
> implement the respective callback.
> 
> Also currently the core doesn't seem to verify that drivers implement
> the mandatory calls. If a driver doesn't do so we'd just get a NPE.
> With this patch this potential issue doesn't exit any longer.
> 
> Signed-off-by: Heiner Kallweit 
> Reviewed-by: Florian Fainelli 

Applied.


Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.

2017-12-01 Thread Philippe Ombredanne
David,

On Fri, Dec 1, 2017 at 9:01 PM, David Daney  wrote:
> On 12/01/2017 11:49 AM, Philippe Ombredanne wrote:
>>
>> David, Greg,
>>
>> On Fri, Dec 1, 2017 at 6:42 PM, David Daney 
>> wrote:
>>>
>>> On 11/30/2017 11:53 PM, Philippe Ombredanne wrote:
>>
>> [...]
>>
>> --- /dev/null
>> +++ b/arch/mips/cavium-octeon/resource-mgr.c
>> @@ -0,0 +1,371 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Resource manager for Octeon.
>> + *
>> + * This file is subject to the terms and conditions of the GNU
>> General
>> Public
>> + * License.  See the file "COPYING" in the main directory of this
>> archive
>> + * for more details.
>> + *
>> + * Copyright (C) 2017 Cavium, Inc.
>> + */



 Since you nicely included an SPDX id, you would not need the
 boilerplate anymore. e.g. these can go alright?
>>>
>>>
>>>
>>> They may not be strictly speaking necessary, but I don't think they hurt
>>> anything.  Unless there is a requirement to strip out the license text,
>>> we
>>> would stick with it as is.
>>
>>
>> I think the requirement is there and that would be much better for
>> everyone: keeping both is redundant and does not bring any value, does
>> it? Instead it kinda removes the benefits of having the SPDX id in the
>> first place IMHO.
>>
>> Furthermore, as there have been already ~12K+ files cleaned up and
>> still over 60K files to go, it would really nice if new files could
>> adopt the new style: this way we will not have to revisit and repatch
>> them in the future.
>>
>
> I am happy to follow any style Greg would suggest.  There doesn't seem to be
> much documentation about how this should be done yet.

Thomas (tglx) has already submitted a first series of doc patches a
few weeks ago. And AFAIK he might be working on posting the updates
soon, whenever his real time clock yields a few cycles away from real
time coding work ;)

See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5]
on this and mostly related topics

[1] https://lkml.org/lkml/2017/11/2/715
[2] https://lkml.org/lkml/2017/11/25/125
[3] https://lkml.org/lkml/2017/11/25/133
[4] https://lkml.org/lkml/2017/11/2/805
[5] https://lkml.org/lkml/2017/10/19/165

-- 
Cordially
Philippe Ombredanne


Re: [PATCH v5 net-next 0/3] ip6_gre: add erspan native tunnel for ipv6

2017-12-01 Thread David Miller
From: William Tu 
Date: Thu, 30 Nov 2017 11:51:26 -0800

> The patch series add support for ERSPAN tunnel over ipv6.  The first patch
> refectors the existing ipv4 gre implementation and the second refactors the
> ipv6 gre's xmit code.  Finally the last patch introduces erspan protocol.

Series applied, thanks William.


Re: [PATCH RFC 2/2] veth: propagate bridge GSO to peer

2017-12-01 Thread Stephen Hemminger
On Mon, 27 Nov 2017 19:02:01 -0700
David Ahern  wrote:

> On 11/27/17 6:42 PM, Solio Sarabia wrote:
> > On Mon, Nov 27, 2017 at 01:15:02PM -0800, Stephen Hemminger wrote:  
> >> On Mon, 27 Nov 2017 12:14:19 -0800
> >> Solio Sarabia  wrote:
> >>  
> >>> On Sun, Nov 26, 2017 at 11:07:25PM -0800, Stephen Hemminger wrote:  
>  On Sun, 26 Nov 2017 20:13:39 -0700
>  David Ahern  wrote:
>  
> > On 11/26/17 11:17 AM, Stephen Hemminger wrote:
> >> This allows veth device in containers to see the GSO maximum
> >> settings of the actual device being used for output.
> >
> > veth devices can be added to a VRF instead of a bridge, and I do not
> > believe the gso propagation works for L3 master devices.
> >
> > From a quick grep, team devices do not appear to handle gso changes 
> > either.
> 
>  This code should still work correctly, but no optimization would happen.
>  The gso_max_size of the VRF or team will
>  still be GSO_MAX_SIZE so there would be no change. If VRF or Team ever 
>  got smart
>  enough to handle GSO limits, then the algorithm would handle it.
> >>>
> >>> This patch propagates gso value from bridge to its veth endpoints.
> >>> However, since bridge is never aware of the GSO limit from underlying
> >>> interfaces, bridge/veth still have larger GSO size.
> >>>
> >>> In the docker case, bridge is not linked directly to physical or
> >>> synthetic interfaces; it relies on iptables to decide which interface to
> >>> forward packets to.  
> >>
> >> So for the docker case, then direct control of GSO values via netlink (ie 
> >> ip link set)
> >> seems like the better solution.  
> > 
> > Adding ioctl support for 'ip link set' would work. I'm still concerned
> > how to enforce the upper limit to not exceed that of the lower devices.
> > 
> > Consider a system with three NICs, each reporting values in the range
> > [60,000 - 62,780]. Users could set virtual interfaces' gso to 65,536,
> > exceeding the limit, and having the host do sw gso (vms settings must
> > not affect host performance.)
> > 
> > Looping through interfaces?  With the difference that now it'd be
> > trigger upon user's request, not every time a veth is created (like one
> > previous patch discussed.)
> >   
> 
> You are concerned about the routed case right? One option is to have VRF
> devices propagate gso sizes to all devices (veth, vlan, etc) enslaved to
> it. VRF devices are Layer 3 master devices so an L3 parallel to a bridge.

See the patch set I posted today which punts the problem to veth setup.


Re: [PATCH 0/2] net: ethtool: add support for ETH_RESET_AP

2017-12-01 Thread David Miller
From: Scott Branden 
Date: Thu, 30 Nov 2017 11:35:58 -0800

> Add support to reset appplication processors inside SmartNICs by
> defining new ETH_RESET_AP bit.
> 
> And use new ETH_RESET_AP bit in bnxt ethernet driver.

Looks good, series applied, thanks!


Re: [PATCH net-next 0/3] rds-tcp netns delete related fixes

2017-12-01 Thread David Miller
From: Sowmini Varadhan 
Date: Thu, 30 Nov 2017 11:11:26 -0800

> Patchset contains cleanup and bug fixes. Patch 1 is the removal
> of some redundant code/functions. Patch 2 and 3 are fixes for 
> corner cases identified by syzkaller. I've not been able to
> reproduce the actual use-after-free race flagged in the syzkaller
> reports, thus these fixes are based on code inspection plus 
> manual testing to make sure the modified code paths are executed 
> without problems in the commonly encountered timing cases.

Series applied, thanks.


Re: [net-next 1/1] tipc: fall back to smaller MTU if allocation of local send skb fails

2017-12-01 Thread David Miller
From: Jon Maloy 
Date: Thu, 30 Nov 2017 16:47:25 +0100

> When sending node local messages the code is using an 'mtu' of 66060
> bytes to avoid unnecessary fragmentation. During situations of low
> memory tipc_msg_build() may sometimes fail to allocate such large
> buffers, resulting in unnecessary send failures. This can easily be
> remedied by falling back to a smaller MTU, and then reassemble the
> buffer chain as if the message were arriving from a remote node.
> 
> At the same time, we change the initial MTU setting of the broadcast
> link to a lower value, so that large messages always are fragmented
> into smaller buffers even when we run in single node mode. Apart from
> obtaining the same advantage as for the 'fallback' solution above, this
> turns out to give a significant performance improvement. This can
> probably be explained with the __pskb_copy() operation performed on the
> buffer for each recipient during reception. We found the optimal value
> for this, considering the most relevant skb pool, to be 3744 bytes.
> 
> Acked-by: Ying Xue 
> Signed-off-by: Jon Maloy 

Applied, thanks Jon.


Re: [PATCH net-next v2 8/8] net: dummy: remove fake SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
On Fri, 1 Dec 2017 14:46:34 +0100, Phil Sutter wrote:
> On Thu, Nov 30, 2017 at 05:35:40PM -0800, Jakub Kicinski wrote:
> > netdevsim driver seems like a better place for fake SR-IOV
> > functionality.  Remove the code previously added to dummy.
> > 
> > Signed-off-by: Jakub Kicinski 
> > Reviewed-by: Quentin Monnet   
> 
> Acked-by: Phil Sutter 

Thanks!

Did you have an opportunity to run your tests against this?  I didn't
find anything that uses dummy's SR-IOV in selftests.


Re: [PATCH 0/4] SFP/phylink fixes

2017-12-01 Thread David Miller
From: Russell King - ARM Linux 
Date: Thu, 30 Nov 2017 13:58:35 +

> Here are four phylink fixes:
> - the "options" is a big-endian value, we must test the bits taking the
>   endian-ness into account.
> - improve the handling of RX_LOS polarity, taking no RX_LOS polarity
>   bits set to mean there is no RX_LOS functionality provided.
> - do not report modules that require the address mode switching as
>   supporting SFF8472.
> - ensure that the mac_link_down() function is called when phylink_stop()
>   is called.

Series applied, thank you.


Re: [PATCH] net: phy-micrel: check return code in flp center function

2017-12-01 Thread David Miller
From: Max Uvarov 
Date: Thu, 30 Nov 2017 13:08:29 +0300

> Fix obvious typo that first return value is set but not checked.
> 
> Signed-off-by: Max Uvarov 

Applied, thank you.


Re: [PATCH net v2] tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()

2017-12-01 Thread David Miller
From: Tommi Rantala 
Date: Wed, 29 Nov 2017 12:48:42 +0200

> Remove the second tipc_rcv() call in tipc_udp_recv(). We have just
> checked that the bearer is not up, and calling tipc_rcv() with a bearer
> that is not up leads to a TIPC div-by-zero crash in
> tipc_node_calculate_timer(). The crash is rare in practice, but can
> happen like this:
> 
>   We're enabling a bearer, but it's not yet up and fully initialized.
>   At the same time we receive a discovery packet, and in tipc_udp_recv()
>   we end up calling tipc_rcv() with the not-yet-initialized bearer,
>   causing later the div-by-zero crash in tipc_node_calculate_timer().
> 
> Jon Maloy explains the impact of removing the second tipc_rcv() call:
>   "link setup in the worst case will be delayed until the next arriving
>discovery messages, 1 sec later, and this is an acceptable delay."
> 
> As the tipc_rcv() call is removed, just leave the function via the
> rcu_out label, so that we will kfree_skb().
 ...
> Fixes: c9b64d492b1f ("tipc: add replicast peer discovery")
> Signed-off-by: Tommi Rantala 
> Cc: Jon Maloy 

Applied and queued up for -stable, thanks.


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Shannon Nelson

On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote:

On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote:

Re. problem #2:
Indeed the best way to address it seems to be to enslave the VF driver
netdev under a persistent anchor netdev.
And it's indeed desired to allow (but not enforce) PV netdev and VF
netdev to work in conjunction.
And it's indeed desired that this enslavement logic work out-of-the box.
But in case of PV+VF some configurable policies must be in place (and
they'd better be generic rather than differ per PV technology).
For example - based on which characteristics should the PV+VF coupling
be done? netvsc uses MAC address, but that might not always be the
desire.


It's a policy but not guest userspace policy.

The hypervisor certainly knows.

Are you concerned that someone might want to create two devices with the
same MAC for an unrelated reason?  If so, hypervisor could easily set a
flag in the virtio device to say "this is a backup, use MAC to find
another device".


This is something I was going to suggest: a flag or other configuration 
on the virtio device to help control how this new feature is used.  I 
can imagine this might be useful to control from either the hypervisor 
side or the VM side.


The hypervisor might want to (1) disable it (force it off), (2) enable 
it for VM choice, or (3) force it on for the VM.  In case (2), the VM 
might be able to chose whether it wants to make use of the feature, or 
stick with the bonding solution.


Either way, the kernel is making a feature available, and the user (VM 
or hypervisor) is able to control it by selecting the feature based on 
the policy desired.


sln


Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality

2017-12-01 Thread Jakub Kicinski
On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote:
> On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote:
> [...]
> > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs)
> > +{
> > +   ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config),
> > +   GFP_KERNEL);
> > +   if (!ns->vfconfigs)
> > +   return -ENOMEM;
> > +   ns->num_vfs = num_vfs;
> > +
> > +   return 0;
> > +}
> > +
> > +static void nsim_vfs_disable(struct netdevsim *ns)
> > +{
> > +   kfree(ns->vfconfigs);
> > +   ns->vfconfigs = NULL;
> > +   ns->num_vfs = 0;
> > +}  
> 
> Why not something like:
> 
> | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs)
> | {
> | void *ptr = krealloc(ns->vfconfigs,
> |  num_vfs * sizeof(struct nsim_vf_config),
> |  GFP_KERNEL);
> | 
> | if (!ptr)
> | return -ENOMEM;
> | 
> | ns->vfconfigs = ptr;
> | ns->num_vfs = num_vfs;
> | return 0;
> | }

Um.  It either frees or allocates, never reallocates so I felt realloc
is misleading.  ZERO_SIZE_PTR is less clearly a NULL than a NULL.  I
will have to specify __GFP_ZERO.  It's not a calloc so there could be
potentially some overflows?

> > +static ssize_t
> > +nsim_numvfs_store(struct device *dev, struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > +   struct netdevsim *ns = to_nsim(dev);
> > +   unsigned int num_vfs;
> > +   int ret;
> > +
> > +   ret = kstrtouint(buf, 0, _vfs);
> > +   if (ret)
> > +   return ret;
> > +
> > +   rtnl_lock();
> > +   if (ns->num_vfs == num_vfs)
> > +   goto exit_good;  
> 
> Then replace this:
> 
> > +   if (ns->num_vfs && num_vfs) {
> > +   ret = -EBUSY;
> > +   goto exit_unlock;
> > +   }
> > +
> > +   if (num_vfs) {
> > +   ret = nsim_vfs_enable(ns, num_vfs);
> > +   if (ret)
> > +   goto exit_unlock;
> > +   } else {
> > +   nsim_vfs_disable(ns);
> > +   }  
> 
> with just:
> 
> | nsim_vfs_set(ns, num_vfs);

I'm trying to mirror the PCI subsystem behaviour here, which only
allows enable or disable, not increase.  I felt we should follow how
real devices behave:

/* enable VFs */
if (pdev->sriov->num_VFs) {
dev_warn(>dev, "%d VFs already enabled. Disable before 
enabling %d VFs\n",
 pdev->sriov->num_VFs, num_vfs);
return -EBUSY;
}

So IOW this is intentional.

> > +   ret = count;
> > +exit_unlock:
> > +   rtnl_unlock();
> > +
> > +   return ret;
> > +}  
> 
> [...]
> 
> > +static void nsim_free(struct net_device *dev)
> > +{
> > +   struct netdevsim *ns = netdev_priv(dev);
> > +
> > +   device_unregister(>dev);
> >  }  
> 
> Shouldn't this also kfree(ns->vfconfigs)?

It's in uninit, I will move it to release.


Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'

2017-12-01 Thread Daniel Borkmann
On 12/01/2017 07:28 PM, Linus Torvalds wrote:
> [ Sorry for HTML email crud - traveling and on mobile right now ]
> 
> On Nov 30, 2017 23:54, "Al Viro"  wrote:
> 
> Would cause problems for tracepoints in there, though.  And that, BTW,
> is precisely why I don't want tracepoints in core VFS, TYVM - makes
> restructuring the code harder...
> 
> Just ignore them, see if anybody notices, and then they can add them back.
> Tracepoints shouldn't hold up kernel development, and I doubt these are
> ones that could be noticed by normal users.

Yep, agree, if it really gets in the way, then lets remove them for
now. After all, that was what was decided anyway.


[PATCH iproute2 net-next] iplink: allow configuring GSO max values

2017-12-01 Thread Stephen Hemminger
This allows sending GSO maximum values when configuring a device.
The values are advisory. Most devices will ignore them but for some
pseudo devices such as veth pairs they can be set.

Example:
# ip link add dev vm1 type veth peer name vm2 gso_max_size 32768

Signed-off-by: Stephen Hemminger 
---
 ip/iplink.c   | 19 ++-
 man/man8/ip-link.8.in | 13 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 0a8eb56fb252..6379b16a14f5 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -97,7 +97,8 @@ void iplink_usage(void)
" [ master DEVICE ][ vrf NAME ]\n"
" [ nomaster ]\n"
" [ addrgenmode { eui64 | none | 
stable_secret | random } ]\n"
-   " [ protodown { on | off } ]\n"
+   " [ protodown { on | off } ]\n"
+   " [ gso_max_size BYTES ] | [ 
gso_max_segs PACKETS ]\n"
"\n"
"   ip link show [ DEVICE | group GROUP ] [up] [master DEV] 
[vrf NAME] [type TYPE]\n");
 
@@ -848,6 +849,22 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
return on_off("protodown", *argv);
addattr8(>n, sizeof(*req), IFLA_PROTO_DOWN,
 proto_down);
+   } else if (strcmp(*argv, "gso_max_size") == 0) {
+   unsigned int max_size;
+
+   NEXT_ARG();
+   if (get_unsigned(_size, *argv, 0) || max_size > 
UINT16_MAX)
+   invarg("Invalid \"gso_max_size\" value\n",
+  *argv);
+   addattr32(>n, sizeof(*req), IFLA_GSO_MAX_SIZE, 
max_size);
+   } else if (strcmp(*argv, "gso_max_segs") == 0) {
+   unsigned int max_segs;
+
+   NEXT_ARG();
+   if (get_unsigned(_segs, *argv, 0) || max_segs > 
UINT16_MAX)
+   invarg("Invalid \"gso_max_segs\" value\n",
+  *argv);
+   addattr32(>n, sizeof(*req), IFLA_GSO_MAX_SEGS, 
max_segs);
} else {
if (matches(*argv, "help") == 0)
usage();
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a6a10e577b1f..0db2582e19f7 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -36,6 +36,11 @@ ip-link \- network device configuration
 .RB "[ " numrxqueues
 .IR QUEUE_COUNT " ]"
 .br
+.BR "[" gso_max_size
+.IR BYTES " ]"
+.RB "[ " gso_max_segs
+.IR SEGMENTS " ]"
+.br
 .BI type " TYPE"
 .RI "[ " ARGS " ]"
 
@@ -343,6 +348,14 @@ specifies the number of transmit queues for new device.
 specifies the number of receive queues for new device.
 
 .TP
+.BI gso_max_size " BYTES "
+specifies the recommended maximum size of a Generic Segment Offload packet the 
new device should accept.
+
+.TP
+.BI gso_max_segs " SEGMENTS "
+specifies the recommended maximum number of a Generic Segment Offload segments 
the new device should accept.
+
+.TP
 .BI index " IDX "
 specifies the desired index of the new virtual device. The link creation 
fails, if the index is busy.
 
-- 
2.11.0



Re: [PATCH net] tcp/dccp: block bh before arming time_wait timer

2017-12-01 Thread David Miller
From: Eric Dumazet 
Date: Fri, 01 Dec 2017 10:06:56 -0800

> From: Eric Dumazet 
> 
> Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
> that might be caused by the following bug.
> 
> timewait timer is pinned to the cpu, because we want to transition
> timwewait refcount from 0 to 4 in one go, once everything has been
> initialized.
> 
> At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
> handling") was merged, TCP was always running from BH habdler.
> 
> After commit 5413d1babe8f ("net: do not block BH while processing
> socket backlog") we definitely can run tcp_time_wait() from process
> context.
> 
> We need to block BH in the critical section so that the pinned timer
> has still its purpose.
> 
> This bug is more likely to happen under stress and when very small RTO
> are used in datacenter flows.
> 
> Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
> Signed-off-by: Eric Dumazet 
> Reported-by: Maciej Żenczykowski 

Applied and queued up for -stable, thanks Eric.


[PATCH net-next 1/2] rtnetlink: allow GSO maximums to be passed to device

2017-12-01 Thread Stephen Hemminger
Allow GSO maximum segments and size as netlink parameters on input.

Signed-off-by: Stephen Hemminger 
---
 net/core/rtnetlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..8138194c5f81 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1569,6 +1569,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_PROMISCUITY]  = { .type = NLA_U32 },
[IFLA_NUM_TX_QUEUES]= { .type = NLA_U32 },
[IFLA_NUM_RX_QUEUES]= { .type = NLA_U32 },
+   [IFLA_GSO_MAX_SEGS] = { .type = NLA_U32 },
+   [IFLA_GSO_MAX_SIZE] = { .type = NLA_U32 },
[IFLA_PHYS_PORT_ID] = { .type = NLA_BINARY, .len = 
MAX_PHYS_ITEM_ID_LEN },
[IFLA_CARRIER_CHANGES]  = { .type = NLA_U32 },  /* ignored */
[IFLA_PHYS_SWITCH_ID]   = { .type = NLA_BINARY, .len = 
MAX_PHYS_ITEM_ID_LEN },
-- 
2.11.0



[PATCH net-next 2/2] veth: allow configuring GSO maximums

2017-12-01 Thread Stephen Hemminger
Veth's can be used in environments (like Azure) where the underlying
network device is impacted by large GSO packets. This patch allows
gso maximum values to be passed in when creating the device via
netlink.

In theory, other pseudo devices could also use netlink attributes
to set GSO maximums but for now veth is what has been observed
to be an issue.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/veth.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5438d0978ca..510c058ba227 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -410,6 +410,26 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
if (ifmp && (dev->ifindex != 0))
peer->ifindex = ifmp->ifi_index;
 
+   if (tbp[IFLA_GSO_MAX_SIZE]) {
+   u32 max_size = nla_get_u32(tbp[IFLA_GSO_MAX_SIZE]);
+
+   if (max_size > GSO_MAX_SIZE)
+   return -EINVAL;
+
+   peer->gso_max_size = max_size;
+   dev->gso_max_size = max_size;
+   }
+
+   if (tbp[IFLA_GSO_MAX_SEGS]) {
+   u32 max_segs = nla_get_u32(tbp[IFLA_GSO_MAX_SEGS]);
+
+   if (max_segs > GSO_MAX_SEGS)
+   return -EINVAL;
+
+   peer->gso_max_segs = max_segs;
+   dev->gso_max_segs = max_segs;
+   }
+
err = register_netdevice(peer);
put_net(net);
net = NULL;
-- 
2.11.0



[PATCH net-next 0/2] allow setting gso_maximum values

2017-12-01 Thread Stephen Hemminger
This is another way of addressing the GSO maximum performance issues for
containers on Azure. What happens is that the underlying infrastructure uses
a overlay network such that GSO packets over 64K - vlan header end up cause
either guest or host to have do expensive software copy and fragmentation.

The netvsc driver reports GSO maximum settings correctly, the issue
is that containers on veth devices still have the larger settings.
One solution that was examined was propogating the values back
through the bridge device, but this does not work for cases where
virtual container network is done on L3.

This patch set punts the problem to the orchestration layer that sets
up the container network. It also enables other virtual devices
to have configurable settings for GSO maximum.

Stephen Hemminger (2):
  rtnetlink: allow GSO maximums to be passed to device
  veth: allow configuring GSO maximums

 drivers/net/veth.c   | 20 
 net/core/rtnetlink.c |  2 ++
 2 files changed, 22 insertions(+)

-- 
2.11.0



Re: [PATCH net-next 00/13] nfp: bpf: jump resolution and memcpy update

2017-12-01 Thread Daniel Borkmann
On 12/01/2017 06:32 AM, Jakub Kicinski wrote:
> Hi!
> 
> Jiong says:
> 
> Currently, compiler will lower memcpy function call in XDP/eBPF C program
> into a sequence of eBPF load/store pairs for some scenarios.
> 
> Compiler is thinking this "inline" optimiation is beneficial as it could
> avoid function call and also increase code locality.
> 
> However, Netronome NPU is not an tranditional load/store architecture that
> doing a sequence of individual load/store actions are not efficient.
> 
> This patch set tries to identify the load/store sequences composed of
> load/store pairs that comes from memcpy lowering, then accelerates them
> through NPU's Command Push Pull (CPP) instruction.
> 
> This patch set registered an new optimization pass before doing the actual
> JIT work, it traverse through eBPF IR, once found candidate sequence then
> record the memory copy source, destination and length information in the
> first load instruction starting the sequence and marks all remaining
> instructions in the sequence into skipable status. Later, when JITing the
> first load instructoin, optimal instructions will be generated using those
> record information.
> 
> For this safety of this transformation:
> 
>   - jump into the middle of the sequence will cancel the optimization.
> 
>   - overlapped memory access will cancel the optimization.
> 
>   - the load destination register still contains the same value as before
> the transformation.

Series applied to bpf-next, thanks guys!


Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.

2017-12-01 Thread David Daney

On 12/01/2017 11:49 AM, Philippe Ombredanne wrote:

David, Greg,

On Fri, Dec 1, 2017 at 6:42 PM, David Daney  wrote:

On 11/30/2017 11:53 PM, Philippe Ombredanne wrote:

[...]

--- /dev/null
+++ b/arch/mips/cavium-octeon/resource-mgr.c
@@ -0,0 +1,371 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Resource manager for Octeon.
+ *
+ * This file is subject to the terms and conditions of the GNU General
Public
+ * License.  See the file "COPYING" in the main directory of this
archive
+ * for more details.
+ *
+ * Copyright (C) 2017 Cavium, Inc.
+ */



Since you nicely included an SPDX id, you would not need the
boilerplate anymore. e.g. these can go alright?



They may not be strictly speaking necessary, but I don't think they hurt
anything.  Unless there is a requirement to strip out the license text, we
would stick with it as is.


I think the requirement is there and that would be much better for
everyone: keeping both is redundant and does not bring any value, does
it? Instead it kinda removes the benefits of having the SPDX id in the
first place IMHO.

Furthermore, as there have been already ~12K+ files cleaned up and
still over 60K files to go, it would really nice if new files could
adopt the new style: this way we will not have to revisit and repatch
them in the future.



I am happy to follow any style Greg would suggest.  There doesn't seem 
to be much documentation about how this should be done yet.


David Daney


Re: [PATCH iproute2] iproute2: Fix undeclared __kernel_long_t type build error in RHEL 6.8

2017-12-01 Thread Michal Kubecek
On Fri, Dec 01, 2017 at 08:48:07AM -0800, Stephen Hemminger wrote:
> On Fri,  1 Dec 2017 13:04:51 +0200
> Leon Romanovsky  wrote:
> 
> > From: Leon Romanovsky 
> > 
> > Add asm/posix_types.h header file to the list of needed includes,
> > because the headers files in RHEL 6.8 are too old and doesn't
> > have declaration of __kernel_long_t.
> > 
> > In file included from ../include/uapi/linux/kernel.h:5,
> >  from ../include/uapi/linux/netfilter/x_tables.h:4,
> >  from ../include/xtables.h:20,
> >  from em_ipset.c:26:
> > ../include/uapi/linux/sysinfo.h:9: error: expected specifier-qualifier-list 
> > before ‘__kernel_long_t’
> > 
> > Cc: Riad Abo Raed 
> > Cc: Guy Ergas 
> > Signed-off-by: Leon Romanovsky 
> 
> I see the problem, but the solution of dragging in posix_types.h
> would be too much of a long term maintenance issue.
> All the headers in uapi are regularly generated from upstream
> kernel headers; I don't want to start making exceptions.
> 
> Is it just the xtables stuff (which has always been problematic)?

Actually, the only place where __kernel_long_t and __kernel_ulong_t
appear is struct sysinfo in include/uapi/linux/sysinfo.h and this
structure isn't even used anywhere in iproute2 source (not even in the
include/uapi/linux/kernel.h file which includes ).

So one could work around the problem by defining _LINUX_SYSINFO_H but
that seems a bit dirty hack.

Michal Kubecek



[PATCH tip/core/rcu 14/21] netfilter: Remove now-redundant smp_read_barrier_depends()

2017-12-01 Thread Paul E. McKenney
READ_ONCE() now implies smp_read_barrier_depends(), which means that
the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
are now redundant.  This commit removes them and adjusts the comments.

Signed-off-by: Paul E. McKenney 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Cc: "David S. Miller" 
Cc: 
Cc: 
Cc: 
---
 net/ipv4/netfilter/arp_tables.c | 7 +--
 net/ipv4/netfilter/ip_tables.c  | 7 +--
 net/ipv6/netfilter/ip6_tables.c | 7 +--
 3 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index f88221aebc9d..d242c2d29161 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -202,13 +202,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 4cbe5e80f3bf..46866cc24a84 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -260,13 +260,8 @@ ipt_do_table(struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index f06e25065a34..ac1db84722a7 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -282,12 +282,7 @@ ip6t_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
table_base = private->entries;
jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
-- 
2.5.2



  1   2   3   >