date:20161201

Re: [net-next 1/5] qede: use reset to set network header

2016-12-01 Thread Mintz, Yuval

> Since offset is zero, it's not necessary to use set function. Reset
> function is straightforward, and will remove the unnecessary add
> operation in set function.

Thanks.

Acked-by: Yuval Mintz 

[For #2 for bnx2x as well]

Re: [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.

2016-12-01 Thread Rangankar, Manish

Hi Dave,

Please consider applying the qed patches 1 & 2 to net-next.

Patch 3, qedi, goes to SCSI. Martin P. is ok to have you take qedi
(already Acked) through your tree. Please let us know which way you
prefer.


Thanks,
Manish

On 01/12/16 1:51 PM, "Rangankar, Manish" 
wrote:

>This series introduces hardware offload iSCSI initiator driver for the
>41000 Series Converged Network Adapters (579xx chip) by Qlogic. The
>overall
>driver design includes a common module ('qed') and protocol specific
>dependent modules ('qedi' for iSCSI).
>
>This is an open iSCSI driver, modifications to open iSCSI user components
>'iscsid', 'iscsiuio', etc. are required for the solution to work. The user
>space changes are also in the process of being submitted.
>
>https://groups.google.com/forum/#!forum/open-iscsi
>
>The 'qed' common module, under drivers/net/ethernet/qlogic/qed/, is
>enhanced with functionality required for the iSCSI support. This series
>is based on:
>
>net tree base: Merge of net and net-next as of 11/29/2016
>
>Changes from RFC v2:
>
>  1. qedi patches are squashed into single patch to prevent krobot
> warning.
>  2. Fixed 'hw_p_cpuq' incompatible pointer type.
>  3. Fixed sparse incompatible types in comparison expression.
>  4. Misc fixes with latest 'checkpatch --strict' option.
>  5. Remove int_mode option from MODULE_PARAM.
>  6. Prefix all MODULE_PARAM params with qedi_*.
>  7. Use CONFIG_QED_ISCSI instead of CONFIG_QEDI
>  8. Added bad task mem access fix.
>
>Manish Rangankar (1):
>  qedi: Add QLogic FastLinQ offload iSCSI driver framework.
>
>Yuval Mintz (2):
>  qed: Add support for hardware offloaded iSCSI.
>  qed: Add iSCSI out of order packet handling.
>
> MAINTAINERS|6 +
> drivers/net/ethernet/qlogic/Kconfig|3 +
> drivers/net/ethernet/qlogic/qed/Makefile   |1 +
> drivers/net/ethernet/qlogic/qed/qed.h  |8 +-
> drivers/net/ethernet/qlogic/qed/qed_dev.c  |   22 +
> drivers/net/ethernet/qlogic/qed/qed_iscsi.c| 1277 +
> drivers/net/ethernet/qlogic/qed/qed_iscsi.h|   52 +
> drivers/net/ethernet/qlogic/qed/qed_ll2.c  |  555 +-
> drivers/net/ethernet/qlogic/qed/qed_ll2.h  |9 +
> drivers/net/ethernet/qlogic/qed/qed_ooo.c  |  501 +
> drivers/net/ethernet/qlogic/qed/qed_ooo.h  |  176 ++
> drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |2 +
> drivers/net/ethernet/qlogic/qed/qed_roce.c |1 +
> drivers/net/ethernet/qlogic/qed/qed_spq.c  |   24 +
> drivers/scsi/Kconfig   |1 +
> drivers/scsi/Makefile  |1 +
> drivers/scsi/qedi/Kconfig  |   10 +
> drivers/scsi/qedi/Makefile |5 +
> drivers/scsi/qedi/qedi.h   |  364 
> drivers/scsi/qedi/qedi_dbg.c   |  143 ++
> drivers/scsi/qedi/qedi_dbg.h   |  144 ++
> drivers/scsi/qedi/qedi_debugfs.c   |  244 +++
> drivers/scsi/qedi/qedi_fw.c| 2378
>
> drivers/scsi/qedi/qedi_gbl.h   |   73 +
> drivers/scsi/qedi/qedi_hsi.h   |   52 +
> drivers/scsi/qedi/qedi_iscsi.c | 1624 
> drivers/scsi/qedi/qedi_iscsi.h |  232 +++
> drivers/scsi/qedi/qedi_main.c  | 2127
>+
> drivers/scsi/qedi/qedi_sysfs.c |   52 +
> drivers/scsi/qedi/qedi_version.h   |   14 +
> include/linux/qed/qed_if.h |2 +
> include/linux/qed/qed_iscsi_if.h   |  229 +++
> 32 files changed, 10303 insertions(+), 29 deletions(-)
> create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c
> create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h
> create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.c
> create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.h
> create mode 100644 drivers/scsi/qedi/Kconfig
> create mode 100644 drivers/scsi/qedi/Makefile
> create mode 100644 drivers/scsi/qedi/qedi.h
> create mode 100644 drivers/scsi/qedi/qedi_dbg.c
> create mode 100644 drivers/scsi/qedi/qedi_dbg.h
> create mode 100644 drivers/scsi/qedi/qedi_debugfs.c
> create mode 100644 drivers/scsi/qedi/qedi_fw.c
> create mode 100644 drivers/scsi/qedi/qedi_gbl.h
> create mode 100644 drivers/scsi/qedi/qedi_hsi.h
> create mode 100644 drivers/scsi/qedi/qedi_iscsi.c
> create mode 100644 drivers/scsi/qedi/qedi_iscsi.h
> create mode 100644 drivers/scsi/qedi/qedi_main.c
> create mode 100644 drivers/scsi/qedi/qedi_sysfs.c
> create mode 100644 drivers/scsi/qedi/qedi_version.h
> create mode 100644 include/linux/qed/qed_iscsi_if.h
>
>-- 
>1.8.3.1
>

Re: iproute2 public git outdated?

2016-12-01 Thread Phil Sutter

On Thu, Dec 01, 2016 at 10:39:10PM +0200, Rami Rosen wrote:
> I suggest that you will try again now, it seems that the iproute2 git
> repo was updated in the last 2-4 hours, and "git log" in master shows
> now a patch from 30 of November (actually it is your "Add notes about
> dropped IPv4 route cache" patch)

Thanks Rami for the heads-up, I noticed yesterday already (checked
immediately after receiving feedback to some other patch from Stephen.

Cheers, Phil

Re: [PATCH 2/3] netns: add dummy struct inside "struct net_generic"

2016-12-01 Thread Cong Wang

On Thu, Dec 1, 2016 at 9:42 PM, Cong Wang  wrote:
> On Thu, Dec 1, 2016 at 5:12 PM, Alexey Dobriyan  wrote:
>>  struct net_generic {
>> -   unsigned int len;
>> -   struct rcu_head rcu;
>> +   struct {
>> +   unsigned int len;
>> +   struct rcu_head rcu;
>> +   } s;
>>
>> void *ptr[0];
>>  };
>
> I think you can put them in a union, since rcu is only used
> for kfree_rcu() where len is not needed and rcu_head doesn't
> need to be initialized by callers.

Never mind, readers could be still reading ->len while we modify
->rcu. So they can't be in a union.

[PATCH net-next v4] ipv6 addrconf: Implemented enhanced DAD (RFC7527)

2016-12-01 Thread Erik Nordmark

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark 
Signed-off-by: Bob Gilligan 
Reviewed-by: Hannes Frederic Sowa 

---

v2: renamed sysctl and made it default to true, plus minor code review fixes
v3: respun with later net-next; fixed whitespace issues
v4: fixed kbuild test robot for route.c; added Reviewed-by

 Documentation/networking/ip-sysctl.txt |  9 +
 include/linux/ipv6.h   |  1 +
 include/net/if_inet6.h |  1 +
 include/net/ndisc.h|  5 -
 include/uapi/linux/ipv6.h  |  1 +
 net/ipv6/addrconf.c| 22 +-
 net/ipv6/ndisc.c   | 31 ---
 net/ipv6/route.c   |  2 +-
 8 files changed, 66 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 5af48dd..d9ef566 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1729,6 +1729,15 @@ drop_unsolicited_na - BOOLEAN
 
By default this is turned off.
 
+enhanced_dad - BOOLEAN
+   Include a nonce option in the IPv6 neighbor solicitation messages used 
for
+   duplicate address detection per RFC7527. A received DAD NS will only 
signal
+   a duplicate address if the nonce is different. This avoids any false
+   detection of duplicates due to loopback of the NS messages that we send.
+   The nonce option will be sent on an interface unless both of
+   conf/{all,interface}/enhanced_dad are set to FALSE.
+   Default: TRUE
+
 icmp/*:
 ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 3f95233..671d014 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -68,6 +68,7 @@ struct ipv6_devconf {
 #ifdef CONFIG_IPV6_SEG6_HMAC
__s32   seg6_require_hmac;
 #endif
+   __u32   enhanced_dad;
 
struct ctl_table_header *sysctl_header;
 };
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index b0576cb..0fa4c32 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -55,6 +55,7 @@ struct inet6_ifaddr {
__u8stable_privacy_retry;
 
__u16   scope;
+   __u64   dad_nonce;
 
unsigned long   cstamp; /* created timestamp */
unsigned long   tstamp; /* updated timestamp */
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index be1fe228..d562a2f 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -31,6 +31,7 @@ enum {
ND_OPT_PREFIX_INFO = 3, /* RFC2461 */
ND_OPT_REDIRECT_HDR = 4,/* RFC2461 */
ND_OPT_MTU = 5, /* RFC2461 */
+   ND_OPT_NONCE = 14,  /* RFC7527 */
__ND_OPT_ARRAY_MAX,
ND_OPT_ROUTE_INFO = 24, /* RFC4191 */
ND_OPT_RDNSS = 25,  /* RFC5006 */
@@ -121,6 +122,7 @@ struct ndisc_options {
 #define nd_opts_pi_end nd_opt_array[__ND_OPT_PREFIX_INFO_END]
 #define nd_opts_rh nd_opt_array[ND_OPT_REDIRECT_HDR]
 #define nd_opts_mtund_opt_array[ND_OPT_MTU]
+#define nd_opts_nonce  nd_opt_array[ND_OPT_NONCE]
 #define nd_802154_opts_src_lladdr  
nd_802154_opt_array[ND_OPT_SOURCE_LL_ADDR]
 #define nd_802154_opts_tgt_lladdr  
nd_802154_opt_array[ND_OPT_TARGET_LL_ADDR]
 
@@ -398,7 +400,8 @@ static inline struct neighbour *__ipv6_neigh_lookup(struct 
net_device *dev, cons
 int ndisc_rcv(struct sk_buff *skb);
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-  const struct in6_addr *daddr, const struct in6_addr *saddr);
+  const struct in6_addr *daddr, const struct in6_addr *saddr,
+  u64 nonce);
 
 void ndisc_send_rs(struct net_device *dev,
   const struct in6_addr *saddr, const struct in6_addr *daddr);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 53561be..eaf65dc 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -181,6 +181,7 @@ enum {
DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
DEVCONF_SEG6_ENABLED,
DEVCONF_SEG6_REQUIRE_HMAC,
+   DEVCONF_ENHANCED_DAD,
DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c387dc..c1e124b 100644
---

Re: [PATCH 2/3] netns: add dummy struct inside "struct net_generic"

2016-12-01 Thread Cong Wang

On Thu, Dec 1, 2016 at 5:12 PM, Alexey Dobriyan  wrote:
>  struct net_generic {
> -   unsigned int len;
> -   struct rcu_head rcu;
> +   struct {
> +   unsigned int len;
> +   struct rcu_head rcu;
> +   } s;
>
> void *ptr[0];
>  };

I think you can put them in a union, since rcu is only used
for kfree_rcu() where len is not needed and rcu_head doesn't
need to be initialized by callers.

Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)

2016-12-01 Thread Erik Nordmark


On 12/1/16 2:28 PM, Hannes Frederic Sowa wrote:

Reviewed-by: Hannes Frederic Sowa 


Thanks - I'll add that.
Also got a warning from the kbuild test robot for route.c which I'll fix.



Thanks!

@@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
  have_ifp:
if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
if (dad) {
+   if (nonce != 0 && ifp->dad_nonce == nonce) {
+   u8 *np = (u8 *)
+   /* Matching nonce if looped back */
+   ND_PRINTK(2, notice,
+ "%s: IPv6 DAD loopback for address 
%pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
+ ifp->idev->dev->name,
+ >addr,
+ np[0], np[1], np[2], np[3],
+ np[4], np[5]);
+   goto out;
+   }
/*
 * We are colliding with another node
 * who is doing DAD


I think it could be a "%pM" because it looks like a MAC address, but
better leave it like that. :)


It is 6 bytes, but it isn't a mac address so I'll leave it.

Thanks,
  Erik

Re: linux-next: manual merge of the wireless-drivers-next tree with the net-next tree

2016-12-01 Thread Kalle Valo

Stephen Rothwell  writes:

> Hi all,
>
> Today's linux-next merge of the wireless-drivers-next tree got a
> conflict in:
>
>   drivers/net/wireless/ath/ath10k/mac.c
>
> between commit:
>
>   f3fe4e93dd63 ("mac80211: add a HW flag for supporting HW TX fragmentation")
>
> from the net-next tree and commit:
>
>   ff32eeb86aa1 ("ath10k: advertize hardware packet loss mechanism")
>
> from the wireless-drivers-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

The fix looks good, thanks. I sent a pull request to Dave yesteday which
should fix this.

-- 
Kalle Valo

Re: [PATCH 23/39] Annotate hardware config module parameters in drivers/net/wireless/

2016-12-01 Thread Kalle Valo

David Howells  writes:

> When the kernel is running in secure boot mode, we lock down the kernel to
> prevent userspace from modifying the running kernel image.  Whilst this
> includes prohibiting access to things like /dev/mem, it must also prevent
> access by means of configuring driver modules in such a way as to cause a
> device to access or modify the kernel image.
>
> To this end, annotate module_param* statements that refer to hardware
> configuration and indicate for future reference what type of parameter they
> specify.  The parameter parser in the core sees this information and can
> skip such parameters with an error message if the kernel is locked down.
> The module initialisation then runs as normal, but just sees whatever the
> default values for those parameters is.
>
> Note that we do still need to do the module initialisation because some
> drivers have viable defaults set in case parameters aren't specified and
> some drivers support automatic configuration (e.g. PNP or PCI) in addition
> to manually coded parameters.
>
> This patch annotates drivers in drivers/net/wireless/.
>
> Suggested-by: One Thousand Gnomes 
> Signed-off-by: David Howells 
> cc: Kalle Valo 
> cc: linux-wirel...@vger.kernel.org
> cc: netdev@vger.kernel.org
> ---
>
>  drivers/net/wireless/cisco/airo.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Via which tree are you planning to submit this, should I take it to
wireless-drivers or will someone else handle it? I didn't get the cover
letter so I have no idea.

-- 
Kalle Valo

Re: [PATCH main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode

2016-12-01 Thread Cong Wang

On Thu, Dec 1, 2016 at 6:14 PM, Felix Jia  wrote:
> +static void addrconf_addrgenmode_change(struct net *net)
> +{
> +   struct net_device *dev;
> +   struct inet6_dev *idev;
> +
> +   read_lock(_base_lock);
> +   for_each_netdev(net, dev) {
> +   rcu_read_lock();
> +   idev = __in6_dev_get(dev);
> +   if (idev) {
> +   idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode;
> +   idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
> +   rtnl_lock();
> +   addrconf_dev_config(idev->dev);
> +   rtnl_unlock();

You can't call rtnl lock in atomic context.

> +   }
> +   rcu_read_unlock();
> +   }
> +   read_unlock(_base_lock);

Did you test your patch seriously?

[PATCH net-next] samples/bpf: silence compiler warnings

2016-12-01 Thread Alexei Starovoitov

silence some of the clang compiler warnings like:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 
is always false
arch/x86/include/asm/processor.h:491:30: warning: taking address of packed 
member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned 
pointer value
include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized 
type 'struct cgroup' not at the end of a struct or class is a GNU extension
since they add too much noise to samples/bpf/ build.

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3ceb5a9d86df..270e2fd7337a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -158,4 +158,6 @@ $(obj)/%.o: $(src)/%.c
$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
+   -Wno-gnu-variable-sized-type-not-at-end \
+   -Wno-address-of-packed-member -Wno-tautological-compare \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
-- 
2.8.0

[PATCH main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode

2016-12-01 Thread Felix Jia

The address generation mode for IPv6 link-local can only be configured
by netlink messages. This patch adds the ability to change the address
generation mode via sysctl.

An possible improvement is to remove the addrgenmode variable from the
idev structure and use the systcl storage for the flag.

The patch is based from v4.9-rc7 in mainline.

Signed-off-by: Felix Jia 
Cc: Carl Smith 
---
 include/linux/ipv6.h  |  1 +
 include/uapi/linux/ipv6.h |  1 +
 net/ipv6/addrconf.c   | 77 ++-
 3 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index a064997..0d9e5d4 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -64,6 +64,7 @@ struct ipv6_devconf {
} stable_secret;
__s32   use_oif_addrs_only;
__s32   keep_addr_on_down;
+   __s32   addrgenmode;
 
struct ctl_table_header *sysctl_header;
 };
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 8c27723..0524e2c 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -178,6 +178,7 @@ enum {
DEVCONF_DROP_UNSOLICITED_NA,
DEVCONF_KEEP_ADDR_ON_DOWN,
DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
+   DEVCONF_ADDRGENMODE,
DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4bc5ba3..8d1f6e6 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -238,6 +238,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.use_oif_addrs_only = 0,
.ignore_routes_with_linkdown = 0,
.keep_addr_on_down  = 0,
+   .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -284,6 +285,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.use_oif_addrs_only = 0,
.ignore_routes_with_linkdown = 0,
.keep_addr_on_down  = 0,
+   .addrgenmode = IN6_ADDR_GEN_MODE_EUI64,
 };
 
 /* Check if a valid qdisc is available */
@@ -378,7 +380,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
if (ndev->cnf.stable_secret.initialized)
ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
else
-   ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_EUI64;
+   ndev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
 
ndev->cnf.mtu6 = dev->mtu;
ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
@@ -4950,6 +4952,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf 
*cnf,
array[DEVCONF_DROP_UNICAST_IN_L2_MULTICAST] = 
cnf->drop_unicast_in_l2_multicast;
array[DEVCONF_DROP_UNSOLICITED_NA] = cnf->drop_unsolicited_na;
array[DEVCONF_KEEP_ADDR_ON_DOWN] = cnf->keep_addr_on_down;
+   array[DEVCONF_ADDRGENMODE] = cnf->addrgenmode;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -5496,6 +5499,71 @@ int addrconf_sysctl_mtu(struct ctl_table *ctl, int write,
return proc_dointvec_minmax(, write, buffer, lenp, ppos);
 }
 
+static void addrconf_addrgenmode_change(struct net *net)
+{
+   struct net_device *dev;
+   struct inet6_dev *idev;
+
+   read_lock(_base_lock);
+   for_each_netdev(net, dev) {
+   rcu_read_lock();
+   idev = __in6_dev_get(dev);
+   if (idev) {
+   idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode;
+   idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode;
+   rtnl_lock();
+   addrconf_dev_config(idev->dev);
+   rtnl_unlock();
+   }
+   rcu_read_unlock();
+   }
+   read_unlock(_base_lock);
+}
+
+static int addrconf_sysctl_addrgenmode(struct ctl_table *ctl, int write,
+   void __user 
*buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   int new_val;
+   struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
+   struct net *net = (struct net *)ctl->extra2;
+
+   if (write) { /* sysctl write request */
+   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+   new_val = *((int *)ctl->data);
+
+   /* request for the all */
+   if (>ipv6.devconf_all->addrgenmode == ctl->data) {
+   ipv6_devconf_dflt.addrgenmode = new_val;
+   addrconf_addrgenmode_change(net);
+
+   /* request for default */
+   } else if (>ipv6.devconf_dflt->addrgenmode == ctl->data) {
+   ipv6_devconf_dflt.addrgenmode = new_val;
+
+   /* request for individual inet device */
+   } else {
+   if (!idev) {
+   return ret;
+   }
+   if

Hi

2016-12-01 Thread Sydney

Can we talk please

[net-next 2/5] bnx2x: use reset to set network header

2016-12-01 Thread Zhang Shengju

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 3fd36b4..3e199d3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -724,7 +724,7 @@ static void bnx2x_gro_ipv6_csum(struct bnx2x *bp, struct 
sk_buff *skb)
 static void bnx2x_gro_csum(struct bnx2x *bp, struct sk_buff *skb,
void (*gro_func)(struct bnx2x*, struct sk_buff*))
 {
-   skb_set_network_header(skb, 0);
+   skb_reset_network_header(skb);
gro_func(bp, skb);
tcp_gro_complete(skb);
 }
-- 
1.8.3.1

[net-next 3/5] mlx4: use reset to set mac header

2016-12-01 Thread Zhang Shengju

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju 
---
 drivers/net/ethernet/mellanox/mlx4/en_selftest.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c 
b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
index c06346a..95290e1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
@@ -68,7 +68,7 @@ static int mlx4_en_test_loopback_xmit(struct mlx4_en_priv 
*priv)
memcpy(ethh->h_dest, priv->dev->dev_addr, ETH_ALEN);
eth_zero_addr(ethh->h_source);
ethh->h_proto = htons(ETH_P_ARP);
-   skb_set_mac_header(skb, 0);
+   skb_reset_mac_header(skb);
for (i = 0; i < packet_size; ++i)   /* fill our packet */
packet[i] = (unsigned char)(i & 0xff);
 
-- 
1.8.3.1

[net-next 1/5] qede: use reset to set network header

2016-12-01 Thread Zhang Shengju

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju 
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index b84a2c4..31c8449 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1220,7 +1220,7 @@ static void qede_gro_receive(struct qede_dev *edev,
 
 #ifdef CONFIG_INET
if (skb_shinfo(skb)->gso_size) {
-   skb_set_network_header(skb, 0);
+   skb_reset_network_header(skb);
 
switch (skb->protocol) {
case htons(ETH_P_IP):
-- 
1.8.3.1

[net-next 0/5] use reset to set headers

2016-12-01 Thread Zhang Shengju

This patch serial replace 'set' function to 'reset', since the offset is zero.
It's not necessary to use set, reset function is straightforward, and will 
remove the unnecessary add operation in set function.  

Zhang Shengju (5):
  qede: use reset to set network header
  bnx2x: use reset to set network header
  mlx4: use reset to set mac header
  iwlwifi: use reset to set transport header
  staging: wilc1000: use reset to set mac header

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c  | 2 +-
 drivers/net/ethernet/mellanox/mlx4/en_selftest.c | 2 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +-
 drivers/staging/wilc1000/linux_mon.c | 4 ++--
 5 files changed, 6 insertions(+), 6 deletions(-)

-- 
1.8.3.1

[net-next 5/5] staging: wilc1000: use reset to set mac header

2016-12-01 Thread Zhang Shengju

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju 
---
 drivers/staging/wilc1000/linux_mon.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/wilc1000/linux_mon.c 
b/drivers/staging/wilc1000/linux_mon.c
index 242f82f..f328d75 100644
--- a/drivers/staging/wilc1000/linux_mon.c
+++ b/drivers/staging/wilc1000/linux_mon.c
@@ -111,7 +111,7 @@ void WILC_WFI_monitor_rx(u8 *buff, u32 size)
}
 
skb->dev = wilc_wfi_mon;
-   skb_set_mac_header(skb, 0);
+   skb_reset_mac_header(skb);
skb->ip_summed = CHECKSUM_UNNECESSARY;
skb->pkt_type = PACKET_OTHERHOST;
skb->protocol = htons(ETH_P_802_2);
@@ -215,7 +215,7 @@ static netdev_tx_t WILC_WFI_mon_xmit(struct sk_buff *skb,
cb_hdr->tx_flags = 0x0004;
 
skb2->dev = wilc_wfi_mon;
-   skb_set_mac_header(skb2, 0);
+   skb_reset_mac_header(skb2);
skb2->ip_summed = CHECKSUM_UNNECESSARY;
skb2->pkt_type = PACKET_OTHERHOST;
skb2->protocol = htons(ETH_P_802_2);
-- 
1.8.3.1

[net-next 4/5] iwlwifi: use reset to set transport header

2016-12-01 Thread Zhang Shengju

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju 
---
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c 
b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
index 5f840f1..e44e5ad 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
@@ -2196,7 +2196,7 @@ static int iwl_fill_data_tbs_amsdu(struct iwl_trans 
*trans, struct sk_buff *skb,
 
memcpy(skb_put(csum_skb, tcp_hdrlen(skb)),
   tcph, tcp_hdrlen(skb));
-   skb_set_transport_header(csum_skb, 0);
+   skb_reset_transport_header(csum_skb);
csum_skb->csum_start =
(unsigned char *)tcp_hdr(csum_skb) -
 csum_skb->head;
-- 
1.8.3.1

Re: [PATCH v8 3/8] thunderbolt: Communication with the ICM (firmware)

2016-12-01 Thread Andy Lutomirski


On 09/28/2016 07:44 AM, Amir Levy wrote:

This patch provides the communication protocol between the
Intel Connection Manager(ICM) firmware that is operational in the
Thunderbolt controller in non-Apple hardware.
The ICM firmware-based controller is used for establishing and maintaining
the Thunderbolt Networking connection - we need to be able to communicate
with it.


I'm a bit late to the party, but here goes.  I have two big questions:

1. Why is this using netlink at all?  A system has zero or more 
Thunderbolt controllers, they're probed just like any other PCI devices 
(by nhi_probe() if I'm understanding correctly), they'll have nodes in 
sysfs, etc.  Shouldn't there be a simple char device per Thunderbolt 
controller that a daemon can connect to?  This will clean up lots of things:


a) You can actually enforce one-daemon-at-a-time in a very natural way. 
Your current code seems to try, but it's rather buggy.  Your 
subscription count is a guess, your unsubscribe is entirely unchecked, 
and you are entirely unable to detect if a daemon crashes AFAICT.


b) You won't need all of the complexity that's currently there to figure 
out *which* Thunderbolt device a daemon is talking to.


c) You can use regular ioctl passing *structs* instead of netlink attrs. 
 There's nothing wrong with netlink attrs, except that your driver 
seems to have a whole lot of boilerplate that just converts back and 
forth to regular structures.


d) The userspace code that does stuff like "send message, wait 150ms, 
receive reply, complain if no reply" goes away because ioctl is 
synchronous.  (Or you can use read and write, but it's still simpler.)


e) You could have one daemon per Thunderbolt device if you were so inclined.

f) You get privilege separation in userspace.  Creating a netlink socket 
and dropping privilege is busted^Winteresting.  Opening a device node 
and dropping privilege works quite nicely.


2. Why do you need a daemon anyway.  Functionally, what exactly does it 
do?  (Okay, I get that it seems to talk to a giant pile of code running 
in SMM, and I get that Intel, for some bizarre reason, wants everyone 
except Apple to use this code in SMM, and that Apple (for entirely 
understandable reasons) turned it off, but that's beside the point. 
What does the user code do that's useful and that the kernel can't do 
all by itself?  The only really interesting bit I can see is the part 
that approves PCI devices.




I'm not going to review this in detail, but here's a tiny bit:


+static int nhi_genl_unsubscribe(__always_unused struct sk_buff *u_skb,
+   __always_unused struct genl_info *info)
+{
+   atomic_dec_if_positive();
+
+   return 0;
+}
+


This, for example, is really quite buggy.



This entire function here:


+static int nhi_genl_query_information(__always_unused struct sk_buff *u_skb,
+ struct genl_info *info)
+{
+   struct tbt_nhi_ctxt *nhi_ctxt;
+   struct sk_buff *skb;
+   bool msg_too_long;
+   int res = -ENODEV;
+   u32 *msg_head;
+
+   if (!info || !info->userhdr)
+   return -EINVAL;
+
+   skb = genlmsg_new(NLMSG_ALIGN(nhi_genl_family.hdrsize) +
+ nla_total_size(sizeof(DRV_VERSION)) +
+ nla_total_size(sizeof(nhi_ctxt->nvm_ver_offset)) +
+ nla_total_size(sizeof(nhi_ctxt->num_ports)) +
+ nla_total_size(sizeof(nhi_ctxt->dma_port)) +
+ nla_total_size(0),/* nhi_ctxt->support_full_e2e */
+ GFP_KERNEL);
+   if (!skb)
+   return -ENOMEM;
+
+   msg_head = genlmsg_put_reply(skb, info, _genl_family, 0,
+NHI_CMD_QUERY_INFORMATION);
+   if (!msg_head) {
+   res = -ENOMEM;
+   goto genl_put_reply_failure;
+   }
+
+   if (mutex_lock_interruptible(_list_mutex)) {
+   res = -ERESTART;
+   goto genl_put_reply_failure;
+   }
+
+   nhi_ctxt = nhi_search_ctxt(*(u32 *)info->userhdr);
+   if (nhi_ctxt && !nhi_ctxt->d0_exit) {
+   *msg_head = nhi_ctxt->id;
+
+   msg_too_long = !!nla_put_string(skb, NHI_ATTR_DRV_VERSION,
+   DRV_VERSION);
+
+   msg_too_long = msg_too_long ||
+  nla_put_u16(skb, NHI_ATTR_NVM_VER_OFFSET,
+  nhi_ctxt->nvm_ver_offset);
+
+   msg_too_long = msg_too_long ||
+  nla_put_u8(skb, NHI_ATTR_NUM_PORTS,
+ nhi_ctxt->num_ports);
+
+   msg_too_long = msg_too_long ||
+  nla_put_u8(skb, NHI_ATTR_DMA_PORT,
+ nhi_ctxt->dma_port);
+
+   if (msg_too_long) {
+   res = -EMSGSIZE;
+

[PATCH net-next 2/8] drivers: net: xgene: Configure classifier with pagepool

2016-12-01 Thread Iyappan Subramanian

This patch configures classifier with the pagepool information.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.c   | 16 ++--
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.h   |  2 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c|  7 +--
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h|  6 --
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  | 11 +--
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h  |  3 ++-
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c |  9 ++---
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c |  7 +--
 8 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
index 7aac0fb..caa55bd 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
@@ -52,6 +52,7 @@ static void xgene_cle_dbptr_to_hw(struct xgene_enet_pdata 
*pdata,
 {
buf[0] = SET_VAL(CLE_DROP, dbptr->drop);
buf[4] = SET_VAL(CLE_FPSEL, dbptr->fpsel) |
+SET_VAL(CLE_NFPSEL, dbptr->nxtfpsel) |
 SET_VAL(CLE_DSTQIDL, dbptr->dstqid);
 
buf[5] = SET_VAL(CLE_DSTQIDH, (u32)dbptr->dstqid >> CLE_DSTQIDL_LEN) |
@@ -349,8 +350,12 @@ static int xgene_cle_set_rss_idt(struct xgene_enet_pdata 
*pdata)
fpsel = xgene_enet_get_fpsel(pool_id);
dstqid = xgene_enet_dst_ring_num(pdata->rx_ring[idx]);
nfpsel = 0;
-   idt_reg = 0;
+   if (pdata->rx_ring[idx]->page_pool) {
+   pool_id = pdata->rx_ring[idx]->page_pool->id;
+   nfpsel = xgene_enet_get_fpsel(pool_id);
+   }
 
+   idt_reg = 0;
xgene_cle_idt_to_hw(pdata, dstqid, fpsel, nfpsel, _reg);
ret = xgene_cle_dram_wr(>cle, _reg, 1, i,
RSS_IDT, CLE_CMD_WR);
@@ -400,9 +405,9 @@ static int xgene_cle_setup_rss(struct xgene_enet_pdata 
*pdata)
 static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata)
 {
struct xgene_enet_cle *enet_cle = >cle;
+   u32 def_qid, def_fpsel, def_nxtfpsel, pool_id;
struct xgene_cle_dbptr dbptr[DB_MAX_PTRS];
struct xgene_cle_ptree_branch *br;
-   u32 def_qid, def_fpsel, pool_id;
struct xgene_cle_ptree *ptree;
struct xgene_cle_ptree_kn kn;
int ret;
@@ -707,13 +712,20 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata 
*pdata)
def_qid = xgene_enet_dst_ring_num(pdata->rx_ring[0]);
pool_id = pdata->rx_ring[0]->buf_pool->id;
def_fpsel = xgene_enet_get_fpsel(pool_id);
+   def_nxtfpsel = 0;
+   if (pdata->rx_ring[0]->page_pool) {
+   pool_id = pdata->rx_ring[0]->page_pool->id;
+   def_nxtfpsel = xgene_enet_get_fpsel(pool_id);
+   }
 
memset(dbptr, 0, sizeof(struct xgene_cle_dbptr) * DB_MAX_PTRS);
dbptr[DB_RES_ACCEPT].fpsel =  def_fpsel;
+   dbptr[DB_RES_ACCEPT].nxtfpsel = def_nxtfpsel;
dbptr[DB_RES_ACCEPT].dstqid = def_qid;
dbptr[DB_RES_ACCEPT].cle_priority = 1;
 
dbptr[DB_RES_DEF].fpsel = def_fpsel;
+   dbptr[DB_RES_DEF].nxtfpsel = def_nxtfpsel;
dbptr[DB_RES_DEF].dstqid = def_qid;
dbptr[DB_RES_DEF].cle_priority = 7;
xgene_cle_setup_def_dbptr(pdata, enet_cle, [DB_RES_DEF],
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h
index 9ac9f8e..903be0c 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h
@@ -91,6 +91,8 @@
 #define CLE_DSTQIDH_LEN5
 #define CLE_FPSEL_POS  21
 #define CLE_FPSEL_LEN  4
+#define CLE_NFPSEL_POS 17
+#define CLE_NFPSEL_LEN 4
 #define CLE_PRIORITY_POS   5
 #define CLE_PRIORITY_LEN   3
 
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 1007074..c395df3 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -550,12 +550,14 @@ static void xgene_enet_config_ring_if_assoc(struct 
xgene_enet_pdata *pdata)
 }
 
 static void xgene_enet_cle_bypass(struct xgene_enet_pdata *pdata,
- u32 dst_ring_num, u16 bufpool_id)
+ u32 dst_ring_num, u16 bufpool_id,
+ u16 nxtbufpool_id)
 {
u32 cb;
-   u32 fpsel;
+   u32 fpsel, nxtfpsel;
 
fpsel = xgene_enet_get_fpsel(bufpool_id);
+   nxtfpsel = xgene_enet_get_fpsel(nxtbufpool_id);
 
xgene_enet_rd_csr(pdata, CLE_BYPASS_REG0_0_ADDR, );
cb |= CFG_CLE_BYPASS_EN0;
@@ -565,6 +567,7 @@ static void xgene_enet_cle_bypass(struct

[PATCH net-next 5/8] drivers: net: xgene: fix: RSS for non-TCP/UDP

2016-12-01 Thread Iyappan Subramanian

This patch fixes RSS feature, for non-TCP/UDP packets.

Signed-off-by: Khuong Dinh 
Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.c | 90 -
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.h |  1 +
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
index caa55bd..1dc6c20 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
@@ -485,11 +485,11 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata 
*pdata)
},
{
.valid = 0,
-   .next_packet_pointer = 260,
+   .next_packet_pointer = 26,
.jump_bw = JMP_FW,
.jump_rel = JMP_ABS,
.operation = EQT,
-   .next_node = LAST_NODE,
+   .next_node = RSS_IPV4_OTHERS_NODE,
.next_branch = 0,
.data = 0x0,
.mask = 0x
@@ -667,6 +667,92 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata 
*pdata)
}
},
{
+   /* RSS_IPV4_OTHERS_NODE */
+   .node_type = EWDN,
+   .last_node = 0,
+   .hdr_len_store = 1,
+   .hdr_extn = NO_BYTE,
+   .byte_store = NO_BYTE,
+   .search_byte_store = BOTH_BYTES,
+   .result_pointer = DB_RES_DROP,
+   .num_branches = 6,
+   .branch = {
+   {
+   /* SRC IPV4 B01 */
+   .valid = 0,
+   .next_packet_pointer = 28,
+   .jump_bw = JMP_FW,
+   .jump_rel = JMP_ABS,
+   .operation = EQT,
+   .next_node = RSS_IPV4_OTHERS_NODE,
+   .next_branch = 1,
+   .data = 0x0,
+   .mask = 0x
+   },
+   {
+   /* SRC IPV4 B23 */
+   .valid = 0,
+   .next_packet_pointer = 30,
+   .jump_bw = JMP_FW,
+   .jump_rel = JMP_ABS,
+   .operation = EQT,
+   .next_node = RSS_IPV4_OTHERS_NODE,
+   .next_branch = 2,
+   .data = 0x0,
+   .mask = 0x
+   },
+   {
+   /* DST IPV4 B01 */
+   .valid = 0,
+   .next_packet_pointer = 32,
+   .jump_bw = JMP_FW,
+   .jump_rel = JMP_ABS,
+   .operation = EQT,
+   .next_node = RSS_IPV4_OTHERS_NODE,
+   .next_branch = 3,
+   .data = 0x0,
+   .mask = 0x
+   },
+   {
+   /* DST IPV4 B23 */
+   .valid = 0,
+   .next_packet_pointer = 34,
+   .jump_bw = JMP_FW,
+   .jump_rel = JMP_ABS,
+   .operation = EQT,
+   .next_node = RSS_IPV4_OTHERS_NODE,
+   .next_branch = 4,
+   .data = 0x0,
+   .mask = 0x
+   },
+   {
+   /* TCP SRC Port */
+   .valid = 0,
+   .next_packet_pointer = 36,
+   .jump_bw = JMP_FW,
+

[PATCH net-next 7/8] drivers: net: xgene: Add flow control initialization

2016-12-01 Thread Iyappan Subramanian

This patch adds flow control/pause frame initialization and
advertising capabilities.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 57 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h|  7 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 44 -
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 17 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h |  7 +++
 5 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 23a0175..06e6816 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -577,6 +577,17 @@ static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
/* Rtype should be copied from FP */
xgene_enet_wr_csr(pdata, RSIF_RAM_DBG_REG0_ADDR, 0);
 
+   /* Configure HW pause frame generation */
+   xgene_enet_rd_mcx_csr(pdata, CSR_MULTI_DPF0_ADDR, );
+   value = (DEF_QUANTA << 16) | (value & 0x);
+   xgene_enet_wr_mcx_csr(pdata, CSR_MULTI_DPF0_ADDR, value);
+
+   xgene_enet_wr_csr(pdata, RXBUF_PAUSE_THRESH, DEF_PAUSE_THRES);
+   xgene_enet_wr_csr(pdata, RXBUF_PAUSE_OFF_THRESH, DEF_PAUSE_OFF_THRES);
+
+   xgene_gmac_flowctl_tx(pdata, pdata->tx_pause);
+   xgene_gmac_flowctl_rx(pdata, pdata->rx_pause);
+
/* Rx-Tx traffic resume */
xgene_enet_wr_csr(pdata, CFG_LINK_AGGR_RESUME_0_ADDR, TX_PORT0);
 
@@ -749,6 +760,48 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata 
*pdata)
}
 }
 
+static u32 xgene_enet_flowctrl_cfg(struct net_device *ndev)
+{
+   struct xgene_enet_pdata *pdata = netdev_priv(ndev);
+   struct phy_device *phydev = ndev->phydev;
+   u16 lcladv, rmtadv = 0;
+   u32 rx_pause, tx_pause;
+   u8 flowctl = 0;
+
+   if (!phydev->duplex || !pdata->pause_autoneg)
+   return 0;
+
+   if (pdata->tx_pause)
+   flowctl |= FLOW_CTRL_TX;
+
+   if (pdata->rx_pause)
+   flowctl |= FLOW_CTRL_RX;
+
+   lcladv = mii_advertise_flowctrl(flowctl);
+
+   if (phydev->pause)
+   rmtadv = LPA_PAUSE_CAP;
+
+   if (phydev->asym_pause)
+   rmtadv |= LPA_PAUSE_ASYM;
+
+   flowctl = mii_resolve_flowctrl_fdx(lcladv, rmtadv);
+   tx_pause = !!(flowctl & FLOW_CTRL_TX);
+   rx_pause = !!(flowctl & FLOW_CTRL_RX);
+
+   if (tx_pause != pdata->tx_pause) {
+   pdata->tx_pause = tx_pause;
+   pdata->mac_ops->flowctl_tx(pdata, pdata->tx_pause);
+   }
+
+   if (rx_pause != pdata->rx_pause) {
+   pdata->rx_pause = rx_pause;
+   pdata->mac_ops->flowctl_rx(pdata, pdata->rx_pause);
+   }
+
+   return 0;
+}
+
 static void xgene_enet_adjust_link(struct net_device *ndev)
 {
struct xgene_enet_pdata *pdata = netdev_priv(ndev);
@@ -763,6 +816,8 @@ static void xgene_enet_adjust_link(struct net_device *ndev)
mac_ops->tx_enable(pdata);
phy_print_status(phydev);
}
+
+   xgene_enet_flowctrl_cfg(ndev);
} else {
mac_ops->rx_disable(pdata);
mac_ops->tx_disable(pdata);
@@ -836,6 +891,8 @@ int xgene_enet_phy_connect(struct net_device *ndev)
phy_dev->supported &= ~SUPPORTED_10baseT_Half &
  ~SUPPORTED_100baseT_Half &
  ~SUPPORTED_1000baseT_Half;
+   phy_dev->supported |= SUPPORTED_Pause |
+ SUPPORTED_Asym_Pause;
phy_dev->advertising = phy_dev->supported;
 
return 0;
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index 7ba649d..5f83037 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -172,6 +172,13 @@ enum xgene_enet_rm {
 #define CFG_CLE_FPSEL0(val)(((val) << 16) & GENMASK(19, 16))
 #define CSR_ECM_CFG_0_ADDR 0x0220
 #define CSR_ECM_CFG_1_ADDR 0x0224
+#define CSR_MULTI_DPF0_ADDR0x0230
+#define RXBUF_PAUSE_THRESH 0x0534
+#define RXBUF_PAUSE_OFF_THRESH 0x0540
+#define DEF_PAUSE_THRES0x7d
+#define DEF_PAUSE_OFF_THRES0x6d
+#define DEF_QUANTA 0x8000
+#define NORM_PAUSE_OPCODE  0x0001
 #define PAUSE_XON_EN   BIT(30)
 #define MULTI_DPF_AUTOCTRL BIT(28)
 #define CFG_CLE_NXTFPSEL0(val) (((val) << 20) & GENMASK(23, 20))
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
index 1c9b8ba..a8e063b 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
+++

[PATCH net-next 3/8] drivers: net: xgene: Add support for Jumbo frame

2016-12-01 Thread Iyappan Subramanian

This patch adds support for jumbo frame, by allocating
additional buffer (page) pool and configuring the hardware.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c|   3 +
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  | 305 --
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h  |  20 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_ring2.c |   1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c |   3 +
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c |   4 +
 6 files changed, 311 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index c395df3..fc9010f 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -679,6 +679,9 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata 
*pdata)
for (i = 0; i < pdata->rxq_cnt; i++) {
ring = pdata->rx_ring[i]->buf_pool;
pb |= BIT(xgene_enet_get_fpsel(ring->id));
+   ring = pdata->rx_ring[i]->page_pool;
+   if (ring)
+   pb |= BIT(xgene_enet_get_fpsel(ring->id));
 
}
xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIFPRESET_ADDR, pb);
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index c89acf5..698df27 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -37,6 +37,9 @@ static void xgene_enet_init_bufpool(struct 
xgene_enet_desc_ring *buf_pool)
struct xgene_enet_raw_desc16 *raw_desc;
int i;
 
+   if (!buf_pool)
+   return;
+
for (i = 0; i < buf_pool->slots; i++) {
raw_desc = _pool->raw_desc16[i];
 
@@ -47,6 +50,86 @@ static void xgene_enet_init_bufpool(struct 
xgene_enet_desc_ring *buf_pool)
}
 }
 
+static u16 xgene_enet_get_data_len(u64 bufdatalen)
+{
+   u16 hw_len, mask;
+
+   hw_len = GET_VAL(BUFDATALEN, bufdatalen);
+
+   if (unlikely(hw_len == 0x7800)) {
+   return 0;
+   } else if (!(hw_len & BIT(14))) {
+   mask = GENMASK(13, 0);
+   return (hw_len & mask) ? (hw_len & mask) : SIZE_16K;
+   } else if (!(hw_len & GENMASK(13, 12))) {
+   mask = GENMASK(11, 0);
+   return (hw_len & mask) ? (hw_len & mask) : SIZE_4K;
+   } else {
+   mask = GENMASK(11, 0);
+   return (hw_len & mask) ? (hw_len & mask) : SIZE_2K;
+   }
+}
+
+static u16 xgene_enet_set_data_len(u32 size)
+{
+   u16 hw_len;
+
+   hw_len =  (size == SIZE_4K) ? BIT(14) : 0;
+
+   return hw_len;
+}
+
+static int xgene_enet_refill_pagepool(struct xgene_enet_desc_ring *buf_pool,
+ u32 nbuf)
+{
+   struct xgene_enet_raw_desc16 *raw_desc;
+   struct xgene_enet_pdata *pdata;
+   struct net_device *ndev;
+   dma_addr_t dma_addr;
+   struct device *dev;
+   struct page *page;
+   u32 slots, tail;
+   u16 hw_len;
+   int i;
+
+   if (unlikely(!buf_pool))
+   return 0;
+
+   ndev = buf_pool->ndev;
+   pdata = netdev_priv(ndev);
+   dev = ndev_to_dev(ndev);
+   slots = buf_pool->slots - 1;
+   tail = buf_pool->tail;
+
+   for (i = 0; i < nbuf; i++) {
+   raw_desc = _pool->raw_desc16[tail];
+
+   page = dev_alloc_page();
+   if (unlikely(!page))
+   return -ENOMEM;
+
+   dma_addr = dma_map_page(dev, page, 0,
+   PAGE_SIZE, DMA_FROM_DEVICE);
+   if (unlikely(dma_mapping_error(dev, dma_addr))) {
+   put_page(page);
+   return -ENOMEM;
+   }
+
+   hw_len = xgene_enet_set_data_len(PAGE_SIZE);
+   raw_desc->m1 = cpu_to_le64(SET_VAL(DATAADDR, dma_addr) |
+  SET_VAL(BUFDATALEN, hw_len) |
+  SET_BIT(COHERENT));
+
+   buf_pool->frag_page[tail] = page;
+   tail = (tail + 1) & slots;
+   }
+
+   pdata->ring_ops->wr_cmd(buf_pool, nbuf);
+   buf_pool->tail = tail;
+
+   return 0;
+}
+
 static int xgene_enet_refill_bufpool(struct xgene_enet_desc_ring *buf_pool,
 u32 nbuf)
 {
@@ -64,8 +147,9 @@ static int xgene_enet_refill_bufpool(struct 
xgene_enet_desc_ring *buf_pool,
ndev = buf_pool->ndev;
dev = ndev_to_dev(buf_pool->ndev);
pdata = netdev_priv(ndev);
+
bufdatalen = BUF_LEN_CODE_2K | (SKB_BUFFER_SIZE & GENMASK(11, 0));
-   len = XGENE_ENET_MAX_MTU;
+   len = XGENE_ENET_STD_MTU;
 
for (i = 0; i < nbuf; i++) {
raw_desc =

[PATCH net-next 4/8] drivers: net: xgene: Add change_mtu function

2016-12-01 Thread Iyappan Subramanian

This patch implements ndo_change_mtu() callback function that
enables mtu change.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c|  6 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  | 20 
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h  |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c |  6 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c |  7 +++
 5 files changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index fc9010f..92cc7e5 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -504,6 +504,11 @@ static void xgene_gmac_set_speed(struct xgene_enet_pdata 
*pdata)
xgene_enet_wr_mcx_csr(pdata, ICM_CONFIG2_REG_0_ADDR, icm2);
 }
 
+static void xgene_enet_set_frame_size(struct xgene_enet_pdata *pdata, int size)
+{
+   xgene_enet_wr_mcx_mac(pdata, MAX_FRAME_LEN_ADDR, size);
+}
+
 static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
 {
u32 value;
@@ -903,6 +908,7 @@ void xgene_enet_mdio_remove(struct xgene_enet_pdata *pdata)
.tx_disable = xgene_gmac_tx_disable,
.set_speed = xgene_gmac_set_speed,
.set_mac_addr = xgene_gmac_set_mac_addr,
+   .set_framesize = xgene_enet_set_frame_size,
 };
 
 const struct xgene_port_ops xgene_gport_ops = {
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 698df27..6c7eea8 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1500,12 +1500,31 @@ static int xgene_enet_set_mac_address(struct net_device 
*ndev, void *addr)
return ret;
 }
 
+static int xgene_change_mtu(struct net_device *ndev, int new_mtu)
+{
+   struct xgene_enet_pdata *pdata = netdev_priv(ndev);
+   int frame_size;
+
+   if (!netif_running(ndev))
+   return 0;
+
+   frame_size = (new_mtu > ETH_DATA_LEN) ? (new_mtu + 18) : 0x600;
+
+   xgene_enet_close(ndev);
+   ndev->mtu = new_mtu;
+   pdata->mac_ops->set_framesize(pdata, frame_size);
+   xgene_enet_open(ndev);
+
+   return 0;
+}
+
 static const struct net_device_ops xgene_ndev_ops = {
.ndo_open = xgene_enet_open,
.ndo_stop = xgene_enet_close,
.ndo_start_xmit = xgene_enet_start_xmit,
.ndo_tx_timeout = xgene_enet_timeout,
.ndo_get_stats64 = xgene_enet_get_stats64,
+   .ndo_change_mtu = xgene_change_mtu,
.ndo_set_mac_address = xgene_enet_set_mac_address,
 };
 
@@ -1832,6 +1851,7 @@ static int xgene_enet_init_hw(struct xgene_enet_pdata 
*pdata)
buf_pool->id, ring_id);
}
 
+   ndev->max_mtu = XGENE_ENET_MAX_MTU;
pdata->phy_speed = SPEED_UNKNOWN;
pdata->mac_ops->init(pdata);
 
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
index 023ed76..61aa987 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
@@ -156,6 +156,7 @@ struct xgene_mac_ops {
void (*rx_disable)(struct xgene_enet_pdata *pdata);
void (*set_speed)(struct xgene_enet_pdata *pdata);
void (*set_mac_addr)(struct xgene_enet_pdata *pdata);
+   void (*set_framesize)(struct xgene_enet_pdata *pdata, int framesize);
void (*set_mss)(struct xgene_enet_pdata *pdata, u16 mss, u8 index);
void (*link_state)(struct work_struct *work);
 };
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
index a10ab64..6283f2b 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
@@ -343,6 +343,11 @@ static void xgene_sgmac_set_speed(struct xgene_enet_pdata 
*p)
xgene_enet_wr_mcx_csr(p, icm2_addr, icm2);
 }
 
+static void xgene_sgmac_set_frame_size(struct xgene_enet_pdata *pdata, int 
size)
+{
+   xgene_enet_wr_mac(pdata, MAX_FRAME_LEN_ADDR, size);
+}
+
 static void xgene_sgmii_enable_autoneg(struct xgene_enet_pdata *p)
 {
u32 data, loop = 10;
@@ -595,6 +600,7 @@ static void xgene_enet_link_state(struct work_struct *work)
.tx_disable = xgene_sgmac_tx_disable,
.set_speed  = xgene_sgmac_set_speed,
.set_mac_addr   = xgene_sgmac_set_mac_addr,
+   .set_framesize  = xgene_sgmac_set_frame_size,
.link_state = xgene_enet_link_state
 };
 
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
index 4109776..2a9761b 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c
@@ -250,6 +250,12 @@

[PATCH net-next 1/8] drivers: net: xgene: Add helper function

2016-12-01 Thread Iyappan Subramanian

This is a prepartion patch and adds xgene_enet_get_fpsel() helper
function to get buffer pool number.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.c   |  4 ++--
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 19 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h|  8 
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 20 +++-
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 20 +++-
 5 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
index 23d72af..7aac0fb 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c
@@ -346,7 +346,7 @@ static int xgene_cle_set_rss_idt(struct xgene_enet_pdata 
*pdata)
for (i = 0; i < XGENE_CLE_IDT_ENTRIES; i++) {
idx = i % pdata->rxq_cnt;
pool_id = pdata->rx_ring[idx]->buf_pool->id;
-   fpsel = xgene_enet_ring_bufnum(pool_id) - 0x20;
+   fpsel = xgene_enet_get_fpsel(pool_id);
dstqid = xgene_enet_dst_ring_num(pdata->rx_ring[idx]);
nfpsel = 0;
idt_reg = 0;
@@ -706,7 +706,7 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata 
*pdata)
 
def_qid = xgene_enet_dst_ring_num(pdata->rx_ring[0]);
pool_id = pdata->rx_ring[0]->buf_pool->id;
-   def_fpsel = xgene_enet_ring_bufnum(pool_id) - 0x20;
+   def_fpsel = xgene_enet_get_fpsel(pool_id);
 
memset(dbptr, 0, sizeof(struct xgene_cle_dbptr) * DB_MAX_PTRS);
dbptr[DB_RES_ACCEPT].fpsel =  def_fpsel;
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 5390ae8..1007074 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -555,7 +555,7 @@ static void xgene_enet_cle_bypass(struct xgene_enet_pdata 
*pdata,
u32 cb;
u32 fpsel;
 
-   fpsel = xgene_enet_ring_bufnum(bufpool_id) - 0x20;
+   fpsel = xgene_enet_get_fpsel(bufpool_id);
 
xgene_enet_rd_csr(pdata, CLE_BYPASS_REG0_0_ADDR, );
cb |= CFG_CLE_BYPASS_EN0;
@@ -652,16 +652,14 @@ static int xgene_enet_reset(struct xgene_enet_pdata 
*pdata)
 static void xgene_enet_clear(struct xgene_enet_pdata *pdata,
 struct xgene_enet_desc_ring *ring)
 {
-   u32 addr, val, data;
-
-   val = xgene_enet_ring_bufnum(ring->id);
+   u32 addr, data;
 
if (xgene_enet_is_bufpool(ring->id)) {
addr = ENET_CFGSSQMIFPRESET_ADDR;
-   data = BIT(val - 0x20);
+   data = BIT(xgene_enet_get_fpsel(ring->id));
} else {
addr = ENET_CFGSSQMIWQRESET_ADDR;
-   data = BIT(val);
+   data = BIT(xgene_enet_ring_bufnum(ring->id));
}
 
xgene_enet_wr_ring_if(pdata, addr, data);
@@ -671,24 +669,21 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata 
*pdata)
 {
struct device *dev = >pdev->dev;
struct xgene_enet_desc_ring *ring;
-   u32 pb, val;
+   u32 pb;
int i;
 
pb = 0;
for (i = 0; i < pdata->rxq_cnt; i++) {
ring = pdata->rx_ring[i]->buf_pool;
+   pb |= BIT(xgene_enet_get_fpsel(ring->id));
 
-   val = xgene_enet_ring_bufnum(ring->id);
-   pb |= BIT(val - 0x20);
}
xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIFPRESET_ADDR, pb);
 
pb = 0;
for (i = 0; i < pdata->txq_cnt; i++) {
ring = pdata->tx_ring[i];
-
-   val = xgene_enet_ring_bufnum(ring->id);
-   pb |= BIT(val);
+   pb |= BIT(xgene_enet_ring_bufnum(ring->id));
}
xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIWQRESET_ADDR, pb);
 
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index 06e598c..e73cbb1 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -346,6 +346,14 @@ static inline bool xgene_enet_is_bufpool(u16 id)
return ((id & RING_BUFNUM_MASK) >= 0x20) ? true : false;
 }
 
+static inline u8 xgene_enet_get_fpsel(u16 id)
+{
+   if (xgene_enet_is_bufpool(id))
+   return xgene_enet_ring_bufnum(id) - RING_BUFNUM_BUFPOOL;
+
+   return 0;
+}
+
 static inline u16 xgene_enet_get_numslots(u16 id, u32 size)
 {
bool is_bufpool = xgene_enet_is_bufpool(id);
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
index d12e9cb..8e4209c 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
@@ -501,7 +501,7 @@ static

[PATCH net-next 0/8] drivers: net: xgene: Add Jumbo and Pause frame support

2016-12-01 Thread Iyappan Subramanian

This patch set adds,

1. Jumbo frame support
2. Pause frame based flow control

and fixes RSS for non-TCP/UDP packets.

Signed-off-by: Iyappan Subramanian 
---
Iyappan Subramanian (8):
  drivers: net: xgene: Add helper function
  drivers: net: xgene: Configure classifier with pagepool
  drivers: net: xgene: Add support for Jumbo frame
  drivers: net: xgene: Add change_mtu function
  drivers: net: xgene: fix: RSS for non-TCP/UDP
  drivers: net: xgene: Add flow control configuration
  drivers: net: xgene: Add flow control initialization
  drivers: net: xgene: ethtool: Add get/set_pauseparam

 drivers/net/ethernet/apm/xgene/xgene_enet_cle.c| 110 ++-
 drivers/net/ethernet/apm/xgene/xgene_enet_cle.h|   3 +
 .../net/ethernet/apm/xgene/xgene_enet_ethtool.c|  70 +
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 140 -
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h |  27 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c   | 336 +++--
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h   |  30 +-
 drivers/net/ethernet/apm/xgene/xgene_enet_ring2.c  |   1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c  | 146 +++--
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c  | 121 +++-
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h  |   9 +
 11 files changed, 895 insertions(+), 98 deletions(-)

-- 
1.9.1

[PATCH net-next 6/8] drivers: net: xgene: Add flow control configuration

2016-12-01 Thread Iyappan Subramanian

This patch adds functions to configure mac, when flow control
and pause frame settings change.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 48 +
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h|  6 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h  |  6 +++
 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 64 --
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 66 ++-
 drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h |  2 +
 6 files changed, 176 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 92cc7e5..23a0175 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -509,6 +509,51 @@ static void xgene_enet_set_frame_size(struct 
xgene_enet_pdata *pdata, int size)
xgene_enet_wr_mcx_mac(pdata, MAX_FRAME_LEN_ADDR, size);
 }
 
+static void xgene_gmac_enable_tx_pause(struct xgene_enet_pdata *pdata,
+  bool enable)
+{
+   u32 data;
+
+   xgene_enet_rd_mcx_csr(pdata, CSR_ECM_CFG_0_ADDR, );
+
+   if (enable)
+   data |= MULTI_DPF_AUTOCTRL | PAUSE_XON_EN;
+   else
+   data &= ~(MULTI_DPF_AUTOCTRL | PAUSE_XON_EN);
+
+   xgene_enet_wr_mcx_csr(pdata, CSR_ECM_CFG_0_ADDR, data);
+}
+
+static void xgene_gmac_flowctl_tx(struct xgene_enet_pdata *pdata, bool enable)
+{
+   u32 data;
+
+   xgene_enet_rd_mcx_mac(pdata, MAC_CONFIG_1_ADDR, );
+
+   if (enable)
+   data |= TX_FLOW_EN;
+   else
+   data &= ~TX_FLOW_EN;
+
+   xgene_enet_wr_mcx_mac(pdata, MAC_CONFIG_1_ADDR, data);
+
+   pdata->mac_ops->enable_tx_pause(pdata, enable);
+}
+
+static void xgene_gmac_flowctl_rx(struct xgene_enet_pdata *pdata, bool enable)
+{
+   u32 data;
+
+   xgene_enet_rd_mcx_mac(pdata, MAC_CONFIG_1_ADDR, );
+
+   if (enable)
+   data |= RX_FLOW_EN;
+   else
+   data &= ~RX_FLOW_EN;
+
+   xgene_enet_wr_mcx_mac(pdata, MAC_CONFIG_1_ADDR, data);
+}
+
 static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
 {
u32 value;
@@ -909,6 +954,9 @@ void xgene_enet_mdio_remove(struct xgene_enet_pdata *pdata)
.set_speed = xgene_gmac_set_speed,
.set_mac_addr = xgene_gmac_set_mac_addr,
.set_framesize = xgene_enet_set_frame_size,
+   .enable_tx_pause = xgene_gmac_enable_tx_pause,
+   .flowctl_tx = xgene_gmac_flowctl_tx,
+   .flowctl_rx = xgene_gmac_flowctl_rx,
 };
 
 const struct xgene_port_ops xgene_gport_ops = {
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index bd6cb6c..7ba649d 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -170,6 +170,10 @@ enum xgene_enet_rm {
 #define CFG_WAITASYNCRD_SET(dst, val)  xgene_set_bits(dst, val, 0, 16)
 #define CFG_CLE_DSTQID0(val)   ((val) & GENMASK(11, 0))
 #define CFG_CLE_FPSEL0(val)(((val) << 16) & GENMASK(19, 16))
+#define CSR_ECM_CFG_0_ADDR 0x0220
+#define CSR_ECM_CFG_1_ADDR 0x0224
+#define PAUSE_XON_EN   BIT(30)
+#define MULTI_DPF_AUTOCTRL BIT(28)
 #define CFG_CLE_NXTFPSEL0(val) (((val) << 20) & GENMASK(23, 20))
 #define ICM_CONFIG0_REG_0_ADDR 0x0400
 #define ICM_CONFIG2_REG_0_ADDR 0x0410
@@ -198,6 +202,8 @@ enum xgene_enet_rm {
 #define SOFT_RESET1BIT(31)
 #define TX_EN  BIT(0)
 #define RX_EN  BIT(2)
+#define TX_FLOW_EN BIT(4)
+#define RX_FLOW_EN BIT(5)
 #define ENET_LHD_MODE  BIT(25)
 #define ENET_GHD_MODE  BIT(26)
 #define FULL_DUPLEX2   BIT(0)
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
index 61aa987..5257174 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
@@ -159,6 +159,9 @@ struct xgene_mac_ops {
void (*set_framesize)(struct xgene_enet_pdata *pdata, int framesize);
void (*set_mss)(struct xgene_enet_pdata *pdata, u16 mss, u8 index);
void (*link_state)(struct work_struct *work);
+   void (*enable_tx_pause)(struct xgene_enet_pdata *pdata, bool enable);
+   void (*flowctl_rx)(struct xgene_enet_pdata *pdata, bool enable);
+   void (*flowctl_tx)(struct xgene_enet_pdata *pdata, bool enable);
 };
 
 struct xgene_port_ops {
@@ -234,6 +237,9 @@ struct xgene_enet_pdata {
bool mdio_driver;
struct gpio_desc *sfp_rdy;
bool sfp_gpio_en;
+   u32 pause_autoneg;
+   bool

[PATCH net-next 8/8] drivers: net: xgene: ethtool: Add get/set_pauseparam

2016-12-01 Thread Iyappan Subramanian

This patch adds get_pauseparam and set_pauseparam functions.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 .../net/ethernet/apm/xgene/xgene_enet_ethtool.c| 70 ++
 1 file changed, 70 insertions(+)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c
index d372d42..28fdedc 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c
@@ -163,6 +163,74 @@ static void xgene_get_ethtool_stats(struct net_device 
*ndev,
*data++ = *(u64 *)(pdata + gstrings_stats[i].offset);
 }
 
+static void xgene_get_pauseparam(struct net_device *ndev,
+struct ethtool_pauseparam *pp)
+{
+   struct xgene_enet_pdata *pdata = netdev_priv(ndev);
+
+   pp->autoneg = pdata->pause_autoneg;
+   pp->tx_pause = pdata->tx_pause;
+   pp->rx_pause = pdata->rx_pause;
+}
+
+static int xgene_set_pauseparam(struct net_device *ndev,
+   struct ethtool_pauseparam *pp)
+{
+   struct xgene_enet_pdata *pdata = netdev_priv(ndev);
+   struct phy_device *phydev = ndev->phydev;
+   u32 oldadv, newadv;
+
+   if (pdata->phy_mode == PHY_INTERFACE_MODE_RGMII ||
+   pdata->phy_mode == PHY_INTERFACE_MODE_SGMII) {
+   if (!phydev)
+   return -EINVAL;
+
+   if (!(phydev->supported & SUPPORTED_Pause) ||
+   (!(phydev->supported & SUPPORTED_Asym_Pause) &&
+pp->rx_pause != pp->tx_pause))
+   return -EINVAL;
+
+   pdata->pause_autoneg = pp->autoneg;
+   pdata->tx_pause = pp->tx_pause;
+   pdata->rx_pause = pp->rx_pause;
+
+   oldadv = phydev->advertising;
+   newadv = oldadv & ~(ADVERTISED_Pause | ADVERTISED_Asym_Pause);
+
+   if (pp->rx_pause)
+   newadv |= ADVERTISED_Pause | ADVERTISED_Asym_Pause;
+
+   if (pp->tx_pause)
+   newadv ^= ADVERTISED_Asym_Pause;
+
+   if (oldadv ^ newadv) {
+   phydev->advertising = newadv;
+
+   if (phydev->autoneg)
+   return phy_start_aneg(phydev);
+
+   if (!pp->autoneg) {
+   pdata->mac_ops->flowctl_tx(pdata,
+  pdata->tx_pause);
+   pdata->mac_ops->flowctl_rx(pdata,
+  pdata->rx_pause);
+   }
+   }
+
+   } else {
+   if (pp->autoneg)
+   return -EINVAL;
+
+   pdata->tx_pause = pp->tx_pause;
+   pdata->rx_pause = pp->rx_pause;
+
+   pdata->mac_ops->flowctl_tx(pdata, pdata->tx_pause);
+   pdata->mac_ops->flowctl_rx(pdata, pdata->rx_pause);
+   }
+
+   return 0;
+}
+
 static const struct ethtool_ops xgene_ethtool_ops = {
.get_drvinfo = xgene_get_drvinfo,
.get_link = ethtool_op_get_link,
@@ -171,6 +239,8 @@ static void xgene_get_ethtool_stats(struct net_device *ndev,
.get_ethtool_stats = xgene_get_ethtool_stats,
.get_link_ksettings = xgene_get_link_ksettings,
.set_link_ksettings = xgene_set_link_ksettings,
+   .get_pauseparam = xgene_get_pauseparam,
+   .set_pauseparam = xgene_set_pauseparam
 };
 
 void xgene_enet_set_ethtool_ops(struct net_device *ndev)
-- 
1.9.1

Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00

2016-12-01 Thread Eric Dumazet

On Thu, 2016-12-01 at 15:47 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> [...]
> >> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, 
> >> struct sk_buff *skb)
> >>
> >>   probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
> >>   if (probes < 0) {
> >> + memset(_dev_hw_addr, 0, dev->addr_len);
> >>   if (!(neigh->nud_state & NUD_VALID))
> >>   pr_debug("trying to ucast probe in NUD_INVALID\n");
> >>   neigh_ha_snapshot(dst_ha, neigh, dev);
> >> - dst_hw = dst_ha;
> >> + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0)
> >> + dst_hw = dst_ha;
> >>   } else {
> >>   probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
> >>   if (probes < 0) {
> >
> > Why is is an IPv4 specific issue ?
> I think the issue is that neigh_ha_snapshot() gets neigh->ha
> unconditionally even if the neigh state is NUD_INVALID.
> 
> > What about IPv6 ?
> Well it's not ARP. The ndisc_solicit() calls ndisc_send_ns() with
> neigh parameter for unicast probe while call with NULL for the
> broadcast probe case. However it does not use this parameter in
> unicast case and probably relies on the route-entry. Hence it is not
> subjected to the same issue.

Well, it looks like the issue is in neighbour code.

Fact that IPv6 might not be impacted is not the point.



> >
> >
> > I would try something in neighbour code, maybe :
> >
> > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > index 
> > 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68
> >  100644
> > --- a/net/core/neighbour.c
> > +++ b/net/core/neighbour.c
> > @@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg)
> > neigh_dbg(2, "neigh %p is probed\n", neigh);
> > neigh->nud_state = NUD_PROBE;
> > neigh->updated = jiffies;
> > -   atomic_set(>probes, 0);
> > +   atomic_set(>probes,
> > +  (neigh->output == neigh_blackhole) ?
> > +   NEIGH_VAR(neigh->parms, 
> > UCAST_PROBES) :
> > +   0);
> This would work if we change the above line (in arp_solicit() code)
> from 'if (probes < 0)' to 'if (probes <= 0)'.

Then code at line 973 is wrong ?

atomic_set(>probes,
   NEIGH_VAR(neigh->parms, UCAST_PROBES));

That would be a more serious issue :)

Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00

2016-12-01 Thread महेश बंडेवार

[...]
>> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, 
>> struct sk_buff *skb)
>>
>>   probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
>>   if (probes < 0) {
>> + memset(_dev_hw_addr, 0, dev->addr_len);
>>   if (!(neigh->nud_state & NUD_VALID))
>>   pr_debug("trying to ucast probe in NUD_INVALID\n");
>>   neigh_ha_snapshot(dst_ha, neigh, dev);
>> - dst_hw = dst_ha;
>> + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0)
>> + dst_hw = dst_ha;
>>   } else {
>>   probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
>>   if (probes < 0) {
>
> Why is is an IPv4 specific issue ?
I think the issue is that neigh_ha_snapshot() gets neigh->ha
unconditionally even if the neigh state is NUD_INVALID.

> What about IPv6 ?
Well it's not ARP. The ndisc_solicit() calls ndisc_send_ns() with
neigh parameter for unicast probe while call with NULL for the
broadcast probe case. However it does not use this parameter in
unicast case and probably relies on the route-entry. Hence it is not
subjected to the same issue.
>
>
> I would try something in neighbour code, maybe :
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 
> 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68
>  100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg)
> neigh_dbg(2, "neigh %p is probed\n", neigh);
> neigh->nud_state = NUD_PROBE;
> neigh->updated = jiffies;
> -   atomic_set(>probes, 0);
> +   atomic_set(>probes,
> +  (neigh->output == neigh_blackhole) ?
> +   NEIGH_VAR(neigh->parms, UCAST_PROBES) 
> :
> +   0);
This would work if we change the above line (in arp_solicit() code)
from 'if (probes < 0)' to 'if (probes <= 0)'.

> notify = 1;
> next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME);
> }
>
> Thanks.
>
>

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones


On 12/01/2016 02:12 PM, Tom Herbert wrote:

We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.


No worries on the specific sizes, it is a classic "How long is a piece 
of string?" sort of question.


Not surprisingly, as the size of what is being received grows, so too 
the delta between GRO on and off.


stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; 
do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro 
$gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- 
-P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; 
done; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Result Tag,Throughput,Local Service Demand
"8K gro on",9899.84,35.947
"8K gro off",7299.54,61.097
"16K gro on",8119.38,58.367
"16K gro off",5176.87,95.317
"64K gro on",4429.57,110.629
"64K gro off",2128.58,289.913
"1M gro on",887.85,918.447
"1M gro off",335.97,3427.587

So that gives a feel for by how much this alternative mechanism would 
have to reduce path-length to maintain the CPU overhead, were the 
mechanism to preclude GRO.


rick

linux-next: manual merge of the wireless-drivers-next tree with the net-next tree

2016-12-01 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the wireless-drivers-next tree got a
conflict in:

  drivers/net/wireless/ath/ath10k/mac.c

between commit:

  f3fe4e93dd63 ("mac80211: add a HW flag for supporting HW TX fragmentation")

from the net-next tree and commit:

  ff32eeb86aa1 ("ath10k: advertize hardware packet loss mechanism")

from the wireless-drivers-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/wireless/ath/ath10k/mac.c
index 717b2fad9a8a,db6ddf974d1d..
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@@ -8005,7 -7993,7 +7993,8 @@@ int ath10k_mac_register(struct ath10k *
ieee80211_hw_set(ar->hw, WANT_MONITOR_VIF);
ieee80211_hw_set(ar->hw, CHANCTX_STA_CSA);
ieee80211_hw_set(ar->hw, QUEUE_CONTROL);
 +  ieee80211_hw_set(ar->hw, SUPPORTS_TX_FRAG);
+   ieee80211_hw_set(ar->hw, REPORTS_LOW_ACK);
  
if (!test_bit(ATH10K_FLAG_RAW_MODE, >dev_flags))
ieee80211_hw_set(ar->hw, SW_CRYPTO_CONTROL);

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 2:47 PM, Hannes Frederic Sowa
 wrote:
> Side note:
>
> On 01.12.2016 20:51, Tom Herbert wrote:
>>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>>> > (where does that come from? should make no difference if a 32 or
>>> >  320 byte buffer gets allocated).
>>> >
>> It's the zero'ing of three cache lines. I believe we talked about that
>> as netdev.
>
> Jesper and me played with that again very recently:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
>
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?
>
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
>
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.
>
One nice direction of XDP is that it forces drivers to defer
allocating (and hence zero'ing) skbs. In the receive path I think we
can exploit this property deeper into the stack. The only time we
_really_ to allocate an skbuf is when we need to put the packet onto a
queue. All the other use cases are really just to pass a structure
containing a packet from function to function. For that purpose we
should be able to just pass a much smaller structure in a stack
argument and only allocate an skbuff when we need to enqueue. In cases
where we don't ever queue a packet we might never need to allocate any
skbuff-- this includes pure acks, packets that end up being dropped.
But even more than that, if a received packet generates a TX packet
(like a SYN causes a SYN-ACK) then we might even be able to just
recycle the received packet and avoid needing any skbuff allocation on
transmit (XDP_TX already does this in a limited context)--  this could
be a win to handle SYN attacks for instance. Also, since we don't
queue on the socket buffer for UDP it's conceivable we could avoid
skbuffs in an expedited UDP TX path.

Currently, nearly the whole stack depends on packets always being
passed in skbuffs, however __skb_flow_dissect is an interesting
exception as it can handle packets passed in either an skbuff or by
just a void *-- so we know that this "dual mode" is at least possible.
Trying to retrain the whole stack to be able to handle both skbuffs
and raw pages is probably untenable at this point, but selectively
augmenting some critical performance functions for dual mode (ip_rcv,
tcp_rcv, udp_rcv functions for instance) might work.

Thanks,
Tom

> Bye,
> Hannes
>

[PATCH 6/7] net: ethernet: ti: cpsw: add support for descs_pool_size dt property

2016-12-01 Thread Grygorii Strashko

The CPSW CPDMA can process buffer descriptors placed as in internal
CPPI RAM as in DDR. This patch adds support in CPSW and CPDMA for
descs_pool_size DT property, which defines total number of CPDMA CPPI
descriptors to be used for both ingress/egress packets processing:
 - memory size required for CPDMA descriptor pool is calculated basing
on number of descriptors specified by user in descs_pool_size and
CPDMA descriptor size;
 - allocate CPDMA descriptor pool in DDR if pool memory size >
internal CPPI RAM or use internal CPPI RAM otherwise;
 - if descs_pool_size not specified in DT - the default value 256 will
be used which will allow to place CPDMA descriptors pool into the
internal CPPI RAM (current default behaviour);
 - CPDMA will ignore descs_pool_size if descs_pool_size = 0 for
backward comaptiobility with davinci_emac.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c  |  5 +
 drivers/net/ethernet/ti/cpsw.h  |  1 +
 drivers/net/ethernet/ti/davinci_cpdma.c | 12 
 drivers/net/ethernet/ti/davinci_cpdma.h |  1 +
 4 files changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index dd5d830..a98c6260 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -145,6 +145,7 @@ do {
\
cpsw->data.active_slave)
 #define IRQ_NUM2
 #define CPSW_MAX_QUEUES8
+#define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
 
 static int debug_level;
 module_param(debug_level, int, 0);
@@ -2557,6 +2558,9 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
if (of_property_read_bool(node, "dual_emac"))
data->dual_emac = 1;
 
+   data->descs_pool_size = CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT;
+   of_property_read_u32(node, "descs_pool_size", >descs_pool_size);
+
/*
 * Populate all the child nodes here...
 */
@@ -2967,6 +2971,7 @@ static int cpsw_probe(struct platform_device *pdev)
dma_params.has_ext_regs = true;
dma_params.desc_hw_addr = dma_params.desc_mem_phys;
dma_params.bus_freq_mhz = cpsw->bus_freq_mhz;
+   dma_params.descs_pool_size  = cpsw->data.descs_pool_size;
 
cpsw->dma = cpdma_ctlr_create(_params);
if (!cpsw->dma) {
diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
index 16b54c6..8835d79 100644
--- a/drivers/net/ethernet/ti/cpsw.h
+++ b/drivers/net/ethernet/ti/cpsw.h
@@ -38,6 +38,7 @@ struct cpsw_platform_data {
u32 mac_control;/* Mac control register */
u16 default_vlan;   /* Def VLAN for ALE lookup in VLAN aware mode*/
booldual_emac;  /* Enable Dual EMAC mode */
+   u32 descs_pool_size;/* Number of Rx/Tx Descriptios */
 };
 
 void cpsw_phy_sel(struct device *dev, phy_interface_t phy_mode, int slave);
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index ba892bb..f45bb8a 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -219,6 +219,18 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
cpdma_params->desc_align);
pool->num_desc  = pool->mem_size / pool->desc_size;
 
+   if (cpdma_params->descs_pool_size) {
+   /* recalculate memory size required cpdma descriptor pool
+* basing on number of descriptors specified by user and
+* if memory size > CPPI internal RAM size (desc_mem_size)
+* then switch to use DDR
+*/
+   pool->num_desc = cpdma_params->descs_pool_size;
+   pool->mem_size = pool->desc_size * pool->num_desc;
+   if (pool->mem_size > cpdma_params->desc_mem_size)
+   cpdma_params->desc_mem_phys = 0;
+   }
+
pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size),
  -1, "cpdma");
if (IS_ERR(pool->gen_pool)) {
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h 
b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db..cb45f8f 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -37,6 +37,7 @@ struct cpdma_params {
int desc_mem_size;
int desc_align;
u32 bus_freq_mhz;
+   u32 descs_pool_size;
 
/*
 * Some instances of embedded cpdma controllers have extra control and
-- 
2.10.1

[PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing

2016-12-01 Thread Grygorii Strashko

The currently processing cpdma descriptor with EOQ flag set may
contain two values in Next Descriptor Pointer field:
- valid pointer: means CPDMA missed addition of new desc in queue;
- null: no more descriptors in queue.
In the later case, it's not required to write to HDP register, but now
CPDMA does it.

Hence, add additional check for Next Descriptor Pointer != null in
cpdma_chan_process() function before writing in HDP register.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index 0924014..379314f 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan *chan)
chan->count--;
chan->stats.good_dequeue++;
 
-   if (status & CPDMA_DESC_EOQ) {
+   if ((status & CPDMA_DESC_EOQ) && chan->head) {
chan->stats.requeue++;
chan_write(chan, hdp, desc_phys(pool, chan->head));
}
-- 
2.10.1

[PATCH 1/7] net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr

2016-12-01 Thread Grygorii Strashko

It's observed that cpsw/cpdma is not working properly when CPPI
descriptors are placed in DDR instead of internal CPPI RAM on am437x
SoC:
- rx/tx silently stops processing packets;
- or - after boot it's working for sometime, but stuck once Network
load is increased (ping is working, but iperf is not).
(The same issue has not been reproduced on am335x and am57xx).

It seems that write to HDP register processed faster by interconnect
than writing of descriptor memory buffer in DDR, which is probably
caused by store buffer / write buffer differences as these function
are implemented differently across devices. So, to fix this i come up
with two changes:

1) all accesses to the channel register HDP/CP/RXFREE registers should
be done using sync IO accessors readl()/writel(), because all previous
memory writes writes have to be completed before starting channel
(write to HDP) or completing desc processing.

2) the change 1 only doesn't work on am437x and additional reading of
desc's field is required right after the new descriptor was filled
with data and before pointer on it will be stored in
prev_desc->hw_next field or HDP register.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e45..0924014 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -167,10 +167,10 @@ static struct cpdma_control_info controls[] = {
 
 /* various accessors */
 #define dma_reg_read(ctlr, ofs)__raw_readl((ctlr)->dmaregs + 
(ofs))
-#define chan_read(chan, fld)   __raw_readl((chan)->fld)
+#define chan_read(chan, fld)   readl((chan)->fld)
 #define desc_read(desc, fld)   __raw_readl(&(desc)->fld)
 #define dma_reg_write(ctlr, ofs, v)__raw_writel(v, (ctlr)->dmaregs + (ofs))
-#define chan_write(chan, fld, v)   __raw_writel(v, (chan)->fld)
+#define chan_write(chan, fld, v)   writel(v, (chan)->fld)
 #define desc_write(desc, fld, v)   __raw_writel((u32)(v), &(desc)->fld)
 
 #define cpdma_desc_to_port(chan, mode, directed)   \
@@ -1064,6 +1064,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void 
*token, void *data,
desc_write(desc, sw_token,  token);
desc_write(desc, sw_buffer, buffer);
desc_write(desc, sw_len,len);
+   desc_read(desc, sw_len);
 
__cpdma_chan_submit(chan, desc);
 
-- 
2.10.1

[PATCH 0/7] net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR

2016-12-01 Thread Grygorii Strashko

This series intended to add support for placing CPDMA descriptors into DDR by
introducing new DT property "descs_pool_size" to specify buffer descriptor's
pool size. The "descs_pool_size" defines total number of CPDMA
CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM.

This allows significantly to reduce UDP packets drop rate
for bandwidth >301 Mbits/sec (am57x).  

Before enabling this feature, the am437x SoC has to be fixed as it's proved
that it's not working when CPDMA descriptors placed in DDR.
So, the patch 1 fixes this issue.

Grygorii Strashko (7):
  net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr
  net: ethernet: ti: cpdma: fix desc re-queuing
  net: ethernet: ti: cpdma: minimize number of parameters in
cpdma_desc_pool_create/destroy()
  net: ethernet: ti: cpdma: use devm_ioremap
  Documentation: DT: net: cpsw: allow to specify descriptors pool size
  net: ethernet: ti: cpsw: add support for descs_pool_size dt property
  Documentation: DT: net: cpsw: remove no_bd_ram property

 Documentation/devicetree/bindings/net/cpsw.txt |  8 ++-
 arch/arm/boot/dts/am33xx.dtsi  |  1 -
 arch/arm/boot/dts/am4372.dtsi  |  1 -
 arch/arm/boot/dts/dm814x.dtsi  |  1 -
 arch/arm/boot/dts/dra7.dtsi|  1 -
 drivers/net/ethernet/ti/cpsw.c |  5 ++
 drivers/net/ethernet/ti/cpsw.h |  1 +
 drivers/net/ethernet/ti/davinci_cpdma.c| 90 +++---
 drivers/net/ethernet/ti/davinci_cpdma.h|  1 +
 9 files changed, 63 insertions(+), 46 deletions(-)

-- 
2.10.1

[PATCH 3/7] net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()

2016-12-01 Thread Grygorii Strashko

Update cpdma_desc_pool_create/destroy() to accept only one parameter
struct cpdma_ctlr*, as this structure contains all required
information for pool creation/destruction.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 66 -
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index 379314f..db0a7fd 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -181,8 +181,10 @@ static struct cpdma_control_info controls[] = {
 (directed << CPDMA_TO_PORT_SHIFT));\
} while (0)
 
-static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
+static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
 {
+   struct cpdma_desc_pool *pool = ctlr->pool;
+
if (!pool)
return;
 
@@ -191,7 +193,7 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool 
*pool)
 gen_pool_size(pool->gen_pool),
 gen_pool_avail(pool->gen_pool));
if (pool->cpumap)
-   dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap,
+   dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap,
  pool->phys);
else
iounmap(pool->iomap);
@@ -203,57 +205,60 @@ static void cpdma_desc_pool_destroy(struct 
cpdma_desc_pool *pool)
  * devices (e.g. cpsw switches) use plain old memory.  Descriptor pools
  * abstract out these details
  */
-static struct cpdma_desc_pool *
-cpdma_desc_pool_create(struct device *dev, u32 phys, dma_addr_t hw_addr,
-   int size, int align)
+int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
 {
+   struct cpdma_params *cpdma_params = >params;
struct cpdma_desc_pool *pool;
-   int ret;
+   int ret = 0;
 
-   pool = devm_kzalloc(dev, sizeof(*pool), GFP_KERNEL);
+   pool = devm_kzalloc(ctlr->dev, sizeof(*pool), GFP_KERNEL);
if (!pool)
goto gen_pool_create_fail;
+   ctlr->pool = pool;
 
-   pool->dev   = dev;
-   pool->mem_size  = size;
-   pool->desc_size = ALIGN(sizeof(struct cpdma_desc), align);
-   pool->num_desc  = size / pool->desc_size;
+   pool->mem_size  = cpdma_params->desc_mem_size;
+   pool->desc_size = ALIGN(sizeof(struct cpdma_desc),
+   cpdma_params->desc_align);
+   pool->num_desc  = pool->mem_size / pool->desc_size;
 
-   pool->gen_pool = devm_gen_pool_create(dev, ilog2(pool->desc_size), -1,
- "cpdma");
+   pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size),
+ -1, "cpdma");
if (IS_ERR(pool->gen_pool)) {
-   dev_err(dev, "pool create failed %ld\n",
-   PTR_ERR(pool->gen_pool));
+   ret = PTR_ERR(pool->gen_pool);
+   dev_err(ctlr->dev, "pool create failed %d\n", ret);
goto gen_pool_create_fail;
}
 
-   if (phys) {
-   pool->phys  = phys;
-   pool->iomap = ioremap(phys, size); /* should be memremap? */
-   pool->hw_addr = hw_addr;
+   if (cpdma_params->desc_mem_phys) {
+   pool->phys  = cpdma_params->desc_mem_phys;
+   pool->iomap = ioremap(pool->phys, pool->mem_size);
+   pool->hw_addr = cpdma_params->desc_hw_addr;
} else {
-   pool->cpumap = dma_alloc_coherent(dev, size, >hw_addr,
- GFP_KERNEL);
+   pool->cpumap = dma_alloc_coherent(ctlr->dev,  pool->mem_size,
+ >hw_addr, GFP_KERNEL);
pool->iomap = (void __iomem __force *)pool->cpumap;
pool->phys = pool->hw_addr; /* assumes no IOMMU, don't use this 
value */
}
 
-   if (!pool->iomap)
+   if (!pool->iomap) {
+   ret = -ENOMEM;
goto gen_pool_create_fail;
+   }
 
ret = gen_pool_add_virt(pool->gen_pool, (unsigned long)pool->iomap,
pool->phys, pool->mem_size, -1);
if (ret < 0) {
-   dev_err(dev, "pool add failed %d\n", ret);
+   dev_err(ctlr->dev, "pool add failed %d\n", ret);
goto gen_pool_add_virt_fail;
}
 
-   return pool;
+   return 0;
 
 gen_pool_add_virt_fail:
-   cpdma_desc_pool_destroy(pool);
+   cpdma_desc_pool_destroy(ctlr);
 gen_pool_create_fail:
-   return NULL;
+   ctlr->pool = NULL;
+   return ret;
 }
 
 static inline dma_addr_t desc_phys(struct cpdma_desc_pool *pool,
@@ -502,12 +507,7 @@ struct cpdma_ctlr *cpdma_ctlr_create(struct cpdma_params 
*params)
ctlr->chan_num = 0;

[PATCH 4/7] net: ethernet: ti: cpdma: use devm_ioremap

2016-12-01 Thread Grygorii Strashko

Use devm_ioremap() and simplify the code.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index db0a7fd..ba892bb 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -195,8 +195,6 @@ static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
if (pool->cpumap)
dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap,
  pool->phys);
-   else
-   iounmap(pool->iomap);
 }
 
 /*
@@ -231,7 +229,8 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
 
if (cpdma_params->desc_mem_phys) {
pool->phys  = cpdma_params->desc_mem_phys;
-   pool->iomap = ioremap(pool->phys, pool->mem_size);
+   pool->iomap = devm_ioremap(ctlr->dev, pool->phys,
+  pool->mem_size);
pool->hw_addr = cpdma_params->desc_hw_addr;
} else {
pool->cpumap = dma_alloc_coherent(ctlr->dev,  pool->mem_size,
-- 
2.10.1

[PATCH 7/7] Documentation: DT: net: cpsw: remove no_bd_ram property

2016-12-01 Thread Grygorii Strashko

Even if no_bd_ram property is described in TI CPSW bindings the
support for it has never been introduced in CPSW driver, so there are
no real users of it. Hence, remove no_bd_ram property.

Signed-off-by: Grygorii Strashko 
---
 Documentation/devicetree/bindings/net/cpsw.txt | 3 ---
 arch/arm/boot/dts/am33xx.dtsi  | 1 -
 arch/arm/boot/dts/am4372.dtsi  | 1 -
 arch/arm/boot/dts/dm814x.dtsi  | 1 -
 arch/arm/boot/dts/dra7.dtsi| 1 -
 5 files changed, 7 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
b/Documentation/devicetree/bindings/net/cpsw.txt
index b99d196..4e8b673 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -25,7 +25,6 @@ Required properties:
 
 Optional properties:
 - ti,hwmods: Must be "cpgmac0"
-- no_bd_ram: Must be 0 or 1
 - dual_emac: Specifies Switch to act as Dual EMAC
 - syscon   : Phandle to the system control device node, which is
  the control module device of the am33x
@@ -73,7 +72,6 @@ Examples:
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
rx_descs = <64>;
mac_control = <0x20>;
slaves = <2>;
@@ -102,7 +100,6 @@ Examples:
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
rx_descs = <64>;
mac_control = <0x20>;
slaves = <2>;
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index 194d884..7af5520 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -777,7 +777,6 @@
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
mac_control = <0x20>;
slaves = <2>;
active_slave = <0>;
diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi
index a275fa9..4f651be 100644
--- a/arch/arm/boot/dts/am4372.dtsi
+++ b/arch/arm/boot/dts/am4372.dtsi
@@ -668,7 +668,6 @@
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
mac_control = <0x20>;
slaves = <2>;
active_slave = <0>;
diff --git a/arch/arm/boot/dts/dm814x.dtsi b/arch/arm/boot/dts/dm814x.dtsi
index ff90a6c..614a4ba 100644
--- a/arch/arm/boot/dts/dm814x.dtsi
+++ b/arch/arm/boot/dts/dm814x.dtsi
@@ -508,7 +508,6 @@
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
mac_control = <0x20>;
slaves = <2>;
active_slave = <0>;
diff --git a/arch/arm/boot/dts/dra7.dtsi b/arch/arm/boot/dts/dra7.dtsi
index d4fcd68..cf7325d 100644
--- a/arch/arm/boot/dts/dra7.dtsi
+++ b/arch/arm/boot/dts/dra7.dtsi
@@ -1706,7 +1706,6 @@
cpdma_channels = <8>;
ale_entries = <1024>;
bd_ram_size = <0x2000>;
-   no_bd_ram = <0>;
mac_control = <0x20>;
slaves = <2>;
active_slave = <0>;
-- 
2.10.1

[PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size

2016-12-01 Thread Grygorii Strashko

Add optional property "descs_pool_size" to specify buffer descriptor's
pool size. The "descs_pool_size" should define total number of CPDMA
CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM on most of TI SoC.

Signed-off-by: Grygorii Strashko 
---
 Documentation/devicetree/bindings/net/cpsw.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
b/Documentation/devicetree/bindings/net/cpsw.txt
index 5ad439f..b99d196 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -35,6 +35,11 @@ Optional properties:
  For example in dra72x-evm, pcf gpio has to be
  driven low so that cpsw slave 0 and phy data
  lines are connected via mux.
+- descs_pool_size  : total number of CPDMA CPPI descriptors to be used for
+ both ingress/egress packets processing. if not
+ specified the default value 256 will be used which
+ will allow to place descriptors pool into the
+ internal CPPI RAM.
 
 
 Slave Properties:
-- 
2.10.1

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Hannes Frederic Sowa

On 02.12.2016 00:14, Ido Schimmel wrote:

[...]

>> Basically, if you delete a node right now the kernel might simply do a
>> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
>> or synchronization with the reader side. Thus you might get a callback
>> from the notifier for a delete event on the one CPU and you end up
>> queueing this fib entry after the delete queue, because the RCU walk
>> isn't protected by any means.
>>
>> Looking closer at this series again, I overlooked the fact that you
>> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
>> orders fetching of fib_seq and thus the RCU dumping after any concurrent
>> executing fib table update, also the mutex_lock and unlock provide
>> proper acquire and release fences, so the CPU indeed sees the effect of
>> a RCU_INIT_POINTER update done on another CPU, because they pair with
>> the rtnl_unlock which might happen on the other CPU.
> 
> Yep, Exactly. I had a feeling this is the issue you were referring to,
> but then you were the one to suggest the use of RTNL, so I was quite
> confused.

At that time I actually had in mind that the fib_register would happen
under the sequence lock, so I didn't look closely to the memory barrier
pairings. I kinda still consider this to be a happy accident. ;)

>> My question is if this is a bit of luck and if we should make this
>> explicit by putting the registration itself under the protection of the
>> sequence counter. I favor the additional protection, e.g. if we some day
>> actually we optimize the fib_seq code? Otherwise we might probably
>> document this fact. :)
> 
> Well, some listeners don't require a dump, but only registration
> (rocker) and in the future we might only need a dump (e.g., port being
> moved to a different net namespace). So I'm not sure if bundling both
> together is a good idea.
> 
> Maybe we can keep register_fib_notifier() as-is and add 'bool register'
> to fib_notifier_dump() so that when set, 'nb' is also registered after
> RCU walk, but before we check if the dump is consistent (unregistered if
> inconsistent)?

I really like that. Would you mind adding this?

[...]

>> Quick follow-up question: How can I quickly find out the hw limitations
>> via the kernel api?
> 
> That's a good question. Currently, you can't. However, we already have a
> mechanism in place to read device's capabilities from the firmware and
> we can (and should) expose some of them to the user. The best API for
> that would be devlink, as it can represent the entire device as opposed
> to only a port netdev like other tools.
> 
> We're also working on making the pipeline more visible to the user, so
> that it would be easier for users to understand and debug their
> networks. I believe a colleague of mine (Matty) presented this during
> the last netdev conference.

Thanks, I will look it up!

Bye,
Hannes

[PATCH 3/3] netns: fix net_generic() "id - 1" bloat

2016-12-01 Thread Alexey Dobriyan

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

...
ptr = ng->ptr[id - 1];
...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function old new   delta
tipc_nametbl_insert_publ12501270 +20
nlmclnt_lookup_host  686 703 +17
nfsd4_encode_fattr  59305941 +11
nfs_get_client  10501061 +11
register_pernet_operations   333 342  +9
tcf_mirred_init  843 849  +6
tcf_bpf_init11431149  +6
gss_setup_upcall 990 994  +4
idmap_name_to_id 432 434  +2
ops_init 274 275  +1
nfsd_inject_forget_client259 260  +1
nfs4_alloc_client612 613  +1
tunnel_key_walker164 163  -1

...

tipc_bcbase_select_primary   392 360 -32
mac80211_hwsim_new_radio28082767 -41
ipip6_tunnel_ioctl  22282186 -42
tipc_bcast_rcv   715 672 -43
tipc_link_build_proto_msg   11401089 -51
nfsd4_lock  38513796 -55
tipc_mon_rcv1012 956 -56
Total: Before=156643951, After=156639743, chg -0.00%


Signed-off-by: Alexey Dobriyan 
---

 include/net/netns/generic.h |   16 +---
 net/core/net_namespace.c|   20 
 2 files changed, 21 insertions(+), 15 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,12 +25,14 @@
  */
 
 struct net_generic {
-   struct {
-   unsigned int len;
-   struct rcu_head rcu;
-   } s;
-
-   void *ptr[0];
+   union {
+   struct {
+   unsigned int len;
+   struct rcu_head rcu;
+   } s;
+
+   void *ptr[0];
+   };
 };
 
 static inline void *net_generic(const struct net *net, unsigned int id)
@@ -40,7 +42,7 @@ static inline void *net_generic(const struct net *net, 
unsigned int id)
 
rcu_read_lock();
ng = rcu_dereference(net->gen);
-   ptr = ng->ptr[id - 1];
+   ptr = ng->ptr[id];
rcu_read_unlock();
 
return ptr;
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -39,6 +39,9 @@ EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
 
+#define MIN_PERNET_OPS_ID  \
+   ((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
+
 #define INITIAL_NET_GEN_PTRS   13 /* +1 for len +2 for rcu_head */
 
 static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
@@ -46,7 +49,7 @@ static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
 static struct net_generic *net_alloc_generic(void)
 {
struct net_generic *ng;
-   size_t generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]);
+   unsigned int generic_size = offsetof(struct net_generic, 
ptr[max_gen_ptrs]);
 
ng = kzalloc(generic_size, GFP_KERNEL);
if (ng)
@@ -60,12 +63,12 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
struct net_generic *ng, *old_ng;
 
BUG_ON(!mutex_is_locked(_mutex));
-   BUG_ON(id == 0);
+   BUG_ON(id < MIN_PERNET_OPS_ID);
 
old_ng = rcu_dereference_protected(net->gen,
   lockdep_is_held(_mutex));
-   if (old_ng->s.len >= id) {
-   old_ng->ptr[id - 1] = data;
+   if (old_ng->s.len > id) {
+   old_ng->ptr[id] = data;
return 0;
}
 
@@ -84,8 +87,9 @@ static int net_assign_generic(struct net *net,

Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00

2016-12-01 Thread Eric Dumazet

On Thu, 2016-12-01 at 14:56 -0800, Mahesh Bandewar wrote:
> From: Mahesh Bandewar 
> 
> If initial broadcast probe(s) is/are lost, the neigh entry wont have
> valid address of the neighbour. In a situation like this, the fall
> back should be to send a broadcast probe, however the code logic
> continues sending ucast probes to 00:00:00:00:00:00. The default value
> of ucast probes is 3 so system usually recovers after three such probes
> but if the value configured is larger it takes those many probes
> (a probe is sent every second in default config) / seconds to recover
> making machine not-available on the network.
> 
> This patch just ensures that the unicast address is not NULL otherwise
> falls back to sending broadcast probe.
> 
> Signed-off-by: Mahesh Bandewar 
> ---
>  net/ipv4/arp.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
> index 89a8cac4726a..56fb33d5ed31 100644
> --- a/net/ipv4/arp.c
> +++ b/net/ipv4/arp.c
> @@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct 
> sk_buff *skb)
>  {
>   __be32 saddr = 0;
>   u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
> + u8 null_dev_hw_addr[MAX_ADDR_LEN];
>   struct net_device *dev = neigh->dev;
>   __be32 target = *(__be32 *)neigh->primary_key;
>   int probes = atomic_read(>probes);
> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct 
> sk_buff *skb)
>  
>   probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
>   if (probes < 0) {
> + memset(_dev_hw_addr, 0, dev->addr_len);
>   if (!(neigh->nud_state & NUD_VALID))
>   pr_debug("trying to ucast probe in NUD_INVALID\n");
>   neigh_ha_snapshot(dst_ha, neigh, dev);
> - dst_hw = dst_ha;
> + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0)
> + dst_hw = dst_ha;
>   } else {
>   probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
>   if (probes < 0) {

Why is is an IPv4 specific issue ?
What about IPv6 ?


I would try something in neighbour code, maybe :

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 
782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68
 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg)
neigh_dbg(2, "neigh %p is probed\n", neigh);
neigh->nud_state = NUD_PROBE;
neigh->updated = jiffies;
-   atomic_set(>probes, 0);
+   atomic_set(>probes,
+  (neigh->output == neigh_blackhole) ?
+   NEIGH_VAR(neigh->parms, UCAST_PROBES) :
+   0);
notify = 1;
next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME);
}

Thanks.

[PATCH] netlink: 2-clause nla_ok()

2016-12-01 Thread Alexey Dobriyan

nla_ok() consists of 3 clauses:

1) int rem >= (int)sizeof(struct nlattr)

2) u16 nla_len >= sizeof(struct nlattr)

3) u16 nla_len <= int rem

The statement is that clause (1) is redundant.

What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.

However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.

NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.

Obligatory space savings report: -1.6 KB

$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
function old new   delta
validate_scan_freqs  142 155 +13
tcf_em_tree_validate 867 879 +12
dcbnl_ieee_del   328 338 +10
netlbl_cipsov4_add_common.isra   218 215  -3
...
ovs_nla_put_actions  888 806 -82
netlbl_cipsov4_add_std  16481566 -82
nl80211_parse_sched_scan28892780-109
ip_tun_from_nlattr  30862945-141

Signed-off-by: Alexey Dobriyan 
---

 include/net/netlink.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -698,8 +698,7 @@ static inline int nla_len(const struct nlattr *nla)
  */
 static inline int nla_ok(const struct nlattr *nla, int remaining)
 {
-   return remaining >= (int) sizeof(*nla) &&
-  nla->nla_len >= sizeof(*nla) &&
+   return nla->nla_len >= sizeof(*nla) &&
   nla->nla_len <= remaining;
 }

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Ido Schimmel

On Thu, Dec 01, 2016 at 10:57:52PM +0100, Hannes Frederic Sowa wrote:
> On 30.11.2016 19:22, Ido Schimmel wrote:
> > On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
> >> On 30.11.2016 17:32, Ido Schimmel wrote:
> >>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>  On 30.11.2016 11:09, Jiri Pirko wrote:
> > From: Ido Schimmel 
> >
> > Make sure the device has a complete view of the FIB tables by invoking
> > their dump during module init.
> >
> > Signed-off-by: Ido Schimmel 
> > Signed-off-by: Jiri Pirko 
> > ---
> >  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 
> > ++
> >  1 file changed, 23 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
> > b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > index 14bed1d..d176047 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct 
> > notifier_block *nb,
> > return NOTIFY_DONE;
> >  }
> >  
> > +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> > +{
> > +   struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, 
> > fib_nb);
> > +
> > +   /* Flush pending FIB notifications and then flush the device's
> > +* table before requesting another dump. Do that with RTNL held,
> > +* as FIB notification block is already registered.
> > +*/
> > +   mlxsw_core_flush_owq();
> > +   rtnl_lock();
> > +   mlxsw_sp_router_fib_flush(mlxsw_sp);
> > +   rtnl_unlock();
> > +}
> > +
> >  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >  {
> > +   fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> > int err;
> >  
> > INIT_LIST_HEAD(_sp->router.nexthop_neighs_list);
> > @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp 
> > *mlxsw_sp)
> >  
> > mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> > register_fib_notifier(_sp->fib_nb);
> 
>  Sorry to pick in here again:
> 
>  There is a race here. You need to protect the registration of the fib
>  notifier as well by the sequence counter. Updates here are not ordered
>  in relation to this code below.
> >>>
> >>> You mean updates that can be received after you registered the notifier
> >>> and until the dump started? I'm aware of that and that's OK. This
> >>> listener should be able to handle duplicates.
> >>
> >> I am not concerned about duplicates, but about ordering deletes and
> >> getting an add from the RCU code you will add the node to hw while it is
> >> deleted in the software path. You probably will ignore the delete
> >> because nothing is installed in hw and later add the node which was
> >> actually deleted but just reordered which happend on another CPU, no?
> > 
> > Are you referring to reordering in the workqueue? We already covered
> > this using an ordered workqueue, which has one context of execution
> > system-wide.
> 
> Ups, sorry, I missed that mail. Probably read it on the mobile phone and
> it became invisible for me later on. Busy day... ;)

Yet another reason not to read emails on your phone ;)

> The reordering in the workqueue seems fine to me and also still necessary.

Correct.

> Basically, if you delete a node right now the kernel might simply do a
> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
> or synchronization with the reader side. Thus you might get a callback
> from the notifier for a delete event on the one CPU and you end up
> queueing this fib entry after the delete queue, because the RCU walk
> isn't protected by any means.
> 
> Looking closer at this series again, I overlooked the fact that you
> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
> orders fetching of fib_seq and thus the RCU dumping after any concurrent
> executing fib table update, also the mutex_lock and unlock provide
> proper acquire and release fences, so the CPU indeed sees the effect of
> a RCU_INIT_POINTER update done on another CPU, because they pair with
> the rtnl_unlock which might happen on the other CPU.

Yep, Exactly. I had a feeling this is the issue you were referring to,
but then you were the one to suggest the use of RTNL, so I was quite
confused.

> My question is if this is a bit of luck and if we should make this
> explicit by putting the registration itself under the protection of the
> sequence counter. I favor the additional protection, e.g. if we some day
> actually we optimize the fib_seq code? Otherwise we might probably
> document this fact. :)

Well,

[PATCH 2/3] netns: add dummy struct inside "struct net_generic"

2016-12-01 Thread Alexey Dobriyan

This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan 
---

 include/net/netns/generic.h |6 --
 net/core/net_namespace.c|8 
 2 files changed, 8 insertions(+), 6 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,8 +25,10 @@
  */
 
 struct net_generic {
-   unsigned int len;
-   struct rcu_head rcu;
+   struct {
+   unsigned int len;
+   struct rcu_head rcu;
+   } s;
 
void *ptr[0];
 };
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -50,7 +50,7 @@ static struct net_generic *net_alloc_generic(void)
 
ng = kzalloc(generic_size, GFP_KERNEL);
if (ng)
-   ng->len = max_gen_ptrs;
+   ng->s.len = max_gen_ptrs;
 
return ng;
 }
@@ -64,7 +64,7 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 
old_ng = rcu_dereference_protected(net->gen,
   lockdep_is_held(_mutex));
-   if (old_ng->len >= id) {
+   if (old_ng->s.len >= id) {
old_ng->ptr[id - 1] = data;
return 0;
}
@@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 * the old copy for kfree after a grace period.
 */
 
-   memcpy(>ptr, _ng->ptr, old_ng->len * sizeof(void*));
+   memcpy(>ptr, _ng->ptr, old_ng->s.len * sizeof(void*));
ng->ptr[id - 1] = data;
 
rcu_assign_pointer(net->gen, ng);
-   kfree_rcu(old_ng, rcu);
+   kfree_rcu(old_ng, s.rcu);
return 0;
 }

[PATCH 1/3] netns: publish net_generic correctly

2016-12-01 Thread Alexey Dobriyan

Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

memcpy
rcu_assign_pointer
kfree_rcu
ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan 
---

 net/core/net_namespace.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -64,9 +64,10 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 
old_ng = rcu_dereference_protected(net->gen,
   lockdep_is_held(_mutex));
-   ng = old_ng;
-   if (old_ng->len >= id)
-   goto assign;
+   if (old_ng->len >= id) {
+   old_ng->ptr[id - 1] = data;
+   return 0;
+   }
 
ng = net_alloc_generic();
if (ng == NULL)
@@ -84,11 +85,10 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 */
 
memcpy(>ptr, _ng->ptr, old_ng->len * sizeof(void*));
+   ng->ptr[id - 1] = data;
 
rcu_assign_pointer(net->gen, ng);
kfree_rcu(old_ng, rcu);
-assign:
-   ng->ptr[id - 1] = data;
return 0;
 }

[PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00

2016-12-01 Thread Mahesh Bandewar

From: Mahesh Bandewar 

If initial broadcast probe(s) is/are lost, the neigh entry wont have
valid address of the neighbour. In a situation like this, the fall
back should be to send a broadcast probe, however the code logic
continues sending ucast probes to 00:00:00:00:00:00. The default value
of ucast probes is 3 so system usually recovers after three such probes
but if the value configured is larger it takes those many probes
(a probe is sent every second in default config) / seconds to recover
making machine not-available on the network.

This patch just ensures that the unicast address is not NULL otherwise
falls back to sending broadcast probe.

Signed-off-by: Mahesh Bandewar 
---
 net/ipv4/arp.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 89a8cac4726a..56fb33d5ed31 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
 {
__be32 saddr = 0;
u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
+   u8 null_dev_hw_addr[MAX_ADDR_LEN];
struct net_device *dev = neigh->dev;
__be32 target = *(__be32 *)neigh->primary_key;
int probes = atomic_read(>probes);
@@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
 
probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
if (probes < 0) {
+   memset(_dev_hw_addr, 0, dev->addr_len);
if (!(neigh->nud_state & NUD_VALID))
pr_debug("trying to ucast probe in NUD_INVALID\n");
neigh_ha_snapshot(dst_ha, neigh, dev);
-   dst_hw = dst_ha;
+   if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0)
+   dst_hw = dst_ha;
} else {
probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
if (probes < 0) {
-- 
2.8.0.rc3.226.g39d4020

Re: Initial thoughts on TXDP

2016-12-01 Thread Hannes Frederic Sowa

On 01.12.2016 21:13, Sowmini Varadhan wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
> 
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>>> and one other critical difference from the hot-potato-forwarding
>>> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>>> does not apply: in order to figure out the ethernet and IP headers
>>> in the response correctly at all times (in the face of things like VRRP,
>>> gw changes, gw's mac addr changes etc) the application should really
>>> be listening on NETLINK sockets for modifications to the networking
>>> state - again points to needing a select() socket set where you can
>>> have both the I/O fds and the netlink socket,
>>>
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
> 
> sure, but my point was that *XDP and other stack-bypass methods needs 
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)

Busypoll on steroids is what windows does by mapping the user space
"doorbell" into a vDSO and let user space loop on that maybe with
MWAIT/MONITOR. The interesting thing is that you can map other events to
this notification event, too. It sounds like a usable idea to me and
reassembles what we already do with futexes.

Bye,
Hannes

stmmac: turn coalescing / NAPI off in stmmac

2016-12-01 Thread Pavel Machek


> > @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct 
> > net_device *dev,
> > features &= ~NETIF_F_CSUM_MASK;
> >  
> > /* Disable tso if asked by ethtool */
> > -   if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> > -   if (features & NETIF_F_TSO)
> > -   priv->tso = true;
> > -   else
> > -   priv->tso = false;
> > -   }
> > +   if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> > +   priv->tso = !!(features & NETIF_F_TSO);
> >  
> 
> Pavel, this really seems arbitrary.
> 
> Whilst I really appreciate you're looking into this driver a bit because
> of some issues you are trying to resolve, I'd like to ask that you not
> start bombarding me with nit-pick cleanups here and there and instead
> concentrate on the real bug or issue.

Well, fixing clean code is easier than fixing strange code... Plus I
was hoping to make the mainainers to talk. The driver is listed as
supported after all.

Anyway... since you asked. I belive I have way to disable NAPI / tx
coalescing in the driver. Unfortunately, locking is missing on the rx
path, and needs to be extended to _irqsave variant on tx path.

So patch currently looks like this (hand edited, can't be
applied, got it working few hours ago). Does it look acceptable?

I'd prefer this to go after the patch that pulls common code to single
place, so that single place needs to be patched. Plus I guess I should
add ifdefs, so that more advanced NAPI / tx coalescing code can be
reactivated when it is fixed. Trivial fixes can go on top. Does that
sound like a plan?

Which tree do you want patches against?

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ?

Best regards,
Pavel


diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 0b706a7..c0016c8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1395,9 +1397,10 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv)
 
 static void stmmac_tx_clean(struct stmmac_priv *priv)
 {
-   spin_lock(>tx_lock);
+   unsigned long flags;
+   spin_lock_irqsave(>tx_lock, flags);
__stmmac_tx_clean(priv);
-   spin_unlock(>tx_lock);
+   spin_unlock_irqrestore(>tx_lock, flags);  
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1441,6 +1444,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
netif_wake_queue(priv->dev);
 }
 
+static int stmmac_rx(struct stmmac_priv *priv, int limit);
+
 /**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
@@ -1452,10 +1457,17 @@ static void stmmac_dma_interrupt(struct stmmac_priv 
*priv)
 {
int status;
int rxfifosz = priv->plat->rx_fifo_size;
+   unsigned long flags;
 
status = priv->hw->dma->dma_interrupt(priv->ioaddr, >xstats);
if (likely((status & handle_rx)) || (status & handle_tx)) {
+   int r;
+   spin_lock_irqsave(>tx_lock, flags);
+   r = stmmac_rx(priv, 999);
+   spin_unlock_irqrestore(>tx_lock, flags);  
+#if 0
if (likely(napi_schedule_prep(>napi))) {
//pr_err("napi: schedule\n");
stmmac_disable_dma_irq(priv);
@@ -1463,7 +1475,8 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
} else
pr_err("napi: schedule failed\n");
 #endif
+   stmmac_tx_clean(priv);
}
if (unlikely(status & tx_hard_error_bump_tc)) {
/* Try to bump up the dma threshold on this failure */
@@ -1638,7 +1651,7 @@ static void stmmac_tx_timer(unsigned long data)
 {
struct stmmac_priv *priv = (struct stmmac_priv *)data;
 
-   stmmac_tx_clean(priv);
+   //stmmac_tx_clean(priv);
 }
 
 /**
@@ -1990,6 +2003,8 @@ static void stmmac_xmit_common(struct sk_buff *skb, 
struct net_device *dev, int
if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
mod_timer(>txtimer,
  STMMAC_COAL_TIMER(priv->tx_coal_timer));
+   priv->hw->desc->set_tx_ic(desc);
+   priv->xstats.tx_set_ic_bit++;
} else {
priv->tx_count_frames = 0;
priv->hw->desc->set_tx_ic(desc);
@@ -2038,8 +2053,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, 
struct net_device *dev)
struct dma_desc *desc, *first, *mss_desc = NULL;
u8 proto_hdr_len;
int i;
+   unsigned long flags;
 
-   spin_lock(>tx_lock);
+   spin_lock_irqsave(>tx_lock, flags);
 
/* Compute header lengths */
proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
@@ -2052,7 +2068,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, 
struct net_device

Re: Initial thoughts on TXDP

2016-12-01 Thread Hannes Frederic Sowa

Side note:

On 01.12.2016 20:51, Tom Herbert wrote:
>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>> > (where does that come from? should make no difference if a 32 or
>> >  320 byte buffer gets allocated).
>> >
> It's the zero'ing of three cache lines. I believe we talked about that
> as netdev.

Jesper and me played with that again very recently:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590

In micro-benchmarks we saw a pretty good speed up not using the rep
stosb generated by gcc builtin but plain movq's. Probably the cost model
for __builtin_memset in gcc is wrong?

When Jesper is free we wanted to benchmark this and maybe come up with a
arch specific way of cleaning if it turns out to really improve throughput.

SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
all the benefits.

Bye,
Hannes

Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)

2016-12-01 Thread Hannes Frederic Sowa

On 01.12.2016 00:39, Erik Nordmark wrote:
> Implemented RFC7527 Enhanced DAD.
> IPv6 duplicate address detection can fail if there is some temporary
> loopback of Ethernet frames. RFC7527 solves this by including a random
> nonce in the NS messages used for DAD, and if an NS is received with the
> same nonce it is assumed to be a looped back DAD probe and is ignored.
> RFC7527 is enabled by default. Can be disabled by setting both of
> conf/{all,interface}/enhanced_dad to zero.
> 
> Signed-off-by: Erik Nordmark 
> Signed-off-by: Bob Gilligan 
> ---

Reviewed-by: Hannes Frederic Sowa 

Thanks!
> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>   if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>   if (dad) {
> + if (nonce != 0 && ifp->dad_nonce == nonce) {
> + u8 *np = (u8 *)
> + /* Matching nonce if looped back */
> + ND_PRINTK(2, notice,
> +   "%s: IPv6 DAD loopback for 
> address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +   ifp->idev->dev->name,
> +   >addr,
> +   np[0], np[1], np[2], np[3],
> +   np[4], np[5]);
> + goto out;
> + }
>   /*
>* We are colliding with another node
>* who is doing DAD
> 

I think it could be a "%pM" because it looks like a MAC address, but
better leave it like that. :)

Bye,
Hannes

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-01 Thread Paolo Abeni

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +
> Mel Gorman  wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed  Messages
>  SizeSize Time Okay Errors   Throughput
>  bytes   bytessecs#  #   10^6bits/sec
> 
>46081024   60.00 50024301  06829.98
>2560   60.00 46133211   6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives885162 0.0
>  IpInDelivers885161 0.0
>  IpOutRequests   885162 0.0
>  UdpInDatagrams  776105 0.0
>  UdpInErrors 109056 0.0
>  UdpOutDatagrams 885160 0.0
>  UdpRcvbufErrors 109056 0.0
>  IpExtInOctets   931190476  0.0
>  IpExtOutOctets  931189564  0.0
>  IpExtInNoECTPkts885162 0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
struct sk_buff_head *list = >sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
int size = skb->truesize;
+   int limit;
 
/* try to avoid the costly atomic add/sub pair when the receive
 * queue is full; always allow at least a packet
 */
rmem = atomic_read(>sk_rmem_alloc);
-   if (rmem && (rmem + size > sk->sk_rcvbuf))
+   limit = size + sk->sk_rcvbuf;
+   if (rmem > limit)
goto drop;
 
/* we drop only if the receive buf is full and the receive
 * queue contains some other skb
 */
rmem = atomic_add_return(size, >sk_rmem_alloc);
-   if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+   if (rmem > limit)
goto uncharge_drop;
 
spin_lock(>lock);

Re: [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040

2016-12-01 Thread Bjørn Mork

Daniele Palmas  writes:

> This patch adds support for PID 0x1040 of Telit LE922A.

LGTM

Acked-by: Bjørn Mork

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones  wrote:
> On 12/01/2016 12:18 PM, Tom Herbert wrote:
>>
>> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:
>>>
>>> Just how much per-packet path-length are you thinking will go away under
>>> the
>>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does
>>> some
>>> non-trivial things to effective overhead (service demand) and so
>>> throughput:
>>>
>> For plain in order TCP packets I believe we should be able process
>> each packet at nearly same speed as GRO. Most of the protocol
>> processing we do between GRO and the stack are the same, the
>> differences are that we need to do a connection lookup in the stack
>> path (note we now do this is UDP GRO and that hasn't show up as a
>> major hit). We also need to consider enqueue/dequeue on the socket
>> which is a major reason to try for lockless sockets in this instance.
>
>
> So waving hands a bit, and taking the service demand for the GRO-on receive
> test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or
> ~1.216 usec of CPU time per TCP segment, including ACK generation which
> unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in
> place would be for every-other segment. Etc etc.
>
>> Sure, but trying running something emulates a more realistic workload
>> than a TCP stream, like RR test with relative small payload and many
>> connections.
>
>
> That is a good point, which of course is why the RR tests are there in
> netperf :) Don't get me wrong, I *like* seeing path-length reductions. What
> would you posit is a relatively small payload?  The promotion of IR10
> suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at
> that as the length of a piece of string:
>
We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr
>
> 16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr
>
> 16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
> 16384  87380
>
> So, losing GRO doubled the service demand.  I suppose I could see cutting
> path-length in half based on the things you listed which would be bypassed?
>
> I'm sure mileage will vary with different NICs and CPUs.  The ones used here
> happened to be to hand.
>
This is also biased because you're using a single connection, but is
consistent with data we've seen in the past. To be clear I'm not
saying GRO is bad, the fact that GRO has such a visible impact in your
test means that the GRO path is significantly more efficient. Closing
that gap seen in your numbers would be a benefit, that means we have
improved per packet processing.

Tom

> happy benchmarking,
>
> rick
>
> Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly
> goes to more than doubling, and halving to 7K narrows the delta:
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
>

Re: [WIP] net+mlx4: auto doorbell

2016-12-01 Thread Eric Dumazet

On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet 
> Date: Thu, 01 Dec 2016 09:04:17 -0800
> 
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> > 
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left.  Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> > 
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
> 
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
> 
> Let's say we put 1 packet in, and hit the doorbell.
> 
> Then another packet comes in and we defer the doorbell to the IRQ.
> 
> At this point there are a couple things I'm unclear about.


> 
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor?  Depending upon how the doorbell works
> it might, or it might not.

It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.

> 
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
> 
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.


> 
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)

This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the  "1 packet
threshold" of the WIP)

Re: [WIP] net+mlx4: auto doorbell

2016-12-01 Thread Eric Dumazet

On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:

> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency.  I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?

I believe it is clear that we can not use this trick without admin
choice.

In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)

This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.

Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.

This WIP was minimal, but we certainly need to add belts and suspenders.

1)  timestamps
  use a ktime_get_ns() to remember a timestamp for the last doorbell,
  and force a doorbell if it gets too old, checked in ndo_start_xmit()
  every time we would like to not send the doorbell.

2)  max numbers of packets not yet doorbelled.

But we can not rely on another high resolution timer, since they also
require softirq being processed timely.

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Hannes Frederic Sowa

On 30.11.2016 19:22, Ido Schimmel wrote:
> On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 17:32, Ido Schimmel wrote:
>>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
 On 30.11.2016 11:09, Jiri Pirko wrote:
> From: Ido Schimmel 
>
> Make sure the device has a complete view of the FIB tables by invoking
> their dump during module init.
>
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 
> ---
>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 
> ++
>  1 file changed, 23 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 14bed1d..d176047 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct 
> notifier_block *nb,
>   return NOTIFY_DONE;
>  }
>  
> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> +{
> + struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> +
> + /* Flush pending FIB notifications and then flush the device's
> +  * table before requesting another dump. Do that with RTNL held,
> +  * as FIB notification block is already registered.
> +  */
> + mlxsw_core_flush_owq();
> + rtnl_lock();
> + mlxsw_sp_router_fib_flush(mlxsw_sp);
> + rtnl_unlock();
> +}
> +
>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>  {
> + fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>   int err;
>  
>   INIT_LIST_HEAD(_sp->router.nexthop_neighs_list);
> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>  
>   mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>   register_fib_notifier(_sp->fib_nb);

 Sorry to pick in here again:

 There is a race here. You need to protect the registration of the fib
 notifier as well by the sequence counter. Updates here are not ordered
 in relation to this code below.
>>>
>>> You mean updates that can be received after you registered the notifier
>>> and until the dump started? I'm aware of that and that's OK. This
>>> listener should be able to handle duplicates.
>>
>> I am not concerned about duplicates, but about ordering deletes and
>> getting an add from the RCU code you will add the node to hw while it is
>> deleted in the software path. You probably will ignore the delete
>> because nothing is installed in hw and later add the node which was
>> actually deleted but just reordered which happend on another CPU, no?
> 
> Are you referring to reordering in the workqueue? We already covered
> this using an ordered workqueue, which has one context of execution
> system-wide.

Ups, sorry, I missed that mail. Probably read it on the mobile phone and
it became invisible for me later on. Busy day... ;)

The reordering in the workqueue seems fine to me and also still necessary.

Basically, if you delete a node right now the kernel might simply do a
RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
or synchronization with the reader side. Thus you might get a callback
from the notifier for a delete event on the one CPU and you end up
queueing this fib entry after the delete queue, because the RCU walk
isn't protected by any means.

Looking closer at this series again, I overlooked the fact that you
fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
orders fetching of fib_seq and thus the RCU dumping after any concurrent
executing fib table update, also the mutex_lock and unlock provide
proper acquire and release fences, so the CPU indeed sees the effect of
a RCU_INIT_POINTER update done on another CPU, because they pair with
the rtnl_unlock which might happen on the other CPU.

My question is if this is a bit of luck and if we should make this
explicit by putting the registration itself under the protection of the
sequence counter. I favor the additional protection, e.g. if we some day
actually we optimize the fib_seq code? Otherwise we might probably
document this fact. :)

>>> I've a follow up patchset that introduces a new event in switchdev
>>> notification chain called SWITCHDEV_SYNC, which is sent when port
>>> netdevs are enslaved / released  from a master device (points in time
>>> where kernel<->device can get out of sync). It will invoke
>>> re-propagation of configuration from different parts of the stack
>>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
>>> in duplicates.
>>
>> Okay, understood. I wonder how we can protect against accidentally abort
>> calls actually. E.g. if I start to inject routes into my routing

[PATCH] net: ethernet: ti: davinci_cpdma: sanitize inter-module API

2016-12-01 Thread Arnd Bergmann

The davinci_cpdma module is a helper library that is used by the
actual device drivers and does nothing by itself, so all its API
functions need to be exported.

Four new functions were added recently without an export, so now
we get a build error:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

This exports those symbols. After taking a closer look, I found two
more global functions in this file that are not exported and not used
anywhere, so they can obviously be removed. There is also one function
that is only used internally in this module, and should be 'static'.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a 
channel")
Fixes: 0fc6432cc78d ("net: ethernet: ti: davinci_cpdma: add weight function for 
channels")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 63 +++--
 drivers/net/ethernet/ti/davinci_cpdma.h |  4 ---
 2 files changed, 21 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..b9d40f0cdf6c 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -628,6 +628,23 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr)
 }
 EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy);
 
+static int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   if (chan->state != CPDMA_STATE_ACTIVE) {
+   spin_unlock_irqrestore(>lock, flags);
+   return -EINVAL;
+   }
+
+   dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
+ chan->mask);
+   spin_unlock_irqrestore(>lock, flags);
+
+   return 0;
+}
+
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable)
 {
unsigned long flags;
@@ -796,6 +813,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
spin_unlock_irqrestore(>lock, flags);
return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +828,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +872,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
spin_unlock_irqrestore(>lock, flags);
return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +885,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 cpdma_handler_fn handler, int rx_type)
@@ -1270,46 +1291,4 @@ int cpdma_chan_stop(struct cpdma_chan *chan)
 }
 EXPORT_SYMBOL_GPL(cpdma_chan_stop);
 
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(>lock, flags);
-   if (chan->state != CPDMA_STATE_ACTIVE) {
-   spin_unlock_irqrestore(>lock, flags);
-   return -EINVAL;
-   }
-
-   dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
- chan->mask);
-   spin_unlock_irqrestore(>lock, flags);
-
-   return 0;
-}
-
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control)
-{
-   unsigned long flags;
-   int ret;
-
-   spin_lock_irqsave(>lock, flags);
-   ret = _cpdma_control_get(ctlr, control);
-   spin_unlock_irqrestore(>lock, flags);
-
-   return ret;
-}
-
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value)
-{
-   unsigned long flags;
-   int ret;
-
-   spin_lock_irqsave(>lock, flags);
-   ret = _cpdma_control_set(ctlr, control, value);
-   spin_unlock_irqrestore(>lock, flags);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(cpdma_control_set);
-
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h 
b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db2abab..36d0a09a3d44 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -87,7 +87,6 @@ int cpdma_chan_process(struct cpdma_chan *chan, int quota);
 
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable);
 void cpdma_ctlr_eoi(struct cpdma_ctlr *ctlr, u32 value);
-int cpdma_chan_int_ctrl(struct

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowa
 wrote:
> On 01.12.2016 22:12, Tom Herbert wrote:
>> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>>  wrote:
>>> Hello,
>>>
>>> this is a good conversation and I simply want to bring my worries
>>> across. I don't have good solutions for the problems XDP tries to solve
>>> but I fear we could get caught up in maintenance problems in the long
>>> term given the ideas floating around on how to evolve XDP currently.
>>>
>>> On 01.12.2016 17:28, Thomas Graf wrote:
 On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
>
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
>
> What use case do you see in XDP specifically e.g. for container 
> networking?

 DDOS mitigation to protect distributed applications in large clusters.
 Relying on CDN works to protect API gateways and frontends (as long as
 they don't throw you out of their network) but offers no protection
 beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
 level and allowing the mitigation capability to scale up with the number
 of servers is natural and cheap.
>>>
>>> So far we e.g. always considered L2 attacks a problem of the network
>>> admin to correctly protect the environment. Are you talking about
>>> protecting the L3 data plane? Are there custom proprietary protocols in
>>> place which need custom protocol parsers that need involvement of the
>>> kernel before it could verify the packet?
>>>
>>> In the past we tried to protect the L3 data plane as good as we can in
>>> Linux to allow the plain old server admin to set an IP address on an
>>> interface and install whatever software in user space. We try not only
>>> to protect it but also try to achieve fairness by adding a lot of
>>> counters everywhere. Are protections missing right now or are we talking
>>> about better performance?
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>
I agree that eBPF is not going to be sufficient from everything we'll
want to do. Undoubtably, we'll continue see new addition of more
helpers to assist in processing, but at some point we will want a to
load a kernel module that handles more complex processing and insert
it at the XDP callout. Nothing in the design of XDP precludes doing
that and I have already posted the patches to generalize the XDP
callout for that. Taking either of these routes has tradeoffs, but
regardless of whether this is BPF or module code, the principles of
XDP and its value to help solve some class of problems remains.

 Tom

> Not to mention the fact that you might have to deal with fragments in
> the Internet. Some DOS mitigations were already abused to generate
> blackholes for other users. Filtering such stuff is quite complicated.
>
> I argued also

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones


On 12/01/2016 12:18 PM, Tom Herbert wrote:

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:

Just how much per-packet path-length are you thinking will go away under the
likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
non-trivial things to effective overhead (service demand) and so throughput:


For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.


So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.



Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.


That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?


I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.


happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr

Re: DSA vs. SWTICHDEV ?

2016-12-01 Thread Murali Karicheri

Hi Andrew,
On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> Hi Murali 
> 
>> 2. Switch mode where it implements a simple Ethernet switch. Currently
>>it doesn't have address learning capability, but in future it
>>can.
> 
> If it does not have address learning capabilities, does it act like a
> plain old hub? What comes in one port goes out all others?

Thanks for the response!

Yes. It is a plain hub. it replicates frame to both ports. So need to
run a bridge layer for address learning in software.

> 
> Or can you do the learning in software on the host and program tables,
> which the hardware then uses?
> 

I think not. I see we have a non Linux implementation that does address
learning in software using a hash table and look up MAC for each packet
to see which port it needs to be sent to.

Murali

>> 3. Switch with HSR/PRP offload where it provides HSR/PRP protocol
>>support and cut through switch.
>>
>> So a device need to function in one of the modes. A a regular Ethernet
>> driver that provides two network devices, one per port, and switchdev
>> for each physical port (in switch mode) will look ideal in this case.
> 
> Yes, this seems the right model to use.
> 
>  Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [WIP] net+mlx4: auto doorbell

2016-12-01 Thread Alexander Duyck

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet  wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> ("skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26 eth0  0.00 5822800.00  0.00 597064.41  0.00  
> 0.00  1.00
> 22:43:27 eth0 24.00 5788237.00  2.09 593520.26  0.00  
> 0.00  0.00
> 22:43:28 eth0 12.00 581.00  1.43 596551.47  0.00  
> 0.00  1.00
> 22:43:29 eth0 22.00 5841516.00  1.61 598982.87  0.00  
> 0.00  0.00
> 22:43:30 eth0  4.00 4389137.00  0.71 450058.08  0.00  
> 0.00  1.00
> 22:43:31 eth0  4.00 5871008.00  0.72 602007.79  0.00  
> 0.00  0.00
> 22:43:32 eth0 12.00 5891809.00  1.43 604142.60  0.00  
> 0.00  1.00
> 22:43:33 eth0 10.00 5901904.00  1.12 605175.70  0.00  
> 0.00  0.00
> 22:43:34 eth0  5.00 5907982.00  0.69 605798.99  0.00  
> 0.00  1.00
> 22:43:35 eth0  2.00 5847086.00  0.12 599554.71  0.00  
> 0.00  0.00
> Average: eth0  9.50 5707925.60  0.99 585285.69  0.00  
> 0.00  0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47 eth0  9.00 10226424.00  1.02 1048608.05  0.00
>   0.00  1.00
> 22:43:48 eth0  1.00 10316955.00  0.06 1057890.89  0.00
>   0.00  0.00
> 22:43:49 eth0  1.00 10310104.00  0.10 1057188.32  0.00
>   0.00  1.00
> 22:43:50 eth0  0.00 10249423.00  0.00 1050966.23  0.00
>   0.00  0.00
> 22:43:51 eth0  0.00 10210441.00  0.00 1046969.05  0.00
>   0.00  1.00
> 22:43:52 eth0  2.00 10198389.00  0.16 1045733.17  0.00
>   0.00  1.00
> 22:43:53 eth0  8.00 10079257.00  1.43 1033517.83  0.00
>   0.00  0.00
> 22:43:54 eth0  0.00 7693509.00  0.00 75.16  0.00  
> 0.00  0.00
> 22:43:55 eth0  2.00 10343076.00  0.20 1060569.32  0.00
>   0.00  1.00
> 22:43:56 eth0  1.00 10224571.00  0.14 1048417.93  0.00
>   0.00  0.00
> Average: eth0  2.40 9985214.90  0.31 1023874.60  0.00 
>  0.00  0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.

A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency.  I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?

>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47 eth0  3.00 1355422.00  0.45 319706.04  0.00  
> 0.00  0.00
> 22:50:48 eth0  2.00 1344270.00  0.42 317035.21  0.00  
> 0.00  1.00
> 22:50:49 eth0  3.00 1350503.00  0.51 318478.34  0.00  
> 0.00  0.00
> 22:50:50 eth0 29.00 1348593.00  2.86 318113.02  0.00  
> 0.00  1.00
> 22:50:51 eth0 14.00 1354855.00  1.83 319508.56  0.00  
> 0.00  0.00
> 22:50:52 eth0  7.00 1357794.00  0.73 320226.89  0.00  
> 0.00  1.00
> 22:50:53 eth0  5.00 1326130.00  0.63 312784.72  0.00  
> 0.00  0.00
> 22:50:54 eth0  2.00 994584.00  0.12 234598.40  0.00  
> 0.00  1.00
> 22:50:55

Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor

2016-12-01 Thread Stefan Schmidt

Hello.

On 28.10.2016 11:42, Stefan Schmidt wrote:
> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt 

Did you somehow mangle this patch manually?

Looking at the patch in your git repo it shows no author name but
instead just my patch was send with git format-patch and git send-email
as usual and shows the right author. Was there something worn on my side
or yours? Just checking to avoid it in the future.

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=8ae2c5382bd9d98a8f7ddcb1faad1a978d773909

regards
Stefan Schmidt

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa

On 01.12.2016 22:12, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>  wrote:
>> Hello,
>>
>> this is a good conversation and I simply want to bring my worries
>> across. I don't have good solutions for the problems XDP tries to solve
>> but I fear we could get caught up in maintenance problems in the long
>> term given the ideas floating around on how to evolve XDP currently.
>>
>> On 01.12.2016 17:28, Thomas Graf wrote:
>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
 First of all, this is a rant targeted at XDP and not at eBPF as a whole.
 XDP manipulates packets at free will and thus all security guarantees
 are off as well as in any user space solution.

 Secondly user space provides policy, acl, more controlled memory
 protection, restartability and better debugability. If I had multi
 tenant workloads I would definitely put more complex "business/acl"
 logic into user space, so I can make use of LSM and other features to
 especially prevent a network facing service to attack the tenants. If
 stuff gets put into the kernel you run user controlled code in the
 kernel exposing a much bigger attack vector.

 What use case do you see in XDP specifically e.g. for container networking?
>>>
>>> DDOS mitigation to protect distributed applications in large clusters.
>>> Relying on CDN works to protect API gateways and frontends (as long as
>>> they don't throw you out of their network) but offers no protection
>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>> level and allowing the mitigation capability to scale up with the number
>>> of servers is natural and cheap.
>>
>> So far we e.g. always considered L2 attacks a problem of the network
>> admin to correctly protect the environment. Are you talking about
>> protecting the L3 data plane? Are there custom proprietary protocols in
>> place which need custom protocol parsers that need involvement of the
>> kernel before it could verify the packet?
>>
>> In the past we tried to protect the L3 data plane as good as we can in
>> Linux to allow the plain old server admin to set an IP address on an
>> interface and install whatever software in user space. We try not only
>> to protect it but also try to achieve fairness by adding a lot of
>> counters everywhere. Are protections missing right now or are we talking
>> about better performance?
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!

I totally understand that. But in my reply to David in this thread I
mentioned DNS apex processing as being problematic which is actually
being referred in your linked slide deck on page 9 ("What do floods look
like") and the problematic of parsing DNS packets in XDP due to string
processing and looping inside eBPF.

Not to mention the fact that you might have to deal with fragments in
the Internet. Some DOS mitigations were already abused to generate
blackholes for other users. Filtering such stuff is quite complicated.

I argued also under the aspect of what Thomas said, that the outside
world of the cluster is already protected by a CDN.

Bye,
Hannes

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Ido Schimmel

On Thu, Dec 01, 2016 at 10:09:19PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:54, Ido Schimmel wrote:
> > On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> >> On 01.12.2016 21:04, David Miller wrote:
> >>>
> >>> Hannes and Ido,
> >>>
> >>> It looks like we are very close to having this in mergable shape, can
> >>> you guys work out this final issue and figure out if it really is
> >>> a merge stopped or not?
> >>
> >> Sure, if the fib notification register could be done under protection of
> >> the sequence counter I don't see any more problems.
> > 
> > Did you maybe miss my reply yesterday? Because I was trying to
> > understand what "ordering" you're referring to, but didn't receive a
> > reply from you.
> 
> Oh, strange, I am pretty sure I replied to that. Let me resend it.

:)

I did get this reply, and then replied myself here:
https://marc.info/?l=linux-netdev=148053017425465=2

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Hannes Frederic Sowa

On 01.12.2016 21:54, Ido Schimmel wrote:
> On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
>> On 01.12.2016 21:04, David Miller wrote:
>>>
>>> Hannes and Ido,
>>>
>>> It looks like we are very close to having this in mergable shape, can
>>> you guys work out this final issue and figure out if it really is
>>> a merge stopped or not?
>>
>> Sure, if the fib notification register could be done under protection of
>> the sequence counter I don't see any more problems.
> 
> Did you maybe miss my reply yesterday? Because I was trying to
> understand what "ordering" you're referring to, but didn't receive a
> reply from you.

Oh, strange, I am pretty sure I replied to that. Let me resend it.

>> The sync handler is nice to have and can be done in a later patch series.
> 
> Sync handler?

I was talking about SWITCHDEV_SYNC.

Bye,
Hannes

Re: [patch] net: renesas: ravb: unintialized return value

2016-12-01 Thread Sergei Shtylyov


Hello!

On 12/01/2016 11:57 PM, Dan Carpenter wrote:


We want to set the other "err" variable here so that we can return it
later.  My version of GCC misses this issue but I caught it with a
static checker.

Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev 
leaks")


   Hm, I somehow missed this one, probably due to the horrific CC list. :-(


Signed-off-by: Dan Carpenter 


Acked-by: Sergei Shtylyov 

MBR, Sergei

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
 wrote:
> Hello,
>
> this is a good conversation and I simply want to bring my worries
> across. I don't have good solutions for the problems XDP tries to solve
> but I fear we could get caught up in maintenance problems in the long
> term given the ideas floating around on how to evolve XDP currently.
>
> On 01.12.2016 17:28, Thomas Graf wrote:
>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>> XDP manipulates packets at free will and thus all security guarantees
>>> are off as well as in any user space solution.
>>>
>>> Secondly user space provides policy, acl, more controlled memory
>>> protection, restartability and better debugability. If I had multi
>>> tenant workloads I would definitely put more complex "business/acl"
>>> logic into user space, so I can make use of LSM and other features to
>>> especially prevent a network facing service to attack the tenants. If
>>> stuff gets put into the kernel you run user controlled code in the
>>> kernel exposing a much bigger attack vector.
>>>
>>> What use case do you see in XDP specifically e.g. for container networking?
>>
>> DDOS mitigation to protect distributed applications in large clusters.
>> Relying on CDN works to protect API gateways and frontends (as long as
>> they don't throw you out of their network) but offers no protection
>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>> level and allowing the mitigation capability to scale up with the number
>> of servers is natural and cheap.
>
> So far we e.g. always considered L2 attacks a problem of the network
> admin to correctly protect the environment. Are you talking about
> protecting the L3 data plane? Are there custom proprietary protocols in
> place which need custom protocol parsers that need involvement of the
> kernel before it could verify the packet?
>
> In the past we tried to protect the L3 data plane as good as we can in
> Linux to allow the plain old server admin to set an IP address on an
> interface and install whatever software in user space. We try not only
> to protect it but also try to achieve fairness by adding a lot of
> counters everywhere. Are protections missing right now or are we talking
> about better performance?
>
The technical plenary at last IETF on Seoul a couple of weeks ago was
exclusively focussed on DDOS in light of the recent attack against
Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
presentation by Nick Sullivan
(https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
alluded to some implementation of DDOS mitigation. In particular, on
slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
numbers he gave we're based in iptables+BPF and that was a whole
1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
and that's also when I introduced XDP to whole IETF :-) ). If that's
the best we can do the Internet is in a world hurt. DDOS mitigation
alone is probably a sufficient motivation to look at XDP. We need
something that drops bad packets as quickly as possible when under
attack, we need this to be integrated into the stack, we need it to be
programmable to deal with the increasing savvy of attackers, and we
don't want to be forced to be dependent on HW solutions. This is why
we created XDP!

Tom

> To provide fairness you often have to share validated data within the
> kernel and with XDP. This requires consistent lookup methods for sockets
> in the lower level. Those can be exported to XDP via external functions
> and become part of uAPI which will limit our ability to change those
> functions in future. When the discussion started about early demuxing in
> XDP I became really nervous, because suddenly the XDP program has to
> decide correctly which protocol type it has and look in the correct
> socket table for the socket. Different semantics for sockets can apply
> here, e.g. some sockets are RCU managed, some end up using reference
> counts. A wrong decision here would cause havoc in the kernel (XDP
> considers packet as UDP but kernel stack as TCP). Also, who knows that
> we won't have per-cpu socket tables we would keep that as uAPI (this is
> btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
> SIP rewriter in XDP and depending on a coherent view of all sockets even
> if their hash doesn't fit to the one of the queue? Suddenly something
> which was thought of as being only mutable by one CPU becomes global
> again and because of XDP we need to add locking because of uAPI.
>
> This discussion is parallel to the discussion about trace points, which
> are not considered uAPI. If eBPF functions are not considered uAPI then
> eBPF in the network stack will have much less value, because you
> suddenly

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Hannes Frederic Sowa

On 30.11.2016 17:32, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>> From: Ido Schimmel 
>>>
>>> Make sure the device has a complete view of the FIB tables by invoking
>>> their dump during module init.
>>>
>>> Signed-off-by: Ido Schimmel 
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 
>>> ++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
>>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> index 14bed1d..d176047 100644
>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct 
>>> notifier_block *nb,
>>> return NOTIFY_DONE;
>>>  }
>>>  
>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>> +{
>>> +   struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>> +
>>> +   /* Flush pending FIB notifications and then flush the device's
>>> +* table before requesting another dump. Do that with RTNL held,
>>> +* as FIB notification block is already registered.
>>> +*/
>>> +   mlxsw_core_flush_owq();
>>> +   rtnl_lock();
>>> +   mlxsw_sp_router_fib_flush(mlxsw_sp);
>>> +   rtnl_unlock();
>>> +}
>>> +
>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  {
>>> +   fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>> int err;
>>>  
>>> INIT_LIST_HEAD(_sp->router.nexthop_neighs_list);
>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  
>>> mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>> register_fib_notifier(_sp->fib_nb);
>>
>> Sorry to pick in here again:
>>
>> There is a race here. You need to protect the registration of the fib
>> notifier as well by the sequence counter. Updates here are not ordered
>> in relation to this code below.
> 
> You mean updates that can be received after you registered the notifier
> and until the dump started? I'm aware of that and that's OK. This
> listener should be able to handle duplicates.

I am not concerned about duplicates, but about ordering deletes and
getting an add from the RCU code you will add the node to hw while it is
deleted in the software path. You probably will ignore the delete
because nothing is installed in hw and later add the node which was
actually deleted but just reordered which happend on another CPU, no?

> I've a follow up patchset that introduces a new event in switchdev
> notification chain called SWITCHDEV_SYNC, which is sent when port
> netdevs are enslaved / released  from a master device (points in time
> where kernel<->device can get out of sync). It will invoke
> re-propagation of configuration from different parts of the stack
> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> in duplicates.

Okay, understood. I wonder how we can protect against accidentally abort
calls actually. E.g. if I start to inject routes into my routing domain
how can I make sure the box doesn't die after I try to insert enough
routes. Do we need to touch quagga etc?

Thanks,
Hannes

[patch] net: renesas: ravb: unintialized return value

2016-12-01 Thread Dan Carpenter

We want to set the other "err" variable here so that we can return it
later.  My version of GCC misses this issue but I caught it with a
static checker.

Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev 
leaks")
Signed-off-by: Dan Carpenter 
---
Applies to the net tree for 4.10.

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index 2c0357c..92d7692 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1016,8 +1016,6 @@ static int ravb_phy_init(struct net_device *ndev)
 * at this time.
 */
if (priv->chip_id == RCAR_GEN3) {
-   int err;
-
err = phy_set_max_speed(phydev, SPEED_100);
if (err) {
netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n");

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Ido Schimmel

On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:04, David Miller wrote:
> > 
> > Hannes and Ido,
> > 
> > It looks like we are very close to having this in mergable shape, can
> > you guys work out this final issue and figure out if it really is
> > a merge stopped or not?
> 
> Sure, if the fib notification register could be done under protection of
> the sequence counter I don't see any more problems.

Did you maybe miss my reply yesterday? Because I was trying to
understand what "ordering" you're referring to, but didn't receive a
reply from you.

> The sync handler is nice to have and can be done in a later patch series.

Sync handler?

Re: [PATCH net] tcp: warn on bogus MSS and try to amend it

2016-12-01 Thread marcelo . leitner

On Thu, Dec 01, 2016 at 03:29:49PM -0500, David Miller wrote:
> From: Marcelo Ricardo Leitner 
> Date: Wed, 30 Nov 2016 11:14:32 -0200
> 
> > There have been some reports lately about TCP connection stalls caused
> > by NIC drivers that aren't setting gso_size on aggregated packets on rx
> > path. This causes TCP to assume that the MSS is actually the size of the
> > aggregated packet, which is invalid.
> > 
> > Although the proper fix is to be done at each driver, it's often hard
> > and cumbersome for one to debug, come to such root cause and report/fix
> > it.
> > 
> > This patch amends this situation in two ways. First, it adds a warning
> > on when this situation occurs, so it gives a hint to those trying to
> > debug this. It also limit the maximum probed MSS to the adverised MSS,
> > as it should never be any higher than that.
> > 
> > The result is that the connection may not have the best performance ever
> > but it shouldn't stall, and the admin will have a hint on what to look
> > for.
> > 
> > Tested with virtio by forcing gso_size to 0.
> > 
> > Cc: Jonathan Maxwell 
> > Signed-off-by: Marcelo Ricardo Leitner 
> 
> I totally agree with this change, however I think the warning message can
> be improved in two ways:
> 
> > len = skb_shinfo(skb)->gso_size ? : skb->len;
> > if (len >= icsk->icsk_ack.rcv_mss) {
> > -   icsk->icsk_ack.rcv_mss = len;
> > +   icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > +  tcp_sk(sk)->advmss);
> > +   if (icsk->icsk_ack.rcv_mss != len)
> > +   pr_warn_once("Seems your NIC driver is doing bad RX 
> > acceleration. TCP performance may be compromised.\n");
> 
> We know it's a bad GRO implementation that causes this so let's be specific 
> in the
> message, perhaps something like:
> 
>   Driver has suspect GRO implementation, TCP performance may be 
> compromised.

Okay.

> 
> Also, we have skb->dev available here most likely, so prefixing the message 
> with
> skb->dev->name would make analyzing this situation even easier for someone 
> hitting
> this.

Nice, yes.
And this skb is mostly non-forwardable as it's bigger than the MTU,
so if someone is using net namespaces and this skb would be routed
through some veth interfaces, it would give a false hint then, but
shouldn't happen. Unless it would fit (a larger) veth mtu, but still,
one probably will simplify things up to debug this.

> 
> I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
> definitely something you need to consider.
> 
> Thanks!
> 

Will check. Thanks!

  Marcelo

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa

Hello,

this is a good conversation and I simply want to bring my worries
across. I don't have good solutions for the problems XDP tries to solve
but I fear we could get caught up in maintenance problems in the long
term given the ideas floating around on how to evolve XDP currently.

On 01.12.2016 17:28, Thomas Graf wrote:
> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>> XDP manipulates packets at free will and thus all security guarantees
>> are off as well as in any user space solution.
>>
>> Secondly user space provides policy, acl, more controlled memory
>> protection, restartability and better debugability. If I had multi
>> tenant workloads I would definitely put more complex "business/acl"
>> logic into user space, so I can make use of LSM and other features to
>> especially prevent a network facing service to attack the tenants. If
>> stuff gets put into the kernel you run user controlled code in the
>> kernel exposing a much bigger attack vector.
>>
>> What use case do you see in XDP specifically e.g. for container networking?
> 
> DDOS mitigation to protect distributed applications in large clusters.
> Relying on CDN works to protect API gateways and frontends (as long as
> they don't throw you out of their network) but offers no protection
> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
> level and allowing the mitigation capability to scale up with the number
> of servers is natural and cheap.

So far we e.g. always considered L2 attacks a problem of the network
admin to correctly protect the environment. Are you talking about
protecting the L3 data plane? Are there custom proprietary protocols in
place which need custom protocol parsers that need involvement of the
kernel before it could verify the packet?

In the past we tried to protect the L3 data plane as good as we can in
Linux to allow the plain old server admin to set an IP address on an
interface and install whatever software in user space. We try not only
to protect it but also try to achieve fairness by adding a lot of
counters everywhere. Are protections missing right now or are we talking
about better performance?

To provide fairness you often have to share validated data within the
kernel and with XDP. This requires consistent lookup methods for sockets
in the lower level. Those can be exported to XDP via external functions
and become part of uAPI which will limit our ability to change those
functions in future. When the discussion started about early demuxing in
XDP I became really nervous, because suddenly the XDP program has to
decide correctly which protocol type it has and look in the correct
socket table for the socket. Different semantics for sockets can apply
here, e.g. some sockets are RCU managed, some end up using reference
counts. A wrong decision here would cause havoc in the kernel (XDP
considers packet as UDP but kernel stack as TCP). Also, who knows that
we won't have per-cpu socket tables we would keep that as uAPI (this is
btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
SIP rewriter in XDP and depending on a coherent view of all sockets even
if their hash doesn't fit to the one of the queue? Suddenly something
which was thought of as being only mutable by one CPU becomes global
again and because of XDP we need to add locking because of uAPI.

This discussion is parallel to the discussion about trace points, which
are not considered uAPI. If eBPF functions are not considered uAPI then
eBPF in the network stack will have much less value, because you
suddenly depend on specific kernel versions again and cannot simply load
the code into the kernel. The API checks will become very difficult to
implement, see also the ongoing MODVERSIONS discussions on LKML some
days back.

>>> I agree with you if the LB is a software based appliance in either a
>>> dedicated VM or on dedicated baremetal.
>>>
>>> The reality is turning out to be different in many cases though, LB
>>> needs to be performed not only for north south but east west as well.
>>> So even if I would handle LB for traffic entering my datacenter in user
>>> space, I will need the same LB for packets from my applications and
>>> I definitely don't want to move all of that into user space.
>>
>> The open question to me is why is programmability needed here.
>>
>> Look at the discussion about ECMP and consistent hashing. It is not very
>> easy to actually write this code correctly. Why can't we just put C code
>> into the kernel that implements this once and for all and let user space
>> update the policies?
> 
> Whatever LB logic is put in place with native C code now is unlikely the
> logic we need in two years. We can't really predict the future. If it
> was the case, networking would have been done long ago and we would all
> be working on self eating ice cream now.

Did LB algorithms on the networking layer change that much?

Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum

2016-12-01 Thread David Miller

From: Jeff Kirsher 
Date: Wed, 30 Nov 2016 13:15:22 -0800

> On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote:
>> From: Alexander Duyck 
>> Date: Mon, 28 Nov 2016 10:42:18 -0500
>> 
>> > When I implemented the GSO partial support in the Intel drivers I was
>> using
>> > lco_csum to compute the checksum that we needed to plug into the IPv4
>> > checksum field in order to cancel out the data that was not a part of
>> the
>> > IPv4 header.  However this didn't take into account that the transport
>> > offset might be pointing to the inner transport header.
>> > 
>> > Instead of using lco_csum I have just coded around it so that we can
>> use
>> > the outer IP header plus the IP header length to determine where we
>> need to
>> > start our checksum and then just call csum_partial ourselves.
>> > 
>> > This should fix the SIT issue reported on igb interfaces as well as
>> simliar
>> > issues that would pop up on other Intel NICs.
>> 
>> Jeff, are you going to send me a pull request with this stuff or would
>> you be OK with my applying these directly to 'net'?
> 
> Go ahead and apply those to your net tree, I do not want to hold this up.

Ok, done, thanks Jeff.

Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op

2016-12-01 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

>> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
>> b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> index ab52c37..9e51405 100644
>> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>>  int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>>   u16 val);
>>  
>> +/* Switch Software Reset */
>> +int (*reset)(struct mv88e6xxx_chip *chip);
>> +
>
> Hi Vivien
>
> In my huge patch series of 6390, i've been using a g1_ prefix for
> functionality which is in global 1, g2_ for global 2, etc.  This has
> worked for everything so far with the exception of setting which
> reserved MAC addresses should be sent to the CPU. Most devices have it
> in g2, but 6390 has it in g1.
>
> Please could you add the prefix.

I don't understand. It looks like you are talking about the second part
of the comment I made on your RFC patchset, about the Rsvd2CPU feature:

https://www.mail-archive.com/netdev@vger.kernel.org/msg139837.html

Switch reset routines are implemented in this patch in global1.c as
mv88e6185_g1_reset and mv88e6352_g1_reset.

6185 and 6352 are implementation references for other switches.

Thanks,

Vivien

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread Hannes Frederic Sowa

On 01.12.2016 21:04, David Miller wrote:
> 
> Hannes and Ido,
> 
> It looks like we are very close to having this in mergable shape, can
> you guys work out this final issue and figure out if it really is
> a merge stopped or not?

Sure, if the fib notification register could be done under protection of
the sequence counter I don't see any more problems.

The sync handler is nice to have and can be done in a later patch series.

Re: [PATCH net-next 0/3] sfc: defalconisation fixups

2016-12-01 Thread David Miller

From: Edward Cree 
Date: Thu, 1 Dec 2016 16:59:13 +

> A bug fix, the Kconfig change, and cleaning up a bit more unused code.
> 
> Edward Cree (3):
>   sfc: fix debug message format string in efx_farch_handle_rx_not_ok
>   sfc: don't select SFC_FALCON
>   sfc: remove RESET_TYPE_RX_RECOVERY

Series applied, thank you.

Re: iproute2 public git outdated?

2016-12-01 Thread Rami Rosen

Hi Phil,
I suggest that you will try again now, it seems that the iproute2 git
repo was updated in the last 2-4 hours, and "git log" in master shows
now a patch from 30 of November (actually it is your "Add notes about
dropped IPv4 route cache" patch)

Regards,
Rami Rosen


On 1 December 2016 at 14:18, Phil Sutter  wrote:
> Hi,
>
> I am using iproute2's public git repo at this URL:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
>
> To my surprise, neither master nor net-next branches have received new
> commits since end of October. Did the repo location change or was it
> just not updated for a while?
>
> Thanks, Phil

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhan
 wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
>
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>> > and one other critical difference from the hot-potato-forwarding
>> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>> > does not apply: in order to figure out the ethernet and IP headers
>> > in the response correctly at all times (in the face of things like VRRP,
>> > gw changes, gw's mac addr changes etc) the application should really
>> > be listening on NETLINK sockets for modifications to the networking
>> > state - again points to needing a select() socket set where you can
>> > have both the I/O fds and the netlink socket,
>> >
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
>
> sure, but my point was that *XDP and other stack-bypass methods needs
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)
>
I'm not sure that TXDP can be reconciled to help OVS. The point of
TXDP is to drive applications closer to bare metal performance, as I
mentioned this is only going to be worth it if the fast path can be
kept simple and not complicated by a requirement for generalization.
It seems like the second we put OVS in we're doubling the data path
and accepting the performance consequences of a complex path anyway.

TXDP can't over the whole system (any more than DPDK can) and needs to
work in concert with other mechanisms-- the key is how to steer the
work amongst the CPUs. For instance, if a latency critical thread is
running on some CPU we either a dedicated queue for the connections of
the thread (e.g. ntuple filtering or aRFS support) or we need a fast
way to get move unrelated packets received on a queue processed by
that CPU to other CPUs (less efficient, but no special HW support is
needed either).

Tom

>
>> The *SOs are always an interesting question. They make for great
>> benchmarks, but in real life the amount of benefit is somewhat
>> unclear. Under the wrong conditions, like all cwnds have collapsed or
>
> I think Rick's already bringing up this one.
>
> --Sowmini
>

pull-request: can-next 2016-12-01

2016-12-01 Thread Marc Kleine-Budde

Hello David,

this is a pull request of 4 patches for net-next/master.

There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.

The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.


regards,
Marc

---
The following changes since commit 8f679ed88f8860206edddff725e2749b4cdbb0e8:

  driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev (2016-11-30 
15:01:32 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git 
tags/linux-can-next-for-4.10-20161201

for you to fetch changes up to 0d8f8efd32bace9f222fcc92d4a3132d877e5df6:

  net: can: usb: kvaser_usb: fix spelling mistake of "outstanding" (2016-12-01 
14:27:02 +0100)


linux-can-next-for-4.10-20161201


Chris Paterson (2):
  can: rcar_can: Add r8a7796 support
  can: rcar_canfd: Add r8a7796 support

Colin Ian King (1):
  net: can: usb: kvaser_usb: fix spelling mistake of "outstanding"

Geert Uytterhoeven (1):
  can: rcar_canfd: Correct order of interrupt specifiers

 Documentation/devicetree/bindings/net/can/rcar_can.txt   | 12 +++-
 Documentation/devicetree/bindings/net/can/rcar_canfd.txt | 14 --
 drivers/net/can/usb/kvaser_usb.c |  4 ++--
 3 files changed, 17 insertions(+), 13 deletions(-)

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.

2016-12-01 Thread Pravin Shelar

On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc  wrote:
> On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote:
>> Do not always set skb->protocol to be the ethertype of the L3 header.
>> For a packet with non-accelerated VLAN tags skb->protocol needs to be
>> the ethertype of the outermost non-accelerated VLAN ethertype.
>
> Well, the current handling of skb->protocol matches what used to be the
> handling of the kernel net stack before Jiri Pirko cleaned up the vlan
> code.
>
> I'm not opposed to changing this but I'm afraid it needs much deeper
> review. Because with this in place, no core kernel functions that
> depend on skb->protocol may be called from within openvswitch.
>
Can you give specific example where it does not work?

>> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct 
>> sw_flow_key *key)
>>   if (res <= 0)
>>   return res;
>>
>> + /* If the outer vlan tag was accelerated, skb->protocol should
>> +  * refelect the inner vlan type. */
>> + if (!eth_type_vlan(skb->protocol))
>> + skb->protocol = key->eth.cvlan.tpid;
>
> This should not depend on the current value in skb->protocol which
> could be arbitrary at this point (from the point of view of how this
> patch understands the skb->protocol values). It's easy to fix, though -
> just add a local bool variable tracking whether the skb->protocol has
> been set.
>
skb-protocol value is set by the caller, so it should not be
arbitrary. is it missing in any case?

Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready

2016-12-01 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> As we have seen in the past, this sort of loop is broken if we end up
> sleeping for a long time. Please take the opportunity to replace it
> with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait()

That won't work. the _wait() helpers are made to wait on self-clear (SC)
bits, i.e. looping until they are cleared to zero.

Here we want the opposite.

I will keep this existing wait loop for the moment and work soon on a
new patchset to rework the wait routines. We need a generic access to
test a given value against a given mask and wrappers for busy bits, etc.

>> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
>> +{
>> +u16 val;
>> +int err;
>> +
>> +/* Check the value of the InitReady bit 11 */
>> +err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, );
>> +if (err)
>> +return err;
>> +
>> +*ready = !!(val & GLOBAL_STATUS_INIT_READY);
>
> I would actually do the wait here.

That is better indeed.

Thanks,

Vivien

Re: [PATCH net] tcp: warn on bogus MSS and try to amend it

2016-12-01 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Wed, 30 Nov 2016 11:14:32 -0200

> There have been some reports lately about TCP connection stalls caused
> by NIC drivers that aren't setting gso_size on aggregated packets on rx
> path. This causes TCP to assume that the MSS is actually the size of the
> aggregated packet, which is invalid.
> 
> Although the proper fix is to be done at each driver, it's often hard
> and cumbersome for one to debug, come to such root cause and report/fix
> it.
> 
> This patch amends this situation in two ways. First, it adds a warning
> on when this situation occurs, so it gives a hint to those trying to
> debug this. It also limit the maximum probed MSS to the adverised MSS,
> as it should never be any higher than that.
> 
> The result is that the connection may not have the best performance ever
> but it shouldn't stall, and the admin will have a hint on what to look
> for.
> 
> Tested with virtio by forcing gso_size to 0.
> 
> Cc: Jonathan Maxwell 
> Signed-off-by: Marcelo Ricardo Leitner 

I totally agree with this change, however I think the warning message can
be improved in two ways:

>   len = skb_shinfo(skb)->gso_size ? : skb->len;
>   if (len >= icsk->icsk_ack.rcv_mss) {
> - icsk->icsk_ack.rcv_mss = len;
> + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> +tcp_sk(sk)->advmss);
> + if (icsk->icsk_ack.rcv_mss != len)
> + pr_warn_once("Seems your NIC driver is doing bad RX 
> acceleration. TCP performance may be compromised.\n");

We know it's a bad GRO implementation that causes this so let's be specific in 
the
message, perhaps something like:

Driver has suspect GRO implementation, TCP performance may be 
compromised.

Also, we have skb->dev available here most likely, so prefixing the message with
skb->dev->name would make analyzing this situation even easier for someone 
hitting
this.

I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
definitely something you need to consider.

Thanks!

[PATCH -next] net: ethernet: ti: davinci_cpdma: add missing EXPORTs

2016-12-01 Thread Paul Gortmaker

As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a 
channel")
Cc: Ivan Khoronzhuk 
Cc: Mugunthan V N 
Cc: Grygorii Strashko 
Cc: linux-o...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker 
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..36518fc5c7cc 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -796,6 +796,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
spin_unlock_irqrestore(>lock, flags);
return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +811,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +855,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
spin_unlock_irqrestore(>lock, flags);
return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +868,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 cpdma_handler_fn handler, int rx_type)
-- 
2.11.0

Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing

2016-12-01 Thread David Miller

From: Hannes Frederic Sowa 
Date: Wed, 30 Nov 2016 14:12:48 +0100

> David, one question: do you remember if you measured with linked lists
> at that time or also with arrays. I actually would expect small arrays
> that entirely fit into cachelines to be actually faster than our current
> approach, which also walks a linked list, probably the best algorithm to
> trash cache lines. I ask because I currently prefer this approach more
> than having large allocations in the O(1) case because of easier code
> and easier management.

I did not try this and I do agree with you that for extremely small table
sizes a list or array would perform better because of the cache behavior.

Re: [PATCH] stmmac: simplify flag assignment

2016-12-01 Thread David Miller

From: Pavel Machek 
Date: Wed, 30 Nov 2016 12:44:31 +0100

> 
> Simplify flag assignment.
> 
> Signed-off-by: Pavel Machek 
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ed20668..0b706a7 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct 
> net_device *dev,
>   features &= ~NETIF_F_CSUM_MASK;
>  
>   /* Disable tso if asked by ethtool */
> - if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> - if (features & NETIF_F_TSO)
> - priv->tso = true;
> - else
> - priv->tso = false;
> - }
> + if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> + priv->tso = !!(features & NETIF_F_TSO);
>  

Pavel, this really seems arbitrary.

Whilst I really appreciate you're looking into this driver a bit because
of some issues you are trying to resolve, I'd like to ask that you not
start bombarding me with nit-pick cleanups here and there and instead
concentrate on the real bug or issue.

Thanks in advance.

Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-12-01 Thread Florian Fainelli

On 11/23/2016 05:48 AM, Ido Schimmel wrote:
> Hi Florian,
> 
> On Tue, Nov 22, 2016 at 09:56:30AM -0800, Florian Fainelli wrote:
>> On 11/22/2016 09:41 AM, Ido Schimmel wrote:
>>> Hi Florian,
>>>
>>> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
 Hi all,

 This patch series allows using the bridge master interface to configure
 an Ethernet switch port's CPU/management port with different VLAN 
 attributes than
 those of the bridge downstream ports/members.

 Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and 
 mv88e6xxx, I
 tested this with b53 and a mockup DSA driver.
>>>
>>> We'll need to add a check in mlxsw and ignore any VLAN configuration for
>>> the bridge device itself. Otherwise, any configuration done on br0 will
>>> be propagated to all of its slaves, which is incorrect.
>>>

 Open questions:

 - if we have more than one bridge on top of a physical switch, the driver
   should keep track of that and verify that we are not going to change
   the CPU port VLAN attributes in a way that results in incompatible 
 settings
   to be applied

 - if the default behavior is to have all VLANs associated with the CPU port
   be ingressing/egressing tagged to the CPU, is this really useful?
>>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in
>> premise, this driver plus the DSA tag protocol hook do exactly the same
>> things as you just describe.
> 
> Thanks for the detailed explanation! I also took the time to read
> dsa.txt, however I still don't quite understand the motivation for
> VLAN filtering on the CPU port. In which cases would you like to prevent
> packets from going to the host due to their VLAN header? This change
> would make sense to me if you only had a limited number of VLANs you
> could enable on the CPU port, but from your response I understand that
> this isn't the case.

It's not much about VLAN filtering per-se, but more about the default
VLAN membership of the CPU port, in the absence of any explicit
configuration. As an user, I find it a little inconvenient to have to
create one VLAN interface per VLAN that I am adding to the bridge to be
able to terminate that traffic properly towards the host/CPU/management
interface, especially when this VLAN is untagged.

This is really the motivation for these patches: if there is only one
VLAN configured, and it's the default VLAN for all ports, then the
bridge master interface also terminates this VLAN with the same
properties as those added by default (typically with default_pvid: VID 1
untagged, unless changed of course).

If you don't want that as an user, you now have the ability to change
it, and make this VLAN (or any other for that matter) to be terminated
differently at the host/CPU/management port level than how it is
egressing at the downstream ports part of that VLAN too.

Maybe it's a bit overkill...

> 
> FWIW, I don't have a problem with patches if they are useful for you,
> I'm just trying to understand the use case. We can easily patch mlxsw to
> ignore calls with orig_dev=br0.

OK, if I resubmit, I will try to take care of mlxsw and rocker as well.

Thanks!
-- 
Florian

[PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.

2016-12-01 Thread Roman Mashak

Signed-off-by: Roman Mashak 
---
 man/man8/tc.8 | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 8a47a2b..d957ffa 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -32,7 +32,9 @@ class-id ] qdisc
 DEV
 .B [ parent
 qdisc-id
-.B | root ] protocol
+.B | root ] [ handle
+handle-id ]
+.B protocol
 protocol
 .B prio
 priority filtertype
@@ -577,7 +579,7 @@ it is created.
 
 .TP
 get
-Displays a single filter given the interface, parent ID, priority, protocol 
and handle ID.
+Displays a single filter given the interface, qdisc-id, priority, protocol and 
handle-id.
 
 .TP
 show
-- 
1.9.1

Re: [WIP] net+mlx4: auto doorbell

2016-12-01 Thread David Miller

From: Eric Dumazet 
Date: Thu, 01 Dec 2016 09:04:17 -0800

> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left.  Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)

The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.

Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.

Let's say we put 1 packet in, and hit the doorbell.

Then another packet comes in and we defer the doorbell to the IRQ.

At this point there are a couple things I'm unclear about.

For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor?  Depending upon how the doorbell works
it might, or it might not.

Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?

This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.

Maybe the situation I'm worried about is not possible, so please
explain it to me :-)

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:
> On 12/01/2016 11:05 AM, Tom Herbert wrote:
>>
>> For the GSO and GRO the rationale is that performing the extra SW
>> processing to do the offloads is significantly less expensive than
>> running each packet through the full stack. This is true in a
>> multi-layered generalized stack. In TXDP, however, we should be able
>> to optimize the stack data path such that that would no longer be
>> true. For instance, if we can process the packets received on a
>> connection quickly enough so that it's about the same or just a little
>> more costly than GRO processing then we might bypass GRO entirely.
>> TSO is probably still relevant in TXDP since it reduces overheads
>> processing TX in the device itself.
>
>
> Just how much per-packet path-length are you thinking will go away under the
> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
> non-trivial things to effective overhead (service demand) and so throughput:
>
For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  9260.24   2.02 -1.000.428 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  5621.82   4.25 -1.001.486 -1.000
>
> And that is still with the stretch-ACKs induced by GRO at the receiver.
>
Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.

> Losing GRO has quite similar results:
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Recv Send RecvSend
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  9154.02   4.00 -1.000.860 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Recv Send RecvSend
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  4212.06   5.36 -1.002.502 -1.000
>
> I'm sure there is a very non-trivial "it depends" component here - netperf
> will get the peak benefit from *SO and so one will see the peak difference
> in service demands - but even if one gets only 6 segments per *SO that is a
> lot of path-length to make-up.
>
True, but I think there's a lot of path we'll be able to cut out. In
this mode we don't need IPtables, Netfilter, input route, IPvlan
check, or other similar lookups. Once we've successfully matched a
establish TCP state anything related to policy on both TX and RX for
that connection is inferred from the state. We want the processing
path in this case to just be concerned with just

Re: Initial thoughts on TXDP

2016-12-01 Thread Sowmini Varadhan

On (12/01/16 11:05), Tom Herbert wrote:
> 
> Polling does not necessarily imply that networking monopolizes the CPU
> except when the CPU is otherwise idle. Presumably the application
> drives the polling when it is ready to receive work.

I'm not grokking that- "if the cpu is idle, we want to busy-poll
and make it 0% idle"?  Keeping CPU 0% idle has all sorts
of issues, see slide 20 of
 http://www.slideshare.net/shemminger/dpdk-performance

> > and one other critical difference from the hot-potato-forwarding
> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> > does not apply: in order to figure out the ethernet and IP headers
> > in the response correctly at all times (in the face of things like VRRP,
> > gw changes, gw's mac addr changes etc) the application should really
> > be listening on NETLINK sockets for modifications to the networking
> > state - again points to needing a select() socket set where you can
> > have both the I/O fds and the netlink socket,
> >
> I would think that that is management would not be implemented in a
> fast path processing thread for an application.

sure, but my point was that *XDP and other stack-bypass methods needs 
to provide a select()able socket: when your use-case is not about just
networking, you have to snoop on changes to the control plane, and update
your data path. In the OVS case (pure networking) the OVS control plane
updates are intrinsic to OVS. For the rest of the request/response world,
we need a select()able socket set to do this elegantly (not really
possible in DPDK, for example)

> The *SOs are always an interesting question. They make for great
> benchmarks, but in real life the amount of benefit is somewhat
> unclear. Under the wrong conditions, like all cwnds have collapsed or

I think Rick's already bringing up this one.

--Sowmini

Re: [WIP] net+mlx4: auto doorbell

2016-12-01 Thread Eric Dumazet

On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet  
> wrote:
> 
> > BTW, if you are doing tests on mlx4 40Gbit,
> 
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
> 
> >  would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
> 
> What tool should I use to send "low-rate flows"?
> 

You could use https://github.com/google/neper

It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second  (so sending one 2-MSS TSO packet
every ms per flow)



> And what am I looking for?

Max throughput, in packets per second :/

Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger

2016-12-01 Thread Michał Kępień

> Hi Michał,
> 
> [auto build test ERROR on mac80211-next/master]
> [also build test ERROR on v4.9-rc7 next-20161201]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master
> config: i386-randconfig-x004-201648 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>net/rfkill/core.c: In function 'rfkill_set_block':
> >> net/rfkill/core.c:354:2: error: implicit declaration of function 
> >> '__rfkill_any_led_trigger_event' [-Werror=implicit-function-declaration]
>  __rfkill_any_led_trigger_event();
>  ^~
>net/rfkill/core.c: In function 'rfkill_init':
>net/rfkill/core.c:1349:1: warning: label 'error_led_trigger' defined but 
> not used [-Wunused-label]
> error_led_trigger:
> ^
>At top level:
>net/rfkill/core.c:243:13: warning: 'rfkill_any_led_trigger_unregister' 
> defined but not used [-Wunused-function]
> static void rfkill_any_led_trigger_unregister(void)
> ^
>net/rfkill/core.c:238:12: warning: 'rfkill_any_led_trigger_register' 
> defined but not used [-Wunused-function]
> static int rfkill_any_led_trigger_register(void)
>^~~
>cc1: some warnings being treated as errors
> 
> vim +/__rfkill_any_led_trigger_event +354 net/rfkill/core.c
> 
>348rfkill->state &= ~RFKILL_BLOCK_SW_SETCALL;
>349rfkill->state &= ~RFKILL_BLOCK_SW_PREV;
>350curr = rfkill->state & RFKILL_BLOCK_SW;
>351spin_unlock_irqrestore(>lock, flags);
>352
>353rfkill_led_trigger_event(rfkill);
>  > 354__rfkill_any_led_trigger_event();
>355
>356if (prev != curr)
>357rfkill_event(rfkill);

Thanks, these are obviously all valid concerns.  Sorry for being sloppy
with the ifdefs.  If I get positive feedback on the proposed feature
itself, all these issues (and the warning pointed out in the other
message) will be resolved in v2.

-- 
Best regards,
Michał Kępień

Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init

2016-12-01 Thread David Miller


Hannes and Ido,

It looks like we are very close to having this in mergable shape, can
you guys work out this final issue and figure out if it really is
a merge stopped or not?

Thanks.

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert

On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> Posting for discussion
>
> Warning: You are not going to like this reply...
>
>> Now that XDP seems to be nicely gaining traction
>
> Yes, I regret to see that.  XDP seems useful to create impressive
> benchmark numbers (and little else).
>
> I will send a separate email to keep that flamebait part away from
> this thread though.
>
> [..]
>
>> addresses the performance gap for stateless packet processing). The
>> problem statement is analogous to that which we had for XDP, namely
>> can we create a mode in the kernel that offer the same performance
>> that is seen with L4 protocols over kernel bypass
>
> Why?  If you want to bypass the kernel, then DO IT.
>
I don't want kernel bypass. I want the Linux stack to provide
something close to bare metal performance for TCP/UDP for some latency
sensitive applications we run.

> There is nothing wrong with DPDK.  The ONLY problem is that the kernel
> does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.
>
> But even without that its not difficult to get DPDK running.
>
That is not true for large scale deployments. Also, TXDP is about
accelerating transport layers like TCP, DPDK is just the interface
from userspace to device queues. We need a whole lot more with DPDK, a
userspace TCP/IP stack for instance, to consider that we have an
equivalent functionality.

> (T)XDP seems born from spite, not technical rationale.
> IMO everyone would be better off if we'd just have something netmap-esqe
> in the network core (also see below).
>
>> I imagine there are a few reasons why userspace TCP stacks can get
>> good performance:
>>
>> - Spin polling (we already can do this in kernel)
>> - Lockless, I would assume that threads typically have exclusive
>> access to a queue pair for a connection
>> - Minimal TCP/IP stack code
>> - Zero copy TX/RX
>> - Light weight structures for queuing
>> - No context switches
>> - Fast data path for in order, uncongested flows
>> - Silo'ing between application and device queues
>
> I only see two cases:
>
> 1. Many applications running (standard Os model) that need to
> send/receive data
> -> Linux Network Stack
>
> 2. Single dedicated application that does all rx/tx
>
> -> no queueing needed (can block network rx completely if receiver
> is slow)
> -> no allocations needed at runtime at all
> -> no locking needed (single produce, single consumer)
>
> If you have #2 and you need to be fast etc then full userspace
> bypass is fine.  We will -- no matter what we do in kernel -- never
> be able to keep up with the speed you can get with that
> because we have to deal with #1.  (Plus the ease of use/freedom of doing
> userspace programming).  And yes, I think that #2 is something we
> should address solely by providing netmap or something similar.
>
> But even considering #1 there are ways to speed stack up:
>
> I'd kill RPF/RPS so we don't have IPI anymore and skb stays
> on same cpu up to where it gets queued (ofo or rx queue).
>
The reference to RPS and RFS is only to move packets off the hot CPU
that are not in the datapath. For instance if we get a FIN for a
connection it we can put this into a slow path since FIN processing is
not latency sensitive but may take a considerable amount of CPU to
process.

> Then we could tell driver what happened with the skb it gave us, e.g.
> we can tell driver it can do immediate page/dma reuse, for example
> in pure ack case as opposed to skb sitting in ofo or receive queue.
>
> (RPS/RFS functionality could still be provided via one of the gazillion
>  hooks we now have in the stack for those that need/want it), so I do
> not think we would lose functionality.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>> XDP API although we would need to generalize the interface to call
>> stack functions (I previously posted patches for that). We will also
>> need a new action, XDP_HELD?, that indicates the XDP function held the
>> packet (put on a socket for instance).
>
> Seems this will not work at all with the planned page pool thing when
> pages start to be held indefinitely.
>
The processing needed to gift a page to the stack shouldn't be very
different than what a driver needs to do when and skbuff is created
when XDP_PASS is returned. We probably would want to pass additional
meta data, things like checksum and vlan information from received
descriptor to the stack. A callback can be included if the stack
decides it wants to hold on to the buffer and driver needs to do
dma_sync etc. for that.

> You can also never get even close to userspace offload stacks once you
> need/do this; allocations in hotpath are too expensive.
>
> [..]
>
>>   - When we transmit, it would be nice to go straight from TCP
>> connection to an XDP device queue and in particular skip the qdisc
>>

Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.

2016-12-01 Thread Pravin Shelar

On Wed, Nov 30, 2016 at 5:51 AM, Jiri Benc  wrote:
> On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote:
>> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct 
>> sk_buff *skb, u8 mac_proto)
>>   goto drop;
>>   }
>>
>> - if (unlikely(packet_length(skb, vport->dev) > mtu &&
>> -  !skb_is_gso(skb))) {
>> - net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
>> -  vport->dev->name,
>> -  packet_length(skb, vport->dev), mtu);
>> + if (unlikely(!is_skb_forwardable(vport->dev, skb))) {
>
> How does this work when the vlan tag is accelerated? Then we can be
> over MTU, yet the check will pass.
>

This is not changing any behavior compared to current OVS vlan checks.
Single vlan header is not considered for MTU check.

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones


On 12/01/2016 11:05 AM, Tom Herbert wrote:

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.


Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9260.24   2.02 -1.000.428 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  5621.82   4.25 -1.001.486 
-1.000


And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9154.02   4.00 -1.000.860 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  4212.06   5.36 -1.002.502 
-1.000


I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.


4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.


happy benchmarking,

rick jones

1 2 3 >

1 - 100 of 292 matches

Mail list logo