Re: [net-next 1/5] qede: use reset to set network header
> Since offset is zero, it's not necessary to use set function. Reset > function is straightforward, and will remove the unnecessary add > operation in set function. Thanks. Acked-by: Yuval Mintz[For #2 for bnx2x as well]
Re: [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.
Hi Dave, Please consider applying the qed patches 1 & 2 to net-next. Patch 3, qedi, goes to SCSI. Martin P. is ok to have you take qedi (already Acked) through your tree. Please let us know which way you prefer. Thanks, Manish On 01/12/16 1:51 PM, "Rangankar, Manish"wrote: >This series introduces hardware offload iSCSI initiator driver for the >41000 Series Converged Network Adapters (579xx chip) by Qlogic. The >overall >driver design includes a common module ('qed') and protocol specific >dependent modules ('qedi' for iSCSI). > >This is an open iSCSI driver, modifications to open iSCSI user components >'iscsid', 'iscsiuio', etc. are required for the solution to work. The user >space changes are also in the process of being submitted. > >https://groups.google.com/forum/#!forum/open-iscsi > >The 'qed' common module, under drivers/net/ethernet/qlogic/qed/, is >enhanced with functionality required for the iSCSI support. This series >is based on: > >net tree base: Merge of net and net-next as of 11/29/2016 > >Changes from RFC v2: > > 1. qedi patches are squashed into single patch to prevent krobot > warning. > 2. Fixed 'hw_p_cpuq' incompatible pointer type. > 3. Fixed sparse incompatible types in comparison expression. > 4. Misc fixes with latest 'checkpatch --strict' option. > 5. Remove int_mode option from MODULE_PARAM. > 6. Prefix all MODULE_PARAM params with qedi_*. > 7. Use CONFIG_QED_ISCSI instead of CONFIG_QEDI > 8. Added bad task mem access fix. > >Manish Rangankar (1): > qedi: Add QLogic FastLinQ offload iSCSI driver framework. > >Yuval Mintz (2): > qed: Add support for hardware offloaded iSCSI. > qed: Add iSCSI out of order packet handling. > > MAINTAINERS|6 + > drivers/net/ethernet/qlogic/Kconfig|3 + > drivers/net/ethernet/qlogic/qed/Makefile |1 + > drivers/net/ethernet/qlogic/qed/qed.h |8 +- > drivers/net/ethernet/qlogic/qed/qed_dev.c | 22 + > drivers/net/ethernet/qlogic/qed/qed_iscsi.c| 1277 + > drivers/net/ethernet/qlogic/qed/qed_iscsi.h| 52 + > drivers/net/ethernet/qlogic/qed/qed_ll2.c | 555 +- > drivers/net/ethernet/qlogic/qed/qed_ll2.h |9 + > drivers/net/ethernet/qlogic/qed/qed_ooo.c | 501 + > drivers/net/ethernet/qlogic/qed/qed_ooo.h | 176 ++ > drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |2 + > drivers/net/ethernet/qlogic/qed/qed_roce.c |1 + > drivers/net/ethernet/qlogic/qed/qed_spq.c | 24 + > drivers/scsi/Kconfig |1 + > drivers/scsi/Makefile |1 + > drivers/scsi/qedi/Kconfig | 10 + > drivers/scsi/qedi/Makefile |5 + > drivers/scsi/qedi/qedi.h | 364 > drivers/scsi/qedi/qedi_dbg.c | 143 ++ > drivers/scsi/qedi/qedi_dbg.h | 144 ++ > drivers/scsi/qedi/qedi_debugfs.c | 244 +++ > drivers/scsi/qedi/qedi_fw.c| 2378 > > drivers/scsi/qedi/qedi_gbl.h | 73 + > drivers/scsi/qedi/qedi_hsi.h | 52 + > drivers/scsi/qedi/qedi_iscsi.c | 1624 > drivers/scsi/qedi/qedi_iscsi.h | 232 +++ > drivers/scsi/qedi/qedi_main.c | 2127 >+ > drivers/scsi/qedi/qedi_sysfs.c | 52 + > drivers/scsi/qedi/qedi_version.h | 14 + > include/linux/qed/qed_if.h |2 + > include/linux/qed/qed_iscsi_if.h | 229 +++ > 32 files changed, 10303 insertions(+), 29 deletions(-) > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.c > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.h > create mode 100644 drivers/scsi/qedi/Kconfig > create mode 100644 drivers/scsi/qedi/Makefile > create mode 100644 drivers/scsi/qedi/qedi.h > create mode 100644 drivers/scsi/qedi/qedi_dbg.c > create mode 100644 drivers/scsi/qedi/qedi_dbg.h > create mode 100644 drivers/scsi/qedi/qedi_debugfs.c > create mode 100644 drivers/scsi/qedi/qedi_fw.c > create mode 100644 drivers/scsi/qedi/qedi_gbl.h > create mode 100644 drivers/scsi/qedi/qedi_hsi.h > create mode 100644 drivers/scsi/qedi/qedi_iscsi.c > create mode 100644 drivers/scsi/qedi/qedi_iscsi.h > create mode 100644 drivers/scsi/qedi/qedi_main.c > create mode 100644 drivers/scsi/qedi/qedi_sysfs.c > create mode 100644 drivers/scsi/qedi/qedi_version.h > create mode 100644 include/linux/qed/qed_iscsi_if.h > >-- >1.8.3.1 >
Re: iproute2 public git outdated?
On Thu, Dec 01, 2016 at 10:39:10PM +0200, Rami Rosen wrote: > I suggest that you will try again now, it seems that the iproute2 git > repo was updated in the last 2-4 hours, and "git log" in master shows > now a patch from 30 of November (actually it is your "Add notes about > dropped IPv4 route cache" patch) Thanks Rami for the heads-up, I noticed yesterday already (checked immediately after receiving feedback to some other patch from Stephen. Cheers, Phil
Re: [PATCH 2/3] netns: add dummy struct inside "struct net_generic"
On Thu, Dec 1, 2016 at 9:42 PM, Cong Wangwrote: > On Thu, Dec 1, 2016 at 5:12 PM, Alexey Dobriyan wrote: >> struct net_generic { >> - unsigned int len; >> - struct rcu_head rcu; >> + struct { >> + unsigned int len; >> + struct rcu_head rcu; >> + } s; >> >> void *ptr[0]; >> }; > > I think you can put them in a union, since rcu is only used > for kfree_rcu() where len is not needed and rcu_head doesn't > need to be initialized by callers. Never mind, readers could be still reading ->len while we modify ->rcu. So they can't be in a union.
[PATCH net-next v4] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
Implemented RFC7527 Enhanced DAD. IPv6 duplicate address detection can fail if there is some temporary loopback of Ethernet frames. RFC7527 solves this by including a random nonce in the NS messages used for DAD, and if an NS is received with the same nonce it is assumed to be a looped back DAD probe and is ignored. RFC7527 is enabled by default. Can be disabled by setting both of conf/{all,interface}/enhanced_dad to zero. Signed-off-by: Erik NordmarkSigned-off-by: Bob Gilligan Reviewed-by: Hannes Frederic Sowa --- v2: renamed sysctl and made it default to true, plus minor code review fixes v3: respun with later net-next; fixed whitespace issues v4: fixed kbuild test robot for route.c; added Reviewed-by Documentation/networking/ip-sysctl.txt | 9 + include/linux/ipv6.h | 1 + include/net/if_inet6.h | 1 + include/net/ndisc.h| 5 - include/uapi/linux/ipv6.h | 1 + net/ipv6/addrconf.c| 22 +- net/ipv6/ndisc.c | 31 --- net/ipv6/route.c | 2 +- 8 files changed, 66 insertions(+), 6 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 5af48dd..d9ef566 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1729,6 +1729,15 @@ drop_unsolicited_na - BOOLEAN By default this is turned off. +enhanced_dad - BOOLEAN + Include a nonce option in the IPv6 neighbor solicitation messages used for + duplicate address detection per RFC7527. A received DAD NS will only signal + a duplicate address if the nonce is different. This avoids any false + detection of duplicates due to loopback of the NS messages that we send. + The nonce option will be sent on an interface unless both of + conf/{all,interface}/enhanced_dad are set to FALSE. + Default: TRUE + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 3f95233..671d014 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -68,6 +68,7 @@ struct ipv6_devconf { #ifdef CONFIG_IPV6_SEG6_HMAC __s32 seg6_require_hmac; #endif + __u32 enhanced_dad; struct ctl_table_header *sysctl_header; }; diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index b0576cb..0fa4c32 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -55,6 +55,7 @@ struct inet6_ifaddr { __u8stable_privacy_retry; __u16 scope; + __u64 dad_nonce; unsigned long cstamp; /* created timestamp */ unsigned long tstamp; /* updated timestamp */ diff --git a/include/net/ndisc.h b/include/net/ndisc.h index be1fe228..d562a2f 100644 --- a/include/net/ndisc.h +++ b/include/net/ndisc.h @@ -31,6 +31,7 @@ enum { ND_OPT_PREFIX_INFO = 3, /* RFC2461 */ ND_OPT_REDIRECT_HDR = 4,/* RFC2461 */ ND_OPT_MTU = 5, /* RFC2461 */ + ND_OPT_NONCE = 14, /* RFC7527 */ __ND_OPT_ARRAY_MAX, ND_OPT_ROUTE_INFO = 24, /* RFC4191 */ ND_OPT_RDNSS = 25, /* RFC5006 */ @@ -121,6 +122,7 @@ struct ndisc_options { #define nd_opts_pi_end nd_opt_array[__ND_OPT_PREFIX_INFO_END] #define nd_opts_rh nd_opt_array[ND_OPT_REDIRECT_HDR] #define nd_opts_mtund_opt_array[ND_OPT_MTU] +#define nd_opts_nonce nd_opt_array[ND_OPT_NONCE] #define nd_802154_opts_src_lladdr nd_802154_opt_array[ND_OPT_SOURCE_LL_ADDR] #define nd_802154_opts_tgt_lladdr nd_802154_opt_array[ND_OPT_TARGET_LL_ADDR] @@ -398,7 +400,8 @@ static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, cons int ndisc_rcv(struct sk_buff *skb); void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit, - const struct in6_addr *daddr, const struct in6_addr *saddr); + const struct in6_addr *daddr, const struct in6_addr *saddr, + u64 nonce); void ndisc_send_rs(struct net_device *dev, const struct in6_addr *saddr, const struct in6_addr *daddr); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index 53561be..eaf65dc 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -181,6 +181,7 @@ enum { DEVCONF_RTR_SOLICIT_MAX_INTERVAL, DEVCONF_SEG6_ENABLED, DEVCONF_SEG6_REQUIRE_HMAC, + DEVCONF_ENHANCED_DAD, DEVCONF_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4c387dc..c1e124b 100644 ---
Re: [PATCH 2/3] netns: add dummy struct inside "struct net_generic"
On Thu, Dec 1, 2016 at 5:12 PM, Alexey Dobriyanwrote: > struct net_generic { > - unsigned int len; > - struct rcu_head rcu; > + struct { > + unsigned int len; > + struct rcu_head rcu; > + } s; > > void *ptr[0]; > }; I think you can put them in a union, since rcu is only used for kfree_rcu() where len is not needed and rcu_head doesn't need to be initialized by callers.
Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
On 12/1/16 2:28 PM, Hannes Frederic Sowa wrote: Reviewed-by: Hannes Frederic SowaThanks - I'll add that. Also got a warning from the kbuild test robot for route.c which I'll fix. Thanks! @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb) have_ifp: if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) { if (dad) { + if (nonce != 0 && ifp->dad_nonce == nonce) { + u8 *np = (u8 *) + /* Matching nonce if looped back */ + ND_PRINTK(2, notice, + "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n", + ifp->idev->dev->name, + >addr, + np[0], np[1], np[2], np[3], + np[4], np[5]); + goto out; + } /* * We are colliding with another node * who is doing DAD I think it could be a "%pM" because it looks like a MAC address, but better leave it like that. :) It is 6 bytes, but it isn't a mac address so I'll leave it. Thanks, Erik
Re: linux-next: manual merge of the wireless-drivers-next tree with the net-next tree
Stephen Rothwellwrites: > Hi all, > > Today's linux-next merge of the wireless-drivers-next tree got a > conflict in: > > drivers/net/wireless/ath/ath10k/mac.c > > between commit: > > f3fe4e93dd63 ("mac80211: add a HW flag for supporting HW TX fragmentation") > > from the net-next tree and commit: > > ff32eeb86aa1 ("ath10k: advertize hardware packet loss mechanism") > > from the wireless-drivers-next tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. The fix looks good, thanks. I sent a pull request to Dave yesteday which should fix this. -- Kalle Valo
Re: [PATCH 23/39] Annotate hardware config module parameters in drivers/net/wireless/
David Howellswrites: > When the kernel is running in secure boot mode, we lock down the kernel to > prevent userspace from modifying the running kernel image. Whilst this > includes prohibiting access to things like /dev/mem, it must also prevent > access by means of configuring driver modules in such a way as to cause a > device to access or modify the kernel image. > > To this end, annotate module_param* statements that refer to hardware > configuration and indicate for future reference what type of parameter they > specify. The parameter parser in the core sees this information and can > skip such parameters with an error message if the kernel is locked down. > The module initialisation then runs as normal, but just sees whatever the > default values for those parameters is. > > Note that we do still need to do the module initialisation because some > drivers have viable defaults set in case parameters aren't specified and > some drivers support automatic configuration (e.g. PNP or PCI) in addition > to manually coded parameters. > > This patch annotates drivers in drivers/net/wireless/. > > Suggested-by: One Thousand Gnomes > Signed-off-by: David Howells > cc: Kalle Valo > cc: linux-wirel...@vger.kernel.org > cc: netdev@vger.kernel.org > --- > > drivers/net/wireless/cisco/airo.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Via which tree are you planning to submit this, should I take it to wireless-drivers or will someone else handle it? I didn't get the cover letter so I have no idea. -- Kalle Valo
Re: [PATCH main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode
On Thu, Dec 1, 2016 at 6:14 PM, Felix Jiawrote: > +static void addrconf_addrgenmode_change(struct net *net) > +{ > + struct net_device *dev; > + struct inet6_dev *idev; > + > + read_lock(_base_lock); > + for_each_netdev(net, dev) { > + rcu_read_lock(); > + idev = __in6_dev_get(dev); > + if (idev) { > + idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode; > + idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode; > + rtnl_lock(); > + addrconf_dev_config(idev->dev); > + rtnl_unlock(); You can't call rtnl lock in atomic context. > + } > + rcu_read_unlock(); > + } > + read_unlock(_base_lock); Did you test your patch seriously?
[PATCH net-next] samples/bpf: silence compiler warnings
silence some of the clang compiler warnings like: include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false arch/x86/include/asm/processor.h:491:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized type 'struct cgroup' not at the end of a struct or class is a GNU extension since they add too much noise to samples/bpf/ build. Signed-off-by: Alexei Starovoitov--- samples/bpf/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 3ceb5a9d86df..270e2fd7337a 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -158,4 +158,6 @@ $(obj)/%.o: $(src)/%.c $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ -Wno-compare-distinct-pointer-types \ + -Wno-gnu-variable-sized-type-not-at-end \ + -Wno-address-of-packed-member -Wno-tautological-compare \ -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ -- 2.8.0
[PATCH main-v4.9-rc7] net/ipv6: allow sysctl to change link-local address generation mode
The address generation mode for IPv6 link-local can only be configured by netlink messages. This patch adds the ability to change the address generation mode via sysctl. An possible improvement is to remove the addrgenmode variable from the idev structure and use the systcl storage for the flag. The patch is based from v4.9-rc7 in mainline. Signed-off-by: Felix JiaCc: Carl Smith --- include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + net/ipv6/addrconf.c | 77 ++- 3 files changed, 78 insertions(+), 1 deletion(-) diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index a064997..0d9e5d4 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -64,6 +64,7 @@ struct ipv6_devconf { } stable_secret; __s32 use_oif_addrs_only; __s32 keep_addr_on_down; + __s32 addrgenmode; struct ctl_table_header *sysctl_header; }; diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index 8c27723..0524e2c 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -178,6 +178,7 @@ enum { DEVCONF_DROP_UNSOLICITED_NA, DEVCONF_KEEP_ADDR_ON_DOWN, DEVCONF_RTR_SOLICIT_MAX_INTERVAL, + DEVCONF_ADDRGENMODE, DEVCONF_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4bc5ba3..8d1f6e6 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -238,6 +238,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .use_oif_addrs_only = 0, .ignore_routes_with_linkdown = 0, .keep_addr_on_down = 0, + .addrgenmode = IN6_ADDR_GEN_MODE_EUI64, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -284,6 +285,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .use_oif_addrs_only = 0, .ignore_routes_with_linkdown = 0, .keep_addr_on_down = 0, + .addrgenmode = IN6_ADDR_GEN_MODE_EUI64, }; /* Check if a valid qdisc is available */ @@ -378,7 +380,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev) if (ndev->cnf.stable_secret.initialized) ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY; else - ndev->addr_gen_mode = IN6_ADDR_GEN_MODE_EUI64; + ndev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode; ndev->cnf.mtu6 = dev->mtu; ndev->nd_parms = neigh_parms_alloc(dev, _tbl); @@ -4950,6 +4952,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf, array[DEVCONF_DROP_UNICAST_IN_L2_MULTICAST] = cnf->drop_unicast_in_l2_multicast; array[DEVCONF_DROP_UNSOLICITED_NA] = cnf->drop_unsolicited_na; array[DEVCONF_KEEP_ADDR_ON_DOWN] = cnf->keep_addr_on_down; + array[DEVCONF_ADDRGENMODE] = cnf->addrgenmode; } static inline size_t inet6_ifla6_size(void) @@ -5496,6 +5499,71 @@ int addrconf_sysctl_mtu(struct ctl_table *ctl, int write, return proc_dointvec_minmax(, write, buffer, lenp, ppos); } +static void addrconf_addrgenmode_change(struct net *net) +{ + struct net_device *dev; + struct inet6_dev *idev; + + read_lock(_base_lock); + for_each_netdev(net, dev) { + rcu_read_lock(); + idev = __in6_dev_get(dev); + if (idev) { + idev->cnf.addrgenmode = ipv6_devconf_dflt.addrgenmode; + idev->addr_gen_mode = ipv6_devconf_dflt.addrgenmode; + rtnl_lock(); + addrconf_dev_config(idev->dev); + rtnl_unlock(); + } + rcu_read_unlock(); + } + read_unlock(_base_lock); +} + +static int addrconf_sysctl_addrgenmode(struct ctl_table *ctl, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret; + int new_val; + struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; + struct net *net = (struct net *)ctl->extra2; + + if (write) { /* sysctl write request */ + ret = proc_dointvec(ctl, write, buffer, lenp, ppos); + new_val = *((int *)ctl->data); + + /* request for the all */ + if (>ipv6.devconf_all->addrgenmode == ctl->data) { + ipv6_devconf_dflt.addrgenmode = new_val; + addrconf_addrgenmode_change(net); + + /* request for default */ + } else if (>ipv6.devconf_dflt->addrgenmode == ctl->data) { + ipv6_devconf_dflt.addrgenmode = new_val; + + /* request for individual inet device */ + } else { + if (!idev) { + return ret; + } + if
Hi
Can we talk please
[net-next 2/5] bnx2x: use reset to set network header
Since offset is zero, it's not necessary to use set function. Reset function is straightforward, and will remove the unnecessary add operation in set function. Signed-off-by: Zhang Shengju--- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index 3fd36b4..3e199d3 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -724,7 +724,7 @@ static void bnx2x_gro_ipv6_csum(struct bnx2x *bp, struct sk_buff *skb) static void bnx2x_gro_csum(struct bnx2x *bp, struct sk_buff *skb, void (*gro_func)(struct bnx2x*, struct sk_buff*)) { - skb_set_network_header(skb, 0); + skb_reset_network_header(skb); gro_func(bp, skb); tcp_gro_complete(skb); } -- 1.8.3.1
[net-next 3/5] mlx4: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset function is straightforward, and will remove the unnecessary add operation in set function. Signed-off-by: Zhang Shengju--- drivers/net/ethernet/mellanox/mlx4/en_selftest.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c index c06346a..95290e1 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c @@ -68,7 +68,7 @@ static int mlx4_en_test_loopback_xmit(struct mlx4_en_priv *priv) memcpy(ethh->h_dest, priv->dev->dev_addr, ETH_ALEN); eth_zero_addr(ethh->h_source); ethh->h_proto = htons(ETH_P_ARP); - skb_set_mac_header(skb, 0); + skb_reset_mac_header(skb); for (i = 0; i < packet_size; ++i) /* fill our packet */ packet[i] = (unsigned char)(i & 0xff); -- 1.8.3.1
[net-next 1/5] qede: use reset to set network header
Since offset is zero, it's not necessary to use set function. Reset function is straightforward, and will remove the unnecessary add operation in set function. Signed-off-by: Zhang Shengju--- drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index b84a2c4..31c8449 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -1220,7 +1220,7 @@ static void qede_gro_receive(struct qede_dev *edev, #ifdef CONFIG_INET if (skb_shinfo(skb)->gso_size) { - skb_set_network_header(skb, 0); + skb_reset_network_header(skb); switch (skb->protocol) { case htons(ETH_P_IP): -- 1.8.3.1
[net-next 0/5] use reset to set headers
This patch serial replace 'set' function to 'reset', since the offset is zero. It's not necessary to use set, reset function is straightforward, and will remove the unnecessary add operation in set function. Zhang Shengju (5): qede: use reset to set network header bnx2x: use reset to set network header mlx4: use reset to set mac header iwlwifi: use reset to set transport header staging: wilc1000: use reset to set mac header drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_selftest.c | 2 +- drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +- drivers/staging/wilc1000/linux_mon.c | 4 ++-- 5 files changed, 6 insertions(+), 6 deletions(-) -- 1.8.3.1
[net-next 5/5] staging: wilc1000: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset function is straightforward, and will remove the unnecessary add operation in set function. Signed-off-by: Zhang Shengju--- drivers/staging/wilc1000/linux_mon.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/staging/wilc1000/linux_mon.c b/drivers/staging/wilc1000/linux_mon.c index 242f82f..f328d75 100644 --- a/drivers/staging/wilc1000/linux_mon.c +++ b/drivers/staging/wilc1000/linux_mon.c @@ -111,7 +111,7 @@ void WILC_WFI_monitor_rx(u8 *buff, u32 size) } skb->dev = wilc_wfi_mon; - skb_set_mac_header(skb, 0); + skb_reset_mac_header(skb); skb->ip_summed = CHECKSUM_UNNECESSARY; skb->pkt_type = PACKET_OTHERHOST; skb->protocol = htons(ETH_P_802_2); @@ -215,7 +215,7 @@ static netdev_tx_t WILC_WFI_mon_xmit(struct sk_buff *skb, cb_hdr->tx_flags = 0x0004; skb2->dev = wilc_wfi_mon; - skb_set_mac_header(skb2, 0); + skb_reset_mac_header(skb2); skb2->ip_summed = CHECKSUM_UNNECESSARY; skb2->pkt_type = PACKET_OTHERHOST; skb2->protocol = htons(ETH_P_802_2); -- 1.8.3.1
[net-next 4/5] iwlwifi: use reset to set transport header
Since offset is zero, it's not necessary to use set function. Reset function is straightforward, and will remove the unnecessary add operation in set function. Signed-off-by: Zhang Shengju--- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c index 5f840f1..e44e5ad 100644 --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c @@ -2196,7 +2196,7 @@ static int iwl_fill_data_tbs_amsdu(struct iwl_trans *trans, struct sk_buff *skb, memcpy(skb_put(csum_skb, tcp_hdrlen(skb)), tcph, tcp_hdrlen(skb)); - skb_set_transport_header(csum_skb, 0); + skb_reset_transport_header(csum_skb); csum_skb->csum_start = (unsigned char *)tcp_hdr(csum_skb) - csum_skb->head; -- 1.8.3.1
Re: [PATCH v8 3/8] thunderbolt: Communication with the ICM (firmware)
On 09/28/2016 07:44 AM, Amir Levy wrote: This patch provides the communication protocol between the Intel Connection Manager(ICM) firmware that is operational in the Thunderbolt controller in non-Apple hardware. The ICM firmware-based controller is used for establishing and maintaining the Thunderbolt Networking connection - we need to be able to communicate with it. I'm a bit late to the party, but here goes. I have two big questions: 1. Why is this using netlink at all? A system has zero or more Thunderbolt controllers, they're probed just like any other PCI devices (by nhi_probe() if I'm understanding correctly), they'll have nodes in sysfs, etc. Shouldn't there be a simple char device per Thunderbolt controller that a daemon can connect to? This will clean up lots of things: a) You can actually enforce one-daemon-at-a-time in a very natural way. Your current code seems to try, but it's rather buggy. Your subscription count is a guess, your unsubscribe is entirely unchecked, and you are entirely unable to detect if a daemon crashes AFAICT. b) You won't need all of the complexity that's currently there to figure out *which* Thunderbolt device a daemon is talking to. c) You can use regular ioctl passing *structs* instead of netlink attrs. There's nothing wrong with netlink attrs, except that your driver seems to have a whole lot of boilerplate that just converts back and forth to regular structures. d) The userspace code that does stuff like "send message, wait 150ms, receive reply, complain if no reply" goes away because ioctl is synchronous. (Or you can use read and write, but it's still simpler.) e) You could have one daemon per Thunderbolt device if you were so inclined. f) You get privilege separation in userspace. Creating a netlink socket and dropping privilege is busted^Winteresting. Opening a device node and dropping privilege works quite nicely. 2. Why do you need a daemon anyway. Functionally, what exactly does it do? (Okay, I get that it seems to talk to a giant pile of code running in SMM, and I get that Intel, for some bizarre reason, wants everyone except Apple to use this code in SMM, and that Apple (for entirely understandable reasons) turned it off, but that's beside the point. What does the user code do that's useful and that the kernel can't do all by itself? The only really interesting bit I can see is the part that approves PCI devices. I'm not going to review this in detail, but here's a tiny bit: +static int nhi_genl_unsubscribe(__always_unused struct sk_buff *u_skb, + __always_unused struct genl_info *info) +{ + atomic_dec_if_positive(); + + return 0; +} + This, for example, is really quite buggy. This entire function here: +static int nhi_genl_query_information(__always_unused struct sk_buff *u_skb, + struct genl_info *info) +{ + struct tbt_nhi_ctxt *nhi_ctxt; + struct sk_buff *skb; + bool msg_too_long; + int res = -ENODEV; + u32 *msg_head; + + if (!info || !info->userhdr) + return -EINVAL; + + skb = genlmsg_new(NLMSG_ALIGN(nhi_genl_family.hdrsize) + + nla_total_size(sizeof(DRV_VERSION)) + + nla_total_size(sizeof(nhi_ctxt->nvm_ver_offset)) + + nla_total_size(sizeof(nhi_ctxt->num_ports)) + + nla_total_size(sizeof(nhi_ctxt->dma_port)) + + nla_total_size(0),/* nhi_ctxt->support_full_e2e */ + GFP_KERNEL); + if (!skb) + return -ENOMEM; + + msg_head = genlmsg_put_reply(skb, info, _genl_family, 0, +NHI_CMD_QUERY_INFORMATION); + if (!msg_head) { + res = -ENOMEM; + goto genl_put_reply_failure; + } + + if (mutex_lock_interruptible(_list_mutex)) { + res = -ERESTART; + goto genl_put_reply_failure; + } + + nhi_ctxt = nhi_search_ctxt(*(u32 *)info->userhdr); + if (nhi_ctxt && !nhi_ctxt->d0_exit) { + *msg_head = nhi_ctxt->id; + + msg_too_long = !!nla_put_string(skb, NHI_ATTR_DRV_VERSION, + DRV_VERSION); + + msg_too_long = msg_too_long || + nla_put_u16(skb, NHI_ATTR_NVM_VER_OFFSET, + nhi_ctxt->nvm_ver_offset); + + msg_too_long = msg_too_long || + nla_put_u8(skb, NHI_ATTR_NUM_PORTS, + nhi_ctxt->num_ports); + + msg_too_long = msg_too_long || + nla_put_u8(skb, NHI_ATTR_DMA_PORT, + nhi_ctxt->dma_port); + + if (msg_too_long) { + res = -EMSGSIZE; +
[PATCH net-next 2/8] drivers: net: xgene: Configure classifier with pagepool
This patch configures classifier with the pagepool information. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_cle.c | 16 ++-- drivers/net/ethernet/apm/xgene/xgene_enet_cle.h | 2 ++ drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 7 +-- drivers/net/ethernet/apm/xgene/xgene_enet_hw.h| 6 -- drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 11 +-- drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 3 ++- drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 9 ++--- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 7 +-- 8 files changed, 47 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c index 7aac0fb..caa55bd 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c @@ -52,6 +52,7 @@ static void xgene_cle_dbptr_to_hw(struct xgene_enet_pdata *pdata, { buf[0] = SET_VAL(CLE_DROP, dbptr->drop); buf[4] = SET_VAL(CLE_FPSEL, dbptr->fpsel) | +SET_VAL(CLE_NFPSEL, dbptr->nxtfpsel) | SET_VAL(CLE_DSTQIDL, dbptr->dstqid); buf[5] = SET_VAL(CLE_DSTQIDH, (u32)dbptr->dstqid >> CLE_DSTQIDL_LEN) | @@ -349,8 +350,12 @@ static int xgene_cle_set_rss_idt(struct xgene_enet_pdata *pdata) fpsel = xgene_enet_get_fpsel(pool_id); dstqid = xgene_enet_dst_ring_num(pdata->rx_ring[idx]); nfpsel = 0; - idt_reg = 0; + if (pdata->rx_ring[idx]->page_pool) { + pool_id = pdata->rx_ring[idx]->page_pool->id; + nfpsel = xgene_enet_get_fpsel(pool_id); + } + idt_reg = 0; xgene_cle_idt_to_hw(pdata, dstqid, fpsel, nfpsel, _reg); ret = xgene_cle_dram_wr(>cle, _reg, 1, i, RSS_IDT, CLE_CMD_WR); @@ -400,9 +405,9 @@ static int xgene_cle_setup_rss(struct xgene_enet_pdata *pdata) static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata) { struct xgene_enet_cle *enet_cle = >cle; + u32 def_qid, def_fpsel, def_nxtfpsel, pool_id; struct xgene_cle_dbptr dbptr[DB_MAX_PTRS]; struct xgene_cle_ptree_branch *br; - u32 def_qid, def_fpsel, pool_id; struct xgene_cle_ptree *ptree; struct xgene_cle_ptree_kn kn; int ret; @@ -707,13 +712,20 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata) def_qid = xgene_enet_dst_ring_num(pdata->rx_ring[0]); pool_id = pdata->rx_ring[0]->buf_pool->id; def_fpsel = xgene_enet_get_fpsel(pool_id); + def_nxtfpsel = 0; + if (pdata->rx_ring[0]->page_pool) { + pool_id = pdata->rx_ring[0]->page_pool->id; + def_nxtfpsel = xgene_enet_get_fpsel(pool_id); + } memset(dbptr, 0, sizeof(struct xgene_cle_dbptr) * DB_MAX_PTRS); dbptr[DB_RES_ACCEPT].fpsel = def_fpsel; + dbptr[DB_RES_ACCEPT].nxtfpsel = def_nxtfpsel; dbptr[DB_RES_ACCEPT].dstqid = def_qid; dbptr[DB_RES_ACCEPT].cle_priority = 1; dbptr[DB_RES_DEF].fpsel = def_fpsel; + dbptr[DB_RES_DEF].nxtfpsel = def_nxtfpsel; dbptr[DB_RES_DEF].dstqid = def_qid; dbptr[DB_RES_DEF].cle_priority = 7; xgene_cle_setup_def_dbptr(pdata, enet_cle, [DB_RES_DEF], diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h index 9ac9f8e..903be0c 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.h @@ -91,6 +91,8 @@ #define CLE_DSTQIDH_LEN5 #define CLE_FPSEL_POS 21 #define CLE_FPSEL_LEN 4 +#define CLE_NFPSEL_POS 17 +#define CLE_NFPSEL_LEN 4 #define CLE_PRIORITY_POS 5 #define CLE_PRIORITY_LEN 3 diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index 1007074..c395df3 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -550,12 +550,14 @@ static void xgene_enet_config_ring_if_assoc(struct xgene_enet_pdata *pdata) } static void xgene_enet_cle_bypass(struct xgene_enet_pdata *pdata, - u32 dst_ring_num, u16 bufpool_id) + u32 dst_ring_num, u16 bufpool_id, + u16 nxtbufpool_id) { u32 cb; - u32 fpsel; + u32 fpsel, nxtfpsel; fpsel = xgene_enet_get_fpsel(bufpool_id); + nxtfpsel = xgene_enet_get_fpsel(nxtbufpool_id); xgene_enet_rd_csr(pdata, CLE_BYPASS_REG0_0_ADDR, ); cb |= CFG_CLE_BYPASS_EN0; @@ -565,6 +567,7 @@ static void xgene_enet_cle_bypass(struct
[PATCH net-next 5/8] drivers: net: xgene: fix: RSS for non-TCP/UDP
This patch fixes RSS feature, for non-TCP/UDP packets. Signed-off-by: Khuong DinhSigned-off-by: Iyappan Subramanian --- drivers/net/ethernet/apm/xgene/xgene_enet_cle.c | 90 - drivers/net/ethernet/apm/xgene/xgene_enet_cle.h | 1 + 2 files changed, 89 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c index caa55bd..1dc6c20 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c @@ -485,11 +485,11 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata) }, { .valid = 0, - .next_packet_pointer = 260, + .next_packet_pointer = 26, .jump_bw = JMP_FW, .jump_rel = JMP_ABS, .operation = EQT, - .next_node = LAST_NODE, + .next_node = RSS_IPV4_OTHERS_NODE, .next_branch = 0, .data = 0x0, .mask = 0x @@ -667,6 +667,92 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata) } }, { + /* RSS_IPV4_OTHERS_NODE */ + .node_type = EWDN, + .last_node = 0, + .hdr_len_store = 1, + .hdr_extn = NO_BYTE, + .byte_store = NO_BYTE, + .search_byte_store = BOTH_BYTES, + .result_pointer = DB_RES_DROP, + .num_branches = 6, + .branch = { + { + /* SRC IPV4 B01 */ + .valid = 0, + .next_packet_pointer = 28, + .jump_bw = JMP_FW, + .jump_rel = JMP_ABS, + .operation = EQT, + .next_node = RSS_IPV4_OTHERS_NODE, + .next_branch = 1, + .data = 0x0, + .mask = 0x + }, + { + /* SRC IPV4 B23 */ + .valid = 0, + .next_packet_pointer = 30, + .jump_bw = JMP_FW, + .jump_rel = JMP_ABS, + .operation = EQT, + .next_node = RSS_IPV4_OTHERS_NODE, + .next_branch = 2, + .data = 0x0, + .mask = 0x + }, + { + /* DST IPV4 B01 */ + .valid = 0, + .next_packet_pointer = 32, + .jump_bw = JMP_FW, + .jump_rel = JMP_ABS, + .operation = EQT, + .next_node = RSS_IPV4_OTHERS_NODE, + .next_branch = 3, + .data = 0x0, + .mask = 0x + }, + { + /* DST IPV4 B23 */ + .valid = 0, + .next_packet_pointer = 34, + .jump_bw = JMP_FW, + .jump_rel = JMP_ABS, + .operation = EQT, + .next_node = RSS_IPV4_OTHERS_NODE, + .next_branch = 4, + .data = 0x0, + .mask = 0x + }, + { + /* TCP SRC Port */ + .valid = 0, + .next_packet_pointer = 36, + .jump_bw = JMP_FW, +
[PATCH net-next 7/8] drivers: net: xgene: Add flow control initialization
This patch adds flow control/pause frame initialization and advertising capabilities. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 57 +++ drivers/net/ethernet/apm/xgene/xgene_enet_hw.h| 7 +++ drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 44 - drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 17 +++ drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h | 7 +++ 5 files changed, 131 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index 23a0175..06e6816 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -577,6 +577,17 @@ static void xgene_gmac_init(struct xgene_enet_pdata *pdata) /* Rtype should be copied from FP */ xgene_enet_wr_csr(pdata, RSIF_RAM_DBG_REG0_ADDR, 0); + /* Configure HW pause frame generation */ + xgene_enet_rd_mcx_csr(pdata, CSR_MULTI_DPF0_ADDR, ); + value = (DEF_QUANTA << 16) | (value & 0x); + xgene_enet_wr_mcx_csr(pdata, CSR_MULTI_DPF0_ADDR, value); + + xgene_enet_wr_csr(pdata, RXBUF_PAUSE_THRESH, DEF_PAUSE_THRES); + xgene_enet_wr_csr(pdata, RXBUF_PAUSE_OFF_THRESH, DEF_PAUSE_OFF_THRES); + + xgene_gmac_flowctl_tx(pdata, pdata->tx_pause); + xgene_gmac_flowctl_rx(pdata, pdata->rx_pause); + /* Rx-Tx traffic resume */ xgene_enet_wr_csr(pdata, CFG_LINK_AGGR_RESUME_0_ADDR, TX_PORT0); @@ -749,6 +760,48 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata *pdata) } } +static u32 xgene_enet_flowctrl_cfg(struct net_device *ndev) +{ + struct xgene_enet_pdata *pdata = netdev_priv(ndev); + struct phy_device *phydev = ndev->phydev; + u16 lcladv, rmtadv = 0; + u32 rx_pause, tx_pause; + u8 flowctl = 0; + + if (!phydev->duplex || !pdata->pause_autoneg) + return 0; + + if (pdata->tx_pause) + flowctl |= FLOW_CTRL_TX; + + if (pdata->rx_pause) + flowctl |= FLOW_CTRL_RX; + + lcladv = mii_advertise_flowctrl(flowctl); + + if (phydev->pause) + rmtadv = LPA_PAUSE_CAP; + + if (phydev->asym_pause) + rmtadv |= LPA_PAUSE_ASYM; + + flowctl = mii_resolve_flowctrl_fdx(lcladv, rmtadv); + tx_pause = !!(flowctl & FLOW_CTRL_TX); + rx_pause = !!(flowctl & FLOW_CTRL_RX); + + if (tx_pause != pdata->tx_pause) { + pdata->tx_pause = tx_pause; + pdata->mac_ops->flowctl_tx(pdata, pdata->tx_pause); + } + + if (rx_pause != pdata->rx_pause) { + pdata->rx_pause = rx_pause; + pdata->mac_ops->flowctl_rx(pdata, pdata->rx_pause); + } + + return 0; +} + static void xgene_enet_adjust_link(struct net_device *ndev) { struct xgene_enet_pdata *pdata = netdev_priv(ndev); @@ -763,6 +816,8 @@ static void xgene_enet_adjust_link(struct net_device *ndev) mac_ops->tx_enable(pdata); phy_print_status(phydev); } + + xgene_enet_flowctrl_cfg(ndev); } else { mac_ops->rx_disable(pdata); mac_ops->tx_disable(pdata); @@ -836,6 +891,8 @@ int xgene_enet_phy_connect(struct net_device *ndev) phy_dev->supported &= ~SUPPORTED_10baseT_Half & ~SUPPORTED_100baseT_Half & ~SUPPORTED_1000baseT_Half; + phy_dev->supported |= SUPPORTED_Pause | + SUPPORTED_Asym_Pause; phy_dev->advertising = phy_dev->supported; return 0; diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h index 7ba649d..5f83037 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h @@ -172,6 +172,13 @@ enum xgene_enet_rm { #define CFG_CLE_FPSEL0(val)(((val) << 16) & GENMASK(19, 16)) #define CSR_ECM_CFG_0_ADDR 0x0220 #define CSR_ECM_CFG_1_ADDR 0x0224 +#define CSR_MULTI_DPF0_ADDR0x0230 +#define RXBUF_PAUSE_THRESH 0x0534 +#define RXBUF_PAUSE_OFF_THRESH 0x0540 +#define DEF_PAUSE_THRES0x7d +#define DEF_PAUSE_OFF_THRES0x6d +#define DEF_QUANTA 0x8000 +#define NORM_PAUSE_OPCODE 0x0001 #define PAUSE_XON_EN BIT(30) #define MULTI_DPF_AUTOCTRL BIT(28) #define CFG_CLE_NXTFPSEL0(val) (((val) << 20) & GENMASK(23, 20)) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c index 1c9b8ba..a8e063b 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c +++
[PATCH net-next 3/8] drivers: net: xgene: Add support for Jumbo frame
This patch adds support for jumbo frame, by allocating additional buffer (page) pool and configuring the hardware. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 3 + drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 305 -- drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 20 +- drivers/net/ethernet/apm/xgene/xgene_enet_ring2.c | 1 + drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 3 + drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 4 + 6 files changed, 311 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index c395df3..fc9010f 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -679,6 +679,9 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata *pdata) for (i = 0; i < pdata->rxq_cnt; i++) { ring = pdata->rx_ring[i]->buf_pool; pb |= BIT(xgene_enet_get_fpsel(ring->id)); + ring = pdata->rx_ring[i]->page_pool; + if (ring) + pb |= BIT(xgene_enet_get_fpsel(ring->id)); } xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIFPRESET_ADDR, pb); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c index c89acf5..698df27 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c @@ -37,6 +37,9 @@ static void xgene_enet_init_bufpool(struct xgene_enet_desc_ring *buf_pool) struct xgene_enet_raw_desc16 *raw_desc; int i; + if (!buf_pool) + return; + for (i = 0; i < buf_pool->slots; i++) { raw_desc = _pool->raw_desc16[i]; @@ -47,6 +50,86 @@ static void xgene_enet_init_bufpool(struct xgene_enet_desc_ring *buf_pool) } } +static u16 xgene_enet_get_data_len(u64 bufdatalen) +{ + u16 hw_len, mask; + + hw_len = GET_VAL(BUFDATALEN, bufdatalen); + + if (unlikely(hw_len == 0x7800)) { + return 0; + } else if (!(hw_len & BIT(14))) { + mask = GENMASK(13, 0); + return (hw_len & mask) ? (hw_len & mask) : SIZE_16K; + } else if (!(hw_len & GENMASK(13, 12))) { + mask = GENMASK(11, 0); + return (hw_len & mask) ? (hw_len & mask) : SIZE_4K; + } else { + mask = GENMASK(11, 0); + return (hw_len & mask) ? (hw_len & mask) : SIZE_2K; + } +} + +static u16 xgene_enet_set_data_len(u32 size) +{ + u16 hw_len; + + hw_len = (size == SIZE_4K) ? BIT(14) : 0; + + return hw_len; +} + +static int xgene_enet_refill_pagepool(struct xgene_enet_desc_ring *buf_pool, + u32 nbuf) +{ + struct xgene_enet_raw_desc16 *raw_desc; + struct xgene_enet_pdata *pdata; + struct net_device *ndev; + dma_addr_t dma_addr; + struct device *dev; + struct page *page; + u32 slots, tail; + u16 hw_len; + int i; + + if (unlikely(!buf_pool)) + return 0; + + ndev = buf_pool->ndev; + pdata = netdev_priv(ndev); + dev = ndev_to_dev(ndev); + slots = buf_pool->slots - 1; + tail = buf_pool->tail; + + for (i = 0; i < nbuf; i++) { + raw_desc = _pool->raw_desc16[tail]; + + page = dev_alloc_page(); + if (unlikely(!page)) + return -ENOMEM; + + dma_addr = dma_map_page(dev, page, 0, + PAGE_SIZE, DMA_FROM_DEVICE); + if (unlikely(dma_mapping_error(dev, dma_addr))) { + put_page(page); + return -ENOMEM; + } + + hw_len = xgene_enet_set_data_len(PAGE_SIZE); + raw_desc->m1 = cpu_to_le64(SET_VAL(DATAADDR, dma_addr) | + SET_VAL(BUFDATALEN, hw_len) | + SET_BIT(COHERENT)); + + buf_pool->frag_page[tail] = page; + tail = (tail + 1) & slots; + } + + pdata->ring_ops->wr_cmd(buf_pool, nbuf); + buf_pool->tail = tail; + + return 0; +} + static int xgene_enet_refill_bufpool(struct xgene_enet_desc_ring *buf_pool, u32 nbuf) { @@ -64,8 +147,9 @@ static int xgene_enet_refill_bufpool(struct xgene_enet_desc_ring *buf_pool, ndev = buf_pool->ndev; dev = ndev_to_dev(buf_pool->ndev); pdata = netdev_priv(ndev); + bufdatalen = BUF_LEN_CODE_2K | (SKB_BUFFER_SIZE & GENMASK(11, 0)); - len = XGENE_ENET_MAX_MTU; + len = XGENE_ENET_STD_MTU; for (i = 0; i < nbuf; i++) { raw_desc =
[PATCH net-next 4/8] drivers: net: xgene: Add change_mtu function
This patch implements ndo_change_mtu() callback function that enables mtu change. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 6 ++ drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 20 drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 1 + drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 6 ++ drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 7 +++ 5 files changed, 40 insertions(+) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index fc9010f..92cc7e5 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -504,6 +504,11 @@ static void xgene_gmac_set_speed(struct xgene_enet_pdata *pdata) xgene_enet_wr_mcx_csr(pdata, ICM_CONFIG2_REG_0_ADDR, icm2); } +static void xgene_enet_set_frame_size(struct xgene_enet_pdata *pdata, int size) +{ + xgene_enet_wr_mcx_mac(pdata, MAX_FRAME_LEN_ADDR, size); +} + static void xgene_gmac_init(struct xgene_enet_pdata *pdata) { u32 value; @@ -903,6 +908,7 @@ void xgene_enet_mdio_remove(struct xgene_enet_pdata *pdata) .tx_disable = xgene_gmac_tx_disable, .set_speed = xgene_gmac_set_speed, .set_mac_addr = xgene_gmac_set_mac_addr, + .set_framesize = xgene_enet_set_frame_size, }; const struct xgene_port_ops xgene_gport_ops = { diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c index 698df27..6c7eea8 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c @@ -1500,12 +1500,31 @@ static int xgene_enet_set_mac_address(struct net_device *ndev, void *addr) return ret; } +static int xgene_change_mtu(struct net_device *ndev, int new_mtu) +{ + struct xgene_enet_pdata *pdata = netdev_priv(ndev); + int frame_size; + + if (!netif_running(ndev)) + return 0; + + frame_size = (new_mtu > ETH_DATA_LEN) ? (new_mtu + 18) : 0x600; + + xgene_enet_close(ndev); + ndev->mtu = new_mtu; + pdata->mac_ops->set_framesize(pdata, frame_size); + xgene_enet_open(ndev); + + return 0; +} + static const struct net_device_ops xgene_ndev_ops = { .ndo_open = xgene_enet_open, .ndo_stop = xgene_enet_close, .ndo_start_xmit = xgene_enet_start_xmit, .ndo_tx_timeout = xgene_enet_timeout, .ndo_get_stats64 = xgene_enet_get_stats64, + .ndo_change_mtu = xgene_change_mtu, .ndo_set_mac_address = xgene_enet_set_mac_address, }; @@ -1832,6 +1851,7 @@ static int xgene_enet_init_hw(struct xgene_enet_pdata *pdata) buf_pool->id, ring_id); } + ndev->max_mtu = XGENE_ENET_MAX_MTU; pdata->phy_speed = SPEED_UNKNOWN; pdata->mac_ops->init(pdata); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h index 023ed76..61aa987 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h @@ -156,6 +156,7 @@ struct xgene_mac_ops { void (*rx_disable)(struct xgene_enet_pdata *pdata); void (*set_speed)(struct xgene_enet_pdata *pdata); void (*set_mac_addr)(struct xgene_enet_pdata *pdata); + void (*set_framesize)(struct xgene_enet_pdata *pdata, int framesize); void (*set_mss)(struct xgene_enet_pdata *pdata, u16 mss, u8 index); void (*link_state)(struct work_struct *work); }; diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c index a10ab64..6283f2b 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c @@ -343,6 +343,11 @@ static void xgene_sgmac_set_speed(struct xgene_enet_pdata *p) xgene_enet_wr_mcx_csr(p, icm2_addr, icm2); } +static void xgene_sgmac_set_frame_size(struct xgene_enet_pdata *pdata, int size) +{ + xgene_enet_wr_mac(pdata, MAX_FRAME_LEN_ADDR, size); +} + static void xgene_sgmii_enable_autoneg(struct xgene_enet_pdata *p) { u32 data, loop = 10; @@ -595,6 +600,7 @@ static void xgene_enet_link_state(struct work_struct *work) .tx_disable = xgene_sgmac_tx_disable, .set_speed = xgene_sgmac_set_speed, .set_mac_addr = xgene_sgmac_set_mac_addr, + .set_framesize = xgene_sgmac_set_frame_size, .link_state = xgene_enet_link_state }; diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c index 4109776..2a9761b 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c @@ -250,6 +250,12 @@
[PATCH net-next 1/8] drivers: net: xgene: Add helper function
This is a prepartion patch and adds xgene_enet_get_fpsel() helper function to get buffer pool number. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_cle.c | 4 ++-- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 19 +++ drivers/net/ethernet/apm/xgene/xgene_enet_hw.h| 8 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 20 +++- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 20 +++- 5 files changed, 31 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c index 23d72af..7aac0fb 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_cle.c @@ -346,7 +346,7 @@ static int xgene_cle_set_rss_idt(struct xgene_enet_pdata *pdata) for (i = 0; i < XGENE_CLE_IDT_ENTRIES; i++) { idx = i % pdata->rxq_cnt; pool_id = pdata->rx_ring[idx]->buf_pool->id; - fpsel = xgene_enet_ring_bufnum(pool_id) - 0x20; + fpsel = xgene_enet_get_fpsel(pool_id); dstqid = xgene_enet_dst_ring_num(pdata->rx_ring[idx]); nfpsel = 0; idt_reg = 0; @@ -706,7 +706,7 @@ static int xgene_enet_cle_init(struct xgene_enet_pdata *pdata) def_qid = xgene_enet_dst_ring_num(pdata->rx_ring[0]); pool_id = pdata->rx_ring[0]->buf_pool->id; - def_fpsel = xgene_enet_ring_bufnum(pool_id) - 0x20; + def_fpsel = xgene_enet_get_fpsel(pool_id); memset(dbptr, 0, sizeof(struct xgene_cle_dbptr) * DB_MAX_PTRS); dbptr[DB_RES_ACCEPT].fpsel = def_fpsel; diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index 5390ae8..1007074 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -555,7 +555,7 @@ static void xgene_enet_cle_bypass(struct xgene_enet_pdata *pdata, u32 cb; u32 fpsel; - fpsel = xgene_enet_ring_bufnum(bufpool_id) - 0x20; + fpsel = xgene_enet_get_fpsel(bufpool_id); xgene_enet_rd_csr(pdata, CLE_BYPASS_REG0_0_ADDR, ); cb |= CFG_CLE_BYPASS_EN0; @@ -652,16 +652,14 @@ static int xgene_enet_reset(struct xgene_enet_pdata *pdata) static void xgene_enet_clear(struct xgene_enet_pdata *pdata, struct xgene_enet_desc_ring *ring) { - u32 addr, val, data; - - val = xgene_enet_ring_bufnum(ring->id); + u32 addr, data; if (xgene_enet_is_bufpool(ring->id)) { addr = ENET_CFGSSQMIFPRESET_ADDR; - data = BIT(val - 0x20); + data = BIT(xgene_enet_get_fpsel(ring->id)); } else { addr = ENET_CFGSSQMIWQRESET_ADDR; - data = BIT(val); + data = BIT(xgene_enet_ring_bufnum(ring->id)); } xgene_enet_wr_ring_if(pdata, addr, data); @@ -671,24 +669,21 @@ static void xgene_gport_shutdown(struct xgene_enet_pdata *pdata) { struct device *dev = >pdev->dev; struct xgene_enet_desc_ring *ring; - u32 pb, val; + u32 pb; int i; pb = 0; for (i = 0; i < pdata->rxq_cnt; i++) { ring = pdata->rx_ring[i]->buf_pool; + pb |= BIT(xgene_enet_get_fpsel(ring->id)); - val = xgene_enet_ring_bufnum(ring->id); - pb |= BIT(val - 0x20); } xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIFPRESET_ADDR, pb); pb = 0; for (i = 0; i < pdata->txq_cnt; i++) { ring = pdata->tx_ring[i]; - - val = xgene_enet_ring_bufnum(ring->id); - pb |= BIT(val); + pb |= BIT(xgene_enet_ring_bufnum(ring->id)); } xgene_enet_wr_ring_if(pdata, ENET_CFGSSQMIWQRESET_ADDR, pb); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h index 06e598c..e73cbb1 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h @@ -346,6 +346,14 @@ static inline bool xgene_enet_is_bufpool(u16 id) return ((id & RING_BUFNUM_MASK) >= 0x20) ? true : false; } +static inline u8 xgene_enet_get_fpsel(u16 id) +{ + if (xgene_enet_is_bufpool(id)) + return xgene_enet_ring_bufnum(id) - RING_BUFNUM_BUFPOOL; + + return 0; +} + static inline u16 xgene_enet_get_numslots(u16 id, u32 size) { bool is_bufpool = xgene_enet_is_bufpool(id); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c index d12e9cb..8e4209c 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c @@ -501,7 +501,7 @@ static
[PATCH net-next 0/8] drivers: net: xgene: Add Jumbo and Pause frame support
This patch set adds, 1. Jumbo frame support 2. Pause frame based flow control and fixes RSS for non-TCP/UDP packets. Signed-off-by: Iyappan Subramanian--- Iyappan Subramanian (8): drivers: net: xgene: Add helper function drivers: net: xgene: Configure classifier with pagepool drivers: net: xgene: Add support for Jumbo frame drivers: net: xgene: Add change_mtu function drivers: net: xgene: fix: RSS for non-TCP/UDP drivers: net: xgene: Add flow control configuration drivers: net: xgene: Add flow control initialization drivers: net: xgene: ethtool: Add get/set_pauseparam drivers/net/ethernet/apm/xgene/xgene_enet_cle.c| 110 ++- drivers/net/ethernet/apm/xgene/xgene_enet_cle.h| 3 + .../net/ethernet/apm/xgene/xgene_enet_ethtool.c| 70 + drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 140 - drivers/net/ethernet/apm/xgene/xgene_enet_hw.h | 27 +- drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 336 +++-- drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 30 +- drivers/net/ethernet/apm/xgene/xgene_enet_ring2.c | 1 + drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 146 +++-- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 121 +++- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h | 9 + 11 files changed, 895 insertions(+), 98 deletions(-) -- 1.9.1
[PATCH net-next 6/8] drivers: net: xgene: Add flow control configuration
This patch adds functions to configure mac, when flow control and pause frame settings change. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c| 48 + drivers/net/ethernet/apm/xgene/xgene_enet_hw.h| 6 +++ drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 6 +++ drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 64 -- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c | 66 ++- drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.h | 2 + 6 files changed, 176 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index 92cc7e5..23a0175 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -509,6 +509,51 @@ static void xgene_enet_set_frame_size(struct xgene_enet_pdata *pdata, int size) xgene_enet_wr_mcx_mac(pdata, MAX_FRAME_LEN_ADDR, size); } +static void xgene_gmac_enable_tx_pause(struct xgene_enet_pdata *pdata, + bool enable) +{ + u32 data; + + xgene_enet_rd_mcx_csr(pdata, CSR_ECM_CFG_0_ADDR, ); + + if (enable) + data |= MULTI_DPF_AUTOCTRL | PAUSE_XON_EN; + else + data &= ~(MULTI_DPF_AUTOCTRL | PAUSE_XON_EN); + + xgene_enet_wr_mcx_csr(pdata, CSR_ECM_CFG_0_ADDR, data); +} + +static void xgene_gmac_flowctl_tx(struct xgene_enet_pdata *pdata, bool enable) +{ + u32 data; + + xgene_enet_rd_mcx_mac(pdata, MAC_CONFIG_1_ADDR, ); + + if (enable) + data |= TX_FLOW_EN; + else + data &= ~TX_FLOW_EN; + + xgene_enet_wr_mcx_mac(pdata, MAC_CONFIG_1_ADDR, data); + + pdata->mac_ops->enable_tx_pause(pdata, enable); +} + +static void xgene_gmac_flowctl_rx(struct xgene_enet_pdata *pdata, bool enable) +{ + u32 data; + + xgene_enet_rd_mcx_mac(pdata, MAC_CONFIG_1_ADDR, ); + + if (enable) + data |= RX_FLOW_EN; + else + data &= ~RX_FLOW_EN; + + xgene_enet_wr_mcx_mac(pdata, MAC_CONFIG_1_ADDR, data); +} + static void xgene_gmac_init(struct xgene_enet_pdata *pdata) { u32 value; @@ -909,6 +954,9 @@ void xgene_enet_mdio_remove(struct xgene_enet_pdata *pdata) .set_speed = xgene_gmac_set_speed, .set_mac_addr = xgene_gmac_set_mac_addr, .set_framesize = xgene_enet_set_frame_size, + .enable_tx_pause = xgene_gmac_enable_tx_pause, + .flowctl_tx = xgene_gmac_flowctl_tx, + .flowctl_rx = xgene_gmac_flowctl_rx, }; const struct xgene_port_ops xgene_gport_ops = { diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h index bd6cb6c..7ba649d 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h @@ -170,6 +170,10 @@ enum xgene_enet_rm { #define CFG_WAITASYNCRD_SET(dst, val) xgene_set_bits(dst, val, 0, 16) #define CFG_CLE_DSTQID0(val) ((val) & GENMASK(11, 0)) #define CFG_CLE_FPSEL0(val)(((val) << 16) & GENMASK(19, 16)) +#define CSR_ECM_CFG_0_ADDR 0x0220 +#define CSR_ECM_CFG_1_ADDR 0x0224 +#define PAUSE_XON_EN BIT(30) +#define MULTI_DPF_AUTOCTRL BIT(28) #define CFG_CLE_NXTFPSEL0(val) (((val) << 20) & GENMASK(23, 20)) #define ICM_CONFIG0_REG_0_ADDR 0x0400 #define ICM_CONFIG2_REG_0_ADDR 0x0410 @@ -198,6 +202,8 @@ enum xgene_enet_rm { #define SOFT_RESET1BIT(31) #define TX_EN BIT(0) #define RX_EN BIT(2) +#define TX_FLOW_EN BIT(4) +#define RX_FLOW_EN BIT(5) #define ENET_LHD_MODE BIT(25) #define ENET_GHD_MODE BIT(26) #define FULL_DUPLEX2 BIT(0) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h index 61aa987..5257174 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h @@ -159,6 +159,9 @@ struct xgene_mac_ops { void (*set_framesize)(struct xgene_enet_pdata *pdata, int framesize); void (*set_mss)(struct xgene_enet_pdata *pdata, u16 mss, u8 index); void (*link_state)(struct work_struct *work); + void (*enable_tx_pause)(struct xgene_enet_pdata *pdata, bool enable); + void (*flowctl_rx)(struct xgene_enet_pdata *pdata, bool enable); + void (*flowctl_tx)(struct xgene_enet_pdata *pdata, bool enable); }; struct xgene_port_ops { @@ -234,6 +237,9 @@ struct xgene_enet_pdata { bool mdio_driver; struct gpio_desc *sfp_rdy; bool sfp_gpio_en; + u32 pause_autoneg; + bool
[PATCH net-next 8/8] drivers: net: xgene: ethtool: Add get/set_pauseparam
This patch adds get_pauseparam and set_pauseparam functions. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- .../net/ethernet/apm/xgene/xgene_enet_ethtool.c| 70 ++ 1 file changed, 70 insertions(+) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c b/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c index d372d42..28fdedc 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_ethtool.c @@ -163,6 +163,74 @@ static void xgene_get_ethtool_stats(struct net_device *ndev, *data++ = *(u64 *)(pdata + gstrings_stats[i].offset); } +static void xgene_get_pauseparam(struct net_device *ndev, +struct ethtool_pauseparam *pp) +{ + struct xgene_enet_pdata *pdata = netdev_priv(ndev); + + pp->autoneg = pdata->pause_autoneg; + pp->tx_pause = pdata->tx_pause; + pp->rx_pause = pdata->rx_pause; +} + +static int xgene_set_pauseparam(struct net_device *ndev, + struct ethtool_pauseparam *pp) +{ + struct xgene_enet_pdata *pdata = netdev_priv(ndev); + struct phy_device *phydev = ndev->phydev; + u32 oldadv, newadv; + + if (pdata->phy_mode == PHY_INTERFACE_MODE_RGMII || + pdata->phy_mode == PHY_INTERFACE_MODE_SGMII) { + if (!phydev) + return -EINVAL; + + if (!(phydev->supported & SUPPORTED_Pause) || + (!(phydev->supported & SUPPORTED_Asym_Pause) && +pp->rx_pause != pp->tx_pause)) + return -EINVAL; + + pdata->pause_autoneg = pp->autoneg; + pdata->tx_pause = pp->tx_pause; + pdata->rx_pause = pp->rx_pause; + + oldadv = phydev->advertising; + newadv = oldadv & ~(ADVERTISED_Pause | ADVERTISED_Asym_Pause); + + if (pp->rx_pause) + newadv |= ADVERTISED_Pause | ADVERTISED_Asym_Pause; + + if (pp->tx_pause) + newadv ^= ADVERTISED_Asym_Pause; + + if (oldadv ^ newadv) { + phydev->advertising = newadv; + + if (phydev->autoneg) + return phy_start_aneg(phydev); + + if (!pp->autoneg) { + pdata->mac_ops->flowctl_tx(pdata, + pdata->tx_pause); + pdata->mac_ops->flowctl_rx(pdata, + pdata->rx_pause); + } + } + + } else { + if (pp->autoneg) + return -EINVAL; + + pdata->tx_pause = pp->tx_pause; + pdata->rx_pause = pp->rx_pause; + + pdata->mac_ops->flowctl_tx(pdata, pdata->tx_pause); + pdata->mac_ops->flowctl_rx(pdata, pdata->rx_pause); + } + + return 0; +} + static const struct ethtool_ops xgene_ethtool_ops = { .get_drvinfo = xgene_get_drvinfo, .get_link = ethtool_op_get_link, @@ -171,6 +239,8 @@ static void xgene_get_ethtool_stats(struct net_device *ndev, .get_ethtool_stats = xgene_get_ethtool_stats, .get_link_ksettings = xgene_get_link_ksettings, .set_link_ksettings = xgene_set_link_ksettings, + .get_pauseparam = xgene_get_pauseparam, + .set_pauseparam = xgene_set_pauseparam }; void xgene_enet_set_ethtool_ops(struct net_device *ndev) -- 1.9.1
Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
On Thu, 2016-12-01 at 15:47 -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > [...] > >> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, > >> struct sk_buff *skb) > >> > >> probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES); > >> if (probes < 0) { > >> + memset(_dev_hw_addr, 0, dev->addr_len); > >> if (!(neigh->nud_state & NUD_VALID)) > >> pr_debug("trying to ucast probe in NUD_INVALID\n"); > >> neigh_ha_snapshot(dst_ha, neigh, dev); > >> - dst_hw = dst_ha; > >> + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0) > >> + dst_hw = dst_ha; > >> } else { > >> probes -= NEIGH_VAR(neigh->parms, APP_PROBES); > >> if (probes < 0) { > > > > Why is is an IPv4 specific issue ? > I think the issue is that neigh_ha_snapshot() gets neigh->ha > unconditionally even if the neigh state is NUD_INVALID. > > > What about IPv6 ? > Well it's not ARP. The ndisc_solicit() calls ndisc_send_ns() with > neigh parameter for unicast probe while call with NULL for the > broadcast probe case. However it does not use this parameter in > unicast case and probably relies on the route-entry. Hence it is not > subjected to the same issue. Well, it looks like the issue is in neighbour code. Fact that IPv6 might not be impacted is not the point. > > > > > > I would try something in neighbour code, maybe : > > > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > > index > > 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68 > > 100644 > > --- a/net/core/neighbour.c > > +++ b/net/core/neighbour.c > > @@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg) > > neigh_dbg(2, "neigh %p is probed\n", neigh); > > neigh->nud_state = NUD_PROBE; > > neigh->updated = jiffies; > > - atomic_set(>probes, 0); > > + atomic_set(>probes, > > + (neigh->output == neigh_blackhole) ? > > + NEIGH_VAR(neigh->parms, > > UCAST_PROBES) : > > + 0); > This would work if we change the above line (in arp_solicit() code) > from 'if (probes < 0)' to 'if (probes <= 0)'. Then code at line 973 is wrong ? atomic_set(>probes, NEIGH_VAR(neigh->parms, UCAST_PROBES)); That would be a more serious issue :)
Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
[...] >> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, >> struct sk_buff *skb) >> >> probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES); >> if (probes < 0) { >> + memset(_dev_hw_addr, 0, dev->addr_len); >> if (!(neigh->nud_state & NUD_VALID)) >> pr_debug("trying to ucast probe in NUD_INVALID\n"); >> neigh_ha_snapshot(dst_ha, neigh, dev); >> - dst_hw = dst_ha; >> + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0) >> + dst_hw = dst_ha; >> } else { >> probes -= NEIGH_VAR(neigh->parms, APP_PROBES); >> if (probes < 0) { > > Why is is an IPv4 specific issue ? I think the issue is that neigh_ha_snapshot() gets neigh->ha unconditionally even if the neigh state is NUD_INVALID. > What about IPv6 ? Well it's not ARP. The ndisc_solicit() calls ndisc_send_ns() with neigh parameter for unicast probe while call with NULL for the broadcast probe case. However it does not use this parameter in unicast case and probably relies on the route-entry. Hence it is not subjected to the same issue. > > > I would try something in neighbour code, maybe : > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index > 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68 > 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg) > neigh_dbg(2, "neigh %p is probed\n", neigh); > neigh->nud_state = NUD_PROBE; > neigh->updated = jiffies; > - atomic_set(>probes, 0); > + atomic_set(>probes, > + (neigh->output == neigh_blackhole) ? > + NEIGH_VAR(neigh->parms, UCAST_PROBES) > : > + 0); This would work if we change the above line (in arp_solicit() code) from 'if (probes < 0)' to 'if (probes <= 0)'. > notify = 1; > next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME); > } > > Thanks. > >
Re: Initial thoughts on TXDP
On 12/01/2016 02:12 PM, Tom Herbert wrote: We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. No worries on the specific sizes, it is a classic "How long is a piece of string?" sort of question. Not surprisingly, as the size of what is being received grows, so too the delta between GRO on and off. stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro $gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- -P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; done; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Result Tag,Throughput,Local Service Demand "8K gro on",9899.84,35.947 "8K gro off",7299.54,61.097 "16K gro on",8119.38,58.367 "16K gro off",5176.87,95.317 "64K gro on",4429.57,110.629 "64K gro off",2128.58,289.913 "1M gro on",887.85,918.447 "1M gro off",335.97,3427.587 So that gives a feel for by how much this alternative mechanism would have to reduce path-length to maintain the CPU overhead, were the mechanism to preclude GRO. rick
linux-next: manual merge of the wireless-drivers-next tree with the net-next tree
Hi all, Today's linux-next merge of the wireless-drivers-next tree got a conflict in: drivers/net/wireless/ath/ath10k/mac.c between commit: f3fe4e93dd63 ("mac80211: add a HW flag for supporting HW TX fragmentation") from the net-next tree and commit: ff32eeb86aa1 ("ath10k: advertize hardware packet loss mechanism") from the wireless-drivers-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/net/wireless/ath/ath10k/mac.c index 717b2fad9a8a,db6ddf974d1d.. --- a/drivers/net/wireless/ath/ath10k/mac.c +++ b/drivers/net/wireless/ath/ath10k/mac.c @@@ -8005,7 -7993,7 +7993,8 @@@ int ath10k_mac_register(struct ath10k * ieee80211_hw_set(ar->hw, WANT_MONITOR_VIF); ieee80211_hw_set(ar->hw, CHANCTX_STA_CSA); ieee80211_hw_set(ar->hw, QUEUE_CONTROL); + ieee80211_hw_set(ar->hw, SUPPORTS_TX_FRAG); + ieee80211_hw_set(ar->hw, REPORTS_LOW_ACK); if (!test_bit(ATH10K_FLAG_RAW_MODE, >dev_flags)) ieee80211_hw_set(ar->hw, SW_CRYPTO_CONTROL);
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 2:47 PM, Hannes Frederic Sowawrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: >>> > E.g. "mini-skb": Even if we assume that this provides a speedup >>> > (where does that come from? should make no difference if a 32 or >>> > 320 byte buffer gets allocated). >>> > >> It's the zero'ing of three cache lines. I believe we talked about that >> as netdev. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? > > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. > One nice direction of XDP is that it forces drivers to defer allocating (and hence zero'ing) skbs. In the receive path I think we can exploit this property deeper into the stack. The only time we _really_ to allocate an skbuf is when we need to put the packet onto a queue. All the other use cases are really just to pass a structure containing a packet from function to function. For that purpose we should be able to just pass a much smaller structure in a stack argument and only allocate an skbuff when we need to enqueue. In cases where we don't ever queue a packet we might never need to allocate any skbuff-- this includes pure acks, packets that end up being dropped. But even more than that, if a received packet generates a TX packet (like a SYN causes a SYN-ACK) then we might even be able to just recycle the received packet and avoid needing any skbuff allocation on transmit (XDP_TX already does this in a limited context)-- this could be a win to handle SYN attacks for instance. Also, since we don't queue on the socket buffer for UDP it's conceivable we could avoid skbuffs in an expedited UDP TX path. Currently, nearly the whole stack depends on packets always being passed in skbuffs, however __skb_flow_dissect is an interesting exception as it can handle packets passed in either an skbuff or by just a void *-- so we know that this "dual mode" is at least possible. Trying to retrain the whole stack to be able to handle both skbuffs and raw pages is probably untenable at this point, but selectively augmenting some critical performance functions for dual mode (ip_rcv, tcp_rcv, udp_rcv functions for instance) might work. Thanks, Tom > Bye, > Hannes >
[PATCH 6/7] net: ethernet: ti: cpsw: add support for descs_pool_size dt property
The CPSW CPDMA can process buffer descriptors placed as in internal CPPI RAM as in DDR. This patch adds support in CPSW and CPDMA for descs_pool_size DT property, which defines total number of CPDMA CPPI descriptors to be used for both ingress/egress packets processing: - memory size required for CPDMA descriptor pool is calculated basing on number of descriptors specified by user in descs_pool_size and CPDMA descriptor size; - allocate CPDMA descriptor pool in DDR if pool memory size > internal CPPI RAM or use internal CPPI RAM otherwise; - if descs_pool_size not specified in DT - the default value 256 will be used which will allow to place CPDMA descriptors pool into the internal CPPI RAM (current default behaviour); - CPDMA will ignore descs_pool_size if descs_pool_size = 0 for backward comaptiobility with davinci_emac. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/cpsw.c | 5 + drivers/net/ethernet/ti/cpsw.h | 1 + drivers/net/ethernet/ti/davinci_cpdma.c | 12 drivers/net/ethernet/ti/davinci_cpdma.h | 1 + 4 files changed, 19 insertions(+) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index dd5d830..a98c6260 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -145,6 +145,7 @@ do { \ cpsw->data.active_slave) #define IRQ_NUM2 #define CPSW_MAX_QUEUES8 +#define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256 static int debug_level; module_param(debug_level, int, 0); @@ -2557,6 +2558,9 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, if (of_property_read_bool(node, "dual_emac")) data->dual_emac = 1; + data->descs_pool_size = CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT; + of_property_read_u32(node, "descs_pool_size", >descs_pool_size); + /* * Populate all the child nodes here... */ @@ -2967,6 +2971,7 @@ static int cpsw_probe(struct platform_device *pdev) dma_params.has_ext_regs = true; dma_params.desc_hw_addr = dma_params.desc_mem_phys; dma_params.bus_freq_mhz = cpsw->bus_freq_mhz; + dma_params.descs_pool_size = cpsw->data.descs_pool_size; cpsw->dma = cpdma_ctlr_create(_params); if (!cpsw->dma) { diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h index 16b54c6..8835d79 100644 --- a/drivers/net/ethernet/ti/cpsw.h +++ b/drivers/net/ethernet/ti/cpsw.h @@ -38,6 +38,7 @@ struct cpsw_platform_data { u32 mac_control;/* Mac control register */ u16 default_vlan; /* Def VLAN for ALE lookup in VLAN aware mode*/ booldual_emac; /* Enable Dual EMAC mode */ + u32 descs_pool_size;/* Number of Rx/Tx Descriptios */ }; void cpsw_phy_sel(struct device *dev, phy_interface_t phy_mode, int slave); diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index ba892bb..f45bb8a 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -219,6 +219,18 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr) cpdma_params->desc_align); pool->num_desc = pool->mem_size / pool->desc_size; + if (cpdma_params->descs_pool_size) { + /* recalculate memory size required cpdma descriptor pool +* basing on number of descriptors specified by user and +* if memory size > CPPI internal RAM size (desc_mem_size) +* then switch to use DDR +*/ + pool->num_desc = cpdma_params->descs_pool_size; + pool->mem_size = pool->desc_size * pool->num_desc; + if (pool->mem_size > cpdma_params->desc_mem_size) + cpdma_params->desc_mem_phys = 0; + } + pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size), -1, "cpdma"); if (IS_ERR(pool->gen_pool)) { diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h index 4a167db..cb45f8f 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.h +++ b/drivers/net/ethernet/ti/davinci_cpdma.h @@ -37,6 +37,7 @@ struct cpdma_params { int desc_mem_size; int desc_align; u32 bus_freq_mhz; + u32 descs_pool_size; /* * Some instances of embedded cpdma controllers have extra control and -- 2.10.1
[PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing
The currently processing cpdma descriptor with EOQ flag set may contain two values in Next Descriptor Pointer field: - valid pointer: means CPDMA missed addition of new desc in queue; - null: no more descriptors in queue. In the later case, it's not required to write to HDP register, but now CPDMA does it. Hence, add additional check for Next Descriptor Pointer != null in cpdma_chan_process() function before writing in HDP register. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/davinci_cpdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index 0924014..379314f 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan *chan) chan->count--; chan->stats.good_dequeue++; - if (status & CPDMA_DESC_EOQ) { + if ((status & CPDMA_DESC_EOQ) && chan->head) { chan->stats.requeue++; chan_write(chan, hdp, desc_phys(pool, chan->head)); } -- 2.10.1
[PATCH 1/7] net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr
It's observed that cpsw/cpdma is not working properly when CPPI descriptors are placed in DDR instead of internal CPPI RAM on am437x SoC: - rx/tx silently stops processing packets; - or - after boot it's working for sometime, but stuck once Network load is increased (ping is working, but iperf is not). (The same issue has not been reproduced on am335x and am57xx). It seems that write to HDP register processed faster by interconnect than writing of descriptor memory buffer in DDR, which is probably caused by store buffer / write buffer differences as these function are implemented differently across devices. So, to fix this i come up with two changes: 1) all accesses to the channel register HDP/CP/RXFREE registers should be done using sync IO accessors readl()/writel(), because all previous memory writes writes have to be completed before starting channel (write to HDP) or completing desc processing. 2) the change 1 only doesn't work on am437x and additional reading of desc's field is required right after the new descriptor was filled with data and before pointer on it will be stored in prev_desc->hw_next field or HDP register. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/davinci_cpdma.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index c776e45..0924014 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -167,10 +167,10 @@ static struct cpdma_control_info controls[] = { /* various accessors */ #define dma_reg_read(ctlr, ofs)__raw_readl((ctlr)->dmaregs + (ofs)) -#define chan_read(chan, fld) __raw_readl((chan)->fld) +#define chan_read(chan, fld) readl((chan)->fld) #define desc_read(desc, fld) __raw_readl(&(desc)->fld) #define dma_reg_write(ctlr, ofs, v)__raw_writel(v, (ctlr)->dmaregs + (ofs)) -#define chan_write(chan, fld, v) __raw_writel(v, (chan)->fld) +#define chan_write(chan, fld, v) writel(v, (chan)->fld) #define desc_write(desc, fld, v) __raw_writel((u32)(v), &(desc)->fld) #define cpdma_desc_to_port(chan, mode, directed) \ @@ -1064,6 +1064,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data, desc_write(desc, sw_token, token); desc_write(desc, sw_buffer, buffer); desc_write(desc, sw_len,len); + desc_read(desc, sw_len); __cpdma_chan_submit(chan, desc); -- 2.10.1
[PATCH 0/7] net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR
This series intended to add support for placing CPDMA descriptors into DDR by introducing new DT property "descs_pool_size" to specify buffer descriptor's pool size. The "descs_pool_size" defines total number of CPDMA CPPI descriptors to be used for both ingress/egress packets processing. If not specified - the default value 256 will be used which will allow to place descriptor's pool into the internal CPPI RAM. This allows significantly to reduce UDP packets drop rate for bandwidth >301 Mbits/sec (am57x). Before enabling this feature, the am437x SoC has to be fixed as it's proved that it's not working when CPDMA descriptors placed in DDR. So, the patch 1 fixes this issue. Grygorii Strashko (7): net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr net: ethernet: ti: cpdma: fix desc re-queuing net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy() net: ethernet: ti: cpdma: use devm_ioremap Documentation: DT: net: cpsw: allow to specify descriptors pool size net: ethernet: ti: cpsw: add support for descs_pool_size dt property Documentation: DT: net: cpsw: remove no_bd_ram property Documentation/devicetree/bindings/net/cpsw.txt | 8 ++- arch/arm/boot/dts/am33xx.dtsi | 1 - arch/arm/boot/dts/am4372.dtsi | 1 - arch/arm/boot/dts/dm814x.dtsi | 1 - arch/arm/boot/dts/dra7.dtsi| 1 - drivers/net/ethernet/ti/cpsw.c | 5 ++ drivers/net/ethernet/ti/cpsw.h | 1 + drivers/net/ethernet/ti/davinci_cpdma.c| 90 +++--- drivers/net/ethernet/ti/davinci_cpdma.h| 1 + 9 files changed, 63 insertions(+), 46 deletions(-) -- 2.10.1
[PATCH 3/7] net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()
Update cpdma_desc_pool_create/destroy() to accept only one parameter struct cpdma_ctlr*, as this structure contains all required information for pool creation/destruction. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/davinci_cpdma.c | 66 - 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index 379314f..db0a7fd 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -181,8 +181,10 @@ static struct cpdma_control_info controls[] = { (directed << CPDMA_TO_PORT_SHIFT));\ } while (0) -static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool) +static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr) { + struct cpdma_desc_pool *pool = ctlr->pool; + if (!pool) return; @@ -191,7 +193,7 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool) gen_pool_size(pool->gen_pool), gen_pool_avail(pool->gen_pool)); if (pool->cpumap) - dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap, + dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap, pool->phys); else iounmap(pool->iomap); @@ -203,57 +205,60 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool) * devices (e.g. cpsw switches) use plain old memory. Descriptor pools * abstract out these details */ -static struct cpdma_desc_pool * -cpdma_desc_pool_create(struct device *dev, u32 phys, dma_addr_t hw_addr, - int size, int align) +int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr) { + struct cpdma_params *cpdma_params = >params; struct cpdma_desc_pool *pool; - int ret; + int ret = 0; - pool = devm_kzalloc(dev, sizeof(*pool), GFP_KERNEL); + pool = devm_kzalloc(ctlr->dev, sizeof(*pool), GFP_KERNEL); if (!pool) goto gen_pool_create_fail; + ctlr->pool = pool; - pool->dev = dev; - pool->mem_size = size; - pool->desc_size = ALIGN(sizeof(struct cpdma_desc), align); - pool->num_desc = size / pool->desc_size; + pool->mem_size = cpdma_params->desc_mem_size; + pool->desc_size = ALIGN(sizeof(struct cpdma_desc), + cpdma_params->desc_align); + pool->num_desc = pool->mem_size / pool->desc_size; - pool->gen_pool = devm_gen_pool_create(dev, ilog2(pool->desc_size), -1, - "cpdma"); + pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size), + -1, "cpdma"); if (IS_ERR(pool->gen_pool)) { - dev_err(dev, "pool create failed %ld\n", - PTR_ERR(pool->gen_pool)); + ret = PTR_ERR(pool->gen_pool); + dev_err(ctlr->dev, "pool create failed %d\n", ret); goto gen_pool_create_fail; } - if (phys) { - pool->phys = phys; - pool->iomap = ioremap(phys, size); /* should be memremap? */ - pool->hw_addr = hw_addr; + if (cpdma_params->desc_mem_phys) { + pool->phys = cpdma_params->desc_mem_phys; + pool->iomap = ioremap(pool->phys, pool->mem_size); + pool->hw_addr = cpdma_params->desc_hw_addr; } else { - pool->cpumap = dma_alloc_coherent(dev, size, >hw_addr, - GFP_KERNEL); + pool->cpumap = dma_alloc_coherent(ctlr->dev, pool->mem_size, + >hw_addr, GFP_KERNEL); pool->iomap = (void __iomem __force *)pool->cpumap; pool->phys = pool->hw_addr; /* assumes no IOMMU, don't use this value */ } - if (!pool->iomap) + if (!pool->iomap) { + ret = -ENOMEM; goto gen_pool_create_fail; + } ret = gen_pool_add_virt(pool->gen_pool, (unsigned long)pool->iomap, pool->phys, pool->mem_size, -1); if (ret < 0) { - dev_err(dev, "pool add failed %d\n", ret); + dev_err(ctlr->dev, "pool add failed %d\n", ret); goto gen_pool_add_virt_fail; } - return pool; + return 0; gen_pool_add_virt_fail: - cpdma_desc_pool_destroy(pool); + cpdma_desc_pool_destroy(ctlr); gen_pool_create_fail: - return NULL; + ctlr->pool = NULL; + return ret; } static inline dma_addr_t desc_phys(struct cpdma_desc_pool *pool, @@ -502,12 +507,7 @@ struct cpdma_ctlr *cpdma_ctlr_create(struct cpdma_params *params) ctlr->chan_num = 0;
[PATCH 4/7] net: ethernet: ti: cpdma: use devm_ioremap
Use devm_ioremap() and simplify the code. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/davinci_cpdma.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index db0a7fd..ba892bb 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -195,8 +195,6 @@ static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr) if (pool->cpumap) dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap, pool->phys); - else - iounmap(pool->iomap); } /* @@ -231,7 +229,8 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr) if (cpdma_params->desc_mem_phys) { pool->phys = cpdma_params->desc_mem_phys; - pool->iomap = ioremap(pool->phys, pool->mem_size); + pool->iomap = devm_ioremap(ctlr->dev, pool->phys, + pool->mem_size); pool->hw_addr = cpdma_params->desc_hw_addr; } else { pool->cpumap = dma_alloc_coherent(ctlr->dev, pool->mem_size, -- 2.10.1
[PATCH 7/7] Documentation: DT: net: cpsw: remove no_bd_ram property
Even if no_bd_ram property is described in TI CPSW bindings the support for it has never been introduced in CPSW driver, so there are no real users of it. Hence, remove no_bd_ram property. Signed-off-by: Grygorii Strashko--- Documentation/devicetree/bindings/net/cpsw.txt | 3 --- arch/arm/boot/dts/am33xx.dtsi | 1 - arch/arm/boot/dts/am4372.dtsi | 1 - arch/arm/boot/dts/dm814x.dtsi | 1 - arch/arm/boot/dts/dra7.dtsi| 1 - 5 files changed, 7 deletions(-) diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index b99d196..4e8b673 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -25,7 +25,6 @@ Required properties: Optional properties: - ti,hwmods: Must be "cpgmac0" -- no_bd_ram: Must be 0 or 1 - dual_emac: Specifies Switch to act as Dual EMAC - syscon : Phandle to the system control device node, which is the control module device of the am33x @@ -73,7 +72,6 @@ Examples: cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; rx_descs = <64>; mac_control = <0x20>; slaves = <2>; @@ -102,7 +100,6 @@ Examples: cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; rx_descs = <64>; mac_control = <0x20>; slaves = <2>; diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi index 194d884..7af5520 100644 --- a/arch/arm/boot/dts/am33xx.dtsi +++ b/arch/arm/boot/dts/am33xx.dtsi @@ -777,7 +777,6 @@ cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; mac_control = <0x20>; slaves = <2>; active_slave = <0>; diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi index a275fa9..4f651be 100644 --- a/arch/arm/boot/dts/am4372.dtsi +++ b/arch/arm/boot/dts/am4372.dtsi @@ -668,7 +668,6 @@ cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; mac_control = <0x20>; slaves = <2>; active_slave = <0>; diff --git a/arch/arm/boot/dts/dm814x.dtsi b/arch/arm/boot/dts/dm814x.dtsi index ff90a6c..614a4ba 100644 --- a/arch/arm/boot/dts/dm814x.dtsi +++ b/arch/arm/boot/dts/dm814x.dtsi @@ -508,7 +508,6 @@ cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; mac_control = <0x20>; slaves = <2>; active_slave = <0>; diff --git a/arch/arm/boot/dts/dra7.dtsi b/arch/arm/boot/dts/dra7.dtsi index d4fcd68..cf7325d 100644 --- a/arch/arm/boot/dts/dra7.dtsi +++ b/arch/arm/boot/dts/dra7.dtsi @@ -1706,7 +1706,6 @@ cpdma_channels = <8>; ale_entries = <1024>; bd_ram_size = <0x2000>; - no_bd_ram = <0>; mac_control = <0x20>; slaves = <2>; active_slave = <0>; -- 2.10.1
[PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
Add optional property "descs_pool_size" to specify buffer descriptor's pool size. The "descs_pool_size" should define total number of CPDMA CPPI descriptors to be used for both ingress/egress packets processing. If not specified - the default value 256 will be used which will allow to place descriptor's pool into the internal CPPI RAM on most of TI SoC. Signed-off-by: Grygorii Strashko--- Documentation/devicetree/bindings/net/cpsw.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index 5ad439f..b99d196 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -35,6 +35,11 @@ Optional properties: For example in dra72x-evm, pcf gpio has to be driven low so that cpsw slave 0 and phy data lines are connected via mux. +- descs_pool_size : total number of CPDMA CPPI descriptors to be used for + both ingress/egress packets processing. if not + specified the default value 256 will be used which + will allow to place descriptors pool into the + internal CPPI RAM. Slave Properties: -- 2.10.1
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On 02.12.2016 00:14, Ido Schimmel wrote: [...] >> Basically, if you delete a node right now the kernel might simply do a >> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers >> or synchronization with the reader side. Thus you might get a callback >> from the notifier for a delete event on the one CPU and you end up >> queueing this fib entry after the delete queue, because the RCU walk >> isn't protected by any means. >> >> Looking closer at this series again, I overlooked the fact that you >> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all >> orders fetching of fib_seq and thus the RCU dumping after any concurrent >> executing fib table update, also the mutex_lock and unlock provide >> proper acquire and release fences, so the CPU indeed sees the effect of >> a RCU_INIT_POINTER update done on another CPU, because they pair with >> the rtnl_unlock which might happen on the other CPU. > > Yep, Exactly. I had a feeling this is the issue you were referring to, > but then you were the one to suggest the use of RTNL, so I was quite > confused. At that time I actually had in mind that the fib_register would happen under the sequence lock, so I didn't look closely to the memory barrier pairings. I kinda still consider this to be a happy accident. ;) >> My question is if this is a bit of luck and if we should make this >> explicit by putting the registration itself under the protection of the >> sequence counter. I favor the additional protection, e.g. if we some day >> actually we optimize the fib_seq code? Otherwise we might probably >> document this fact. :) > > Well, some listeners don't require a dump, but only registration > (rocker) and in the future we might only need a dump (e.g., port being > moved to a different net namespace). So I'm not sure if bundling both > together is a good idea. > > Maybe we can keep register_fib_notifier() as-is and add 'bool register' > to fib_notifier_dump() so that when set, 'nb' is also registered after > RCU walk, but before we check if the dump is consistent (unregistered if > inconsistent)? I really like that. Would you mind adding this? [...] >> Quick follow-up question: How can I quickly find out the hw limitations >> via the kernel api? > > That's a good question. Currently, you can't. However, we already have a > mechanism in place to read device's capabilities from the firmware and > we can (and should) expose some of them to the user. The best API for > that would be devlink, as it can represent the entire device as opposed > to only a port netdev like other tools. > > We're also working on making the pipeline more visible to the user, so > that it would be easier for users to understand and debug their > networks. I believe a colleague of mine (Matty) presented this during > the last netdev conference. Thanks, I will look it up! Bye, Hannes
[PATCH 3/3] netns: fix net_generic() "id - 1" bloat
net_generic() function is both a) inline and b) used ~600 times. It has the following code inside ... ptr = ng->ptr[id - 1]; ... "id" is never compile time constant so compiler is forced to subtract 1. And those decrements or LEA [r32 - 1] instructions add up. We also start id'ing from 1 to catch bugs where pernet sybsystem id is not initialized and 0. This is quite pointless idea (nothing will work or immediate interference with first registered subsystem) in general but it hints what needs to be done for code size reduction. Namely, overlaying allocation of pointer array and fixed part of structure in the beginning and using usual base-0 addressing. Ids are just cookies, their exact values do not matter, so lets start with 3 on x86_64. Code size savings (oh boy): -4.2 KB As usual, ignore the initial compiler stupidity part of the table. add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208) function old new delta tipc_nametbl_insert_publ12501270 +20 nlmclnt_lookup_host 686 703 +17 nfsd4_encode_fattr 59305941 +11 nfs_get_client 10501061 +11 register_pernet_operations 333 342 +9 tcf_mirred_init 843 849 +6 tcf_bpf_init11431149 +6 gss_setup_upcall 990 994 +4 idmap_name_to_id 432 434 +2 ops_init 274 275 +1 nfsd_inject_forget_client259 260 +1 nfs4_alloc_client612 613 +1 tunnel_key_walker164 163 -1 ... tipc_bcbase_select_primary 392 360 -32 mac80211_hwsim_new_radio28082767 -41 ipip6_tunnel_ioctl 22282186 -42 tipc_bcast_rcv 715 672 -43 tipc_link_build_proto_msg 11401089 -51 nfsd4_lock 38513796 -55 tipc_mon_rcv1012 956 -56 Total: Before=156643951, After=156639743, chg -0.00% Signed-off-by: Alexey Dobriyan--- include/net/netns/generic.h | 16 +--- net/core/net_namespace.c| 20 2 files changed, 21 insertions(+), 15 deletions(-) --- a/include/net/netns/generic.h +++ b/include/net/netns/generic.h @@ -25,12 +25,14 @@ */ struct net_generic { - struct { - unsigned int len; - struct rcu_head rcu; - } s; - - void *ptr[0]; + union { + struct { + unsigned int len; + struct rcu_head rcu; + } s; + + void *ptr[0]; + }; }; static inline void *net_generic(const struct net *net, unsigned int id) @@ -40,7 +42,7 @@ static inline void *net_generic(const struct net *net, unsigned int id) rcu_read_lock(); ng = rcu_dereference(net->gen); - ptr = ng->ptr[id - 1]; + ptr = ng->ptr[id]; rcu_read_unlock(); return ptr; --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -39,6 +39,9 @@ EXPORT_SYMBOL(init_net); static bool init_net_initialized; +#define MIN_PERNET_OPS_ID \ + ((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *)) + #define INITIAL_NET_GEN_PTRS 13 /* +1 for len +2 for rcu_head */ static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS; @@ -46,7 +49,7 @@ static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS; static struct net_generic *net_alloc_generic(void) { struct net_generic *ng; - size_t generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]); + unsigned int generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]); ng = kzalloc(generic_size, GFP_KERNEL); if (ng) @@ -60,12 +63,12 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) struct net_generic *ng, *old_ng; BUG_ON(!mutex_is_locked(_mutex)); - BUG_ON(id == 0); + BUG_ON(id < MIN_PERNET_OPS_ID); old_ng = rcu_dereference_protected(net->gen, lockdep_is_held(_mutex)); - if (old_ng->s.len >= id) { - old_ng->ptr[id - 1] = data; + if (old_ng->s.len > id) { + old_ng->ptr[id] = data; return 0; } @@ -84,8 +87,9 @@ static int net_assign_generic(struct net *net,
Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
On Thu, 2016-12-01 at 14:56 -0800, Mahesh Bandewar wrote: > From: Mahesh Bandewar> > If initial broadcast probe(s) is/are lost, the neigh entry wont have > valid address of the neighbour. In a situation like this, the fall > back should be to send a broadcast probe, however the code logic > continues sending ucast probes to 00:00:00:00:00:00. The default value > of ucast probes is 3 so system usually recovers after three such probes > but if the value configured is larger it takes those many probes > (a probe is sent every second in default config) / seconds to recover > making machine not-available on the network. > > This patch just ensures that the unicast address is not NULL otherwise > falls back to sending broadcast probe. > > Signed-off-by: Mahesh Bandewar > --- > net/ipv4/arp.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c > index 89a8cac4726a..56fb33d5ed31 100644 > --- a/net/ipv4/arp.c > +++ b/net/ipv4/arp.c > @@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct > sk_buff *skb) > { > __be32 saddr = 0; > u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL; > + u8 null_dev_hw_addr[MAX_ADDR_LEN]; > struct net_device *dev = neigh->dev; > __be32 target = *(__be32 *)neigh->primary_key; > int probes = atomic_read(>probes); > @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct > sk_buff *skb) > > probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES); > if (probes < 0) { > + memset(_dev_hw_addr, 0, dev->addr_len); > if (!(neigh->nud_state & NUD_VALID)) > pr_debug("trying to ucast probe in NUD_INVALID\n"); > neigh_ha_snapshot(dst_ha, neigh, dev); > - dst_hw = dst_ha; > + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0) > + dst_hw = dst_ha; > } else { > probes -= NEIGH_VAR(neigh->parms, APP_PROBES); > if (probes < 0) { Why is is an IPv4 specific issue ? What about IPv6 ? I would try something in neighbour code, maybe : diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg) neigh_dbg(2, "neigh %p is probed\n", neigh); neigh->nud_state = NUD_PROBE; neigh->updated = jiffies; - atomic_set(>probes, 0); + atomic_set(>probes, + (neigh->output == neigh_blackhole) ? + NEIGH_VAR(neigh->parms, UCAST_PROBES) : + 0); notify = 1; next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME); } Thanks.
[PATCH] netlink: 2-clause nla_ok()
nla_ok() consists of 3 clauses: 1) int rem >= (int)sizeof(struct nlattr) 2) u16 nla_len >= sizeof(struct nlattr) 3) u16 nla_len <= int rem The statement is that clause (1) is redundant. What it does is ensuring that "rem" is a positive number, so that in clause (3) positive number will be compared to positive number with no problems. However, "u16" fully fits into "int" and integers do not change value when upcasting even to signed type. Negative integers will be rejected by clause (3) just fine. Small positive integers will be rejected by transitivity of comparison operator. NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is u32(!), so 3 clauses AND A CAST TO INT are necessary. Obligatory space savings report: -1.6 KB $ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001* add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657) function old new delta validate_scan_freqs 142 155 +13 tcf_em_tree_validate 867 879 +12 dcbnl_ieee_del 328 338 +10 netlbl_cipsov4_add_common.isra 218 215 -3 ... ovs_nla_put_actions 888 806 -82 netlbl_cipsov4_add_std 16481566 -82 nl80211_parse_sched_scan28892780-109 ip_tun_from_nlattr 30862945-141 Signed-off-by: Alexey Dobriyan--- include/net/netlink.h |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/include/net/netlink.h +++ b/include/net/netlink.h @@ -698,8 +698,7 @@ static inline int nla_len(const struct nlattr *nla) */ static inline int nla_ok(const struct nlattr *nla, int remaining) { - return remaining >= (int) sizeof(*nla) && - nla->nla_len >= sizeof(*nla) && + return nla->nla_len >= sizeof(*nla) && nla->nla_len <= remaining; }
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On Thu, Dec 01, 2016 at 10:57:52PM +0100, Hannes Frederic Sowa wrote: > On 30.11.2016 19:22, Ido Schimmel wrote: > > On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote: > >> On 30.11.2016 17:32, Ido Schimmel wrote: > >>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote: > On 30.11.2016 11:09, Jiri Pirko wrote: > > From: Ido Schimmel> > > > Make sure the device has a complete view of the FIB tables by invoking > > their dump during module init. > > > > Signed-off-by: Ido Schimmel > > Signed-off-by: Jiri Pirko > > --- > > .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 23 > > ++ > > 1 file changed, 23 insertions(+) > > > > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > > b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > > index 14bed1d..d176047 100644 > > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > > @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct > > notifier_block *nb, > > return NOTIFY_DONE; > > } > > > > +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb) > > +{ > > + struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, > > fib_nb); > > + > > + /* Flush pending FIB notifications and then flush the device's > > +* table before requesting another dump. Do that with RTNL held, > > +* as FIB notification block is already registered. > > +*/ > > + mlxsw_core_flush_owq(); > > + rtnl_lock(); > > + mlxsw_sp_router_fib_flush(mlxsw_sp); > > + rtnl_unlock(); > > +} > > + > > int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) > > { > > + fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush; > > int err; > > > > INIT_LIST_HEAD(_sp->router.nexthop_neighs_list); > > @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp > > *mlxsw_sp) > > > > mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event; > > register_fib_notifier(_sp->fib_nb); > > Sorry to pick in here again: > > There is a race here. You need to protect the registration of the fib > notifier as well by the sequence counter. Updates here are not ordered > in relation to this code below. > >>> > >>> You mean updates that can be received after you registered the notifier > >>> and until the dump started? I'm aware of that and that's OK. This > >>> listener should be able to handle duplicates. > >> > >> I am not concerned about duplicates, but about ordering deletes and > >> getting an add from the RCU code you will add the node to hw while it is > >> deleted in the software path. You probably will ignore the delete > >> because nothing is installed in hw and later add the node which was > >> actually deleted but just reordered which happend on another CPU, no? > > > > Are you referring to reordering in the workqueue? We already covered > > this using an ordered workqueue, which has one context of execution > > system-wide. > > Ups, sorry, I missed that mail. Probably read it on the mobile phone and > it became invisible for me later on. Busy day... ;) Yet another reason not to read emails on your phone ;) > The reordering in the workqueue seems fine to me and also still necessary. Correct. > Basically, if you delete a node right now the kernel might simply do a > RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers > or synchronization with the reader side. Thus you might get a callback > from the notifier for a delete event on the one CPU and you end up > queueing this fib entry after the delete queue, because the RCU walk > isn't protected by any means. > > Looking closer at this series again, I overlooked the fact that you > fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all > orders fetching of fib_seq and thus the RCU dumping after any concurrent > executing fib table update, also the mutex_lock and unlock provide > proper acquire and release fences, so the CPU indeed sees the effect of > a RCU_INIT_POINTER update done on another CPU, because they pair with > the rtnl_unlock which might happen on the other CPU. Yep, Exactly. I had a feeling this is the issue you were referring to, but then you were the one to suggest the use of RTNL, so I was quite confused. > My question is if this is a bit of luck and if we should make this > explicit by putting the registration itself under the protection of the > sequence counter. I favor the additional protection, e.g. if we some day > actually we optimize the fib_seq code? Otherwise we might probably > document this fact. :) Well,
[PATCH 2/3] netns: add dummy struct inside "struct net_generic"
This is precursor to fixing "[id - 1]" bloat inside net_generic(). Name "s" is chosen to complement name "u" often used for dummy unions. Signed-off-by: Alexey Dobriyan--- include/net/netns/generic.h |6 -- net/core/net_namespace.c|8 2 files changed, 8 insertions(+), 6 deletions(-) --- a/include/net/netns/generic.h +++ b/include/net/netns/generic.h @@ -25,8 +25,10 @@ */ struct net_generic { - unsigned int len; - struct rcu_head rcu; + struct { + unsigned int len; + struct rcu_head rcu; + } s; void *ptr[0]; }; --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -50,7 +50,7 @@ static struct net_generic *net_alloc_generic(void) ng = kzalloc(generic_size, GFP_KERNEL); if (ng) - ng->len = max_gen_ptrs; + ng->s.len = max_gen_ptrs; return ng; } @@ -64,7 +64,7 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) old_ng = rcu_dereference_protected(net->gen, lockdep_is_held(_mutex)); - if (old_ng->len >= id) { + if (old_ng->s.len >= id) { old_ng->ptr[id - 1] = data; return 0; } @@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) * the old copy for kfree after a grace period. */ - memcpy(>ptr, _ng->ptr, old_ng->len * sizeof(void*)); + memcpy(>ptr, _ng->ptr, old_ng->s.len * sizeof(void*)); ng->ptr[id - 1] = data; rcu_assign_pointer(net->gen, ng); - kfree_rcu(old_ng, rcu); + kfree_rcu(old_ng, s.rcu); return 0; }
[PATCH 1/3] netns: publish net_generic correctly
Publishing net_generic pointer is done with silly mistake: new array is published BEFORE setting freshly acquired pernet subsystem pointer. memcpy rcu_assign_pointer kfree_rcu ng->ptr[id - 1] = data; This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365 ("[NETNS]: The generic per-net pointers.") in the glorious days of chopping networking stack into containers proper 8.5 years ago (whee...) How it didn't trigger for so long? Well, you need quite specific set of conditions: *) race window opens once per pernet subsystem addition (read: modprobe or boot) *) not every pernet subsystem is eligible (need ->id and ->size) *) not every pernet subsystem is vulnerable (need incorrect or absense of ordering of register_pernet_sybsys() and actually using net_generic()) *) to hide the bug even more, default is to preallocate 13 pointers which is actually quite a lot. You need IPv6, netfilter, bridging etc together loaded to trigger reallocation in the first place. Trimmed down config are OK. Signed-off-by: Alexey Dobriyan--- net/core/net_namespace.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -64,9 +64,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) old_ng = rcu_dereference_protected(net->gen, lockdep_is_held(_mutex)); - ng = old_ng; - if (old_ng->len >= id) - goto assign; + if (old_ng->len >= id) { + old_ng->ptr[id - 1] = data; + return 0; + } ng = net_alloc_generic(); if (ng == NULL) @@ -84,11 +85,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) */ memcpy(>ptr, _ng->ptr, old_ng->len * sizeof(void*)); + ng->ptr[id - 1] = data; rcu_assign_pointer(net->gen, ng); kfree_rcu(old_ng, rcu); -assign: - ng->ptr[id - 1] = data; return 0; }
[PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
From: Mahesh BandewarIf initial broadcast probe(s) is/are lost, the neigh entry wont have valid address of the neighbour. In a situation like this, the fall back should be to send a broadcast probe, however the code logic continues sending ucast probes to 00:00:00:00:00:00. The default value of ucast probes is 3 so system usually recovers after three such probes but if the value configured is larger it takes those many probes (a probe is sent every second in default config) / seconds to recover making machine not-available on the network. This patch just ensures that the unicast address is not NULL otherwise falls back to sending broadcast probe. Signed-off-by: Mahesh Bandewar --- net/ipv4/arp.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 89a8cac4726a..56fb33d5ed31 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) { __be32 saddr = 0; u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL; + u8 null_dev_hw_addr[MAX_ADDR_LEN]; struct net_device *dev = neigh->dev; __be32 target = *(__be32 *)neigh->primary_key; int probes = atomic_read(>probes); @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES); if (probes < 0) { + memset(_dev_hw_addr, 0, dev->addr_len); if (!(neigh->nud_state & NUD_VALID)) pr_debug("trying to ucast probe in NUD_INVALID\n"); neigh_ha_snapshot(dst_ha, neigh, dev); - dst_hw = dst_ha; + if (memcmp(_ha, _dev_hw_addr, dev->addr_len) != 0) + dst_hw = dst_ha; } else { probes -= NEIGH_VAR(neigh->parms, APP_PROBES); if (probes < 0) { -- 2.8.0.rc3.226.g39d4020
Re: Initial thoughts on TXDP
On 01.12.2016 21:13, Sowmini Varadhan wrote: > On (12/01/16 11:05), Tom Herbert wrote: >> >> Polling does not necessarily imply that networking monopolizes the CPU >> except when the CPU is otherwise idle. Presumably the application >> drives the polling when it is ready to receive work. > > I'm not grokking that- "if the cpu is idle, we want to busy-poll > and make it 0% idle"? Keeping CPU 0% idle has all sorts > of issues, see slide 20 of > http://www.slideshare.net/shemminger/dpdk-performance > >>> and one other critical difference from the hot-potato-forwarding >>> model (the sort of OVS model that DPDK etc might aruguably be a fit for) >>> does not apply: in order to figure out the ethernet and IP headers >>> in the response correctly at all times (in the face of things like VRRP, >>> gw changes, gw's mac addr changes etc) the application should really >>> be listening on NETLINK sockets for modifications to the networking >>> state - again points to needing a select() socket set where you can >>> have both the I/O fds and the netlink socket, >>> >> I would think that that is management would not be implemented in a >> fast path processing thread for an application. > > sure, but my point was that *XDP and other stack-bypass methods needs > to provide a select()able socket: when your use-case is not about just > networking, you have to snoop on changes to the control plane, and update > your data path. In the OVS case (pure networking) the OVS control plane > updates are intrinsic to OVS. For the rest of the request/response world, > we need a select()able socket set to do this elegantly (not really > possible in DPDK, for example) Busypoll on steroids is what windows does by mapping the user space "doorbell" into a vDSO and let user space loop on that maybe with MWAIT/MONITOR. The interesting thing is that you can map other events to this notification event, too. It sounds like a usable idea to me and reassembles what we already do with futexes. Bye, Hannes
stmmac: turn coalescing / NAPI off in stmmac
> > @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct > > net_device *dev, > > features &= ~NETIF_F_CSUM_MASK; > > > > /* Disable tso if asked by ethtool */ > > - if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) { > > - if (features & NETIF_F_TSO) > > - priv->tso = true; > > - else > > - priv->tso = false; > > - } > > + if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) > > + priv->tso = !!(features & NETIF_F_TSO); > > > > Pavel, this really seems arbitrary. > > Whilst I really appreciate you're looking into this driver a bit because > of some issues you are trying to resolve, I'd like to ask that you not > start bombarding me with nit-pick cleanups here and there and instead > concentrate on the real bug or issue. Well, fixing clean code is easier than fixing strange code... Plus I was hoping to make the mainainers to talk. The driver is listed as supported after all. Anyway... since you asked. I belive I have way to disable NAPI / tx coalescing in the driver. Unfortunately, locking is missing on the rx path, and needs to be extended to _irqsave variant on tx path. So patch currently looks like this (hand edited, can't be applied, got it working few hours ago). Does it look acceptable? I'd prefer this to go after the patch that pulls common code to single place, so that single place needs to be patched. Plus I guess I should add ifdefs, so that more advanced NAPI / tx coalescing code can be reactivated when it is fixed. Trivial fixes can go on top. Does that sound like a plan? Which tree do you want patches against? https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ? Best regards, Pavel diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 0b706a7..c0016c8 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1395,9 +1397,10 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv) static void stmmac_tx_clean(struct stmmac_priv *priv) { - spin_lock(>tx_lock); + unsigned long flags; + spin_lock_irqsave(>tx_lock, flags); __stmmac_tx_clean(priv); - spin_unlock(>tx_lock); + spin_unlock_irqrestore(>tx_lock, flags); } static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv) @@ -1441,6 +1444,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv) netif_wake_queue(priv->dev); } +static int stmmac_rx(struct stmmac_priv *priv, int limit); + /** * stmmac_dma_interrupt - DMA ISR * @priv: driver private structure @@ -1452,10 +1457,17 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv) { int status; int rxfifosz = priv->plat->rx_fifo_size; + unsigned long flags; status = priv->hw->dma->dma_interrupt(priv->ioaddr, >xstats); if (likely((status & handle_rx)) || (status & handle_tx)) { + int r; + spin_lock_irqsave(>tx_lock, flags); + r = stmmac_rx(priv, 999); + spin_unlock_irqrestore(>tx_lock, flags); +#if 0 if (likely(napi_schedule_prep(>napi))) { //pr_err("napi: schedule\n"); stmmac_disable_dma_irq(priv); @@ -1463,7 +1475,8 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv) } else pr_err("napi: schedule failed\n"); #endif + stmmac_tx_clean(priv); } if (unlikely(status & tx_hard_error_bump_tc)) { /* Try to bump up the dma threshold on this failure */ @@ -1638,7 +1651,7 @@ static void stmmac_tx_timer(unsigned long data) { struct stmmac_priv *priv = (struct stmmac_priv *)data; - stmmac_tx_clean(priv); + //stmmac_tx_clean(priv); } /** @@ -1990,6 +2003,8 @@ static void stmmac_xmit_common(struct sk_buff *skb, struct net_device *dev, int if (likely(priv->tx_coal_frames > priv->tx_count_frames)) { mod_timer(>txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer)); + priv->hw->desc->set_tx_ic(desc); + priv->xstats.tx_set_ic_bit++; } else { priv->tx_count_frames = 0; priv->hw->desc->set_tx_ic(desc); @@ -2038,8 +2053,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) struct dma_desc *desc, *first, *mss_desc = NULL; u8 proto_hdr_len; int i; + unsigned long flags; - spin_lock(>tx_lock); + spin_lock_irqsave(>tx_lock, flags); /* Compute header lengths */ proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb); @@ -2052,7 +2068,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device
Re: Initial thoughts on TXDP
Side note: On 01.12.2016 20:51, Tom Herbert wrote: >> > E.g. "mini-skb": Even if we assume that this provides a speedup >> > (where does that come from? should make no difference if a 32 or >> > 320 byte buffer gets allocated). >> > > It's the zero'ing of three cache lines. I believe we talked about that > as netdev. Jesper and me played with that again very recently: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 In micro-benchmarks we saw a pretty good speed up not using the rep stosb generated by gcc builtin but plain movq's. Probably the cost model for __builtin_memset in gcc is wrong? When Jesper is free we wanted to benchmark this and maybe come up with a arch specific way of cleaning if it turns out to really improve throughput. SIMD instructions seem even faster but the kernel_fpu_begin/end() kill all the benefits. Bye, Hannes
Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
On 01.12.2016 00:39, Erik Nordmark wrote: > Implemented RFC7527 Enhanced DAD. > IPv6 duplicate address detection can fail if there is some temporary > loopback of Ethernet frames. RFC7527 solves this by including a random > nonce in the NS messages used for DAD, and if an NS is received with the > same nonce it is assumed to be a looped back DAD probe and is ignored. > RFC7527 is enabled by default. Can be disabled by setting both of > conf/{all,interface}/enhanced_dad to zero. > > Signed-off-by: Erik Nordmark> Signed-off-by: Bob Gilligan > --- Reviewed-by: Hannes Frederic Sowa Thanks! > @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb) > have_ifp: > if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) { > if (dad) { > + if (nonce != 0 && ifp->dad_nonce == nonce) { > + u8 *np = (u8 *) > + /* Matching nonce if looped back */ > + ND_PRINTK(2, notice, > + "%s: IPv6 DAD loopback for > address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n", > + ifp->idev->dev->name, > + >addr, > + np[0], np[1], np[2], np[3], > + np[4], np[5]); > + goto out; > + } > /* >* We are colliding with another node >* who is doing DAD > I think it could be a "%pM" because it looks like a MAC address, but better leave it like that. :) Bye, Hannes
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > small socket queues) > > On Wed, 30 Nov 2016 16:35:20 + > Mel Gormanwrote: > > > > I don't quite get why you are setting the socket recv size > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > Maybe I missed something at the time I wrote that but why would it > > need to be larger? > > Well, to me it is quite obvious that we need some queue to avoid packet > drops. We have two processes netperf and netserver, that are sending > packets between each-other (UDP_STREAM mostly netperf -> netserver). > These PIDs are getting scheduled and migrated between CPUs, and thus > does not get executed equally fast, thus a queue is need absorb the > fluctuations. > > The network stack is even partly catching your config "mistake" and > increase the socket queue size, so we minimum can handle one max frame > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > Hopefully for localhost testing a small queue should hopefully not > result in packet drops. Testing... ups, this does result in packet > drops. > > Test command extracted from mmtests, UDP_STREAM size 1024: > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > Socket Message Elapsed Messages > SizeSize Time Okay Errors Throughput > bytes bytessecs# # 10^6bits/sec > >46081024 60.00 50024301 06829.98 >2560 60.00 46133211 6298.72 > > Dropped packets: 50024301-46133211=3891090 > > To get a better drop indication, during this I run a command, to get > system-wide network counters from the last second, so below numbers are > per second. > > $ nstat > /dev/null && sleep 1 && nstat > #kernel > IpInReceives885162 0.0 > IpInDelivers885161 0.0 > IpOutRequests 885162 0.0 > UdpInDatagrams 776105 0.0 > UdpInErrors 109056 0.0 > UdpOutDatagrams 885160 0.0 > UdpRcvbufErrors 109056 0.0 > IpExtInOctets 931190476 0.0 > IpExtOutOctets 931189564 0.0 > IpExtInNoECTPkts885162 0.0 > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > happens kernel side in __udp_queue_rcv_skb[1], because receiving > process didn't empty it's queue fast enough see [2]. > > Although upstream changes are coming in this area, [2] is replaced with > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > Retesting with kernel 4.7.0-baseline+ ... show something else. > To Paolo, you might want to look into this. And it could also explain why > I've not see the mentioned speedup by mm-change, as I've been testing > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. Thank you for reporting this. It seems that the commit 123b4a633580 ("udp: use it's own memory accounting schema") is too strict while checking the rcvbuf. For very small value of rcvbuf, it allows a single skb to be enqueued, while previously we allowed 2 of them to enter the queue, even if the first one truesize exceeded rcvbuf, as in your test-case. Can you please try the following patch ? Thank you, Paolo --- net/ipv4/udp.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index e1d0bf8..2f5dc92 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) struct sk_buff_head *list = >sk_receive_queue; int rmem, delta, amt, err = -ENOMEM; int size = skb->truesize; + int limit; /* try to avoid the costly atomic add/sub pair when the receive * queue is full; always allow at least a packet */ rmem = atomic_read(>sk_rmem_alloc); - if (rmem && (rmem + size > sk->sk_rcvbuf)) + limit = size + sk->sk_rcvbuf; + if (rmem > limit) goto drop; /* we drop only if the receive buf is full and the receive * queue contains some other skb */ rmem = atomic_add_return(size, >sk_rmem_alloc); - if ((rmem > sk->sk_rcvbuf) && (rmem > size)) + if (rmem > limit) goto uncharge_drop; spin_lock(>lock);
Re: [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
Daniele Palmaswrites: > This patch adds support for PID 0x1040 of Telit LE922A. LGTM Acked-by: Bjørn Mork
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 1:47 PM, Rick Joneswrote: > On 12/01/2016 12:18 PM, Tom Herbert wrote: >> >> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones wrote: >>> >>> Just how much per-packet path-length are you thinking will go away under >>> the >>> likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does >>> some >>> non-trivial things to effective overhead (service demand) and so >>> throughput: >>> >> For plain in order TCP packets I believe we should be able process >> each packet at nearly same speed as GRO. Most of the protocol >> processing we do between GRO and the stack are the same, the >> differences are that we need to do a connection lookup in the stack >> path (note we now do this is UDP GRO and that hasn't show up as a >> major hit). We also need to consider enqueue/dequeue on the socket >> which is a major reason to try for lockless sockets in this instance. > > > So waving hands a bit, and taking the service demand for the GRO-on receive > test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or > ~1.216 usec of CPU time per TCP segment, including ACK generation which > unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in > place would be for every-other segment. Etc etc. > >> Sure, but trying running something emulates a more realistic workload >> than a TCP stream, like RR test with relative small payload and many >> connections. > > > That is a good point, which of course is why the RR tests are there in > netperf :) Don't get me wrong, I *like* seeing path-length reductions. What > would you posit is a relatively small payload? The promotion of IR10 > suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at > that as the length of a piece of string: > We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,14K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote > bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr > > 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 > 16384 87380 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,14K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote > bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr > > 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 > 16384 87380 > > So, losing GRO doubled the service demand. I suppose I could see cutting > path-length in half based on the things you listed which would be bypassed? > > I'm sure mileage will vary with different NICs and CPUs. The ones used here > happened to be to hand. > This is also biased because you're using a single connection, but is consistent with data we've seen in the past. To be clear I'm not saying GRO is bad, the fact that GRO has such a visible impact in your test means that the GRO path is significantly more efficient. Closing that gap seen in your numbers would be a benefit, that means we have improved per packet processing. Tom > happy benchmarking, > > rick > > Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly > goes to more than doubling, and halving to 7K narrows the delta: > > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,28K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote >
Re: [WIP] net+mlx4: auto doorbell
On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote: > From: Eric Dumazet> Date: Thu, 01 Dec 2016 09:04:17 -0800 > > > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote: > > > >> When qdisc layer or trafgen/af_packet see this indication it knows it > >> should/must flush the queue when it don't have more work left. Perhaps > >> through net_tx_action(), by registering itself and e.g. if qdisc_run() > >> is called and queue is empty then check if queue needs a flush. I would > >> also allow driver to flush and clear this bit. > > > > net_tx_action() is not normally called, unless BQL limit is hit and/or > > some qdiscs with throttling (HTB, TBF, FQ, ...) > > The one thing I wonder about is whether we should "ramp up" into a mode > where the NAPI poll does the doorbells instead of going directly there. > > Maybe I misunderstand your algorithm, but it looks to me like if there > are any active packets in the TX queue at enqueue time you will defer > the doorbell to the interrupt handler. > > Let's say we put 1 packet in, and hit the doorbell. > > Then another packet comes in and we defer the doorbell to the IRQ. > > At this point there are a couple things I'm unclear about. > > For example, if we didn't hit the doorbell, will the chip still take a > peek at the second descriptor? Depending upon how the doorbell works > it might, or it might not. It might depend on the hardware. I can easily check on mlx4, by increasing tx-usecs and tx-frames, and sending 2 packets back to back. > > Either way, wouldn't there be a possible condition where the chip > wouldn't see the second enqueued packet and we'd thus have the wire > idle until the interrupt + NAPI runs and hits the doorbell? > > This is why I think we should "ramp up" the doorbell deferral, in > order to avoid this potential wire idle time situation. > > Maybe the situation I'm worried about is not possible, so please > explain it to me :-) This is absolutely the problem. We might need to enable this mode only above a given load. We could have an EWMA of the number of packets that TX completion runs can dequeue. And enable auto doorbell only if we have that many packets in the TX ring (instead of the "1 packet threshold" of the WIP)
Re: [WIP] net+mlx4: auto doorbell
On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote: > A few years back when I did something like this on ixgbe I was told by > you that the issue was that doing something like this would add too > much latency. I was just wondering what the latency impact is on a > change like this and if this now isn't that much of an issue? I believe it is clear that we can not use this trick without admin choice. In my experience, mlx4 can deliver way more interrupts than ixgbe. (No idea why) This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames| tx-frames-irq", or simply a choice for hosts dedicated to content delivery (like video servers) with 40GBit+ cards. Under stress, softirq can be handled by ksoftirqd, and might be delayed by ~10 ms or even more. This WIP was minimal, but we certainly need to add belts and suspenders. 1) timestamps use a ktime_get_ns() to remember a timestamp for the last doorbell, and force a doorbell if it gets too old, checked in ndo_start_xmit() every time we would like to not send the doorbell. 2) max numbers of packets not yet doorbelled. But we can not rely on another high resolution timer, since they also require softirq being processed timely.
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On 30.11.2016 19:22, Ido Schimmel wrote: > On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote: >> On 30.11.2016 17:32, Ido Schimmel wrote: >>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote: On 30.11.2016 11:09, Jiri Pirko wrote: > From: Ido Schimmel> > Make sure the device has a complete view of the FIB tables by invoking > their dump during module init. > > Signed-off-by: Ido Schimmel > Signed-off-by: Jiri Pirko > --- > .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 23 > ++ > 1 file changed, 23 insertions(+) > > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > index 14bed1d..d176047 100644 > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c > @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct > notifier_block *nb, > return NOTIFY_DONE; > } > > +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb) > +{ > + struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb); > + > + /* Flush pending FIB notifications and then flush the device's > + * table before requesting another dump. Do that with RTNL held, > + * as FIB notification block is already registered. > + */ > + mlxsw_core_flush_owq(); > + rtnl_lock(); > + mlxsw_sp_router_fib_flush(mlxsw_sp); > + rtnl_unlock(); > +} > + > int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) > { > + fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush; > int err; > > INIT_LIST_HEAD(_sp->router.nexthop_neighs_list); > @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) > > mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event; > register_fib_notifier(_sp->fib_nb); Sorry to pick in here again: There is a race here. You need to protect the registration of the fib notifier as well by the sequence counter. Updates here are not ordered in relation to this code below. >>> >>> You mean updates that can be received after you registered the notifier >>> and until the dump started? I'm aware of that and that's OK. This >>> listener should be able to handle duplicates. >> >> I am not concerned about duplicates, but about ordering deletes and >> getting an add from the RCU code you will add the node to hw while it is >> deleted in the software path. You probably will ignore the delete >> because nothing is installed in hw and later add the node which was >> actually deleted but just reordered which happend on another CPU, no? > > Are you referring to reordering in the workqueue? We already covered > this using an ordered workqueue, which has one context of execution > system-wide. Ups, sorry, I missed that mail. Probably read it on the mobile phone and it became invisible for me later on. Busy day... ;) The reordering in the workqueue seems fine to me and also still necessary. Basically, if you delete a node right now the kernel might simply do a RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers or synchronization with the reader side. Thus you might get a callback from the notifier for a delete event on the one CPU and you end up queueing this fib entry after the delete queue, because the RCU walk isn't protected by any means. Looking closer at this series again, I overlooked the fact that you fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all orders fetching of fib_seq and thus the RCU dumping after any concurrent executing fib table update, also the mutex_lock and unlock provide proper acquire and release fences, so the CPU indeed sees the effect of a RCU_INIT_POINTER update done on another CPU, because they pair with the rtnl_unlock which might happen on the other CPU. My question is if this is a bit of luck and if we should make this explicit by putting the registration itself under the protection of the sequence counter. I favor the additional protection, e.g. if we some day actually we optimize the fib_seq code? Otherwise we might probably document this fact. :) >>> I've a follow up patchset that introduces a new event in switchdev >>> notification chain called SWITCHDEV_SYNC, which is sent when port >>> netdevs are enslaved / released from a master device (points in time >>> where kernel<->device can get out of sync). It will invoke >>> re-propagation of configuration from different parts of the stack >>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result >>> in duplicates. >> >> Okay, understood. I wonder how we can protect against accidentally abort >> calls actually. E.g. if I start to inject routes into my routing
[PATCH] net: ethernet: ti: davinci_cpdma: sanitize inter-module API
The davinci_cpdma module is a helper library that is used by the actual device drivers and does nothing by itself, so all its API functions need to be exported. Four new functions were added recently without an export, so now we get a build error: ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! This exports those symbols. After taking a closer look, I found two more global functions in this file that are not exported and not used anywhere, so they can obviously be removed. There is also one function that is only used internally in this module, and should be 'static'. Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel") Fixes: 0fc6432cc78d ("net: ethernet: ti: davinci_cpdma: add weight function for channels") Signed-off-by: Arnd Bergmann--- drivers/net/ethernet/ti/davinci_cpdma.c | 63 +++-- drivers/net/ethernet/ti/davinci_cpdma.h | 4 --- 2 files changed, 21 insertions(+), 46 deletions(-) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index c776e4575d2d..b9d40f0cdf6c 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -628,6 +628,23 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr) } EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy); +static int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable) +{ + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + if (chan->state != CPDMA_STATE_ACTIVE) { + spin_unlock_irqrestore(>lock, flags); + return -EINVAL; + } + + dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear, + chan->mask); + spin_unlock_irqrestore(>lock, flags); + + return 0; +} + int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable) { unsigned long flags; @@ -796,6 +813,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight) spin_unlock_irqrestore(>lock, flags); return ret; } +EXPORT_SYMBOL_GPL(cpdma_chan_set_weight); /* cpdma_chan_get_min_rate - get minimum allowed rate for channel * Should be called before cpdma_chan_set_rate. @@ -810,6 +828,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr) return DIV_ROUND_UP(divident, divisor); } +EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate); /* cpdma_chan_set_rate - limits bandwidth for transmit channel. * The bandwidth * limited channels have to be in order beginning from lowest. @@ -853,6 +872,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate) spin_unlock_irqrestore(>lock, flags); return ret; } +EXPORT_SYMBOL_GPL(cpdma_chan_set_rate); u32 cpdma_chan_get_rate(struct cpdma_chan *ch) { @@ -865,6 +885,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch) return rate; } +EXPORT_SYMBOL_GPL(cpdma_chan_get_rate); struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num, cpdma_handler_fn handler, int rx_type) @@ -1270,46 +1291,4 @@ int cpdma_chan_stop(struct cpdma_chan *chan) } EXPORT_SYMBOL_GPL(cpdma_chan_stop); -int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable) -{ - unsigned long flags; - - spin_lock_irqsave(>lock, flags); - if (chan->state != CPDMA_STATE_ACTIVE) { - spin_unlock_irqrestore(>lock, flags); - return -EINVAL; - } - - dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear, - chan->mask); - spin_unlock_irqrestore(>lock, flags); - - return 0; -} - -int cpdma_control_get(struct cpdma_ctlr *ctlr, int control) -{ - unsigned long flags; - int ret; - - spin_lock_irqsave(>lock, flags); - ret = _cpdma_control_get(ctlr, control); - spin_unlock_irqrestore(>lock, flags); - - return ret; -} - -int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value) -{ - unsigned long flags; - int ret; - - spin_lock_irqsave(>lock, flags); - ret = _cpdma_control_set(ctlr, control, value); - spin_unlock_irqrestore(>lock, flags); - - return ret; -} -EXPORT_SYMBOL_GPL(cpdma_control_set); - MODULE_LICENSE("GPL"); diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h index 4a167db2abab..36d0a09a3d44 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.h +++ b/drivers/net/ethernet/ti/davinci_cpdma.h @@ -87,7 +87,6 @@ int cpdma_chan_process(struct cpdma_chan *chan, int quota); int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable); void cpdma_ctlr_eoi(struct cpdma_ctlr *ctlr, u32 value); -int cpdma_chan_int_ctrl(struct
Re: [flamebait] xdp, well meaning but pointless
On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowawrote: > On 01.12.2016 22:12, Tom Herbert wrote: >> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa >> wrote: >>> Hello, >>> >>> this is a good conversation and I simply want to bring my worries >>> across. I don't have good solutions for the problems XDP tries to solve >>> but I fear we could get caught up in maintenance problems in the long >>> term given the ideas floating around on how to evolve XDP currently. >>> >>> On 01.12.2016 17:28, Thomas Graf wrote: On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: > First of all, this is a rant targeted at XDP and not at eBPF as a whole. > XDP manipulates packets at free will and thus all security guarantees > are off as well as in any user space solution. > > Secondly user space provides policy, acl, more controlled memory > protection, restartability and better debugability. If I had multi > tenant workloads I would definitely put more complex "business/acl" > logic into user space, so I can make use of LSM and other features to > especially prevent a network facing service to attack the tenants. If > stuff gets put into the kernel you run user controlled code in the > kernel exposing a much bigger attack vector. > > What use case do you see in XDP specifically e.g. for container > networking? DDOS mitigation to protect distributed applications in large clusters. Relying on CDN works to protect API gateways and frontends (as long as they don't throw you out of their network) but offers no protection beyond that, e.g. a noisy/hostile neighbour. Doing this at the server level and allowing the mitigation capability to scale up with the number of servers is natural and cheap. >>> >>> So far we e.g. always considered L2 attacks a problem of the network >>> admin to correctly protect the environment. Are you talking about >>> protecting the L3 data plane? Are there custom proprietary protocols in >>> place which need custom protocol parsers that need involvement of the >>> kernel before it could verify the packet? >>> >>> In the past we tried to protect the L3 data plane as good as we can in >>> Linux to allow the plain old server admin to set an IP address on an >>> interface and install whatever software in user space. We try not only >>> to protect it but also try to achieve fairness by adding a lot of >>> counters everywhere. Are protections missing right now or are we talking >>> about better performance? >>> >> The technical plenary at last IETF on Seoul a couple of weeks ago was >> exclusively focussed on DDOS in light of the recent attack against >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare >> presentation by Nick Sullivan >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) >> alluded to some implementation of DDOS mitigation. In particular, on >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >> numbers he gave we're based in iptables+BPF and that was a whole >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic >> and that's also when I introduced XDP to whole IETF :-) ). If that's >> the best we can do the Internet is in a world hurt. DDOS mitigation >> alone is probably a sufficient motivation to look at XDP. We need >> something that drops bad packets as quickly as possible when under >> attack, we need this to be integrated into the stack, we need it to be >> programmable to deal with the increasing savvy of attackers, and we >> don't want to be forced to be dependent on HW solutions. This is why >> we created XDP! > > I totally understand that. But in my reply to David in this thread I > mentioned DNS apex processing as being problematic which is actually > being referred in your linked slide deck on page 9 ("What do floods look > like") and the problematic of parsing DNS packets in XDP due to string > processing and looping inside eBPF. > I agree that eBPF is not going to be sufficient from everything we'll want to do. Undoubtably, we'll continue see new addition of more helpers to assist in processing, but at some point we will want a to load a kernel module that handles more complex processing and insert it at the XDP callout. Nothing in the design of XDP precludes doing that and I have already posted the patches to generalize the XDP callout for that. Taking either of these routes has tradeoffs, but regardless of whether this is BPF or module code, the principles of XDP and its value to help solve some class of problems remains. Tom > Not to mention the fact that you might have to deal with fragments in > the Internet. Some DOS mitigations were already abused to generate > blackholes for other users. Filtering such stuff is quite complicated. > > I argued also
Re: Initial thoughts on TXDP
On 12/01/2016 12:18 PM, Tom Herbert wrote: On Thu, Dec 1, 2016 at 11:48 AM, Rick Joneswrote: Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. So waving hands a bit, and taking the service demand for the GRO-on receive test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including ACK generation which unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc etc. Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. That is a good point, which of course is why the RR tests are there in netperf :) Don't get me wrong, I *like* seeing path-length reductions. What would you posit is a relatively small payload? The promotion of IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at that as the length of a piece of string: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 16384 87380 So, losing GRO doubled the service demand. I suppose I could see cutting path-length in half based on the things you listed which would be bypassed? I'm sure mileage will vary with different NICs and CPUs. The ones used here happened to be to hand. happy benchmarking, rick Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly goes to more than doubling, and halving to 7K narrows the delta: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 6732.32 1.79 -1.00 63.819 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 3780.47 2.32 -1.00 147.280 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,7K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr
Re: DSA vs. SWTICHDEV ?
Hi Andrew, On 12/01/2016 12:31 PM, Andrew Lunn wrote: > Hi Murali > >> 2. Switch mode where it implements a simple Ethernet switch. Currently >>it doesn't have address learning capability, but in future it >>can. > > If it does not have address learning capabilities, does it act like a > plain old hub? What comes in one port goes out all others? Thanks for the response! Yes. It is a plain hub. it replicates frame to both ports. So need to run a bridge layer for address learning in software. > > Or can you do the learning in software on the host and program tables, > which the hardware then uses? > I think not. I see we have a non Linux implementation that does address learning in software using a hash table and look up MAC for each packet to see which port it needs to be sent to. Murali >> 3. Switch with HSR/PRP offload where it provides HSR/PRP protocol >>support and cut through switch. >> >> So a device need to function in one of the modes. A a regular Ethernet >> driver that provides two network devices, one per port, and switchdev >> for each physical port (in switch mode) will look ideal in this case. > > Yes, this seems the right model to use. > > Andrew > -- Murali Karicheri Linux Kernel, Keystone
Re: [WIP] net+mlx4: auto doorbell
On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazetwrote: > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote: > > >> Not sure it this has been tried before, but the doorbell avoidance could >> be done by the driver itself, because it knows a TX completion will come >> shortly (well... if softirqs are not delayed too much !) >> >> Doorbell would be forced only if : >> >> ("skb->xmit_more is not set" AND "TX engine is not 'started yet'" ) >> OR >> ( too many [1] packets were put in TX ring buffer, no point deferring >> more) >> >> Start the pump, but once it is started, let the doorbells being done by >> TX completion. >> >> ndo_start_xmit and TX completion handler would have to maintain a shared >> state describing if packets were ready but doorbell deferred. >> >> >> Note that TX completion means "if at least one packet was drained", >> otherwise busy polling, constantly calling napi->poll() would force a >> doorbell too soon for devices sharing a NAPI for both RX and TX. >> >> But then, maybe busy poll would like to force a doorbell... >> >> I could try these ideas on mlx4 shortly. >> >> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames > > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is > not used. > > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > lpaa23:~# sar -n DEV 1 10|grep eth0 > 22:43:26 eth0 0.00 5822800.00 0.00 597064.41 0.00 > 0.00 1.00 > 22:43:27 eth0 24.00 5788237.00 2.09 593520.26 0.00 > 0.00 0.00 > 22:43:28 eth0 12.00 581.00 1.43 596551.47 0.00 > 0.00 1.00 > 22:43:29 eth0 22.00 5841516.00 1.61 598982.87 0.00 > 0.00 0.00 > 22:43:30 eth0 4.00 4389137.00 0.71 450058.08 0.00 > 0.00 1.00 > 22:43:31 eth0 4.00 5871008.00 0.72 602007.79 0.00 > 0.00 0.00 > 22:43:32 eth0 12.00 5891809.00 1.43 604142.60 0.00 > 0.00 1.00 > 22:43:33 eth0 10.00 5901904.00 1.12 605175.70 0.00 > 0.00 0.00 > 22:43:34 eth0 5.00 5907982.00 0.69 605798.99 0.00 > 0.00 1.00 > 22:43:35 eth0 2.00 5847086.00 0.12 599554.71 0.00 > 0.00 0.00 > Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 > 0.00 0.50 > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt > lpaa23:~# sar -n DEV 1 10|grep eth0 > 22:43:47 eth0 9.00 10226424.00 1.02 1048608.05 0.00 > 0.00 1.00 > 22:43:48 eth0 1.00 10316955.00 0.06 1057890.89 0.00 > 0.00 0.00 > 22:43:49 eth0 1.00 10310104.00 0.10 1057188.32 0.00 > 0.00 1.00 > 22:43:50 eth0 0.00 10249423.00 0.00 1050966.23 0.00 > 0.00 0.00 > 22:43:51 eth0 0.00 10210441.00 0.00 1046969.05 0.00 > 0.00 1.00 > 22:43:52 eth0 2.00 10198389.00 0.16 1045733.17 0.00 > 0.00 1.00 > 22:43:53 eth0 8.00 10079257.00 1.43 1033517.83 0.00 > 0.00 0.00 > 22:43:54 eth0 0.00 7693509.00 0.00 75.16 0.00 > 0.00 0.00 > 22:43:55 eth0 2.00 10343076.00 0.20 1060569.32 0.00 > 0.00 1.00 > 22:43:56 eth0 1.00 10224571.00 0.14 1048417.93 0.00 > 0.00 0.00 > Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 > 0.00 0.50 > > And about 11 % improvement on an mono-flow UDP_STREAM test. > > skb_set_owner_w() is now the most consuming function. A few years back when I did something like this on ixgbe I was told by you that the issue was that doing something like this would add too much latency. I was just wondering what the latency impact is on a change like this and if this now isn't that much of an issue? > > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 & > [1] 13696 > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > lpaa23:~# sar -n DEV 1 10|grep eth0 > 22:50:47 eth0 3.00 1355422.00 0.45 319706.04 0.00 > 0.00 0.00 > 22:50:48 eth0 2.00 1344270.00 0.42 317035.21 0.00 > 0.00 1.00 > 22:50:49 eth0 3.00 1350503.00 0.51 318478.34 0.00 > 0.00 0.00 > 22:50:50 eth0 29.00 1348593.00 2.86 318113.02 0.00 > 0.00 1.00 > 22:50:51 eth0 14.00 1354855.00 1.83 319508.56 0.00 > 0.00 0.00 > 22:50:52 eth0 7.00 1357794.00 0.73 320226.89 0.00 > 0.00 1.00 > 22:50:53 eth0 5.00 1326130.00 0.63 312784.72 0.00 > 0.00 0.00 > 22:50:54 eth0 2.00 994584.00 0.12 234598.40 0.00 > 0.00 1.00 > 22:50:55
Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor
Hello. On 28.10.2016 11:42, Stefan Schmidt wrote: > Both types have been missing here and thus ip always showed > only the numbers. > > Based on a suggestion from Alexander Aring. > > Signed-off-by: Stefan SchmidtDid you somehow mangle this patch manually? Looking at the patch in your git repo it shows no author name but instead just my patch was send with git format-patch and git send-email as usual and shows the right author. Was there something worn on my side or yours? Just checking to avoid it in the future. http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=8ae2c5382bd9d98a8f7ddcb1faad1a978d773909 regards Stefan Schmidt
Re: [flamebait] xdp, well meaning but pointless
On 01.12.2016 22:12, Tom Herbert wrote: > On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa >wrote: >> Hello, >> >> this is a good conversation and I simply want to bring my worries >> across. I don't have good solutions for the problems XDP tries to solve >> but I fear we could get caught up in maintenance problems in the long >> term given the ideas floating around on how to evolve XDP currently. >> >> On 01.12.2016 17:28, Thomas Graf wrote: >>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: First of all, this is a rant targeted at XDP and not at eBPF as a whole. XDP manipulates packets at free will and thus all security guarantees are off as well as in any user space solution. Secondly user space provides policy, acl, more controlled memory protection, restartability and better debugability. If I had multi tenant workloads I would definitely put more complex "business/acl" logic into user space, so I can make use of LSM and other features to especially prevent a network facing service to attack the tenants. If stuff gets put into the kernel you run user controlled code in the kernel exposing a much bigger attack vector. What use case do you see in XDP specifically e.g. for container networking? >>> >>> DDOS mitigation to protect distributed applications in large clusters. >>> Relying on CDN works to protect API gateways and frontends (as long as >>> they don't throw you out of their network) but offers no protection >>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server >>> level and allowing the mitigation capability to scale up with the number >>> of servers is natural and cheap. >> >> So far we e.g. always considered L2 attacks a problem of the network >> admin to correctly protect the environment. Are you talking about >> protecting the L3 data plane? Are there custom proprietary protocols in >> place which need custom protocol parsers that need involvement of the >> kernel before it could verify the packet? >> >> In the past we tried to protect the L3 data plane as good as we can in >> Linux to allow the plain old server admin to set an IP address on an >> interface and install whatever software in user space. We try not only >> to protect it but also try to achieve fairness by adding a lot of >> counters everywhere. Are protections missing right now or are we talking >> about better performance? >> > The technical plenary at last IETF on Seoul a couple of weeks ago was > exclusively focussed on DDOS in light of the recent attack against > Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > presentation by Nick Sullivan > (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > alluded to some implementation of DDOS mitigation. In particular, on > slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" > numbers he gave we're based in iptables+BPF and that was a whole > 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > and that's also when I introduced XDP to whole IETF :-) ). If that's > the best we can do the Internet is in a world hurt. DDOS mitigation > alone is probably a sufficient motivation to look at XDP. We need > something that drops bad packets as quickly as possible when under > attack, we need this to be integrated into the stack, we need it to be > programmable to deal with the increasing savvy of attackers, and we > don't want to be forced to be dependent on HW solutions. This is why > we created XDP! I totally understand that. But in my reply to David in this thread I mentioned DNS apex processing as being problematic which is actually being referred in your linked slide deck on page 9 ("What do floods look like") and the problematic of parsing DNS packets in XDP due to string processing and looping inside eBPF. Not to mention the fact that you might have to deal with fragments in the Internet. Some DOS mitigations were already abused to generate blackholes for other users. Filtering such stuff is quite complicated. I argued also under the aspect of what Thomas said, that the outside world of the cluster is already protected by a CDN. Bye, Hannes
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On Thu, Dec 01, 2016 at 10:09:19PM +0100, Hannes Frederic Sowa wrote: > On 01.12.2016 21:54, Ido Schimmel wrote: > > On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote: > >> On 01.12.2016 21:04, David Miller wrote: > >>> > >>> Hannes and Ido, > >>> > >>> It looks like we are very close to having this in mergable shape, can > >>> you guys work out this final issue and figure out if it really is > >>> a merge stopped or not? > >> > >> Sure, if the fib notification register could be done under protection of > >> the sequence counter I don't see any more problems. > > > > Did you maybe miss my reply yesterday? Because I was trying to > > understand what "ordering" you're referring to, but didn't receive a > > reply from you. > > Oh, strange, I am pretty sure I replied to that. Let me resend it. :) I did get this reply, and then replied myself here: https://marc.info/?l=linux-netdev=148053017425465=2
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On 01.12.2016 21:54, Ido Schimmel wrote: > On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote: >> On 01.12.2016 21:04, David Miller wrote: >>> >>> Hannes and Ido, >>> >>> It looks like we are very close to having this in mergable shape, can >>> you guys work out this final issue and figure out if it really is >>> a merge stopped or not? >> >> Sure, if the fib notification register could be done under protection of >> the sequence counter I don't see any more problems. > > Did you maybe miss my reply yesterday? Because I was trying to > understand what "ordering" you're referring to, but didn't receive a > reply from you. Oh, strange, I am pretty sure I replied to that. Let me resend it. >> The sync handler is nice to have and can be done in a later patch series. > > Sync handler? I was talking about SWITCHDEV_SYNC. Bye, Hannes
Re: [patch] net: renesas: ravb: unintialized return value
Hello! On 12/01/2016 11:57 PM, Dan Carpenter wrote: We want to set the other "err" variable here so that we can return it later. My version of GCC misses this issue but I caught it with a static checker. Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks") Hm, I somehow missed this one, probably due to the horrific CC list. :-( Signed-off-by: Dan CarpenterAcked-by: Sergei Shtylyov MBR, Sergei
Re: [flamebait] xdp, well meaning but pointless
On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowawrote: > Hello, > > this is a good conversation and I simply want to bring my worries > across. I don't have good solutions for the problems XDP tries to solve > but I fear we could get caught up in maintenance problems in the long > term given the ideas floating around on how to evolve XDP currently. > > On 01.12.2016 17:28, Thomas Graf wrote: >> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: >>> First of all, this is a rant targeted at XDP and not at eBPF as a whole. >>> XDP manipulates packets at free will and thus all security guarantees >>> are off as well as in any user space solution. >>> >>> Secondly user space provides policy, acl, more controlled memory >>> protection, restartability and better debugability. If I had multi >>> tenant workloads I would definitely put more complex "business/acl" >>> logic into user space, so I can make use of LSM and other features to >>> especially prevent a network facing service to attack the tenants. If >>> stuff gets put into the kernel you run user controlled code in the >>> kernel exposing a much bigger attack vector. >>> >>> What use case do you see in XDP specifically e.g. for container networking? >> >> DDOS mitigation to protect distributed applications in large clusters. >> Relying on CDN works to protect API gateways and frontends (as long as >> they don't throw you out of their network) but offers no protection >> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server >> level and allowing the mitigation capability to scale up with the number >> of servers is natural and cheap. > > So far we e.g. always considered L2 attacks a problem of the network > admin to correctly protect the environment. Are you talking about > protecting the L3 data plane? Are there custom proprietary protocols in > place which need custom protocol parsers that need involvement of the > kernel before it could verify the packet? > > In the past we tried to protect the L3 data plane as good as we can in > Linux to allow the plain old server admin to set an IP address on an > interface and install whatever software in user space. We try not only > to protect it but also try to achieve fairness by adding a lot of > counters everywhere. Are protections missing right now or are we talking > about better performance? > The technical plenary at last IETF on Seoul a couple of weeks ago was exclusively focussed on DDOS in light of the recent attack against Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare presentation by Nick Sullivan (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) alluded to some implementation of DDOS mitigation. In particular, on slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" numbers he gave we're based in iptables+BPF and that was a whole 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic and that's also when I introduced XDP to whole IETF :-) ). If that's the best we can do the Internet is in a world hurt. DDOS mitigation alone is probably a sufficient motivation to look at XDP. We need something that drops bad packets as quickly as possible when under attack, we need this to be integrated into the stack, we need it to be programmable to deal with the increasing savvy of attackers, and we don't want to be forced to be dependent on HW solutions. This is why we created XDP! Tom > To provide fairness you often have to share validated data within the > kernel and with XDP. This requires consistent lookup methods for sockets > in the lower level. Those can be exported to XDP via external functions > and become part of uAPI which will limit our ability to change those > functions in future. When the discussion started about early demuxing in > XDP I became really nervous, because suddenly the XDP program has to > decide correctly which protocol type it has and look in the correct > socket table for the socket. Different semantics for sockets can apply > here, e.g. some sockets are RCU managed, some end up using reference > counts. A wrong decision here would cause havoc in the kernel (XDP > considers packet as UDP but kernel stack as TCP). Also, who knows that > we won't have per-cpu socket tables we would keep that as uAPI (this is > btw. the dragonflyBSD approach to scaling)? Imagine someone writing a > SIP rewriter in XDP and depending on a coherent view of all sockets even > if their hash doesn't fit to the one of the queue? Suddenly something > which was thought of as being only mutable by one CPU becomes global > again and because of XDP we need to add locking because of uAPI. > > This discussion is parallel to the discussion about trace points, which > are not considered uAPI. If eBPF functions are not considered uAPI then > eBPF in the network stack will have much less value, because you > suddenly
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On 30.11.2016 17:32, Ido Schimmel wrote: > Hi Hannes, > > On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote: >> On 30.11.2016 11:09, Jiri Pirko wrote: >>> From: Ido Schimmel>>> >>> Make sure the device has a complete view of the FIB tables by invoking >>> their dump during module init. >>> >>> Signed-off-by: Ido Schimmel >>> Signed-off-by: Jiri Pirko >>> --- >>> .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 23 >>> ++ >>> 1 file changed, 23 insertions(+) >>> >>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c >>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c >>> index 14bed1d..d176047 100644 >>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c >>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c >>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct >>> notifier_block *nb, >>> return NOTIFY_DONE; >>> } >>> >>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb) >>> +{ >>> + struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb); >>> + >>> + /* Flush pending FIB notifications and then flush the device's >>> +* table before requesting another dump. Do that with RTNL held, >>> +* as FIB notification block is already registered. >>> +*/ >>> + mlxsw_core_flush_owq(); >>> + rtnl_lock(); >>> + mlxsw_sp_router_fib_flush(mlxsw_sp); >>> + rtnl_unlock(); >>> +} >>> + >>> int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) >>> { >>> + fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush; >>> int err; >>> >>> INIT_LIST_HEAD(_sp->router.nexthop_neighs_list); >>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp) >>> >>> mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event; >>> register_fib_notifier(_sp->fib_nb); >> >> Sorry to pick in here again: >> >> There is a race here. You need to protect the registration of the fib >> notifier as well by the sequence counter. Updates here are not ordered >> in relation to this code below. > > You mean updates that can be received after you registered the notifier > and until the dump started? I'm aware of that and that's OK. This > listener should be able to handle duplicates. I am not concerned about duplicates, but about ordering deletes and getting an add from the RCU code you will add the node to hw while it is deleted in the software path. You probably will ignore the delete because nothing is installed in hw and later add the node which was actually deleted but just reordered which happend on another CPU, no? > I've a follow up patchset that introduces a new event in switchdev > notification chain called SWITCHDEV_SYNC, which is sent when port > netdevs are enslaved / released from a master device (points in time > where kernel<->device can get out of sync). It will invoke > re-propagation of configuration from different parts of the stack > (e.g. bridge driver, 8021q driver, fib/neigh code), which can result > in duplicates. Okay, understood. I wonder how we can protect against accidentally abort calls actually. E.g. if I start to inject routes into my routing domain how can I make sure the box doesn't die after I try to insert enough routes. Do we need to touch quagga etc? Thanks, Hannes
[patch] net: renesas: ravb: unintialized return value
We want to set the other "err" variable here so that we can return it later. My version of GCC misses this issue but I caught it with a static checker. Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks") Signed-off-by: Dan Carpenter--- Applies to the net tree for 4.10. diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 2c0357c..92d7692 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -1016,8 +1016,6 @@ static int ravb_phy_init(struct net_device *ndev) * at this time. */ if (priv->chip_id == RCAR_GEN3) { - int err; - err = phy_set_max_speed(phydev, SPEED_100); if (err) { netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n");
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote: > On 01.12.2016 21:04, David Miller wrote: > > > > Hannes and Ido, > > > > It looks like we are very close to having this in mergable shape, can > > you guys work out this final issue and figure out if it really is > > a merge stopped or not? > > Sure, if the fib notification register could be done under protection of > the sequence counter I don't see any more problems. Did you maybe miss my reply yesterday? Because I was trying to understand what "ordering" you're referring to, but didn't receive a reply from you. > The sync handler is nice to have and can be done in a later patch series. Sync handler?
Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
On Thu, Dec 01, 2016 at 03:29:49PM -0500, David Miller wrote: > From: Marcelo Ricardo Leitner> Date: Wed, 30 Nov 2016 11:14:32 -0200 > > > There have been some reports lately about TCP connection stalls caused > > by NIC drivers that aren't setting gso_size on aggregated packets on rx > > path. This causes TCP to assume that the MSS is actually the size of the > > aggregated packet, which is invalid. > > > > Although the proper fix is to be done at each driver, it's often hard > > and cumbersome for one to debug, come to such root cause and report/fix > > it. > > > > This patch amends this situation in two ways. First, it adds a warning > > on when this situation occurs, so it gives a hint to those trying to > > debug this. It also limit the maximum probed MSS to the adverised MSS, > > as it should never be any higher than that. > > > > The result is that the connection may not have the best performance ever > > but it shouldn't stall, and the admin will have a hint on what to look > > for. > > > > Tested with virtio by forcing gso_size to 0. > > > > Cc: Jonathan Maxwell > > Signed-off-by: Marcelo Ricardo Leitner > > I totally agree with this change, however I think the warning message can > be improved in two ways: > > > len = skb_shinfo(skb)->gso_size ? : skb->len; > > if (len >= icsk->icsk_ack.rcv_mss) { > > - icsk->icsk_ack.rcv_mss = len; > > + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len, > > + tcp_sk(sk)->advmss); > > + if (icsk->icsk_ack.rcv_mss != len) > > + pr_warn_once("Seems your NIC driver is doing bad RX > > acceleration. TCP performance may be compromised.\n"); > > We know it's a bad GRO implementation that causes this so let's be specific > in the > message, perhaps something like: > > Driver has suspect GRO implementation, TCP performance may be > compromised. Okay. > > Also, we have skb->dev available here most likely, so prefixing the message > with > skb->dev->name would make analyzing this situation even easier for someone > hitting > this. Nice, yes. And this skb is mostly non-forwardable as it's bigger than the MTU, so if someone is using net namespaces and this skb would be routed through some veth interfaces, it would give a false hint then, but shouldn't happen. Unless it would fit (a larger) veth mtu, but still, one probably will simplify things up to debug this. > > I'm not certain if an skb->dev==NULL check is necessary here or not, but it is > definitely something you need to consider. > > Thanks! > Will check. Thanks! Marcelo
Re: [flamebait] xdp, well meaning but pointless
Hello, this is a good conversation and I simply want to bring my worries across. I don't have good solutions for the problems XDP tries to solve but I fear we could get caught up in maintenance problems in the long term given the ideas floating around on how to evolve XDP currently. On 01.12.2016 17:28, Thomas Graf wrote: > On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: >> First of all, this is a rant targeted at XDP and not at eBPF as a whole. >> XDP manipulates packets at free will and thus all security guarantees >> are off as well as in any user space solution. >> >> Secondly user space provides policy, acl, more controlled memory >> protection, restartability and better debugability. If I had multi >> tenant workloads I would definitely put more complex "business/acl" >> logic into user space, so I can make use of LSM and other features to >> especially prevent a network facing service to attack the tenants. If >> stuff gets put into the kernel you run user controlled code in the >> kernel exposing a much bigger attack vector. >> >> What use case do you see in XDP specifically e.g. for container networking? > > DDOS mitigation to protect distributed applications in large clusters. > Relying on CDN works to protect API gateways and frontends (as long as > they don't throw you out of their network) but offers no protection > beyond that, e.g. a noisy/hostile neighbour. Doing this at the server > level and allowing the mitigation capability to scale up with the number > of servers is natural and cheap. So far we e.g. always considered L2 attacks a problem of the network admin to correctly protect the environment. Are you talking about protecting the L3 data plane? Are there custom proprietary protocols in place which need custom protocol parsers that need involvement of the kernel before it could verify the packet? In the past we tried to protect the L3 data plane as good as we can in Linux to allow the plain old server admin to set an IP address on an interface and install whatever software in user space. We try not only to protect it but also try to achieve fairness by adding a lot of counters everywhere. Are protections missing right now or are we talking about better performance? To provide fairness you often have to share validated data within the kernel and with XDP. This requires consistent lookup methods for sockets in the lower level. Those can be exported to XDP via external functions and become part of uAPI which will limit our ability to change those functions in future. When the discussion started about early demuxing in XDP I became really nervous, because suddenly the XDP program has to decide correctly which protocol type it has and look in the correct socket table for the socket. Different semantics for sockets can apply here, e.g. some sockets are RCU managed, some end up using reference counts. A wrong decision here would cause havoc in the kernel (XDP considers packet as UDP but kernel stack as TCP). Also, who knows that we won't have per-cpu socket tables we would keep that as uAPI (this is btw. the dragonflyBSD approach to scaling)? Imagine someone writing a SIP rewriter in XDP and depending on a coherent view of all sockets even if their hash doesn't fit to the one of the queue? Suddenly something which was thought of as being only mutable by one CPU becomes global again and because of XDP we need to add locking because of uAPI. This discussion is parallel to the discussion about trace points, which are not considered uAPI. If eBPF functions are not considered uAPI then eBPF in the network stack will have much less value, because you suddenly depend on specific kernel versions again and cannot simply load the code into the kernel. The API checks will become very difficult to implement, see also the ongoing MODVERSIONS discussions on LKML some days back. >>> I agree with you if the LB is a software based appliance in either a >>> dedicated VM or on dedicated baremetal. >>> >>> The reality is turning out to be different in many cases though, LB >>> needs to be performed not only for north south but east west as well. >>> So even if I would handle LB for traffic entering my datacenter in user >>> space, I will need the same LB for packets from my applications and >>> I definitely don't want to move all of that into user space. >> >> The open question to me is why is programmability needed here. >> >> Look at the discussion about ECMP and consistent hashing. It is not very >> easy to actually write this code correctly. Why can't we just put C code >> into the kernel that implements this once and for all and let user space >> update the policies? > > Whatever LB logic is put in place with native C code now is unlikely the > logic we need in two years. We can't really predict the future. If it > was the case, networking would have been done long ago and we would all > be working on self eating ice cream now. Did LB algorithms on the networking layer change that much?
Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: Jeff KirsherDate: Wed, 30 Nov 2016 13:15:22 -0800 > On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote: >> From: Alexander Duyck >> Date: Mon, 28 Nov 2016 10:42:18 -0500 >> >> > When I implemented the GSO partial support in the Intel drivers I was >> using >> > lco_csum to compute the checksum that we needed to plug into the IPv4 >> > checksum field in order to cancel out the data that was not a part of >> the >> > IPv4 header. However this didn't take into account that the transport >> > offset might be pointing to the inner transport header. >> > >> > Instead of using lco_csum I have just coded around it so that we can >> use >> > the outer IP header plus the IP header length to determine where we >> need to >> > start our checksum and then just call csum_partial ourselves. >> > >> > This should fix the SIT issue reported on igb interfaces as well as >> simliar >> > issues that would pop up on other Intel NICs. >> >> Jeff, are you going to send me a pull request with this stuff or would >> you be OK with my applying these directly to 'net'? > > Go ahead and apply those to your net tree, I do not want to hold this up. Ok, done, thanks Jeff.
Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
Hi Andrew, Andrew Lunnwrites: >> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h >> b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h >> index ab52c37..9e51405 100644 >> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h >> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h >> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops { >> int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg, >> u16 val); >> >> +/* Switch Software Reset */ >> +int (*reset)(struct mv88e6xxx_chip *chip); >> + > > Hi Vivien > > In my huge patch series of 6390, i've been using a g1_ prefix for > functionality which is in global 1, g2_ for global 2, etc. This has > worked for everything so far with the exception of setting which > reserved MAC addresses should be sent to the CPU. Most devices have it > in g2, but 6390 has it in g1. > > Please could you add the prefix. I don't understand. It looks like you are talking about the second part of the comment I made on your RFC patchset, about the Rsvd2CPU feature: https://www.mail-archive.com/netdev@vger.kernel.org/msg139837.html Switch reset routines are implemented in this patch in global1.c as mv88e6185_g1_reset and mv88e6352_g1_reset. 6185 and 6352 are implementation references for other switches. Thanks, Vivien
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
On 01.12.2016 21:04, David Miller wrote: > > Hannes and Ido, > > It looks like we are very close to having this in mergable shape, can > you guys work out this final issue and figure out if it really is > a merge stopped or not? Sure, if the fib notification register could be done under protection of the sequence counter I don't see any more problems. The sync handler is nice to have and can be done in a later patch series.
Re: [PATCH net-next 0/3] sfc: defalconisation fixups
From: Edward CreeDate: Thu, 1 Dec 2016 16:59:13 + > A bug fix, the Kconfig change, and cleaning up a bit more unused code. > > Edward Cree (3): > sfc: fix debug message format string in efx_farch_handle_rx_not_ok > sfc: don't select SFC_FALCON > sfc: remove RESET_TYPE_RX_RECOVERY Series applied, thank you.
Re: iproute2 public git outdated?
Hi Phil, I suggest that you will try again now, it seems that the iproute2 git repo was updated in the last 2-4 hours, and "git log" in master shows now a patch from 30 of November (actually it is your "Add notes about dropped IPv4 route cache" patch) Regards, Rami Rosen On 1 December 2016 at 14:18, Phil Sutterwrote: > Hi, > > I am using iproute2's public git repo at this URL: > > git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git > > To my surprise, neither master nor net-next branches have received new > commits since end of October. Did the repo location change or was it > just not updated for a while? > > Thanks, Phil
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhanwrote: > On (12/01/16 11:05), Tom Herbert wrote: >> >> Polling does not necessarily imply that networking monopolizes the CPU >> except when the CPU is otherwise idle. Presumably the application >> drives the polling when it is ready to receive work. > > I'm not grokking that- "if the cpu is idle, we want to busy-poll > and make it 0% idle"? Keeping CPU 0% idle has all sorts > of issues, see slide 20 of > http://www.slideshare.net/shemminger/dpdk-performance > >> > and one other critical difference from the hot-potato-forwarding >> > model (the sort of OVS model that DPDK etc might aruguably be a fit for) >> > does not apply: in order to figure out the ethernet and IP headers >> > in the response correctly at all times (in the face of things like VRRP, >> > gw changes, gw's mac addr changes etc) the application should really >> > be listening on NETLINK sockets for modifications to the networking >> > state - again points to needing a select() socket set where you can >> > have both the I/O fds and the netlink socket, >> > >> I would think that that is management would not be implemented in a >> fast path processing thread for an application. > > sure, but my point was that *XDP and other stack-bypass methods needs > to provide a select()able socket: when your use-case is not about just > networking, you have to snoop on changes to the control plane, and update > your data path. In the OVS case (pure networking) the OVS control plane > updates are intrinsic to OVS. For the rest of the request/response world, > we need a select()able socket set to do this elegantly (not really > possible in DPDK, for example) > I'm not sure that TXDP can be reconciled to help OVS. The point of TXDP is to drive applications closer to bare metal performance, as I mentioned this is only going to be worth it if the fast path can be kept simple and not complicated by a requirement for generalization. It seems like the second we put OVS in we're doubling the data path and accepting the performance consequences of a complex path anyway. TXDP can't over the whole system (any more than DPDK can) and needs to work in concert with other mechanisms-- the key is how to steer the work amongst the CPUs. For instance, if a latency critical thread is running on some CPU we either a dedicated queue for the connections of the thread (e.g. ntuple filtering or aRFS support) or we need a fast way to get move unrelated packets received on a queue processed by that CPU to other CPUs (less efficient, but no special HW support is needed either). Tom > >> The *SOs are always an interesting question. They make for great >> benchmarks, but in real life the amount of benefit is somewhat >> unclear. Under the wrong conditions, like all cwnds have collapsed or > > I think Rick's already bringing up this one. > > --Sowmini >
pull-request: can-next 2016-12-01
Hello David, this is a pull request of 4 patches for net-next/master. There are two patches by Chris Paterson for the rcar_can and rcar_canfd device tree binding documentation. And a patch by Geert Uytterhoeven that corrects the order of interrupt specifiers. The fourth patch by Colin Ian King fixes a spelling error in the kvaser_usb driver. regards, Marc --- The following changes since commit 8f679ed88f8860206edddff725e2749b4cdbb0e8: driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev (2016-11-30 15:01:32 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git tags/linux-can-next-for-4.10-20161201 for you to fetch changes up to 0d8f8efd32bace9f222fcc92d4a3132d877e5df6: net: can: usb: kvaser_usb: fix spelling mistake of "outstanding" (2016-12-01 14:27:02 +0100) linux-can-next-for-4.10-20161201 Chris Paterson (2): can: rcar_can: Add r8a7796 support can: rcar_canfd: Add r8a7796 support Colin Ian King (1): net: can: usb: kvaser_usb: fix spelling mistake of "outstanding" Geert Uytterhoeven (1): can: rcar_canfd: Correct order of interrupt specifiers Documentation/devicetree/bindings/net/can/rcar_can.txt | 12 +++- Documentation/devicetree/bindings/net/can/rcar_canfd.txt | 14 -- drivers/net/can/usb/kvaser_usb.c | 4 ++-- 3 files changed, 17 insertions(+), 13 deletions(-) -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions| Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917- | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
On Wed, Nov 30, 2016 at 6:30 AM, Jiri Bencwrote: > On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote: >> Do not always set skb->protocol to be the ethertype of the L3 header. >> For a packet with non-accelerated VLAN tags skb->protocol needs to be >> the ethertype of the outermost non-accelerated VLAN ethertype. > > Well, the current handling of skb->protocol matches what used to be the > handling of the kernel net stack before Jiri Pirko cleaned up the vlan > code. > > I'm not opposed to changing this but I'm afraid it needs much deeper > review. Because with this in place, no core kernel functions that > depend on skb->protocol may be called from within openvswitch. > Can you give specific example where it does not work? >> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct >> sw_flow_key *key) >> if (res <= 0) >> return res; >> >> + /* If the outer vlan tag was accelerated, skb->protocol should >> + * refelect the inner vlan type. */ >> + if (!eth_type_vlan(skb->protocol)) >> + skb->protocol = key->eth.cvlan.tpid; > > This should not depend on the current value in skb->protocol which > could be arbitrary at this point (from the point of view of how this > patch understands the skb->protocol values). It's easy to fix, though - > just add a local bool variable tracking whether the skb->protocol has > been set. > skb-protocol value is set by the caller, so it should not be arbitrary. is it missing in any case?
Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
Hi Andrew, Andrew Lunnwrites: > As we have seen in the past, this sort of loop is broken if we end up > sleeping for a long time. Please take the opportunity to replace it > with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait() That won't work. the _wait() helpers are made to wait on self-clear (SC) bits, i.e. looping until they are cleared to zero. Here we want the opposite. I will keep this existing wait loop for the moment and work soon on a new patchset to rework the wait routines. We need a generic access to test a given value against a given mask and wrappers for busy bits, etc. >> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready) >> +{ >> +u16 val; >> +int err; >> + >> +/* Check the value of the InitReady bit 11 */ >> +err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, ); >> +if (err) >> +return err; >> + >> +*ready = !!(val & GLOBAL_STATUS_INIT_READY); > > I would actually do the wait here. That is better indeed. Thanks, Vivien
Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
From: Marcelo Ricardo LeitnerDate: Wed, 30 Nov 2016 11:14:32 -0200 > There have been some reports lately about TCP connection stalls caused > by NIC drivers that aren't setting gso_size on aggregated packets on rx > path. This causes TCP to assume that the MSS is actually the size of the > aggregated packet, which is invalid. > > Although the proper fix is to be done at each driver, it's often hard > and cumbersome for one to debug, come to such root cause and report/fix > it. > > This patch amends this situation in two ways. First, it adds a warning > on when this situation occurs, so it gives a hint to those trying to > debug this. It also limit the maximum probed MSS to the adverised MSS, > as it should never be any higher than that. > > The result is that the connection may not have the best performance ever > but it shouldn't stall, and the admin will have a hint on what to look > for. > > Tested with virtio by forcing gso_size to 0. > > Cc: Jonathan Maxwell > Signed-off-by: Marcelo Ricardo Leitner I totally agree with this change, however I think the warning message can be improved in two ways: > len = skb_shinfo(skb)->gso_size ? : skb->len; > if (len >= icsk->icsk_ack.rcv_mss) { > - icsk->icsk_ack.rcv_mss = len; > + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len, > +tcp_sk(sk)->advmss); > + if (icsk->icsk_ack.rcv_mss != len) > + pr_warn_once("Seems your NIC driver is doing bad RX > acceleration. TCP performance may be compromised.\n"); We know it's a bad GRO implementation that causes this so let's be specific in the message, perhaps something like: Driver has suspect GRO implementation, TCP performance may be compromised. Also, we have skb->dev available here most likely, so prefixing the message with skb->dev->name would make analyzing this situation even easier for someone hitting this. I'm not certain if an skb->dev==NULL check is necessary here or not, but it is definitely something you need to consider. Thanks!
[PATCH -next] net: ethernet: ti: davinci_cpdma: add missing EXPORTs
As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d ("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the ARM allmodconfig builds would fail modpost with: ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined! Since these weren't declared as static, it is assumed they were meant to be shared outside the file, and that modular build testing was simply overlooked. Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel") Cc: Ivan KhoronzhukCc: Mugunthan V N Cc: Grygorii Strashko Cc: linux-o...@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker --- drivers/net/ethernet/ti/davinci_cpdma.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c index c776e4575d2d..36518fc5c7cc 100644 --- a/drivers/net/ethernet/ti/davinci_cpdma.c +++ b/drivers/net/ethernet/ti/davinci_cpdma.c @@ -796,6 +796,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight) spin_unlock_irqrestore(>lock, flags); return ret; } +EXPORT_SYMBOL_GPL(cpdma_chan_set_weight); /* cpdma_chan_get_min_rate - get minimum allowed rate for channel * Should be called before cpdma_chan_set_rate. @@ -810,6 +811,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr) return DIV_ROUND_UP(divident, divisor); } +EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate); /* cpdma_chan_set_rate - limits bandwidth for transmit channel. * The bandwidth * limited channels have to be in order beginning from lowest. @@ -853,6 +855,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate) spin_unlock_irqrestore(>lock, flags); return ret; } +EXPORT_SYMBOL_GPL(cpdma_chan_set_rate); u32 cpdma_chan_get_rate(struct cpdma_chan *ch) { @@ -865,6 +868,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch) return rate; } +EXPORT_SYMBOL_GPL(cpdma_chan_get_rate); struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num, cpdma_handler_fn handler, int rx_type) -- 2.11.0
Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic SowaDate: Wed, 30 Nov 2016 14:12:48 +0100 > David, one question: do you remember if you measured with linked lists > at that time or also with arrays. I actually would expect small arrays > that entirely fit into cachelines to be actually faster than our current > approach, which also walks a linked list, probably the best algorithm to > trash cache lines. I ask because I currently prefer this approach more > than having large allocations in the O(1) case because of easier code > and easier management. I did not try this and I do agree with you that for extremely small table sizes a list or array would perform better because of the cache behavior.
Re: [PATCH] stmmac: simplify flag assignment
From: Pavel MachekDate: Wed, 30 Nov 2016 12:44:31 +0100 > > Simplify flag assignment. > > Signed-off-by: Pavel Machek > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > index ed20668..0b706a7 100644 > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct > net_device *dev, > features &= ~NETIF_F_CSUM_MASK; > > /* Disable tso if asked by ethtool */ > - if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) { > - if (features & NETIF_F_TSO) > - priv->tso = true; > - else > - priv->tso = false; > - } > + if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) > + priv->tso = !!(features & NETIF_F_TSO); > Pavel, this really seems arbitrary. Whilst I really appreciate you're looking into this driver a bit because of some issues you are trying to resolve, I'd like to ask that you not start bombarding me with nit-pick cleanups here and there and instead concentrate on the real bug or issue. Thanks in advance.
Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
On 11/23/2016 05:48 AM, Ido Schimmel wrote: > Hi Florian, > > On Tue, Nov 22, 2016 at 09:56:30AM -0800, Florian Fainelli wrote: >> On 11/22/2016 09:41 AM, Ido Schimmel wrote: >>> Hi Florian, >>> >>> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote: Hi all, This patch series allows using the bridge master interface to configure an Ethernet switch port's CPU/management port with different VLAN attributes than those of the bridge downstream ports/members. Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I tested this with b53 and a mockup DSA driver. >>> >>> We'll need to add a check in mlxsw and ignore any VLAN configuration for >>> the bridge device itself. Otherwise, any configuration done on br0 will >>> be propagated to all of its slaves, which is incorrect. >>> Open questions: - if we have more than one bridge on top of a physical switch, the driver should keep track of that and verify that we are not going to change the CPU port VLAN attributes in a way that results in incompatible settings to be applied - if the default behavior is to have all VLANs associated with the CPU port be ingressing/egressing tagged to the CPU, is this really useful? >>> >>> First of all, I want to be sure that when we say "CPU port", we're >>> talking about the same thing. In mlxsw, the CPU port is a pipe between >>> the device and the host, through which all packets trapped to the host >>> go through. So, when a packet is trapped, the driver reads its Rx >>> descriptor, checks through which port it ingressed, resolves its netdev, >>> sets skb->dev accordingly and injects it to the Rx path via >>> netif_receive_skb(). The CPU port itself isn't represented using a >>> netdev. >> >> In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in >> premise, this driver plus the DSA tag protocol hook do exactly the same >> things as you just describe. > > Thanks for the detailed explanation! I also took the time to read > dsa.txt, however I still don't quite understand the motivation for > VLAN filtering on the CPU port. In which cases would you like to prevent > packets from going to the host due to their VLAN header? This change > would make sense to me if you only had a limited number of VLANs you > could enable on the CPU port, but from your response I understand that > this isn't the case. It's not much about VLAN filtering per-se, but more about the default VLAN membership of the CPU port, in the absence of any explicit configuration. As an user, I find it a little inconvenient to have to create one VLAN interface per VLAN that I am adding to the bridge to be able to terminate that traffic properly towards the host/CPU/management interface, especially when this VLAN is untagged. This is really the motivation for these patches: if there is only one VLAN configured, and it's the default VLAN for all ports, then the bridge master interface also terminates this VLAN with the same properties as those added by default (typically with default_pvid: VID 1 untagged, unless changed of course). If you don't want that as an user, you now have the ability to change it, and make this VLAN (or any other for that matter) to be terminated differently at the host/CPU/management port level than how it is egressing at the downstream ports part of that VLAN too. Maybe it's a bit overkill... > > FWIW, I don't have a problem with patches if they are useful for you, > I'm just trying to understand the use case. We can easily patch mlxsw to > ignore calls with orig_dev=br0. OK, if I resubmit, I will try to take care of mlxsw and rocker as well. Thanks! -- Florian
[PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.
Signed-off-by: Roman Mashak--- man/man8/tc.8 | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/man/man8/tc.8 b/man/man8/tc.8 index 8a47a2b..d957ffa 100644 --- a/man/man8/tc.8 +++ b/man/man8/tc.8 @@ -32,7 +32,9 @@ class-id ] qdisc DEV .B [ parent qdisc-id -.B | root ] protocol +.B | root ] [ handle +handle-id ] +.B protocol protocol .B prio priority filtertype @@ -577,7 +579,7 @@ it is created. .TP get -Displays a single filter given the interface, parent ID, priority, protocol and handle ID. +Displays a single filter given the interface, qdisc-id, priority, protocol and handle-id. .TP show -- 1.9.1
Re: [WIP] net+mlx4: auto doorbell
From: Eric DumazetDate: Thu, 01 Dec 2016 09:04:17 -0800 > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote: > >> When qdisc layer or trafgen/af_packet see this indication it knows it >> should/must flush the queue when it don't have more work left. Perhaps >> through net_tx_action(), by registering itself and e.g. if qdisc_run() >> is called and queue is empty then check if queue needs a flush. I would >> also allow driver to flush and clear this bit. > > net_tx_action() is not normally called, unless BQL limit is hit and/or > some qdiscs with throttling (HTB, TBF, FQ, ...) The one thing I wonder about is whether we should "ramp up" into a mode where the NAPI poll does the doorbells instead of going directly there. Maybe I misunderstand your algorithm, but it looks to me like if there are any active packets in the TX queue at enqueue time you will defer the doorbell to the interrupt handler. Let's say we put 1 packet in, and hit the doorbell. Then another packet comes in and we defer the doorbell to the IRQ. At this point there are a couple things I'm unclear about. For example, if we didn't hit the doorbell, will the chip still take a peek at the second descriptor? Depending upon how the doorbell works it might, or it might not. Either way, wouldn't there be a possible condition where the chip wouldn't see the second enqueued packet and we'd thus have the wire idle until the interrupt + NAPI runs and hits the doorbell? This is why I think we should "ramp up" the doorbell deferral, in order to avoid this potential wire idle time situation. Maybe the situation I'm worried about is not possible, so please explain it to me :-)
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 11:48 AM, Rick Joneswrote: > On 12/01/2016 11:05 AM, Tom Herbert wrote: >> >> For the GSO and GRO the rationale is that performing the extra SW >> processing to do the offloads is significantly less expensive than >> running each packet through the full stack. This is true in a >> multi-layered generalized stack. In TXDP, however, we should be able >> to optimize the stack data path such that that would no longer be >> true. For instance, if we can process the packets received on a >> connection quickly enough so that it's about the same or just a little >> more costly than GRO processing then we might bypass GRO entirely. >> TSO is probably still relevant in TXDP since it reduces overheads >> processing TX in the device itself. > > > Just how much per-packet path-length are you thinking will go away under the > likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some > non-trivial things to effective overhead (service demand) and so throughput: > For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P > 12867 > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Send Recv SendRecv > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P > 12867 > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Send Recv SendRecv > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 > > And that is still with the stretch-ACKs induced by GRO at the receiver. > Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. > Losing GRO has quite similar results: > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_MAERTS -- -P 12867 > MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Recv Send RecvSend > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off > > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_MAERTS -- -P 12867 > MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Recv Send RecvSend > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 > > I'm sure there is a very non-trivial "it depends" component here - netperf > will get the peak benefit from *SO and so one will see the peak difference > in service demands - but even if one gets only 6 segments per *SO that is a > lot of path-length to make-up. > True, but I think there's a lot of path we'll be able to cut out. In this mode we don't need IPtables, Netfilter, input route, IPvlan check, or other similar lookups. Once we've successfully matched a establish TCP state anything related to policy on both TX and RX for that connection is inferred from the state. We want the processing path in this case to just be concerned with just
Re: Initial thoughts on TXDP
On (12/01/16 11:05), Tom Herbert wrote: > > Polling does not necessarily imply that networking monopolizes the CPU > except when the CPU is otherwise idle. Presumably the application > drives the polling when it is ready to receive work. I'm not grokking that- "if the cpu is idle, we want to busy-poll and make it 0% idle"? Keeping CPU 0% idle has all sorts of issues, see slide 20 of http://www.slideshare.net/shemminger/dpdk-performance > > and one other critical difference from the hot-potato-forwarding > > model (the sort of OVS model that DPDK etc might aruguably be a fit for) > > does not apply: in order to figure out the ethernet and IP headers > > in the response correctly at all times (in the face of things like VRRP, > > gw changes, gw's mac addr changes etc) the application should really > > be listening on NETLINK sockets for modifications to the networking > > state - again points to needing a select() socket set where you can > > have both the I/O fds and the netlink socket, > > > I would think that that is management would not be implemented in a > fast path processing thread for an application. sure, but my point was that *XDP and other stack-bypass methods needs to provide a select()able socket: when your use-case is not about just networking, you have to snoop on changes to the control plane, and update your data path. In the OVS case (pure networking) the OVS control plane updates are intrinsic to OVS. For the rest of the request/response world, we need a select()able socket set to do this elegantly (not really possible in DPDK, for example) > The *SOs are always an interesting question. They make for great > benchmarks, but in real life the amount of benefit is somewhat > unclear. Under the wrong conditions, like all cwnds have collapsed or I think Rick's already bringing up this one. --Sowmini
Re: [WIP] net+mlx4: auto doorbell
On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote: > On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet> wrote: > > > BTW, if you are doing tests on mlx4 40Gbit, > > I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the > machines too. > > > would you check the > > following quick/dirty hack, using lots of low-rate flows ? > > What tool should I use to send "low-rate flows"? > You could use https://github.com/google/neper It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate limited to 3028000 bytes per second (so sending one 2-MSS TSO packet every ms per flow) > And what am I looking for? Max throughput, in packets per second :/
Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
> Hi Michał, > > [auto build test ERROR on mac80211-next/master] > [also build test ERROR on v4.9-rc7 next-20161201] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119 > base: > https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master > config: i386-randconfig-x004-201648 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): > >net/rfkill/core.c: In function 'rfkill_set_block': > >> net/rfkill/core.c:354:2: error: implicit declaration of function > >> '__rfkill_any_led_trigger_event' [-Werror=implicit-function-declaration] > __rfkill_any_led_trigger_event(); > ^~ >net/rfkill/core.c: In function 'rfkill_init': >net/rfkill/core.c:1349:1: warning: label 'error_led_trigger' defined but > not used [-Wunused-label] > error_led_trigger: > ^ >At top level: >net/rfkill/core.c:243:13: warning: 'rfkill_any_led_trigger_unregister' > defined but not used [-Wunused-function] > static void rfkill_any_led_trigger_unregister(void) > ^ >net/rfkill/core.c:238:12: warning: 'rfkill_any_led_trigger_register' > defined but not used [-Wunused-function] > static int rfkill_any_led_trigger_register(void) >^~~ >cc1: some warnings being treated as errors > > vim +/__rfkill_any_led_trigger_event +354 net/rfkill/core.c > >348rfkill->state &= ~RFKILL_BLOCK_SW_SETCALL; >349rfkill->state &= ~RFKILL_BLOCK_SW_PREV; >350curr = rfkill->state & RFKILL_BLOCK_SW; >351spin_unlock_irqrestore(>lock, flags); >352 >353rfkill_led_trigger_event(rfkill); > > 354__rfkill_any_led_trigger_event(); >355 >356if (prev != curr) >357rfkill_event(rfkill); Thanks, these are obviously all valid concerns. Sorry for being sloppy with the ifdefs. If I get positive feedback on the proposed feature itself, all these issues (and the warning pointed out in the other message) will be resolved in v2. -- Best regards, Michał Kępień
Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
Hannes and Ido, It looks like we are very close to having this in mergable shape, can you guys work out this final issue and figure out if it really is a merge stopped or not? Thanks.
Re: Initial thoughts on TXDP
On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphalwrote: > Tom Herbert wrote: >> Posting for discussion > > Warning: You are not going to like this reply... > >> Now that XDP seems to be nicely gaining traction > > Yes, I regret to see that. XDP seems useful to create impressive > benchmark numbers (and little else). > > I will send a separate email to keep that flamebait part away from > this thread though. > > [..] > >> addresses the performance gap for stateless packet processing). The >> problem statement is analogous to that which we had for XDP, namely >> can we create a mode in the kernel that offer the same performance >> that is seen with L4 protocols over kernel bypass > > Why? If you want to bypass the kernel, then DO IT. > I don't want kernel bypass. I want the Linux stack to provide something close to bare metal performance for TCP/UDP for some latency sensitive applications we run. > There is nothing wrong with DPDK. The ONLY problem is that the kernel > does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap. > > But even without that its not difficult to get DPDK running. > That is not true for large scale deployments. Also, TXDP is about accelerating transport layers like TCP, DPDK is just the interface from userspace to device queues. We need a whole lot more with DPDK, a userspace TCP/IP stack for instance, to consider that we have an equivalent functionality. > (T)XDP seems born from spite, not technical rationale. > IMO everyone would be better off if we'd just have something netmap-esqe > in the network core (also see below). > >> I imagine there are a few reasons why userspace TCP stacks can get >> good performance: >> >> - Spin polling (we already can do this in kernel) >> - Lockless, I would assume that threads typically have exclusive >> access to a queue pair for a connection >> - Minimal TCP/IP stack code >> - Zero copy TX/RX >> - Light weight structures for queuing >> - No context switches >> - Fast data path for in order, uncongested flows >> - Silo'ing between application and device queues > > I only see two cases: > > 1. Many applications running (standard Os model) that need to > send/receive data > -> Linux Network Stack > > 2. Single dedicated application that does all rx/tx > > -> no queueing needed (can block network rx completely if receiver > is slow) > -> no allocations needed at runtime at all > -> no locking needed (single produce, single consumer) > > If you have #2 and you need to be fast etc then full userspace > bypass is fine. We will -- no matter what we do in kernel -- never > be able to keep up with the speed you can get with that > because we have to deal with #1. (Plus the ease of use/freedom of doing > userspace programming). And yes, I think that #2 is something we > should address solely by providing netmap or something similar. > > But even considering #1 there are ways to speed stack up: > > I'd kill RPF/RPS so we don't have IPI anymore and skb stays > on same cpu up to where it gets queued (ofo or rx queue). > The reference to RPS and RFS is only to move packets off the hot CPU that are not in the datapath. For instance if we get a FIN for a connection it we can put this into a slow path since FIN processing is not latency sensitive but may take a considerable amount of CPU to process. > Then we could tell driver what happened with the skb it gave us, e.g. > we can tell driver it can do immediate page/dma reuse, for example > in pure ack case as opposed to skb sitting in ofo or receive queue. > > (RPS/RFS functionality could still be provided via one of the gazillion > hooks we now have in the stack for those that need/want it), so I do > not think we would lose functionality. > >> - Call into TCP/IP stack with page data directly from driver-- no >> skbuff allocation or interface. This is essentially provided by the >> XDP API although we would need to generalize the interface to call >> stack functions (I previously posted patches for that). We will also >> need a new action, XDP_HELD?, that indicates the XDP function held the >> packet (put on a socket for instance). > > Seems this will not work at all with the planned page pool thing when > pages start to be held indefinitely. > The processing needed to gift a page to the stack shouldn't be very different than what a driver needs to do when and skbuff is created when XDP_PASS is returned. We probably would want to pass additional meta data, things like checksum and vlan information from received descriptor to the stack. A callback can be included if the stack decides it wants to hold on to the buffer and driver needs to do dma_sync etc. for that. > You can also never get even close to userspace offload stacks once you > need/do this; allocations in hotpath are too expensive. > > [..] > >> - When we transmit, it would be nice to go straight from TCP >> connection to an XDP device queue and in particular skip the qdisc >>
Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
On Wed, Nov 30, 2016 at 5:51 AM, Jiri Bencwrote: > On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote: >> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct >> sk_buff *skb, u8 mac_proto) >> goto drop; >> } >> >> - if (unlikely(packet_length(skb, vport->dev) > mtu && >> - !skb_is_gso(skb))) { >> - net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n", >> - vport->dev->name, >> - packet_length(skb, vport->dev), mtu); >> + if (unlikely(!is_skb_forwardable(vport->dev, skb))) { > > How does this work when the vlan tag is accelerated? Then we can be > over MTU, yet the check will pass. > This is not changing any behavior compared to current OVS vlan checks. Single vlan header is not considered for MTU check.
Re: Initial thoughts on TXDP
On 12/01/2016 11:05 AM, Tom Herbert wrote: For the GSO and GRO the rationale is that performing the extra SW processing to do the offloads is significantly less expensive than running each packet through the full stack. This is true in a multi-layered generalized stack. In TXDP, however, we should be able to optimize the stack data path such that that would no longer be true. For instance, if we can process the packets received on a connection quickly enough so that it's about the same or just a little more costly than GRO processing then we might bypass GRO entirely. TSO is probably still relevant in TXDP since it reduces overheads processing TX in the device itself. Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 And that is still with the stretch-ACKs induced by GRO at the receiver. Losing GRO has quite similar results: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 I'm sure there is a very non-trivial "it depends" component here - netperf will get the peak benefit from *SO and so one will see the peak difference in service demands - but even if one gets only 6 segments per *SO that is a lot of path-length to make-up. 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz And even if one does have the CPU cycles to burn so to speak, the effect on power consumption needs to be included in the calculus. happy benchmarking, rick jones