[PATCH net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
From: Johannes BergThis reverts commit d6f295e9def0; some userspace (in the case we noticed it's wpa_supplicant), is relying on the current error code to determine that a fixed name interface already exists. Reported-by: Jouni Malinen Signed-off-by: Johannes Berg --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 07ed21d64f92..f47e96b62308 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1106,7 +1106,7 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf) * when the name is long and there isn't enough space left * for the digits, or if all bits are used. */ - return p ? -ENFILE : -EEXIST; + return -ENFILE; } static int dev_alloc_name_ns(struct net *net, -- 2.14.2
[PATCH v2] net: macb: change GFP_KERNEL to GFP_ATOMIC
Function gem_add_flow_filter called on line 2958 inside lock on line 2949 but uses GFP_KERNEL Generated by: scripts/coccinelle/locks/call_kern.cocci Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering") CC: Rafal OziebloSigned-off-by: Julia Lawall Signed-off-by: Fengguang Wu --- v2: Fix some broken email addresses. No change to the patch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master head: fb20eb9d798d2f4c1a75b7fe981d72dfa8d7270d commit: ae8223de3df5a0ce651d14a50dad31b9cae029f2 [2033/2251] net: macb: Added support for RX filtering macb_main.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2799,7 +2799,7 @@ static int gem_add_flow_filter(struct ne int ret = -EINVAL; bool added = false; - newfs = kmalloc(sizeof(*newfs), GFP_KERNEL); + newfs = kmalloc(sizeof(*newfs), GFP_ATOMIC); if (newfs == NULL) return -ENOMEM; memcpy(>fs, fs, sizeof(newfs->fs));
Re: [Patch net-next] act_mirred: use tcfm_dev in tcf_mirred_get_dev()
Fri, Dec 01, 2017 at 10:46:42PM CET, xiyou.wangc...@gmail.com wrote: >On Fri, Dec 1, 2017 at 9:56 AM, Jiri Pirkowrote: >> >> Isn't this here so user may specify a ifindex of netdev which is not yet >> present on the system (not sure how much sense that would make though...) > >How is this even possible? If an ifindex is not present, we return ENODEV: Right, I missed this. Thanks. > >if (parm->ifindex) { >dev = __dev_get_by_index(net, parm->ifindex); >if (dev == NULL) { >if (exists) >tcf_idr_release(*a, bind); >return -ENODEV; >}
[PATCH net] nfp: fix port stats for mac representors
From: Pieter Jansen van VuurenPreviously we swapped the tx_packets, tx_bytes and tx_dropped counters with rx_packets, rx_bytes and rx_dropped counters, respectively. This behaviour is correct and expected for VF representors but it should not be swapped for physical port mac representors. Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c index 924a05e05da0..78b36c67c232 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c @@ -84,16 +84,13 @@ nfp_repr_phy_port_get_stats64(struct nfp_port *port, { u8 __iomem *mem = port->eth_stats; - /* TX and RX stats are flipped as we are returning the stats as seen -* at the switch port corresponding to the phys port. -*/ - stats->tx_packets = readq(mem + NFP_MAC_STATS_RX_FRAMES_RECEIVED_OK); - stats->tx_bytes = readq(mem + NFP_MAC_STATS_RX_IN_OCTETS); - stats->tx_dropped = readq(mem + NFP_MAC_STATS_RX_IN_ERRORS); + stats->tx_packets = readq(mem + NFP_MAC_STATS_TX_FRAMES_TRANSMITTED_OK); + stats->tx_bytes = readq(mem + NFP_MAC_STATS_TX_OUT_OCTETS); + stats->tx_dropped = readq(mem + NFP_MAC_STATS_TX_OUT_ERRORS); - stats->rx_packets = readq(mem + NFP_MAC_STATS_TX_FRAMES_TRANSMITTED_OK); - stats->rx_bytes = readq(mem + NFP_MAC_STATS_TX_OUT_OCTETS); - stats->rx_dropped = readq(mem + NFP_MAC_STATS_TX_OUT_ERRORS); + stats->rx_packets = readq(mem + NFP_MAC_STATS_RX_FRAMES_RECEIVED_OK); + stats->rx_bytes = readq(mem + NFP_MAC_STATS_RX_IN_OCTETS); + stats->rx_dropped = readq(mem + NFP_MAC_STATS_RX_IN_ERRORS); } static void -- 2.15.1
[PATCH 1/1] timecounter: Make cyclecounter struct part of timecounter struct
There is no real need for the users of timecounters to define cyclecounter and timecounter variables separately. Since timecounter will always be based on cyclecounter, have cyclecounter struct as member of timecounter struct. Suggested-by: Chris WilsonSigned-off-by: Sagar Arun Kamble Cc: Chris Wilson Cc: Richard Cochran Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: linux-ker...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: netdev@vger.kernel.org Cc: intel-wired-...@lists.osuosl.org Cc: linux-r...@vger.kernel.org Cc: alsa-de...@alsa-project.org Cc: kvm...@lists.cs.columbia.edu --- arch/microblaze/kernel/timer.c | 20 ++-- drivers/clocksource/arm_arch_timer.c | 19 ++-- drivers/net/ethernet/amd/xgbe/xgbe-dev.c | 3 +- drivers/net/ethernet/amd/xgbe/xgbe-ptp.c | 9 +++--- drivers/net/ethernet/amd/xgbe/xgbe.h | 1 - drivers/net/ethernet/broadcom/bnx2x/bnx2x.h| 1 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 20 ++-- drivers/net/ethernet/freescale/fec.h | 1 - drivers/net/ethernet/freescale/fec_ptp.c | 30 +- drivers/net/ethernet/intel/e1000e/e1000.h | 1 - drivers/net/ethernet/intel/e1000e/netdev.c | 27 drivers/net/ethernet/intel/e1000e/ptp.c| 2 +- drivers/net/ethernet/intel/igb/igb.h | 1 - drivers/net/ethernet/intel/igb/igb_ptp.c | 25 --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 - drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 17 +- drivers/net/ethernet/mellanox/mlx4/en_clock.c | 28 - drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1 - .../net/ethernet/mellanox/mlx5/core/lib/clock.c| 34 ++-- drivers/net/ethernet/qlogic/qede/qede_ptp.c| 20 ++-- drivers/net/ethernet/ti/cpts.c | 36 -- drivers/net/ethernet/ti/cpts.h | 1 - include/linux/mlx5/driver.h| 1 - include/linux/timecounter.h| 4 +-- include/sound/hdaudio.h| 1 - kernel/time/timecounter.c | 28 - sound/hda/hdac_stream.c| 7 +++-- virt/kvm/arm/arch_timer.c | 6 ++-- 28 files changed, 163 insertions(+), 182 deletions(-) diff --git a/arch/microblaze/kernel/timer.c b/arch/microblaze/kernel/timer.c index 7de941c..b7f89e9 100644 --- a/arch/microblaze/kernel/timer.c +++ b/arch/microblaze/kernel/timer.c @@ -199,27 +199,25 @@ static u64 xilinx_read(struct clocksource *cs) return (u64)xilinx_clock_read(); } -static struct timecounter xilinx_tc = { - .cc = NULL, -}; - static u64 xilinx_cc_read(const struct cyclecounter *cc) { return xilinx_read(NULL); } -static struct cyclecounter xilinx_cc = { - .read = xilinx_cc_read, - .mask = CLOCKSOURCE_MASK(32), - .shift = 8, +static struct timecounter xilinx_tc = { + .cc.read = xilinx_cc_read, + .cc.mask = CLOCKSOURCE_MASK(32), + .cc.mult = 0, + .cc.shift = 8, }; static int __init init_xilinx_timecounter(void) { - xilinx_cc.mult = div_sc(timer_clock_freq, NSEC_PER_SEC, - xilinx_cc.shift); + struct cyclecounter *cc = _tc.cc; + + cc->mult = div_sc(timer_clock_freq, NSEC_PER_SEC, cc->shift); - timecounter_init(_tc, _cc, sched_clock()); + timecounter_init(_tc, sched_clock()); return 0; } diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index 57cb2f0..31543e5 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -179,11 +179,6 @@ static u64 arch_counter_read_cc(const struct cyclecounter *cc) .flags = CLOCK_SOURCE_IS_CONTINUOUS, }; -static struct cyclecounter cyclecounter __ro_after_init = { - .read = arch_counter_read_cc, - .mask = CLOCKSOURCE_MASK(56), -}; - struct ate_acpi_oem_info { char oem_id[ACPI_OEM_ID_SIZE + 1]; char oem_table_id[ACPI_OEM_TABLE_ID_SIZE + 1]; @@ -915,7 +910,10 @@ static u64 arch_counter_get_cntvct_mem(void) return ((u64) vct_hi << 32) | vct_lo; } -static struct arch_timer_kvm_info arch_timer_kvm_info; +static struct arch_timer_kvm_info arch_timer_kvm_info = { + .timecounter.cc.read = arch_counter_read_cc, + .timecounter.cc.mask = CLOCKSOURCE_MASK(56), +}; struct arch_timer_kvm_info *arch_timer_get_kvm_info(void) { @@ -925,6 +923,7 @@ struct arch_timer_kvm_info *arch_timer_get_kvm_info(void) static void __init
[PATCH net-next v2] net: dsa: Allow compiling out legacy support
Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out support for the old platform device and Device Tree binding registration. Support for these configurations is scheduled to be removed in 4.17. Signed-off-by: Florian Fainelli--- Changes in v2: - make the option enabled by default - make the .probe function part of NET_DSA_LEGACY - make mv88e6060 depend on NET_DSA_LEGACY - move dsa_legacy_fdb_{add,del} out of net/dsa/legacy.c drivers/net/dsa/Kconfig | 2 +- drivers/net/dsa/mv88e6xxx/chip.c | 4 include/net/dsa.h| 11 +++ net/dsa/Kconfig | 9 + net/dsa/Makefile | 3 ++- net/dsa/dsa_priv.h | 9 + net/dsa/legacy.c | 20 net/dsa/slave.c | 20 8 files changed, 56 insertions(+), 22 deletions(-) diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig index 83a9bc892a3b..2b81b97e994f 100644 --- a/drivers/net/dsa/Kconfig +++ b/drivers/net/dsa/Kconfig @@ -33,7 +33,7 @@ config NET_DSA_MT7530 config NET_DSA_MV88E6060 tristate "Marvell 88E6060 ethernet switch chip support" - depends on NET_DSA + depends on NET_DSA && NET_DSA_LEGACY select NET_DSA_TAG_TRAILER ---help--- This enables support for the Marvell 88E6060 ethernet switch diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 8171055fde7a..b2afbd730051 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -3739,6 +3739,7 @@ static enum dsa_tag_protocol mv88e6xxx_get_tag_protocol(struct dsa_switch *ds, return chip->info->tag_protocol; } +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) static const char *mv88e6xxx_drv_probe(struct device *dsa_dev, struct device *host_dev, int sw_addr, void **priv) @@ -3786,6 +3787,7 @@ static const char *mv88e6xxx_drv_probe(struct device *dsa_dev, return NULL; } +#endif static int mv88e6xxx_port_mdb_prepare(struct dsa_switch *ds, int port, const struct switchdev_obj_port_mdb *mdb, @@ -3827,7 +3829,9 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port, } static const struct dsa_switch_ops mv88e6xxx_switch_ops = { +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) .probe = mv88e6xxx_drv_probe, +#endif .get_tag_protocol = mv88e6xxx_get_tag_protocol, .setup = mv88e6xxx_setup, .adjust_link= mv88e6xxx_adjust_link, diff --git a/include/net/dsa.h b/include/net/dsa.h index 2a05738570d8..e4326695653e 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -315,12 +315,14 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds) typedef int dsa_fdb_dump_cb_t(const unsigned char *addr, u16 vid, bool is_static, void *data); struct dsa_switch_ops { +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) /* * Legacy probing. */ const char *(*probe)(struct device *dsa_dev, struct device *host_dev, int sw_addr, void **priv); +#endif enum dsa_tag_protocol (*get_tag_protocol)(struct dsa_switch *ds, int port); @@ -472,11 +474,20 @@ struct dsa_switch_driver { const struct dsa_switch_ops *ops; }; +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) /* Legacy driver registration */ void register_switch_driver(struct dsa_switch_driver *type); void unregister_switch_driver(struct dsa_switch_driver *type); struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev); +#else +static inline void register_switch_driver(struct dsa_switch_driver *type) { } +static inline void unregister_switch_driver(struct dsa_switch_driver *type) { } +static inline struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev) +{ + return NULL; +} +#endif struct net_device *dsa_dev_to_net_device(struct device *dev); /* Keep inline for faster access in hot path */ diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig index 03c3bdf25468..bbf2c82cf7b2 100644 --- a/net/dsa/Kconfig +++ b/net/dsa/Kconfig @@ -16,6 +16,15 @@ config NET_DSA if NET_DSA +config NET_DSA_LEGACY + bool "Support for older platform device and Device Tree registration" + default y + ---help--- + Say Y if you want to enable support for the older platform device and + deprecated Device Tree binding registration. + + This feature is scheduled for removal in 4.17. + # tagging formats config NET_DSA_TAG_BRCM bool diff --git a/net/dsa/Makefile b/net/dsa/Makefile index 0e13c1f95d13..9e4d3536f977 100644 --- a/net/dsa/Makefile +++ b/net/dsa/Makefile @@ -1,7 +1,8 @@ # SPDX-License-Identifier: GPL-2.0
Re: UNITED NATION COMPENSATIONS,
Re:Hello Dear, What has actually kept you waiting to claim your fund $870.000.00 since then? Your fund has been approved since and nobody has heard from you. hurry and get back to me with your valid receiving data immediately you receive this mail to avoid error procedures because the United Nation Newly Elected president has approved the release of your awaited funds. Regards, Mr. Jake Brandon, CUSTOMER CARE ON FOREIGN PAYMENT.
[PATCH net-next] enic: add sw timestamp support
Add ethtool ops to advertise sw timestamping. Call skb_tx_timestamp() just before ringing the wq doorbell. Signed-off-by: Govindarajulu Varadarajan--- drivers/net/ethernet/cisco/enic/enic_ethtool.c | 12 drivers/net/ethernet/cisco/enic/enic_main.c| 1 + 2 files changed, 13 insertions(+) diff --git a/drivers/net/ethernet/cisco/enic/enic_ethtool.c b/drivers/net/ethernet/cisco/enic/enic_ethtool.c index 462d0ce51240..efb9333c7cf8 100644 --- a/drivers/net/ethernet/cisco/enic/enic_ethtool.c +++ b/drivers/net/ethernet/cisco/enic/enic_ethtool.c @@ -18,6 +18,7 @@ #include #include +#include #include "enic_res.h" #include "enic.h" @@ -578,6 +579,16 @@ static int enic_set_rxfh(struct net_device *netdev, const u32 *indir, return __enic_set_rsskey(enic); } +static int enic_get_ts_info(struct net_device *netdev, + struct ethtool_ts_info *info) +{ + info->so_timestamping = SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_RX_SOFTWARE | + SOF_TIMESTAMPING_SOFTWARE; + + return 0; +} + static const struct ethtool_ops enic_ethtool_ops = { .get_drvinfo = enic_get_drvinfo, .get_msglevel = enic_get_msglevel, @@ -597,6 +608,7 @@ static const struct ethtool_ops enic_ethtool_ops = { .get_rxfh = enic_get_rxfh, .set_rxfh = enic_set_rxfh, .get_link_ksettings = enic_get_ksettings, + .get_ts_info = enic_get_ts_info, }; void enic_set_ethtool_ops(struct net_device *netdev) diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c index e130fb757e7b..d98676e43e03 100644 --- a/drivers/net/ethernet/cisco/enic/enic_main.c +++ b/drivers/net/ethernet/cisco/enic/enic_main.c @@ -856,6 +856,7 @@ static netdev_tx_t enic_hard_start_xmit(struct sk_buff *skb, if (vnic_wq_desc_avail(wq) < MAX_SKB_FRAGS + ENIC_DESC_MAX_SPLITS) netif_tx_stop_queue(txq); + skb_tx_timestamp(skb); if (!skb->xmit_more || netif_xmit_stopped(txq)) vnic_wq_doorbell(wq); -- 2.15.0
Re: [PATCH/RFC] Re: 'perf test BPF' failing, libbpf regression wrt "basic API for BPF obj name"
On 12/1/17 9:51 AM, Arnaldo Carvalho de Melo wrote: But this is not just testcase expectations, the usecase is someone wanting to use a newer tool, with perhaps some new features of interest that don't depend on changes in the kernel, in an older kernel on a system where updating it is not possible or desirable. I think it's also dangerous for the core library like libbpf to be smarter than the tool that is using it. In this case we added prog and map names by default into loader and create_map functions to make sure that all tools pick them up automatically and we can see a bit more human readable bpf names in kernel stack traces and in debug tools like bpftool, bcc/bps. When kernel is older and doesn't support prog/map names, it's perfectly reasonable to fall back to map creation without the name, but library shouldn't be doing it in all cases. Like prog_load command recently got new prog_ifindex field. It would be incorrect to fallback to loading without it.
Re: [PATCH net-next V3 3/3] net: add a sysctl to make auto flowlabel consistent
On Fri, Dec 1, 2017 at 3:31 PM, Shaohua Liwrote: > From: Shaohua Li > > Currently if there is negative routing, we change sock's txhash, so the > sock will have a different flowlabel and route to different path. > According to Tom, we'd better to have option to enable this, because some > routers require flowlabel consistent. By default, we maintain consistent > flowlabel, eg, negative routing doesn't change flowlabel. > > Suggested-by: Tom Herbert > Signed-off-by: Shaohua Li > --- > Documentation/networking/ip-sysctl.txt | 7 +++ > include/net/netns/ipv6.h | 1 + > include/net/sock.h | 28 +++- > net/ipv6/af_inet6.c| 1 + > net/ipv6/sysctl_net_ipv6.c | 8 > 5 files changed, 32 insertions(+), 13 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.txt > b/Documentation/networking/ip-sysctl.txt > index 46c7e10..14132a0 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -1345,6 +1345,13 @@ auto_flowlabels - INTEGER >be disabled by the socket option > Default: 1 > > +consistent_auto_flowlabel - BOOLEAN I think we should call it consistent_txhash since this isn't just about the flow label. > + When auto_flowlabels is enabled, this option makes socket flowlabel > + consistent in the lifetime. > + TRUE: enabled > + FALSE: disabled > + Default: TRUE > + > flowlabel_state_ranges - BOOLEAN > Split the flow label number space into two ranges. 0-0x7 is > reserved for the IPv6 flow manager facility, 0x8-0xF > diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h > index 987cc45..e55f851 100644 > --- a/include/net/netns/ipv6.h > +++ b/include/net/netns/ipv6.h > @@ -30,6 +30,7 @@ struct netns_sysctl_ipv6 { > int ip6_rt_min_advmss; > int flowlabel_consistency; > int auto_flowlabels; > + int consistent_auto_flowlabel; > int icmpv6_time; > int anycast_src_echo_reply; > int ip_nonlocal_bind; > diff --git a/include/net/sock.h b/include/net/sock.h > index b9cb9d2..45e868f 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1729,6 +1729,18 @@ static inline kuid_t sock_net_uid(const struct net > *net, const struct sock *sk) > return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); > } > > +static inline > +struct net *sock_net(const struct sock *sk) > +{ > + return read_pnet(>sk_net); > +} > + > +static inline > +void sock_net_set(struct sock *sk, struct net *net) > +{ > + write_pnet(>sk_net, net); > +} > + > static inline void sk_set_txhash(struct sock *sk, u32 hash) > { > sk->sk_txhash = hash; > @@ -1736,7 +1748,9 @@ static inline void sk_set_txhash(struct sock *sk, u32 > hash) > > static inline void sk_rethink_txhash(struct sock *sk) > { > - if (sk->sk_txhash) { > + struct net *net = sock_net(sk); > + > + if (sk->sk_txhash && !net->ipv6.sysctl.consistent_auto_flowlabel) { > u32 v = prandom_u32(); > sk->sk_txhash = v ?: 1; > } > @@ -2291,18 +2305,6 @@ static inline void sk_eat_skb(struct sock *sk, struct > sk_buff *skb) > __kfree_skb(skb); > } > > -static inline > -struct net *sock_net(const struct sock *sk) > -{ > - return read_pnet(>sk_net); > -} > - > -static inline > -void sock_net_set(struct sock *sk, struct net *net) > -{ > - write_pnet(>sk_net, net); > -} > - > static inline struct sock *skb_steal_sock(struct sk_buff *skb) > { > if (skb->sk) { > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c > index c26f712..fe9b312 100644 > --- a/net/ipv6/af_inet6.c > +++ b/net/ipv6/af_inet6.c > @@ -807,6 +807,7 @@ static int __net_init inet6_net_init(struct net *net) > net->ipv6.sysctl.icmpv6_time = 1*HZ; > net->ipv6.sysctl.flowlabel_consistency = 1; > net->ipv6.sysctl.auto_flowlabels = IP6_DEFAULT_AUTO_FLOW_LABELS; > + net->ipv6.sysctl.consistent_auto_flowlabel = 1; > net->ipv6.sysctl.idgen_retries = 3; > net->ipv6.sysctl.idgen_delay = 1 * HZ; > net->ipv6.sysctl.flowlabel_state_ranges = 0; > diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c > index a789a8a..8908092 100644 > --- a/net/ipv6/sysctl_net_ipv6.c > +++ b/net/ipv6/sysctl_net_ipv6.c > @@ -126,6 +126,13 @@ static struct ctl_table ipv6_table_template[] = { > .mode = 0644, > .proc_handler = proc_dointvec > }, > + { > + .procname = "consistent_auto_flowlabel", > + .data = > _net.ipv6.sysctl.consistent_auto_flowlabel, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec > + }, >
Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start
On Fri, Dec 1, 2017 at 3:29 PM, Tom Herbertwrote: > On Fri, Dec 1, 2017 at 2:18 PM, Herbert Xu > wrote: >> On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote: >>> Remove the code that resets the walker table. The walker table should >>> only be initialized in the walk init function or when a future table is >>> encountered. If the walker table is NULL this is the indication that >>> the walk has completed and this information can be used to break a >>> multi-call walk in the table (e.g. successive calls to nelink_dump >>> that are dumping elements of an rhashtable). >>> >>> This also allows us to change rhashtable_walk_start to return void >>> since the only error it was returning was -EAGAIN for a table change. >>> This patch changes all the callers of rhashtable_walk_start to expect >>> void which eliminates logic needed to check the return value for a >>> rare condition. Note that -EAGAIN will be returned in a call >>> to rhashtable_walk_next which seems to always follow the start >>> of the walk so there should be no behavioral change in doing this. >>> >>> Signed-off-by: Tom Herbert >> >> Doesn't this mean that if a walk encounters a rehash you may end up >> missing half or more of the hash table? >> > Because of tbl->rehash < tbl->size conditions in walk stop? How about > we add a flag to iter that indicates table needs a reset and set it > along with setting walker.tbl to NULL? On the next walk start do the > reload when walker.tbl is NULL and flag is set. In this case walk > start would automatically set walker.tbl which is already done by > nearly all callers already in that they ignore -EAGAIN returned from > start walk. > Herbert, Looking at this some more, I am wondering if the walkers list is necessary. When a rehash table is done, the new table is assigned to ht->tbl and walker->tbl is cleared for all walkers. In walk start the walker tbl is checked and if it's NULL ht->tbl is loaded. Assuming that -EAGAIN isn't interesting to callers here, it seems like we could just get iter->walker.tbl in each call to walk start and not need to maintain the walkers list at all. Am I missing something? Tom > Thanks, > Tom > >> Cheers, >> -- >> Email: Herbert Xu >> Home Page: http://gondor.apana.org.au/~herbert/ >> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
pull-request: bpf 2017-12-02
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a compilation warning in xdp redirect tracepoint due to missing bpf.h include that pulls in struct bpf_map, from Xie. 2) Limit the maximum number of attachable BPF progs for a given perf event as long as uabi is not frozen yet. The hard upper limit is now 64 and therefore the same as with BPF multi-prog for cgroups. Also add related error checking for the sample BPF loader when enabling and attaching to the perf event, from Yonghong. 3) Specifically set the RLIMIT_MEMLOCK for the test_verifier_log case, so that the test case can always pass and not fail in some environments due to too low default limit, also from Yonghong. 4) Fix up a missing license header comment for kernel/bpf/offload.c, from Jakub. 5) Several fixes for bpftool, among others a crash on incorrect arguments when json output is used, error message handling fixes on unknown options and proper destruction of json writer for some exit cases, all from Quentin. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit 2e724dca7749223204bbae21745c0e3fc932700a: tipc: eliminate access after delete in group_filter_msg() (2017-11-27 14:44:45 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to 0ec9552b43b98deb882bf48efd347be4bd7afc9f: samples/bpf: add error checking for perf ioctl calls in bpf loader (2017-12-01 02:59:21 +0100) Daniel Borkmann (1): Merge branch 'bpftool-misc-fixes' Jakub Kicinski (1): bpf: offload: add a license header Quentin Monnet (6): tools: bpftool: fix crash on bad parameters with JSON tools: bpftool: clean up the JSON writer before exiting in usage() tools: bpftool: make error message from getopt_long() JSON-friendly tools: bpftool: remove spurious line break from error message tools: bpftool: unify installation directories tools: bpftool: declare phony targets as such Xie XiuQi (1): trace/xdp: fix compile warning: 'struct bpf_map' declared inside parameter list Yonghong Song (3): tools/bpf: adjust rlimit RLIMIT_MEMLOCK for test_verifier_log bpf: set maximum number of attached progs to 64 for a single perf tp samples/bpf: add error checking for perf ioctl calls in bpf loader include/trace/events/xdp.h | 1 + kernel/bpf/core.c | 3 ++- kernel/bpf/offload.c| 15 +++ kernel/trace/bpf_trace.c| 8 ++ samples/bpf/bpf_load.c | 14 -- tools/bpf/bpftool/Documentation/Makefile| 2 +- tools/bpf/bpftool/Makefile | 7 ++--- tools/bpf/bpftool/main.c| 36 - tools/bpf/bpftool/main.h| 5 ++-- tools/testing/selftests/bpf/test_verifier_log.c | 7 + 10 files changed, 77 insertions(+), 21 deletions(-)
Re: [PATCH net-next 0/2] allow setting gso_maximum values
On Fri, Dec 01, 2017 at 03:30:01PM -0800, Stephen Hemminger wrote: > On Fri, 1 Dec 2017 12:11:56 -0800 > Stephen Hemmingerwrote: > > > This is another way of addressing the GSO maximum performance issues for > > containers on Azure. What happens is that the underlying infrastructure uses > > a overlay network such that GSO packets over 64K - vlan header end up cause > > either guest or host to have do expensive software copy and fragmentation. > > > > The netvsc driver reports GSO maximum settings correctly, the issue > > is that containers on veth devices still have the larger settings. > > One solution that was examined was propogating the values back > > through the bridge device, but this does not work for cases where > > virtual container network is done on L3. > > > > This patch set punts the problem to the orchestration layer that sets > > up the container network. It also enables other virtual devices > > to have configurable settings for GSO maximum. > > > > Stephen Hemminger (2): > > rtnetlink: allow GSO maximums to be passed to device > > veth: allow configuring GSO maximums > > > > drivers/net/veth.c | 20 > > net/core/rtnetlink.c | 2 ++ > > 2 files changed, 22 insertions(+) > > > > I would like a confirmation from Intel that is doing Docker testing > that this works for them before merging. This change and its iproute2 counterpart allow creating veth pairs with specific gso_max{size,segs}. Thanks. However, the docker code that sets up veth pairis is go-compiled in their libnetwork. End-users won't be able to tweak gso settings at veth creation. In this case, we would need to add ioctl (ip/iplink.c:do_set) support to allow changes after veth is created.
x86 boot broken on -rc1?
Hi! I'm hitting these after DaveM pulled rc1 into net-next on my Xeon E5-2630 v4 box. It also happens on linux-next. Did anyone else experience it? (.config attached) [5.003771] WARNING: CPU: 14 PID: 1 at ../arch/x86/events/intel/uncore.c:936 uncore_pci_probe+0x285/0x2b0 [5.007544] Modules linked in: [5.007544] CPU: 14 PID: 1 Comm: swapper/0 Not tainted 4.15.0-rc1-perf-00225-gb2a4e0a76b1d #782 [5.007544] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [5.007544] task: 9e842725 task.stack: 8a63fd2d [5.007544] RIP: 0010:uncore_pci_probe+0x285/0x2b0 [5.007544] RSP: :ad8580163d10 EFLAGS: 00010286 [5.007544] RAX: 98576cc3df30 RBX: b08037e0 RCX: b0c1a120 [5.007544] RDX: RSI: RDI: b0c1a960 [5.007544] RBP: 985b6c00ac00 R08: fffe R09: 000f [5.007544] R10: 98576f1b6018 R11: 0022 R12: 985b6c641000 [5.007544] R13: 0001 R14: 0001 R15: 0001 [5.007544] FS: () GS:98576fb8() knlGS: [5.007544] CS: 0010 DS: ES: CR0: 80050033 [5.007544] CR2: CR3: 000185c09001 CR4: 003606e0 [5.007544] DR0: DR1: DR2: [5.007544] DR3: DR6: fffe0ff0 DR7: 0400 [5.007544] Call Trace: [5.007544] local_pci_probe+0x3d/0x90 [5.007544] ? pci_match_device+0xd9/0x100 [5.007544] pci_device_probe+0x122/0x180 [5.007544] driver_probe_device+0x246/0x330 [5.007544] ? set_debug_rodata+0x11/0x11 [5.007544] __driver_attach+0x8a/0x90 [5.007544] ? driver_probe_device+0x330/0x330 [5.007544] bus_for_each_dev+0x5c/0x90 [5.007544] bus_add_driver+0x196/0x220 [5.007544] driver_register+0x57/0xc0 [5.007544] intel_uncore_init+0x1e3/0x249 [5.007544] ? uncore_type_init+0x193/0x193 [5.007544] ? set_debug_rodata+0x11/0x11 [5.007544] do_one_initcall+0x4b/0x190 [5.007544] kernel_init_freeable+0x16e/0x1f5 [5.007544] ? rest_init+0xd0/0xd0 [5.007544] kernel_init+0xa/0x100 [5.007544] ret_from_fork+0x1f/0x30 [5.007544] Code: 48 8b 52 08 48 85 d2 74 0d 89 44 24 04 48 89 df ff d2 8b 44 24 04 48 89 df 89 44 24 04 e8 54 0a 1c 00 8b 44 24 0 [5.007544] ---[ end trace 4dc4c3d5f5afcd2f ]--- [5.244504] bdx_uncore: probe of :ff:08.2 failed with error -22 [5.251604] bdx_uncore: probe of :ff:0b.1 failed with error -22 [5.258711] bdx_uncore: probe of :ff:10.1 failed with error -22 [5.265819] bdx_uncore: probe of :ff:14.0 failed with error -22 [5.272919] bdx_uncore: probe of :ff:14.1 failed with error -22 [5.280019] bdx_uncore: probe of :ff:15.0 failed with error -22 [5.287112] bdx_uncore: probe of :ff:15.1 failed with error -22 [5.294376] WARNING: CPU: 1 PID: 15 at ../arch/x86/events/intel/uncore.c:1065 uncore_change_type_ctx.isra.5+0xe6/0xf0 [5.298362] Modules linked in: [5.298362] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: GW 4.15.0-rc1-perf-00225-gb2a4e0a76b1d #782 [5.298362] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [5.298362] task: ae78bc8f task.stack: f79660c1 [5.298362] RIP: 0010:uncore_change_type_ctx.isra.5+0xe6/0xf0 [5.298362] RSP: :ad85833b3db8 EFLAGS: 00010213 [5.298362] RAX: RBX: 9857669b0200 RCX: 0001 [5.298362] RDX: 985b6f00 RSI: 985b66580400 RDI: b0c1ae8c [5.298362] RBP: 985b66580400 R08: b0c1ae8c R09: 0001 [5.298362] R10: R11: 003d0900 R12: [5.298362] R13: R14: 0001 R15: 0008 [5.298362] FS: () GS:985b6f00() knlGS: [5.298362] CS: 0010 DS: ES: CR0: 80050033 [5.298362] CR2: CR3: 000185c09001 CR4: 003606e0 [5.298362] DR0: DR1: DR2: [5.298362] DR3: DR6: fffe0ff0 DR7: 0400 [5.298362] Call Trace: [5.298362] uncore_event_cpu_online+0x283/0x340 [5.298362] ? uncore_event_cpu_offline+0x180/0x180 [5.298362] cpuhp_invoke_callback+0x8c/0x620 [5.298362] ? __schedule+0x1ad/0x6c0 [5.298362] ? sort_range+0x20/0x20 [5.298362] cpuhp_thread_fun+0xbc/0x140 [5.298362] smpboot_thread_fn+0x114/0x1d0 [5.298362] kthread+0x111/0x130 [5.298362] ? kthread_create_on_node+0x40/0x40 [5.298362] ret_from_fork+0x1f/0x30 [5.298362] Code: 2a 44 89 73 10 41 83 c4 01 48 81 c5 40 01 00 00 45 3b 20 7c cf 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f f [5.298362] ---[ end trace
[Patch net-next] net_sched: get rid of rcu_barrier() in tcf_block_put_ext()
Both Eric and Paolo noticed the rcu_barrier() we use in tcf_block_put_ext() could be a performance bottleneck when we have lots of filters. Paolo provided the following to demonstrate the issue: tc qdisc add dev lo root htb for I in `seq 1 1000`; do tc class add dev lo parent 1: classid 1:$I htb rate 100kbit tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb for J in `seq 1 10`; do tc filter add dev lo parent $((I + 1)): u32 match ip src 1.1.1.$J done done time tc qdisc del dev root real0m54.764s user0m0.023s sys 0m0.000s The rcu_barrier() there is to ensure we free the block after all chains are gone, that is, to queue tcf_block_put_final() at the tail of workqueue. We can achieve this ordering requirement by refcnt'ing tcf block instead, that is, the tcf block is freed only when the last chain in this block is gone. This also simplifies the code. Paolo reported after this patch we get: real0m0.017s user0m0.000s sys 0m0.017s Tested-by: Paolo AbeniCc: Eric Dumazet Cc: Jiri Pirko Cc: Jamal Hadi Salim Signed-off-by: Cong Wang --- include/net/sch_generic.h | 2 +- net/sched/cls_api.c | 31 +-- 2 files changed, 10 insertions(+), 23 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 65d0d25f2648..b013ded1a38d 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -278,7 +278,7 @@ struct tcf_block { struct net *net; struct Qdisc *q; struct list_head cb_list; - struct work_struct work; + unsigned int nr_chains; }; static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index ddcf04b4ab43..dec0d36078c8 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -190,6 +190,7 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block *block, return NULL; list_add_tail(>list, >chain_list); chain->block = block; + block->nr_chains++; chain->index = chain_index; chain->refcnt = 1; return chain; @@ -218,8 +219,12 @@ static void tcf_chain_flush(struct tcf_chain *chain) static void tcf_chain_destroy(struct tcf_chain *chain) { + struct tcf_block *block = chain->block; + list_del(>list); kfree(chain); + if (!--block->nr_chains) + kfree(block); } static void tcf_chain_hold(struct tcf_chain *chain) @@ -330,27 +335,13 @@ int tcf_block_get(struct tcf_block **p_block, } EXPORT_SYMBOL(tcf_block_get); -static void tcf_block_put_final(struct work_struct *work) -{ - struct tcf_block *block = container_of(work, struct tcf_block, work); - struct tcf_chain *chain, *tmp; - - rtnl_lock(); - - /* At this point, all the chains should have refcnt == 1. */ - list_for_each_entry_safe(chain, tmp, >chain_list, list) - tcf_chain_put(chain); - rtnl_unlock(); - kfree(block); -} - /* XXX: Standalone actions are not allowed to jump to any chain, and bound * actions should be all removed after flushing. */ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q, struct tcf_block_ext_info *ei) { - struct tcf_chain *chain; + struct tcf_chain *chain, *tmp; /* Hold a refcnt for all chains, except 0, so that they don't disappear * while we are iterating. @@ -364,13 +355,9 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q, tcf_block_offload_unbind(block, q, ei); - INIT_WORK(>work, tcf_block_put_final); - /* Wait for existing RCU callbacks to cool down, make sure their works -* have been queued before this. We can not flush pending works here -* because we are holding the RTNL lock. -*/ - rcu_barrier(); - tcf_queue_work(>work); + /* At this point, all the chains should have refcnt >= 1. */ + list_for_each_entry_safe(chain, tmp, >chain_list, list) + tcf_chain_put(chain); } EXPORT_SYMBOL(tcf_block_put_ext); -- 2.13.0
Re: [PATCH net-next] net: dsa: Allow compiling out legacy support
On 12/01/2017 07:21 AM, Vivien Didelot wrote: > Hi Florian, > > Florian Fainelliwrites: > >> +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) >> /* Legacy driver registration */ >> void register_switch_driver(struct dsa_switch_driver *type); >> void unregister_switch_driver(struct dsa_switch_driver *type); >> struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev); >> >> +#else >> +static inline void register_switch_driver(struct dsa_switch_driver *type) { >> } >> +static inline void unregister_switch_driver(struct dsa_switch_driver *type) >> { } >> +static inline struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev) >> +{ >> +return NULL; >> +} >> +#endif > > The .probe dsa_switch_ops is part of the legacy code, we may want to > wrap it in a CONFIG_NET_DSA_LEGACY check as well. Fixed, also made 88e6060 dependent on CONFIG_NET_DSA_LEGACY as a result. > >> struct net_device *dsa_dev_to_net_device(struct device *dev); >> >> /* Keep inline for faster access in hot path */ >> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig >> index 03c3bdf25468..b6ec8e9069e4 100644 >> --- a/net/dsa/Kconfig >> +++ b/net/dsa/Kconfig >> @@ -16,6 +16,14 @@ config NET_DSA >> >> if NET_DSA >> >> +config NET_DSA_LEGACY > > We need to have it enabled by default, otherwise we'll miss errors when > touching the code shared by both legacy and new bindings. Fixed. > >> +bool "Support for older platform device and Device Tree registration" >> +---help--- >> + Say Y if you want to enable support for the older platform device and >> + deprectaed Device Tree binding registration. > > deprecated* > >> + >> + This feature is scheduled for removal in 4.17. >> + >> /* legacy.c */ >> +#if IS_ENABLED(CONFIG_NET_DSA_LEGACY) >> int dsa_legacy_register(void); >> void dsa_legacy_unregister(void); >> int dsa_legacy_fdb_add(struct ndmsg *ndm, struct nlattr *tb[], >> @@ -106,6 +107,28 @@ int dsa_legacy_fdb_add(struct ndmsg *ndm, struct nlattr >> *tb[], >> int dsa_legacy_fdb_del(struct ndmsg *ndm, struct nlattr *tb[], >> struct net_device *dev, >> const unsigned char *addr, u16 vid); > > the dsa_legacy_fdb_{add,del} routines are "legacy" in terms of FDB > handling, not in terms of DSA bindings, we must keep them. Oh, right. This should probably be moved somewhere else then, right? The whole idea was to compile out net/dsa/legacy.c -- Florian
Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.
On Fri, Dec 1, 2017 at 9:56 PM, David Daneywrote: > On 12/01/2017 12:41 PM, Philippe Ombredanne wrote: >> >> David, >> >> On Fri, Dec 1, 2017 at 9:01 PM, David Daney >> wrote: >>> >>> On 12/01/2017 11:49 AM, Philippe Ombredanne wrote: David, Greg, On Fri, Dec 1, 2017 at 6:42 PM, David Daney wrote: > > > On 11/30/2017 11:53 PM, Philippe Ombredanne wrote: [...] --- /dev/null +++ b/arch/mips/cavium-octeon/resource-mgr.c @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource manager for Octeon. + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * Copyright (C) 2017 Cavium, Inc. + */ >> >> >> >> >> Since you nicely included an SPDX id, you would not need the >> boilerplate anymore. e.g. these can go alright? > > > > > They may not be strictly speaking necessary, but I don't think they > hurt > anything. Unless there is a requirement to strip out the license text, > we > would stick with it as is. I think the requirement is there and that would be much better for everyone: keeping both is redundant and does not bring any value, does it? Instead it kinda removes the benefits of having the SPDX id in the first place IMHO. Furthermore, as there have been already ~12K+ files cleaned up and still over 60K files to go, it would really nice if new files could adopt the new style: this way we will not have to revisit and repatch them in the future. >>> >>> I am happy to follow any style Greg would suggest. There doesn't seem to >>> be >>> much documentation about how this should be done yet. >> >> >> Thomas (tglx) has already submitted a first series of doc patches a >> few weeks ago. And AFAIK he might be working on posting the updates >> soon, whenever his real time clock yields a few cycles away from real >> time coding work ;) >> >> See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5] >> on this and mostly related topics >> >> [1] https://lkml.org/lkml/2017/11/2/715 >> [2] https://lkml.org/lkml/2017/11/25/125 >> [3] https://lkml.org/lkml/2017/11/25/133 >> [4] https://lkml.org/lkml/2017/11/2/805 >> [5] https://lkml.org/lkml/2017/10/19/165 >> > > OK, you convinced me. > > Thanks, > David > No! Thank you to you: For doing real work on the kernel that makes my servers and laptops run, while I am nitpicking you on comments. -- Cordially Philippe Ombredanne
[PATCH net-next V3 2/3] net-next: copy user configured flowlabel to reset packet
From: Shaohua LiReset packet doesn't use user configured flowlabel, instead, it always uses 0. This will cause inconsistency for flowlabel. tw sock already records flowlabel info, so we can directly use it. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- net/ipv6/tcp_ipv6.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 1e4ce06..b8383be 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -902,6 +902,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) struct sock *sk1 = NULL; #endif int oif = 0; + u8 tclass = 0; + __be32 flowlabel = 0; if (th->rst) return; @@ -955,7 +957,21 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) trace_tcp_send_reset(sk, skb); } - tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0); + if (sk) { + if (sk_fullsock(sk)) { + struct ipv6_pinfo *np = inet6_sk(sk); + + tclass = np->tclass; + flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK; + } else { + struct inet_timewait_sock *tw = inet_twsk(sk); + + tclass = tw->tw_tclass; + flowlabel = cpu_to_be32(tw->tw_flowlabel); + } + } + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, + tclass, flowlabel); #ifdef CONFIG_TCP_MD5SIG out: -- 2.9.5
[PATCH net-next V3 3/3] net: add a sysctl to make auto flowlabel consistent
From: Shaohua LiCurrently if there is negative routing, we change sock's txhash, so the sock will have a different flowlabel and route to different path. According to Tom, we'd better to have option to enable this, because some routers require flowlabel consistent. By default, we maintain consistent flowlabel, eg, negative routing doesn't change flowlabel. Suggested-by: Tom Herbert Signed-off-by: Shaohua Li --- Documentation/networking/ip-sysctl.txt | 7 +++ include/net/netns/ipv6.h | 1 + include/net/sock.h | 28 +++- net/ipv6/af_inet6.c| 1 + net/ipv6/sysctl_net_ipv6.c | 8 5 files changed, 32 insertions(+), 13 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 46c7e10..14132a0 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1345,6 +1345,13 @@ auto_flowlabels - INTEGER be disabled by the socket option Default: 1 +consistent_auto_flowlabel - BOOLEAN + When auto_flowlabels is enabled, this option makes socket flowlabel + consistent in the lifetime. + TRUE: enabled + FALSE: disabled + Default: TRUE + flowlabel_state_ranges - BOOLEAN Split the flow label number space into two ranges. 0-0x7 is reserved for the IPv6 flow manager facility, 0x8-0xF diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h index 987cc45..e55f851 100644 --- a/include/net/netns/ipv6.h +++ b/include/net/netns/ipv6.h @@ -30,6 +30,7 @@ struct netns_sysctl_ipv6 { int ip6_rt_min_advmss; int flowlabel_consistency; int auto_flowlabels; + int consistent_auto_flowlabel; int icmpv6_time; int anycast_src_echo_reply; int ip_nonlocal_bind; diff --git a/include/net/sock.h b/include/net/sock.h index b9cb9d2..45e868f 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1729,6 +1729,18 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); } +static inline +struct net *sock_net(const struct sock *sk) +{ + return read_pnet(>sk_net); +} + +static inline +void sock_net_set(struct sock *sk, struct net *net) +{ + write_pnet(>sk_net, net); +} + static inline void sk_set_txhash(struct sock *sk, u32 hash) { sk->sk_txhash = hash; @@ -1736,7 +1748,9 @@ static inline void sk_set_txhash(struct sock *sk, u32 hash) static inline void sk_rethink_txhash(struct sock *sk) { - if (sk->sk_txhash) { + struct net *net = sock_net(sk); + + if (sk->sk_txhash && !net->ipv6.sysctl.consistent_auto_flowlabel) { u32 v = prandom_u32(); sk->sk_txhash = v ?: 1; } @@ -2291,18 +2305,6 @@ static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) __kfree_skb(skb); } -static inline -struct net *sock_net(const struct sock *sk) -{ - return read_pnet(>sk_net); -} - -static inline -void sock_net_set(struct sock *sk, struct net *net) -{ - write_pnet(>sk_net, net); -} - static inline struct sock *skb_steal_sock(struct sk_buff *skb) { if (skb->sk) { diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index c26f712..fe9b312 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -807,6 +807,7 @@ static int __net_init inet6_net_init(struct net *net) net->ipv6.sysctl.icmpv6_time = 1*HZ; net->ipv6.sysctl.flowlabel_consistency = 1; net->ipv6.sysctl.auto_flowlabels = IP6_DEFAULT_AUTO_FLOW_LABELS; + net->ipv6.sysctl.consistent_auto_flowlabel = 1; net->ipv6.sysctl.idgen_retries = 3; net->ipv6.sysctl.idgen_delay = 1 * HZ; net->ipv6.sysctl.flowlabel_state_ranges = 0; diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c index a789a8a..8908092 100644 --- a/net/ipv6/sysctl_net_ipv6.c +++ b/net/ipv6/sysctl_net_ipv6.c @@ -126,6 +126,13 @@ static struct ctl_table ipv6_table_template[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "consistent_auto_flowlabel", + .data = _net.ipv6.sysctl.consistent_auto_flowlabel, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, { } }; @@ -190,6 +197,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net) ipv6_table[11].data = >ipv6.sysctl.max_hbh_opts_cnt; ipv6_table[12].data = >ipv6.sysctl.max_dst_opts_len; ipv6_table[13].data = >ipv6.sysctl.max_hbh_opts_len; + ipv6_table[14].data = >ipv6.sysctl.consistent_auto_flowlabel; ipv6_route_table = ipv6_route_sysctl_init(net); if
[PATCH net-next V3 0/3] net: fix flowlabel inconsistency in reset packet
From: Shaohua LiHi, Please see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The tcp reset packet has a different flowlabel, which causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. The reason is the normal packet gets the skb->hash from sk->sk_txhash, which is generated randomly. ip6_make_flowlabel then uses the hash to create a flowlabel. The reset packet doesn't get assigned a hash, so the flowlabel is calculated with flowi6. The patches fix the issue. Thanks, Shaohua V2->V3: - Address Tom's comments - Add a new sysctl suggested by Tom Shaohua Li (3): net-next: use five-tuple hash for sk_txhash net-next: copy user configured flowlabel to reset packet net: add a sysctl to make auto flowlabel consistent Documentation/networking/ip-sysctl.txt | 7 +++ include/linux/tcp.h| 5 + include/net/netns/ipv6.h | 1 + include/net/sock.h | 35 +++- include/net/tcp.h | 2 +- net/ipv4/datagram.c| 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c| 18 - net/ipv4/tcp_output.c | 1 - net/ipv6/af_inet6.c| 1 + net/ipv6/datagram.c| 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/sysctl_net_ipv6.c | 8 net/ipv6/tcp_ipv6.c| 37 -- 15 files changed, 92 insertions(+),
[PATCH net-next V3 1/3] net-next: use five-tuple hash for sk_txhash
From: Shaohua LiWe are using sk_txhash to calculate flowlabel, but sk_txhash isn't always available, for example, in inet_timewait_sock. This causes problem for reset packet, which will have a different flowlabel. This causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. Per Tom's suggestion, we switch back to five-tuple hash, so we can reconstruct correct flowlabel for reset packet. At most places, we already have the flowi info, so we directly use it build sk_txhash. For synack, we do this after route search. At that time, we have the flowi info ready, so don't need to create the flowi info again. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- include/linux/tcp.h | 5 + include/net/sock.h| 17 ++--- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 18 +- net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 19 ++- 11 files changed, 48 insertions(+), 28 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index df5d97a..227e8b2 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -139,6 +139,11 @@ struct tcp_request_sock { */ }; +static inline void tcp_rsk_set_txhash(struct tcp_request_sock *rsk, u32 hash) +{ + rsk->txhash = hash; +} + static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req) { return (struct tcp_request_sock *)req; diff --git a/include/net/sock.h b/include/net/sock.h index 79e1a2c..b9cb9d2 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1729,22 +1729,17 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); } -static inline u32 net_tx_rndhash(void) +static inline void sk_set_txhash(struct sock *sk, u32 hash) { - u32 v = prandom_u32(); - - return v ?: 1; -} - -static inline void sk_set_txhash(struct sock *sk) -{ - sk->sk_txhash = net_tx_rndhash(); + sk->sk_txhash = hash; } static inline void sk_rethink_txhash(struct sock *sk) { - if (sk->sk_txhash) - sk_set_txhash(sk); + if (sk->sk_txhash) { + u32 v = prandom_u32(); + sk->sk_txhash = v ?: 1; + } } static inline struct dst_entry * diff --git a/include/net/tcp.h b/include/net/tcp.h index 4e09398..a5c28be 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops { __u16 *mss); #endif struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl, - const struct request_sock *req); + struct request_sock *req); u32 (*init_seq)(const struct sk_buff *skb); u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb); int (*send_synack)(const struct sock *sk, struct dst_entry *dst, diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index f915abf..1f2f9fc 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len inet->inet_daddr = fl4->daddr; inet->inet_dport = usin->sin_port; sk->sk_state = TCP_ESTABLISHED; - sk_set_txhash(sk); + sk_set_txhash(sk, get_hash_from_flowi4(fl4)); inet->inet_id = jiffies; sk_dst_set(sk, >dst); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index fda37f2..ecf6e7a 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq->rcv_isn = ntohl(th->seq) - 1; treq->snt_isn = cookie; treq->ts_off= 0; - treq->txhash= net_tx_rndhash(); req->mss= mss; ireq->ir_num= ntohs(th->dest); ireq->ir_rmt_port = th->source; @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) opt->srr ? opt->faddr : ireq->ir_rmt_addr, ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid); security_req_classify_flow(req, flowi4_to_flowi()); + + tcp_rsk_set_txhash(treq,
Re: [PATCH net-next 0/2] allow setting gso_maximum values
On Fri, 1 Dec 2017 12:11:56 -0800 Stephen Hemmingerwrote: > This is another way of addressing the GSO maximum performance issues for > containers on Azure. What happens is that the underlying infrastructure uses > a overlay network such that GSO packets over 64K - vlan header end up cause > either guest or host to have do expensive software copy and fragmentation. > > The netvsc driver reports GSO maximum settings correctly, the issue > is that containers on veth devices still have the larger settings. > One solution that was examined was propogating the values back > through the bridge device, but this does not work for cases where > virtual container network is done on L3. > > This patch set punts the problem to the orchestration layer that sets > up the container network. It also enables other virtual devices > to have configurable settings for GSO maximum. > > Stephen Hemminger (2): > rtnetlink: allow GSO maximums to be passed to device > veth: allow configuring GSO maximums > > drivers/net/veth.c | 20 > net/core/rtnetlink.c | 2 ++ > 2 files changed, 22 insertions(+) > I would like a confirmation from Intel that is doing Docker testing that this works for them before merging.
Re: [PATCH v5 net-next,mips 1/7] dt-bindings: Add Cavium Octeon Common Ethernet Interface.
On 12/01/2017 03:18 PM, David Daney wrote: > From: Carlos Munoz> > Add bindings for Common Ethernet Interface (BGX) block. > > Acked-by: Rob Herring > Signed-off-by: Carlos Munoz > Signed-off-by: Steven J. Hill > Signed-off-by: David Daney Reviewed-by: Florian Fainelli -- Florian
[PATCH net] Revert "tcp: must block bh in __inet_twsk_hashdance()"
From: Eric DumazetWe had to disable BH _before_ calling __inet_twsk_hashdance() in commit cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer"). This means we can revert 614bdd4d6e61 ("tcp: must block bh in __inet_twsk_hashdance()"). Signed-off-by: Eric Dumazet --- net/ipv4/inet_timewait_sock.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index c690cd0d9b3f0af53c23b9a1ecc87be4098ae059..b563e0c46bac2362acccf38495546a8b6b726384 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -93,7 +93,7 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw, } /* - * Enter the time wait state. + * Enter the time wait state. This is called with locally disabled BH. * Essentially we whip up a timewait bucket, copy the relevant info into it * from the SK, and mess with hash chains and list linkage. */ @@ -111,7 +111,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, */ bhead = >bhash[inet_bhashfn(twsk_net(tw), inet->inet_num, hashinfo->bhash_size)]; - spin_lock_bh(>lock); + spin_lock(>lock); tw->tw_tb = icsk->icsk_bind_hash; WARN_ON(!icsk->icsk_bind_hash); inet_twsk_add_bind_node(tw, >tw_tb->owners); @@ -137,7 +137,7 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, if (__sk_nulls_del_node_init_rcu(sk)) sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); - spin_unlock_bh(lock); + spin_unlock(lock); } EXPORT_SYMBOL_GPL(__inet_twsk_hashdance);
Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start
On Fri, Dec 1, 2017 at 2:18 PM, Herbert Xuwrote: > On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote: >> Remove the code that resets the walker table. The walker table should >> only be initialized in the walk init function or when a future table is >> encountered. If the walker table is NULL this is the indication that >> the walk has completed and this information can be used to break a >> multi-call walk in the table (e.g. successive calls to nelink_dump >> that are dumping elements of an rhashtable). >> >> This also allows us to change rhashtable_walk_start to return void >> since the only error it was returning was -EAGAIN for a table change. >> This patch changes all the callers of rhashtable_walk_start to expect >> void which eliminates logic needed to check the return value for a >> rare condition. Note that -EAGAIN will be returned in a call >> to rhashtable_walk_next which seems to always follow the start >> of the walk so there should be no behavioral change in doing this. >> >> Signed-off-by: Tom Herbert > > Doesn't this mean that if a walk encounters a rehash you may end up > missing half or more of the hash table? > Because of tbl->rehash < tbl->size conditions in walk stop? How about we add a flag to iter that indicates table needs a reset and set it along with setting walker.tbl to NULL? On the next walk start do the reload when walker.tbl is NULL and flag is set. In this case walk start would automatically set walker.tbl which is already done by nearly all callers already in that they ignore -EAGAIN returned from start walk. Thanks, Tom > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH net-next 2/3] bpf: allow disabling tunnel csum for ipv6
Before the patch, BPF_F_ZERO_CSUM_TX can be used only for ipv4 tunnel. With introduction of ip6gretap collect_md mode, the flag should be also supported for ipv6. Signed-off-by: William TuCc: Daniel Borkmann --- net/core/filter.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 6a85e67fafce..8ec5a504eb28 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3026,10 +3026,11 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, IPV6_FLOWLABEL_MASK; } else { info->key.u.ipv4.dst = cpu_to_be32(from->remote_ipv4); - if (flags & BPF_F_ZERO_CSUM_TX) - info->key.tun_flags &= ~TUNNEL_CSUM; } + if (flags & BPF_F_ZERO_CSUM_TX) + info->key.tun_flags &= ~TUNNEL_CSUM; + return 0; } -- 2.7.4
[PATCH net-next 0/3] add ip6 gre and gretap collect_md mode
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6gretap tunnels to operate in collect metadata mode. The first patch adds the support to ip6_gre.c. The second patch enables unsetting the csum for ipv6 tunnel, when using bpf_skb_[gs]et_tunnel_key() helpers. Finally, the last patch adds the ip6 gre and gretap tunnel test cases to BPF sample code. The corresponding iproute2 patch: https://marc.info/?l=linux-netdev=151216943128087=2 William Tu (3): ip6_gre: add ip6 gre and gretap collect_md mode bpf: allow disabling tunnel csum for ipv6 samples/bpf: extend test_tunnel_bpf.sh with ip6gre net/core/filter.c | 5 +- net/ipv6/ip6_gre.c | 105 + net/ipv6/ip6_tunnel.c | 5 +- samples/bpf/tcbpf2_kern.c | 43 + samples/bpf/test_tunnel_bpf.sh | 65 + 5 files changed, 210 insertions(+), 13 deletions(-) -- 2.7.4
[PATCH net-next 3/3] samples/bpf: extend test_tunnel_bpf.sh with ip6gre
Extend existing tests for vxlan, gre, geneve, ipip, erspan, to include ip6 gre and gretap tunnel. Signed-off-by: William TuCc: Alexei Starovoitov --- samples/bpf/tcbpf2_kern.c | 43 samples/bpf/test_tunnel_bpf.sh | 65 ++ 2 files changed, 108 insertions(+) diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c index 370b749f5ee6..15a469220e19 100644 --- a/samples/bpf/tcbpf2_kern.c +++ b/samples/bpf/tcbpf2_kern.c @@ -81,6 +81,49 @@ int _gre_get_tunnel(struct __sk_buff *skb) return TC_ACT_OK; } +SEC("ip6gretap_set_tunnel") +int _ip6gretap_set_tunnel(struct __sk_buff *skb) +{ + struct bpf_tunnel_key key; + int ret; + + __builtin_memset(, 0x0, sizeof(key)); + key.remote_ipv6[3] = _htonl(0x11); /* ::11 */ + key.tunnel_id = 2; + key.tunnel_tos = 0; + key.tunnel_ttl = 64; + key.tunnel_label = 0xabcde; + + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), +BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + return TC_ACT_OK; +} + +SEC("ip6gretap_get_tunnel") +int _ip6gretap_get_tunnel(struct __sk_buff *skb) +{ + char fmt[] = "key %d remote ip6 ::%x label %x\n"; + struct bpf_tunnel_key key; + int ret; + + ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), +BPF_F_TUNINFO_IPV6); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + bpf_trace_printk(fmt, sizeof(fmt), +key.tunnel_id, key.remote_ipv6[3], key.tunnel_label); + + return TC_ACT_OK; +} + SEC("erspan_set_tunnel") int _erspan_set_tunnel(struct __sk_buff *skb) { diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh index 312e1722a39f..226f45381b76 100755 --- a/samples/bpf/test_tunnel_bpf.sh +++ b/samples/bpf/test_tunnel_bpf.sh @@ -33,6 +33,30 @@ function add_gre_tunnel { ip addr add dev $DEV 10.1.1.200/24 } +function add_ip6gretap_tunnel { + + # assign ipv6 address + ip netns exec at_ns0 ip addr add ::11/96 dev veth0 + ip netns exec at_ns0 ip link set dev veth0 up + ip addr add dev veth1 ::22/96 + ip link set dev veth1 up + + # in namespace + ip netns exec at_ns0 \ + ip link add dev $DEV_NS type $TYPE flowlabel 0xbcdef key 2 \ + local ::11 remote ::22 + + ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24 + ip netns exec at_ns0 ip addr add dev $DEV_NS fc80::100/96 + ip netns exec at_ns0 ip link set dev $DEV_NS up + + # out of namespace + ip link add dev $DEV type $TYPE external + ip addr add dev $DEV 10.1.1.200/24 + ip addr add dev $DEV fc80::200/24 + ip link set dev $DEV up +} + function add_erspan_tunnel { # in namespace ip netns exec at_ns0 \ @@ -113,6 +137,41 @@ function test_gre { cleanup } +function test_ip6gre { + TYPE=ip6gre + DEV_NS=ip6gre00 + DEV=ip6gre11 + config_device + # reuse the ip6gretap function + add_ip6gretap_tunnel + attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel + # underlay + ping6 -c 4 ::11 + # overlay: ipv4 over ipv6 + ip netns exec at_ns0 ping -c 1 10.1.1.200 + ping -c 1 10.1.1.100 + # overlay: ipv6 over ipv6 + ip netns exec at_ns0 ping6 -c 1 fc80::200 + cleanup +} + +function test_ip6gretap { + TYPE=ip6gretap + DEV_NS=ip6gretap00 + DEV=ip6gretap11 + config_device + add_ip6gretap_tunnel + attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel + # underlay + ping6 -c 4 ::11 + # overlay: ipv4 over ipv6 + ip netns exec at_ns0 ping -i .2 -c 1 10.1.1.200 + ping -c 1 10.1.1.100 + # overlay: ipv6 over ipv6 + ip netns exec at_ns0 ping6 -c 1 fc80::200 + cleanup +} + function test_erspan { TYPE=erspan DEV_NS=erspan00 @@ -175,6 +234,8 @@ function cleanup { ip link del veth1 ip link del ipip11 ip link del gretap11 + ip link del ip6gre11 + ip link del ip6gretap11 ip link del vxlan11 ip link del geneve11 ip link del erspan11 @@ -187,6 +248,10 @@ trap cleanup 0 2 3 6 9 cleanup echo "Testing GRE tunnel..." test_gre +echo "Testing IP6GRE tunnel..." +test_ip6gre +echo "Testing IP6GRETAP tunnel..." +test_ip6gretap echo "Testing ERSPAN tunnel..." test_erspan echo "Testing VXLAN tunnel..." -- 2.7.4
[PATCH net-next 1/3] ip6_gre: add ip6 gre and gretap collect_md mode
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6 gre and gretap tunnels to operate in collect metadata mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. OVS can use it as well in the future. Signed-off-by: William Tu--- net/ipv6/ip6_gre.c| 105 +- net/ipv6/ip6_tunnel.c | 5 ++- 2 files changed, 99 insertions(+), 11 deletions(-) diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 76379f01bcd2..1510ce9a4e4e 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -56,6 +56,7 @@ #include #include #include +#include static bool log_ecn_error = true; @@ -69,6 +70,7 @@ static unsigned int ip6gre_net_id __read_mostly; struct ip6gre_net { struct ip6_tnl __rcu *tunnels[4][IP6_GRE_HASH_SIZE]; + struct ip6_tnl __rcu *collect_md_tun; struct net_device *fb_tunnel_dev; }; @@ -229,6 +231,10 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct net_device *dev, if (cand) return cand; + t = rcu_dereference(ign->collect_md_tun); + if (t && t->dev->flags & IFF_UP) + return t; + dev = ign->fb_tunnel_dev; if (dev->flags & IFF_UP) return netdev_priv(dev); @@ -264,6 +270,9 @@ static void ip6gre_tunnel_link(struct ip6gre_net *ign, struct ip6_tnl *t) { struct ip6_tnl __rcu **tp = ip6gre_bucket(ign, t); + if (t->parms.collect_md) + rcu_assign_pointer(ign->collect_md_tun, t); + rcu_assign_pointer(t->next, rtnl_dereference(*tp)); rcu_assign_pointer(*tp, t); } @@ -273,6 +282,9 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, struct ip6_tnl *t) struct ip6_tnl __rcu **tp; struct ip6_tnl *iter; + if (t->parms.collect_md) + rcu_assign_pointer(ign->collect_md_tun, NULL); + for (tp = ip6gre_bucket(ign, t); (iter = rtnl_dereference(*tp)) != NULL; tp = >next) { @@ -463,7 +475,22 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi) >saddr, >daddr, tpi->key, tpi->proto); if (tunnel) { - ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); + if (tunnel->parms.collect_md) { + struct metadata_dst *tun_dst; + __be64 tun_id; + __be16 flags; + + flags = tpi->flags; + tun_id = key32_to_tunnel_id(tpi->key); + + tun_dst = ipv6_tun_rx_dst(skb, flags, tun_id, 0); + if (!tun_dst) + return PACKET_REJECT; + + ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error); + } else { + ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); + } return PACKET_RCVD; } @@ -633,8 +660,38 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb, /* Push GRE header. */ protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto; - gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags, -protocol, tunnel->parms.o_key, htonl(tunnel->o_seqno)); + + if (tunnel->parms.collect_md) { + struct ip_tunnel_info *tun_info; + const struct ip_tunnel_key *key; + __be16 flags; + + tun_info = skb_tunnel_info(skb); + if (unlikely(!tun_info || +!(tun_info->mode & IP_TUNNEL_INFO_TX) || +ip_tunnel_info_af(tun_info) != AF_INET6)) + return -EINVAL; + + key = _info->key; + memset(fl6, 0, sizeof(*fl6)); + fl6->flowi6_proto = IPPROTO_GRE; + fl6->daddr = key->u.ipv6.dst; + fl6->flowlabel = key->label; + fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL); + + dsfield = key->tos; + flags = key->tun_flags & (TUNNEL_CSUM | TUNNEL_KEY); + tunnel->tun_hlen = gre_calc_hlen(flags); + + gre_build_header(skb, tunnel->tun_hlen, +flags, protocol, +tunnel_id_to_key32(tun_info->key.tun_id), 0); + + } else { + gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags, +protocol, tunnel->parms.o_key, +htonl(tunnel->o_seqno)); + } return ip6_tnl_xmit(skb, dev, dsfield, fl6, encap_limit, pmtu, NEXTHDR_GRE); @@ -645,13 +702,15 @@ static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, struct net_device *dev) struct ip6_tnl *t = netdev_priv(dev); int encap_limit = -1; struct
Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers
On 12/01/2017 02:37 PM, Heiner Kallweit wrote: > Am 01.12.2017 um 21:42 schrieb David Miller: >> From: Heiner Kallweit>> Date: Thu, 30 Nov 2017 23:47:52 +0100 >> >>> Remove generic settings for callbacks config_aneg and read_status >>> from drivers. >>> > When re-testing I just figured out that in drivers/net/phy/broadcom.c > I mistakenly removed three lines too many. > Do you prefer a fixed version of the patch or just a patch with the > fix? Once the patches has been applied by David, you should send an incremental change to fix your previous patches. Thank you. -- Florian
[PATCH v5 net-next,mips 5/7] MIPS: Octeon: Automatically provision CVMSEG space.
Remove CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE and automatically calculate the amount of CVMSEG space needed. 1st 128-bytes: Use by IOBDMA 2nd 128-bytes: Reserved by kernel for scratch/TLS emulation. 3rd 128-bytes: OCTEON-III LMTLINE New config variable CONFIG_CAVIUM_OCTEON_EXTRA_CVMSEG provisions additional lines, defaults to zero. Signed-off-by: David DaneySigned-off-by: Carlos Munoz --- arch/mips/cavium-octeon/Kconfig| 27 arch/mips/cavium-octeon/setup.c| 16 ++-- .../asm/mach-cavium-octeon/kernel-entry-init.h | 20 +-- arch/mips/include/asm/mipsregs.h | 2 ++ arch/mips/include/asm/octeon/octeon.h | 2 ++ arch/mips/include/asm/processor.h | 2 +- arch/mips/kernel/octeon_switch.S | 2 -- arch/mips/mm/tlbex.c | 29 ++ drivers/staging/octeon/ethernet-defines.h | 2 +- 9 files changed, 50 insertions(+), 52 deletions(-) diff --git a/arch/mips/cavium-octeon/Kconfig b/arch/mips/cavium-octeon/Kconfig index ce469f982134..29c4d81364a6 100644 --- a/arch/mips/cavium-octeon/Kconfig +++ b/arch/mips/cavium-octeon/Kconfig @@ -11,21 +11,26 @@ config CAVIUM_CN63XXP1 non-CN63XXP1 hardware, so it is recommended to select "n" unless it is known the workarounds are needed. -config CAVIUM_OCTEON_CVMSEG_SIZE - int "Number of L1 cache lines reserved for CVMSEG memory" - range 0 54 - default 1 - help - CVMSEG LM is a segment that accesses portions of the dcache as a - local memory; the larger CVMSEG is, the smaller the cache is. - This selects the size of CVMSEG LM, which is in cache blocks. The - legally range is from zero to 54 cache blocks (i.e. CVMSEG LM is - between zero and 6192 bytes). - endif # CPU_CAVIUM_OCTEON if CAVIUM_OCTEON_SOC +config CAVIUM_OCTEON_EXTRA_CVMSEG + int "Number of extra L1 cache lines reserved for CVMSEG memory" + range 0 50 + default 0 + help + CVMSEG LM is a segment that accesses portions of the dcache + as a local memory; the larger CVMSEG is, the smaller the + cache is. The kernel uses two or three blocks (one for TLB + exception handlers, one for driver IOBDMA operations, and on + models that need it, one for LMTDMA operations). This + selects an optional extra number of CVMSEG lines for use by + other software. + + Normally no extra lines are required, and this parameter + should be set to zero. + config CAVIUM_OCTEON_LOCK_L2 bool "Lock often used kernel code in the L2" default "y" diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c index 99e6a68bc652..51c4d3c3cada 100644 --- a/arch/mips/cavium-octeon/setup.c +++ b/arch/mips/cavium-octeon/setup.c @@ -68,6 +68,12 @@ extern void pci_console_init(const char *arg); static unsigned long long max_memory = ULLONG_MAX; static unsigned long long reserve_low_mem; +/* + * modified in hernel-entry-init.h, must have an initial value to keep + * it from being clobbered when bss is zeroed. + */ +u32 octeon_cvmseg_lines = 2; + DEFINE_SEMAPHORE(octeon_bootbus_sem); EXPORT_SYMBOL(octeon_bootbus_sem); @@ -604,11 +610,7 @@ void octeon_user_io_init(void) /* R/W If set, CVMSEG is available for loads/stores in * kernel/debug mode. */ -#if CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE > 0 cvmmemctl.s.cvmsegenak = 1; -#else - cvmmemctl.s.cvmsegenak = 0; -#endif if (OCTEON_IS_OCTEON3()) { /* Enable LMTDMA */ cvmmemctl.s.lmtena = 1; @@ -626,9 +628,9 @@ void octeon_user_io_init(void) /* Setup of CVMSEG is done in kernel-entry-init.h */ if (smp_processor_id() == 0) - pr_notice("CVMSEG size: %d cache lines (%d bytes)\n", - CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE, - CONFIG_CAVIUM_OCTEON_CVMSEG_SIZE * 128); + pr_notice("CVMSEG size: %u cache lines (%u bytes)\n", + octeon_cvmseg_lines, + octeon_cvmseg_lines * 128); if (octeon_has_feature(OCTEON_FEATURE_FAU)) { union cvmx_iob_fau_timeout fau_timeout; diff --git a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h index c38b38ce5a3d..cdcca60978a2 100644 --- a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h +++ b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h @@ -26,11 +26,18 @@ # a3 = address of boot descriptor block .set push .set arch=octeon + mfc0v1, CP0_PRID_REG + andiv1, 0xff00 + li v0, 0x9500 # cn78XX or later + subuv1, v1, v0 +
[PATCH v5 net-next,mips 2/7] MIPS: Octeon: Enable LMTDMA/LMTST operations.
From: Carlos MunozLMTDMA/LMTST operations move data between cores and I/O devices: * LMTST operations can send an address and a variable length (up to 128 bytes) of data to an I/O device. * LMTDMA operations can send an address and a variable length (up to 128) of data to the I/O device and then return a variable length (up to 128 bytes) response from the I/O device. For both LMTST and LMTDMA, the data sent to the device is first stored in the CVMSEG core local memory cache line indexed by CVMMEMCTL[LMTLINE], the data is then atomically transmitted to the device with a store to the CVMSEG LMTDMA trigger location. Reviewed-by: James Hogan Signed-off-by: Carlos Munoz Signed-off-by: Steven J. Hill Signed-off-by: David Daney --- arch/mips/cavium-octeon/setup.c | 6 ++ arch/mips/include/asm/octeon/octeon.h | 12 ++-- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c index a8034d0dcade..99e6a68bc652 100644 --- a/arch/mips/cavium-octeon/setup.c +++ b/arch/mips/cavium-octeon/setup.c @@ -609,6 +609,12 @@ void octeon_user_io_init(void) #else cvmmemctl.s.cvmsegenak = 0; #endif + if (OCTEON_IS_OCTEON3()) { + /* Enable LMTDMA */ + cvmmemctl.s.lmtena = 1; + /* Scratch line to use for LMT operation */ + cvmmemctl.s.lmtline = 2; + } /* R/W If set, CVMSEG is available for loads/stores in * supervisor mode. */ cvmmemctl.s.cvmsegenas = 0; diff --git a/arch/mips/include/asm/octeon/octeon.h b/arch/mips/include/asm/octeon/octeon.h index c99c4b6a79f4..92a17d67c1fa 100644 --- a/arch/mips/include/asm/octeon/octeon.h +++ b/arch/mips/include/asm/octeon/octeon.h @@ -179,7 +179,15 @@ union octeon_cvmemctl { /* RO 1 = BIST fail, 0 = BIST pass */ __BITFIELD_FIELD(uint64_t wbfbist:1, /* Reserved */ - __BITFIELD_FIELD(uint64_t reserved:17, + __BITFIELD_FIELD(uint64_t reserved_52_57:6, + /* When set, LMTDMA/LMTST operations are permitted */ + __BITFIELD_FIELD(uint64_t lmtena:1, + /* Selects the CVMSEG LM cacheline used by LMTDMA +* LMTST and wide atomic store operations. +*/ + __BITFIELD_FIELD(uint64_t lmtline:6, + /* Reserved */ + __BITFIELD_FIELD(uint64_t reserved_41_44:4, /* OCTEON II - TLB replacement policy: 0 = bitmask LRU; 1 = NLU. * This field selects between the TLB replacement policies: * bitmask LRU or NLU. Bitmask LRU maintains a mask of @@ -275,7 +283,7 @@ union octeon_cvmemctl { /* R/W Size of local memory in cache blocks, 54 (6912 * bytes) is max legal value. */ __BITFIELD_FIELD(uint64_t lmemsz:6, - ;) + ; } s; }; -- 2.14.3
[PATCH v5 net-next,mips 4/7] MIPS: Octeon: Add Free Pointer Unit (FPA) support.
From: Carlos Munoz>From the hardware user manual: "The FPA is a unit that maintains pools of pointers to free L2/DRAM memory. To provide QoS, the pools are referenced indirectly through 1024 auras. Both core software and hardware units allocate and free pointers." Signed-off-by: Carlos Munoz Signed-off-by: Steven J. Hill Signed-off-by: David Daney --- arch/mips/cavium-octeon/Kconfig | 8 + arch/mips/cavium-octeon/Makefile | 1 + arch/mips/cavium-octeon/octeon-fpa3.c | 363 ++ arch/mips/include/asm/octeon/octeon.h | 13 ++ 4 files changed, 385 insertions(+) create mode 100644 arch/mips/cavium-octeon/octeon-fpa3.c diff --git a/arch/mips/cavium-octeon/Kconfig b/arch/mips/cavium-octeon/Kconfig index 204a1670fd9b..ce469f982134 100644 --- a/arch/mips/cavium-octeon/Kconfig +++ b/arch/mips/cavium-octeon/Kconfig @@ -87,4 +87,12 @@ config OCTEON_ILM To compile this driver as a module, choose M here. The module will be called octeon-ilm +config OCTEON_FPA3 + tristate "Octeon III fpa driver" + help + This option enables a Octeon III driver for the Free Pool Unit (FPA). + The FPA is a hardware unit that manages pools of pointers to free + L2/DRAM memory. This driver provides an interface to reserve, + initialize, and fill fpa pools. + endif # CAVIUM_OCTEON_SOC diff --git a/arch/mips/cavium-octeon/Makefile b/arch/mips/cavium-octeon/Makefile index 28c0bb75d1a4..9d547c2cd77d 100644 --- a/arch/mips/cavium-octeon/Makefile +++ b/arch/mips/cavium-octeon/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_MTD) += flash_setup.o obj-$(CONFIG_SMP)+= smp.o obj-$(CONFIG_OCTEON_ILM) += oct_ilm.o obj-$(CONFIG_USB)+= octeon-usb.o +obj-$(CONFIG_OCTEON_FPA3)+= octeon-fpa3.o diff --git a/arch/mips/cavium-octeon/octeon-fpa3.c b/arch/mips/cavium-octeon/octeon-fpa3.c new file mode 100644 index ..3f0c10e9d915 --- /dev/null +++ b/arch/mips/cavium-octeon/octeon-fpa3.c @@ -0,0 +1,363 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for the Octeon III Free Pool Unit (fpa). + * + * Copyright (C) 2015-2017 Cavium, Inc. + */ + +#include + +#include + + +/* Registers are accessed via xkphys */ +#define SET_XKPHYS (1ull << 63) +#define NODE_OFFSET0x10ull +#define SET_NODE(node) ((node) * NODE_OFFSET) + +#define FPA_BASE 0x12800ull +#define SET_FPA_BASE(node) (SET_XKPHYS + SET_NODE(node) + FPA_BASE) + +#define FPA_GEN_CFG(n) (SET_FPA_BASE(n) + 0x0050) + +#define FPA_POOLX_CFG(n, p)(SET_FPA_BASE(n) + (p<<3) + 0x1000) +#define FPA_POOLX_START_ADDR(n, p) (SET_FPA_BASE(n) + (p<<3) + 0x1050) +#define FPA_POOLX_END_ADDR(n, p) (SET_FPA_BASE(n) + (p<<3) + 0x1060) +#define FPA_POOLX_STACK_BASE(n, p) (SET_FPA_BASE(n) + (p<<3) + 0x1070) +#define FPA_POOLX_STACK_END(n, p) (SET_FPA_BASE(n) + (p<<3) + 0x1080) +#define FPA_POOLX_STACK_ADDR(n, p) (SET_FPA_BASE(n) + (p<<3) + 0x1090) + +#define FPA_AURAX_POOL(n, a) (SET_FPA_BASE(n) + (a<<3) + 0x2000) +#define FPA_AURAX_CFG(n, a)(SET_FPA_BASE(n) + (a<<3) + 0x2010) +#define FPA_AURAX_CNT(n, a)(SET_FPA_BASE(n) + (a<<3) + 0x2020) +#define FPA_AURAX_CNT_LIMIT(n, a) (SET_FPA_BASE(n) + (a<<3) + 0x2040) +#define FPA_AURAX_CNT_THRESHOLD(n, a) (SET_FPA_BASE(n) + (a<<3) + 0x2050) +#define FPA_AURAX_POOL_LEVELS(n, a)(SET_FPA_BASE(n) + (a<<3) + 0x2070) +#define FPA_AURAX_CNT_LEVELS(n, a) (SET_FPA_BASE(n) + (a<<3) + 0x2080) + +static inline u64 oct_csr_read(u64 addr) +{ + return __raw_readq((void __iomem *)addr); +} + +static inline void oct_csr_write(u64 data, u64 addr) +{ + __raw_writeq(data, (void __iomem *)addr); +} + +static DEFINE_MUTEX(octeon_fpa3_lock); + +static int get_num_pools(void) +{ + if (OCTEON_IS_MODEL(OCTEON_CN78XX)) + return 64; + if (OCTEON_IS_MODEL(OCTEON_CNF75XX) || OCTEON_IS_MODEL(OCTEON_CN73XX)) + return 32; + return 0; +} + +static int get_num_auras(void) +{ + if (OCTEON_IS_MODEL(OCTEON_CN78XX)) + return 1024; + if (OCTEON_IS_MODEL(OCTEON_CNF75XX) || OCTEON_IS_MODEL(OCTEON_CN73XX)) + return 512; + return 0; +} + +/** + * octeon_fpa3_init() - Initialize the fpa to default values. + * @node: Node of fpa to initialize. + * + * Return: 0 if successful. + * < 0 for error codes. + */ +int octeon_fpa3_init(int node) +{ + static bool init_done[2]; + u64 data; + int aura_cnt, i; + + mutex_lock(_fpa3_lock); + + if (init_done[node]) + goto done; + + aura_cnt = get_num_auras(); +
[PATCH v5 net-next,mips 3/7] MIPS: Octeon: Add a global resource manager.
From: Carlos MunozAdd a global resource manager to manage tagged pointers within bootmem allocated memory. This is used by various functional blocks in the Octeon core like the FPA, Ethernet nexus, etc. Signed-off-by: Carlos Munoz Signed-off-by: Steven J. Hill Signed-off-by: David Daney --- arch/mips/cavium-octeon/Makefile | 1 + arch/mips/cavium-octeon/resource-mgr.c | 351 + arch/mips/include/asm/octeon/octeon.h | 18 ++ 3 files changed, 370 insertions(+) create mode 100644 arch/mips/cavium-octeon/resource-mgr.c diff --git a/arch/mips/cavium-octeon/Makefile b/arch/mips/cavium-octeon/Makefile index 7c02e542959a..28c0bb75d1a4 100644 --- a/arch/mips/cavium-octeon/Makefile +++ b/arch/mips/cavium-octeon/Makefile @@ -10,6 +10,7 @@ # obj-y := cpu.o setup.o octeon-platform.o octeon-irq.o csrc-octeon.o +obj-y += resource-mgr.o obj-y += dma-octeon.o obj-y += octeon-memcpy.o obj-y += executive/ diff --git a/arch/mips/cavium-octeon/resource-mgr.c b/arch/mips/cavium-octeon/resource-mgr.c new file mode 100644 index ..74efda5420ff --- /dev/null +++ b/arch/mips/cavium-octeon/resource-mgr.c @@ -0,0 +1,351 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource manager for Octeon. + * + * Copyright (C) 2017 Cavium, Inc. + */ +#include + +#include +#include + +#define RESOURCE_MGR_BLOCK_NAME"cvmx-global-resources" +#define MAX_RESOURCES 128 +#define INST_AVAILABLE -88 +#define OWNER 0xbadc0de + +struct global_resource_entry { + struct global_resource_tag tag; + u64 phys_addr; + u64 size; +}; + +struct global_resources { +#ifdef __LITTLE_ENDIAN_BITFIELD + u32 rlock; + u32 pad; +#else + u32 pad; + u32 rlock; +#endif + u64 entry_cnt; + struct global_resource_entry resource_entry[]; +}; + +static struct global_resources *res_mgr_info; + + +/* + * The resource manager interacts with software running outside of the + * Linux kernel, which necessitates locking to maintain data structure + * consistency. These custom locking functions implement the locking + * protocol, and cannot be replaced by kernel locking functions that + * may use different in-memory structures. + */ + +static void res_mgr_lock(void) +{ + while (cmpxchg(_mgr_info->rlock, 0, 1)) + ; /* Loop while not zero */ + rmb(); +} + +static void res_mgr_unlock(void) +{ + /* Wait until all resource operations finish before unlocking. */ + wmb(); + WRITE_ONCE(res_mgr_info->rlock, 0); + /* Force a write buffer flush. */ + wmb(); +} + +static int res_mgr_find_resource(struct global_resource_tag tag) +{ + struct global_resource_entry *res_entry; + int i; + + for (i = 0; i < res_mgr_info->entry_cnt; i++) { + res_entry = _mgr_info->resource_entry[i]; + if (res_entry->tag.lo == tag.lo && res_entry->tag.hi == tag.hi) + return i; + } + return -1; +} + +/** + * res_mgr_create_resource() - Create a resource. + * @tag: Identifies the resource. + * @inst_cnt: Number of resource instances to create. + * + * Returns 0 if the source was created successfully. + * Returns < 0 for error codes. + */ +int res_mgr_create_resource(struct global_resource_tag tag, int inst_cnt) +{ + struct global_resource_entry *res_entry; + u64 size; + u64 *res_addr; + int res_index, i, rc = 0; + + res_mgr_lock(); + + /* Make sure resource doesn't already exist. */ + res_index = res_mgr_find_resource(tag); + if (res_index >= 0) { + rc = -EEXIST; + goto err; + } + + if (res_mgr_info->entry_cnt >= MAX_RESOURCES) { + pr_err("Resource max limit reached, not created\n"); + rc = -ENOSPC; + goto err; + } + + /* +* Each instance is kept in an array of u64s. The first array element +* holds the number of allocated instances. +*/ + size = sizeof(u64) * (inst_cnt + 1); + res_addr = cvmx_bootmem_alloc_range(size, CVMX_CACHE_LINE_SIZE, 0, 0); + if (!res_addr) { + pr_err("Failed to allocate resource. not created\n"); + rc = -ENOMEM; + goto err; + } + + /* Initialize the newly created resource. */ + *res_addr = inst_cnt; + for (i = 1; i <= inst_cnt; i++) + res_addr[i] = INST_AVAILABLE; + + res_index = res_mgr_info->entry_cnt; + res_entry = _mgr_info->resource_entry[res_index]; + res_entry->tag = tag; + res_entry->phys_addr = virt_to_phys(res_addr); + res_entry->size = size; + res_mgr_info->entry_cnt++; + +err: + res_mgr_unlock(); + + return rc; +} +EXPORT_SYMBOL(res_mgr_create_resource); + +/** + *
[PATCH v5 net-next,mips 7/7] MAINTAINERS: Add entry for drivers/net/ethernet/cavium/octeon/octeon3-*
Signed-off-by: David Daney--- MAINTAINERS | 6 ++ 1 file changed, 6 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 77d819b458a9..5aff6fb41b21 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3249,6 +3249,12 @@ W: http://www.cavium.com S: Supported F: drivers/mmc/host/cavium* +CAVIUM OCTEON-III NETWORK DRIVER +M: David Daney +L: netdev@vger.kernel.org +S: Supported +F: drivers/net/ethernet/cavium/octeon/octeon3-* + CAVIUM OCTEON-TX CRYPTO DRIVER M: George Cherian L: linux-cry...@vger.kernel.org -- 2.14.3
[PATCH v5 net-next,mips 0/7] Cavium OCTEON-III network driver.
We are adding the Cavium OCTEON-III network driver. But since interacting with the input and output queues is done via special CPU local memory, we also need to add support to the MIPS/Octeon architecture code. Aren't SoCs nice in this way? The first five patches add the SoC support needed by the driver, the last two add the driver and an entry in MAINTAINERS. Since these touch several subsystems (mips, netdev), I would propose merging via netdev, but defer to the maintainers if they think something else would work better. A separate pull request was recently done by Steven Hill for the firmware required by the driver. Changes from v4: o Removed cleanup patch for previous generation SoC "staging" driver, as it will be sent as a follow-on. o Fixed kernel doc formatting in all patches. o Removed redundant licensing text boilerplate. o Reviewed-by: header added to 2/7. o Rewrote locking code in 3/7 to eliminate inline asm. Changes from v3: o Use phy_print_status() instead of open coding the equivalent. o Print warning on phy mode mismatch. o Improve dt-bindings and add Acked-by. Changes from v2: o Fix PKI (RX path) initialization to work with little endian kernel. Changes from v1: o Cleanup and use of standard bindings in the device tree bindings document. o Added (hopefully) clarifying comments about several OCTEON architectural peculiarities. o Removed unused testing code from the driver. o Removed some module parameters that already default to the proper values. o KConfig cleanup, including testing on x86_64, arm64 and mips. o Fixed breakage to the driver for previous generation of OCTEON SoCs (in the staging directory still). o Verified bisectability of the patch set. Carlos Munoz (5): dt-bindings: Add Cavium Octeon Common Ethernet Interface. MIPS: Octeon: Enable LMTDMA/LMTST operations. MIPS: Octeon: Add a global resource manager. MIPS: Octeon: Add Free Pointer Unit (FPA) support. netdev: octeon-ethernet: Add Cavium Octeon III support. David Daney (2): MIPS: Octeon: Automatically provision CVMSEG space. MAINTAINERS: Add entry for drivers/net/ethernet/cavium/octeon/octeon3-* .../devicetree/bindings/net/cavium-bgx.txt | 61 + MAINTAINERS|6 + arch/mips/cavium-octeon/Kconfig| 35 +- arch/mips/cavium-octeon/Makefile |2 + arch/mips/cavium-octeon/octeon-fpa3.c | 363 arch/mips/cavium-octeon/resource-mgr.c | 351 arch/mips/cavium-octeon/setup.c| 22 +- .../asm/mach-cavium-octeon/kernel-entry-init.h | 20 +- arch/mips/include/asm/mipsregs.h |2 + arch/mips/include/asm/octeon/octeon.h | 45 +- arch/mips/include/asm/processor.h |2 +- arch/mips/kernel/octeon_switch.S |2 - arch/mips/mm/tlbex.c | 29 +- drivers/net/ethernet/cavium/Kconfig| 55 +- drivers/net/ethernet/cavium/octeon/Makefile|6 + .../net/ethernet/cavium/octeon/octeon3-bgx-nexus.c | 701 +++ .../net/ethernet/cavium/octeon/octeon3-bgx-port.c | 2015 +++ drivers/net/ethernet/cavium/octeon/octeon3-core.c | 2069 drivers/net/ethernet/cavium/octeon/octeon3-pki.c | 824 drivers/net/ethernet/cavium/octeon/octeon3-pko.c | 1688 drivers/net/ethernet/cavium/octeon/octeon3-sso.c | 301 +++ drivers/net/ethernet/cavium/octeon/octeon3.h | 418 drivers/staging/octeon/ethernet-defines.h |2 +- 23 files changed, 8955 insertions(+), 64 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt create mode 100644 arch/mips/cavium-octeon/octeon-fpa3.c create mode 100644 arch/mips/cavium-octeon/resource-mgr.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-bgx-nexus.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-bgx-port.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-core.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-pki.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-pko.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3-sso.c create mode 100644 drivers/net/ethernet/cavium/octeon/octeon3.h -- 2.14.3
[PATCH v5 net-next,mips 1/7] dt-bindings: Add Cavium Octeon Common Ethernet Interface.
From: Carlos MunozAdd bindings for Common Ethernet Interface (BGX) block. Acked-by: Rob Herring Signed-off-by: Carlos Munoz Signed-off-by: Steven J. Hill Signed-off-by: David Daney --- .../devicetree/bindings/net/cavium-bgx.txt | 61 ++ 1 file changed, 61 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt diff --git a/Documentation/devicetree/bindings/net/cavium-bgx.txt b/Documentation/devicetree/bindings/net/cavium-bgx.txt new file mode 100644 index ..830c5f08 --- /dev/null +++ b/Documentation/devicetree/bindings/net/cavium-bgx.txt @@ -0,0 +1,61 @@ +* Common Ethernet Interface (BGX) block + +Properties: + +- compatible: "cavium,octeon-7890-bgx": Compatibility with all cn7xxx SOCs. + +- reg: The base address of the BGX block. + +- #address-cells: Must be <1>. + +- #size-cells: Must be <0>. BGX addresses have no size component. + +A BGX block has several children, each representing an Ethernet +interface. + + +* Ethernet Interface (BGX port) connects to PKI/PKO + +Properties: + +- compatible: "cavium,octeon-7890-bgx-port": Compatibility with all + cn7xxx SOCs. + + "cavium,octeon-7360-xcv": Compatibility with cn73xx SOCs + for RGMII. + +- reg: The index of the interface within the BGX block. + +Optional properties: + +- local-mac-address: Mac address for the interface. + +- phy-handle: phandle to the phy node connected to the interface. + +- phy-mode: described in ethernet.txt. + +- fixed-link: described in fixed-link.txt. + +Example: + + ethernet-mac-nexus@11800e000 { + compatible = "cavium,octeon-7890-bgx"; + reg = <0x00011800 0xe000 0x 0x0100>; + #address-cells = <1>; + #size-cells = <0>; + + ethernet@0 { + compatible = "cavium,octeon-7360-xcv"; + reg = <0>; + local-mac-address = [ 00 01 23 45 67 89 ]; + phy-handle = <>; + phy-mode = "rgmii-rxid" + }; + ethernet@1 { + compatible = "cavium,octeon-7890-bgx-port"; + reg = <1>; + local-mac-address = [ 00 01 23 45 67 8a ]; + phy-handle = <>; + phy-mode = "sgmii" + }; + }; -- 2.14.3
[PATCH net-next v3 2/8] net: xdp: report flags program was installed with on query
Some drivers enforce that flags on program replacement and removal must match the flags passed on install. This leaves the possibility open to enable simultaneous loading of XDP programs both to HW and DRV. Allow such drivers to report the flags back to the stack. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 + include/linux/netdevice.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 1a603fdd9e80..ea6bbf1efefc 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3392,6 +3392,7 @@ static int nfp_net_xdp(struct net_device *netdev, struct netdev_bpf *xdp) if (nn->dp.bpf_offload_xdp) xdp->prog_attached = XDP_ATTACHED_HW; xdp->prog_id = nn->xdp_prog ? nn->xdp_prog->aux->id : 0; + xdp->flags = nn->xdp_prog ? nn->xdp_flags : 0; return 0; case BPF_OFFLOAD_VERIFIER_PREP: return nfp_app_bpf_verifier_prep(nn->app, nn, xdp); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 667bdd3ad33e..cc4ce7456e38 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -820,6 +820,8 @@ struct netdev_bpf { struct { u8 prog_attached; u32 prog_id; + /* flags with which program was installed */ + u32 prog_flags; }; /* BPF_OFFLOAD_VERIFIER_PREP */ struct { -- 2.15.0
[PATCH net-next v3 8/8] net: dummy: remove fake SR-IOV functionality
netdevsim driver seems like a better place for fake SR-IOV functionality. Remove the code previously added to dummy. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet Acked-by: Phil Sutter --- CC: Phil Sutter CC: Sabrina Dubroca --- drivers/net/dummy.c | 215 +--- 1 file changed, 1 insertion(+), 214 deletions(-) diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c index 58483af80bdb..30b1c8512049 100644 --- a/drivers/net/dummy.c +++ b/drivers/net/dummy.c @@ -42,48 +42,7 @@ #define DRV_NAME "dummy" #define DRV_VERSION"1.0" -#undef pr_fmt -#define pr_fmt(fmt) DRV_NAME ": " fmt - static int numdummies = 1; -static int num_vfs; - -struct vf_data_storage { - u8 vf_mac[ETH_ALEN]; - u16 pf_vlan; /* When set, guest VLAN config not allowed. */ - u16 pf_qos; - __be16 vlan_proto; - u16 min_tx_rate; - u16 max_tx_rate; - u8 spoofchk_enabled; - boolrss_query_enabled; - u8 trusted; - int link_state; -}; - -struct dummy_priv { - struct vf_data_storage *vfinfo; -}; - -static int dummy_num_vf(struct device *dev) -{ - return num_vfs; -} - -static struct bus_type dummy_bus = { - .name = "dummy", - .num_vf = dummy_num_vf, -}; - -static void release_dummy_parent(struct device *dev) -{ -} - -static struct device dummy_parent = { - .init_name = "dummy", - .bus= _bus, - .release= release_dummy_parent, -}; /* fake multicast ability */ static void set_multicast_list(struct net_device *dev) @@ -133,25 +92,10 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct net_device *dev) static int dummy_dev_init(struct net_device *dev) { - struct dummy_priv *priv = netdev_priv(dev); - dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats); if (!dev->dstats) return -ENOMEM; - priv->vfinfo = NULL; - - if (!num_vfs) - return 0; - - dev->dev.parent = _parent; - priv->vfinfo = kcalloc(num_vfs, sizeof(struct vf_data_storage), - GFP_KERNEL); - if (!priv->vfinfo) { - free_percpu(dev->dstats); - return -ENOMEM; - } - return 0; } @@ -169,117 +113,6 @@ static int dummy_change_carrier(struct net_device *dev, bool new_carrier) return 0; } -static int dummy_set_vf_mac(struct net_device *dev, int vf, u8 *mac) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (!is_valid_ether_addr(mac) || (vf >= num_vfs)) - return -EINVAL; - - memcpy(priv->vfinfo[vf].vf_mac, mac, ETH_ALEN); - - return 0; -} - -static int dummy_set_vf_vlan(struct net_device *dev, int vf, -u16 vlan, u8 qos, __be16 vlan_proto) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if ((vf >= num_vfs) || (vlan > 4095) || (qos > 7)) - return -EINVAL; - - priv->vfinfo[vf].pf_vlan = vlan; - priv->vfinfo[vf].pf_qos = qos; - priv->vfinfo[vf].vlan_proto = vlan_proto; - - return 0; -} - -static int dummy_set_vf_rate(struct net_device *dev, int vf, int min, int max) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (vf >= num_vfs) - return -EINVAL; - - priv->vfinfo[vf].min_tx_rate = min; - priv->vfinfo[vf].max_tx_rate = max; - - return 0; -} - -static int dummy_set_vf_spoofchk(struct net_device *dev, int vf, bool val) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (vf >= num_vfs) - return -EINVAL; - - priv->vfinfo[vf].spoofchk_enabled = val; - - return 0; -} - -static int dummy_set_vf_rss_query_en(struct net_device *dev, int vf, bool val) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (vf >= num_vfs) - return -EINVAL; - - priv->vfinfo[vf].rss_query_enabled = val; - - return 0; -} - -static int dummy_set_vf_trust(struct net_device *dev, int vf, bool val) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (vf >= num_vfs) - return -EINVAL; - - priv->vfinfo[vf].trusted = val; - - return 0; -} - -static int dummy_get_vf_config(struct net_device *dev, - int vf, struct ifla_vf_info *ivi) -{ - struct dummy_priv *priv = netdev_priv(dev); - - if (vf >= num_vfs) - return -EINVAL; - - ivi->vf = vf; - memcpy(>mac, priv->vfinfo[vf].vf_mac, ETH_ALEN); - ivi->vlan = priv->vfinfo[vf].pf_vlan; - ivi->qos = priv->vfinfo[vf].pf_qos; - ivi->spoofchk = priv->vfinfo[vf].spoofchk_enabled; - ivi->linkstate = priv->vfinfo[vf].link_state; - ivi->min_tx_rate = priv->vfinfo[vf].min_tx_rate; -
[PATCH net-next v3 5/8] netdevsim: add bpf offload support
Add support for loading programs for netdevsim devices and expose the related information via DebugFS. Both offload of XDP and cls_bpf programs is supported. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- drivers/net/netdevsim/Makefile| 1 + drivers/net/netdevsim/bpf.c | 373 ++ drivers/net/netdevsim/netdev.c| 116 +++- drivers/net/netdevsim/netdevsim.h | 40 4 files changed, 529 insertions(+), 1 deletion(-) create mode 100644 drivers/net/netdevsim/bpf.c diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile index 07867bfe873b..074ddebbc41d 100644 --- a/drivers/net/netdevsim/Makefile +++ b/drivers/net/netdevsim/Makefile @@ -4,3 +4,4 @@ obj-$(CONFIG_NETDEVSIM) += netdevsim.o netdevsim-objs := \ netdev.o \ + bpf.o \ diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c new file mode 100644 index ..8e4398a50903 --- /dev/null +++ b/drivers/net/netdevsim/bpf.c @@ -0,0 +1,373 @@ +/* + * Copyright (C) 2017 Netronome Systems, Inc. + * + * This software is licensed under the GNU General License Version 2, + * June 1991 as shown in the file COPYING in the top-level directory of this + * source tree. + * + * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" + * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, + * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE + * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME + * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + */ + +#include +#include +#include +#include +#include +#include + +#include "netdevsim.h" + +struct nsim_bpf_bound_prog { + struct netdevsim *ns; + struct bpf_prog *prog; + struct dentry *ddir; + const char *state; + bool is_loaded; + struct list_head l; +}; + +static int nsim_debugfs_bpf_string_read(struct seq_file *file, void *data) +{ + const char **str = file->private; + + if (*str) + seq_printf(file, "%s\n", *str); + + return 0; +} + +static int nsim_debugfs_bpf_string_open(struct inode *inode, struct file *f) +{ + return single_open(f, nsim_debugfs_bpf_string_read, inode->i_private); +} + +static const struct file_operations nsim_bpf_string_fops = { + .owner = THIS_MODULE, + .open = nsim_debugfs_bpf_string_open, + .release = single_release, + .read = seq_read, + .llseek = seq_lseek +}; + +static int +nsim_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn) +{ + struct nsim_bpf_bound_prog *state; + + state = env->prog->aux->offload->dev_priv; + if (state->ns->bpf_bind_verifier_delay && !insn_idx) + msleep(state->ns->bpf_bind_verifier_delay); + + return 0; +} + +static const struct bpf_ext_analyzer_ops nsim_bpf_analyzer_ops = { + .insn_hook = nsim_bpf_verify_insn, +}; + +static bool nsim_xdp_offload_active(struct netdevsim *ns) +{ + return ns->xdp_prog_mode == XDP_ATTACHED_HW; +} + +static void nsim_prog_set_loaded(struct bpf_prog *prog, bool loaded) +{ + struct nsim_bpf_bound_prog *state; + + if (!prog || !prog->aux->offload) + return; + + state = prog->aux->offload->dev_priv; + state->is_loaded = loaded; +} + +static int +nsim_bpf_offload(struct netdevsim *ns, struct bpf_prog *prog, bool oldprog) +{ + nsim_prog_set_loaded(ns->bpf_offloaded, false); + + WARN(!!ns->bpf_offloaded != oldprog, +"bad offload state, expected offload %sto be active", +oldprog ? "" : "not "); + ns->bpf_offloaded = prog; + ns->bpf_offloaded_id = prog ? prog->aux->id : 0; + nsim_prog_set_loaded(prog, true); + + return 0; +} + +int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type, + void *type_data, void *cb_priv) +{ + struct tc_cls_bpf_offload *cls_bpf = type_data; + struct bpf_prog *prog = cls_bpf->prog; + struct netdevsim *ns = cb_priv; + bool skip_sw; + + if (type != TC_SETUP_CLSBPF || + !tc_can_offload(ns->netdev) || + cls_bpf->common.protocol != htons(ETH_P_ALL) || + cls_bpf->common.chain_index) + return -EOPNOTSUPP; + + skip_sw = cls_bpf->gen_flags & TCA_CLS_FLAGS_SKIP_SW; + + if (nsim_xdp_offload_active(ns)) + return -EBUSY; + + if (!ns->bpf_tc_accept) + return -EOPNOTSUPP; + /* Note: progs without skip_sw will probably not be dev bound */ + if (prog && !prog->aux->offload && !ns->bpf_tc_non_bound_accept) + return -EOPNOTSUPP; + + switch (cls_bpf->command) { +
[PATCH net-next v3 3/8] net: xdp: make the stack take care of the tear down
Since day one of XDP drivers had to remember to free the program on the remove path. This leads to code duplication and is error prone. Make the stack query the installed programs on unregister and if something is installed, remove the program. Freeing of program attached to XDP generic is moved from free_netdev() as well. Because the remove will now be called before notifiers are invoked, BPF offload state of the program will not get destroyed before uninstall. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 -- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 7 -- .../net/ethernet/netronome/nfp/nfp_net_common.c| 3 --- drivers/net/ethernet/qlogic/qede/qede_main.c | 4 --- drivers/net/tun.c | 4 --- net/core/dev.c | 29 -- 7 files changed, 22 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 33c49ad697e4..413ad2444ba2 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -7800,8 +7800,6 @@ static void bnxt_remove_one(struct pci_dev *pdev) bnxt_dcb_free(bp); kfree(bp->edev); bp->edev = NULL; - if (bp->xdp_prog) - bpf_prog_put(bp->xdp_prog); bnxt_cleanup_pci(bp); free_netdev(dev); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index d2b057a3e512..0f5c012de52e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4308,9 +4308,6 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv) { mlx5e_ipsec_cleanup(priv); mlx5e_vxlan_cleanup(priv); - - if (priv->channels.params.xdp_prog) - bpf_prog_put(priv->channels.params.xdp_prog); } static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index e379b78e86ef..54bfd7846f6d 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -82,12 +82,6 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, struct nfp_net *nn) return nfp_net_ebpf_capable(nn) ? "BPF" : ""; } -static void nfp_bpf_vnic_free(struct nfp_app *app, struct nfp_net *nn) -{ - if (nn->dp.bpf_offload_xdp) - nfp_bpf_xdp_offload(app, nn, NULL); -} - static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type, void *type_data, void *cb_priv) { @@ -168,7 +162,6 @@ const struct nfp_app_type app_bpf = { .extra_cap = nfp_bpf_extra_cap, .vnic_alloc = nfp_app_nic_vnic_alloc, - .vnic_free = nfp_bpf_vnic_free, .setup_tc = nfp_bpf_setup_tc, .tc_busy= nfp_bpf_tc_busy, diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index ea6bbf1efefc..ad3e9f6a61e5 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3562,9 +3562,6 @@ struct nfp_net *nfp_net_alloc(struct pci_dev *pdev, bool needs_netdev, */ void nfp_net_free(struct nfp_net *nn) { - if (nn->xdp_prog) - bpf_prog_put(nn->xdp_prog); - if (nn->dp.netdev) free_netdev(nn->dp.netdev); else diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 8f9b3eb82137..57332b3e5e64 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -1068,10 +1068,6 @@ static void __qede_remove(struct pci_dev *pdev, enum qede_remove_mode mode) pci_set_drvdata(pdev, NULL); - /* Release edev's reference to XDP's bpf if such exist */ - if (edev->xdp_prog) - bpf_prog_put(edev->xdp_prog); - /* Use global ops since we've freed edev */ qed_ops->common->slowpath_stop(cdev); if (system_state == SYSTEM_POWER_OFF) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 6a7bde9bc4b2..6f7e8e45c961 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -673,7 +673,6 @@ static void tun_detach(struct tun_file *tfile, bool clean) static void tun_detach_all(struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); - struct bpf_prog *xdp_prog = rtnl_dereference(tun->xdp_prog); struct tun_file *tfile, *tmp; int i, n = tun->numqueues; @@ -708,9 +707,6 @@ static
[PATCH net-next v3 0/8] xdp: make stack perform remove and add selftests
Hi! The purpose of this series is to add a software model of BPF offloads to make it easier for everyone to test them and make some of the more arcane rules and assumptions more clear. The series starts with 3 patches aiming to make XDP handling in the drivers less error prone. Currently driver authors have to remember to free XDP programs if XDP is active during unregister. With this series the core will disable XDP on its own. It will take place after close, drivers are not expected to perform reconfiguration when disabling XDP on a downed device. Next two patches add the software netdev driver, followed by a python test which exercises all the corner cases which came to my mind. Test needs to be run as root. It will print basic information to stdout, but can also create a more detailed log of all commands when --log option is passed. Log is in Emacs Org-mode format. ./tools/testing/selftests/bpf/test_offload.py --log /tmp/log Last two patches replace the SR-IOV API implementation of dummy. v3: - move the freeing of vfs to release (Phil). v2: - free device from the release function; - use bus-based name generatin instead of netdev name. v1: - replace the SR-IOV API implementation of dummy; - make the dev_xdp_uninstall() also handle the XDP generic (Daniel). Jakub Kicinski (8): net: xdp: avoid output parameters when querying XDP prog net: xdp: report flags program was installed with on query net: xdp: make the stack take care of the tear down netdevsim: add software driver for testing offloads netdevsim: add bpf offload support selftests/bpf: add offload test based on netdevsim netdevsim: add SR-IOV functionality net: dummy: remove fake SR-IOV functionality MAINTAINERS| 5 + drivers/net/Kconfig| 11 + drivers/net/Makefile | 1 + drivers/net/dummy.c| 215 +-- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 - drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 - drivers/net/ethernet/netronome/nfp/bpf/main.c | 7 - .../net/ethernet/netronome/nfp/nfp_net_common.c| 4 +- drivers/net/ethernet/qlogic/qede/qede_main.c | 4 - drivers/net/netdevsim/Makefile | 7 + drivers/net/netdevsim/bpf.c| 373 +++ drivers/net/netdevsim/netdev.c | 502 +++ drivers/net/netdevsim/netdevsim.h | 78 +++ drivers/net/tun.c | 4 - include/linux/netdevice.h | 5 +- net/core/dev.c | 53 +- net/core/rtnetlink.c | 6 +- tools/testing/selftests/bpf/Makefile | 5 +- tools/testing/selftests/bpf/sample_ret0.c | 7 + tools/testing/selftests/bpf/test_offload.py| 681 + 20 files changed, 1715 insertions(+), 258 deletions(-) create mode 100644 drivers/net/netdevsim/Makefile create mode 100644 drivers/net/netdevsim/bpf.c create mode 100644 drivers/net/netdevsim/netdev.c create mode 100644 drivers/net/netdevsim/netdevsim.h create mode 100644 tools/testing/selftests/bpf/sample_ret0.c create mode 100755 tools/testing/selftests/bpf/test_offload.py -- 2.15.0
[PATCH net-next v3 4/8] netdevsim: add software driver for testing offloads
To be able to run selftests without any hardware required we need a software model. The model can also serve as an example implementation for those implementing actual HW offloads. The dummy driver have previously been extended to test SR-IOV, but the general consensus seems to be against adding further features to it. Add a new driver for purposes of software modelling only. eBPF and SR-IOV will be added here shortly, others are invited to further extend the driver with their offload models. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- MAINTAINERS | 5 ++ drivers/net/Kconfig | 11 drivers/net/Makefile | 1 + drivers/net/netdevsim/Makefile| 6 ++ drivers/net/netdevsim/netdev.c| 118 ++ drivers/net/netdevsim/netdevsim.h | 26 + 6 files changed, 167 insertions(+) create mode 100644 drivers/net/netdevsim/Makefile create mode 100644 drivers/net/netdevsim/netdev.c create mode 100644 drivers/net/netdevsim/netdevsim.h diff --git a/MAINTAINERS b/MAINTAINERS index 77d819b458a9..010e46a38373 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9599,6 +9599,11 @@ NETWORKING [WIRELESS] L: linux-wirel...@vger.kernel.org Q: http://patchwork.kernel.org/project/linux-wireless/list/ +NETDEVSIM +M: Jakub Kicinski +S: Maintained +F: drivers/net/netdevsim/* + NETXEN (1/10) GbE SUPPORT M: Manish Chopra M: Rahul Verma diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 0936da592e12..944ec3c9282c 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -497,4 +497,15 @@ config THUNDERBOLT_NET source "drivers/net/hyperv/Kconfig" +config NETDEVSIM + tristate "Simulated networking device" + depends on DEBUG_FS + help + This driver is a developer testing tool and software model that can + be used to test various control path networking APIs, especially + HW-offload related. + + To compile this driver as a module, choose M here: the module + will be called netdevsim. + endif # NETDEVICES diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 766f62d02a0b..04c3b747812c 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -78,3 +78,4 @@ obj-$(CONFIG_FUJITSU_ES) += fjes/ thunderbolt-net-y += thunderbolt.o obj-$(CONFIG_THUNDERBOLT_NET) += thunderbolt-net.o +obj-$(CONFIG_NETDEVSIM) += netdevsim/ diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile new file mode 100644 index ..07867bfe873b --- /dev/null +++ b/drivers/net/netdevsim/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_NETDEVSIM) += netdevsim.o + +netdevsim-objs := \ + netdev.o \ diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c new file mode 100644 index ..7599c72c477a --- /dev/null +++ b/drivers/net/netdevsim/netdev.c @@ -0,0 +1,118 @@ +/* + * Copyright (C) 2017 Netronome Systems, Inc. + * + * This software is licensed under the GNU General License Version 2, + * June 1991 as shown in the file COPYING in the top-level directory of this + * source tree. + * + * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" + * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, + * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE + * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME + * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "netdevsim.h" + +static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct netdevsim *ns = netdev_priv(dev); + + u64_stats_update_begin(>syncp); + ns->tx_packets++; + ns->tx_bytes += skb->len; + u64_stats_update_end(>syncp); + + dev_kfree_skb(skb); + + return NETDEV_TX_OK; +} + +static void nsim_set_rx_mode(struct net_device *dev) +{ +} + +static void +nsim_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats) +{ + struct netdevsim *ns = netdev_priv(dev); + unsigned int start; + + do { + start = u64_stats_fetch_begin(>syncp); + stats->tx_bytes = ns->tx_bytes; + stats->tx_packets = ns->tx_packets; + } while (u64_stats_fetch_retry(>syncp, start)); +} + +static const struct net_device_ops nsim_netdev_ops = { + .ndo_start_xmit = nsim_start_xmit, + .ndo_set_rx_mode= nsim_set_rx_mode, + .ndo_set_mac_address= eth_mac_addr, +
[PATCH net-next v3 1/8] net: xdp: avoid output parameters when querying XDP prog
The output parameters will get unwieldy if we want to add more information about the program. Simply pass the entire struct netdev_bpf in. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- include/linux/netdevice.h | 3 ++- net/core/dev.c| 24 ++-- net/core/rtnetlink.c | 6 +- 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ef789e1d679e..667bdd3ad33e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3330,7 +3330,8 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, int fd, u32 flags); -u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t xdp_op, u32 *prog_id); +void __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op, +struct netdev_bpf *xdp); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index 07ed21d64f92..3f271c9cb5e0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7073,17 +7073,21 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down) } EXPORT_SYMBOL(dev_change_proto_down); -u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op, u32 *prog_id) +void __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op, +struct netdev_bpf *xdp) { - struct netdev_bpf xdp; - - memset(, 0, sizeof(xdp)); - xdp.command = XDP_QUERY_PROG; + memset(xdp, 0, sizeof(*xdp)); + xdp->command = XDP_QUERY_PROG; /* Query must always succeed. */ - WARN_ON(bpf_op(dev, ) < 0); - if (prog_id) - *prog_id = xdp.prog_id; + WARN_ON(bpf_op(dev, xdp) < 0); +} + +static u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op) +{ + struct netdev_bpf xdp; + + __dev_xdp_query(dev, bpf_op, ); return xdp.prog_attached; } @@ -7134,10 +7138,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, bpf_chk = generic_xdp_install; if (fd >= 0) { - if (bpf_chk && __dev_xdp_attached(dev, bpf_chk, NULL)) + if (bpf_chk && __dev_xdp_attached(dev, bpf_chk)) return -EEXIST; if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && - __dev_xdp_attached(dev, bpf_op, NULL)) + __dev_xdp_attached(dev, bpf_op)) return -EBUSY; prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP, diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index dabba2a91fc8..9c4cb584bfb0 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1261,6 +1261,7 @@ static u8 rtnl_xdp_attached_mode(struct net_device *dev, u32 *prog_id) { const struct net_device_ops *ops = dev->netdev_ops; const struct bpf_prog *generic_xdp_prog; + struct netdev_bpf xdp; ASSERT_RTNL(); @@ -1273,7 +1274,10 @@ static u8 rtnl_xdp_attached_mode(struct net_device *dev, u32 *prog_id) if (!ops->ndo_bpf) return XDP_ATTACHED_NONE; - return __dev_xdp_attached(dev, ops->ndo_bpf, prog_id); + __dev_xdp_query(dev, ops->ndo_bpf, ); + *prog_id = xdp.prog_id; + + return xdp.prog_attached; } static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) -- 2.15.0
[PATCH net-next v3 6/8] selftests/bpf: add offload test based on netdevsim
Add a test of BPF offload control path interfaces based on just-added netdevsim driver. Perform various checks of both the stack and the expected driver behaviour. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Quentin Monnet --- tools/testing/selftests/bpf/Makefile| 5 +- tools/testing/selftests/bpf/sample_ret0.c | 7 + tools/testing/selftests/bpf/test_offload.py | 681 3 files changed, 691 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/sample_ret0.c create mode 100755 tools/testing/selftests/bpf/test_offload.py diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 333a48655ee0..2c9d8c63c6fa 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -17,9 +17,10 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ - sockmap_verdict_prog.o dev_cgroup.o + sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o -TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh +TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \ + test_offload.py include ../lib.mk diff --git a/tools/testing/selftests/bpf/sample_ret0.c b/tools/testing/selftests/bpf/sample_ret0.c new file mode 100644 index ..fec99750d6ea --- /dev/null +++ b/tools/testing/selftests/bpf/sample_ret0.c @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) */ + +/* Sample program which should always load for testing control paths. */ +int func() +{ + return 0; +} diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py new file mode 100755 index ..3914f7a4585a --- /dev/null +++ b/tools/testing/selftests/bpf/test_offload.py @@ -0,0 +1,681 @@ +#!/usr/bin/python3 + +# Copyright (C) 2017 Netronome Systems, Inc. +# +# This software is licensed under the GNU General License Version 2, +# June 1991 as shown in the file COPYING in the top-level directory of this +# source tree. +# +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + +from datetime import datetime +import argparse +import json +import os +import pprint +import subprocess +import time + +logfile = None +log_level = 1 +bpf_test_dir = os.path.dirname(os.path.realpath(__file__)) +pp = pprint.PrettyPrinter() +devs = [] # devices we created for clean up +files = [] # files to be removed + +def log_get_sec(level=0): +return "*" * (log_level + level) + +def log_level_inc(add=1): +global log_level +log_level += add + +def log_level_dec(sub=1): +global log_level +log_level -= sub + +def log_level_set(level): +global log_level +log_level = level + +def log(header, data, level=None): +""" +Output to an optional log. +""" +if logfile is None: +return +if level is not None: +log_level_set(level) + +if not isinstance(data, str): +data = pp.pformat(data) + +if len(header): +logfile.write("\n" + log_get_sec() + " ") +logfile.write(header) +if len(header) and len(data.strip()): +logfile.write("\n") +logfile.write(data) + +def skip(cond, msg): +if not cond: +return +print("SKIP: " + msg) +log("SKIP: " + msg, "", level=1) +os.sys.exit(0) + +def fail(cond, msg): +if not cond: +return +print("FAIL: " + msg) +log("FAIL: " + msg, "", level=1) +os.sys.exit(1) + +def start_test(msg): +log(msg, "", level=1) +log_level_inc() +print(msg) + +def cmd(cmd, shell=True, include_stderr=False, background=False, fail=True): +""" +Run a command in subprocess and return tuple of (retval, stdout); +optionally return stderr as well as third value. +""" +proc = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE, +stderr=subprocess.PIPE) +if background: +msg = "%s START: %s" % (log_get_sec(1), +datetime.now().strftime("%H:%M:%S.%f")) +log("BKG " + proc.args, msg) +return proc + +return cmd_result(proc, include_stderr=include_stderr, fail=fail) + +def cmd_result(proc, include_stderr=False, fail=False): +stdout, stderr = proc.communicate() +stdout
[PATCH net-next v3 7/8] netdevsim: add SR-IOV functionality
dummy driver was extended with VF-related netdev APIs for testing SR-IOV-related software. netdevsim did not exist back then. Implement SR-IOV functionality in netdevsim. Notable difference is that since netdevsim has no module parameters, we will actually create a device with sriov_numvfs attribute for each netdev. The zero MAC address is accepted as some HW use it to mean any address is allowed. Link state is also now validated. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet --- CC: Phil Sutter CC: Sabrina Dubroca --- drivers/net/netdevsim/netdev.c| 274 +- drivers/net/netdevsim/netdevsim.h | 12 ++ 2 files changed, 284 insertions(+), 2 deletions(-) diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index 828c1ce49a8b..eb8c679fca9f 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -25,6 +25,125 @@ #include "netdevsim.h" +struct nsim_vf_config { + int link_state; + u16 min_tx_rate; + u16 max_tx_rate; + u16 vlan; + __be16 vlan_proto; + u16 qos; + u8 vf_mac[ETH_ALEN]; + bool spoofchk_enabled; + bool trusted; + bool rss_query_enabled; +}; + +static u32 nsim_dev_id; + +static int nsim_num_vf(struct device *dev) +{ + struct netdevsim *ns = to_nsim(dev); + + return ns->num_vfs; +} + +static struct bus_type nsim_bus = { + .name = DRV_NAME, + .dev_name = DRV_NAME, + .num_vf = nsim_num_vf, +}; + +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs) +{ + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config), + GFP_KERNEL); + if (!ns->vfconfigs) + return -ENOMEM; + ns->num_vfs = num_vfs; + + return 0; +} + +static void nsim_vfs_disable(struct netdevsim *ns) +{ + kfree(ns->vfconfigs); + ns->vfconfigs = NULL; + ns->num_vfs = 0; +} + +static ssize_t +nsim_numvfs_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct netdevsim *ns = to_nsim(dev); + unsigned int num_vfs; + int ret; + + ret = kstrtouint(buf, 0, _vfs); + if (ret) + return ret; + + rtnl_lock(); + if (ns->num_vfs == num_vfs) + goto exit_good; + if (ns->num_vfs && num_vfs) { + ret = -EBUSY; + goto exit_unlock; + } + + if (num_vfs) { + ret = nsim_vfs_enable(ns, num_vfs); + if (ret) + goto exit_unlock; + } else { + nsim_vfs_disable(ns); + } +exit_good: + ret = count; +exit_unlock: + rtnl_unlock(); + + return ret; +} + +static ssize_t +nsim_numvfs_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct netdevsim *ns = to_nsim(dev); + + return sprintf(buf, "%u\n", ns->num_vfs); +} + +static struct device_attribute nsim_numvfs_attr = + __ATTR(sriov_numvfs, 0664, nsim_numvfs_show, nsim_numvfs_store); + +static struct attribute *nsim_dev_attrs[] = { + _numvfs_attr.attr, + NULL, +}; + +static const struct attribute_group nsim_dev_attr_group = { + .attrs = nsim_dev_attrs, +}; + +static const struct attribute_group *nsim_dev_attr_groups[] = { + _dev_attr_group, + NULL, +}; + +static void nsim_dev_release(struct device *dev) +{ + struct netdevsim *ns = to_nsim(dev); + + nsim_vfs_disable(ns); + free_netdev(ns->netdev); +} + +struct device_type nsim_dev_type = { + .groups = nsim_dev_attr_groups, + .release = nsim_dev_release, +}; + static int nsim_init(struct net_device *dev) { struct netdevsim *ns = netdev_priv(dev); @@ -37,8 +156,19 @@ static int nsim_init(struct net_device *dev) if (err) goto err_debugfs_destroy; + ns->dev.id = nsim_dev_id++; + ns->dev.bus = _bus; + ns->dev.type = _dev_type; + err = device_register(>dev); + if (err) + goto err_bpf_uninit; + + SET_NETDEV_DEV(dev, >dev); + return 0; +err_bpf_uninit: + nsim_bpf_uninit(ns); err_debugfs_destroy: debugfs_remove_recursive(ns->ddir); return err; @@ -52,6 +182,14 @@ static void nsim_uninit(struct net_device *dev) nsim_bpf_uninit(ns); } +static void nsim_free(struct net_device *dev) +{ + struct netdevsim *ns = netdev_priv(dev); + + device_unregister(>dev); + /* netdev and vf state will be freed out of device_release() */ +} + static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct netdevsim *ns = netdev_priv(dev); @@ -122,6 +260,123 @@ nsim_setup_tc_block(struct net_device *dev, struct tc_block_offload *f)
[PATCH net 1/2] tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
James Morris reported kernel stack corruption bug [1] while running the SELinux testsuite, and bisected to a recent commit bffa72cf7f9d ("net: sk_buff rbnode reorg") We believe this commit is fine, but exposes an older bug. SELinux code runs from tcp_filter() and might send an ICMP, expecting IP options to be found in skb->cb[] using regular IPCB placement. We need to defer TCP mangling of skb->cb[] after tcp_filter() calls. This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very similar way we added them for IPv6. [1] [ 339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet [ 339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: 81745af5 [ 339.822505] [ 339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15 [ 339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A 01/19/2017 [ 339.885060] Call Trace: [ 339.896875] [ 339.908103] dump_stack+0x63/0x87 [ 339.920645] panic+0xe8/0x248 [ 339.932668] ? ip_push_pending_frames+0x33/0x40 [ 339.946328] ? icmp_send+0x525/0x530 [ 339.958861] ? kfree_skbmem+0x60/0x70 [ 339.971431] __stack_chk_fail+0x1b/0x20 [ 339.984049] icmp_send+0x525/0x530 [ 339.996205] ? netlbl_skbuff_err+0x36/0x40 [ 340.008997] ? selinux_netlbl_err+0x11/0x20 [ 340.021816] ? selinux_socket_sock_rcv_skb+0x211/0x230 [ 340.035529] ? security_sock_rcv_skb+0x3b/0x50 [ 340.048471] ? sk_filter_trim_cap+0x44/0x1c0 [ 340.061246] ? tcp_v4_inbound_md5_hash+0x69/0x1b0 [ 340.074562] ? tcp_filter+0x2c/0x40 [ 340.086400] ? tcp_v4_rcv+0x820/0xa20 [ 340.098329] ? ip_local_deliver_finish+0x71/0x1a0 [ 340.111279] ? ip_local_deliver+0x6f/0xe0 [ 340.123535] ? ip_rcv_finish+0x3a0/0x3a0 [ 340.135523] ? ip_rcv_finish+0xdb/0x3a0 [ 340.147442] ? ip_rcv+0x27c/0x3c0 [ 340.158668] ? inet_del_offload+0x40/0x40 [ 340.170580] ? __netif_receive_skb_core+0x4ac/0x900 [ 340.183285] ? rcu_accelerate_cbs+0x5b/0x80 [ 340.195282] ? __netif_receive_skb+0x18/0x60 [ 340.207288] ? process_backlog+0x95/0x140 [ 340.218948] ? net_rx_action+0x26c/0x3b0 [ 340.230416] ? __do_softirq+0xc9/0x26a [ 340.241625] ? do_softirq_own_stack+0x2a/0x40 [ 340.253368] [ 340.262673] ? do_softirq+0x50/0x60 [ 340.273450] ? __local_bh_enable_ip+0x57/0x60 [ 340.285045] ? ip_finish_output2+0x175/0x350 [ 340.296403] ? ip_finish_output+0x127/0x1d0 [ 340.307665] ? nf_hook_slow+0x3c/0xb0 [ 340.318230] ? ip_output+0x72/0xe0 [ 340.328524] ? ip_fragment.constprop.54+0x80/0x80 [ 340.340070] ? ip_local_out+0x35/0x40 [ 340.350497] ? ip_queue_xmit+0x15c/0x3f0 [ 340.361060] ? __kmalloc_reserve.isra.40+0x31/0x90 [ 340.372484] ? __skb_clone+0x2e/0x130 [ 340.382633] ? tcp_transmit_skb+0x558/0xa10 [ 340.393262] ? tcp_connect+0x938/0xad0 [ 340.403370] ? ktime_get_with_offset+0x4c/0xb0 [ 340.414206] ? tcp_v4_connect+0x457/0x4e0 [ 340.424471] ? __inet_stream_connect+0xb3/0x300 [ 340.435195] ? inet_stream_connect+0x3b/0x60 [ 340.445607] ? SYSC_connect+0xd9/0x110 [ 340.455455] ? __audit_syscall_entry+0xaf/0x100 [ 340.466112] ? syscall_trace_enter+0x1d0/0x2b0 [ 340.476636] ? __audit_syscall_exit+0x209/0x290 [ 340.487151] ? SyS_connect+0xe/0x10 [ 340.496453] ? do_syscall_64+0x67/0x1b0 [ 340.506078] ? entry_SYSCALL64_slow_path+0x25/0x25 Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Signed-off-by: Eric DumazetReported-by: James Morris Tested-by: James Morris Tested-by: Casey Schaufler --- net/ipv4/tcp_ipv4.c | 59 - net/ipv6/tcp_ipv6.c | 10 + 2 files changed, 46 insertions(+), 23 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c6bc0c4d19c624888b0d0b5a4246c7183edf63f5..77ea45da0fe9c746907a312989658af3ad3b198d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1591,6 +1591,34 @@ int tcp_filter(struct sock *sk, struct sk_buff *skb) } EXPORT_SYMBOL(tcp_filter); +static void tcp_v4_restore_cb(struct sk_buff *skb) +{ + memmove(IPCB(skb), _SKB_CB(skb)->header.h4, + sizeof(struct inet_skb_parm)); +} + +static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph, + const struct tcphdr *th) +{ + /* This is tricky : We move IPCB at its correct location into TCP_SKB_CB() +* barrier() makes sure compiler wont play fool^Waliasing games. +*/ + memmove(_SKB_CB(skb)->header.h4, IPCB(skb), + sizeof(struct inet_skb_parm)); + barrier(); + + TCP_SKB_CB(skb)->seq = ntohl(th->seq); + TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + + skb->len - th->doff * 4); + TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq); + TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th); +
[PATCH net 0/2] tcp: fix SELinux/Smack corruptions
James Morris reported kernel stack corruption bug that we tracked back to commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") First patch needs to be backported to kernels >= 3.18, while second patch needs to be backported to kernels >= 4.9, since this was the time when inet_exact_dif_match appeared. David Ahern (1): tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match() Eric Dumazet (1): tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() include/net/tcp.h | 3 +-- net/ipv4/tcp_ipv4.c | 59 - net/ipv6/tcp_ipv6.c | 10 + 3 files changed, 47 insertions(+), 25 deletions(-) -- 2.15.0.531.g2ccb3012c9-goog
[PATCH net 2/2] tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
From: David AhernAfter this fix : ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"), socket lookups happen while skb->cb[] has not been mangled yet by TCP. Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is l3mdev") Signed-off-by: David Ahern Signed-off-by: Eric Dumazet --- include/net/tcp.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 4e09398009c10a72478b43d3cffc24ba01612b91..6998707e81f343ef8d893c0b2ba16db541082230 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -844,12 +844,11 @@ static inline int tcp_v6_sdif(const struct sk_buff *skb) } #endif -/* TCP_SKB_CB reference means this can not be used from early demux */ static inline bool inet_exact_dif_match(struct net *net, struct sk_buff *skb) { #if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) if (!net->ipv4.sysctl_tcp_l3mdev_accept && - skb && ipv4_l3mdev_skb(TCP_SKB_CB(skb)->header.h4.flags)) + skb && ipv4_l3mdev_skb(IPCB(skb)->flags)) return true; #endif return false; -- 2.15.0.531.g2ccb3012c9-goog
[PATCH iproute2 net-next] gre6: add collect metadata support
The patch adds 'external' option to support collect metadata gre6 tunnel. Example of L3 and L2 gre device: bash:~# ip link add dev ip6gre123 type ip6gre external bash:~# ip link add dev ip6gretap123 type ip6gretap external Signed-off-by: William Tu--- ip/link_gre6.c| 55 --- man/man8/ip-link.8.in | 6 ++ 2 files changed, 41 insertions(+), 20 deletions(-) diff --git a/ip/link_gre6.c b/ip/link_gre6.c index 0a82eaecf2cd..2cb46ca116d0 100644 --- a/ip/link_gre6.c +++ b/ip/link_gre6.c @@ -105,6 +105,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv, __u16 encapflags = TUNNEL_ENCAP_FLAG_CSUM6; __u16 encapsport = 0; __u16 encapdport = 0; + __u8 metadata = 0; int len; __u32 fwmark = 0; __u32 erspan_idx = 0; @@ -178,6 +179,9 @@ get_failed: if (greinfo[IFLA_GRE_ENCAP_SPORT]) encapsport = rta_getattr_u16(greinfo[IFLA_GRE_ENCAP_SPORT]); + if (greinfo[IFLA_GRE_COLLECT_METADATA]) + metadata = 1; + if (greinfo[IFLA_GRE_ENCAP_DPORT]) encapdport = rta_getattr_u16(greinfo[IFLA_GRE_ENCAP_DPORT]); @@ -355,6 +359,8 @@ get_failed: encapflags |= TUNNEL_ENCAP_FLAG_REMCSUM; } else if (strcmp(*argv, "noencap-remcsum") == 0) { encapflags &= ~TUNNEL_ENCAP_FLAG_REMCSUM; + } else if (strcmp(*argv, "external") == 0) { + metadata = 1; } else if (strcmp(*argv, "fwmark") == 0) { NEXT_ARG(); if (strcmp(*argv, "inherit") == 0) { @@ -388,26 +394,30 @@ get_failed: argc--; argv++; } - addattr32(n, 1024, IFLA_GRE_IKEY, ikey); - addattr32(n, 1024, IFLA_GRE_OKEY, okey); - addattr_l(n, 1024, IFLA_GRE_IFLAGS, , 2); - addattr_l(n, 1024, IFLA_GRE_OFLAGS, , 2); - addattr_l(n, 1024, IFLA_GRE_LOCAL, , sizeof(laddr)); - addattr_l(n, 1024, IFLA_GRE_REMOTE, , sizeof(raddr)); - if (link) - addattr32(n, 1024, IFLA_GRE_LINK, link); - addattr_l(n, 1024, IFLA_GRE_TTL, _limit, 1); - addattr_l(n, 1024, IFLA_GRE_ENCAP_LIMIT, _limit, 1); - addattr_l(n, 1024, IFLA_GRE_FLOWINFO, , 4); - addattr32(n, 1024, IFLA_GRE_FLAGS, flags); - addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark); - if (erspan_idx != 0) - addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx); - - addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype); - addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags); - addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport)); - addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport)); + if (!metadata) { + addattr32(n, 1024, IFLA_GRE_IKEY, ikey); + addattr32(n, 1024, IFLA_GRE_OKEY, okey); + addattr_l(n, 1024, IFLA_GRE_IFLAGS, , 2); + addattr_l(n, 1024, IFLA_GRE_OFLAGS, , 2); + addattr_l(n, 1024, IFLA_GRE_LOCAL, , sizeof(laddr)); + addattr_l(n, 1024, IFLA_GRE_REMOTE, , sizeof(raddr)); + if (link) + addattr32(n, 1024, IFLA_GRE_LINK, link); + addattr_l(n, 1024, IFLA_GRE_TTL, _limit, 1); + addattr_l(n, 1024, IFLA_GRE_ENCAP_LIMIT, _limit, 1); + addattr_l(n, 1024, IFLA_GRE_FLOWINFO, , 4); + addattr32(n, 1024, IFLA_GRE_FLAGS, flags); + addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark); + if (erspan_idx != 0) + addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx); + + addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype); + addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags); + addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport)); + addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport)); + } else { + addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0); + } return 0; } @@ -426,6 +436,11 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) if (!tb) return; + if (tb[IFLA_GRE_COLLECT_METADATA]) { + print_bool(PRINT_ANY, "collect_metadata", "external", true); + return; + } + if (tb[IFLA_GRE_FLAGS]) flags = rta_getattr_u32(tb[IFLA_GRE_FLAGS]); diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in index a6a10e577b1f..c9b9bb7b2a4e 100644 --- a/man/man8/ip-link.8.in +++ b/man/man8/ip-link.8.in @@ -755,6 +755,8 @@ the following additional arguments are supported: .BI "dscp inherit" ] [ .BI dev " PHYS_DEV " +] [ +.RB [ no ] external ] .in +8 @@ -833,6 +835,10 @@ or .IR 00 ".." ff when tunneling non-IP packets. The default value is
Re: [PATCH net-next] openvswitch: do not propagate headroom updates to internal port
On Thu, Nov 30, 2017 at 6:35 AM, Paolo Abeniwrote: > After commit 3a927bc7cf9d ("ovs: propagate per dp max headroom to > all vports") the need_headroom for the internal vport is updated > accordingly to the max needed headroom in its datapath. > > That avoids the pskb_expand_head() costs when sending/forwarding > packets towards tunnel devices, at least for some scenarios. > > We still require such copy when using the ovs-preferred configuration > for vxlan tunnels: > > br_int > / \ > tap vxlan >(remote_ip:X) > > br_phy > \ > NIC > > where the route towards the IP 'X' is via 'br_phy'. > > When forwarding traffic from the tap towards the vxlan device, we > will call pskb_expand_head() in vxlan_build_skb() because > br-phy->needed_headroom is equal to tun->needed_headroom. > > With this change we avoid updating the internal vport needed_headroom, > so that in the above scenario no head copy is needed, giving 5% > performance improvement in UDP throughput test. > > As a trade-off, packets sent from the internal port towards a tunnel > device will now experience the head copy overhead. The rationale is > that the latter use-case is less relevant performance-wise. > > Signed-off-by: Paolo Abeni Acked-by: Pravin B Shelar Thanks.
Re: [PATCH net-next 1/5] libbpf: add ability to guess program type based on section name
On Fri, 1 Dec 2017 10:22:57 +, Quentin Monnet wrote: > Thanks Roman! > One comment in-line. > > 2017-11-30 13:42 UTC+ ~ Roman Gushchin> > The bpf_prog_load() function will guess program type if it's not > > specified explicitly. This functionality will be used to implement > > loading of different programs without asking a user to specify > > the program type. In first order it will be used by bpftool. > > > > Signed-off-by: Roman Gushchin > > Cc: Alexei Starovoitov > > Cc: Daniel Borkmann > > Cc: Jakub Kicinski > > --- > > tools/lib/bpf/libbpf.c | 47 +++ > > 1 file changed, 47 insertions(+) > > > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c > > index 5aa45f89da93..9f2410beaa18 100644 > > --- a/tools/lib/bpf/libbpf.c > > +++ b/tools/lib/bpf/libbpf.c > > @@ -1721,6 +1721,41 @@ BPF_PROG_TYPE_FNS(tracepoint, > > BPF_PROG_TYPE_TRACEPOINT); > > BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP); > > BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT); > > > > +static enum bpf_prog_type bpf_program__guess_type(struct bpf_program *prog) > > +{ > > + if (!prog->section_name) > > + goto err; > > + > > + if (strncmp(prog->section_name, "socket", 6) == 0) > > + return BPF_PROG_TYPE_SOCKET_FILTER; > > + if (strncmp(prog->section_name, "kprobe/", 7) == 0) > > + return BPF_PROG_TYPE_KPROBE; > > + if (strncmp(prog->section_name, "kretprobe/", 10) == 0) > > + return BPF_PROG_TYPE_KPROBE; > > + if (strncmp(prog->section_name, "tracepoint/", 11) == 0) > > + return BPF_PROG_TYPE_TRACEPOINT; > > + if (strncmp(prog->section_name, "xdp", 3) == 0) > > + return BPF_PROG_TYPE_XDP; > > + if (strncmp(prog->section_name, "perf_event", 10) == 0) > > + return BPF_PROG_TYPE_PERF_EVENT; > > + if (strncmp(prog->section_name, "cgroup/skb", 10) == 0) > > + return BPF_PROG_TYPE_CGROUP_SKB; > > + if (strncmp(prog->section_name, "cgroup/sock", 11) == 0) > > + return BPF_PROG_TYPE_CGROUP_SOCK; > > + if (strncmp(prog->section_name, "cgroup/dev", 10) == 0) > > + return BPF_PROG_TYPE_CGROUP_DEVICE; > > + if (strncmp(prog->section_name, "sockops", 7) == 0) > > + return BPF_PROG_TYPE_SOCK_OPS; > > + if (strncmp(prog->section_name, "sk_skb", 6) == 0) > > + return BPF_PROG_TYPE_SK_SKB; > > I do not really like these hard-coded lengths, maybe we could work out > something nicer with a bit of pre-processing work? Perhaps something like: > > #define SOCKET_FILTER_SEC_PREFIX "socket" > #define KPROBE_SEC_PREFIX "kprobe/" > […] > > #define TRY_TYPE(string, __TYPE) \ > do {\ > if (!strncmp(string, __TYPE ## _SEC_PREFIX, \ >sizeof(__TYPE ## _SEC_PREFIX)))\ > return BPF_PROG_TYPE_ ## __TYPE;\ > } while(0); I like the suggestion, but I think return and goto statements hiding inside macros are slightly frowned upon in the netdev. Perhaps just a macro that wraps the strncmp() with sizeof would be enough? Without the return inside? > static enum bpf_prog_type bpf_program__guess_type(struct bpf_program *prog) > { > if (!prog->section_name) > goto err; > > TRY_TYPE(prog->section_name, SOCKET_FILTER); > TRY_TYPE(prog->section_name, KPROBE); > […] > > err: > pr_warning("…", > prog->section_name); > > return BPF_PROG_TYPE_UNSPEC; > }
Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers
Am 01.12.2017 um 21:42 schrieb David Miller: > From: Heiner Kallweit> Date: Thu, 30 Nov 2017 23:47:52 +0100 > >> Remove generic settings for callbacks config_aneg and read_status >> from drivers. >> When re-testing I just figured out that in drivers/net/phy/broadcom.c I mistakenly removed three lines too many. Do you prefer a fixed version of the patch or just a patch with the fix? Sorry, Heiner >> Signed-off-by: Heiner Kallweit >> Reviewed-by: Florian Fainelli > > Applied. >
Re: [PATCH v2 net-next 3/4] inet: Add a 2nd listener hashtable (port+addr)
On Fri, 2017-12-01 at 12:52 -0800, Martin KaFai Lau wrote: > The current listener hashtable is hashed by port only. > When a process is listening at many IP addresses with the same port > (e.g. > [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener() > performance is degraded to a link list. It is prone to syn attack. > > UDP had a similar issue and a second hashtable was added to resolve > it. > > This patch adds a second hashtable for the listener's sockets. > The second hashtable is hashed by port and address. > > It cannot reuse the existing skc_portaddr_node which is shared > with skc_bind_node. TCP listener needs to use skc_bind_node. > Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to > the inet_connection_sock which the listener (like TCP) also belongs > to. > > The new portaddr hashtable may need two lookup (First by IP:PORT. > Second by INADDR_ANY:PORT if the IP:PORT is a not found). Hence, > it implements a similar cut off as UDP such that it will only consult > the > new portaddr hashtable if the current port-only hashtable has >10 > sk in the link-list. > > lhash2 and lhash2_mask are added to 'struct inet_hashinfo'. I take > this chance to plug a 4 bytes hole. It is done by first moving > the existing bind_bucket_cachep up and then add the new > (int lhash2_mask, *lhash2) after the existing bhash_size. > > Signed-off-by: Martin KaFai LauNice work, thanks Martin ! Reviewed-by: Eric Dumazet
Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start
On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote: > Remove the code that resets the walker table. The walker table should > only be initialized in the walk init function or when a future table is > encountered. If the walker table is NULL this is the indication that > the walk has completed and this information can be used to break a > multi-call walk in the table (e.g. successive calls to nelink_dump > that are dumping elements of an rhashtable). > > This also allows us to change rhashtable_walk_start to return void > since the only error it was returning was -EAGAIN for a table change. > This patch changes all the callers of rhashtable_walk_start to expect > void which eliminates logic needed to check the return value for a > rare condition. Note that -EAGAIN will be returned in a call > to rhashtable_walk_next which seems to always follow the start > of the walk so there should be no behavioral change in doing this. > > Signed-off-by: Tom HerbertDoesn't this mean that if a walk encounters a rehash you may end up missing half or more of the hash table? Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality
On Fri, 1 Dec 2017 22:58:29 +0100, Phil Sutter wrote: > > > > > > + ret = count; > > > > > > +exit_unlock: > > > > > > + rtnl_unlock(); > > > > > > + > > > > > > + return ret; > > > > > > +} > > > > > > > > > > [...] > > > > > > > > > > > +static void nsim_free(struct net_device *dev) > > > > > > +{ > > > > > > + struct netdevsim *ns = netdev_priv(dev); > > > > > > + > > > > > > + device_unregister(>dev); > > > > > > } > > > > > > > > > > Shouldn't this also kfree(ns->vfconfigs)? > > > > > > > > It's in uninit, I will move it to release. > > > > > > Oh, I missed that. If you're certain this won't lead to memleaks, no > > > objection from my side. :) > > > > OK, I will respin v3 with the free moved :) > > So it did leak? I'm glad the traffic I caused wasn't completely > pointless then. :) There is a window where it could've been re-enabled and that would leak, yes. Thanks for catching it :)
Re: [RFC PATCH] net_sched: bulk free tcf_block
On Fri, Dec 1, 2017 at 3:05 AM, Paolo Abeniwrote: > > Thank you for the feedback. > > I tested your patch and in the above scenario I measure: > > real0m0.017s > user0m0.000s > sys 0m0.017s > > so it apparently works well for this case. Thanks a lot for testing it! I will test it further. If it goes well I will send a formal patch with your Tested-by unless you object it. > > We could still have a storm of rtnl lock/unlock operations while > deleting a large tc tree with lot of filters, and I think we can reduce > them with bulk free, evenutally applying it to filters, too. > > That will also reduce the pressure on the rtnl lock when e.g. OVS H/W > offload pushes a lot of rules/sec. > > WDYT? > Why this is specific to tc filter? From what you are saying, we need to batch all TC operations (qdisc, filter and action) rather than just filter? In short term, I think batching rtnl lock/unlock is a good optimization, so I have no objection. For long term, I think we need to revise RTNL lock and probably move it down to each layer, but clearly it requires much more work. Thanks.
Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality
On Fri, Dec 01, 2017 at 01:45:09PM -0800, Jakub Kicinski wrote: > On Fri, 1 Dec 2017 22:36:52 +0100, Phil Sutter wrote: > > On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote: > > > On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote: > > > > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote: > > > > [...] > > > > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int > > > > > num_vfs) > > > > > +{ > > > > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config), > > > > > + GFP_KERNEL); > > > > > + if (!ns->vfconfigs) > > > > > + return -ENOMEM; > > > > > + ns->num_vfs = num_vfs; > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static void nsim_vfs_disable(struct netdevsim *ns) > > > > > +{ > > > > > + kfree(ns->vfconfigs); > > > > > + ns->vfconfigs = NULL; > > > > > + ns->num_vfs = 0; > > > > > +} > > > > > > > > Why not something like: > > > > > > > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs) > > > > | { > > > > | void *ptr = krealloc(ns->vfconfigs, > > > > |num_vfs * sizeof(struct nsim_vf_config), > > > > |GFP_KERNEL); > > > > | > > > > | if (!ptr) > > > > | return -ENOMEM; > > > > | > > > > | ns->vfconfigs = ptr; > > > > | ns->num_vfs = num_vfs; > > > > | return 0; > > > > | } > > > > > > Um. It either frees or allocates, never reallocates so I felt realloc > > > is misleading. ZERO_SIZE_PTR is less clearly a NULL than a NULL. I > > > will have to specify __GFP_ZERO. It's not a calloc so there could be > > > potentially some overflows? > > > > I don't understand: How can overflows happen if I use malloc() instead > > of calloc()? > > The multiplication may overflow. That's why we have kmalloc_array(). > Note this explicit check in kmalloc_array() (which is also called by > kcalloc): > > if (size != 0 && n > SIZE_MAX / size) > return NULL; > > Where: > > #define SIZE_MAX (~(size_t)0) Ah, I see. Thanks for educating me on this! > > > > > + ret = count; > > > > > +exit_unlock: > > > > > + rtnl_unlock(); > > > > > + > > > > > + return ret; > > > > > +} > > > > > > > > [...] > > > > > > > > > +static void nsim_free(struct net_device *dev) > > > > > +{ > > > > > + struct netdevsim *ns = netdev_priv(dev); > > > > > + > > > > > + device_unregister(>dev); > > > > > } > > > > > > > > Shouldn't this also kfree(ns->vfconfigs)? > > > > > > It's in uninit, I will move it to release. > > > > Oh, I missed that. If you're certain this won't lead to memleaks, no > > objection from my side. :) > > OK, I will respin v3 with the free moved :) So it did leak? I'm glad the traffic I caused wasn't completely pointless then. :) Thanks, Phil
Re: [Patch net-next] act_mirred: use tcfm_dev in tcf_mirred_get_dev()
On Fri, Dec 1, 2017 at 9:56 AM, Jiri Pirkowrote: > > Isn't this here so user may specify a ifindex of netdev which is not yet > present on the system (not sure how much sense that would make though...) How is this even possible? If an ifindex is not present, we return ENODEV: if (parm->ifindex) { dev = __dev_get_by_index(net, parm->ifindex); if (dev == NULL) { if (exists) tcf_idr_release(*a, bind); return -ENODEV; }
Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality
On Fri, 1 Dec 2017 22:36:52 +0100, Phil Sutter wrote: > On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote: > > On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote: > > > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote: > > > [...] > > > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs) > > > > +{ > > > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config), > > > > + GFP_KERNEL); > > > > + if (!ns->vfconfigs) > > > > + return -ENOMEM; > > > > + ns->num_vfs = num_vfs; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static void nsim_vfs_disable(struct netdevsim *ns) > > > > +{ > > > > + kfree(ns->vfconfigs); > > > > + ns->vfconfigs = NULL; > > > > + ns->num_vfs = 0; > > > > +} > > > > > > Why not something like: > > > > > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs) > > > | { > > > | void *ptr = krealloc(ns->vfconfigs, > > > | num_vfs * sizeof(struct nsim_vf_config), > > > | GFP_KERNEL); > > > | > > > | if (!ptr) > > > | return -ENOMEM; > > > | > > > | ns->vfconfigs = ptr; > > > | ns->num_vfs = num_vfs; > > > | return 0; > > > | } > > > > Um. It either frees or allocates, never reallocates so I felt realloc > > is misleading. ZERO_SIZE_PTR is less clearly a NULL than a NULL. I > > will have to specify __GFP_ZERO. It's not a calloc so there could be > > potentially some overflows? > > I don't understand: How can overflows happen if I use malloc() instead > of calloc()? The multiplication may overflow. That's why we have kmalloc_array(). Note this explicit check in kmalloc_array() (which is also called by kcalloc): if (size != 0 && n > SIZE_MAX / size) return NULL; Where: #define SIZE_MAX(~(size_t)0) > > > > + ret = count; > > > > +exit_unlock: > > > > + rtnl_unlock(); > > > > + > > > > + return ret; > > > > +} > > > > > > [...] > > > > > > > +static void nsim_free(struct net_device *dev) > > > > +{ > > > > + struct netdevsim *ns = netdev_priv(dev); > > > > + > > > > + device_unregister(>dev); > > > > } > > > > > > Shouldn't this also kfree(ns->vfconfigs)? > > > > It's in uninit, I will move it to release. > > Oh, I missed that. If you're certain this won't lead to memleaks, no > objection from my side. :) OK, I will respin v3 with the free moved :)
Re: [PATCH net-next v2 8/8] net: dummy: remove fake SR-IOV functionality
On Fri, Dec 01, 2017 at 12:19:52PM -0800, Jakub Kicinski wrote: > On Fri, 1 Dec 2017 14:46:34 +0100, Phil Sutter wrote: > > On Thu, Nov 30, 2017 at 05:35:40PM -0800, Jakub Kicinski wrote: > > > netdevsim driver seems like a better place for fake SR-IOV > > > functionality. Remove the code previously added to dummy. > > > > > > Signed-off-by: Jakub Kicinski> > > Reviewed-by: Quentin Monnet > > > > Acked-by: Phil Sutter > > Thanks! > > Did you have an opportunity to run your tests against this? I didn't > find anything that uses dummy's SR-IOV in selftests. In fact, at Red Hat nobody uses dummy for iproute SR-IOV testing yet (which was the motivation for it in the first place). Hence why I didn't see a problem with moving it from dummy over to something else. Hopefully upstream iproute will at some point contain a testsuite which makes use of this, but sadly that's still wishful thinking. :( Cheers, Phil
Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality
On Fri, Dec 01, 2017 at 12:14:07PM -0800, Jakub Kicinski wrote: > On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote: > > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote: > > [...] > > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs) > > > +{ > > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config), > > > + GFP_KERNEL); > > > + if (!ns->vfconfigs) > > > + return -ENOMEM; > > > + ns->num_vfs = num_vfs; > > > + > > > + return 0; > > > +} > > > + > > > +static void nsim_vfs_disable(struct netdevsim *ns) > > > +{ > > > + kfree(ns->vfconfigs); > > > + ns->vfconfigs = NULL; > > > + ns->num_vfs = 0; > > > +} > > > > Why not something like: > > > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs) > > | { > > | void *ptr = krealloc(ns->vfconfigs, > > |num_vfs * sizeof(struct nsim_vf_config), > > |GFP_KERNEL); > > | > > | if (!ptr) > > | return -ENOMEM; > > | > > | ns->vfconfigs = ptr; > > | ns->num_vfs = num_vfs; > > | return 0; > > | } > > Um. It either frees or allocates, never reallocates so I felt realloc > is misleading. ZERO_SIZE_PTR is less clearly a NULL than a NULL. I > will have to specify __GFP_ZERO. It's not a calloc so there could be > potentially some overflows? I don't understand: How can overflows happen if I use malloc() instead of calloc()? > > > +static ssize_t > > > +nsim_numvfs_store(struct device *dev, struct device_attribute *attr, > > > + const char *buf, size_t count) > > > +{ > > > + struct netdevsim *ns = to_nsim(dev); > > > + unsigned int num_vfs; > > > + int ret; > > > + > > > + ret = kstrtouint(buf, 0, _vfs); > > > + if (ret) > > > + return ret; > > > + > > > + rtnl_lock(); > > > + if (ns->num_vfs == num_vfs) > > > + goto exit_good; > > > > Then replace this: > > > > > + if (ns->num_vfs && num_vfs) { > > > + ret = -EBUSY; > > > + goto exit_unlock; > > > + } > > > + > > > + if (num_vfs) { > > > + ret = nsim_vfs_enable(ns, num_vfs); > > > + if (ret) > > > + goto exit_unlock; > > > + } else { > > > + nsim_vfs_disable(ns); > > > + } > > > > with just: > > > > | nsim_vfs_set(ns, num_vfs); > > I'm trying to mirror the PCI subsystem behaviour here, which only > allows enable or disable, not increase. I felt we should follow how > real devices behave: > > /* enable VFs */ > if (pdev->sriov->num_VFs) { > dev_warn(>dev, "%d VFs already enabled. Disable before > enabling %d VFs\n", >pdev->sriov->num_VFs, num_vfs); > return -EBUSY; > } > > So IOW this is intentional. Ah, I see. Yes, then it makes sense! Keeping this virtual VF functionality as close to real ones as possible is certainly feasible. > > > + ret = count; > > > +exit_unlock: > > > + rtnl_unlock(); > > > + > > > + return ret; > > > +} > > > > [...] > > > > > +static void nsim_free(struct net_device *dev) > > > +{ > > > + struct netdevsim *ns = netdev_priv(dev); > > > + > > > + device_unregister(>dev); > > > } > > > > Shouldn't this also kfree(ns->vfconfigs)? > > It's in uninit, I will move it to release. Oh, I missed that. If you're certain this won't lead to memleaks, no objection from my side. :) Cheers, Phil
Re: [PATCH net-next 00/11] net: ethernet: ti: cpsw/ale clean up and optimization
From: Grygorii StrashkoDate: Thu, 30 Nov 2017 18:21:09 -0600 > This is set of non critical clean ups and optimizations for TI > CPSW and ALE drivers. > > Rebased on top on net-next. Series applied, thank you.
Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
On 12/01/2017 04:48 AM, Al Viro wrote: > On Fri, Dec 01, 2017 at 01:33:04AM +, Al Viro wrote: > >> Use of file descriptors should be limited to "got a number from userland, >> convert to struct file *" on the way in and "install struct file * into >> descriptor table and return the descriptor to userland" on the way out. >> And the latter - *ONLY* after the last possible point of failure. Once >> a file reference is inserted into descriptor table, that's it - you >> can't undo that. >> >> The only way to use bpf_obj_get_user() is to pass its return value to >> userland. As return value of syscall - not even put_user() (for that >> you'd need to reserve the descriptor, copy it to userland and only >> then attach struct file * to it). >> >> The whole approach stinks - what it needs is something that would >> take struct filename * and return struct bpf_prog * or struct file * >> reference. With bpf_obj_get_user() and this thing implemented >> via that. Agree, the "fix" is completely buggy due to fd being exposed to user space during that period of time ... >> I'm looking into that thing... > > What it tries to pull off is something not far from > > static struct bpf_prog *__get_prog(struct inode *inode, enum bpf_prog_type > type) > { > struct bpf_prog *prog; > int err = inode_permission(inode, FMODE_READ | FMODE_WRITE); > if (err) > return ERR_PTR(err); > > if (inode->i_op == _map_iops) > return ERR_PTR(-EINVAL); > > if (inode->i_op != _prog_iops) > return ERR_PTR(-EACCES); > > prog = inode->i_private; > err = security_bpf_prog(prog); > if (err < 0) > return ERR_PTR(err); > > if (!bpf_prog_get_ok(prog, , false)) > return ERR_PTR(-EINVAL); > > return bpf_prog_inc(prog); > } > > struct bpf_prog *get_prog_path_type(const char *name, enum bpf_prog_type type) > { > struct path path; > struct bpf_prog *prog; > int err = kern_path(name, LOOKUP_FOLLOW, ); > if (err) > return ERR_PTR(err); > prog = __get_prog(d_backing_inode(path.dentry), type); > if (!IS_ERR(prog)) > touch_atime(); > path_put(); > return prog; > } > > static int __bpf_mt_check_path(const char *path, struct bpf_prog **ret) > { > *ret = get_prog_path_type(path, BPF_PROG_TYPE_SOCKET_FILTER); > return PTR_ERR_OR_ZERO(*ret); > } > > That skips all tracepoint random shite (pardon the triple redundance) and > makes > a somewhat arbitrary change for touch_atime() logics. And, of course, it is > not even compile-tested. > > Something similar to get_prog_path_type() above might make for a usable > primitive, IMO... The above looks good to me!
Re: [PATCH net-next V2 1/2] net-next: use five-tuple hash for sk_txhash
On Fri, Dec 1, 2017 at 1:00 PM, Shaohua Liwrote: > From: Shaohua Li > > We are using sk_txhash to calculate flowlabel, but sk_txhash isn't > always available, for example, in inet_timewait_sock. This causes > problem for reset packet, which will have a different flowlabel. This > causes our router doesn't correctly close tcp connection. We are using > flowlabel to do load balance. Routers in the path maintain connection > state. So if flow label changes, the packet is routed through a > different router. In this case, the old router doesn't get the reset > packet to close the tcp connection. > > Per Tom's suggestion, we switch back to five-tuple hash, so we can > reconstruct correct flowlabel for reset packet. > Thanks for doing this! > At most places, we already have the flowi info, so we directly use it > build sk_txhash. For synack, we do this after route search. At that > time, we have the flowi info ready, so don't need to create the flowi > info again. > > I don't change sk_rethink_txhash() though, it still uses random hash, > which is the whole point to select a different path after a negative > routing advise. > > Cc: Martin KaFai Lau > Cc: Eric Dumazet > Cc: Florent Fourcot > Cc: Cong Wang > Cc: Tom Herbert > Signed-off-by: Shaohua Li > --- > include/net/sock.h| 18 -- > include/net/tcp.h | 2 +- > net/ipv4/datagram.c | 2 +- > net/ipv4/syncookies.c | 4 +++- > net/ipv4/tcp_input.c | 1 - > net/ipv4/tcp_ipv4.c | 17 - > net/ipv4/tcp_output.c | 1 - > net/ipv6/datagram.c | 4 +++- > net/ipv6/syncookies.c | 3 ++- > net/ipv6/tcp_ipv6.c | 18 +- > 10 files changed, 39 insertions(+), 31 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 79e1a2c..640db0f 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1729,22 +1729,12 @@ static inline kuid_t sock_net_uid(const struct net > *net, const struct sock *sk) > return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); > } > > -static inline u32 net_tx_rndhash(void) > -{ > - u32 v = prandom_u32(); > - > - return v ?: 1; > -} > - > -static inline void sk_set_txhash(struct sock *sk) > -{ > - sk->sk_txhash = net_tx_rndhash(); > -} > - > static inline void sk_rethink_txhash(struct sock *sk) > { > - if (sk->sk_txhash) > - sk_set_txhash(sk); > + if (sk->sk_txhash) { > + u32 v = prandom_u32(); > + sk->sk_txhash = v ?: 1; > + } We'll need to add configuration about whether rethink is done at all. Conservative approach is probably to disable it by default. That is the default behavior of the stack is that flow label is consistent for lifetime of a flow. > } > > static inline struct dst_entry * > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 4e09398..a5c28be 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops { > __u16 *mss); > #endif > struct dst_entry *(*route_req)(const struct sock *sk, struct flowi > *fl, > - const struct request_sock *req); > + struct request_sock *req); > u32 (*init_seq)(const struct sk_buff *skb); > u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb); > int (*send_synack)(const struct sock *sk, struct dst_entry *dst, > diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c > index f915abf..ed9ccb7 100644 > --- a/net/ipv4/datagram.c > +++ b/net/ipv4/datagram.c > @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr > *uaddr, int addr_len > inet->inet_daddr = fl4->daddr; > inet->inet_dport = usin->sin_port; > sk->sk_state = TCP_ESTABLISHED; > - sk_set_txhash(sk); > + sk->sk_txhash = get_hash_from_flowi4(fl4); Maybe keep sk_set_txhash but add an argument that gives the hash. Hiding behind a function gives us the place to add/change logic in the future. > inet->inet_id = jiffies; > > sk_dst_set(sk, >dst); > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c > index fda37f2..76f1cf6 100644 > --- a/net/ipv4/syncookies.c > +++ b/net/ipv4/syncookies.c > @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct > sk_buff *skb) > treq->rcv_isn = ntohl(th->seq) - 1; > treq->snt_isn = cookie; > treq->ts_off= 0; > - treq->txhash= net_tx_rndhash(); > req->mss= mss; > ireq->ir_num= ntohs(th->dest); > ireq->ir_rmt_port = th->source; > @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct > sk_buff *skb) >
[PATCH net-next V2 2/2] net-next: copy user configured flowlabel to reset packet
From: Shaohua LiReset packet doesn't use user configured flowlabel, instead, it always uses 0. This will cause inconsistency for flowlabel. tw sock already records flowlabel info, so we can directly use it. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- net/ipv6/tcp_ipv6.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index a1a5802..9b678cd 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) struct sock *sk1 = NULL; #endif int oif = 0; + u8 tclass = 0; + __be32 flowlabel = 0; if (th->rst) return; @@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) trace_tcp_send_reset(sk, skb); } - tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0); + if (sk) { + if (sk_fullsock(sk)) { + struct ipv6_pinfo *np = inet6_sk(sk); + + tclass = np->tclass; + flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK; + } else { + struct inet_timewait_sock *tw = inet_twsk(sk); + + tclass = tw->tw_tclass; + flowlabel = cpu_to_be32(tw->tw_flowlabel); + } + } + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, + tclass, flowlabel); #ifdef CONFIG_TCP_MD5SIG out: -- 2.9.5
[PATCH net-next V2 1/2] net-next: use five-tuple hash for sk_txhash
From: Shaohua LiWe are using sk_txhash to calculate flowlabel, but sk_txhash isn't always available, for example, in inet_timewait_sock. This causes problem for reset packet, which will have a different flowlabel. This causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. Per Tom's suggestion, we switch back to five-tuple hash, so we can reconstruct correct flowlabel for reset packet. At most places, we already have the flowi info, so we directly use it build sk_txhash. For synack, we do this after route search. At that time, we have the flowi info ready, so don't need to create the flowi info again. I don't change sk_rethink_txhash() though, it still uses random hash, which is the whole point to select a different path after a negative routing advise. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 18 +- 10 files changed, 39 insertions(+), 31 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 79e1a2c..640db0f 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1729,22 +1729,12 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); } -static inline u32 net_tx_rndhash(void) -{ - u32 v = prandom_u32(); - - return v ?: 1; -} - -static inline void sk_set_txhash(struct sock *sk) -{ - sk->sk_txhash = net_tx_rndhash(); -} - static inline void sk_rethink_txhash(struct sock *sk) { - if (sk->sk_txhash) - sk_set_txhash(sk); + if (sk->sk_txhash) { + u32 v = prandom_u32(); + sk->sk_txhash = v ?: 1; + } } static inline struct dst_entry * diff --git a/include/net/tcp.h b/include/net/tcp.h index 4e09398..a5c28be 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops { __u16 *mss); #endif struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl, - const struct request_sock *req); + struct request_sock *req); u32 (*init_seq)(const struct sk_buff *skb); u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb); int (*send_synack)(const struct sock *sk, struct dst_entry *dst, diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index f915abf..ed9ccb7 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len inet->inet_daddr = fl4->daddr; inet->inet_dport = usin->sin_port; sk->sk_state = TCP_ESTABLISHED; - sk_set_txhash(sk); + sk->sk_txhash = get_hash_from_flowi4(fl4); inet->inet_id = jiffies; sk_dst_set(sk, >dst); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index fda37f2..76f1cf6 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq->rcv_isn = ntohl(th->seq) - 1; treq->snt_isn = cookie; treq->ts_off= 0; - treq->txhash= net_tx_rndhash(); req->mss= mss; ireq->ir_num= ntohs(th->dest); ireq->ir_rmt_port = th->source; @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) opt->srr ? opt->faddr : ireq->ir_rmt_addr, ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid); security_req_classify_flow(req, flowi4_to_flowi()); + + treq->txhash = get_hash_from_flowi4(); + rt = ip_route_output_key(sock_net(sk), ); if (IS_ERR(rt)) { reqsk_free(req); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 734cfc8..e886c28 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6288,7 +6288,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } tcp_rsk(req)->snt_isn = isn; - tcp_rsk(req)->txhash = net_tx_rndhash();
[PATCH net-next V2 0/2] net: fix flowlabel inconsistency in reset packet
From: Shaohua LiHi, Please see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The tcp reset packet has a different flowlabel, which causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. The reason is the normal packet gets the skb->hash from sk->sk_txhash, which is generated randomly. ip6_make_flowlabel then uses the hash to create a flowlabel. The reset packet doesn't get assigned a hash, so the flowlabel is calculated with flowi6. The patches fix the issue. Thanks, Shaohua Shaohua Li (2): net-next: use five-tuple hash for sk_txhash net-next: copy user configured flowlabel to reset packet include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 36 ++-- 10 files changed, 56 insertions(+), 32 deletions(-) -- 2.9.5
Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.
On 12/01/2017 12:41 PM, Philippe Ombredanne wrote: David, On Fri, Dec 1, 2017 at 9:01 PM, David Daneywrote: On 12/01/2017 11:49 AM, Philippe Ombredanne wrote: David, Greg, On Fri, Dec 1, 2017 at 6:42 PM, David Daney wrote: On 11/30/2017 11:53 PM, Philippe Ombredanne wrote: [...] --- /dev/null +++ b/arch/mips/cavium-octeon/resource-mgr.c @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource manager for Octeon. + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * Copyright (C) 2017 Cavium, Inc. + */ Since you nicely included an SPDX id, you would not need the boilerplate anymore. e.g. these can go alright? They may not be strictly speaking necessary, but I don't think they hurt anything. Unless there is a requirement to strip out the license text, we would stick with it as is. I think the requirement is there and that would be much better for everyone: keeping both is redundant and does not bring any value, does it? Instead it kinda removes the benefits of having the SPDX id in the first place IMHO. Furthermore, as there have been already ~12K+ files cleaned up and still over 60K files to go, it would really nice if new files could adopt the new style: this way we will not have to revisit and repatch them in the future. I am happy to follow any style Greg would suggest. There doesn't seem to be much documentation about how this should be done yet. Thomas (tglx) has already submitted a first series of doc patches a few weeks ago. And AFAIK he might be working on posting the updates soon, whenever his real time clock yields a few cycles away from real time coding work ;) See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5] on this and mostly related topics [1] https://lkml.org/lkml/2017/11/2/715 [2] https://lkml.org/lkml/2017/11/25/125 [3] https://lkml.org/lkml/2017/11/25/133 [4] https://lkml.org/lkml/2017/11/2/805 [5] https://lkml.org/lkml/2017/10/19/165 OK, you convinced me. Thanks, David
[PATCH v2 net-next 0/4] tcp: Add a 2nd listener hashtable (port+addr)
This patch set adds a 2nd listener hashtable. It is to resolve the performance issue when a process is listening at many IP addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443) v2: - Move the new lhash2 and lhash2_mask before the existing listening_hash to avoid adding another cacheline to inet_hashinfo (Suggested by Eric Dumazet, Thanks!) - I take this chance to plug an existing 4 bytes hole while adding 'unsigned int lhash2_mask'. - Add some comments about lhash2 in inet_hashtables.h Martin KaFai Lau (4): inet: Add a count to struct inet_listen_hashbucket udp: Move udp[46]_portaddr_hash() to net/ip[v6].h inet: Add a 2nd listener hashtable (port+addr) tcp: Enable 2nd listener hashtable in TCP include/net/inet_connection_sock.h | 2 + include/net/inet_hashtables.h | 29 +-- include/net/ip.h | 9 ++ include/net/ipv6.h | 17 net/ipv4/inet_hashtables.c | 173 +++-- net/ipv4/tcp.c | 3 + net/ipv4/udp.c | 22 ++--- net/ipv6/inet6_hashtables.c| 66 ++ net/ipv6/udp.c | 32 ++- 9 files changed, 301 insertions(+), 52 deletions(-) -- 2.9.5
[PATCH v2 net-next 2/4] udp: Move udp[46]_portaddr_hash() to net/ip[v6].h
This patch moves the udp[46]_portaddr_hash() to net/ip[v6].h. The function name is renamed to ipv[46]_portaddr_hash(). It will be used by a later patch which adds a second listener hashtable hashed by the address and port. Signed-off-by: Martin KaFai LauReviewed-by: Eric Dumazet --- include/net/ip.h | 9 + include/net/ipv6.h | 17 + net/ipv4/udp.c | 22 -- net/ipv6/udp.c | 32 4 files changed, 42 insertions(+), 38 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 9896f46cbbf1..fc9bf1b1fe2c 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -26,12 +26,14 @@ #include #include #include +#include #include #include #include #include #include +#include #define IPV4_MAX_PMTU 65535U /* RFC 2675, Section 5.1 */ @@ -521,6 +523,13 @@ static inline unsigned int ipv4_addr_hash(__be32 ip) return (__force unsigned int) ip; } +static inline u32 ipv4_portaddr_hash(const struct net *net, +__be32 saddr, +unsigned int port) +{ + return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port; +} + bool ip_call_ra_chain(struct sk_buff *skb); /* diff --git a/include/net/ipv6.h b/include/net/ipv6.h index f73797e2fa60..25be4715578c 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -22,6 +22,7 @@ #include #include #include +#include #define SIN6_LEN_RFC2133 24 @@ -673,6 +674,22 @@ static inline bool ipv6_addr_v4mapped(const struct in6_addr *a) cpu_to_be32(0x))) == 0UL; } +static inline u32 ipv6_portaddr_hash(const struct net *net, +const struct in6_addr *addr6, +unsigned int port) +{ + unsigned int hash, mix = net_hash_mix(net); + + if (ipv6_addr_any(addr6)) + hash = jhash_1word(0, mix); + else if (ipv6_addr_v4mapped(addr6)) + hash = jhash_1word((__force u32)addr6->s6_addr32[3], mix); + else + hash = jhash2((__force u32 *)addr6->s6_addr32, 4, mix); + + return hash ^ port; +} + /* * Check for a RFC 4843 ORCHID address * (Overlay Routable Cryptographic Hash Identifiers) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 36f857c87fe2..e9c0d1e1772e 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -357,18 +357,12 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum, } EXPORT_SYMBOL(udp_lib_get_port); -static u32 udp4_portaddr_hash(const struct net *net, __be32 saddr, - unsigned int port) -{ - return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port; -} - int udp_v4_get_port(struct sock *sk, unsigned short snum) { unsigned int hash2_nulladdr = - udp4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum); + ipv4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum); unsigned int hash2_partial = - udp4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 0); + ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 0); /* precompute partial secondary hash */ udp_sk(sk)->udp_portaddr_hash = hash2_partial; @@ -485,7 +479,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, u32 hash = 0; if (hslot->count > 10) { - hash2 = udp4_portaddr_hash(net, daddr, hnum); + hash2 = ipv4_portaddr_hash(net, daddr, hnum); slot2 = hash2 & udptable->mask; hslot2 = >hash2[slot2]; if (hslot->count < hslot2->count) @@ -496,7 +490,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, exact_dif, hslot2, skb); if (!result) { unsigned int old_slot2 = slot2; - hash2 = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum); + hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); slot2 = hash2 & udptable->mask; /* avoid searching the same slot again. */ if (unlikely(slot2 == old_slot2)) @@ -1761,7 +1755,7 @@ EXPORT_SYMBOL(udp_lib_rehash); static void udp_v4_rehash(struct sock *sk) { - u16 new_hash = udp4_portaddr_hash(sock_net(sk), + u16 new_hash = ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, inet_sk(sk)->inet_num); udp_lib_rehash(sk, new_hash); @@ -1952,9 +1946,9 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, struct sk_buff *nskb; if (use_hash2) { - hash2_any = udp4_portaddr_hash(net,
[PATCH v2 net-next 3/4] inet: Add a 2nd listener hashtable (port+addr)
The current listener hashtable is hashed by port only. When a process is listening at many IP addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener() performance is degraded to a link list. It is prone to syn attack. UDP had a similar issue and a second hashtable was added to resolve it. This patch adds a second hashtable for the listener's sockets. The second hashtable is hashed by port and address. It cannot reuse the existing skc_portaddr_node which is shared with skc_bind_node. TCP listener needs to use skc_bind_node. Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to the inet_connection_sock which the listener (like TCP) also belongs to. The new portaddr hashtable may need two lookup (First by IP:PORT. Second by INADDR_ANY:PORT if the IP:PORT is a not found). Hence, it implements a similar cut off as UDP such that it will only consult the new portaddr hashtable if the current port-only hashtable has >10 sk in the link-list. lhash2 and lhash2_mask are added to 'struct inet_hashinfo'. I take this chance to plug a 4 bytes hole. It is done by first moving the existing bind_bucket_cachep up and then add the new (int lhash2_mask, *lhash2) after the existing bhash_size. Signed-off-by: Martin KaFai Lau--- include/net/inet_connection_sock.h | 2 + include/net/inet_hashtables.h | 28 +-- net/ipv4/inet_hashtables.c | 168 +++-- net/ipv6/inet6_hashtables.c| 66 +++ 4 files changed, 249 insertions(+), 15 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 0358745ea059..8e1bf9ae4a5e 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops { * @icsk_af_ops Operations which are AF_INET{4,6} specific * @icsk_ulp_ops Pluggable ULP control hook * @icsk_ulp_data ULP private data + * @icsk_listen_portaddr_node hash to the portaddr listener hashtable * @icsk_ca_state:Congestion control state * @icsk_retransmits: Number of unrecovered [RTO] timeouts * @icsk_pending: Scheduled timer event @@ -101,6 +102,7 @@ struct inet_connection_sock { const struct inet_connection_sock_af_ops *icsk_af_ops; const struct tcp_ulp_ops *icsk_ulp_ops; void *icsk_ulp_data; + struct hlist_node icsk_listen_portaddr_node; unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu); __u8 icsk_ca_state:6, icsk_ca_setsockopt:1, diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 4cce516c41ac..9141e95529e7 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -133,12 +133,13 @@ struct inet_hashinfo { /* Ok, let's try this, I give up, we do need a local binding * TCP hash as well as the others for fast bind/connect. */ + struct kmem_cache *bind_bucket_cachep; struct inet_bind_hashbucket *bhash; - unsigned intbhash_size; - /* 4 bytes hole on 64 bit */ - struct kmem_cache *bind_bucket_cachep; + /* The 2nd listener table hashed by local port and address */ + unsigned intlhash2_mask; + struct inet_listen_hashbucket *lhash2; /* All the above members are written once at bootup and * never written again _or_ are predominantly read-access. @@ -146,14 +147,25 @@ struct inet_hashinfo { * Now align to a new cache line as all the following members * might be often dirty. */ - /* All sockets in TCP_LISTEN state will be in here. This is the only -* table where wildcard'd TCP sockets can exist. Hash function here -* is just local port number. + /* All sockets in TCP_LISTEN state will be in listening_hash. +* This is the only table where wildcard'd TCP sockets can +* exist. listening_hash is only hashed by local port number. +* If lhash2 is initialized, the same socket will also be hashed +* to lhash2 by port and address. */ struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE] cacheline_aligned_in_smp; }; +#define inet_lhash2_for_each_icsk_rcu(__icsk, list) \ + hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node) + +static inline struct inet_listen_hashbucket * +inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash) +{ + return >lhash2[hash & h->lhash2_mask]; +} + static inline struct inet_ehash_bucket *inet_ehash_bucket( struct inet_hashinfo *hashinfo, unsigned int hash) @@ -209,6 +221,10 @@ int __inet_inherit_port(const struct sock
[PATCH v2 net-next 4/4] tcp: Enable 2nd listener hashtable in TCP
Enable the second listener hashtable in TCP. The scale is the same as UDP which is one slot per 2MB. Signed-off-by: Martin KaFai LauReviewed-by: Eric Dumazet --- net/ipv4/tcp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bf97317e6c97..180311636023 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3577,6 +3577,9 @@ void __init tcp_init(void) percpu_counter_init(_sockets_allocated, 0, GFP_KERNEL); percpu_counter_init(_orphan_count, 0, GFP_KERNEL); inet_hashinfo_init(_hashinfo); + inet_hashinfo2_init(_hashinfo, "tcp_listen_portaddr_hash", + thash_entries, 21, /* one slot per 2 MB*/ + 0, 64 * 1024); tcp_hashinfo.bind_bucket_cachep = kmem_cache_create("tcp_bind_bucket", sizeof(struct inet_bind_bucket), 0, -- 2.9.5
[PATCH v2 net-next 1/4] inet: Add a count to struct inet_listen_hashbucket
This patch adds a count to the 'struct inet_listen_hashbucket'. It counts how many sk is hashed to a bucket. It will be used to decide if the (to-be-added) portaddr listener's hashtable should be used during inet[6]_lookup_listener(). Signed-off-by: Martin KaFai LauReviewed-by: Eric Dumazet --- include/net/inet_hashtables.h | 1 + net/ipv4/inet_hashtables.c| 11 +-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 2dbbbff5e1e3..4cce516c41ac 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -111,6 +111,7 @@ struct inet_bind_hashbucket { */ struct inet_listen_hashbucket { spinlock_t lock; + unsigned intcount; struct hlist_head head; }; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 427b705d7c64..80cfd3fa21ca 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -476,6 +476,7 @@ int __inet_hash(struct sock *sk, struct sock *osk) hlist_add_tail_rcu(>sk_node, >head); else hlist_add_head_rcu(>sk_node, >head); + ilb->count++; sock_set_flag(sk, SOCK_RCU_FREE); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); unlock: @@ -502,6 +503,7 @@ EXPORT_SYMBOL_GPL(inet_hash); void inet_unhash(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_listen_hashbucket *ilb; spinlock_t *lock; bool listener = false; int done; @@ -510,7 +512,8 @@ void inet_unhash(struct sock *sk) return; if (sk->sk_state == TCP_LISTEN) { - lock = >listening_hash[inet_sk_listen_hashfn(sk)].lock; + ilb = >listening_hash[inet_sk_listen_hashfn(sk)]; + lock = >lock; listener = true; } else { lock = inet_ehash_lockp(hashinfo, sk->sk_hash); @@ -522,8 +525,11 @@ void inet_unhash(struct sock *sk) done = __sk_del_node_init(sk); else done = __sk_nulls_del_node_init_rcu(sk); - if (done) + if (done) { + if (listener) + ilb->count--; sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); + } spin_unlock_bh(lock); } EXPORT_SYMBOL_GPL(inet_unhash); @@ -658,6 +664,7 @@ void inet_hashinfo_init(struct inet_hashinfo *h) for (i = 0; i < INET_LHTABLE_SIZE; i++) { spin_lock_init(>listening_hash[i].lock); INIT_HLIST_HEAD(>listening_hash[i].head); + h->listening_hash[i].count = 0; } } EXPORT_SYMBOL_GPL(inet_hashinfo_init); -- 2.9.5
Re: [PATCH net] tcp/dccp: block bh before arming time_wait timer
On Fri, 2017-12-01 at 15:12 -0500, David Miller wrote: > From: Eric Dumazet> Date: Fri, 01 Dec 2017 10:06:56 -0800 > > > From: Eric Dumazet > > > > Maciej Żenczykowski reported some panics in tcp_twsk_destructor() > > that might be caused by the following bug. > > > > timewait timer is pinned to the cpu, because we want to transition > > timwewait refcount from 0 to 4 in one go, once everything has been > > initialized. > > > > At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in > timer > > handling") was merged, TCP was always running from BH habdler. > > > > After commit 5413d1babe8f ("net: do not block BH while processing > > socket backlog") we definitely can run tcp_time_wait() from process > > context. > > > > We need to block BH in the critical section so that the pinned > timer > > has still its purpose. > > > > This bug is more likely to happen under stress and when very small > RTO > > are used in datacenter flows. > > > > Fixes: 5413d1babe8f ("net: do not block BH while processing socket > backlog") > > Signed-off-by: Eric Dumazet > > Reported-by: Maciej Żenczykowski > > Applied and queued up for -stable, thanks Eric. It just occurred to me that we can now revert 614bdd4d6e61d26 ("tcp: must block bh in __inet_twsk_hashdance()")
Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
On 12/01/2017 06:39 PM, Al Viro wrote: [...] > If that does not scream "wrong or missing primitive", I don't know what would. > You want something along the lines of "create a filesystem object at given > location, calling this function with this argument for actual object > creation"? > Fair enough, but then let's add a primitive that would do just that. > > And grepping around for similar sick tricks catches a slightly milder example > - > mq_open(2) doesn't play with encoding stuff into dev_t, but otherwise it's > very > similar and could also benefit from the same primitive. > > How about something like this: > int vfs_mkobj(struct dentry *dentry, umode_t mode, > int (*f)(struct dentry *, umode_t, void *), > void *arg) > { > struct inode *dir = dentry->d_parent->d_inode; > int error = may_create(dir, dentry); > if (error) > return error; > > mode &= S_IALLUGO; > mode |= S_IFREG; > error = security_inode_create(dir, dentry, mode); > if (error) > return error; > error = f(dentry, mode, arg); > if (!error) > fsnotify_create(dir, dentry); > return error; > } > > exported by fs/namei.c, with your code doing > > switch (type) { > case BPF_TYPE_PROG: > error = vfs_mkobj(path.dentry, mode, bpf_mkprog, raw); > break; > case BPF_TYPE_MAP: > error = vfs_mkobj(path.dentry, mode, bpf_mkmap, raw); > break; > default: > error = -EPERM; > } > instead that vfs_mknod() hack, with > > static int bpf_mkprog(struct inode *dir, struct dentry *dentry, >umode_t mode, void *raw) > { > return bpf_mkobj_ops(dir, dentry, mode, raw, _prog_iops); > } > > static int bpf_mkmap(struct inode *dir, struct dentry *dentry, >umode_t mode, void *raw) > { > return bpf_mkobj_ops(dir, dentry, mode, raw, _map_iops); > } > > static int bpf_mkobj_ops(struct inode *dir, struct dentry *dentry, >umode_t mode, void *raw, struct inode_operations *iops) > { > struct inode *inode; > > inode = bpf_get_inode(dir->i_sb, dir, mode); > if (IS_ERR(inode)) > return PTR_ERR(inode); > > inode->i_op = iops; > inode->i_private = raw; > > bpf_dentry_finalize(dentry, inode, dir); > return 0; > } > > And to hell with messing with dev_t, ->d_fsdata or having ->mknod() there at > all... > Might want to replace security_path_mknod() with something saner, while we are > at it. > > Objections? No, thanks for looking into this, and sorry for this fugly hack! :( Not that this doesn't make it any better, but I think back then I took it over from mqueue implementation ... should have known better and looking into making this generic instead, sigh. The above looks good to me, so no objections from my side and thanks for working on it! > PS: mqueue.c would also benefit from such primitive - do_create() there would > simply pass attr as callback's argument into vfs_mkobj(), with callback being > the guts of mqueue_create()...
Re: [PATCH net-next resubmit 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers
From: Heiner KallweitDate: Thu, 30 Nov 2017 23:47:52 +0100 > Remove generic settings for callbacks config_aneg and read_status > from drivers. > > Signed-off-by: Heiner Kallweit > Reviewed-by: Florian Fainelli Applied.
Re: [PATCH net-next resubmit 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default
From: Heiner KallweitDate: Thu, 30 Nov 2017 23:46:19 +0100 > read_status and config_aneg are the only mandatory callbacks and most > of the time the generic implementation is used by drivers. > So make the core fall back to the generic version if a driver doesn't > implement the respective callback. > > Also currently the core doesn't seem to verify that drivers implement > the mandatory calls. If a driver doesn't do so we'd just get a NPE. > With this patch this potential issue doesn't exit any longer. > > Signed-off-by: Heiner Kallweit > Reviewed-by: Florian Fainelli Applied.
Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.
David, On Fri, Dec 1, 2017 at 9:01 PM, David Daneywrote: > On 12/01/2017 11:49 AM, Philippe Ombredanne wrote: >> >> David, Greg, >> >> On Fri, Dec 1, 2017 at 6:42 PM, David Daney >> wrote: >>> >>> On 11/30/2017 11:53 PM, Philippe Ombredanne wrote: >> >> [...] >> >> --- /dev/null >> +++ b/arch/mips/cavium-octeon/resource-mgr.c >> @@ -0,0 +1,371 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Resource manager for Octeon. >> + * >> + * This file is subject to the terms and conditions of the GNU >> General >> Public >> + * License. See the file "COPYING" in the main directory of this >> archive >> + * for more details. >> + * >> + * Copyright (C) 2017 Cavium, Inc. >> + */ Since you nicely included an SPDX id, you would not need the boilerplate anymore. e.g. these can go alright? >>> >>> >>> >>> They may not be strictly speaking necessary, but I don't think they hurt >>> anything. Unless there is a requirement to strip out the license text, >>> we >>> would stick with it as is. >> >> >> I think the requirement is there and that would be much better for >> everyone: keeping both is redundant and does not bring any value, does >> it? Instead it kinda removes the benefits of having the SPDX id in the >> first place IMHO. >> >> Furthermore, as there have been already ~12K+ files cleaned up and >> still over 60K files to go, it would really nice if new files could >> adopt the new style: this way we will not have to revisit and repatch >> them in the future. >> > > I am happy to follow any style Greg would suggest. There doesn't seem to be > much documentation about how this should be done yet. Thomas (tglx) has already submitted a first series of doc patches a few weeks ago. And AFAIK he might be working on posting the updates soon, whenever his real time clock yields a few cycles away from real time coding work ;) See also these discussions with Linus [1][2][3], Thomas[4] and Greg[5] on this and mostly related topics [1] https://lkml.org/lkml/2017/11/2/715 [2] https://lkml.org/lkml/2017/11/25/125 [3] https://lkml.org/lkml/2017/11/25/133 [4] https://lkml.org/lkml/2017/11/2/805 [5] https://lkml.org/lkml/2017/10/19/165 -- Cordially Philippe Ombredanne
Re: [PATCH v5 net-next 0/3] ip6_gre: add erspan native tunnel for ipv6
From: William TuDate: Thu, 30 Nov 2017 11:51:26 -0800 > The patch series add support for ERSPAN tunnel over ipv6. The first patch > refectors the existing ipv4 gre implementation and the second refactors the > ipv6 gre's xmit code. Finally the last patch introduces erspan protocol. Series applied, thanks William.
Re: [PATCH RFC 2/2] veth: propagate bridge GSO to peer
On Mon, 27 Nov 2017 19:02:01 -0700 David Ahernwrote: > On 11/27/17 6:42 PM, Solio Sarabia wrote: > > On Mon, Nov 27, 2017 at 01:15:02PM -0800, Stephen Hemminger wrote: > >> On Mon, 27 Nov 2017 12:14:19 -0800 > >> Solio Sarabia wrote: > >> > >>> On Sun, Nov 26, 2017 at 11:07:25PM -0800, Stephen Hemminger wrote: > On Sun, 26 Nov 2017 20:13:39 -0700 > David Ahern wrote: > > > On 11/26/17 11:17 AM, Stephen Hemminger wrote: > >> This allows veth device in containers to see the GSO maximum > >> settings of the actual device being used for output. > > > > veth devices can be added to a VRF instead of a bridge, and I do not > > believe the gso propagation works for L3 master devices. > > > > From a quick grep, team devices do not appear to handle gso changes > > either. > > This code should still work correctly, but no optimization would happen. > The gso_max_size of the VRF or team will > still be GSO_MAX_SIZE so there would be no change. If VRF or Team ever > got smart > enough to handle GSO limits, then the algorithm would handle it. > >>> > >>> This patch propagates gso value from bridge to its veth endpoints. > >>> However, since bridge is never aware of the GSO limit from underlying > >>> interfaces, bridge/veth still have larger GSO size. > >>> > >>> In the docker case, bridge is not linked directly to physical or > >>> synthetic interfaces; it relies on iptables to decide which interface to > >>> forward packets to. > >> > >> So for the docker case, then direct control of GSO values via netlink (ie > >> ip link set) > >> seems like the better solution. > > > > Adding ioctl support for 'ip link set' would work. I'm still concerned > > how to enforce the upper limit to not exceed that of the lower devices. > > > > Consider a system with three NICs, each reporting values in the range > > [60,000 - 62,780]. Users could set virtual interfaces' gso to 65,536, > > exceeding the limit, and having the host do sw gso (vms settings must > > not affect host performance.) > > > > Looping through interfaces? With the difference that now it'd be > > trigger upon user's request, not every time a veth is created (like one > > previous patch discussed.) > > > > You are concerned about the routed case right? One option is to have VRF > devices propagate gso sizes to all devices (veth, vlan, etc) enslaved to > it. VRF devices are Layer 3 master devices so an L3 parallel to a bridge. See the patch set I posted today which punts the problem to veth setup.
Re: [PATCH 0/2] net: ethtool: add support for ETH_RESET_AP
From: Scott BrandenDate: Thu, 30 Nov 2017 11:35:58 -0800 > Add support to reset appplication processors inside SmartNICs by > defining new ETH_RESET_AP bit. > > And use new ETH_RESET_AP bit in bnxt ethernet driver. Looks good, series applied, thanks!
Re: [PATCH net-next 0/3] rds-tcp netns delete related fixes
From: Sowmini VaradhanDate: Thu, 30 Nov 2017 11:11:26 -0800 > Patchset contains cleanup and bug fixes. Patch 1 is the removal > of some redundant code/functions. Patch 2 and 3 are fixes for > corner cases identified by syzkaller. I've not been able to > reproduce the actual use-after-free race flagged in the syzkaller > reports, thus these fixes are based on code inspection plus > manual testing to make sure the modified code paths are executed > without problems in the commonly encountered timing cases. Series applied, thanks.
Re: [net-next 1/1] tipc: fall back to smaller MTU if allocation of local send skb fails
From: Jon MaloyDate: Thu, 30 Nov 2017 16:47:25 +0100 > When sending node local messages the code is using an 'mtu' of 66060 > bytes to avoid unnecessary fragmentation. During situations of low > memory tipc_msg_build() may sometimes fail to allocate such large > buffers, resulting in unnecessary send failures. This can easily be > remedied by falling back to a smaller MTU, and then reassemble the > buffer chain as if the message were arriving from a remote node. > > At the same time, we change the initial MTU setting of the broadcast > link to a lower value, so that large messages always are fragmented > into smaller buffers even when we run in single node mode. Apart from > obtaining the same advantage as for the 'fallback' solution above, this > turns out to give a significant performance improvement. This can > probably be explained with the __pskb_copy() operation performed on the > buffer for each recipient during reception. We found the optimal value > for this, considering the most relevant skb pool, to be 3744 bytes. > > Acked-by: Ying Xue > Signed-off-by: Jon Maloy Applied, thanks Jon.
Re: [PATCH net-next v2 8/8] net: dummy: remove fake SR-IOV functionality
On Fri, 1 Dec 2017 14:46:34 +0100, Phil Sutter wrote: > On Thu, Nov 30, 2017 at 05:35:40PM -0800, Jakub Kicinski wrote: > > netdevsim driver seems like a better place for fake SR-IOV > > functionality. Remove the code previously added to dummy. > > > > Signed-off-by: Jakub Kicinski> > Reviewed-by: Quentin Monnet > > Acked-by: Phil Sutter Thanks! Did you have an opportunity to run your tests against this? I didn't find anything that uses dummy's SR-IOV in selftests.
Re: [PATCH 0/4] SFP/phylink fixes
From: Russell King - ARM LinuxDate: Thu, 30 Nov 2017 13:58:35 + > Here are four phylink fixes: > - the "options" is a big-endian value, we must test the bits taking the > endian-ness into account. > - improve the handling of RX_LOS polarity, taking no RX_LOS polarity > bits set to mean there is no RX_LOS functionality provided. > - do not report modules that require the address mode switching as > supporting SFF8472. > - ensure that the mac_link_down() function is called when phylink_stop() > is called. Series applied, thank you.
Re: [PATCH] net: phy-micrel: check return code in flp center function
From: Max UvarovDate: Thu, 30 Nov 2017 13:08:29 +0300 > Fix obvious typo that first return value is set but not checked. > > Signed-off-by: Max Uvarov Applied, thank you.
Re: [PATCH net v2] tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()
From: Tommi RantalaDate: Wed, 29 Nov 2017 12:48:42 +0200 > Remove the second tipc_rcv() call in tipc_udp_recv(). We have just > checked that the bearer is not up, and calling tipc_rcv() with a bearer > that is not up leads to a TIPC div-by-zero crash in > tipc_node_calculate_timer(). The crash is rare in practice, but can > happen like this: > > We're enabling a bearer, but it's not yet up and fully initialized. > At the same time we receive a discovery packet, and in tipc_udp_recv() > we end up calling tipc_rcv() with the not-yet-initialized bearer, > causing later the div-by-zero crash in tipc_node_calculate_timer(). > > Jon Maloy explains the impact of removing the second tipc_rcv() call: > "link setup in the worst case will be delayed until the next arriving >discovery messages, 1 sec later, and this is an acceptable delay." > > As the tipc_rcv() call is removed, just leave the function via the > rcu_out label, so that we will kfree_skb(). ... > Fixes: c9b64d492b1f ("tipc: add replicast peer discovery") > Signed-off-by: Tommi Rantala > Cc: Jon Maloy Applied and queued up for -stable, thanks.
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote: On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote: Re. problem #2: Indeed the best way to address it seems to be to enslave the VF driver netdev under a persistent anchor netdev. And it's indeed desired to allow (but not enforce) PV netdev and VF netdev to work in conjunction. And it's indeed desired that this enslavement logic work out-of-the box. But in case of PV+VF some configurable policies must be in place (and they'd better be generic rather than differ per PV technology). For example - based on which characteristics should the PV+VF coupling be done? netvsc uses MAC address, but that might not always be the desire. It's a policy but not guest userspace policy. The hypervisor certainly knows. Are you concerned that someone might want to create two devices with the same MAC for an unrelated reason? If so, hypervisor could easily set a flag in the virtio device to say "this is a backup, use MAC to find another device". This is something I was going to suggest: a flag or other configuration on the virtio device to help control how this new feature is used. I can imagine this might be useful to control from either the hypervisor side or the VM side. The hypervisor might want to (1) disable it (force it off), (2) enable it for VM choice, or (3) force it on for the VM. In case (2), the VM might be able to chose whether it wants to make use of the feature, or stick with the bonding solution. Either way, the kernel is making a feature available, and the user (VM or hypervisor) is able to control it by selecting the feature based on the policy desired. sln
Re: [PATCH net-next v2 7/8] netdevsim: add SR-IOV functionality
On Fri, 1 Dec 2017 14:43:06 +0100, Phil Sutter wrote: > On Thu, Nov 30, 2017 at 05:35:39PM -0800, Jakub Kicinski wrote: > [...] > > +static int nsim_vfs_enable(struct netdevsim *ns, unsigned int num_vfs) > > +{ > > + ns->vfconfigs = kcalloc(num_vfs, sizeof(struct nsim_vf_config), > > + GFP_KERNEL); > > + if (!ns->vfconfigs) > > + return -ENOMEM; > > + ns->num_vfs = num_vfs; > > + > > + return 0; > > +} > > + > > +static void nsim_vfs_disable(struct netdevsim *ns) > > +{ > > + kfree(ns->vfconfigs); > > + ns->vfconfigs = NULL; > > + ns->num_vfs = 0; > > +} > > Why not something like: > > | static int nsim_vfs_set(struct netdevsim *ns, unsigned int num_vfs) > | { > | void *ptr = krealloc(ns->vfconfigs, > | num_vfs * sizeof(struct nsim_vf_config), > | GFP_KERNEL); > | > | if (!ptr) > | return -ENOMEM; > | > | ns->vfconfigs = ptr; > | ns->num_vfs = num_vfs; > | return 0; > | } Um. It either frees or allocates, never reallocates so I felt realloc is misleading. ZERO_SIZE_PTR is less clearly a NULL than a NULL. I will have to specify __GFP_ZERO. It's not a calloc so there could be potentially some overflows? > > +static ssize_t > > +nsim_numvfs_store(struct device *dev, struct device_attribute *attr, > > + const char *buf, size_t count) > > +{ > > + struct netdevsim *ns = to_nsim(dev); > > + unsigned int num_vfs; > > + int ret; > > + > > + ret = kstrtouint(buf, 0, _vfs); > > + if (ret) > > + return ret; > > + > > + rtnl_lock(); > > + if (ns->num_vfs == num_vfs) > > + goto exit_good; > > Then replace this: > > > + if (ns->num_vfs && num_vfs) { > > + ret = -EBUSY; > > + goto exit_unlock; > > + } > > + > > + if (num_vfs) { > > + ret = nsim_vfs_enable(ns, num_vfs); > > + if (ret) > > + goto exit_unlock; > > + } else { > > + nsim_vfs_disable(ns); > > + } > > with just: > > | nsim_vfs_set(ns, num_vfs); I'm trying to mirror the PCI subsystem behaviour here, which only allows enable or disable, not increase. I felt we should follow how real devices behave: /* enable VFs */ if (pdev->sriov->num_VFs) { dev_warn(>dev, "%d VFs already enabled. Disable before enabling %d VFs\n", pdev->sriov->num_VFs, num_vfs); return -EBUSY; } So IOW this is intentional. > > + ret = count; > > +exit_unlock: > > + rtnl_unlock(); > > + > > + return ret; > > +} > > [...] > > > +static void nsim_free(struct net_device *dev) > > +{ > > + struct netdevsim *ns = netdev_priv(dev); > > + > > + device_unregister(>dev); > > } > > Shouldn't this also kfree(ns->vfconfigs)? It's in uninit, I will move it to release.
Re: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
On 12/01/2017 07:28 PM, Linus Torvalds wrote: > [ Sorry for HTML email crud - traveling and on mobile right now ] > > On Nov 30, 2017 23:54, "Al Viro"wrote: > > Would cause problems for tracepoints in there, though. And that, BTW, > is precisely why I don't want tracepoints in core VFS, TYVM - makes > restructuring the code harder... > > Just ignore them, see if anybody notices, and then they can add them back. > Tracepoints shouldn't hold up kernel development, and I doubt these are > ones that could be noticed by normal users. Yep, agree, if it really gets in the way, then lets remove them for now. After all, that was what was decided anyway.
[PATCH iproute2 net-next] iplink: allow configuring GSO max values
This allows sending GSO maximum values when configuring a device. The values are advisory. Most devices will ignore them but for some pseudo devices such as veth pairs they can be set. Example: # ip link add dev vm1 type veth peer name vm2 gso_max_size 32768 Signed-off-by: Stephen Hemminger--- ip/iplink.c | 19 ++- man/man8/ip-link.8.in | 13 + 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/ip/iplink.c b/ip/iplink.c index 0a8eb56fb252..6379b16a14f5 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -97,7 +97,8 @@ void iplink_usage(void) " [ master DEVICE ][ vrf NAME ]\n" " [ nomaster ]\n" " [ addrgenmode { eui64 | none | stable_secret | random } ]\n" - " [ protodown { on | off } ]\n" + " [ protodown { on | off } ]\n" + " [ gso_max_size BYTES ] | [ gso_max_segs PACKETS ]\n" "\n" " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [vrf NAME] [type TYPE]\n"); @@ -848,6 +849,22 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, return on_off("protodown", *argv); addattr8(>n, sizeof(*req), IFLA_PROTO_DOWN, proto_down); + } else if (strcmp(*argv, "gso_max_size") == 0) { + unsigned int max_size; + + NEXT_ARG(); + if (get_unsigned(_size, *argv, 0) || max_size > UINT16_MAX) + invarg("Invalid \"gso_max_size\" value\n", + *argv); + addattr32(>n, sizeof(*req), IFLA_GSO_MAX_SIZE, max_size); + } else if (strcmp(*argv, "gso_max_segs") == 0) { + unsigned int max_segs; + + NEXT_ARG(); + if (get_unsigned(_segs, *argv, 0) || max_segs > UINT16_MAX) + invarg("Invalid \"gso_max_segs\" value\n", + *argv); + addattr32(>n, sizeof(*req), IFLA_GSO_MAX_SEGS, max_segs); } else { if (matches(*argv, "help") == 0) usage(); diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in index a6a10e577b1f..0db2582e19f7 100644 --- a/man/man8/ip-link.8.in +++ b/man/man8/ip-link.8.in @@ -36,6 +36,11 @@ ip-link \- network device configuration .RB "[ " numrxqueues .IR QUEUE_COUNT " ]" .br +.BR "[" gso_max_size +.IR BYTES " ]" +.RB "[ " gso_max_segs +.IR SEGMENTS " ]" +.br .BI type " TYPE" .RI "[ " ARGS " ]" @@ -343,6 +348,14 @@ specifies the number of transmit queues for new device. specifies the number of receive queues for new device. .TP +.BI gso_max_size " BYTES " +specifies the recommended maximum size of a Generic Segment Offload packet the new device should accept. + +.TP +.BI gso_max_segs " SEGMENTS " +specifies the recommended maximum number of a Generic Segment Offload segments the new device should accept. + +.TP .BI index " IDX " specifies the desired index of the new virtual device. The link creation fails, if the index is busy. -- 2.11.0
Re: [PATCH net] tcp/dccp: block bh before arming time_wait timer
From: Eric DumazetDate: Fri, 01 Dec 2017 10:06:56 -0800 > From: Eric Dumazet > > Maciej Żenczykowski reported some panics in tcp_twsk_destructor() > that might be caused by the following bug. > > timewait timer is pinned to the cpu, because we want to transition > timwewait refcount from 0 to 4 in one go, once everything has been > initialized. > > At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer > handling") was merged, TCP was always running from BH habdler. > > After commit 5413d1babe8f ("net: do not block BH while processing > socket backlog") we definitely can run tcp_time_wait() from process > context. > > We need to block BH in the critical section so that the pinned timer > has still its purpose. > > This bug is more likely to happen under stress and when very small RTO > are used in datacenter flows. > > Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") > Signed-off-by: Eric Dumazet > Reported-by: Maciej Żenczykowski Applied and queued up for -stable, thanks Eric.
[PATCH net-next 1/2] rtnetlink: allow GSO maximums to be passed to device
Allow GSO maximum segments and size as netlink parameters on input. Signed-off-by: Stephen Hemminger--- net/core/rtnetlink.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index dabba2a91fc8..8138194c5f81 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1569,6 +1569,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_PROMISCUITY] = { .type = NLA_U32 }, [IFLA_NUM_TX_QUEUES]= { .type = NLA_U32 }, [IFLA_NUM_RX_QUEUES]= { .type = NLA_U32 }, + [IFLA_GSO_MAX_SEGS] = { .type = NLA_U32 }, + [IFLA_GSO_MAX_SIZE] = { .type = NLA_U32 }, [IFLA_PHYS_PORT_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, [IFLA_CARRIER_CHANGES] = { .type = NLA_U32 }, /* ignored */ [IFLA_PHYS_SWITCH_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, -- 2.11.0
[PATCH net-next 2/2] veth: allow configuring GSO maximums
Veth's can be used in environments (like Azure) where the underlying network device is impacted by large GSO packets. This patch allows gso maximum values to be passed in when creating the device via netlink. In theory, other pseudo devices could also use netlink attributes to set GSO maximums but for now veth is what has been observed to be an issue. Signed-off-by: Stephen Hemminger--- drivers/net/veth.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f5438d0978ca..510c058ba227 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -410,6 +410,26 @@ static int veth_newlink(struct net *src_net, struct net_device *dev, if (ifmp && (dev->ifindex != 0)) peer->ifindex = ifmp->ifi_index; + if (tbp[IFLA_GSO_MAX_SIZE]) { + u32 max_size = nla_get_u32(tbp[IFLA_GSO_MAX_SIZE]); + + if (max_size > GSO_MAX_SIZE) + return -EINVAL; + + peer->gso_max_size = max_size; + dev->gso_max_size = max_size; + } + + if (tbp[IFLA_GSO_MAX_SEGS]) { + u32 max_segs = nla_get_u32(tbp[IFLA_GSO_MAX_SEGS]); + + if (max_segs > GSO_MAX_SEGS) + return -EINVAL; + + peer->gso_max_segs = max_segs; + dev->gso_max_segs = max_segs; + } + err = register_netdevice(peer); put_net(net); net = NULL; -- 2.11.0
[PATCH net-next 0/2] allow setting gso_maximum values
This is another way of addressing the GSO maximum performance issues for containers on Azure. What happens is that the underlying infrastructure uses a overlay network such that GSO packets over 64K - vlan header end up cause either guest or host to have do expensive software copy and fragmentation. The netvsc driver reports GSO maximum settings correctly, the issue is that containers on veth devices still have the larger settings. One solution that was examined was propogating the values back through the bridge device, but this does not work for cases where virtual container network is done on L3. This patch set punts the problem to the orchestration layer that sets up the container network. It also enables other virtual devices to have configurable settings for GSO maximum. Stephen Hemminger (2): rtnetlink: allow GSO maximums to be passed to device veth: allow configuring GSO maximums drivers/net/veth.c | 20 net/core/rtnetlink.c | 2 ++ 2 files changed, 22 insertions(+) -- 2.11.0
Re: [PATCH net-next 00/13] nfp: bpf: jump resolution and memcpy update
On 12/01/2017 06:32 AM, Jakub Kicinski wrote: > Hi! > > Jiong says: > > Currently, compiler will lower memcpy function call in XDP/eBPF C program > into a sequence of eBPF load/store pairs for some scenarios. > > Compiler is thinking this "inline" optimiation is beneficial as it could > avoid function call and also increase code locality. > > However, Netronome NPU is not an tranditional load/store architecture that > doing a sequence of individual load/store actions are not efficient. > > This patch set tries to identify the load/store sequences composed of > load/store pairs that comes from memcpy lowering, then accelerates them > through NPU's Command Push Pull (CPP) instruction. > > This patch set registered an new optimization pass before doing the actual > JIT work, it traverse through eBPF IR, once found candidate sequence then > record the memory copy source, destination and length information in the > first load instruction starting the sequence and marks all remaining > instructions in the sequence into skipable status. Later, when JITing the > first load instructoin, optimal instructions will be generated using those > record information. > > For this safety of this transformation: > > - jump into the middle of the sequence will cancel the optimization. > > - overlapped memory access will cancel the optimization. > > - the load destination register still contains the same value as before > the transformation. Series applied to bpf-next, thanks guys!
Re: [PATCH v4 3/8] MIPS: Octeon: Add a global resource manager.
On 12/01/2017 11:49 AM, Philippe Ombredanne wrote: David, Greg, On Fri, Dec 1, 2017 at 6:42 PM, David Daneywrote: On 11/30/2017 11:53 PM, Philippe Ombredanne wrote: [...] --- /dev/null +++ b/arch/mips/cavium-octeon/resource-mgr.c @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource manager for Octeon. + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * Copyright (C) 2017 Cavium, Inc. + */ Since you nicely included an SPDX id, you would not need the boilerplate anymore. e.g. these can go alright? They may not be strictly speaking necessary, but I don't think they hurt anything. Unless there is a requirement to strip out the license text, we would stick with it as is. I think the requirement is there and that would be much better for everyone: keeping both is redundant and does not bring any value, does it? Instead it kinda removes the benefits of having the SPDX id in the first place IMHO. Furthermore, as there have been already ~12K+ files cleaned up and still over 60K files to go, it would really nice if new files could adopt the new style: this way we will not have to revisit and repatch them in the future. I am happy to follow any style Greg would suggest. There doesn't seem to be much documentation about how this should be done yet. David Daney
Re: [PATCH iproute2] iproute2: Fix undeclared __kernel_long_t type build error in RHEL 6.8
On Fri, Dec 01, 2017 at 08:48:07AM -0800, Stephen Hemminger wrote: > On Fri, 1 Dec 2017 13:04:51 +0200 > Leon Romanovskywrote: > > > From: Leon Romanovsky > > > > Add asm/posix_types.h header file to the list of needed includes, > > because the headers files in RHEL 6.8 are too old and doesn't > > have declaration of __kernel_long_t. > > > > In file included from ../include/uapi/linux/kernel.h:5, > > from ../include/uapi/linux/netfilter/x_tables.h:4, > > from ../include/xtables.h:20, > > from em_ipset.c:26: > > ../include/uapi/linux/sysinfo.h:9: error: expected specifier-qualifier-list > > before ‘__kernel_long_t’ > > > > Cc: Riad Abo Raed > > Cc: Guy Ergas > > Signed-off-by: Leon Romanovsky > > I see the problem, but the solution of dragging in posix_types.h > would be too much of a long term maintenance issue. > All the headers in uapi are regularly generated from upstream > kernel headers; I don't want to start making exceptions. > > Is it just the xtables stuff (which has always been problematic)? Actually, the only place where __kernel_long_t and __kernel_ulong_t appear is struct sysinfo in include/uapi/linux/sysinfo.h and this structure isn't even used anywhere in iproute2 source (not even in the include/uapi/linux/kernel.h file which includes ). So one could work around the problem by defining _LINUX_SYSINFO_H but that seems a bit dirty hack. Michal Kubecek
[PATCH tip/core/rcu 14/21] netfilter: Remove now-redundant smp_read_barrier_depends()
READ_ONCE() now implies smp_read_barrier_depends(), which means that the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table() are now redundant. This commit removes them and adjusts the comments. Signed-off-by: Paul E. McKenneyCc: Pablo Neira Ayuso Cc: Jozsef Kadlecsik Cc: Florian Westphal Cc: "David S. Miller" Cc: Cc: Cc: --- net/ipv4/netfilter/arp_tables.c | 7 +-- net/ipv4/netfilter/ip_tables.c | 7 +-- net/ipv6/netfilter/ip6_tables.c | 7 +-- 3 files changed, 3 insertions(+), 18 deletions(-) diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index f88221aebc9d..d242c2d29161 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -202,13 +202,8 @@ unsigned int arpt_do_table(struct sk_buff *skb, local_bh_disable(); addend = xt_write_recseq_begin(); - private = table->private; + private = READ_ONCE(table->private); /* Address dependency. */ cpu = smp_processor_id(); - /* -* Ensure we load private-> members after we've fetched the base -* pointer. -*/ - smp_read_barrier_depends(); table_base = private->entries; jumpstack = (struct arpt_entry **)private->jumpstack[cpu]; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 4cbe5e80f3bf..46866cc24a84 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -260,13 +260,8 @@ ipt_do_table(struct sk_buff *skb, WARN_ON(!(table->valid_hooks & (1 << hook))); local_bh_disable(); addend = xt_write_recseq_begin(); - private = table->private; + private = READ_ONCE(table->private); /* Address dependency. */ cpu= smp_processor_id(); - /* -* Ensure we load private-> members after we've fetched the base -* pointer. -*/ - smp_read_barrier_depends(); table_base = private->entries; jumpstack = (struct ipt_entry **)private->jumpstack[cpu]; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index f06e25065a34..ac1db84722a7 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -282,12 +282,7 @@ ip6t_do_table(struct sk_buff *skb, local_bh_disable(); addend = xt_write_recseq_begin(); - private = table->private; - /* -* Ensure we load private-> members after we've fetched the base -* pointer. -*/ - smp_read_barrier_depends(); + private = READ_ONCE(table->private); /* Address dependency. */ cpu= smp_processor_id(); table_base = private->entries; jumpstack = (struct ip6t_entry **)private->jumpstack[cpu]; -- 2.5.2