from:"Stephen Hemminger"

Re: [PATCH 01/10] appletalk: remove localtalk and ppp support

2023-10-09 Thread Stephen Hemminger

On Mon, 9 Oct 2023 18:49:43 +0200
Rodolfo Zitellini  wrote:

> Hi!
> I’ve been working on a new LocalTalk interface driver for the last couple 
> months, do you think it would be possible to at least postpone the removal of 
> LT a bit?
> 
> It is a driver for an open source device called TashTalk 
> (https://github.com/lampmerchant/tashtalk), which runs on a PIC micro that 
> does all the LT interfacing, and communicates back via serial to the host 
> system. My driver is relatively simple and works very well with netatalk 2.2 
> (which is still maintained and still has support for AppleTalk). The driver 
> is basically complete and trsted and I was preparing to submit a patch.
> 
> Still having LocalTalk in my view has many advantages for us enthusiasts that 
> still want to bridge old machines to the current world without modifications, 
> for example for printing on modern printers, netbooting, sharing files and 
> even tcp/ip. All this basically works out of the box via the driver, Linux 
> and available userspace tools (netatalk, macipgw).
> 
> The old ISA cards supported by COPS were basically unobtanium even 20 years 
> ago, but the solution of using a PIC and a serial port is very robust and 
> much more furure-proof. We also already have a device that can interface a 
> modern machine directly via USB to LocalTalk.
> 
> The development of the TashTalk has been also extensively discussed on thr 
> 68KMLA forum 
> (https://68kmla.org/bb/index.php?threads/modtashtalk-lt0-driver-for-linux.45031/)
> 
> I hope the decision to remove LocalTalk can be reconsidered at least for the 
> time being so there is a chance to submit a new, modern device making use of 
> this stack.
> 
> Many Thanks,
> Rodolfo Zitellini

Does it really need it to be a kernel protocol stack?
What about doing it in userspace or with BPF?

Re: [PATCH v8 net-next 2/2] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-19 Thread Stephen Hemminger

On Fri, 16 Apr 2021 13:11:59 -0700
Dexuan Cui  wrote:

> Add a VF driver for Microsoft Azure Network Adapter (MANA) that will be
> available in the future.
> 
> Co-developed-by: Haiyang Zhang 
> Signed-off-by: Haiyang Zhang 
> Co-developed-by: Shachar Raindel 
> Signed-off-by: Shachar Raindel 
> Signed-off-by: Dexuan Cui 

Reviewed-by: Stephen Hemminger

Re: [PATCH v8 net-next 1/2] hv_netvsc: Make netvsc/VF binding check both MAC and serial number

2021-04-19 Thread Stephen Hemminger

On Fri, 16 Apr 2021 13:11:58 -0700
Dexuan Cui  wrote:

> Currently the netvsc/VF binding logic only checks the PCI serial number.
> 
> The upcoming Microsoft Azure Network Adapter (MANA) supports multiple
> net_device interfaces (each such interface is called a "vPort", and has
> its unique MAC address) which are backed by the same VF PCI device, so
> the binding logic should check both the MAC address and the PCI serial
> number.
> 
> The change should not break any other existing VF drivers, because
> Hyper-V NIC SR-IOV implementation requires the netvsc network
> interface and the VF network interface have the same MAC address.
> 
> Co-developed-by: Haiyang Zhang 
> Signed-off-by: Haiyang Zhang 
> Co-developed-by: Shachar Raindel 
> Signed-off-by: Shachar Raindel 
> Signed-off-by: Dexuan Cui 

Acked-by: Stephen Hemminger

Re: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-16 Thread Stephen Hemminger

On Fri, 16 Apr 2021 17:58:45 +
Dexuan Cui  wrote:

> > >
> > > This probably should be a separate patch.
> > > I think it is trying to address the case of VF discovery in Hyper-V/Azure 
> > > where
> > > the reported
> > > VF from Hypervisor is bogus or confused.  
> > 
> > This is for the Multi vPorts feature of MANA driver, which allows one VF to
> > create multiple vPorts (NICs). They have the same PCI device and same VF
> > serial number, but different MACs.
> > 
> > So we put the change in one patch to avoid distro vendors missing this
> > change when backporting the MANA driver.
> > 
> > Thanks,
> > - Haiyang  
> 
> The netvsc change should come together in the same patch with this VF
> driver, otherwise the multi-vPorts functionality doesn't work properly.
> 
> The netvsc change should not break any other existing VF drivers, because
> Hyper-V NIC SR-IOV implementation requires the the NetVSC network
> interface and the VF network interface should have the same MAC address,
> otherwise things won't work.
> 
> Thanks,
> Dexuan

Distro vendors should be able to handle a patch series.
Don't see why this could not be two patch series.

Re: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-16 Thread Stephen Hemminger

On Thu, 15 Apr 2021 23:07:05 -0700
Dexuan Cui  wrote:

> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 7349a70af083..f682a5572d84 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2297,6 +2297,7 @@ static struct net_device *get_netvsc_byslot(const 
> struct net_device *vf_netdev)
>  {
>   struct device *parent = vf_netdev->dev.parent;
>   struct net_device_context *ndev_ctx;
> + struct net_device *ndev;
>   struct pci_dev *pdev;
>   u32 serial;
>  
> @@ -2319,8 +2320,17 @@ static struct net_device *get_netvsc_byslot(const 
> struct net_device *vf_netdev)
>   if (!ndev_ctx->vf_alloc)
>   continue;
>  
> - if (ndev_ctx->vf_serial == serial)
> - return hv_get_drvdata(ndev_ctx->device_ctx);
> + if (ndev_ctx->vf_serial != serial)
> + continue;
> +
> + ndev = hv_get_drvdata(ndev_ctx->device_ctx);
> + if (ndev->addr_len != vf_netdev->addr_len ||
> + memcmp(ndev->perm_addr, vf_netdev->perm_addr,
> +ndev->addr_len) != 0)
> + continue;
> +
> + return ndev;
> +
>   }
>  
>   netdev_notice(vf_netdev,


This probably should be a separate patch.
I think it is trying to address the case of VF discovery in Hyper-V/Azure where 
the reported
VF from Hypervisor is bogus or confused.

Re: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-15 Thread Stephen Hemminger

On Wed, 14 Apr 2021 22:45:19 -0700
Dexuan Cui  wrote:

> +static int mana_probe_port(struct mana_context *ac, int port_idx,
> +struct net_device **ndev_storage)
> +{
> + struct gdma_context *gc = ac->gdma_dev->gdma_context;
> + struct mana_port_context *apc;
> + struct net_device *ndev;
> + int err;
> +
> + ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
> +  gc->max_num_queues);
> + if (!ndev)
> + return -ENOMEM;
> +
> + *ndev_storage = ndev;
> +
> + apc = netdev_priv(ndev);
> + apc->ac = ac;
> + apc->ndev = ndev;
> + apc->max_queues = gc->max_num_queues;
> + apc->num_queues = min_t(uint, gc->max_num_queues, MANA_MAX_NUM_QUEUES);
> + apc->port_handle = INVALID_MANA_HANDLE;
> + apc->port_idx = port_idx;
> +
> + ndev->netdev_ops = _devops;
> + ndev->ethtool_ops = _ethtool_ops;
> + ndev->mtu = ETH_DATA_LEN;
> + ndev->max_mtu = ndev->mtu;
> + ndev->min_mtu = ndev->mtu;
> + ndev->needed_headroom = MANA_HEADROOM;
> + SET_NETDEV_DEV(ndev, gc->dev);
> +
> + netif_carrier_off(ndev);
> +
> + get_random_bytes(apc->hashkey, MANA_HASH_KEY_SIZE);

Current practice for network drivers is to use netdev_rss_key_fill() for this.

Re: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-15 Thread Stephen Hemminger

On Wed, 14 Apr 2021 22:45:19 -0700
Dexuan Cui  wrote:

> +static int mana_query_vport_cfg(struct mana_port_context *apc, u32 
> vport_index,
> + u32 *max_sq, u32 *max_rq, u32 *num_indir_entry)
> +{
> + struct mana_query_vport_cfg_resp resp = {};
> + struct mana_query_vport_cfg_req req = {};
> + int err;
> +
> + mana_gd_init_req_hdr(, MANA_QUERY_VPORT_CONFIG,
> +  sizeof(req), sizeof(resp));
> +
> + req.vport_index = vport_index;
> +
> + err = mana_send_request(apc->ac, , sizeof(req), ,
> + sizeof(resp));
> + if (err)
> + return err;
> +
> + err = mana_verify_resp_hdr(, MANA_QUERY_VPORT_CONFIG,
> +sizeof(resp));
> + if (err)
> + return err;
> +
> + if (resp.hdr.status)
> + return -EPROTO;
> +
> + *max_sq = resp.max_num_sq;
> + *max_rq = resp.max_num_rq;
> + *num_indir_entry = resp.num_indirection_ent;
> +
> + apc->port_handle = resp.vport;
> + memcpy(apc->mac_addr, resp.mac_addr, ETH_ALEN);

You could use ether_addr_copy here.


> +int mana_do_attach(struct net_device *ndev, enum mana_attach_caller caller)
> +{
> + struct mana_port_context *apc = netdev_priv(ndev);
> + struct gdma_dev *gd = apc->ac->gdma_dev;
> + u32 max_txq, max_rxq, max_queues;
> + int port_idx = apc->port_idx;
> + u32 num_indirect_entries;
> + int err;
> +
> + if (caller == MANA_OPEN)
> + goto start_open;
> +
> + err = mana_init_port_context(apc);
> + if (err)
> + return err;
> +
> + err = mana_query_vport_cfg(apc, port_idx, _txq, _rxq,
> +_indirect_entries);
> + if (err) {
> + netdev_err(ndev, "Failed to query info for vPort 0\n");
> + goto reset_apc;
> + }
> +
> + max_queues = min_t(u32, max_txq, max_rxq);
> + if (apc->max_queues > max_queues)
> + apc->max_queues = max_queues;
> +
> + if (apc->num_queues > apc->max_queues)
> + apc->num_queues = apc->max_queues;
> +
> + memcpy(ndev->dev_addr, apc->mac_addr, ETH_ALEN);

And here use ether_addr_copy().

Re: [Resend RFC PATCH V2 08/12] UIO/Hyper-V: Not load UIO HV driver in the isolation VM.

2021-04-14 Thread Stephen Hemminger

On Wed, 14 Apr 2021 17:45:51 +0200
Greg KH  wrote:

> On Wed, Apr 14, 2021 at 10:49:41AM -0400, Tianyu Lan wrote:
> > From: Tianyu Lan 
> > 
> > UIO HV driver should not load in the isolation VM for security reason.
> > Return ENOTSUPP in the hv_uio_probe() in the isolation VM.
> > 
> > Signed-off-by: Tianyu Lan 

This is debatable, in isolation VM's shouldn't userspace take responsibility
to validate host communication. If that is an issue please participate with
the DPDK community (main user of this) to make sure netvsc userspace driver
has the required checks.

Re: [PATCH net-next] net: Space: remove hp100 probe

2021-04-13 Thread Stephen Hemminger

On Tue, 13 Apr 2021 16:16:17 +0200
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> The driver was removed last year, but the static initialization got left
> behind by accident.
> 
> Fixes: a10079c66290 ("staging: remove hp100 driver")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/Space.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/net/Space.c b/drivers/net/Space.c
> index 7bb699d7c422..a61cc7b26a87 100644
> --- a/drivers/net/Space.c
> +++ b/drivers/net/Space.c
> @@ -59,9 +59,6 @@ static int __init probe_list2(int unit, struct devprobe2 
> *p, int autoprobe)
>   * look for EISA/PCI cards in addition to ISA cards).
>   */
>  static struct devprobe2 isa_probes[] __initdata = {
> -#if defined(CONFIG_HP100) && defined(CONFIG_ISA) /* ISA, EISA */
> - {hp100_probe, 0},
> -#endif
>  #ifdef CONFIG_3C515
>   {tc515_probe, 0},
>  #endif

Thanks, do we even need to have the static initialization anymore?

Re: [PATCH RFC iproute2-next] iplink: allow to change iplink value

2021-04-11 Thread Stephen Hemminger

On Sat, 10 Apr 2021 15:34:50 +0200
Ansuel Smith  wrote:

> Allow to change the interface to which a given interface is linked to.
> This is useful in the case of multi-CPU port DSA, for changing the CPU
> port of a given user port.
> 
> Signed-off-by: Marek Behún 
> Cc: David Ahern 
> Cc: Stephen Hemminger 

This may work for DSA but it won't work for all the device types 
vlan/macsec/... that
now use the link attribute.  It looks like the change link handling for those
device types just ignores the link attribute (maybe ok). But before supporting 
this
as an API, it would be better if all the other drivers that use IFLA_LINK
had error checks in their change link handling.

Please add error checks in kernel first.

Re: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Stephen Hemminger

On Thu,  8 Apr 2021 15:58:40 -0700
Dexuan Cui  wrote:

> Add a VF driver for Microsoft Azure Network Adapter (MANA) that will be
> available in the future.
> 
> Co-developed-by: Haiyang Zhang 
> Signed-off-by: Haiyang Zhang 
> Signed-off-by: Dexuan Cui 
> ---
>  MAINTAINERS   |4 +-
>  drivers/net/ethernet/Kconfig  |1 +
>  drivers/net/ethernet/Makefile |1 +
>  drivers/net/ethernet/microsoft/Kconfig|   29 +
>  drivers/net/ethernet/microsoft/Makefile   |5 +
>  drivers/net/ethernet/microsoft/mana/Makefile  |6 +
>  drivers/net/ethernet/microsoft/mana/gdma.h|  728 +++
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 1515 ++
>  .../net/ethernet/microsoft/mana/hw_channel.c  |  859 
>  .../net/ethernet/microsoft/mana/hw_channel.h  |  186 ++
>  drivers/net/ethernet/microsoft/mana/mana.h|  531 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 1827 +
>  .../ethernet/microsoft/mana/mana_ethtool.c|  278 +++
>  .../net/ethernet/microsoft/mana/shm_channel.c |  292 +++
>  .../net/ethernet/microsoft/mana/shm_channel.h |   21 +
>  15 files changed, 6282 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/microsoft/Kconfig
>  create mode 100644 drivers/net/ethernet/microsoft/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/Makefile
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/gdma_main.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana.h
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_en.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/mana_ethtool.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.c
>  create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.h

Linux kernel doesn't do namespaces in the code, so every new driver needs
to worry about global symbols clashing

Please make sure there are not name conflicts with other drivers around 
variable or
functions name "gdma_". Noticed there is one driver in staging using similar
names (drivers/staging/ralink-gdma/ralink-gdma.c). Granted that is in the 
staging
doghouse so probably not important but might show up as a name conflict
with something like the randconfig testing.

Re: [PATCH v2 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Stephen Hemminger

On Thu, 8 Apr 2021 09:22:57 -0700
Randy Dunlap  wrote:

> On 4/8/21 2:15 AM, Dexuan Cui wrote:
> > diff --git a/drivers/net/ethernet/microsoft/Kconfig 
> > b/drivers/net/ethernet/microsoft/Kconfig
> > new file mode 100644
> > index ..12ef6b581566
> > --- /dev/null
> > +++ b/drivers/net/ethernet/microsoft/Kconfig
> > @@ -0,0 +1,30 @@
> > +#
> > +# Microsoft Azure network device configuration
> > +#
> > +
> > +config NET_VENDOR_MICROSOFT
> > +   bool "Microsoft Azure Network Device"  
> 
> Seems to me that should be generalized, more like:
> 
>   bool "Microsoft Network Devices"

Yes, that is what it should be at this level.

> 
> 
> > +   default y

This follows the existing policy for network vendor level

> > +   help
> > + If you have a network (Ethernet) device belonging to this class, say 
> > Y.
> > +
> > + Note that the answer to this question doesn't directly affect the
> > + kernel: saying N will just cause the configurator to skip the
> > + question about Microsoft Azure network device. If you say Y, you  
> 
>  about Microsoft networking devices.
> 
> > + will be asked for your specific device in the following question.
> > +
> > +if NET_VENDOR_MICROSOFT
> > +
> > +config MICROSOFT_MANA
> > +   tristate "Microsoft Azure Network Adapter (MANA) support"
> > +   default m  
> 
> Please drop the default m. We don't randomly add drivers to be built.

Yes, it should be no (or no default which is the default for default)

> Or leave this as is and change NET_VENDOR_MICROSOFT to be default n.
> 
> 
> > +   depends on PCI_MSI && X86_64
> > +   select PCI_HYPERV
> > +   help
> > + This driver supports Microsoft Azure Network Adapter (MANA).
> > + So far, the driver is only validated on X86_64.  
> 
> validated how?

Maybe change validated to supported?

Re: Kconfig, DEFAULT_NETSCH, and shooting yourself in the foot..

2021-01-03 Thread Stephen Hemminger

On Sun, 3 Jan 2021 15:30:30 +0900
Masahiro Yamada  wrote:

> On Sun, Jan 3, 2021 at 5:14 AM Valdis Klētnieks  
> wrote:
> >
> > Consider the following own goal I just discovered I scored:
> >
> > [~] zgrep -i fq_codel /proc/config.gz
> > CONFIG_NET_SCH_FQ_CODEL=m
> > CONFIG_DEFAULT_FQ_CODEL=y
> > CONFIG_DEFAULT_NET_SCH="fq_codel"
> >
> > Obviously, fq_codel didn't get set as the default, because that happens
> > before the module gets loaded (which may never happen if the sysadmin
> > thinks the DEFAULT_NET_SCH already made it happen)
> >
> > Whoops. My bad, probably - but
> >
> > The deeper question, part 1:
> >
> > There's this chunk in net/sched/Kconfig:
> >
> > config DEFAULT_NET_SCH
> > string
> > default "pfifo_fast" if DEFAULT_PFIFO_FAST
> > default "fq" if DEFAULT_FQ
> > default "fq_codel" if DEFAULT_FQ_CODEL
> > default "fq_pie" if DEFAULT_FQ_PIE
> > default "sfq" if DEFAULT_SFQ
> > default "pfifo_fast"
> > endif
> >
> > (And a similar chunk right above it with a similar issue)
> >
> > Should those be "if (foo=y)" so =m can't be chosen? (I'll be
> > happy to write the patch if that's what we want)
> >
> > Deeper question, part 2:
> >
> > Should there be a way in the Kconfig language to ensure that
> > these two chunks can't accidentally get out of sync?  There's other
> > places in the kernel where similar issues arise - a few days ago I was
> > chasing a CPU governor issue where it looked like it was possible
> > to set a default that was a module and thus possibly not actually loaded.
> >  
> 
> 
> If there is a restriction where a modular discipline cannot be the default,
> I think you can add 'depends on FOO = y'.
> 
> 
> 
> For example,
> 
> 
> choice
>prompt "Default"
> 
>config DEFAULT_FOO
>   bool "Use foo for default"
>   depends on FOO = y
> 
>config DEFAULT_BAR
>   bool "Use bar for default"
>   depends on BAR = y
> 
>config DEFAULT_FALLBACK
>   bool "fallback when nothing else is builtin"
> 
> endchoice
> 
> 
> 
> 
> 

You can use a qdisc that is a module, it just has to be available when device
is loaded. Typically that means putting it in initramfs.

[ANNOUNCE] iproute2-5.10

2020-12-20 Thread Stephen Hemminger

Just in time for the holidays, new iproute2!

This update is smaller than usual, not a lot of new features.
It does NOT include libbpf, that will be merged in 5.11 (iproute2-next).

Download:
https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-5.10.0.tar.gz

Repository for upcoming release:
git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

And future release (net-next):
git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git

Thanks for all the contributions.

Report problems (or enhancements) to the net...@vger.kernel.org mailing list.

---
Andrea Claudi (4):
  man: tc-flower: fix manpage
  devlink: fix memory leak in cmd_dev_flash()
  tc: pedit: fix memory leak in print_pedit
  ss: mptcp: fix add_addr_accepted stat print

Antony Antony (2):
  ip xfrm: support printing XFRMA_SET_MARK_MASK attribute in states
  ip xfrm: support setting XFRMA_SET_MARK_MASK attribute in states

Ciara Loftus (1):
  ss: add support for xdp statistics

David Ahern (5):
  Update kernel headers
  Update kernel headers
  Update kernel headers
  Update kernel headers
  Update kernel headers

Guillaume Nault (5):
  m_vlan: add pop_eth and push_eth actions
  m_mpls: add mac_push action
  m_mpls: test the 'mac_push' action after 'modify'
  tc-vlan: fix help and error message strings
  tc-mpls: fix manpage example and help message string

Hoang Le (1):
  tipc: support 128bit node identity for peer removing

Jacob Keller (2):
  devlink: support setting the overwrite mask attribute
  devlink: display elapsed time during flash update

Jakub Kicinski (1):
  ip: promote missed packets to the -s row

Jiri Pirko (1):
  devlink: Add health reporter test command support

Johannes Berg (5):
  libnetlink: add rtattr_for_each_nested() iteration macro
  libnetlink: add nl_print_policy() helper
  genl: ctrl: support dumping netlink policy
  genl: ctrl: print op -> policy idx mapping
  libnetlink: define __aligned conditionally

Luca Boccassi (2):
  ip/netns: use flock when setting up /run/netns
  tc/mqprio: json-ify output

Nikolay Aleksandrov (6):
  bridge: mdb: add support for source address
  bridge: mdb: print fast_leave flag
  bridge: mdb: show igmpv3/mldv2 flags
  bridge: mdb: print filter mode when available
  bridge: mdb: print source list when available
  bridge: mdb: print protocol when available

Parav Pandit (2):
  devlink: Show external port attribute
  devlink: Show controller number of a devlink port

Roopa Prabhu (1):
  iplink: add support for protodown reason

Stephen Hemminger (15):
  v5.9.0
  uapi: updates from 5.10-rc1
  tc/m_gate: fix spelling errors
  man: fix spelling errors
  rdma: fix spelling error in comment
  uapi: update kernel headers from 5.10-rc2
  bridge: report correct version
  devlink: fix uninitialized warning
  bridge: fix string length warning
  tc: fix compiler warnings in ip6 pedit
  misc: fix compiler warning in ifstat and nstat
  f_u32: fix compiler gcc-10 compiler warning
  uapi: update devlink.h
  uapi: update devlink.h
  uapi: merge in change to bpf.h

Tuong Lien (2):
  tipc: add option to set master key for encryption
  tipc: add option to set rekeying for encryption

Wei Wang (1):
  iproute2: ss: add support to expose various inet sockopts

Re: [PATCH v3 4/6] drivers: hv: vmbus: Fix checkpatch SPLIT_STRING

2020-11-29 Thread Stephen Hemminger

On Sun, 29 Nov 2020 08:51:47 -0800
"Michael Kelley"  wrote:

> From: Matheus Castello  Sent: Tuesday, November
> 24, 2020 7:29 PM
> > 
> > Checkpatch emits WARNING: quoted string split across lines.
> > To keep the code clean and with the 80 column length indentation the
> > check and registration code for kmsg_dump_register has been transferred
> > to a new function hv_kmsg_dump_register.
> > 
> > Signed-off-by: Matheus Castello 
> > ---
> > This is the V3 of patch 4 of the series "Add improvements suggested by
> > checkpatch for vmbus_drv" with the changes suggested by Michael Kelley  
> and
> > Joe Perches. Thanks!
> > ---
> >  drivers/hv/vmbus_drv.c | 35 ---
> >  1 file changed, 20 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 61d28c743263..edcc419ba328 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1387,6 +1387,24 @@ static struct kmsg_dumper hv_kmsg_dumper = {
> > .dump = hv_kmsg_dump,
> >  };
> > 
> > +static void hv_kmsg_dump_register(void)
> > +{
> > +   int ret;
> > +
> > +   hv_panic_page = (void *)hv_alloc_hyperv_zeroed_page();

I know you just copy/pasted the original code, but one other thing.

Doesn't it already return void *?
arch/x86/include/asm/mshyperv.h:void *hv_alloc_hyperv_zeroed_page(void);

> > +   if (!hv_panic_page) {
> > +   pr_err("Hyper-V: panic message page memory allocation  
> failed\n");
> > +   return;
> > +   }
> > +
> > +   ret = kmsg_dump_register(_kmsg_dumper);
> > +   if (ret) {
> > +   pr_err("Hyper-V: kmsg dump register error 0x%x\n", ret);
> > +   hv_free_hyperv_page((unsigned long)hv_panic_page);
> > +   hv_panic_page = NULL;
> > +   }
> > +}
> > +
> >  static struct ctl_table_header *hv_ctl_table_hdr;
> > 
> >  /*
> > @@ -1477,21 +1495,8 @@ static int vmbus_bus_init(void)
> >  * capability is supported by the hypervisor.
> >  */
> > hv_get_crash_ctl(hyperv_crash_ctl);
> > -   if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG) {
> > -   hv_panic_page = (void  
> *)hv_alloc_hyperv_zeroed_page();
> > -   if (hv_panic_page) {
> > -   ret = kmsg_dump_register(_kmsg_dumper);
> > -   if (ret) {
> > -   pr_err("Hyper-V: kmsg dump  
> register "
> > -   "error 0x%x\n", ret);
> > -   hv_free_hyperv_page(
> > -   (unsigned long)hv_panic_page);
> > -   hv_panic_page = NULL;
> > -   }
> > -   } else
> > -   pr_err("Hyper-V: panic message page memory  
> "
> > -   "allocation failed");
> > -   }
> > +   if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG)
> > +   hv_kmsg_dump_register();
> > 
> > register_die_notifier(_die_block);
> > }
> > --
> > 2.29.2  
> 
> Reviewed-by: Michael Kelley 
>

Re: [PATCH v2] netem: fix zero division in tabledist

2020-10-28 Thread Stephen Hemminger

On Wed, 28 Oct 2020 17:07:31 +
Aleksandr Nogikh  wrote:

> From: Aleksandr Nogikh 
> 
> Currently it is possible to craft a special netlink RTM_NEWQDISC
> command that can result in jitter being equal to 0x8000. It is
> enough to set the 32 bit jitter to 0x0200 (it will later be
> multiplied by 2^6) or just set the 64 bit jitter via
> TCA_NETEM_JITTER64. This causes an overflow during the generation of
> uniformly distributed numbers in tabledist(), which in turn leads to
> division by zero (sigma != 0, but sigma * 2 is 0).
> 
> The related fragment of code needs 32-bit division - see commit
> 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to
> 64 bit is not an option.
> 
> Fix the issue by keeping the value of jitter within the range that can
> be adequately handled by tabledist() - [0;INT_MAX]. As negative std
> deviation makes no sense, take the absolute value of the passed value
> and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit
> arithmetic in order to prevent overflows.
> 
> Signed-off-by: Aleksandr Nogikh 
> Reported-by: syzbot+ec762a6342ad0d3c0...@syzkaller.appspotmail.com

Acked-by: Stephen Hemminger

Re: [Bridge] [PATCH net-next v7 01/10] net: bridge: extend the process of special frames

2020-10-27 Thread Stephen Hemminger

On Tue, 27 Oct 2020 10:02:42 +
Henrik Bjoernlund via Bridge  wrote:

> +/* Return 0 if the frame was not processed otherwise 1
> + * note: already called with rcu_read_lock
> + */
> +static int br_process_frame_type(struct net_bridge_port *p,
> +  struct sk_buff *skb)
> +{
> + struct br_frame_type *tmp;
> +
> + hlist_for_each_entry_rcu(tmp, >br->frame_type_list, list)
> + if (unlikely(tmp->type == skb->protocol))
> + return tmp->frame_handler(p, skb);
> +
> + return 0;
> +}

Does the linear search of frame types have noticable impact on performance?
Hint: maybe a bitmap or something would be faster.

Re: [RFC PATCH iproute2] bridge: add support for L2 multicast groups

2020-10-22 Thread Stephen Hemminger

On Sat, 17 Oct 2020 21:45:26 +0300
Vladimir Oltean  wrote:

> Extend the 'bridge mdb' command for the following syntax:
> bridge mdb add dev br0 port swp0 grp 01:02:03:04:05:06 permanent
> 
> Signed-off-by: Vladimir Oltean 
> ---
>  bridge/mdb.c   | 54 ++
>  include/uapi/linux/if_bridge.h |  2 ++
>  2 files changed, 43 insertions(+), 13 deletions(-)
> 
> diff --git a/bridge/mdb.c b/bridge/mdb.c
> index 4cd7ca762b78..af160250928e 100644
> --- a/bridge/mdb.c
> +++ b/bridge/mdb.c
> @@ -149,6 +149,7 @@ static void print_mdb_entry(FILE *f, int ifindex, const 
> struct br_mdb_entry *e,
>   struct nlmsghdr *n, struct rtattr **tb)
>  {
>   const void *grp, *src;
> + const char *addr;
>   SPRINT_BUF(abuf);
>   const char *dev;
>   int af;
> @@ -156,9 +157,16 @@ static void print_mdb_entry(FILE *f, int ifindex, const 
> struct br_mdb_entry *e,
>   if (filter_vlan && e->vid != filter_vlan)
>   return;
>  
> - af = e->addr.proto == htons(ETH_P_IP) ? AF_INET : AF_INET6;
> - grp = af == AF_INET ? (const void *)>addr.u.ip4 :
> -   (const void *)>addr.u.ip6;
> + if (!e->addr.proto) {
> + af = AF_PACKET;
> + grp = (const void *)>addr.u.mac_addr;
> + } else if (e->addr.proto == htons(ETH_P_IP)) {
> + af = AF_INET;
> + grp = (const void *)>addr.u.ip4;
> + } else {
> + af = AF_INET6;
> + grp = (const void *)>addr.u.ip6;
> + }
>   dev = ll_index_to_name(ifindex);
>  

In C casts of pointer to void are not necessary.

Re: [PATCH net-next] drivers/net/wan/hdlc_fr: Reduce indentation in pvc_xmit

2020-10-03 Thread Stephen Hemminger

On Sat,  3 Oct 2020 10:35:28 -0700
Xie He  wrote:

> + if (dev->type == ARPHRD_ETHER) {
> + int pad = ETH_ZLEN - skb->len;
> +
> + if (pad > 0) { /* Pad the frame with zeros */
> + int len = skb->len;
> +
> + if (skb_tailroom(skb) < pad)
> + if (pskb_expand_head(skb, 0, pad, GFP_ATOMIC))
> + goto drop;
> + skb_put(skb, pad);
> + memset(skb->data + len, 0, pad);
>   }
>   }

This code snippet is basically an version of skb_pad().
Probably it is very old and pre-dates that.
Could the code use skb_pad?

Re: [PATCH net-next v2 6/6] bonding: make Kconfig toggle to disable legacy interfaces

2020-10-02 Thread Stephen Hemminger

On Fri, 2 Oct 2020 16:23:46 -0400
Jarod Wilson  wrote:

> On Fri, Oct 2, 2020 at 3:13 PM Stephen Hemminger
>  wrote:
> >
> > On Fri,  2 Oct 2020 13:40:01 -0400
> > Jarod Wilson  wrote:
> >  
> > > By default, enable retaining all user-facing API that includes the use of
> > > master and slave, but add a Kconfig knob that allows those that wish to
> > > remove it entirely do so in one shot.
> > >
> > > Cc: Jay Vosburgh 
> > > Cc: Veaceslav Falico 
> > > Cc: Andy Gospodarek 
> > > Cc: "David S. Miller" 
> > > Cc: Jakub Kicinski 
> > > Cc: Thomas Davis 
> > > Cc: net...@vger.kernel.org
> > > Signed-off-by: Jarod Wilson 
> > > ---
> > >  drivers/net/Kconfig   | 12 
> > >  drivers/net/bonding/bond_main.c   |  4 ++--
> > >  drivers/net/bonding/bond_options.c|  4 ++--
> > >  drivers/net/bonding/bond_procfs.c |  8 
> > >  drivers/net/bonding/bond_sysfs.c  | 14 ++
> > >  drivers/net/bonding/bond_sysfs_port.c |  6 --
> > >  6 files changed, 38 insertions(+), 10 deletions(-)
> > >  
> >
> > This is problematic. You are printing both old and new values.
> > Also every distribution will have to enable it.
> >
> > This looks like too much of change to users.  
> 
> I'd had a bit of feedback that people would rather see both, and be
> able to toggle off the old ones, rather than only having one or the
> other, depending on the toggle, so I thought I'd give this a try. I
> kind of liked the one or the other route, but I see the problems with
> that too.
> 
> For simplicity, I'm kind of liking the idea of just not updating the
> proc and sysfs interfaces, have a toggle entirely disable them, and
> work on enhancing userspace to only use netlink, but ... it's going to
> be a while before any such work makes its way to any already shipping
> distros. I don't have a satisfying answer here.
> 

I like the idea of having bonding proc and sysf apis optional.

Re: [PATCH net-next v2 6/6] bonding: make Kconfig toggle to disable legacy interfaces

2020-10-02 Thread Stephen Hemminger

On Fri,  2 Oct 2020 13:40:01 -0400
Jarod Wilson  wrote:

> By default, enable retaining all user-facing API that includes the use of
> master and slave, but add a Kconfig knob that allows those that wish to
> remove it entirely do so in one shot.
> 
> Cc: Jay Vosburgh 
> Cc: Veaceslav Falico 
> Cc: Andy Gospodarek 
> Cc: "David S. Miller" 
> Cc: Jakub Kicinski 
> Cc: Thomas Davis 
> Cc: net...@vger.kernel.org
> Signed-off-by: Jarod Wilson 
> ---
>  drivers/net/Kconfig   | 12 
>  drivers/net/bonding/bond_main.c   |  4 ++--
>  drivers/net/bonding/bond_options.c|  4 ++--
>  drivers/net/bonding/bond_procfs.c |  8 
>  drivers/net/bonding/bond_sysfs.c  | 14 ++
>  drivers/net/bonding/bond_sysfs_port.c |  6 --
>  6 files changed, 38 insertions(+), 10 deletions(-)
> 

This is problematic. You are printing both old and new values.
Also every distribution will have to enable it.

This looks like too much of change to users.

Re: [PATCH net-next v2 5/6] bonding: update Documentation for port/bond terminology

2020-10-02 Thread Stephen Hemminger

On Fri,  2 Oct 2020 13:40:00 -0400
Jarod Wilson  wrote:

> @@ -265,7 +265,7 @@ ad_user_port_key
>   This parameter has effect only in 802.3ad mode and is available through
>   SysFs interface.
>  
> -all_slaves_active
> +all_ports_active

You can change internal variable names, comments, and documentation all you 
want, thats great.

But you can't change user API, that includes:
   * definitions in uapi header
   * module parameters
   * sysfs file names or outputs

Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces

2020-09-22 Thread Stephen Hemminger

On Tue, 22 Sep 2020 16:47:07 -0700
Jay Vosburgh  wrote:

> Stephen Hemminger  wrote:
> 
> >On Tue, 22 Sep 2020 09:37:30 -0400
> >Jarod Wilson  wrote:
> >  
> >> By default, enable retaining all user-facing API that includes the use of
> >> master and slave, but add a Kconfig knob that allows those that wish to
> >> remove it entirely do so in one shot.
> >> 
> >> Cc: Jay Vosburgh 
> >> Cc: Veaceslav Falico 
> >> Cc: Andy Gospodarek 
> >> Cc: "David S. Miller" 
> >> Cc: Jakub Kicinski 
> >> Cc: Thomas Davis 
> >> Cc: net...@vger.kernel.org
> >> Signed-off-by: Jarod Wilson   
> >
> >Why not just have a config option to remove all the /proc and sysfs options
> >in bonding (and bridging) and only use netlink? New tools should be only able
> >to use netlink only.  
> 
>   I agree that new tooling should be netlink, but what value is
> provided by such an option that distros are unlikely to enable, and
> enabling will break the UAPI?
> 
> >Then you might convince maintainers to update documentation as well.
> >Last I checked there were still references to ifenslave.  
> 
>   Distros still include ifenslave, but it's now a shell script
> that uses sysfs.  I see it used in scripts from time to time.

Some bleeding edge distros have already dropped ifenslave and even ifconfig.
The Enterprise ones never will.

The one motivation would be for the embedded folks which are always looking
to trim out the fat. Although not sure if the minimal versions of commands
in busybox are pure netlink yet.

Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces

2020-09-22 Thread Stephen Hemminger

On Tue, 22 Sep 2020 09:37:30 -0400
Jarod Wilson  wrote:

> By default, enable retaining all user-facing API that includes the use of
> master and slave, but add a Kconfig knob that allows those that wish to
> remove it entirely do so in one shot.
> 
> Cc: Jay Vosburgh 
> Cc: Veaceslav Falico 
> Cc: Andy Gospodarek 
> Cc: "David S. Miller" 
> Cc: Jakub Kicinski 
> Cc: Thomas Davis 
> Cc: net...@vger.kernel.org
> Signed-off-by: Jarod Wilson 

Why not just have a config option to remove all the /proc and sysfs options
in bonding (and bridging) and only use netlink? New tools should be only able
to use netlink only.

Then you might convince maintainers to update documentation as well.
Last I checked there were still references to ifenslave.

Re: [PATCH v3] net: use exponential backoff in netdev_wait_allrefs

2020-09-17 Thread Stephen Hemminger

On Thu, 17 Sep 2020 16:49:53 -0700
frugg...@arista.com (Francesco Ruggeri) wrote:

> The combination of aca_free_rcu, introduced in commit 2384d02520ff
> ("net/ipv6: Add anycast addresses to a global hashtable"), and
> fib6_info_destroy_rcu, introduced in commit 9b0a8da8c4c6 ("net/ipv6:
> respect rcu grace period before freeing fib6_info"), can result in
> an extra rcu grace period being needed when deleting an interface,
> with the result that netdev_wait_allrefs ends up hitting the msleep(250),
> which is considerably longer than the required grace period.
> This can result in long delays when deleting a large number of interfaces,
> and it can be observed with this script:
> 

Is there anyway to make RCU trigger faster?

Re: [PATCH RFC 0/7] net: bridge: cfm: Add support for Connectivity Fault Management(CFM)

2020-09-04 Thread Stephen Hemminger

On Fri, 4 Sep 2020 09:15:20 +
Henrik Bjoernlund  wrote:

> Connectivity Fault Management (CFM) is defined in 802.1Q section 12.14.
> 
> Connectivity Fault Management (CFM) comprises capabilities for
> detecting, verifying, and isolating connectivity failures in
> Virtual Bridged Networks. These capabilities can be used in
> networks operated by multiple independent organizations, each
> with restricted management access to each other’s equipment.
> 
> CFM functions are partitioned as follows:
> — Path discovery
> — Fault detection
> — Fault verification and isolation
> — Fault notification
> — Fault recovery
> 
> The primary CFM protocol shims are called Maintenance Points (MPs).
> A MP can be either a MEP or a MHF.
> The MEP:
> -It is the Maintenance association End Point
>  described in 802.1Q section 19.2.
> -It is created on a specific level (1-7) and is assuring
>  that no CFM frames are passing through this MEP on lower levels.
> -It initiates and terminates/validates CFM frames on its level.
> -It can only exist on a port that is related to a bridge.
> The MHF:
> -It is the Maintenance Domain Intermediate Point
>  (MIP) Half Function (MHF) described in 802.1Q section 19.3.
> -It is created on a specific level (1-7).
> -It is extracting/injecting certain CFM frame on this level.
> -It can only exist on a port that is related to a bridge.
> -Currently not supported.
> 
> There are defined the following CFM protocol functions:
> -Continuity Check
> -Loopback. Currently not supported.
> -Linktrace. Currently not supported.
> 
> This CFM component supports create/delete of MEP instances and
> configuration of the different CFM protocols. Also status information
> can be fetched and delivered through notification due to defect status
> change.
> 
> The user interacts with CFM using the 'cfm' user space client program, the
> client talks with the kernel using netlink. The kernel will try to offload
> the requests to the HW via switchdev API (not implemented yet).
> 
> Any notification emitted by CFM from the kernel can be monitored in user
> space by starting 'cfm_server' program.
> 
> Currently this 'cfm' and 'cfm_server' programs are standalone placed in a
> cfm repository https://github.com/microchip-ung/cfm but it is considered
> to integrate this into 'iproute2'.
> 
> Reviewed-by: Horatiu Vultur  
> Signed-off-by: Henrik Bjoernlund  

Could this be done in userspace? It is a control plane protocol.
Could it be done by using eBPF?

Adding more code in bridge impacts a large number of users of Linux distros.
It creates bloat and potential security vulnerabilities.

Re: [PATCH v3] net: Use standardized (IANA) local port range

2020-08-28 Thread Stephen Hemminger

On Fri, 28 Aug 2020 22:44:47 +0200
Bart Groeneveld  wrote:

> IANA specifies User ports as 1024-49151,
> and Private ports (local/ephemeral/dynamic/w/e) as 49152-65535 [1].
> 
> This means Linux uses 32768-49151 'illegally'.
> This is not just a matter of following specifications:
> IANA actually assigns numbers in this range [1].
> 
> I understand that Linux uses 61000-65535 for masquarading/NAT [2],
> so I left the high value at 60999.
> This means the high value still does not follow the specification,
> but it also doesn't conflict with it.
> 
> This change will effectively halve the available ephemeral ports,
> increasing the risk of port exhaustion. But:
> a) I don't think that warrants ignoring standards.
>   Consider for example setting up a (corporate) firewall blocking
>   all unknown external services.
>   It will only allow outgoing trafiic at port 80,443 and 49152-65535.
>   A Linux computer behind such a firewall will not be able to connect
>   to *any* external service *half of the time*.
>   Of course, the firewall can be adjusted to also allow 32768-49151,
>   but that allows computers to use some services against the policy.
> b) It is only an issue with more than 11848 *outgoing* connections.
>   I think that is a niche case (I know, citation needed, but still).
>   If someone finds themselves in such a niche case,
>   they can still modify ip_local_port_range.
> 
> This patch keeps the low and high value at different parity,
> as to optimize port assignment [3].
> 
> [1]: 
> https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt
> [2]: https://marc.info/?l=linux-kernel=117900026927289
> [3]: See for example commit 1580ab63fc9a03593072cc5656167a75c4f1d173 
> ("tcp/dccp: better use of ephemeral ports in connect()")
> 
> Signed-off-by: Bart Groeneveld 

Changing the default range impacts existing users. Since Linux has been doing
this for so long, I don't think just because a standards body decided to reserve
some space is sufficient justification to do this.

Re: [PATCH iproute2 v5 0/2] iplink: hsr: add support for creating PRP device

2020-08-22 Thread Stephen Hemminger

On Mon, 17 Aug 2020 17:17:35 -0400
Murali Karicheri  wrote:

> This series enhances the iproute2 iplink module to add support
> for creating PRP device similar to HSR. The kernel part of this
> is already merged to v5.9 master
> 
> v4 - addressed comment from Stephen Hemminger
>- Sending this with a iproute2 prefix so that this can
>  be merged to v5.9 iprout2 if possible.
> v3 of the series is rebased to iproute2-next/master at
> git://git.kernel.org/pub/scm/network/iproute2/iproute2-next
> and send as v4.
> 
> Please apply this if looks good.
> 
> 
> Murali Karicheri (2):
>   iplink: hsr: add support for creating PRP device similar to HSR
>   ip: iplink: prp: update man page for new parameter
> 
>  ip/iplink_hsr.c   | 17 +++--
>  man/man8/ip-link.8.in |  9 -
>  2 files changed, 23 insertions(+), 3 deletions(-)

Applied. Than
ks for updating

Re: [net-next iproute2 PATCH v4 1/2] iplink: hsr: add support for creating PRP device similar to HSR

2020-08-16 Thread Stephen Hemminger

On Thu, 6 Aug 2020 16:37:11 -0400
Murali Karicheri  wrote:

> + 

> + print_int(PRINT_ANY,
> +   "proto",
> +   "proto %d ",
> +   rta_getattr_u8(tb[IFLA_HSR_PROTOCOL]));

Since this unsigned value, you probably want to use print_uint, or print_hhu.
Also please put as many arguments on one line that will fit in 80 (to 90) 
characters.

if (tb[IFLA_HSR_PROTOCOL])
print_hhu(PRINT_ANY, "proto", "proto %hhu ", 
  rta_getattr_u8(tb[IFLA_HSR_PROTOCOL]));

[ANNOUNCE] iproute2 5.8

2020-08-07 Thread Stephen Hemminger

 ip fou: respect preferred_family for IPv6

Stephen Hemminger (11):
  uapi: update headers
  devlink: update include files
  uapi: update to magic.h
  man/tc: remove obsolete reference to ipchains
  uapi: update bpf.h
  iplink_bareudp: use common include syntax
  rtacct: drop unused header
  genl: use <> for system includes
  uapi: update bpf.h
  replace SNAPSHOT with auto-generated version string
  lnstat: use same version as iproute2

Tony Ambardar (1):
  configure: support ipset version 7 with kernel version 5

Tuong Lien (1):
  tipc: enable printing of broadcast rcv link stats

William Tu (1):
  erspan: Add type I version 0 support.

Xin Long (7):
  iproute_lwtunnel: add options support for geneve metadata
  iproute_lwtunnel: add options support for vxlan metadata
  iproute_lwtunnel: add options support for erspan metadata
  tc: m_tunnel_key: add options support for vxlan
  tc: m_tunnel_key: add options support for erpsan
  tc: f_flower: add options support for vxlan
  tc: f_flower: add options support for erspan

RE: [PATCH] hv_utils: Add validation for untrusted Hyper-V values

2020-07-29 Thread Stephen Hemminger

Ok at least use the ratelimit form of kernel logging.

Netdev_err_ratelimited...

-Original Message-
From: Andres Beltran  
Sent: Wednesday, July 29, 2020 9:10 AM
To: Stephen Hemminger 
Cc: KY Srinivasan ; Haiyang Zhang ; 
Wei Liu ; linux-hyp...@vger.kernel.org; 
linux-kernel@vger.kernel.org; Michael Kelley ; Andrea 
Parri ; Saruhan Karademir 
Subject: Re: [PATCH] hv_utils: Add validation for untrusted Hyper-V values

On Tue, Jul 28, 2020 at 5:04 PM Stephen Hemminger
 wrote:
>
> You may want to use one of the macros that prints this once only.
> This is a "should never happen" type error, so if something goes wrong it 
> might happens so much that journal/syslog would get overloaded.

Certainly, printing error messages once would be ideal if we were only
dealing with Linux kernel bugs. But under the assumption that Hyper-V
can send bogus values at any time, I think it would be better to print
error messages every time so that we are aware of malicious/erroneous
data sent by the host.

Re: [PATCH] Drivers: hv: vmbus: Fix variable assignments in hv_ringbuffer_read()

2020-07-24 Thread Stephen Hemminger

On Fri, 24 Jul 2020 09:46:06 -0700
"Andres Beltran"  wrote:

> Assignments to buffer_actual_len and requestid happen before packetlen
> is checked to be within buflen. If this condition is true,
> hv_ringbuffer_read() returns with these variables already set to some
> value even though no data is actually read. This might create
> inconsistencies in any routine calling hv_ringbuffer_read(). Assign values
> to such pointers after the packetlen check.
> 
> Signed-off-by: Andres Beltran 
> ---
>  drivers/hv/ring_buffer.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 356e22159e83..e277ce7372a4 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -350,12 +350,13 @@ int hv_ringbuffer_read(struct vmbus_channel
> *channel,
>  
>   offset = raw ? 0 : (desc->offset8 << 3);
>   packetlen = (desc->len8 << 3) - offset;
> - *buffer_actual_len = packetlen;
> - *requestid = desc->trans_id;
>  
>   if (unlikely(packetlen > buflen))
>   return -ENOBUFS;
>  
> + *buffer_actual_len = packetlen;
> + *requestid = desc->trans_id;
> +
>   /* since ring is double mapped, only one copy is necessary */
>   memcpy(buffer, (const char *)desc + offset, packetlen);
>  

What is the rationale for this change, it may break other code.

A common API model in Windows world where this originated
is to have a call where caller first
makes request and then if the requested buffer is not big enough the
caller look at the actual length and allocate a bigger buffer.

Did you audit all the users of this API to make sure they aren't doing that.

Re: [PATCH v2] net-sysfs: add a newline when printing 'tx_timeout' by sysfs

2020-07-22 Thread Stephen Hemminger

On Wed, 22 Jul 2020 13:23:11 -0700 (PDT)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Wed, 22 Jul 2020 08:27:41 -0700
> 
> > On Tue, 21 Jul 2020 15:36:32 -0700 (PDT)
> > David Miller  wrote:
> >   
> >> From: Xiongfeng Wang 
> >> Date: Tue, 21 Jul 2020 15:02:57 +0800
> >>   
> >> > When I cat 'tx_timeout' by sysfs, it displays as follows. It's better to
> >> > add a newline for easy reading.
> >> > 
> >> > root@syzkaller:~# cat /sys/devices/virtual/net/lo/queues/tx-0/tx_timeout
> >> > 0root@syzkaller:~#
> >> > 
> >> > Signed-off-by: Xiongfeng Wang 
> >> 
> >> Applied, thank you.  
> > 
> > Could you add  
> 
> Stephen, of all people you should know by now that all of my commits
> are %100 immutable.  So commit log changes cannot be made after I've
> applied the patch, ever.

Will you send it to stable tree?
It could be added then.

Re: [PATCH v5 0/3] Drivers: hv: vmbus: vmbus_requestor data structure for VMBus hardening

2020-07-22 Thread Stephen Hemminger

On Wed, 22 Jul 2020 11:10:48 -0700
"Andres Beltran"  wrote:

> Currently, VMbus drivers use pointers into guest memory as request IDs
> for interactions with Hyper-V. To be more robust in the face of errors
> or malicious behavior from a compromised Hyper-V, avoid exposing
> guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a
> bad request ID that is then treated as the address of a guest data
> structure with no validation. Instead, encapsulate these memory
> addresses and provide small integers as request IDs.
> 
> The first patch creates the definitions for the data structure, provides
> helper methods to generate new IDs and retrieve data, and
> allocates/frees the memory needed for vmbus_requestor.
> 
> The second and third patches make use of vmbus_requestor to send request
> IDs to Hyper-V in storvsc and netvsc respectively.
> 
> Thanks.
> Andres Beltran
> 
> Cc: James E.J. Bottomley 
> Cc: Martin K. Petersen 
> Cc: David S. Miller 
> 
> Andres Beltran (3):
>   Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus
> hardening
>   scsi: storvsc: Use vmbus_requestor to generate transaction IDs for
> VMBus hardening
>   hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus
> hardening
> 
>  drivers/hv/channel.c  | 175 ++
>  drivers/net/hyperv/hyperv_net.h   |  13 +++
>  drivers/net/hyperv/netvsc.c   |  79 +++---
>  drivers/net/hyperv/rndis_filter.c |   1 +
>  drivers/scsi/storvsc_drv.c|  85 +--
>  include/linux/hyperv.h|  22 
>  6 files changed, 350 insertions(+), 25 deletions(-)
> 


What is the performance impact of this?
It means keeping a global (bookkeeping) structure which should have
noticeable impact on mult-queue performance.

Re: [PATCH v2] net-sysfs: add a newline when printing 'tx_timeout' by sysfs

2020-07-22 Thread Stephen Hemminger

On Tue, 21 Jul 2020 15:36:32 -0700 (PDT)
David Miller  wrote:

> From: Xiongfeng Wang 
> Date: Tue, 21 Jul 2020 15:02:57 +0800
> 
> > When I cat 'tx_timeout' by sysfs, it displays as follows. It's better to
> > add a newline for easy reading.
> > 
> > root@syzkaller:~# cat /sys/devices/virtual/net/lo/queues/tx-0/tx_timeout
> > 0root@syzkaller:~#
> > 
> > Signed-off-by: Xiongfeng Wang   
> 
> Applied, thank you.

Could you add


Fixes: ccf5ff69fbbd ("net: new counter for tx_timeout errors in sysfs")
Cc: david.decoti...@google.com

Re: [PATCH v4] hv_netvsc: add support for vlans in AF_PACKET mode

2020-07-21 Thread Stephen Hemminger

On Tue, 21 Jul 2020 12:44:03 +0530
Sriram Krishnan  wrote:

> + /* When using AF_PACKET we need to drop VLAN header from
> +  * the frame and update the SKB to allow the HOST OS
> +  * to transmit the 802.1Q packet
> +  */
> + if (skb->protocol == htons(ETH_P_8021Q)) {
> + u16 vlan_tci = 0;
Unnecessary initialization.

> + skb_reset_mac_header(skb);
> + if (eth_type_vlan(eth_hdr(skb)->h_proto)) {
> + int pop_err;
> + pop_err = __skb_vlan_pop(skb, _tci);
> + if (likely(pop_err == 0)) {
> + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), 
> vlan_tci);
> + /* Update the NDIS header pkt lengths */
> + packet->total_data_buflen -= VLAN_HLEN;
> + rndis_msg->msg_len = packet->total_data_buflen;
> + rndis_msg->msg.pkt.data_len = 
> packet->total_data_buflen;
> + } else {
> + netdev_err(net, "Pop vlan err %x\n", pop_err);
> + goto drop;
> + }
> + }
> + }

Printing error messages is good for debugging but bad IRL.
Users ignore it, or it overflows the log buffer.

A better alternative would be to add a counter to netvsc_ethtool_stats.
Something like:

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index abda736e7c7d..2181d4538ab7 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -897,6 +897,7 @@ struct netvsc_ethtool_stats {
unsigned long rx_no_memory;
unsigned long stop_queue;
unsigned long wake_queue;
+   unsigned long vlan_error;
 };
 
 struct netvsc_ethtool_pcpu_stats {
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 6267f706e8ee..89568276e653 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -605,6 +605,28 @@ static int netvsc_xmit(struct sk_buff *skb, struct 
net_device *net, bool xdp_tx)
*hash_info = hash;
}
 
+   /* When using AF_PACKET we need to drop VLAN header from
+* the frame and update the SKB to allow the HOST OS
+* to transmit the 802.1Q packet
+*/
+   if (skb->protocol == htons(ETH_P_8021Q)) {
+   skb_reset_mac_header(skb);
+   if (eth_type_vlan(eth_hdr(skb)->h_proto)) {
+   u16 vlan_tci;
+
+   if (unlikely(__skb_vlan_pop(skb, _tci) != 0)) {
+   ++net_device_ctx->eth_stats.vlan_error;
+   goto drop;
+   }
+
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), 
vlan_tci);
+   /* Update the NDIS header pkt lengths */
+   packet->total_data_buflen -= VLAN_HLEN;
+   rndis_msg->msg_len = packet->total_data_buflen;
+   rndis_msg->msg.pkt.data_len = packet->total_data_buflen;
+   }
+   }
+
if (skb_vlan_tag_present(skb)) {
struct ndis_pkt_8021q_info *vlan;
 
@@ -1427,6 +1449,7 @@ static const struct {
{ "rx_no_memory", offsetof(struct netvsc_ethtool_stats, rx_no_memory) },
{ "stop_queue", offsetof(struct netvsc_ethtool_stats, stop_queue) },
{ "wake_queue", offsetof(struct netvsc_ethtool_stats, wake_queue) },
+   { "vlan_error", offsetof(struct netvsc_ethtool_stats, vlan_error) },
 }, pcpu_stats[] = {
{ "cpu%u_rx_packets",
offsetof(struct netvsc_ethtool_pcpu_stats, rx_packets) },

Re: [PATCH v5 net-next] net: hyperv: Add attributes to show TX indirection table

2020-07-20 Thread Stephen Hemminger

On Tue, 21 Jul 2020 03:50:00 +
Chi Song  wrote:

> +static void netvsc_attrs_init(void)
> +{
> + char buffer[4];
> + int i;
> +
> + for (i = 0; i < VRSS_SEND_TAB_SIZE; i++) {
> + sprintf(buffer, "%02u", i);
> + dev_attr_netvsc_dev_attrs[i].attr.name =
> + kstrdup(buffer, GFP_KERNEL);
> + dev_attr_netvsc_dev_attrs[i].attr.mode = 0444;
> + sysfs_attr_init(_attr_netvsc_dev_attrs[i].attr);
> +
> + dev_attr_netvsc_dev_attrs[i].show = tx_indirection_show;
> + dev_attr_netvsc_dev_attrs[i].store = NULL;
> + netvsc_dev_attrs[i] = _attr_netvsc_dev_attrs[i].attr;
> + }
> + netvsc_dev_attrs[VRSS_SEND_TAB_SIZE] = NULL;
You know that last line is unnecessary. The variable is static and 
starts out as all zero.

Overall looks good.

Acked-by: Stephen Hemminger

Re: [PATCH v2] AF_PACKET doesnt strip VLAN information

2020-07-20 Thread Stephen Hemminger

On Mon, 20 Jul 2020 17:22:49 -0400
Willem de Bruijn  wrote:

> On Mon, Jul 20, 2020 at 4:57 PM Stephen Hemminger
>  wrote:
> >
> > On Mon, 20 Jul 2020 09:52:27 -0400
> > Willem de Bruijn  wrote:
> >  
> > > On Mon, Jul 20, 2020 at 12:27 AM Sriram Krishnan (srirakr2)
> > >  wrote:  
> > > >
> > > > +Stephen Hemminger
> > > >
> > > > Hi Willem,
> > > > Thanks for looking into the code, I understand that this is more of a 
> > > > generic problem wherein many of the filtering functions assume the vlan 
> > > > tag to be in the skb rather than in the packet. Hence we moved the fix 
> > > > from the driver to the common AF packet that our solution uses.
> > > >
> > > > I recall from the v1 of the patch you had mentioned other common areas 
> > > > where this fix might be relevant (such as tap/tun), but I'm afraid I 
> > > > cant comprehensively test those patches out. Please let me know your 
> > > > thoughts  
> > >
> > > Please use plain text to respond. HTML replies do not reach the list.
> > >
> > > Can you be more precise in which other code besides the hyper-v driver
> > > is affected? Do you have an example?
> > >
> > > This is a resubmit of the original patch. My previous
> > > questions/concerns remain valid:
> > >
> > > - if the function can now fail, all callers must be updated to detect
> > > and handle that
> > >
> > > - any solution should probably address all inputs into the tx path:
> > > packet sockets, tuntap, virtio-net
> > >
> > > - this only addresses packet sockets with ETH_P_ALL/ETH_P_NONE. Not
> > > sockets that set ETH_P_8021Q
> > >
> > > - which code in the transmit stack requires the tag to be in the skb,
> > > and does this problem after this patch still persist for Q-in-Q?  
> >
> > It matters because the problem is generic, not just to the netvsc driver.
> > For example, BPF programs and netfilter rules will see different packets
> > when send is through AF_PACKET than they would see for sends from the
> > kernel stack.
> >
> > Presenting uniform data to the lower layers makes sense.  
> 
> Are all forwarded and locally generated packets guaranteed to always
> have VLAN information in the tag (so that this is indeed only an issue
> with input from userspace, through tuntap, virtio and packet sockets)?
> 
> I guess the first might be assured due to this in __netif_receive_skb_core:
> 
> if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> skb = skb_vlan_untag(skb);
> if (unlikely(!skb))
> goto out;
> }
> 
> and the second by this in vlan_dev_hard_start_xmit:
> 
> if (veth->h_vlan_proto != vlan->vlan_proto ||
> vlan->flags & VLAN_FLAG_REORDER_HDR) {
> u16 vlan_tci;
> vlan_tci = vlan->vlan_id;
> vlan_tci |= vlan_dev_get_egress_qos_mask(dev, skb->priority);
> __vlan_hwaccel_put_tag(skb, vlan->vlan_proto, vlan_tci);
> }
> 
> But I don't know this code very well, so that is based on a very
> cursory glance only. Might well be missing other paths. (update: I
> think pktgen is another example.)
> 
> Netfilter and BPF still need to handle tags in the packet for Q-in-Q,
> right? So does this actually simplify their logic?
> 
> If the above holds and Q-in-Q is not a problem, then doing the same on
> ingress from userspace may make sense. I don't know the kind of BPF
> or netfilter programs what would be affected, and how.
> 
> Then it would be good to all those inputs at once to really plug the hole.
> See also virtio_net_hdr_to_skb for another example of code that
> applies to all of tuntap, virtio, pf_packet and uml.


Older versions of Linux used to handle outer VLAN differentl
based on what the driver supported. It was a mess.
Some drivers and code paths would strip and put in meta-data, some
would leave it in skb data. But in recent (like 5 yrs), the kernel
has tried to be more uniform and only have vlan as skb tag.
It looks like AF_PACKET was overlooked at that time.

Re: [PATCH v2] AF_PACKET doesnt strip VLAN information

2020-07-20 Thread Stephen Hemminger

On Mon, 20 Jul 2020 09:52:27 -0400
Willem de Bruijn  wrote:

> On Mon, Jul 20, 2020 at 12:27 AM Sriram Krishnan (srirakr2)
>  wrote:
> >
> > +Stephen Hemminger
> >
> > Hi Willem,
> > Thanks for looking into the code, I understand that this is more of a 
> > generic problem wherein many of the filtering functions assume the vlan tag 
> > to be in the skb rather than in the packet. Hence we moved the fix from the 
> > driver to the common AF packet that our solution uses.
> >
> > I recall from the v1 of the patch you had mentioned other common areas 
> > where this fix might be relevant (such as tap/tun), but I'm afraid I cant 
> > comprehensively test those patches out. Please let me know your thoughts  
> 
> Please use plain text to respond. HTML replies do not reach the list.
> 
> Can you be more precise in which other code besides the hyper-v driver
> is affected? Do you have an example?
> 
> This is a resubmit of the original patch. My previous
> questions/concerns remain valid:
> 
> - if the function can now fail, all callers must be updated to detect
> and handle that
> 
> - any solution should probably address all inputs into the tx path:
> packet sockets, tuntap, virtio-net
> 
> - this only addresses packet sockets with ETH_P_ALL/ETH_P_NONE. Not
> sockets that set ETH_P_8021Q
> 
> - which code in the transmit stack requires the tag to be in the skb,
> and does this problem after this patch still persist for Q-in-Q?

It matters because the problem is generic, not just to the netvsc driver.
For example, BPF programs and netfilter rules will see different packets
when send is through AF_PACKET than they would see for sends from the
kernel stack.

Presenting uniform data to the lower layers makes sense.

Re: [PATCH v3] net: hyperv: add support for vlans in netvsc driver

2020-07-20 Thread Stephen Hemminger

On Mon, 20 Jul 2020 22:15:51 +0530
Sriram Krishnan  wrote:

>  
> + /* When using AF_PACKET we need to remove VLAN from frame
> +  * and indicate VLAN information in SKB so HOST OS will
> +  * transmit the VLAN frame
> +  */
> + if (skb->protocol == htons(ETH_P_8021Q)) {
> + u16 vlan_tci = 0;
> + skb_reset_mac_header(skb);
> + if (eth_type_vlan(eth_hdr(skb)->h_proto)) {
> + int pop_err;
> + pop_err = __skb_vlan_pop(skb, _tci);
> + if (likely(pop_err == 0)) {
> + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), 
> vlan_tci);
> +
> + /* Update the NDIS header pkt lengths */
> + packet->total_data_buflen -= VLAN_HLEN;
> + rndis_msg->msg_len = packet->total_data_buflen;
> + rndis_msg->msg.pkt.data_len = 
> packet->total_data_buflen;
> +
> + } else {
> + netdev_err(net,"Pop vlan err %x\n",pop_err);
> + }
> + }
> + }

Minor comments.

1. Blank line between declaration and code.
2. Error handling is different than other parts of this code.
   probably just need a goto drop on error.

It seems like you are putting into message, then driver is putting
it into meta-data in next code block. Maybe it should be combined?

Re: [PATCH v2 net-next] net: hyperv: Add attributes to show TX indirection table

2020-07-20 Thread Stephen Hemminger

On Mon, 20 Jul 2020 00:12:10 -0700 (PDT)
Chi Song  wrote:

> An imbalanced TX indirection table causes netvsc to have low
> performance. This table is created and managed during runtime. To help
> better diagnose performance issues caused by imbalanced tables, add
> device attributes to show the content of TX indirection tables.
> 
> Signed-off-by: Chi Song 
> ---


> v2: remove RX as it's in ethtool already, show single value in each file,
>  and update description.
> 
> Thank you for comments. Let me know, if I miss something.
> 
> ---
>  drivers/net/hyperv/netvsc_drv.c | 53 +
>  1 file changed, 53 insertions(+)
> 
> diff --git a/drivers/net/hyperv/netvsc_drv.c
> b/drivers/net/hyperv/netvsc_drv.c
> index 6267f706e8ee..222c2fad9300 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2370,6 +2370,55 @@ static int netvsc_unregister_vf(struct net_device
> *vf_netdev)
>   return NOTIFY_OK;
>  }
> 
> +static struct device_attribute
> dev_attr_netvsc_dev_attrs[VRSS_SEND_TAB_SIZE];
> +static struct attribute *netvsc_dev_attrs[VRSS_SEND_TAB_SIZE + 1];
> +
> +const struct attribute_group netvsc_dev_group = {
> + .name = NULL,
> + .attrs = netvsc_dev_attrs,
> +};
> +
> +static ssize_t tx_indirection_table_show(struct device *dev,
> +  struct device_attribute
> *dev_attr,
> +  char *buf)
> +{
> + struct net_device *ndev = to_net_dev(dev);
> + struct net_device_context *ndc = netdev_priv(ndev);
> + ssize_t offset = 0;

useless initialization

> + int index = dev_attr - dev_attr_netvsc_dev_attrs;
> +
> + offset = sprintf(buf, "%u\n", ndc->tx_table[index]);
> +
> + return offset;
why not just
return sprintf(buf, "%u\n", ndc->tx_table[index]);
> +}
> +
> +static void netvsc_attrs_init(void)
> +{
> + int i;
> + char buffer[32];
> +
> + for (i = 0; i < VRSS_SEND_TAB_SIZE; i++) {
> + sprintf(buffer, "tx_indirection_table_%02u", i);

Although this has one value per file it leads to a mess.
Why not put it in a separate directory (/sys/class/net/eth0/tx_indirection/N)?

Re: net: decnet: TODO Items

2020-07-19 Thread Stephen Hemminger

On Fri, 17 Jul 2020 11:48:16 +0530
Suraj Upadhyay  wrote:

> Hi Maintainers and Developers,
>   I am interested in the DECnet TODO list.
> I just need a quick response whether they are worth doing or not
> for the amount of development happening in this subsystem is extremely
> low and I can't help but question whether I should indulge in any of
> the listed works or not.
> 
> Thanks,
> 
> Suraj Upadhyay.
> 

The was a push to move decnet into staging and kill it.
But last time there were still some users.


pgpmxgFtOKLsW.pgp
Description: OpenPGP digital signature

Re: [PATCH] net: neterion: vxge: reduce stack usage in VXGE_COMPLETE_VPATH_TX

2020-07-19 Thread Stephen Hemminger

On Thu, 16 Jul 2020 17:32:47 +
Bixuan Cui  wrote:

> Fix the warning: [-Werror=-Wframe-larger-than=]
> 
> drivers/net/ethernet/neterion/vxge/vxge-main.c:
> In function'VXGE_COMPLETE_VPATH_TX.isra.37':
> drivers/net/ethernet/neterion/vxge/vxge-main.c:119:1:
> warning: the frame size of 1056 bytes is larger than 1024 bytes
> 
> Signed-off-by: Bixuan Cui 

Dropping the NR_SKB_COMPLETED to 16 won't have much impact
on performance, and shrink the size.

Doing 16 skb's at a time instead of 128 probably costs
less than one allocation. Especially since it is unlikely
that the device completed that many transmits at once.

Re: [Ksummit-discuss] [PATCH v2] CodingStyle: Inclusive Terminology

2020-07-08 Thread Stephen Hemminger

On Wed, 08 Jul 2020 00:23:59 -0700
Dan Williams  wrote:

> Linux maintains a coding-style and its own idiomatic set of terminology.
> Update the style guidelines to recommend replacements for the terms
> master/slave and blacklist/whitelist.
> 
> Link: 
> http://lore.kernel.org/r/159389297140.2210796.13590142254668787525.st...@dwillia2-desk3.amr.corp.intel.com
> Cc: Jonathan Corbet 
> Acked-by: Randy Dunlap 
> Acked-by: Dave Airlie 
> Acked-by: Kees Cook 
> Acked-by: SeongJae Park 
> Signed-off-by: Olof Johansson 
> Signed-off-by: Chris Mason 
> Signed-off-by: Greg Kroah-Hartman 
> Signed-off-by: Dan Williams 

Thanks for doing this.

Signed-off-by: Stephen Hemminger

Re: [Ksummit-discuss] [Tech-board-discuss] [PATCH] CodingStyle: Inclusive Terminology

2020-07-07 Thread Stephen Hemminger

On Tue, 7 Jul 2020 02:03:36 +0300
Pavel Begunkov  wrote:

> On 07/07/2020 01:28, Steven Rostedt wrote:
> > On Tue, 7 Jul 2020 01:17:47 +0300
> > Pavel Begunkov  wrote:
> >   
> >> Totally agree with you! But do we care then whether two _devices_ or 
> >> _objects_
> >> are slave-master? Can't see how it fundamentally differs.  
> > 
> > The term slave carries a lot more meaning than subordinate. I replied to
> > someone else but later realized that the person sent me their reply
> > offlist, so my reply to them was also offlist. What I told them was,
> > back in college (decades ago), when I first mentioned "master/slave" in
> > conversation (I think it was about hard drives), a person in that
> > conversation stated that those were not very nice terms to use. I blew
> > it off back then, but after listening to more people, I found that
> > using "slave" even to describe a device is not something that people
> > care to hear about.  
> 
> That's cultural, but honestly I've never seen such a person. I still
> don't understand, why having secondary or subordinate object belittling
> the owned side by not providing it the same rights and freedom is OK,
> but slave/master objects are not. Where is the line?
> 
> 
> > 
> > And in actuality, does one device actually enslave another device? I
> > think that terminology is misleading to begin with.  
> 
> As mentioned, I do like good clear terminology, and if it conveys the idea
> better, etc., then it's worth to try. And IMHO that's the right reasoning
> that should be behind. Otherwise, for almost every word we can find a person
> seeing something subjectively offensive or at least bad in it.

Wherever possible the kernel should use the same terminology as the current
standard in that area. Many of the master/slave references in the networking
code are for protocols based on IEEE 802 standards (unfortunately paywalled).
The current version of those standards do not use this kind of wording and the
standards committees are also actively working on inclusive language statemets.

As far as the use of master/slave for bonding, bridge, team etc, it
looks like Linux just invented using those terms since I don't see it
any other vendors implementations Cisco/Juniper/Arista/... Linux terms
are different than industry norms in networking, this is not a good
thing. But changing human expectations is hard.

Re: [PATCH v3 0/3] Drivers: hv: vmbus: vmbus_requestor data structure for VMBus hardening

2020-06-30 Thread Stephen Hemminger

On Tue, 30 Jun 2020 11:31:57 -0400
Andres Beltran  wrote:

> Currently, VMbus drivers use pointers into guest memory as request IDs
> for interactions with Hyper-V. To be more robust in the face of errors
> or malicious behavior from a compromised Hyper-V, avoid exposing
> guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a
> bad request ID that is then treated as the address of a guest data
> structure with no validation. Instead, encapsulate these memory
> addresses and provide small integers as request IDs.
> 
> The first patch creates the definitions for the data structure, provides
> helper methods to generate new IDs and retrieve data, and
> allocates/frees the memory needed for vmbus_requestor.
> 
> The second and third patches make use of vmbus_requestor to send request
> IDs to Hyper-V in storvsc and netvsc respectively.
> 
> Thanks.
> Andres Beltran
> 
> Tested-by: Andrea Parri 
> 
> Cc: linux-s...@vger.kernel.org
> Cc: net...@vger.kernel.org
> Cc: James E.J. Bottomley 
> Cc: Martin K. Petersen 
> Cc: David S. Miller 
> 
> Andres Beltran (3):
>   Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus
> hardening
>   scsi: storvsc: Use vmbus_requestor to generate transaction IDs for
> VMBus hardening
>   hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus
> hardening
> 
>  drivers/hv/channel.c  | 154 ++
>  drivers/net/hyperv/hyperv_net.h   |  13 +++
>  drivers/net/hyperv/netvsc.c   |  79 ---
>  drivers/net/hyperv/rndis_filter.c |   1 +
>  drivers/scsi/storvsc_drv.c|  85 ++---
>  include/linux/hyperv.h|  22 +
>  6 files changed, 329 insertions(+), 25 deletions(-)
> 

How does this interact with use of the vmbus in usermode by DPDK through 
hv_uio_generic?
Will it still work?

Re: [iproute2-next] action police: make 'mtu' could be set independently in police action

2020-06-28 Thread Stephen Hemminger

On Sun, 28 Jun 2020 09:46:02 +0800
Po Liu  wrote:

> Current police action must set 'rate' and 'burst'. 'mtu' parameter
> set the max frame size and could be set alone without 'rate' and 'burst'
> in some situation. Offloading to hardware for example, 'mtu' could limit
> the flow max frame size.
> 
> Signed-off-by: Po Liu 
> ---
>  tc/m_police.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tc/m_police.c b/tc/m_police.c
> index a5bc20c0..89497f67 100644
> --- a/tc/m_police.c
> +++ b/tc/m_police.c
> @@ -161,8 +161,8 @@ action_ctrl_ok:
>   return -1;
>  
>   /* Must at least do late binding, use TB or ewma policing */
> - if (!rate64 && !avrate && !p.index) {
> - fprintf(stderr, "\"rate\" or \"avrate\" MUST be specified.\n");
> + if (!rate64 && !avrate && !p.index && !mtu) {
> + fprintf(stderr, "\"rate\" or \"avrate\" or \"mtu\"MUST be 
> specified.\n");

Missing blank.
Your message will come out as:
"rate" or "avrate" or "mtu"MUST be specified.


The quotes aren't adding to the readability, why not just remove them instead.

Re: [RFC,net-next, 2/5] vrf: track associations between VRF devices and tables

2020-06-13 Thread Stephen Hemminger

On Fri, 12 Jun 2020 18:49:34 +0200
Andrea Mayer  wrote:

> + /* shared_tables:
> +  * count how many distinct tables does not comply with the
> +  * strict mode requirement.
> +  * shared_table value must be 0 in order to switch to strict mode.
> +  *
> +  * example of evolution of shared_table:
> +  *| time
> +  * add  vrf0 --> table 100shared_tables = 0   | t0
> +  * add  vrf1 --> table 101shared_tables = 0   | t1
> +  * add  vrf2 --> table 100shared_tables = 1   | t2
> +  * add  vrf3 --> table 100shared_tables = 1   | t3
> +  * add  vrf4 --> table 101shared_tables = 2   v t4
> +  *
> +  * shared_tables is a "step function" (or "staircase function")
> +  * and is increased by one when the second vrf is associated to a table
> +  *
> +  * at t2, vrf0 and vrf2 are bound to table 100: shared_table = 1.
> +  *
> +  * at t3, another dev (vrf3) is bound to the same table 100 but the
> +  * shared_table counters is still 1.
> +  * This means that no matter how many new vrfs will register on the
> +  * table 100, the shared_table will not increase (considering only
> +  * table 100).
> +  *
> +  * at t4, vrf4 is bound to table 101, and shared_table = 2.
> +  *
> +  * Looking at the value of shared_tables we can immediately know if
> +  * the strict_mode can or cannot be enforced. Indeed, strict_mode
> +  * can be enforced iff shared_table = 0.
> +  *
> +  * Conversely, shared_table is decreased when a vrf is de-associated
> +  * from a table with exactly two associated vrfs.
> +  */
> + int shared_tables;

Should this be unsigned?
Should it be a fixed size?

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Stephen Hemminger

On Sun, 07 Jun 2020 16:45:32 -0700 (PDT)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Sun, 7 Jun 2020 15:30:19 -0700
> 
> > Open source projects have been working hard to remove the terms master and 
> > slave
> > in API's and documentation. Apparently, Linux hasn't gotten the message.
> > It would make sense not to introduce new instances.  
> 
> Would you also be against, for example, the use of the terminology
> expressing the "death" of allocated registers in a compiler backend,
> for example?
> 
> How far do you plan take this resistence of terminology when it
> clearly has a well defined usage and meaning in a specific technical
> realm which is entirely disconnected to what the terms might imply,
> meaning wise, in other realms?
> 
> And if you are going to say not to use this terminology, you must
> suggest a reasonable (and I do mean _reasonable_) well understood
> and _specific_ replacement.
> 
> Thank you.

How many times have you or Linus argued about variable naming.
Yes, words do matter and convey a lot of implied connotation and meaning.

Most projects and standards bodies are taking a stance on fixing the
language. The IETF is has proposed making changes as well.

There are a very specific set of trigger words and terms that
should be fixed. Most of these terms do have better alternatives.

A common example is that master/slave is unclear and would be clearer
as primary/secondary or active/backup or controller/worker.

Most of networking is based on standards. When the standards wording changes
(and it will happen soon); then Linux should also change the wording in the
source, api and documentation.

See:

[0] - 
<https://www.cs.cmu.edu/~mjw/Language/NonSexist/vuw.non-sexist-language-guidelines.txt>,
 <https://twitter.com/justkelly_ok/status/933011085594066944>
[1] - <https://github.com/django/django/pull/2692>
[2] - <https://bugs.python.org/issue34605>
[3] - <https://github.com/rust-lang-deprecated/rust-buildbot/issues/2>, 
<https://github.com/rust-community/foss-events-planner/issues/58>
[4] - <https://twitter.com/ISCdotORG/status/942815837299253248>
[5] - <https://gitlab.gnome.org/GNOME/geary/issues/324>
[6] - 
https://mail.gnome.org/archives/desktop-devel-list/2019-April/msg00049.html
[7] - https://www.ietf.org/archive/id/draft-knodel-terminology-01.txt

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-07 Thread Stephen Hemminger

On Tue, 26 May 2020 11:10:25 +0200
Oleksij Rempel  wrote:

> This UAPI is needed for BroadR-Reach 100BASE-T1 devices. Due to lack of
> auto-negotiation support, we needed to be able to configure the
> MASTER-SLAVE role of the port manually or from an application in user
> space.
> 
> The same UAPI can be used for 1000BASE-T or MultiGBASE-T devices to
> force MASTER or SLAVE role. See IEEE 802.3-2018:
> 22.2.4.3.7 MASTER-SLAVE control register (Register 9)
> 22.2.4.3.8 MASTER-SLAVE status register (Register 10)
> 40.5.2 MASTER-SLAVE configuration resolution
> 45.2.1.185.1 MASTER-SLAVE config value (1.2100.14)
> 45.2.7.10 MultiGBASE-T AN control 1 register (Register 7.32)
> 
> The MASTER-SLAVE role affects the clock configuration:
> 
> ---
> When the  PHY is configured as MASTER, the PMA Transmit function shall
> source TX_TCLK from a local clock source. When configured as SLAVE, the
> PMA Transmit function shall source TX_TCLK from the clock recovered from
> data stream provided by MASTER.
> 
> iMX6Q KSZ9031XXX
> --\/---\/\
>   ||   |||
>  MAC  || PHY Slave |<-->| PHY Master |
>   |<--- 125 MHz ---+-<--/  || \  |
> --/\---/\/
>^
> \-TX_TCLK
> 
> ---
> 
> Since some clock or link related issues are only reproducible in a
> specific MASTER-SLAVE-role, MAC and PHY configuration, it is beneficial
> to provide generic (not 100BASE-T1 specific) interface to the user space
> for configuration flexibility and trouble shooting.
> 
> Signed-off-by: Oleksij Rempel 

NAK
Open source projects have been working hard to remove the terms master and slave
in API's and documentation. Apparently, Linux hasn't gotten the message.
It would make sense not to introduce new instances.

[ANNOUNCE] iproute2-5.7.0 release

2020-06-02 Thread Stephen Hemminger

The 5.7 kernel has been released, and the last patches have
been applied to iproute2.

As usual lots of small fixes, across many utilities.
Several qdisc now have more parameters available.
Devlink get most of the fixes.

Download:
https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-5.7.0.tar.gz

Repository for upcoming release:
git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

And future release (net-next):
git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git

Thanks for all the contributions.

Report problems (or enhancements) to the net...@vger.kernel.org mailing list.


Andrea Claudi (2):
  Revert "bpf: replace snprintf with asprintf when dealing with long 
buffers"
  bpf: Fixes a snprintf truncation warning

Antoine Tenart (4):
  macsec: report the offloading mode currently selected
  macsec: add support for changing the offloading mode
  man: document the ip macsec offload command
  macsec: add an accessor for validate_str

Bastien Roucariès (6):
  Better documentation of mcast_to_unicast option
  Improve hairpin mode description
  Document BPDU filter option
  Better documentation of BDPU guard
  Document root_block option
  State of bridge STP port are now case insensitive

Benjamin Lee (4):
  man: tc-htb.8: add missing qdisc parameter r2q
  man: tc-htb.8: add missing class parameter quantum
  man: tc-htb.8: fix class prio is not mandatory
  tc: fq_codel: fix class stat deficit is signed int

Benjamin Poirier (6):
  bridge: Use consistent column names in vlan output
  bridge: Fix typo
  bridge: Fix output with empty vlan lists
  json_print: Return number of characters printed
  bridge: Align output columns
  Replace open-coded instances of print_nl()

Brian Norris (2):
  man: add ip-netns(8) as generation target
  man: replace $(NETNS_ETC_DIR) and $(NETNS_RUN_DIR) in ip-netns(8)

Danielle Ratson (1):
  bash-completion: devlink: add bash-completion function

David Ahern (4):
  Update kernel headers
  Update kernel headers
  Update kernel headers
  Update kernel headers

Donald Sharp (1):
  nexthop: Fix Deletion display

Eric Dumazet (5):
  tc: fq_codel: add drop_batch parameter
  tc: fq: add timer_slack parameter
  ss: add support for Gbit speeds in sprint_bw()
  tc: fq: fix two issues
  utils: remove trailing zeros in print_time() and print_time64()

Ido Schimmel (3):
  devlink: Add devlink trap policer set and show commands
  devlink: Add ability to bind policer to trap group
  bash-completion: devlink: Extend bash-completion for new commands

Jacob Keller (1):
  devlink: add support for DEVLINK_CMD_REGION_NEW

Jakub Kicinski (1):
  tc: m_action: rename hw stats type uAPI

Jiri Pirko (14):
  devlink: add trap metadata type for flow action cookie
  tc: m_action: introduce support for hw stats type
  tc: show used HW stats types
  devlink: remove custom bool command line options parsing
  devlink: Fix help and man of "devlink health set" command
  devlink: fix encap mode manupulation
  devlink: Add alias "counters_enabled" for "counters" option
  devlink: rename dpipe_counters_enable struct field to 
dpipe_counters_enabled
  devlink: Fix help message for dpipe
  devlink: remove "dev" object sub help messages
  man: add man page for devlink dpipe
  devlink: remove unused "jw" field
  devlink: fix JSON output of mon command
  tc: m_action: check cookie hex string len

Leslie Monis (2):
  tc: pie: change maximum integer value of tc_pie_xstats->prob
  Revert "tc: pie: change maximum integer value of tc_pie_xstats->prob"

Moshe Shemesh (1):
  devlink: Add health error recovery status monitoring

Odin Ugedal (3):
  tc_util: detect overflow in get_size
  q_cake: Make fwmark uint instead of int
  q_cake: properly print memlimit

Parav Pandit (1):
  devlink: Introduce devlink port flavour virtual

Paul Blakey (1):
  man: tc-ct.8: Add manual page for ct tc action

Petr Machata (5):
  tc: q_red: Support 'nodrop' flag
  tc: p_ip6: Support pedit of IPv6 dsfield
  man: tc-pedit: Add examples for dsfield and retain
  man: tc-pedit: Drop the claim that pedit ex is only for IPv4
  ip: link_gre: Do not send ERSPAN attributes to GRE tunnels

Roman Mashak (1):
  tc: action: fix time values output in JSON format

Stephen Hemminger (7):
  man/tc-actions: fix formatting
  bridge: man page spelling fixes
  uapi: update bpf.h
  ss: update to bw print
  uapi: update to bpf.h
  uapi: fix comment in xfrm.h
  v5.7.0

Xin Long (1):
  xfrm: also check for ipv6 state in xfrm_state_keep

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-20 Thread Stephen Hemminger

On Wed, 20 May 2020 21:37:11 +0200
Thomas Gleixner  wrote:

> David Miller  writes:
> > From: Thomas Gleixner 
> > Date: Wed, 20 May 2020 01:42:30 +0200  
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real  0m17.002s
> >>> user  0m1.064s
> >>> sys   0m0.375s  
> >> 
> >> And that solves the incorrectness of the current code in which way?  
> >
> > You mentioned that there wasn't a test case, he gave you one to try.  
> 
> If it makes you happy to compare incorrrect code with correct code, here
> you go:
> 
> 5 runs of 1000 device add, 1000 device rename and 1000 device del
> 
> CONFIG_PREEMPT_NONE=y
> 
>  Base  rwsem
>  add 0:05.01   0:05.28
>0:05.93   0:06.11
>0:06.52   0:06.26
>0:06.06   0:05.74
>0:05.71   0:06.07
> 
>  rename  0:32.57   0:33.04
>0:32.91   0:32.45
>0:32.72   0:32.53
>0:39.65   0:34.18
>0:34.52   0:32.50
> 
>  delete  3:48.65   3:48.91
>3:49.66   3:49.13
>3:45.29   3:48.26
>3:47.56   3:46.60
>3:50.01   3:48.06
> 
>  -
> 
> CONFIG_PREEMPT_VOLUNTARY=y
> 
>  Base  rwsem
>  add 0:06.80   0:06.42
>0:04.77   0:05.03
>0:05.74   0:04.62
>0:05.87   0:04.34
>0:04.20   0:04.12
> 
>  rename  0:33.33   0:42.02
>0:42.36   0:32.55
>0:39.58   0:31.60
>0:33.69   0:35.08
>0:34.24   0:33.97
> 
>  delete  3:47.82   3:44.00
>3:47.42   3:51.00
>3:48.52   3:48.88
>3:48.50   3:48.09
>3:50.03   3:46.56
> 
>  -
> 
> CONFIG_PREEMPT=y
> 
>  Base  rwsem
> 
>  add 0:07.89   0:07.72
>0:07.25   0:06.72
>0:07.42   0:06.51
>0:06.92   0:06.38
>0:06.20   0:06.72
> 
>  rename  0:41.77   0:32.39
>0:44.29   0:33.29
>0:36.19   0:34.86
>0:33.19   0:35.06
>0:37.00   0:34.78
> 
>  delete  2:36.96   2:39.97
>2:37.80   2:42.19
>2:44.66   2:48.40
>2:39.75   2:41.02
>2:40.77   2:38.36
> 
> The runtime variation is rather large and when running the same in a VM
> I got complete random numbers for both base and rwsem. The most amazing
> was delete where the time varies from 30s to 6m20s.
> 
> Btw, Sebastian noticed that rename spams dmesg:
> 
>   netdev_info(dev, "renamed from %s\n", oldname);
> 
> which eats about 50% of the Rename run time.
> 
>  Base  netdev_info() removed
> 
> Rename   0:34.84   0:17.48
> 
> That number at least makes tons of sense
> 
> Thanks,
> 
> tglx

Looks good thanks for following through.

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Stephen Hemminger

On Tue, 19 May 2020 20:18:19 -0700
Eric Dumazet  wrote:

> On 5/19/20 7:57 PM, David Miller wrote:
> > From: Thomas Gleixner 
> > Date: Wed, 20 May 2020 01:42:30 +0200
> >   
> >> Stephen Hemminger  writes:  
> >>> On Wed, 20 May 2020 00:23:48 +0200
> >>> Thomas Gleixner  wrote:  
> >>>> No. We did not. -ENOTESTCASE  
> >>>
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real  0m17.002s
> >>> user  0m1.064s
> >>> sys   0m0.375s  
> >>
> >> And that solves the incorrectness of the current code in which way?  
> > 
> > You mentioned that there wasn't a test case, he gave you one to try.
> >   
> 
> I do not think this would ever use device rename, nor netdev_get_name()
> 
> None of this stuff is fast path really.
> 
> # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> 
> real  0m1.127s
> user  0m0.270s
> sys   0m1.039s

Your right it is a weak test, and most of the overhead is in the syscall
and all netlink events that happen.

It does end up looking up the new name, so would exercise that.
Better test is to use %d syntax or create 1000 dummy's then rename every one.

This is more of a stress test
# for ((i=0;i<1000;i++)); do echo link add dev dummy%d type dummy; done | time 
ip -batch -
0.00user 0.29system 0:02.11elapsed 13%CPU (0avgtext+0avgdata 2544maxresident)k
0inputs+0outputs (0major+148minor)pagefaults 0swaps

# for ((i=999;i>=0;i--)); do echo link set dummy$i name dummy$((i+1)); done | 
time ip -batch -
0.00user 0.26system 0:54.98elapsed 0%CPU (0avgtext+0avgdata 2508maxresident)k
0inputs+0outputs (0major+145minor)pagefaults 0swaps

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Stephen Hemminger

On Wed, 20 May 2020 01:42:30 +0200
Thomas Gleixner  wrote:

> Stephen Hemminger  writes:
> > On Wed, 20 May 2020 00:23:48 +0200
> > Thomas Gleixner  wrote:  
> >> No. We did not. -ENOTESTCASE  
> >
> > Please try, it isn't that hard..
> >
> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >
> > real0m17.002s
> > user0m1.064s
> > sys 0m0.375s  
> 
> And that solves the incorrectness of the current code in which way?

Agree that the current code is has evolved over time to a state where it is not
correct in the case of Preempt-RT. The motivation for the changes to seqcount
goes back many years when there were ISP's that were concerned about scaling of 
tunnels, vlans etc.

Is it too much to ask for a simple before/after test of your patch as part 
of the submission. You probably measure latency changes to the nanosecond.

Getting it correct without causing user complaints.

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Stephen Hemminger

On Wed, 20 May 2020 00:23:48 +0200
Thomas Gleixner  wrote:

> Stephen Hemminger  writes:
> > On Tue, 19 May 2020 23:45:23 +0200
> > "Ahmed S. Darwish"  wrote:
> >  
> >> Sequence counters write paths are critical sections that must never be
> >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >> 
> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >> netdev name retrieval.") handled a deadlock, observed with
> >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >> infinitely spinning: it got scheduled after the seqcount write side
> >> blocked inside its own critical section.
> >> 
> >> To fix that deadlock, among other issues, the commit added a
> >> cond_resched() inside the read side section. While this will get the
> >> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >> 
> >> The fix is also still broken: if the seqcount reader belongs to a
> >> real-time scheduling policy, it can spin forever and the kernel will
> >> livelock.
> >> 
> >> Disabling preemption over the seqcount write side critical section will
> >> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >> locking through the drivers/base/ :: device_rename() call chain.
> >> 
> >> From all the above, replace the seqcount with a rwsem.
> >> 
> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and 
> >> netdev name retrieval.)
> >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to 
> >> return an interface name)
> >> Cc: 
> >> Signed-off-by: Ahmed S. Darwish 
> >> Reviewed-by: Sebastian Andrzej Siewior   
> >
> > Have your performance tested this with 1000's of network devices?  
> 
> No. We did not. -ENOTESTCASE

Please try, it isn't that hard..

# time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real0m17.002s
user0m1.064s
sys 0m0.375s

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Stephen Hemminger

On Tue, 19 May 2020 23:45:23 +0200
"Ahmed S. Darwish"  wrote:

> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and 
> netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to 
> return an interface name)
> Cc: 
> Signed-off-by: Ahmed S. Darwish 
> Reviewed-by: Sebastian Andrzej Siewior 

Have your performance tested this with 1000's of network devices?

The reason seqcount logic is was done here was to achieve scaleablity
and a semaphore does not scale as well.

Re: [v4,iproute2-next 1/2] iproute2-next:tc:action: add a gate control action

2020-05-06 Thread Stephen Hemminger

On Wed,  6 May 2020 16:40:19 +0800
Po Liu  wrote:

>   } else if (matches(*argv, "base-time") == 0) {
> + NEXT_ARG();
> + if (get_u64(_time, *argv, 10)) {
> + invalidarg = "base-time";
> + goto err_arg;
> + }
> + } else if (matches(*argv, "cycle-time") == 0) {
> + NEXT_ARG();
> + if (get_u64(_time, *argv, 10)) {
> + invalidarg = "cycle-time";
> + goto err_arg;
> + }
> + } else if (matches(*argv, "cycle-time-ext") == 0) {
> + NEXT_ARG();
> + if (get_u64(_time_ext, *argv, 10)) {
> + invalidarg = "cycle-time-ext";
> + goto err_arg;
> + }

Could all these time values use existing TC helper routines?
See get_time().  The way you have it makes sense for hardware
but stands out versus the rest of tc.

It maybe that the kernel UAPI is wrong, and should be using same
time units as rest of tc. Forgot to review that part of the patch.

Re: [v3,iproute2 1/2] iproute2:tc:action: add a gate control action

2020-05-04 Thread Stephen Hemminger

On Sun,  3 May 2020 14:32:50 +0800
Po Liu  wrote:

> + print_string(PRINT_ANY, "gate state", "\tgate-state %-8s",

NAK
Space in a json tag is not valid.

Please run a dump command and feed it into JSON validation checker like Python.

Re: [v3,iproute2 1/2] iproute2:tc:action: add a gate control action

2020-05-04 Thread Stephen Hemminger

On Sun,  3 May 2020 14:32:50 +0800
Po Liu  wrote:

> Introduce a ingress frame gate control flow action.
> Tc gate action does the work like this:
> Assume there is a gate allow specified ingress frames can pass at
> specific time slot, and also drop at specific time slot. Tc filter
> chooses the ingress frames, and tc gate action would specify what slot
> does these frames can be passed to device and what time slot would be
> dropped.
> Tc gate action would provide an entry list to tell how much time gate
> keep open and how much time gate keep state close. Gate action also
> assign a start time to tell when the entry list start. Then driver would
> repeat the gate entry list cyclically.
> For the software simulation, gate action require the user assign a time
> clock type.
> 
> Below is the setting example in user space. Tc filter a stream source ip
> address is 192.168.0.20 and gate action own two time slots. One is last
> 200ms gate open let frame pass another is last 100ms gate close let
> frames dropped.
> 
>  # tc qdisc add dev eth0 ingress
>  # tc filter add dev eth0 parent : protocol ip \
> 
> flower src_ip 192.168.0.20 \
> action gate index 2 clockid CLOCK_TAI \
> sched-entry open 2 -1 -1 \
> sched-entry close 1
> 
>  # tc chain del dev eth0 ingress chain 0
> 
> "sched-entry" follow the name taprio style. Gate state is
> "open"/"close". Follow the period nanosecond. Then next -1 is internal
> priority value means which ingress queue should put to. "-1" means
> wildcard. The last value optional specifies the maximum number of
> MSDU octets that are permitted to pass the gate during the specified
> time interval.
> 
> Below example shows filtering a stream with destination mac address is
> 10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
> action would run with one close time slot which means always keep close.
> The time cycle is total 2ns. The base-time would calculate by:
> 
>  13570 + (N + 1) * cycletime
> 
> When the total value is the future time, it will be the start time.
> The cycletime here would be 2ns for this case.
> 
>  #tc filter add dev eth0 parent :  protocol ip \
>flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
>action gate index 12 base-time 13570 \
>sched-entry CLOSE 2 \
>clockid CLOCK_TAI
> 
> Signed-off-by: Po Liu 

These changes are specific to net-next should be assigned to iproute2-next.
Will change delegation.

Re: [PATCH RFC] net: vlan: reverse 4 bytes of vlan header when setting initial MTU

2019-10-21 Thread Stephen Hemminger

On Mon, 21 Oct 2019 20:26:03 +0800
Yunsheng Lin  wrote:

> Currently the MTU of vlan netdevice is set to the same MTU
> of the lower device, which requires the underlying device
> to handle it as the comment has indicated:
> 
>   /* need 4 bytes for extra VLAN header info,
>* hope the underlying device can handle it.
>*/
>   new_dev->mtu = real_dev->mtu;
> 
> Currently most of the physical netdevs seems to handle above
> by reversing 2 * VLAN_HLEN for L2 packet len.
> 
> But for vlan netdev over vxlan netdev case, the vxlan does not
> seems to reverse the vlan header for vlan device, which may cause
> performance degradation because vxlan may emit a packet that
> exceed the MTU of the physical netdev, and cause the software
> TSO to happen in ip_finish_output_gso(), software TSO call stack
> as below:
> 
>  => ftrace_graph_call
>  => tcp_gso_segment
>  => tcp4_gso_segment
>  => inet_gso_segment
>  => skb_mac_gso_segment
>  => skb_udp_tunnel_segment
>  => udp4_ufo_fragment
>  => inet_gso_segment
>  => skb_mac_gso_segment
>  => __skb_gso_segment
>  => __ip_finish_output
>  => ip_output
>  => ip_local_out
>  => iptunnel_xmit
>  => udp_tunnel_xmit_skb
>  => vxlan_xmit_one
>  => vxlan_xmit
>  => dev_hard_start_xmit
>  => __dev_queue_xmit
>  => dev_queue_xmit
>  => vlan_dev_hard_start_xmit
>  => dev_hard_start_xmit
>  => __dev_queue_xmit
>  => dev_queue_xmit
>  => neigh_resolve_output
>  => ip_finish_output2
>  => __ip_finish_output
>  => ip_output
>  => ip_local_out
>  => __ip_queue_xmit
>  => ip_queue_xmit
>  => __tcp_transmit_skb
>  => tcp_write_xmit
>  => __tcp_push_pending_frames
>  => tcp_push
>  => tcp_sendmsg_locked
>  => tcp_sendmsg
>  => inet_sendmsg
>  => sock_sendmsg
>  => sock_write_iter
>  => new_sync_write
>  => __vfs_write
>  => vfs_write
>  => ksys_write
>  => __arm64_sys_write
>  => el0_svc_common.constprop.0
>  => el0_svc_handler
>  => el0_svc  
> 
> This patch set initial MTU of the vlan device to the MTU of the
> lower device minus vlan header to handle the above case.
> 
> Signed-off-by: Yunsheng Lin 

The MTU is visible to user space in many tools, and Linux (and BSD)
have always treated VLAN header as not part of the MTU. You can't change
that now.

Re: [RFC 1/2] vhost: IFC VF hardware operation layer

2019-10-15 Thread Stephen Hemminger

On Wed, 16 Oct 2019 09:03:17 +0800
Zhu Lingshan  wrote:

> + IFC_INFO(>dev, "PCI capability mapping:\n"
> + "common cfg: %p\n"
> + "notify base: %p\n"
> + "isr cfg: %p\n"
> + "device cfg: %p\n"
> + "multiplier: %u\n",
> + hw->common_cfg,
> + hw->notify_base,
> + hw->isr,
> + hw->dev_cfg,
> + hw->notify_off_multiplier);

Since kernel messages go to syslog, syslog does not handle multi-line
messages very well. This should be a single line.

Also, this is the kind of message that should be at the debug
level; something that is useful to the driver developers
but not something that needs to be filling up every end users log.

Re: [RFC 1/2] vhost: IFC VF hardware operation layer

2019-10-15 Thread Stephen Hemminger

On Wed, 16 Oct 2019 09:03:17 +0800
Zhu Lingshan  wrote:

> +int ifcvf_init_hw(struct ifcvf_hw *hw, struct pci_dev *dev)
> +{
> + int ret;
> + u8 pos;
> + struct virtio_pci_cap cap;
> + u32 i;
> + u16 notify_off;

For network code, the preferred declaration style is
reverse christmas tree.

Re: [PATCH] AF_PACKET doesnt strip VLAN information

2019-10-01 Thread Stephen Hemminger

On Mon, 30 Sep 2019 11:16:14 -0400
Willem de Bruijn  wrote:

> On Mon, Sep 30, 2019 at 1:24 AM Sriram Krishnan  wrote:
> >
> > When an application sends with AF_PACKET and places a vlan header on
> > the raw packet; then the AF_PACKET needs to move the tag into the skb
> > so that it gets processed normally through the rest of the transmit
> > path.
> >
> > This is particularly a problem on Hyper-V where the host only allows
> > vlan in the offload info.  
> 
> This sounds like behavior that needs to be addressed in the driver, instead?

This was what we did first, but the problem was more general.
For example, many filtering functions assume that vlan tag is in
skb meta data, not the packet data itself. Therefore AF_PACKET would
get around any filter rules.

> 
> > Cc: xe-linux-exter...@cisco.com
> > ---
> >  net/packet/af_packet.c | 26 --
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e2742b0..cfe0904 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -1849,15 +1849,35 @@ static int packet_rcv_spkt(struct sk_buff *skb, 
> > struct net_device *dev,
> > return 0;
> >  }
> >
> > -static void packet_parse_headers(struct sk_buff *skb, struct socket *sock)
> > +static int packet_parse_headers(struct sk_buff *skb, struct socket *sock)
> >  {
> > if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) &&
> > sock->type == SOCK_RAW) {  
> 
> If inside this branch, may miss packets with skb->protocol set to one
> of the VLAN Ethertypes.
> 
> > +   __be16 ethertype;
> > +
> > skb_reset_mac_header(skb);
> > +
> > +   ethertype = eth_hdr(skb)->h_proto;
> > +   /*
> > +* If Vlan tag is present in the packet
> > +*  move it to skb
> > +   */
> > +   if (eth_type_vlan(ethertype)) {
> > +   int err;
> > +   __be16 vlan_tci;
> > +
> > +   err = __skb_vlan_pop(skb, _tci);
> > +   if (unlikely(err))
> > +   return err;
> > +
> > +   __vlan_hwaccel_put_tag(skb, ethertype, vlan_tci);  
> 
> What happens with multiple tags (QinQ)?

Same as multiple tags in a normal sent packet. The second tag is in
the packet itself.

> 
> > +   }
> > +
> > skb->protocol = dev_parse_header_protocol(skb);
> > }
> >
> > skb_probe_transport_header(skb);
> > +   return 0;
> >  }
> >
> >  /*
> > @@ -1979,7 +1999,9 @@ static int packet_sendmsg_spkt(struct socket *sock, 
> > struct msghdr *msg,
> > if (unlikely(extra_len == 4))
> > skb->no_fcs = 1;
> >
> > -   packet_parse_headers(skb, sock);
> > +   err = packet_parse_headers(skb, sock);
> > +   if (err)
> > +   goto out_unlock;  
> 
> This only tests the new return value in one of three callers of
> packet_sendmsg_spkt.

Re: ERROR: "__umoddi3" [drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko] undefined!

2019-09-30 Thread Stephen Hemminger

On Mon, 30 Sep 2019 18:29:10 +0200
Borislav Petkov  wrote:

> On Mon, Sep 30, 2019 at 05:45:35PM +0200, Michal Kubecek wrote:
> > On Mon, Sep 30, 2019 at 04:13:17PM +0200, Borislav Petkov wrote:  
> > > I'm seeing this on i386 allyesconfig builds of current Linus master:
> > > 
> > > ERROR: "__umoddi3" [drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko] 
> > > undefined!
> > > make[1]: *** [__modpost] Error 1
> > > make: *** [modules] Error 2  
> > 
> > This is usually result of dividing (or modulo) by a 64-bit integer. Can
> > you identify where (file and line number) is the __umoddi3() call
> > generated?  
> 
> Did another 32-bit allyesconfig build. It said:
> 
> ld: drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.o: in 
> function `mlx5dr_icm_alloc_chunk':
> dr_icm_pool.c:(.text+0x733): undefined reference to `__umoddi3'
> make: *** [vmlinux] Error 1
> 
> The .s file then points to the exact location:
> 
> # drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c:140:   
> align_diff = icm_mr->icm_start_addr % align_base;
> pushl   %ebx# align_base
> pushl   %ecx# align_base
> call__umoddi3   #
> popl%edx#
> popl%ecx#
> 
> HTH.
> 

Could also us div_u64_rem here?

Re: ERROR: "__umoddi3" [drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko] undefined!

2019-09-30 Thread Stephen Hemminger

On Mon, 30 Sep 2019 18:29:10 +0200
Borislav Petkov  wrote:

> On Mon, Sep 30, 2019 at 05:45:35PM +0200, Michal Kubecek wrote:
> > On Mon, Sep 30, 2019 at 04:13:17PM +0200, Borislav Petkov wrote:  
> > > I'm seeing this on i386 allyesconfig builds of current Linus master:
> > > 
> > > ERROR: "__umoddi3" [drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko] 
> > > undefined!
> > > make[1]: *** [__modpost] Error 1
> > > make: *** [modules] Error 2  
> > 
> > This is usually result of dividing (or modulo) by a 64-bit integer. Can
> > you identify where (file and line number) is the __umoddi3() call
> > generated?  
> 
> Did another 32-bit allyesconfig build. It said:
> 
> ld: drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.o: in 
> function `mlx5dr_icm_alloc_chunk':
> dr_icm_pool.c:(.text+0x733): undefined reference to `__umoddi3'
> make: *** [vmlinux] Error 1
> 
> The .s file then points to the exact location:
> 
> # drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c:140:   
> align_diff = icm_mr->icm_start_addr % align_base;
> pushl   %ebx# align_base
> pushl   %ecx# align_base
> call__umoddi3   #
> popl%edx#
> popl%ecx#

Since align_base is a power of 2 masking should work as well.

Re: WARNING in hso_free_net_device

2019-09-04 Thread Stephen Hemminger

On Wed, 4 Sep 2019 16:27:50 -0400
Hui Peng  wrote:

> Hi, all:
> 
> I looked at the bug a little.
> 
> The issue is that in the error handling code, hso_free_net_device
> unregisters
> 
> the net_device (hso_net->net)  by calling unregister_netdev. In the
> error handling code path,
> 
> hso_net->net has not been registered yet.
> 
> I think there are two ways to solve the issue:
> 
> 1. fix it in drivers/net/usb/hso.c to avoiding unregistering the
> net_device when it is still not registered
> 
> 2. fix it in unregister_netdev. We can add a field in net_device to
> record whether it is registered, and make unregister_netdev return if
> the net_device is not registered yet.
> 
> What do you guys think ?

#1

Re: [v2] iproute2-next: police: support 64bit rate and peakrate in tc utility

2019-08-30 Thread Stephen Hemminger

On Fri, 30 Aug 2019 14:07:17 -0500
David Dai  wrote:

> + if (rate64) {
>   fprintf(stderr, "Double \"rate\" spec\n");
>   return -1;
>   }

The m_police filter should start using the common functions
for duparg and invarg that are in lib/utils.c

Re: [v1] iproute2: police: support 64bit rate and peakrate in tc utility

2019-08-29 Thread Stephen Hemminger

On Wed, 28 Aug 2019 17:52:56 -0500
David Dai  wrote:

> For high speed adapter like Mellanox CX-5 card, it can reach upto
> 100 Gbits per second bandwidth. Currently htb already supports 64bit rate
> in tc utility. However police action rate and peakrate are still limited
> to 32bit value (upto 32 Gbits per second). Taking advantage of the 2 new
> attributes TCA_POLICE_RATE64 and TCA_POLICE_PEAKRATE64 from kernel,
> tc can use them to break the 32bit limit, and still keep the backward 
> binary compatibility.
> 
> Tested-by: David Dai 
> Signed-off-by: David Dai 

This needs to go to iproute2-next not iproute2

Re: [PATCH] hv_netvsc: Fix a memory leak bug

2019-08-14 Thread Stephen Hemminger

On Wed, 14 Aug 2019 15:16:11 -0500
Wenwen Wang  wrote:

> In rndis_filter_device_add(), 'rndis_device' is allocated through kzalloc()
> by invoking get_rndis_device(). In the following execution, if an error
> occurs, the execution will go to the 'err_dev_remv' label. However, the
> allocated 'rndis_device' is not deallocated, leading to a memory leak bug.
> 
> Signed-off-by: Wenwen Wang 
> ---
>  drivers/net/hyperv/rndis_filter.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/hyperv/rndis_filter.c 
> b/drivers/net/hyperv/rndis_filter.c
> index 317dbe9..ed35085 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -1420,6 +1420,7 @@ struct netvsc_device *rndis_filter_device_add(struct 
> hv_device *dev,
>  
>  err_dev_remv:
>   rndis_filter_device_remove(dev, net_device);
> + kfree(rndis_device);
>   return ERR_PTR(ret);
>  }
>  

The rndis_device is already freed by:

rndis_filter_device_remove
netvsc_device_remove
free_netvsc_device_rcu

free_netvsc_device called by rcu

static void free_netvsc_device(struct rcu_head *head)
{
struct netvsc_device *nvdev
= container_of(head, struct netvsc_device, rcu);
int i;

kfree(nvdev->extension);  << here

Re: [PATCH] isdn: hysdn: fix code style error from checkpatch

2019-08-02 Thread Stephen Hemminger

On Fri,  2 Aug 2019 19:50:17 +
Ricardo Bruno Lopes da Silva  wrote:

> Fix error bellow from checkpatch.
> 
> WARNING: Block comments use * on subsequent lines
> +/***
> +
> 
> Signed-off-by: Ricardo Bruno Lopes da Silva 

Read the TODO, these drivers are scheduled for removal, so changes
are not helpful at this time.

Re: [PATCH] isdn: hysdn: Fix error spaces around '*'

2019-08-02 Thread Stephen Hemminger

On Fri,  2 Aug 2019 19:56:02 +
Jose Carlos Cazarin Filho  wrote:

> Fix checkpath error:
> CHECK: spaces preferred around that '*' (ctx:WxV)
> +extern hysdn_card *card_root;/* pointer to first card */
> 
> Signed-off-by: Jose Carlos Cazarin Filho 


Read the TODO, these drivers are scheduled for removal, so changes
are not helpful at this time.

Re: [PATCH 1/2] Drivers: hv: Specify receive buffer size using Hyper-V page size

2019-07-26 Thread Stephen Hemminger

On Wed, 24 Jul 2019 22:03:14 -0700
"Himadri Pandya"  wrote:

> The recv_buffer is used to retrieve data from the VMbus ring buffer.
> VMbus ring buffers are sized based on the guest page size which
> Hyper-V assumes to be 4KB. But it may be different on some
> architectures. So use the Hyper-V page size to allocate the
> recv_buffer and set the maximum size to receive.
> 
> Signed-off-by: Himadri Pandya 

If pagesize is 64K, then doing it this way will waste lots of
memory.

Re: [PATCH 2/2] Drivers: hv: util: Specify ring buffer size using Hyper-V page size

2019-07-26 Thread Stephen Hemminger

On Wed, 24 Jul 2019 22:03:15 -0700
"Himadri Pandya"  wrote:

> VMbus ring buffers are sized based on the 4K page size used by
> Hyper-V. The Linux guest page size may not be 4K on all architectures
> so use the Hyper-V page size to specify the ring buffer size.
> 
> Signed-off-by: Himadri Pandya 
> ---
>  drivers/hv/hv_util.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
> index c2c08f26bd5f..766bd8457346 100644
> --- a/drivers/hv/hv_util.c
> +++ b/drivers/hv/hv_util.c
> @@ -413,8 +413,9 @@ static int util_probe(struct hv_device *dev,
>  
>   hv_set_drvdata(dev, srv);
>  
> - ret = vmbus_open(dev->channel, 4 * PAGE_SIZE, 4 * PAGE_SIZE, NULL,
> 0,
> - srv->util_cb, dev->channel);
> + ret = vmbus_open(dev->channel, 4 * HV_HYP_PAGE_SIZE,
> +  4 * HV_HYP_PAGE_SIZE, NULL, 0, srv->util_cb,
> +  dev->channel);
>   if (ret)
>   goto error;
>  

hv_util doesn't need lots of buffering. Why not define a fixed
value across all architectures. Maybe with some roundup to HV_HYP_PAGE_SIZE.

Re: [PATCH net-next v2 3/3] netlink: add validation of NLA_F_NESTED flag

2019-07-23 Thread Stephen Hemminger

On Thu,  2 May 2019 16:15:10 +0200 (CEST)
Michal Kubecek  wrote:

> Add new validation flag NL_VALIDATE_NESTED which adds three consistency
> checks of NLA_F_NESTED_FLAG:
> 
>   - the flag is set on attributes with NLA_NESTED{,_ARRAY} policy
>   - the flag is not set on attributes with other policies except NLA_UNSPEC
>   - the flag is set on attribute passed to nla_parse_nested()
> 
> Signed-off-by: Michal Kubecek 
> 
> v2: change error messages to mention NLA_F_NESTED explicitly

There are some cases where netlink related to IPv4 does not send nested
flag. You risk breaking older iproute2 and other tools being used on newer
kernel. I.e this patch may break binary compatibility. Have you tried running
with this on a very old distro (like Redhat Linux 9)?

Re: PROBLEM: Marvell 88E8040 (sky2) fails after hibernation

2019-07-15 Thread Stephen Hemminger

On Mon, 15 Jul 2019 21:09:44 +0200 (CEST)
Thomas Gleixner  wrote:

> Octavio,
> 
> On Mon, 15 Jul 2019, Octavio Alvarez wrote:
> > If I reboot with sky2.disable_msi=1, then I get IO-APIC and the bug does not
> > occur:
> > 
> >  19:  0  0  0  0   IO-APIC  19-fasteoi eth0
> > 
> > However, if I reboot without sky2.disable_msi=1 it properly starts as 
> > PCI-MSI
> > and then, after re-modprobing it it goes to IO-APIC, but the bug occurs
> > anyway:
> > 
> > $ cat /proc/interrupts | grep eth
> >  27:  0  1  0  0   PCI-MSI 3145728-edge
> > eth0
> > 
> > $ sudo modprobe -r sky2
> > [sudo] password for alvarezp:
> > 
> > $ sudo modprobe sky2 disable_msi=1
> > 
> > $ # hibernating and coming back hibernation
> > 
> > $ cat /proc/interrupts | grep eth
> >  19:  0  0  0  0   IO-APIC  19-fasteoi  eth0
> > 
> >   
> > > Also please check Linus suspicion about the module being reloaded after
> > > hibernation through some distro magic.  
> > 
> > This is not happening. Each time the driver is loaded the message "sky2:
> > driver version 1.30" is shown.
> > 
> > I confirm only 1 line for the sky2.disable_msi=1 from kernel boot and only 2
> > lines for re-modprobing.  
> 
> Odd. I still fail to make a connection to that commit you identified
> which merily restores the behaviour before the big changes.
> 
> As we cannot revert that commit by any means and as the hardware is known
> to have issues with MSI, the only option we have is to avoid MSI on that
> particular machine. I suspect that the fact that it is 'working' on some
> older kernel version does not necessarily mean that it works by design. It
> might as well be a works by chance thing.
> 
> Thanks for all the detective work you put into that and sorry that I can't
> come up with the magic cure for this.
> 
> Thanks,
> 
>   tglx

In the past, I had one ASUS motherboard with broken MSI and updating the
BIOS did fix it.

Re: [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default

2019-07-09 Thread Stephen Hemminger

On Tue, 9 Jul 2019 22:56:30 +
Haiyang Zhang  wrote:

> - VRSS_CHANNEL_MAX);
> + if (probe_type == PROBE_PREFER_ASYNCHRONOUS) {
> + snprintf(name, IFNAMSIZ, "eth%d", dev->channel->dev_num);

What about PCI passthrough or VF devices that are also being probed and
consuming the ethN names.  Won't there be a collision?

[ANNOUNCE] iproute2 5.2

2019-07-08 Thread Stephen Hemminger

The 5.2 kernel has been released, and therefore time for another
update to iproute2.

Not a lot of big new features in this release. Just the usual array of
small fixes across the board.

Download:
https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-5.2.0.tar.gz

Repository for upcoming release:
git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

And future release (net-next):
git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git

Thanks for all the contributions.

Report problems (or enhancements) to the net...@vger.kernel.org mailing list.

---
Andrea Claudi (6):
  Makefile: use make -C
  ip address: do not set nodad option for IPv4 addresses
  ip address: do not set home option for IPv4 addresses
  ip address: do not set mngtmpaddr option for IPv4 addresses
  man: tc-netem.8: fix URL for netem page
  tc: netem: fix r parameter in Bernoulli loss model

Baruch Siach (2):
  devlink: fix format string warning for 32bit targets
  devlink: fix libc and kernel headers collision

David Ahern (5):
  Update kernel headers
  Update kernel headers
  Update kernel headers
  uapi: wrap SIOCGSTAMP and SIOCGSTAMPNS in ifndef
  Update kernel headers

Davide Caratti (2):
  man: tc-skbedit.8: document 'inheritdsfield'
  tc: simple: don't hardcode the control action

Eyal Birger (1):
  tc: adjust xtables_match and xtables_target to changes in recent iptables

Gal Pressman (1):
  rdma: Update node type strings

Hangbin Liu (1):
  ip/iptoken: fix dump error when ipv6 disabled

Hoang Le (3):
  tipc: add link broadcast set method and ratio
  tipc: add link broadcast get
  tipc: add link broadcast man page

Ido Schimmel (2):
  ipneigh: Print neighbour offload indication
  devlink: Increase column size for larger shared buffers

Josh Hunt (1):
  ss: add option to print socket information on one line

Kristian Evensen (1):
  ip fou: Support binding FOU ports

Lucas Siba 2019-04-20 11:40 UTC (1):
  Update tc-bpf.8 man page examples

Lukasz Czapnik (1):
  tc: flower: fix port value truncation

Mahesh Bandewar (1):
  ip6tunnel: fix 'ip -6 {show|change} dev ' cmds

Matteo Croce (4):
  ip: reset netns after each command in batch mode
  netns: switch netns in the child when executing commands
  ip vrf: use hook to change VRF in the child
  netns: make netns_{save,restore} static

Michael Forney (1):
  ipmroute: Prevent overlapping storage of `filter` global

Mike Manning (1):
  iplink_vlan: add support for VLAN bridge binding flag

Moshe Shemesh (1):
  devlink: mnlg: Catch returned error value of dumpit commands

Nicolas Dichtel (3):
  lib: suppress error msg when filling the cache
  iplink: don't try to get ll addr len when creating an iface
  ip monitor: display interfaces from all groups

Nikolay Aleksandrov (2):
  iplink: bridge: add support for vlan_stats_per_port
  bridge: mdb: restore text output format

Paolo Abeni (2):
  tc: add support for plug qdisc
  m_mirred: don't bail if the control action is missing

Parav Pandit (1):
  devlink: Increase bus, device buffer size to 64 bytes

Pete Morici (1):
  Add support for configuring MACsec gcm-aes-256 cipher type.

Roman Mashak (1):
  tc: Fix binding of gact action by index.

Stefano Brivio (1):
  iproute: Set flags and attributes on dump to get IPv6 cached routes to be 
flushed

Stephen Hemminger (14):
  uapi: update to elf-em header
  uapi: add include/linux/net.h
  uapi: update headers to import asm-generic/sockios.h
  mailmap: add myself
  mailmap: map David's mail address
  uapi: add sockios.h
  uapi: merge bpf.h from 5.2
  rdma: update uapi headers
  man: fix macaddr section of ip-link
  uapi: minor upstream btf.h header change
  testsuite: intent if/else in Makefile
  uapi: update headers and add if_link.h and if_infiniband.h
  devlink: replace print macros with functions
  vv5.2.0

Steve Wise (4):
  Add .mailmap file
  rdma: add helper rd_sendrecv_msg()
  rdma: add 'link add/delete' commands
  rdma: man page update for link add/delete

Tomasz Torcz (1):
  ss: in --numeric mode, print raw numbers for data rates

Vinicius Costa Gomes (2):
  taprio: Add support for changing schedules
  taprio: Add support for cycle_time and cycle_time_extension

Re: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled

2019-07-08 Thread Stephen Hemminger

On Sun, 7 Jul 2019 10:46:22 -0700
Randy Dunlap  wrote:

> On 7/3/19 11:06 AM, Haiyang Zhang wrote:
> > 
> >   
> >> -Original Message-
> >> From: Randy Dunlap 
> >> Sent: Wednesday, July 3, 2019 12:59 PM
> >> To: LKML ; linux-pci  >> p...@vger.kernel.org>  
> >> Cc: Matthew Wilcox ; Jake Oshins
> >> ; KY Srinivasan ; Haiyang
> >> Zhang ; Stephen Hemminger
> >> ; Sasha Levin ; Bjorn
> >> Helgaas ; linux-hyp...@vger.kernel.org; Dexuan
> >> Cui ; Yuehaibing 
> >> Subject: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
> >>
> >> From: Randy Dunlap 
> >>
> >> Fix build of drivers/pci/controller/pci-hyperv.o when
> >> CONFIG_SYSFS is not set/enabled by adding stubs for
> >> pci_create_slot() and pci_destroy_slot().
> >>
> >> Fixes these build errors:
> >>
> >> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> >> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> >>
> >> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot
> >> information")
> >>
> >> Signed-off-by: Randy Dunlap 
> >> Cc: Matthew Wilcox 
> >> Cc: Jake Oshins 
> >> Cc: "K. Y. Srinivasan" 
> >> Cc: Haiyang Zhang 
> >> Cc: Stephen Hemminger 
> >> Cc: Sasha Levin 
> >> Cc: Bjorn Helgaas 
> >> Cc: linux-...@vger.kernel.org
> >> Cc: linux-hyp...@vger.kernel.org
> >> Cc: Dexuan Cui 
> >> Cc: Yuehaibing 
> >> ---
> >> v2:
> >> - provide non-CONFIG_SYSFS stubs for pci_create_slot() and
> >>   pci_destroy_slot() [suggested by Matthew Wilcox ]
> >> - use the correct Fixes: tag [Dexuan Cui ]
> >>
> >>  include/linux/pci.h |   12 ++--
> >>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> --- lnx-52-rc7.orig/include/linux/pci.h
> >> +++ lnx-52-rc7/include/linux/pci.h
> >> @@ -25,6 +25,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -947,14 +948,21 @@ int pci_scan_root_bus_bridge(struct pci_
> >>  struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev
> >> *dev,
> >>int busnr);
> >>  void pcie_update_link_speed(struct pci_bus *bus, u16 link_status);
> >> +#ifdef CONFIG_SYSFS
> >> +void pci_dev_assign_slot(struct pci_dev *dev);
> >>  struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
> >> const char *name,
> >> struct hotplug_slot *hotplug);
> >>  void pci_destroy_slot(struct pci_slot *slot);
> >> -#ifdef CONFIG_SYSFS
> >> -void pci_dev_assign_slot(struct pci_dev *dev);
> >>  #else
> >>  static inline void pci_dev_assign_slot(struct pci_dev *dev) { }
> >> +static inline struct pci_slot *pci_create_slot(struct pci_bus *parent,
> >> + int slot_nr,
> >> + const char *name,
> >> + struct hotplug_slot *hotplug) {
> >> +  return ERR_PTR(-EINVAL);
> >> +}
> >> +static inline void pci_destroy_slot(struct pci_slot *slot) { }
> >>  #endif
> >>  int pci_scan_slot(struct pci_bus *bus, int devfn);
> >>  struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
> >>  
> > 
> > The serial number in slot info is used to match VF NIC with Synthetic NIC.
> > Without selecting SYSFS, the SRIOV feature will fail on VM on Hyper-V and
> > Azure. The first version of this patch should be used.
> > 
> > @Stephen Hemminger how do you think?

Haiyang is right, accelerated networking won't work if slot is not recorded.

So the original patch (to depend on SYSFS) or using "select SYSFS" is
are necessary.

The whole thing is a bit of "angels dancing on the head of a pin" because
there is no good reason to build kernel without SYSFS in real world.
It would just be looking for trouble. As far as I can tell it is all
about getting "make randconfig" to work in more cases.

Re: [PATCH net-next v6 01/15] rtnetlink: provide permanent hardware address in RTM_NEWLINK

2019-07-02 Thread Stephen Hemminger

On Tue,  2 Jul 2019 13:49:44 +0200 (CEST)
Michal Kubecek  wrote:

> Permanent hardware address of a network device was traditionally provided
> via ethtool ioctl interface but as Jiri Pirko pointed out in a review of
> ethtool netlink interface, rtnetlink is much more suitable for it so let's
> add it to the RTM_NEWLINK message.
> 
> Add IFLA_PERM_ADDRESS attribute to RTM_NEWLINK messages unless the
> permanent address is all zeros (i.e. device driver did not fill it). As
> permanent address is not modifiable, reject userspace requests containing
> IFLA_PERM_ADDRESS attribute.
> 
> Note: we already provide permanent hardware address for bond slaves;
> unfortunately we cannot drop that attribute for backward compatibility
> reasons.
> 
> v5 -> v6: only add the attribute if permanent address is not zero
> 
> Signed-off-by: Michal Kubecek 

Do you want to make an iproute patch to display this?

Acked-by: Stephen Hemminger

Re: PROBLEM: Marvell 88E8040 (sky2) fails after hibernation

2019-06-24 Thread Stephen Hemminger

On Sun, 23 Jun 2019 14:54:13 +0200 (CEST)
Thomas Gleixner  wrote:

> Octavio,
> 
> On Sat, 22 Jun 2019, Thomas Gleixner wrote:
> > On Wed, 19 Jun 2019, Octavio Alvarez wrote:  
> > > On 6/13/19 3:45 PM, Thomas Gleixner wrote:  
> > > > Can you please provide the content of /proc/interrupts with the driver
> > > > loaded and working after boot (don't hibernate) for the following 
> > > > kernels:
> > > >   
> > > 
> > > $ cat linux-master-after-boot.txt
> > >CPU0   CPU1   CPU2   CPU3  
> >   
> > >  27:  1  0  0  0   PCI-MSI 3145728-edge 
> > > eth0  
> >   
> > > >Linus upstream + revert  
> > > 
> > > $ cat linux-master-reverted-after-boot.txt
> > >CPU0   CPU1   CPU2   CPU3  
> >   
> > >  27:  1  0  0  0   PCI-MSI 3145728-edge 
> > > eth0  
> >
> > > Meanwhile, here it is for 4.9, which is the latest Debian-provided kernel 
> > > and
> > > worked:
> > > 
> > > $ cat linux-4.9-after-boot.txt
> > >CPU0   CPU1   CPU2   CPU3  
> >   
> > >  24:  1  0  0  0   PCI-MSI 3145728-edge 
> > > eth0  
> >
> > > I will keep trying 4.14, unless you say otherwise.  
> > 
> > It would be interesting though I don't expect too much data.
> > 
> > So all of the above use PCI/MSI. That's at least a data point. I need to
> > stare into that driver again to figure out why this might make a
> > difference, but right now I'm lost.  
> 
> One other data point you could provide please:
> 
>  Load the driver on Linus master with the following module parameter:
> 
>disable_msi=1
> 
> That switches to INTx usage. Does the machine resume proper with that?
> 
> Thanks,
> 
>   tglx


Suspend/resume and hibernation issues are often related to BIOS issues.

Re: [PATCH v2 0/3] net: fddi: skfp: Use PCI generic definitions instead of private duplicates

2019-06-20 Thread Stephen Hemminger

On Thu, 20 Jun 2019 23:37:51 +0530
Puranjay Mohan  wrote:

> This patch series removes the private duplicates of PCI definitions in
> favour of generic definitions defined in pci_regs.h.
> 
> This driver only uses one of the generic PCI definitons, i.e.
> PCI_REVISION_ID, which is included from pci_regs.h and its private
> version is removed from skfbi.h with all other private defines.
> 
> The skfbi.h defines PCI_REV_ID which is renamed to PCI_REVISION_ID in
> drvfbi.c to make it compatible with the generic define in pci_regs.h.
> 
> Puranjay Mohan (3):
>   net: fddi: skfp: Rename PCI_REV_ID to PCI_REVISION_ID
>   net: fddi: skfp: Include generic PCI definitions
>   net: fddi: skfp: Remove unused private PCI definitions
> 
>  drivers/net/fddi/skfp/drvfbi.c  |   4 +-
>  drivers/net/fddi/skfp/h/skfbi.h | 207 +---
>  2 files changed, 3 insertions(+), 208 deletions(-)
> 

Does anyone still have the hardware to test this?
Maybe FDDI should be put out to pasture.

Re: Driver has suspect GRO implementation, TCP performance may be compromised.

2019-05-29 Thread Stephen Hemminger

On Wed, 29 May 2019 09:00:54 -0700
Eric Dumazet  wrote:

> On Wed, May 29, 2019 at 7:49 AM Paul Menzel  wrote:
> >
> > Dear Eric,
> >
> >
> > Thank you for the quick reply.
> >
> > On 05/28/19 19:18, Eric Dumazet wrote:  
> > > On 5/28/19 8:42 AM, Paul Menzel wrote:  
> >  
> > >> Occasionally, Linux outputs the message below on the workstation Dell
> > >> OptiPlex 5040 MT.
> > >>
> > >> TCP: net00: Driver has suspect GRO implementation, TCP performance 
> > >> may be compromised.
> > >>
> > >> Linux 4.14.55 and Linux 5.2-rc2 show the message, and the WWW also
> > >> gives some hits [1][2].
> > >>
> > >> ```
> > >> $ sudo ethtool -i net00
> > >> driver: e1000e
> > >> version: 3.2.6-k
> > >> firmware-version: 0.8-4
> > >> expansion-rom-version:
> > >> bus-info: :00:1f.6
> > >> supports-statistics: yes
> > >> supports-test: yes
> > >> supports-eeprom-access: yes
> > >> supports-register-dump: yes
> > >> supports-priv-flags: no
> > >> ```
> > >>
> > >> Can the driver e1000e be improved?
> > >>
> > >> Any idea, what triggers this, as I do not see it every boot? Download
> > >> of big files?
> > >>  
> > > Maybe the driver/NIC can receive frames bigger than MTU, although this 
> > > would be strange.
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 
> > > c61edd023b352123e2a77465782e0d32689e96b0..cb0194f66125bcba427e6e7e3cacf0c93040ef61
> > >  100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -150,8 +150,10 @@ static void tcp_gro_dev_warn(struct sock *sk, const 
> > > struct sk_buff *skb,
> > > rcu_read_lock();
> > > dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
> > > if (!dev || len >= dev->mtu)
> > > -   pr_warn("%s: Driver has suspect GRO 
> > > implementation, TCP performance may be compromised.\n",
> > > -   dev ? dev->name : "Unknown driver");
> > > +   pr_warn("%s: Driver has suspect GRO 
> > > implementation, TCP performance may be compromised."
> > > +   " len %u mtu %u\n",
> > > +   dev ? dev->name : "Unknown driver",
> > > +   len, dev ? dev->mtu : 0);
> > > rcu_read_unlock();
> > > }
> > >  }  
> >
> > I applied your patch on commit 9fb67d643 (Merge tag 'pinctrl-v5.2-2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl):
> >
> >  [ 5507.291769] TCP: net00: Driver has suspect GRO implementation, TCP 
> > performance may be compromised. len 1856 mtu 1500  
> 
> 
> The 'GRO' in the warning can be probably ignored, since this NIC does
> not implement its own GRO.
> 
> You can confirm this with this debug patch:
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 
> 0e09bede42a2bd2c912366a68863a52a22def8ee..014a43ce77e09664bda0568dd118064b006acd67
> 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -561,6 +561,9 @@ static void e1000_receive_skb(struct e1000_adapter 
> *adapter,
> if (staterr & E1000_RXD_STAT_VP)
> __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), tag);
> 
> +   if (skb->len > netdev->mtu)
> +   pr_err_ratelimited("received packet bigger (%u) than
> MTU (%u)\n",
> +  skb->len, netdev->mtu);
> napi_gro_receive(>napi, skb);
>  }

I think e1000 is one of those devices that only has receive limit as power of 2.
Therefore frames up to 2K can be received.

There always some confusion in Linux about whether MTU is transmit only or 
devices
have to enforce it on receive.

Re: [PATCH net-next] net: link_watch: prevent starvation when processing linkwatch wq

2019-05-27 Thread Stephen Hemminger

On Tue, 28 May 2019 09:04:18 +0800
Yunsheng Lin  wrote:

> On 2019/5/27 22:58, Stephen Hemminger wrote:
> > On Mon, 27 May 2019 09:47:54 +0800
> > Yunsheng Lin  wrote:
> >   
> >> When user has configured a large number of virtual netdev, such
> >> as 4K vlans, the carrier on/off operation of the real netdev
> >> will also cause it's virtual netdev's link state to be processed
> >> in linkwatch. Currently, the processing is done in a work queue,
> >> which may cause worker starvation problem for other work queue.
> >>
> >> This patch releases the cpu when link watch worker has processed
> >> a fixed number of netdev' link watch event, and schedule the
> >> work queue again when there is still link watch event remaining.
> >>
> >> Signed-off-by: Yunsheng Lin   
> > 
> > Why not put link watch in its own workqueue so it is scheduled
> > separately from the system workqueue?  
> 
> From testing and debuging, the workqueue runs on the cpu where the
> workqueue is schedule when using normal workqueue, even using its
> own workqueue instead of system workqueue. So if the cpu is busy
> processing the linkwatch event, it is not able to process other
> workqueue' work when the workqueue is scheduled on the same cpu.
> 
> Using unbound workqueue may solve the cpu starvation problem.
> But the __linkwatch_run_queue is called with rtnl_lock, so if it
> takes a lot time to process, other need to take the rtnl_lock may
> not be able to move forward.

Agree with the starvation issue. My cocern is that large number of
events that end up being delayed would impact things that are actually
watching for link events (like routing daemons).

It probably would be not accepted to do rtnl_unlock/sched_yield/rtnl_lock
in the loop, but that is another alternative.

Re: [PATCH net-next] macvlan: Replace strncpy() by strscpy()

2019-05-27 Thread Stephen Hemminger

On Mon, 27 May 2019 16:28:05 -0500
"Gustavo A. R. Silva"  wrote:

> On 5/27/19 4:20 PM, Stephen Hemminger wrote:
> > On Mon, 27 May 2019 13:38:55 -0500
> > "Gustavo A. R. Silva"  wrote:
> >   
> >> The strncpy() function is being deprecated. Replace it by the safer
> >> strscpy() and fix the following Coverity warning:
> >>
> >> "Calling strncpy with a maximum size argument of 16 bytes on destination
> >> array ifrr.ifr_ifrn.ifrn_name of size 16 bytes might leave the destination
> >> string unterminated."
> >>
> >> Notice that, unlike strncpy(), strscpy() always null-terminates the
> >> destination string.
> >>
> >> Addresses-Coverity-ID: 1445537 ("Buffer not null terminated")
> >> Signed-off-by: Gustavo A. R. Silva 
> >> ---
> >>  drivers/net/macvlan.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> >> index 61550122b563..0ccabde8e9c9 100644
> >> --- a/drivers/net/macvlan.c
> >> +++ b/drivers/net/macvlan.c
> >> @@ -831,7 +831,7 @@ static int macvlan_do_ioctl(struct net_device *dev, 
> >> struct ifreq *ifr, int cmd)
> >>struct ifreq ifrr;
> >>int err = -EOPNOTSUPP;
> >>  
> >> -  strncpy(ifrr.ifr_name, real_dev->name, IFNAMSIZ);
> >> +  strscpy(ifrr.ifr_name, real_dev->name, IFNAMSIZ);
> >>ifrr.ifr_ifru = ifr->ifr_ifru;
> >>  
> >>switch (cmd) {  
> > 
> > Why not use strlcpy like all the other places IFNAMSIZ is copied?
> >   
> 
> strlcpy() is also being deprecated.

Are you going to fix all these:
$ git grep strlcpy | grep IFNAMSIZ| wc -l
47

Re: [PATCH net-next] macvlan: Replace strncpy() by strscpy()

2019-05-27 Thread Stephen Hemminger

On Mon, 27 May 2019 13:38:55 -0500
"Gustavo A. R. Silva"  wrote:

> The strncpy() function is being deprecated. Replace it by the safer
> strscpy() and fix the following Coverity warning:
> 
> "Calling strncpy with a maximum size argument of 16 bytes on destination
> array ifrr.ifr_ifrn.ifrn_name of size 16 bytes might leave the destination
> string unterminated."
> 
> Notice that, unlike strncpy(), strscpy() always null-terminates the
> destination string.
> 
> Addresses-Coverity-ID: 1445537 ("Buffer not null terminated")
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/net/macvlan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 61550122b563..0ccabde8e9c9 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -831,7 +831,7 @@ static int macvlan_do_ioctl(struct net_device *dev, 
> struct ifreq *ifr, int cmd)
>   struct ifreq ifrr;
>   int err = -EOPNOTSUPP;
>  
> - strncpy(ifrr.ifr_name, real_dev->name, IFNAMSIZ);
> + strscpy(ifrr.ifr_name, real_dev->name, IFNAMSIZ);
>   ifrr.ifr_ifru = ifr->ifr_ifru;
>  
>   switch (cmd) {

Why not use strlcpy like all the other places IFNAMSIZ is copied?

Re: [PATCH net-next] net: link_watch: prevent starvation when processing linkwatch wq

2019-05-27 Thread Stephen Hemminger

On Mon, 27 May 2019 09:47:54 +0800
Yunsheng Lin  wrote:

> When user has configured a large number of virtual netdev, such
> as 4K vlans, the carrier on/off operation of the real netdev
> will also cause it's virtual netdev's link state to be processed
> in linkwatch. Currently, the processing is done in a work queue,
> which may cause worker starvation problem for other work queue.
> 
> This patch releases the cpu when link watch worker has processed
> a fixed number of netdev' link watch event, and schedule the
> work queue again when there is still link watch event remaining.
> 
> Signed-off-by: Yunsheng Lin 

Why not put link watch in its own workqueue so it is scheduled
separately from the system workqueue?

Re: [PATCH net-next] hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers

2019-05-22 Thread Stephen Hemminger

On Wed, 22 May 2019 22:56:07 +
Sunil Muthuswamy  wrote:

> Currently, the hv_sock buffer size is static and can't scale to the
> bandwidth requirements of the application. This change allows the
> applications to influence the socket buffer sizes using the SO_SNDBUF and
> the SO_RCVBUF socket options.
> 
> Few interesting points to note:
> 1. Since the VMBUS does not allow a resize operation of the ring size, the
> socket buffer size option should be set prior to establishing the
> connection for it to take effect.
> 2. Setting the socket option comes with the cost of that much memory being
> reserved/allocated by the kernel, for the lifetime of the connection.
> 
> Perf data:
> Total Data Transfer: 1GB
> Single threaded reader/writer
> Results below are summarized over 10 iterations.
> 
> Linux hvsocket writer + Windows hvsocket reader:
> |-|
> |Packet size ->   |  128B   |   1KB   |   4KB   | 
>64KB |
> |-|
> |SO_SNDBUF size | | Throughput in MB/s (min/max/avg/median):  
> |
> |   v |   
> |
> |-|
> |  Default| 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   
> 410/491/462/470   |
> |  16KB   | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   
> 592/824/692/676   |
> |  32KB   | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 
> 1593/2124/2000/2085 |
> |  64KB   | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 
> 1784/1943/1862/1843 |
> |-|
> 
> Windows hvsocket writer + Linux hvsocket reader:
> |-|
> |Packet size ->   | 128B|  1KB|  4KB| 
>64KB |
> |-|
> |SO_RCVBUF size | |   Throughput in MB/s (min/max/avg/median):
> |
> |   v |   
> |
> |-|
> |  Default| 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   
> 659/701/676/678   |
> |  16KB   | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   
> 602/624/615/615   |
> |  32KB   | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 
> 944/1180/1070/1100  |
> |  64KB   | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 
> 1605/1819/1670/1660 |
> |-|
> 
> Signed-off-by: Sunil Muthuswamy 

It looks like Exchange mangled you patch. It doesn't apply clean.


>

Re: [PATCH 2/2] ftpm: firmware TPM running in TEE

2019-04-02 Thread Stephen Hemminger

On Tue, 2 Apr 2019 12:33:16 -0700
"Sasha Levin"  wrote:

> +/*
> + * ftpm_tee_tpm_op_recv retrieve fTPM response.
> + * @param: chip, the tpm_chip description as specified in
> driver/char/tpm/tpm.h.
> + * @param: buf,  the buffer to store data.
> + * @param: count, the number of bytes to read.
> + * @return: In case of success the number of bytes received.
> + *   In other case, a < 0 value describing the issue.
> + */

You are using a docbook style comment but it doesn't start with
docbook prefix.

/**
 * ftpm_tee_tpm_op_recv retrieve fTPM response.
 *
 * @param: chip, the tpm_chip description as specified in driver/char/tpm/tpm.h.
 * @param: buf, the buffer to store data.
 * @param: count, the number of bytes to read.
 * @return: In case of success the number of bytes received.
 *  In other case, a < 0 value describing the issue.
 */

Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex

2019-03-28 Thread Stephen Hemminger

On Thu, 28 Mar 2019 00:30:57 -0400
Kimberly Brown  wrote:

> On Thu, Mar 21, 2019 at 04:04:20PM +, Michael Kelley wrote:
> > From: Kimberly Brown   Sent: Wednesday, March 20, 
> > 2019 8:48 PM  
> > > > > > Adding more locks will solve the problem but it seems like overkill.
> > > > > > Why not either use a reference count or an RCU style access for the
> > > > > > ring buffer?  
> > > > >
> > > > > I agree that a reference count or RCU could also solve this problem.
> > > > > Using mutex locks seemed like the most straightforward solution, but
> > > > > I'll certainly switch to a different approach if it's better!
> > > > >
> > > > > Are you concerned about the extra memory required for the mutex locks,
> > > > > read performance, or something else?  
> > > >
> > > > Locks in control path are ok, but my concern is performance of the
> > > > data path which puts packets in/out of rings. To keep reasonable 
> > > > performance,
> > > > no additional locking should be added in those paths.
> > > >
> > > > So if data path is using RCU, can/should the control operations also
> > > > use it?  
> > >   
> 
> 
> Hi Stephen,
> 
> Do you have any additional questions or suggestions for this race
> condition and the mutex locks? I think that your initial questions were
> addressed in the responses below. If there's anything else, please let
> me know!
> 
> Thanks,
> Kim
> 
> 
> > > The data path doesn't use RCU to protect the ring buffers.  
> > 
> > My $.02:  The mutex is obtained only in the sysfs path and the "delete
> > ringbuffers" path, neither of which is performance or concurrency 
> > sensitive. 
> > There's no change to any path that reads or writes data from/to the ring
> > buffers.  It seems like the mutex is the most straightforward solution to
> > preventing sysfs from accessing the ring buffer info while the memory is
> > being freed as part of "delete ringbuffers".
> > 
> > Michael  


I have no problems with the patch you did.
My discussion was more around the general issues with ringbuffers being detached
from the device. Not sure if it was even a good design choice but that is
something that is hard to fix now.

Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable

2019-03-28 Thread Stephen Hemminger

On Thu, 28 Mar 2019 17:48:45 +
Haiyang Zhang  wrote:

> +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> + struct net_device *ndev)
> +{
> + nvscdev->tx_disable = false;
> + mb(); /* ensure queue wake up mechanism is on */
> +
> + netif_tx_wake_all_queues(ndev);
> +}

You don't need a full mb(). virt_wmb() should be sufficient.

Could I suggest an alternative approach.
You don't need to introduce a local tx_disable flag, the only place where a 
wakeup
could cause problems is after a send_completion was processed during detach 
state.

Instead, just avoid wakeup in that place.

--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);

if (netif_tx_queue_stopped(txq) &&
+   netif_device_present(ndev) &&
(hv_get_avail_to_write_percent(>outbound) >
 RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
netif_tx_wake_queue(txq);

Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex

2019-03-20 Thread Stephen Hemminger

On Sat, 16 Mar 2019 21:49:28 -0400
Kimberly Brown  wrote:

> On Thu, Mar 14, 2019 at 03:45:33PM -0700, Stephen Hemminger wrote:
> > On Thu, 14 Mar 2019 13:05:15 -0700
> > "Kimberly Brown"  wrote:
> >   
> > > Fix a race condition that can result in a ring buffer pointer being set
> > > to null while a "_show" function is reading the ring buffer's data. This
> > > problem was discussed here:
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> > > %2Flkml%2F2018%2F10%2F18%2F779data=02%7C01%7Csthemmin%40microsoft.com
> > > %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> > > %7C0%7C636881907217609564sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> > > vtGu2hU%3Dreserved=0
> > > 
> > > To fix the race condition, add a new mutex lock to the
> > > "hv_ring_buffer_info" struct. Add a new function,
> > > "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> > > ring_buffer_info mutex locks are initialized.
> > >
> > >  ... snip ...
> > 
> > Adding more locks will solve the problem but it seems like overkill.
> > Why not either use a reference count or an RCU style access for the
> > ring buffer?  
> 
> I agree that a reference count or RCU could also solve this problem.
> Using mutex locks seemed like the most straightforward solution, but
> I'll certainly switch to a different approach if it's better!
> 
> Are you concerned about the extra memory required for the mutex locks,
> read performance, or something else?

Locks in control path are ok, but my concern is performance of the
data path which puts packets in/out of rings. To keep reasonable performance,
no additional locking should be added in those paths.

So if data path is using RCU, can/should the control operations also
use it?

Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex

2019-03-14 Thread Stephen Hemminger

On Thu, 14 Mar 2019 13:05:15 -0700
"Kimberly Brown"  wrote:

> Fix a race condition that can result in a ring buffer pointer being set
> to null while a "_show" function is reading the ring buffer's data. This
> problem was discussed here:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> %2Flkml%2F2018%2F10%2F18%2F779data=02%7C01%7Csthemmin%40microsoft.com
> %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> %7C0%7C636881907217609564sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> vtGu2hU%3Dreserved=0
> 
> To fix the race condition, add a new mutex lock to the
> "hv_ring_buffer_info" struct. Add a new function,
> "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> ring_buffer_info mutex locks are initialized.
> 
> Acquire/release the locks in the "hv_ringbuffer_cleanup()" function,
> which is where the ring buffer pointers are set to null.
> 
> Acquire/release the locks in the four channel-level "_show" functions
> that access ring buffer data. Remove the "const" qualifier from the
> "vmbus_channel" parameter and the "rbi" variable of the channel-level
> "_show" functions so that the locks can be acquired/released in these
> functions.
> 
> Acquire/release the locks in hv_ringbuffer_get_debuginfo(). Remove the
> "const" qualifier from the "hv_ring_buffer_info" parameter so that the
> locks can be acquired/released in this function.
> 
> Signed-off-by: Kimberly Brown 
> ---
>  drivers/hv/channel_mgmt.c |  2 +
>  drivers/hv/hyperv_vmbus.h |  1 +
>  drivers/hv/ring_buffer.c  | 19 -
>  drivers/hv/vmbus_drv.c| 82 +--
>  include/linux/hyperv.h|  7 +++-
>  5 files changed, 79 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 9d7f9c1c60c7..14543059cc3e 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -336,6 +336,8 @@ static struct vmbus_channel *alloc_channel(void)
>   tasklet_init(>callback_event,
>vmbus_on_event, (unsigned long)channel);
>  
> + hv_ringbuffer_pre_init(channel);
> +
>   return channel;
>  }
>  
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index a94aab94e0b5..e5467b821f41 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -193,6 +193,7 @@ extern void hv_synic_clockevents_cleanup(void);
>  
>  /* Interface */
>  
> +void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
>  
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 pagecnt);
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 0386ff48c5ea..121a01c43298 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -166,14 +166,18 @@ hv_get_ringbuffer_availbytes(const struct
> hv_ring_buffer_info *rbi,
>  }
>  
>  /* Get various debug metrics for the specified ring buffer. */
> -int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info
> *ring_info,
> +int hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
>   struct hv_ring_buffer_debug_info
> *debug_info)
>  {
>   u32 bytes_avail_towrite;
>   u32 bytes_avail_toread;
>  
> - if (!ring_info->ring_buffer)
> + mutex_lock(_info->ring_buffer_mutex);
> +
> + if (!ring_info->ring_buffer) {
> + mutex_unlock(_info->ring_buffer_mutex);
>   return -EINVAL;
> + }
>  
>   hv_get_ringbuffer_availbytes(ring_info,
>_avail_toread,
> @@ -184,10 +188,19 @@ int hv_ringbuffer_get_debuginfo(const struct
> hv_ring_buffer_info *ring_info,
>   debug_info->current_write_index =
> ring_info->ring_buffer->write_index;
>   debug_info->current_interrupt_mask
>   = ring_info->ring_buffer->interrupt_mask;
> + mutex_unlock(_info->ring_buffer_mutex);
> +
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(hv_ringbuffer_get_debuginfo);
>  
> +/* Initialize a channel's ring buffer info mutex locks */
> +void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
> +{
> + mutex_init(>inbound.ring_buffer_mutex);
> + mutex_init(>outbound.ring_buffer_mutex);
> +}
> +
>  /* Initialize the ring buffer. */
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 page_cnt)
> @@ -240,8 +253,10 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info
> *ring_info,
>  /* Cleanup the ring buffer. */
>  void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info)
>  {
> + mutex_lock(_info->ring_buffer_mutex);
>   vunmap(ring_info->ring_buffer);
>   ring_info->ring_buffer = NULL;
> + mutex_unlock(_info->ring_buffer_mutex);
>  }
>  
>  /* Write to the ring buffer. */
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 7f15c41d952e..84f3a510b4c9 100644
> --- a/drivers/hv/vmbus_drv.c
> +++

Re: r8169 only works in promisc mode

2019-03-10 Thread Stephen Hemminger

On Sun, 10 Mar 2019 15:26:16 +0100
Heiner Kallweit  wrote:

> On 10.03.2019 15:18, Stephen Hemminger wrote:
> > On Sat, 9 Mar 2019 10:33:40 +0100
> > Heiner Kallweit  wrote:
> >   
> >> On 09.03.2019 03:16, Alex Xu (Hello71) wrote:  
> >>> After suspending, my r8169 (not actually 8169, just that driver) only 
> >>> receives packets when in promiscuous mode. I have tried disabling all 
> >>> offload features except highdma [fixed], and it doesn't fix the issue.
> >>>
> >>> I am using torvalds/linux, compiled about two days ago (not sure which 
> >>> commit). I think this happened after a recent upgrade (some time in the 
> >>> past week). Given the long testing cycle, I don't really want to bisect; 
> >>> I am hoping someone has some idea of where the issue could be.
> >>> 
> >> The provided information isn't enough to say anything. Please create a
> >> bug ticket in bugzilla. Needed information:
> >> - Last working and first failing kernel version (as reported by uname -a)
> >> - Is 5.0 ok?
> >> - Full dmesg log
> >> - lspci -vv output for the network card
> >> - If the issue happens only after resume from suspend, diff of the
> >>   chip registers (ethtool -d) before and after suspend.
> >> - Does removing / reloading the r8169 module fix the issue?
> >>  
> >>> Please CC me on replies.
> >>>
> >>> Thanks,
> >>> Alex.
> >>> 
> >> Heiner  
> > 
> > Bugzilla is not really used by Linux network developers.
> > I just filter/forward mail from bugzilla into netdev.
> >   
> To make sure I understand you correctly:
> Are you just saying "network developers don't actively scan bugzilla"
> or do you want to say that bugzilla shouldn't be used in general for
> netdev bugs/issues? What would be the right tool then?

both.

Please use the net...@vger.kernel.org mailing list for discussions.

Re: r8169 only works in promisc mode

2019-03-10 Thread Stephen Hemminger

On Sat, 9 Mar 2019 10:33:40 +0100
Heiner Kallweit  wrote:

> On 09.03.2019 03:16, Alex Xu (Hello71) wrote:
> > After suspending, my r8169 (not actually 8169, just that driver) only 
> > receives packets when in promiscuous mode. I have tried disabling all 
> > offload features except highdma [fixed], and it doesn't fix the issue.
> > 
> > I am using torvalds/linux, compiled about two days ago (not sure which 
> > commit). I think this happened after a recent upgrade (some time in the 
> > past week). Given the long testing cycle, I don't really want to bisect; 
> > I am hoping someone has some idea of where the issue could be.
> >   
> The provided information isn't enough to say anything. Please create a
> bug ticket in bugzilla. Needed information:
> - Last working and first failing kernel version (as reported by uname -a)
> - Is 5.0 ok?
> - Full dmesg log
> - lspci -vv output for the network card
> - If the issue happens only after resume from suspend, diff of the
>   chip registers (ethtool -d) before and after suspend.
> - Does removing / reloading the r8169 module fix the issue?
> 
> > Please CC me on replies.
> > 
> > Thanks,
> > Alex.
> >   
> Heiner

Bugzilla is not really used by Linux network developers.
I just filter/forward mail from bugzilla into netdev.

Re: [PATCH 1/2] ipmr: Make cache queue length configurable

2019-03-07 Thread Stephen Hemminger

On Thu,  7 Mar 2019 09:19:55 +1300
Brodie Greenfield  wrote:

> +ip_mr_cache_queue_length - INTEGER
> + Limit the number of multicast packets we can have in the queue to be
> + resolved.
> + Bear in mind that when an unresolved multicast packet is received,
> + there is an O(n) traversal of the queue. This should be considered
> + if increasing.
> +

Why not make it to a unsigned value? A negative value doesn't make
much sense here.

Although other sysctl values date back to a time when Linux was
sloppy about allowing negative values, it would be good to use unsigned
now.

Re: [PATCH 0/3] pci-hyperv: fix memory leak and add pci_destroy_slot()

2019-03-05 Thread Stephen Hemminger

On Mon, 4 Mar 2019 21:34:47 +
Dexuan Cui  wrote:

> Patch #1 fixes a memory leak caused by incorrectly-maintained hpdev->refs.
> 
> Patch #2 and #3 make sure the "slot" is removed in all the scenarios.
> Without them, in the quick hot-add/hot-remove test, systemd-dev may easily
> crash when trying to access a dangling sys file in /sys/bus/pci/slots/:
> "BUG: unable to handle kernel paging request".
> 
> BTW, Patch #2 was posted on Feb 7, 2019, and this is the v2: the change
> to hv_eject_device_work() in v1 is removed, as the change is only needed
> when we hot-remove the device and remove the pci-hyperv driver at the 
> same time. It looks more work is required to make this scenaro work
> correctly, and since removing the driver is not really a "usual" usage,
> we can address this scenario in the future.
> 
> Please review the patchset.
> 
> Dexuan Cui (3):
>   PCI: hv: Fix a memory leak in hv_eject_device_work()
>   PCI: hv: Add hv_pci_remove_slots() when we unload the driver
>   PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if
> necessary
> 
>  drivers/pci/controller/pci-hyperv.c | 23 +++
>  1 file changed, 23 insertions(+)


Thanks for fixing this.

Reviewed-by: Stephen Hemminger

Re: Request for suggestion on net device re-naming/re-ordering based on DT alias

2019-02-27 Thread Stephen Hemminger

On Wed, 27 Feb 2019 10:45:44 -0800
Florian Fainelli  wrote:

> On 2/27/19 10:40 AM, Stephen Hemminger wrote:
> > On Wed, 27 Feb 2019 17:24:03 +0530
> > Harini Katakam  wrote:
> >   
> >> Hi,
> >>
> >> We've had some users requesting control over net device name order
> >> when multiple ethernet devices are present on a system. I've tried a
> >> few solutions to this and looked it up on forums. But I apologize if
> >> I have missed something.
> >>
> >> I know that the current system allocates eth as per probe order
> >> but that is obviously not stably controlled by user (tried DT
> >> re-ordering and defer probe). One solution is to use DT alias names
> >> to write to (net_device)->name as follows:
> >> Devicetree:
> >> aliases {
> >>ethernet0 = 
> >>ethernet1 = 
> >> };
> >> Driver probe:
> >> +   /* Read ethernet DT alias id and assign to right device name*/
> >> +   id = of_alias_get_id(np, "ethernet");
> >> +   if (id < 0) {
> >> +   dev_warn(>dev, "failed to get alias id (%u)\n", id);
> >> +   return id;
> >> +   }
> >> +   snprintf(dev->name, sizeof(dev->name), "eth%d", id);
> >> +
> >>
> >> These three drivers seem to have something similar for mdio/phy bus IDs:
> >> drivers/net/ethernet/broadcom/genet/bcmmii.c:409: id =
> >> of_alias_get_id(dn, "eth");
> >> drivers/net/ethernet/samsung/sxgbe/sxgbe_platform.c:43: plat->bus_id =
> >> of_alias_get_id(np, "ethernet");
> >> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:404:
> >> plat->bus_id = of_alias_get_id(np, "ethernet");
> >>
> >> Drawback: This approach will break if alias is not provided for one
> >> of the interfaces on board. Not to mention, there could be systems
> >> with multiple ethernet makes (for ex. Cadence macb and Xilinx axienet)
> >> If one of the drivers does not have an alias read mechanism, it is
> >> possible to have clashing ID assignments. Is there any way this
> >> solution can be changed to be stable/acceptable?
> >>
> >> One other alternative I've tried is netdev kernel bootargs but this
> >> device name was not being picked by the kernel:
> >> https://www.kernel.org/doc/html/v4.19/admin-guide/kernel-parameters.html
> >> netdev= , , eth0
> >> netdev=, , eth1
> >>
> >> Could you please suggest any alternatives?
> >> Thanks!
> >>
> >> Regards,
> >> Harini  
> > 
> > Device naming is a hard problem, and there is no perfect solution.
> > 
> > Device tree should be providing hints to userspace policy for naming, not
> > trying to do it in the kernel.   
> 
> And Device Tree does already, if you look at the uevent attributes that
> the kernel sends, there should be ample information to uniquely
> determine which physical network device the network device maps to, e
> for instance:
> 
> cat /sys/class/net/eth*/device/uevent
> DRIVER=brcm-systemport
> OF_NAME=ethernet
> OF_FULLNAME=/rdb/ethernet@930
> OF_TYPE=network
> OF_COMPATIBLE_0=brcm,systemportlite-v1.00
> OF_COMPATIBLE_1=brcm,systemport
> OF_COMPATIBLE_N=2
> OF_ALIAS_0=eth0
> MODALIAS=of:NethernetTnetworkCbrcm,systemportlite-v1.00Cbrcm,systemport

Then you need to work with udev developers to handle device tree better.

By design ethN is not used for persistent naming, only other names like ensX.
This allows for safer renaming and provides way to handle devices that may
not match any rules.

Most of udev naming is based of properties in sysfs like pci-slot, port etc.
If you had that available on these devices, it would just work now.

Re: Request for suggestion on net device re-naming/re-ordering based on DT alias

2019-02-27 Thread Stephen Hemminger

On Wed, 27 Feb 2019 17:24:03 +0530
Harini Katakam  wrote:

> Hi,
> 
> We've had some users requesting control over net device name order
> when multiple ethernet devices are present on a system. I've tried a
> few solutions to this and looked it up on forums. But I apologize if
> I have missed something.
> 
> I know that the current system allocates eth as per probe order
> but that is obviously not stably controlled by user (tried DT
> re-ordering and defer probe). One solution is to use DT alias names
> to write to (net_device)->name as follows:
> Devicetree:
> aliases {
>ethernet0 = 
>ethernet1 = 
> };
> Driver probe:
> +   /* Read ethernet DT alias id and assign to right device name*/
> +   id = of_alias_get_id(np, "ethernet");
> +   if (id < 0) {
> +   dev_warn(>dev, "failed to get alias id (%u)\n", id);
> +   return id;
> +   }
> +   snprintf(dev->name, sizeof(dev->name), "eth%d", id);
> +
> 
> These three drivers seem to have something similar for mdio/phy bus IDs:
> drivers/net/ethernet/broadcom/genet/bcmmii.c:409: id =
> of_alias_get_id(dn, "eth");
> drivers/net/ethernet/samsung/sxgbe/sxgbe_platform.c:43: plat->bus_id =
> of_alias_get_id(np, "ethernet");
> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:404:
> plat->bus_id = of_alias_get_id(np, "ethernet");
> 
> Drawback: This approach will break if alias is not provided for one
> of the interfaces on board. Not to mention, there could be systems
> with multiple ethernet makes (for ex. Cadence macb and Xilinx axienet)
> If one of the drivers does not have an alias read mechanism, it is
> possible to have clashing ID assignments. Is there any way this
> solution can be changed to be stable/acceptable?
> 
> One other alternative I've tried is netdev kernel bootargs but this
> device name was not being picked by the kernel:
> https://www.kernel.org/doc/html/v4.19/admin-guide/kernel-parameters.html
> netdev= , , eth0
> netdev=, , eth1
> 
> Could you please suggest any alternatives?
> Thanks!
> 
> Regards,
> Harini

Device naming is a hard problem, and there is no perfect solution.

Device tree should be providing hints to userspace policy for naming, not
trying to do it in the kernel. 
See: 
https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

Re: [PATCH hyperv-fixes] hv_netvsc: Fix IP header checksum for coalesced packets

2019-02-23 Thread Stephen Hemminger

On Fri, 22 Feb 2019 18:25:03 +
Haiyang Zhang  wrote:

> From: Haiyang Zhang 
> 
> Incoming packets may have IP header checksum verified by the host.
> They may not have IP header checksum computed after coalescing.
> This patch re-compute the checksum when necessary, otherwise the
> packets may be dropped, because Linux network stack always checks it.
> 
> Signed-off-by: Haiyang Zhang 
> ---
>  drivers/net/hyperv/netvsc_drv.c | 22 +++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 256adbd044f5..cf4897043e83 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -744,6 +744,14 @@ void netvsc_linkstatus_callback(struct net_device *net,
>   schedule_delayed_work(_ctx->dwork, 0);
>  }
>  
> +static void netvsc_comp_ipcsum(struct sk_buff *skb)
> +{
> + struct iphdr *iph = (struct iphdr *)skb->data;

Can you use iphdr(skb) here?

> +
> + iph->check = 0;
> + iph->check = ip_fast_csum(iph, iph->ihl);
> +}
> +
>  static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net,
>struct netvsc_channel *nvchan)
>  {
> @@ -770,9 +778,17 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct 
> net_device *net,
>   /* skb is already created with CHECKSUM_NONE */
>   skb_checksum_none_assert(skb);
>  
> - /*
> -  * In Linux, the IP checksum is always checked.
> -  * Do L4 checksum offload if enabled and present.
> + /* Incoming packets may have IP header checksum verified by the host.
> +  * They may not have IP header checksum computed after coalescing.
> +  * We compute it here if the flags are set, because on Linux, the IP
> +  * checksum is always checked.
> +  */
> + if (csum_info && csum_info->receive.ip_checksum_value_invalid &&
> + csum_info->receive.ip_checksum_succeeded &&
> + skb->protocol == htons(ETH_P_IP))
> + netvsc_comp_ipcsum(skb);

Does this still handle for coalesced and non-coalesced packets
which are received with bad IP checksum?  My concern is that you are
potentially correcting the checksum for a packet whose received checksum was 
bad.


> + /* Do L4 checksum offload if enabled and present.
>*/
>   if (csum_info && (net->features & NETIF_F_RXCSUM)) {
>   if (csum_info->receive.tcp_checksum_succeeded ||

RE: [PATCH V3 1/10] X86/Hyper-V: Add parameter offset for hyperv_fill_flush_guest_mapping_list()

2019-02-22 Thread Stephen Hemminger

int hyperv_fill_flush_guest_mapping_list(
struct hv_guest_mapping_flush_list *flush,
-   u64 start_gfn, u64 pages)
+   int offset, u64 start_gfn, u64 pages)
 {
u64 cur = start_gfn;
u64 additional_pages;
-   int gpa_n = 0;
+   int gpa_n = offset;
 
do {
/*

Do you mean to support negative offsets here? Maybe unsigned would be better?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2043 matches

Mail list logo