[PATCH net-next v3 3/5] bnxt_en: Use NETIF_F_GRO_HW.
Advertise NETIF_F_GRO_HW in hw_features if hardware GRO is supported. In bnxt_fix_features(), disable GRO_HW and LRO if current hardware configuration does not allow it. XDP setup will now rely on bnxt_fix_features() to turn off aggregation. During chip init, turn on or off hardware GRO based on NETIF_F_GRO_HW in features flag. Signed-off-by: Michael Chan--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index af6c83f..04dddf6 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -2755,7 +2755,7 @@ void bnxt_set_tpa_flags(struct bnxt *bp) return; if (bp->dev->features & NETIF_F_LRO) bp->flags |= BNXT_FLAG_LRO; - if (bp->dev->features & NETIF_F_GRO) + else if (bp->dev->features & NETIF_F_GRO_HW) bp->flags |= BNXT_FLAG_GRO; } @@ -2843,10 +2843,10 @@ int bnxt_set_rx_skb_mode(struct bnxt *bp, bool page_mode) min_t(u16, bp->max_mtu, BNXT_MAX_PAGE_MODE_MTU); bp->flags &= ~BNXT_FLAG_AGG_RINGS; bp->flags |= BNXT_FLAG_NO_AGG_RINGS | BNXT_FLAG_RX_PAGE_MODE; - bp->dev->hw_features &= ~NETIF_F_LRO; - bp->dev->features &= ~NETIF_F_LRO; bp->rx_dir = DMA_BIDIRECTIONAL; bp->rx_skb_func = bnxt_rx_page_skb; + /* Disable LRO or GRO_HW */ + netdev_update_features(bp->dev); } else { bp->dev->max_mtu = bp->max_mtu; bp->flags &= ~BNXT_FLAG_RX_PAGE_MODE; @@ -6788,6 +6788,12 @@ static netdev_features_t bnxt_fix_features(struct net_device *dev, if ((features & NETIF_F_NTUPLE) && !bnxt_rfs_capable(bp)) features &= ~NETIF_F_NTUPLE; + if (bp->flags & BNXT_FLAG_NO_AGG_RINGS) + features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW); + + if (features & NETIF_F_GRO_HW) + features &= ~NETIF_F_LRO; + /* Both CTAG and STAG VLAN accelaration on the RX side have to be * turned on or off together. */ @@ -6821,9 +6827,9 @@ static int bnxt_set_features(struct net_device *dev, netdev_features_t features) bool update_tpa = false; flags &= ~BNXT_FLAG_ALL_CONFIG_FEATS; - if ((features & NETIF_F_GRO) && !BNXT_CHIP_TYPE_NITRO_A0(bp)) + if (features & NETIF_F_GRO_HW) flags |= BNXT_FLAG_GRO; - if (features & NETIF_F_LRO) + else if (features & NETIF_F_LRO) flags |= BNXT_FLAG_LRO; if (bp->flags & BNXT_FLAG_NO_AGG_RINGS) @@ -7924,8 +7930,8 @@ static int bnxt_get_dflt_rings(struct bnxt *bp, int *max_rx, int *max_tx, if (rc) return rc; bp->flags |= BNXT_FLAG_NO_AGG_RINGS; - bp->dev->hw_features &= ~NETIF_F_LRO; - bp->dev->features &= ~NETIF_F_LRO; + bp->dev->hw_features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW); + bp->dev->features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW); bnxt_set_ring_params(bp); } @@ -8108,7 +8114,11 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) dev->vlan_features = dev->hw_features | NETIF_F_HIGHDMA; dev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_RX | NETIF_F_HW_VLAN_STAG_TX; + if (!BNXT_CHIP_TYPE_NITRO_A0(bp)) + dev->hw_features |= NETIF_F_GRO_HW; dev->features |= dev->hw_features | NETIF_F_HIGHDMA; + if (dev->features & NETIF_F_GRO_HW) + dev->features &= ~NETIF_F_LRO; dev->priv_flags |= IFF_UNICAST_FLT; #ifdef CONFIG_BNXT_SRIOV -- 1.8.3.1
[PATCH net-next v3 1/5] net: Introduce NETIF_F_GRO_HW.
Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware GRO. With this flag, we can now independently turn on or off hardware GRO when GRO is on. Previously, drivers were using NETIF_F_GRO to control hardware GRO and so it cannot be independently turned on or off without affecting GRO. Hardware GRO (just like GRO) guarantees that packets can be re-segmented by TSO/GSO to reconstruct the original packet stream. It is a subset of NETIF_F_GRO and depends on it, as well as NETIF_F_RXCSUM. Since NETIF_F_GRO is not propagated between upper and lower devices, NETIF_F_GRO_HW should follow suit since it is a subset of GRO. In other words, a lower device can independent have GRO/GRO_HW enabled or disabled and no feature propagation is required. This will preserve the current GRO behavior. Cc: Ariel EliorCc: everest-linux...@cavium.com Signed-off-by: Michael Chan --- Documentation/networking/netdev-features.txt | 8 include/linux/netdev_features.h | 3 +++ net/core/dev.c | 12 net/core/ethtool.c | 1 + 4 files changed, 24 insertions(+) diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index 7413eb0..8f36527 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -163,3 +163,11 @@ This requests that the NIC receive all possible frames, including errored frames (such as bad FCS, etc). This can be helpful when sniffing a link with bad packets on it. Some NICs may receive more packets if also put into normal PROMISC mode. + +* rx-gro-hw + +This requests that the NIC enables Hardware GRO (generic receive offload). +Hardware GRO is basically the exact reverse of TSO, and is generally +stricter than Hardware LRO. A packet stream merged by Hardware GRO must +be re-segmentable by GSO or TSO back to the exact original packet stream. +Hardware GRO is dependent on GRO and RXCSUM. diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index b1b0ca7..db84c51 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -78,6 +78,8 @@ enum { NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */ NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */ + NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */ + /* * Add your fresh new feature above and remember to update * netdev_features_strings[] in net/core/ethtool.c and maybe @@ -97,6 +99,7 @@ enum { #define NETIF_F_FRAGLIST __NETIF_F(FRAGLIST) #define NETIF_F_FSO__NETIF_F(FSO) #define NETIF_F_GRO__NETIF_F(GRO) +#define NETIF_F_GRO_HW __NETIF_F(GRO_HW) #define NETIF_F_GSO__NETIF_F(GSO) #define NETIF_F_GSO_ROBUST __NETIF_F(GSO_ROBUST) #define NETIF_F_HIGHDMA__NETIF_F(HIGHDMA) diff --git a/net/core/dev.c b/net/core/dev.c index e32cf5c..6ebd0e7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7424,6 +7424,18 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, features &= ~dev->gso_partial_features; } + if (features & NETIF_F_GRO_HW) { + /* Hardware GRO depends on GRO and RXCSUM. */ + if (!(features & NETIF_F_GRO)) { + netdev_dbg(dev, "Dropping NETIF_F_GSO_HW since no GRO feature.\n"); + features &= ~NETIF_F_GRO_HW; + } + if (!(features & NETIF_F_RXCSUM)) { + netdev_dbg(dev, "Dropping NETIF_F_GSO_HW since no RXCSUM feature.\n"); + features &= ~NETIF_F_GRO_HW; + } + } + return features; } diff --git a/net/core/ethtool.c b/net/core/ethtool.c index f8fcf45..50a7920 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -73,6 +73,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info) [NETIF_F_LLTX_BIT] = "tx-lockless", [NETIF_F_NETNS_LOCAL_BIT] = "netns-local", [NETIF_F_GRO_BIT] = "rx-gro", + [NETIF_F_GRO_HW_BIT] = "rx-gro-hw", [NETIF_F_LRO_BIT] = "rx-lro", [NETIF_F_TSO_BIT] = "tx-tcp-segmentation", -- 1.8.3.1
[PATCH net-next v3 0/5] Introduce NETIF_F_GRO_HW
Introduce NETIF_F_GRO_HW feature flag and convert drivers that support hardware GRO to use the new flag. v3: - Let driver's ndo_fix_features() disable NETIF_F_LRO when NETIF_F_GRO_HW is set instead of doing it in common netdev_fix_features(). v2: - NETIF_F_GRO_HW flag propagation between upper and lower devices not required (see patch 1). - NETIF_F_GRO_HW depends on NETIF_F_GRO and NETIF_F_RXCSUM. - Add dev_disable_gro_hw() to disable GRO_HW for generic XDP. - Use ndo_fix_features() on all 3 drivers to drop GRO_HW when it is not supported Michael Chan (5): net: Introduce NETIF_F_GRO_HW. net: Disable GRO_HW when generic XDP is installed on a device. bnxt_en: Use NETIF_F_GRO_HW. bnx2x: Use NETIF_F_GRO_HW. qede: Use NETIF_F_GRO_HW. Documentation/networking/netdev-features.txt | 8 ++ drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 19 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 4 ++- drivers/net/ethernet/broadcom/bnxt/bnxt.c| 24 +++- drivers/net/ethernet/qlogic/qede/qede.h | 2 ++ drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 3 ++ drivers/net/ethernet/qlogic/qede/qede_filter.c | 20 - drivers/net/ethernet/qlogic/qede/qede_main.c | 17 --- include/linux/netdev_features.h | 3 ++ net/core/dev.c | 36 net/core/ethtool.c | 1 + 11 files changed, 104 insertions(+), 33 deletions(-) -- 1.8.3.1
[PATCH net-next v3 2/5] net: Disable GRO_HW when generic XDP is installed on a device.
Hardware should not aggregate any packets when generic XDP is installed. Cc: Ariel EliorCc: everest-linux...@cavium.com Signed-off-by: Michael Chan --- net/core/dev.c | 24 1 file changed, 24 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index 6ebd0e7..ec08ace 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1542,6 +1542,29 @@ void dev_disable_lro(struct net_device *dev) } EXPORT_SYMBOL(dev_disable_lro); +/** + * dev_disable_gro_hw - disable HW Generic Receive Offload on a device + * @dev: device + * + * Disable HW Generic Receive Offload (GRO_HW) on a net device. Must be + * called under RTNL. This is needed if Generic XDP is installed on + * the device. + */ +static void dev_disable_gro_hw(struct net_device *dev) +{ + struct net_device *lower_dev; + struct list_head *iter; + + dev->wanted_features &= ~NETIF_F_GRO_HW; + netdev_update_features(dev); + + if (unlikely(dev->features & NETIF_F_GRO_HW)) + netdev_WARN(dev, "failed to disable GRO_HW!\n"); + + netdev_for_each_lower_dev(dev, lower_dev, iter) + dev_disable_gro_hw(lower_dev); +} + static int call_netdevice_notifier(struct notifier_block *nb, unsigned long val, struct net_device *dev) { @@ -4564,6 +4587,7 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp) } else if (new && !old) { static_key_slow_inc(_xdp_needed); dev_disable_lro(dev); + dev_disable_gro_hw(dev); } break; -- 1.8.3.1
[PATCH net-next v3 4/5] bnx2x: Use NETIF_F_GRO_HW.
Advertise NETIF_F_GRO_HW and turn on TPA_MODE_GRO when NETIF_F_GRO_HW is set. Disable NETIF_F_GRO_HW in bnx2x_fix_features() if the MTU does not support TPA_MODE_GRO. bnx2x_change_mtu() also needs to disable NETIF_F_GRO_HW if the MTU does not support it. Cc: Ariel EliorCc: everest-linux...@cavium.com Signed-off-by: Michael Chan --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 19 --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 4 +++- 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index 4c739d5..c75cfdd 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -2482,8 +2482,7 @@ static void bnx2x_bz_fp(struct bnx2x *bp, int index) */ if (bp->dev->features & NETIF_F_LRO) fp->mode = TPA_MODE_LRO; - else if (bp->dev->features & NETIF_F_GRO && -bnx2x_mtu_allows_gro(bp->dev->mtu)) + else if (bp->dev->features & NETIF_F_GRO_HW) fp->mode = TPA_MODE_GRO; else fp->mode = TPA_MODE_DISABLED; @@ -4874,6 +4873,9 @@ int bnx2x_change_mtu(struct net_device *dev, int new_mtu) */ dev->mtu = new_mtu; + if (!bnx2x_mtu_allows_gro(new_mtu)) + dev->features &= ~NETIF_F_GRO_HW; + if (IS_PF(bp) && SHMEM2_HAS(bp, curr_cfg)) SHMEM2_WR(bp, curr_cfg, CURR_CFG_MET_OS); @@ -4907,6 +4909,10 @@ netdev_features_t bnx2x_fix_features(struct net_device *dev, features &= ~NETIF_F_LRO; features &= ~NETIF_F_GRO; } + if (!bnx2x_mtu_allows_gro(dev->mtu)) + features &= ~NETIF_F_GRO_HW; + if (features & NETIF_F_GRO_HW) + features &= ~NETIF_F_LRO; return features; } @@ -4933,13 +4939,12 @@ int bnx2x_set_features(struct net_device *dev, netdev_features_t features) } } - /* if GRO is changed while LRO is enabled, don't force a reload */ - if ((changes & NETIF_F_GRO) && (features & NETIF_F_LRO)) - changes &= ~NETIF_F_GRO; + /* Don't care about GRO changes */ + changes &= ~NETIF_F_GRO; /* if GRO is changed while HW TPA is off, don't force a reload */ - if ((changes & NETIF_F_GRO) && bp->disable_tpa) - changes &= ~NETIF_F_GRO; + if ((changes & NETIF_F_GRO_HW) && bp->disable_tpa) + changes &= ~NETIF_F_GRO_HW; if (changes) bnx2x_reload = true; diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 91e2a75..1c7f821 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -13273,7 +13273,7 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct pci_dev *pdev, dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 | - NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO | + NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO | NETIF_F_GRO_HW | NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX; if (!chip_is_e1x) { dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_GRE_CSUM | @@ -13309,6 +13309,8 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct pci_dev *pdev, dev->features |= dev->hw_features | NETIF_F_HW_VLAN_CTAG_RX; dev->features |= NETIF_F_HIGHDMA; + if (dev->features & NETIF_F_GRO_HW) + dev->features &= ~NETIF_F_LRO; /* Add Loopback capability to the device */ dev->hw_features |= NETIF_F_LOOPBACK; -- 1.8.3.1
[PATCH net-next v3 5/5] qede: Use NETIF_F_GRO_HW.
Advertise NETIF_F_GRO_HW and set edev->gro_disable according to the feature flag. Add qede_fix_features() to drop NETIF_F_GRO_HW if XDP is running or MTU does not support GRO_HW. qede_change_mtu() also checks and disables GRO_HW if MTU is not supported. Cc: Ariel EliorCc: everest-linux...@cavium.com Acked-by: Manish Chopra Signed-off-by: Michael Chan --- drivers/net/ethernet/qlogic/qede/qede.h | 2 ++ drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 3 +++ drivers/net/ethernet/qlogic/qede/qede_filter.c | 20 +--- drivers/net/ethernet/qlogic/qede/qede_main.c| 17 ++--- 4 files changed, 24 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h index a3a70ad..8a33651 100644 --- a/drivers/net/ethernet/qlogic/qede/qede.h +++ b/drivers/net/ethernet/qlogic/qede/qede.h @@ -494,6 +494,8 @@ int qede_free_tx_pkt(struct qede_dev *edev, void qede_vlan_mark_nonconfigured(struct qede_dev *edev); int qede_configure_vlan_filters(struct qede_dev *edev); +netdev_features_t qede_fix_features(struct net_device *dev, + netdev_features_t features); int qede_set_features(struct net_device *dev, netdev_features_t features); void qede_set_rx_mode(struct net_device *ndev); void qede_config_rx_mode(struct net_device *ndev); diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c index dae7412..4ca3847 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu) DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN), "Configuring MTU size of %d\n", new_mtu); + if (new_mtu > PAGE_SIZE) + ndev->features &= ~NETIF_F_GRO_HW; + /* Set the mtu field and re-start the interface if needed */ args.u.mtu = new_mtu; args.func = _update_mtu; diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c index c1a0708..2de947e 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c @@ -895,19 +895,25 @@ static void qede_set_features_reload(struct qede_dev *edev, edev->ndev->features = args->u.features; } +netdev_features_t qede_fix_features(struct net_device *dev, + netdev_features_t features) +{ + struct qede_dev *edev = netdev_priv(dev); + + if (edev->xdp_prog || edev->ndev->mtu > PAGE_SIZE) + features &= ~NETIF_F_GRO_HW; + + return features; +} + int qede_set_features(struct net_device *dev, netdev_features_t features) { struct qede_dev *edev = netdev_priv(dev); netdev_features_t changes = features ^ dev->features; bool need_reload = false; - /* No action needed if hardware GRO is disabled during driver load */ - if (changes & NETIF_F_GRO) { - if (dev->features & NETIF_F_GRO) - need_reload = !edev->gro_disable; - else - need_reload = edev->gro_disable; - } + if (changes & NETIF_F_GRO_HW) + need_reload = true; if (need_reload) { struct qede_reload_args args; diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 57332b3..90d79ae 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -545,6 +545,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) #endif .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid, + .ndo_fix_features = qede_fix_features, .ndo_set_features = qede_set_features, .ndo_get_stats64 = qede_get_stats64, #ifdef CONFIG_QED_SRIOV @@ -572,6 +573,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) .ndo_change_mtu = qede_change_mtu, .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid, + .ndo_fix_features = qede_fix_features, .ndo_set_features = qede_set_features, .ndo_get_stats64 = qede_get_stats64, .ndo_udp_tunnel_add = qede_udp_tunnel_add, @@ -589,6 +591,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) .ndo_change_mtu = qede_change_mtu, .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid, + .ndo_fix_features = qede_fix_features, .ndo_set_features = qede_set_features, .ndo_get_stats64 = qede_get_stats64, .ndo_udp_tunnel_add = qede_udp_tunnel_add,
Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.
On Sat, Dec 9, 2017 at 6:09 AM, Cong Wangwrote: > On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang > wrote: >> >> Release the netlink sock created in kernel(not hold the _net_ namespace): >> > > You can avoid counting kernel sock by testing 'kern' in sk_alloc() > and testing 'sk->sk_net_refcnt' in __sk_free(). Hi cong, if we do it in this way, we will not counter the sock created in kernel, right ?
Incorrect source IP address on IGMP membership report
Closing a multicast socket after the final IPv4 address is deleted from an interface will generate a membership report that uses the source IP from a different interface. The following test script, run from an isolated netns, reproduces the issue: #!/bin/bash ip link add dummy0 type dummy ip link add dummy1 type dummy ip link set dummy0 up ip link set dummy1 up ip addr add 10.1.1.1/24 dev dummy0 ip addr add 192.168.99.99/24 dev dummy1 tcpdump -U -i dummy0 -w dummy0.pcap & socat EXEC:"sleep 2" UDP4-DATAGRAM:239.101.1.68:8889,ip-add-membership=239.0.1.68:10.1.1.1 & sleep 1 ip addr del 10.1.1.1/24 dev dummy0 sleep 5 kill %tcpdump After running this script, dummy0.pcap contains one Membership Report / Join Group packet with source IP 10.1.1.1, and two Membership Report / Leave Group packets with source IP 192.168.99.99. Sending out multicasts on the LAN using an unexpected source IP address seems to be causing issues in some enterprise environments[0], where the network infrastructure is set up to flag suspicious packets. I believe the source address is provided by ip_route_output_ports() called from igmpv3_newpack() in the kernel. Is this behavior intentional? If not, is it something that we should fix? I was able to suppress these reports with an iptables OUTPUT rule, but maybe there is a cleaner way. [0] https://bugs.chromium.org/p/chromium/issues/detail?id=786514
Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.
On Fri, Dec 8, 2017 at 9:24 PM, Eric Dumazetwrote: > On Fri, 2017-12-08 at 19:29 +0800, Tonghao Zhang wrote: >> hi all. we can add synchronize_rcu and rcu_barrier in >> sock_inuse_exit_net to >> ensure there are no outstanding rcu callbacks using this network >> namespace. >> we will not have to test if net->core.sock_inuse is NULL or not from >> sock_inuse_add(). :) >> >> static void __net_exit sock_inuse_exit_net(struct net *net) >> { >> free_percpu(net->core.prot_inuse); >> + >> + synchronize_rcu(); >> + rcu_barrier(); >> + >> + free_percpu(net->core.sock_inuse); >> } > > > Oh well. Do you have any idea of the major problem this would add ? > > Try the following, before and after your patches : > > for i in `seq 1 40` > do > (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & > done > wait > > ( Check commit 8ca712c373a462cfa1b62272870b6c2c74aa83f9 ) > Yes, I did the test. The patches drop the performance. Before patch: # time ./add_del_unshare.sh net_namespace 97125 601658 : tunables00 0 : slabdata 25 25 0 real 8m19.665s user 0m4.268s sys 0m6.477s After : # time ./add_del_unshare.sh net_namespace102130 601658 : tunables00 0 : slabdata 26 26 0 real 8m52.563s user 0m4.040s sys 0m7.558s > > This is a complex problem, we wont accept patches that kill network > namespaces dismantling performance by adding brute force > synchronize_rcu() or rcu_barrier() calls. > > Why not freeing net->core.sock_inuse right before feeing net itself in > net_free() ? I try this way, alloc core.sock_inuse in net_alloc(), free it in net_free (). It does not drop performance, and we will not always to check the core.sock_inuse in sock_inuse_add(). After : # time ./add_del_unshare.sh net_namespace109135 601658 : tunables00 0 : slabdata 27 27 0 real 8m19.265s user 0m4.090s sys 0m8.185s > You do not have to hijack sock_inuse_exit_net() just because it has a > misleading name. > >
Fund Transfer
Hello, I'm Mr. Hazim, A banker working in a bank in my country; there is a certain deceased customer of my bank who left behind her fund. I seek your partnership in receiving this fund. If interested, reply immediately for detailed information. My sincere regards, Mr. Hazim Issa.
RE: [PATCH net-next 1/1] net: dsa: microchip: Add Microchip KSZ8895 DSA driver
> Ok, let me retry: > > > One thing to debug this problem is to dump the MIB counters. Use the > ethtool > > utility to show MIB counters of both ports: > > > > ethtool -S lan3 > > ethtool -S eth0 > > > > Assuming eth0 is the MAC controller that drives the switch, the receive > counters of > > the host port of the switch should match the transmit counters of > > lan3, and vice versa. > > Hmm. I'm getting some "interesting" results from mii-tool: > > root@miro:~# mii-tool lan3 > lan3: negotiated 1000baseT-HD flow-control, link ok > > But IIRC the switch is 100mbit? And dmesg does get it right. Its just > mii-tool that is confused. > > Link detection seems to work > > root@miro:/sys/bus/spi/devices/spi2.0# mii-tool lan1 > lan1: negotiated 1000baseT-HD flow-control, link ok > root@miro:/sys/bus/spi/devices/spi2.0# mii-tool lan1 > lan1: no link > > (But that really should be 100baseT, not 1000baseT). ethtool lan3 should also report the correct setting. > Is there register dump available somewhere? I was using > /sys/bus/spi/devices/spi32766.0/registers but this does not seem to be > available. There is a patch to add that functionality. It is very simple and I will send it to you later. Without that it is hard to debug the DSA driver if there is something wrong. I also have a simple utility to communicate with that registers file to read/write register individually. Is there a standard Linux utility for that function? > I tried that ethtool -S, and counters do not seem to match: > root@miro:~# ethtool -S eth0 > NIC statistics: > tx_dropped: 0 > tx_packets: 55 > tx_broadcast: 35 > tx_multicast: 20 > tx_crc_errors: 0 > tx_undersize: 0 > tx_oversize: 0 > tx_fragment: 0 > tx_jabber: 0 > tx_collision: 0 > tx_64byte: 0 > tx_65to127byte: 35 > tx_128to255byte: 0 > tx_256to511byte: 20 > tx_512to1023byte: 0 > tx_1024to2047byte: 0 > tx_GTE2048byte: 0 > tx_octets: 9576 > IEEE_tx_drop: 0 > IEEE_tx_frame_ok: 55 > IEEE_tx_1col: 0 > IEEE_tx_mcol: 0 > IEEE_tx_def: 0 > IEEE_tx_lcol: 0 > IEEE_tx_excol: 0 > IEEE_tx_macerr: 0 > IEEE_tx_cserr: 0 > IEEE_tx_sqe: 0 > IEEE_tx_fdxfc: 0 > IEEE_tx_octets_ok: 9576 > rx_packets: 0 > rx_broadcast: 0 > rx_multicast: 0 > rx_crc_errors: 0 > rx_undersize: 0 > rx_oversize: 0 > rx_fragment: 0 > rx_jabber: 0 > rx_64byte: 0 > rx_65to127byte: 0 > rx_128to255byte: 0 > rx_256to511byte: 0 > rx_512to1023byte: 0 > rx_1024to2047byte: 0 > rx_GTE2048byte: 0 > rx_octets: 0 > IEEE_rx_drop: 0 > IEEE_rx_frame_ok: 0 > IEEE_rx_crc: 0 > IEEE_rx_align: 0 > IEEE_rx_macerr: 0 > IEEE_rx_fdxfc: 0 > IEEE_rx_octets_ok: 0 These are the MIB counters from the switch host port: > p04_rx: 660 > p04_rx_hi: 0 > p04_rx_undersize: 0 > p04_rx_fragments: 20 This indicates a problem with the MAC. Are you using a MII or RMII version? > p04_rx_oversize: 0 > p04_rx_jabbers: 0 > p04_rx_symbol_err: 0 > p04_rx_crc_err: 0 > p04_rx_align_err: 0 > p04_rx_mac_ctrl: 0 > p04_rx_pause: 0 > p04_rx_bcast: 0 > p04_rx_mcast: 0 > p04_rx_ucast: 0 > p04_rx_64_or_less: 0 > p04_rx_65_127: 0 > p04_rx_128_255: 0 > p04_rx_256_511: 0 > p04_rx_512_1023: 0 > p04_rx_1024_1522: 0 > p04_tx: 388 > p04_tx_hi: 0 > p04_tx_late_col: 0 > p04_tx_pause: 0 > p04_tx_bcast: 0 > p04_tx_mcast: 3 This indicates the host port tried to send frames to the MAC. > p04_tx_ucast: 0 > p04_tx_deferred: 0 > p04_tx_total_col: 0 > p04_tx_exc_col: 0 > p04_tx_single_col: 0 > p04_tx_mult_col: 0 > p04_rx_discards: 0 > p04_tx_discards: 0 > root@miro:~# ethtool -S lan3 > NIC statistics: > tx_packets: 24 > tx_bytes: 1356 > rx_packets: 0 > rx_bytes: 0 The previous counters are from DSA. The rest are MIB counters of the port. > rx: 566 > rx_hi: 0 > rx_undersize: 0 > rx_fragments: 0 > rx_oversize: 0 > rx_jabbers: 0 > rx_symbol_err: 0 > rx_crc_err: 0 > rx_align_err: 0 > rx_mac_ctrl: 0 > rx_pause: 0 > rx_bcast: 0 > rx_mcast: 4 > rx_ucast: 0 > rx_64_or_less: 0 > rx_65_127: 1 > rx_128_255: 3 > rx_256_511: 0 > rx_512_1023: 0 > rx_1024_1522: 0 > tx: 0 > tx_hi: 0 > tx_late_col: 0 > tx_pause: 0 > tx_bcast: 0 > tx_mcast: 0 > tx_ucast: 0 > tx_deferred: 0 > tx_total_col: 0 > tx_exc_col: 0 > tx_single_col: 0 > tx_mult_col: 0 > rx_discards: 0 > tx_discards: 0 They just reported frames are received from the port. Because of problem with the host port there is no transmission coming from the host port.
[PATCH net-next 4/4] nfp: debug dump - decrease endian conversions
From: Carl HeymannConvert the requested dump level parameter to big-endian at the start of nfp_net_dump_calculate_size() and nfp_net_dump_populate_buffer(), then compare and assign it directly where needed in the traversal and prolog code. This decreases the total number of conversions used. Signed-off-by: Carl Heymann Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c index cb74602f0907..cbff0adad235 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c @@ -126,13 +126,13 @@ struct nfp_dump_error { /* to track state through debug size calculation TLV traversal */ struct nfp_level_size { - u32 requested_level;/* input */ + __be32 requested_level; /* input */ u32 total_size; /* output */ }; /* to track state during debug dump creation TLV traversal */ struct nfp_dump_state { - u32 requested_level;/* input param */ + __be32 requested_level; /* input param */ u32 dumped_size;/* adds up to size of dumped data */ u32 buf_size; /* size of buffer pointer to by p */ void *p;/* current point in dump buffer */ @@ -334,7 +334,7 @@ nfp_calc_specific_level_size(struct nfp_pf *pf, struct nfp_dump_tl *dump_level, { struct nfp_level_size *lev_sz = param; - if (be32_to_cpu(dump_level->type) != lev_sz->requested_level) + if (dump_level->type != lev_sz->requested_level) return 0; return nfp_traverse_tlvs(pf, dump_level->data, @@ -348,7 +348,7 @@ s64 nfp_net_dump_calculate_size(struct nfp_pf *pf, struct nfp_dumpspec *spec, struct nfp_level_size lev_sz; int err; - lev_sz.requested_level = flag; + lev_sz.requested_level = cpu_to_be32(flag); lev_sz.total_size = ALIGN8(sizeof(struct nfp_dump_prolog)); err = nfp_traverse_tlvs(pf, spec->data, spec->size, _sz, @@ -733,7 +733,7 @@ nfp_dump_specific_level(struct nfp_pf *pf, struct nfp_dump_tl *dump_level, { struct nfp_dump_state *dump = param; - if (be32_to_cpu(dump_level->type) != dump->requested_level) + if (dump_level->type != dump->requested_level) return 0; return nfp_traverse_tlvs(pf, dump_level->data, @@ -753,7 +753,7 @@ static int nfp_dump_populate_prolog(struct nfp_dump_state *dump) if (err) return err; - prolog->dump_level = cpu_to_be32(dump->requested_level); + prolog->dump_level = dump->requested_level; return 0; } @@ -764,7 +764,7 @@ int nfp_net_dump_populate_buffer(struct nfp_pf *pf, struct nfp_dumpspec *spec, struct nfp_dump_state dump; int err; - dump.requested_level = dump_param->flag; + dump.requested_level = cpu_to_be32(dump_param->flag); dump.dumped_size = 0; dump.p = dest; dump.buf_size = dump_param->len; -- 2.15.1
[PATCH net-next 2/4] nfp: flower: remove dead code paths
From: John HurleyPort matching is selected by default on every rule so remove check for it and delete 'else' side of the statement. Remove nfp_flower_meta_one as now it will not feature in the code. Rename nfp_flower_meta_two given that one has been removed. 'Additional metadata' if statement can never be true so remove it as well. Signed-off-by: John Hurley Reviewed-by: Dirk van der Merwe Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 16 + drivers/net/ethernet/netronome/nfp/flower/match.c | 79 -- .../net/ethernet/netronome/nfp/flower/offload.c| 2 +- 3 files changed, 29 insertions(+), 68 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h index 66070741d55f..430a3932cf4d 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h +++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h @@ -165,20 +165,6 @@ struct nfp_fl_pop_vlan { __be16 reserved; }; -/* Metadata without L2 (1W/4B) - * - *3 2 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - * | key_layers |mask_id| reserved| - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - */ -struct nfp_flower_meta_one { - u8 nfp_flow_key_layer; - u8 mask_id; - u16 reserved; -}; - struct nfp_fl_pre_tunnel { struct nfp_fl_act_head head; __be16 reserved; @@ -209,7 +195,7 @@ struct nfp_fl_set_vxlan { * NOTE: | TCI | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ */ -struct nfp_flower_meta_two { +struct nfp_flower_meta_tci { u8 nfp_flow_key_layer; u8 mask_id; __be16 tci; diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c b/drivers/net/ethernet/netronome/nfp/flower/match.c index 60614d4f0e22..1f2b879e12d4 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/match.c +++ b/drivers/net/ethernet/netronome/nfp/flower/match.c @@ -38,7 +38,7 @@ #include "main.h" static void -nfp_flower_compile_meta_tci(struct nfp_flower_meta_two *frame, +nfp_flower_compile_meta_tci(struct nfp_flower_meta_tci *frame, struct tc_cls_flower_offload *flow, u8 key_type, bool mask_version) { @@ -46,7 +46,7 @@ nfp_flower_compile_meta_tci(struct nfp_flower_meta_two *frame, struct flow_dissector_key_vlan *flow_vlan; u16 tmp_tci; - memset(frame, 0, sizeof(struct nfp_flower_meta_two)); + memset(frame, 0, sizeof(struct nfp_flower_meta_tci)); /* Populate the metadata frame. */ frame->nfp_flow_key_layer = key_type; frame->mask_id = ~0; @@ -67,14 +67,6 @@ nfp_flower_compile_meta_tci(struct nfp_flower_meta_two *frame, } } -static void -nfp_flower_compile_meta(struct nfp_flower_meta_one *frame, u8 key_type) -{ - frame->nfp_flow_key_layer = key_type; - frame->mask_id = 0; - frame->reserved = 0; -} - static int nfp_flower_compile_port(struct nfp_flower_in_port *frame, u32 cmsg_port, bool mask_version, enum nfp_flower_tun_type tun_type) @@ -278,49 +270,32 @@ int nfp_flower_compile_flow_match(struct tc_cls_flower_offload *flow, ext = nfp_flow->unmasked_data; msk = nfp_flow->mask_data; - if (NFP_FLOWER_LAYER_PORT & key_ls->key_layer) { - /* Populate Exact Metadata. */ - nfp_flower_compile_meta_tci((struct nfp_flower_meta_two *)ext, - flow, key_ls->key_layer, false); - /* Populate Mask Metadata. */ - nfp_flower_compile_meta_tci((struct nfp_flower_meta_two *)msk, - flow, key_ls->key_layer, true); - ext += sizeof(struct nfp_flower_meta_two); - msk += sizeof(struct nfp_flower_meta_two); - - /* Populate Exact Port data. */ - err = nfp_flower_compile_port((struct nfp_flower_in_port *)ext, - nfp_repr_get_port_id(netdev), - false, tun_type); - if (err) - return err; - - /* Populate Mask Port Data. */ - err = nfp_flower_compile_port((struct nfp_flower_in_port *)msk, - nfp_repr_get_port_id(netdev), - true, tun_type); - if (err) - return err; - - ext += sizeof(struct
[PATCH net-next 0/4] nfp: dead code, clean ups and slight improvements
Hi! This series contains small clean ups from John and Carl, and brings no functional changes. John's improvements target the flower code. First he makes sure we don't allocate space in FW request messages for MAC matches if the TC rule does not contain any. The remaining two patches remove some dead code and unused defines. Carl follows up with a slight optimization to his recent ethtool FW state dumps, byte swapping input parameters once instead of the data for every dumped item. Carl Heymann (1): nfp: debug dump - decrease endian conversions John Hurley (3): nfp: flower: do not assume mac/mpls matches nfp: flower: remove dead code paths nfp: flower: remove unused defines drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 19 +- drivers/net/ethernet/netronome/nfp/flower/match.c | 79 -- .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++-- .../net/ethernet/netronome/nfp/nfp_net_debugdump.c | 14 ++-- 4 files changed, 44 insertions(+), 81 deletions(-) -- 2.15.1
[PATCH net-next 3/4] nfp: flower: remove unused defines
From: John HurleyDelete match field defines that are not supported at this time. Signed-off-by: John Hurley Reviewed-by: Dirk van der Merwe Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h index 430a3932cf4d..d6b63c8f14da 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h +++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h @@ -50,9 +50,6 @@ #define NFP_FLOWER_LAYER_CTBIT(6) #define NFP_FLOWER_LAYER_VXLAN BIT(7) -#define NFP_FLOWER_LAYER_ETHER BIT(3) -#define NFP_FLOWER_LAYER_ARP BIT(4) - #define NFP_FLOWER_MASK_VLAN_PRIO GENMASK(15, 13) #define NFP_FLOWER_MASK_VLAN_CFI BIT(12) #define NFP_FLOWER_MASK_VLAN_VID GENMASK(11, 0) -- 2.15.1
[PATCH net-next 1/4] nfp: flower: do not assume mac/mpls matches
From: John HurleyRemove the matching of mac/mpls as a default selection. These are not necessarily set by a TC rule (unlike the port). Previously a mac/mpls field would exist in every match and be masked out if not used. This patch has no impact on functionality but removes unnessary memory assignment in the match cmsg. Signed-off-by: John Hurley Reviewed-by: Dirk van der Merwe Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/offload.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 553f94f55dce..1b7c59a8b139 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -150,10 +150,15 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls, return -EOPNOTSUPP; key_layer_two = 0; - key_layer = NFP_FLOWER_LAYER_PORT | NFP_FLOWER_LAYER_MAC; + key_layer = NFP_FLOWER_LAYER_PORT; key_size = sizeof(struct nfp_flower_meta_one) + - sizeof(struct nfp_flower_in_port) + - sizeof(struct nfp_flower_mac_mpls); + sizeof(struct nfp_flower_in_port); + + if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS) || + dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_MPLS)) { + key_layer |= NFP_FLOWER_LAYER_MAC; + key_size += sizeof(struct nfp_flower_mac_mpls); + } if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL)) { -- 2.15.1
Re: [PATCH v5 net-next] net/tcp: trace all TCP/IP state transition with tcp_set_state tracepoint
2017-12-09 0:28 GMT+08:00 David Miller: > From: Yafang Shao > Date: Fri, 8 Dec 2017 23:50:44 +0800 > >> 2017-12-08 23:42 GMT+08:00 David Miller : >>> From: Yafang Shao >>> Date: Fri, 8 Dec 2017 11:40:23 +0800 >>> It will looks like these, if (sk->sk_protocol == IPPROTO_TCP) __tcp_set_state(newsk, TCP_SYN_RECV); else newsk->sk_state = TCP_SYN_RECV; if (sk->sk_protocol == IPPROTO_TCP) __tcp_set_state(sk, TCP_CLOSE); else sk->sk_state = TCP_CLOSE; if (sk->sk_protocol == IPPROTO_TCP) tcp_state_store(sk, state); else sk_state_store(sk, state); Some redundant code. IMO, put these similar code into a wrapper is more nice. >>> >>> I think this discussion and how ugly this is getting shows that >>> tracing the state transitions of a socket is perhaps not best as a TCP >>> specific feature. >> >> Do you mean that tcp_set_state tracepoint should be replaced with >> sk_set_state tracepoint and move that tracepoint to >> trace/events/sock.h ? > > Yes, something like that. > > It will avoid all of these protocol specific checks and weird > dependencies. That looks like a good idea. I will try to reimplemnet it. Thanks Yafang
[PATCH] wcn36xx: Reduce spinlock in indication handler
The purpose of pushing indication on a list and handle these in a separate worker is to allow the handlers to sleep. It does therefor not make much sense to hold the queue spinlock through the entire indication worker function. By removing items from the queue early we don't need to hold the lock throughout the indication worker, allowing the individual handlers to sleep. Signed-off-by: Bjorn Andersson--- drivers/net/wireless/ath/wcn36xx/smd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c index fa88a2a460aa..52daae863aed 100644 --- a/drivers/net/wireless/ath/wcn36xx/smd.c +++ b/drivers/net/wireless/ath/wcn36xx/smd.c @@ -2414,6 +2414,8 @@ static void wcn36xx_ind_smd_work(struct work_struct *work) hal_ind_msg = list_first_entry(>hal_ind_queue, struct wcn36xx_hal_ind_msg, list); + list_del(wcn->hal_ind_queue.next); + spin_unlock_irqrestore(>hal_ind_lock, flags); msg_header = (struct wcn36xx_hal_msg_header *)hal_ind_msg->msg; @@ -2450,8 +2452,6 @@ static void wcn36xx_ind_smd_work(struct work_struct *work) wcn36xx_err("SMD_EVENT (%d) not supported\n", msg_header->msg_type); } - list_del(wcn->hal_ind_queue.next); - spin_unlock_irqrestore(>hal_ind_lock, flags); kfree(hal_ind_msg); } int wcn36xx_smd_open(struct wcn36xx *wcn) -- 2.15.0
Re: [PATCH V2 net-next 2/8] net: hns3: Add mailbox support to VF driver
Sali, On Fri, Dec 8, 2017 at 10:16 PM, Salil Mehtawrote: > This patch adds the support of the mailbox to the VF driver. The > mailbox shall be used as an interface to communicate with the > PF driver for various purposes like {set|get} MAC related > operations, reset, link status etc. The mailbox supports both > synchronous and asynchronous command send to PF driver. > > Signed-off-by: Salil Mehta > Signed-off-by: lipeng [...] > --- /dev/null > +++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h > @@ -0,0 +1,94 @@ > +/* > + * Copyright (c) 2016-2017 Hisilicon Limited. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + */ Why not use the new SPDX ids? e.g. > +/* SPDX-License-Identifier: GPL-2.0+ */ > +/* Copyright (c) 2016-2017 Hisilicon Limited. */ See Linus posts and Thomas doc patches for details. -- Cordially Philippe Ombredanne
Re: [PATCH net-next v3 0/2] veth and GSO maximums
On Fri, 8 Dec 2017 15:50:25 -0800 Solio Sarabiawrote: > On Fri, Dec 08, 2017 at 02:23:23PM -0500, David Miller wrote: > > From: Stephen Hemminger > > Date: Thu, 7 Dec 2017 15:40:18 -0800 > > > > > This is the more general way to solving the issue of GSO limits > > > not being set correctly for containers on Azure. If a GSO packet > > > is sent to host that exceeds the limit (reported by NDIS), then > > > the host is forced to do segmentation in software which has noticeable > > > performance impact. > > > > > > The core rtnetlink infrastructure already has the messages and > > > infrastructure to allow changing gso limits. With an updated iproute2 > > > the following already works: > > > # ip li set dev dummy0 gso_max_size 3 > > > > > > These patches are about making it easier with veth. > > > > Ok, this is definitely a step forward. > > > > Series applied, thanks Stephen. > > Thanks. > Still not seeing the iproute2 patch though, either master or net-next. I held off on iproute2 until kernel change is accepted upstream.
Re: [PATCH net] enic: add wq clean up budget
On Wed, 6 Dec 2017, David Miller wrote: From: Govindarajulu VaradarajanDate: Tue, 5 Dec 2017 11:14:41 -0800 In case of tx clean up, we set '-1' as budget. This means clean up until wq is empty or till (1 << 32) pkts are cleaned. Under heavy load this will run for long time and cause "watchdog: BUG: soft lockup - CPU#25 stuck for 21s!" warning. This patch sets wq clean up budget to 256. Signed-off-by: Govindarajulu Varadarajan This driver with all of it's indirection and layers upon layers of macros for queue processing is so difficult to read, and this can't be generating nice optimal code either... Anyways, I was walking over the driver to see if the logic is contributing to this. The limit you are proposing sounds unnecessary, nobody else I can see needs this, and that includes all of the most heavily used drivers under load. I used 256 as the limit because most of the other drivers use it. * mlx4 uses MLX4_EN_DEFAULT_TX_WORK as the tx budget in mlx4_en_process_tx_cq() Added in commit fbc6daf19745 ("net/mlx4_en: Ignore budget on TX napi polling") * i40e uses vsi->work_limit as tx budget in i40e_clean_tx_irq(), which is set to I40E_DEFAULT_IRQ_WORK. Added in commit a619afe814453 ("i40e/i40evf: Add support for bulk free in Tx cleanup") * ixgbe uses q_vector->tx.work_limit as tx budget in ixgbe_clean_tx_irq(), which is set to IXGBE_DEFAULT_TX_WORK. Added in commit 592245559e900 ("ixgbe: Change default Tx work limit size to 256 buffers") If I had to guess I'd say that the problem is that the queue loop keeps sampling the head and tail pointers, where as it should just do that _once_ and only process that TX entries found in that snapshot and return to the poll() routine immedately afterwards. The only way to know the tail pointer at the time napi is scheduled is to read hw fetch_index register. This is discouraged by hw engineers. We work around this by using color bit. Every cq entry has color bit. It is either 0 or 1. Hw flips the bit when it creates a new cq entry. So every new cq entry will have a different color bit than previous. We reach end of the queue when previous color bit is same as current cq entry's color. i.e hw did not flip the bit, so its not a new cq entry. So enic driver cannot know the tail pointer at the time napi is scheduled, until we reach the tail pointer.
Re: [PATCH net-next v3 0/2] veth and GSO maximums
On Fri, Dec 08, 2017 at 02:23:23PM -0500, David Miller wrote: > From: Stephen Hemminger> Date: Thu, 7 Dec 2017 15:40:18 -0800 > > > This is the more general way to solving the issue of GSO limits > > not being set correctly for containers on Azure. If a GSO packet > > is sent to host that exceeds the limit (reported by NDIS), then > > the host is forced to do segmentation in software which has noticeable > > performance impact. > > > > The core rtnetlink infrastructure already has the messages and > > infrastructure to allow changing gso limits. With an updated iproute2 > > the following already works: > > # ip li set dev dummy0 gso_max_size 3 > > > > These patches are about making it easier with veth. > > Ok, this is definitely a step forward. > > Series applied, thanks Stephen. Thanks. Still not seeing the iproute2 patch though, either master or net-next.
Re: [PATCH v3 net-next 4/4] bpftool: implement cgroup bpf operations
On Fri, 8 Dec 2017 14:52:36 +, Roman Gushchin wrote: > +static int list_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type) > +{ > + __u32 prog_ids[1024] = {0}; > + char *attach_flags_str; > + __u32 prog_cnt, iter; > + __u32 attach_flags; > + char buf[16]; > + int ret; > + ... > + case BPF_F_ALLOW_OVERRIDE: > + attach_flags_str = "allow_override"; > + break; > + case 0: > + attach_flags_str = ""; > + break; > + default: > + snprintf(buf, sizeof(buf), "unknown(%x)", attach_flags); > + attach_flags_str = buf; nit: theoretically speaking strlen("unknown()") == 9 + 8 + 1 > sizeof(buf) so if we ever use all flags this may get truncated. I personally also like using %x without 0x in front, but I restrained myself in bpftool to avoid potential confusion (unknown(10) could be 10 or 16). Map flags do put the prefix in, perhaps it would be good to keep those consistent? > + } > + > + for (iter = 0; iter < prog_cnt; iter++) > + list_bpf_prog(prog_ids[iter], attach_type_strings[type], > + attach_flags_str); > + > + return 0; > +} > +static int do_attach(int argc, char **argv) > +{ > + int cgroup_fd, prog_fd; > + enum bpf_attach_type attach_type; > + int attach_flags = 0; > + int i; > + int ret = -1; nit: I was hoping you'd fix the order of variables in all functions.. > + if (argc < 4) { > + p_err("too few parameters for cgroup attach\n"); > + goto exit; > + } > + > + cgroup_fd = open(argv[0], O_RDONLY); > + if (cgroup_fd < 0) { > + p_err("can't open cgroup %s\n", argv[1]); > + goto exit; > + } > + > + attach_type = parse_attach_type(argv[1]); > + if (attach_type == __MAX_BPF_ATTACH_TYPE) { > + p_err("invalid attach type\n"); > + goto exit_cgroup; > + } > + > + argc -= 2; > + argv = [2]; > + prog_fd = prog_parse_fd(, ); > + if (prog_fd < 0) > + goto exit_cgroup; > + > + for (i = 0; i < argc; i++) { > + if (strcmp(argv[i], "allow_multi") == 0) { > + attach_flags |= BPF_F_ALLOW_MULTI; > + } else if (strcmp(argv[i], "allow_override") == 0) { > + attach_flags |= BPF_F_ALLOW_OVERRIDE; I don't feel about this strongly but as I said I was trying to follow iproute2's conventions, and it allows aliasing. So if you type "ip a" it will give you the first thing that starts with a, not necessarily alphabetically, more likely in order of usefulness or order in which things were added. IOW if "allow_" selects "allow_mutli" that's what I would actually expect it to do.. Maybe others disagree? > + } else { > + p_err("unknown option: %s\n", argv[i]); > + goto exit_cgroup; > + } > + } > + > + if (bpf_prog_attach(prog_fd, cgroup_fd, attach_type, attach_flags)) { > + p_err("failed to attach program"); > + goto exit_prog; > + } > + > + if (json_output) > + jsonw_null(json_wtr); > + > + ret = 0; > + > +exit_prog: > + close(prog_fd); > +exit_cgroup: > + close(cgroup_fd); > +exit: > + return ret; > +} > + > +static int do_detach(int argc, char **argv) > +{ > + int prog_fd, cgroup_fd; > + enum bpf_attach_type attach_type; > + int ret = -1; nit: order here too.. > + if (argc < 4) { > + p_err("too few parameters for cgroup detach\n"); > + goto exit; > + } > + > + cgroup_fd = open(argv[0], O_RDONLY); > + if (cgroup_fd < 0) { > + p_err("can't open cgroup %s\n", argv[1]); > + goto exit; > + } > + > + attach_type = parse_attach_type(argv[1]); > + if (attach_type == __MAX_BPF_ATTACH_TYPE) { > + p_err("invalid attach type"); > + goto exit_cgroup; > + } > + > + argc -= 2; > + argv = [2]; > + prog_fd = prog_parse_fd(, ); > + if (prog_fd < 0) > + goto exit_cgroup; > + > + if (bpf_prog_detach2(prog_fd, cgroup_fd, attach_type)) { > + p_err("failed to detach program"); > + goto exit_prog; > + } > + > + if (json_output) > + jsonw_null(json_wtr); > + > + ret = 0; > + > +exit_prog: > + close(prog_fd); > +exit_cgroup: > + close(cgroup_fd); > +exit: > + return ret; > +}
[PATCH] rtnetlink: fix typo in GSO max segments
Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") Signed-off-by: Stephen Hemminger--- net/core/rtnetlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 412ebf0b09c6..c688dc564b11 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -2684,7 +2684,7 @@ struct net_device *rtnl_create_link(struct net *net, if (tb[IFLA_GSO_MAX_SIZE]) netif_set_gso_max_size(dev, nla_get_u32(tb[IFLA_GSO_MAX_SIZE])); if (tb[IFLA_GSO_MAX_SEGS]) - dev->gso_max_size = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]); + dev->gso_max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]); return dev; } -- 2.11.0
Re: [PATCH v2 net-next 4/4] bpftool: implement cgroup bpf operations
On Fri, 8 Dec 2017 09:52:16 -0700, David Ahern wrote: > On 12/8/17 8:39 AM, Quentin Monnet wrote: > > I don't believe compatibility is an issue here, since the program and > > its documentation come together (so they should stay in sync) and are > > part of the kernel tree (so the tool should be compatible with the > > kernel sources it comes with). My concern is that there is no way to > > guess from the current description what the values for ATTACH_FLAG or > > ATTACH_TYPE can be, without reading the source code of the program—which > > is not exactly user-friendly. > > The tool should be backward and forward compatible across kernel > versions. Running a newer command on an older kernel should fail in a > deterministic. While the tool is in the kernel tree for ease of > development, that should not be confused with having a direct tie to any > kernel version. > > I believe man pages do include kernel version descriptions in flags > (e.g., man 7 socket -- flags are denoted with "since Linux x.y") which > is one way to handle it with the usual caveat that vendors might have > backported support to earlier kernels. Let's see if I understand correctly. We have a list of hard coded strings which the tool will recognize (static const char * const attach_type_strings[]). We should put that list into the man page and help so users know what values are possible. And in the "verbose" part of the man section mark each flag with kernel version it was introduced in. Roman, would you agree this is the best way forward?
Re: [PATCH net-next] ip6_vti: adjust vti mtu according to mtu of output device
On 12/8/2017 3:54 AM, Alexey Kodanev wrote: On 12/08/2017 10:02 AM, Steffen Klassert wrote: On Wed, Dec 06, 2017 at 07:38:19PM +0300, Alexey Kodanev wrote: Since you're planning to do a 2nd version anyway, can we get a couple of the commit message issues cleaned up? LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams that require fragmentation and underlying device MTU <= 1500. s/underlying device/the underlying device has/ This happens because ip6_vti sets mtu to ETH_DATA_LEN and not updating it depending on a destiantion address. s/destiantion/destination/ Futhure attempts to send UDP packets may succeed because pmtu s/Futhure/Further/ get updated on ICMPV6_PKT_TOOBIG in vti6_err(). s/get/gets/ Here is the example when output device MTU set to 9000: s/output device MTU/the output device MTU is/ Thanks, sln
Re: [PATCH v8 0/5] Add the ability to do BPF directed error injection
On 12/08/2017 09:24 PM, Josef Bacik wrote: > On Fri, Dec 08, 2017 at 04:35:44PM +0100, Daniel Borkmann wrote: >> On 12/06/2017 05:12 PM, Josef Bacik wrote: >>> Jon noticed that I had a typo in my _ASM_KPROBE_ERROR_INJECT macro. I went >>> to >>> figure out why the compiler didn't catch it and it's because it was not used >>> anywhere. I had copied it from the trace blacklist code without >>> understanding >>> where it was used as cscope didn't find the original macro I was looking >>> for, so >>> I assumed it was some voodoo and left it in place. Turns out cscope failed >>> me >>> and I didn't need the macro at all, the trace blacklist thing I was looking >>> at >>> was for marking assembly functions as blacklisted and I have no intention of >>> marking assembly functions as error injectable at the moment. >>> >>> v7->v8: >>> - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed. >> >> The series doesn't apply cleanly to the bpf-next tree, so one last respin >> with >> a rebase would unfortunately still be required, thanks! > > I've rebased and let it sit in my git tree to make sure kbuild test bot didn't > blow up, can you pull from > > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git > bpf-override-return > > or do you want me to repost the whole series? Thanks, Yeah, the patches would need to end up on netdev, so once kbuild bot went through fine after your rebase, please send the series. Thanks, Daniel
Re: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.
On Fri, Dec 8, 2017 at 2:09 PM, Marcelo Ricardo Leitnerwrote: > Hi, > > On Thu, Dec 07, 2017 at 03:03:35AM -0500, Michael Chan wrote: >> --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c >> +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c >> @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu) >> DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN), >> "Configuring MTU size of %d\n", new_mtu); >> >> + if (new_mtu > PAGE_SIZE) > > I don't know the specs for this card but if it needs to fit the whole > packet in a page, maybe it should consider the ethernet header size in > such checks? > I am not changing the logic in this patch, just moving the check from qede_alloc_sge_mem() to qede_fix_features(). Typically, the chip will also do header-data split when doing GRO_HW, so that's probably why it is checking the MTU and not the total packet size against page size. Ariel can confirm.
Re: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.
Hi, On Thu, Dec 07, 2017 at 03:03:35AM -0500, Michael Chan wrote: > --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu) > DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN), > "Configuring MTU size of %d\n", new_mtu); > > + if (new_mtu > PAGE_SIZE) I don't know the specs for this card but if it needs to fit the whole packet in a page, maybe it should consider the ethernet header size in such checks? Marcelo
Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.
On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhangwrote: > > Release the netlink sock created in kernel(not hold the _net_ namespace): > You can avoid counting kernel sock by testing 'kern' in sk_alloc() and testing 'sk->sk_net_refcnt' in __sk_free().
[PATCH V2 net-next 2/8] net: hns3: Add mailbox support to VF driver
This patch adds the support of the mailbox to the VF driver. The mailbox shall be used as an interface to communicate with the PF driver for various purposes like {set|get} MAC related operations, reset, link status etc. The mailbox supports both synchronous and asynchronous command send to PF driver. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- Patch V2: Addressed some internal commnets Patch V1: Initial Submit --- drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h| 94 +++ .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 188 + 2 files changed, 282 insertions(+) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h new file mode 100644 index 000..5a6ebfca --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h @@ -0,0 +1,94 @@ +/* + * Copyright (c) 2016-2017 Hisilicon Limited. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __HCLGE_MBX_H +#define __HCLGE_MBX_H +#include +#include +#include + +#define HCLGE_MBX_VF_MSG_DATA_NUM 16 + +enum HCLGE_MBX_OPCODE { + HCLGE_MBX_RESET = 0x01, /* (VF -> PF) assert reset */ + HCLGE_MBX_SET_UNICAST, /* (VF -> PF) set UC addr */ + HCLGE_MBX_SET_MULTICAST,/* (VF -> PF) set MC addr */ + HCLGE_MBX_SET_VLAN, /* (VF -> PF) set VLAN */ + HCLGE_MBX_MAP_RING_TO_VECTOR, /* (VF -> PF) map ring-to-vector */ + HCLGE_MBX_UNMAP_RING_TO_VECTOR, /* (VF -> PF) unamp ring-to-vector */ + HCLGE_MBX_SET_PROMISC_MODE, /* (VF -> PF) set promiscuous mode */ + HCLGE_MBX_SET_MACVLAN, /* (VF -> PF) set unicast filter */ + HCLGE_MBX_API_NEGOTIATE,/* (VF -> PF) negotiate API version */ + HCLGE_MBX_GET_QINFO,/* (VF -> PF) get queue config */ + HCLGE_MBX_GET_TCINFO, /* (VF -> PF) get TC config */ + HCLGE_MBX_GET_RETA, /* (VF -> PF) get RETA */ + HCLGE_MBX_GET_RSS_KEY, /* (VF -> PF) get RSS key */ + HCLGE_MBX_GET_MAC_ADDR, /* (VF -> PF) get MAC addr */ + HCLGE_MBX_PF_VF_RESP, /* (PF -> VF) generate respone to VF */ + HCLGE_MBX_GET_BDNUM,/* (VF -> PF) get BD num */ + HCLGE_MBX_GET_BUFSIZE, /* (VF -> PF) get buffer size */ + HCLGE_MBX_GET_STREAMID, /* (VF -> PF) get stream id */ + HCLGE_MBX_SET_AESTART, /* (VF -> PF) start ae */ + HCLGE_MBX_SET_TSOSTATS, /* (VF -> PF) get tso stats */ + HCLGE_MBX_LINK_STAT_CHANGE, /* (PF -> VF) link status has changed */ + HCLGE_MBX_GET_BASE_CONFIG, /* (VF -> PF) get config */ + HCLGE_MBX_BIND_FUNC_QUEUE, /* (VF -> PF) bind function and queue */ + HCLGE_MBX_GET_LINK_STATUS, /* (VF -> PF) get link status */ + HCLGE_MBX_QUEUE_RESET, /* (VF -> PF) reset queue */ +}; + +/* below are per-VF mac-vlan subcodes */ +enum hclge_mbx_mac_vlan_subcode { + HCLGE_MBX_MAC_VLAN_UC_MODIFY = 0, /* modify UC mac addr */ + HCLGE_MBX_MAC_VLAN_UC_ADD, /* add a new UC mac addr */ + HCLGE_MBX_MAC_VLAN_UC_REMOVE, /* remove a new UC mac addr */ + HCLGE_MBX_MAC_VLAN_MC_MODIFY, /* modify MC mac addr */ + HCLGE_MBX_MAC_VLAN_MC_ADD, /* add new MC mac addr */ + HCLGE_MBX_MAC_VLAN_MC_REMOVE, /* remove MC mac addr */ + HCLGE_MBX_MAC_VLAN_MC_FUNC_MTA_ENABLE, /* config func MTA enable */ +}; + +/* below are per-VF vlan cfg subcodes */ +enum hclge_mbx_vlan_cfg_subcode { + HCLGE_MBX_VLAN_FILTER = 0, /* set vlan filter */ + HCLGE_MBX_VLAN_TX_OFF_CFG, /* set tx side vlan offload */ + HCLGE_MBX_VLAN_RX_OFF_CFG, /* set rx side vlan offload */ +}; + +#define HCLGE_MBX_MAX_MSG_SIZE 16 +#define HCLGE_MBX_MAX_RESP_DATA_SIZE 8 + +struct hclgevf_mbx_resp_status { + struct mutex mbx_mutex; /* protects against contending sync cmd resp */ + u32 origin_mbx_msg; + bool received_resp; + int resp_status; + u8 additional_info[HCLGE_MBX_MAX_RESP_DATA_SIZE]; +}; + +struct hclge_mbx_vf_to_pf_cmd { + u8 rsv; + u8 mbx_src_vfid; /* Auto filled by IMP */ + u8 rsv1[2]; + u8 msg_len; + u8 rsv2[3]; + u8 msg[HCLGE_MBX_MAX_MSG_SIZE]; +}; + +struct hclge_mbx_pf_to_vf_cmd { + u8 dest_vfid; + u8 rsv[3]; + u8 msg_len; + u8 rsv1[3]; + u16 msg[8]; +}; + +#define hclge_mbx_ring_ptr_move_crq(crq) \ + (crq->next_to_use =
[PATCH V2 net-next 5/8] net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC
Most of the NAPI handling interface, skb buffer management, management of the RX/TX descriptors, ethool interface etc. has quite a bit of code which is common to VF and PF driver. This patch makes the exisitng PF's HNS3 ENET driver as the common ENET driver for both Virtual & Physical Function. This will help in reduction of redundancy and better management of code. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- drivers/net/ethernet/hisilicon/hns3/Makefile | 6 +- drivers/net/ethernet/hisilicon/hns3/hnae3.c| 14 -- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 7 --- .../hisilicon/hns3/{hns3pf => }/hns3_dcbnl.c | 2 +- .../hisilicon/hns3/{hns3pf => }/hns3_enet.c| 2 ++ .../hisilicon/hns3/{hns3pf => }/hns3_enet.h| 0 .../hisilicon/hns3/{hns3pf => }/hns3_ethtool.c | 22 +- .../net/ethernet/hisilicon/hns3/hns3pf/Makefile| 5 - .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 1 + 9 files changed, 46 insertions(+), 13 deletions(-) rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_dcbnl.c (97%) rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_enet.c (99%) rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_enet.h (100%) rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_ethtool.c (97%) diff --git a/drivers/net/ethernet/hisilicon/hns3/Makefile b/drivers/net/ethernet/hisilicon/hns3/Makefile index 244664b..48af9b1 100644 --- a/drivers/net/ethernet/hisilicon/hns3/Makefile +++ b/drivers/net/ethernet/hisilicon/hns3/Makefile @@ -1,8 +1,12 @@ # # Makefile for the HISILICON network device drivers. # - obj-$(CONFIG_HNS3) += hns3pf/ obj-$(CONFIG_HNS3) += hns3vf/ obj-$(CONFIG_HNS3) += hnae3.o + +obj-$(CONFIG_HNS3_ENET) += hns3.o +hns3-objs = hns3_enet.o hns3_ethtool.o + +hns3-$(CONFIG_HNS3_DCB) += hns3_dcbnl.o diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.c b/drivers/net/ethernet/hisilicon/hns3/hnae3.c index 5bcb223..02145f2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.c +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.c @@ -196,9 +196,18 @@ int hnae3_register_ae_dev(struct hnae3_ae_dev *ae_dev) const struct pci_device_id *id; struct hnae3_ae_algo *ae_algo; struct hnae3_client *client; - int ret = 0; + int ret = 0, lock_acquired; + + /* we can get deadlocked if SRIOV is being enabled in context to probe +* and probe gets called again in same context. This can happen when +* pci_enable_sriov() is called to create VFs from PF probes context. +* Therefore, for simplicity uniformly defering further probing in all +* cases where we detect contention. +*/ + lock_acquired = mutex_trylock(_common_lock); + if (!lock_acquired) + return -EPROBE_DEFER; - mutex_lock(_common_lock); list_add_tail(_dev->node, _ae_dev_list); /* Check if there are matched ae_algo */ @@ -211,6 +220,7 @@ int hnae3_register_ae_dev(struct hnae3_ae_dev *ae_dev) if (!ae_dev->ops) { dev_err(_dev->pdev->dev, "ae_dev ops are null\n"); + ret = -EOPNOTSUPP; goto out_err; } diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 67c59e1..a9e2b32 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -452,9 +452,10 @@ struct hnae3_unic_private_info { struct hnae3_queue **tqp; /* array base of all TQPs of this instance */ }; -#define HNAE3_SUPPORT_MAC_LOOPBACK1 -#define HNAE3_SUPPORT_PHY_LOOPBACK2 -#define HNAE3_SUPPORT_SERDES_LOOPBACK 4 +#define HNAE3_SUPPORT_MAC_LOOPBACKBIT(0) +#define HNAE3_SUPPORT_PHY_LOOPBACKBIT(1) +#define HNAE3_SUPPORT_SERDES_LOOPBACK BIT(2) +#define HNAE3_SUPPORT_VF BIT(3) struct hnae3_handle { struct hnae3_client *client; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c b/drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c similarity index 97% rename from drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c rename to drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c index 925619a..eb82700 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c @@ -93,7 +93,7 @@ void hns3_dcbnl_setup(struct hnae3_handle *handle) { struct net_device *dev = handle->kinfo.netdev; - if (!handle->kinfo.dcb_ops) + if ((!handle->kinfo.dcb_ops) || (handle->flags & HNAE3_SUPPORT_VF)) return; dev->dcbnl_ops = _dcbnl_ops; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c similarity index 99% rename from
[PATCH V2 net-next 3/8] net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
This patch adds the support of hardware compatibiltiy layer to the HNS3 VF Driver. This layer implements various {set|get} operations over MAC address for a virtual port, RSS related configuration, fetches the link status info from PF, does various VLAN related configuration over the virtual port, queries the statistics from the hardware etc. This layer can directly interact with hardware through the IMP(Integrated Mangement Processor) interface or can use mailbox to interact with the PF driver. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- Patch V2: Addressed some internal comments Patch V1: Initial Submit --- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 1494 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 170 +++ 2 files changed, 1664 insertions(+) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c new file mode 100644 index 000..a878c63 --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -0,0 +1,1494 @@ +/* + * Copyright (c) 2016-2017 Hisilicon Limited. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include "hclgevf_cmd.h" +#include "hclgevf_main.h" +#include "hclge_mbx.h" +#include "hnae3.h" + +#define HCLGEVF_NAME "hclgevf" + +static struct hnae3_ae_algo ae_algovf; + +static const struct pci_device_id ae_algovf_pci_tbl[] = { + {PCI_VDEVICE(HUAWEI, HNAE3_DEV_ID_100G_VF), 0}, + {PCI_VDEVICE(HUAWEI, HNAE3_DEV_ID_100G_RDMA_DCB_PFC_VF), 0}, + /* required last entry */ + {0, } +}; + +static inline struct hclgevf_dev *hclgevf_ae_get_hdev( + struct hnae3_handle *handle) +{ + return container_of(handle, struct hclgevf_dev, nic); +} + +static int hclgevf_tqps_update_stats(struct hnae3_handle *handle) +{ + struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + struct hnae3_queue *queue; + struct hclgevf_desc desc; + struct hclgevf_tqp *tqp; + int status; + int i; + + for (i = 0; i < hdev->num_tqps; i++) { + queue = handle->kinfo.tqp[i]; + tqp = container_of(queue, struct hclgevf_tqp, q); + hclgevf_cmd_setup_basic_desc(, +HCLGEVF_OPC_QUERY_RX_STATUS, +true); + + desc.data[0] = cpu_to_le32(tqp->index & 0x1ff); + status = hclgevf_cmd_send(>hw, , 1); + if (status) { + dev_err(>pdev->dev, + "Query tqp stat fail, status = %d,queue = %d\n", + status, i); + return status; + } + tqp->tqp_stats.rcb_rx_ring_pktnum_rcd += + le32_to_cpu(desc.data[4]); + + hclgevf_cmd_setup_basic_desc(, HCLGEVF_OPC_QUERY_TX_STATUS, +true); + + desc.data[0] = cpu_to_le32(tqp->index & 0x1ff); + status = hclgevf_cmd_send(>hw, , 1); + if (status) { + dev_err(>pdev->dev, + "Query tqp stat fail, status = %d,queue = %d\n", + status, i); + return status; + } + tqp->tqp_stats.rcb_tx_ring_pktnum_rcd += + le32_to_cpu(desc.data[4]); + } + + return 0; +} + +static u64 *hclgevf_tqps_get_stats(struct hnae3_handle *handle, u64 *data) +{ + struct hnae3_knic_private_info *kinfo = >kinfo; + struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + struct hclgevf_tqp *tqp; + u64 *buff = data; + int i; + + for (i = 0; i < hdev->num_tqps; i++) { + tqp = container_of(handle->kinfo.tqp[i], struct hclgevf_tqp, q); + *buff++ = tqp->tqp_stats.rcb_tx_ring_pktnum_rcd; + } + for (i = 0; i < kinfo->num_tqps; i++) { + tqp = container_of(handle->kinfo.tqp[i], struct hclgevf_tqp, q); + *buff++ = tqp->tqp_stats.rcb_rx_ring_pktnum_rcd; + } + + return buff; +} + +static int hclgevf_tqps_get_sset_count(struct hnae3_handle *handle, int strset) +{ + struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + + return hdev->num_tqps * 2; +} + +static u8 *hclgevf_tqps_get_strings(struct hnae3_handle *handle, u8 *data) +{ + struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + u8 *buff = data; +
[PATCH V2 net-next 6/8] net: hns3: Add mailbox support to PF driver
Command queue provides the provision of Mailbox command which can be used for communication between PF and VF. PF handles messages from various VFs for fetching various information like, queue, vlan, link status related etc. It also handles the request from various VFs to perform certain privileged operations. This patch adds the support of a message handler for handling such various command requests from VF. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- .../net/ethernet/hisilicon/hns3/hns3pf/Makefile| 2 +- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 1 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 2 + .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 310 + 4 files changed, 314 insertions(+), 1 deletion(-) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile b/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile index d077fa0..393a1b3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile @@ -5,6 +5,6 @@ ccflags-y := -Idrivers/net/ethernet/hisilicon/hns3 obj-$(CONFIG_HNS3_HCLGE) += hclge.o -hclge-objs = hclge_main.o hclge_cmd.o hclge_mdio.o hclge_tm.o +hclge-objs = hclge_main.o hclge_cmd.o hclge_mdio.o hclge_tm.o hclge_mbx.o hclge-$(CONFIG_HNS3_DCB) += hclge_dcb.o diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index d07c700..980fcdf 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -21,6 +21,7 @@ #include "hclge_cmd.h" #include "hclge_dcb.h" #include "hclge_main.h" +#include "hclge_mbx.h" #include "hclge_mdio.h" #include "hclge_tm.h" #include "hnae3.h" diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index aacec43..028817c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -554,4 +554,6 @@ int hclge_set_vf_vlan_common(struct hclge_dev *vport, int vfid, int hclge_buffer_alloc(struct hclge_dev *hdev); int hclge_rss_init_hw(struct hclge_dev *hdev); + +void hclge_mbx_handler(struct hclge_dev *hdev); #endif diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c new file mode 100644 index 000..a993735 --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c @@ -0,0 +1,310 @@ +/* + * Copyright (c) 2016-2017 Hisilicon Limited. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include "hclge_main.h" +#include "hclge_mbx.h" +#include "hnae3.h" + +/* hclge_gen_resp_to_vf: used to generate a synchronous response to VF when PF + * receives a mailbox message from VF. + * @vport: pointer to struct hclge_vport + * @vf_to_pf_req: pointer to hclge_mbx_vf_to_pf_cmd of the original mailbox + * message + * @resp_status: indicate to VF whether its request success(0) or failed. + */ +static int hclge_gen_resp_to_vf(struct hclge_vport *vport, + struct hclge_mbx_vf_to_pf_cmd *vf_to_pf_req, + int resp_status, + u8 *resp_data, u16 resp_data_len) +{ + struct hclge_mbx_pf_to_vf_cmd *resp_pf_to_vf; + struct hclge_dev *hdev = vport->back; + enum hclge_cmd_status status; + struct hclge_desc desc; + + resp_pf_to_vf = (struct hclge_mbx_pf_to_vf_cmd *)desc.data; + + if (resp_data_len > HCLGE_MBX_MAX_RESP_DATA_SIZE) { + dev_err(>pdev->dev, + "PF fail to gen resp to VF len %d exceeds max len %d\n", + resp_data_len, + HCLGE_MBX_MAX_RESP_DATA_SIZE); + } + + hclge_cmd_setup_basic_desc(, HCLGEVF_OPC_MBX_PF_TO_VF, false); + + resp_pf_to_vf->dest_vfid = vf_to_pf_req->mbx_src_vfid; + resp_pf_to_vf->msg_len = vf_to_pf_req->msg_len; + + resp_pf_to_vf->msg[0] = HCLGE_MBX_PF_VF_RESP; + resp_pf_to_vf->msg[1] = vf_to_pf_req->msg[0]; + resp_pf_to_vf->msg[2] = vf_to_pf_req->msg[1]; + resp_pf_to_vf->msg[3] = (resp_status == 0) ? 0 : 1; + + if (resp_data && resp_data_len > 0) + memcpy(_pf_to_vf->msg[4], resp_data, resp_data_len); + + status = hclge_cmd_send(>hw, , 1); + if (status) + dev_err(>pdev->dev, + "PF failed(=%d) to send response to VF\n", status); + + return status; +} + +static int
[PATCH V2 net-next 7/8] net: hns3: Change PF to add ring-vect binding & resetQ to mailbox
This patch is required to support ring-vector binding and reset of TQPs requested by the VF driver to the PF driver. Mailbox handler is added with corresponding VF commands/messages to handle the request. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- Patch V2: Addressed some internal comments Patch V1: Initial Submit --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 138 +++-- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 7 +- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 106 3 files changed, 159 insertions(+), 92 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 980fcdf..3b1fc49 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3256,49 +3256,48 @@ int hclge_rss_init_hw(struct hclge_dev *hdev) return ret; } -int hclge_map_vport_ring_to_vector(struct hclge_vport *vport, int vector_id, - struct hnae3_ring_chain_node *ring_chain) +int hclge_bind_ring_with_vector(struct hclge_vport *vport, + int vector_id, bool en, + struct hnae3_ring_chain_node *ring_chain) { struct hclge_dev *hdev = vport->back; - struct hclge_ctrl_vector_chain_cmd *req; struct hnae3_ring_chain_node *node; struct hclge_desc desc; - int ret; + struct hclge_ctrl_vector_chain_cmd *req + = (struct hclge_ctrl_vector_chain_cmd *)desc.data; + enum hclge_cmd_status status; + enum hclge_opcode_type op; + u16 tqp_type_and_id; int i; - hclge_cmd_setup_basic_desc(, HCLGE_OPC_ADD_RING_TO_VECTOR, false); - - req = (struct hclge_ctrl_vector_chain_cmd *)desc.data; + op = en ? HCLGE_OPC_ADD_RING_TO_VECTOR : HCLGE_OPC_DEL_RING_TO_VECTOR; + hclge_cmd_setup_basic_desc(, op, false); req->int_vector_id = vector_id; i = 0; for (node = ring_chain; node; node = node->next) { - u16 type_and_id = 0; - - hnae_set_field(type_and_id, HCLGE_INT_TYPE_M, HCLGE_INT_TYPE_S, + tqp_type_and_id = le16_to_cpu(req->tqp_type_and_id[i]); + hnae_set_field(tqp_type_and_id, HCLGE_INT_TYPE_M, + HCLGE_INT_TYPE_S, hnae_get_bit(node->flag, HNAE3_RING_TYPE_B)); - hnae_set_field(type_and_id, HCLGE_TQP_ID_M, HCLGE_TQP_ID_S, - node->tqp_index); - hnae_set_field(type_and_id, HCLGE_INT_GL_IDX_M, - HCLGE_INT_GL_IDX_S, - hnae_get_bit(node->flag, HNAE3_RING_TYPE_B)); - req->tqp_type_and_id[i] = cpu_to_le16(type_and_id); - req->vfid = vport->vport_id; - + hnae_set_field(tqp_type_and_id, HCLGE_TQP_ID_M, + HCLGE_TQP_ID_S, node->tqp_index); + req->tqp_type_and_id[i] = cpu_to_le16(tqp_type_and_id); if (++i >= HCLGE_VECTOR_ELEMENTS_PER_CMD) { req->int_cause_num = HCLGE_VECTOR_ELEMENTS_PER_CMD; + req->vfid = vport->vport_id; - ret = hclge_cmd_send(>hw, , 1); - if (ret) { + status = hclge_cmd_send(>hw, , 1); + if (status) { dev_err(>pdev->dev, "Map TQP fail, status is %d.\n", - ret); - return ret; + status); + return -EIO; } i = 0; hclge_cmd_setup_basic_desc(, - HCLGE_OPC_ADD_RING_TO_VECTOR, + op, false); req->int_vector_id = vector_id; } @@ -3306,21 +3305,21 @@ int hclge_map_vport_ring_to_vector(struct hclge_vport *vport, int vector_id, if (i > 0) { req->int_cause_num = i; - - ret = hclge_cmd_send(>hw, , 1); - if (ret) { + req->vfid = vport->vport_id; + status = hclge_cmd_send(>hw, , 1); + if (status) { dev_err(>pdev->dev, - "Map TQP fail, status is %d.\n", ret); - return ret; + "Map TQP fail, status is %d.\n", status); + return -EIO; } } return 0; } -static int hclge_map_handle_ring_to_vector( -
[PATCH V2 net-next 4/8] net: hns3: Add HNS3 VF driver to kernel build framework
This patch introduces the new Makefiles and updates existing Makefiles required to build the HNS3 Virtual Function driver. This also updates the Kconfig for introduction of new menuconfig entries related to VF driver. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- drivers/net/ethernet/hisilicon/Kconfig | 28 +++--- drivers/net/ethernet/hisilicon/hns3/Makefile | 1 + .../net/ethernet/hisilicon/hns3/hns3vf/Makefile| 8 +++ 3 files changed, 28 insertions(+), 9 deletions(-) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile diff --git a/drivers/net/ethernet/hisilicon/Kconfig b/drivers/net/ethernet/hisilicon/Kconfig index 3b6..8bcf470 100644 --- a/drivers/net/ethernet/hisilicon/Kconfig +++ b/drivers/net/ethernet/hisilicon/Kconfig @@ -94,15 +94,6 @@ config HNS3_HCLGE compatibility layer. The engine would be used in Hisilicon hip08 family of SoCs and further upcoming SoCs. -config HNS3_ENET - tristate "Hisilicon HNS3 Ethernet Device Support" - depends on 64BIT && PCI - depends on HNS3 && HNS3_HCLGE - ---help--- - This selects the Ethernet Driver for Hisilicon Network Subsystem 3 for hip08 - family of SoCs. This module depends upon HNAE3 driver to access the HNAE3 - devices and their associated operations. - config HNS3_DCB bool "Hisilicon HNS3 Data Center Bridge Support" default n @@ -112,4 +103,23 @@ config HNS3_DCB If unsure, say N. +config HNS3_HCLGEVF +tristate "Hisilicon HNS3VF Acceleration Engine & Compatibility Layer Support" +depends on PCI_MSI +depends on HNS3 + depends on HNS3_HCLGE +---help--- + This selects the HNS3 VF drivers network acceleration engine & its hardware + compatibility layer. The engine would be used in Hisilicon hip08 family of + SoCs and further upcoming SoCs. + +config HNS3_ENET + tristate "Hisilicon HNS3 Ethernet Device Support" + depends on 64BIT && PCI + depends on HNS3 + ---help--- + This selects the Ethernet Driver for Hisilicon Network Subsystem 3 for hip08 + family of SoCs. This module depends upon HNAE3 driver to access the HNAE3 + devices and their associated operations. + endif # NET_VENDOR_HISILICON diff --git a/drivers/net/ethernet/hisilicon/hns3/Makefile b/drivers/net/ethernet/hisilicon/hns3/Makefile index a9349e1..244664b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/Makefile +++ b/drivers/net/ethernet/hisilicon/hns3/Makefile @@ -3,5 +3,6 @@ # obj-$(CONFIG_HNS3) += hns3pf/ +obj-$(CONFIG_HNS3) += hns3vf/ obj-$(CONFIG_HNS3) += hnae3.o diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile b/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile new file mode 100644 index 000..a513d00 --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile @@ -0,0 +1,8 @@ +# +# Makefile for the HISILICON network device drivers. +# + +ccflags-y := -Idrivers/net/ethernet/hisilicon/hns3 + +obj-$(CONFIG_HNS3_HCLGEVF) += hclgevf.o +hclgevf-objs = hclgevf_main.o hclgevf_cmd.o hclgevf_mbx.o \ No newline at end of file -- 2.7.4
[PATCH V2 net-next 8/8] net: hns3: Add mailbox interrupt handling to PF driver
All PF mailbox events are conveyed through a common interrupt (vector 0). This interrupt vector is shared by reset and mailbox. This patch adds the handling of mailbox interrupt event and its deferred processing in context to a separate mailbox task. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 68 +++--- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 8 ++- 2 files changed, 68 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 3b1fc49..e97fd66 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2227,6 +2227,12 @@ static int hclge_mac_init(struct hclge_dev *hdev) return hclge_cfg_func_mta_filter(hdev, 0, hdev->accept_mta_mc); } +static void hclge_mbx_task_schedule(struct hclge_dev *hdev) +{ + if (!test_and_set_bit(HCLGE_STATE_MBX_SERVICE_SCHED, >state)) + schedule_work(>mbx_service_task); +} + static void hclge_reset_task_schedule(struct hclge_dev *hdev) { if (!test_and_set_bit(HCLGE_STATE_RST_SERVICE_SCHED, >state)) @@ -2372,9 +2378,18 @@ static void hclge_service_complete(struct hclge_dev *hdev) static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) { u32 rst_src_reg; + u32 cmdq_src_reg; /* fetch the events from their corresponding regs */ rst_src_reg = hclge_read_dev(>hw, HCLGE_MISC_RESET_STS_REG); + cmdq_src_reg = hclge_read_dev(>hw, HCLGE_VECTOR0_CMDQ_SRC_REG); + + /* Assumption: If by any chance reset and mailbox events are reported +* together then we will only process reset event in this go and will +* defer the processing of the mailbox events. Since, we would have not +* cleared RX CMDQ event this time we would receive again another +* interrupt from H/W just for the mailbox. +*/ /* check for vector0 reset event sources */ if (BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) & rst_src_reg) { @@ -2395,7 +2410,12 @@ static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) return HCLGE_VECTOR0_EVENT_RST; } - /* mailbox event sharing vector 0 interrupt would be placed here */ + /* check for vector0 mailbox(=CMDQ RX) event source */ + if (BIT(HCLGE_VECTOR0_RX_CMDQ_INT_B) & cmdq_src_reg) { + cmdq_src_reg &= ~BIT(HCLGE_VECTOR0_RX_CMDQ_INT_B); + *clearval = cmdq_src_reg; + return HCLGE_VECTOR0_EVENT_MBX; + } return HCLGE_VECTOR0_EVENT_OTHER; } @@ -2403,10 +2423,14 @@ static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type, u32 regclr) { - if (event_type == HCLGE_VECTOR0_EVENT_RST) + switch (event_type) { + case HCLGE_VECTOR0_EVENT_RST: hclge_write_dev(>hw, HCLGE_MISC_RESET_STS_REG, regclr); - - /* mailbox event sharing vector 0 interrupt would be placed here */ + break; + case HCLGE_VECTOR0_EVENT_MBX: + hclge_write_dev(>hw, HCLGE_VECTOR0_CMDQ_SRC_REG, regclr); + break; + } } static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable) @@ -2423,13 +2447,23 @@ static irqreturn_t hclge_misc_irq_handle(int irq, void *data) hclge_enable_vector(>misc_vector, false); event_cause = hclge_check_event_cause(hdev, ); - /* vector 0 interrupt is shared with reset and mailbox source events. -* For now, we are not handling mailbox events. -*/ + /* vector 0 interrupt is shared with reset and mailbox source events.*/ switch (event_cause) { case HCLGE_VECTOR0_EVENT_RST: hclge_reset_task_schedule(hdev); break; + case HCLGE_VECTOR0_EVENT_MBX: + /* If we are here then, +* 1. Either we are not handling any mbx task and we are not +*scheduled as well +*OR +* 2. We could be handling a mbx task but nothing more is +*scheduled. +* In both cases, we should schedule mbx task as there are more +* mbx messages reported by this interrupt. +*/ + hclge_mbx_task_schedule(hdev); + default: dev_dbg(>pdev->dev, "received unknown or unhandled event of vector0\n"); @@ -2708,6 +2742,21 @@ static void hclge_reset_service_task(struct work_struct *work) clear_bit(HCLGE_STATE_RST_HANDLING, >state); } +static void hclge_mailbox_service_task(struct work_struct *work)
[PATCH V2 net-next 1/8] net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface
This patch adds support of command interface for communication with the IMP(Integrated Management Processor) for HNS3 Virtual Function Driver. Each VF has support of CQP(Command Queue Pair) ring interface. Each CQP consis of send queue CSQ and receive queue CRQ. There are various commands a VF may support, like to query frimware version, TQP management, statistics, interrupt related, mailbox etc. This also contains code to initialize the command queue, manage the command queue descriptors and Rx/Tx protocol with the command processor in the form of various commands/results and acknowledgements. Signed-off-by: Salil MehtaSigned-off-by: lipeng --- Patch V2: Reworked comments by David Miller(except one comment on the udelay() while holding locks. Needs further discussion) Link: https://lkml.org/lkml/2017/12/5/639 Patch V1: Initial Submit --- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c | 348 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h | 262 2 files changed, 610 insertions(+) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c new file mode 100644 index 000..d786ab8 --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c @@ -0,0 +1,348 @@ +/* + * Copyright (c) 2016-2017 Hisilicon Limited. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include "hclgevf_cmd.h" +#include "hclgevf_main.h" +#include "hnae3.h" + +#define hclgevf_is_csq(ring) ((ring)->flag & HCLGEVF_TYPE_CSQ) +#define hclgevf_ring_to_dma_dir(ring) (hclgevf_is_csq(ring) ? \ + DMA_TO_DEVICE : DMA_FROM_DEVICE) +#define cmq_ring_to_dev(ring) (&(ring)->dev->pdev->dev) + +static int hclgevf_ring_space(struct hclgevf_cmq_ring *ring) +{ + int ntc = ring->next_to_clean; + int ntu = ring->next_to_use; + int used; + + used = (ntu - ntc + ring->desc_num) % ring->desc_num; + + return ring->desc_num - used - 1; +} + +static int hclgevf_cmd_csq_clean(struct hclgevf_hw *hw) +{ + struct hclgevf_cmq_ring *csq = >cmq.csq; + u16 ntc = csq->next_to_clean; + struct hclgevf_desc *desc; + int clean = 0; + u32 head; + + desc = >desc[ntc]; + head = hclgevf_read_dev(hw, HCLGEVF_NIC_CSQ_HEAD_REG); + while (head != ntc) { + memset(desc, 0, sizeof(*desc)); + ntc++; + if (ntc == csq->desc_num) + ntc = 0; + desc = >desc[ntc]; + clean++; + } + csq->next_to_clean = ntc; + + return clean; +} + +static bool hclgevf_cmd_csq_done(struct hclgevf_hw *hw) +{ + u32 head; + + head = hclgevf_read_dev(hw, HCLGEVF_NIC_CSQ_HEAD_REG); + + return head == hw->cmq.csq.next_to_use; +} + +static bool hclgevf_is_special_opcode(u16 opcode) +{ + u16 spec_opcode[] = {0x30, 0x31, 0x32}; + int i; + + for (i = 0; i < ARRAY_SIZE(spec_opcode); i++) { + if (spec_opcode[i] == opcode) + return true; + } + + return false; +} + +static int hclgevf_alloc_cmd_desc(struct hclgevf_cmq_ring *ring) +{ + int size = ring->desc_num * sizeof(struct hclgevf_desc); + + ring->desc = kzalloc(size, GFP_KERNEL); + if (!ring->desc) + return -ENOMEM; + + ring->desc_dma_addr = dma_map_single(cmq_ring_to_dev(ring), ring->desc, +size, DMA_BIDIRECTIONAL); + + if (dma_mapping_error(cmq_ring_to_dev(ring), ring->desc_dma_addr)) { + ring->desc_dma_addr = 0; + kfree(ring->desc); + ring->desc = NULL; + return -ENOMEM; + } + + return 0; +} + +static void hclgevf_free_cmd_desc(struct hclgevf_cmq_ring *ring) +{ + dma_unmap_single(cmq_ring_to_dev(ring), ring->desc_dma_addr, +ring->desc_num * sizeof(ring->desc[0]), +hclgevf_ring_to_dma_dir(ring)); + + ring->desc_dma_addr = 0; + kfree(ring->desc); + ring->desc = NULL; +} + +static int hclgevf_init_cmd_queue(struct hclgevf_dev *hdev, + struct hclgevf_cmq_ring *ring) +{ + struct hclgevf_hw *hw = >hw; + int ring_type = ring->flag; + u32 reg_val; + int ret; + + ring->desc_num = HCLGEVF_NIC_CMQ_DESC_NUM; + spin_lock_init(>lock); +
Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?
On Fri, 2017-12-08 at 12:26 -0800, Matt Turner wrote: > > Thanks for the quick reply! > > I tried the patch on top of master, but unfortunately the corruption > still occurs. You might try replacing in sbdma_add_rcvbuffer() sb_new = netdev_alloc_skb(dev, size); by sb_new = alloc_skb(size, GFP_ATOMIC); Maybe the device does not like having a frame spanning 2 pages.
Re: [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4
On Fri, Dec 08, 2017 at 11:04:13AM +0100, Florian Westphal wrote: > Pablo Neira Ayusowrote: > > This patch adds the IPv4 flow table type, that implements the datapath > > flow table to forward IPv4 traffic. Rationale is: > > > > 1) Look up for the packet in the flow table, from the ingress hook. > > 2) If there's a hit, decrement ttl and pass it on to the neighbour layer > >for transmission. > > 3) If there's a miss, packet is passed up to the classic forwarding > >path. > > Is there a plan to also handle zone IDs in future? Zone ID is meaningful to whoever applies the policy: in this offload approach this patchset implements, the policy resides in the kernel. > I't going to be messy for sure since we'd need to tell HW how to do > the zone mapping. Perhaps only support a builtin list, e.g. > vlan id == zone...? I've been considering a more simple solution, ie. add the input ifindex device in the flowtable hash lookup, as part of the flow tuple. All examples I've been observing for zones are basically mapping network interfaces to zones. > Don't yet see how it could be done in a generic way as the mappings can > be arbitrarily complex. > > Right now afaics one could install one flow table per zone and map > this in nft, but then we still miss the part that tells the hardware > how the zone identifier was derived. > > > +static bool ip_has_options(unsigned int thoff) > > +{ > > + return thoff > sizeof(struct iphdr); > > I'd use > thoff != sizeof(...) > > to catch case where ihl is < struct iphdr. ok. > > +nf_flow_offload_hook(void *priv, struct sk_buff *skb, > > +const struct nf_hook_state *state) > > +{ > > + struct flow_offload_tuple_rhash *tuplehash; > > + struct nf_flowtable *flow_table = priv; > > + struct flow_offload_tuple tuple = {}; > > + union nf_inet_addr nexthop; > > + struct flow_offload *flow; > > + struct net_device *outdev; > > + struct iphdr *iph; > > + > > + if (nf_flow_tuple_ip(skb, ) < 0) > > + return NF_ACCEPT; > > + > > + tuplehash = flow_offload_lookup(flow_table, ); > > + if (tuplehash == NULL) > > + return NF_ACCEPT; > > + > > + outdev = dev_get_by_index_rcu(_net, tuplehash->tuple.oifidx); > > state->net ? Yes, netns support is in my TODO list.
Re: BUG: KASAN: use-after-free in tcf_block_put_ext+0x5cf/0x5e0
On Thu, Dec 7, 2017 at 8:59 PM, Jakub Kicinskiwrote: > On Thu, 7 Dec 2017 20:51:27 -0800, Jakub Kicinski wrote: >> Running the netdevsim test after a week and a bit of not trying it: >> >> $ make -C tools/testing/selftests/bpf/ CLANG=clang LLC=llc >> # ./tools/testing/selftests/bpf/test_offload.py > > Ah, I didn't clarify - this is net-next at 66c5c5b56682 ("Merge branch > 'smc-fixes-next'") plus some BPF-related commits. Should be fixed by: commit df45bf84e4f5a48f23d4b1a07d21d566e8b587b2 Author: Jiri Pirko Date: Fri Dec 8 19:27:27 2017 +0100 net: sched: fix use-after-free in tcf_block_put_ext
Re: [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit
On Fri, Dec 08, 2017 at 07:47:02AM +0100, Florian Westphal wrote: > Pablo Neira Ayusowrote: > > diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h > > b/include/uapi/linux/netfilter/nf_conntrack_common.h > > index dc947e59d03a..6b463b88182d 100644 > > --- a/include/uapi/linux/netfilter/nf_conntrack_common.h > > +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h > > @@ -100,6 +100,10 @@ enum ip_conntrack_status { > > IPS_HELPER_BIT = 13, > > IPS_HELPER = (1 << IPS_HELPER_BIT), > > > > + /* Conntrack has been offloaded to flow table. */ > > + IPS_OFFLOAD_BIT = 14, > > + IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT), > > + > > /* Be careful here, modifying these bits can make things messy, > > * so don't let users modify them directly. > > */ > > I think this new bit has to be added to the UNCHANGEABLE mask below. Right. > > diff --git a/net/netfilter/nf_conntrack_core.c > > b/net/netfilter/nf_conntrack_core.c > > index 01130392b7c0..02e195accd47 100644 > > --- a/net/netfilter/nf_conntrack_core.c > > +++ b/net/netfilter/nf_conntrack_core.c > > @@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net, > > hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) { > > tmp = nf_ct_tuplehash_to_ctrack(h); > > > > + if (test_bit(IPS_OFFLOAD_BIT, >status)) > > + continue; > > + > > nit: I would move this below the ASSURED bit check, AFAIU most > (all?) offloaded conntracks are not in ASSURED state since they never > see two-way communication but in case we've mixed flows or no offloading > in place then the ASSURED check takes care of skipping earlydrop > already. Offload happens once we enter established state (or later if your rule postpone it to a later stage), that is before we observe the full 3-way handshake in tcp, hence the assured bit. So I think we still need this here. > > +/* Set an arbitrary timeout large enough not to ever expire, this save > > + * us a check for the IPS_OFFLOAD_BIT from the packet path via > > + * nf_ct_is_expired(). > > + */ > > +static void nf_ct_offload_timeout(struct nf_conn *ct) > > +{ > > + ct->timeout = nfct_time_stamp + DAY; > > +} > > Not sure if its worth adding a test to avoid unconditional write, > e.g. something like > > > +static void nf_ct_offload_timeout(struct nf_conn *ct) > > +{ > if (nf_ct_expires(ct) < DAY/2)) > > + ct->timeout = nfct_time_stamp + DAY; > > but perhaps not worth it, gc_worker is infrequent. That's fine, I'll do this.
Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected
On 12/08/2017 at 09:11 PM Andreas Hartmann wrote: > On 12/08/2017 at 05:04 PM Willem de Bruijn wrote: >> On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecekwrote: >>> On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote: On 12/08/2017 at 09:47 AM Michal Kubecek wrote: > On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote: >> >> All my VMs are using virtio_net. BTW: I couldn't see the problems >> (sometimes, the VM couldn't be stopped at all) if all my VMs are using >> e1000 as interface instead. >> >> This finding now matches pretty much the responsible UDP-package which >> caused the stall. I already mentioned it here [2]. >> >> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13] >> Remove UDP Fragmentation Offload support" [3] >> >> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4] >> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5] >> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6] >> >> and applied it to Linux 4.14.4. It compiled fine and is running fine. >> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs >> are gone, too. >> >> Obviously, there is something broken with the new UDP handling. Could >> you please analyze this problem? I could test some more patches ... . > > Any chance your VMs were live migrated from pre-4.14 host kernel? No - the VMs are not live migrated. They are always running on the same host - either with kernel < 4.14 or with kernel 4.14.x. >>> >>> This is disturbing... unless I'm mistaken, it shouldn't be possible to >>> have UFO enabled on a virtio device in a VM booted on a host with 4.14 >>> kernel. >> >> Indeed. When working on that revert patch I verified that UFO in >> the guest virtio_net was off before the revert patch, on after. >> >> Qemu should check host support with tap_probe_has_ufo >> before advertising support to the guest. Indeed, this is exactly >> what broke live migration in virtio_net_load_device at >> >> if (qemu_get_byte(f) && !peer_has_ufo(n)) { >> error_report("virtio-net: saved image requires TUN_F_UFO support"); >> return -1; >> } >> >> Which follows >> >>peer_has_ufo >> qemu_has_ufo >>tap_has_ufo >> s->has_ufo >> >> where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init. >> >> Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this >> codepath does not seem to have changed between then and 2.10.1. >> >> I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the >> fix-up wasn't too hard. Compiled and booted, but untested otherwise. At >> >> https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo > > I'm just running it at the moment. I didn't face any network hang until > now - although the critical UDP packages have been gone through. > Therefore: looks nice. Well, the patch does not fix hanging VMs, which have been shutdown and can't be killed any more. Because of the stack trace [] vhost_net_ubuf_put_and_wait+0x35/0x60 [vhost_net] [] vhost_net_ioctl+0x304/0x870 [vhost_net] [] do_vfs_ioctl+0x8f/0x5c0 [] SyS_ioctl+0x74/0x80 [] do_syscall_64+0x5b/0x100 [] entry_SYSCALL64_slow_path+0x25/0x25 [] 0x I was hoping, that the problems could be related - but that seems not to be true. Does anybody have any idea what happened here and how to analyze / fix it? Thanks, Andreas
Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
On Fri, Dec 08, 2017 at 11:08:56AM -0500, Neil Horman wrote: > On Fri, Dec 08, 2017 at 04:04:58PM +, David Laight wrote: > > From: Marcelo Ricardo Leitner > > > Sent: 08 December 2017 16:00 > > ... > > > > Is it worth replacing the si struct with an index/enum value, and > > > > indexing an > > > > array of method pointer structs? That would save you at least one > > > > dereference. > > > > > > Hmmm, maybe, yes. It would be like > > > sctp_stream_interleave[asoc->stream.si].make_datafrag(...) > > > > If you only expect 2 choices then an if () is likely > > to produce better code that the above. > > > > The actual implementation can be hidden inside a #define > > or static inline function. > > > Thats the real question though, will we expect more than two interleaving > strategies? Currently its a boolean operation so the answer seems like yes, > but > is there a possiblity of a biased interleaving, or other worthwhile algorithm? For the chunk format, I don't think so. It would require another RFC update. For other possibilities on having a 3rd choice in there, I also don't think so. Stream scheduling is handled apart from it and rx buffer stuff is being covered by it now, don't see how we could have a 3rd option without another chunk format. But that said, I don't think the macro or even inline wrappers are as clear as the struct with function pointers here and in some cases it even won't avoid the second deref. I wouldn't like to compromise code readability and OO because of 1 fetch, specially considering that this will be barely noticeable. Neil's idea on using the array of structs and indexing on it is nice. Saves a deref and keeps the abstractions without inserting too much noise on it. Plus, it also allows replacing the struct pointer in sctp_stream with a bit. If then placed before the union in there, we can save up to the whole pointer. Looks like a good compromise to me. Marcelo
Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?
On Fri, Dec 8, 2017 at 5:52 AM, Eric Dumazetwrote: > On Fri, 2017-12-08 at 05:42 -0800, Eric Dumazet wrote: >> On Thu, Dec 7, 2017 at 11:54 PM, Matt Turner >> wrote: >> > On Thu, Dec 7, 2017 at 11:00 PM, Matt Turner >> > wrote: >> > > On Sun, Mar 12, 2017 at 6:43 PM, Matt Turner >> > > wrote: >> > > > On a Broadcom BCM91250a MIPS system I can reliably trigger NFS >> > > > corruption on the first file read. >> > > > >> > > > To demonstrate, I downloaded five identical copies of the gcc- >> > > > 5.4.0 >> > > > source tarball. On the NFS server, they hash to the same value: >> > > > >> > > > server distfiles # md5sum gcc-5.4.0.tar.bz2* >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.1 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.3 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.4 >> > > > >> > > > On the MIPS system (the NFS client): >> > > > >> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2.2 >> > > > 35346975989954df8a8db2b034da610d gcc-5.4.0.tar.bz2.2 >> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2* >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.1 >> > > > 35346975989954df8a8db2b034da610d gcc-5.4.0.tar.bz2.2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.3 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.4 >> > > > >> > > > The first file read will contain some corruption, and it is >> > > > persistent until... >> > > > >> > > > bcm91250a-le distfiles # echo 1 > /proc/sys/vm/drop_caches >> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2* >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.1 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.2 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.3 >> > > > 4c626ac2a83ef30dfb9260e6f59c2b30 gcc-5.4.0.tar.bz2.4 >> > > > >> > > > the caches are dropped, at which point it reads back properly. >> > > > >> > > > Note that the corruption is different across reboots, both in >> > > > the size >> > > > of the corruption and the location. I saw 1900~ and 1400~ byte >> > > > sequences corrupted on separate occasions, which don't >> > > > correspond to >> > > > the system's 16kB page size. >> > > > >> > > > I've tested kernels from v3.19 to 4.11-rc1+ (master branch from >> > > > today). All exhibit this behavior with differing frequencies. >> > > > Earlier >> > > > kernels seem to reproduce the issue less often, while more >> > > > recent >> > > > kernels reliably exhibit the problem every boot. >> > > > >> > > > How can I further debug this? >> > > >> > > I think I was wrong about the statement about kernels v3.19 to >> > > 4.11-rc1+. I found out I couldn't reproduce with 4.7-rc1 and then >> > > bisected to 4cd13c21b207e80ddb1144c576500098f2d5f882 ("softirq: >> > > Let >> > > ksoftirqd do its job"). Still reproduces with current tip of >> > > Linus' >> > > tree. >> > > >> > > Any ideas? The board's ethernet is an uncommon device supported >> > > by >> > > CONFIG_SB1250_MAC. Something about the ethernet driver maybe? >> > >> > With the patch reverted on master (reverts cleanly), NFS corruption >> > no >> > longer happens. >> >> Hi Matt. >> >> Thanks for bisecting. >> >> Patch simply exposes an existing bug more often by changing the way >> driver functions are scheduled. >> >> Which is probably a good thing. >> >> sbmac_intr() looks extremely suspicious to me. >> >> A NAPI driver hard interrupt should simply schedule NAPI. >> >> Apparently, if sbmac_intr() can not grab NAPIF_STATE_SCHED bit, it >> directly calls sbdma_rx_process() from >> hard interrupt context. >> >> Insane really. > > Please try this fix (not compiled on my x86 laptop, and I had no coffee > yet, so it might have some trivial errors) Thanks for the quick reply! I tried the patch on top of master, but unfortunately the corruption still occurs.
Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected
On 12/08/2017 at 05:04 PM Willem de Bruijn wrote: > On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecekwrote: >> On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote: >>> On 12/08/2017 at 09:47 AM Michal Kubecek wrote: On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote: > > All my VMs are using virtio_net. BTW: I couldn't see the problems > (sometimes, the VM couldn't be stopped at all) if all my VMs are using > e1000 as interface instead. > > This finding now matches pretty much the responsible UDP-package which > caused the stall. I already mentioned it here [2]. > > To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13] > Remove UDP Fragmentation Offload support" [3] > > 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4] > 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5] > 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6] > > and applied it to Linux 4.14.4. It compiled fine and is running fine. > The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs > are gone, too. > > Obviously, there is something broken with the new UDP handling. Could > you please analyze this problem? I could test some more patches ... . Any chance your VMs were live migrated from pre-4.14 host kernel? >>> >>> No - the VMs are not live migrated. They are always running on the same >>> host - either with kernel < 4.14 or with kernel 4.14.x. >> >> This is disturbing... unless I'm mistaken, it shouldn't be possible to >> have UFO enabled on a virtio device in a VM booted on a host with 4.14 >> kernel. > > Indeed. When working on that revert patch I verified that UFO in > the guest virtio_net was off before the revert patch, on after. > > Qemu should check host support with tap_probe_has_ufo > before advertising support to the guest. Indeed, this is exactly > what broke live migration in virtio_net_load_device at > > if (qemu_get_byte(f) && !peer_has_ufo(n)) { > error_report("virtio-net: saved image requires TUN_F_UFO support"); > return -1; > } > > Which follows > >peer_has_ufo > qemu_has_ufo >tap_has_ufo > s->has_ufo > > where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init. > > Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this > codepath does not seem to have changed between then and 2.10.1. > > I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the > fix-up wasn't too hard. Compiled and booted, but untested otherwise. At > > https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo I'm just running it at the moment. I didn't face any network hang until now - although the critical UDP packages have been gone through. Therefore: looks nice. Thanks, Andreas
[GIT] Networking
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb drivers). 2) Revert returning -EEXIST from __dev_alloc_name() as this propagates to userspace and broke some apps. From Johannes Berg. 3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong Wang. 4) Gianfar MAC can't do EEE so don't advertise it by default, from Claudiu Manoil. 5) Relax strict netlink attribute validation, but emit a warning. From David Ahern. 6) Fix regression in checksum offload of thunderx driver, from Florian Westphal. 7) Fix UAPI bpf issues on s390, from Hendrik Brueckner. 8) New card support in iwlwifi, from Ihab Zhaika. 9) BBR congestion control bug fixes from Neal Cardwell. 10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren. 11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan. 12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni. 13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens. Please pull, thanks a lot! The following changes since commit 2391f0b4808e3d5af348324d69f5f45c56a26836: Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost (2017-12-04 11:32:02 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to fd29117aeb905aaacdf4ff5afbc7787fa50e16e4: Merge tag 'linux-can-fixes-for-4.15-20171208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can (2017-12-08 14:53:54 -0500) Al Viro (3): fix kcm_clone() socketpair(): allocate descriptors first make sock_alloc_file() do sock_release() on failures Andrew Lunn (2): net: dsa: mv88e6xxx: Fix interrupt masking on removal net: dsa: mv88e6xxx: Unregister MDIO bus on error path Andy Shevchenko (1): brcmfmac: Avoid build error with make W=1 Antoine Tenart (1): net: mvpp2: fix the RSS table entry offset Arend Van Spriel (1): brcmfmac: change driver unbind order of the sdio function devices Bert Kenward (1): sfc: pass valid pointers from efx_enqueue_unwind Bjørn Mork (1): usbnet: fix alignment for frames with no ethernet header Branislav Radocaj (1): net: ethernet: arc: fix error handling in emac_rockchip_probe Calvin Owens (1): bnxt_en: Fix sources of spurious netpoll warnings Chris Dion (1): net_sched: use macvlan real dev trans_start in dev_trans_start() Claudiu Manoil (1): gianfar: Disable EEE autoneg by default Cong Wang (1): tipc: fix a null pointer deref on error path Daniel Borkmann (1): Merge branch 'bpf-fix-broken-uapi-for-pt-regs' David Ahern (1): netlink: Relax attr validation for fixed length types David S. Miller (9): Merge branch 'RED-qdisc-fixes' Merge branch 'sh_eth-dma-mapping-fixes' Merge branch 'rmnet-Fix-leaks-in-failure-scenarios' Merge branch 'mv88e6xxx-error-patch-fixes' Merge git://git.kernel.org/.../bpf/bpf Merge branch 'tcp-bbr-sampling-fixes' Merge branch 'tcp-RACK-loss-recovery-bug-fixes' Merge tag 'wireless-drivers-for-davem-2017-12-08' of git://git.kernel.org/.../kvalo/wireless-drivers Merge tag 'linux-can-fixes-for-4.15-20171208' of git://git.kernel.org/.../mkl/linux-can David Spinadel (1): iwlwifi: mvm: enable RX offloading with TKIP and WEP Emmanuel Grumbach (3): iwlwifi: mvm: don't use transmit queue hang detection when it is not possible iwlwifi: mvm: fix the TX queue hang timeout for MONITOR vif type iwlwifi: mvm: fix packet injection Eric Dumazet (3): Revert "tcp: must block bh in __inet_twsk_hashdance()" net: remove hlist_nulls_add_tail_rcu() tcp: use current time in tcp_rcv_space_adjust() Florian Westphal (1): net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts Hendrik Brueckner (6): bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type s390/bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type arm64/bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type s390/uapi: correct whitespace & coding style in asm/ptrace.h selftests/bpf: sync kernel headers and introduce arch support in Makefile perf s390: add regs_query_register_offset() Håkon Bugge (1): rds: Fix NULL pointer dereference in __rds_rdma_map Ihab Zhaika (1): iwlwifi: add new cards for 9260 and 22000 series Joe Perches (1): xen-netback: Fix logging message with spurious period after newline Johannes Berg (2): iwlwifi: mvm: flush queue before deleting ROC Revert "net: core: maybe return -EEXIST in __dev_alloc_name" Jon Maloy (1): tipc: fix memory leak in tipc_accept_from_sock() Kalle Valo (2): Merge tag 'iwlwifi-for-kalle-2017-11-28' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes Merge tag 'iwlw
Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open
On 12/06/2017 11:45 PM, Song Liu wrote: > Changes PATCH v4 to PATCH v5: > Remove PERF_PROBE_CONFIG_IS_RETPROBE from uapi, use PMU_FORMAT_ATTR > instead. > > Changes PATCH v3 to PATCH v4: > Remove uapi define MAX_PROBE_FUNC_NAME_LEN, use KSYM_NAME_LEN instead. > Add flag PERF_PROBE_CONFIG_IS_RETPROBE for config field of [k,u]probe. > Optimize ifdef's of CONFIG_KPROBE_EVENTS and CONFIG_UPROBE_EVENTS. > Optimize checks in perf_event_is_tracing(). > Optimize perf_tp_register(). > > Changes PATCH v2 to PATCH v3: > Remove fixed type PERF_TYPE_KPROBE and PERF_TYPE_UPROBE, use dynamic > type instead. > Update userspace (samples/bpf, bcc) to look up type from sysfs. > Change License info in test_many_kprobe_user.c as Philippe Ombredanne > suggested. > > Changes PATCH v1 to PATCH v2: > Split PERF_TYPE_PROBE into PERF_TYPE_KPROBE and PERF_TYPE_UPROBE. > Split perf_probe into perf_kprobe and perf_uprobe. > Remove struct probe_desc, use config1 and config2 instead. > > Changes RFC v2 to PATCH v1: > Check type PERF_TYPE_PROBE in perf_event_set_filter(). > Rebase on to tip perf/core. > > Changes RFC v1 to RFC v2: > Fix build issue reported by kbuild test bot by adding ifdef of > CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS. > > RFC v1 cover letter: > > This is to follow up the discussion over "new kprobe api" at Linux > Plumbers 2017: > > https://www.linuxplumbersconf.org/2017/ocw/proposals/4808 > > With current kernel, user space tools can only create/destroy [k,u]probes > with a text-based API (kprobe_events and uprobe_events in tracefs). This > approach relies on user space to clean up the [k,u]probe after using them. > However, this is not easy for user space to clean up properly. > > To solve this problem, we introduce a file descriptor based API. > Specifically, we extended perf_event_open to create [k,u]probe, and attach > this [k,u]probe to the file descriptor created by perf_event_open. These > [k,u]probe are associated with this file descriptor, so they are not > available in tracefs. > > We reuse large portion of existing trace_kprobe and trace_uprobe code. > Currently, the file descriptor API does not support arguments as the > text-based API does. This should not be a problem, as user of the file > decriptor based API read data through other methods (bpf, etc.). > > I also include a patch to to bcc, and a patch to man-page perf_even_open. > Please see the list below. A fork of bcc with this patch is also available > on github: > > https://github.com/liu-song-6/bcc/tree/perf_event_open Peter / Stephen, I presume this will be routed through one of you, if not please yell. Thanks, Daniel > man-pages patch: > perf_event_open.2: add type kprobe and uprobe > > bcc patch: > bcc: Try use new API to create [k,u]probe with perf_event_open > > kernel patches: > > Song Liu (6): > perf: prepare perf_event.h for new types perf_kprobe and perf_uprobe > perf: copy new perf_event.h to tools/include/uapi > perf: implement pmu perf_kprobe > perf: implement pmu perf_uprobe > bpf: add option for bpf_load.c to use perf_kprobe > bpf: add new test test_many_kprobe > > include/linux/trace_events.h | 8 ++ > include/uapi/linux/perf_event.h | 4 + > kernel/events/core.c | 131 +++- > kernel/trace/trace_event_perf.c | 102 +++ > kernel/trace/trace_kprobe.c | 91 +++-- > kernel/trace/trace_probe.h| 11 ++ > kernel/trace/trace_uprobe.c | 86 ++-- > samples/bpf/Makefile | 3 + > samples/bpf/bpf_load.c| 66 ++-- > samples/bpf/bpf_load.h| 14 +++ > samples/bpf/test_many_kprobe_user.c | 186 > ++ > tools/include/uapi/linux/perf_event.h | 4 + > 12 files changed, 677 insertions(+), 29 deletions(-) > create mode 100644 samples/bpf/test_many_kprobe_user.c > > -- > 2.9.5 >
Re: pull-request: can 2017-12-08
From: Marc Kleine-BuddeDate: Fri, 8 Dec 2017 16:58:11 +0100 > this is a pull request of 6 patches for net/master. > > Martin Kelly provides 5 patches for various USB based CAN drivers, that > properly cancel the URBs on adapter unplug, so that the driver doesn't > end up in an endless loop. Stephane Grosjean provides a patch to restart > the tx queue if zero length packages are transmitted. Pulled, thanks Marc.
Re: [PATCH net-next] macvlan: fix memory hole in macvlan_dev
From: Girish MoodalbailDate: Fri, 8 Dec 2017 06:03:26 -0800 > Move 'macaddr_count' from after 'netpoll' to after 'nest_level' to pack > and reduce a memory hole. > > Fixes: 88ca59d1aaf28c25 (macvlan: remove unused fields in struct macvlan_dev) > Signed-off-by: Girish Moodalbail Applied, thanks.
Re: [PATCH net-next 0/2] tools: bpftool: clean up and extend Makefiles
On 12/08/2017 12:00 AM, Jakub Kicinski wrote: > Hi! > > This is a follow up to a series of Makefile fixes for bpftool from > two weeks ago. I think there will have to be a merge back of net-next > into bpf-next (or rebase), AFAICT the previous series arrived in > net-next already, but not in bpf-next. I hope that makes sense. > FWIW this should not conflict with Roman's cgroup work. > > Quentin says: > > First patch of this series cleans up the two Makefiles (Makefile and > Documentation/Makefile) and make their contents more consistent. > > The second one add "uninstall" and "doc-uninstall" targets, to remove files > previously installed on the system with "install" and "doc-install" > targets. Series applied to bpf-next, thanks guys!
Re: [PATCH] slip: sl_alloc(): remove unused parameter "dev_t line"
From: Marc Kleine-BuddeDate: Fri, 8 Dec 2017 12:18:59 +0100 > The first and only parameter of sl_alloc() is unused, so remove it. > > Fixes: 5342b77c4123 slip: ("Clean up create and destroy") > Signed-off-by: Marc Kleine-Budde Applied. While reviewing this I noticed that slip_devs[], along with the arbitrary slip_maxdev limit, can be removed and replaced with a simply linked list. All the array is used for it proper teardown during module unload. This is made even more clear by the fact that the index stored in dev->base_addr is never ever used.
Re: [PATCH net] net: mvpp2: fix the RSS table entry offset
From: Antoine TenartDate: Fri, 8 Dec 2017 10:24:20 +0100 > The macro used to access or set an RSS table entry was using an offset > of 8, while it should use an offset of 0. This lead to wrongly configure > the RSS table, not accessing the right entries. > > Fixes: 1d7d15d79fb4 ("net: mvpp2: initialize the RSS tables") > Signed-off-by: Antoine Tenart Applied.
Re: pull-request: bpf-next 2017-12-07
From: Alexei StarovoitovDate: Thu, 7 Dec 2017 22:22:11 -0800 > The following pull-request contains BPF updates for your net-next tree. > > The main changes are: > > 1) Detailed documentation of BPF development process from Daniel. > > 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops >from Lawrence. > > 3) Minor follow up for bpf_skb_set_tunnel_key() from William. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Pulled, thanks Alexei.
Re: [PATCH net-next v2 0/6] cxgb4: collect hardware logs via ethtool
From: Rahul LakkireddyDate: Fri, 8 Dec 2017 09:48:35 +0530 > Collect more hardware logs via ethtool --get-dump facility. > > Patch 1 collects on-chip memory layout information. > > Patch 2 collects on-chip MC memory dumps. > > Patch 3 collects HMA memory dump. > > Patch 4 evaluates and skips TX and RX payload regions in memory dumps. > > Patch 5 collects egress and ingress SGE queue contexts. > > Patch 6 collects PCIe configuration logs ... > --- > v2: > - Fix uninitialized variable "size" build warning in Patch 1. That looks better, series applied, thanks.
Re: [PATCH v2 1/3] Bluetooth: hci_ll: add support for setting public address
Hi David, >>> This adds support for setting the public address on Texas Instruments >>> Bluetooth chips using a vendor-specific command. >>> >>> This has been tested on a CC2560A. The TI wiki also indicates that this >>> command should work on TI WL17xx/WL18xx Bluetooth chips. >>> >>> Signed-off-by: David Lechner>>> --- >>> >>> v2 changes: >>> * This is a new patch in v2 >>> >>> drivers/bluetooth/hci_ll.c | 17 + >>> 1 file changed, 17 insertions(+) >>> >>> diff --git a/drivers/bluetooth/hci_ll.c b/drivers/bluetooth/hci_ll.c >>> index 974a788..b732004 100644 >>> --- a/drivers/bluetooth/hci_ll.c >>> +++ b/drivers/bluetooth/hci_ll.c >>> @@ -57,6 +57,7 @@ >>> #include "hci_uart.h" >>> >>> /* Vendor-specific HCI commands */ >>> +#define HCI_VS_WRITE_BD_ADDR 0xfc06 >>> #define HCI_VS_UPDATE_UART_HCI_BAUDRATE 0xff36 >>> >>> /* HCILL commands */ >>> @@ -662,6 +663,20 @@ static int download_firmware(struct ll_device *lldev) >>> return err; >>> } >>> >>> +static int ll_set_bdaddr(struct hci_dev *hdev, const bdaddr_t *bdaddr) >>> +{ >>> + bdaddr_t bdaddr_swapped; >>> + struct sk_buff *skb; >>> + >>> + baswap(_swapped, bdaddr); >>> + skb = __hci_cmd_sync(hdev, HCI_VS_WRITE_BD_ADDR, sizeof(bdaddr_t), >>> +_swapped, HCI_INIT_TIMEOUT); >>> + if (!IS_ERR(skb)) >>> + kfree_skb(skb); >>> + >> You have a trailing whitespace here. >> Does the HCI command really expect the BD_ADDR in the swapped order. The >> caller of hdev->set_bdaddr while provide it in the same order as the HCI >> Read BD Address command and everything in HCI. So it seems odd that you have >> to swap it for the vendor command. >> So have you actually tested this with btmgmt public-add and >> checked that the address comes out correctly. I think ll_set_bdaddr should >> function correctly for the mgmt interface. And if needed any other caller >> outside of mgmt has to do the swapping. > > I did test using btmgmt public-address 00:11:22:33:44:55, which is how I > found out that the order needed to be swapped. Like you, I was surprised. I > couldn't find any documentation from TI saying what the order is supposed to > be, so I can only assume that because this works, it is indeed correct as-is. then please add a comment for that and I would appreciate to have the parts from btmon showing the public-addr command and the following HCI Read BD Address command as part of the commit message. Just for being able to dig this out at some later point if needed. Regards Marcel
Re: [PATCH net-next v3 0/2] veth and GSO maximums
From: Stephen HemmingerDate: Thu, 7 Dec 2017 15:40:18 -0800 > This is the more general way to solving the issue of GSO limits > not being set correctly for containers on Azure. If a GSO packet > is sent to host that exceeds the limit (reported by NDIS), then > the host is forced to do segmentation in software which has noticeable > performance impact. > > The core rtnetlink infrastructure already has the messages and > infrastructure to allow changing gso limits. With an updated iproute2 > the following already works: > # ip li set dev dummy0 gso_max_size 3 > > These patches are about making it easier with veth. Ok, this is definitely a step forward. Series applied, thanks Stephen.
Re: [PATCH v2 1/3] Bluetooth: hci_ll: add support for setting public address
On 12/08/2017 02:07 AM, Marcel Holtmann wrote: Hi David, This adds support for setting the public address on Texas Instruments Bluetooth chips using a vendor-specific command. This has been tested on a CC2560A. The TI wiki also indicates that this command should work on TI WL17xx/WL18xx Bluetooth chips. Signed-off-by: David Lechner--- v2 changes: * This is a new patch in v2 drivers/bluetooth/hci_ll.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/bluetooth/hci_ll.c b/drivers/bluetooth/hci_ll.c index 974a788..b732004 100644 --- a/drivers/bluetooth/hci_ll.c +++ b/drivers/bluetooth/hci_ll.c @@ -57,6 +57,7 @@ #include "hci_uart.h" /* Vendor-specific HCI commands */ +#define HCI_VS_WRITE_BD_ADDR 0xfc06 #define HCI_VS_UPDATE_UART_HCI_BAUDRATE 0xff36 /* HCILL commands */ @@ -662,6 +663,20 @@ static int download_firmware(struct ll_device *lldev) return err; } +static int ll_set_bdaddr(struct hci_dev *hdev, const bdaddr_t *bdaddr) +{ + bdaddr_t bdaddr_swapped; + struct sk_buff *skb; + + baswap(_swapped, bdaddr); + skb = __hci_cmd_sync(hdev, HCI_VS_WRITE_BD_ADDR, sizeof(bdaddr_t), +_swapped, HCI_INIT_TIMEOUT); + if (!IS_ERR(skb)) + kfree_skb(skb); + You have a trailing whitespace here. Does the HCI command really expect the BD_ADDR in the swapped order. The caller of hdev->set_bdaddr while provide it in the same order as the HCI Read BD Address command and everything in HCI. So it seems odd that you have to swap it for the vendor command. So have you actually tested this with btmgmt public-add and checked that the address comes out correctly. I think ll_set_bdaddr should function correctly for the mgmt interface. And if needed any other caller outside of mgmt has to do the swapping. I did test using btmgmt public-address 00:11:22:33:44:55, which is how I found out that the order needed to be swapped. Like you, I was surprised. I couldn't find any documentation from TI saying what the order is supposed to be, so I can only assume that because this works, it is indeed correct as-is.
Re: [PATH net 0/4] tcp: RACK loss recovery bug fixes
From: Yuchung ChengDate: Thu, 7 Dec 2017 11:33:29 -0800 > This patch set has four minor bug fixes in TCP RACK loss recovery. Series applied, thank you.
Re: [PATCH net-next] net: dsa: lan9303: Protect ALR operations with mutex
From: Egil HjelmelandDate: Thu, 7 Dec 2017 19:56:04 +0100 > ALR table operations are a sequence of related register operations which > should be protected from concurrent access. The alr_cache should also be > protected. Add alr_mutex doing that. > > Signed-off-by: Egil Hjelmeland Applied.
Re: [patch net-next] net: sched: fix use-after-free in tcf_block_put_ext
From: Jiri PirkoDate: Fri, 8 Dec 2017 19:27:27 +0100 > From: Jiri Pirko > > Since the block is freed with last chain being put, once we reach the > end of iteration of list_for_each_entry_safe, the block may be > already freed. I'm hitting this only by creating and deleting clsact: ... > Fix this by holding the block also by chain 0 and put chain 0 > explicitly, out of the list_for_each_entry_safe loop at the very > end of tcf_block_put_ext. > > Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in > tcf_block_put_ext()") > Signed-off-by: Jiri Pirko Applied, thanks Jiri.
Re: [PATCH] bnxt_en: Fix sources of spurious netpoll warnings
From: Michael ChanDate: Fri, 8 Dec 2017 09:35:42 -0800 > On Fri, Dec 8, 2017 at 9:05 AM, Calvin Owens wrote: >> After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and >> 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), >> we still see the following WARN fire: ... >> This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting >> in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually >> given a non-zero budget. >> >> Signed-off-by: Calvin Owens > > Thanks. > Acked-by: Michael Chan Applied and queued up for -stable.
Re: [PATCH net-next 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices
On 2017-12-08 09:26, Dan Williams wrote: On Sun, 2017-12-03 at 23:37 -0700, Subash Abhinov Kasiviswanathan wrote: Add an option to configure the rmnet aggregation and command features on device creation. This is achieved by using the vlan flags option. Does this overload the VLAN flags item with different meanings than VLAN_FLAG_* that are specific to rmnet? Dan Hi Dan Yes, that's right. I am using it to configure rmnet specific features. -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [net-next PATCH 00/14] lockless qdisc series
From: John FastabendDate: Thu, 07 Dec 2017 09:53:46 -0800 > This series adds support for building lockless qdiscs. Series applied, nice work. I particularly like how you solved the GSO and bad_tx requeueing. Thanks!
Re: [iproute2] ss: print tcpi_rcv_ssthresh
On Thu, 7 Dec 2017 16:12:00 -0800 Wei Wangwrote: > From: Wei Wang > > tcpi_rcv_ssthresh is an important stats when debugging receive side > behavior. > Add it to the ss output. > > Signed-off-by: Wei Wang > Signed-off-by: Eric Dumazet Applied
Re: [PATCH net 0/3] TCP BBR sampling fixes for loss recovery undo
From: Neal CardwellDate: Thu, 7 Dec 2017 12:43:29 -0500 > This patch series has a few minor bug fixes for cases where spurious > loss recoveries can trick BBR estimators into estimating that the > available bandwidth is much lower than the true available bandwidth. > In both cases the fix here is to just reset the estimator upon loss > recovery undo. Series applied and queued up for -stable, thanks Neal.
[patch net-next] net: sched: fix use-after-free in tcf_block_put_ext
From: Jiri PirkoSince the block is freed with last chain being put, once we reach the end of iteration of list_for_each_entry_safe, the block may be already freed. I'm hitting this only by creating and deleting clsact: [ 202.171952] == [ 202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390 [ 202.187590] Read of size 8 at addr 880225539a80 by task tc/796 [ 202.194508] [ 202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5 [ 202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 202.213613] Call Trace: [ 202.216369] dump_stack+0xda/0x169 [ 202.220192] ? dma_virt_map_sg+0x147/0x147 [ 202.224790] ? show_regs_print_info+0x54/0x54 [ 202.229691] ? tcf_chain_destroy+0x1dc/0x250 [ 202.234494] print_address_description+0x83/0x3d0 [ 202.239781] ? tcf_block_put_ext+0x240/0x390 [ 202.244575] kasan_report+0x1ba/0x460 [ 202.248707] ? tcf_block_put_ext+0x240/0x390 [ 202.253518] tcf_block_put_ext+0x240/0x390 [ 202.258117] ? tcf_chain_flush+0x290/0x290 [ 202.262708] ? qdisc_hash_del+0x82/0x1a0 [ 202.267111] ? qdisc_hash_add+0x50/0x50 [ 202.271411] ? __lock_is_held+0x5f/0x1a0 [ 202.275843] clsact_destroy+0x3d/0x80 [sch_ingress] [ 202.281323] qdisc_destroy+0xcb/0x240 [ 202.285445] qdisc_graft+0x216/0x7b0 [ 202.289497] tc_get_qdisc+0x260/0x560 Fix this by holding the block also by chain 0 and put chain 0 explicitly, out of the list_for_each_entry_safe loop at the very end of tcf_block_put_ext. Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()") Signed-off-by: Jiri Pirko --- net/sched/cls_api.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index d51051d..5b9b8a6 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -342,23 +342,24 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q, { struct tcf_chain *chain, *tmp; - /* Hold a refcnt for all chains, except 0, so that they don't disappear + /* Hold a refcnt for all chains, so that they don't disappear * while we are iterating. */ list_for_each_entry(chain, >chain_list, list) - if (chain->index) - tcf_chain_hold(chain); + tcf_chain_hold(chain); list_for_each_entry(chain, >chain_list, list) tcf_chain_flush(chain); tcf_block_offload_unbind(block, q, ei); - /* At this point, all the chains should have refcnt >= 1. Block will be -* freed after all chains are gone. -*/ + /* At this point, all the chains should have refcnt >= 1. */ list_for_each_entry_safe(chain, tmp, >chain_list, list) tcf_chain_put(chain); + + /* Finally, put chain 0 and allow block to be freed. */ + chain = list_first_entry(>chain_list, struct tcf_chain, list); + tcf_chain_put(chain); } EXPORT_SYMBOL(tcf_block_put_ext); -- 2.9.5
Re: [PATCH net] sfc: pass valid pointers from efx_enqueue_unwind
From: Bert KenwardDate: Thu, 7 Dec 2017 17:18:58 + > The bytes_compl and pkts_compl pointers passed to efx_dequeue_buffers > cannot be NULL. Add a paranoid warning to check this condition and fix > the one case where they were NULL. > > efx_enqueue_unwind() is called very rarely, during error handling. > Without this fix it would fail with a NULL pointer dereference in > efx_dequeue_buffer, with efx_enqueue_skb in the call stack. > > Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2") > Reported-by: Jarod Wilson > Signed-off-by: Bert Kenward Applied and queued up for -stable, thanks.
Re: [PATCH net] gianfar: Disable EEE autoneg by default
From: Claudiu ManoilDate: Thu, 7 Dec 2017 18:44:23 +0200 > This controller does not support EEE, but it may connect to a PHY > which supports EEE and advertises EEE by default, while its link > partner also advertises EEE. If this happens, the PHY enters low > power mode when the traffic rate is low and causes packet loss. > This patch disables EEE advertisement by default for any PHY that > gianfar connects to, to prevent the above unwanted outcome. > > Signed-off-by: Shaohui Xie > Tested-by: Yangbo Lu > Signed-off-by: Claudiu Manoil Applied, thank you.
Re: WireGuard Upstreaming Roadmap (November 2017)
Hi Dave, On Fri, Dec 8, 2017 at 4:38 PM, David Millerwrote: > Sorry, you cannot force the discussion of a feature which will be submitted > upstream to occur on a private mailing list. > > It is _ABSOLUTELY_ appropriate to discss this on netdev since it is the > netdev community which must consider issues like this when looking at > whether to accept WireGuard upstream. > > Jason, this action and response was entirely inappropriate, and please > I'd like you to reply properly to questions about your feature here. Whoops, okay! Very sorry. I'm actually kind of happy to hear that. I had assumed that you'd be annoyed if WireGuard crypto discussion spewed over into netdev adding even more message volume there for something perhaps not directly relevant. But in fact, you're interested and find it important to discuss there. So, good news. And sorry for trying to shew it away before. I'll copy and paste the response I had made on the other list: > This is covered in the paper: > https://www.wireguard.com/papers/wireguard.pdf > > The basic answer is that WireGuard has message type identifiers, and > the handshake also hashes into it an identifier of the primitives > used. If there's ever a problem with those primitives chosen, it will > be possible to introduce new message type identifiers, if that kind of > "support everything even the broken garbage" approach is desired by > misguided people. However, a better approach, of course, is to keep > your secure network separate from your insecure network, and to not > allow insecure nodes on secure segments; when you mix the two, > disaster tends to strike. So, in other words, both approaches to "upgrading" > are possible, in this fantasy wireguardalypse. Take note, though, that > neither one of these approaches (support new and retain old protocol > too for old nodes, or only support new) are "agile" or are anything at > all like the 90s "cipher agility" -- the user still is not permitted > to "choose" ciphers. Regards, Jason
Re: [PATCH net-next] virtio_net: Disable interrupts if napi_complete_done rescheduled napi
From: Jason WangDate: Thu, 7 Dec 2017 15:08:45 +0800 > > > On 2017年12月07日 13:09, Michael S. Tsirkin wrote: >> On Thu, Dec 07, 2017 at 01:15:15PM +0900, Toshiaki Makita wrote: >>> Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been >>> able >>> to be rescheduled within napi_complete_done() even in non-busypoll >>> case, >>> but virtnet_poll() always enabled interrupts before complete, and when >>> napi was rescheduled within napi_complete_done() it did not disable >>> interrupts. >>> This caused more interrupts when event idx is disabled. >>> >>> According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ >>> processing") we cannot place virtqueue_enable_cb_prepare() after >>> NAPI_STATE_SCHED is cleared, so disable interrupts again if >>> napi_complete_done() returned false. >>> >>> Tested with vhost-user of OVS 2.7 on host, which does not have the >>> event >>> idx feature. ... >>> Signed-off-by: Toshiaki Makita >> Acked-by: Michael S. Tsirkin >> >> it might make sense in net and not -next since tx napi regressed >> performance >> in some configs, this might bring it back at least partially. >> Jason - what do you think? > > No sure, the regression I saw was tested with event idx on. And > virtqueue_disable_cb() does almost nothing for event idx (or even a > little bit slower). > > The patch looks good. > > Acked-by: Jason Wang I'm going to put this into net-next for now.
Re: [PATCH] ptr_ring: add barriers
From: "Michael S. Tsirkin"Date: Tue, 5 Dec 2017 21:29:37 +0200 > Users of ptr_ring expect that it's safe to give the > data structure a pointer and have it be available > to consumers, but that actually requires an smb_wmb > or a stronger barrier. > > In absence of such barriers and on architectures that reorder writes, > consumer might read an un=initialized value from an skb pointer stored > in the skb array. This was observed causing crashes. > > To fix, add memory barriers. The barrier we use is a wmb, the > assumption being that producers do not need to read the value so we do > not need to order these reads. > > Reported-by: George Cherian > Suggested-by: Jason Wang > Signed-off-by: Michael S. Tsirkin > --- > > George, could you pls report whether this patch fixes > the issue for you? > > This seems to be needed in stable as well. I really need some testing feedback for this before I apply it and queue it up for -stable. George?
Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
On Sat, Dec 9, 2017 at 1:29 AM, David Laightwrote: > From: Xin Long ... >> Hi, David, Sorry, I'm not sure we're worrying about the cpu cost or >> codes style now ? >> >> For cpu cost, I think 0x848(%r13) operation must be better than the >> generated code of if-else. > > Nope - the call xxx(%ryyy) is likely to be a data cache miss and a complete > cpu pipeline stall. > > The conditional will be a data cache hit and the code (for one branch) > will be prefetched and speculatively executed. > > Some very modern cpu might manage to predict indirect jumps, but for > most it is a full pipeline stall. Thanks for the CPU information. The thing is with if-else can't avoid xxx(%ryyy) in this case, as Marcelo said above. It seems if-else will just be a extra cost compare to this one.
Re: [PATCH] bnxt_en: Fix sources of spurious netpoll warnings
On Fri, Dec 8, 2017 at 9:05 AM, Calvin Owenswrote: > After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and > 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), > we still see the following WARN fire: > > [ cut here ] > WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 > netpoll_poll_dev+0x15a/0x160 > bnxt_poll+0x0/0xd0 exceeded budget in poll > > Call Trace: >[] dump_stack+0x4d/0x70 >[] __warn+0xd3/0xf0 >[] warn_slowpath_fmt+0x4f/0x60 >[] netpoll_poll_dev+0x15a/0x160 >[] netpoll_send_skb_on_dev+0x168/0x250 >[] netpoll_send_udp+0x2dc/0x440 >[] write_ext_msg+0x20e/0x250 >[] call_console_drivers.constprop.23+0xa5/0x110 >[] console_unlock+0x339/0x5b0 >[] vprintk_emit+0x2c8/0x450 >[] vprintk_default+0x1f/0x30 >[] printk+0x48/0x50 >[] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core] >[] edac_mc_handle_error+0x42b/0x6e0 [edac_core] >[] sbridge_mce_output_error+0x410/0x10d0 [sb_edac] >[] sbridge_check_error+0xac/0x130 [sb_edac] >[] edac_mc_workq_function+0x3c/0x90 [edac_core] >[] process_one_work+0x19b/0x480 >[] worker_thread+0x6a/0x520 >[] kthread+0xe4/0x100 >[] ret_from_fork+0x22/0x40 > > This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting > in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually > given a non-zero budget. > > Signed-off-by: Calvin Owens Thanks. Acked-by: Michael Chan
RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
From: Xin Long ... > Hi, David, Sorry, I'm not sure we're worrying about the cpu cost or > codes style now ? > > For cpu cost, I think 0x848(%r13) operation must be better than the > generated code of if-else. Nope - the call xxx(%ryyy) is likely to be a data cache miss and a complete cpu pipeline stall. The conditional will be a data cache hit and the code (for one branch) will be prefetched and speculatively executed. Some very modern cpu might manage to predict indirect jumps, but for most it is a full pipeline stall. David
Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
On Sat, Dec 9, 2017 at 12:22 AM, David Laightwrote: > From: Xin Long >> Sent: 08 December 2017 16:18 >> > ... >> >> Alternatively you could preform the dereference in two steps (i.e. >> >> declare an si >> >> pointer on the stack and set it equal to asoc->stream.si, then deref >> >> si->make_datafrag at call time. That will at least give the compiler an >> >> opportunity to preload the first pointer. > > You want to save the function pointer itself. > > ... >> Another small difference: >> as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop, >> instead of %r13. >> >> So that's what I can see from the related generated code. >> If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think >> asoc->stream.si->make_datafrag() is even better. No ? > > That code must have far too many life local variables. > Otherwise there's be a caller saved register available. > Hi, David, Sorry, I'm not sure we're worrying about the cpu cost or codes style now ? For cpu cost, I think 0x848(%r13) operation must be better than the generated code of if-else. For the codes style, comparing to the if-else, I think this one is more readable. (ignore extendible stuff first, as probably no more new type of data chunk).
Re: [PATCH v4.1] phylib: Add device reset GPIO support
Hello! On 12/08/2017 12:53 PM, Geert Uytterhoeven wrote: On 12/04/2017 03:35 PM, Geert Uytterhoeven wrote: From: Sergei ShtylyovThe PHY devices sometimes do have their reset signal (maybe even power supply?) tied to some GPIO and sometimes it also does happen that a boot loader does not leave it deasserted. So far this issue has been attacked from (as I believe) a wrong angle: by teaching the MAC driver to manipulate the GPIO in question; that solution, when applied to the device trees, led to adding the PHY reset GPIO properties to the MAC device node, with one exception: Cadence MACB driver which could handle the "reset-gpios" prop in a PHY device subnode. I believe that the correct approach is to teach the 'phylib' to get the MDIO device reset GPIO from the device tree node corresponding to this device -- which this patch is doing... Note that I had to modify the AT803x PHY driver as it would stop working otherwise -- it made use of the reset GPIO for its own purposes... Signed-off-by: Sergei Shtylyov Acked-by: Rob Herring [geert: Propagate actual errors from fwnode_get_named_gpiod()] [geert: Avoid destroying initial setup] [geert: Consolidate GPIO descriptor acquiring code] Signed-off-by: Geert Uytterhoeven [...] diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index 2df7b62c1a36811e..8f8b7747c54bc478 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c [...] @@ -48,9 +49,26 @@ int mdiobus_register_device(struct mdio_device *mdiodev) { + struct gpio_desc *gpiod = NULL; + if (mdiodev->bus->mdio_map[mdiodev->addr]) return -EBUSY; + /* Deassert the optional reset signal */ Umm, but why deassert it here for such a short time? That's a consequence of moving it from drivers/of/of_mdio.c to here. Well, you shouldn't do code moves without some thinking. ;-) Not that it was deasserted that much longer in drivers/of/of_mdio.c, though... There it had a reason, here I'm not seeing one. Perhaps using GPIOD_ASIS (or GPIOD_OUT_HIGH) instead of GPIOD_OUT_LOW and dropping mdio_device_reset(mdiodev, 1) afterwards would make more sense here? Gr{oetje,eeting}s, Geert MBR, Sergei
[PATCH iproute2 net-next 4/4] ss: Implement automatic column width calculation
Group fitting fields into lines and space them equally using the remaining screen width for each line. If columns don't fit on one line, break them into the least possible amount of lines and keep them aligned across lines. This is done by: - recording the length of the longest item in each column during formatting and buffering (which was added in the previous patch) - fitting as many fields as possible on each line of output - distributing the remaining padding space equally between the columns Signed-off-by: Stefano BrivioReviewed-by: Sabrina Dubroca --- misc/ss.c | 188 +++--- 1 file changed, 120 insertions(+), 68 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index abc0da7fa8fe..44ad9587b62c 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -128,19 +128,21 @@ struct column { const enum col_align align; const char *header; const char *ldelim; - int width; /* Including delimiter. -1: fit to content, 0: hide */ + int disabled; + int width; /* Calculated, including additional layout spacing */ + int max_len;/* Measured maximum field length in this column */ }; static struct column columns[] = { - { ALIGN_LEFT, "Netid","", 0 }, - { ALIGN_LEFT, "State"," ",0 }, - { ALIGN_LEFT, "Recv-Q", " ",7 }, - { ALIGN_LEFT, "Send-Q", " ",7 }, - { ALIGN_RIGHT, "Local Address:", " ",0 }, - { ALIGN_LEFT, "Port", "", 0 }, - { ALIGN_RIGHT, "Peer Address:"," ",0 }, - { ALIGN_LEFT, "Port", "", 0 }, - { ALIGN_LEFT, "", "", -1 }, + { ALIGN_LEFT, "Netid","", 0, 0, 0 }, + { ALIGN_LEFT, "State"," ",0, 0, 0 }, + { ALIGN_LEFT, "Recv-Q", " ",0, 0, 0 }, + { ALIGN_LEFT, "Send-Q", " ",0, 0, 0 }, + { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_LEFT, "", "", 0, 0, 0 }, }; static struct column *current_field = columns; @@ -959,7 +961,7 @@ static void out(const char *fmt, ...) char *pos; int len; - if (!f->width) + if (f->disabled) return; if (!buffer.head) @@ -982,7 +984,7 @@ static int print_left_spacing(struct column *f, int stored, int printed) { int s; - if (f->width < 0 || f->align == ALIGN_LEFT) + if (!f->width || f->align == ALIGN_LEFT) return 0; s = f->width - stored - printed; @@ -1000,7 +1002,7 @@ static void print_right_spacing(struct column *f, int printed) { int s; - if (f->width < 0 || f->align == ALIGN_RIGHT) + if (!f->width || f->align == ALIGN_RIGHT) return; s = f->width - printed; @@ -1017,9 +1019,12 @@ static void field_flush(struct column *f) struct buf_chunk *chunk = buffer.tail; unsigned int pad = buffer.cur->len % 2; - if (!f->width) + if (f->disabled) return; + if (buffer.cur->len > f->max_len) + f->max_len = buffer.cur->len; + /* We need a new chunk if we can't store the next length descriptor. * Mind the gap between end of previous token and next aligned position * for length descriptor. @@ -1062,7 +1067,7 @@ static void field_set(enum col_id id) static void print_header(void) { while (!field_is_last(current_field)) { - if (current_field->width) + if (!current_field->disabled) out(current_field->header); field_next(); } @@ -1095,16 +1100,106 @@ static void buf_free_all(void) buffer.head = NULL; } +/* Calculate column width from contents length. If columns don't fit on one + * line, break them into the least possible amount of lines and keep them + * aligned across lines. Available screen space is equally spread between fields + * as additional spacing. + */ +static void render_calc_width(int screen_width) +{ + int first, len = 0, linecols = 0; + struct column *c, *eol = columns - 1; + + /* First pass: set width for each column to measured content length */ + for (first = 1, c = columns; c - columns < COL_MAX; c++) { + if (c->disabled) + continue; + + if (!first && c->max_len) + c->width = c->max_len + strlen(c->ldelim); + else + c->width = c->max_len; + + /* But
[PATCH iproute2 net-next 0/4] Abstract columns, properly space and wrap fields
Currently, 'ss' simply subdivides the whole available screen width between available columns, starting from a set of hardcoded amount of spacing and growing column widths. This makes the output unreadable in several cases, as it doesn't take into account the actual content width. Fix this by introducing a simple abstraction for columns, buffering the output, measuring the width of the fields, grouping fields into lines as they fit, equally distributing any remaining whitespace, and finally rendering the result. Some examples are reported below [1]. This implementation doesn't seem to cause any significant performance issues, as reported in 3/4. Patch 1/4 replaces all relevant printf() calls by the out() helper, which simply consists of the usual printf() implementation. Patch 2/4 implements column abstraction, with configurable column width and delimiters, and 3/4 splits buffering and rendering phases, employing a simple buffering mechanism with chunked allocation and introducing a rendering function. Up to this point, the output is still unchanged. Finally, 4/4 introduces field width calculation based on content length measured while buffering, in order to split fields onto multiple lines and equally space them within the single lines. Now that column behaviour is well-defined and more easily configurable, it should be easier to further improve the output by splitting logically separable information (e.g. TCP details) into additional columns. However, this patchset keeps the full "extended" information into a single column, for the moment being. [1] - 80 columns terminal, ss -Z -f netlink * before: Recv-Q Send-Q Local Address:Port Peer Address:Port 0 0rtnl:evolution-calen/2075 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:abrt-applet/32700 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:firefox/21619 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:evolution-calen/32639 * p roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [...] * after: Recv-Q Send-Q Local Address:Port Peer Address:Port 00 rtnl:evolution-calen/2075 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:abrt-applet/32700 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:firefox/21619 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:evolution-calen/32639 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [...] - 80 colums terminal, ss -tunpl * before: Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udpUNCONN 0 0 *:37732 *:* udpUNCONN 0 0 *:5353 *:* udpUNCONN 0 0 192.168.122.1:53*:* udpUNCONN 0 0 *%virbr0:67*:* [...] * after: Netid StateRecv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 00 *:37732*:* udp UNCONN 00 *:5353 *:* udp UNCONN 00 192.168.122.1:53 *:* udp UNCONN 00 *%virbr0:67 *:* [...] - 66 columns terminal, ss -tunpl * before: Netid State Recv-Q Send-Q Local Address:Port P eer Address:Port udpUNCONN 0 0 *:37732 *:* udpUNCONN 0 0 *:5353*:* udpUNCONN 0 0 192.168.122.1:53 *:* udpUNCONN 0 0 *%virbr0:67 *:* [...] * after: Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 *:37732 *:* udp UNCONN 0 0 *:5353 *:* udp UNCONN 0 0 192.168.122.1:53*:* udp UNCONN 0 0 *%virbr0:67*:* [...] Stefano Brivio (4): ss: Replace printf() calls for "main" output by calls to helper ss: Introduce columns lightweight abstraction ss: Buffer raw fields first, then render them as a table ss: Implement automatic column width calculation misc/ss.c | 893 +++--- 1 file changed, 621 insertions(+), 272 deletions(-) -- 2.9.4
[PATCH iproute2 net-next 2/4] ss: Introduce columns lightweight abstraction
Instead of embedding spacing directly while printing contents, logically declare columns and functions to buffer their content, to print left and right spacing around fields, to flush them to screen, and to print headers. This makes it a bit easier to handle layout changes and prepares for full output buffering, needed for optimal spacing in field output layout. Columns are currently set up to retain exactly the same output as before. This needs some slight adjustments of the values previously calculated in main(), as the width value introduced here already includes the width of left delimiters and spacing is not explicitly printed anymore whenever a field is printed. These calculations will go away altogether once automatic width calculation is implemented. We can also remove explicit printing of newlines after the final content for a given line is printed, flushing the last field on a line will cause field_flush() to print newlines where appropriate. No changes in output expected here. Signed-off-by: Stefano BrivioReviewed-by: Sabrina Dubroca --- misc/ss.c | 289 ++ 1 file changed, 198 insertions(+), 91 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 72e117ca9405..9dbcfd514d48 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -103,11 +103,48 @@ int show_header = 1; int follow_events; int sctp_ino; -int netid_width; -int state_width; -int addr_width; -int serv_width; -char *odd_width_pad = ""; +enum col_id { + COL_NETID, + COL_STATE, + COL_RECVQ, + COL_SENDQ, + COL_ADDR, + COL_SERV, + COL_RADDR, + COL_RSERV, + COL_EXT, + COL_MAX +}; + +enum col_align { + ALIGN_LEFT, + ALIGN_CENTER, + ALIGN_RIGHT +}; + +struct column { + const enum col_align align; + const char *header; + const char *ldelim; + int width; /* Including delimiter. -1: fit to content, 0: hide */ + int stored; /* Characters buffered */ + int printed;/* Characters printed so far */ +}; + +static struct column columns[] = { + { ALIGN_LEFT, "Netid","", 0, 0, 0 }, + { ALIGN_LEFT, "State"," ",0, 0, 0 }, + { ALIGN_LEFT, "Recv-Q", " ",7, 0, 0 }, + { ALIGN_LEFT, "Send-Q", " ",7, 0, 0 }, + { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_LEFT, "", "", -1, 0, 0 }, +}; + +static struct column *current_field = columns; +static char field_buf[BUFSIZ]; static const char *TCP_PROTO = "tcp"; static const char *SCTP_PROTO = "sctp"; @@ -825,13 +862,113 @@ static const char *vsock_netid_name(int type) static void out(const char *fmt, ...) { + struct column *f = current_field; va_list args; va_start(args, fmt); - vfprintf(stdout, fmt, args); + f->stored += vsnprintf(field_buf + f->stored, BUFSIZ - f->stored, + fmt, args); va_end(args); } +static int print_left_spacing(struct column *f) +{ + int s; + + if (f->width < 0 || f->align == ALIGN_LEFT) + return 0; + + s = f->width - f->stored - f->printed; + if (f->align == ALIGN_CENTER) + /* If count of total spacing is odd, shift right by one */ + s = (s + 1) / 2; + + if (s > 0) + return printf("%*c", s, ' '); + + return 0; +} + +static void print_right_spacing(struct column *f) +{ + int s; + + if (f->width < 0 || f->align == ALIGN_RIGHT) + return; + + s = f->width - f->printed; + if (f->align == ALIGN_CENTER) + s /= 2; + + if (s > 0) + printf("%*c", s, ' '); +} + +static int field_needs_delimiter(struct column *f) +{ + if (!f->stored) + return 0; + + /* Was another field already printed on this line? */ + for (f--; f >= columns; f--) + if (f->width) + return 1; + + return 0; +} + +/* Flush given field to screen together with delimiter and spacing */ +static void field_flush(struct column *f) +{ + if (!f->width) + return; + + if (field_needs_delimiter(f)) + f->printed = printf("%s", f->ldelim); + + f->printed += print_left_spacing(f); + f->printed += printf("%s", field_buf); + print_right_spacing(f); + + *field_buf = 0; + f->printed = 0; + f->stored = 0; +} + +static int field_is_last(struct column *f) +{ + return f - columns == COL_MAX -
[PATCH iproute2 net-next 3/4] ss: Buffer raw fields first, then render them as a table
This allows us to measure the maximum field length for each column before printing fields and will permit us to apply optimal field spacing and distribution. Structure of the output buffer with chunked allocation is described in comments. Output is still unchanged, original spacing is used. Running over one million sockets with -tul options by simply modifying main() to loop 50,000 times over the *_show() functions, buffering the whole output and rendering it at the end, with 10 UDP sockets, 10 TCP sockets, while throwing output away, doesn't show significant changes in execution time on my laptop with an Intel i7-6600U CPU: - before this patch: $ time ./ss -tul > /dev/null real0m29.899s user0m2.017s sys 0m27.801s - after this patch: $ time ./ss -tul > /dev/null real0m29.827s user0m1.942s sys 0m27.812s Signed-off-by: Stefano BrivioReviewed-by: Sabrina Dubroca --- misc/ss.c | 271 +++--- 1 file changed, 225 insertions(+), 46 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 9dbcfd514d48..abc0da7fa8fe 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -47,6 +47,8 @@ #include #define MAGIC_SEQ 123456 +#define BUF_CHUNK (1024 * 1024) +#define LEN_ALIGN(x) (((x) + 1) & ~1) #define DIAG_REQUEST(_req, _r) \ struct {\ @@ -127,24 +129,45 @@ struct column { const char *header; const char *ldelim; int width; /* Including delimiter. -1: fit to content, 0: hide */ - int stored; /* Characters buffered */ - int printed;/* Characters printed so far */ }; static struct column columns[] = { - { ALIGN_LEFT, "Netid","", 0, 0, 0 }, - { ALIGN_LEFT, "State"," ",0, 0, 0 }, - { ALIGN_LEFT, "Recv-Q", " ",7, 0, 0 }, - { ALIGN_LEFT, "Send-Q", " ",7, 0, 0 }, - { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_LEFT, "", "", -1, 0, 0 }, + { ALIGN_LEFT, "Netid","", 0 }, + { ALIGN_LEFT, "State"," ",0 }, + { ALIGN_LEFT, "Recv-Q", " ",7 }, + { ALIGN_LEFT, "Send-Q", " ",7 }, + { ALIGN_RIGHT, "Local Address:", " ",0 }, + { ALIGN_LEFT, "Port", "", 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0 }, + { ALIGN_LEFT, "Port", "", 0 }, + { ALIGN_LEFT, "", "", -1 }, }; static struct column *current_field = columns; -static char field_buf[BUFSIZ]; + +/* Output buffer: chained chunks of BUF_CHUNK bytes. Each field is written to + * the buffer as a variable size token. A token consists of a 16 bits length + * field, followed by a string which is not NULL-terminated. + * + * A new chunk is allocated and linked when the current chunk doesn't have + * enough room to store the current token as a whole. + */ +struct buf_chunk { + struct buf_chunk *next; /* Next chained chunk */ + char *end; /* Current end of content */ + char data[0]; +}; + +struct buf_token { + uint16_t len; /* Data length, excluding length descriptor */ + char data[0]; +}; + +static struct { + struct buf_token *cur; /* Position of current token in chunk */ + struct buf_chunk *head; /* First chunk */ + struct buf_chunk *tail; /* Current chunk */ +} buffer; static const char *TCP_PROTO = "tcp"; static const char *SCTP_PROTO = "sctp"; @@ -860,25 +883,109 @@ static const char *vsock_netid_name(int type) } } +/* Allocate and initialize a new buffer chunk */ +static struct buf_chunk *buf_chunk_new(void) +{ + struct buf_chunk *new = malloc(BUF_CHUNK); + + if (!new) + abort(); + + new->next = NULL; + + /* This is also the last block */ + buffer.tail = new; + + /* Next token will be stored at the beginning of chunk data area, and +* its initial length is zero. +*/ + buffer.cur = (struct buf_token *)new->data; + buffer.cur->len = 0; + + new->end = buffer.cur->data; + + return new; +} + +/* Return available tail room in given chunk */ +static int buf_chunk_avail(struct buf_chunk *chunk) +{ + return BUF_CHUNK - offsetof(struct buf_chunk, data) - + (chunk->end - chunk->data); +} + +/* Update end pointer and token length, link new chunk if we
[PATCH] bnxt_en: Fix sources of spurious netpoll warnings
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), we still see the following WARN fire: [ cut here ] WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 netpoll_poll_dev+0x15a/0x160 bnxt_poll+0x0/0xd0 exceeded budget in poll Call Trace: [] dump_stack+0x4d/0x70 [] __warn+0xd3/0xf0 [] warn_slowpath_fmt+0x4f/0x60 [] netpoll_poll_dev+0x15a/0x160 [] netpoll_send_skb_on_dev+0x168/0x250 [] netpoll_send_udp+0x2dc/0x440 [] write_ext_msg+0x20e/0x250 [] call_console_drivers.constprop.23+0xa5/0x110 [] console_unlock+0x339/0x5b0 [] vprintk_emit+0x2c8/0x450 [] vprintk_default+0x1f/0x30 [] printk+0x48/0x50 [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core] [] edac_mc_handle_error+0x42b/0x6e0 [edac_core] [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac] [] sbridge_check_error+0xac/0x130 [sb_edac] [] edac_mc_workq_function+0x3c/0x90 [edac_core] [] process_one_work+0x19b/0x480 [] worker_thread+0x6a/0x520 [] kthread+0xe4/0x100 [] ret_from_fork+0x22/0x40 This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually given a non-zero budget. Signed-off-by: Calvin Owens--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index c5c38d4..f38160f 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct bnxt_napi *bnapi, int budget) * here forever if we consistently cannot allocate * buffers. */ - else if (rc == -ENOMEM) + else if (rc == -ENOMEM && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; @@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, int budget) cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR); rc = bnxt_rx_pkt(bp, bnapi, _cons, ); - if (likely(rc == -EIO)) + if (likely(rc == -EIO) && budget) rx_pkts++; else if (rc == -EBUSY) /* partial completion */ break; -- 2.9.5
RE: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.
> -Original Message- > From: Michael Chan [mailto:michael.c...@broadcom.com] > Sent: Thursday, December 07, 2017 1:34 PM > To: da...@davemloft.net > Cc: netdev@vger.kernel.org; andrew.gospoda...@broadcom.com; Elior, Ariel >; Dept-Eng Everest Linux L2 engeverestlinu...@cavium.com> > Subject: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW. > > Advertise NETIF_F_GRO_HW and set edev->gro_disable according to the feature > flag. Add qede_fix_features() to drop NETIF_F_GRO_HW if XDP is running or > MTU does not support GRO_HW. qede_change_mtu() also checks and disables > GRO_HW if MTU is not supported. > > Cc: Ariel Elior > Cc: everest-linux...@cavium.com > Signed-off-by: Michael Chan > --- > drivers/net/ethernet/qlogic/qede/qede.h | 2 ++ > drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 3 +++ > drivers/net/ethernet/qlogic/qede/qede_filter.c | 20 +--- > drivers/net/ethernet/qlogic/qede/qede_main.c| 17 ++--- > 4 files changed, 24 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/ethernet/qlogic/qede/qede.h > b/drivers/net/ethernet/qlogic/qede/qede.h > index a3a70ad..8a33651 100644 > --- a/drivers/net/ethernet/qlogic/qede/qede.h > +++ b/drivers/net/ethernet/qlogic/qede/qede.h > @@ -494,6 +494,8 @@ int qede_free_tx_pkt(struct qede_dev *edev, void > qede_vlan_mark_nonconfigured(struct qede_dev *edev); int > qede_configure_vlan_filters(struct qede_dev *edev); > > +netdev_features_t qede_fix_features(struct net_device *dev, > + netdev_features_t features); > int qede_set_features(struct net_device *dev, netdev_features_t features); > void qede_set_rx_mode(struct net_device *ndev); void > qede_config_rx_mode(struct net_device *ndev); diff --git > a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > index dae7412..4ca3847 100644 > --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c > @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int > new_mtu) > DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN), > "Configuring MTU size of %d\n", new_mtu); > > + if (new_mtu > PAGE_SIZE) > + ndev->features &= ~NETIF_F_GRO_HW; > + > /* Set the mtu field and re-start the interface if needed */ > args.u.mtu = new_mtu; > args.func = _update_mtu; > diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c > b/drivers/net/ethernet/qlogic/qede/qede_filter.c > index c1a0708..2de947e 100644 > --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c > +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c > @@ -895,19 +895,25 @@ static void qede_set_features_reload(struct > qede_dev *edev, > edev->ndev->features = args->u.features; } > > +netdev_features_t qede_fix_features(struct net_device *dev, > + netdev_features_t features) > +{ > + struct qede_dev *edev = netdev_priv(dev); > + > + if (edev->xdp_prog || edev->ndev->mtu > PAGE_SIZE) > + features &= ~NETIF_F_GRO_HW; > + > + return features; > +} > + > int qede_set_features(struct net_device *dev, netdev_features_t features) { > struct qede_dev *edev = netdev_priv(dev); > netdev_features_t changes = features ^ dev->features; > bool need_reload = false; > > - /* No action needed if hardware GRO is disabled during driver load */ > - if (changes & NETIF_F_GRO) { > - if (dev->features & NETIF_F_GRO) > - need_reload = !edev->gro_disable; > - else > - need_reload = edev->gro_disable; > - } > + if (changes & NETIF_F_GRO_HW) > + need_reload = true; > > if (need_reload) { > struct qede_reload_args args; > diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c > b/drivers/net/ethernet/qlogic/qede/qede_main.c > index 57332b3..90d79ae 100644 > --- a/drivers/net/ethernet/qlogic/qede/qede_main.c > +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c > @@ -545,6 +545,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq > *ifr, int cmd) #endif > .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid, > .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid, > + .ndo_fix_features = qede_fix_features, > .ndo_set_features = qede_set_features, > .ndo_get_stats64 = qede_get_stats64, > #ifdef CONFIG_QED_SRIOV > @@ -572,6 +573,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq > *ifr, int cmd) > .ndo_change_mtu = qede_change_mtu, > .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid, > .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid, > + .ndo_fix_features = qede_fix_features, > .ndo_set_features = qede_set_features, > .ndo_get_stats64 = qede_get_stats64, > .ndo_udp_tunnel_add =
Re: [PATCH v2 net-next 4/4] bpftool: implement cgroup bpf operations
On 12/8/17 8:39 AM, Quentin Monnet wrote: > I don't believe compatibility is an issue here, since the program and > its documentation come together (so they should stay in sync) and are > part of the kernel tree (so the tool should be compatible with the > kernel sources it comes with). My concern is that there is no way to > guess from the current description what the values for ATTACH_FLAG or > ATTACH_TYPE can be, without reading the source code of the program—which > is not exactly user-friendly. > The tool should be backward and forward compatible across kernel versions. Running a newer command on an older kernel should fail in a deterministic. While the tool is in the kernel tree for ease of development, that should not be confused with having a direct tie to any kernel version. I believe man pages do include kernel version descriptions in flags (e.g., man 7 socket -- flags are denoted with "since Linux x.y") which is one way to handle it with the usual caveat that vendors might have backported support to earlier kernels.
Re: [PATCH RFC 0/4] Fixes for Marvell MII paged register access races
On Fri, Dec 08, 2017 at 05:17:14PM +0100, Andrew Lunn wrote: > Hi Russell > > > There is an open question whether there should be generic helpers for > > this. Generic helpers would mean: > > > > - Additional couple of function pointers in phy_driver to read/write the > > paging register. This has the restriction that there must only be one > > paging register. > > I must be missing something. I don't see why there is this > restriction. Don't we just need > > int phy_get_page(phydev); > int phy_set_page(phydev, page); The restriction occurs because a PHY may have several different registers, and knowing which of the registers need touching becomes an issue. We wouldn't want these accessors to needlessly access several registers each and every time we requested an access to the page register. There's also the issue of whether an "int" or whatever type we choose to pass the "page" around is enough bits. I haven't surveyed all the PHY drivers yet to know the answer to that. > We might even be able to do without phy_get_page(), if we assume page > 0 should be selected by default, and all paged reads/write return to > page 0 afterwards. That would break the marvell driver, where it polls the copper page (0) and fiber page (1) for link, and leaves the page set appropriately depending on which negotiated the link. Not nice, but that's how the driver is written to work. > > - The helpers become more expensive, and because they're in a separate > > compilation unit, the compiler will be unable to optimise them by > > inlining the static functions. > > The mdio bus is slow. Often there is a completion function triggered > by an interrupt etc. Worst case it is bit-banging. I suspect the gains > from inlining are just noise in the bigger picture. Yep. > > - The helpers would be re-usable, saving replications of that code, and > > making it more likely for phy authors to safely access the PHY. > > This is the key point for me. This has likely to of been broken for > years. If it is obviously broken, driver writers are more likely to > get it right. Here it is not obvious, so we should take it out of the > driver writers hands and do it in the core. See the patch below that partially does this on top of this series. drivers/net/phy/marvell.c | 70 +++ drivers/net/phy/phy-core.c | 118 + include/linux/phy.h| 8 +++ 3 files changed, 176 insertions(+), 20 deletions(-) diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index 5548ac8fa239..183a60a06099 100644 --- a/drivers/net/phy/marvell.c +++ b/drivers/net/phy/marvell.c @@ -537,9 +537,9 @@ static int m88e1121_config_aneg_rgmii_delays(struct phy_device *phydev) else mscr = 0; - return marvell_modify_paged(phydev, MII_MARVELL_MSCR_PAGE, - MII_88E1121_PHY_MSCR_REG, - MII_88E1121_PHY_MSCR_DELAY_MASK, mscr); + return phy_modify_paged(phydev, MII_MARVELL_MSCR_PAGE, + MII_88E1121_PHY_MSCR_REG, + MII_88E1121_PHY_MSCR_DELAY_MASK, mscr); } static int m88e1121_config_aneg(struct phy_device *phydev) @@ -567,9 +567,9 @@ static int m88e1318_config_aneg(struct phy_device *phydev) { int err; - err = marvell_modify_paged(phydev, MII_MARVELL_MSCR_PAGE, - MII_88E1318S_PHY_MSCR1_REG, - 0, MII_88E1318S_PHY_MSCR1_PAD_ODD); + err = phy_modify_paged(phydev, MII_MARVELL_MSCR_PAGE, + MII_88E1318S_PHY_MSCR1_REG, + 0, MII_88E1318S_PHY_MSCR1_PAD_ODD); if (err < 0) return err; @@ -894,9 +894,9 @@ static int m88e1121_config_init(struct phy_device *phydev) int err; /* Default PHY LED config: LED[0] .. Link, LED[1] .. Activity */ - err = marvell_write_paged(phydev, MII_MARVELL_LED_PAGE, - MII_88E1121_PHY_LED_CTRL, - MII_88E1121_PHY_LED_DEF); + err = phy_write_paged(phydev, MII_MARVELL_LED_PAGE, + MII_88E1121_PHY_LED_CTRL, + MII_88E1121_PHY_LED_DEF); if (err < 0) return err; @@ -1544,7 +1544,7 @@ static u64 marvell_get_stat(struct phy_device *phydev, int i) int val; u64 ret; - val = marvell_read_paged(phydev, stat.page, stat.reg); + val = phy_read_paged(phydev, stat.page, stat.reg); if (val < 0) { ret = UINT64_MAX; } else { @@ -1684,8 +1684,8 @@ static int m88e1510_get_temp(struct phy_device *phydev, long *temp) *temp = 0; - ret = marvell_read_paged(phydev, MII_MARVELL_MISC_TEST_PAGE, -MII_88E1510_TEMP_SENSOR); + ret = phy_read_paged(phydev,
Re: [PATCH] net: dsa: allow XAUI phy interface mode
On Fri, Dec 08, 2017 at 05:26:43PM +0100, Andrew Lunn wrote: > On Fri, Dec 08, 2017 at 04:04:59PM +, Russell King wrote: > > XGMII is a 32-bit bus plus two clock signals per direction. XAUI is > > four serial lanes per direction. The 88e6190 supports XAUI but not > > XGMII as it doesn't have enough pins. The same is true of 88e6176. > > > > Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep > > accepting XGMII for backwards compatibility. > > Hi Russell > > The backwards compatibility is for the ZII devel C, the DSA link > between the two switches? I don't think there are any other boards > using this. > > I will ask around, but i don't think we need to worry about backwards > compatibility. There is no region of flash set aside for the DT blob, > etc. I always TFTP boot using the one from the kernel, etc. Well, there's also kexec to consider, where it's better not to specify the --dtb option because that wipes out the modifications (such as the memory information) that the boot loader made to the pristine kbuild generated dtb. So, with kexec its preferable to use the existing dtb even with a newer kernel, or if not go through a full reboot to use a new dtb. Whether anyone uses kexec on this platform though would be an entirely separate question, but I'm generally not in favour of breaking compatibility with already merged DT without good reason. You do seem to be correct that this only applies (in mainline) to the ZII rev C board, so I guess including a patch to update its dts for the inter-switch connection would at least be sensible. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [PATCH] selftests: net: Adding config fragment CONFIG_CGROUP_BPF=y
On 12/07/2017 08:23 PM, Naresh Kamboju wrote: > CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case. > > Signed-off-by: Naresh Kamboju> --- > tools/testing/selftests/net/config | 1 + Did you mean to add this to tools/testing/selftests/bpf/config instead? test_dev_cgroup is under bpf selftests. Please respin for the correct test suite. > 1 file changed, 1 insertion(+) > > diff --git a/tools/testing/selftests/net/config > b/tools/testing/selftests/net/config > index e57b4ac..02301c6 100644 > --- a/tools/testing/selftests/net/config > +++ b/tools/testing/selftests/net/config > @@ -1,3 +1,4 @@ > CONFIG_USER_NS=y > CONFIG_BPF_SYSCALL=y > CONFIG_TEST_BPF=m > +CONFIG_CGROUP_BPF=y >
Re: [PATCH v5 net-next] net/tcp: trace all TCP/IP state transition with tcp_set_state tracepoint
From: Yafang ShaoDate: Fri, 8 Dec 2017 23:50:44 +0800 > 2017-12-08 23:42 GMT+08:00 David Miller : >> From: Yafang Shao >> Date: Fri, 8 Dec 2017 11:40:23 +0800 >> >>> It will looks like these, >>> >>> if (sk->sk_protocol == IPPROTO_TCP) >>> __tcp_set_state(newsk, TCP_SYN_RECV); >>> else >>> newsk->sk_state = TCP_SYN_RECV; >>> >>> >>> if (sk->sk_protocol == IPPROTO_TCP) >>> __tcp_set_state(sk, TCP_CLOSE); >>> else >>> sk->sk_state = TCP_CLOSE; >>> >>> if (sk->sk_protocol == IPPROTO_TCP) >>> tcp_state_store(sk, state); >>> else >>> sk_state_store(sk, state); >>> >>> >>> Some redundant code. >>> >>> IMO, put these similar code into a wrapper is more nice. >> >> I think this discussion and how ugly this is getting shows that >> tracing the state transitions of a socket is perhaps not best as a TCP >> specific feature. > > Do you mean that tcp_set_state tracepoint should be replaced with > sk_set_state tracepoint and move that tracepoint to > trace/events/sock.h ? Yes, something like that. It will avoid all of these protocol specific checks and weird dependencies.
Re: [PATCH] net: dsa: allow XAUI phy interface mode
On Fri, Dec 08, 2017 at 04:04:59PM +, Russell King wrote: > XGMII is a 32-bit bus plus two clock signals per direction. XAUI is > four serial lanes per direction. The 88e6190 supports XAUI but not > XGMII as it doesn't have enough pins. The same is true of 88e6176. > > Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep > accepting XGMII for backwards compatibility. Hi Russell The backwards compatibility is for the ZII devel C, the DSA link between the two switches? I don't think there are any other boards using this. I will ask around, but i don't think we need to worry about backwards compatibility. There is no region of flash set aside for the DT blob, etc. I always TFTP boot using the one from the kernel, etc. Andrew
Re: [PATCH net-next 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices
On Sun, 2017-12-03 at 23:37 -0700, Subash Abhinov Kasiviswanathan wrote: > Add an option to configure the rmnet aggregation and command features > on device creation. This is achieved by using the vlan flags option. Does this overload the VLAN flags item with different meanings than VLAN_FLAG_* that are specific to rmnet? Dan > Signed-off-by: Subash Abhinov Kasiviswanathan >> --- > drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16 > +--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > index 5e530db..2f5f661 100644 > --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > @@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net, > struct net_device *dev, > if (err) > goto err2; > > - netdev_dbg(dev, "data format [ingress 0x%08X]\n", > ingress_format); > - port->ingress_data_format = ingress_format; > port->rmnet_mode = mode; > > hlist_add_head_rcu(>hlnode, >muxed_ep[mux_id]); > + > + if (data[IFLA_VLAN_FLAGS]) { > + struct ifla_vlan_flags *flags; > + > + flags = nla_data(data[IFLA_VLAN_FLAGS]); > + ingress_format = flags->flags & flags->mask; > + } > + > + netdev_dbg(dev, "data format [ingress 0x%08X]\n", > ingress_format); > + port->ingress_data_format = ingress_format; > + > return 0; > > err2: > @@ -312,7 +321,8 @@ static int rmnet_rtnl_validate(struct nlattr > *tb[], struct nlattr *data[], > > static size_t rmnet_get_size(const struct net_device *dev) > { > - return nla_total_size(2); /* IFLA_VLAN_ID */ > + return nla_total_size(2) /* IFLA_VLAN_ID */ + > + nla_total_size(sizeof(struct ifla_vlan_flags)); /* > IFLA_VLAN_FLAGS */ > } > > struct rtnl_link_ops rmnet_link_ops __read_mostly = {
RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
From: Xin Long > Sent: 08 December 2017 16:18 > ... > >> Alternatively you could preform the dereference in two steps (i.e. declare > >> an si > >> pointer on the stack and set it equal to asoc->stream.si, then deref > >> si->make_datafrag at call time. That will at least give the compiler an > >> opportunity to preload the first pointer. You want to save the function pointer itself. ... > Another small difference: > as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop, > instead of %r13. > > So that's what I can see from the related generated code. > If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think > asoc->stream.si->make_datafrag() is even better. No ? That code must have far too many life local variables. Otherwise there's be a caller saved register available. David
RE: [PATCH net 3/3] hv_netvsc: Fix the default receive buffer size
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Friday, December 8, 2017 5:45 AM > To: Stephen Hemminger> Cc: KY Srinivasan ; Haiyang Zhang > ; Stephen Hemminger > ; de...@linuxdriverproject.org; > netdev@vger.kernel.org > Subject: Re: [PATCH net 3/3] hv_netvsc: Fix the default receive buffer size > > On Thu, Dec 07, 2017 at 04:10:55PM -0800, Stephen Hemminger wrote: > > From: Haiyang Zhang > > > > The intended size is 16 MB, and the default slot size is 1728. > > So, NETVSC_DEFAULT_RX should be 16*1024*1024 / 1728 = 9709. > > > > Fixes: 5023a6db73196 ("netvsc: increase default receive buffer size") > > Signed-off-by: Haiyang Zhang > > Signed-off-by: Stephen Hemminger > > --- > > drivers/net/hyperv/netvsc_drv.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/hyperv/netvsc_drv.c > > b/drivers/net/hyperv/netvsc_drv.c index dc70de674ca9..edfcde5d3621 > > 100644 > > --- a/drivers/net/hyperv/netvsc_drv.c > > +++ b/drivers/net/hyperv/netvsc_drv.c > > @@ -50,7 +50,7 @@ > > #define NETVSC_MIN_TX_SECTIONS 10 > > #define NETVSC_DEFAULT_TX 192 /* ~1M */ > > #define NETVSC_MIN_RX_SECTIONS 10 /* ~64K */ > > -#define NETVSC_DEFAULT_RX 10485 /* Max ~16M */ > > +#define NETVSC_DEFAULT_RX 9709/* ~16M */ > > How does this bug look like to the user? Memory corruption? > > It's weird to me reviewing this code that the default sizes are stored in > netvsc_drv.c and the max sizes are stored in hyperv_net.h. > Could we move these to hyperv_net.h? We could write it like: > #define NETVSC_DEFAULT_RX ((16 * 1024 * 1024) / > NETVSC_RECV_SECTION_SIZE) > > 16MB is sort of a weird default because it's larger than the 15MB allowed for > legacy versions, but it's smaller than the 32MB you'd want for the current > versions. As a tradeoff between traffic burst and latency, we set the default to be 16MB. And this default is reduced automatically for legacy hosts based on the NVSP version in patch 2. I will move the change to hyperv_net.h Thanks, - Haiyang
Re: [PATCH RFC 0/4] Fixes for Marvell MII paged register access races
Hi Russell > There is an open question whether there should be generic helpers for > this. Generic helpers would mean: > > - Additional couple of function pointers in phy_driver to read/write the > paging register. This has the restriction that there must only be one > paging register. I must be missing something. I don't see why there is this restriction. Don't we just need int phy_get_page(phydev); int phy_set_page(phydev, page); We might even be able to do without phy_get_page(), if we assume page 0 should be selected by default, and all paged reads/write return to page 0 afterwards. > - The helpers become more expensive, and because they're in a separate > compilation unit, the compiler will be unable to optimise them by > inlining the static functions. The mdio bus is slow. Often there is a completion function triggered by an interrupt etc. Worst case it is bit-banging. I suspect the gains from inlining are just noise in the bigger picture. > - The helpers would be re-usable, saving replications of that code, and > making it more likely for phy authors to safely access the PHY. This is the key point for me. This has likely to of been broken for years. If it is obviously broken, driver writers are more likely to get it right. Here it is not obvious, so we should take it out of the driver writers hands and do it in the core. > Another potential question is whether using the mdiobus lock (which > excludes all other MII bus access) is best - while it has the advantage > of also ensuring atomicity with userspace accesses, it means that no one > else can access an independent PHY on the same bus while a paged access > is on-going. It feels like a big hammer, but I'm not convinced that we > will see a lot of contention on it. I think you are right about there being little contention on the lock. I suspect most paged accesses are performed during initial setup and configuration. I guess the once per second poll does not need to use paged registers. And the weight of that hammer can be reduced a lot by using interrupts instead of polling. Andrew
Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
On Sat, Dec 9, 2017 at 12:00 AM, Marcelo Ricardo Leitnerwrote: > On Fri, Dec 08, 2017 at 10:37:34AM -0500, Neil Horman wrote: >> On Fri, Dec 08, 2017 at 12:56:30PM -0200, Marcelo Ricardo Leitner wrote: >> > On Fri, Dec 08, 2017 at 02:06:04PM +, David Laight wrote: >> > > From: Xin Long >> > > > Sent: 08 December 2017 13:04 >> > > ... >> > > > @@ -264,8 +264,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct >> > > > sctp_association *asoc, >> > > > frag |= SCTP_DATA_SACK_IMM; >> > > > } >> > > > >> > > > - chunk = sctp_make_datafrag_empty(asoc, sinfo, len, >> > > > frag, >> > > > -0, GFP_KERNEL); >> > > > + chunk = asoc->stream.si->make_datafrag(asoc, sinfo, >> > > > len, frag, >> > > > + GFP_KERNEL); >> > > >> > > I know that none of the sctp code is very optimised, but that indirect >> > > call is going to be horrid. >> > >> > Yeah.. but there is no way to avoid the double derreference >> > considering we only have the asoc pointer in there and we have to >> > reach the contents of the data chunk operations struct, and the .si >> > part is the same as 'stream' part as it's a constant offset. >> > >> > Due to the for() in there, we could add a variable to store >> > asoc->stream.si outside the for and then we can do only a single deref >> > inside it. Xin, can you please try and see if the generated code is >> > different? >> > >> > Other suggestions? >> > >> Is it worth replacing the si struct with an index/enum value, and indexing an >> array of method pointer structs? That would save you at least one >> dereference. > > Hmmm, maybe, yes. It would be like > sctp_stream_interleave[asoc->stream.si].make_datafrag(...) > > Then same goes for pf->af, probably. > >> >> Alternatively you could preform the dereference in two steps (i.e. declare >> an si >> pointer on the stack and set it equal to asoc->stream.si, then deref >> si->make_datafrag at call time. That will at least give the compiler an >> opportunity to preload the first pointer. > > Yep, that was my 2nd paragraph above :-) but it only works for cases > such as this one. Now: for(N) { ... chunk = asoc->stream.si->make_datafrag(asoc, sinfo, len, frag, 0xfb58 <+360>: mov0x848(%r13),%rax < [a] 0xfb5f <+367>: movzbl %cl,%ecx 0xfb62 <+370>: mov$0x14000c0,%r8d 0xfb68 <+376>: mov%r12d,%edx 0xfb6b <+379>: mov(%rsp),%rsi 0xfb6f <+383>: mov%r13,%rdi <=(X) 0xfb72 <+386>: callq *0x8(%rax) < [b] 0xfb78 <+392>: mov%rax,%r15 } ret = N * ([a] + [b]) After using a variable: struct sctp_stream_interleave *si; ... si = asoc->stream.si; 0xfb44 <+340>: mov0x848(%r14),%rax 0xfb4e <+350>: mov%rax,0x20(%rsp) <- [1] for(N) { ... chunk = si->make_datafrag(asoc, sinfo, len, frag, GFP_KERNEL); 0xfb69 <+377>: mov0x20(%rsp),%rax <- [2] 0xfb6e <+382>: movzbl %cl,%ecx 0xfb71 <+385>: mov$0x14000c0,%r8d 0xfb77 <+391>: mov%r12d,%edx 0xfb7a <+394>: mov(%rsp),%rsi 0xfb7e <+398>: mov0x28(%rsp),%rdi <=(Y) 0xfb83 <+403>: callq *0x8(%rax) <- [3] 0xfb89 <+409>: mov%rax,%r14 } ret = [1] + N * ([2] + [3]) Another small difference: as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop, instead of %r13. So that's what I can see from the related generated code. If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think asoc->stream.si->make_datafrag() is even better. No ?
RE: [PATCH net 1/3] hv_netvsc: Correct the max receive buffer size
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Friday, December 8, 2017 5:33 AM > To: Stephen Hemminger> Cc: KY Srinivasan ; Haiyang Zhang > ; Stephen Hemminger > ; de...@linuxdriverproject.org; > netdev@vger.kernel.org > Subject: Re: [PATCH net 1/3] hv_netvsc: Correct the max receive buffer size > > On Thu, Dec 07, 2017 at 04:10:53PM -0800, Stephen Hemminger wrote: > > From: Haiyang Zhang > > > > It should be 31 MB on recent host versions. > > > > Signed-off-by: Haiyang Zhang > > Signed-off-by: Stephen Hemminger > > This is very vague. What does "recent" mean in this context? There are also > some unrelated white space changes here which make the patch harder to > read. > > This patch kind of makes the bug fixed by patch 2 even worse because > before the receive buffer was capped at around 16MB and now we can set > the receive buffer to 31MB. It might make sense to fold the two patches > together. > > Is patch 2 a memory corruption bug? The changelog doesn't really say what > the user visible effects of the bug are. Basically if you make the buffer too > small then it's a performance issue but if you make it too large what happens? > It's not clear to me. For NVSP v2, and earlier host, the limit is 15MB. Later hosts, the limit is 31MB. Setting beyond it will be denied by the host, resulting the vNIC doesn't come up. I will merge this one together with the patch 2. Thanks, - Haiyang
Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
On Fri, Dec 08, 2017 at 04:04:58PM +, David Laight wrote: > From: Marcelo Ricardo Leitner > > Sent: 08 December 2017 16:00 > ... > > > Is it worth replacing the si struct with an index/enum value, and > > > indexing an > > > array of method pointer structs? That would save you at least one > > > dereference. > > > > Hmmm, maybe, yes. It would be like > > sctp_stream_interleave[asoc->stream.si].make_datafrag(...) > > If you only expect 2 choices then an if () is likely > to produce better code that the above. > > The actual implementation can be hidden inside a #define > or static inline function. > Thats the real question though, will we expect more than two interleaving strategies? Currently its a boolean operation so the answer seems like yes, but is there a possiblity of a biased interleaving, or other worthwhile algorithm? Neil > David > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: pull-request: bpf-next 2017-12-07
From: Alexei StarovoitovDate: Thu, 7 Dec 2017 22:22:11 -0800 > The following pull-request contains BPF updates for your net-next tree. > > The main changes are: > > 1) Detailed documentation of BPF development process from Daniel. > > 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops >from Lawrence. > > 3) Minor follow up for bpf_skb_set_tunnel_key() from William. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Pulled, thanks Alexei.
[PATCH] net: dsa: allow XAUI phy interface mode
XGMII is a 32-bit bus plus two clock signals per direction. XAUI is four serial lanes per direction. The 88e6190 supports XAUI but not XGMII as it doesn't have enough pins. The same is true of 88e6176. Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep accepting XGMII for backwards compatibility. Signed-off-by: Russell King--- drivers/net/dsa/mv88e6xxx/port.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c index a7801f6668a5..6315774d72b3 100644 --- a/drivers/net/dsa/mv88e6xxx/port.c +++ b/drivers/net/dsa/mv88e6xxx/port.c @@ -338,6 +338,7 @@ int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, int port, cmode = MV88E6XXX_PORT_STS_CMODE_2500BASEX; break; case PHY_INTERFACE_MODE_XGMII: + case PHY_INTERFACE_MODE_XAUI: cmode = MV88E6XXX_PORT_STS_CMODE_XAUI; break; case PHY_INTERFACE_MODE_RXAUI: -- 2.7.4
RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave
From: Marcelo Ricardo Leitner > Sent: 08 December 2017 16:00 ... > > Is it worth replacing the si struct with an index/enum value, and indexing > > an > > array of method pointer structs? That would save you at least one > > dereference. > > Hmmm, maybe, yes. It would be like > sctp_stream_interleave[asoc->stream.si].make_datafrag(...) If you only expect 2 choices then an if () is likely to produce better code that the above. The actual implementation can be hidden inside a #define or static inline function. David
Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected
On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecekwrote: > On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote: >> On 12/08/2017 at 09:47 AM Michal Kubecek wrote: >> > On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote: >> >> >> >> All my VMs are using virtio_net. BTW: I couldn't see the problems >> >> (sometimes, the VM couldn't be stopped at all) if all my VMs are using >> >> e1000 as interface instead. >> >> >> >> This finding now matches pretty much the responsible UDP-package which >> >> caused the stall. I already mentioned it here [2]. >> >> >> >> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13] >> >> Remove UDP Fragmentation Offload support" [3] >> >> >> >> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4] >> >> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5] >> >> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6] >> >> >> >> and applied it to Linux 4.14.4. It compiled fine and is running fine. >> >> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs >> >> are gone, too. >> >> >> >> Obviously, there is something broken with the new UDP handling. Could >> >> you please analyze this problem? I could test some more patches ... . >> > >> > Any chance your VMs were live migrated from pre-4.14 host kernel? >> >> No - the VMs are not live migrated. They are always running on the same >> host - either with kernel < 4.14 or with kernel 4.14.x. > > This is disturbing... unless I'm mistaken, it shouldn't be possible to > have UFO enabled on a virtio device in a VM booted on a host with 4.14 > kernel. Indeed. When working on that revert patch I verified that UFO in the guest virtio_net was off before the revert patch, on after. Qemu should check host support with tap_probe_has_ufo before advertising support to the guest. Indeed, this is exactly what broke live migration in virtio_net_load_device at if (qemu_get_byte(f) && !peer_has_ufo(n)) { error_report("virtio-net: saved image requires TUN_F_UFO support"); return -1; } Which follows peer_has_ufo qemu_has_ufo tap_has_ufo s->has_ufo where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init. Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this codepath does not seem to have changed between then and 2.10.1. I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the fix-up wasn't too hard. Compiled and booted, but untested otherwise. At https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo > >> > If this is the case, you should try commit 0c19f846d582 ("net: >> > accept UFO datagrams from tuntap and packet"). >> >> It doesn't apply to 4.14.4 >> >> > Or disabling UFO in the guest should >> > work around the issue. >> >> ethtool -K ethX ufo off for each device / bridge in VM. >> >> Yes, this seems to work. I'll wait and see if the non stoppable >> qemu-problem on shutdown will remain. >> >> When will there be a fix for 4.14? It is clearly a regression. Is it >> possible / a good idea to just remove the complete patch series "Remove >> UDP Fragmentation Offload support"? > > I cannot give an exact date but the patch is queued for stable > (see http://patchwork.ozlabs.org/bundle/davem/stable/?state=* ) so that > it should land in stable-4.14 in near future (weeks at most). > > Michal Kubecek >