[PATCH net-next v3 3/5] bnxt_en: Use NETIF_F_GRO_HW.

2017-12-08 Thread Michael Chan
Advertise NETIF_F_GRO_HW in hw_features if hardware GRO is supported.
In bnxt_fix_features(), disable GRO_HW and LRO if current hardware
configuration does not allow it.  XDP setup will now rely on
bnxt_fix_features() to turn off aggregation.  During chip init, turn on
or off hardware GRO based on NETIF_F_GRO_HW in features flag.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index af6c83f..04dddf6 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2755,7 +2755,7 @@ void bnxt_set_tpa_flags(struct bnxt *bp)
return;
if (bp->dev->features & NETIF_F_LRO)
bp->flags |= BNXT_FLAG_LRO;
-   if (bp->dev->features & NETIF_F_GRO)
+   else if (bp->dev->features & NETIF_F_GRO_HW)
bp->flags |= BNXT_FLAG_GRO;
 }
 
@@ -2843,10 +2843,10 @@ int bnxt_set_rx_skb_mode(struct bnxt *bp, bool 
page_mode)
min_t(u16, bp->max_mtu, BNXT_MAX_PAGE_MODE_MTU);
bp->flags &= ~BNXT_FLAG_AGG_RINGS;
bp->flags |= BNXT_FLAG_NO_AGG_RINGS | BNXT_FLAG_RX_PAGE_MODE;
-   bp->dev->hw_features &= ~NETIF_F_LRO;
-   bp->dev->features &= ~NETIF_F_LRO;
bp->rx_dir = DMA_BIDIRECTIONAL;
bp->rx_skb_func = bnxt_rx_page_skb;
+   /* Disable LRO or GRO_HW */
+   netdev_update_features(bp->dev);
} else {
bp->dev->max_mtu = bp->max_mtu;
bp->flags &= ~BNXT_FLAG_RX_PAGE_MODE;
@@ -6788,6 +6788,12 @@ static netdev_features_t bnxt_fix_features(struct 
net_device *dev,
if ((features & NETIF_F_NTUPLE) && !bnxt_rfs_capable(bp))
features &= ~NETIF_F_NTUPLE;
 
+   if (bp->flags & BNXT_FLAG_NO_AGG_RINGS)
+   features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW);
+
+   if (features & NETIF_F_GRO_HW)
+   features &= ~NETIF_F_LRO;
+
/* Both CTAG and STAG VLAN accelaration on the RX side have to be
 * turned on or off together.
 */
@@ -6821,9 +6827,9 @@ static int bnxt_set_features(struct net_device *dev, 
netdev_features_t features)
bool update_tpa = false;
 
flags &= ~BNXT_FLAG_ALL_CONFIG_FEATS;
-   if ((features & NETIF_F_GRO) && !BNXT_CHIP_TYPE_NITRO_A0(bp))
+   if (features & NETIF_F_GRO_HW)
flags |= BNXT_FLAG_GRO;
-   if (features & NETIF_F_LRO)
+   else if (features & NETIF_F_LRO)
flags |= BNXT_FLAG_LRO;
 
if (bp->flags & BNXT_FLAG_NO_AGG_RINGS)
@@ -7924,8 +7930,8 @@ static int bnxt_get_dflt_rings(struct bnxt *bp, int 
*max_rx, int *max_tx,
if (rc)
return rc;
bp->flags |= BNXT_FLAG_NO_AGG_RINGS;
-   bp->dev->hw_features &= ~NETIF_F_LRO;
-   bp->dev->features &= ~NETIF_F_LRO;
+   bp->dev->hw_features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW);
+   bp->dev->features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW);
bnxt_set_ring_params(bp);
}
 
@@ -8108,7 +8114,11 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->vlan_features = dev->hw_features | NETIF_F_HIGHDMA;
dev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_RX | NETIF_F_HW_VLAN_STAG_TX;
+   if (!BNXT_CHIP_TYPE_NITRO_A0(bp))
+   dev->hw_features |= NETIF_F_GRO_HW;
dev->features |= dev->hw_features | NETIF_F_HIGHDMA;
+   if (dev->features & NETIF_F_GRO_HW)
+   dev->features &= ~NETIF_F_LRO;
dev->priv_flags |= IFF_UNICAST_FLT;
 
 #ifdef CONFIG_BNXT_SRIOV
-- 
1.8.3.1



[PATCH net-next v3 1/5] net: Introduce NETIF_F_GRO_HW.

2017-12-08 Thread Michael Chan
Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO.  With this flag, we can now independently turn on or off hardware
GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.

Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream.  It is a subset of
NETIF_F_GRO and depends on it, as well as NETIF_F_RXCSUM.

Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required.  This will preserve the current
GRO behavior.

Cc: Ariel Elior 
Cc: everest-linux...@cavium.com
Signed-off-by: Michael Chan 
---
 Documentation/networking/netdev-features.txt |  8 
 include/linux/netdev_features.h  |  3 +++
 net/core/dev.c   | 12 
 net/core/ethtool.c   |  1 +
 4 files changed, 24 insertions(+)

diff --git a/Documentation/networking/netdev-features.txt 
b/Documentation/networking/netdev-features.txt
index 7413eb0..8f36527 100644
--- a/Documentation/networking/netdev-features.txt
+++ b/Documentation/networking/netdev-features.txt
@@ -163,3 +163,11 @@ This requests that the NIC receive all possible frames, 
including errored
 frames (such as bad FCS, etc).  This can be helpful when sniffing a link with
 bad packets on it.  Some NICs may receive more packets if also put into normal
 PROMISC mode.
+
+*  rx-gro-hw
+
+This requests that the NIC enables Hardware GRO (generic receive offload).
+Hardware GRO is basically the exact reverse of TSO, and is generally
+stricter than Hardware LRO.  A packet stream merged by Hardware GRO must
+be re-segmentable by GSO or TSO back to the exact original packet stream.
+Hardware GRO is dependent on GRO and RXCSUM.
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index b1b0ca7..db84c51 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -78,6 +78,8 @@ enum {
NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
 
+   NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
+
/*
 * Add your fresh new feature above and remember to update
 * netdev_features_strings[] in net/core/ethtool.c and maybe
@@ -97,6 +99,7 @@ enum {
 #define NETIF_F_FRAGLIST   __NETIF_F(FRAGLIST)
 #define NETIF_F_FSO__NETIF_F(FSO)
 #define NETIF_F_GRO__NETIF_F(GRO)
+#define NETIF_F_GRO_HW __NETIF_F(GRO_HW)
 #define NETIF_F_GSO__NETIF_F(GSO)
 #define NETIF_F_GSO_ROBUST __NETIF_F(GSO_ROBUST)
 #define NETIF_F_HIGHDMA__NETIF_F(HIGHDMA)
diff --git a/net/core/dev.c b/net/core/dev.c
index e32cf5c..6ebd0e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7424,6 +7424,18 @@ static netdev_features_t netdev_fix_features(struct 
net_device *dev,
features &= ~dev->gso_partial_features;
}
 
+   if (features & NETIF_F_GRO_HW) {
+   /* Hardware GRO depends on GRO and RXCSUM. */
+   if (!(features & NETIF_F_GRO)) {
+   netdev_dbg(dev, "Dropping NETIF_F_GSO_HW since no GRO 
feature.\n");
+   features &= ~NETIF_F_GRO_HW;
+   }
+   if (!(features & NETIF_F_RXCSUM)) {
+   netdev_dbg(dev, "Dropping NETIF_F_GSO_HW since no 
RXCSUM feature.\n");
+   features &= ~NETIF_F_GRO_HW;
+   }
+   }
+
return features;
 }
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f8fcf45..50a7920 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -73,6 +73,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct 
ethtool_ts_info *info)
[NETIF_F_LLTX_BIT] = "tx-lockless",
[NETIF_F_NETNS_LOCAL_BIT] =  "netns-local",
[NETIF_F_GRO_BIT] =  "rx-gro",
+   [NETIF_F_GRO_HW_BIT] =   "rx-gro-hw",
[NETIF_F_LRO_BIT] =  "rx-lro",
 
[NETIF_F_TSO_BIT] =  "tx-tcp-segmentation",
-- 
1.8.3.1



[PATCH net-next v3 0/5] Introduce NETIF_F_GRO_HW

2017-12-08 Thread Michael Chan
Introduce NETIF_F_GRO_HW feature flag and convert drivers that support
hardware GRO to use the new flag.

v3:
- Let driver's ndo_fix_features() disable NETIF_F_LRO when NETIF_F_GRO_HW
is set instead of doing it in common netdev_fix_features().

v2:
- NETIF_F_GRO_HW flag propagation between upper and lower devices not
required (see patch 1).
- NETIF_F_GRO_HW depends on NETIF_F_GRO and NETIF_F_RXCSUM.
- Add dev_disable_gro_hw() to disable GRO_HW for generic XDP.
- Use ndo_fix_features() on all 3 drivers to drop GRO_HW when it is not
supported

Michael Chan (5):
  net: Introduce NETIF_F_GRO_HW.
  net: Disable GRO_HW when generic XDP is installed on a device.
  bnxt_en: Use NETIF_F_GRO_HW.
  bnx2x: Use NETIF_F_GRO_HW.
  qede: Use NETIF_F_GRO_HW.

 Documentation/networking/netdev-features.txt |  8 ++
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c  | 19 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  4 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c| 24 +++-
 drivers/net/ethernet/qlogic/qede/qede.h  |  2 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c  |  3 ++
 drivers/net/ethernet/qlogic/qede/qede_filter.c   | 20 -
 drivers/net/ethernet/qlogic/qede/qede_main.c | 17 ---
 include/linux/netdev_features.h  |  3 ++
 net/core/dev.c   | 36 
 net/core/ethtool.c   |  1 +
 11 files changed, 104 insertions(+), 33 deletions(-)

-- 
1.8.3.1



[PATCH net-next v3 2/5] net: Disable GRO_HW when generic XDP is installed on a device.

2017-12-08 Thread Michael Chan
Hardware should not aggregate any packets when generic XDP is installed.

Cc: Ariel Elior 
Cc: everest-linux...@cavium.com
Signed-off-by: Michael Chan 
---
 net/core/dev.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6ebd0e7..ec08ace 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1542,6 +1542,29 @@ void dev_disable_lro(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_disable_lro);
 
+/**
+ * dev_disable_gro_hw - disable HW Generic Receive Offload on a device
+ * @dev: device
+ *
+ * Disable HW Generic Receive Offload (GRO_HW) on a net device.  Must be
+ * called under RTNL.  This is needed if Generic XDP is installed on
+ * the device.
+ */
+static void dev_disable_gro_hw(struct net_device *dev)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   dev->wanted_features &= ~NETIF_F_GRO_HW;
+   netdev_update_features(dev);
+
+   if (unlikely(dev->features & NETIF_F_GRO_HW))
+   netdev_WARN(dev, "failed to disable GRO_HW!\n");
+
+   netdev_for_each_lower_dev(dev, lower_dev, iter)
+   dev_disable_gro_hw(lower_dev);
+}
+
 static int call_netdevice_notifier(struct notifier_block *nb, unsigned long 
val,
   struct net_device *dev)
 {
@@ -4564,6 +4587,7 @@ static int generic_xdp_install(struct net_device *dev, 
struct netdev_bpf *xdp)
} else if (new && !old) {
static_key_slow_inc(_xdp_needed);
dev_disable_lro(dev);
+   dev_disable_gro_hw(dev);
}
break;
 
-- 
1.8.3.1



[PATCH net-next v3 4/5] bnx2x: Use NETIF_F_GRO_HW.

2017-12-08 Thread Michael Chan
Advertise NETIF_F_GRO_HW and turn on TPA_MODE_GRO when NETIF_F_GRO_HW
is set.  Disable NETIF_F_GRO_HW in bnx2x_fix_features() if the MTU
does not support TPA_MODE_GRO.  bnx2x_change_mtu() also needs to disable
NETIF_F_GRO_HW if the MTU does not support it.

Cc: Ariel Elior 
Cc: everest-linux...@cavium.com
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c  | 19 ---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  4 +++-
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 4c739d5..c75cfdd 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -2482,8 +2482,7 @@ static void bnx2x_bz_fp(struct bnx2x *bp, int index)
 */
if (bp->dev->features & NETIF_F_LRO)
fp->mode = TPA_MODE_LRO;
-   else if (bp->dev->features & NETIF_F_GRO &&
-bnx2x_mtu_allows_gro(bp->dev->mtu))
+   else if (bp->dev->features & NETIF_F_GRO_HW)
fp->mode = TPA_MODE_GRO;
else
fp->mode = TPA_MODE_DISABLED;
@@ -4874,6 +4873,9 @@ int bnx2x_change_mtu(struct net_device *dev, int new_mtu)
 */
dev->mtu = new_mtu;
 
+   if (!bnx2x_mtu_allows_gro(new_mtu))
+   dev->features &= ~NETIF_F_GRO_HW;
+
if (IS_PF(bp) && SHMEM2_HAS(bp, curr_cfg))
SHMEM2_WR(bp, curr_cfg, CURR_CFG_MET_OS);
 
@@ -4907,6 +4909,10 @@ netdev_features_t bnx2x_fix_features(struct net_device 
*dev,
features &= ~NETIF_F_LRO;
features &= ~NETIF_F_GRO;
}
+   if (!bnx2x_mtu_allows_gro(dev->mtu))
+   features &= ~NETIF_F_GRO_HW;
+   if (features & NETIF_F_GRO_HW)
+   features &= ~NETIF_F_LRO;
 
return features;
 }
@@ -4933,13 +4939,12 @@ int bnx2x_set_features(struct net_device *dev, 
netdev_features_t features)
}
}
 
-   /* if GRO is changed while LRO is enabled, don't force a reload */
-   if ((changes & NETIF_F_GRO) && (features & NETIF_F_LRO))
-   changes &= ~NETIF_F_GRO;
+   /* Don't care about GRO changes */
+   changes &= ~NETIF_F_GRO;
 
/* if GRO is changed while HW TPA is off, don't force a reload */
-   if ((changes & NETIF_F_GRO) && bp->disable_tpa)
-   changes &= ~NETIF_F_GRO;
+   if ((changes & NETIF_F_GRO_HW) && bp->disable_tpa)
+   changes &= ~NETIF_F_GRO_HW;
 
if (changes)
bnx2x_reload = true;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 91e2a75..1c7f821 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13273,7 +13273,7 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
 
dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 |
-   NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO |
+   NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO | NETIF_F_GRO_HW |
NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX;
if (!chip_is_e1x) {
dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_GRE_CSUM |
@@ -13309,6 +13309,8 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
 
dev->features |= dev->hw_features | NETIF_F_HW_VLAN_CTAG_RX;
dev->features |= NETIF_F_HIGHDMA;
+   if (dev->features & NETIF_F_GRO_HW)
+   dev->features &= ~NETIF_F_LRO;
 
/* Add Loopback capability to the device */
dev->hw_features |= NETIF_F_LOOPBACK;
-- 
1.8.3.1



[PATCH net-next v3 5/5] qede: Use NETIF_F_GRO_HW.

2017-12-08 Thread Michael Chan
Advertise NETIF_F_GRO_HW and set edev->gro_disable according to the
feature flag.  Add qede_fix_features() to drop NETIF_F_GRO_HW if
XDP is running or MTU does not support GRO_HW.  qede_change_mtu()
also checks and disables GRO_HW if MTU is not supported.

Cc: Ariel Elior 
Cc: everest-linux...@cavium.com
Acked-by: Manish Chopra 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/qlogic/qede/qede.h |  2 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  3 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c  | 20 +---
 drivers/net/ethernet/qlogic/qede/qede_main.c| 17 ++---
 4 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index a3a70ad..8a33651 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -494,6 +494,8 @@ int qede_free_tx_pkt(struct qede_dev *edev,
 void qede_vlan_mark_nonconfigured(struct qede_dev *edev);
 int qede_configure_vlan_filters(struct qede_dev *edev);
 
+netdev_features_t qede_fix_features(struct net_device *dev,
+   netdev_features_t features);
 int qede_set_features(struct net_device *dev, netdev_features_t features);
 void qede_set_rx_mode(struct net_device *ndev);
 void qede_config_rx_mode(struct net_device *ndev);
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index dae7412..4ca3847 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu)
DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN),
   "Configuring MTU size of %d\n", new_mtu);
 
+   if (new_mtu > PAGE_SIZE)
+   ndev->features &= ~NETIF_F_GRO_HW;
+
/* Set the mtu field and re-start the interface if needed */
args.u.mtu = new_mtu;
args.func = _update_mtu;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index c1a0708..2de947e 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -895,19 +895,25 @@ static void qede_set_features_reload(struct qede_dev 
*edev,
edev->ndev->features = args->u.features;
 }
 
+netdev_features_t qede_fix_features(struct net_device *dev,
+   netdev_features_t features)
+{
+   struct qede_dev *edev = netdev_priv(dev);
+
+   if (edev->xdp_prog || edev->ndev->mtu > PAGE_SIZE)
+   features &= ~NETIF_F_GRO_HW;
+
+   return features;
+}
+
 int qede_set_features(struct net_device *dev, netdev_features_t features)
 {
struct qede_dev *edev = netdev_priv(dev);
netdev_features_t changes = features ^ dev->features;
bool need_reload = false;
 
-   /* No action needed if hardware GRO is disabled during driver load */
-   if (changes & NETIF_F_GRO) {
-   if (dev->features & NETIF_F_GRO)
-   need_reload = !edev->gro_disable;
-   else
-   need_reload = edev->gro_disable;
-   }
+   if (changes & NETIF_F_GRO_HW)
+   need_reload = true;
 
if (need_reload) {
struct qede_reload_args args;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 57332b3..90d79ae 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -545,6 +545,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq 
*ifr, int cmd)
 #endif
.ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid,
+   .ndo_fix_features = qede_fix_features,
.ndo_set_features = qede_set_features,
.ndo_get_stats64 = qede_get_stats64,
 #ifdef CONFIG_QED_SRIOV
@@ -572,6 +573,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq 
*ifr, int cmd)
.ndo_change_mtu = qede_change_mtu,
.ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid,
+   .ndo_fix_features = qede_fix_features,
.ndo_set_features = qede_set_features,
.ndo_get_stats64 = qede_get_stats64,
.ndo_udp_tunnel_add = qede_udp_tunnel_add,
@@ -589,6 +591,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq 
*ifr, int cmd)
.ndo_change_mtu = qede_change_mtu,
.ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid,
+   .ndo_fix_features = qede_fix_features,
.ndo_set_features = qede_set_features,
.ndo_get_stats64 = qede_get_stats64,
.ndo_udp_tunnel_add = qede_udp_tunnel_add,

Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.

2017-12-08 Thread Tonghao Zhang
On Sat, Dec 9, 2017 at 6:09 AM, Cong Wang  wrote:
> On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang  
> wrote:
>>
>> Release the netlink sock created in kernel(not hold the _net_ namespace):
>>
>
> You can avoid counting kernel sock by testing 'kern' in sk_alloc()
> and testing 'sk->sk_net_refcnt' in __sk_free().
Hi cong, if we do it in this way, we will not counter the sock created
in kernel, right ?


Incorrect source IP address on IGMP membership report

2017-12-08 Thread Kevin Cernekee
Closing a multicast socket after the final IPv4 address is deleted
from an interface will generate a membership report that uses the
source IP from a different interface.  The following test script, run
from an isolated netns, reproduces the issue:

#!/bin/bash

ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link set dummy0 up
ip link set dummy1 up
ip addr add 10.1.1.1/24 dev dummy0
ip addr add 192.168.99.99/24 dev dummy1

tcpdump -U -i dummy0 -w dummy0.pcap &
socat EXEC:"sleep 2"
UDP4-DATAGRAM:239.101.1.68:8889,ip-add-membership=239.0.1.68:10.1.1.1
&

sleep 1
ip addr del 10.1.1.1/24 dev dummy0
sleep 5
kill %tcpdump

After running this script, dummy0.pcap contains one Membership Report
/ Join Group packet with source IP 10.1.1.1, and two Membership Report
/ Leave Group packets with source IP 192.168.99.99.

Sending out multicasts on the LAN using an unexpected source IP
address seems to be causing issues in some enterprise environments[0],
where the network infrastructure is set up to flag suspicious packets.

I believe the source address is provided by ip_route_output_ports()
called from igmpv3_newpack() in the kernel.

Is this behavior intentional?  If not, is it something that we should fix?

I was able to suppress these reports with an iptables OUTPUT rule, but
maybe there is a cleaner way.


[0] https://bugs.chromium.org/p/chromium/issues/detail?id=786514


Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.

2017-12-08 Thread Tonghao Zhang
On Fri, Dec 8, 2017 at 9:24 PM, Eric Dumazet  wrote:
> On Fri, 2017-12-08 at 19:29 +0800, Tonghao Zhang wrote:
>> hi all. we can add synchronize_rcu and rcu_barrier in
>> sock_inuse_exit_net to
>> ensure there are no outstanding rcu callbacks using this network
>> namespace.
>> we will not have to test if net->core.sock_inuse is NULL or not from
>> sock_inuse_add(). :)
>>
>>  static void __net_exit sock_inuse_exit_net(struct net *net)
>>  {
>> free_percpu(net->core.prot_inuse);
>> +
>> +   synchronize_rcu();
>> +   rcu_barrier();
>> +
>> +   free_percpu(net->core.sock_inuse);
>>  }
>
>
> Oh well. Do you have any idea of the major problem this would add ?
>
> Try the following, before and after your patches :
>
> for i in `seq 1 40`
> do
>  (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
> done
> wait
>
> ( Check commit 8ca712c373a462cfa1b62272870b6c2c74aa83f9 )
>
Yes, I did the test. The patches drop the performance.
Before patch:
# time ./add_del_unshare.sh
net_namespace 97125   601658 : tunables00
  0 : slabdata 25 25  0

real 8m19.665s
user 0m4.268s
sys 0m6.477s

After :
# time ./add_del_unshare.sh
net_namespace102130   601658 : tunables00
  0 : slabdata 26 26  0

real 8m52.563s
user 0m4.040s
sys 0m7.558s

>
> This is a complex problem, we wont accept patches that kill network
> namespaces dismantling performance by adding brute force
> synchronize_rcu() or rcu_barrier() calls.
>
> Why not freeing net->core.sock_inuse right before feeing net itself in
> net_free() ?
I try this way, alloc  core.sock_inuse in net_alloc(), free it in net_free ().
It does not drop performance, and we will not always to check the
core.sock_inuse
in sock_inuse_add().

After :
# time ./add_del_unshare.sh
net_namespace109135   601658 : tunables00
  0 : slabdata 27 27  0

real 8m19.265s
user 0m4.090s
sys 0m8.185s

> You do not have to hijack sock_inuse_exit_net() just because it has a
> misleading name.
>
>


Fund Transfer

2017-12-08 Thread Hazim Issa


Hello,
I'm Mr. Hazim, A banker working in a bank in my country; there is a certain 
deceased customer of my bank who left behind her fund.

I seek your partnership in receiving this fund. If interested, reply 
immediately for detailed information.

My sincere regards, Mr. Hazim Issa.​


RE: [PATCH net-next 1/1] net: dsa: microchip: Add Microchip KSZ8895 DSA driver

2017-12-08 Thread Tristram.Ha
> Ok, let me retry:
> 
> > One thing to debug this problem is to dump the MIB counters.  Use the
> ethtool
> > utility to show MIB counters of both ports:
> >
> > ethtool -S lan3
> > ethtool -S eth0
> >
> > Assuming eth0 is the MAC controller that drives the switch, the receive
> counters of
> > the host port of the switch should match the transmit counters of
> > lan3, and vice versa.
> 
> Hmm. I'm getting some "interesting" results from mii-tool:
> 
> root@miro:~# mii-tool lan3
> lan3: negotiated 1000baseT-HD flow-control, link ok
> 
> But IIRC the switch is 100mbit? And dmesg does get it right. Its just
> mii-tool that is confused.
> 
> Link detection seems to work
> 
> root@miro:/sys/bus/spi/devices/spi2.0# mii-tool lan1
> lan1: negotiated 1000baseT-HD flow-control, link ok
> root@miro:/sys/bus/spi/devices/spi2.0# mii-tool lan1
> lan1: no link
> 
> (But that really should be 100baseT, not 1000baseT).

ethtool lan3 should also report the correct setting.

> Is there register dump available somewhere? I was using
> /sys/bus/spi/devices/spi32766.0/registers but this does not seem to be
> available.

There is a patch to add that functionality.  It is very simple and I will send 
it
to you later.  Without that it is hard to debug the DSA driver if there is
something wrong.

I also have a simple utility to communicate with that registers file to 
read/write
register individually.  Is there a standard Linux utility for that function?

> I tried that ethtool -S, and counters do not seem to match:
> root@miro:~# ethtool -S eth0
> NIC statistics:
>  tx_dropped: 0
>  tx_packets: 55
>  tx_broadcast: 35
>  tx_multicast: 20
>  tx_crc_errors: 0
>  tx_undersize: 0
>  tx_oversize: 0
>  tx_fragment: 0
>  tx_jabber: 0
>  tx_collision: 0
>  tx_64byte: 0
>  tx_65to127byte: 35
>  tx_128to255byte: 0
>  tx_256to511byte: 20
>  tx_512to1023byte: 0
>  tx_1024to2047byte: 0
>  tx_GTE2048byte: 0
>  tx_octets: 9576
>  IEEE_tx_drop: 0
>  IEEE_tx_frame_ok: 55
>  IEEE_tx_1col: 0
>  IEEE_tx_mcol: 0
>  IEEE_tx_def: 0
>  IEEE_tx_lcol: 0
>  IEEE_tx_excol: 0
>  IEEE_tx_macerr: 0
>  IEEE_tx_cserr: 0
>  IEEE_tx_sqe: 0
>  IEEE_tx_fdxfc: 0
>  IEEE_tx_octets_ok: 9576
>  rx_packets: 0
>  rx_broadcast: 0
>  rx_multicast: 0
>  rx_crc_errors: 0
>  rx_undersize: 0
>  rx_oversize: 0
>  rx_fragment: 0
>  rx_jabber: 0
>  rx_64byte: 0
>  rx_65to127byte: 0
>  rx_128to255byte: 0
>  rx_256to511byte: 0
>  rx_512to1023byte: 0
>  rx_1024to2047byte: 0
>  rx_GTE2048byte: 0
>  rx_octets: 0
>  IEEE_rx_drop: 0
>  IEEE_rx_frame_ok: 0
>  IEEE_rx_crc: 0
>  IEEE_rx_align: 0
>  IEEE_rx_macerr: 0
>  IEEE_rx_fdxfc: 0
>  IEEE_rx_octets_ok: 0

These are the MIB counters from the switch host port:

>  p04_rx: 660
>  p04_rx_hi: 0
>  p04_rx_undersize: 0
>  p04_rx_fragments: 20

This indicates a problem with the MAC.  Are you using a MII or RMII version?

>  p04_rx_oversize: 0
>  p04_rx_jabbers: 0
>  p04_rx_symbol_err: 0
>  p04_rx_crc_err: 0
>  p04_rx_align_err: 0
>  p04_rx_mac_ctrl: 0
>  p04_rx_pause: 0
>  p04_rx_bcast: 0
>  p04_rx_mcast: 0
>  p04_rx_ucast: 0
>  p04_rx_64_or_less: 0
>  p04_rx_65_127: 0
>  p04_rx_128_255: 0
>  p04_rx_256_511: 0
>  p04_rx_512_1023: 0
>  p04_rx_1024_1522: 0
>  p04_tx: 388
>  p04_tx_hi: 0
>  p04_tx_late_col: 0
>  p04_tx_pause: 0
>  p04_tx_bcast: 0
>  p04_tx_mcast: 3

This indicates the host port tried to send frames to the MAC.

>  p04_tx_ucast: 0
>  p04_tx_deferred: 0
>  p04_tx_total_col: 0
>  p04_tx_exc_col: 0
>  p04_tx_single_col: 0
>  p04_tx_mult_col: 0
>  p04_rx_discards: 0
>  p04_tx_discards: 0
> root@miro:~# ethtool -S lan3
> NIC statistics:
>  tx_packets: 24
>  tx_bytes: 1356
>  rx_packets: 0
>  rx_bytes: 0

The previous counters are from DSA.  The rest are MIB counters of the port.

>  rx: 566
>  rx_hi: 0
>  rx_undersize: 0
>  rx_fragments: 0
>  rx_oversize: 0
>  rx_jabbers: 0
>  rx_symbol_err: 0
>  rx_crc_err: 0
>  rx_align_err: 0
>  rx_mac_ctrl: 0
>  rx_pause: 0
>  rx_bcast: 0
>  rx_mcast: 4
>  rx_ucast: 0
>  rx_64_or_less: 0
>  rx_65_127: 1
>  rx_128_255: 3
>  rx_256_511: 0
>  rx_512_1023: 0
>  rx_1024_1522: 0
>  tx: 0
>  tx_hi: 0
>  tx_late_col: 0
>  tx_pause: 0
>  tx_bcast: 0
>  tx_mcast: 0
>  tx_ucast: 0
>  tx_deferred: 0
>  tx_total_col: 0
>  tx_exc_col: 0
>  tx_single_col: 0
>  tx_mult_col: 0
>  rx_discards: 0
>  tx_discards: 0

They just reported frames are received from the port.  Because of problem with
the host port there is no transmission coming from the host port.



[PATCH net-next 4/4] nfp: debug dump - decrease endian conversions

2017-12-08 Thread Jakub Kicinski
From: Carl Heymann 

Convert the requested dump level parameter to big-endian at the start of
nfp_net_dump_calculate_size() and nfp_net_dump_populate_buffer(), then
compare and assign it directly where needed in the traversal and prolog
code. This decreases the total number of conversions used.

Signed-off-by: Carl Heymann 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
index cb74602f0907..cbff0adad235 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
@@ -126,13 +126,13 @@ struct nfp_dump_error {
 
 /* to track state through debug size calculation TLV traversal */
 struct nfp_level_size {
-   u32 requested_level;/* input */
+   __be32 requested_level; /* input */
u32 total_size; /* output */
 };
 
 /* to track state during debug dump creation TLV traversal */
 struct nfp_dump_state {
-   u32 requested_level;/* input param */
+   __be32 requested_level; /* input param */
u32 dumped_size;/* adds up to size of dumped data */
u32 buf_size;   /* size of buffer pointer to by p */
void *p;/* current point in dump buffer */
@@ -334,7 +334,7 @@ nfp_calc_specific_level_size(struct nfp_pf *pf, struct 
nfp_dump_tl *dump_level,
 {
struct nfp_level_size *lev_sz = param;
 
-   if (be32_to_cpu(dump_level->type) != lev_sz->requested_level)
+   if (dump_level->type != lev_sz->requested_level)
return 0;
 
return nfp_traverse_tlvs(pf, dump_level->data,
@@ -348,7 +348,7 @@ s64 nfp_net_dump_calculate_size(struct nfp_pf *pf, struct 
nfp_dumpspec *spec,
struct nfp_level_size lev_sz;
int err;
 
-   lev_sz.requested_level = flag;
+   lev_sz.requested_level = cpu_to_be32(flag);
lev_sz.total_size = ALIGN8(sizeof(struct nfp_dump_prolog));
 
err = nfp_traverse_tlvs(pf, spec->data, spec->size, _sz,
@@ -733,7 +733,7 @@ nfp_dump_specific_level(struct nfp_pf *pf, struct 
nfp_dump_tl *dump_level,
 {
struct nfp_dump_state *dump = param;
 
-   if (be32_to_cpu(dump_level->type) != dump->requested_level)
+   if (dump_level->type != dump->requested_level)
return 0;
 
return nfp_traverse_tlvs(pf, dump_level->data,
@@ -753,7 +753,7 @@ static int nfp_dump_populate_prolog(struct nfp_dump_state 
*dump)
if (err)
return err;
 
-   prolog->dump_level = cpu_to_be32(dump->requested_level);
+   prolog->dump_level = dump->requested_level;
 
return 0;
 }
@@ -764,7 +764,7 @@ int nfp_net_dump_populate_buffer(struct nfp_pf *pf, struct 
nfp_dumpspec *spec,
struct nfp_dump_state dump;
int err;
 
-   dump.requested_level = dump_param->flag;
+   dump.requested_level = cpu_to_be32(dump_param->flag);
dump.dumped_size = 0;
dump.p = dest;
dump.buf_size = dump_param->len;
-- 
2.15.1



[PATCH net-next 2/4] nfp: flower: remove dead code paths

2017-12-08 Thread Jakub Kicinski
From: John Hurley 

Port matching is selected by default on every rule so remove check for it
and delete 'else' side of the statement. Remove nfp_flower_meta_one as now
it will not feature in the code. Rename nfp_flower_meta_two given that one
has been removed.

'Additional metadata' if statement can never be true so remove it as well.

Signed-off-by: John Hurley 
Reviewed-by: Dirk van der Merwe 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   | 16 +
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 79 --
 .../net/ethernet/netronome/nfp/flower/offload.c|  2 +-
 3 files changed, 29 insertions(+), 68 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 66070741d55f..430a3932cf4d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -165,20 +165,6 @@ struct nfp_fl_pop_vlan {
__be16 reserved;
 };
 
-/* Metadata without L2 (1W/4B)
- * 
- *3   2   1
- *  1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
- * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- * |  key_layers   |mask_id|   reserved|
- * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- */
-struct nfp_flower_meta_one {
-   u8 nfp_flow_key_layer;
-   u8 mask_id;
-   u16 reserved;
-};
-
 struct nfp_fl_pre_tunnel {
struct nfp_fl_act_head head;
__be16 reserved;
@@ -209,7 +195,7 @@ struct nfp_fl_set_vxlan {
  *   NOTE: | TCI   |
  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  */
-struct nfp_flower_meta_two {
+struct nfp_flower_meta_tci {
u8 nfp_flow_key_layer;
u8 mask_id;
__be16 tci;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c 
b/drivers/net/ethernet/netronome/nfp/flower/match.c
index 60614d4f0e22..1f2b879e12d4 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/match.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/match.c
@@ -38,7 +38,7 @@
 #include "main.h"
 
 static void
-nfp_flower_compile_meta_tci(struct nfp_flower_meta_two *frame,
+nfp_flower_compile_meta_tci(struct nfp_flower_meta_tci *frame,
struct tc_cls_flower_offload *flow, u8 key_type,
bool mask_version)
 {
@@ -46,7 +46,7 @@ nfp_flower_compile_meta_tci(struct nfp_flower_meta_two *frame,
struct flow_dissector_key_vlan *flow_vlan;
u16 tmp_tci;
 
-   memset(frame, 0, sizeof(struct nfp_flower_meta_two));
+   memset(frame, 0, sizeof(struct nfp_flower_meta_tci));
/* Populate the metadata frame. */
frame->nfp_flow_key_layer = key_type;
frame->mask_id = ~0;
@@ -67,14 +67,6 @@ nfp_flower_compile_meta_tci(struct nfp_flower_meta_two 
*frame,
}
 }
 
-static void
-nfp_flower_compile_meta(struct nfp_flower_meta_one *frame, u8 key_type)
-{
-   frame->nfp_flow_key_layer = key_type;
-   frame->mask_id = 0;
-   frame->reserved = 0;
-}
-
 static int
 nfp_flower_compile_port(struct nfp_flower_in_port *frame, u32 cmsg_port,
bool mask_version, enum nfp_flower_tun_type tun_type)
@@ -278,49 +270,32 @@ int nfp_flower_compile_flow_match(struct 
tc_cls_flower_offload *flow,
 
ext = nfp_flow->unmasked_data;
msk = nfp_flow->mask_data;
-   if (NFP_FLOWER_LAYER_PORT & key_ls->key_layer) {
-   /* Populate Exact Metadata. */
-   nfp_flower_compile_meta_tci((struct nfp_flower_meta_two *)ext,
-   flow, key_ls->key_layer, false);
-   /* Populate Mask Metadata. */
-   nfp_flower_compile_meta_tci((struct nfp_flower_meta_two *)msk,
-   flow, key_ls->key_layer, true);
-   ext += sizeof(struct nfp_flower_meta_two);
-   msk += sizeof(struct nfp_flower_meta_two);
-
-   /* Populate Exact Port data. */
-   err = nfp_flower_compile_port((struct nfp_flower_in_port *)ext,
- nfp_repr_get_port_id(netdev),
- false, tun_type);
-   if (err)
-   return err;
-
-   /* Populate Mask Port Data. */
-   err = nfp_flower_compile_port((struct nfp_flower_in_port *)msk,
- nfp_repr_get_port_id(netdev),
- true, tun_type);
-   if (err)
-   return err;
-
-   ext += sizeof(struct 

[PATCH net-next 0/4] nfp: dead code, clean ups and slight improvements

2017-12-08 Thread Jakub Kicinski
Hi!

This series contains small clean ups from John and Carl, and brings
no functional changes.

John's improvements target the flower code.  First he makes sure we don't
allocate space in FW request messages for MAC matches if the TC rule does
not contain any.  The remaining two patches remove some dead code and
unused defines.

Carl follows up with a slight optimization to his recent ethtool FW state
dumps, byte swapping input parameters once instead of the data for every
dumped item.

Carl Heymann (1):
  nfp: debug dump - decrease endian conversions

John Hurley (3):
  nfp: flower: do not assume mac/mpls matches
  nfp: flower: remove dead code paths
  nfp: flower: remove unused defines

 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   | 19 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 79 --
 .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++--
 .../net/ethernet/netronome/nfp/nfp_net_debugdump.c | 14 ++--
 4 files changed, 44 insertions(+), 81 deletions(-)

-- 
2.15.1



[PATCH net-next 3/4] nfp: flower: remove unused defines

2017-12-08 Thread Jakub Kicinski
From: John Hurley 

Delete match field defines that are not supported at this time.

Signed-off-by: John Hurley 
Reviewed-by: Dirk van der Merwe 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 430a3932cf4d..d6b63c8f14da 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -50,9 +50,6 @@
 #define NFP_FLOWER_LAYER_CTBIT(6)
 #define NFP_FLOWER_LAYER_VXLAN BIT(7)
 
-#define NFP_FLOWER_LAYER_ETHER BIT(3)
-#define NFP_FLOWER_LAYER_ARP   BIT(4)
-
 #define NFP_FLOWER_MASK_VLAN_PRIO  GENMASK(15, 13)
 #define NFP_FLOWER_MASK_VLAN_CFI   BIT(12)
 #define NFP_FLOWER_MASK_VLAN_VID   GENMASK(11, 0)
-- 
2.15.1



[PATCH net-next 1/4] nfp: flower: do not assume mac/mpls matches

2017-12-08 Thread Jakub Kicinski
From: John Hurley 

Remove the matching of mac/mpls as a default selection. These are not
necessarily set by a TC rule (unlike the port). Previously a mac/mpls
field would exist in every match and be masked out if not used. This patch
has no impact on functionality but removes unnessary memory assignment in
the match cmsg.

Signed-off-by: John Hurley 
Reviewed-by: Dirk van der Merwe 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 553f94f55dce..1b7c59a8b139 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -150,10 +150,15 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls 
*ret_key_ls,
return -EOPNOTSUPP;
 
key_layer_two = 0;
-   key_layer = NFP_FLOWER_LAYER_PORT | NFP_FLOWER_LAYER_MAC;
+   key_layer = NFP_FLOWER_LAYER_PORT;
key_size = sizeof(struct nfp_flower_meta_one) +
-  sizeof(struct nfp_flower_in_port) +
-  sizeof(struct nfp_flower_mac_mpls);
+  sizeof(struct nfp_flower_in_port);
+
+   if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS) ||
+   dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_MPLS)) {
+   key_layer |= NFP_FLOWER_LAYER_MAC;
+   key_size += sizeof(struct nfp_flower_mac_mpls);
+   }
 
if (dissector_uses_key(flow->dissector,
   FLOW_DISSECTOR_KEY_ENC_CONTROL)) {
-- 
2.15.1



Re: [PATCH v5 net-next] net/tcp: trace all TCP/IP state transition with tcp_set_state tracepoint

2017-12-08 Thread Yafang Shao
2017-12-09 0:28 GMT+08:00 David Miller :
> From: Yafang Shao 
> Date: Fri, 8 Dec 2017 23:50:44 +0800
>
>> 2017-12-08 23:42 GMT+08:00 David Miller :
>>> From: Yafang Shao 
>>> Date: Fri, 8 Dec 2017 11:40:23 +0800
>>>
 It will looks like these,

 if (sk->sk_protocol == IPPROTO_TCP)
 __tcp_set_state(newsk, TCP_SYN_RECV);
 else
 newsk->sk_state = TCP_SYN_RECV;


 if (sk->sk_protocol == IPPROTO_TCP)
   __tcp_set_state(sk, TCP_CLOSE);
 else
   sk->sk_state = TCP_CLOSE;

 if (sk->sk_protocol == IPPROTO_TCP)
   tcp_state_store(sk,  state);
 else
   sk_state_store(sk, state);


 Some redundant code.

 IMO, put these similar code into a wrapper is more nice.
>>>
>>> I think this discussion and how ugly this is getting shows that
>>> tracing the state transitions of a socket is perhaps not best as a TCP
>>> specific feature.
>>
>> Do you mean that tcp_set_state tracepoint should be replaced with
>> sk_set_state tracepoint and move that tracepoint to
>> trace/events/sock.h ?
>
> Yes, something like that.
>
> It will avoid all of these protocol specific checks and weird
> dependencies.

That looks like a good idea.
I will try to reimplemnet it.


Thanks
Yafang


[PATCH] wcn36xx: Reduce spinlock in indication handler

2017-12-08 Thread Bjorn Andersson
The purpose of pushing indication on a list and handle these in a
separate worker is to allow the handlers to sleep. It does therefor not
make much sense to hold the queue spinlock through the entire indication
worker function.

By removing items from the queue early we don't need to hold the lock
throughout the indication worker, allowing the individual handlers to
sleep.

Signed-off-by: Bjorn Andersson 
---
 drivers/net/wireless/ath/wcn36xx/smd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c 
b/drivers/net/wireless/ath/wcn36xx/smd.c
index fa88a2a460aa..52daae863aed 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -2414,6 +2414,8 @@ static void wcn36xx_ind_smd_work(struct work_struct *work)
hal_ind_msg = list_first_entry(>hal_ind_queue,
   struct wcn36xx_hal_ind_msg,
   list);
+   list_del(wcn->hal_ind_queue.next);
+   spin_unlock_irqrestore(>hal_ind_lock, flags);
 
msg_header = (struct wcn36xx_hal_msg_header *)hal_ind_msg->msg;
 
@@ -2450,8 +2452,6 @@ static void wcn36xx_ind_smd_work(struct work_struct *work)
wcn36xx_err("SMD_EVENT (%d) not supported\n",
  msg_header->msg_type);
}
-   list_del(wcn->hal_ind_queue.next);
-   spin_unlock_irqrestore(>hal_ind_lock, flags);
kfree(hal_ind_msg);
 }
 int wcn36xx_smd_open(struct wcn36xx *wcn)
-- 
2.15.0



Re: [PATCH V2 net-next 2/8] net: hns3: Add mailbox support to VF driver

2017-12-08 Thread Philippe Ombredanne
Sali,

On Fri, Dec 8, 2017 at 10:16 PM, Salil Mehta  wrote:
> This patch adds the support of the mailbox to the VF driver. The
> mailbox shall be used as an interface to communicate with the
> PF driver for various purposes like {set|get} MAC related
> operations, reset, link status etc. The mailbox supports both
> synchronous and asynchronous command send to PF driver.
>
> Signed-off-by: Salil Mehta 
> Signed-off-by: lipeng 
[...]
> --- /dev/null
> +++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2016-2017 Hisilicon Limited.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */

Why not use the new SPDX ids?
e.g.
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +/* Copyright (c) 2016-2017 Hisilicon Limited. */

See Linus posts and Thomas doc patches for details.

-- 
Cordially
Philippe Ombredanne


Re: [PATCH net-next v3 0/2] veth and GSO maximums

2017-12-08 Thread Stephen Hemminger
On Fri, 8 Dec 2017 15:50:25 -0800
Solio Sarabia  wrote:

> On Fri, Dec 08, 2017 at 02:23:23PM -0500, David Miller wrote:
> > From: Stephen Hemminger 
> > Date: Thu,  7 Dec 2017 15:40:18 -0800
> >   
> > > This is the more general way to solving the issue of GSO limits
> > > not being set correctly for containers on Azure. If a GSO packet
> > > is sent to host that exceeds the limit (reported by NDIS), then
> > > the host is forced to do segmentation in software which has noticeable
> > > performance impact.
> > > 
> > > The core rtnetlink infrastructure already has the messages and
> > > infrastructure to allow changing gso limits. With an updated iproute2
> > > the following already works:
> > >   # ip li set dev dummy0 gso_max_size 3
> > > 
> > > These patches are about making it easier with veth.  
> > 
> > Ok, this is definitely a step forward.
> > 
> > Series applied, thanks Stephen.  
> 
> Thanks.
> Still not seeing the iproute2 patch though, either master or net-next.

I held off on iproute2 until kernel change is accepted upstream.


Re: [PATCH net] enic: add wq clean up budget

2017-12-08 Thread Govindarajulu Varadarajan

On Wed, 6 Dec 2017, David Miller wrote:


From: Govindarajulu Varadarajan 
Date: Tue,  5 Dec 2017 11:14:41 -0800


In case of tx clean up, we set '-1' as budget. This means clean up until
wq is empty or till (1 << 32) pkts are cleaned. Under heavy load this
will run for long time and cause
"watchdog: BUG: soft lockup - CPU#25 stuck for 21s!" warning.

This patch sets wq clean up budget to 256.

Signed-off-by: Govindarajulu Varadarajan 


This driver with all of it's indirection and layers upon layers of
macros for queue processing is so difficult to read, and this can't
be generating nice optimal code either...

Anyways, I was walking over the driver to see if the logic is
contributing to this.

The limit you are proposing sounds unnecessary, nobody else I can
see needs this, and that includes all of the most heavily used
drivers under load.


I used 256 as the limit because most of the other drivers use it.

* mlx4 uses MLX4_EN_DEFAULT_TX_WORK as the tx budget in mlx4_en_process_tx_cq()
  Added in commit fbc6daf19745 ("net/mlx4_en: Ignore budget on TX napi polling")

* i40e uses vsi->work_limit as tx budget in i40e_clean_tx_irq(), which is
  set to I40E_DEFAULT_IRQ_WORK. Added in commit
  a619afe814453 ("i40e/i40evf: Add support for bulk free in Tx cleanup")

* ixgbe uses q_vector->tx.work_limit as tx budget in ixgbe_clean_tx_irq(),
  which is set to IXGBE_DEFAULT_TX_WORK. Added in commit
  592245559e900 ("ixgbe: Change default Tx work limit size to 256 buffers")



If I had to guess I'd say that the problem is that the queue loop
keeps sampling the head and tail pointers, where as it should just
do that _once_ and only process that TX entries found in that
snapshot and return to the poll() routine immedately afterwards.


The only way to know the tail pointer at the time napi is scheduled is to read
hw fetch_index register. This is discouraged by hw engineers.

We work around this by using color bit. Every cq entry has color bit. It is
either 0 or 1. Hw flips the bit when it creates a new cq entry. So every new
cq entry will have a different color bit than previous. We reach end of the
queue when previous color bit is same as current cq entry's color. i.e hw did
not flip the bit, so its not a new cq entry.

So enic driver cannot know the tail pointer at the time napi is scheduled, until
we reach the tail pointer.


Re: [PATCH net-next v3 0/2] veth and GSO maximums

2017-12-08 Thread Solio Sarabia
On Fri, Dec 08, 2017 at 02:23:23PM -0500, David Miller wrote:
> From: Stephen Hemminger 
> Date: Thu,  7 Dec 2017 15:40:18 -0800
> 
> > This is the more general way to solving the issue of GSO limits
> > not being set correctly for containers on Azure. If a GSO packet
> > is sent to host that exceeds the limit (reported by NDIS), then
> > the host is forced to do segmentation in software which has noticeable
> > performance impact.
> > 
> > The core rtnetlink infrastructure already has the messages and
> > infrastructure to allow changing gso limits. With an updated iproute2
> > the following already works:
> >   # ip li set dev dummy0 gso_max_size 3
> > 
> > These patches are about making it easier with veth.
> 
> Ok, this is definitely a step forward.
> 
> Series applied, thanks Stephen.

Thanks.
Still not seeing the iproute2 patch though, either master or net-next.


Re: [PATCH v3 net-next 4/4] bpftool: implement cgroup bpf operations

2017-12-08 Thread Jakub Kicinski
On Fri, 8 Dec 2017 14:52:36 +, Roman Gushchin wrote:
> +static int list_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type)
> +{
> + __u32 prog_ids[1024] = {0};
> + char *attach_flags_str;
> + __u32 prog_cnt, iter;
> + __u32 attach_flags;
> + char buf[16];
> + int ret;
> +
...
> + case BPF_F_ALLOW_OVERRIDE:
> + attach_flags_str = "allow_override";
> + break;
> + case 0:
> + attach_flags_str = "";
> + break;
> + default:
> + snprintf(buf, sizeof(buf), "unknown(%x)", attach_flags);
> + attach_flags_str = buf;

nit: theoretically speaking strlen("unknown()") == 9 + 8 + 1 > sizeof(buf)
 so if we ever use all flags this may get truncated.  I personally also 
 like using %x without 0x in front, but I restrained myself in bpftool 
 to avoid potential confusion (unknown(10) could be 10 or 16).  Map flags 
 do put the prefix in, perhaps it would be good to keep those consistent?

> + }
> +
> + for (iter = 0; iter < prog_cnt; iter++)
> + list_bpf_prog(prog_ids[iter], attach_type_strings[type],
> +   attach_flags_str);
> +
> + return 0;
> +}


> +static int do_attach(int argc, char **argv)
> +{
> + int cgroup_fd, prog_fd;
> + enum bpf_attach_type attach_type;
> + int attach_flags = 0;
> + int i;
> + int ret = -1;

nit: I was hoping you'd fix the order of variables in all functions..

> + if (argc < 4) {
> + p_err("too few parameters for cgroup attach\n");
> + goto exit;
> + }
> +
> + cgroup_fd = open(argv[0], O_RDONLY);
> + if (cgroup_fd < 0) {
> + p_err("can't open cgroup %s\n", argv[1]);
> + goto exit;
> + }
> +
> + attach_type = parse_attach_type(argv[1]);
> + if (attach_type == __MAX_BPF_ATTACH_TYPE) {
> + p_err("invalid attach type\n");
> + goto exit_cgroup;
> + }
> +
> + argc -= 2;
> + argv = [2];
> + prog_fd = prog_parse_fd(, );
> + if (prog_fd < 0)
> + goto exit_cgroup;
> +
> + for (i = 0; i < argc; i++) {
> + if (strcmp(argv[i], "allow_multi") == 0) {
> + attach_flags |= BPF_F_ALLOW_MULTI;
> + } else if (strcmp(argv[i], "allow_override") == 0) {
> + attach_flags |= BPF_F_ALLOW_OVERRIDE;

I don't feel about this strongly but as I said I was trying to follow
iproute2's conventions, and it allows aliasing.  So if you type "ip a"
it will give you the first thing that starts with a, not necessarily
alphabetically, more likely in order of usefulness or order in which
things were added.  IOW if "allow_" selects "allow_mutli" that's what I
would actually expect it to do..

Maybe others disagree?

> + } else {
> + p_err("unknown option: %s\n", argv[i]);
> + goto exit_cgroup;
> + }
> + }
> +
> + if (bpf_prog_attach(prog_fd, cgroup_fd, attach_type, attach_flags)) {
> + p_err("failed to attach program");
> + goto exit_prog;
> + }
> +
> + if (json_output)
> + jsonw_null(json_wtr);
> +
> + ret = 0;
> +
> +exit_prog:
> + close(prog_fd);
> +exit_cgroup:
> + close(cgroup_fd);
> +exit:
> + return ret;
> +}
> +
> +static int do_detach(int argc, char **argv)
> +{
> + int prog_fd, cgroup_fd;
> + enum bpf_attach_type attach_type;
> + int ret = -1;

nit: order here too..

> + if (argc < 4) {
> + p_err("too few parameters for cgroup detach\n");
> + goto exit;
> + }
> +
> + cgroup_fd = open(argv[0], O_RDONLY);
> + if (cgroup_fd < 0) {
> + p_err("can't open cgroup %s\n", argv[1]);
> + goto exit;
> + }
> +
> + attach_type = parse_attach_type(argv[1]);
> + if (attach_type == __MAX_BPF_ATTACH_TYPE) {
> + p_err("invalid attach type");
> + goto exit_cgroup;
> + }
> +
> + argc -= 2;
> + argv = [2];
> + prog_fd = prog_parse_fd(, );
> + if (prog_fd < 0)
> + goto exit_cgroup;
> +
> + if (bpf_prog_detach2(prog_fd, cgroup_fd, attach_type)) {
> + p_err("failed to detach program");
> + goto exit_prog;
> + }
> +
> + if (json_output)
> + jsonw_null(json_wtr);
> +
> + ret = 0;
> +
> +exit_prog:
> + close(prog_fd);
> +exit_cgroup:
> + close(cgroup_fd);
> +exit:
> + return ret;
> +}


[PATCH] rtnetlink: fix typo in GSO max segments

2017-12-08 Thread Stephen Hemminger
Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device 
creation")
Signed-off-by: Stephen Hemminger 
---
 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 412ebf0b09c6..c688dc564b11 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2684,7 +2684,7 @@ struct net_device *rtnl_create_link(struct net *net,
if (tb[IFLA_GSO_MAX_SIZE])
netif_set_gso_max_size(dev, nla_get_u32(tb[IFLA_GSO_MAX_SIZE]));
if (tb[IFLA_GSO_MAX_SEGS])
-   dev->gso_max_size = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
+   dev->gso_max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
 
return dev;
 }
-- 
2.11.0



Re: [PATCH v2 net-next 4/4] bpftool: implement cgroup bpf operations

2017-12-08 Thread Jakub Kicinski
On Fri, 8 Dec 2017 09:52:16 -0700, David Ahern wrote:
> On 12/8/17 8:39 AM, Quentin Monnet wrote:
> > I don't believe compatibility is an issue here, since the program and
> > its documentation come together (so they should stay in sync) and are
> > part of the kernel tree (so the tool should be compatible with the
> > kernel sources it comes with). My concern is that there is no way to
> > guess from the current description what the values for ATTACH_FLAG or
> > ATTACH_TYPE can be, without reading the source code of the program—which
> > is not exactly user-friendly.
> 
> The tool should be backward and forward compatible across kernel
> versions. Running a newer command on an older kernel should fail in a
> deterministic. While the tool is in the kernel tree for ease of
> development, that should not be confused with having a direct tie to any
> kernel version.
>
> I believe man pages do include kernel version descriptions in flags
> (e.g., man 7 socket -- flags are denoted with "since Linux x.y") which
> is one way to handle it with the usual caveat that vendors might have
> backported support to earlier kernels.

Let's see if I understand correctly.  We have a list of hard coded
strings which the tool will recognize (static const char * const
attach_type_strings[]).  We should put that list into the man page and
help so users know what values are possible.  And in the "verbose"
part of the man section mark each flag with kernel version it was
introduced in.

Roman, would you agree this is the best way forward?


Re: [PATCH net-next] ip6_vti: adjust vti mtu according to mtu of output device

2017-12-08 Thread Shannon Nelson

On 12/8/2017 3:54 AM, Alexey Kodanev wrote:

On 12/08/2017 10:02 AM, Steffen Klassert wrote:

On Wed, Dec 06, 2017 at 07:38:19PM +0300, Alexey Kodanev wrote:


Since you're planning to do a 2nd version anyway, can we get a couple of 
the commit message issues cleaned up?



LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams
that require fragmentation and underlying device MTU <= 1500.


s/underlying device/the underlying device has/


This happens because ip6_vti sets mtu to ETH_DATA_LEN and not
updating it depending on a destiantion address.


s/destiantion/destination/



Futhure attempts to send UDP packets may succeed because pmtu


s/Futhure/Further/


get updated on ICMPV6_PKT_TOOBIG in vti6_err().


s/get/gets/



Here is the example when output device MTU set to 9000:


s/output device MTU/the output device MTU is/

Thanks,
sln



Re: [PATCH v8 0/5] Add the ability to do BPF directed error injection

2017-12-08 Thread Daniel Borkmann
On 12/08/2017 09:24 PM, Josef Bacik wrote:
> On Fri, Dec 08, 2017 at 04:35:44PM +0100, Daniel Borkmann wrote:
>> On 12/06/2017 05:12 PM, Josef Bacik wrote:
>>> Jon noticed that I had a typo in my _ASM_KPROBE_ERROR_INJECT macro.  I went 
>>> to
>>> figure out why the compiler didn't catch it and it's because it was not used
>>> anywhere.  I had copied it from the trace blacklist code without 
>>> understanding
>>> where it was used as cscope didn't find the original macro I was looking 
>>> for, so
>>> I assumed it was some voodoo and left it in place.  Turns out cscope failed 
>>> me
>>> and I didn't need the macro at all, the trace blacklist thing I was looking 
>>> at
>>> was for marking assembly functions as blacklisted and I have no intention of
>>> marking assembly functions as error injectable at the moment.
>>>
>>> v7->v8:
>>> - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
>>
>> The series doesn't apply cleanly to the bpf-next tree, so one last respin 
>> with
>> a rebase would unfortunately still be required, thanks!
> 
> I've rebased and let it sit in my git tree to make sure kbuild test bot didn't
> blow up, can you pull from
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
> bpf-override-return
> 
> or do you want me to repost the whole series?  Thanks,

Yeah, the patches would need to end up on netdev, so once kbuild bot went
through fine after your rebase, please send the series.

Thanks,
Daniel


Re: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.

2017-12-08 Thread Michael Chan
On Fri, Dec 8, 2017 at 2:09 PM, Marcelo Ricardo Leitner
 wrote:
> Hi,
>
> On Thu, Dec 07, 2017 at 03:03:35AM -0500, Michael Chan wrote:
>> --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
>> +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
>> @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu)
>>   DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN),
>>  "Configuring MTU size of %d\n", new_mtu);
>>
>> + if (new_mtu > PAGE_SIZE)
>
> I don't know the specs for this card but if it needs to fit the whole
> packet in a page, maybe it should consider the ethernet header size in
> such checks?
>

I am not changing the logic in this patch, just moving the check from
qede_alloc_sge_mem() to qede_fix_features().

Typically, the chip will also do header-data split when doing GRO_HW,
so that's probably why it is checking the MTU and not the total packet
size against page size.  Ariel can confirm.


Re: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.

2017-12-08 Thread Marcelo Ricardo Leitner
Hi,

On Thu, Dec 07, 2017 at 03:03:35AM -0500, Michael Chan wrote:
> --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int new_mtu)
>   DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN),
>  "Configuring MTU size of %d\n", new_mtu);
>  
> + if (new_mtu > PAGE_SIZE)

I don't know the specs for this card but if it needs to fit the whole
packet in a page, maybe it should consider the ethernet header size in
such checks?

  Marcelo


Re: [PATCH v5 2/2] sock: Move the socket inuse to namespace.

2017-12-08 Thread Cong Wang
On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang  wrote:
>
> Release the netlink sock created in kernel(not hold the _net_ namespace):
>

You can avoid counting kernel sock by testing 'kern' in sk_alloc()
and testing 'sk->sk_net_refcnt' in __sk_free().


[PATCH V2 net-next 2/8] net: hns3: Add mailbox support to VF driver

2017-12-08 Thread Salil Mehta
This patch adds the support of the mailbox to the VF driver. The
mailbox shall be used as an interface to communicate with the
PF driver for various purposes like {set|get} MAC related
operations, reset, link status etc. The mailbox supports both
synchronous and asynchronous command send to PF driver.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
Patch V2: Addressed some internal commnets
Patch V1: Initial Submit
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h|  94 +++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 188 +
 2 files changed, 282 insertions(+)
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h 
b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
new file mode 100644
index 000..5a6ebfca
--- /dev/null
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2016-2017 Hisilicon Limited.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __HCLGE_MBX_H
+#define __HCLGE_MBX_H
+#include 
+#include 
+#include 
+
+#define HCLGE_MBX_VF_MSG_DATA_NUM  16
+
+enum HCLGE_MBX_OPCODE {
+   HCLGE_MBX_RESET = 0x01, /* (VF -> PF) assert reset */
+   HCLGE_MBX_SET_UNICAST,  /* (VF -> PF) set UC addr */
+   HCLGE_MBX_SET_MULTICAST,/* (VF -> PF) set MC addr */
+   HCLGE_MBX_SET_VLAN, /* (VF -> PF) set VLAN */
+   HCLGE_MBX_MAP_RING_TO_VECTOR,   /* (VF -> PF) map ring-to-vector */
+   HCLGE_MBX_UNMAP_RING_TO_VECTOR, /* (VF -> PF) unamp ring-to-vector */
+   HCLGE_MBX_SET_PROMISC_MODE, /* (VF -> PF) set promiscuous mode */
+   HCLGE_MBX_SET_MACVLAN,  /* (VF -> PF) set unicast filter */
+   HCLGE_MBX_API_NEGOTIATE,/* (VF -> PF) negotiate API version */
+   HCLGE_MBX_GET_QINFO,/* (VF -> PF) get queue config */
+   HCLGE_MBX_GET_TCINFO,   /* (VF -> PF) get TC config */
+   HCLGE_MBX_GET_RETA, /* (VF -> PF) get RETA */
+   HCLGE_MBX_GET_RSS_KEY,  /* (VF -> PF) get RSS key */
+   HCLGE_MBX_GET_MAC_ADDR, /* (VF -> PF) get MAC addr */
+   HCLGE_MBX_PF_VF_RESP,   /* (PF -> VF) generate respone to VF */
+   HCLGE_MBX_GET_BDNUM,/* (VF -> PF) get BD num */
+   HCLGE_MBX_GET_BUFSIZE,  /* (VF -> PF) get buffer size */
+   HCLGE_MBX_GET_STREAMID, /* (VF -> PF) get stream id */
+   HCLGE_MBX_SET_AESTART,  /* (VF -> PF) start ae */
+   HCLGE_MBX_SET_TSOSTATS, /* (VF -> PF) get tso stats */
+   HCLGE_MBX_LINK_STAT_CHANGE, /* (PF -> VF) link status has changed */
+   HCLGE_MBX_GET_BASE_CONFIG,  /* (VF -> PF) get config */
+   HCLGE_MBX_BIND_FUNC_QUEUE,  /* (VF -> PF) bind function and queue */
+   HCLGE_MBX_GET_LINK_STATUS,  /* (VF -> PF) get link status */
+   HCLGE_MBX_QUEUE_RESET,  /* (VF -> PF) reset queue */
+};
+
+/* below are per-VF mac-vlan subcodes */
+enum hclge_mbx_mac_vlan_subcode {
+   HCLGE_MBX_MAC_VLAN_UC_MODIFY = 0,   /* modify UC mac addr */
+   HCLGE_MBX_MAC_VLAN_UC_ADD,  /* add a new UC mac addr */
+   HCLGE_MBX_MAC_VLAN_UC_REMOVE,   /* remove a new UC mac addr */
+   HCLGE_MBX_MAC_VLAN_MC_MODIFY,   /* modify MC mac addr */
+   HCLGE_MBX_MAC_VLAN_MC_ADD,  /* add new MC mac addr */
+   HCLGE_MBX_MAC_VLAN_MC_REMOVE,   /* remove MC mac addr */
+   HCLGE_MBX_MAC_VLAN_MC_FUNC_MTA_ENABLE,  /* config func MTA enable */
+};
+
+/* below are per-VF vlan cfg subcodes */
+enum hclge_mbx_vlan_cfg_subcode {
+   HCLGE_MBX_VLAN_FILTER = 0,  /* set vlan filter */
+   HCLGE_MBX_VLAN_TX_OFF_CFG,  /* set tx side vlan offload */
+   HCLGE_MBX_VLAN_RX_OFF_CFG,  /* set rx side vlan offload */
+};
+
+#define HCLGE_MBX_MAX_MSG_SIZE 16
+#define HCLGE_MBX_MAX_RESP_DATA_SIZE   8
+
+struct hclgevf_mbx_resp_status {
+   struct mutex mbx_mutex; /* protects against contending sync cmd resp */
+   u32 origin_mbx_msg;
+   bool received_resp;
+   int resp_status;
+   u8 additional_info[HCLGE_MBX_MAX_RESP_DATA_SIZE];
+};
+
+struct hclge_mbx_vf_to_pf_cmd {
+   u8 rsv;
+   u8 mbx_src_vfid; /* Auto filled by IMP */
+   u8 rsv1[2];
+   u8 msg_len;
+   u8 rsv2[3];
+   u8 msg[HCLGE_MBX_MAX_MSG_SIZE];
+};
+
+struct hclge_mbx_pf_to_vf_cmd {
+   u8 dest_vfid;
+   u8 rsv[3];
+   u8 msg_len;
+   u8 rsv1[3];
+   u16 msg[8];
+};
+
+#define hclge_mbx_ring_ptr_move_crq(crq) \
+   (crq->next_to_use = 

[PATCH V2 net-next 5/8] net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC

2017-12-08 Thread Salil Mehta
Most of the NAPI handling interface, skb buffer management,
management of the RX/TX descriptors, ethool interface etc.
has quite a bit of code which is common to VF and PF driver.

This patch makes the exisitng PF's HNS3 ENET driver as the
common ENET driver for both Virtual & Physical Function. This
will help in reduction of redundancy and better management of
code.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/Makefile   |  6 +-
 drivers/net/ethernet/hisilicon/hns3/hnae3.c| 14 --
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  7 ---
 .../hisilicon/hns3/{hns3pf => }/hns3_dcbnl.c   |  2 +-
 .../hisilicon/hns3/{hns3pf => }/hns3_enet.c|  2 ++
 .../hisilicon/hns3/{hns3pf => }/hns3_enet.h|  0
 .../hisilicon/hns3/{hns3pf => }/hns3_ethtool.c | 22 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/Makefile|  5 -
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  1 +
 9 files changed, 46 insertions(+), 13 deletions(-)
 rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_dcbnl.c (97%)
 rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_enet.c (99%)
 rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_enet.h (100%)
 rename drivers/net/ethernet/hisilicon/hns3/{hns3pf => }/hns3_ethtool.c (97%)

diff --git a/drivers/net/ethernet/hisilicon/hns3/Makefile 
b/drivers/net/ethernet/hisilicon/hns3/Makefile
index 244664b..48af9b1 100644
--- a/drivers/net/ethernet/hisilicon/hns3/Makefile
+++ b/drivers/net/ethernet/hisilicon/hns3/Makefile
@@ -1,8 +1,12 @@
 #
 # Makefile for the HISILICON network device drivers.
 #
-
 obj-$(CONFIG_HNS3) += hns3pf/
 obj-$(CONFIG_HNS3) += hns3vf/
 
 obj-$(CONFIG_HNS3) += hnae3.o
+
+obj-$(CONFIG_HNS3_ENET) += hns3.o
+hns3-objs = hns3_enet.o hns3_ethtool.o
+
+hns3-$(CONFIG_HNS3_DCB) += hns3_dcbnl.o
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.c 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.c
index 5bcb223..02145f2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.c
@@ -196,9 +196,18 @@ int hnae3_register_ae_dev(struct hnae3_ae_dev *ae_dev)
const struct pci_device_id *id;
struct hnae3_ae_algo *ae_algo;
struct hnae3_client *client;
-   int ret = 0;
+   int ret = 0, lock_acquired;
+
+   /* we can get deadlocked if SRIOV is being enabled in context to probe
+* and probe gets called again in same context. This can happen when
+* pci_enable_sriov() is called to create VFs from PF probes context.
+* Therefore, for simplicity uniformly defering further probing in all
+* cases where we detect contention.
+*/
+   lock_acquired = mutex_trylock(_common_lock);
+   if (!lock_acquired)
+   return -EPROBE_DEFER;
 
-   mutex_lock(_common_lock);
list_add_tail(_dev->node, _ae_dev_list);
 
/* Check if there are matched ae_algo */
@@ -211,6 +220,7 @@ int hnae3_register_ae_dev(struct hnae3_ae_dev *ae_dev)
 
if (!ae_dev->ops) {
dev_err(_dev->pdev->dev, "ae_dev ops are null\n");
+   ret = -EOPNOTSUPP;
goto out_err;
}
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 67c59e1..a9e2b32 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -452,9 +452,10 @@ struct hnae3_unic_private_info {
struct hnae3_queue **tqp;  /* array base of all TQPs of this instance */
 };
 
-#define HNAE3_SUPPORT_MAC_LOOPBACK1
-#define HNAE3_SUPPORT_PHY_LOOPBACK2
-#define HNAE3_SUPPORT_SERDES_LOOPBACK 4
+#define HNAE3_SUPPORT_MAC_LOOPBACKBIT(0)
+#define HNAE3_SUPPORT_PHY_LOOPBACKBIT(1)
+#define HNAE3_SUPPORT_SERDES_LOOPBACK BIT(2)
+#define HNAE3_SUPPORT_VF BIT(3)
 
 struct hnae3_handle {
struct hnae3_client *client;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c
similarity index 97%
rename from drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c
rename to drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c
index 925619a..eb82700 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_dcbnl.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_dcbnl.c
@@ -93,7 +93,7 @@ void hns3_dcbnl_setup(struct hnae3_handle *handle)
 {
struct net_device *dev = handle->kinfo.netdev;
 
-   if (!handle->kinfo.dcb_ops)
+   if ((!handle->kinfo.dcb_ops) || (handle->flags & HNAE3_SUPPORT_VF))
return;
 
dev->dcbnl_ops = _dcbnl_ops;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
similarity index 99%
rename from 

[PATCH V2 net-next 3/8] net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support

2017-12-08 Thread Salil Mehta
This patch adds the support of hardware compatibiltiy layer to the
HNS3 VF Driver. This layer implements various {set|get} operations
over MAC address for a virtual port, RSS related configuration,
fetches the link status info from PF, does various VLAN related
configuration over the virtual port, queries the statistics from
the hardware etc.

This layer can directly interact with hardware through the
IMP(Integrated Mangement Processor) interface or can use mailbox
to interact with the PF driver.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
Patch V2: Addressed some internal comments
Patch V1: Initial Submit
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 1494 
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  170 +++
 2 files changed, 1664 insertions(+)
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
new file mode 100644
index 000..a878c63
--- /dev/null
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -0,0 +1,1494 @@
+/*
+ * Copyright (c) 2016-2017 Hisilicon Limited.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include "hclgevf_cmd.h"
+#include "hclgevf_main.h"
+#include "hclge_mbx.h"
+#include "hnae3.h"
+
+#define HCLGEVF_NAME   "hclgevf"
+
+static struct hnae3_ae_algo ae_algovf;
+
+static const struct pci_device_id ae_algovf_pci_tbl[] = {
+   {PCI_VDEVICE(HUAWEI, HNAE3_DEV_ID_100G_VF), 0},
+   {PCI_VDEVICE(HUAWEI, HNAE3_DEV_ID_100G_RDMA_DCB_PFC_VF), 0},
+   /* required last entry */
+   {0, }
+};
+
+static inline struct hclgevf_dev *hclgevf_ae_get_hdev(
+   struct hnae3_handle *handle)
+{
+   return container_of(handle, struct hclgevf_dev, nic);
+}
+
+static int hclgevf_tqps_update_stats(struct hnae3_handle *handle)
+{
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+   struct hnae3_queue *queue;
+   struct hclgevf_desc desc;
+   struct hclgevf_tqp *tqp;
+   int status;
+   int i;
+
+   for (i = 0; i < hdev->num_tqps; i++) {
+   queue = handle->kinfo.tqp[i];
+   tqp = container_of(queue, struct hclgevf_tqp, q);
+   hclgevf_cmd_setup_basic_desc(,
+HCLGEVF_OPC_QUERY_RX_STATUS,
+true);
+
+   desc.data[0] = cpu_to_le32(tqp->index & 0x1ff);
+   status = hclgevf_cmd_send(>hw, , 1);
+   if (status) {
+   dev_err(>pdev->dev,
+   "Query tqp stat fail, status = %d,queue = %d\n",
+   status, i);
+   return status;
+   }
+   tqp->tqp_stats.rcb_rx_ring_pktnum_rcd +=
+   le32_to_cpu(desc.data[4]);
+
+   hclgevf_cmd_setup_basic_desc(, HCLGEVF_OPC_QUERY_TX_STATUS,
+true);
+
+   desc.data[0] = cpu_to_le32(tqp->index & 0x1ff);
+   status = hclgevf_cmd_send(>hw, , 1);
+   if (status) {
+   dev_err(>pdev->dev,
+   "Query tqp stat fail, status = %d,queue = %d\n",
+   status, i);
+   return status;
+   }
+   tqp->tqp_stats.rcb_tx_ring_pktnum_rcd +=
+   le32_to_cpu(desc.data[4]);
+   }
+
+   return 0;
+}
+
+static u64 *hclgevf_tqps_get_stats(struct hnae3_handle *handle, u64 *data)
+{
+   struct hnae3_knic_private_info *kinfo = >kinfo;
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+   struct hclgevf_tqp *tqp;
+   u64 *buff = data;
+   int i;
+
+   for (i = 0; i < hdev->num_tqps; i++) {
+   tqp = container_of(handle->kinfo.tqp[i], struct hclgevf_tqp, q);
+   *buff++ = tqp->tqp_stats.rcb_tx_ring_pktnum_rcd;
+   }
+   for (i = 0; i < kinfo->num_tqps; i++) {
+   tqp = container_of(handle->kinfo.tqp[i], struct hclgevf_tqp, q);
+   *buff++ = tqp->tqp_stats.rcb_rx_ring_pktnum_rcd;
+   }
+
+   return buff;
+}
+
+static int hclgevf_tqps_get_sset_count(struct hnae3_handle *handle, int strset)
+{
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+
+   return hdev->num_tqps * 2;
+}
+
+static u8 *hclgevf_tqps_get_strings(struct hnae3_handle *handle, u8 *data)
+{
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+   u8 *buff = data;
+ 

[PATCH V2 net-next 6/8] net: hns3: Add mailbox support to PF driver

2017-12-08 Thread Salil Mehta
Command queue provides the provision of Mailbox command which
can be used for communication between PF and VF. PF handles
messages from various VFs for fetching various information like,
queue, vlan, link status related etc. It also handles the request
from various VFs to perform certain privileged operations.

This patch adds the support of a message handler for handling
such various command requests from VF.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/Makefile|   2 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c|   1 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   2 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 310 +
 4 files changed, 314 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile
index d077fa0..393a1b3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/Makefile
@@ -5,6 +5,6 @@
 ccflags-y := -Idrivers/net/ethernet/hisilicon/hns3
 
 obj-$(CONFIG_HNS3_HCLGE) += hclge.o
-hclge-objs = hclge_main.o hclge_cmd.o hclge_mdio.o hclge_tm.o
+hclge-objs = hclge_main.o hclge_cmd.o hclge_mdio.o hclge_tm.o hclge_mbx.o
 
 hclge-$(CONFIG_HNS3_DCB) += hclge_dcb.o
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index d07c700..980fcdf 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -21,6 +21,7 @@
 #include "hclge_cmd.h"
 #include "hclge_dcb.h"
 #include "hclge_main.h"
+#include "hclge_mbx.h"
 #include "hclge_mdio.h"
 #include "hclge_tm.h"
 #include "hnae3.h"
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index aacec43..028817c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -554,4 +554,6 @@ int hclge_set_vf_vlan_common(struct hclge_dev *vport, int 
vfid,
 
 int hclge_buffer_alloc(struct hclge_dev *hdev);
 int hclge_rss_init_hw(struct hclge_dev *hdev);
+
+void hclge_mbx_handler(struct hclge_dev *hdev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
new file mode 100644
index 000..a993735
--- /dev/null
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
@@ -0,0 +1,310 @@
+/*
+ * Copyright (c) 2016-2017 Hisilicon Limited.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include "hclge_main.h"
+#include "hclge_mbx.h"
+#include "hnae3.h"
+
+/* hclge_gen_resp_to_vf: used to generate a synchronous response to VF when PF
+ * receives a mailbox message from VF.
+ * @vport: pointer to struct hclge_vport
+ * @vf_to_pf_req: pointer to hclge_mbx_vf_to_pf_cmd of the original mailbox
+ *   message
+ * @resp_status: indicate to VF whether its request success(0) or failed.
+ */
+static int hclge_gen_resp_to_vf(struct hclge_vport *vport,
+   struct hclge_mbx_vf_to_pf_cmd *vf_to_pf_req,
+   int resp_status,
+   u8 *resp_data, u16 resp_data_len)
+{
+   struct hclge_mbx_pf_to_vf_cmd *resp_pf_to_vf;
+   struct hclge_dev *hdev = vport->back;
+   enum hclge_cmd_status status;
+   struct hclge_desc desc;
+
+   resp_pf_to_vf = (struct hclge_mbx_pf_to_vf_cmd *)desc.data;
+
+   if (resp_data_len > HCLGE_MBX_MAX_RESP_DATA_SIZE) {
+   dev_err(>pdev->dev,
+   "PF fail to gen resp to VF len %d exceeds max len %d\n",
+   resp_data_len,
+   HCLGE_MBX_MAX_RESP_DATA_SIZE);
+   }
+
+   hclge_cmd_setup_basic_desc(, HCLGEVF_OPC_MBX_PF_TO_VF, false);
+
+   resp_pf_to_vf->dest_vfid = vf_to_pf_req->mbx_src_vfid;
+   resp_pf_to_vf->msg_len = vf_to_pf_req->msg_len;
+
+   resp_pf_to_vf->msg[0] = HCLGE_MBX_PF_VF_RESP;
+   resp_pf_to_vf->msg[1] = vf_to_pf_req->msg[0];
+   resp_pf_to_vf->msg[2] = vf_to_pf_req->msg[1];
+   resp_pf_to_vf->msg[3] = (resp_status == 0) ? 0 : 1;
+
+   if (resp_data && resp_data_len > 0)
+   memcpy(_pf_to_vf->msg[4], resp_data, resp_data_len);
+
+   status = hclge_cmd_send(>hw, , 1);
+   if (status)
+   dev_err(>pdev->dev,
+   "PF failed(=%d) to send response to VF\n", status);
+
+   return status;
+}
+
+static int 

[PATCH V2 net-next 7/8] net: hns3: Change PF to add ring-vect binding & resetQ to mailbox

2017-12-08 Thread Salil Mehta
This patch is required to support ring-vector binding and reset
of TQPs requested by the VF driver to the PF driver. Mailbox
handler is added with corresponding VF commands/messages to
handle the request.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
Patch V2: Addressed some internal comments
Patch V1: Initial Submit
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 138 +++--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   7 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 106 
 3 files changed, 159 insertions(+), 92 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 980fcdf..3b1fc49 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3256,49 +3256,48 @@ int hclge_rss_init_hw(struct hclge_dev *hdev)
return ret;
 }
 
-int hclge_map_vport_ring_to_vector(struct hclge_vport *vport, int vector_id,
-  struct hnae3_ring_chain_node *ring_chain)
+int hclge_bind_ring_with_vector(struct hclge_vport *vport,
+   int vector_id, bool en,
+   struct hnae3_ring_chain_node *ring_chain)
 {
struct hclge_dev *hdev = vport->back;
-   struct hclge_ctrl_vector_chain_cmd *req;
struct hnae3_ring_chain_node *node;
struct hclge_desc desc;
-   int ret;
+   struct hclge_ctrl_vector_chain_cmd *req
+   = (struct hclge_ctrl_vector_chain_cmd *)desc.data;
+   enum hclge_cmd_status status;
+   enum hclge_opcode_type op;
+   u16 tqp_type_and_id;
int i;
 
-   hclge_cmd_setup_basic_desc(, HCLGE_OPC_ADD_RING_TO_VECTOR, false);
-
-   req = (struct hclge_ctrl_vector_chain_cmd *)desc.data;
+   op = en ? HCLGE_OPC_ADD_RING_TO_VECTOR : HCLGE_OPC_DEL_RING_TO_VECTOR;
+   hclge_cmd_setup_basic_desc(, op, false);
req->int_vector_id = vector_id;
 
i = 0;
for (node = ring_chain; node; node = node->next) {
-   u16 type_and_id = 0;
-
-   hnae_set_field(type_and_id, HCLGE_INT_TYPE_M, HCLGE_INT_TYPE_S,
+   tqp_type_and_id = le16_to_cpu(req->tqp_type_and_id[i]);
+   hnae_set_field(tqp_type_and_id,  HCLGE_INT_TYPE_M,
+  HCLGE_INT_TYPE_S,
   hnae_get_bit(node->flag, HNAE3_RING_TYPE_B));
-   hnae_set_field(type_and_id, HCLGE_TQP_ID_M, HCLGE_TQP_ID_S,
-  node->tqp_index);
-   hnae_set_field(type_and_id, HCLGE_INT_GL_IDX_M,
-  HCLGE_INT_GL_IDX_S,
-  hnae_get_bit(node->flag, HNAE3_RING_TYPE_B));
-   req->tqp_type_and_id[i] = cpu_to_le16(type_and_id);
-   req->vfid = vport->vport_id;
-
+   hnae_set_field(tqp_type_and_id, HCLGE_TQP_ID_M,
+  HCLGE_TQP_ID_S, node->tqp_index);
+   req->tqp_type_and_id[i] = cpu_to_le16(tqp_type_and_id);
if (++i >= HCLGE_VECTOR_ELEMENTS_PER_CMD) {
req->int_cause_num = HCLGE_VECTOR_ELEMENTS_PER_CMD;
+   req->vfid = vport->vport_id;
 
-   ret = hclge_cmd_send(>hw, , 1);
-   if (ret) {
+   status = hclge_cmd_send(>hw, , 1);
+   if (status) {
dev_err(>pdev->dev,
"Map TQP fail, status is %d.\n",
-   ret);
-   return ret;
+   status);
+   return -EIO;
}
i = 0;
 
hclge_cmd_setup_basic_desc(,
-  HCLGE_OPC_ADD_RING_TO_VECTOR,
+  op,
   false);
req->int_vector_id = vector_id;
}
@@ -3306,21 +3305,21 @@ int hclge_map_vport_ring_to_vector(struct hclge_vport 
*vport, int vector_id,
 
if (i > 0) {
req->int_cause_num = i;
-
-   ret = hclge_cmd_send(>hw, , 1);
-   if (ret) {
+   req->vfid = vport->vport_id;
+   status = hclge_cmd_send(>hw, , 1);
+   if (status) {
dev_err(>pdev->dev,
-   "Map TQP fail, status is %d.\n", ret);
-   return ret;
+   "Map TQP fail, status is %d.\n", status);
+   return -EIO;
}
}
 
return 0;
 }
 
-static int hclge_map_handle_ring_to_vector(
-

[PATCH V2 net-next 4/8] net: hns3: Add HNS3 VF driver to kernel build framework

2017-12-08 Thread Salil Mehta
This patch introduces the new Makefiles and updates existing
Makefiles required to build the HNS3 Virtual Function driver.
This also updates the Kconfig for introduction of new menuconfig
entries related to VF driver.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
 drivers/net/ethernet/hisilicon/Kconfig | 28 +++---
 drivers/net/ethernet/hisilicon/hns3/Makefile   |  1 +
 .../net/ethernet/hisilicon/hns3/hns3vf/Makefile|  8 +++
 3 files changed, 28 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile

diff --git a/drivers/net/ethernet/hisilicon/Kconfig 
b/drivers/net/ethernet/hisilicon/Kconfig
index 3b6..8bcf470 100644
--- a/drivers/net/ethernet/hisilicon/Kconfig
+++ b/drivers/net/ethernet/hisilicon/Kconfig
@@ -94,15 +94,6 @@ config HNS3_HCLGE
  compatibility layer. The engine would be used in Hisilicon hip08 
family of
  SoCs and further upcoming SoCs.
 
-config HNS3_ENET
-   tristate "Hisilicon HNS3 Ethernet Device Support"
-   depends on 64BIT && PCI
-   depends on HNS3 && HNS3_HCLGE
-   ---help---
- This selects the Ethernet Driver for Hisilicon Network Subsystem 3 
for hip08
- family of SoCs. This module depends upon HNAE3 driver to access the 
HNAE3
- devices and their associated operations.
-
 config HNS3_DCB
bool "Hisilicon HNS3 Data Center Bridge Support"
default n
@@ -112,4 +103,23 @@ config HNS3_DCB
 
  If unsure, say N.
 
+config HNS3_HCLGEVF
+tristate "Hisilicon HNS3VF Acceleration Engine & Compatibility Layer 
Support"
+depends on PCI_MSI
+depends on HNS3
+   depends on HNS3_HCLGE
+---help---
+ This selects the HNS3 VF drivers network acceleration engine & its 
hardware
+ compatibility layer. The engine would be used in Hisilicon hip08 
family of
+ SoCs and further upcoming SoCs.
+
+config HNS3_ENET
+   tristate "Hisilicon HNS3 Ethernet Device Support"
+   depends on 64BIT && PCI
+   depends on HNS3
+   ---help---
+ This selects the Ethernet Driver for Hisilicon Network Subsystem 3 
for hip08
+ family of SoCs. This module depends upon HNAE3 driver to access the 
HNAE3
+ devices and their associated operations.
+
 endif # NET_VENDOR_HISILICON
diff --git a/drivers/net/ethernet/hisilicon/hns3/Makefile 
b/drivers/net/ethernet/hisilicon/hns3/Makefile
index a9349e1..244664b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/Makefile
+++ b/drivers/net/ethernet/hisilicon/hns3/Makefile
@@ -3,5 +3,6 @@
 #
 
 obj-$(CONFIG_HNS3) += hns3pf/
+obj-$(CONFIG_HNS3) += hns3vf/
 
 obj-$(CONFIG_HNS3) += hnae3.o
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile
new file mode 100644
index 000..a513d00
--- /dev/null
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/Makefile
@@ -0,0 +1,8 @@
+#
+# Makefile for the HISILICON network device drivers.
+#
+
+ccflags-y := -Idrivers/net/ethernet/hisilicon/hns3
+
+obj-$(CONFIG_HNS3_HCLGEVF) += hclgevf.o
+hclgevf-objs = hclgevf_main.o hclgevf_cmd.o hclgevf_mbx.o
\ No newline at end of file
-- 
2.7.4




[PATCH V2 net-next 8/8] net: hns3: Add mailbox interrupt handling to PF driver

2017-12-08 Thread Salil Mehta
All PF mailbox events are conveyed through a common interrupt
(vector 0). This interrupt vector is shared by reset and mailbox.

This patch adds the handling of mailbox interrupt event and its
deferred processing in context to a separate mailbox task.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 68 +++---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  8 ++-
 2 files changed, 68 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 3b1fc49..e97fd66 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2227,6 +2227,12 @@ static int hclge_mac_init(struct hclge_dev *hdev)
return hclge_cfg_func_mta_filter(hdev, 0, hdev->accept_mta_mc);
 }
 
+static void hclge_mbx_task_schedule(struct hclge_dev *hdev)
+{
+   if (!test_and_set_bit(HCLGE_STATE_MBX_SERVICE_SCHED, >state))
+   schedule_work(>mbx_service_task);
+}
+
 static void hclge_reset_task_schedule(struct hclge_dev *hdev)
 {
if (!test_and_set_bit(HCLGE_STATE_RST_SERVICE_SCHED, >state))
@@ -2372,9 +2378,18 @@ static void hclge_service_complete(struct hclge_dev 
*hdev)
 static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval)
 {
u32 rst_src_reg;
+   u32 cmdq_src_reg;
 
/* fetch the events from their corresponding regs */
rst_src_reg = hclge_read_dev(>hw, HCLGE_MISC_RESET_STS_REG);
+   cmdq_src_reg = hclge_read_dev(>hw, HCLGE_VECTOR0_CMDQ_SRC_REG);
+
+   /* Assumption: If by any chance reset and mailbox events are reported
+* together then we will only process reset event in this go and will
+* defer the processing of the mailbox events. Since, we would have not
+* cleared RX CMDQ event this time we would receive again another
+* interrupt from H/W just for the mailbox.
+*/
 
/* check for vector0 reset event sources */
if (BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) & rst_src_reg) {
@@ -2395,7 +2410,12 @@ static u32 hclge_check_event_cause(struct hclge_dev 
*hdev, u32 *clearval)
return HCLGE_VECTOR0_EVENT_RST;
}
 
-   /* mailbox event sharing vector 0 interrupt would be placed here */
+   /* check for vector0 mailbox(=CMDQ RX) event source */
+   if (BIT(HCLGE_VECTOR0_RX_CMDQ_INT_B) & cmdq_src_reg) {
+   cmdq_src_reg &= ~BIT(HCLGE_VECTOR0_RX_CMDQ_INT_B);
+   *clearval = cmdq_src_reg;
+   return HCLGE_VECTOR0_EVENT_MBX;
+   }
 
return HCLGE_VECTOR0_EVENT_OTHER;
 }
@@ -2403,10 +2423,14 @@ static u32 hclge_check_event_cause(struct hclge_dev 
*hdev, u32 *clearval)
 static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type,
u32 regclr)
 {
-   if (event_type == HCLGE_VECTOR0_EVENT_RST)
+   switch (event_type) {
+   case HCLGE_VECTOR0_EVENT_RST:
hclge_write_dev(>hw, HCLGE_MISC_RESET_STS_REG, regclr);
-
-   /* mailbox event sharing vector 0 interrupt would be placed here */
+   break;
+   case HCLGE_VECTOR0_EVENT_MBX:
+   hclge_write_dev(>hw, HCLGE_VECTOR0_CMDQ_SRC_REG, regclr);
+   break;
+   }
 }
 
 static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable)
@@ -2423,13 +2447,23 @@ static irqreturn_t hclge_misc_irq_handle(int irq, void 
*data)
hclge_enable_vector(>misc_vector, false);
event_cause = hclge_check_event_cause(hdev, );
 
-   /* vector 0 interrupt is shared with reset and mailbox source events.
-* For now, we are not handling mailbox events.
-*/
+   /* vector 0 interrupt is shared with reset and mailbox source events.*/
switch (event_cause) {
case HCLGE_VECTOR0_EVENT_RST:
hclge_reset_task_schedule(hdev);
break;
+   case HCLGE_VECTOR0_EVENT_MBX:
+   /* If we are here then,
+* 1. Either we are not handling any mbx task and we are not
+*scheduled as well
+*OR
+* 2. We could be handling a mbx task but nothing more is
+*scheduled.
+* In both cases, we should schedule mbx task as there are more
+* mbx messages reported by this interrupt.
+*/
+   hclge_mbx_task_schedule(hdev);
+
default:
dev_dbg(>pdev->dev,
"received unknown or unhandled event of vector0\n");
@@ -2708,6 +2742,21 @@ static void hclge_reset_service_task(struct work_struct 
*work)
clear_bit(HCLGE_STATE_RST_HANDLING, >state);
 }
 
+static void hclge_mailbox_service_task(struct work_struct *work)

[PATCH V2 net-next 1/8] net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface

2017-12-08 Thread Salil Mehta
This patch adds support of command interface for communication with
the IMP(Integrated Management Processor) for HNS3 Virtual Function
Driver.

Each VF has support of CQP(Command Queue Pair) ring interface.
Each CQP consis of send queue CSQ and receive queue CRQ.
There are various commands a VF may support, like to query frimware
version, TQP management, statistics, interrupt related, mailbox etc.

This also contains code to initialize the command queue, manage the
command queue descriptors and Rx/Tx protocol with the command processor
in the form of various commands/results and acknowledgements.

Signed-off-by: Salil Mehta 
Signed-off-by: lipeng 
---
Patch V2: Reworked comments by David Miller(except one comment on the
  udelay() while holding locks. Needs further discussion)
  Link: https://lkml.org/lkml/2017/12/5/639
Patch V1: Initial Submit
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   | 348 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h   | 262 
 2 files changed, 610 insertions(+)
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
 create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
new file mode 100644
index 000..d786ab8
--- /dev/null
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
@@ -0,0 +1,348 @@
+/*
+ * Copyright (c) 2016-2017 Hisilicon Limited.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "hclgevf_cmd.h"
+#include "hclgevf_main.h"
+#include "hnae3.h"
+
+#define hclgevf_is_csq(ring) ((ring)->flag & HCLGEVF_TYPE_CSQ)
+#define hclgevf_ring_to_dma_dir(ring) (hclgevf_is_csq(ring) ? \
+   DMA_TO_DEVICE : DMA_FROM_DEVICE)
+#define cmq_ring_to_dev(ring)   (&(ring)->dev->pdev->dev)
+
+static int hclgevf_ring_space(struct hclgevf_cmq_ring *ring)
+{
+   int ntc = ring->next_to_clean;
+   int ntu = ring->next_to_use;
+   int used;
+
+   used = (ntu - ntc + ring->desc_num) % ring->desc_num;
+
+   return ring->desc_num - used - 1;
+}
+
+static int hclgevf_cmd_csq_clean(struct hclgevf_hw *hw)
+{
+   struct hclgevf_cmq_ring *csq = >cmq.csq;
+   u16 ntc = csq->next_to_clean;
+   struct hclgevf_desc *desc;
+   int clean = 0;
+   u32 head;
+
+   desc = >desc[ntc];
+   head = hclgevf_read_dev(hw, HCLGEVF_NIC_CSQ_HEAD_REG);
+   while (head != ntc) {
+   memset(desc, 0, sizeof(*desc));
+   ntc++;
+   if (ntc == csq->desc_num)
+   ntc = 0;
+   desc = >desc[ntc];
+   clean++;
+   }
+   csq->next_to_clean = ntc;
+
+   return clean;
+}
+
+static bool hclgevf_cmd_csq_done(struct hclgevf_hw *hw)
+{
+   u32 head;
+
+   head = hclgevf_read_dev(hw, HCLGEVF_NIC_CSQ_HEAD_REG);
+
+   return head == hw->cmq.csq.next_to_use;
+}
+
+static bool hclgevf_is_special_opcode(u16 opcode)
+{
+   u16 spec_opcode[] = {0x30, 0x31, 0x32};
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(spec_opcode); i++) {
+   if (spec_opcode[i] == opcode)
+   return true;
+   }
+
+   return false;
+}
+
+static int hclgevf_alloc_cmd_desc(struct hclgevf_cmq_ring *ring)
+{
+   int size = ring->desc_num * sizeof(struct hclgevf_desc);
+
+   ring->desc = kzalloc(size, GFP_KERNEL);
+   if (!ring->desc)
+   return -ENOMEM;
+
+   ring->desc_dma_addr = dma_map_single(cmq_ring_to_dev(ring), ring->desc,
+size, DMA_BIDIRECTIONAL);
+
+   if (dma_mapping_error(cmq_ring_to_dev(ring), ring->desc_dma_addr)) {
+   ring->desc_dma_addr = 0;
+   kfree(ring->desc);
+   ring->desc = NULL;
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static void hclgevf_free_cmd_desc(struct hclgevf_cmq_ring *ring)
+{
+   dma_unmap_single(cmq_ring_to_dev(ring), ring->desc_dma_addr,
+ring->desc_num * sizeof(ring->desc[0]),
+hclgevf_ring_to_dma_dir(ring));
+
+   ring->desc_dma_addr = 0;
+   kfree(ring->desc);
+   ring->desc = NULL;
+}
+
+static int hclgevf_init_cmd_queue(struct hclgevf_dev *hdev,
+ struct hclgevf_cmq_ring *ring)
+{
+   struct hclgevf_hw *hw = >hw;
+   int ring_type = ring->flag;
+   u32 reg_val;
+   int ret;
+
+   ring->desc_num = HCLGEVF_NIC_CMQ_DESC_NUM;
+   spin_lock_init(>lock);
+   

Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?

2017-12-08 Thread Eric Dumazet
On Fri, 2017-12-08 at 12:26 -0800, Matt Turner wrote:
> 
> Thanks for the quick reply!
> 
> I tried the patch on top of master, but unfortunately the corruption
> still occurs.

You might try replacing in sbdma_add_rcvbuffer()

sb_new = netdev_alloc_skb(dev, size);

by 

sb_new = alloc_skb(size, GFP_ATOMIC);

Maybe the device does not like having a frame spanning 2 pages.




Re: [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4

2017-12-08 Thread Pablo Neira Ayuso
On Fri, Dec 08, 2017 at 11:04:13AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso  wrote:
> > This patch adds the IPv4 flow table type, that implements the datapath
> > flow table to forward IPv4 traffic. Rationale is:
> > 
> > 1) Look up for the packet in the flow table, from the ingress hook.
> > 2) If there's a hit, decrement ttl and pass it on to the neighbour layer
> >for transmission.
> > 3) If there's a miss, packet is passed up to the classic forwarding
> >path.
> 
> Is there a plan to also handle zone IDs in future?

Zone ID is meaningful to whoever applies the policy: in this offload
approach this patchset implements, the policy resides in the kernel.

> I't going to be messy for sure since we'd need to tell HW how to do
> the zone mapping.  Perhaps only support a builtin list, e.g.
> vlan id == zone...?

I've been considering a more simple solution, ie. add the input ifindex
device in the flowtable hash lookup, as part of the flow tuple. All
examples I've been observing for zones are basically mapping network
interfaces to zones.

> Don't yet see how it could be done in a generic way as the mappings can
> be arbitrarily complex.
> 
> Right now afaics one could install one flow table per zone and map
> this in nft, but then we still miss the part that tells the hardware
> how the zone identifier was derived.
> 
> > +static bool ip_has_options(unsigned int thoff)
> > +{
> > +   return thoff > sizeof(struct iphdr);
> 
> I'd use
>   thoff != sizeof(...)
> 
> to catch case where ihl is < struct iphdr.

ok.

> > +nf_flow_offload_hook(void *priv, struct sk_buff *skb,
> > +const struct nf_hook_state *state)
> > +{
> > +   struct flow_offload_tuple_rhash *tuplehash;
> > +   struct nf_flowtable *flow_table = priv;
> > +   struct flow_offload_tuple tuple = {};
> > +   union nf_inet_addr nexthop;
> > +   struct flow_offload *flow;
> > +   struct net_device *outdev;
> > +   struct iphdr *iph;
> > +
> > +   if (nf_flow_tuple_ip(skb, ) < 0)
> > +   return NF_ACCEPT;
> > +
> > +   tuplehash = flow_offload_lookup(flow_table, );
> > +   if (tuplehash == NULL)
> > +   return NF_ACCEPT;
> > +
> > +   outdev = dev_get_by_index_rcu(_net, tuplehash->tuple.oifidx);
> 
> state->net ?

Yes, netns support is in my TODO list.


Re: BUG: KASAN: use-after-free in tcf_block_put_ext+0x5cf/0x5e0

2017-12-08 Thread Cong Wang
On Thu, Dec 7, 2017 at 8:59 PM, Jakub Kicinski  wrote:
> On Thu, 7 Dec 2017 20:51:27 -0800, Jakub Kicinski wrote:
>> Running the netdevsim test after a week and a bit of not trying it:
>>
>> $ make -C tools/testing/selftests/bpf/ CLANG=clang LLC=llc
>> # ./tools/testing/selftests/bpf/test_offload.py
>
> Ah, I didn't clarify - this is net-next at 66c5c5b56682 ("Merge branch
> 'smc-fixes-next'") plus some BPF-related commits.

Should be fixed by:

commit df45bf84e4f5a48f23d4b1a07d21d566e8b587b2
Author: Jiri Pirko 
Date:   Fri Dec 8 19:27:27 2017 +0100

net: sched: fix use-after-free in tcf_block_put_ext


Re: [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit

2017-12-08 Thread Pablo Neira Ayuso
On Fri, Dec 08, 2017 at 07:47:02AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso  wrote:
> > diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h 
> > b/include/uapi/linux/netfilter/nf_conntrack_common.h
> > index dc947e59d03a..6b463b88182d 100644
> > --- a/include/uapi/linux/netfilter/nf_conntrack_common.h
> > +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
> > @@ -100,6 +100,10 @@ enum ip_conntrack_status {
> > IPS_HELPER_BIT = 13,
> > IPS_HELPER = (1 << IPS_HELPER_BIT),
> >  
> > +   /* Conntrack has been offloaded to flow table. */
> > +   IPS_OFFLOAD_BIT = 14,
> > +   IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT),
> > +
> > /* Be careful here, modifying these bits can make things messy,
> >  * so don't let users modify them directly.
> >  */
> 
> I think this new bit has to be added to the UNCHANGEABLE mask below.

Right.

> > diff --git a/net/netfilter/nf_conntrack_core.c 
> > b/net/netfilter/nf_conntrack_core.c
> > index 01130392b7c0..02e195accd47 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net,
> > hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
> > tmp = nf_ct_tuplehash_to_ctrack(h);
> >  
> > +   if (test_bit(IPS_OFFLOAD_BIT, >status))
> > +   continue;
> > +
> 
> nit: I would move this below the ASSURED bit check, AFAIU most
> (all?) offloaded conntracks are not in ASSURED state since they never
> see two-way communication but in case we've mixed flows or no offloading
> in place then the ASSURED check takes care of skipping earlydrop
> already.

Offload happens once we enter established state (or later if your rule
postpone it to a later stage), that is before we observe the full
3-way handshake in tcp, hence the assured bit. So I think we still
need this here.

> > +/* Set an arbitrary timeout large enough not to ever expire, this save
> > + * us a check for the IPS_OFFLOAD_BIT from the packet path via
> > + * nf_ct_is_expired().
> > + */
> > +static void nf_ct_offload_timeout(struct nf_conn *ct)
> > +{
> > +   ct->timeout = nfct_time_stamp + DAY;
> > +}
> 
> Not sure if its worth adding a test to avoid unconditional write,
> e.g. something like
> 
> > +static void nf_ct_offload_timeout(struct nf_conn *ct)
> > +{
>   if (nf_ct_expires(ct) < DAY/2))
> > +   ct->timeout = nfct_time_stamp + DAY;
> 
> but perhaps not worth it, gc_worker is infrequent.

That's fine, I'll do this.


Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Andreas Hartmann
On 12/08/2017 at 09:11 PM Andreas Hartmann wrote:
> On 12/08/2017 at 05:04 PM Willem de Bruijn wrote:
>> On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecek  wrote:
>>> On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote:
 On 12/08/2017 at 09:47 AM Michal Kubecek wrote:
> On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote:
>>
>> All my VMs are using virtio_net. BTW: I couldn't see the problems
>> (sometimes, the VM couldn't be stopped at all) if all my VMs are using
>> e1000 as interface instead.
>>
>> This finding now matches pretty much the responsible UDP-package which
>> caused the stall. I already mentioned it here [2].
>>
>> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13]
>> Remove UDP Fragmentation Offload support" [3]
>>
>> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4]
>> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5]
>> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6]
>>
>> and applied it to Linux 4.14.4. It compiled fine and is running fine.
>> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs
>> are gone, too.
>>
>> Obviously, there is something broken with the new UDP handling. Could
>> you please analyze this problem? I could test some more patches ... .
>
> Any chance your VMs were live migrated from pre-4.14 host kernel?

 No - the VMs are not live migrated. They are always running on the same
 host - either with kernel < 4.14 or with kernel 4.14.x.
>>>
>>> This is disturbing... unless I'm mistaken, it shouldn't be possible to
>>> have UFO enabled on a virtio device in a VM booted on a host with 4.14
>>> kernel.
>>
>> Indeed. When working on that revert patch I verified that UFO in
>> the guest virtio_net was off before the revert patch, on after.
>>
>> Qemu should check host support with tap_probe_has_ufo
>> before advertising support to the guest. Indeed, this is exactly
>> what broke live migration in virtio_net_load_device at
>>
>> if (qemu_get_byte(f) && !peer_has_ufo(n)) {
>> error_report("virtio-net: saved image requires TUN_F_UFO support");
>> return -1;
>> }
>>
>> Which follows
>>
>>peer_has_ufo
>>  qemu_has_ufo
>>tap_has_ufo
>>  s->has_ufo
>>
>> where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init.
>>
>> Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this
>> codepath does not seem to have changed between then and 2.10.1.
>>
>> I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the
>> fix-up wasn't too hard. Compiled and booted, but untested otherwise. At
>>
>>   https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo
> 
> I'm just running it at the moment. I didn't face any network hang until
> now - although the critical UDP packages have been gone through.
> Therefore: looks nice.

Well, the patch does not fix hanging VMs, which have been shutdown and
can't be killed any more.
Because of the stack trace

[] vhost_net_ubuf_put_and_wait+0x35/0x60 [vhost_net]
[] vhost_net_ioctl+0x304/0x870 [vhost_net]
[] do_vfs_ioctl+0x8f/0x5c0
[] SyS_ioctl+0x74/0x80
[] do_syscall_64+0x5b/0x100
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

I was hoping, that the problems could be related - but that seems not to
be true.

Does anybody have any idea what happened here and how to analyze / fix it?


Thanks,
Andreas


Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread 'Marcelo Ricardo Leitner'
On Fri, Dec 08, 2017 at 11:08:56AM -0500, Neil Horman wrote:
> On Fri, Dec 08, 2017 at 04:04:58PM +, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 08 December 2017 16:00
> > ...
> > > > Is it worth replacing the si struct with an index/enum value, and 
> > > > indexing an
> > > > array of method pointer structs?  That would save you at least one 
> > > > dereference.
> > > 
> > > Hmmm, maybe, yes. It would be like
> > > sctp_stream_interleave[asoc->stream.si].make_datafrag(...)
> > 
> > If you only expect 2 choices then an if () is likely
> > to produce better code that the above.
> > 
> > The actual implementation can be hidden inside a #define
> > or static inline function.
> > 
> Thats the real question though, will we expect more than two interleaving
> strategies?  Currently its a boolean operation so the answer seems like yes, 
> but
> is there a possiblity of a biased interleaving, or other worthwhile algorithm?

For the chunk format, I don't think so. It would require another RFC
update.
For other possibilities on having a 3rd choice in there, I also don't
think so. Stream scheduling is handled apart from it and rx buffer
stuff is being covered by it now, don't see how we could have a 3rd
option without another chunk format.

But that said, I don't think the macro or even inline wrappers are as
clear as the struct with function pointers here and in some cases it
even won't avoid the second deref. I wouldn't like to compromise code
readability and OO because of 1 fetch, specially considering that this
will be barely noticeable.

Neil's idea on using the array of structs and indexing on it is nice.
Saves a deref and keeps the abstractions without inserting too much
noise on it. Plus, it also allows replacing the struct pointer in
sctp_stream with a bit. If then placed before the union in there, we
can save up to the whole pointer. Looks like a good compromise to me.

  Marcelo


Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?

2017-12-08 Thread Matt Turner
On Fri, Dec 8, 2017 at 5:52 AM, Eric Dumazet  wrote:
> On Fri, 2017-12-08 at 05:42 -0800, Eric Dumazet wrote:
>> On Thu, Dec 7, 2017 at 11:54 PM, Matt Turner 
>> wrote:
>> > On Thu, Dec 7, 2017 at 11:00 PM, Matt Turner 
>> > wrote:
>> > > On Sun, Mar 12, 2017 at 6:43 PM, Matt Turner 
>> > > wrote:
>> > > > On a Broadcom BCM91250a MIPS system I can reliably trigger NFS
>> > > > corruption on the first file read.
>> > > >
>> > > > To demonstrate, I downloaded five identical copies of the gcc-
>> > > > 5.4.0
>> > > > source tarball. On the NFS server, they hash to the same value:
>> > > >
>> > > > server distfiles # md5sum gcc-5.4.0.tar.bz2*
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.1
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.3
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.4
>> > > >
>> > > > On the MIPS system (the NFS client):
>> > > >
>> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2.2
>> > > > 35346975989954df8a8db2b034da610d  gcc-5.4.0.tar.bz2.2
>> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2*
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.1
>> > > > 35346975989954df8a8db2b034da610d  gcc-5.4.0.tar.bz2.2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.3
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.4
>> > > >
>> > > > The first file read will contain some corruption, and it is
>> > > > persistent until...
>> > > >
>> > > > bcm91250a-le distfiles # echo 1 > /proc/sys/vm/drop_caches
>> > > > bcm91250a-le distfiles # md5sum gcc-5.4.0.tar.bz2*
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.1
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.2
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.3
>> > > > 4c626ac2a83ef30dfb9260e6f59c2b30  gcc-5.4.0.tar.bz2.4
>> > > >
>> > > > the caches are dropped, at which point it reads back properly.
>> > > >
>> > > > Note that the corruption is different across reboots, both in
>> > > > the size
>> > > > of the corruption and the location. I saw 1900~ and 1400~ byte
>> > > > sequences corrupted on separate occasions, which don't
>> > > > correspond to
>> > > > the system's 16kB page size.
>> > > >
>> > > > I've tested kernels from v3.19 to 4.11-rc1+ (master branch from
>> > > > today). All exhibit this behavior with differing frequencies.
>> > > > Earlier
>> > > > kernels seem to reproduce the issue less often, while more
>> > > > recent
>> > > > kernels reliably exhibit the problem every boot.
>> > > >
>> > > > How can I further debug this?
>> > >
>> > > I think I was wrong about the statement about kernels v3.19 to
>> > > 4.11-rc1+. I found out I couldn't reproduce with 4.7-rc1 and then
>> > > bisected to 4cd13c21b207e80ddb1144c576500098f2d5f882 ("softirq:
>> > > Let
>> > > ksoftirqd do its job"). Still reproduces with current tip of
>> > > Linus'
>> > > tree.
>> > >
>> > > Any ideas? The board's ethernet is an uncommon device supported
>> > > by
>> > > CONFIG_SB1250_MAC. Something about the ethernet driver maybe?
>> >
>> > With the patch reverted on master (reverts cleanly), NFS corruption
>> > no
>> > longer happens.
>>
>> Hi Matt.
>>
>> Thanks for bisecting.
>>
>> Patch simply exposes an existing bug more often by changing the way
>> driver functions are scheduled.
>>
>> Which is probably a good thing.
>>
>> sbmac_intr() looks extremely suspicious to me.
>>
>> A NAPI driver hard interrupt should simply schedule NAPI.
>>
>> Apparently, if sbmac_intr() can not grab NAPIF_STATE_SCHED bit, it
>> directly calls sbdma_rx_process() from
>> hard interrupt context.
>>
>> Insane really.
>
> Please try this fix (not compiled on my x86 laptop, and I had no coffee
> yet, so it might have some trivial errors)

Thanks for the quick reply!

I tried the patch on top of master, but unfortunately the corruption
still occurs.


Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Andreas Hartmann
On 12/08/2017 at 05:04 PM Willem de Bruijn wrote:
> On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecek  wrote:
>> On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote:
>>> On 12/08/2017 at 09:47 AM Michal Kubecek wrote:
 On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote:
>
> All my VMs are using virtio_net. BTW: I couldn't see the problems
> (sometimes, the VM couldn't be stopped at all) if all my VMs are using
> e1000 as interface instead.
>
> This finding now matches pretty much the responsible UDP-package which
> caused the stall. I already mentioned it here [2].
>
> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13]
> Remove UDP Fragmentation Offload support" [3]
>
> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4]
> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5]
> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6]
>
> and applied it to Linux 4.14.4. It compiled fine and is running fine.
> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs
> are gone, too.
>
> Obviously, there is something broken with the new UDP handling. Could
> you please analyze this problem? I could test some more patches ... .

 Any chance your VMs were live migrated from pre-4.14 host kernel?
>>>
>>> No - the VMs are not live migrated. They are always running on the same
>>> host - either with kernel < 4.14 or with kernel 4.14.x.
>>
>> This is disturbing... unless I'm mistaken, it shouldn't be possible to
>> have UFO enabled on a virtio device in a VM booted on a host with 4.14
>> kernel.
> 
> Indeed. When working on that revert patch I verified that UFO in
> the guest virtio_net was off before the revert patch, on after.
> 
> Qemu should check host support with tap_probe_has_ufo
> before advertising support to the guest. Indeed, this is exactly
> what broke live migration in virtio_net_load_device at
> 
> if (qemu_get_byte(f) && !peer_has_ufo(n)) {
> error_report("virtio-net: saved image requires TUN_F_UFO support");
> return -1;
> }
> 
> Which follows
> 
>peer_has_ufo
>  qemu_has_ufo
>tap_has_ufo
>  s->has_ufo
> 
> where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init.
> 
> Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this
> codepath does not seem to have changed between then and 2.10.1.
> 
> I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the
> fix-up wasn't too hard. Compiled and booted, but untested otherwise. At
> 
>   https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo

I'm just running it at the moment. I didn't face any network hang until
now - although the critical UDP packages have been gone through.
Therefore: looks nice.


Thanks,
Andreas


[GIT] Networking

2017-12-08 Thread David Miller

1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN
   usb drivers).

2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
   to userspace and broke some apps.  From Johannes Berg.

3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
   Wang.

4) Gianfar MAC can't do EEE so don't advertise it by default, from
   Claudiu Manoil.

5) Relax strict netlink attribute validation, but emit a warning.
   From David Ahern.

6) Fix regression in checksum offload of thunderx driver, from Florian
   Westphal.

7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.

8) New card support in iwlwifi, from Ihab Zhaika.

9) BBR congestion control bug fixes from Neal Cardwell.

10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.

11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.

12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.

13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.

Please pull, thanks a lot!

The following changes since commit 2391f0b4808e3d5af348324d69f5f45c56a26836:

  Merge tag 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost (2017-12-04 11:32:02 
-0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to fd29117aeb905aaacdf4ff5afbc7787fa50e16e4:

  Merge tag 'linux-can-fixes-for-4.15-20171208' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can (2017-12-08 
14:53:54 -0500)


Al Viro (3):
  fix kcm_clone()
  socketpair(): allocate descriptors first
  make sock_alloc_file() do sock_release() on failures

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Fix interrupt masking on removal
  net: dsa: mv88e6xxx: Unregister MDIO bus on error path

Andy Shevchenko (1):
  brcmfmac: Avoid build error with make W=1

Antoine Tenart (1):
  net: mvpp2: fix the RSS table entry offset

Arend Van Spriel (1):
  brcmfmac: change driver unbind order of the sdio function devices

Bert Kenward (1):
  sfc: pass valid pointers from efx_enqueue_unwind

Bjørn Mork (1):
  usbnet: fix alignment for frames with no ethernet header

Branislav Radocaj (1):
  net: ethernet: arc: fix error handling in emac_rockchip_probe

Calvin Owens (1):
  bnxt_en: Fix sources of spurious netpoll warnings

Chris Dion (1):
  net_sched: use macvlan real dev trans_start in dev_trans_start()

Claudiu Manoil (1):
  gianfar: Disable EEE autoneg by default

Cong Wang (1):
  tipc: fix a null pointer deref on error path

Daniel Borkmann (1):
  Merge branch 'bpf-fix-broken-uapi-for-pt-regs'

David Ahern (1):
  netlink: Relax attr validation for fixed length types

David S. Miller (9):
  Merge branch 'RED-qdisc-fixes'
  Merge branch 'sh_eth-dma-mapping-fixes'
  Merge branch 'rmnet-Fix-leaks-in-failure-scenarios'
  Merge branch 'mv88e6xxx-error-patch-fixes'
  Merge git://git.kernel.org/.../bpf/bpf
  Merge branch 'tcp-bbr-sampling-fixes'
  Merge branch 'tcp-RACK-loss-recovery-bug-fixes'
  Merge tag 'wireless-drivers-for-davem-2017-12-08' of 
git://git.kernel.org/.../kvalo/wireless-drivers
  Merge tag 'linux-can-fixes-for-4.15-20171208' of 
git://git.kernel.org/.../mkl/linux-can

David Spinadel (1):
  iwlwifi: mvm: enable RX offloading with TKIP and WEP

Emmanuel Grumbach (3):
  iwlwifi: mvm: don't use transmit queue hang detection when it is not 
possible
  iwlwifi: mvm: fix the TX queue hang timeout for MONITOR vif type
  iwlwifi: mvm: fix packet injection

Eric Dumazet (3):
  Revert "tcp: must block bh in __inet_twsk_hashdance()"
  net: remove hlist_nulls_add_tail_rcu()
  tcp: use current time in tcp_rcv_space_adjust()

Florian Westphal (1):
  net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts

Hendrik Brueckner (6):
  bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
  s390/bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
  arm64/bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
  s390/uapi: correct whitespace & coding style in asm/ptrace.h
  selftests/bpf: sync kernel headers and introduce arch support in Makefile
  perf s390: add regs_query_register_offset()

Håkon Bugge (1):
  rds: Fix NULL pointer dereference in __rds_rdma_map

Ihab Zhaika (1):
  iwlwifi: add new cards for 9260 and 22000 series

Joe Perches (1):
  xen-netback: Fix logging message with spurious period after newline

Johannes Berg (2):
  iwlwifi: mvm: flush queue before deleting ROC
  Revert "net: core: maybe return -EEXIST in __dev_alloc_name"

Jon Maloy (1):
  tipc: fix memory leak in tipc_accept_from_sock()

Kalle Valo (2):
  Merge tag 'iwlwifi-for-kalle-2017-11-28' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-fixes
  Merge tag 'iwlw

Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open

2017-12-08 Thread Daniel Borkmann
On 12/06/2017 11:45 PM, Song Liu wrote:
> Changes PATCH v4 to PATCH v5:
>   Remove PERF_PROBE_CONFIG_IS_RETPROBE from uapi, use PMU_FORMAT_ATTR
>   instead.
> 
> Changes PATCH v3 to PATCH v4:
>   Remove uapi define MAX_PROBE_FUNC_NAME_LEN, use KSYM_NAME_LEN instead.
>   Add flag PERF_PROBE_CONFIG_IS_RETPROBE for config field of [k,u]probe.
>   Optimize ifdef's of CONFIG_KPROBE_EVENTS and CONFIG_UPROBE_EVENTS.
>   Optimize checks in perf_event_is_tracing().
>   Optimize perf_tp_register().
> 
> Changes PATCH v2 to PATCH v3:
>   Remove fixed type PERF_TYPE_KPROBE and PERF_TYPE_UPROBE, use dynamic
>   type instead.
>   Update userspace (samples/bpf, bcc) to look up type from sysfs.
>   Change License info in test_many_kprobe_user.c as Philippe Ombredanne
>   suggested.
> 
> Changes PATCH v1 to PATCH v2:
>   Split PERF_TYPE_PROBE into PERF_TYPE_KPROBE and PERF_TYPE_UPROBE.
>   Split perf_probe into perf_kprobe and perf_uprobe.
>   Remove struct probe_desc, use config1 and config2 instead.
> 
> Changes RFC v2 to PATCH v1:
>   Check type PERF_TYPE_PROBE in perf_event_set_filter().
>   Rebase on to tip perf/core.
> 
> Changes RFC v1 to RFC v2:
>   Fix build issue reported by kbuild test bot by adding ifdef of
>   CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.
> 
> RFC v1 cover letter:
> 
> This is to follow up the discussion over "new kprobe api" at Linux
> Plumbers 2017:
> 
> https://www.linuxplumbersconf.org/2017/ocw/proposals/4808
> 
> With current kernel, user space tools can only create/destroy [k,u]probes
> with a text-based API (kprobe_events and uprobe_events in tracefs). This
> approach relies on user space to clean up the [k,u]probe after using them.
> However, this is not easy for user space to clean up properly.
> 
> To solve this problem, we introduce a file descriptor based API.
> Specifically, we extended perf_event_open to create [k,u]probe, and attach
> this [k,u]probe to the file descriptor created by perf_event_open. These
> [k,u]probe are associated with this file descriptor, so they are not
> available in tracefs.
> 
> We reuse large portion of existing trace_kprobe and trace_uprobe code.
> Currently, the file descriptor API does not support arguments as the
> text-based API does. This should not be a problem, as user of the file
> decriptor based API read data through other methods (bpf, etc.).
> 
> I also include a patch to to bcc, and a patch to man-page perf_even_open.
> Please see the list below. A fork of bcc with this patch is also available
> on github:
> 
>   https://github.com/liu-song-6/bcc/tree/perf_event_open

Peter / Stephen, I presume this will be routed through one of you, if
not please yell.

Thanks,
Daniel

> man-pages patch:
>   perf_event_open.2: add type kprobe and uprobe
> 
> bcc patch:
>   bcc: Try use new API to create [k,u]probe with perf_event_open
> 
> kernel patches:
> 
> Song Liu (6):
>   perf: prepare perf_event.h for new types perf_kprobe and perf_uprobe
>   perf: copy new perf_event.h to tools/include/uapi
>   perf: implement pmu perf_kprobe
>   perf: implement pmu perf_uprobe
>   bpf: add option for bpf_load.c to use perf_kprobe
>   bpf: add new test test_many_kprobe
> 
>  include/linux/trace_events.h  |   8 ++
>  include/uapi/linux/perf_event.h   |   4 +
>  kernel/events/core.c  | 131 +++-
>  kernel/trace/trace_event_perf.c   | 102 +++
>  kernel/trace/trace_kprobe.c   |  91 +++--
>  kernel/trace/trace_probe.h|  11 ++
>  kernel/trace/trace_uprobe.c   |  86 ++--
>  samples/bpf/Makefile  |   3 +
>  samples/bpf/bpf_load.c|  66 ++--
>  samples/bpf/bpf_load.h|  14 +++
>  samples/bpf/test_many_kprobe_user.c   | 186 
> ++
>  tools/include/uapi/linux/perf_event.h |   4 +
>  12 files changed, 677 insertions(+), 29 deletions(-)
>  create mode 100644 samples/bpf/test_many_kprobe_user.c
> 
> --
> 2.9.5
> 



Re: pull-request: can 2017-12-08

2017-12-08 Thread David Miller
From: Marc Kleine-Budde 
Date: Fri,  8 Dec 2017 16:58:11 +0100

> this is a pull request of 6 patches for net/master.
> 
> Martin Kelly provides 5 patches for various USB based CAN drivers, that
> properly cancel the URBs on adapter unplug, so that the driver doesn't
> end up in an endless loop. Stephane Grosjean provides a patch to restart
> the tx queue if zero length packages are transmitted.

Pulled, thanks Marc.


Re: [PATCH net-next] macvlan: fix memory hole in macvlan_dev

2017-12-08 Thread David Miller
From: Girish Moodalbail 
Date: Fri,  8 Dec 2017 06:03:26 -0800

> Move 'macaddr_count' from after 'netpoll' to after 'nest_level' to pack
> and reduce a memory hole.
> 
> Fixes: 88ca59d1aaf28c25 (macvlan: remove unused fields in struct macvlan_dev)
> Signed-off-by: Girish Moodalbail 

Applied, thanks.


Re: [PATCH net-next 0/2] tools: bpftool: clean up and extend Makefiles

2017-12-08 Thread Daniel Borkmann
On 12/08/2017 12:00 AM, Jakub Kicinski wrote:
> Hi!
> 
> This is a follow up to a series of Makefile fixes for bpftool from
> two weeks ago.  I think there will have to be a merge back of net-next
> into bpf-next (or rebase), AFAICT the previous series arrived in
> net-next already, but not in bpf-next.  I hope that makes sense.
> FWIW this should not conflict with Roman's cgroup work.
> 
> Quentin says:
> 
> First patch of this series cleans up the two Makefiles (Makefile and
> Documentation/Makefile) and make their contents more consistent.
> 
> The second one add "uninstall" and "doc-uninstall" targets, to remove files
> previously installed on the system with "install" and "doc-install"
> targets.

Series applied to bpf-next, thanks guys!


Re: [PATCH] slip: sl_alloc(): remove unused parameter "dev_t line"

2017-12-08 Thread David Miller
From: Marc Kleine-Budde 
Date: Fri,  8 Dec 2017 12:18:59 +0100

> The first and only parameter of sl_alloc() is unused, so remove it.
> 
> Fixes: 5342b77c4123 slip: ("Clean up create and destroy")
> Signed-off-by: Marc Kleine-Budde 

Applied.

While reviewing this I noticed that slip_devs[], along with the
arbitrary slip_maxdev limit, can be removed and replaced with
a simply linked list.

All the array is used for it proper teardown during module unload.
This is made even more clear by the fact that the index stored
in dev->base_addr is never ever used.


Re: [PATCH net] net: mvpp2: fix the RSS table entry offset

2017-12-08 Thread David Miller
From: Antoine Tenart 
Date: Fri,  8 Dec 2017 10:24:20 +0100

> The macro used to access or set an RSS table entry was using an offset
> of 8, while it should use an offset of 0. This lead to wrongly configure
> the RSS table, not accessing the right entries.
> 
> Fixes: 1d7d15d79fb4 ("net: mvpp2: initialize the RSS tables")
> Signed-off-by: Antoine Tenart 

Applied.


Re: pull-request: bpf-next 2017-12-07

2017-12-08 Thread David Miller
From: Alexei Starovoitov 
Date: Thu, 7 Dec 2017 22:22:11 -0800

> The following pull-request contains BPF updates for your net-next tree.
> 
> The main changes are:
> 
> 1) Detailed documentation of BPF development process from Daniel.
> 
> 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
>from Lawrence.
> 
> 3) Minor follow up for bpf_skb_set_tunnel_key() from William.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Pulled, thanks Alexei.


Re: [PATCH net-next v2 0/6] cxgb4: collect hardware logs via ethtool

2017-12-08 Thread David Miller
From: Rahul Lakkireddy 
Date: Fri,  8 Dec 2017 09:48:35 +0530

> Collect more hardware logs via ethtool --get-dump facility.
> 
> Patch 1 collects on-chip memory layout information.
> 
> Patch 2 collects on-chip MC memory dumps.
> 
> Patch 3 collects HMA memory dump.
> 
> Patch 4 evaluates and skips TX and RX payload regions in memory dumps.
> 
> Patch 5 collects egress and ingress SGE queue contexts.
> 
> Patch 6 collects PCIe configuration logs
 ...
> ---
> v2:
> - Fix uninitialized variable "size" build warning in Patch 1.

That looks better, series applied, thanks.


Re: [PATCH v2 1/3] Bluetooth: hci_ll: add support for setting public address

2017-12-08 Thread Marcel Holtmann
Hi David,

>>> This adds support for setting the public address on Texas Instruments
>>> Bluetooth chips using a vendor-specific command.
>>> 
>>> This has been tested on a CC2560A. The TI wiki also indicates that this
>>> command should work on TI WL17xx/WL18xx Bluetooth chips.
>>> 
>>> Signed-off-by: David Lechner 
>>> ---
>>> 
>>> v2 changes:
>>> * This is a new patch in v2
>>> 
>>> drivers/bluetooth/hci_ll.c | 17 +
>>> 1 file changed, 17 insertions(+)
>>> 
>>> diff --git a/drivers/bluetooth/hci_ll.c b/drivers/bluetooth/hci_ll.c
>>> index 974a788..b732004 100644
>>> --- a/drivers/bluetooth/hci_ll.c
>>> +++ b/drivers/bluetooth/hci_ll.c
>>> @@ -57,6 +57,7 @@
>>> #include "hci_uart.h"
>>> 
>>> /* Vendor-specific HCI commands */
>>> +#define HCI_VS_WRITE_BD_ADDR   0xfc06
>>> #define HCI_VS_UPDATE_UART_HCI_BAUDRATE 0xff36
>>> 
>>> /* HCILL commands */
>>> @@ -662,6 +663,20 @@ static int download_firmware(struct ll_device *lldev)
>>> return err;
>>> }
>>> 
>>> +static int ll_set_bdaddr(struct hci_dev *hdev, const bdaddr_t *bdaddr)
>>> +{
>>> +   bdaddr_t bdaddr_swapped;
>>> +   struct sk_buff *skb;
>>> +
>>> +   baswap(_swapped, bdaddr);
>>> +   skb = __hci_cmd_sync(hdev, HCI_VS_WRITE_BD_ADDR, sizeof(bdaddr_t),
>>> +_swapped, HCI_INIT_TIMEOUT);
>>> +   if (!IS_ERR(skb))
>>> +   kfree_skb(skb);
>>> +   
>> You have a trailing whitespace here.
>> Does the HCI command really expect the BD_ADDR in the swapped order. The 
>> caller of hdev->set_bdaddr while provide it in the same order as the HCI 
>> Read BD Address command and everything in HCI. So it seems odd that you have 
>> to swap it for the vendor command.
>> So have you actually tested this with btmgmt public-add  and 
>> checked that the address comes out correctly. I think ll_set_bdaddr should 
>> function correctly for the mgmt interface. And if needed any other caller 
>> outside of mgmt has to do the swapping.
> 
> I did test using btmgmt public-address 00:11:22:33:44:55, which is how I 
> found out that the order needed to be swapped. Like you, I was surprised. I 
> couldn't find any documentation from TI saying what the order is supposed to 
> be, so I can only assume that because this works, it is indeed correct as-is.

then please add a comment for that and I would appreciate to have the parts 
from btmon showing the public-addr command and the following HCI Read BD 
Address command as part of the commit message. Just for being able to dig this 
out at some later point if needed.

Regards

Marcel



Re: [PATCH net-next v3 0/2] veth and GSO maximums

2017-12-08 Thread David Miller
From: Stephen Hemminger 
Date: Thu,  7 Dec 2017 15:40:18 -0800

> This is the more general way to solving the issue of GSO limits
> not being set correctly for containers on Azure. If a GSO packet
> is sent to host that exceeds the limit (reported by NDIS), then
> the host is forced to do segmentation in software which has noticeable
> performance impact.
> 
> The core rtnetlink infrastructure already has the messages and
> infrastructure to allow changing gso limits. With an updated iproute2
> the following already works:
>   # ip li set dev dummy0 gso_max_size 3
> 
> These patches are about making it easier with veth.

Ok, this is definitely a step forward.

Series applied, thanks Stephen.


Re: [PATCH v2 1/3] Bluetooth: hci_ll: add support for setting public address

2017-12-08 Thread David Lechner

On 12/08/2017 02:07 AM, Marcel Holtmann wrote:

Hi David,


This adds support for setting the public address on Texas Instruments
Bluetooth chips using a vendor-specific command.

This has been tested on a CC2560A. The TI wiki also indicates that this
command should work on TI WL17xx/WL18xx Bluetooth chips.

Signed-off-by: David Lechner 
---

v2 changes:
* This is a new patch in v2

drivers/bluetooth/hci_ll.c | 17 +
1 file changed, 17 insertions(+)

diff --git a/drivers/bluetooth/hci_ll.c b/drivers/bluetooth/hci_ll.c
index 974a788..b732004 100644
--- a/drivers/bluetooth/hci_ll.c
+++ b/drivers/bluetooth/hci_ll.c
@@ -57,6 +57,7 @@
#include "hci_uart.h"

/* Vendor-specific HCI commands */
+#define HCI_VS_WRITE_BD_ADDR   0xfc06
#define HCI_VS_UPDATE_UART_HCI_BAUDRATE 0xff36

/* HCILL commands */
@@ -662,6 +663,20 @@ static int download_firmware(struct ll_device *lldev)
return err;
}

+static int ll_set_bdaddr(struct hci_dev *hdev, const bdaddr_t *bdaddr)
+{
+   bdaddr_t bdaddr_swapped;
+   struct sk_buff *skb;
+
+   baswap(_swapped, bdaddr);
+   skb = __hci_cmd_sync(hdev, HCI_VS_WRITE_BD_ADDR, sizeof(bdaddr_t),
+_swapped, HCI_INIT_TIMEOUT);
+   if (!IS_ERR(skb))
+   kfree_skb(skb);
+   


You have a trailing whitespace here.

Does the HCI command really expect the BD_ADDR in the swapped order. The caller of 
hdev->set_bdaddr while provide it in the same order as the HCI Read BD Address 
command and everything in HCI. So it seems odd that you have to swap it for the 
vendor command.

So have you actually tested this with btmgmt public-add  and checked 
that the address comes out correctly. I think ll_set_bdaddr should function correctly 
for the mgmt interface. And if needed any other caller outside of mgmt has to do the 
swapping.



I did test using btmgmt public-address 00:11:22:33:44:55, which is how I 
found out that the order needed to be swapped. Like you, I was 
surprised. I couldn't find any documentation from TI saying what the 
order is supposed to be, so I can only assume that because this works, 
it is indeed correct as-is.


Re: [PATH net 0/4] tcp: RACK loss recovery bug fixes

2017-12-08 Thread David Miller
From: Yuchung Cheng 
Date: Thu,  7 Dec 2017 11:33:29 -0800

> This patch set has four minor bug fixes in TCP RACK loss recovery.

Series applied, thank you.


Re: [PATCH net-next] net: dsa: lan9303: Protect ALR operations with mutex

2017-12-08 Thread David Miller
From: Egil Hjelmeland 
Date: Thu,  7 Dec 2017 19:56:04 +0100

> ALR table operations are a sequence of related register operations which
> should be protected from concurrent access. The alr_cache should also be
> protected. Add alr_mutex doing that.
> 
> Signed-off-by: Egil Hjelmeland 

Applied.


Re: [patch net-next] net: sched: fix use-after-free in tcf_block_put_ext

2017-12-08 Thread David Miller
From: Jiri Pirko 
Date: Fri,  8 Dec 2017 19:27:27 +0100

> From: Jiri Pirko 
> 
> Since the block is freed with last chain being put, once we reach the
> end of iteration of list_for_each_entry_safe, the block may be
> already freed. I'm hitting this only by creating and deleting clsact:
 ...
> Fix this by holding the block also by chain 0 and put chain 0
> explicitly, out of the list_for_each_entry_safe loop at the very
> end of tcf_block_put_ext.
> 
> Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in 
> tcf_block_put_ext()")
> Signed-off-by: Jiri Pirko 

Applied, thanks Jiri.


Re: [PATCH] bnxt_en: Fix sources of spurious netpoll warnings

2017-12-08 Thread David Miller
From: Michael Chan 
Date: Fri, 8 Dec 2017 09:35:42 -0800

> On Fri, Dec 8, 2017 at 9:05 AM, Calvin Owens  wrote:
>> After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
>> 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
>> we still see the following WARN fire:
 ...
>> This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
>> in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
>> given a non-zero budget.
>>
>> Signed-off-by: Calvin Owens 
> 
> Thanks.
> Acked-by: Michael Chan 

Applied and queued up for -stable.


Re: [PATCH net-next 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices

2017-12-08 Thread Subash Abhinov Kasiviswanathan

On 2017-12-08 09:26, Dan Williams wrote:

On Sun, 2017-12-03 at 23:37 -0700, Subash Abhinov Kasiviswanathan
wrote:

Add an option to configure the rmnet aggregation and command features
on device creation. This is achieved by using the vlan flags option.


Does this overload the VLAN flags item with different meanings than
VLAN_FLAG_* that are specific to rmnet?

Dan


Hi Dan

Yes, that's right. I am using it to configure rmnet specific features.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [net-next PATCH 00/14] lockless qdisc series

2017-12-08 Thread David Miller
From: John Fastabend 
Date: Thu, 07 Dec 2017 09:53:46 -0800

> This series adds support for building lockless qdiscs.

Series applied, nice work.

I particularly like how you solved the GSO and bad_tx requeueing.

Thanks!



Re: [iproute2] ss: print tcpi_rcv_ssthresh

2017-12-08 Thread Stephen Hemminger
On Thu,  7 Dec 2017 16:12:00 -0800
Wei Wang  wrote:

> From: Wei Wang 
> 
> tcpi_rcv_ssthresh is an important stats when debugging receive side
> behavior.
> Add it to the ss output.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Eric Dumazet 

Applied


Re: [PATCH net 0/3] TCP BBR sampling fixes for loss recovery undo

2017-12-08 Thread David Miller
From: Neal Cardwell 
Date: Thu,  7 Dec 2017 12:43:29 -0500

> This patch series has a few minor bug fixes for cases where spurious
> loss recoveries can trick BBR estimators into estimating that the
> available bandwidth is much lower than the true available bandwidth.
> In both cases the fix here is to just reset the estimator upon loss
> recovery undo.

Series applied and queued up for -stable, thanks Neal.


[patch net-next] net: sched: fix use-after-free in tcf_block_put_ext

2017-12-08 Thread Jiri Pirko
From: Jiri Pirko 

Since the block is freed with last chain being put, once we reach the
end of iteration of list_for_each_entry_safe, the block may be
already freed. I'm hitting this only by creating and deleting clsact:

[  202.171952] 
==
[  202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
[  202.187590] Read of size 8 at addr 880225539a80 by task tc/796
[  202.194508]
[  202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
[  202.203200] Hardware name: Mellanox Technologies Ltd. 
"MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[  202.213613] Call Trace:
[  202.216369]  dump_stack+0xda/0x169
[  202.220192]  ? dma_virt_map_sg+0x147/0x147
[  202.224790]  ? show_regs_print_info+0x54/0x54
[  202.229691]  ? tcf_chain_destroy+0x1dc/0x250
[  202.234494]  print_address_description+0x83/0x3d0
[  202.239781]  ? tcf_block_put_ext+0x240/0x390
[  202.244575]  kasan_report+0x1ba/0x460
[  202.248707]  ? tcf_block_put_ext+0x240/0x390
[  202.253518]  tcf_block_put_ext+0x240/0x390
[  202.258117]  ? tcf_chain_flush+0x290/0x290
[  202.262708]  ? qdisc_hash_del+0x82/0x1a0
[  202.267111]  ? qdisc_hash_add+0x50/0x50
[  202.271411]  ? __lock_is_held+0x5f/0x1a0
[  202.275843]  clsact_destroy+0x3d/0x80 [sch_ingress]
[  202.281323]  qdisc_destroy+0xcb/0x240
[  202.285445]  qdisc_graft+0x216/0x7b0
[  202.289497]  tc_get_qdisc+0x260/0x560

Fix this by holding the block also by chain 0 and put chain 0
explicitly, out of the list_for_each_entry_safe loop at the very
end of tcf_block_put_ext.

Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in 
tcf_block_put_ext()")
Signed-off-by: Jiri Pirko 
---
 net/sched/cls_api.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d51051d..5b9b8a6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -342,23 +342,24 @@ void tcf_block_put_ext(struct tcf_block *block, struct 
Qdisc *q,
 {
struct tcf_chain *chain, *tmp;
 
-   /* Hold a refcnt for all chains, except 0, so that they don't disappear
+   /* Hold a refcnt for all chains, so that they don't disappear
 * while we are iterating.
 */
list_for_each_entry(chain, >chain_list, list)
-   if (chain->index)
-   tcf_chain_hold(chain);
+   tcf_chain_hold(chain);
 
list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
tcf_block_offload_unbind(block, q, ei);
 
-   /* At this point, all the chains should have refcnt >= 1. Block will be
-* freed after all chains are gone.
-*/
+   /* At this point, all the chains should have refcnt >= 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
+
+   /* Finally, put chain 0 and allow block to be freed. */
+   chain = list_first_entry(>chain_list, struct tcf_chain, list);
+   tcf_chain_put(chain);
 }
 EXPORT_SYMBOL(tcf_block_put_ext);
 
-- 
2.9.5



Re: [PATCH net] sfc: pass valid pointers from efx_enqueue_unwind

2017-12-08 Thread David Miller
From: Bert Kenward 
Date: Thu, 7 Dec 2017 17:18:58 +

> The bytes_compl and pkts_compl pointers passed to efx_dequeue_buffers
> cannot be NULL. Add a paranoid warning to check this condition and fix
> the one case where they were NULL.
> 
> efx_enqueue_unwind() is called very rarely, during error handling.
> Without this fix it would fail with a NULL pointer dereference in
> efx_dequeue_buffer, with efx_enqueue_skb in the call stack.
> 
> Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2")
> Reported-by: Jarod Wilson 
> Signed-off-by: Bert Kenward 

Applied and queued up for -stable, thanks.


Re: [PATCH net] gianfar: Disable EEE autoneg by default

2017-12-08 Thread David Miller
From: Claudiu Manoil 
Date: Thu, 7 Dec 2017 18:44:23 +0200

> This controller does not support EEE, but it may connect to a PHY
> which supports EEE and advertises EEE by default, while its link
> partner also advertises EEE. If this happens, the PHY enters low
> power mode when the traffic rate is low and causes packet loss.
> This patch disables EEE advertisement by default for any PHY that
> gianfar connects to, to prevent the above unwanted outcome.
> 
> Signed-off-by: Shaohui Xie 
> Tested-by: Yangbo Lu 
> Signed-off-by: Claudiu Manoil 

Applied, thank you.


Re: WireGuard Upstreaming Roadmap (November 2017)

2017-12-08 Thread Jason A. Donenfeld
Hi Dave,

On Fri, Dec 8, 2017 at 4:38 PM, David Miller  wrote:
> Sorry, you cannot force the discussion of a feature which will be submitted
> upstream to occur on a private mailing list.
>
> It is _ABSOLUTELY_ appropriate to discss this on netdev since it is the
> netdev community which must consider issues like this when looking at
> whether to accept WireGuard upstream.
>
> Jason, this action and response was entirely inappropriate, and please
> I'd like you to reply properly to questions about your feature here.

Whoops, okay! Very sorry. I'm actually kind of happy to hear that. I
had assumed that you'd be annoyed if WireGuard crypto discussion
spewed over into netdev adding even more message volume there for
something perhaps not directly relevant. But in fact, you're
interested and find it important to discuss there. So, good news. And
sorry for trying to shew it away before. I'll copy and paste the
response I had made on the other list:

> This is covered in the paper:
> https://www.wireguard.com/papers/wireguard.pdf
>
> The basic answer is that WireGuard has message type identifiers, and
> the handshake also hashes into it an identifier of the primitives
> used. If there's ever a problem with those primitives chosen, it will
> be possible to introduce new message type identifiers, if that kind of
> "support everything even the broken garbage" approach is desired by
> misguided people. However, a better approach, of course, is to keep
> your secure network separate from your insecure network, and to not
> allow insecure nodes on secure segments; when you mix the two,
> disaster tends to strike. So, in other words, both approaches to "upgrading"
> are possible, in this fantasy wireguardalypse. Take note, though, that
> neither one of these approaches (support new and retain old protocol
> too for old nodes, or only support new) are "agile" or are anything at
> all like the 90s "cipher agility" -- the user still is not permitted
> to "choose" ciphers.

Regards,
Jason


Re: [PATCH net-next] virtio_net: Disable interrupts if napi_complete_done rescheduled napi

2017-12-08 Thread David Miller
From: Jason Wang 
Date: Thu, 7 Dec 2017 15:08:45 +0800

> 
> 
> On 2017年12月07日 13:09, Michael S. Tsirkin wrote:
>> On Thu, Dec 07, 2017 at 01:15:15PM +0900, Toshiaki Makita wrote:
>>> Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been
>>> able
>>> to be rescheduled within napi_complete_done() even in non-busypoll
>>> case,
>>> but virtnet_poll() always enabled interrupts before complete, and when
>>> napi was rescheduled within napi_complete_done() it did not disable
>>> interrupts.
>>> This caused more interrupts when event idx is disabled.
>>>
>>> According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
>>> processing") we cannot place virtqueue_enable_cb_prepare() after
>>> NAPI_STATE_SCHED is cleared, so disable interrupts again if
>>> napi_complete_done() returned false.
>>>
>>> Tested with vhost-user of OVS 2.7 on host, which does not have the
>>> event
>>> idx feature.
...
>>> Signed-off-by: Toshiaki Makita 
>> Acked-by: Michael S. Tsirkin 
>>
>> it might make sense in net and not -next since tx napi regressed
>> performance
>> in some configs, this might bring it back at least partially.
>> Jason - what do you think?
> 
> No sure, the regression I saw was tested with event idx on. And
> virtqueue_disable_cb() does almost nothing for event idx (or even a
> little bit slower).
> 
> The patch looks good.
> 
> Acked-by: Jason Wang 

I'm going to put this into net-next for now.


Re: [PATCH] ptr_ring: add barriers

2017-12-08 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 5 Dec 2017 21:29:37 +0200

> Users of ptr_ring expect that it's safe to give the
> data structure a pointer and have it be available
> to consumers, but that actually requires an smb_wmb
> or a stronger barrier.
> 
> In absence of such barriers and on architectures that reorder writes,
> consumer might read an un=initialized value from an skb pointer stored
> in the skb array.  This was observed causing crashes.
> 
> To fix, add memory barriers.  The barrier we use is a wmb, the
> assumption being that producers do not need to read the value so we do
> not need to order these reads.
> 
> Reported-by: George Cherian 
> Suggested-by: Jason Wang 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> George, could you pls report whether this patch fixes
> the issue for you?
> 
> This seems to be needed in stable as well.

I really need some testing feedback for this before I apply it
and queue it up for -stable.

George?


Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread Xin Long
On Sat, Dec 9, 2017 at 1:29 AM, David Laight  wrote:
> From: Xin Long ...
>> Hi, David, Sorry,  I'm not sure we're worrying about the cpu cost or
>> codes style now ?
>>
>> For cpu cost,  I think 0x848(%r13) operation must be better than the
>> generated code of if-else.
>
> Nope - the call xxx(%ryyy) is likely to be a data cache miss and a complete
> cpu pipeline stall.
>
> The conditional will be a data cache hit and the code (for one branch)
> will be prefetched and speculatively executed.
>
> Some very modern cpu might manage to predict indirect jumps, but for
> most it is a full pipeline stall.
Thanks for the CPU information.

The thing is with if-else can't avoid xxx(%ryyy) in this case, as Marcelo
said above. It seems if-else will just be a extra cost compare to this one.


Re: [PATCH] bnxt_en: Fix sources of spurious netpoll warnings

2017-12-08 Thread Michael Chan
On Fri, Dec 8, 2017 at 9:05 AM, Calvin Owens  wrote:
> After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
> 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
> we still see the following WARN fire:
>
>   [ cut here ]
>   WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 
> netpoll_poll_dev+0x15a/0x160
>   bnxt_poll+0x0/0xd0 exceeded budget in poll
>   
>   Call Trace:
>[] dump_stack+0x4d/0x70
>[] __warn+0xd3/0xf0
>[] warn_slowpath_fmt+0x4f/0x60
>[] netpoll_poll_dev+0x15a/0x160
>[] netpoll_send_skb_on_dev+0x168/0x250
>[] netpoll_send_udp+0x2dc/0x440
>[] write_ext_msg+0x20e/0x250
>[] call_console_drivers.constprop.23+0xa5/0x110
>[] console_unlock+0x339/0x5b0
>[] vprintk_emit+0x2c8/0x450
>[] vprintk_default+0x1f/0x30
>[] printk+0x48/0x50
>[] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core]
>[] edac_mc_handle_error+0x42b/0x6e0 [edac_core]
>[] sbridge_mce_output_error+0x410/0x10d0 [sb_edac]
>[] sbridge_check_error+0xac/0x130 [sb_edac]
>[] edac_mc_workq_function+0x3c/0x90 [edac_core]
>[] process_one_work+0x19b/0x480
>[] worker_thread+0x6a/0x520
>[] kthread+0xe4/0x100
>[] ret_from_fork+0x22/0x40
>
> This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
> in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
> given a non-zero budget.
>
> Signed-off-by: Calvin Owens 

Thanks.
Acked-by: Michael Chan 


RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Xin Long ...
> Hi, David, Sorry,  I'm not sure we're worrying about the cpu cost or
> codes style now ?
> 
> For cpu cost,  I think 0x848(%r13) operation must be better than the
> generated code of if-else.

Nope - the call xxx(%ryyy) is likely to be a data cache miss and a complete
cpu pipeline stall.

The conditional will be a data cache hit and the code (for one branch)
will be prefetched and speculatively executed.

Some very modern cpu might manage to predict indirect jumps, but for
most it is a full pipeline stall.

David



Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread Xin Long
On Sat, Dec 9, 2017 at 12:22 AM, David Laight  wrote:
> From: Xin Long
>> Sent: 08 December 2017 16:18
>>
> ...
>> >> Alternatively you could preform the dereference in two steps (i.e. 
>> >> declare an si
>> >> pointer on the stack and set it equal to asoc->stream.si, then deref
>> >> si->make_datafrag at call time.  That will at least give the compiler an
>> >> opportunity to preload the first pointer.
>
> You want to save the function pointer itself.
>
> ...
>> Another small difference:
>>   as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop,
>>   instead of %r13.
>>
>> So that's what I can see from the related generated code.
>> If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think
>> asoc->stream.si->make_datafrag() is even better. No ?
>
> That code must have far too many life local variables.
> Otherwise there's be a caller saved register available.
>
Hi, David, Sorry,  I'm not sure we're worrying about the cpu cost or
codes style now ?

For cpu cost,  I think 0x848(%r13) operation must be better than the
generated code of if-else.

For the codes style, comparing to the if-else, I think this one is
more readable. (ignore extendible stuff first, as probably no more
new type of data chunk).


Re: [PATCH v4.1] phylib: Add device reset GPIO support

2017-12-08 Thread Sergei Shtylyov

Hello!

On 12/08/2017 12:53 PM, Geert Uytterhoeven wrote:


On 12/04/2017 03:35 PM, Geert Uytterhoeven wrote:

From: Sergei Shtylyov 
The PHY devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to
manipulate
the GPIO in question; that solution, when applied to the device trees, led
to adding the PHY reset GPIO properties to the MAC device node, with one
exception: Cadence MACB driver which could handle the "reset-gpios" prop
in a PHY device subnode. I believe that the correct approach is to teach
the 'phylib' to get the MDIO device reset GPIO from the device tree node
corresponding to this device -- which this patch is doing...

Note that I had to modify the AT803x PHY driver as it would stop working
otherwise -- it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov 
Acked-by: Rob Herring 
[geert: Propagate actual errors from fwnode_get_named_gpiod()]
[geert: Avoid destroying initial setup]
[geert: Consolidate GPIO descriptor acquiring code]
Signed-off-by: Geert Uytterhoeven 


[...]


diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 2df7b62c1a36811e..8f8b7747c54bc478 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c


[...]


@@ -48,9 +49,26 @@
 int mdiobus_register_device(struct mdio_device *mdiodev)
   {
+   struct gpio_desc *gpiod = NULL;
+
 if (mdiodev->bus->mdio_map[mdiodev->addr])
 return -EBUSY;
   + /* Deassert the optional reset signal */



Umm, but why deassert it here for such a short time?


That's a consequence of moving it from drivers/of/of_mdio.c to here.


   Well, you shouldn't do code moves without some thinking. ;-)


Not that it was deasserted that much longer in drivers/of/of_mdio.c, though...


   There it had a reason, here I'm not seeing one. Perhaps using GPIOD_ASIS 
(or GPIOD_OUT_HIGH) instead of GPIOD_OUT_LOW and dropping 
mdio_device_reset(mdiodev, 1) afterwards would make more sense here?



Gr{oetje,eeting}s,

 Geert


MBR, Sergei


[PATCH iproute2 net-next 4/4] ss: Implement automatic column width calculation

2017-12-08 Thread Stefano Brivio
Group fitting fields into lines and space them equally using the
remaining screen width for each line. If columns don't fit on
one line, break them into the least possible amount of lines and
keep them aligned across lines.

This is done by:
 - recording the length of the longest item in each column during
   formatting and buffering (which was added in the previous patch)
 - fitting as many fields as possible on each line of output
 - distributing the remaining padding space equally between the
   columns

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 misc/ss.c | 188 +++---
 1 file changed, 120 insertions(+), 68 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index abc0da7fa8fe..44ad9587b62c 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -128,19 +128,21 @@ struct column {
const enum col_align align;
const char *header;
const char *ldelim;
-   int width;  /* Including delimiter. -1: fit to content, 0: hide */
+   int disabled;
+   int width;  /* Calculated, including additional layout spacing */
+   int max_len;/* Measured maximum field length in this column */
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0 },
-   { ALIGN_LEFT,   "State"," ",0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",7 },
-   { ALIGN_LEFT,   "Send-Q",   " ",7 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0 },
-   { ALIGN_LEFT,   "Port", "", 0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0 },
-   { ALIGN_LEFT,   "Port", "", 0 },
-   { ALIGN_LEFT,   "", "", -1 },
+   { ALIGN_LEFT,   "Netid","", 0, 0, 0 },
+   { ALIGN_LEFT,   "State"," ",0, 0, 0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",0, 0, 0 },
+   { ALIGN_LEFT,   "Send-Q",   " ",0, 0, 0 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0, 0, 0 },
+   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0, 0, 0 },
+   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
+   { ALIGN_LEFT,   "", "", 0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -959,7 +961,7 @@ static void out(const char *fmt, ...)
char *pos;
int len;
 
-   if (!f->width)
+   if (f->disabled)
return;
 
if (!buffer.head)
@@ -982,7 +984,7 @@ static int print_left_spacing(struct column *f, int stored, 
int printed)
 {
int s;
 
-   if (f->width < 0 || f->align == ALIGN_LEFT)
+   if (!f->width || f->align == ALIGN_LEFT)
return 0;
 
s = f->width - stored - printed;
@@ -1000,7 +1002,7 @@ static void print_right_spacing(struct column *f, int 
printed)
 {
int s;
 
-   if (f->width < 0 || f->align == ALIGN_RIGHT)
+   if (!f->width || f->align == ALIGN_RIGHT)
return;
 
s = f->width - printed;
@@ -1017,9 +1019,12 @@ static void field_flush(struct column *f)
struct buf_chunk *chunk = buffer.tail;
unsigned int pad = buffer.cur->len % 2;
 
-   if (!f->width)
+   if (f->disabled)
return;
 
+   if (buffer.cur->len > f->max_len)
+   f->max_len = buffer.cur->len;
+
/* We need a new chunk if we can't store the next length descriptor.
 * Mind the gap between end of previous token and next aligned position
 * for length descriptor.
@@ -1062,7 +1067,7 @@ static void field_set(enum col_id id)
 static void print_header(void)
 {
while (!field_is_last(current_field)) {
-   if (current_field->width)
+   if (!current_field->disabled)
out(current_field->header);
field_next();
}
@@ -1095,16 +1100,106 @@ static void buf_free_all(void)
buffer.head = NULL;
 }
 
+/* Calculate column width from contents length. If columns don't fit on one
+ * line, break them into the least possible amount of lines and keep them
+ * aligned across lines. Available screen space is equally spread between 
fields
+ * as additional spacing.
+ */
+static void render_calc_width(int screen_width)
+{
+   int first, len = 0, linecols = 0;
+   struct column *c, *eol = columns - 1;
+
+   /* First pass: set width for each column to measured content length */
+   for (first = 1, c = columns; c - columns < COL_MAX; c++) {
+   if (c->disabled)
+   continue;
+
+   if (!first && c->max_len)
+   c->width = c->max_len + strlen(c->ldelim);
+   else
+   c->width = c->max_len;
+
+   /* But 

[PATCH iproute2 net-next 0/4] Abstract columns, properly space and wrap fields

2017-12-08 Thread Stefano Brivio
Currently, 'ss' simply subdivides the whole available screen width
between available columns, starting from a set of hardcoded amount
of spacing and growing column widths.

This makes the output unreadable in several cases, as it doesn't take
into account the actual content width.

Fix this by introducing a simple abstraction for columns, buffering
the output, measuring the width of the fields, grouping fields into
lines as they fit, equally distributing any remaining whitespace, and
finally rendering the result. Some examples are reported below [1].

This implementation doesn't seem to cause any significant performance
issues, as reported in 3/4.

Patch 1/4 replaces all relevant printf() calls by the out() helper,
which simply consists of the usual printf() implementation.

Patch 2/4 implements column abstraction, with configurable column
width and delimiters, and 3/4 splits buffering and rendering phases,
employing a simple buffering mechanism with chunked allocation and
introducing a rendering function.

Up to this point, the output is still unchanged.

Finally, 4/4 introduces field width calculation based on content
length measured while buffering, in order to split fields onto
multiple lines and equally space them within the single lines.

Now that column behaviour is well-defined and more easily
configurable, it should be easier to further improve the output by
splitting logically separable information (e.g. TCP details) into
additional columns. However, this patchset keeps the full "extended"
information into a single column, for the moment being.


[1]

- 80 columns terminal, ss -Z -f netlink
  * before:
Recv-Q Send-Q Local Address:Port Peer Address:Port

0  0rtnl:evolution-calen/2075   * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:abrt-applet/32700  * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:firefox/21619  * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:evolution-calen/32639   * p
roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[...]

  * after:
Recv-Q   Send-Q Local Address:Port  Peer Address:Port
00   rtnl:evolution-calen/2075  *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:abrt-applet/32700 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:firefox/21619 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:evolution-calen/32639 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[...]

- 80 colums terminal, ss -tunpl
  * before:
Netid  State  Recv-Q Send-Q Local Address:Port   Peer 
Address:Port
udpUNCONN 0  0 *:37732 *:*
udpUNCONN 0  0 *:5353  *:*
udpUNCONN 0  0  192.168.122.1:53*:*
udpUNCONN 0  0  *%virbr0:67*:*
[...]

  * after:
Netid   StateRecv-Q   Send-Q Local Address:Port  Peer Address:Port
udp UNCONN   00  *:37732*:*
udp UNCONN   00  *:5353 *:*
udp UNCONN   00  192.168.122.1:53   *:*
udp UNCONN   00   *%virbr0:67   *:*
[...]

 - 66 columns terminal, ss -tunpl
  * before:
Netid  State  Recv-Q Send-Q Local Address:Port   P
eer Address:Port
udpUNCONN 0  0   *:37732   *:*

udpUNCONN 0  0   *:5353*:*

udpUNCONN 0  0  192.168.122.1:53
*:*
udpUNCONN 0  0  *%virbr0:67  *:*
[...]

  * after:
Netid State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
udp   UNCONN 0  0  *:37732 *:*
udp   UNCONN 0  0  *:5353  *:*
udp   UNCONN 0  0  192.168.122.1:53*:*
udp   UNCONN 0  0   *%virbr0:67*:*
[...]


Stefano Brivio (4):
  ss: Replace printf() calls for "main" output by calls to helper
  ss: Introduce columns lightweight abstraction
  ss: Buffer raw fields first, then render them as a table
  ss: Implement automatic column width calculation

 misc/ss.c | 893 +++---
 1 file changed, 621 insertions(+), 272 deletions(-)

-- 
2.9.4



[PATCH iproute2 net-next 2/4] ss: Introduce columns lightweight abstraction

2017-12-08 Thread Stefano Brivio
Instead of embedding spacing directly while printing contents,
logically declare columns and functions to buffer their content,
to print left and right spacing around fields, to flush them to
screen, and to print headers.

This makes it a bit easier to handle layout changes and prepares
for full output buffering, needed for optimal spacing in field
output layout.

Columns are currently set up to retain exactly the same output
as before. This needs some slight adjustments of the values
previously calculated in main(), as the width value introduced
here already includes the width of left delimiters and spacing
is not explicitly printed anymore whenever a field is printed.
These calculations will go away altogether once automatic width
calculation is implemented.

We can also remove explicit printing of newlines after the final
content for a given line is printed, flushing the last field on
a line will cause field_flush() to print newlines where
appropriate.

No changes in output expected here.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 misc/ss.c | 289 ++
 1 file changed, 198 insertions(+), 91 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 72e117ca9405..9dbcfd514d48 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -103,11 +103,48 @@ int show_header = 1;
 int follow_events;
 int sctp_ino;
 
-int netid_width;
-int state_width;
-int addr_width;
-int serv_width;
-char *odd_width_pad = "";
+enum col_id {
+   COL_NETID,
+   COL_STATE,
+   COL_RECVQ,
+   COL_SENDQ,
+   COL_ADDR,
+   COL_SERV,
+   COL_RADDR,
+   COL_RSERV,
+   COL_EXT,
+   COL_MAX
+};
+
+enum col_align {
+   ALIGN_LEFT,
+   ALIGN_CENTER,
+   ALIGN_RIGHT
+};
+
+struct column {
+   const enum col_align align;
+   const char *header;
+   const char *ldelim;
+   int width;  /* Including delimiter. -1: fit to content, 0: hide */
+   int stored; /* Characters buffered */
+   int printed;/* Characters printed so far */
+};
+
+static struct column columns[] = {
+   { ALIGN_LEFT,   "Netid","", 0,  0,  0 },
+   { ALIGN_LEFT,   "State"," ",0,  0,  0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",7,  0,  0 },
+   { ALIGN_LEFT,   "Send-Q",   " ",7,  0,  0 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0,  0,  0 },
+   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0,  0,  0 },
+   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
+   { ALIGN_LEFT,   "", "", -1, 0,  0 },
+};
+
+static struct column *current_field = columns;
+static char field_buf[BUFSIZ];
 
 static const char *TCP_PROTO = "tcp";
 static const char *SCTP_PROTO = "sctp";
@@ -825,13 +862,113 @@ static const char *vsock_netid_name(int type)
 
 static void out(const char *fmt, ...)
 {
+   struct column *f = current_field;
va_list args;
 
va_start(args, fmt);
-   vfprintf(stdout, fmt, args);
+   f->stored += vsnprintf(field_buf + f->stored, BUFSIZ - f->stored,
+  fmt, args);
va_end(args);
 }
 
+static int print_left_spacing(struct column *f)
+{
+   int s;
+
+   if (f->width < 0 || f->align == ALIGN_LEFT)
+   return 0;
+
+   s = f->width - f->stored - f->printed;
+   if (f->align == ALIGN_CENTER)
+   /* If count of total spacing is odd, shift right by one */
+   s = (s + 1) / 2;
+
+   if (s > 0)
+   return printf("%*c", s, ' ');
+
+   return 0;
+}
+
+static void print_right_spacing(struct column *f)
+{
+   int s;
+
+   if (f->width < 0 || f->align == ALIGN_RIGHT)
+   return;
+
+   s = f->width - f->printed;
+   if (f->align == ALIGN_CENTER)
+   s /= 2;
+
+   if (s > 0)
+   printf("%*c", s, ' ');
+}
+
+static int field_needs_delimiter(struct column *f)
+{
+   if (!f->stored)
+   return 0;
+
+   /* Was another field already printed on this line? */
+   for (f--; f >= columns; f--)
+   if (f->width)
+   return 1;
+
+   return 0;
+}
+
+/* Flush given field to screen together with delimiter and spacing */
+static void field_flush(struct column *f)
+{
+   if (!f->width)
+   return;
+
+   if (field_needs_delimiter(f))
+   f->printed = printf("%s", f->ldelim);
+
+   f->printed += print_left_spacing(f);
+   f->printed += printf("%s", field_buf);
+   print_right_spacing(f);
+
+   *field_buf = 0;
+   f->printed = 0;
+   f->stored = 0;
+}
+
+static int field_is_last(struct column *f)
+{
+   return f - columns == COL_MAX - 

[PATCH iproute2 net-next 3/4] ss: Buffer raw fields first, then render them as a table

2017-12-08 Thread Stefano Brivio
This allows us to measure the maximum field length for each
column before printing fields and will permit us to apply
optimal field spacing and distribution. Structure of the output
buffer with chunked allocation is described in comments.

Output is still unchanged, original spacing is used.

Running over one million sockets with -tul options by simply
modifying main() to loop 50,000 times over the *_show()
functions, buffering the whole output and rendering it at the
end, with 10 UDP sockets, 10 TCP sockets, while throwing
output away, doesn't show significant changes in execution time
on my laptop with an Intel i7-6600U CPU:

- before this patch:
$ time ./ss -tul > /dev/null
real0m29.899s
user0m2.017s
sys 0m27.801s

- after this patch:
$ time ./ss -tul > /dev/null
real0m29.827s
user0m1.942s
sys 0m27.812s

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 misc/ss.c | 271 +++---
 1 file changed, 225 insertions(+), 46 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 9dbcfd514d48..abc0da7fa8fe 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -47,6 +47,8 @@
 #include 
 
 #define MAGIC_SEQ 123456
+#define BUF_CHUNK (1024 * 1024)
+#define LEN_ALIGN(x) (((x) + 1) & ~1)
 
 #define DIAG_REQUEST(_req, _r) \
struct {\
@@ -127,24 +129,45 @@ struct column {
const char *header;
const char *ldelim;
int width;  /* Including delimiter. -1: fit to content, 0: hide */
-   int stored; /* Characters buffered */
-   int printed;/* Characters printed so far */
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0,  0,  0 },
-   { ALIGN_LEFT,   "State"," ",0,  0,  0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",7,  0,  0 },
-   { ALIGN_LEFT,   "Send-Q",   " ",7,  0,  0 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0,  0,  0 },
-   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0,  0,  0 },
-   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
-   { ALIGN_LEFT,   "", "", -1, 0,  0 },
+   { ALIGN_LEFT,   "Netid","", 0 },
+   { ALIGN_LEFT,   "State"," ",0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",7 },
+   { ALIGN_LEFT,   "Send-Q",   " ",7 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0 },
+   { ALIGN_LEFT,   "Port", "", 0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0 },
+   { ALIGN_LEFT,   "Port", "", 0 },
+   { ALIGN_LEFT,   "", "", -1 },
 };
 
 static struct column *current_field = columns;
-static char field_buf[BUFSIZ];
+
+/* Output buffer: chained chunks of BUF_CHUNK bytes. Each field is written to
+ * the buffer as a variable size token. A token consists of a 16 bits length
+ * field, followed by a string which is not NULL-terminated.
+ *
+ * A new chunk is allocated and linked when the current chunk doesn't have
+ * enough room to store the current token as a whole.
+ */
+struct buf_chunk {
+   struct buf_chunk *next; /* Next chained chunk */
+   char *end;  /* Current end of content */
+   char data[0];
+};
+
+struct buf_token {
+   uint16_t len;   /* Data length, excluding length descriptor */
+   char data[0];
+};
+
+static struct {
+   struct buf_token *cur;  /* Position of current token in chunk */
+   struct buf_chunk *head; /* First chunk */
+   struct buf_chunk *tail; /* Current chunk */
+} buffer;
 
 static const char *TCP_PROTO = "tcp";
 static const char *SCTP_PROTO = "sctp";
@@ -860,25 +883,109 @@ static const char *vsock_netid_name(int type)
}
 }
 
+/* Allocate and initialize a new buffer chunk */
+static struct buf_chunk *buf_chunk_new(void)
+{
+   struct buf_chunk *new = malloc(BUF_CHUNK);
+
+   if (!new)
+   abort();
+
+   new->next = NULL;
+
+   /* This is also the last block */
+   buffer.tail = new;
+
+   /* Next token will be stored at the beginning of chunk data area, and
+* its initial length is zero.
+*/
+   buffer.cur = (struct buf_token *)new->data;
+   buffer.cur->len = 0;
+
+   new->end = buffer.cur->data;
+
+   return new;
+}
+
+/* Return available tail room in given chunk */
+static int buf_chunk_avail(struct buf_chunk *chunk)
+{
+   return BUF_CHUNK - offsetof(struct buf_chunk, data) -
+  (chunk->end - chunk->data);
+}
+
+/* Update end pointer and token length, link new chunk if we 

[PATCH] bnxt_en: Fix sources of spurious netpoll warnings

2017-12-08 Thread Calvin Owens
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
we still see the following WARN fire:

  [ cut here ]
  WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 
netpoll_poll_dev+0x15a/0x160
  bnxt_poll+0x0/0xd0 exceeded budget in poll
  
  Call Trace:
   [] dump_stack+0x4d/0x70
   [] __warn+0xd3/0xf0
   [] warn_slowpath_fmt+0x4f/0x60
   [] netpoll_poll_dev+0x15a/0x160
   [] netpoll_send_skb_on_dev+0x168/0x250
   [] netpoll_send_udp+0x2dc/0x440
   [] write_ext_msg+0x20e/0x250
   [] call_console_drivers.constprop.23+0xa5/0x110
   [] console_unlock+0x339/0x5b0
   [] vprintk_emit+0x2c8/0x450
   [] vprintk_default+0x1f/0x30
   [] printk+0x48/0x50
   [] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core]
   [] edac_mc_handle_error+0x42b/0x6e0 [edac_core]
   [] sbridge_mce_output_error+0x410/0x10d0 [sb_edac]
   [] sbridge_check_error+0xac/0x130 [sb_edac]
   [] edac_mc_workq_function+0x3c/0x90 [edac_core]
   [] process_one_work+0x19b/0x480
   [] worker_thread+0x6a/0x520
   [] kthread+0xe4/0x100
   [] ret_from_fork+0x22/0x40

This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
given a non-zero budget.

Signed-off-by: Calvin Owens 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c5c38d4..f38160f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1883,7 +1883,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
 * here forever if we consistently cannot allocate
 * buffers.
 */
-   else if (rc == -ENOMEM)
+   else if (rc == -ENOMEM && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
@@ -1969,7 +1969,7 @@ static int bnxt_poll_nitroa0(struct napi_struct *napi, 
int budget)
cpu_to_le32(RX_CMPL_ERRORS_CRC_ERROR);
 
rc = bnxt_rx_pkt(bp, bnapi, _cons, );
-   if (likely(rc == -EIO))
+   if (likely(rc == -EIO) && budget)
rx_pkts++;
else if (rc == -EBUSY)  /* partial completion */
break;
-- 
2.9.5



RE: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.

2017-12-08 Thread Chopra, Manish
> -Original Message-
> From: Michael Chan [mailto:michael.c...@broadcom.com]
> Sent: Thursday, December 07, 2017 1:34 PM
> To: da...@davemloft.net
> Cc: netdev@vger.kernel.org; andrew.gospoda...@broadcom.com; Elior, Ariel
> ; Dept-Eng Everest Linux L2  engeverestlinu...@cavium.com>
> Subject: [PATCH net-next v2 5/5] qede: Use NETIF_F_GRO_HW.
> 
> Advertise NETIF_F_GRO_HW and set edev->gro_disable according to the feature
> flag.  Add qede_fix_features() to drop NETIF_F_GRO_HW if XDP is running or
> MTU does not support GRO_HW.  qede_change_mtu() also checks and disables
> GRO_HW if MTU is not supported.
> 
> Cc: Ariel Elior 
> Cc: everest-linux...@cavium.com
> Signed-off-by: Michael Chan 
> ---
>  drivers/net/ethernet/qlogic/qede/qede.h |  2 ++
>  drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  3 +++
> drivers/net/ethernet/qlogic/qede/qede_filter.c  | 20 +---
>  drivers/net/ethernet/qlogic/qede/qede_main.c| 17 ++---
>  4 files changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qede/qede.h
> b/drivers/net/ethernet/qlogic/qede/qede.h
> index a3a70ad..8a33651 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede.h
> +++ b/drivers/net/ethernet/qlogic/qede/qede.h
> @@ -494,6 +494,8 @@ int qede_free_tx_pkt(struct qede_dev *edev,  void
> qede_vlan_mark_nonconfigured(struct qede_dev *edev);  int
> qede_configure_vlan_filters(struct qede_dev *edev);
> 
> +netdev_features_t qede_fix_features(struct net_device *dev,
> + netdev_features_t features);
>  int qede_set_features(struct net_device *dev, netdev_features_t features);
> void qede_set_rx_mode(struct net_device *ndev);  void
> qede_config_rx_mode(struct net_device *ndev); diff --git
> a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> index dae7412..4ca3847 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
> @@ -940,6 +940,9 @@ int qede_change_mtu(struct net_device *ndev, int
> new_mtu)
>   DP_VERBOSE(edev, (NETIF_MSG_IFUP | NETIF_MSG_IFDOWN),
>  "Configuring MTU size of %d\n", new_mtu);
> 
> + if (new_mtu > PAGE_SIZE)
> + ndev->features &= ~NETIF_F_GRO_HW;
> +
>   /* Set the mtu field and re-start the interface if needed */
>   args.u.mtu = new_mtu;
>   args.func = _update_mtu;
> diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c
> b/drivers/net/ethernet/qlogic/qede/qede_filter.c
> index c1a0708..2de947e 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
> @@ -895,19 +895,25 @@ static void qede_set_features_reload(struct
> qede_dev *edev,
>   edev->ndev->features = args->u.features;  }
> 
> +netdev_features_t qede_fix_features(struct net_device *dev,
> + netdev_features_t features)
> +{
> + struct qede_dev *edev = netdev_priv(dev);
> +
> + if (edev->xdp_prog || edev->ndev->mtu > PAGE_SIZE)
> + features &= ~NETIF_F_GRO_HW;
> +
> + return features;
> +}
> +
>  int qede_set_features(struct net_device *dev, netdev_features_t features)  {
>   struct qede_dev *edev = netdev_priv(dev);
>   netdev_features_t changes = features ^ dev->features;
>   bool need_reload = false;
> 
> - /* No action needed if hardware GRO is disabled during driver load */
> - if (changes & NETIF_F_GRO) {
> - if (dev->features & NETIF_F_GRO)
> - need_reload = !edev->gro_disable;
> - else
> - need_reload = edev->gro_disable;
> - }
> + if (changes & NETIF_F_GRO_HW)
> + need_reload = true;
> 
>   if (need_reload) {
>   struct qede_reload_args args;
> diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c
> b/drivers/net/ethernet/qlogic/qede/qede_main.c
> index 57332b3..90d79ae 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_main.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
> @@ -545,6 +545,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq
> *ifr, int cmd)  #endif
>   .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid,
>   .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid,
> + .ndo_fix_features = qede_fix_features,
>   .ndo_set_features = qede_set_features,
>   .ndo_get_stats64 = qede_get_stats64,
>  #ifdef CONFIG_QED_SRIOV
> @@ -572,6 +573,7 @@ static int qede_ioctl(struct net_device *dev, struct ifreq
> *ifr, int cmd)
>   .ndo_change_mtu = qede_change_mtu,
>   .ndo_vlan_rx_add_vid = qede_vlan_rx_add_vid,
>   .ndo_vlan_rx_kill_vid = qede_vlan_rx_kill_vid,
> + .ndo_fix_features = qede_fix_features,
>   .ndo_set_features = qede_set_features,
>   .ndo_get_stats64 = qede_get_stats64,
>   .ndo_udp_tunnel_add = 

Re: [PATCH v2 net-next 4/4] bpftool: implement cgroup bpf operations

2017-12-08 Thread David Ahern
On 12/8/17 8:39 AM, Quentin Monnet wrote:
> I don't believe compatibility is an issue here, since the program and
> its documentation come together (so they should stay in sync) and are
> part of the kernel tree (so the tool should be compatible with the
> kernel sources it comes with). My concern is that there is no way to
> guess from the current description what the values for ATTACH_FLAG or
> ATTACH_TYPE can be, without reading the source code of the program—which
> is not exactly user-friendly.
> 

The tool should be backward and forward compatible across kernel
versions. Running a newer command on an older kernel should fail in a
deterministic. While the tool is in the kernel tree for ease of
development, that should not be confused with having a direct tie to any
kernel version.

I believe man pages do include kernel version descriptions in flags
(e.g., man 7 socket -- flags are denoted with "since Linux x.y") which
is one way to handle it with the usual caveat that vendors might have
backported support to earlier kernels.


Re: [PATCH RFC 0/4] Fixes for Marvell MII paged register access races

2017-12-08 Thread Russell King - ARM Linux
On Fri, Dec 08, 2017 at 05:17:14PM +0100, Andrew Lunn wrote:
> Hi Russell
> 
> > There is an open question whether there should be generic helpers for
> > this.  Generic helpers would mean:
> > 
> > - Additional couple of function pointers in phy_driver to read/write the
> >   paging register.  This has the restriction that there must only be one
> >   paging register.
> 
> I must be missing something. I don't see why there is this
> restriction. Don't we just need
> 
> int phy_get_page(phydev);
> int phy_set_page(phydev, page);

The restriction occurs because a PHY may have several different
registers, and knowing which of the registers need touching becomes an
issue.  We wouldn't want these accessors to needlessly access several
registers each and every time we requested an access to the page
register.

There's also the issue of whether an "int" or whatever type we choose to
pass the "page" around is enough bits.  I haven't surveyed all the PHY
drivers yet to know the answer to that.

> We might even be able to do without phy_get_page(), if we assume page
> 0 should be selected by default, and all paged reads/write return to
> page 0 afterwards.

That would break the marvell driver, where it polls the copper page (0)
and fiber page (1) for link, and leaves the page set appropriately
depending on which negotiated the link.  Not nice, but that's how the
driver is written to work.

> > - The helpers become more expensive, and because they're in a separate
> >   compilation unit, the compiler will be unable to optimise them by
> >   inlining the static functions.
> 
> The mdio bus is slow. Often there is a completion function triggered
> by an interrupt etc. Worst case it is bit-banging. I suspect the gains
> from inlining are just noise in the bigger picture.

Yep.

> > - The helpers would be re-usable, saving replications of that code, and
> >   making it more likely for phy authors to safely access the PHY.
> 
> This is the key point for me. This has likely to of been broken for
> years. If it is obviously broken, driver writers are more likely to
> get it right. Here it is not obvious, so we should take it out of the
> driver writers hands and do it in the core.

See the patch below that partially does this on top of this series.

 drivers/net/phy/marvell.c  |  70 +++
 drivers/net/phy/phy-core.c | 118 +
 include/linux/phy.h|   8 +++
 3 files changed, 176 insertions(+), 20 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 5548ac8fa239..183a60a06099 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -537,9 +537,9 @@ static int m88e1121_config_aneg_rgmii_delays(struct 
phy_device *phydev)
else
mscr = 0;
 
-   return marvell_modify_paged(phydev, MII_MARVELL_MSCR_PAGE,
-   MII_88E1121_PHY_MSCR_REG,
-   MII_88E1121_PHY_MSCR_DELAY_MASK, mscr);
+   return phy_modify_paged(phydev, MII_MARVELL_MSCR_PAGE,
+   MII_88E1121_PHY_MSCR_REG,
+   MII_88E1121_PHY_MSCR_DELAY_MASK, mscr);
 }
 
 static int m88e1121_config_aneg(struct phy_device *phydev)
@@ -567,9 +567,9 @@ static int m88e1318_config_aneg(struct phy_device *phydev)
 {
int err;
 
-   err = marvell_modify_paged(phydev, MII_MARVELL_MSCR_PAGE,
-  MII_88E1318S_PHY_MSCR1_REG,
-  0, MII_88E1318S_PHY_MSCR1_PAD_ODD);
+   err = phy_modify_paged(phydev, MII_MARVELL_MSCR_PAGE,
+  MII_88E1318S_PHY_MSCR1_REG,
+  0, MII_88E1318S_PHY_MSCR1_PAD_ODD);
if (err < 0)
return err;
 
@@ -894,9 +894,9 @@ static int m88e1121_config_init(struct phy_device *phydev)
int err;
 
/* Default PHY LED config: LED[0] .. Link, LED[1] .. Activity */
-   err = marvell_write_paged(phydev, MII_MARVELL_LED_PAGE,
- MII_88E1121_PHY_LED_CTRL,
- MII_88E1121_PHY_LED_DEF);
+   err = phy_write_paged(phydev, MII_MARVELL_LED_PAGE,
+ MII_88E1121_PHY_LED_CTRL,
+ MII_88E1121_PHY_LED_DEF);
if (err < 0)
return err;
 
@@ -1544,7 +1544,7 @@ static u64 marvell_get_stat(struct phy_device *phydev, 
int i)
int val;
u64 ret;
 
-   val = marvell_read_paged(phydev, stat.page, stat.reg);
+   val = phy_read_paged(phydev, stat.page, stat.reg);
if (val < 0) {
ret = UINT64_MAX;
} else {
@@ -1684,8 +1684,8 @@ static int m88e1510_get_temp(struct phy_device *phydev, 
long *temp)
 
*temp = 0;
 
-   ret = marvell_read_paged(phydev, MII_MARVELL_MISC_TEST_PAGE,
-MII_88E1510_TEMP_SENSOR);
+   ret = phy_read_paged(phydev, 

Re: [PATCH] net: dsa: allow XAUI phy interface mode

2017-12-08 Thread Russell King - ARM Linux
On Fri, Dec 08, 2017 at 05:26:43PM +0100, Andrew Lunn wrote:
> On Fri, Dec 08, 2017 at 04:04:59PM +, Russell King wrote:
> > XGMII is a 32-bit bus plus two clock signals per direction.  XAUI is
> > four serial lanes per direction.  The 88e6190 supports XAUI but not
> > XGMII as it doesn't have enough pins.  The same is true of 88e6176.
> > 
> > Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep
> > accepting XGMII for backwards compatibility.
> 
> Hi Russell
> 
> The backwards compatibility is for the ZII devel C, the DSA link
> between the two switches? I don't think there are any other boards
> using this.
> 
> I will ask around, but i don't think we need to worry about backwards
> compatibility. There is no region of flash set aside for the DT blob,
> etc. I always TFTP boot using the one from the kernel, etc.

Well, there's also kexec to consider, where it's better not to specify
the --dtb option because that wipes out the modifications (such as the
memory information) that the boot loader made to the pristine kbuild
generated dtb.  So, with kexec its preferable to use the existing dtb
even with a newer kernel, or if not go through a full reboot to use a
new dtb.

Whether anyone uses kexec on this platform though would be an entirely
separate question, but I'm generally not in favour of breaking
compatibility with already merged DT without good reason.

You do seem to be correct that this only applies (in mainline) to the
ZII rev C board, so I guess including a patch to update its dts for
the inter-switch connection would at least be sensible.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: [PATCH] selftests: net: Adding config fragment CONFIG_CGROUP_BPF=y

2017-12-08 Thread Daniel Borkmann
On 12/07/2017 08:23 PM, Naresh Kamboju wrote:
> CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case.
> 
> Signed-off-by: Naresh Kamboju 
> ---
>  tools/testing/selftests/net/config | 1 +

Did you mean to add this to tools/testing/selftests/bpf/config instead?

test_dev_cgroup is under bpf selftests. Please respin for the correct
test suite.

>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/net/config 
> b/tools/testing/selftests/net/config
> index e57b4ac..02301c6 100644
> --- a/tools/testing/selftests/net/config
> +++ b/tools/testing/selftests/net/config
> @@ -1,3 +1,4 @@
>  CONFIG_USER_NS=y
>  CONFIG_BPF_SYSCALL=y
>  CONFIG_TEST_BPF=m
> +CONFIG_CGROUP_BPF=y
> 



Re: [PATCH v5 net-next] net/tcp: trace all TCP/IP state transition with tcp_set_state tracepoint

2017-12-08 Thread David Miller
From: Yafang Shao 
Date: Fri, 8 Dec 2017 23:50:44 +0800

> 2017-12-08 23:42 GMT+08:00 David Miller :
>> From: Yafang Shao 
>> Date: Fri, 8 Dec 2017 11:40:23 +0800
>>
>>> It will looks like these,
>>>
>>> if (sk->sk_protocol == IPPROTO_TCP)
>>> __tcp_set_state(newsk, TCP_SYN_RECV);
>>> else
>>> newsk->sk_state = TCP_SYN_RECV;
>>>
>>>
>>> if (sk->sk_protocol == IPPROTO_TCP)
>>>   __tcp_set_state(sk, TCP_CLOSE);
>>> else
>>>   sk->sk_state = TCP_CLOSE;
>>>
>>> if (sk->sk_protocol == IPPROTO_TCP)
>>>   tcp_state_store(sk,  state);
>>> else
>>>   sk_state_store(sk, state);
>>>
>>>
>>> Some redundant code.
>>>
>>> IMO, put these similar code into a wrapper is more nice.
>>
>> I think this discussion and how ugly this is getting shows that
>> tracing the state transitions of a socket is perhaps not best as a TCP
>> specific feature.
> 
> Do you mean that tcp_set_state tracepoint should be replaced with
> sk_set_state tracepoint and move that tracepoint to
> trace/events/sock.h ?

Yes, something like that.

It will avoid all of these protocol specific checks and weird
dependencies.


Re: [PATCH] net: dsa: allow XAUI phy interface mode

2017-12-08 Thread Andrew Lunn
On Fri, Dec 08, 2017 at 04:04:59PM +, Russell King wrote:
> XGMII is a 32-bit bus plus two clock signals per direction.  XAUI is
> four serial lanes per direction.  The 88e6190 supports XAUI but not
> XGMII as it doesn't have enough pins.  The same is true of 88e6176.
> 
> Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep
> accepting XGMII for backwards compatibility.

Hi Russell

The backwards compatibility is for the ZII devel C, the DSA link
between the two switches? I don't think there are any other boards
using this.

I will ask around, but i don't think we need to worry about backwards
compatibility. There is no region of flash set aside for the DT blob,
etc. I always TFTP boot using the one from the kernel, etc.

  Andrew


Re: [PATCH net-next 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices

2017-12-08 Thread Dan Williams
On Sun, 2017-12-03 at 23:37 -0700, Subash Abhinov Kasiviswanathan
wrote:
> Add an option to configure the rmnet aggregation and command features
> on device creation. This is achieved by using the vlan flags option.

Does this overload the VLAN flags item with different meanings than
VLAN_FLAG_* that are specific to rmnet?

Dan

> Signed-off-by: Subash Abhinov Kasiviswanathan
> 
> ---
>  drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16
> +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> index 5e530db..2f5f661 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> @@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net,
> struct net_device *dev,
>   if (err)
>   goto err2;
>  
> - netdev_dbg(dev, "data format [ingress 0x%08X]\n",
> ingress_format);
> - port->ingress_data_format = ingress_format;
>   port->rmnet_mode = mode;
>  
>   hlist_add_head_rcu(>hlnode, >muxed_ep[mux_id]);
> +
> + if (data[IFLA_VLAN_FLAGS]) {
> + struct ifla_vlan_flags *flags;
> +
> + flags = nla_data(data[IFLA_VLAN_FLAGS]);
> + ingress_format = flags->flags & flags->mask;
> + }
> +
> + netdev_dbg(dev, "data format [ingress 0x%08X]\n",
> ingress_format);
> + port->ingress_data_format = ingress_format;
> +
>   return 0;
>  
>  err2:
> @@ -312,7 +321,8 @@ static int rmnet_rtnl_validate(struct nlattr
> *tb[], struct nlattr *data[],
>  
>  static size_t rmnet_get_size(const struct net_device *dev)
>  {
> - return nla_total_size(2); /* IFLA_VLAN_ID */
> + return nla_total_size(2) /* IFLA_VLAN_ID */ +
> + nla_total_size(sizeof(struct ifla_vlan_flags)); /*
> IFLA_VLAN_FLAGS */
>  }
>  
>  struct rtnl_link_ops rmnet_link_ops __read_mostly = {


RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Xin Long
> Sent: 08 December 2017 16:18
> 
...
> >> Alternatively you could preform the dereference in two steps (i.e. declare 
> >> an si
> >> pointer on the stack and set it equal to asoc->stream.si, then deref
> >> si->make_datafrag at call time.  That will at least give the compiler an
> >> opportunity to preload the first pointer.

You want to save the function pointer itself.

...
> Another small difference:
>   as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop,
>   instead of %r13.
> 
> So that's what I can see from the related generated code.
> If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think
> asoc->stream.si->make_datafrag() is even better. No ?

That code must have far too many life local variables.
Otherwise there's be a caller saved register available.

David



RE: [PATCH net 3/3] hv_netvsc: Fix the default receive buffer size

2017-12-08 Thread Haiyang Zhang


> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Friday, December 8, 2017 5:45 AM
> To: Stephen Hemminger 
> Cc: KY Srinivasan ; Haiyang Zhang
> ; Stephen Hemminger
> ; de...@linuxdriverproject.org;
> netdev@vger.kernel.org
> Subject: Re: [PATCH net 3/3] hv_netvsc: Fix the default receive buffer size
> 
> On Thu, Dec 07, 2017 at 04:10:55PM -0800, Stephen Hemminger wrote:
> > From: Haiyang Zhang 
> >
> > The intended size is 16 MB, and the default slot size is 1728.
> > So, NETVSC_DEFAULT_RX should be 16*1024*1024 / 1728 = 9709.
> >
> > Fixes: 5023a6db73196 ("netvsc: increase default receive buffer size")
> > Signed-off-by: Haiyang Zhang 
> > Signed-off-by: Stephen Hemminger 
> > ---
> >  drivers/net/hyperv/netvsc_drv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/hyperv/netvsc_drv.c
> > b/drivers/net/hyperv/netvsc_drv.c index dc70de674ca9..edfcde5d3621
> > 100644
> > --- a/drivers/net/hyperv/netvsc_drv.c
> > +++ b/drivers/net/hyperv/netvsc_drv.c
> > @@ -50,7 +50,7 @@
> >  #define NETVSC_MIN_TX_SECTIONS 10
> >  #define NETVSC_DEFAULT_TX  192 /* ~1M */
> >  #define NETVSC_MIN_RX_SECTIONS 10  /* ~64K */
> > -#define NETVSC_DEFAULT_RX  10485   /* Max ~16M */
> > +#define NETVSC_DEFAULT_RX  9709/* ~16M */
> 
> How does this bug look like to the user?  Memory corruption?
> 
> It's weird to me reviewing this code that the default sizes are stored in
> netvsc_drv.c and the max sizes are stored in hyperv_net.h.
> Could we move these to hyperv_net.h?  We could write it like:
> #define NETVSC_DEFAULT_RX ((16 * 1024 * 1024) /
> NETVSC_RECV_SECTION_SIZE)
> 
> 16MB is sort of a weird default because it's larger than the 15MB allowed for
> legacy versions, but it's smaller than the 32MB you'd want for the current
> versions.

As a tradeoff between traffic burst and latency, we set the default to be 16MB.
And this default is reduced automatically for legacy hosts based on the NVSP
version in patch 2. 

I will move the change to hyperv_net.h

Thanks,
- Haiyang



Re: [PATCH RFC 0/4] Fixes for Marvell MII paged register access races

2017-12-08 Thread Andrew Lunn
Hi Russell

> There is an open question whether there should be generic helpers for
> this.  Generic helpers would mean:
> 
> - Additional couple of function pointers in phy_driver to read/write the
>   paging register.  This has the restriction that there must only be one
>   paging register.

I must be missing something. I don't see why there is this
restriction. Don't we just need

int phy_get_page(phydev);
int phy_set_page(phydev, page);

We might even be able to do without phy_get_page(), if we assume page
0 should be selected by default, and all paged reads/write return to
page 0 afterwards.

> - The helpers become more expensive, and because they're in a separate
>   compilation unit, the compiler will be unable to optimise them by
>   inlining the static functions.

The mdio bus is slow. Often there is a completion function triggered
by an interrupt etc. Worst case it is bit-banging. I suspect the gains
from inlining are just noise in the bigger picture.

> - The helpers would be re-usable, saving replications of that code, and
>   making it more likely for phy authors to safely access the PHY.

This is the key point for me. This has likely to of been broken for
years. If it is obviously broken, driver writers are more likely to
get it right. Here it is not obvious, so we should take it out of the
driver writers hands and do it in the core.

> Another potential question is whether using the mdiobus lock (which
> excludes all other MII bus access) is best - while it has the advantage
> of also ensuring atomicity with userspace accesses, it means that no one
> else can access an independent PHY on the same bus while a paged access
> is on-going.  It feels like a big hammer, but I'm not convinced that we
> will see a lot of contention on it.

I think you are right about there being little contention on the lock.
I suspect most paged accesses are performed during initial setup and
configuration. I guess the once per second poll does not need to use
paged registers. And the weight of that hammer can be reduced a lot by
using interrupts instead of polling.

  Andrew


Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread Xin Long
On Sat, Dec 9, 2017 at 12:00 AM, Marcelo Ricardo Leitner
 wrote:
> On Fri, Dec 08, 2017 at 10:37:34AM -0500, Neil Horman wrote:
>> On Fri, Dec 08, 2017 at 12:56:30PM -0200, Marcelo Ricardo Leitner wrote:
>> > On Fri, Dec 08, 2017 at 02:06:04PM +, David Laight wrote:
>> > > From: Xin Long
>> > > > Sent: 08 December 2017 13:04
>> > > ...
>> > > > @@ -264,8 +264,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
>> > > > sctp_association *asoc,
>> > > > frag |= SCTP_DATA_SACK_IMM;
>> > > > }
>> > > >
>> > > > -   chunk = sctp_make_datafrag_empty(asoc, sinfo, len, 
>> > > > frag,
>> > > > -0, GFP_KERNEL);
>> > > > +   chunk = asoc->stream.si->make_datafrag(asoc, sinfo, 
>> > > > len, frag,
>> > > > +  GFP_KERNEL);
>> > >
>> > > I know that none of the sctp code is very optimised, but that indirect
>> > > call is going to be horrid.
>> >
>> > Yeah.. but there is no way to avoid the double derreference
>> > considering we only have the asoc pointer in there and we have to
>> > reach the contents of the data chunk operations struct, and the .si
>> > part is the same as 'stream' part as it's a constant offset.
>> >
>> > Due to the for() in there, we could add a variable to store
>> > asoc->stream.si outside the for and then we can do only a single deref
>> > inside it. Xin, can you please try and see if the generated code is
>> > different?
>> >
>> > Other suggestions?
>> >
>> Is it worth replacing the si struct with an index/enum value, and indexing an
>> array of method pointer structs?  That would save you at least one 
>> dereference.
>
> Hmmm, maybe, yes. It would be like
> sctp_stream_interleave[asoc->stream.si].make_datafrag(...)
>
> Then same goes for pf->af, probably.
>
>>
>> Alternatively you could preform the dereference in two steps (i.e. declare 
>> an si
>> pointer on the stack and set it equal to asoc->stream.si, then deref
>> si->make_datafrag at call time.  That will at least give the compiler an
>> opportunity to preload the first pointer.
>
> Yep, that was my 2nd paragraph above :-) but it only works for cases
> such as this one.

Now:
  for(N) {
...
chunk = asoc->stream.si->make_datafrag(asoc, sinfo, len, frag,
 0xfb58 <+360>: mov0x848(%r13),%rax  < [a]
 0xfb5f <+367>: movzbl %cl,%ecx
 0xfb62 <+370>: mov$0x14000c0,%r8d
 0xfb68 <+376>: mov%r12d,%edx
 0xfb6b <+379>: mov(%rsp),%rsi
 0xfb6f <+383>: mov%r13,%rdi  <=(X)
 0xfb72 <+386>: callq  *0x8(%rax)  < [b]
 0xfb78 <+392>: mov%rax,%r15
   }

   ret = N * ([a] + [b])


After using a variable:
  struct sctp_stream_interleave *si;
  ...
  si = asoc->stream.si;
 0xfb44 <+340>: mov0x848(%r14),%rax
 0xfb4e <+350>: mov%rax,0x20(%rsp) <- [1]

  for(N) {
...
chunk = si->make_datafrag(asoc, sinfo, len, frag, GFP_KERNEL);
 0xfb69 <+377>: mov0x20(%rsp),%rax <- [2]
 0xfb6e <+382>: movzbl %cl,%ecx
 0xfb71 <+385>: mov$0x14000c0,%r8d
 0xfb77 <+391>: mov%r12d,%edx
 0xfb7a <+394>: mov(%rsp),%rsi
 0xfb7e <+398>: mov0x28(%rsp),%rdi <=(Y)
 0xfb83 <+403>: callq  *0x8(%rax) <- [3]
 0xfb89 <+409>: mov%rax,%r14
   }

   ret = [1] + N * ([2] + [3])


Another small difference:
  as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop,
  instead of %r13.

So that's what I can see from the related generated code.
If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think
asoc->stream.si->make_datafrag() is even better. No ?


RE: [PATCH net 1/3] hv_netvsc: Correct the max receive buffer size

2017-12-08 Thread Haiyang Zhang


> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Friday, December 8, 2017 5:33 AM
> To: Stephen Hemminger 
> Cc: KY Srinivasan ; Haiyang Zhang
> ; Stephen Hemminger
> ; de...@linuxdriverproject.org;
> netdev@vger.kernel.org
> Subject: Re: [PATCH net 1/3] hv_netvsc: Correct the max receive buffer size
> 
> On Thu, Dec 07, 2017 at 04:10:53PM -0800, Stephen Hemminger wrote:
> > From: Haiyang Zhang 
> >
> > It should be 31 MB on recent host versions.
> >
> > Signed-off-by: Haiyang Zhang 
> > Signed-off-by: Stephen Hemminger 
> 
> This is very vague.  What does "recent" mean in this context?  There are also
> some unrelated white space changes here which make the patch harder to
> read.
> 
> This patch kind of makes the bug fixed by patch 2 even worse because
> before the receive buffer was capped at around 16MB and now we can set
> the receive buffer to 31MB.  It might make sense to fold the two patches
> together.
> 
> Is patch 2 a memory corruption bug?  The changelog doesn't really say what
> the user visible effects of the bug are.  Basically if you make the buffer too
> small then it's a performance issue but if you make it too large what happens?
> It's not clear to me.

For NVSP v2, and earlier host, the limit is 15MB. Later hosts, the limit is 
31MB.
Setting beyond it will be denied by the host, resulting the vNIC doesn't come 
up.

I will merge this one together with the patch 2.

Thanks,
- Haiyang


Re: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread Neil Horman
On Fri, Dec 08, 2017 at 04:04:58PM +, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 08 December 2017 16:00
> ...
> > > Is it worth replacing the si struct with an index/enum value, and 
> > > indexing an
> > > array of method pointer structs?  That would save you at least one 
> > > dereference.
> > 
> > Hmmm, maybe, yes. It would be like
> > sctp_stream_interleave[asoc->stream.si].make_datafrag(...)
> 
> If you only expect 2 choices then an if () is likely
> to produce better code that the above.
> 
> The actual implementation can be hidden inside a #define
> or static inline function.
> 
Thats the real question though, will we expect more than two interleaving
strategies?  Currently its a boolean operation so the answer seems like yes, but
is there a possiblity of a biased interleaving, or other worthwhile algorithm?

Neil

>   David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: pull-request: bpf-next 2017-12-07

2017-12-08 Thread David Miller
From: Alexei Starovoitov 
Date: Thu, 7 Dec 2017 22:22:11 -0800

> The following pull-request contains BPF updates for your net-next tree.
> 
> The main changes are:
> 
> 1) Detailed documentation of BPF development process from Daniel.
> 
> 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
>from Lawrence.
> 
> 3) Minor follow up for bpf_skb_set_tunnel_key() from William.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Pulled, thanks Alexei.


[PATCH] net: dsa: allow XAUI phy interface mode

2017-12-08 Thread Russell King
XGMII is a 32-bit bus plus two clock signals per direction.  XAUI is
four serial lanes per direction.  The 88e6190 supports XAUI but not
XGMII as it doesn't have enough pins.  The same is true of 88e6176.

Match on PHY_INTERFACE_MODE_XAUI for the XAUI port type, but keep
accepting XGMII for backwards compatibility.

Signed-off-by: Russell King 
---
 drivers/net/dsa/mv88e6xxx/port.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index a7801f6668a5..6315774d72b3 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -338,6 +338,7 @@ int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, 
int port,
cmode = MV88E6XXX_PORT_STS_CMODE_2500BASEX;
break;
case PHY_INTERFACE_MODE_XGMII:
+   case PHY_INTERFACE_MODE_XAUI:
cmode = MV88E6XXX_PORT_STS_CMODE_XAUI;
break;
case PHY_INTERFACE_MODE_RXAUI:
-- 
2.7.4



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Marcelo Ricardo Leitner
> Sent: 08 December 2017 16:00
...
> > Is it worth replacing the si struct with an index/enum value, and indexing 
> > an
> > array of method pointer structs?  That would save you at least one 
> > dereference.
> 
> Hmmm, maybe, yes. It would be like
> sctp_stream_interleave[asoc->stream.si].make_datafrag(...)

If you only expect 2 choices then an if () is likely
to produce better code that the above.

The actual implementation can be hidden inside a #define
or static inline function.

David



Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Willem de Bruijn
On Fri, Dec 8, 2017 at 6:40 AM, Michal Kubecek  wrote:
> On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote:
>> On 12/08/2017 at 09:47 AM Michal Kubecek wrote:
>> > On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote:
>> >>
>> >> All my VMs are using virtio_net. BTW: I couldn't see the problems
>> >> (sometimes, the VM couldn't be stopped at all) if all my VMs are using
>> >> e1000 as interface instead.
>> >>
>> >> This finding now matches pretty much the responsible UDP-package which
>> >> caused the stall. I already mentioned it here [2].
>> >>
>> >> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13]
>> >> Remove UDP Fragmentation Offload support" [3]
>> >>
>> >> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4]
>> >> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5]
>> >> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6]
>> >>
>> >> and applied it to Linux 4.14.4. It compiled fine and is running fine.
>> >> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs
>> >> are gone, too.
>> >>
>> >> Obviously, there is something broken with the new UDP handling. Could
>> >> you please analyze this problem? I could test some more patches ... .
>> >
>> > Any chance your VMs were live migrated from pre-4.14 host kernel?
>>
>> No - the VMs are not live migrated. They are always running on the same
>> host - either with kernel < 4.14 or with kernel 4.14.x.
>
> This is disturbing... unless I'm mistaken, it shouldn't be possible to
> have UFO enabled on a virtio device in a VM booted on a host with 4.14
> kernel.

Indeed. When working on that revert patch I verified that UFO in
the guest virtio_net was off before the revert patch, on after.

Qemu should check host support with tap_probe_has_ufo
before advertising support to the guest. Indeed, this is exactly
what broke live migration in virtio_net_load_device at

if (qemu_get_byte(f) && !peer_has_ufo(n)) {
error_report("virtio-net: saved image requires TUN_F_UFO support");
return -1;
}

Which follows

   peer_has_ufo
 qemu_has_ufo
   tap_has_ufo
 s->has_ufo

where s->has_ufo was set by tap_probe_has_ufo in net_tap_fd_init.

Now, checking my qemu git branch, I ran pretty old 2.7.0-rc3. But this
codepath does not seem to have changed between then and 2.10.1.

I cherry-picked the revert onto 4.14.3. It did not apply cleanly, but the
fix-up wasn't too hard. Compiled and booted, but untested otherwise. At

  https://github.com/wdebruij/linux/commits/v4.14.3-aargh-ufo


>
>> > If this is the case, you should try commit 0c19f846d582 ("net:
>> > accept UFO datagrams from tuntap and packet").
>>
>> It doesn't apply to 4.14.4
>>
>> > Or disabling UFO in the guest should
>> > work around the issue.
>>
>> ethtool -K ethX ufo off for each device / bridge in VM.
>>
>> Yes, this seems to work. I'll wait and see if the non stoppable
>> qemu-problem on shutdown will remain.
>>
>> When will there be a fix for 4.14? It is clearly a regression. Is it
>> possible / a good idea to just remove the complete patch series "Remove
>> UDP Fragmentation Offload support"?
>
> I cannot give an exact date but the patch is queued for stable
> (see http://patchwork.ozlabs.org/bundle/davem/stable/?state=* ) so that
> it should land in stable-4.14 in near future (weeks at most).
>
>  Michal Kubecek
>


  1   2   3   >