date:20160919

[PATCH net-next v2 05/10] bnxt_en: Fix ethtool -l|-L inconsistent channel counts.

2016-09-19 Thread Michael Chan

The existing code is inconsistent in reporting and accepting the combined
channel count.  bnxt_get_channels() reports maximum combined as the
maximum rx count.  bnxt_set_channels() accepts combined count that
cannot be bigger than max rx or max tx.

For example, if max rx = 2 and max tx = 1, we report max supported
combined to be 2.  But if the user tries to set combined to 2, it will
fail because 2 is bigger than max tx which is 1.

Fix the code to be consistent.  Max allowed combined = max(max_rx, max_tx).
We will accept a combined channel count <= max(max_rx, max_tx).

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index ae4458d..de893c9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -348,7 +348,7 @@ static void bnxt_get_channels(struct net_device *dev,
int max_rx_rings, max_tx_rings, tcs;
 
bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, true);
-   channel->max_combined = max_rx_rings;
+   channel->max_combined = max_t(int, max_rx_rings, max_tx_rings);
 
if (bnxt_get_max_rings(bp, &max_rx_rings, &max_tx_rings, false)) {
max_rx_rings = 0;
@@ -406,8 +406,8 @@ static int bnxt_set_channels(struct net_device *dev,
if (tcs > 1)
max_tx_rings /= tcs;
 
-   if (sh && (channel->combined_count > max_rx_rings ||
-  channel->combined_count > max_tx_rings))
+   if (sh &&
+   channel->combined_count > max_t(int, max_rx_rings, max_tx_rings))
return -ENOMEM;
 
if (!sh && (channel->rx_count > max_rx_rings ||
@@ -430,8 +430,10 @@ static int bnxt_set_channels(struct net_device *dev,
 
if (sh) {
bp->flags |= BNXT_FLAG_SHARED_RINGS;
-   bp->rx_nr_rings = channel->combined_count;
-   bp->tx_nr_rings_per_tc = channel->combined_count;
+   bp->rx_nr_rings = min_t(int, channel->combined_count,
+   max_rx_rings);
+   bp->tx_nr_rings_per_tc = min_t(int, channel->combined_count,
+  max_tx_rings);
} else {
bp->flags &= ~BNXT_FLAG_SHARED_RINGS;
bp->rx_nr_rings = channel->rx_count;
-- 
1.8.3.1

[PATCH net-next v2 09/10] bnxt_en: Support for "ethtool -r" command

2016-09-19 Thread Michael Chan

From: Deepak Khungar 

Restart autoneg if autoneg is enabled.

Signed-off-by: Deepak Khungar 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index de893c9..0ee4d7b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1849,6 +1849,25 @@ static int bnxt_get_module_eeprom(struct net_device *dev,
return rc;
 }
 
+static int bnxt_nway_reset(struct net_device *dev)
+{
+   int rc = 0;
+
+   struct bnxt *bp = netdev_priv(dev);
+   struct bnxt_link_info *link_info = &bp->link_info;
+
+   if (!BNXT_SINGLE_PF(bp))
+   return -EOPNOTSUPP;
+
+   if (!(link_info->autoneg & BNXT_AUTONEG_SPEED))
+   return -EINVAL;
+
+   if (netif_running(dev))
+   rc = bnxt_hwrm_set_link_setting(bp, true, false);
+
+   return rc;
+}
+
 const struct ethtool_ops bnxt_ethtool_ops = {
.get_link_ksettings = bnxt_get_link_ksettings,
.set_link_ksettings = bnxt_set_link_ksettings,
@@ -1881,4 +1900,5 @@ const struct ethtool_ops bnxt_ethtool_ops = {
.set_eee= bnxt_set_eee,
.get_module_info= bnxt_get_module_info,
.get_module_eeprom  = bnxt_get_module_eeprom,
+   .nway_reset = bnxt_nway_reset
 };
-- 
1.8.3.1

[PATCH net-next v2 10/10] bnxt_en: Fixed the VF link status after a link state change

2016-09-19 Thread Michael Chan

From: Eddie Wai 

The VF link state can be changed via the 'ip link set' cmd.
Currently, the new link state does not take effect immediately.

The fix is for the PF to send a link change async event to the
designated VF after a VF link state change.  This async event will
trigger the VF to update the link status.

Signed-off-by: Eddie Wai 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 84 -
 1 file changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index 50d2007..8be7185 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -19,6 +19,45 @@
 #include "bnxt_ethtool.h"
 
 #ifdef CONFIG_BNXT_SRIOV
+static int bnxt_hwrm_fwd_async_event_cmpl(struct bnxt *bp,
+ struct bnxt_vf_info *vf, u16 event_id)
+{
+   struct hwrm_fwd_async_event_cmpl_output *resp = bp->hwrm_cmd_resp_addr;
+   struct hwrm_fwd_async_event_cmpl_input req = {0};
+   struct hwrm_async_event_cmpl *async_cmpl;
+   int rc = 0;
+
+   bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FWD_ASYNC_EVENT_CMPL, -1, -1);
+   if (vf)
+   req.encap_async_event_target_id = cpu_to_le16(vf->fw_fid);
+   else
+   /* broadcast this async event to all VFs */
+   req.encap_async_event_target_id = cpu_to_le16(0x);
+   async_cmpl = (struct hwrm_async_event_cmpl *)req.encap_async_event_cmpl;
+   async_cmpl->type =
+   cpu_to_le16(HWRM_ASYNC_EVENT_CMPL_TYPE_HWRM_ASYNC_EVENT);
+   async_cmpl->event_id = cpu_to_le16(event_id);
+
+   mutex_lock(&bp->hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
+
+   if (rc) {
+   netdev_err(bp->dev, "hwrm_fwd_async_event_cmpl failed. rc:%d\n",
+  rc);
+   goto fwd_async_event_cmpl_exit;
+   }
+
+   if (resp->error_code) {
+   netdev_err(bp->dev, "hwrm_fwd_async_event_cmpl error %d\n",
+  resp->error_code);
+   rc = -1;
+   }
+
+fwd_async_event_cmpl_exit:
+   mutex_unlock(&bp->hwrm_cmd_lock);
+   return rc;
+}
+
 static int bnxt_vf_ndo_prep(struct bnxt *bp, int vf_id)
 {
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
@@ -243,8 +282,9 @@ int bnxt_set_vf_link_state(struct net_device *dev, int 
vf_id, int link)
rc = -EINVAL;
break;
}
-   /* CHIMP TODO: send msg to VF to update new link state */
-
+   if (vf->flags & (BNXT_VF_LINK_UP | BNXT_VF_LINK_FORCED))
+   rc = bnxt_hwrm_fwd_async_event_cmpl(bp, vf,
+   HWRM_ASYNC_EVENT_CMPL_EVENT_ID_LINK_STATUS_CHANGE);
return rc;
 }
 
@@ -525,46 +565,6 @@ err_out1:
return rc;
 }
 
-static int bnxt_hwrm_fwd_async_event_cmpl(struct bnxt *bp,
- struct bnxt_vf_info *vf,
- u16 event_id)
-{
-   int rc = 0;
-   struct hwrm_fwd_async_event_cmpl_input req = {0};
-   struct hwrm_fwd_async_event_cmpl_output *resp = bp->hwrm_cmd_resp_addr;
-   struct hwrm_async_event_cmpl *async_cmpl;
-
-   bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FWD_ASYNC_EVENT_CMPL, -1, -1);
-   if (vf)
-   req.encap_async_event_target_id = cpu_to_le16(vf->fw_fid);
-   else
-   /* broadcast this async event to all VFs */
-   req.encap_async_event_target_id = cpu_to_le16(0x);
-   async_cmpl = (struct hwrm_async_event_cmpl *)req.encap_async_event_cmpl;
-   async_cmpl->type =
-   cpu_to_le16(HWRM_ASYNC_EVENT_CMPL_TYPE_HWRM_ASYNC_EVENT);
-   async_cmpl->event_id = cpu_to_le16(event_id);
-
-   mutex_lock(&bp->hwrm_cmd_lock);
-   rc = _hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
-
-   if (rc) {
-   netdev_err(bp->dev, "hwrm_fwd_async_event_cmpl failed. rc:%d\n",
-  rc);
-   goto fwd_async_event_cmpl_exit;
-   }
-
-   if (resp->error_code) {
-   netdev_err(bp->dev, "hwrm_fwd_async_event_cmpl error %d\n",
-  resp->error_code);
-   rc = -1;
-   }
-
-fwd_async_event_cmpl_exit:
-   mutex_unlock(&bp->hwrm_cmd_lock);
-   return rc;
-}
-
 void bnxt_sriov_disable(struct bnxt *bp)
 {
u16 num_vfs = pci_num_vf(bp->pdev);
-- 
1.8.3.1

[PATCH net-next v2 04/10] bnxt_en: Added support for Secure Firmware Update

2016-09-19 Thread Michael Chan

From: Rob Swindell 

Using Ethtool flashdev command, entire NVM package (*.pkg) files
may now be staged into the "update" area of the NVM and subsequently
verified and installed by the firmware using the newly introduced
command: NVM_INSTALL_UPDATE.

We also introduce use of the new firmware command FW_SET_TIME so that the
NVM-resident package installation log contains valid time-stamps.

v2: Fixed undefined rtc_time64_to_tm error on kernels that don't have
CONFIG_RTC_LIB.  The timestamp is optional, so don't provide it if not
available.  Use '!' instead of '== 0'.

Signed-off-by: Rob Swindell 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  28 
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 155 --
 drivers/net/ethernet/broadcom/bnxt/bnxt_fw_hdr.h  |  16 ++-
 4 files changed, 186 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f6b4f34..088b922 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4314,6 +4315,31 @@ hwrm_ver_get_exit:
return rc;
 }
 
+int bnxt_hwrm_fw_set_time(struct bnxt *bp)
+{
+#if IS_ENABLED(CONFIG_RTC_LIB)
+   struct hwrm_fw_set_time_input req = {0};
+   struct rtc_time tm;
+   struct timeval tv;
+
+   if (bp->hwrm_spec_code < 0x10400)
+   return -EOPNOTSUPP;
+
+   do_gettimeofday(&tv);
+   rtc_time_to_tm(tv.tv_sec, &tm);
+   bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FW_SET_TIME, -1, -1);
+   req.year = cpu_to_le16(1900 + tm.tm_year);
+   req.month = 1 + tm.tm_mon;
+   req.day = tm.tm_mday;
+   req.hour = tm.tm_hour;
+   req.minute = tm.tm_min;
+   req.second = tm.tm_sec;
+   return hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
+#else
+   return -EOPNOTSUPP;
+#endif
+}
+
 static int bnxt_hwrm_port_qstats(struct bnxt *bp)
 {
int rc;
@@ -6811,6 +6837,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto init_err;
 
+   bnxt_hwrm_fw_set_time(bp);
+
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
   NETIF_F_TSO | NETIF_F_TSO6 |
   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 012cc51..41033d0 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1220,6 +1220,7 @@ int bnxt_hwrm_set_coal(struct bnxt *);
 int bnxt_hwrm_func_qcaps(struct bnxt *);
 int bnxt_hwrm_set_pause(struct bnxt *);
 int bnxt_hwrm_set_link_setting(struct bnxt *, bool, bool);
+int bnxt_hwrm_fw_set_time(struct bnxt *);
 int bnxt_open_nic(struct bnxt *, bool, bool);
 int bnxt_close_nic(struct bnxt *, bool, bool);
 int bnxt_get_max_rings(struct bnxt *, int *, int *, bool);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index b83e174..ae4458d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -21,6 +21,8 @@
 #include "bnxt_nvm_defs.h" /* NVRAM content constant and structure defs */
 #include "bnxt_fw_hdr.h"   /* Firmware hdr constant and structure defs */
 #define FLASH_NVRAM_TIMEOUT((HWRM_CMD_TIMEOUT) * 100)
+#define FLASH_PACKAGE_TIMEOUT  ((HWRM_CMD_TIMEOUT) * 200)
+#define INSTALL_PACKAGE_TIMEOUT((HWRM_CMD_TIMEOUT) * 200)
 
 static char *bnxt_get_pkgver(struct net_device *dev, char *buf, size_t buflen);
 
@@ -1028,6 +1030,10 @@ static u32 bnxt_get_link(struct net_device *dev)
return bp->link_info.link_up;
 }
 
+static int bnxt_find_nvram_item(struct net_device *dev, u16 type, u16 ordinal,
+   u16 ext, u16 *index, u32 *item_length,
+   u32 *data_length);
+
 static int bnxt_flash_nvram(struct net_device *dev,
u16 dir_type,
u16 dir_ordinal,
@@ -1179,7 +1185,6 @@ static int bnxt_flash_firmware(struct net_device *dev,
   (unsigned long)calculated_crc);
return -EINVAL;
}
-   /* TODO: Validate digital signature (RSA-encrypted SHA-256 hash) here */
rc = bnxt_flash_nvram(dev, dir_type, BNX_DIR_ORDINAL_FIRST,
  0, 0, fw_data, fw_size);
if (rc == 0)/* Firmware update successful */
@@ -1188,6 +1193,57 @@ static int bnxt_flash_firmware(struct net_device *dev,
return rc;
 }
 
+static int bnxt_flash_microcode(struct net_device *dev,
+   u16 dir_type,
+

[PATCH net-next v2 03/10] bnxt_en: Update to firmware interface spec 1.5.1.

2016-09-19 Thread Michael Chan

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |3 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   14 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 1251 +++--
 3 files changed, 760 insertions(+), 508 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d9b4cd1..f6b4f34 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4253,6 +4253,9 @@ static int bnxt_hwrm_queue_qportcfg(struct bnxt *bp)
if (bp->max_tc > BNXT_MAX_QUEUE)
bp->max_tc = BNXT_MAX_QUEUE;
 
+   if (resp->queue_cfg_info & QUEUE_QPORTCFG_RESP_QUEUE_CFG_INFO_ASYM_CFG)
+   bp->max_tc = 1;
+
qptr = &resp->queue_id0;
for (i = 0; i < bp->max_tc; i++) {
bp->q_info[i].queue_id = *qptr++;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index db4814e..012cc51 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -11,10 +11,10 @@
 #define BNXT_H
 
 #define DRV_MODULE_NAME"bnxt_en"
-#define DRV_MODULE_VERSION "1.3.0"
+#define DRV_MODULE_VERSION "1.5.0"
 
 #define DRV_VER_MAJ1
-#define DRV_VER_MIN3
+#define DRV_VER_MIN5
 #define DRV_VER_UPD0
 
 struct tx_bd {
@@ -106,11 +106,11 @@ struct tx_cmp {
 #define CMP_TYPE_REMOTE_DRIVER_REQ  34
 #define CMP_TYPE_REMOTE_DRIVER_RESP 36
 #define CMP_TYPE_ERROR_STATUS   48
-#define CMPL_BASE_TYPE_STAT_EJECT   (0x1aUL << 0)
-#define CMPL_BASE_TYPE_HWRM_DONE(0x20UL << 0)
-#define CMPL_BASE_TYPE_HWRM_FWD_REQ (0x22UL << 0)
-#define CMPL_BASE_TYPE_HWRM_FWD_RESP(0x24UL << 0)
-#define CMPL_BASE_TYPE_HWRM_ASYNC_EVENT (0x2eUL << 0)
+#define CMPL_BASE_TYPE_STAT_EJECT   0x1aUL
+#define CMPL_BASE_TYPE_HWRM_DONE0x20UL
+#define CMPL_BASE_TYPE_HWRM_FWD_REQ 0x22UL
+#define CMPL_BASE_TYPE_HWRM_FWD_RESP0x24UL
+#define CMPL_BASE_TYPE_HWRM_ASYNC_EVENT 0x2eUL
 
#define TX_CMP_FLAGS_ERROR  (1 << 6)
#define TX_CMP_FLAGS_PUSH   (1 << 7)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
index 517567f..04a96cc 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
@@ -39,7 +39,7 @@ struct eject_cmpl {
__le16 type;
#define EJECT_CMPL_TYPE_MASK0x3fUL
#define EJECT_CMPL_TYPE_SFT 0
-   #define EJECT_CMPL_TYPE_STAT_EJECT (0x1aUL << 0)
+   #define EJECT_CMPL_TYPE_STAT_EJECT 0x1aUL
__le16 len;
__le32 opaque;
__le32 v;
@@ -52,7 +52,7 @@ struct hwrm_cmpl {
__le16 type;
#define HWRM_CMPL_TYPE_MASK 0x3fUL
#define HWRM_CMPL_TYPE_SFT  0
-   #define HWRM_CMPL_TYPE_HWRM_DONE   (0x20UL << 0)
+   #define HWRM_CMPL_TYPE_HWRM_DONE   0x20UL
__le16 sequence_id;
__le32 unused_1;
__le32 v;
@@ -65,7 +65,7 @@ struct hwrm_fwd_req_cmpl {
__le16 req_len_type;
#define HWRM_FWD_REQ_CMPL_TYPE_MASK 0x3fUL
#define HWRM_FWD_REQ_CMPL_TYPE_SFT  0
-   #define HWRM_FWD_REQ_CMPL_TYPE_HWRM_FWD_REQ(0x22UL << 0)
+   #define HWRM_FWD_REQ_CMPL_TYPE_HWRM_FWD_REQ0x22UL
#define HWRM_FWD_REQ_CMPL_REQ_LEN_MASK  0xffc0UL
#define HWRM_FWD_REQ_CMPL_REQ_LEN_SFT   6
__le16 source_id;
@@ -81,7 +81,7 @@ struct hwrm_fwd_resp_cmpl {
__le16 type;
#define HWRM_FWD_RESP_CMPL_TYPE_MASK0x3fUL
#define HWRM_FWD_RESP_CMPL_TYPE_SFT 0
-   #define HWRM_FWD_RESP_CMPL_TYPE_HWRM_FWD_RESP  (0x24UL << 0)
+   #define HWRM_FWD_RESP_CMPL_TYPE_HWRM_FWD_RESP  0x24UL
__le16 source_id;
__le16 resp_len;
__le16 unused_1;
@@ -96,25 +96,26 @@ struct hwrm_async_event_cmpl {
__le16 type;
#define HWRM_ASYNC_EVENT_CMPL_TYPE_MASK 0x3fUL
#define HWRM_ASYNC_EVENT_CMPL_TYPE_SFT  0
-   #define HWRM_ASYNC_EVENT_CMPL_TYPE_HWRM_ASYNC_EVENT   (0x2eUL << 0)
+   #define HWRM_ASYNC_EVENT_CMPL_TYPE_HWRM_ASYNC_EV

[PATCH net-next v2 08/10] bnxt_en: Pad TX packets below 52 bytes.

2016-09-19 Thread Michael Chan

The hardware has a limitation that it won't pass host to BMC loopback
packets below 52-bytes.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 41033d0..51b164a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -413,7 +413,7 @@ struct rx_tpa_end_cmp_ext {
 
 #define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
 
-#define BNXT_MIN_PKT_SIZE  45
+#define BNXT_MIN_PKT_SIZE  52
 
 #define BNXT_NUM_TESTS(bp) 0
 
-- 
1.8.3.1

[PATCH net-next v2 01/10] bnxt_en: Use RSS flags defined in the bnxt_hsi.h file.

2016-09-19 Thread Michael Chan

And remove redundant definitions of the same flags.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 5 -
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 228c964..cee0e8d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3419,10 +3419,10 @@ static int bnxt_hwrm_vnic_set_rss(struct bnxt *bp, u16 
vnic_id, bool set_rss)
 
bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_VNIC_RSS_CFG, -1, -1);
if (set_rss) {
-   vnic->hash_type = BNXT_RSS_HASH_TYPE_FLAG_IPV4 |
-BNXT_RSS_HASH_TYPE_FLAG_TCP_IPV4 |
-BNXT_RSS_HASH_TYPE_FLAG_IPV6 |
-BNXT_RSS_HASH_TYPE_FLAG_TCP_IPV6;
+   vnic->hash_type = VNIC_RSS_CFG_REQ_HASH_TYPE_IPV4 |
+ VNIC_RSS_CFG_REQ_HASH_TYPE_TCP_IPV4 |
+ VNIC_RSS_CFG_REQ_HASH_TYPE_IPV6 |
+ VNIC_RSS_CFG_REQ_HASH_TYPE_TCP_IPV6;
 
req.hash_type = cpu_to_le32(vnic->hash_type);
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 23e04a6..db4814e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -389,11 +389,6 @@ struct rx_tpa_end_cmp_ext {
 
 #define INVALID_HW_RING_ID ((u16)-1)
 
-#define BNXT_RSS_HASH_TYPE_FLAG_IPV4   0x01
-#define BNXT_RSS_HASH_TYPE_FLAG_TCP_IPV4   0x02
-#define BNXT_RSS_HASH_TYPE_FLAG_IPV6   0x04
-#define BNXT_RSS_HASH_TYPE_FLAG_TCP_IPV6   0x08
-
 /* The hardware supports certain page sizes.  Use the supported page sizes
  * to allocate the rings.
  */
-- 
1.8.3.1

[PATCH net-next v2 00/10] bnxt: update for net-next.

2016-09-19 Thread Michael Chan

Misc. changes and minor bug fixes for net-next.  Please review.

v2: Updated "bnxt_en: Added support for Secure Firmware Update" patch.

Deepak Khungar (1):
  bnxt_en: Support for "ethtool -r" command

Eddie Wai (1):
  bnxt_en: Fixed the VF link status after a link state change

Michael Chan (7):
  bnxt_en: Use RSS flags defined in the bnxt_hsi.h file.
  bnxt_en: Simplify PCI device names and add additinal PCI IDs.
  bnxt_en: Update to firmware interface spec 1.5.1.
  bnxt_en: Fix ethtool -l|-L inconsistent channel counts.
  bnxt_en: Re-arrange bnxt_hwrm_func_qcaps().
  bnxt_en: Call firmware to approve the random VF MAC address.
  bnxt_en: Pad TX packets below 52 bytes.

Rob Swindell (1):
  bnxt_en: Added support for Secure Firmware Update

 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  135 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   22 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  187 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_fw_hdr.h  |   16 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 1251 -
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c   |   84 +-
 6 files changed, 1073 insertions(+), 622 deletions(-)

-- 
1.8.3.1

[PATCH net-next v2 02/10] bnxt_en: Simplify PCI device names and add additinal PCI IDs.

2016-09-19 Thread Michael Chan

Remove "Single-port/Dual-port" from the device names.  Dual-port devices
will appear as 2 separate devices, so no need to call each a dual-port
device.  Use a more generic name for VF devices belonging to the same
chip fanmily.  Add some remaining NPAR device IDs.

Signed-off-by: David Christensen 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 68 ---
 1 file changed, 36 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index cee0e8d..d9b4cd1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -93,50 +93,49 @@ enum board_idx {
BCM57404_NPAR,
BCM57406_NPAR,
BCM57407_SFP,
+   BCM57407_NPAR,
BCM57414_NPAR,
BCM57416_NPAR,
-   BCM57304_VF,
-   BCM57404_VF,
-   BCM57414_VF,
-   BCM57314_VF,
+   NETXTREME_E_VF,
+   NETXTREME_C_VF,
 };
 
 /* indexed by enum above */
 static const struct {
char *name;
 } board_info[] = {
-   { "Broadcom BCM57301 NetXtreme-C Single-port 10Gb Ethernet" },
-   { "Broadcom BCM57302 NetXtreme-C Dual-port 10Gb/25Gb Ethernet" },
-   { "Broadcom BCM57304 NetXtreme-C Dual-port 10Gb/25Gb/40Gb/50Gb 
Ethernet" },
+   { "Broadcom BCM57301 NetXtreme-C 10Gb Ethernet" },
+   { "Broadcom BCM57302 NetXtreme-C 10Gb/25Gb Ethernet" },
+   { "Broadcom BCM57304 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
{ "Broadcom BCM57417 NetXtreme-E Ethernet Partition" },
-   { "Broadcom BCM58700 Nitro 4-port 1Gb/2.5Gb/10Gb Ethernet" },
-   { "Broadcom BCM57311 NetXtreme-C Single-port 10Gb Ethernet" },
-   { "Broadcom BCM57312 NetXtreme-C Dual-port 10Gb/25Gb Ethernet" },
-   { "Broadcom BCM57402 NetXtreme-E Dual-port 10Gb Ethernet" },
-   { "Broadcom BCM57404 NetXtreme-E Dual-port 10Gb/25Gb Ethernet" },
-   { "Broadcom BCM57406 NetXtreme-E Dual-port 10GBase-T Ethernet" },
+   { "Broadcom BCM58700 Nitro 1Gb/2.5Gb/10Gb Ethernet" },
+   { "Broadcom BCM57311 NetXtreme-C 10Gb Ethernet" },
+   { "Broadcom BCM57312 NetXtreme-C 10Gb/25Gb Ethernet" },
+   { "Broadcom BCM57402 NetXtreme-E 10Gb Ethernet" },
+   { "Broadcom BCM57404 NetXtreme-E 10Gb/25Gb Ethernet" },
+   { "Broadcom BCM57406 NetXtreme-E 10GBase-T Ethernet" },
{ "Broadcom BCM57402 NetXtreme-E Ethernet Partition" },
-   { "Broadcom BCM57407 NetXtreme-E Dual-port 10GBase-T Ethernet" },
-   { "Broadcom BCM57412 NetXtreme-E Dual-port 10Gb Ethernet" },
-   { "Broadcom BCM57414 NetXtreme-E Dual-port 10Gb/25Gb Ethernet" },
-   { "Broadcom BCM57416 NetXtreme-E Dual-port 10GBase-T Ethernet" },
-   { "Broadcom BCM57417 NetXtreme-E Dual-port 10GBase-T Ethernet" },
+   { "Broadcom BCM57407 NetXtreme-E 10GBase-T Ethernet" },
+   { "Broadcom BCM57412 NetXtreme-E 10Gb Ethernet" },
+   { "Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet" },
+   { "Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet" },
+   { "Broadcom BCM57417 NetXtreme-E 10GBase-T Ethernet" },
{ "Broadcom BCM57412 NetXtreme-E Ethernet Partition" },
-   { "Broadcom BCM57314 NetXtreme-C Dual-port 10Gb/25Gb/40Gb/50Gb 
Ethernet" },
-   { "Broadcom BCM57417 NetXtreme-E Dual-port 10Gb/25Gb Ethernet" },
-   { "Broadcom BCM57416 NetXtreme-E Dual-port 10Gb Ethernet" },
+   { "Broadcom BCM57314 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
+   { "Broadcom BCM57417 NetXtreme-E 10Gb/25Gb Ethernet" },
+   { "Broadcom BCM57416 NetXtreme-E 10Gb Ethernet" },
{ "Broadcom BCM57404 NetXtreme-E Ethernet Partition" },
{ "Broadcom BCM57406 NetXtreme-E Ethernet Partition" },
-   { "Broadcom BCM57407 NetXtreme-E Dual-port 25Gb Ethernet" },
+   { "Broadcom BCM57407 NetXtreme-E 25Gb Ethernet" },
+   { "Broadcom BCM57407 NetXtreme-E Ethernet Partition" },
{ "Broadcom BCM57414 NetXtreme-E Ethernet Partition" },
{ "Broadcom BCM57416 NetXtreme-E Ethernet Partition" },
-   { "Broadcom BCM57304 NetXtreme-C Ethernet Virtual Function" },
-   { "Broadcom BCM57404 NetXtreme-E Ethernet Virtual Function" },
-   { "Broadcom BCM57414 NetXtreme-E Ethernet Virtual Function" },
-   { "Broadcom BCM57314 NetXtreme-E Ethernet Virtual Function" },
+   { "Broadcom NetXtreme-E Ethernet Virtual Function" },
+   { "Broadcom NetXtreme-C Ethernet Virtual Function" },
 };
 
 static const struct pci_device_id bnxt_pci_tbl[] = {
+   { PCI_VDEVICE(BROADCOM, 0x16c0), .driver_data = BCM57417_NPAR },
{ PCI_VDEVICE(BROADCOM, 0x16c8), .driver_data = BCM57301 },
{ PCI_VDEVICE(BROADCOM, 0x16c9), .driver_data = BCM57302 },
{ PCI_VDEVICE(BROADCOM, 0x16ca), .driver_data = BCM57304 },
@@ -160,13 +159,19 @@ static const struct pci_device_id bnxt_pci_tbl[] = {
{ PCI_VDEVICE(BROADCOM, 0x16e7), .driver_data = BCM57404_NPAR },
{ PCI_VDEVICE(BROADCOM, 0x

Re: [net-next PATCH] net: netlink messages for HW addr programming

2016-09-19 Thread Roopa Prabhu

On 9/19/16, 10:49 PM, Jiri Pirko wrote:
> Tue, Sep 20, 2016 at 07:31:27AM CEST, ro...@cumulusnetworks.com wrote:
>> On 9/19/16, 7:46 AM, Patrick Ruddy wrote:
>>> On Sun, 2016-09-18 at 07:51 -0700, Roopa Prabhu wrote:
 On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
> Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
> AF_UNSPEC to indicate interest in specific unicast and multicast
> hardware addresses. These messages are sent when addresses are
> added or deleted from the appropriate interface driver.
> Added AF_UNSPEC GETADDR function to allow the netlink notifications
> to be replayed to avoid loss of state due to application start
> ordering or restart.
>
> Signed-off-by: Patrick Ruddy 
> ---
 RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the 
 kernel.
 so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them 
 to
 userspace and also to request a special dump of these addresses.

 This could just be a new nested netlink attribute in the existing link 
 dump ?
>>> Hi Roopa
>>>
>>> Thanks for the review. I did initially code this using NEW/DEL/GET_LINK
>>> messages but was asked to change to to ADDR messages by Stephen
>>> Hemminger (cc'd). 
>>>
>>> However I agree that these addresses fall between the LINK and ADDR
>>> areas so I'm happy to change this if we can reach some consensus on the
>>> format.
>>>
>> ok, thanks for the history. yes, they do lie in a weird spot.
> They are l2 addresses, they should be threated accordingly. Am I missing
> something?
>
>
>> the general convention for other rtnl registrations seems to be
>> AF_UNSPEC family means include all supported families. thats where this 
>> seems a bit odd.
>>
>> On the other hand, one reason I see where using RTM_*ADDR will be useful for 
>> this is if we wanted
>> to provide a way to add these uc and mc address via ip addr add in the 
>> future.
>> ip addr add  dev eth0
>>
>> Does this patch allow that in the future ?
> This shoul go under ip link I believe. "ip addr" is for l3.
>
>
yes, ...my initial comment was the same (two new attributes to cover UC and MC 
addresses).
patrick had it in link first..and there were some suggestions on doing it in 
addr. he is ok with either.

My questions were to make sure we don't lose anything ...by adding it under 
link.
there is no external way to add addrs to uc and mc lists today. hence would be 
nice
to cover that case as well when we are exposing the dev uc and mc lists to 
userspace.
and ofcourse ..it does not have to be RTM_NEWADDR ...
RTM_NEWLINK can cover it both ways also.

so, if stephen has no major objections, we can still go with attributes in 
RTM_*LINK.

[PATCH net-next 1/7] net/faraday: Separate rx page storage from rxdesc

2016-09-19 Thread Joel Stanley

From: Andrew Jeffery 

The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.

Signed-off-by: Andrew Jeffery 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 36361f8bf894..40622567159a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -60,6 +60,8 @@ struct ftgmac100 {
struct ftgmac100_descs *descs;
dma_addr_t descs_dma_addr;
 
+   struct page *rx_pages[RX_QUEUE_ENTRIES];
+
unsigned int rx_pointer;
unsigned int tx_clean_pointer;
unsigned int tx_pointer;
@@ -341,18 +343,27 @@ static bool ftgmac100_rxdes_ipcs_err(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes1 & cpu_to_le32(FTGMAC100_RXDES1_IP_CHKSUM_ERR);
 }
 
+static inline struct page **ftgmac100_rxdes_page_slot(struct ftgmac100 *priv,
+ struct ftgmac100_rxdes 
*rxdes)
+{
+   return &priv->rx_pages[rxdes - priv->descs->rxdes];
+}
+
 /*
  * rxdes2 is not used by hardware. We use it to keep track of page.
  * Since hardware does not touch it, we can skip cpu_to_le32()/le32_to_cpu().
  */
-static void ftgmac100_rxdes_set_page(struct ftgmac100_rxdes *rxdes, struct 
page *page)
+static void ftgmac100_rxdes_set_page(struct ftgmac100 *priv,
+struct ftgmac100_rxdes *rxdes,
+struct page *page)
 {
-   rxdes->rxdes2 = (unsigned int)page;
+   *ftgmac100_rxdes_page_slot(priv, rxdes) = page;
 }
 
-static struct page *ftgmac100_rxdes_get_page(struct ftgmac100_rxdes *rxdes)
+static struct page *ftgmac100_rxdes_get_page(struct ftgmac100 *priv,
+struct ftgmac100_rxdes *rxdes)
 {
-   return (struct page *)rxdes->rxdes2;
+   return *ftgmac100_rxdes_page_slot(priv, rxdes);
 }
 
 /**
@@ -501,7 +512,7 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int 
*processed)
 
do {
dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
-   struct page *page = ftgmac100_rxdes_get_page(rxdes);
+   struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
unsigned int size;
 
dma_unmap_page(priv->dev, map, RX_BUF_SIZE, DMA_FROM_DEVICE);
@@ -779,7 +790,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
return -ENOMEM;
}
 
-   ftgmac100_rxdes_set_page(rxdes, page);
+   ftgmac100_rxdes_set_page(priv, rxdes, page);
ftgmac100_rxdes_set_dma_addr(rxdes, map);
ftgmac100_rxdes_set_dma_own(rxdes);
return 0;
@@ -791,7 +802,7 @@ static void ftgmac100_free_buffers(struct ftgmac100 *priv)
 
for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
struct ftgmac100_rxdes *rxdes = &priv->descs->rxdes[i];
-   struct page *page = ftgmac100_rxdes_get_page(rxdes);
+   struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
 
if (!page)
-- 
2.9.3

[PATCH net-next 2/7] net/faraday: Make EDO{R,T}R bits configurable

2016-09-19 Thread Joel Stanley

From: Andrew Jeffery 

These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.

Signed-off-by: Andrew Jeffery 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 40 +---
 drivers/net/ethernet/faraday/ftgmac100.h |  2 --
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 40622567159a..62a88d1a1f99 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -79,6 +79,9 @@ struct ftgmac100 {
int int_mask_all;
bool use_ncsi;
bool enabled;
+
+   u32 rxdes0_edorr_mask;
+   u32 txdes0_edotr_mask;
 };
 
 static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
@@ -259,10 +262,11 @@ static bool ftgmac100_rxdes_packet_ready(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_RXPKT_RDY);
 }
 
-static void ftgmac100_rxdes_set_dma_own(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_dma_own(const struct ftgmac100 *priv,
+   struct ftgmac100_rxdes *rxdes)
 {
/* clear status bits */
-   rxdes->rxdes0 &= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+   rxdes->rxdes0 &= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static bool ftgmac100_rxdes_rx_error(struct ftgmac100_rxdes *rxdes)
@@ -300,9 +304,10 @@ static bool ftgmac100_rxdes_multicast(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_MULTICAST);
 }
 
-static void ftgmac100_rxdes_set_end_of_ring(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_end_of_ring(const struct ftgmac100 *priv,
+   struct ftgmac100_rxdes *rxdes)
 {
-   rxdes->rxdes0 |= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+   rxdes->rxdes0 |= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static void ftgmac100_rxdes_set_dma_addr(struct ftgmac100_rxdes *rxdes,
@@ -393,7 +398,7 @@ ftgmac100_rx_locate_first_segment(struct ftgmac100 *priv)
if (ftgmac100_rxdes_first_segment(rxdes))
return rxdes;
 
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
ftgmac100_rx_pointer_advance(priv);
rxdes = ftgmac100_current_rxdes(priv);
}
@@ -464,7 +469,7 @@ static void ftgmac100_rx_drop_packet(struct ftgmac100 *priv)
if (ftgmac100_rxdes_last_segment(rxdes))
done = true;
 
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
ftgmac100_rx_pointer_advance(priv);
rxdes = ftgmac100_current_rxdes(priv);
} while (!done && ftgmac100_rxdes_packet_ready(rxdes));
@@ -556,10 +561,11 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, 
int *processed)
 /**
  * internal functions (transmit descriptor)
  */
-static void ftgmac100_txdes_reset(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_reset(const struct ftgmac100 *priv,
+ struct ftgmac100_txdes *txdes)
 {
/* clear all except end of ring bit */
-   txdes->txdes0 &= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+   txdes->txdes0 &= cpu_to_le32(priv->txdes0_edotr_mask);
txdes->txdes1 = 0;
txdes->txdes2 = 0;
txdes->txdes3 = 0;
@@ -580,9 +586,10 @@ static void ftgmac100_txdes_set_dma_own(struct 
ftgmac100_txdes *txdes)
txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_TXDMA_OWN);
 }
 
-static void ftgmac100_txdes_set_end_of_ring(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_set_end_of_ring(const struct ftgmac100 *priv,
+   struct ftgmac100_txdes *txdes)
 {
-   txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+   txdes->txdes0 |= cpu_to_le32(priv->txdes0_edotr_mask);
 }
 
 static void ftgmac100_txdes_set_first_segment(struct ftgmac100_txdes *txdes)
@@ -701,7 +708,7 @@ static bool ftgmac100_tx_complete_packet(struct ftgmac100 
*priv)
 
dev_kfree_skb(skb);
 
-   ftgmac100_txdes_reset(txdes);
+   ftgmac100_txdes_reset(priv, txdes);
 
ftgmac100_tx_clean_pointer_advance(priv);
 
@@ -792,7 +799,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
 
ftgmac100_rxdes_set_page(priv, rxdes, page);
ftgmac100_rxdes_set_dma_addr(rxdes, map);
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
return 0;
 }
 
@@ -839,7 +846,8 @@ static int ftgmac100_alloc_buffers(struct ftgmac100 *priv)

[PATCH net-next 0/7] ftgmac100 support for ast2500

2016-09-19 Thread Joel Stanley

Hello Dave,

This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.

They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.

Cheers,

Joel

Andrew Jeffery (2):
  net/ftgmac100: Separate rx page storage from rxdesc
  net/ftgmac100: Make EDO{R,T}R bits configurable

Gavin Shan (2):
  net/faraday: Avoid PHYSTS_CHG interrupt
  net/faraday: Clear stale interrupts

Joel Stanley (3):
  net/ftgmac100: Adapt for Aspeed SoCs
  net/faraday: Fix phy link irq on Aspeed G5 SoCs
  net/faraday: Configure old MDIO interface on Aspeed SoCs

 drivers/net/ethernet/faraday/ftgmac100.c | 92 
 drivers/net/ethernet/faraday/ftgmac100.h |  8 ++-
 2 files changed, 77 insertions(+), 23 deletions(-)

-- 
2.9.3

[PATCH net-next 3/7] net/faraday: Adapt for Aspeed SoCs

2016-09-19 Thread Joel Stanley

The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.

It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.

This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.

Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 62a88d1a1f99..47f512224b57 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1345,9 +1345,6 @@ static int ftgmac100_probe(struct platform_device *pdev)
priv->netdev = netdev;
priv->dev = &pdev->dev;
 
-   priv->rxdes0_edorr_mask = BIT(15);
-   priv->txdes0_edotr_mask = BIT(15);
-
spin_lock_init(&priv->tx_lock);
 
/* initialize NAPI */
@@ -1381,6 +1378,16 @@ static int ftgmac100_probe(struct platform_device *pdev)
  FTGMAC100_INT_PHYSTS_CHG |
  FTGMAC100_INT_RPKT_BUF |
  FTGMAC100_INT_NO_RXBUF);
+
+   if (of_machine_is_compatible("aspeed,ast2400") ||
+   of_machine_is_compatible("aspeed,ast2500")) {
+   priv->rxdes0_edorr_mask = BIT(30);
+   priv->txdes0_edotr_mask = BIT(30);
+   } else {
+   priv->rxdes0_edorr_mask = BIT(15);
+   priv->txdes0_edotr_mask = BIT(15);
+   }
+
if (pdev->dev.of_node &&
of_get_property(pdev->dev.of_node, "use-ncsi", NULL)) {
if (!IS_ENABLED(CONFIG_NET_NCSI)) {
-- 
2.9.3

[PATCH net-next 5/7] net/faraday: Clear stale interrupts

2016-09-19 Thread Joel Stanley

From: Gavin Shan 

There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.

This clears the stale interrupts in ISR (0x0) when enabling the MAC.

Signed-off-by: Gavin Shan 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index f2ea6c2f1fbd..7ba0f2d58a8b 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1113,6 +1113,7 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
budget)
 static int ftgmac100_open(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
+   unsigned int status;
int err;
 
err = ftgmac100_alloc_buffers(priv);
@@ -1138,6 +1139,11 @@ static int ftgmac100_open(struct net_device *netdev)
 
ftgmac100_init_hw(priv);
ftgmac100_start_hw(priv, priv->use_ncsi ? 100 : 10);
+
+   /* Clear stale interrupts */
+   status = ioread32(priv->base + FTGMAC100_OFFSET_ISR);
+   iowrite32(status, priv->base + FTGMAC100_OFFSET_ISR);
+
if (netdev->phydev)
phy_start(netdev->phydev);
else if (priv->use_ncsi)
-- 
2.9.3

[PATCH net-next 4/7] net/faraday: Avoid PHYSTS_CHG interrupt

2016-09-19 Thread Joel Stanley

From: Gavin Shan 

Bit#11 in MACCR (0x50) designates the signal level for PHY link
status change. It's cleared, meaning high level enabled, by default.
However, we can see continuous interrupt (bit#6) in ISR (0x0) for it
and it's obviously a false alarm. The side effect is CPU cycles wasted
to process the false alarm.

This sets bit#11 in MACCR (0x50) to avoid the bogus interrupt.

Signed-off-by: Gavin Shan 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 1 +
 drivers/net/ethernet/faraday/ftgmac100.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 47f512224b57..f2ea6c2f1fbd 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -215,6 +215,7 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
 FTGMAC100_MACCR_RXMAC_EN   | \
 FTGMAC100_MACCR_FULLDUP| \
 FTGMAC100_MACCR_CRC_APD| \
+FTGMAC100_MACCR_PHY_LINK_LEVEL | \
 FTGMAC100_MACCR_RX_RUNT| \
 FTGMAC100_MACCR_RX_BROADPKT)
 
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h 
b/drivers/net/ethernet/faraday/ftgmac100.h
index c258586ce4a4..d07b6ea5d1b5 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -152,6 +152,7 @@
 #define FTGMAC100_MACCR_FULLDUP(1 << 8)
 #define FTGMAC100_MACCR_GIGA_MODE  (1 << 9)
 #define FTGMAC100_MACCR_CRC_APD(1 << 10)
+#define FTGMAC100_MACCR_PHY_LINK_LEVEL (1 << 11)
 #define FTGMAC100_MACCR_RX_RUNT(1 << 12)
 #define FTGMAC100_MACCR_JUMBO_LF   (1 << 13)
 #define FTGMAC100_MACCR_RX_ALL (1 << 14)
-- 
2.9.3

[PATCH net-next 7/7] net/faraday: Configure old MDIO interface on Aspeed SoCs

2016-09-19 Thread Joel Stanley

The Aspeed SoCs have a new MDIO interface as an option in the G4 and G5
SoCs. The old one is still available, so select it in order to remain
compatible with the ftgmac100 driver.

Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 9 +
 drivers/net/ethernet/faraday/ftgmac100.h | 5 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 5466df028381..54c6ba3632a3 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1257,12 +1257,21 @@ static int ftgmac100_setup_mdio(struct net_device 
*netdev)
struct ftgmac100 *priv = netdev_priv(netdev);
struct platform_device *pdev = to_platform_device(priv->dev);
int i, err = 0;
+   u32 reg;
 
/* initialize mdio bus */
priv->mii_bus = mdiobus_alloc();
if (!priv->mii_bus)
return -EIO;
 
+   if (of_machine_is_compatible("aspeed,ast2400") ||
+   of_machine_is_compatible("aspeed,ast2500")) {
+   /* This driver supports the old MDIO interface */
+   reg = ioread32(priv->base + FTGMAC100_OFFSET_REVR);
+   reg &= ~FTGMAC100_REVR_NEW_MDIO_INTERFACE;
+   iowrite32(reg, priv->base + FTGMAC100_OFFSET_REVR);
+   };
+
priv->mii_bus->name = "ftgmac100_mdio";
snprintf(priv->mii_bus->id, MII_BUS_ID_SIZE, "%s-%d",
 pdev->name, pdev->id);
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h 
b/drivers/net/ethernet/faraday/ftgmac100.h
index d07b6ea5d1b5..a7ce0ac8858a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -134,6 +134,11 @@
 #define FTGMAC100_DMAFIFOS_TXDMA_REQ   (1 << 31)
 
 /*
+ * Feature Register
+ */
+#define FTGMAC100_REVR_NEW_MDIO_INTERFACE  BIT(31)
+
+/*
  * Receive buffer size register
  */
 #define FTGMAC100_RBSR_SIZE(x) ((x) & 0x3fff)
-- 
2.9.3

[PATCH net-next 6/7] net/faraday: Fix phy link irq on Aspeed G5 SoCs

2016-09-19 Thread Joel Stanley

On Aspeed SoC with a direct PHY connection (non-NSCI), we receive
continual PHYSTS interrupts:

 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.30] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG

This is because the driver was enabling low-level sensitive interrupt
generation where the systems are wired for high-level. All CPU cycles
are spent servicing this interrupt.

Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 7ba0f2d58a8b..5466df028381 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -223,6 +223,10 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv, int 
speed)
 {
int maccr = MACCR_ENABLE_ALL;
 
+   if (of_machine_is_compatible("aspeed,ast2500")) {
+   maccr &= ~FTGMAC100_MACCR_PHY_LINK_LEVEL;
+   }
+
switch (speed) {
default:
case 10:
-- 
2.9.3

[PATCHv2 net] cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter

2016-09-19 Thread Hariprasad Shenai

We were missing check for 25G and 100G while checking port speed,
which lead to less number of queues getting allocated for 25G & 100G
adapters and leading to low throughput. Adding the missing check for
both NIC and vNIC driver.

Also fixes port advertisement for 25G and 100G in ethtool output.

Signed-off-by: Hariprasad Shenai 
---
V2: Missed 25G in the first one

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 15 +--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |  7 ++-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h  |  6 ++
 drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h | 15 +++
 drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c |  9 +++--
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 2e2aa9fec9bb..edd23386b47d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -419,8 +419,8 @@ struct link_config {
unsigned short supported;/* link capabilities */
unsigned short advertising;  /* advertised capabilities */
unsigned short lp_advertising;   /* peer advertised capabilities */
-   unsigned short requested_speed;  /* speed user has requested */
-   unsigned short speed;/* actual link speed */
+   unsigned int   requested_speed;  /* speed user has requested */
+   unsigned int   speed;/* actual link speed */
unsigned char  requested_fc; /* flow control user has requested */
unsigned char  fc;   /* actual link flow control */
unsigned char  autoneg;  /* autonegotiating? */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index c762a8c8c954..3ceafb55d6da 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4305,10 +4305,17 @@ static const struct pci_error_handlers cxgb4_eeh = {
.resume = eeh_resume,
 };
 
+/* Return true if the Link Configuration supports "High Speeds" (those greater
+ * than 1Gb/s).
+ */
 static inline bool is_x_10g_port(const struct link_config *lc)
 {
-   return (lc->supported & FW_PORT_CAP_SPEED_10G) != 0 ||
-  (lc->supported & FW_PORT_CAP_SPEED_40G) != 0;
+   unsigned int speeds, high_speeds;
+
+   speeds = FW_PORT_CAP_SPEED_V(FW_PORT_CAP_SPEED_G(lc->supported));
+   high_speeds = speeds & ~(FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G);
+
+   return high_speeds != 0;
 }
 
 static inline void init_rspq(struct adapter *adap, struct sge_rspq *q,
@@ -4756,8 +4763,12 @@ static void print_port_info(const struct net_device *dev)
bufp += sprintf(bufp, "1000/");
if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_10G)
bufp += sprintf(bufp, "10G/");
+   if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_25G)
+   bufp += sprintf(bufp, "25G/");
if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_40G)
bufp += sprintf(bufp, "40G/");
+   if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_100G)
+   bufp += sprintf(bufp, "100G/");
if (bufp != buf)
--bufp;
sprintf(bufp, "BASE-%s", t4_get_port_type_description(pi->port_type));
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index dc92c80a75f4..660204bff726 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -3627,7 +3627,8 @@ void t4_ulprx_read_la(struct adapter *adap, u32 *la_buf)
 }
 
 #define ADVERT_MASK (FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G |\
-FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_40G | \
+FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_25G | \
+FW_PORT_CAP_SPEED_40G | FW_PORT_CAP_SPEED_100G | \
 FW_PORT_CAP_ANEG)
 
 /**
@@ -7196,8 +7197,12 @@ void t4_handle_get_port_info(struct port_info *pi, const 
__be64 *rpl)
speed = 1000;
else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_10G))
speed = 1;
+   else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_25G))
+   speed = 25000;
else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_40G))
speed = 4;
+   else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_100G))
+   speed = 10;
 
lc = &pi->link_cfg;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index a89b30720e38..30507d44422c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -2265,6 +2265,12 @@ enum fw_port_cap {
FW_P

Re: [PATCH net] cxgb4/cxgb4vf: Allocate more queues for 100G adapter

2016-09-19 Thread Hariprasad Shenai

On Mon, Sep 19, 2016 at 01:32:46PM +0530, Hariprasad Shenai wrote:
> We were missing check for 100G while checking port speed, which lead to
> less number of queues getting allocated for 100G and leading to low
> throughput. Adding the missing check for both NIC and vNIC driver.
> 
> Signed-off-by: Hariprasad Shenai 
> ---

Hi David,

I missed 25G, will send a V2 for the same with changes for 25Gbps adapter too.

Thanks,
Hari

Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes

2016-09-19 Thread Roopa Prabhu

On Mon, Sep 19, 2016 at 11:02 PM, Jiri Pirko  wrote:
> Tue, Sep 20, 2016 at 07:49:47AM CEST, ro...@cumulusnetworks.com wrote:

[snip]

>>
>>Do you see any scale problems with using notifiers ?. as you know these ascis 
>>can scale to
>>32k-128k routes.
>
> I don't see any problem there. What do you think might be wrong?
>

we had seen some overheads with link notifiers in older kernels with
large number of links flaps.
But that could have been due to rtnl lock. don't have any more data
than that. so, ignore that.
I don't see anything obvious from perf perspectivertnl is already
held. but, thought i did just ask.

Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes

2016-09-19 Thread Jiri Pirko

Tue, Sep 20, 2016 at 07:49:47AM CEST, ro...@cumulusnetworks.com wrote:
>On 9/19/16, 8:15 AM, Jiri Pirko wrote:
>> Mon, Sep 19, 2016 at 04:59:22PM CEST, ro...@cumulusnetworks.com wrote:
>>> On 9/18/16, 11:14 PM, Jiri Pirko wrote:
 Mon, Sep 19, 2016 at 01:16:17AM CEST, ro...@cumulusnetworks.com wrote:
> On 9/18/16, 1:00 PM, Florian Fainelli wrote:
>> Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
>>> From: Jiri Pirko 
>>>
>>> This is RFC, unfinished. I came across some issues in the process so I 
>>> would
>>> like to share those and restart the fib offload discussion in order to 
>>> make it
>>> really usable.
>>>
>>> So the goal of this patchset is to allow driver to propagate all 
>>> prefixes
>>> configured in kernel down HW. This is necessary for routing to work
>>> as expected. If we don't do that HW might forward prefixes known to 
>>> kernel
>>> incorrectly. Take an example when default route is set in switch HW and 
>>> there
>>> is an IP address set on a management (non-switch) port.
>>>
>>> Currently, only fibs related to the switch port netdev are offloaded 
>>> using
>>> switchdev ops. This model is not extendable so the first patch 
>>> introduces
>>> a replacement: notifier to propagate fib additions and removals to 
>>> whoever
>>> interested. The second patch makes mlxsw to adopt this new way, 
>>> registering
>>> one notifier block for each mlxsw (asic) instance.
>> Instead of introducing another specialization of a notifier_block
>> implementation, could we somehow have a kernel-based netlink listener
>> which receives the same kind of event information from rtmsg_fib()?
>>
>> The reason is that having such a facility would hook directly onto
>> existing rtmsg_* calls that exist throughout the stack, and that seems
>> to scale better.
> I was thinking along the same lines. Instead of proliferating notifier 
> blocks
> through-out the stack for switchdev offload, putting existing events to 
> use would be nice.
>
> But the problem though is drivers having to parse the netlink msg again. 
> also, the intent
> here is to do the offload first ..before the route is added to the kernel 
> (though i don't see that in
> the current series). existing netlink rmsg_fib events are generated after 
> the route is added to the kernel.
>
>
> Jiri, instead of the notifier, do you see a problem with always calling 
> the existing switchdev
> offload api for every route  for every asic instance ?. the first device 
> where the route fits wins.
 There is not list of asic instances. Therefore the notifier fits much 
 better here.



> it seems similar to driver registering for notifier and looking at every 
> route ...
> am i missing something ?
> and the policies you mention could help around selecting the asic 
> instance (FCFS or mirror).
> you will need to abstract out the asic instance for switchdev api to call 
> on, but I thought you
> already have that in some form in your devlink infrastructure.
 switchdev asic instances and devlink instances are orthogonal.
>>> maybe it is not today...but the requirement for devlink was to provide a 
>>> way to communicate
>>> to the switch driver
>>> - global switch attributes or
>>> - things that cannot go via switch ports (exactly the problem you are 
>>> trying to solve for routes here)
>> Devlink is a general beast, not switch specific one. I see no need to
>> use fib->devlink->driver route inside kernel. Devlink is for userspace
>> facing.
>
>yes, sure. it has a dev abstraction and an api. devlink discussion started a 
>few years ago in the context
>of switch asics for the very same reason that it will help direct the offload 
>call to the
>switch device driver when you cant apply the settings on a per port basis.
>You have kept the abstraction and api generic ..which is a great thing.
>But that can't be the reason for it to not support its original intent...if 
>there is a way.
>
>>
>>
>>> so,  maybe an instance of switch asic modeled via devlink will help here 
>>> and possibly all/other switchdev
>>> offload hooks ?
>> Maybe, but in case of fibs, the notifier just fits great. I see no need
>> for anything else.
>
>I think its better to stick with 'offload api or notifier' whichever we pick ..
>to be consistent with other switchdev offload areas. That was the original 
>intent of
>introducing the switchdev api layer. If we are now replacing the switchdev api 
>with notifiers,

I strongly disagree. Make it uniform is not desirable. For some things,
direct ndo/sdo make sense and is better. For some other things, notifier
fits better. For example when I was implementing LAG offload,
I also chose a notifier.


>assuming 'notifiers are the best way' to offload routes, lets keep it 
>consistent with
>oth

[PATCH v3] iproute2: build nsid-name cache only for commands that need it

2016-09-19 Thread Anton Aksola

The calling of netns_map_init() before command parsing introduced
a performance issue with large number of namespaces.

As commands such as add, del and exec do not need to iterate through
/var/run/netns it would be good not no build the cache before executing
these commands.

Example:
unpatched:
time seq 1 1000 | xargs -n 1 ip netns add

real0m16.832s
user0m1.350s
sys0m15.029s

patched:
time seq 1 1000 | xargs -n 1 ip netns add

real0m3.859s
user0m0.132s
sys0m3.205s

Signed-off-by: Anton Aksola 
---
 ip/ip_common.h|  1 +
 ip/ipmonitor.c|  1 +
 ip/ipnetns.c  | 31 ++-
 testsuite/tests/ip/netns/set_nsid.t   | 22 ++
 testsuite/tests/ip/netns/set_nsid_batch.t | 18 ++
 5 files changed, 64 insertions(+), 9 deletions(-)
 create mode 100755 testsuite/tests/ip/netns/set_nsid.t
 create mode 100755 testsuite/tests/ip/netns/set_nsid_batch.t

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 93ff5bc..fabc4b5 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -31,6 +31,7 @@ int print_netconf(const struct sockaddr_nl *who,
  struct rtnl_ctrl_data *ctrl,
  struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
+void netns_nsid_socket_init(void);
 int print_nsid(const struct sockaddr_nl *who,
   struct nlmsghdr *n, void *arg);
 int do_ipaddr(int argc, char **argv);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 2090a45..c892b8f 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -301,6 +301,7 @@ int do_ipmonitor(int argc, char **argv)
exit(1);
 
ll_init_map(&rth);
+   netns_nsid_socket_init();
netns_map_init();
 
if (rtnl_listen(&rth, accept_msg, stdout) < 0)
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index af87065..6b42751 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -194,6 +194,18 @@ static void netns_map_del(struct nsid_cache *c)
free(c);
 }
 
+void netns_nsid_socket_init(void)
+{
+   if (rtnsh.fd > -1 || !ipnetns_have_nsid())
+   return;
+
+   if (rtnl_open(&rtnsh, 0) < 0) {
+   fprintf(stderr, "Cannot open rtnetlink\n");
+   exit(1);
+   }
+
+}
+
 void netns_map_init(void)
 {
static int initialized;
@@ -204,11 +216,6 @@ void netns_map_init(void)
if (initialized || !ipnetns_have_nsid())
return;
 
-   if (rtnl_open(&rtnsh, 0) < 0) {
-   fprintf(stderr, "Cannot open rtnetlink\n");
-   exit(1);
-   }
-
dir = opendir(NETNS_RUN_DIR);
if (!dir)
return;
@@ -775,17 +782,23 @@ static int netns_monitor(int argc, char **argv)
 
 int do_netns(int argc, char **argv)
 {
-   netns_map_init();
+   netns_nsid_socket_init();
 
-   if (argc < 1)
+   if (argc < 1) {
+   netns_map_init();
return netns_list(0, NULL);
+   }
 
if ((matches(*argv, "list") == 0) || (matches(*argv, "show") == 0) ||
-   (matches(*argv, "lst") == 0))
+   (matches(*argv, "lst") == 0)) {
+   netns_map_init();
return netns_list(argc-1, argv+1);
+   }
 
-   if ((matches(*argv, "list-id") == 0))
+   if ((matches(*argv, "list-id") == 0)) {
+   netns_map_init();
return netns_list_id(argc-1, argv+1);
+   }
 
if (matches(*argv, "help") == 0)
return usage();
diff --git a/testsuite/tests/ip/netns/set_nsid.t 
b/testsuite/tests/ip/netns/set_nsid.t
new file mode 100755
index 000..606d45a
--- /dev/null
+++ b/testsuite/tests/ip/netns/set_nsid.t
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+ts_log "[Testing netns nsid]"
+
+NS=testnsid
+NSID=99
+
+ts_ip "$0" "Add new netns $NS" netns add $NS
+ts_ip "$0" "Set $NS nsid to $NSID" netns set $NS $NSID
+
+ts_ip "$0" "List netns" netns list
+test_on "$NS \(id: $NSID\)"
+
+ts_ip "$0" "List netns without explicit list or show" netns
+test_on "$NS \(id: $NSID\)"
+
+ts_ip "$0" "List nsid" netns list-id
+test_on "$NSID \(iproute2 netns name: $NS\)"
+
+ts_ip "$0" "Delete netns $NS" netns del $NS
diff --git a/testsuite/tests/ip/netns/set_nsid_batch.t 
b/testsuite/tests/ip/netns/set_nsid_batch.t
new file mode 100755
index 000..abb3f1b
--- /dev/null
+++ b/testsuite/tests/ip/netns/set_nsid_batch.t
@@ -0,0 +1,18 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+ts_log "[Testing netns nsid in batch mode]"
+
+NS=testnsid
+NSID=99
+BATCHFILE=`mktemp`
+
+echo "netns add $NS" >> $BATCHFILE
+echo "netns set $NS $NSID" >> $BATCHFILE
+echo "netns list-id" >> $BATCHFILE
+ts_ip "$0" "Add ns, set nsid and list in batch mode" -b $BATCHFILE
+test_on "nsid $NSID \(iproute2 netns name: $NS\)"
+rm -f $BATCHFILE
+
+ts_ip "$0" "Delete netns $NS" netns del $NS
-- 
1.8.3.1

Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes

2016-09-19 Thread Roopa Prabhu

On 9/19/16, 8:15 AM, Jiri Pirko wrote:
> Mon, Sep 19, 2016 at 04:59:22PM CEST, ro...@cumulusnetworks.com wrote:
>> On 9/18/16, 11:14 PM, Jiri Pirko wrote:
>>> Mon, Sep 19, 2016 at 01:16:17AM CEST, ro...@cumulusnetworks.com wrote:
 On 9/18/16, 1:00 PM, Florian Fainelli wrote:
> Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
>> From: Jiri Pirko 
>>
>> This is RFC, unfinished. I came across some issues in the process so I 
>> would
>> like to share those and restart the fib offload discussion in order to 
>> make it
>> really usable.
>>
>> So the goal of this patchset is to allow driver to propagate all prefixes
>> configured in kernel down HW. This is necessary for routing to work
>> as expected. If we don't do that HW might forward prefixes known to 
>> kernel
>> incorrectly. Take an example when default route is set in switch HW and 
>> there
>> is an IP address set on a management (non-switch) port.
>>
>> Currently, only fibs related to the switch port netdev are offloaded 
>> using
>> switchdev ops. This model is not extendable so the first patch introduces
>> a replacement: notifier to propagate fib additions and removals to 
>> whoever
>> interested. The second patch makes mlxsw to adopt this new way, 
>> registering
>> one notifier block for each mlxsw (asic) instance.
> Instead of introducing another specialization of a notifier_block
> implementation, could we somehow have a kernel-based netlink listener
> which receives the same kind of event information from rtmsg_fib()?
>
> The reason is that having such a facility would hook directly onto
> existing rtmsg_* calls that exist throughout the stack, and that seems
> to scale better.
 I was thinking along the same lines. Instead of proliferating notifier 
 blocks
 through-out the stack for switchdev offload, putting existing events to 
 use would be nice.

 But the problem though is drivers having to parse the netlink msg again. 
 also, the intent
 here is to do the offload first ..before the route is added to the kernel 
 (though i don't see that in
 the current series). existing netlink rmsg_fib events are generated after 
 the route is added to the kernel.

 Jiri, instead of the notifier, do you see a problem with always calling 
 the existing switchdev
 offload api for every route  for every asic instance ?. the first device 
 where the route fits wins.
>>> There is not list of asic instances. Therefore the notifier fits much 
>>> better here.
>>>
>>>
>>>
 it seems similar to driver registering for notifier and looking at every 
 route ...
 am i missing something ?
 and the policies you mention could help around selecting the asic instance 
 (FCFS or mirror).
 you will need to abstract out the asic instance for switchdev api to call 
 on, but I thought you
 already have that in some form in your devlink infrastructure.
>>> switchdev asic instances and devlink instances are orthogonal.
>> maybe it is not today...but the requirement for devlink was to provide a way 
>> to communicate
>> to the switch driver
>> - global switch attributes or
>> - things that cannot go via switch ports (exactly the problem you are trying 
>> to solve for routes here)
> Devlink is a general beast, not switch specific one. I see no need to
> use fib->devlink->driver route inside kernel. Devlink is for userspace
> facing.

yes, sure. it has a dev abstraction and an api. devlink discussion started a 
few years ago in the context
of switch asics for the very same reason that it will help direct the offload 
call to the
switch device driver when you cant apply the settings on a per port basis.
You have kept the abstraction and api generic ..which is a great thing.
But that can't be the reason for it to not support its original intent...if 
there is a way.

>
>
>> so,  maybe an instance of switch asic modeled via devlink will help here and 
>> possibly all/other switchdev
>> offload hooks ?
> Maybe, but in case of fibs, the notifier just fits great. I see no need
> for anything else.

I think its better to stick with 'offload api or notifier' whichever we pick ..
to be consistent with other switchdev offload areas. That was the original 
intent of
introducing the switchdev api layer. If we are now replacing the switchdev api 
with notifiers,
assuming 'notifiers are the best way' to offload routes, lets keep it 
consistent with
other switchdev offload areas too.

I know you already have them for links...and that is good..because links 
already have notifiers.
we will need the same thing for acls. Having notifiers for acls too seems like 
an overkill.
we will then have to extend this to multicast and mpls routes too. will all 
these be notifiers too ?

Do you see any scale problems with using notifiers ?. as you know these ascis 
can scale to
32k-128k

Re: [net-next PATCH] net: netlink messages for HW addr programming

2016-09-19 Thread Jiri Pirko

Tue, Sep 20, 2016 at 07:31:27AM CEST, ro...@cumulusnetworks.com wrote:
>On 9/19/16, 7:46 AM, Patrick Ruddy wrote:
>> On Sun, 2016-09-18 at 07:51 -0700, Roopa Prabhu wrote:
>>> On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
 Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
 AF_UNSPEC to indicate interest in specific unicast and multicast
 hardware addresses. These messages are sent when addresses are
 added or deleted from the appropriate interface driver.
 Added AF_UNSPEC GETADDR function to allow the netlink notifications
 to be replayed to avoid loss of state due to application start
 ordering or restart.

 Signed-off-by: Patrick Ruddy 
 ---
>>> RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the kernel.
>>> so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them 
>>> to
>>> userspace and also to request a special dump of these addresses.
>>>
>>> This could just be a new nested netlink attribute in the existing link dump 
>>> ?
>> Hi Roopa
>>
>> Thanks for the review. I did initially code this using NEW/DEL/GET_LINK
>> messages but was asked to change to to ADDR messages by Stephen
>> Hemminger (cc'd). 
>>
>> However I agree that these addresses fall between the LINK and ADDR
>> areas so I'm happy to change this if we can reach some consensus on the
>> format.
>>
>ok, thanks for the history. yes, they do lie in a weird spot.

They are l2 addresses, they should be threated accordingly. Am I missing
something?


>the general convention for other rtnl registrations seems to be
>AF_UNSPEC family means include all supported families. thats where this seems 
>a bit odd.
>
>On the other hand, one reason I see where using RTM_*ADDR will be useful for 
>this is if we wanted
>to provide a way to add these uc and mc address via ip addr add in the future.
>ip addr add  dev eth0
>
>Does this patch allow that in the future ?

This shoul go under ip link I believe. "ip addr" is for l3.


>
>also, will these l2 addresses now show up in 'ip addr show' output ?.
>
>thanks,
>Roopa
>

Re: [net-next PATCH] net: netlink messages for HW addr programming

2016-09-19 Thread Roopa Prabhu

On 9/19/16, 7:46 AM, Patrick Ruddy wrote:
> On Sun, 2016-09-18 at 07:51 -0700, Roopa Prabhu wrote:
>> On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
>>> Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
>>> AF_UNSPEC to indicate interest in specific unicast and multicast
>>> hardware addresses. These messages are sent when addresses are
>>> added or deleted from the appropriate interface driver.
>>> Added AF_UNSPEC GETADDR function to allow the netlink notifications
>>> to be replayed to avoid loss of state due to application start
>>> ordering or restart.
>>>
>>> Signed-off-by: Patrick Ruddy 
>>> ---
>> RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the kernel.
>> so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them to
>> userspace and also to request a special dump of these addresses.
>>
>> This could just be a new nested netlink attribute in the existing link dump ?
> Hi Roopa
>
> Thanks for the review. I did initially code this using NEW/DEL/GET_LINK
> messages but was asked to change to to ADDR messages by Stephen
> Hemminger (cc'd). 
>
> However I agree that these addresses fall between the LINK and ADDR
> areas so I'm happy to change this if we can reach some consensus on the
> format.
>
ok, thanks for the history. yes, they do lie in a weird spot.
the general convention for other rtnl registrations seems to be
AF_UNSPEC family means include all supported families. thats where this seems a 
bit odd.

On the other hand, one reason I see where using RTM_*ADDR will be useful for 
this is if we wanted
to provide a way to add these uc and mc address via ip addr add in the future.
ip addr add  dev eth0

Does this patch allow that in the future ?

also, will these l2 addresses now show up in 'ip addr show' output ?.

thanks,
Roopa

Re: [PATCH v2 4/6] net: ethernet: bgmac: convert to feature flags

2016-09-19 Thread Rafał Miłecki

On 17 August 2016 at 13:34, Rafał Miłecki  wrote:
> On 8 July 2016 at 01:08, Jon Mason  wrote:
>> mode = (bgmac_read(bgmac, BGMAC_DEV_STATUS) & BGMAC_DS_MM_MASK) >>
>> BGMAC_DS_MM_SHIFT;
>> -   if (ci->id != BCMA_CHIP_ID_BCM47162 || mode != 0)
>> +   if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST || mode != 0)
>> bgmac_set(bgmac, BCMA_CLKCTLST, BCMA_CLKCTLST_FORCEHT);
>> -   if (ci->id == BCMA_CHIP_ID_BCM47162 && mode == 2)
>> +   if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST && mode == 2)
>> bcma_chipco_chipctl_maskset(&bgmac->core->bus->drv_cc, 1, ~0,
>> BGMAC_CHIPCTL_1_RXC_DLL_BYPASS);
>
> Jon, it looks to me you translated two following conditions:
> ci->id != BCMA_CHIP_ID_BCM47162
> and
> ci->id == BCMA_CHIP_ID_BCM47162
> into the same flag check:
> bgmac->feature_flags & BGMAC_FEAT_CLKCTLST
>
> I don't think it's intentional, is it? Do you have a moment to fix this?

Ping

-- 
Rafał

[PATCH net-next] mlxsw: spectrum: Make offloads stats functions static

2016-09-19 Thread Or Gerlitz

The offloads stats functions are local to this file, make them static.

Fixes: fc1bbb0f1831 ('mlxsw: spectrum: Implement offload stats ndo [..]')
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 171f8dd..efac909 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -819,7 +819,7 @@ err_span_port_mtu_update:
return err;
 }
 
-int
+static int
 mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
 struct rtnl_link_stats64 *stats)
 {
@@ -851,7 +851,7 @@ mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
return 0;
 }
 
-bool mlxsw_sp_port_has_offload_stats(int attr_id)
+static bool mlxsw_sp_port_has_offload_stats(int attr_id)
 {
switch (attr_id) {
case IFLA_OFFLOAD_XSTATS_CPU_HIT:
@@ -861,8 +861,8 @@ bool mlxsw_sp_port_has_offload_stats(int attr_id)
return false;
 }
 
-int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
-   void *sp)
+static int mlxsw_sp_port_get_offload_stats(int attr_id, const struct 
net_device *dev,
+  void *sp)
 {
switch (attr_id) {
case IFLA_OFFLOAD_XSTATS_CPU_HIT:
-- 
2.3.7

Re: 答复: [PATCH] sunrpc: queue work on system_power_efficient_wq

2016-09-19 Thread Chunyan Zhang

Resend behalf on Ke Wang.

Thanks,
Chunyan

On 20 September 2016 at 10:33, Ke Wang (王科)  wrote:
> May I have any comments for this patch?
> or
> This patch can be merged directly into next release?
>
> Thanks,
> Ke
> 
> 发件人: Anna Schumaker 
> 发送时间: 2016年9月2日 2:46
> 收件人: Chunyan Zhang; trond.mykleb...@primarydata.com; 
> anna.schuma...@netapp.com; da...@davemloft.net
> 抄送: linux-...@vger.kernel.org; netdev@vger.kernel.org; 
> linux-ker...@vger.kernel.org; Ke Wang (王科)
> 主题: Re: [PATCH] sunrpc: queue work on system_power_efficient_wq
>
> On 09/01/2016 03:30 AM, Chunyan Zhang wrote:
>> From: Ke Wang 
>>
>> sunrpc uses workqueue to clean cache regulary. There is no real dependency
>> of executing work on the cpu which queueing it.
>>
>> On a idle system, especially for a heterogeneous systems like big.LITTLE,
>> it is observed that the big idle cpu was woke up many times just to service
>> this work, which against the principle of power saving. It would be better
>> if we can schedule it on a cpu which the scheduler believes to be the most
>> appropriate one.
>>
>> After apply this patch, system_wq will be replaced by
>> system_power_efficient_wq for sunrpc. This functionality is enabled when
>> CONFIG_WQ_POWER_EFFICIENT is selected.
>
> Makes sense to me, but I'm a little surprised that there isn't a 
> "schedule_delayed_power_efficient_work()" function to match how the normal 
> workqueue is used.
>
> Thanks,
> Anna
>
>>
>> Signed-off-by: Ke Wang 
>> ---
>>  net/sunrpc/cache.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
>> index 4d8e11f..8aabe12 100644
>> --- a/net/sunrpc/cache.c
>> +++ b/net/sunrpc/cache.c
>> @@ -353,7 +353,7 @@ void sunrpc_init_cache_detail(struct cache_detail *cd)
>>   spin_unlock(&cache_list_lock);
>>
>>   /* start the cleaning process */
>> - schedule_delayed_work(&cache_cleaner, 0);
>> + queue_delayed_work(system_power_efficient_wq, &cache_cleaner, 0);
>>  }
>>  EXPORT_SYMBOL_GPL(sunrpc_init_cache_detail);
>>
>> @@ -476,7 +476,8 @@ static void do_cache_clean(struct work_struct *work)
>>   delay = 0;
>>
>>   if (delay)
>> - schedule_delayed_work(&cache_cleaner, delay);
>> + queue_delayed_work(system_power_efficient_wq,
>> +&cache_cleaner, delay);
>>  }
>>
>>
>>
>

Re: [PATCH v2 0/2] act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

2016-09-19 Thread Shmulik Ladkani

This is for net-next, forgot to mention.

Deprecates the v1 of https://patchwork.ozlabs.org/patch/671403/

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-19 Thread Sargun Dhillon

On Thu, Sep 15, 2016 at 09:41:33PM +0200, Mickaël Salaün wrote:
> 
> On 15/09/2016 06:48, Alexei Starovoitov wrote:
> > On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
> >>  wrote:
> >>> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
>  On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>   wrote:
> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >
> > This RFC handle both cgroup and seccomp approaches in a similar 
> > way. I
> > don't see why building on top of cgroup v2 is a problem. Is there
> > security issues with delegation?
> 
>  What I mean is: cgroup v2 delegation has a functionality problem.
>  Tejun says [1]:
> 
>  We haven't had to face this decision because cgroup has never 
>  properly
>  supported delegating to applications and the in-use setups where this
>  happens are custom configurations where there is no boundary between
>  system and applications and adhoc trial-and-error is good enough a 
>  way
>  to find a working solution.  That wiggle room goes away once we
>  officially open this up to individual applications.
> 
>  Unless and until that changes, I think that landlock should stay away
>  from cgroups.  Others could reasonably disagree with me.
> >>>
> >>> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >>> and not for sandboxing. So the above doesn't matter in such contexts.
> >>> lsm hooks + cgroups provide convenient scope and existing entry 
> >>> points.
> >>> Please see checmate examples how it's used.
> >>>
> >>
> >> To be clear: I'm not arguing at all that there shouldn't be
> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> landlock interface shouldn't expose any cgroup integration, at least
> >> until the cgroup situation settles down a lot.
> >
> > ahh. yes. we're perfectly in agreement here.
> > I'm suggesting that the next RFC shouldn't include unpriv
> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> > argue about unpriv with cgroups and even unpriv as a whole,
> > since it's not a given. Seccomp integration is also questionable.
> > I'd rather not have seccomp as a gate keeper for this lsm.
> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> > don't have one to one relationship, so mixing them up is only
> > asking for trouble further down the road.
> > If we really need to carry some information from seccomp to lsm+bpf,
> > it's easier to add eBPF support to seccomp and let bpf side deal
> > with passing whatever information.
> >
> 
>  As an argument for keeping seccomp (or an extended seccomp) as the
>  interface for an unprivileged bpf+lsm: seccomp already checks off most
>  of the boxes for safely letting unprivileged programs sandbox
>  themselves.
> >>>
> >>> you mean the attach part of seccomp syscall that deals with no_new_priv?
> >>> sure, that's reusable.
> >>>
>  Furthermore, to the extent that there are use cases for
>  unprivileged bpf+lsm that *aren't* expressible within the seccomp
>  hierarchy, I suspect that syscall filters have exactly the same
>  problem and that we should fix seccomp to cover it.
> >>>
> >>> not sure what you mean by 'seccomp hierarchy'. The normal process
> >>> hierarchy ?
> >>
> >> Kind of.  I mean the filter layers that are inherited across fork(),
> >> the TSYNC mechanism, etc.
> >>
> >>> imo the main deficiency of secccomp is inability to look into arguments.
> >>> One can argue that it's a blessing, since composite args
> >>> are not yet copied into the kernel memory.
> >>> But in a lot of cases the seccomp arguments are FDs pointing
> >>> to kernel objects and if programs could examine those objects
> >>> the sandboxing scope would be more precise.
> >>> lsm+bpf solves that part and I'd still argue that it's
> >>> orthogonal to seccomp's pass/reject flow.
> >>> I mean if seccomp says 'ok' the syscall should continue executing
> >>> as normal and whatever LSM hooks were triggered by it may have
> >>> their own lsm+bpf verdicts.
> >>
> >> I agree with all of this...
> >>
> >>> Furthermore in the process hierarchy different children
> >>> should be able to set their own lsm+bpf filters that are not
> >>> related to parallel seccomp+bpf hierarchy of programs.
> >>> seccomp syscall can be an interface to attach programs
> >>> to lsm hooks, but nothing more than that.
> >>
> >> I'm not sure what you mean.  I mean that, logically, I think we should
> >> be able to do:
> >>
> >> seccomp(attach a syscall filter);
> >> fork();
> >> child does seccomp(attach some lsm filters);
> >>
> >> I think that they *should* be related to the seccomp

Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()

2016-09-19 Thread Shmulik Ladkani

Hi,

On Mon, 19 Sep 2016 13:46:10 -0700 pravin shelar  wrote:
> On Mon, Sep 19, 2016 at 1:04 PM, Shmulik Ladkani
>  wrote:
> > Hi Pravin,
> >
> > On Sun, 18 Sep 2016 13:26:30 -0700 pravin shelar  wrote:  
> >> > +++ b/net/core/skbuff.c
> >> > @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
> >> > } else {
> >> > if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
> >> >   skb->protocol != htons(ETH_P_8021AD)) ||
> >> > -skb->len < VLAN_ETH_HLEN))
> >> > +skb->mac_len < VLAN_ETH_HLEN))  
> >>
> >> There is already check in __skb_vlan_pop() to validate skb for a vlan
> >> header. So it is safe to drop this check entirely.  
> >
> > Yep, I submitted a v2 with your suggestion, however I withdrew it, as
> > there is a slight behavior difference noticable by 'skb_vlan_pop' callers.
> >
> > Suppose the rare case where skb->len is too small.
> >
> > pre:
> >   skb_vlan_pop returns 0 (at least for the correct tx path).
> >   Meaning, callers do not see it as a failure.
> > post:
> >   skb_ensure_writable fails (!pskb_may_pull), therefore -ENOMEM returned
> >   to the callers of 'skb_vlan_pop'.
> >
> > For ovs, it means do_execute_actions's loop is terminated, no further
> > actions are executed, and skb gets freed.
> >
> > For tc act vlan, it means skb gets dropped.
> >
> > This actually makes sense, but do we want to present this change?
> >  
> I think this is correct behavior over existing code.

Ok.
I'll submit a v3 identical to v2 but with proper statement of this
behavior change in the commit log.

Thanks.

[PATCH v4 net-next 10/16] tcp: allow congestion control module to request TSO skb segment count

2016-09-19 Thread Neal Cardwell

Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.

This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h |  2 ++
 net/ipv4/tcp_output.c | 15 +--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a69ed7f..f8f581f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -913,6 +913,8 @@ struct tcp_congestion_ops {
u32  (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
+   /* suggest number of segments for each skb to transmit (optional) */
+   u32 (*tso_segs_goal)(struct sock *sk);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e02c8eb..0137956 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1566,6 +1566,17 @@ static u32 tcp_tso_autosize(const struct sock *sk, 
unsigned int mss_now)
return min_t(u32, segs, sk->sk_gso_max_segs);
 }
 
+/* Return the number of segments we want in the skb we are transmitting.
+ * See if congestion control module wants to decide; otherwise, autosize.
+ */
+static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now)
+{
+   const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
+   u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0;
+
+   return tso_segs ? : tcp_tso_autosize(sk, mss_now);
+}
+
 /* Returns the portion of skb which can be sent right away */
 static unsigned int tcp_mss_split_point(const struct sock *sk,
const struct sk_buff *skb,
@@ -2061,7 +2072,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
}
}
 
-   max_segs = tcp_tso_autosize(sk, mss_now);
+   max_segs = tcp_tso_segs(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
 
@@ -2778,7 +2789,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
last_lost = tp->snd_una;
}
 
-   max_segs = tcp_tso_autosize(sk, tcp_current_mss(sk));
+   max_segs = tcp_tso_segs(sk, tcp_current_mss(sk));
tcp_for_write_queue_from(skb, sk) {
__u8 sacked;
int segs;
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 12/16] tcp: export tcp_mss_to_mtu() for congestion control modules

2016-09-19 Thread Neal Cardwell

Export tcp_mss_to_mtu(), so that congestion control modules can use
this to help calculate a pacing rate.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bf3d48..7d025a7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1362,6 +1362,7 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
}
return mtu;
 }
+EXPORT_SYMBOL(tcp_mss_to_mtu);
 
 /* MTU probing init per socket */
 void tcp_mtup_init(struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 00/16] tcp: BBR congestion control algorithm

2016-09-19 Thread Neal Cardwell

tcp: BBR congestion control algorithm

This patch series implements a new TCP congestion control algorithm:
BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
description of BBR will be published in ACM Queue, September-October
2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
deployed in production at Google.

The patch series starts with a set of supporting infrastructure
changes, including a few that extend the congestion control
framework. The last patch adds BBR as a TCP congestion control
module. Please see individual patches for the details.

- v3 -> v4:
 - Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
   to use const to qualify all the constant parameters.
   Thanks to Stephen Hemminger.
 - In "tcp_bbr: add BBR congestion control", remove the bbr_rate_kbps()
   function, which had a 64-bit divide that would be problematic on some
   architectures, and just use bbr_rate_bytes_per_sec() directly.
   Thanks to Kenneth Klette Jonassen for suggesting this.
 - In "tcp: switch back to proper tcp_skb_cb size check in tcp_init()",
   switched from sizeof(skb->cb) to FIELD_SIZEOF.
   Thanks to Lance Richardson for suggesting this.
 - Updated "tcp_bbr: add BBR congestion control" commit message with
   performance data, more details about deployment at Google, and
   another reminder to use fq with BBR.
 - Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
   to use MODULE_LICENSE("Dual BSD/GPL").

- v2 -> v3: fix another issue caught by build bots:
 - adjust rate_sample struct initialization syntax to allow gcc-4.4 to compile
   the "tcp: track data delivery rate for a TCP connection" patch; also
   adjusted some similar syntax in "tcp_bbr: add BBR congestion control"

- v1 -> v2: fix issues caught by build bots:
 - fix "tcp: export data delivery rate" to use rate64 instead of rate,
   so there is a 64-bit numerator for the do_div call
 - fix conflicting definitions for minmax caused by
   "tcp: use windowed min filter library for TCP min_rtt estimation"
   with a new commit:
   tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
 - fix warning about the use of __packed in
   "tcp: track data delivery rate for a TCP connection",
   which involves the addition of a new commit:
   tcp: switch back to proper tcp_skb_cb size check in tcp_init()  

Eric Dumazet (2):
  net_sched: sch_fq: add low_rate_threshold parameter
  tcp: switch back to proper tcp_skb_cb size check in tcp_init()

Neal Cardwell (8):
  lib/win_minmax: windowed min or max estimator
  tcp: use windowed min filter library for TCP min_rtt estimation
  tcp: count packets marked lost for a TCP connection
  tcp: allow congestion control module to request TSO skb segment count
  tcp: export tcp_tso_autosize() and parameterize minimum number of TSO
segments
  tcp: export tcp_mss_to_mtu() for congestion control modules
  tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
  tcp_bbr: add BBR congestion control

Soheil Hassas Yeganeh (2):
  tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
  tcp: track application-limited rate samples

Yuchung Cheng (4):
  tcp: track data delivery rate for a TCP connection
  tcp: export data delivery rate
  tcp: allow congestion control to expand send buffer differently
  tcp: new CC hook to set sending rate with rate_sample in any CA state

 include/linux/tcp.h|  14 +-
 include/linux/win_minmax.h |  37 ++
 include/net/inet_connection_sock.h |   4 +-
 include/net/tcp.h  |  53 ++-
 include/uapi/linux/inet_diag.h |  13 +
 include/uapi/linux/pkt_sched.h |   2 +
 include/uapi/linux/tcp.h   |   3 +
 lib/Makefile   |   2 +-
 lib/win_minmax.c   |  98 
 net/ipv4/Kconfig   |  18 +
 net/ipv4/Makefile  |   3 +-
 net/ipv4/tcp.c |  26 +-
 net/ipv4/tcp_bbr.c | 896 +
 net/ipv4/tcp_cdg.c |  12 +-
 net/ipv4/tcp_cong.c|   2 +-
 net/ipv4/tcp_input.c   | 154 +++
 net/ipv4/tcp_minisocks.c   |   5 +-
 net/ipv4/tcp_output.c  |  27 +-
 net/ipv4/tcp_rate.c| 186 
 net/sched/sch_fq.c |  22 +-
 20 files changed, 1470 insertions(+), 107 deletions(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c
 create mode 100644 net/ipv4/tcp_bbr.c
 create mode 100644 net/ipv4/tcp_rate.c

-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 13/16] tcp: allow congestion control to expand send buffer differently

2016-09-19 Thread Neal Cardwell

From: Yuchung Cheng 

Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.

For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.

This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h| 2 ++
 net/ipv4/tcp_input.c | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3492041..1aa9628 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -917,6 +917,8 @@ struct tcp_congestion_ops {
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* suggest number of segments for each skb to transmit (optional) */
u32 (*tso_segs_goal)(struct sock *sk);
+   /* returns the multiplier used in tcp_sndbuf_expand (optional) */
+   u32 (*sndbuf_expand)(struct sock *sk);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 17de77d..5af0bf3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -289,6 +289,7 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, 
const struct tcphdr
 static void tcp_sndbuf_expand(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
+   const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
int sndmem, per_mss;
u32 nr_segs;
 
@@ -309,7 +310,8 @@ static void tcp_sndbuf_expand(struct sock *sk)
 * Cubic needs 1.7 factor, rounded to 2 to include
 * extra cushion (application might react slowly to POLLOUT)
 */
-   sndmem = 2 * nr_segs * per_mss;
+   sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
+   sndmem *= nr_segs * per_mss;
 
if (sk->sk_sndbuf < sndmem)
sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 06/16] tcp: count packets marked lost for a TCP connection

2016-09-19 Thread Neal Cardwell

Count the number of packets that a TCP connection marks lost.

Congestion control modules can use this loss rate information for more
intelligent decisions about how fast to send.

Specifically, this is used in TCP BBR policer detection. BBR uses a
high packet loss rate as one signal in its policer detection and
policer bandwidth estimation algorithm.

The BBR policer detection algorithm cannot simply track retransmits,
because a retransmit can be (and often is) an indicator of packets
lost long, long ago. This is particularly true in a long CA_Loss
period that repairs the initial massive losses when a policer kicks
in.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp_input.c | 25 -
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6433cc8..38590fb 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -267,6 +267,7 @@ struct tcp_sock {
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
+   u32 lost;   /* Total data packets lost incl. rexmits */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ac5b38f..024b579 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -899,12 +899,29 @@ static void tcp_verify_retransmit_hint(struct tcp_sock 
*tp, struct sk_buff *skb)
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
 }
 
+/* Sum the number of packets on the wire we have marked as lost.
+ * There are two cases we care about here:
+ * a) Packet hasn't been marked lost (nor retransmitted),
+ *and this is the first loss.
+ * b) Packet has been marked both lost and retransmitted,
+ *and this means we think it was lost again.
+ */
+static void tcp_sum_lost(struct tcp_sock *tp, struct sk_buff *skb)
+{
+   __u8 sacked = TCP_SKB_CB(skb)->sacked;
+
+   if (!(sacked & TCPCB_LOST) ||
+   ((sacked & TCPCB_LOST) && (sacked & TCPCB_SACKED_RETRANS)))
+   tp->lost += tcp_skb_pcount(skb);
+}
+
 static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
 {
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tcp_verify_retransmit_hint(tp, skb);
 
tp->lost_out += tcp_skb_pcount(skb);
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
}
 }
@@ -913,6 +930,7 @@ void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, 
struct sk_buff *skb)
 {
tcp_verify_retransmit_hint(tp, skb);
 
+   tcp_sum_lost(tp, skb);
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tp->lost_out += tcp_skb_pcount(skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
@@ -1890,6 +1908,7 @@ void tcp_enter_loss(struct sock *sk)
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
+   bool mark_lost;
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1923,8 +1942,12 @@ void tcp_enter_loss(struct sock *sk)
if (skb == tcp_send_head(sk))
break;
 
+   mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
+is_reneg);
+   if (mark_lost)
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-   if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) || is_reneg) {
+   if (mark_lost) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 01/16] tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict

2016-09-19 Thread Neal Cardwell

From: Soheil Hassas Yeganeh 

The upcoming change "lib/win_minmax: windowed min or max estimator"
introduces a struct called minmax, which is then included in
include/linux/tcp.h in the upcoming change "tcp: use windowed min
filter library for TCP min_rtt estimation". This would create a
compilation error for tcp_cdg.c, which defines its own minmax
struct. To avoid this naming conflict (and potentially others in the
future), this commit renames the version used in tcp_cdg.c to
cdg_minmax.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Cc: Kenneth Klette Jonassen 
---
 net/ipv4/tcp_cdg.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 03725b2..35b2803 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -56,7 +56,7 @@ MODULE_PARM_DESC(use_shadow, "use shadow window heuristic");
 module_param(use_tolerance, bool, 0644);
 MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic");
 
-struct minmax {
+struct cdg_minmax {
union {
struct {
s32 min;
@@ -74,10 +74,10 @@ enum cdg_state {
 };
 
 struct cdg {
-   struct minmax rtt;
-   struct minmax rtt_prev;
-   struct minmax *gradients;
-   struct minmax gsum;
+   struct cdg_minmax rtt;
+   struct cdg_minmax rtt_prev;
+   struct cdg_minmax *gradients;
+   struct cdg_minmax gsum;
bool gfilled;
u8  tail;
u8  state;
@@ -353,7 +353,7 @@ static void tcp_cdg_cwnd_event(struct sock *sk, const enum 
tcp_ca_event ev)
 {
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
-   struct minmax *gradients;
+   struct cdg_minmax *gradients;
 
switch (ev) {
case CA_EVENT_CWND_RESTART:
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 02/16] lib/win_minmax: windowed min or max estimator

2016-09-19 Thread Neal Cardwell

This commit introduces a generic library to estimate either the min or
max value of a time-varying variable over a recent time window. This
is code originally from Kathleen Nichols. The current form of the code
is from Van Jacobson.

A single struct minmax_sample will track the estimated windowed-max
value of the series if you call minmax_running_max() or the estimated
windowed-min value of the series if you call minmax_running_min().

Nearly equivalent code is already in place for minimum RTT estimation
in the TCP stack. This commit extracts that code and generalizes it to
handle both min and max. Moving the code here reduces the footprint
and complexity of the TCP code base and makes the filter generally
available for other parts of the codebase, including an upcoming TCP
congestion control module.

This library works well for time series where the measurements are
smoothly increasing or decreasing.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/win_minmax.h | 37 +
 lib/Makefile   |  2 +-
 lib/win_minmax.c   | 98 ++
 3 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c

diff --git a/include/linux/win_minmax.h b/include/linux/win_minmax.h
new file mode 100644
index 000..5656960
--- /dev/null
+++ b/include/linux/win_minmax.h
@@ -0,0 +1,37 @@
+/**
+ * lib/minmax.c: windowed min/max tracker by Kathleen Nichols.
+ *
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include 
+
+/* A single data point for our parameterized min-max tracker */
+struct minmax_sample {
+   u32 t;  /* time measurement was taken */
+   u32 v;  /* value measured */
+};
+
+/* State for the parameterized min-max tracker */
+struct minmax {
+   struct minmax_sample s[3];
+};
+
+static inline u32 minmax_get(const struct minmax *m)
+{
+   return m->s[0].v;
+}
+
+static inline u32 minmax_reset(struct minmax *m, u32 t, u32 meas)
+{
+   struct minmax_sample val = { .t = t, .v = meas };
+
+   m->s[2] = m->s[1] = m->s[0] = val;
+   return m->s[0].v;
+}
+
+u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas);
+u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8..df747e5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o
+earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/win_minmax.c b/lib/win_minmax.c
new file mode 100644
index 000..c8420d4
--- /dev/null
+++ b/lib/win_minmax.c
@@ -0,0 +1,98 @@
+/**
+ * lib/minmax.c: windowed min/max tracker
+ *
+ * Kathleen Nichols' algorithm for tracking the minimum (or maximum)
+ * value of a data stream over some fixed time interval.  (E.g.,
+ * the minimum RTT over the past five minutes.) It uses constant
+ * space and constant time per update yet almost always delivers
+ * the same minimum as an implementation that has to keep all the
+ * data in the window.
+ *
+ * The algorithm keeps track of the best, 2nd best & 3rd best min
+ * values, maintaining an invariant that the measurement time of
+ * the n'th best >= n-1'th best. It also makes sure that the three
+ * values are widely separated in the time window since that bounds
+ * the worse case error when that data is monotonically increasing
+ * over the window.
+ *
+ * Upon getting a new min, we can forget everything earlier because
+ * it has no value - the new min is <= everything else in the window
+ * by definition and it's the most recent. So we restart fresh on
+ * every new min and overwrites 2nd & 3rd choices. The same property
+ * holds for 2nd & 3rd best.
+ */
+#include 
+#include 
+
+/* As time advances, update the 1st, 2nd, and 3rd choices. */
+static u32 minmax_subwin_update(struct minmax *m, u32 win,
+   const struct minmax_sample *val)
+{
+   u32 dt = val->t - m->s[0].t;
+
+   if (unlikely(dt > win)) {
+   /*
+* Passed entire window without a new val so make 2nd
+* choice the new val & 3rd choice the new 2nd choice.
+* we may have to iterate this since our 2nd choice
+* may also be outside the window (we checked on entry
+* that the third choice was in the window).
+*/
+   m->s[0] = m->s[1];
+   m->s[1] = m->s[2];
+

[PATCH v4 net-next 14/16] tcp: new CC hook to set sending rate with rate_sample in any CA state

2016-09-19 Thread Neal Cardwell

From: Yuchung Cheng 

This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.

This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.

We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.

With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h|  4 
 net/ipv4/tcp_cong.c  |  2 +-
 net/ipv4/tcp_input.c | 17 ++---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1aa9628..f83b7f2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -919,6 +919,10 @@ struct tcp_congestion_ops {
u32 (*tso_segs_goal)(struct sock *sk);
/* returns the multiplier used in tcp_sndbuf_expand (optional) */
u32 (*sndbuf_expand)(struct sock *sk);
+   /* call when packets are delivered to update cwnd and pacing rate,
+* after all the ca_state processing. (optional)
+*/
+   void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 882caa4..1294af4 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -69,7 +69,7 @@ int tcp_register_congestion_control(struct tcp_congestion_ops 
*ca)
int ret = 0;
 
/* all algorithms must implement ssthresh and cong_avoid ops */
-   if (!ca->ssthresh || !ca->cong_avoid) {
+   if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) {
pr_err("%s does not implement required ops\n", ca->name);
return -EINVAL;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5af0bf3..28cfe99 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2536,6 +2536,9 @@ static inline void tcp_end_cwnd_reduction(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
+   if (inet_csk(sk)->icsk_ca_ops->cong_control)
+   return;
+
/* Reset cwnd to ssthresh in CWR or Recovery (unless it's undone) */
if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
(tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
@@ -3312,8 +3315,15 @@ static inline bool tcp_may_raise_cwnd(const struct sock 
*sk, const int flag)
  * information. All transmission or retransmission are delayed afterwards.
  */
 static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked,
-int flag)
+int flag, const struct rate_sample *rs)
 {
+   const struct inet_connection_sock *icsk = inet_csk(sk);
+
+   if (icsk->icsk_ca_ops->cong_control) {
+   icsk->icsk_ca_ops->cong_control(sk, rs);
+   return;
+   }
+
if (tcp_in_cwnd_reduction(sk)) {
/* Reduce cwnd if state mandates */
tcp_cwnd_reduction(sk, acked_sacked, flag);
@@ -3683,7 +3693,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
delivered = tp->delivered - delivered;  /* freshly ACKed or SACKed */
lost = tp->lost - lost; /* freshly marked lost */
tcp_rate_gen(sk, delivered, lost, &now, &rs);
-   tcp_cong_control(sk, ack, delivered, flag);
+   tcp_cong_control(sk, ack, delivered, flag, &rs);
tcp_xmit_recovery(sk, rexmit);
return 1;
 
@@ -5981,7 +5991,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
} else
tcp_init_metrics(sk);
 
-   tcp_update_pacing_rate(sk);
+   if (!inet_csk(sk)->icsk_ca_ops->cong_control)
+   tcp_update_pacing_rate(sk);
 
/* Prevent spurious tcp_cwnd_restart() on first data packet */
tp->lsndtime = tcp_time

[PATCH v4 net-next 09/16] tcp: export data delivery rate

2016-09-19 Thread Neal Cardwell

From: Yuchung Cheng 

This commit export two new fields in struct tcp_info:

  tcpi_delivery_rate: The most recent goodput, as measured by
tcp_rate_gen(). If the socket is limited by the sending
application (e.g., no data to send), it reports the highest
measurement instead of the most recent. The unit is bytes per
second (like other rate fields in tcp_info).

  tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
was measured when the socket's throughput was limited by the
sending application.

This delivery rate information can be useful for applications that
want to know the current throughput the TCP connection is seeing,
e.g. adaptive bitrate video streaming. It can also be very useful for
debugging or troubleshooting.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  5 -
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 11 ++-
 net/ipv4/tcp_rate.c  | 12 +++-
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fdcd00f..a17ae7b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -213,7 +213,8 @@ struct tcp_sock {
u8 reord;/* reordering detected */
} rack;
u16 advmss; /* Advertised MSS   */
-   u8  unused;
+   u8  rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
+   unused:7;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
@@ -271,6 +272,8 @@ struct tcp_sock {
u32 app_limited;/* limited until "delivered" reaches this val */
struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
+   u32 rate_delivered;/* saved rate sample: packets delivered */
+   u32 rate_interval_us;  /* saved rate sample: time elapsed */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 482898f..73ac0db 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -167,6 +167,7 @@ struct tcp_info {
__u8tcpi_backoff;
__u8tcpi_options;
__u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
+   __u8tcpi_delivery_rate_app_limited:1;
 
__u32   tcpi_rto;
__u32   tcpi_ato;
@@ -211,6 +212,8 @@ struct tcp_info {
__u32   tcpi_min_rtt;
__u32   tcpi_data_segs_in;  /* RFC4898 tcpEStatsDataSegsIn */
__u32   tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
+
+   __u64   tcpi_delivery_rate;
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c187552..cece376 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2695,7 +2695,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 {
const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
const struct inet_connection_sock *icsk = inet_csk(sk);
-   u32 now = tcp_time_stamp;
+   u32 now = tcp_time_stamp, intv;
unsigned int start;
int notsent_bytes;
u64 rate64;
@@ -2785,6 +2785,15 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_min_rtt = tcp_min_rtt(tp);
info->tcpi_data_segs_in = tp->data_segs_in;
info->tcpi_data_segs_out = tp->data_segs_out;
+
+   info->tcpi_delivery_rate_app_limited = tp->rate_app_limited ? 1 : 0;
+   rate = READ_ONCE(tp->rate_delivered);
+   intv = READ_ONCE(tp->rate_interval_us);
+   if (rate && intv) {
+   rate64 = (u64)rate * tp->mss_cache * USEC_PER_SEC;
+   do_div(rate64, intv);
+   put_unaligned(rate64, &info->tcpi_delivery_rate);
+   }
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 52ff84b..9be1581 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -149,12 +149,22 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 
lost,
 * for connections suffer heavy or prolonged losses.
 */
if (unlikely(rs->interval_us < tcp_min_rtt(tp))) {
-   rs->interval_us = -1;
if (!rs->is_retrans)
pr_debug("tcp rate: %ld %d %u %u %u\n",
 rs->interval_us, rs->delivered,
 inet_csk(sk)->icsk_ca_state,
 tp->rx_opt.sack_ok, tcp_min_rtt(tp));
+

[PATCH v4 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Neal Cardwell

This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale.  Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.

Example performance results, to illustrate the difference between BBR
and CUBIC:

Resilience to random loss (e.g. from shallow buffers):
  Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
  path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
  rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

Low latency with the bloated buffers common in today's

[PATCH v4 net-next 15/16] tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88

2016-09-19 Thread Neal Cardwell

The TCP CUBIC module already uses 64 bytes.
The upcoming TCP BBR module uses 88 bytes.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/inet_connection_sock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 49dcad4..197a30d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -134,8 +134,8 @@ struct inet_connection_sock {
} icsk_mtup;
u32   icsk_user_timeout;
 
-   u64   icsk_ca_priv[64 / sizeof(u64)];
-#define ICSK_CA_PRIV_SIZE  (8 * sizeof(u64))
+   u64   icsk_ca_priv[88 / sizeof(u64)];
+#define ICSK_CA_PRIV_SIZE  (11 * sizeof(u64))
 };
 
 #define ICSK_TIME_RETRANS  1   /* Retransmit timer */
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 11/16] tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments

2016-09-19 Thread Neal Cardwell

To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:

1) Export tcp_tso_autosize() so that CC modules can use it.

2) Change tcp_tso_autosize() to allow callers to specify a minimum
   number of segments per TSO skb, in case the congestion control
   module has a different notion of the best floor for TSO skbs for
   the connection right now. For very low-rate paths or policed
   connections it can be appropriate to use smaller TSO skbs.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h | 2 ++
 net/ipv4/tcp_output.c | 9 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f8f581f..3492041 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -533,6 +533,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, 
__u16 *mss);
 #endif
 /* tcp_output.c */
 
+u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
+int min_tso_segs);
 void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
   int nonagle);
 bool tcp_may_send_now(struct sock *sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0137956..0bf3d48 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1549,7 +1549,8 @@ static bool tcp_nagle_check(bool partial, const struct 
tcp_sock *tp,
 /* Return how many segs we'd like on a TSO packet,
  * to send one TSO packet per ms
  */
-static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
+int min_tso_segs)
 {
u32 bytes, segs;
 
@@ -1561,10 +1562,11 @@ static u32 tcp_tso_autosize(const struct sock *sk, 
unsigned int mss_now)
 * This preserves ACK clocking and is consistent
 * with tcp_tso_should_defer() heuristic.
 */
-   segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+   segs = max_t(u32, bytes / mss_now, min_tso_segs);
 
return min_t(u32, segs, sk->sk_gso_max_segs);
 }
+EXPORT_SYMBOL(tcp_tso_autosize);
 
 /* Return the number of segments we want in the skb we are transmitting.
  * See if congestion control module wants to decide; otherwise, autosize.
@@ -1574,7 +1576,8 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int 
mss_now)
const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0;
 
-   return tso_segs ? : tcp_tso_autosize(sk, mss_now);
+   return tso_segs ? :
+   tcp_tso_autosize(sk, mss_now, sysctl_tcp_min_tso_segs);
 }
 
 /* Returns the portion of skb which can be sent right away */
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 08/16] tcp: track application-limited rate samples

2016-09-19 Thread Neal Cardwell

From: Soheil Hassas Yeganeh 

This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.

Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.

This logic marks the flow app-limited on a write if *all* of the
following are true:

  1) There is less than 1 MSS of unsent data in the write queue
 available to transmit.

  2) There is no packet in the sender's queues (e.g. in fq or the NIC
 tx queue).

  3) The connection is not limited by cwnd.

  4) There are no lost packets to retransmit.

The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.

When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.

The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.

We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 include/net/tcp.h|  6 +-
 net/ipv4/tcp.c   |  8 
 net/ipv4/tcp_minisocks.c |  3 +++
 net/ipv4/tcp_rate.c  | 29 -
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c50e6ae..fdcd00f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -268,6 +268,7 @@ struct tcp_sock {
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
u32 lost;   /* Total data packets lost incl. rexmits */
+   u32 app_limited;/* limited until "delivered" reaches this val */
struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b261c89..a69ed7f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -764,7 +764,9 @@ struct tcp_skb_cb {
union {
struct {
/* There is space for up to 24 bytes */
-   __u32 in_flight;/* Bytes in flight when packet sent */
+   __u32 in_flight:30,/* Bytes in flight at transmit */
+ is_app_limited:1, /* cwnd not fully used? */
+ unused:1;
/* pkts S/ACKed so far upon tx of skb, incl retrans: */
__u32 delivered;
/* start of send pipeline phase */
@@ -883,6 +885,7 @@ struct rate_sample {
int  losses;/* number of packets marked lost upon ACK */
u32  acked_sacked;  /* number of packets newly (S)ACKed upon ACK */
u32  prior_in_flight;   /* in flight before this ACK */
+   bool is_app_limited;/* is sample from packet with bubble in pipe? */
bool is_retrans;/* is sample from retransmission? */
 };
 
@@ -978,6 +981,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff 
*skb,
struct rate_sample *rs);
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
  struct skb_mstamp *now, struct rate_sample *rs);
+void tcp_rate_check_app_limited(struct sock *sk);
 
 /* These functions determine how the current flow behaves in respect of SACK
  * handling. SACK is negotiated with the peer, and therefore it can vary
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9d1ae18..c187552 100644
--- a/net/ipv4/tcp.c

[PATCH v4 net-next 03/16] tcp: use windowed min filter library for TCP min_rtt estimation

2016-09-19 Thread Neal Cardwell

Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  5 ++--
 include/net/tcp.h|  2 +-
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_input.c | 64 
 net/ipv4/tcp_minisocks.c |  2 +-
 5 files changed, 10 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c723a46..6433cc8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -19,6 +19,7 @@
 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -234,9 +235,7 @@ struct tcp_sock {
u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
-   struct rtt_meas {
-   u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */
-   } rtt_min[3];
+   struct  minmax rtt_min;
 
u32 packets_out;/* Packets which are "in flight"*/
u32 retrans_out;/* Retransmitted packets out*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fdfbedd..2f1648a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -671,7 +671,7 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry 
*dst)
 /* Minimum RTT in usec. ~0 means not available. */
 static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
 {
-   return tp->rtt_min[0].rtt;
+   return minmax_get(&tp->rtt_min);
 }
 
 /* Compute the actual receive window we are currently advertising.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a13fcb3..5b0b49c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -387,7 +387,7 @@ void tcp_init_sock(struct sock *sk)
 
icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   tp->rtt_min[0].rtt = ~0U;
+   minmax_reset(&tp->rtt_min, tcp_time_stamp, ~0U);
 
/* So many TCP implementations out there (incorrectly) count the
 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 70b892d..ac5b38f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2879,67 +2879,13 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const int acked,
*rexmit = REXMIT_LOST;
 }
 
-/* Kathleen Nichols' algorithm for tracking the minimum value of
- * a data stream over some fixed time interval. (E.g., the minimum
- * RTT over the past five minutes.) It uses constant space and constant
- * time per update yet almost always delivers the same minimum as an
- * implementation that has to keep all the data in the window.
- *
- * The algorithm keeps track of the best, 2nd best & 3rd best min
- * values, maintaining an invariant that the measurement time of the
- * n'th best >= n-1'th best. It also makes sure that the three values
- * are widely separated in the time window since that bounds the worse
- * case error when that data is monotonically increasing over the window.
- *
- * Upon getting a new min, we can forget everything earlier because it
- * has no value - the new min is <= everything else in the window by
- * definition and it's the most recent. So we restart fresh on every new min
- * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd
- * best.
- */
 static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
 {
-   const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ;
-   struct rtt_meas *m = tcp_sk(sk)->rtt_min;
-   struct rtt_meas rttm = {
-   .rtt = likely(rtt_us) ? rtt_us : jiffies_to_usecs(1),
-   .ts = now,
-   };
-   u32 elapsed;
-
-   /* Check if the new measurement updates the 1st, 2nd, or 3rd choices */
-   if (unlikely(rttm.rtt <= m[0].rtt))
-   m[0] = m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[1].rtt)
-   m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[2].rtt)
-   m[2] = rttm;
-
-   elapsed = now - m[0].ts;
-   if (unlikely(elapsed > wlen)) {
-   /* Passed entire window without a new min so make 2nd choice
-* the new min & 3rd choice the new 2nd. So forth and so on.
-*/
-   m[0] = m[1];
-   m[1] = m[2];
-   m[2] = rttm;
-   if (now - m[0].ts > wlen) {
-   m[0] = m[1];
-   m[1] = rttm;
-

[PATCH v4 net-next 05/16] tcp: switch back to proper tcp_skb_cb size check in tcp_init()

2016-09-19 Thread Neal Cardwell

From: Eric Dumazet 

Revert to the tcp_skb_cb size check that tcp_init() had before commit
b4772ef879a8 ("net: use common macro for assering skb->cb[] available
size in protocol families"). As related commit 744d5a3e9fe2 ("net:
move skb->dropcount to skb->cb[]") explains, the
sock_skb_cb_check_size() mechanism was added to ensure that there is
space for dropcount, "for protocol families using it". But TCP is not
a protocol using dropcount, so tcp_init() doesn't need to provision
space for dropcount in the skb->cb[], and thus we can revert to the
older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
more bytes of the skb->cb[] space.

Fixes: b4772ef879a8 ("net: use common macro for assering skb->cb[] available 
size in protocol families")
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
---
 net/ipv4/tcp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5b0b49c..9d1ae18 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3244,11 +3244,12 @@ static void __init tcp_init_mem(void)
 
 void __init tcp_init(void)
 {
-   unsigned long limit;
int max_rshare, max_wshare, cnt;
+   unsigned long limit;
unsigned int i;
 
-   sock_skb_cb_check_size(sizeof(struct tcp_skb_cb));
+   BUILD_BUG_ON(sizeof(struct tcp_skb_cb) >
+FIELD_SIZEOF(struct sk_buff, cb));
 
percpu_counter_init(&tcp_sockets_allocated, 0, GFP_KERNEL);
percpu_counter_init(&tcp_orphan_count, 0, GFP_KERNEL);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 04/16] net_sched: sch_fq: add low_rate_threshold parameter

2016-09-19 Thread Neal Cardwell

From: Eric Dumazet 

This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.

This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.

The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_fq.c | 22 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 2382eed..f8e39db 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -792,6 +792,8 @@ enum {
 
TCA_FQ_ORPHAN_MASK, /* mask applied to orphaned skb hashes */
 
+   TCA_FQ_LOW_RATE_THRESHOLD, /* per packet delay under this rate */
+
__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index e5458b9..40ad4fc 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -94,6 +94,7 @@ struct fq_sched_data {
u32 flow_max_rate;  /* optional max rate per flow */
u32 flow_plimit;/* max packets per flow */
u32 orphan_mask;/* mask for orphaned skb */
+   u32 low_rate_threshold;
struct rb_root  *fq_root;
u8  rate_enable;
u8  fq_trees_log;
@@ -433,7 +434,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
struct fq_flow_head *head;
struct sk_buff *skb;
struct fq_flow *f;
-   u32 rate;
+   u32 rate, plen;
 
skb = fq_dequeue_head(sch, &q->internal);
if (skb)
@@ -482,7 +483,7 @@ begin:
prefetch(&skb->end);
f->credit -= qdisc_pkt_len(skb);
 
-   if (f->credit > 0 || !q->rate_enable)
+   if (!q->rate_enable)
goto out;
 
/* Do not pace locally generated ack packets */
@@ -493,8 +494,15 @@ begin:
if (skb->sk)
rate = min(skb->sk->sk_pacing_rate, rate);
 
+   if (rate <= q->low_rate_threshold) {
+   f->credit = 0;
+   plen = qdisc_pkt_len(skb);
+   } else {
+   plen = max(qdisc_pkt_len(skb), q->quantum);
+   if (f->credit > 0)
+   goto out;
+   }
if (rate != ~0U) {
-   u32 plen = max(qdisc_pkt_len(skb), q->quantum);
u64 len = (u64)plen * NSEC_PER_SEC;
 
if (likely(rate))
@@ -662,6 +670,7 @@ static const struct nla_policy fq_policy[TCA_FQ_MAX + 1] = {
[TCA_FQ_FLOW_MAX_RATE]  = { .type = NLA_U32 },
[TCA_FQ_BUCKETS_LOG]= { .type = NLA_U32 },
[TCA_FQ_FLOW_REFILL_DELAY]  = { .type = NLA_U32 },
+   [TCA_FQ_LOW_RATE_THRESHOLD] = { .type = NLA_U32 },
 };
 
 static int fq_change(struct Qdisc *sch, struct nlattr *opt)
@@ -716,6 +725,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
if (tb[TCA_FQ_FLOW_MAX_RATE])
q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
 
+   if (tb[TCA_FQ_LOW_RATE_THRESHOLD])
+   q->low_rate_threshold =
+   nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]);
+
if (tb[TCA_FQ_RATE_ENABLE]) {
u32 enable = nla_get_u32(tb[TCA_FQ_RATE_ENABLE]);
 
@@ -781,6 +794,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
q->fq_root  = NULL;
q->fq_trees_log = ilog2(1024);
q->orphan_mask  = 1024 - 1;
+   q->low_rate_threshold   = 55 / 8;
qdisc_watchdog_init(&q->watchdog, sch);
 
if (opt)
@@ -811,6 +825,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
jiffies_to_usecs(q->flow_refill_delay)) ||
nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
+   nla_put_u32(skb, TCA_FQ_LOW_RATE_THRESHOLD,
+   q->low_rate_threshold) ||
nla_put_u32(skb, TCA_FQ_BUCKETS_LOG, q->fq_trees_log))
goto nla_put_failure;
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH v4 net-next 07/16] tcp: track data delivery rate for a TCP connection

2016-09-19 Thread Neal Cardwell

From: Yuchung Cheng 

This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.

Key state:

tp->delivered: Tracks the total number of data packets (original or not)
   delivered so far. This is an already-existing field.

tp->delivered_mstamp: the last time tp->delivered was updated.

Algorithm:

A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

  d1: the current tp->delivered after processing the ACK
  t1: the current time after processing the ACK

  d0: the prior tp->delivered when the acked skb was transmitted
  t0: the prior tp->delivered_mstamp when the acked skb was transmitted

When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().

When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).

Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.

At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.

If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:

bw = delivered / ack_phase_rtt

to the following:

bw = delivered / max(send_phase_rtt, ack_phase_rtt)

In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h   |   2 +
 include/net/tcp.h |  35 +++-
 net/ipv4/Makefile |   2 +-
 net/ipv4/tcp_input.c  |  46 +++-
 net/ipv4/tcp_output.c |   4 ++
 net/ipv4/tcp_rate.c   | 149 ++
 6 files changed, 222 insertions(+), 16 deletions(-)
 create mode 100644 net/ipv4/tcp_rate.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 38590fb..c50e6ae 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -268,6 +268,8 @@ struct tcp_sock {
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
u32 lost;   /* Total data packets lost incl. rexmits */
+   struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
+   struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2f1648a..b261c89 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -763,8 +763,14 @@ struct tcp_skb_cb {
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct {
-   /* There is space for up to 20 bytes */
+   /* There is space for up to 24 bytes */
__u32 in_flight;/* Bytes in flight when packet sent */
+   /* pkts S/ACKed so far upon tx of skb, incl retrans: */
+   __u32 delivered;
+   /* start of send pipeline phase */
+   struct skb_mstamp first_tx_mstamp;
+   /* when we reached the "delivered" count */
+   struct skb_mstamp delivered_mstamp;
} tx;   /* only used for outgoing skbs */
union {
struct inet_skb_parmh4;
@@ -860,6 +866,26 @@ struct ack_sample {
u32 in_flight;
 };
 
+/* A rate sample measures the number of (original/retransmitted) data
+ * packets delivered "delivered" over

[PATCH net-next] net: ethernet: mediatek: enhance with avoiding superfluous assignment inside mtk_get_ethtool_stats

2016-09-19 Thread sean.wang

From: Sean Wang 

data_src is unchanged inside the loop, so this patch moves
the assignment to outside the loop to avoid unnecessarily
assignment

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 481f360..ca6b501 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2137,8 +2137,9 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
}
}
 
+   data_src = (u64 *)hwstats;
+
do {
-   data_src = (u64 *)hwstats;
data_dst = data;
start = u64_stats_fetch_begin_irq(&hwstats->syncp);
 
-- 
1.9.1

Re: [PATCHv2 net-next 0/2] Preparation for mv88e6390

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> These two patches are a couple of preparation steps for supporting the
> the MV88E6390 family of chips. This is a new generation from Marvell,
> and will need more feature flags than are currently available in an
> unsigned long. Expand to an unsigned long long. The MV88E6390 also
> places its port registers somewhere else, so add a wrapper around port
> register access.
>
> v2: Rework wrappers to use mv88e6xxx_{read|write}
> Simpliy some (err < ) to (err)
> Add Reviewed by tag.
>
> Andrew Lunn (2):
>   net: dsa: mv88e6xxx: Convert flag bits to unsigned long long
>   WIP: net: dsa: mv88e6xxx: Implement interrupt support.

Since there's a need for a v3, note that the above summary is wrong.

Thanks,

Vivien

Re: [PATCHv2 net-next 1/2] net: dsa: mv88e6xxx: Add helper for accessing port registers

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:



> @@ -585,19 +601,19 @@ static void mv88e6xxx_adjust_link(struct dsa_switch 
> *ds, int port,
> struct phy_device *phydev)
>  {
>   struct mv88e6xxx_chip *chip = ds->priv;
> - u32 reg;
> - int ret;
> + u16 reg;
> + int err;
>  
>   if (!phy_is_pseudo_fixed_link(phydev))
>   return;
>  
>   mutex_lock(&chip->reg_lock);
>  
> - ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), PORT_PCS_CTRL);
> - if (ret < 0)
> + err = mv88e6xxx_port_read(chip, port, PORT_PCS_CTRL, ®);
> + if (err)
>   goto out;
>  
> - reg = ret & ~(PORT_PCS_CTRL_LINK_UP |
> + reg = reg & ~(PORT_PCS_CTRL_LINK_UP |

reg &= ~(PORT_PCS_CTRL_LINK_UP |

> PORT_PCS_CTRL_FORCE_LINK |
> PORT_PCS_CTRL_DUPLEX_FULL |
> PORT_PCS_CTRL_FORCE_DUPLEX |



>   /* Wait up to one second for reset to complete. */
>   timeout = jiffies + 1 * HZ;
>   while (time_before(jiffies, timeout)) {
> - ret = _mv88e6xxx_reg_read(chip, REG_GLOBAL, 0x00);
> - if (ret < 0)
> - return ret;
> + err = _mv88e6xxx_reg_read(chip, REG_GLOBAL, 0x00);
> + if (err)
> + return err;

Ouch! Wrong s/ret/err/ here.

>  
> - if ((ret & is_reset) == is_reset)
> + if ((err & is_reset) == is_reset)

Here as well. Keep ret or use mv88e6xxx_read().

>   break;
>   usleep_range(1000, 2000);
>   }
>   if (time_after(jiffies, timeout))
> - ret = -ETIMEDOUT;
> + err = -ETIMEDOUT;
>   else
> - ret = 0;
> + err = 0;
>  
> - return ret;
> + return err;
>  }
>  
>  static int mv88e6xxx_serdes_power_on(struct mv88e6xxx_chip *chip)

Thanks,

Vivien

Re: [PATCH net 0/3] mlx5 fixes to 4.8-rc6

2016-09-19 Thread David Miller

From: Or Gerlitz 
Date: Sun, 18 Sep 2016 18:20:26 +0300

> This series series has a fix from Roi to memory corruption bug in 
> the bulk flow counters code and two late and hopefully last fixes 
> from me to the new eswitch offloads code.
> 
> Series done over net commit 37dd348 "bna: fix crash in bnad_get_strings()"

Series applied, thanks.

Re: [PATCHv6 net-next 00/15] BPF hardware offload (cls_bpf for now)

2016-09-19 Thread David Miller

From: Jakub Kicinski 
Date: Sun, 18 Sep 2016 16:09:10 +0100

> As spotted by Daniel JIT might have accessed indexes past the end
> of verifier's reg_state array.

This series doesn't apply cleanly to net-next, please respin.

Thanks.

Re: [PATCH net-next 1/1] net sched actions police: peg drop stats for conforming traffic

2016-09-19 Thread David Miller

From: Jamal Hadi Salim 
Date: Sun, 18 Sep 2016 07:53:08 -0400

> From: Roman Mashak 
> 
> setting conforming action to drop is a valid policy.
> When it is set we need to at least see the stats indicating it
> for debugging.
> 
> Signed-off-by: Roman Mashak 
> Signed-off-by: Jamal Hadi Salim 

Applied.

Re: [PATCH net-next 1/1] net sched: stylistic cleanups

2016-09-19 Thread David Miller

From: Jamal Hadi Salim 
Date: Sun, 18 Sep 2016 08:45:33 -0400

> From: Jamal Hadi Salim 
> 
> Signed-off-by: Jamal Hadi Salim 

Applied.

Re: [PATCH] net: hns: mark symbols static where possible

2016-09-19 Thread David Miller

From: Baoyou Xie 
Date: Sun, 18 Sep 2016 17:07:25 +0800

> We get a few warnings when building kernel with W=1:
> drivers/net/ethernet/hisilicon/hisi_femac.c:943:5: warning: no previous 
> prototype for 'hisi_femac_drv_suspend' [-Wmissing-prototypes]
> drivers/net/ethernet/hisilicon/hisi_femac.c:960:5: warning: no previous 
> prototype for 'hisi_femac_drv_resume' [-Wmissing-prototypes]
> drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c:76:21: warning: no previous 
> prototype for 'hns_ae_get_handle' [-Wmissing-prototypes]
> 
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> so this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Does not apply cleanly to net-next, please respin.

Re: [PATCH v3 net-next 1/2] net sched ife action: add 16 bit helpers

2016-09-19 Thread David Miller

From: Jamal Hadi Salim 
Date: Sun, 18 Sep 2016 07:31:42 -0400

> From: Jamal Hadi Salim 
> 
> encoder and checker for 16 bits metadata
> 
> Signed-off-by: Jamal Hadi Salim 

Applied.

Re: [PATCH] phy: mark lan88xx_suspend() static

2016-09-19 Thread David Miller

From: Baoyou Xie 
Date: Sun, 18 Sep 2016 16:26:34 +0800

> We get 1 warning when building kernel with W=1:
> drivers/net/phy/microchip.c:58:5: warning: no previous prototype for 
> 'lan88xx_suspend' [-Wmissing-prototypes]
> 
> In fact, this function is only used in the file in which it is
> declared and don't need a declaration, but can be made static.
> so this patch marks this function with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Applied.

Re: [PATCH v3 net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap

2016-09-19 Thread David Miller

From: Jamal Hadi Salim 
Date: Sun, 18 Sep 2016 07:31:43 -0400

> From: Jamal Hadi Salim 
> 
> Sample use case of how this is encoded:
> user space via tuntap (or a connected VM/Machine/container)
> encodes the tcindex TLV.
> 
> Sample use case of decoding:
> IFE action decodes it and the skb->tc_index is then used to classify.
> So something like this for encoded ICMP packets:
> 
> .. first decode then reclassify... skb->tcindex will be set
> sudo $TC filter add dev $ETH parent : prio 2 protocol 0xbeef \
> u32 match u32 0 0 flowid 1:1 \
> action ife decode reclassify
> 
> ...next match the decode icmp packet...
> sudo $TC filter add dev $ETH parent : prio 4 protocol ip \
> u32 match ip protocol 1 0xff flowid 1:1 \
> action continue
> 
> ... last classify it using the tcindex classifier and do someaction..
> sudo $TC filter add dev $ETH parent : prio 5 protocol ip \
> handle 0x11 tcindex classid 1:1 \
> action blah..
> 
> Signed-off-by: Jamal Hadi Salim 

Applied.

Re: [PATCH] cxgb4: mark symbols static where possible

2016-09-19 Thread David Miller

From: Baoyou Xie 
Date: Sun, 18 Sep 2016 17:18:23 +0800

> We get a few warnings when building kernel with W=1:
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:178:5: warning: no previous 
> prototype for 'setup_sge_queues_uld' [-Wmissing-prototypes]
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:205:6: warning: no previous 
> prototype for 'free_sge_queues_uld' [-Wmissing-prototypes]
> 
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> so this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Does not apply cleanly to net-next, please respin.

Re: [PATCH] be2net: mark symbols static where possible

2016-09-19 Thread David Miller

From: Baoyou Xie 
Date: Sun, 18 Sep 2016 16:35:29 +0800

> We get 4 warnings when building kernel with W=1:
> drivers/net/ethernet/emulex/benet/be_main.c:4368:6: warning: no previous 
> prototype for 'be_calculate_pf_pool_rss_tables' [-Wmissing-prototypes]
> drivers/net/ethernet/emulex/benet/be_cmds.c:4385:5: warning: no previous 
> prototype for 'be_get_nic_pf_num_list' [-Wmissing-prototypes]
> drivers/net/ethernet/emulex/benet/be_cmds.c:4537:6: warning: no previous 
> prototype for 'be_reset_nic_desc' [-Wmissing-prototypes]
> drivers/net/ethernet/emulex/benet/be_cmds.c:4910:5: warning: no previous 
> prototype for '__be_cmd_set_logical_link_config' [-Wmissing-prototypes]
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> so this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Applied.

Re: [v3 PATCH 1/2] rhashtable: Add rhlist interface

2016-09-19 Thread Herbert Xu

On Mon, Sep 19, 2016 at 11:16:21PM +0200, Thomas Graf wrote:
>
> Nice, I like how this simplifies users! Is this suitable for
> ILA as well?

Does it have duplicate objects and use inelastic_security? If so
then yes it should switch over to rhlist.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH RFC 2/6] rhashtable: Call library function alloc_bucket_locks

2016-09-19 Thread Herbert Xu

Tom Herbert  wrote:
> To allocate the array of bucket locks for the hash table we now
> call library function alloc_bucket_spinlocks. This function is
> based on the old alloc_bucket_locks in rhashtable and should
> produce the same effect.
> 
> Signed-off-by: Tom Herbert 

This conflicts with the work I'm doing to fix the resize ENOMEM
issue.  I'll be making the hashtable as well as the spinlock table
nested, in which case you must not directly dereference it as an
array.

If you're just trying to share the spinlocks for another purpose,
what we can do is provide a helper function to return the right
lock for a given key/object.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/2] net: ethernet: broadcom: b44: use new api ethtool_{get|set}_link_ksettings

2016-09-19 Thread David Miller

From: Philippe Reynes 
Date: Sun, 18 Sep 2016 00:11:35 +0200

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH 2/2] net: ethernet: broadcom: bcm63xx: use new api ethtool_{get|set}_link_ksettings

2016-09-19 Thread David Miller

From: Philippe Reynes 
Date: Sun, 18 Sep 2016 16:59:07 +0200

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH 1/2] net: ethernet: broadcom: bcm63xx: use phydev from struct net_device

2016-09-19 Thread David Miller

From: Philippe Reynes 
Date: Sun, 18 Sep 2016 16:59:06 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: ethernet: broadcom: bcmgenet: use new api ethtool_{get|set}_link_ksettings

2016-09-19 Thread David Miller

From: Philippe Reynes 
Date: Sun, 18 Sep 2016 17:16:45 +0200

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH 1/2] net: ethernet: broadcom: b44: use phydev from struct net_device

2016-09-19 Thread David Miller

From: Philippe Reynes 
Date: Sun, 18 Sep 2016 00:11:34 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH net-next 00/10] bnxt: update for net-next.

2016-09-19 Thread David Miller

From: Michael Chan 
Date: Mon, 19 Sep 2016 03:57:59 -0400

> Misc. changes and minor bug fixes for net-next.  Please review.

Series applied, thanks Michael.

Re: lsm naming dilemma. Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-19 Thread Sargun Dhillon

I'm fine giving up the Checmate name. Landlock seems easy enough to
Google. I haven't gotten a chance to look through the entire patchset
yet, but it does seem like they are somewhat similar.

On Mon, Sep 19, 2016 at 5:12 PM, Alexei Starovoitov
 wrote:
> On Thu, Sep 15, 2016 at 11:25:10PM +0200, Mickaël Salaün wrote:
>> >> Agreed. With this RFC, the Checmate features (i.e. network helpers)
>> >> should be able to sit on top of Landlock.
>> >
>> > I think neither of them should be called fancy names for no technical 
>> > reason.
>> > We will have only one bpf based lsm. That's it and it doesn't
>> > need an obscure name. Directory name can be security/bpf/..stuff.c
>>
>> I disagree on an LSM named "BPF". I first started with the "seccomp LSM"
>> name (first RFC) but I later realized that it is confusing because
>> seccomp is associated to its syscall and the underlying features. Same
>> thing goes for BPF. It is also artificially hard to grep on a name too
>> used in the kernel source tree.
>> Making an association between the generic eBPF mechanism and a security
>> centric approach (i.e. LSM) seems a bit reductive (for BPF). Moreover,
>> the seccomp interface [1] can still be used.
>
> agree with above.
>
>> Landlock is a nice name to depict a sandbox as an enclave (i.e. a
>> landlocked country/state). I want to keep this name, which is simple,
>> express the goal of Landlock nicely and is comparable to other sandbox
>> mechanisms as Seatbelt or Pledge.
>> Landlock should not be confused with the underlying eBPF implementation.
>> Landlock could use more than only eBPF in the future and eBPF could be
>> used in other LSM as well.
>
> there will not be two bpf based LSMs.
> Therefore unless you can convince Sargun to give up his 'checmate' name,
> nothing goes in.
> The features you both need are 90% the same, so they must be done
> as part of single LSM whatever you both agree to call it.
>

Re: [PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> On Mon, Sep 19, 2016 at 07:56:11PM -0400, Vivien Didelot wrote:
>> An address can be loaded in the ATU with multiple ports, for instance
>> when adding multiple ports to a Multicast group with "bridge mdb".
>>
>> The current code doesn't allow that. Add an helper to get a single entry
>> from the ATU, then set or clear the requested port, before loading the
>> entry back in the ATU.
>>
>> Note that the required _mv88e6xxx_atu_getnext function is defined below
>> mv88e6xxx_port_db_load_purge, so forward-declare it for the moment. The
>> ATU code will be isolated in future patches.
>>
>> Fixes: 83dabd1fa84c ("net: dsa: mv88e6xxx: make switchdev DB ops generic")
>
> Is this a real fixes? You don't make it clear what goes wrong. I
> assume adding the same MAC address for a second time but for a
> different port removes the first entry for the old port?

Yes, this is what happens, sorry for the bad message. Below is an
example with the relevant hardware bits.

Here's the current behavior, without this patch:

# bridge mdb add dev br0 port lan0 grp 238.39.20.86

FID  MAC Addr  State Trunk?  DPV/Trunk ID
001:00:5e:27:14:56 MC_STATIC   n 0 - - - - - -

# bridge mdb add dev br0 port lan2 grp 238.39.20.86

FID  MAC Addr  State Trunk?  DPV/Trunk ID
001:00:5e:27:14:56 MC_STATIC   n - - 2 - - - - 

Here's the new behavior, with this patch:

# bridge mdb add dev br0 port lan0 grp 238.39.20.86

FID  MAC Addr  State Trunk?  DPV/Trunk ID
001:00:5e:27:14:56 MC_STATIC   n 0 - - - - - -

# bridge mdb add dev br0 port lan2 grp 238.39.20.86

FID  MAC Addr  State Trunk?  DPV/Trunk ID
001:00:5e:27:14:56 MC_STATIC   n 0 - 2 - - - -

Thanks!

Vivien

Re: [PATCH net-next v6] gso: Support partial splitting at the frag_list pointer

2016-09-19 Thread David Miller

From: Alexander Duyck 
Date: Mon, 19 Sep 2016 08:12:15 -0700

> On Mon, Sep 19, 2016 at 3:58 AM, Steffen Klassert
>  wrote:
>> Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
>> gro may build buffers with a frag_list. This can hurt forwarding
>> because most NICs can't offload such packets, they need to be
>> segmented in software. This patch splits buffers with a frag_list
>> at the frag_list pointer into buffers that can be TSO offloaded.
>>
>> Signed-off-by: Steffen Klassert 
 ...
> 
> Looks good.
> 
> Acked-by: Alexander Duyck 

Applied, thanks everyone.

Re: [PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Andrew Lunn

On Mon, Sep 19, 2016 at 07:56:11PM -0400, Vivien Didelot wrote:
> An address can be loaded in the ATU with multiple ports, for instance
> when adding multiple ports to a Multicast group with "bridge mdb".
> 
> The current code doesn't allow that. Add an helper to get a single entry
> from the ATU, then set or clear the requested port, before loading the
> entry back in the ATU.
> 
> Note that the required _mv88e6xxx_atu_getnext function is defined below
> mv88e6xxx_port_db_load_purge, so forward-declare it for the moment. The
> ATU code will be isolated in future patches.
> 
> Fixes: 83dabd1fa84c ("net: dsa: mv88e6xxx: make switchdev DB ops generic")

Is this a real fixes? You don't make it clear what goes wrong. I
assume adding the same MAC address for a second time but for a
different port removes the first entry for the old port?

> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew

[PATCHv2 net-next 1/2] net: dsa: mv88e6xxx: Add helper for accessing port registers

2016-09-19 Thread Andrew Lunn

There is a device coming soon which places its port registers
somewhere different to all other Marvell switches supported so far.
Add helper functions for reading/writing port registers, making it
easier to handle this new device.

Signed-off-by: Andrew Lunn 
---
v2:
 Call mv88e6xxx_{read|write} in wrappers
 Change some (err < 0) to plain (err)
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 369 +-
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |   1 -
 2 files changed, 181 insertions(+), 189 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 70a812d159c9..6f2a145082e5 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -216,6 +216,22 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, 
int reg, u16 val)
return 0;
 }
 
+int mv88e6xxx_port_read(struct mv88e6xxx_chip *chip, int port, int reg,
+   u16 *val)
+{
+   int addr = chip->info->port_base_addr + port;
+
+   return mv88e6xxx_read(chip, addr, reg, val);
+}
+
+int mv88e6xxx_port_write(struct mv88e6xxx_chip *chip, int port, int reg,
+u16 val)
+{
+   int addr = chip->info->port_base_addr + port;
+
+   return mv88e6xxx_write(chip, addr, reg, val);
+}
+
 static int mv88e6xxx_phy_read(struct mv88e6xxx_chip *chip, int phy,
  int reg, u16 *val)
 {
@@ -585,19 +601,19 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
  struct phy_device *phydev)
 {
struct mv88e6xxx_chip *chip = ds->priv;
-   u32 reg;
-   int ret;
+   u16 reg;
+   int err;
 
if (!phy_is_pseudo_fixed_link(phydev))
return;
 
mutex_lock(&chip->reg_lock);
 
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), PORT_PCS_CTRL);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, PORT_PCS_CTRL, ®);
+   if (err)
goto out;
 
-   reg = ret & ~(PORT_PCS_CTRL_LINK_UP |
+   reg = reg & ~(PORT_PCS_CTRL_LINK_UP |
  PORT_PCS_CTRL_FORCE_LINK |
  PORT_PCS_CTRL_DUPLEX_FULL |
  PORT_PCS_CTRL_FORCE_DUPLEX |
@@ -639,7 +655,7 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
reg |= (PORT_PCS_CTRL_RGMII_DELAY_RXCLK |
PORT_PCS_CTRL_RGMII_DELAY_TXCLK);
}
-   _mv88e6xxx_reg_write(chip, REG_PORT(port), PORT_PCS_CTRL, reg);
+   mv88e6xxx_port_write(chip, port, PORT_PCS_CTRL, reg);
 
 out:
mutex_unlock(&chip->reg_lock);
@@ -799,22 +815,22 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct 
mv88e6xxx_chip *chip,
 {
u32 low;
u32 high = 0;
-   int ret;
+   int err;
+   u16 reg;
u64 value;
 
switch (s->type) {
case PORT:
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), s->reg);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, s->reg, ®);
+   if (err)
return UINT64_MAX;
 
-   low = ret;
+   low = reg;
if (s->sizeof_stat == 4) {
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port),
- s->reg + 1);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, s->reg + 1, ®);
+   if (err)
return UINT64_MAX;
-   high = ret;
+   high = reg;
}
break;
case BANK0:
@@ -893,6 +909,8 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int 
port,
   struct ethtool_regs *regs, void *_p)
 {
struct mv88e6xxx_chip *chip = ds->priv;
+   int err;
+   u16 reg;
u16 *p = _p;
int i;
 
@@ -903,11 +921,10 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int 
port,
mutex_lock(&chip->reg_lock);
 
for (i = 0; i < 32; i++) {
-   int ret;
 
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), i);
-   if (ret >= 0)
-   p[i] = ret;
+   err = mv88e6xxx_port_read(chip, port, i, ®);
+   if (!err)
+   p[i] = reg;
}
 
mutex_unlock(&chip->reg_lock);
@@ -938,7 +955,7 @@ static int mv88e6xxx_get_eee(struct dsa_switch *ds, int 
port,
e->eee_enabled = !!(reg & 0x0200);
e->tx_lpi_enabled = !!(reg & 0x0100);
 
-   err = mv88e6xxx_read(chip, REG_PORT(port), PORT_STATUS, ®);
+   err = mv88e6xxx_port_read(chip, port, PORT_STATUS, ®);
if (err)
goto out;
 
@@ -1106,12 +1123,13 @@ static int _mv88e6xxx_port_state(struct mv88e6xxx_chip 
*chip, int port,
 u8 state)
 {
str

[PATCHv2 net-next 2/2] net: dsa: mv88e6xxx: Convert flag bits to unsigned long long

2016-09-19 Thread Andrew Lunn

We are soon going to run out of flag bits on 32bit systems. Convert to
unsigned long long.

Signed-off-by: Andrew Lunn 
Reviewed-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 62 +--
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index e349d0d64645..827988397fd8 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -452,36 +452,36 @@ enum mv88e6xxx_cap {
 };
 
 /* Bitmask of capabilities */
-#define MV88E6XXX_FLAG_EDSABIT(MV88E6XXX_CAP_EDSA)
-#define MV88E6XXX_FLAG_EEE BIT(MV88E6XXX_CAP_EEE)
-
-#define MV88E6XXX_FLAG_SMI_CMD BIT(MV88E6XXX_CAP_SMI_CMD)
-#define MV88E6XXX_FLAG_SMI_DATABIT(MV88E6XXX_CAP_SMI_DATA)
-
-#define MV88E6XXX_FLAG_PHY_PAGEBIT(MV88E6XXX_CAP_PHY_PAGE)
-
-#define MV88E6XXX_FLAG_SERDES  BIT(MV88E6XXX_CAP_SERDES)
-
-#define MV88E6XXX_FLAG_GLOBAL2 BIT(MV88E6XXX_CAP_GLOBAL2)
-#define MV88E6XXX_FLAG_G2_MGMT_EN_2X   BIT(MV88E6XXX_CAP_G2_MGMT_EN_2X)
-#define MV88E6XXX_FLAG_G2_MGMT_EN_0X   BIT(MV88E6XXX_CAP_G2_MGMT_EN_0X)
-#define MV88E6XXX_FLAG_G2_IRL_CMD  BIT(MV88E6XXX_CAP_G2_IRL_CMD)
-#define MV88E6XXX_FLAG_G2_IRL_DATA BIT(MV88E6XXX_CAP_G2_IRL_DATA)
-#define MV88E6XXX_FLAG_G2_PVT_ADDR BIT(MV88E6XXX_CAP_G2_PVT_ADDR)
-#define MV88E6XXX_FLAG_G2_PVT_DATA BIT(MV88E6XXX_CAP_G2_PVT_DATA)
-#define MV88E6XXX_FLAG_G2_SWITCH_MAC   BIT(MV88E6XXX_CAP_G2_SWITCH_MAC)
-#define MV88E6XXX_FLAG_G2_POT  BIT(MV88E6XXX_CAP_G2_POT)
-#define MV88E6XXX_FLAG_G2_EEPROM_CMD   BIT(MV88E6XXX_CAP_G2_EEPROM_CMD)
-#define MV88E6XXX_FLAG_G2_EEPROM_DATA  BIT(MV88E6XXX_CAP_G2_EEPROM_DATA)
-#define MV88E6XXX_FLAG_G2_SMI_PHY_CMD  BIT(MV88E6XXX_CAP_G2_SMI_PHY_CMD)
-#define MV88E6XXX_FLAG_G2_SMI_PHY_DATA BIT(MV88E6XXX_CAP_G2_SMI_PHY_DATA)
-
-#define MV88E6XXX_FLAG_PPU BIT(MV88E6XXX_CAP_PPU)
-#define MV88E6XXX_FLAG_PPU_ACTIVE  BIT(MV88E6XXX_CAP_PPU_ACTIVE)
-#define MV88E6XXX_FLAG_STU BIT(MV88E6XXX_CAP_STU)
-#define MV88E6XXX_FLAG_TEMPBIT(MV88E6XXX_CAP_TEMP)
-#define MV88E6XXX_FLAG_TEMP_LIMIT  BIT(MV88E6XXX_CAP_TEMP_LIMIT)
-#define MV88E6XXX_FLAG_VTU BIT(MV88E6XXX_CAP_VTU)
+#define MV88E6XXX_FLAG_EDSABIT_ULL(MV88E6XXX_CAP_EDSA)
+#define MV88E6XXX_FLAG_EEE BIT_ULL(MV88E6XXX_CAP_EEE)
+
+#define MV88E6XXX_FLAG_SMI_CMD BIT_ULL(MV88E6XXX_CAP_SMI_CMD)
+#define MV88E6XXX_FLAG_SMI_DATABIT_ULL(MV88E6XXX_CAP_SMI_DATA)
+
+#define MV88E6XXX_FLAG_PHY_PAGEBIT_ULL(MV88E6XXX_CAP_PHY_PAGE)
+
+#define MV88E6XXX_FLAG_SERDES  BIT_ULL(MV88E6XXX_CAP_SERDES)
+
+#define MV88E6XXX_FLAG_GLOBAL2 BIT_ULL(MV88E6XXX_CAP_GLOBAL2)
+#define MV88E6XXX_FLAG_G2_MGMT_EN_2X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_2X)
+#define MV88E6XXX_FLAG_G2_MGMT_EN_0X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_0X)
+#define MV88E6XXX_FLAG_G2_IRL_CMD  BIT_ULL(MV88E6XXX_CAP_G2_IRL_CMD)
+#define MV88E6XXX_FLAG_G2_IRL_DATA BIT_ULL(MV88E6XXX_CAP_G2_IRL_DATA)
+#define MV88E6XXX_FLAG_G2_PVT_ADDR BIT_ULL(MV88E6XXX_CAP_G2_PVT_ADDR)
+#define MV88E6XXX_FLAG_G2_PVT_DATA BIT_ULL(MV88E6XXX_CAP_G2_PVT_DATA)
+#define MV88E6XXX_FLAG_G2_SWITCH_MAC   BIT_ULL(MV88E6XXX_CAP_G2_SWITCH_MAC)
+#define MV88E6XXX_FLAG_G2_POT  BIT_ULL(MV88E6XXX_CAP_G2_POT)
+#define MV88E6XXX_FLAG_G2_EEPROM_CMD   BIT_ULL(MV88E6XXX_CAP_G2_EEPROM_CMD)
+#define MV88E6XXX_FLAG_G2_EEPROM_DATA  BIT_ULL(MV88E6XXX_CAP_G2_EEPROM_DATA)
+#define MV88E6XXX_FLAG_G2_SMI_PHY_CMD  BIT_ULL(MV88E6XXX_CAP_G2_SMI_PHY_CMD)
+#define MV88E6XXX_FLAG_G2_SMI_PHY_DATA BIT_ULL(MV88E6XXX_CAP_G2_SMI_PHY_DATA)
+
+#define MV88E6XXX_FLAG_PPU BIT_ULL(MV88E6XXX_CAP_PPU)
+#define MV88E6XXX_FLAG_PPU_ACTIVE  BIT_ULL(MV88E6XXX_CAP_PPU_ACTIVE)
+#define MV88E6XXX_FLAG_STU BIT_ULL(MV88E6XXX_CAP_STU)
+#define MV88E6XXX_FLAG_TEMPBIT_ULL(MV88E6XXX_CAP_TEMP)
+#define MV88E6XXX_FLAG_TEMP_LIMIT  BIT_ULL(MV88E6XXX_CAP_TEMP_LIMIT)
+#define MV88E6XXX_FLAG_VTU BIT_ULL(MV88E6XXX_CAP_VTU)
 
 /* EEPROM Programming via Global2 with 16-bit data */
 #define MV88E6XXX_FLAGS_EEPROM16   \
@@ -614,7 +614,7 @@ struct mv88e6xxx_info {
unsigned int num_ports;
unsigned int port_base_addr;
unsigned int age_time_coeff;
-   unsigned long flags;
+   unsigned long long flags;
 };
 
 struct mv88e6xxx_atu_entry {
-- 
2.9.3

[PATCHv2 net-next 0/2] Preparation for mv88e6390

2016-09-19 Thread Andrew Lunn

These two patches are a couple of preparation steps for supporting the
the MV88E6390 family of chips. This is a new generation from Marvell,
and will need more feature flags than are currently available in an
unsigned long. Expand to an unsigned long long. The MV88E6390 also
places its port registers somewhere else, so add a wrapper around port
register access.

v2: Rework wrappers to use mv88e6xxx_{read|write}
Simpliy some (err < ) to (err)
Add Reviewed by tag.

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Convert flag bits to unsigned long long
  WIP: net: dsa: mv88e6xxx: Implement interrupt support.

 drivers/net/dsa/mv88e6xxx/chip.c  | 174 ++
 drivers/net/dsa/mv88e6xxx/global2.c   | 128 -
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  90 +++---
 3 files changed, 359 insertions(+), 33 deletions(-)

-- 
2.9.3

Re: [PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Andrew Lunn

On Mon, Sep 19, 2016 at 08:29:40PM -0400, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn  writes:
> 
> > Hi Vivien
> >
> >> +  do {
> >> +  err = _mv88e6xxx_atu_getnext(chip, fid, &next);
> >> +  if (err)
> >> +  return err;
> >> +
> >> +  if (next.state == GLOBAL_ATU_DATA_STATE_UNUSED)
> >> +  break;
> >> +
> >> +  if (ether_addr_equal(next.mac, addr)) {
> >> +  *entry = next;
> >> +  return 0;
> >> +  }
> >> +  } while (!is_broadcast_ether_addr(next.mac));
> >
> > This is correct, but i wonder how well it scales? When we have a lot
> > of entries in the ATU, this is going to take time. At some point in
> > the future, we might want to keep a shadow copy of static entries of
> > the ATU in RAM. We then don't need to search for them.
> 
> There won't be any issue about time here, because we are searching a
> precise FID.

Ah, i didn't realise you can do that. However, it makes sense, the
hardware needs to do it all the time when it receives a frame.

 Andrew

Re: [PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> Hi Vivien
>
>> +do {
>> +err = _mv88e6xxx_atu_getnext(chip, fid, &next);
>> +if (err)
>> +return err;
>> +
>> +if (next.state == GLOBAL_ATU_DATA_STATE_UNUSED)
>> +break;
>> +
>> +if (ether_addr_equal(next.mac, addr)) {
>> +*entry = next;
>> +return 0;
>> +}
>> +} while (!is_broadcast_ether_addr(next.mac));
>
> This is correct, but i wonder how well it scales? When we have a lot
> of entries in the ATU, this is going to take time. At some point in
> the future, we might want to keep a shadow copy of static entries of
> the ATU in RAM. We then don't need to search for them.

There won't be any issue about time here, because we are searching a
precise FID. What takes time is dumping the whole 4095 FIDs at once.

> But that is for later, once we have some complaints this is too slow.

Yes, there are severals things we can cache in this driver, but to be
honest I'd prefer not to cache anything for the moment, until the driver
is robust. Caching data must be the very last point IMO.

Thanks,

Vivien

Re: [PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Andrew Lunn

Hi Vivien

> + do {
> + err = _mv88e6xxx_atu_getnext(chip, fid, &next);
> + if (err)
> + return err;
> +
> + if (next.state == GLOBAL_ATU_DATA_STATE_UNUSED)
> + break;
> +
> + if (ether_addr_equal(next.mac, addr)) {
> + *entry = next;
> + return 0;
> + }
> + } while (!is_broadcast_ether_addr(next.mac));

This is correct, but i wonder how well it scales? When we have a lot
of entries in the ATU, this is going to take time. At some point in
the future, we might want to keep a shadow copy of static entries of
the ATU in RAM. We then don't need to search for them.

But that is for later, once we have some complaints this is too slow.

Andrew

lsm naming dilemma. Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-19 Thread Alexei Starovoitov

On Thu, Sep 15, 2016 at 11:25:10PM +0200, Mickaël Salaün wrote:
> >> Agreed. With this RFC, the Checmate features (i.e. network helpers)
> >> should be able to sit on top of Landlock.
> > 
> > I think neither of them should be called fancy names for no technical 
> > reason.
> > We will have only one bpf based lsm. That's it and it doesn't
> > need an obscure name. Directory name can be security/bpf/..stuff.c
> 
> I disagree on an LSM named "BPF". I first started with the "seccomp LSM"
> name (first RFC) but I later realized that it is confusing because
> seccomp is associated to its syscall and the underlying features. Same
> thing goes for BPF. It is also artificially hard to grep on a name too
> used in the kernel source tree.
> Making an association between the generic eBPF mechanism and a security
> centric approach (i.e. LSM) seems a bit reductive (for BPF). Moreover,
> the seccomp interface [1] can still be used.

agree with above.

> Landlock is a nice name to depict a sandbox as an enclave (i.e. a
> landlocked country/state). I want to keep this name, which is simple,
> express the goal of Landlock nicely and is comparable to other sandbox
> mechanisms as Seatbelt or Pledge.
> Landlock should not be confused with the underlying eBPF implementation.
> Landlock could use more than only eBPF in the future and eBPF could be
> used in other LSM as well.

there will not be two bpf based LSMs.
Therefore unless you can convince Sargun to give up his 'checmate' name,
nothing goes in.
The features you both need are 90% the same, so they must be done
as part of single LSM whatever you both agree to call it.

[PATCH iproute2 net-next] iptnl: add support for collect_md flag in IPv4 and IPv6 tunnels

2016-09-19 Thread Alexei Starovoitov

Signed-off-by: Alexei Starovoitov 
---
Stephen,
please sync include/uapi/linux/if_tunnel.h from the kernel tree
before applying.
Thanks
---
 ip/link_ip6tnl.c | 10 ++
 ip/link_iptnl.c  | 11 +++
 2 files changed, 21 insertions(+)

diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
index 59162a33608f..051c89f4fe57 100644
--- a/ip/link_ip6tnl.c
+++ b/ip/link_ip6tnl.c
@@ -40,6 +40,7 @@ static void print_usage(FILE *f)
fprintf(f, "  [ noencap ] [ encap { fou | gue | none } ]\n");
fprintf(f, "  [ encap-sport PORT ] [ encap-dport PORT ]\n");
fprintf(f, "  [ [no]encap-csum ] [ [no]encap-csum6 ] [ 
[no]encap-remcsum ]\n");
+   fprintf(f, "  [ external ]\n");
fprintf(f, "\n");
fprintf(f, "Where: NAME  := STRING\n");
fprintf(f, "   ADDR  := IPV6_ADDRESS\n");
@@ -89,6 +90,7 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
__u16 encapflags = TUNNEL_ENCAP_FLAG_CSUM6;
__u16 encapsport = 0;
__u16 encapdport = 0;
+   __u8 metadata = 0;
 
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
@@ -141,6 +143,8 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
 
if (iptuninfo[IFLA_IPTUN_PROTO])
proto = rta_getattr_u8(iptuninfo[IFLA_IPTUN_PROTO]);
+   if (iptuninfo[IFLA_IPTUN_COLLECT_METADATA])
+   metadata = 1;
}
 
while (argc > 0) {
@@ -277,12 +281,18 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
encapflags |= TUNNEL_ENCAP_FLAG_REMCSUM;
} else if (strcmp(*argv, "noencap-remcsum") == 0) {
encapflags |= ~TUNNEL_ENCAP_FLAG_REMCSUM;
+   } else if (strcmp(*argv, "external") == 0) {
+   metadata = 1;
} else
usage();
argc--, argv++;
}
 
addattr8(n, 1024, IFLA_IPTUN_PROTO, proto);
+   if (metadata) {
+   addattr_l(n, 1024, IFLA_IPTUN_COLLECT_METADATA, NULL, 0);
+   return 0;
+   }
addattr_l(n, 1024, IFLA_IPTUN_LOCAL, &laddr, sizeof(laddr));
addattr_l(n, 1024, IFLA_IPTUN_REMOTE, &raddr, sizeof(raddr));
addattr8(n, 1024, IFLA_IPTUN_TTL, hop_limit);
diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
index 7ec377791d88..1875348a268a 100644
--- a/ip/link_iptnl.c
+++ b/ip/link_iptnl.c
@@ -36,6 +36,7 @@ static void print_usage(FILE *f, int sit)
fprintf(f, "  [ mode { ip6ip | ipip | any } ]\n");
fprintf(f, "  [ isatap ]\n");
}
+   fprintf(f, "  [ external ]\n");
fprintf(f, "\n");
fprintf(f, "Where: NAME := STRING\n");
fprintf(f, "   ADDR := { IP_ADDRESS | any }\n");
@@ -85,6 +86,7 @@ static int iptunnel_parse_opt(struct link_util *lu, int argc, 
char **argv,
__u16 encapflags = 0;
__u16 encapsport = 0;
__u16 encapdport = 0;
+   __u8 metadata = 0;
 
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
@@ -161,6 +163,8 @@ static int iptunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN])
ip6rdrelayprefixlen =

rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+   if (iptuninfo[IFLA_IPTUN_COLLECT_METADATA])
+   metadata = 1;
}
 
while (argc > 0) {
@@ -257,6 +261,8 @@ static int iptunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
encapflags |= TUNNEL_ENCAP_FLAG_REMCSUM;
} else if (strcmp(*argv, "noencap-remcsum") == 0) {
encapflags &= ~TUNNEL_ENCAP_FLAG_REMCSUM;
+   } else if (strcmp(*argv, "external") == 0) {
+   metadata = 1;
} else if (strcmp(*argv, "6rd-prefix") == 0) {
inet_prefix prefix;
 
@@ -291,6 +297,11 @@ static int iptunnel_parse_opt(struct link_util *lu, int 
argc, char **argv,
exit(-1);
}
 
+   if (metadata) {
+   addattr_l(n, 1024, IFLA_IPTUN_COLLECT_METADATA, NULL, 0);
+   return 0;
+   }
+
addattr32(n, 1024, IFLA_IPTUN_LINK, link);
addattr32(n, 1024, IFLA_IPTUN_LOCAL, laddr);
addattr32(n, 1024, IFLA_IPTUN_REMOTE, raddr);
-- 
2.8.0

[PATCH net-next] net: dsa: mv88e6xxx: handle multiple ports in ATU

2016-09-19 Thread Vivien Didelot

An address can be loaded in the ATU with multiple ports, for instance
when adding multiple ports to a Multicast group with "bridge mdb".

The current code doesn't allow that. Add an helper to get a single entry
from the ATU, then set or clear the requested port, before loading the
entry back in the ATU.

Note that the required _mv88e6xxx_atu_getnext function is defined below
mv88e6xxx_port_db_load_purge, so forward-declare it for the moment. The
ATU code will be isolated in future patches.

Fixes: 83dabd1fa84c ("net: dsa: mv88e6xxx: make switchdev DB ops generic")
Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 56 +++-
 1 file changed, 49 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 70a812d..1d71802 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2091,12 +2091,48 @@ static int _mv88e6xxx_atu_load(struct mv88e6xxx_chip 
*chip,
return _mv88e6xxx_atu_cmd(chip, entry->fid, GLOBAL_ATU_OP_LOAD_DB);
 }
 
+static int _mv88e6xxx_atu_getnext(struct mv88e6xxx_chip *chip, u16 fid,
+ struct mv88e6xxx_atu_entry *entry);
+
+static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
+const u8 *addr, struct mv88e6xxx_atu_entry *entry)
+{
+   struct mv88e6xxx_atu_entry next;
+   int err;
+
+   eth_broadcast_addr(next.mac);
+
+   err = _mv88e6xxx_atu_mac_write(chip, next.mac);
+   if (err)
+   return err;
+
+   do {
+   err = _mv88e6xxx_atu_getnext(chip, fid, &next);
+   if (err)
+   return err;
+
+   if (next.state == GLOBAL_ATU_DATA_STATE_UNUSED)
+   break;
+
+   if (ether_addr_equal(next.mac, addr)) {
+   *entry = next;
+   return 0;
+   }
+   } while (!is_broadcast_ether_addr(next.mac));
+
+   memset(entry, 0, sizeof(*entry));
+   entry->fid = fid;
+   ether_addr_copy(entry->mac, addr);
+
+   return 0;
+}
+
 static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
const unsigned char *addr, u16 vid,
u8 state)
 {
-   struct mv88e6xxx_atu_entry entry = { 0 };
struct mv88e6xxx_vtu_stu_entry vlan;
+   struct mv88e6xxx_atu_entry entry;
int err;
 
/* Null VLAN ID corresponds to the port private database */
@@ -2107,12 +2143,18 @@ static int mv88e6xxx_port_db_load_purge(struct 
mv88e6xxx_chip *chip, int port,
if (err)
return err;
 
-   entry.fid = vlan.fid;
-   entry.state = state;
-   ether_addr_copy(entry.mac, addr);
-   if (state != GLOBAL_ATU_DATA_STATE_UNUSED) {
-   entry.trunk = false;
-   entry.portv_trunkid = BIT(port);
+   err = mv88e6xxx_atu_get(chip, vlan.fid, addr, &entry);
+   if (err)
+   return err;
+
+   /* Purge the ATU entry only if no port is using it anymore */
+   if (state == GLOBAL_ATU_DATA_STATE_UNUSED) {
+   entry.portv_trunkid &= ~BIT(port);
+   if (!entry.portv_trunkid)
+   entry.state = GLOBAL_ATU_DATA_STATE_UNUSED;
+   } else {
+   entry.portv_trunkid |= BIT(port);
+   entry.state = state;
}
 
return _mv88e6xxx_atu_load(chip, &entry);
-- 
2.10.0

Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Eric Dumazet

>
> It generates some slightly smaller code.
> if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
> - 3e7:  0f b6 c0movzbl %al,%eax
> - 3ea:  83 f8 03cmp$0x3,%eax
> - 3ed:  0f 86 d4 00 00 00   jbe4c7 
> + 3e7:  3c 03   cmp$0x3,%al
> + 3e9:  0f 86 d1 00 00 00   jbe4c0 
>
> And different code for abs
> /* Is new bw close to the lt_bw from the previous interval? */
> diff = abs(bw - bbr->lt_bw);
> - 47a:  44 89 e2mov%r12d,%edx
> - 47d:  29 c2   sub%eax,%edx
> - 47f:  89 d1   mov%edx,%ecx
> - 481:  89 d3   mov%edx,%ebx
> + 475:  44 89 e3mov%r12d,%ebx
> + 478:  29 c3   sub%eax,%ebx
> + 47a:  89 da   mov%ebx,%edx
> + 47c:  c1 fa 1fsar$0x1f,%edx
> + 47f:  31 d3   xor%edx,%ebx
> + 481:  29 d3   sub%edx,%ebx
>
> The biggest change is getting rid of the always true conditional.
>
...
>
> Overall, it really makes little difference. Actual file sizes come out the 
> same.
> The idea is more to document what is variable
> and what is immutable in the algorithm.

SGTM, thanks Stephen !

Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Stephen Hemminger

On Mon, 19 Sep 2016 14:10:39 -0700
Eric Dumazet  wrote:

> On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
>  wrote:
> 
> > Looks good, but could I suggest a simple optimization.
> > All these parameters are immutable in the version of BBR you are submitting.
> > Why not make the values const? And eliminate the always true long-term bw 
> > estimate
> > variable?
> >  
> 
> We could do that.
> 
> We used to have variables (aka module params) while BBR was cooking in
> our kernels ;)
> 
> Are you sure generated code is indeed 'optimized' ?


It generates some slightly smaller code.
if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
- 3e7:  0f b6 c0movzbl %al,%eax
- 3ea:  83 f8 03cmp$0x3,%eax
- 3ed:  0f 86 d4 00 00 00   jbe4c7 
+ 3e7:  3c 03   cmp$0x3,%al
+ 3e9:  0f 86 d1 00 00 00   jbe4c0 

And different code for abs
/* Is new bw close to the lt_bw from the previous interval? */
diff = abs(bw - bbr->lt_bw);
- 47a:  44 89 e2mov%r12d,%edx
- 47d:  29 c2   sub%eax,%edx
- 47f:  89 d1   mov%edx,%ecx
- 481:  89 d3   mov%edx,%ebx
+ 475:  44 89 e3mov%r12d,%ebx
+ 478:  29 c3   sub%eax,%ebx
+ 47a:  89 da   mov%ebx,%edx
+ 47c:  c1 fa 1fsar$0x1f,%edx
+ 47f:  31 d3   xor%edx,%ebx
+ 481:  29 d3   sub%edx,%ebx

The biggest change is getting rid of the always true conditional.

-   u32 diff;
-
-   if (bbr->lt_bw &&  /* do we have bw from a previous interval? */
-   bbr_lt_bw_estimator) {  /* using long-term bw estimator enabled? */
-   /* Is new bw close to the lt_bw from the previous interval? */
-   diff = abs(bw - bbr->lt_bw);
- 485:  c1 f9 1fsar$0x1f,%ecx
-   if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) ||
- 488:  c1 e2 05shl$0x5,%edx
-   u32 diff;
-
-   if (bbr->lt_bw &&  /* do we have bw from a previous interval? */
-   bbr_lt_bw_estimator) {  /* using long-term bw estimator enabled? */
-   /* Is new bw close to the lt_bw from the previous interval? */
-   diff = abs(bw - bbr->lt_bw);
- 48b:  31 cb   xor%ecx,%ebx
- 48d:  29 cb   sub%ecx,%ebx
-   if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) ||
- 48f:  89 d9   mov%ebx,%ecx
- 491:  c1 e1 08shl$0x8,%ecx
- 494:  39 d1   cmp%edx,%ecx
- 496:  0f 87 6e 01 00 00   ja 60a 
+ 485:  89 d9   mov%ebx,%ecx
+ 487:  c1 e2 05shl$0x5,%edx
+ 48a:  c1 e1 08shl$0x8,%ecx
+ 48d:  39 d1   cmp%edx,%ecx
+ 48f:  0f 87 6e 01 00 00   ja 603 


Overall, it really makes little difference. Actual file sizes come out the same.
The idea is more to document what is variable
and what is immutable in the algorithm.

[PATCH v5 net-next 1/1] net sched actions: fix GETing actions

2016-09-19 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes: 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
Fixes: 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
Fixes: f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Acked-by: Cong Wang 
Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_api.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..411191b 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,6 +592,17 @@ err_out:
return ERR_PTR(err);
 }
 
+static void cleanup_a(struct list_head *actions, int ovr)
+{
+   struct tc_action *a;
+
+   if (!ovr)
+   return;
+
+   list_for_each_entry(a, actions, list)
+   a->tcfa_refcnt--;
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
  struct nlattr *est, char *name, int ovr,
  int bind, struct list_head *actions)
@@ -612,8 +623,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
goto err;
}
act->order = i;
+   if (ovr)
+   act->tcfa_refcnt++;
list_add_tail(&act->list, actions);
}
+
+   /* Remove the temp refcnt which was necessary to protect against
+* destroying an existing action which was being replaced
+*/
+   cleanup_a(actions, ovr);
return 0;
 
 err:
@@ -883,6 +901,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct 
nlmsghdr *n,
goto err;
}
act->order = i;
+   if (event == RTM_GETACTION)
+   act->tcfa_refcnt++;
list_add_tail(&act->list, &actions);
}
 
-- 
1.9.1

Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: Convert flag bits to unsigned long long

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> We are soon going to run out of flag bits on 32bit systems. Convert to
> unsigned long long.
>
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot 

Thanks,

Vivien

Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Add helper for accessing port registers

2016-09-19 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> @@ -41,6 +41,11 @@ static void assert_reg_lock(struct mv88e6xxx_chip *chip)
>   }
>  }
>  
> +static int mv88e6xxx_reg_port(struct mv88e6xxx_chip *chip, int port)
> +{
> + return chip->info->port_base_addr + port;
> +}
> +

If we really want such helper, can you call it mv88e6xxx_port_addr()
instead, so that we respect an implicit mv88e6xxx_port_ namespace, and
use the correct "addr" term instead of erroneous "reg" one.

>  /* The switch ADDR[4:1] configuration pins define the chip SMI device address
>   * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
>   *
> @@ -216,6 +221,42 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int 
> addr, int reg, u16 val)
>   return 0;
>  }
>  
> +int mv88e6xxx_port_read(struct mv88e6xxx_chip *chip, int port, int reg,
> + u16 *val)
> +{
> + int addr = mv88e6xxx_reg_port(chip, port);
> + int err;
> +
> + assert_reg_lock(chip);
> +
> + err = mv88e6xxx_smi_read(chip, addr, reg, val);
> + if (err)
> + return err;
> +
> + dev_dbg(chip->dev, "<- port: 0x%.2x reg: 0x%.2x val: 0x%.4x\n",
> + port, reg, *val);
> +
> + return 0;
> +}
> +
> +int mv88e6xxx_port_write(struct mv88e6xxx_chip *chip, int port, int reg,
> +  u16 val)
> +{
> + int addr = mv88e6xxx_reg_port(chip, port);
> + int err;
> +
> + assert_reg_lock(chip);
> +
> + err = mv88e6xxx_smi_write(chip, addr, reg, val);
> + if (err)
> + return err;
> +
> + dev_dbg(chip->dev, "-> port: 0x%.2x reg: 0x%.2x val: 0x%.4x\n",
> + port, reg, val);
> +
> + return 0;
> +}
> +

mv88e6xxx_{read,write} are already doing this (wrapping the assert, smi
op and debug message). Plus, we could access the port registers through
a different interface, like remote management frames.

So please don't duplicate and use the following, as the previous
mv88e6xxx_port_read() function was doing:

static int mv88e6xxx_port_read(struct mv88e6xxx_chip *chip,
   int port, int reg, u16 *val)
{
int addr = chip->info->port_base_addr + port;

return mv88e6xxx_read(chip, addr, reg, val);
}

static int mv88e6xxx_port_write(struct mv88e6xxx_chip *chip,
int port, int reg, u16 val)
{
int addr = chip->info->port_base_addr + port;

return mv88e6xxx_write(chip, addr, reg, val);
}

Note: I don't really see a need for the mv88e6xxx_port_addr helper in
fact, the above code is quite clear. I'd suggest to drop it unless we
need a port address somewhere else than in mv88e6xxx_port_{read,write}.

>  static int mv88e6xxx_phy_read(struct mv88e6xxx_chip *chip, int phy,
> int reg, u16 *val)
>  {
> @@ -585,19 +626,19 @@ static void mv88e6xxx_adjust_link(struct dsa_switch 
> *ds, int port,
> struct phy_device *phydev)
>  {
>   struct mv88e6xxx_chip *chip = ds->priv;
> - u32 reg;
> - int ret;
> + u16 reg;
> + int err;
>  
>   if (!phy_is_pseudo_fixed_link(phydev))
>   return;
>  
>   mutex_lock(&chip->reg_lock);
>  
> - ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), PORT_PCS_CTRL);
> - if (ret < 0)
> + err = mv88e6xxx_port_read(chip, port, PORT_PCS_CTRL, ®);
> + if (err < 0)
>   goto out;

Can you please get rid of the < 0 condition at the same time, if (err)
is enough.

(same for the rest of this patch).

Thanks,

Vivien

[PATCH net-next 0/3] BPF direct packet access improvements

2016-09-19 Thread Daniel Borkmann

This set adds write support to the currently available read support
for {cls,act}_bpf programs. First one is a fix for affected commit
sitting in net-next and prerequisite for the second one, last patch
adds a number of test cases against the verifier. For details, please
see individual patches.

Thanks!

Daniel Borkmann (3):
  bpf, verifier: enforce larger zero range for pkt on overloading stack buffs
  bpf: direct packet write and access for helpers for clsact progs
  bpf: add test cases for direct packet access

 include/linux/bpf.h |   4 +-
 include/linux/skbuff.h  |  14 +-
 include/uapi/linux/bpf.h|  21 +++
 kernel/bpf/helpers.c|   3 +
 kernel/bpf/verifier.c   |  56 --
 net/core/filter.c   | 134 --
 samples/bpf/test_verifier.c | 433 +++-
 7 files changed, 627 insertions(+), 38 deletions(-)

-- 
1.9.3

[PATCH net-next 1/3] bpf, verifier: enforce larger zero range for pkt on overloading stack buffs

2016-09-19 Thread Daniel Borkmann

Current contract for the following two helper argument types is:

  * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
  * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
(NULL, 0) or (ptr, >0).

With 6841de8b0d03 ("bpf: allow helpers access the packet directly"), we can
pass also raw packet data to helpers, so depending on the argument type
being PTR_TO_PACKET, we now either assert memory via check_packet_access()
or check_stack_boundary(). As a result, the tests in check_packet_access()
currently allow more than intended with regards to reg->imm.

Back in 969bf05eb3ce ("bpf: direct packet access"), check_packet_access()
was fine to ignore size argument since in check_mem_access() size was
bpf_size_to_bytes() derived and prior to the call to check_packet_access()
guaranteed to be larger than zero.

However, for the above two argument types, it currently means, we can have
a <= 0 size and thus breaking current guarantees for helpers. Enforce a
check for size <= 0 and bail out if so.

check_stack_boundary() doesn't have such an issue since it already tests
for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL
pointer passed when allowed.

Fixes: 6841de8b0d03 ("bpf: allow helpers access the packet directly")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 Affected commit sits in net-next only.

 kernel/bpf/verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90493a6..bc138f3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -671,7 +671,7 @@ static int check_packet_access(struct verifier_env *env, 
u32 regno, int off,
struct reg_state *reg = ®s[regno];
 
off += reg->off;
-   if (off < 0 || off + size > reg->range) {
+   if (off < 0 || size <= 0 || off + size > reg->range) {
verbose("invalid access to packet, off=%d size=%d, 
R%d(id=%d,off=%d,r=%d)\n",
off, size, regno, reg->id, reg->off, reg->range);
return -EACCES;
-- 
1.9.3

[PATCH net-next 2/3] bpf: direct packet write and access for helpers for clsact progs

2016-09-19 Thread Daniel Borkmann

This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a22089 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |   4 +-
 include/linux/skbuff.h   |  14 -
 include/uapi/linux/bpf.h |  21 
 kernel/bpf/helpers.c |   3 ++
 kernel/bpf/verifier.c|  54 ++-
 net/core/filter.c| 134 +--
 6 files changed, 196 insertions(+), 34 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9a904f6..5691fdc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -96,6 +96,7 @@ enum bpf_return_type {
 struct bpf_func_proto {
u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
bool gpl_only;
+   bool pkt_access;
enum bpf_return_type ret_type;
enum bpf_arg_type arg1_type;
enum bpf_arg_type arg2_type;
@@ -151,7 +152,8 @@ struct bpf_verifier_ops {
 */
bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
enum bpf_reg_type *reg_type);
-
+   int (*gen_prologue)(struct bpf_insn *insn, bool direct_write,
+   const struct bpf_prog *prog);
u32 (*convert_ctx_access)(enum bpf_access_type type, int dst_reg,
  int src_reg, int ctx_off,
  struct bpf_insn *insn, struct bpf_prog *prog);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c5662f..c6dab3f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,23 @@ struct sk_buff {
 */
kmemcheck_bitfield_begin(flags1);
__u16   queue_mapping;
+
+/* if you move c

[PATCH net-next 3/3] bpf: add test cases for direct packet access

2016-09-19 Thread Daniel Borkmann

Add couple of test cases for direct write and the negative size issue, and
also adjust the direct packet access test4 since it asserts that writes are
not possible, but since we've just added support for writes, we need to
invert the verdict to ACCEPT, of course. Summary: 133 PASSED, 0 FAILED.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 samples/bpf/test_verifier.c | 433 +++-
 1 file changed, 430 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 1f6cc9b..ac590d4 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -291,6 +291,29 @@ static struct bpf_test tests[] = {
.result = REJECT,
},
{
+   "invalid argument register",
+   .insns = {
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_cgroup_classid),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_cgroup_classid),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "R1 !read_ok",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "non-invalid argument register",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_cgroup_classid),
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_1, BPF_REG_6),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_cgroup_classid),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
"check valid spill/fill",
.insns = {
/* spill R1(ctx) into stack */
@@ -1210,6 +1233,54 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
{
+   "raw_stack: skb_load_bytes, negative len",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 4),
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+   BPF_MOV64_IMM(BPF_REG_4, -8),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_skb_load_bytes),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "invalid stack type R3",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "raw_stack: skb_load_bytes, negative len 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 4),
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+   BPF_MOV64_IMM(BPF_REG_4, ~0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_skb_load_bytes),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "invalid stack type R3",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "raw_stack: skb_load_bytes, zero len",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 4),
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+   BPF_MOV64_IMM(BPF_REG_4, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_skb_load_bytes),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "invalid stack type R3",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
"raw_stack: skb_load_bytes, no init",
.insns = {
BPF_MOV64_IMM(BPF_REG_2, 4),
@@ -1511,7 +1582,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
},
{
-   "direct packet access: test4",
+   "direct packet access: test4 (write)",
.insns = {
BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
offsetof(struct __sk_buff, data)),
@@ -1524,8 +1595,7 @@ static struct bpf_test tests[] = {

[PATCH net-next 0/2] Preparation for mv88e6390

2016-09-19 Thread Andrew Lunn

These two patches are a couple of preparation steps for supporting the
the MV88E6390 family of chips. This is a new generation from Marvell,
and will need more feature flags than are currently available in an
unsigned long. Expand to an unsigned long long. The MV88E6390 also
places its port registers somewhere else, so add a wrapper around port
register access.

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Convert flag bits to unsigned long long
  WIP: net: dsa: mv88e6xxx: Implement interrupt support.

 drivers/net/dsa/mv88e6xxx/chip.c  | 174 ++
 drivers/net/dsa/mv88e6xxx/global2.c   | 128 -
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  90 +++---
 3 files changed, 359 insertions(+), 33 deletions(-)

-- 
2.9.3

[PATCH net-next 1/2] net: dsa: mv88e6xxx: Add helper for accessing port registers

2016-09-19 Thread Andrew Lunn

There is a device coming soon which places its port registers
somewhere different to all other Marvell switches supported so far.
Add helper functions for reading/writing port registers, making it
easier to handle this new device.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 394 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |   1 -
 2 files changed, 206 insertions(+), 189 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 70a812d159c9..58b8b94d278e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -41,6 +41,11 @@ static void assert_reg_lock(struct mv88e6xxx_chip *chip)
}
 }
 
+static int mv88e6xxx_reg_port(struct mv88e6xxx_chip *chip, int port)
+{
+   return chip->info->port_base_addr + port;
+}
+
 /* The switch ADDR[4:1] configuration pins define the chip SMI device address
  * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
  *
@@ -216,6 +221,42 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, 
int reg, u16 val)
return 0;
 }
 
+int mv88e6xxx_port_read(struct mv88e6xxx_chip *chip, int port, int reg,
+   u16 *val)
+{
+   int addr = mv88e6xxx_reg_port(chip, port);
+   int err;
+
+   assert_reg_lock(chip);
+
+   err = mv88e6xxx_smi_read(chip, addr, reg, val);
+   if (err)
+   return err;
+
+   dev_dbg(chip->dev, "<- port: 0x%.2x reg: 0x%.2x val: 0x%.4x\n",
+   port, reg, *val);
+
+   return 0;
+}
+
+int mv88e6xxx_port_write(struct mv88e6xxx_chip *chip, int port, int reg,
+u16 val)
+{
+   int addr = mv88e6xxx_reg_port(chip, port);
+   int err;
+
+   assert_reg_lock(chip);
+
+   err = mv88e6xxx_smi_write(chip, addr, reg, val);
+   if (err)
+   return err;
+
+   dev_dbg(chip->dev, "-> port: 0x%.2x reg: 0x%.2x val: 0x%.4x\n",
+   port, reg, val);
+
+   return 0;
+}
+
 static int mv88e6xxx_phy_read(struct mv88e6xxx_chip *chip, int phy,
  int reg, u16 *val)
 {
@@ -585,19 +626,19 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
  struct phy_device *phydev)
 {
struct mv88e6xxx_chip *chip = ds->priv;
-   u32 reg;
-   int ret;
+   u16 reg;
+   int err;
 
if (!phy_is_pseudo_fixed_link(phydev))
return;
 
mutex_lock(&chip->reg_lock);
 
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), PORT_PCS_CTRL);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, PORT_PCS_CTRL, ®);
+   if (err < 0)
goto out;
 
-   reg = ret & ~(PORT_PCS_CTRL_LINK_UP |
+   reg = reg & ~(PORT_PCS_CTRL_LINK_UP |
  PORT_PCS_CTRL_FORCE_LINK |
  PORT_PCS_CTRL_DUPLEX_FULL |
  PORT_PCS_CTRL_FORCE_DUPLEX |
@@ -639,7 +680,7 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
reg |= (PORT_PCS_CTRL_RGMII_DELAY_RXCLK |
PORT_PCS_CTRL_RGMII_DELAY_TXCLK);
}
-   _mv88e6xxx_reg_write(chip, REG_PORT(port), PORT_PCS_CTRL, reg);
+   mv88e6xxx_port_write(chip, port, PORT_PCS_CTRL, reg);
 
 out:
mutex_unlock(&chip->reg_lock);
@@ -799,22 +840,22 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct 
mv88e6xxx_chip *chip,
 {
u32 low;
u32 high = 0;
-   int ret;
+   int err;
+   u16 reg;
u64 value;
 
switch (s->type) {
case PORT:
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port), s->reg);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, s->reg, ®);
+   if (err < 0)
return UINT64_MAX;
 
-   low = ret;
+   low = reg;
if (s->sizeof_stat == 4) {
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port),
- s->reg + 1);
-   if (ret < 0)
+   err = mv88e6xxx_port_read(chip, port, s->reg + 1, ®);
+   if (err < 0)
return UINT64_MAX;
-   high = ret;
+   high = reg;
}
break;
case BANK0:
@@ -893,6 +934,8 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int 
port,
   struct ethtool_regs *regs, void *_p)
 {
struct mv88e6xxx_chip *chip = ds->priv;
+   int err;
+   u16 reg;
u16 *p = _p;
int i;
 
@@ -903,11 +946,10 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int 
port,
mutex_lock(&chip->reg_lock);
 
for (i = 0; i < 32; i++) {
-   int ret;
 
-   ret = _mv88e6xxx_reg_read(chip, REG_PORT(port),

[PATCH net-next 2/2] net: dsa: mv88e6xxx: Convert flag bits to unsigned long long

2016-09-19 Thread Andrew Lunn

We are soon going to run out of flag bits on 32bit systems. Convert to
unsigned long long.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 62 +--
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index e349d0d64645..827988397fd8 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -452,36 +452,36 @@ enum mv88e6xxx_cap {
 };
 
 /* Bitmask of capabilities */
-#define MV88E6XXX_FLAG_EDSABIT(MV88E6XXX_CAP_EDSA)
-#define MV88E6XXX_FLAG_EEE BIT(MV88E6XXX_CAP_EEE)
-
-#define MV88E6XXX_FLAG_SMI_CMD BIT(MV88E6XXX_CAP_SMI_CMD)
-#define MV88E6XXX_FLAG_SMI_DATABIT(MV88E6XXX_CAP_SMI_DATA)
-
-#define MV88E6XXX_FLAG_PHY_PAGEBIT(MV88E6XXX_CAP_PHY_PAGE)
-
-#define MV88E6XXX_FLAG_SERDES  BIT(MV88E6XXX_CAP_SERDES)
-
-#define MV88E6XXX_FLAG_GLOBAL2 BIT(MV88E6XXX_CAP_GLOBAL2)
-#define MV88E6XXX_FLAG_G2_MGMT_EN_2X   BIT(MV88E6XXX_CAP_G2_MGMT_EN_2X)
-#define MV88E6XXX_FLAG_G2_MGMT_EN_0X   BIT(MV88E6XXX_CAP_G2_MGMT_EN_0X)
-#define MV88E6XXX_FLAG_G2_IRL_CMD  BIT(MV88E6XXX_CAP_G2_IRL_CMD)
-#define MV88E6XXX_FLAG_G2_IRL_DATA BIT(MV88E6XXX_CAP_G2_IRL_DATA)
-#define MV88E6XXX_FLAG_G2_PVT_ADDR BIT(MV88E6XXX_CAP_G2_PVT_ADDR)
-#define MV88E6XXX_FLAG_G2_PVT_DATA BIT(MV88E6XXX_CAP_G2_PVT_DATA)
-#define MV88E6XXX_FLAG_G2_SWITCH_MAC   BIT(MV88E6XXX_CAP_G2_SWITCH_MAC)
-#define MV88E6XXX_FLAG_G2_POT  BIT(MV88E6XXX_CAP_G2_POT)
-#define MV88E6XXX_FLAG_G2_EEPROM_CMD   BIT(MV88E6XXX_CAP_G2_EEPROM_CMD)
-#define MV88E6XXX_FLAG_G2_EEPROM_DATA  BIT(MV88E6XXX_CAP_G2_EEPROM_DATA)
-#define MV88E6XXX_FLAG_G2_SMI_PHY_CMD  BIT(MV88E6XXX_CAP_G2_SMI_PHY_CMD)
-#define MV88E6XXX_FLAG_G2_SMI_PHY_DATA BIT(MV88E6XXX_CAP_G2_SMI_PHY_DATA)
-
-#define MV88E6XXX_FLAG_PPU BIT(MV88E6XXX_CAP_PPU)
-#define MV88E6XXX_FLAG_PPU_ACTIVE  BIT(MV88E6XXX_CAP_PPU_ACTIVE)
-#define MV88E6XXX_FLAG_STU BIT(MV88E6XXX_CAP_STU)
-#define MV88E6XXX_FLAG_TEMPBIT(MV88E6XXX_CAP_TEMP)
-#define MV88E6XXX_FLAG_TEMP_LIMIT  BIT(MV88E6XXX_CAP_TEMP_LIMIT)
-#define MV88E6XXX_FLAG_VTU BIT(MV88E6XXX_CAP_VTU)
+#define MV88E6XXX_FLAG_EDSABIT_ULL(MV88E6XXX_CAP_EDSA)
+#define MV88E6XXX_FLAG_EEE BIT_ULL(MV88E6XXX_CAP_EEE)
+
+#define MV88E6XXX_FLAG_SMI_CMD BIT_ULL(MV88E6XXX_CAP_SMI_CMD)
+#define MV88E6XXX_FLAG_SMI_DATABIT_ULL(MV88E6XXX_CAP_SMI_DATA)
+
+#define MV88E6XXX_FLAG_PHY_PAGEBIT_ULL(MV88E6XXX_CAP_PHY_PAGE)
+
+#define MV88E6XXX_FLAG_SERDES  BIT_ULL(MV88E6XXX_CAP_SERDES)
+
+#define MV88E6XXX_FLAG_GLOBAL2 BIT_ULL(MV88E6XXX_CAP_GLOBAL2)
+#define MV88E6XXX_FLAG_G2_MGMT_EN_2X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_2X)
+#define MV88E6XXX_FLAG_G2_MGMT_EN_0X   BIT_ULL(MV88E6XXX_CAP_G2_MGMT_EN_0X)
+#define MV88E6XXX_FLAG_G2_IRL_CMD  BIT_ULL(MV88E6XXX_CAP_G2_IRL_CMD)
+#define MV88E6XXX_FLAG_G2_IRL_DATA BIT_ULL(MV88E6XXX_CAP_G2_IRL_DATA)
+#define MV88E6XXX_FLAG_G2_PVT_ADDR BIT_ULL(MV88E6XXX_CAP_G2_PVT_ADDR)
+#define MV88E6XXX_FLAG_G2_PVT_DATA BIT_ULL(MV88E6XXX_CAP_G2_PVT_DATA)
+#define MV88E6XXX_FLAG_G2_SWITCH_MAC   BIT_ULL(MV88E6XXX_CAP_G2_SWITCH_MAC)
+#define MV88E6XXX_FLAG_G2_POT  BIT_ULL(MV88E6XXX_CAP_G2_POT)
+#define MV88E6XXX_FLAG_G2_EEPROM_CMD   BIT_ULL(MV88E6XXX_CAP_G2_EEPROM_CMD)
+#define MV88E6XXX_FLAG_G2_EEPROM_DATA  BIT_ULL(MV88E6XXX_CAP_G2_EEPROM_DATA)
+#define MV88E6XXX_FLAG_G2_SMI_PHY_CMD  BIT_ULL(MV88E6XXX_CAP_G2_SMI_PHY_CMD)
+#define MV88E6XXX_FLAG_G2_SMI_PHY_DATA BIT_ULL(MV88E6XXX_CAP_G2_SMI_PHY_DATA)
+
+#define MV88E6XXX_FLAG_PPU BIT_ULL(MV88E6XXX_CAP_PPU)
+#define MV88E6XXX_FLAG_PPU_ACTIVE  BIT_ULL(MV88E6XXX_CAP_PPU_ACTIVE)
+#define MV88E6XXX_FLAG_STU BIT_ULL(MV88E6XXX_CAP_STU)
+#define MV88E6XXX_FLAG_TEMPBIT_ULL(MV88E6XXX_CAP_TEMP)
+#define MV88E6XXX_FLAG_TEMP_LIMIT  BIT_ULL(MV88E6XXX_CAP_TEMP_LIMIT)
+#define MV88E6XXX_FLAG_VTU BIT_ULL(MV88E6XXX_CAP_VTU)
 
 /* EEPROM Programming via Global2 with 16-bit data */
 #define MV88E6XXX_FLAGS_EEPROM16   \
@@ -614,7 +614,7 @@ struct mv88e6xxx_info {
unsigned int num_ports;
unsigned int port_base_addr;
unsigned int age_time_coeff;
-   unsigned long flags;
+   unsigned long long flags;
 };
 
 struct mv88e6xxx_atu_entry {
-- 
2.9.3

Re: [PATCH] igb: mark igb_rxnfc_write_vlan_prio_filter() static

2016-09-19 Thread Jeff Kirsher

On Sun, 2016-09-18 at 16:50 +0800, Baoyou Xie wrote:
> We get 1 warning when building kernel with W=1:
> drivers/net/ethernet/intel/igb/igb_ethtool.c:2707:5: warning: no previous
> prototype for 'igb_rxnfc_write_vlan_prio_filter' [-Wmissing-prototypes]
> 
> In fact, this function is only used in the file in which it is
> declared and don't need a declaration, but can be made static.
> so this patch marks this function with 'static'.
> 
> Signed-off-by: Baoyou Xie 
> ---
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Already fixed, see http://patchwork.ozlabs.org/patch/661904/

signature.asc
Description: This is a digitally signed message part

Re: [PATCHv6 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-19 Thread Daniel Borkmann


On 09/19/2016 11:48 PM, Jakub Kicinski wrote:

On Mon, 19 Sep 2016 23:44:50 +0200, Daniel Borkmann wrote:

On 09/19/2016 11:36 PM, Jakub Kicinski wrote:

On Mon, 19 Sep 2016 23:03:17 +0200, Daniel Borkmann wrote:

On 09/18/2016 05:09 PM, Jakub Kicinski wrote:

Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 


I believe there's still an issue here. Could you please double check
and confirm?

I rebased my locally pending stuff on top of your set and suddenly my
test case breaks. So I did a bisect and it pointed me to this commit
eventually.

[...]

@@ -2697,11 +2706,8 @@ static int convert_ctx_accesses(struct verifier_env *env)
else
continue;

-   if (insn->imm != PTR_TO_CTX) {
-   /* clear internal mark */
-   insn->imm = 0;
+   if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX)
continue;
-   }

cnt = env->prog->aux->ops->
convert_ctx_access(type, insn->dst_reg, insn->src_reg,


Looking at the code, I believe the issue is in above snippet. In the
convert_ctx_accesses() rewrite loop, each time we bpf_patch_insn_single()
a program, the program can grow in size (due to __sk_buff access rewrite,
for example). After rewrite, we do 'i += insn_delta' for adjustment to
process next insn.

However, env->insn_aux_data is alloced under the assumption that the
very initial, pre-verification prog->len doesn't change, right? So in
the above conversion access to env->insn_aux_data[i].ptr_type is off,
since after rewrites, corresponding mappings to ptr_type might not be
related anymore.

I noticed this with direct packet access where suddenly the data vs
data_end test failed and contained some "semi-random" value always
bailing out for me.


You are correct.  Should I respin or would you like to post your set? :)


Heh, if you don't mind I would go ahead tonight, the conflict at two spots
when exposing verifier is really minor turns out. Are you okay with this?


Yes, please go ahead :)


Ok, thanks!


What's the plan wrt env->insn_aux_data? Realloc plus rewrite of the array,
or do you see a more straight forward solution?


I was thinking about something like this: (untested)


Yep, much better. :)

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-19 Thread Sargun Dhillon

On Mon, Sep 19, 2016 at 06:34:28PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 09/16/2016 09:57 PM, Sargun Dhillon wrote:
> > On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> 
> >> I have no idea what makes you think this is limited to systemd. As I
> >> said, I provided an example for userspace that works from the command
> >> line. The same limitation apply as for all other users of cgroups.
> >>
> > So, at least in my work, we have Mesos, but on nearly every machine that 
> > Mesos 
> > runs, people also have systemd. Now, there's recently become a bit of a 
> > battle 
> > of ownership of things like cgroups on these machines. We can usually solve 
> > it 
> > by nesting under systemd cgroups, and thus so far we've avoided making too 
> > many 
> > systemd-specific concessions.
> > 
> > The reason this works (mostly), is because everything we touch has a sense 
> > of 
> > nesting, where we can apply policy at a place lower in the hierarchy, and 
> > yet 
> > systemd's monitoring and policy still stays in place. 
> > 
> > Now, with this patch, we don't have that, but I think we can reasonably add 
> > some 
> > flag like "no override" when applying policies, or alternatively something 
> > like 
> > "no new privileges", to prevent children from applying policies that 
> > override 
> > top-level policy.
> 
> Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
> capability away from your children, and they can't tamper with the
> policy. Does that work for you?
> 
No. This can be addressed in a follow-on patch, but the use-case is that I have 
a container orchestrator (Docker, or Mesos), and systemd. The sysadmin controls 
systemd, and Docker is controlled by devs. Typically, the system owner wants 
some system level statistics, and filtering, and then we want to do 
per-container filtering.

We really want to be able to do nesting with userspace tools that are 
oblivious, 
and we want to delegate a level of the cgroup hierarchy to the tool that 
created 
it. I do not see Docker integrating with systemd any time soon, and that's 
really the only other alternative.

> > I realize there is a speed concern as well, but I think for 
> > people who want nested policy, we're willing to make the tradeoff. The cost
> > of traversing a few extra pointers still outweighs the overhead of network
> > namespaces, iptables, etc.. for many of us. 
> 
> Not sure. Have you tried it?
> 
Tried nested policies? Yes. I tried nested policy execution with syscalls, and 
I 
tested with bind and connect. The performance overhead was pretty minimal, but 
latency increased by 100 microseconds+ once the number of BPF hooks increased 
beyond 30. The BPF programs were trivial, and essentially did a map lookup, and 
returned 0.

I don't think that it's just raw cycles / execution time, but I didn't spend 
enough time digging into it to determine the performance hit. I'm waiting
for your patchset to land, and then I plan to work off of it.

> > What do you think Daniel?
> 
> I think we should look at an implementation once we really need it, and
> then revisit the performance impact. In any case, this can be changed
> under the hood, without touching the userspace API (except for adding
> flags if we need them).
> 
+1
> >> Not necessarily. You can as well do it the inetd way, and pass the
> >> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> >> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> >> is not transparent to the launched binary at all. The proposed cgroup
> >> eBPF solution implements a very similar behavior in that regard.
> >
> > It would be nice to be able to see whether or not a filter is attached to a 
> > cgroup, but given this is going through syscalls, at least introspection
> > is possible as opposed to something like netlink.
> 
> Sure, there are many ways. I implemented the bpf cgroup logic using an
> own cgroup controller once, which made it possible to read out the
> status. But as we agreed on attaching programs through the bpf(2) system
> call, I moved back to the implementation that directly stores the
> pointers in the cgroup.
> 
> First enabling the controller through the fs-backed cgroup interface,
> then come back through the bpf(2) syscall and then go back to the fs
> interface to read out status values is a bit weird.
> 
Hrm, that makes sense. with the BPF syscall, would there be a way to get
file descriptor of the currently attached BPF program?

> >> And FWIW, I agree with Thomas - there is nothing wrong with having
> >> multiple options to use for such use-cases.
> >
> > Right now, for containers, we have netfilter and network namespaces.
> > There's a lot of performance overhead that comes with this.
> 
> Out of curiosity: Could you express that in numbers? And how exactly are
> you testing?
> 
Sure. Our workload that we use as a baseline is Redis with redis-benchmark. We 
reconnect after every connection, and we're running "iso

Re: [PATCHv6 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-19 Thread Jakub Kicinski

On Mon, 19 Sep 2016 23:44:50 +0200, Daniel Borkmann wrote:
> On 09/19/2016 11:36 PM, Jakub Kicinski wrote:
> > On Mon, 19 Sep 2016 23:03:17 +0200, Daniel Borkmann wrote:  
> >> On 09/18/2016 05:09 PM, Jakub Kicinski wrote:  
> >>> Storing state in reserved fields of instructions makes
> >>> it impossible to run verifier on programs already
> >>> marked as read-only. Allocate and use an array of
> >>> per-instruction state instead.
> >>>
> >>> While touching the error path rename and move existing
> >>> jump target.
> >>>
> >>> Suggested-by: Alexei Starovoitov 
> >>> Signed-off-by: Jakub Kicinski 
> >>> Acked-by: Alexei Starovoitov 
> >>> Acked-by: Daniel Borkmann   
> >>
> >> I believe there's still an issue here. Could you please double check
> >> and confirm?
> >>
> >> I rebased my locally pending stuff on top of your set and suddenly my
> >> test case breaks. So I did a bisect and it pointed me to this commit
> >> eventually.
> >>
> >> [...]  
> >>> @@ -2697,11 +2706,8 @@ static int convert_ctx_accesses(struct 
> >>> verifier_env *env)
> >>>   else
> >>>   continue;
> >>>
> >>> - if (insn->imm != PTR_TO_CTX) {
> >>> - /* clear internal mark */
> >>> - insn->imm = 0;
> >>> + if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX)
> >>>   continue;
> >>> - }
> >>>
> >>>   cnt = env->prog->aux->ops->
> >>>   convert_ctx_access(type, insn->dst_reg, 
> >>> insn->src_reg,  
> >>
> >> Looking at the code, I believe the issue is in above snippet. In the
> >> convert_ctx_accesses() rewrite loop, each time we bpf_patch_insn_single()
> >> a program, the program can grow in size (due to __sk_buff access rewrite,
> >> for example). After rewrite, we do 'i += insn_delta' for adjustment to
> >> process next insn.
> >>
> >> However, env->insn_aux_data is alloced under the assumption that the
> >> very initial, pre-verification prog->len doesn't change, right? So in
> >> the above conversion access to env->insn_aux_data[i].ptr_type is off,
> >> since after rewrites, corresponding mappings to ptr_type might not be
> >> related anymore.
> >>
> >> I noticed this with direct packet access where suddenly the data vs
> >> data_end test failed and contained some "semi-random" value always
> >> bailing out for me.  
> >
> > You are correct.  Should I respin or would you like to post your set? :)  
> 
> Heh, if you don't mind I would go ahead tonight, the conflict at two spots
> when exposing verifier is really minor turns out. Are you okay with this?

Yes, please go ahead :)
 
> What's the plan wrt env->insn_aux_data? Realloc plus rewrite of the array,
> or do you see a more straight forward solution?

I was thinking about something like this: (untested)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1612f7364c42..5c4cae046251 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2657,13 +2657,13 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
struct bpf_insn insn_buf[16];
struct bpf_prog *new_prog;
enum bpf_access_type type;
-   int i;
+   int i, delta = 0;
 
if (!env->prog->aux->ops->convert_ctx_access)
return 0;
 
for (i = 0; i < insn_cnt; i++, insn++) {
-   u32 insn_delta, cnt;
+   u32 cnt;
 
if (insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
insn->code == (BPF_LDX | BPF_MEM | BPF_DW))
@@ -2685,18 +2685,16 @@ static int convert_ctx_accesses(struct bpf_verifier_env 
*env)
return -EINVAL;
}
 
-   new_prog = bpf_patch_insn_single(env->prog, i, insn_buf, cnt);
+   new_prog = bpf_patch_insn_single(env->prog, i + delta, insn_buf,
+cnt);
if (!new_prog)
return -ENOMEM;
 
-   insn_delta = cnt - 1;
+   delta += cnt - 1;
 
/* keep walking new program and skip insns we just inserted */
env->prog = new_prog;
-   insn  = new_prog->insnsi + i + insn_delta;
-
-   insn_cnt += insn_delta;
-   i+= insn_delta;
+   insn  = new_prog->insnsi + i + delta;
}
 
return 0;

Re: [PATCHv6 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-19 Thread Daniel Borkmann


On 09/19/2016 11:36 PM, Jakub Kicinski wrote:

On Mon, 19 Sep 2016 23:03:17 +0200, Daniel Borkmann wrote:

On 09/18/2016 05:09 PM, Jakub Kicinski wrote:

Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 


I believe there's still an issue here. Could you please double check
and confirm?

I rebased my locally pending stuff on top of your set and suddenly my
test case breaks. So I did a bisect and it pointed me to this commit
eventually.

[...]

@@ -2697,11 +2706,8 @@ static int convert_ctx_accesses(struct verifier_env *env)
else
continue;

-   if (insn->imm != PTR_TO_CTX) {
-   /* clear internal mark */
-   insn->imm = 0;
+   if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX)
continue;
-   }

cnt = env->prog->aux->ops->
convert_ctx_access(type, insn->dst_reg, insn->src_reg,


Looking at the code, I believe the issue is in above snippet. In the
convert_ctx_accesses() rewrite loop, each time we bpf_patch_insn_single()
a program, the program can grow in size (due to __sk_buff access rewrite,
for example). After rewrite, we do 'i += insn_delta' for adjustment to
process next insn.

However, env->insn_aux_data is alloced under the assumption that the
very initial, pre-verification prog->len doesn't change, right? So in
the above conversion access to env->insn_aux_data[i].ptr_type is off,
since after rewrites, corresponding mappings to ptr_type might not be
related anymore.

I noticed this with direct packet access where suddenly the data vs
data_end test failed and contained some "semi-random" value always
bailing out for me.


You are correct.  Should I respin or would you like to post your set? :)


Heh, if you don't mind I would go ahead tonight, the conflict at two spots
when exposing verifier is really minor turns out. Are you okay with this?

What's the plan wrt env->insn_aux_data? Realloc plus rewrite of the array,
or do you see a more straight forward solution?

1 2 3 >

1 - 100 of 276 matches

Mail list logo