date:20180111

[PATCH V2 net-next 00/11] add some new features and fix some bugs

2018-01-11 Thread Peng Li

This patchset adds 3 ethtool features: get_channels,
get_coalesce and get_coalesce, and fix some bugs.

[patch 1/11] adds ethtool_ops.get_channels (ethtool -l) support
for VF.

[patch 2/11] removes TSO config command from VF driver,
as only main PF can config TSO MSS length according to
hardware.

[patch 3/11 - 4/11] add ethtool_ops {get|set}_coalesce
(ethtool -c/-C) support to PF.
[patch 5/11 - 9/11] fix some bugs related to {get|set}_coalesce.

[patch 10/11 - 11/11] fix the features handling in
hns3_nic_set_features(). Local variable "changed" was defined
to indicates features changed, but was used only for feature
NETIF_F_HW_VLAN_CTAG_RX. Add checking to improve the reliability.

---
Change log:
V1 -> V2:
1, Rewrite the cover letter requested by David Miller.
---

Fuyun Liang (7):
  net: hns3: add ethtool_ops.get_coalesce support to PF
  net: hns3: add ethtool_ops.set_coalesce support to PF
  net: hns3: refactor interrupt coalescing init function
  net: hns3: refactor GL update function
  net: hns3: remove unused GL setup function
  net: hns3: change the unit of GL value macro
  net: hns3: add int_gl_idx setup for TX and RX queues

Jian Shen (2):
  net: hns3: add feature check when feature changed
  net: hns3: check for NULL function pointer in hns3_nic_set_features

Peng Li (2):
  net: hns3: add ethtool_ops.get_channels support for VF
  net: hns3: remove TSO config command from VF driver

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   7 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 148 ++---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  26 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 179 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c|   5 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h   |   8 -
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  50 +++---
 7 files changed, 336 insertions(+), 87 deletions(-)

-- 
1.9.1

[PATCH V2 net-next 09/11] net: hns3: add int_gl_idx setup for TX and RX queues

2018-01-11 Thread Peng Li

From: Fuyun Liang 

If the int_gl_idx does not be set, the default interrupt coalesce index
is 0. The TX queues and the RX queues will both use the GL0 as the
interrupt coalesce GL switch. But it should be GL1 for TX queues and GL0
for RX queues.

This patch adds the int_gl_idx setup for TX queues and RX queues.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  5 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 11 +++
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c |  5 +
 3 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 0bad0e3..634e932 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -133,11 +133,16 @@ struct hnae3_vector_info {
 #define HNAE3_RING_TYPE_B 0
 #define HNAE3_RING_TYPE_TX 0
 #define HNAE3_RING_TYPE_RX 1
+#define HNAE3_RING_GL_IDX_S 0
+#define HNAE3_RING_GL_IDX_M GENMASK(1, 0)
+#define HNAE3_RING_GL_RX 0
+#define HNAE3_RING_GL_TX 1
 
 struct hnae3_ring_chain_node {
struct hnae3_ring_chain_node *next;
u32 tqp_index;
u32 flag;
+   u32 int_gl_idx;
 };
 
 #define HNAE3_IS_TX_RING(node) \
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 2e9e61c..34879c4 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2523,6 +2523,8 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
cur_chain->tqp_index = tx_ring->tqp->tqp_index;
hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_TX);
+   hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_TX);
 
cur_chain->next = NULL;
 
@@ -2538,6 +2540,10 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
chain->tqp_index = tx_ring->tqp->tqp_index;
hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_TX);
+   hnae_set_field(chain->int_gl_idx,
+  HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S,
+  HNAE3_RING_GL_TX);
 
cur_chain = chain;
}
@@ -2549,6 +2555,8 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
cur_chain->tqp_index = rx_ring->tqp->tqp_index;
hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_RX);
+   hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX);
 
rx_ring = rx_ring->next;
}
@@ -2562,6 +2570,9 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
chain->tqp_index = rx_ring->tqp->tqp_index;
hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_RX);
+   hnae_set_field(chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX);
+
cur_chain = chain;
 
rx_ring = rx_ring->next;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index d7352f5..27f0ab6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3409,6 +3409,11 @@ int hclge_bind_ring_with_vector(struct hclge_vport 
*vport,
   hnae_get_bit(node->flag, HNAE3_RING_TYPE_B));
hnae_set_field(tqp_type_and_id, HCLGE_TQP_ID_M,
   HCLGE_TQP_ID_S, node->tqp_index);
+   hnae_set_field(tqp_type_and_id, HCLGE_INT_GL_IDX_M,
+  HCLGE_INT_GL_IDX_S,
+  hnae_get_field(node->int_gl_idx,
+ HNAE3_RING_GL_IDX_M,
+ HNAE3_RING_GL_IDX_S));
req->tqp_type_and_id[i] = cpu_to_le16(tqp_type_and_id);
if (++i >= HCLGE_VECTOR_ELEMENTS_PER_CMD) {
req->int_cause_num = HCLGE_VECTOR_ELEMENTS_PER_CMD;
-- 
1.9.1

[PATCH V2 net-next 01/11] net: hns3: add ethtool_ops.get_channels support for VF

2018-01-11 Thread Peng Li

This patch supports the ethtool's get_channels() for VF.

Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  1 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 30 ++
 2 files changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index d3cb3ec..f44336c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -900,6 +900,7 @@ static void hns3_get_channels(struct net_device *netdev,
.get_rxfh = hns3_get_rss,
.set_rxfh = hns3_set_rss,
.get_link_ksettings = hns3_get_link_ksettings,
+   .get_channels = hns3_get_channels,
 };
 
 static const struct ethtool_ops hns3_ethtool_ops = {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 655f522..5f9afa6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1433,6 +1433,35 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
ae_dev->priv = NULL;
 }
 
+static u32 hclgevf_get_max_channels(struct hclgevf_dev *hdev)
+{
+   struct hnae3_handle *nic = &hdev->nic;
+   struct hnae3_knic_private_info *kinfo = &nic->kinfo;
+
+   return min_t(u32, hdev->rss_size_max * kinfo->num_tc, hdev->num_tqps);
+}
+
+/**
+ * hclgevf_get_channels - Get the current channels enabled and max supported.
+ * @handle: hardware information for network interface
+ * @ch: ethtool channels structure
+ *
+ * We don't support separate tx and rx queues as channels. The other count
+ * represents how many queues are being used for control. max_combined counts
+ * how many queue pairs we can support. They may not be mapped 1 to 1 with
+ * q_vectors since we support a lot more queue pairs than q_vectors.
+ **/
+static void hclgevf_get_channels(struct hnae3_handle *handle,
+struct ethtool_channels *ch)
+{
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+
+   ch->max_combined = hclgevf_get_max_channels(hdev);
+   ch->other_count = 0;
+   ch->max_other = 0;
+   ch->combined_count = hdev->num_tqps;
+}
+
 static const struct hnae3_ae_ops hclgevf_ops = {
.init_ae_dev = hclgevf_init_ae_dev,
.uninit_ae_dev = hclgevf_uninit_ae_dev,
@@ -1462,6 +1491,7 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
.get_tc_size = hclgevf_get_tc_size,
.get_fw_version = hclgevf_get_fw_version,
.set_vlan_filter = hclgevf_set_vlan_filter,
+   .get_channels = hclgevf_get_channels,
 };
 
 static struct hnae3_ae_algo ae_algovf = {
-- 
1.9.1

[PATCH V2 net-next 05/11] net: hns3: refactor interrupt coalescing init function

2018-01-11 Thread Peng Li

From: Fuyun Liang 

In the hardware, the coalesce configurable registers include GL0, GL1,
GL2. In the driver, the TX queues use the register GL1 and the RX queues
use the register GL0. This function initializes the configuration of the
interrupt coalescing, but does not distinguish between the TX direction
and the RX direction. It will cause some confusion.

This patch refactors the function to initialize the TX GL and the RX GL
separately. And the initialization of related variables also is added to
this patch.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 29 +
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 32c9f88..59d8d9f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -206,21 +206,32 @@ void hns3_set_vector_coalesce_tx_gl(struct 
hns3_enet_tqp_vector *tqp_vector,
writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET);
 }
 
-static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector)
+static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector,
+  struct hns3_nic_priv *priv)
 {
+   struct hnae3_handle *h = priv->ae_handle;
+
/* initialize the configuration for interrupt coalescing.
 * 1. GL (Interrupt Gap Limiter)
 * 2. RL (Interrupt Rate Limiter)
 */
 
-   /* Default :enable interrupt coalesce */
-   tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
+   /* Default: enable interrupt coalescing self-adaptive and GL */
+   tqp_vector->tx_group.gl_adapt_enable = 1;
+   tqp_vector->rx_group.gl_adapt_enable = 1;
+
tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K;
-   hns3_set_vector_coalesc_gl(tqp_vector, HNS3_INT_GL_50K);
-   /* for now we are disabling Interrupt RL - we
-* will re-enable later
-*/
-   hns3_set_vector_coalesce_rl(tqp_vector, 0);
+   tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
+
+   hns3_set_vector_coalesce_tx_gl(tqp_vector,
+  tqp_vector->tx_group.int_gl);
+   hns3_set_vector_coalesce_rx_gl(tqp_vector,
+  tqp_vector->rx_group.int_gl);
+
+   /* Default: disable RL */
+   h->kinfo.int_rl_setting = 0;
+   hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting);
+
tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
 }
@@ -2654,7 +2665,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv 
*priv)
tqp_vector->rx_group.total_packets = 0;
tqp_vector->tx_group.total_bytes = 0;
tqp_vector->tx_group.total_packets = 0;
-   hns3_vector_gl_rl_init(tqp_vector);
+   hns3_vector_gl_rl_init(tqp_vector, priv);
tqp_vector->handle = h;
 
ret = hns3_get_vector_ring_chain(tqp_vector,
-- 
1.9.1

[PATCH V2 net-next 04/11] net: hns3: add ethtool_ops.set_coalesce support to PF

2018-01-11 Thread Peng Li

From: Fuyun Liang 

This patch adds ethtool_ops.set_coalesce support to PF.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|  34 -
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  17 +++
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 141 +
 3 files changed, 188 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 14c7625..32c9f88 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -170,14 +170,40 @@ static void hns3_set_vector_coalesc_gl(struct 
hns3_enet_tqp_vector *tqp_vector,
writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL2_OFFSET);
 }
 
-static void hns3_set_vector_coalesc_rl(struct hns3_enet_tqp_vector *tqp_vector,
-  u32 rl_value)
+void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector,
+u32 rl_value)
 {
+   u32 rl_reg = hns3_rl_usec_to_reg(rl_value);
+
/* this defines the configuration for RL (Interrupt Rate Limiter).
 * Rl defines rate of interrupts i.e. number of interrupts-per-second
 * GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing
 */
-   writel(rl_value, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET);
+
+   if (rl_reg > 0 && !tqp_vector->tx_group.gl_adapt_enable &&
+   !tqp_vector->rx_group.gl_adapt_enable)
+   /* According to the hardware, the range of rl_reg is
+* 0-59 and the unit is 4.
+*/
+   rl_reg |=  HNS3_INT_RL_ENABLE_MASK;
+
+   writel(rl_reg, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET);
+}
+
+void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value)
+{
+   u32 rx_gl_reg = hns3_gl_usec_to_reg(gl_value);
+
+   writel(rx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL0_OFFSET);
+}
+
+void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value)
+{
+   u32 tx_gl_reg = hns3_gl_usec_to_reg(gl_value);
+
+   writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET);
 }
 
 static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector)
@@ -194,7 +220,7 @@ static void hns3_vector_gl_rl_init(struct 
hns3_enet_tqp_vector *tqp_vector)
/* for now we are disabling Interrupt RL - we
 * will re-enable later
 */
-   hns3_set_vector_coalesc_rl(tqp_vector, 0);
+   hns3_set_vector_coalesce_rl(tqp_vector, 0);
tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
 }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 24f6109..7adbda8 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -451,11 +451,15 @@ enum hns3_link_mode_bits {
HNS3_LM_COUNT = 15
 };
 
+#define HNS3_INT_GL_MAX0x1FE0
 #define HNS3_INT_GL_50K0x000A
 #define HNS3_INT_GL_20K0x0019
 #define HNS3_INT_GL_18K0x001B
 #define HNS3_INT_GL_8K 0x003E
 
+#define HNS3_INT_RL_MAX0x00EC
+#define HNS3_INT_RL_ENABLE_MASK0x40
+
 struct hns3_enet_ring_group {
/* array of pointers to rings */
struct hns3_enet_ring *ring;
@@ -595,6 +599,12 @@ static inline void hns3_write_reg(void __iomem *base, u32 
reg, u32 value)
 #define hns3_get_handle(ndev) \
(((struct hns3_nic_priv *)netdev_priv(ndev))->ae_handle)
 
+#define hns3_gl_usec_to_reg(int_gl) (int_gl >> 1)
+#define hns3_gl_round_down(int_gl) round_down(int_gl, 2)
+
+#define hns3_rl_usec_to_reg(int_rl) (int_rl >> 2)
+#define hns3_rl_round_down(int_rl) round_down(int_rl, 4)
+
 void hns3_ethtool_set_ops(struct net_device *netdev);
 int hns3_set_channels(struct net_device *netdev,
  struct ethtool_channels *ch);
@@ -607,6 +617,13 @@ int hns3_clean_rx_ring(
struct hns3_enet_ring *ring, int budget,
void (*rx_fn)(struct hns3_enet_ring *, struct sk_buff *));
 
+void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value);
+void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value);
+void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector,
+u32 rl_value);
+
 #ifdef CONFIG_HNS3_DCB
 void hns3_dcbnl_setup(struct hnae3_handle *handle);
 #else
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 81b

[PATCH V2 net-next 03/11] net: hns3: add ethtool_ops.get_coalesce support to PF

2018-01-11 Thread Peng Li

From: Fuyun Liang 

This patch adds ethtool_ops.get_coalesce support to PF.

Whilst our hardware supports per queue values, external interfaces
support only a single shared value. As such we use the values for
queue 0.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  2 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  1 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 37 ++
 3 files changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index adec88d..0bad0e3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -448,6 +448,8 @@ struct hnae3_knic_private_info {
u16 num_tqps; /* total number of TQPs in this handle */
struct hnae3_queue **tqp;  /* array base of all TQPs in this instance */
const struct hnae3_dcb_ops *dcb_ops;
+
+   u16 int_rl_setting;
 };
 
 struct hnae3_roce_private_info {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index a2a7ea3..24f6109 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -464,6 +464,7 @@ struct hns3_enet_ring_group {
u16 count;
enum hns3_flow_level_range flow_level;
u16 int_gl;
+   u8 gl_adapt_enable;
 };
 
 struct hns3_enet_tqp_vector {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index f44336c..81b4b3b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -887,6 +887,42 @@ static void hns3_get_channels(struct net_device *netdev,
h->ae_algo->ops->get_channels(h, ch);
 }
 
+static int hns3_get_coalesce_per_queue(struct net_device *netdev, u32 queue,
+  struct ethtool_coalesce *cmd)
+{
+   struct hns3_enet_tqp_vector *tx_vector, *rx_vector;
+   struct hns3_nic_priv *priv = netdev_priv(netdev);
+   struct hnae3_handle *h = priv->ae_handle;
+   u16 queue_num = h->kinfo.num_tqps;
+
+   if (queue >= queue_num) {
+   netdev_err(netdev,
+  "Invalid queue value %d! Queue max id=%d\n",
+  queue, queue_num - 1);
+   return -EINVAL;
+   }
+
+   tx_vector = priv->ring_data[queue].ring->tqp_vector;
+   rx_vector = priv->ring_data[queue_num + queue].ring->tqp_vector;
+
+   cmd->use_adaptive_tx_coalesce = tx_vector->tx_group.gl_adapt_enable;
+   cmd->use_adaptive_rx_coalesce = rx_vector->rx_group.gl_adapt_enable;
+
+   cmd->tx_coalesce_usecs = tx_vector->tx_group.int_gl;
+   cmd->rx_coalesce_usecs = rx_vector->rx_group.int_gl;
+
+   cmd->tx_coalesce_usecs_high = h->kinfo.int_rl_setting;
+   cmd->rx_coalesce_usecs_high = h->kinfo.int_rl_setting;
+
+   return 0;
+}
+
+static int hns3_get_coalesce(struct net_device *netdev,
+struct ethtool_coalesce *cmd)
+{
+   return hns3_get_coalesce_per_queue(netdev, 0, cmd);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -925,6 +961,7 @@ static void hns3_get_channels(struct net_device *netdev,
.nway_reset = hns3_nway_reset,
.get_channels = hns3_get_channels,
.set_channels = hns3_set_channels,
+   .get_coalesce = hns3_get_coalesce,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
-- 
1.9.1

[PATCH V2 net-next 11/11] net: hns3: check for NULL function pointer in hns3_nic_set_features

2018-01-11 Thread Peng Li

From: Jian Shen 

It's necessary to check hook whether being defined before
calling, improve the reliability.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index a7ae4f3..ac84816 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1133,14 +1133,16 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
}
}
 
-   if (changed & NETIF_F_HW_VLAN_CTAG_FILTER) {
+   if ((changed & NETIF_F_HW_VLAN_CTAG_FILTER) &&
+   h->ae_algo->ops->enable_vlan_filter) {
if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
h->ae_algo->ops->enable_vlan_filter(h, true);
else
h->ae_algo->ops->enable_vlan_filter(h, false);
}
 
-   if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
+   if ((changed & NETIF_F_HW_VLAN_CTAG_RX) &&
+   h->ae_algo->ops->enable_hw_strip_rxvtag) {
if (features & NETIF_F_HW_VLAN_CTAG_RX)
ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true);
else
-- 
1.9.1

[PATCH V2 net-next 08/11] net: hns3: change the unit of GL value macro

2018-01-11 Thread Peng Li

From: Fuyun Liang 

Previously, driver used 2us as the GL unit. The time unit ethtool
command "-c" and "-C" use is 1us, so now the GL unit driver uses
actually is 1us.

This patch changes the unit of GL value macro from
2us to 1us.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 7adbda8..213f501 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -452,10 +452,10 @@ enum hns3_link_mode_bits {
 };
 
 #define HNS3_INT_GL_MAX0x1FE0
-#define HNS3_INT_GL_50K0x000A
-#define HNS3_INT_GL_20K0x0019
-#define HNS3_INT_GL_18K0x001B
-#define HNS3_INT_GL_8K 0x003E
+#define HNS3_INT_GL_50K0x0014
+#define HNS3_INT_GL_20K0x0032
+#define HNS3_INT_GL_18K0x0036
+#define HNS3_INT_GL_8K 0x007C
 
 #define HNS3_INT_RL_MAX0x00EC
 #define HNS3_INT_RL_ENABLE_MASK0x40
-- 
1.9.1

[PATCH net-next v5 1/4] phy: add 2.5G SGMII mode to the phy_mode enum

2018-01-11 Thread Antoine Tenart

This patch adds one more generic PHY mode to the phy_mode enum, to allow
configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
callback.

Signed-off-by: Antoine Tenart 
---
 include/linux/phy/phy.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
index 4f8423a948d5..5a80e9de3686 100644
--- a/include/linux/phy/phy.h
+++ b/include/linux/phy/phy.h
@@ -28,6 +28,7 @@ enum phy_mode {
PHY_MODE_USB_DEVICE,
PHY_MODE_USB_OTG,
PHY_MODE_SGMII,
+   PHY_MODE_2500SGMII,
PHY_MODE_10GKR,
PHY_MODE_UFS_HS_A,
PHY_MODE_UFS_HS_B,
-- 
2.14.3

[PATCH net-next v5 4/4] net: mvpp2: 2500baseX support

2018-01-11 Thread Antoine Tenart

This patch adds the 2500Base-X PHY mode support in the Marvell PPv2
driver. 2500Base-X is quite close to 1000Base-X and SGMII modes and uses
nearly the same code path.

Signed-off-by: Antoine Tenart 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/marvell/mvpp2.c | 49 
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 257a6b99b4ca..38f9a79481c6 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4502,6 +4502,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
break;
case PHY_INTERFACE_MODE_SGMII:
case PHY_INTERFACE_MODE_1000BASEX:
+   case PHY_INTERFACE_MODE_2500BASEX:
mvpp22_gop_init_sgmii(port);
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4540,7 +4541,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
/* Enable the GMAC link status irq for this port */
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -4571,7 +4573,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -4584,7 +4587,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_MASK);
val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -4599,6 +4603,16 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
mvpp22_gop_unmask_irq(port);
 }
 
+/* Sets the PHY mode of the COMPHY (which configures the serdes lanes).
+ *
+ * The PHY mode used by the PPv2 driver comes from the network subsystem, while
+ * the one given to the COMPHY comes from the generic PHY subsystem. Hence they
+ * differ.
+ *
+ * The COMPHY configures the serdes lanes regardless of the actual use of the
+ * lanes by the physical layer. This is why configurations like
+ * "PPv2 (2500BaseX) - COMPHY (2500SGMII)" are valid.
+ */
 static int mvpp22_comphy_init(struct mvpp2_port *port)
 {
enum phy_mode mode;
@@ -4612,6 +4626,9 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
case PHY_INTERFACE_MODE_1000BASEX:
mode = PHY_MODE_SGMII;
break;
+   case PHY_INTERFACE_MODE_2500BASEX:
+   mode = PHY_MODE_2500SGMII;
+   break;
case PHY_INTERFACE_MODE_10GKR:
mode = PHY_MODE_10GKR;
break;
@@ -4631,7 +4648,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
u32 val;
 
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_CTRL_4_REG);
val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL |
   MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE;
@@ -4647,7 +4665,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
}
 
val = readl(port->base + MVPP2_GMAC_CTRL_0_REG);
-   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX)
val |= MVPP2_GMAC_PORT_TYPE_MASK;
else
val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
@@ -4660,7 +4679,13 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII)

[PATCH net-next v5 2/4] phy: cp110-comphy: 2.5G SGMII mode

2018-01-11 Thread Antoine Tenart

This patch allow the CP100 comphy to configure some lanes in the
2.5G SGMII mode. This mode is quite close to SGMII and uses nearly the
same code path.

Signed-off-by: Antoine Tenart 
---
 drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c 
b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
index a0d522154cdf..4ef429250d7b 100644
--- a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
+++ b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
@@ -135,19 +135,25 @@ struct mvebu_comhy_conf {
 static const struct mvebu_comhy_conf mvebu_comphy_cp110_modes[] = {
/* lane 0 */
MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(0, 1, PHY_MODE_2500SGMII, 0x1),
/* lane 1 */
MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(1, 2, PHY_MODE_2500SGMII, 0x1),
/* lane 2 */
MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(2, 0, PHY_MODE_2500SGMII, 0x1),
MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1),
/* lane 3 */
MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2),
+   MVEBU_COMPHY_CONF(3, 1, PHY_MODE_2500SGMII, 0x2),
/* lane 4 */
MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2),
+   MVEBU_COMPHY_CONF(4, 0, PHY_MODE_2500SGMII, 0x2),
MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2),
MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1),
/* lane 5 */
MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(5, 2, PHY_MODE_2500SGMII, 0x1),
 };
 
 struct mvebu_comphy_priv {
@@ -206,6 +212,10 @@ static void mvebu_comphy_ethernet_init_reset(struct 
mvebu_comphy_lane *lane,
if (mode == PHY_MODE_10GKR)
val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0xe) |
   MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0xe);
+   else if (mode == PHY_MODE_2500SGMII)
+   val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x8) |
+  MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x8) |
+  MVEBU_COMPHY_SERDES_CFG0_HALF_BUS;
else if (mode == PHY_MODE_SGMII)
val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x6) |
   MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x6) |
@@ -296,13 +306,13 @@ static int mvebu_comphy_init_plls(struct 
mvebu_comphy_lane *lane,
return 0;
 }
 
-static int mvebu_comphy_set_mode_sgmii(struct phy *phy)
+static int mvebu_comphy_set_mode_sgmii(struct phy *phy, enum phy_mode mode)
 {
struct mvebu_comphy_lane *lane = phy_get_drvdata(phy);
struct mvebu_comphy_priv *priv = lane->priv;
u32 val;
 
-   mvebu_comphy_ethernet_init_reset(lane, PHY_MODE_SGMII);
+   mvebu_comphy_ethernet_init_reset(lane, mode);
 
val = readl(priv->base + MVEBU_COMPHY_RX_CTRL1(lane->id));
val &= ~MVEBU_COMPHY_RX_CTRL1_CLK8T_EN;
@@ -487,7 +497,8 @@ static int mvebu_comphy_power_on(struct phy *phy)
 
switch (lane->mode) {
case PHY_MODE_SGMII:
-   ret = mvebu_comphy_set_mode_sgmii(phy);
+   case PHY_MODE_2500SGMII:
+   ret = mvebu_comphy_set_mode_sgmii(phy, lane->mode);
break;
case PHY_MODE_10GKR:
ret = mvebu_comphy_set_mode_10gkr(phy);
-- 
2.14.3

[PATCH net-next v5 3/4] net: mvpp2: 1000baseX support

2018-01-11 Thread Antoine Tenart

This patch adds the 1000Base-X PHY mode support in the Marvell PPv2
driver. 1000Base-X is quite close the SGMII and uses nearly the same
code path.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/marvell/mvpp2.c | 45 
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index a19760736b71..257a6b99b4ca 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4501,6 +4501,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
mvpp22_gop_init_rgmii(port);
break;
case PHY_INTERFACE_MODE_SGMII:
+   case PHY_INTERFACE_MODE_1000BASEX:
mvpp22_gop_init_sgmii(port);
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4538,7 +4539,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
u32 val;
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
/* Enable the GMAC link status irq for this port */
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -4568,7 +4570,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
}
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -4580,7 +4583,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
u32 val;
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_MASK);
val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -4605,6 +4609,7 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
 
switch (port->phy_interface) {
case PHY_INTERFACE_MODE_SGMII:
+   case PHY_INTERFACE_MODE_1000BASEX:
mode = PHY_MODE_SGMII;
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4625,7 +4630,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
 {
u32 val;
 
-   if (port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_CTRL_4_REG);
val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL |
   MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE;
@@ -4640,9 +4646,11 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
writel(val, port->base + MVPP22_GMAC_CTRL_4_REG);
}
 
-   /* The port is connected to a copper PHY */
val = readl(port->base + MVPP2_GMAC_CTRL_0_REG);
-   val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   val |= MVPP2_GMAC_PORT_TYPE_MASK;
+   else
+   val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
writel(val, port->base + MVPP2_GMAC_CTRL_0_REG);
 
val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
@@ -4651,6 +4659,19 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
   MVPP2_GMAC_AN_DUPLEX_EN;
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII)
val |= MVPP2_GMAC_IN_BAND_AUTONEG;
+
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   /* 1000BaseX port cannot negotiate speed nor can it
+* negotiate duplex: they are always operating with a
+* fixed speed of 1000Mbps in full duplex, so force
+* 1000 speed and full duplex here.
+*/
+   val |= MVPP2_GMAC_CONFIG_GMII_SPEED |
+  MVPP2_GMAC_CONFIG_FULL_DUPLEX;
+   else
+   val |= MVPP2_GMAC_AN_SPEED_EN |
+  MVPP2_GMAC_AN_DUPLEX_EN;
+
writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
 }
 
@@ -4671,7 +4692,8 @@ static void mvpp2_port_mii_gmac_configure(struct 
mvpp2_port *port)
 
/* Configure the PCS and in-band AN */
val = readl(port->base + MVPP2_GMAC_

[PATCH net-next v5 0/4] net: mvpp2: 1000BaseX and 2500BaseX support

2018-01-11 Thread Antoine Tenart

Hi all,

This series adds 1000BaseX and 2500BaseX support to the Marvell PPv2
driver. In order to use it, the 2.5 SGMII mode is added in the Marvell
common PHY driver (cp110-comphy).

This was tested on a mcbin.

All patches should probably go through net-next as patch 4/4 depends on
patch 1/4 to build and work.

Please note the two mvpp2 patches do not conflict with the ACPI series
v2 Marcin sent a few days ago, and the two series can be processed in
parallel. (Marcin is aware of me sending this series).

Thanks!
Antoine

Since v4:
  - Fixed a compilation warning which was a real error in the code.

Since v3:
  - Stopped setting the MII_SPEED bit in the GMAC AN register, as the
GMII_SPEED bit takes over anyway.
  - Added Andrew's Reviewed-by on patch 4/4.

Since v2:
  - Added a comment before mvpp22_comphy_init() about the different PHY modes
used and why they differ between the PPv2 driver and the COMPHY one.

Since v1:
  - s/PHY_MODE_SGMII_2_5G/PHY_MODE_2500SGMII/
  - Fixed a build error in 'net: mvpp2: 1000baseX support' (which was solved in
the 2500baseX support one, but the bisection was broken).
  - Removed the dt patches, as the fourth network interface on the mcbin also
needs PHYLINK support in the PPv2 driver to be correctly supported.

Antoine Tenart (4):
  phy: add 2.5G SGMII mode to the phy_mode enum
  phy: cp110-comphy: 2.5G SGMII mode
  net: mvpp2: 1000baseX support
  net: mvpp2: 2500baseX support

 drivers/net/ethernet/marvell/mvpp2.c | 74 
 drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 +--
 include/linux/phy/phy.h  |  1 +
 3 files changed, 79 insertions(+), 13 deletions(-)

-- 
2.14.3

[PATCH V2 net-next 02/11] net: hns3: remove TSO config command from VF driver

2018-01-11 Thread Peng Li

Only main PF can config TSO MSS length according to hardware.
This patch removes TSO config command from VF driver.

Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h |  8 
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c| 20 
 2 files changed, 28 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
index ad8adfe..2caca93 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
@@ -86,8 +86,6 @@ enum hclgevf_opcode_type {
HCLGEVF_OPC_QUERY_TX_STATUS = 0x0B03,
HCLGEVF_OPC_QUERY_RX_STATUS = 0x0B13,
HCLGEVF_OPC_CFG_COM_TQP_QUEUE   = 0x0B20,
-   /* TSO cmd */
-   HCLGEVF_OPC_TSO_GENERIC_CONFIG  = 0x0C01,
/* RSS cmd */
HCLGEVF_OPC_RSS_GENERIC_CONFIG  = 0x0D01,
HCLGEVF_OPC_RSS_INDIR_TABLE = 0x0D07,
@@ -202,12 +200,6 @@ struct hclgevf_cfg_tx_queue_pointer_cmd {
u8 rsv[14];
 };
 
-#define HCLGEVF_TSO_ENABLE_B   0
-struct hclgevf_cfg_tso_status_cmd {
-   u8 tso_enable;
-   u8 rsv[23];
-};
-
 #define HCLGEVF_TYPE_CRQ   0
 #define HCLGEVF_TYPE_CSQ   1
 #define HCLGEVF_NIC_CSQ_BASEADDR_L_REG 0x27000
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 5f9afa6..3d2bc9a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -201,20 +201,6 @@ static int hclge_get_queue_info(struct hclgevf_dev *hdev)
return 0;
 }
 
-static int hclgevf_enable_tso(struct hclgevf_dev *hdev, int enable)
-{
-   struct hclgevf_cfg_tso_status_cmd *req;
-   struct hclgevf_desc desc;
-
-   req = (struct hclgevf_cfg_tso_status_cmd *)desc.data;
-
-   hclgevf_cmd_setup_basic_desc(&desc, HCLGEVF_OPC_TSO_GENERIC_CONFIG,
-false);
-   hnae_set_bit(req->tso_enable, HCLGEVF_TSO_ENABLE_B, enable);
-
-   return hclgevf_cmd_send(&hdev->hw, &desc, 1);
-}
-
 static int hclgevf_alloc_tqps(struct hclgevf_dev *hdev)
 {
struct hclgevf_tqp *tqp;
@@ -1375,12 +1361,6 @@ static int hclgevf_init_ae_dev(struct hnae3_ae_dev 
*ae_dev)
goto err_config;
}
 
-   ret = hclgevf_enable_tso(hdev, true);
-   if (ret) {
-   dev_err(&pdev->dev, "failed(%d) to enable tso\n", ret);
-   goto err_config;
-   }
-
/* Initialize VF's MTA */
hdev->accept_mta_mc = true;
ret = hclgevf_cfg_func_mta_filter(&hdev->nic, hdev->accept_mta_mc);
-- 
1.9.1

Re: [PATCH net-next] net: phy: Have __phy_modify return 0 on success

2018-01-11 Thread Sergei Shtylyov


Hello!

On 1/11/2018 11:55 PM, Andrew Lunn wrote:


__phy_modify would return the old value of the register before it was
modified. Thus on success, it does not return 0, but a positive value.
Thus functions using phy_modify, which is a wrapper around
__phy_modify, can start returning > 0 on success, rather than 0. As a
result, breakage has been noticed in various places, where 0 was
assumed.

Code inspection does not find any current location where the return of
the old value is currently used. So have __phy_modify return 0 on
success. When there is a real need for the old value, either a new
accessor can be added, or an additional parameter passed.

Fixes: 2b74e5be17d2 ("net: phy: add phy_modify() accessor")
Reported-by: Geert Uytterhoeven 
Signed-off-by: Andrew Lunn 
---

Geert, Niklas

Please can you test this and let me know if it fixes the problems you
see.

  drivers/net/phy/phy-core.c | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
index e75989ce8850..36cad6b3b96d 100644
--- a/drivers/net/phy/phy-core.c
+++ b/drivers/net/phy/phy-core.c
@@ -336,16 +336,15 @@ EXPORT_SYMBOL(phy_write_mmd);
   */
  int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set)
  {
-   int ret, res;
+   int ret;
  
  	ret = __phy_read(phydev, regnum);

-   if (ret >= 0) {
-   res = __phy_write(phydev, regnum, (ret & ~mask) | set);
-   if (res < 0)
-   ret = res;
-   }
+   if (ret < 0)
+   return ret;
  
-	return ret;

+   ret = __phy_write(phydev, regnum, (ret & ~mask) | set);
+
+   return ret < 0 ? ret: 0;


   Need another space, before ':'...


  }
  EXPORT_SYMBOL_GPL(__phy_modify);
  


MBR, Sergei

Re: [PATCH net-next v4 0/4] net: mvpp2: 1000BaseX and 2500BaseX support

2018-01-11 Thread Antoine Tenart

Hi David,

On Thu, Jan 11, 2018 at 11:32:03AM -0500, David Miller wrote:
> 
> Actually, this introduced build warnings, I'm reverting.  Please fix this
> and repost.

The warning points a real issue. I'm sorry about that, seems like I
forgot to test this one after the last change... I'll send a new
(tested) version.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

Re: [PATCH 10/18] qla2xxx: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Greg KH

On Thu, Jan 11, 2018 at 02:15:12PM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 1:03 AM, Greg KH  wrote:
> > On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
> >> Static analysis reports that 'handle' may be a user controlled value
> >> that is used as a data dependency to read 'sp' from the
> >> 'req->outstanding_cmds' array.  In order to avoid potential leaks of
> >> kernel memory values, block speculative execution of the instruction
> >> stream that could issue reads based on an invalid value of 'sp'. In this
> >> case 'sp' is directly dereferenced later in the function.
> >
> > I'm pretty sure that 'handle' comes from the hardware, not from
> > userspace, from what I can tell here.  If we want to start auditing
> > __iomem data sources, great!  But that's a bigger task, and one I don't
> > think we are ready to tackle...
> 
> I think it falls in the hygiene bucket of shutting off an array index
> from a source that could be under attacker control. Should we leave
> this one un-patched while we decide if we generally have a problem
> with trusting completion 'tags' from hardware? My vote is patch it for
> now.

Hah, if you are worried about "tags" from hardware, we have a lot more
auditing to do, right?  I don't think anyone has looked into just basic
"bounds checking" for that type of information.  For USB devices we have
_just_ started doing that over the past year, the odds of anyone looking
at PCI devices for this same problem is slim-to-none.

Again, here are my questions/objections right now to this series:
- How can we audit this stuff?
- How did you audit this stuff to find these usages?
- How do you know that this series fixes all of the issues?
- What exact tree/date did you run your audit against?
- How do you know that linux-next does not contain a boatload
  more problems that we need to go back and fix after 4.16-rc1
  is out?
- How can we prevent this type of pattern showing up again?
- How can we audit the data coming from hardware correctly?

I'm all for merging this series, but if anyone things that somehow the
whole problem is now "solved" in this area, they are sorely mistaken.

thanks,

greg k-h

[PATCH net] ipv6: ip6_make_skb() needs to clear cork.base.dst

2018-01-11 Thread Eric Dumazet

From: Eric Dumazet 

In my last patch, I missed fact that cork.base.dst was not initialized
in ip6_make_skb() :

If ip6_setup_cork() returns an error, we might attempt a dst_release()
on some random pointer.

Fixes: 862c03ee1deb ("ipv6: fix possible mem leaks in ipv6_make_skb()")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/ipv6/ip6_output.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 
688ba5f7516b37c87b879036dce781bdcfa01739..78a774e7af12b5725577fb4aa3c917af2e171a8d
 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1733,6 +1733,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
cork.base.flags = 0;
cork.base.addr = 0;
cork.base.opt = NULL;
+   cork.base.dst = NULL;
v6_cork.opt = NULL;
err = ip6_setup_cork(sk, &cork, &v6_cork, ipc6, rt, fl6);
if (err) {

Re: [PATCH v7 7/8] dt-bindings: can: m_can: Document new can transceiver binding

2018-01-11 Thread Faiz Abbas

Hi Rob,

On Friday 12 January 2018 01:50 AM, Rob Herring wrote:
> On Wed, Jan 10, 2018 at 4:55 AM, Faiz Abbas  wrote:
>> From: Franklin S Cooper Jr 
>>
>> Add information regarding can-transceiver binding. This is especially
>> important for MCAN since the IP allows CAN FD mode to run significantly
>> faster than what most transceivers are capable of.
>>
>> Signed-off-by: Franklin S Cooper Jr 
>> Signed-off-by: Sekhar Nori 
>> Signed-off-by: Faiz Abbas 
>> ---
>>  Documentation/devicetree/bindings/net/can/m_can.txt | 9 +
>>  1 file changed, 9 insertions(+)
> 
> Why did you drop my ack from v6?

Sorry, I missed it. Will make sure its there in future versions.

Thanks,
Faiz

[RFC PATCH net-next v2 0/2] Enable virtio to act as a backup for a passthru device

2018-01-11 Thread Sridhar Samudrala

This patch series extends virtio_net to take over VF datapath by
simulating a transparent bond without creating any additional netdev.

I understand that there are some comments suggesting an alternate model
that is based on 3 driver model(virtio_net, VF driver, a new driver 
virt_bond that acts as a master to virtio_net and VF).

Would like to get some feedback on the right way to solve the live
migration problem with direct attached devices in KVM environment.

Stephen,
Is the netvsc transparent bond implemenation robust enough and deployed
in real environments? Or would netvsc switch over to a 3-driver model if
that solution becomes available?

Can we start with this implementation that is similar to netvsc and if
needed we can move to the 3 driver model later?

This patch series enables virtio to switch over to a VF datapath when a 
VF netdev is present with the same MAC address. It allows live migration 
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It is based on netvsc implementation and it should be possible to make this
code generic and move it to a common location that can be shared by netvsc
and virtio.

This patch series is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v2:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (2):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available

 drivers/net/virtio_net.c| 309 +++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 309 insertions(+), 3 deletions(-)

Sridhar Samudrala (1):
  qemu: Introduce VIRTIO_NET_F_BACKUP feature bit to virtio_net

 hw/net/virtio-net.c | 2 ++
 include/standard-headers/linux/virtio_net.h | 3 +++
 2 files changed, 5 insertions(+)

-- 
2.14.3

[RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-11 Thread Sridhar Samudrala

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/virtio_net.c| 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 12dfc5fee58e..f149a160a8c5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-   VIRTIO_NET_F_SPEED_DUPLEX
+   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another device
+* with the same MAC.
+*/
 #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

[RFC PATCH 1/1] qemu: Introduce VIRTIO_NET_F_BACKUP feature bit to virtio_net

2018-01-11 Thread Sridhar Samudrala

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

Signed-off-by: Sridhar Samudrala 

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index cd63659140..fa47e723b9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2193,6 +2193,8 @@ static Property virtio_net_properties[] = {
  true),
 DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
 DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
+DEFINE_PROP_BIT64("backup", VirtIONet, host_features,
+VIRTIO_NET_F_BACKUP, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 17c8531a22..6afca6dcaf 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP  62   /* Act as backup for another device
+   * with the same MAC.
+   */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY

[RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-11 Thread Sridhar Samudrala

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. The VF datapath is only used
for unicast traffic. Broadcasts/multicasts go via virtio datapath so that
east-west broadcasts don't use the PCI bandwidth. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed, the
destination hypervisor sets the MAC filter on the VF and plugs it back to
the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala 
Reviewed-by: Jesse Brandeburg 
---
 drivers/net/virtio_net.c | 307 ++-
 1 file changed, 305 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f149a160a8c5..0e58d364fde9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -120,6 +122,15 @@ struct receive_queue {
struct xdp_rxq_info xdp_rxq;
 };
 
+struct virtnet_vf_pcpu_stats {
+   u64 rx_packets;
+   u64 rx_bytes;
+   u64 tx_packets;
+   u64 tx_bytes;
+   struct u64_stats_sync   syncp;
+   u32 tx_dropped;
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
@@ -182,6 +193,10 @@ struct virtnet_info {
u32 speed;
 
unsigned long guest_offloads;
+
+   /* State to manage the associated VF interface. */
+   struct net_device __rcu *vf_netdev;
+   struct virtnet_vf_pcpu_stats __percpu *vf_stats;
 };
 
 struct padded_vnet_hdr {
@@ -1314,16 +1329,53 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
 
+/* Send skb on the slave VF device. */
+static int virtnet_vf_xmit(struct net_device *dev, struct net_device 
*vf_netdev,
+  struct sk_buff *skb)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   unsigned int len = skb->len;
+   int rc;
+
+   skb->dev = vf_netdev;
+   skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+   rc = dev_queue_xmit(skb);
+   if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
+   struct virtnet_vf_pcpu_stats *pcpu_stats
+   = this_cpu_ptr(vi->vf_stats);
+
+   u64_stats_update_begin(&pcpu_stats->syncp);
+   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_bytes += len;
+   u64_stats_update_end(&pcpu_stats->syncp);
+   } else {
+   this_cpu_inc(vi->vf_stats->tx_dropped);
+   }
+
+   return rc;
+}
+
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
int qnum = skb_get_queue_mapping(skb);
struct send_queue *sq = &vi->sq[qnum];
+   struct net_device *vf_netdev;
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
bool use_napi = sq->napi.weight;
 
+   /* If VF is present and up then redirect packets
+* called with rcu_read_lock_bh
+*/
+   vf_netdev = rcu_dereference_bh(vi->vf_netdev);
+   if (vf_netdev && netif_running(vf_netdev) &&
+   !netpoll_tx_running(dev) &&
+   is_unicast_ether_addr(eth_hdr(skb)->h_dest))
+   return virtnet_vf_xmit(dev, vf_netdev, skb);
+
/* Free up any pending old buffers before queueing new ones. */
free_old_xmit_skbs(sq);
 
@@ -1470,10 +1522,41 @@ static int virtnet_set_mac_address(struct net_device 
*dev, void *p)
return ret;
 }
 
+static void virtnet_get_vf_stats(struct net_device *dev,
+struct virtnet_vf_pcpu_stats *tot)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int i;
+
+   memset(tot, 0, sizeof(*tot));
+
+   for_each_possible_cpu(i) {
+   const struct virtnet_vf_pcpu_stats *stats
+   = per_cpu_ptr(vi->vf_stats, i);
+   u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+   unsigned int start;
+
+   do {
+   start = u64_stats_fetch_begin_irq(&stats->syncp);
+   rx_packets = stats->rx_packets;
+   tx_packets = stats->tx_packets;
+   rx_bytes = stats->rx_bytes;
+   tx_bytes = stats->tx_bytes;
+   } while (u64_

[patch iproute2 v10 1/2] lib/libnetlink: Add a new function rtnl_talk_iov

2018-01-11 Thread Chris Mi

rtnl_talk can only send a single message to kernel. Add a new function
rtnl_talk_iov that can send multiple messages to kernel.
rtnl_talk_iov takes struct iovec * and iovlen as arguments.

Signed-off-by: Chris Mi 
Signed-off-by: David Ahern 
---
 include/libnetlink.h |  3 +++
 lib/libnetlink.c | 65 +---
 2 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index a4d83b9e..d6322190 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -96,6 +96,9 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
  struct nlmsghdr **answer)
__attribute__((warn_unused_result));
+int rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iovec, size_t iovlen,
+ struct nlmsghdr **answer)
+   __attribute__((warn_unused_result));
 int rtnl_talk_extack(struct rtnl_handle *rtnl, struct nlmsghdr *n,
  struct nlmsghdr **answer, nl_ext_ack_fn_t errfn)
__attribute__((warn_unused_result));
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 00e6ce0c..7ca47b22 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -581,30 +581,30 @@ static void rtnl_talk_error(struct nlmsghdr *h, struct 
nlmsgerr *err,
strerror(-err->error));
 }
 
-static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-  struct nlmsghdr **answer,
-  bool show_rtnl_err, nl_ext_ack_fn_t errfn)
+
+static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
+  size_t iovlen, struct nlmsghdr **answer,
+  bool show_rtnl_err, nl_ext_ack_fn_t errfn)
 {
-   int status;
-   unsigned int seq;
-   struct nlmsghdr *h;
struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
-   struct iovec iov = {
-   .iov_base = n,
-   .iov_len = n->nlmsg_len
-   };
+   struct iovec riov;
struct msghdr msg = {
.msg_name = &nladdr,
.msg_namelen = sizeof(nladdr),
-   .msg_iov = &iov,
-   .msg_iovlen = 1,
+   .msg_iov = iov,
+   .msg_iovlen = iovlen,
};
+   unsigned int seq = 0;
+   struct nlmsghdr *h;
+   int i, status;
char *buf;
 
-   n->nlmsg_seq = seq = ++rtnl->seq;
-
-   if (answer == NULL)
-   n->nlmsg_flags |= NLM_F_ACK;
+   for (i = 0; i < iovlen; i++) {
+   h = iov[i].iov_base;
+   h->nlmsg_seq = seq = ++rtnl->seq;
+   if (answer == NULL)
+   h->nlmsg_flags |= NLM_F_ACK;
+   }
 
status = sendmsg(rtnl->fd, &msg, 0);
if (status < 0) {
@@ -612,8 +612,14 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
return -1;
}
 
+   /* change msg to use the response iov */
+   msg.msg_iov = &riov;
+   msg.msg_iovlen = 1;
+   i = 0;
while (1) {
+next:
status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
+   ++i;
 
if (status < 0)
return status;
@@ -642,7 +648,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
 
if (nladdr.nl_pid != 0 ||
h->nlmsg_pid != rtnl->local.nl_pid ||
-   h->nlmsg_seq != seq) {
+   h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen) {
/* Don't forget to skip that message. */
status -= NLMSG_ALIGN(len);
h = (struct nlmsghdr *)((char *)h + 
NLMSG_ALIGN(len));
@@ -662,7 +668,10 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
*answer = (struct nlmsghdr 
*)buf;
else
free(buf);
-   return 0;
+   if (h->nlmsg_seq == seq)
+   return 0;
+   else
+   goto next;
}
 
if (rtnl->proto != NETLINK_SOCK_DIAG &&
@@ -671,7 +680,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
 
errno = -err->error;
free(buf);
-   return -1;
+   return -i;
}
 
if (answer) {
@@ -698,12 +707,30 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
}
 }
 
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsg

[patch iproute2 v10 2/2] tc: Add batchsize feature for filter and actions

2018-01-11 Thread Chris Mi

Currently in tc batch mode, only one command is read from the batch
file and sent to kernel to process. With this support, at most 128
commands can be accumulated before sending to kernel.

Now it only works for the following successive commands:
1. filter add/delete/change/replace
2. actions add/change/replace

Signed-off-by: Chris Mi 
---
 tc/m_action.c  |  65 ---
 tc/tc.c| 199 -
 tc/tc_common.h |   5 +-
 tc/tc_filter.c | 104 ++
 4 files changed, 294 insertions(+), 79 deletions(-)

diff --git a/tc/m_action.c b/tc/m_action.c
index fc422364..611f6cc2 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -546,40 +546,61 @@ bad_val:
return ret;
 }
 
+struct tc_action_req {
+   struct nlmsghdr n;
+   struct tcamsg   t;
+   charbuf[MAX_MSG];
+};
+
 static int tc_action_modify(int cmd, unsigned int flags,
-   int *argc_p, char ***argv_p)
+   int *argc_p, char ***argv_p,
+   void *buf, size_t buflen)
 {
-   int argc = *argc_p;
+   struct tc_action_req *req, action_req;
char **argv = *argv_p;
+   struct rtattr *tail;
+   int argc = *argc_p;
+   struct iovec iov;
int ret = 0;
-   struct {
-   struct nlmsghdr n;
-   struct tcamsg   t;
-   charbuf[MAX_MSG];
-   } req = {
-   .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg)),
-   .n.nlmsg_flags = NLM_F_REQUEST | flags,
-   .n.nlmsg_type = cmd,
-   .t.tca_family = AF_UNSPEC,
-   };
-   struct rtattr *tail = NLMSG_TAIL(&req.n);
+
+   if (buf) {
+   req = buf;
+   if (buflen < sizeof (struct tc_action_req)) {
+   fprintf(stderr, "buffer is too small: %zu\n", buflen);
+   return -1;
+   }
+   } else {
+   memset(&action_req, 0, sizeof (struct tc_action_req));
+   req = &action_req;
+   }
+
+   req->n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg));
+   req->n.nlmsg_flags = NLM_F_REQUEST | flags;
+   req->n.nlmsg_type = cmd;
+   req->t.tca_family = AF_UNSPEC;
+   tail = NLMSG_TAIL(&req->n);
 
argc -= 1;
argv += 1;
-   if (parse_action(&argc, &argv, TCA_ACT_TAB, &req.n)) {
+   if (parse_action(&argc, &argv, TCA_ACT_TAB, &req->n)) {
fprintf(stderr, "Illegal \"action\"\n");
return -1;
}
-   tail->rta_len = (void *) NLMSG_TAIL(&req.n) - (void *) tail;
+   tail->rta_len = (void *) NLMSG_TAIL(&req->n) - (void *) tail;
+
+   *argc_p = argc;
+   *argv_p = argv;
+
+   if (buf)
+   return 0;
 
-   if (rtnl_talk(&rth, &req.n, NULL) < 0) {
+   iov.iov_base = &req->n;
+   iov.iov_len = req->n.nlmsg_len;
+   if (rtnl_talk_iov(&rth, &iov, 1, NULL) < 0) {
fprintf(stderr, "We have an error talking to the kernel\n");
ret = -1;
}
 
-   *argc_p = argc;
-   *argv_p = argv;
-
return ret;
 }
 
@@ -679,7 +700,7 @@ bad_val:
return ret;
 }
 
-int do_action(int argc, char **argv)
+int do_action(int argc, char **argv, void *buf, size_t buflen)
 {
 
int ret = 0;
@@ -689,12 +710,12 @@ int do_action(int argc, char **argv)
if (matches(*argv, "add") == 0) {
ret =  tc_action_modify(RTM_NEWACTION,
NLM_F_EXCL | NLM_F_CREATE,
-   &argc, &argv);
+   &argc, &argv, buf, buflen);
} else if (matches(*argv, "change") == 0 ||
  matches(*argv, "replace") == 0) {
ret = tc_action_modify(RTM_NEWACTION,
   NLM_F_CREATE | NLM_F_REPLACE,
-  &argc, &argv);
+  &argc, &argv, buf, buflen);
} else if (matches(*argv, "delete") == 0) {
argc -= 1;
argv += 1;
diff --git a/tc/tc.c b/tc/tc.c
index ad9f07e9..63e64fec 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -193,16 +193,16 @@ static void usage(void)
"-nm | -nam[es] | { -cf | -conf } 
path } | -j[son]\n");
 }
 
-static int do_cmd(int argc, char **argv)
+static int do_cmd(int argc, char **argv, void *buf, size_t buflen)
 {
if (matches(*argv, "qdisc") == 0)
return do_qdisc(argc-1, argv+1);
if (matches(*argv, "class") == 0)
return do_class(argc-1, argv+1);
if (matches(*argv, "filter") == 0)
-   return do_filter(argc-1, argv+1);
+   re

[patch iproute2 v10 0/2] tc: Add batchsize feature to batch mode

2018-01-11 Thread Chris Mi

Currently in tc batch mode, only one command is read from the batch
file and sent to kernel to process. With this patchset, at most 128
commands can be accumulated before sending to kernel.

We introduced a new function in patch 1 to support for sending
multiple messages. In patch 2, we add this support for filter
add/delete/change/replace and actions add/change/replace commands.

But please note that kernel still processes the requests one by one.
To process the requests in parallel in kernel is another effort.
The time we're saving in this patchset is the user mode and kernel mode
context switch. So this patchset works on top of the current kernel.

Using the following script in kernel, we can generate 1,000,000 rules.
tools/testing/selftests/tc-testing/tdc_batch.py

Without this patchset, 'tc -b $file' exection time is:

real0m15.555s
user0m7.211s
sys 0m8.284s

With this patchset, 'tc -b $file' exection time is:

real0m12.360s
user0m6.082s
sys 0m6.213s

The insertion rate is improved more than 10%.

v3
==
1. Instead of hacking function rtnl_talk directly, add a new function
   rtnl_talk_msg.
2. remove most of global variables to use parameter passing
3. divide the previous patch into 4 patches.

v4
==
1. Remove function setcmdlinetotal. Now in function batch, we read one
   more line to determine if we are reaching the end of file.
2. Remove function __rtnl_check_ack. Now __rtnl_talk calls __rtnl_talk_msg
   directly.
3. if (batch_size < 1)
batch_size = 1;

v5
==
1. Fix a bug that can't deal with batch file with blank line.
2. Describe the limitation in man page.

v6
==
1. Add support for mixed commands.
2. Fix a bug that not all messages are acked if batch size > 1.

v7
==
1. We can tell exactly which command fails.
2. Add a new function rtnl_talk_iov
3. Allocate the memory in function batch() instead of each client.
4. Remove option -bs.

v8
==
1. Replace strcmp with matches.
2. Recycle buffers.

v9
==
1. remove rtnl_talk_msg
2. use a table to determine if supporting batchsize feature or not

v10
===
1. Improve function batchsize_enabled.


Chris Mi (2):
  lib/libnetlink: Add a new function rtnl_talk_iov
  tc: Add batchsize feature for filter and actions

 include/libnetlink.h |   3 +
 lib/libnetlink.c |  65 -
 tc/m_action.c|  65 +++--
 tc/tc.c  | 199 +++
 tc/tc_common.h   |   5 +-
 tc/tc_filter.c   | 104 ---
 6 files changed, 343 insertions(+), 98 deletions(-)

-- 
2.14.2

[bpf-next PATCH v3 4/7] bpf: sockmap sample, report bytes/sec

2018-01-11 Thread John Fastabend

Report bytes/sec sent as well as total bytes. Useful to get rough
idea how different configurations and usage patterns perform with
sockmap.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |   37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index bbe9587..442fc00 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -189,14 +190,16 @@ static int sockmap_init_sockets(void)
 struct msg_stats {
size_t bytes_sent;
size_t bytes_recvd;
+   struct timespec start;
+   struct timespec end;
 };
 
 static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
struct msg_stats *s, bool tx)
 {
struct msghdr msg = {0};
+   int err, i, flags = MSG_NOSIGNAL;
struct iovec *iov;
-   int i, flags = MSG_NOSIGNAL;
 
iov = calloc(iov_count, sizeof(struct iovec));
if (!iov)
@@ -218,6 +221,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
msg.msg_iovlen = iov_count;
 
if (tx) {
+   clock_gettime(CLOCK_MONOTONIC, &s->start);
for (i = 0; i < cnt; i++) {
int sent = sendmsg(fd, &msg, flags);
 
@@ -228,6 +232,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
}
s->bytes_sent += sent;
}
+   clock_gettime(CLOCK_MONOTONIC, &s->end);
} else {
int slct, recv, max_fd = fd;
struct timeval timeout;
@@ -235,6 +240,9 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
fd_set w;
 
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
+   err = clock_gettime(CLOCK_MONOTONIC, &s->start);
+   if (err < 0)
+   perror("recv start time: ");
while (s->bytes_recvd < total_bytes) {
timeout.tv_sec = 1;
timeout.tv_usec = 0;
@@ -246,15 +254,18 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
slct = select(max_fd + 1, &w, NULL, NULL, &timeout);
if (slct == -1) {
perror("select()");
+   clock_gettime(CLOCK_MONOTONIC, &s->end);
goto out_errno;
} else if (!slct) {
fprintf(stderr, "unexpected timeout\n");
+   clock_gettime(CLOCK_MONOTONIC, &s->end);
goto out_errno;
}
 
recv = recvmsg(fd, &msg, flags);
if (recv < 0) {
if (errno != EWOULDBLOCK) {
+   clock_gettime(CLOCK_MONOTONIC, &s->end);
perror("recv failed()\n");
goto out_errno;
}
@@ -262,6 +273,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
 
s->bytes_recvd += recv;
}
+   clock_gettime(CLOCK_MONOTONIC, &s->end);
}
 
for (i = 0; i < iov_count; i++)
@@ -273,11 +285,14 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
return errno;
 }
 
+static float giga = 10;
+
 static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
 {
int txpid, rxpid, err = 0;
struct msg_stats s = {0};
int status;
+   float sent_Bps = 0, recvd_Bps = 0;
 
errno = 0;
 
@@ -288,10 +303,16 @@ static int sendmsg_test(int iov_count, int iov_buf, int 
cnt, int verbose)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
iov_count, iov_buf, cnt, err);
-   fprintf(stdout, "rx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
-   s.bytes_sent, s.bytes_recvd);
shutdown(p2, SHUT_RDWR);
shutdown(p1, SHUT_RDWR);
+   if (s.end.tv_sec - s.start.tv_sec) {
+   sent_Bps = s.bytes_sent / (s.end.tv_sec - 
s.start.tv_sec);
+   recvd_Bps = s.bytes_recvd / (s.end.tv_sec - 
s.start.tv_sec);
+   }
+   fprintf(stdout,
+   "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB %fB/s 
%fGB/s\n",
+   s.bytes_sent, sent_Bps, sent_Bps/giga,
+   s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
exit(1);
} else if (rxpid == -1) {

[bpf-next PATCH v3 6/7] bpf: sockmap put client sockets in blocking mode

2018-01-11 Thread John Fastabend

Put client sockets in blocking mode otherwise with sendmsg tests
its easy to overrun the socket buffers which results in the test
being aborted.

The original non-blocking was added to handle listen/accept with
a single thread the client/accepted sockets do not need to be
non-blocking.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index c3295a7..818766b 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -109,7 +109,7 @@ static int sockmap_init_sockets(void)
}
 
/* Non-blocking sockets */
-   for (i = 0; i < 4; i++) {
+   for (i = 0; i < 2; i++) {
err = ioctl(*fds[i], FIONBIO, (char *)&one);
if (err < 0) {
perror("ioctl s1 failed()");

[bpf-next PATCH v3 7/7] bpf: sockmap set rlimit

2018-01-11 Thread John Fastabend

Avoid extra step of setting limit from cmdline and do it directly in
the program.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 818766b..a6dab97 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -27,6 +27,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 #include 
@@ -439,6 +440,7 @@ enum {
 int main(int argc, char **argv)
 {
int iov_count = 1, length = 1024, rate = 1, verbose = 0;
+   struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int opt, longindex, err, cg_fd = 0;
int test = PING_PONG;
char filename[256];
@@ -493,6 +495,11 @@ int main(int argc, char **argv)
return -1;
}
 
+   if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+   perror("setrlimit(RLIMIT_MEMLOCK)");
+   return 1;
+   }
+
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
 
running = 1;

[bpf-next PATCH v3 5/7] bpf: sockmap sample add base test without any BPF for comparison

2018-01-11 Thread John Fastabend

Add a base test that does not use BPF hooks to test baseline case.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |   26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 442fc00..c3295a7 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -287,18 +287,24 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
 
 static float giga = 10;
 
-static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
+static int sendmsg_test(int iov_count, int iov_buf, int cnt,
+   int verbose, bool base)
 {
-   int txpid, rxpid, err = 0;
+   float sent_Bps = 0, recvd_Bps = 0;
+   int rx_fd, txpid, rxpid, err = 0;
struct msg_stats s = {0};
int status;
-   float sent_Bps = 0, recvd_Bps = 0;
 
errno = 0;
 
+   if (base)
+   rx_fd = p1;
+   else
+   rx_fd = p2;
+
rxpid = fork();
if (rxpid == 0) {
-   err = msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
+   err = msg_loop(rx_fd, iov_count, iov_buf, cnt, &s, false);
if (err)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
@@ -427,6 +433,7 @@ static int forever_ping_pong(int rate, int verbose)
 enum {
PING_PONG,
SENDMSG,
+   BASE,
 };
 
 int main(int argc, char **argv)
@@ -466,6 +473,8 @@ int main(int argc, char **argv)
test = PING_PONG;
} else if (memcmp(optarg, "sendmsg", 7) == 0) {
test = SENDMSG;
+   } else if (memcmp(optarg, "base", 4) == 0) {
+   test = BASE;
} else {
usage(argv);
return -1;
@@ -491,6 +500,10 @@ int main(int argc, char **argv)
/* catch SIGINT */
signal(SIGINT, running_handler);
 
+   /* If base test skip BPF setup */
+   if (test == BASE)
+   goto run;
+
if (load_bpf_file(filename)) {
fprintf(stderr, "load_bpf_file: (%s) %s\n",
filename, strerror(errno));
@@ -522,6 +535,7 @@ int main(int argc, char **argv)
return err;
}
 
+run:
err = sockmap_init_sockets();
if (err) {
fprintf(stderr, "ERROR: test socket failed: %d\n", err);
@@ -531,7 +545,9 @@ int main(int argc, char **argv)
if (test == PING_PONG)
err = forever_ping_pong(rate, verbose);
else if (test == SENDMSG)
-   err = sendmsg_test(iov_count, length, rate, verbose);
+   err = sendmsg_test(iov_count, length, rate, verbose, false);
+   else if (test == BASE)
+   err = sendmsg_test(iov_count, length, rate, verbose, true);
else
fprintf(stderr, "unknown test\n");
 out:

[bpf-next PATCH v3 2/7] bpf: add sendmsg option for testing BPF programs

2018-01-11 Thread John Fastabend

When testing BPF programs using sockmap I often want to have more
control over how sendmsg is exercised. This becomes even more useful
as new sockmap program types are added.

This adds a test type option to select type of test to run. Currently,
only "ping" and "sendmsg" are supported, but more can be added as
needed.

The new help argument gives the following,

 Usage: ./sockmap --cgroup 
 options:
 --help -h
 --cgroup   -c
 --rate -r
 --verbose  -v
 --iov_count-i
 --length   -l
 --test -t

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |  147 +++-
 1 file changed, 144 insertions(+), 3 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 17400d4..8ec7dbf 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -56,6 +56,9 @@
{"cgroup",  required_argument,  NULL, 'c' },
{"rate",required_argument,  NULL, 'r' },
{"verbose", no_argument,NULL, 'v' },
+   {"iov_count",   required_argument,  NULL, 'i' },
+   {"length",  required_argument,  NULL, 'l' },
+   {"test",required_argument,  NULL, 't' },
{0, 0, NULL, 0 }
 };
 
@@ -182,6 +185,117 @@ static int sockmap_init_sockets(void)
return 0;
 }
 
+struct msg_stats {
+   size_t bytes_sent;
+   size_t bytes_recvd;
+};
+
+static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
+   struct msg_stats *s, bool tx)
+{
+   struct msghdr msg = {0};
+   struct iovec *iov;
+   int i, flags = 0;
+
+   iov = calloc(iov_count, sizeof(struct iovec));
+   if (!iov)
+   return -ENOMEM;
+
+   for (i = 0; i < iov_count; i++) {
+   char *d = calloc(iov_length, sizeof(char));
+
+   if (!d) {
+   fprintf(stderr, "iov_count %i/%i OOM\n", i, iov_count);
+   free(iov);
+   return -ENOMEM;
+   }
+   iov[i].iov_base = d;
+   iov[i].iov_len = iov_length;
+   }
+
+   msg.msg_iov = iov;
+   msg.msg_iovlen = iov_count;
+
+   if (tx) {
+   for (i = 0; i < cnt; i++) {
+   int sent = sendmsg(fd, &msg, flags);
+
+   if (sent < 0) {
+   perror("send loop error:");
+   free(iov);
+   return sent;
+   }
+   s->bytes_sent += sent;
+   }
+   } else {
+   int slct, recv, max_fd = fd;
+   struct timeval timeout;
+   float total_bytes;
+   fd_set w;
+
+   total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
+   while (s->bytes_recvd < total_bytes) {
+   timeout.tv_sec = 1;
+   timeout.tv_usec = 0;
+
+   /* FD sets */
+   FD_ZERO(&w);
+   FD_SET(fd, &w);
+
+   slct = select(max_fd + 1, &w, NULL, NULL, &timeout);
+   if (slct == -1) {
+   perror("select()");
+   goto out_errno;
+   } else if (!slct) {
+   fprintf(stderr, "unexpected timeout\n");
+   goto out_errno;
+   }
+
+   recv = recvmsg(fd, &msg, flags);
+   if (recv < 0) {
+   if (errno != EWOULDBLOCK) {
+   perror("recv failed()\n");
+   goto out_errno;
+   }
+   }
+
+   s->bytes_recvd += recv;
+   }
+   }
+
+   for (i = 0; i < iov_count; i++)
+   free(iov[i].iov_base);
+   free(iov);
+   return 0;
+out_errno:
+   free(iov);
+   return errno;
+}
+
+static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
+{
+   struct msg_stats s = {0};
+   int err;
+
+   err = msg_loop(c1, iov_count, iov_buf, cnt, &s, true);
+   if (err) {
+   fprintf(stderr,
+   "msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
+   iov_count, iov_buf, cnt, err);
+   return err;
+   }
+
+   msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
+   if (err)
+   fprintf(stderr,
+   "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
+   iov_count, iov_buf, cnt, err);
+
+   fprintf(stdout, "sendmsg: TX_bytes %zu RX_bytes %zu\n",
+   s.bytes_sent, s.bytes_recvd);
+   return err;
+}
+
 static int forever_ping_pong(int rat

[bpf-next PATCH v3 3/7] bpf: sockmap sample, use fork() for send and recv

2018-01-11 Thread John Fastabend

Currently for SENDMSG tests first send completes then recv runs. This
does not work well for large data sizes and/or many iterations. So
fork the recv and send handler so that we run both send and recv. In
the future we can add a parameter to do more than a single fork of
tx/rx.

With this we can get many GBps of data which helps exercise the
sockmap code.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |   58 +---
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 8ec7dbf..bbe9587 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -195,7 +196,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
 {
struct msghdr msg = {0};
struct iovec *iov;
-   int i, flags = 0;
+   int i, flags = MSG_NOSIGNAL;
 
iov = calloc(iov_count, sizeof(struct iovec));
if (!iov)
@@ -274,25 +275,50 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
 
 static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
 {
+   int txpid, rxpid, err = 0;
struct msg_stats s = {0};
-   int err;
-
-   err = msg_loop(c1, iov_count, iov_buf, cnt, &s, true);
-   if (err) {
-   fprintf(stderr,
-   "msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
-   iov_count, iov_buf, cnt, err);
-   return err;
+   int status;
+
+   errno = 0;
+
+   rxpid = fork();
+   if (rxpid == 0) {
+   err = msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
+   if (err)
+   fprintf(stderr,
+   "msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
+   iov_count, iov_buf, cnt, err);
+   fprintf(stdout, "rx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
+   s.bytes_sent, s.bytes_recvd);
+   shutdown(p2, SHUT_RDWR);
+   shutdown(p1, SHUT_RDWR);
+   exit(1);
+   } else if (rxpid == -1) {
+   perror("msg_loop_rx: ");
+   return errno;
}
 
-   msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
-   if (err)
-   fprintf(stderr,
-   "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
-   iov_count, iov_buf, cnt, err);
+   txpid = fork();
+   if (txpid == 0) {
+   err = msg_loop(c1, iov_count, iov_buf, cnt, &s, true);
+   if (err)
+   fprintf(stderr,
+   "msg_loop_tx: iov_count %i iov_buf %i cnt %i 
err %i\n",
+   iov_count, iov_buf, cnt, err);
+   fprintf(stdout, "tx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
+   s.bytes_sent, s.bytes_recvd);
+   shutdown(c1, SHUT_RDWR);
+   exit(1);
+   } else if (txpid == -1) {
+   perror("msg_loop_tx: ");
+   return errno;
+   }
 
-   fprintf(stdout, "sendmsg: TX_bytes %zu RX_bytes %zu\n",
-   s.bytes_sent, s.bytes_recvd);
+   assert(waitpid(rxpid, &status, 0) == rxpid);
+   if (!txpid)
+   goto out;
+   assert(waitpid(txpid, &status, 0) == txpid);
+out:
return err;
 }

[bpf-next PATCH v3 1/7] bpf: refactor sockmap sample program update for arg parsing

2018-01-11 Thread John Fastabend

sockmap sample program takes arguments from cmd line but it reads them
in using offsets into the array. Because we want to add more arguments
in the future lets do proper argument handling.

Also refactor code to pull apart sock init and ping/pong test. This
allows us to add new tests in the future.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |  164 
 1 file changed, 113 insertions(+), 51 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 7cc9d22..17400d4 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 
+#include 
+
 #include "../bpf/bpf_load.h"
 #include "../bpf/bpf_util.h"
 #include "../bpf/libbpf.h"
@@ -46,15 +48,39 @@
 #define S1_PORT 1
 #define S2_PORT 10001
 
-static int sockmap_test_sockets(int rate, int dot)
+/* global sockets */
+int s1, s2, c1, c2, p1, p2;
+
+static const struct option long_options[] = {
+   {"help",no_argument,NULL, 'h' },
+   {"cgroup",  required_argument,  NULL, 'c' },
+   {"rate",required_argument,  NULL, 'r' },
+   {"verbose", no_argument,NULL, 'v' },
+   {0, 0, NULL, 0 }
+};
+
+static void usage(char *argv[])
 {
-   int i, sc, err, max_fd, one = 1;
-   int s1, s2, c1, c2, p1, p2;
+   int i;
+
+   printf(" Usage: %s --cgroup \n", argv[0]);
+   printf(" options:\n");
+   for (i = 0; long_options[i].name != 0; i++) {
+   printf(" --%-12s", long_options[i].name);
+   if (long_options[i].flag != NULL)
+   printf(" flag (internal value:%d)\n",
+   *long_options[i].flag);
+   else
+   printf(" -%c\n", long_options[i].val);
+   }
+   printf("\n");
+}
+
+static int sockmap_init_sockets(void)
+{
+   int i, err, one = 1;
struct sockaddr_in addr;
-   struct timeval timeout;
-   char buf[1024] = {0};
int *fds[4] = {&s1, &s2, &c1, &c2};
-   fd_set w;
 
s1 = s2 = p1 = p2 = c1 = c2 = 0;
 
@@ -63,8 +89,7 @@ static int sockmap_test_sockets(int rate, int dot)
*fds[i] = socket(AF_INET, SOCK_STREAM, 0);
if (*fds[i] < 0) {
perror("socket s1 failed()");
-   err = *fds[i];
-   goto out;
+   return errno;
}
}
 
@@ -74,7 +99,7 @@ static int sockmap_test_sockets(int rate, int dot)
 (char *)&one, sizeof(one));
if (err) {
perror("setsockopt failed()");
-   goto out;
+   return errno;
}
}
 
@@ -83,7 +108,7 @@ static int sockmap_test_sockets(int rate, int dot)
err = ioctl(*fds[i], FIONBIO, (char *)&one);
if (err < 0) {
perror("ioctl s1 failed()");
-   goto out;
+   return errno;
}
}
 
@@ -96,14 +121,14 @@ static int sockmap_test_sockets(int rate, int dot)
err = bind(s1, (struct sockaddr *)&addr, sizeof(addr));
if (err < 0) {
perror("bind s1 failed()\n");
-   goto out;
+   return errno;
}
 
addr.sin_port = htons(S2_PORT);
err = bind(s2, (struct sockaddr *)&addr, sizeof(addr));
if (err < 0) {
perror("bind s2 failed()\n");
-   goto out;
+   return errno;
}
 
/* Listen server sockets */
@@ -111,14 +136,14 @@ static int sockmap_test_sockets(int rate, int dot)
err = listen(s1, 32);
if (err < 0) {
perror("listen s1 failed()\n");
-   goto out;
+   return errno;
}
 
addr.sin_port = htons(S2_PORT);
err = listen(s2, 32);
if (err < 0) {
perror("listen s1 failed()\n");
-   goto out;
+   return errno;
}
 
/* Initiate Connect */
@@ -126,46 +151,56 @@ static int sockmap_test_sockets(int rate, int dot)
err = connect(c1, (struct sockaddr *)&addr, sizeof(addr));
if (err < 0 && errno != EINPROGRESS) {
perror("connect c1 failed()\n");
-   goto out;
+   return errno;
}
 
addr.sin_port = htons(S2_PORT);
err = connect(c2, (struct sockaddr *)&addr, sizeof(addr));
if (err < 0 && errno != EINPROGRESS) {
perror("connect c2 failed()\n");
-   goto out;
+   return errno;
+   } else if (err < 0) {
+   err = 0;
}
 
/* Accept Connecrtions */
p1 = accept(s1, NULL, NULL);
if (p1 < 0) {
perror("accept s1 failed()\n");
-   goto out;
+

[bpf-next PATCH v3 0/7] sockmap sample update

2018-01-11 Thread John Fastabend

The sockmap sample is pretty simple at the moment. All it does is open
a few sockets attach BPF programs/sockmaps and sends a few packets.

However, for testing and debugging I wanted to have more control over
the sendmsg format and data than provided by tools like iperf3/netperf,
etc. The reason is for testing BPF programs and stream parser it is
helpful to be able submit multiple sendmsg calls with different msg
layouts. For example lots of 1B iovs or a single large MB of data, etc.

Additionally, my current test setup requires an entire orchestration
layer (cilium) to run. As well as lighttpd and http traffic generators
or for kafka testing brokers and clients. This makes it a bit more
difficult when doing performance optimizations to incrementally test
small changes and come up with performance delta's and perf numbers.

By adding a few more options and an additional few tests the sockmap
sample program can show a more complete example and do some of the
above. Because the sample program is self contained it doesn't require
additional infrastructure to run either.

This series, although still fairly crude, does provide some nice
additions. They are

  - a new sendmsg tests with a sender and recv threads
  - a new base tests so we can get metrics/data without BPF
  - multiple GBps of throughput on base and sendmsg tests
  - automatically set rlimit and common variables

That said the UI is still primitive, more features could be added,
more tests might be useful, the reporting is bare bones, etc. But,
IMO lets push this now rather than sit on it for weeks until I get
time to do the above improvements. Additional patches can address
the other limitations/issues.

v2: removed bogus file added by patch 3/7
v3: 1/7 replace goto out with returns, remove sighandler update,
2/7 free iov in error cases
3/7 fix bogus makefile change, bail out early on errors

Thanks Daniel and Martin for the reviews!
---

John Fastabend (7):
  bpf: refactor sockmap sample program update for arg parsing
  bpf: add sendmsg option for testing BPF programs
  bpf: sockmap sample, use fork() for send and recv
  bpf: sockmap sample, report bytes/sec
  bpf: sockmap sample add base test without any BPF for comparison
  bpf: sockmap put client sockets in blocking mode
  bpf: sockmap set rlimit


 samples/sockmap/sockmap_user.c |  383 +++-
 1 file changed, 331 insertions(+), 52 deletions(-)

--
Signature

Re: [bpf-next PATCH v2 1/7] bpf: refactor sockmap sample program update for arg parsing

2018-01-11 Thread John Fastabend

On 01/11/2018 08:31 PM, John Fastabend wrote:
> On 01/10/2018 05:25 PM, Daniel Borkmann wrote:
>> On 01/10/2018 07:39 PM, John Fastabend wrote:
>>> sockmap sample program takes arguments from cmd line but it reads them
>>> in using offsets into the array. Because we want to add more arguments
>>> in the future lets do proper argument handling.
>>>
>>> Also refactor code to pull apart sock init and ping/pong test. This
>>> allows us to add new tests in the future.
>>>
>>> Signed-off-by: John Fastabend 
>>> ---
>>>  samples/sockmap/sockmap_user.c |  142 
>>> +---
>>>  1 file changed, 103 insertions(+), 39 deletions(-)
>> [...]
>>>  
>>> /* Accept Connecrtions */
>>> @@ -149,23 +177,32 @@ static int sockmap_test_sockets(int rate, int dot)
>>> goto out;
>>> }
>>>  
>>> -   max_fd = p2;
>>> -   timeout.tv_sec = 10;
>>> -   timeout.tv_usec = 0;
>>> -
>>> printf("connected sockets: c1 <-> p1, c2 <-> p2\n");
>>> printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n",
>>> c1, s1, c2, s2);
>>> +out:
>>> +   return err;
>>
>> Maybe rather than setting err and goto out where we now just return
>> err anyway, return from those places directly.
>>
> 
> Perhaps but how about doing this in another patch. This
> patch is not changing the goto err pattern. I can send
> a follow up.

OK I take it back. I went ahead an removed the goto as you
suggested. As Martin noticed the accept err was missing and
also most of those should have been errno instead of err.

v3 in-flight.

.John

Re: [bpf-next PATCH v2 3/7] bpf: sockmap sample, use fork() for send and recv

2018-01-11 Thread John Fastabend

On 01/10/2018 05:31 PM, Daniel Borkmann wrote:
> On 01/10/2018 07:39 PM, John Fastabend wrote:
>> Currently for SENDMSG tests first send completes then recv runs. This
>> does not work well for large data sizes and/or many iterations. So
>> fork the recv and send handler so that we run both send and recv. In
>> the future we can add a parameter to do more than a single fork of
>> tx/rx.
>>
>> With this we can get many GBps of data which helps exercise the
>> sockmap code.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  samples/sockmap/Makefile   |2 +
>>  samples/sockmap/sockmap_user.c |   58 
>> +---
>>  2 files changed, 43 insertions(+), 17 deletions(-)
>>
>> diff --git a/samples/sockmap/Makefile b/samples/sockmap/Makefile
>> index 73f1da4..4fefd66 100644
>> --- a/samples/sockmap/Makefile
>> +++ b/samples/sockmap/Makefile
>> @@ -8,7 +8,7 @@ HOSTCFLAGS += -I$(objtree)/usr/include
>>  HOSTCFLAGS += -I$(srctree)/tools/lib/
>>  HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
>>  HOSTCFLAGS += -I$(srctree)/tools/lib/ -I$(srctree)/tools/include
>> -HOSTCFLAGS += -I$(srctree)/tools/perf
>> +HOSTCFLAGS += -I$(srctree)/tools/perf -g
> 
> Slipped in here?
> 

Yep, removed in v3. Thanks.

>>  sockmap-objs := ../bpf/bpf_load.o $(LIBBPF) sockmap_user.o
>>  
>> diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
>> index 2d51672..48fa09a 100644
>> --- a/samples/sockmap/sockmap_user.c
>> +++ b/samples/sockmap/sockmap_user.c
>> @@ -23,6 +23,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> [...]
>

Re: [bpf-next PATCH v2 1/7] bpf: refactor sockmap sample program update for arg parsing

2018-01-11 Thread John Fastabend

On 01/10/2018 05:25 PM, Daniel Borkmann wrote:
> On 01/10/2018 07:39 PM, John Fastabend wrote:
>> sockmap sample program takes arguments from cmd line but it reads them
>> in using offsets into the array. Because we want to add more arguments
>> in the future lets do proper argument handling.
>>
>> Also refactor code to pull apart sock init and ping/pong test. This
>> allows us to add new tests in the future.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  samples/sockmap/sockmap_user.c |  142 
>> +---
>>  1 file changed, 103 insertions(+), 39 deletions(-)
> [...]
>>  
>>  /* Accept Connecrtions */
>> @@ -149,23 +177,32 @@ static int sockmap_test_sockets(int rate, int dot)
>>  goto out;
>>  }
>>  
>> -max_fd = p2;
>> -timeout.tv_sec = 10;
>> -timeout.tv_usec = 0;
>> -
>>  printf("connected sockets: c1 <-> p1, c2 <-> p2\n");
>>  printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n",
>>  c1, s1, c2, s2);
>> +out:
>> +return err;
> 
> Maybe rather than setting err and goto out where we now just return
> err anyway, return from those places directly.
> 

Perhaps but how about doing this in another patch. This
patch is not changing the goto err pattern. I can send
a follow up.

>> +}
>> +
>> +static int forever_ping_pong(int rate, int verbose)
>> +{
>> +struct timeval timeout;
>> +char buf[1024] = {0};
>> +int sc;
>> +
>> +timeout.tv_sec = 10;
>> +timeout.tv_usec = 0;
>>  
>>  /* Ping/Pong data from client to server */
>>  sc = send(c1, buf, sizeof(buf), 0);
>>  if (sc < 0) {
>>  perror("send failed()\n");
>> -goto out;
>> +return sc;
>>  }
>>  
>>  do {
>> -int s, rc, i;
>> +int s, rc, i, max_fd = p2;
>> +fd_set w;
>>  
>>  /* FD sets */
>>  FD_ZERO(&w);
> [...]
>> -err = sockmap_test_sockets(rate, dot);
>> +err = sockmap_init_sockets();
>>  if (err) {
>>  fprintf(stderr, "ERROR: test socket failed: %d\n", err);
>> -return err;
>> +goto out;
>>  }
>> -return 0;
>> +
>> +err = forever_ping_pong(rate, verbose);
>> +out:
>> +close(s1);
>> +close(s2);
>> +close(p1);
>> +close(p2);
>> +close(c1);
>> +close(c2);
>> +return err;
>>  }
>>  
>>  void running_handler(int a)
>>  {
>>  running = 0;
>> +printf("\n");
> 
> Do we need this out of the sighandler instead of e.g. main loop when
> we break out?

Not really let me just remove it. I'll do
another patch with documentation and fixup
error messages so we don't have this issue.

I agree its a bit clumsy.

[PATCH bpf-next v2 03/15] bpf: hashtab: move checks out of alloc function

2018-01-11 Thread Jakub Kicinski

Use the new callback to perform allocation checks for hash maps.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 kernel/bpf/hashtab.c | 55 +---
 1 file changed, 39 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b80f42adf068..7fd6519444d3 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -227,7 +227,7 @@ static int alloc_extra_elems(struct bpf_htab *htab)
 }
 
 /* Called from syscall */
-static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
+static int htab_map_alloc_check(union bpf_attr *attr)
 {
bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
   attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
@@ -241,9 +241,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
int numa_node = bpf_map_attr_numa_node(attr);
-   struct bpf_htab *htab;
-   int err, i;
-   u64 cost;
 
BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
 offsetof(struct htab_elem, hash_node.pprev));
@@ -254,33 +251,33 @@ static struct bpf_map *htab_map_alloc(union bpf_attr 
*attr)
/* LRU implementation is much complicated than other
 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
 */
-   return ERR_PTR(-EPERM);
+   return -EPERM;
 
if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK)
/* reserved bits should not be used */
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
if (!lru && percpu_lru)
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
if (lru && !prealloc)
-   return ERR_PTR(-ENOTSUPP);
+   return -ENOTSUPP;
 
if (numa_node != NUMA_NO_NODE && (percpu || percpu_lru))
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
/* check sanity of attributes.
 * value_size == 0 may be allowed in the future to use map as a set
 */
if (attr->max_entries == 0 || attr->key_size == 0 ||
attr->value_size == 0)
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
if (attr->key_size > MAX_BPF_STACK)
/* eBPF programs initialize keys on stack, so they cannot be
 * larger than max stack size
 */
-   return ERR_PTR(-E2BIG);
+   return -E2BIG;
 
if (attr->value_size >= KMALLOC_MAX_SIZE -
MAX_BPF_STACK - sizeof(struct htab_elem))
@@ -289,7 +286,28 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 * sure that the elem_size doesn't overflow and it's
 * kmalloc-able later in htab_map_update_elem()
 */
-   return ERR_PTR(-E2BIG);
+   return -E2BIG;
+
+   return 0;
+}
+
+static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
+{
+   bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+  attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
+   bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
+   attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
+   /* percpu_lru means each cpu has its own LRU list.
+* it is different from BPF_MAP_TYPE_PERCPU_HASH where
+* the map's value itself is percpu.  percpu_lru has
+* nothing to do with the map's value.
+*/
+   bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
+   bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
+   int numa_node = bpf_map_attr_numa_node(attr);
+   struct bpf_htab *htab;
+   int err, i;
+   u64 cost;
 
htab = kzalloc(sizeof(*htab), GFP_USER);
if (!htab)
@@ -1142,6 +1160,7 @@ static void htab_map_free(struct bpf_map *map)
 }
 
 const struct bpf_map_ops htab_map_ops = {
+   .map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
@@ -1152,6 +1171,7 @@ const struct bpf_map_ops htab_map_ops = {
 };
 
 const struct bpf_map_ops htab_lru_map_ops = {
+   .map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
@@ -1235,6 +1255,7 @@ int bpf_percpu_hash_update(struct bpf_map *map, void 
*key, void *value,
 }
 
 const struct bpf_map_ops htab_percpu_map_ops = {
+   .map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
@@ -1244,6 +1265,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
 };
 
 const struct bpf_map_ops htab_lru_perc

[PATCH bpf-next v2 07/15] bpf: offload: add map offload infrastructure

2018-01-11 Thread Jakub Kicinski

BPF map offload follow similar path to program offload.  At creation
time users may specify ifindex of the device on which they want to
create the map.  Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation.  Map will have an empty set of operations
associated with it (save for alloc and free callbacks).  The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures.  Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.

Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held.  Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns).  Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.

All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf.h|  59 +
 include/linux/netdevice.h  |   6 ++
 include/uapi/linux/bpf.h   |   1 +
 kernel/bpf/offload.c   | 188 +++--
 kernel/bpf/syscall.c   |  44 --
 kernel/bpf/verifier.c  |   7 ++
 tools/include/uapi/linux/bpf.h |   1 +
 7 files changed, 293 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0534722ba1d8..b198c7554538 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -74,6 +74,33 @@ struct bpf_map {
char name[BPF_OBJ_NAME_LEN];
 };
 
+struct bpf_offloaded_map;
+
+struct bpf_map_dev_ops {
+   int (*map_get_next_key)(struct bpf_offloaded_map *map,
+   void *key, void *next_key);
+   int (*map_lookup_elem)(struct bpf_offloaded_map *map,
+  void *key, void *value);
+   int (*map_update_elem)(struct bpf_offloaded_map *map,
+  void *key, void *value, u64 flags);
+   int (*map_delete_elem)(struct bpf_offloaded_map *map, void *key);
+};
+
+struct bpf_offloaded_map {
+   struct bpf_map map;
+   struct net_device *netdev;
+   const struct bpf_map_dev_ops *dev_ops;
+   void *dev_priv;
+   struct list_head offloads;
+};
+
+static inline struct bpf_offloaded_map *map_to_offmap(struct bpf_map *map)
+{
+   return container_of(map, struct bpf_offloaded_map, map);
+}
+
+extern const struct bpf_map_ops bpf_map_offload_ops;
+
 /* function argument constraints */
 enum bpf_arg_type {
ARG_DONTCARE = 0,   /* unused argument in helper function */
@@ -369,6 +396,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages);
 void __bpf_prog_uncharge(struct user_struct *user, u32 pages);
 
 void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
+void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
@@ -556,6 +584,15 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog);
 int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
   struct bpf_prog *prog);
 
+int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value);
+int bpf_map_offload_update_elem(struct bpf_map *map,
+   void *key, void *value, u64 flags);
+int bpf_map_offload_delete_elem(struct bpf_map *map, void *key);
+int bpf_map_offload_get_next_key(struct bpf_map *map,
+void *key, void *next_key);
+
+bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map);
+
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
 
@@ -563,6 +600,14 @@ static inline bool bpf_prog_is_dev_bound(struct 
bpf_prog_aux *aux)
 {
return aux->offload_requested;
 }
+
+static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+{
+   return unlikely(map->ops == &bpf_map_offload_ops);
+}
+
+struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr);
+void bpf_map_offload_map_free(struct bpf_map *map);
 #else
 static inline int bpf_prog_offload_init(struct bpf_prog *prog,
union bpf_attr *attr)
@@ -574,6 +619,20 @@ static inline bool bpf_prog_is_dev_bound(struct 
bpf_prog_aux *aux)
 {
return false;
 }
+
+static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+{
+   return false;
+}
+
+static inline struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void bpf_map_offload_map_free(struct bpf_map *map)
+{
+}
 #endif /* CONFIG_NET && CONFIG_BPF_SYSCAL

[PATCH bpf-next v2 04/15] bpf: add helper for copying attrs to struct bpf_map

2018-01-11 Thread Jakub Kicinski

All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map.  Add a helper to perform this operation.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/cpumap.c   |  8 +---
 kernel/bpf/devmap.c   |  8 +---
 kernel/bpf/hashtab.c  |  9 +
 kernel/bpf/lpm_trie.c |  7 +--
 kernel/bpf/sockmap.c  |  8 +---
 kernel/bpf/stackmap.c |  6 +-
 kernel/bpf/syscall.c  | 10 ++
 8 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2041ac5db2a3..cfbee9f83fbe 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -378,6 +378,7 @@ void bpf_map_put(struct bpf_map *map);
 int bpf_map_precharge_memlock(u32 pages);
 void *bpf_map_area_alloc(size_t size, int numa_node);
 void bpf_map_area_free(void *base);
+void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index ce5b669003b2..192151ec9d12 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -94,13 +94,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
if (!cmap)
return ERR_PTR(-ENOMEM);
 
-   /* mandatory map attributes */
-   cmap->map.map_type = attr->map_type;
-   cmap->map.key_size = attr->key_size;
-   cmap->map.value_size = attr->value_size;
-   cmap->map.max_entries = attr->max_entries;
-   cmap->map.map_flags = attr->map_flags;
-   cmap->map.numa_node = bpf_map_attr_numa_node(attr);
+   bpf_map_init_from_attr(&cmap->map, attr);
 
/* Pre-limit array size based on NR_CPUS, not final CPU check */
if (cmap->map.max_entries > NR_CPUS) {
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index ebdef54bf7df..565f9ece9115 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -93,13 +93,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
if (!dtab)
return ERR_PTR(-ENOMEM);
 
-   /* mandatory map attributes */
-   dtab->map.map_type = attr->map_type;
-   dtab->map.key_size = attr->key_size;
-   dtab->map.value_size = attr->value_size;
-   dtab->map.max_entries = attr->max_entries;
-   dtab->map.map_flags = attr->map_flags;
-   dtab->map.numa_node = bpf_map_attr_numa_node(attr);
+   bpf_map_init_from_attr(&dtab->map, attr);
 
/* make sure page count doesn't overflow */
cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7fd6519444d3..b76828f23b49 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -304,7 +304,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 */
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
-   int numa_node = bpf_map_attr_numa_node(attr);
struct bpf_htab *htab;
int err, i;
u64 cost;
@@ -313,13 +312,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
if (!htab)
return ERR_PTR(-ENOMEM);
 
-   /* mandatory map attributes */
-   htab->map.map_type = attr->map_type;
-   htab->map.key_size = attr->key_size;
-   htab->map.value_size = attr->value_size;
-   htab->map.max_entries = attr->max_entries;
-   htab->map.map_flags = attr->map_flags;
-   htab->map.numa_node = numa_node;
+   bpf_map_init_from_attr(&htab->map, attr);
 
if (percpu_lru) {
/* ensure each CPU's lru list has >=1 elements.
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 885e45479680..584e02227671 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -522,12 +522,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
return ERR_PTR(-ENOMEM);
 
/* copy mandatory map attributes */
-   trie->map.map_type = attr->map_type;
-   trie->map.key_size = attr->key_size;
-   trie->map.value_size = attr->value_size;
-   trie->map.max_entries = attr->max_entries;
-   trie->map.map_flags = attr->map_flags;
-   trie->map.numa_node = bpf_map_attr_numa_node(attr);
+   bpf_map_init_from_attr(&trie->map, attr);
trie->data_size = attr->key_size -
  offsetof(struct bpf_lpm_trie_key, data);
trie->max_prefixlen = trie->data_size * 8;
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 079968680bc3..0314d1783d77 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -513,13 +513,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
if (!stab)
return ERR_PTR(-ENOMEM);
 
-   /* mandatory map attributes */
-   stab->map.map_type = attr->map_type;
-   stab->map.key_size = attr->key_size;
-   st

[PATCH bpf-next v2 06/15] bpf: offload: factor out netdev checking at allocation time

2018-01-11 Thread Jakub Kicinski

Add a helper to check if netdev could be found and whether it
has .ndo_bpf callback.  There is no need to check the callback
every time it's invoked, ndos can't reasonably be swapped for
a set without .ndp_bpf while program is loaded.

bpf_dev_offload_check() will also be used by map offload.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 kernel/bpf/offload.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 001ddfde7874..cdd1e19a668b 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -30,9 +30,19 @@
 static DECLARE_RWSEM(bpf_devs_lock);
 static LIST_HEAD(bpf_prog_offload_devs);
 
+static int bpf_dev_offload_check(struct net_device *netdev)
+{
+   if (!netdev)
+   return -EINVAL;
+   if (!netdev->netdev_ops->ndo_bpf)
+   return -EOPNOTSUPP;
+   return 0;
+}
+
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
struct bpf_prog_offload *offload;
+   int err;
 
if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
attr->prog_type != BPF_PROG_TYPE_XDP)
@@ -49,12 +59,15 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
 
offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
   attr->prog_ifindex);
-   if (!offload->netdev)
-   goto err_free;
+   err = bpf_dev_offload_check(offload->netdev);
+   if (err)
+   goto err_maybe_put;
 
down_write(&bpf_devs_lock);
-   if (offload->netdev->reg_state != NETREG_REGISTERED)
+   if (offload->netdev->reg_state != NETREG_REGISTERED) {
+   err = -EINVAL;
goto err_unlock;
+   }
prog->aux->offload = offload;
list_add_tail(&offload->offloads, &bpf_prog_offload_devs);
dev_put(offload->netdev);
@@ -63,10 +76,11 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
return 0;
 err_unlock:
up_write(&bpf_devs_lock);
-   dev_put(offload->netdev);
-err_free:
+err_maybe_put:
+   if (offload->netdev)
+   dev_put(offload->netdev);
kfree(offload);
-   return -EINVAL;
+   return err;
 }
 
 static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command 
cmd,
@@ -80,8 +94,6 @@ static int __bpf_offload_ndo(struct bpf_prog *prog, enum 
bpf_netdev_command cmd,
if (!offload)
return -ENODEV;
netdev = offload->netdev;
-   if (!netdev->netdev_ops->ndo_bpf)
-   return -EOPNOTSUPP;
 
data->command = cmd;
 
-- 
2.15.1

[PATCH bpf-next v2 01/15] bpf: add map_alloc_check callback

2018-01-11 Thread Jakub Kicinski

.map_alloc callbacks contain a number of checks validating user-
-provided map attributes against constraints of a particular map
type.  For offloaded maps we will need to check map attributes
without actually allocating any memory on the host.  Add a new
callback for validating attributes before any memory is allocated.
This callback can be selectively implemented by map types for
sharing code with offloads, or simply to separate the logical
steps of validation and allocation.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf.h  |  1 +
 kernel/bpf/syscall.c | 17 +
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 44f26f6df8fc..2041ac5db2a3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -25,6 +25,7 @@ struct bpf_map;
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
/* funcs callable from userspace (via syscall) */
+   int (*map_alloc_check)(union bpf_attr *attr);
struct bpf_map *(*map_alloc)(union bpf_attr *attr);
void (*map_release)(struct bpf_map *map, struct file *map_file);
void (*map_free)(struct bpf_map *map);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2bac0dc8baba..c0ac03a04880 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -96,16 +96,25 @@ static int check_uarg_tail_zero(void __user *uaddr,
 
 static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 {
+   const struct bpf_map_ops *ops;
struct bpf_map *map;
+   int err;
 
-   if (attr->map_type >= ARRAY_SIZE(bpf_map_types) ||
-   !bpf_map_types[attr->map_type])
+   if (attr->map_type >= ARRAY_SIZE(bpf_map_types))
+   return ERR_PTR(-EINVAL);
+   ops = bpf_map_types[attr->map_type];
+   if (!ops)
return ERR_PTR(-EINVAL);
 
-   map = bpf_map_types[attr->map_type]->map_alloc(attr);
+   if (ops->map_alloc_check) {
+   err = ops->map_alloc_check(attr);
+   if (err)
+   return ERR_PTR(err);
+   }
+   map = ops->map_alloc(attr);
if (IS_ERR(map))
return map;
-   map->ops = bpf_map_types[attr->map_type];
+   map->ops = ops;
map->map_type = attr->map_type;
return map;
 }
-- 
2.15.1

[PATCH bpf-next v2 15/15] nfp: bpf: implement bpf map offload

2018-01-11 Thread Jakub Kicinski

Plug in to the stack's map offload callbacks for BPF map offload.
Get next call needs some special handling on the FW side, since
we can't send a NULL pointer to the FW there is a get first entry
FW command.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c|   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.h|   4 +
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 104 +++
 3 files changed, 109 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 7d5cc59feb7e..8823c8360047 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -381,6 +381,7 @@ static void nfp_bpf_clean(struct nfp_app *app)
 
WARN_ON(!skb_queue_empty(&bpf->cmsg_replies));
WARN_ON(!list_empty(&bpf->map_list));
+   WARN_ON(bpf->maps_in_use || bpf->map_elems_in_use);
kfree(bpf);
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 59197535c465..b80e75a8ecda 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -108,6 +108,8 @@ enum pkt_vec {
  * @cmsg_wq:   work queue for waiting for cmsg replies
  *
  * @map_list:  list of offloaded maps
+ * @maps_in_use:   number of currently offloaded maps
+ * @map_elems_in_use:  number of elements allocated to offloaded maps
  *
  * @adjust_head:   adjust head capability
  * @flags: extra flags for adjust head
@@ -138,6 +140,8 @@ struct nfp_app_bpf {
struct wait_queue_head cmsg_wq;
 
struct list_head map_list;
+   unsigned int maps_in_use;
+   unsigned int map_elems_in_use;
 
struct nfp_bpf_cap_adjust_head {
u32 flags;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 6590228d3755..e2859b2e9c6a 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -36,6 +36,9 @@
  * Netronome network device driver: TC offload functions for PF and VF
  */
 
+#define pr_fmt(fmt)"NFP net bpf: " fmt
+
+#include 
 #include 
 #include 
 #include 
@@ -153,6 +156,103 @@ static int nfp_bpf_destroy(struct nfp_net *nn, struct 
bpf_prog *prog)
return 0;
 }
 
+static int
+nfp_bpf_map_get_next_key(struct bpf_offloaded_map *offmap,
+void *key, void *next_key)
+{
+   if (!key)
+   return nfp_bpf_ctrl_getfirst_entry(offmap, next_key);
+   return nfp_bpf_ctrl_getnext_entry(offmap, key, next_key);
+}
+
+static int
+nfp_bpf_map_delete_elem(struct bpf_offloaded_map *offmap, void *key)
+{
+   return nfp_bpf_ctrl_del_entry(offmap, key);
+}
+
+static const struct bpf_map_dev_ops nfp_bpf_map_ops = {
+   .map_get_next_key   = nfp_bpf_map_get_next_key,
+   .map_lookup_elem= nfp_bpf_ctrl_lookup_entry,
+   .map_update_elem= nfp_bpf_ctrl_update_entry,
+   .map_delete_elem= nfp_bpf_map_delete_elem,
+};
+
+static int
+nfp_bpf_map_alloc(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
+{
+   struct nfp_bpf_map *nfp_map;
+   long long int res;
+
+   if (!bpf->maps.types)
+   return -EOPNOTSUPP;
+
+   if (offmap->map.map_flags ||
+   offmap->map.numa_node != NUMA_NO_NODE) {
+   pr_info("map flags are not supported\n");
+   return -EINVAL;
+   }
+
+   if (!(bpf->maps.types & 1 << offmap->map.map_type)) {
+   pr_info("map type not supported\n");
+   return -EOPNOTSUPP;
+   }
+   if (bpf->maps.max_maps == bpf->maps_in_use) {
+   pr_info("too many maps for a device\n");
+   return -ENOMEM;
+   }
+   if (bpf->maps.max_elems - bpf->map_elems_in_use <
+   offmap->map.max_entries) {
+   pr_info("map with too many elements: %u, left: %u\n",
+   offmap->map.max_entries,
+   bpf->maps.max_elems - bpf->map_elems_in_use);
+   return -ENOMEM;
+   }
+   if (offmap->map.key_size > bpf->maps.max_key_sz ||
+   offmap->map.value_size > bpf->maps.max_val_sz ||
+   round_up(offmap->map.key_size, 8) +
+   round_up(offmap->map.value_size, 8) > bpf->maps.max_elem_sz) {
+   pr_info("elements don't fit in device constraints\n");
+   return -ENOMEM;
+   }
+
+   nfp_map = kzalloc(sizeof(*nfp_map), GFP_USER);
+   if (!nfp_map)
+   return -ENOMEM;
+
+   offmap->dev_priv = nfp_map;
+   nfp_map->offmap = offmap;
+   nfp_map->bpf = bpf;
+
+   res = nfp_bpf_ctrl_alloc_map(bpf, &offmap->map);
+   if (res < 0) {
+   kfree(nfp_map);
+   return res;
+   }
+
+   n

[PATCH bpf-next v2 09/15] nfp: bpf: add basic control channel communication

2018-01-11 Thread Jakub Kicinski

For map support we will need to send and receive control messages.
Add basic support for sending a message to FW, and waiting for a
reply.

Control messages are tagged with a 16 bit ID.  Add a simple ID
allocator and make sure we don't allow too many messages in flight,
to avoid request <> reply mismatches.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/Makefile|   1 +
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c  | 238 +
 drivers/net/ethernet/netronome/nfp/bpf/fw.h|  22 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |   5 +
 drivers/net/ethernet/netronome/nfp/bpf/main.h  |  23 ++
 drivers/net/ethernet/netronome/nfp/nfp_app.h   |   9 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  12 ++
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   7 +
 8 files changed, 317 insertions(+)
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c

diff --git a/drivers/net/ethernet/netronome/nfp/Makefile 
b/drivers/net/ethernet/netronome/nfp/Makefile
index 6e5ef984398b..064f00e23a19 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -44,6 +44,7 @@ endif
 
 ifeq ($(CONFIG_BPF_SYSCALL),y)
 nfp-objs += \
+   bpf/cmsg.o \
bpf/main.o \
bpf/offload.o \
bpf/verifier.o \
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
new file mode 100644
index ..46753ee9f7c5
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -0,0 +1,238 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../nfp_app.h"
+#include "../nfp_net.h"
+#include "fw.h"
+#include "main.h"
+
+#define cmsg_warn(bpf, msg...) nn_dp_warn(&(bpf)->app->ctrl->dp, msg)
+
+#define NFP_BPF_TAG_ALLOC_SPAN (U16_MAX / 4)
+
+static bool nfp_bpf_all_tags_busy(struct nfp_app_bpf *bpf)
+{
+   u16 used_tags;
+
+   used_tags = bpf->tag_alloc_next - bpf->tag_alloc_last;
+
+   return used_tags > NFP_BPF_TAG_ALLOC_SPAN;
+}
+
+static int nfp_bpf_alloc_tag(struct nfp_app_bpf *bpf)
+{
+   /* All FW communication for BPF is request-reply.  To make sure we
+* don't reuse the message ID too early after timeout - limit the
+* number of requests in flight.
+*/
+   if (nfp_bpf_all_tags_busy(bpf)) {
+   cmsg_warn(bpf, "all FW request contexts busy!\n");
+   return -EAGAIN;
+   }
+
+   WARN_ON(__test_and_set_bit(bpf->tag_alloc_next, bpf->tag_allocator));
+   return bpf->tag_alloc_next++;
+}
+
+static void nfp_bpf_free_tag(struct nfp_app_bpf *bpf, u16 tag)
+{
+   WARN_ON(!__test_and_clear_bit(tag, bpf->tag_allocator));
+
+   while (!test_bit(bpf->tag_alloc_last, bpf->tag_allocator) &&
+  bpf->tag_alloc_last != bpf->tag_alloc_next)
+   bpf->tag_alloc_last++;
+}
+
+static unsigned int nfp_bpf_cmsg_get_tag(struct sk_buff *skb)
+{
+   struct cmsg_hdr *hdr;
+
+   hdr = (struct cmsg_hdr *)skb->data;
+
+   return be16_to_cpu(hdr->tag);
+}
+
+static struct sk_buff *__nfp_bpf_reply(struct nfp_app_bpf *bpf, u16 tag)
+{
+   unsigned int msg_tag;
+   struct sk_buff *skb;
+
+   skb_queue_walk(&bpf->cmsg_replies, skb) {
+   msg_tag = nfp_bpf_cmsg_get_tag(skb);
+   if (msg_tag == tag) {
+   nfp_bpf_free_tag(bpf, tag);
+

[PATCH bpf-next v2 10/15] nfp: bpf: implement helpers for FW map ops

2018-01-11 Thread Jakub Kicinski

Implement calls for FW map communication.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 210 +-
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |  65 
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  17 ++-
 3 files changed, 288 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index 46753ee9f7c5..71e6586acc36 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -31,6 +31,7 @@
  * SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -79,6 +80,28 @@ static void nfp_bpf_free_tag(struct nfp_app_bpf *bpf, u16 
tag)
bpf->tag_alloc_last++;
 }
 
+static struct sk_buff *
+nfp_bpf_cmsg_alloc(struct nfp_app_bpf *bpf, unsigned int size)
+{
+   struct sk_buff *skb;
+
+   skb = nfp_app_ctrl_msg_alloc(bpf->app, size, GFP_KERNEL);
+   skb_put(skb, size);
+
+   return skb;
+}
+
+static struct sk_buff *
+nfp_bpf_cmsg_map_req_alloc(struct nfp_app_bpf *bpf, unsigned int n)
+{
+   unsigned int size;
+
+   size = sizeof(struct cmsg_req_map_op);
+   size += sizeof(struct cmsg_key_value_pair) * n;
+
+   return nfp_bpf_cmsg_alloc(bpf, size);
+}
+
 static unsigned int nfp_bpf_cmsg_get_tag(struct sk_buff *skb)
 {
struct cmsg_hdr *hdr;
@@ -159,7 +182,7 @@ nfp_bpf_cmsg_wait_reply(struct nfp_app_bpf *bpf, enum 
nfp_bpf_cmsg_type type,
return skb;
 }
 
-struct sk_buff *
+static struct sk_buff *
 nfp_bpf_cmsg_communicate(struct nfp_app_bpf *bpf, struct sk_buff *skb,
 enum nfp_bpf_cmsg_type type, unsigned int reply_size)
 {
@@ -206,6 +229,191 @@ nfp_bpf_cmsg_communicate(struct nfp_app_bpf *bpf, struct 
sk_buff *skb,
return ERR_PTR(-EIO);
 }
 
+static int
+nfp_bpf_ctrl_rc_to_errno(struct nfp_app_bpf *bpf,
+struct cmsg_reply_map_simple *reply)
+{
+   static const int res_table[] = {
+   [CMSG_RC_SUCCESS]   = 0,
+   [CMSG_RC_ERR_MAP_FD]= -EBADFD,
+   [CMSG_RC_ERR_MAP_NOENT] = -ENOENT,
+   [CMSG_RC_ERR_MAP_ERR]   = -EINVAL,
+   [CMSG_RC_ERR_MAP_PARSE] = -EIO,
+   [CMSG_RC_ERR_MAP_EXIST] = -EEXIST,
+   [CMSG_RC_ERR_MAP_NOMEM] = -ENOMEM,
+   [CMSG_RC_ERR_MAP_E2BIG] = -E2BIG,
+   };
+   u32 rc;
+
+   rc = be32_to_cpu(reply->rc);
+   if (rc >= ARRAY_SIZE(res_table)) {
+   cmsg_warn(bpf, "FW responded with invalid status: %u\n", rc);
+   return -EIO;
+   }
+
+   return res_table[rc];
+}
+
+long long int
+nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map)
+{
+   struct cmsg_reply_map_alloc_tbl *reply;
+   struct cmsg_req_map_alloc_tbl *req;
+   struct sk_buff *skb;
+   u32 tid;
+   int err;
+
+   skb = nfp_bpf_cmsg_alloc(bpf, sizeof(*req));
+   if (!skb)
+   return -ENOMEM;
+
+   req = (void *)skb->data;
+   req->key_size = cpu_to_be32(map->key_size);
+   req->value_size = cpu_to_be32(map->value_size);
+   req->max_entries = cpu_to_be32(map->max_entries);
+   req->map_type = cpu_to_be32(map->map_type);
+   req->map_flags = 0;
+
+   skb = nfp_bpf_cmsg_communicate(bpf, skb, CMSG_TYPE_MAP_ALLOC,
+  sizeof(*reply));
+   if (IS_ERR(skb))
+   return PTR_ERR(skb);
+
+   reply = (void *)skb->data;
+   err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
+   if (err)
+   goto err_free;
+
+   tid = be32_to_cpu(reply->tid);
+   dev_consume_skb_any(skb);
+
+   return tid;
+err_free:
+   dev_kfree_skb_any(skb);
+   return err;
+}
+
+void nfp_bpf_ctrl_free_map(struct nfp_app_bpf *bpf, struct nfp_bpf_map 
*nfp_map)
+{
+   struct cmsg_reply_map_free_tbl *reply;
+   struct cmsg_req_map_free_tbl *req;
+   struct sk_buff *skb;
+   int err;
+
+   skb = nfp_bpf_cmsg_alloc(bpf, sizeof(*req));
+   if (!skb) {
+   cmsg_warn(bpf, "leaking map - failed to allocate msg\n");
+   return;
+   }
+
+   req = (void *)skb->data;
+   req->tid = cpu_to_be32(nfp_map->tid);
+
+   skb = nfp_bpf_cmsg_communicate(bpf, skb, CMSG_TYPE_MAP_FREE,
+  sizeof(*reply));
+   if (IS_ERR(skb)) {
+   cmsg_warn(bpf, "leaking map - I/O error\n");
+   return;
+   }
+
+   reply = (void *)skb->data;
+   err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
+   if (err)
+   cmsg_warn(bpf, "leaking map - FW responded with: %d\n", err);
+
+   dev_consume_skb_any(skb);
+}
+
+static int
+nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap,
+ enum nfp_bpf_cmsg_type op,
+ u8 *key, u8 *value, u64 flags,

[PATCH bpf-next v2 08/15] nfp: bpf: add map data structure

2018-01-11 Thread Jakub Kicinski

To be able to split code into reasonable chunks we need to add
the map data structures already.  Later patches will add code
piece by piece.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  7 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 18 ++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index e8cfe300c8c4..c9fd7d417d1a 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -313,6 +313,8 @@ static int nfp_bpf_init(struct nfp_app *app)
bpf->app = app;
app->priv = bpf;
 
+   INIT_LIST_HEAD(&bpf->map_list);
+
err = nfp_bpf_parse_capabilities(app);
if (err)
goto err_free_bpf;
@@ -326,7 +328,10 @@ static int nfp_bpf_init(struct nfp_app *app)
 
 static void nfp_bpf_clean(struct nfp_app *app)
 {
-   kfree(app->priv);
+   struct nfp_app_bpf *bpf = app->priv;
+
+   WARN_ON(!list_empty(&bpf->map_list));
+   kfree(bpf);
 }
 
 const struct nfp_app_type app_bpf = {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 66381afee2a9..23763b22f8fc 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -93,6 +93,8 @@ enum pkt_vec {
  * struct nfp_app_bpf - bpf app priv structure
  * @app:   backpointer to the app
  *
+ * @map_list:  list of offloaded maps
+ *
  * @adjust_head:   adjust head capability
  * @flags: extra flags for adjust head
  * @off_min:   minimal packet offset within buffer required
@@ -103,6 +105,8 @@ enum pkt_vec {
 struct nfp_app_bpf {
struct nfp_app *app;
 
+   struct list_head map_list;
+
struct nfp_bpf_cap_adjust_head {
u32 flags;
int off_min;
@@ -112,6 +116,20 @@ struct nfp_app_bpf {
} adjust_head;
 };
 
+/**
+ * struct nfp_bpf_map - private per-map data attached to BPF maps for offload
+ * @offmap:pointer to the offloaded BPF map
+ * @bpf:   back pointer to bpf app private structure
+ * @tid:   table id identifying map on datapath
+ * @l: link on the nfp_app_bpf->map_list list
+ */
+struct nfp_bpf_map {
+   struct bpf_offloaded_map *offmap;
+   struct nfp_app_bpf *bpf;
+   u32 tid;
+   struct list_head l;
+};
+
 struct nfp_prog;
 struct nfp_insn_meta;
 typedef int (*instr_cb_t)(struct nfp_prog *, struct nfp_insn_meta *);
-- 
2.15.1

[PATCH bpf-next v2 12/15] nfp: bpf: add helpers for updating immediate instructions

2018-01-11 Thread Jakub Kicinski

Immediate loads are used to load the return address of a helper.
We need to be able to update those loads for relocations.
Immediate loads can be slightly more complex and spread over
two instructions in general, but here we only care about simple
loads of small (< 65k) constants, so complex cases are not handled.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_asm.c | 58 
 drivers/net/ethernet/netronome/nfp/nfp_asm.h |  4 ++
 2 files changed, 62 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.c 
b/drivers/net/ethernet/netronome/nfp/nfp_asm.c
index 9ee3a3f60cc7..3f6952b66a49 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_asm.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.c
@@ -50,6 +50,11 @@ const struct cmd_tgt_act cmd_tgt_act[__CMD_TGT_MAP_SIZE] = {
[CMD_TGT_READ_SWAP_LE] ={ 0x03, 0x40 },
 };
 
+static bool unreg_is_imm(u16 reg)
+{
+   return (reg & UR_REG_IMM) == UR_REG_IMM;
+}
+
 u16 br_get_offset(u64 instr)
 {
u16 addr_lo, addr_hi;
@@ -80,6 +85,59 @@ void br_add_offset(u64 *instr, u16 offset)
br_set_offset(instr, addr + offset);
 }
 
+static bool immed_can_modify(u64 instr)
+{
+   if (FIELD_GET(OP_IMMED_INV, instr) ||
+   FIELD_GET(OP_IMMED_SHIFT, instr) ||
+   FIELD_GET(OP_IMMED_WIDTH, instr) != IMMED_WIDTH_ALL) {
+   pr_err("Can't decode/encode immed!\n");
+   return false;
+   }
+   return true;
+}
+
+u16 immed_get_value(u64 instr)
+{
+   u16 reg;
+
+   if (!immed_can_modify(instr))
+   return 0;
+
+   reg = FIELD_GET(OP_IMMED_A_SRC, instr);
+   if (!unreg_is_imm(reg))
+   reg = FIELD_GET(OP_IMMED_B_SRC, instr);
+
+   return (reg & 0xff) | FIELD_GET(OP_IMMED_IMM, instr);
+}
+
+void immed_set_value(u64 *instr, u16 immed)
+{
+   if (!immed_can_modify(*instr))
+   return;
+
+   if (unreg_is_imm(FIELD_GET(OP_IMMED_A_SRC, *instr))) {
+   *instr &= ~FIELD_PREP(OP_IMMED_A_SRC, 0xff);
+   *instr |= FIELD_PREP(OP_IMMED_A_SRC, immed & 0xff);
+   } else {
+   *instr &= ~FIELD_PREP(OP_IMMED_B_SRC, 0xff);
+   *instr |= FIELD_PREP(OP_IMMED_B_SRC, immed & 0xff);
+   }
+
+   *instr &= ~OP_IMMED_IMM;
+   *instr |= FIELD_PREP(OP_IMMED_IMM, immed >> 8);
+}
+
+void immed_add_value(u64 *instr, u16 offset)
+{
+   u16 val;
+
+   if (!immed_can_modify(*instr))
+   return;
+
+   val = immed_get_value(*instr);
+   immed_set_value(instr, val + offset);
+}
+
 static u16 nfp_swreg_to_unreg(swreg reg, bool is_dst)
 {
bool lm_id, lm_dec = false;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.h 
b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
index 20e51cb60e69..5f9291db98e0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_asm.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
@@ -138,6 +138,10 @@ enum immed_shift {
IMMED_SHIFT_2B = 2,
 };
 
+u16 immed_get_value(u64 instr);
+void immed_set_value(u64 *instr, u16 immed);
+void immed_add_value(u64 *instr, u16 offset);
+
 #define OP_SHF_BASE0x080ULL
 #define OP_SHF_A_SRC   0x0ffULL
 #define OP_SHF_SC  0x300ULL
-- 
2.15.1

[PATCH bpf-next v2 11/15] nfp: bpf: parse function call and map capabilities

2018-01-11 Thread Jakub Kicinski

Parse helper function and supported map FW TLV capabilities.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   | 16 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 47 +++
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 24 ++
 3 files changed, 87 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h 
b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
index e0ff68fc9562..cfcc7bcb2c67 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
@@ -38,7 +38,14 @@
 #include 
 
 enum bpf_cap_tlv_type {
+   NFP_BPF_CAP_TYPE_FUNC   = 1,
NFP_BPF_CAP_TYPE_ADJUST_HEAD= 2,
+   NFP_BPF_CAP_TYPE_MAPS   = 3,
+};
+
+struct nfp_bpf_cap_tlv_func {
+   __le32 func_id;
+   __le32 func_addr;
 };
 
 struct nfp_bpf_cap_tlv_adjust_head {
@@ -51,6 +58,15 @@ struct nfp_bpf_cap_tlv_adjust_head {
 
 #define NFP_BPF_ADJUST_HEAD_NO_METABIT(0)
 
+struct nfp_bpf_cap_tlv_maps {
+   __le32 types;
+   __le32 max_maps;
+   __le32 max_elems;
+   __le32 max_key_sz;
+   __le32 max_val_sz;
+   __le32 max_elem_sz;
+};
+
 /*
  * Types defined for map related control messages
  */
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index a14368c6449f..7d5cc59feb7e 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -251,6 +251,45 @@ nfp_bpf_parse_cap_adjust_head(struct nfp_app_bpf *bpf, 
void __iomem *value,
return 0;
 }
 
+static int
+nfp_bpf_parse_cap_func(struct nfp_app_bpf *bpf, void __iomem *value, u32 
length)
+{
+   struct nfp_bpf_cap_tlv_func __iomem *cap = value;
+
+   if (length < sizeof(*cap)) {
+   nfp_err(bpf->app->cpp, "truncated function TLV: %d\n", length);
+   return -EINVAL;
+   }
+
+   switch (readl(&cap->func_id)) {
+   case BPF_FUNC_map_lookup_elem:
+   bpf->helpers.map_lookup = readl(&cap->func_addr);
+   break;
+   }
+
+   return 0;
+}
+
+static int
+nfp_bpf_parse_cap_maps(struct nfp_app_bpf *bpf, void __iomem *value, u32 
length)
+{
+   struct nfp_bpf_cap_tlv_maps __iomem *cap = value;
+
+   if (length < sizeof(*cap)) {
+   nfp_err(bpf->app->cpp, "truncated maps TLV: %d\n", length);
+   return -EINVAL;
+   }
+
+   bpf->maps.types = readl(&cap->types);
+   bpf->maps.max_maps = readl(&cap->max_maps);
+   bpf->maps.max_elems = readl(&cap->max_elems);
+   bpf->maps.max_key_sz = readl(&cap->max_key_sz);
+   bpf->maps.max_val_sz = readl(&cap->max_val_sz);
+   bpf->maps.max_elem_sz = readl(&cap->max_elem_sz);
+
+   return 0;
+}
+
 static int nfp_bpf_parse_capabilities(struct nfp_app *app)
 {
struct nfp_cpp *cpp = app->pf->cpp;
@@ -276,11 +315,19 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
goto err_release_free;
 
switch (type) {
+   case NFP_BPF_CAP_TYPE_FUNC:
+   if (nfp_bpf_parse_cap_func(app->priv, value, length))
+   goto err_release_free;
+   break;
case NFP_BPF_CAP_TYPE_ADJUST_HEAD:
if (nfp_bpf_parse_cap_adjust_head(app->priv, value,
  length))
goto err_release_free;
break;
+   case NFP_BPF_CAP_TYPE_MAPS:
+   if (nfp_bpf_parse_cap_maps(app->priv, value, length))
+   goto err_release_free;
+   break;
default:
nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
break;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 047f253fc581..d381ae8629a2 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -112,6 +112,17 @@ enum pkt_vec {
  * @off_max:   maximum packet offset within buffer required
  * @guaranteed_sub:amount of negative adjustment guaranteed possible
  * @guaranteed_add:amount of positive adjustment guaranteed possible
+ *
+ * @maps:  map capability
+ * @types: supported map types
+ * @max_maps:  max number of maps supported
+ * @max_elems: max number of entries in each map
+ * @max_key_sz:max size of map key
+ * @max_val_sz:max size of map value
+ * @max_elem_sz:   max size of map entry (key + value)
+ *
+ * @helpers:   helper addressess for various calls
+ * @map_lookup:map lookup helper address
  */
 struct nfp_app_bpf {
struct nfp_app *app;
@

[PATCH bpf-next v2 13/15] nfp: bpf: add verification and codegen for map lookups

2018-01-11 Thread Jakub Kicinski

Verify our current constraints on the location of the key are
met and generate the code for calling map lookup on the datapath.

New relocation types have to be added - for helpers and return
addresses.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 86 +++
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 15 +++-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 39 ++
 3 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 47c5224f8d6f..77a5f35d7809 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -483,6 +483,21 @@ static void wrp_immed(struct nfp_prog *nfp_prog, swreg 
dst, u32 imm)
}
 }
 
+static void
+wrp_immed_relo(struct nfp_prog *nfp_prog, swreg dst, u32 imm,
+  enum nfp_relo_type relo)
+{
+   if (imm > 0x) {
+   pr_err("relocation of a large immediate!\n");
+   nfp_prog->error = -EFAULT;
+   return;
+   }
+   emit_immed(nfp_prog, dst, imm, IMMED_WIDTH_ALL, false, IMMED_SHIFT_0B);
+
+   nfp_prog->prog[nfp_prog->prog_len - 1] |=
+   FIELD_PREP(OP_RELO_TYPE, relo);
+}
+
 /* ur_load_imm_any() - encode immediate or use tmp register (unrestricted)
  * If the @imm is small enough encode it directly in operand and return
  * otherwise load @imm to a spare register and return its encoding.
@@ -1279,6 +1294,56 @@ static int adjust_head(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return 0;
 }
 
+static int
+map_lookup_stack(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   struct bpf_offloaded_map *offmap;
+   struct nfp_bpf_map *nfp_map;
+   bool load_lm_ptr;
+   u32 ret_tgt;
+   s64 lm_off;
+   swreg tid;
+
+   offmap = (struct bpf_offloaded_map *)meta->arg1.map_ptr;
+   nfp_map = offmap->dev_priv;
+
+   /* We only have to reload LM0 if the key is not at start of stack */
+   lm_off = nfp_prog->stack_depth;
+   lm_off += meta->arg2.var_off.value + meta->arg2.off;
+   load_lm_ptr = meta->arg2_var_off || lm_off;
+
+   /* Set LM0 to start of key */
+   if (load_lm_ptr)
+   emit_csr_wr(nfp_prog, reg_b(2 * 2), NFP_CSR_ACT_LM_ADDR0);
+
+   /* Load map ID into a register, it should actually fit as an immediate
+* but in case it doesn't deal with it here, not in the delay slots.
+*/
+   tid = ur_load_imm_any(nfp_prog, nfp_map->tid, imm_a(nfp_prog));
+
+   emit_br_relo(nfp_prog, BR_UNC, BR_OFF_RELO + BPF_FUNC_map_lookup_elem,
+2, RELO_BR_HELPER);
+   ret_tgt = nfp_prog_current_offset(nfp_prog) + 2;
+
+   /* Load map ID into A0 */
+   wrp_mov(nfp_prog, reg_a(0), tid);
+
+   /* Load the return address into B0 */
+   wrp_immed_relo(nfp_prog, reg_b(0), ret_tgt, RELO_IMMED_REL);
+
+   if (!nfp_prog_confirm_current_offset(nfp_prog, ret_tgt))
+   return -EINVAL;
+
+   /* Reset the LM0 pointer */
+   if (!load_lm_ptr)
+   return 0;
+
+   emit_csr_wr(nfp_prog, stack_reg(nfp_prog),  NFP_CSR_ACT_LM_ADDR0);
+   wrp_nops(nfp_prog, 3);
+
+   return 0;
+}
+
 /* --- Callbacks --- */
 static int mov_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
@@ -2058,6 +2123,8 @@ static int call(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
switch (meta->insn.imm) {
case BPF_FUNC_xdp_adjust_head:
return adjust_head(nfp_prog, meta);
+   case BPF_FUNC_map_lookup_elem:
+   return map_lookup_stack(nfp_prog, meta);
default:
WARN_ONCE(1, "verifier allowed unsupported function\n");
return -EOPNOTSUPP;
@@ -2794,6 +2861,7 @@ void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, 
struct nfp_bpf_vnic *bv)
 
for (i = 0; i < nfp_prog->prog_len; i++) {
enum nfp_relo_type special;
+   u32 val;
 
special = FIELD_GET(OP_RELO_TYPE, prog[i]);
switch (special) {
@@ -2813,6 +2881,24 @@ void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, 
struct nfp_bpf_vnic *bv)
case RELO_BR_NEXT_PKT:
br_set_offset(&prog[i], bv->tgt_done);
break;
+   case RELO_BR_HELPER:
+   val = br_get_offset(prog[i]);
+   val -= BR_OFF_RELO;
+   switch (val) {
+   case BPF_FUNC_map_lookup_elem:
+   val = nfp_prog->bpf->helpers.map_lookup;
+   break;
+   default:
+   pr_err("relocation of unknown helper %d\n",
+  val);
+   err = -EINVAL;
+

[PATCH bpf-next v2 14/15] nfp: bpf: add support for reading map memory

2018-01-11 Thread Jakub Kicinski

Map memory needs to use 40 bit addressing.  Add handling of such
accesses.  Since 40 bit addresses are formed by using both 32 bit
operands we need to pre-calculate the actual address instead of
adding in the offset inside the instruction, like we did in 32 bit
mode.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 77 ---
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  8 +++
 2 files changed, 76 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 77a5f35d7809..cdc949fabe98 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -553,27 +553,51 @@ wrp_reg_subpart(struct nfp_prog *nfp_prog, swreg dst, 
swreg src, u8 field_len,
emit_ld_field_any(nfp_prog, dst, mask, src, sc, offset * 8, true);
 }
 
+static void
+addr40_offset(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
+ swreg *rega, swreg *regb)
+{
+   if (offset == reg_imm(0)) {
+   *rega = reg_a(src_gpr);
+   *regb = reg_b(src_gpr + 1);
+   return;
+   }
+
+   emit_alu(nfp_prog, imm_a(nfp_prog), reg_a(src_gpr), ALU_OP_ADD, offset);
+   emit_alu(nfp_prog, imm_b(nfp_prog), reg_b(src_gpr + 1), ALU_OP_ADD_C,
+reg_imm(0));
+   *rega = imm_a(nfp_prog);
+   *regb = imm_b(nfp_prog);
+}
+
 /* NFP has Command Push Pull bus which supports bluk memory operations. */
 static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta)
 {
bool descending_seq = meta->ldst_gather_len < 0;
s16 len = abs(meta->ldst_gather_len);
swreg src_base, off;
+   bool src_40bit_addr;
unsigned int i;
u8 xfer_num;
 
off = re_load_imm_any(nfp_prog, meta->insn.off, imm_b(nfp_prog));
+   src_40bit_addr = meta->ptr.type == PTR_TO_MAP_VALUE;
src_base = reg_a(meta->insn.src_reg * 2);
xfer_num = round_up(len, 4) / 4;
 
+   if (src_40bit_addr)
+   addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
+ &off);
+
/* Setup PREV_ALU fields to override memory read length. */
if (len > 32)
wrp_immed(nfp_prog, reg_none(),
  CMD_OVE_LEN | FIELD_PREP(CMD_OV_LEN, xfer_num - 1));
 
/* Memory read from source addr into transfer-in registers. */
-   emit_cmd_any(nfp_prog, CMD_TGT_READ32_SWAP, CMD_MODE_32b, 0, src_base,
-off, xfer_num - 1, true, len > 32);
+   emit_cmd_any(nfp_prog, CMD_TGT_READ32_SWAP,
+src_40bit_addr ? CMD_MODE_40b_BA : CMD_MODE_32b, 0,
+src_base, off, xfer_num - 1, true, len > 32);
 
/* Move from transfer-in to transfer-out. */
for (i = 0; i < xfer_num; i++)
@@ -711,20 +735,20 @@ data_ld(struct nfp_prog *nfp_prog, swreg offset, u8 
dst_gpr, int size)
 }
 
 static int
-data_ld_host_order(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
-  u8 dst_gpr, int size)
+data_ld_host_order(struct nfp_prog *nfp_prog, u8 dst_gpr,
+  swreg lreg, swreg rreg, int size, enum cmd_mode mode)
 {
unsigned int i;
u8 mask, sz;
 
-   /* We load the value from the address indicated in @offset and then
+   /* We load the value from the address indicated in rreg + lreg and then
 * mask out the data we don't need.  Note: this is little endian!
 */
sz = max(size, 4);
mask = size < 4 ? GENMASK(size - 1, 0) : 0;
 
-   emit_cmd(nfp_prog, CMD_TGT_READ32_SWAP, CMD_MODE_32b, 0,
-reg_a(src_gpr), offset, sz / 4 - 1, true);
+   emit_cmd(nfp_prog, CMD_TGT_READ32_SWAP, mode, 0,
+lreg, rreg, sz / 4 - 1, true);
 
i = 0;
if (mask)
@@ -740,6 +764,26 @@ data_ld_host_order(struct nfp_prog *nfp_prog, u8 src_gpr, 
swreg offset,
return 0;
 }
 
+static int
+data_ld_host_order_addr32(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
+ u8 dst_gpr, u8 size)
+{
+   return data_ld_host_order(nfp_prog, dst_gpr, reg_a(src_gpr), offset,
+ size, CMD_MODE_32b);
+}
+
+static int
+data_ld_host_order_addr40(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
+ u8 dst_gpr, u8 size)
+{
+   swreg rega, regb;
+
+   addr40_offset(nfp_prog, src_gpr, offset, ®a, ®b);
+
+   return data_ld_host_order(nfp_prog, dst_gpr, rega, regb,
+ size, CMD_MODE_40b_BA);
+}
+
 static int
 construct_data_ind_ld(struct nfp_prog *nfp_prog, u16 offset, u16 src, u8 size)
 {
@@ -1778,8 +1822,20 @@ mem_ldx_data(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
 
tmp_reg = re_load_imm_any(nfp_prog, meta->insn.off, imm_b(nfp_prog));
 
-   return data_ld_host_order

[PATCH bpf-next v2 02/15] bpf: hashtab: move attribute validation before allocation

2018-01-11 Thread Jakub Kicinski

Number of attribute checks are currently performed after hashtab
is already allocated.  Move them to be able to split them out to
the check function later on.  Checks have to now be performed on
the attr union directly instead of the members of bpf_map, since
bpf_map will be allocated later.  No functional changes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 kernel/bpf/hashtab.c | 47 +++
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3905d4bc5b80..b80f42adf068 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -269,6 +269,28 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
if (numa_node != NUMA_NO_NODE && (percpu || percpu_lru))
return ERR_PTR(-EINVAL);
 
+   /* check sanity of attributes.
+* value_size == 0 may be allowed in the future to use map as a set
+*/
+   if (attr->max_entries == 0 || attr->key_size == 0 ||
+   attr->value_size == 0)
+   return ERR_PTR(-EINVAL);
+
+   if (attr->key_size > MAX_BPF_STACK)
+   /* eBPF programs initialize keys on stack, so they cannot be
+* larger than max stack size
+*/
+   return ERR_PTR(-E2BIG);
+
+   if (attr->value_size >= KMALLOC_MAX_SIZE -
+   MAX_BPF_STACK - sizeof(struct htab_elem))
+   /* if value_size is bigger, the user space won't be able to
+* access the elements via bpf syscall. This check also makes
+* sure that the elem_size doesn't overflow and it's
+* kmalloc-able later in htab_map_update_elem()
+*/
+   return ERR_PTR(-E2BIG);
+
htab = kzalloc(sizeof(*htab), GFP_USER);
if (!htab)
return ERR_PTR(-ENOMEM);
@@ -281,14 +303,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
htab->map.map_flags = attr->map_flags;
htab->map.numa_node = numa_node;
 
-   /* check sanity of attributes.
-* value_size == 0 may be allowed in the future to use map as a set
-*/
-   err = -EINVAL;
-   if (htab->map.max_entries == 0 || htab->map.key_size == 0 ||
-   htab->map.value_size == 0)
-   goto free_htab;
-
if (percpu_lru) {
/* ensure each CPU's lru list has >=1 elements.
 * since we are at it, make each lru list has the same
@@ -304,22 +318,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
/* hash table size must be power of 2 */
htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
 
-   err = -E2BIG;
-   if (htab->map.key_size > MAX_BPF_STACK)
-   /* eBPF programs initialize keys on stack, so they cannot be
-* larger than max stack size
-*/
-   goto free_htab;
-
-   if (htab->map.value_size >= KMALLOC_MAX_SIZE -
-   MAX_BPF_STACK - sizeof(struct htab_elem))
-   /* if value_size is bigger, the user space won't be able to
-* access the elements via bpf syscall. This check also makes
-* sure that the elem_size doesn't overflow and it's
-* kmalloc-able later in htab_map_update_elem()
-*/
-   goto free_htab;
-
htab->elem_size = sizeof(struct htab_elem) +
  round_up(htab->map.key_size, 8);
if (percpu)
@@ -327,6 +325,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
else
htab->elem_size += round_up(htab->map.value_size, 8);
 
+   err = -E2BIG;
/* prevent zero size kmalloc and check for u32 overflow */
if (htab->n_buckets == 0 ||
htab->n_buckets > U32_MAX / sizeof(struct bucket))
-- 
2.15.1

[PATCH bpf-next v2 00/15] bpf: support creating maps on networking devices

2018-01-11 Thread Jakub Kicinski

Hi!

This set adds support for creating maps on networking devices.  BPF is
programs+maps, the pure program offload has been around for quite some
time, this patchset adds the map part of the equation.

Maps are allocated on the target device from the start.  There is no
host copy when map is created on the device.  Device maps are represented
by struct bpf_offloaded_map, regardless of type.  Host programs can't
access such maps, access is only possible from a program also loaded
to the same device and/or via the BPF syscall.

Offloaded programs are currently only allowed to perform lookups,
control plane is responsible for populating the maps.

For brevity only infrastructure and basic NFP patches are included.
Target device reporting, netdevsim and tests will follow up as well as
some further optimizations to the NFP code.

v2:
 - leave out the array maps, we will add them trivially later to avoid
   merge conflicts with ongoing spectere&meltdown mitigations.

Jakub Kicinski (15):
  bpf: add map_alloc_check callback
  bpf: hashtab: move attribute validation before allocation
  bpf: hashtab: move checks out of alloc function
  bpf: add helper for copying attrs to struct bpf_map
  bpf: rename bpf_dev_offload -> bpf_prog_offload
  bpf: offload: factor out netdev checking at allocation time
  bpf: offload: add map offload infrastructure
  nfp: bpf: add map data structure
  nfp: bpf: add basic control channel communication
  nfp: bpf: implement helpers for FW map ops
  nfp: bpf: parse function call and map capabilities
  nfp: bpf: add helpers for updating immediate instructions
  nfp: bpf: add verification and codegen for map lookups
  nfp: bpf: add support for reading map memory
  nfp: bpf: implement bpf map offload

 drivers/net/ethernet/netronome/nfp/Makefile|   1 +
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c  | 446 +
 drivers/net/ethernet/netronome/nfp/bpf/fw.h| 103 +
 drivers/net/ethernet/netronome/nfp/bpf/jit.c   | 163 +++-
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  60 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.h  |  95 -
 drivers/net/ethernet/netronome/nfp/bpf/offload.c   | 106 -
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c  |  47 +++
 drivers/net/ethernet/netronome/nfp/nfp_app.h   |   9 +
 drivers/net/ethernet/netronome/nfp/nfp_asm.c   |  58 +++
 drivers/net/ethernet/netronome/nfp/nfp_asm.h   |   4 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  12 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   7 +
 include/linux/bpf.h|  65 ++-
 include/linux/netdevice.h  |   6 +
 include/uapi/linux/bpf.h   |   1 +
 kernel/bpf/cpumap.c|   8 +-
 kernel/bpf/devmap.c|   8 +-
 kernel/bpf/hashtab.c   | 103 +++--
 kernel/bpf/lpm_trie.c  |   7 +-
 kernel/bpf/offload.c   | 224 ++-
 kernel/bpf/sockmap.c   |   8 +-
 kernel/bpf/stackmap.c  |   6 +-
 kernel/bpf/syscall.c   |  71 +++-
 kernel/bpf/verifier.c  |   7 +
 tools/include/uapi/linux/bpf.h |   1 +
 26 files changed, 1506 insertions(+), 120 deletions(-)
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c

-- 
2.15.1

[PATCH bpf-next v2 05/15] bpf: rename bpf_dev_offload -> bpf_prog_offload

2018-01-11 Thread Jakub Kicinski

With map offload coming, we need to call program offload structure
something less ambiguous.  Pure rename, no functional changes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/offload.c |  2 +-
 include/linux/bpf.h  |  4 ++--
 kernel/bpf/offload.c | 10 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 320b2250d29a..6590228d3755 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -237,7 +237,7 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct bpf_prog 
*prog,
int err;
 
if (prog) {
-   struct bpf_dev_offload *offload = prog->aux->offload;
+   struct bpf_prog_offload *offload = prog->aux->offload;
 
if (!offload)
return -EINVAL;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cfbee9f83fbe..0534722ba1d8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -200,7 +200,7 @@ struct bpf_prog_offload_ops {
 int insn_idx, int prev_insn_idx);
 };
 
-struct bpf_dev_offload {
+struct bpf_prog_offload {
struct bpf_prog *prog;
struct net_device   *netdev;
void*dev_priv;
@@ -230,7 +230,7 @@ struct bpf_prog_aux {
 #ifdef CONFIG_SECURITY
void *security;
 #endif
-   struct bpf_dev_offload *offload;
+   struct bpf_prog_offload *offload;
union {
struct work_struct work;
struct rcu_head rcu;
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 040d4e0edf3f..001ddfde7874 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -32,7 +32,7 @@ static LIST_HEAD(bpf_prog_offload_devs);
 
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
-   struct bpf_dev_offload *offload;
+   struct bpf_prog_offload *offload;
 
if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
attr->prog_type != BPF_PROG_TYPE_XDP)
@@ -72,7 +72,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
 static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command 
cmd,
 struct netdev_bpf *data)
 {
-   struct bpf_dev_offload *offload = prog->aux->offload;
+   struct bpf_prog_offload *offload = prog->aux->offload;
struct net_device *netdev;
 
ASSERT_RTNL();
@@ -110,7 +110,7 @@ int bpf_prog_offload_verifier_prep(struct bpf_verifier_env 
*env)
 int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env,
 int insn_idx, int prev_insn_idx)
 {
-   struct bpf_dev_offload *offload;
+   struct bpf_prog_offload *offload;
int ret = -ENODEV;
 
down_read(&bpf_devs_lock);
@@ -124,7 +124,7 @@ int bpf_prog_offload_verify_insn(struct bpf_verifier_env 
*env,
 
 static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
-   struct bpf_dev_offload *offload = prog->aux->offload;
+   struct bpf_prog_offload *offload = prog->aux->offload;
struct netdev_bpf data = {};
 
data.offload.prog = prog;
@@ -242,7 +242,7 @@ static int bpf_offload_notification(struct notifier_block 
*notifier,
ulong event, void *ptr)
 {
struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
-   struct bpf_dev_offload *offload, *tmp;
+   struct bpf_prog_offload *offload, *tmp;
 
ASSERT_RTNL();
 
-- 
2.15.1

Re: linux-next: build failure after merge of the net-next tree

2018-01-11 Thread Alexei Starovoitov

On Thu, Jan 11, 2018 at 10:11:45PM -0500, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Wed, 10 Jan 2018 17:58:54 -0800
> 
> > On Thu, Jan 11, 2018 at 11:53:55AM +1100, Stephen Rothwell wrote:
> >> Hi all,
> >> 
> >> After merging the net-next tree, today's linux-next build (x86_64
> >> allmodconfig) failed like this:
> >> 
> >> kernel/bpf/verifier.o: In function `bpf_check':
> >> verifier.c:(.text+0xd86e): undefined reference to `bpf_patch_call_args'
> >> 
> >> Caused by commit
> >> 
> >>   1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
> >> 
> >> interacting with commit
> >> 
> >>   290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
> >> 
> >> from the bpf and net trees.
> >> 
> >> I have just reverted commit 290af86629b2 for today.  A better solution
> >> would be nice (lie fixing this in a merge between the net-next and net
> >> trees).
> > 
> > that's due to 'endif' from 290af86629b2 needs to be moved above
> > bpf_patch_call_args() definition.
> 
> That doesn't fix it, because then you'd need to expose
> interpreters_args as well and obviously that can't be right.
> 
> Instead, we should never call bpf_patch_call_args() when JIT always on
> is enabled.  So if we fail to JIT the subprogs we should fail
> immediately.

right, as I was trying to say one extra hunk would be needed for net-next.
I was reading this patch:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a2b211262c25..ca80559c4ec3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5267,7 +5267,11 @@ static int fixup_call_args(struct bpf_verifier_env *env)
depth = get_callee_stack_depth(env, insn, i);
if (depth < 0)
return depth;
+#ifdef CONFIG_BPF_JIT_ALWAYS_ON
+   return -ENOTSUPP;
+#else
bpf_patch_call_args(insn, depth);
+#endif
}
return 0;

but below should be fine too.
Will test it asap.

> This is the net --> net-next merge resolution I am about to use to fix
> this:
> 
> ...
>  +static int fixup_call_args(struct bpf_verifier_env *env)
>  +{
>  +struct bpf_prog *prog = env->prog;
>  +struct bpf_insn *insn = prog->insnsi;
> - int i, depth;
> ++int i, depth, err;
>  +
> - if (env->prog->jit_requested)
> - if (jit_subprogs(env) == 0)
> ++err = 0;
> ++if (env->prog->jit_requested) {
> ++err = jit_subprogs(env);
> ++if (err == 0)
>  +return 0;
> - 
> ++}
> ++#ifndef CONFIG_BPF_JIT_ALWAYS_ON
>  +for (i = 0; i < prog->len; i++, insn++) {
>  +if (insn->code != (BPF_JMP | BPF_CALL) ||
>  +insn->src_reg != BPF_PSEUDO_CALL)
>  +continue;
>  +depth = get_callee_stack_depth(env, insn, i);
>  +if (depth < 0)
>  +return depth;
>  +bpf_patch_call_args(insn, depth);
>  +}
> - return 0;
> ++err = 0;
> ++#endif
> ++return err;
>  +}
>  +
>   /* fixup insn->imm field of bpf_call instructions
>* and inline eligible helpers as explicit sequence of BPF instructions
>*

Re: [bpf-next PATCH v2 5/7] bpf: sockmap sample add base test without any BPF for comparison

2018-01-11 Thread John Fastabend

On 01/11/2018 01:10 PM, Martin KaFai Lau wrote:
> On Wed, Jan 10, 2018 at 10:40:11AM -0800, John Fastabend wrote:
>> Add a base test that does not use BPF hooks to test baseline case.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  samples/sockmap/sockmap_user.c |   26 +-
>>  1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
>> index 812fc7e..eb19d14 100644
>> --- a/samples/sockmap/sockmap_user.c
>> +++ b/samples/sockmap/sockmap_user.c
>> @@ -285,18 +285,24 @@ static int msg_loop(int fd, int iov_count, int 
>> iov_length, int cnt,
>>  
>>  static float giga = 10;
>>  
>> -static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
>> +static int sendmsg_test(int iov_count, int iov_buf, int cnt,
>> +int verbose, bool base)
>>  {
>> -int txpid, rxpid, err = 0;
>> +float sent_Bps = 0, recvd_Bps = 0;
>> +int rx_fd, txpid, rxpid, err = 0;
>>  struct msg_stats s = {0};
>>  int status;
>> -float sent_Bps = 0, recvd_Bps = 0;
>>  
>>  errno = 0;
>>  
>> +if (base)
>> +rx_fd = p1;
>> +else
>> +rx_fd = p2;
>> +
>>  rxpid = fork();
>>  if (rxpid == 0) {
>> -err = msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
>> +err = msg_loop(rx_fd, iov_count, iov_buf, cnt, &s, false);
> I am likely missing something.  After receiving from p1, should the
> base-line case also send to c2 which then will be received by p2?
> 

Well I wanted a test to check socket to socket rates and see what
max throughput we could get with this simple tool. It provides a
good reference point for any other 'perf' data, throughput numbers,
etc. The numbers I see here, probably as expected, are very close
to what I get with iperf tests.

Adding another test base_bounce or base_proxy or something along
those lines might be another test we can add. I think you were
expecting this to be a 1:1 comparison with the sendmsg BPF test
but its not. Probably can add it though.

Thanks,
John

Re: [bpf-next PATCH v2 3/7] bpf: sockmap sample, use fork() for send and recv

2018-01-11 Thread John Fastabend

On 01/11/2018 01:08 PM, Martin KaFai Lau wrote:
> On Wed, Jan 10, 2018 at 10:39:37AM -0800, John Fastabend wrote:
>> Currently for SENDMSG tests first send completes then recv runs. This
>> does not work well for large data sizes and/or many iterations. So
>> fork the recv and send handler so that we run both send and recv. In
>> the future we can add a parameter to do more than a single fork of
>> tx/rx.
>>
>> With this we can get many GBps of data which helps exercise the
>> sockmap code.
>>
>> Signed-off-by: John Fastabend 
>> ---

[...]

>>  static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
>>  {
>> +int txpid, rxpid, err = 0;
>>  struct msg_stats s = {0};
>> -int err;
>> -
>> -err = msg_loop(c1, iov_count, iov_buf, cnt, &s, true);
>> -if (err) {
>> -fprintf(stderr,
>> -"msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
>> -iov_count, iov_buf, cnt, err);
>> -return err;
>> +int status;
>> +
>> +errno = 0;
>> +
>> +rxpid = fork();
>> +if (rxpid == 0) {
>> +err = msg_loop(p2, iov_count, iov_buf, cnt, &s, false);
>> +if (err)
>> +fprintf(stderr,
>> +"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
>> err %i\n",
>> +iov_count, iov_buf, cnt, err);
>> +fprintf(stdout, "rx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
>> +s.bytes_sent, s.bytes_recvd);
>> +shutdown(p2, SHUT_RDWR);
>> +shutdown(p1, SHUT_RDWR);
>> +exit(1);
>> +} else if (rxpid == -1) {
>> +perror("msg_loop_rx: ");
>> +err = errno;
> Bail out here instead of continuing the tx side?
> 

Sure makes sense. No point in running the TX side here I guess.

Re: [bpf-next PATCH v2 1/7] bpf: refactor sockmap sample program update for arg parsing

2018-01-11 Thread John Fastabend

On 01/11/2018 01:05 PM, Martin KaFai Lau wrote:
> On Wed, Jan 10, 2018 at 10:39:04AM -0800, John Fastabend wrote:
>> sockmap sample program takes arguments from cmd line but it reads them
>> in using offsets into the array. Because we want to add more arguments
>> in the future lets do proper argument handling.
>>
>> Also refactor code to pull apart sock init and ping/pong test. This
>> allows us to add new tests in the future.
>>

[...]

>>  /* Accept Connecrtions */
>> @@ -149,23 +177,32 @@ static int sockmap_test_sockets(int rate, int dot)
>>  goto out;
>>  }
>>  
>> -max_fd = p2;
>> -timeout.tv_sec = 10;
>> -timeout.tv_usec = 0;
>> -
>>  printf("connected sockets: c1 <-> p1, c2 <-> p2\n");
>>  printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n",
>>  c1, s1, c2, s2);
>> +out:
>> +return err;
> err is not updated with p1 and p2 in the case that accept() errors out.
> 

OK will fix to propagate the error, nice spot. This will avoid
trying to run the test without connected sockets.

Thanks,
John

[PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-11 Thread Jianchao Wang

Customer reported memory corruption issue on previous mlx4_en driver
version where the order-3 pages and multiple page reference counting
were still used.

Finally, find out one of the root causes is that the HW may see stale
rx_descs due to prod db updating reaches HW before rx_desc. Especially
when cross order-3 pages boundary and update a new one, HW may write
on the pages which may has been freed and allocated again by others.

To fix it, add a wmb between rx_desc and prod db updating to ensure
the order. Even thougth order-0 and page recycling has been introduced,
the disorder between rx_desc and prod db still could lead to corruption
on different inbound packages.

Signed-off-by: Jianchao Wang 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 85e28ef..eefa82c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -555,7 +555,7 @@ static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv 
*priv,
break;
ring->prod++;
} while (likely(--missing));
-
+   wmb(); /* ensure rx_desc updating reaches HW before prod db updating */
mlx4_en_update_rx_prod_db(ring);
 }
 
-- 
2.7.4

Re: [PATCH net-next 00/11] add some new features and fix some bugs

2018-01-11 Thread lipeng (Y)

On 2018/1/12 1:07, David Miller wrote:

From: Peng Li 
Date: Thu, 11 Jan 2018 19:45:55 +0800

This patchset adds some new features and fixes some bugs:
[patch 1/11] adds ethtool_ops.get_channels support for VF.
[patch 2/11] removes TSO config command from VF driver.
[patch 3/11] adds ethtool_ops.get_coalesce support to PF.
[patch 4/11] adds ethtool_ops.set_coalesce support to PF.
[patch 5/11 - 11/11] do some code improvements and fix some bugs.

Can you please write a real commit message in your header postings
please?

Don't just copy the subject lines from the patches, and add one
sentence with a brief description.

Really write real paragraphs describing what the patch series
is doing, how it is doing it, and why it is doing it that
way.

A real explanation that tells the reader what exactly to
expect when they review the patches themselves.

Thanks for your advice.
A detail explanation is better for review, I will write
the "real explanation" in V2 patch-set.

Peng Li

Thank you.

.

MERGE net into net-next

2018-01-11 Thread David Miller


Daniel please look at how I resolved the BPF conflicts and build
failures.

The test_align.c one was pretty simple, but the one that fixes the
build failure due to overlap of the BPF call vs. JIT always on changes
is bit less trivial.

Thanks.

Re: linux-next: build failure after merge of the net-next tree

2018-01-11 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 10 Jan 2018 17:58:54 -0800

> On Thu, Jan 11, 2018 at 11:53:55AM +1100, Stephen Rothwell wrote:
>> Hi all,
>> 
>> After merging the net-next tree, today's linux-next build (x86_64
>> allmodconfig) failed like this:
>> 
>> kernel/bpf/verifier.o: In function `bpf_check':
>> verifier.c:(.text+0xd86e): undefined reference to `bpf_patch_call_args'
>> 
>> Caused by commit
>> 
>>   1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
>> 
>> interacting with commit
>> 
>>   290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
>> 
>> from the bpf and net trees.
>> 
>> I have just reverted commit 290af86629b2 for today.  A better solution
>> would be nice (lie fixing this in a merge between the net-next and net
>> trees).
> 
> that's due to 'endif' from 290af86629b2 needs to be moved above
> bpf_patch_call_args() definition.

That doesn't fix it, because then you'd need to expose
interpreters_args as well and obviously that can't be right.

Instead, we should never call bpf_patch_call_args() when JIT always on
is enabled.  So if we fail to JIT the subprogs we should fail
immediately.

This is the net --> net-next merge resolution I am about to use to fix
this:

...
 +static int fixup_call_args(struct bpf_verifier_env *env)
 +{
 +  struct bpf_prog *prog = env->prog;
 +  struct bpf_insn *insn = prog->insnsi;
-   int i, depth;
++  int i, depth, err;
 +
-   if (env->prog->jit_requested)
-   if (jit_subprogs(env) == 0)
++  err = 0;
++  if (env->prog->jit_requested) {
++  err = jit_subprogs(env);
++  if (err == 0)
 +  return 0;
- 
++  }
++#ifndef CONFIG_BPF_JIT_ALWAYS_ON
 +  for (i = 0; i < prog->len; i++, insn++) {
 +  if (insn->code != (BPF_JMP | BPF_CALL) ||
 +  insn->src_reg != BPF_PSEUDO_CALL)
 +  continue;
 +  depth = get_callee_stack_depth(env, insn, i);
 +  if (depth < 0)
 +  return depth;
 +  bpf_patch_call_args(insn, depth);
 +  }
-   return 0;
++  err = 0;
++#endif
++  return err;
 +}
 +
  /* fixup insn->imm field of bpf_call instructions
   * and inline eligible helpers as explicit sequence of BPF instructions
   *

Re:

2018-01-11 Thread Sumitomo Rubber Industries

Did you receive our email ?

[PATCH net-next] i40evf: use GFP_ATOMIC under spin lock

2018-01-11 Thread Wei Yongjun

A spin lock is taken here so we should use GFP_ATOMIC.

Fixes: 504398f0a78e ("i40evf: use spinlock to protect (mac|vlan)_filter_list")
Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index feb95b6..ca5b538 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -459,7 +459,7 @@ void i40evf_add_ether_addrs(struct i40evf_adapter *adapter)
more = true;
}
 
-   veal = kzalloc(len, GFP_KERNEL);
+   veal = kzalloc(len, GFP_ATOMIC);
if (!veal) {
spin_unlock_bh(&adapter->mac_vlan_list_lock);
return;
@@ -532,7 +532,7 @@ void i40evf_del_ether_addrs(struct i40evf_adapter *adapter)
  (count * sizeof(struct virtchnl_ether_addr));
more = true;
}
-   veal = kzalloc(len, GFP_KERNEL);
+   veal = kzalloc(len, GFP_ATOMIC);
if (!veal) {
spin_unlock_bh(&adapter->mac_vlan_list_lock);
return;
@@ -606,7 +606,7 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)
  (count * sizeof(u16));
more = true;
}
-   vvfl = kzalloc(len, GFP_KERNEL);
+   vvfl = kzalloc(len, GFP_ATOMIC);
if (!vvfl) {
spin_unlock_bh(&adapter->mac_vlan_list_lock);
return;
@@ -678,7 +678,7 @@ void i40evf_del_vlans(struct i40evf_adapter *adapter)
  (count * sizeof(u16));
more = true;
}
-   vvfl = kzalloc(len, GFP_KERNEL);
+   vvfl = kzalloc(len, GFP_ATOMIC);
if (!vvfl) {
spin_unlock_bh(&adapter->mac_vlan_list_lock);
return;

Re: [RFC bpf-next] bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info

2018-01-11 Thread Jakub Kicinski

On Thu, 11 Jan 2018 16:47:47 -0800, Jakub Kicinski wrote:
> Hi!
> 
> Jiong is working on dumping JITed NFP image via bpftool, Francois will be
> submitting support for NFP in binutils soon (whoop! :)).
> 
> We would appreciate if you could weigh in on the uAPI.  Is it OK to reuse
> the existing jited_prog_len/jited_prog_insns or should we add separate
> 2 new fields (plus the arch name) to avoid confusing old user space?

Ah, I skipped one chunk of Jiong's patch here, this would also be
necessary if we reuse fields:

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2bac0dc..c7831cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1673,19 +1673,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
goto done;
}
 
-   ulen = info.jited_prog_len;
-   info.jited_prog_len = prog->jited_len;
-   if (info.jited_prog_len && ulen) {
-   if (bpf_dump_raw_ok()) {
-   uinsns = u64_to_user_ptr(info.jited_prog_insns);
-   ulen = min_t(u32, info.jited_prog_len, ulen);
-   if (copy_to_user(uinsns, prog->bpf_func, ulen))
-   return -EFAULT;
-   } else {
-   info.jited_prog_insns = 0;
-   }
-   }
-
ulen = info.xlated_prog_len;
info.xlated_prog_len = bpf_prog_insn_size(prog);
if (info.xlated_prog_len && ulen) {
@@ -1711,6 +1698,21 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
err = bpf_prog_offload_info_fill(&info, prog);
if (err)
return err;
+   else
+   goto done;
+   }
+
+   ulen = info.jited_prog_len;
+   info.jited_prog_len = prog->jited_len;
+   if (info.jited_prog_len && ulen) {
+   if (bpf_dump_raw_ok()) {
+   uinsns = u64_to_user_ptr(info.jited_prog_insns);
+   ulen = min_t(u32, info.jited_prog_len, ulen);
+   if (copy_to_user(uinsns, prog->bpf_func, ulen))
+   return -EFAULT;
+   } else {
+   info.jited_prog_insns = 0;
+   }
}
 
 done:

info.jited_prog_len is an in/out parameter, so we can't write it twice
if we share fields..  Sorry for messing up.

Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
 wrote:
> On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
> wrote:
>>
>> This series incorporates Mark Rutland's latest ARM changes and adds
>> the x86 specific implementation of 'ifence_array_ptr'. That ifence
>> based approach is provided as an opt-in fallback, but the default
>> mitigation, '__array_ptr', uses a 'mask' approach that removes
>> conditional branches instructions, and otherwise aims to redirect
>> speculation to use a NULL pointer rather than a user controlled value.
>
> Do you have any performance numbers and perhaps example code
> generation? Is this noticeable? Are there any microbenchmarks showing
> the difference between lfence use and the masking model?

I don't have performance numbers, but here's a sample code generation
from __fcheck_files, where the 'and; lea; and' sequence is portion of
array_ptr() after the mask generation with 'sbb'.

fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
 8e7:   8b 02   mov(%rdx),%eax
 8e9:   48 39 c7cmp%rax,%rdi
 8ec:   48 19 c9sbb%rcx,%rcx
 8ef:   48 8b 42 08 mov0x8(%rdx),%rax
 8f3:   48 89 femov%rdi,%rsi
 8f6:   48 21 ceand%rcx,%rsi
 8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
 8fd:   48 21 c8and%rcx,%rax


> Having both seems good for testing, but wouldn't we want to pick one in the 
> end?

I was thinking we'd keep it as a 'just in case' sort of thing, at
least until the 'probably safe' assumption of the 'mask' approach has
more time to settle out.

>
> Also, I do think that there is one particular array load that would
> seem to be pretty obvious: the system call function pointer array.
>
> Yes, yes, the actual call is now behind a retpoline, but that protects
> against a speculative BTB access, it's not obvious that it  protects
> against the mispredict of the __NR_syscall_max comparison in
> arch/x86/entry/entry_64.S.
>
> The act of fetching code is a kind of read too. And retpoline protects
> against BTB stuffing etc, but what if the _actual_ system call
> function address is wrong (due to mis-prediction of the system call
> index check)?
>
> Should the array access in entry_SYSCALL_64_fastpath be made to use
> the masking approach?

I'll take a look. I'm firmly in the 'patch first / worry later' stance
on these investigations.

Re: [PATCH net-next] net: phy: Have __phy_modify return 0 on success

2018-01-11 Thread Florian Fainelli

On 01/11/2018 12:55 PM, Andrew Lunn wrote:
> __phy_modify would return the old value of the register before it was
> modified. Thus on success, it does not return 0, but a positive value.
> Thus functions using phy_modify, which is a wrapper around
> __phy_modify, can start returning > 0 on success, rather than 0. As a
> result, breakage has been noticed in various places, where 0 was
> assumed.
> 
> Code inspection does not find any current location where the return of
> the old value is currently used. 

phy_restore_page() does actually use the old value returned by
__phy_modify(), but treats > 0 and == 0 the same way so it is
technically used, just not as a > 0 quantity it seems.

Russell, are there out of tree call sites (e.g: in your "phy" branch)
that we are going to be breaking if we accept this change?

So have __phy_modify return 0 on
> success. When there is a real need for the old value, either a new
> accessor can be added, or an additional parameter passed.
> 
> Fixes: 2b74e5be17d2 ("net: phy: add phy_modify() accessor")

You would probably want to tag it with:

Fixes: fea23fb591cc ("net: phy: convert read-modify-write to phy_modify()")

as well, because that seems to be the problem Niklas and Geert encountered.

> Reported-by: Geert Uytterhoeven 
> Signed-off-by: Andrew Lunn 
> ---
> 
> Geert, Niklas
> 
> Please can you test this and let me know if it fixes the problems you
> see.
> 
>  drivers/net/phy/phy-core.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
> index e75989ce8850..36cad6b3b96d 100644
> --- a/drivers/net/phy/phy-core.c
> +++ b/drivers/net/phy/phy-core.c
> @@ -336,16 +336,15 @@ EXPORT_SYMBOL(phy_write_mmd);
>   */
>  int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set)
>  {
> - int ret, res;
> + int ret;
>  
>   ret = __phy_read(phydev, regnum);
> - if (ret >= 0) {
> - res = __phy_write(phydev, regnum, (ret & ~mask) | set);
> - if (res < 0)
> - ret = res;
> - }
> + if (ret < 0)
> + return ret;
>  
> - return ret;
> + ret = __phy_write(phydev, regnum, (ret & ~mask) | set);
> +
> + return ret < 0 ? ret: 0;
>  }
>  EXPORT_SYMBOL_GPL(__phy_modify);
>  
> 


-- 
Florian

Re: [PATCH net-next v2] net: phy: remove parameter new_link from phy_mac_interrupt()

2018-01-11 Thread Florian Fainelli

On 01/10/2018 12:21 PM, Heiner Kallweit wrote:
> I see two issues with parameter new_link:
> 
> 1. It's not needed. See also phy_interrupt(), works w/o this parameter.
>phy_mac_interrupt sets the state to PHY_CHANGELINK and triggers the
>state machine which then calls phy_read_status. And phy_read_status
>updates the link state.
> 
> 2. phy_mac_interrupt is used in interrupt context and getting the link
>state may sleep (at least when having to access the PHY registers
>via MDIO bus).
> 
> So let's remove it.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 
Tested-by: Florian Fainelli 

Thanks!
-- 
Florian

Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Linus Torvalds

On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  wrote:
>
> This series incorporates Mark Rutland's latest ARM changes and adds
> the x86 specific implementation of 'ifence_array_ptr'. That ifence
> based approach is provided as an opt-in fallback, but the default
> mitigation, '__array_ptr', uses a 'mask' approach that removes
> conditional branches instructions, and otherwise aims to redirect
> speculation to use a NULL pointer rather than a user controlled value.

Do you have any performance numbers and perhaps example code
generation? Is this noticeable? Are there any microbenchmarks showing
the difference between lfence use and the masking model?

Having both seems good for testing, but wouldn't we want to pick one in the end?

Also, I do think that there is one particular array load that would
seem to be pretty obvious: the system call function pointer array.

Yes, yes, the actual call is now behind a retpoline, but that protects
against a speculative BTB access, it's not obvious that it  protects
against the mispredict of the __NR_syscall_max comparison in
arch/x86/entry/entry_64.S.

The act of fetching code is a kind of read too. And retpoline protects
against BTB stuffing etc, but what if the _actual_ system call
function address is wrong (due to mis-prediction of the system call
index check)?

Should the array access in entry_SYSCALL_64_fastpath be made to use
the masking approach?

Linus

[PATCH v2 10/19] ipv4: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv4/raw.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 125c1eab3eaa..91091a10294f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -472,17 +473,18 @@ static int raw_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   rfv_buf = array_ptr(rfv->hdr.c, offset, rfv->hlen);
+   if (rfv_buf) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->hdr.c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->hdr.c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

[PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'ar9170_qmap' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'ar9170_qmap[queue]'. In this case the
value of 'ar9170_qmap[queue]' is immediately reused as an index to the
'ar->edcf' array.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/ath/carl9170/main.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/carl9170/main.c 
b/drivers/net/wireless/ath/carl9170/main.c
index 988c8857d78c..0acfa8c22b7d 100644
--- a/drivers/net/wireless/ath/carl9170/main.c
+++ b/drivers/net/wireless/ath/carl9170/main.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "hw.h"
@@ -1384,11 +1385,13 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
   const struct ieee80211_tx_queue_params *param)
 {
struct ar9170 *ar = hw->priv;
+   const u8 *elem;
int ret;
 
mutex_lock(&ar->mutex);
-   if (queue < ar->hw->queues) {
-   memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
+   elem = array_ptr(ar9170_qmap, queue, ar->hw->queues);
+   if (elem) {
+   memcpy(&ar->edcf[*elem], param, sizeof(*param));
ret = carl9170_set_qos(ar);
} else {
ret = -EINVAL;

[PATCH v2 16/19] p54: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'priv->qos_params' array.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'priv->qos_params[queue]'.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/intersil/p54/main.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intersil/p54/main.c 
b/drivers/net/wireless/intersil/p54/main.c
index ab6d39e12069..5ce693ff547e 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -411,12 +412,14 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
   const struct ieee80211_tx_queue_params *params)
 {
struct p54_common *priv = dev->priv;
+   struct p54_edcf_queue_param *p54_q;
int ret;
 
mutex_lock(&priv->conf_mutex);
-   if (queue < dev->queues) {
-   P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
-   params->cw_min, params->cw_max, params->txop);
+   p54_q = array_ptr(priv->qos_params, queue, dev->queues);
+   if (p54_q) {
+   P54_SET_QUEUE(p54_q[0], params->aifs, params->cw_min,
+   params->cw_max, params->txop);
ret = p54_set_edcf(priv);
} else
ret = -EINVAL;

[PATCH v2 18/19] cw1200: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read 'txq_params' from the
'priv->tx_queue_params.params' array.  In order to avoid potential leaks
of kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'txq_params'.
In this case 'txq_params' is referenced later in the function.

Based on an original patch by Elena Reshetova.

Cc: Solomon Peachy 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/st/cw1200/sta.c |   11 +++
 drivers/net/wireless/st/cw1200/wsm.h |4 +---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/st/cw1200/sta.c 
b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..7521077e50a4 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cw1200.h"
 #include "sta.h"
@@ -612,18 +613,20 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct 
ieee80211_vif *vif,
   u16 queue, const struct ieee80211_tx_queue_params *params)
 {
struct cw1200_common *priv = dev->priv;
+   struct wsm_set_tx_queue_params *txq_params;
int ret = 0;
/* To prevent re-applying PM request OID again and again*/
bool old_uapsd_flags;
 
mutex_lock(&priv->conf_mutex);
 
-   if (queue < dev->queues) {
+   txq_params = array_ptr(priv->tx_queue_params.params, queue,
+   dev->queues);
+   if (txq_params) {
old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);
 
-   WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0);
-   ret = wsm_set_tx_queue_params(priv,
- 
&priv->tx_queue_params.params[queue], queue);
+   WSM_TX_QUEUE_SET(txq_params, 0, 0, 0);
+   ret = wsm_set_tx_queue_params(priv, txq_params, queue);
if (ret) {
ret = -EINVAL;
goto out;
diff --git a/drivers/net/wireless/st/cw1200/wsm.h 
b/drivers/net/wireless/st/cw1200/wsm.h
index 48086e849515..8c8d9191e233 100644
--- a/drivers/net/wireless/st/cw1200/wsm.h
+++ b/drivers/net/wireless/st/cw1200/wsm.h
@@ -1099,10 +1099,8 @@ struct wsm_tx_queue_params {
 };
 
 
-#define WSM_TX_QUEUE_SET(queue_params, queue, ack_policy, allowed_time,\
-   max_life_time)  \
+#define WSM_TX_QUEUE_SET(p, ack_policy, allowed_time, max_life_time)   \
 do {   \
-   struct wsm_set_tx_queue_params *p = &(queue_params)->params[queue]; \
p->ackPolicy = (ack_policy);\
p->allowedMediumTime = (allowed_time);  \
p->maxTransmitLifetime = (max_life_time);   \

[PATCH v2 19/19] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array.  In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid 'rt' value.

Based on an original patch by Elena Reshetova.

Eric notes:
"
When val is a pointer not an integer.
Then
array2[val] = y;
/* or */
y = array2[va];

Won't happen.

val->field;

Will happen.

Which looks similar.  However the address space of pointers is too
large.  Making it impossible for an attack to know where to look in
the cache to see if "val->field" happened.  At least on the
assumption that val is an arbitrary value.

Further mpls_forward is small enough the entire scope of "rt" the
value read possibly past the bound check is auditable without too
much trouble.  I have looked and I don't see anything that could
possibly allow the value of "rt" to be exfitrated.  The problem
continuing to be that it is a pointer and the only operation on the
pointer besides derferencing it is testing if it is NULL.

Other types of timing attacks are very hard if not impossible
because any packet presenting with a value outside the bounds check
will be dropped.  So it will hard if not impossible to find
something to time to see how long it took to drop the packet.
"

The motivation of resending this patch despite the NAK is to
continue a community wide discussion on the bar for judging Spectre
changes. I.e. is any user controlled speculative pointer in the
kernel a pointer too far, especially given the current array_ptr()
implementation.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/mpls/af_mpls.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..c92b1033adc2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct 
mpls_route *rt,
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
struct mpls_route *rt = NULL;
+   struct mpls_route __rcu **platform_label =
+   rcu_dereference(net->mpls.platform_label);
+   struct mpls_route __rcu **rtp;
 
-   if (index < net->mpls.platform_labels) {
-   struct mpls_route __rcu **platform_label =
-   rcu_dereference(net->mpls.platform_label);
-   rt = rcu_dereference(platform_label[index]);
-   }
+   rtp = array_ptr(platform_label, index, net->mpls.platform_labels);
+   if (rtp)
+   rt = rcu_dereference(*rtp);
return rt;
 }

[PATCH v2 09/19] ipv6: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw6_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv6/raw.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 761a473a07c5..0b7ceeb6f709 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -725,17 +726,18 @@ static int raw6_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw6_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   rfv_buf = array_ptr(rfv->c, offset, rfv->hlen);
+   if (rfv_buf) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

[PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Changes since v1 [1]:
* fixup the ifence definition to use alternative_2 per recent AMD
  changes in tip/x86/pti (Tom)

* drop 'nospec_ptr' (Linus, Mark)

* rename 'nospec_array_ptr' to 'array_ptr' (Alexei)

* rename 'nospec_barrier' to 'ifence' (Peter, Ingo)

* clean up occasions of 'variable assignment in if()' (Sergei, Stephen)

* make 'array_ptr' use a mask instead of an architectural ifence by
  default (Linus, Alexei)

* provide a command line and compile-time opt-in to the ifence
  mechanism, if an architecture provides 'ifence_array_ptr'.

* provide an optimized mask generation helper, 'array_ptr_mask', for
  x86 (Linus)

* move 'get_user' hardening from '__range_not_ok' to '__uaccess_begin'
  (Linus)

* drop "Thermal/int340x: prevent bounds-check..." since userspace does
  not have arbitrary control over the 'trip' index (Srinivas)

* update the changelog of "net: mpls: prevent bounds-check..." and keep
  it in the series to continue the debate about Spectre hygiene patches.
  (Eric).

* record a reviewed-by from Laurent on "[media] uvcvideo: prevent
  bounds-check..."

* update the cover letter

[1]: https://lwn.net/Articles/743376/

---

Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [2]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest ARM changes and adds
the x86 specific implementation of 'ifence_array_ptr'. That ifence
based approach is provided as an opt-in fallback, but the default
mitigation, '__array_ptr', uses a 'mask' approach that removes
conditional branches instructions, and otherwise aims to redirect
speculation to use a NULL pointer rather than a user controlled value.

The mask is generated by the following from Alexei, and Linus:

mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);

...and Linus provided an optimized mask generation helper for x86:

asm ("cmpq %1,%2; sbbq %0,%0;"
:"=r" (mask)
:"r"(sz),"r" (idx)
:"cc");

The 'array_ptr' mechanism can be switched between 'mask' and 'ifence'
via the spectre_v1={mask,ifence} command line option, and the
compile-time default is set by selecting either CONFIG_SPECTRE1_MASK or
CONFIG_SPECTRE1_IFENCE.

The 'array_ptr' infrastructure is the primary focus this patch set. The
individual patches that perform 'array_ptr' conversions are a point in
time (i.e. earlier kernel, early analysis tooling, x86 only etc...)
start at finding some of these gadgets.

Another consideration for reviewing these patches is the 'hygiene'
argument. When a patch refers to hygiene it is concerned with stopping
speculation on an unconstrained or insufficiently constrained pointer
value under userspace control. That by itself is not sufficient for
attack (per current understanding) [3], but it is a necessary
pre-condition.  So 'hygiene' refers to cleaning up those suspect
pointers regardless of whether they are usable as a gadget.

These patches are also be available via the 'nospec-v2' git branch
here:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec-v2

Note that the BPF fix for Spectre variant1 is merged in the bpf.git
tree [4], and is not included in this branch.

[2]: 
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[3]: https://spectreattack.com/spectre.pdf
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc98

---

Dan Williams (16):
  x86: implement ifence()
  x86: implement ifence_array_ptr() and array_ptr_mask()
  asm-generic/barrier: mask speculative execution flows
  x86: introduce __uaccess_begin_nospec and ASM_IFENCE
  x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  ipv6: prevent bounds-check bypass via speculative execution
  ipv4: prevent bounds-check bypass via speculative execution
  vfs, fdtable: prevent bounds-check bypass via speculative execution
  userns: prevent bounds-check bypass via speculative execution
  udf: prevent bounds-check bypass via speculative execution
  [media] uvcvideo: prevent bounds-check bypass via speculative execution
  carl9170: prevent bounds-check bypass via speculative execution
  p54: prevent bounds-check bypass via speculative execution
  qla2xxx: prevent bounds-check bypass via speculative execution
  cw1200: prevent bounds-check bypass via speculative execution
  net: mpls: prevent bounds-check bypass via speculative execution

Mark Rutland (3):
  Documentation: document array_ptr
  arm64: implement ifence_array_ptr()
  arm: implement ifence_array_ptr()

 Documentation/speculation.txt|  142 ++
 arch/arm/Kconfig

[RFC bpf-next] bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info

2018-01-11 Thread Jakub Kicinski

Hi!

Jiong is working on dumping JITed NFP image via bpftool, Francois will be
submitting support for NFP in binutils soon (whoop! :)).

We would appreciate if you could weigh in on the uAPI.  Is it OK to reuse
the existing jited_prog_len/jited_prog_insns or should we add separate
2 new fields (plus the arch name) to avoid confusing old user space?

From: Jiong Wang 

For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
used by all host JIT targets to get jited image and it's length. While for
offload, targets are likely to have different offload mechanisms that these
info are kept in device private data fields.

Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
length and contents info for offload targets.

One way is to introduce new callback to parse device private data then fill
those fields in bpf_prog_info. This might be a little heavy, the other way
is to add generic fields which will be initialized by all offload targets.

This patch follows the second approach to introduce two new fields in
struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
correct jited_prog_len and jited_prog_insns in bpf_prog_info.

Also, currently userspace tools can't get offload architecture info from
bpf_prog_info. This info is necessary for choosing correct disassembler.

This patch add name info in both bpf_dev_offload and bpf_prog_info so it
could be used by tools to select correct architecture.

The code logic in bpf_prog_offload_info_fill is adjusted slightly. Code
that only applies to offload are centered in bpf_prog_offload_info_fill as
much as possible.

Signed-off-by: Jiong Wang 
---
 include/linux/bpf.h|  3 +++
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/offload.c   | 26 ++
 tools/include/uapi/linux/bpf.h |  2 ++
 4 files changed, 33 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e03046d1df2..d0cb9735bbba 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -197,6 +197,9 @@ struct bpf_dev_offload {
struct list_headoffloads;
booldev_state;
const struct bpf_prog_offload_ops *dev_ops;
+   void*jited_image;
+   u32 jited_len;
+   charjited_arch_name[BPF_OFFLOAD_ARCH_NAME_LEN];
 };
 
 struct bpf_prog_aux {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 405317f9c064..124560b982df 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_attach_type {
 #define BPF_F_QUERY_EFFECTIVE  (1U << 0)
 
 #define BPF_OBJ_NAME_LEN 16U
+#define BPF_OFFLOAD_ARCH_NAME_LEN 16U
 
 /* Flags for accessing BPF object */
 #define BPF_F_RDONLY   (1U << 3)
@@ -927,6 +928,7 @@ struct bpf_prog_info {
__u32 ifindex;
__u64 netns_dev;
__u64 netns_ino;
+   char offload_arch_name[BPF_OFFLOAD_ARCH_NAME_LEN];
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 040d4e0edf3f..88b4396d19aa 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -216,9 +216,12 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
.prog   = prog,
.info   = info,
};
+   struct bpf_prog_aux *aux = prog->aux;
struct inode *ns_inode;
struct path ns_path;
+   char __user *uinsns;
void *res;
+   u32 ulen;
 
res = ns_get_path_cb(&ns_path, bpf_prog_offload_info_fill_ns, &args);
if (IS_ERR(res)) {
@@ -227,6 +230,29 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
return PTR_ERR(res);
}
 
+
+   down_read(&bpf_devs_lock);
+   if (!aux->offload) {
+   up_read(&bpf_devs_lock);
+   return -ENODEV;
+   }
+
+   ulen = info->jited_prog_len;
+   info->jited_prog_len = aux->offload->jited_len;
+   if (info->jited_prog_len & ulen) {
+   uinsns = u64_to_user_ptr(info->jited_prog_insns);
+   ulen = min_t(u32, info->jited_prog_len, ulen);
+   if (copy_to_user(uinsns, aux->offload->jited_image, ulen)) {
+   up_read(&bpf_devs_lock);
+   return -EFAULT;
+   }
+   }
+
+   memcpy(info->offload_arch_name, aux->offload->jited_arch_name,
+  sizeof(info->offload_arch_name));
+
+   up_read(&bpf_devs_lock);
+
ns_inode = ns_path.dentry->d_inode;
info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
info->netns_ino = ns_inode->i_ino;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e8c60acfa32..647aee66f4da 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_attach_type {
 #define BPF_F_QUERY_EFFECTIVE  (1U << 0)
 
 #define BPF_OBJ_NAME_LEN 16U
+#d

RE: [patch net-next 5/5] mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc

2018-01-11 Thread Yuval Mintz

> > Support basic stats for PRIO qdisc, which includes tx packets and bytes
> > count, drops count and backlog size. The rest of the stats are irrelevant
> > for this qdisc offload.
> > Since backlog is not only incremental but reflecting momentary value, in
> > case of a qdisc that stops being offloaded but is not destroyed, backlog
> > value needs to be updated about the un-offloading.
> > For that reason an unoffload function is being added to the ops struct.
> >
> > Signed-off-by: Nogah Frankel 
> > Reviewed-by: Yuval Mintz 
> > Signed-off-by: Jiri Pirko 
> > ---
> >  .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c   | 92
> ++
> >  1 file changed, 92 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> > index 9e83edde7b35..272c04951e5d 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> > @@ -66,6 +66,11 @@ struct mlxsw_sp_qdisc_ops {
> >   void *xstats_ptr);
> > void (*clean_stats)(struct mlxsw_sp_port *mlxsw_sp_port,
> > struct mlxsw_sp_qdisc *mlxsw_sp_qdisc);
> > +   /* unoffload - to be used for a qdisc that stops being offloaded
> without
> > +* being destroyed.
> > +*/
> > +   void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port,
> > + struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void
> *params);
> 
> Hm.  You you need this just because you didn't add the backlog pointer
> to destroy?  AFAIK on destroy we are free to reset stats as well, thus
> simplifying your driver...  Let me know if I misunderstand.

This is meant exactly for the scenario where qdisc didn't get destroyed
yet is no longer offloaded; E.g., if number of bands increased beyond
What we can offload. So we can't reset the statistics in this case.
[Although I might be the one to misunderstand you,
as the 'not destroyed' was explicitly mentioned twice above]

> 
> >  };
> >
> >  struct mlxsw_sp_qdisc {
> > @@ -73,6 +78,9 @@ struct mlxsw_sp_qdisc {
> > u8 tclass_num;
> > union {
> > struct red_stats red;
> > +   struct mlxsw_sp_qdisc_prio_stats {
> > +   u64 backlog;
> 
> This is not a prio stat, it's a standard qstat.  I've added it to
> struct mlxsw_sp_qdisc_stats.  The reason you need to treat it
> separately is that RED has non-standard backlog handling which I'm
> trying to fix...
> 
> > +   } prio;
> > } xstats_base;
> > struct mlxsw_sp_qdisc_stats {
> > u64 tx_bytes;
> > @@ -144,6 +152,9 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port
> *mlxsw_sp_port, u32 handle,
> >
> >  err_bad_param:
> >  err_config:
> > +   if (mlxsw_sp_qdisc->handle == handle && ops->unoffload)
> > +   ops->unoffload(mlxsw_sp_port, mlxsw_sp_qdisc, params);
> > +
> > mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
> > return err;
> >  }
> 
> > @@ -479,6 +567,10 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port
> *mlxsw_sp_port,
> > switch (p->command) {
> > case TC_PRIO_DESTROY:
> > return mlxsw_sp_qdisc_destroy(mlxsw_sp_port,
> mlxsw_sp_qdisc);
> > +   case TC_PRIO_STATS:
> > +   return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port,
> mlxsw_sp_qdisc,
> > +   &p->stats);
> > +
> 
> nit: extra new line intentional? :)
> 
> > default:
> > return -EOPNOTSUPP;
> > }

[net 04/11] net/mlx5: Fix mlx5_get_uars_page to return error code

2018-01-11 Thread Saeed Mahameed

From: Eran Ben Elisha 

Change mlx5_get_uars_page to return ERR_PTR in case of
allocation failure. Change all callers accordingly to
check the IS_ERR(ptr) instead of NULL.

Fixes: 59211bd3b632 ("net/mlx5: Split the load/unload flow into hardware and 
software flows")
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Eugenia Emantayev 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/uar.c  | 14 ++
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 00cb184fa027..262c1aa2e028 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -4160,7 +4160,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
goto err_cnt;
 
dev->mdev->priv.uar = mlx5_get_uars_page(dev->mdev);
-   if (!dev->mdev->priv.uar)
+   if (IS_ERR(dev->mdev->priv.uar))
goto err_cong;
 
err = mlx5_alloc_bfreg(dev->mdev, &dev->bfreg, false, false);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index a4c82fa71aec..6dffa58fb178 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1135,8 +1135,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
}
 
dev->priv.uar = mlx5_get_uars_page(dev);
-   if (!dev->priv.uar) {
+   if (IS_ERR(dev->priv.uar)) {
dev_err(&pdev->dev, "Failed allocating uar, aborting\n");
+   err = PTR_ERR(dev->priv.uar);
goto err_disable_msix;
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/uar.c 
b/drivers/net/ethernet/mellanox/mlx5/core/uar.c
index 222b25908d01..8b97066dd1f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/uar.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/uar.c
@@ -168,18 +168,16 @@ struct mlx5_uars_page *mlx5_get_uars_page(struct 
mlx5_core_dev *mdev)
struct mlx5_uars_page *ret;
 
mutex_lock(&mdev->priv.bfregs.reg_head.lock);
-   if (list_empty(&mdev->priv.bfregs.reg_head.list)) {
-   ret = alloc_uars_page(mdev, false);
-   if (IS_ERR(ret)) {
-   ret = NULL;
-   goto out;
-   }
-   list_add(&ret->list, &mdev->priv.bfregs.reg_head.list);
-   } else {
+   if (!list_empty(&mdev->priv.bfregs.reg_head.list)) {
ret = list_first_entry(&mdev->priv.bfregs.reg_head.list,
   struct mlx5_uars_page, list);
kref_get(&ret->ref_count);
+   goto out;
}
+   ret = alloc_uars_page(mdev, false);
+   if (IS_ERR(ret))
+   goto out;
+   list_add(&ret->list, &mdev->priv.bfregs.reg_head.list);
 out:
mutex_unlock(&mdev->priv.bfregs.reg_head.lock);
 
-- 
2.13.0

[net 03/11] net/mlx5: Fix memory leak in bad flow of mlx5_alloc_irq_vectors

2018-01-11 Thread Saeed Mahameed

From: Alaa Hleihel 

Fix a memory leak where in case that pci_alloc_irq_vectors failed,
priv->irq_info was not released.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Alaa Hleihel 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 95e188d0883e..a4c82fa71aec 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -319,6 +319,7 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
struct mlx5_eq_table *table = &priv->eq_table;
int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
int nvec;
+   int err;
 
nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
   MLX5_EQ_VEC_COMP_BASE;
@@ -328,21 +329,23 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev 
*dev)
 
priv->irq_info = kcalloc(nvec, sizeof(*priv->irq_info), GFP_KERNEL);
if (!priv->irq_info)
-   goto err_free_msix;
+   return -ENOMEM;
 
nvec = pci_alloc_irq_vectors(dev->pdev,
MLX5_EQ_VEC_COMP_BASE + 1, nvec,
PCI_IRQ_MSIX);
-   if (nvec < 0)
-   return nvec;
+   if (nvec < 0) {
+   err = nvec;
+   goto err_free_irq_info;
+   }
 
table->num_comp_vectors = nvec - MLX5_EQ_VEC_COMP_BASE;
 
return 0;
 
-err_free_msix:
+err_free_irq_info:
kfree(priv->irq_info);
-   return -ENOMEM;
+   return err;
 }
 
 static void mlx5_free_irq_vectors(struct mlx5_core_dev *dev)
-- 
2.13.0

[pull request][net 00/11] Mellanox, mlx5 fixes 2018-01-11

2018-01-11 Thread Saeed Mahameed

Hi Dave,

The following series includes fixes to mlx5 core and netdev driver.
To highlight we have two critical fixes in this series:
1st patch from Eran to address a fix for Host2BMC Breakage.

2nd patch from Saeed to address the RDMA IRQ vector affinity settings query 
issue, the patch provides the correct mlx5_core implementation for RDMA to
correctly  query vector affinity.
I sent this patch privately to Sagi a week a go, so he could to test it
but I didn't hear from him.

All other patches are trivial misc fixes.
Please pull and let me know if there's any problem.

for -stable v4.14-y and later:
("net/mlx5: Fix get vector affinity helper function")
("{net,ib}/mlx5: Don't disable local loopback multicast traffic when needed")

Note: Merging this series with net-next will produce the following conflict:
<<< HEAD
u8 disable_local_lb[0x1];
u8 reserved_at_3e2[0x1];
u8 log_min_hairpin_wq_data_sz[0x5];
u8 reserved_at_3e8[0x3];
===
u8 disable_local_lb_uc[0x1];
u8 disable_local_lb_mc[0x1];
u8 reserved_at_3e3[0x8];
>>> 359c96447ac2297fabe15ef30b60f3b4b71e7fd0

To resolve, use the following hunk:
i.e:
<<
u8 disable_local_lb_uc[0x1];
u8 disable_local_lb_mc[0x1];
u8 log_min_hairpin_wq_data_sz[0x5];
u8 reserved_at_3e8[0x3];
>>

Thanks,
Saeed.

---

The following changes since commit ccc12b11c5332c84442ef120dcd631523be75089:

  ipv6: sr: fix TLVs not being copied using setsockopt (2018-01-10 16:03:55 
-0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-fixes-2018-01-11

for you to fetch changes up to 237f258c42c905f71c694670fe4d9773d85c36ed:

  net/mlx5e: Remove timestamp set from netdevice open flow (2018-01-12 02:01:50 
+0200)


mlx5-fixes-2018-01-11


Alaa Hleihel (1):
  net/mlx5: Fix memory leak in bad flow of mlx5_alloc_irq_vectors

Eran Ben Elisha (2):
  {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed
  net/mlx5: Fix mlx5_get_uars_page to return error code

Feras Daoud (2):
  net/mlx5: Update ptp_clock_event foreach PPS event
  net/mlx5e: Remove timestamp set from netdevice open flow

Gal Pressman (2):
  net/mlx5e: Keep updating ethtool statistics when the interface is down
  net/mlx5e: Don't override netdev features field unless in error flow

Maor Gottlieb (1):
  net/mlx5: Fix error handling in load one

Saeed Mahameed (1):
  net/mlx5: Fix get vector affinity helper function

Tariq Toukan (2):
  net/mlx5e: Add error print in ETS init
  net/mlx5e: Check support before TC swap in ETS init

 drivers/infiniband/hw/mlx5/main.c  | 11 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 16 +---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 48 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  2 +
 .../net/ethernet/mellanox/mlx5/core/en_selftest.c  | 27 
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  3 +-
 .../net/ethernet/mellanox/mlx5/core/lib/clock.c|  6 ++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 28 -
 drivers/net/ethernet/mellanox/mlx5/core/uar.c  | 14 +++
 drivers/net/ethernet/mellanox/mlx5/core/vport.c| 22 +++---
 include/linux/mlx5/driver.h| 19 -
 include/linux/mlx5/mlx5_ifc.h  |  5 ++-
 14 files changed, 135 insertions(+), 71 deletions(-)

[net 07/11] net/mlx5e: Add error print in ETS init

2018-01-11 Thread Saeed Mahameed

From: Tariq Toukan 

ETS initialization might fail, add a print to indicate
such failures.

Fixes: 08fb1dacdd76 ("net/mlx5e: Support DCBNL IEEE ETS")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
index 9bcf38f4123b..a5c5134f5cb2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
@@ -922,8 +922,9 @@ static void mlx5e_dcbnl_query_dcbx_mode(struct mlx5e_priv 
*priv,
 
 static void mlx5e_ets_init(struct mlx5e_priv *priv)
 {
-   int i;
struct ieee_ets ets;
+   int err;
+   int i;
 
if (!MLX5_CAP_GEN(priv->mdev, ets))
return;
@@ -940,7 +941,10 @@ static void mlx5e_ets_init(struct mlx5e_priv *priv)
ets.prio_tc[0] = 1;
ets.prio_tc[1] = 0;
 
-   mlx5e_dcbnl_ieee_setets_core(priv, &ets);
+   err = mlx5e_dcbnl_ieee_setets_core(priv, &ets);
+   if (err)
+   netdev_err(priv->netdev,
+  "%s, Failed to init ETS: %d\n", __func__, err);
 }
 
 enum {
-- 
2.13.0

[net 11/11] net/mlx5e: Remove timestamp set from netdevice open flow

2018-01-11 Thread Saeed Mahameed

From: Feras Daoud 

To avoid configuration override, timestamp set call will
be moved from the netdevice open flow to the init flow.
By this, a close-open procedure will not override the timestamp
configuration.
In addition, the change will rename mlx5e_timestamp_set function
to be mlx5e_timestamp_init.

Fixes: ef9814deafd0 ("net/mlx5e: Add HW timestamping (TS) support")
Signed-off-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  | 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 543060c305a0..c2d89bfa1a70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -895,7 +895,7 @@ int mlx5e_vlan_rx_kill_vid(struct net_device *dev, 
__always_unused __be16 proto,
   u16 vid);
 void mlx5e_enable_cvlan_filter(struct mlx5e_priv *priv);
 void mlx5e_disable_cvlan_filter(struct mlx5e_priv *priv);
-void mlx5e_timestamp_set(struct mlx5e_priv *priv);
+void mlx5e_timestamp_init(struct mlx5e_priv *priv);
 
 struct mlx5e_redirect_rqt_param {
bool is_rss;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 311d5ec8407c..d8aefeed124d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2669,7 +2669,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
netif_carrier_on(netdev);
 }
 
-void mlx5e_timestamp_set(struct mlx5e_priv *priv)
+void mlx5e_timestamp_init(struct mlx5e_priv *priv)
 {
priv->tstamp.tx_type   = HWTSTAMP_TX_OFF;
priv->tstamp.rx_filter = HWTSTAMP_FILTER_NONE;
@@ -2690,7 +2690,6 @@ int mlx5e_open_locked(struct net_device *netdev)
mlx5e_activate_priv_channels(priv);
if (priv->profile->update_carrier)
priv->profile->update_carrier(priv);
-   mlx5e_timestamp_set(priv);
 
if (priv->profile->update_stats)
queue_delayed_work(priv->wq, &priv->update_stats_work, 0);
@@ -4146,6 +4145,8 @@ static void mlx5e_build_nic_netdev_priv(struct 
mlx5_core_dev *mdev,
INIT_WORK(&priv->set_rx_mode_work, mlx5e_set_rx_mode_work);
INIT_WORK(&priv->tx_timeout_work, mlx5e_tx_timeout_work);
INIT_DELAYED_WORK(&priv->update_stats_work, mlx5e_update_stats_work);
+
+   mlx5e_timestamp_init(priv);
 }
 
 static void mlx5e_set_netdev_dev_addr(struct net_device *netdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 2c43606c26b5..3409d86eb06b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -877,6 +877,8 @@ static void mlx5e_init_rep(struct mlx5_core_dev *mdev,
 
mlx5e_build_rep_params(mdev, &priv->channels.params);
mlx5e_build_rep_netdev(netdev);
+
+   mlx5e_timestamp_init(priv);
 }
 
 static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index 8812d7208e8f..ee2f378c5030 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -86,6 +86,8 @@ void mlx5i_init(struct mlx5_core_dev *mdev,
mlx5e_build_nic_params(mdev, &priv->channels.params, 
profile->max_nch(mdev));
mlx5i_build_nic_params(mdev, &priv->channels.params);
 
+   mlx5e_timestamp_init(priv);
+
/* netdev init */
netdev->hw_features|= NETIF_F_SG;
netdev->hw_features|= NETIF_F_IP_CSUM;
@@ -450,7 +452,6 @@ static int mlx5i_open(struct net_device *netdev)
 
mlx5e_refresh_tirs(epriv, false);
mlx5e_activate_priv_channels(epriv);
-   mlx5e_timestamp_set(epriv);
 
mutex_unlock(&epriv->state_lock);
return 0;
-- 
2.13.0

[net 05/11] net/mlx5: Fix error handling in load one

2018-01-11 Thread Saeed Mahameed

From: Maor Gottlieb 

We didn't store the result of mlx5_init_once, due to that
mlx5_load_one returned success on error.  Fix that.

Fixes: 59211bd3b632 ("net/mlx5: Split the load/unload flow into hardware and 
software flows")
Signed-off-by: Maor Gottlieb 
Signed-off-by: Eugenia Emantayev 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6dffa58fb178..0f88fd30a09a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1123,9 +1123,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto err_stop_poll;
}
 
-   if (boot && mlx5_init_once(dev, priv)) {
-   dev_err(&pdev->dev, "sw objs init failed\n");
-   goto err_stop_poll;
+   if (boot) {
+   err = mlx5_init_once(dev, priv);
+   if (err) {
+   dev_err(&pdev->dev, "sw objs init failed\n");
+   goto err_stop_poll;
+   }
}
 
err = mlx5_alloc_irq_vectors(dev);
-- 
2.13.0

[net 01/11] {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed

2018-01-11 Thread Saeed Mahameed

From: Eran Ben Elisha 

There are systems platform information management interfaces (such as
HOST2BMC) for which we cannot disable local loopback multicast traffic.

Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
driver will not disable multicast loopback traffic if not supported.
(It is expected that Firmware will not set disable_local_lb_mc if
HOST2BMC is running for example.)

Function mlx5_nic_vport_update_local_lb will do best effort to
disable/enable UC/MC loopback traffic and return success only in case it
succeeded to changed all allowed by Firmware.

Adapt mlx5_ib and mlx5e to support the new cap bits.

Fixes: 2c43c5a036be ("net/mlx5e: Enable local loopback in loopback selftest")
Fixes: c85023e153e3 ("IB/mlx5: Add raw ethernet local loopback support")
Fixes: bded747bb432 ("net/mlx5: Add raw ethernet local loopback firmware 
command")
Signed-off-by: Eran Ben Elisha 
Cc: kernel-t...@fb.com
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c  |  9 +---
 .../net/ethernet/mellanox/mlx5/core/en_selftest.c  | 27 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/vport.c| 22 +-
 include/linux/mlx5/mlx5_ifc.h  |  5 ++--
 5 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 8ac50de2b242..00cb184fa027 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1324,7 +1324,8 @@ static int mlx5_ib_alloc_transport_domain(struct 
mlx5_ib_dev *dev, u32 *tdn)
return err;
 
if ((MLX5_CAP_GEN(dev->mdev, port_type) != MLX5_CAP_PORT_TYPE_ETH) ||
-   !MLX5_CAP_GEN(dev->mdev, disable_local_lb))
+   (!MLX5_CAP_GEN(dev->mdev, disable_local_lb_uc) &&
+!MLX5_CAP_GEN(dev->mdev, disable_local_lb_mc)))
return err;
 
mutex_lock(&dev->lb_mutex);
@@ -1342,7 +1343,8 @@ static void mlx5_ib_dealloc_transport_domain(struct 
mlx5_ib_dev *dev, u32 tdn)
mlx5_core_dealloc_transport_domain(dev->mdev, tdn);
 
if ((MLX5_CAP_GEN(dev->mdev, port_type) != MLX5_CAP_PORT_TYPE_ETH) ||
-   !MLX5_CAP_GEN(dev->mdev, disable_local_lb))
+   (!MLX5_CAP_GEN(dev->mdev, disable_local_lb_uc) &&
+!MLX5_CAP_GEN(dev->mdev, disable_local_lb_mc)))
return;
 
mutex_lock(&dev->lb_mutex);
@@ -4187,7 +4189,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
}
 
if ((MLX5_CAP_GEN(mdev, port_type) == MLX5_CAP_PORT_TYPE_ETH) &&
-   MLX5_CAP_GEN(mdev, disable_local_lb))
+   (MLX5_CAP_GEN(mdev, disable_local_lb_uc) ||
+MLX5_CAP_GEN(mdev, disable_local_lb_mc)))
mutex_init(&dev->lb_mutex);
 
dev->ib_active = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
index 1f1f8af87d4d..5a4608281f38 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
@@ -238,15 +238,19 @@ static int mlx5e_test_loopback_setup(struct mlx5e_priv 
*priv,
int err = 0;
 
/* Temporarily enable local_lb */
-   if (MLX5_CAP_GEN(priv->mdev, disable_local_lb)) {
-   mlx5_nic_vport_query_local_lb(priv->mdev, &lbtp->local_lb);
-   if (!lbtp->local_lb)
-   mlx5_nic_vport_update_local_lb(priv->mdev, true);
+   err = mlx5_nic_vport_query_local_lb(priv->mdev, &lbtp->local_lb);
+   if (err)
+   return err;
+
+   if (!lbtp->local_lb) {
+   err = mlx5_nic_vport_update_local_lb(priv->mdev, true);
+   if (err)
+   return err;
}
 
err = mlx5e_refresh_tirs(priv, true);
if (err)
-   return err;
+   goto out;
 
lbtp->loopback_ok = false;
init_completion(&lbtp->comp);
@@ -256,16 +260,21 @@ static int mlx5e_test_loopback_setup(struct mlx5e_priv 
*priv,
lbtp->pt.dev = priv->netdev;
lbtp->pt.af_packet_priv = lbtp;
dev_add_pack(&lbtp->pt);
+
+   return 0;
+
+out:
+   if (!lbtp->local_lb)
+   mlx5_nic_vport_update_local_lb(priv->mdev, false);
+
return err;
 }
 
 static void mlx5e_test_loopback_cleanup(struct mlx5e_priv *priv,
struct mlx5e_lbt_priv *lbtp)
 {
-   if (MLX5_CAP_GEN(priv->mdev, disable_local_lb)) {
-   if (!lbtp->local_lb)
-   mlx5_nic_vport_update_local_lb(priv->mdev, false);
-   }
+   if (!lbtp->local_lb)
+   mlx5_nic_vport_update_local_lb(priv->mdev, false);
 
dev_remove_pack(&lbtp->pt);
mlx5e_refresh_tirs(priv, false);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/m

[net 10/11] net/mlx5: Update ptp_clock_event foreach PPS event

2018-01-11 Thread Saeed Mahameed

From: Feras Daoud 

PPS event did not update ptp_clock_event fields, therefore,
timestamp value was not updated correctly. This fix updates the
event source and the timestamp value for each PPS event.

Fixes: 7c39afb394c7 ("net/mlx5: PTP code migration to driver core section")
Signed-off-by: Feras Daoud 
Reported-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index fa8aed62b231..5701f125e99c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -423,9 +423,13 @@ void mlx5_pps_event(struct mlx5_core_dev *mdev,
 
switch (clock->ptp_info.pin_config[pin].func) {
case PTP_PF_EXTTS:
+   ptp_event.index = pin;
+   ptp_event.timestamp = timecounter_cyc2time(&clock->tc,
+   be64_to_cpu(eqe->data.pps.time_stamp));
if (clock->pps_info.enabled) {
ptp_event.type = PTP_CLOCK_PPSUSR;
-   ptp_event.pps_times.ts_real = 
ns_to_timespec64(eqe->data.pps.time_stamp);
+   ptp_event.pps_times.ts_real =
+   ns_to_timespec64(ptp_event.timestamp);
} else {
ptp_event.type = PTP_CLOCK_EXTTS;
}
-- 
2.13.0

[net 06/11] net/mlx5e: Keep updating ethtool statistics when the interface is down

2018-01-11 Thread Saeed Mahameed

From: Gal Pressman 

ethtool statistics should be updated even when the interface is down
since it shows more than just netdev counters, which might change while
the logical link is down.
One useful use case, for example, is when running RoCE traffic over the
interface (while the logical link is down, but physical link is up) and
examining rx_prioX_bytes.

Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet 
functionality")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 8f05efa5c829..ea5fff2c3143 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -207,8 +207,7 @@ void mlx5e_ethtool_get_ethtool_stats(struct mlx5e_priv 
*priv,
return;
 
mutex_lock(&priv->state_lock);
-   if (test_bit(MLX5E_STATE_OPENED, &priv->state))
-   mlx5e_update_stats(priv, true);
+   mlx5e_update_stats(priv, true);
mutex_unlock(&priv->state_lock);
 
for (i = 0; i < mlx5e_num_stats_grps; i++)
-- 
2.13.0

[net 08/11] net/mlx5e: Check support before TC swap in ETS init

2018-01-11 Thread Saeed Mahameed

From: Tariq Toukan 

Should not do the following swap between TCs 0 and 1
when max num of TCs is 1:
tclass[prio=0]=1, tclass[prio=1]=0, tclass[prio=i]=i (for i>1)

Fixes: 08fb1dacdd76 ("net/mlx5e: Support DCBNL IEEE ETS")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
index a5c5134f5cb2..3d46ef48d5b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
@@ -937,9 +937,11 @@ static void mlx5e_ets_init(struct mlx5e_priv *priv)
ets.prio_tc[i] = i;
}
 
-   /* tclass[prio=0]=1, tclass[prio=1]=0, tclass[prio=i]=i (for i>1) */
-   ets.prio_tc[0] = 1;
-   ets.prio_tc[1] = 0;
+   if (ets.ets_cap > 1) {
+   /* tclass[prio=0]=1, tclass[prio=1]=0, tclass[prio=i]=i (for 
i>1) */
+   ets.prio_tc[0] = 1;
+   ets.prio_tc[1] = 0;
+   }
 
err = mlx5e_dcbnl_ieee_setets_core(priv, &ets);
if (err)
-- 
2.13.0

[net 09/11] net/mlx5e: Don't override netdev features field unless in error flow

2018-01-11 Thread Saeed Mahameed

From: Gal Pressman 

Set features function sets dev->features in order to keep track of which
features were successfully changed and which weren't (in case the user
asks for more than one change in a single command).

This breaks the logic in __netdev_update_features which assumes that
dev->features is not changed on success and checks for diffs between
features and dev->features (diffs that might not exist at this point
because of the driver override).

The solution is to keep track of successful/failed feature changes and
assign them to dev->features in case of failure only.

Fixes: 0e405443e803 ("net/mlx5e: Improve set features ndo resiliency")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d9d8227f195f..311d5ec8407c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3219,12 +3219,12 @@ static int mlx5e_set_mac(struct net_device *netdev, 
void *addr)
return 0;
 }
 
-#define MLX5E_SET_FEATURE(netdev, feature, enable) \
+#define MLX5E_SET_FEATURE(features, feature, enable)   \
do {\
if (enable) \
-   netdev->features |= feature;\
+   *features |= feature;   \
else\
-   netdev->features &= ~feature;   \
+   *features &= ~feature;  \
} while (0)
 
 typedef int (*mlx5e_feature_handler)(struct net_device *netdev, bool enable);
@@ -3347,6 +3347,7 @@ static int set_feature_arfs(struct net_device *netdev, 
bool enable)
 #endif
 
 static int mlx5e_handle_feature(struct net_device *netdev,
+   netdev_features_t *features,
netdev_features_t wanted_features,
netdev_features_t feature,
mlx5e_feature_handler feature_handler)
@@ -3365,34 +3366,40 @@ static int mlx5e_handle_feature(struct net_device 
*netdev,
return err;
}
 
-   MLX5E_SET_FEATURE(netdev, feature, enable);
+   MLX5E_SET_FEATURE(features, feature, enable);
return 0;
 }
 
 static int mlx5e_set_features(struct net_device *netdev,
  netdev_features_t features)
 {
+   netdev_features_t oper_features = netdev->features;
int err;
 
-   err  = mlx5e_handle_feature(netdev, features, NETIF_F_LRO,
-   set_feature_lro);
-   err |= mlx5e_handle_feature(netdev, features,
+   err  = mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_LRO, set_feature_lro);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
NETIF_F_HW_VLAN_CTAG_FILTER,
set_feature_cvlan_filter);
-   err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_TC,
-   set_feature_tc_num_filters);
-   err |= mlx5e_handle_feature(netdev, features, NETIF_F_RXALL,
-   set_feature_rx_all);
-   err |= mlx5e_handle_feature(netdev, features, NETIF_F_RXFCS,
-   set_feature_rx_fcs);
-   err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_VLAN_CTAG_RX,
-   set_feature_rx_vlan);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_HW_TC, set_feature_tc_num_filters);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_RXALL, set_feature_rx_all);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_RXFCS, set_feature_rx_fcs);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_HW_VLAN_CTAG_RX, 
set_feature_rx_vlan);
 #ifdef CONFIG_RFS_ACCEL
-   err |= mlx5e_handle_feature(netdev, features, NETIF_F_NTUPLE,
-   set_feature_arfs);
+   err |= mlx5e_handle_feature(netdev, &oper_features, features,
+   NETIF_F_NTUPLE, set_feature_arfs);
 #endif
 
-   return err ? -EINVAL : 0;
+   if (err) {
+   netdev->features = oper_features;
+   return -EINVAL;
+   }
+
+   return 0;
 }
 
 static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
-- 
2.13.0

[net 02/11] net/mlx5: Fix get vector affinity helper function

2018-01-11 Thread Saeed Mahameed

mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
mlx5 users.  To fix this, this patch provides an alternative way to
retrieve IRQ vector affinity using legacy IRQ API, following
smp_affinity read procfs implementation.

Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic 
code")
Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
Cc: Sagi Grimberg 
Cc: Qing Huang 
Signed-off-by: Saeed Mahameed 
---
 include/linux/mlx5/driver.h | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 1f509d072026..a0610427e168 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1231,7 +1232,23 @@ enum {
 static inline const struct cpumask *
 mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int vector)
 {
-   return pci_irq_get_affinity(dev->pdev, MLX5_EQ_VEC_COMP_BASE + vector);
+   const struct cpumask *mask;
+   struct irq_desc *desc;
+   unsigned int irq;
+   int eqn;
+   int err;
+
+   err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
+   if (err)
+   return NULL;
+
+   desc = irq_to_desc(irq);
+#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
+   mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
+#else
+   mask = desc->irq_common_data.affinity;
+#endif
+   return mask;
 }
 
 #endif /* MLX5_DRIVER_H */
-- 
2.13.0

Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita

2018-01-12 1:15 GMT+09:00 Masami Hiramatsu :
> On Thu, 11 Jan 2018 23:44:57 +0900
> Akinobu Mita  wrote:
>
>> 2018-01-11 9:51 GMT+09:00 Masami Hiramatsu :
>> > Support in-kernel fault-injection framework via debugfs.
>> > This allows you to inject a conditional error to specified
>> > function using debugfs interfaces.
>> >
>> > Here is the result of test script described in
>> > Documentation/fault-injection/fault-injection.txt
>> >
>> >   ===
>> >   # ./test_fail_function.sh
>> >   1+0 records in
>> >   1+0 records out
>> >   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>> >   btrfs-progs v4.4
>> >   See http://btrfs.wiki.kernel.org for more information.
>> >
>> >   Label:  (null)
>> >   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>> >   Node size:  16384
>> >   Sector size:4096
>> >   Filesystem size:1001.00MiB
>> >   Block group profiles:
>> > Data: single8.00MiB
>> > Metadata: DUP  58.00MiB
>> > System:   DUP  12.00MiB
>> >   SSD detected:   no
>> >   Incompat features:  extref, skinny-metadata
>> >   Number of devices:  1
>> >   Devices:
>> >  IDSIZE  PATH
>> >   1  1001.00MiB  /dev/loop2
>> >
>> >   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>> >   SUCCESS!
>> >   ===
>> >
>> >
>> > Signed-off-by: Masami Hiramatsu 
>> > Reviewed-by: Josef Bacik 
>> > ---
>> >   Changes in v3:
>> >- Check and adjust error value for each target function
>> >- Clear kporbe flag for reuse
>> >- Add more documents and example
>> > ---
>> >  Documentation/fault-injection/fault-injection.txt |   62 ++
>> >  kernel/Makefile   |1
>> >  kernel/fail_function.c|  217 
>> > +
>> >  lib/Kconfig.debug |   10 +
>> >  4 files changed, 290 insertions(+)
>> >  create mode 100644 kernel/fail_function.c
>> >
>> > diff --git a/Documentation/fault-injection/fault-injection.txt 
>> > b/Documentation/fault-injection/fault-injection.txt
>> > index 918972babcd8..4aecbceef9d2 100644
>> > --- a/Documentation/fault-injection/fault-injection.txt
>> > +++ b/Documentation/fault-injection/fault-injection.txt
>> > @@ -30,6 +30,12 @@ o fail_mmc_request
>> >injects MMC data errors on devices permitted by setting
>> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>> >
>> > +o fail_function
>> > +
>> > +  injects error return on specific functions, which are marked by
>> > +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
>> > +  under /sys/kernel/debug/fail_function. No boot option supported.
>> > +
>> >  Configure fault-injection capabilities behavior
>> >  ---
>> >
>> > @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
>> > default is 'N', setting it to 'Y' will disable failure injections
>> > when dealing with private (address space) futexes.
>> >
>> > +- /sys/kernel/debug/fail_function/inject:
>> > +
>> > +   specifies the target function of error injection by name.
>> > +
>> > +- /sys/kernel/debug/fail_function/retval:
>> > +
>> > +   specifies the "error" return value to inject to the given
>> > +   function.
>> > +
>>
>> Is it possible to inject errors into multiple functions at the same time?
>
> Yes, it is.
>
>> If so, it will be more useful to support it in the fault injection, too.
>> Because some kind of bugs are caused by the combination of errors.
>> (e.g. another error in an error path)
>>
>> I suggest the following interface.
>>
>> - /sys/kernel/debug/fail_function/inject:
>>
>>   specifies the target function of error injection by name.
>>   /sys/kernel/debug/fail_function// directory will be created.
>>
>> - /sys/kernel/debug/fail_function/uninject:
>>
>>   specifies the target function of error injection by name that is
>>   currently being injected.  /sys/kernel/debug/fail_function//
>>   directory will be removed.
>>
>> - /sys/kernel/debug/fail_function//retval:
>>
>>   specifies the "error" return value to inject to the given function.
>
> OK, it is easy to make it. But also we might need to consider using bpf
> if we do such complex error injection.
>
> BTW, would we need "uninject" file? or just make inject file accept
> "!function" syntax to remove function as ftrace does?

It also sounds good.  Either way is fine with me.

Backporting "netfilter: xt_hashlimit: Fix integer divide round to zero." to stable kernels

2018-01-11 Thread Cyril Brulebois

Hi,

A customer of mine has been hitting hashlimits issues in netfilter by
switching from Debian 8 (3.16.y) to Debian 9 (4.9.y), making iptables
hashlimits unusable for production.

This issue has been fixed in mainline with this commit:
| commit ad5b55761956427f61ed9c96961bf9c5cd4f92dc
| Author: Alban Browaeys 
| Date:   Mon Feb 6 23:50:33 2017 +0100
| 
| netfilter: xt_hashlimit: Fix integer divide round to zero.

Backporting this commit to Debian 9's 4.9.y kernel series has been
confirmed to fix the bug there. It might be worth considering it for
other stable kernels though.

Downstream bug reports:
  https://bugs.debian.org/872907
  https://bugs.debian.org/884983

Thanks for considering.


Cheers,
-- 
Cyril Brulebois -- Debian Consultant @ DEBAMAX -- https://debamax.com/


signature.asc
Description: PGP signature

Re: [patch net-next 5/5] mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc

2018-01-11 Thread Jakub Kicinski

On Thu, 11 Jan 2018 11:21:02 +0100, Jiri Pirko wrote:
> From: Nogah Frankel 
> 
> Support basic stats for PRIO qdisc, which includes tx packets and bytes
> count, drops count and backlog size. The rest of the stats are irrelevant
> for this qdisc offload.
> Since backlog is not only incremental but reflecting momentary value, in
> case of a qdisc that stops being offloaded but is not destroyed, backlog
> value needs to be updated about the un-offloading.
> For that reason an unoffload function is being added to the ops struct.
> 
> Signed-off-by: Nogah Frankel 
> Reviewed-by: Yuval Mintz 
> Signed-off-by: Jiri Pirko 
> ---
>  .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c   | 92 
> ++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> index 9e83edde7b35..272c04951e5d 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
> @@ -66,6 +66,11 @@ struct mlxsw_sp_qdisc_ops {
> void *xstats_ptr);
>   void (*clean_stats)(struct mlxsw_sp_port *mlxsw_sp_port,
>   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc);
> + /* unoffload - to be used for a qdisc that stops being offloaded without
> +  * being destroyed.
> +  */
> + void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port,
> +   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params);

Hm.  You you need this just because you didn't add the backlog pointer
to destroy?  AFAIK on destroy we are free to reset stats as well, thus
simplifying your driver...  Let me know if I misunderstand.

>  };
>  
>  struct mlxsw_sp_qdisc {
> @@ -73,6 +78,9 @@ struct mlxsw_sp_qdisc {
>   u8 tclass_num;
>   union {
>   struct red_stats red;
> + struct mlxsw_sp_qdisc_prio_stats {
> + u64 backlog;

This is not a prio stat, it's a standard qstat.  I've added it to
struct mlxsw_sp_qdisc_stats.  The reason you need to treat it
separately is that RED has non-standard backlog handling which I'm
trying to fix...

> + } prio;
>   } xstats_base;
>   struct mlxsw_sp_qdisc_stats {
>   u64 tx_bytes;
> @@ -144,6 +152,9 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port 
> *mlxsw_sp_port, u32 handle,
>  
>  err_bad_param:
>  err_config:
> + if (mlxsw_sp_qdisc->handle == handle && ops->unoffload)
> + ops->unoffload(mlxsw_sp_port, mlxsw_sp_qdisc, params);
> +
>   mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
>   return err;
>  }

> @@ -479,6 +567,10 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port 
> *mlxsw_sp_port,
>   switch (p->command) {
>   case TC_PRIO_DESTROY:
>   return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
> + case TC_PRIO_STATS:
> + return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc,
> + &p->stats);
> +

nit: extra new line intentional? :)

>   default:
>   return -EOPNOTSUPP;
>   }

Re: [patch iproute2 v8 1/2] lib/libnetlink: Add functions rtnl_talk_msg and rtnl_talk_iov

2018-01-11 Thread David Ahern

On 1/11/18 8:08 AM, Phil Sutter wrote:
> On Wed, Jan 10, 2018 at 09:12:45PM +0100, Phil Sutter wrote:
>> On Wed, Jan 10, 2018 at 12:20:36PM -0700, David Ahern wrote:
>> [...]
>>> 2. I am using a batch file with drop filters:
>>>
>>> filter add dev eth2 ingress protocol ip pref 273 flower dst_ip
>>> 192.168.253.0/16 action drop
>>>
>>> and for each command tc is trying to dlopen m_drop.so:
>>>
>>> open("/usr/lib/tc//m_drop.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such
>>> file or directory)
>>
>> [...]
>>
>>> Can you look at a follow on patch (not part of this set) to cache status
>>> of dlopen attempts?
>>
>> IMHO the logic used in get_action_kind() for gact is the culprit here:
>> After trying to dlopen m_drop.so, it dlopens m_gact.so although it is
>> present already. (Unless I missed something.)
> 
> Not quite, m_gact.c is statically compiled in and there is logic around
> dlopen(NULL, ...) to prevent calling it twice.
> 
>> I guess the better (and easier) fix would be to create some more struct
>> action_util instances in m_gact.c for the primitives it supports so that
>> the lookup in action_list succeeds for consecutive uses. Note that
>> parse_gact() even supports this already.
> 
> Sadly, this doesn't fly: If a lookup for action 'drop' is successful,
> that value is set as TCA_ACT_KIND and the kernel doesn't know about it.
> 
> I came up with an alternative solution, what do you think about attached
> patch?

Looks ok to me and removes the repeated open overhead. Send it formally
and cc Jiri and Jamal.

Thanks,

RE: [patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc

2018-01-11 Thread Yuval Mintz

> > > > +struct tc_prio_qopt_offload_params {
> > > > +   int bands;
> > > > +   u8 priomap[TC_PRIO_MAX + 1];
> > > > +   /* In case that a prio qdisc is offloaded and now is changed to 
> > > > a
> > > > +* non-offloadedable config, it needs to update the backlog 
> > > > value
> > > > +* to negate the HW backlog value.
> > > > +*/
> > > > +   u32 *backlog;
> > > > +};
> > >
> > > Could we please pass the full qstats on replace and destroy.  This
> > > simplifies the driver code and allows handling the qlen as well as
> > > backlog.  Please see the 2 patch series I sent earlier yesterday.
> >
> > That might give the false impression that offloading driver is expected
> > to correct all the qstats fields during destruction, whereas for most of
> > them it doesn't seem appropriate.
> 
> The driver is supposed to return the momentary stats to their
> original/SW-only value.  And the driver knows exactly which stats
> those are, just look at your patch 5, you handle backlog completely
> differently than other stats already.

*we* surely understand that now. I'm just mentioning it might
confuse future offloaders; No strong objection here.
And I agree the alternative [passing pointers to each momentary stat]
is quite ugly.

[PATCH v2] bnx2x: disable GSO where gso_size is too big for hardware

2018-01-11 Thread Daniel Axtens

If a bnx2x card is passed a GSO packet with a gso_size larger than
~9700 bytes, it will cause a firmware error that will bring the card
down:

bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert!
bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2
bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e43e47 0x00463e01 0x00010052
bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 
7_13_1
... (dump of values continues) ...

Detect when gso_size + header length is greater than the maximum
packet size (9700 bytes) and disable GSO. For simplicity and speed
this is approximated by comparing gso_size against 9200 and assuming
no-one will have more than 500 bytes of headers.

This raises the obvious question - how do we end up with a packet with
a gso_size that's greater than 9700? This has been observed on an
powerpc system when Open vSwitch is forwarding a packet from an
ibmveth device.

ibmveth is a bit special. It's the driver for communication between
virtual machines (aka 'partitions'/LPARs) running under IBM's
proprietary hypervisor on ppc machines. It allows sending very large
packets (up to 64kB) between LPARs. This involves some quite
'interesting' things: for example, when talking TCP, the MSS is stored
the checksum field (see ibmveth_rx_mss_helper() in ibmveth.c).

Normally on a box like this, there would be a Virtual I/O Server
(VIOS) partition that owns the physical network card. VIOS lets the
AIX partitions know when they're talking to a real network and that
they should drop their MSS. This works fine if VIOS owns the physical
network card.

However, in this case, a Linux partition owns the card (this is known
as a NovaLink setup). The negotiation between VIOS and AIX uses a
non-standard TCP option, so Linux has never supported that.  Instead,
Linux just supports receiving large packets. It doesn't support any
form of messaging/MSS negotiation back to other LPARs.

To get some clarity about where the large MSS was coming from, I asked
Thomas Falcon, the maintainer of ibmveth, for some background:

"In most cases, large segments are an aggregation of smaller packets
by the Virtual I/O Server (VIOS) partition and then are forwarded to
the Linux LPAR / ibmveth driver. These segments can be as large as
64KB. In this case, since the customer is using Novalink, I believe
what is happening is pretty straightforward: the large segments are
created by the AIX partition and then forwarded to the Linux
partition, ... The ibmveth driver doesn't do any aggregation itself
but just ensures the proper bits are set before sending the frame up
to avoid giving the upper layers indigestion."

It is possible to stop AIX from sending these large segments, but it
requires configuration on each LPAR. While ibmveth's behaviour is
admittedly weird, we should fix this here: it shouldn't be possible
for it to cause a firmware panic on another card.

Cc: Thomas Falcon  # ibmveth
Cc: Yuval Mintz  # bnx2x
Thanks-to: Jay Vosburgh  # veth info
Signed-off-by: Daniel Axtens 

---
v2: change to a feature check as suggested by Eric Dumazet.

---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 7b08323e3f3d..bab909b5d7a2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12934,6 +12934,17 @@ static netdev_features_t bnx2x_features_check(struct 
sk_buff *skb,
  struct net_device *dev,
  netdev_features_t features)
 {
+   /*
+* A skb with gso_size + header length > 9700 will cause a
+* firmware panic. Drop GSO support.
+*
+* To avoid costly calculations on all packets (and because
+* super-jumbo frames are rare), allow 500 bytes of headers
+* and just disable GSO if gso_size is greater than 9200.
+*/
+   if (unlikely(skb_is_gso(skb) && skb_shinfo(skb)->gso_size > 9200))
+   features &= ~NETIF_F_GSO_MASK;
+
features = vlan_features_check(skb, features);
return vxlan_features_check(skb, features);
 }
-- 
2.14.1

Re: [patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc

2018-01-11 Thread Jakub Kicinski

On Thu, 11 Jan 2018 23:50:27 +, Yuval Mintz wrote:
> > > +struct tc_prio_qopt_offload_params {
> > > + int bands;
> > > + u8 priomap[TC_PRIO_MAX + 1];
> > > + /* In case that a prio qdisc is offloaded and now is changed to a
> > > +  * non-offloadedable config, it needs to update the backlog value
> > > +  * to negate the HW backlog value.
> > > +  */
> > > + u32 *backlog;
> > > +};  
> > 
> > Could we please pass the full qstats on replace and destroy.  This
> > simplifies the driver code and allows handling the qlen as well as
> > backlog.  Please see the 2 patch series I sent earlier yesterday.  
> 
> That might give the false impression that offloading driver is expected
> to correct all the qstats fields during destruction, whereas for most of
> them it doesn't seem appropriate.

The driver is supposed to return the momentary stats to their
original/SW-only value.  And the driver knows exactly which stats 
those are, just look at your patch 5, you handle backlog completely
differently than other stats already.

RE: [patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc

2018-01-11 Thread Yuval Mintz

> > +struct tc_prio_qopt_offload_params {
> > +   int bands;
> > +   u8 priomap[TC_PRIO_MAX + 1];
> > +   /* In case that a prio qdisc is offloaded and now is changed to a
> > +* non-offloadedable config, it needs to update the backlog value
> > +* to negate the HW backlog value.
> > +*/
> > +   u32 *backlog;
> > +};
> 
> Could we please pass the full qstats on replace and destroy.  This
> simplifies the driver code and allows handling the qlen as well as
> backlog.  Please see the 2 patch series I sent earlier yesterday.

That might give the false impression that offloading driver is expected
to correct all the qstats fields during destruction, whereas for most of
them it doesn't seem appropriate.

Re: [patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc

2018-01-11 Thread Jakub Kicinski

On Thu, 11 Jan 2018 11:21:00 +0100, Jiri Pirko wrote:
> +struct tc_prio_qopt_offload_params {
> + int bands;
> + u8 priomap[TC_PRIO_MAX + 1];
> + /* In case that a prio qdisc is offloaded and now is changed to a
> +  * non-offloadedable config, it needs to update the backlog value
> +  * to negate the HW backlog value.
> +  */
> + u32 *backlog;
> +};

Could we please pass the full qstats on replace and destroy.  This
simplifies the driver code and allows handling the qlen as well as
backlog.  Please see the 2 patch series I sent earlier yesterday.

Re: [PATCH 34/38] arm: Implement thread_struct whitelist for hardened usercopy

2018-01-11 Thread Kees Cook

On Thu, Jan 11, 2018 at 2:24 AM, Russell King - ARM Linux
 wrote:
> On Wed, Jan 10, 2018 at 06:03:06PM -0800, Kees Cook wrote:
>> ARM does not carry FPU state in the thread structure, so it can declare
>> no usercopy whitelist at all.
>
> This comment seems to be misleading.  We have stored FP state in the
> thread structure for a long time - for example, VFP state is stored
> in thread->vfpstate.hard, so we _do_ have floating point state in
> the thread structure.
>
> What I think this commit message needs to describe is why we don't
> need a whitelist _despite_ having FP state in the thread structure.
>
> At the moment, the commit message is making me think that this patch
> is wrong and will introduce a regression.

Yeah, I will improve this comment; it's not clear enough. The places
where I see state copied to/from userspace are all either static sizes
or already use bounce buffers (or both). e.g.:

err |= __copy_from_user(&hwstate->fpregs, &ufp->fpregs,
sizeof(hwstate->fpregs));

I will adjust the commit log and comment to more clearly describe the
lack of whitelisting due to all-static sized copies.

Thanks!

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH 13/38] ext4: Define usercopy region in ext4_inode_cache slab cache

2018-01-11 Thread Kees Cook

On Thu, Jan 11, 2018 at 9:01 AM, Theodore Ts'o  wrote:
> On Wed, Jan 10, 2018 at 06:02:45PM -0800, Kees Cook wrote:
>> The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
>> and therefore contained in the ext4_inode_cache slab cache, need
>> to be copied to/from userspace.
>
> Symlink operations to/from userspace aren't common or in the hot path,
> and when they are in i_data, limited to at most 60 bytes.  Is it worth
> it to copy through a bounce buffer so as to disallow any usercopies
> into struct ext4_inode_info?

If this is the only place it's exposed, yeah, that might be a way to
avoid the per-FS patches. This would, AIUI, require changing
readlink_copy() to include a bounce buffer, and that would require an
allocation. I kind of prefer just leaving the per-FS whitelists, as
then there's no global overhead added.

-Kees

-- 
Kees Cook
Pixel Security

1 2 3 >

1 - 100 of 255 matches

Mail list logo