Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Pravin Shelar
On Thu, Jan 18, 2018 at 5:28 PM, Daniel Axtens  wrote:
> Pravin Shelar  writes:
>
>> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens  wrote:
>>> Pravin Shelar  writes:
>>>
 On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens  wrote:
> When regular packets are forwarded, we validate their size against the
> MTU of the destination device. However, when GSO packets are
> forwarded, we do not validate their size against the MTU. We
> implicitly assume that when they are segmented, the resultant packets
> will be correctly sized.
>
> This is not always the case.
>
> We observed a case where a packet received on an ibmveth device had a
> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
> device, where it caused a firmware assert. This is described in detail
> at [0] and was the genesis of this series. Rather than fixing it in
> the driver, this series fixes the forwarding path.
>
 Are there any other possible forwarding path in networking stack? TC
 is one subsystem that could forward such a packet to the bnx2x device,
 how is that handled ?
>>>
>>> So far I have only looked at bridges, openvswitch and macvlan. In
>>> general, if the code uses dev_forward_skb() it should automatically be
>>> fine as that invokes is_skb_forwardable(), which we patch.
>>>
>> But there are other ways to forward packets, e.g tc-mirred or bpf
>> redirect. We need to handle all these cases rather than fixing one at
>> a time. As Jason suggested netif_needs_gso() looks like good function
>> to validate if a device is capable of handling given GSO packet.
>
> I am not entirely sure this is a better solution.
>
> The biggest reason I am uncomfortable with this is that if
> netif_needs_gso() returns true, the skb will be segmented. The segment
> sizes will be based on gso_size. Since gso_size is greater than the MTU,
> the resulting segments will themselves be over-MTU. Those over-MTU
> segments will then be passed to the network card. I think we should not
> be creating over-MTU segments; we should instead be dropping the packet
> and logging.
>

Would this oversized segment cause the firmware assert?
If this solves the assert issue then I do not see much value in adding
checks in fast-path just for logging.


[PATCH V2 net-next 1/4] net: hns3: add support for get_regs

2018-01-18 Thread Peng Li
From: Fuyun Liang 

This patch adds get_regs support for ethtool cmd.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   3 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  23 +++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   4 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 176 +
 4 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 634e932..d104ce5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -356,7 +356,8 @@ struct hnae3_ae_ops {
u32 stringset, u8 *data);
int (*get_sset_count)(struct hnae3_handle *handle, int stringset);
 
-   void (*get_regs)(struct hnae3_handle *handle, void *data);
+   void (*get_regs)(struct hnae3_handle *handle, u32 *version,
+void *data);
int (*get_regs_len)(struct hnae3_handle *handle);
 
u32 (*get_rss_key_size)(struct hnae3_handle *handle);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 358f780..1c8b293 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -1063,6 +1063,27 @@ static int hns3_set_coalesce(struct net_device *netdev,
return 0;
 }
 
+static int hns3_get_regs_len(struct net_device *netdev)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo->ops->get_regs_len)
+   return -EOPNOTSUPP;
+
+   return h->ae_algo->ops->get_regs_len(h);
+}
+
+static void hns3_get_regs(struct net_device *netdev,
+ struct ethtool_regs *cmd, void *data)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo->ops->get_regs)
+   return;
+
+   h->ae_algo->ops->get_regs(h, >version, data);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -1103,6 +1124,8 @@ static const struct ethtool_ops hns3_ethtool_ops = {
.set_channels = hns3_set_channels,
.get_coalesce = hns3_get_coalesce,
.set_coalesce = hns3_set_coalesce,
+   .get_regs_len = hns3_get_regs_len,
+   .get_regs = hns3_get_regs,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 3c3159b..2561e7a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -102,6 +102,10 @@ enum hclge_opcode_type {
HCLGE_OPC_STATS_64_BIT  = 0x0030,
HCLGE_OPC_STATS_32_BIT  = 0x0031,
HCLGE_OPC_STATS_MAC = 0x0032,
+
+   HCLGE_OPC_QUERY_REG_NUM = 0x0040,
+   HCLGE_OPC_QUERY_32_BIT_REG  = 0x0041,
+   HCLGE_OPC_QUERY_64_BIT_REG  = 0x0042,
/* Device management command */
 
/* MAC commond */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 27f0ab6..c3d2cca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5544,6 +5544,180 @@ static int hclge_set_channels(struct hnae3_handle 
*handle, u32 new_tqps_num)
return ret;
 }
 
+static int hclge_get_regs_num(struct hclge_dev *hdev, u32 *regs_num_32_bit,
+ u32 *regs_num_64_bit)
+{
+   struct hclge_desc desc;
+   u32 total_num;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_QUERY_REG_NUM, true);
+   ret = hclge_cmd_send(>hw, , 1);
+   if (ret) {
+   dev_err(>pdev->dev,
+   "Query register number cmd failed, ret = %d.\n", ret);
+   return ret;
+   }
+
+   *regs_num_32_bit = le32_to_cpu(desc.data[0]);
+   *regs_num_64_bit = le32_to_cpu(desc.data[1]);
+
+   total_num = *regs_num_32_bit + *regs_num_64_bit;
+   if (!total_num)
+   return -EINVAL;
+
+   return 0;
+}
+
+static int hclge_get_32_bit_regs(struct hclge_dev *hdev, u32 regs_num,
+void *data)
+{
+#define HCLGE_32_BIT_REG_RTN_DATANUM 8
+
+   struct hclge_desc *desc;
+   u32 *reg_val = data;
+   __le32 *desc_data;
+   int cmd_num;
+   int i, k, n;
+   int ret;
+
+   if (regs_num == 0)
+   return 0;
+
+   cmd_num = DIV_ROUND_UP(regs_num + 2, HCLGE_32_BIT_REG_RTN_DATANUM);
+   desc = kcalloc(cmd_num, sizeof(struct hclge_desc), GFP_KERNEL);
+   if (!desc)

[PATCH V2 net-next 2/4] net: hns3: add manager table initialization for hardware

2018-01-18 Thread Peng Li
From: Fuyun Liang 

The manager table is empty by default. If it is not initialized, the
management pkgs like LLDP will be dropped by hardware. Default entries
need to be added to manager table.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  22 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 +
 2 files changed, 123 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 2561e7a..1cd28e0 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -605,6 +605,28 @@ struct hclge_mac_vlan_mask_entry_cmd {
u8 rsv2[14];
 };
 
+#define HCLGE_MAC_MGR_MASK_VLAN_B  BIT(0)
+#define HCLGE_MAC_MGR_MASK_MAC_B   BIT(1)
+#define HCLGE_MAC_MGR_MASK_ETHERTYPE_B BIT(2)
+#define HCLGE_MAC_ETHERTYPE_LLDP   0x88cc
+
+struct hclge_mac_mgr_tbl_entry_cmd {
+   u8  flags;
+   u8  resp_code;
+   __le16  vlan_tag;
+   __le32  mac_addr_hi32;
+   __le16  mac_addr_lo16;
+   __le16  rsv1;
+   __le16  ethter_type;
+   __le16  egress_port;
+   __le16  egress_queue;
+   u8  sw_port_id_aware;
+   u8  rsv2;
+   u8  i_port_bitmap;
+   u8  i_port_direction;
+   u8  rsv3[2];
+};
+
 #define HCLGE_CFG_MTA_MAC_SEL_S0x0
 #define HCLGE_CFG_MTA_MAC_SEL_MGENMASK(1, 0)
 #define HCLGE_CFG_MTA_MAC_EN_B 0x7
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index c3d2cca..6e64bed 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -392,6 +392,16 @@ static const struct hclge_comm_stats_str 
g_mac_stats_string[] = {
HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_bad_pkt_num)}
 };
 
+static const struct hclge_mac_mgr_tbl_entry_cmd hclge_mgr_table[] = {
+   {
+   .flags = HCLGE_MAC_MGR_MASK_VLAN_B,
+   .ethter_type = cpu_to_le16(HCLGE_MAC_ETHERTYPE_LLDP),
+   .mac_addr_hi32 = cpu_to_le32(htonl(0x0180C200)),
+   .mac_addr_lo16 = cpu_to_le16(htons(0x000E)),
+   .i_port_bitmap = 0x1,
+   },
+};
+
 static int hclge_64_bit_update_stats(struct hclge_dev *hdev)
 {
 #define HCLGE_64_BIT_CMD_NUM 5
@@ -4249,6 +4259,91 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport,
return status;
 }
 
+static int hclge_get_mac_ethertype_cmd_status(struct hclge_dev *hdev,
+ u16 cmdq_resp, u8 resp_code)
+{
+#define HCLGE_ETHERTYPE_SUCCESS_ADD0
+#define HCLGE_ETHERTYPE_ALREADY_ADD1
+#define HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW   2
+#define HCLGE_ETHERTYPE_KEY_CONFLICT   3
+
+   int return_status;
+
+   if (cmdq_resp) {
+   dev_err(>pdev->dev,
+   "cmdq execute failed for get_mac_ethertype_cmd_status, 
status=%d.\n",
+   cmdq_resp);
+   return -EIO;
+   }
+
+   switch (resp_code) {
+   case HCLGE_ETHERTYPE_SUCCESS_ADD:
+   case HCLGE_ETHERTYPE_ALREADY_ADD:
+   return_status = 0;
+   break;
+   case HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW:
+   dev_err(>pdev->dev,
+   "add mac ethertype failed for manager table 
overflow.\n");
+   return_status = -EIO;
+   break;
+   case HCLGE_ETHERTYPE_KEY_CONFLICT:
+   dev_err(>pdev->dev,
+   "add mac ethertype failed for key conflict.\n");
+   return_status = -EIO;
+   break;
+   default:
+   dev_err(>pdev->dev,
+   "add mac ethertype failed for undefined, code=%d.\n",
+   resp_code);
+   return_status = -EIO;
+   }
+
+   return return_status;
+}
+
+static int hclge_add_mgr_tbl(struct hclge_dev *hdev,
+const struct hclge_mac_mgr_tbl_entry_cmd *req)
+{
+   struct hclge_desc desc;
+   u8 resp_code;
+   u16 retval;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_MAC_ETHTYPE_ADD, false);
+   memcpy(desc.data, req, sizeof(struct hclge_mac_mgr_tbl_entry_cmd));
+
+   ret = hclge_cmd_send(>hw, , 1);
+   if (ret) {
+   dev_err(>pdev->dev,
+   "add mac ethertype failed for cmd_send, ret =%d.\n",
+   ret);
+   return ret;
+   }
+
+   resp_code = (le32_to_cpu(desc.data[0]) >> 8) & 0xff;
+   retval = le16_to_cpu(desc.retval);
+
+   return hclge_get_mac_ethertype_cmd_status(hdev, retval, resp_code);

[PATCH V2 net-next 0/4] add some features to hns3 driver

2018-01-18 Thread Peng Li
This patchset adds some features to hns3 driver, include the support
for ethtool command -d, -p and support for manager table.

[Patch 1/4] adds support for ethtool command -d, its ops is get_regs.
driver will send command to command queue, and get regs number and
regs value from command queue.
[Patch 2/4] adds manager table initialization for hardware.
[Patch 3/4] adds support for ethtool command -p. For fiber ports, driver
sends command to command queue, and IMP will write SGPIO regs to control
leds.
[Patch 4/4] adds support for net status led for fiber ports. Net status
include  port speed, total rx/tx packets and link status. Driver send
the status to command queue, and IMP will write SGPIO to control leds.

---
Change log:
V1 -> V2:
1, fix comments from Andrew Lunn, remove the patch "net: hns3: add
ethtool -p support for phy device".
---

Fuyun Liang (2):
  net: hns3: add support for get_regs
  net: hns3: add manager table initialization for hardware

Jian Shen (2):
  net: hns3: add ethtool -p support for fiber port
  net: hns3: add net status led support for fiber port

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   5 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  35 ++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  47 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 456 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   3 +
 5 files changed, 545 insertions(+), 1 deletion(-)

-- 
2.9.3



[PATCH V2 net-next 4/4] net: hns3: add net status led support for fiber port

2018-01-18 Thread Peng Li
From: Jian Shen 

Check the net status per second, include port speed, total rx/tx packets
and link status. Updating the led status for fiber port.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   1 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 109 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   3 +
 3 files changed, 113 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 122f862..3fd10a6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -115,6 +115,7 @@ enum hclge_opcode_type {
HCLGE_OPC_QUERY_LINK_STATUS = 0x0307,
HCLGE_OPC_CONFIG_MAX_FRM_SIZE   = 0x0308,
HCLGE_OPC_CONFIG_SPEED_DUP  = 0x0309,
+   HCLGE_OPC_STATS_MAC_TRAFFIC = 0x0314,
/* MACSEC command */
 
/* PFC/Pause CMD*/
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 12150f2..32bc6f6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -39,6 +39,7 @@ static int hclge_set_mta_filter_mode(struct hclge_dev *hdev,
 static int hclge_set_mtu(struct hnae3_handle *handle, int new_mtu);
 static int hclge_init_vlan_config(struct hclge_dev *hdev);
 static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev);
+static int hclge_update_led_status(struct hclge_dev *hdev);
 
 static struct hnae3_ae_algo ae_algo;
 
@@ -505,6 +506,38 @@ static int hclge_32_bit_update_stats(struct hclge_dev 
*hdev)
return 0;
 }
 
+static int hclge_mac_get_traffic_stats(struct hclge_dev *hdev)
+{
+   struct hclge_mac_stats *mac_stats = >hw_stats.mac_stats;
+   struct hclge_desc desc;
+   __le64 *desc_data;
+   int ret;
+
+   /* for fiber port, need to query the total rx/tx packets statstics,
+* used for data transferring checking.
+*/
+   if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_FIBER)
+   return 0;
+
+   if (test_bit(HCLGE_STATE_STATISTICS_UPDATING, >state))
+   return 0;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_STATS_MAC_TRAFFIC, true);
+   ret = hclge_cmd_send(>hw, , 1);
+   if (ret) {
+   dev_err(>pdev->dev,
+   "Get MAC total pkt stats fail, ret = %d\n", ret);
+
+   return ret;
+   }
+
+   desc_data = (__le64 *)([0]);
+   mac_stats->mac_tx_total_pkt_num += le64_to_cpu(*desc_data++);
+   mac_stats->mac_rx_total_pkt_num += le64_to_cpu(*desc_data);
+
+   return 0;
+}
+
 static int hclge_mac_update_stats(struct hclge_dev *hdev)
 {
 #define HCLGE_MAC_CMD_NUM 21
@@ -2846,13 +2879,20 @@ static void hclge_service_task(struct work_struct *work)
struct hclge_dev *hdev =
container_of(work, struct hclge_dev, service_task);
 
+   /* The total rx/tx packets statstics are wanted to be updated
+* per second. Both hclge_update_stats_for_all() and
+* hclge_mac_get_traffic_stats() can do it.
+*/
if (hdev->hw_stats.stats_timer >= HCLGE_STATS_TIMER_INTERVAL) {
hclge_update_stats_for_all(hdev);
hdev->hw_stats.stats_timer = 0;
+   } else {
+   hclge_mac_get_traffic_stats(hdev);
}
 
hclge_update_speed_duplex(hdev);
hclge_update_link_status(hdev);
+   hclge_update_led_status(hdev);
hclge_service_complete(hdev);
 }
 
@@ -5888,6 +5928,75 @@ static int hclge_set_led_id(struct hnae3_handle *handle,
return ret;
 }
 
+enum hclge_led_port_speed {
+   HCLGE_SPEED_LED_FOR_1G,
+   HCLGE_SPEED_LED_FOR_10G,
+   HCLGE_SPEED_LED_FOR_25G,
+   HCLGE_SPEED_LED_FOR_40G,
+   HCLGE_SPEED_LED_FOR_50G,
+   HCLGE_SPEED_LED_FOR_100G,
+};
+
+static u8 hclge_led_get_speed_status(u32 speed)
+{
+   u8 speed_led;
+
+   switch (speed) {
+   case HCLGE_MAC_SPEED_1G:
+   speed_led = HCLGE_SPEED_LED_FOR_1G;
+   break;
+   case HCLGE_MAC_SPEED_10G:
+   speed_led = HCLGE_SPEED_LED_FOR_10G;
+   break;
+   case HCLGE_MAC_SPEED_25G:
+   speed_led = HCLGE_SPEED_LED_FOR_25G;
+   break;
+   case HCLGE_MAC_SPEED_40G:
+   speed_led = HCLGE_SPEED_LED_FOR_40G;
+   break;
+   case HCLGE_MAC_SPEED_50G:
+   speed_led = HCLGE_SPEED_LED_FOR_50G;
+   break;
+   case HCLGE_MAC_SPEED_100G:
+   speed_led = HCLGE_SPEED_LED_FOR_100G;
+   break;
+   default:
+   speed_led = HCLGE_LED_NO_CHANGE;
+   }
+
+   return speed_led;
+}
+
+static int hclge_update_led_status(struct 

[PATCH V2 net-next 3/4] net: hns3: add ethtool -p support for fiber port

2018-01-18 Thread Peng Li
From: Jian Shen 

Add led location support for fiber port. The led will keep blinking
when locating.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  2 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 12 
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 20 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 70 ++
 4 files changed, 104 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index d104ce5..fd06bc7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -405,6 +405,8 @@ struct hnae3_ae_ops {
int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num);
void (*get_flowctrl_adv)(struct hnae3_handle *handle,
 u32 *flowctrl_adv);
+   int (*set_led_id)(struct hnae3_handle *handle,
+ enum ethtool_phys_id_state status);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 1c8b293..7410205 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -1084,6 +1084,17 @@ static void hns3_get_regs(struct net_device *netdev,
h->ae_algo->ops->get_regs(h, >version, data);
 }
 
+static int hns3_set_phys_id(struct net_device *netdev,
+   enum ethtool_phys_id_state state)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo || !h->ae_algo->ops || !h->ae_algo->ops->set_led_id)
+   return -EOPNOTSUPP;
+
+   return h->ae_algo->ops->set_led_id(h, state);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -1126,6 +1137,7 @@ static const struct ethtool_ops hns3_ethtool_ops = {
.set_coalesce = hns3_set_coalesce,
.get_regs_len = hns3_get_regs_len,
.get_regs = hns3_get_regs,
+   .set_phys_id = hns3_set_phys_id,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 1cd28e0..122f862 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -227,6 +227,9 @@ enum hclge_opcode_type {
 
/* Mailbox cmd */
HCLGEVF_OPC_MBX_PF_TO_VF= 0x2000,
+
+   /* Led command */
+   HCLGE_OPC_LED_STATUS_CFG= 0xB000,
 };
 
 #define HCLGE_TQP_REG_OFFSET   0x8
@@ -807,6 +810,23 @@ struct hclge_reset_cmd {
 #define HCLGE_NIC_CMQ_DESC_NUM 1024
 #define HCLGE_NIC_CMQ_DESC_NUM_S   3
 
+#define HCLGE_LED_PORT_SPEED_STATE_S   0
+#define HCLGE_LED_PORT_SPEED_STATE_M   GENMASK(5, 0)
+#define HCLGE_LED_ACTIVITY_STATE_S 0
+#define HCLGE_LED_ACTIVITY_STATE_M GENMASK(1, 0)
+#define HCLGE_LED_LINK_STATE_S 0
+#define HCLGE_LED_LINK_STATE_M GENMASK(1, 0)
+#define HCLGE_LED_LOCATE_STATE_S   0
+#define HCLGE_LED_LOCATE_STATE_M   GENMASK(1, 0)
+
+struct hclge_set_led_state_cmd {
+   u8 port_speed_led_config;
+   u8 link_led_config;
+   u8 activity_led_config;
+   u8 locate_led_config;
+   u8 rsv[20];
+};
+
 int hclge_cmd_init(struct hclge_dev *hdev);
 static inline void hclge_write_reg(void __iomem *base, u32 reg, u32 value)
 {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 6e64bed..12150f2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5819,6 +5819,75 @@ static void hclge_get_regs(struct hnae3_handle *handle, 
u32 *version,
"Get 64 bit register failed, ret = %d.\n", ret);
 }
 
+static int hclge_set_led_status_sfp(struct hclge_dev *hdev, u8 
speed_led_status,
+   u8 act_led_status, u8 link_led_status,
+   u8 locate_led_status)
+{
+   struct hclge_set_led_state_cmd *req;
+   struct hclge_desc desc;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_LED_STATUS_CFG, false);
+
+   req = (struct hclge_set_led_state_cmd *)desc.data;
+   hnae_set_field(req->port_speed_led_config, HCLGE_LED_PORT_SPEED_STATE_M,
+  HCLGE_LED_PORT_SPEED_STATE_S, speed_led_status);
+   hnae_set_field(req->link_led_config, HCLGE_LED_ACTIVITY_STATE_M,
+  HCLGE_LED_ACTIVITY_STATE_S, act_led_status);
+   hnae_set_field(req->activity_led_config, HCLGE_LED_LINK_STATE_M,
+  

Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Daniel Axtens
Daniel Axtens  writes:

> Pravin Shelar  writes:
>
>> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens  wrote:
>>> Pravin Shelar  writes:
>>>
 On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens  wrote:
> When regular packets are forwarded, we validate their size against the
> MTU of the destination device. However, when GSO packets are
> forwarded, we do not validate their size against the MTU. We
> implicitly assume that when they are segmented, the resultant packets
> will be correctly sized.
>
> This is not always the case.
>
> We observed a case where a packet received on an ibmveth device had a
> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
> device, where it caused a firmware assert. This is described in detail
> at [0] and was the genesis of this series. Rather than fixing it in
> the driver, this series fixes the forwarding path.
>
 Are there any other possible forwarding path in networking stack? TC
 is one subsystem that could forward such a packet to the bnx2x device,
 how is that handled ?
>>>
>>> So far I have only looked at bridges, openvswitch and macvlan. In
>>> general, if the code uses dev_forward_skb() it should automatically be
>>> fine as that invokes is_skb_forwardable(), which we patch.
>>>
>> But there are other ways to forward packets, e.g tc-mirred or bpf
>> redirect. We need to handle all these cases rather than fixing one at
>> a time. As Jason suggested netif_needs_gso() looks like good function
>> to validate if a device is capable of handling given GSO packet.
>
> I am not entirely sure this is a better solution.
>
> The biggest reason I am uncomfortable with this is that if
> netif_needs_gso() returns true, the skb will be segmented. The segment
> sizes will be based on gso_size. Since gso_size is greater than the MTU,
> the resulting segments will themselves be over-MTU. Those over-MTU
> segments will then be passed to the network card. I think we should not
> be creating over-MTU segments; we should instead be dropping the packet
> and logging.
>
> I do take the point that you and Jason are making: a more generic
> fix would  be good. I'm just not sure where to put it.

How about this? It puts the check in validate_xmit_skb(). (completely untested)

Ultimately I think we need to make a broader policy decision about who
is responsible for ensuring that packets are correctly sized - is it the
caller of dev_queue_xmit (as in bridges and openswitch) or is it lower
down the stack, such as validate_xmit_skb()? Making the caller check is
more efficient - packets created on the system are always going to fit
due to the checks in the protocol code, so rechecking is inefficient and
unnecessary. But checking later is more reliable as we don't have to
identify all forwarding paths.

Regards,
Daniel

diff --git a/net/core/dev.c b/net/core/dev.c
index 5af04be128c1..b8c3f33ece19 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
__net_timestamp(SKB);   \
}   \
 
-bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
+static inline bool does_skb_fit_mtu(const struct net_device *dev,
+   const struct sk_buff *skb)
 {
unsigned int len;
 
-   if (!(dev->flags & IFF_UP))
-   return false;
-
len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
if (skb->len <= len)
return true;
@@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, 
const struct sk_buff *skb)
 
return false;
 }
+
+bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
+{
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   return does_skb_fit_mtu(dev, skb);
+}
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
@@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff 
*skb, struct net_device
if (unlikely(!skb))
goto out_null;
 
+   if (unlikely(!does_skb_fit_mtu(dev, skb)))
+   goto out_kfree_skb;
+
if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;
 



>
>
> Regards,
> Daniel


Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-18 Thread Benjamin Poirier
On 2018/01/18 18:42, Shrikrishna Khare wrote:
> 
> 
> On Thu, 18 Jan 2018, Benjamin Poirier wrote:
> 
> > On 2018/01/18 15:50, Benjamin Poirier wrote:
> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> > > 
> > > Some experimentation showed that this flaw in vmware e1000e emulation can
> > > be worked around by not setting Other in EIAC. This is how it was before
> > > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> > 
> > vmware folks, please comment.
> 
> Thank you for bringing this to our attention.
> 
> Using the reported build (ESX 6.5, 7526125) and 4.15.0-rc8+ kernel (which 
> has the said patch), I could bring up e1000e interface (version: 3.2.6-k),
> get dhcp address and even do large file downloads without difficulty.
> 
> Could you give us more pointers on how we may be able to reproduce this 
> locally? Was there anything different with the configuration when the 
> issue was observed? Is the issue consistently reproducible?

It's consistently reproducible, however I noticed that once in a while
there is a genuine "Other" interrupt that comes in and triggers the link
status change. The problem is with interrupts that are triggered via a
write to ICS (such as in e1000e_trigger_lsc()). Can you reproduce a
problem if you do:
ip link set ethX down
ip link set ethX up

If you're building your own kernel, you can add the following patch and
cat /sys/kernel/debug/tracing/trace_pipe

For me it shows on v4.15-rc8:
   <...>-2578  [000]  83527.938321: e1000e_trigger_lsc: trigger_lsc
   <...>-2578  [000] d.h. 83527.938398: e1000_msix_other: icr 0x0

With the patch that I submitted, it shows:
 wickedd-1329  [002] .N..20.123545: e1000e_trigger_lsc: trigger_lsc
  -0 [000] d.h.20.123630: e1000_msix_other: icr 0x8104
  -0 [000] d.h.20.123654: e1000_msix_other: lsc
  -0 [000] d.h.20.123676: e1000_msix_other: mod_timer


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..16620ce840fc 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,22 +1918,29 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   trace_printk("icr 0x%x\n", icr);
+
if (icr & E1000_ICR_RXO) {
+   trace_printk("rxo\n");
ew32(ICR, E1000_ICR_RXO);
enable = false;
/* napi poll will re-enable Other, make sure it runs */
if (napi_schedule_prep(>napi)) {
+   trace_printk("napi schedule\n");
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(>napi);
}
}
if (icr & E1000_ICR_LSC) {
+   trace_printk("lsc\n");
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state))
+   if (!test_bit(__E1000_DOWN, >state)) {
+   trace_printk("mod_timer\n");
mod_timer(>watchdog_timer, jiffies + 1);
+   }
}
 
if (enable && !test_bit(__E1000_DOWN, >state))
@@ -4221,6 +4228,8 @@ static void e1000e_trigger_lsc(struct e1000_adapter 
*adapter)
 {
struct e1000_hw *hw = >hw;
 
+   trace_printk("trigger_lsc\n");
+
if (adapter->msix_entries)
ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
else


Re: [PATCH iproute2] devlink: Ignore unknown attributes

2018-01-18 Thread David Ahern
On 1/17/18 5:28 AM, Arkadi Sharshevsky wrote:
> In case of extending the UAPI old packages would break.
> 
> Signed-off-by: Arkadi Sharshevsky 
> ---
>  devlink/devlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/devlink/devlink.c b/devlink/devlink.c
> index 39cda06..c9d1838 100644
> --- a/devlink/devlink.c
> +++ b/devlink/devlink.c
> @@ -343,7 +343,7 @@ static int attr_cb(const struct nlattr *attr, void *data)
>   int type;
>  
>   if (mnl_attr_type_valid(attr, DEVLINK_ATTR_MAX) < 0)
> - return MNL_CB_ERROR;
> + return MNL_CB_OK;
>  
>   type = mnl_attr_get_type(attr);
>   if (mnl_attr_validate(attr, devlink_policy[type]) < 0)
> 

What's the point of calling mnl_attr_type_valid if you disregard a
failure? you might as well not call mnl_attr_type_valid at all.


Re: iscsi target regression due to "tcp: remove prequeue support" patch

2018-01-18 Thread Mike Christie
On 01/18/2018 09:10 AM, Florian Westphal wrote:
>> It would indicate users providing their own ->sk_data_ready() callback
>> must be responsible for waking up a kthread context blocked on
>> sock_recvmsg(..., MSG_WAITALL), when a second ->sk_data_ready() is
>> received before the first sock_recvmsg(..., MSG_WAITALL) completes.
> 
> I agree, it looks like we need something like this?
> (not even build tested):
> 
> diff --git a/drivers/target/iscsi/iscsi_target_nego.c 
> b/drivers/target/iscsi/iscsi_target_nego.c
> index b686e2ce9c0e..3723f8f419aa 100644
> --- a/drivers/target/iscsi/iscsi_target_nego.c
> +++ b/drivers/target/iscsi/iscsi_target_nego.c
> @@ -432,6 +432,9 @@ static void iscsi_target_sk_data_ready(struct sock *sk)
> if (test_and_set_bit(LOGIN_FLAGS_READ_ACTIVE, >login_flags)) {
> write_unlock_bh(>sk_callback_lock);
> pr_debug("Got LOGIN_FLAGS_READ_ACTIVE=1, conn: %p \n", 
> conn);
> +   if (WARN_ON(iscsi_target_sk_data_ready == 
> conn->orig_data_ready))
> +   return;
> +   conn->orig_data_ready(sk);
> return;

This allows iscsi login to work for me.

I ran it against the target-pending for-next branch.


[PATCH net] net: qdisc_pkt_len_init() should be more robust

2018-01-18 Thread Eric Dumazet
From: Eric Dumazet 

Without proper validation of DODGY packets, we might very well
feed qdisc_pkt_len_init() with invalid GSO packets.

tcp_hdrlen() might access out-of-bound data, so let's use
skb_header_pointer() and proper checks.

Whole story is described in commit d0c081b49137 ("flow_dissector:
properly cap thoff field")

We have the goal of validating DODGY packets earlier in the stack,
so we might very well revert this fix in the future.

Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Cc: Jason Wang 
Reported-by: syzbot+9da69ebac7804...@syzkaller.appspotmail.com
---
 net/core/dev.c |   19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 
0e0ba36eeac9852b8df5ddd398dbc66ad6c83760..613fb4066be7bedbd3cd59ea9cf6eb7ef3100bd0
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3151,10 +3151,21 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
 
/* + transport layer */
-   if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
-   hdr_len += tcp_hdrlen(skb);
-   else
-   hdr_len += sizeof(struct udphdr);
+   if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) 
{
+   const struct tcphdr *th;
+   struct tcphdr _tcphdr;
+
+   th = skb_header_pointer(skb, skb_transport_offset(skb),
+   sizeof(_tcphdr), &_tcphdr);
+   if (likely(th))
+   hdr_len += __tcp_hdrlen(th);
+   } else {
+   struct udphdr _udphdr;
+
+   if (skb_header_pointer(skb, skb_transport_offset(skb),
+  sizeof(_udphdr), &_udphdr))
+   hdr_len += sizeof(struct udphdr);
+   }
 
if (shinfo->gso_type & SKB_GSO_DODGY)
gso_segs = DIV_ROUND_UP(skb->len - hdr_len,



Re: [PATCH-next] flow_netlink: Remove unneeded semicolons

2018-01-18 Thread Christopher Díaz Riveros
El jue, 18-01-2018 a las 16:25 -0500, David Miller escribió:
> From: Christopher Díaz Riveros 
> Date: Wed, 17 Jan 2018 16:10:28 -0500
> 
> > Trivial fix removes unneeded semicolons after if blocks.
> > 
> > This issue was detected by using the Coccinelle software.
> > 
> > Signed-off-by: Christopher Díaz Riveros 
> 
> Applied, thanks.

Thank you David,

It's actually my first patch accepted in the kernel :) sorry if it's a
very tiny trivial one, but I'll keep learning and sending more
meaningful ones in the future.

Regards,
-- 
Christopher Díaz Riveros
Gentoo Linux Developer
GPG Fingerprint: E517 5ECB 8152 98E4 FEBC  2BAA 4DBB D10F 0FDD 2547

signature.asc
Description: This is a digitally signed message part


Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc

2018-01-18 Thread Al Viro
On Thu, Jan 18, 2018 at 07:31:56PM +, Al Viro wrote:

> * SIOCADDRT/SIOCDELRT in compat ioctls

To bring back a question I'd asked back in October - what do
we do about SIOC...RT compat?

To recap:
* AF_INET sockets expect struct rtentry; it differs
between 32bit and 64bit, so routing_ioctl() in net/socket.c
is called from compat_sock_ioctl_trans() and does the right
thing.  All proto_ops instances with .family = PF_INET (and
only they) have inet_ioctl() as ->ioctl(), and end up with
ip_rt_ioctl() called for native ones.  Three of those have
->compat_ioctl() set to inet_compat_ioctl(), the rest have
it NULL.  In any case, inet_compat_ioctl() ignores those,
leaving them to compat_sock_ioctl_trans() to pick up.
* for AF_INET6 the situation is similar, except that
they use struct in6_rtmsg.  Compat is also dealt with in
routing_ioctl().  inet6_ioctl() for all such proto_ops
(and only those), ipv6_route_ioctl() is what ends up
handling the native ones.  No ->compat_ioctl() in any
of those.
* AF_PACKET sockets expect struct rt_entry and
actually bounce the native calls to inet_ioctl().  No
->compat_ioctl() there, but routing_ioctl() in net/socket.c
does the right thing.
* AF_APPLETALK sockets expect struct rt_entry.
Native handled in atrtr_ioctl(); there is ->compat_ioctl(),
but it ignores those ioctls, so we go through the conversion
in net/socket.c.  Also happens to work correctly.

* ax25, ipx, netrom, rose and x25 use structures
of their own, and those structures have identical layouts on
32bit and 64bit.  x25 has ->compat_ioctl() that does the
right thing (bounces to native), the rest either have
->compat_ioctl() ignoring those ioctls (ipx) or do not
have ->compat_ioctl() at all.  That ends up with generic
code picking those and buggering them up - routing_ioctl()
assumes that we want either in6_rtmsg (ipv6) or rtentry
(everything else).  Unfortunately, in case of these
protocols we should just leave the suckers alone.
Back then Ralf has verified that the bug exists
and said he'd put together a fix.  Looks like that fix
has fallen through the cracks, though.

* all other protocols fail those; usually with
ENOTTY, except for AF_QIPCRTR that fails with EINVAL.
Either way, compat is not an issue.

Note that handling of SIOCADDRT on e.g. raw ipv4
sockets from 32bit process is convoluted as hell.  The
call chain is
compat_sys_ioctl()
compat_sock_ioctl()
inet_compat_ioctl()
compat_raw_ioctl()
=> -ENOIOCTLCMD, possibly
by way of ipmr_compat_ioctl()
compat_sock_ioctl_trans()
routing_ioctl() [conversion done here]
sock_do_ioctl()
inet_ioctl()
ip_rt_ioctl()
A lot of those are method calls, BTW, and the overhead on those has
just grown...

Does anybody have objections against the following?

1) Somewhere in net/core (or net/compat.c, for that matter) add
int compat_get_rtentry(struct rtentry *r, struct rtentry32 __user *p);

2) In inet_compat_ioctl() recognize SIOC{ADD,DEL}RT and do
err = compat_get_rtentry(, (void __user *)arg);
if (!err)
err = ip_rt_ioctl(...)
return err;

3) Add inet_compat_ioctl() as ->compat_ioctl in all PF_INET proto_ops.

4) Lift copyin from atrtr_ioctl() to atalk_ioctl(), teach
atalk_compat_ioctl() about these ioctls (using compat_get_rtentry()
and atrtr_ioctl(), that is).

5) Add ->compat_ioctl() to AF_PACKET, let it just call inet_compat_ioctl()
for those two.

6) Lift copyin from ipv6_route_ioctl() to inet6_ioctl().  
Add inet6_compat_ioctl() that would recognize those two, do compat copyin
and call ipv6_route_ioctl().  Make it ->compat_ioctl for all PF_INET6
proto_ops.

7) Tell compat_sock_ioctl_trans() to move these two into the "just call
sock_do_ioctl()" group of cases.  Or, with Ralf's fix, just remove these
two cases from compat_sock_ioctl_trans() completely.  Either way,
routing_ioctl() becomes dead code and can be removed.


Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-18 Thread Shrikrishna Khare


On Thu, 18 Jan 2018, Benjamin Poirier wrote:

> On 2018/01/18 15:50, Benjamin Poirier wrote:
> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> > 
> > Some experimentation showed that this flaw in vmware e1000e emulation can
> > be worked around by not setting Other in EIAC. This is how it was before
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> 
> vmware folks, please comment.

Thank you for bringing this to our attention.

Using the reported build (ESX 6.5, 7526125) and 4.15.0-rc8+ kernel (which 
has the said patch), I could bring up e1000e interface (version: 3.2.6-k),
get dhcp address and even do large file downloads without difficulty.

Could you give us more pointers on how we may be able to reproduce this 
locally? Was there anything different with the configuration when the 
issue was observed? Is the issue consistently reproducible?

Thanks,
Shri


Re: [PATCH net-next 3/5] net: hns3: add ethtool -p support for phy device

2018-01-18 Thread lipeng (Y)



On 2018/1/18 22:25, Andrew Lunn wrote:

+static int hclge_set_led_status_phy(struct phy_device *phydev, int value)
+{
+   int ret, cur_page;
+
+   mutex_lock(>lock);
+
+   ret = phy_read(phydev, HCLGE_PHY_PAGE_REG);
+   if (ret < 0)
+   goto out;
+   else
+   cur_page = ret;
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED);
+   if (ret)
+   goto out;
+
+   ret = phy_write(phydev, HCLGE_LED_FC_REG, value);
+   if (ret)
+   goto out;
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page);
+
+out:
+   mutex_unlock(>lock);
+   return ret;
+}

Sorry, but NACK.

Please add an interface to phylib and the phy driver you are using to
do this.


  #define HCLGE_PHY_PAGE_MDIX   0
  #define HCLGE_PHY_PAGE_COPPER 0
+#define HCLGE_PHY_PAGE_LED 3
  
  /* Page Selection Reg. */

  #define HCLGE_PHY_PAGE_REG22
@@ -73,6 +74,15 @@
  /* Copper Specific Status Register */
  #define HCLGE_PHY_CSS_REG 17
  
+/* LED Function Control Register */

+#define HCLGE_LED_FC_REG   16
+
+/* LED Polarity Control Register */
+#define HCLGE_LED_PC_REG   17
+
+#define HCLGE_LED_FORCE_ON 9
+#define HCLGE_LED_FORCE_OFF8
+

By the looks of these defines, you assume you have a Marvell PHY.
Please make this generic so anybody with a Marvell PHY can use it.

Andrew

Hi  Andrw,

As your suggestion, we need add  interface to  phylib and the phy driver.
We will consider your suggestion and push this patch after we fix your 
comments.


so we will remove this patch  in V2 patch-set.

Thanks
Peng Li


.






[GIT] Networking

2018-01-18 Thread David Miller

1) Fix BPF divides by zero, from Eric Dumazet and Alexei Starovoitov.

2) Reject stores into bpf context via st and xadd, from Daniel
   Borkmann.

3) Fix a memory leak in TUN, from Cong Wang.

4) Disable RX aggregation on a specific troublesome configuration of
   r8152 in a Dell TB16b dock.

5) Fix sw_ctx leak in tls, from Sabrina Dubroca.

6) Fix program replacement in cls_bpf, from Daniel Borkmann.

7) Fix uninitialized station_info structures in cfg80211, from Johannes
   Berg.

8) Fix miscalculation of transport header offset field in flow
   dissector, from Eric Dumazet.

9) Fix LPM tree leak on failure in mlxsw driver, from Ido Schimmel.

Please pull, thanks a lot!

The following changes since commit 8cbab92dff778e516064c13113ca15d4869ec883:

  Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma (2018-01-16 16:47:40 
-0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to a0dca10fce42ae82651edbe682b1c637a8ecd365:

  ibmvnic: Fix IPv6 packet descriptors (2018-01-18 21:19:06 -0500)


Alexei Starovoitov (1):
  bpf: fix 32-bit divide by zero

Alexey Kodanev (1):
  ip6_gre: init dev->mtu and dev->hard_header_len correctly

Arnd Bergmann (1):
  fm10k: mark PM functions as __maybe_unused

Christophe Leroy (1):
  net: fs_enet: do not call phy_stop() in interrupts

Cong Wang (1):
  tun: fix a memory leak for tfile->tx_array

Daniel Borkmann (4):
  bpf, arm64: fix stack_depth tracking in combination with tail calls
  bpf: reject stores into ctx via st and xadd
  bpf: fix cls_bpf on filter replace
  bpf: mark dst unknown on inconsistent {s, u}bounds adjustments

David S. Miller (4):
  Merge tag 'linux-can-fixes-for-4.15-20180116' of 
ssh://gitolite.kernel.org/.../mkl/linux-can
  Merge git://git.kernel.org/.../bpf/bpf
  Merge tag 'wireless-drivers-for-davem-2018-01-17' of 
git://git.kernel.org/.../kvalo/wireless-drivers
  Merge tag 'linux-can-fixes-for-4.15-20180118' of 
ssh://gitolite.kernel.org/.../mkl/linux-can

Eric Dumazet (2):
  bpf: fix divides by zero
  flow_dissector: properly cap thoff field

Guenter Roeck (1):
  bcma: Fix 'allmodconfig' and BCMA builds on MIPS targets

Ido Schimmel (1):
  mlxsw: spectrum_router: Free LPM tree upon failure

Ilya Lesokhin (1):
  net/tls: Only attach to sockets in ESTABLISHED state

James Hogan (1):
  ssb: Disable PCI host for PCI_DRIVERS_GENERIC

Johannes Berg (1):
  cfg80211: fix station info handling bugs

Kai-Heng Feng (1):
  r8152: disable RX aggregation on Dell TB16 dock

Marc Kleine-Budde (2):
  can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
  can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once

Rex Chang (1):
  Net: ethernet: ti: netcp: Fix inbound ping crash if MTU size is greater 
than 1500

Sabrina Dubroca (3):
  tls: fix sw_ctx leak
  tls: return -EBUSY if crypto_info is already set
  tls: reset crypto_info when do_tls_setsockopt_tx fails

Stephane Grosjean (1):
  can: peak: fix potential bug in packet fragmentation

Thomas Falcon (2):
  ibmvnic: Fix IP offload control buffer
  ibmvnic: Fix IPv6 packet descriptors

Wei Wang (1):
  ipv6: don't let tb6_root node share routes with other node

Wright Feng (1):
  brcmfmac: fix CLM load error for legacy chips when user helper is enabled

Xin Long (1):
  netlink: reset extack earlier in netlink_rcv_skb

 arch/arm64/net/bpf_jit_comp.c |  20 ++-
 drivers/bcma/Kconfig  |   2 +-
 drivers/net/can/usb/peak_usb/pcan_usb_fd.c|  21 +--
 drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c |  16 +++--
 drivers/net/ethernet/freescale/fs_enet/fs_enet.h  |   1 +
 drivers/net/ethernet/ibm/ibmvnic.c|  24 -
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c  |   9 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c |  20 +++
 drivers/net/ethernet/ti/netcp_core.c  |   2 +-
 drivers/net/tun.c |  15 ++--
 drivers/net/usb/r8152.c   |  13 +++
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/common.c |   9 ++---
 drivers/ssb/Kconfig   |   2 +-
 kernel/bpf/core.c |   4 +--
 kernel/bpf/verifier.c |  64 
+++--
 net/can/af_can.c  |  36 
---
 net/core/filter.c |   4 +++
 net/core/flow_dissector.c |   3 +-
 net/ipv6/ip6_fib.c

Re: [PATCH net v2] ibmvnic: Fix IPv6 packet descriptors

2018-01-18 Thread David Miller
From: Thomas Falcon 
Date: Thu, 18 Jan 2018 19:29:48 -0600

> Packet descriptor generation for IPv6 is broken.
> Properly set L3 and L4 protocol flags for IPv6 descriptors.
> 
> Signed-off-by: Thomas Falcon 
> ---
> Changes in v2:
> -keep declarations in reverse xmas tree format
> -Fix whitespace errors

Also applied, thank you.


Re: [PATCH net] ibmvnic: Fix IP offload control buffer

2018-01-18 Thread David Miller
From: Thomas Falcon 
Date: Thu, 18 Jan 2018 19:05:01 -0600

> Set some missing fields in the IP control offload buffer. This buffer is
> used to enable checksum and TCP segmentation offload in the VNIC server.
> The buffer length field and the checksum offloading bits were not set
> properly, so fix that here.
> 
> Signed-off-by: Thomas Falcon 

Applied.


Re: pull-request: can 2018-01-18

2018-01-18 Thread David Miller
From: Marc Kleine-Budde 
Date: Thu, 18 Jan 2018 11:15:05 +0100

> this is a pull reqeust of two patches for net/master:
> 
> The syzkaller project triggered two WARN_ONCE() in the af_can code from
> userspace and we decided to replace it by a pr_warn_once().

Pulled, thanks Marc.


Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread David Miller
From: Wei Wang 
Date: Thu, 18 Jan 2018 10:40:03 -0800

> From: Wei Wang 
> 
> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
> which does not have any route attached to it yet, the current code will
> let tb6_root and the node in the subtree share the same route.
> This could cause problem cause tb6_root has RTN_INFO flag marked and the
> tree repair and clean up code will not work properly.
> This commit makes sure tb6_root->leaf points back to null_entry instead
> of sharing route with other node.
> 
> It fixes the following syzkaller reported issue:
 ...
> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route")
> Signed-off-by: Wei Wang 
> Acked-by: Eric Dumazet 

Applied, thank you.


Re: [patch 1/1] net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues

2018-01-18 Thread David Miller
From: a...@linux-foundation.org
Date: Thu, 18 Jan 2018 16:30:49 -0800

> From: Andrew Morton 
> Subject: net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues
> 
> gcc-4.4.4 has problems witn anon union initializers.  Work around this.
> 
> net/sched/sch_prio.c: In function 'prio_dump_offload':
> net/sched/sch_prio.c:260: error: unknown field 'stats' specified in 
> initializer
> net/sched/sch_prio.c:260: warning: initialization makes integer from pointer 
> without a cast
> net/sched/sch_prio.c:261: error: unknown field 'stats' specified in 
> initializer
> net/sched/sch_prio.c:261: warning: initialization makes integer from pointer 
> without a cast
> 
> Fixes: 7fdb61b44c0c95 ("net: sch: prio: Add offload ability to PRIO qdisc")
> Cc: Nogah Frankel 
> Cc: Yuval Mintz 
> Cc: Jiri Pirko 
> Cc: David S. Miller 
> Signed-off-by: Andrew Morton 

Applied, thanks Andrew.


Re: [PATCH net v2] ip6_gre: init dev->mtu and dev->hard_header_len correctly

2018-01-18 Thread David Miller
From: Alexey Kodanev 
Date: Thu, 18 Jan 2018 20:51:12 +0300

> Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path,
> call common GRE functions") moved dev->mtu initialization
> from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
> result, the previously set values, before ndo_init(), are
> reset in the following cases:
> 
> * rtnl_create_link() can update dev->mtu from IFLA_MTU
>   parameter.
> 
> * ip6gre_tnl_link_config() is invoked before ndo_init() in
>   netlink and ioctl setup, so ndo_init() can reset MTU
>   adjustments with the lower device MTU as well, dev->mtu
>   and dev->hard_header_len.
> 
>   Not applicable for ip6gretap because it has one more call
>   to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().
> 
> Fix the first case by updating dev->mtu with 'tb[IFLA_MTU]'
> parameter if a user sets it manually on a device creation,
> and fix the second one by moving ip6gre_tnl_link_config()
> call after register_netdevice().
> 
> Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE 
> functions")
> Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting")
> Signed-off-by: Alexey Kodanev 
> ---
> 
> v2: Instead of checking whether dev->mtu equals zero or not
> in ip6gre_tunnel_init_common(), update 'dev->mtu' once
> more with 'IFLA_MTU' parameter after register_netdevice().

Applied and queued up for -stable, thanks Alexey.

I was almost tricked that this approach would have problems
because you do the MTU setting after register_netdevice(), but
we hold the RTNL semaphore so even if the device becomes
globally visible the user can't perform a change of the
device's MTU before we do it.

Thanks.


Re: [patch net] mlxsw: spectrum_router: Free LPM tree upon failure

2018-01-18 Thread David Miller
From: Jiri Pirko 
Date: Thu, 18 Jan 2018 15:42:10 +0100

> From: Ido Schimmel 
> 
> When a new LPM tree is created, we try to replace the trees in the
> existing virtual routers with it. If we fail, the tree needs to be
> freed.
> 
> Currently, this does not happen in the unlikely case where we fail to
> bind the tree to the first virtual router, since its reference count
> never transitions from 1 to 0.
> 
> Fix that by taking a reference before binding the tree.
> 
> Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all 
> virtual routers")
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 

Applied, thanks Jiri.


Re: [PATCH] bridge: return boolean instead of integer in br_multicast_is_router

2018-01-18 Thread Gustavo A. R. Silva


Quoting Stephen Hemminger :


On Thu, 18 Jan 2018 17:37:45 -0600
"Gustavo A. R. Silva"  wrote:


Return statements in functions returning bool should use
true/false instead of 1/0.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 


LGTM

Fixes: 85b352693264 ("bridge: Fix build error when IGMP_SNOOPING is  
not enabled")

Reviewed-by: Stephen Hemminger 


Thanks Stephen.
--
Gustavo






[PATCH net v2] ibmvnic: Fix IPv6 packet descriptors

2018-01-18 Thread Thomas Falcon
Packet descriptor generation for IPv6 is broken.
Properly set L3 and L4 protocol flags for IPv6 descriptors.

Signed-off-by: Thomas Falcon 
---
Changes in v2:
-keep declarations in reverse xmas tree format
-Fix whitespace errors

 drivers/net/ethernet/ibm/ibmvnic.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 736df59..5397c47 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1280,6 +1280,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
unsigned char *dst;
u64 *handle_array;
int index = 0;
+   u8 proto = 0;
int ret = 0;
 
if (adapter->resetting) {
@@ -1368,17 +1369,18 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
}
 
if (skb->protocol == htons(ETH_P_IP)) {
-   if (ip_hdr(skb)->version == 4)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
-   else if (ip_hdr(skb)->version == 6)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
-
-   if (ip_hdr(skb)->protocol == IPPROTO_TCP)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
-   else if (ip_hdr(skb)->protocol != IPPROTO_TCP)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
+   proto = ip_hdr(skb)->protocol;
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
+   proto = ipv6_hdr(skb)->nexthdr;
}
 
+   if (proto == IPPROTO_TCP)
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
+   else if (proto == IPPROTO_UDP)
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
+
if (skb->ip_summed == CHECKSUM_PARTIAL) {
tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD;
hdrs += 2;
-- 
1.8.3.1



Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Daniel Axtens
Pravin Shelar  writes:

> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens  wrote:
>> Pravin Shelar  writes:
>>
>>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens  wrote:
 When regular packets are forwarded, we validate their size against the
 MTU of the destination device. However, when GSO packets are
 forwarded, we do not validate their size against the MTU. We
 implicitly assume that when they are segmented, the resultant packets
 will be correctly sized.

 This is not always the case.

 We observed a case where a packet received on an ibmveth device had a
 GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
 device, where it caused a firmware assert. This is described in detail
 at [0] and was the genesis of this series. Rather than fixing it in
 the driver, this series fixes the forwarding path.

>>> Are there any other possible forwarding path in networking stack? TC
>>> is one subsystem that could forward such a packet to the bnx2x device,
>>> how is that handled ?
>>
>> So far I have only looked at bridges, openvswitch and macvlan. In
>> general, if the code uses dev_forward_skb() it should automatically be
>> fine as that invokes is_skb_forwardable(), which we patch.
>>
> But there are other ways to forward packets, e.g tc-mirred or bpf
> redirect. We need to handle all these cases rather than fixing one at
> a time. As Jason suggested netif_needs_gso() looks like good function
> to validate if a device is capable of handling given GSO packet.

I am not entirely sure this is a better solution.

The biggest reason I am uncomfortable with this is that if
netif_needs_gso() returns true, the skb will be segmented. The segment
sizes will be based on gso_size. Since gso_size is greater than the MTU,
the resulting segments will themselves be over-MTU. Those over-MTU
segments will then be passed to the network card. I think we should not
be creating over-MTU segments; we should instead be dropping the packet
and logging.

I do take the point that you and Jason are making: a more generic
fix would be good. I'm just not sure where to put it.


Regards,
Daniel


Re: net: r8169: a question of memory barrier in the r8169 driver

2018-01-18 Thread Jia-Ju Bai


On 2018/1/19 9:11, Francois Romieu wrote:

Jia-Ju Bai  :
[...]

The function rtl8169_start_xmit reads tp->dirty_tx in TX_FRAGS_READY_FOR:
 if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
 netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
 goto err_stop_0;
 }
But there is no memory barrier around this code.

Is there a possible data race here?

This code would not even be needed if rtl8169_start_xmit was only your
usual ndo_start_xmit handler: Realtek {ab / re}used it for GSO handling
(see r8169_csum_workaround).

If the test is not a no-op in this GSO context, it's racy.



Thanks for reply.
I didn't clearly understand your meaning...
I wonder whether there is a possible data race and whether a "smp_mb" is 
needed before this code?

By the way, do you mean that this code can be removed?


Thanks,
Jia-Ju Bai


Re: net: r8169: a question of memory barrier in the r8169 driver

2018-01-18 Thread Francois Romieu
Jia-Ju Bai  :
[...]
> The function rtl8169_start_xmit reads tp->dirty_tx in TX_FRAGS_READY_FOR:
> if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
> netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
> goto err_stop_0;
> }
> But there is no memory barrier around this code.
> 
> Is there a possible data race here?

This code would not even be needed if rtl8169_start_xmit was only your
usual ndo_start_xmit handler: Realtek {ab / re}used it for GSO handling
(see r8169_csum_workaround).

If the test is not a no-op in this GSO context, it's racy.

-- 
Ueimor


Re: net: r8169: a question of memory barrier in the r8169 driver

2018-01-18 Thread Francois Romieu
Peter Zijlstra  :
[...]
> There is only 1 variable afaict. Memory barriers need at least 2 in
> order to be able to do _anything_.

I don't get your point: why don't {cur_tx, dirty_tx} qualify as said
two variables ?

-- 
Ueimor



Re: [PATCH net] ibmvnic: Fix IPv6 packet descriptors

2018-01-18 Thread Thomas Falcon
On 01/18/2018 07:05 PM, Thomas Falcon wrote:
> Packet descriptor generation for IPv6 is broken.
> Properly set L3 and L4 protocol flags for IPv6 descriptors.
>
> Signed-off-by: Thomas Falcon 

Sorry!  Forgot to check for formatting.  Please ignore this patch.

> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index 3700bfa..4bb41cd 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1281,6 +1281,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
> net_device *netdev)
>   u64 *handle_array;
>   int index = 0;
>   int ret = 0;
> + u8 proto = 0;
>
>   if (adapter->resetting) {
>   if (!netif_subqueue_stopped(netdev, skb))
> @@ -1367,18 +1368,19 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
> net_device *netdev)
>   tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci);
>   }
>
> - if (skb->protocol == htons(ETH_P_IP)) {
> - if (ip_hdr(skb)->version == 4)
> - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
> - else if (ip_hdr(skb)->version == 6)
> - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
> -
> - if (ip_hdr(skb)->protocol == IPPROTO_TCP)
> - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
> - else if (ip_hdr(skb)->protocol != IPPROTO_TCP)
> - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
> +if (skb->protocol == htons(ETH_P_IP)) {
> + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
> + proto = ip_hdr(skb)->protocol;
> + } else if (skb->protocol == htons(ETH_P_IPV6)) {
> + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
> + proto = ipv6_hdr(skb)->nexthdr;
>   }
>
> + if (proto == IPPROTO_TCP)
> + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
> + else if (proto == IPPROTO_UDP)
> + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
> +
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
>   tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD;
>   hdrs += 2;




[PATCH net] ibmvnic: Fix IP offload control buffer

2018-01-18 Thread Thomas Falcon
Set some missing fields in the IP control offload buffer. This buffer is
used to enable checksum and TCP segmentation offload in the VNIC server.
The buffer length field and the checksum offloading bits were not set
properly, so fix that here.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b676fa9..3700bfa 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3351,7 +3351,11 @@ static void handle_query_ip_offload_rsp(struct 
ibmvnic_adapter *adapter)
return;
}
 
+   adapter->ip_offload_ctrl.len =
+   cpu_to_be32(sizeof(adapter->ip_offload_ctrl));
adapter->ip_offload_ctrl.version = cpu_to_be32(INITIAL_VERSION_IOB);
+   adapter->ip_offload_ctrl.ipv4_chksum = buf->ipv4_chksum;
+   adapter->ip_offload_ctrl.ipv6_chksum = buf->ipv6_chksum;
adapter->ip_offload_ctrl.tcp_ipv4_chksum = buf->tcp_ipv4_chksum;
adapter->ip_offload_ctrl.udp_ipv4_chksum = buf->udp_ipv4_chksum;
adapter->ip_offload_ctrl.tcp_ipv6_chksum = buf->tcp_ipv6_chksum;
-- 
1.8.3.1



[PATCH net] ibmvnic: Fix IPv6 packet descriptors

2018-01-18 Thread Thomas Falcon
Packet descriptor generation for IPv6 is broken.
Properly set L3 and L4 protocol flags for IPv6 descriptors.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 3700bfa..4bb41cd 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1281,6 +1281,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
u64 *handle_array;
int index = 0;
int ret = 0;
+   u8 proto = 0;
 
if (adapter->resetting) {
if (!netif_subqueue_stopped(netdev, skb))
@@ -1367,18 +1368,19 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci);
}
 
-   if (skb->protocol == htons(ETH_P_IP)) {
-   if (ip_hdr(skb)->version == 4)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
-   else if (ip_hdr(skb)->version == 6)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
-
-   if (ip_hdr(skb)->protocol == IPPROTO_TCP)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
-   else if (ip_hdr(skb)->protocol != IPPROTO_TCP)
-   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
+if (skb->protocol == htons(ETH_P_IP)) {
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4;
+   proto = ip_hdr(skb)->protocol;
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6;
+   proto = ipv6_hdr(skb)->nexthdr;
}
 
+   if (proto == IPPROTO_TCP)
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP;
+   else if (proto == IPPROTO_UDP)
+   tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP;
+
if (skb->ip_summed == CHECKSUM_PARTIAL) {
tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD;
hdrs += 2;
-- 
1.8.3.1



linux-next: manual merge of the net-next tree with the net tree

2018-01-18 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  kernel/bpf/verifier.c

between commit:

  6f16101e6a8b ("bpf: mark dst unknown on inconsistent {s, u}bounds 
adjustments")

from the net tree and commit:

  f4d7e40a5b71 ("bpf: introduce function calls (verification)")

from the net-next tree.

I fixed it up (I just used the former version) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.



-- 
Cheers,
Stephen Rothwell


Re: KASAN: slab-out-of-bounds Read in __dev_queue_xmit

2018-01-18 Thread Eric Dumazet
On Thu, 2018-01-18 at 15:58 -0800, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on linux-next commit
> 0e08c463db387a2adcb0243b15ab868a73f87807
> 
> So far this crash happened 6 times on linux-next, mmots, upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+9da69ebac7804...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> device syz0 entered promiscuous mode
> audit: type=1400 audit(1514752309.665:10): avc:  denied  { net_raw } for   
> pid=3143 comm="syzkaller343753" capability=13   
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
> tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=cap_userns  
> permissive=1
> audit: type=1400 audit(1514752309.668:11): avc:  denied  { net_admin } for   
> pid=3143 comm="syzkaller343753" capability=12   
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
> tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=cap_userns  
> permissive=1
> ==
> BUG: KASAN: slab-out-of-bounds in __tcp_hdrlen include/linux/tcp.h:35  
> [inline]
> BUG: KASAN: slab-out-of-bounds in tcp_hdrlen include/linux/tcp.h:40 [inline]
> BUG: KASAN: slab-out-of-bounds in qdisc_pkt_len_init net/core/dev.c:3160  
> [inline]
> BUG: KASAN: slab-out-of-bounds in __dev_queue_xmit+0x20d3/0x2200  
> net/core/dev.c:3465
> Read of size 2 at addr 8801c85791e0 by task syzkaller343753/3143
> 
> CPU: 0 PID: 3143 Comm: syzkaller343753 Not tainted  
> 4.15.0-rc4-next-20171221+ #78
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x257 lib/dump_stack.c:53
>   print_address_description+0x73/0x250 mm/kasan/report.c:252
>   kasan_report_error mm/kasan/report.c:351 [inline]
>   kasan_report+0x25b/0x340 mm/kasan/report.c:409
>   __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
>   __tcp_hdrlen include/linux/tcp.h:35 [inline]
>   tcp_hdrlen include/linux/tcp.h:40 [inline]
>   qdisc_pkt_len_init net/core/dev.c:3160 [inline]
>   __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465
>   dev_queue_xmit+0x17/0x20 net/core/dev.c:3554
>   packet_snd net/packet/af_packet.c:2943 [inline]
>   packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968
>   sock_sendmsg_nosec net/socket.c:628 [inline]
>   sock_sendmsg+0xca/0x110 net/socket.c:638
>   sock_write_iter+0x31a/0x5d0 net/socket.c:907
>   call_write_iter include/linux/fs.h:1776 [inline]
>   new_sync_write fs/read_write.c:469 [inline]
>   __vfs_write+0x684/0x970 fs/read_write.c:482
>   vfs_write+0x189/0x510 fs/read_write.c:544
>   SYSC_write fs/read_write.c:589 [inline]
>   SyS_write+0xef/0x220 fs/read_write.c:581
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x444df9
> RSP: 002b:007eff78 EFLAGS: 0297 ORIG_RAX: 0001
> RAX: ffda RBX: 7ffc3d2180f0 RCX: 00444df9
> RDX: 00ce RSI: 20fecf2b RDI: 0005
> RBP:  R08: 000120080522 R09: 000120080522
> R10: 000120080522 R11: 0297 R12: 004029f0
> R13: 00402a80 R14:  R15: 
> 
> Allocated by task 3143:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>   set_track mm/kasan/kasan.c:459 [inline]
>   kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>   __do_kmalloc_node mm/slab.c:3673 [inline]
>   __kmalloc_node_track_caller+0x47/0x70 mm/slab.c:3687
>   __kmalloc_reserve.isra.41+0x41/0xd0 net/core/skbuff.c:137
>   __alloc_skb+0x13b/0x780 net/core/skbuff.c:205
>   alloc_skb include/linux/skbuff.h:983 [inline]
>   alloc_skb_with_frags+0x10d/0x750 net/core/skbuff.c:5146
>   sock_alloc_send_pskb+0x787/0x9b0 net/core/sock.c:2088
>   packet_alloc_skb net/packet/af_packet.c:2802 [inline]
>   packet_snd net/packet/af_packet.c:2893 [inline]
>   packet_sendmsg+0x1ec2/0x60a0 net/packet/af_packet.c:2968
>   sock_sendmsg_nosec net/socket.c:628 [inline]
>   sock_sendmsg+0xca/0x110 net/socket.c:638
>   sock_write_iter+0x31a/0x5d0 net/socket.c:907
>   call_write_iter include/linux/fs.h:1776 [inline]
>   new_sync_write fs/read_write.c:469 [inline]
>   __vfs_write+0x684/0x970 fs/read_write.c:482
>   vfs_write+0x189/0x510 fs/read_write.c:544
>   SYSC_write fs/read_write.c:589 [inline]
>   SyS_write+0xef/0x220 fs/read_write.c:581
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Freed by task 0:
> (stack is not available)
> 
> The buggy address belongs to the object at 8801c8578d80
>   which belongs to the cache kmalloc-1024 of size 1024
> The buggy address is located 96 bytes to the right of
>   1024-byte 

Re: [PATCH net] net: validate untrusted gso packets

2018-01-18 Thread Willem de Bruijn
>>> This implements subset of function for codes which was removed by the
>>> commit
>>> I mentioned below.
>>
>> No, as I explain above, it performs a different check.
>>
>
>
> [...]

Clearly I was wrong, sorry. Thanks for pointing out that commit and
576a30eb6453 ("[NET]: Added GSO header verification").

> For performance reason. I think we should delay the check or
> segmentation
> as
> much as possible until it was really needed.

 Going through segmentation is probably as expensive as flow dissector,
 if not more so because of the indirect branches.
>>>
>>> I think we don't even need to care about this consider the evil packet
>>> should be rare.
>>
>> How does frequency matter when a single packet can crash a host?
>
>
> I mean consider we had fix the crash, we don't care how expensive do we spot
> this.
>
>>
>>> And what you propose here is just a very small subset of the
>>> necessary checking, more comes at gso header checking. So even if we care
>>> performance, it only help for some specific case.
>>
>> It also fixed the bug that Eric sent a separate patch for, as that did
>> not dissect as a valid TCP packet, either.
>
>
> I may miss something but how did this patch protects an evil thoff?

Actually, it blocked that specific reproducer because the ip protocol
did not match.

I think that __skb_flow_dissect_tcp should return a boolean, causing
dissection return FLOW_DISSECT_RET_OUT_BAD if the tcph is bad.
That would be needed to really catch it with flow dissection at the source.


[PATCH] selftests: bpf: update .gitignore with missing generated files

2018-01-18 Thread Shuah Khan
Update .gitignore with missing generated files.

Signed-off-by: Shuah Khan 
---
 tools/testing/selftests/bpf/.gitignore | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 541d9d7fad5a..1e09d77f1948 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -3,3 +3,10 @@ test_maps
 test_lru_map
 test_lpm_map
 test_tag
+FEATURE-DUMP.libbpf
+fixdep
+test_align
+test_dev_cgroup
+test_progs
+test_verifier_log
+feature
-- 
2.14.1



Re: [PATCH iproute2 v2 0/9] ip/tunnel: Improve tunnel parameters printing

2018-01-18 Thread Stephen Hemminger
On Thu, 18 Jan 2018 16:04:27 +0200
Serhey Popovych  wrote:

> Continue improvements to tunnel modules. Final goal
> is to make merge IP and IPv6 variants and make that
> transition as transparent as possible.
> 
> Everything within this series is open for your comments,
> suggestions and criticism.
> 
> See individual patch description message for details.
> 
> v2
>   1) Fix checkpatch issues: no assignment in condtional
>  anymore.
> 
>   2) Better code abstraction: introduce tnl_print_encap()
>  helper and get rid of duplicated code for printing
>  tunnel encapsulation options.
> 
>  Patch with subject
>  "ip/tunnel: Abstract tunnel encapsulation options printing"
>  replaces two previous in v1 series
>  "ip/tunnel: Use print_string() and simplify encap option printing"
>  and
>  "ip/tunnel: Minor cleanups in print routines"
> 
>   3) Include patch with subject
>  "tunnel: Return constant string without copying it"
>  in the series: it is related to tunneling code too.
> 
> Thanks,
> Serhii
> 
> Serhey Popovych (9):
>   iplink: Use ll_index_to_name() instead of if_indextoname()
>   ip/tunnel: Correct and unify ttl/hoplimit printing
>   ip/tunnel: Simplify and unify tos printing
>   ip/tunnel: Use print_0xhex() instead of print_string()
>   ip/tunnel: Abstract tunnel encapsulation options printing
>   gre/tunnel: Print erspan_index using print_uint()
>   vti/tunnel: Unify ikey/okey printing
>   vti6/tunnel: Unify and simplify link type help functions
>   tunnel: Return constant string without copying it
> 
>  bridge/fdb.c  |5 +-
>  bridge/link.c |   19 +++
>  ip/ip6tunnel.c|5 +-
>  ip/iplink_bond.c  |   38 ++
>  ip/iplink_geneve.c|   38 ++
>  ip/iplink_vxlan.c |   49 --
>  ip/iproute_lwtunnel.c |   11 ++---
>  ip/iptunnel.c |2 +-
>  ip/link_gre.c |  132 
> +++--
>  ip/link_gre6.c|  103 +-
>  ip/link_ip6tnl.c  |  105 ++-
>  ip/link_iptnl.c   |  128 +++
>  ip/link_vti.c |   42 
>  ip/link_vti6.c|   64 +++-
>  ip/tunnel.c   |  116 ---
>  ip/tunnel.h   |5 ++
>  16 files changed, 340 insertions(+), 522 deletions(-)
> 

This looks fine. Thanks for following up.
Applied.


Re: [PATCH iproute2] devlink: Ignore unknown attributes

2018-01-18 Thread Stephen Hemminger
On Wed, 17 Jan 2018 15:28:00 +0200
Arkadi Sharshevsky  wrote:

> In case of extending the UAPI old packages would break.
> 
> Signed-off-by: Arkadi Sharshevsky 

Looks like good future proofing.
Applied.


[patch 1/1] net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues

2018-01-18 Thread akpm
From: Andrew Morton 
Subject: net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues

gcc-4.4.4 has problems witn anon union initializers.  Work around this.

net/sched/sch_prio.c: In function 'prio_dump_offload':
net/sched/sch_prio.c:260: error: unknown field 'stats' specified in initializer
net/sched/sch_prio.c:260: warning: initialization makes integer from pointer 
without a cast
net/sched/sch_prio.c:261: error: unknown field 'stats' specified in initializer
net/sched/sch_prio.c:261: warning: initialization makes integer from pointer 
without a cast

Fixes: 7fdb61b44c0c95 ("net: sch: prio: Add offload ability to PRIO qdisc")
Cc: Nogah Frankel 
Cc: Yuval Mintz 
Cc: Jiri Pirko 
Cc: David S. Miller 
Signed-off-by: Andrew Morton 
---

 net/sched/sch_prio.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff -puN 
net/sched/sch_prio.c~net-sched-sch_prioc-work-around-gcc-444-union-initializer-issues
 net/sched/sch_prio.c
--- 
a/net/sched/sch_prio.c~net-sched-sch_prioc-work-around-gcc-444-union-initializer-issues
+++ a/net/sched/sch_prio.c
@@ -254,11 +254,15 @@ static int prio_dump_offload(struct Qdis
 {
struct net_device *dev = qdisc_dev(sch);
struct tc_prio_qopt_offload hw_stats = {
+   .command = TC_PRIO_STATS,
.handle = sch->handle,
.parent = sch->parent,
-   .command = TC_PRIO_STATS,
-   .stats.bstats = >bstats,
-   .stats.qstats = >qstats,
+   {
+   .stats = {
+   .bstats = >bstats,
+   .qstats = >qstats,
+   },
+   },
};
int err;
 
_


Re: [PATCH iproute2] iplink: Fix "alias" parameter length calculations

2018-01-18 Thread Stephen Hemminger
On Thu, 18 Jan 2018 16:24:40 +0200
Serhey Popovych  wrote:

> We need NEXT_ARG() to get *argv pointing to "alias"
> parameter value. Overwise we get and check "alias"
> string length.
> 
> Fixes: f88becf35e08 ("iplink: Process "alias" parameter correctly")
> Signed-off-by: Serhey Popovych 

Good catch. Applied thanks.


Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-18 Thread Willem de Bruijn
> Okay, I'm working on a patch that adds explicit checks

Sent: http://patchwork.ozlabs.org/patch/863237/

>
>   @@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff 
> *skb,
>   struct sk_buff *segs = ERR_PTR(-EINVAL);
>   struct sctphdr *sh;
>
>   +   if (!skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)
>   +   goto out;
>   +
>
> to all transport layer gso offloads: {sctp, tcpv[46], ufov[46], espv[46]}. 
> This
> will block packets with gso_type X but protocol type Y from being parsed.
> But does allow entering a tunnel protocol handler if that is different from Y,
> unlike the above patch.
>
> tunnels segmentation itself is protected by skb->encapsulation. Only tunnel
> devices in the stack can set this field, not virtio_net_hdr_to_skb. Packets 
> that
> enter {gre, udp tunnel, ipxip4, ipxip6} without this bit are already dropped, 
> so
> no new checks are needed to these based on gso_type.

This is not yet sufficient. If a packet comes from userspace with tunnel
headers and passes through a tunnel that sets skb->encapsulation, it is
indistinguishable from a valid tunneled packet. That bit is not exclusive,


Re: [PATCH] bridge: return boolean instead of integer in br_multicast_is_router

2018-01-18 Thread Stephen Hemminger
On Thu, 18 Jan 2018 17:37:45 -0600
"Gustavo A. R. Silva"  wrote:

> Return statements in functions returning bool should use
> true/false instead of 1/0.
> 
> This issue was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

LGTM

Fixes: 85b352693264 ("bridge: Fix build error when IGMP_SNOOPING is not 
enabled")
Reviewed-by: Stephen Hemminger 


[PATCH net v2] gso: validate gso_type if SKB_GSO_DODGY

2018-01-18 Thread Willem de Bruijn
From: Willem de Bruijn 

Validate gso_type during segmentation as SKB_GSO_DODGY sources
may pass packets where the gso_type does not match the contents.

Syzkaller was able to enter the SCTP gso handler with a packet of
gso_type SKB_GSO_TCPV4.

On entry of transport layer gso handlers, verify that the gso_type
matches the transport protocol.

Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")
Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com>
Reported-by: syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn 

---
Similar checks existed until removed in commit 5c7cdf339af5 ("gso:
Remove arbitrary checks for unsupported GSO"). But those were limited
to the TSO path, not software GSO. I believe that this issue goes
back further, hence the Fixes at the first user of virtio_net_hdr.
---
 net/ipv4/esp4_offload.c  | 3 +++
 net/ipv4/tcp_offload.c   | 3 +++
 net/ipv4/udp_offload.c   | 3 +++
 net/ipv6/esp6_offload.c  | 3 +++
 net/ipv6/tcpv6_offload.c | 3 +++
 net/ipv6/udp_offload.c   | 3 +++
 net/sctp/offload.c   | 3 +++
 7 files changed, 21 insertions(+)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index b1338e576d00..29b333a62ab0 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -122,6 +122,9 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
if (!xo)
goto out;
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP))
+   goto out;
+
seq = xo->seq.low;
 
x = skb->sp->xvec[skb->sp->len - 1];
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index b6a2aa1dcf56..4d58e2ce0b5b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -32,6 +32,9 @@ static void tcp_gso_tstamp(struct sk_buff *skb, unsigned int 
ts_seq,
 static struct sk_buff *tcp4_gso_segment(struct sk_buff *skb,
netdev_features_t features)
 {
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4))
+   return ERR_PTR(-EINVAL);
+
if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
return ERR_PTR(-EINVAL);
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 01801b77bd0d..ea6e6e7df0ee 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -203,6 +203,9 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
goto out;
}
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP))
+   goto out;
+
if (!pskb_may_pull(skb, sizeof(struct udphdr)))
goto out;
 
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index dd9627490c7c..f52c314d4c97 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -149,6 +149,9 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
if (!xo)
goto out;
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP))
+   goto out;
+
seq = xo->seq.low;
 
x = skb->sp->xvec[skb->sp->len - 1];
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index d883c9204c01..278e49cd67d4 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -46,6 +46,9 @@ static struct sk_buff *tcp6_gso_segment(struct sk_buff *skb,
 {
struct tcphdr *th;
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6))
+   return ERR_PTR(-EINVAL);
+
if (!pskb_may_pull(skb, sizeof(*th)))
return ERR_PTR(-EINVAL);
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index a0f89ad76f9d..2a04dc9c781b 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -42,6 +42,9 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
const struct ipv6hdr *ipv6h;
struct udphdr *uh;
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP))
+   goto out;
+
if (!pskb_may_pull(skb, sizeof(struct udphdr)))
goto out;
 
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 275925b93b29..35bc7106d182 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
struct sk_buff *segs = ERR_PTR(-EINVAL);
struct sctphdr *sh;
 
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP))
+   goto out;
+
sh = sctp_hdr(skb);
if (!pskb_may_pull(skb, sizeof(*sh)))
goto out;
-- 
2.16.0.rc1.238.g530d649a79-goog



[PATCH] bridge: return boolean instead of integer in br_multicast_is_router

2018-01-18 Thread Gustavo A. R. Silva
Return statements in functions returning bool should use
true/false instead of 1/0.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/bridge/br_private.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 80559fd..8e13a64 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -760,7 +760,7 @@ static inline void br_multicast_flood(struct 
net_bridge_mdb_entry *mdst,
 
 static inline bool br_multicast_is_router(struct net_bridge *br)
 {
-   return 0;
+   return false;
 }
 
 static inline bool br_multicast_querier_exists(struct net_bridge *br,
-- 
2.7.4



[PATCH] netfilter: return booleans instead of integers

2018-01-18 Thread Gustavo A. R. Silva
Return statements in functions returning bool should use
true/false instead of 1/0.

These issues were detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/netfilter/nf_conncount.c | 2 +-
 net/netfilter/xt_hashlimit.c | 2 +-
 net/netfilter/xt_ipcomp.c| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index a955182..6d65389 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -71,7 +71,7 @@ static inline bool already_closed(const struct nf_conn *conn)
return conn->proto.tcp.state == TCP_CONNTRACK_TIME_WAIT ||
   conn->proto.tcp.state == TCP_CONNTRACK_CLOSE;
else
-   return 0;
+   return false;
 }
 
 static int key_diff(const u32 *a, const u32 *b, unsigned int klen)
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 5da8746..ec51d9a 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -353,7 +353,7 @@ static int htable_create(struct net *net, struct 
hashlimit_cfg3 *cfg,
 static bool select_all(const struct xt_hashlimit_htable *ht,
   const struct dsthash_ent *he)
 {
-   return 1;
+   return true;
 }
 
 static bool select_gc(const struct xt_hashlimit_htable *ht,
diff --git a/net/netfilter/xt_ipcomp.c b/net/netfilter/xt_ipcomp.c
index 000e703..7ca64a5 100644
--- a/net/netfilter/xt_ipcomp.c
+++ b/net/netfilter/xt_ipcomp.c
@@ -58,7 +58,7 @@ static bool comp_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
 */
pr_debug("Dropping evil IPComp tinygram.\n");
par->hotdrop = true;
-   return 0;
+   return false;
}
 
return spi_match(compinfo->spis[0], compinfo->spis[1],
-- 
2.7.4



Re: [PATCH 2/3] net: is_skb_forwardable: validate length of GSO packet segments

2018-01-18 Thread Marcelo Ricardo Leitner
On Tue, Jan 16, 2018 at 01:09:19PM +1100, Daniel Axtens wrote:
> is_skb_forwardable attempts to detect if a packet is too large to
> be sent to the destination device. However, this test does not
> consider GSO packets, and it is possible that a GSO packet, when
> resegmented, will be larger than the device can transmit.
> 
> Add detection for packets which will be too large by creating a
> skb_gso_validate_len() routine which is similar to
> skb_gso_validate_mtu(), but which considers L2 headers, and wire

Why not add a 3rd parameter to skb_gso_validate_mtu() instead, hlen?
After all it's the only difference between both.

And maybe rename _mtu if you think the term doesn't fit in the new
usecase or add a wrapper around it for each case.

As it is here, it gets very confusing on which function does what,
IMHO.

> it up in is_skb_forwardable().
> 
> Signed-off-by: Daniel Axtens 
> ---
>  include/linux/skbuff.h |  1 +
>  net/core/dev.c |  7 ---
>  net/core/skbuff.c  | 34 ++
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b8acafd73272..6a9e4b6f80a7 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
> int shiftlen);
>  void skb_scrub_packet(struct sk_buff *skb, bool xnet);
>  unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
>  bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu);
> +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len);
>  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
>  struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
>  int skb_ensure_writable(struct sk_buff *skb, int write_len);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 94435cd09072..5af04be128c1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, 
> const struct sk_buff *skb)
>   if (skb->len <= len)
>   return true;
>  
> - /* if TSO is enabled, we don't care about the length as the packet
> -  * could be forwarded without being segmented before
> + /*
> +  * if TSO is enabled, we need to check the size of the
> +  * segmented packets
>*/
>   if (skb_is_gso(skb))
> - return true;
> + return skb_gso_validate_len(skb, len);
>  
>   return false;
>  }
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 01e8285aea73..aefe049e4b0c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4945,6 +4945,40 @@ bool skb_gso_validate_mtu(const struct sk_buff *skb, 
> unsigned int mtu)
>  }
>  EXPORT_SYMBOL_GPL(skb_gso_validate_mtu);
>  
> +/**
> + * skb_gso_validate_len - Will a split GSO skb fit in a given length?
> + *
> + * @skb: GSO skb
> + * @len: length to validate against
> + *
> + * skb_gso_validate_len validates if a given skb will fit a wanted
> + * length once split, including L2, L3 and L4 headers.
> + *
> + * Similar to skb_gso_validate_mtu, but inclusive of L2 headers.
> + */
> +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len)
> +{
> + const struct skb_shared_info *shinfo = skb_shinfo(skb);
> + const struct sk_buff *iter;
> + unsigned int hlen;
> +
> + hlen = skb_gso_mac_seglen(skb);
> +
> + if (shinfo->gso_size != GSO_BY_FRAGS)
> + return hlen <= len;
> +
> + /* Undo this so we can re-use header sizes */
> + hlen -= GSO_BY_FRAGS;
> +
> + skb_walk_frags(skb, iter) {
> + if (hlen + skb_headlen(iter) > len)
> + return false;
> + }
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(skb_gso_validate_len);
> +
>  static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb)
>  {
>   if (skb_cow(skb, skb_headroom(skb)) < 0) {
> -- 
> 2.14.1
> 


[PATCH bpf-next v6 3/4] libbpf: add missing SPDX-License-Identifier

2018-01-18 Thread Eric Leblond
Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/bpf.c| 2 ++
 tools/lib/bpf/bpf.h| 2 ++
 tools/lib/bpf/libbpf.c | 2 ++
 tools/lib/bpf/libbpf.h | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4517dce6849d..991bd72c4153 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: LGPL-2.1
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9f44c196931e..8d18fb73d7fb 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index c60122d3ea85..71ddc481f349 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: LGPL-2.1
+
 /*
  * Common eBPF ELF object loading operations.
  *
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e42f96900318..f85906533cdd 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * Common eBPF ELF object loading operations.
  *
-- 
2.15.1



[PATCH bpf-next v6 4/4] samples/bpf: use bpf_set_link_xdp_fd

2018-01-18 Thread Eric Leblond
Use bpf_set_link_xdp_fd instead of set_link_xdp_fd to remove some
code duplication and benefit of netlink ext ack errors message.

Signed-off-by: Eric Leblond 
---
 samples/bpf/bpf_load.c  | 102 
 samples/bpf/bpf_load.h  |   2 +-
 samples/bpf/xdp1_user.c |   4 +-
 samples/bpf/xdp_redirect_cpu_user.c |   6 +--
 samples/bpf/xdp_redirect_map_user.c |   8 +--
 samples/bpf/xdp_redirect_user.c |   8 +--
 samples/bpf/xdp_router_ipv4_user.c  |  10 ++--
 samples/bpf/xdp_tx_iptunnel_user.c  |   6 +--
 8 files changed, 22 insertions(+), 124 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 242631aa4ea2..69806d74fa53 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -695,105 +695,3 @@ struct ksym *ksym_search(long key)
return [0];
 }
 
-int set_link_xdp_fd(int ifindex, int fd, __u32 flags)
-{
-   struct sockaddr_nl sa;
-   int sock, seq = 0, len, ret = -1;
-   char buf[4096];
-   struct nlattr *nla, *nla_xdp;
-   struct {
-   struct nlmsghdr  nh;
-   struct ifinfomsg ifinfo;
-   char attrbuf[64];
-   } req;
-   struct nlmsghdr *nh;
-   struct nlmsgerr *err;
-
-   memset(, 0, sizeof(sa));
-   sa.nl_family = AF_NETLINK;
-
-   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
-   if (sock < 0) {
-   printf("open netlink socket: %s\n", strerror(errno));
-   return -1;
-   }
-
-   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
-   printf("bind to netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   memset(, 0, sizeof(req));
-   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
-   req.nh.nlmsg_type = RTM_SETLINK;
-   req.nh.nlmsg_pid = 0;
-   req.nh.nlmsg_seq = ++seq;
-   req.ifinfo.ifi_family = AF_UNSPEC;
-   req.ifinfo.ifi_index = ifindex;
-
-   /* started nested attribute for XDP */
-   nla = (struct nlattr *)(((char *))
-   + NLMSG_ALIGN(req.nh.nlmsg_len));
-   nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
-   nla->nla_len = NLA_HDRLEN;
-
-   /* add XDP fd */
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd));
-   nla->nla_len += nla_xdp->nla_len;
-
-   /* if user passed in any flags, add those too */
-   if (flags) {
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags));
-   nla->nla_len += nla_xdp->nla_len;
-   }
-
-   req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
-
-   if (send(sock, , req.nh.nlmsg_len, 0) < 0) {
-   printf("send to netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   len = recv(sock, buf, sizeof(buf), 0);
-   if (len < 0) {
-   printf("recv from netlink: %s\n", strerror(errno));
-   goto cleanup;
-   }
-
-   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
-nh = NLMSG_NEXT(nh, len)) {
-   if (nh->nlmsg_pid != getpid()) {
-   printf("Wrong pid %d, expected %d\n",
-  nh->nlmsg_pid, getpid());
-   goto cleanup;
-   }
-   if (nh->nlmsg_seq != seq) {
-   printf("Wrong seq %d, expected %d\n",
-  nh->nlmsg_seq, seq);
-   goto cleanup;
-   }
-   switch (nh->nlmsg_type) {
-   case NLMSG_ERROR:
-   err = (struct nlmsgerr *)NLMSG_DATA(nh);
-   if (!err->error)
-   continue;
-   printf("nlmsg error %s\n", strerror(-err->error));
-   goto cleanup;
-   case NLMSG_DONE:
-   break;
-   }
-   }
-
-   ret = 0;
-
-cleanup:
-   close(sock);
-   return ret;
-}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a4248893..453c200b389b 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -61,5 +61,5 @@ struct ksym {
 
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
-int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index fdaefe91801d..b901ee2b3336 100644
--- a/samples/bpf/xdp1_user.c
+++ 

Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread Martin KaFai Lau
On Thu, Jan 18, 2018 at 03:31:29PM -0800, Wei Wang wrote:
> On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lau  wrote:
> > On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote:
> >> From: Wei Wang 
> >>
> >> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
> >> which does not have any route attached to it yet, the current code will
> >> let tb6_root and the node in the subtree share the same route.
> >> This could cause problem cause tb6_root has RTN_INFO flag marked and the
> > You meant the RTN_RTINFO check in fib6_purge_rt()?
> >
> Yes. Exactly.
Looks good to me.  Thanks for the fix!

Acked-by: Martin KaFai Lau 


[PATCH bpf-next 0/4] libbpf: add XDP binding support

2018-01-18 Thread Eric Leblond

Hello,

This patchset rebases the libbpf code on latest bpf-next code and addresses
remarks by Daniel.

Best regards,
--
Eric Leblond


[PATCH bpf-next v6 2/4] libbpf: add error reporting in XDP

2018-01-18 Thread Eric Leblond
Parse netlink ext attribute to get the error message returned by
the card. Code is partially take from libnl.

Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 
---
 samples/bpf/Makefile   |   2 +-
 tools/lib/bpf/Build|   2 +-
 tools/lib/bpf/bpf.c|   8 +++
 tools/lib/bpf/nlattr.c | 187 +
 tools/lib/bpf/nlattr.h |  70 ++
 5 files changed, 267 insertions(+), 2 deletions(-)
 create mode 100644 tools/lib/bpf/nlattr.c
 create mode 100644 tools/lib/bpf/nlattr.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 7f61a3d57fa7..5c4cd3745282 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,7 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
-LIBBPF := ../../tools/lib/bpf/bpf.o
+LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index d8749756352d..64c679d67109 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1 +1 @@
-libbpf-y := libbpf.o bpf.o
+libbpf-y := libbpf.o bpf.o nlattr.o
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index e6c61035b64c..4517dce6849d 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -26,6 +26,7 @@
 #include 
 #include "bpf.h"
 #include "libbpf.h"
+#include "nlattr.h"
 #include 
 #include 
 #include 
@@ -440,6 +441,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
struct nlmsghdr *nh;
struct nlmsgerr *err;
socklen_t addrlen;
+   int one = 1;
 
memset(, 0, sizeof(sa));
sa.nl_family = AF_NETLINK;
@@ -449,6 +451,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
return -errno;
}
 
+   if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
+  , sizeof(one)) < 0) {
+   fprintf(stderr, "Netlink error reporting not supported\n");
+   }
+
if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
ret = -errno;
goto cleanup;
@@ -525,6 +532,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
if (!err->error)
continue;
ret = err->error;
+   nla_dump_errormsg(nh);
goto cleanup;
case NLMSG_DONE:
break;
diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c
new file mode 100644
index ..4719434278b2
--- /dev/null
+++ b/tools/lib/bpf/nlattr.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+/*
+ * NETLINK  Netlink attributes
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation version 2.1
+ * of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf 
+ */
+
+#include 
+#include "nlattr.h"
+#include 
+#include 
+#include 
+
+static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = {
+   [NLA_U8]= sizeof(uint8_t),
+   [NLA_U16]   = sizeof(uint16_t),
+   [NLA_U32]   = sizeof(uint32_t),
+   [NLA_U64]   = sizeof(uint64_t),
+   [NLA_STRING]= 1,
+   [NLA_FLAG]  = 0,
+};
+
+static int nla_len(const struct nlattr *nla)
+{
+   return nla->nla_len - NLA_HDRLEN;
+}
+
+static struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
+{
+   int totlen = NLA_ALIGN(nla->nla_len);
+
+   *remaining -= totlen;
+   return (struct nlattr *) ((char *) nla + totlen);
+}
+
+static int nla_ok(const struct nlattr *nla, int remaining)
+{
+   return remaining >= sizeof(*nla) &&
+  nla->nla_len >= sizeof(*nla) &&
+  nla->nla_len <= remaining;
+}
+
+static void *nla_data(const struct nlattr *nla)
+{
+   return (char *) nla + NLA_HDRLEN;
+}
+
+static int nla_type(const struct nlattr *nla)
+{
+   return nla->nla_type & NLA_TYPE_MASK;
+}
+
+static int validate_nla(struct nlattr *nla, int maxtype,
+   struct nla_policy *policy)
+{
+   struct nla_policy *pt;
+   unsigned int minlen = 0;
+   int type = nla_type(nla);
+
+   if (type < 0 || type > maxtype)
+   return 0;
+
+   pt = [type];
+
+   if (pt->type > NLA_TYPE_MAX)
+   return 0;
+
+   if (pt->minlen)
+   minlen = pt->minlen;
+   else if (pt->type != NLA_UNSPEC)
+   minlen = nla_attr_minlen[pt->type];
+
+   if (nla_len(nla) < minlen)
+   return -1;
+
+   if (pt->maxlen && nla_len(nla) > pt->maxlen)
+   return -1;
+
+   if (pt->type == NLA_STRING) {
+   char *data = nla_data(nla);
+   if 

[PATCH bpf-next v6 1/4] libbpf: add function to setup XDP

2018-01-18 Thread Eric Leblond
Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
slightly modified to be library compliant.

Signed-off-by: Eric Leblond 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/bpf.c| 126 +
 tools/lib/bpf/libbpf.c |   2 +
 tools/lib/bpf/libbpf.h |   4 ++
 3 files changed, 132 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5128677e4117..e6c61035b64c 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -25,6 +25,16 @@
 #include 
 #include 
 #include "bpf.h"
+#include "libbpf.h"
+#include 
+#include 
+#include 
+
+#ifndef IFLA_XDP_MAX
+#define IFLA_XDP   43
+#define IFLA_XDP_FD1
+#define IFLA_XDP_FLAGS 3
+#endif
 
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
@@ -46,7 +56,9 @@
 # endif
 #endif
 
+#ifndef min
 #define min(x, y) ((x) < (y) ? (x) : (y))
+#endif
 
 static inline __u64 ptr_to_u64(const void *ptr)
 {
@@ -413,3 +425,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 
*info_len)
 
return err;
 }
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+   struct sockaddr_nl sa;
+   int sock, seq = 0, len, ret = -1;
+   char buf[4096];
+   struct nlattr *nla, *nla_xdp;
+   struct {
+   struct nlmsghdr  nh;
+   struct ifinfomsg ifinfo;
+   char attrbuf[64];
+   } req;
+   struct nlmsghdr *nh;
+   struct nlmsgerr *err;
+   socklen_t addrlen;
+
+   memset(, 0, sizeof(sa));
+   sa.nl_family = AF_NETLINK;
+
+   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+   if (sock < 0) {
+   return -errno;
+   }
+
+   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   addrlen = sizeof(sa);
+   if (getsockname(sock, (struct sockaddr *), ) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   if (addrlen != sizeof(sa)) {
+   ret = -LIBBPF_ERRNO__INTERNAL;
+   goto cleanup;
+   }
+
+   memset(, 0, sizeof(req));
+   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+   req.nh.nlmsg_type = RTM_SETLINK;
+   req.nh.nlmsg_pid = 0;
+   req.nh.nlmsg_seq = ++seq;
+   req.ifinfo.ifi_family = AF_UNSPEC;
+   req.ifinfo.ifi_index = ifindex;
+
+   /* started nested attribute for XDP */
+   nla = (struct nlattr *)(((char *))
+   + NLMSG_ALIGN(req.nh.nlmsg_len));
+   nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+   nla->nla_len = NLA_HDRLEN;
+
+   /* add XDP fd */
+   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+   nla_xdp->nla_type = IFLA_XDP_FD;
+   nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd));
+   nla->nla_len += nla_xdp->nla_len;
+
+   /* if user passed in any flags, add those too */
+   if (flags) {
+   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+   nla_xdp->nla_type = IFLA_XDP_FLAGS;
+   nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags));
+   nla->nla_len += nla_xdp->nla_len;
+   }
+
+   req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+   if (send(sock, , req.nh.nlmsg_len, 0) < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   len = recv(sock, buf, sizeof(buf), 0);
+   if (len < 0) {
+   ret = -errno;
+   goto cleanup;
+   }
+
+   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+nh = NLMSG_NEXT(nh, len)) {
+   if (nh->nlmsg_pid != sa.nl_pid) {
+   ret = -LIBBPF_ERRNO__WRNGPID;
+   goto cleanup;
+   }
+   if (nh->nlmsg_seq != seq) {
+   ret = -LIBBPF_ERRNO__INVSEQ;
+   goto cleanup;
+   }
+   switch (nh->nlmsg_type) {
+   case NLMSG_ERROR:
+   err = (struct nlmsgerr *)NLMSG_DATA(nh);
+   if (!err->error)
+   continue;
+   ret = err->error;
+   goto cleanup;
+   case NLMSG_DONE:
+   break;
+   default:
+   break;
+   }
+   }
+
+   ret = 0;
+
+cleanup:
+   close(sock);
+   return ret;
+}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 30c776375118..c60122d3ea85 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = {
[ERRCODE_OFFSET(PROG2BIG)]  = "Program too big",
   

Re: [PATCH bpf-next v5 2/4] libbpf: add error reporting in XDP

2018-01-18 Thread Eric Leblond
Hi,

Sorry for the delay, missed the mail.

On Sat, 2018-01-06 at 22:16 +0100, Daniel Borkmann wrote:
> On 01/04/2018 09:21 AM, Eric Leblond wrote:
> > Parse netlink ext attribute to get the error message returned by
> > the card. Code is partially take from libnl.
> > 
> > Signed-off-by: Eric Leblond 
> > Acked-by: Alexei Starovoitov 
> > ---
> >  samples/bpf/Makefile   |   2 +-
> >  tools/lib/bpf/Build|   2 +-
> >  tools/lib/bpf/bpf.c|  10 ++-
> >  tools/lib/bpf/nlattr.c | 187
> > +
> >  tools/lib/bpf/nlattr.h |  70 ++
> >  5 files changed, 268 insertions(+), 3 deletions(-)
> >  create mode 100644 tools/lib/bpf/nlattr.c
> >  create mode 100644 tools/lib/bpf/nlattr.h
> > 
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 4fb944a7ecf8..c889ebcba9b3 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -44,7 +44,7 @@ hostprogs-y += xdp_monitor
> >  hostprogs-y += syscall_tp
> >  
> >  # Libbpf dependencies
> > -LIBBPF := ../../tools/lib/bpf/bpf.o
> > +LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> >  CGROUP_HELPERS :=
> > ../../tools/testing/selftests/bpf/cgroup_helpers.o
> >  
> >  test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
> > diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
> > index d8749756352d..64c679d67109 100644
> > --- a/tools/lib/bpf/Build
> > +++ b/tools/lib/bpf/Build
> > @@ -1 +1 @@
> > -libbpf-y := libbpf.o bpf.o
> > +libbpf-y := libbpf.o bpf.o nlattr.o
> > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > index e6c61035b64c..10d71b9fdbd0 100644
> > --- a/tools/lib/bpf/bpf.c
> > +++ b/tools/lib/bpf/bpf.c
> > @@ -26,6 +26,7 @@
> >  #include 
> >  #include "bpf.h"
> >  #include "libbpf.h"
> > +#include "nlattr.h"
> >  #include 
> >  #include 
> >  #include 
> > @@ -440,6 +441,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd,
> > __u32 flags)
> > struct nlmsghdr *nh;
> > struct nlmsgerr *err;
> > socklen_t addrlen;
> > +   int one;
> 
> Hmm, it's not initialized here to 1 ...
> 
> > memset(, 0, sizeof(sa));
> > sa.nl_family = AF_NETLINK;
> > @@ -449,6 +451,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd,
> > __u32 flags)
> > return -errno;
> > }
> >  
> > +   if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
> > +  , sizeof(one)) < 0) {
> 
> ... so we turn it on by chance here.

Indeed, fixing that.

> > +   fprintf(stderr, "Netlink error reporting not
> > supported\n");
> > +   }
> > +
> > if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
> > ret = -errno;
> > goto cleanup;
> > @@ -524,7 +531,8 @@ int bpf_set_link_xdp_fd(int ifindex, int fd,
> > __u32 flags)
> > err = (struct nlmsgerr *)NLMSG_DATA(nh);
> > if (!err->error)
> > continue;
> > -   ret = err->error;
> > +   ret = -err->error;
> 
> This one looks strange. Your prior patch added the 'ret = err->error'
> and this one negates it. Which variant is the correct version? From
> digging into the kernel code, my take is that 'ret = err->error' was
> the correct variant since it already holds the negative error code.
> Could you double check?

Yes all netlink_ack usage I have seen are using the negative value of
the error. Fixing that too.

BR,
-- 
Eric Leblond 
Blog: https://home.regit.org/


[PATCH] smc: return boolean instead of integer in using_ipsec

2018-01-18 Thread Gustavo A. R. Silva
Return statements in functions returning bool should use
true/false instead of 1/0.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 net/smc/smc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 0bee9d1..ef13af4 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -258,7 +258,7 @@ static inline bool using_ipsec(struct smc_sock *smc)
 #else
 static inline bool using_ipsec(struct smc_sock *smc)
 {
-   return 0;
+   return false;
 }
 #endif
 
-- 
2.7.4



Re: [PATCH bpf] bpf: fix 32-bit divide by zero

2018-01-18 Thread Daniel Borkmann
On 01/18/2018 11:40 PM, Alexei Starovoitov wrote:
> On 1/18/18 2:30 PM, Eric Dumazet wrote:
>> On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote:
>>> due to some JITs doing if (src_reg == 0) check in 64-bit mode
>>> for div/mod opreations mask upper 32-bits of src register
>>> before doing the check
>>
>> Is the plan to fix JIT, and if they can all be fixed,
>> revert this patch ?
> 
> No need to fix JITs.
> I'd rather remove 'cmp rX, 0' from JITs and let verifier emit it.
> It's more generic and gives us flexibility to decide what to do
> with divide by zero and other exceptions.

Yeah, working on this, but patch most likely for the next window.
The 'return 0' is really suboptimal as exception code for some
program types, so I'd like to have struct bpf_verifier_ops to
define such exception return code to be used, so that the verifier
can apply this via bpf insns directly.


Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread Wei Wang
On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lau  wrote:
> On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote:
>> From: Wei Wang 
>>
>> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
>> which does not have any route attached to it yet, the current code will
>> let tb6_root and the node in the subtree share the same route.
>> This could cause problem cause tb6_root has RTN_INFO flag marked and the
> You meant the RTN_RTINFO check in fib6_purge_rt()?
>
Yes. Exactly.

>> tree repair and clean up code will not work properly.
>> This commit makes sure tb6_root->leaf points back to null_entry instead
>> of sharing route with other node.
>>
>> It fixes the following syzkaller reported issue:
>> BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 
>> [inline]
>> BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>> Read of size 8 at addr 8801bc043498 by task syz-executor5/19819
>>
>> CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>>  ipv6_prefix_equal include/net/ipv6.h:540 [inline]
>>  fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>>  fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214
>>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003
>>  ip6_route_add+0x141/0x190 net/ipv6/route.c:2790
>>  ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299
>>  inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
>>  sock_do_ioctl+0x65/0xb0 net/socket.c:958
>>  sock_ioctl+0x2c2/0x440 net/socket.c:1055
>>  vfs_ioctl fs/ioctl.c:46 [inline]
>>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>>  SYSC_ioctl fs/ioctl.c:701 [inline]
>>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>>  entry_SYSCALL_64_fastpath+0x23/0x9a
>> RIP: 0033:0x452ac9
>> RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010
>> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
>> RDX: 20fd7000 RSI: 890b RDI: 0013
>> RBP: 049e R08:  R09: 
>> R10:  R11: 0212 R12: 006f4f70
>> R13:  R14: 7fd42b3226d4 R15: 
>>
>> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route")
>> Signed-off-by: Wei Wang 
>> Acked-by: Eric Dumazet 
>> ---
>>  net/ipv6/ip6_fib.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index 9dcc3924a975..217683d40f12 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
>> *rt,
>>   }
>>
>>   if (!rcu_access_pointer(fn->leaf)) {
>> - atomic_inc(>rt6i_ref);
>> - rcu_assign_pointer(fn->leaf, rt);
>> + if (fn->fn_flags & RTN_TL_ROOT) {
>> + /* put back null_entry for root node */
>> + rcu_assign_pointer(fn->leaf,
>> + info->nl_net->ipv6.ip6_null_entry);
>> + } else {
>> + atomic_inc(>rt6i_ref);
>> + rcu_assign_pointer(fn->leaf, rt);
>> + }
>>   }
>>   fn = sn;
>>   }
>> --
>> 2.16.0.rc1.238.g530d649a79-goog
>>


Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-18 Thread Willem de Bruijn
On Thu, Jan 18, 2018 at 6:20 PM, Sowmini Varadhan
 wrote:
> On (01/18/18 18:09), Willem de Bruijn wrote:
>> If that is true in general for PF_RDS, then it is a reasonable approach.
>> How about treating it as a (follow-on) optimization path. Opportunistic
>> piggybacking of notifications on data reads is more widely applicable.
>
> sounds good.
>
>> > that's similar to what I have, except that it does not have the
>> > MSG_PEEK part (you'd need to enforce that the data portion
>> > is upper-bounded, and that the application has the responsibility
>> > of sending down "enough" buffer with recvmsg).
>>
>> Right. I think that an upper bound is the simplest solution here.
>>
>> By the way, if you allocate an skb immediately on page pinning, then
>> there are always sufficient skbs to store all notifications. On errqueue
>> enqueue just drop the new skb and copy its notification to the body of
>> the skb already on the queue, if one exists and it has room. That is
>> essentially what the tcp zerocopy code does with the [data, info] range.
>
> ok, I'll give that a shot (I'm working through the other review comments
> as well)
>
> fwiw, the data-corruption issue I mentioned turned out to be a day-one
> bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/).
> The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
> this one out..

Thanks. Good to hear that it's not in zerocopy, itself.


Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-18 Thread Sowmini Varadhan
On (01/18/18 18:09), Willem de Bruijn wrote:
> If that is true in general for PF_RDS, then it is a reasonable approach.
> How about treating it as a (follow-on) optimization path. Opportunistic
> piggybacking of notifications on data reads is more widely applicable.

sounds good.

> > that's similar to what I have, except that it does not have the
> > MSG_PEEK part (you'd need to enforce that the data portion
> > is upper-bounded, and that the application has the responsibility
> > of sending down "enough" buffer with recvmsg).
> 
> Right. I think that an upper bound is the simplest solution here.
> 
> By the way, if you allocate an skb immediately on page pinning, then
> there are always sufficient skbs to store all notifications. On errqueue
> enqueue just drop the new skb and copy its notification to the body of
> the skb already on the queue, if one exists and it has room. That is
> essentially what the tcp zerocopy code does with the [data, info] range.

ok, I'll give that a shot (I'm working through the other review comments
as well)

fwiw, the data-corruption issue I mentioned turned out to be a day-one
bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/). 
The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
this one out..

--Sowmini


[PATCH net-next v2] net: stmmac: Fix reception of Broadcom switches tags

2018-01-18 Thread Florian Fainelli
Broadcom tags inserted by Broadcom switches put a 4 byte header after
the MAC SA and before the EtherType, which may look like some sort of 0
length LLC/SNAP packet (tcpdump and wireshark do think that way). With
ACS enabled in stmmac the packets were truncated to 8 bytes on
reception, whereas clearing this bit allowed normal reception to occur.

In order to make that possible, we need to pass a net_device argument to
the different core_init() functions and we are dependent on the Broadcom
tagger padding packets correctly (which it now does). To be as little
invasive as possible, this is only done for gmac1000 when the network
device is DSA-enabled (netdev_uses_dsa() returns true).

Signed-off-by: Florian Fainelli 
---
Changes in v2:

- fixed build failure in dwmac4_core.c
- updated dwmac100_core.c to also clear the ASTP bit

 drivers/net/ethernet/stmicro/stmmac/common.h |  2 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c|  3 ++-
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 12 +++-
 drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c  | 15 +--
 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c| 12 +++-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c|  2 +-
 6 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index ce2ea2d491ac..2ffe76c0ff74 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -474,7 +474,7 @@ struct mac_device_info;
 /* Helpers to program the MAC core */
 struct stmmac_ops {
/* MAC core initialization */
-   void (*core_init)(struct mac_device_info *hw, int mtu);
+   void (*core_init)(struct mac_device_info *hw, struct net_device *dev);
/* Enable the MAC RX/TX */
void (*set_mac)(void __iomem *ioaddr, bool enable);
/* Enable and verify that the IPC module is supported */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index 9eb7f65d8000..a3fa65b1ca8e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -483,7 +483,8 @@ static int sun8i_dwmac_init(struct platform_device *pdev, 
void *priv)
return 0;
 }
 
-static void sun8i_dwmac_core_init(struct mac_device_info *hw, int mtu)
+static void sun8i_dwmac_core_init(struct mac_device_info *hw,
+ struct net_device *dev)
 {
void __iomem *ioaddr = hw->pcsr;
u32 v;
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index 8a86340ff2d3..540d21786a43 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -25,18 +25,28 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "stmmac_pcs.h"
 #include "dwmac1000.h"
 
-static void dwmac1000_core_init(struct mac_device_info *hw, int mtu)
+static void dwmac1000_core_init(struct mac_device_info *hw,
+   struct net_device *dev)
 {
void __iomem *ioaddr = hw->pcsr;
u32 value = readl(ioaddr + GMAC_CONTROL);
+   int mtu = dev->mtu;
 
/* Configure GMAC core */
value |= GMAC_CORE_INIT;
 
+   /* Clear ACS bit because Ethernet switch tagging formats such as
+* Broadcom tags can look like invalid LLC/SNAP packets and cause the
+* hardware to truncate packets on reception.
+*/
+   if (netdev_uses_dsa(dev))
+   value &= ~GMAC_CONTROL_ACS;
+
if (mtu > 1500)
value |= GMAC_CONTROL_2K;
if (mtu > 2000)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
index 8ef517356313..91b23f9db31a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
@@ -25,15 +25,26 @@
 
***/
 
 #include 
+#include 
 #include 
 #include "dwmac100.h"
 
-static void dwmac100_core_init(struct mac_device_info *hw, int mtu)
+static void dwmac100_core_init(struct mac_device_info *hw,
+  struct net_device *dev)
 {
void __iomem *ioaddr = hw->pcsr;
u32 value = readl(ioaddr + MAC_CONTROL);
 
-   writel((value | MAC_CORE_INIT), ioaddr + MAC_CONTROL);
+   value |= MAC_CORE_INIT;
+
+   /* Clear ASTP bit because Ethernet switch tagging formats such as
+* Broadcom tags can look like invalid LLC/SNAP packets and cause the
+* hardware to truncate packets on reception.
+*/
+   if (netdev_uses_dsa(dev))
+   value &= ~MAC_CONTROL_ASTP;
+
+   writel(value, ioaddr + MAC_CONTROL);
 
 #ifdef 

[PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map

2018-01-18 Thread Yonghong Song
This patch set implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
This command is really useful for key enumeration, and for key deletion
if what keys in the trie are unknown.

Patch #1 implements the functionality in the kernel and patch #2
adds a test case in tools/testing/selftests/bpf.

Yonghong Song (2):
  bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
  tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of LPM_TRIE map

 kernel/bpf/lpm_trie.c  |  95 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 122 +
 2 files changed, 215 insertions(+), 2 deletions(-)

-- 
2.9.5



[PATCH bpf-next 1/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map

2018-01-18 Thread Yonghong Song
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
command. This command is handy when users want to enumerate
keys. Otherwise, a different map which supports key
enumeration may be required to store the keys. If the
map data is sparse and all map data are to be deleted without
closing file descriptor, using MAP_GET_NEXT_KEY to find
all keys is much faster than enumerating all key space.

This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
If user provided key pointer is NULL or the key does not have
an exact match in the trie, the first key will be returned.
Otherwise, the next key will be returned.

In this implemenation, key enumeration follows a postorder
traversal of internal trie. More specific keys
will be returned first than less specific ones, given
a sequence of MAP_GET_NEXT_KEY syscalls.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/lpm_trie.c | 95 +--
 1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 584e022..d7ea962 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -591,9 +591,100 @@ static void trie_free(struct bpf_map *map)
raw_spin_unlock(>lock);
 }
 
-static int trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
+static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
 {
-   return -ENOTSUPP;
+   struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+   struct bpf_lpm_trie_key *key = _key, *next_key = _next_key;
+   struct lpm_trie_node *node, *next_node = NULL, *parent;
+   struct lpm_trie_node **node_stack = NULL;
+   struct lpm_trie_node __rcu **root;
+   int err = 0, stack_ptr = -1;
+   unsigned int next_bit;
+   size_t matchlen;
+
+   /* The get_next_key follows postorder. For the 4 node example in
+* the top of this file, the trie_get_next_key() returns the following
+* one after another:
+*   192.168.0.0/24
+*   192.168.1.0/24
+*   192.168.128.0/24
+*   192.168.0.0/16
+*
+* The idea is to return more specific keys before less specific ones.
+*/
+
+   /* Empty trie */
+   if (!rcu_dereference(trie->root))
+   return -ENOENT;
+
+   /* For invalid key, find the leftmost node in the trie */
+   if (!key || key->prefixlen > trie->max_prefixlen) {
+   root = >root;
+   goto find_leftmost;
+   }
+
+   node_stack = kmalloc(trie->max_prefixlen * sizeof(struct lpm_trie_node 
*),
+GFP_USER | __GFP_NOWARN);
+   if (!node_stack)
+   return -ENOMEM;
+
+   /* Try to find the exact node for the given key */
+   for (node = rcu_dereference(trie->root); node;) {
+   node_stack[++stack_ptr] = node;
+   matchlen = longest_prefix_match(trie, node, key);
+   if (node->prefixlen != matchlen ||
+   node->prefixlen == key->prefixlen)
+   break;
+
+   next_bit = extract_bit(key->data, node->prefixlen);
+   node = rcu_dereference(node->child[next_bit]);
+   }
+   if (!node || node->prefixlen != key->prefixlen ||
+   (node->flags & LPM_TREE_NODE_FLAG_IM)) {
+   root = >root;
+   goto find_leftmost;
+   }
+
+   /* The node with the exactly-matching key has been found,
+* find the first node in postorder after the matched node.
+*/
+   node = node_stack[stack_ptr];
+   while (stack_ptr > 0) {
+   parent = node_stack[stack_ptr - 1];
+   if (rcu_dereference(parent->child[0]) == node &&
+   rcu_dereference(parent->child[1])) {
+   root = >child[1];
+   goto find_leftmost;
+   }
+   if (!(parent->flags & LPM_TREE_NODE_FLAG_IM)) {
+   next_node = parent;
+   goto do_copy;
+   }
+
+   node = parent;
+   stack_ptr--;
+   }
+
+   /* did not find anything */
+   err = -ENOENT;
+   goto free_stack;
+
+find_leftmost:
+   /* Find the leftmost non-intermediate node, all intermediate nodes
+* have exact two children, so this function will never return NULL.
+*/
+   for (node = rcu_dereference(*root); node;) {
+   if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
+   next_node = node;
+   node = rcu_dereference(node->child[0]);
+   }
+do_copy:
+   next_key->prefixlen = next_node->prefixlen;
+   memcpy((void *)next_key + offsetof(struct bpf_lpm_trie_key, data),
+  next_node->data, trie->data_size);
+free_stack:
+   kfree(node_stack);
+   return err;
 }
 
 const struct bpf_map_ops trie_map_ops = {
-- 
2.9.5



Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-18 Thread Willem de Bruijn
On Thu, Jan 18, 2018 at 6:03 PM, Sowmini Varadhan
 wrote:
> On (01/18/18 17:54), Willem de Bruijn wrote:
>> > 2. If we have the option of passing completion-notification up as ancillary
>> >data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
>>
>> This assumes a somewhat symmetric workload, where there are enough recv
>> calls to reap the notification associated with the send calls.
>
> Your comment about the assumption is true, but at least for the database
> use-cases, we have a request-response model, so the assumption works out..
> I dont know if many other workloads that send large buffers have this
> pattern.

If that is true in general for PF_RDS, then it is a reasonable approach.
How about treating it as a (follow-on) optimization path. Opportunistic
piggybacking of notifications on data reads is more widely applicable.

>
>> I would stay with MSG_ERRQUEUE processing. One option is to pass data
>> up to userspace in the data portion of the notification skb instead of
>> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.
>
> that's similar to what I have, except that it does not have the
> MSG_PEEK part (you'd need to enforce that the data portion
> is upper-bounded, and that the application has the responsibility
> of sending down "enough" buffer with recvmsg).

Right. I think that an upper bound is the simplest solution here.

By the way, if you allocate an skb immediately on page pinning, then
there are always sufficient skbs to store all notifications. On errqueue
enqueue just drop the new skb and copy its notification to the body of
the skb already on the queue, if one exists and it has room. That is
essentially what the tcp zerocopy code does with the [data, info] range.

> Note that any one of these choices are ok with me- I have no
> special attachments to any of them.


[PATCH bpf-next 2/2] tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of LPM_TRIE map

2018-01-18 Thread Yonghong Song
A test case is added in tools/testing/selftests/bpf/test_lpm_map.c
for MAP_GET_NEXT_KEY command. A four node trie, which
is described in kernel/bpf/lpm_trie.c, is built and the
MAP_GET_NEXT_KEY results are checked.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_lpm_map.c | 122 +
 1 file changed, 122 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
index f614806..0815108 100644
--- a/tools/testing/selftests/bpf/test_lpm_map.c
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -521,6 +521,126 @@ static void test_lpm_delete(void)
close(map_fd);
 }
 
+static void test_lpm_get_next_key(void)
+{
+   struct bpf_lpm_trie_key *key_p, *next_key_p;
+   size_t key_size;
+   __u32 value = 0;
+   int map_fd;
+
+   key_size = sizeof(*key_p) + sizeof(__u32);
+   key_p = alloca(key_size);
+   next_key_p = alloca(key_size);
+
+   map_fd = bpf_create_map(BPF_MAP_TYPE_LPM_TRIE, key_size, sizeof(value),
+   100, BPF_F_NO_PREALLOC);
+   assert(map_fd >= 0);
+
+   /* empty tree. get_next_key should return ENOENT */
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == -1 &&
+  errno == ENOENT);
+
+   /* get and verify the first key, get the second one should fail. */
+   key_p->prefixlen = 16;
+   inet_pton(AF_INET, "192.168.0.0", key_p->data);
+   assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0);
+
+   memset(key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0);
+   assert(key_p->prefixlen == 16 && key_p->data[0] == 192 &&
+  key_p->data[1] == 168);
+
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 &&
+  errno == ENOENT);
+
+   /* no exact matching key should get the first one in post order. */
+   key_p->prefixlen = 8;
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0);
+   assert(key_p->prefixlen == 16 && key_p->data[0] == 192 &&
+  key_p->data[1] == 168);
+
+   /* add one more element (total two) */
+   key_p->prefixlen = 24;
+   inet_pton(AF_INET, "192.168.0.0", key_p->data);
+   assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0);
+
+   memset(key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0);
+   assert(key_p->prefixlen == 24 && key_p->data[0] == 192 &&
+  key_p->data[1] == 168 && key_p->data[2] == 0);
+
+   memset(next_key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0);
+   assert(next_key_p->prefixlen == 16 && next_key_p->data[0] == 192 &&
+  next_key_p->data[1] == 168);
+
+   memcpy(key_p, next_key_p, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 &&
+  errno == ENOENT);
+
+   /* Add one more element (total three) */
+   key_p->prefixlen = 24;
+   inet_pton(AF_INET, "192.168.128.0", key_p->data);
+   assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0);
+
+   memset(key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0);
+   assert(key_p->prefixlen == 24 && key_p->data[0] == 192 &&
+  key_p->data[1] == 168 && key_p->data[2] == 0);
+
+   memset(next_key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0);
+   assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 &&
+  next_key_p->data[1] == 168 && next_key_p->data[2] == 128);
+
+   memcpy(key_p, next_key_p, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0);
+   assert(next_key_p->prefixlen == 16 && next_key_p->data[0] == 192 &&
+  next_key_p->data[1] == 168);
+
+   memcpy(key_p, next_key_p, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 &&
+  errno == ENOENT);
+
+   /* Add one more element (total four) */
+   key_p->prefixlen = 24;
+   inet_pton(AF_INET, "192.168.1.0", key_p->data);
+   assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0);
+
+   memset(key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0);
+   assert(key_p->prefixlen == 24 && key_p->data[0] == 192 &&
+  key_p->data[1] == 168 && key_p->data[2] == 0);
+
+   memset(next_key_p, 0, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0);
+   assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 &&
+  next_key_p->data[1] == 168 && next_key_p->data[2] == 1);
+
+   memcpy(key_p, next_key_p, key_size);
+   assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0);
+   assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 &&
+  next_key_p->data[1] == 168 && next_key_p->data[2] == 128);
+
+   

It's all about 14

2018-01-18 Thread Facebook Int'l


Hello ,


facebook is given out  14,000,000.USD (Fourteen Million Dollars) its all about 
14 Please, respond with your Unique Code (FB/BF14-13M5250UD) 
using your registration email, to the Verification Department at; 
dustinmoskovitz.faceb...@gmail.com


Dustin Moskovitz
Facebook Team
Copyright © 2018 Facebook Int'l


Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-18 Thread Sowmini Varadhan
On (01/18/18 17:54), Willem de Bruijn wrote:
> > 2. If we have the option of passing completion-notification up as ancillary
> >data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
> 
> This assumes a somewhat symmetric workload, where there are enough recv
> calls to reap the notification associated with the send calls.

Your comment about the assumption is true, but at least for the database
use-cases, we have a request-response model, so the assumption works out..
I dont know if many other workloads that send large buffers have this
pattern.

> I would stay with MSG_ERRQUEUE processing. One option is to pass data
> up to userspace in the data portion of the notification skb instead of
> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.

that's similar to what I have, except that it does not have the
MSG_PEEK part (you'd need to enforce that the data portion
is upper-bounded, and that the application has the responsibility
of sending down "enough" buffer with recvmsg). 

Note that any one of these choices are ok with me- I have no
special attachments to any of them.

--Sowmini


Re: [PATCHv3 net-next 7/7] net: sched: cls_u32: add extack support

2018-01-18 Thread Alexander Aring
Hi,

On Thu, Jan 18, 2018 at 11:51 AM, Jiri Pirko  wrote:
> Thu, Jan 18, 2018 at 05:20:55PM CET, ar...@mojatatu.com wrote:
>>This patch adds extack support for the u32 classifier as example for
>>delete and init callback.
>>
>>Cc: David Ahern 
>>Signed-off-by: Alexander Aring 
>
> Acked-by: Jiri Pirko 
>
> Any plan to continue with other classifiers? Like cls_flower? :)
>

Not yet, the next step is extack support in tc actions. If you want to
do cls_flower extack support go ahead and make it.
As you are the author of it, you know better than everybody else how
the extack messages should look like.

- Alex


Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-18 Thread Willem de Bruijn
On Thu, Jan 18, 2018 at 12:12 PM, Sowmini Varadhan
 wrote:
> On (01/18/18 08:53), Eric Dumazet wrote:
>>
>> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.
>
> sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled
> about anyway)
>
>> So lets properly design things, and not re-use legacy stuff that is
>> proven to be not multi-thread ready and too complex.
>>
>> If you want to design a new channel of communication, do it, and
>> maintain it.
>
> My instinct is to go with the fixed size ancillary data- which itself
> allows 2 options:
>
> 1. cmsg_data has a sock_extended_err preamble
>with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar),
>and the ee_data is an array of 32 bit cookies (can pack at most 8
>32-bit cookies, if we want to pack this into an skb->cb)
>
>Using the sock_extended_err as preamble will allow this to be usable by
>existing tcp zcopy applications (they can use the ee_origin to find
>out if this a batch of cookies or the existing hi/lo values).
>
> 2. If we have the option of passing completion-notification up as ancillary
>data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)

This assumes a somewhat symmetric workload, where there are enough recv
calls to reap the notification associated with the send calls.

>we dont have to try to retain "backward compat" to the
>SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data
>struct for the notification and potentially pack more cookies into
>48 bytes (RDS could be the first guinea pig for this- doesnt even
>have to be done across all protocol families on day-1).
>
> I think the shmem channel suggestion would be an optional optimization
> that can be added later- it may not even be necessary, since most
> applications will likely be sending *and* receiving data, so passing up
> cookies with recvmsg should be "good enough" to save syscall overhead
> for the common case.
>
> I can work #2, if there are no objections to it.

I would stay with MSG_ERRQUEUE processing. One option is to pass data
up to userspace in the data portion of the notification skb instead of
encoding it in ancillary data, like tcp_get_timestamping_opt_stats.


Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification

2018-01-18 Thread Willem de Bruijn
>> This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
>> even if the semantics can be derived from the packet family (for now).
>
> sounds good, I'll take care of this (and other review comments)
> for the next rev.
>
>> Even better would be if we can avoid the cookies completely. I understand
>> the issue with concurrent send threads racing on obtaining a zckey value. If
>> the sender could learn which zckey was chosen for a call, would that suffice?
>
> I'm not sure I understand the proposal- you want sendmsg to return
> the cookie ("zckey") for you? How?
>
> even if we mangled sendmsg() to return something other than
> the existing POSIX semantics, how will the application be asynchronously
> notified that send has completed (on a per-message/per-datagram) basis?

I'm purposely glossing over how the kernel returns this item for now.
Was just wondering whether we can then assume mostly in order delivery
and reuse the existing notification interface from tcp zerocopy.

>From your experiments, it sounds like this is not the case. In which case
there is little benefit to trying to force linear IDs derived from sk->sk_zckey.

>> I suspect that in even with concurrent senders, notifications arrive
>> largely in
>> order, in which case we could just maintain the existing semantics and even
>> reuse that implementation.
>
> not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe
> connection.

Okay. In that case, the cmsg cookie approach sounds fine. I had the same
in an early tcp zerocopy test version, as a matter of fact.

>> > } rdma;
>> > struct rm_data_op {
>> > unsigned intop_active:1;
>> > -   unsigned intop_notify:1;
>> > +   unsigned intop_notify:1,
>> > +   op_zcopy:1,
>  :
>> > +   struct rds_znotifier*op_mmp_znotifier;
>>
>> not necessary if op_mmp_znotifier is NULL unless set in
>> rds_message_copy_from_user
>
> To make sure I dont misunderstand, you are suggesting that we dont
> need op_zcopy, but can just check for the null-ness of
> op_mmp_znotifier (yes, true, I agree)? or something else?

Yes, that's what I meant.


Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread Martin KaFai Lau
On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote:
> From: Wei Wang 
> 
> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
> which does not have any route attached to it yet, the current code will
> let tb6_root and the node in the subtree share the same route.
> This could cause problem cause tb6_root has RTN_INFO flag marked and the
You meant the RTN_RTINFO check in fib6_purge_rt()?

> tree repair and clean up code will not work properly.
> This commit makes sure tb6_root->leaf points back to null_entry instead
> of sharing route with other node.
> 
> It fixes the following syzkaller reported issue:
> BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 
> [inline]
> BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
> Read of size 8 at addr 8801bc043498 by task syz-executor5/19819
> 
> CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  ipv6_prefix_equal include/net/ipv6.h:540 [inline]
>  fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>  fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214
>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003
>  ip6_route_add+0x141/0x190 net/ipv6/route.c:2790
>  ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299
>  inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
>  sock_do_ioctl+0x65/0xb0 net/socket.c:958
>  sock_ioctl+0x2c2/0x440 net/socket.c:1055
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x452ac9
> RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010
> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
> RDX: 20fd7000 RSI: 890b RDI: 0013
> RBP: 049e R08:  R09: 
> R10:  R11: 0212 R12: 006f4f70
> R13:  R14: 7fd42b3226d4 R15: 
> 
> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route")
> Signed-off-by: Wei Wang 
> Acked-by: Eric Dumazet 
> ---
>  net/ipv6/ip6_fib.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 9dcc3924a975..217683d40f12 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
> *rt,
>   }
>  
>   if (!rcu_access_pointer(fn->leaf)) {
> - atomic_inc(>rt6i_ref);
> - rcu_assign_pointer(fn->leaf, rt);
> + if (fn->fn_flags & RTN_TL_ROOT) {
> + /* put back null_entry for root node */
> + rcu_assign_pointer(fn->leaf,
> + info->nl_net->ipv6.ip6_null_entry);
> + } else {
> + atomic_inc(>rt6i_ref);
> + rcu_assign_pointer(fn->leaf, rt);
> + }
>   }
>   fn = sn;
>   }
> -- 
> 2.16.0.rc1.238.g530d649a79-goog
> 


Re: [PATCH bpf] bpf: fix 32-bit divide by zero

2018-01-18 Thread Alexei Starovoitov

On 1/18/18 2:30 PM, Eric Dumazet wrote:

On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote:

due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod opreations mask upper 32-bits of src register
before doing the check



Is the plan to fix JIT, and if they can all be fixed,
revert this patch ?


No need to fix JITs.
I'd rather remove 'cmp rX, 0' from JITs and let verifier emit it.
It's more generic and gives us flexibility to decide what to do
with divide by zero and other exceptions.




Re: [PATCH bpf] bpf: fix 32-bit divide by zero

2018-01-18 Thread Eric Dumazet
On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote:
> due to some JITs doing if (src_reg == 0) check in 64-bit mode
> for div/mod opreations mask upper 32-bits of src register
> before doing the check
> 

Is the plan to fix JIT, and if they can all be fixed,
revert this patch ?

x86 patch would be something like :

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 
87f214fbe66ec163d24b12b6defc7edab612ecc9..91e4ab69573e09f793eb1c1e29d1b5ffad1d5dc7
 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -548,8 +548,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
if (BPF_SRC(insn->code) == BPF_X) {
/* if (src_reg == 0) return 0 */
 
-   /* cmp r11, 0 */
-   EMIT4(0x49, 0x83, 0xFB, 0x00);
+   if (BPF_CLASS(insn->code) == BPF_ALU64) {
+   /* cmp r11, 0 */
+   EMIT4(0x49, 0x83, 0xFB, 0x00);
+   } else {
+   /* cmp r11d, 0 */
+   EMIT4(0x41, 0x83, 0xFB, 0x00);
+   }
 
/* jne .+9 (skip over pop, pop, xor and jmp) */
EMIT2(X86_JNE, 1 + 1 + 2 + 5);


[PATCH net v2 3/3] ibmvnic: Allocate and request vpd in init_resources

2018-01-18 Thread John Allen
In reset events in which our memory allocations need to be reallocated,
VPD data is being freed, but never reallocated. This can cause issues if
we later attempt to access that memory or reset and attempt to free the
memory. This patch moves the allocation of the VPD data to init_resources
so that it will be symmetrically freed during release resources.

Signed-off-by: John Allen 
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index bb56460..f0dbb76 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -867,7 +867,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
if (adapter->vpd->buff)
len = adapter->vpd->len;

-   reinit_completion(>fw_done);
+   init_completion(>fw_done);
crq.get_vpd_size.first = IBMVNIC_CRQ_CMD;
crq.get_vpd_size.cmd = GET_VPD_SIZE;
ibmvnic_send_crq(adapter, );
@@ -929,6 +929,13 @@ static int init_resources(struct ibmvnic_adapter *adapter)
if (!adapter->vpd)
return -ENOMEM;

+   /* Vital Product Data (VPD) */
+   rc = ibmvnic_get_vpd(adapter);
+   if (rc) {
+   netdev_err(netdev, "failed to initialize Vital Product Data 
(VPD)\n");
+   return rc;
+   }
+
adapter->map_id = 1;
adapter->napi = kcalloc(adapter->req_rx_queues,
sizeof(struct napi_struct), GFP_KERNEL);
@@ -1002,7 +1009,7 @@ static int __ibmvnic_open(struct net_device *netdev)
 static int ibmvnic_open(struct net_device *netdev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
-   int rc, vpd;
+   int rc;

mutex_lock(>reset_lock);

@@ -1030,11 +1037,6 @@ static int ibmvnic_open(struct net_device *netdev)
rc = __ibmvnic_open(netdev);
netif_carrier_on(netdev);

-   /* Vital Product Data (VPD) */
-   vpd = ibmvnic_get_vpd(adapter);
-   if (vpd)
-   netdev_err(netdev, "failed to initialize Vital Product Data 
(VPD)\n");
-
mutex_unlock(>reset_lock);

return rc;



[PATCH net v2 2/3] ibmvnic: Revert to previous mtu when unsupported value requested

2018-01-18 Thread John Allen
If we request an unsupported mtu value, the vnic server will suggest a
different value. Currently we take the suggested value without question
and login with that value. However, the behavior doesn't seem completely
sane as attempting to change the mtu to some specific value will change
the mtu to some completely different value most of the time. This patch
fixes the issue by logging in with the previously used mtu value and
printing an error message saying that the given mtu is unsupported.

Signed-off-by: John Allen 
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index f8f1396..bb56460 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3608,7 +3608,17 @@ static void handle_request_cap_rsp(union ibmvnic_crq 
*crq,
 *req_value,
 (long int)be64_to_cpu(crq->request_capability_rsp.
   number), name);
-   *req_value = be64_to_cpu(crq->request_capability_rsp.number);
+
+   if (be16_to_cpu(crq->request_capability_rsp.capability) ==
+   REQ_MTU) {
+   pr_err("mtu of %llu is not supported. Reverting.\n",
+  *req_value);
+   *req_value = adapter->fallback.mtu;
+   } else {
+   *req_value =
+   be64_to_cpu(crq->request_capability_rsp.number);
+   }
+
ibmvnic_send_req_caps(adapter, 1);
return;
default:



[PATCH net v2 1/3] ibmvnic: Modify buffer size and number of queues on failover

2018-01-18 Thread John Allen
Using newer backing devices can cause the required padding at the end of
buffer as well as the number of queues to change after a failover.
Since we currently assume that these values never change, after a
failover to a backing device with different capabilities, we can get
errors from the vnic server, attempt to free long term buffers that are
no longer there, or not free long term buffers that should be freed.

This patch resolves the issue by checking whether any of these values
change, and if so perform the necessary re-allocations.

Signed-off-by: John Allen 
---
v2: Added the line to free the long term buff in reset rx pools which
caused me to hit a couple of other problems. On a failover, the number
of queues can also change which doesn't get updated in the current
reset code. Added some extra logic in do reset to check for changed
number of queues and added some logic to release pools to ensure that
it is only releasing pools that were allocated in the init pools
routines.
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 461014b7ccdd..12559c63c226 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -411,6 +411,10 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter)
struct ibmvnic_rx_pool *rx_pool;
int rx_scrqs;
int i, j, rc;
+   u64 *size_array;
+
+   size_array = (u64 *)((u8 *)(adapter->login_rsp_buf) +
+   be32_to_cpu(adapter->login_rsp_buf->off_rxadd_buff_size));

rx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs);
for (i = 0; i < rx_scrqs; i++) {
@@ -418,7 +422,17 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter)

netdev_dbg(adapter->netdev, "Re-setting rx_pool[%d]\n", i);

-   rc = reset_long_term_buff(adapter, _pool->long_term_buff);
+   if (rx_pool->buff_size != be64_to_cpu(size_array[i])) {
+   free_long_term_buff(adapter, _pool->long_term_buff);
+   rx_pool->buff_size = be64_to_cpu(size_array[i]);
+   alloc_long_term_buff(adapter, _pool->long_term_buff,
+rx_pool->size *
+rx_pool->buff_size);
+   } else {
+   rc = reset_long_term_buff(adapter,
+ _pool->long_term_buff);
+   }
+
if (rc)
return rc;

@@ -440,14 +454,12 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter)
 static void release_rx_pools(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_rx_pool *rx_pool;
-   int rx_scrqs;
int i, j;

if (!adapter->rx_pool)
return;

-   rx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs);
-   for (i = 0; i < rx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_rx_pools; i++) {
rx_pool = >rx_pool[i];

netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i);
@@ -470,6 +482,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)

kfree(adapter->rx_pool);
adapter->rx_pool = NULL;
+   adapter->num_active_rx_pools = 0;
 }

 static int init_rx_pools(struct net_device *netdev)
@@ -494,6 +507,8 @@ static int init_rx_pools(struct net_device *netdev)
return -1;
}

+   adapter->num_active_rx_pools = 0;
+
for (i = 0; i < rxadd_subcrqs; i++) {
rx_pool = >rx_pool[i];

@@ -537,6 +552,8 @@ static int init_rx_pools(struct net_device *netdev)
rx_pool->next_free = 0;
}

+   adapter->num_active_rx_pools = rxadd_subcrqs;
+
return 0;
 }

@@ -587,13 +604,12 @@ static void release_vpd_data(struct ibmvnic_adapter 
*adapter)
 static void release_tx_pools(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_tx_pool *tx_pool;
-   int i, tx_scrqs;
+   int i;

if (!adapter->tx_pool)
return;

-   tx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
-   for (i = 0; i < tx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_tx_pools; i++) {
netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i);
tx_pool = >tx_pool[i];
kfree(tx_pool->tx_buff);
@@ -604,6 +620,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)

kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
+   adapter->num_active_tx_pools = 0;
 }

 static int init_tx_pools(struct net_device *netdev)
@@ -620,6 +637,8 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;

+   adapter->num_active_tx_pools = 0;
+
for (i = 0; i < tx_subcrqs; i++) {
tx_pool = >tx_pool[i];

@@ -667,6 +686,8 @@ 

[PATCH net v2 0/3] ibmvnic: Reset behavior fixes

2018-01-18 Thread John Allen
This patchset fixes a number of issues related to ibmvnic reset uncovered
from testing new Power9 machines with Everglades adapters and the new
functionality to change mtu and other parameters in the driver.

Changes since v1:
-In patch 1/3, added the line to free the long term buffers before
allocating a new one. This change inadvertently uncovered the problem
that the number of queues can change after a failover as well. To fix
this, we check whether or not the number of queues has changed in
do_reset and if they have, we do a full release and init of the queues.
-In patch 1/3, added variables to the adapter struct to track how
many rx/tx pools have actually been allocated and modify the release
pools routines to use these values rather than the possibly incorrect
req_rx/tx_queues values.

John Allen (3):
  ibmvnic: Modify buffer size and number of queues on failover
  ibmvnic: Revert to previous mtu when unsupported value requested
  ibmvnic: Allocate and request vpd in init_resources

 drivers/net/ethernet/ibm/ibmvnic.c | 73 ++
 drivers/net/ethernet/ibm/ibmvnic.h |  2 ++
 2 files changed, 60 insertions(+), 15 deletions(-)

-- 
2.9.5



Re: [pull request bpf-next 0/8] bpf: offload: array maps, reporting and test

2018-01-18 Thread Daniel Borkmann
On 01/18/2018 04:13 AM, Jakub Kicinski wrote:
> Hi!
> 
> This set bring in the rest of map offload code held up by urgent
> fixes and improvements to the BPF arrays.
> 
> The first 3 patches take care of array map offload, similarly to
> hash maps the attribute validation is split out to a separate map
> op, and used for both offloaded and non-offloaded case (allocation
> only happens if map is on the host).  Offload support comes down
> to allowing this map type through the offload check in the core.
> NFP driver also rejects the delete operation in case of array maps.
> 
> Subsequent patches add reporting of target device in a very similar
> way target device of programs is reported (ifindex+netns dev/ino).
> Netdevsim is extended with a trivial map implementation allowing us
> to test the offload in test_offload.py.
> 
> Last patch adds a small busy wait to NFP map IO, this improves the
> response times which is especially useful for map dumps.

Looks good to me as well, applied to bpf-next, thanks Jakub!


[PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-18 Thread William Tu
The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a nested attribute to support all ERSPAN v1 and v2's
fields.  Note taht Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu 
---
 include/uapi/linux/openvswitch.h |  11 +++
 net/openvswitch/flow_netlink.c   | 154 ++-
 2 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dcfab5e3b55c..f1f98fd703fe 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,6 +273,16 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
+enum {
+   OVS_ERSPAN_OPT_UNSPEC,
+   OVS_ERSPAN_OPT_IDX, /* u32 index */
+   OVS_ERSPAN_OPT_VER, /* u8 version number */
+   OVS_ERSPAN_OPT_DIR, /* u8 direction */
+   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
+   __OVS_ERSPAN_OPT_MAX,
+};
+
+#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
 
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
@@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index f143908b651d..c57b96b595b5 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "flow_netlink.h"
 
@@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
-   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
+* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
@@ -384,6 +386,13 @@ static const struct ovs_len_tbl 
ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
[OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
 };
 
+static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = {
+   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
+   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -400,6 +409,8 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
+   .next = ovs_erspan_opt_lens },
 };
 
 static const struct ovs_len_tbl
@@ -631,6 +642,94 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
+static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+   unsigned long opt_key_offset;
+   struct erspan_metadata opts;
+   struct nlattr *a;
+   u16 hwid, dir;
+   int rem;
+
+   BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+   memset(, 0, sizeof(opts));
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
+
+   if (type > OVS_ERSPAN_OPT_MAX) {
+   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
+ type, OVS_ERSPAN_OPT_MAX);
+   return -EINVAL;
+   }
+
+   if (!check_attr_len(nla_len(a),
+   ovs_erspan_opt_lens[type].len)) {
+   OVS_NLERR(log, "ERSPAN option %d has unexpected len %d 
expected %d",
+ type, nla_len(a),
+ ovs_erspan_opt_lens[type].len);
+ 

[PATCHv4 net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-18 Thread William Tu
Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu 
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -198,7 +259,6 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
__be16 tci;
  

[PATCHv4 net-next 0/2] net: erspan: add support for openvswitch

2018-01-18 Thread William Tu
The first patch refactors the erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and error-prone.  The first patch changes it to use
bitfields.  The second patch introduces the new OVS tunnel key attribute
to program both v1 and v2 erspan tunnel for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version I and II support

 include/net/erspan.h | 127 +++-
 include/uapi/linux/openvswitch.h |  11 +++
 net/ipv4/ip_gre.c|  38 --
 net/ipv6/ip6_gre.c   |  36 -
 net/openvswitch/flow_netlink.c   | 154 ++-
 5 files changed, 285 insertions(+), 81 deletions(-)

---
v3->v4
  change from be32 to u32 for OVS_ERSPAN_OPT_IDX, suggested by Jiri Benc.

v2->v3
  revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e.
  redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute

v1->v2
  Fix compatibility issue suggested by Pravin.
-- 
2.7.4



Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Pravin Shelar
On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens  wrote:
> Pravin Shelar  writes:
>
>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens  wrote:
>>> When regular packets are forwarded, we validate their size against the
>>> MTU of the destination device. However, when GSO packets are
>>> forwarded, we do not validate their size against the MTU. We
>>> implicitly assume that when they are segmented, the resultant packets
>>> will be correctly sized.
>>>
>>> This is not always the case.
>>>
>>> We observed a case where a packet received on an ibmveth device had a
>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
>>> device, where it caused a firmware assert. This is described in detail
>>> at [0] and was the genesis of this series. Rather than fixing it in
>>> the driver, this series fixes the forwarding path.
>>>
>> Are there any other possible forwarding path in networking stack? TC
>> is one subsystem that could forward such a packet to the bnx2x device,
>> how is that handled ?
>
> So far I have only looked at bridges, openvswitch and macvlan. In
> general, if the code uses dev_forward_skb() it should automatically be
> fine as that invokes is_skb_forwardable(), which we patch.
>
But there are other ways to forward packets, e.g tc-mirred or bpf
redirect. We need to handle all these cases rather than fixing one at
a time. As Jason suggested netif_needs_gso() looks like good function
to validate if a device is capable of handling given GSO packet.


Re: [PATCH bpf-next 4/8] bpf: offload: report device information about offloaded maps

2018-01-18 Thread Alexei Starovoitov
On Wed, Jan 17, 2018 at 07:13:28PM -0800, Jakub Kicinski wrote:
> Tell user space about device on which the map was created.
> Unfortunate reality of user ABI makes sharing this code
> with program offload difficult but the information is the
> same.
> 
> Signed-off-by: Jakub Kicinski 

Acked-by: Alexei Starovoitov 

the series look good to me.
Thank you for breaking them down this way to avoid conflicts
between bpf-next and our array masking fixes in bpf.



Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-18 Thread Laurent Pinchart
Hi Will,

On Thursday, 18 January 2018 19:05:47 EET Will Deacon wrote:
> On Thu, Jan 18, 2018 at 08:58:08AM -0800, Dan Williams wrote:
> > On Thu, Jan 18, 2018 at 5:18 AM, Will Deacon wrote:
> >> On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote:
> >>> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds wrote:
>  On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams wrote:
> > This series incorporates Mark Rutland's latest ARM changes and adds
> > the x86 specific implementation of 'ifence_array_ptr'. That ifence
> > based approach is provided as an opt-in fallback, but the default
> > mitigation, '__array_ptr', uses a 'mask' approach that removes
> > conditional branches instructions, and otherwise aims to redirect
> > speculation to use a NULL pointer rather than a user controlled
> > value.
>  
>  Do you have any performance numbers and perhaps example code
>  generation? Is this noticeable? Are there any microbenchmarks showing
>  the difference between lfence use and the masking model?
> >>> 
> >>> I don't have performance numbers, but here's a sample code generation
> >>> from __fcheck_files, where the 'and; lea; and' sequence is portion of
> >>> array_ptr() after the mask generation with 'sbb'.
> >>> 
> >>> fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
> >>>  
> >>>  8e7:   8b 02   mov(%rdx),%eax
> >>>  8e9:   48 39 c7cmp%rax,%rdi
> >>>  8ec:   48 19 c9sbb%rcx,%rcx
> >>>  8ef:   48 8b 42 08 mov0x8(%rdx),%rax
> >>>  8f3:   48 89 femov%rdi,%rsi
> >>>  8f6:   48 21 ceand%rcx,%rsi
> >>>  8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
> >>>  8fd:   48 21 c8and%rcx,%rax
>  
>  Having both seems good for testing, but wouldn't we want to pick one
>  in the end?
> >>> 
> >>> I was thinking we'd keep it as a 'just in case' sort of thing, at
> >>> least until the 'probably safe' assumption of the 'mask' approach has
> >>> more time to settle out.
> >> 
> >> From the arm64 side, the only concern I have (and this actually applies
> >> to our CSDB sequence as well) is the calculation of the array size by
> >> the caller. As Linus mentioned at the end of [1], if the determination
> >> of the size argument is based on a conditional branch, then masking
> >> doesn't help because you bound within the wrong range under speculation.
> >> 
> >> We ran into this when trying to use masking to protect our uaccess
> >> routines where the conditional bound is either KERNEL_DS or USER_DS.
> >> It's possible that a prior conditional set_fs(KERNEL_DS) could defeat
> >> the masking and so we'd need to throw some heavy barriers in set_fs to
> >> make it robust.
> > 
> > At least in the conditional mask case near set_fs() usage the approach
> > we are taking is to use a barrier. I.e. the following guidance from
> > Linus:
> > 
> > "Basically, the rule is trivial: find all 'stac' users, and use address
> > masking if those users already integrate the limit check, and lfence
> > they don't."
> > 
> > ...which translates to narrow the pointer for get_user() and use a
> > barrier  for __get_user().
> 
> Great, that matches my thinking re set_fs but I'm still worried about
> finding all the places where the bound is conditional for other users
> of the macro. Then again, finding the places that need this macro in the
> first place is tough enough so perhaps analysing the bound calculation
> doesn't make it much worse.

It might not now, but if the bound calculation changes later, I'm pretty sure 
we'll forget to update the speculation barrier macro at least in some cases. 
Without the help of static (or possibly dynamic) code analysis I think we're 
bound to reintroduce problems over time, but that's true for finding places 
where the barrier is needed, not just for barrier selection based on bound 
calculation.

-- 
Regards,

Laurent Pinchart



Re: [PATCH 27/38] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()

2018-01-18 Thread Kees Cook
On Thu, Jan 18, 2018 at 1:31 PM, Laura Abbott  wrote:
> On 01/10/2018 06:02 PM, Kees Cook wrote:
>>
>> From: David Windsor 
>>
>> The autoclose field can be copied with put_user(), so there is no need to
>> use copy_to_user(). In both cases, hardened usercopy is being bypassed
>> since the size is constant, and not open to runtime manipulation.
>>
>> This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
>> whitelisting code in the last public patch of grsecurity/PaX based on my
>> understanding of the code. Changes or omissions from the original code are
>> mine and don't reflect the original grsecurity/PaX code.
>>
>
> Just tried a quick rebase and it looks like this conflicts with
> c76f97c99ae6 ("sctp: make use of pre-calculated len")
> I don't think we can use put_user if we're copying via the full
> len?

It should be fine, since:

len = sizeof(int);

c76f97c99ae6 just does a swap of sizeof(int) with len, put_user() will
work in either case, since autoclose will always be int sized.

-Kees

>
> Thanks,
> Laura
>
>
>> Signed-off-by: David Windsor 
>> [kees: adjust commit log]
>> Cc: Vlad Yasevich 
>> Cc: Neil Horman 
>> Cc: "David S. Miller" 
>> Cc: linux-s...@vger.kernel.org
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Kees Cook 
>> ---
>>   net/sctp/socket.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index efbc8f52c531..15491491ec88 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -5011,7 +5011,7 @@ static int sctp_getsockopt_autoclose(struct sock
>> *sk, int len, char __user *optv
>> len = sizeof(int);
>> if (put_user(len, optlen))
>> return -EFAULT;
>> -   if (copy_to_user(optval, _sk(sk)->autoclose, sizeof(int)))
>> +   if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval))
>> return -EFAULT;
>> return 0;
>>   }
>>
>



-- 
Kees Cook
Pixel Security


Re: [PATCH 27/38] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()

2018-01-18 Thread Laura Abbott

On 01/10/2018 06:02 PM, Kees Cook wrote:

From: David Windsor 

The autoclose field can be copied with put_user(), so there is no need to
use copy_to_user(). In both cases, hardened usercopy is being bypassed
since the size is constant, and not open to runtime manipulation.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.



Just tried a quick rebase and it looks like this conflicts with
c76f97c99ae6 ("sctp: make use of pre-calculated len")
I don't think we can use put_user if we're copying via the full
len?

Thanks,
Laura


Signed-off-by: David Windsor 
[kees: adjust commit log]
Cc: Vlad Yasevich 
Cc: Neil Horman 
Cc: "David S. Miller" 
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook 
---
  net/sctp/socket.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index efbc8f52c531..15491491ec88 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5011,7 +5011,7 @@ static int sctp_getsockopt_autoclose(struct sock *sk, int 
len, char __user *optv
len = sizeof(int);
if (put_user(len, optlen))
return -EFAULT;
-   if (copy_to_user(optval, _sk(sk)->autoclose, sizeof(int)))
+   if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval))
return -EFAULT;
return 0;
  }





[PATCH net] rds: tcp: compute m_ack_seq as offset from ->write_seq

2018-01-18 Thread Sowmini Varadhan
rds-tcp uses m_ack_seq to track the tcp ack# that indicates
that the peer has received a rds_message. The m_ack_seq is
used in rds_tcp_is_acked() to figure out when it is safe to
drop the rds_message from the RDS retransmit queue.

The m_ack_seq must be calculated as an offset from the right 
edge of the in-flight tcp buffer, i.e., it should be based on 
the ->write_seq, not the ->snd_nxt.

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp.c  |5 +++--
 net/rds/tcp.h  |2 +-
 net/rds/tcp_send.c |4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 6b7ee71..ab7356e 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -90,9 +90,10 @@ void rds_tcp_nonagle(struct socket *sock)
  sizeof(val));
 }
 
-u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
+u32 rds_tcp_write_seq(struct rds_tcp_connection *tc)
 {
-   return tcp_sk(tc->t_sock->sk)->snd_nxt;
+   /* seq# of the last byte of data in tcp send buffer */
+   return tcp_sk(tc->t_sock->sk)->write_seq;
 }
 
 u32 rds_tcp_snd_una(struct rds_tcp_connection *tc)
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 1aafbf7..864ca7d 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -54,7 +54,7 @@ struct rds_tcp_statistics {
 void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
 void rds_tcp_restore_callbacks(struct socket *sock,
   struct rds_tcp_connection *tc);
-u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc);
+u32 rds_tcp_write_seq(struct rds_tcp_connection *tc);
 u32 rds_tcp_snd_una(struct rds_tcp_connection *tc);
 u64 rds_tcp_map_seq(struct rds_tcp_connection *tc, u32 seq);
 extern struct rds_transport rds_tcp_transport;
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index dc860d1..9b76e0f 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -86,7 +86,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct 
rds_message *rm,
 * m_ack_seq is set to the sequence number of the last byte of
 * header and data.  see rds_tcp_is_acked().
 */
-   tc->t_last_sent_nxt = rds_tcp_snd_nxt(tc);
+   tc->t_last_sent_nxt = rds_tcp_write_seq(tc);
rm->m_ack_seq = tc->t_last_sent_nxt +
sizeof(struct rds_header) +
be32_to_cpu(rm->m_inc.i_hdr.h_len) - 1;
@@ -98,7 +98,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct 
rds_message *rm,
rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED;
 
rdsdebug("rm %p tcp nxt %u ack_seq %llu\n",
-rm, rds_tcp_snd_nxt(tc),
+rm, rds_tcp_write_seq(tc),
 (unsigned long long)rm->m_ack_seq);
}
 
-- 
1.7.1



Re: [PATCH net] flow_dissector: properly cap thoff field

2018-01-18 Thread David Miller
From: Eric Dumazet 
Date: Wed, 17 Jan 2018 14:21:13 -0800

> From: Eric Dumazet 
> 
> syzbot reported yet another crash [1] that is caused by
> insufficient validation of DODGY packets.
> 
> Two bugs are happening here to trigger the crash.
> 
> 1) Flow dissection leaves with incorrect thoff field.
> 
> 2) skb_probe_transport_header() sets transport header to this invalid
> thoff, even if pointing after skb valid data.
> 
> 3) qdisc_pkt_len_init() reads out-of-bound data because it
> trusts tcp_hdrlen(skb)  
> 
> Possible fixes :
> 
> - Full flow dissector validation before injecting bad DODGY packets in
> the stack.
>  This approach was attempted here : https://patchwork.ozlabs.org/patch/
> 861874/
> 
> - Have more robust functions in the core.
>   This might be needed anyway for stable versions.
> 
> This patch fixes the flow dissection issue.
> 
> [1]
 ...
> Fixes: 34fad54c2537 ("net: __skb_flow_dissect() must cap its return value")
> Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in 
> __skb_flow_dissect")
> Signed-off-by: Eric Dumazet 
> Cc: Willem de Bruijn 
> Reported-by: syzbot 

Applied and queued up for -stable, thanks Eric.


Re: [PATCH-next] flow_netlink: Remove unneeded semicolons

2018-01-18 Thread David Miller
From: Christopher Díaz Riveros 
Date: Wed, 17 Jan 2018 16:10:28 -0500

> Trivial fix removes unneeded semicolons after if blocks.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Christopher Díaz Riveros 

Applied, thanks.


Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-18 Thread Willem de Bruijn
> Aside from inet_gso_segment and ipv6_gso_segment, this only leaves the
> new nsh_gso_segment. Unlke mpls, it has its own gso_type, so
>
> if (!skb_shinfo(skb)->gso_type & SKB_GSO_NSH)
> goto out;

This last point was incorrect. There is no such type for this protocol, so
we have to do the same as for MPLS. Either assume that it is robust
or be conservative and add

if (skb_shinfo(skb)->gso_type & SKB_GSO_DODGY)
goto out;

It looks correct on the surface and may already be relied upon, so I
suggest the first approach.


Re: [PATCH v2] tools/bpftool: silence a static checker warning

2018-01-18 Thread Daniel Borkmann
On 01/18/2018 10:35 AM, Dan Carpenter wrote:
> There is a static checker warning that "proglen" has an upper bound but
> no lower bound.  The allocation will just fail harmlessly so it's not a
> big deal.
> 
> Signed-off-by: Dan Carpenter 

Applied to bpf-next, thanks Dan! (I changed the prefix from 'tools/bpftool'
into 'tools/bpf_jit_disasm' since it's not touching bpftool here.)


Re: pull-request: wireless-drivers 2018-01-17

2018-01-18 Thread David Miller
From: Kalle Valo 
Date: Wed, 17 Jan 2018 16:30:21 +0200

> here are few more important fixes to the net tree for 4.15, I hope they
> still make it. Please let me know if there are any problems.

Pulled, thanks Kalle.


Re: [PATCH] net/mlx5e: Fix trailing semicolon

2018-01-18 Thread David Miller
From: Luis de Bethencourt 
Date: Wed, 17 Jan 2018 12:09:15 +

> The trailing semicolon is an empty statement that does no operation.
> Removing it since it doesn't do anything.
> 
> Signed-off-by: Luis de Bethencourt 

Applied to net-next, thank you.


Re: [bpf-next PATCH] bpf: Sync kernel ABI header with tooling header

2018-01-18 Thread Daniel Borkmann
On 01/18/2018 05:49 PM, Jesper Dangaard Brouer wrote:
> Update tools/include/uapi/linux/bpf.h to bring it in sync with
> include/uapi/linux/bpf.h.  The listed commits forgot to update it.
> 
> Fixes: 02dd3291b2f0 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
> Fixes: f19397a5c656 ("bpf: Add access to snd_cwnd and others in sock_ops")
> Fixes: 06ef0ccb5a36 ("bpf/cgroup: fix a verification error for a 
> CGROUP_DEVICE type prog")
> Signed-off-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks Jesper!


Re: [net] fm10k: mark PM functions as __maybe_unused

2018-01-18 Thread David Miller
From: David Miller 
Date: Thu, 18 Jan 2018 15:52:38 -0500 (EST)

> From: Jeff Kirsher 
> Date: Wed, 17 Jan 2018 07:57:32 -0800
> 
>> From: Arnd Bergmann 
>> 
>> A cleanup of the PM code left an incorrect #ifdef in place, leading
>> to a harmless build warning:
>> 
>> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend' 
>> defined but not used [-Werror=unused-function]
>> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume' 
>> defined but not used [-Werror=unused-function]
>> 
>> It's easier to use __maybe_unused attributes here, since you
>> can't pick the wrong one.
>> 
>> Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe 
>> power hooks")
>> Signed-off-by: Arnd Bergmann 
>> Acked-by: Jacob Keller 
>> Tested-by: Krishneil Singh 
>> Signed-off-by: Jeff Kirsher 
> 
> Applied, thanks Arnd.

Of course, thank you to Jeff as well :-)


Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-18 Thread Willem de Bruijn
On Thu, Jan 18, 2018 at 4:14 AM, Jason Wang  wrote:
>
>
> On 2018年01月18日 14:04, Willem de Bruijn wrote:
>>
>> On Thu, Jan 18, 2018 at 12:20 AM, Willem de Bruijn
>>  wrote:
>>>
>>> On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang  wrote:


 On 2018年01月18日 07:11, Willem de Bruijn wrote:
>
> From: Willem de Bruijn
>
> Validate gso_type of untrusted SKB_GSO_DODGY packets during
> segmentation.
>
> Untrusted user packets are limited to a small set of gso types in
> virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
> Syzkaller was able to enter gso callbacks that are not hardened
> against untrusted user input.
>
> Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")


 This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso:
 Remove
 arbitrary checks for unsupported GSO")
>>>
>>> The specific SCTP path was introduced with commit 90017accff61 ("sctp:
>>> add GSO support"). But the main issue that packets can be delivered to
>>> gso handlers different from their gso_type goes back further.
>>>
>>> The commit you reference is actually older than the sctp gso patch, so
>>> it makes sense that it did not have a check in the sctp_gso_segment.
>>>
>>> I still think that we should check in inet_gso_segment when we have
>>> the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having
>>> a check of the form.
>>>
>>>!(type & (SKB_GSO_TCPV4 |
>>>  SKB_GSO_TCPV6
>>
>> Unless we can create packets that legitimate combine
>> SKB_GSO_DODGY with tunnel headers.
>
>
> As you mentioned below, looks like we can e.g bridge between tunnels (vxlan,
> gre or others) and tap, or even bpf can produce this (e.g
> bpf_skb_adjust_room).
>
>>
>> virtio_net_hdr_to_skb does not accept tunneled gso types.
>
>
> Yes, Vlad is trying to extend virtio to support more kinds of gso types, so
> it will be supported for sure.
>
>>
>> But a tun device can be bridged with a gre tunnel in the
>> host, creating a path that will call gre_gso_segment.
>>
>> If that is possible, then this patch is indeed too strict and
>> we do need checks in the individual handlers.

Okay, I'm working on a patch that adds explicit checks

  @@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
  struct sk_buff *segs = ERR_PTR(-EINVAL);
  struct sctphdr *sh;

  +   if (!skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)
  +   goto out;
  +

to all transport layer gso offloads: {sctp, tcpv[46], ufov[46], espv[46]}. This
will block packets with gso_type X but protocol type Y from being parsed.
But does allow entering a tunnel protocol handler if that is different from Y,
unlike the above patch.

tunnels segmentation itself is protected by skb->encapsulation. Only tunnel
devices in the stack can set this field, not virtio_net_hdr_to_skb. Packets that
enter {gre, udp tunnel, ipxip4, ipxip6} without this bit are already dropped, so
no new checks are needed to these based on gso_type.

That leaves eth_proto callbacks. mpls had this check before commit 5c7cdf339af5
("gso: Remove arbitrary checks for unsupported GSO"):

 if (unlikely(skb_shinfo(skb)->gso_type &
 ~(SKB_GSO_TCPV4 |
   SKB_GSO_TCPV6 |
   SKB_GSO_UDP |
   SKB_GSO_DODGY |
   SKB_GSO_TCP_FIXEDID |
   SKB_GSO_TCP_ECN)))
 goto out;

which appears to exclude only other tunnel types at the time, no
transport layers. I don't think we need any new check here, then: it
was robust against handling dodgy sources.

Aside from inet_gso_segment and ipv6_gso_segment, this only leaves the
new nsh_gso_segment. Unlke mpls, it has its own gso_type, so

if (!skb_shinfo(skb)->gso_type & SKB_GSO_NSH)
goto out;


[PATCH] net: tcp: close sock if net namespace is exiting

2018-01-18 Thread Dan Streetman
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman 
---
 include/net/net_namespace.h | 10 ++
 net/ipv4/tcp.c  |  3 +++
 net/ipv4/tcp_timer.c| 15 +++
 3 files changed, 28 insertions(+)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f8a84a2c2341..f306b2aa15a4 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -223,6 +223,11 @@ int net_eq(const struct net *net1, const struct net *net2)
return net1 == net2;
 }
 
+static inline int check_net(const struct net *net)
+{
+   return refcount_read(>count) != 0;
+}
+
 void net_drop_ns(void *);
 
 #else
@@ -247,6 +252,11 @@ int net_eq(const struct net *net1, const struct net *net2)
return 1;
 }
 
+static inline int check_net(const struct net *net)
+{
+   return 1;
+}
+
 #define net_drop_ns NULL
 #endif
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861bf699..9389193e73f3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2298,6 +2298,9 @@ void tcp_close(struct sock *sk, long timeout)
tcp_send_active_reset(sk, GFP_ATOMIC);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
+   } else if (!check_net(sock_net(sk))) {
+   /* Not possible to send reset; just close */
+   tcp_set_state(sk, TCP_CLOSE);
}
}
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124cdbda..41b40b805aa3 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -48,11 +48,19 @@ static void tcp_write_err(struct sock *sk)
  *  to prevent DoS attacks. It is called when a retransmission timeout
  *  or zero probe timeout occurs on orphaned socket.
  *
+ *  Also close if our net namespace is exiting; in that case there is no
+ *  hope of ever communicating again since all netns interfaces are already
+ *  down (or about to be down), and we need to release our dst references,
+ *  which have been moved to the netns loopback interface, so the namespace
+ *  can finish exiting.  This condition is only possible if we are a kernel
+ *  socket, as those do not hold references to the namespace.
+ *
  *  Criteria is still not confirmed experimentally and may change.
  *  We kill the socket, if:
  *  1. If number of orphaned sockets exceeds an administratively configured
  * limit.
  *  2. If we have strong memory pressure.
+ *  3. If our net namespace is exiting.
  */
 static int tcp_out_of_resources(struct sock *sk, bool do_reset)
 {
@@ -81,6 +89,13 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
return 1;
}
+
+   if (!check_net(sock_net(sk))) {
+   /* Not possible to send reset; just close */
+   tcp_done(sk);
+   return 1;
+   }
+
return 0;
 }
 
-- 
2.14.1



Re: [bpf-next PATCH] bpf: add comments to BPF ld/ldx sizes

2018-01-18 Thread Daniel Borkmann
On 01/17/2018 12:05 PM, Jesper Dangaard Brouer wrote:
> Doc BPF ld/ldx size defines as comments in code, as it makes in
> faster to lookup in a programming/review setting, than looking up
> the sizes in Documentation/networking/filter.txt.
> 
> Signed-off-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks Jesper!


Re: [PATCH bpf-next] bpf: change fake_ip for bpf_trace_printk helper

2018-01-18 Thread Daniel Borkmann
On 01/17/2018 06:19 PM, Yonghong Song wrote:
> Currently, for bpf_trace_printk helper, fake ip address 0x1
> is used with comments saying that fake ip will not be printed.
> This is indeed true for 4.12 and earlier version, but for
> 4.13 and later version, the ip address will be printed if
> it cannot be resolved with kallsym. Running samples/bpf/tracex5
> program and you will have the following in the debugfs
> trace_pipe output:
>   ...
>   <...>-1819  [003]    443.497877: 0x0001: mmap
>   <...>-1819  [003]    443.498289: 0x0001: syscall=102 (one of 
> get/set uid/pid/gid)
>   ...
> 
> The kernel commit changed this behavior is:
>   commit feaf1283d11794b9d518fcfd54b6bf8bee1f0b4b
>   Author: Steven Rostedt (VMware) 
>   Date:   Thu Jun 22 17:04:55 2017 -0400
> 
>   tracing: Show address when function names are not found
>   ...
> 
> This patch changed the comment and also altered the fake ip
> address to 0x0 as users may think 0x1 has some special meaning
> while it doesn't. The new output:
>   ...
>   <...>-1799  [002] 25.953576: 0: mmap
>   <...>-1799  [002] 25.953865: 0: read(fd=0, buf=053936b5, 
> size=512)
>   ...
> 
> Signed-off-by: Yonghong Song 

Applied to bpf-next, thanks Yonghong!


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-18 Thread Christian Brauner
On Thu, Jan 18, 2018 at 09:29:14PM +0100, Jiri Benc wrote:
> On Thu, 18 Jan 2018 21:21:24 +0100, Christian Brauner wrote:
> > In such scenarios setting a netns id property is
> > not really wanted
> 
> Why? I think that's what you should do if you want to avoid setns. Just
> use netnsid. I don't see any problem with that and you didn't provide
> any explanation.

Ah, sorry, maybe I was to brief. When creating and destroying a lot of
short-lived network namespaces it seems unnecessary to give them all
label/netns id. Using a netns id makes much more sense when you want a
persistent, long-living network namespace. For example, iproute2 where
you want to create a persistent network namespace that sticks around via
ip netns add bla && ip netns set bla 5.

A more concrete scenario is creating a network namespace, moving a
device into it via RTM_SETLINK which also supports IFLA_NET_NS_{FD,PID}
and then wanting to query the device. This would be very easy to do if
one could reuse the IFLA_NET_NS_{FD,PID} without having to set a
netnsid.

Christian


Re: [net] fm10k: mark PM functions as __maybe_unused

2018-01-18 Thread David Miller
From: Jeff Kirsher 
Date: Wed, 17 Jan 2018 07:57:32 -0800

> From: Arnd Bergmann 
> 
> A cleanup of the PM code left an incorrect #ifdef in place, leading
> to a harmless build warning:
> 
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend' 
> defined but not used [-Werror=unused-function]
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume' 
> defined but not used [-Werror=unused-function]
> 
> It's easier to use __maybe_unused attributes here, since you
> can't pick the wrong one.
> 
> Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe 
> power hooks")
> Signed-off-by: Arnd Bergmann 
> Acked-by: Jacob Keller 
> Tested-by: Krishneil Singh 
> Signed-off-by: Jeff Kirsher 

Applied, thanks Arnd.


Re: [PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-18 Thread William Tu
Hi Jiri,

On Thu, Jan 18, 2018 at 2:13 AM, Jiri Benc  wrote:
> Looks much better.
>
> On Wed, 17 Jan 2018 09:32:51 -0800, William Tu wrote:
>> + OVS_ERSPAN_OPT_IDX, /* be32 index */
>
> Why don't you convert this to u32 while passing to/from user space?
>
>> + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
>
> sizeof(__be32) but see above.
>
> Thanks,
>
>  Jiri

Thanks. Your suggestion makes sense.
In the beginning I just want to avoid another ntoh, hton conversion.
Since the ERSPAN iproute2 also assume u32 to/from userspace, I will
change it here to use u32.

William


  1   2   3   4   >